Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/TCP: Implement flush of all outstanding operations #7140

Merged
merged 1 commit into from
Aug 4, 2021

Conversation

dmitrygx
Copy link
Member

What

Implement flush of all outstanding operations.

Why ?

To fix uct_ep_flush() which don't wait for all operations being completed. As a result if fixes the following case:

/* client */
for (i= 0; i < 1000; ++i) { 
    req = ucp_ep_tag_send_nb(ep)
    reqs.push_back(req);
}
wait_for_reqs(reqs);

req = ucp_ep_close_nb(ep, FLUSH);
wait_for_req(req);

exit(0);
/* server */
for (i = 0; i < 1000; ++i) {
    req = ucp_worker_tag_recv_nb(worker);
    reqs.push_back(req);
}
wait_for_reqs(reqs);

/* 998-999 completed successfully, but 1-2 completed with error  */

How ?

  1. Do PUT operation if last_acked_sn != tx.sn in EP
  2. Fix EP flush conditions when checking resources. Return UCS_OK if connection has already been closed.
  3. Fix tests.

@dmitrygx dmitrygx force-pushed the topic/uct/tcp_ep_flush branch 4 times, most recently from 92ef441 to ef5509e Compare July 23, 2021 19:53
* Zcopy operation. PUT Zcopy sends PUT REQ message which triggers
* sending ACK message back. */
--ep->tx.sn;
status = uct_ep_put_zcopy(&ep->super.super, NULL, 0, 0, 0, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why shutdown doesn't work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shutdown() doesn’t work if user ha star following flow:

ucp_ep_close_nbx(FLUSH);
ucp_worker_destroy();

Since uct_ep_close() shutdowns the connection and we should wait for it completion in epoll_wait(). So, real closing of the socket is deferred. And we destroy socket, when destroying Worker. It will be ok, if user had some the following flow:

ucp_ep_close_nbx(FLUSH);
// wait for some time to ensure that socket is closed after sholutdown
ucp_worker_destroy();

@dmitrygx dmitrygx force-pushed the topic/uct/tcp_ep_flush branch 2 times, most recently from 13d2a90 to 388e1a3 Compare July 27, 2021 07:51
@dmitrygx
Copy link
Member Author

@brminich @hoopoepg could you review pls?

@@ -1174,7 +1184,7 @@ static void ucp_perf_test_destroy_eps(ucx_perf_context_t* perf)

if (UCS_PTR_IS_PTR(req)) {
do {
ucp_worker_progress(perf->ucp.tctx[i].perf.ucp.worker);
ucp_perf_worker_progress(perf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to progress all threads when doing a barrier or closing all EPs.
before the fix, only the first thread was progressed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did it actually cause a hang?
these threads are not communicating with one another so should not be a deadlock

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, one process moved to destroy the 2nd EP, while another process is flushing the 1st EP.
So, we should progress the 1st worker in the first process to send ACK message to the 1st EP of the second process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so can we progress all "thread_count" close reqs in parallel? like we do in OMPI for example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then each thread should take care of it closing. in current perftest implementation, master thread closes (one by one) all EPs and progress workers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean still progress in master thread, but first start all close operations and then do the progress calls

@@ -987,7 +988,10 @@ void ucp_test_base::entity::ep_destructor(ucp_ep_h ep, entity *e)
ucs_status_t status;
ucp_tag_recv_info_t info;
do {
e->progress();
const ucp_test *test = dynamic_cast<const ucp_test*>(e->m_test);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it to progress TCP flush on the remote side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it was the idea. otherwise, the test hangs.

* UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK flag has to be removed upon PUT
* ACK message receiving if there are no other PUT operations in-flight */
ep->flags |= UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK;
ucs_assert(ep->put_cnt != UINT32_MAX);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better return NO_RESOURCE instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? we could do several PUT operations simultaneously

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean instead of having an assert you can return NO_RESOURCE, until some puts confirmed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, done

src/uct/tcp/tcp_ep.c Show resolved Hide resolved
if (status == UCS_ERR_NO_RESOURCE) {
return UCS_ERR_NO_RESOURCE;
}
return UCS_OK;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why return OK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if (ep->flags & UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK) {
status = uct_tcp_ep_put_comp_add(ep, comp, ep->tx.put_sn);
if (ep->last_acked_sn != ep->tx.sn) {
/* Decrement the sequence number to not consider the flush operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it fixes hang of the following loop:

do {
    uct_worker_progress(worker);
    uct_ep_flush(ep);
} while (status = UCS_INPROGRESS);

since EP is always has outstanding PUT sent by flush. So, we don't want to count them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe insted call some internal function of tcp which will reuse the current sequence number?
it's confusing that sn is decremented

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that it is confusing, but it should almost duplicate PUT code. Is it ok?

src/uct/tcp/tcp_ep.c Show resolved Hide resolved
@@ -1452,6 +1452,14 @@ uct_tcp_ep_am_prepare(uct_tcp_iface_t *iface, uct_tcp_ep_t *ep,
*hdr = ep->tx.buf;
(*hdr)->am_id = am_id;

++ep->tx.sn;
if (ep->tx.sn == ep->last_acked_sn) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can it be?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. if we don't request ACK for long period of time. so, it happens that ep->tx.sn == ep->last_acked_sn == 2^32 - 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean that ep->tx.sn will wrap? Then why they will be necessarily 2^32 - 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the value can wrap.
No, it could be another value, just an example, if no PUT operations done at all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an idea - will prepare a commit

@dmitrygx
Copy link
Member Author

@brminich your comments were address. could you review pls?

Comment on lines 72 to 73
uct_tcp_ep_ctx_buf_empty(&ep->tx) &
(ep->put_cnt != UINT32_MAX))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it will fail on assert below if ep->put_cnt == UINT32_MAX now, because of ep->conn_state

also do we really need this check for all ops?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we have to check it for all operations to avoid sending if something in the pending queue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assume puts will be unordered with other ops?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure that we can, IB support ordering for WRITE after SEND.
also, we have to check put_cnt inside uct_ep_pending_add(), otherwise the following test will be failed:

for (i = 0; i < UINT32_MAX; ++i) {
    status = uct_ep_put_zcopy();
    ucs_assert(status == UCS_INPROGRESS);
}
status = uct_ep_put_zcopy();
ucs_assert(status == UCS_ERR_NO_RESOURCES);

status = uct_ep_pending_add();
ucs_assert(status == UCS_OK);

status = uct_ep_am_short();
ucs_assert(status == UCS_ERR_NO_RESOURCES); // Fails

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich is it ok?

@dmitrygx dmitrygx requested review from brminich and yosefe July 29, 2021 10:12
@@ -69,7 +69,8 @@ static inline int uct_tcp_ep_ctx_buf_need_progress(uct_tcp_ep_ctx_t *ctx)
static inline ucs_status_t uct_tcp_ep_check_tx_res(uct_tcp_ep_t *ep)
{
if (ucs_likely((ep->conn_state == UCT_TCP_EP_CONN_STATE_CONNECTED) &&
uct_tcp_ep_ctx_buf_empty(&ep->tx))) {
uct_tcp_ep_ctx_buf_empty(&ep->tx) &&
(ep->put_cnt != UINT32_MAX))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update ep->ctx instead so that it will be "not empty" to avoid extra branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we increment ep->put_cnt, ep->tx isn't empty. and when we do progress of TX oeprations we don't really know if it is PUT or some AM operations.
so, we can't do it.

Copy link
Member Author

@dmitrygx dmitrygx Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have some flag which indicates "CAN_SEND" and set it when both are true:

  • TX is empty
  • put_cnt != UINT32_MAX
    but the checks will be in other place (data path) to manage this flag

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 1458 to 1463
if (ep->tx.sn == ep->last_acked_sn) {
/* If the TX sequence number is now the same as the last acked sequence
* number, ensure that they are different to request ACK through PUT in
* TCP ep flush operation */
--ep->last_acked_sn;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do it for put operations only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, no
if ep->tx.sn is wrapped and now it is equal to ep->last_acked_sn, flush will return OK immediately, but we don't want it

@dmitrygx
Copy link
Member Author

@brminich is it ok now?

@@ -1174,7 +1184,7 @@ static void ucp_perf_test_destroy_eps(ucx_perf_context_t* perf)

if (UCS_PTR_IS_PTR(req)) {
do {
ucp_worker_progress(perf->ucp.tctx[i].perf.ucp.worker);
ucp_perf_worker_progress(perf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did it actually cause a hang?
these threads are not communicating with one another so should not be a deadlock

src/uct/tcp/tcp_cm.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp_ep.c Outdated Show resolved Hide resolved
ucs_assert(ep->flags & UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK);
ep->flags &= ~UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK;
ucs_assert(ep->put_cnt != 0);
if (ucs_unlikely(ep->put_cnt == UINT32_MAX)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make the sn 64 bit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it makes sense

if (ep->flags & UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK) {
status = uct_tcp_ep_put_comp_add(ep, comp, ep->tx.put_sn);
if (ep->last_acked_sn != ep->tx.sn) {
/* Decrement the sequence number to not consider the flush operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe insted call some internal function of tcp which will reuse the current sequence number?
it's confusing that sn is decremented

@brminich
Copy link
Contributor

coverity error is relevant

Error: CONSTANT_EXPRESSION_RESULT:
/__w/1/s/src/uct/tcp/tcp_ep.c:2030:
result_independent_of_operands: "ep->put_cnt != 18446744073709551615UL" is always true regardless of the values of its operands. This occurs as a value.
+ cp -ar /home/swx-azure-svc_azpcontainer/22152-20210729.25/build/cov_build_release /__w/1/s/cov_build_release
+ echo 'not ok 1 Coverity Detected 1 failures'
+ modules_for_coverity_unload
not ok 1 Coverity Detected 1 failures

@@ -1174,7 +1184,7 @@ static void ucp_perf_test_destroy_eps(ucx_perf_context_t* perf)

if (UCS_PTR_IS_PTR(req)) {
do {
ucp_worker_progress(perf->ucp.tctx[i].perf.ucp.worker);
ucp_perf_worker_progress(perf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so can we progress all "thread_count" close reqs in parallel? like we do in OMPI for example

Comment on lines 63 to 72
/* Endpoint can do TX operations if 3 conditions are true:
* - endpoint is connected to a peer
* - TX buffer is empty
* - number of PUT operations done is not equal to UINT32_MAX */
#define UCT_TCP_EP_TX_DO_MAX 3

/* Endpoint can do RX operations if 1 conditions is true:
* - RX buffer is empty */
#define UCT_TCP_EP_RX_DO_MAX 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -337,6 +346,8 @@ struct uct_tcp_ep {
* closed as soon as the EP is connected
* using the new fd */
uct_tcp_ep_cm_id_t cm_id; /* EP connection mananger ID */
uint32_t last_acked_sn; /* Last acked operation sequence number */
uint64_t put_cnt; /* Number of PUT operations scheduled */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on other handm making it 64but increases ep size. we can't really have 4G outstanding PUT operations, otheriwse sn would wrap around anyway, right?
so better return NO_RESOURCES if put_cnt >= INT32_MAX/2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I don't see any differences to have uint32_t put_cnt and check put_cnt == UINT32_MAX instead of

so better return NO_RESOURCES if put_cnt >= INT32_MAX/2

Also, it will require checking put_cnt for other operations too. Then better to return back to what I suggested in 5a53785, i.e. having counter to simplify checking condition (it will be sing if condition instead of 3 ones)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have 32-bit put_cnt and move it along with last_acked_sn under uct_tcp_ep_ctx_t. So, they will be inside union and put_cnt is valid for TX context, last_acked_sn - for RX context.

Comment on lines 1467 to 1472
if (tx_sn_inc && (++ep->tx.sn == ep->last_acked_sn)) {
/* If the TX sequence number is now the same as the last acked sequence
* number, ensure that they are different to request ACK through PUT in
* TCP ep flush operation */
--ep->last_acked_sn;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this PR could be done simpler (and w/o adding more counters):

  1. increase SN for flush operations as well
  2. keep flag on the ep of "whether there was put without flush":
    • put turns the flag on
    • flush is nop is flag is off, otherwise - put and turn the flag off

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done

ucs_status_ptr_t **reqs;
ucs_status_t status;

reqs = ucs_malloc(thread_count * sizeof(*reqs), "ep_close_reqs");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use alloca

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (status != UCS_INPROGRESS) {
--num_in_prog;
ucp_request_release(reqs[i]);
reqs[i] = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove req from the array (reqs[i]=reqs[--num-in_prog])
also, not add NULL reqs to array after close_nb

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

src/tools/perf/lib/libperf.c Show resolved Hide resolved
UCT_TCP_EP_FLAG_ON_PTR_MAP = UCS_BIT(9)
UCT_TCP_EP_FLAG_ON_PTR_MAP = UCS_BIT(8),
/* EP has some operations done without flush */
UCT_TCP_EP_FLAG_HAS_OPS_NO_FLUSH = UCS_BIT(9)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCT_TCP_EP_FLAG_NEED_FLUSH

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{
ctx->put_sn = UINT32_MAX;
ctx->buf = NULL;
ctx->sn = UINT32_MAX;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to rename?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we increment this sn not only for PUT operations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use one flag (NEED_FLUSH) instead of counting am

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 918 to 919
ucs_assert(ep->tx.put_cnt != 0);
if (--ep->tx.put_cnt == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed to change this logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since if some operation AM operations done after PUT operations are scheduled, ep->tx.sn != put_ack->sn

}

if (ep->flags & UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK) {
status = uct_tcp_ep_put_comp_add(ep, comp, ep->tx.put_sn);
if (ep->rx.last_acked_sn != ep->tx.sn) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove UCT_TCP_EP_FLAG_HAS_OPS_NO_FLUSH flag when receiving ACK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the following case won't work then:

do {
    uct_worker_progress();
    status = uct_ep_flush();
} while (status == UCS_INPROGRESS);

src/tools/perf/lib/libperf.c Outdated Show resolved Hide resolved
src/tools/perf/lib/libperf.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp_ep.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp.h Outdated Show resolved Hide resolved
src/uct/tcp/tcp_ep.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp_ep.c Outdated Show resolved Hide resolved
src/uct/tcp/tcp_ep.c Outdated Show resolved Hide resolved
Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls fix the conflict by a merge commit

@dmitrygx
Copy link
Member Author

dmitrygx commented Aug 3, 2021

pls fix the conflict by a merge commit

done
should I rebase & squash now?

@dmitrygx
Copy link
Member Author

dmitrygx commented Aug 4, 2021

@brminich could you review pls?

@dmitrygx dmitrygx merged commit 9d62432 into openucx:master Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants