UCT/TCP: Implement flush of all outstanding operations #7140

dmitrygx · 2021-07-23T16:55:21Z

What

Implement flush of all outstanding operations.

Why ?

To fix uct_ep_flush() which don't wait for all operations being completed. As a result if fixes the following case:

/* client */
for (i= 0; i < 1000; ++i) { 
    req = ucp_ep_tag_send_nb(ep)
    reqs.push_back(req);
}
wait_for_reqs(reqs);

req = ucp_ep_close_nb(ep, FLUSH);
wait_for_req(req);

exit(0);

/* server */
for (i = 0; i < 1000; ++i) {
    req = ucp_worker_tag_recv_nb(worker);
    reqs.push_back(req);
}
wait_for_reqs(reqs);

/* 998-999 completed successfully, but 1-2 completed with error  */

How ?

Do PUT operation if last_acked_sn != tx.sn in EP
Fix EP flush conditions when checking resources. Return UCS_OK if connection has already been closed.
Fix tests.

yosefe · 2021-07-25T14:48:52Z

src/uct/tcp/tcp_ep.c

+         * Zcopy operation. PUT Zcopy sends PUT REQ message which triggers
+         * sending ACK message back. */
+        --ep->tx.sn;
+        status = uct_ep_put_zcopy(&ep->super.super, NULL, 0, 0, 0, NULL);


why shutdown doesn't work?

shutdown() doesn’t work if user ha star following flow:

ucp_ep_close_nbx(FLUSH); ucp_worker_destroy();

Since uct_ep_close() shutdowns the connection and we should wait for it completion in epoll_wait(). So, real closing of the socket is deferred. And we destroy socket, when destroying Worker. It will be ok, if user had some the following flow:

ucp_ep_close_nbx(FLUSH); // wait for some time to ensure that socket is closed after sholutdown ucp_worker_destroy();

dmitrygx · 2021-07-27T13:02:05Z

@brminich @hoopoepg could you review pls?

brminich · 2021-07-27T13:06:48Z

src/tools/perf/lib/libperf.c

@@ -1174,7 +1184,7 @@ static void ucp_perf_test_destroy_eps(ucx_perf_context_t* perf)

            if (UCS_PTR_IS_PTR(req)) {
                do {
-                    ucp_worker_progress(perf->ucp.tctx[i].perf.ucp.worker);
+                    ucp_perf_worker_progress(perf);


what does it fix?

need to progress all threads when doing a barrier or closing all EPs.
before the fix, only the first thread was progressed

did it actually cause a hang?
these threads are not communicating with one another so should not be a deadlock

yes, one process moved to destroy the 2nd EP, while another process is flushing the 1st EP.
So, we should progress the 1st worker in the first process to send ACK message to the 1st EP of the second process.

so can we progress all "thread_count" close reqs in parallel? like we do in OMPI for example

then each thread should take care of it closing. in current perftest implementation, master thread closes (one by one) all EPs and progress workers.

i mean still progress in master thread, but first start all close operations and then do the progress calls

brminich · 2021-07-27T13:12:59Z

test/gtest/ucp/ucp_test.cc

@@ -987,7 +988,10 @@ void ucp_test_base::entity::ep_destructor(ucp_ep_h ep, entity *e)
    ucs_status_t        status;
    ucp_tag_recv_info_t info;
    do {
-        e->progress();
+        const ucp_test *test = dynamic_cast<const ucp_test*>(e->m_test);


is it to progress TCP flush on the remote side?

yes, it was the idea. otherwise, the test hangs.

brminich · 2021-07-27T13:19:21Z

src/uct/tcp/tcp_ep.c

-         * UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK flag has to be removed upon PUT
-         * ACK message receiving if there are no other PUT operations in-flight */
-        ep->flags |= UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK;
+    ucs_assert(ep->put_cnt != UINT32_MAX);


maybe better return NO_RESOURCE instead?

why? we could do several PUT operations simultaneously

i mean instead of having an assert you can return NO_RESOURCE, until some puts confirmed

good idea, done

src/uct/tcp/tcp_ep.c

brminich · 2021-07-27T13:20:56Z

src/uct/tcp/tcp_ep.c

+        if (status == UCS_ERR_NO_RESOURCE) {
+            return UCS_ERR_NO_RESOURCE;
+        }
+        return UCS_OK;


why return OK?

brminich · 2021-07-27T13:22:27Z

src/uct/tcp/tcp_ep.c

-    if (ep->flags & UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK) {
-        status = uct_tcp_ep_put_comp_add(ep, comp, ep->tx.put_sn);
+    if (ep->last_acked_sn != ep->tx.sn) {
+        /* Decrement the sequence number to not consider the flush operation


what does it mean?

it fixes hang of the following loop:

do { uct_worker_progress(worker); uct_ep_flush(ep); } while (status = UCS_INPROGRESS);

since EP is always has outstanding PUT sent by flush. So, we don't want to count them

maybe insted call some internal function of tcp which will reuse the current sequence number?
it's confusing that sn is decremented

I see that it is confusing, but it should almost duplicate PUT code. Is it ok?

src/uct/tcp/tcp_ep.c

brminich · 2021-07-27T13:24:57Z

src/uct/tcp/tcp_ep.c

@@ -1452,6 +1452,14 @@ uct_tcp_ep_am_prepare(uct_tcp_iface_t *iface, uct_tcp_ep_t *ep,
    *hdr          = ep->tx.buf;
    (*hdr)->am_id = am_id;

+    ++ep->tx.sn;
+    if (ep->tx.sn == ep->last_acked_sn) {


how can it be?

e.g. if we don't request ACK for long period of time. so, it happens that ep->tx.sn == ep->last_acked_sn == 2^32 - 1

do you mean that ep->tx.sn will wrap? Then why they will be necessarily 2^32 - 1

Yes, the value can wrap.
No, it could be another value, just an example, if no PUT operations done at all

I have an idea - will prepare a commit

dmitrygx · 2021-07-27T21:07:23Z

@brminich your comments were address. could you review pls?

brminich · 2021-07-28T10:36:36Z

src/uct/tcp/tcp_ep.c

+                   uct_tcp_ep_ctx_buf_empty(&ep->tx) &
+                   (ep->put_cnt != UINT32_MAX))) {


looks like it will fail on assert below if ep->put_cnt == UINT32_MAX now, because of ep->conn_state

also do we really need this check for all ops?

yes, we have to check it for all operations to avoid sending if something in the pending queue

good catch, fixed

can we assume puts will be unordered with other ops?

not sure that we can, IB support ordering for WRITE after SEND.
also, we have to check put_cnt inside uct_ep_pending_add(), otherwise the following test will be failed:

for (i = 0; i < UINT32_MAX; ++i) { status = uct_ep_put_zcopy(); ucs_assert(status == UCS_INPROGRESS); } status = uct_ep_put_zcopy(); ucs_assert(status == UCS_ERR_NO_RESOURCES); status = uct_ep_pending_add(); ucs_assert(status == UCS_OK); status = uct_ep_am_short(); ucs_assert(status == UCS_ERR_NO_RESOURCES); // Fails

@brminich is it ok?

brminich · 2021-07-29T10:32:51Z

src/uct/tcp/tcp_ep.c

@@ -69,7 +69,8 @@ static inline int uct_tcp_ep_ctx_buf_need_progress(uct_tcp_ep_ctx_t *ctx)
 static inline ucs_status_t uct_tcp_ep_check_tx_res(uct_tcp_ep_t *ep)
 {
    if (ucs_likely((ep->conn_state == UCT_TCP_EP_CONN_STATE_CONNECTED) &&
-                   uct_tcp_ep_ctx_buf_empty(&ep->tx))) {
+                   uct_tcp_ep_ctx_buf_empty(&ep->tx) &&
+                   (ep->put_cnt != UINT32_MAX))) {


can you update ep->ctx instead so that it will be "not empty" to avoid extra branch?

when we increment ep->put_cnt, ep->tx isn't empty. and when we do progress of TX oeprations we don't really know if it is PUT or some AM operations.
so, we can't do it.

we can have some flag which indicates "CAN_SEND" and set it when both are true:

TX is empty

put_cnt != UINT32_MAX
but the checks will be in other place (data path) to manage this flag

brminich · 2021-07-29T11:21:44Z

src/uct/tcp/tcp_ep.c

+    if (ep->tx.sn == ep->last_acked_sn) {
+        /* If the TX sequence number is now the same as the last acked sequence
+         * number, ensure that they are different to request ACK through PUT in
+         * TCP ep flush operation */
+        --ep->last_acked_sn;
+    }


can we do it for put operations only?

unfortunately, no
if ep->tx.sn is wrapped and now it is equal to ep->last_acked_sn, flush will return OK immediately, but we don't want it

dmitrygx · 2021-07-29T15:34:05Z

@brminich is it ok now?

yosefe · 2021-07-29T11:59:09Z

src/tools/perf/lib/libperf.c

@@ -1174,7 +1184,7 @@ static void ucp_perf_test_destroy_eps(ucx_perf_context_t* perf)

            if (UCS_PTR_IS_PTR(req)) {
                do {
-                    ucp_worker_progress(perf->ucp.tctx[i].perf.ucp.worker);
+                    ucp_perf_worker_progress(perf);


did it actually cause a hang?
these threads are not communicating with one another so should not be a deadlock

src/uct/tcp/tcp_cm.c

src/uct/tcp/tcp_ep.c

yosefe · 2021-07-29T18:12:42Z

src/uct/tcp/tcp_ep.c

-        ucs_assert(ep->flags & UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK);
-        ep->flags &= ~UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK;
+    ucs_assert(ep->put_cnt != 0);
+    if (ucs_unlikely(ep->put_cnt == UINT32_MAX)) {


maybe make the sn 64 bit?

Ok, it makes sense

yosefe · 2021-07-29T18:15:09Z

src/uct/tcp/tcp_ep.c

-    if (ep->flags & UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK) {
-        status = uct_tcp_ep_put_comp_add(ep, comp, ep->tx.put_sn);
+    if (ep->last_acked_sn != ep->tx.sn) {
+        /* Decrement the sequence number to not consider the flush operation


maybe insted call some internal function of tcp which will reuse the current sequence number?
it's confusing that sn is decremented

brminich · 2021-07-30T07:33:40Z

coverity error is relevant

Error: CONSTANT_EXPRESSION_RESULT:
/__w/1/s/src/uct/tcp/tcp_ep.c:2030:
result_independent_of_operands: "ep->put_cnt != 18446744073709551615UL" is always true regardless of the values of its operands. This occurs as a value.
+ cp -ar /home/swx-azure-svc_azpcontainer/22152-20210729.25/build/cov_build_release /__w/1/s/cov_build_release
+ echo 'not ok 1 Coverity Detected 1 failures'
+ modules_for_coverity_unload
not ok 1 Coverity Detected 1 failures

yosefe · 2021-07-31T18:04:43Z

src/tools/perf/lib/libperf.c

@@ -1174,7 +1184,7 @@ static void ucp_perf_test_destroy_eps(ucx_perf_context_t* perf)

            if (UCS_PTR_IS_PTR(req)) {
                do {
-                    ucp_worker_progress(perf->ucp.tctx[i].perf.ucp.worker);
+                    ucp_perf_worker_progress(perf);


so can we progress all "thread_count" close reqs in parallel? like we do in OMPI for example

yosefe · 2021-07-31T18:05:26Z

src/uct/tcp/tcp.h

+/* Endpoint can do TX operations if 3 conditions are true:
+  * - endpoint is connected to a peer
+  * - TX buffer is empty
+  * - number of PUT operations done is not equal to UINT32_MAX */
+#define UCT_TCP_EP_TX_DO_MAX                 3
+
+/* Endpoint can do RX operations if 1 conditions is true:
+  * - RX buffer is empty */
+#define UCT_TCP_EP_RX_DO_MAX                 1
+


yosefe · 2021-07-31T18:11:21Z

src/uct/tcp/tcp.h

@@ -337,6 +346,8 @@ struct uct_tcp_ep {
                                                     * closed as soon as the EP is connected
                                                     * using the new fd */
    uct_tcp_ep_cm_id_t            cm_id;            /* EP connection mananger ID */
+    uint32_t                      last_acked_sn;    /* Last acked operation sequence number */
+    uint64_t                      put_cnt;          /* Number of PUT operations scheduled */


on other handm making it 64but increases ep size. we can't really have 4G outstanding PUT operations, otheriwse sn would wrap around anyway, right?
so better return NO_RESOURCES if put_cnt >= INT32_MAX/2

sorry, I don't see any differences to have uint32_t put_cnt and check put_cnt == UINT32_MAX instead of

so better return NO_RESOURCES if put_cnt >= INT32_MAX/2

Also, it will require checking put_cnt for other operations too. Then better to return back to what I suggested in 5a53785, i.e. having counter to simplify checking condition (it will be sing if condition instead of 3 ones)

I think it makes sense to have 32-bit put_cnt and move it along with last_acked_sn under uct_tcp_ep_ctx_t. So, they will be inside union and put_cnt is valid for TX context, last_acked_sn - for RX context.

yosefe · 2021-07-31T18:16:44Z

src/uct/tcp/tcp_ep.c

+    if (tx_sn_inc && (++ep->tx.sn == ep->last_acked_sn)) {
+        /* If the TX sequence number is now the same as the last acked sequence
+         * number, ensure that they are different to request ACK through PUT in
+         * TCP ep flush operation */
+        --ep->last_acked_sn;
+    }


maybe this PR could be done simpler (and w/o adding more counters):

increase SN for flush operations as well

keep flag on the ep of "whether there was put without flush":

put turns the flag on

flush is nop is flag is off, otherwise - put and turn the flag off

good catch, done

yosefe · 2021-08-03T08:29:54Z

src/tools/perf/lib/libperf.c

+    ucs_status_ptr_t **reqs;
+    ucs_status_t status;
+
+    reqs = ucs_malloc(thread_count * sizeof(*reqs), "ep_close_reqs");


yosefe · 2021-08-03T08:34:01Z

src/tools/perf/lib/libperf.c

+            if (status != UCS_INPROGRESS) {
+                --num_in_prog;
+                ucp_request_release(reqs[i]);
+                reqs[i] = NULL;


maybe remove req from the array (reqs[i]=reqs[--num-in_prog])
also, not add NULL reqs to array after close_nb

src/tools/perf/lib/libperf.c

yosefe · 2021-08-03T08:37:27Z

src/uct/tcp/tcp.h

-    UCT_TCP_EP_FLAG_ON_PTR_MAP         = UCS_BIT(9)
+    UCT_TCP_EP_FLAG_ON_PTR_MAP         = UCS_BIT(8),
+    /* EP has some operations done without flush */
+    UCT_TCP_EP_FLAG_HAS_OPS_NO_FLUSH   = UCS_BIT(9)


UCT_TCP_EP_FLAG_NEED_FLUSH

yosefe · 2021-08-03T08:38:39Z

src/uct/tcp/tcp_ep.c

 {
-    ctx->put_sn = UINT32_MAX;
-    ctx->buf    = NULL;
+    ctx->sn  = UINT32_MAX;


why need to rename?

because we increment this sn not only for PUT operations

use one flag (NEED_FLUSH) instead of counting am

yosefe · 2021-08-03T08:39:25Z

src/uct/tcp/tcp_ep.c

+    ucs_assert(ep->tx.put_cnt != 0);
+    if (--ep->tx.put_cnt == 0) {


why needed to change this logic?

since if some operation AM operations done after PUT operations are scheduled, ep->tx.sn != put_ack->sn

yosefe · 2021-08-03T08:41:13Z

src/uct/tcp/tcp_ep.c

    }

-    if (ep->flags & UCT_TCP_EP_FLAG_PUT_TX_WAITING_ACK) {
-        status = uct_tcp_ep_put_comp_add(ep, comp, ep->tx.put_sn);
+    if (ep->rx.last_acked_sn != ep->tx.sn) {


can we remove UCT_TCP_EP_FLAG_HAS_OPS_NO_FLUSH flag when receiving ACK?

no, the following case won't work then:

do { uct_worker_progress(); status = uct_ep_flush(); } while (status == UCS_INPROGRESS);

src/tools/perf/lib/libperf.c

src/uct/tcp/tcp_ep.c

src/uct/tcp/tcp.h

src/uct/tcp/tcp_ep.c

yosefe

pls fix the conflict by a merge commit

dmitrygx · 2021-08-03T14:42:28Z

pls fix the conflict by a merge commit

done
should I rebase & squash now?

dmitrygx · 2021-08-04T07:18:02Z

@brminich could you review pls?

dmitrygx force-pushed the topic/uct/tcp_ep_flush branch 4 times, most recently from 92ef441 to ef5509e Compare July 23, 2021 19:53

yosefe reviewed Jul 25, 2021

View reviewed changes

dmitrygx force-pushed the topic/uct/tcp_ep_flush branch 2 times, most recently from 13d2a90 to 388e1a3 Compare July 27, 2021 07:51

brminich requested changes Jul 27, 2021

View reviewed changes

brminich reviewed Jul 28, 2021

View reviewed changes

dmitrygx requested review from brminich and yosefe July 29, 2021 10:12

brminich reviewed Jul 29, 2021

View reviewed changes

yosefe reviewed Jul 29, 2021

View reviewed changes

yosefe reviewed Jul 31, 2021

View reviewed changes

yosefe reviewed Aug 3, 2021

View reviewed changes

yosefe approved these changes Aug 3, 2021

View reviewed changes

UCT/TCP: Implement flush of all outstanding operations

68fb8e3

yosefe approved these changes Aug 3, 2021

View reviewed changes

dmitrygx force-pushed the topic/uct/tcp_ep_flush branch from e93811b to 68fb8e3 Compare August 3, 2021 15:04

dmitrygx requested a review from brminich August 4, 2021 07:17

brminich approved these changes Aug 4, 2021

View reviewed changes

dmitrygx merged commit 9d62432 into openucx:master Aug 4, 2021

This was referenced Aug 4, 2021

UCT/TCP: Implement flush of all outstanding operations [v1.11.x] #7188

Merged

TCP transport may cause Connection reset by peer when closing shortly after ucp_tag_send_nb #6922

Closed

lappazos mentioned this pull request Aug 11, 2021

TL/UCP: Fix bug in ep close openucx/ucc#287

Merged

yosefe mentioned this pull request Aug 11, 2021

TCP/TEST: Fix simultaneous ep close with ucp_hello_world #7224

Merged

		uct_tcp_ep_ctx_buf_empty(&ep->tx) &
		(ep->put_cnt != UINT32_MAX))) {

		ucs_assert(ep->tx.put_cnt != 0);
		if (--ep->tx.put_cnt == 0) {

UCT/TCP: Implement flush of all outstanding operations #7140

UCT/TCP: Implement flush of all outstanding operations #7140

Conversation

dmitrygx commented Jul 23, 2021

What

Why ?

How ?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitrygx commented Jul 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitrygx commented Jul 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitrygx Jul 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitrygx commented Jul 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brminich commented Jul 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yosefe left a comment

Choose a reason for hiding this comment

dmitrygx commented Aug 3, 2021

dmitrygx commented Aug 4, 2021

dmitrygx Jul 29, 2021 •

edited

Loading