Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/UD: Fix completion callback not called for flush operation #7854

Merged

Conversation

yosefe
Copy link
Contributor

@yosefe yosefe commented Jan 17, 2022

What

Fix leak warning about ep_removed flush request (added by ucp_wireup_send_ep_removed, when the worker is destroyed.

Why?

Fix test failures like:

2022-01-17T17:25:13.3034557Z [ RUN      ] udx/test_ucp_wireup_errh_peer.stress_connect_force_disconnect/2 <ud_x/rma,no_ep_match>
2022-01-17T17:25:13.8430781Z [1642440313.842728] [swx-rdmz-ucx-gpu-01:5074 :0]           mpool.c:54   UCX  WARN  object 0x5494b40 {flags:0x200002 <no debug info>} was not returned to mpool ucp_requests
2022-01-17T17:25:13.8687245Z /scrap/azure/agent-11/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/test.cc:344: Failure
2022-01-17T17:25:13.8689017Z Failed
2022-01-17T17:25:13.8689885Z Got 0 errors and 1 warnings during the test
2022-01-17T17:25:13.8691642Z [     INFO ] < /scrap/azure/agent-11/AZP_WORKSPACE/1/s/contrib/../src/ucs/datastruct/mpool.c:54 object 0x5494b40 {flags:0x200002 <no debug info>} was not returned to mpool ucp_requests >
2022-01-17T17:25:13.8693215Z [  FAILED  ] udx/test_ucp_wireup_errh_peer.stress_connect_force_disconnect/2, where GetParam() = ud_x/rma,no_ep_match (565 ms)

How ?

Fix UD transport to not remove uct_completion_t descriptors from async queue without calling their callback

@yosefe yosefe added the Bugfix label Jan 17, 2022
@yosefe yosefe force-pushed the topic/ucp-ep-release-ep-removed-request-if branch from 1de9461 to bce3dea Compare January 18, 2022 19:23
Copy link
Member

@dmitrygx dmitrygx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean that some transports (specifically, UD) doesn't guarantee uct_ep_flush(LOCAL) completion even after discard (i.e. after uct_ep_pending_purge() + uct_ep_flush(CANCEL) + uct_ep_destroy())?
should they guarantee flush operation completion? if yes, I guess UD (and other transports) should be fixed.

@@ -2515,7 +2515,7 @@ static void ucp_worker_destroy_eps(ucp_worker_h worker,
ucs_debug("worker %p: destroy %s endpoints", worker, ep_type_name);
ucs_list_for_each_safe(ep_ext, tmp, ep_list, ep_list) {
ep = ucp_ep_from_ext_gen(ep_ext);
/* Cleanup pending operations on the UCP EP before destroying it, since
/* Cleanup pending operations on the UCP EP before destroying it, since
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls, fix alignment and remove TAB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, unrelated change

@yosefe
Copy link
Contributor Author

yosefe commented Jan 19, 2022

does it mean that some transports (specifically, UD) doesn't guarantee uct_ep_flush(LOCAL) completion even after discard (i.e. after uct_ep_pending_purge() + uct_ep_flush(CANCEL) + uct_ep_destroy())? should they guarantee flush operation completion? if yes, I guess UD (and other transports) should be fixed.

RC an UD seem to call user flush completion from uct_ep_destroy, maybe something not fully working with UD - checking.
However, IMO it's not so good we call user uct_completion_t from ep_destroy, since we try to guarantee user callbacks would be called only from uct_worker_progress(). Maybe should introduce uct_ep_outstanding_purge() API to extract such operations without calling the callbacks directly

Before this fix, the iface async completion queue dispatch could remove
a completion desc from the queue but not call the upper layer callback.
It can happen when we destroy two ud endpoints, and the queue contains
completions for both of them: we would remove all completions, but call
the callback only for the first destroyed endpoint.

As a result, some UCP requests were not released wnd lanes are
destroyed.

The fix is to remove only the relevant completion descs from the queue.
@yosefe yosefe force-pushed the topic/ucp-ep-release-ep-removed-request-if branch from bce3dea to 70b18d5 Compare January 19, 2022 15:57
@yosefe yosefe changed the title UCP/EP: Release ep_removed request if its internal endpoint is destroyed UCT/UD: Fix completion callback not called for flush operation Jan 19, 2022
@yosefe
Copy link
Contributor Author

yosefe commented Jan 19, 2022

if yes, I guess UD (and other transports) should be fixed.

@dmitrygx reworked the PR to fix UD, can you pls take a look (more details in commit message)

@yosefe yosefe merged commit ea16f70 into openucx:master Jan 21, 2022
@yosefe yosefe deleted the topic/ucp-ep-release-ep-removed-request-if branch January 21, 2022 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants