Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/TAG/RNDV: Complete send operation only if RTS wasn't sent #230

Merged

Conversation

dmitrygx
Copy link

Fixes:

Error iodemo analyzer: [1640029744.908498] [DEMO] ERROR: iov data corruption (0x7f7087be9800: expected: segment=ed00 conn_id=10 seed=3c212088 got: segment=ed00 conn_id=ffff seed=ffffffff) at 485376 position detected on [UCX-connection 0x99dd10: #16 2.1.3.7:20011] (status=Endpoint is not connected) for operation (length=487269 mem_type=0 op="read")

http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/ucx_ompi/20211220_211114_26976_84801_jazz07.swx.labs.mlnx/html/test_stdout_w5cd6S.txt

UCP RNDV request should be completed during the request cancellation procedure only if the RTS packet wasn't sent to the peer (i.e. the peer is not aware of this operation). If the RTS packet was sent we should put the request on the list of canceled requests and it will be cleaned when all UCT lanes are destroyed and UCP EP is under destroying.

@dmitrygx
Copy link
Author

currently, it successfully passed 15 iterations of the reproducer from the link specified in the description.

@dmitrygx
Copy link
Author

@evgeny-leksikov @hoopoepg could you review pls?
it passes > 30 iterations without any errors.
but the error could be reproduced every 2-3 runs without the fix.

Comment on lines 422 to 423
if (!(sreq->send.ep->flags & UCP_EP_FLAG_REMOTE_CONNECTED)) {
if (sreq->flags & UCP_REQUEST_FLAG_RNDV_RTS_SENT) {
if (!(sreq->flags & UCP_REQUEST_FLAG_RNDV_RTS_SENT)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe invert if/else? positive conditions are simpler for reading

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, done

@dmitrygx dmitrygx force-pushed the topic/ucp/rndv_complete_rts_int3 branch from 5c062a1 to 97be7b6 Compare December 23, 2021 08:53
@yosefe yosefe merged commit 14ca34d into yosefe:integration3 Jan 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants