Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/AM: Fix request datatype state during CM switch #7432

Merged

Conversation

yosefe
Copy link
Contributor

@yosefe yosefe commented Sep 21, 2021

Why

Fix failure like this (internal link) on r-vmb-ppc-jenkins (MellanoxLab test)

[2021-09-21T21:49:03.447Z] [ RUN      ] all/test_ucp_sockaddr_cm_switch.rereg_memory_on_cm_switch/0
[2021-09-21T21:49:03.703Z] [     INFO ] Testing 65.65.65.12:0
2021-09-21T21:49:03.703Z] [     INFO ] server listening on 65.65.65.12:33430
[2021-09-21T21:49:03.960Z] /scrap/jenkins/workspace/ucx-9/contrib/../test/gtest/ucp/test_ucp_sockaddr.cc:271: Failure
[2021-09-21T21:49:03.960Z] Error: Input/output error
[2021-09-21T21:49:03.960Z] [r-vmb-ppc-jenkins:14408:p-0:14408]  ucp_worker.c:2534 Assertion `worker->inprogress++ == 0' failed
[2021-09-21T21:49:05.327Z] 
[2021-09-21T21:49:05.327Z] /scrap/jenkins/workspace/ucx-9/contrib/../src/ucp/core/ucp_worker.c: [ ucp_worker_progress() ]
[2021-09-21T21:49:05.327Z]       ...
[2021-09-21T21:49:05.327Z]      2531     UCP_WORKER_THREAD_CS_ENTER_CONDITIONAL(worker);
[2021-09-21T21:49:05.327Z]      2532 
[2021-09-21T21:49:05.327Z]      2533     /* check that ucp_worker_progress is not called from within ucp_worker_progress */
[2021-09-21T21:49:05.327Z] ==>  2534     ucs_assert(worker->inprogress++ == 0);
[2021-09-21T21:49:05.327Z]      2535     count = uct_worker_progress(worker->uct);
[2021-09-21T21:49:05.327Z]      2536     ucs_async_check_miss(&worker->async);
[2021-09-21T21:49:05.327Z]      2537 
[2021-09-21T21:49:05.327Z] 

Issue started after merging #7403

When switching between transports, we can add new memry registration
handles to req->send.state.dt by calling ucp_send_request_add_reg_lane()
from ucp_do_am_zcopy_single(). Should not save 'state' before calling
add_reg_lane() - otherwise the new memory registration will be
overridden.

This fixes failure on r-vmb-ppc-jenkins in test_ucp_sockaddr_cm_switch
test with the symptom "Error: Input/output error". It is actually a
local protection error (PD violation) due to using wrong uct_mem_h.
@yosefe yosefe merged commit eb2aa0d into openucx:master Sep 22, 2021
@yosefe yosefe deleted the topic/ucp-am-fix-request-datatype-state-during branch September 22, 2021 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants