Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP: Reregister memh on CM switch. #7403

Merged
merged 1 commit into from
Sep 19, 2021

Conversation

petro-rudenko
Copy link
Member

What

Fixes segfault on emty memh when CM switches from rdmacm -> tcp (due to listener started on non RDMA NIC)

test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Show resolved Hide resolved
void check_cm_fallback()
{
if (get_num_cms() < 2) {
UCS_TEST_SKIP_R("No CM for fallback to");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can create one boolean function "have_two_cm_components()" and use the macro UCS_TEST_SKIP_COND_P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check the number of CM components needs worker. Function in macro instantiates before test, so worker is unavailable: https://github.com/openucx/ucx/blob/master/test/gtest/ucp/test_ucp_sockaddr.cc#L1359

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

protected:
ucp_rsc_index_t get_num_cms()
protected:
const ucp_rsc_index_t get_num_cms()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_rsc_index_t get_num_cms() const

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/__w/1/s/contrib/../test/gtest/ucp/test_ucp_sockaddr.cc: In member function ‘ucp_rsc_index_t test_ucp_sockaddr_cm_switch::get_num_cms() const’:
/__w/1/s/contrib/../test/gtest/ucp/test_ucp_sockaddr.cc:1358:47: error: passing ‘const test_ucp_sockaddr_cm_switch’ as ‘this’ argument discards qualifiers [-fpermissive]
const ucp_worker_h worker = sender().worker();

Removing const

@petro-rudenko petro-rudenko force-pushed the ucp/rereg-mem-on-cm-switch branch 2 times, most recently from c9df428 to 3ef257e Compare September 15, 2021 08:02
@petro-rudenko
Copy link
Member Author

Some flake with tests on gpu-worker2 machine.
adding ZCOPY_TRESH sometimes makes flush on tear down fail:

UCS_TEST_P(test_ucp_sockaddr, zcopy_zero_fails, "ZCOPY_THRESH=0") {
    listen_and_communicate(false, SEND_DIRECTION_BIDI);
}
[peterr@swx-rdmz-ucx-gpu-02 gtest]$ GTEST_FILTER=*zcopy_zero_fails* ./gtest  2>&1 | tee test.log
[     INFO ] ugni is not available
[     INFO ] ugni,cuda_copy,rocm_copy is not available
[     INFO ] Using random seed of 494
Note: Google Test filter = *zcopy_zero_fails*
[==========] Running 96 tests from 12 test cases.
[----------] Global test environment set-up.
[----------] 8 tests from dcx/test_ucp_sockaddr
[ RUN      ] dcx/test_ucp_sockaddr.zcopy_zero_fails/0 <dc_x/tag>
[     INFO ] Testing 12.10.44.12:0
[     INFO ] server listening on 12.10.44.12:54709
ucp/test_ucp_sockaddr.cc:271: Failure
Error: Input/output error
[swx-rdmz-ucx-gpu-02:19365:0:19365]  ucp_worker.c:2531 Assertion `worker->inprogress++ == 0' failed

/tmp/peterr-test2/src/ucp/core/ucp_worker.c: [ ucp_worker_progress() ]
      ...
     2528     UCP_WORKER_THREAD_CS_ENTER_CONDITIONAL(worker);
     2529
     2530     /* check that ucp_worker_progress is not called from within ucp_worker_progress */
==>  2531     ucs_assert(worker->inprogress++ == 0);
     2532     count = uct_worker_progress(worker->uct);
     2533     ucs_async_check_miss(&worker->async);
     2534

==== backtrace (tid:  19365) ====
 0 0x00000000000655cd ucp_worker_progress()  /tmp/peterr-test2/src/ucp/core/ucp_worker.c:2531
 1 0x000000000092f6ba ucp_test_base::entity::progress()  /tmp/peterr-test2/test/gtest/ucp/ucp_test.cc:916
 2 0x000000000092ab6a ucp_test::progress()  /tmp/peterr-test2/test/gtest/ucp/ucp_test.cc:155
 3 0x000000000092b003 ucp_test::request_process()  /tmp/peterr-test2/test/gtest/ucp/ucp_test.cc:244
 4 0x000000000092b137 ucp_test::request_wait()  /tmp/peterr-test2/test/gtest/ucp/ucp_test.cc:267
 5 0x000000000092ac7c ucp_test::flush_worker()  /tmp/peterr-test2/test/gtest/ucp/ucp_test.cc:177
 6 0x000000000092ae30 ucp_test::disconnect()  /tmp/peterr-test2/test/gtest/ucp/ucp_test.cc:204
 7 0x000000000092a68e ucp_test::cleanup()  /tmp/peterr-test2/test/gtest/ucp/ucp_test.cc:78
 8 0x00000000005b3fc7 ucs::test_base::TearDownProxy()  /tmp/peterr-test2/test/gtest/common/test.cc:333
 9 0x0000000000739372 ucp_test::TearDown()  /tmp/peterr-test2/test/gtest/ucp/ucp_test.h:190
10 0x0000000000591c74 testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /tmp/peterr-test2/test/gtest/common/gtest-all.cc:3562
11 0x000000000058ce52 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>()  /tmp/peterr-test2/test/gtest/common/gtest-all.cc:3598
12 0x00000000005741a8 testing::Test::Run()  /tmp/peterr-test2/test/gtest/common/gtest-all.cc:3643
13 0x000000000057493c testing::TestInfo::Run()  /tmp/peterr-test2/test/gtest/common/gtest-all.cc:3812
14 0x0000000000574fcc testing::TestCase::Run()  /tmp/peterr-test2/test/gtest/common/gtest-all.cc:3930
15 0x000000000057b824 testing::internal::UnitTestImpl::RunAllTests()  /tmp/peterr-test2/test/gtest/common/gtest-all.cc:5808
16 0x0000000000593052 testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>()  /tmp/peterr-test2/test/gtest/common/gtest-all.cc:3562
17 0x000000000058dcb4 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>()  /tmp/peterr-test2/test/gtest/common/gtest-all.cc:3598
18 0x000000000057a460 testing::UnitTest::Run()  /tmp/peterr-test2/test/gtest/common/gtest-all.cc:5422
19 0x000000000059c461 RUN_ALL_TESTS()  /tmp/peterr-test2/test/gtest/common/gtest.h:20059
20 0x000000000059c34a main()  /tmp/peterr-test2/test/gtest/common/main.cc:109
21 0x00000000000223d5 __libc_start_main()  ???:0
22 0x000000000056ea79 _start()  ???:0
=================================
[swx-rdmz-ucx-gpu-02:19365:0:19365] Process frozen...

But sometimes it passes OK. Checking. But it always fails when the listener starts at 12.10.44.12. We probably need to remove random IP selection, to be more deterministic in tests

@yosefe
Copy link
Contributor

yosefe commented Sep 16, 2021

it happens when @avildema is running docker tests inside GPU machines which create temporary network devices for docker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants