Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hang in udrc/test_ucp_rma.nonblocking_stream_get_nbi_flush_worker/1 #1641

Closed
yosefe opened this issue Jun 27, 2017 · 2 comments
Closed

hang in udrc/test_ucp_rma.nonblocking_stream_get_nbi_flush_worker/1 #1641

yosefe opened this issue Jun 27, 2017 · 2 comments

Comments

@yosefe
Copy link
Contributor

yosefe commented Jun 27, 2017

happens because ucp_worker_flush is blocking(), so receiver does not make progress and does not let the sender to switch from stub_ep to real ep.

(gdb) bt
#0  0x00007f5971b64f16 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#1  0x00007f597244f256 in ibv_poll_cq (arg=0x3b8ab90) at /usr/include/infiniband/verbs.h:1271
#2  uct_ib_poll_cq (arg=0x3b8ab90) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../src/uct/ib/base/ib_device.h:267
#3  uct_rc_verbs_iface_poll_rx_common (arg=0x3b8ab90) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../src/uct/ib/rc/verbs/rc_verbs_common.h:154
#4  uct_rc_verbs_iface_progress (arg=0x3b8ab90) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../src/uct/ib/rc/verbs/rc_verbs_iface.c:129
#5  0x00007f597244030a in ucs_callbackq_dispatch (worker=<value optimized out>) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../src/ucs/datastruct/callbackq.h:150
#6  uct_worker_progress (worker=<value optimized out>) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../src/uct/base/uct_worker.c:37
#7  0x00007f5971fdf283 in ucp_worker_progress (worker=0x3b9aa60) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../src/ucp/core/ucp_worker.c:850
#8  0x00007f5971fe22a0 in ucp_worker_flush_inner (worker=0x3b9aa60) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../src/ucp/rma/basic_rma.c:430
#9  ucp_worker_flush (worker=0x3b9aa60) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../src/ucp/rma/basic_rma.c:423
#10 0x000000000057a3b1 in ucp_test_base::entity::flush_worker (this=0x39c0330, worker_index=0) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/ucp/ucp_test.cc:362
#11 0x00000000004f5c70 in test_ucp_memheap::test_nonblocking_implicit_stream_xfer (this=0x3b62650, send=
    (void (test_ucp_memheap::*)(test_ucp_memheap *, ucp_test_base::entity *, size_t, void *, ucp_rkey_h, std::string &)) 0x508980 <test_ucp_rma::nonblocking_get_nbi(ucp_test_base::entity*, size_t, void*, ucp_rkey_h, std::string&)>, 
    size=4730, max_iter=300, alignment=1, malloc_allocate=false, is_ep_flush=false) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/ucp/test_ucp_memheap.cc:118
#12 0x00000000005012e0 in test_ucp_rma_nonblocking_stream_get_nbi_flush_worker_Test::test_body (this=0x3b62650) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/ucp/test_ucp_rma.cc:244
#13 0x0000000000436f9e in ucs::test_base::run (this=0x3b62650) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/test.cc:204
#14 0x0000000000437a0d in ucs::test_base::TestBodyProxy (this=0x3b62650) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/test.cc:230
#15 0x000000000043088d in HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0x3b626a8, method=&virtual testing::Test::TestBody(), location=0x64515a "the test body")
    at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/gtest-all.cc:3562
#16 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x3b626a8, method=&virtual testing::Test::TestBody(), location=0x64515a "the test body")
    at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/gtest-all.cc:3598
#17 0x0000000000428157 in testing::Test::Run (this=0x3b626a8) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/gtest-all.cc:3635
#18 0x000000000042822e in testing::TestInfo::Run (this=0x3ac3ae0) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/gtest-all.cc:3812
#19 0x0000000000428377 in testing::TestCase::Run (this=0x3aa35a0) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/gtest-all.cc:3930
#20 0x000000000042860c in testing::internal::UnitTestImpl::RunAllTests (this=0x3946aa0) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/gtest-all.cc:5802
#21 0x000000000043041d in HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x3946aa0, method=
    (bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl *)) 0x4283f0 <testing::internal::UnitTestImpl::RunAllTests()>, location=0x646110 "auxiliary test code (environments or event listeners)")
    at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/gtest-all.cc:3562
#22 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x3946aa0, method=
    (bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl *)) 0x4283f0 <testing::internal::UnitTestImpl::RunAllTests()>, location=0x646110 "auxiliary test code (environments or event listeners)")
    at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/gtest-all.cc:3598
#23 0x0000000000427859 in testing::UnitTest::Run (this=0x999860) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/gtest-all.cc:5416
#24 0x0000000000431b7f in RUN_ALL_TESTS (argc=1, argv=<value optimized out>) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/gtest.h:20059
#25 main (argc=1, argv=<value optimized out>) at /hpc/mtr_scrap/users/yosefe/ucx/contrib/../test/gtest/common/main.cc:79
@yosefe yosefe added the Bug label Jun 27, 2017
@yosefe
Copy link
Contributor Author

yosefe commented Jun 27, 2017

(gdb) get_flags ((ucp_stub_ep_t*)$ep2->uct_eps[0])->flags
$35 = 0
$36 = 1
(gdb) get_flags ((ucp_stub_ep_t*)$ep->uct_eps[0])->flags
$37 = 1
(gdb) get_flags $ep->flags 
$14 = 0
$15 = 2
$16 = 3
(gdb) get_flags $ep2->flags
$23 = 0
$24 = 1
$25 = 2
$26 = 3

@yosefe
Copy link
Contributor Author

yosefe commented Sep 25, 2017

similar hang (no valgrind):
http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4668/label=hpc-test-node,worker=1/console

20:12:52 [ RUN      ] udrcx/test_ucp_rma.blocking_small/0
20:12:53 [       OK ] udrcx/test_ucp_rma.blocking_small/0 (314 ms)
20:12:53 [ RUN      ] udrcx/test_ucp_rma.nonblocking_stream_get_nbi_flush_ep/0
22:19:16 Build timed out (after 150 minutes). Marking the build as failed.
22:19:16 Build was aborted
22:19:16 TAP Reports Processing: START

@yosefe yosefe changed the title hang in udrc/test_ucp_rma.nonblocking_stream_get_nbi_flush_worker/1 [jenkins] hang in udrc/test_ucp_rma.nonblocking_stream_get_nbi_flush_worker/1 Oct 4, 2017
@yosefe yosefe changed the title [jenkins] hang in udrc/test_ucp_rma.nonblocking_stream_get_nbi_flush_worker/1 hang in udrc/test_ucp_rma.nonblocking_stream_get_nbi_flush_worker/1 Oct 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant