Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote access error in dc/test_ucp_tag_xfer.send_contig_recv_contig_exp_rndv_probe/0 #1770

Closed
yosefe opened this issue Aug 19, 2017 · 5 comments
Labels

Comments

@yosefe
Copy link
Contributor

yosefe commented Aug 19, 2017

http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4305/label=hpc-test-node2,worker=0/console

19:34:46 [ RUN      ] dc/test_ucp_tag_xfer.send_contig_recv_contig_exp_rndv_probe/0
19:34:56 [1502987696.852541] [hpc-test-node2:36589:3]      ib_device.c:171  UCX  ERROR IB Async event on mlx5_0: DCT access error on DCTN 0x2a8a3
19:34:56 mlx5: hpc-test-node2: got completion with error:
19:34:56 00000000 00000000 00000000 00000000
19:34:56 00000000 00000000 00000000 00000000
19:34:56 00000002 00000000 00000000 00000000
19:34:56 00000000 00008a12 1002a8af 0001a2d2
19:34:56 [1502987696.853057] [hpc-test-node2:36589:4]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2a8af: remote invalid request error syndrome 0x8a
19:34:56 [1502987696.856578] [hpc-test-node2:36589:4]     ucp_worker.c:410  UCX  ERROR Error Endpoint timeout was not handled for ep 0x9d8f000
@yosefe yosefe added the Bug label Aug 19, 2017
@yosefe
Copy link
Contributor Author

yosefe commented Aug 20, 2017

http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4311/label=hpc-test-node2,worker=0/console

16:47:45 [ RUN      ] dc/uct_amo_add_test.add64/0
16:47:45 [1503236865.974003] [hpc-test-node2:50936:2]      ib_device.c:171  UCX  ERROR IB Async event on mlx5_0: DCT access error on DCTN 0x2f369
16:47:45 mlx5: hpc-test-node2: got completion with error:
16:47:45 00000000 00000000 00000000 00000000
16:47:45 00000000 00000000 00000000 00000000
16:47:45 00000002 00000000 00000000 00000000
16:47:45 00000000 00008a12 1202f397 0002c0d2
16:47:45 [1503236865.977817] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: remote invalid request error syndrome 0x8a
16:47:45 [1503236865.977833] [hpc-test-node2:50936:3]      uct_iface.c:345  UCX  ERROR Error Endpoint timeout was not handled for ep 0x31b3e30
16:47:45 [1503236865.981600] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981609] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981613] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981617] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981621] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981624] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981628] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981631] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981634] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981637] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981641] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981644] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981647] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981650] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 [1503236865.981653] [hpc-test-node2:50936:3]       dc_verbs.c:593  UCX  ERROR Send completion with error on qp 0x2f397: Work Request Flushed Error syndrome 0xf9
16:47:45 /scrap/jenkins/scrap/workspace/hpc-ucx-pr-3/label/hpc-test-node2/worker/0/contrib/../test/gtest/uct/test_amo.cc:185: Failure
16:47:45 Endpoint timeout
16:47:45 terminate called after throwing an instance of 'ucs::test_abort_exception'
16:47:45   what():  std::exception

@yosefe
Copy link
Contributor Author

yosefe commented Aug 20, 2017

http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4309/label=hpc-test-node2,worker=1/console

16:05:36 [ RUN      ] dc/test_ucp_tag_probe.send_rndv_msg_probe/0
16:05:36 [1503234336.389405] [hpc-test-node2:10248:1]      ib_device.c:171  UCX  ERROR IB Async event on mlx5_0: DCT access error on DCTN 0x2293e
16:05:36 mlx5: hpc-test-node2: got completion with error:
16:05:36 00000000 00000000 00000000 00000000
16:05:36 00000000 00000000 00000000 00000000
16:05:36 00000002 00000000 00000000 00000000
16:05:36 00000000 00008a12 1002294a 0001c6d2

@alinask
Copy link
Contributor

alinask commented Aug 23, 2017

http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4328/label=hpc-arm-04,worker=1/console

18:06:44 [ RUN      ] dc/test_ucp_tag_xfer.send_contig_recv_contig_unexp_rndv/0
18:06:44 mlx5: hpc-arm-04.mtr.labs.mlnx: got completion with error:
18:06:44 00000000 00000000 00000000 00000000
18:06:44 00000000 00000000 00000000 00000000
18:06:44 00000002 00000000 00000000 00000000
18:06:44 00000000 00008a12 100023e8 00016cd2
18:06:44 [1503414404.338204] [hpc-arm-04:92478:0]      ib_device.c:151  UCX  WARN  IB Async event on mlx5_0: DCT access error on DCTN 0x23db
18:06:44 [1503414404.339639] [hpc-arm-04:92478:1]          dc_ep.c:44   UCX  WARN  ep (0x13d39280) is destroyed with 2 outstanding ops
18:06:44 [1503414404.339651] [hpc-arm-04:92478:1]      uct_iface.c:328  UCX  ERROR Error Endpoint timeout was not handled for ep 0x13d39280
18:06:44 [1503414404.341152] [hpc-arm-04:92478:1]       dc_verbs.c:615  UCX  ERROR Send completion with error: remote invalid request error

@alinask
Copy link
Contributor

alinask commented Aug 28, 2017

20:08:31 [ RUN      ] dcx/test_ucp_tag_match.sync_send_unexp_rndv/0
20:08:31 [1503853711.184879] [hpc-arm-04:27141:1]      ib_device.c:171  UCX  ERROR IB Async event on mlx5_0: DCT access error on DCTN 0x15fb1
20:08:31 [1503853711.185637] [hpc-arm-04:27141:0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0x15fc3 wqe[1]: Invalid request (synd 0x12 vend 0x8a) opcode RDMA_READ
20:08:31 [1503853711.186333] [hpc-arm-04:27141:0]     ucp_worker.c:399  UCX  ERROR Error Endpoint timeout was not handled for ep 0xe01dee0
20:08:31 /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-04/worker/3/contrib/../test/gtest/ucp/test_ucp_tag_match.cc:424: Failure
20:08:31 Value of: recvbuf
20:08:31   Actual: { ',' (44, 0x2C), '\x6' (6), '0' (48, 0x30), 'u' (117, 0x75), '\0', '\0', '\0', '\0', '\xC9' (201), '=' (61, 0x3D), '\xE0' (224), '\x93' (147), '\x4' (4), '\0', '\0', '\0', '\xEB' (235), 'i' (105, 0x69), '\xC2' (194), '\xC6' (198), '-' (45, 0x2D), '\0', '\0', '\0', '?' (63, 0x3F), '#' (35, 0x23), '\x98' (152), '\xC3' (195), '\xC9' (201), '\x1' (1), '\0', '\0', ... }
20:08:31 Expected: sendbuf
20:08:31 Which is: { ',' (44, 0x2C), '\x6' (6), '0' (48, 0x30), 'u' (117, 0x75), '\0', '\0', '\0', '\0', '\xC9' (201), '=' (61, 0x3D), '\xE0' (224), '\x93' (147), '\x4' (4), '\0', '\0', '\0', '\xEB' (235), 'i' (105, 0x69), '\xC2' (194), '\xC6' (198), '-' (45, 0x2D), '\0', '\0', '\0', '?' (63, 0x3F), '#' (35, 0x23), '\x98' (152), '\xC3' (195), '\xC9' (201), '\x1' (1), '\0', '\0', ... }
20:08:41 /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-04/worker/3/contrib/../test/gtest/ucp/test_ucp_tag_match.cc:429: Failure
20:08:41 Value of: my_send_req->completed
20:08:41   Actual: false
20:08:41 Expected: true
20:08:41 [1503853721.295566] [hpc-arm-04:27141:0]          mpool.c:38   UCX  WARN  object 0xe2b7f40 was not returned to mpool ucp_requests
20:08:41 [1503853721.457042] [hpc-arm-04:27141:0]         rcache.c:284  UCX  WARN  mlx5_0: destroying inuse region 0xdfb49d0 [0xe52fd40..0xe6483e0] gt- rw ref 1 lkey 0x10803a rkey 0x10803a atomic: lkey 0xffffffff rkey 0xffff
20:08:41 /scrap/jenkins/workspace/hpc-ucx-pr/label/hpc-arm-04/worker/3/contrib/../test/gtest/common/test.cc:228: Failure
20:08:41 Failed
20:08:41 Got 2 warnings during the test
20:08:41 
[  FAILED  ] dcx/test_ucp_tag_match.sync_send_unexp_rndv/0, where GetParam() = \dc_mlx5 (10382 ms)

@yosefe
Copy link
Contributor Author

yosefe commented Aug 28, 2017

MLNX internal ref: https://redmine.mellanox.com/issues/1117631

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants