Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mtt] ucx-emulation-tcp: tcp_ep 0x37b76c0 (state=CONNECTED): recv(42) faile #6471

Closed
avildema opened this issue Mar 9, 2021 · 0 comments · Fixed by #7093
Closed

[mtt] ucx-emulation-tcp: tcp_ep 0x37b76c0 (state=CONNECTED): recv(42) faile #6471

avildema opened this issue Mar 9, 2021 · 0 comments · Fixed by #7093
Assignees
Labels
Bug MTT MTT Error

Comments

@avildema
Copy link
Contributor

avildema commented Mar 9, 2021

Configuration

OMPI: 4.1.1rc1
MOFED: MLNX_OFED_LINUX-5.1-2.5.8.0
Module: hpcx-gcc (2021-03-05)
Test module: none
Nodes: jazz x4 (ppn=28(x4), nodelist=jazz[05,07,13,15])

 
MTT log:
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/shmem/20210305_043538_172128_28209_jazz05.swx.labs.mlnx/html/test_stdout_o_g4qM.txt
 
Cmd:
/hpc/local/benchmarks/daily/next/2021-03-05/hpcx-gcc-redhat7.6/ompi/bin/oshrun -np 112 --display-map --bind-to core -mca oshmem_proc_group_cache_size 10000 -mca sshmem ucx -mca atomic ucx -mca coll '^hcoll' -mca coll_hcoll_enable 0 -mca spml ucx -mca pml ucx -x UCX_HANDLE_ERRORS=bt -x UCX_TLS=tcp -x UCX_NET_DEVICES=p2p2 --map-by slot -x SHMEM_SYMMETRIC_HEAP_SIZE=128M /hpc/mtr_scrap/users/mtt/scratch/shmem/20210305_043538_172128_28209_jazz05.swx.labs.mlnx/installs/Be3t/tests/verifier/tests-mellanox.git/verifier/install/bin/oshmem_test exec --no-colour --task=analysis:tc2 --task=analysis:tc3 --task=analysis:tc4 --task=analysis:tc5 --duration 10
 
Output:

PASS   analysis   reduce         Reduce performance.
[1614929386.419906] [jazz13:37304:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x2909ca0 (state=CONNECTED): recv(31) failed: Operation timed out
[1614929386.419906] [jazz13:37339:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x31a16c0 (state=CONNECTED): recv(62) failed: Operation timed out[1614929386.427160] [jazz07:181620:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x37b76c0 (state=CONNECTED): recv(42) failed: Connection reset by remote peer
[1614929386.427156] [jazz07:181643:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x2744ea0 (state=CONNECTED): recv(59) failed: Connection reset by remote peer[1614929386.628724] [jazz05:170975:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x2bd79d0 (state=CONNECTED): recv(28) failed: Operation timed out
[1614929387.927145] [jazz05:171018:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x28dc190 (state=CONNECTED): recv(36) failed: Operation timed out
[1614929387.927067] [jazz05:170997:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x23fa1d0 (state=CONNECTED): recv(36) failed: Operation timed out
[1614929387.927036] [jazz05:170966:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x2f60240 (state=CONNECTED): recv(59) failed: Operation timed out
[1614929387.927096] [jazz05:170979:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x323d270 (state=CONNECTED): recv(42) failed: Operation timed out[1614929387.934330] [jazz07:181608:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x2e86840 (state=CONNECTED): recv(34) failed: Operation timed out
[1614929387.934369] [jazz07:181616:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x3e51c80 (state=CONNECTED): recv(58) failed: Operation timed out
[1614929387.934381] [jazz07:181621:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x3d00010 (state=CONNECTED): recv(49) failed: Operation timed out[1614929387.995123] [jazz05:171004:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x29bd7c0 (state=CONNECTED): recv(30) failed: Connection reset by remote peer
[1614929388.005963] [jazz05:171009:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x39da640 (state=CONNECTED): recv(42) failed: Connection reset by remote peer
[1614929388.011493] [jazz05:170997:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x246d8e0 (state=CONNECTED): recv(43) failed: Connection reset by remote peer
[1614929388.644361] [jazz13:37298:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x2cdff90 (state=CONNECTED): recv(53) failed: Operation timed out
[1614929388.734877] [jazz07:181614:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x3aa10c0 (state=CONNECTED): recv(54) failed: Connection reset by remote peer
[1614929388.938531] [jazz07:181613:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x3cf01c0 (state=CONNECTED): recv(44) failed: Operation timed out
[1614929388.938494] [jazz07:181608:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x2e77ca0 (state=CONNECTED): recv(52) failed: Operation timed out[1614929388.974960] [jazz05:171018:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x2919820 (state=CONNECTED): recv(33) failed: Connection reset by remote peer
[1614929388.975050] [jazz05:170991:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x2520840 (state=CONNECTED): recv(57) failed: Connection reset by remote peer[1614929417.975222] [jazz15:193006:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x28dcb80 (state=CONNECTED): recv(33) failed: Connection reset by remote peer
[1614929417.975206] [jazz15:193016:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x38554f0 (state=CONNECTED): recv(57) failed: Connection reset by remote peer
[1614929417.975264] [jazz15:193024:0]          tcp_ep.c:1090 UCX  ERROR tcp_ep 0x330c420 (state=CONNECTED): recv(59) failed: Connection reset by remote peer
[jazz05.swx.labs.mlnx:170975] pml_ucx.c:890  Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer[jazz05.swx.labs.mlnx:170997] pml_ucx.c:890  Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer
[jazz05.swx.labs.mlnx:170991] pml_ucx.c:890  Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer
[jazz05.swx.labs.mlnx:171009] pml_ucx.c:890  Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer
[jazz05.swx.labs.mlnx:171004] pml_ucx.c:890  Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer
[jazz05.swx.labs.mlnx:171018] pml_ucx.c:890  Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer
node=jazz05, pid=171018:
Thread 4 (Thread 0x7f4dca343700 (LWP 171038)):
#0  0x00007f4dccc69483 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f4dce511133 in epoll_dispatch (base=0x2733ae0, tv=<optimized out>) at epoll.c:407
#2  0x00007f4dce514b80 in opal_libevent2022_event_base_loop (base=0x2733ae0, flags=flags@entry=1) at event.c:1630
#3  0x00007f4dce4cf46e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007f4dccf3fdd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f4dccc68ead in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f4dc6d2f700 (LWP 171049)):
#0  0x00007f4dccc69483 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f4dce511133 in epoll_dispatch (base=0x2759770, tv=<optimized out>) at epoll.c:407
#2  0x00007f4dce514b80 in opal_libevent2022_event_base_loop (base=0x2759770, flags=flags@entry=1) at event.c:1630
#3  0x00007f4dc947834e in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232
#4  0x00007f4dccf3fdd5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f4dccc68ead in clone () from /lib64/libc.so.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MTT MTT Error
Projects
None yet
3 participants