We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configuration
OMPI: 4.1.1rc1 MOFED: MLNX_OFED_LINUX-5.1-2.5.8.0 Module: hpcx-gcc (2021-03-05) Test module: none Nodes: jazz x4 (ppn=28(x4), nodelist=jazz[05,07,13,15])
MTT log: http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/shmem/20210305_043538_172128_28209_jazz05.swx.labs.mlnx/html/test_stdout_o_g4qM.txt Cmd: /hpc/local/benchmarks/daily/next/2021-03-05/hpcx-gcc-redhat7.6/ompi/bin/oshrun -np 112 --display-map --bind-to core -mca oshmem_proc_group_cache_size 10000 -mca sshmem ucx -mca atomic ucx -mca coll '^hcoll' -mca coll_hcoll_enable 0 -mca spml ucx -mca pml ucx -x UCX_HANDLE_ERRORS=bt -x UCX_TLS=tcp -x UCX_NET_DEVICES=p2p2 --map-by slot -x SHMEM_SYMMETRIC_HEAP_SIZE=128M /hpc/mtr_scrap/users/mtt/scratch/shmem/20210305_043538_172128_28209_jazz05.swx.labs.mlnx/installs/Be3t/tests/verifier/tests-mellanox.git/verifier/install/bin/oshmem_test exec --no-colour --task=analysis:tc2 --task=analysis:tc3 --task=analysis:tc4 --task=analysis:tc5 --duration 10 Output:
/hpc/local/benchmarks/daily/next/2021-03-05/hpcx-gcc-redhat7.6/ompi/bin/oshrun -np 112 --display-map --bind-to core -mca oshmem_proc_group_cache_size 10000 -mca sshmem ucx -mca atomic ucx -mca coll '^hcoll' -mca coll_hcoll_enable 0 -mca spml ucx -mca pml ucx -x UCX_HANDLE_ERRORS=bt -x UCX_TLS=tcp -x UCX_NET_DEVICES=p2p2 --map-by slot -x SHMEM_SYMMETRIC_HEAP_SIZE=128M /hpc/mtr_scrap/users/mtt/scratch/shmem/20210305_043538_172128_28209_jazz05.swx.labs.mlnx/installs/Be3t/tests/verifier/tests-mellanox.git/verifier/install/bin/oshmem_test exec --no-colour --task=analysis:tc2 --task=analysis:tc3 --task=analysis:tc4 --task=analysis:tc5 --duration 10
PASS analysis reduce Reduce performance. [1614929386.419906] [jazz13:37304:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x2909ca0 (state=CONNECTED): recv(31) failed: Operation timed out [1614929386.419906] [jazz13:37339:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x31a16c0 (state=CONNECTED): recv(62) failed: Operation timed out[1614929386.427160] [jazz07:181620:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x37b76c0 (state=CONNECTED): recv(42) failed: Connection reset by remote peer [1614929386.427156] [jazz07:181643:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x2744ea0 (state=CONNECTED): recv(59) failed: Connection reset by remote peer[1614929386.628724] [jazz05:170975:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x2bd79d0 (state=CONNECTED): recv(28) failed: Operation timed out [1614929387.927145] [jazz05:171018:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x28dc190 (state=CONNECTED): recv(36) failed: Operation timed out [1614929387.927067] [jazz05:170997:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x23fa1d0 (state=CONNECTED): recv(36) failed: Operation timed out [1614929387.927036] [jazz05:170966:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x2f60240 (state=CONNECTED): recv(59) failed: Operation timed out [1614929387.927096] [jazz05:170979:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x323d270 (state=CONNECTED): recv(42) failed: Operation timed out[1614929387.934330] [jazz07:181608:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x2e86840 (state=CONNECTED): recv(34) failed: Operation timed out [1614929387.934369] [jazz07:181616:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x3e51c80 (state=CONNECTED): recv(58) failed: Operation timed out [1614929387.934381] [jazz07:181621:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x3d00010 (state=CONNECTED): recv(49) failed: Operation timed out[1614929387.995123] [jazz05:171004:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x29bd7c0 (state=CONNECTED): recv(30) failed: Connection reset by remote peer [1614929388.005963] [jazz05:171009:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x39da640 (state=CONNECTED): recv(42) failed: Connection reset by remote peer [1614929388.011493] [jazz05:170997:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x246d8e0 (state=CONNECTED): recv(43) failed: Connection reset by remote peer [1614929388.644361] [jazz13:37298:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x2cdff90 (state=CONNECTED): recv(53) failed: Operation timed out [1614929388.734877] [jazz07:181614:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x3aa10c0 (state=CONNECTED): recv(54) failed: Connection reset by remote peer [1614929388.938531] [jazz07:181613:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x3cf01c0 (state=CONNECTED): recv(44) failed: Operation timed out [1614929388.938494] [jazz07:181608:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x2e77ca0 (state=CONNECTED): recv(52) failed: Operation timed out[1614929388.974960] [jazz05:171018:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x2919820 (state=CONNECTED): recv(33) failed: Connection reset by remote peer [1614929388.975050] [jazz05:170991:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x2520840 (state=CONNECTED): recv(57) failed: Connection reset by remote peer[1614929417.975222] [jazz15:193006:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x28dcb80 (state=CONNECTED): recv(33) failed: Connection reset by remote peer [1614929417.975206] [jazz15:193016:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x38554f0 (state=CONNECTED): recv(57) failed: Connection reset by remote peer [1614929417.975264] [jazz15:193024:0] tcp_ep.c:1090 UCX ERROR tcp_ep 0x330c420 (state=CONNECTED): recv(59) failed: Connection reset by remote peer [jazz05.swx.labs.mlnx:170975] pml_ucx.c:890 Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer[jazz05.swx.labs.mlnx:170997] pml_ucx.c:890 Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer [jazz05.swx.labs.mlnx:170991] pml_ucx.c:890 Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer [jazz05.swx.labs.mlnx:171009] pml_ucx.c:890 Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer [jazz05.swx.labs.mlnx:171004] pml_ucx.c:890 Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer [jazz05.swx.labs.mlnx:171018] pml_ucx.c:890 Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer node=jazz05, pid=171018: Thread 4 (Thread 0x7f4dca343700 (LWP 171038)): #0 0x00007f4dccc69483 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f4dce511133 in epoll_dispatch (base=0x2733ae0, tv=<optimized out>) at epoll.c:407 #2 0x00007f4dce514b80 in opal_libevent2022_event_base_loop (base=0x2733ae0, flags=flags@entry=1) at event.c:1630 #3 0x00007f4dce4cf46e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105 #4 0x00007f4dccf3fdd5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f4dccc68ead in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x7f4dc6d2f700 (LWP 171049)): #0 0x00007f4dccc69483 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f4dce511133 in epoll_dispatch (base=0x2759770, tv=<optimized out>) at epoll.c:407 #2 0x00007f4dce514b80 in opal_libevent2022_event_base_loop (base=0x2759770, flags=flags@entry=1) at event.c:1630 #3 0x00007f4dc947834e in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:232 #4 0x00007f4dccf3fdd5 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f4dccc68ead in clone () from /lib64/libc.so.6
The text was updated successfully, but these errors were encountered:
dmitrygx
Successfully merging a pull request may close this issue.
Configuration
MTT log:
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/shmem/20210305_043538_172128_28209_jazz05.swx.labs.mlnx/html/test_stdout_o_g4qM.txt
Cmd:
/hpc/local/benchmarks/daily/next/2021-03-05/hpcx-gcc-redhat7.6/ompi/bin/oshrun -np 112 --display-map --bind-to core -mca oshmem_proc_group_cache_size 10000 -mca sshmem ucx -mca atomic ucx -mca coll '^hcoll' -mca coll_hcoll_enable 0 -mca spml ucx -mca pml ucx -x UCX_HANDLE_ERRORS=bt -x UCX_TLS=tcp -x UCX_NET_DEVICES=p2p2 --map-by slot -x SHMEM_SYMMETRIC_HEAP_SIZE=128M /hpc/mtr_scrap/users/mtt/scratch/shmem/20210305_043538_172128_28209_jazz05.swx.labs.mlnx/installs/Be3t/tests/verifier/tests-mellanox.git/verifier/install/bin/oshmem_test exec --no-colour --task=analysis:tc2 --task=analysis:tc3 --task=analysis:tc4 --task=analysis:tc5 --duration 10
Output:
The text was updated successfully, but these errors were encountered: