We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ppn=28
Command line:
/hpc/local/benchmarks/hpcx_install_2017-12-07/hpcx-gcc-redhat7.4/ompi-v3.1.x/bin/mpirun -np 224 --debug-daemons --display-map -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_4:1 -mca btl_openib_if_include mlx5_4:1 -mca coll '^hcoll' -x UCX_IB_GID_INDEX=0 -mca osc ucx -x UCX_TLS=rc,sm -x UCX_RC_VERBS_TM_ENABLE=y --map-by node /mnt/lustre/users/mtt/scratch/ucx_ompi/20171208_021625_23775_41003_clx-orion-001/installs/npmP/tests/mpich_tests/mpich-mellanox.git/test/mpi/coll/icalltoall
MPIR_server_arguments: NULLFri Dec 8 03:57:17 2017[1,78]<stderr>:[clx-orion-009:7956 :0] ptr_array.c:191 Assertion `!ucs_ptr_array_is_free(ptr_array, index)' failed Fri Dec 8 03:57:18 2017[1,78]<stderr>:==== backtrace ==== Fri Dec 8 03:57:18 2017[1,78]<stderr>: 0 0x00000000000438d7 ucs_ptr_array_remove() /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/datastruct/ptr_array.c:191 Fri Dec 8 03:57:18 2017[1,78]<stderr>: 1 0x0000000000028ad5 uct_rc_ep_tag_rndv_cancel() /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/ib/rc/base/rc_ep.c:531 Fri Dec 8 03:57:18 2017[1,78]<stderr>: 2 0x0000000000022063 uct_ep_tag_rndv_cancel() /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/api/uct.h:2237 Fri Dec 8 03:57:18 2017[1,78]<stderr>: 3 0x000000000001c4c5 ucp_rndv_ats_handler() /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/ucp/tag/rndv.c:613 Fri Dec 8 03:57:18 2017[1,78]<stderr>: 4 0x000000000002e0ae uct_iface_invoke_am() /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/base/uct_iface.h:514 Fri Dec 8 03:57:18 2017[1,78]<stderr>: 5 0x0000000000015052 ucs_callbackq_dispatch() /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/datastruct/callbackq.h:168 Fri Dec 8 03:57:18 2017[1,78]<stderr>: 6 0x0000000000003277 mca_pml_ucx_progress() /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/mca/pml/ucx/pml_ucx.c:451 Fri Dec 8 03:57:18 2017[1,78]<stderr>: 7 0x0000000000032a04 opal_progress() /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/opal/runtime/opal_progress.c:222 Fri Dec 8 03:57:18 2017[1,78]<stderr>: 8 0x0000000000048ded sync_wait_st() /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/../opal/threads/wait_sync.h:83 Fri Dec 8 03:57:18 2017[1,78]<stderr>: 9 0x0000000000002d7e mca_coll_basic_alltoall_inter() /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/mca/coll/basic/coll_basic_alltoall.c:115 Fri Dec 8 03:57:18 2017[1,78]<stderr>:10 0x000000000005b911 PMPI_Alltoall() /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/mpi/c/profile/palltoall.c:107 Fri Dec 8 03:57:18 2017[1,78]<stderr>:11 0x00000000004024b3 MTest_Alltoall() icalltoall.c:0 Fri Dec 8 03:57:18 2017[1,78]<stderr>:12 0x0000000000402630 main() ???:0 Fri Dec 8 03:57:18 2017[1,78]<stderr>:13 0x0000000000021b35 __libc_start_main() ???:0 Fri Dec 8 03:57:18 2017[1,78]<stderr>:14 0x00000000004023a9 _start() ???:0 Fri Dec 8 03:57:18 2017[1,78]<stderr>:===================Fri Dec 8 03:57:18 2017[1,78]<stderr>:[clx-orion-009:7956 :0] Process frozen...
http://e2e-gw.mellanox.com:4080//mnt/lustre/users/mtt/scratch/ucx_ompi/20171208_021625_23775_41003_clx-orion-001/html/test_stdout_XTg2Hz.txt
Failed to reproduce this after multiple attemps.
The text was updated successfully, but these errors were encountered:
ucx_2056_traces.txt Attached a trace from all the processes.
Sorry, something went wrong.
AFAIR I ran this command line in a loop for around 50 times and it didn't reproduce.
The problem was that RNDV receiver (who makes get_zcopy) sent duplicated ATS to the sender
brminich
No branches or pull requests
ppn=28
Command line:
/hpc/local/benchmarks/hpcx_install_2017-12-07/hpcx-gcc-redhat7.4/ompi-v3.1.x/bin/mpirun -np 224 --debug-daemons --display-map -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_4:1 -mca btl_openib_if_include mlx5_4:1 -mca coll '^hcoll' -x UCX_IB_GID_INDEX=0 -mca osc ucx -x UCX_TLS=rc,sm -x UCX_RC_VERBS_TM_ENABLE=y --map-by node /mnt/lustre/users/mtt/scratch/ucx_ompi/20171208_021625_23775_41003_clx-orion-001/installs/npmP/tests/mpich_tests/mpich-mellanox.git/test/mpi/coll/icalltoall
http://e2e-gw.mellanox.com:4080//mnt/lustre/users/mtt/scratch/ucx_ompi/20171208_021625_23775_41003_clx-orion-001/html/test_stdout_XTg2Hz.txt
Failed to reproduce this after multiple attemps.
The text was updated successfully, but these errors were encountered: