Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mtt] failure on the icalltoall test #2056

Closed
alinask opened this issue Dec 10, 2017 · 3 comments
Closed

[mtt] failure on the icalltoall test #2056

alinask opened this issue Dec 10, 2017 · 3 comments
Assignees
Labels
Bug MTT MTT Error

Comments

@alinask
Copy link
Contributor

alinask commented Dec 10, 2017

ppn=28

Command line:

/hpc/local/benchmarks/hpcx_install_2017-12-07/hpcx-gcc-redhat7.4/ompi-v3.1.x/bin/mpirun -np 224 --debug-daemons --display-map -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_4:1 -mca btl_openib_if_include mlx5_4:1 -mca coll '^hcoll' -x UCX_IB_GID_INDEX=0 -mca osc ucx -x UCX_TLS=rc,sm -x UCX_RC_VERBS_TM_ENABLE=y --map-by node /mnt/lustre/users/mtt/scratch/ucx_ompi/20171208_021625_23775_41003_clx-orion-001/installs/npmP/tests/mpich_tests/mpich-mellanox.git/test/mpi/coll/icalltoall

MPIR_server_arguments: NULLFri Dec  8 03:57:17 2017[1,78]<stderr>:[clx-orion-009:7956 :0]   ptr_array.c:191  Assertion `!ucs_ptr_array_is_free(ptr_array, index)' failed
Fri Dec  8 03:57:18 2017[1,78]<stderr>:==== backtrace ====
Fri Dec  8 03:57:18 2017[1,78]<stderr>: 0 0x00000000000438d7 ucs_ptr_array_remove()  /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/datastruct/ptr_array.c:191
Fri Dec  8 03:57:18 2017[1,78]<stderr>: 1 0x0000000000028ad5 uct_rc_ep_tag_rndv_cancel()  /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/ib/rc/base/rc_ep.c:531
Fri Dec  8 03:57:18 2017[1,78]<stderr>: 2 0x0000000000022063 uct_ep_tag_rndv_cancel()  /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/api/uct.h:2237
Fri Dec  8 03:57:18 2017[1,78]<stderr>: 3 0x000000000001c4c5 ucp_rndv_ats_handler()  /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/ucp/tag/rndv.c:613
Fri Dec  8 03:57:18 2017[1,78]<stderr>: 4 0x000000000002e0ae uct_iface_invoke_am()  /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/uct/base/uct_iface.h:514
Fri Dec  8 03:57:18 2017[1,78]<stderr>: 5 0x0000000000015052 ucs_callbackq_dispatch()  /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ucx-master/src/ucs/datastruct/callbackq.h:168
Fri Dec  8 03:57:18 2017[1,78]<stderr>: 6 0x0000000000003277 mca_pml_ucx_progress()  /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/mca/pml/ucx/pml_ucx.c:451
Fri Dec  8 03:57:18 2017[1,78]<stderr>: 7 0x0000000000032a04 opal_progress()  /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/opal/runtime/opal_progress.c:222
Fri Dec  8 03:57:18 2017[1,78]<stderr>: 8 0x0000000000048ded sync_wait_st()  /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/../opal/threads/wait_sync.h:83
Fri Dec  8 03:57:18 2017[1,78]<stderr>: 9 0x0000000000002d7e mca_coll_basic_alltoall_inter()  /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/mca/coll/basic/coll_basic_alltoall.c:115
Fri Dec  8 03:57:18 2017[1,78]<stderr>:10 0x000000000005b911 PMPI_Alltoall()  /hpc/local/benchmarks/hpcx_install_2017-12-07/src/hpcx-gcc-redhat7.4/ompi-v3.1.x/ompi/mpi/c/profile/palltoall.c:107
Fri Dec  8 03:57:18 2017[1,78]<stderr>:11 0x00000000004024b3 MTest_Alltoall()  icalltoall.c:0
Fri Dec  8 03:57:18 2017[1,78]<stderr>:12 0x0000000000402630 main()  ???:0
Fri Dec  8 03:57:18 2017[1,78]<stderr>:13 0x0000000000021b35 __libc_start_main()  ???:0
Fri Dec  8 03:57:18 2017[1,78]<stderr>:14 0x00000000004023a9 _start()  ???:0
Fri Dec  8 03:57:18 2017[1,78]<stderr>:===================Fri Dec  8 03:57:18 2017[1,78]<stderr>:[clx-orion-009:7956 :0] Process frozen...

http://e2e-gw.mellanox.com:4080//mnt/lustre/users/mtt/scratch/ucx_ompi/20171208_021625_23775_41003_clx-orion-001/html/test_stdout_XTg2Hz.txt

Failed to reproduce this after multiple attemps.

@alinask alinask added the Bug label Dec 10, 2017
@alinask
Copy link
Contributor Author

alinask commented Dec 10, 2017

ucx_2056_traces.txt
Attached a trace from all the processes.

@yosefe yosefe added the MTT MTT Error label Dec 11, 2017
@alinask
Copy link
Contributor Author

alinask commented Dec 18, 2017

AFAIR I ran this command line in a loop for around 50 times and it didn't reproduce.

@brminich
Copy link
Contributor

The problem was that RNDV receiver (who makes get_zcopy) sent duplicated ATS to the sender

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MTT MTT Error
Projects
None yet
Development

No branches or pull requests

3 participants