wireup.c:473 Fatal: endpoint reconfiguration not supported yet #1534

brminich · 2017-05-21T16:14:18Z

This problem occurs on Orion with the following cmd line:
mpirun -np 4 --bind-to core --report-bindings -mca pml ucx -x UCX_TLS=rc ./sendself

The workaround is to specify a particular network device (UCX_NET_DEVICES).

Need to check how device priority is handled.

- Add the lanency.overhead to the passed address so that each rank can see the same values when selecting a lane - since this value maybe different for different ranks. - Consider the remote peer's bandwidth in the rndv score function - this will allow support for cases where different ranks have different speeds on their HCAs - heterogeneous fabric. - enhance the logging for pack/unpack address - include the priority of the device and the lantency overhead. fixes openucx#1534 (cherry picked from commit eb7fd1b)

alinask · 2017-06-11T12:02:02Z

Reopening this issue since this error is printed again.

the issue happens when using more than one hca from the command line.
setting the UCX_IB_PREFER_NEAREST_DEVICE parameter to 'no' eliminates the issue.

To reproduce:

/hpc/local/benchmarks/hpcx_install_Friday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -x UCX_IB_PREFER_NEAREST_DEVICE=y -np 512 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1,mlx5_0:1 -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca coll_hcoll_enable 0 -x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170610_005456_13446_748055_clx-hercules-001/installs/PbKb/tests/mpich_tests/mpich-mellanox.git/test/mpi/basic/patterns

seems that the problem happens when both sockets are used. for 16 hosts, the test passes with 256 ranks (since only the first socket is used on each host. 16 cores on each socket) but fails with more.
this is why this reproduces with only 2 ranks on one host when using --map-by socket :

/hpc/local/benchmarks/hpcx_install_Friday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -x UCX_IB_PREFER_NEAREST_DEVICE=y -np 2 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1,mlx5_0:1 -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca coll_hcoll_enable 0 -x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by socket --display-map /hpc/scrap/users/mtt/scratch/ucx_ompi/20170610_005456_13446_748055_clx-hercules-001/installs/PbKb/tests/mpich_tests/mpich-mellanox.git/test/mpi/basic/patterns

======================== JOB MAP ========================

Data for node: clx-hercules-081 Num slots: 32 Max slots: 0 Num procs: 2
Process OMPI jobid: [59095,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/././././././././././././././.][./././././././././././././././.]
Process OMPI jobid: [59095,1] App: 0 Process rank: 1 Bound: socket 1[core 16[hwt 0]]:[./././././././././././././././.][B/././././././././././././././.]

=============================================================

links from mtt:
http://e2e-gw.mellanox.com:4080//hpc/scrap/users/mtt/scratch/ucx_ompi/20170609_042941_8216_747906_clx-hercules-065/html/test_stdout_WR4XBq.txt

http://e2e-gw.mellanox.com:4080//hpc/scrap/users/mtt/scratch/ucx_ompi/20170609_083322_21791_747909_clx-hercules-065/html/test_stdout_U6Qul1.txt

- added a small value which would overcome float imprecision in score calculations. - print the lanes configurations in case the tl-dev selection is incorrect. fixes openucx#1534

- Add the lanency.overhead to the passed address so that each rank can see the same values when selecting a lane - since this value maybe different for different ranks. - Consider the remote peer's bandwidth in the rndv score function - this will allow support for cases where different ranks have different speeds on their HCAs - heterogeneous fabric. - enhance the logging for pack/unpack address - include the priority of the device and the lantency overhead. fixes openucx#1534 (cherry picked from commit eb7fd1b)

brminich added the Bug label May 21, 2017

brminich mentioned this issue May 25, 2017

OSMEM direct modex problems #1006

Closed

alinask mentioned this issue May 25, 2017

UCP: Add the latency.overhead to the passed address. #1544

Merged

yosefe closed this as completed in eb7fd1b May 30, 2017

alinask mentioned this issue May 30, 2017

UCP: Add the latency.overhead to the passed address. #1552

Merged

alinask reopened this Jun 11, 2017

alinask mentioned this issue Jun 15, 2017

UCP: fix tl/dev selection to handle float values. #1613

Merged

yosefe closed this as completed in 075b7ba Jun 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wireup.c:473 Fatal: endpoint reconfiguration not supported yet #1534

wireup.c:473 Fatal: endpoint reconfiguration not supported yet #1534

brminich commented May 21, 2017

alinask commented Jun 11, 2017

wireup.c:473 Fatal: endpoint reconfiguration not supported yet #1534

wireup.c:473 Fatal: endpoint reconfiguration not supported yet #1534

Comments

brminich commented May 21, 2017

alinask commented Jun 11, 2017