Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wireup.c:473 Fatal: endpoint reconfiguration not supported yet #1534

Closed
brminich opened this issue May 21, 2017 · 1 comment
Closed

wireup.c:473 Fatal: endpoint reconfiguration not supported yet #1534

brminich opened this issue May 21, 2017 · 1 comment
Labels

Comments

@brminich
Copy link
Contributor

This problem occurs on Orion with the following cmd line:
mpirun -np 4 --bind-to core --report-bindings -mca pml ucx -x UCX_TLS=rc ./sendself

The workaround is to specify a particular network device (UCX_NET_DEVICES).

Need to check how device priority is handled.

@brminich brminich added the Bug label May 21, 2017
@yosefe yosefe closed this as completed in eb7fd1b May 30, 2017
alinask added a commit to alinask/ucx that referenced this issue May 30, 2017
- Add the lanency.overhead to the passed address so that each rank can
  see the same values when selecting a lane - since this value maybe
  different for different ranks.

- Consider the remote peer's bandwidth in the rndv score function - this
  will allow support for cases where different ranks have different
  speeds on their HCAs - heterogeneous fabric.

- enhance the logging for pack/unpack address - include the priority of
  the device and the lantency overhead.

fixes openucx#1534

(cherry picked from commit eb7fd1b)
@alinask alinask reopened this Jun 11, 2017
@alinask
Copy link
Contributor

alinask commented Jun 11, 2017

Reopening this issue since this error is printed again.

  1. the issue happens when using more than one hca from the command line.
  2. setting the UCX_IB_PREFER_NEAREST_DEVICE parameter to 'no' eliminates the issue.

To reproduce:

/hpc/local/benchmarks/hpcx_install_Friday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -x UCX_IB_PREFER_NEAREST_DEVICE=y -np 512 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1,mlx5_0:1 -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca coll_hcoll_enable 0 -x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170610_005456_13446_748055_clx-hercules-001/installs/PbKb/tests/mpich_tests/mpich-mellanox.git/test/mpi/basic/patterns

  1. seems that the problem happens when both sockets are used. for 16 hosts, the test passes with 256 ranks (since only the first socket is used on each host. 16 cores on each socket) but fails with more.
    this is why this reproduces with only 2 ranks on one host when using --map-by socket :

/hpc/local/benchmarks/hpcx_install_Friday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -x UCX_IB_PREFER_NEAREST_DEVICE=y -np 2 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1,mlx5_0:1 -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca coll_hcoll_enable 0 -x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by socket --display-map /hpc/scrap/users/mtt/scratch/ucx_ompi/20170610_005456_13446_748055_clx-hercules-001/installs/PbKb/tests/mpich_tests/mpich-mellanox.git/test/mpi/basic/patterns

======================== JOB MAP ========================

Data for node: clx-hercules-081 Num slots: 32 Max slots: 0 Num procs: 2
Process OMPI jobid: [59095,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0]]:[B/././././././././././././././.][./././././././././././././././.]
Process OMPI jobid: [59095,1] App: 0 Process rank: 1 Bound: socket 1[core 16[hwt 0]]:[./././././././././././././././.][B/././././././././././././././.]

=============================================================

links from mtt:
http://e2e-gw.mellanox.com:4080//hpc/scrap/users/mtt/scratch/ucx_ompi/20170609_042941_8216_747906_clx-hercules-065/html/test_stdout_WR4XBq.txt

http://e2e-gw.mellanox.com:4080//hpc/scrap/users/mtt/scratch/ucx_ompi/20170609_083322_21791_747909_clx-hercules-065/html/test_stdout_U6Qul1.txt

http://e2e-gw.mellanox.com:4080//hpc/scrap/users/mtt/scratch/ucx_ompi/20170609_083322_21791_747909_clx-hercules-065/html/test_stdout_U6Qul1.txt

alinask added a commit to alinask/ucx that referenced this issue Jun 15, 2017
- added a small value which would overcome float imprecision in score
  calculations.
- print the lanes configurations in case the tl-dev selection is
  incorrect.

fixes openucx#1534
@yosefe yosefe closed this as completed in 075b7ba Jun 16, 2017
boehms pushed a commit to boehms/ucx that referenced this issue Oct 17, 2017
- Add the lanency.overhead to the passed address so that each rank can
  see the same values when selecting a lane - since this value maybe
  different for different ranks.

- Consider the remote peer's bandwidth in the rndv score function - this
  will allow support for cases where different ranks have different
  speeds on their HCAs - heterogeneous fabric.

- enhance the logging for pack/unpack address - include the priority of
  the device and the lantency overhead.

fixes openucx#1534

(cherry picked from commit eb7fd1b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants