-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wireup.c:473 Fatal: endpoint reconfiguration not supported yet #1534
Comments
- Add the lanency.overhead to the passed address so that each rank can see the same values when selecting a lane - since this value maybe different for different ranks. - Consider the remote peer's bandwidth in the rndv score function - this will allow support for cases where different ranks have different speeds on their HCAs - heterogeneous fabric. - enhance the logging for pack/unpack address - include the priority of the device and the lantency overhead. fixes openucx#1534 (cherry picked from commit eb7fd1b)
Reopening this issue since this error is printed again.
To reproduce: /hpc/local/benchmarks/hpcx_install_Friday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -x UCX_IB_PREFER_NEAREST_DEVICE=y -np 512 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1,mlx5_0:1 -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca coll_hcoll_enable 0 -x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by node /hpc/scrap/users/mtt/scratch/ucx_ompi/20170610_005456_13446_748055_clx-hercules-001/installs/PbKb/tests/mpich_tests/mpich-mellanox.git/test/mpi/basic/patterns
/hpc/local/benchmarks/hpcx_install_Friday/hpcx-gcc-redhat7.2/ompi-v2.x/bin/mpirun -x UCX_IB_PREFER_NEAREST_DEVICE=y -np 2 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core --tag-output --timestamp-output -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1,mlx5_0:1 -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca coll_hcoll_enable 0 -x UCX_TLS=rc,sm -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 100000 --map-by socket --display-map /hpc/scrap/users/mtt/scratch/ucx_ompi/20170610_005456_13446_748055_clx-hercules-001/installs/PbKb/tests/mpich_tests/mpich-mellanox.git/test/mpi/basic/patterns ======================== JOB MAP ======================== Data for node: clx-hercules-081 Num slots: 32 Max slots: 0 Num procs: 2 ============================================================= links from mtt: |
- added a small value which would overcome float imprecision in score calculations. - print the lanes configurations in case the tl-dev selection is incorrect. fixes openucx#1534
- Add the lanency.overhead to the passed address so that each rank can see the same values when selecting a lane - since this value maybe different for different ranks. - Consider the remote peer's bandwidth in the rndv score function - this will allow support for cases where different ranks have different speeds on their HCAs - heterogeneous fabric. - enhance the logging for pack/unpack address - include the priority of the device and the lantency overhead. fixes openucx#1534 (cherry picked from commit eb7fd1b)
This problem occurs on Orion with the following cmd line:
mpirun -np 4 --bind-to core --report-bindings -mca pml ucx -x UCX_TLS=rc ./sendself
The workaround is to specify a particular network device (UCX_NET_DEVICES).
Need to check how device priority is handled.
The text was updated successfully, but these errors were encountered: