Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang on UD-only MPI runs on some hosts #1767

Closed
alex--m opened this issue Aug 17, 2017 · 6 comments
Closed

Hang on UD-only MPI runs on some hosts #1767

alex--m opened this issue Aug 17, 2017 · 6 comments
Labels

Comments

@alex--m
Copy link
Contributor

alex--m commented Aug 17, 2017

I noticed the problem on the Thor Cluster (Mellanox-internal). The problem did not reproduce on other hosts. The following command hangs, while changing ud to rc (or disabling UCX altogether) make is pass:

salloc -N 4 mpirun -bind-to core -map-by node -mca pml ucx -x UCX_TLS=ud,shm,self -x UCX_NET_DEVICES=mlx5_0:1 osu_igather

Also, hangs even w/o HCOLL, with ud_x, and for other async (i*) OSU collective benchmarks. With osu_gatherv - hangs at around 4K (message size).
Is there any additional data you'd like me to provide?

@yosefe
Copy link
Contributor

yosefe commented Aug 17, 2017

Is it Connect-X5? What is FW version?

@yosefe
Copy link
Contributor

yosefe commented Aug 17, 2017

see also #1513

@alex--m
Copy link
Contributor Author

alex--m commented Aug 17, 2017

Doesn't look related to #1513, since the hang is on the first sends - it doesn't start showing benchmark results (I attached gdb - all procs seem to wait on poll_cq()). Also, it's UD, not RC (RC works).

I'll post the FW & HCA versions.

@alex--m
Copy link
Contributor Author

alex--m commented Aug 17, 2017

It is Connect-X5. Still waiting on the FW version.

@yosefe yosefe added the Bug label Sep 26, 2017
@yosefe
Copy link
Contributor

yosefe commented Dec 21, 2017

Currently it works for me on that cluster:

[root@thor001 ~]# ibstat mlx5_0
CA 'mlx5_0'
	CA type: MT4121
	Number of ports: 1
	Firmware version: 16.21.2010
	Port 1:
		State: Active
		Rate: 100
		Link layer: InfiniBand
[root@thor001 ~]#  mpirun --allow-run-as-root -bind-to core -map-by node -mca pml ucx -x UCX_TLS=ud,shm,self -x UCX_NET_DEVICES=mlx5_0:1 $HPCX_MPI_TESTS_DIR/osu-micro-benchmarks-5.3.2/osu_igather

# OSU MPI Non-blocking Gather Latency Test v5.3.2
# Overall = Coll. Init + Compute + MPI_Test + MPI_Wait

# Size           Overall(us)       Compute(us)    Pure Comm.(us)        Overlap(%)
...
   <results>
...

@alex--m can we close this?

@alex--m
Copy link
Contributor Author

alex--m commented Dec 27, 2017

If OSU passes i consider it solved.

@alex--m alex--m closed this as completed Dec 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants