Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH launch silently hangs with certain numbers of hosts in machine file #7087

Closed
mwheinz opened this issue Oct 11, 2019 · 5 comments
Closed
Assignees

Comments

@mwheinz
Copy link

mwheinz commented Oct 11, 2019

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.1.4, 3.1.x (tip of 3.1.x branch)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

As part of OPA IFS install, or by manual build from source.

Please describe the system on which you are running

  • Operating system/version: RHEL 7.6
  • Computer hardware: X86_64 servers
  • Network type: OPA

Details of the problem

3.1.4: Problem appears to fail for any # of hosts greater than 64.

3.1.x: If the machine file contains a particular number of hosts the job silently hangs during launch. Known bad numbers of hosts include 72 and 130. Known good values include 80 and 129 hosts.

Problem appears to be related to #6618 but unlike that issue, in this case the launch simply hangs and the workarounds provided in that issue (--mca routed_radix 1, --mca routed direct, etc..) do not resolve the problem. Completely disabling tree-based launching (-mca plm_rsh_no_tree_spawn 1) does resolve the issue. Problem may also be somewhat different between 3.1.4 and 3.1.x and I am going to test whether the problem occurs in 4.0.x.

Sample command line:
[RHEL7.6 hds1fnb8261 20191011_0927 mpi_apps]# /usr/mpi/gcc/openmpi-3.1.4-hfi/bin/mpirun -np 80 -map-by node --allow-run-as-root --mca routed_radix 1 -machinefile /root/mpi_apps/mpi_hosts /bin/hostname\

Running verbose with 3.1.x, the last output is:

[hds1fnb8261:115916] [[26234,0],0] plm:rsh: final template argv:
	/usr/bin/ssh <template>     PATH=/usr/mpi/gcc/openmpi-3.1.4-hfi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-3.1.4-hfi/lib64:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-3.1.4-hfi/lib64:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/mpi/gcc/openmpi-3.1.4-hfi/bin/orted -mca ess "env" -mca ess_base_jobid "1719271424" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "128" -mca orte_hnp_uri "1719271424.0;tcp://10.127.238.168,10.127.114.69,192.168.0.125:60072;ud://2976.126.1" --mca routed_radix "1" -mca plm_base_verbose "100" -mca plm "rsh" -mca rmaps_base_mapping_policy "node" -mca pmix "^s1,s2,cray,isolated"
[hds1fnb6061:28893] mca: base: components_register: registering framework plm components
[hds1fnb6061:28893] mca: base: components_register: found loaded component rsh
[hds1fnb6061:28893] mca: base: components_register: component rsh register function successful
[hds1fnb6061:28893] mca: base: components_open: opening plm components
[hds1fnb6061:28893] mca: base: components_open: found loaded component rsh
[hds1fnb6061:28893] mca: base: components_open: component rsh open function successful
[hds1fnb6061:28893] mca:base:select: Auto-selecting plm components
[hds1fnb6061:28893] mca:base:select:(  plm) Querying component [rsh]
[hds1fnb6061:28893] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[hds1fnb6061:28893] mca:base:select:(  plm) Selected component [rsh]
@mwheinz mwheinz self-assigned this Oct 11, 2019
@rhc54
Copy link
Contributor

rhc54 commented Oct 11, 2019

I'd suggest first checking master and then working back to the release branches

@mwheinz
Copy link
Author

mwheinz commented Oct 11, 2019

I'd suggest first checking master and then working back to the release branches

Fair enough.

I'm still holding out a vague hope that it's some kind of configuration issue on this particular cluster but I don't have access to another one of sufficient size to compare.

@mwheinz
Copy link
Author

mwheinz commented Oct 11, 2019

Problem does not occur with the 4.0.2 release.

@mwheinz
Copy link
Author

mwheinz commented Oct 11, 2019

Problem does not exist in MASTER.

@gpaulsen
Copy link
Member

Closing this issue as a dup of #6198. If this is incorrect, please feel free to reopen this and discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants