Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install Request: OpenMPI 4.0.2 or later (stale shared memory segments bug) #337

Closed
heatherkellyucl opened this issue Mar 25, 2020 · 12 comments

Comments

@heatherkellyucl
Copy link
Contributor

heatherkellyucl commented Mar 25, 2020

Related to IN:04155266 on Legion but is a general issue, as our most recent OpenMPIs are 3.1.4 and 3.1.5 (beta module). OpenMPI 3 > 3.1.1 has a bug where vader_segment.x shared memory files are left behind (only/mostly on an aborted run?). If they exist, then a new run on those nodes will fail with this:

node-o08a-029: Unable to allocate shared memory for intra-node messaging.
node-o08a-029: Delete stale shared memory files in /dev/shm.

Note that /dev/shm is not full in this case.

OpenMPI 4.0.2 and later have fixed a bunch of vader issues, and are using PMIx 3 rather than 2, which has better hooks for doing job shutdown cleanup.

Note: 4.0.x deprecates the openib BTL in favour of UCX.
https://www.open-mpi.org/software/ompi/major-changes.php
https://www.open-mpi.org/faq/?category=openfabrics#run-ucx
https://www.open-mpi.org/faq/?category=building#build-p2p

It also suggests to build --without-verbs when using UCX.

See open-mpi/ompi#6322 and open-mpi/ompi#7220 for bug.

@heatherkellyucl
Copy link
Contributor Author

heatherkellyucl commented Apr 20, 2020

Building on:

  • Kathleen
  • Thomas
  • Grace
  • Legion (requested)
  • modulefile

@heatherkellyucl
Copy link
Contributor Author

Installed, need to test if working before doing the modulefile.

@heatherkellyucl
Copy link
Contributor Author

mpi_pi 2-node job worked on Kathleen.

@heatherkellyucl
Copy link
Contributor Author

Also working 2-node on Grace, single node on Myriad and single node on the build node on Legion. (Don't think I'm currently in anything that can submit multi-node jobs on Legion remains).

Need to check multi-node on the Economics part of Myriad.

@heatherkellyucl
Copy link
Contributor Author

Not working across 2 nodes on Myriad-Economics.

--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[node-d97a-020:89406] *** An error occurred in MPI_Init
[node-d97a-020:89406] *** reported by process [2074411009,39]
[node-d97a-020:89406] *** on a NULL communicator
[node-d97a-020:89406] *** Unknown error
[node-d97a-020:89406] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node-d97a-020:89406] ***    and potentially your MPI job)

@heatherkellyucl
Copy link
Contributor Author

We don't have openucx installed, which is the recommended replacement for openib and probably something we should get installed.

ompi_info | grep mtl
                 MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                 MCA mtl: psm2 (MCA v2.1.0, API v2.0.0, Component v4.0.3)

ofi ought to work if we get the right options...

The mpirun wrapper is doing $MPI_LAUNCHER --mca mtl '^psm2' -mca pml cm "$@" which is not working on its own.

@heatherkellyucl
Copy link
Contributor Author

I've been adding some of the verbose options.

mpirun --mca plm_base_verbose 10  --mca mtl_base_verbose 10  --mca mtl ofi ./mpi_pi

If we just specify ofi then ofi is picking verbs as the provider (and I left in the --without-verbs in the OpenMPI build options, which we had there before. So maybe should try a version built with verbs again here since we do not have ucx).

[node-d97a-008.myriad.ucl.ac.uk:397694] mca: base: components_register: registering framework mtl components
[node-d97a-008.myriad.ucl.ac.uk:397694] mca: base: components_register: found loaded component ofi
[node-d97a-008.myriad.ucl.ac.uk:397694] mca: base: components_register: component ofi register function successful
[node-d97a-008.myriad.ucl.ac.uk:397694] mca: base: components_open: opening mtl components
[node-d97a-008.myriad.ucl.ac.uk:397694] mca: base: components_open: found loaded component ofi
[node-d97a-008.myriad.ucl.ac.uk:397694] mca: base: components_open: component ofi open function successful
[node-d97a-008.myriad.ucl.ac.uk:397694] mca:base:select: Auto-selecting mtl components
[node-d97a-008.myriad.ucl.ac.uk:397694] mca:base:select:(  mtl) Querying component [ofi]
[node-d97a-008.myriad.ucl.ac.uk:397694] mca:base:select:(  mtl) Query of component [ofi] set priority to 25
[node-d97a-008.myriad.ucl.ac.uk:397694] mca:base:select:(  mtl) Selected component [ofi]
[node-d97a-008.myriad.ucl.ac.uk:397694] select: initializing mtl component ofi
[node-d97a-008.myriad.ucl.ac.uk:397694] mtl_ofi_component.c:315: mtl:ofi:provider_include = "(null)"
[node-d97a-008.myriad.ucl.ac.uk:397694] mtl_ofi_component.c:318: mtl:ofi:provider_exclude = "shm,sockets,tcp,udp,rstream"
[node-d97a-008.myriad.ucl.ac.uk:397694] mtl_ofi_component.c:347: mtl:ofi:prov: verbs
[node-d97a-008.myriad.ucl.ac.uk:397694] select: init returned success
[node-d97a-008.myriad.ucl.ac.uk:397694] select: component ofi selected
[node-d97a-008.myriad.ucl.ac.uk:397694] mtl_ofi.c:116: fi_av_insert failed: 1

It is possible to set -mca mtl_ofi_provider_include whatever or --mca mtl_ofi_provider_exclude "verbs".

open-mpi/ompi#6570 (comment) says that "The default settings are indeed based on an assumption that ofi will not be the transport of choice for Mellanox IB cards as Mellanox directs their users to install UCX and go that route."

@heatherkellyucl
Copy link
Contributor Author

So, back to installing UCX?

@heatherkellyucl
Copy link
Contributor Author

heatherkellyucl commented Apr 27, 2020

Ok, OpenUCX is BSD-licensed and looks straightforward to build: https://openucx.readthedocs.io/en/master/
https://github.com/openucx/ucx/releases

I will install it centrally and test with a test module first.

  • Myriad
  • Grace
  • test
  • modulefile

(On the OmniPath systems it ought to work but be slower than psm2 - don't want to use it there but could install it).

  • Kathleen
  • Thomas

@heatherkellyucl
Copy link
Contributor Author

A default config with these modules

gcc-libs/4.9.2
compilers/gnu/4.9.2
numactl/2.0.12
binutils/2.29.1/gnu-4.9.2

gets this:

configure: =========================================================
configure: UCX build configuration:
configure:       Build prefix:   /home/cceahke/ucx-install
configure: Preprocessor flags:   -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src
configure:         C compiler:   gcc -O3 -g -Wall -Werror
configure:       C++ compiler:   g++ -O3 -g -Wall -Werror
configure:       Multi-thread:   disabled
configure:          MPI tests:   disabled
configure:      Devel headers:   no
configure:           Bindings:   < >
configure:        UCT modules:   < ib rdmacm cma >
configure:       CUDA modules:   < >
configure:       ROCM modules:   < >
configure:         IB modules:   < cm >
configure:        UCM modules:   < >
configure:       Perf modules:   < >
configure: =========================================================

@heatherkellyucl
Copy link
Contributor Author

OpenMPI 4.0.3 on Myriad now built with ucx.

[cceahke@login12 ompi-4.0.3]$ ompi_info | grep pml
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component v4.0.3)
[cceahke@login12 ompi-4.0.3]$ ompi_info | grep mtl
                 MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                 MCA mtl: psm2 (MCA v2.1.0, API v2.0.0, Component v4.0.3)

Trying

mpirun --mca plm_base_verbose 10  --mca mtl_base_verbose 10  --mca mtl ofi -mca pml ucx ./mpi_pi

@heatherkellyucl
Copy link
Contributor Author

heatherkellyucl commented Apr 28, 2020

That ran, but I think it used tcp entirely.

[node-d97a-021.myriad.ucl.ac.uk:186223] mca:base:select: Auto-selecting plm components
[node-d97a-021.myriad.ucl.ac.uk:186223] mca:base:select:(  plm) Querying component [rsh]
[node-d97a-021.myriad.ucl.ac.uk:186223] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[node-d97a-021.myriad.ucl.ac.uk:186223] mca:base:select:(  plm) Querying component [isolated]
[node-d97a-021.myriad.ucl.ac.uk:186223] mca:base:select:(  plm) Query of component [isolated] set priority to 0
[node-d97a-021.myriad.ucl.ac.uk:186223] mca:base:select:(  plm) Querying component [slurm]
[node-d97a-021.myriad.ucl.ac.uk:186223] mca:base:select:(  plm) Selected component [rsh]
[node-d97a-021.myriad.ucl.ac.uk:186223] mca: base: close: component isolated closed
[node-d97a-021.myriad.ucl.ac.uk:186223] mca: base: close: unloading component isolated
[node-d97a-021.myriad.ucl.ac.uk:186223] mca: base: close: component slurm closed
[node-d97a-021.myriad.ucl.ac.uk:186223] mca: base: close: unloading component slurm
[node-d97a-021.myriad.ucl.ac.uk:186223] [[30014,0],0] plm:rsh: using "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching
[node-d97a-021.myriad.ucl.ac.uk:186223] [[30014,0],0] plm:rsh: final template argv:
        /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose <template>  orted -mca ess "env" -mca ess_base_jobid "1966997504" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "node-d[2:97]a-021,node-d[2:97]a-014@0(2)" -mca orte_hnp_uri "1966997504.0;tcp://10.34.6.71,169.254.95.120,10.128.26.221,10.128.18.221:56571" --mca plm_base_verbose "10" --mca mtl_base_verbose "10" --mca mtl "ofi" -mca pml "ucx" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "1966997504.0;tcp://10.34.6.71,169.254.95.120,10.128.26.221,10.128.18.221:56571" -mca pmix "^s1,s2,cray,isolated"
Starting server daemon at host "node-d97a-014"
Server daemon successfully started with task id "1.node-d97a-014"
Establishing /opt/geassist/bin/rshcommand session to host node-d97a-014.myriad.ucl.ac.uk ...
[node-d97a-014.myriad.ucl.ac.uk:126077] mca: base: components_register: registering framework plm components
[node-d97a-014.myriad.ucl.ac.uk:126077] mca: base: components_register: found loaded component rsh
[node-d97a-014.myriad.ucl.ac.uk:126077] mca: base: components_register: component rsh register function successful
[node-d97a-014.myriad.ucl.ac.uk:126077] mca: base: components_open: opening plm components
[node-d97a-014.myriad.ucl.ac.uk:126077] mca: base: components_open: found loaded component rsh
[node-d97a-014.myriad.ucl.ac.uk:126077] mca: base: components_open: component rsh open function successful
[node-d97a-014.myriad.ucl.ac.uk:126077] mca:base:select: Auto-selecting plm components
[node-d97a-014.myriad.ucl.ac.uk:126077] mca:base:select:(  plm) Querying component [rsh]
[node-d97a-014.myriad.ucl.ac.uk:126077] mca:base:select:(  plm) Query of component [rsh] set priority to 10
[node-d97a-014.myriad.ucl.ac.uk:126077] mca:base:select:(  plm) Selected component [rsh]
[node-d97a-014.myriad.ucl.ac.uk:126077] [[30014,0],1] plm:rsh: using "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching
[node-d97a-021.myriad.ucl.ac.uk:186223] [[30014,0],0] complete_setup on job [30014,1]
[node-d97a-021.myriad.ucl.ac.uk:186223] [[30014,0],0] plm:base:receive update proc state command from [[30014,0],1]
[node-d97a-021.myriad.ucl.ac.uk:186223] [[30014,0],0] plm:base:receive got update_proc_state for job [30014,1]
[node-d97a-021.myriad.ucl.ac.uk:186223] [[30014,0],0] plm:base:receive update proc state command from [[30014,0],1]
[node-d97a-021.myriad.ucl.ac.uk:186223] [[30014,0],0] plm:base:receive got update_proc_state for job [30014,1]
[node-d97a-014.myriad.ucl.ac.uk:126077] mca: base: close: component rsh closed
[node-d97a-014.myriad.ucl.ac.uk:126077] mca: base: close: unloading component rsh
/opt/geassist/bin/rshcommand exited with exit code 0
reading exit code from shepherd ... [node-d97a-021.myriad.ucl.ac.uk:186223] mca: base: close: component rsh closed
[node-d97a-021.myriad.ucl.ac.uk:186223] mca: base: close: unloading component rsh

The line that is too long to read is
/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose <template> orted -mca ess "env" -mca ess_base_jobid "1966997504" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "node-d[2:97]a-021,node-d[2:97]a-014@0(2)" -mca orte_hnp_uri "1966997504.0;tcp://10.34.6.71,169.254.95.120,10.128.26.221,10.128.18.221:56571" --mca plm_base_verbose "10" --mca mtl_base_verbose "10" --mca mtl "ofi" -mca pml "ucx" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "1966997504.0;tcp://10.34.6.71,169.254.95.120,10.128.26.221,10.128.18.221:56571" -mca pmix "^s1,s2,cray,isolated"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants