Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in UCX 1.5.1 on Fedora Rawhide with 4.0.1 #6671

Closed
opoplawski opened this issue May 16, 2019 · 5 comments
Closed

Segfault in UCX 1.5.1 on Fedora Rawhide with 4.0.1 #6671

opoplawski opened this issue May 16, 2019 · 5 comments

Comments

@opoplawski
Copy link
Contributor

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.1
UCX 1.5.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Fedora Package

Please describe the system on which you are running

  • Operating system/version: Fedora Rawhide (31)
  • Computer hardware: x86_64
  • Network type: None/Ethernet

Details of the problem

Since updating the openmpi package in Fedora to 4.0.1 we are observing segfaults in the tests of a number of dependent packages.

Reported to ucx here: openucx/ucx#3558 but no response there yet.

Here is a traceback from netcdf:

FAIL: run_par_tests.sh
======================
Testing parallel I/O with HDF5...
[buildhw-08:1772 :0:1772] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f9280277948)
==== backtrace ====
    0  /lib64/libucs.so.0(+0x194a3) [0x7f92801ff4a3]
    1  /lib64/libucs.so.0(+0x1965a) [0x7f92801ff65a]
    2  /lib64/libuct.so.0(+0x1b72b) [0x7f928014a72b]
    3  /lib64/ld-linux-x86-64.so.2(+0xfe4a) [0x7f9281aa4e4a]
    4  /lib64/ld-linux-x86-64.so.2(+0xff51) [0x7f9281aa4f51]
    5  /lib64/ld-linux-x86-64.so.2(+0x13eae) [0x7f9281aa8eae]
    6  /lib64/libc.so.6(_dl_catch_exception+0x79) [0x7f92815133e9]
    7  /lib64/ld-linux-x86-64.so.2(+0x1372e) [0x7f9281aa872e]
    8  /lib64/libdl.so.2(+0x239c) [0x7f92813b639c]
    9  /lib64/libc.so.6(_dl_catch_exception+0x79) [0x7f92815133e9]
   10  /lib64/libc.so.6(_dl_catch_error+0x33) [0x7f9281513483]
   11  /lib64/libdl.so.2(+0x2af9) [0x7f92813b6af9]
   12  /lib64/libdl.so.2(dlopen+0x4a) [0x7f92813b642a]
   13  /usr/lib64/openmpi/lib/libopen-pal.so.40(+0x6ead7) [0x7f928114aad7]
   14  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_repository_open+0x1f4) [0x7f9281128524]
   15  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_component_find+0x35b) [0x7f92811274eb]
   16  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_components_register+0x2e) [0x7f9281132dfe]
   17  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_register+0x256) [0x7f92811332e6]
   18  /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_base_framework_open+0x14) [0x7f9281133344]
   19  /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x695) [0x7f92815f8795]
   20  /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x73) [0x7f9281628a53]
   21  ./tst_h_par(+0x2489) [0x55af8923a489]
   22  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f9281400f73]
   23  ./tst_h_par(+0x2dbe) [0x55af8923adbe]
===================

Other affected packages: https://apps.fedoraproject.org/koschei/affected-by/openmpi?epoch1=0&version1=3.1.4&release1=1.fc31&epoch2=0&version2=4.0.1&release2=1.fc31&collection=f31

@opoplawski
Copy link
Contributor Author

I've disabled building openmpi with UCX in Fedora until this is resolved. This appears to have fixed the segfault issue at the expense of building the open-shmem components.

@artpol84
Copy link
Contributor

artpol84 commented May 30, 2019

Thanks @opoplawski, @yosefe is working on this.

@jsquyres
Copy link
Member

@opoplawski Does Open MPI v4.0.2 fix this issue, perchance?

@opoplawski
Copy link
Contributor Author

This appears to have been fixed with UCX 1.5.2

@rhc54
Copy link
Contributor

rhc54 commented Oct 19, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants