Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Init segfaults with UCX v1.12.0 #8054

Closed
dmcdougall opened this issue Mar 18, 2022 · 2 comments
Closed

MPI_Init segfaults with UCX v1.12.0 #8054

dmcdougall opened this issue Mar 18, 2022 · 2 comments
Labels

Comments

@dmcdougall
Copy link

dmcdougall commented Mar 18, 2022

Describe the bug

A call to MPI_Init gives this error message:

Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

This worked in UCX v1.11.2 and fails in UCX v1.12.0. Here is a backtrace:

Thread 1 "hello_world" received signal SIGSEGV, Segmentation fault.
0x00007fffe6bd9d5f in uct_base_iface_t_init (self=0x7eab60, _myclass=0x7fffe6e12d20 <uct_base_iface_t_class>, _init_count=0x7fffffffd778, ops=0x7fffe6523be0 <uct_rocm_ipc_iface_ops>, internal_ops=0x0, md=0x7fffe6523ad0 <md>, worker=0x7e5ba0, params=0x7fffffffd930, config=0x7e9cd0) at ../../../src/uct/base/uct_iface.c:511
511         ucs_assert(internal_ops->iface_estimate_perf != NULL);
#0  0x00007fffe6bd9d5f in uct_base_iface_t_init (self=0x7eab60, _myclass=0x7fffe6e12d20 <uct_base_iface_t_class>, _init_count=0x7fffffffd778, ops=0x7fffe6523be0 <uct_rocm_ipc_iface_ops>, internal_ops=0x0, md=0x7fffe6523ad0 <md>, worker=0x7e5ba0, params=0x7fffffffd930, 
    config=0x7e9cd0) at ../../../src/uct/base/uct_iface.c:511
#1  0x00007fffe631dd0d in uct_rocm_ipc_iface_t_init (self=0x7eab60, _myclass=0x7fffe6523dc0 <uct_rocm_ipc_iface_t_class>, _init_count=0x7fffffffd778, md=0x7fffe6523ad0 <md>, worker=0x7e5ba0, params=0x7fffffffd930, tl_config=0x7e9cd0)
    at ../../../../src/uct/rocm/ipc/rocm_ipc_iface.c:228
#2  0x00007fffe631dec2 in uct_rocm_ipc_iface_t_new (arg0=0x7fffe6523ad0 <md>, arg1=0x7e5ba0, arg2=0x7fffffffd930, arg3=0x7e9cd0, obj_p=0x7e9a60) at ../../../../src/uct/rocm/ipc/rocm_ipc_iface.c:262
#3  0x00007fffe6bd5df8 in uct_iface_open (md=0x7fffe6523ad0 <md>, worker=0x7e5ba0, params=0x7fffffffd930, config=0x7e9cd0, iface_p=0x7e9a60) at ../../../src/uct/base/uct_md.c:267
#4  0x00007fffe6e72456 in ucp_worker_iface_open (worker=0x7fffe4095010, tl_id=6 '\006', iface_params=0x7fffffffd930, wiface_p=0x7e5a80) at ../../../src/ucp/core/ucp_worker.c:1173
#5  0x00007fffe6e70853 in ucp_worker_add_resource_ifaces (worker=0x7fffe4095010) at ../../../src/ucp/core/ucp_worker.c:974
#6  0x00007fffe6e75d6c in ucp_worker_create (context=0x765a40, params=0x7fffffffdee0, worker_p=0x7fffe75e9340 <ompi_pml_ucx+192>) at ../../../src/ucp/core/ucp_worker.c:2210
#7  0x00007fffe73e0a8f in mca_pml_ucx_init (enable_mpi_threads=0) at ../../../../../ompi/mca/pml/ucx/pml_ucx.c:306
#8  0x00007fffe73e5c59 in mca_pml_ucx_component_init (priority=0x7fffffffe05c, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../ompi/mca/pml/ucx/pml_ucx_component.c:118
#9  0x00007ffff7b590d4 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../ompi/mca/pml/base/pml_base_select.c:127
#10 0x00007ffff7b6d6e9 in ompi_mpi_init (argc=1, argv=0x7fffffffe328, requested=0, provided=0x7fffffffe1ec, reinit_ok=false) at ../../ompi/runtime/ompi_mpi_init.c:646
#11 0x00007ffff7aeb2e9 in PMPI_Init (argc=0x7fffffffe21c, argv=0x7fffffffe210) at pinit.c:67
#12 0x0000000000400709 in main (argc=1, argv=0x7fffffffe328) at hello_world.c:5

Steps to Reproduce

  • Command line
$ cat hello_world.c 
#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int world_size;
    int world_rank;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    printf("Hello world from rank %d out of %d processors\n", world_rank, world_size);
    MPI_Finalize();
}

$ mpicc -ggdb -O0 hello_world.c -o hello_world

$ mpirun -np 1 ./hello_world
[HOSTNAME:76168:0:76168] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:  76168) ====
 0  /path/to/lib64/libucs.so.0(ucs_handle_error+0x73) [0x7f545c0ea0af]
 1  /path/to/lib64/libucs.so.0(+0x32e88) [0x7f545c0e9e88]
 2  /path/to/lib64/libucs.so.0(+0x32fcf) [0x7f545c0e9fcf]
 3  /path/to/lib64/libuct.so.0(uct_base_iface_t_init+0xf4) [0x7f545c33dd5f]
 4  /path/to/lib64/ucx/libuct_rocm.so.0(+0x7d0d) [0x7f54579cad0d]
 5  /path/to/lib64/ucx/libuct_rocm.so.0(+0x7ec2) [0x7f54579caec2]
 6  /path/to/lib64/libuct.so.0(uct_iface_open+0x18f) [0x7f545c339df8]
 7  /path/to/lib64/libucp.so.0(ucp_worker_iface_open+0x49f) [0x7f545c5d6456]
 8  /path/to/lib64/libucp.so.0(+0x5a853) [0x7f545c5d4853]
 9  /path/to/lib64/libucp.so.0(ucp_worker_create+0x6b2) [0x7f545c5d9d6c]
10  /path/to/lib64/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0xcb) [0x7f545cb44a8f]
11  /path/to/lib64/openmpi/mca_pml_ucx.so(+0x9c59) [0x7f545cb49c59]
12  /path/to/lib64/libmpi.so.40(mca_pml_base_select+0x272) [0x7f546910b0d4]
13  /path/to/lib64/libmpi.so.40(ompi_mpi_init+0x889) [0x7f546911f6e9]
14  /path/to/lib64/libmpi.so.40(MPI_Init+0x7f) [0x7f546909d2e9]
15  ./hello_world() [0x400709]
16  /lib64/libc.so.6(__libc_start_main+0xed) [0x7f5468a3634d]
17  ./hello_world() [0x40063a]
=================================
[HOSTNAME:76168] *** Process received signal ***
[HOSTNAME:76168] Signal: Segmentation fault (11)
[HOSTNAME:76168] Signal code:  (-6)
[HOSTNAME:76168] Failing at address: 0x3e800012988
[HOSTNAME:76168] [ 0] /lib64/libpthread.so.0(+0x13f80)[0x7f5468df9f80]
[HOSTNAME:76168] [ 1] /path/to/lib64/libuct.so.0(uct_base_iface_t_init+0xf4)[0x7f545c33dd5f]
[HOSTNAME:76168] [ 2] /path/to/lib64/ucx/libuct_rocm.so.0(+0x7d0d)[0x7f54579cad0d]
[HOSTNAME:76168] [ 3] /path/to/lib64/ucx/libuct_rocm.so.0(+0x7ec2)[0x7f54579caec2]
[HOSTNAME:76168] [ 4] /path/to/lib64/libuct.so.0(uct_iface_open+0x18f)[0x7f545c339df8]
[HOSTNAME:76168] [ 5] /path/to/lib64/libucp.so.0(ucp_worker_iface_open+0x49f)[0x7f545c5d6456]
[HOSTNAME:76168] [ 6] /path/to/lib64/libucp.so.0(+0x5a853)[0x7f545c5d4853]
[HOSTNAME:76168] [ 7] /path/to/lib64/libucp.so.0(ucp_worker_create+0x6b2)[0x7f545c5d9d6c]
[HOSTNAME:76168] [ 8] /path/to/lib64/openmpi/mca_pml_ucx.so(mca_pml_ucx_init+0xcb)[0x7f545cb44a8f]
[HOSTNAME:76168] [ 9] /path/to/lib64/openmpi/mca_pml_ucx.so(+0x9c59)[0x7f545cb49c59]
[HOSTNAME:76168] [10] /path/to/lib64/libmpi.so.40(mca_pml_base_select+0x272)[0x7f546910b0d4]
[HOSTNAME:76168] [11] /path/to/lib64/libmpi.so.40(ompi_mpi_init+0x889)[0x7f546911f6e9]
[HOSTNAME:76168] [12] /path/to/lib64/libmpi.so.40(MPI_Init+0x7f)[0x7f546909d2e9]
[HOSTNAME:76168] [13] ./hello_world[0x400709]
[HOSTNAME:76168] [14] /lib64/libc.so.6(__libc_start_main+0xed)[0x7f5468a3634d]
[HOSTNAME:76168] [15] ./hello_world[0x40063a]
[HOSTNAME:76168] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node HOSTNAME exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)

1.12.0 tarball built like so:

$ ../configure CFLAGS="-O0 -ggdb" CXXFLAGS="-O0 -ggdb" --prefix=/path/to --with-rocm --without-knem --without-cuda --without-java
$ make -j `nproc`
$ make install
  • Any UCX environment variables used
export OMPI_MCA_pml=ucx
export OMPI_MCA_btl=^openib,tcp

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • cat /etc/issue or cat /etc/redhat-release + uname -a
    • For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
$ cat /etc/issue

Welcome to SUSE Linux Enterprise Server 15 SP3  (x86_64) - Kernel \r (\l).

eth0: \4{eth0} \6{eth0}


$ uname -a
Linux HOSTNAME 5.3.18-57_11.0.18-cray_shasta_c #1 SMP Sun Jul 18 18:14:52 UTC 2021 (15c194a) x86_64 x86_64 x86_64 GNU/Linux
  • For GPU related issues:
    • GPU type

GPUs are AMD Instinct MI250X.

$ /usr/sbin/dkms status
amdgpu, 5.13.11.21.50-1384496, 5.3.18-57_11.0.18-cray_shasta_c, x86_64: installed

Additional information (depending on the issue)

  • OpenMPI version

OpenMPI v4.1.1 tarball built like so:

$ ../configure CFLAGS="-O0 -ggdb" CXXFLAGS="-O0 -ggdb" --prefix=/path/to --with-ucx=/path/to --without-verbs
$ make `nproc`
$ make install

Interestingly enough, I couldn't run ucx_info -d because that also segfaults. Here is a backtrace from ucx_info -d:

# Memory domain: rocm_ipc
#     Component: rocm_ipc
#             register: unlimited, cost: 9 nsec
#           remote key: 56 bytes
#
#      Transport: rocm_ipc
#         Device: rocm_ipc
#           Type: accelerator
#  System device: <unknown>
[Thread 0x7fffeffff700 (LWP 76976) exited]

Thread 1 "ucx_info" received signal SIGSEGV, Segmentation fault.
0x00007ffff77d7d5f in uct_base_iface_t_init (self=0x6615c0, _myclass=0x7ffff7a10d20 <uct_base_iface_t_class>, _init_count=0x7fffffffd838, ops=0x7ffff63dfbe0 <uct_rocm_ipc_iface_ops>, internal_ops=0x0, md=0x7ffff63dfad0 <md>, worker=0x6609d0, params=0x7fffffffdb30, config=0x631060) at ../../../src/uct/base/uct_iface.c:511
511         ucs_assert(internal_ops->iface_estimate_perf != NULL);
#0  0x00007ffff77d7d5f in uct_base_iface_t_init (self=0x6615c0, _myclass=0x7ffff7a10d20 <uct_base_iface_t_class>, _init_count=0x7fffffffd838, ops=0x7ffff63dfbe0 <uct_rocm_ipc_iface_ops>, internal_ops=0x0, md=0x7ffff63dfad0 <md>, worker=0x6609d0, params=0x7fffffffdb30, 
    config=0x631060) at ../../../src/uct/base/uct_iface.c:511
#1  0x00007ffff61d9d0d in uct_rocm_ipc_iface_t_init (self=0x6615c0, _myclass=0x7ffff63dfdc0 <uct_rocm_ipc_iface_t_class>, _init_count=0x7fffffffd838, md=0x7ffff63dfad0 <md>, worker=0x6609d0, params=0x7fffffffdb30, tl_config=0x631060)
    at ../../../../src/uct/rocm/ipc/rocm_ipc_iface.c:228
#2  0x00007ffff61d9ec2 in uct_rocm_ipc_iface_t_new (arg0=0x7ffff63dfad0 <md>, arg1=0x6609d0, arg2=0x7fffffffdb30, arg3=0x631060, obj_p=0x7fffffffd8e8) at ../../../../src/uct/rocm/ipc/rocm_ipc_iface.c:262
#3  0x00007ffff77d3df8 in uct_iface_open (md=0x7ffff63dfad0 <md>, worker=0x6609d0, params=0x7fffffffdb30, config=0x631060, iface_p=0x7fffffffd8e8) at ../../../src/uct/base/uct_md.c:267
#4  0x000000000040443b in print_iface_info (worker=0x6609d0, md=0x7ffff63dfad0 <md>, resource=0x65dc00) at ../../../../src/tools/info/tl_info.c:156
#5  0x00000000004054d9 in print_tl_info (md=0x7ffff63dfad0 <md>, tl_name=0x65dc00 "rocm_ipc", resources=0x65dc00, num_resources=1, print_opts=16, print_flags=0) at ../../../../src/tools/info/tl_info.c:375
#6  0x0000000000405a4c in print_md_info (component=0x7ffff63dfa00 <uct_rocm_ipc_component>, component_attr=0x7fffffffe0a0, md_name=0x7fffffffe060 "rocm_ipc", print_opts=16, print_flags=0, req_tl_name=0x0) at ../../../../src/tools/info/tl_info.c:482
#7  0x0000000000405d20 in print_uct_component_info (component=0x7ffff63dfa00 <uct_rocm_ipc_component>, print_opts=16, print_flags=0, req_tl_name=0x0) at ../../../../src/tools/info/tl_info.c:588
#8  0x0000000000405dbd in print_uct_info (print_opts=16, print_flags=0, req_tl_name=0x0) at ../../../../src/tools/info/tl_info.c:614
#9  0x00000000004068ed in main (argc=2, argv=0x7fffffffe2f8) at ../../../../src/tools/info/ucx_info.c:257

With UCX 1.11.2, ucx_info -d runs to completion, and here is the output from the same place it failed in v1.12.0:

# Memory domain: rocm_ipc
#     Component: rocm_ipc
#             register: unlimited, cost: 9 nsec
#           remote key: 56 bytes
#
#      Transport: rocm_ipc
#         Device: rocm_ipc
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 409600.00/ppn + 0.00 MB/sec
#              latency: 1 nsec
#             overhead: 0 nsec
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 4
#        put_align_mtu: <= 4
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 4
#        get_align_mtu: <= 4
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: none
#
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#      Transport: cma
#         Device: memory
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 400 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
@dmcdougall dmcdougall added the Bug label Mar 18, 2022
@yosefe
Copy link
Contributor

yosefe commented Mar 19, 2022

I think we fixed it with 1.12.1. Can you pls try with https://github.com/openucx/ucx/releases/tag/v1.12.1-rc4?
cc @edgargabriel

@dmcdougall
Copy link
Author

Confirmed. UCX v1.11.2 works, UCX v1.12.0 segfaults, and UCX v1.12.1 works.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants