Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX SEGV in osc_ucx_component.c #5083

Closed
gpaulsen opened this issue Apr 19, 2018 · 15 comments
Closed

UCX SEGV in osc_ucx_component.c #5083

gpaulsen opened this issue Apr 19, 2018 · 15 comments

Comments

@gpaulsen
Copy link
Member

gpaulsen commented Apr 19, 2018

@xinzhao3 @jladd-mlnx
many IBM tests on v3.1.x and on master have been failing for a number of weeks with a runtime segv due to the OSC UCX component.

I believe this should be easy to reproduce, though I'm not sure where the argument to the 'flavor' is coming from.

I think we should either block v3.1.x or disable the ucx osc component for the v3.1.x until we figure this out, due to how easy it is to his this issue.

aint: osc_ucx_component.c:246: int mem_map(void **, size_t, ucp_mem_h *, ompi_osc_ucx_module_t *,
int): Assertion `flavor == 2 || flavor == 1' failed.
[c656f6n05:122836] *** Process received signal ***
[c656f6n05:122836] Signal: Aborted (6)
[c656f6n05:122836] Signal code:  (-6)
[c656f6n05:122836] [ 0] [0x3fff9fcd0478]
[c656f6n05:122836] [ 1] aint: osc_ucx_component.c:246: int mem_map(void **, size_t, ucp_mem_h *,
ompi_osc_ucx_module_t *, int): Assertion `flavor == 2 || flavor == 1' failed.
[c656f6n05:122835] *** Process received signal ***
[c656f6n05:122835] Signal: Aborted (6)
[c656f6n05:122835] Signal code:  (-6)
[c656f6n05:122835] [ 0] [0x3fffa46f0478]
[c656f6n05:122835] [ 1] /lib64/libc.so.6(abort+0x280)[0x3fff9f530d70]
[c656f6n05:122836] [ 2] /lib64/libc.so.6(abort+0x280)[0x3fffa3f50d70]
[c656f6n05:122835] [ 2] /lib64/libc.so.6(+0x348a4)[0x3fff9f5248a4]
[c656f6n05:122836] [ 3] /lib64/libc.so.6(+0x348a4)[0x3fffa3f448a4]
[c656f6n05:122835] [ 3] /lib64/libc.so.6(__assert_fail+0x64)[0x3fff9f524994]
[c656f6n05:122836] [ 4] /lib64/libc.so.6(__assert_fail+0x64)[0x3fffa3f44994
@xinzhao3
Copy link
Contributor

thanks @gpaulsen looking into this now

@xinzhao3
Copy link
Contributor

@gpaulsen could you paste the command line and test code path to reproduce this error?

@xinzhao3
Copy link
Contributor

@gpaulsen I found the test to reproduce this. I think MPI_Win_dynamic is wrong, modifying it now.

@angainor
Copy link

angainor commented Apr 20, 2018

@xinzhao3 Probably related, there is a problem when creating windows with empty (0-length) buffers using MPI_Win_create. This code is from MPICH test suite, running with mpirun -np 2 -mca osc ucx

/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil ; -*- */
/*
 *
 *  (C) 2003 by Argonne National Laboratory.
 *      See COPYRIGHT in top-level directory.
 */
#include <mpi.h>
#include <stdio.h>
// #include "mpitest.h"

#define ELEM_SIZE 8

int main( int argc, char *argv[] )
{
    int     rank;
    int     errors = 0, all_errors = 0;
    int    *flavor, *model, flag;
    void   *buf;
    MPI_Win window;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    /** Create using MPI_Win_create() **/

    if (rank > 0)
      MPI_Alloc_mem(rank*ELEM_SIZE, MPI_INFO_NULL, &buf);
    else
      buf = NULL;

    MPI_Win_create(buf, rank*ELEM_SIZE, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &window);
    MPI_Win_get_attr(window, MPI_WIN_CREATE_FLAVOR, &flavor, &flag);

    if (!flag) {
      printf("%d: MPI_Win_create - Error, no flavor\n", rank);
      errors++;
    } else if (*flavor != MPI_WIN_FLAVOR_CREATE) {
      printf("%d: MPI_Win_create - Error, bad flavor (%d)\n", rank, *flavor);
      errors++;
    }

    MPI_Win_get_attr(window, MPI_WIN_MODEL, &model, &flag);

    if (!flag) {
      printf("%d: MPI_Win_create - Error, no model\n", rank);
      errors++;
    } else if ( ! (*model == MPI_WIN_SEPARATE || *model == MPI_WIN_UNIFIED) ) {
      printf("%d: MPI_Win_create - Error, bad model (%d)\n", rank, *model);
      errors++;
    }

    MPI_Win_free(&window);

    if (buf)
      MPI_Free_mem(buf);
}

@angainor
Copy link

@xinzhao3 And the stack trace

[1524230578.179553] [c11-1:154185:0]         ucp_mm.c:264  UCX  ERROR Undefined address requires UCP_MEM_MAP_ALLOCATE flag
[c11-1:154185] osc_ucx_component.c:266: ucp_mem_map failed: -5
[c11-1:154185:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
 2 0x000000000006858c mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel7-u4-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat7.4-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:641
 3 0x0000000000068adc mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel7-u4-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat7.4-x86_64/mxm-v3.7/src/mxm/util/debug/debug.c:616
 4 0x0000000000035270 killpg()  ??:0
 5 0x0000000000012f4f ucp_mem_unmap_common()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel7-u4-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v2.1.0-gcc-MLNX_OFED_LINUX-4.2-1.2.0.0-redhat7.4-x86_64/ucx-v1.3.x/src/ucp/core/ucp_mm.c:345
 6 0x0000000000009389 mem_map()  /tmp/marcink/openmpi-3.1.0rc4/ompi/mca/osc/ucx/osc_ucx_component.c:290
 7 0x0000000000056ae3 ompi_win_create()  /tmp/marcink/openmpi-3.1.0rc4/ompi/win/win.c:245
 8 0x000000000007d021 PMPI_Win_create()  /tmp/marcink/openmpi-3.1.0rc4/ompi/mpi/c/profile/pwin_create.c:81
 9 0x0000000000400a41 main()  /nird/home/marcink/mpitest/win_flavors.c:31
10 0x0000000000021c05 __libc_start_main()  ??:0
11 0x0000000000400929 _start()  ??:0
===================

@gpaulsen
Copy link
Member Author

@xinzhao3 do you need anything else from me?

@xinzhao3
Copy link
Contributor

@gpaulsen I am working on this issue and close to finish. I will give an update on tomorrow ompi meeting.

@xinzhao3
Copy link
Contributor

@angainor could you try #5094 to see if it works now?

@angainor
Copy link

@xinzhao3 I can try it tomorrow, our system is in maintenance today.

@bwbarrett
Copy link
Member

removed blocker label; it appears we've minimized the damage on this one with #5135 and #5139, but not completely eliminated it. So leaving the ticket open until we run the problem to ground.

@jsquyres
Copy link
Member

jsquyres commented Aug 7, 2018

Per 2018-08-07 webex, @xinzhao3 is going to check to see if this was a UCX error / has already been resolved. If we end up release noting this saying that there's a UCX issue in version X.Y.Z yadda yadda yadda, that would be fine.

@jsquyres
Copy link
Member

@gpaulsen Per 2018-08-21 webex, @xinzhao3 confirmed that this was fixed on the UCX side (i.e., not in Open MPI). @gpaulsen will verify that this is no longer an issue with the latest released UCX.

@jsquyres jsquyres assigned gpaulsen and unassigned xinzhao3 Aug 21, 2018
@bwbarrett bwbarrett modified the milestones: v3.1.2, v3.1.3 Aug 22, 2018
@jsquyres
Copy link
Member

@gpaulsen Ping.

@gpaulsen
Copy link
Member Author

waiting on update from me. No update yet.

@bwbarrett bwbarrett modified the milestones: v3.1.3, v3.1.4 Oct 29, 2018
@bwbarrett bwbarrett modified the milestones: v3.1.4, v3.1.5 Apr 16, 2019
@gpaulsen
Copy link
Member Author

No longer failing in v3.1.x with latest UCX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants