Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User's PMIx call fails after MPI_Init + UCX/HCOLL #6982

Open
angainor opened this issue Sep 17, 2019 · 30 comments
Open

User's PMIx call fails after MPI_Init + UCX/HCOLL #6982

angainor opened this issue Sep 17, 2019 · 30 comments
Assignees
Labels

Comments

@angainor
Copy link

angainor commented Sep 17, 2019

@artpol84 FYI As discussed with @rhc54 in an issue reported in the PMIx repo, it seems something in the OpenMPI runtime 'breaks' the PMIx infrastructure so that it is not possible to distribute user's keys if a call to PMIx_Set + PMIx_Commit is made after the MPI_Init call. That is, PMIx_Get fails on the clients with error -46. If the user's code sets the custom key before MPI_Init, then the code works as expected.

What's puzzling is that I only observe this problem when the UCX pml and HCOLL are enabled. I compile the code attached at the end of this post against OMPI master + it's internal PMIx, but I see the same behavior for OMPI 4.0.1 + PMIx 2.1.4:

$ mpirun -map-by node ./pmixtest
PMIx initialized
PMIx_Put on test-key
PMIx_Put on test-key
Tue Sep 17 14:47:15 2019 ERROR: pmixtest.c:55  Client ns 39583745 rank 1: PMIx_Get test-key: -46
Tue Sep 17 14:47:15 2019 ERROR: pmixtest.c:55  Client ns 39583745 rank 0: PMIx_Get test-key: -46

If I turn off UCX and HCOLL, things work as expected:

$ mpirun -mca pml ^ucx -mca coll_hcoll_enable 0 -map-by node ./pmixtest
PMIx initialized
PMIx_Put on test-key
PMIx_Put on test-key
PMIx_get test-key returned 256 bytes
0: obtained data "rank 1"
PMIx_get test-key returned 256 bytes
1: obtained data "rank 0"
PMIx finalized

Here is the reproducing code. To compile one needs to pass include and link path to the PMIx installation used by OpenMPI. I'd appreciate any insight.

#include <stdio.h>
#include <pmix.h>
#include <mpi.h>

static pmix_proc_t allproc = {};
static pmix_proc_t myproc = {};

#define ERR(msg, ...)							\
    do {								\
	time_t tm = time(NULL);						\
	char *stm = ctime(&tm);						\
	stm[strlen(stm)-1] = 0;						\
	fprintf(stderr, "%s ERROR: %s:%d  " msg "\n", stm, __FILE__, __LINE__, ## __VA_ARGS__); \
	exit(1);							\
    } while(0);


int pmi_set_string(const char *key, void *data, size_t size)
{
    int rc;
    pmix_value_t value;

    PMIX_VALUE_CONSTRUCT(&value);
    value.type = PMIX_BYTE_OBJECT;
    value.data.bo.bytes = data;
    value.data.bo.size  = size;
    if (PMIX_SUCCESS != (rc = PMIx_Put(PMIX_GLOBAL, key, &value))) {
        ERR("Client ns %s rank %d: PMIx_Put failed: %d\n", myproc.nspace, myproc.rank, rc);
    }

    if (PMIX_SUCCESS != (rc = PMIx_Commit())) {
        ERR("Client ns %s rank %d: PMIx_Commit failed: %d\n", myproc.nspace, myproc.rank, rc);
    }

    /* protect the data */
    value.data.bo.bytes = NULL;
    value.data.bo.size  = 0;
    PMIX_VALUE_DESTRUCT(&value);
    printf("PMIx_Put on %s\n", key);


    return 0;
}

int pmi_get_string(uint32_t peer_rank, const char *key, void **data_out, size_t *data_size_out)
{
    int rc;
    pmix_proc_t proc;
    pmix_value_t *pvalue;

    PMIX_PROC_CONSTRUCT(&proc);
    (void)strncpy(proc.nspace, myproc.nspace, PMIX_MAX_NSLEN);
    proc.rank = peer_rank;
    if (PMIX_SUCCESS != (rc = PMIx_Get(&proc, key, NULL, 0, &pvalue))) {
        ERR("Client ns %s rank %d: PMIx_Get %s: %d\n", myproc.nspace, myproc.rank, key, rc);
    }
    if(pvalue->type != PMIX_BYTE_OBJECT){
        ERR("Client ns %s rank %d: PMIx_Get %s: got wrong data type\n", myproc.nspace, myproc.rank, key);
    }
    *data_out = pvalue->data.bo.bytes;
    *data_size_out = pvalue->data.bo.size;

    /* protect the data */
    pvalue->data.bo.bytes = NULL;
    pvalue->data.bo.size = 0;
    PMIX_VALUE_RELEASE(pvalue);
    PMIX_PROC_DESTRUCT(&proc);

    printf("PMIx_get %s returned %zi bytes\n", key, data_size_out[0]);

    return 0;
}

int main(int argc, char *argv[])
{
    char data[256] = {};
    char *data_out;
    size_t size_out;
    int rc;
    pmix_value_t *pvalue;

    // if MPI_Init is executed before PMIx_set, PMIx_Get fails with -46
    MPI_Init(&argc, &argv);

    if (PMIX_SUCCESS != (rc = PMIx_Init(&myproc, NULL, 0))) {
	ERR("PMIx_Init failed");
        exit(1);
    }
    if(myproc.rank == 0) printf("PMIx initialized\n");

    sprintf(data, "rank %d", myproc.rank);
    pmi_set_string("test-key", data, 256);

    // if MPI_Init is executed after PMIx_set, PMIx_Get works fine
    // MPI_Init(&argc, &argv);
    
    pmi_get_string((myproc.rank+1)%2, "test-key", (void**)&data_out, &size_out);
    printf("%d: obtained data \"%s\"\n", myproc.rank, data_out);

    if (PMIX_SUCCESS != (rc = PMIx_Finalize(NULL, 0))) {
        ERR("Client ns %s rank %d:PMIx_Finalize failed: %d\n", myproc.nspace, myproc.rank, rc);
    }
    if(myproc.rank == 0) printf("PMIx finalized\n");

    MPI_Finalize();
}
@artpol84
Copy link
Contributor

@angainor thank you for the information.
@karasevb, please take a look.

@artpol84
Copy link
Contributor

@jsquyres, for some reasons I can't assign @karasevb. Can you help please?

@artpol84
Copy link
Contributor

Thanks @jsquyres !

@jsquyres
Copy link
Member

jsquyres commented Dec 3, 2019

@artpol84 @karasevb @jladd-mlnx Ping.

@bosilca
Copy link
Member

bosilca commented Dec 3, 2019

I have noticed some weird issues with what I suppose is the memory allocator by the UCX PML. In my case the issue arise if I dlopen cuda (via cublas) after an MPI_Init where the UCX PML is loaded but not enabled (so it is dlclosed). The issue manifest as a segfault deep inside dlopen while allocating some string. If I prevent the UCX PML from loading, then the code works just fine.

@artpol84
Copy link
Contributor

@jsquyres @angainor sorry for the delay.
We will try to take it ASAP.

@bosilca thank you for the note.

@karasevb please, remind me the UCX memory hooks setting that we can try.

@karasevb
Copy link
Member

@angainor could you try to run with added -mca patcher ^overwrite?

@angainor
Copy link
Author

@karasevb unfortunately, still the same:

$ mpirun -x LD_LIBRARY_PATH -np 2 -H c3-3,c3-4 -mca patcher ^overwrite ./out 
PMIx initialized
PMIx_Put on test-key
PMIx_Put on test-key
Fri Dec 13 08:05:37 2019 ERROR: pmitest.c:55  Client ns 2865364993 rank 1: PMIx_Get test-key: -46

Fri Dec 13 08:05:37 2019 ERROR: pmitest.c:55  Client ns 2865364993 rank 0: PMIx_Get test-key: -46

@artpol84
Copy link
Contributor

@bosilca disabling patcher may help in your case.

@artpol84
Copy link
Contributor

@angainor I'll try to repro next week.

@artpol84
Copy link
Contributor

@angainor, I tried to reproduce with the following results:

First, I had to fix the return value, without the fix I was getting an error messages about procs exiting with non-zero status:

--- repro.c	2019-12-17 11:23:36.216701105 -0800
+++ repro_new.c	2019-12-17 11:34:33.000156999 -0800
@@ -96,5 +96,6 @@
     }
     if(myproc.rank == 0) printf("PMIx finalized\n");
 
+    return 0;
     // MPI_Finalize();
 }

I was using HPCX that is based on ompi-v4.0.x (af04a9d) and running on 2 nodes:

$ module load hpcx-gcc && mpirun --map-by ppr:1:node hostname                                                                                                                                                                                 
cn1
cn2

For me the patched reproducer is passing successfully (note that I'm explicitly requesting UCX PML):

$ module load hpcx-gcc && mpirun --map-by ppr:1:node --mca pml ucx ./repro                                                                                                                                                                    
PMIx_Put on ghex-rank-address
PMIx_Commit on ghex-rank-address
PMIx initialized
PMIx_Put on ghex-rank-address
PMIx_Commit on ghex-rank-address
PMIx_get ghex-rank-address returned 256 bytes
PMIx_get ghex-rank-address returned 256 bytes
PMIx finalized

Is there anything I am missing?

P.S. in order to build the repro I had to explicitly link it with OMPI's PMIx:

$ module load hpcx-gcc && mpicc -o repro repro.c  -L <HPCX_PATH>/ompi/lib/ -lpmix

otherwise, I was getting an error:

$ module load hpcx-gcc && mpicc -o repro repro.c                                                                                                                                                                                              
/tmp/cchpYK8Q.o: In function `pmi_set_string':
repro.c:(.text+0xfbe): undefined reference to `OPAL_MCA_PMIX3X_PMIx_Put'
repro.c:(.text+0x1082): undefined reference to `OPAL_MCA_PMIX3X_PMIx_Commit'
/tmp/cchpYK8Q.o: In function `pmi_get_string':
repro.c:(.text+0x11b7): undefined reference to `OPAL_MCA_PMIX3X_PMIx_Get'
/tmp/cchpYK8Q.o: In function `main':
repro.c:(.text+0x13bc): undefined reference to `OPAL_MCA_PMIX3X_PMIx_Init'
repro.c:(.text+0x14bc): undefined reference to `OPAL_MCA_PMIX3X_PMIx_Finalize'
collect2: error: ld returned 1 exit status

How are you building your repro? Can it be that a different PMIx version is used? It doesn't look like that, otherwise, the presence or absence of UCX wouldn't matter.

I noticed that you provide LD_LIBRARY_PATH in your command invocation, why do you need it?

$ mpirun -x LD_LIBRARY_PATH -np 2 -H c3-3,c3-4 -mca patcher ^overwrite ./out 

@angainor
Copy link
Author

angainor commented Dec 17, 2019

@artpol84 I'll be able to look at it in more detail tomorrow, but some thoughts: You have to run on different compute nodes. If you start the ranks on the same node, it works. I was building OpenMPI against an external PMIx version, which was also used to link the reproducing program. I did not use the internal PMIx, also for the reasons you've described.

LD_LIBRARY_PATH was needed on my system when I was running outside of slurm, simply specifying two compute nodes as a commandline arg to mpirun. If I do not do that, the environment is not preserved and the PMIx libs (and OpenMPI libs) are not available.

@artpol84
Copy link
Contributor

artpol84 commented Dec 18, 2019

Thanks, @angainor

If you start the ranks on the same node, it works.

Yeah, I noticed that and that's why above I explicitly said I was using 2 nodes:

I was using HPCX that is based on ompi-v4.0.x (af04a9d) and running on 2 nodes:

$ module load hpcx-gcc && mpirun --map-by ppr:1:node hostname                                                                                                                                                                                 
cn1
cn2

I was building OpenMPI against an external PMIx version, which was also used to link the reproducing program.

Ok, this could be a key difference. I'll try to build my own OMPI and link it against PMIx.

LD_LIBRARY_PATH was needed on my system when I was running outside of slurm, simply specifying two compute nodes as a commandline arg to mpirun. If I do not do that, the environment is not preserved and the PMIx libs (and OpenMPI libs) are not available.

I see, I probably should try to run without Slurm as well.

@angainor
Copy link
Author

@artpol84 The same happens when I run from within slurm, running as mpirun -np 2 -map-by node ./out

@angainor
Copy link
Author

@artpol84 I checked again running on the same node with slurm, and also in this case things work.

FYR, this is the list of modules I load:

Currently Loaded Modules:
  1) StdEnv                      (S)   6) hpcx/2.4
  2) GCCcore/8.3.0                     7) GCC/8.3.0
  3) zlib/1.2.11-GCCcore-8.3.0   (H)   8) ucx/1.6.1
  4) binutils/2.32-GCCcore-8.3.0 (H)   9) openmpi/4.0.2
  5) pmix/3.1.4

I have a custom version of UCX and PMIx, but I also compile openmpi against HPCx and hcoll. Maybe it matters.

@angainor
Copy link
Author

angainor commented Jan 8, 2020

@rhc54 @artpol84 I believe I've found the simplest use-case that breaks things, and it seems it is not related to UCX/OpenMPI. I can reproduce the behavior using PMIx code only. The problem is triggered by the following sequence of PMIx pseudo-calls:

PMIx_Put('user-variable-1', value);
PMIx_Fence();

PMIx_Put('user-variable-2', value);
PMIx_Get(peer, 'user-variable-2);

If I do a Put on a new user variable after a Fence has been called (which happens inside MPI_Init, I guess), then the no-modex version of PMIx_Get (without an explicit data collect) doesn't work. I would expect that since user-variable-2 is not locally available, the PMIx server goes to peer directly and asks for it. But it seems that no such attempt is made and instead an error is returned. I guess the assumption is that since a fence was called then all data is up to date and we need not check at the source.

That is of course a problem :) To make my code work I have to add an explicit Fence after I Put the user-variable-2. The problem with this is that I'd rather use a no-modex version without an AllGather of my custom keys due to memory overhead. I have a near-neighbor application, and I don't want to locally store all keys.

@rhc54
Copy link
Contributor

rhc54 commented Jan 8, 2020

The code sequence you show cannot work as it is missing a call to PMIx_Commit after the call to PMIx_Put - as a result, the data never flows from the client to the local server, and so a no-modex PMIx_Get cannot find it. The PMIx server never directly asks a peer for data - the peer has to make it available to the server via PMIx_Commit. Perhaps that is the problem in your code?

@angainor
Copy link
Author

angainor commented Jan 8, 2020

@rhc54 No, the PMIx_Commit is there - that was only pseudocode to demonstrate the call sequence. That part works, as I do get the data correctly when I comment out the first two lines.

@rhc54
Copy link
Contributor

rhc54 commented Jan 8, 2020

Maybe you could provide us with a complete test code - I confess I'm getting confused.

@angainor
Copy link
Author

angainor commented Jan 8, 2020

Of course! Lines 132-133 break things. If the second pmix_exchange below is not called, then the get for test-key-2 fails:

    /* the below two lines break the subsequent PMIx_Get query on a key set later */
    pmi_set_string("test-key-1", data, 256);
    pmix_exchange();
 
    sprintf(data, "rank %d", myproc.rank);
    pmi_set_string("test-key-2", data, 256);
    
    /* An explicit Fence has to be called again to read the `test-key-2` */
    pmix_exchange();

    pmi_get_string((myproc.rank+1)%2, "test-key-2", (void**)&data_out, &size_out);
    printf("%d: obtained data \"%s\"\n", myproc.rank, data_out);

The complete code without any MPI calls:

#include <stdio.h>
#include <pmix.h>
#include <sched.h>

static pmix_proc_t allproc = {};
static pmix_proc_t myproc = {};

#define ERR(msg, ...)							\
    do {								\
	time_t tm = time(NULL);						\
	char *stm = ctime(&tm);						\
	stm[strlen(stm)-1] = 0;						\
	fprintf(stderr, "%s ERROR: %s:%d  " msg "\n", stm, __FILE__, __LINE__, ## __VA_ARGS__); \
	exit(1);							\
    } while(0);


int pmi_set_string(const char *key, void *data, size_t size)
{
    int rc;
    pmix_value_t value;

    PMIX_VALUE_CONSTRUCT(&value);
    value.type = PMIX_BYTE_OBJECT;
    value.data.bo.bytes = data;
    value.data.bo.size  = size;
    if (PMIX_SUCCESS != (rc = PMIx_Put(PMIX_GLOBAL, key, &value))) {
        ERR("Client ns %s rank %d: PMIx_Put failed: %d\n", myproc.nspace, myproc.rank, rc);
    }

    if (PMIX_SUCCESS != (rc = PMIx_Commit())) {
        ERR("Client ns %s rank %d: PMIx_Commit failed: %d\n", myproc.nspace, myproc.rank, rc);
    }

    /* protect the data */
    value.data.bo.bytes = NULL;
    value.data.bo.size  = 0;
    PMIX_VALUE_DESTRUCT(&value);
    printf("PMIx_Put on %s\n", key);


    return 0;
}

int pmi_get_string(uint32_t peer_rank, const char *key, void **data_out, size_t *data_size_out)
{
    int rc;
    pmix_proc_t proc;
    pmix_value_t *pvalue;

    PMIX_PROC_CONSTRUCT(&proc);
    (void)strncpy(proc.nspace, myproc.nspace, PMIX_MAX_NSLEN);
    proc.rank = peer_rank;
    if (PMIX_SUCCESS != (rc = PMIx_Get(&proc, key, NULL, 0, &pvalue))) {
        ERR("Client ns %s rank %d: PMIx_Get %s: %d\n", myproc.nspace, myproc.rank, key, rc);
    }
    if(pvalue->type != PMIX_BYTE_OBJECT){
        ERR("Client ns %s rank %d: PMIx_Get %s: got wrong data type\n", myproc.nspace, myproc.rank, key);
    }
    *data_out = pvalue->data.bo.bytes;
    *data_size_out = pvalue->data.bo.size;

    /* protect the data */
    pvalue->data.bo.bytes = NULL;
    pvalue->data.bo.size = 0;
    PMIX_VALUE_RELEASE(pvalue);
    PMIX_PROC_DESTRUCT(&proc);

    printf("PMIx_get %s returned %zi bytes\n", key, data_size_out[0]);

    return 0;
}

static volatile int fence_active = 0;
static void fence_status_cb(pmix_status_t status, void *cbdata)
{
    fence_active = 0;
}
    
int pmix_exchange()
{
    int rc;
    pmix_info_t info;
    bool flag;

    fence_active = 1;
    PMIX_INFO_CONSTRUCT(&info);
    flag = true;
    PMIX_INFO_LOAD(&info, PMIX_COLLECT_DATA, &flag, PMIX_BOOL);
    if (PMIX_SUCCESS != (rc = PMIx_Fence_nb(&allproc, 1, &info, 1, fence_status_cb, NULL))){
        fprintf(stderr, "Client ns %s rank %d: PMIx_Fence_nb failed: %d\n", myproc.nspace, myproc.rank, rc);
        exit(1);
    }
    PMIX_INFO_DESTRUCT(&info);
        
    /* wait for completion */
    while(fence_active);
    
    return 0;
}

int main(int argc, char *argv[])
{
    char data[256] = {};
    char *data_out;
    size_t size_out;
    int rc;
    pmix_value_t *pvalue;

    if (PMIX_SUCCESS != (rc = PMIx_Init(&myproc, NULL, 0))) {
	ERR("PMIx_Init failed");
        exit(1);
    }
    if(myproc.rank == 0) printf("PMIx initialized\n");

    /* job-related info is found in our nspace, assigned to the
     * wildcard rank as it doesn't relate to a specific rank. Setup
     * a name to retrieve such values */
    PMIX_PROC_CONSTRUCT(&allproc);
    (void)strncpy(allproc.nspace, myproc.nspace, PMIX_MAX_NSLEN);
    allproc.rank = PMIX_RANK_WILDCARD;

    /* get the number of procs in our job */
    if (PMIX_SUCCESS != (rc = PMIx_Get(&allproc, PMIX_JOB_SIZE, NULL, 0, &pvalue))) {
        fprintf(stderr, "Client ns %s rank %d: PMIx_Get job size failed: %d\n", myproc.nspace, myproc.rank, rc);
        exit(1);
    }
    uint32_t nprocs = pvalue->data.uint32;
    PMIX_VALUE_RELEASE(pvalue);

    /* the below two lines break the subsequent PMIx_Get query on a key set later */
    pmi_set_string("test-key-1", data, 256);
    pmix_exchange();
 
    sprintf(data, "rank %d", myproc.rank);
    pmi_set_string("test-key-2", data, 256);
    
    /* An explicit Fence has to be called again to read the `test-key-2` */
    pmix_exchange();

    pmi_get_string((myproc.rank+1)%2, "test-key-2", (void**)&data_out, &size_out);
    printf("%d: obtained data \"%s\"\n", myproc.rank, data_out);

    if (PMIX_SUCCESS != (rc = PMIx_Finalize(NULL, 0))) {
        ERR("Client ns %s rank %d:PMIx_Finalize failed: %d\n", myproc.nspace, myproc.rank, rc);
    }
    if(myproc.rank == 0) printf("PMIx finalized\n");
}

@rhc54
Copy link
Contributor

rhc54 commented Jan 8, 2020

And you are running this using mpirun within OMPI (I see master and v4.0.x cited above), yes? I assume you built OMPI against the same external version of PMIx you used to compile your program - what version was that? I see a v2.1.4 cited above, but also something about using the OMPI internal PMIx, which would not be supported/possible.

@angainor
Copy link
Author

angainor commented Jan 8, 2020

Right now I used OpenMPI 4.0.2 to run the test. It is compiled against an external pmix/3.1.4. My test code is compiled using the same external PMIx installation as OpenMPI is using.

@rhc54
Copy link
Contributor

rhc54 commented Jan 8, 2020

Kewl - thx! Let me poke into it a bit.

@rhc54
Copy link
Contributor

rhc54 commented Jan 8, 2020

Hmmm...well, I tested it against PMIx master with PRRTE master, and it works fine. I added your code to the "test" area, but I can remove it if you like - up to you. I wanted to keep it for further tests to ensure this continued to work. You'll find it here: openpmix/prrte#294

Please take a look and see if I am doing something wrong/different. From what I saw, it appears the issue is in the ORTE code, not PMIx - and that PRRTE is doing it correctly. We are getting ready to replace ORTE with PRRTE, but that will not be released until OMPI v5 this summer.

@angainor
Copy link
Author

angainor commented Jan 8, 2020

@rhc54 Sure, please use the code as you see fit. Just to make sure: did you comment out the second pmix_exchange() call? The code I pasted above works, but it fails when you comment it out.

I can look at PRRTE tomorrow. I remember it did not work with the original code reported in this issue. But then I called MPI_Init. I did not test the PMIx-only version.

@rhc54
Copy link
Contributor

rhc54 commented Jan 8, 2020

Yes - in fact, I removed all fences from the code path and it still worked.

@angainor
Copy link
Author

angainor commented Jan 9, 2020

@rhc54 I tested with clean clones of prrte and pmix master. To reproduce the error you have to run the pmix_exchange() with data collect after call to pmi_set_string("test-key-1", ...); and before the call to pmi_set_string("test-key-2", ...);. Then, pmi_get_string(..., "test-key-2", ...) fails with -46. I have commented in the PR with line numbers.

@rhc54
Copy link
Contributor

rhc54 commented Jan 9, 2020

Okay, I might fiddle a bit with the reproducer as it only fails if I have two nodes and run one process per node. I suspect it is finding a problem in the PMIx server's code path when a host asks for modex info - just the code itself only involves rank 0 and 1, and so if those two ranks are on the same node it works just fine. I'll take a look at it today.

@angainor
Copy link
Author

angainor commented Jan 9, 2020

@rhc54 Thanks! Yes, the ranks need to run on different compute nodes. So it seems that it is a problem with how the servers exchange the data between them.

@rhc54
Copy link
Contributor

rhc54 commented Jan 9, 2020

Not precisely. I think the problem is that the two host daemons are sending requests to the other side, and those requests are being passed down into the respective PMIx server libraries. The problem is that those libraries are (a) looking for the requested key and not finding it, and then (b) checking to see if we have already received data from the target process. If (b) is true, then they immediately respond with "not found" instead of caching the request until the data appears.

Problem is that the remote client will "hang" forever in PMIx_Get if the target proc never posts the data. This makes use of the PMIX_TIMEOUT attribute rather critical. The standard does point that out, so while I expect there will be some "growing pains" out there, it technically is the proper behavior.

I'll try to have something for you to play with later in the day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants