Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is having more than one GPU initialized UB in standard cuda-aware MPI? #12848

Open
JackAKirk opened this issue Oct 8, 2024 · 0 comments
Open

Comments

@JackAKirk
Copy link

JackAKirk commented Oct 8, 2024

To be precise is the following code example that maps two gpus to a single MPI rank valid, or is it Undefined Behavior in the standard cuda-aware MPI?

In particular note the part

  int otherRank = myrank == 0 ? 1 : 0;
  cudaSetDevice(otherRank);
  cudaSetDevice(myrank);

Where I set a cuda device that I don't use in that rank before correctly setting the device that I want to map the the MPI rank for MPI invocations. According to e.g. these docs https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#when-do-i-need-to-select-a-cuda-device
nothing is specified that tells me this is invalid.

Here is the full example:

#include <cuda_runtime.h>
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
  int myrank;
  float *val_device, *val_host;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
  int otherRank = myrank == 0 ? 1 : 0;
  cudaSetDevice(otherRank);
  cudaSetDevice(myrank);
  int num = 1000000;
  val_host = (float *)malloc(sizeof(float) * num);
  cudaMalloc((void **)&val_device, sizeof(float) * num);
  for (int i = 0; i < 1; i++) {
    *val_host = -1.0;
    if (myrank != 0) {
      if (i == 0)
        printf("%s %d %s %f\n", "I am rank", myrank,
               "and my initial value is:", *val_host);
    }

    if (myrank == 0) {
      *val_host = 42.0;
      cudaMemcpy(val_device, val_host, sizeof(float), cudaMemcpyHostToDevice);
      if (i == 0)
        printf("%s %d %s %f\n", "I am rank", myrank,
               "and will broadcast value:", *val_host);
    }

    MPI_Bcast(val_device, 1, MPI_FLOAT, 0, MPI_COMM_WORLD);

    if (myrank != 0) {
      cudaMemcpy(val_host, val_device, sizeof(float), cudaMemcpyDeviceToHost);
      if (i == 0)
        printf("%s %d %s %f\n", "I am rank", myrank,
               "and received broadcasted value:", *val_host);
    }
  }
  cudaFree(val_device);
  free(val_host);

  MPI_Finalize();
  return 0;
}

Then execute the above using two MPI ranks on a node with at least two gpus available.

It seems that the above example leads to data leaks for all cuda-aware MPI implementations (cray-mpi, OPENMPI/ MPICH, built with UCX or openfabric (I think, but not certain)), and this isn't specific to e.g. MPI_Bcast, using MPI_Send etc also has the same effect.
I think it might be implied by a few things that I've read that standard MPI does not support multiple gpus to a single rank, and that things like MPI endpoints or nccl (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-3-multiple-devices-per-thread) might be used in such a case.
However I cannot find anything explicitly stating this, so it would be good if someone could confirm this?
If not then the above reproducer is a bug report for this case.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant