Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTMG multi node #3932

Merged
merged 19 commits into from
Nov 6, 2023
Merged

Conversation

ChuckHastings
Copy link
Collaborator

@ChuckHastings ChuckHastings commented Oct 12, 2023

This PR extends the MTMG framework to support a multi-node configuration.

Multi-node configuration of MTMG assumes some externally available mechanism for communicating configuration parameters and the NCCL unique id between the processes that will participate. For testing purposes we assume a shared file system where we can create a directory and read/write configuration information rather than constructing a more complex communication mechanism (NFS is sufficient for testing).

Closes https://github.com/rapidsai/graph_dl/issues/329

@ChuckHastings ChuckHastings self-assigned this Oct 12, 2023
@ChuckHastings ChuckHastings added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2023
@ChuckHastings ChuckHastings marked this pull request as ready for review October 24, 2023 18:07
@ChuckHastings ChuckHastings requested review from a team as code owners October 24, 2023 18:07
src/detail/shuffle_vertex_pairs.cu
src/detail/shuffle_vertex_pairs_int32_int32.cu
src/detail/shuffle_vertex_pairs_int32_int64.cu
src/detail/shuffle_vertex_pairs_int64_int64.cu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind this? Is this to cut compile time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be more consistent on this to keep our codebase maintainable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was to cut compile time. It was the one file that was being constantly recompiled while I was testing this that took a long time.

I have been picking these up when I see them to split them. I think ultimately we need to re-evaluate how we are doing type dispatching so that we can generally improve compile time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... we should re-evaluate this, but AFAIK, this isn't the file with the longest compile time, and just splitting this file looks a bit odd. No problem in doing this for local development, but I guess pushing this change to the main repo (without really agreeing on how to tackle compile time issue) sounds a bit premature.

@@ -70,6 +70,8 @@ class resource_manager_t {
{
std::lock_guard<std::mutex> lock(lock_);

CUGRAPH_EXPECTS(remote_rank_set_.find(rank) == remote_rank_set_.end(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check, is rank still local? I assume remote_rank_set_ stores global ranks, and can we query remote_rank_set_ with a local rank?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or from now on, rank is global and device_id might serve as local GPU rank?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my implementation here (trying to somewhat mimic MPI) rank is global. device_id is local.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.. in that case, we need to review the documentation.

  /**
   * @brief add a local GPU to the resource manager.
   *
   * @param rank       The rank to assign to the local GPU
   * @param device_id  The device_id corresponding to this rank
   */

It is not clear to figure out that rank is global and device_id is local just seeing the documentation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the parameter names and documentation to be global_rank and local_device_id. Added a small improvement in the class documentation to identify the global rank concept.

} else {
local_ranks_to_include.push_back(rank);
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using std::copy_if with std::back_inserter better documents the intention.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look at this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in next push.

Comment on lines 133 to 138
//
// This is intended to mimic a multi-node host application (non-MPI) integrating
// with MTMG library. This is a simple implementation using a shared file system
// to pass configuration messages. Terribly inefficient, but should mimic
// expected behavior.
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this vs using mpi but creating just one process per node? We are already using MPI for testing, so the latter won't require a new execution environment. The former assumes availability of a sort of NFS and users need to setup the directory path correctly. This is not necessary if we just re-use the existing MPI test environment but just create one process per node.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid depending on MPI since we want the interface to work without MPI.

We can certainly create an MPI-based test that doesn't use MPI to initialize the comms. That would be a bit easier to use. I'd be concerned that we might subtly rely on an MPI in some subtle way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was think more like updating

#define CUGRAPH_MG_TEST_PROGRAM_MAIN()                                                  \
  int main(int argc, char** argv)                                                       \
  {                                                                                     \
    cugraph::test::initialize_mpi(argc, argv);                                          \
    auto comm_rank = cugraph::test::query_mpi_comm_world_rank();                        \
    auto comm_size = cugraph::test::query_mpi_comm_world_size();                        \
    int num_gpus_per_node{};                                                            \
    RAFT_CUDA_TRY(cudaGetDeviceCount(&num_gpus_per_node));                              \
    RAFT_CUDA_TRY(cudaSetDevice(comm_rank % num_gpus_per_node));                        \
    ::testing::InitGoogleTest(&argc, argv);                                             \
    auto const cmd_opts = parse_test_options(argc, argv);                               \
    auto const rmm_mode = cmd_opts["rmm_mode"].as<std::string>();                       \
    auto resource       = cugraph::test::create_memory_resource(rmm_mode);              \
    rmm::mr::set_current_device_resource(resource.get());                               \
    cugraph::test::g_perf = cmd_opts["perf"].as<bool>();                                \
    cugraph::test::g_rmat_scale =                                                       \
      (cmd_opts.count("rmat_scale") > 0)                                                \
        ? std::make_optional<size_t>(cmd_opts["rmat_scale"].as<size_t>())               \
        : std::nullopt;                                                                 \
    cugraph::test::g_rmat_edge_factor =                                                 \
      (cmd_opts.count("rmat_edge_factor") > 0)                                          \
        ? std::make_optional<size_t>(cmd_opts["rmat_edge_factor"].as<size_t>())         \
        : std::nullopt;                                                                 \
    cugraph::test::g_test_file_name =                                                   \
      (cmd_opts.count("test_file_name") > 0)                                            \
        ? std::make_optional<std::string>(cmd_opts["test_file_name"].as<std::string>()) \
        : std::nullopt;                                                                 \
                                                                                        \
    auto ret = RUN_ALL_TESTS();                                                         \
    cugraph::test::finalize_mpi();                                                      \
    return ret;                                                                         \
  }

and create CUGRAPH_MTMG_TEST_PROGRAM_MAIN() We can just broadcast NCCL unique ID here and we won't touch anything related to MPI after this (and we can even call cugraph::test::finalize_mpi() right after broadcasting NCCL unique ID).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got the MPI configuration working in the next push. I did not abstract it into a reusable function/MACRO, I can look at that when we added a second test (perhaps in a PR for adding Louvain or Jaccard).

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general.

I added few comments for potential code improvements (better hiding NCCL & MPI dependencies).

@@ -89,7 +100,6 @@ class instance_manager_t {
std::vector<std::unique_ptr<raft::handle_t>> raft_handle_{};
std::vector<std::unique_ptr<ncclComm_t>> nccl_comms_{};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we directly managing NCCL? Shouldn't this be implementation detail of raft?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can look at updating raft for this. I'll talk with Corey.

if (g_node_rank == 0) RAFT_NCCL_TRY(ncclGetUniqueId(&instance_manager_id));

RAFT_MPI_TRY(
MPI_Bcast(&instance_manager_id, sizeof(instance_manager_id), MPI_CHAR, 0, MPI_COMM_WORLD));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to hide the existence of MPI in individual tests, we may add

// valid only for MTMG tests, updated by CUGRAPH_MTMG_TEST_PROGRAM_MAIN()
std::optional g_nccl_unique_id{std::nullopt};

after

// these variables are updated by command line arguments
static bool g_perf{false};
static std::optional<size_t> g_rmat_scale{std::nullopt};
static std::optional<size_t> g_rmat_edge_factor{std::nullopt};
static std::optional<std::string> g_test_file_name{std::nullopt};

https://github.com/rapidsai/cugraph/blob/branch-23.12/cpp/tests/utilities/base_fixture.hpp#L119

A bit annoying to add a global variable (only valid for MTMG tests) to all C++ tests, but I think this is OK as this is only for testing and the variable is std::optional (so having invalid values for non-MTMG tests is somewhat acceptable).

Then, we can expose our dependency on MPI only within CUGRAPH_MTMG_TEST_PROGRAM_MAIN() and we can finalize MPI before invoking individual tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning on visiting this issue in depth when I add Louvain. I was also thinking of adding some test methods to hide some of these things.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried creating g_nccl_unique_id as a global and initializing it in the main program. As currently constructed, each call to create_instance_manager needs a unique g_nccl_unique_id, I couldn't reuse them. Each test could potentially use a different GPU configuration (something I eventually want to add). So I needed to move the creation of the nccl_unique_id back into the test.

What I'm imagining is creating an MTMG test function that will create and broadcast the nccl unique id so that the use of MPI is restricted there. It does require that we not finalize MPI before we run the tests. But if we capture this in helper methods then we should be able to easily identify if we are using MPI calls that could potentially be a problem within one of the MTMG tests.


#include <gtest/gtest.h>

#include <nccl.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I am not sure exposing NCCL in cuGraph is a good thing. Up to this point, we have been largely hiding the existence of NCCL under raft. Now we are explicitly dependent on NCCL. Hiding NCCL for MTMG may require an additional raft function and this can be a FIXME thing, but I feel like hiding explicit NCCL dependency is a good design choice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address this also in the Louvain PR. I can actually pick this dependency up through a raft header. So if we address your other comment (about ncclComm_t and migrating some functionality in raft) that should clean this up as well.

@ChuckHastings
Copy link
Collaborator Author

/merge

@rapids-bot rapids-bot bot merged commit 586451d into rapidsai:branch-23.12 Nov 6, 2023
83 checks passed
@ChuckHastings ChuckHastings deleted the mtmg_multi_node branch December 1, 2023 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants