MTMG multi node #3932

ChuckHastings · 2023-10-12T20:25:11Z

This PR extends the MTMG framework to support a multi-node configuration.

Multi-node configuration of MTMG assumes some externally available mechanism for communicating configuration parameters and the NCCL unique id between the processes that will participate. For testing purposes we assume a shared file system where we can create a directory and read/write configuration information rather than constructing a more complex communication mechanism (NFS is sufficient for testing).

Closes https://github.com/rapidsai/graph_dl/issues/329

seunghwak · 2023-10-24T20:17:50Z

cpp/CMakeLists.txt

-    src/detail/shuffle_vertex_pairs.cu
+    src/detail/shuffle_vertex_pairs_int32_int32.cu
+    src/detail/shuffle_vertex_pairs_int32_int64.cu
+    src/detail/shuffle_vertex_pairs_int64_int64.cu


What's the rationale behind this? Is this to cut compile time?

We should be more consistent on this to keep our codebase maintainable.

Yes, this was to cut compile time. It was the one file that was being constantly recompiled while I was testing this that took a long time.

I have been picking these up when I see them to split them. I think ultimately we need to re-evaluate how we are doing type dispatching so that we can generally improve compile time.

Yeah... we should re-evaluate this, but AFAIK, this isn't the file with the longest compile time, and just splitting this file looks a bit odd. No problem in doing this for local development, but I guess pushing this change to the main repo (without really agreeing on how to tackle compile time issue) sounds a bit premature.

seunghwak · 2023-10-24T20:38:42Z

cpp/include/cugraph/mtmg/resource_manager.hpp

@@ -70,6 +70,8 @@ class resource_manager_t {
  {
    std::lock_guard<std::mutex> lock(lock_);

+    CUGRAPH_EXPECTS(remote_rank_set_.find(rank) == remote_rank_set_.end(),


Just to double check, is rank still local? I assume remote_rank_set_ stores global ranks, and can we query remote_rank_set_ with a local rank?

Or from now on, rank is global and device_id might serve as local GPU rank?

In my implementation here (trying to somewhat mimic MPI) rank is global. device_id is local.

Yeah.. in that case, we need to review the documentation.

/** * @brief add a local GPU to the resource manager. * * @param rank The rank to assign to the local GPU * @param device_id The device_id corresponding to this rank */

It is not clear to figure out that rank is global and device_id is local just seeing the documentation.

Updated the parameter names and documentation to be global_rank and local_device_id. Added a small improvement in the class documentation to identify the global rank concept.

seunghwak · 2023-10-24T20:42:43Z

cpp/include/cugraph/mtmg/resource_manager.hpp

+                    } else {
+                      local_ranks_to_include.push_back(rank);
+                    }
+                  });


I think using std::copy_if with std::back_inserter better documents the intention.

I will look at this change.

Updated in next push.

seunghwak · 2023-10-24T20:52:43Z

cpp/tests/mtmg/multi_node_threaded_test.cu

+    //
+    //  This is intended to mimic a multi-node host application (non-MPI) integrating
+    //  with MTMG library.  This is a simple implementation using a shared file system
+    //  to pass configuration messages.  Terribly inefficient, but should mimic
+    //  expected behavior.
+    //


What do you think about this vs using mpi but creating just one process per node? We are already using MPI for testing, so the latter won't require a new execution environment. The former assumes availability of a sort of NFS and users need to setup the directory path correctly. This is not necessary if we just re-use the existing MPI test environment but just create one process per node.

I was trying to avoid depending on MPI since we want the interface to work without MPI.

We can certainly create an MPI-based test that doesn't use MPI to initialize the comms. That would be a bit easier to use. I'd be concerned that we might subtly rely on an MPI in some subtle way.

I was think more like updating

#define CUGRAPH_MG_TEST_PROGRAM_MAIN() \ int main(int argc, char** argv) \ { \ cugraph::test::initialize_mpi(argc, argv); \ auto comm_rank = cugraph::test::query_mpi_comm_world_rank(); \ auto comm_size = cugraph::test::query_mpi_comm_world_size(); \ int num_gpus_per_node{}; \ RAFT_CUDA_TRY(cudaGetDeviceCount(&num_gpus_per_node)); \ RAFT_CUDA_TRY(cudaSetDevice(comm_rank % num_gpus_per_node)); \ ::testing::InitGoogleTest(&argc, argv); \ auto const cmd_opts = parse_test_options(argc, argv); \ auto const rmm_mode = cmd_opts["rmm_mode"].as<std::string>(); \ auto resource = cugraph::test::create_memory_resource(rmm_mode); \ rmm::mr::set_current_device_resource(resource.get()); \ cugraph::test::g_perf = cmd_opts["perf"].as<bool>(); \ cugraph::test::g_rmat_scale = \ (cmd_opts.count("rmat_scale") > 0) \ ? std::make_optional<size_t>(cmd_opts["rmat_scale"].as<size_t>()) \ : std::nullopt; \ cugraph::test::g_rmat_edge_factor = \ (cmd_opts.count("rmat_edge_factor") > 0) \ ? std::make_optional<size_t>(cmd_opts["rmat_edge_factor"].as<size_t>()) \ : std::nullopt; \ cugraph::test::g_test_file_name = \ (cmd_opts.count("test_file_name") > 0) \ ? std::make_optional<std::string>(cmd_opts["test_file_name"].as<std::string>()) \ : std::nullopt; \ \ auto ret = RUN_ALL_TESTS(); \ cugraph::test::finalize_mpi(); \ return ret; \ }

and create CUGRAPH_MTMG_TEST_PROGRAM_MAIN() We can just broadcast NCCL unique ID here and we won't touch anything related to MPI after this (and we can even call cugraph::test::finalize_mpi() right after broadcasting NCCL unique ID).

Got the MPI configuration working in the next push. I did not abstract it into a reusable function/MACRO, I can look at that when we added a second test (perhaps in a PR for adding Louvain or Jaccard).

…to mtmg_multi_node

seunghwak

LGTM in general.

I added few comments for potential code improvements (better hiding NCCL & MPI dependencies).

seunghwak · 2023-11-02T21:49:48Z

cpp/include/cugraph/mtmg/instance_manager.hpp

@@ -89,7 +100,6 @@ class instance_manager_t {
  std::vector<std::unique_ptr<raft::handle_t>> raft_handle_{};
  std::vector<std::unique_ptr<ncclComm_t>> nccl_comms_{};


Why are we directly managing NCCL? Shouldn't this be implementation detail of raft?

We can look at updating raft for this. I'll talk with Corey.

seunghwak · 2023-11-02T22:06:33Z

cpp/tests/mtmg/multi_node_threaded_test.cu

+    if (g_node_rank == 0) RAFT_NCCL_TRY(ncclGetUniqueId(&instance_manager_id));
+
+    RAFT_MPI_TRY(
+      MPI_Bcast(&instance_manager_id, sizeof(instance_manager_id), MPI_CHAR, 0, MPI_COMM_WORLD));


If we want to hide the existence of MPI in individual tests, we may add

// valid only for MTMG tests, updated by CUGRAPH_MTMG_TEST_PROGRAM_MAIN()
std::optional g_nccl_unique_id{std::nullopt};

after

// these variables are updated by command line arguments static bool g_perf{false}; static std::optional<size_t> g_rmat_scale{std::nullopt}; static std::optional<size_t> g_rmat_edge_factor{std::nullopt}; static std::optional<std::string> g_test_file_name{std::nullopt};

https://github.com/rapidsai/cugraph/blob/branch-23.12/cpp/tests/utilities/base_fixture.hpp#L119

A bit annoying to add a global variable (only valid for MTMG tests) to all C++ tests, but I think this is OK as this is only for testing and the variable is std::optional (so having invalid values for non-MTMG tests is somewhat acceptable).

Then, we can expose our dependency on MPI only within CUGRAPH_MTMG_TEST_PROGRAM_MAIN() and we can finalize MPI before invoking individual tests.

I was planning on visiting this issue in depth when I add Louvain. I was also thinking of adding some test methods to hide some of these things.

I tried creating g_nccl_unique_id as a global and initializing it in the main program. As currently constructed, each call to create_instance_manager needs a unique g_nccl_unique_id, I couldn't reuse them. Each test could potentially use a different GPU configuration (something I eventually want to add). So I needed to move the creation of the nccl_unique_id back into the test.

What I'm imagining is creating an MTMG test function that will create and broadcast the nccl unique id so that the use of MPI is restricted there. It does require that we not finalize MPI before we run the tests. But if we capture this in helper methods then we should be able to easily identify if we are using MPI calls that could potentially be a problem within one of the MTMG tests.

seunghwak · 2023-11-02T22:09:28Z

cpp/tests/mtmg/multi_node_threaded_test.cu

+
+#include <gtest/gtest.h>
+
+#include <nccl.h>


And I am not sure exposing NCCL in cuGraph is a good thing. Up to this point, we have been largely hiding the existence of NCCL under raft. Now we are explicitly dependent on NCCL. Hiding NCCL for MTMG may require an additional raft function and this can be a FIXME thing, but I feel like hiding explicit NCCL dependency is a good design choice.

I will address this also in the Louvain PR. I can actually pick this dependency up through a raft header. So if we address your other comment (about ncclComm_t and migrating some functionality in raft) that should clean this up as well.

… in cmake to ensure MPI is available

ChuckHastings · 2023-11-06T15:28:00Z

/merge

ChuckHastings added 9 commits October 2, 2023 17:49

created MNMG/MTMG test code

16d7057

first try at implementation

002632f

debugging

99c19b2

debuggin

fe6ed61

debuggin

72092cc

debugging

256157f

debugging

de2922c

debugging, split shuffle_vertex_pairs to reduce compile time

a80fe0e

test first working version

3d57693

ChuckHastings self-assigned this Oct 12, 2023

ChuckHastings added 4 commits October 12, 2023 14:14

a little code cleanup

d72bded

add extra error checks

c413f38

clean up a few things

006cf0b

Merge branch 'branch-23.12' into mtmg_multi_node

55fe02a

ChuckHastings added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2023

ChuckHastings marked this pull request as ready for review October 24, 2023 18:07

ChuckHastings requested review from a team as code owners October 24, 2023 18:07

ChuckHastings requested review from seunghwak, jnke2016 and naimnv October 24, 2023 18:08

seunghwak reviewed Oct 24, 2023

View reviewed changes

jnke2016 approved these changes Oct 30, 2023

View reviewed changes

nv-rliu and others added 5 commits October 31, 2023 09:58

Merge branch 'branch-23.12' into mtmg_multi_node

2aedd53

address PR comments

6178f36

Merge branch 'branch-23.12' into mtmg_multi_node

1c6d485

Merge branch 'mtmg_multi_node' of github.com:chuckhastings/cugraph in…

949a32c

…to mtmg_multi_node

remove inadvertently added blank line

7502ed5

seunghwak approved these changes Nov 2, 2023

View reviewed changes

add some FIXME comments, move multi-node MTMG test into MG test block…

63d157f

… in cmake to ensure MPI is available

rapids-bot bot merged commit 586451d into rapidsai:branch-23.12 Nov 6, 2023
83 checks passed

ChuckHastings deleted the mtmg_multi_node branch December 1, 2023 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTMG multi node #3932

MTMG multi node #3932

ChuckHastings commented Oct 12, 2023 •

edited

Loading

seunghwak Oct 24, 2023

seunghwak Oct 24, 2023

ChuckHastings Oct 25, 2023

seunghwak Oct 25, 2023

seunghwak Oct 24, 2023

seunghwak Oct 24, 2023

ChuckHastings Oct 25, 2023

seunghwak Oct 25, 2023

ChuckHastings Nov 2, 2023

seunghwak Oct 24, 2023

ChuckHastings Oct 25, 2023

ChuckHastings Nov 2, 2023

seunghwak Oct 24, 2023

ChuckHastings Oct 25, 2023

seunghwak Oct 25, 2023

ChuckHastings Nov 2, 2023

seunghwak left a comment

seunghwak Nov 2, 2023

ChuckHastings Nov 3, 2023

seunghwak Nov 2, 2023

ChuckHastings Nov 3, 2023

ChuckHastings Nov 3, 2023

seunghwak Nov 2, 2023

ChuckHastings Nov 3, 2023

ChuckHastings commented Nov 6, 2023

		@@ -89,7 +100,6 @@ class instance_manager_t {
		std::vector<std::unique_ptr<raft::handle_t>> raft_handle_{};
		std::vector<std::unique_ptr<ncclComm_t>> nccl_comms_{};


		#include <gtest/gtest.h>

		#include <nccl.h>

MTMG multi node #3932

MTMG multi node #3932

Conversation

ChuckHastings commented Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seunghwak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChuckHastings commented Nov 6, 2023

ChuckHastings commented Oct 12, 2023 •

edited

Loading