nx-cugraph: adds `ancestors`, `descendants`, and BFS algos #4029

eriknw · 2023-11-30T22:13:15Z

This PR adds the following:

ancestors
bfs_edges
bfs_layers
bfs_predecessors
bfs_successors
bfs_tree
descendants
descendants_at_distance
generic_bfs_edges

…fs`?!

This required skipping several tests that use algorithms that repeatedly call `bfs_layers`. Also, refactor BFS algorithms to reduce repetition. Also, update converting from networkx to handle non-dict edge data mappings.

python/nx-cugraph/nx_cugraph/algorithms/dag.py

python/nx-cugraph/nx_cugraph/algorithms/traversal/breadth_first_search.py

python/nx-cugraph/nx_cugraph/classes/graph.py

python/nx-cugraph/nx_cugraph/interface.py

eriknw · 2023-12-07T20:29:32Z

@rlratzel what do you think about the three tests that fail due to running out of GPU memory? I don't think the graphs are particularly large. What can we try? Can we investigate? Could it help to call gc.collect() regularly during tests (since maybe device memory is being help by released Python objects, but Python's garbage collection doesn't trigger b/c host memory usage is low)? Or, should we just skip the tests (and maybe run them one-by-one)?

rlratzel · 2023-12-09T04:32:05Z

@rlratzel what do you think about the three tests that fail due to running out of GPU memory? I don't think the graphs are particularly large. What can we try? Can we investigate?

I was wondering if we have a memory leak, possibly related to PLC graph creation during the conversion from the NX Graph to the PLC Graph happening more often now that nx-cugraph has implemented more algos. Do you see this when you run tests locally, or see anything that resembles a leak when polling nvidia-smi during testing?

Could it help to call gc.collect() regularly during tests (since maybe device memory is being help by released Python objects, but Python's garbage collection doesn't trigger b/c host memory usage is low)?

Maybe? If this is happening during tests, I'd be concerned that it could happen in a large user application too, and they may not know or would prefer not to manually invoke the GC. I'll try recreating this on my workstation using your branch. I can then add a test setup hook that calls gc.collect() before each test to see if that makes a difference.

rlratzel · 2023-12-10T03:47:28Z

@eriknw I just ran your branch on my workstation and saw the same results - the same three tests failed with the same OOM error. I was polling nvidia-smi and never saw it exceed ~800MB, and my workstation GPUs have (I think) more memory than the V100s in the CI runners (I have 50GB cards).

I don't think this is a leak or anything gc.collect() will fix; I think there's a bug somewhere in the algos (or combination of algos) called from the tests. Are you seeing the same result when you run these tests locally?

I suspect something is causing an abrupt spike in memory which fails the test and the memory is reclaimed faster than my nvidia-smi polling interval, or there's a bug where we're computing an amount of memory to allocate that's much too big. I'm just guessing at this point though. I'll keep looking...

rlratzel · 2023-12-10T03:54:38Z

@eriknw - I have a minimal repro script. Can you use this to debug further? I'm going to stop for the night but I can help next week.

import networkx as nx
from networkx.algorithms import flow

G = nx.fast_gnp_random_graph(100, 0.0575, seed=42)
cut = nx.minimum_node_cut(G, flow_func=flow.boykov_kolmogorov)

G = nx.DiGraph()
G.add_nodes_from([0, 1, 2])
list(nx.algorithms.dag.antichains(G))

(base) root@98e76e671240:/Projects/eriknw/cugraph# NETWORKX_AUTOMATIC_BACKENDS=cugraph python oom_repro.py
Traceback (most recent call last):
  File "/Projects/eriknw/cugraph/oom_repro.py", line 9, in <module>
    list(nx.algorithms.dag.antichains(G))
  File "/opt/conda/lib/python3.10/site-packages/networkx/algorithms/dag.py", line 940, in antichains
    TC = nx.transitive_closure_dag(G, topo_order)
  File "/opt/conda/lib/python3.10/site-packages/networkx/utils/decorators.py", line 770, in func
    return argmap._lazy_compile(__wrapper)(*args, **kwargs)
  File "<class 'networkx.utils.decorators.argmap'> compilation 28", line 4, in argmap_transitive_closure_dag_25
    import inspect
  File "/opt/conda/lib/python3.10/site-packages/networkx/utils/backends.py", line 576, in __call__
    return self.orig_func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/networkx/algorithms/dag.py", line 811, in transitive_closure_dag
    TC.add_edges_from((v, u) for u in nx.descendants_at_distance(TC, v, 2))
  File "/opt/conda/lib/python3.10/site-packages/networkx/utils/backends.py", line 569, in __call__
    return self._convert_and_call(
  File "/opt/conda/lib/python3.10/site-packages/networkx/utils/backends.py", line 817, in _convert_and_call
    result = getattr(backend, self.name)(*converted_args, **converted_kwargs)
  File "/Projects/eriknw/cugraph/python/nx-cugraph/nx_cugraph/utils/decorators.py", line 86, in __call__
    return self.__wrapped__(*args, **kwargs)
  File "/Projects/eriknw/cugraph/python/nx-cugraph/nx_cugraph/algorithms/traversal/breadth_first_search.py", line 239, in descendants_at_distance
    distances, predecessors, node_ids = plc.bfs(
  File "bfs.pyx", line 181, in pylibcugraph.bfs.bfs
  File "utils.pyx", line 53, in pylibcugraph.utils.assert_success
RuntimeError: non-success value returned from cugraph_bfs: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp

eriknw · 2023-12-11T16:30:34Z

Do you see this when you run tests locally, or see anything that resembles a leak when polling nvidia-smi during testing?

Yeah, I see it (just like you) when I run the tests locally. However, if I run each failing test individually, they pass.

Thanks for investigating and making a reproducer!

ajschmidt8

Approving ops-codeowner file changes

rlratzel · 2023-12-21T01:14:13Z

ci/test_python.sh

+# Individually run tests that are skipped above b/c they may run out of memory
+PYTEST_NO_SKIP=True ./run_nx_tests.sh --cov-append -k "TestDAG and test_antichains"
+PYTEST_NO_SKIP=True ./run_nx_tests.sh --cov-append -k "TestMultiDiGraph_DAGLCA and test_all_pairs_lca_pairs_without_lca"
+PYTEST_NO_SKIP=True ./run_nx_tests.sh --cov-append -k "TestDAGLCA and test_all_pairs_lca_pairs_without_lca"
+PYTEST_NO_SKIP=True ./run_nx_tests.sh --cov-append -k "TestEfficiency and test_using_ego_graph"


We can leave this in place in order to get this PR merged, but I think this is masking a bug (likely in pylibcugraph/libcugraph, not in nx-cugraph).

I filed this issue and I'll open a PR to remove these lines and the skipped tests separately, which will be merged when the bug is fixed.

The PR to eventually fix the bug is #4068

rlratzel · 2023-12-21T01:18:34Z

/merge

…valid vertex (#4077) * Added error check to be sure that the source vertex is a valid vertex in the graph. * Updated `nx_cugraph.Graph` class to create PLC graphs using `vertices_array` in order to include isolated vertices. This is now needed since the error check added in this PR prevents NetworkX tests from passing if isolated vertices are treated as invalid, so this change prevents that. * This resolves the problem that required the test workarounds done [here](#4029 (comment)) in [4029](#4029), so those workarounds have been removed in this PR. Closes #4067 Authors: - Chuck Hastings (https://github.com/ChuckHastings) - Rick Ratzel (https://github.com/rlratzel) Approvers: - Seunghwa Kang (https://github.com/seunghwak) - Ray Douglass (https://github.com/raydouglass) - Erik Welch (https://github.com/eriknw) URL: #4077

nx-cugraph: add ancestors and descendants

22e317c

eriknw requested a review from a team as a code owner November 30, 2023 22:13

eriknw added 3 commits November 30, 2023 18:32

Add bfs_successors and bfs_predecessors; is there a bug in `plc.b…

da1c3a1

…fs`?!

Add bfs_edges, bfs_tree, generic_bfs_edges

601f8b3

Add descendants_at_distance and bfs_layers

d36af26

This required skipping several tests that use algorithms that repeatedly call `bfs_layers`. Also, refactor BFS algorithms to reduce repetition. Also, update converting from networkx to handle non-dict edge data mappings.

github-actions bot added the python label Dec 5, 2023

rlratzel added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 5, 2023

rlratzel reviewed Dec 5, 2023

View reviewed changes

eriknw added 2 commits December 6, 2023 13:13

Clean up based on review

0daffcc

Merge branch 'branch-24.02' into ancestors_and_descendants

a19ed3d

eriknw added 3 commits December 19, 2023 12:19

Merge branch 'branch-24.02' into ancestors_and_descendants

4fc915e

Skip tests that may OOM, but also test them individually in CI

04dcf6d

Merge branch 'branch-24.02' into ancestors_and_descendants

cf14f99

eriknw requested a review from a team as a code owner December 20, 2023 13:09

github-actions bot added the ci label Dec 20, 2023

ajschmidt8 approved these changes Dec 20, 2023

View reviewed changes

oops fix

2742cf7

rlratzel mentioned this pull request Dec 21, 2023

OOM error raised after subsequent calls to plc.bfs with different SGGraphs #4067

Closed

rlratzel approved these changes Dec 21, 2023

View reviewed changes

rapids-bot bot merged commit 8d5bba3 into rapidsai:branch-24.02 Dec 21, 2023
74 checks passed

rlratzel changed the title ~~nx-cugraph: add ancestors and descendants~~ nx-cugraph: adds ancestors, descendants, and BFS algos Dec 21, 2023

rlratzel mentioned this pull request Jan 6, 2024

Fix OOB error, BFS C API should validate that the source vertex is a valid vertex #4077

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nx-cugraph: adds `ancestors`, `descendants`, and BFS algos #4029

nx-cugraph: adds `ancestors`, `descendants`, and BFS algos #4029

eriknw commented Nov 30, 2023 •

edited by rlratzel

Loading

eriknw commented Dec 7, 2023

rlratzel commented Dec 9, 2023

rlratzel commented Dec 10, 2023

rlratzel commented Dec 10, 2023

eriknw commented Dec 11, 2023

ajschmidt8 left a comment

rlratzel Dec 21, 2023

rlratzel Dec 21, 2023

rlratzel commented Dec 21, 2023

nx-cugraph: adds ancestors, descendants, and BFS algos #4029

nx-cugraph: adds ancestors, descendants, and BFS algos #4029

Conversation

eriknw commented Nov 30, 2023 • edited by rlratzel Loading

eriknw commented Dec 7, 2023

rlratzel commented Dec 9, 2023

rlratzel commented Dec 10, 2023

rlratzel commented Dec 10, 2023

eriknw commented Dec 11, 2023

ajschmidt8 left a comment

Choose a reason for hiding this comment

rlratzel Dec 21, 2023

Choose a reason for hiding this comment

rlratzel Dec 21, 2023

Choose a reason for hiding this comment

rlratzel commented Dec 21, 2023

nx-cugraph: adds `ancestors`, `descendants`, and BFS algos #4029

nx-cugraph: adds `ancestors`, `descendants`, and BFS algos #4029

eriknw commented Nov 30, 2023 •

edited by rlratzel

Loading