Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nx-cugraph: adds ancestors, descendants, and BFS algos #4029

Merged
merged 10 commits into from
Dec 21, 2023

Conversation

eriknw
Copy link
Contributor

@eriknw eriknw commented Nov 30, 2023

This PR adds the following:

  • ancestors
  • bfs_edges
  • bfs_layers
  • bfs_predecessors
  • bfs_successors
  • bfs_tree
  • descendants
  • descendants_at_distance
  • generic_bfs_edges

@eriknw eriknw requested a review from a team as a code owner November 30, 2023 22:13
This required skipping several tests that use algorithms that repeatedly call `bfs_layers`.
Also, refactor BFS algorithms to reduce repetition.
Also, update converting from networkx to handle non-dict edge data mappings.
@github-actions github-actions bot added the python label Dec 5, 2023
@rlratzel rlratzel added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 5, 2023
@eriknw
Copy link
Contributor Author

eriknw commented Dec 7, 2023

@rlratzel what do you think about the three tests that fail due to running out of GPU memory? I don't think the graphs are particularly large. What can we try? Can we investigate? Could it help to call gc.collect() regularly during tests (since maybe device memory is being help by released Python objects, but Python's garbage collection doesn't trigger b/c host memory usage is low)? Or, should we just skip the tests (and maybe run them one-by-one)?

@rlratzel
Copy link
Contributor

rlratzel commented Dec 9, 2023

@rlratzel what do you think about the three tests that fail due to running out of GPU memory? I don't think the graphs are particularly large. What can we try? Can we investigate?

I was wondering if we have a memory leak, possibly related to PLC graph creation during the conversion from the NX Graph to the PLC Graph happening more often now that nx-cugraph has implemented more algos. Do you see this when you run tests locally, or see anything that resembles a leak when polling nvidia-smi during testing?

Could it help to call gc.collect() regularly during tests (since maybe device memory is being help by released Python objects, but Python's garbage collection doesn't trigger b/c host memory usage is low)?

Maybe? If this is happening during tests, I'd be concerned that it could happen in a large user application too, and they may not know or would prefer not to manually invoke the GC. I'll try recreating this on my workstation using your branch. I can then add a test setup hook that calls gc.collect() before each test to see if that makes a difference.

@rlratzel
Copy link
Contributor

@eriknw I just ran your branch on my workstation and saw the same results - the same three tests failed with the same OOM error. I was polling nvidia-smi and never saw it exceed ~800MB, and my workstation GPUs have (I think) more memory than the V100s in the CI runners (I have 50GB cards).

I don't think this is a leak or anything gc.collect() will fix; I think there's a bug somewhere in the algos (or combination of algos) called from the tests. Are you seeing the same result when you run these tests locally?

I suspect something is causing an abrupt spike in memory which fails the test and the memory is reclaimed faster than my nvidia-smi polling interval, or there's a bug where we're computing an amount of memory to allocate that's much too big. I'm just guessing at this point though. I'll keep looking...

image

@rlratzel
Copy link
Contributor

@eriknw - I have a minimal repro script. Can you use this to debug further? I'm going to stop for the night but I can help next week.

import networkx as nx
from networkx.algorithms import flow

G = nx.fast_gnp_random_graph(100, 0.0575, seed=42)
cut = nx.minimum_node_cut(G, flow_func=flow.boykov_kolmogorov)

G = nx.DiGraph()
G.add_nodes_from([0, 1, 2])
list(nx.algorithms.dag.antichains(G))
(base) root@98e76e671240:/Projects/eriknw/cugraph# NETWORKX_AUTOMATIC_BACKENDS=cugraph python oom_repro.py
Traceback (most recent call last):
  File "/Projects/eriknw/cugraph/oom_repro.py", line 9, in <module>
    list(nx.algorithms.dag.antichains(G))
  File "/opt/conda/lib/python3.10/site-packages/networkx/algorithms/dag.py", line 940, in antichains
    TC = nx.transitive_closure_dag(G, topo_order)
  File "/opt/conda/lib/python3.10/site-packages/networkx/utils/decorators.py", line 770, in func
    return argmap._lazy_compile(__wrapper)(*args, **kwargs)
  File "<class 'networkx.utils.decorators.argmap'> compilation 28", line 4, in argmap_transitive_closure_dag_25
    import inspect
  File "/opt/conda/lib/python3.10/site-packages/networkx/utils/backends.py", line 576, in __call__
    return self.orig_func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/networkx/algorithms/dag.py", line 811, in transitive_closure_dag
    TC.add_edges_from((v, u) for u in nx.descendants_at_distance(TC, v, 2))
  File "/opt/conda/lib/python3.10/site-packages/networkx/utils/backends.py", line 569, in __call__
    return self._convert_and_call(
  File "/opt/conda/lib/python3.10/site-packages/networkx/utils/backends.py", line 817, in _convert_and_call
    result = getattr(backend, self.name)(*converted_args, **converted_kwargs)
  File "/Projects/eriknw/cugraph/python/nx-cugraph/nx_cugraph/utils/decorators.py", line 86, in __call__
    return self.__wrapped__(*args, **kwargs)
  File "/Projects/eriknw/cugraph/python/nx-cugraph/nx_cugraph/algorithms/traversal/breadth_first_search.py", line 239, in descendants_at_distance
    distances, predecessors, node_ids = plc.bfs(
  File "bfs.pyx", line 181, in pylibcugraph.bfs.bfs
  File "utils.pyx", line 53, in pylibcugraph.utils.assert_success
RuntimeError: non-success value returned from cugraph_bfs: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp

@eriknw
Copy link
Contributor Author

eriknw commented Dec 11, 2023

Do you see this when you run tests locally, or see anything that resembles a leak when polling nvidia-smi during testing?

Yeah, I see it (just like you) when I run the tests locally. However, if I run each failing test individually, they pass.

Thanks for investigating and making a reproducer!

@eriknw eriknw requested a review from a team as a code owner December 20, 2023 13:09
@github-actions github-actions bot added the ci label Dec 20, 2023
Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving ops-codeowner file changes

Comment on lines +114 to +118
# Individually run tests that are skipped above b/c they may run out of memory
PYTEST_NO_SKIP=True ./run_nx_tests.sh --cov-append -k "TestDAG and test_antichains"
PYTEST_NO_SKIP=True ./run_nx_tests.sh --cov-append -k "TestMultiDiGraph_DAGLCA and test_all_pairs_lca_pairs_without_lca"
PYTEST_NO_SKIP=True ./run_nx_tests.sh --cov-append -k "TestDAGLCA and test_all_pairs_lca_pairs_without_lca"
PYTEST_NO_SKIP=True ./run_nx_tests.sh --cov-append -k "TestEfficiency and test_using_ego_graph"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave this in place in order to get this PR merged, but I think this is masking a bug (likely in pylibcugraph/libcugraph, not in nx-cugraph).

I filed this issue and I'll open a PR to remove these lines and the skipped tests separately, which will be merged when the bug is fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR to eventually fix the bug is #4068

@rlratzel
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 8d5bba3 into rapidsai:branch-24.02 Dec 21, 2023
74 checks passed
@rlratzel rlratzel changed the title nx-cugraph: add ancestors and descendants nx-cugraph: adds ancestors, descendants, and BFS algos Dec 21, 2023
rapids-bot bot pushed a commit that referenced this pull request Jan 12, 2024
…valid vertex (#4077)

* Added error check to be sure that the source vertex is a valid vertex in the graph.
* Updated `nx_cugraph.Graph` class to create PLC graphs using `vertices_array` in order to include isolated vertices. This is now needed since the error check added in this PR prevents NetworkX tests from passing if isolated vertices are treated as invalid, so this change prevents that.
* This resolves the problem that required the test workarounds done [here](#4029 (comment)) in [4029](#4029), so those workarounds have been removed in this PR.

Closes #4067

Authors:
  - Chuck Hastings (https://github.com/ChuckHastings)
  - Rick Ratzel (https://github.com/rlratzel)

Approvers:
  - Seunghwa Kang (https://github.com/seunghwak)
  - Ray Douglass (https://github.com/raydouglass)
  - Erik Welch (https://github.com/eriknw)

URL: #4077
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants