Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler #3669

Merged

Conversation

alexbarghi-nv
Copy link
Member

@alexbarghi-nv alexbarghi-nv commented Jun 23, 2023

Some dask operations were not being done correctly, and time was being lost in broadcasting the rank and label arrays to all workers. This PR resolves those issues.

Also pulls in the previously-experimental changes that add logging to the bulk sampler.

Credit to @VibhuJawa for isolating and fixing the issues with the column merge in uniform_neighbor_sample and the new sampling notebook and shell script.

This PR does modify the sampling APIs so it is breaking. The API changes are necessary to avoid unnecessary shuffling, and eventually, to improve batch id assignment.

Dataset: ogbn_papers100M x 2; Fanout: [25, 25]; Batch Size: 512; Seeds Per Call: 524288
Current runtime: 2.69 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Previous runtime: 4.51 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Speedup: 1.7x

Dataset: ogbn_papers100M x 4; Fanout: [25, 25]; Batch Size: 512; Seeds Per Call: 524288
Current runtime: 6.32 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Previous runtime: 10.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)
Speedup: 1.7x

@alexbarghi-nv alexbarghi-nv self-assigned this Jun 23, 2023
@alexbarghi-nv alexbarghi-nv added bug Something isn't working breaking Breaking change labels Jun 23, 2023
@alexbarghi-nv alexbarghi-nv added this to the 23.08 milestone Jun 23, 2023
@alexbarghi-nv alexbarghi-nv changed the title [FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample [FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler Jun 26, 2023
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more minor changes, but i think we should be good to go after that. I think the only thing is what @rlratzel requested around keeping the now legacy implementation around.

@VibhuJawa
Copy link
Member

This still seems to be actually hanging (not perf issues) for me when running 10 times on bulk samplinng. :-( . Can you try running it and ensure we dont cause hangs.

dataset='ogbn_papers100M'
dataset_root="/datasets/abarghi/"
output_root="/raid/vjawa/"
reverse_edges=True
add_edge_types=False
batch_size=1024
seeds_per_call=524288
fanout=[10,10,10]
replication_factor=4
seed=123

dataset_dir=dataset_root
output_path=output_root
persist=False

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor asks around raising deprecated warnings

Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 197 to 199
if isinstance(self.__batches, dask_cudf.DataFrame):
min_batch_id = min_batch_id.compute()
min_batch_id = int(min_batch_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we add this compute back ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

client.submit doesn't accept it uncomputed. What we had before worked with client.compute. I'm trying to find alternative that works for submit.

The error we get is this:

AttributeError("'Scalar' object has no attribute '_parent_meta'")

And wrapping with delayed doesn't work either, it gives a similar error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e.
delayed(lambda df : df.min())(self.__batches) doesn't work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a way around that issue, but we have another problem where the execution hangs because we are setting allow_other_workers=False so it can't get the min

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can let it be for now and try to get other optimizations as a followup

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've since figured out that what we had with client.compute and not computing the min beforehand was wrong. It should cause a hang, which we were seeing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and thanks for adding support for the legacy args. I'm filing this issue to remind us to remove it for 23.10 (which gives users one release to migrate).

I have a question and a possible FIXME suggestion which need not hold up approval now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A notebook is nice, but just so I can plan ahead: do we want to run this notebook as part of the nightlies, and if so, can we use the notebook conversion script to run it from a command-line (this is what we do in CI, as seen here, which calls this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do want to run this nightly.

@alexbarghi-nv
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 0372396 into rapidsai:branch-23.08 Jul 6, 2023
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants