Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about hostStream, deviceStream and userStream #1476

Open
MC952-arch opened this issue Oct 11, 2024 · 0 comments
Open

Question about hostStream, deviceStream and userStream #1476

MC952-arch opened this issue Oct 11, 2024 · 0 comments

Comments

@MC952-arch
Copy link

Hi, I have a question about the usage of multi-stream mechanism in NCCL. As we can see, NCCL API such as ncclAllReduce allow users to specify a cuda stream as an input argument. And in ncclLaunchPrepare function (defined in src/enqueue.cc), there is a note saying that:

// Semantically we want these dependencies for the kernels launched:
// 1. Launch host task on hostStream.
// 2. Launch kernel, depends on all of {deviceStream, hostStream, userStream[i]...}
// 3. {deviceStream, userStream[i]...} depend on kernel.
// We achieve this by:
// 1. userStream[0] waits on deviceStream
// 2. deviceStream waits on each of userStream[1...]
// 3. host task launch on hostStream
// 4. userStream[0] waits on hostStream
// 5. kernel launch on userStream[0]
// 6. deviceStream waits on userStream[0]
// 7. userStream[1...] each waits on deviceStream
// The two-level fan-in fan-out is because ncclStrongStreamWaitStream() requires
// at least one of the two streams to be strong-stream.

I'm quite confused and want to figure out the relationship between these streams (hostStream, deviceStream, userStream, and API stream) and their design goal.

Thanks for your reply in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant