You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have a question about the usage of multi-stream mechanism in NCCL. As we can see, NCCL API such as ncclAllReduce allow users to specify a cuda stream as an input argument. And in ncclLaunchPrepare function (defined in src/enqueue.cc), there is a note saying that:
// Semantically we want these dependencies for the kernels launched:
// 1. Launch host task on hostStream.
// 2. Launch kernel, depends on all of {deviceStream, hostStream, userStream[i]...}
// 3. {deviceStream, userStream[i]...} depend on kernel.
// We achieve this by:
// 1. userStream[0] waits on deviceStream
// 2. deviceStream waits on each of userStream[1...]
// 3. host task launch on hostStream
// 4. userStream[0] waits on hostStream
// 5. kernel launch on userStream[0]
// 6. deviceStream waits on userStream[0]
// 7. userStream[1...] each waits on deviceStream
// The two-level fan-in fan-out is because ncclStrongStreamWaitStream() requires
// at least one of the two streams to be strong-stream.
I'm quite confused and want to figure out the relationship between these streams (hostStream, deviceStream, userStream, and API stream) and their design goal.
Thanks for your reply in advance.
The text was updated successfully, but these errors were encountered:
Hi, I have a question about the usage of multi-stream mechanism in NCCL. As we can see, NCCL API such as ncclAllReduce allow users to specify a cuda stream as an input argument. And in ncclLaunchPrepare function (defined in src/enqueue.cc), there is a note saying that:
// Semantically we want these dependencies for the kernels launched:
// 1. Launch host task on hostStream.
// 2. Launch kernel, depends on all of {deviceStream, hostStream, userStream[i]...}
// 3. {deviceStream, userStream[i]...} depend on kernel.
// We achieve this by:
// 1. userStream[0] waits on deviceStream
// 2. deviceStream waits on each of userStream[1...]
// 3. host task launch on hostStream
// 4. userStream[0] waits on hostStream
// 5. kernel launch on userStream[0]
// 6. deviceStream waits on userStream[0]
// 7. userStream[1...] each waits on deviceStream
// The two-level fan-in fan-out is because ncclStrongStreamWaitStream() requires
// at least one of the two streams to be strong-stream.
I'm quite confused and want to figure out the relationship between these streams (hostStream, deviceStream, userStream, and API stream) and their design goal.
Thanks for your reply in advance.
The text was updated successfully, but these errors were encountered: