New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SHArP] about the intranode allreduce performance with SHArP #1455

Open

shh2000 opened this issue Sep 19, 2024 · 0 comments

shh2000 commented Sep 19, 2024

Hello, @sjeaugey nccl team!

ENV:
8 * H100_SXM, ngctorch enviornment

Test Method:
NCCL_ALGO=[ALGO] ./all_reduce_perf -b 2G -e 4G -f 2 -g [gpus], where all_reduce_perf is compiled from nccl-test github repo

Result:

when using ALGO=RING, busbw is about 365-370 GBps when gpus from 2 to 8, about 80% of 450GBps(NVLink)
when using ALGO=NVLS, things become interest. Busbw remains about 365-370 GBps when gpus from 2 to 4. However, Busbw increase from 370 to 420, 440, 460, and 480 GBps when gpus increase from 4 to 5, 6, 7, and 8.
Besides, when gpus from 5 to 6, 7, and 8 at ALGO=NVLS, algbw also increases from 258 to about 265, 270, and 273.

Question:

how does 480GBps been calculated? SHArP ALU @nvswitch3 has 400GFLOPS@float32. Is 400GFLOPS any relationship with 480GBps?
each H100 should be connected directly with all nvswitches, so how could algbw@5gpus lower than algbw@8gpus? In addition, why gpus=2, 3, and 4 remain 370GBps like Ring allreduce?

Looking forward to you reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment