-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor NCCL allreduce performance #1453
Comments
You might want to try rerunning it with What does the topology look like in the file that's passed via |
nvidia-smi topo -m
Topology file
Output from run with NCCL_DEBUG_SUBSYS=INIT,ENV,TUNING
|
Your latest log shows that NCCL chooses the Tree algorithm for small message sizes (<512KB), and then switches to NVLSTree-Simple up to 32MB, which is expected. Somewhat unusually (probably because of RoCE?), it switches to Ring-LL128 for 64MB-128MB, but for 256MB and above it switches to Ring-Simple, which is expected. You may want to try experimenting by disabling Ring ( My guess is that you are suffering from some sort of network congestion. Have you tried experimenting with other values of |
Thank you for your review of the data and suggestions. We've made some network infrastructure changes and are seeing improved performance. I'll get back after we've had more time to study the results. |
We are seeing an issue with NCCL allreduce performance that we would appreciate Nvidia's help on.
We have three nodes split across two racks: Two nodes on one rack and one node on another rack.
Two-node performance either within a rack or across racks is OK. Three-node performance across racks is severely degraded.
We've replicated this on different sets of nodes and racks.
The configuration is as follows:
Three nodes: Two on one rack, and one on another rack
8 x H100 GPUs with NVlink in each node
8 x ConnectX-7 dual-port NICs on each node, with 200 Gb links
Each rack has two top-of-rack (TOR) switches; each NIC's ports are split between the TOR switches; TOR switches are connected with spine switches
Virtualized configuration: Nodes are Ubuntu 22.04 VM's
Module versions:
GPU information is provided in nvidia-smi -q output at the bottom
NCCL version 2.22.3+cuda12.5
NCCL environment variable settings:
NCCL command: all_reduce_perf -b 1 -e 8G -f 2 -g 1 -n 20
Example output from the three-node case is below. Bandwidth at the 8 GB data size is 99% degraded from our two-node case.
Degradation is noticeable but less severe at smaller data sizes starting at around 8 KB.
We also note that the drop-off in bandwidth going from the 32 MB to 64 MB data size is consistent across executions of the test.
The output of nvidia-smi -q for one GPU is provided below. This was captured with no workload running.
The text was updated successfully, but these errors were encountered: