Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

300node 8GPU 4 IB NCCL TEST #1454

Open
gim4moon opened this issue Sep 19, 2024 · 4 comments
Open

300node 8GPU 4 IB NCCL TEST #1454

gim4moon opened this issue Sep 19, 2024 · 4 comments

Comments

@gim4moon
Copy link

Hello

Currently, our client company is supporting nccl-test.

We are supporting it by writing the script below.

mpirun -np 300 -N 1 -x NCCL_DEBUG=INFO --hostfile /nccl/hostfile
-mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_num_concurrent 512
--bind-to none -mca btl tcp,self -mca coll_hcoll_enable 0
-x NCCL_SOCKET_IFNAME=bond0
-x NCCL_IB_AR_THRESHOLD=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 -x NCCL_IB_QPS_PER_CONNECTION=2 -x CUDA_DEVICE_ORDER=PCI_BUS_ID
-x PATH -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH
-x NCCL_NET_GDR_READ=1 -x NCCL_IGNORE_CPU_AFFINITY=1 -x NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -x NCCL_DEBUG_SUBSYS=NET
/nccl/nccl-tests/build/all_reduce_perf -b 512 -e 8G -f 2 -g 8

The max busbw is only 14GB/s

Is there something wrong with the command? Please help me.

@kiskra-nvidia
Copy link
Member

We need more info. What are the GPUs? What is the interconnect? The output of nvidia-smi and nvidia-smi topo -m from one of the nodes would be nice, as would a dump of the topology detected by NCCL. Can you include the NCCL debug output (from just one of the ranks, please! 😃), especially since you collect it already? It might be worth adding TUNING to the list of subsystems to debug...

@gim4moon
Copy link
Author

We need more info. What are the GPUs? What is the interconnect? The output of nvidia-smi and nvidia-smi topo -m from one of the nodes would be nice, as would a dump of the topology detected by NCCL. Can you include the NCCL debug output (from just one of the ranks, please! 😃), especially since you collect it already? It might be worth adding TUNING to the list of subsystems to debug...

The node is Dell XE9680.

The GPU is H100 x 8EA per node.

The Infiniband has connectX-7 x 4EA VPI card (mlx5_0:1, mlx5_1:1, mlx5_2:1, mlx5_3:1) per node and 200G ethernet cards x 2EA (bonding configuration).

The topology is GPU to GPU connected with NV18, and GPU to NIC connected with PIX.

I'm sorry I can't provide the original nvidia-smi and topo!

I appreciate your help as much as possible.

@GeofferyGeng
Copy link

The Infiniband has connectX-7 x 4EA VPI card (mlx5_0:1, mlx5_1:1, mlx5_2:1, mlx5_3:1) per node and 200G ethernet cards x 2EA (bonding configuration).

The compatibility of NCCL with NIC bonding is not very good, at least in the case of RoCE and I'm not sure if it's the same for InfiniBand.

you can test on several nodes to test bonding.

@gim4moon
Copy link
Author

The Infiniband has connectX-7 x 4EA VPI card (mlx5_0:1, mlx5_1:1, mlx5_2:1, mlx5_3:1) per node and 200G ethernet cards x 2EA (bonding configuration).

The compatibility of NCCL with NIC bonding is not very good, at least in the case of RoCE and I'm not sure if it's the same for InfiniBand.

you can test on several nodes to test bonding.

It seems that some of the node issues were due to faults in the SXM GPU board and the PCI riser board.

The faulty equipment has now been replaced, and the busbw is in the early 180Gb/s range.

Is the current speed a good figure for a 4nic infrastructure with over 300 nodes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants