-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
300node 8GPU 4 IB NCCL TEST #1454
Comments
We need more info. What are the GPUs? What is the interconnect? The output of |
The node is Dell XE9680. The GPU is H100 x 8EA per node. The Infiniband has connectX-7 x 4EA VPI card (mlx5_0:1, mlx5_1:1, mlx5_2:1, mlx5_3:1) per node and 200G ethernet cards x 2EA (bonding configuration). The topology is GPU to GPU connected with NV18, and GPU to NIC connected with PIX. I'm sorry I can't provide the original nvidia-smi and topo! I appreciate your help as much as possible. |
The compatibility of NCCL with NIC bonding is not very good, at least in the case of RoCE and I'm not sure if it's the same for InfiniBand. you can test on several nodes to test bonding. |
It seems that some of the node issues were due to faults in the SXM GPU board and the PCI riser board. The faulty equipment has now been replaced, and the busbw is in the early 180Gb/s range. Is the current speed a good figure for a 4nic infrastructure with over 300 nodes? |
Hello
Currently, our client company is supporting nccl-test.
We are supporting it by writing the script below.
mpirun -np 300 -N 1 -x NCCL_DEBUG=INFO --hostfile /nccl/hostfile
-mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_num_concurrent 512
--bind-to none -mca btl tcp,self -mca coll_hcoll_enable 0
-x NCCL_SOCKET_IFNAME=bond0
-x NCCL_IB_AR_THRESHOLD=0 -x NCCL_IB_PCI_RELAXED_ORDERING=1
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 -x NCCL_IB_QPS_PER_CONNECTION=2 -x CUDA_DEVICE_ORDER=PCI_BUS_ID
-x PATH -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH
-x NCCL_NET_GDR_READ=1 -x NCCL_IGNORE_CPU_AFFINITY=1 -x NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH -x NCCL_DEBUG_SUBSYS=NET
/nccl/nccl-tests/build/all_reduce_perf -b 512 -e 8G -f 2 -g 8
The max busbw is only 14GB/s
Is there something wrong with the command? Please help me.
The text was updated successfully, but these errors were encountered: