All ranks are not trained. They are blocked all the time #39

ADAM-CT · 2020-01-15T01:58:37Z

My environment:

server1：4GPUS

server2 : 4GPUS

Initialization has been completed. All ranks are not trained. They are blocked all the time

Here is the output of each rank：

in rank0: Finished initializing process group; backend: gloo, rank: 0, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank1: Finished initializing process group; backend: gloo, rank: 1, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank2: Finished initializing process group; backend: gloo, rank: 2, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank3: Finished initializing process group; backend: gloo, rank: 3, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank4: Finished initializing process group; backend: gloo, rank: 4, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank5 : Finished initializing process group; backend: gloo, rank: 5, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank6: Finished initializing process group; backend: gloo, rank: 6, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.156 (1.804) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank7: Finished initializing process group; backend: gloo, rank: 7, world_size: 8 Send ranks: {} Receive ranks: {'out1': [4, 5, 6], 'target': [4, 5, 6]} Setting up process groups for broadcasts... Letting in 0 warm-up minibatches Running training for 20016 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690624.000 bytes send_tensors 0.000 seconds send_tensors_size 0.000 bytes Epoch: 0 Step 0 Learning rate: 0.010000 Epoch: [0][0/20016] Time: 8.293 (8.293) Epoch time [hr]: 0.002 (46.107) Memory: 1.284 (1.636) Loss: 6.9063 (6.9063) Prec@1: 0.000 (0.000)Prec@5: 0.000 (0.000) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 25690112.000 bytes Optimizer step took: 0.005

The text was updated successfully, but these errors were encountered:

ADAM-CT · 2020-01-15T02:41:44Z

Finally it throw the runtime error :

Exception in thread Thread-9:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "../communication.py", line 619, in recv_helper_thread
sub_process_group=sub_process_group)
File "../communication.py", line 654, in _recv
group=sub_process_group)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 755, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1800000ms for recv operation to complete

letianzhao · 2020-06-10T13:08:30Z

My environment:

server1：4GPUS

server2 : 4GPUS

Initialization has been completed. All ranks are not trained. They are blocked all the time

Here is the output of each rank：

in rank0: Finished initializing process group; backend: gloo, rank: 0, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank1: Finished initializing process group; backend: gloo, rank: 1, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank2: Finished initializing process group; backend: gloo, rank: 2, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank3: Finished initializing process group; backend: gloo, rank: 3, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank4: Finished initializing process group; backend: gloo, rank: 4, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank5 : Finished initializing process group; backend: gloo, rank: 5, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank6: Finished initializing process group; backend: gloo, rank: 6, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.156 (1.804) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank7: Finished initializing process group; backend: gloo, rank: 7, world_size: 8 Send ranks: {} Receive ranks: {'out1': [4, 5, 6], 'target': [4, 5, 6]} Setting up process groups for broadcasts... Letting in 0 warm-up minibatches Running training for 20016 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690624.000 bytes send_tensors 0.000 seconds send_tensors_size 0.000 bytes Epoch: 0 Step 0 Learning rate: 0.010000 Epoch: [0][0/20016] Time: 8.293 (8.293) Epoch time [hr]: 0.002 (46.107) Memory: 1.284 (1.636) Loss: 6.9063 (6.9063) Prec@1: 0.000 (0.000)Prec@5: 0.000 (0.000) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 25690112.000 bytes Optimizer step took: 0.005

I have the same issue, have you solved this problem?
Thank you.

Q1Shane · 2021-04-19T16:37:30Z

I Have same issue!
Have U found the solution?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All ranks are not trained. They are blocked all the time #39

All ranks are not trained. They are blocked all the time #39

ADAM-CT commented Jan 15, 2020

ADAM-CT commented Jan 15, 2020

letianzhao commented Jun 10, 2020

Q1Shane commented Apr 19, 2021

All ranks are not trained. They are blocked all the time #39

All ranks are not trained. They are blocked all the time #39

Comments

ADAM-CT commented Jan 15, 2020

ADAM-CT commented Jan 15, 2020

Finally it throw the runtime error :

letianzhao commented Jun 10, 2020

Q1Shane commented Apr 19, 2021