Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All ranks are not trained. They are blocked all the time #39

Open
ADAM-CT opened this issue Jan 15, 2020 · 3 comments
Open

All ranks are not trained. They are blocked all the time #39

ADAM-CT opened this issue Jan 15, 2020 · 3 comments

Comments

@ADAM-CT
Copy link

ADAM-CT commented Jan 15, 2020

My environment:

server1:4GPUS

server2 : 4GPUS

Initialization has been completed. All ranks are not trained. They are blocked all the time

Here is the output of each rank:

in rank0: Finished initializing process group; backend: gloo, rank: 0, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank1: Finished initializing process group; backend: gloo, rank: 1, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank2: Finished initializing process group; backend: gloo, rank: 2, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank3: Finished initializing process group; backend: gloo, rank: 3, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank4: Finished initializing process group; backend: gloo, rank: 4, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank5 : Finished initializing process group; backend: gloo, rank: 5, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank6: Finished initializing process group; backend: gloo, rank: 6, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.156 (1.804) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank7: Finished initializing process group; backend: gloo, rank: 7, world_size: 8 Send ranks: {} Receive ranks: {'out1': [4, 5, 6], 'target': [4, 5, 6]} Setting up process groups for broadcasts... Letting in 0 warm-up minibatches Running training for 20016 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690624.000 bytes send_tensors 0.000 seconds send_tensors_size 0.000 bytes Epoch: 0 Step 0 Learning rate: 0.010000 Epoch: [0][0/20016] Time: 8.293 (8.293) Epoch time [hr]: 0.002 (46.107) Memory: 1.284 (1.636) Loss: 6.9063 (6.9063) Prec@1: 0.000 (0.000)Prec@5: 0.000 (0.000) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 25690112.000 bytes Optimizer step took: 0.005

@ADAM-CT
Copy link
Author

ADAM-CT commented Jan 15, 2020

Finally it throw the runtime error :

Exception in thread Thread-9:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "../communication.py", line 619, in recv_helper_thread
sub_process_group=sub_process_group)
File "../communication.py", line 654, in _recv
group=sub_process_group)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 755, in broadcast
work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1800000ms for recv operation to complete

@letianzhao
Copy link

My environment:

server1:4GPUS

server2 : 4GPUS

Initialization has been completed. All ranks are not trained. They are blocked all the time

Here is the output of each rank:

in rank0: Finished initializing process group; backend: gloo, rank: 0, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank1: Finished initializing process group; backend: gloo, rank: 1, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank2: Finished initializing process group; backend: gloo, rank: 2, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank3: Finished initializing process group; backend: gloo, rank: 3, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank4: Finished initializing process group; backend: gloo, rank: 4, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank5 : Finished initializing process group; backend: gloo, rank: 5, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank6: Finished initializing process group; backend: gloo, rank: 6, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.156 (1.804) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank7: Finished initializing process group; backend: gloo, rank: 7, world_size: 8 Send ranks: {} Receive ranks: {'out1': [4, 5, 6], 'target': [4, 5, 6]} Setting up process groups for broadcasts... Letting in 0 warm-up minibatches Running training for 20016 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690624.000 bytes send_tensors 0.000 seconds send_tensors_size 0.000 bytes Epoch: 0 Step 0 Learning rate: 0.010000 Epoch: [0][0/20016] Time: 8.293 (8.293) Epoch time [hr]: 0.002 (46.107) Memory: 1.284 (1.636) Loss: 6.9063 (6.9063) Prec@1: 0.000 (0.000)Prec@5: 0.000 (0.000) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 25690112.000 bytes Optimizer step took: 0.005

I have the same issue, have you solved this problem?
Thank you.

@Q1Shane
Copy link

Q1Shane commented Apr 19, 2021

I Have same issue!
Have U found the solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants