Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server crashed at the beginning #270

Closed
jasperzhong opened this issue Jul 18, 2020 · 1 comment
Closed

server crashed at the beginning #270

jasperzhong opened this issue Jul 18, 2020 · 1 comment

Comments

@jasperzhong
Copy link
Contributor

@vycezhong I ran this PR with our mxnet vgg-16 test to check for regression. I used 2 worker nodes, each node has 8 GPUs, and 2 server nodes. One of the server nodes will core dump, it happens consistently. Is this something you've seen before? I didn't change the test to use gradient compression, so dmlc/ps-lite#168 shouldn't matter here.

[00:06:35] byteps/server/server.cc:430: BytePS server engine uses 16 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[00:06:35] byteps/server/server.cc:438: Enable engine scheduling for BytePS server
[00:06:35] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[00:06:35] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[00:06:35] [src/van.cc:421: Bind to role=server, ip=xxxxxxx, port=48413, is_recovery=0
00:06:35] src/./zmq_van.h:287: Start ZMQ recv thread
[00:06:35] src/van.cc:510: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=xxxxxxxx, port=48413, is_recovery=0 } }. THIS IS NOT DATA MSG!
[00:07:34] src/van.cc:535: 1 => 2147483647. Meta: request=0, timestamp=3, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=xxx.196, port=35657, is_recovery=0 role=server, id=8, ip=xxx.195, port=61601, is_recovery=0 role=server, id=10, ip=xxx.144, port=48413, is_recovery=0 role=worker, id=11, ip=xxx.142, port=29591, is_recovery=0 role=scheduler, id=1, ip=xxx.195, port=9000, is_recovery=0 } }. THIS IS NOT DATA MSG!
[00:07:34] src/van.cc:370: S[10] is connected to others
[00:07:35] src/van.cc:510: ? => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[00:07:35] src/van.cc:535: 1 => 10. Meta: request=0, timestamp=8, control={ cmd=BARRIER, barrier_group=-564201712 }. THIS IS NOT DATA MSG!
[00:07:35] src/van.cc:510: ? => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[00:07:35] src/van.cc:535: 11 => 10. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140723023324848, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=256 data_size=4
[00:07:35] src/van.cc:535: 9 => 10. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140724865464560, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=256 data_size=4
Segmentation fault      (core dumped) bpslaunch

Originally posted by @pleasantrabbit in #225 (comment)

@jasperzhong
Copy link
Contributor Author

jasperzhong commented Jul 18, 2020

i tried to use reference in new issue. why is it created in the original repo? my mistake. just ignore it....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant