-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add horovodrun launch agent for Wholegraph #200
Conversation
/label feature request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me. Thanks for contributing to Wholegraph @Tomcli . Could you please help to kick off the CI? @BradReesWork thanks!
/okay to test |
Hi @Tomcli, it seems that there are some code style issue in your code, which leads to failure of CI. It is recommended that you can use "precommit" tool to do some code style test before commit, as in here https://docs.rapids.ai/api/cuspatial/stable/developer_guide/contributing_guide/ . Can you check with precommit and then commit again? Thanks. Or if it's a bit troublesome for you, I can also open a PR and commit your codes. |
Thank you @linhu-nv for providing the link to the contributing guide. I fixed the license check and verified with my local pre-commit check. |
No problem @Tomcli , @BradReesWork could you please kick off the CI again? Thanks |
/okay to test |
/merge |
We have many users running the Kubeflow training operator who are also interested in using Wholegraph. For our MPIJobs users, many of them still use HorovodRun as the startup command. Therefore, we want to add HorovodRun as one of the Wholegraph launch agents so our users can use Wholegraph on top of Kubeflow.
The new function will be similar to the existing MPI launcher agent, where the horovod library is only imported on demand. The horovod.tensorflow library will be used solely for the Horovod initialization command due to the issue with horovod.torch (see horovod/horovod#4009). After the Horovod initialization, the program can continue to run normal PyTorch code within each rank just like the mpi4py.
fixes #201