Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add horovodrun launch agent for Wholegraph #200

Merged
merged 3 commits into from
Aug 8, 2024

Conversation

Tomcli
Copy link
Contributor

@Tomcli Tomcli commented Jul 26, 2024

We have many users running the Kubeflow training operator who are also interested in using Wholegraph. For our MPIJobs users, many of them still use HorovodRun as the startup command. Therefore, we want to add HorovodRun as one of the Wholegraph launch agents so our users can use Wholegraph on top of Kubeflow.

The new function will be similar to the existing MPI launcher agent, where the horovod library is only imported on demand. The horovod.tensorflow library will be used solely for the Horovod initialization command due to the issue with horovod.torch (see horovod/horovod#4009). After the Horovod initialization, the program can continue to run normal PyTorch code within each rank just like the mpi4py.

fixes #201

Copy link

copy-pr-bot bot commented Jul 26, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Tomcli
Copy link
Contributor Author

Tomcli commented Jul 26, 2024

/label feature request

Copy link
Contributor

@linhu-nv linhu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me. Thanks for contributing to Wholegraph @Tomcli . Could you please help to kick off the CI? @BradReesWork thanks!

@BradReesWork BradReesWork added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 30, 2024
@BradReesWork
Copy link
Member

/okay to test

@linhu-nv
Copy link
Contributor

Hi @Tomcli, it seems that there are some code style issue in your code, which leads to failure of CI. It is recommended that you can use "precommit" tool to do some code style test before commit, as in here https://docs.rapids.ai/api/cuspatial/stable/developer_guide/contributing_guide/ . Can you check with precommit and then commit again? Thanks. Or if it's a bit troublesome for you, I can also open a PR and commit your codes.

@Tomcli
Copy link
Contributor Author

Tomcli commented Aug 6, 2024

Thank you @linhu-nv for providing the link to the contributing guide. I fixed the license check and verified with my local pre-commit check.

@Tomcli Tomcli changed the base branch from branch-24.08 to branch-24.10 August 6, 2024 23:58
@linhu-nv
Copy link
Contributor

linhu-nv commented Aug 7, 2024

No problem @Tomcli , @BradReesWork could you please kick off the CI again? Thanks

@BradReesWork
Copy link
Member

/okay to test

@BradReesWork
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 928b5d6 into rapidsai:branch-24.10 Aug 8, 2024
48 checks passed
@Tomcli Tomcli deleted the horovodrun branch August 8, 2024 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Add HorovodRun Launch Agent
3 participants