Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt mpi-operator v2 to Kueue #505

Closed
tenzen-y opened this issue Jan 24, 2023 · 14 comments
Closed

Adapt mpi-operator v2 to Kueue #505

tenzen-y opened this issue Jan 24, 2023 · 14 comments

Comments

@tenzen-y
Copy link
Member

tenzen-y commented Jan 24, 2023

Many users hope mpi-operator v2 adapts Kueue.

Blocked by #504 kubernetes-sigs/kueue#369
Kueue side issue: kubernetes-sigs/kueue#65

/kind feature

@tenzen-y tenzen-y mentioned this issue Jan 26, 2023
10 tasks
@alculquicondor
Copy link
Collaborator

I suppose you mean creating and syncing a Workload object for each MPIJob.

Do you think we should have this code in mpi-operator or in kueue?

@tenzen-y
Copy link
Member Author

I suppose you mean creating and syncing a Workload object for each MPIJob.

Yes, I meant we implement a controller for Workload resources like the job-controller.
https://github.com/kubernetes-sigs/kueue/blob/10d35322c252c2724467cf4617e79e94e1bd0c8a/pkg/controller/workload/job/job_controller.go

Do you think we should have this code in mpi-operator or in kueue?

IIUC, kueue is designed not to hold third-party dependencies.
So we might need to add that code into mpi-operator, right?

@alculquicondor
Copy link
Collaborator

Yes, we could have this controller in any of the repos. Wherever it goes, we should have them for kueue+mpi-operator. Both repos have the setup for E2E tests.

Having it in kueue might be better for the time being as things are still changing. But having it in mpi-operator serves as proof that custom jobs don't have to be in tree (and we can add references from the kueue README).

I personally prefer to have it in mpi-operator if the OWNERS don't mind (cc @terrytangyuan)

@ahg-g, wdyt?

@terrytangyuan
Copy link
Member

terrytangyuan commented Jan 26, 2023

I am okay with either approach as long as there are E2E tests.

@ahg-g
Copy link

ahg-g commented Jan 26, 2023

If we put it in mpi-operator repo, I suppose it will run as a reconciler within the same binary, not yet another operator, correct?

@alculquicondor
Copy link
Collaborator

Yes, same binary, possibly guarded by a command line flag to enable

@tenzen-y
Copy link
Member Author

Having it in kueue might be better for the time being as things are still changing. But having it in mpi-operator serves as proof that custom jobs don't have to be in tree (and we can add references from the kueue README).

I'm also fine either way.

If we select the latter, my concern is when kueue will stop serving that controller and donate that controller to the mpi-operator repo.

If kueue keeps serving that controller, we might face why kueue does not provide other controllers (e.g., Ray, Argo, Spark, and more).

However, if we select the former, the mpi-operator may face API changes of kueue since kueue has alpha status as you say.

@alculquicondor
Copy link
Collaborator

We have maintainers in both sides, so I think we can manage :)

@tenzen-y
Copy link
Member Author

Great!
+1 for having it in this repo.

@mwielgus
Copy link

mwielgus commented Feb 9, 2023

Kueue would like to have consistent integration model. We are grateful that MPI-operator is willing to add some Kueue specific code, but other frameworks may not be that welcoming and prefer to keep outside the code that doesn't have to be in their repo. From that perspective, doing the non-critical integration (other than suspend logic) outside of frameworks sounds like a better option.

@tenzen-y
Copy link
Member Author

tenzen-y commented Feb 9, 2023

It sounds reasonable. I agree, @mwielgus.
It isn't easy to convince all communities to have the workload-controller for kueue.

@tenzen-y
Copy link
Member Author

tenzen-y commented Feb 9, 2023

We can close this issue and work on the kueue side. However, for a while, I would like to keep opening this issue to know what others think.

@tenzen-y
Copy link
Member Author

/close

@google-oss-prow
Copy link

@tenzen-y: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants