Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Supplement more information about tfjob into podgroup when enable-gang-scheduler is set true in tf-operator #112

Open
jiangkaihua opened this issue Feb 1, 2021 · 1 comment

Comments

@jiangkaihua
Copy link
Contributor

When enable-gang-scheduler=true, tf-operator will create CRD podgroup to permit gang scheduler volcano to allocate the pods. but when createing pod in func SyncPodGroup:

func (jc *JobController) SyncPodGroup(job metav1.Object, minAvailableReplicas int32) (*v1beta1.PodGroup, error) {

	createPodGroup := &v1beta1.PodGroup{
		ObjectMeta: metav1.ObjectMeta{
			Name: job.GetName(),
			OwnerReferences: []metav1.OwnerReference{
				*jc.GenOwnerReference(job),
			},
		},
		Spec: v1beta1.PodGroupSpec{
			MinMember: minAvailable.IntVal,
		},
	}
	createdPodGroup, err := volcanoClientSet.SchedulingV1beta1().PodGroups(job.GetNamespace()).Create(createPodGroup)

Only pod infos and minMember is set in podgroup, which resulting to function missing, as well as unpredicatable bugs during allocation.

For example, since no minResources field is filled in podgroup, gang scheduler volcano cannot diff tfjobs from bestEffort jobs as both of the two jobs owns nil minResources, causing all tfjobs can be inqueue and action enqueue , reserve lose effort.

So in my opinion, we need to supplement more infos about tfjob into podgroup, such as minMember, queue as well as other fields, so as to make sure gang scheduler workers correctly.

@gaocegege
Copy link
Member

SGTM.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants