Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Proposal: Add Error JobConditionType to reflect controller error (Resource Quota Error) into the status #47

Open
CJZheng91 opened this issue Oct 2, 2019 · 4 comments

Comments

@CJZheng91
Copy link

Problem: Currently JobConditionType constants, there doesn't exist a a status that indicates error happening in the controller. This cause problems, e.g. when pod creation fails because it exceeds resource quota. Take TFJob as an example (https://github.com/kubeflow/tf-operator/blob/5adee6f30c86484897db33188af591d5976d1cd2/pkg/control/pod_control.go#L138), if pod creation here exceeds resource quota, the Create func will return an error. This information is only exposed to the higher-level jobs as an event.

Solution: Add an Error JobConditionType in kubeflow common api. Also show pod creation error (or other types of error if necessary) into the JobStatus, e.g. in the updateStatusSingle func for TFJob (https://github.com/kubeflow/tf-operator/blob/5adee6f30c86484897db33188af591d5976d1cd2/pkg/controller.v1/tensorflow/status.go#L61). This Error status will just indicate a retriable error that happened in the controller.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.89. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@czheng94
Copy link

czheng94 commented Apr 23, 2020

Any thoughts on this issue?

Some updates supporting the necessity of the Error ConditionType:

Similar "Error" conditions do exist in other Kubernetes components, e.g. in Deployment and ReplicaSet.

ReplicaSet has ReplicaSetReplicaFailure as a condition type:
https://github.com/kubernetes/kubernetes/blob/abe6321296123aaba8e83978f7d17951ab1b64fd/pkg/apis/apps/types.go#L788-L793

const (
	// ReplicaSetReplicaFailure is added in a replica set when one of its pods fails to be created
	// due to insufficient quota, limit ranges, pod security policy, node selectors, etc. or deleted
	// due to kubelet being down or finalizers are failing.
	ReplicaSetReplicaFailure ReplicaSetConditionType = "ReplicaFailure"
)

Deployment has DeploymentReplicaFailure as a condition type:
https://github.com/kubernetes/kubernetes/blob/abe6321296123aaba8e83978f7d17951ab1b64fd/pkg/apis/apps/types.go#L460-L473

const (
	// Available means the deployment is available, ie. at least the minimum available
	// replicas required are up and running for at least minReadySeconds.
	DeploymentAvailable DeploymentConditionType = "Available"
	// Progressing means the deployment is progressing. Progress for a deployment is
	// considered when a new replica set is created or adopted, and when new pods scale
	// up or old pods scale down. Progress is not estimated for paused deployments or
	// when progressDeadlineSeconds is not specified.
	DeploymentProgressing DeploymentConditionType = "Progressing"
	// ReplicaFailure is added in a deployment when one of its pods fails to be created
	// or deleted.
	DeploymentReplicaFailure DeploymentConditionType = "ReplicaFailure"
)

Deployment controller will check for underlying ReplicaSet condition type and propagate ReplicaSetReplicaFailure up into Deployment status:
https://github.com/kubernetes/kubernetes/blob/abe6321296123aaba8e83978f7d17951ab1b64fd/pkg/controller/deployment/progress.go#L101-L106

	// Move failure conditions of all replica sets in deployment conditions. For now,
	// only one failure condition is returned from getReplicaFailures.
	if replicaFailureCond := dc.getReplicaFailures(allRSs, newRS); len(replicaFailureCond) > 0 {
		// There will be only one ReplicaFailure condition on the replica set.
		util.SetDeploymentCondition(&newStatus, replicaFailureCond[0])
	} else {
		util.RemoveDeploymentCondition(&newStatus, apps.DeploymentReplicaFailure)
	}

Right now, pod creation errors due to resource quota are only exposed as events in KubeFlow jobs, which is not a very good channel to expose and especially to propagate errors. Enabling this Error condition in KubeFlow job CR status will enable much easier error propagation for users who are building custom CR on top of KubeFlow jobs.

@terrytangyuan
Copy link
Member

cc @Jeffwan @gaocegege @johnugeorge @ywskycn @merlintang WDYT? Any objections on bringing this type of errors to Kubeflow CR level? It would be convenient to surface this at CR level status so users don't have to check each individual subresources but my main concern is that it's hard to define the types of errors that's retriable because this depends on the specific controller, use cases, as well as the lower-level mechanism to handle/propagate errors in different clusters.

@czheng94
Copy link

@terrytangyuan

but my main concern is that it's hard to define the types of errors that's retriable because this depends on the specific controller,

I don't think we need to define these specific error types in kubeflow/common. They can just all be the "Error" JobConditionType, then in each specific controller, we can use JobCondition.Reason and JobCondition.Message to expose the specific error.

georgkaleido pushed a commit to georgkaleido/common that referenced this issue Jun 9, 2022
Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com>
Co-authored-by: Alexander Graf <alex@basecamp.tirol>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants