Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Consider supporting SuccessPolicy and FailurePolicy #99

Open
terrytangyuan opened this issue Jun 17, 2020 · 4 comments
Open

Consider supporting SuccessPolicy and FailurePolicy #99

terrytangyuan opened this issue Jun 17, 2020 · 4 comments

Comments

@terrytangyuan
Copy link
Member

We recently added SuccessPolicy in tf-operator kubeflow/training-operator#1165 and are considering adding FailurePolicy to handle the case of failure in kubeflow/training-operator#1170. Once it's mature and if we see a common pattern in other operators, we should consider moving that to kubeflow/common.

cc @gaocegege @Jeffwan @johnugeorge @ChanYiLin @pingsutw

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.77
area/operator 0.85

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@kf-label-bot-dev
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.77

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@Jeffwan
Copy link
Member

Jeffwan commented Jun 22, 2020

Having success/failure would be great which would be easier for different frameworks to handle errors and it help make reconciler logic extensible.

@zw0610
Copy link
Member

zw0610 commented Aug 11, 2020

With fault-tolerant & elastic distributed training propagating among more frameworks, a universal definition of failure and success for a distributed training job shall benefit developers for clarifying logic when handling pods failed or recently joined.

georgkaleido pushed a commit to georgkaleido/common that referenced this issue Jun 9, 2022
Co-authored-by: depfu[bot] <23717796+depfu[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Alexander Graf <alex@basecamp.tirol>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants