Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dependency-group-timeout check ignored until all tasks scheduled #623

Merged
merged 1 commit into from
Dec 9, 2021

Conversation

zuston
Copy link
Member

@zuston zuston commented Dec 9, 2021

Bug Fix

When some resources are not satisfied and the conf of dependency-timeout-check is specified, it will throw exception. like:

2021-12-09 06:18:04 INFO  ApplicationMaster:1199 - Successfully started container container_e03_1582553233674_1290236_01_000033
2021-12-09 06:18:04 ERROR TFRuntime:149 - Failed to check dependency timeout.
java.lang.NullPointerException
	at com.linkedin.tony.runtime.MLGenericRuntime.lambda$groupDependencyTimeout$1(MLGenericRuntime.java:211)
	at java.util.stream.ReferencePipeline$5$1.accept(ReferencePipeline.java:227)
	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.LongPipeline.reduce(LongPipeline.java:443)
	at java.util.stream.LongPipeline.max(LongPipeline.java:406)
	at com.linkedin.tony.runtime.MLGenericRuntime.groupDependencyTimeout(MLGenericRuntime.java:212)
	at com.linkedin.tony.runtime.MLGenericRuntime.isHealthy(MLGenericRuntime.java:147)
	at com.linkedin.tony.ApplicationMaster.monitor(ApplicationMaster.java:749)
	at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:422)
	at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:356)

Solution

To get the accurate running tasks info, we should make dependency-group-timeout check ignored until all tasks scheduled.

Tips:
I add the test case for above meeting problems, if you remove the MLGenericRuntime.groupDependencyTimeout, and then you could rerun this testPartialTaskScheduledShouldPass test case and reproduce the problem.

@zuston zuston requested a review from oliverhu December 9, 2021 03:13
@zuston zuston merged commit ce3f207 into tony-framework:master Dec 9, 2021
@zuston zuston mentioned this pull request Dec 15, 2021
zuston added a commit to zuston/TonY that referenced this pull request Feb 9, 2022
zuston pushed a commit to zuston/TonY that referenced this pull request Feb 9, 2022
Backport: Make dependency-group-timeout check ignored until all tasks scheduled tony-framework#623

See merge request !75
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants