Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't stop Adaptive on error #8871

Merged
merged 6 commits into from
Sep 27, 2024

Conversation

hendrikmakait
Copy link
Member

Closes #8808

  • Tests added / passed
  • Passes pre-commit run --all-files

Copy link
Contributor

github-actions bot commented Sep 10, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    25 files  ± 0      25 suites  ±0   10h 22m 58s ⏱️ + 8m 24s
 4 128 tests + 2   4 014 ✅ + 3    110 💤  -  1  4 ❌ ±0 
47 686 runs  +24  45 586 ✅ +35  2 095 💤  - 12  5 ❌ +1 

For more details on these failures, see this check.

Results for commit b42358a. ± Comparison against base commit 80b3af5.

This pull request removes 6 and adds 8 tests. Note that renamed tests count towards both.
distributed.deploy.tests.test_adaptive ‑ test_adaptive_scale_down_override
distributed.deploy.tests.test_adaptive_core ‑ test_adapt_oserror_safe_target
distributed.deploy.tests.test_adaptive_core ‑ test_adapt_oserror_scale
distributed.deploy.tests.test_adaptive_core ‑ test_adapt_stop_del
distributed.deploy.tests.test_adaptive_core ‑ test_adaptive_logs_stopping_once
distributed.deploy.tests.test_adaptive_core ‑ test_interval
distributed.deploy.tests.test_adaptive ‑ test_adapt_callback_logs_error_in_scale_down
distributed.deploy.tests.test_adaptive ‑ test_adapt_gets_stopped_on_cluster_close
distributed.deploy.tests.test_adaptive ‑ test_adapt_logs_error_in_safe_target
distributed.deploy.tests.test_adaptive ‑ test_adapt_stop_del
distributed.deploy.tests.test_adaptive ‑ test_adaptive_logs_stopping_once[False]
distributed.deploy.tests.test_adaptive ‑ test_adaptive_logs_stopping_once[True]
distributed.deploy.tests.test_adaptive ‑ test_adaptive_stops_on_cluster_status_change
distributed.deploy.tests.test_adaptive ‑ test_interval

♻️ This comment has been updated with latest results.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me. It's not clear from reading #2904 why that stop was introduced in the first place.

@hendrikmakait hendrikmakait marked this pull request as draft September 10, 2024 17:37
@hendrikmakait
Copy link
Member Author

This seems fine to me. It's not clear from reading #2904 why that stop was introduced in the first place.

My gut feeling is that it was introduced to cover the case where the cluster closes, and the scheduler becomes unavailable. I think this should be covered by the Cluster object (and is indeed handled by Cluster). The open question is how to enforce this from subclasses or duck-typed clusters. (This is why I moved this to draft.)

@jacobtomlinson
Copy link
Member

Yeah I feel like the cluster object should stop adapting if the scheduler connection closes. I think we want to implement a protocol for cluster objects that sets out expectations and formalises the API a little. But for now I think implementing it in SpecCluster should be enough?

@hendrikmakait
Copy link
Member Author

@jacobtomlinson: Conceptually, I agree with your approach, but for now I'd like to go the more conversative route of having Adaptive itself check whether its cluster is still running. This has caused the PR to blow up a bit because some functionality and tests needed to move between adaptive and adaptive_core.

@hendrikmakait hendrikmakait marked this pull request as ready for review September 13, 2024 09:19
Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks for taking the time here Hendrik.

Have you tried these changes out with any of the projects that use adaptive scaling? AFAIK that's only dask-jobqueue and dask-cloudprovider today.

cluster: Cluster,
interval: str | float | timedelta | None = None,
minimum: int | None = None,
maximum: float | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is a little confusing given that we actually either want int or math.inf, per

if not isinstance(maximum, int) and not math.isinf(maximum):
raise TypeError(f"maximum must be int or inf; got {maximum}")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings here, this just follow the numeric tower in PEP 484, i.e., int | float can be simplified to float. Unfortunately there isn't a canonical INF literal that I'm aware of, instead we have math.inf, float("inf"), np.inf, and more. Happy to change it back if people find it confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively I would say that None is equivalent to math.inf in this case. I expect most users would assume that setting the value to None would not set an upper bound. They might even assume that setting it to 0 does the same.

Should we attempt to define an INF literal in dask.typing that we can use here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable but let's do that in a different PR. I'll roll my changes to the type back for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that None is not equivalent to math.inf in this case. Instead, we read the config value if None is provided.

@hendrikmakait
Copy link
Member Author

Have you tried these changes out with any of the projects that use adaptive scaling? AFAIK that's only dask-jobqueue and dask-cloudprovider today.

I've executed both test suites locally once without failures. There were quite a few skipped tests, so YMMV.

@hendrikmakait hendrikmakait merged commit e52da46 into dask:main Sep 27, 2024
23 of 30 checks passed
@hendrikmakait hendrikmakait deleted the dont-stop-adaptive-on-oserror branch September 27, 2024 11:00
@jacobtomlinson
Copy link
Member

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adaptive stops whenever it encounters an error
3 participants