-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't start LocalCUDACluster w/ ucx #1148
Comments
I'm seeing this on a box w/ 4 T4s, as well as a box with 8x A100s. Doesn't seem machine or card specific. |
Could you check the |
Added UCX to the issue desc |
Could you then check if there are any other traces as asked above, after that could you try installing |
Actually, before downgrading to |
Here's the full trace with https://gist.github.com/randerzander/16e59b627fc8abf1efa349f819eba735 |
I'm seeing the same thing with UCX 1.14 with and without |
This is an issue at shutdown but shouldn't prevent any usage. We think the issue is a change in connection handling in distributed -- bisecting to find out |
dask/distributed#7593 is the offending PR. EDIT: I had pasted the wrong link before. |
Unfortunately it does impact usage. Simple things like |
With non-trivial failures can you locally revert dask/distributed#7644 and see if that passes for you ? |
@randerzander Can you try adding:
To the top of your script? Must occur before |
Yes, that resolved both the clean cluster shutdown, and workload failures! Thanks for the workaround! |
@randerzander can you also check if the following works around the problem for you? def workaround_7726():
from distributed._concurrent_futures_thread import _python_exit
from distributed.client import _close_global_client
from distributed.deploy.spec import close_clusters
import distributed
import atexit
import weakref
def shutdown():
# This mimics a function in distributed.__init__ which we can't
# get a handle on (because it is del'd), and must be registered
# after the other functions
distributed._python_shutting_down = True
# These functions must be unregistered and then re-registered
for fn in [_python_exit, _close_global_client, close_clusters]:
atexit.unregister(fn)
# So that this finalizer is in an atexit hook after them
# Note that atexit handlers are called last-in, first out.
# See https://docs.python.org/3/library/atexit.html
weakref.finalize(lambda: None, lambda: None)
# And re-register them.
for fn in [_python_exit, close_clusters, _close_global_client, shutdown]:
atexit.register(fn) Run this function before you boot a cluster. |
If this solution works I would be inclined to use this rather than put out patch release of distributed |
The fact that this has to run before spinning up the cluster suggests this must be implemented by the user. Am I right @wence- ? Did you have success implementing it as part of Dask-CUDA directly (without requiring any user interaction)? |
I haven't yet, but we could run it, for example, in |
This works for the minimal reproducer (not sure about cluster restarts): from dask_cuda import LocalCUDACluster
from distributed import Client
if __name__ == "__main__":
cluster = LocalCUDACluster(protocol="ucx")
client = Client(cluster)
del cluster
print("shutting down...")
print(client) diff --git a/dask_cuda/local_cuda_cluster.py b/dask_cuda/local_cuda_cluster.py
index 656f614..4781048 100644
--- a/dask_cuda/local_cuda_cluster.py
+++ b/dask_cuda/local_cuda_cluster.py
@@ -23,6 +23,35 @@ from .utils import (
)
+def workaround_distributed_7726():
+ import atexit
+ import weakref
+
+ import distributed
+ from distributed._concurrent_futures_thread import _python_exit
+ from distributed.client import _close_global_client
+ from distributed.deploy.spec import close_clusters
+
+ def shutdown():
+ # This mimics a function in distributed.__init__ which we can't
+ # get a handle on (because it is del'd), and must be registered
+ # after the other functions
+ distributed._python_shutting_down = True
+
+ # These functions must be unregistered and then re-registered
+ for fn in [_python_exit, _close_global_client, close_clusters]:
+ atexit.unregister(fn)
+ # So that this finalizer is in an atexit hook after them
+ weakref.finalize(lambda: None, lambda: None)
+ # And re-register them.
+ for fn in [_python_exit, close_clusters, _close_global_client, shutdown]:
+ atexit.register(fn)
+
+
+workaround_distributed_7726()
+del workaround_distributed_7726
+
+
class LoggedWorker(Worker):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs) |
I can confirm that. The more fragile part seems to be the client must import EDIT: That's what I tested: repro.pyfrom dask.distributed import Client, LocalCluster
import dask_cuda
if __name__ == "__main__":
client = Client("ucx://10.33.227.163:8786")
print(client) shell
If |
Yes, I think the client connections are the problematic ones :( |
Client is problematic in the cases we've seen. I suspect @randerzander 's case may be happening elsewhere too, but not sure either. |
This should now be resolved by the distributed 2023.3.2.1 release . Thank you @randerzander for raising! |
With a conda environment using the latest nightlies:
Repro:
Trace:
The text was updated successfully, but these errors were encountered: