-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removal of Finalizer Causes Breakage for UCX Protocol #7726
Comments
@quasiben can you share a bit more information about what is actually breaking? I outlined in #7639 (comment) that I am deeply confused why this is causing any breakage. I discuss how this finalizer, even if called, should not have any action and everything in CPython docs suggest that this finalizer should not even be called. I would like to ensure that we cover the functionality you require and make sure this is not just a red herring. I do not want to assert on a functionality that the CPython docs tell me should not even be there. |
In the MRE, the breakage occurs when client/cluster are shutting down, a bunch of exceptions are raised and everything hangs until the process ultimately times out after 60 seconds. The person who originally reported the issue @randerzander also mentioned it occurs during his workflows, I can't say for sure what happens there because we're trying to work on the MRE first, but I imagine this occurs when workers are getting restarted. @randerzander please add more information about when it happens if this is known.
I think we can for now discarding that it "should not be called", it's pretty clear at this point it is called, even with
Ultimately, I agree. The problem here seems to be that there is no guarantees on the ordering of destruction of objects. This seems to happen because the ordering is in fact unpredictable. For instance, |
FWIW, we just need any finalizer here to fix Ben's initial repro: e.g. this runs fine for me if I add this patch: diff --git a/distributed/utils.py b/distributed/utils.py
index 4c1a6642..64904402 100644
--- a/distributed/utils.py
+++ b/distributed/utils.py
@@ -1401,7 +1401,7 @@ def is_valid_xml(text):
_offload_executor = ThreadPoolExecutor(max_workers=1, thread_name_prefix="Dask-Offload")
-
+weakref.finalize(_offload_executor, lambda: None)
def import_term(name: str) -> AnyType:
"""Return the fully qualified term |
I think python makes a best-effort attempt to run finalizers that contain ref-cycles at interpreter shutdown and probably we've just been lucky that the particular combo of finalizers has been such that things run in an order that happens to work. I don't think this is a clean solution, but it might be sufficient to patch over the issue for now. As another data point, if I just completely cull the global |
Seems to be something to do with some datastructure setup that I don't think should be relied on, but if I create a no-op finalizer before the atexit handler in That is: Cluster closes gracefully diff --git a/distributed/deploy/spec.py b/distributed/deploy/spec.py
index c35ec764..134a3bd8 100644
--- a/distributed/deploy/spec.py
+++ b/distributed/deploy/spec.py
@@ -686,6 +686,9 @@ async def run_spec(spec: dict[str, Any], *args: Any) -> dict[str, Worker | Nanny
return workers
+weakref.finalize(lambda: None, lambda: None)
+
+
@atexit.register
def close_clusters():
for cluster in list(SpecCluster._instances): Interpreter segfaults: diff --git a/distributed/deploy/spec.py b/distributed/deploy/spec.py
index c35ec764..34bbe68f 100644
--- a/distributed/deploy/spec.py
+++ b/distributed/deploy/spec.py
@@ -693,3 +693,6 @@ def close_clusters():
with suppress(gen.TimeoutError, TimeoutError):
if getattr(cluster, "status", Status.closed) != Status.closed:
cluster.close(timeout=10)
+
+
+weakref.finalize(lambda: None, lambda: None) |
Apologies @fjetter for not mentioning this earlier. I think you are seeing some escalation here because there is a bit of urgency for us (RAPIDS). We are going through a RAPIDS release now and had just discovered this bug yesterday (bad timing us) and we had/have pinned to Dask 2023.3.2. This Dask release includes PR #7644 and thus we are scrambling a bit for fix short-term. There was hope that we could work around this issue on our end and that hope lead me to delay mentioning our sense urgency -- my apologies for not laying it all out right way. Perhaps @wence- and I can continue exploring down the path Lawrence mentioned in the previous issue we simultaneously produce a release 2023.3.2.1 with the commit reverted |
Thanks @wence- ! What you are posting in #7726 (comment) is very interesting and aligns with my suspicions that there is something else going on.
So much fun. What python version was this on? The place where this is raising is interesting since it's again one of those "catch all" atexits, I'm happy to add any harmless patches to the code base (i.e. also fine reverting the original change) if that unblocks your release but I would like us to make progress on making this more robust. It looks like the RAPIDS team is hit by this rather frequently. If there is anything we can or should refactor to make this easier to maintain we should talk about this. I'm counting on help from the RAPIDS team for this. |
This is 3.10. But it's the combination with UCX I think that is causing a segfault. We've somehow lost well before we get the segfault and so various invariants are not maintained and (sometimes) the asyncio implementation bugs out.
Yeah, I would like to understand this. I am generally leary of state teardown in atexit handlers since ordering is import-order sensitive and that seems wrong. My reading of the tea-leaves suggests that something like the following is happening:
Exception ignored in atexit callback: <function close_clusters at 0x7f86149e3e20>
Traceback (most recent call last):
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/deploy/spec.py", line 695, in close_clusters
cluster.close(timeout=10)
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/deploy/cluster.py", line 219, in close
return self.sync(self._close, callback_timeout=timeout)
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/utils.py", line 349, in sync
return sync(
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/utils.py", line 416, in sync
raise exc.with_traceback(tb)
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/utils.py", line 389, in f
result = yield future
File "/home/wence/Documents/apps/mambaforge/envs/test-weakref/lib/python3.10/site-packages/tornado/gen.py", line 769, in run
value = future.result()
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/utils.py", line 1849, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/home/wence/Documents/apps/mambaforge/envs/test-weakref/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/deploy/spec.py", line 441, in _close
await self._correct_state()
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/deploy/spec.py", line 348, in _correct_state_internal
await self.scheduler_comm.retire_workers(workers=list(to_close))
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/core.py", line 1185, in send_recv_from_rpc
comm = await self.live_comm()
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/core.py", line 1144, in live_comm
comm = await connect(
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/comm/core.py", line 292, in connect
comm = await wait_for(
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/utils.py", line 1849, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/home/wence/Documents/apps/mambaforge/envs/test-weakref/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/home/wence/Documents/src/rapids/third-party/distributed/distributed/comm/ucx.py", line 466, in connect
ep = await ucp.create_endpoint(ip, port)
File "/home/wence/Documents/apps/mambaforge/envs/test-weakref/lib/python3.10/site-packages/ucp/core.py", line 1004, in create_endpoint
return await _get_ctx().create_endpoint(
File "/home/wence/Documents/apps/mambaforge/envs/test-weakref/lib/python3.10/site-packages/ucp/core.py", line 304, in create_endpoint
ucx_ep = ucx_api.UCXEndpoint.create(
File "ucp/_libs/ucx_endpoint.pyx", line 255, in ucp._libs.ucx_api.UCXEndpoint.create
AssertionError: What looks like is going on here is that the cluster shutdown is somehow happening after the ucx-py resources have been released, and so in finalization of the cluster, there is an attempt to resurrect the comm channel which fails because UCX is already shutdown. I don't really understand how the object ownership model is supposed to work in this case right now (i.e. who is responsible for teardown in which order) in these kind of setups where all the workers and so forth are booted in the background. |
Here's an exemplar backtrace for the segfault case. As you can see, the GC is completely hosed at this point
With matching python backtrace
|
collecting all atexit handlers in a single module and register them in the "correct" order might already be helpful, wouldn't it? FWIW I don't think anybody likes the atexit handlers. If we can ensure proper cleanup without them somehow that would be very interesting. |
So there are, AFAICT, three atexit handlers:
If I remove the registration of all of them, and run the bug with shutting down...
<Client: 'ucx://127.0.0.1:58631' processes=1 threads=1, memory=124.45 GiB>
[1680262822.127658] [shallot:51402] UCXPY ERROR Non-thread-safe operation invoked on an event loop other than the current one
Traceback (most recent call last):
File "/home/wence/Documents/apps/mambaforge/envs/test-weakref/lib/python3.10/site-packages/ucp/_libs/exceptions.py", line 13, in log_errors
yield
File "ucp/_libs/transfer_tag.pyx", line 141, in ucp._libs.ucx_api._tag_recv_callback
File "/home/wence/Documents/apps/mambaforge/envs/test-weakref/lib/python3.10/site-packages/ucp/comm.py", line 16, in _cb_func
future.set_exception(exception)
File "/home/wence/Documents/apps/mambaforge/envs/test-weakref/lib/python3.10/asyncio/base_events.py", line 755, in call_soon
self._check_thread()
File "/home/wence/Documents/apps/mambaforge/envs/test-weakref/lib/python3.10/asyncio/base_events.py", line 792, in _check_thread
raise RuntimeError(
RuntimeError: Non-thread-safe operation invoked on an event loop other than the current one So somehow some event loop is out of whack. |
Having done some more debugging (with @pentschev) the The hypothesis as to why this no-op weakref finalizer "fixes" things is that it fortuitously happens to arrange that everything that is torn down is done so on the thread that owns the resource. So, we don't have a solution right now (certainly not a quick fix that is less dangerous than re-introducing this no-op finalization). |
The last atexit finalizer in the chain of finalizers that distributed registers needs to be one in the weakref module. Otherwise we observe hangs in the shutdown of UCX-Py where one thread has the gil but not the UCX spinlock and vice-versa. To ensure this, unregister the known shutdown finalizers in distributed, add our weakref finalizer, and then re-register them. Along with a final re-implementation of the "python shutting down" handler which must run before other handlers in distributed.
The first atexit finalizer in the chain of finalizers that distributed registers needs to be one in the weakref module. Otherwise we observe hangs in the shutdown of UCX-Py where one thread has the gil but not the UCX spinlock and vice-versa. To ensure this, register a no-op handler at the top of `__init__.py`.
The first atexit finalizer in the chain of finalizers that distributed registers needs to be one in the weakref module. Otherwise we observe hangs in the shutdown of UCX-Py where one thread has the gil but not the UCX spinlock and vice-versa. To ensure this, register a no-op handler at the top of `__init__.py`.
A little more detail. tl:dr; From the Python docs (and also by looking at the implementation of weakref.finalize):
(Note that these calls are always made, even if the object is still alive). How does this work? The first time Distributed also registers some atexit handlers to run during shutdown. atexit handlers are run in last-in-first-out order. So we can have two scenarios:
In scenario 1. the destruction order will be UCX-Py uses If we're in scenario 1. all is good, In scenario 2. we have a problem. The first atexit handler to fire is |
Following up here because I notice that UCX tests are failing intermittently quite often on GPU CI: https://gpuci.gpuopenanalytics.com/job/dask/job/distributed/job/prb/job/distributed-prb/6516/ I'm assuming this is related to the fact that the finalizer patch wasn't merged into Interested in what we want to do here in the short term - should we xfail the UCX test module (or a subset of it) for now, or is there a modification we could make to the tests to unblock this trivially? |
This is interesting. I would expect all (Sorry for the late reply, I was busy/out for a while) |
I don't recall unfortunately. I think that this is because the destructor is happening somehow by an atexit handler which can be on a different thread (it's hard to tell from the documentation/implementation what if any guarantees are given). |
As mentioned in #7639 (comment) , we are seeing what we think is a bug due to the removal of the finalizer for a ThreadPoolExecutor . We are observing this behavior when using UCX/UCX-Py (see #7639 (comment) for a more detailed explanation)
Note: this MRE has no GPU/CUDA elements, only UCX/UCX-py
The above produces errors like the following:
We think #7644 is the culprit here because when it is added back the errors are gone :). I confirmed the finalizer is being called with the following patch:
When using this patch we observe no errors and confirmation that the finalizer is being called:
The text was updated successfully, but these errors were encountered: