Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ› Engine: daemon restart fails with circus.exc.ConflictError #6041

Open
mbercx opened this issue May 29, 2023 · 4 comments
Open

πŸ› Engine: daemon restart fails with circus.exc.ConflictError #6041

mbercx opened this issue May 29, 2023 · 4 comments

Comments

@mbercx
Copy link
Member

mbercx commented May 29, 2023

Describe the bug

From time to time, I find my processes are no longer updating. When trying to restart the daemon, stopping the daemon fails due to a timeout:

❯ verdi daemon restart --reset
Profile: dev
Stopping the daemon... FAILED
Critical: Connection to the daemon timed out.

Restarting afterwards is not a problem:

❯ verdi daemon start
Starting the daemon with 1 workers... OK

I find no errors in the daemon logs, but in the circus logs I find the following (trimmed for brevity, full error messages log below):

2023-05-29 21:48:37 circus[2237] [INFO] Arbiter exiting
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 tornado.general[2246] [WARNING] Got events for stream <zmq.eventloop.zmqstream.ZMQStream object at 0x104370580> attached to closed socket: Socket operation on non-socket
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2237] [INFO] circusd-stats stopped
2023-05-29 21:48:38 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:39 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
[...]
2023-05-29 21:48:53 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:53 circus[2237] [INFO] aiida-dev stopped
Full Error log
2023-05-29 21:48:37 circus[2237] [INFO] Arbiter exiting
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 tornado.general[2246] [WARNING] Got events for stream <zmq.eventloop.zmqstream.ZMQStream object at 0x104370580> attached to closed socket: Socket operation on non-socket
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2237] [INFO] circusd-stats stopped
2023-05-29 21:48:38 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:39 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:40 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:41 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:42 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:43 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:44 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:45 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:46 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:47 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:48 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:49 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:50 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:51 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:52 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:53 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:53 circus[2237] [INFO] aiida-dev stopped

Steps to reproduce

I haven't found a way to consistently reproduce the problem yet, but it seems to occur more often when I am running processes that involve large data transfers. For the event above, a dozen orso calculation jobs had paused due to connection issues, which I then had restarted with verdi process play -a.

Your environment

  • Operating system [e.g. Linux]: macOS Monterey v12.5
  • Python version [e.g. 3.7.1]: Python 3.9.16
  • aiida-core version [e.g. 1.2.1]: sph/fix/6013/verdi-computer-test branch, it seems. Commit 47cd515, but I've also had it happen when running on main.
  • circus: 0.18.0

Additional context

From an offline discussion I know @unkcpz has also run into this issue.

@mbercx
Copy link
Member Author

mbercx commented May 30, 2023

I've been running into this one quite consistently this morning, so reproduction has become less of an issue. πŸ˜… Now also running on main.

One of my processes seems to have also excepted due to this issue:

  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/events.py", line 97, in run
    await self._callback(*self._args, **self._kwargs)
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/processes.py", line 567, in _run_task
    result = await coro(*args, **kwargs)
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/utils.py", line 245, in wrap
    return coro_or_fn(*args, **kwargs)
  File "/Users/mbercx/project/super/code/aiida-core/aiida/engine/processes/workchains/workchain.py", line 413, in _on_awaitable_finished
    self.resume()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 98, in transition
    result = wrapped(self, *a, **kw)
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/processes.py", line 1112, in resume
    return self._state.resume(*args)  # type: ignore
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/process_states.py", line 339, in resume
    self._waiting_future.set_result(value)
asyncio.exceptions.InvalidStateError: invalid state

@sphuber
Copy link
Contributor

sphuber commented Jun 2, 2023

Thanks for the report @mbercx . I am pretty sure that the error in the circus log is unrelated to any Processes excepting. So the second stack trace you mentioned should be independent. I think the circus exception is mostly innocent and should not have any negative effects to the system functioning.

The second stack trace is problematic however as it actually causes AiiDA processes to except. I am not sure what could cause this. Apparently the value that is getting set on the future is invalid, but annoyingly the exception raised by asyncio doesn't say what that value is.

If you say that you can consistently reproduce it, what would be super useful is to add some print statements to plumpy/process_states printing value on line 339. This way we get an idea what is getting passed and we can retrace where it is coming from and why it is missing a value.

@mbercx
Copy link
Member Author

mbercx commented Jun 3, 2023

If you say that you can consistently reproduce it

This was for the circus-related error. I'll see if I will still bump into the other one.

@sphuber
Copy link
Contributor

sphuber commented Sep 22, 2023

This may be of interest: circus-tent/circus#1202

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants