Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler crashes in SSHCluster in 2023.3.2 but not in 2023.3.1 #7724

Closed
jabbera opened this issue Mar 29, 2023 · 9 comments · Fixed by #7729
Closed

Scheduler crashes in SSHCluster in 2023.3.2 but not in 2023.3.1 #7724

jabbera opened this issue Mar 29, 2023 · 9 comments · Fixed by #7729

Comments

@jabbera
Copy link

jabbera commented Mar 29, 2023

Describe the issue: Attempting to use the SSHCluster does not work in 2023.3.2 because the scheduler exits early with an exit code of 1

INFO:distributed.deploy.ssh:2023-03-29 18:21:07,199 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2023-03-29 18:21:07,204 - distributed.deploy.ssh - INFO - 2023-03-29 18:21:07,204 - distributed.scheduler - INFO - State start
INFO:distributed.deploy.ssh:2023-03-29 18:21:07,204 - distributed.scheduler - INFO - State start
2023-03-29 18:21:07,207 - distributed.deploy.ssh - INFO - 2023-03-29 18:21:07,206 - distributed.scheduler - DEBUG - Clear task state
INFO:distributed.deploy.ssh:2023-03-29 18:21:07,206 - distributed.scheduler - DEBUG - Clear task state
2023-03-29 18:21:07,207 - distributed.deploy.ssh - INFO - 2023-03-29 18:21:07,207 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:36143
INFO:distributed.deploy.ssh:2023-03-29 18:21:07,207 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:36143
INFO:asyncssh:[conn=0, chan=1] Received exit status 1
INFO:asyncssh:[conn=0, chan=1] Received channel close
INFO:asyncssh:[conn=0, chan=1] Channel closed
INFO:asyncssh:[conn=0, chan=1] Sending KILL signal

When rolling back to 2023.3.1 the scheduler starts sucessfully:

2023-03-29 18:23:33,874 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,873 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,873 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2023-03-29 18:23:33,878 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,878 - distributed.scheduler - INFO - State start
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,878 - distributed.scheduler - INFO - State start
2023-03-29 18:23:33,882 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,881 - distributed.scheduler - DEBUG - Clear task state
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,881 - distributed.scheduler - DEBUG - Clear task state
2023-03-29 18:23:33,883 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,882 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:40305
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,882 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:40305
INFO:asyncssh:Opening SSH connection to localhost, port 22
INFO:asyncssh:[conn=1] Connected to SSH server at localhost, port 22

Minimal Complete Verifiable Example:

import logging
logging.basicConfig(level=logging.DEBUG)

from distributed import SSHCluster
cluster = SSHCluster(["localhost", "localhost"])

Anything else we need to know?: Full repro here:

git clone https://github.com/jabbera/distributed-bug.git
cd distributed-bug
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements-bug.txt
python demo.py

Environment:

  • Dask version: 2023.3.2
  • Python version: 3.10.5
  • Operating System: Ubuntu 20.04.5 LTS
  • Install method (conda, pip, source): pip
@jabbera jabbera changed the title Scheduler crashes un SSHCluster in 2023.3.2 but not in 2023.3.1 Scheduler crashes in SSHCluster in 2023.3.2 but not in 2023.3.1 Mar 29, 2023
@jabbera
Copy link
Author

jabbera commented Mar 29, 2023

Rolling this back sorts the issue: #7631

@jrbourbeau
Copy link
Member

cc @milesgranger @jacobtomlinson for visibility

@jabbera
Copy link
Author

jabbera commented Mar 29, 2023

I did a little more digging and the crash happens somewhere in here, specifically the template format.

return template.format(
**toolz.merge(os.environ, dict(scheme=scheme, host=host, port=port))
)

Replacing it to return a constant string avoids the crash.

@jabbera
Copy link
Author

jabbera commented Mar 29, 2023

I've figured out what is going on here but I don't know how to fix it in dask. I have the following environment variables set:

DASK_DISTRIBUTED__DASHBOARD__LINK='{JUPYTERHUB_EXTERNAL_BASE_URL}{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status'

This is somehow making it's way over to the SSHCluster (I'm assuming via dask config serialization)

The issue is those environment variables (JUPYTERHUB_EXTERNAL_BASE_URL, JUPYTERHUB_SERVICE_PREFIX) are not available in the SSH session since they are set in the profile so the template.format is failing:

KeyError: 'JUPYTERHUB_EXTERNAL_BASE_URL'

I understand how to get the correct scheduler link manually. I'd prefer if this situation doesn't cause the scheduler to crash and maybe just falls back on it's old behavior if the link can't be crafted.

PS. These errors are not being propagated back to the process that started the cluster which has made debugging this much harder.

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Mar 30, 2023

Thanks for taking the time to dig into this. It sounds like there are two things going on here.

First is that when DASK_DISTRIBUTED__DASHBOARD__LINK has been misconfigured SSHCluster is crashing for hard to understand reasons. The root of that problem is that you're not seeing helpful error messages, making debugging it a pain. The fix for this would be to explore why tracebacks aren't making it back from the remote process.

The other part is can we make it so that if DASK_DISTRIBUTED__DASHBOARD__LINK is misconfigured the failure mode is less aggressive. I'm not sure default behaviour would be best though as it will likely mask the problem and make it hard to debug. Perhaps a better route would be to catch the exception and log an error saying that formatting failed but continue onwards.

link = format_dashboard_link(addr, server.port)

@jabbera
Copy link
Author

jabbera commented Mar 30, 2023

Perhaps a better route would be to catch the exception and log an error saying that formatting failed but continue onwards.

Indeed this would be the best solution.

@mplough-kobold
Copy link
Contributor

First is that when DASK_DISTRIBUTED__DASHBOARD__LINK has been misconfigured SSHCluster is crashing for hard to understand reasons. The root of that problem is that you're not seeing helpful error messages, making debugging it a pain. The fix for this would be to explore why tracebacks aren't making it back from the remote process.

When I initially read this, I didn't totally understand what you meant by "misconfigured" here. As I understand it, the problem is that the link includes an environment variable that exists only on the host and not on the cluster.

Thus, these would be incorrect...

export DASK_DISTRIBUTED__DASHBOARD__LINK="{JUPYTERHUB_EXTERNAL_BASE_URL}{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status"
# or
export DASK_DISTRIBUTED__DASHBOARD__LINK="{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status"

...and this would be correct:
(quoted from dask/dask-labextension#109 (comment)):

export DASK_DISTRIBUTED__DASHBOARD__LINK="proxy/{port}/status"

However, the correct link doesn't work.

Suppose I have a JupyterHub deployment and I access my notebook server at:

https://jupyterhub.example.com/user/matt.plough/my-named-server/lab

Setting my DASK_DISTRIBUTED__DASHBOARD__LINK="proxy/{port}/status" results in the browser creating the following link in the output of a cell that says client:

https://jupyterhub.example.com/user/matt.plough/my-named-server/files/proxy/8787/status?_xsrf=[some token]

This is incorrect due to the inclusion of /files, something that does not occur when JUPYTERHUB_SERVICE_PREFIX is part of the DASK_DISTRIBUTED__DASHBOARD__LINK variable.

The recommendation in Dask documentation of /user/<user>/proxy/8787/status cannot accommodate named servers, and is not flexible enough to deal with standard servers and named servers on the same box. Use of the JUPYTERHUB_SERVICE_PREFIX eliminates all of these problems.

How should users configure the DASK_DISTRIBUTED__DASHBOARD__LINK variable when using a JupyterHub proxy?

@jacobtomlinson
Copy link
Member

How should users configure the DASK_DISTRIBUTED__DASHBOARD__LINK variable when using a JupyterHub proxy?

I think this question is separate from the bug highlighted here. Could you open a new issue for this?

@mplough-kobold
Copy link
Contributor

How should users configure the DASK_DISTRIBUTED__DASHBOARD__LINK variable when using a JupyterHub proxy?

I think this question is separate from the bug highlighted here. Could you open a new issue for this?

Good idea, and done - see #7736.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants