Scheduler crashes in SSHCluster in 2023.3.2 but not in 2023.3.1 #7724

jabbera · 2023-03-29T18:31:22Z

Describe the issue: Attempting to use the SSHCluster does not work in 2023.3.2 because the scheduler exits early with an exit code of 1

INFO:distributed.deploy.ssh:2023-03-29 18:21:07,199 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2023-03-29 18:21:07,204 - distributed.deploy.ssh - INFO - 2023-03-29 18:21:07,204 - distributed.scheduler - INFO - State start
INFO:distributed.deploy.ssh:2023-03-29 18:21:07,204 - distributed.scheduler - INFO - State start
2023-03-29 18:21:07,207 - distributed.deploy.ssh - INFO - 2023-03-29 18:21:07,206 - distributed.scheduler - DEBUG - Clear task state
INFO:distributed.deploy.ssh:2023-03-29 18:21:07,206 - distributed.scheduler - DEBUG - Clear task state
2023-03-29 18:21:07,207 - distributed.deploy.ssh - INFO - 2023-03-29 18:21:07,207 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:36143
INFO:distributed.deploy.ssh:2023-03-29 18:21:07,207 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:36143
INFO:asyncssh:[conn=0, chan=1] Received exit status 1
INFO:asyncssh:[conn=0, chan=1] Received channel close
INFO:asyncssh:[conn=0, chan=1] Channel closed
INFO:asyncssh:[conn=0, chan=1] Sending KILL signal

When rolling back to 2023.3.1 the scheduler starts sucessfully:

2023-03-29 18:23:33,874 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,873 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,873 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2023-03-29 18:23:33,878 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,878 - distributed.scheduler - INFO - State start
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,878 - distributed.scheduler - INFO - State start
2023-03-29 18:23:33,882 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,881 - distributed.scheduler - DEBUG - Clear task state
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,881 - distributed.scheduler - DEBUG - Clear task state
2023-03-29 18:23:33,883 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,882 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:40305
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,882 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:40305
INFO:asyncssh:Opening SSH connection to localhost, port 22
INFO:asyncssh:[conn=1] Connected to SSH server at localhost, port 22

Minimal Complete Verifiable Example:

import logging
logging.basicConfig(level=logging.DEBUG)

from distributed import SSHCluster
cluster = SSHCluster(["localhost", "localhost"])

Anything else we need to know?: Full repro here:

git clone https://github.com/jabbera/distributed-bug.git
cd distributed-bug
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements-bug.txt
python demo.py

Environment:

Dask version: 2023.3.2
Python version: 3.10.5
Operating System: Ubuntu 20.04.5 LTS
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

jabbera · 2023-03-29T18:42:07Z

Rolling this back sorts the issue: #7631

jrbourbeau · 2023-03-29T22:06:30Z

cc @milesgranger @jacobtomlinson for visibility

jabbera · 2023-03-29T22:57:33Z

I did a little more digging and the crash happens somewhere in here, specifically the template format.

distributed/distributed/utils.py

Lines 1252 to 1254 in f102357

    
           return template.format( 
        
               **toolz.merge(os.environ, dict(scheme=scheme, host=host, port=port)) 
        
           )

Replacing it to return a constant string avoids the crash.

jabbera · 2023-03-29T23:58:40Z

I've figured out what is going on here but I don't know how to fix it in dask. I have the following environment variables set:

DASK_DISTRIBUTED__DASHBOARD__LINK='{JUPYTERHUB_EXTERNAL_BASE_URL}{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status'

This is somehow making it's way over to the SSHCluster (I'm assuming via dask config serialization)

The issue is those environment variables (JUPYTERHUB_EXTERNAL_BASE_URL, JUPYTERHUB_SERVICE_PREFIX) are not available in the SSH session since they are set in the profile so the template.format is failing:

KeyError: 'JUPYTERHUB_EXTERNAL_BASE_URL'

I understand how to get the correct scheduler link manually. I'd prefer if this situation doesn't cause the scheduler to crash and maybe just falls back on it's old behavior if the link can't be crafted.

PS. These errors are not being propagated back to the process that started the cluster which has made debugging this much harder.

jacobtomlinson · 2023-03-30T15:50:46Z

Thanks for taking the time to dig into this. It sounds like there are two things going on here.

First is that when DASK_DISTRIBUTED__DASHBOARD__LINK has been misconfigured SSHCluster is crashing for hard to understand reasons. The root of that problem is that you're not seeing helpful error messages, making debugging it a pain. The fix for this would be to explore why tracebacks aren't making it back from the remote process.

The other part is can we make it so that if DASK_DISTRIBUTED__DASHBOARD__LINK is misconfigured the failure mode is less aggressive. I'm not sure default behaviour would be best though as it will likely mask the problem and make it hard to debug. Perhaps a better route would be to catch the exception and log an error saying that formatting failed but continue onwards.

distributed/distributed/scheduler.py

Line 3873 in 78a926d

link = format_dashboard_link(addr, server.port)

jabbera · 2023-03-30T16:08:04Z

Perhaps a better route would be to catch the exception and log an error saying that formatting failed but continue onwards.

Indeed this would be the best solution.

mplough-kobold · 2023-03-31T16:36:11Z

First is that when DASK_DISTRIBUTED__DASHBOARD__LINK has been misconfigured SSHCluster is crashing for hard to understand reasons. The root of that problem is that you're not seeing helpful error messages, making debugging it a pain. The fix for this would be to explore why tracebacks aren't making it back from the remote process.

When I initially read this, I didn't totally understand what you meant by "misconfigured" here. As I understand it, the problem is that the link includes an environment variable that exists only on the host and not on the cluster.

Thus, these would be incorrect...

export DASK_DISTRIBUTED__DASHBOARD__LINK="{JUPYTERHUB_EXTERNAL_BASE_URL}{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status"
# or
export DASK_DISTRIBUTED__DASHBOARD__LINK="{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status"

...and this would be correct:
(quoted from dask/dask-labextension#109 (comment)):

export DASK_DISTRIBUTED__DASHBOARD__LINK="proxy/{port}/status"

However, the correct link doesn't work.

Suppose I have a JupyterHub deployment and I access my notebook server at:

https://jupyterhub.example.com/user/matt.plough/my-named-server/lab

Setting my DASK_DISTRIBUTED__DASHBOARD__LINK="proxy/{port}/status" results in the browser creating the following link in the output of a cell that says client:

https://jupyterhub.example.com/user/matt.plough/my-named-server/files/proxy/8787/status?_xsrf=[some token]

This is incorrect due to the inclusion of /files, something that does not occur when JUPYTERHUB_SERVICE_PREFIX is part of the DASK_DISTRIBUTED__DASHBOARD__LINK variable.

The recommendation in Dask documentation of /user/<user>/proxy/8787/status cannot accommodate named servers, and is not flexible enough to deal with standard servers and named servers on the same box. Use of the JUPYTERHUB_SERVICE_PREFIX eliminates all of these problems.

How should users configure the DASK_DISTRIBUTED__DASHBOARD__LINK variable when using a JupyterHub proxy?

jacobtomlinson · 2023-03-31T16:46:54Z

How should users configure the DASK_DISTRIBUTED__DASHBOARD__LINK variable when using a JupyterHub proxy?

I think this question is separate from the bug highlighted here. Could you open a new issue for this?

mplough-kobold · 2023-03-31T17:06:15Z

How should users configure the DASK_DISTRIBUTED__DASHBOARD__LINK variable when using a JupyterHub proxy?

I think this question is separate from the bug highlighted here. Could you open a new issue for this?

Good idea, and done - see #7736.

jabbera changed the title ~~Scheduler crashes un SSHCluster in 2023.3.2 but not in 2023.3.1~~ Scheduler crashes in SSHCluster in 2023.3.2 but not in 2023.3.1 Mar 29, 2023

milesgranger mentioned this issue Mar 31, 2023

Fix crash on missing env var in dashboard link formatting #7729

Merged

2 tasks

mplough-kobold mentioned this issue Mar 31, 2023

How should users configure distributed.dashboard.link? #7736

Open

hendrikmakait closed this as completed in #7729 Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler crashes in SSHCluster in 2023.3.2 but not in 2023.3.1 #7724

Scheduler crashes in SSHCluster in 2023.3.2 but not in 2023.3.1 #7724

jabbera commented Mar 29, 2023

jabbera commented Mar 29, 2023

jrbourbeau commented Mar 29, 2023

jabbera commented Mar 29, 2023 •

edited

Loading

jabbera commented Mar 29, 2023 •

edited

Loading

jacobtomlinson commented Mar 30, 2023 •

edited

Loading

jabbera commented Mar 30, 2023

mplough-kobold commented Mar 31, 2023

jacobtomlinson commented Mar 31, 2023

mplough-kobold commented Mar 31, 2023

Scheduler crashes in SSHCluster in 2023.3.2 but not in 2023.3.1 #7724

Scheduler crashes in SSHCluster in 2023.3.2 but not in 2023.3.1 #7724

Comments

jabbera commented Mar 29, 2023

jabbera commented Mar 29, 2023

jrbourbeau commented Mar 29, 2023

jabbera commented Mar 29, 2023 • edited Loading

jabbera commented Mar 29, 2023 • edited Loading

jacobtomlinson commented Mar 30, 2023 • edited Loading

jabbera commented Mar 30, 2023

mplough-kobold commented Mar 31, 2023

jacobtomlinson commented Mar 31, 2023

mplough-kobold commented Mar 31, 2023

jabbera commented Mar 29, 2023 •

edited

Loading

jabbera commented Mar 29, 2023 •

edited

Loading

jacobtomlinson commented Mar 30, 2023 •

edited

Loading