Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow tunnel server upgrade without disconnecting user environments #233

Open
royra opened this issue Sep 20, 2023 · 5 comments
Open

Allow tunnel server upgrade without disconnecting user environments #233

royra opened this issue Sep 20, 2023 · 5 comments
Labels
bug Something isn't working enhancement New feature or request need spec points: 5 Very high complexity

Comments

@royra
Copy link
Collaborator

royra commented Sep 20, 2023

Currently, when deploying the tunnel server, user environments will be briefly disconnected while the CTA (agent) reconnects to the new instance. This can cause incoming requests to the environments to fail with 502 "environment not found" errors.

Suggested solution - a cooperative rollout flow, compatible with Kubernetes rolling update (although it's quite generic and can be used with other orchestration infra).

  • The tunnel server will handle SIGTERM to start a graceful shutdown flow. It will notify its connected clients to reconnect (see below). It will then wait for all its client connections to end, or a configurable timeout has passed, then exit.
  • When the CTA (agent) is notified of the pending tunnel server shutdown, it will:
    • Create a new SSH client connection to its configured tunnel server URL. The new connection will be routed to the new tunnel server instance by the infra (e.g, K8s).
    • Once the new SSH client connection is established, all existing forwards will be established on it. This will cause new requests to come in through the new SSH connection.
    • Existing TCP forward connections from the old SSH connection will be allowed to complete. This assumes they are short-lived HTTP requests. Long-lived connections (e.g, websockets) will eventually be terminated from the remote side (the tunnel server timeout expiring), but are assumed to be designed to recover from disconnections.
    • Once all the old TCP connections are closed, the old SSH connection will be closed.

Currently there is no simple way for the SSH server to notify its clients of an event. An applicative "server events" channel can be created by having the CTA initiate a specific "control" command session (exec) on its client connection, and wait for it to end as a signal. Alternatively, instead of using the SSH connection, the CTA can accept an HTTP request on its own API endpoint. However, this requires the tunnel server to identify the specific tunnel for each connected CTA.

@royra royra added bug Something isn't working enhancement New feature or request need spec points: 5 Very high complexity labels Sep 20, 2023
@Yshayy
Copy link
Contributor

Yshayy commented Sep 20, 2023

I think a problem we have is once there's two new tunnel servers (old and new) and:

  • New is getting all incoming requests (due to k8s service behavior), but has not yet established connection with CTA.
    In this case, either the routing layer should be aware which CTA have connected, or the new tunnel server should pass the traffic to the old instance.
    This solution can support multiple-instances but require some sort of mechanism for the tunnel server to be aware of other tunneling servers.

If we want to support only the case of upgrade and HA, An alternative solution can be to use sort-of modified blue-green deployment of two instances with swap mechanisms:

  • CTA connect to two instances with two different external URLs for ssh (discovery can be done using DNS SRV record)
  • Both tunnel server produce the same external URLs for traffic
  • A Kubernetes service only forward traffic to the active deployment
  • During upgrade, we upgrade the other deployment
  • After upgrade we're switching the active deployment (using service+labels)
  • Kubernetes forward external traffic to the new active deployment

During this time, the old deployment is still alive and forward traffic to the CTA.
The trick here is that we're not shutting down active deployment, only switching traffic.
But this solution is more specific to the case of HA during upgrade.

@royra
Copy link
Collaborator Author

royra commented Sep 20, 2023

You're right, it won't work as I suggested.

I like the blue/green idea, but I think there's a way to do it without two URLs at the CTA.

By extracting the stunnel/sslh to a separate deployment, we can define different k8s services for the SSH and HTTP endpoints. Normally they will point to the same deployment. When upgrading:

  • create the new deployment, and wait for it to be healthy
  • point the SSH service to it
  • send a SIGTERM to the old deployment's tunnel server (not sure how to do that nicely, but you can script it)
  • wait for CTAs to switch
  • update the HTTP service to point to the new deployment
  • delete the old deployment

@royra
Copy link
Collaborator Author

royra commented Sep 20, 2023

This sounds a bit painful tho and if it's covered by the distributed tunnel server solution, maybe it's best to wait for that.

@Yshayy
Copy link
Contributor

Yshayy commented Sep 20, 2023

We can use DNS SRV records so the CTA will need to know only one URL, it can be optional for simplicity.
In practice, it'll query the dns for SRV records and get tunnel servers to connect to if such records are found.

Deployment itself shouldn't be too difficult with K8S ->
It's two deployments with different labels and one service with selector on blue/green.

@Yshayy
Copy link
Contributor

Yshayy commented Sep 20, 2023

The different services approach (stunnel/sslh) with a single tunnel server deployment is a bit tricky because there are multiple CTAs here, so there isn't a single switch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request need spec points: 5 Very high complexity
Projects
None yet
Development

No branches or pull requests

2 participants