Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Error: Not connected - is it possible to make connection timeout configurable? #519

Open
1 task done
akramarev opened this issue Jun 24, 2024 · 5 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@akramarev
Copy link

What happened?

Please compare these two outputs from livecycle/preevy-up-action@v2.4.0:

Build step done in 716.21s
    Error: Not connected
Error: Process completed with exit code 1.

and

Build step done in 31.32s
- Copying files: Calculating...
✔ Copied 4 files to remote machine
Running: docker compose up -d --remove-orphans --no-build
- Connecting to remote docker socket...
✔ Connected to remote docker socket

The first one is a cold run of my workflow where the build stage took >10m, and succeeded, but artifacts copying failed right after it. The second output is from the same workflow retried - it reused cached images and thus build phase finished in 30s and preevy-up-action didn't have any problems with copying the artifacts to the runner.

I suspect that there is a timeout (ssh connection timeout) that approximately equals 10m somewhere. Wondering if it's possible to make it configurable for docker-compose stacks that require a longer build phase?

Add screenshots

please see the previous section with error details

Steps to reproduce the behavior

My setup:

  • public tunnel server
  • deploy runtime: AWS Lightsail
  • a docker-compose with a service which build:context step takes a long time (10m)
  • GHA with livecycle/preevy-up-action@v2.4.0 that uses GH builder and GHCR

GHA:

...
      - name: Set up Docker Buildx
        id: buildx_setup
        uses: docker/setup-buildx-action@v3
        with:
          buildkitd-config: .github/buildkitd.toml

      - name: Login to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ github.token }}

      - uses: livecycle/preevy-up-action@v2.4.0
        id: preevy
        with:
          install: gh-release
          profile-url: "${{ vars.PREEVY_PROFILE_URL }}"
          args: --registry ghcr.io/my-org --builder ${{ steps.buildx_setup.outputs.name }}
          docker-compose-yaml-paths: "./docker-compose.yml"
        env:
          GITHUB_TOKEN: ${{ github.token }}

Expected behavior

Avoid Error: Not connected error when the build step takes a long time, i.e. either make the timeout configurable or retry the connection just a few times.

What OS are you seeing the problem on?

Linux

Additional context

No response

Record

  • I agree to follow this project's Code of Conduct
@akramarev akramarev added the bug Something isn't working label Jun 24, 2024
@akramarev
Copy link
Author

hmm I noticed the same issue with builds that took "only" ~7 minutes:

Screenshot 2024-06-24 at 6 11 48 PM

@royra
Copy link
Collaborator

royra commented Jun 29, 2024

Hey @akramarev, can you post some of the logs before the error? Preferably add --debug. I had builds longer than 10m and they did not time out. Usually during the build the ssh connection is active (messages are being sent) and there is no reason for it to time out.

What I did see in the past is builds that overload the machine's resources to the point where the ssh server hangs. You can try running top while the machine is building (connect using preevy ssh). You can also try a larger instance type or offloading the build to the GH action runner. LMK.

@akramarev
Copy link
Author

akramarev commented Jun 29, 2024

Thanks for the reply @royra. I observe this problem only when I offload the build to the GHA runner (please check "My setup" section above for details), if I use the default builder (build happens on the remove Lightsail instance) I don't have this issue.

So my suspicion is while GHA runner is building the image preevy doesn't actively use the opened earlier SSH connection and it's timing out by the moment GHA finishes the build and is ready to upload artifacts. During the build, I can ssh to the Lightsail machine and see that it's almost idle.

Attaching logs (build details in the middle manually reducted):
preevy-logs.txt

@royra
Copy link
Collaborator

royra commented Jun 29, 2024

Sorry I missed the fact that you're already offloading the build.

Can you look at the ssh server logs on the lightsail instance? If there's nothing interesting try changing the LogLevel setting at /etc/ssh/sshd_config

@akramarev
Copy link
Author

Thank you for your reply @royra and livecycle team for keeping this github issue open.

Noticed in /var/log/auth.log that SSH server restarted right at the moment when preevy action reported that it successfully configured the new lightsail machine:
Screenshot 2024-09-06 at 6 01 29 PM

At the same time /var/log/syslog indicates that cloud-init is the process that restarted sshd.

Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Starting Execute cloud user/final scripts...
Sep  7 00:41:01 ip-172-26-8-210 dockerd[1011]: time="2024-09-07T00:41:01.470172700Z" level=info msg="API listen on /run/docker.sock"
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Starting Update UTMP about System Runlevel Changes...
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Finished Update UTMP about System Runlevel Changes.
Sep  7 00:41:01 ip-172-26-8-210 cloud-init[2353]: Lightsail: Starting Instance Initialization.
Sep  7 00:41:01 ip-172-26-8-210 cloud-init[2353]: Lightsail: SSH CA Public Key created.
Sep  7 00:41:01 ip-172-26-8-210 cloud-init[2353]: Lightsail: SSH CA Public Key registered.
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Stopping OpenBSD Secure Shell server...
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: ssh.service: Succeeded.
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Stopped OpenBSD Secure Shell server.
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Starting OpenBSD Secure Shell server...
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Started OpenBSD Secure Shell server.
Sep  7 00:41:01 ip-172-26-8-210 cloud-init[2353]: Lightsail: sshd restarted.
Sep  7 00:41:02 ip-172-26-8-210 cloud-init: #############################################################
Sep  7 00:41:02 ip-172-26-8-210 cloud-init: -----BEGIN SSH HOST KEY FINGERPRINTS-----
-- {redacted}
Sep  7 00:41:02 ip-172-26-8-210 cloud-init: -----END SSH HOST KEY FINGERPRINTS-----
Sep  7 00:41:02 ip-172-26-8-210 cloud-init: #############################################################
Sep  7 00:41:02 ip-172-26-8-210 cloud-init[2353]: Cloud-init v. 23.3.3-0ubuntu0~20.04.1 running 'modules:final' at Sat, 07 Sep 2024 00:41:01 +0000. Up 72.92 seconds.
Sep  7 00:41:02 ip-172-26-8-210 cloud-init[2353]: Cloud-init v. 23.3.3-0ubuntu0~20.04.1 finished at Sat, 07 Sep 2024 00:41:02 +0000. Datasource DataSourceEc2Local.  Up 73.33 seconds
...

As expected any further attempts to restart the job:

  1. doesn't lead to sshd restart
  2. succeeded, i.e. successfully copied context to the lightsail instance and runs compose

Is there anything you can suggest in this situation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants