Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osbuild-worker: fix "crashing" on worker registration issues #4357

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

mvo5
Copy link
Contributor

@mvo5 mvo5 commented Sep 10, 2024

When the osbuild worker cannot register itself with the server on startup the worker will "crash". This is inconsistent with the existing behavior in workerHeartbeat() which deals with connectivity or other server issue gracefully and retries periodically.

To unify the behavior this commit changes the behavior and only issues a logrus.Warnf instead of the previous Falalf when the registration fails.


This is an alternative version of #4320 that contains the minimal change to achieve the goal with a test. I feel quite rude to open this PR and I'm sorry for this. I'm mostly doing this as I feel after 51 discussion comments in the other PR it might be easier/more illustrative to offer an alternative version instead of keeping the discussion in the other PR going.

The key points (to me):

  • the change is minimal to achieve the goal
  • the test is targeted exactly at the change
  • the commit message explains in some (but not too much) details the "what/why/how" for future reference (note that it's the git commit message itself, not the github PR description which is not connected to the git history so won't help when doing e.g. git blame)

I hope this helps, I sincerely want to help here, this is not asserting domination or playing one-upmanship or anything like this. I'm sure there are things to improve with this PR too.

P.S. This does not contain the unix socket change, mostly because I did not want to spend more time thinking how to test it (it's probably easy but I wanted to timebox it). Also with unix sockets connectivity issues are more rare so it seemed less important (but I could be wrong here).

[edit: typo fix]

@schuellerf
Copy link
Contributor

I do acknowledge (and did before) that your PRs are smaller and cleaner and I'm also mostly interested in getting this released to unblock the ansible PR.

The PR #4320 for sure is more extended as it also tries to make the startup cleaner (avoid 401 in the good case)
also there is an extended test for the call sequence which improves the test quality, I guess.

Would you mind taking this test over (just changing the sequence to be "the current" one) so this doesn't get lost and should help in identifying other startup changes.
… the one with the expectedCalls, but that's obvious, right?

@schuellerf
Copy link
Contributor

although, sure, this can be done in a separate PR

schuellerf
schuellerf previously approved these changes Sep 10, 2024
@mvo5
Copy link
Contributor Author

mvo5 commented Sep 10, 2024

I do acknowledge (and did before) that your PRs are smaller and cleaner and I'm also mostly interested in getting this released to unblock the ansible PR.

The PR #4320 for sure is more extended as it also tries to make the startup cleaner (avoid 401 in the good case) also there is an extended test for the call sequence which improves the test quality, I guess.

Would you mind taking this test over (just changing the sequence to be "the current" one) so this doesn't get lost and should help in identifying other startup changes. … the one with the expectedCalls, but that's obvious, right?

Thanks, part of why I opened this PR is precisely to avoid mixing this change with other changes. It's totally fine to have a PR about test improvements (which I really like) or changing the way the startup works, my point (and why I opened this PR) is that that those should (IMHO) be their own PRs. The rational is to avoid exactly what happend in #4320 - a lot of discussion/questions about topics that where not part of the (direct) goal that distracted from getting the change required to unblock ansible(?) landed.

In my experience when a PR becomes controversial the best course of action is to make it smaller and try to extract the essence/most important changes on their own. Then one-by-one the other changes can be proposed/discussed/merged.

So my suggestion would be:

  • let's try to get this (or a variant of it) landed
  • open the test sequence changes as their own PR as they are usful on their own
  • discuss/open the lazy vs non-lazy token refresh but keep it separate as it was causing the most controversy

Don't get me wrong, this is not a dig on your work, I'm really just trying to help establishing a flow that makes it quicker/easier to land it :)

When the osbuild worker cannot register itself with the server
on startup the worker will "crash". This is inconsistent with the
existing behavior in `workerHeartbeat()` which deals with connectivity
or other server issue gracefully and retries periodically.

To unify the behavior this commit changes the behavior and only
issues a `logrus.Warnf` instead of the previous `Falalf` when
the registration fails.

Co-authored-by: Florian Schüller <florian.schueller@redhat.com>
Copy link
Contributor

@schuellerf schuellerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • let's try to get this (or a variant of it) landed
    🥳
  • open the test sequence changes as their own PR as they are usful on their own
  • discuss/open the lazy vs non-lazy token refresh but keep it separate as it was causing the most controversy

Fixing the just the "crash" helps for sure, and I'll continue to roll out the task I referenced from my pr. I really doubt that the other two will happen soonish with this approach, let's see…

@mvo5
Copy link
Contributor Author

mvo5 commented Sep 10, 2024

  • let's try to get this (or a variant of it) landed
    🥳
  • open the test sequence changes as their own PR as they are usful on their own
  • discuss/open the lazy vs non-lazy token refresh but keep it separate as it was causing the most controversy

Fixing the just the "crash" helps for sure, and I'll continue to roll out the task I referenced from my pr. I really doubt that the other two will happen soonish with this approach, let's see…

Great to hear that this will unblock the ansible worker PR.

As for the other two:

  • I think the test change is uncontroversial if it just replaces the number with the actual expected calls without a behavior change (in it's own PR ideally)
  • the change for the lazy/non-lazy refresh is controversial but it's orthogonal to the "Fatalf" -> "Warnf" change or the test change so that seems fine (or am I misunderstanding something, is this also blocking you?)

Copy link
Member

@croissanne croissanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mvo5 mvo5 merged commit 3df26ed into osbuild:main Sep 10, 2024
48 of 50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants