Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sambacc: add a retry loop to ctdb.monitor_cluster_meta_changes #130

Merged

Conversation

phlogistonjohn
Copy link
Collaborator

Add a loop that tries the ctdb reloadnodes command after an increasing delay. This is an attempt to fix a condition where ctdbd is apparently not ready to handle the ctdb reloadnodes command. In this case the command would be run, but fail and an exception would be raised in the monitor_cluster_meta_changes function would raise an exception. This would be caught by the command-level retry loop. However, this command-level retry loop will simply re-run monitor_cluster_meta_changes and this function now no longer has the same initial clustermeta state and has effectively "forgotten" that it needs to run reloadnodes. This new retry loop adds a level of error handling inside the monitor_cluster_meta_changes function so that we will retry with a bounded number of attempts.

Add a loop that tries the `ctdb reloadnodes` command after an increasing
delay. This is an attempt to fix a condition where ctdbd is apparently
not ready to handle the `ctdb reloadnodes` command. In this case the
command would be run, but fail and an exception would be raised in the
monitor_cluster_meta_changes function would raise an exception. This
would be caught by the command-level retry loop. However, this
command-level retry loop will simply re-run monitor_cluster_meta_changes
and this function now no longer has the same initial clustermeta state
and has effectively "forgotten" that it needs to run reloadnodes.  This
new retry loop adds a level of error handling inside the
monitor_cluster_meta_changes function so that we will retry with a
bounded number of attempts.

Signed-off-by: John Mulligan <jmulligan@redhat.com>
@phlogistonjohn phlogistonjohn marked this pull request as ready for review August 16, 2024 16:29
Copy link
Collaborator

@synarete synarete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. See minor comment.

tries: int = 5,
) -> None:
for idx in range(tries):
time.sleep(1 << idx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: have upper bound to sleep time. Something like time.sleep(min(1 << idx, 60)). Just in case someone naively increases tries to large value.

Copy link
Collaborator

@anoopcs9 anoopcs9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@mergify mergify bot merged commit 1b72854 into samba-in-kubernetes:master Aug 19, 2024
9 checks passed
@phlogistonjohn phlogistonjohn deleted the jjm-ctdb-reload-err-handle branch August 20, 2024 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants