Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/ingester: prevent shutdowns from processing during joining handoff #1114

Merged
merged 1 commit into from
Oct 9, 2019

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Oct 3, 2019

This commit fixes a race condition where an ingester that is shut down during the joining handoff (i.e., receiving chunks from a leaving ingester before claiming its tokens) hangs and can never shut down
cleanly. A shutdown mutex is implemented which is obtained at the start of the handoff process and released after the handoff process completes. This race condition also prevented the leaving ingester from completing its shutdown, since it waits for the joining ingester to claim the tokens.

This means that if a brand new ingester is shut down, it will always have finished receiving chunks from the previous leaving ingester and have finished obtaining the tokens from the ingester attempting to leave.

Checklist

  • Tests updated

This commit fixes a race condition where an ingester that is shut down
during the joining handoff (i.e., receiving chunks from a leaving
ingester before claiming its tokens) hangs and can never shut down
cleanly. A shutdown mutex is implemented which is obtained at the
start of the handoff process and released after the handoff process
completes. This race condition also prevented the leaving ingester from
completing its shutdown, since it waits for the joining ingester to
claim the tokens.

This means that if a brand new ingester is shut down, it will always
have finished receiving chunks from the previous leaving ingester and
have finished obtaining the tokens from the ingester attempting to
leave.
@pstibrany
Copy link
Member

Wouldn't it be better in this case to cancel transfer instead, when shutdown is triggered? Then the ingester has less to deal with.

(TransferChunks method could be passed a context so that it knows when it gets canceled).

@rfratto
Copy link
Member Author

rfratto commented Oct 4, 2019

I'm not sure about that. If we cancel the transfer, then we're forced to do a flush. The transfers are fairly quick and this shouldn't hold back the shutdown process too much.

I'm also not sure how easy it would be for TransferChunks to be cancelled since it's not invoked from the lifecycler. It's invoked as a remote method from the original leaving ingester.

@rfratto rfratto requested a review from slim-bean October 7, 2019 16:30
Copy link
Collaborator

@slim-bean slim-bean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@slim-bean slim-bean merged commit 2cb3c82 into grafana:master Oct 9, 2019
@rfratto rfratto deleted the fix-shutdown-during-transfer branch November 19, 2019 14:32
@chaudum chaudum added the type/bug Somehing is not working as expected label Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/loki type/bug Somehing is not working as expected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants