Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eviction: regression test + distinguish layer write from map insert #4005

Merged
merged 17 commits into from
May 4, 2023

Conversation

problame
Copy link
Contributor

@problame problame commented Apr 11, 2023

This patch adds a regression test for the threshold-based layer eviction.
The test asserts the basic invariant that, if left alone, the residence statuses will stabilize, with some layers resident and some layers evicted.
Thereby, we cover both the aspect of last-access-time-threshold-based eviction, and the "imitate access" hacks that we put in recently.

The aggressive period and threshold values revealed a subtle bug which is also fixed in this patch.
The symptom was that, without the Rust changes of this patch, there would be occasional test failures due to WARN... unexpectedly downloading log messages.
These log messages were caused by the "imitate access" calls of the eviction task.
But, the whole point of the "imitate access" hack was to prevent eviction of the layers that we access there.
After some digging, I found the root cause, which is the following race condition:

  1. Compact: Write out an L1 layer from several L0 layers. This records residence event LayerCreate with the current timestamp.
  2. Eviction: imitate access logical size calculation. This accesses the L0 layers because the L1 layer is not yet in the layer map.
  3. Compact: Grab layer map lock, add the new L1 to layer map and remove the L0s, release layer map lock.
  4. Eviction: observes the new L1 layer whose only activity timestamp is the LayerCreate event.

The L1 layer had no chance of being accessed until after (3).
So, if enough time passes between (1) and (3), then (4) will observe a layer with now-last_activity > threshold and evict it

The fix is to require the first record_residence_event to happen while we already hold the layer map lock.
The API requires a ref to a BatchedUpdates as a witness that we are inside a layer map lock.
That is not fool-proof, e.g., new call sites for insert_historic could just completely forget to record the residence event.
It would be nice to prevent this at the type level.
In the meantime, we have a rate-limited log messages to warn us, if such an implementation error sneaks in in the future.

fixes #3593
fixes #3942

@github-actions
Copy link

github-actions bot commented Apr 11, 2023

Test results for 9b41b5f:


debug build: 225 tests run: 215 passed, 0 failed, 10 skipped (full report)


release build: 225 tests run: 215 passed, 0 failed, 10 skipped (full report)


@problame problame requested a review from koivunej April 11, 2023 18:41
@problame problame marked this pull request as ready for review April 11, 2023 18:41
@problame problame requested review from a team as code owners April 11, 2023 18:41
@problame problame requested review from lubennikovaav and removed request for a team April 11, 2023 18:41
Copy link
Contributor Author

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I duplicate the explainer of the race condition somewhere in the code, as a comment? If so, where?

@problame problame removed the request for review from lubennikovaav April 11, 2023 18:42
@koivunej
Copy link
Member

WARN... unexpectedly downloading

I think we saw this in logs but probably forgot to add a note. But good that you re-hit that.

@koivunej
Copy link
Member

So, we have to do it at the call sites of insert_historic.

Makes me think the #3775 fix or wrapping is more needed than not.

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly trusting the PR desc, code matching it and the test passing, looking good. Comments mostly questions.

@problame
Copy link
Contributor Author

Makes me think the #3775 fix or wrapping is more needed than not.

Yeah, had that thought as well.

@problame problame force-pushed the problame/threshold-eviction-policy-unit-tests branch from 72fad66 to 767c57f Compare April 13, 2023 14:51
@problame
Copy link
Contributor Author

Changes:

  1. Implemented an alternative approach that puts more responsibility to the layer-creating code. Pushes some checks into type system by requiring a witness. See commit feat: alternative approach that takes a witness for layer map lock held.
  2. Rebased (insert_historic can now fail, had conflicts, CI wouldn't run anymore on this branch)

@koivunej WDYT, is the new approach better?

koivunej
koivunej previously approved these changes Apr 14, 2023
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still on the approval stance. I don't see a huge difference, but the witness certainly does guide towards the wanted implementation. At the same time it cannot guide towards not-leaving the recording out, which doesn't really fit in our current types. Would require some two-step builder.

But it's not like we are trying to make a fool-proof library out of this, and the only downside is disk based eviction giving out a warning.

pageserver/src/tenant/storage_layer.rs Outdated Show resolved Hide resolved
test_runner/regress/test_metric_collection.py Show resolved Hide resolved
@problame
Copy link
Contributor Author

@koivunej in reaction to your comment on the log message, I pushed
feat: rate-limit logging + centralize in the latest_activity method f , I think this is the max amount of yak-shaving I'm able to tolerate ;)

Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to dismiss my earlier review. Either I may have confused some two PRs or I don't really follow the latest changes. I don't have time to look at this this evening, will need to continue later.

pageserver/src/tenant/storage_layer.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
test_runner/regress/test_metric_collection.py Show resolved Hide resolved
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred the non-last minute version; having to require the ratelimited logging is a problem in itself caused by the fact how we currently don't always have a last access timestamp for what I understand, api design reasons. It wasn't a problem in the past so I already forgot why is it a problem now.

I preferred the pre-ratelimited version at f480844^ which I already approved; the logging could be done at the callsites (for disk usage based eviction) and maybe for eviction task we could handle it in some summary if at all.

Looking at this it will take a lot longer to understand what is the case where it remains None, I think I proposed SystemTime::now for that which I can now find from the other.

@problame
Copy link
Contributor Author

I preferred the non-last minute version; having to require the ratelimited logging is a problem in itself caused by the fact how we currently don't always have a last access timestamp for what I understand, api design reasons. It wasn't a problem in the past so I already forgot why is it a problem now.

We never expect to hit the log message because we're careful to add the record_residence_event calls in all the right places.
However, if we forget that in a branch not covered by tests (e.g., in a future refactoring), then the risk is that we're hitting it for many layers, thereby flooding the ERROR log.

the logging could be done at the callsites (for disk usage based eviction) and maybe for eviction task we could handle it in some summary if at all.

For disk-usage-based eviction, rate-limiting to once per iteration is fine.

For the per-timeline eviction task, rate-limiting to once per ieration still leaves $num_timelines/$period ERROR logs per pageserver. E.g., 2000msg/10min = 3msg/sec`.

IMO that's too much, hence the global rate limit.
It's an implementation issue after all.
It affects all call sites similarly.
Hence IMO it's nonsense to log at the callsites, esp. if one of them (per-timeline eviction task) will require rate limiting.

The ugliness of the rate limiting is constrainted to the impl, where we have a "nice" comment explaining the situation.

Looking at this it will take a lot longer to understand what is the case where it remains None, I think I proposed SystemTime::now for that which I can now find from the other.

The warn message will still print every 10 seconds. It really takes no genius to grep for the message in the codebase, then do git blame / read surrounding code, which is well-commented. So, I think it's easy to understand what's going on.


Does the above convince you? If not, please state what to do with the per-timeline eviction task logging. I'll implement that on top of f480844^ then.

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still dislike the approach (RateLimitClosure or after rename), static mutexes, but perhaps it truly is unavoidable or at least I don't have the energy to come up with alternative. At least the current review comments need to be addressed.

@problame problame merged commit 7dd9553 into main May 4, 2023
@problame problame deleted the problame/threshold-eviction-policy-unit-tests branch May 4, 2023 14:16
@problame
Copy link
Contributor Author

problame commented May 4, 2023

Updated the PR description to the final commit message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants