Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: deletion queue & generation validation for deletions #5207

Merged
merged 69 commits into from
Sep 26, 2023

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Sep 5, 2023

Problem

Pageservers must not delete objects or advertise updates to remote_consistent_lsn without checking that they hold the latest generation for the tenant in question (see the RFC)

In this PR:

  • A new "deletion queue" subsystem is introduced, through which deletions flow
  • RemoteTimelineClient is modified to send deletions through the deletion queue:
    • For GC & compaction, deletions flow through the full generation verifying process
    • For timeline deletions, deletions take a fast path that bypasses generation verification
  • The last_uploaded_consistent_lsn value in UploadQueue is replaced with a mechanism that maintains a "projected" lsn (equivalent to the previous property), and a "visible" LSN (which is the one that we may share with safekeepers).
  • Until control_plane_api is set, all deletions skip generation validation
  • Tests are introduced for the new functionality in test_pageserver_generations.py

Once this lands, if a pageserver is configured with the control_plane_api configuration added in #5163, it becomes safe to attach a tenant to multiple pageservers concurrently.

Summary of changes

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@github-actions
Copy link

github-actions bot commented Sep 5, 2023

2520 tests run: 2403 passed, 0 failed, 117 skipped (full report)


Flaky tests (5)

Postgres 16

  • test_crafted_wal_end[last_wal_record_xlog_switch_ends_on_page_boundary]: release
  • test_partial_evict_tenant: release
  • test_many_timelines: release
  • test_wal_lagging: release

Postgres 14

  • test_get_tenant_size_with_multiple_branches: release

Code coverage (full report)

  • functions: 52.9% (8018 of 15145 functions)
  • lines: 81.2% (47003 of 57884 lines)

The comment gets automatically updated with the latest test results
8af4131 at 2023-09-26T15:05:57.985Z :recycle:

@jcsp jcsp force-pushed the jcsp/generation-numbers-pt3 branch 3 times, most recently from afb48d0 to 13f45cf Compare September 6, 2023 16:07
@jcsp jcsp changed the title Jcsp/generation numbers pt3 pageserver: deletion queue & generation validation for deletions Sep 6, 2023
@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Sep 6, 2023
jcsp added a commit that referenced this pull request Sep 8, 2023
…#5231)

## Problem

Currently our testing environment only supports running a single
pageserver at a time. This is insufficient for testing failover and
migrations.
- Dependency of writing tests for #5207 

## Summary of changes

- `neon_local` and `neon_fixture` now handle multiple pageservers
- This is a breaking change to the `.neon/config` format: any local
environments will need recreating
- Existing tests continue to work unchanged:
  - The default number of pageservers is 1
- `NeonEnv.pageserver` is now a helper property that retrieves the first
pageserver if there is only one, else throws.
- Pageserver data directories are now at `.neon/pageserver_{n}` where n
is 1,2,3...
- Compatibility tests get some special casing to migrate neon_local
configs: these are not meant to be backward/forward compatible, but they
were treated that way by the test.
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big chunk but let's handle rest of changes if any after merging.

@jcsp jcsp enabled auto-merge (squash) September 26, 2023 11:44
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed validation & recovery.

I'm having trouble keeping track of error handling correctness in validator.rs when we write to disk.
At the offsite, we agreed that local IO error should crash the pageserver.
Started Slack thread about that.

Other than that, only few small remarks.

pageserver/src/deletion_queue/validator.rs Show resolved Hide resolved
pageserver/src/deletion_queue/list_writer.rs Show resolved Hide resolved
pageserver/src/deletion_queue/list_writer.rs Show resolved Hide resolved
pageserver/src/deletion_queue/validator.rs Show resolved Hide resolved
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed in Slack that there will be a follow-up PR that will make the pageserver process abort/panic/exit if there are any local filesystem IO errors encountered by deletion queue.

With that, I'm OK with merging this PR.

(I still dislike the over-use of the term "flush", but, let's take that to a follow-up if any.)

@jcsp
Copy link
Contributor Author

jcsp commented Sep 26, 2023

This hit a test failure that didn't have an existing ticket: #5385 -- I looked at the test and I don't think it's touching anything in that changed in this PR, and the test also looks like it might be racy by not waiting for pageserver ingest after it writes to the database. Will follow up separately.

@koivunej
Copy link
Member

(re-ran failed)

@jcsp jcsp merged commit ba92668 into main Sep 26, 2023
40 checks passed
@jcsp jcsp deleted the jcsp/generation-numbers-pt3 branch September 26, 2023 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants