Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean up logging around on-demand downloads #4030

Merged
merged 3 commits into from
Apr 27, 2023

Conversation

problame
Copy link
Contributor

@problame problame commented Apr 14, 2023

  • Remove repeated tenant & timeline from span
  • Demote logging of the path to debug level
  • Log completion at info level, in the same function where we log errors
  • distinguish between layer file download success & on-demand download succeeding as a whole in the log message wording
  • Assert that the span contains a tenant id and a timeline id

fixes #3945

Before:

  INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: download complete: /storage/pageserver/data/tenants/$TENANT_ID/timelines/$TIMELINE_ID/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91
  INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates.

After:

  INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: layer file download finished
  INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates.
  INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: on-demand download successful

@problame problame requested a review from koivunej April 14, 2023 12:39
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving even with 2 => 3 expansion, but I wish that was still considered. Need to upgrade log analysis dashboards.

@github-actions
Copy link

github-actions bot commented Apr 14, 2023

Test results for 50d8430:


debug build: 219 tests run: 209 passed, 0 failed, 10 (full report)


release build: 219 tests run: 209 passed, 0 failed, 10 (full report)


@problame
Copy link
Contributor Author

problame commented Apr 21, 2023

@problame problame force-pushed the problame/ondemand-log-cleanup branch from 3221e35 to f03a2d1 Compare April 25, 2023 15:32
@problame problame marked this pull request as ready for review April 25, 2023 15:33
@problame problame requested review from a team as code owners April 25, 2023 15:33
@problame problame requested review from knizhnik, arssher and a team and removed request for a team and knizhnik April 25, 2023 15:33
@problame
Copy link
Contributor Author

@arssher adding you / safekeepers to the review because the logging code is shared. The perf overhead of the tracing_error is non-zero, but, I'd like to have it enabled in release builds to see what the actual overhead is in prod. Are safekeepers ok with that?

- Remove repeated tenant & timeline from span
- Demote logging of the path to debug level
- Log completion at info level, in the same function where we log errors
- distinguish between layer file download success & on-demand download
  succeeding as a whole in the log message wording
- Assert that the span contains a tenant id and a timeline id

The assert uncovered that walreceiver_connection uses TenantTimelineId
in the span. I changed that to tenant_id and timeline_id.

fixes #3945

Before:

```
  INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: download complete: /storage/pageserver/data/tenants/$TENANT_ID/timelines/$TIMELINE_ID/000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91
  INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{tenant_id=$TENANT_ID timeline_id=$TIMELINE_ID layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates.
```

After:

```
  INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: layer file download finished
  INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: Rebuilt layer map. Did 9 insertions to process a batch of 1 updates.
  INFO compaction_loop{tenant_id=$TENANT_ID}:compact_timeline{timeline=$TIMELINE_ID}:download_remote_layer{layer=000000000000000000000000000000000000-FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF__00000000020C8A71-00000000020CAF91}: on-demand download successful
```
@problame problame force-pushed the problame/ondemand-log-cleanup branch from f03a2d1 to 6d5a7fc Compare April 26, 2023 09:06
@problame
Copy link
Contributor Author

Rebased to get the fix for the failed test

commit bfd45dd6713a2e9038954cf0368b1a082937b045
Author: Joonas Koivunen <joonas@neon.tech>
Date:   Tue Apr 25 18:41:09 2023 +0300

    test_tenant_config: allow ERROR from eviction task (#4074)

@arssher
Copy link
Contributor

arssher commented Apr 26, 2023

Is the only reason for tracing_error inclusion is debug_assert_current_span_has_tenant_and_timeline_id? We try not to have many logs in perf sensitive places, so this shouldn't be an issue, but just to understand.

Also seems like tracing_error layer can be made optional, if needed.

@problame
Copy link
Contributor Author

Is the only reason for tracing_error inclusion is debug_assert_current_span_has_tenant_and_timeline_id?

Yes.

We try not to have many logs in perf sensitive places, so this shouldn't be an issue, but just to understand.

I don't understand this sentence.
We're not logging anything in debug_assert_current_span_has_tenant_and_timeline_id.
What it does, essentially, is walk up the stack of tracing::Span and ensure that, along that walk, we see a tenant_id or tenant field, and a timeline_id or timeline field.

Also seems like tracing_error layer can be made optional, if needed.

True, since debug_assert_current_span_has_tenant_and_timeline_id is cfg(debug_assertion), we can make the tracing_error layer cfg(debug_assertion) as well.

But, as I said,

The perf overhead of the tracing_error is non-zero, but, I'd like to have it enabled in release builds to see what the actual overhead is in prod. Are safekeepers ok with that?

So, I read it's not Ok for you (SKs) to have it enabled always?

@arssher
Copy link
Contributor

arssher commented Apr 26, 2023

I don't understand this sentence.

I should have made it separate paragraph, not related to debug_assert_current_span_has_tenant_and_timeline_id.

So I think it is ok, I just ensured we can easily turn it off if needed.

@problame
Copy link
Contributor Author

So I think it is ok, I just ensured we can easily turn it off if needed.

Ok, it can't be turned off currently, so, I take as an action item that it should be configurable and off by default?

@arssher
Copy link
Contributor

arssher commented Apr 26, 2023

I doubt this is important currently, my message that we try not to log a lot in perf sensitive places was supposed to be an argument why this shouldn't be important. I think we can return here only if we spot any problems.

But if the whole deal is a matter of adding single #[cfg(debug_assertions)] in logging.rs, why not.

@arssher
Copy link
Contributor

arssher commented Apr 26, 2023

but, I'd like to have it enabled in release builds to see what the actual overhead is in prod.

Ah, I missed that part. Then leaving as is is definitely ok.

@problame
Copy link
Contributor Author

@arssher now I already put in the work :D
The latest commit constrains the enablement to just pageserver, and there only if cfg!(debug_assertions).

That takes out almost all the performance risk with this PR.

@arssher
Copy link
Contributor

arssher commented Apr 26, 2023

cool!

@problame problame enabled auto-merge (squash) April 26, 2023 15:26
@problame problame removed the request for review from a team April 27, 2023 09:54
@problame problame merged commit 9ea7b5d into main Apr 27, 2023
@problame problame deleted the problame/ondemand-log-cleanup branch April 27, 2023 09:54
problame added a commit that referenced this pull request Apr 27, 2023
PR  `build: run clippy for powerset of features (#4077)`
brought us a `clippy --release` pass.

It was merged after #4030, which fails under `clippy --release` with

```
error: static `TENANT_ID_EXTRACTOR` is never used
    --> pageserver/src/tenant/timeline.rs:4270:16
     |
4270 |     pub static TENANT_ID_EXTRACTOR: once_cell::sync::Lazy<
     |                ^^^^^^^^^^^^^^^^^^^
     |
     = note: `-D dead-code` implied by `-D warnings`

error: static `TIMELINE_ID_EXTRACTOR` is never used
    --> pageserver/src/tenant/timeline.rs:4276:16
     |
4276 |     pub static TIMELINE_ID_EXTRACTOR: once_cell::sync::Lazy<
     |                ^^^^^^^^^^^^^^^^^^^^^
```

A merge queue would have prevented this.
problame added a commit that referenced this pull request Apr 27, 2023
PR `build: run clippy for powerset of features (#4077)` brought us a
`clippy --release` pass.

It was merged after #4030, which fails under `clippy --release` with

```
error: static `TENANT_ID_EXTRACTOR` is never used
    --> pageserver/src/tenant/timeline.rs:4270:16
     |
4270 |     pub static TENANT_ID_EXTRACTOR: once_cell::sync::Lazy<
     |                ^^^^^^^^^^^^^^^^^^^
     |
     = note: `-D dead-code` implied by `-D warnings`

error: static `TIMELINE_ID_EXTRACTOR` is never used
    --> pageserver/src/tenant/timeline.rs:4276:16
     |
4276 |     pub static TIMELINE_ID_EXTRACTOR: once_cell::sync::Lazy<
     |                ^^^^^^^^^^^^^^^^^^^^^
```

A merge queue would have prevented this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

don't log full file path in pageserver/src/tenant/remote_timeline_client/download.rs::download_layer_file
3 participants