Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiKV GcKeys task doesn't work when called with multiple keys (at least in 5.1 but I think for everything) #11217

Closed
frew opened this issue Nov 3, 2021 · 12 comments · Fixed by #11248
Labels
affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. severity/critical type/bug The issue is confirmed as a bug.

Comments

@frew
Copy link
Contributor

frew commented Nov 3, 2021

Bug Report

In gc worker, https://github.com/tikv/tikv/blob/master/src/server/gc_worker/gc_worker.rs#L340 the GcKeys GC task when called with multiple keys tries to get the list of regions overlapping the key range for the task and then do a sorted merge of the keys and the space encompassed by the regions. However, it fails to prepend the keys with the data prefix ('z') when passing them to get_regions_in_range, get_regions_in_range doesn't prepend the keys, and so the keys form an invalid range and no regions are ever returned. This makes garbage collection of delete markers not happen.

To verify the issue I added some logging statements in a branch: https://github.com/frew/tikv/pull/1/files#diff-fad0fef4b49a4159243096a9212032ae40e8b88cd0deb3c60df42a3dbdc639dfR385 - an example line from the first logging statement showing the issue in our cluster:

[2021/11/03 06:46:28.397 +00:00] [INFO] [region_info_accessor.rs:385] ["gc get_regions_in_range"] [start_key="[116, 128, 0, 0, 0, 0, 0, 0, 255, 159, 95, 114, 128, 0, 0, 12, 81, 255, 152, 74, 144, 0, 0, 0, 0, 0, 250]"] [first_region="Some(([122, 116, 128, 0, 0, 0, 0, 0, 0, 255, 19, 0, 0, 0, 0, 0, 0, 0, 248], 53))"] [regions_size=126845]

(note the prefix 122 on the first_region but not on the start_key)

I suggest adding data_key() to https://github.com/tikv/tikv/blob/master/src/server/gc_worker/gc_worker.rs#L338 and the following line, but defer to your expertise.

What version of TiKV are you using?

5.1.2

What operating system and CPU are you using?

Linux on GCP n2d

@frew
Copy link
Contributor Author

frew commented Nov 3, 2021

Tagging @hicqu @Connor1996 @MyonKeminta from #9959

@Connor1996
Copy link
Member

Nice catch! It does have the problem. Would you like to file a PR to fix it?

@MyonKeminta
Copy link
Contributor

Thanks for your reporting and digging so deep to find the cause.

I suggest adding data_key() to https://github.com/tikv/tikv/blob/master/src/server/gc_worker/gc_worker.rs#L338 and the following line, but defer to your expertise.

Your suggested change should work, but I suggest adding the data_key conversion here so that all interfaces of the RegionInfoAccessor are consistent, which means they all accepts normal encoded keys instead of 'z'-prefixed keys.

@frew
Copy link
Contributor Author

frew commented Nov 3, 2021

Oh, good point in making RegionInfoAccessor consistent. Will cut a PR

@frew
Copy link
Contributor Author

frew commented Nov 3, 2021

Added PR - going to backport to 5.1.2 locally and test on our cluster. Could somebody on the PingCAP side take adding whatever tests you think appropriate?

@frew
Copy link
Contributor Author

frew commented Nov 3, 2021

Posted this as a comment on the PR as well but as-written, doesn't work: Actually this has a problem - get_regions_in_range is also called from CompactionGuardGenerator, where we already have a data key. I think that function is also busted because it then calls data_end_key(&region.end_key) where I think region end_key is already data encoded? This is starting to seem like it really needs an actual implementation in the type system for data encoded keys vs. non data encoded keys or something and I'm not convinced I have enough context to pursue.

@frew
Copy link
Contributor Author

frew commented Nov 4, 2021

Okay, here's a version (with logging as well) that's working locally: https://github.com/frew/tikv/pull/1/files

Ran into two additional issues:

  1. get_regions_in_range isn't really what we want here - we need the regions encompassing the range (i.e. including a region that has an endpoint outside of the range), so added a function to do that (haven't checked the other callsite for get_regions_in_range to see if it's using it effectively
  2. There was another encoding issue around data_key, so fixed that

@yiwu-arbug
Copy link

Posted this as a comment on the PR as well but as-written, doesn't work: Actually this has a problem - get_regions_in_range is also called from CompactionGuardGenerator, where we already have a data key. I think that function is also busted because it then calls data_end_key(&region.end_key) where I think region end_key is already data encoded? This is starting to seem like it really needs an actual implementation in the type system for data encoded keys vs. non data encoded keys or something and I'm not convinced I have enough context to pursue.

CompactionGuardGenerator code looks fine. it is applying data_end_key on region.end_key, which is unencoded.

@frew
Copy link
Contributor Author

frew commented Nov 4, 2021

CompactionGuardGenerator code looks fine. it is applying data_end_key on region.end_key, which is unencoded.

The problem is the other way - self.smallest_key and self.largest_key appear to already have the 'z' prefix, so following @MyonKeminta 's suggestion of making the function prepend internally doesn't work.

@Connor1996
Copy link
Member

Added PR - going to backport to 5.1.2 locally and test on our cluster. Could somebody on the PingCAP side take adding whatever tests you think appropriate?

okay, I’ll add a test and fix it

@frew
Copy link
Contributor Author

frew commented Nov 4, 2021

@Connor1996 okay, the current version (+logging) which seems to be working when applied to our 5.1.2 cluster is at: https://github.com/frew/tikv/pull/1/files

Please let me know (+ feel free to ping me in Slack) if you have any questions. The logging is probably more verbose than you'd want to merge into mainline, but I think everything else is reasonable to merge. One thing I haven't checked is whether the other users of get_regions_in_range should also be using get_regions_encompassing_range or if the logic is right for them

@frew
Copy link
Contributor Author

frew commented Nov 4, 2021

(deleted one comment that misunderstood something)

One proposal to make code simpler: Connor1996#2

Other than that, looks good to me as far as I understand it.

@yiwu-arbug yiwu-arbug added the type/bug The issue is confirmed as a bug. label Nov 4, 2021
Connor1996 added a commit to Connor1996/tikv that referenced this issue Nov 5, 2021
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
ti-chi-bot added a commit that referenced this issue Nov 8, 2021
* fix gc keys doesn't work

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

* close #11217 simplify rangekey

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

* improve test

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

* fix clippy

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

* add metrics

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

* use closed interval for end key

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
5kbpers pushed a commit to 5kbpers/tikv that referenced this issue Nov 19, 2021
…#11248)

* fix gc keys doesn't work

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

* close tikv#11217 simplify rangekey

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

* improve test

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

* fix clippy

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

* add metrics

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

* use closed interval for end key

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
Signed-off-by: 5kbpers <tangminghua@pingcap.com>
@youjiali1995 youjiali1995 added affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. labels Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.0 This bug affects 5.0.x versions. affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
5 participants