Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

br restore failed when injection tikv failure 5 minutes every 5 minutes #56046

Open
Lily2025 opened this issue Sep 12, 2024 · 3 comments
Open
Labels
component/br This issue is related to BR of TiDB. type/bug The issue is confirmed as a bug.

Comments

@Lily2025
Copy link

Lily2025 commented Sep 12, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、br restore failed
2、injection tikv failure 5 minutes every 5 minutes and total injection fault twice

br.log.2024-09-10T18.03.34Z.zip

2. What did you expect to see? (Required)

br restore success

3. What did you see instead (Required)

br restore

start time: 2024-09-11 02:03:34, failed time: 2024-09-11 02:19:48
stdout:
Detail BR log in /tmp/br.log.2024-09-10T18.03.34Z

[2024/09/10 18:19:48.220 +00:00] [INFO] [collector.go:73] [DataBase Restore failed summary] [total-ranges=129] [ranges-succeed=128] [ranges-failed=1] [split-region=9m5.021030894s] [restore-ranges=7897] [unit-name=file] [error=rpc error: code = Unavailable desc = Cancelling all calls; rpc error: code = Unavailable desc = connection error: desc = \transport: error while dialing: dial tcp 10.200.68.65:20160: connect: connection refused; rpc error: code = Unavailable desc = connection error

4. What is your TiDB version? (Required)

./tidb-server -V
Release Version: v6.5.11
Edition: Community
Git Commit Hash: 305cf42
Git Branch: HEAD
UTC Build Time: 2024-09-10 08:34:23
GoVersion: go1.19.13
Race Enabled: false
TiKV Min Version: 6.2.0-alpha
Check Table Before Drop: false
Store: unistore
2024-09-11T02:03:30.451+0800

./br -V
Release Version: v6.5.11
Git Commit Hash: 305cf42
Git Branch: HEAD
Go Version: go1.19.13
UTC Build Time: 2024-09-10 08:35:44
Race Enabled: false

@Lily2025 Lily2025 added the type/bug The issue is confirmed as a bug. label Sep 12, 2024
@YuJuncen
Copy link
Contributor

In this case, one download request suffers from both the first TiKV outage and the second one.

> echo -e (cat br.log.2024-09-10T18.03.34ZE | tail -n1) | rg '^ -' | uniq -c
      1  -  rpc error: code = Unavailable desc = Cancelling all calls
     71  -  rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 10.200.68.65:20160: connect: connection refused\"
     49  -  rpc error: code = Unavailable desc = connection error: desc = \"transport: error while dialing: dial tcp 10.200.59.172:20160: connect: connection refused\"

In fact we have totally 128 times of retry according to our code (release-6.5, 1097ba8). The average backoff is about 4s.

@YuJuncen
Copy link
Contributor

So, when the total downtime exceeds 480s, BR may fail. The possibility is related to how long between two TiKVs down. The longer the outage happens, the lower the possibility of BR failure.

Personally I guess when two outages happen in more than 15 minutes, the total downtime counter for failure can be reset.

@YuJuncen
Copy link
Contributor

Note: restore requires all nodes are online. When one of TiKV was down, it starts retry. And, once this TiKV goes back, the request may still not finish, hence a continuous outage of different TiKV may consume the same retry counter. Once all retry chances consumed, BR will fail.

@jebter jebter added the component/br This issue is related to BR of TiDB. label Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/br This issue is related to BR of TiDB. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

3 participants