Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store/copr: support batch coprocessor requests by store #39525

Merged
merged 18 commits into from
Dec 1, 2022

Conversation

you06
Copy link
Contributor

@you06 you06 commented Dec 1, 2022

Signed-off-by: you06 you1474600@gmail.com

What problem does this PR solve?

Issue Number: ref #39361

Problem Summary:

Fanout query creates too many table reader tasks.

What is changed and how it works?

Batching the tasks by store reduces the number of RPC requests and serialize/deserialize cost. In the fanout scenario, this mechanism will batch the fanout tasks together.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Support batch coprocessor requests by store.

Signed-off-by: you06 <you1474600@gmail.com>

fix missing max value

Signed-off-by: you06 <you1474600@gmail.com>
@you06 you06 requested a review from a team as a code owner December 1, 2022 03:29
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Dec 1, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • cfzjywxk
  • sticnarf

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 1, 2022
Signed-off-by: you06 <you1474600@gmail.com>
Signed-off-by: you06 <you1474600@gmail.com>
@ti-chi-bot ti-chi-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 1, 2022
Signed-off-by: you06 <you1474600@gmail.com>
sessionctx/variable/sysvar.go Outdated Show resolved Hide resolved
}
task := batchedTask.task
if regionErr := batchResp.GetRegionError(); regionErr != nil {
logutil.BgLogger().Info("DBG region error", zap.String("err", regionErr.String()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug log?

var err error
resolveLockDetail, err = worker.handleLockErr(bo, lockErr, task)
if err != nil {
return nil, err
}
return []*copTask{task}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still handle the remaining batch responses and merge them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for region error, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that error(lock, region miss, others) is returned in the original response while the batched responses returns ok.
All the lock errors may lead to lock resolving, and all the region errors may lead to region miss retry.
If the order is not quired, we could return the results of the successful responses and not execute them again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error from worker.handleLockErr means that it's failed to resolve lock or a backoff timeout, don't we return the error to the client?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem in L1148. But if the lock is resolved, I think we shouldn't only return the task itself? BatchResponses may include success and failure results and they should be either returned through the channel or returned to retry. (Or is there anything I misunderstand here...?)

taskID := uint64(0)
var store2Idx map[uint64]int
if req.StoreBatchSize > 0 {
store2Idx = make(map[uint64]int, 16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the cache.SplitKeyRangesByBuckets(bo, ranges), the ranges are split by the ordered region range, and the ranges within each region are also ordered. For example

Region 1            Region 2               Region 3
[1, 2], [3, 4]     [5, 10], [15, 20]        [21, 25]
task1   task2      task3  task4             task5

So if the KeepOrder is required, I think the batch processing could still work. The difference is that if the order is required, the coprocessor client could not response to the caller if task5 has finished while task2 does not.

@sticnarf @you06
What do you think? Please correct me if I missed anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a tiny change to the example.

Region 1   Region 2     Region 3               Region 4
[1, 2]     [3, 4]       [5, 10], [15, 20]      [21, 25]
task1      task2        task3  task4           task5

Suppose region1, region 2 and region 4 are located in store1, and region 3 is located in store2, there are two batch methods:

  1. [task1, task2, task5], [task3, task4]

In this way, we archive the maximum batch size, and task5 should wait until task4 is received.

  1. [task1, task2], [task3, task4], [task5]

In this way, we don't need to reorder the responses.

Copy link
Contributor

@sticnarf sticnarf Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This largely reduces the effect of batching. It's possible that a batch involves hundreds of regions in one store. It's very common that region ranges intersect between stores.

Instead, I think we should store the range or the order index of the response and sort them after receiving all of them. This can be done in the next iterations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking about the ordering related work in the next iteration is fine to me, by now we could just disable batching if order is required to make it simple.

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Dec 1, 2022
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Dec 1, 2022
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Dec 1, 2022

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 12b3c39

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 1, 2022
@sticnarf
Copy link
Contributor

sticnarf commented Dec 1, 2022

There're some lint errors that need fixing.

@you06
Copy link
Contributor Author

you06 commented Dec 1, 2022

There're some lint errors that need fixing.

There are some mistakes when processing lock resolve details, fix the lint by now.

Signed-off-by: you06 <you1474600@gmail.com>
@ti-chi-bot ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Dec 1, 2022
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Dec 1, 2022

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 8ec0efc

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 1, 2022
ti-chi-bot and others added 2 commits December 1, 2022 22:39
Signed-off-by: you06 <you1474600@gmail.com>
@ti-chi-bot ti-chi-bot removed the status/can-merge Indicates a PR has been approved by a committer. label Dec 1, 2022
@you06
Copy link
Contributor Author

you06 commented Dec 1, 2022

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: c1011b3

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 1, 2022
@you06
Copy link
Contributor Author

you06 commented Dec 1, 2022

/run-mysql-test

3 similar comments
@you06
Copy link
Contributor Author

you06 commented Dec 1, 2022

/run-mysql-test

@you06
Copy link
Contributor Author

you06 commented Dec 1, 2022

/run-mysql-test

@you06
Copy link
Contributor Author

you06 commented Dec 1, 2022

/run-mysql-test

@ti-chi-bot ti-chi-bot merged commit 9d9eaca into pingcap:master Dec 1, 2022
@sre-bot
Copy link
Contributor

sre-bot commented Dec 1, 2022

TiDB MergeCI notify

🔴 Bad News! [3] CI still failing after this pr merged.
These failed integration tests don't seem to be introduced by the current PR.

CI Name Result Duration Compare with Parent commit
idc-jenkins-ci-tidb/integration-common-test 🔴 failed 1, success 16, total 17 34 min Existing failure
idc-jenkins-ci/integration-cdc-test 🔴 failed 2, success 38, total 40 24 min Existing failure
idc-jenkins-ci-tidb/common-test 🔴 failed 1, success 10, total 11 14 min Existing failure
idc-jenkins-ci-tidb/tics-test 🟢 all 1 tests passed 5 min 57 sec Existing passed
idc-jenkins-ci-tidb/integration-ddl-test 🟢 all 6 tests passed 5 min 1 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-1 🟢 all 26 tests passed 4 min 50 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-2 🟢 all 28 tests passed 4 min 37 sec Existing passed
idc-jenkins-ci-tidb/mybatis-test 🟢 all 1 tests passed 4 min 21 sec Existing passed
idc-jenkins-ci-tidb/integration-compatibility-test 🟢 all 1 tests passed 2 min 31 sec Existing passed
idc-jenkins-ci-tidb/plugin-test 🟢 build success, plugin test success 4min Existing passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants