Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schedule: Do not send an operator of a region wth a stale epoch #1659

Merged
merged 11 commits into from
Aug 23, 2019

Conversation

shafreeck
Copy link
Contributor

Signed-off-by: Shafreeck Sea shafreeck@gmail.com

What problem does this PR solve?

closes #1652

When PD receives a region heartbeat, it dispatches the operator of that region. The dispatching uses the epoch from the heartbeat rather than from the operator when it is created. This causes a bug described in #1652.

What is changed and how it works?

If the operator to dispatch has a stale region epoch, cancel it.

Check List

Tests

  • Unit test

@shafreeck shafreeck self-assigned this Jul 31, 2019
@zyxbest
Copy link

zyxbest commented Jul 31, 2019

/run-integration-ddl-test

@shafreeck shafreeck changed the title schedule: Do not send an operator of a region wth a stale epoch WIP: schedule: Do not send an operator of a region wth a stale epoch Jul 31, 2019
@shafreeck shafreeck changed the title WIP: schedule: Do not send an operator of a region wth a stale epoch schedule: Do not send an operator of a region wth a stale epoch Aug 1, 2019
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
@shafreeck
Copy link
Contributor Author

/rebuild

@shafreeck
Copy link
Contributor Author

/run-integration-ddl-test

…ool"

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
@codecov-io
Copy link

codecov-io commented Aug 14, 2019

Codecov Report

Merging #1659 into master will decrease coverage by 0.03%.
The diff coverage is 81.13%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1659      +/-   ##
==========================================
- Coverage    76.7%   76.66%   -0.04%     
==========================================
  Files         158      158              
  Lines       15527    15569      +42     
==========================================
+ Hits        11910    11936      +26     
- Misses       2600     2612      +12     
- Partials     1017     1021       +4
Impacted Files Coverage Δ
server/schedule/operator_controller.go 89.37% <100%> (+0.3%) ⬆️
pkg/mock/mockhbstream/mockhbstream.go 86.48% <100%> (+0.77%) ⬆️
pkg/mock/mockcluster/mockcluster.go 91.75% <100%> (+0.08%) ⬆️
server/schedule/operator/operator.go 86% <60%> (-1.26%) ⬇️
pkg/etcdutil/etcdutil.go 76.81% <0%> (-11.6%) ⬇️
server/member/leader.go 76.53% <0%> (-2.05%) ⬇️
server/checker/replica_checker.go 80% <0%> (-1.88%) ⬇️
server/handler.go 49.6% <0%> (-0.53%) ⬇️
server/cluster.go 82.67% <0%> (-0.26%) ⬇️
server/grpc_service.go 59.56% <0%> (+0.43%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54975cd...2c2ce00. Read the comment docs.

Copy link
Member

@rleungx rleungx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest LGTM.

server/schedule/operator_controller_test.go Outdated Show resolved Hide resolved
c.Assert(op.ConfVerChanged(), Equals, 1)
c.Assert(len(stream.MsgCh()), Equals, 2)

// add and disaptch op again, the op should be stale
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// add and disaptch op again, the op should be stale
// add and dispatch op again, the op should be stale

Co-Authored-By: Ryan Leung <rleungx@gmail.com>
@shafreeck
Copy link
Contributor Author

/run-all-tests

@rleungx
Copy link
Member

rleungx commented Aug 22, 2019

There is still a type which needs to be fixed. @shafreeck

shafreeck and others added 4 commits August 23, 2019 00:05
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>
@shafreeck shafreeck merged commit 141496b into tikv:master Aug 23, 2019
Luffbee pushed a commit that referenced this pull request Aug 27, 2019
* *: unify get store function everywhere (#1671)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

*  server: use leader lease to determine tso service validity (#1676)

Signed-off-by: disksing <i@disksing.com>

* test: fix tests (#1696)

* test: fix region syncer test

Signed-off-by: disksing <i@disksing.com>

* add config-check flag for pd-server (#1695)

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* operator: rewrite move region related functions (#1667)

* *: support setting endKey for ScanRange (#1700)

Signed-off-by: disksing <i@disksing.com>

* *: reduce some unnecessary parameters (#1698)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: Do not send an operator of a region wth a stale epoch (#1659)

* schedule: Do not send an operator of a region wth a stale epoch

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: check the version changed by the operator self

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix unit test

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix to avoid dispatching a stale opstep

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refactor "ConsumeConfVer() int" to "ExpectConfVerChange() bool"

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: fix typo in comment

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* fix typo

Co-Authored-By: Ryan Leung <rleungx@gmail.com>

* dispatch: fix unittest

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refine format

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* server: fix the dead lock in scatter region (#1706)

Signed-off-by: Ryan Leung <rleungx@gmail.com>
Luffbee added a commit that referenced this pull request Sep 9, 2019
* *: unify get store function everywhere (#1671)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* remove unnecessary parentheses

*  server: use leader lease to determine tso service validity (#1676)

Signed-off-by: disksing <i@disksing.com>

* change internal stat values to float64

* add pending operator influence

* add metrics of pending influence

* fix metrics

* fix panic

* adjust pending influence of balanceHotWrite

* change weight of pending influence

* test: fix tests (#1696)

* test: fix region syncer test

Signed-off-by: disksing <i@disksing.com>

* decrease region rolling window; store pending influence in scheduler

* add config-check flag for pd-server (#1695)

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* decrease possiblility transfer hot write leader

* change pending influence weight

* add unstarted op metrics

* add logs for debug

* add log for debug

* add logs for debug

* add logs for debug

* add logs for debug

* add logs for debug

* add logs for debug

* add logs for debug

* Revert "add logs for debug"

This reverts commit e74c7a9.

* add metrics for hotspot operators

* operator: rewrite move region related functions (#1667)

* add metrics for pending operators

* *: support setting endKey for ScanRange (#1700)

Signed-off-by: disksing <i@disksing.com>

* fix bug

* fix bug

* fix bug

* fix metrics thread-safe bug

* fix logic bug

* *: reduce some unnecessary parameters (#1698)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: Do not send an operator of a region wth a stale epoch (#1659)

* schedule: Do not send an operator of a region wth a stale epoch

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: check the version changed by the operator self

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix unit test

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix to avoid dispatching a stale opstep

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refactor "ConsumeConfVer() int" to "ExpectConfVerChange() bool"

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: fix typo in comment

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* fix typo

Co-Authored-By: Ryan Leung <rleungx@gmail.com>

* dispatch: fix unittest

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refine format

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* server: fix the dead lock in scatter region (#1706)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* add drop time for operator

* use IsDropped to recognize canceled ops

* try to fix trans leader burst

* try to fix trans leader burst

* add zombie influence

* change select src dst strategy; improve op_controller

* change select src strategy

* fix bug

* tools: fix set namespace in pd-ctl (#1701)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tools: fix parse url without http prefix (#1703)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tests: support deadlock detection in make test (#1704)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* Makefile: fix failpoint enable (#1722)

Signed-off-by: nolouch <nolouch@gmail.com>

* checker: fix the issue that a region does not merge to the sibling with smaller size (#1723)

Signed-off-by: disksing <i@disksing.com>

* tools: balance region simulator (#1708)

* scheduler: do not remove the operator when the step does not finish (#1715)

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* operator: fix the AddLearner config version judgment (#1732)

Signed-off-by: nolouch <nolouch@gmail.com>

* tools: fix TLS in pd control (#1729)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* syncer: support TLS for region syncer (#1728)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: fix a thread-safe bug and improve code (#1719)
Luffbee added a commit that referenced this pull request Sep 11, 2019
* *: unify get store function everywhere (#1671)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

*  server: use leader lease to determine tso service validity (#1676)

Signed-off-by: disksing <i@disksing.com>

* test: fix tests (#1696)

* test: fix region syncer test

Signed-off-by: disksing <i@disksing.com>

* add config-check flag for pd-server (#1695)

Signed-off-by: cwen0 <cwenyin0@gmail.com>

* operator: rewrite move region related functions (#1667)

* *: support setting endKey for ScanRange (#1700)

Signed-off-by: disksing <i@disksing.com>

* *: reduce some unnecessary parameters (#1698)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: Do not send an operator of a region wth a stale epoch (#1659)

* schedule: Do not send an operator of a region wth a stale epoch

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: check the version changed by the operator self

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix unit test

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* schedule: fix to avoid dispatching a stale opstep

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refactor "ConsumeConfVer() int" to "ExpectConfVerChange() bool"

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: fix typo in comment

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* fix typo

Co-Authored-By: Ryan Leung <rleungx@gmail.com>

* dispatch: fix unittest

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* dispatch: refine format

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* server: fix the dead lock in scatter region (#1706)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tools: fix set namespace in pd-ctl (#1701)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tools: fix parse url without http prefix (#1703)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* tests: support deadlock detection in make test (#1704)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* Makefile: fix failpoint enable (#1722)

Signed-off-by: nolouch <nolouch@gmail.com>

* checker: fix the issue that a region does not merge to the sibling with smaller size (#1723)

Signed-off-by: disksing <i@disksing.com>

* tools: balance region simulator (#1708)

* scheduler: do not remove the operator when the step does not finish (#1715)

Signed-off-by: Shafreeck Sea <shafreeck@gmail.com>

* operator: fix the AddLearner config version judgment (#1732)

Signed-off-by: nolouch <nolouch@gmail.com>

* tools: fix TLS in pd control (#1729)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* syncer: support TLS for region syncer (#1728)

Signed-off-by: Ryan Leung <rleungx@gmail.com>

* schedule: fix a thread-safe bug and improve code (#1719)

* statistics: fix region flow calculation (#1688)

Signed-off-by: jiyingtk <jiyingtk@mail.ustc.edu.cn>

* makefile: improve deadlock-enable/disable (#1736)

* api: fix missing keys statistic in region information (#1741)

Signed-off-by: nolouch <nolouch@gmail.com>

* *: update go version to 1.13 (#1742)

Signed-off-by: disksing <i@disksing.com>

* coordinator: add the operator cost time in log field (#1748)

Signed-off-by: nolouch <nolouch@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The number of peers to remove may exceed expectation when timeout happends
5 participants