Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture restart: tikv reported duplicated request to the same region, which is not expected #2386

Closed
Tammyxia opened this issue Jul 27, 2021 · 3 comments · Fixed by #3019
Closed
Assignees
Labels
affects-4.0 affects-5.0 affects-5.1 affects-5.2 area/ticdc Issues or PRs related to TiCDC. component/kv-client TiKV kv log client component. severity/major type/bug The issue is confirmed as a bug.

Comments

@Tammyxia
Copy link

Tammyxia commented Jul 27, 2021

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.
  • 2xCDC server, 2x changefeed, 200 tables.
    [root@IDC-172-16-xxx~]# tiup cdc:v4.0.14 cli changefeed list --pd=http://172.16.xxx:2379
    Starting component cdc: /root/.tiup/components/cdc/v4.0.14/cdc cli changefeed list --pd=http://172.16.xxx:2379
    [
    {
    "id": "kafka-task-11",
    "summary": {
    "state": "normal",
    "tso": 426606071522525189,
    "checkpoint": "2021-07-27 16:03:34.536",
    "error": null
    }
    },
    {
    "id": "replication-task-11",
    "summary": {
    "state": "normal",
    "tso": 426606071719133316,
    "checkpoint": "2021-07-27 16:03:35.286",
    "error": null
    }
    }
    ]

$ tiup cdc:v4.0.14 cli capture list --pd=http://172.16.xxx:2379
Starting component cdc: /root/.tiup/components/cdc/v4.0.14/cdc cli capture list --pd=http://172.16.xxx:2379
[
{
"id": "3378b726-26cb-4963-8280-3ee679024a76",
"is-owner": false,
"address": "172.16.xxx2:8300"
},
{
"id": "d2600da4-cdc4-4420-84a4-e57e826ffbc7",
"is-owner": true,
"address": "172.16.xxx:8300"
}
]

  • Load data: $ bin/go-ycsb run mysql -P ./workloads/oncall2929 -p mysql.host=172.16.xxx -p mysql.port=4000 -p operationcount=5000000 --threads=10
    $ bin/go-ycsb run mysql -P ./workloads/oncall2929 -p mysql.host=172.16.xxx -p mysql.port=4000 -p operationcount=5000000 --threads=200 -p databasename=test123 -p tablename=test222
  1. What did you expect to see?
    No error
  2. What did you see instead?
    cdc log has panic:

[2021/07/27 16:04:34.191 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=4552] [regionID=1448181] [startKey=748000000000000eff425f698000000000ff0000020400000000ff05f7bc4c07800034ffa8e4eeb8000419a9ff9a00000000000000fd] [endKey=748000000000000eff425f698000000000ff0000020400000000ff05f818db07800034ff8cf46008000419a9ff9a00000000000000fd] [checkpointTs=426606086831734920]
[2021/07/27 16:04:34.191 +08:00] [INFO] [client.go:858] ["start new request"] [request="{"header":{"cluster_id":6985429421391688222,"ticdc_version":"4.0.14"},"region_id":1448181,"region_epoch":{"conf_ver":1097,"version":5725},"checkpoint_ts":426606086831734920,"start_key":"dIAAAAAAAA7/Ql9pgAAAAAD/AAACBAAAAAD/Bfe8TAeAADT/qOTuuAAEGan/mgAAAAAAAAD9","end_key":"dIAAAAAAAA7/Ql9pgAAAAAD/AAACBAAAAAD/BfgY2weAADT/jPRgCAAEGan/mgAAAAAAAAD9","request_id":656854}"] [addr=172.16.6.11:20170]
[2021/07/27 16:04:34.191 +08:00] [INFO] [client.go:1345] ["region state entry will be replaced because received message of newer requestID"] [regionID=1448181] [oldRequestID=656846] [requestID=656854] [addr=172.16.6.11:20170]
[2021/07/27 16:04:34.192 +08:00] [INFO] [client.go:1039] ["EventFeed disconnected"] [regionID=1448181] [requestID=656854] [span="[748000000000000eff425f698000000000ff0000020400000000ff05f7bc4c07800034ffa8e4eeb8000419a9ff9a00000000000000fd, 748000000000000eff425f698000000000ff0000020400000000ff05f818db07800034ff8cf46008000419a9ff9a00000000000000fd)"] [checkpoint=426606086831734920] [error="[CDC:ErrEventFeedEventError]duplicate_request:<region_id:1448181 > "]
[2021/07/27 16:04:34.192 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=4552] [regionID=1448181] [startKey=748000000000000eff425f698000000000ff0000020400000000ff05f7bc4c07800034ffa8e4eeb8000419a9ff9a00000000000000fd] [endKey=748000000000000eff425f698000000000ff0000020400000000ff05f818db07800034ff8cf46008000419a9ff9a00000000000000fd] [checkpointTs=426606086831734920]
[2021/07/27 16:04:34.192 +08:00] [PANIC] [client.go:1173] ["tikv reported duplicated request to the same region, which is not expected"] [regionID=1448181] [stack="github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).handleError\n\tgithub.mirror.nvdadr.com/pingcap/ticdc@/cdc/kv/client.go:1173\ngithub.mirror.nvdadr.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func4\n\tgithub.mirror.nvdadr.com/pingcap/ticdc@/cdc/kv/client.go:630\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57"]
[2021/07/27 16:04:51.532 +08:00] [INFO] [util.go:99] ["init log"] [file=/data2/cdcup/cdc-8300/log/cdc.log] [level=info]
[2021/07/27 16:04:51.532 +08:00] [INFO] [version.go:47] ["Welcome to Change Data Capture (CDC)"] [release-version=v4.0.14] [git-hash=5a7851967f686da896b45acd3f3e968bfe53d6bd] [git-branch=heads/refs/tags/v4.0.14] [utc-build-time="2021-07-15 05:48:23"] [go-version="go version go1.13 linux/amd64"] [failpoint-build=false]

  1. Versions of the cluster

    • Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

      v4.0.14
      
    • TiCDC version (execute cdc version):

       [release-version=v4.0.14] [git-hash=5a7851967f686da896b45acd3f3e968bfe53d6bd] [git-branch=heads/refs/tags/v4.0.14]
      
@Tammyxia Tammyxia added type/bug The issue is confirmed as a bug. severity/major labels Jul 27, 2021
@Tammyxia
Copy link
Author

tikv log:
[2021/07/27 16:04:34.191 +08:00] [ERROR] [endpoint.rs:465] ["cdc duplicate register"] [downstream_id=DownstreamID(78244)] [req_id=656854] [conn_id=ConnID(2288)] [region_id=1448181]

@overvenus overvenus added the component/kv-client TiKV kv log client component. label Jul 27, 2021
@overvenus
Copy link
Member

Another case of this issue.

# TiKV log
[2021/09/19 14:30:33.188 +08:00] [INFO] [endpoint.rs:457] ["cdc register region"] [downstream_id=DownstreamID(177133)] [req_id=2590727] [conn_id=ConnID(98)] [region_id=3603516]
[2021/09/19 14:30:33.786 +08:00] [ERROR] [endpoint.rs:449] ["cdc duplicate register"] [downstream_id=DownstreamID(177139)] [req_id=2590738] [conn_id=ConnID(98)] [region_id=3603516]

# TiCDC log
[2021/09/19 14:30:33.187 +08:00] [INFO] [client.go:859] ["start new request"] [request="{\"header\":{\"cluster_id\":6900853575174453058,\"ticdc_version\":\"4.0.13\"},\"region_id\":3603516,\"region_epoch\":{\"conf_ver\":22814,\"version\":3432},\"checkpoint_ts\":427827666839142402,\"start_key\":\"dIAAAAAAABX/z19ygAAAACr//pzHAAAAAAD6\",\"end_key\":\"dIAAAAAAABX/z2AAAAAAAAD5\",\"request_id\":2590727}"] [addr=10.19.85.4:20162]
[2021/09/19 14:30:33.208 +08:00] [INFO] [client.go:1269] ["region state entry will be replaced because received message of newer requestID"] [regionID=3603516] [oldRequestID=2590141] [requestID=2590727] [addr=10.19.85.4:20162]
[2021/09/19 14:30:33.323 +08:00] [INFO] [client.go:968] ["EventFeed disconnected"] [regionID=3603516] [requestID=2590727] [span="[7480000000000015ffcf5f72800000002afffe9cc70000000000fa, 7480000000000015ffcf60000000000000f9)"] [checkpoint=427827666839142402] [error="[CDC:ErrPrewriteNotMatch]prewrite not match, key: [1110100 10000000 0 0 0 0 0 10101 11001111 1011111 1110010 10000000 0 0 0 101011 0 100111 11000001], start-ts: 427827665449779212"]
[2021/09/19 14:30:33.323 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=42] [regionID=3603516] [startKey=7480000000000015ffcf5f72800000002afffe9cc70000000000fa] [endKey=7480000000000015ffcf60000000000000f9] [checkpointTs=427827666839142402]
[2021/09/19 14:30:33.323 +08:00] [INFO] [region_cache.go:622] ["switch region peer to next due to send request fail"] [current="region ID: 3603516, meta: id:3603516 start_key:\"t\\200\\000\\000\\000\\000\\000\\025\\377\\317_r\\200\\000\\000\\000*\\377\\376\\234\\307\\000\\000\\000\\000\\000\\372\" end_key:\"t\\200\\000\\000\\000\\000\\000\\026\\377\\034\\000\\000\\000\\000\\000\\000\\000\\370\" region_epoch:<conf_ver:22814 version:3432 > peers:<id:24641259 store_id:13 > peers:<id:24642858 store_id:6 > peers:<id:24645746 store_id:5 > , peer: id:24642858 store_id:6 , addr: 10.19.85.4:20162, idx: 1, reqStoreType: TiKvOnly, runStoreType: tikv"] [needReload=false] [error="[CDC:ErrPrewriteNotMatch]prewrite not match, key: [1110100 10000000 0 0 0 0 0 10101 11001111 1011111 1110010 10000000 0 0 0 101011 0 100111 11000001], start-ts: 427827665449779212"] [errorVerbose="[CDC:ErrPrewriteNotMatch]prewrite not match, key: [1110100 10000000 0 0 0 0 0 10101 11001111 1011111 1110010 10000000 0 0 0 101011 0 100111 11000001], start-ts: 427827665449779212
github.com/pingcap/errors.AddStack
	github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174
github.com/pingcap/errors.(*Error).GenWithStackByArgs
	github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:156
github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).singleEventFeed
	github.com/pingcap/ticdc@/cdc/kv/client.go:1562
github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).partialRegionFeed
	github.com/pingcap/ticdc@/cdc/kv/client.go:936
github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).sendRegionChangeEvent.func1
	github.com/pingcap/ticdc@/cdc/kv/client.go:1303
golang.org/x/sync/errgroup.(*Group).Go.func1
	golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57
runtime.goexit
	runtime/asm_amd64.s:1357"]
[2021/09/19 14:30:33.786 +08:00] [INFO] [client.go:1269] ["region state entry will be replaced because received message of newer requestID"] [regionID=3603516] [oldRequestID=2590727] [requestID=2590738] [addr=10.19.85.4:20162]
[2021/09/19 14:30:33.786 +08:00] [INFO] [client.go:968] ["EventFeed disconnected"] [regionID=3603516] [requestID=2590738] [span="[7480000000000015ffcf5f72800000002afffe9cc70000000000fa, 7480000000000015ffcf60000000000000f9)"] [checkpoint=427827666839142402] [error="[CDC:ErrEventFeedEventError]duplicate_request:<region_id:3603516 > "]
[2021/09/19 14:30:33.786 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=42] [regionID=3603516] [startKey=7480000000000015ffcf5f72800000002afffe9cc70000000000fa] [endKey=7480000000000015ffcf60000000000000f9] [checkpointTs=427827666839142402]
[2021/09/19 14:30:33.786 +08:00] [PANIC] [client.go:1101] ["tikv reported duplicated request to the same region, which is not expected"] [regionID=3603516] [stack="github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).handleError
	github.com/pingcap/ticdc@/cdc/kv/client.go:1101
github.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func3
	github.com/pingcap/ticdc@/cdc/kv/client.go:614
golang.org/x/sync/errgroup.(*Group).Go.func1
	golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57"]

In above log, TiCDC restarts a new feed (request id 2590738) of region 3603516 due to an internal error CDC:ErrPrewriteNotMatch, but old feed (request id 2590727) in TiKV is still alive, so TiKV reports duplicate request error.

@overvenus
Copy link
Member

There are two problems in the above comment.

  1. TiCDC reports ErrPrewriteNotMatch for unknown reason.
  2. ErrPrewriteNotMatch cause TiCDC reconnect while an old request is still alive in TiKV.

For 1, we need to open a new issue and continue to investigate.
For 2, we can a) in short term, workaround the panic by reconnection EventFeed RPC, b) in long term, let TiCDC notify TiKV to close captured regions (this may require changing EventFeed gRPC protocol).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.0 affects-5.0 affects-5.1 affects-5.2 area/ticdc Issues or PRs related to TiCDC. component/kv-client TiKV kv log client component. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants