tests linearizability: trigger snapshot related failpoints #15104

lavacat · 2023-01-13T18:56:53Z

fixes #14726, except raftBeforeFollowerSend

Signed-off-by: Bogdan Kanivets bkanivets@apple.com

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

serathius · 2023-01-14T12:56:49Z

Looks promising, great work! Suggestions:

Use network blackholeing to simulate delaying member revision.
Without needing to restart binary we can use http failpoint for now as it's simpler
Let's not reduce snapshot count < 100. That's already low, risking performance degradation.
Instead of connecting to each member, we can just connect to whole cluster.

I managed to get it working. Please take a look serathius@a13adec

What is left is integrate snapshot failpoints into existing tests. As snapshot failpoints can only be triggered in follower, they cannot be enabled in single node cluster. You should be able to skip those failpoints if you change Available method for them.

codecov-commenter · 2023-01-14T13:29:43Z

Codecov Report

Merging #15104 (a13adec) into main (6315f1c) will decrease coverage by 0.13%.
The diff coverage is n/a.

❗ Current head a13adec differs from pull request most recent head f877cd3. Consider uploading reports for the commit f877cd3 to get more accurate results

@@            Coverage Diff             @@
##             main   #15104      +/-   ##
==========================================
- Coverage   74.75%   74.62%   -0.14%     
==========================================
  Files         415      415              
  Lines       34341    34341              
==========================================
- Hits        25673    25628      -45     
- Misses       7038     7075      +37     
- Partials     1630     1638       +8

Flag	Coverage Δ
all	`74.62% <ø> (-0.14%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
client/v3/leasing/util.go	`88.33% <0.00%> (-10.00%)`	⬇️
client/pkg/v3/tlsutil/tlsutil.go	`83.33% <0.00%> (-8.34%)`	⬇️
server/etcdserver/api/v3rpc/interceptor.go	`73.43% <0.00%> (-4.17%)`	⬇️
client/v3/leasing/txn.go	`88.09% <0.00%> (-3.18%)`	⬇️
client/v3/experimental/recipes/key.go	`75.34% <0.00%> (-2.74%)`	⬇️
client/v3/experimental/recipes/double_barrier.go	`68.83% <0.00%> (-2.60%)`	⬇️
client/v3/concurrency/election.go	`79.68% <0.00%> (-2.35%)`	⬇️
server/etcdserver/api/rafthttp/msgappv2_codec.go	`71.30% <0.00%> (-1.74%)`	⬇️
client/v3/client.go	`82.33% <0.00%> (-1.27%)`	⬇️
pkg/adt/interval_tree.go	`85.96% <0.00%> (-1.26%)`	⬇️
... and 16 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

serathius · 2023-01-20T10:13:07Z

ping @lavacat, are you planning to finish this?

lavacat · 2023-01-20T19:03:56Z

@serathius yes, sorry, I got sidetracked on another issue.
Thanks for review, your suggestion makes sense to me. I'm trying to debug why we need to wait for 1000 revs.

serathius · 2023-01-24T21:32:24Z

ping, this change is blocking #15102

lavacat · 2023-01-24T21:49:47Z

@serathius I've enabled debug logs for proxy and it reports correct info. 0 B of modified data.

logger.go:130: 2023-01-23T23:34:44.833Z	DEBUG	modified rx	{"data-received": "51 B", "data-modified": "0 B", "from": "tcp://localhost:20011", "to": "tcp://localhost:20013"}

But blackholed process still receives non-empty entries in applyAll and appliedi is increasing. This happens for ~1s then stops. Still looking for root cause.

serathius · 2023-01-24T22:08:55Z

Please let me know if you need any help.

lavacat · 2023-01-26T19:27:32Z

@serathius if you have time, please take a look. I'll continue investigating as well. It seams to me that after blackholing traffic, peer should get deactivated and logs should show failed to dial errors. I don't see that.

serathius · 2023-01-27T09:57:38Z

I would guess that reason is that blackholing just drops the packet, but doesn't interrupt the connection.

lavacat · 2023-01-28T09:21:08Z

Ok, I think explanation of this issue is that cutting traffic between --initial-advertise-peer-urls and --listen-peer-urls of one node isn't enough to fully isolate it from the rest of the cluster. It will still establish connections to the list of peers from --initial-cluster.

Here is an example of Procfile to reproduce this. etcd1 has different --listen-peer-urls and --initial-advertise-peer-urls and the proxy ins't setup. etcd1 will be applying puts.

stress-put: tools/benchmark/benchmark --endpoints=127.0.0.1:22379,127.0.0.1:32379 --clients=27 --conns=3 put --sequential-keys --key-space-size=100000 --total=100000

etcd1: GOFAIL_HTTP="127.0.0.1:11180" bin/etcd --name infra1 --snapshot-count=1000 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:11111 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:11111,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof
etcd2: GOFAIL_HTTP="127.0.0.1:22280" bin/etcd --name infra2 --snapshot-count=1000 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:11111,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof
etcd3: GOFAIL_HTTP="127.0.0.1:33380" bin/etcd --name infra3 --snapshot-count=1000 --listen-client-urls http://127.0.0.1:32379 --advertise-client-urls http://127.0.0.1:32379 --listen-peer-urls http://127.0.0.1:32380 --initial-advertise-peer-urls http://127.0.0.1:32380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:11111,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof

lavacat · 2023-01-30T10:43:54Z

Each peer X registers a writer with other peers, so they can write msg addressed to X.
When peer Y is blackholed, X can't register a writer with Y. But there is a working pipeline that Y is created to X.
So, when Y tries to send msg to X, pipeline is picked instead of usual writer.

That's my understanding based on debugging. I think this isn't a bug but not much docs around design.

Still not sure why blackholing works after some time. But to fix the test we need to blackhole pipeline traffic somehow.

serathius · 2023-01-30T11:29:40Z

Great job on investigation! Seems like an issue with using proxy for network blackholing, however we can still workaround this.

Let's focus on getting PR merged, leave 1000 for now and create an follow up issue.
If we can get consistent triggers of snapshot failpoints we should be ok. Could you rerun test 100 times to confirm?

lavacat · 2023-02-01T06:54:15Z

@serathius made a small improvement - instead of waiting for 1000 revs, check revLeader - revBlackholedMember.

In terms of stability, sometimes I get Model is not linearizable due to panic: interface conversion: interface {} is nil, not model.EtcdRequest. Need to investigate why, but otherwise snapshot triggering is consistent for my local runs.

    linearizability_test.go:360: Model is not linearizable
    logger.go:130: 2023-02-01T06:37:39.145Z	INFO	Saving operation history	{"path": "/tmp/linearizability/TestLinearizability_Snapshot/history.json"}
    logger.go:130: 2023-02-01T06:37:39.187Z	INFO	Saving member data dir	{"member": "TestLinearizabilitySnapshot-test-0", "path": "/tmp/linearizability/TestLinearizability_Snapshot/TestLinearizabilitySnapshot-test-0"}
    logger.go:130: 2023-02-01T06:37:39.193Z	INFO	Saving member data dir	{"member": "TestLinearizabilitySnapshot-test-1", "path": "/tmp/linearizability/TestLinearizability_Snapshot/TestLinearizabilitySnapshot-test-1"}
    logger.go:130: 2023-02-01T06:37:39.199Z	INFO	Saving member data dir	{"member": "TestLinearizabilitySnapshot-test-2", "path": "/tmp/linearizability/TestLinearizability_Snapshot/TestLinearizabilitySnapshot-test-2"}
    logger.go:130: 2023-02-01T06:37:39.199Z	INFO	Saving visualization	{"path": "/tmp/linearizability/TestLinearizability_Snapshot/history.html"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	closing test cluster...
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-0"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-1"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-2"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	closing server...	{"name": "TestLinearizabilitySnapshot-test-0"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-0"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	removing directory	{"data-dir": "/tmp/TestLinearizabilitySnapshot3613164201/001"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	closing server...	{"name": "TestLinearizabilitySnapshot-test-1"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-1"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	removing directory	{"data-dir": "/tmp/TestLinearizabilitySnapshot3613164201/002"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	closing server...	{"name": "TestLinearizabilitySnapshot-test-2"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-2"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	removing directory	{"data-dir": "/tmp/TestLinearizabilitySnapshot3613164201/003"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	closed test cluster.
--- FAIL: TestLinearizability (15.84s)
    --- FAIL: TestLinearizability/Snapshot (15.84s)
panic: interface conversion: interface {} is nil, not model.EtcdRequest [recovered]
	panic: interface conversion: interface {} is nil, not model.EtcdRequest

goroutine 74 [running]:
testing.tRunner.func1.2({0xf48c60, 0xc003b354a0})
	/usr/local/go/src/testing/testing.go:1396 +0x24e
testing.tRunner.func1()
	/usr/local/go/src/testing/testing.go:1399 +0x39f
panic({0xf48c60, 0xc003b354a0})
	/usr/local/go/src/runtime/panic.go:884 +0x212
go.etcd.io/etcd/tests/v3/linearizability/model.glob..func3({0x0?, 0x0?}, {0x1026040?, 0xc0013a8fc0?})
	/go/src/go.etcd.io/etcd/tests/linearizability/model/model.go:53 +0x188
github.com/anishathalye/porcupine.computeVisualizationData({0x1112218, 0x1112210, 0x11132b8, 0x11132c0, 0x1112220, 0x11132c8, 0x1112208}, {{0xc001602078, 0x1, 0x1}, ...})
	/go/pkg/mod/github.com/anishathalye/porcupine@v0.1.4/visualization.go:50 +0x94f
github.com/anishathalye/porcupine.Visualize({0x0, 0x0, 0x11132b8, 0x11132c0, 0x0, 0x11132c8, 0x0}, {{0xc001602078, 0x1, 0x1}, ...}, ...)
	/go/pkg/mod/github.com/anishathalye/porcupine@v0.1.4/visualization.go:102 +0xa5
github.com/anishathalye/porcupine.VisualizePath({0x0, 0x0, 0x11132b8, 0x11132c0, 0x0, 0x11132c8, 0x0}, {{0xc001602078, 0x1, 0x1}, ...}, ...)
	/go/pkg/mod/github.com/anishathalye/porcupine@v0.1.4/visualization.go:122 +0x16e
go.etcd.io/etcd/tests/v3/linearizability.checkOperationsAndPersistResults(0xc000560680, 0x23c6?, {0xc001838000, 0x23c5, 0x23c6}, 0x2?)
	/go/src/go.etcd.io/etcd/tests/linearizability/linearizability_test.go:372 +0x55a
go.etcd.io/etcd/tests/v3/linearizability.TestLinearizability.func1(0xc000560680)
	/go/src/go.etcd.io/etcd/tests/linearizability/linearizability_test.go:134 +0x3b7
testing.tRunner(0xc000560680, 0xc0002de330)
	/usr/local/go/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1493 +0x35f
FAIL	go.etcd.io/etcd/tests/v3/linearizability	16.268s
FAIL

serathius · 2023-02-01T10:46:08Z

In terms of stability, sometimes I get Model is not linearizable due to panic: interface conversion: interface {} is nil, not model.EtcdRequest. Need to investigate why, but otherwise snapshot triggering is consistent for my local runs.

That's great. The error you linked is usually connected to incorrect history merging or patching. Still it's weird as I thought I addressed all of those.

serathius · 2023-02-03T13:24:25Z

Please let me know when PR is ready for review.

lavacat · 2023-02-03T20:32:17Z

@serathius I've run into another Model is not linearizable. In this case there were no interface {} is nil, not model.EtcdRequest errors (see html in the Archive.zip)

Archive.zip

Otherwise PR is ready. Maybe we can merge but disable this scenario while I debug the problem with Model is not linearizable.

tests/linearizability/failpoints.go

serathius · 2023-02-03T21:49:44Z

Otherwise PR is ready. Maybe we can merge but disable this scenario while I debug the problem with Model is not linearizable.

Sounds good. Please comment the scenario out and leave a TODO to fix it and restore it.

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>

lavacat · 2023-02-05T18:00:17Z

@serathius I think functional test failure is a flake. Documented it #14826 (comment)

Can you please trigger tests rerun?

serathius · 2023-02-09T19:41:16Z

@lavacat Looked at the file you send, it looks like a data inconsistency. Will investigate it more.

serathius mentioned this pull request Jan 26, 2023

Replace functional tests with Linearizability #15102

Closed

3 tasks

lavacat force-pushed the main-tests-snapshots branch from f877cd3 to d983b86 Compare February 1, 2023 06:45

serathius reviewed Feb 3, 2023

View reviewed changes

tests/linearizability/failpoints.go Outdated Show resolved Hide resolved

serathius reviewed Feb 3, 2023

View reviewed changes

tests/linearizability/failpoints.go Outdated Show resolved Hide resolved

serathius reviewed Feb 3, 2023

View reviewed changes

tests/linearizability/failpoints.go Outdated Show resolved Hide resolved

tests linearizability: trigger snapshot related failpoints

91b0569

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>

lavacat force-pushed the main-tests-snapshots branch from d983b86 to 91b0569 Compare February 5, 2023 08:04

lavacat marked this pull request as ready for review February 5, 2023 08:04

serathius approved these changes Feb 6, 2023

View reviewed changes

serathius merged commit dc51cf1 into etcd-io:main Feb 6, 2023

serathius mentioned this pull request Feb 9, 2023

Watch response traveling back in time when reconnecting member downloads snapshot from the leader #15271

Closed

lavacat mentioned this pull request Feb 15, 2023

tests: use monotonic clock for linearizability test history #15323

Merged

jmhbnz mentioned this pull request Apr 24, 2023

Implement remaining gofailpoints in linearizability tests #14726

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests linearizability: trigger snapshot related failpoints #15104

tests linearizability: trigger snapshot related failpoints #15104

lavacat commented Jan 13, 2023

serathius commented Jan 14, 2023 •

edited

Loading

codecov-commenter commented Jan 14, 2023

serathius commented Jan 20, 2023

lavacat commented Jan 20, 2023

serathius commented Jan 24, 2023

lavacat commented Jan 24, 2023

serathius commented Jan 24, 2023

lavacat commented Jan 26, 2023

serathius commented Jan 27, 2023

lavacat commented Jan 28, 2023

lavacat commented Jan 30, 2023 •

edited

Loading

serathius commented Jan 30, 2023

lavacat commented Feb 1, 2023

serathius commented Feb 1, 2023

serathius commented Feb 3, 2023

lavacat commented Feb 3, 2023 •

edited

Loading

serathius commented Feb 3, 2023

lavacat commented Feb 5, 2023

serathius commented Feb 9, 2023

tests linearizability: trigger snapshot related failpoints #15104

tests linearizability: trigger snapshot related failpoints #15104

Conversation

lavacat commented Jan 13, 2023

serathius commented Jan 14, 2023 • edited Loading

codecov-commenter commented Jan 14, 2023

Codecov Report

serathius commented Jan 20, 2023

lavacat commented Jan 20, 2023

serathius commented Jan 24, 2023

lavacat commented Jan 24, 2023

serathius commented Jan 24, 2023

lavacat commented Jan 26, 2023

serathius commented Jan 27, 2023

lavacat commented Jan 28, 2023

lavacat commented Jan 30, 2023 • edited Loading

serathius commented Jan 30, 2023

lavacat commented Feb 1, 2023

serathius commented Feb 1, 2023

serathius commented Feb 3, 2023

lavacat commented Feb 3, 2023 • edited Loading

serathius commented Feb 3, 2023

lavacat commented Feb 5, 2023

serathius commented Feb 9, 2023

serathius commented Jan 14, 2023 •

edited

Loading

lavacat commented Jan 30, 2023 •

edited

Loading

lavacat commented Feb 3, 2023 •

edited

Loading