Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests linearizability: trigger snapshot related failpoints #15104

Merged
merged 1 commit into from
Feb 6, 2023

Conversation

lavacat
Copy link

@lavacat lavacat commented Jan 13, 2023

fixes #14726, except raftBeforeFollowerSend

Signed-off-by: Bogdan Kanivets bkanivets@apple.com

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@serathius
Copy link
Member

serathius commented Jan 14, 2023

Looks promising, great work! Suggestions:

  • Use network blackholeing to simulate delaying member revision.
  • Without needing to restart binary we can use http failpoint for now as it's simpler
  • Let's not reduce snapshot count < 100. That's already low, risking performance degradation.
  • Instead of connecting to each member, we can just connect to whole cluster.

I managed to get it working. Please take a look serathius@a13adec

What is left is integrate snapshot failpoints into existing tests. As snapshot failpoints can only be triggered in follower, they cannot be enabled in single node cluster. You should be able to skip those failpoints if you change Available method for them.

@codecov-commenter
Copy link

Codecov Report

Merging #15104 (a13adec) into main (6315f1c) will decrease coverage by 0.13%.
The diff coverage is n/a.

❗ Current head a13adec differs from pull request most recent head f877cd3. Consider uploading reports for the commit f877cd3 to get more accurate results

@@            Coverage Diff             @@
##             main   #15104      +/-   ##
==========================================
- Coverage   74.75%   74.62%   -0.14%     
==========================================
  Files         415      415              
  Lines       34341    34341              
==========================================
- Hits        25673    25628      -45     
- Misses       7038     7075      +37     
- Partials     1630     1638       +8     
Flag Coverage Δ
all 74.62% <ø> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
client/v3/leasing/util.go 88.33% <0.00%> (-10.00%) ⬇️
client/pkg/v3/tlsutil/tlsutil.go 83.33% <0.00%> (-8.34%) ⬇️
server/etcdserver/api/v3rpc/interceptor.go 73.43% <0.00%> (-4.17%) ⬇️
client/v3/leasing/txn.go 88.09% <0.00%> (-3.18%) ⬇️
client/v3/experimental/recipes/key.go 75.34% <0.00%> (-2.74%) ⬇️
client/v3/experimental/recipes/double_barrier.go 68.83% <0.00%> (-2.60%) ⬇️
client/v3/concurrency/election.go 79.68% <0.00%> (-2.35%) ⬇️
server/etcdserver/api/rafthttp/msgappv2_codec.go 71.30% <0.00%> (-1.74%) ⬇️
client/v3/client.go 82.33% <0.00%> (-1.27%) ⬇️
pkg/adt/interval_tree.go 85.96% <0.00%> (-1.26%) ⬇️
... and 16 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@serathius
Copy link
Member

ping @lavacat, are you planning to finish this?

@lavacat
Copy link
Author

lavacat commented Jan 20, 2023

@serathius yes, sorry, I got sidetracked on another issue.
Thanks for review, your suggestion makes sense to me. I'm trying to debug why we need to wait for 1000 revs.

@serathius
Copy link
Member

ping, this change is blocking #15102

@lavacat
Copy link
Author

lavacat commented Jan 24, 2023

@serathius I've enabled debug logs for proxy and it reports correct info. 0 B of modified data.

logger.go:130: 2023-01-23T23:34:44.833Z	DEBUG	modified rx	{"data-received": "51 B", "data-modified": "0 B", "from": "tcp://localhost:20011", "to": "tcp://localhost:20013"}

But blackholed process still receives non-empty entries in applyAll and appliedi is increasing. This happens for ~1s then stops. Still looking for root cause.

@serathius
Copy link
Member

Please let me know if you need any help.

@lavacat
Copy link
Author

lavacat commented Jan 26, 2023

@serathius if you have time, please take a look. I'll continue investigating as well. It seams to me that after blackholing traffic, peer should get deactivated and logs should show failed to dial errors. I don't see that.

@serathius
Copy link
Member

I would guess that reason is that blackholing just drops the packet, but doesn't interrupt the connection.

@lavacat
Copy link
Author

lavacat commented Jan 28, 2023

Ok, I think explanation of this issue is that cutting traffic between --initial-advertise-peer-urls and --listen-peer-urls of one node isn't enough to fully isolate it from the rest of the cluster. It will still establish connections to the list of peers from --initial-cluster.

Here is an example of Procfile to reproduce this. etcd1 has different --listen-peer-urls and --initial-advertise-peer-urls and the proxy ins't setup. etcd1 will be applying puts.

stress-put: tools/benchmark/benchmark --endpoints=127.0.0.1:22379,127.0.0.1:32379 --clients=27 --conns=3 put --sequential-keys --key-space-size=100000 --total=100000

etcd1: GOFAIL_HTTP="127.0.0.1:11180" bin/etcd --name infra1 --snapshot-count=1000 --listen-client-urls http://127.0.0.1:2379 --advertise-client-urls http://127.0.0.1:2379 --listen-peer-urls http://127.0.0.1:12380 --initial-advertise-peer-urls http://127.0.0.1:11111 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:11111,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof
etcd2: GOFAIL_HTTP="127.0.0.1:22280" bin/etcd --name infra2 --snapshot-count=1000 --listen-client-urls http://127.0.0.1:22379 --advertise-client-urls http://127.0.0.1:22379 --listen-peer-urls http://127.0.0.1:22380 --initial-advertise-peer-urls http://127.0.0.1:22380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:11111,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof
etcd3: GOFAIL_HTTP="127.0.0.1:33380" bin/etcd --name infra3 --snapshot-count=1000 --listen-client-urls http://127.0.0.1:32379 --advertise-client-urls http://127.0.0.1:32379 --listen-peer-urls http://127.0.0.1:32380 --initial-advertise-peer-urls http://127.0.0.1:32380 --initial-cluster-token etcd-cluster-1 --initial-cluster 'infra1=http://127.0.0.1:11111,infra2=http://127.0.0.1:22380,infra3=http://127.0.0.1:32380' --initial-cluster-state new --enable-pprof

@lavacat
Copy link
Author

lavacat commented Jan 30, 2023

Each peer X registers a writer with other peers, so they can write msg addressed to X.
When peer Y is blackholed, X can't register a writer with Y. But there is a working pipeline that Y is created to X.
So, when Y tries to send msg to X, pipeline is picked instead of usual writer.

That's my understanding based on debugging. I think this isn't a bug but not much docs around design.

Still not sure why blackholing works after some time. But to fix the test we need to blackhole pipeline traffic somehow.

@serathius
Copy link
Member

Great job on investigation! Seems like an issue with using proxy for network blackholing, however we can still workaround this.

Let's focus on getting PR merged, leave 1000 for now and create an follow up issue.
If we can get consistent triggers of snapshot failpoints we should be ok. Could you rerun test 100 times to confirm?

@lavacat
Copy link
Author

lavacat commented Feb 1, 2023

@serathius made a small improvement - instead of waiting for 1000 revs, check revLeader - revBlackholedMember.

In terms of stability, sometimes I get Model is not linearizable due to panic: interface conversion: interface {} is nil, not model.EtcdRequest. Need to investigate why, but otherwise snapshot triggering is consistent for my local runs.

    linearizability_test.go:360: Model is not linearizable
    logger.go:130: 2023-02-01T06:37:39.145Z	INFO	Saving operation history	{"path": "/tmp/linearizability/TestLinearizability_Snapshot/history.json"}
    logger.go:130: 2023-02-01T06:37:39.187Z	INFO	Saving member data dir	{"member": "TestLinearizabilitySnapshot-test-0", "path": "/tmp/linearizability/TestLinearizability_Snapshot/TestLinearizabilitySnapshot-test-0"}
    logger.go:130: 2023-02-01T06:37:39.193Z	INFO	Saving member data dir	{"member": "TestLinearizabilitySnapshot-test-1", "path": "/tmp/linearizability/TestLinearizability_Snapshot/TestLinearizabilitySnapshot-test-1"}
    logger.go:130: 2023-02-01T06:37:39.199Z	INFO	Saving member data dir	{"member": "TestLinearizabilitySnapshot-test-2", "path": "/tmp/linearizability/TestLinearizability_Snapshot/TestLinearizabilitySnapshot-test-2"}
    logger.go:130: 2023-02-01T06:37:39.199Z	INFO	Saving visualization	{"path": "/tmp/linearizability/TestLinearizability_Snapshot/history.html"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	closing test cluster...
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-0"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-1"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-2"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	closing server...	{"name": "TestLinearizabilitySnapshot-test-0"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-0"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	removing directory	{"data-dir": "/tmp/TestLinearizabilitySnapshot3613164201/001"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	closing server...	{"name": "TestLinearizabilitySnapshot-test-1"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-1"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	removing directory	{"data-dir": "/tmp/TestLinearizabilitySnapshot3613164201/002"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	closing server...	{"name": "TestLinearizabilitySnapshot-test-2"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	stopping server...	{"name": "TestLinearizabilitySnapshot-test-2"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	removing directory	{"data-dir": "/tmp/TestLinearizabilitySnapshot3613164201/003"}
    logger.go:130: 2023-02-01T06:37:39.211Z	INFO	closed test cluster.
--- FAIL: TestLinearizability (15.84s)
    --- FAIL: TestLinearizability/Snapshot (15.84s)
panic: interface conversion: interface {} is nil, not model.EtcdRequest [recovered]
	panic: interface conversion: interface {} is nil, not model.EtcdRequest

goroutine 74 [running]:
testing.tRunner.func1.2({0xf48c60, 0xc003b354a0})
	/usr/local/go/src/testing/testing.go:1396 +0x24e
testing.tRunner.func1()
	/usr/local/go/src/testing/testing.go:1399 +0x39f
panic({0xf48c60, 0xc003b354a0})
	/usr/local/go/src/runtime/panic.go:884 +0x212
go.etcd.io/etcd/tests/v3/linearizability/model.glob..func3({0x0?, 0x0?}, {0x1026040?, 0xc0013a8fc0?})
	/go/src/go.etcd.io/etcd/tests/linearizability/model/model.go:53 +0x188
github.com/anishathalye/porcupine.computeVisualizationData({0x1112218, 0x1112210, 0x11132b8, 0x11132c0, 0x1112220, 0x11132c8, 0x1112208}, {{0xc001602078, 0x1, 0x1}, ...})
	/go/pkg/mod/github.com/anishathalye/porcupine@v0.1.4/visualization.go:50 +0x94f
github.com/anishathalye/porcupine.Visualize({0x0, 0x0, 0x11132b8, 0x11132c0, 0x0, 0x11132c8, 0x0}, {{0xc001602078, 0x1, 0x1}, ...}, ...)
	/go/pkg/mod/github.com/anishathalye/porcupine@v0.1.4/visualization.go:102 +0xa5
github.com/anishathalye/porcupine.VisualizePath({0x0, 0x0, 0x11132b8, 0x11132c0, 0x0, 0x11132c8, 0x0}, {{0xc001602078, 0x1, 0x1}, ...}, ...)
	/go/pkg/mod/github.com/anishathalye/porcupine@v0.1.4/visualization.go:122 +0x16e
go.etcd.io/etcd/tests/v3/linearizability.checkOperationsAndPersistResults(0xc000560680, 0x23c6?, {0xc001838000, 0x23c5, 0x23c6}, 0x2?)
	/go/src/go.etcd.io/etcd/tests/linearizability/linearizability_test.go:372 +0x55a
go.etcd.io/etcd/tests/v3/linearizability.TestLinearizability.func1(0xc000560680)
	/go/src/go.etcd.io/etcd/tests/linearizability/linearizability_test.go:134 +0x3b7
testing.tRunner(0xc000560680, 0xc0002de330)
	/usr/local/go/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1493 +0x35f
FAIL	go.etcd.io/etcd/tests/v3/linearizability	16.268s
FAIL

@serathius
Copy link
Member

In terms of stability, sometimes I get Model is not linearizable due to panic: interface conversion: interface {} is nil, not model.EtcdRequest. Need to investigate why, but otherwise snapshot triggering is consistent for my local runs.

That's great. The error you linked is usually connected to incorrect history merging or patching. Still it's weird as I thought I addressed all of those.

@serathius
Copy link
Member

Please let me know when PR is ready for review.

@lavacat
Copy link
Author

lavacat commented Feb 3, 2023

@serathius I've run into another Model is not linearizable. In this case there were no interface {} is nil, not model.EtcdRequest errors (see html in the Archive.zip)

Archive.zip

Otherwise PR is ready. Maybe we can merge but disable this scenario while I debug the problem with Model is not linearizable.

@serathius
Copy link
Member

Otherwise PR is ready. Maybe we can merge but disable this scenario while I debug the problem with Model is not linearizable.

Sounds good. Please comment the scenario out and leave a TODO to fix it and restore it.

Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
@lavacat lavacat marked this pull request as ready for review February 5, 2023 08:04
@lavacat
Copy link
Author

lavacat commented Feb 5, 2023

@serathius I think functional test failure is a flake. Documented it #14826 (comment)

Can you please trigger tests rerun?

@serathius serathius merged commit dc51cf1 into etcd-io:main Feb 6, 2023
@serathius
Copy link
Member

@lavacat Looked at the file you send, it looks like a data inconsistency. Will investigate it more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Implement remaining gofailpoints in linearizability tests
3 participants