Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failpoint functional testing not covering etcd startup well #11898

Closed
jpbetz opened this issue May 15, 2020 · 2 comments · Fixed by #11913
Closed

failpoint functional testing not covering etcd startup well #11898

jpbetz opened this issue May 15, 2020 · 2 comments · Fixed by #11913

Comments

@jpbetz
Copy link
Contributor

jpbetz commented May 15, 2020

As part of #11888 one question we had was why wasn't this caught by our failpoint functional testing?

I looked over the code and I think there is a simple answer. The way we inject failpoint failures is:

  • Get a list of all failurepoints from etcd via a HTTP request
  • Create functional testing cases for them
  • Inject them into the cluster by doing an HTTP put to enable the failpoint

See failpointFailures:

inject := makeInjectFailpoint(fp, fcmd)

This works great for many of the failpoints, but for failpoints that are in the startup phase of the etcd lifecycle, there is a problem: When etcd is restarted, any failpoints set on that etcd member via HTTP requests are cleared.

This means there is a very narrow window of time between when etcd starts up and when the first few startup failpoints are reached when the functional tester would have an opportunity to enable them via an HTTP request and have them be exercised.

I think the reason why we didn't catch #11888 with failure testing is because this is sufficiently unlikely.

Possible ways to fix this:

  1. keep track of which failure points have been enabled on each etcd member, and when they restart, enable them via the GOFAIL_FAILPOINTS environment variable. (also, when we "recover" that failurepoint, remember to stop including the failure point in the environment variable)
  2. Whenever starting an etcd, inject a failure point via environment variable via some probability function.

I like #1 but am curious what others think.

@gyuho @jingyih @YoyinZyc @wenjiaswe

@jpbetz
Copy link
Contributor Author

jpbetz commented May 18, 2020

Humm, actually, it might be much simpler. case_failipoints.go only waits for a snapshot to occur after enabling failpoints with a name that contains "Snap" and we just added one for between snapshot file save and WAL entry writes (https://github.com/etcd-io/etcd/blob/master/etcdserver/raft.go#L237) in #11888.

@jpbetz
Copy link
Contributor Author

jpbetz commented May 18, 2020

Opened #11913 to presist gofail failpoints across restarts since it seems like something we should be doing. I don't think it matters for anything urgent, since for #11888 the problem with testing was that there was not failpoint with "Snap" in the name between the operations that were causing a problem (which has since been fixed). But I figured I add this for completeness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

1 participant