Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Raft log catching up extremely slowly #27983

Closed
bdarnell opened this issue Jul 27, 2018 · 10 comments · Fixed by #28511
Closed

storage: Raft log catching up extremely slowly #27983

bdarnell opened this issue Jul 27, 2018 · 10 comments · Fixed by #28511
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@bdarnell
Copy link
Contributor

A cluster that got into a weird state has revealed something that doesn't appear to be working correctly in raft. The cluster was running 2.0.4 with the patch in #27804 (comment)

A range had gotten into a state with only two replicas, and one of them had been down for a long time. The live node was left as leader and continually added to its raft log. After the downed follower came back online, that multi-GB raft log had to be copied to the follower before new progress could be made (including establishing a new lease for the range or adding a third replica).

This process went extremely slowly, taking 30 hours in one case to transfer a few GB. While this was happening, the range status page showed that the follower's Last Index was increasing steadily, but the Commit and Applied indexes remained constant until the follower caught up completely. This is unexpected; the Commit and Applied indexes should increase as the follower progresses, tracking only a few message round trips behind the Last index.

Note that raft leadership was stable during this time (term number was only 141), so it was not the case that elections were being called frequently (which could slow down the leader and prevent it from committing entries as it processes MsgAppResps).

@bdarnell bdarnell added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 27, 2018
@bdarnell bdarnell self-assigned this Jul 27, 2018
@bdarnell
Copy link
Contributor Author

One theory: Raft has a flow control mechanism which limits the number of outstanding unacknowledged MsgApps, but also optimistically opens up one slot in the flow control window for every MsgHeartbeatResp received. Some quick math lines up: if we could only send one (16KB) message per (500ms) heartbeat interval), our throughput would be 2.7GB/day. The commit index is only advanced for MsgAppResp, not MsgHeartbeatResp, so this would also explain the lack of advancement.

In order for this to happen, we would need:

  • No MsgAppResps in the steady state. Even one would be enough to bump the commit index
  • Reliable MsgHeartbeatResp delivery (and in the other direction, MsgApp and MsgHeartbeat). This rules out a network partition (including a one-way partition)
  • The follower must be in ProgressStateReplicate. I think it must be in this state continuously, because flipping between ProgressStateReplicate and ProgressStateProbe would reset the flow control window. (flipping between the two states could be an alternative explanation for the poor throughput, although I don't immediately see how it could explain the lack of advancement)

The ProgressStateReplicate requirement is tricky. Followers start in ProgressStateProbe and only enter ProgressStateReplicate when the leader receives an MsgAppResp. So we had at least one successful MsgAppResp, then no more.

This special case is one way in which MsgAppResp is different from other messages:

case raftpb.MsgAppResp:
// A successful (non-reject) MsgAppResp contains one piece of
// information: the highest log index. Raft currently queues up
// one MsgAppResp per incoming MsgApp, and we may process
// multiple messages in one handleRaftReady call (because
// multiple messages may arrive while we're blocked syncing to
// disk). If we get redundant MsgAppResps, drop all but the
// last (we've seen that too many MsgAppResps can overflow
// message queues on the receiving side).
//
// Note that this reorders the chosen MsgAppResp relative to
// other messages (including any MsgAppResps with the Reject flag),
// but raft is fine with this reordering.
//
// TODO(bdarnell): Consider pushing this optimization into etcd/raft.
// Similar optimizations may be possible for other message types,
// although MsgAppResp is the only one that has been seen as a
// problem in practice.
if !message.Reject && message.Index > lastAppResp.Index {
lastAppResp = message
drop = true
}

@bdarnell
Copy link
Contributor Author

In support of this theory, the cluster showed substantially fewer MsgAppResps than MsgApps during this period:

image

@bdarnell
Copy link
Contributor Author

The "dropped messages" stat was also zero during this time, so the messages were not getting dropped at the raft transport layer.

@tbg tbg added this to the 2.1 milestone Jul 31, 2018
@tbg tbg added A-coreperf and removed A-disaster-recovery A-kv-transactions Relating to MVCC and the transactional model. A-kv-distribution Relating to rebalancing and leasing. A-kv-client Relating to the KV client and the KV interface. A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 31, 2018
@bdarnell
Copy link
Contributor Author

bdarnell commented Aug 6, 2018

The fact that the commit index does not advance is normal: a new leader cannot commit any entries from past terms until it has committed an entry from its own term (which requires that that a quorum of followers is fully up to date). This is described in section 3.6.2 of the raft dissertation.

The only problem, then, is that the log replication is proceeding extremely slowly. I still think the theory in the second comment above is the best one. The fact that entries cannot be committed makes this more plausible, since successful MsgAppResps would not bump the commit index and invalidate the theory. It is enough for most MsgAppResps to be dropped, instead of requiring all of them to fail.

Alternately, this could reflect a node alternating between ProgressStateProbe and ProgressStateReplicate. I had dismissed this theory because I think it requires some MsgAppResps to succeed, but that is now possible.

@bdarnell bdarnell changed the title storage: Raft not committing new entries incrementally storage: Raft log catching up extremely slowly Aug 6, 2018
@bdarnell
Copy link
Contributor Author

bdarnell commented Aug 6, 2018

OK, I've figured it out. There are two issues with raft's flow control, one of which is exacerbated by our MsgAppResp-coalescing optimization. (and perhaps by our configuration that allows for a large number of small messages in flight, instead of the reverse)

Raft's flow control mostly tries to operate on a "one in, one out" principle. After receiving a MsgAppResp, multiple slots may free up in the flow control window, but only one MsgApp will be sent. The only thing that can lead to multiple inflight messages is incoming traffic (which is rare here, since the only incoming traffic is the occasional lease request as the range tries to reestablish itself). And then when we do manage to get multiple messages in flight at once, the fact that we coalesce redundant MsgAppResps means that we will tend to fall back into the one-at-a-time mode.

Even worse, there's a bug in the "one in, one out" implementation. If the flow control window was not full when the MsgAppResp was received, it won't send a new message. This degrades to "one message per heartbeat interval", which matches the measured transfer rates.

Instead of attempting to send exactly the right number of messages, raft should be trying to fill up the flow control when whenever it sends MsgApp.

bdarnell added a commit to bdarnell/etcd that referenced this issue Aug 7, 2018
We allow multiple in-flight append messages, but prior to this change
the only way we'd ever send them is if there is a steady stream of new
proposals. Catching up a follower that is far behind would be
unnecessarily slow (this is exacerbated by a quirk of CockroachDB's
use of raft which limits our ability to catch up via snapshot in some
cases).

See cockroachdb/cockroach#27983
@bdarnell
Copy link
Contributor Author

bdarnell commented Aug 7, 2018

etcd-io/etcd#9985

bdarnell added a commit to bdarnell/etcd that referenced this issue Aug 8, 2018
We allow multiple in-flight append messages, but prior to this change
the only way we'd ever send them is if there is a steady stream of new
proposals. Catching up a follower that is far behind would be
unnecessarily slow (this is exacerbated by a quirk of CockroachDB's
use of raft which limits our ability to catch up via snapshot in some
cases).

See cockroachdb/cockroach#27983
bdarnell added a commit to bdarnell/cockroach that referenced this issue Aug 13, 2018
Picks up etcd-io/etcd#9982 and etcd-io/etcd#9985 (and no other changes
to packages we use).

Fixes cockroachdb#27983
Fixes cockroachdb#27804

Release note (bug fix): Additional fixes for out-of-memory errors
caused by very large raft logs.

Release note (performance improvement): Greatly improved performance
when catching up followers that are behind when raft logs are large.
craig bot pushed a commit that referenced this issue Aug 13, 2018
28511: vendor: Update etcd r=tschottdorf a=bdarnell

Picks up etcd-io/etcd#9982 and etcd-io/etcd#9985 (and no other changes
to packages we use).

Fixes #27983
Fixes #27804

Release note (bug fix): Additional fixes for out-of-memory errors
caused by very large raft logs.

Release note (performance improvement): Greatly improved performance
when catching up followers that are behind when raft logs are large.

Co-authored-by: Ben Darnell <ben@bendarnell.com>
@craig craig bot closed this as completed in #28511 Aug 13, 2018
bdarnell added a commit to cockroachdb/etcd that referenced this issue Aug 13, 2018
We allow multiple in-flight append messages, but prior to this change
the only way we'd ever send them is if there is a steady stream of new
proposals. Catching up a follower that is far behind would be
unnecessarily slow (this is exacerbated by a quirk of CockroachDB's
use of raft which limits our ability to catch up via snapshot in some
cases).

See cockroachdb/cockroach#27983
@DorianZheng
Copy link

@bdarnell

A range had gotten into a state with only two replicas, and one of them had been down for a long time. The live node was left as leader and continually added to its raft log.

Why the live node can become leader? Majority of two replicas is two, one down, how can the other one become leader?

@tbg
Copy link
Member

tbg commented Aug 20, 2018

@DorianZheng I think what happens is that the leader was elected before quorum was lost. A leader steps down if it hears about a higher term and followers campaign if they don't receive heartbeats. This means that a sole leader will remain in this state; none of the two mechanisms will let it step down.

Of course it won't be able to make any progress, but it can append to the uncommitted Raft log.

@DorianZheng
Copy link

Thanks @tschottdorf for answering. So will leader use heartbeat to detect if it is still a leader?If not, why?

@bdarnell
Copy link
Contributor Author

bdarnell commented Sep 9, 2018

We don't currently use heartbeats to detect when we can no longer be leader because our followers have gone away. We used to in 1.1, but we turned that off in 2.0 because it was expensive to do this work for every replica even when they were idle. (We could probably do this more cheaply if we wanted to put more work into it, but we didn't expect it to be an issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants