storage: Raft log catching up extremely slowly #27983

bdarnell · 2018-07-27T15:25:07Z

A cluster that got into a weird state has revealed something that doesn't appear to be working correctly in raft. The cluster was running 2.0.4 with the patch in #27804 (comment)

A range had gotten into a state with only two replicas, and one of them had been down for a long time. The live node was left as leader and continually added to its raft log. After the downed follower came back online, that multi-GB raft log had to be copied to the follower before new progress could be made (including establishing a new lease for the range or adding a third replica).

This process went extremely slowly, taking 30 hours in one case to transfer a few GB. While this was happening, the range status page showed that the follower's Last Index was increasing steadily, but the Commit and Applied indexes remained constant until the follower caught up completely. This is unexpected; the Commit and Applied indexes should increase as the follower progresses, tracking only a few message round trips behind the Last index.

Note that raft leadership was stable during this time (term number was only 141), so it was not the case that elections were being called frequently (which could slow down the leader and prevent it from committing entries as it processes MsgAppResps).

bdarnell · 2018-07-27T16:03:25Z

One theory: Raft has a flow control mechanism which limits the number of outstanding unacknowledged MsgApps, but also optimistically opens up one slot in the flow control window for every MsgHeartbeatResp received. Some quick math lines up: if we could only send one (16KB) message per (500ms) heartbeat interval), our throughput would be 2.7GB/day. The commit index is only advanced for MsgAppResp, not MsgHeartbeatResp, so this would also explain the lack of advancement.

In order for this to happen, we would need:

No MsgAppResps in the steady state. Even one would be enough to bump the commit index
Reliable MsgHeartbeatResp delivery (and in the other direction, MsgApp and MsgHeartbeat). This rules out a network partition (including a one-way partition)
The follower must be in ProgressStateReplicate. I think it must be in this state continuously, because flipping between ProgressStateReplicate and ProgressStateProbe would reset the flow control window. (flipping between the two states could be an alternative explanation for the poor throughput, although I don't immediately see how it could explain the lack of advancement)

The ProgressStateReplicate requirement is tricky. Followers start in ProgressStateProbe and only enter ProgressStateReplicate when the leader receives an MsgAppResp. So we had at least one successful MsgAppResp, then no more.

This special case is one way in which MsgAppResp is different from other messages:

cockroach/pkg/storage/replica.go

Lines 4796 to 4817 in 2ecc022

    
           case raftpb.MsgAppResp: 
        
           	// A successful (non-reject) MsgAppResp contains one piece of 
        
           	// information: the highest log index. Raft currently queues up 
        
           	// one MsgAppResp per incoming MsgApp, and we may process 
        
           	// multiple messages in one handleRaftReady call (because 
        
           	// multiple messages may arrive while we're blocked syncing to 
        
           	// disk). If we get redundant MsgAppResps, drop all but the 
        
           	// last (we've seen that too many MsgAppResps can overflow 
        
           	// message queues on the receiving side). 
        
           	// 
        
           	// Note that this reorders the chosen MsgAppResp relative to 
        
           	// other messages (including any MsgAppResps with the Reject flag), 
        
           	// but raft is fine with this reordering. 
        
           	// 
        
           	// TODO(bdarnell): Consider pushing this optimization into etcd/raft. 
        
           	// Similar optimizations may be possible for other message types, 
        
           	// although MsgAppResp is the only one that has been seen as a 
        
           	// problem in practice. 
        
           	if !message.Reject && message.Index > lastAppResp.Index { 
        
           		lastAppResp = message 
        
           		drop = true 
        
           	}

bdarnell · 2018-07-27T17:27:19Z

In support of this theory, the cluster showed substantially fewer MsgAppResps than MsgApps during this period:

bdarnell · 2018-07-27T17:30:21Z

The "dropped messages" stat was also zero during this time, so the messages were not getting dropped at the raft transport layer.

bdarnell · 2018-08-06T14:24:05Z

The fact that the commit index does not advance is normal: a new leader cannot commit any entries from past terms until it has committed an entry from its own term (which requires that that a quorum of followers is fully up to date). This is described in section 3.6.2 of the raft dissertation.

The only problem, then, is that the log replication is proceeding extremely slowly. I still think the theory in the second comment above is the best one. The fact that entries cannot be committed makes this more plausible, since successful MsgAppResps would not bump the commit index and invalidate the theory. It is enough for most MsgAppResps to be dropped, instead of requiring all of them to fail.

Alternately, this could reflect a node alternating between ProgressStateProbe and ProgressStateReplicate. I had dismissed this theory because I think it requires some MsgAppResps to succeed, but that is now possible.

bdarnell · 2018-08-06T23:32:45Z

OK, I've figured it out. There are two issues with raft's flow control, one of which is exacerbated by our MsgAppResp-coalescing optimization. (and perhaps by our configuration that allows for a large number of small messages in flight, instead of the reverse)

Raft's flow control mostly tries to operate on a "one in, one out" principle. After receiving a MsgAppResp, multiple slots may free up in the flow control window, but only one MsgApp will be sent. The only thing that can lead to multiple inflight messages is incoming traffic (which is rare here, since the only incoming traffic is the occasional lease request as the range tries to reestablish itself). And then when we do manage to get multiple messages in flight at once, the fact that we coalesce redundant MsgAppResps means that we will tend to fall back into the one-at-a-time mode.

Even worse, there's a bug in the "one in, one out" implementation. If the flow control window was not full when the MsgAppResp was received, it won't send a new message. This degrades to "one message per heartbeat interval", which matches the measured transfer rates.

Instead of attempting to send exactly the right number of messages, raft should be trying to fill up the flow control when whenever it sends MsgApp.

We allow multiple in-flight append messages, but prior to this change the only way we'd ever send them is if there is a steady stream of new proposals. Catching up a follower that is far behind would be unnecessarily slow (this is exacerbated by a quirk of CockroachDB's use of raft which limits our ability to catch up via snapshot in some cases). See cockroachdb/cockroach#27983

bdarnell · 2018-08-07T03:33:14Z

etcd-io/etcd#9985

We allow multiple in-flight append messages, but prior to this change the only way we'd ever send them is if there is a steady stream of new proposals. Catching up a follower that is far behind would be unnecessarily slow (this is exacerbated by a quirk of CockroachDB's use of raft which limits our ability to catch up via snapshot in some cases). See cockroachdb/cockroach#27983

Picks up etcd-io/etcd#9982 and etcd-io/etcd#9985 (and no other changes to packages we use). Fixes cockroachdb#27983 Fixes cockroachdb#27804 Release note (bug fix): Additional fixes for out-of-memory errors caused by very large raft logs. Release note (performance improvement): Greatly improved performance when catching up followers that are behind when raft logs are large.

28511: vendor: Update etcd r=tschottdorf a=bdarnell Picks up etcd-io/etcd#9982 and etcd-io/etcd#9985 (and no other changes to packages we use). Fixes #27983 Fixes #27804 Release note (bug fix): Additional fixes for out-of-memory errors caused by very large raft logs. Release note (performance improvement): Greatly improved performance when catching up followers that are behind when raft logs are large. Co-authored-by: Ben Darnell <ben@bendarnell.com>

We allow multiple in-flight append messages, but prior to this change the only way we'd ever send them is if there is a steady stream of new proposals. Catching up a follower that is far behind would be unnecessarily slow (this is exacerbated by a quirk of CockroachDB's use of raft which limits our ability to catch up via snapshot in some cases). See cockroachdb/cockroach#27983

DorianZheng · 2018-08-19T14:46:53Z

@bdarnell

A range had gotten into a state with only two replicas, and one of them had been down for a long time. The live node was left as leader and continually added to its raft log.

Why the live node can become leader? Majority of two replicas is two, one down, how can the other one become leader?

tbg · 2018-08-20T15:45:46Z

@DorianZheng I think what happens is that the leader was elected before quorum was lost. A leader steps down if it hears about a higher term and followers campaign if they don't receive heartbeats. This means that a sole leader will remain in this state; none of the two mechanisms will let it step down.

Of course it won't be able to make any progress, but it can append to the uncommitted Raft log.

DorianZheng · 2018-08-27T13:52:49Z

Thanks @tschottdorf for answering. So will leader use heartbeat to detect if it is still a leader？If not, why?

bdarnell · 2018-09-09T22:08:58Z

We don't currently use heartbeats to detect when we can no longer be leader because our followers have gone away. We used to in 1.1, but we turned that off in 2.0 because it was expensive to do this work for every replica even when they were idle. (We could probably do this more cheaply if we wanted to put more work into it, but we didn't expect it to be an issue)

bdarnell added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 27, 2018

bdarnell self-assigned this Jul 27, 2018

tbg added this to the 2.1 milestone Jul 31, 2018

bdarnell changed the title ~~storage: Raft not committing new entries incrementally~~ storage: Raft log catching up extremely slowly Aug 6, 2018

bdarnell mentioned this issue Aug 7, 2018

raft: Make flow control more aggressive etcd-io/etcd#9985

Merged

bdarnell mentioned this issue Aug 13, 2018

vendor: Update etcd #28511

Merged

craig bot closed this as completed in #28511 Aug 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Raft log catching up extremely slowly #27983

storage: Raft log catching up extremely slowly #27983

bdarnell commented Jul 27, 2018

bdarnell commented Jul 27, 2018

bdarnell commented Jul 27, 2018

bdarnell commented Jul 27, 2018

bdarnell commented Aug 6, 2018

bdarnell commented Aug 6, 2018

bdarnell commented Aug 7, 2018

DorianZheng commented Aug 19, 2018

tbg commented Aug 20, 2018

DorianZheng commented Aug 27, 2018

bdarnell commented Sep 9, 2018

storage: Raft log catching up extremely slowly #27983

storage: Raft log catching up extremely slowly #27983

Comments

bdarnell commented Jul 27, 2018

bdarnell commented Jul 27, 2018

bdarnell commented Jul 27, 2018

bdarnell commented Jul 27, 2018

bdarnell commented Aug 6, 2018

bdarnell commented Aug 6, 2018

bdarnell commented Aug 7, 2018

DorianZheng commented Aug 19, 2018

tbg commented Aug 20, 2018

DorianZheng commented Aug 27, 2018

bdarnell commented Sep 9, 2018