Poor performance and cluster stability when lots of leases are expiring. #9360

mgates · 2018-02-26T17:01:53Z

We've been using etcdv3 leases that we let expire to add a TTL to keys. We've been noticing very poor performance with lease expiration vs manual key deletion. This is true both on our main 5-node cluster, and on a single node cluster we set up for testing. In particular, when lots of leases are expiring, request time, especially p99 request time goes through the roof, and we end up with a lot of request timeouts.

It also seems that the cluster gets behind, and leases don't expire on time too; keys are still there, but have a negative ttl.

We saw this in our production cluster (5 8-cpu VMs) where we paralyzed our cluster with about 3-400 lease expirations per second. It's able to handle well over that in set and delete requests.

We'd be happy to run any tests or gather any data that might be helpful.

jcalvert · 2018-02-26T17:04:12Z

We were able to observe this behavior outside of our application code by utilizing the following simple Go program. In pointing to localhost we are using the grpc proxy. Without using the leases, the cluster is able to handle this traffic easily with no degradation.

package main

import "github.com/coreos/etcd/clientv3"
import "github.com/nu7hatch/gouuid"

import (
        "log"
        "time"
        "context"
        "sync"
)

func main() {
        cli, err := clientv3.New(clientv3.Config{
                Endpoints:   []string{"localhost:2379"},
                DialTimeout: 5 * time.Second,
        })
        log.Printf("Start")
        if err != nil {
                log.Fatal(err)
        }
        var wg sync.WaitGroup
        for k := 0; k < 200; k++ {
                wg.Add(1)
                go func() {
                        for i := 0; i < 3000; i++ {
                                resp, err := cli.Grant(context.TODO(), 5)
                                if err != nil {
                                        log.Fatal(err)
                                }

                                // after 5 seconds, the key 'foo' will be removed
                                u, err := uuid.NewV4()
                                _, err = cli.Put(context.TODO(), u.String(), "bar")//, clientv3.WithLease(resp.ID))
                                if err != nil {
                                        log.Fatal(err)
                                }
                        }
                        defer wg.Done()
                }()
        }
        wg.Wait()
        defer cli.Close()
        log.Printf("End")
}

xiang90 · 2018-02-26T17:23:22Z

what is the version of your etcd server?

gyuho · 2018-02-26T17:24:06Z

@mgates How do all these leases expire at the same time? Did you restart the machine?
We've added randomization on lease revoke in 3.3.

mgates · 2018-02-26T17:37:23Z

We've been using 3.2, but we've tested on 3.3 had the same issue.

mgates · 2018-02-26T17:37:54Z

The leases mostly expire one minute after creation, so different times, but the same TTL. We've certainly done rolling restarts of etcd, but we weren't doing it during the times when we've seen this.

gyuho · 2018-02-26T17:44:39Z

We saw this in our production cluster (5 8-cpu VMs) where we paralyzed our cluster with about 3-400 lease expirations per second. It's able to handle well over that in set and delete requests.

How many leases are there in total? Reproducible steps would be helpful.

jcalvert · 2018-02-26T17:50:04Z

In our application usage, keys being set are the result of many concurrent API requests. An analogy would be a reservation system where locks are taken out on a resource and if that resource is not released, we are relying on lease expiration to enforce a TTL on the lock. In this way we naturally have many lease TTLs that expire concurrently without the attached keys being related.

jcalvert · 2018-02-26T23:14:54Z

TL;DR We notice much worse performance when we allow the server to expire leases versus when we revoke them actively. We're trying to determine if using leases as a way to expire long lived keys is an anti-pattern or if we have encountered some unexpected performance bug.

In our production system, we were doing approximately 1200 GRPC operations per second. Lease grants that went from approximately 400 to 620/second, with those additional 220 being leases that were not revoked. These lease durations ranged from 60 seconds to 8 hours. During this time our p99 for request duration (http_request_duration_microseconds) went from 50ms to 250ms and remained high until remaining leases expired, or approximately 8 hours despite the code being in production for about 30 minutes. We were able to determine this by looking at the etcd_debugging_server_lease_expired_total values. Our previous benchmarks against this production cluster had shown it to be able handle burst traffic of 6000 GRPC operations per second. This seems to indicate that issuing leases and then letting them expire degrades performance, and as mentioned before we could query to see leases with negative TTLs, so keys were not expiring on time.

We wanted to try to validate this by running a very minimal test outside of our application code or production infrastructure. We set up a cluster: 5 m3.large EC2 instances, running 3.2.0, with the data directory backed by tmpfs. We initially had EBS volumes with 2000 provisioned IOPS and wanted to use the ramdisk to eliminate the possibility of disk being the source of the issue, but found no real difference.

600,000 total loops granting a lease then setting a random key with value “bar” with that lease, lease lasts for 5 seconds. Total run time was 8 minutes. Peak lease grant per second peaked at 1893 per second and fell off to about 1300. Server side lease expiration peaked at 830 per second. Server side lease expirations continued for another 8 minutes after all operations sent to the cluster stopped. 3240 requests failed and we received several hundred “apply entry took too long” messages in the logs. This was all done using the very simple Go application provided above as a model for our real world application behavior. Hopefully this is some information towards reproducing the issue, but we will try to provide something more turn key as a test for reproduction tomorrow.

gyuho · 2018-02-26T23:34:15Z

@jcalvert Do you have any concurrent Lease API workloads (e.g. Lease Lookup, Lease List)?

Or just overlapping Grant and Revoke?

jcalvert · 2018-02-27T00:06:53Z

@gyuho No lookup, list, or heartbeat.

xiang90 · 2018-02-27T00:20:51Z

v3 api is not measured by http_request_duration_microseconds. so i am not sure why you want to check that metrics.

jcalvert · 2018-02-27T00:49:11Z

@xiang90 We saw quite a bit of movement on that metric. What would be a more appropriate metric to measure request time then? We saw an equivalent rise in our application side metric measurement where we wrap each request to etcd in a statsd timer.

gyuho · 2018-02-27T00:52:27Z

@jcalvert For v3, we have gRPC metrics

e.g.

grpc_server_handled_total{grpc_code="DeadlineExceeded",grpc_method="DeleteRange",grpc_service="etcdserverpb.KV",grpc_type="unary"} 0

Or

# HELP etcd_server_proposals_failed_total The total number of failed proposals seen.
# TYPE etcd_server_proposals_failed_total counter
etcd_server_proposals_failed_total 0
# HELP etcd_server_proposals_pending The current number of pending proposals to commit.
# TYPE etcd_server_proposals_pending gauge
etcd_server_proposals_pending 0

jcalvert · 2018-02-27T17:00:20Z

Looking at the comment here, the hard limit on lease revoke rate and that the leasor.go file has a mutex around operations like Grant it seems that in general having a large number of concurrently unexpired leases is going to be consistently bad for performance. Is there any chance that performance considerations like this will be added to the roadmap for etcd?

gyuho · 2018-02-27T17:35:13Z

@jcalvert That limit is for restricting lease revoke spikes when a node restarts.

This issue is more about Revoke path, right?

I need to investigate more where bottlenecks are.

Is this #9360 (comment) a good way to reproduce your issue?
Or please let me know if you have any other easier way to reproduce.

jcalvert · 2018-02-27T20:10:17Z

@gyuho Yes, although I believe it works more pronounced if the lease time is extended to 5 minutes.

gyuho · 2018-02-27T20:17:17Z

@jcalvert Ok, thanks. I will double-check our code base.

mgates · 2018-02-27T20:51:53Z

Hey folks, I got some numbers. These are all on a one-node cluster with a ramdisk backing the data directory, and with the test running on the same VPS, so we shouldn't see network or disk latency impact it. I restarted the node and cleared out the data dir before each run.

When we revoked the lease right away (https://gist.github.com/mgates/f79cbccf9f61ae5fd9d5c85d0b984b41)
it took 7 minutes and 30 second to run, and had one warning of 'apply entries took too long' in the logs.

With a 5 second lease, where we deleted the key (https://gist.github.com/mgates/b866ab74625f4dd83eb6ebc20d988eff), but let the lease expire on it's own, it took 9 and a half minutes, and we had 225 of those messages.

Increasing the lease to 60 second took 10.5 minutes, and had 441 of those messages.

I can get you the full test logs if you want, or get more detail about how long the long-running actions took if it would help.

gyuho · 2018-02-27T21:09:21Z

I can get you the full test logs if you want, or get more detail about how long the long-running actions took if it would help.

Sure, full server logs + output of /metrics would help.

mgates · 2018-02-27T22:36:00Z

Here you go, let me know if you need more data or anything: https://gist.github.com/mgates/bcf10617030922980048dfa08f6f208f

gyuho · 2018-03-01T03:53:36Z

@mgates Thanks for the logs. Will take a look.

/cc @ximenzaoshi

mgates · 2018-03-05T22:13:58Z

Hey folks, we were thinking of taking a stab at implementing a backing heap suggested by the comment in https://github.com/coreos/etcd/blob/master/lease/lessor.go#L123. We just wanted to check if you had made any headway in a different direction before we dive in.

gyuho · 2018-03-05T22:57:05Z

@mgates We are working to find the root cause of another lease issue #9374.

Rather not optimize before we resolve possible bugs.

mgates · 2018-03-06T22:39:57Z

Hey folks, we took a stab at some profiling and exploratory coding. It looks like there is a lot of CPU time spent checking the leases map:

Fetching profile over HTTP from http://localhost:2379/debug/pprof/profile
Saved profile in /home/pair/pprof/pprof.etcd.samples.cpu.009.pb.gz
File: etcd
Type: cpu
Time: Mar 6, 2018 at 10:13pm (UTC)
Duration: 30s, Total samples = 7.97s (26.57%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 7.84s, 98.37% of 7.97s total
Dropped 3 nodes (cum <= 0.04s)
Showing top 10 nodes out of 16
      flat  flat%   sum%        cum   cum%
     6.37s 79.92% 79.92%      6.37s 79.92%  runtime._ExternalCode
     0.69s  8.66% 88.58%      1.09s 13.68%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/lease.(*Lease).Remaining
     0.28s  3.51% 92.10%      0.28s  3.51%  runtime.mapiternext
     0.16s  2.01% 94.10%      1.57s 19.70%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/lease.(*lessor).findExpiredLeases
     0.08s  1.00% 95.11%      0.08s  1.00%  runtime.freedefer
     0.07s  0.88% 95.98%      0.07s  0.88%  sync.(*RWMutex).RLock
     0.05s  0.63% 96.61%      0.13s  1.63%  runtime.deferreturn
     0.05s  0.63% 97.24%      0.05s  0.63%  runtime.newdefer
     0.05s  0.63% 97.87%      0.05s  0.63%  sync.(*RWMutex).RUnlock
     0.04s   0.5% 98.37%      1.13s 14.18%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/lease.(*Lease).expired

We went ahead and implemented a heap to have a priority queue of leases to expire (as was suggested by the comment left in the code), and our performance for setting 200,000 keys with leases went up 25%, plus we lost all of our slow request warnings in the logs.

We're going to get some more benchmarks tomorrow, and we'll hopefully have a pull request later this week, but we understand if you want to let if sit while the bug hunting continues.

gyuho · 2018-03-06T22:46:37Z

@mgates Sorry for delay. And thanks for working on the improvements!

jpbetz · 2018-05-30T22:15:05Z

We're seeing similar issues etcd clusters serving kubernetes events where the lease count is in the 200k+ range. xref kubernetes/kubernetes#47532

xiang90 · 2018-05-30T22:25:13Z

@jpbetz Just a side note, we should really try not to create so many leases in Kubernetes since it is not necessary.

jpbetz · 2018-05-30T22:31:04Z

@xiang90 I completely agree. At the very least we could improve to have a lease per time bucket (minute) and use that lease for all events for all events that should expire in that time bucket. This would reduce the default lease count to fixed number (e.g. 60 total if we did 1m buckets for 1hr of event TTL).

jbguerraz · 2019-09-17T10:15:40Z

Hello @mgates ! I'm wondering if there is a chance to get a pull request from your side ? :)
Thank you!

jingyih · 2019-09-17T17:33:51Z

Hello @mgates ! I'm wondering if there is a chance to get a pull request from your side ? :)
Thank you!

The improvement was merged and is included in release 3.4. Ref: #9418

gyuho · 2019-09-17T17:49:57Z

Let's close it. And revisit if any other lease performance issue arises.

gyuho added the stage/investigating label Feb 26, 2018

gyuho mentioned this issue Feb 27, 2018

Struggle with etcd timeout, help #9159

Closed

gyuho added the area/performance label Feb 27, 2018

PadmaB mentioned this issue May 27, 2019

Increased failedProposal count to 80 for an alert gardener/gardener#1036

Merged

gyuho closed this as completed Sep 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance and cluster stability when lots of leases are expiring. #9360

Poor performance and cluster stability when lots of leases are expiring. #9360

mgates commented Feb 26, 2018

jcalvert commented Feb 26, 2018 •

edited by gyuho

Loading

xiang90 commented Feb 26, 2018

gyuho commented Feb 26, 2018

mgates commented Feb 26, 2018

mgates commented Feb 26, 2018 •

edited

Loading

gyuho commented Feb 26, 2018

jcalvert commented Feb 26, 2018

jcalvert commented Feb 26, 2018

gyuho commented Feb 26, 2018 •

edited

Loading

jcalvert commented Feb 27, 2018

xiang90 commented Feb 27, 2018

jcalvert commented Feb 27, 2018

gyuho commented Feb 27, 2018

jcalvert commented Feb 27, 2018

gyuho commented Feb 27, 2018 •

edited

Loading

jcalvert commented Feb 27, 2018

gyuho commented Feb 27, 2018

mgates commented Feb 27, 2018 •

edited

Loading

gyuho commented Feb 27, 2018

mgates commented Feb 27, 2018

gyuho commented Mar 1, 2018

mgates commented Mar 5, 2018

gyuho commented Mar 5, 2018 •

edited

Loading

mgates commented Mar 6, 2018 •

edited

Loading

gyuho commented Mar 6, 2018

jpbetz commented May 30, 2018

xiang90 commented May 30, 2018

jpbetz commented May 30, 2018 •

edited

Loading

jbguerraz commented Sep 17, 2019

jingyih commented Sep 17, 2019

gyuho commented Sep 17, 2019

Poor performance and cluster stability when lots of leases are expiring. #9360

Poor performance and cluster stability when lots of leases are expiring. #9360

Comments

mgates commented Feb 26, 2018

jcalvert commented Feb 26, 2018 • edited by gyuho Loading

xiang90 commented Feb 26, 2018

gyuho commented Feb 26, 2018

mgates commented Feb 26, 2018

mgates commented Feb 26, 2018 • edited Loading

gyuho commented Feb 26, 2018

jcalvert commented Feb 26, 2018

jcalvert commented Feb 26, 2018

gyuho commented Feb 26, 2018 • edited Loading

jcalvert commented Feb 27, 2018

xiang90 commented Feb 27, 2018

jcalvert commented Feb 27, 2018

gyuho commented Feb 27, 2018

jcalvert commented Feb 27, 2018

gyuho commented Feb 27, 2018 • edited Loading

jcalvert commented Feb 27, 2018

gyuho commented Feb 27, 2018

mgates commented Feb 27, 2018 • edited Loading

gyuho commented Feb 27, 2018

mgates commented Feb 27, 2018

gyuho commented Mar 1, 2018

mgates commented Mar 5, 2018

gyuho commented Mar 5, 2018 • edited Loading

mgates commented Mar 6, 2018 • edited Loading

gyuho commented Mar 6, 2018

jpbetz commented May 30, 2018

xiang90 commented May 30, 2018

jpbetz commented May 30, 2018 • edited Loading

jbguerraz commented Sep 17, 2019

jingyih commented Sep 17, 2019

gyuho commented Sep 17, 2019

jcalvert commented Feb 26, 2018 •

edited by gyuho

Loading

mgates commented Feb 26, 2018 •

edited

Loading

gyuho commented Feb 26, 2018 •

edited

Loading

gyuho commented Feb 27, 2018 •

edited

Loading

mgates commented Feb 27, 2018 •

edited

Loading

gyuho commented Mar 5, 2018 •

edited

Loading

mgates commented Mar 6, 2018 •

edited

Loading

jpbetz commented May 30, 2018 •

edited

Loading