Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance and cluster stability when lots of leases are expiring. #9360

Closed
mgates opened this issue Feb 26, 2018 · 31 comments
Closed

Comments

@mgates
Copy link

mgates commented Feb 26, 2018

We've been using etcdv3 leases that we let expire to add a TTL to keys. We've been noticing very poor performance with lease expiration vs manual key deletion. This is true both on our main 5-node cluster, and on a single node cluster we set up for testing. In particular, when lots of leases are expiring, request time, especially p99 request time goes through the roof, and we end up with a lot of request timeouts.

It also seems that the cluster gets behind, and leases don't expire on time too; keys are still there, but have a negative ttl.

We saw this in our production cluster (5 8-cpu VMs) where we paralyzed our cluster with about 3-400 lease expirations per second. It's able to handle well over that in set and delete requests.

We'd be happy to run any tests or gather any data that might be helpful.

@jcalvert
Copy link
Contributor

jcalvert commented Feb 26, 2018

We were able to observe this behavior outside of our application code by utilizing the following simple Go program. In pointing to localhost we are using the grpc proxy. Without using the leases, the cluster is able to handle this traffic easily with no degradation.

package main

import "github.com/coreos/etcd/clientv3"
import "github.com/nu7hatch/gouuid"

import (
        "log"
        "time"
        "context"
        "sync"
)

func main() {
        cli, err := clientv3.New(clientv3.Config{
                Endpoints:   []string{"localhost:2379"},
                DialTimeout: 5 * time.Second,
        })
        log.Printf("Start")
        if err != nil {
                log.Fatal(err)
        }
        var wg sync.WaitGroup
        for k := 0; k < 200; k++ {
                wg.Add(1)
                go func() {
                        for i := 0; i < 3000; i++ {
                                resp, err := cli.Grant(context.TODO(), 5)
                                if err != nil {
                                        log.Fatal(err)
                                }

                                // after 5 seconds, the key 'foo' will be removed
                                u, err := uuid.NewV4()
                                _, err = cli.Put(context.TODO(), u.String(), "bar")//, clientv3.WithLease(resp.ID))
                                if err != nil {
                                        log.Fatal(err)
                                }
                        }
                        defer wg.Done()
                }()
        }
        wg.Wait()
        defer cli.Close()
        log.Printf("End")
}

@xiang90
Copy link
Contributor

xiang90 commented Feb 26, 2018

what is the version of your etcd server?

@gyuho
Copy link
Contributor

gyuho commented Feb 26, 2018

@mgates How do all these leases expire at the same time? Did you restart the machine?
We've added randomization on lease revoke in 3.3.

@mgates
Copy link
Author

mgates commented Feb 26, 2018

We've been using 3.2, but we've tested on 3.3 had the same issue.

@mgates
Copy link
Author

mgates commented Feb 26, 2018

The leases mostly expire one minute after creation, so different times, but the same TTL. We've certainly done rolling restarts of etcd, but we weren't doing it during the times when we've seen this.

@gyuho
Copy link
Contributor

gyuho commented Feb 26, 2018

We saw this in our production cluster (5 8-cpu VMs) where we paralyzed our cluster with about 3-400 lease expirations per second. It's able to handle well over that in set and delete requests.

How many leases are there in total? Reproducible steps would be helpful.

@jcalvert
Copy link
Contributor

In our application usage, keys being set are the result of many concurrent API requests. An analogy would be a reservation system where locks are taken out on a resource and if that resource is not released, we are relying on lease expiration to enforce a TTL on the lock. In this way we naturally have many lease TTLs that expire concurrently without the attached keys being related.

@jcalvert
Copy link
Contributor

TL;DR We notice much worse performance when we allow the server to expire leases versus when we revoke them actively. We're trying to determine if using leases as a way to expire long lived keys is an anti-pattern or if we have encountered some unexpected performance bug.

In our production system, we were doing approximately 1200 GRPC operations per second. Lease grants that went from approximately 400 to 620/second, with those additional 220 being leases that were not revoked. These lease durations ranged from 60 seconds to 8 hours. During this time our p99 for request duration (http_request_duration_microseconds) went from 50ms to 250ms and remained high until remaining leases expired, or approximately 8 hours despite the code being in production for about 30 minutes. We were able to determine this by looking at the etcd_debugging_server_lease_expired_total values. Our previous benchmarks against this production cluster had shown it to be able handle burst traffic of 6000 GRPC operations per second. This seems to indicate that issuing leases and then letting them expire degrades performance, and as mentioned before we could query to see leases with negative TTLs, so keys were not expiring on time.

We wanted to try to validate this by running a very minimal test outside of our application code or production infrastructure. We set up a cluster: 5 m3.large EC2 instances, running 3.2.0, with the data directory backed by tmpfs. We initially had EBS volumes with 2000 provisioned IOPS and wanted to use the ramdisk to eliminate the possibility of disk being the source of the issue, but found no real difference.

600,000 total loops granting a lease then setting a random key with value “bar” with that lease, lease lasts for 5 seconds. Total run time was 8 minutes. Peak lease grant per second peaked at 1893 per second and fell off to about 1300. Server side lease expiration peaked at 830 per second. Server side lease expirations continued for another 8 minutes after all operations sent to the cluster stopped. 3240 requests failed and we received several hundred “apply entry took too long” messages in the logs. This was all done using the very simple Go application provided above as a model for our real world application behavior. Hopefully this is some information towards reproducing the issue, but we will try to provide something more turn key as a test for reproduction tomorrow.

@gyuho
Copy link
Contributor

gyuho commented Feb 26, 2018

@jcalvert Do you have any concurrent Lease API workloads (e.g. Lease Lookup, Lease List)?

Or just overlapping Grant and Revoke?

@jcalvert
Copy link
Contributor

@gyuho No lookup, list, or heartbeat.

@xiang90
Copy link
Contributor

xiang90 commented Feb 27, 2018

v3 api is not measured by http_request_duration_microseconds. so i am not sure why you want to check that metrics.

@jcalvert
Copy link
Contributor

@xiang90 We saw quite a bit of movement on that metric. What would be a more appropriate metric to measure request time then? We saw an equivalent rise in our application side metric measurement where we wrap each request to etcd in a statsd timer.

@gyuho
Copy link
Contributor

gyuho commented Feb 27, 2018

@jcalvert For v3, we have gRPC metrics

e.g.

grpc_server_handled_total{grpc_code="DeadlineExceeded",grpc_method="DeleteRange",grpc_service="etcdserverpb.KV",grpc_type="unary"} 0

Or

# HELP etcd_server_proposals_failed_total The total number of failed proposals seen.
# TYPE etcd_server_proposals_failed_total counter
etcd_server_proposals_failed_total 0
# HELP etcd_server_proposals_pending The current number of pending proposals to commit.
# TYPE etcd_server_proposals_pending gauge
etcd_server_proposals_pending 0

@jcalvert
Copy link
Contributor

Looking at the comment here, the hard limit on lease revoke rate and that the leasor.go file has a mutex around operations like Grant it seems that in general having a large number of concurrently unexpired leases is going to be consistently bad for performance. Is there any chance that performance considerations like this will be added to the roadmap for etcd?

@gyuho
Copy link
Contributor

gyuho commented Feb 27, 2018

@jcalvert That limit is for restricting lease revoke spikes when a node restarts.

This issue is more about Revoke path, right?

I need to investigate more where bottlenecks are.

Is this #9360 (comment) a good way to reproduce your issue?
Or please let me know if you have any other easier way to reproduce.

@jcalvert
Copy link
Contributor

@gyuho Yes, although I believe it works more pronounced if the lease time is extended to 5 minutes.

@gyuho
Copy link
Contributor

gyuho commented Feb 27, 2018

@jcalvert Ok, thanks. I will double-check our code base.

@mgates
Copy link
Author

mgates commented Feb 27, 2018

Hey folks, I got some numbers. These are all on a one-node cluster with a ramdisk backing the data directory, and with the test running on the same VPS, so we shouldn't see network or disk latency impact it. I restarted the node and cleared out the data dir before each run.

When we revoked the lease right away (https://gist.github.com/mgates/f79cbccf9f61ae5fd9d5c85d0b984b41)
it took 7 minutes and 30 second to run, and had one warning of 'apply entries took too long' in the logs.

With a 5 second lease, where we deleted the key (https://gist.github.com/mgates/b866ab74625f4dd83eb6ebc20d988eff), but let the lease expire on it's own, it took 9 and a half minutes, and we had 225 of those messages.

Increasing the lease to 60 second took 10.5 minutes, and had 441 of those messages.

I can get you the full test logs if you want, or get more detail about how long the long-running actions took if it would help.

@gyuho
Copy link
Contributor

gyuho commented Feb 27, 2018

I can get you the full test logs if you want, or get more detail about how long the long-running actions took if it would help.

Sure, full server logs + output of /metrics would help.

@mgates
Copy link
Author

mgates commented Feb 27, 2018

Here you go, let me know if you need more data or anything: https://gist.github.com/mgates/bcf10617030922980048dfa08f6f208f

@gyuho
Copy link
Contributor

gyuho commented Mar 1, 2018

@mgates Thanks for the logs. Will take a look.

/cc @ximenzaoshi

@mgates
Copy link
Author

mgates commented Mar 5, 2018

Hey folks, we were thinking of taking a stab at implementing a backing heap suggested by the comment in https://github.com/coreos/etcd/blob/master/lease/lessor.go#L123. We just wanted to check if you had made any headway in a different direction before we dive in.

@gyuho
Copy link
Contributor

gyuho commented Mar 5, 2018

@mgates We are working to find the root cause of another lease issue #9374.

Rather not optimize before we resolve possible bugs.

@mgates
Copy link
Author

mgates commented Mar 6, 2018

Hey folks, we took a stab at some profiling and exploratory coding. It looks like there is a lot of CPU time spent checking the leases map:

Fetching profile over HTTP from http://localhost:2379/debug/pprof/profile
Saved profile in /home/pair/pprof/pprof.etcd.samples.cpu.009.pb.gz
File: etcd
Type: cpu
Time: Mar 6, 2018 at 10:13pm (UTC)
Duration: 30s, Total samples = 7.97s (26.57%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 7.84s, 98.37% of 7.97s total
Dropped 3 nodes (cum <= 0.04s)
Showing top 10 nodes out of 16
      flat  flat%   sum%        cum   cum%
     6.37s 79.92% 79.92%      6.37s 79.92%  runtime._ExternalCode
     0.69s  8.66% 88.58%      1.09s 13.68%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/lease.(*Lease).Remaining
     0.28s  3.51% 92.10%      0.28s  3.51%  runtime.mapiternext
     0.16s  2.01% 94.10%      1.57s 19.70%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/lease.(*lessor).findExpiredLeases
     0.08s  1.00% 95.11%      0.08s  1.00%  runtime.freedefer
     0.07s  0.88% 95.98%      0.07s  0.88%  sync.(*RWMutex).RLock
     0.05s  0.63% 96.61%      0.13s  1.63%  runtime.deferreturn
     0.05s  0.63% 97.24%      0.05s  0.63%  runtime.newdefer
     0.05s  0.63% 97.87%      0.05s  0.63%  sync.(*RWMutex).RUnlock
     0.04s   0.5% 98.37%      1.13s 14.18%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/lease.(*Lease).expired

We went ahead and implemented a heap to have a priority queue of leases to expire (as was suggested by the comment left in the code), and our performance for setting 200,000 keys with leases went up 25%, plus we lost all of our slow request warnings in the logs.

We're going to get some more benchmarks tomorrow, and we'll hopefully have a pull request later this week, but we understand if you want to let if sit while the bug hunting continues.

@gyuho
Copy link
Contributor

gyuho commented Mar 6, 2018

@mgates Sorry for delay. And thanks for working on the improvements!

@jpbetz
Copy link
Contributor

jpbetz commented May 30, 2018

We're seeing similar issues etcd clusters serving kubernetes events where the lease count is in the 200k+ range. xref kubernetes/kubernetes#47532

@xiang90
Copy link
Contributor

xiang90 commented May 30, 2018

@jpbetz Just a side note, we should really try not to create so many leases in Kubernetes since it is not necessary.

@jpbetz
Copy link
Contributor

jpbetz commented May 30, 2018

@xiang90 I completely agree. At the very least we could improve to have a lease per time bucket (minute) and use that lease for all events for all events that should expire in that time bucket. This would reduce the default lease count to fixed number (e.g. 60 total if we did 1m buckets for 1hr of event TTL).

@jbguerraz
Copy link

Hello @mgates ! I'm wondering if there is a chance to get a pull request from your side ? :)
Thank you!

@jingyih
Copy link
Contributor

jingyih commented Sep 17, 2019

Hello @mgates ! I'm wondering if there is a chance to get a pull request from your side ? :)
Thank you!

The improvement was merged and is included in release 3.4. Ref: #9418

@gyuho
Copy link
Contributor

gyuho commented Sep 17, 2019

Let's close it. And revisit if any other lease performance issue arises.

@gyuho gyuho closed this as completed Sep 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants