Rate Limiter

When using RocksDB, users may want to throttle the maximum write speed within a certain limit for lots of reasons. For example, flash writes cause terrible spikes in read latency if they exceed a certain threshold. Since you've been reading this site, I believe you already know why you need a rate limiter. Actually, RocksDB contains a native RateLimiter which should be adequate for most use cases.

How to use

Create a RateLimiter object by calling NewGenericRateLimiter, which can be created separately for each RocksDB instance or by shared among RocksDB instances to control the aggregated write rate of flush and compaction.

RateLimiter* rate_limiter = NewGenericRateLimiter(
    rate_bytes_per_sec /* int64_t */, 
    refill_period_us /* int64_t */,
    fairness /* int32_t */);

Params

rate_bytes_per_sec: this is the only parameter you want to set most of the time. It controls the total write rate of compaction and flush in bytes per second. Currently, RocksDB does not enforce rate limit for anything other than flush and compaction, e.g. write to WAL
refill_period_us: this controls how often tokens are refilled. For example, when rate_bytes_per_sec is set to 10MB/s and refill_period_us is set to 100ms, then 1MB is refilled every 100ms internally. Larger value can lead to burst writes while smaller value introduces more CPU overhead. The default value 100,000 should work for most cases.
fairness: RateLimiter accepts and queues requests of user-pri, high-pri, mid-pri and low-pri. During a certain refilling interval, with on exception, a request queue of higher pri is satisfied before the ones of lower pri but with 1/fairness chance of being satisfied after the ones of lower pri to avoid starvation. As an exception, the request queue of user-pri is always satisfied before others without being subjected to fairness mechanism we are going to talk about (see below for more information about user-pri).

Currently, RocksDB assigns low-pri to requests from compaction and high-pri to requests from flush. Compaction requests can get blocked if flush requests come in continuously. However, this fairness parameter grants compaction requests permission by 1/fairness chance even though flush requests exist, to avoid starvation.

The fairness parameter also works with more priority levels. In general, the fairness parameter indicates that, at a certain refilling interval, there is 1/fairness chance that a higher-pri request queue (e.g, high-pri) will be blocked by all its lower-pri request queues (e.g, mid-pri and low-pri). To be more specific, if we have high-pri, mid-pri and low-pri queues, then at a given refilling interval, the possible orders of satisfying request queues of different priorities with its corresponding probability are as following:

{high_pri, mid_pri, low_pri} with probability (1 - 1/fairness) * (1 - 1/fairness);

{high_pri, low_pri, mid_pri} with probability (1 - 1/fairness) * 1/fairness;

{mid_pri, low_pri, high_pri} with probability 1/fairness * (1 - 1/fairness);

{low_pri, mid_pri, high_pri} with probability 1/fairness^2.

To make our point more relevant to practice, suppose fairness = 10 and, at the beginning of a certain refilling interval of interest, we start with 100 high-pri requests queued, 20 mid-pri requests queued and 10 low-pri requests queued. Further, suppose that the fairness mechanism at the beginning of this refilling interval decided that we satisfy high-pri request queue first and further suppose that we have enough bytes refilled to satisfy 9 front requests queued in the high-pri queue, then there are two cases after satisfying the 9-th high-pri request:

(1) If we still have enough bytes for the 10-th request from the high-pri queue or more, it will results in continuous satisfaction of high-pri requests until we either run out of high-pri requests (and move on to next priority queue decided by the fairness mechanism at the beginning of this refilling interval) or run out of bytes.

(2) If we don't have enough bytes for the 10-th request from the high-pri queue, we will wait for next refilling interval and trigger fairness mechanism again to decide on satisfaction order. In this case, there is 9/10 chance that the 10-th request will again be a high-pri, 1/10 * 9/10 chance be a mid-pri, 1/10 * 1/10 chance be a low-pri.

As described above, user-pri is designed to represent a superior priority above all other priorities hence user-pri queue is always satisfied first without being subject to fairness mechanism. It is intended for representing the superior priority of foreground operations triggered by users (e.g, read from SST, write to WAL), compared with the ones triggered by RocksDB internals (e.g, flush, compaction). In other words, user-pri requests will never be blocked by requests of other priorities. For example, if we have user-pri, high-pri, mid-pri and low-pri queues, then at a given refilling interval, the possible orders of satisfying request queues of different priorities with its corresponding probability are as following:

{user_pri, high_pri, mid_pri, low_pri} with probability (1 - 1/fairness) * (1 - 1/fairness);

{user_pri, high_pri, low_pri, mid_pri} with probability (1 - 1/fairness) * 1/fairness;

{user_pri, mid_pri, low_pri, high_pri} with probability 1/fairness * (1 - 1/fairness);

{user_pri, low_pri, mid_pri, high_pri} with probability 1/fairness^2.

You should be good by leaving the fairness parameter at default 10.

Maximum bytes of a single request

Notice that although tokens are refilled with a certain interval set by refill_period_us, the maximum bytes that can be granted in a single request have to be bounded since we are not happy to see the following scenario:

Suppose we have rate_bytes_per_sec = 200 and request A from thread A asking for 1 million bytes comes slightly before request B from thread B asking for 1 byte. Suppose these two requests are of same priority so they will be enqueued to the same queue. By FIFO principle, request A will be satisfied first, taking a long time. If we don't constrain the maximum bytes, then request B will be blocked until request A is granted with all the bytes needed, while if we do constrain the maximum bytes, request A will need to make multiple requests, allowing request B to "squeeze" between these requests in the queue and get satisfied early.

Therefore our implementation bounds the maximum bytes that can be granted in a single request by GetSingleBurstBytes().

Request token before writes

Each time token should be requested before writes happen. If this request can not be satisfied now, the call will be blocked until tokens get refilled to fulfill the request. For example,

// block if tokens are not enough
rate_limiter->Request(1024 /* bytes */, rocksdb::Env::IO_HIGH); 
Status s = db->Flush();

Users could also dynamically change rate limiter's bytes per second with SetBytesPerSecond() when they need. see include/rocksdb/rate_limiter.h for more API details.

Customization

For the users whose requirements are beyond the functions provided by RocksDB native Ratelimiter, they can implement their own Ratelimiter by extending include/rocksdb/rate_limiter.h

Contents

RocksDB Wiki
Overview
RocksDB FAQ
Terminology
Requirements
Contributors' Guide
Release Methodology
RocksDB Users and Use Cases
RocksDB Public Communication and Information Channels
Basic Operations
- Iterator
- Prefix seek
- SeekForPrev
- Tailing Iterator
- Compaction Filter
- Multi Column Family Iterator (Experimental)
- Read-Modify-Write (Merge) Operator
- Column Families
- Creating and Ingesting SST files
- Single Delete
- Low Priority Write
- Time to Live (TTL) Support
- Transactions
- Snapshot
- DeleteRange
- Atomic flush
- Read-only and Secondary instances
- Approximate Size
- User-defined Timestamp
- Wide Columns
- BlobDB
- Online Verification
Options
- Setup Options and Basic Tuning
- Option String and Option Map
- RocksDB Options File
MemTable
Journal
- Write Ahead Log (WAL)
- MANIFEST
- Track WAL in MANIFEST
Cache
- Block Cache
- SecondaryCache (Experimental)
Write Buffer Manager
Compaction
- Leveled Compaction
- Universal compaction style
- FIFO compaction style
- Manual Compaction
- Subcompaction
- Choose Level Compaction Files
- Managing Disk Space Utilization
- Trivial Move Compaction
- Remote Compaction (Experimental)
SST File Formats
- Block-based Table Format
- PlainTable Format
- CuckooTable Format
- Index Block Format
- Bloom Filter
- Data Block Hash Index
IO
- Rate Limiter
- SST File Manager
- Direct I/O
Compression
- Dictionary Compression
Full File Checksum and Checksum Handoff
Background Error Handling
Huge Page TLB Support
Tiered Storage (Experimental)
Logging and Monitoring
- Logger
- Statistics
- Compaction Stats and DB Status
- Perf Context and IO Stats Context
- EventListener
Known Issues
Troubleshooting Guide
Tests
- Stress Test
- Fuzzing
- Benchmarking
Tools / Utilities
- Administration and Data Access Tool
- How to Backup RocksDB?
- Replication Helpers
- Checkpoints
- How to persist in-memory RocksDB database
- Third-party language bindings
- RocksDB Trace, Replay, Analyzer, and Workload Generation
- Block cache analysis and simulation tools
- IO Tracer and Parser
Implementation Details
- Delete Stale Files
- Partitioned Index/Filters
- WritePrepared-Transactions
- WriteUnprepared-Transactions
- How we keep track of live SST files
- How we index SST
- Merge Operator Implementation
- RocksDB Repairer
- Write Batch With Index
- Two Phase Commit
- Iterator's Implementation
- Simulation Cache
- [To Be Deprecated] Persistent Read Cache
- DeleteRange Implementation
- unordered_write
Extending RocksDB
- RocksDB Configurable Objects
- The Customizable Class
- Object Registry
RocksJava
- RocksJava Basics
- Logging in RocksJava
- JNI Debugging
- RocksJava API TODO
- RocksJava Performance on Flash Storage
- Tuning RocksDB from Java
Lua
- Lua CompactionFilter
Performance
- Performance Benchmarks
- In Memory Workload Performance
- Read-Modify-Write (Merge) Performance
- Delete A Range Of Keys
- Write Stalls
- Pipelined Write
- MultiGet Performance
- Tuning Guide
- Memory usage in RocksDB
- Speed-Up DB Open
- Implement Queue Service Using RocksDB
- Asynchronous IO
- Off-peak in RocksDB
Projects Being Developed
Misc
- Building on Windows
- Developing with an IDE
- Open Projects
- Talks
- Publication
- Features Not in LevelDB
- How to ask a performance-related question?
- Articles about Rocks

Provide feedback

Saved searches