Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive hashing #1796

Closed
wants to merge 4 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
193 changes: 193 additions & 0 deletions text/0000-adaptive-hashing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
- Feature Name: adaptive_hashing
- Start Date: 2016-11-20
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)

# Summary

Implement adaptive hashing for HashMap. Initialize hash maps using the fastest practical hash
function, and fall back to SipHash in case of a potential DoS attack.

# Motivation

Hash DoS is an example of a DoS attack. The goal of DoS attacks is a denial of service. Consider
creating a HashMap from a given list of keys:

```rust
fn make_map(keys: Vec<usize>) => HashMap<usize> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on =>

let mut map = HashMap::new();
for key in keys {
map.insert(key, 0);
}
map
}
```

Let's suppose that the `keys` array is an input that comes from the outside world. A simple case
of DoS happens when a server receives a HTTP request with thousands of deliberately chosen
parameters. Processing just one such request can take minutes.

The `keys` array is manipulated to get the slowest possible run time. In the worst case, all keys
hash to the same bucket, so we no longer benefit from hashing. Each iteration of the loop in the
example code takes O(n) time. The entire function executes in O(n**2) time. The hash map behaves
like a typical dynamic array. We might as well write:

```rust
fn make_map(keys: Vec<usize>) => HashMap<usize> {
let mut map = vec![];
for key in keys {
if let Some(index) = map.position(|(k, _)| k == key) {
map[index] = 0;
} else {
map.push(0);
}
}
}
```

We are only considering slow insertions, because we don’t need to worry about lookup. The cost of
inserting an element includes the cost of searching for that element. Immediately after inserting an
element, the cost of looking it up will be equal or smaller. Later, after some number of unrelated
insertions, the cost of looking up that element will still be limited by some threshold.

To prevent all Hash DoS attacks, we need to make sure that HashMap is protected. The standard
library's HashMap currently uses SipHash-1-3 for all its lookups to protect from Hash DoS.
Unfortunately, this comes with a tradeoff. Some people believe SipHash is too slow. They consider
non-ideal performance of HashMap for small keys as its main drawback. Others see the use of SipHash
as a good solution to the tradeoff between security and speed.

Is SipHash really slow, and why? We can simply count the number of instructions it performs.
SipHash’s round involves 14 64-bit operations. SipHash-1-3 runs one round for each 8 bytes of input,
and three rounds for finalization, so it involves 16 operations for each 8 bytes of input, and 42
operations for finalization. Hashing an input of 8 bytes needs 58 operations. However, out-of-order
execution allows more than one operation per cycle on modern CPUs. Also, SipHash uses simple
operations, i.e. addition, bitwise rotation and XOR. Still, we can see that SipHash is relatively
slow for small values. Ideally, hashing an integer should take only 7 instructions.

Several dynamic programming languages use SipHash for their hash tables. However, Rust is a systems
programming language. The slowdown from hashing is more noticeable than in other languages.

Perl uses a mechanism similar to adaptive hashing for its dictionaries implemented with chaining.
Java uses chaining and changes a linked list to a binary tree when its length exceeds some
threshold.

Fortunately, Robin Hood hashing can be easily extended with adaptive hashing.

# Detailed design
## The algorithm for adaptive hashing

A HashMap with adaptive hashing has two states. One state is called “fast mode” and the other is
“safe mode”. The fast mode is the inital state for HashMaps with keys of a type that can be hashed
in one shot. Otherwise, a HashMap with complex keys is always in safe mode. We switch to the safe
mode when the following conditions are met:

- an inserted entry's displacement >= 128, or the number of entries displaced by an inserted
entry >= 512
- the load of the map is smaller than 20%
- the map is in the fast mode

The second condition reduces the odds of switching to safe hashing. The chance that the first
condition is satisfied is tiny, and the chance that both are satisfied at the same time is
negligible. Moreover, we add a flag to the map. The flag delays displacement reduction until the
next insertion to make code simpler. Otherwise, rebuilding the map would invalidate our entry.
The pseudocode for a function that replaces `insert` is:

```
fn safeguarded_insert(map, key):
entry = insert(map, key)
if the entry's displacement >= 128 or the number of entries displaced by entry >= 512:
set the flag for reducing displacement
return entry
```

Before the next insertion operation, the state must be checked. Conveniently, the `reserve` method
is always called before insertion and entry search, so we add the following code to `reserve`:

```
fn reserve(map, ...):
if the flag for reducing displacement is set and the map uses fast hashing:
if the load of the map is higher than 20%:
grow the map
else:
switch the map's hash state to safe hashing
rebuild the map
clear the flag for reducing displacement
// ...
```

Here’s a state diagram for HashMap with adaptive hashing. The dashed edge means the state change is
very unlikely, and the dotted edge means the state change is enormously unlikely.

<img width="800" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/adaptive.svg">

## Choosing constants

The thresholds of 128 and 512 are chosen to minimize the chance of exceeding them. In particular, we
want that chance to be less than 10^-8 with a load of 90% and less than 10^-30 with a load of 20%.
For displacement, the smallest k that fits our needs is 90, so we round that up to 128. For the
number of forward-shifted buckets, we choose k=512. Keep in mind that the run length is a sum of the
displacement and the number of forward-shifted buckets, so its threshold is 128+512=640. Even though
the probability of having a run length of more than 640 buckets may be higher than the probability
we want, it should be low enough.

<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/lookup_cost.png">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These might want to get encoded as base64-encoded data URIs, so the RFC is self-contained.


<img width="600" src="https://cdn.rawgit.com/pczarn/code/d62cd067ca84ff049ef196aa1b7773d67b4189d4/rust/robinhood/run_length.png">

## Choosing hash functions

For hashing integers, the best choice is a mixer similar to the one used in SipHash’s finalizer. For
strings and slices of integers, we will use FarmHash. (The Hasher trait must allow one-shot hashing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No design that I know of has resolved the issue that we run into with one shot hashing, even for a few select types: custom types that define Hash exactly like strings/slices do.

for FarmHash.) Using any other key type means your HashMap will do safe hashing.

# Consequences
## For the performance of Rust programs

The impact is minimal on programs that rarely use HashMaps. The increase in binary size should be
small. For programs that spend a large portion of their run time using HashMap with primitive keys,
the speedup should be noticeable.

On 32-bit platforms, the benefit of using a 32-bit hash function instead of SipHash is higher,
because SipHash’s round involves 30 32-bit operations.

## For the HashMap API

One day, we may want to hash HashMaps. The hashing infrastructure can be changed to allow it. The
implementation of Hash for HashMap can hash the hashes stored in every map, rather than However,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than what?

adaptive hashing makes it harder to write a correct and performant implementation of Hash for
HashMap. If two HashMaps (that can be compared) have equal values, they must hash to the same
integer. However, with adaptive hashing, HashMap can switch to the safe mode, which means it no
longer stores the same hashes as other HashMaps that remain in ‘fast’ mode. The only way to handle
the situation for the safe mode is to rehash all keys as if the HashMap were in ‘fast’ mode, which
may take a significant time.

## For the order of iteration

Currently, HashMaps have nondeterministic order of iteration by default. This is seen as a good
thing, because programmers that test programs won’t learn to rely on a fixed iteration order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: “programmers that test programs” makes little grammatical sense. Maybe:

because programmers then ensure programs do not rely on a fixed iteration order

Otherwise, programmers might not know that their programs only work with a specific iteration order.
To keep nondeterministic order, SipHash’s thread-local seed may be used for all hashers.

# Drawbacks

More complex code needs to be maintained. There’s a risk of having a bug in the algorithm or in the
code.

# Alternatives

- We can reject adaptive hashing. SipHash-1-3 may be fast enough.
- We can restrict adaptive hashing to integer keys. With this limitation, we don't need Farmhash in
the standard library.
- We can use some other fast one-shot hasher instead of Farmhash.
- We can add use an additional fast hash function for fast streaming hashing. The improvement would
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: “add use” maybe should be either “add” or “use”.

be small.
- We can set FarmHash's seed to a random value for nondeterminism.
- When a map is emptied, its hash function does not matter anymore. As a special case, we can detect
operations that clear maps in safe mode, and reset them back to fast mode.
- We can let user declare their types as one-shot hashable.

# Unresolved questions

Is there any hasher that is faster than Farmhash?

Are the chosen thresholds reasonably low?