Design consistent hashing
Table of Contents
Consistent hashing is a commonly used technique to achieve horizontal scaling, to distribute requests/data efficiently and evenly across servers.
The rehashing problem
해싱을 하면 제법 골고루 랜덤한 값이 나오긴 하지만, 일반적으로 이 값이 너무 크기 때문에 곧바로 쓸 수 없고 모듈러 연산으로 원하는 크기만큼 줄인다. 만약에 n개의 서버에 (across servers) 해싱을 하고 싶다면, 해싱한 값을 n으로 모듈러 하게 된다. 문제는 서버 수, 즉 n이 바뀌는 경우이다. 이때는 모든 모듈러 연산의 n이 바뀌기 때문에, 같은 서버라도 해시 값이 바뀌게 된다. 이를 리해싱 문제라고 한다.
Most keys are redistributed. This causes a storm of cache misses.
Consistent hashing
A special kind of hashing such that when a hash table is re-sized and consistent hashing is used, only k/n keys need to be remapped on average, where k is the number of keys, and n is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped.
Basic approach
- Assume SHA1 is used as the hash function.
- SHA1 space goes from 0 to $2^{160}-1$.
- By collecting both ends, we get a hash ring.
- We map servers based on server IP or name onto the ring.
- One thing worth mentioning is that hash function used here is different from the one in “the rehashing problem.”, and there is no modular operation!
- To determine which server a key is stored on, we go rightward/clockwise from the key position on the ring until a server is found.
- Adding a new server will only require redistribution of a fraction of keys (usually 1/n).
- When a server is removed, only a fraction of keys require redistribution with consistent hashing. The rest of the keys are unaffected.
There are two issues in this basic approach.
- It is impossible to keep the same size of partitions on the ring for all servers considering a server can be added or removed. A partition is the hash space between adjacent servers. It is possible that the size of the partitions on the ring assigned to each server is very small or fairly large.
- It is possible to have a non-uniform key distribution on the ring.
Virtual nodes
- A virtual node refers to the real node, and each server is represented by multiple virtual nodes on the ring.
- As the number of virtual nodes increases, the distribution of keys becomes more balanced.
- Because the standard deviation gets smaller with more virtual nodes, leading to balanced data distribution!
- With one or two hundred virtual nodes, the standard deviation is between 5-10% of the mean.
Wrap up
The benefits of consistent hashing include:
- Minimised keys are redistributed when servers are added/removed.
- Easy to scale horizontally because data are more evenly distributed.
- Mitigate hotspot key problem. Excessive access to a specific shard could cause server overload.
Widely used:
- Partitioning component of Amazon’s Dynamo
- Data partitioning across the cluster in Apache Cassandra
- Discord chat application
- Akamai content delivery network
- Maglev network load balancer