Design a key-value store

Understand the problem and establish design scope
Single server key-value store
Distributed key-value store
System components
Summary

A key-value store is a non-relational database. Each unique identifier is stored as a key with its associated value.

The key must be unique, and the value associated with the key can be accessed through the key. Keys can be plain text or hashed values. For performance reasons, a short key works better.

put(key, value): insert “value” associated with the “key”
get(key): get “value” associated with the “key”

Understand the problem and establish design scope

Trade-offs:
- Read, write, and memory usage
- Consistency vs. availability
Characteristics to be considered:
- The size of a key-value pair is small: less than 10KB.
- Ability to store big data.
- High availability: the system responds quickly, even during failures.
- High scalability: the system can be scaled to support large data set.
- Automatic scaling: the addition/deletion of servers should be automatic based on traffic.
- Tunable consistency.
- Low latency.

Single server key-value store

An easy, intuitive approach is to store key-value pairs in a hash table which keeps everything in memory.
Fitting everything memory may be impossible due to the space constraint.
Two optimisations can be done:
- Data compression
- Store only frequently used data in memory, and the rest on disk.
A single server can reach its capacity very quickly. A distributed key-value store is required to support big data.

Distributed key-value store

CAP Theorem

System components

Based on three popular key-value store systems: Dynamo, Cassandra, and BigTable.

Data partition

It is infeasible to fit the complete data set in a single server.
Split the data into smaller partitions, and store them in multiple server.
Two challenges while partitioning the data:
- Distribute data across multiple servers evenly.
- Minimise data movement when nodes are added or removed.
Consistent hashing is a great technique to solve these problems.
- Automatic scaling: servers could be added or removed automatically depending on the load.
- Heterogeneity(이질성): the number of virtual nodes for a server is proportional to the server capacity.
  - 서로 다른 서버 (Heterogeneous servers)의 성능 차이를 인정하고 그에 맞게 부하를 차등 분배하는 기법 (Dynamo paper에서 처음 나옴)

Data replication

To achieve high availability and reliability, data must be replicated asynchronously over n servers. (n is configurable)
After a key is mapped to a position on the hash ring, walk clockwise from that position and choose the first N servers on the ring to store data copies.
With virtual nodes, the first n nodes on the ring may be owned by the fewer than n physical servers. To avoid this issue, we only choose unique servers while performing the clockwise walk logic.
For better reliability, replicas are placed in distinct data centres

Consistency

Data is replicated at multiple nodes, so it must be synchronised across replicas.
Quorum Consensus can guarantee consistency for both read and write operations.

N = The number of replicas.
W = A write quorum of size W. For a write operation to be considered as successful, write operation must be acknowledged from W replicas.
R = A read quorum of size R. For a read operation to be considered as successful, read operation must be acknowledged from at least R replicas.

The example is when N = 3.

If W=1, the coordinator must receive at least one acknowledgement before the write operation is considered as successful. If we get an ack from s1, we no longer need to wait for ack from s0 and s2. A coordinator acts as a proxy between the client and the nodes.

How to configure N, R, W?

R=1 and W=N: the system is optimised for fast reads.
W=1 and R=N: the system is optimised for fast writes.
W+R > N: strong consistency.
- There must be at least one overlapping node that has the latest data to ensure consistency.
W+R $\le$ N: strong consistency is not guaranteed.

Consistency models

Defines the degree of data consistency

Strong consistency: any read operation returns a value corresponding to the result of the most updated write data item. A client never sees out-of-date data.
- Forcing a replica not to accept new reads/writes until every replica has agreed on current write.
- Not ideal for highly available systems.
Weak consistency: subsequent read operations may not see the most updated value.
Eventual consistency: a specific form of weak consistency, which is, given enough time, all updates are propagated, and all replicas become consistent.
- Dynamo, Cassandra
- Allows inconsistent values to enter the system, and force the client to read the value to reconcile.

Inconsistency resolution

A vector clock: [server, version] pair associated with a data item.

Assume a vector clock is represented by D([s1,v1], [s2,v2], …, [sn,vn]), where D is a data item, vn is a version counter, and sn is a server number. If data item D is written to server si, the system must perform one of the following tasks:

increment vi if [si, vi] exists
otherwise, create a new entry [si, 1].

A client writes a data item D1 to the system, and the write is handled by server s1, which now has the vector clock D1([s1, 1]).
Another client reads the latest D1, updates it to D2, and writes it back. D2 descends from D1 so it overwrites D1. The write is handled by the same server s1, which now has vector clock D1([s1, 2]).
Another client reads the latest D2, updates it to D3, and writes it back. The write is handled by s2, which now has vector clock D3([s1,2], [s2,1]).
Another client reads the latest D2, updates it to D4, and writes it back. The write is handled by s3, which now has vector clock D4([s1,2], [s3,1]).
When another client reads D3 and D4, it discovers a conflict, which is caused by data item D2 being modified by both s2 and s3. The conflict is resolved by the client and updated data is sent to the server. The write is handled by s1, which now has D5([s1,3], [s2,1], [s3,1]).

Using vector clocks, it is easy to tell that a version X is an ancestor (i.e., no conflict) of version Y if the version counters for each participant in the vector clock of Y is greater than or equal to the ones in version X.

D1([s1,1]) is an ancestor of D2([s1,2])
D2[s1,2] is an ancestor of D3([s1,2], [s2,1])

(서버가 없으면 0으로 치면 되는듯)

You can tell that a version X is a sibling (i.e., a conflict) of Y if there is any participant in Y’s vector clock who has a counter that is less than its corresponding counter in X.

D3([s1,2], [s2,1]) and D4([s1,2], [s3,1]) have conflict.
D([s0,1], [s1,2]) and D([s0,2], [s1,1]) have conflict.

Two notable downsides.

Add complexity to client. (conflict resolution)
Space complexity - the [server:version] pairs in the vector clock can grow rapidly.
1. Set threshold for the length, and if exceeds the limit, the oldest pairs are dropped. -> could lead to inefficiencies in reconciliation due to the inaccurate descendant relationship. -> Dynamo experiments show that this is acceptable solution.

Handling failures

As with any large system at scale, failures are not only inevitable, but common. Handling failure scenario is very important.

Failure detection

In a distributed system, it is insufficient to believe that a server is down because another server says so. Usually, it requires at least two independent sources of information to mark a server down.

A Gossip protocol is better solution.

Each node maintains a node membership list. It contains member IDs and heartbeat counters.
Each node periodically increments its heartbeat counter.
Each node periodically sends heartbeats to a set of random nodes, which in turn propagate to another set of nodes.
Once nodes receive heartbeats, membership list is updated to the latest info.
If the heartbeat has not increased for more than pre-defined periods, the member is considered as offline.

Handling temporary failures

After failures have been detected through the gossip protocol, read/write operations could be blocked.

Sloppy quorum: used to improve availability; instead of enforcing the quorum requirement, the system chooses the first W healthy servers for writes and first R healthy servers for reads on the hash ring. Offline servers are ignored.
Hinted hand-off: a server becomes unavailable, another server will process requests temporarily. When the down server is up, changes will be pushed back to achieve data consistency.

Handling permanent failures

Anti-entropy protocol
- keep replicas in sync.
- comparing each piece of data on replicas and updating each replica to the newest version.
- Merkle tree is used for inconsistency detection and minimising the amount of data transferred: a tree in which every non-leaf node is labelled with the hash of the labels or values (in case of leaves) of its child nodes. Hash trees allow efficient and secure verification of the contents of large data structures.
- Using Merkle tree, the amount of data needed to be synchronised is proportional to the differences between the two replicas, and not the amount of data they contain. In real-world systems, the bucket size is quite big (e.g., 1,000 keys).

Handling data centre outage

Replicate data across multiple data centres.

System architecture diagram

Client communicate with the key-value store through simple APIs: get(key) and put(key, value).
A coordinator is a node that acts as a proxy between the client and the key-value store.
Nodes are distributed on a consistent hash ring.
The system is completely decentralised; adding and moving nodes can be automatic.
Data is replicated at multiple nodes.
There is no single point of failure, as every node has the same set of responsibilities.

Write path

The write request is persisted on a commit log.
Data is saved in the memory cache.
When the memory cache is full or reaches a threshold, data is flushed to SSTable on disk.

Note that a sorted-string table (SSTable) is a sorted list of <key, value> pairs.

Read path

If data is in the memory cache, it is returned to the client directly.

Otherwise,

The system first checks if data is in memory. If not, go to step (2).
The system checks the bloom filter.
The bloom filter is used to figure out which SSTables might contain the key.
SSTables return the result of the data set.
The result is returned to the client.

Summary

Goal or Problems	Technique
Ability to store big data	Use consistent hashing to spread the load across servers
High availability (reads)	Data replication, Multi data centres
High availability (writes)	Versioning and conflict resolution with vector clocks
Dataset partition	Consistent hashing
Incremental scalability	Consistent hashing
Heterogeneity	Consistent hashing with server capacities
Tunable consistency	Quorum consensus
Handling temporary failures	Sloppy quorum and hinted hand-off
Handling permanent failures	Merkle tree
Handling data centre outage	Multi data centres, replication across data centres