Introduction
In leader-based replication, a single designated node accepts writes and propagates changes to followers.
This design simplifies consistency reasoning but introduces a single point of failure for write availability.
Leaderless replication takes a fundamentally different approach: any replica can accept writes directly from clients and consistency is achieved through quorum protocols and conflict resolution rather than write ordering at a leader.
Amazon's Dynamo system, described in a landmark 2007 paper, popularized this approach for production systems.
The core ideas have since been adopted by Riak, Apache Cassandra, and Voldemort, among others.
Understanding leaderless replication requires grappling with quorum arithmetic, read repair, anti-entropy, version vectors, and the inherent tradeoffs between availability and consistency that these systems embrace.
System Model
A leaderless replication system stores copies of each data item on N replica nodes.
The client (or a coordinator node acting on the client's behalf) sends write requests directly to all N replicas in parallel.
Similarly, reads go to multiple replicas in parallel.
The system does not designate any replica as authoritative.
Two parameters govern the consistency semantics:
- W: the number of replicas that must acknowledge a write before the client considers it successful.
- R: the number of replicas that must respond to a read before the client returns a value.
The fundamental quorum condition is:
W + R > N
When this condition holds, the read set and write set must overlap by at least one node.
This guarantees that at least one node in every read quorum has seen the most recent write, enabling the client to identify the latest value (assuming a mechanism to distinguish newer from older values exists).
Common configurations include:
| Configuration | N | W | R | Properties |
|---|---|---|---|---|
| Strict majority | 3 | 2 | 2 | Balanced read/write latency, tolerates 1 failure |
| Read-optimized | 3 | 3 | 1 | Fast reads (contact only 1 replica), but writes require all N replicas to be available |
| Write-optimized | 3 | 1 | 3 | Fast writes (acknowledged by 1 replica), but reads must contact all N replicas |
Setting W + R <= N is legal but yields eventual consistency: reads may return stale data because the read set might not include any node that participated in the latest write quorum.
Walkthrough
Write Path
- The client sends a write request for key K with value V to all N replicas.
- Each replica that receives the request stores the value locally along with a version identifier (a version vector or a simple version number, depending on the implementation).
- The client waits for W acknowledgments. Once W replicas have confirmed the write, the client treats the operation as successful.
- If fewer than W replicas respond within a timeout, the write fails from the client's perspective. Some replicas may have already stored the value, creating partial writes that anti-entropy will eventually propagate.
Read Path
- The client sends a read request for key K to all N replicas in parallel.
- The client waits for R responses.
- If the R responses contain different versions of the value, the client (or coordinator) selects the most recent version using version metadata.
- If a stale value is detected at some replicas, the coordinator performs read repair: it sends the latest value back to the replicas that returned outdated data.
Read Repair Pseudocode
Note: The following pseudocode uses scalar version numbers for clarity.
In systems that use version vectors, the comparison requires a dominates() function rather than a simple less-than operator (see the Conflict Resolution section).
function read(key, N, R):
responses = send_read_to_all_replicas(key, N)
wait_for(R, responses)
latest = max(responses, by=scalar_version_number)
# Read repair: fix stale replicas
for response in responses:
if response.version < latest.version:
send_write(response.replica, key, latest)
return latest.value
Sloppy Quorums and Hinted Handoff
Strict quorums require that reads and writes always go to the designated N home nodes for a key.
In practice, some of these nodes may be temporarily unreachable.
Dynamo introduced sloppy quorums to handle this scenario.
With sloppy quorums, if a home node for key K is unreachable, the coordinator routes the request to another node that is not among the designated N replicas.
This stand-in node stores the value with a hint indicating which node should ultimately own the data.
When the intended recipient recovers, the stand-in forwards the data.
This is called hinted handoff.
Sloppy quorums improve write availability but weaken the consistency guarantee.
The W nodes that acknowledged a write may not overlap with the R nodes contacted during a subsequent read because some of the W acknowledgments came from non-home nodes.
In this case, W + R > N no longer guarantees that a read sees the latest write.
function sloppy_write(key, value, W, preference_list):
# preference_list is ordered; select up to N targets,
# substituting reachable non-home nodes for unreachable home nodes.
targets = []
for node in preference_list:
if len(targets) == N:
break
if node.is_reachable():
targets.append(node)
else:
substitute = find_reachable_non_home_node()
if substitute is not None:
targets.append(HintedNode(substitute, intended=node))
# If no substitute found, this slot remains unfilled;
# the write may fail to reach W acknowledgments.
acks = 0
for target in targets:
if target.store(key, value):
acks += 1
return acks >= W
Conflict Resolution
Because multiple replicas accept writes concurrently with no coordination, conflicting writes can occur.
Two clients may write different values for the same key at overlapping times, and different replicas may end up with different values.
The system must detect and resolve these conflicts.
Last-Writer-Wins (LWW)
The simplest approach: attach a timestamp to each write and always keep the value with the highest timestamp.
This is deterministic and simple but discards concurrent writes silently.
If two clients write concurrently, one write is lost without notification.
Cassandra uses LWW semantics: each column value carries a client-supplied timestamp, and the higher timestamp wins on conflict.
This means correctness depends on clients providing accurate and monotonically increasing timestamps.
Version Vectors
A more precise approach uses version vectors.
Each replica maintains a per-replica counter.
The version vector for a value is a tuple of these counters, one per replica.
Given two version vectors, you can determine whether one causally supersedes the other or whether they are concurrent.
Note on terminology: Dynamo's original paper used the term vector clocks, which traditionally track causality per client or per operation. Version vectors track per-replica write counts and are more precisely suited to replica-level conflict detection. Modern systems such as Riak use dotted version vectors to avoid false conflicts. The terms are sometimes used interchangeably in the literature, but the distinction matters for correctness.
Given version vectors A = [r1:2, r2:1] and B = [r1:1, r2:2]:
A does not dominate B (because A[r2] < B[r2])
B does not dominate A (because B[r1] < A[r1])
Therefore A and B are concurrent: conflict detected.
When a conflict is detected, the system can either merge the values automatically (if the data type supports it, as with CRDTs) or return all conflicting versions (called siblings) to the client for application-level resolution.
Riak exposes siblings to the application, allowing domain-specific merge logic.
Anti-Entropy
Read repair only fixes stale replicas for keys that are actually read.
Keys that are rarely read may remain inconsistent across replicas indefinitely.
To address this, Dynamo-style systems run a background anti-entropy process.
Anti-entropy works by comparing data across replicas and copying missing or outdated values.
Comparing entire data sets is expensive, so systems use Merkle trees (hash trees) to efficiently identify differences.
Each node maintains a Merkle tree over its key space.
Two nodes can identify which key ranges differ by comparing tree roots and walking down only the branches where hashes diverge.
This reduces the amount of data transferred during synchronization.
The anti-entropy process runs continuously but at low priority, ensuring that replicas converge over time even without read traffic.
Consistency Semantics and Limitations
Even with strict quorums (W + R > N), Dynamo-style systems do not provide linearizability.
Several scenarios break linearizability guarantees:
- Sloppy quorums break the overlap property, as described above.
- Concurrent writes to the same key may be resolved in ways that do not reflect a total order.
- Partial failures during writes: if a write succeeds on W replicas but fails on the remainder, there is no rollback. A subsequent read might or might not see the new value depending on which replicas are contacted.
- Read repair race conditions: if a read detects a stale value and initiates repair, but another concurrent read contacts the stale replica before repair completes, the second read may return stale data.
Dynamo-style systems deliberately make this tradeoff.
They prioritize availability and partition tolerance (in the CAP theorem sense), accepting weaker consistency in exchange for the ability to serve reads and writes even when some nodes are unreachable.
Key Points
- Leaderless replication allows any replica to accept writes, eliminating the write-availability bottleneck of leader-based designs.
- Quorum parameters W, R, and N control the consistency-availability tradeoff; W + R > N ensures read-write overlap, but does not guarantee linearizability.
- Read repair fixes stale replicas opportunistically during reads, while background anti-entropy (using Merkle trees) ensures convergence for unread keys.
- Sloppy quorums with hinted handoff improve availability during node failures at the cost of weakening quorum overlap guarantees.
- Conflict resolution is unavoidable in leaderless systems; strategies range from last-writer-wins (simple but lossy) to version vectors with application-level merge.
- These systems are designed for high availability and partition tolerance, explicitly trading away strong consistency guarantees.
References
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. "Dynamo: Amazon's Highly Available Key-Value Store." Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), 2007.
Kleppmann, M. "Designing Data-Intensive Applications." O'Reilly Media, 2017. Chapter 5: Replication.
Gilbert, S. and Lynch, N. "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services." ACM SIGACT News, 33(2), 2002.
Shapiro, M., Preguiça, N., Baquero, C., and Zawirski, M. "Conflict-Free Replicated Data Types." Proceedings of the 13th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS), 2011.
Preguiça, N., Baquero, C., and Shapiro, M. "Conflict-Free Replicated Data Types (CRDTs)." arXiv:1805.06113, 2018. (For dotted version vectors and their advantages over plain version vectors.)