Distributed databases face a fundamental tension between consistency, availability, and partition tolerance. The CAP theorem tells us we cannot have all three simultaneously during a network partition. Rather than making a single, system-wide choice, Cassandra and similar systems allow operators to tune consistency on a per-operation basis. This approach, known as tunable consistency, lets you select different consistency levels for reads and writes depending on the requirements of each query.
Background: Replication and Quorums
Cassandra uses a peer-to-peer architecture where data is replicated across N nodes (the replication factor). When a client issues a read or write, a coordinator node forwards the request to the replicas responsible for that partition key. The consistency level (CL) determines how many replica acknowledgments the coordinator must receive before responding to the client.
The core insight comes from quorum systems. If you write to W replicas and read from R replicas, then as long as W + R > N, the read set and write set must overlap. This overlap guarantees that at least one replica contacted during a read holds the most recent acknowledged write. This is the foundation of tunable consistency.
Common consistency levels in Cassandra:
| Level | Replicas Required | Behavior |
|---|---|---|
| ONE | 1 | Lowest latency, weakest consistency |
| TWO | 2 | Two replicas must respond |
| THREE | 3 | Three replicas must respond |
| QUORUM | ⌊N/2⌋ + 1 | Majority of replicas |
| LOCAL_QUORUM | ⌊N_local/2⌋ + 1 | Majority within the local datacenter |
| EACH_QUORUM | ⌊N_dc/2⌋ + 1 per DC | Majority in every datacenter (writes only in Cassandra 2.x+; not supported for reads) |
| ALL | N | Every replica must respond |
How It Works
Write Path
When a client sends a write with consistency level W, the coordinator sends the mutation to all N replicas (or all replicas in the relevant datacenters). It waits for W acknowledgments before responding success to the client. The remaining replicas will still receive the write asynchronously if they haven't already. Each write carries a client-supplied or coordinator-generated timestamp, which Cassandra uses for last-write-wins (LWW) conflict resolution.
Read Path
When a client sends a read with consistency level R, the coordinator optimizes the request by sending a full data request to one replica and digest (hash) requests to the remaining R-1 replicas. If all digests agree, the coordinator returns the data from the first replica. If digests diverge, the coordinator fetches full data from all R replicas, compares timestamps, and returns the value with the highest timestamp. If the R replicas return divergent values, the coordinator performs read repair: it sends the most recent value to any out-of-date replicas that were contacted. Cassandra can also perform read repair probabilistically in the background on the remaining (N - R) replicas (note: the per-table read_repair_chance and dc_local_read_repair_chance settings were deprecated in Cassandra 4.0; background repair behavior may differ by version).
The Consistency Guarantee
The combination of R and W determines the consistency semantics:
- Strong consistency (latest-write visibility): W + R > N. Every read is guaranteed to contact at least one replica that holds the most recent acknowledged write, ensuring the coordinator can return the latest value under Cassandra's last-write-wins model. Note: this does not constitute strict linearizability due to clock skew and LWW semantics (see Caveats below).
- Eventual consistency: W + R ≤ N. Reads may return stale data, but anti-entropy mechanisms (read repair, hinted handoff, Merkle tree-based repair) will eventually converge replicas.
For example, with N=3 and QUORUM for both reads and writes, W=2 and R=2, so W + R = 4 > 3. This guarantees that the read set overlaps with the write set. With N=3, W=ONE, R=ONE, we get W + R = 2, which does not exceed N, so stale reads are possible.
Walkthrough
Consider a concrete scenario with replication factor N=3, replicas A, B, and C, and a key user:42.
Step 1: Write with QUORUM (W=2)
Client -> Coordinator: WRITE user:42 = {name: "Alice", v: 5} at timestamp t=100
Coordinator -> Replica A: WRITE user:42 = {name: "Alice", v: 5, ts: 100}
Coordinator -> Replica B: WRITE user:42 = {name: "Alice", v: 5, ts: 100}
Coordinator -> Replica C: WRITE user:42 = {name: "Alice", v: 5, ts: 100}
Replica A: ACK (received, written to memtable + commit log)
Replica B: ACK (received, written to memtable + commit log)
Replica C: [slow, has not responded yet]
Coordinator has 2 ACKs >= W=2.
Coordinator -> Client: WRITE SUCCESS
At this point, replicas A and B hold version ts=100. Replica C may still be processing or temporarily unreachable. It will eventually receive the write (via the pending message or hinted handoff).
Step 2: Read with QUORUM (R=2)
Client -> Coordinator: READ user:42
Coordinator sends a full data request to Replica B and a digest request to Replica C.
Replica B: responds with {name: "Alice", v: 5, ts: 100}
Replica C: responds with digest of {name: "Alice", v: 4, ts: 90} [stale]
Digests disagree. Coordinator fetches full data from Replica C:
Replica C: {name: "Alice", v: 4, ts: 90}
Coordinator compares timestamps: ts=100 > ts=90.
Coordinator -> Client: READ RESULT = {name: "Alice", v: 5}
Coordinator initiates read repair:
Coordinator -> Replica C: WRITE user:42 = {name: "Alice", v: 5, ts: 100}
Because W + R = 4 > 3 = N, at least one of the two replicas contacted during the read (B in this case) must have seen the write. The coordinator correctly returns the latest value and repairs the stale replica.
Step 3: What If We Used CL=ONE for Both?
WRITE with W=1: Only Replica A acknowledges. Client sees success.
READ with R=1: Coordinator contacts Replica C, which still has ts=90.
Client receives stale data.
This illustrates why W + R > N matters.
Practical Trade-offs
Latency vs. Consistency
Higher consistency levels require more replicas to respond, which means the operation's latency is bounded by the slowest of the required replicas. With CL=ALL, a single slow or failed replica causes the operation to fail or timeout. With CL=ONE, you get the latency of the fastest replica but risk stale reads.
Availability vs. Consistency
With N=3 and CL=QUORUM, you can tolerate one replica failure. With CL=ALL, zero failures are tolerated. With CL=ONE, you can tolerate N-1 failures. This is the direct manifestation of the CAP trade-off, made explicit per query.
Mixed Consistency Levels
A common pattern is to use strong writes with weaker reads (or vice versa). For instance, with N=3:
- Write at CL=ALL (W=3), Read at CL=ONE (R=1): W + R = 4 > 3. Latest-write visibility with fast reads but fragile writes (zero replica failures tolerated).
- Write at CL=ONE (W=1), Read at CL=ALL (R=3): W + R = 4 > 3. Latest-write visibility with fast writes but expensive reads and zero read-path failure tolerance.
- Write at CL=QUORUM (W=2), Read at CL=QUORUM (R=2): Balanced. Tolerates one failure on either path.
Caveats and Failure Modes
Tunable consistency in Cassandra does not provide linearizability in the strict sense, even with QUORUM reads and writes. Cassandra uses last-write-wins with timestamps, and clock skew between nodes can cause a write with a lower timestamp to "win" even if it was issued later in real time. Lightweight transactions (using Paxos) are required for true linearizability, at significant performance cost.
Additionally, hinted handoff introduces subtlety. If a coordinator cannot reach a replica, it stores a "hint" locally and forwards it when the replica recovers. Hints allow writes to succeed with fewer live replicas, but they are not replayed instantly, and they can be lost if the coordinator itself fails before delivering them. By default, Cassandra only stores hints for a limited window (e.g., 3 hours); replicas that are down longer than this window may miss writes entirely until a full repair is run.
Anti-Entropy Mechanisms
Tunable consistency relies on background convergence mechanisms to ensure replicas eventually agree:
-
Read repair: During reads, divergent values are detected and the latest version is propagated to stale replicas that were contacted. Background probabilistic read repair on non-contacted replicas was configurable in older Cassandra versions but was deprecated in Cassandra 4.0.
-
Hinted handoff: When a replica is unreachable, the coordinator stores mutations as hints and replays them when the replica comes back online, subject to the configured hint window.
-
Anti-entropy repair (Merkle trees): Operators run periodic full or incremental repairs. Replicas exchange Merkle tree digests of their data to identify and resolve inconsistencies. This is the only mechanism that guarantees full convergence across all replicas, including those that were down longer than the hint window.
Without regular anti-entropy repair, clusters relying on CL=ONE for reads can accumulate unbounded staleness. Operational discipline is essential.
Key Points
- Tunable consistency allows per-operation selection of how many replicas must acknowledge a read or write, enabling fine-grained control over the consistency-availability trade-off.
- The quorum condition W + R > N guarantees that reads and writes overlap on at least one replica, ensuring the latest acknowledged write is visible under Cassandra's LWW model.
- Lower consistency levels (ONE, TWO) reduce latency and improve availability at the cost of potentially returning stale data.
- Higher consistency levels (QUORUM, ALL) provide stronger guarantees but increase tail latency and reduce fault tolerance.
- Cassandra's last-write-wins conflict resolution with timestamps means that even QUORUM-level operations do not provide true linearizability; lightweight transactions (Paxos) are needed for that.
- Anti-entropy mechanisms (read repair, hinted handoff, Merkle tree repair) are essential companions to tunable consistency, ensuring eventual convergence of replicas.
- Mixed consistency levels across reads and writes can satisfy W + R > N while optimizing for the access pattern, such as write-heavy or read-heavy workloads.
- EACH_QUORUM is supported for writes only (not reads) in current Cassandra versions.
References
Gilbert, S. and Lynch, N. "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services." ACM SIGACT News, Vol. 33, No. 2, 2002.
DeCandia, G. et al. "Dynamo: Amazon's Highly Available Key-Value Store." Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), 2007. (Cassandra's replication and consistency model draws heavily from Dynamo's design, though the two are distinct systems.)
Lakshman, A. and Malik, P. "Cassandra: A Decentralized Structured Storage System." ACM SIGOPS Operating Systems Review, Vol. 44, No. 2, 2010.
Vogels, W. "Eventually Consistent." Communications of the ACM, Vol. 52, No. 1, 2009.
Lamport, L. "Paxos Made Simple." ACM SIGACT News, Vol. 32, No. 4, 2001.