Distributed databases face a fundamental tension between consistency, availability, and partition tolerance.
The CAP theorem tells us we cannot have all three simultaneously during a network partition.
Rather than making a fixed choice, systems like Apache Cassandra let operators configure consistency guarantees on a per-operation basis.
This approach, known as tunable consistency, allows different queries within the same system to make different tradeoffs depending on their requirements.
Background: Replication and Quorums
Cassandra replicates each piece of data across N nodes, where N is the replication factor configured per keyspace.
When a client issues a read or write, Cassandra's coordinator node sends the request to all N replicas but only waits for a subset to respond before acknowledging success.
The size of that subset is the consistency level (CL).
The key consistency levels are:
- ONE: Wait for a single replica to acknowledge.
- TWO / THREE: Wait for two or three replicas.
- QUORUM: Wait for ⌊N/2⌋ + 1 replicas.
- ALL: Wait for all N replicas.
- LOCAL_QUORUM: A quorum within the local datacenter only.
- EACH_QUORUM: A quorum in every datacenter.
- ANY (writes only): The write is acknowledged if it reaches at least one node, including hinted handoff nodes. Note that this does not guarantee the data has been written to any actual replica — it may exist only as a hint.
The classical quorum condition for strong consistency is:
R + W > N
Where R is the number of replicas read, W is the number of replicas written to, and N is the replication factor.
If this inequality holds, the read set and write set must share at least one replica by the pigeonhole principle, so a read will contact at least one node that saw the most recent write.
This provides a best-effort "strong" consistency under the last-write-wins model, but it is not full linearizability — clock skew and the LWW conflict resolution mechanism can still cause anomalies (see Edge Cases below).
True linearizability requires lightweight transactions.
How Tunable Consistency Works
Write Path
When a client sends a write with consistency level W:
- The coordinator determines all N replica nodes for the partition key using consistent hashing on the token ring.
- The coordinator sends the mutation to all N replicas.
- Each replica writes the data to a commit log and memtable, and then sends an acknowledgment.
- The coordinator waits for W acknowledgments before responding to the client.
- If fewer than W replicas respond within the timeout, the coordinator returns an error to the client, and the replicas that did receive the write still retain it.
Read Path
When a client sends a read with consistency level R:
- As a Cassandra-specific optimization, the coordinator sends a full data request to one replica and digest requests to R - 1 others (rather than fetching full data from all R). When R = 1, only a full data request is sent with no digest requests. The coordinator still waits for R total responses.
- If digests match, the coordinator returns the data.
- If digests diverge, the coordinator requests full data from all R replicas, reconciles using timestamps (last-write-wins), and returns the most recent value.
- A read repair is triggered in the background to update stale replicas.
Walkthrough
Consider a keyspace with replication factor N = 3 and a single datacenter.
Scenario A: Strong consistency
Write CL = QUORUM → W = 2
Read CL = QUORUM → R = 2
R + W = 4 > 3 = N ✓ (strong consistency)
A write succeeds once 2 of 3 replicas acknowledge.
A subsequent read contacts 2 replicas.
At least one of those 2 must be among the 2 that acknowledged the write.
The read will therefore see the latest value (after timestamp-based reconciliation), assuming clocks are synchronized.
Scenario B: Eventual consistency, write-heavy optimization
Write CL = ONE → W = 1
Read CL = ONE → R = 1
R + W = 2 ≤ 3 = N ✗ (no strong consistency guarantee)
The read might contact the single replica that did not receive the latest write.
The client may see stale data until anti-entropy mechanisms (read repair, hinted handoff, Merkle-tree-based repair) propagate the update.
Scenario C: High write durability, fast reads
Write CL = ALL → W = 3
Read CL = ONE → R = 1
R + W = 4 > 3 = N ✓ (strong consistency)
This satisfies the quorum overlap condition but severely sacrifices write availability: if any single replica is unavailable, all writes fail.
This pattern is fragile in production and should only be used when the durability requirement is extreme and the replica set is highly reliable.
Scenario D: Multi-datacenter
With N = 3 per datacenter (6 total replicas across 2 DCs):
Write CL = LOCAL_QUORUM → W = 2 (in the local DC)
Read CL = LOCAL_QUORUM → R = 2 (in the local DC)
This gives strong consistency within a single datacenter and avoids cross-DC latency on the critical path.
However, a read in DC-B immediately after a write in DC-A may return stale data, since the local quorum in DC-B may not yet have received the write.
EACH_QUORUM on writes addresses this at the cost of higher latency.
The Quorum Overlap Argument (Pseudocode)
function is_strongly_consistent(R, W, N):
// Pigeonhole principle: if R + W > N, then
// the set of R read-replicas and W write-replicas
// must share at least one member.
overlap = R + W - N
return overlap > 0
function coordinate_write(key, value, W, N):
replicas = get_replicas(key, N)
acks = 0
timestamp = now()
for replica in replicas:
send_async(replica, WRITE, key, value, timestamp)
while acks < W:
response = await_any_response(timeout)
if response == ACK:
acks += 1
if timeout_exceeded:
raise WriteTimeoutException
return SUCCESS
function coordinate_read(key, R, N):
replicas = get_replicas(key, N)
select one replica for full_data_request
// When R > 1, send digest requests to the remaining R - 1 replicas.
// When R = 1, no digest requests are sent.
if R > 1:
select R - 1 replicas for digest_request
// Coordinator waits for all R responses total
responses = await_responses(R, timeout)
if R == 1 or all_digests_match(responses):
return responses.full_data
else:
// Read repair: fetch full data from all R replicas
full_responses = fetch_full(replicas, R)
latest = max(full_responses, by=timestamp)
trigger_background_repair(replicas, latest)
return latest
Edge Cases and Caveats
The R + W > N rule provides strong consistency only under specific assumptions.
Several practical concerns weaken the guarantee.
Timestamp Skew
Cassandra uses last-write-wins (LWW) conflict resolution based on client-supplied or coordinator-assigned timestamps.
If clocks are skewed between nodes, a "later" write can be silently discarded in favor of an "earlier" one with a higher timestamp.
NTP synchronization helps but does not eliminate this risk.
This is why quorum reads and writes do not provide true linearizability.
Hinted Handoff and ANY
A write with CL = ANY can be acknowledged when a coordinator or live node stores a hint for an unavailable target replica.
At this point, no actual replica holds the written data — only a hint exists.
The data is replayed to the target replica when it comes back online.
Until replay completes, a read at any consistency level, including CL = ALL, will not see this write because the data has not been written to any replica yet.
Lightweight Transactions
For operations requiring true linearizability (e.g., compare-and-set), tunable consistency is insufficient.
Cassandra provides lightweight transactions (LWTs) based on a Paxos consensus protocol.
These involve multiple round trips covering the Paxos phases (prepare/promise, read of current state, propose/accept, and commit) versus one for a normal write, carrying significantly higher latency but provide serial isolation for the affected partition.
Membership Changes
During topology changes (nodes joining, leaving, or being decommissioned), the replica set for a given key can shift.
Pending writes to old replicas and reads from new replicas can violate the overlap guarantee temporarily, Cassandra mitigates this through streaming and repair but does not fully eliminate the window.
Cassandra mitigates this through streaming and repair but does not fully eliminate the window.
Read Repair Limitations
Read repair is probabilistic in older Cassandra versions and only covers the replicas contacted during the read.
Full anti-entropy repair (using Merkle trees) must be run periodically to ensure all replicas converge.
Without regular repair, stale data can persist indefinitely on replicas that are never read from.
Practical Guidance
Choosing consistency levels requires understanding the workload:
| Pattern | Write CL | Read CL | Tradeoff |
|---|---|---|---|
| General purpose | QUORUM | QUORUM | Balanced consistency and availability |
| Write-heavy logging | ONE | ONE | Maximum throughput, eventual consistency |
| Financial records | QUORUM | QUORUM | Strong consistency; prefer LWTs for CAS operations |
| Multi-DC with local reads | LOCAL_QUORUM | LOCAL_QUORUM | Low latency, strong consistency per DC |
| Maximum write availability (fire-and-forget) | ANY | — | Acknowledged write may exist only as a hint; do not pair with reads expecting to see this write until hint replay is confirmed |
A common anti-pattern is writing at QUORUM and reading at ONE.
For N = 3, QUORUM gives W = 2, so R + W = 1 + 2 = 3, which equals N — it does not satisfy R + W > N.
This configuration does not guarantee overlap between the read set and write set, and can return stale data.
Key Points
- Tunable consistency allows each read and write operation to independently select how many replicas must respond, trading consistency for availability and latency on a per-query basis.
- The quorum overlap condition R + W > N is the foundation for strong consistency: it guarantees at least one replica in common between the read set and the write set.
- CL = QUORUM for both reads and writes is the standard recipe for strong consistency with N = 3, tolerating one replica failure for both operations.
- Last-write-wins conflict resolution depends on synchronized clocks, making NTP accuracy a correctness concern rather than just an operational one.
- LOCAL_QUORUM provides low-latency strong consistency within a single datacenter but does not guarantee cross-datacenter consistency.
- True linearizability requires lightweight transactions (Paxos-based), not quorum reads and writes alone.
- CL = ANY acknowledges a write before it reaches any replica; subsequent reads at any level may not see the write until hint replay completes.
- Anti-entropy repair must run periodically to correct divergence that read repair and hinted handoff cannot fully resolve.
References
DeCandia, G., Hastorun, D., Jampani, M., et al. "Dynamo: Amazon's Highly Available Key-Value Store." Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), 2007.
Lakshman, A. and Malik, P. "Cassandra: A Decentralized Structured Storage System." ACM SIGOPS Operating Systems Review, 44(2), 2010.
Gilbert, S. and Lynch, N. "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services." ACM SIGACT News, 33(2), 2002.
Lamport, L. "The Part-Time Parliament." ACM Transactions on Computer Systems, 16(2), 1998.
Vogels, W. "Eventually Consistent." Communications of the ACM, 52(1), 2009.