Tunable Consistency (Cassandra-Style)

Distributed databases face a fundamental tension between consistency, availability, and partition tolerance.
The CAP theorem tells us we cannot have all three simultaneously during a network partition.
Rather than making a fixed choice, systems like Apache Cassandra let operators configure consistency guarantees on a per-operation basis.
This approach, known as tunable consistency, allows different queries within the same system to make different tradeoffs depending on their requirements.

Background: Replication and Quorums

Cassandra replicates each piece of data across N nodes, where N is the replication factor configured per keyspace.
When a client issues a read or write, Cassandra's coordinator node sends the request to all N replicas but only waits for a subset to respond before acknowledging success.
The size of that subset is the consistency level (CL).

The key consistency levels are:

ONE: Wait for a single replica to acknowledge.
TWO / THREE: Wait for two or three replicas.
QUORUM: Wait for ⌊N/2⌋ + 1 replicas.
ALL: Wait for all N replicas.
LOCAL_QUORUM: A quorum within the local datacenter only.
EACH_QUORUM: A quorum in every datacenter.
ANY (writes only): The write is acknowledged if it reaches at least one node, including hinted handoff nodes. Note that this does not guarantee the data has been written to any actual replica — it may exist only as a hint.

The classical quorum condition for strong consistency is:

R + W > N

Where R is the number of replicas read, W is the number of replicas written to, and N is the replication factor.
If this inequality holds, the read set and write set must share at least one replica by the pigeonhole principle, so a read will contact at least one node that saw the most recent write.
This provides a best-effort "strong" consistency under the last-write-wins model, but it is not full linearizability — clock skew and the LWW conflict resolution mechanism can still cause anomalies (see Edge Cases below).
True linearizability requires lightweight transactions.

How Tunable Consistency Works

Write Path

diagram-1 — Cassandra tunable write: coordinator counts W acks across N replicas

When a client sends a write with consistency level W:

The coordinator determines all N replica nodes for the partition key using consistent hashing on the token ring.
The coordinator sends the mutation to all N replicas.
Each replica writes the data to a commit log and memtable, and then sends an acknowledgment.
The coordinator waits for W acknowledgments before responding to the client.
If fewer than W replicas respond within the timeout, the coordinator returns an error to the client, and the replicas that did receive the write still retain it.

Read Path

diagram-2 — Cassandra tunable read: full + digests, mismatch reconciliation, read repair

When a client sends a read with consistency level R:

As a Cassandra-specific optimization, the coordinator sends a full data request to one replica and digest requests to R - 1 others (rather than fetching full data from all R). When R = 1, only a full data request is sent with no digest requests. The coordinator still waits for R total responses.
If digests match, the coordinator returns the data.
If digests diverge, the coordinator requests full data from all R replicas, reconciles using timestamps (last-write-wins), and returns the most recent value.
A read repair is triggered in the background to update stale replicas.

Walkthrough

Consider a keyspace with replication factor N = 3 and a single datacenter.

Scenario A: Strong consistency

Write CL = QUORUM → W = 2
Read  CL = QUORUM → R = 2

R + W = 4 > 3 = N  ✓ (strong consistency)

A write succeeds once 2 of 3 replicas acknowledge.
A subsequent read contacts 2 replicas.
At least one of those 2 must be among the 2 that acknowledged the write.
The read will therefore see the latest value (after timestamp-based reconciliation), assuming clocks are synchronized.

Scenario B: Eventual consistency, write-heavy optimization

Write CL = ONE   → W = 1
Read  CL = ONE   → R = 1

R + W = 2 ≤ 3 = N  ✗ (no strong consistency guarantee)

The read might contact the single replica that did not receive the latest write.
The client may see stale data until anti-entropy mechanisms (read repair, hinted handoff, Merkle-tree-based repair) propagate the update.

Scenario C: High write durability, fast reads

Write CL = ALL   → W = 3
Read  CL = ONE   → R = 1

R + W = 4 > 3 = N  ✓ (strong consistency)

This satisfies the quorum overlap condition but severely sacrifices write availability: if any single replica is unavailable, all writes fail.
This pattern is fragile in production and should only be used when the durability requirement is extreme and the replica set is highly reliable.

Scenario D: Multi-datacenter

With N = 3 per datacenter (6 total replicas across 2 DCs):

Write CL = LOCAL_QUORUM → W = 2 (in the local DC)
Read  CL = LOCAL_QUORUM → R = 2 (in the local DC)

This gives strong consistency within a single datacenter and avoids cross-DC latency on the critical path.
However, a read in DC-B immediately after a write in DC-A may return stale data, since the local quorum in DC-B may not yet have received the write.
EACH_QUORUM on writes addresses this at the cost of higher latency.

The Quorum Overlap Argument (Pseudocode)

diagram-3 — Quorum overlap (R + W > N) with N=3, W=2, R=2 — single shared replica

function is_strongly_consistent(R, W, N):
    // Pigeonhole principle: if R + W > N, then
    // the set of R read-replicas and W write-replicas
    // must share at least one member.
    overlap = R + W - N
    return overlap > 0

function coordinate_write(key, value, W, N):
    replicas = get_replicas(key, N)
    acks = 0
    timestamp = now()
    for replica in replicas:
        send_async(replica, WRITE, key, value, timestamp)
    while acks < W:
        response = await_any_response(timeout)
        if response == ACK:
            acks += 1
        if timeout_exceeded:
            raise WriteTimeoutException
    return SUCCESS

function coordinate_read(key, R, N):
    replicas = get_replicas(key, N)
    select one replica for full_data_request
    // When R > 1, send digest requests to the remaining R - 1 replicas.
    // When R = 1, no digest requests are sent.
    if R > 1:
        select R - 1 replicas for digest_request
    // Coordinator waits for all R responses total
    responses = await_responses(R, timeout)
    if R == 1 or all_digests_match(responses):
        return responses.full_data
    else:
        // Read repair: fetch full data from all R replicas
        full_responses = fetch_full(replicas, R)
        latest = max(full_responses, by=timestamp)
        trigger_background_repair(replicas, latest)
        return latest

Edge Cases and Caveats

The R + W > N rule provides strong consistency only under specific assumptions.
Several practical concerns weaken the guarantee.

Timestamp Skew

Cassandra uses last-write-wins (LWW) conflict resolution based on client-supplied or coordinator-assigned timestamps.
If clocks are skewed between nodes, a "later" write can be silently discarded in favor of an "earlier" one with a higher timestamp.
NTP synchronization helps but does not eliminate this risk.
This is why quorum reads and writes do not provide true linearizability.

Hinted Handoff and ANY

A write with CL = ANY can be acknowledged when a coordinator or live node stores a hint for an unavailable target replica.
At this point, no actual replica holds the written data — only a hint exists.
The data is replayed to the target replica when it comes back online.
Until replay completes, a read at any consistency level, including CL = ALL, will not see this write because the data has not been written to any replica yet.

Lightweight Transactions

For operations requiring true linearizability (e.g., compare-and-set), tunable consistency is insufficient.
Cassandra provides lightweight transactions (LWTs) based on a Paxos consensus protocol.
These involve multiple round trips covering the Paxos phases (prepare/promise, read of current state, propose/accept, and commit) versus one for a normal write, carrying significantly higher latency but provide serial isolation for the affected partition.

Membership Changes

During topology changes (nodes joining, leaving, or being decommissioned), the replica set for a given key can shift.
Pending writes to old replicas and reads from new replicas can violate the overlap guarantee temporarily, Cassandra mitigates this through streaming and repair but does not fully eliminate the window.
Cassandra mitigates this through streaming and repair but does not fully eliminate the window.

Read Repair Limitations

Read repair is probabilistic in older Cassandra versions and only covers the replicas contacted during the read.
Full anti-entropy repair (using Merkle trees) must be run periodically to ensure all replicas converge.
Without regular repair, stale data can persist indefinitely on replicas that are never read from.

Practical Guidance

Choosing consistency levels requires understanding the workload:

Pattern	Write CL	Read CL	Tradeoff
General purpose	QUORUM	QUORUM	Balanced consistency and availability
Write-heavy logging	ONE	ONE	Maximum throughput, eventual consistency
Financial records	QUORUM	QUORUM	Strong consistency; prefer LWTs for CAS operations
Multi-DC with local reads	LOCAL_QUORUM	LOCAL_QUORUM	Low latency, strong consistency per DC
Maximum write availability (fire-and-forget)	ANY	—	Acknowledged write may exist only as a hint; do not pair with reads expecting to see this write until hint replay is confirmed

A common anti-pattern is writing at QUORUM and reading at ONE.
For N = 3, QUORUM gives W = 2, so R + W = 1 + 2 = 3, which equals N — it does not satisfy R + W > N.
This configuration does not guarantee overlap between the read set and write set, and can return stale data.

Key Points

Tunable consistency allows each read and write operation to independently select how many replicas must respond, trading consistency for availability and latency on a per-query basis.
The quorum overlap condition R + W > N is the foundation for strong consistency: it guarantees at least one replica in common between the read set and the write set.
CL = QUORUM for both reads and writes is the standard recipe for strong consistency with N = 3, tolerating one replica failure for both operations.
Last-write-wins conflict resolution depends on synchronized clocks, making NTP accuracy a correctness concern rather than just an operational one.
LOCAL_QUORUM provides low-latency strong consistency within a single datacenter but does not guarantee cross-datacenter consistency.
True linearizability requires lightweight transactions (Paxos-based), not quorum reads and writes alone.
CL = ANY acknowledges a write before it reaches any replica; subsequent reads at any level may not see the write until hint replay completes.
Anti-entropy repair must run periodically to correct divergence that read repair and hinted handoff cannot fully resolve.

References

DeCandia, G., Hastorun, D., Jampani, M., et al. "Dynamo: Amazon's Highly Available Key-Value Store." Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), 2007.

Lakshman, A. and Malik, P. "Cassandra: A Decentralized Structured Storage System." ACM SIGOPS Operating Systems Review, 44(2), 2010.

Gilbert, S. and Lynch, N. "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services." ACM SIGACT News, 33(2), 2002.

Lamport, L. "The Part-Time Parliament." ACM Transactions on Computer Systems, 16(2), 1998.

Vogels, W. "Eventually Consistent." Communications of the ACM, 52(1), 2009.