DSW.

Advanced

Fencing Tokens and Lock Safety

Article diagram
April 25, 2026·10 min read

Fencing tokens transform lock safety from a timing-dependent property into an ordering-dependent one by requiring the storage layer to reject writes from stale lock holders.

Distributed locks are fundamentally unreliable without additional mechanisms to prevent stale lock holders from corrupting shared state.
A process that believes it holds a lock may have already lost it due to a GC pause, network delay, or clock skew, yet still proceed to write to a storage system.
Fencing tokens solve this problem by attaching a monotonically increasing token to each lock acquisition and requiring the storage layer to reject operations bearing outdated tokens.

The Problem: Unsafe Lock Usage

diagram-1
A sequence diagram with actors: Client A, Lock Service, Client B, and Storage; show A acquiring the lock, A entering a long GC pause, the lease expiring, B acquiring the lock, then both A and B issuing writes to Storage and both writes being applied (illustrating the mutual-exclusion violation).

Consider a simple distributed lock protecting a shared resource (a database row, a file, a leader election).
Client A acquires the lock, then experiences a long GC pause.
The lock's lease expires.
Client B acquires the lock and begins writing.
Client A resumes, unaware that its lock has expired, and also writes.
The result is a classic mutual exclusion violation: two clients both believe they hold the lock and both modify shared state.

This scenario is not hypothetical.
Any lock implementation relying on time-based leases (which is nearly all of them in distributed systems) is vulnerable.
The root cause is that a lock holder cannot reliably know whether its lease is still valid at the moment it performs a write.
Checking the lease before writing introduces a time-of-check-to-time-of-use (TOCTOU) gap that no local clock can close.

The insight behind fencing tokens is simple: instead of relying on the client to police itself, push the safety guarantee into the resource being protected.

How Fencing Tokens Work

When a lock service grants a lock, it also returns a fencing token: a number that increases monotonically with each successive lock grant.
The client includes this token in every request to the shared resource.
The resource (or a proxy in front of it) tracks the highest token it has seen and rejects any request carrying a token strictly lower than that maximum.

This converts a timing-dependent safety property into an ordering-dependent one.
Even if client A's stale write arrives after client B has already written, the resource will reject it because A's token is lower than B's.

Properties of a Fencing Token

  1. Monotonicity. Each new lock grant must produce a token strictly greater than all previously issued tokens for that lock.
  2. Uniqueness. No two distinct lock grants produce the same token.
  3. Durability. The token counter must survive lock service restarts. Losing the counter and resetting to zero would allow old tokens to be accepted again.
  4. Lock service linearizability. Fencing tokens are only as strong as the guarantee that the lock service itself never issues the same token to two different clients. This requires the lock service to be linearizable (e.g., backed by a consensus protocol). A non-linearizable lock service can produce duplicate tokens under split-brain conditions, defeating the fencing guarantee.

These properties can be achieved with a simple persistent counter, a database sequence, or any linearizable compare-and-swap register.

Walkthrough

Below is a step-by-step illustration of fencing in action, followed by pseudocode for both the lock service and the storage layer.

Scenario

diagram-2
A sequence diagram that labels the concrete steps: Lock Service -> Client A: token=33 (grant), Client A pauses, Lock Service -> Client B: token=34 (grant), Client B -> Storage: write(token=34) accepted (max_token=34), Client A -> Storage: write(token=33) rejected (33 < max_token=34); include the storage decision point and token updates.
  1. Client A calls acquire_lock("resource-X"). The lock service grants the lock and returns token = 33.
  2. Client A enters a long GC pause (or network partition).
  3. The lock's lease expires.
  4. Client B calls acquire_lock("resource-X"). The lock service grants the lock and returns token = 34.
  5. Client B sends write(resource-X, value=B, token=34) to the storage service. The storage service records max_token[resource-X] = 34 and accepts the write.
  6. Client A wakes up from its GC pause and sends write(resource-X, value=A, token=33) to the storage service.
  7. The storage service compares 33 < max_token[resource-X] = 34 and rejects the write.

Client A's stale operation is safely blocked.

Pseudocode: Lock Service

state:
    token_counter: persistent integer, initially 0
    lock_holders: map[resource_id -> (client_id, expiry)]

function acquire_lock(resource_id, client_id, lease_duration):
    if lock_holders[resource_id] exists and not expired:
        return LOCK_BUSY

    token_counter := token_counter + 1       // must be persisted atomically
    expiry := now() + lease_duration
    lock_holders[resource_id] := (client_id, expiry)
    return LOCK_GRANTED, token_counter

Pseudocode: Storage Service (Fencing Check)

state:
    max_token: map[resource_id -> integer], initially 0
    data: map[resource_id -> value]

function write(resource_id, value, token):
    if token < max_token[resource_id]:
        return REJECTED, "fencing token is stale"

    max_token[resource_id] := token
    data[resource_id] := value
    return OK

Design note on equal tokens: The check above uses strict less-than, meaning a repeated write with the same token as the current maximum is accepted.
This is a deliberate choice to support idempotent retries by the same lock holder (e.g., a client that retries a write after a network timeout without re-acquiring the lock).
If your application requires that each write be accepted exactly once, you should use token <= max_token to reject equal-token duplicates, but you must then ensure clients do not need to retry writes without re-acquiring the lock.

The storage-side check is the critical component.
If the storage system does not enforce the fencing check, then the token is merely advisory and provides no safety guarantee.

Integration Challenges

Storage Systems Must Cooperate

The most significant practical difficulty is that the storage layer must understand and enforce fencing tokens.
Many off-the-shelf databases and file systems do not natively support this.
Options include:

  • Using a conditional write primitive that already exists (e.g., ZooKeeper's setData with a version number, or etcd's transactions with mod_revision checks).
  • Adding a proxy layer in front of the storage system that performs the fencing check before forwarding the write.
  • Encoding the token into an application-level optimistic concurrency control field (e.g., a version column in a relational database with a WHERE version < token guard).

Each approach has trade-offs.
Native support is cleanest.
Proxies add latency and become a single point of failure.
Application-level encoding requires discipline across every code path that touches the resource.

Multi-Resource Operations

Fencing tokens protect a single resource per lock.
If a lock holder needs to modify multiple resources atomically, the fencing token alone is insufficient.
You need either a transactional storage layer that can check the token across all resources in a single atomic operation, or a design where the multiple writes are idempotent and convergent.

Token Generation in Replicated Lock Services

If the lock service itself is replicated (as with ZooKeeper, etcd, or Chubby), the monotonic token must be derived from the replicated state machine.
In ZooKeeper, for instance, the zxid (transaction ID) or the sequential ephemeral node number serves as a natural fencing token because both are guaranteed to be monotonically increasing across the entire cluster.

In systems using Raft-based consensus (like etcd), the Raft log index can serve this purpose.
The key requirement is that the token's monotonicity is a consequence of the consensus protocol's ordering guarantee, not of any single node's local state.

Comparison with Other Approaches

Epoch-Based Leader Election

Some systems use an "epoch" or "term" number (as in Raft) to fence stale leaders.
This is conceptually identical to fencing tokens.
A new leader has a higher term number, and followers reject messages from leaders with lower terms.
The fencing token pattern generalizes this idea beyond consensus protocols to any lock-protected resource.

Delay-Based Approaches

An alternative strategy is to wait out the maximum possible clock drift plus network delay before using a lock.
For example, Google's TrueTime API in Spanner commits transactions only after waiting for the uncertainty interval to pass.
This approach trades latency for safety and requires infrastructure (GPS/atomic clocks) that most systems do not have.
Fencing tokens avoid this cost.

CAS (Compare-And-Swap) and Optimistic Concurrency Control

CAS operations and optimistic concurrency control (OCC) — for example, a version column with a conditional UPDATE ... WHERE version = expected — are complementary to fencing tokens rather than strict alternatives.
A CAS operation is itself an atomic read-modify-write: the storage system atomically checks the current value and writes only if it matches the expected value, so there is no TOCTOU window within a single CAS call.

The key distinction is directional: fencing tokens require only a one-sided monotonic check (reject if token is lower than the max seen), which means the client does not need to know the current stored version before writing.
CAS/OCC requires the client to supply the exact expected current version, which means a read is needed before each write to obtain that version.
In practice, many robust systems combine both: they use a fencing token to reject clearly stale writers and CAS/OCC to handle fine-grained version conflicts among current writers.

When Fencing Is Not Enough

Fencing tokens guarantee that writes from stale lock holders (those holding an outdated token) are rejected by the storage layer.
They ensure that the most recent lock holder's writes are never overwritten by an older one.
However, there are scenarios they do not fully address:

Intermediate read visibility. Because network messages can be reordered, it is possible for a stale write (token 33) to arrive at the storage layer before the newer write (token 34), be accepted, and then be overwritten when token 34 arrives.
The final stored state will be correct (token 34's value wins), but a concurrent reader between those two events may briefly observe the stale value.
For applications requiring strict real-time ordering of reads — where a read that begins after a write completes must see that write — additional mechanisms such as linearizable reads or read fences are needed.

Liveness. Fencing tokens do not help with lock liveness.
If the lock service itself is unavailable, no tokens are issued and the system stalls.
Fencing is a safety mechanism, not an availability mechanism.

Correctness of lock service. As noted above, if the lock service is not itself linearizable and issues duplicate tokens due to a split-brain or implementation bug, fencing tokens cannot prevent two clients from each believing their token is authoritative.

Key Points

  • A distributed lock based on leases alone cannot prevent stale lock holders from corrupting shared state after a GC pause, network delay, or clock skew.
  • A fencing token is a monotonically increasing number issued with each lock grant, included in all subsequent client operations on the protected resource.
  • Safety depends on the storage layer rejecting any write carrying a token lower than the highest token it has previously accepted.
  • The comparison operator (strict less-than vs. less-than-or-equal) in the fencing check is a design choice that trades off idempotent retry support against duplicate-write rejection.
  • The lock service's token counter must be durable and, in replicated setups, derived from the consensus protocol's ordering guarantees. The lock service must itself be linearizable.
  • Fencing tokens generalize the epoch/term-based leader fencing used in consensus protocols like Raft and Paxos.
  • Storage systems must actively cooperate by enforcing the fencing check; without this, the token provides no safety guarantee.
  • CAS and optimistic concurrency control are complementary to fencing tokens, not strict replacements; robust systems often use both.
  • Fencing tokens address safety (mutual exclusion of writes) but do not address liveness or real-time read ordering.

References

Martin Kleppmann. "How to do distributed locking." martin.kleppmann.com, 2016.

Martin Kleppmann. Designing Data-Intensive Applications. O'Reilly Media, 2017. Chapter 8: The Trouble with Distributed Systems.

Mike Burrows. "The Chubby lock service for loosely-coupled distributed systems." Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006.

Cary G. Gray and David R. Cheriton. "Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency." Proceedings of the 12th ACM Symposium on Operating Systems Principles (SOSP), 1989.

Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. "ZooKeeper: Wait-free coordination for internet-scale systems." Proceedings of the USENIX Annual Technical Conference, 2010.

Newsletter

Signal
over noise.

Distributed systems deep-dives, delivered once a week. Consensus, infrastructure, and the architecture that scales.

You will receive Distributed Systems Weekly.