Fencing Tokens and Lock Safety

Distributed locks are one of the most misunderstood primitives in systems engineering. A process acquires a lock, believing it holds exclusive access to a shared resource. But in an asynchronous network with unbounded delays, process pauses, and clock skew, the lock may have already expired by the time the process acts on it. Fencing tokens exist to address this gap: they transform a "best effort" lock into a mechanism that can provide actual safety guarantees at the resource level.

The Problem with Distributed Locks

Consider a common scenario. Process A acquires a lease-based lock from a lock service (e.g., ZooKeeper, etcd, or a Redis-based lock). The lock has a timeout of 30 seconds. Process A then begins some work: it reads from a database, computes a result, and prepares to write. But between acquiring the lock and issuing the write, Process A experiences a long GC pause, a network partition, or simply slow I/O. Thirty-five seconds elapse. The lock expires. Process B acquires the same lock and begins its own write. Process A resumes, unaware the lock has expired, and issues its write. Both processes have now written to the shared resource concurrently.

No amount of tuning the lock timeout eliminates this problem. You cannot bound process pauses in a system running on garbage-collected runtimes, virtual machines, or commodity operating systems. The core issue is that the lock holder has no reliable way to know whether it still holds the lock at the exact moment it performs the protected operation.

This is not a theoretical concern. Martin Kleppmann described this failure mode in detail in his 2016 analysis of the Redlock algorithm, demonstrating that lease-based distributed locks, on their own, cannot guarantee mutual exclusion in asynchronous systems.

What Is a Fencing Token?

A fencing token is a monotonically increasing value issued by the lock service each time a lock is granted. When a client acquires the lock, it receives not just permission but also a token (e.g., an integer that increments with each lock acquisition). The client must present this token to the resource server when performing any protected operation. The resource server then validates the token: it rejects any request bearing a token that is less than or equal to the highest token it has already processed.

This shifts the safety guarantee from the lock service to the resource server. The lock service provides ordering. The resource server enforces it.

Walkthrough

Below is a step-by-step walkthrough illustrating how fencing tokens prevent stale writes.

Scenario Without Fencing

diagram-1 — Without fencing: GC-paused client overwrites new lock-holder's write

Process A acquires lock, timeout = 30s.
Process A begins work (read, compute, prepare write).
Process A experiences a GC pause lasting 35 seconds.
Lock expires.
Process B acquires lock, performs its write to the storage service.
Process A resumes, issues its (now stale) write to the storage service.
Result: Process A's stale write overwrites Process B's valid write. Data corruption.

Scenario With Fencing

diagram-2 — With fencing: storage rejects stale token T=33 after T=34 advances highest_seen

Process A acquires lock, receives fencing token T=33.
Process A begins work.
Process A experiences a GC pause lasting 35 seconds.
Lock expires.
Process B acquires lock, receives fencing token T=34.
Process B writes to the storage service, presenting T=34. Storage server records highest_seen = 34.
Process A resumes, attempts to write to the storage service, presenting T=33.
Storage server checks: 33 < 34. Rejects the write.
Result: Process B's write is preserved. Safety is maintained.

Pseudocode: Resource Server Validation

diagram-3 — Resource-server write path: persist highest token durably before commit

class ResourceServer:
    highest_token = load_from_durable_storage()  # Must survive restarts

    function handle_write(request):
        if request.fencing_token <= highest_token:
            return REJECT("stale fencing token")
        
        highest_token = request.fencing_token
        persist_highest_token(highest_token)  # Must be durable before the write
        perform_write(request.data)
        return OK

This logic is simple, but the simplicity is the point. The safety mechanism lives in a single comparison at the resource boundary, not in complex distributed consensus within the lock service.

Important: The highest_token value must be stored durably. If the resource server crashes and restarts with highest_token reset to zero, it would accept previously-rejected stale tokens, breaking the safety guarantee. In practice, highest_token should be persisted to stable storage (or encoded in the data record itself, as with a conditional-write approach) before the write is committed.

Pseudocode: Client Protocol

function do_protected_work(lock_service, resource_server):
    lock, token = lock_service.acquire("my-resource")
    if lock == nil:
        return FAILURE("could not acquire lock")
    
    try:
        data = compute_result()
        result = resource_server.write(data, fencing_token=token)
        if result == REJECT:
            log("Lock was stale; write rejected by resource server")
            return FAILURE("fenced")
        return SUCCESS
    finally:
        lock_service.release("my-resource")

Why the Resource Server Must Participate

A critical property of fencing tokens is that safety does not depend on the lock holder behaving correctly. Even if a client is buggy, paused, or partitioned, the resource server independently enforces ordering. This is what makes the approach robust: it does not require the client to check whether its lock is still valid before acting. The server-side check is the last line of defense and the only one that matters for correctness.

This design reflects a general principle in distributed systems: place safety enforcement at the point of state mutation, not at the point of intent. Clients can be arbitrarily delayed or confused. The storage layer, which actually commits changes, is the right place to enforce invariants.

Not all storage systems support fencing tokens natively. If your storage backend is a relational database, you can implement fencing using conditional writes (e.g., UPDATE ... SET value = X, token = 34 WHERE token < 34). This is effectively a per-row fencing check, and it has the advantage of being atomic with the write itself, avoiding the separate durability concern described above. If the storage backend is an object store or a third-party API that does not support conditional writes, implementing fencing becomes much harder, and in some cases impossible without a proxy layer.

Fencing Tokens vs. Other Approaches

Lock Extension (Heartbeating)

Some systems allow lock holders to periodically extend their lease. This reduces the window in which a stale client might act, but it does not eliminate it. A sufficiently long pause can still cause the client to miss its heartbeat window. Heartbeating is a performance optimization, not a safety mechanism.

Epoch-Based Approaches

Systems like Raft and ZAB use epoch or term numbers that serve a conceptually similar purpose. A leader is only valid during its epoch. If a stale leader from a previous epoch sends messages, followers reject them because the epoch is outdated. Fencing tokens are the application-level analog of this pattern.

CAS (Compare-And-Swap)

Compare-and-swap operations at the storage layer can provide similar safety guarantees. The client reads the current version of the data, performs its computation, and writes back only if the version has not changed. In fact, the conditional-write implementation of fencing described above is essentially a CAS on the token value. The distinction is one of framing: CAS is typically described as protecting a specific data item against concurrent modification by any writer, while fencing tokens are framed as protecting against a specific stale lock holder. In practice, the two approaches are complementary and often combined.

Limitations and Practical Considerations

Fencing tokens require cooperation from the resource server. If the resource does not check tokens, the mechanism provides no safety. This is a real constraint in systems where the "resource" is an external service you do not control.

Token monotonicity must be guaranteed by the lock service. If the lock service itself can issue duplicate or out-of-order tokens (due to split-brain, for example), the fencing guarantee breaks down. In practice, this means the lock service should use a consensus-backed counter or a sequentially-allocated identifier. ZooKeeper's sequential ephemeral nodes provide a convenient way to generate monotonically increasing lock tokens, since ZooKeeper's sequential node counter is maintained by a consensus-backed log. Note that ZooKeeper's internal zxid (transaction ID) is a different concept — it tracks ZooKeeper's own internal state changes, not lock acquisitions — and should not be used directly as a fencing token.

There is a subtle issue with token gaps. If Process A receives token 33 and Process C receives token 35 (skipping 34), the system still works correctly. The only requirement is strict monotonicity, not contiguity.

Fencing tokens do not help with the "efficiency" use case for locks. If you use a lock purely to avoid redundant work (and duplicate work is merely wasteful, not incorrect), fencing is unnecessary overhead. Fencing matters only when correctness depends on mutual exclusion.

Finally, fencing tokens address write safety but not read safety. A stale client might still read data that it interprets incorrectly because its lock has expired. Protecting reads typically requires versioned reads or snapshot isolation at the storage layer.

Key Points

Distributed locks based on leases cannot guarantee mutual exclusion in asynchronous systems because process pauses and network delays are unbounded.
A fencing token is a monotonically increasing integer issued with each lock acquisition, presented to the resource server on every protected operation.
The resource server rejects any request carrying a token less than or equal to the highest token it has already processed.
The resource server's highest_token must be stored durably; a restart that resets it to zero would allow stale tokens to be accepted again.
Safety enforcement belongs at the resource server (point of state mutation), not at the client (point of intent), because clients can be arbitrarily delayed.
Fencing tokens are the application-level equivalent of epoch/term numbers used in consensus protocols like Raft and ZAB.
The lock service must use a consensus-backed mechanism to guarantee strict token monotonicity; ZooKeeper sequential nodes are suitable, but ZooKeeper's internal zxid is not the same thing and should not be used directly as a fencing token.
Fencing tokens only matter when correctness requires mutual exclusion; for "efficiency only" locks where duplicate work is merely wasteful, they are unnecessary.

References

Martin Kleppmann. "How to do distributed locking." martin.kleppmann.com, 2016.

Martin Kleppmann. Designing Data-Intensive Applications. O'Reilly Media, 2017. Chapter 8: The Trouble with Distributed Systems.

Cary G. Gray and David R. Cheriton. "Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency." Proceedings of the 12th ACM Symposium on Operating Systems Principles (SOSP), 1989.

Tushar Chandra, Robert Griesemer, and Joshua Redstone. "Paxos Made Live: An Engineering Perspective." Proceedings of the 26th ACM Symposium on Principles of Distributed Computing (PODC), 2007.

Patrick Hunt, Mahadev Konar, Flavio Junqueira, and Benjamin Reed. "ZooKeeper: Wait-free Coordination for Internet-scale Systems." Proceedings of the USENIX Annual Technical Conference, 2010.