MVCC Implementation: Timestamps and Version Chains

Multi-Version Concurrency Control (MVCC) is the dominant concurrency control mechanism in modern database systems.
Rather than blocking readers while writers hold locks, MVCC maintains multiple physical versions of each data item, allowing transactions to read a consistent snapshot without acquiring read locks.
The core implementation challenge lies in how systems assign timestamps, organize version chains, and determine which version is visible to a given transaction.
This article examines these mechanisms in detail.

Why MVCC Exists

Traditional two-phase locking (2PL) enforces serializability by making readers and writers block each other.
This creates contention under read-heavy workloads, which dominate most OLTP systems.
MVCC breaks this dependency: writers create new versions instead of modifying data in place, and readers access the version that was current at their snapshot time.
The result is that reads never block writes, and writes never block reads.

This benefit comes at a cost.
The system must maintain version metadata, garbage-collect obsolete versions, and resolve which version each transaction should see.
These problems are solved through timestamp assignment and version chain management.

Timestamp Assignment

Every MVCC implementation must assign some form of ordering identifier to transactions.
This identifier determines the logical point in time at which a transaction "sees" the database.
There are several approaches.

Monotonic Counter

The simplest scheme uses a globally incrementing counter.
Each transaction receives a unique identifier at begin time (for snapshot isolation) or at commit time (for serializable snapshot isolation).
PostgreSQL uses a 32-bit transaction ID (XID) drawn from a global counter with special wraparound logic.
Visibility is determined not by comparing wall-clock timestamps but by comparing XIDs: a tuple version is visible to a transaction if its creating XID committed before the reader's snapshot, and its deleting XID (if any) had not yet committed.
This XID-based snapshot mechanism differs from wall-clock or HLC timestamp schemes but serves the same logical purpose.

The advantage of monotonic counters is simplicity.
The disadvantage is that a global counter becomes a contention point under high concurrency — every transaction start or commit requires an atomic fetch-and-increment on a shared cache line.

Hybrid Logical Clocks (HLC)

Distributed systems cannot rely on a single global counter.
HLCs combine a physical clock component with a logical counter, producing timestamps that respect causality while staying close to wall-clock time.
CockroachDB uses HLCs to assign MVCC timestamps across nodes without centralized coordination.

Commit Timestamps vs. Begin Timestamps

diagram-1 — Begin- vs commit-timestamp transactions: snapshot read vs validated commit

A critical design choice is when the snapshot identifier is assigned.
Under begin-timestamp schemes (used by PostgreSQL, where the snapshot is fixed at transaction start via the XID), the transaction reads data as of the moment it began.
Under commit-timestamp schemes (used by Hekaton, SQL Server's in-memory engine), the transaction receives its final ordering position at commit time.
Commit-timestamp ordering simplifies serializability validation but requires the system to defer visibility decisions until after commit.

Note: MySQL/InnoDB opens a "read view" at the start of a consistent read (similar to a begin-timestamp), but reconstructs old versions by applying undo records backward from the current heap version.
This hybrid approach shares properties of both schemes.

Version Chain Structures

diagram-2 — Version chain layouts: newest-to-oldest vs oldest-to-newest, chain heads

When a tuple is updated, the system must store both the old and new versions and link them so that readers can traverse to the correct version.
Two primary layouts exist.

Newest-to-Oldest (N2O)

The head of the chain points to the most recent version.
Each version contains a back-pointer to the previous version.
This is the layout used by Hekaton and MySQL/InnoDB (via undo logs).

When a transaction with snapshot timestamp T reads a tuple, it starts at the head and walks backward until it finds the first version whose commit timestamp is less than or equal to T.
For workloads where most reads target recent data, N2O is efficient because the traversal terminates quickly.

Oldest-to-Newest (O2N)

The head of the chain points to the original version, and each version has a forward pointer to its successor.
PostgreSQL stores updated rows as new heap tuples, and the old tuple's t_ctid field points forward to the new one.
However, PostgreSQL's visibility algorithm does not rely primarily on pointer-chasing along this chain.
Instead, each heap tuple carries xmin (the XID that inserted it) and xmax (the XID that deleted or updated it), and the reader checks these fields directly against its snapshot to determine visibility.
The t_ctid forward pointer is used mainly for index-to-heap resolution after an update, not as the primary traversal mechanism.

O2N benefits long-running analytical queries that read old snapshots, since the version they need is at or near the head.
However, short transactions reading current data must locate the latest visible version.

Append-Only vs. Delta Storage

In append-only storage, each version is a complete copy of the tuple.
In delta storage, the system stores only the changed columns as a diff.
Applying deltas to reconstruct a version adds CPU cost but reduces storage overhead.
Oracle and MySQL store undo records (deltas) in a separate undo tablespace.
PostgreSQL stores full tuple copies in the main heap.

Walkthrough

The following walkthrough illustrates a read operation under an N2O version chain with begin-timestamp snapshot isolation.

Setup

Tuple X has three versions: V1 (ts=10), V2 (ts=25), V3 (ts=40).
The version chain head points to V3.
Transaction T_read has snapshot timestamp T_s = 30.

Steps

function ReadVersion(chain_head, T_s):
    current = chain_head          // Start at newest version (V3, ts=40)
    
    while current is not null:
        if current.commit_ts <= T_s and current.status == COMMITTED:
            return current        // This version is visible
        
        if current.status == ABORTED:
            current = current.prev
            continue              // Skip aborted versions
        
        if current.status == IN_PROGRESS:
            // Another transaction created this version but hasn't committed.
            // Under snapshot isolation, this version is invisible.
            current = current.prev
            continue
        
        current = current.prev    // Version too new; keep walking
    
    return NOT_FOUND              // Tuple did not exist at T_s

Execution Trace

current = V3 (ts=40). Since 40 > 30, this version is too new. Move to V3.prev.
current = V2 (ts=25). Since 25 <= 30 and status is COMMITTED, return V2.

Transaction T_read sees the value written by the transaction that committed at timestamp 25, which is the correct snapshot.

Visibility with Commit-Timestamp Ordering

Under commit-timestamp schemes (like Hekaton), the visibility check differs.
A transaction does not know its own commit timestamp until it commits, so reads during execution use a tentative begin timestamp.
At commit time, the system validates that no concurrent transaction has written a version that postdates the version that was read.
If validation fails, the transaction aborts.

function ValidateReadSet(T):
    for each (tuple, version_read) in T.read_set:
        latest = GetLatestCommittedVersion(tuple)
        // Abort if a newer committed version exists that was committed
        // after the version we read — meaning our read is now stale.
        if latest != version_read and latest.commit_ts > version_read.commit_ts:
            ABORT(T)
    COMMIT(T)

Garbage Collection

Old versions that are no longer visible to any active transaction are garbage.
Failing to collect them causes unbounded storage growth and longer chain traversals.
There are three common strategies.

Transaction-level GC: Each transaction, during its normal traversal, marks unreachable versions for reclamation.
PostgreSQL's VACUUM process is a background variant of this approach.

Epoch-based GC: The system tracks the oldest active snapshot timestamp (T_min).
Any version whose successor was committed before T_min is unreachable and can be freed.
Hekaton uses cooperative epoch-based reclamation, where worker threads contribute to GC during idle cycles.

Version-chain pruning: During an update, the system opportunistically removes intermediate dead versions from the chain, shortening future traversals.
This is common in delta-storage systems where undo records can be discarded once no snapshot references them.

A system that falls behind on garbage collection degrades in both space and read latency.
Long-running transactions are particularly harmful because they pin T_min at a low value, preventing reclamation of versions that are dead to all other transactions.
This is a well-known operational issue in PostgreSQL (table bloat) and Oracle (ORA-01555, "snapshot too old," when undo is aggressively reclaimed).

Trade-offs Between Layouts

Property	N2O (Hekaton, InnoDB)	O2N (PostgreSQL)
Short txn read cost	Low (head is newest)	Requires locating latest visible version
Long txn read cost	High (must traverse back)	Lower (older versions near head)
Write cost	Prepend to chain	Append new heap tuple, update old heap tuple pointer
Index maintenance	Index points to chain head (stable)	Index may need update if tuple moves (HOT optimization keeps index entries pointing to the original slot when only heap tuples change, avoiding index updates)
GC complexity	Tail pruning	Head pruning or full VACUUM

The N2O layout generally favors OLTP workloads.
O2N can be advantageous for mixed workloads with long-running analytical snapshots, though in practice PostgreSQL's design was motivated more by its process-per-connection architecture, and heap-based storage than by an explicit optimization for long reads.

Key Points

MVCC eliminates read-write contention by maintaining multiple tuple versions, allowing each transaction to read a consistent snapshot without locks.
Timestamp assignment (global counter, HLC, or hardware clocks) determines transaction ordering and is a critical scalability bottleneck in high-throughput systems.
PostgreSQL uses XID-based snapshot isolation, where visibility is determined by comparing transaction IDs and their commit status, not wall-clock timestamps.
The choice between begin-timestamp and commit-timestamp snapshot acquisition affects both the visibility algorithm and the validation logic required for serializability.
Version chains can be organized newest-to-oldest or oldest-to-newest, with each layout favoring different workload profiles in terms of read traversal cost.
Versions can be stored as full tuple copies (append-only) or as deltas, trading CPU reconstruction cost against storage efficiency.
Garbage collection of obsolete versions is essential to prevent unbounded storage growth and chain-traversal degradation, and its effectiveness is directly tied to the oldest active snapshot.
Long-running transactions degrade MVCC performance system-wide by pinning the garbage collection watermark and preventing version reclamation.

References

Reed, D.P. "Naming and Synchronization in a Decentralized Computer System." MIT PhD Dissertation, 1978.

Berenson, H., Bernstein, P., Gray, J., Melton, J., O'Neil, E., O'Neil, P. "A Critique of ANSI SQL Isolation Levels." Proceedings of the ACM SIGMOD International Conference on Management of Data, 1995.

Larson, P., Blanas, S., Diaconu, C., Freedman, C., Patel, J., Zwilling, M. "High-Performance Concurrency Control Mechanisms for Main-Memory Databases." Proceedings of the VLDB Endowment, Vol. 5, No. 4, 2011.

Wu, Y., Arulraj, J., Lin, J., Xian, R., Pavlo, A. "An Empirical Evaluation of In-Memory Multi-Version Concurrency Control." Proceedings of the VLDB Endowment, Vol. 10, No. 7, 2017.