DBW.

Advanced

MVCC Implementation: Timestamps and Version Chains

Article diagram
April 19, 2026·10 min read

MVCC systems use timestamp-ordered visibility rules over linked version chains to provide snapshot consistency without read-write locking, with garbage collection as the critical operational constraint.

Multi-Version Concurrency Control (MVCC) is the dominant concurrency control mechanism in modern database systems.
Rather than forcing readers to block writers or vice versa, MVCC maintains multiple physical versions of each logical tuple, allowing transactions to read a consistent snapshot of the database without acquiring locks on data.
The two core implementation pillars are timestamp ordering (to determine visibility) and version chains (to organize and traverse the physical versions).
Understanding how these components interact is essential for anyone working on storage engines or diagnosing performance at the database layer.

Timestamps and Transaction Ordering

Every MVCC system needs a way to establish a total or partial order over transactions.
This ordering determines which version of a tuple a given transaction should see.
The mechanism for producing these identifiers varies across systems, but the concept is uniform: each transaction receives a timestamp at some defined point in its lifecycle, and this timestamp governs visibility.

Timestamp Assignment Strategies

diagram-1
Compare the different timestamp generation approaches (monotonic counters, HLC, begin-vs-commit) in a single diagram that highlights where timestamps are assigned in the transaction lifecycle, how they encode order/causality, and the primary consistency/scalability trade-offs for each.

There are several practical approaches to generating timestamps:

  • Monotonically increasing counters. A simple atomic counter incremented on each BEGIN. PostgreSQL uses a 32-bit transaction ID (XID) for its core MVCC machinery (wrapping around at ~2 billion, requiring periodic "freezing" of old tuples to prevent wraparound ambiguity). PostgreSQL 14+ also exposes a 64-bit xid8 type in some contexts, but the underlying storage-level XID remains 32-bit. MySQL/InnoDB uses an internal transaction sequence number (trx_id); in recent MySQL 8.x versions this is effectively a 56-bit value, though the exact representation has varied across releases.
  • Hybrid logical clocks (HLC). Systems like CockroachDB combine a physical (wall-clock) component with a logical counter to preserve causal ordering across nodes while remaining loosely synchronized with real time.
  • Commit timestamps vs. begin timestamps. Some systems assign visibility based on the timestamp at transaction start (snapshot isolation), while others use the commit timestamp (serializable variants). The choice affects both the anomalies permitted and the bookkeeping required.

A transaction T with begin-timestamp Tbegin operating under snapshot isolation should see exactly those versions committed by transactions with commit-timestamp less than Tbegin.
This is the fundamental visibility rule, though the exact boundary condition (strict less-than vs. less-than-or-equal) is implementation-specific and tied to how the snapshot high-water mark is defined.

Visibility Check

For a given tuple version created by transaction Tcreate and optionally deleted or superseded by transaction Tdelete, transaction T sees the version if and only if:

  1. Tcreate committed before T's snapshot,
  2. Tdelete either does not exist, has not committed, or committed after T's snapshot.

In PostgreSQL's terminology, this involves checking xmin (the creating transaction) and xmax (the deleting/updating transaction) against the transaction's snapshot, which includes both the snapshot high-water mark (xmax) and the set of in-progress transaction IDs at snapshot time.

Version Chains

diagram-2
Show the three major version-chain organizations (oldest-to-newest append-only, newest-to-oldest append-only, delta/undo storage) side-by-side, illustrating where new versions live, how the primary index points, and the direction of pointer traversal for reads and updates.

When a tuple is updated, the system must create a new version and link it to the old one.
The data structure connecting these versions is the version chain: a linked sequence of tuple versions for the same logical row.
The three major version chain organizations, as classified by Wu et al. (2017), differ in where new versions are stored, how the chain is ordered, and where the primary index points.

Append-Only Storage (Oldest-to-Newest)

New versions are appended to the same table storage.
The primary index points to the oldest version (the tail of the logical sequence), and traversal walks forward through the chain to find the version visible to the requesting transaction.
This is simple to implement but penalizes read-heavy workloads on frequently updated rows: every read must traverse versions from the oldest forward until it finds the appropriate visible version.
Because the index does not require updating on each write (it still points to the original version), write amplification is lower, but long chains impose read latency.

Append-Only Storage (Newest-to-Oldest)

The chain is maintained so that the most recent version is at the head.
The primary index is updated on every write to point to the new head.
Reads of the latest version are fast (single pointer dereference), but every update requires an index update, and long-running transactions reading old snapshots must traverse the chain backward through newer versions they cannot see.

Delta Storage

Instead of storing full tuple copies, the system stores a compact "delta" (before-image or after-image) and maintains the main version in-place.
MySQL/InnoDB uses this approach: the current version lives in the clustered index (primary B-tree), and old versions are reconstructed by applying undo log records backward.
This is efficient for workloads where most reads target the latest version, but version reconstruction for old snapshots becomes expensive proportional to chain length.

A Note on PostgreSQL's Heap

PostgreSQL's version storage does not fit cleanly into the Wu et al. taxonomy.
PostgreSQL writes each new tuple version as a new heap tuple (append-only), storing both old and new versions in the heap.
Indexes contain pointers (ctids) to specific heap tuple locations.
Within a single heap page, a Heap-Only Tuple (HOT) chain may link versions without requiring an index update, keeping the index pointing to the original tuple location and chaining versions via ctid pointers within that page.
HOT chains are strictly intra-page; cross-page version access does not use an equivalent linked-list mechanism.
Readers evaluate xmin/xmax fields on each candidate tuple to determine visibility.
This design simplifies writes at the cost of vacuum overhead and potential read-time work to locate the correct version.

Walkthrough

The following walkthrough illustrates how an update creates a new version and how a concurrent reader resolves visibility under a newest-to-oldest append-only scheme.

Step-by-Step: Update and Read

diagram-3
Trace the sequence of actions for the updater and concurrent readers: create V2, set deleted_by on V1, update index pointer, then the reader's traversal with the visibility checks (creator/deleter committed?, commit_ts < snapshot?, active_txns membership) and the decision points that determine which version is returned.
Initial state:
  Tuple X has one version: V1 {value=100, created_by=T1(committed), deleted_by=NULL}
  Index entry for X points to V1.

Transaction T10 (begin_ts=10) starts.
Transaction T15 (begin_ts=15) starts.

Step 1: T15 updates X to value=200.
  - System creates V2 {value=200, created_by=T15, deleted_by=NULL}
  - V1 is updated: deleted_by = T15
  - V2.next = V1  (V2 is now the chain head)
  - Index entry for X updated to point to V2.
  - Chain: [V2 (val=200, T15)] -> [V1 (val=100, T1)]

Step 2: T15 commits (commit_ts=16).

Step 3: T10 reads X.
  - T10's snapshot includes transactions committed before ts=10.
  - Follow index to chain head: V2.
  - Visibility check on V2:
      created_by=T15, commit_ts=16. Is 16 < 10? No.
      V2 is NOT visible to T10.
  - Follow V2.next to V1.
  - Visibility check on V1:
      created_by=T1 (committed, assume commit_ts=2). Is 2 < 10? Yes.
      deleted_by=T15 (committed at ts=16). Is 16 < 10? No.
      V1 IS visible to T10 (created before snapshot, not yet deleted in snapshot).
  - T10 reads value=100.

Step 4: New transaction T20 (begin_ts=20) reads X.
  - Follow index to V2.
  - created_by=T15, commit_ts=16. Is 16 < 20? Yes.
  - deleted_by=NULL.
  - V2 IS visible. T20 reads value=200.

This demonstrates how two transactions with different snapshots see different versions of the same logical row without any locking.

Pseudocode: Version Chain Traversal

function read(tuple_id, snapshot_ts, active_txns):
    version = index.lookup(tuple_id)    // head of chain
    while version != NULL:
        creator_committed = is_committed(version.created_by)
        creator_visible = creator_committed
                          AND commit_ts(version.created_by) < snapshot_ts
                          AND version.created_by NOT IN active_txns

        if creator_visible:
            deleter = version.deleted_by
            if deleter == NULL:
                return version.value
            deleter_committed = is_committed(deleter)
            deleter_visible = deleter_committed
                              AND commit_ts(deleter) < snapshot_ts
                              AND deleter NOT IN active_txns
            if NOT deleter_visible:
                return version.value
        version = version.next
    return NOT_FOUND   // tuple does not exist in this snapshot

The active_txns set is critical.
A transaction that started before our snapshot but has not yet committed must be treated as invisible.
Its writes are excluded not simply because its numeric timestamp is low, but explicitly because its transaction ID appears in the active_txns set — a snapshot-time record of all in-flight transactions.
This is why PostgreSQL's snapshot includes both a high-water mark (xmax) and a list of in-progress XIDs: numeric ordering of XIDs alone is insufficient.
Note that the strict less-than boundary (< snapshot_ts) shown here is illustrative; exact boundary semantics vary by implementation.

Garbage Collection

Version chains grow with every update.
Without reclamation, chain traversal degrades and storage bloats.
Garbage collection (GC) identifies and removes versions that are no longer visible to any active transaction.
The oldest active snapshot defines the GC horizon: any version superseded before this horizon can be safely reclaimed.

Three common GC approaches exist:

  • Transaction-level GC. Each transaction tracks the versions it created. Upon determining that no active snapshot can see the old version, the transaction (or a background thread) reclaims it. Used in Hekaton.
  • Tuple-level GC (background). A dedicated vacuum process scans the version store and prunes dead versions. PostgreSQL's VACUUM is the canonical example. It is decoupled from transaction execution but introduces I/O overhead and can fall behind under heavy write loads.
  • Tuple-level GC (cooperative). Readers prune stale versions as they traverse chains. This amortizes GC cost across read operations but adds latency variance to reads.

Failing to keep up with garbage collection is a real operational problem.
PostgreSQL's XID wraparound protection and InnoDB's "history list length" metric both reflect the consequences of deferred version cleanup.

Trade-offs Across Implementations

AspectPostgreSQLMySQL/InnoDBHekaton (SQL Server)
Version storageHeap (append-only, new tuple per update)Undo log (delta)Append-only (newest-to-oldest)
Index points toSpecific heap tuple ctid; HOT chains reuse index entry within a pageCurrent (clustered index) versionNewest version
Chain orderVersions scattered in heap; HOT chain within page; no cross-page linked listNewest-to-oldest (via undo)Newest-to-oldest
GC mechanismBackground VACUUMPurge threadsCooperative + background
Timestamp type32-bit XID (xid8 64-bit in some contexts, PG14+)56-bit trx_id (internal, version-dependent)Timestamp (begin + end)

Each design reflects a different bet on the expected workload.
Delta storage (InnoDB) optimizes for reading current data.
PostgreSQL's approach simplifies writes at the cost of read-time heap evaluation and vacuum overhead.
Hekaton targets high-throughput OLTP with latch-free structures.

Key Points

  • MVCC avoids read-write conflicts by maintaining multiple physical versions of each tuple, using timestamps to determine which version each transaction should see.
  • Timestamp assignment strategy (counter, HLC, begin vs. commit) directly affects the consistency guarantees, scalability, and anomaly profile of the system.
  • The three version chain organizations (oldest-to-newest, newest-to-oldest, delta storage) present distinct trade-offs between read performance, write amplification, and index maintenance cost; real systems like PostgreSQL may not map cleanly to any single category.
  • Visibility checks require evaluating both the creating and deleting transaction against the reader's snapshot, including explicit tracking of in-progress transactions — numeric timestamp ordering alone is insufficient.
  • Garbage collection is not optional: without timely reclamation, version chains degrade read latency and consume unbounded storage.
  • Long-running transactions are toxic to MVCC performance because they pin the GC horizon and force retention of old versions across the entire database.
  • Real systems combine these mechanisms with additional optimizations such as index-only visibility maps (PostgreSQL), rollback segment reuse (InnoDB), and epoch-based reclamation (Hekaton).

References

Wu, Y., Arulraj, J., Lin, J., Xian, R., and Pavlo, A. "An Empirical Evaluation of In-Memory Multi-Version Concurrency Control." Proceedings of the VLDB Endowment, Vol. 10, No. 7, 2017.

Berenson, H., Bernstein, P., Gray, J., Melton, J., O'Neil, E., and O'Neil, P. "A Critique of ANSI SQL Isolation Levels." Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, 1995.

Bernstein, P. A. and Goodman, N. "Concurrency Control in Distributed Database Systems." ACM Computing Surveys, Vol. 13, No. 2, 1981.

Larson, P., Blanas, S., Diaconu, C., Freedman, C., Patel, J., and Zwilling, M. "High-Performance Concurrency Control Mechanisms for Main-Memory Databases." Proceedings of the VLDB Endowment, Vol. 5, No. 4, 2011.

Newsletter

Signal
over noise.

Database deep-dives, delivered once a week. Storage engines, query optimization, and the data layer.

You will receive Databases Weekly.