Anti-Caching: Evicting Tuples Instead of Pages

Introduction

Traditional disk-oriented database systems organize their buffer pools around fixed-size pages, typically 4 KB to 16 KB.
When memory pressure increases, the buffer pool manager evicts entire pages to disk, regardless of how many tuples on that page are actually "cold." This granularity mismatch means that hot tuples sharing a page with cold tuples can be evicted unnecessarily, forcing expensive disk fetches on subsequent access.

Anti-caching inverts this model.
Instead of pulling pages from disk into memory on demand, an anti-caching system starts from the assumption that all data resides in memory and selectively evicts individual cold tuples to disk when memory grows scarce.
This tuple-level eviction granularity, first proposed in the context of the H-Store/VoltDB lineage of in-memory databases, allows the system to keep the working set in memory with far greater precision than page-level buffer management.

Motivation: Why Pages Are the Wrong Unit

In a page-oriented buffer pool, eviction decisions operate on pages.
A single page might contain dozens or hundreds of tuples.
Some of those tuples are accessed frequently; others have not been touched in hours.
When the buffer manager selects a page for eviction using LRU or CLOCK, it evicts all tuples on that page indiscriminately.
This leads to two concrete problems:

Collateral eviction. Hot tuples co-located with cold tuples get written to disk and must be fetched back soon after, wasting I/O bandwidth.
Wasted memory. Keeping an entire page resident just because one tuple on it is hot means the cold tuples on that page occupy memory that could be used for genuinely active data.

For workloads with skewed access patterns (which describe most OLTP workloads), these problems are significant.
Zipfian distributions are common; a small fraction of tuples receive the vast majority of accesses.
Anti-caching exploits this skew by evicting precisely the tuples that are cold, keeping memory utilization aligned with actual access frequency.

Architecture

An anti-caching system has several components that differ from a traditional DBMS.

In-Memory Table and Tuple Storage

The primary storage is an in-memory table structure (hash table or tree index) that holds all "hot" tuples.
Each tuple is individually addressable.
There is no concept of a slotted page in the primary path.
Tuples are allocated from a memory pool and referenced directly by pointers in the index.

Eviction Metadata

Each tuple (or each block of tuples) carries metadata for tracking access recency.
A lightweight LRU approximation, such as a chain of timestamps, or a CLOCK-style reference bit, is maintained.
The system periodically scans or samples this metadata to identify eviction candidates.

Block Table on Disk

diagram-3 — Block table mapping of evicted tuples to on-disk blocks

When tuples are evicted, they are serialized into variable-size blocks and written to disk (or an SSD).
A block table maps each evicted tuple's primary key to its on-disk block identifier and offset.
This mapping is kept in memory so that the system can detect, during query processing, whether a requested tuple has been evicted.

Eviction and Merge-Back

Evicted tuples are removed from the in-memory table indexes, but their keys are retained in a special "evicted" marker or tombstone entry.
When a transaction accesses an evicted tuple, the system fetches the entire block containing that tuple from disk, merges the tuples back into the in-memory table, and restarts the transaction.
This merge-back can be done synchronously (blocking the transaction) or asynchronously (aborting and retrying).

Walkthrough

diagram-1 — Eviction and retrieval sequence in anti-caching

The following walkthrough describes the eviction and retrieval cycle in an anti-caching system.

Eviction Phase

procedure EVICT(memory_threshold):
    if current_memory_usage() < memory_threshold:
        return

    candidates = sample_cold_tuples(N)     // N tuples with oldest access times
    sort candidates by table affinity       // group by table for efficient serialization

    for each group G in candidates:
        block = new DiskBlock()
        for each tuple t in G:
            block.append(serialize(t))
            block_table.insert(t.primary_key, block.id, block.offset)
            remove t from in-memory table
            insert eviction_tombstone(t.primary_key) into table index
        write block to disk

    update memory accounting

Retrieval Phase

procedure ACCESS(key):
    entry = index.lookup(key)

    if entry is a regular tuple:
        update access timestamp for entry
        return entry

    if entry is an eviction_tombstone:
        block_id = block_table.lookup(key)
        block = read_block_from_disk(block_id)

        for each tuple t in block:
            merge t back into in-memory table
            remove eviction_tombstone(t.primary_key)
            block_table.remove(t.primary_key)

        abort_and_restart current transaction
        // On restart, the tuple will be found in memory

A few details deserve emphasis.
First, the entire block is merged back, not just the single requested tuple.
This amortizes I/O cost and exploits spatial locality among co-evicted tuples.
Second, the transaction that triggered the fetch is typically aborted and restarted rather than blocked, because in-memory OLTP systems (like H-Store) use single-threaded partition execution where blocking would stall all other work on that partition.

Design Tradeoffs

Eviction Granularity and Block Size

diagram-2 — Tradeoff between block size, I/O efficiency, and eviction precision

The choice of disk block size controls a tradeoff between I/O efficiency and precision.
Larger blocks amortize disk I/O overhead but bring back more potentially unnecessary tuples during merge-back.
Smaller blocks are more precise but increase I/O operations and metadata overhead.
Typical configurations use blocks in the range of 256 KB to 1 MB.

LRU Tracking Overhead

Maintaining per-tuple access timestamps adds overhead to every read and write operation.
In an in-memory system where tuple access takes microseconds, even a few nanoseconds of additional bookkeeping per access can measurably affect throughput.
Approximate tracking (CLOCK bits, sampled timestamps) is preferred over exact LRU chains.

Transaction Abort Cost

When an evicted tuple is accessed, the triggering transaction must be aborted and retried.
For workloads where the fraction of accesses to cold data is small (say, under 5%), this cost is acceptable.
If a significant fraction of transactions touch evicted data, the abort-and-retry overhead becomes dominant, and anti-caching degrades.
The system works best when the in-memory working set fits comfortably, with only a tail of cold data spilling to disk.

Comparison with Traditional Buffer Pool

Property	Page-Oriented Buffer Pool	Anti-Caching
Eviction unit	Fixed-size page	Individual tuple
Primary residence	Disk	Memory
Cold/hot separation	Coarse (page-level)	Fine (tuple-level)
Index on evicted data	B-tree on disk	Tombstone in memory index
Miss penalty	Page fault, synchronous I/O	Transaction abort + block fetch
Best regime	Large data, moderate memory	Data mostly fits in memory

When Anti-Caching Wins (and When It Doesn't)

Anti-caching delivers its biggest advantage when the database is "almost" in-memory: the total data size moderately exceeds available RAM, but the active working set fits.
In this regime, tuple-level eviction keeps memory usage tightly aligned with access patterns, and the system avoids the overhead of a full buffer pool manager.

When data significantly exceeds memory (e.g., 10x), anti-caching becomes less effective.
The block table metadata itself grows large, abort-and-retry costs escalate, and the system increasingly resembles a disk-oriented system with extra overhead.
In that regime, a well-tuned buffer pool with prefetching and large page sizes may perform better.

Anti-caching also assumes OLTP workloads with point lookups and small transactions.
Analytical queries that scan large fractions of a table would trigger massive merge-back operations, overwhelming the system with I/O and transaction restarts.

Relationship to Larger-than-Memory Engines

Anti-caching is one of several approaches to handling data sets that exceed memory in systems originally designed for in-memory operation.
Other approaches include:

Project Siberia (Hekaton): Uses offline classification to identify cold tuples and migrate them to a cold store, with a Bloom filter to avoid unnecessary cold-store lookups.
LeanStore and Umbra: Use pointer swizzling and variable-size pages to blur the line between in-memory and disk-based access, maintaining buffer management but with lower overhead than traditional systems.

Anti-caching is distinguished by its willingness to accept transaction aborts as the mechanism for handling cold data access, which keeps the hot path (all data in memory) completely free of buffer-pool indirection.

Key Points

Anti-caching evicts individual cold tuples to disk rather than entire pages, achieving finer-grained memory management than traditional buffer pools.
Evicted tuples are grouped into disk blocks and tracked via an in-memory block table that maps primary keys to on-disk locations.
Tombstone entries in the in-memory index mark evicted tuples, allowing the system to detect cold accesses at query time without additional lookup structures.
When a transaction touches an evicted tuple, the system fetches the containing block, merges all its tuples back into memory, and aborts/restarts the transaction.
The approach is most effective when data moderately exceeds RAM but the working set fits in memory, typical of skewed OLTP workloads.
LRU tracking at tuple granularity introduces per-access overhead that must be minimized through approximation techniques like CLOCK bits or sampled timestamps.
Anti-caching degrades when data far exceeds memory or when workloads involve large scans, because abort-and-retry costs and block table metadata become dominant.

References

DeBrabant, J., Pavlo, A., Tu, S., Stonebraker, M., and Zdonik, S. "Anti-Caching: A New Approach to Database Management System Architecture." Proceedings of the VLDB Endowment, Vol. 6, No. 14, 2013.

Stonebraker, M., Madden, S., Abadi, D.J., Harizopoulos, S., Hachem, N., and Helland, P. "The End of an Architectural Era (It's Time for a Complete Rewrite)." Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB), 2007.

Eldawy, A., Levandoski, J., and Larson, P.-A. "Trekking Through Siberia: Managing Cold Data in a Memory-Optimized Database." Proceedings of the VLDB Endowment, Vol. 7, No. 11, 2014.

Leis, V., Haubenschild, M., Kemper, A., and Neumann, T. "LeanStore: In-Memory Data Management Beyond Main Memory." Proceedings of the 34th IEEE International Conference on Data Engineering (ICDE), 2018.