Sizing memory for a database system is one of the most consequential capacity planning decisions an engineer makes.
Too little RAM leads to excessive disk I/O and poor latency.
Too much wastes money and may mask architectural problems.
Memory advisor algorithms attempt to answer a deceptively simple question: given a workload, what is the optimal amount of memory to allocate to the buffer pool, sort areas, hash join buffers, and cache?
This article examines the algorithmic foundations behind memory advisors, covering working set estimation, miss-rate curves, and the marginal-benefit optimization frameworks used by systems like Oracle's Automatic Memory Management, DB2's Self-Tuning Memory Manager (STMM), and MySQL's InnoDB buffer pool sizing heuristics.
The Core Problem: Miss-Rate Curves
The fundamental data structure behind memory advisory is the miss-rate curve (MRC), sometimes called the miss-ratio curve.
An MRC maps buffer pool size to the expected cache miss rate for a given workload.
If you can construct an accurate MRC, memory sizing reduces to finding the knee of the curve or optimizing a cost function that balances memory cost against I/O cost.
Formally, for a workload trace of page references, the miss rate at cache size $c$ is:
$$MR(c) = \frac{\text{number of accesses that miss at size } c}{\text{total accesses}}$$
The challenge is that naively computing this requires simulating the cache replacement policy at every possible size, which is expensive.
The key insight enabling practical advisors is that for stack-based replacement algorithms (LRU and compatible variants), the miss rate at size $c$ can be derived from a single pass through the reference trace using a reuse distance histogram.
Reuse Distance and Stack Algorithms
The reuse distance (also called LRU stack distance) of a memory reference is the number of distinct pages accessed between two consecutive accesses to the same page.
For an LRU cache of size $c$, a reference is a hit if and only if its reuse distance is less than $c$.
This property, formalized by Mattson et al. (1970) as the inclusion property, states that a cache of size $k$ always holds a subset of the pages held by a cache of size $k+1$ under the same reference trace.
LRU satisfies this property; policies such as LFU or MRU generally do not.
LFU fails because eviction decisions depend on global frequency counts rather than recency, so the set of pages held at size $k$ is not guaranteed to be a subset of those held at size $k+1$ under the same trace.
MRU fails because it evicts the most recently used page, which produces non-monotone inclusion behavior on typical workloads.
For inclusion-property algorithms, a single pass through the trace can produce a histogram of reuse distances, and the miss rate at any cache size can be computed by summing the histogram bins for distances $\geq c$.
Exact reuse distance computation can be performed in $O(N \log M)$ time using an augmented order-statistic tree (e.g., a balanced BST augmented with subtree sizes to support rank queries), where $N$ is the trace length and $M$ is the number of distinct pages.
At each access, the algorithm looks up the page's current position in the LRU stack, computes its rank (number of distinct pages above it), records that rank as the reuse distance, and moves the page to the top.
Space consumption is $O(M)$.
Note: a Fenwick tree can be adapted for this purpose by maintaining a bit-array over LRU positions and using prefix-sum queries for rank computation, but the position-to-page mapping requires additional bookkeeping; a self-balancing BST with order-statistic augmentation (such as a splay tree) is often simpler to reason about.
For databases with millions of buffer pages and billions of accesses, even this log-linear cost is often prohibitive in production, motivating the approximate approaches below.
Approximate Approaches
Several approximation techniques make online MRC estimation tractable:
-
Counter stacks (Wires et al., 2014): Use probabilistic counters (HyperLogLog sketches) arranged in a hierarchy of stack levels to approximate reuse distances. The key advantage is sublinear space: rather than tracking every distinct page, each level maintains a sketch, achieving $O(\log^2 M)$ space per level with $O(\log M)$ levels, for a total space cost far below the $O(M)$ of the exact method. Per-access processing cost remains $O(\log M)$ amortized.
-
SHARDS (Waldspurger et al., 2015): Spatially hashed sampling that reduces the effective trace size while preserving the shape of the MRC. A fixed hash threshold $T$ (relative to the hash space size) is applied to each page identifier; only references whose hash falls below $T$ are processed. The sampling rate is $T / H_{max}$, where $H_{max}$ is the maximum hash value, and miss-rate estimates are scaled accordingly.
-
AET (Hu et al., 2016): Average Eviction Time transforms the problem into estimating the average time a page survives in cache at a given size, avoiding explicit reuse distance tracking.
Algorithm
The following walkthrough describes a practical memory advisor algorithm modeled on the approach used in Oracle's V$DB_CACHE_ADVICE and similar systems.
Step 1: Instrument the Buffer Pool
Maintain a shadow LRU stack (or an approximate structure like a counter stack).
Every buffer pool access records the reuse distance of the accessed page.
Step 2: Build the Reuse Distance Histogram
Accumulate reuse distances into a histogram with logarithmically spaced bins.
Cold misses (first access to a page) go into a special "infinity" bin.
function UPDATE_HISTOGRAM(page_id, lru_stack, histogram):
if page_id not in lru_stack:
histogram[INFINITY] += 1
lru_stack.push_to_top(page_id)
else:
distance = lru_stack.depth(page_id)
bin = floor(log2(distance))
histogram[bin] += 1
lru_stack.move_to_top(page_id)
Step 3: Derive the Miss-Rate Curve
Convert the histogram into a cumulative miss-rate curve using a suffix sum.
For each candidate cache size $c$, the miss rate is the fraction of accesses with reuse distance $\geq c$.
The loop below iterates bins in descending order (largest distances first, INFINITY last in ascending order means first in reversed order) to build the suffix count correctly:
function COMPUTE_MRC(histogram, total_accesses):
// sorted_bins: ascending numeric order, with INFINITY appended at the end
sorted_bins = sorted(histogram.keys()) // e.g., [0, 1, 2, ..., k, INFINITY]
// Build suffix sum by iterating in reverse (INFINITY first):
// suffix_count[bin] = total accesses with reuse distance >= lower bound of bin
suffix_count = {}
running = 0
for bin in reversed(sorted_bins): // INFINITY, k, ..., 1, 0
running += histogram[bin]
suffix_count[bin] = running
// Map each bin's lower-bound cache size to its miss rate
mrc = {}
for bin in sorted_bins:
if bin == INFINITY:
continue
size = 2^bin // lower bound of bin
mrc[size] = suffix_count[bin] / total_accesses
return mrc
Step 4: Compute Marginal Benefit
For each memory consumer (buffer pool, sort area, and PGA/UGA caches), compute the marginal benefit of adding one unit of memory.
The marginal benefit at size $c$ is the reduction in physical I/O per unit of additional memory:
function MARGINAL_BENEFIT(mrc, io_cost, total_accesses, c, delta):
// io_cost: average cost of a physical I/O in time units
miss_reduction = mrc[c] - mrc[c + delta]
return (miss_reduction * total_accesses * io_cost) / delta
Step 5: Global Optimization via Marginal Benefit Equalization
When multiple memory consumers compete for a fixed memory budget, the optimal allocation equalizes marginal benefit across all consumers.
This is a classic resource allocation problem.
Because miss-rate curves are monotonically non-increasing and convex (equivalently, hit-rate curves are concave), marginal benefit is non-increasing — exhibiting diminishing returns.
When the marginal benefit functions derived from the MRCs are indeed concave (which holds for well-behaved, stationary workloads but may be violated for noisy or approximated MRCs), the objective function is separable and concave, which guarantees that a greedy allocation is globally optimal:
function ALLOCATE_MEMORY(consumers, total_memory, delta):
// Initialize each consumer to minimum viable size
for each consumer c:
c.alloc = c.min_size
remaining = total_memory - sum(c.alloc for c in consumers)
while remaining >= delta:
// Find consumer with highest marginal benefit
best = argmax(c: MARGINAL_BENEFIT(c.mrc, c.io_cost, c.total_accesses, c.alloc, delta)
for c in consumers
where c.alloc + delta <= c.max_size)
if best is None:
break
best.alloc += delta
remaining -= delta
return {c.name: c.alloc for c in consumers}
The greedy approach terminates with the globally optimal allocation under the concavity assumption.
In practice, if approximated MRCs introduce non-concavities, a small smoothing pass over the marginal benefit sequence is advisable before running allocation.
Non-LRU Policies and Practical Complications
Real database buffer pools rarely use pure LRU.
InnoDB uses a modified LRU with a "young" and "old" sublist.
PostgreSQL uses a clock-sweep algorithm with per-buffer usage counts — a second-chance CLOCK variant.
This is distinct from the published CLOCK-Pro algorithm and should not be conflated with it.
Oracle uses a touch-count based scheme.
These are not stack algorithms, so the Mattson inclusion property does not hold exactly.
Practical advisors handle this in several ways:
-
Simulation at sampled sizes. Rather than deriving the full MRC from reuse distances, run lightweight simulations at 5–10 discrete cache sizes. Oracle's advisory views work this way, maintaining ghost entries (metadata-only markers for evicted pages) at several hypothetical pool sizes.
-
Workload-aware correction factors. Sequential scan patterns, prefetch behavior, and multi-version concurrency control (MVCC) bloat all distort the relationship between reuse distance and actual miss rate. Advisors apply heuristics to filter or discount sequential scan traffic.
-
Temporal decay. Workloads change. Advisors typically use exponentially weighted moving averages (EWMA) over recent sampling windows rather than accumulating statistics from startup.
Beyond the Buffer Pool
Memory advisors must also account for non-buffer-pool memory consumers:
-
Sort and hash areas. When a sort or hash join spills to disk, performance degrades sharply. The advisory here is often threshold-based: estimate the working set of in-flight queries and ensure enough memory to avoid spills for the p95 case.
-
Lock tables and metadata caches. These are typically small but cause catastrophic performance cliffs when undersized. The advisory is usually a function of concurrency level and schema complexity, not workload trace analysis.
-
Connection-level memory. Each database connection consumes memory for session state, parse trees, and result buffers. With connection pooling, the advisor must model the expected concurrency, not the configured maximum.
DB2's STMM uses a control-theoretic approach: it periodically (every few minutes) adjusts memory allocations across these consumers, treating each one as a control loop with the miss rate or spill rate as the feedback signal.
The controller applies proportional adjustments bounded by a maximum step size to prevent oscillation.
When Advisors Fail
Memory advisors produce misleading results in several scenarios:
-
Phase-changing workloads. If OLTP runs during the day and batch analytics runs at night, an MRC computed over a mixed window recommends a size that is wrong for both phases.
-
Correlated access patterns. MRC-based analysis assumes statistical stationarity. Workloads with strong temporal correlations (e.g., hot partition rotations) violate this assumption.
-
Prefetch and readahead. Physical I/O is not proportional to miss rate when the storage subsystem does large sequential reads. Reducing the miss rate from 2% to 1% may not halve I/O if most misses are already coalesced into sequential prefetches.
-
Flash and tiered storage. When the I/O cost function is non-uniform (some misses go to NVMe, others to spinning disk), the simple marginal-benefit model needs a per-tier cost weighting.
Key Points
- Miss-rate curves are the foundational data structure for memory advisory, mapping cache size to expected miss rate for a given workload.
- The Mattson inclusion property enables efficient MRC construction for LRU and compatible stack-based policies via a single-pass reuse distance histogram. LFU does not satisfy this property because its eviction decisions depend on global frequency rather than recency; MRU does not satisfy it because evicting the most recently used page produces non-monotone inclusion behavior.
- Exact reuse distance computation runs in $O(N \log M)$ time with an augmented order-statistic tree; approximate algorithms (SHARDS, counter stacks, and AET) reduce memory and processing overhead for production-scale online use.
- Optimal memory allocation across competing consumers (buffer pool, sort areas, and caches) is achieved by equalizing marginal I/O benefit per unit of memory; correctness follows from the concavity of hit-rate curves, which holds for well-behaved stationary workloads.
- Real buffer pools use non-LRU policies (InnoDB's young/old sublist, PostgreSQL's CLOCK-based sweep, and Oracle's touch-count scheme), requiring simulation-based or correction-factor approaches rather than pure stack distance analysis.
- Advisors degrade in accuracy under phase-changing workloads, correlated access patterns, and tiered storage configurations.
- Connection-level and metadata memory should not be ignored, as these can cause sharp performance cliffs even when the buffer pool is well-sized.
References
R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. "Evaluation Techniques for Storage Hierarchies." IBM Systems Journal, 9(2):78–117, 1970.
C. A. Waldspurger, N. Park, A. Garthwaite, and I. Ahmad. "Efficient MRC Construction with SHARDS." Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST), 2015.
J. Wires, S. Ingram, Z. Drudi, N. Harvey, and A. Warfield. "Characterizing Storage Workloads with Counter Stacks." Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
X. Hu, X. Wang, Y. Li, L. Zhou, Y. Luo, C. Ding, S. Jiang, and Z. Wang. "AET: Average Eviction Time, a Simple and Effective MRC Construction Algorithm." Proceedings of ACM SIGMETRICS, 2016.
IBM DB2 Knowledge Center. "Self-Tuning Memory Manager (STMM)." IBM Documentation, 2020.