DBW.

Intermediate

B-Tree Internals and Page Splits

Article diagram
April 19, 2026·11 min read

Page splits are the critical bottleneck in B-tree write performance, and their cost is managed through split-point selection, concurrency protocols, and separator truncation.

Introduction

The B-tree remains the dominant on-disk index structure in relational databases, key-value stores, and file systems.
Introduced by Bayer and McCreight in 1972, it was designed to minimize disk I/O by maintaining a balanced, wide, and shallow tree.
Despite decades of alternatives (LSM-trees, fractal trees, skip lists), the B-tree persists because its design maps directly onto the fixed-size page abstraction used by storage hardware and operating systems.

Understanding the internal mechanics of B-trees, particularly how nodes are structured and how page splits propagate, is essential for anyone tuning database performance or building storage engines.
Page splits are one of the most expensive operations a B-tree performs, and their behavior directly affects write amplification, space utilization, and tail latency.

B-Tree Structure

A B-tree of order m is a balanced search tree where each internal node contains between ⌈m/2⌉ and m children, and all leaves reside at the same depth.
In practice, database implementations almost universally use the B+-tree variant, where:

  1. Internal nodes store only keys and child pointers. They serve purely as a routing structure.
  2. Leaf nodes store keys and their associated values (or pointers to heap tuples). Leaves are typically linked in a doubly-linked list to support efficient range scans.

Each node corresponds to a fixed-size page on disk, commonly 4 KB, 8 KB, and 16 KB.
InnoDB uses 16 KB pages; PostgreSQL uses 8 KB.
The page size determines the fanout: a wider page holds more keys per node, producing a shallower tree and fewer I/O operations per lookup.

Routing Invariant

In the B+-tree variant used by most databases, internal (routing) nodes hold separator keys that partition the key space among child pointers.
A common convention is: for a separator key k at position i in an internal node, all keys in the subtree rooted at child i are strictly less than k, and all keys in the subtree rooted at child i+1 are greater than or equal to k.
Leaf nodes store the actual data records, and the same key may therefore appear in both a leaf and as a separator in an ancestor internal node.
Implementations vary in their exact ≤ vs. < boundary conventions; the examples in this article follow the convention that the separator promoted during a split is the first (smallest) key of the new right leaf.

Page Layout

diagram-1
Diagram of a single B-tree page showing the page header (page ID, level, key count, free offset, sibling pointers), the fixed-size item/line pointer array, variable-length key-value records stored at the page end growing toward the pointers, and the free-space gap; include arrows that illustrate the indirection from item pointers to record bytes and the leaf next/prev sibling links.

A typical B-tree page contains:

  • Page header: metadata including page ID, level in the tree, number of keys, free space offset, and pointers to the next and previous sibling pages (for leaves).
  • Item pointers (line pointers): a fixed-size array of offsets pointing to the actual key-value entries within the page.
  • Key-value entries: variable-length records stored from the end of the page, growing toward the item pointer array.
  • Free space: the gap between the end of the item pointer array and the beginning of the entries.

This indirection through item pointers allows entries to be logically reordered without physically moving data within the page.
When a key is deleted, its item pointer can be marked dead and the space reclaimed lazily during page compaction.

Search

Searching a B+-tree traverses from root to leaf.
At each internal node, a binary search (or interpolation search) over the sorted keys determines which child pointer to follow.
For a tree with N keys and fanout f, the total number of levels (including the leaf level) is approximately ⌈log_f(N)⌉ + 1, meaning the number of page reads per point lookup equals the tree height.
With a fanout of 500 and 500 million keys, the tree height is only 4 levels.
The root and first few levels are almost always cached in the buffer pool, so a point lookup typically requires one or two physical disk reads.

Page Splits

A page split occurs when an insertion targets a node that is already full.
The node must be divided into two nodes, and the parent must be updated to include a pointer to the new node.
If the parent is also full, the split propagates upward, potentially reaching the root and increasing tree height by one.

Why Splits Are Expensive

Page splits involve multiple coordinated writes:

  1. Allocating a new page on disk.
  2. Copying roughly half the entries from the original page to the new page.
  3. Updating the parent node to insert the new separator key and child pointer.
  4. Updating sibling pointers in the leaf linked list.
  5. Writing a WAL (write-ahead log) record to ensure crash recovery can reconstruct or undo the split.

All of these operations must appear atomic to concurrent readers, and most implementations hold a write latch on the splitting page and its parent during the operation.
Some engines (like InnoDB) use optimistic latching: they latch only the leaf initially, and restart with pessimistic root-to-leaf latching if a split is detected.

Walkthrough

The following walkthrough describes a leaf-level page split in a B+-tree.
Assume order m = 4 (each leaf holds at most 4 keys; each internal node holds at most 3 separator keys and 4 child pointers).

The separator convention used here: a separator key k means all keys in the right subtree are ≥ k.
Separator keys in internal nodes are copied up from the first key of the new right leaf.

Initial State

        [15 | 40]                        <- Internal node (root)
       /         \                       
  [5,10,13,14]  [15,25,30,35]  [40,45,50]   <- Leaf nodes (linked)

Note: The root key 15 is the first key of the middle leaf; 40 is the first key of the rightmost leaf, consistent with the separator convention.

Step-by-step: Insert key 12

diagram-2
Stepwise flow diagram tracing the insert: root-to-leaf traversal, detection of full leaf, allocate new page, merge entries and compute split point, redistribute entries into left/right leaves with promoted separator (12), update parent and sibling pointers, and write WAL to commit the split.

Step 1: Traverse to target leaf. Compare 12 against root keys. 12 < 15, so follow the leftmost child pointer.
Arrive at leaf [5, 10, 13, 14].

Step 2: Check for space. The leaf holds 4 keys (maximum capacity for order m = 4).
No room for key 12.
A split is required.

Step 3: Allocate a new leaf page. Request a free page from the page allocator (free list or extent bitmap).

Step 4: Choose the split point. Merge the 4 existing keys with the new key into a temporary sorted array: [5, 10, 12, 13, 14] (5 entries).
Divide at the midpoint: mid = floor(5 / 2) = 2.
The left leaf receives entries [0, mid) = [5, 10]; the right leaf receives entries [mid, end) = [12, 13, 14].
The separator key promoted to the parent is 12 — the first (smallest) key of the new right leaf.

Step 5: Redistribute keys.

Original leaf (left):  [5, 10]
New leaf (right):      [12, 13, 14]

Step 6: Update the parent. Insert separator key 12 into the parent along with a pointer to the new right leaf.
The root previously had keys [15 | 40] and 3 children, and it now has keys [12 | 15 | 40] and 4 children.

        [12 | 15 | 40]
       /    |      |     \
  [5,10] [12,13,14] [15,25,30,35] [40,45,50]

Step 7: Update sibling pointers. The new right leaf's next pointer is set to the former sibling [15,25,30,35].
The original left leaf's next pointer is updated to point to the new right leaf [12,13,14].

Step 8: Write WAL record. A single compound WAL record captures the entire split operation so that crash recovery can replay or undo it atomically.

Pseudocode

def insert(tree, key, value):
    leaf = find_leaf(tree.root, key)

    if leaf.has_space():
        leaf.insert_sorted(key, value)
        leaf.write_to_disk()
        return

    # Split required
    new_leaf = allocate_page()

    # Create temporary sorted array of all entries + new entry
    entries = leaf.entries + [(key, value)]
    entries.sort()

    mid = len(entries) // 2  # integer (floor) division
    leaf.entries = entries[0:mid]
    new_leaf.entries = entries[mid:]

    # Update sibling linked list
    new_leaf.next = leaf.next
    new_leaf.prev = leaf
    if leaf.next is not None:
        leaf.next.prev = new_leaf
    leaf.next = new_leaf

    # Separator key is the first key in the new right leaf
    separator = new_leaf.entries[0][0]

    # Write both leaves
    leaf.write_to_disk()
    new_leaf.write_to_disk()

    # Insert separator into parent (may trigger recursive split)
    insert_into_parent(tree, leaf.parent, separator, new_leaf)


def insert_into_parent(tree, parent, key, right_child):
    if parent is None:
        # Splitting the root: create a new root
        old_root = tree.root
        new_root = allocate_page()
        new_root.keys = [key]
        new_root.children = [old_root, right_child]
        tree.root = new_root
        return

    if parent.has_space():
        parent.insert_sorted(key, right_child)
        parent.write_to_disk()
    else:
        # Recursive split of internal node
        split_internal(tree, parent, key, right_child)

Split Strategies and Optimizations

Monotonic Insert Optimization

When keys are inserted in sequential order (auto-increment primary keys, timestamps), a naive midpoint split wastes space.
The left page ends up half-full and never receives another insert.
PostgreSQL addresses this with its "fastpath" optimization: if the new key is greater than all existing keys in the rightmost leaf, it performs a right-biased split instead of a 50/50 split, keeping almost all existing entries in the left page and placing only the new key in the right page.
This yields near-100% page fill for append-only workloads.
Note that PostgreSQL's fastpath also bypasses the buffer lock on the root for rightmost-page appends, providing an additional concurrency benefit.

Pre-split (Top-Down Splitting)

Some implementations split nodes proactively during the downward traversal of an insert operation.
If a node is full, it is split before descending further, even if the actual insertion will target a child that is not full.
This guarantees that a split never propagates upward, simplifying concurrency control because no latches on ancestor nodes need to be held during upward propagation.
This approach is sometimes called top-down or eager splitting and trades occasional unnecessary splits for simpler latch management.

Suffix Truncation

Internal nodes only need enough key information to correctly route searches.
Rather than copying the full key into the parent during a split, the separator can be truncated to the shortest prefix that still distinguishes the left and right children.
This increases fanout, reduces tree height, and is implemented in PostgreSQL (since version 12) and SQLite.

Fill Factor

Most databases expose a FILLFACTOR parameter that controls how full leaf pages are during bulk loads or index creation.
Setting it below 100% reserves space in each page for future inserts, reducing split frequency at the cost of increased space usage and potentially deeper trees.

Concurrency During Splits

diagram-3
Flowchart contrasting optimistic vs. pessimistic splitting: optimistic path showing read-only descent with released parent latches, acquiring a write latch on the leaf, branch on split-detected → release and restart with top-down write latches, and an alternate branch showing Lehman–Yao right-link detection enabling concurrent readers to follow in-progress splits without holding ancestor latches.

The classic approach to concurrency in B-trees uses a "crabbing" (lock coupling) protocol: acquire a latch on the child before releasing the latch on the parent.
During splits, this means holding latches on multiple levels simultaneously, which limits concurrency.

Modern implementations (InnoDB, PostgreSQL, LMDB) use optimistic approaches.
The common pattern is:

  1. Traverse to the leaf holding only read latches, releasing each parent latch before latching the child.
  2. Acquire a write latch on the leaf.
  3. If a split is needed, release all latches, restart from the root with write latches on the full path (pessimistic mode), and perform the split.

Lehman and Yao's 1981 paper introduced "right-link" pointers on each node, allowing concurrent searches to detect and follow a split in progress without requiring the searcher to hold latches on multiple tree levels simultaneously.
This design forms the basis of PostgreSQL's nbtree implementation.

Key Points

  • Each B-tree node maps to a fixed-size disk page, and the page size determines fanout, tree height, and I/O cost per lookup.
  • Page splits are triggered when a full node receives an insertion, and they may cascade upward to the root.
  • A split involves allocating a new page, redistributing entries, updating the parent with a separator key (the first key of the new right child under the common convention), adjusting sibling pointers, and writing a WAL record atomically.
  • Separator key conventions (strict < vs. ≤) vary by implementation; it is important to understand the convention used when reasoning about correctness.
  • Monotonic key patterns cause pathological 50% page utilization under naive splits; right-biased split strategies mitigate this.
  • Suffix truncation of separator keys in internal nodes increases fanout and reduces tree height.
  • Optimistic latching and Lehman-Yao right-link pointers enable high-concurrency access during splits without holding root-to-leaf latch chains.
  • Fill factor configuration trades space for reduced split frequency in write-heavy workloads.

References

R. Bayer and E. McCreight, "Organization and Maintenance of Large Ordered Indexes," Acta Informatica, vol. 1, no. 3, pp. 173–189, 1972.

P. L. Lehman and S. B. Yao, "Efficient Locking for Concurrent Operations on B-Trees," ACM Transactions on Database Systems, vol. 6, no. 4, pp. 650–670, 1981.

D. Comer, "The Ubiquitous B-Tree," ACM Computing Surveys, vol. 11, no. 2, pp. 121–137, 1979.

Newsletter

Signal
over noise.

Database deep-dives, delivered once a week. Storage engines, query optimization, and the data layer.

You will receive Databases Weekly.