B-Tree Internals and Page Splits

Introduction

The B-Tree remains the dominant index structure in relational databases, key-value stores, and file systems after more than four decades.
Its design maps naturally to block-oriented storage: each node corresponds to a fixed-size page (typically 4 KB, 8 KB, or 16 KB), and the tree's branching factor is chosen so that a single page can hold hundreds of keys.
This keeps tree height small (usually 3–4 levels for billions of rows) and bounds the number of I/O operations required to locate any key.

Despite this apparent simplicity, the mechanics of insertion, deletion, and especially page splits involve subtle design decisions that directly affect throughput, write amplification, space utilization, and concurrency.
This article examines those mechanics in detail.

B-Tree Structure

A B-Tree of order m (where m is the maximum number of children an internal node may have) satisfies the following invariants:

Every internal node holds between ⌈m/2⌉ and m children (the root may hold fewer).
A node with k children stores k−1 keys in sorted order.
All leaves reside at the same depth.
Keys in a subtree rooted at child pointer i fall between key i−1 and key i of the parent.

In most database implementations, the variant actually used is the B⁺-Tree, where all records (or record pointers) live in leaf nodes and internal nodes contain only separator keys and child pointers.
Leaf nodes are linked in a doubly-linked list to support efficient range scans.
Throughout this article, "B-Tree" refers to this B⁺-Tree variant unless stated otherwise.

Page Layout

A page typically contains:

A page header (LSN, page type, level, number of keys, free space offset).
An array of key-pointer slots, sorted by key. In slotted-page designs, the slot directory grows from the front of the page and variable-length payloads grow from the back.
For leaf pages: pointers to heap tuples (or inline row data, as in InnoDB clustered indexes).
For internal pages: child page numbers interleaved with separator keys.
A sibling pointer (right-link in Lehman-Yao style trees, or both left and right links).

The fill factor at page creation time is configurable.
PostgreSQL defaults to 90% for leaf pages, deliberately leaving room for future inserts that fall within the page's key range.

Search

Searching a B⁺-Tree is straightforward: starting at the root, perform a binary search (or interpolation search) within each page to find the child pointer to follow, then repeat until a leaf is reached.
The cost is O(log_m N) page accesses, where m is the effective branching factor and N is the number of keys.
Because upper levels of the tree are frequently accessed, they tend to remain in the buffer pool, so in practice a point lookup touches one or two pages on disk.

Insertion and Page Splits

Insertion proceeds by first searching for the correct leaf, then inserting the key in sorted order within that leaf.
If the leaf has room, the operation completes with a single page write (plus a WAL record).
The interesting case is when the leaf is full.

When a Split Occurs

A page split is triggered when an insert would cause a page to exceed its capacity.
The node must be divided into two nodes, and a new separator key must be promoted into the parent.
If the parent is also full, the split cascades upward.
In the rare case that the root splits, a new root is created, increasing tree height by one.

Split Strategies

Midpoint split. The classic approach: divide keys evenly, placing the lower half in the original page and the upper half in a new page.
This yields roughly 50% utilization in each resulting page.
For random-order inserts, this is reasonable because future inserts are equally likely to land in either page.

Right-biased (monotonic) split. When inserts are sequential (e.g., auto-increment primary keys), a midpoint split wastes space: the left page will never receive another insert.
PostgreSQL and InnoDB detect monotonic insert patterns and perform "fast path" or right-most splits, leaving the original page nearly full and creating a new page with only the inserted key.
This preserves high space utilization (~90%+) for append-only workloads.

Prefix truncation of separator keys. The separator key promoted to the parent need not be a full copy of any leaf key.
It only needs to be a value that correctly separates the two pages.
InnoDB and SQLite use suffix truncation to produce the shortest possible separator, reducing internal node size and increasing the effective branching factor.

Walkthrough

Below is a step-by-step walkthrough of a leaf page split in a B⁺-Tree.
Assume a leaf page can hold at most 4 keys.

Initial State

Leaf L: [10, 20, 30, 40]   (full)
Parent P: [ ... | 10 | ptr_L | 50 | ptr_R | ... ]

Insert key 25

diagram-1 — B-tree leaf split: redistribute keys, promote separator 30

Step 1: Locate leaf. Search from root to leaf L.
Key 25 belongs in L (between 20 and 30), but L is full.

Step 2: Allocate new page. Create a new leaf page L'.

Step 3: Redistribute keys (midpoint split).

The combined set of keys is [10, 20, 25, 30, 40] (5 keys).
A midpoint split places the lower half in L and the upper half in L':

L:  [10, 20, 25]
L': [30, 40]

Note: some implementations round the split point differently (floor vs. ceiling of the midpoint), and some incorporate the new key into redistribution before choosing the split point.
The specific distribution may vary, but the invariant is that both pages must remain within [⌈m/2⌉, m] key bounds after the split.

Step 4: Determine separator key. The separator is the smallest key in L', which is 30. (With prefix truncation, a shorter separator might be used if keys are strings or composite.)

Step 5: Insert separator into parent.

Parent P: [ ... | 10 | ptr_L | 30 | ptr_L' | 50 | ptr_R | ... ]

If P is also full, repeat the split process at the parent level.

Step 6: Update sibling pointers.

L'.right_sibling = L.right_sibling
L.right_sibling  = L'

Step 7: Write WAL records and mark pages dirty. Most systems (e.g., PostgreSQL, InnoDB) write one WAL record per modified page (old leaf, new leaf, and parent), relying on LSN ordering and group commit to ensure logical atomicity of the split.
Pages L, L', and P are marked dirty in the buffer pool.

Pseudocode

function insert(tree, key, value):
    leaf = search(tree.root, key)
    if leaf.has_space():
        leaf.insert_sorted(key, value)
        log_wal(INSERT, leaf)
    else:
        new_leaf = allocate_page()
        all_keys = leaf.keys + [(key, value)]
        sort(all_keys)
        mid = floor(len(all_keys) / 2)   // integer floor division

        leaf.keys     = all_keys[:mid]
        new_leaf.keys = all_keys[mid:]

        separator = new_leaf.keys[0].key  // or truncated version

        new_leaf.right_sibling = leaf.right_sibling
        leaf.right_sibling     = new_leaf

        log_wal(SPLIT, leaf, new_leaf, separator)  // one record per dirty page in practice
        insert_into_parent(tree, leaf, separator, new_leaf)

function insert_into_parent(tree, left, separator, right):
    parent = left.parent
    if parent is null:
        new_root = allocate_page()
        new_root.keys     = [separator]
        new_root.children = [left, right]
        tree.root = new_root
        return
    if parent.has_space():
        parent.insert_child(separator, right)
    else:
        // recursive split of internal node
        split_internal(tree, parent, separator, right)

Concurrency Considerations

Naive locking during a page split would require holding a write latch from the leaf up to the root (or at least to the first non-full ancestor), blocking all concurrent readers and writers on that path.
Two important techniques address this.

Lehman and Yao (1981) introduced right-link pointers and "high keys" on each page in what is known as the B-link tree.
Because splits proceed bottom-up and install the right-link on the old page before inserting the separator into the parent, a reader that lands on a page mid-split can detect the situation (the searched key exceeds the page's high key) and follow the right-link to the new sibling without blocking.
This allows splits to complete with at most a small number of latches held at one time.

Optimistic latching (latch coupling). The inserter traverses the tree with read latches, noting whether each node is "safe" (has room so that a split would not propagate through it).
Only when it reaches the leaf and determines a split is necessary does it re-traverse with write latches starting from the deepest unsafe ancestor.
Most inserts find every node safe, so no write latches above the leaf are needed.

InnoDB uses an "optimistic" insert path that latches only the leaf page in exclusive mode.
If a split is required, the operation restarts with a "pessimistic" path that acquires an index-level SX latch and latches the necessary pages from the parent downward.

Write Amplification and Space Utilization

diagram-2 — Space utilization curves: midpoint split (~69%) vs right-biased (~90%)

Page splits are a significant source of write amplification.
A single logical insert can dirty three pages (old leaf, new leaf, parent) and generate multiple WAL records.
In the worst case, a cascading split dirties O(h) pages, where h is the tree height.

Average space utilization depends on the workload.
For uniformly random inserts into a B-Tree using midpoint splits, theoretical analysis shows that expected page utilization converges to approximately ln(2) ≈ 69% (a result derived from the classic Yao analysis of B-Tree storage utilization).
In practice, B⁺-Tree utilization under random inserts is similar.
Sequential inserts with right-biased splits can maintain utilization above 90%.
Merge operations (rebalancing, page merges, or page merges on delete) can recover space but are often deferred or omitted in practice because they add complexity and contention.

Key Points

A B⁺-Tree page split creates a new page, redistributes keys, and promotes a separator into the parent, potentially cascading upward to the root.
Midpoint splits yield roughly 69% average utilization under random workloads; right-biased splits approach 90%+ for sequential inserts.
Separator keys promoted to internal nodes can be truncated to their shortest distinguishing prefix, increasing the branching factor and reducing tree height.
Lehman-Yao B-link trees use right-link pointers and high keys so that concurrent readers can navigate past in-progress splits without blocking, enabling high-concurrency latch protocols.
Page splits are a primary source of write amplification in B-Tree indexes: a single insert can dirty three or more pages and generate multiple WAL records.
Optimistic latch coupling avoids holding write latches above the leaf for the common case where no split propagation is needed.
Fill factor tuning at page creation time is a direct trade-off between read density and the frequency of future page splits.

References

R. Bayer and E. McCreight. "Organization and Maintenance of Large Ordered Indexes." Acta Informatica, 1(3):173-189, 1972.

P. L. Lehman and S. B. Yao. "Efficient Locking for Concurrent Operations on B-Trees." ACM Transactions on Database Systems, 6(4):650-670, 1981.

D. Comer. "The Ubiquitous B-Tree." ACM Computing Surveys, 11(2):121-137, 1979.

G. Graefe. "Modern B-Tree Techniques." Foundations and Trends in Databases, 3(4):203-402, 2011.

R. Ramakrishnan and J. Gehrke. Database Management Systems, 3rd Edition. McGraw-Hill, 2003.