DBW.

Intermediate

Write-Ahead Logging and Crash Recovery

Article diagram
April 19, 2026·9 min read

Write-ahead logging, combined with the three-phase ARIES recovery protocol, provides crash atomicity and durability without sacrificing runtime performance by decoupling the timing of log persistence from data page persistence.

Introduction

Databases make a fundamental promise: committed transactions will survive crashes, and uncommitted transactions will leave no trace.
Delivering on this promise while maintaining high throughput is one of the hardest problems in systems engineering.
Write-ahead logging (WAL) is the dominant technique that makes it possible.

The core invariant is simple.
Before any modification to a database page is flushed to durable storage, the log records describing that modification must be flushed first.
This single rule, combined with a carefully designed recovery protocol, provides both durability and atomicity even when the system crashes at an arbitrary point in execution.

The WAL Protocol

WAL operates on three rules, often called the WAL protocol rules:

  1. Undo rule (steal policy): Before a dirty page is written to disk, all log records pertaining to modifications on that page must first be forced to stable storage. This ensures that if a crash occurs after the page is written but before the transaction commits, the recovery system can undo the change.

  2. Redo rule (no-force policy): Before a transaction is reported as committed, all log records for that transaction must be forced to stable storage. The actual data pages need not be written. This ensures that if a crash occurs after the commit acknowledgment, the recovery system can redo the change.

  3. Log ordering: Log records are written sequentially and assigned monotonically increasing Log Sequence Numbers (LSNs). Each database page carries a pageLSN field indicating the LSN of the most recent log record applied to that page.

A system that can evict dirty pages of uncommitted transactions is called a steal system.
A system that does not require all dirty pages to be written at commit is called a no-force system.
Most production databases (PostgreSQL, MySQL/InnoDB, SQL Server) use steal/no-force policies because this combination maximizes buffer pool flexibility and minimizes commit latency.
The cost is a more complex recovery algorithm, which is precisely what ARIES provides.

Log Record Structure

A typical WAL log record contains:

FieldDescription
LSNUnique, monotonically increasing sequence number
transactionIDThe transaction that generated this record
prevLSNThe previous log record for the same transaction (forms a per-transaction linked list)
typeUPDATE, COMMIT, ABORT, CLR, CHECKPOINT, END
pageIDThe page modified (for update records)
offsetPosition within the page
beforeImageData before modification (for undo)
afterImageData after modification (for redo)
undoNextLSNFor CLRs only: the next record to undo if this compensation itself must be undone

Compensation Log Records (CLRs) are written during undo operations.
They are redo-only records that describe the inverse of an earlier update.
The undoNextLSN field allows recovery to skip over already-undone work, preventing repeated undo operations if the system crashes during recovery itself.

ARIES: The Recovery Algorithm

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the most influential WAL-based recovery algorithm.
Published by Mohan et al. in 1992, it forms the basis of recovery in IBM DB2, SQL Server, InnoDB, and many other systems.
ARIES recovery proceeds in three passes over the log.

Data Structures

diagram-1
Show the in-memory Transaction Table and Dirty Page Table and their relationships to the on-disk log, buffer pool pages (with pageLSN), and persistent storage; include arrows for operations: log appends from transactions, buffer-manager page evictions conditioned on log flush, and checkpoint snapshots capturing TT/DPT.

Two in-memory structures are critical:

  • Transaction Table (TT): Tracks active transactions, their state, and the LSN of their most recent log record (lastLSN).
  • Dirty Page Table (DPT): Tracks pages in the buffer pool that may contain unflushed modifications. Each entry stores a recLSN, the LSN of the first log record that dirtied the page since it was last flushed.

Both structures are captured during fuzzy checkpoints, which are written periodically without halting transaction processing.

Walkthrough

Recovery proceeds in three sequential phases:

Phase 1: Analysis

function Analysis(checkpoint):
    load TT and DPT from checkpoint record
    scan log forward from checkpoint LSN to end of log

    for each log record R:
        if R.transactionID not in TT:
            add R.transactionID to TT
        set TT[R.transactionID].lastLSN = R.LSN

        if R.type == COMMIT:
            mark TT[R.transactionID] as committed
        elif R.type == END:
            remove R.transactionID from TT

        if R modifies a page and R.pageID not in DPT:
            add R.pageID to DPT with recLSN = R.LSN

    // After analysis:
    // TT contains all transactions active at crash time
    // DPT contains all pages potentially dirty at crash time
    // Committed transactions still in TT need END records
    // Uncommitted transactions in TT need undo

The analysis phase establishes the precise state of the system at the moment of the crash.
It determines which transactions need to be redone (all of them, including those that will later be undone) and which pages might be out of date on disk.

Phase 2: Redo

function Redo(DPT):
    startLSN = min(recLSN for all entries in DPT)
    scan log forward from startLSN to end of log

    for each update or CLR record R:
        if R.pageID not in DPT:
            skip  // page was already flushed
        if R.LSN < DPT[R.pageID].recLSN:
            skip  // page already reflects this change
        
        fetch page P from disk
        if P.pageLSN >= R.LSN:
            skip  // page already reflects this change
        
        reapply R.afterImage to page P
        set P.pageLSN = R.LSN

Redo is unconditional in principle but conditional in practice.
The three skip conditions are optimizations that avoid unnecessary I/O.
The pageLSN check is the definitive test: if the page on disk already has an LSN greater than or equal to the log record's LSN, the modification survived the crash and does not need to be reapplied.

Note that ARIES redoes updates for all transactions, including those that will be rolled back.
This is the "repeating history" principle.
It restores the database to the exact state it was in at the crash point, which simplifies reasoning about correctness and ensures that CLRs written before the crash are also reapplied.

Phase 3: Undo

diagram-2
Illustrate the undo loop: build max-heap of loser lastLSNs, pop largest LSN, branch on whether record is CLR or regular update, actions for writing CLRs, pushing prevLSN or undoNextLSN, applying inverse to pages, and writing END records, highlighting how undoNextLSN skips work across crashes.
function Undo(TT):
    // Collect lastLSN of all uncommitted (loser) transactions
    toUndo = max-heap of {lastLSN for each loser in TT}

    while toUndo is not empty:
        LSN = toUndo.pop()   // largest LSN first
        R = fetch log record at LSN

        if R is a CLR:
            if R.undoNextLSN != NULL:
                toUndo.push(R.undoNextLSN)
            else:
                write END record for R.transactionID
        else:
            // R is a regular update: undo it
            write CLR for R with undoNextLSN = R.prevLSN
            apply the inverse operation to the page
            if R.prevLSN != NULL:
                toUndo.push(R.prevLSN)
            else:
                write END record for R.transactionID

Undo processes log records in reverse LSN order across all loser transactions simultaneously using a single backward pass.
When it encounters a CLR, it follows the undoNextLSN pointer, which skips over the range of records already compensated.
This is how ARIES guarantees bounded recovery even if the system crashes repeatedly during recovery.
Each CLR is itself redo-able (captured in Phase 2 on a subsequent recovery attempt), and the undoNextLSN chain ensures no log record is undone more than once across multiple crash-recovery cycles.

Checkpointing

diagram-3
Depict the temporal sequence of a fuzzy checkpoint: BEGIN_CHECKPOINT, snapshot TT/DPT while concurrent transactions run (showing possible changes), END_CHECKPOINT with captured TT/DPT, and master-record update; annotate effects such as reduced redo window and how background page flushing narrows recLSN gaps.

Without checkpoints, recovery would need to replay the entire log from the beginning of time.
Fuzzy checkpoints bound recovery time by persisting snapshots of the transaction table and dirty page table.
The checkpoint process works as follows:

  1. Write a BEGIN_CHECKPOINT record.
  2. Capture a snapshot of the TT and DPT (these can change during the capture).
  3. Write an END_CHECKPOINT record containing the captured TT and DPT.
  4. Write the LSN of BEGIN_CHECKPOINT to a special fixed location on disk (the master record).

Because the checkpoint does not force dirty pages to disk or block ongoing transactions, it is "fuzzy." The analysis phase corrects any inaccuracies introduced by concurrent activity during the checkpoint window.

Aggressive background page flushing reduces the gap between recLSN values in the DPT and the current log tail, which directly reduces redo work during recovery.

Practical Considerations

Group commit. Forcing the log to disk on every commit is expensive.
Systems batch multiple commit records into a single fsync call, amortizing the I/O cost across transactions.
This introduces a small delay before commit acknowledgment but dramatically improves throughput.

Log buffer management. The log buffer is a sequential, append-only region in memory.
Because log writes are sequential, they are well-suited to both HDDs and SSDs.
The critical path for commit latency is the time to flush the log buffer, not the time to write data pages.

Physiological logging. Most systems do not log pure physical (byte-level), pure logical (operation-level), or changes.
Instead, they use physiological logging: physical to the page (identifying the exact page) but logical within the page (describing the operation, such as "insert key X into slot 5").
This allows the page's internal layout to differ between the time the log record was written and the time it is replayed, as long as the page-level operation is deterministic.

Interaction with buffer management. The steal policy requires that the buffer manager check the log before evicting a dirty page.
Specifically, it must ensure all log records up to the page's pageLSN have been flushed.
This coordination between the buffer manager and the log manager is a subtle but essential correctness requirement.

Key Points

  • Write-ahead logging requires that log records reach stable storage before the corresponding data page modifications, ensuring undo and redo information is always available after a crash.
  • The steal/no-force buffer management policy maximizes flexibility and performance but requires a recovery algorithm capable of both undo and redo.
  • ARIES recovery uses three phases (analysis, redo, and undo) and the "repeating history" principle to restore the database to its pre-crash state before rolling back incomplete transactions.
  • Compensation Log Records (CLRs) with undoNextLSN pointers guarantee bounded recovery time even if the system crashes repeatedly during the recovery process itself.
  • Fuzzy checkpoints limit recovery time without blocking transaction processing, and background page flushing further reduces the redo window.
  • Physiological logging (physical to page, logical within page) balances log volume with flexibility, and in page layout management.
  • The pageLSN stored on each database page is the definitive mechanism for determining whether a redo operation is necessary, enabling idempotent recovery.

References

C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, P. Schwarz. "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging." ACM Transactions on Database Systems, 17(1), March 1992.

Ramakrishnan, R. and Gehrke, J. "Database Management Systems." 3rd Edition, McGraw-Hill, 2003. Chapters 18-20.

Gray, J. and Reuter, A. "Transaction Processing: Concepts and Techniques." Morgan Kaufmann, 1993.

Hellerstein, J., Stonebraker, M., Hamilton, J. "Architecture of a Database System." Foundations and Trends in Databases, 1(2), 2007.

Newsletter

Signal
over noise.

Database deep-dives, delivered once a week. Storage engines, query optimization, and the data layer.

You will receive Databases Weekly.