Write-Ahead Logging and Crash Recovery

Introduction

Databases promise durability: once a transaction commits, its effects survive power failures, crashes, and restarts.
Delivering on this promise while maintaining high throughput is one of the central engineering challenges in database systems.
Write-ahead logging (WAL) is the dominant technique for solving this problem.
The core invariant is simple: before any modified page is flushed from the buffer pool to disk, the log records describing that modification must already be durable on stable storage.
This single rule, combined with a disciplined recovery protocol, is sufficient to guarantee both atomicity and durability in the face of arbitrary crashes.

The WAL Protocol

WAL rests on two fundamental rules, formalized in the ARIES recovery algorithm:

Write-ahead rule: Before a dirty page is written to the database on disk, all log records pertaining to updates on that page must be flushed to the log on stable storage.
Commit rule: A transaction is not considered committed until all of its log records, including the commit record, have been flushed to stable storage.

A third important property follows from the choice of buffer management policy:

Undo capability: For steal-based buffer managers (where uncommitted pages can be flushed to disk before the transaction commits), the log must contain enough information to undo uncommitted changes after a crash. This is a consequence of the steal policy, not an independent WAL invariant, but it is required for correct recovery under ARIES.

These rules decouple the logical ordering of operations from the physical ordering of disk writes.
A transaction can modify hundreds of pages in the buffer pool without issuing a single random I/O to the data files.
Only sequential appends to the log are required at commit time, and sequential I/O is dramatically cheaper than random I/O on both spinning disks and, to a lesser extent, SSDs.

Log Record Structure

diagram-2 — WAL log-record fields and pageLSN linkage to data pages

Each log record typically contains:

LSN (Log Sequence Number): A monotonically increasing identifier, often the byte offset in the log file.
Transaction ID: Which transaction generated this record.
Page ID: Which page was modified.
Previous LSN: The LSN of the previous log record for the same transaction, forming a per-transaction chain for undo.
Redo information: The after-image or physical operation needed to re-apply the change.
Undo information: The before-image or inverse operation needed to reverse the change.
Record type: Update, commit, abort, compensation log record (CLR), checkpoint, etc.

The LSN is also stored on each data page (the pageLSN field), recording the most recent log record whose effects are reflected on that page.
This is critical during recovery for determining whether a redo operation has already been applied.

Checkpointing

Without checkpointing, recovery would need to replay the entire log from the beginning of time.
A checkpoint bounds the work required during recovery by recording a snapshot of the system state.
ARIES uses fuzzy checkpoints, which do not require halting transaction processing.
A fuzzy checkpoint writes the following to the log:

The active transaction table (ATT): all in-progress transactions and their most recent LSN.
The dirty page table (DPT): all pages in the buffer pool that have been modified but not yet flushed, along with the recoveryLSN (the LSN of the earliest log record that dirtied each page).

The checkpoint does not force dirty pages to disk.
It simply records enough metadata so that recovery knows where to start scanning the log.

Walkthrough

ARIES Recovery: Three-Pass Protocol

diagram-1 — ARIES three-pass recovery: Analysis → Redo (with skip rules) → Undo

ARIES recovery proceeds in three phases after a crash.
Assuming the most recent checkpoint has been located in the log.

Phase 1: Analysis

1. Read the checkpoint record. Initialize ATT and DPT from the checkpoint.
2. Scan the log forward from the checkpoint LSN to the end of the log.
3. For each log record encountered:
   a. If the record's transaction is not in ATT, add it (status = in-progress).
   b. If the record is a COMMIT record, change the transaction's status
      to committed in ATT. (The transaction remains in the ATT until
      an END record is seen or the end of the log is reached.)
   c. If the record is an UPDATE record:
      - Update the transaction's lastLSN in ATT.
      - If the page is not in DPT, add it with recoveryLSN = current LSN.
   d. If the record is an END record, remove the transaction from ATT.
4. At the end of analysis:
   - ATT contains all transactions that had not yet received an END record
     at crash time. Those with status = committed will be confirmed during
     redo; those with status = in-progress are losers to be undone.
   - DPT contains all pages that were potentially dirty at crash time.

Phase 2: Redo

1. Determine the starting point: the minimum recoveryLSN across all
   entries in the DPT. The recoveryLSN is the earliest LSN that could
   have dirtied a page still in the DPT; any log record with an LSN
   strictly below this minimum is guaranteed to be reflected on disk
   already and need not be considered.
2. Scan the log forward from that minimum recoveryLSN.
3. For each redo-able log record (UPDATE or CLR):
   a. If the page is not in DPT, skip (page was already flushed and
      its changes are durable).
   b. If the record's LSN < recoveryLSN for that page in DPT, skip
      (this update was already on disk before the crash).
   c. Read the page from disk. If pageLSN >= record's LSN, skip
      (the update is already reflected on the page).
   d. Otherwise, re-apply the logged operation and set pageLSN = record's LSN.

The redo phase repeats history: it brings the database to the exact state it was in at the moment of the crash, including any modifications made by transactions that had not committed.

Phase 3: Undo

1. Collect the set of "loser" transactions: all transactions in ATT
   with status = in-progress (i.e., those that did not commit).
2. Build a set of lastLSNs, one per loser transaction.
3. While the set of lastLSNs is not empty:
   a. Pick the largest LSN from the set (process in reverse log order).
   b. If the record is a CLR:
      - If CLR.undoNextLSN is null, write an END record for this transaction
        and remove it from the set.
      - Else, replace this transaction's entry with CLR.undoNextLSN.
   c. If the record is an UPDATE:
      - Undo the operation (apply the inverse).
      - Write a CLR to the log. The CLR's undoNextLSN points to the
        previousLSN of the record being undone.
      - Replace this transaction's entry with the previousLSN.
   d. If the previousLSN is null, write an END record; remove from set.

Compensation log records (CLRs) are redo-only log records that describe undo actions.
They serve a critical purpose: if the system crashes during recovery itself, the CLRs prevent the same undo work from being repeated.
The undoNextLSN pointer in each CLR allows the undo phase to skip over already-undone operations, guaranteeing that recovery is idempotent and bounded in the number of passes.

Practical Considerations

Group Commit

Flushing the log to stable storage on every commit is expensive.
Group commit batches multiple transaction commit records into a single fsync call.
Transactions that commit within a small time window share the cost of one durable write.
PostgreSQL, MySQL/InnoDB, and most modern systems implement some variants of group commit.
The trade-off is a small increase in commit latency for a large increase in throughput.

Log-Structured Considerations

WAL as described here assumes an update-in-place storage engine.
Log-structured merge trees (LSM trees), used by systems like RocksDB, blur the line between the log and the data store.
The write-ahead log in an LSM system primarily protects the in-memory memtable: if the process crashes before the memtable is flushed to an SSTable on disk, the WAL replays the lost writes.
Recovery is generally simpler than in update-in-place engines because the on-disk SSTables are immutable and always internally consistent.
However, a complete recovery must also account for compactions that were in progress at crash time, ensuring that partially written compaction outputs do not corrupt the logical dataset.

Performance of Recovery

The time to recover is bounded by the interval between checkpoints and the volume of log generated in that interval.
More frequent checkpoints reduce recovery time but increase I/O during normal operation.
Some systems (such as SQL Server) support parallel redo, applying log records to different pages concurrently to speed up the redo phase.

Interaction with Buffer Management

The WAL protocol interacts tightly with the buffer pool's page replacement policy.
Under a steal policy, the buffer manager can evict dirty pages from uncommitted transactions (requiring undo capability).
Under a no-force policy, the buffer manager is not required to flush all dirty pages at commit time (requiring redo capability).
ARIES assumes steal/no-force, which gives the buffer manager maximum flexibility and the best runtime performance at the cost of more complex recovery.

Key Points

The write-ahead rule requires that log records reach stable storage before the corresponding dirty data pages are flushed, ensuring that every committed change can be reconstructed after a crash.
ARIES recovery uses three phases (analysis, redo, undo) to restore the database to a consistent state: it repeats history by redoing all logged operations, then rolls back the changes of uncommitted (loser) transactions.
Fuzzy checkpoints avoid stalling transaction processing by recording the active transaction table and dirty page table without forcing pages to disk.
Compensation log records (CLRs) make the undo phase idempotent, so a crash during recovery does not cause unbounded work or logical errors.
The steal/no-force buffer management policy maximizes runtime performance but requires both redo and undo capabilities in the log.
Group commit amortizes the cost of durable log flushes across multiple transactions, trading marginal latency for significant throughput gains.
The LSN on each data page (pageLSN) enables the redo phase to skip operations that are already reflected on disk, avoiding redundant work.

References

C. Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, Peter Schwarz. "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging." ACM Transactions on Database Systems, 17(1), March 1992.

Jim Gray and Andreas Reuter. "Transaction Processing: Concepts and Techniques." Morgan Kaufmann, 1993.

Raghu Ramakrishnan and Johannes Gehrke. "Database Management Systems." 3rd Edition, McGraw-Hill, 2003.

Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman. "Concurrency Control and Recovery in Database Systems." Addison-Wesley, 1987.