Introduction
Databases make a fundamental promise: committed transactions will survive crashes, and uncommitted transactions will leave no trace.
Delivering on this promise while maintaining high throughput is one of the hardest problems in systems engineering.
Write-ahead logging (WAL) is the dominant technique that makes it possible.
The core invariant is simple.
Before any modification to a database page is flushed to durable storage, the log records describing that modification must be flushed first.
This single rule, combined with a carefully designed recovery protocol, provides both durability and atomicity even when the system crashes at an arbitrary point in execution.
The WAL Protocol
WAL operates on three rules, often called the WAL protocol rules:
-
Undo rule (steal policy): Before a dirty page is written to disk, all log records pertaining to modifications on that page must first be forced to stable storage. This ensures that if a crash occurs after the page is written but before the transaction commits, the recovery system can undo the change.
-
Redo rule (no-force policy): Before a transaction is reported as committed, all log records for that transaction must be forced to stable storage. The actual data pages need not be written. This ensures that if a crash occurs after the commit acknowledgment, the recovery system can redo the change.
-
Log ordering: Log records are written sequentially and assigned monotonically increasing Log Sequence Numbers (LSNs). Each database page carries a
pageLSNfield indicating the LSN of the most recent log record applied to that page.
A system that can evict dirty pages of uncommitted transactions is called a steal system.
A system that does not require all dirty pages to be written at commit is called a no-force system.
Most production databases (PostgreSQL, MySQL/InnoDB, SQL Server) use steal/no-force policies because this combination maximizes buffer pool flexibility and minimizes commit latency.
The cost is a more complex recovery algorithm, which is precisely what ARIES provides.
Log Record Structure
A typical WAL log record contains:
| Field | Description |
|---|---|
LSN | Unique, monotonically increasing sequence number |
transactionID | The transaction that generated this record |
prevLSN | The previous log record for the same transaction (forms a per-transaction linked list) |
type | UPDATE, COMMIT, ABORT, CLR, CHECKPOINT, END |
pageID | The page modified (for update records) |
offset | Position within the page |
beforeImage | Data before modification (for undo) |
afterImage | Data after modification (for redo) |
undoNextLSN | For CLRs only: the next record to undo if this compensation itself must be undone |
Compensation Log Records (CLRs) are written during undo operations.
They are redo-only records that describe the inverse of an earlier update.
The undoNextLSN field allows recovery to skip over already-undone work, preventing repeated undo operations if the system crashes during recovery itself.
ARIES: The Recovery Algorithm
ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the most influential WAL-based recovery algorithm.
Published by Mohan et al. in 1992, it forms the basis of recovery in IBM DB2, SQL Server, InnoDB, and many other systems.
ARIES recovery proceeds in three passes over the log.
Data Structures
Two in-memory structures are critical:
- Transaction Table (TT): Tracks active transactions, their state, and the LSN of their most recent log record (
lastLSN). - Dirty Page Table (DPT): Tracks pages in the buffer pool that may contain unflushed modifications. Each entry stores a
recLSN, the LSN of the first log record that dirtied the page since it was last flushed.
Both structures are captured during fuzzy checkpoints, which are written periodically without halting transaction processing.
Walkthrough
Recovery proceeds in three sequential phases:
Phase 1: Analysis
function Analysis(checkpoint):
load TT and DPT from checkpoint record
scan log forward from checkpoint LSN to end of log
for each log record R:
if R.transactionID not in TT:
add R.transactionID to TT
set TT[R.transactionID].lastLSN = R.LSN
if R.type == COMMIT:
mark TT[R.transactionID] as committed
elif R.type == END:
remove R.transactionID from TT
if R modifies a page and R.pageID not in DPT:
add R.pageID to DPT with recLSN = R.LSN
// After analysis:
// TT contains all transactions active at crash time
// DPT contains all pages potentially dirty at crash time
// Committed transactions still in TT need END records
// Uncommitted transactions in TT need undo
The analysis phase establishes the precise state of the system at the moment of the crash.
It determines which transactions need to be redone (all of them, including those that will later be undone) and which pages might be out of date on disk.
Phase 2: Redo
function Redo(DPT):
startLSN = min(recLSN for all entries in DPT)
scan log forward from startLSN to end of log
for each update or CLR record R:
if R.pageID not in DPT:
skip // page was already flushed
if R.LSN < DPT[R.pageID].recLSN:
skip // page already reflects this change
fetch page P from disk
if P.pageLSN >= R.LSN:
skip // page already reflects this change
reapply R.afterImage to page P
set P.pageLSN = R.LSN
Redo is unconditional in principle but conditional in practice.
The three skip conditions are optimizations that avoid unnecessary I/O.
The pageLSN check is the definitive test: if the page on disk already has an LSN greater than or equal to the log record's LSN, the modification survived the crash and does not need to be reapplied.
Note that ARIES redoes updates for all transactions, including those that will be rolled back.
This is the "repeating history" principle.
It restores the database to the exact state it was in at the crash point, which simplifies reasoning about correctness and ensures that CLRs written before the crash are also reapplied.
Phase 3: Undo
function Undo(TT):
// Collect lastLSN of all uncommitted (loser) transactions
toUndo = max-heap of {lastLSN for each loser in TT}
while toUndo is not empty:
LSN = toUndo.pop() // largest LSN first
R = fetch log record at LSN
if R is a CLR:
if R.undoNextLSN != NULL:
toUndo.push(R.undoNextLSN)
else:
write END record for R.transactionID
else:
// R is a regular update: undo it
write CLR for R with undoNextLSN = R.prevLSN
apply the inverse operation to the page
if R.prevLSN != NULL:
toUndo.push(R.prevLSN)
else:
write END record for R.transactionID
Undo processes log records in reverse LSN order across all loser transactions simultaneously using a single backward pass.
When it encounters a CLR, it follows the undoNextLSN pointer, which skips over the range of records already compensated.
This is how ARIES guarantees bounded recovery even if the system crashes repeatedly during recovery.
Each CLR is itself redo-able (captured in Phase 2 on a subsequent recovery attempt), and the undoNextLSN chain ensures no log record is undone more than once across multiple crash-recovery cycles.
Checkpointing
Without checkpoints, recovery would need to replay the entire log from the beginning of time.
Fuzzy checkpoints bound recovery time by persisting snapshots of the transaction table and dirty page table.
The checkpoint process works as follows:
- Write a
BEGIN_CHECKPOINTrecord. - Capture a snapshot of the TT and DPT (these can change during the capture).
- Write an
END_CHECKPOINTrecord containing the captured TT and DPT. - Write the LSN of
BEGIN_CHECKPOINTto a special fixed location on disk (the master record).
Because the checkpoint does not force dirty pages to disk or block ongoing transactions, it is "fuzzy." The analysis phase corrects any inaccuracies introduced by concurrent activity during the checkpoint window.
Aggressive background page flushing reduces the gap between recLSN values in the DPT and the current log tail, which directly reduces redo work during recovery.
Practical Considerations
Group commit. Forcing the log to disk on every commit is expensive.
Systems batch multiple commit records into a single fsync call, amortizing the I/O cost across transactions.
This introduces a small delay before commit acknowledgment but dramatically improves throughput.
Log buffer management. The log buffer is a sequential, append-only region in memory.
Because log writes are sequential, they are well-suited to both HDDs and SSDs.
The critical path for commit latency is the time to flush the log buffer, not the time to write data pages.
Physiological logging. Most systems do not log pure physical (byte-level), pure logical (operation-level), or changes.
Instead, they use physiological logging: physical to the page (identifying the exact page) but logical within the page (describing the operation, such as "insert key X into slot 5").
This allows the page's internal layout to differ between the time the log record was written and the time it is replayed, as long as the page-level operation is deterministic.
Interaction with buffer management. The steal policy requires that the buffer manager check the log before evicting a dirty page.
Specifically, it must ensure all log records up to the page's pageLSN have been flushed.
This coordination between the buffer manager and the log manager is a subtle but essential correctness requirement.
Key Points
- Write-ahead logging requires that log records reach stable storage before the corresponding data page modifications, ensuring undo and redo information is always available after a crash.
- The steal/no-force buffer management policy maximizes flexibility and performance but requires a recovery algorithm capable of both undo and redo.
- ARIES recovery uses three phases (analysis, redo, and undo) and the "repeating history" principle to restore the database to its pre-crash state before rolling back incomplete transactions.
- Compensation Log Records (CLRs) with
undoNextLSNpointers guarantee bounded recovery time even if the system crashes repeatedly during the recovery process itself. - Fuzzy checkpoints limit recovery time without blocking transaction processing, and background page flushing further reduces the redo window.
- Physiological logging (physical to page, logical within page) balances log volume with flexibility, and in page layout management.
- The pageLSN stored on each database page is the definitive mechanism for determining whether a redo operation is necessary, enabling idempotent recovery.
References
C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, P. Schwarz. "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging." ACM Transactions on Database Systems, 17(1), March 1992.
Ramakrishnan, R. and Gehrke, J. "Database Management Systems." 3rd Edition, McGraw-Hill, 2003. Chapters 18-20.
Gray, J. and Reuter, A. "Transaction Processing: Concepts and Techniques." Morgan Kaufmann, 1993.
Hellerstein, J., Stonebraker, M., Hamilton, J. "Architecture of a Database System." Foundations and Trends in Databases, 1(2), 2007.