Building a correct crash recovery mechanism is one of the hardest problems in database engineering.
Building confidence that the mechanism actually works under real failure conditions is arguably harder.
A recovery protocol can be formally correct on paper, yet still lose data due to subtle bugs in implementation, unexpected filesystem behavior, or hardware that reorders writes.
Crash recovery testing and power-loss simulation exist to close this gap between theoretical correctness and production reliability.
Why Crash Recovery Is Difficult to Test
Database systems rely on write-ahead logging (WAL), checkpointing, and careful ordering of I/O operations to guarantee durability and atomicity.
The correctness of these mechanisms depends on assumptions about when data reaches stable storage.
These assumptions are routinely violated in practice:
- Filesystems may reorder, coalesce, or delay writes.
- Disk controllers with volatile write caches can acknowledge writes before they reach the platter or NAND.
- Operating systems may page out dirty pages in an order different from the one the application intended.
- Power loss can occur at any point during a multi-step I/O sequence, producing partially written pages (torn writes).
A crash recovery test must systematically explore these failure points and verify that the database can recover to a consistent state after each one.
Approaches to Crash Simulation
Software-Level Fault Injection
The most accessible approach intercepts I/O system calls and simulates failures at chosen points.
By replacing or wrapping write(), fsync(), fdatasync(), and rename(), a test harness can:
- Record every write operation along with its offset, size, and content.
- At a chosen point, simulate a crash by halting the process.
- Reconstruct a set of plausible on-disk states by replaying prefixes of the recorded write sequence.
- Attempt recovery on each reconstructed state and verify consistency.
SQLite's crash testing infrastructure works this way.
Its test harness, sometimes called the "crash simulation" layer, intercepts VFS calls and systematically truncates the write sequence at every possible point.
After each simulated crash, the test reopens the database and runs integrity checks.
LMDB uses a simpler model: because it relies on copy-on-write B-trees and never overwrites live data, its crash safety properties are more amenable to reasoning, but the project still uses fault injection to confirm that partially written pages do not corrupt the stable copy of the tree.
Block-Level Record and Replay
A more realistic approach operates below the filesystem, capturing block-level writes to the storage device.
Tools like dm-log-writes in the Linux kernel sit between the filesystem and the block device as a device-mapper target.
Every block write and flush operation is logged to a separate device.
After capturing a workload, a test harness replays the log up to each flush boundary (or partway between flush boundaries) and mounts the resulting image to check for consistency.
This approach captures filesystem reordering behavior that application-level fault injection misses.
The key insight is that writes between two flush (FUA/FLUSH) operations can land in any order or only partially.
A thorough test must explore subsets and permutations of the writes within each flush interval.
Actual Power-Loss Testing
Some organizations use hardware rigs that physically cut power to storage devices under load.
The device is then reconnected, and the filesystem and database are checked for consistency.
This is the most realistic form of testing but also the least reproducible.
Results vary across device firmware versions, temperature, and even the electrical characteristics of the power cut.
Companies like Google, Meta, and storage device manufacturers maintain dedicated power-loss testing infrastructure.
The typical setup involves a relay or electronic switch controlled by a host machine, a device under test, and an orchestration script that runs a workload, cuts power at a random or targeted moment, and then verifies recovery.
Walkthrough
The following walkthrough describes a systematic crash recovery test using block-level write recording, modeled on the approach used by btrfs and other Linux filesystem developers with dm-log-writes.
Step 1: Set Up Write Logging
Create a device-mapper target that logs all writes and flushes from the filesystem to a log device.
dmsetup create log-writes --table \
"0 <size> log-writes /dev/target_dev /dev/log_dev"
Step 2: Run the Workload
Execute the database workload (inserts, updates, transactions with commits) against a filesystem mounted on the logged device.
Each fsync or fdatasync call by the database will produce a FLUSH marker in the log.
Step 3: Identify Crash Points
Parse the log to enumerate all flush markers.
The interesting crash points are:
- Immediately before each flush marker (all writes in the interval are pending).
- Immediately after each flush marker (all preceding writes are durable).
- At each individual write within a flush interval (partial completion).
For a log with F flush markers and an average of W writes per interval, the number of crash points at flush boundaries is F, and the number of intra-interval crash points is F * W.
If testing write reordering, the space grows combinatorially, so practical testing samples from this space.
Step 4: Replay and Verify
For each selected crash point:
replay-log --log /dev/log_dev --replay /dev/test_dev --end-mark <N>
mount /dev/test_dev /mnt/test
# Run database recovery
database --recover --data-dir /mnt/test
# Check consistency
database --check-integrity --data-dir /mnt/test
umount /mnt/test
Step 5: Verify Consistency Invariants
The consistency check after recovery must verify:
- Atomicity: every committed transaction is fully present; no uncommitted transaction is partially present.
- Ordering: if transaction B was committed after transaction A, and B is present, then A must also be present.
- Structural integrity: all internal data structures (B-tree nodes, page headers, free-space maps) are well-formed.
- Referential integrity: no dangling pointers, no orphaned pages.
for crash_point in crash_points:
disk_image = replay(log, crash_point)
recovered_db = attempt_recovery(disk_image)
if recovered_db is None:
FAIL("Recovery failed at crash point", crash_point)
if not check_atomicity(recovered_db, committed_txns):
FAIL("Atomicity violation at crash point", crash_point)
if not check_ordering(recovered_db, txn_order):
FAIL("Ordering violation at crash point", crash_point)
if not check_structural_integrity(recovered_db):
FAIL("Structural corruption at crash point", crash_point)
Handling Torn Writes and Sector Atomicity
Storage devices typically guarantee atomicity only at the sector level (512 bytes or 4096 bytes).
A database page is often 4 KiB, 8 KiB, or 16 KiB.
If power is lost during a page write, the resulting page can contain a mix of old and new data.
This is a torn write.
Databases handle torn writes in several ways:
- Double-write buffers (InnoDB): pages are written to a sequential buffer area first, then to their final location. Recovery can reconstruct any torn page from the double-write buffer.
- Full-page writes (PostgreSQL): after each checkpoint, the first modification to a page causes the entire page image to be written to WAL. Recovery replays the full page image, overwriting any torn version.
- Checksums on pages: detecting torn writes after the fact, allowing recovery to discard the torn page and reconstruct it from WAL.
A thorough crash recovery test must simulate torn writes, not just missing writes.
This means replaying a partial write (e.g., only the first 512 bytes of a 16 KiB page) and confirming that recovery handles it correctly.
Tooling in Practice
Several notable tools and frameworks exist for crash recovery testing:
- SQLite's crash simulation VFS: intercepts file I/O, explores all crash points, and runs integrity checks. This is one of the most mature and well-documented systems.
- dm-log-writes (Linux): kernel-level block write logging used by filesystem developers. The
replay-loguserspace tool reconstructs disk states. - CrashMonkey (University of Texas at Austin): a systematic framework that explores reorderings of block writes within flush intervals, targeting filesystem crash consistency bugs.
- Jepsen: while primarily focused on distributed systems, Jepsen's process-kill and clock-skew tests exercise single-node crash recovery paths as well.
- ALICE (Application-Level Intelligent Crash Explorer): a tool from the University of Wisconsin-Madison that identifies crash vulnerabilities in applications by analyzing block-level traces, and systematically constructing crash states.
Common Bugs Found
Crash recovery testing routinely finds bugs that are invisible to conventional testing:
- Missing
fsynccalls on parent directories after file creation or rename. - Incorrect WAL replay that applies log records out of order.
- Recovery code that assumes page writes are atomic.
- Checkpointing logic that allows a checkpoint record to become durable before all the data it references.
- File truncation or extension operations that are not properly logged.
Key Points
- Crash recovery correctness cannot be verified by unit tests alone; it requires systematic exploration of possible on-disk states after failures.
- Block-level write recording (e.g., dm-log-writes) captures filesystem reordering behavior that application-level fault injection misses.
- Writes between flush operations can land in any order or only partially, and tests must explore subsets and permutations within each flush interval.
- Torn writes (partial page writes) are a distinct failure mode from missing writes and require dedicated test coverage.
- Real hardware power-loss testing is the most realistic but least reproducible method; it complements but does not replace systematic software simulation.
- Consistency verification after simulated recovery must check atomicity, ordering, structural integrity, and referential integrity.
- Crash testing tools like ALICE and CrashMonkey have found numerous real bugs in production filesystems and databases, demonstrating that ad hoc testing is insufficient.
References
Pillai, T. S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. "All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications." OSDI 2014.
Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., and Schwarz, P. "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging." ACM Transactions on Database Systems, 17(1), 1992.
Zheng, M., Tucek, J., Qin, F., and Lillibridge, M. "Understanding the Robustness of SSDs under Power Fault." FAST 2013.
Alagappan, R., Ganesan, A., Patel, Y., Pillai, T. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. "Correlated Crash Vulnerabilities." OSDI 2016.