Database systems make durability guarantees that must hold even when power is cut mid-write, kernels panic, or storage devices reorder operations.
Testing these guarantees is notoriously difficult.
A system can run for years without revealing a crash recovery bug, only to lose data during a real failure.
This article covers the techniques, tools, and methodologies used to systematically test crash recovery logic and simulate power-loss scenarios in storage systems.
Why Crash Recovery Bugs Are Hard to Find
Crash recovery correctness depends on the exact ordering and atomicity of writes to durable storage.
Modern storage stacks introduce multiple layers where reordering can occur: the filesystem, the block layer, the disk controller's volatile write cache, and even the drive firmware itself.
A crash recovery bug may only manifest when a failure happens at a precise point in a sequence of writes, and only when those writes have been partially persisted in a specific combination.
Conventional testing (unit tests, integration tests, and stress tests) exercises the normal execution path.
Crash recovery bugs live on the abnormal path: the state left on disk after an incomplete operation.
The combinatorial space of possible crash points and partial-write states is enormous, making brute-force exploration impractical without purpose-built tooling.
Approaches to Power-Loss Simulation
Physical Power-Fault Injection
The most direct approach is to repeatedly run a workload and physically cut power to the storage device.
Teams at companies like Google and Microsoft have used relay-controlled power supplies to automate this.
After each power cut, the system reboots, runs recovery, and a checker validates invariant's.
Physical testing is valuable because it captures real device behavior, including firmware quirks, volatile cache flushes, and sector-tearing.
However, it is slow (each cycle involves a reboot), non-deterministic (you cannot precisely control which writes were in-flight), and not reproducible.
It works best as a final validation step rather than a primary development tool.
Block-Level Record and Replay
A more controlled technique intercepts block I/O operations, records the sequence of writes, and then systematically generates possible crash states by choosing subsets of pending writes that could have reached durable storage.
This approach is the foundation of tools like Alice (Application-Level Intelligent Crash Explorer), developed at the University of Wisconsin-Madison.
The key insight is that after an fsync or fdatasync call, all previously submitted writes must be durable.
Between sync points, any subset of the outstanding writes could have been persisted, and some individual writes may be only partially completed (torn writes).
By enumerating these possible "crash states," we can check whether recovery produces a consistent result for each one.
Filesystem-Level Simulation
Some tools operate at the filesystem level rather than the block level.
For example, dm-flakey in Linux is a device-mapper target that can be configured to start dropping or corrupting writes after a specified number of I/O operations.
LazyFS, a FUSE-based filesystem, intercepts POSIX calls and simulates a volatile page cache that can be "lost" on demand, mimicking a crash where write() calls succeeded but data never reached disk because no fsync was issued.
Model-Based Approaches
Formal approaches model the storage stack as a state machine and use model checkers (like TLA+ or Alloy) to explore possible crash states exhaustively.
While these do not test the actual code, they can verify that a recovery protocol is correct by design before implementation begins.
Walkthrough
The following walkthrough describes the core algorithm used by crash-state exploration tools like Alice.
The goal is to take a recorded I/O trace and generate all valid crash states, then run recovery against each one.
Step 1: Record the I/O Trace
Run the target workload under a tracing layer that captures every block write and every sync operation.
Each entry in the trace contains:
- The logical block address (LBA) and data for each write
- The type of sync barrier (
fsync,fdatasync,sync,write barriers) - The ordering relationships imposed by the sync operations
Step 2: Partition Writes Into Epochs
Group the writes into epochs separated by sync points.
All writes in epoch k are guaranteed durable before any write in epoch k+1 can be considered durable.
Epoch 0: [W1, W2, W3] <- before first fsync
--- fsync ---
Epoch 1: [W4, W5] <- between first and second fsync
--- fsync ---
Epoch 2: [W6, W7, W8] <- after second fsync, before crash
Step 3: Generate Candidate Crash States
For a crash occurring during epoch n:
- All writes from epochs
0throughn-1are fully durable. - For epoch
n, any subset of writes may have reached disk. - Each write in the subset may be fully persisted or torn (partially written).
function generate_crash_states(trace, crash_epoch):
durable = all writes from epochs 0..crash_epoch-1
pending = writes from epoch crash_epoch
for each subset S of pending:
for each write W in S:
for each torn_variant T of W: // full write, torn at sector boundary
crash_state = apply(durable + (S - {W}) + {T}, base_image)
yield crash_state
The number of subsets is exponential in the number of pending writes.
In practice, tools apply heuristics: they prioritize writes to metadata over data, test only "interesting" subset boundaries (e.g., prefix subsets respecting submission order), and limit torn-write simulation to sector-aligned tears.
Step 4: Validate Each Crash State
For each generated crash state:
- Mount the filesystem (or open the database) in recovery mode.
- Run the recovery procedure.
- Check application-level invariants. For a database, this might mean verifying that committed transactions are present and uncommitted transactions are fully rolled back. For a filesystem, it might mean checking that the directory tree is consistent.
function validate(crash_state, invariant_checker):
disk_image = crash_state
recovered = run_recovery(disk_image)
if not invariant_checker(recovered):
report_bug(crash_state)
Step 5: Report Violations
When a crash state leads to an invariant violation after recovery, the tool reports the exact sequence of writes, the subset that was persisted, and the specific invariant that was broken.
This gives developers a reproducible scenario for debugging.
Practical Tools and Frameworks
Alice (2014): Systematically explores crash vulnerabilities in applications by recording POSIX-level I/O and generating crash states.
It found numerous bugs in databases (LevelDB, LMDB), version control systems (Git, Mercurial), and other data-intensive applications.
CrashMonkey (2019): A kernel-level framework for testing Linux filesystem crash consistency.
It uses dm-log-writes to record block I/O, then generates crash states and checks filesystem invariants using custom checkers.
dm-flakey / dm-log-writes: Linux device-mapper targets. dm-flakey drops or corrupts I/O to simulate failures. dm-log-writes records all block writes so they can be replayed to construct crash states.
LazyFS: A FUSE filesystem that decouples write() from persistence, explicitly modeling the volatile page cache.
Useful for testing applications that may rely on implicit durability guarantees the kernel does not actually provide.
Jepsen: While primarily focused on distributed systems consistency, Jepsen's methodology of injecting faults and checking invariants shares the same philosophical foundation.
Some Jepsen tests include process kills (simulating crashes) and verify single-node recovery.
Common Bugs Found by Crash Testing
Several categories of bugs recur across systems:
Missing fsync on directory entries. On many Linux filesystems, creating a file and fsyncing it does not guarantee the directory entry is durable.
After a crash, the file may not exist.
Many applications, including early versions of LevelDB, had this bug.
Incorrect write ordering assumptions. Applications sometimes assume that writes are persisted in submission order.
Without explicit barriers, the storage stack may reorder them, leading to states where metadata points to uninitialized data blocks.
Incomplete WAL protocol implementation. A write-ahead log must ensure that the log record is durable before the corresponding data page modification is written.
Bugs in barrier placement can cause the data page to be durable without the log record, making rollback impossible.
Torn writes on non-atomic boundaries. Even when a write is logically a single operation, it may span multiple disk sectors.
A crash mid-write produces a block that is half old data and half new data.
Systems that do not use checksums or double-write buffers to detect this can silently process corrupted pages.
Limitations of Simulation
No simulation perfectly captures real hardware behavior.
Disk firmware may have undocumented write reordering.
SSDs may lose data from volatile DRAM caches even if they report write completion.
Some NVMe drives have been found to violate the flush semantics they advertise.
Simulation tools typically operate under a model of storage behavior (e.g., "writes are atomic at the sector level, reorderable within an epoch").
If the real device violates the model, bugs can still slip through.
This is why mature storage systems combine simulation-based testing with physical power-fault injection.
Key Points
- Crash recovery bugs only manifest under specific partial-write states that normal testing never exercises, requiring specialized fault injection and state exploration tools.
- Block I/O traces partitioned into sync-delimited epochs form the basis for systematically generating possible crash states.
- The combinatorial explosion of possible crash states requires heuristics such as prefix-based subset selection and sector-aligned torn-write simulation to keep exploration tractable.
- Missing fsync calls (especially on directory entries) and incorrect write ordering assumptions are among the most common bugs found by crash testing tools.
- Simulation tools operate under a model of storage behavior; real hardware may violate these models, so physical power-fault injection remains a necessary complement.
- Tools like Alice, CrashMonkey, and LazyFS provide practical, automatable frameworks for crash recovery testing at different layers of the storage stack.
References
Pillai, T.S., Chidambaram, V., Alagappan, R., Al-Kiswany, S., Arpaci-Dusseau, A.C., and Arpaci-Dusseau, R.H. "All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications." Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., and Schwarz, P. "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging." ACM Transactions on Database Systems, 17(1), 1992.
Alagappan, R., Arun, V., Pillai, T.S., Chidambaram, V., Arpaci-Dusseau, A.C., and Arpaci-Dusseau, R.H. "Protocol-Aware Recovery for Consensus-Based Storage." Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST), 2018.
Jaffer, S., Bhandari, S., Alagappan, R., Lu, S., and Arpaci-Dusseau, A.C. "CrashMonkey and ACE: Systematically Testing File-System Crash Consistency." ACM Transactions on Storage, 15(2), 2019.