Introduction
Replication is the most intuitive way to achieve fault tolerance in distributed storage: keep three copies of every block, and you can survive two failures.
The cost is a 3x storage overhead.
For systems storing petabytes or exabytes of data, that multiplier translates directly into hardware budgets, power consumption, and operational complexity.
Erasure coding offers a fundamentally better trade-off.
By encoding data with redundant parity fragments, a system can tolerate the same number of failures as three-way replication while using significantly less raw storage — a (4, 2) code, for example, tolerates two failures at only 1.5x overhead, and a (10, 4) code tolerates four failures at roughly 1.4x overhead, compared to 3x for three-way replication that tolerates only two failures.
This is not a free lunch.
Erasure coding introduces computational overhead during reads and writes, complicates repair after failures, and changes the failure-domain calculus.
But for large-scale, read-heavy workloads, the economics are compelling.
Google, Facebook, Microsoft, and virtually every hyperscaler use erasure coding in their storage infrastructure.
Fundamentals
An erasure code takes k data blocks and produces n total blocks (n = k + m), where m is the number of parity blocks.
Any k of the n blocks are sufficient to reconstruct the original data.
The system tolerates up to m simultaneous block losses.
The code rate is k/n.
A (10, 4) code, for example, has a code rate of 10/14 ≈ 0.71, meaning storage overhead is roughly 1.4x.
Compare this to three-way replication at 3.0x overhead for tolerating two failures.
The (10, 4) code tolerates four failures at less than half the storage cost.
Reed-Solomon Codes
Reed-Solomon (RS) codes are the most widely deployed erasure codes in distributed storage.
They are Maximum Distance Separable (MDS), meaning they achieve the theoretically optimal trade-off between redundancy and fault tolerance: an (n, k) MDS code can recover from exactly n - k erasures, and no code can do better with the same overhead.
RS encoding operates over a finite field, typically GF(2^8) for byte-oriented storage.
The encoding process treats the k data blocks as coefficients of a polynomial of degree k - 1, then evaluates that polynomial at n distinct points.
The resulting n values form the encoded blocks.
Because any k points uniquely determine a polynomial of degree k - 1 (via Lagrange interpolation or Gaussian elimination), any k of the n encoded blocks suffice for reconstruction.
Beyond Reed-Solomon
RS codes are optimal in redundancy but expensive in repair.
When a single block fails, classic RS requires downloading k blocks to reconstruct it, even though only one block is missing.
The repair bandwidth is k times the block size, which can dominate network costs in large clusters.
This observation motivated a family of codes optimized for repair:
- Locally Repairable Codes (LRCs) add local parity blocks that cover small subsets of data blocks, allowing single-failure repair from a small group rather than the full stripe. Azure Storage uses a (12, 2, 2) LRC that tolerates three failures while requiring only fragments from a local group of six for most repairs.
- Regenerating codes minimize the total data transferred during repair by allowing helpers to send compact, coded sub-blocks rather than full blocks. Minimum Storage Regenerating (MSR) codes achieve the information-theoretic lower bound on repair bandwidth while maintaining MDS storage efficiency.
- Product codes and fountain codes (e.g., RaptorQ) trade MDS optimality for encoding/decoding speed, useful in streaming or CDN contexts.
Walkthrough
The following walkthrough illustrates Reed-Solomon encoding and single-block recovery for a (4, 2) code over GF(7) (a small prime field chosen for clarity).
Encoding
Parameters: k = 2 data blocks, m = 2 parity blocks, n = 4 total blocks
Field: GF(7), arithmetic mod 7
Evaluation points: α = [1, 2, 3, 4]
Step 1: Treat data blocks as polynomial coefficients.
Data: d0 = 3, d1 = 5
Polynomial: P(x) = 3 + 5x
Step 2: Evaluate P(x) at each of the n = 4 points.
P(1) = 3 + 5*1 = 8 mod 7 = 1
P(2) = 3 + 5*2 = 13 mod 7 = 6
P(3) = 3 + 5*3 = 18 mod 7 = 4
P(4) = 3 + 5*4 = 23 mod 7 = 2
Step 3: Distribute encoded blocks to 4 nodes.
Node 0: block = 1 (evaluation at x=1)
Node 1: block = 6 (evaluation at x=2)
Node 2: block = 4 (evaluation at x=3)
Node 3: block = 2 (evaluation at x=4)
Decoding (Recovery from 2 Erasures)
Suppose Node 1 and Node 2 fail. Surviving blocks:
(x=1, y=1) and (x=4, y=2)
Step 1: Solve for polynomial P(x) = a0 + a1*x using Lagrange interpolation.
L0(x) = (x - 4) / (1 - 4) = (x - 4) / (-3)
In GF(7): -3 ≡ 4, and 4^{-1} = 2 (since 4*2 = 8 ≡ 1 mod 7)
L0(x) = (x - 4) * 2 = 2x - 8 ≡ 2x + 6 (mod 7)
(since -8 ≡ -1 ≡ 6 mod 7)
L1(x) = (x - 1) / (4 - 1) = (x - 1) / 3
In GF(7): 3^{-1} = 5 (since 3*5 = 15 ≡ 1 mod 7)
L1(x) = (x - 1) * 5 = 5x - 5 ≡ 5x + 2 (mod 7)
(since -5 ≡ 2 mod 7)
Step 2: Reconstruct P(x).
P(x) = 1 * L0(x) + 2 * L1(x)
= 1*(2x + 6) + 2*(5x + 2)
= 2x + 6 + 10x + 4
= 12x + 10
≡ 5x + 3 (mod 7)
Step 3: Verify.
P(x) = 3 + 5x ✓ (matches original polynomial)
Recovered data: d0 = 3, d1 = 5
This process generalizes to arbitrary k and m.
In production systems, the field is GF(2^8) or GF(2^16), and encoding/decoding use matrix operations over Vandermonde or Cauchy matrices, often accelerated with SIMD instructions.
System Design Considerations
Stripe Size and Placement
An erasure-coded stripe of n blocks should be placed across n distinct failure domains (racks, availability zones, or regions).
The stripe width n directly constrains the minimum cluster size and affect data placement strategies.
Wider stripes improve storage efficiency but increase the coordination surface and the blast radius of correlated failures.
Write Path
Erasure coding is typically applied as a background process to already-replicated data.
HDFS, for instance, writes data with three-way replication and then converts cold data to erasure-coded format asynchronously.
This avoids the latency penalty of computing parity on the write path and simplifies the consistency model.
Systems that encode inline (such as Azure Storage and Ceph with BlueStore) must handle partial stripe writes carefully, often using a journaling or log-structured approach to ensure atomicity.
Read Path and Degraded Reads
Normal reads touch a single block on a single node.
When a node is unavailable, the system performs a degraded read: it fetches k blocks from surviving nodes, decodes the missing block, and returns the result.
This increases both latency and network traffic.
To bound tail latency, some systems issue speculative reads to k + δ nodes and use the first k responses (a technique used by Google's Colossus).
Repair
When a node fails permanently, the system must reconstruct its blocks and place them on a new node.
Repair is expensive: for a (14, 10) RS code, repairing one block requires reading 10 blocks — transferring 10 times that single block’s volume across the network.
In large clusters, background repair traffic can saturate network links.
LRCs mitigate this by reducing the number of blocks needed for most repairs.
Regenerating codes reduce the bytes transferred per helper.
Both come with trade-offs in code complexity and, in the case of LRCs, slightly higher storage overhead than pure MDS codes.
Practical Deployments
| System | Code | Parameters | Notes |
|---|---|---|---|
| HDFS (Hadoop 3.x) | RS, XOR | (6,3), (10,4) | Supports both RS and simple XOR |
| Azure Storage | LRC | (12,2,2) | Local + global parity |
| Google Colossus | RS | Varies | Inline encoding |
| Ceph | RS, LRC | Configurable | Plugin-based EC profiles |
| Facebook f4 | RS | (10,4) | Warm/cold BLOB storage |
Key Points
- Erasure coding achieves the same fault tolerance as replication at significantly lower storage overhead, typically 1.4x to 1.8x versus 3x for three-way replication.
- Reed-Solomon codes are MDS-optimal, meaning they extract the maximum fault tolerance for a given level of redundancy, but they impose high repair bandwidth costs.
- Locally Repairable Codes reduce repair cost by introducing local parity groups, at the expense of slightly higher storage overhead compared to pure MDS codes.
- Stripe placement must respect failure-domain boundaries (rack, zone, or region) to ensure that correlated failures do not exceed the code's tolerance.
- Degraded reads are substantially more expensive than normal reads, making erasure coding best suited for read-heavy, latency-tolerant, or cold-storage workloads.
- Most production systems apply erasure coding as a background transformation on cooled data, avoiding inline encoding overhead on the hot write path.
- The choice of code parameters (k, m, and locality) is a system-level trade-off among storage efficiency, repair cost, read amplification, and failure tolerance.
References
Reed, I.S. and Solomon, G. "Polynomial Codes over Certain Finite Fields." Journal of the Society for Industrial and Applied Mathematics, 8(2), 1960.
Huang, C., Simitci, H., Xu, Y., et al. "Erasure Coding in Windows Azure Storage." Proceedings of the USENIX Annual Technical Conference (ATC), 2012.
Rashmi, K.V., Shah, N.B., and Kumar, P.V. "Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction." IEEE Transactions on Information Theory, 57(8), 2011.
Sathiamoorthy, M., Asteris, M., Papailiopoulos, D., et al. "XORing Elephants: Novel Erasure Codes for Big Data." Proceedings of the VLDB Endowment, 6(5), 2013.