Erasure Coding for Distributed Storage

Introduction

Replication is the most intuitive way to achieve fault tolerance in distributed storage: keep three copies of every block, and you can survive two failures.
The cost is a 3x storage overhead.
For systems storing petabytes or exabytes of data, that multiplier translates directly into hardware budgets, power consumption, and operational complexity.

Erasure coding offers a fundamentally better trade-off.
By encoding data with redundant parity fragments, a system can tolerate the same number of failures as three-way replication while using significantly less raw storage — a (4, 2) code, for example, tolerates two failures at only 1.5x overhead, and a (10, 4) code tolerates four failures at roughly 1.4x overhead, compared to 3x for three-way replication that tolerates only two failures.
This is not a free lunch.
Erasure coding introduces computational overhead during reads and writes, complicates repair after failures, and changes the failure-domain calculus.
But for large-scale, read-heavy workloads, the economics are compelling.
Google, Facebook, Microsoft, and virtually every hyperscaler use erasure coding in their storage infrastructure.

Fundamentals

diagram-1 — Encoder producing n blocks and reconstruction from any k blocks

An erasure code takes k data blocks and produces n total blocks (n = k + m), where m is the number of parity blocks.
Any k of the n blocks are sufficient to reconstruct the original data.
The system tolerates up to m simultaneous block losses.

The code rate is k/n.
A (10, 4) code, for example, has a code rate of 10/14 ≈ 0.71, meaning storage overhead is roughly 1.4x.
Compare this to three-way replication at 3.0x overhead for tolerating two failures.
The (10, 4) code tolerates four failures at less than half the storage cost.

Reed-Solomon Codes

Reed-Solomon (RS) codes are the most widely deployed erasure codes in distributed storage.
They are Maximum Distance Separable (MDS), meaning they achieve the theoretically optimal trade-off between redundancy and fault tolerance: an (n, k) MDS code can recover from exactly n - k erasures, and no code can do better with the same overhead.

RS encoding operates over a finite field, typically GF(2^8) for byte-oriented storage.
The encoding process treats the k data blocks as coefficients of a polynomial of degree k - 1, then evaluates that polynomial at n distinct points.
The resulting n values form the encoded blocks.
Because any k points uniquely determine a polynomial of degree k - 1 (via Lagrange interpolation or Gaussian elimination), any k of the n encoded blocks suffice for reconstruction.

Beyond Reed-Solomon

RS codes are optimal in redundancy but expensive in repair.
When a single block fails, classic RS requires downloading k blocks to reconstruct it, even though only one block is missing.
The repair bandwidth is k times the block size, which can dominate network costs in large clusters.

This observation motivated a family of codes optimized for repair:

Locally Repairable Codes (LRCs) add local parity blocks that cover small subsets of data blocks, allowing single-failure repair from a small group rather than the full stripe. Azure Storage uses a (12, 2, 2) LRC that tolerates three failures while requiring only fragments from a local group of six for most repairs.
Regenerating codes minimize the total data transferred during repair by allowing helpers to send compact, coded sub-blocks rather than full blocks. Minimum Storage Regenerating (MSR) codes achieve the information-theoretic lower bound on repair bandwidth while maintaining MDS storage efficiency.
Product codes and fountain codes (e.g., RaptorQ) trade MDS optimality for encoding/decoding speed, useful in streaming or CDN contexts.

Walkthrough

The following walkthrough illustrates Reed-Solomon encoding and single-block recovery for a (4, 2) code over GF(7) (a small prime field chosen for clarity).

Encoding

Parameters: k = 2 data blocks, m = 2 parity blocks, n = 4 total blocks
Field: GF(7), arithmetic mod 7
Evaluation points: α = [1, 2, 3, 4]

Step 1: Treat data blocks as polynomial coefficients.
   Data: d0 = 3, d1 = 5
   Polynomial: P(x) = 3 + 5x

Step 2: Evaluate P(x) at each of the n = 4 points.
   P(1) = 3 + 5*1 = 8  mod 7 = 1
   P(2) = 3 + 5*2 = 13 mod 7 = 6
   P(3) = 3 + 5*3 = 18 mod 7 = 4
   P(4) = 3 + 5*4 = 23 mod 7 = 2

Step 3: Distribute encoded blocks to 4 nodes.
   Node 0: block = 1  (evaluation at x=1)
   Node 1: block = 6  (evaluation at x=2)
   Node 2: block = 4  (evaluation at x=3)
   Node 3: block = 2  (evaluation at x=4)

Decoding (Recovery from 2 Erasures)

Suppose Node 1 and Node 2 fail. Surviving blocks:
   (x=1, y=1) and (x=4, y=2)

Step 1: Solve for polynomial P(x) = a0 + a1*x using Lagrange interpolation.

   L0(x) = (x - 4) / (1 - 4) = (x - 4) / (-3)
          In GF(7): -3 ≡ 4, and 4^{-1} = 2 (since 4*2 = 8 ≡ 1 mod 7)
          L0(x) = (x - 4) * 2 = 2x - 8 ≡ 2x + 6  (mod 7)
          (since -8 ≡ -1 ≡ 6 mod 7)

   L1(x) = (x - 1) / (4 - 1) = (x - 1) / 3
          In GF(7): 3^{-1} = 5 (since 3*5 = 15 ≡ 1 mod 7)
          L1(x) = (x - 1) * 5 = 5x - 5 ≡ 5x + 2  (mod 7)
          (since -5 ≡ 2 mod 7)

Step 2: Reconstruct P(x).
   P(x) = 1 * L0(x) + 2 * L1(x)
         = 1*(2x + 6) + 2*(5x + 2)
         = 2x + 6 + 10x + 4
         = 12x + 10
         ≡ 5x + 3  (mod 7)

Step 3: Verify.
   P(x) = 3 + 5x  ✓  (matches original polynomial)
   Recovered data: d0 = 3, d1 = 5

This process generalizes to arbitrary k and m.
In production systems, the field is GF(2^8) or GF(2^16), and encoding/decoding use matrix operations over Vandermonde or Cauchy matrices, often accelerated with SIMD instructions.

System Design Considerations

Stripe Size and Placement

diagram-2 — Stripe blocks placed across distinct failure domains (racks)

An erasure-coded stripe of n blocks should be placed across n distinct failure domains (racks, availability zones, or regions).
The stripe width n directly constrains the minimum cluster size and affect data placement strategies.
Wider stripes improve storage efficiency but increase the coordination surface and the blast radius of correlated failures.

Write Path

Erasure coding is typically applied as a background process to already-replicated data.
HDFS, for instance, writes data with three-way replication and then converts cold data to erasure-coded format asynchronously.
This avoids the latency penalty of computing parity on the write path and simplifies the consistency model.
Systems that encode inline (such as Azure Storage and Ceph with BlueStore) must handle partial stripe writes carefully, often using a journaling or log-structured approach to ensure atomicity.

Read Path and Degraded Reads

Normal reads touch a single block on a single node.
When a node is unavailable, the system performs a degraded read: it fetches k blocks from surviving nodes, decodes the missing block, and returns the result.
This increases both latency and network traffic.
To bound tail latency, some systems issue speculative reads to k + δ nodes and use the first k responses (a technique used by Google's Colossus).

Repair

When a node fails permanently, the system must reconstruct its blocks and place them on a new node.
Repair is expensive: for a (14, 10) RS code, repairing one block requires reading 10 blocks — transferring 10 times that single block’s volume across the network.
In large clusters, background repair traffic can saturate network links.

LRCs mitigate this by reducing the number of blocks needed for most repairs.
Regenerating codes reduce the bytes transferred per helper.
Both come with trade-offs in code complexity and, in the case of LRCs, slightly higher storage overhead than pure MDS codes.

Practical Deployments

System	Code	Parameters	Notes
HDFS (Hadoop 3.x)	RS, XOR	(6,3), (10,4)	Supports both RS and simple XOR
Azure Storage	LRC	(12,2,2)	Local + global parity
Google Colossus	RS	Varies	Inline encoding
Ceph	RS, LRC	Configurable	Plugin-based EC profiles
Facebook f4	RS	(10,4)	Warm/cold BLOB storage

Key Points

Erasure coding achieves the same fault tolerance as replication at significantly lower storage overhead, typically 1.4x to 1.8x versus 3x for three-way replication.
Reed-Solomon codes are MDS-optimal, meaning they extract the maximum fault tolerance for a given level of redundancy, but they impose high repair bandwidth costs.
Locally Repairable Codes reduce repair cost by introducing local parity groups, at the expense of slightly higher storage overhead compared to pure MDS codes.
Stripe placement must respect failure-domain boundaries (rack, zone, or region) to ensure that correlated failures do not exceed the code's tolerance.
Degraded reads are substantially more expensive than normal reads, making erasure coding best suited for read-heavy, latency-tolerant, or cold-storage workloads.
Most production systems apply erasure coding as a background transformation on cooled data, avoiding inline encoding overhead on the hot write path.
The choice of code parameters (k, m, and locality) is a system-level trade-off among storage efficiency, repair cost, read amplification, and failure tolerance.

References

Reed, I.S. and Solomon, G. "Polynomial Codes over Certain Finite Fields." Journal of the Society for Industrial and Applied Mathematics, 8(2), 1960.

Huang, C., Simitci, H., Xu, Y., et al. "Erasure Coding in Windows Azure Storage." Proceedings of the USENIX Annual Technical Conference (ATC), 2012.

Rashmi, K.V., Shah, N.B., and Kumar, P.V. "Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction." IEEE Transactions on Information Theory, 57(8), 2011.

Sathiamoorthy, M., Asteris, M., Papailiopoulos, D., et al. "XORing Elephants: Novel Erasure Codes for Big Data." Proceedings of the VLDB Endowment, 6(5), 2013.