Distributed File Systems (GFS, HDFS)

Introduction

Large-scale data processing demands storage systems that can span thousands of machines while presenting a coherent abstraction to applications.
The Google File System (GFS) and the Hadoop Distributed File System (HDFS) are two foundational designs that shaped how the industry thinks about storing and retrieving data at scale.
GFS was described in a 2003 paper by Ghemawat, Gobioff, and Leung.
HDFS is its open-source descendant, built as part of the Apache Hadoop project.
Both systems make specific, deliberate trade-offs around consistency, availability, and performance that reflect the workloads they were designed for: large sequential reads and appends, batch analytics, and write-once-read-many access patterns.

Understanding these systems is valuable not because they represent the final word in distributed storage, but because their architectural decisions illustrate core distributed systems principles: replication, fault tolerance, metadata management, and the tension between consistency and throughput.

Architecture Overview

Single-Master Design

diagram-1 — GFS/HDFS single-master topology: metadata via master, data via chunkservers

Both GFS and HDFS use a single-master architecture.
In GFS, this node is called the master; in HDFS, it is the NameNode.
The master is responsible for all metadata operations: maintaining the namespace (directory tree), mapping files to their constituent chunks (or blocks), and tracking which chunk servers (DataNodes in HDFS) hold replicas of each chunk.

Data never flows through the master.
Clients contact the master to learn where chunks reside, then read and write data directly to the chunk servers.
This separation of the metadata path from the data path is critical.
It prevents the master from becoming a throughput bottleneck for large data transfers, limiting its role to relatively lightweight metadata lookups, and lease management.

The single-master model simplifies consistency reasoning but introduces a single point of failure.
GFS mitigated this with operation log replication and shadow masters that could serve read-only metadata in a degraded mode.
HDFS added a Secondary NameNode (which is a checkpointing helper, not a hot standby — despite its misleading name, it cannot take over if the primary NameNode fails) and later introduced a true HA architecture using a shared edit log on a Quorum Journal Manager (QJM), which is the recommended approach, or legacy shared NFS storage.

Chunks and Blocks

GFS splits files into fixed-size chunks of 64 MB.
HDFS originally also used 64 MB blocks (as described in the 2010 MSST paper), with modern Hadoop 2.x+ configurations defaulting to 128 MB blocks.
Each chunk or block is stored as a regular file on the local filesystem of the chunk server or DataNode.

Large chunk sizes reduce the amount of metadata the master must store and decrease the frequency of client-master interactions.
A single file of 1 TB, divided into 128 MB blocks, requires only ~8,000 metadata entries.
The trade-off is internal fragmentation for small files and the "small files problem" in HDFS, where millions of tiny files can exhaust NameNode heap memory.

Replication

diagram-3 — Cross-rack 3-replica placement: 1 local, 2 on a remote rack (different nodes)

Each chunk is replicated (typically three copies) across different machines, and ideally across different racks, to tolerate both machine and rack-level failures.
The master manages replica placement and initiates re-replication when it detects that a chunk is under-replicated (for instance, after a DataNode heartbeat timeout).

Rack-aware placement in HDFS follows a specific policy: the first replica goes on the local DataNode (or a random node if the client is not a DataNode), the second goes on a different rack, and the third goes on a different node within the same rack as the second.
This balances fault tolerance against cross-rack network traffic.

Write Path

GFS introduced a distinctive write pipeline.
Understanding it in detail clarifies how the system achieves high throughput while maintaining some degree of ordering.

Record Append in GFS

GFS supports a record append operation that is atomic and allows concurrent appenders.
The master grants a lease on a chunk to one of its replicas, designating it as the primary.
The primary serializes all mutations to the chunk, assigning them a sequential order.

Walkthrough

diagram-2 — GFS write protocol: lease grant, pipelined data push, primary-serialized apply

The following steps describe how a client performs a record append in GFS:

1. Client requests the master for the current lease holder (primary)
   and locations of all replicas for the target chunk.

2. Master responds with the identity of the primary and the
   secondary replicas. Client caches this information.

3. Client pushes data to ALL replicas in a pipelined fashion.
   Data flows along a chain of chunk servers chosen to minimize
   network distance. Each server forwards data to the next
   closest server that has not yet received it.
   Data is buffered in each replica's LRU buffer cache and has
   not yet been applied to the chunk file.

4. Once all replicas acknowledge receipt of the data, the client
   sends a write request to the primary.

5. Primary assigns a consecutive serial number to the mutation,
   applies it to its own state, and forwards the write order
   (with the serial number) to all secondaries.

6. Each secondary applies the mutation in serial-number order
   and acknowledges the primary.

7. Primary replies to the client.
   - If all replicas succeeded: success.
   - If any replica failed: the client retries.
     This can result in duplicate or partially written records
     at some replicas, which is why GFS provides "at-least-once"
     append semantics at the record level. Applications are
     expected to include checksums and unique IDs in records
     and filter duplicates on read.

This design decouples data flow (step 3, pipelined to all replicas) from control flow (steps 4-7, serialized through the primary).
This decoupling allows the network bandwidth to be fully utilized for data transfer while the primary handles ordering cheaply.

HDFS Write Pipeline

HDFS uses a similar pipeline for block writes but with a simpler model.
The client writes to the first DataNode, which forwards to the second, which forwards to the third.
Acknowledgments flow back in reverse order.
HDFS does not support concurrent appends to the same block from multiple writers; a file has a single writer that holds the lease.

Consistency Model

GFS has a relaxed consistency model, defined in terms of file region states after mutations:

After a successful single-writer mutation, the file region is defined: all replicas are identical and reflect the mutation.
After a failed mutation, the region is inconsistent: replicas may differ.
After concurrent successful record appends, the GFS paper describes regions as defined interspersed with inconsistent: each individual record append is written atomically at least once, but padding and duplicate records may appear, and the offsets at which a given record lands may differ across replicas. Applications must tolerate duplicates and gaps (typically by embedding checksums and unique IDs in records and filtering on read).

HDFS provides a somewhat stronger model for its primary use case.
Because a block has a single writer and HDFS uses a write pipeline with acknowledgment, successfully written blocks are consistent and defined.
HDFS does not support random writes to existing blocks; data is append-only or write-once.
This sidesteps many consistency complications.

Neither system provides the kind of strong consistency you would expect from a database.
Both are designed for batch workloads where occasional duplicates or retries are acceptable.

Fault Tolerance Mechanisms

Heartbeats and Lease Expiry

Chunk servers (DataNodes) send periodic heartbeats to the master (NameNode).
If a heartbeat is missed beyond a configured timeout, the master considers the server dead and schedules re-replication of its chunks.
Leases on chunks expire if the primary fails to renew them, allowing the master to grant a new lease to a different replica.

Checksumming

Both systems checksum data at the chunk/block level.
GFS stores 32-bit checksums for each 64 KB sub-block.
If a read returns corrupted data, the chunk server reports the error, the client retries from another replica, and the master schedules re-replication from a healthy replica.
HDFS uses CRC32 checksums per 512-byte segment by default.

Master/NameNode Recovery

The GFS master persists its operation log and checkpoints.
On recovery, it replays the log from the last checkpoint.
It does not persistently store chunk-to-server mappings; instead, it rebuilds this map by polling all chunk servers at startup.

HDFS's NameNode persists the namespace image (fsimage) and an edit log.
The Secondary NameNode (or Checkpoint Node) periodically merges the edit log into the fsimage to prevent unbounded log growth.
In HA configurations, a Standby NameNode replays edit log entries from the Quorum Journal Manager in near real-time and can take over within seconds.

Limitations and Evolution

The single-master design, while simple, has scaling limits.
The master must hold all metadata in RAM, bounding the number of files and blocks.
GFS internally evolved into Colossus, which distributes metadata across multiple servers.
HDFS federation allows multiple NameNodes to manage independent namespace volumes, sharing a pool of DataNodes.

Three-way replication carries a 200% storage overhead.
HDFS 3.x introduced erasure coding (Reed-Solomon codes), which — depending on configuration — can achieve comparable fault tolerance at roughly 50% overhead for typical deployments (e.g., RS-6-3, which tolerates 3 failures across 9 storage units), at the cost of increased CPU and network usage during recovery.
Configurations with fewer data and parity units will have higher overhead.

Both systems were designed for large sequential I/O.
They perform poorly for random reads of small records or low-latency access patterns.
Systems like HBase (built atop HDFS) or Bigtable (built atop GFS/Colossus) address those workloads by layering structured storage on top of the distributed file system.

Key Points

GFS and HDFS use a single-master architecture that separates metadata management from data transfer, preventing the master from becoming a data throughput bottleneck.
Large chunk sizes (64–128 MB) reduce metadata volume and client-master interaction frequency but create challenges for workloads with many small files.
Data is replicated (typically three copies) with rack-aware placement to tolerate both machine and rack failures.
GFS decouples data flow (pipelined push to all replicas) from control flow (serial ordering through the primary), maximizing network utilization.
The consistency model is relaxed: after concurrent successful record appends, regions are defined interspersed with inconsistent (padding and duplicates may appear), and applications must handle this by embedding checksums and unique IDs in records.
Fault tolerance relies on heartbeats, lease expiry, checksumming, operation log replay, and automatic re-replication of under-replicated chunks.
Scaling limitations of the single-master design led to distributed metadata systems (Colossus, HDFS Federation) and storage efficiency improvements like erasure coding.

References

Ghemawat, S., Gobioff, H., and Leung, S.T. "The Google File System." Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), 2003.

Shvachko, K., Kuang, H., Radia, S., and Chansler, R. "The Hadoop Distributed File System." Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), 2010.

Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R.E. "Bigtable: A Distributed Storage System for Structured Data." Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006.

White, T. "Hadoop: The Definitive Guide." O'Reilly Media, 4th Edition, 2015.