Introduction
Large-scale data processing demands storage systems that can span thousands of machines while presenting a coherent abstraction to applications.
The Google File System (GFS) and the Hadoop Distributed File System (HDFS) are two foundational distributed file systems that solved this problem for batch-oriented, data-intensive workloads.
GFS was designed internally at Google to support its search infrastructure and MapReduce pipelines.
HDFS, an open-source implementation inspired heavily by GFS, became the storage layer for the Apache Hadoop ecosystem.
Both systems share a core set of design decisions shaped by pragmatic observations about hardware failure rates, workload characteristics, and the economics of commodity servers.
Understanding these systems provides a baseline vocabulary for reasoning about modern distributed storage, from cloud object stores to purpose-built analytics engines.
Architecture
Single-Master Design
Both GFS and HDFS use a single-master architecture.
In GFS, this node is called the master; in HDFS, it is the NameNode.
The master is responsible for all metadata operations: managing the file namespace, controlling access, and tracking the mapping of files to their constituent chunks (GFS) or blocks (HDFS) across storage nodes.
Storage nodes are called chunkservers in GFS and DataNodes in HDFS.
These nodes store the actual data on local disks and serve read/write requests directly to clients.
A critical design choice is that data never flows through the master.
The master only provides the client with the locations of relevant chunks, and the client then communicates directly with the chunkservers.
This separation of metadata and data paths prevents the master from becoming a throughput bottleneck for data transfer.
The single-master design simplifies consistency reasoning and metadata management significantly, but it introduces a single point of failure.
GFS addressed this with operation log replication and shadow masters.
HDFS added a Secondary NameNode (later replaced by the Standby NameNode in HDFS 2.x with high-availability configurations using a shared edit log backed by a quorum journal manager or NFS).
Chunk/Block Storage
Files are divided into fixed-size segments.
GFS uses 64 MB chunks; HDFS uses 128 MB blocks by default (originally 64 MB).
These sizes are much larger than typical filesystem block sizes (4 KB) and are chosen deliberately for workloads dominated by large sequential reads and appends.
Large chunk sizes reduce the amount of metadata the master must store, since fewer chunks are needed per file.
They also reduce the frequency of client-master interactions: a client performing a sequential read of a large file needs to look up chunk locations infrequently.
On the storage nodes, large chunks mean that most I/O operations within a chunk are sequential, which is efficient even on spinning disks.
Each chunk is replicated across multiple chunkservers (default replication factor of 3 in both systems).
Replicas are placed with awareness of rack topology to tolerate both node and rack failures.
Metadata Management
The master holds all metadata in memory for fast lookups.
This includes the file-to-chunk mapping, chunk locations, and the namespace tree.
GFS keeps the namespace as a flat lookup table with pathname keys protected by read-write locks, rather than a traditional directory tree with per-directory data structures.
Persistence is achieved through an operation log that records all metadata mutations.
This log is the canonical record of metadata state.
On startup, the master replays the log to reconstruct in-memory state.
Periodic checkpoints (stored as a compact B-tree-like structure in GFS) truncate the log and speed up recovery.
Chunk location information is not persisted by the master.
Instead, each chunkserver reports its chunks to the master via heartbeat messages at startup and periodically thereafter.
This eliminates consistency problems between the master's persisted view and the actual state of chunkservers, which can change independently due to disk failures or restarts.
Walkthrough
Write Path in GFS
The GFS write path for record appends (the most common write operation) illustrates several key design decisions.
Below is a step-by-step walkthrough of how a client appends data to a file.
1. Client requests chunk location from Master.
- Master identifies the last chunk of the file.
- If no lease exists, Master grants a lease to one replica,
designating it as the PRIMARY.
- Master returns the identity of the primary and secondary
replicas to the client.
2. Client pushes data to ALL replicas.
- Data is pushed in a pipelined, chain-like fashion:
client -> closest replica -> next closest -> ...
- Data is buffered in each chunkserver's LRU cache.
- Note: data flow is decoupled from control flow.
3. Client sends write request to the PRIMARY.
- Primary assigns a consecutive serial number to the mutation.
- Primary applies the mutation to its own local state.
4. Primary forwards the write request to all SECONDARY replicas.
- Each secondary applies the mutation in serial-number order.
5. Secondaries acknowledge completion to the primary.
6. Primary replies to the client.
- If any secondary failed, the client retries.
- Partial failures can leave replicas in inconsistent states
(some replicas have the record, some do not).
A few details deserve emphasis.
The decoupling of data flow from control flow (step 2 vs. step 3) allows the network to be utilized efficiently.
Data is pushed along a chain that follows network topology, saturating each link in sequence, rather than having the primary fan out data to all replicas simultaneously.
The primary serializes all mutations through serial number assignment, which establishes a total order across concurrent appends to the same chunk.
Consistency Model
GFS has a relaxed consistency model, weaker than what most POSIX filesystems provide.
After a successful record append, GFS guarantees only that the data is written atomically at least once at some offset.
Different replicas of the same chunk may contain duplicate records or padding from failed appends.
The file region is described as "defined" only if all replicas are identical; after concurrent appends, regions may be "consistent but undefined" (all replicas have the same data, but it may interleave records from different clients) or simply "inconsistent."
Applications are expected to handle this.
Google's libraries use record-level checksums and unique identifiers to detect and discard duplicates and corrupted fragments.
This was a deliberate tradeoff: stronger consistency would have required more complex protocols and reduced throughput for the append-heavy workloads GFS targeted.
HDFS initially offered a simpler write model (single-writer semantics, write-once-read-many) that avoided some of these complexities.
A file in HDFS can only be written by one client at a time, and once closed, it is immutable.
This simplification made its consistency properties easier to reason about for most Hadoop workloads.
Fault Tolerance and Replication
Both systems rely on replication as their primary fault-tolerance mechanism.
With a replication factor of 3, the system can tolerate two simultaneous replica failures for any given chunk without data loss.
Chunk re-replication is triggered when the master detects (via heartbeats) that the number of available replicas for a chunk has fallen below the target.
The master prioritizes re-replication of chunks with fewer surviving replicas.
Replication bandwidth is throttled to avoid overwhelming the cluster network during recovery from large-scale failures (such as a rack going offline).
Data integrity is verified using checksums.
Each chunkserver maintains per-block checksums (32-bit checksums for each 64 KB sub-block in GFS).
Checksums are verified on every read.
If corruption is detected, the chunkserver reports the error, and the client reads from a different replica.
The master then initiates re-replication from a valid copy.
For master fault tolerance, the operation log is replicated synchronously to multiple remote machines before any mutation is acknowledged to the client.
If the master process fails, it can be restarted quickly from the latest checkpoint plus replayed log.
In HDFS 2.x, automatic failover to a standby NameNode is managed by ZooKeeper-based leader election.
Limitations and Evolution
The single-master design, while simplifying the system, imposes scalability limits on metadata capacity.
Since all metadata lives in the master's memory, the number of files and blocks the system can manage is bounded by available RAM.
For GFS, this limit was eventually reached at Google, motivating the development of successor systems.
HDFS addressed this partially through federation (HDFS Federation), which partitions the namespace across multiple independent NameNodes.
The relaxed consistency model of GFS made it unsuitable for applications requiring strong consistency.
Google later built systems like Colossus (GFS's successor), Megastore, and Spanner to address these gaps.
In the Hadoop ecosystem, the append-only semantics of HDFS created friction for workloads requiring updates, leading to systems like Apache Kudu and Delta Lake that layer mutable semantics on top of or beside HDFS.
The default three-way replication carries a 200% storage overhead.
Both systems have moved toward erasure coding as a more storage-efficient alternative for cold data.
HDFS 3.x natively supports erasure coding with Reed-Solomon codes, reducing storage overhead to roughly 50% while maintaining comparable fault tolerance.
Key Points
- GFS and HDFS use a single-master architecture that separates metadata management from data storage, keeping the master off the data path to avoid throughput bottlenecks.
- Large chunk sizes (64-128 MB) optimize for sequential access patterns, reduce metadata volume, and minimize client-master interactions.
- The master holds all metadata in memory for low-latency lookups, with persistence provided by a replicated operation log and periodic checkpoints.
- Chunk location data is not persisted by the master; it is reconstructed from chunkserver heartbeat reports, eliminating stale-state consistency problems.
- GFS provides a relaxed consistency model (at-least-once append semantics), pushing deduplication and validation responsibilities to the application layer.
- Three-way replication provides fault tolerance at the cost of 200% storage overhead, with modern deployments increasingly adopting erasure coding for cold data.
- The single-master design imposes a metadata scalability ceiling, motivating successor systems like Colossus and architectural extensions like HDFS Federation.
References
Ghemawat, S., Gobioff, H., and Leung, S.-T. "The Google File System." Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), 2003.
Shvachko, K., Kuang, H., Radia, S., and Chansler, R. "The Hadoop Distributed File System." Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST), 2010.
Dean, J. and Ghemawat, S. "MapReduce: Simplified Data Processing on Large Clusters." Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2004.
White, T. "Hadoop: The Definitive Guide." O'Reilly Media, 4th Edition, 2015.