Introduction
Collecting, storing, and querying metrics at scale is a deceptively hard distributed systems problem.
A single moderately sized microservices deployment can produce millions of time series, each generating a data point every 15 seconds.
At that rate, the ingestion pipeline must handle hundreds of thousands of samples per second, the storage layer must efficiently compress and retain months or years of data, and the query engine must return aggregated results over arbitrary time windows within interactive latency budgets.
Prometheus emerged as the dominant open-source solution for metrics collection in cloud-native environments, but its single-node storage architecture imposes hard limits on retention and throughput.
Systems like M3 (developed at Uber) and Thanos were built to address these limitations, extending the Prometheus data model into horizontally scalable, distributed architectures.
M3 achieves this through a purpose-built distributed time series database with consistent hash partitioning; Thanos takes a different approach, attaching sidecar processes to existing Prometheus instances and using object storage (e.g., S3, GCS, ) as a shared long-term backend, avoiding the need to replace Prometheus's local TSDB.
Understanding the design tradeoffs in these systems is essential for anyone operating observability infrastructure at scale.
The Prometheus Data Model and Pull Architecture
Prometheus models all data as time series, where each series is uniquely identified by a metric name and a set of key-value label pairs.
A single sample is a tuple of (timestamp, float64 value).
This model is simple but expressive: labels allow high-cardinality decomposition of metrics along arbitrary dimensions (service name, endpoint, HTTP status code, pod ID, etc.).
Prometheus uses a pull-based collection model.
The Prometheus server periodically scrapes HTTP endpoints exposed by instrumented targets.
This inverts the typical push model and has several practical consequences:
- The monitoring system controls the sample rate, simplifying capacity planning.
- Target liveness is implicit: if a scrape fails, the target is down.
- Service discovery integrates naturally, since Prometheus discovers targets rather than targets discovering Prometheus.
The pull model works well within a single cluster but introduces challenges at scale.
A single Prometheus instance must scrape all targets within its configured scope, creating a fan-in bottleneck.
Federation (hierarchical scraping of one Prometheus by another) partially addresses this, but it introduces staleness and restricts the queries that can be federated efficiently.
Prometheus TSDB Internals
The Prometheus local time series database (TSDB) is the core of its storage layer, and its design are worth understanding in detail.
TSDB organizes data into blocks, each initially covering approximately two hours.
Within each block, data is stored in compressed chunks on disk, indexed by an inverted index that maps label pairs to series IDs.
Note that this two-hour duration applies to the initial block size written by head compaction.
As blocks age, TSDB's compaction process merges smaller blocks into progressively larger ones — up to roughly 10% of the configured retention period — reducing the number of blocks the query engine must scan over long time ranges.
Write Path
New samples arrive from scrapes and are written to an in-memory head block.
The head block contains the most recent, actively written data.
It is backed by a write-ahead log (WAL) for crash recovery.
Once the head block's time range is complete, it is compacted into an immutable on-disk block.
Compaction
TSDB periodically merges smaller blocks into larger ones, similar to LSM-tree compaction but operating on time-aligned blocks rather than sorted runs.
Compaction reduces the number of blocks the query engine must scan and enables more efficient compression of longer contiguous series.
Compression
Prometheus uses Gorilla-style compression (from Facebook's in-memory TSDB paper).
Timestamps are delta-of-delta encoded, and values are XOR-compressed against the previous value.
For workloads with regular scrape intervals and slowly changing values, this achieves roughly 1.37 bytes per sample — a figure reported by Facebook for Gorilla's in-memory workload.
Prometheus TSDB on disk achieves similar compression ratios for comparable workloads, though actual ratios vary: counter-heavy workloads with stable scrape intervals compress more aggressively, while high-entropy gauge values compress less efficiently, with practical on-disk figures typically ranging from about 1.3 to 2 bytes per sample.
The compression ratio is nonetheless critical for practical retention at high cardinality.
Walkthrough
Gorilla-Style Sample Compression (Simplified)
The following walkthrough describes how Prometheus (and M3) compresses a stream of (timestamp, value) pairs within a single chunk.
The encoding boundaries below are adapted from the Gorilla paper and are illustrative; the Prometheus implementation uses similar but not always identical bucket boundaries in practice.
Timestamp encoding:
1. Store the first timestamp t0 in full (e.g., 64 bits).
2. For the second timestamp t1, store delta = t1 - t0.
3. For each subsequent timestamp tn:
a. Compute delta_n = tn - t(n-1)
b. Compute delta_of_delta = delta_n - delta_(n-1)
c. If delta_of_delta == 0:
Write a single '0' bit.
d. Else:
Encode delta_of_delta using variable-length
prefix coding (approximate ranges from the paper):
- [-64, 64] -> '10' + 7 bits
- [-256, 255] -> '110' + 9 bits
- [-2048, 2047] -> '1110' + 12 bits
- otherwise -> '1111' + 32 bits
Value encoding:
1. Store the first value v0 as a full 64-bit float.
2. For each subsequent value vn:
a. Compute xor = float64bits(vn) XOR float64bits(v(n-1))
b. If xor == 0:
Write a single '0' bit (value unchanged).
c. Else:
Write '1' bit.
Let leading = number of leading zero bits in xor
Let trailing = number of trailing zero bits in xor
If leading >= prev_leading AND trailing >= prev_trailing
(i.e., the meaningful bits fit within the previously
stored bit window):
Write '0' bit, then the meaningful bits only,
reusing the stored leading/trailing counts.
Else:
Write '1' bit, 5 bits for leading count,
6 bits for length of meaningful bits,
then the meaningful bits.
Update stored leading/trailing counts.
The key condition for reusing the previous window is that the new XOR's meaningful bits fit entirely within the bit range defined by the previous encoding — that is, the new leading zero count is greater than or equal to the previous leading count, and the new trailing zero count is greater than or equal to the previous trailing count.
This avoids re-emitting the window header on every sample.
This scheme exploits the fact that successive timestamps are nearly equidistant (regular scrape intervals) and successive values are often identical or close in their floating-point representation.
In practice, time series from counters and gauges compress extremely well.
Scaling Beyond a Single Node: M3
M3 is a distributed metrics platform originally built at Uber to handle billions of active time series across multiple data centers.
It consists of several components:
- M3DB: A distributed time series database that provides horizontally scalable storage with configurable replication.
- M3 Coordinator: A sidecar that bridges Prometheus remote write/read with M3DB, allowing Prometheus to use M3 as a long-term storage backend.
- M3 Aggregator: A distributed, stateful aggregation tier that downsamples and rolls up metrics before storage.
- M3 Query: A query engine compatible with PromQL.
M3DB Architecture
M3DB uses consistent hashing to partition time series across a cluster of storage nodes.
Each series is assigned to a set of replica nodes based on its series ID (a hash of the metric name and label set).
This is conceptually similar to how Dynamo or Cassandra partition data, but optimized for time series workloads.
Key design decisions in M3DB:
Placement and replication. M3DB maintains a placement (ring) that maps virtual nodes to physical hosts.
Replication factor is configurable (typically RF=3).
Writes are sent to all replicas, and reads can be configured for eventual or quorum consistency.
Namespaces and retention. M3DB supports multiple namespaces, each with its own retention period and block size.
A common pattern is to store raw-resolution data for 48 hours and downsampled (e.g., 5-minute) data for months or years.
The M3 Aggregator performs this downsampling in a streaming fashion.
Memory-mapped storage. M3DB stores recent data in memory and flushes completed blocks to disk.
Disk blocks are memory-mapped for query access, minimizing heap pressure.
This is important for Go-based systems where large heaps cause GC pauses.
Commit log. M3DB maintains a commit log (analogous to a WAL) at the node level, spanning all series on that node.
Unlike Prometheus's per-head-block WAL, M3DB's commit log is a node-wide append-only log truncated after successful block flushes, ensuring durability across restart without replaying the entire block.
Inverted index. Like Prometheus TSDB, M3DB maintains an inverted index from label pairs to series.
In the distributed case, each node maintains an index shard for the series it owns.
M3DB uses a specialized posting list implementation based on Roaring Bitmaps for efficient set intersection during label-based queries.
Write Path in M3
- A Prometheus instance sends samples via the remote write API to M3 Coordinator.
- M3 Coordinator looks up the placement to determine which M3DB nodes own each series.
- Samples are batched and sent to replica nodes over gRPC.
- Each M3DB node writes samples to an in-memory buffer, appending to Gorilla-compressed chunks.
- A commit log ensures durability.
- Periodically, completed time blocks are flushed to disk and the commit log is truncated.
Query Path
PromQL queries arrive at M3 Query, which fans out to M3DB nodes that own the relevant series.
Each node performs local filtering and decompression, returning raw or partially aggregated results.
M3 Query merges results and evaluates the full PromQL expression.
For queries spanning downsampled namespaces, M3 Query transparently selects the appropriate resolution.
Thanos: Object Storage as Long-Term Backend
Thanos offers a complementary scaling strategy that avoids replacing Prometheus entirely.
Rather than introducing a new distributed database, Thanos attaches a sidecar process to each existing Prometheus instance.
The sidecar uploads completed TSDB blocks to object storage (S3, GCS, Azure Blob Storage, etc.) as they are written by Prometheus’s normal compaction cycle.
Key components:
- Sidecar: Runs alongside each Prometheus instance, exposes a gRPC Store API for queries, and uploads completed blocks to object storage.
- Store Gateway: Reads blocks from object storage and serves them via the Store API, enabling queries over historical data without keeping it in memory.
- Querier: A globally federated query layer that fans out PromQL queries to sidecars and store gateways, deduplicating results from replicated Prometheus instances.
- Compactor: Applies Prometheus-style compaction and downsampling to blocks in object storage, producing lower-resolution blocks for long-term retention.
- Ruler: Evaluates recording and alerting rules at global scope, across all Prometheus instances.
The primary tradeoff relative to M3 is that Thanos inherits Prometheus's per-instance cardinality limits for recent data (each Prometheus still has a bounded local TSDB), while M3 provides a unified distributed ingestion path.
Thanos excels in environments where existing Prometheus deployments are already in place and where object storage costs and operational simplicity are priorities.
Challenges at Scale
Several problems become acute at large scale:
Cardinality explosion. A label with N unique values multiplies the number of time series by N.
Unbounded labels (request IDs, user IDs) can create millions of series, overwhelming both ingestion and indexing.
Both Prometheus and M3 require careful label governance.
Query fan-out. A query like sum(rate(http_requests_total[5m])) by (service) may touch millions of series across hundreds of nodes.
Efficient execution requires pushing computation (filtering, partial aggregation) to storage nodes rather than centralizing it.
Clock skew and ordering. Distributed ingestion means samples may arrive out of order.
M3DB handles this by buffering writes within a configurable time window and rejecting samples outside it.
Prometheus TSDB historically rejected out-of-order samples in the head block.
Starting with Prometheus 2.39, configurable out-of-order ingestion support was introduced via the --storage.tsdb.out-of-order-time-window flag, which allows a bounded time window within which out-of-order samples are accepted into the head block. (The separate --storage.tsdb.allow-overlapping-blocks flag, available earlier, controls merging of overlapping on-disk blocks produced by federation or backfill, and is distinct from the out-of-order ingestion window.)
Multi-tenancy. Shared metrics platforms must isolate tenants to prevent a noisy neighbor from degrading the system.
M3 Coordinator supports per-tenant rate limiting, and M3DB namespaces can enforce per-tenant retention and cardinality limits.
Key Points
- Prometheus uses a pull-based model with a local TSDB that organizes data into time-aligned, compactable blocks with Gorilla-style compression, achieving roughly 1.37 bytes per sample under typical workloads (this figure originates from the Gorilla paper for in-memory storage; practical on-disk ratios in Prometheus are workload-dependent, typically 1.3–2 bytes per sample).
- The Prometheus data model (metric name plus label pairs identifying each time series) is simple but creates cardinality management challenges at scale.
- M3DB extends the Prometheus model into a distributed architecture using consistent hashing for partitioning, configurable replication, and memory-mapped block storage.
- Thanos offers an alternative scaling strategy: sidecar processes attach to existing Prometheus instances and ship completed TSDB blocks to object storage, enabling long-term retention and global querying without replacing Prometheus's local TSDB.
- Gorilla compression (delta-of-delta for timestamps, XOR for values) is fundamental to both Prometheus and M3, enabling practical storage costs at high ingest rates.
- Distributed query execution requires pushing filtering and partial aggregation to storage nodes to avoid moving decompressed data across the network.
- Streaming downsampling (via M3 Aggregator or Thanos Compactor) is essential for long-term retention, reducing storage costs by orders of magnitude for historical data.
- Cardinality governance, including bounding label values and enforcing per-tenant limits, is an operational prerequisite for any metrics system operating above a few million active series.
References
Pelkonen, T., Franklin, S., Teller, J., Cavallaro, P., Huang, Q., Meza, J., and Veeraraghavan, K. "Gorilla: A Fast, Scalable, In-Memory Time Series Database." Proceedings of the VLDB Endowment, Vol. 8, No. 12, 2015.
Prometheus Authors. "Prometheus: Monitoring system and time series database." https://prometheus.io/docs/introduction/overview/
Uber Engineering. "M3: Uber's Open Source, Large-scale Metrics Platform for Prometheus." https://eng.uber.com/m3/
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. "Dynamo: Amazon's Highly Available Key-value Store." SOSP 2007.
Reinartz, F. "Writing a Time Series Database from Scratch." PromCon 2017. https://promcon.io/2017-munich/talks/writing-a-time-series-database-from-scratch/
Thanos Authors. "Thanos - Highly available Prometheus setup with long term storage capabilities." https://thanos.io/