Stream-Table Duality

Introduction

In distributed systems and data processing, two fundamental abstractions govern how we represent and manipulate data: streams and tables.
A stream is an unbounded, ordered sequence of immutable events.
A table is a mutable, point-in-time snapshot of state keyed by some identifier.
These two abstractions appear fundamentally different, yet they are formally dual to each other.
Every stream can be converted into a table, and every table can be converted into a stream.

This duality is not merely a conceptual curiosity.
It is the theoretical foundation underpinning modern stream processing systems like Apache Kafka, Apache Flink, and ksqlDB.
Understanding it deeply changes how you design data architectures, reason about consistency, and build systems that unify batch and real-time processing.

Foundations

The stream-table duality was most clearly articulated in the context of Apache Kafka by Jay Kreps and later formalized in the Kafka Streams programming model.
The core insight draws from database theory: a database's changelog (the write-ahead log) and its tables are two representations of the same information.

Streams

A stream is an append-only, immutable sequence of keyed records.
Each record represents a fact: something happened.
A stream of user click events, a stream of temperature sensor readings, or a stream of financial transactions all fit this model.
Records in a stream are never updated in place.
New records may supersede old ones semantically, but the old records remain in the log.

Formally, a stream S is an ordered sequence of records:

S = [(k₁, v₁, t₁), (k₂, v₂, t₂), (k₃, v₃, t₃), ...]

where each tuple contains a key k, a value v, and a timestamp t.
The sequence is ordered by timestamp (or offset).

Tables

A table is a collection of keyed records representing the latest state for each key.
A table of user profiles, for instance, contains at most one record per user ID.
When a new record arrives for a given key, it replaces the previous record.

Formally, a table T is a function from keys to values at a point in time:

T(t) = { k → v | (k, v) is the latest record for key k where timestamp ≤ t }

A table is inherently mutable.
Its contents change as new data arrives.

The Duality

Stream to Table

diagram-1 — Materializing a stream into a table (k1 superseded)

Given a stream, you can derive a table by replaying the stream from the beginning and maintaining the latest value for each key.
This is equivalent to applying a REDUCE or FOLD operation, grouped by key.
In database terms, you are materializing a changelog into a table.

If the stream contains records [(k1, v1), (k2, v2), (k1, v3)], the resulting table after processing all three records is {k1 → v3, k2 → v2}.
The earlier value for k1 has been superseded.

This is exactly what happens when a database replays its write-ahead log during crash recovery.
The log (stream) is folded into the current table state.

Table to Stream

diagram-2 — Deriving a changelog stream from table updates

Given a table, you can derive a stream by capturing every change made to the table as an event.
Each insert, update, or delete becomes a record in the output stream.
This is Change Data Capture (CDC), and it produces what is often called a changelog stream.

If a table transitions from {k1 → v1} to {k1 → v2, k2 → v3}, the derived stream contains the records [(k1, v2), (k2, v3)], representing the two changes that occurred.

Formal Relationship

diagram-3 — Round-trip duality and log compaction effect

The two conversions are inverses.
If you start with a stream, materialize it into a table, and then capture changes from that table, you recover the original stream (modulo compaction of intermediate values for the same key).
If you start with a table, capture its changelog, and then replay that changelog, you recover the table.

table(stream(T)) = T
stream(table(S)) ⊆ S  (equal under compaction)

The asymmetry in the second equation reflects log compaction: consecutive updates to the same key in the original stream may collapse into a single entry in the derived stream.

Walkthrough

Consider a concrete example.
A stream of user profile update events arrives in a Kafka topic:

Offset 1: (user-42, {name: "Alice", city: "NYC"})
Offset 2: (user-73, {name: "Bob", city: "SF"})
Offset 3: (user-42, {name: "Alice", city: "LA"})
Offset 4: (user-73, null)                          // tombstone: delete

Step 1: Materialize stream to table.

Process each record in offset order, upsert into a key-value store.

After offset 1: { user-42 → {name: "Alice", city: "NYC"} }
After offset 2: { user-42 → {name: "Alice", city: "NYC"},
                  user-73 → {name: "Bob", city: "SF"} }
After offset 3: { user-42 → {name: "Alice", city: "LA"},
                  user-73 → {name: "Bob", city: "SF"} }
After offset 4: { user-42 → {name: "Alice", city: "LA"} }

After offset 3, Alice's city changed from NYC to LA.
After offset 4, Bob's record was deleted (the null value acts as a tombstone).

Step 2: Derive a changelog stream from the table.

If we capture every mutation to the table as a new event, we get:

Change 1: (user-42, {name: "Alice", city: "NYC"})   // INSERT
Change 2: (user-73, {name: "Bob", city: "SF"})       // INSERT
Change 3: (user-42, {name: "Alice", city: "LA"})     // UPDATE
Change 4: (user-73, null)                             // DELETE

This is identical to the original stream.
The round-trip is lossless.

Step 3: Log compaction.

Kafka's log compaction feature exploits this duality.
It retains only the latest value for each key in a topic, effectively converting a stream into a compacted stream that can bootstrap a table without replaying the entire history.
After compaction, the topic would contain:

(user-42, {name: "Alice", city: "LA"})
(user-73, null)

This is sufficient to reconstruct the final table state.

Practical Implications

Unified Batch and Stream Processing

The duality dissolves the traditional boundary between batch and stream processing.
A batch dataset is simply a bounded table.
A table is a materialized stream.
Therefore, batch processing is a special case of stream processing applied to a finite changelog.
This is the philosophical basis of the Kappa Architecture, which argues that a single stream processing pipeline can replace separate batch and real-time pipelines.

Stateful Stream Processing

In systems like Kafka Streams and Flink, stream-table duality is operationalized through state stores.
When a stream processor performs an aggregation (e.g., counting events per key), it materializes a table internally.
That table is backed by a local state store (often RocksDB) and is kept consistent by replaying the input stream's partitions.
If the processor fails, the state store is rebuilt from the changelog topic, which is itself a stream.

Event Sourcing and CQRS

Event sourcing stores the stream of state-changing events as the system of record, rather than the current state.
The current state (table) is derived on demand by replaying events.
This is a direct application of the stream-to-table direction of the duality.
Command Query Responsibility Segregation (CQRS) extends this by maintaining multiple derived tables (read models) from a single authoritative event stream.

Database Replication

Database replication protocols rely on the table-to-stream direction.
A primary database captures its mutations as a replication log (binlog, WAL) and ships that stream to replicas, which materialize it back into tables.
The correctness of replication depends on the faithfulness of this round-trip.

Common Misconceptions

"Streams are for real-time and tables are for batch." This is a false dichotomy.
A table is a continuously updated materialization of a stream.
The distinction is representational, not temporal.

"Converting a stream to a table loses information." It does lose the history of intermediate states for a given key.
But this is by design: a table answers "what is the current state?" while a stream answers "what happened?" Both are complete representations within their respective semantics.

"You need a database for tables and a message broker for streams." Systems like Kafka blur this line.
A compacted Kafka topic functions as both a stream and a table, depending on how you consume it.
Reading from the beginning and maintaining state gives you a table.
Subscribing to new records gives you a stream.

Key Points

A stream is an unbounded, ordered sequence of immutable events; a table is a mutable, point-in-time mapping from keys to values.
Any stream can be materialized into a table by replaying events and maintaining the latest value per key, equivalent to folding a changelog.
Any table can be converted into a stream by capturing every insert, update, and delete as a changelog event (Change Data Capture).
These two transformations are inverses of each other, forming a formal duality where information is preserved across round-trips (modulo log compaction).
Log compaction in systems like Kafka exploits this duality by retaining only the latest value per key, enabling efficient table bootstrapping from a stream.
This duality is the theoretical basis for unifying batch and stream processing, as seen in the Kappa Architecture and systems like Kafka Streams and Flink.
Practical applications include event sourcing, CQRS, database replication, and stateful stream processing with embedded state stores.

References

Kreps, Jay. "I Heart Logs: Event Data, Stream Processing, and Data Integration." O'Reilly Media, 2014.

Kleppmann, Martin. "Designing Data-Intensive Applications." O'Reilly Media, 2017. Chapters 3 and 11.

Sax, Matthias J., Guozhang Wang, Matthias Weidlich, and Johann-Christoph Freytag. "Streams and Tables: Two Sides of the Same Coin." Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics (BIRTE), 2018.

Carbone, Paris, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. "Apache Flink: Stream and Batch Processing in a Single Engine." Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.