Backfill and Reprocessing in Streaming Pipelines

Streaming pipelines are designed to process data continuously, but the reality of production systems demands the ability to go backwards.
Bugs in processing logic, late-arriving data, schema changes, new feature requirements, and regulatory corrections all create scenarios where historical data must be reprocessed.
This capability, broadly called backfill and reprocessing, is one of the hardest operational problems in stream processing and one of the least discussed relative to its importance.

The Problem Space

A streaming pipeline typically consumes events from a durable log (Kafka, Kinesis, Pulsar), applies transformations or aggregations, and writes results to a downstream sink.
The pipeline maintains some form of processing state: watermarks, windows, counters, join buffers.
When something goes wrong with historical output, engineers face a deceptively simple question: how do we re-derive correct results from raw input?

The difficulty arises from several interacting concerns:

State entanglement. Streaming operators accumulate state over time. Reprocessing a slice of history means either reconstructing the state that existed at that point or carefully managing how reprocessed data interacts with current state.
Output side effects. Downstream consumers may have already acted on incorrect results. Reprocessing must produce corrective outputs without creating duplicates or inconsistencies.
Resource contention. Running a backfill alongside a live pipeline competes for the same compute, network, and storage resources.
Temporal semantics. Event time, processing time, and ingestion time may diverge significantly during a backfill. Watermark logic and late-data handling behave differently when replaying months of data at wire speed.

Architectural Approaches

diagram-1 — Lambda vs Kappa pipeline topologies and trade-offs

Dual-Pipeline (Lambda-Style)

The Lambda Architecture, originally proposed by Nathan Marz, addresses reprocessing by maintaining two parallel systems: a batch layer that periodically recomputes results from the complete dataset, and a speed layer that handles recent data.
During a backfill, the batch layer is simply re-run.

This approach works but carries significant operational cost.
Maintaining two codebases (or two runtimes for the same logic) introduces drift, and merging batch and real-time results at the serving layer is error-prone.

Unified Replay (Kappa-Style)

Jay Kreps proposed the Kappa Architecture as a simplification: use a single streaming pipeline for both real-time processing and reprocessing.
To backfill, you spin up a second instance of the pipeline reading from the beginning of the retained log, let it catch up, then swap it in to replace the original.

This requires the input log to retain data for the full reprocessing window, which can be expensive.
It also assumes the pipeline can process historical data fast enough to catch up within an acceptable timeframe.

Hybrid Approaches

Most production systems land somewhere between these extremes.
Common patterns include:

Selective replay from snapshots. Checkpoint the pipeline state periodically. To reprocess, restore a checkpoint from before the corruption window and replay only the affected time range.
Side-by-side execution. Run the backfill pipeline in a separate consumer group, writing to a staging sink. Validate results, then atomically swap the output tables.
Changelog-based correction. Instead of full reprocessing, emit correction events (retractions and re-additions) into the existing pipeline.

Walkthrough

The following walkthrough describes a typical backfill procedure for a Kafka-based streaming pipeline using Flink-style checkpointing.
Assume a bug was discovered in a windowed aggregation that produced incorrect results between timestamps T1 and T2.

Step 1: Identify the Corruption Window

Determine the event-time range [T1, T2] affected by the bug.
Identify which Kafka topic partitions and offset ranges correspond to this window.
This mapping is often non-trivial because event-time ordering and offset ordering are not identical.

Step 2: Select a Valid Checkpoint

Find the most recent checkpoint taken before T1 where the pipeline state was still correct.
Call this checkpoint C0.
If no such checkpoint exists (e.g., the bug was present from the start), a cold start from offset zero is required.

Step 3: Deploy the Corrected Pipeline

1. Deploy fixed pipeline version as a NEW job (do not modify the live job).
2. Initialize state from checkpoint C0.
3. Configure consumer offsets to the positions recorded in C0.
4. Set output sink to a STAGING destination (not production).
5. Start consuming from the restored offsets.

Step 4: Bounded Replay

diagram-2 — Bounded replay control flow to stop when T2 windows are closed

1. Process events from C0's offsets forward.
2. For each event with event_time > T2:
     IF all windows covering [T1, T2] have been closed and flushed:
       MARK replay as complete.
       STOP processing.
3. Monitor: compare throughput against the live pipeline.
   Ensure the backfill pipeline can sustain >= 10x real-time throughput
   to complete within operational SLA.

Step 5: Validate and Swap

1. Run validation queries against STAGING output:
   - Row counts per time bucket vs. expected values.
   - Checksums on key aggregation columns.
   - Spot-check known-correct records.
2. IF validation passes:
   a. Pause downstream consumers of PRODUCTION output.
   b. Atomically swap STAGING tables into PRODUCTION
      (e.g., ALTER TABLE ... EXCHANGE PARTITION, or update a pointer).
   c. Resume downstream consumers.
   d. Stop the backfill pipeline job.
3. IF validation fails:
   - Investigate discrepancies.
   - Repeat from Step 3 with further fixes.

Step 6: Reconcile Ongoing State

If the live pipeline continued running during the backfill, there is a gap between the backfill's terminal offset and the live pipeline's current position.
Two options:

Continue the backfill pipeline past T2 until it catches up to the live pipeline's position, then replace the live job entirely.
Merge outputs by taking the backfill's results for [T1, T2] and the live pipeline's results for everything after T2.

The first option is simpler and less error-prone.

Operational Considerations

Log Retention

Backfill is only possible if the raw input data is still available.
Kafka's default retention of 7 days is insufficient for most reprocessing scenarios.
Organizations that take reprocessing seriously typically configure retention of 30 to 90 days or use tiered storage to keep older segments in object storage (S3, GCS) at lower cost.

Idempotent Writes

The output sink must handle duplicate or overlapping writes gracefully.
Upsert semantics keyed on a deterministic record identifier make reprocessing significantly easier than append-only sinks.
If the sink is append-only (e.g., a data lake), the backfill must delete or overwrite the affected partitions before writing corrected data.

Exactly-Once and Fencing

When running a backfill pipeline alongside a live pipeline, both may attempt to write to the same sink.
Fencing mechanisms (Kafka transactional IDs, database advisory locks) prevent the backfill from corrupting live output and vice versa.
Flink's two-phase commit sink integration, for example, uses unique transactional ID prefixes per job to prevent cross-job interference.

Watermark Behavior During Replay

During normal operation, watermarks advance roughly in sync with wall-clock time.
During backfill, events spanning hours or days arrive in seconds.
This compresses the watermark timeline and can trigger massive window firings simultaneously.
Operators must ensure that the downstream sink and any intermediate buffers can absorb these bursts.
Throttling the source consumption rate is sometimes necessary, at the cost of longer backfill duration.

Schema Evolution

If the reprocessing need arises from a schema change, the backfill pipeline must be able to deserialize historical data written under the old schema.
Schema registries with backward/forward compatibility guarantees (as in Confluent Schema Registry or AWS Glue Schema Registry) are essential.
Without them, backfill across schema boundaries becomes a data engineering nightmare.

When Not to Backfill

Backfill is expensive.
Before undertaking one, consider alternatives:

Compensating events. If the error is bounded and well-understood, emitting correction events may be cheaper than full reprocessing.
Materialized view refresh. If the output is a derived view over an immutable event log stored in a queryable format (e.g., Iceberg, Delta Lake), recomputing the view with a batch query may be faster than replaying through the streaming pipeline.
Accepting the error. If the impact is within acceptable tolerance (e.g., off-by-small-amount in a non-critical metric), documenting the discrepancy and fixing forward may be the pragmatic choice.

Key Points

Backfill capability must be designed into streaming pipelines from the start; retrofitting it onto a pipeline that assumes append-only, forward-only processing is painful and error-prone.
Log retention policy directly determines the reprocessing horizon. Tiered storage makes long retention economically viable.
Running backfill as a separate job writing to a staging sink, then validating and swapping, is safer than in-place reprocessing of a live pipeline.
Idempotent, upsert-capable sinks dramatically simplify reprocessing by eliminating the need for manual deduplication.
Watermark compression during replay causes bursty window firings that can overwhelm downstream systems if not anticipated.
Checkpoint and savepoint mechanisms (as in Flink) provide the state snapshots necessary to start reprocessing from a known-good point without replaying the entire history.
Schema evolution support in the serialization layer is a prerequisite for reprocessing data that spans schema changes.

References

Kreps, J. (2014). "Questioning the Lambda Architecture." O'Reilly Radar. https://www.oreilly.com/radar/questioning-the-lambda-architecture/

Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., & Tzoumas, K. (2015). "Apache Flink: Stream and Batch Processing in a Single Engine." Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4).

Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R., Lax, R., McVeety, S., Mills, D., Perry, F., Schmidt, E., & Whittle, S. (2015). "The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing." Proceedings of the VLDB Endowment, 8(12), 1792-1803.

Marz, N., & Warren, J. (2015). "Big Data: Principles and Best Practices of Scalable Real-Time Data Systems." Manning Publications.

Kleppmann, M. (2017). "Designing Data-Intensive Applications." O'Reilly Media. Chapter 11: Stream Processing.