Schema Evolution in Event Streams

Introduction

Event-driven architectures rely on a simple contract: producers write events, and consumers read them.
The schema of those events defines the contract.
In practice, schemas change constantly.
Fields are added, renamed, removed, or retyped.
When a producer updates an event schema and a consumer has not yet been updated (or vice versa), the system must not break.
This is the problem of schema evolution.

In batch systems, schema changes can be coordinated with downtime or migration scripts.
In streaming systems, events are immutable, durable, and potentially replayed over long time horizons.
Old events coexist with new events in the same topic or partition.
Producers and consumers are deployed independently.
This makes schema evolution in event streams a harder, more consequential problem than in request/response systems.

Why Schema Evolution Matters

An event stream is an append-only log.
Once written, an event cannot be altered.
A Kafka topic might retain events for days, weeks, or indefinitely (as in log-compacted topics).
At any point, a consumer might read events produced months ago alongside events produced seconds ago.
The consumer's deserialization logic must handle all of them.

Independent deployment makes this worse.
Producer team A deploys a new version of an event at 10:00.
Consumer team B deploys their updated reader at 14:00.
During that four-hour window, consumer B must still function correctly against the new events.
If a schema change is not backward compatible, consumer B crashes.

The core tension is between agility (teams should be able to evolve their schemas freely) and stability (consumers should not break when schemas change).
Schema evolution strategies exist to manage this tension.

Compatibility Modes

diagram-1 — Backward / forward / full compatibility Venn (with transitive variant)

The standard framework for reasoning about schema evolution uses four compatibility modes, most clearly codified in the Confluent Schema Registry.

Backward Compatibility

A new schema is backward compatible if code using the new schema can read data written with the old schema.
In other words, consumers can be upgraded to the new schema before producers: the upgraded consumer will correctly read events that were written under the old schema.
The typical safe change is adding a new field with a default value — old events lacking that field are still readable by the new consumer, which substitutes the default.

Forward Compatibility

A new schema is forward compatible if code using the old schema can read data written with the new schema.
This means producers can be upgraded before consumers: old consumers will correctly read events written under the new schema.
Forward compatibility is critical during deployment gaps where a producer has shipped a new schema version but some consumers are still running the old one.

Note that how unknown fields are handled depends on the serialization format.
In Protobuf, unknown field numbers are skipped by old readers.
In Avro, forward compatibility requires that the old reader schema be able to resolve against the new writer schema — fields present in the writer schema but absent in the reader schema are silently dropped during resolution, but this depends on the Avro runtime performing schema resolution, not on the reader passively ignoring bytes.

Full Compatibility

Full compatibility requires both backward and forward compatibility simultaneously.
This is the strictest useful mode for most production systems.
It means any reader and any writer, across the current and previous schema versions, can interoperate.

Transitive Variants

Each mode has a transitive variant (e.g., BACKWARD_TRANSITIVE) that checks compatibility not just against the immediately preceding version, but against all prior versions.
This matters for event streams with long retention, where a consumer might replay events written under a schema from many versions ago.

Serialization Formats and Their Properties

Not all serialization formats handle evolution equally.

Apache Avro

Avro is the format most closely associated with schema evolution in Kafka-based streaming systems.
It uses a writer schema (embedded or referenced at write time) and a reader schema (held by the consumer).
The Avro runtime resolves differences between the two schemas at read time using well-defined resolution rules.
Fields present in the writer schema but absent in the reader schema are silently discarded.
Fields present in the reader schema but absent in the writer schema are filled with defaults (or cause a SchemaResolutionError if no default is defined).

This reader/writer schema resolution model is what makes Avro particularly well-suited to event streams.
The schema is not merely documentation; it is executable.

Protocol Buffers

Protobuf uses field numbers to identify fields on the wire.
Adding new fields (with new numbers) is always forward compatible, since old readers skip unknown field numbers.
Removing fields is safe as long as the field number is never reused.
Protobuf does not use Avro's explicit reader/writer resolution model, but its tagging scheme provides implicit forward and backward compatibility for additive changes.
In high-throughput environments where Protobuf tooling is already established, it is a strong alternative to Avro.

JSON Schema

JSON is flexible but dangerous.
Without a schema registry enforcing compatibility, JSON event schemas tend to drift silently.
JSON Schema can formalize the contract, but the lack of a compact binary encoding and the absence of built-in reader/writer resolution makes it a weaker choice for high-throughput event streams.

Walkthrough

The following walkthrough shows how a schema registry enforces compatibility when a producer attempts to register a new schema version.

Step-by-Step: Schema Compatibility Check

diagram-2 — Registry compatibility loop over prior schemas, returns 409 on incompatibility

1. Producer builds a new schema S_new for subject "order-events-value".

2. Producer sends S_new to the Schema Registry via:
   POST /subjects/order-events-value/versions

3. Registry retrieves the compatibility mode for this subject.
   (e.g., FULL_TRANSITIVE)

4. Registry retrieves all previously registered schemas for the subject:
   [S_v1, S_v2, ..., S_v(n-1)]

5. For FULL_TRANSITIVE, the registry checks:
   a. For each S_v(i) in [S_v1 ... S_v(n-1)]:
      - Can a reader using S_new read data written with S_v(i)?   (backward)
      - Can a reader using S_v(i) read data written with S_new?   (forward)

6. If all checks pass:
   - Assign S_new the version number n.
   - Return schema ID to producer.
   - Producer serializes the event using the Confluent wire format:
     [magic byte 0x00][4-byte schema ID][Avro-encoded bytes]

7. If any check fails:
   - Return HTTP 409 Conflict with details of the incompatibility.
   - Producer must revise the schema.

At read time, the consumer extracts the magic byte and schema ID from the event payload, fetches the writer schema from the registry (caching it locally), and uses it in combination with its own reader schema to deserialize the event.

Pseudocode: Avro Reader/Writer Resolution

diagram-3 — Avro reader/writer field resolution: defaults applied, writer-only fields discarded

def resolve(writer_schema, reader_schema, encoded_bytes):
    """
    Conceptual model of Avro reader/writer schema resolution.

    Note: In the actual Avro runtime, decoding and resolution occur
    in a single pass against both schemas simultaneously. This
    two-step representation is simplified for clarity.
    """
    decoded = decode_with(writer_schema, encoded_bytes)
    result = {}

    for field in reader_schema.fields:
        if field.name in decoded:
            # Field exists in writer output. Type promotion may apply.
            result[field.name] = promote(decoded[field.name],
                                         writer_type=writer_schema.field_type(field.name),
                                         reader_type=field.type)
        elif field.has_default:
            # Field missing from writer schema. Use reader default.
            result[field.name] = field.default
        else:
            raise SchemaResolutionError(
                f"Field '{field.name}' missing from writer schema and has no default"
            )

    # Fields in writer schema but not in reader schema are silently discarded.
    return result

This resolution logic is what makes Avro tolerant of independent schema changes across producers and consumers.

Practical Rules for Safe Evolution

The following rules apply broadly, regardless of serialization format.

Never remove a required field without a default. If a consumer expects a field and the producer stops sending it, deserialization fails unless the reader can substitute a default.
Never reuse field identifiers. In Protobuf, never reuse a field number. In Avro, never reuse a field name with a different type.
Always add new fields with defaults. This ensures backward compatibility: old events without the field can still be read by new consumers.
Use union types for optionality. In Avro, making a field a union of null and the desired type is the standard way to express optional fields safely.
Treat schema changes as API versioning. A schema is a public interface. Apply the same discipline you would to a REST API: deprecation periods, changelog documentation, and consumer notifications.
Prefer additive changes. Adding fields is almost always safe. Renaming, removing, or retyping fields is almost always dangerous. Structure your events so that evolution is additive wherever possible.

Schema Registries in Practice

A schema registry is not optional infrastructure for serious event streaming systems.
It serves as the single source of truth for schema versions and enforces compatibility at write time, before incompatible data enters the stream.
The Confluent Schema Registry is the most widely used implementation for Kafka-based systems.
AWS Glue Schema Registry and Apicurio Registry are alternatives.

The registry also enables schema ID embedding.
Instead of including the full schema in every event (which Avro's file format does, but is wasteful for streams), the producer writes a small header — in the Confluent wire format, a magic byte followed by a 4-byte schema ID — before the encoded payload.
Consumers use this ID to fetch and cache the writer schema.
This keeps event payloads compact while preserving the information needed for reader/writer resolution.

Common Pitfalls

One frequent mistake is treating schema evolution as purely a serialization concern.
In reality, schema changes propagate through the entire data pipeline.
A new field added to an event may require changes in stream processors, materialized views, sink connectors, and downstream databases.
Schema compatibility at the serialization layer prevents deserialization failures, but it does not prevent semantic misinterpretation.

Another pitfall is neglecting transitive compatibility in long-retention systems.
If your topic retains events for 90 days and you check compatibility only against the previous version, a consumer replaying from the beginning of the topic may encounter a schema from several versions ago that is incompatible with its current reader schema.

Finally, teams sometimes disable compatibility checks to unblock a deployment.
This creates a time bomb.
The incompatible schema version will persist in the registry and in the topic, potentially causing failures for any consumer that encounters it now or in the future.

Key Points

Event streams require schema evolution strategies because events are immutable, long-lived, and read by independently deployed consumers.
Backward compatibility (new schema reads old data) allows consumers to be upgraded first; forward compatibility (old schema reads new data) allows producers to be upgraded first.
Avro's reader/writer schema resolution model makes it a particularly strong serialization choice for evolving event schemas in Kafka-based systems; Protobuf is a strong alternative where its tooling is already in use.
Schema registries enforce compatibility checks at write time, preventing incompatible schemas from entering production streams.
Transitive compatibility checking is essential for streams with long retention periods, where consumers may replay events spanning many schema versions.
Safe evolution rules (additive changes, defaults on new fields, no field identifier reuse) apply across all serialization formats.
Schema evolution is not just a serialization problem; it has downstream effects on processors, databases, and materialized views.

References

Martin Kleppmann. Designing Data-Intensive Applications. O'Reilly Media, 2017.

Apache Avro Specification, Version 1.11.1. Apache Software Foundation, 2023.

Confluent Schema Registry Documentation: Schema Evolution and Compatibility. Confluent, Inc., 2023.

Jay Kreps. "The Log: What every software engineer should know about real-time data's unifying abstraction." LinkedIn Engineering Blog, 2013.

Martin Kleppmann, Alastair R. Beresford, and Boerge Svingen. "Online Event Processing: Achieving consistency where distributed transactions have failed." Communications of the ACM, Vol. 62, No. 5, 2019.