Distributed Tracing (OpenTelemetry, Jaeger)

Introduction

In a monolithic application, a stack trace tells you everything you need to know about a request's execution path.
In a distributed system composed of dozens or hundreds of services, a single user request fans out across process boundaries, networks, and serialization formats.
The stack trace dies at the process boundary.
Distributed tracing solves this by reconstructing the causal chain of operations across services, giving engineers a unified view of latency, errors, and dependencies.

This article covers the foundational concepts of distributed tracing, the data model that underpins it, and how OpenTelemetry and Jaeger implement these ideas in practice.

The Problem

Consider a request that enters an API gateway, calls an authentication service, queries a product catalog, checks inventory, and writes to an order database.
If that request takes 1200ms instead of the expected 200ms, where did the time go?
Without tracing, you are left correlating timestamps across separate log streams from separate services; a brittle and error-prone process.

Distributed tracing provides three capabilities that logs and metrics alone cannot:

Causal ordering across service boundaries, not just temporal ordering.
Per-request latency decomposition, showing exactly which service or RPC contributed how much wall-clock time.
Dependency topology discovery, revealing the actual runtime call graph (which may differ from what architecture diagrams claim).

Core Data Model

The tracing data model, formalized by the OpenTracing specification and now carried forward by OpenTelemetry, is built on two primitives: traces and spans.

Traces

A trace represents the entire lifecycle of a single request through the system.
It is identified by a globally unique trace_id (typically 128 bits).
A trace is not a single data structure that lives in one place; it is the logical collection of all spans sharing the same trace_id.

Spans

A span represents a single unit of work: an RPC call, a database query, a function invocation.
Each span contains:

trace_id: links the span to its parent trace.
span_id: a unique identifier for this span (typically 64 bits).
parent_span_id: the span that caused this one. The root span has no parent.
operation_name: a human-readable label (e.g., GET /api/orders).
start_time and duration: wall-clock timing.
tags / attributes: key-value metadata (e.g., http.status_code=200).
events / logs: timestamped annotations within the span's lifetime.
status: success, error, or unset.

The parent-child relationships between spans most commonly form a tree: each span has at most one parent, and the root span has none.
The OpenTelemetry specification also allows a span to reference other spans via Link objects (for example, to connect a consumer span back to a producer span across a message queue), which can produce a more general directed acyclic graph (DAG).
In practice, the tree structure covers the vast majority of use cases.

Context Propagation

The mechanism that ties spans across process boundaries is context propagation.
When service A calls service B, the trace_id and span_id must be transmitted alongside the request.
This is typically done via HTTP headers (e.g., W3C traceparent), gRPC metadata, or message queue headers.

The W3C Trace Context standard defines the traceparent header format:

traceparent: 00-<trace_id_hex_32chars>-<parent_span_id_hex_16chars>-<trace_flags_hex_2chars>

For example:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

This single header is sufficient for a downstream service to create a child span that links back to the correct trace and parent.

Walkthrough

The following walkthrough shows how a trace is constructed across three services during a single request.
Identifiers are shown symbolically (T1, S1, etc.) for readability; in practice they are hex-encoded byte arrays conforming to the W3C format shown above.

Step-by-step: Trace Construction

diagram-1 — Span creation and context propagation across services for one request

1. Request arrives at API Gateway.
   - No incoming traceparent header.
   - Generate trace_id = T1, span_id = S1.
   - Create root span: {trace_id: T1, span_id: S1, parent: null, op: "GET /order"}.
   - Start timer.

2. API Gateway calls Auth Service via HTTP.
   - Inject header: traceparent: 00-<T1>-<S1>-01
   - Auth Service extracts context from header.
   - Generate span_id = S2.
   - Create span: {trace_id: T1, span_id: S2, parent: S1, op: "validate_token"}.
   - Auth Service processes request, finishes span S2 (duration: 15ms).
   - Auth Service reports span S2 to collector.

3. API Gateway calls Order Service via HTTP.
   - Inject header: traceparent: 00-<T1>-<S1>-01
   - Order Service extracts context.
   - Generate span_id = S3.
   - Create span: {trace_id: T1, span_id: S3, parent: S1, op: "create_order"}.

4. Order Service queries PostgreSQL.
   - Generate span_id = S4.
   - Create span: {trace_id: T1, span_id: S4, parent: S3, op: "INSERT orders"}.
   - Query completes (duration: 45ms). Finish span S4.

5. Order Service finishes (duration: 60ms). Finish span S3.
6. API Gateway finishes (duration: 95ms). Finish span S1.

7. All spans (S1-S4) are reported to the collector asynchronously.
   The collector assembles them into trace T1 using the shared trace_id.

The resulting trace tree:

S1: GET /order (95ms)
├── S2: validate_token (15ms)
└── S3: create_order (60ms)
    └── S4: INSERT orders (45ms)

OpenTelemetry Architecture

OpenTelemetry (OTel) is the CNCF project that provides a vendor-neutral standard for telemetry data collection.
It was announced in 2019 as a merger of the earlier OpenTracing and OpenCensus projects, consolidating their communities and specifications into a single standard.
Its architecture has three layers.

Instrumentation Layer

This is the API and SDKs that application code interacts with.
OTel provides both manual instrumentation (explicit span creation) and automatic instrumentation (agent-based or library hooks that instrument common frameworks like gRPC, HTTP clients, and database drivers, without code changes).

from opentelemetry import trace

tracer = trace.get_tracer("order-service")

with tracer.start_as_current_span("create_order") as span:
    span.set_attribute("order.id", order_id)
    result = db.execute("INSERT INTO orders ...")
    # Note: recording raw SQL in db.statement attributes can expose sensitive
    # data and create high cardinality. Use parameterized query templates or
    # the semantic conventions db.operation + db.name instead in production.
    span.set_attribute("db.operation", "INSERT")
    span.set_attribute("db.name", "orders")

Collector Layer

diagram-2 — OpenTelemetry Collector pipeline: receivers, processors, exporters

The OpenTelemetry Collector is a standalone process that receives, processes, and exports telemetry data.
It operates as a pipeline with three stages:

Receivers: accept data over OTLP (OpenTelemetry Protocol), Jaeger, Zipkin, and other formats.
Processors: batch spans, filter, sample, or enrich with attributes.
Exporters: send data to backends (Jaeger, Tempo, Datadog, etc.).

This architecture decouples instrumentation from the storage backend.
You can switch from Jaeger to Grafana Tempo without changing application code.

Sampling

In high-throughput systems, tracing every request is prohibitively expensive.
OTel supports two sampling strategies:

Head-based sampling: the decision to trace is made at the root span, before any work is done. Simple and predictable, but it cannot consider downstream information (e.g., whether the request will eventually fail).
Tail-based sampling: the decision is made after the trace is complete, at the collector level. This allows policies like "keep all traces with errors" or "keep all traces exceeding 500ms." Tail-based sampling requires the collector to buffer complete traces before deciding, which increases memory pressure.

Jaeger

Jaeger, originally developed at Uber and now a CNCF graduated project, is a distributed tracing backend.
It handles storage, querying, and visualization of trace data.

Components

jaeger-collector: receives spans from agents or directly from OTel collectors, validates them, and writes to storage.
jaeger-query: serves the UI and API for trace retrieval and search.
Storage backend: Jaeger supports Cassandra, Elasticsearch/OpenSearch, Kafka (as a buffer), Badger (for local development), and ClickHouse (via community plugins). The storage schema is optimized for lookups by trace_id and for indexed searches by service name, operation, tags, and duration.

Jaeger and OpenTelemetry Convergence

Jaeger has progressively adopted OpenTelemetry as its instrumentation and collection layer.
As of Jaeger v2, the Jaeger backend is built directly on top of the OpenTelemetry Collector, using it as the core pipeline engine.
The Jaeger-specific SDKs are deprecated in favor of OTel SDKs.
This means Jaeger is now primarily a storage and query layer, while OpenTelemetry handles collection and transport.

Practical Considerations

Clock skew. Spans from different machines use different clocks.
Without clock synchronization (NTP or better), parent spans can appear to start after their children.
Jaeger's UI includes heuristics to detect and adjust for clock skew, but the fundamental problem requires good time synchronization infrastructure.

Cardinality. Trace storage scales with request volume multiplied by the average number of spans per trace.
A system processing 100K requests/second with an average fan-out of 10 spans per trace generates 1M spans/second.
Sampling is not optional at this scale.

Instrumentation overhead. Span creation, context propagation, and asynchronous reporting all consume CPU and memory.
In practice, the overhead is small (typically under 1% with proper batching and sampling), but it is nonzero and should be measured.

Trace completeness. If any service in the call chain fails to propagate context, the trace is broken.
This makes instrumentation coverage a team-wide discipline, not an individual choice.
Missing instrumentation in one service creates a gap that affects every trace passing through it.

Key Points

A trace is a collection of causally related spans sharing a trace_id, forming a tree (or occasionally a DAG via span links) that represents a single request's journey through a distributed system.
Context propagation (via headers like W3C traceparent) is the mechanism that links spans across process boundaries and is the single most critical piece to get right.
OpenTelemetry provides a vendor-neutral instrumentation and collection standard, decoupling application code from the choice of tracing backend.
Jaeger serves as a storage, query, and visualization layer for trace data, and has converged with OpenTelemetry for its collection pipeline.
Tail-based sampling enables intelligent trace retention (e.g., keeping error traces) but requires buffering complete traces at the collector, increasing operational complexity.
Instrumentation coverage must be treated as a system-wide concern; a single uninstrumented service breaks trace continuity for all requests traversing it.
Clock synchronization across hosts is a prerequisite for meaningful span timing, not an afterthought.

References

Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., and Shanbhag, C. "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure." Google Technical Report, 2010.

W3C Trace Context Specification. "Trace Context - W3C Recommendation." W3C, 2021. https://www.w3.org/TR/trace-context/

OpenTelemetry Specification. "OpenTelemetry Specification." Cloud Native Computing Foundation. https://opentelemetry.io/docs/specs/otel/

Shkuro, Y. "Mastering Distributed Tracing." Packt Publishing, 2019.

Kaldor, J., Mace, J., Bejda, M., Gao, E., Kuropatwa, W., O'Neill, J., Ong, K. W., Schaller, B., Shan, P., Viscomi, B., Venkataraman, V., Veeraraghavan, K., and Song, Y. J. "Canopy: An End-to-End Performance Tracing And Analysis System." Proceedings of the 26th Symposium on Operating Systems Principles (SOSP), 2017.