A request hits your API gateway, bounces through five microservices, touches two databases, and returns an error. Logs say “something failed.” Which service? Which call? What was the payload?

Distributed tracing answers these questions by connecting the dots across service boundaries.

The Core Concepts

Traces and Spans

A trace represents a complete request journey. A span represents a single operation within that journey.

TraceS:paanb:SSScppp1Aaaa2Pnnn3I::S:SSpppGAUaOaaausnrnntte:d::ehrewDrIPaSSanayeetSvyrraeem(vvbrnepiiavtnaccsiotreeecreeyPnQrtuCo)ehcreeycsksing

Each span has:

  • Trace ID: Links all spans in a request
  • Span ID: Unique identifier for this span
  • Parent Span ID: Who called me
  • Start/End time: Duration
  • Attributes: Key-value metadata
  • Events: Timestamped logs within the span
  • Status: OK, Error, or Unset

Context Propagation

The trace ID must travel with the request. Standard headers:

1
2
3
4
5
6
7
8
9
# W3C Trace Context (recommended)
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: vendor=value

# B3 (Zipkin style)
X-B3-TraceId: 80f198ee56343ba864fe8b2a57d3eff7
X-B3-SpanId: e457b5a2e4d86bd1
X-B3-ParentSpanId: 05e3ac9a4f6e3b90
X-B3-Sampled: 1

OpenTelemetry Implementation

OpenTelemetry is the standard. Use it.

Python Setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Configure provider
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Auto-instrument libraries
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

# Get tracer for manual instrumentation
tracer = trace.get_tracer(__name__)

Manual Span Creation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer(__name__)

def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        # Add attributes
        span.set_attribute("order.id", order_id)
        span.set_attribute("order.source", "web")
        
        try:
            # Child span for validation
            with tracer.start_as_current_span("validate_order"):
                validate(order_id)
            
            # Child span for payment
            with tracer.start_as_current_span("charge_payment") as payment_span:
                payment_span.set_attribute("payment.method", "credit_card")
                result = charge_customer(order_id)
                payment_span.set_attribute("payment.amount", result.amount)
            
            # Add event (like a log within the span)
            span.add_event("Order processed successfully", {
                "order.total": result.total
            })
            
        except PaymentError as e:
            span.set_status(Status(StatusCode.ERROR, str(e)))
            span.record_exception(e)
            raise

Go Setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
    "go.opentelemetry.io/otel/attribute"
)

func initTracer() func() {
    exporter, _ := otlptracegrpc.New(context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.ServiceNameKey.String("order-service"),
        )),
    )
    
    otel.SetTracerProvider(tp)
    
    return func() { tp.Shutdown(context.Background()) }
}

func processOrder(ctx context.Context, orderID string) error {
    tracer := otel.Tracer("order-service")
    
    ctx, span := tracer.Start(ctx, "process_order")
    defer span.End()
    
    span.SetAttributes(
        attribute.String("order.id", orderID),
    )
    
    // Child span
    _, childSpan := tracer.Start(ctx, "validate_order")
    err := validate(orderID)
    childSpan.End()
    
    if err != nil {
        span.RecordError(err)
        return err
    }
    
    return nil
}

Node.js Setup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

const provider = new NodeTracerProvider();

provider.addSpanProcessor(
  new BatchSpanProcessor(
    new OTLPTraceExporter({ url: 'http://otel-collector:4317' })
  )
);

provider.register();

registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

// Manual tracing
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('order-service');

async function processOrder(orderId) {
  return tracer.startActiveSpan('process_order', async (span) => {
    span.setAttribute('order.id', orderId);
    
    try {
      await validate(orderId);
      const result = await chargePayment(orderId);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Collector Configuration

The OpenTelemetry Collector receives, processes, and exports traces.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  
  # Add service name if missing
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
  
  # Sample to reduce volume
  probabilistic_sampler:
    sampling_percentage: 10

exporters:
  # Jaeger
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  
  # Or Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  
  # Debug output
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger, logging]

Deploy with Docker Compose:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"  # OTLP gRPC
      - "4318:4318"  # OTLP HTTP

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # UI
      - "14250:14250"  # gRPC

Sampling Strategies

Tracing everything is expensive. Sample intelligently.

Head-Based Sampling

Decide at trace start whether to sample:

1
2
3
4
5
6
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased, ParentBased

# Sample 10% of traces
sampler = ParentBased(root=TraceIdRatioBased(0.1))

provider = TracerProvider(sampler=sampler)

Tail-Based Sampling

Decide after seeing the full trace (requires collector):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Collector config
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      # Always keep errors
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      
      # Always keep slow requests
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 5000
      
      # Sample 5% of everything else
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

Priority Sampling

Flag important requests to always trace:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from opentelemetry import trace
from opentelemetry.trace import SpanKind

def handle_request(request):
    # Always trace payments
    if request.path.startswith('/api/payments'):
        context = trace.set_span_in_context(
            trace.get_current_span(),
            trace.Context(trace_flags=trace.TraceFlags.SAMPLED)
        )
    
    with tracer.start_as_current_span("handle_request", context=context):
        # ...

Querying Traces

Jaeger Query Examples

Find slow requests:

service=order-serviceoperation=process_orderminDuration=2s

Find errors:

service=payment-serviceerror=true

Find by attribute:

service=api-gatewayorder.id=abc123

Grafana Tempo with TraceQL

{{{rdseupsraoanut.riocorend.es>re.r2ivsdic&=e&."nsaapbmacen1.2=d3b"".os}ryds|etrec-mosue=nrtv"(ip)coes>"tg3&r&essqpla"n.}http.status_code>=500}

Common Patterns

Async Job Tracing

Preserve trace context across queues:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import json
from opentelemetry import trace
from opentelemetry.propagate import inject, extract

def enqueue_job(job_data):
    """Producer: inject trace context into message"""
    carrier = {}
    inject(carrier)
    
    message = {
        "data": job_data,
        "trace_context": carrier
    }
    queue.publish(json.dumps(message))

def process_job(message):
    """Consumer: extract and continue trace"""
    data = json.loads(message)
    context = extract(data.get("trace_context", {}))
    
    with tracer.start_as_current_span(
        "process_job",
        context=context,
        kind=SpanKind.CONSUMER
    ):
        do_work(data["data"])

Database Query Tracing

1
2
3
4
5
6
7
8
9
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

SQLAlchemyInstrumentor().instrument(
    engine=engine,
    enable_commenter=True,  # Adds trace ID to SQL comments
)

# Queries now include:
# SELECT * FROM users /* traceparent='00-abc...' */

External Service Calls

1
2
3
4
5
6
7
8
import requests
from opentelemetry.instrumentation.requests import RequestsInstrumentor

RequestsInstrumentor().instrument()

# Trace context automatically propagated in headers
response = requests.get("https://api.external.com/data")
# Request includes: traceparent: 00-...

Debugging with Traces

The Investigation Flow

  1. Start with the error: Find traces with error=true
  2. Identify the failing span: Which service/operation failed?
  3. Check span attributes: What was the input?
  4. Look at parent spans: What called this?
  5. Check timing: Was there unusual latency before failure?
  6. Compare to working traces: What’s different?

Adding Debug Context

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
with tracer.start_as_current_span("risky_operation") as span:
    # Add context that helps debugging
    span.set_attribute("user.id", user_id)
    span.set_attribute("feature.flag", "new_algorithm")
    span.set_attribute("request.size_bytes", len(payload))
    
    # Log decision points
    span.add_event("cache_miss", {"key": cache_key})
    span.add_event("fallback_activated", {"reason": "timeout"})
    
    result = do_risky_thing()
    
    span.set_attribute("result.count", len(result))

The Checklist

  1. Instrument all services — One uninstrumented service breaks the chain
  2. Propagate context — Use standard headers (W3C Trace Context)
  3. Add meaningful attributes — IDs, sizes, flags, decisions
  4. Sample appropriately — 100% in dev, tail-sample in prod
  5. Connect logs to traces — Include trace ID in log messages
  6. Set up alerting — Error rate, latency percentiles by trace
  7. Document trace queries — Share common debugging patterns

Distributed tracing turns “something’s broken” into “line 47 of payment-service timed out waiting for inventory-service, which was blocked on a database lock.” That’s the difference between debugging for hours and debugging for minutes.

Instrument everything. Sample wisely. Follow the breadcrumbs.