Distributed Tracing: The Missing Piece of Your Observability Stack

When a request fails in a distributed system, the question isn’t if something went wrong—it’s where. Logs tell you what happened. Metrics tell you how often. But tracing tells you the story.

The Problem with Logs and Metrics Alone

You’ve got 15 microservices. A user reports slow checkout. You check the logs—thousands of entries. You check the metrics—latency is up, but which service? You’re playing detective without a map.

This is where distributed tracing shines. It connects the dots across service boundaries, showing you the exact path a request takes and where time is spent.

What Distributed Tracing Actually Does

A trace is a tree of spans. Each span represents a unit of work: an HTTP request, a database query, a queue operation. Spans carry context that propagates across services, linking parent and child operations.

One glance tells you the database call inside the product service is eating 67% of your request time.

OpenTelemetry: The Modern Standard

OpenTelemetry (OTel) has become the industry standard for instrumentation. It provides vendor-neutral APIs and SDKs for traces, metrics, and logs.

Basic Setup in Node.js

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://jaeger:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Auto-instrumentation handles HTTP, database clients, and popular frameworks. You get tracing without touching your application code.

Manual Spans for Business Logic

Auto-instrumentation captures infrastructure, but business-critical operations need explicit spans:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
const { trace } = require('@opentelemetry/api');

async function processOrder(order) {
  const tracer = trace.getTracer('order-service');
  
  return tracer.startActiveSpan('process-order', async (span) => {
    span.setAttribute('order.id', order.id);
    span.setAttribute('order.total', order.total);
    
    try {
      await validateInventory(order);
      await chargePayment(order);
      await shipOrder(order);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Context Propagation: The Magic Ingredient

The real power comes from propagating trace context across service boundaries. When Service A calls Service B, it passes a trace ID in headers:

Service B continues the trace, adding its spans as children. This creates the unified view across your entire stack.

Propagation Across Message Queues

HTTP headers work for synchronous calls. For async messaging, inject context into message attributes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
const { propagation, context, trace } = require('@opentelemetry/api');

// Producer
function sendMessage(queue, payload) {
  const carrier = {};
  propagation.inject(context.active(), carrier);
  
  queue.send({
    body: payload,
    attributes: carrier,
  });
}

// Consumer
function handleMessage(message) {
  const extractedContext = propagation.extract(context.active(), message.attributes);
  
  return context.with(extractedContext, () => {
    const tracer = trace.getTracer('consumer');
    return tracer.startActiveSpan('process-message', (span) => {
      // Your processing logic
      span.end();
    });
  });
}

Sampling Strategies

Tracing everything at scale is expensive. Smart sampling keeps costs manageable while preserving visibility:

Head-based sampling: Decide at trace start whether to sample.

1
2
3
4
const { TraceIdRatioBasedSampler } = require('@opentelemetry/sdk-trace-base');

// Sample 10% of traces
const sampler = new TraceIdRatioBasedSampler(0.1);

Tail-based sampling: Collect all spans, decide later. Keep errors and slow traces, sample normal ones. More powerful but requires a collector like Jaeger or the OTel Collector.

Adaptive sampling: Increase sampling during incidents, reduce during normal operation.

Backend Options

Where do traces go?

Jaeger: Open source, battle-tested, great for self-hosted
Zipkin: Simpler, also open source
Grafana Tempo: Cost-effective, integrates with Grafana stack
Datadog/New Relic/Honeycomb: SaaS with powerful UIs

OTel’s vendor-neutral approach means you can switch backends without changing instrumentation.

Practical Tips

Start with auto-instrumentation. Get value immediately, add manual spans for critical paths later.
Add business context. Span attributes like user.id, order.value, and feature.flag make traces actionable.
Set up error alerting. Trigger alerts when error spans spike, with direct links to traces.

Use trace IDs in logs. Correlate log entries with traces:

1
2
3
4
logger.info('Processing order', { 
  traceId: trace.getSpan(context.active())?.spanContext().traceId,
  orderId: order.id 
});

Sample intelligently. 100% sampling in dev, adaptive in production.

The Payoff

With tracing in place, debugging goes from “which of these 15 services is slow?” to “this database query in the product service takes 165ms because it’s missing an index.”

That’s not just faster incident response—it’s the difference between guessing and knowing.

Distributed tracing isn’t optional for microservices. It’s how you understand what’s actually happening in production.

The Problem with Logs and Metrics Alone#

What Distributed Tracing Actually Does#

OpenTelemetry: The Modern Standard#

Basic Setup in Node.js#

Manual Spans for Business Logic#

Context Propagation: The Magic Ingredient#

Propagation Across Message Queues#

Sampling Strategies#

Backend Options#

Practical Tips#

The Payoff#

📬 Get the Newsletter