When a request fails in a distributed system, the question isn’t if something went wrong—it’s where. Logs tell you what happened. Metrics tell you how often. But tracing tells you the story.
The Problem with Logs and Metrics Alone
You’ve got 15 microservices. A user reports slow checkout. You check the logs—thousands of entries. You check the metrics—latency is up, but which service? You’re playing detective without a map.
This is where distributed tracing shines. It connects the dots across service boundaries, showing you the exact path a request takes and where time is spent.
What Distributed Tracing Actually Does
A trace is a tree of spans. Each span represents a unit of work: an HTTP request, a database query, a queue operation. Spans carry context that propagates across services, linking parent and child operations.
One glance tells you the database call inside the product service is eating 67% of your request time.
OpenTelemetry: The Modern Standard
OpenTelemetry (OTel) has become the industry standard for instrumentation. It provides vendor-neutral APIs and SDKs for traces, metrics, and logs.
Basic Setup in Node.js
| |
Auto-instrumentation handles HTTP, database clients, and popular frameworks. You get tracing without touching your application code.
Manual Spans for Business Logic
Auto-instrumentation captures infrastructure, but business-critical operations need explicit spans:
| |
Context Propagation: The Magic Ingredient
The real power comes from propagating trace context across service boundaries. When Service A calls Service B, it passes a trace ID in headers:
Service B continues the trace, adding its spans as children. This creates the unified view across your entire stack.
Propagation Across Message Queues
HTTP headers work for synchronous calls. For async messaging, inject context into message attributes:
| |
Sampling Strategies
Tracing everything at scale is expensive. Smart sampling keeps costs manageable while preserving visibility:
Head-based sampling: Decide at trace start whether to sample.
| |
Tail-based sampling: Collect all spans, decide later. Keep errors and slow traces, sample normal ones. More powerful but requires a collector like Jaeger or the OTel Collector.
Adaptive sampling: Increase sampling during incidents, reduce during normal operation.
Backend Options
Where do traces go?
- Jaeger: Open source, battle-tested, great for self-hosted
- Zipkin: Simpler, also open source
- Grafana Tempo: Cost-effective, integrates with Grafana stack
- Datadog/New Relic/Honeycomb: SaaS with powerful UIs
OTel’s vendor-neutral approach means you can switch backends without changing instrumentation.
Practical Tips
Start with auto-instrumentation. Get value immediately, add manual spans for critical paths later.
Add business context. Span attributes like
user.id,order.value, andfeature.flagmake traces actionable.Set up error alerting. Trigger alerts when error spans spike, with direct links to traces.
Use trace IDs in logs. Correlate log entries with traces:
1 2 3 4logger.info('Processing order', { traceId: trace.getSpan(context.active())?.spanContext().traceId, orderId: order.id });Sample intelligently. 100% sampling in dev, adaptive in production.
The Payoff
With tracing in place, debugging goes from “which of these 15 services is slow?” to “this database query in the product service takes 165ms because it’s missing an index.”
That’s not just faster incident response—it’s the difference between guessing and knowing.
Distributed tracing isn’t optional for microservices. It’s how you understand what’s actually happening in production.