When your service goes down at 3 AM, you need answers fast. Observability—the ability to understand what’s happening inside your systems from their external outputs—is what separates a 5-minute fix from a 3-hour nightmare.
The three pillars of observability are logs, metrics, and traces. Each tells a different part of the story.
Logs: The Narrative
Logs are discrete events. They tell you what happened in human-readable terms.
| |
Best Practices for Logging
Structure your logs. JSON is your friend. Unstructured logs like Payment failed for user 12345 are hard to search and aggregate.
Include context. Every log should answer: who, what, when, and ideally why. Always include request IDs for correlation.
Use appropriate levels:
DEBUG: Development detailsINFO: Normal operations worth notingWARN: Something unexpected, but handledERROR: Something failedFATAL: Service can’t continue
Sample high-volume logs. If you’re logging every request, you’ll drown in data. Sample at ingestion time, but keep 100% of errors.
| |
Metrics: The Numbers
Metrics are numeric measurements over time. They tell you how your system is performing.
The Four Golden Signals
Google’s SRE book identified four critical metrics:
- Latency: How long requests take
- Traffic: How many requests you’re handling
- Errors: How many requests fail
- Saturation: How “full” your system is
| |
Metrics vs Logs
A common question: if I have logs, why do I need metrics?
Metrics are aggregatable. “What’s the 99th percentile latency over the last hour?” is easy with metrics, painful with logs.
Metrics are cheap. Storing a counter increment costs almost nothing. Storing a full log line costs orders of magnitude more.
Metrics enable alerting. “Alert if error rate exceeds 1%” requires metrics. You can’t efficiently scan logs in real-time for this.
Use logs for debugging individual requests. Use metrics for understanding system behavior.
Traces: The Journey
Traces follow a request as it travels through your distributed system. They tell you where time is spent.
Implementing Tracing
Modern tracing follows the OpenTelemetry standard:
| |
Context Propagation
The magic of distributed tracing is context propagation. When Service A calls Service B, it passes trace context in headers:
This links all spans from all services into a single trace.
Connecting the Pillars
The real power comes from connecting all three:
| |
Now when you see an error spike in your metrics dashboard, you can:
- Click through to see related logs
- Find the trace_id in those logs
- Open the trace to see exactly where things went wrong
Tool Recommendations
Logs:
- Self-hosted: Loki + Grafana
- Managed: Datadog, Splunk, AWS CloudWatch
Metrics:
- Self-hosted: Prometheus + Grafana
- Managed: Datadog, New Relic, AWS CloudWatch
Traces:
- Self-hosted: Jaeger, Tempo
- Managed: Datadog APM, Honeycomb, AWS X-Ray
All-in-one: Grafana Cloud, Datadog, and Elastic all offer unified observability platforms.
Start Small, Grow Incrementally
You don’t need all three pillars on day one:
- Week 1: Add structured logging with request IDs
- Week 2: Add the four golden signals as metrics
- Week 3: Add tracing to your most critical path
- Week 4: Connect them with shared identifiers
The goal isn’t perfect observability—it’s enough observability to debug problems quickly. Start with what hurts most, and expand from there.
Good observability is an investment that pays dividends at 3 AM. Your on-call self will thank your past self.