Monitoring answers “is it working?” Observability answers “why isn’t it working?” The difference matters when you’re debugging a production incident at 3am.
The three pillars of observability — logs, metrics, and traces — each provide different perspectives. Together, they create a complete picture of system behavior.
Logs: The Narrative
Logs tell you what happened, in order:
| |
Good for:
- Debugging specific requests
- Understanding error context
- Audit trails
- Ad-hoc investigation
Challenges:
- High volume, high cost
- Searching requires indexing
- Context scattered across services
Structured Logging
Always log structured data:
| |
Structured logs are queryable: “show all purchases over $100 in the last hour.”
Metrics: The Numbers
Metrics are numerical measurements over time:
Good for:
- Dashboards and visualization
- Alerting on thresholds
- Capacity planning
- Trend analysis
Challenges:
- High cardinality kills performance
- Aggregation loses detail
- Choosing what to measure
The RED Method (Request-focused)
For services:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Request latency distribution
| |
The USE Method (Resource-focused)
For infrastructure:
- Utilization: Percentage of resource busy
- Saturation: Queue length, backlog
- Errors: Error count
| |
Traces: The Journey
Traces follow a request across service boundaries:
Good for:
- Understanding distributed latency
- Finding bottlenecks
- Debugging cross-service issues
- Dependency mapping
Challenges:
- Sampling required at scale
- Instrumentation overhead
- Context propagation across boundaries
Implementing Tracing
| |
Propagate context to downstream services:
| |
Connecting the Pillars
The magic happens when logs, metrics, and traces connect:
| |
From an alert (metric), jump to the trace. From the trace, find the relevant logs. Each provides context the others lack.
Correlation IDs
Every request gets a unique ID that flows everywhere:
| |
Now every log, metric label, and trace span can include request_id.
Sampling Strategies
At scale, you can’t keep everything:
Head-based sampling: Decide at the start whether to trace this request.
| |
Tail-based sampling: Decide after the request completes.
| |
Adaptive sampling: Adjust rate based on traffic.
| |
Alerting Philosophy
Alert on symptoms, not causes:
| |
Symptoms tell you users are affected. Causes require investigation.
Alert Fatigue
Every alert should be:
- Actionable: Someone can do something about it
- Urgent: It needs attention now (or use a ticket, not an alert)
- Real: Low false positive rate
If you’re ignoring alerts, you have too many or they’re misconfigured.
Tool Choices
Metrics: Prometheus, InfluxDB, Datadog, CloudWatch Logs: Elasticsearch, Loki, Splunk, CloudWatch Logs Traces: Jaeger, Zipkin, Datadog APM, AWS X-Ray
Unified platforms: Datadog, Grafana Cloud, New Relic, Honeycomb
Starting out? Prometheus + Loki + Tempo with Grafana provides all three pillars, open source, with excellent integration.
The Observability Mindset
Observability isn’t a product you buy. It’s a property of your system:
- Instrument everything: If it’s not measured, it doesn’t exist
- Correlate across signals: Connect logs, metrics, traces
- Ask arbitrary questions: Can you debug issues you’ve never seen before?
- Reduce MTTR: The goal is faster incident resolution
The system that can explain its own behavior is the system you can actually operate.
Logs tell the story. Metrics show the trends. Traces reveal the journey. Separately, each is useful. Together, they’re essential.
Build observability in from the start. Instrument before you need to debug. Connect your signals with correlation IDs. The incident you resolve in minutes instead of hours will justify every line of instrumentation code.