You’ve got logging. Great. Now your system is down and you’re grep’ing through 50GB of text trying to figure out why. Sound familiar?
Observability isn’t about collecting more data. It’s about collecting the right data and making it queryable. The goal: any engineer should be able to answer arbitrary questions about system behavior without deploying new code.
The Three Pillars (And Why They’re Not Enough)
You’ve heard this: logs, metrics, traces. The “three pillars of observability.” It’s a useful framework, but it misses something crucial: correlation.
Individual pillars are noise. Correlated pillars are signal.
| |
When your trace links to the logs that happened during that request, which link to the metrics that spiked at that moment — that’s observability.
Structured Logging: Your First Real Win
If you do nothing else, do this: stop logging strings. Log structured events.
Bad:
| |
Good:
| |
The second version is queryable. You can ask: “Show me all purchases over $100 that took longer than 2 seconds” without regex gymnastics.
The Metrics That Actually Matter
Stop measuring everything. Start measuring what tells a story.
The RED Method (for services):
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Distribution of request latency
The USE Method (for resources):
- Utilization: Average time resource is busy
- Saturation: Queue depth / backlog
- Errors: Error events
| |
Four recording rules give you a complete picture of any service’s health. That’s it.
Distributed Tracing Without the Pain
Tracing gets complicated fast. Here’s how to make it work:
1. Instrument at the edges
Start with HTTP handlers and database calls. Don’t trace every function.
| |
2. Sample intelligently
100% sampling is expensive and rarely necessary. Sample:
- All errors (always)
- All slow requests (>p99)
- 1-10% of everything else
| |
3. Add context, not noise
Every span should answer: what happened, to what, and why should I care?
Alerting That Doesn’t Suck
Your alerts should be symptoms, not causes. Alert on user impact, investigate causes.
Bad alert: “CPU usage above 80%” Good alert: “Error rate for checkout flow exceeds 1%”
The first tells you a metric moved. The second tells you users are hurting.
| |
Notice: for: 2m. Don’t alert on transient blips. Notice: runbook_url. Every alert should link to what to do about it.
The Stack That Works
After years of iteration, here’s what actually works for most teams:
- Metrics: Prometheus + Grafana (or VictoriaMetrics for scale)
- Logs: Loki or Elasticsearch (Loki if you’re already in Grafana-land)
- Traces: Tempo, Jaeger, or Honeycomb
- Correlation: Grafana with linked data sources
The magic is in the linking. Same dashboard shows metrics, lets you jump to traces for outliers, which link to relevant logs.
Start Here
- Today: Add trace IDs to all your logs
- This week: Set up the RED metrics for one critical service
- This month: Get one dashboard that shows metrics + traces + logs together
- This quarter: Roll it out to everything
Observability isn’t a destination. It’s a practice. Start small, but start now.
The difference between “we have monitoring” and “we have observability” is the difference between having data and being able to answer questions. Build for the questions you don’t know you’ll ask yet.