You’ve got logging. Great. Now your system is down and you’re grep’ing through 50GB of text trying to figure out why. Sound familiar?

Observability isn’t about collecting more data. It’s about collecting the right data and making it queryable. The goal: any engineer should be able to answer arbitrary questions about system behavior without deploying new code.

The Three Pillars (And Why They’re Not Enough)

You’ve heard this: logs, metrics, traces. The “three pillars of observability.” It’s a useful framework, but it misses something crucial: correlation.

Individual pillars are noise. Correlated pillars are signal.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Grafana Tempo trace example showing correlation
traceparent: 00-abc123-def456-01
span_name: api_request
duration_ms: 2340
http_status: 500
service: payment-api
linked_logs:
  - timestamp: "2026-03-12T03:45:23Z"
    message: "Payment gateway timeout after 2000ms"
linked_metrics:
  - payment_gateway_latency_p99: 1850ms
  - payment_gateway_errors_total: 47

When your trace links to the logs that happened during that request, which link to the metrics that spiked at that moment — that’s observability.

Structured Logging: Your First Real Win

If you do nothing else, do this: stop logging strings. Log structured events.

Bad:

1
logger.info(f"User {user_id} purchased {item} for ${price}")

Good:

1
2
3
4
5
6
7
8
logger.info("purchase_completed", extra={
    "user_id": user_id,
    "item_id": item,
    "price_cents": price * 100,
    "currency": "USD",
    "payment_method": method,
    "trace_id": get_current_trace_id()
})

The second version is queryable. You can ask: “Show me all purchases over $100 that took longer than 2 seconds” without regex gymnastics.

The Metrics That Actually Matter

Stop measuring everything. Start measuring what tells a story.

The RED Method (for services):

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Distribution of request latency

The USE Method (for resources):

  • Utilization: Average time resource is busy
  • Saturation: Queue depth / backlog
  • Errors: Error events
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Prometheus rules for the RED method
groups:
  - name: service_red_metrics
    rules:
      - record: service:request_rate:5m
        expr: sum(rate(http_requests_total[5m])) by (service)
      
      - record: service:error_rate:5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
      
      - record: service:latency_p99:5m
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))

Four recording rules give you a complete picture of any service’s health. That’s it.

Distributed Tracing Without the Pain

Tracing gets complicated fast. Here’s how to make it work:

1. Instrument at the edges

Start with HTTP handlers and database calls. Don’t trace every function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func handleRequest(w http.ResponseWriter, r *http.Request) {
    ctx, span := tracer.Start(r.Context(), "handle_request",
        trace.WithAttributes(
            attribute.String("http.method", r.Method),
            attribute.String("http.route", r.URL.Path),
        ),
    )
    defer span.End()
    
    // Your handler logic, passing ctx to downstream calls
}

2. Sample intelligently

100% sampling is expensive and rarely necessary. Sample:

  • All errors (always)
  • All slow requests (>p99)
  • 1-10% of everything else
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# OpenTelemetry Collector sampling config
processors:
  probabilistic_sampler:
    sampling_percentage: 10
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow_requests
        type: latency
        latency: {threshold_ms: 2000}

3. Add context, not noise

Every span should answer: what happened, to what, and why should I care?

Alerting That Doesn’t Suck

Your alerts should be symptoms, not causes. Alert on user impact, investigate causes.

Bad alert: “CPU usage above 80%” Good alert: “Error rate for checkout flow exceeds 1%”

The first tells you a metric moved. The second tells you users are hurting.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Alertmanager rule example
groups:
  - name: user_impact
    rules:
      - alert: CheckoutErrorRateHigh
        expr: |
          sum(rate(http_requests_total{handler="checkout", status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{handler="checkout"}[5m]))
          > 0.01
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Checkout errors affecting users"
          runbook_url: "https://wiki.example.com/runbooks/checkout-errors"

Notice: for: 2m. Don’t alert on transient blips. Notice: runbook_url. Every alert should link to what to do about it.

The Stack That Works

After years of iteration, here’s what actually works for most teams:

  • Metrics: Prometheus + Grafana (or VictoriaMetrics for scale)
  • Logs: Loki or Elasticsearch (Loki if you’re already in Grafana-land)
  • Traces: Tempo, Jaeger, or Honeycomb
  • Correlation: Grafana with linked data sources

The magic is in the linking. Same dashboard shows metrics, lets you jump to traces for outliers, which link to relevant logs.

Start Here

  1. Today: Add trace IDs to all your logs
  2. This week: Set up the RED metrics for one critical service
  3. This month: Get one dashboard that shows metrics + traces + logs together
  4. This quarter: Roll it out to everything

Observability isn’t a destination. It’s a practice. Start small, but start now.


The difference between “we have monitoring” and “we have observability” is the difference between having data and being able to answer questions. Build for the questions you don’t know you’ll ask yet.