Monitoring tells you when something is wrong. Observability helps you understand why. In distributed systems, you can’t predict every failure mode—you need systems that let you ask arbitrary questions about their behavior.

The Three Pillars

Metrics: What’s Happening Now

Numeric time-series data. Fast to query, cheap to store.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Counter - only goes up
requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Histogram - distribution of values
request_duration = Histogram(
    'http_request_duration_seconds',
    'Request duration in seconds',
    ['method', 'endpoint'],
    buckets=[.01, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

# Gauge - can go up or down
active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

# Usage
@app.route("/api/<endpoint>")
def handle_request(endpoint):
    active_connections.inc()
    
    with request_duration.labels(
        method=request.method,
        endpoint=endpoint
    ).time():
        result = process_request()
    
    requests_total.labels(
        method=request.method,
        endpoint=endpoint,
        status=200
    ).inc()
    
    active_connections.dec()
    return result

# Expose metrics endpoint
start_http_server(9090)

Logs: What Happened

Discrete events with context. Rich detail, expensive at scale.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import structlog
import logging

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

logger = structlog.get_logger()

# Bind context that persists across log calls
def handle_request(request_id: str, user_id: str):
    log = logger.bind(
        request_id=request_id,
        user_id=user_id
    )
    
    log.info("request_started", path=request.path)
    
    try:
        result = process_order(order_id)
        log.info("request_completed", 
                 order_id=order_id,
                 items=len(result.items))
        return result
    except PaymentError as e:
        log.error("payment_failed",
                  error=str(e),
                  error_code=e.code)
        raise

Output:

1
2
{"request_id": "req-123", "user_id": "user-456", "event": "request_started", "path": "/api/orders", "timestamp": "2026-02-11T19:00:00Z", "level": "info"}
{"request_id": "req-123", "user_id": "user-456", "event": "payment_failed", "error": "Card declined", "error_code": "CARD_DECLINED", "timestamp": "2026-02-11T19:00:01Z", "level": "error"}

Traces: How It Flowed

Request path through distributed services. Shows causality.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Setup tracing
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://jaeger:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Auto-instrument frameworks
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

# Manual spans for business logic
@app.route("/api/orders", methods=["POST"])
def create_order():
    with tracer.start_as_current_span("create_order") as span:
        span.set_attribute("user.id", current_user.id)
        
        # Child span for validation
        with tracer.start_as_current_span("validate_order"):
            validate(request.json)
        
        # Child span for payment
        with tracer.start_as_current_span("process_payment") as payment_span:
            payment_span.set_attribute("payment.amount", order.total)
            result = payment_service.charge(order)
            payment_span.set_attribute("payment.status", result.status)
        
        # Child span for fulfillment
        with tracer.start_as_current_span("create_fulfillment"):
            fulfillment_service.create(order)
        
        return jsonify(order.to_dict())

Correlating the Three Pillars

The power comes from correlation. Same request ID across all three:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import uuid
from flask import g, request

@app.before_request
def set_correlation_id():
    # Get from header or generate
    g.request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))
    g.trace_id = trace.get_current_span().get_span_context().trace_id
    
    # Bind to logger
    g.log = logger.bind(
        request_id=g.request_id,
        trace_id=format(g.trace_id, '032x')
    )

@app.after_request
def add_correlation_headers(response):
    response.headers['X-Request-ID'] = g.request_id
    return response

Now you can:

  1. See a spike in error metrics
  2. Filter logs by that time window
  3. Find specific request IDs with errors
  4. Pull the full trace to see which service failed

Service Level Objectives (SLOs)

Define what “good” means:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# SLO definition
slos:
  - name: api-availability
    description: "API returns successful responses"
    indicator:
      events:
        good: http_requests_total{status=~"2.."}
        total: http_requests_total
    target: 99.9
    window: 30d

  - name: api-latency
    description: "API responds within 200ms"
    indicator:
      events:
        good: http_request_duration_seconds_bucket{le="0.2"}
        total: http_request_duration_seconds_count
    target: 95.0
    window: 30d
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Calculate error budget
def calculate_error_budget(slo_target: float, window_requests: int, 
                          failed_requests: int) -> dict:
    allowed_failures = window_requests * (1 - slo_target / 100)
    remaining = allowed_failures - failed_requests
    
    return {
        "budget_total": allowed_failures,
        "budget_used": failed_requests,
        "budget_remaining": remaining,
        "budget_remaining_percent": (remaining / allowed_failures) * 100
    }

# Example: 99.9% SLO over 1M requests
# Allowed failures: 1000
# If we've had 300 failures, 700 remaining (70% budget left)

Alerting on Symptoms, Not Causes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Prometheus alerting rules
groups:
- name: slo-alerts
  rules:
  # Alert on SLO breach, not individual errors
  - alert: HighErrorRate
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[5m]))
        /
        sum(rate(http_requests_total[5m]))
      ) > 0.01
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Error rate above 1% for 5 minutes"
      
  # Alert on error budget burn rate
  - alert: ErrorBudgetBurn
    expr: |
      (
        1 - (
          sum(rate(http_requests_total{status=~"2.."}[1h]))
          /
          sum(rate(http_requests_total[1h]))
        )
      ) > (1 - 0.999) * 14.4
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Burning error budget 14x faster than sustainable"

Dashboard Design

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Grafana dashboard as code
dashboard = {
    "title": "Service Overview",
    "rows": [
        {
            "title": "Golden Signals",
            "panels": [
                {
                    "title": "Request Rate",
                    "type": "graph",
                    "targets": [{
                        "expr": "sum(rate(http_requests_total[5m]))"
                    }]
                },
                {
                    "title": "Error Rate",
                    "type": "graph",
                    "targets": [{
                        "expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))"
                    }]
                },
                {
                    "title": "Latency (p50, p95, p99)",
                    "type": "graph",
                    "targets": [
                        {"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))"},
                        {"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"},
                        {"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"}
                    ]
                },
                {
                    "title": "Saturation (CPU/Memory)",
                    "type": "graph",
                    "targets": [{
                        "expr": "container_memory_usage_bytes / container_spec_memory_limit_bytes"
                    }]
                }
            ]
        }
    ]
}

Debugging with Observability

When something breaks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 1. Check metrics - what's the symptom?
# High error rate starting at 14:32

# 2. Query logs around that time
{
  "query": "level:error AND timestamp:[2026-02-11T14:30:00 TO 2026-02-11T14:35:00]",
  "sort": "timestamp:asc"
}

# 3. Find a specific failing request
# request_id: req-abc123, error: "connection timeout to payment-service"

# 4. Pull the trace
curl "http://jaeger:16686/api/traces/abc123def456"

# 5. See the trace shows:
# api-gateway (50ms) 
#   → order-service (30ms)
#     → payment-service (TIMEOUT after 30s)

# 6. Check payment-service metrics
# CPU at 100%, queue depth spiking

# Root cause: Payment service overloaded, need to scale

OpenTelemetry Collector

Unified pipeline for all signals:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  jaeger:
    endpoint: jaeger:14250
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Key Takeaways

  1. Metrics for alerting — cheap, fast, good for dashboards
  2. Logs for context — rich detail when you need to investigate
  3. Traces for causality — understand request flow across services
  4. Correlate everything — same request ID across all three
  5. Alert on symptoms — error rates and latency, not CPU or disk
  6. Define SLOs — know what “good enough” means before incidents

Observability isn’t a tool you install—it’s a property of your system. Build it in from the start, and debugging becomes archaeology instead of archaeology in the dark.