Your dashboards are green. Your alerts are quiet. Then a user tweets that your app is broken.

This is the monitoring trap: you’re measuring what you expected to fail, not what actually failed.

Observability is the escape hatch.

Monitoring vs Observability

Monitoring is asking predefined questions:

  • Is the server up?
  • Is CPU under 80%?
  • Are requests completing in under 200ms?

Observability is being able to ask any question:

  • Why did this specific user’s request fail?
  • What changed between yesterday and today?
  • Which service in the chain is causing latency?

Monitoring is a subset of observability. You need both.

The Three Pillars

1. Metrics

Numeric measurements over time. Great for dashboards and alerts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Prometheus metrics with Python
from prometheus_client import Counter, Histogram, start_http_server

# Count requests
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Track latency
REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint']
)

@app.route('/api/users')
def get_users():
    with REQUEST_LATENCY.labels(method='GET', endpoint='/api/users').time():
        users = fetch_users()
        REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status=200).inc()
        return jsonify(users)

Prometheus scrape config:

1
2
3
4
5
6
# prometheus.yml
scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['app:8000']
    scrape_interval: 15s

2. Logs

Discrete events with context. The source of truth for debugging.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import structlog
import uuid

# Structured logging
logger = structlog.get_logger()

@app.before_request
def add_request_id():
    request.id = str(uuid.uuid4())

@app.route('/api/orders', methods=['POST'])
def create_order():
    log = logger.bind(
        request_id=request.id,
        user_id=current_user.id,
        endpoint='/api/orders'
    )
    
    log.info("order_creation_started")
    
    try:
        order = process_order(request.json)
        log.info("order_created", order_id=order.id, total=order.total)
        return jsonify(order)
    except PaymentError as e:
        log.error("payment_failed", error=str(e), card_last_four=e.card[-4:])
        raise

Output (JSON for parsing):

1
2
3
4
5
6
7
8
{
  "event": "order_created",
  "request_id": "abc-123",
  "user_id": 42,
  "order_id": 789,
  "total": 99.99,
  "timestamp": "2026-02-10T17:30:00Z"
}

3. Traces

The path of a request through your system. Essential for microservices.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

otlp_exporter = OTLPSpanExporter(endpoint="http://jaeger:4317")
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)

@app.route('/api/checkout')
def checkout():
    with tracer.start_as_current_span("checkout") as span:
        span.set_attribute("user.id", current_user.id)
        
        # Each call creates a child span
        cart = get_cart()  # Span: get_cart
        inventory = check_inventory(cart)  # Span: check_inventory  
        payment = process_payment(cart.total)  # Span: process_payment
        order = create_order(cart, payment)  # Span: create_order
        
        span.set_attribute("order.id", order.id)
        return jsonify(order)

Trace visualization shows the waterfall:

checkgcpcoehrruteoet_ccacket(a_se2ris_5tn_o0vprm(eads1nye)5tmrmoesrn()yt40((m41s55)m0sm)s)bottleneck!

Connecting the Pillars

The magic happens when you connect them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Correlation IDs tie everything together
@app.before_request
def setup_observability():
    trace_id = trace.get_current_span().get_span_context().trace_id
    request.trace_id = format(trace_id, '032x')
    
    # Add trace ID to all logs
    structlog.contextvars.bind_contextvars(trace_id=request.trace_id)

# Now you can:
# 1. See an error in metrics (spike in 500s)
# 2. Find the trace ID from logs
# 3. View the full trace to see which service failed
# 4. Drill into that service's logs for the specific error

Practical Alerting

Alert on symptoms, not causes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Good: Alert on user-facing impact
groups:
  - name: slos
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1% for 5 minutes"
      
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            rate(http_request_duration_seconds_bucket[5m])
          ) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency above 500ms"

# Bad: Alert on causes (too noisy)
# - CPU above 80% (might be fine!)
# - Disk above 90% (might have days left)
# - Memory growing (might be normal GC pattern)

The Stack

A modern observability stack:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana

  loki:
    image: grafana/loki
    ports:
      - "3100:3100"

  jaeger:
    image: jaegertracing/all-in-one
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # OTLP gRPC

volumes:
  grafana-data:

Or use managed services:

  • Datadog — All-in-one, expensive
  • Grafana Cloud — Good free tier
  • AWS CloudWatch — If you’re already in AWS
  • Honeycomb — Best for high-cardinality exploration

SLOs: Tying It Together

Service Level Objectives make observability actionable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Define your SLOs
SLOS = {
    'availability': {
        'target': 0.999,  # 99.9% of requests succeed
        'window': '30d'
    },
    'latency': {
        'target': 0.95,   # 95% of requests under 200ms
        'threshold': 0.2,
        'window': '30d'
    }
}

# Calculate error budget
# 99.9% availability over 30 days = 43 minutes of allowed downtime
# If you've used 20 minutes, you have 23 minutes left

Grafana query for error budget:

1
2
3
4
5
6
# Error budget remaining (as percentage)
1 - (
  sum(rate(http_requests_total{status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
) / (1 - 0.999)

The Real Win

Good observability changes how you work:

Before: “The site is slow. Let me check 47 dashboards and grep through logs.”

After: “Error rate spiked at 3pm. Here’s the trace. The payment service timed out calling Stripe. Here are the 12 affected users.”

That’s the difference between fighting fires and understanding systems.


Building your observability stack? Hit me up on Twitter.