Monitoring answers “is it working?” Observability answers “why isn’t it working?” The difference matters when you’re debugging a production incident at 3am.

The three pillars of observability — logs, metrics, and traces — each provide different perspectives. Together, they create a complete picture of system behavior.

Logs: The Narrative

Logs tell you what happened, in order:

1
2
3
{"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "request_started", "request_id": "abc123", "path": "/api/users"}
{"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "db_query", "request_id": "abc123", "duration_ms": 45}
{"timestamp": "2026-02-23T13:00:02Z", "level": "error", "event": "request_failed", "request_id": "abc123", "error": "connection timeout"}

Good for:

  • Debugging specific requests
  • Understanding error context
  • Audit trails
  • Ad-hoc investigation

Challenges:

  • High volume, high cost
  • Searching requires indexing
  • Context scattered across services

Structured Logging

Always log structured data:

1
2
3
4
5
6
7
8
9
# Bad - unstructured
logger.info(f"User {user_id} purchased {product} for ${price}")

# Good - structured
logger.info("purchase_completed", 
    user_id=user_id, 
    product_id=product, 
    price=price,
    currency="USD")

Structured logs are queryable: “show all purchases over $100 in the last hour.”

Metrics: The Numbers

Metrics are numerical measurements over time:

hhhttttttppp___rrreeeqqquuueeessstttss___dttuoorttaaatlli{{ommnee_ttshheoocddo==n""dGGsEE{TTq""u,,ansstttiaaltteuu=ss"==0"".25900900"""}}}012.53244250

Good for:

  • Dashboards and visualization
  • Alerting on thresholds
  • Capacity planning
  • Trend analysis

Challenges:

  • High cardinality kills performance
  • Aggregation loses detail
  • Choosing what to measure

The RED Method (Request-focused)

For services:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Request latency distribution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Prometheus metrics
from prometheus_client import Counter, Histogram

requests = Counter('http_requests_total', 'Total requests', ['method', 'status'])
latency = Histogram('http_request_duration_seconds', 'Request latency')

@app.middleware
async def metrics_middleware(request, call_next):
    start = time.time()
    response = await call_next(request)
    
    requests.labels(method=request.method, status=response.status_code).inc()
    latency.observe(time.time() - start)
    
    return response

The USE Method (Resource-focused)

For infrastructure:

  • Utilization: Percentage of resource busy
  • Saturation: Queue length, backlog
  • Errors: Error count
1
2
3
4
5
6
7
# Example alerts
- alert: HighCPUUtilization
  expr: cpu_usage_percent > 80
  for: 5m

- alert: DiskNearlyFull
  expr: disk_usage_percent > 90

Traces: The Journey

Traces follow a request across service boundaries:

TraceapIiD-:guiasnxtevyerezw-pnre7asotex8yesodt9rtrie(vgysr5ir-n0ces(amese2lsrm-)((vsa31i)p05cimmess())(4405mmss))

Good for:

  • Understanding distributed latency
  • Finding bottlenecks
  • Debugging cross-service issues
  • Dependency mapping

Challenges:

  • Sampling required at scale
  • Instrumentation overhead
  • Context propagation across boundaries

Implementing Tracing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

tracer = trace.get_tracer(__name__)

@app.get("/api/users/{user_id}")
async def get_user(user_id: int):
    with tracer.start_as_current_span("get_user") as span:
        span.set_attribute("user_id", user_id)
        
        with tracer.start_as_current_span("db_query"):
            user = await db.get_user(user_id)
        
        with tracer.start_as_current_span("enrich_user"):
            preferences = await preferences_service.get(user_id)
        
        return {**user, **preferences}

Propagate context to downstream services:

1
2
3
4
# Inject trace context into outgoing requests
headers = {}
TraceContextTextMapPropagator().inject(headers)
response = requests.get(url, headers=headers)

Connecting the Pillars

The magic happens when logs, metrics, and traces connect:

1
2
3
4
5
6
7
8
{
  "timestamp": "2026-02-23T13:00:01Z",
  "level": "error",
  "message": "Payment failed",
  "trace_id": "xyz789",
  "span_id": "abc123",
  "user_id": "user_456"
}

From an alert (metric), jump to the trace. From the trace, find the relevant logs. Each provides context the others lack.

Correlation IDs

Every request gets a unique ID that flows everywhere:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@app.middleware
async def add_request_id(request, call_next):
    request_id = request.headers.get('X-Request-ID', str(uuid4()))
    
    # Add to logging context
    structlog.contextvars.bind_contextvars(request_id=request_id)
    
    response = await call_next(request)
    response.headers['X-Request-ID'] = request_id
    
    return response

Now every log, metric label, and trace span can include request_id.

Sampling Strategies

At scale, you can’t keep everything:

Head-based sampling: Decide at the start whether to trace this request.

1
2
3
# Sample 10% of requests
if random.random() < 0.1:
    start_trace()

Tail-based sampling: Decide after the request completes.

1
2
3
# Keep all errors and slow requests
if response.status >= 500 or duration > 1.0:
    keep_trace()

Adaptive sampling: Adjust rate based on traffic.

1
2
# More sampling during low traffic
sample_rate = 0.1 if requests_per_second > 1000 else 0.5

Alerting Philosophy

Alert on symptoms, not causes:

1
2
3
4
5
6
7
8
9
# Bad - alerts on cause
- alert: DatabaseConnectionPoolExhausted
  expr: db_connections_available == 0

# Good - alerts on symptom
- alert: HighErrorRate
  expr: rate(http_requests_total{status="500"}[5m]) > 0.01

# Then investigate: why are requests failing? (maybe connection pool)

Symptoms tell you users are affected. Causes require investigation.

Alert Fatigue

Every alert should be:

  • Actionable: Someone can do something about it
  • Urgent: It needs attention now (or use a ticket, not an alert)
  • Real: Low false positive rate

If you’re ignoring alerts, you have too many or they’re misconfigured.

Tool Choices

Metrics: Prometheus, InfluxDB, Datadog, CloudWatch Logs: Elasticsearch, Loki, Splunk, CloudWatch Logs Traces: Jaeger, Zipkin, Datadog APM, AWS X-Ray

Unified platforms: Datadog, Grafana Cloud, New Relic, Honeycomb

Starting out? Prometheus + Loki + Tempo with Grafana provides all three pillars, open source, with excellent integration.

The Observability Mindset

Observability isn’t a product you buy. It’s a property of your system:

  • Instrument everything: If it’s not measured, it doesn’t exist
  • Correlate across signals: Connect logs, metrics, traces
  • Ask arbitrary questions: Can you debug issues you’ve never seen before?
  • Reduce MTTR: The goal is faster incident resolution

The system that can explain its own behavior is the system you can actually operate.


Logs tell the story. Metrics show the trends. Traces reveal the journey. Separately, each is useful. Together, they’re essential.

Build observability in from the start. Instrument before you need to debug. Connect your signals with correlation IDs. The incident you resolve in minutes instead of hours will justify every line of instrumentation code.