Monitoring tells you when something is wrong. Observability helps you understand why. In distributed systems, you can’t predict every failure mode—you need systems that let you ask arbitrary questions about their behavior.
The Three Pillars#
Metrics: What’s Happening Now#
Numeric time-series data. Fast to query, cheap to store.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| from prometheus_client import Counter, Histogram, Gauge, start_http_server
# Counter - only goes up
requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
# Histogram - distribution of values
request_duration = Histogram(
'http_request_duration_seconds',
'Request duration in seconds',
['method', 'endpoint'],
buckets=[.01, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
# Gauge - can go up or down
active_connections = Gauge(
'active_connections',
'Number of active connections'
)
# Usage
@app.route("/api/<endpoint>")
def handle_request(endpoint):
active_connections.inc()
with request_duration.labels(
method=request.method,
endpoint=endpoint
).time():
result = process_request()
requests_total.labels(
method=request.method,
endpoint=endpoint,
status=200
).inc()
active_connections.dec()
return result
# Expose metrics endpoint
start_http_server(9090)
|
Logs: What Happened#
Discrete events with context. Rich detail, expensive at scale.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| import structlog
import logging
# Configure structured logging
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.JSONRenderer()
],
wrapper_class=structlog.stdlib.BoundLogger,
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
)
logger = structlog.get_logger()
# Bind context that persists across log calls
def handle_request(request_id: str, user_id: str):
log = logger.bind(
request_id=request_id,
user_id=user_id
)
log.info("request_started", path=request.path)
try:
result = process_order(order_id)
log.info("request_completed",
order_id=order_id,
items=len(result.items))
return result
except PaymentError as e:
log.error("payment_failed",
error=str(e),
error_code=e.code)
raise
|
Output:
1
2
| {"request_id": "req-123", "user_id": "user-456", "event": "request_started", "path": "/api/orders", "timestamp": "2026-02-11T19:00:00Z", "level": "info"}
{"request_id": "req-123", "user_id": "user-456", "event": "payment_failed", "error": "Card declined", "error_code": "CARD_DECLINED", "timestamp": "2026-02-11T19:00:01Z", "level": "error"}
|
Traces: How It Flowed#
Request path through distributed services. Shows causality.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Setup tracing
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://jaeger:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Auto-instrument frameworks
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
# Manual spans for business logic
@app.route("/api/orders", methods=["POST"])
def create_order():
with tracer.start_as_current_span("create_order") as span:
span.set_attribute("user.id", current_user.id)
# Child span for validation
with tracer.start_as_current_span("validate_order"):
validate(request.json)
# Child span for payment
with tracer.start_as_current_span("process_payment") as payment_span:
payment_span.set_attribute("payment.amount", order.total)
result = payment_service.charge(order)
payment_span.set_attribute("payment.status", result.status)
# Child span for fulfillment
with tracer.start_as_current_span("create_fulfillment"):
fulfillment_service.create(order)
return jsonify(order.to_dict())
|
Correlating the Three Pillars#
The power comes from correlation. Same request ID across all three:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| import uuid
from flask import g, request
@app.before_request
def set_correlation_id():
# Get from header or generate
g.request_id = request.headers.get('X-Request-ID', str(uuid.uuid4()))
g.trace_id = trace.get_current_span().get_span_context().trace_id
# Bind to logger
g.log = logger.bind(
request_id=g.request_id,
trace_id=format(g.trace_id, '032x')
)
@app.after_request
def add_correlation_headers(response):
response.headers['X-Request-ID'] = g.request_id
return response
|
Now you can:
- See a spike in error metrics
- Filter logs by that time window
- Find specific request IDs with errors
- Pull the full trace to see which service failed
Service Level Objectives (SLOs)#
Define what “good” means:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # SLO definition
slos:
- name: api-availability
description: "API returns successful responses"
indicator:
events:
good: http_requests_total{status=~"2.."}
total: http_requests_total
target: 99.9
window: 30d
- name: api-latency
description: "API responds within 200ms"
indicator:
events:
good: http_request_duration_seconds_bucket{le="0.2"}
total: http_request_duration_seconds_count
target: 95.0
window: 30d
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # Calculate error budget
def calculate_error_budget(slo_target: float, window_requests: int,
failed_requests: int) -> dict:
allowed_failures = window_requests * (1 - slo_target / 100)
remaining = allowed_failures - failed_requests
return {
"budget_total": allowed_failures,
"budget_used": failed_requests,
"budget_remaining": remaining,
"budget_remaining_percent": (remaining / allowed_failures) * 100
}
# Example: 99.9% SLO over 1M requests
# Allowed failures: 1000
# If we've had 300 failures, 700 remaining (70% budget left)
|
Alerting on Symptoms, Not Causes#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| # Prometheus alerting rules
groups:
- name: slo-alerts
rules:
# Alert on SLO breach, not individual errors
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Error rate above 1% for 5 minutes"
# Alert on error budget burn rate
- alert: ErrorBudgetBurn
expr: |
(
1 - (
sum(rate(http_requests_total{status=~"2.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > (1 - 0.999) * 14.4
for: 5m
labels:
severity: critical
annotations:
summary: "Burning error budget 14x faster than sustainable"
|
Dashboard Design#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
| # Grafana dashboard as code
dashboard = {
"title": "Service Overview",
"rows": [
{
"title": "Golden Signals",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [{
"expr": "sum(rate(http_requests_total[5m]))"
}]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [{
"expr": "sum(rate(http_requests_total{status=~'5..'}[5m])) / sum(rate(http_requests_total[5m]))"
}]
},
{
"title": "Latency (p50, p95, p99)",
"type": "graph",
"targets": [
{"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))"},
{"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"},
{"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))"}
]
},
{
"title": "Saturation (CPU/Memory)",
"type": "graph",
"targets": [{
"expr": "container_memory_usage_bytes / container_spec_memory_limit_bytes"
}]
}
]
}
]
}
|
Debugging with Observability#
When something breaks:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| # 1. Check metrics - what's the symptom?
# High error rate starting at 14:32
# 2. Query logs around that time
{
"query": "level:error AND timestamp:[2026-02-11T14:30:00 TO 2026-02-11T14:35:00]",
"sort": "timestamp:asc"
}
# 3. Find a specific failing request
# request_id: req-abc123, error: "connection timeout to payment-service"
# 4. Pull the trace
curl "http://jaeger:16686/api/traces/abc123def456"
# 5. See the trace shows:
# api-gateway (50ms)
# → order-service (30ms)
# → payment-service (TIMEOUT after 30s)
# 6. Check payment-service metrics
# CPU at 100%, queue depth spiking
# Root cause: Payment service overloaded, need to scale
|
OpenTelemetry Collector#
Unified pipeline for all signals:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| # otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
memory_limiter:
check_interval: 1s
limit_mib: 1000
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
jaeger:
endpoint: jaeger:14250
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
|
Key Takeaways#
- Metrics for alerting — cheap, fast, good for dashboards
- Logs for context — rich detail when you need to investigate
- Traces for causality — understand request flow across services
- Correlate everything — same request ID across all three
- Alert on symptoms — error rates and latency, not CPU or disk
- Define SLOs — know what “good enough” means before incidents
Observability isn’t a tool you install—it’s a property of your system. Build it in from the start, and debugging becomes archaeology instead of archaeology in the dark.