Your dashboards are green. Your alerts are quiet. Then a user tweets that your app is broken.
This is the monitoring trap: you’re measuring what you expected to fail, not what actually failed.
Observability is the escape hatch.
Monitoring vs Observability# Monitoring is asking predefined questions:
Is the server up? Is CPU under 80%? Are requests completing in under 200ms? Observability is being able to ask any question:
Why did this specific user’s request fail? What changed between yesterday and today? Which service in the chain is causing latency? Monitoring is a subset of observability. You need both.
The Three Pillars# 1. Metrics# Numeric measurements over time. Great for dashboards and alerts.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Prometheus metrics with Python
from prometheus_client import Counter , Histogram , start_http_server
# Count requests
REQUEST_COUNT = Counter (
'http_requests_total' ,
'Total HTTP requests' ,
[ 'method' , 'endpoint' , 'status' ]
)
# Track latency
REQUEST_LATENCY = Histogram (
'http_request_duration_seconds' ,
'HTTP request latency' ,
[ 'method' , 'endpoint' ]
)
@app.route ( '/api/users' )
def get_users ():
with REQUEST_LATENCY . labels ( method = 'GET' , endpoint = '/api/users' ) . time ():
users = fetch_users ()
REQUEST_COUNT . labels ( method = 'GET' , endpoint = '/api/users' , status = 200 ) . inc ()
return jsonify ( users )
Prometheus scrape config:
1
2
3
4
5
6
# prometheus.yml
scrape_configs :
- job_name : 'my-app'
static_configs :
- targets : [ 'app:8000' ]
scrape_interval : 15s
2. Logs# Discrete events with context. The source of truth for debugging.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import structlog
import uuid
# Structured logging
logger = structlog . get_logger ()
@app.before_request
def add_request_id ():
request . id = str ( uuid . uuid4 ())
@app.route ( '/api/orders' , methods = [ 'POST' ])
def create_order ():
log = logger . bind (
request_id = request . id ,
user_id = current_user . id ,
endpoint = '/api/orders'
)
log . info ( "order_creation_started" )
try :
order = process_order ( request . json )
log . info ( "order_created" , order_id = order . id , total = order . total )
return jsonify ( order )
except PaymentError as e :
log . error ( "payment_failed" , error = str ( e ), card_last_four = e . card [ - 4 :])
raise
Output (JSON for parsing):
1
2
3
4
5
6
7
8
{
"event" : "order_created" ,
"request_id" : "abc-123" ,
"user_id" : 42 ,
"order_id" : 789 ,
"total" : 99.99 ,
"timestamp" : "2026-02-10T17:30:00Z"
}
3. Traces# The path of a request through your system. Essential for microservices.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup tracing
trace . set_tracer_provider ( TracerProvider ())
tracer = trace . get_tracer ( __name__ )
otlp_exporter = OTLPSpanExporter ( endpoint = "http://jaeger:4317" )
trace . get_tracer_provider () . add_span_processor (
BatchSpanProcessor ( otlp_exporter )
)
@app.route ( '/api/checkout' )
def checkout ():
with tracer . start_as_current_span ( "checkout" ) as span :
span . set_attribute ( "user.id" , current_user . id )
# Each call creates a child span
cart = get_cart () # Span: get_cart
inventory = check_inventory ( cart ) # Span: check_inventory
payment = process_payment ( cart . total ) # Span: process_payment
order = create_order ( cart , payment ) # Span: create_order
span . set_attribute ( "order.id" , order . id )
return jsonify ( order )
Trace visualization shows the waterfall:
c ├ ├ ├ └ h ─ ─ ─ ─ e ─ ─ ─ ─ c k g c p c o e h r r u t e o e t _ c c a c k e t ( a _ s e 2 r i s _ 5 t n _ o 0 v p r m ( e a d s 1 n y e ) 5 t m r m o e s r n ( ) y t 4 0 ( ( m 4 1 s 5 5 ) m 0 s m ) s ) ← b o t t l e n e c k !
Connecting the Pillars# The magic happens when you connect them:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Correlation IDs tie everything together
@app.before_request
def setup_observability ():
trace_id = trace . get_current_span () . get_span_context () . trace_id
request . trace_id = format ( trace_id , '032x' )
# Add trace ID to all logs
structlog . contextvars . bind_contextvars ( trace_id = request . trace_id )
# Now you can:
# 1. See an error in metrics (spike in 500s)
# 2. Find the trace ID from logs
# 3. View the full trace to see which service failed
# 4. Drill into that service's logs for the specific error
Practical Alerting# Alert on symptoms, not causes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Good: Alert on user-facing impact
groups :
- name : slos
rules :
- alert : HighErrorRate
expr : |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.01
for : 5m
labels :
severity : critical
annotations :
summary : "Error rate above 1% for 5 minutes"
- alert : HighLatency
expr : |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 0.5
for : 5m
labels :
severity : warning
annotations :
summary : "P95 latency above 500ms"
# Bad: Alert on causes (too noisy)
# - CPU above 80% (might be fine!)
# - Disk above 90% (might have days left)
# - Memory growing (might be normal GC pattern)
The Stack# A modern observability stack:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# docker-compose.yml
services :
prometheus :
image : prom/prometheus
volumes :
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports :
- "9090:9090"
grafana :
image : grafana/grafana
environment :
- GF_SECURITY_ADMIN_PASSWORD=admin
ports :
- "3000:3000"
volumes :
- grafana-data:/var/lib/grafana
loki :
image : grafana/loki
ports :
- "3100:3100"
jaeger :
image : jaegertracing/all-in-one
ports :
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
volumes :
grafana-data :
Or use managed services:
Datadog — All-in-one, expensiveGrafana Cloud — Good free tierAWS CloudWatch — If you’re already in AWSHoneycomb — Best for high-cardinality explorationSLOs: Tying It Together# Service Level Objectives make observability actionable:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Define your SLOs
SLOS = {
'availability' : {
'target' : 0.999 , # 99.9% of requests succeed
'window' : '30d'
},
'latency' : {
'target' : 0.95 , # 95% of requests under 200ms
'threshold' : 0.2 ,
'window' : '30d'
}
}
# Calculate error budget
# 99.9% availability over 30 days = 43 minutes of allowed downtime
# If you've used 20 minutes, you have 23 minutes left
Grafana query for error budget:
1
2
3
4
5
6
# Error budget remaining (as percentage)
1 - (
sum ( rate ( http_requests_total { status =~ " 5.. "}[ 30d ] ))
/
sum ( rate ( http_requests_total [ 30d ] ))
) / ( 1 - 0.999 )
The Real Win# Good observability changes how you work:
Before: “The site is slow. Let me check 47 dashboards and grep through logs.”
After: “Error rate spiked at 3pm. Here’s the trace. The payment service timed out calling Stripe. Here are the 12 affected users.”
That’s the difference between fighting fires and understanding systems.
Building your observability stack? Hit me up on Twitter .