Most monitoring dashboards are useless. Hundreds of metrics, dozens of graphs, all green—until something breaks and you’re scrambling through charts trying to find the one that matters.
Good monitoring isn’t about collecting everything. It’s about knowing what to look at when things go wrong.
The Three Pillars# Observability has three pillars: metrics, logs, and traces. Each answers different questions.
Metrics : What is happening? (Aggregated numbers over time)
Request rate, error rate, latency CPU, memory, disk usage Queue depth, connection count Logs : Why did it happen? (Detailed event records)
Error messages and stack traces Request/response bodies State changes Traces : Where did it happen? (Request flow across services)
Which service was slow? What path did the request take? Where did it fail? You need all three. Metrics tell you there’s a problem. Logs tell you what. Traces tell you where.
Start With RED and USE# Don’t invent metrics. Use established frameworks.
RED Method (for services):
R ate: Requests per secondE rrors: Failed requests per secondD uration: Latency distribution 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from prometheus_client import Counter , Histogram
REQUEST_COUNT = Counter (
'http_requests_total' ,
'Total HTTP requests' ,
[ 'method' , 'endpoint' , 'status' ]
)
REQUEST_LATENCY = Histogram (
'http_request_duration_seconds' ,
'HTTP request latency' ,
[ 'method' , 'endpoint' ],
buckets = [ .01 , .025 , .05 , .1 , .25 , .5 , 1 , 2.5 , 5 , 10 ]
)
@app.before_request
def start_timer ():
request . start_time = time . time ()
@app.after_request
def record_metrics ( response ):
latency = time . time () - request . start_time
REQUEST_COUNT . labels (
method = request . method ,
endpoint = request . endpoint ,
status = response . status_code
) . inc ()
REQUEST_LATENCY . labels (
method = request . method ,
endpoint = request . endpoint
) . observe ( latency )
return response
USE Method (for resources):
U tilization: How busy is it? (CPU at 80%)S aturation: How overloaded? (Queue depth)E rrors: Failures? (Disk errors)Every resource (CPU, memory, disk, network) should have all three.
The Four Golden Signals# Google’s SRE book defines four signals every service needs:
Latency : Time to serve a requestTraffic : Demand on your systemErrors : Rate of failed requestsSaturation : How “full” your service isIf you only monitor four things, monitor these.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Prometheus alerting rules for golden signals
groups :
- name : golden_signals
rules :
- alert : HighLatency
expr : histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
for : 5m
labels :
severity : warning
annotations :
summary : "P95 latency above 1 second"
- alert : HighErrorRate
expr : rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
for : 5m
labels :
severity : critical
annotations :
summary : "Error rate above 1%"
- alert : TrafficDrop
expr : rate(http_requests_total[5m]) < rate(http_requests_total[5m] offset 1h) * 0.5
for : 10m
labels :
severity : warning
annotations :
summary : "Traffic dropped by more than 50%"
Percentiles, Not Averages# Average latency is a lie.
R A P P P e v 5 9 9 q e 0 5 9 u r : : : e a s g 5 5 5 t e 0 0 0 s : m 0 0 : s 0 0 1 m m [ 0 s s 5 4 0 0 m m s s , 5 0 m s , 5 0 m s , 5 0 m s , 5 0 0 0 m s ]
The average says “about 1 second.” Reality: most requests are fast, but one in five users gets a terrible experience.
Always use percentiles:
P50 (median): Typical experience P95: What slow users see P99: Worst case (mostly) 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Histogram buckets for latency
LATENCY_BUCKETS = [
0.001 , # 1ms
0.005 , # 5ms
0.01 , # 10ms
0.025 , # 25ms
0.05 , # 50ms
0.1 , # 100ms
0.25 , # 250ms
0.5 , # 500ms
1.0 , # 1s
2.5 , # 2.5s
5.0 , # 5s
10.0 , # 10s
]
# Query percentiles in Prometheus
# histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Alerting That Doesn’t Suck# Bad alerts train people to ignore alerts.
Symptoms of bad alerting:
Alert fatigue (too many alerts) False positives (alerts that aren’t problems) Missing context (alert fires, no idea why) No runbook (alert fires, no idea what to do) Rules for good alerts:
1. Alert on symptoms, not causes
1
2
3
4
5
6
7
# Bad: Alerts on cause
- alert : HighCPU
expr : cpu_usage > 80%
# Good: Alerts on symptom
- alert : HighLatency
expr : p95_latency > 500ms
High CPU might be fine if latency is fine. Alert on what users experience.
2. Every alert needs a runbook
1
2
3
4
5
6
7
8
9
10
- alert : DatabaseConnectionPoolExhausted
expr : db_pool_available_connections == 0
annotations :
summary : "No available database connections"
runbook_url : "https://wiki.internal/runbooks/db-pool-exhausted"
description : |
The connection pool is exhausted. Check:
1. Are queries running slowly? Check slow query log.
2. Are connections being leaked? Check application logs.
3. Is traffic higher than usual? Check request rate.
3. Use severity levels meaningfully
1
2
3
4
5
6
# Critical: User-facing, needs immediate action
severity : critical
# Warning: Degraded, but functional
severity : warning
# Info: FYI, no action needed
severity : info
Don’t page at 3 AM for warnings.
4. Include actionable context
1
2
3
4
5
6
7
8
9
annotations :
summary : "High error rate on payment service"
description : |
Error rate: {{ $value | printf "%.2f" }}%
Service: {{ $labels.service }}
Endpoint: {{ $labels.endpoint }}
Most common error: {{ $labels.error_type }}
dashboard : "https://grafana.internal/d/payments"
logs : "https://logs.internal/payments?level=error&last=15m"
Dashboard Design# The three-dashboard strategy:
1. Overview Dashboard
For: Quick health check
Shows: Golden signals for all services
Use: Start here during incidents
2. Service Dashboard
For: Debugging specific service
Shows: Detailed metrics for one service
Use: Drill down after overview shows problem
3. Debug Dashboard
For: Deep investigation
Shows: Everything (resource usage, dependencies, internals)
Use: Root cause analysis
Layout principles:
Most important metrics at top Group related metrics Use consistent colors (green=good, red=bad) Show rate of change, not just current value ┌ │ ├ │ │ ├ │ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ S ─ [ ▁ ─ [ 4 ─ ─ e ─ R ▂ ─ C 5 ─ ─ r ─ e ▃ ─ P % ─ ─ v ─ q ▄ ─ U ─ ─ i ─ u ▅ ─ ] ─ ─ c ─ e ▆ ─ ─ ─ e ─ s ▇ ─ ─ ─ ─ t █ ─ ─ ─ H ─ ▇ ─ ─ ─ e ─ R ▆ ─ ─ ─ a ─ a ▅ ─ [ 6 ─ ─ l ─ t ─ M 2 ─ ─ t ─ e ─ e % ─ ─ h ─ ] ─ m ─ ─ : ─ ─ o ─ ─ ─ ─ r ─ ─ ✓ ─ ─ y ─ ─ ─ ▁ ─ ] ─ ─ A ─ [ ▁ ─ ─ ─ P ─ E ▁ ─ ─ ─ I ─ r ▁ ─ ─ ─ ─ r ▁ ─ ─ ─ ─ o ▂ ─ [ 3 ─ ─ ✓ ─ r ▁ ─ D 4 ─ ─ ─ ▁ ─ i % ─ ─ D ─ R ▁ ─ s ─ ─ B ─ a ▁ ─ k ─ ─ ─ t ─ ] ─ ─ ─ e ─ ─ ─ ⚠ ─ ] ─ ─ ─ ─ ─ ─ ─ C ─ ─ ─ ─ a ─ ─ [ 1 ─ ─ c ─ ▂ ─ N 2 ─ ─ h ─ [ ▂ ─ e 5 ─ ─ e ─ P ▂ ─ t ─ ─ ─ 9 ▃ ─ w M ─ ─ ─ 5 ▂ ─ o b ─ ─ ✓ ─ ▂ ─ r p ─ ─ ─ L ▂ ─ k s ─ ─ Q ─ a ▂ ─ ] ─ ─ u ─ t ▂ ─ ─ ─ e ─ e ▂ ─ ─ ─ u ─ n ─ ─ ─ e ─ c ─ ─ ─ ─ y ─ ─ ─ │ ─ ] │ ─ ─ ┐ ┤ │ ┤ │ │ ┘
Log Aggregation Done Right# Logs are useless if you can’t search them.
Structured logging:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import structlog
logger = structlog . get_logger ()
# Bad
logger . info ( f "User { user_id } purchased { item } for $ { price } " )
# Good
logger . info (
"purchase_completed" ,
user_id = user_id ,
item_id = item ,
price = price ,
currency = "USD" ,
payment_method = "credit_card"
)
Output:
1
2
3
4
5
6
7
8
9
10
11
12
{
"timestamp" : "2026-03-11T05:30:00Z" ,
"level" : "info" ,
"event" : "purchase_completed" ,
"user_id" : "usr_12345" ,
"item_id" : "prod_67890" ,
"price" : 29.99 ,
"currency" : "USD" ,
"payment_method" : "credit_card" ,
"trace_id" : "abc123" ,
"service" : "checkout"
}
Now you can query: “Show all purchases over $100 in the last hour that failed.”
Log levels matter:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# DEBUG: Detailed debugging info (off in production)
logger . debug ( "cache_lookup" , key = key , hit = True )
# INFO: Normal operations worth recording
logger . info ( "user_login" , user_id = user_id )
# WARNING: Something unexpected but handled
logger . warning ( "rate_limit_approached" , user_id = user_id , current = 95 , limit = 100 )
# ERROR: Something failed, needs attention
logger . error ( "payment_failed" , user_id = user_id , error = str ( e ), payment_id = payment_id )
# CRITICAL: System is broken
logger . critical ( "database_connection_lost" , error = str ( e ))
Distributed Tracing# When a request touches multiple services, traces show the journey.
[ R e q u e s t ] ─ → [ A P I │ └ │ └ ─ ─ G → → a t [ [ e A P w u a a t y y h m ] e S n ─ e t → r v S [ i e U c r s e v e ] i r c ─ e S → ] e r [ ─ v C → i a c c [ e h S ] e t ] r ─ i → p e [ D A a P t I a ] b a s e ]
OpenTelemetry instrumentation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from opentelemetry import trace
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Auto-instrument HTTP clients
RequestsInstrumentor () . instrument ()
tracer = trace . get_tracer ( __name__ )
@app.route ( '/checkout' )
def checkout ():
with tracer . start_as_current_span ( "checkout" ) as span :
span . set_attribute ( "user_id" , request . user_id )
with tracer . start_as_current_span ( "validate_cart" ):
validate_cart ()
with tracer . start_as_current_span ( "process_payment" ):
process_payment ()
with tracer . start_as_current_span ( "send_confirmation" ):
send_confirmation ()
The trace propagates automatically through HTTP headers. When payment service is slow, you see exactly where in the chain the delay occurred.
SLOs: Defining “Good Enough”# Service Level Objectives make reliability measurable.
Define your SLO:
9 9 v . e 9 r % a o f 3 0 r - e d q a u y e s r t o s l l c i o n m g p l w e i t n e d o s w u c c e s s f u l l y w i t h i n 5 0 0 m s
Calculate error budget:
3 0 0 . 1 d % a y e s r r = o r 4 3 b , u 2 d 0 g 0 e t m i = n u 4 t 3 e . s 2 m i n u t e s o f d o w n t i m e a l l o w e d
Track burn rate:
1
2
3
4
5
6
7
8
9
10
11
12
13
# Alert when burning error budget too fast
- alert : ErrorBudgetBurnRate
expr : |
(
1 - (
sum(rate(http_requests_total{status="200"}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > 14.4 * (1 - 0.999)
for : 5m
annotations :
summary : "Burning error budget 14x faster than sustainable"
If you’re burning budget 14x faster than sustainable, you’ll exhaust the monthly budget in about 2 days.
The Minimum Viable Monitoring# Start here:
Metrics : Prometheus + Grafana (or cloud equivalent)Logs : Structured JSON → Loki/Elasticsearch/CloudWatchTraces : OpenTelemetry → Jaeger/Tempo/X-RayAlerts : PagerDuty/Opsgenie with proper escalationFirst dashboards:
Service overview (golden signals) Infrastructure (CPU, memory, disk per host) Dependencies (database, cache, queues) First alerts:
Error rate > 1% P95 latency > SLO threshold Service down (health check failing) Disk > 85% full Error budget burn rate too high Build more as you learn what you need. The goal isn’t comprehensive monitoring—it’s useful monitoring.
Good monitoring is invisible when things work and invaluable when they don’t.