Most monitoring dashboards are useless. Hundreds of metrics, dozens of graphs, all green—until something breaks and you’re scrambling through charts trying to find the one that matters.

Good monitoring isn’t about collecting everything. It’s about knowing what to look at when things go wrong.

The Three Pillars

Observability has three pillars: metrics, logs, and traces. Each answers different questions.

Metrics: What is happening? (Aggregated numbers over time)

  • Request rate, error rate, latency
  • CPU, memory, disk usage
  • Queue depth, connection count

Logs: Why did it happen? (Detailed event records)

  • Error messages and stack traces
  • Request/response bodies
  • State changes

Traces: Where did it happen? (Request flow across services)

  • Which service was slow?
  • What path did the request take?
  • Where did it fail?

You need all three. Metrics tell you there’s a problem. Logs tell you what. Traces tell you where.

Start With RED and USE

Don’t invent metrics. Use established frameworks.

RED Method (for services):

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Latency distribution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[.01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

@app.before_request
def start_timer():
    request.start_time = time.time()

@app.after_request
def record_metrics(response):
    latency = time.time() - request.start_time
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.endpoint,
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.endpoint
    ).observe(latency)
    return response

USE Method (for resources):

  • Utilization: How busy is it? (CPU at 80%)
  • Saturation: How overloaded? (Queue depth)
  • Errors: Failures? (Disk errors)

Every resource (CPU, memory, disk, network) should have all three.

The Four Golden Signals

Google’s SRE book defines four signals every service needs:

  1. Latency: Time to serve a request
  2. Traffic: Demand on your system
  3. Errors: Rate of failed requests
  4. Saturation: How “full” your service is

If you only monitor four things, monitor these.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Prometheus alerting rules for golden signals
groups:
  - name: golden_signals
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency above 1 second"

      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 1%"

      - alert: TrafficDrop
        expr: rate(http_requests_total[5m]) < rate(http_requests_total[5m] offset 1h) * 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Traffic dropped by more than 50%"

Percentiles, Not Averages

Average latency is a lie.

RAPPPev599qe059ur:::easg555te000s:m00:s001mm[0ss5400mmss,50ms,50ms,50ms,5000ms]

The average says “about 1 second.” Reality: most requests are fast, but one in five users gets a terrible experience.

Always use percentiles:

  • P50 (median): Typical experience
  • P95: What slow users see
  • P99: Worst case (mostly)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Histogram buckets for latency
LATENCY_BUCKETS = [
    0.001,   # 1ms
    0.005,   # 5ms
    0.01,    # 10ms
    0.025,   # 25ms
    0.05,    # 50ms
    0.1,     # 100ms
    0.25,    # 250ms
    0.5,     # 500ms
    1.0,     # 1s
    2.5,     # 2.5s
    5.0,     # 5s
    10.0,    # 10s
]

# Query percentiles in Prometheus
# histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Alerting That Doesn’t Suck

Bad alerts train people to ignore alerts.

Symptoms of bad alerting:

  • Alert fatigue (too many alerts)
  • False positives (alerts that aren’t problems)
  • Missing context (alert fires, no idea why)
  • No runbook (alert fires, no idea what to do)

Rules for good alerts:

1. Alert on symptoms, not causes

1
2
3
4
5
6
7
# Bad: Alerts on cause
- alert: HighCPU
  expr: cpu_usage > 80%

# Good: Alerts on symptom
- alert: HighLatency
  expr: p95_latency > 500ms

High CPU might be fine if latency is fine. Alert on what users experience.

2. Every alert needs a runbook

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
- alert: DatabaseConnectionPoolExhausted
  expr: db_pool_available_connections == 0
  annotations:
    summary: "No available database connections"
    runbook_url: "https://wiki.internal/runbooks/db-pool-exhausted"
    description: |
      The connection pool is exhausted. Check:
      1. Are queries running slowly? Check slow query log.
      2. Are connections being leaked? Check application logs.
      3. Is traffic higher than usual? Check request rate.

3. Use severity levels meaningfully

1
2
3
4
5
6
# Critical: User-facing, needs immediate action
severity: critical
# Warning: Degraded, but functional
severity: warning
# Info: FYI, no action needed
severity: info

Don’t page at 3 AM for warnings.

4. Include actionable context

1
2
3
4
5
6
7
8
9
annotations:
  summary: "High error rate on payment service"
  description: |
    Error rate: {{ $value | printf "%.2f" }}%
    Service: {{ $labels.service }}
    Endpoint: {{ $labels.endpoint }}
    Most common error: {{ $labels.error_type }}
  dashboard: "https://grafana.internal/d/payments"
  logs: "https://logs.internal/payments?level=error&last=15m"

Dashboard Design

The three-dashboard strategy:

1. Overview Dashboard For: Quick health check Shows: Golden signals for all services Use: Start here during incidents

2. Service Dashboard For: Debugging specific service Shows: Detailed metrics for one service Use: Drill down after overview shows problem

3. Debug Dashboard For: Deep investigation Shows: Everything (resource usage, dependencies, internals) Use: Root cause analysis

Layout principles:

  • Most important metrics at top
  • Group related metrics
  • Use consistent colors (green=good, red=bad)
  • Show rate of change, not just current value
S[[4eRC5reP%vqUiu]ceestHeRaa[6ltM2tee%h]m:ory]A[PEIrro[3rD4i%DRsBakt]e]Ca[1cN2h[e5ePt9wM5obrpLksQa]uteeunecy]

Log Aggregation Done Right

Logs are useless if you can’t search them.

Structured logging:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import structlog

logger = structlog.get_logger()

# Bad
logger.info(f"User {user_id} purchased {item} for ${price}")

# Good
logger.info(
    "purchase_completed",
    user_id=user_id,
    item_id=item,
    price=price,
    currency="USD",
    payment_method="credit_card"
)

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "timestamp": "2026-03-11T05:30:00Z",
  "level": "info",
  "event": "purchase_completed",
  "user_id": "usr_12345",
  "item_id": "prod_67890",
  "price": 29.99,
  "currency": "USD",
  "payment_method": "credit_card",
  "trace_id": "abc123",
  "service": "checkout"
}

Now you can query: “Show all purchases over $100 in the last hour that failed.”

Log levels matter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# DEBUG: Detailed debugging info (off in production)
logger.debug("cache_lookup", key=key, hit=True)

# INFO: Normal operations worth recording
logger.info("user_login", user_id=user_id)

# WARNING: Something unexpected but handled
logger.warning("rate_limit_approached", user_id=user_id, current=95, limit=100)

# ERROR: Something failed, needs attention
logger.error("payment_failed", user_id=user_id, error=str(e), payment_id=payment_id)

# CRITICAL: System is broken
logger.critical("database_connection_lost", error=str(e))

Distributed Tracing

When a request touches multiple services, traces show the journey.

[Request][APIGat[[eAPwuaatyyhm]eSnetrvS[ieUcrseve]irceS]er[vCiacc[ehS]et]ripe[DAaPtIa]base]

OpenTelemetry instrumentation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from opentelemetry import trace
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Auto-instrument HTTP clients
RequestsInstrumentor().instrument()

tracer = trace.get_tracer(__name__)

@app.route('/checkout')
def checkout():
    with tracer.start_as_current_span("checkout") as span:
        span.set_attribute("user_id", request.user_id)
        
        with tracer.start_as_current_span("validate_cart"):
            validate_cart()
        
        with tracer.start_as_current_span("process_payment"):
            process_payment()
        
        with tracer.start_as_current_span("send_confirmation"):
            send_confirmation()

The trace propagates automatically through HTTP headers. When payment service is slow, you see exactly where in the chain the delay occurred.

SLOs: Defining “Good Enough”

Service Level Objectives make reliability measurable.

Define your SLO:

99v.e9r%aof30r-edqauyesrtosllcionmgplweitnedoswuccessfullywithin500ms

Calculate error budget:

300.1d%ayesrr=or43b,u2d0g0etmi=nu4t3e.s2minutesofdowntimeallowed

Track burn rate:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Alert when burning error budget too fast
- alert: ErrorBudgetBurnRate
  expr: |
    (
      1 - (
        sum(rate(http_requests_total{status="200"}[1h]))
        /
        sum(rate(http_requests_total[1h]))
      )
    ) > 14.4 * (1 - 0.999)
  for: 5m
  annotations:
    summary: "Burning error budget 14x faster than sustainable"

If you’re burning budget 14x faster than sustainable, you’ll exhaust the monthly budget in about 2 days.

The Minimum Viable Monitoring

Start here:

  1. Metrics: Prometheus + Grafana (or cloud equivalent)
  2. Logs: Structured JSON → Loki/Elasticsearch/CloudWatch
  3. Traces: OpenTelemetry → Jaeger/Tempo/X-Ray
  4. Alerts: PagerDuty/Opsgenie with proper escalation

First dashboards:

  • Service overview (golden signals)
  • Infrastructure (CPU, memory, disk per host)
  • Dependencies (database, cache, queues)

First alerts:

  • Error rate > 1%
  • P95 latency > SLO threshold
  • Service down (health check failing)
  • Disk > 85% full
  • Error budget burn rate too high

Build more as you learn what you need. The goal isn’t comprehensive monitoring—it’s useful monitoring.


Good monitoring is invisible when things work and invaluable when they don’t.