Good monitoring saves you from outages. Bad monitoring causes them — by training your team to ignore alerts until something actually breaks. Here’s how to avoid the most common anti-patterns.

Anti-Pattern 1: Alerting on Symptoms, Not Impact

1
2
3
4
5
6
# ❌ BAD: CPU is high
- alert: HighCPU
  expr: node_cpu_usage > 80
  for: 5m
  labels:
    severity: critical

High CPU isn’t a problem. Slow responses are a problem. Users don’t care about your CPU graphs.

1
2
3
4
5
6
7
8
9
# ✅ GOOD: User-facing latency degraded
- alert: APILatencyHigh
  expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "95th percentile latency above 500ms"
    impact: "Users experiencing slow page loads"

Alert on what users experience, not on what your infrastructure is doing internally.

Anti-Pattern 2: Static Thresholds for Dynamic Systems

1
2
3
# ❌ BAD: Fixed threshold regardless of time
- alert: LowTraffic
  expr: rate(http_requests_total[5m]) < 100

At 3 AM, 100 requests/minute is normal. At 3 PM, it’s an outage. Static thresholds don’t understand context.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# ✅ GOOD: Compare to historical baseline
- alert: TrafficAnomaly
  expr: |
    rate(http_requests_total[5m]) 
    < 
    avg_over_time(rate(http_requests_total[5m])[7d:1h]) * 0.5
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Traffic 50% below weekly average for this hour"

Compare current behavior to what’s normal for this time of day, this day of week.

Anti-Pattern 3: No Severity Levels

When everything is critical, nothing is:

1
2
3
4
5
6
7
8
# ❌ BAD: One severity for all alerts
- alert: DiskSpaceLow
  labels:
    severity: critical  # Is it though?

- alert: DatabaseDown
  labels:
    severity: critical  # This one actually is

Define what severities mean and use them consistently:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# ✅ GOOD: Clear severity definitions
# critical: User-facing, immediate response required, page on-call
# warning: Degraded, needs attention within hours
# info: Notable event, no action required

- alert: DiskSpace80Percent
  expr: disk_used_percent > 80
  labels:
    severity: warning
  annotations:
    runbook: "https://wiki/runbooks/disk-space"

- alert: DiskSpace95Percent
  expr: disk_used_percent > 95
  labels:
    severity: critical

Anti-Pattern 4: Alerts Without Runbooks

1
2
3
4
5
# ❌ BAD: What am I supposed to do with this?
- alert: KafkaLagHigh
  expr: kafka_consumer_lag > 10000
  annotations:
    summary: "Kafka lag is high"

The on-call engineer at 3 AM shouldn’t have to figure out what this means or how to fix it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# ✅ GOOD: Actionable context
- alert: KafkaLagHigh
  expr: kafka_consumer_lag > 10000
  for: 10m
  annotations:
    summary: "Consumer {{ $labels.consumer_group }} lag above 10k"
    impact: "Events processing delayed, users may see stale data"
    likely_causes: |
      - Consumer pods crashed (check `kubectl get pods -l app=consumer`)
      - Upstream producer spike (check producer rate dashboard)
      - Slow downstream dependency (check DB latency)
    runbook: "https://wiki/runbooks/kafka-lag"
    dashboard: "https://grafana/d/kafka-overview"

Every alert should answer: What’s wrong? What’s the impact? Where do I start investigating?

Anti-Pattern 5: Alerting on Every Error

1
2
3
4
5
# ❌ BAD: Any error triggers alert
- alert: ApplicationError
  expr: rate(app_errors_total[1m]) > 0
  labels:
    severity: critical

Errors are normal. A 0.1% error rate on high-traffic services means thousands of errors per day. You can’t investigate every one.

1
2
3
4
5
6
7
8
9
# ✅ GOOD: Alert on error rate changes
- alert: ErrorRateSpike
  expr: |
    rate(app_errors_total[5m]) / rate(app_requests_total[5m]) > 0.05
    AND
    rate(app_errors_total[5m]) > 10
  for: 5m
  annotations:
    summary: "Error rate above 5% ({{ $value | humanizePercentage }})"

Alert when error rate is both elevated and there’s enough traffic to be statistically meaningful.

Anti-Pattern 6: Missing Dependency Context

1
2
3
4
5
6
7
8
9
# ❌ BAD: Alerts without dependency awareness
- alert: ServiceADown
  expr: up{job="service-a"} == 0

- alert: ServiceBDown  
  expr: up{job="service-b"} == 0

- alert: ServiceCDown
  expr: up{job="service-c"} == 0

If Service A depends on Service B, and Service B is down, you’ll get alerts for both. The on-call sees two fires when there’s really one.

1
2
3
4
5
6
7
8
9
# ✅ GOOD: Suppress dependent alerts
- alert: ServiceBDown
  expr: up{job="service-b"} == 0

- alert: ServiceADown
  expr: |
    up{job="service-a"} == 0 
    AND 
    up{job="service-b"} == 1  # Only alert if dependency is healthy

Or use Alertmanager’s inhibition rules:

1
2
3
4
5
6
7
# alertmanager.yml
inhibit_rules:
  - source_match:
      alertname: ServiceBDown
    target_match:
      alertname: ServiceADown
    equal: ['environment']

Anti-Pattern 7: No Alert Testing

You test your code. Do you test your alerts?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Test that alert fires when it should
def test_high_latency_alert():
    # Inject synthetic high latency
    set_metric("http_request_duration_seconds", 0.8, {"quantile": "0.95"})
    
    # Wait for evaluation
    time.sleep(60)
    
    # Verify alert fired
    alerts = get_firing_alerts()
    assert any(a["alertname"] == "APILatencyHigh" for a in alerts)

Run chaos engineering on your alerting. Turn things off and verify alerts fire. If they don’t fire in testing, they won’t fire in production.

Anti-Pattern 8: Alert Spam During Deploys

Deploys cause transient issues. Pods restart, connections drop, error rates spike briefly. If every deploy triggers a page, you’ll train people to ignore pages.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Silence alerts during deploy windows
# deploy-silence.yaml
matchers:
  - name: severity
    value: warning
  - name: environment  
    value: production
startsAt: 2024-01-15T14:00:00Z
endsAt: 2024-01-15T14:30:00Z
createdBy: deploy-automation
comment: "Scheduled deploy window"

Better: Make your deploys not trigger alerts in the first place with proper health checks and graceful shutdowns.

The Right Approach

Good alerting follows these principles:

  1. Page on user impact, not infrastructure metrics
  2. Every alert must be actionable — if you can’t do anything, it’s a log, not an alert
  3. Alerts need context — runbooks, dashboards, likely causes
  4. Severity should map to response time — critical = now, warning = today, info = never
  5. Test your alerts — chaos engineering for observability
  6. Review and prune — if an alert hasn’t been useful in 30 days, delete it

Metrics to Track

Monitor your monitoring:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Alert fatigue indicators
SELECT 
  alertname,
  COUNT(*) as fire_count,
  AVG(resolved_at - fired_at) as avg_duration,
  SUM(CASE WHEN acknowledged THEN 0 ELSE 1 END) as ignored_count
FROM alerts
WHERE fired_at > NOW() - INTERVAL '30 days'
GROUP BY alertname
ORDER BY ignored_count DESC;

If an alert fires frequently but is rarely acknowledged, it’s noise. Either fix the underlying issue or delete the alert.


The goal isn’t maximum coverage. It’s signal-to-noise ratio. One actionable alert is worth more than a hundred that get ignored.