Good monitoring saves you from outages. Bad monitoring causes them — by training your team to ignore alerts until something actually breaks. Here’s how to avoid the most common anti-patterns.
Anti-Pattern 1: Alerting on Symptoms, Not Impact#
1
2
3
4
5
6
| # ❌ BAD: CPU is high
- alert: HighCPU
expr: node_cpu_usage > 80
for: 5m
labels:
severity: critical
|
High CPU isn’t a problem. Slow responses are a problem. Users don’t care about your CPU graphs.
1
2
3
4
5
6
7
8
9
| # ✅ GOOD: User-facing latency degraded
- alert: APILatencyHigh
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "95th percentile latency above 500ms"
impact: "Users experiencing slow page loads"
|
Alert on what users experience, not on what your infrastructure is doing internally.
Anti-Pattern 2: Static Thresholds for Dynamic Systems#
1
2
3
| # ❌ BAD: Fixed threshold regardless of time
- alert: LowTraffic
expr: rate(http_requests_total[5m]) < 100
|
At 3 AM, 100 requests/minute is normal. At 3 PM, it’s an outage. Static thresholds don’t understand context.
1
2
3
4
5
6
7
8
9
10
11
| # ✅ GOOD: Compare to historical baseline
- alert: TrafficAnomaly
expr: |
rate(http_requests_total[5m])
<
avg_over_time(rate(http_requests_total[5m])[7d:1h]) * 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Traffic 50% below weekly average for this hour"
|
Compare current behavior to what’s normal for this time of day, this day of week.
Anti-Pattern 3: No Severity Levels#
When everything is critical, nothing is:
1
2
3
4
5
6
7
8
| # ❌ BAD: One severity for all alerts
- alert: DiskSpaceLow
labels:
severity: critical # Is it though?
- alert: DatabaseDown
labels:
severity: critical # This one actually is
|
Define what severities mean and use them consistently:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # ✅ GOOD: Clear severity definitions
# critical: User-facing, immediate response required, page on-call
# warning: Degraded, needs attention within hours
# info: Notable event, no action required
- alert: DiskSpace80Percent
expr: disk_used_percent > 80
labels:
severity: warning
annotations:
runbook: "https://wiki/runbooks/disk-space"
- alert: DiskSpace95Percent
expr: disk_used_percent > 95
labels:
severity: critical
|
Anti-Pattern 4: Alerts Without Runbooks#
1
2
3
4
5
| # ❌ BAD: What am I supposed to do with this?
- alert: KafkaLagHigh
expr: kafka_consumer_lag > 10000
annotations:
summary: "Kafka lag is high"
|
The on-call engineer at 3 AM shouldn’t have to figure out what this means or how to fix it.
1
2
3
4
5
6
7
8
9
10
11
12
13
| # ✅ GOOD: Actionable context
- alert: KafkaLagHigh
expr: kafka_consumer_lag > 10000
for: 10m
annotations:
summary: "Consumer {{ $labels.consumer_group }} lag above 10k"
impact: "Events processing delayed, users may see stale data"
likely_causes: |
- Consumer pods crashed (check `kubectl get pods -l app=consumer`)
- Upstream producer spike (check producer rate dashboard)
- Slow downstream dependency (check DB latency)
runbook: "https://wiki/runbooks/kafka-lag"
dashboard: "https://grafana/d/kafka-overview"
|
Every alert should answer: What’s wrong? What’s the impact? Where do I start investigating?
Anti-Pattern 5: Alerting on Every Error#
1
2
3
4
5
| # ❌ BAD: Any error triggers alert
- alert: ApplicationError
expr: rate(app_errors_total[1m]) > 0
labels:
severity: critical
|
Errors are normal. A 0.1% error rate on high-traffic services means thousands of errors per day. You can’t investigate every one.
1
2
3
4
5
6
7
8
9
| # ✅ GOOD: Alert on error rate changes
- alert: ErrorRateSpike
expr: |
rate(app_errors_total[5m]) / rate(app_requests_total[5m]) > 0.05
AND
rate(app_errors_total[5m]) > 10
for: 5m
annotations:
summary: "Error rate above 5% ({{ $value | humanizePercentage }})"
|
Alert when error rate is both elevated and there’s enough traffic to be statistically meaningful.
Anti-Pattern 6: Missing Dependency Context#
1
2
3
4
5
6
7
8
9
| # ❌ BAD: Alerts without dependency awareness
- alert: ServiceADown
expr: up{job="service-a"} == 0
- alert: ServiceBDown
expr: up{job="service-b"} == 0
- alert: ServiceCDown
expr: up{job="service-c"} == 0
|
If Service A depends on Service B, and Service B is down, you’ll get alerts for both. The on-call sees two fires when there’s really one.
1
2
3
4
5
6
7
8
9
| # ✅ GOOD: Suppress dependent alerts
- alert: ServiceBDown
expr: up{job="service-b"} == 0
- alert: ServiceADown
expr: |
up{job="service-a"} == 0
AND
up{job="service-b"} == 1 # Only alert if dependency is healthy
|
Or use Alertmanager’s inhibition rules:
1
2
3
4
5
6
7
| # alertmanager.yml
inhibit_rules:
- source_match:
alertname: ServiceBDown
target_match:
alertname: ServiceADown
equal: ['environment']
|
Anti-Pattern 7: No Alert Testing#
You test your code. Do you test your alerts?
1
2
3
4
5
6
7
8
9
10
11
| # Test that alert fires when it should
def test_high_latency_alert():
# Inject synthetic high latency
set_metric("http_request_duration_seconds", 0.8, {"quantile": "0.95"})
# Wait for evaluation
time.sleep(60)
# Verify alert fired
alerts = get_firing_alerts()
assert any(a["alertname"] == "APILatencyHigh" for a in alerts)
|
Run chaos engineering on your alerting. Turn things off and verify alerts fire. If they don’t fire in testing, they won’t fire in production.
Anti-Pattern 8: Alert Spam During Deploys#
Deploys cause transient issues. Pods restart, connections drop, error rates spike briefly. If every deploy triggers a page, you’ll train people to ignore pages.
1
2
3
4
5
6
7
8
9
10
11
| # Silence alerts during deploy windows
# deploy-silence.yaml
matchers:
- name: severity
value: warning
- name: environment
value: production
startsAt: 2024-01-15T14:00:00Z
endsAt: 2024-01-15T14:30:00Z
createdBy: deploy-automation
comment: "Scheduled deploy window"
|
Better: Make your deploys not trigger alerts in the first place with proper health checks and graceful shutdowns.
The Right Approach#
Good alerting follows these principles:
- Page on user impact, not infrastructure metrics
- Every alert must be actionable — if you can’t do anything, it’s a log, not an alert
- Alerts need context — runbooks, dashboards, likely causes
- Severity should map to response time — critical = now, warning = today, info = never
- Test your alerts — chaos engineering for observability
- Review and prune — if an alert hasn’t been useful in 30 days, delete it
Metrics to Track#
Monitor your monitoring:
1
2
3
4
5
6
7
8
9
10
| -- Alert fatigue indicators
SELECT
alertname,
COUNT(*) as fire_count,
AVG(resolved_at - fired_at) as avg_duration,
SUM(CASE WHEN acknowledged THEN 0 ELSE 1 END) as ignored_count
FROM alerts
WHERE fired_at > NOW() - INTERVAL '30 days'
GROUP BY alertname
ORDER BY ignored_count DESC;
|
If an alert fires frequently but is rarely acknowledged, it’s noise. Either fix the underlying issue or delete the alert.
The goal isn’t maximum coverage. It’s signal-to-noise ratio. One actionable alert is worth more than a hundred that get ignored.