On-call doesn’t have to mean sleepless nights and burnout. With the right practices, you can maintain reliable systems while keeping your team healthy. Here’s how.

Alert Design

Only Alert on Actionable Items

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Bad: Noisy, leads to alert fatigue
- alert: HighCPU
  expr: cpu_usage > 80
  for: 1m
  
# Good: Actionable, symptom-based
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.01
  for: 5m
  labels:
    severity: warning
    runbook: "https://runbooks.example.com/high-error-rate"
  annotations:
    summary: "Error rate {{ $value | humanizePercentage }} exceeds 1%"
    impact: "Users experiencing failures"
    action: "Check application logs and recent deployments"

Severity Levels

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# severity-definitions.yaml
severities:
  critical:
    description: "Service completely down, immediate action required"
    response_time: "5 minutes"
    notification: "page + phone call"
    examples:
      - "Total service outage"
      - "Data loss occurring"
      - "Security breach detected"
  
  warning:
    description: "Degraded service, action needed soon"
    response_time: "30 minutes"
    notification: "page"
    examples:
      - "Error rate elevated but service functional"
      - "Approaching resource limits"
      - "Single replica down"
  
  info:
    description: "Notable event, no immediate action"
    response_time: "Next business day"
    notification: "Slack"
    examples:
      - "Deployment completed"
      - "Scheduled maintenance starting"

Runbooks

Every alert needs a runbook:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Runbook: High Error Rate

## Alert
`HighErrorRate` - Error rate exceeds 1% for 5+ minutes

## Impact
Users receiving 5xx errors. Potential revenue loss.

## Quick Diagnosis

1. **Check recent deployments**
   ```bash
   kubectl rollout history deploy/api-service

If recent deploy, consider rollback.

  1. Check logs for patterns

    1
    
    kubectl logs -l app=api-service --since=10m | grep -i error | head -50
    
  2. Check dependencies

    1
    
    curl -s http://api-service/health/dependencies | jq
    

Common Causes & Fixes

Database Connection Issues

1
2
3
4
5
# Check connection pool
kubectl exec -it deploy/api-service -- curl localhost:8080/metrics | grep db_pool

# If exhausted, restart pods
kubectl rollout restart deploy/api-service

Memory Pressure

1
2
3
4
# Check memory usage
kubectl top pods -l app=api-service

# If OOMKilled, check for memory leaks in recent commits

Downstream Service Failure

1
2
3
4
# Check circuit breaker status
curl http://api-service/health/circuit-breakers

# If payment-service is open, check payment-service health

Escalation

If unresolved after 15 minutes, escalate to:

  • Slack: #incidents
  • Page: @backend-oncall-secondary
##`#ffic******---R**P##`rrmlITSCSReDSo#`ioopaDieotoπŸŽ–πŸ“πŸ”§auusIpnmmos:tvmal️crmtnIycrsdd*lemreSTtddammcntisdteeeratsIceeetaoichdlaIff{:inenrcwffirrdioeatuni*tddncihioytednnceucssdi#cc#snye:eibnt#si}sraitisriiiidsn:esnetktii_eeenhhec{:r*edeiheneednineennnnue:*metn__iddillccCaaPlit*:de:cNlcltdcmclsccccrl*ltbsmenfflirnnofdi*{enatofifu_ieifoiiiiaf{dfModeni..adennins.ctet{ddt(lht.ctd"""""""".rtdsd.ctlddddt.ct{so.ctaBtkttsareaeasetshenls<a:ueisheeitsccsstanietesheveeeeisheducsheno.iM_lcentlm_llaxtee@t*CnLflaxndiehotticmnanlaxennnnolaxumlaxatpima_at_teep_iant_}v{eoceeyantt"tvamaamtietmtant_ttttnantrmwantgympn(cii_==rincn=iectmlamcn=:lenmrteinlp[cn=i[[[cn=aaicn=epoaskvniifidckefdroimadomkeff=ernatulvci="kefn="""=keftrlkefmorgeecdns"vi.l"}immai:ja.l""i"ienesiein=t.l"csrs.l"iyl.l"ertel=_iceia=dc="tmenmiic=🚨J{n:tldd"n_desic=πŸ“isteuic="o}c=βœ…ntrfid=ilntehc"ya.de(nh"ocy_e_:eie_edmhideasmnhi"nbh"td:,Wnedfcecnah🚨.nneduta#*iit"ira"nnelaean`eltomcanβœ…}ea#{Waecnfe.-=httaudor)nocti{ndi:d"t":ctnftltc{nfulaitctiietsbit"ns{Fa_n*pew:ch_nset":"ait.ei_itt.svrd_i*c_nnbelCd(ItliandpnIpr(lcapce<nls::c[dratnpdi(a"eyepdIrpccCtalescNanlneoene})<alnoiv#teect]eycieoemsc]d"noeneoiilicineoCcccsetslcr>.@iansde{_,vcodin(tm"sneet_]tsncasddimketlm-hkielat_i(i{mieterciehmavtsie]ttsli=a[ttitteeee_nsfm{a.d[iMid)scemlMnihdramtesev..M[tfvt="M[deMnnntt,adnce"lede}oodetta,inae"[lenae"a,e""re"edettto(=nanoncss,nfm)assyntnnt,if_opscm_r]sescn.ss_kttdtenthstoms".nyedin,iwpshpiieusshts"ieo{ieelv_aarara,ue,lemcn(eaa}nns=moaaPa,dnk}trteingDmnogpl_reiic)ngn`ccomlgnRlg}:el:irdneeadlep_i,.dni.deniildaveneeenems.e(ctee(eidnecdi((e{ddvare(esa(rs=:sealll(r.rd,onies{leeeetydlosetst.to"a)}"(}wtdno"_nnnde__lesrlsrniw]r}>")>(_etftittt"taivo)at)ooe[e"})insoidr_sitdeal:crwnr"d*f,dt[rm"yi[m""ddvk,(s(i*o]_ime]}die]]*de_)_)dirina",":n.,dts.c}"n=dct:cn-yoedsr"]cu:i(siooakvite,ipid)ttdwiufeecraddnseire(nrtnrtfteactnm,n)ce)i:tentirtetinrti(ted,_ss_doymseituiet{:ed.nedamdned(e"tn]mm]tsus'ctpa[.rt%lr,r""arYayys"t,%r:":t"imeeao%dsnsrnd:tttt}'rrre:){)y)d}t:":_{-i:as{ttuule"muen]mi}tad\rr.nyyu"}}u)"id4().hex[:4].upper()}"

Incident Response Checklist

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# incident-checklist.yaml
incident_response:
  declare:
    - Assign Incident Commander
    - Create incident channel
    - Notify stakeholders
    - Start timeline
  
  triage:
    - Assess customer impact
    - Determine severity
    - Identify affected systems
    - Check for related incidents
  
  mitigate:
    - Implement quick fix (rollback, scale, redirect)
    - Communicate status externally if needed
    - Monitor for improvement
  
  resolve:
    - Confirm metrics normalized
    - Verify customer impact resolved
    - Document resolution steps
    - Schedule postmortem
  
  follow_up:
    - Write postmortem within 48 hours
    - Create action items
    - Share learnings
    - Update runbooks

Blameless Postmortems

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# Postmortem: INC-20260211-A3F2

## Summary
API error rate spiked to 15% for 23 minutes due to database 
connection pool exhaustion after a connection leak was introduced 
in deploy v2.3.4.

## Impact
- Duration: 23 minutes (14:32 - 14:55 EST)
- Users affected: ~12,000
- Revenue impact: ~$4,500 estimated
- SLO impact: 0.3% of monthly error budget consumed

## Timeline
| Time | Event |
|------|-------|
| 14:15 | Deploy v2.3.4 rolled out |
| 14:32 | Error rate alert fired |
| 14:35 | On-call engineer acknowledged |
| 14:38 | Identified connection pool exhaustion |
| 14:42 | Rolled back to v2.3.3 |
| 14:50 | Error rate normalized |
| 14:55 | Incident resolved |

## Root Cause
A new database query in v2.3.4 used a raw connection instead of 
the connection pool, leaking connections on each request. Under 
load, the pool exhausted within 17 minutes.

## What Went Well
- Alert fired promptly when error rate exceeded threshold
- On-call engineer had runbook for connection pool issues
- Rollback was fast and effective

## What Went Poorly
- Connection leak wasn't caught in code review
- No integration test for connection pool behavior
- Took 7 minutes to identify root cause

## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add connection pool integration test | @alice | 2026-02-18 |
| Add static analysis rule for raw connections | @bob | 2026-02-18 |
| Update code review checklist | @team | 2026-02-14 |
| Add connection pool saturation alert | @charlie | 2026-02-15 |

## Lessons Learned
Connection management is subtle and tests should cover pool behavior 
under load, not just happy-path queries.

On-Call Rotation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# oncall_scheduler.py
from datetime import datetime, timedelta
from typing import List

class OnCallScheduler:
    def __init__(self, team_members: List[str], 
                 rotation_days: int = 7):
        self.team = team_members
        self.rotation_days = rotation_days
    
    def get_current_oncall(self) -> dict:
        # Calculate rotation based on week number
        epoch = datetime(2026, 1, 1)
        weeks_since_epoch = (datetime.now() - epoch).days // self.rotation_days
        
        primary_idx = weeks_since_epoch % len(self.team)
        secondary_idx = (primary_idx + 1) % len(self.team)
        
        return {
            "primary": self.team[primary_idx],
            "secondary": self.team[secondary_idx],
            "rotation_ends": self._next_rotation_time(),
        }
    
    def _next_rotation_time(self) -> datetime:
        now = datetime.now()
        days_until_rotation = self.rotation_days - (now.weekday() % self.rotation_days)
        return now + timedelta(days=days_until_rotation)
    
    def handoff_checklist(self) -> List[str]:
        return [
            "Review active incidents",
            "Check pending alerts",
            "Review recent deployments",
            "Verify PagerDuty app is working",
            "Confirm laptop/phone charged",
            "Share context on ongoing issues",
        ]

Sustainable Practices

Protect Sleep

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# pagerduty-schedule.yaml
# Route non-critical alerts away from sleeping hours
routing_rules:
  - condition: 
      severity: warning
      time: "22:00-08:00"
    action: 
      route_to: slack_channel
      suppress_page: true
  
  - condition:
      severity: critical
    action:
      route_to: on_call_primary
      escalate_after: 5m
      escalate_to: on_call_secondary

Track On-Call Load

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Track pages per rotation
def oncall_health_report(incidents: List[dict]) -> dict:
    return {
        "total_pages": len(incidents),
        "after_hours_pages": len([i for i in incidents 
                                  if is_after_hours(i["time"])]),
        "mean_time_to_ack": mean([i["ack_time"] for i in incidents]),
        "mean_time_to_resolve": mean([i["resolve_time"] for i in incidents]),
        "top_alert_sources": Counter([i["alert"] for i in incidents]).most_common(5),
        "actionable_rate": len([i for i in incidents if i["was_actionable"]]) / len(incidents),
    }

# If after_hours_pages > 2/week, investigate and fix
# If actionable_rate < 80%, alerts need tuning

Key Principles

  1. Alert on symptoms, not causes β€” Users care about errors, not CPU
  2. Every alert needs a runbook β€” If you can’t write one, don’t alert
  3. Blameless postmortems β€” Focus on systems, not people
  4. Protect sleep β€” Tired engineers make mistakes
  5. Measure on-call load β€” Track pages and fix noisy alerts
  6. Practice incidents β€” Game days build muscle memory

Good on-call is a system property, not individual heroism. Build the systems that let your team respond effectively without burning out.