Something is broken in production. Customers are complaining. Your heart rate is elevated. What now?

Having a playbook before the incident happens is the difference between a coordinated response and chaos. Here’s the playbook.

The First 5 Minutes

1. Acknowledge the Incident

Someone needs to own it. Right now.

@SICctnohacmatimnudsnse:e:nlt#Ii🚨nCnvocIemiNsmdCtaeIinnDgdtEae-Ntr2Ti:0:n2g@4Pa-al0yi3mc-ee1n2tprocessingfailing

Create a dedicated channel immediately. All incident communication goes there.

2. Assess Severity

SEV-1 (Critical):

  • Complete service outage
  • Data loss or corruption
  • Security breach
  • Revenue-impacting for all customers

SEV-2 (Major):

  • Partial outage
  • Degraded performance for many users
  • Key feature unavailable

SEV-3 (Minor):

  • Limited impact
  • Workaround available
  • Single customer affected

Severity determines who gets paged and what SLAs apply.

3. Communicate Externally

Don’t wait until you have answers. Acknowledge the problem:

WUep'draeteaswatroefooflliosws.uesaffecting[service]andareactivelyinvestigating.

Silence is worse than “we don’t know yet.”

The Investigation Phase

Gather Context

1
2
3
4
5
6
7
8
# Recent deployments
git log --oneline --since="2 hours ago"

# Recent config changes
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Error spike timing
grep -c "ERROR" /var/log/app.log | head -20

Questions to answer quickly:

  • What changed? (deploys, config, traffic)
  • When did it start?
  • What’s the blast radius?
  • Is it getting worse?

Check the Usual Suspects

1. Recent deploys

1
2
3
# Kubernetes
kubectl rollout history deployment/myapp
kubectl rollout undo deployment/myapp  # If recent deploy looks suspicious

2. Dependencies

1
2
3
4
5
6
# Database
pg_isready -h db.example.com
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

# External services
curl -s -o /dev/null -w "%{http_code}" https://api.stripe.com/v1/health

3. Resource exhaustion

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Memory
free -h
kubectl top pods --sort-by=memory

# Disk
df -h
kubectl exec -it mypod -- df -h

# Connections
ss -s
netstat -an | grep ESTABLISHED | wc -l

4. Traffic patterns

1
2
3
4
5
# Request rate
grep "$(date +%H:%M)" /var/log/nginx/access.log | wc -l

# Error rate
grep "$(date +%H:%M)" /var/log/nginx/error.log | wc -l

The Debugging Loop

  1. Hypothesize: “I think it’s X because Y”
  2. Test: Check logs, metrics, or run a quick test
  3. Confirm or eliminate: Move to next hypothesis
  4. Document: Write what you tried in the incident channel

Don’t go silent. Even “checking database connections now” keeps the team aligned.

Roles During an Incident

Incident Commander (IC)

  • Owns the incident end-to-end
  • Coordinates responders
  • Makes decisions when there’s disagreement
  • Keeps timeline and status updated
  • Decides when to escalate

The IC doesn’t need to be the most senior person. They need to stay calm and organized.

Technical Lead

  • Drives the investigation
  • Assigns debugging tasks
  • Proposes and implements fixes
  • Flags when they need help

Communications Lead

  • Updates status page
  • Drafts customer communications
  • Handles incoming support tickets
  • Shields the team from external noise

Scribe

  • Documents everything in real-time
  • Captures timeline of events
  • Notes hypotheses tested
  • Records what worked and didn’t

This role is often forgotten and always valuable.

Communication Templates

Internal Updates (Every 15 minutes)

🕐SICNCEtmueuT1aprxsA0tartt:uceot4stnamo5::tcetrrAI~tiseMn3hosv0ennoUe%o:olpsrtudtoyIitaif:nfitgcioeapDrentaaed:iyta:nmasUgebiYnnanektsgsneorpweconqoounlensestcistziefo,animlpoionnoigltoerxihnagusted

External Updates

Initial acknowledgment:

WWee''rlelepxrpoevriideencuipndgatiesssueevserwyit3h0[msienruvtiecse.].Ourteamisinvestigating.

Progress update:

UiWpmedpaletexemp:eencWttei'nrvgeesaoildfueitnxit.oinfSiowemidethtuihsneer[istsismmuaeeyfrasaftmfieel]cl.tienxgpe[rsieernvciece[]syamnpdtoamr]e.

Resolution:

RAielnslcoolsnvyvesedtn:eimeTsnhceaerieasnsodupeewriaalftlfienscghtainrnoegrma[asldelerytv.aiicWleee]dahppaooslsotbg-eimezonertffeiomxrewdti.htehin48hours.

Making the Call: Rollback vs Fix Forward

Rollback When:

  • Recent deploy is the obvious cause
  • Rollback is fast and safe
  • You don’t understand the root cause yet
  • Impact is high and continuing
1
2
3
4
5
# Kubernetes rollback
kubectl rollout undo deployment/myapp

# Feature flag disable
curl -X POST https://flagservice/api/flags/new-payment-flow/disable

Fix Forward When:

  • The fix is simple and well-understood
  • Rollback would cause data issues
  • The bug existed before the deploy
  • You’ve tested the fix

The default should be rollback. Fix forward requires higher confidence.

After the Incident

Immediate (Within 1 hour)

  1. Confirm service is stable
  2. Final customer communication
  3. Hand off to next on-call if needed
  4. Quick debrief: “What happened? What did we do?”

Short-term (Within 48 hours)

Blameless Post-mortem:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
## Incident: Payment Processing Outage
**Date:** 2024-03-12
**Duration:** 47 minutes
**Severity:** SEV-1
**Author:** @alice

### Summary
Payment processing failed for 30% of requests due to database 
connection pool exhaustion.

### Timeline
- 10:02 - Alerts fire for increased payment errors
- 10:05 - IC declared, investigation started
- 10:12 - Identified connection pool exhaustion
- 10:18 - Increased pool size, deployed
- 10:25 - Error rate returning to normal
- 10:49 - Incident resolved

### Root Cause
A slow query in the new recommendation service held connections 
longer than expected. Combined with a 20% traffic increase, 
this exhausted the connection pool.

### What Went Well
- Fast detection (3 minutes to alert)
- Clear ownership
- Good communication

### What Could Be Improved
- No alert on connection pool utilization
- Recommendation service not load tested
- Runbook for "database connections" outdated

### Action Items
- [ ] Add connection pool utilization alert (@bob, 3 days)
- [ ] Load test recommendation service (@carol, 1 week)
- [ ] Update database runbook (@dave, 3 days)

Long-term

Track action items to completion. Schedule a review if the same category of incident happens again.

The Runbook Library

Every alert should link to a runbook. Minimum contents:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
## Alert: High Error Rate

### What This Means
More than 1% of requests are returning 5xx errors.

### Likely Causes
1. Recent deployment (check: `kubectl rollout history`)
2. Downstream dependency failure (check: dashboard link)
3. Database issues (check: `pg_stat_activity`)
4. Resource exhaustion (check: `kubectl top pods`)

### Immediate Actions
1. Check recent deployments
2. Check error logs: `kubectl logs -l app=myapp --tail=100`
3. If recent deploy, consider rollback

### Escalation
If not resolved in 15 minutes, page @database-team

### Related
- Dashboard: [link]
- Playbook: [link]
- Last incident: INC-1234

Tools of the Trade

  • Incident management: PagerDuty, Opsgenie, Incident.io
  • Communication: Slack/Teams with dedicated channels
  • Status page: Statuspage.io, Cachet, Instatus
  • Timeline: Datadog, Grafana with annotations
  • Post-mortems: Notion, Confluence, or plain markdown in git

The Meta-Lesson

The goal isn’t to prevent all incidents. It’s to:

  1. Detect fast (minutes, not hours)
  2. Respond systematically (not heroically)
  3. Communicate clearly (internally and externally)
  4. Learn effectively (fix the system, not the blame)

Practice incident response before you need it. Run game days. Review past incidents. Build the muscle memory.


The best incident response is the one you’ve practiced so many times it’s boring. When real incidents feel routine, you’re doing it right.