Incident Response: A Playbook for When Things Go Wrong

Something is broken in production. Customers are complaining. Your heart rate is elevated. What now?

Having a playbook before the incident happens is the difference between a coordinated response and chaos. Here’s the playbook.

The First 5 Minutes

1. Acknowledge the Incident

Someone needs to own it. Right now.

Create a dedicated channel immediately. All incident communication goes there.

2. Assess Severity

SEV-1 (Critical):

Complete service outage
Data loss or corruption
Security breach
Revenue-impacting for all customers

SEV-2 (Major):

Partial outage
Degraded performance for many users
Key feature unavailable

SEV-3 (Minor):

Limited impact
Workaround available
Single customer affected

Severity determines who gets paged and what SLAs apply.

3. Communicate Externally

Don’t wait until you have answers. Acknowledge the problem:

Silence is worse than “we don’t know yet.”

The Investigation Phase

Gather Context

1
2
3
4
5
6
7
8
# Recent deployments
git log --oneline --since="2 hours ago"

# Recent config changes
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Error spike timing
grep -c "ERROR" /var/log/app.log | head -20

Questions to answer quickly:

What changed? (deploys, config, traffic)
When did it start?
What’s the blast radius?
Is it getting worse?

Check the Usual Suspects

1. Recent deploys

1
2
3
# Kubernetes
kubectl rollout history deployment/myapp
kubectl rollout undo deployment/myapp  # If recent deploy looks suspicious

2. Dependencies

1
2
3
4
5
6
# Database
pg_isready -h db.example.com
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

# External services
curl -s -o /dev/null -w "%{http_code}" https://api.stripe.com/v1/health

3. Resource exhaustion

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Memory
free -h
kubectl top pods --sort-by=memory

# Disk
df -h
kubectl exec -it mypod -- df -h

# Connections
ss -s
netstat -an | grep ESTABLISHED | wc -l

4. Traffic patterns

1
2
3
4
5
# Request rate
grep "$(date +%H:%M)" /var/log/nginx/access.log | wc -l

# Error rate
grep "$(date +%H:%M)" /var/log/nginx/error.log | wc -l

The Debugging Loop

Hypothesize: “I think it’s X because Y”
Test: Check logs, metrics, or run a quick test
Confirm or eliminate: Move to next hypothesis
Document: Write what you tried in the incident channel

Don’t go silent. Even “checking database connections now” keeps the team aligned.

Roles During an Incident

Incident Commander (IC)

Owns the incident end-to-end
Coordinates responders
Makes decisions when there’s disagreement
Keeps timeline and status updated
Decides when to escalate

The IC doesn’t need to be the most senior person. They need to stay calm and organized.

Technical Lead

Drives the investigation
Assigns debugging tasks
Proposes and implements fixes
Flags when they need help

Communications Lead

Updates status page
Drafts customer communications
Handles incoming support tickets
Shields the team from external noise

Scribe

Documents everything in real-time
Captures timeline of events
Notes hypotheses tested
Records what worked and didn’t

This role is often forgotten and always valuable.

Communication Templates

Internal Updates (Every 15 minutes)

External Updates

Initial acknowledgment:

Progress update:

Resolution:

Making the Call: Rollback vs Fix Forward

Rollback When:

Recent deploy is the obvious cause
Rollback is fast and safe
You don’t understand the root cause yet
Impact is high and continuing

1
2
3
4
5
# Kubernetes rollback
kubectl rollout undo deployment/myapp

# Feature flag disable
curl -X POST https://flagservice/api/flags/new-payment-flow/disable

Fix Forward When:

The fix is simple and well-understood
Rollback would cause data issues
The bug existed before the deploy
You’ve tested the fix

The default should be rollback. Fix forward requires higher confidence.

After the Incident

Immediate (Within 1 hour)

Confirm service is stable
Final customer communication
Hand off to next on-call if needed
Quick debrief: “What happened? What did we do?”

Short-term (Within 48 hours)

Blameless Post-mortem:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
## Incident: Payment Processing Outage
**Date:** 2024-03-12
**Duration:** 47 minutes
**Severity:** SEV-1
**Author:** @alice

### Summary
Payment processing failed for 30% of requests due to database 
connection pool exhaustion.

### Timeline
- 10:02 - Alerts fire for increased payment errors
- 10:05 - IC declared, investigation started
- 10:12 - Identified connection pool exhaustion
- 10:18 - Increased pool size, deployed
- 10:25 - Error rate returning to normal
- 10:49 - Incident resolved

### Root Cause
A slow query in the new recommendation service held connections 
longer than expected. Combined with a 20% traffic increase, 
this exhausted the connection pool.

### What Went Well
- Fast detection (3 minutes to alert)
- Clear ownership
- Good communication

### What Could Be Improved
- No alert on connection pool utilization
- Recommendation service not load tested
- Runbook for "database connections" outdated

### Action Items
- [ ] Add connection pool utilization alert (@bob, 3 days)
- [ ] Load test recommendation service (@carol, 1 week)
- [ ] Update database runbook (@dave, 3 days)

Long-term

Track action items to completion. Schedule a review if the same category of incident happens again.

The Runbook Library

Every alert should link to a runbook. Minimum contents:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
## Alert: High Error Rate

### What This Means
More than 1% of requests are returning 5xx errors.

### Likely Causes
1. Recent deployment (check: `kubectl rollout history`)
2. Downstream dependency failure (check: dashboard link)
3. Database issues (check: `pg_stat_activity`)
4. Resource exhaustion (check: `kubectl top pods`)

### Immediate Actions
1. Check recent deployments
2. Check error logs: `kubectl logs -l app=myapp --tail=100`
3. If recent deploy, consider rollback

### Escalation
If not resolved in 15 minutes, page @database-team

### Related
- Dashboard: [link]
- Playbook: [link]
- Last incident: INC-1234

Tools of the Trade

Incident management: PagerDuty, Opsgenie, Incident.io
Communication: Slack/Teams with dedicated channels
Status page: Statuspage.io, Cachet, Instatus
Timeline: Datadog, Grafana with annotations
Post-mortems: Notion, Confluence, or plain markdown in git

The Meta-Lesson

The goal isn’t to prevent all incidents. It’s to:

Detect fast (minutes, not hours)
Respond systematically (not heroically)
Communicate clearly (internally and externally)
Learn effectively (fix the system, not the blame)

Practice incident response before you need it. Run game days. Review past incidents. Build the muscle memory.

The best incident response is the one you’ve practiced so many times it’s boring. When real incidents feel routine, you’re doing it right.

The First 5 Minutes#

1. Acknowledge the Incident#

2. Assess Severity#

3. Communicate Externally#

The Investigation Phase#

Gather Context#

Check the Usual Suspects#

The Debugging Loop#

Roles During an Incident#

Incident Commander (IC)#

Technical Lead#

Communications Lead#

Scribe#

Communication Templates#

Internal Updates (Every 15 minutes)#

External Updates#

Making the Call: Rollback vs Fix Forward#

Rollback When:#

Fix Forward When:#

After the Incident#

Immediate (Within 1 hour)#

Short-term (Within 48 hours)#

Long-term#

The Runbook Library#

Tools of the Trade#

The Meta-Lesson#

📬 Get the Newsletter