Incident Response: A Practical Playbook

When production is on fire, you need a process—not panic. A good incident response framework gets you from “everything’s broken” to “everything’s fixed” with minimal chaos.

Incident Lifecycle

Each phase has specific goals and actions.

Phase 1: Detection

How you find out something’s wrong:

Automated:

Monitoring alerts
Health check failures
Error rate spikes
Customer-facing synthetic tests

Manual:

Customer reports
Internal user reports
Social media mentions

Detection Best Practices

1
2
3
4
5
6
7
# Alert should provide immediate context
- alert: CheckoutErrorRate
  annotations:
    summary: "Checkout errors at {{ $value }}%"
    dashboard: "https://grafana.internal/d/checkout"
    runbook: "https://wiki.internal/runbooks/checkout-errors"
    slack_channel: "#incidents-payments"

Don’t make responders hunt for information.

Phase 2: Triage

Quickly assess severity and ownership.

Severity Levels

Level	Impact	Response Time	Example
SEV1	Complete outage, data loss	Immediate, all hands	Site down, database corruption
SEV2	Major feature broken	15 minutes	Checkout broken, login failing
SEV3	Degraded performance	1 hour	Slow responses, minor feature broken
SEV4	Minor issue	Next business day	Cosmetic bug, non-critical error

Triage Questions

What’s broken? (Symptom)
Who’s affected? (Scope)
What’s the business impact? (Severity)
When did it start? (Timeline)
Did anything change recently? (Cause hypothesis)

1
2
3
4
5
6
# Quick triage checklist
- [ ] What service/feature is affected?
- [ ] How many users impacted?
- [ ] Is it getting worse or stable?
- [ ] Recent deployments in last 2 hours?
- [ ] Any infrastructure changes?

Phase 3: Response

Declare the Incident

For SEV1/SEV2, formally declare:

Roles

Incident Commander (IC):

Coordinates response
Makes decisions
Doesn’t debug (delegates)
Manages communication

Technical Lead:

Leads debugging
Proposes fixes
Implements solutions

Communications Lead:

Updates status page
Notifies stakeholders
Handles customer comms

For small teams, one person might fill multiple roles. For major incidents, separate them.

Communication Cadence

Status Page Updates

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# 14:45 UTC - Investigating
We're investigating issues with the checkout process. 
Some customers may experience errors when completing purchases.

# 15:00 UTC - Identified  
We've identified the issue as a database connection problem.
Our team is working on a fix.

# 15:20 UTC - Monitoring
A fix has been deployed. We're monitoring to confirm resolution.

# 15:45 UTC - Resolved
The checkout issue has been resolved. All systems operational.

Common Response Actions

Rollback

Often the fastest fix:

1
2
3
4
5
6
7
8
9
# Kubernetes
kubectl rollout undo deployment/checkout-service

# Docker Compose
docker-compose pull  # Get previous image
docker-compose up -d

# Feature flags
curl -X POST "https://flagsmith.internal/api/flags/new-checkout/off"

Scale Up

If it’s a capacity issue:

1
2
3
4
5
6
7
# Kubernetes
kubectl scale deployment/api --replicas=10

# AWS Auto Scaling
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name api-asg \
  --desired-capacity 10

Failover

Switch to backup systems:

1
2
3
4
5
6
7
# Database failover
aws rds failover-db-cluster --db-cluster-identifier prod-cluster

# DNS failover
aws route53 change-resource-record-sets \
  --hosted-zone-id ZONE_ID \
  --change-batch file://failover.json

Isolate

Stop the bleeding:

1
2
3
4
5
6
7
8
9
# Block problematic traffic
iptables -A INPUT -s 192.168.1.100 -j DROP

# Disable problematic feature
redis-cli SET feature:new_checkout:enabled false

# Rate limit
curl -X PUT "https://api-gateway/routes/checkout/rate-limit" \
  -d '{"requests_per_second": 10}'

Phase 4: Resolution

Confirm the Fix

Don’t declare victory too early:

1
2
3
4
5
# Monitor for 15-30 minutes after fix
- Error rate returned to baseline?
- Latency normal?
- No new alerts?
- Customer reports stopped?

Document Timeline

While it’s fresh:

1
2
3
4
5
6
7
8
## Timeline
- 14:32 - First alert fired (checkout error rate)
- 14:35 - On-call acknowledged, began investigation
- 14:42 - Identified recent deployment as potential cause
- 14:48 - Rollback initiated
- 14:52 - Rollback complete
- 15:05 - Error rate returned to normal
- 15:20 - Incident resolved

Close the Incident

Phase 5: Postmortem

Blameless Culture

The goal is learning, not punishment. Focus on systems, not individuals.

Bad: “John deployed broken code” Good: “Our deployment pipeline lacked sufficient validation”

Postmortem Template

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Incident Postmortem: Checkout Outage 2026-03-04

## Summary
48-minute outage affecting all checkout attempts.

## Impact
- 1,247 failed checkout attempts
- Estimated $15,000 lost revenue
- 47 customer support tickets

## Timeline
[Detailed timeline from detection to resolution]

## Root Cause
Connection pool sized for 100 connections. New deployment 
increased queries-per-request from 3 to 8, exhausting pool.

## What Went Well
- Alert fired within 2 minutes
- Quick identification of recent deployment
- Rollback was smooth

## What Went Poorly  
- No load testing for new code path
- Connection pool metrics not monitored
- Took 10 minutes to decide to rollback

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| Add connection pool metrics to dashboard | Alice | 2026-03-07 |
| Add load test for checkout flow | Bob | 2026-03-14 |
| Create rollback decision tree | Carol | 2026-03-10 |

## Lessons Learned
- Database connection changes need load testing
- Decision to rollback should be faster (< 5 min)

Follow Through

Track action items to completion. Postmortems without follow-through are theater.

Incident Tools

Communication

Slack/Teams: Real-time coordination
PagerDuty/Opsgenie: Alerting and escalation
Status page: Customer communication

Coordination

Incident.io, FireHydrant: Incident management
Zoom/Meet: War room calls
Shared doc: Timeline and notes

Investigation

Grafana/Datadog: Metrics and dashboards
Kibana/Loki: Log analysis
Jaeger/Honeycomb: Distributed tracing

On-Call Essentials

Runbooks

Every alert needs a runbook:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Runbook: High Error Rate - Checkout Service

## Alert
checkout_error_rate > 1% for 5 minutes

## Likely Causes
1. Database connection issues
2. Payment provider outage
3. Recent deployment bug

## Investigation Steps
1. Check error logs: `kubectl logs -l app=checkout --tail=100`
2. Check database: `psql -c "SELECT count(*) FROM pg_stat_activity"`
3. Check payment provider status: https://status.stripe.com

## Common Fixes
- If database: Restart connection pooler
- If payment provider: Enable fallback provider
- If deployment: Rollback

## Escalation
If unresolved after 15 minutes, page @payments-team

Handoffs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# On-Call Handoff: 2026-03-04

## Active Issues
- None

## Recently Resolved  
- SEV2 checkout outage (see postmortem)

## Watch Items
- New checkout deployment scheduled for tomorrow
- Database maintenance window Friday 2am

## Notes
- Runbook for payment errors updated
- Bob is on PTO, escalate to Carol instead

The goal isn’t zero incidents—it’s fast detection, efficient response, and continuous improvement. Every incident is a learning opportunity.

Incident Lifecycle#

Phase 1: Detection#

Detection Best Practices#

Phase 2: Triage#

Severity Levels#

Triage Questions#

Phase 3: Response#

Declare the Incident#

Roles#

Communication Cadence#

Status Page Updates#

Common Response Actions#

Rollback#

Scale Up#

Failover#

Isolate#

Phase 4: Resolution#

Confirm the Fix#

Document Timeline#

Close the Incident#

Phase 5: Postmortem#

Blameless Culture#

Postmortem Template#

Follow Through#

Incident Tools#

Communication#

Coordination#

Investigation#

On-Call Essentials#

Runbooks#

Handoffs#

📬 Get the Newsletter