When production is on fire, you need a process—not panic. A good incident response framework gets you from “everything’s broken” to “everything’s fixed” with minimal chaos.

Incident Lifecycle

DetectionTriageResponseResolutionPostmortem

Each phase has specific goals and actions.

Phase 1: Detection

How you find out something’s wrong:

Automated:

  • Monitoring alerts
  • Health check failures
  • Error rate spikes
  • Customer-facing synthetic tests

Manual:

  • Customer reports
  • Internal user reports
  • Social media mentions

Detection Best Practices

1
2
3
4
5
6
7
# Alert should provide immediate context
- alert: CheckoutErrorRate
  annotations:
    summary: "Checkout errors at {{ $value }}%"
    dashboard: "https://grafana.internal/d/checkout"
    runbook: "https://wiki.internal/runbooks/checkout-errors"
    slack_channel: "#incidents-payments"

Don’t make responders hunt for information.

Phase 2: Triage

Quickly assess severity and ownership.

Severity Levels

LevelImpactResponse TimeExample
SEV1Complete outage, data lossImmediate, all handsSite down, database corruption
SEV2Major feature broken15 minutesCheckout broken, login failing
SEV3Degraded performance1 hourSlow responses, minor feature broken
SEV4Minor issueNext business dayCosmetic bug, non-critical error

Triage Questions

  1. What’s broken? (Symptom)
  2. Who’s affected? (Scope)
  3. What’s the business impact? (Severity)
  4. When did it start? (Timeline)
  5. Did anything change recently? (Cause hypothesis)
1
2
3
4
5
6
# Quick triage checklist
- [ ] What service/feature is affected?
- [ ] How many users impacted?
- [ ] Is it getting worse or stable?
- [ ] Recent deployments in last 2 hours?
- [ ] Any infrastructure changes?

Phase 3: Response

Declare the Incident

For SEV1/SEV2, formally declare:

#🚨SISASUentftpSIvcafadlNeiteraaCrducttcIiesteekDtn:edsEytd::mN:I:eTCn1EsSovA4vsDEmel:eaEVmsl3rgC1at2yeLnicAdghU1tReaeT5oErtcCD:ikm#:noii@gunnCatuchltieiaedcctseketnoeitumnstpttSsheirfsvaiitclheirneUgandavailable

Roles

Incident Commander (IC):

  • Coordinates response
  • Makes decisions
  • Doesn’t debug (delegates)
  • Manages communication

Technical Lead:

  • Leads debugging
  • Proposes fixes
  • Implements solutions

Communications Lead:

  • Updates status page
  • Notifies stakeholders
  • Handles customer comms

For small teams, one person might fill multiple roles. For major incidents, separate them.

Communication Cadence

SSSU----EEEpVVVdCWWE123auhhT:::traaAerttUUUseipppnwwfdddsteeaaah'ktttoskrneeeutneolaoweeedtwtnvvvureeeisyrrrniyyycnlg13hu50oduemmr:iinnuutteess

Status Page Updates

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# 14:45 UTC - Investigating
We're investigating issues with the checkout process. 
Some customers may experience errors when completing purchases.

# 15:00 UTC - Identified  
We've identified the issue as a database connection problem.
Our team is working on a fix.

# 15:20 UTC - Monitoring
A fix has been deployed. We're monitoring to confirm resolution.

# 15:45 UTC - Resolved
The checkout issue has been resolved. All systems operational.

Common Response Actions

Rollback

Often the fastest fix:

1
2
3
4
5
6
7
8
9
# Kubernetes
kubectl rollout undo deployment/checkout-service

# Docker Compose
docker-compose pull  # Get previous image
docker-compose up -d

# Feature flags
curl -X POST "https://flagsmith.internal/api/flags/new-checkout/off"

Scale Up

If it’s a capacity issue:

1
2
3
4
5
6
7
# Kubernetes
kubectl scale deployment/api --replicas=10

# AWS Auto Scaling
aws autoscaling set-desired-capacity \
  --auto-scaling-group-name api-asg \
  --desired-capacity 10

Failover

Switch to backup systems:

1
2
3
4
5
6
7
# Database failover
aws rds failover-db-cluster --db-cluster-identifier prod-cluster

# DNS failover
aws route53 change-resource-record-sets \
  --hosted-zone-id ZONE_ID \
  --change-batch file://failover.json

Isolate

Stop the bleeding:

1
2
3
4
5
6
7
8
9
# Block problematic traffic
iptables -A INPUT -s 192.168.1.100 -j DROP

# Disable problematic feature
redis-cli SET feature:new_checkout:enabled false

# Rate limit
curl -X PUT "https://api-gateway/routes/checkout/rate-limit" \
  -d '{"requests_per_second": 10}'

Phase 4: Resolution

Confirm the Fix

Don’t declare victory too early:

1
2
3
4
5
# Monitor for 15-30 minutes after fix
- Error rate returned to baseline?
- Latency normal?
- No new alerts?
- Customer reports stopped?

Document Timeline

While it’s fresh:

1
2
3
4
5
6
7
8
## Timeline
- 14:32 - First alert fired (checkout error rate)
- 14:35 - On-call acknowledged, began investigation
- 14:42 - Identified recent deployment as potential cause
- 14:48 - Rollback initiated
- 14:52 - Rollback complete
- 15:05 - Error rate returned to normal
- 15:20 - Incident resolved

Close the Incident

DRRPAuoeocIrosstNatotiCtlmoIiCuonDoatrEnuitIN:soetTenme4:::mR8sEDRS:SmaocOitlh[LnalewVubediEtadulDesll:sebeadbCccehokfenotcndrrkeeaocptcutloktiomeoyodSnmreerirpnonvotwiopcl1oe0seatxmmhoarutsetmi]onafterdeployment

Phase 5: Postmortem

Blameless Culture

The goal is learning, not punishment. Focus on systems, not individuals.

Bad: “John deployed broken code” Good: “Our deployment pipeline lacked sufficient validation”

Postmortem Template

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Incident Postmortem: Checkout Outage 2026-03-04

## Summary
48-minute outage affecting all checkout attempts.

## Impact
- 1,247 failed checkout attempts
- Estimated $15,000 lost revenue
- 47 customer support tickets

## Timeline
[Detailed timeline from detection to resolution]

## Root Cause
Connection pool sized for 100 connections. New deployment 
increased queries-per-request from 3 to 8, exhausting pool.

## What Went Well
- Alert fired within 2 minutes
- Quick identification of recent deployment
- Rollback was smooth

## What Went Poorly  
- No load testing for new code path
- Connection pool metrics not monitored
- Took 10 minutes to decide to rollback

## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| Add connection pool metrics to dashboard | Alice | 2026-03-07 |
| Add load test for checkout flow | Bob | 2026-03-14 |
| Create rollback decision tree | Carol | 2026-03-10 |

## Lessons Learned
- Database connection changes need load testing
- Decision to rollback should be faster (< 5 min)

Follow Through

Track action items to completion. Postmortems without follow-through are theater.

Incident Tools

Communication

  • Slack/Teams: Real-time coordination
  • PagerDuty/Opsgenie: Alerting and escalation
  • Status page: Customer communication

Coordination

  • Incident.io, FireHydrant: Incident management
  • Zoom/Meet: War room calls
  • Shared doc: Timeline and notes

Investigation

  • Grafana/Datadog: Metrics and dashboards
  • Kibana/Loki: Log analysis
  • Jaeger/Honeycomb: Distributed tracing

On-Call Essentials

Runbooks

Every alert needs a runbook:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Runbook: High Error Rate - Checkout Service

## Alert
checkout_error_rate > 1% for 5 minutes

## Likely Causes
1. Database connection issues
2. Payment provider outage
3. Recent deployment bug

## Investigation Steps
1. Check error logs: `kubectl logs -l app=checkout --tail=100`
2. Check database: `psql -c "SELECT count(*) FROM pg_stat_activity"`
3. Check payment provider status: https://status.stripe.com

## Common Fixes
- If database: Restart connection pooler
- If payment provider: Enable fallback provider
- If deployment: Rollback

## Escalation
If unresolved after 15 minutes, page @payments-team

Handoffs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# On-Call Handoff: 2026-03-04

## Active Issues
- None

## Recently Resolved  
- SEV2 checkout outage (see postmortem)

## Watch Items
- New checkout deployment scheduled for tomorrow
- Database maintenance window Friday 2am

## Notes
- Runbook for payment errors updated
- Bob is on PTO, escalate to Carol instead

The goal isn’t zero incidents—it’s fast detection, efficient response, and continuous improvement. Every incident is a learning opportunity.