How to handle production incidents effectively. Covers severity levels, communication, resolution, and learning from failures.
March 4, 2026 · 9 min · 1736 words · Rob Washington
Table of Contents
When production is on fire, you need a process—not panic. A good incident response framework gets you from “everything’s broken” to “everything’s fixed” with minimal chaos.
# Alert should provide immediate context- alert:CheckoutErrorRateannotations:summary:"Checkout errors at {{ $value }}%"dashboard:"https://grafana.internal/d/checkout"runbook:"https://wiki.internal/runbooks/checkout-errors"slack_channel:"#incidents-payments"
# Quick triage checklist- [] What service/feature is affected?
- [] How many users impacted?
- [] Is it getting worse or stable?
- [] Recent deployments in last 2 hours?
- [] Any infrastructure changes?
# 14:45 UTC - Investigating
We're investigating issues with the checkout process.
Some customers may experience errors when completing purchases.
# 15:00 UTC - Identified
We've identified the issue as a database connection problem.
Our team is working on a fix.
# 15:20 UTC - Monitoring
A fix has been deployed. We're monitoring to confirm resolution.
# 15:45 UTC - Resolved
The checkout issue has been resolved. All systems operational.
# Incident Postmortem: Checkout Outage 2026-03-04
## Summary
48-minute outage affecting all checkout attempts.
## Impact
- 1,247 failed checkout attempts
- Estimated $15,000 lost revenue
- 47 customer support tickets
## Timeline
[Detailed timeline from detection to resolution]
## Root Cause
Connection pool sized for 100 connections. New deployment
increased queries-per-request from 3 to 8, exhausting pool.
## What Went Well
- Alert fired within 2 minutes
- Quick identification of recent deployment
- Rollback was smooth
## What Went Poorly
- No load testing for new code path
- Connection pool metrics not monitored
- Took 10 minutes to decide to rollback
## Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| Add connection pool metrics to dashboard | Alice | 2026-03-07 |
| Add load test for checkout flow | Bob | 2026-03-14 |
| Create rollback decision tree | Carol | 2026-03-10 |
## Lessons Learned
- Database connection changes need load testing
- Decision to rollback should be faster (<5min)