The deployment window is a relic. Scheduled maintenance pages, late-night deploys, crossing fingers and hoping—none of this should exist in 2026. Your users shouldn’t know when you deploy. They shouldn’t care.
Zero-downtime deployment isn’t magic. It’s engineering discipline applied to a specific problem: how do you replace running code without dropping requests?
The Fundamental Challenge
During deployment, you have two versions of your application:
- Old version: Currently serving traffic
- New version: Ready to serve traffic
The challenge: transition from old to new without dropping connections or serving errors.
Strategy 1: Rolling Deployment
The simplest approach. Replace instances one at a time.
How it works:
- Start a new instance with v2
- Wait for health checks to pass
- Add v2 to load balancer
- Remove one v1 from load balancer
- Terminate v1 instance
- Repeat until all v1 replaced
Kubernetes example:
| |
Pros:
- Simple to implement
- Low resource overhead (only one extra instance at a time)
- Gradual rollout catches issues early
Cons:
- Both versions serve traffic simultaneously
- Database migrations must be backward compatible
- Rollback requires another rolling deployment
Strategy 2: Blue-Green Deployment
Run two identical environments. Switch traffic all at once.
Implementation with AWS ALB:
| |
Pros:
- Instant rollback (just switch back)
- Clean separation between versions
- Easy to test green before switching
Cons:
- Double the infrastructure cost (temporarily)
- Database must support both versions during switch
- DNS propagation if using DNS-based switching
Strategy 3: Canary Deployment
Route a small percentage of traffic to the new version. Increase gradually.
Nginx weighted upstream:
| |
AWS ALB weighted target groups:
| |
Pros:
- Minimize blast radius of bugs
- Real production traffic validation
- Can abort at any point
Cons:
- Requires good observability to detect issues
- More complex routing logic
- Users may see inconsistent behavior during rollout
The Database Problem
Code deploys are easy. Databases are hard.
The trap:
| |
The solution: Expand-Contract pattern
Phase 1: Expand (backward compatible)
| |
Deploy v2 that writes to both columns, reads from new column.
Phase 2: Contract (after all instances are v2)
| |
Timeline:
- Deploy migration (expand)
- Deploy v2 code
- Wait for all v1 instances gone
- Deploy cleanup migration (contract)
This adds complexity but prevents downtime.
Connection Draining
When removing an instance from rotation, don’t kill active connections.
The wrong way:
| |
The right way:
| |
Implementation:
| |
Load balancer settings:
- AWS ALB:
deregistration_delay.timeout_seconds = 30 - Kubernetes:
terminationGracePeriodSeconds: 30
Health Checks That Work
Bad health checks cause downtime during deploys.
Too simple:
| |
Too complex:
| |
If any dependency is slow, health checks fail, and your instance gets killed—even though it might be able to serve requests.
Just right:
| |
Kubernetes uses all three:
| |
Graceful Startup
New instances need time before handling traffic.
Problems during cold start:
- JIT compilation hasn’t warmed up
- Caches are empty
- Connection pools aren’t filled
- Lazy-loaded resources aren’t loaded
Solution: Warm-up period
| |
Rollback Strategies
Deployment succeeded. Then metrics tank. Now what?
Rollback options:
Re-deploy previous version (slow)
- Works for any strategy
- Takes as long as a normal deploy
Blue-green switch (instant)
- Just change the pointer back
- Requires keeping old environment ready
Feature flag disable (instant)
- New code stays deployed
- Problematic feature turned off
- Requires feature flags built in
Traffic shift (instant)
- Canary: shift back to 0% new
- Keep new version for debugging
Automated rollback:
| |
Putting It Together
A practical deployment pipeline:
- Build artifact (container image, binary)
- Test in staging environment
- Deploy canary (5% traffic)
- Monitor for 5 minutes
- Gradual rollout (25% → 50% → 100%)
- Monitor for 30 minutes
- Cleanup old version
If any step fails, rollback automatically.
What you need:
- Health checks (liveness, readiness, startup)
- Connection draining
- Backward-compatible database migrations
- Observability (metrics, logs, alerts)
- Automated rollback triggers
Zero-downtime deployment isn’t a feature you turn on. It’s a set of practices that, together, make “deploy whenever” safe and boring.
Boring is good. Boring means your users don’t notice.
The best deployment is the one nobody notices happened.