Zero-Downtime Deployments: Strategies That Actually Work

“We’re deploying, please hold” is not an acceptable user experience. Whether you’re running a startup or enterprise infrastructure, users expect services to just work. Here’s how to ship code without the maintenance windows.

The Goal: Invisible Deploys

A zero-downtime deployment means users never notice you’re deploying. No error pages, no dropped connections, no “please refresh” messages. The old version serves traffic until the new version is proven healthy.

Strategy 1: Rolling Deployments

The simplest approach. Replace instances one at a time:

1
2
3
4
5
6
7
8
# Kubernetes rolling update config
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Add 1 extra pod during update
      maxUnavailable: 0  # Never reduce below desired count

How it works:

Start one new pod with new version
Wait for it to pass health checks
Remove one old pod
Repeat until all pods are updated

Pros:

Simple to implement
Built into Kubernetes, ECS, most orchestrators
Minimal extra resource usage

Cons:

Slow for large deployments
Both versions run simultaneously (must be compatible)
Rollback requires another full roll

Strategy 2: Blue-Green Deployments

Run two complete environments. Switch traffic atomically:

Implementation with AWS ALB:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Deploy to green environment
aws ecs update-service --cluster prod --service app-green \
  --task-definition app:42

# Wait for green to stabilize
aws ecs wait services-stable --cluster prod --services app-green

# Switch traffic (update target group weights)
aws elbv2 modify-rule --rule-arn $RULE_ARN \
  --actions '[{"Type":"forward","ForwardConfig":{"TargetGroups":[
    {"TargetGroupArn":"'$GREEN_TG'","Weight":100},
    {"TargetGroupArn":"'$BLUE_TG'","Weight":0}
  ]}}]'

Pros:

Instant rollback (just switch back)
Full environment testing before switch
No mixed versions serving traffic

Cons:

Double the infrastructure cost
Database migrations get complicated
State synchronization between environments

Strategy 3: Canary Deployments

Ship to a small percentage of traffic first. Expand if metrics look good:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Argo Rollouts canary strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5       # 5% traffic to canary
        - pause: {duration: 5m}
        - setWeight: 25
        - pause: {duration: 5m}
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1

The key: automated analysis. Don’t rely on humans watching dashboards at 3am:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result >= 0.99
      provider:
        prometheus:
          query: |
            sum(rate(http_requests_total{status=~"2.."}[5m])) /
            sum(rate(http_requests_total[5m]))

If error rate spikes, the rollout automatically aborts and rolls back.

The Database Problem

Zero-downtime deploys break when your schema changes aren’t backward compatible:

1
2
3
4
5
-- ❌ This breaks running v1 code:
ALTER TABLE users DROP COLUMN legacy_field;

-- ❌ This too:
ALTER TABLE users RENAME COLUMN name TO full_name;

Solution: Expand-Contract migrations

1
2
3
4
5
6
7
8
-- Step 1: Expand (safe)
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- Step 2: Backfill (background job)
UPDATE users SET full_name = name WHERE full_name IS NULL;

-- Step 3: Contract (only after all code uses full_name)
ALTER TABLE users DROP COLUMN name;

Each step is a separate deploy. Old code keeps working until you’re ready to cut over.

Health Checks That Matter

Your orchestrator can only make good decisions with good health signals:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# ❌ BAD: Always returns healthy
@app.get("/health")
def health():
    return {"status": "ok"}

# ✅ GOOD: Actually checks dependencies
@app.get("/health")
def health():
    checks = {
        "database": check_db_connection(),
        "redis": check_redis_connection(),
        "disk": check_disk_space(),
    }
    
    all_healthy = all(c["healthy"] for c in checks.values())
    status_code = 200 if all_healthy else 503
    
    return JSONResponse(
        content={"status": "healthy" if all_healthy else "unhealthy", "checks": checks},
        status_code=status_code
    )

Configure appropriate timing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30  # Give app time to start
  periodSeconds: 10
  failureThreshold: 3      # 3 failures before restart

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 1      # Remove from LB immediately

Key insight: Readiness checks should be stricter than liveness checks. A pod that can’t serve traffic should be removed from the load balancer immediately, but killing and restarting it might just make things worse.

Graceful Shutdown

When a pod is terminated, it needs time to finish in-flight requests:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import signal
import sys

def graceful_shutdown(signum, frame):
    # Stop accepting new requests
    server.stop_accepting()
    
    # Wait for in-flight requests (with timeout)
    server.wait_for_completion(timeout=25)
    
    # Clean up connections
    db.close()
    redis.close()
    
    sys.exit(0)

signal.signal(signal.SIGTERM, graceful_shutdown)

Kubernetes config:

1
2
3
4
5
6
7
spec:
  terminationGracePeriodSeconds: 30
  containers:
    - lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5"]  # Let LB drain

The preStop hook runs before SIGTERM. Use it to give load balancers time to remove the pod from rotation.

Feature Flags: Deploy ≠ Release

The ultimate zero-downtime strategy is separating deployment from release:

1
2
3
4
if feature_flags.is_enabled("new_checkout_flow", user_id=user.id):
    return new_checkout_flow(cart)
else:
    return legacy_checkout_flow(cart)

Now you can:

Deploy code anytime (it’s dormant behind a flag)
Enable for internal users first
Gradually roll out to percentages
Instant rollback by disabling the flag
A/B test features with real traffic

This is how companies ship hundreds of deploys per day without fear.

Putting It Together

My recommended approach for most teams:

Start with rolling deployments — it’s good enough for most apps
Add proper health checks — readiness and liveness, checking real dependencies
Implement graceful shutdown — finish in-flight requests before dying
Use expand-contract migrations — never break backward compatibility
Graduate to canary when scale demands it — with automated analysis

Zero-downtime deployments aren’t about fancy tools. They’re about respecting the contract between old code and new code, giving systems time to transition, and having automated guardrails to catch problems before users do.

Ship fast. Ship often. Ship invisibly.

The Goal: Invisible Deploys#

Strategy 1: Rolling Deployments#

Strategy 2: Blue-Green Deployments#

Strategy 3: Canary Deployments#

The Database Problem#

Health Checks That Matter#

Graceful Shutdown#

Feature Flags: Deploy ≠ Release#

Putting It Together#

📬 Get the Newsletter