Zero-Downtime Deployments: Blue-Green and Canary Strategies

Deploying new code shouldn’t mean crossing your fingers. Blue-green and canary deployments let you release changes with confidence, validate them with real traffic, and roll back in seconds if something goes wrong.

Blue-Green Deployments

Blue-green maintains two identical production environments. One serves traffic (blue), while the other stands ready (green). To deploy, you push to green, test it, then switch traffic over.

Kubernetes Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-blue
  labels:
    app: myapp
    version: blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
      - name: app
        image: myapp:1.0.0
        ports:
        - containerPort: 8080
---
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-green
  labels:
    app: myapp
    version: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
      - name: app
        image: myapp:1.1.0
        ports:
        - containerPort: 8080
---
# service.yaml - switch by changing selector
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue  # Change to 'green' to switch
  ports:
  - port: 80
    targetPort: 8080

Deployment Script

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash
set -e

NEW_VERSION=$1
CURRENT=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}')

if [ "$CURRENT" == "blue" ]; then
  TARGET="green"
else
  TARGET="blue"
fi

echo "Current: $CURRENT, Deploying to: $TARGET"

# Update the standby deployment
kubectl set image deployment/app-$TARGET app=myapp:$NEW_VERSION

# Wait for rollout
kubectl rollout status deployment/app-$TARGET --timeout=300s

# Run smoke tests against standby
kubectl run smoke-test --rm -it --image=curlimages/curl \
  --restart=Never -- curl -f http://app-$TARGET:8080/health

# Switch traffic
kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$TARGET\"}}}"

echo "Switched to $TARGET (v$NEW_VERSION)"

Instant Rollback

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/bash
# rollback.sh - switch back to previous version

CURRENT=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}')

if [ "$CURRENT" == "blue" ]; then
  PREVIOUS="green"
else
  PREVIOUS="blue"
fi

kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$PREVIOUS\"}}}"
echo "Rolled back to $PREVIOUS"

Canary Deployments

Canary deployments route a small percentage of traffic to the new version, gradually increasing if metrics look good.

Kubernetes with Ingress

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Using nginx ingress annotations for traffic splitting
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"  # 10% to canary
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-canary
            port:
              number: 80

Progressive Rollout with Flagger

Flagger automates canary analysis and promotion:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  service:
    port: 80
    targetPort: 8080
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
    webhooks:
    - name: smoke-test
      type: pre-rollout
      url: http://flagger-loadtester/
      timeout: 30s
      metadata:
        type: bash
        cmd: "curl -f http://myapp-canary:8080/health"

Manual Canary with Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import random
from flask import Flask, request
import requests

app = Flask(__name__)

# Configuration
CANARY_PERCENTAGE = 10
STABLE_BACKEND = "http://stable-service:8080"
CANARY_BACKEND = "http://canary-service:8080"

def get_backend(user_id: str = None) -> str:
    """
    Determine which backend to route to.
    Use consistent hashing for user_id to ensure same user 
    always sees same version during rollout.
    """
    if user_id:
        # Consistent routing based on user
        hash_val = hash(user_id) % 100
        use_canary = hash_val < CANARY_PERCENTAGE
    else:
        # Random for anonymous users
        use_canary = random.randint(1, 100) <= CANARY_PERCENTAGE
    
    return CANARY_BACKEND if use_canary else STABLE_BACKEND

@app.route('/', defaults={'path': ''})
@app.route('/<path:path>')
def proxy(path):
    user_id = request.headers.get('X-User-ID')
    backend = get_backend(user_id)
    
    # Forward request
    resp = requests.request(
        method=request.method,
        url=f"{backend}/{path}",
        headers={k: v for k, v in request.headers if k != 'Host'},
        data=request.get_data(),
        allow_redirects=False
    )
    
    # Add header indicating which backend served the request
    response = app.make_response(resp.content)
    response.headers['X-Served-By'] = 'canary' if backend == CANARY_BACKEND else 'stable'
    return response

Monitoring During Rollout

Track key metrics to decide whether to proceed or rollback:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from prometheus_client import Counter, Histogram

# Track by version
requests_total = Counter(
    'http_requests_total',
    'Total requests',
    ['version', 'status_code', 'endpoint']
)

request_duration = Histogram(
    'http_request_duration_seconds',
    'Request duration',
    ['version', 'endpoint']
)

# In your app
@app.before_request
def track_request():
    request.start_time = time.time()

@app.after_request
def record_metrics(response):
    version = os.environ.get('APP_VERSION', 'unknown')
    duration = time.time() - request.start_time
    
    requests_total.labels(
        version=version,
        status_code=response.status_code,
        endpoint=request.endpoint
    ).inc()
    
    request_duration.labels(
        version=version,
        endpoint=request.endpoint
    ).observe(duration)
    
    return response

Automated Rollback Trigger

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# canary_monitor.py
import requests
from prometheus_api_client import PrometheusConnect

prom = PrometheusConnect(url="http://prometheus:9090")

def check_canary_health(canary_version: str) -> bool:
    # Check error rate
    error_query = f'''
        sum(rate(http_requests_total{{version="{canary_version}", status_code=~"5.."}}[5m]))
        /
        sum(rate(http_requests_total{{version="{canary_version}"}}[5m]))
    '''
    error_rate = prom.custom_query(error_query)
    
    if error_rate and float(error_rate[0]['value'][1]) > 0.01:  # >1% errors
        return False
    
    # Check latency
    latency_query = f'''
        histogram_quantile(0.99, 
            rate(http_request_duration_seconds_bucket{{version="{canary_version}"}}[5m])
        )
    '''
    p99_latency = prom.custom_query(latency_query)
    
    if p99_latency and float(p99_latency[0]['value'][1]) > 0.5:  # >500ms p99
        return False
    
    return True

def rollback_canary():
    # Trigger rollback via your deployment tool
    requests.post("http://deployment-api/rollback")

When to Use Which

Strategy	Best For	Trade-offs
Blue-Green	Database migrations, breaking changes, compliance requirements	Requires 2x resources during deployment
Canary	Gradual rollouts, A/B testing, risk-sensitive changes	More complex routing and monitoring
Rolling	Simple updates, stateless services	Harder to rollback, mixed versions during deploy

Best Practices

Always have a rollback plan — test it before you need it
Monitor the right metrics — error rates, latency, business KPIs
Use consistent routing — same user should see same version
Automate promotion criteria — remove human error from the loop
Keep both versions compatible — especially for database schemas
Test the deployment process — not just the code

Zero-downtime deployments aren’t just about uptime — they’re about deploying with confidence. When rollback is instant and painless, you ship faster because the cost of mistakes is lower.

Blue-Green Deployments#

Kubernetes Implementation#

Deployment Script#

Instant Rollback#

Canary Deployments#

Kubernetes with Ingress#

Progressive Rollout with Flagger#

Manual Canary with Python#

Monitoring During Rollout#

Automated Rollback Trigger#

When to Use Which#

Best Practices#

📬 Get the Newsletter