Blue-green deployment is one of those patterns that sounds simple until you try to implement it at scale. Here’s what actually works.

The Core Concept

You maintain two identical production environments: Blue (current) and Green (new). Traffic flows to Blue while you deploy and test on Green. When ready, you flip traffic to Green. If something breaks, flip back to Blue.

Simple in theory. Let’s talk about the messy reality.

Infrastructure Setup

Load Balancer Configuration

Your load balancer is the traffic director. With AWS ALB:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
resource "aws_lb_target_group" "blue" {
  name     = "app-blue"
  port     = 80
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 10
  }
}

resource "aws_lb_target_group" "green" {
  name     = "app-green"
  port     = 80
  protocol = "HTTP"
  vpc_id   = var.vpc_id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 10
  }
}

resource "aws_lb_listener" "main" {
  load_balancer_arn = aws_lb.main.arn
  port              = 443
  protocol          = "HTTPS"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.blue.arn
  }
}

Switching Traffic

The actual switch is just updating the listener:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/bash
# switch-traffic.sh

CURRENT=$(aws elbv2 describe-listeners \
  --listener-arns $LISTENER_ARN \
  --query 'Listeners[0].DefaultActions[0].TargetGroupArn' \
  --output text)

if [[ $CURRENT == *"blue"* ]]; then
  TARGET=$GREEN_TG_ARN
  NEW="green"
else
  TARGET=$BLUE_TG_ARN
  NEW="blue"
fi

aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=$TARGET

echo "Switched traffic to $NEW"

The Database Problem

Here’s where most blue-green implementations fail. Your two environments share a database. That means:

  1. Schema changes must be backward compatible
  2. You can’t just drop columns
  3. Migrations need to be multi-phase

The Expand-Contract Pattern

Phase 1: Expand (deploy with both Blue and Green)

1
2
-- Add new column, keep old one
ALTER TABLE users ADD COLUMN email_verified_at TIMESTAMP;

Phase 2: Migrate data (background job)

1
2
-- Backfill data
UPDATE users SET email_verified_at = created_at WHERE is_email_verified = true;

Phase 3: Contract (only after both environments use new column)

1
2
-- Safe to remove old column
ALTER TABLE users DROP COLUMN is_email_verified;

This is slow. It’s annoying. It’s also the only way to do zero-downtime schema changes safely.

Handling Stateful Services

Session State

Don’t store sessions in memory. Use Redis or your database:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from flask_session import Session
import redis

app.config['SESSION_TYPE'] = 'redis'
app.config['SESSION_REDIS'] = redis.Redis(
    host='redis.internal',
    port=6379,
    db=0
)
Session(app)

Background Jobs

Jobs started on Blue shouldn’t fail when traffic moves to Green:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Good: Job is self-contained
class ProcessOrderJob:
    def perform(self, order_id):
        order = Order.find(order_id)
        # All data is fetched fresh
        
# Bad: Job depends on in-memory state
class ProcessOrderJob:
    def perform(self, order):
        # If 'order' was pickled from Blue, Green might not have same class structure

Smoke Testing Green

Before switching, validate Green works:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
# smoke-test.sh

GREEN_URL="https://green.internal.example.com"

# Health check
curl -sf "$GREEN_URL/health" || exit 1

# Critical paths
curl -sf "$GREEN_URL/api/v1/status" || exit 1

# Test login flow
TOKEN=$(curl -sf "$GREEN_URL/api/v1/login" \
  -d '{"user":"test@test.com","pass":"test123"}' \
  | jq -r '.token')

[ -n "$TOKEN" ] || exit 1

# Test authenticated endpoint
curl -sf "$GREEN_URL/api/v1/me" \
  -H "Authorization: Bearer $TOKEN" || exit 1

echo "Smoke tests passed"

Gradual Rollout with Weighted Routing

Instead of flipping 100% at once, use weighted target groups:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Send 10% to green
aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --default-actions '[
    {
      "Type": "forward",
      "ForwardConfig": {
        "TargetGroups": [
          {"TargetGroupArn": "'$BLUE_TG_ARN'", "Weight": 90},
          {"TargetGroupArn": "'$GREEN_TG_ARN'", "Weight": 10}
        ]
      }
    }
  ]'

Monitor error rates, latency, and business metrics. If things look good, increment to 25%, 50%, 100%.

Rollback Strategy

Rollback should be instant. If you need to deploy a fix to rollback, you’ve failed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#!/bin/bash
# rollback.sh

# Switch back to previous target group
aws elbv2 modify-listener \
  --listener-arn $LISTENER_ARN \
  --default-actions Type=forward,TargetGroupArn=$PREVIOUS_TG_ARN

# Alert the team
curl -X POST "$SLACK_WEBHOOK" \
  -d '{"text":"🚨 Production rollback executed. Green deployment reverted to Blue."}'

Track which version is “blue” and which is “green” somewhere persistent:

1
2
3
4
5
aws ssm put-parameter \
  --name "/app/active-environment" \
  --value "blue" \
  --type String \
  --overwrite

Cost Considerations

Running two identical environments doubles your infrastructure cost during deployment. Minimize this window:

  1. Only spin up Green when deploying
  2. Use spot instances for Green during testing
  3. Tear down the old environment quickly after successful switch
  4. Or use Kubernetes and just manage replica sets

Kubernetes Alternative

In Kubernetes, you often don’t need explicit blue-green. Rolling updates give you similar benefits:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
  template:
    spec:
      containers:
      - name: app
        image: app:v2.0.0
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

But for full blue-green with instant rollback, use Argo Rollouts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: app
spec:
  strategy:
    blueGreen:
      activeService: app-active
      previewService: app-preview
      autoPromotionEnabled: false

When Blue-Green Makes Sense

Good fit:

  • Critical services where you need instant rollback
  • Significant releases that need full validation before exposure
  • Compliance requirements for deployment verification
  • Services with complex integration testing requirements

Maybe not worth it:

  • Microservices with rolling updates and feature flags
  • Services with infrequent deployments
  • Stateless services where canary releases work fine
  • Resource-constrained environments

Key Takeaways

  1. Database compatibility is the hard part — use expand-contract migrations
  2. Externalize all state — sessions, caches, job queues
  3. Smoke test before switching — automated, comprehensive, blocking
  4. Use weighted routing for gradual rollout — not all-or-nothing
  5. Rollback must be instant — one command, no deployment

Blue-green deployments give you confidence. When you can flip back in seconds, you deploy more often. When you deploy more often, each deployment is smaller. Smaller deployments are safer.

The setup cost is real, but the payoff is fearless releases. 🌍