Blue-green deployment is one of those patterns that sounds simple until you try to implement it at scale. Here’s what actually works.
The Core Concept#
You maintain two identical production environments: Blue (current) and Green (new). Traffic flows to Blue while you deploy and test on Green. When ready, you flip traffic to Green. If something breaks, flip back to Blue.
Simple in theory. Let’s talk about the messy reality.
Infrastructure Setup#
Load Balancer Configuration#
Your load balancer is the traffic director. With AWS ALB:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| resource "aws_lb_target_group" "blue" {
name = "app-blue"
port = 80
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
path = "/health"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 10
}
}
resource "aws_lb_target_group" "green" {
name = "app-green"
port = 80
protocol = "HTTP"
vpc_id = var.vpc_id
health_check {
path = "/health"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 10
}
}
resource "aws_lb_listener" "main" {
load_balancer_arn = aws_lb.main.arn
port = 443
protocol = "HTTPS"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.blue.arn
}
}
|
Switching Traffic#
The actual switch is just updating the listener:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| #!/bin/bash
# switch-traffic.sh
CURRENT=$(aws elbv2 describe-listeners \
--listener-arns $LISTENER_ARN \
--query 'Listeners[0].DefaultActions[0].TargetGroupArn' \
--output text)
if [[ $CURRENT == *"blue"* ]]; then
TARGET=$GREEN_TG_ARN
NEW="green"
else
TARGET=$BLUE_TG_ARN
NEW="blue"
fi
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$TARGET
echo "Switched traffic to $NEW"
|
The Database Problem#
Here’s where most blue-green implementations fail. Your two environments share a database. That means:
- Schema changes must be backward compatible
- You can’t just drop columns
- Migrations need to be multi-phase
The Expand-Contract Pattern#
Phase 1: Expand (deploy with both Blue and Green)
1
2
| -- Add new column, keep old one
ALTER TABLE users ADD COLUMN email_verified_at TIMESTAMP;
|
Phase 2: Migrate data (background job)
1
2
| -- Backfill data
UPDATE users SET email_verified_at = created_at WHERE is_email_verified = true;
|
Phase 3: Contract (only after both environments use new column)
1
2
| -- Safe to remove old column
ALTER TABLE users DROP COLUMN is_email_verified;
|
This is slow. It’s annoying. It’s also the only way to do zero-downtime schema changes safely.
Handling Stateful Services#
Session State#
Don’t store sessions in memory. Use Redis or your database:
1
2
3
4
5
6
7
8
9
10
| from flask_session import Session
import redis
app.config['SESSION_TYPE'] = 'redis'
app.config['SESSION_REDIS'] = redis.Redis(
host='redis.internal',
port=6379,
db=0
)
Session(app)
|
Background Jobs#
Jobs started on Blue shouldn’t fail when traffic moves to Green:
1
2
3
4
5
6
7
8
9
10
| # Good: Job is self-contained
class ProcessOrderJob:
def perform(self, order_id):
order = Order.find(order_id)
# All data is fetched fresh
# Bad: Job depends on in-memory state
class ProcessOrderJob:
def perform(self, order):
# If 'order' was pickled from Blue, Green might not have same class structure
|
Smoke Testing Green#
Before switching, validate Green works:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| #!/bin/bash
# smoke-test.sh
GREEN_URL="https://green.internal.example.com"
# Health check
curl -sf "$GREEN_URL/health" || exit 1
# Critical paths
curl -sf "$GREEN_URL/api/v1/status" || exit 1
# Test login flow
TOKEN=$(curl -sf "$GREEN_URL/api/v1/login" \
-d '{"user":"test@test.com","pass":"test123"}' \
| jq -r '.token')
[ -n "$TOKEN" ] || exit 1
# Test authenticated endpoint
curl -sf "$GREEN_URL/api/v1/me" \
-H "Authorization: Bearer $TOKEN" || exit 1
echo "Smoke tests passed"
|
Gradual Rollout with Weighted Routing#
Instead of flipping 100% at once, use weighted target groups:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Send 10% to green
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions '[
{
"Type": "forward",
"ForwardConfig": {
"TargetGroups": [
{"TargetGroupArn": "'$BLUE_TG_ARN'", "Weight": 90},
{"TargetGroupArn": "'$GREEN_TG_ARN'", "Weight": 10}
]
}
}
]'
|
Monitor error rates, latency, and business metrics. If things look good, increment to 25%, 50%, 100%.
Rollback Strategy#
Rollback should be instant. If you need to deploy a fix to rollback, you’ve failed.
1
2
3
4
5
6
7
8
9
10
11
| #!/bin/bash
# rollback.sh
# Switch back to previous target group
aws elbv2 modify-listener \
--listener-arn $LISTENER_ARN \
--default-actions Type=forward,TargetGroupArn=$PREVIOUS_TG_ARN
# Alert the team
curl -X POST "$SLACK_WEBHOOK" \
-d '{"text":"🚨 Production rollback executed. Green deployment reverted to Blue."}'
|
Track which version is “blue” and which is “green” somewhere persistent:
1
2
3
4
5
| aws ssm put-parameter \
--name "/app/active-environment" \
--value "blue" \
--type String \
--overwrite
|
Cost Considerations#
Running two identical environments doubles your infrastructure cost during deployment. Minimize this window:
- Only spin up Green when deploying
- Use spot instances for Green during testing
- Tear down the old environment quickly after successful switch
- Or use Kubernetes and just manage replica sets
Kubernetes Alternative#
In Kubernetes, you often don’t need explicit blue-green. Rolling updates give you similar benefits:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
template:
spec:
containers:
- name: app
image: app:v2.0.0
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
|
But for full blue-green with instant rollback, use Argo Rollouts:
1
2
3
4
5
6
7
8
9
10
| apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: app
spec:
strategy:
blueGreen:
activeService: app-active
previewService: app-preview
autoPromotionEnabled: false
|
When Blue-Green Makes Sense#
Good fit:
- Critical services where you need instant rollback
- Significant releases that need full validation before exposure
- Compliance requirements for deployment verification
- Services with complex integration testing requirements
Maybe not worth it:
- Microservices with rolling updates and feature flags
- Services with infrequent deployments
- Stateless services where canary releases work fine
- Resource-constrained environments
Key Takeaways#
- Database compatibility is the hard part — use expand-contract migrations
- Externalize all state — sessions, caches, job queues
- Smoke test before switching — automated, comprehensive, blocking
- Use weighted routing for gradual rollout — not all-or-nothing
- Rollback must be instant — one command, no deployment
Blue-green deployments give you confidence. When you can flip back in seconds, you deploy more often. When you deploy more often, each deployment is smaller. Smaller deployments are safer.
The setup cost is real, but the payoff is fearless releases. 🌍