“We don’t test in production” sounds responsible until you realize: production is the only environment that’s actually production. Staging lies to you. Here’s how to test in production safely.
Why Staging Fails#
Staging environments differ from production in ways that matter:
- Data: Sanitized, outdated, or synthetic
- Scale: 1% of production traffic
- Integrations: Sandbox APIs with different behavior
- Users: Developers clicking around, not real usage patterns
- Infrastructure: Smaller instances, shared resources
That bug that only appears under real load with real data? Staging won’t catch it.
The Mental Shift#
Testing in production doesn’t mean “YOLO deploy and hope.” It means:
- Deploy dark: Code is in production but not active
- Expose gradually: Increase exposure in controlled steps
- Observe aggressively: Know immediately when something breaks
- Rollback instantly: One button to undo
Feature Flags: Deploy Without Releasing#
Separate deployment from release:
1
2
3
4
5
6
7
8
9
10
| from flagsmith import Flagsmith
flags = Flagsmith(environment_key=FLAGSMITH_KEY)
@app.get("/checkout")
def checkout(user_id: str):
if flags.get_identity_flags(user_id).is_feature_enabled("new_checkout"):
return new_checkout_flow()
else:
return legacy_checkout_flow()
|
Now you can:
- Deploy new checkout to production (dark)
- Enable for internal users only
- Enable for 1% of users
- Monitor for errors
- Gradually increase to 100%
- Or kill it instantly if something breaks
Canary Deployments#
Route a small percentage of traffic to new code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Kubernetes canary deployment
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10" # 10% traffic
spec:
rules:
- host: api.example.com
http:
paths:
- path: /
backend:
service:
name: api-canary
port:
number: 80
|
Monitor the canary separately. If error rates spike, route traffic back to stable.
Observability: Your Safety Net#
You can’t test in production without seeing what’s happening:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| from prometheus_client import Counter, Histogram
import structlog
logger = structlog.get_logger()
checkout_attempts = Counter(
'checkout_attempts_total',
'Checkout attempts',
['variant', 'status']
)
checkout_duration = Histogram(
'checkout_duration_seconds',
'Checkout duration',
['variant']
)
def new_checkout_flow():
with checkout_duration.labels(variant='new').time():
try:
result = process_new_checkout()
checkout_attempts.labels(variant='new', status='success').inc()
logger.info("checkout_completed", variant="new", order_id=result.id)
return result
except Exception as e:
checkout_attempts.labels(variant='new', status='error').inc()
logger.error("checkout_failed", variant="new", error=str(e))
raise
|
Dashboard this:
- Success rate by variant
- Latency percentiles by variant
- Error types and frequencies
- Business metrics (conversion rate, cart value)
Chaos Engineering#
Intentionally break things to test resilience:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Chaos middleware
import random
class ChaosMiddleware:
def __init__(self, app, failure_rate=0.01):
self.app = app
self.failure_rate = failure_rate
async def __call__(self, scope, receive, send):
if random.random() < self.failure_rate:
# Inject artificial latency
await asyncio.sleep(random.uniform(0.5, 2.0))
await self.app(scope, receive, send)
# Only enable for specific test users
if user.is_chaos_test_subject:
app = ChaosMiddleware(app, failure_rate=0.05)
|
Or use tools like Gremlin, Chaos Monkey, or Litmus for infrastructure-level chaos.
Shadow Traffic#
Test new code with production traffic without affecting users:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| async def handle_request(request):
# Primary path - what users see
response = await legacy_handler(request)
# Shadow path - async, fire-and-forget
asyncio.create_task(shadow_handler(request))
return response
async def shadow_handler(request):
"""Process request with new code, log differences, don't return to user."""
try:
new_response = await new_handler(request)
# Compare responses
if new_response != legacy_response:
logger.warning(
"shadow_mismatch",
request_id=request.id,
legacy=legacy_response,
new=new_response
)
except Exception as e:
logger.error("shadow_error", error=str(e))
|
Real production traffic, zero user impact.
Synthetic Monitoring#
Continuously test production with fake requests:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| # Scheduled job every minute
async def synthetic_checkout_test():
test_user = get_synthetic_test_user()
test_product = get_test_product()
try:
start = time.time()
result = await checkout_api.create_order(
user_id=test_user.id,
product_id=test_product.id,
is_synthetic=True # Mark for filtering
)
duration = time.time() - start
metrics.record("synthetic.checkout.success", duration)
# Clean up test order
await checkout_api.cancel_order(result.order_id)
except Exception as e:
metrics.record("synthetic.checkout.failure")
alert_oncall(f"Synthetic checkout failed: {e}")
|
Know about outages before users report them.
Rollback Strategy#
Every production test needs an escape hatch:
1
2
3
4
5
6
7
8
9
10
11
12
| #!/bin/bash
# rollback.sh
# Feature flag: instant
curl -X PATCH $FLAGSMITH_API/features/new_checkout \
-d '{"enabled": false}'
# Kubernetes: seconds
kubectl rollout undo deployment/api
# Database migration: have a down migration ready
alembic downgrade -1
|
Practice rollbacks. If you’ve never done it, you’ll fumble when it matters.
What to Test in Production#
| Test Type | Method | Risk Level |
|---|
| Performance | Canary + metrics | Low |
| Feature correctness | Feature flags + internal users | Low |
| Integration behavior | Shadow traffic | Very low |
| Resilience | Chaos engineering (controlled) | Medium |
| Full user flow | Synthetic monitoring | Very low |
What NOT to Test in Production#
- Untested database migrations (test in staging first)
- Security-sensitive changes without review
- Anything without rollback capability
- Changes to payment processing (use sandbox first)
The Rule#
Deploy to production often. Release to users carefully.
Production testing isn’t reckless—it’s realistic. Staging gives you confidence in controlled conditions. Production gives you confidence in the real world. You need both.