Testing in Production: Because Staging Never Tells the Whole Story

“We don’t test in production” sounds responsible until you realize: production is the only environment that’s actually production. Staging lies to you. Here’s how to test in production safely.

Why Staging Fails

Staging environments differ from production in ways that matter:

Data: Sanitized, outdated, or synthetic
Scale: 1% of production traffic
Integrations: Sandbox APIs with different behavior
Users: Developers clicking around, not real usage patterns
Infrastructure: Smaller instances, shared resources

That bug that only appears under real load with real data? Staging won’t catch it.

The Mental Shift

Testing in production doesn’t mean “YOLO deploy and hope.” It means:

Deploy dark: Code is in production but not active
Expose gradually: Increase exposure in controlled steps
Observe aggressively: Know immediately when something breaks
Rollback instantly: One button to undo

Feature Flags: Deploy Without Releasing

Separate deployment from release:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from flagsmith import Flagsmith

flags = Flagsmith(environment_key=FLAGSMITH_KEY)

@app.get("/checkout")
def checkout(user_id: str):
    if flags.get_identity_flags(user_id).is_feature_enabled("new_checkout"):
        return new_checkout_flow()
    else:
        return legacy_checkout_flow()

Now you can:

Deploy new checkout to production (dark)
Enable for internal users only
Enable for 1% of users
Monitor for errors
Gradually increase to 100%
Or kill it instantly if something breaks

Canary Deployments

Route a small percentage of traffic to new code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Kubernetes canary deployment
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"  # 10% traffic
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        backend:
          service:
            name: api-canary
            port: 
              number: 80

Monitor the canary separately. If error rates spike, route traffic back to stable.

Observability: Your Safety Net

You can’t test in production without seeing what’s happening:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from prometheus_client import Counter, Histogram
import structlog

logger = structlog.get_logger()

checkout_attempts = Counter(
    'checkout_attempts_total', 
    'Checkout attempts',
    ['variant', 'status']
)

checkout_duration = Histogram(
    'checkout_duration_seconds',
    'Checkout duration',
    ['variant']
)

def new_checkout_flow():
    with checkout_duration.labels(variant='new').time():
        try:
            result = process_new_checkout()
            checkout_attempts.labels(variant='new', status='success').inc()
            logger.info("checkout_completed", variant="new", order_id=result.id)
            return result
        except Exception as e:
            checkout_attempts.labels(variant='new', status='error').inc()
            logger.error("checkout_failed", variant="new", error=str(e))
            raise

Dashboard this:

Success rate by variant
Latency percentiles by variant
Error types and frequencies
Business metrics (conversion rate, cart value)

Chaos Engineering

Intentionally break things to test resilience:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Chaos middleware
import random

class ChaosMiddleware:
    def __init__(self, app, failure_rate=0.01):
        self.app = app
        self.failure_rate = failure_rate
    
    async def __call__(self, scope, receive, send):
        if random.random() < self.failure_rate:
            # Inject artificial latency
            await asyncio.sleep(random.uniform(0.5, 2.0))
        
        await self.app(scope, receive, send)

# Only enable for specific test users
if user.is_chaos_test_subject:
    app = ChaosMiddleware(app, failure_rate=0.05)

Or use tools like Gremlin, Chaos Monkey, or Litmus for infrastructure-level chaos.

Shadow Traffic

Test new code with production traffic without affecting users:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
async def handle_request(request):
    # Primary path - what users see
    response = await legacy_handler(request)
    
    # Shadow path - async, fire-and-forget
    asyncio.create_task(shadow_handler(request))
    
    return response

async def shadow_handler(request):
    """Process request with new code, log differences, don't return to user."""
    try:
        new_response = await new_handler(request)
        
        # Compare responses
        if new_response != legacy_response:
            logger.warning(
                "shadow_mismatch",
                request_id=request.id,
                legacy=legacy_response,
                new=new_response
            )
    except Exception as e:
        logger.error("shadow_error", error=str(e))

Real production traffic, zero user impact.

Synthetic Monitoring

Continuously test production with fake requests:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Scheduled job every minute
async def synthetic_checkout_test():
    test_user = get_synthetic_test_user()
    test_product = get_test_product()
    
    try:
        start = time.time()
        result = await checkout_api.create_order(
            user_id=test_user.id,
            product_id=test_product.id,
            is_synthetic=True  # Mark for filtering
        )
        duration = time.time() - start
        
        metrics.record("synthetic.checkout.success", duration)
        
        # Clean up test order
        await checkout_api.cancel_order(result.order_id)
        
    except Exception as e:
        metrics.record("synthetic.checkout.failure")
        alert_oncall(f"Synthetic checkout failed: {e}")

Know about outages before users report them.

Rollback Strategy

Every production test needs an escape hatch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#!/bin/bash
# rollback.sh

# Feature flag: instant
curl -X PATCH $FLAGSMITH_API/features/new_checkout \
  -d '{"enabled": false}'

# Kubernetes: seconds
kubectl rollout undo deployment/api

# Database migration: have a down migration ready
alembic downgrade -1

Practice rollbacks. If you’ve never done it, you’ll fumble when it matters.

What to Test in Production

Test Type	Method	Risk Level
Performance	Canary + metrics	Low
Feature correctness	Feature flags + internal users	Low
Integration behavior	Shadow traffic	Very low
Resilience	Chaos engineering (controlled)	Medium
Full user flow	Synthetic monitoring	Very low

What NOT to Test in Production

Untested database migrations (test in staging first)
Security-sensitive changes without review
Anything without rollback capability
Changes to payment processing (use sandbox first)

The Rule

Deploy to production often. Release to users carefully.

Production testing isn’t reckless—it’s realistic. Staging gives you confidence in controlled conditions. Production gives you confidence in the real world. You need both.

Why Staging Fails#

The Mental Shift#

Feature Flags: Deploy Without Releasing#

Canary Deployments#

Observability: Your Safety Net#

Chaos Engineering#

Shadow Traffic#

Synthetic Monitoring#

Rollback Strategy#

What to Test in Production#

What NOT to Test in Production#

The Rule#

📬 Get the Newsletter