Testing in Production: Because Staging Never Tells the Whole Story

“We don’t test in production” sounds responsible until you realize: production is the only environment that’s actually production. Staging lies to you. Here’s how to test in production safely. Why Staging Fails Staging environments differ from production in ways that matter: Data: Sanitized, outdated, or synthetic Scale: 1% of production traffic Integrations: Sandbox APIs with different behavior Users: Developers clicking around, not real usage patterns Infrastructure: Smaller instances, shared resources That bug that only appears under real load with real data? Staging won’t catch it. ...

March 1, 2026 Â· 4 min Â· 836 words Â· Rob Washington

curl Patterns for API Development and Testing

curl is the universal HTTP client. Every developer should know it well—for debugging, testing, and automation. Basic Requests 1 2 3 4 5 6 7 8 9 10 11 # GET (default) curl https://api.example.com/users # POST with data curl -X POST https://api.example.com/users \ -d '{"name":"alice"}' # With headers curl -H "Content-Type: application/json" \ -H "Authorization: Bearer token123" \ https://api.example.com/users JSON APIs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # POST JSON (sets Content-Type automatically) curl -X POST https://api.example.com/users \ --json '{"name":"alice","email":"alice@example.com"}' # Or explicitly curl -X POST https://api.example.com/users \ -H "Content-Type: application/json" \ -d '{"name":"alice"}' # Pretty print response curl -s https://api.example.com/users | jq . # From file curl -X POST https://api.example.com/users \ -H "Content-Type: application/json" \ -d @payload.json HTTP Methods 1 2 3 4 5 6 7 curl -X GET https://api.example.com/users/1 curl -X POST https://api.example.com/users -d '...' curl -X PUT https://api.example.com/users/1 -d '...' curl -X PATCH https://api.example.com/users/1 -d '...' curl -X DELETE https://api.example.com/users/1 curl -X HEAD https://api.example.com/users # Headers only curl -X OPTIONS https://api.example.com/users # CORS preflight Authentication 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Bearer token curl -H "Authorization: Bearer eyJhbGc..." \ https://api.example.com/me # Basic auth curl -u username:password https://api.example.com/users # Or curl -H "Authorization: Basic $(echo -n 'user:pass' | base64)" \ https://api.example.com/users # API key in header curl -H "X-API-Key: secret123" https://api.example.com/data # API key in query curl "https://api.example.com/data?api_key=secret123" Response Inspection 1 2 3 4 5 6 7 8 9 10 11 12 13 # Show headers curl -I https://api.example.com/users # HEAD request curl -i https://api.example.com/users # Include headers with body # Verbose (see full request/response) curl -v https://api.example.com/users # Just status code curl -s -o /dev/null -w "%{http_code}" https://api.example.com/users # Multiple stats curl -s -o /dev/null -w "code: %{http_code}\ntime: %{time_total}s\nsize: %{size_download} bytes\n" \ https://api.example.com/users File Uploads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # Form upload curl -X POST https://api.example.com/upload \ -F "file=@document.pdf" # Multiple files curl -X POST https://api.example.com/upload \ -F "file1=@doc1.pdf" \ -F "file2=@doc2.pdf" # With additional form fields curl -X POST https://api.example.com/upload \ -F "file=@photo.jpg" \ -F "description=Profile photo" \ -F "public=true" # Binary data curl -X POST https://api.example.com/upload \ -H "Content-Type: application/octet-stream" \ --data-binary @file.bin Handling Redirects 1 2 3 4 5 6 7 8 # Follow redirects curl -L https://example.com/shortened-url # Show redirect chain curl -L -v https://example.com/shortened-url 2>&1 | grep "< location" # Limit redirects curl -L --max-redirs 3 https://example.com/url Timeouts and Retries 1 2 3 4 5 6 7 8 9 10 11 # Connection timeout (seconds) curl --connect-timeout 5 https://api.example.com/ # Max time for entire operation curl --max-time 30 https://api.example.com/slow-endpoint # Retry on failure curl --retry 3 --retry-delay 2 https://api.example.com/ # Retry on specific errors curl --retry 3 --retry-all-errors https://api.example.com/ Saving Output 1 2 3 4 5 6 7 8 9 10 11 # Save to file curl -o response.json https://api.example.com/users # Use remote filename curl -O https://example.com/file.zip # Save headers separately curl -D headers.txt -o body.json https://api.example.com/users # Append to file curl https://api.example.com/users >> all_responses.json Cookie Handling 1 2 3 4 5 6 7 8 9 10 11 12 # Send cookies curl -b "session=abc123; token=xyz" https://api.example.com/ # Save cookies to file curl -c cookies.txt https://api.example.com/login \ -d "user=admin&pass=secret" # Use saved cookies curl -b cookies.txt https://api.example.com/dashboard # Both save and send curl -b cookies.txt -c cookies.txt https://api.example.com/ Testing Webhooks 1 2 3 4 5 6 7 8 9 10 11 12 # Simulate GitHub webhook curl -X POST http://localhost:3000/webhook \ -H "Content-Type: application/json" \ -H "X-GitHub-Event: push" \ -H "X-Hub-Signature-256: sha256=..." \ -d @github-payload.json # Stripe webhook curl -X POST http://localhost:3000/stripe/webhook \ -H "Content-Type: application/json" \ -H "Stripe-Signature: t=...,v1=..." \ -d @stripe-event.json SSL/TLS Options 1 2 3 4 5 6 7 8 # Skip certificate verification (development only!) curl -k https://self-signed.example.com/ # Use specific CA bundle curl --cacert /path/to/ca-bundle.crt https://api.example.com/ # Client certificate curl --cert client.crt --key client.key https://mtls.example.com/ Proxy Configuration 1 2 3 4 5 6 7 8 # HTTP proxy curl -x http://proxy:8080 https://api.example.com/ # SOCKS5 proxy curl --socks5 localhost:1080 https://api.example.com/ # No proxy for specific hosts curl --noproxy "localhost,*.internal" https://api.example.com/ Performance Testing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Time breakdown curl -s -o /dev/null -w "\ namelookup: %{time_namelookup}s\n\ connect: %{time_connect}s\n\ appconnect: %{time_appconnect}s\n\ pretransfer: %{time_pretransfer}s\n\ redirect: %{time_redirect}s\n\ starttransfer: %{time_starttransfer}s\n\ total: %{time_total}s\n" \ https://api.example.com/ # Loop for load testing (basic) for i in {1..100}; do curl -s -o /dev/null -w "%{http_code} %{time_total}\n" \ https://api.example.com/ done Config Files Create ~/.curlrc for defaults: ...

February 28, 2026 Â· 6 min Â· 1116 words Â· Rob Washington

Testing Strategies: The Pyramid That Actually Works

Everyone agrees testing is important. Few agree on how much, what kind, and where to focus. The testing pyramid provides a framework: many fast unit tests, fewer integration tests, even fewer end-to-end tests. But the pyramid only works if you understand why each layer exists. The Pyramid E I U 2 n n E t i e t T g e r T s a e t t s s i t o s ( n f ( e T m w e a , s n t y s s , l o ( f w s a , o s m t e e , x , p c e m h n e e s d a i i p v u ) e m ) s p e e d ) Unit tests: Test individual functions/classes in isolation Integration tests: Test components working together E2E tests: Test the full system like a user would ...

February 24, 2026 Â· 7 min Â· 1367 words Â· Rob Washington

Infrastructure Testing: Validating Your IaC Before Production

You test your application code. Why not your infrastructure code? Infrastructure as Code (IaC) has the same failure modes as any software: bugs, regressions, unintended side effects. Yet most teams treat Terraform and Ansible like configuration files rather than code that deserves tests. Why Infrastructure Testing Matters A Terraform plan looks correct until it: Creates a security group that’s too permissive Deploys to the wrong availability zone Sets instance types that exceed your budget Breaks networking in ways that only manifest at runtime Manual review catches some issues. Automated testing catches more. ...

February 16, 2026 Â· 6 min Â· 1115 words Â· Rob Washington

Chaos Engineering: Breaking Your Systems to Make Them Stronger

You don’t know if your system handles failures gracefully until failures happen. Chaos engineering lets you find out on your terms—in controlled conditions, during business hours, with engineers ready to respond. The Principles Define steady state — What does “healthy” look like? Hypothesize — “The system will continue serving traffic if one pod dies” Inject failure — Kill the pod Observe — Did steady state hold? Learn — Fix what broke, update runbooks Start Simple: Kill Things Random Pod Termination 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 # chaos_pod_killer.py import random from kubernetes import client, config from datetime import datetime import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) class ChaosPodKiller: def __init__(self, namespace: str, label_selector: str): config.load_incluster_config() # or load_kube_config() for local self.v1 = client.CoreV1Api() self.namespace = namespace self.label_selector = label_selector def get_targets(self) -> list: """Get pods matching selector.""" pods = self.v1.list_namespaced_pod( namespace=self.namespace, label_selector=self.label_selector ) return [pod.metadata.name for pod in pods.items if pod.status.phase == "Running"] def kill_random_pod(self, dry_run: bool = True) -> str: """Kill a random pod from targets.""" targets = self.get_targets() if len(targets) <= 1: logger.warning("Only one pod running, skipping kill") return None victim = random.choice(targets) if dry_run: logger.info(f"DRY RUN: Would delete pod {victim}") else: logger.info(f"Deleting pod {victim}") self.v1.delete_namespaced_pod( name=victim, namespace=self.namespace, grace_period_seconds=0 ) return victim def run_experiment(self, duration_minutes: int = 30, interval_seconds: int = 300): """Run chaos experiment for duration.""" import time end_time = datetime.now().timestamp() + (duration_minutes * 60) logger.info(f"Starting chaos experiment for {duration_minutes} minutes") while datetime.now().timestamp() < end_time: victim = self.kill_random_pod(dry_run=False) if victim: logger.info(f"Killed {victim}, waiting {interval_seconds}s") time.sleep(interval_seconds) logger.info("Chaos experiment completed") # Usage if __name__ == "__main__": killer = ChaosPodKiller( namespace="production", label_selector="app=api-service" ) killer.run_experiment(duration_minutes=30, interval_seconds=300) Kubernetes CronJob for Continuous Chaos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 apiVersion: batch/v1 kind: CronJob metadata: name: chaos-pod-killer namespace: chaos-system spec: schedule: "*/10 9-17 * * 1-5" # Every 10 min, 9-5, weekdays only jobTemplate: spec: template: spec: serviceAccountName: chaos-runner containers: - name: chaos image: chaos-toolkit:latest command: - python - /scripts/chaos_pod_killer.py env: - name: TARGET_NAMESPACE value: "production" - name: LABEL_SELECTOR value: "chaos-enabled=true" restartPolicy: Never Network Chaos Introduce Latency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Using Chaos Mesh apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-delay namespace: chaos-testing spec: action: delay mode: all selector: namespaces: - production labelSelectors: app: api-service delay: latency: "100ms" jitter: "20ms" correlation: "50" duration: "5m" scheduler: cron: "@every 1h" Packet Loss 1 2 3 4 5 6 7 8 9 10 11 12 13 14 apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-loss spec: action: loss mode: one selector: labelSelectors: app: payment-service loss: loss: "10" # 10% packet loss correlation: "50" duration: "2m" Network Partition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-partition spec: action: partition mode: all selector: namespaces: - production labelSelectors: app: api-service direction: both target: selector: namespaces: - production labelSelectors: app: database duration: "30s" Resource Stress CPU Stress 1 2 3 4 5 6 7 8 9 10 11 12 13 14 apiVersion: chaos-mesh.org/v1alpha1 kind: StressChaos metadata: name: cpu-stress spec: mode: one selector: labelSelectors: app: api-service stressors: cpu: workers: 2 load: 80 # 80% CPU usage duration: "5m" Memory Stress 1 2 3 4 5 6 7 8 9 10 11 12 13 14 apiVersion: chaos-mesh.org/v1alpha1 kind: StressChaos metadata: name: memory-stress spec: mode: one selector: labelSelectors: app: api-service stressors: memory: workers: 1 size: "512MB" # Allocate 512MB duration: "3m" Application-Level Chaos HTTP Fault Injection with Istio 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: api-chaos spec: hosts: - api-service http: - fault: abort: percentage: value: 5 httpStatus: 503 delay: percentage: value: 10 fixedDelay: 2s route: - destination: host: api-service Custom Failure Injection 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # chaos_middleware.py import random import time import os from functools import wraps CHAOS_ENABLED = os.getenv("CHAOS_ENABLED", "false").lower() == "true" CHAOS_FAILURE_RATE = float(os.getenv("CHAOS_FAILURE_RATE", "0.0")) CHAOS_LATENCY_MS = int(os.getenv("CHAOS_LATENCY_MS", "0")) def chaos_middleware(f): """Inject chaos into function calls.""" @wraps(f) def wrapper(*args, **kwargs): if not CHAOS_ENABLED: return f(*args, **kwargs) # Random failure if random.random() < CHAOS_FAILURE_RATE: raise Exception("Chaos injection: random failure") # Random latency if CHAOS_LATENCY_MS > 0: delay = random.randint(0, CHAOS_LATENCY_MS) / 1000 time.sleep(delay) return f(*args, **kwargs) return wrapper # Usage @chaos_middleware def call_payment_service(order): return requests.post(PAYMENT_URL, json=order.to_dict()) Experiment Framework 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 # chaos_experiment.py from dataclasses import dataclass from typing import Callable, List import time @dataclass class SteadyStateCheck: name: str check: Callable[[], bool] @dataclass class ChaosExperiment: name: str hypothesis: str steady_state_checks: List[SteadyStateCheck] inject_failure: Callable[[], None] rollback: Callable[[], None] duration_seconds: int = 60 def run_experiment(experiment: ChaosExperiment) -> dict: """Run a chaos experiment with proper controls.""" result = { "name": experiment.name, "hypothesis": experiment.hypothesis, "success": False, "steady_state_before": {}, "steady_state_after": {}, "errors": [] } # 1. Verify steady state before print(f"Checking steady state before experiment...") for check in experiment.steady_state_checks: try: passed = check.check() result["steady_state_before"][check.name] = passed if not passed: result["errors"].append(f"Pre-check failed: {check.name}") return result except Exception as e: result["errors"].append(f"Pre-check error: {check.name}: {e}") return result # 2. Inject failure print(f"Injecting failure: {experiment.name}") try: experiment.inject_failure() except Exception as e: result["errors"].append(f"Injection failed: {e}") experiment.rollback() return result # 3. Wait for duration print(f"Waiting {experiment.duration_seconds}s...") time.sleep(experiment.duration_seconds) # 4. Check steady state during/after print(f"Checking steady state after experiment...") for check in experiment.steady_state_checks: try: passed = check.check() result["steady_state_after"][check.name] = passed except Exception as e: result["steady_state_after"][check.name] = False result["errors"].append(f"Post-check error: {check.name}: {e}") # 5. Rollback print(f"Rolling back...") experiment.rollback() # 6. Evaluate hypothesis result["success"] = all(result["steady_state_after"].values()) return result # Example experiment def check_api_responding(): response = requests.get("http://api-service/health", timeout=5) return response.status_code == 200 def check_error_rate_low(): # Query Prometheus result = prometheus.query('sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))') return float(result) < 0.01 def kill_one_api_pod(): killer = ChaosPodKiller("production", "app=api-service") killer.kill_random_pod(dry_run=False) def noop_rollback(): pass # Kubernetes will restart the pod experiment = ChaosExperiment( name="api-pod-failure", hypothesis="API continues serving traffic when one pod dies", steady_state_checks=[ SteadyStateCheck("api_responding", check_api_responding), SteadyStateCheck("error_rate_low", check_error_rate_low), ], inject_failure=kill_one_api_pod, rollback=noop_rollback, duration_seconds=60 ) result = run_experiment(experiment) print(f"Experiment {'PASSED' if result['success'] else 'FAILED'}") Game Days Schedule regular chaos exercises: ...

February 11, 2026 Â· 8 min Â· 1552 words Â· Rob Washington

Testing Infrastructure: From Terraform Plans to Production Validation

Infrastructure code deserves the same testing rigor as application code. A typo in Terraform can delete a database. An untested Ansible role can break production. Let’s build confidence with proper testing. The Testing Pyramid for Infrastructure E ( I ( 2 F n R E u t e U l e a n ( T l g l i S e r t t s s a c a t t t l T t s a i o e i c o u s c k n d t s a d T r n e e e a p s s l l t o y o s u s y r i m c s e e , n s t ) p ) l a n v a l i d a t i o n ) Unit Testing: Static Analysis Terraform Validation 1 2 3 4 5 6 7 8 # Built-in validation terraform init terraform validate terraform fmt -check # Custom validation rules terraform plan -out=tfplan terraform show -json tfplan > plan.json 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # tests/test_terraform_plan.py import json import pytest @pytest.fixture def plan(): with open('plan.json') as f: return json.load(f) def test_no_resources_destroyed(plan): """Ensure no resources are being destroyed.""" changes = plan.get('resource_changes', []) destroyed = [c for c in changes if 'delete' in c.get('change', {}).get('actions', [])] assert len(destroyed) == 0, f"Resources being destroyed: {[d['address'] for d in destroyed]}" def test_no_public_s3_buckets(plan): """Ensure S3 buckets aren't public.""" changes = plan.get('resource_changes', []) for change in changes: if change['type'] == 'aws_s3_bucket': after = change.get('change', {}).get('after', {}) acl = after.get('acl', 'private') assert acl == 'private', f"Bucket {change['address']} has public ACL: {acl}" def test_instances_have_tags(plan): """Ensure EC2 instances have required tags.""" required_tags = {'Environment', 'Owner', 'Project'} changes = plan.get('resource_changes', []) for change in changes: if change['type'] == 'aws_instance': after = change.get('change', {}).get('after', {}) tags = set(after.get('tags', {}).keys()) missing = required_tags - tags assert not missing, f"Instance {change['address']} missing tags: {missing}" Policy as Code with OPA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # policy/terraform.rego package terraform deny[msg] { resource := input.resource_changes[_] resource.type == "aws_security_group_rule" resource.change.after.cidr_blocks[_] == "0.0.0.0/0" resource.change.after.from_port == 22 msg := sprintf("SSH open to world in %v", [resource.address]) } deny[msg] { resource := input.resource_changes[_] resource.type == "aws_db_instance" resource.change.after.publicly_accessible == true msg := sprintf("RDS instance %v is publicly accessible", [resource.address]) } 1 2 # Run OPA checks terraform show -json tfplan | opa eval -i - -d policy/ "data.terraform.deny" Integration Testing with Terratest Terratest deploys real infrastructure, validates it, then tears it down: ...

February 11, 2026 Â· 7 min Â· 1358 words Â· Rob Washington

Feature Flags: Ship Fast, Control Risk, and Test in Production

Deploying code and releasing features don’t have to be the same thing. Feature flags let you ship code to production while controlling who sees it, when they see it, and how quickly you roll it out. This separation is transformative for both velocity and safety. Why Feature Flags? Traditional deployment: code goes live, everyone gets it immediately, and if something breaks, you redeploy. With feature flags: Deploy anytime — code exists in production but isn’t active Release gradually — 1% of users, then 10%, then 50%, then everyone Instant rollback — flip a switch, no deployment needed Test in production — real traffic, real conditions, controlled exposure A/B testing — compare variants with actual user behavior Basic Implementation Start simple. A feature flag is just a conditional: ...

February 11, 2026 Â· 7 min Â· 1365 words Â· Rob Washington