Devops

Feature Flags: Deploy Doesn't Mean Release

Separating deployment from release is one of the best things you can do for your team’s sanity. Feature flags make this possible. The Core Idea 1 2 3 4 5 6 7 8 9 10 11 12 # Without flags: deploy = release def checkout(): process_payment() send_confirmation() # With flags: deploy != release def checkout(): process_payment() if feature_enabled("new_confirmation_email"): send_new_confirmation() # Deployed but not released else: send_confirmation() Code ships to production. Flag decides if users see it. ...

Structured Logging: Stop Parsing Log Lines

Unstructured logs are technical debt. Structured logs are queryable, parseable, and actually useful when things break. The Problem # 2 2 2 0 0 0 U 2 2 2 n 6 6 6 s - - - t 0 0 0 r 2 2 2 u - - - c 2 2 2 t 8 8 8 u r 1 1 1 e 0 0 0 d : : : : 1 1 1 5 5 5 g : : : o 2 2 2 o 3 4 5 d I E I l N R N u F R F c O O O k R U R p s F e a e a q r r i u s l e i a e s n l d t g i c t c t e o o h m i l p p s o r l g o e g c t e e e d s d s i i n o n r f d 2 r e 3 o r 4 m m 1 s 1 2 9 3 2 4 . 5 1 : 6 8 c . o 1 n . n 1 e c t i o n t i m e o u t Regex hell when you need to extract user, IP, order ID, or duration. ...

The Twelve-Factor App: What Actually Matters

The Twelve-Factor methodology is from 2011 but remains relevant. Here’s what matters in practice, and what’s become outdated. The Factors, Ranked by Impact Critical (Ignore at Your Peril) III. Config in Environment 1 2 3 4 5 # Bad DATABASE_URL = "postgres://localhost/myapp" # hardcoded # Good DATABASE_URL = os.environ["DATABASE_URL"] Config includes credentials, per-environment values, and feature flags. Environment variables work everywhere: containers, serverless, bare metal. VI. Stateless Processes 1 2 3 4 5 6 7 8 9 10 11 # Bad: storing session in memory sessions = {} @app.post("/login") def login(user): sessions[user.id] = {"logged_in": True} # Dies with process # Good: external session store @app.post("/login") def login(user): redis.set(f"session:{user.id}", {"logged_in": True}) If your process dies, can another pick up the work? Statelessness enables horizontal scaling, rolling deploys, and crash recovery. ...

Graceful Shutdown: Stop Dropping Requests

Every deployment is a potential outage if your application doesn’t shut down gracefully. Here’s how to do it right. The Problem 1 2 3 4 5 . . . . . K P Y I U u o o n s b d u - e e r f r r i l s n s a i e p g s t r p h e e e t e s m e o x r e s v i e r e e t q r n d s u o d e r s f i s s r m t S o m s d I m e u G d g r T s i e i E e a t n R r t g M v e c i l o " c y n z e n e e r e c o n t - d i d p o o o n w i n n r t t e i s s m e e t " d e p l o y s The fix: handle SIGTERM, finish existing work, then exit. ...

Linux Performance Troubleshooting: The First Five Minutes

When a server is slow and people are yelling, you need a systematic approach. Here’s what to run in the first five minutes. The Checklist 1 2 3 4 5 6 7 8 uptime dmesg | tail vmstat 1 5 mpstat -P ALL 1 3 pidstat 1 3 iostat -xz 1 3 free -h sar -n DEV 1 3 Let’s break down what each tells you. 1. uptime 1 2 $ uptime 16:30:01 up 45 days, 3:22, 2 users, load average: 8.42, 6.31, 5.12 Load averages: 1-minute, 5-minute, 15-minute. ...

SOPS: Git-Friendly Secrets Management

The eternal problem: you need secrets in your repo for deployment, but you can’t commit plaintext credentials. SOPS solves this elegantly by encrypting only the values while leaving keys readable. Why SOPS? Traditional approaches: Environment variables: Work, but no version control Vault: Great, but complex for small teams AWS Secrets Manager: Vendor lock-in, API calls at runtime .env files in .gitignore: Hope nobody commits them SOPS encrypts secrets in place. You commit encrypted files. CI/CD decrypts at deploy time. Full audit trail in git. ...

Prometheus Alerting Rules That Won't Wake You Up at 3am

The difference between good alerting and bad alerting is whether you still trust your pager after six months. Here’s how to build alerts that matter. The Golden Rule: Alert on Symptoms, Not Causes 1 2 3 4 5 6 7 8 9 10 11 12 13 # Bad: alerts on a cause - alert: HighCPU expr: node_cpu_seconds_total > 80 for: 5m # Good: alerts on user-facing symptom - alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "95th percentile latency above 500ms" Users don’t care if CPU is high. They care if the site is slow. ...

GitHub Actions Patterns for Practical CI/CD

GitHub Actions has become the default CI/CD for many teams. Here are patterns I’ve seen work well in production, and a few anti-patterns to avoid. The Foundation: A Reusable Test Workflow 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 name: Test on: push: branches: [main] pull_request: branches: [main] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Node uses: actions/setup-node@v4 with: node-version: '20' cache: 'npm' - run: npm ci - run: npm test Key details: ...

Kubernetes Troubleshooting Patterns for Production

Kubernetes hides complexity until something breaks. Then you need to know where to look. Here’s a systematic approach to debugging production issues. The Debugging Hierarchy Start broad, narrow down: Cluster level: Nodes healthy? Resources available? Namespace level: Deployments running? Services configured? Pod level: Containers starting? Logs clean? Container level: Process running? Resources sufficient? Quick Health Check 1 2 3 4 5 6 7 8 9 10 11 # Node status kubectl get nodes -o wide # All pods across namespaces kubectl get pods -A # Pods not running kubectl get pods -A | grep -v Running | grep -v Completed # Events (recent issues) kubectl get events -A --sort-by='.lastTimestamp' | tail -20 Pod Troubleshooting Pod States State Meaning Check Pending Can’t be scheduled Resources, node selectors, taints ContainerCreating Image pulling or volume mounting Image name, pull secrets, PVCs CrashLoopBackOff Container crashing repeatedly Logs, resource limits, probes ImagePullBackOff Can’t pull image Image name, registry auth Error Container exited with error Logs Pending Pods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Why is it pending? kubectl describe pod my-pod # Look for: # - Insufficient cpu/memory # - No nodes match nodeSelector # - Taints not tolerated # - PVC not bound # Check node resources kubectl describe nodes | grep -A5 "Allocated resources" # Check PVC status kubectl get pvc CrashLoopBackOff 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Get logs from current container kubectl logs my-pod # Get logs from previous (crashed) container kubectl logs my-pod --previous # Get logs from specific container kubectl logs my-pod -c my-container # Follow logs kubectl logs -f my-pod # Last N lines kubectl logs --tail=100 my-pod Common causes: ...

Kubernetes Troubleshooting: A Practical Field Guide

When a Kubernetes deployment goes sideways at 3am, you need a systematic approach. Here’s the troubleshooting playbook I’ve developed from watching countless production incidents. The First Three Commands Before diving deep, these three commands tell you 80% of what you need: 1 2 3 4 5 6 7 8 # What's not running? kubectl get pods -A | grep -v Running | grep -v Completed # What happened recently? kubectl get events -A --sort-by='.lastTimestamp' | tail -20 # Resource pressure? kubectl top nodes Run these first. Always. ...