Devops

Monitoring Anti-Patterns: When Alerts Become Noise

Good monitoring saves you from outages. Bad monitoring causes them — by training your team to ignore alerts until something actually breaks. Here’s how to avoid the most common anti-patterns. Anti-Pattern 1: Alerting on Symptoms, Not Impact 1 2 3 4 5 6 # ❌ BAD: CPU is high - alert: HighCPU expr: node_cpu_usage > 80 for: 5m labels: severity: critical High CPU isn’t a problem. Slow responses are a problem. Users don’t care about your CPU graphs. ...

Container Image Security: Scanning Before You Ship

Every container image you deploy is a collection of dependencies you didn’t write. Some of those dependencies have known vulnerabilities. The question isn’t whether your images have CVEs — it’s whether you know about them before attackers do. The Problem A typical container image includes: A base OS (Alpine, Debian, Ubuntu) Language runtime (Python, Node, Go) Application dependencies (npm packages, pip modules) Your actual code Each layer can introduce vulnerabilities. That innocent FROM python:3.11 pulls in hundreds of packages you’ve never audited. ...

GitOps Workflow Patterns: Infrastructure as Pull Requests

GitOps sounds simple: put your infrastructure in Git, let a controller sync it to your cluster. In practice, there are a dozen ways to get it wrong. Here’s what works. The Core Principle Git is the source of truth. Not the cluster. Not a dashboard. Not someone’s kubectl session. D e v e l o p e r s → i n G g i l t ↑ e → s o C u o r n c t e r o l l e r → C l u s t e r If the cluster state doesn’t match Git, the controller fixes it. If someone manually changes the cluster, the controller reverts it. This is the contract. ...

Zero-Downtime Deployments: Strategies That Actually Work

“We’re deploying, please hold” is not an acceptable user experience. Whether you’re running a startup or enterprise infrastructure, users expect services to just work. Here’s how to ship code without the maintenance windows. The Goal: Invisible Deploys A zero-downtime deployment means users never notice you’re deploying. No error pages, no dropped connections, no “please refresh” messages. The old version serves traffic until the new version is proven healthy. Strategy 1: Rolling Deployments The simplest approach. Replace instances one at a time: ...

Ansible Idempotency Patterns: Write Playbooks That Don't Break Things

The promise of Ansible is simple: describe your desired state, run the playbook, and the system converges to that state. Run it again, nothing changes. That’s idempotency—and it’s harder to achieve than it sounds. Here’s how to write playbooks that won’t surprise you on the second run. The Problem: Commands That Lie The command and shell modules are where idempotency goes to die: 1 2 3 # ❌ BAD: Always reports "changed", even when nothing changed - name: Create database command: createdb myapp This fails on the second run because the database already exists. Worse, it always shows as “changed” even when it shouldn’t run at all. ...

Environment Configuration Patterns: From Dev to Production

Configuration management sounds simple until you’re debugging why production is reading from the staging database at 3am. Here’s how to structure configuration so environments stay isolated and secrets stay secret. The Twelve-Factor Baseline The Twelve-Factor App got it right: store config in environment variables. But that’s just the starting point. Real systems need layers. ┌ │ ├ │ ├ │ ├ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ R ─ S ─ E ─ D ─ ─ u ─ e ─ n ─ e ─ ─ n ─ c ─ v ─ f ─ ─ t ─ r ─ i ─ a ─ ─ i ─ e ─ r ─ u ─ ─ m ─ t ─ o ─ l ─ ─ e ─ s ─ n ─ t ─ ─ ─ ─ m ─ ─ ─ E ─ M ─ e ─ c ─ ─ n ─ a ─ n ─ o ─ ─ v ─ n ─ t ─ n ─ ─ i ─ a ─ - ─ f ─ ─ r ─ g ─ s ─ i ─ ─ o ─ e ─ p ─ g ─ ─ n ─ r ─ e ─ ─ ─ m ─ ─ c ─ i ─ ─ e ─ ( ─ i ─ n ─ ─ n ─ V ─ f ─ ─ ─ t ─ a ─ i ─ c ─ ─ ─ u ─ c ─ o ─ ─ V ─ l ─ ─ d ─ ─ a ─ t ─ c ─ e ─ ─ r ─ , ─ o ─ ─ ─ i ─ ─ n ─ ─ ─ a ─ A ─ f ─ ─ ─ b ─ W ─ i ─ ─ ─ l ─ S ─ g ─ ─ ─ e ─ ─ ─ ─ ─ s ─ S ─ f ─ ─ ─ ─ S ─ i ─ ─ ─ ─ M ─ l ─ ─ ─ ─ , ─ e ─ ─ ─ ─ ─ s ─ ─ ─ ─ e ─ ─ ─ ─ ─ t ─ ─ ─ ─ ─ c ─ ─ ─ ─ ─ . ─ ─ ─ ─ ─ ) ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ ┤ │ ┤ │ ┤ │ ┘ ← ← H L i o g w h e e s s t t p p r r i i o o r r i i t t y y Each layer overrides the one below it. Defaults live in code, environment-specific values in config files, secrets in a secrets manager, and runtime overrides in environment variables. ...

Ansible Role Design Patterns: Building Reusable Automation

Ansible playbooks get messy fast. Roles fix this by packaging related tasks, variables, and handlers into reusable units. But bad roles are worse than no roles — they hide complexity without managing it. Here’s how to design roles that actually help. The Basic Structure r o l m e y s _ t h d v t f m / r a a e a e i e o s m n m f m r m m c l s t m l k a d a a a s a p o e c a a e s i l i u i / i l n s r / i / / n e n l n n a f / i n . r . t . . t i p . y s y s y y e g t y m / m / m m s . . m l l l l j s l 2 h # # # # # # # E E D R J S R n v e o i t o t e f l n a l r n a e j t e y t u a i - l v 2 c m p t t a e o r r t f t i i v i e i a n g a a m l d t g r b p e a e i l l s t r a e a a e b s t d l e a e ( s n t s h d a i s ( g d k l h e s o e p w r e e n s p d t r e e n p c c r e i e d e c e s e n d c e e n ) c e ) Only create directories you need. An empty handlers/ is noise. ...

LLM Prompt Engineering for DevOps Automation

LLMs are becoming infrastructure. Not just chatbots — actual components in automation pipelines. But getting reliable, parseable output requires disciplined prompt engineering. Here’s what works for DevOps use cases. The Core Problem LLMs are probabilistic. Ask the same question twice, get different answers. That’s fine for chat. It’s terrible for automation that needs to parse structured output. The solution: constrain the output format and validate aggressively. Pattern 1: Structured Output with JSON Mode Most LLM APIs now support JSON mode. Use it. ...

Health Check Patterns: Liveness, Readiness, and Startup Probes

Your load balancer routes traffic to a pod that’s crashed. Or kills a pod that’s just slow. Or restarts a pod that’s still initializing. Health checks prevent these failures — when configured correctly. Most teams get them wrong. Here’s how to get them right. The Three Probe Types Kubernetes offers three distinct probes, each with a different purpose: Probe Question Failure Action Liveness Is the process alive? Restart container Readiness Can it handle traffic? Remove from Service Startup Has it finished starting? Delay other probes Liveness: “Should I restart this?” Detects when your app is stuck — deadlocked, infinite loop, unrecoverable state. ...

Kubernetes Resource Limits: Right-Sizing Containers for Stability

Your pod got OOMKilled. Or throttled to 5% CPU. Or evicted because the node ran out of resources. The fix isn’t “add more resources” — it’s understanding how Kubernetes scheduling actually works. Requests vs Limits Requests: What you’re guaranteed. Kubernetes uses this for scheduling. Limits: The ceiling. Exceed this and bad things happen. 1 2 3 4 5 6 7 resources: requests: memory: "256Mi" cpu: "250m" # 0.25 cores limits: memory: "512Mi" cpu: "500m" # 0.5 cores What Happens When You Exceed Them Resource Exceed Request Exceed Limit CPU Throttled when node is busy Hard throttled always Memory Fine if available OOMKilled immediately CPU is compressible — you slow down but survive. Memory is not — you die. ...