Devops

Load Balancing: Distributing Traffic Without Playing Favorites

You’ve scaled horizontally — multiple servers ready to handle requests. Now you need something to decide which server handles each request. That’s load balancing, and the strategy you choose affects latency, reliability, and resource utilization. Round Robin: The Default Each server gets requests in rotation: Server 1, Server 2, Server 3, Server 1, Server 2… 1 2 3 4 5 upstream backend { server app1:8080; server app2:8080; server app3:8080; } Pros: Simple to understand and implement Even distribution over time No state to maintain Cons: ...

Database Migrations Without Downtime: The Expand-Contract Pattern

Database migrations are terrifying. One wrong move and you’ve locked tables, broken queries, or corrupted data. The traditional approach — maintenance window, stop traffic, migrate, restart — works but costs you availability and customer trust. The expand-contract pattern lets you migrate schemas incrementally, with zero downtime, while your application keeps serving traffic. The Core Idea Never make breaking changes in a single step. Instead: Expand: Add the new structure alongside the old Migrate: Move data and update application code Contract: Remove the old structure Each step is independently deployable and reversible. ...

API Versioning: Breaking Changes Without Breaking Clients

Every API eventually needs to change in ways that break existing clients. Field removed, response format changed, authentication updated. The question isn’t whether you’ll have breaking changes — it’s how you’ll manage them. Good versioning gives you freedom to evolve while giving clients stability and migration paths. The Three Schools of Versioning URL Path Versioning G G E E T T / / a a p p i i / / v 1 2 / / u u s s e e r r s s / / 1 1 2 2 3 3 Pros: ...

Environment Variables Done Right: Configuration Without Chaos

The twelve-factor app methodology tells us: store config in the environment. It’s good advice. Environment variables separate config from code, making the same artifact deployable across environments. But “store config in environment variables” doesn’t mean “scatter random ENV vars everywhere and hope for the best.” Done poorly, environment configuration becomes untraceable, untestable, and unmaintainable. The Baseline Pattern Centralize environment variable access: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import os from dataclasses import dataclass from typing import Optional @dataclass class Config: database_url: str redis_url: str api_key: str debug: bool max_workers: int log_level: str @classmethod def from_env(cls) -> 'Config': return cls( database_url=os.environ['DATABASE_URL'], redis_url=os.environ['REDIS_URL'], api_key=os.environ['API_KEY'], debug=os.environ.get('DEBUG', 'false').lower() == 'true', max_workers=int(os.environ.get('MAX_WORKERS', '4')), log_level=os.environ.get('LOG_LEVEL', 'INFO'), ) # Single source of truth config = Config.from_env() Benefits: ...

Retry Patterns: Exponential Backoff and Beyond

Networks fail. Databases hiccup. External APIs return 503. In distributed systems, transient failures are not exceptional — they’re expected. The question isn’t whether to retry, but how. Bad retry logic turns a brief outage into a cascading failure. Good retry logic absorbs transient issues invisibly. The Naive Retry (Don’t Do This) 1 2 3 4 5 6 7 # Immediate retry loop - a recipe for disaster def call_api(): while True: try: return requests.get(API_URL) except RequestException: continue # Hammer the failing service This creates a thundering herd. If the service is struggling, every client immediately retrying makes it worse. You’re not recovering from failure — you’re accelerating toward total collapse. ...

Graceful Shutdown: Finishing What You Started

When Kubernetes scales down your deployment or you push a new release, your running containers receive SIGTERM. Then, after a grace period, SIGKILL. The difference between graceful and chaotic shutdown is what happens in those seconds between the two signals. A request half-processed, a database transaction uncommitted, a file partially written — these are the artifacts of ungraceful shutdown. They create inconsistent state, failed requests, and debugging nightmares. The Signal Sequence 1 2 3 4 5 6 7 . . . . . . . S G P P P P I I r r r r r f G a o o o o T c c c c c s E e e e e e t R s s s s i M p s s s s l e l s r s s s e e i h h h x r n o o o o i u t d u u u t n l l l s n t c d d d i o o w n u s f c i g p n t i l t : r t o n o h o d p i s S c o s e c I e w a h o G s n c c d K s c i o e I b e n n L e p - n 0 L g t f e i i l c ( n n i t n s g g i o h o ( n t n m d e s e e w w r f o c c a w r l y u o k e ) l r a t k n : l y 3 0 s i n K u b e r n e t e s ) Basic Signal Handling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import signal import sys shutdown_requested = False def handle_sigterm(signum, frame): global shutdown_requested print("SIGTERM received, initiating graceful shutdown...") shutdown_requested = True signal.signal(signal.SIGTERM, handle_sigterm) signal.signal(signal.SIGINT, handle_sigterm) # Ctrl+C # Main loop checks shutdown flag while not shutdown_requested: process_next_item() # Cleanup after loop exits cleanup() sys.exit(0) Web Servers: Stop Accepting, Finish Processing Most web frameworks have built-in graceful shutdown. The pattern: ...

Health Check Endpoints: More Than Just 200 OK

Every container orchestrator, load balancer, and monitoring system asks the same question: is this service healthy? The answer you provide determines whether traffic gets routed, containers get replaced, and alerts get fired. A health check that lies — always returning 200 even when the database is down — is worse than no health check at all. It creates false confidence while your users experience failures. The Three Types of Health Checks Liveness: “Is the process alive?” Liveness checks answer: should this container be killed and restarted? ...

Structured Logging: Stop Grepping, Start Querying

You’ve seen this log line before: 2 0 2 6 - 0 2 - 2 3 0 5 : 3 0 : 0 0 I N F O U s e r j o h n @ e x a m p l e . c o m l o g g e d i n f r o m 1 9 2 . 1 6 8 . 1 . 1 0 0 a f t e r 2 f a i l e d a t t e m p t s Human readable. Grep-able. And completely useless for answering questions like “how many users had failed login attempts yesterday?” or “what’s the P95 response time for requests from the EU region?” ...

API Rate Limiting Strategies That Don't Annoy Your Users

Every API needs rate limiting. Without it, one enthusiastic script kiddie or a bug in a client application can take down your entire service. The question isn’t whether to rate limit — it’s how to do it without making your API frustrating to use. The Naive Approach (And Why It Fails) 1 2 3 # Don't do this if requests_this_minute > 100: return 429, "Rate limit exceeded" Fixed limits per time window are simple to implement and almost always wrong. They create the “thundering herd” problem: all your users hit the limit at minute :00, back off, retry at :01, and create a synchronized spike that’s worse than no limit at all. ...

Terraform State Management: Avoiding the Footguns

Terraform state is where infrastructure-as-code meets reality. It’s also where most Terraform disasters originate. Here’s how to manage state without losing sleep. The Problem Terraform tracks what it’s created in a state file. This file maps your HCL resources to real infrastructure. Without it, Terraform can’t update or destroy anything — it doesn’t know what exists. The default is a local file called terraform.tfstate. This works fine until: Someone else needs to run Terraform Your laptop dies Two people run apply simultaneously You accidentally commit secrets to Git Rule 1: Remote State from Day One Never use local state for anything beyond experiments: ...