Backend

Environment Variables Done Right: Configuration Without Chaos

The twelve-factor app methodology tells us: store config in the environment. It’s good advice. Environment variables separate config from code, making the same artifact deployable across environments. But “store config in environment variables” doesn’t mean “scatter random ENV vars everywhere and hope for the best.” Done poorly, environment configuration becomes untraceable, untestable, and unmaintainable. The Baseline Pattern Centralize environment variable access: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import os from dataclasses import dataclass from typing import Optional @dataclass class Config: database_url: str redis_url: str api_key: str debug: bool max_workers: int log_level: str @classmethod def from_env(cls) -> 'Config': return cls( database_url=os.environ['DATABASE_URL'], redis_url=os.environ['REDIS_URL'], api_key=os.environ['API_KEY'], debug=os.environ.get('DEBUG', 'false').lower() == 'true', max_workers=int(os.environ.get('MAX_WORKERS', '4')), log_level=os.environ.get('LOG_LEVEL', 'INFO'), ) # Single source of truth config = Config.from_env() Benefits: ...

Retry Patterns: Exponential Backoff and Beyond

Networks fail. Databases hiccup. External APIs return 503. In distributed systems, transient failures are not exceptional — they’re expected. The question isn’t whether to retry, but how. Bad retry logic turns a brief outage into a cascading failure. Good retry logic absorbs transient issues invisibly. The Naive Retry (Don’t Do This) 1 2 3 4 5 6 7 # Immediate retry loop - a recipe for disaster def call_api(): while True: try: return requests.get(API_URL) except RequestException: continue # Hammer the failing service This creates a thundering herd. If the service is struggling, every client immediately retrying makes it worse. You’re not recovering from failure — you’re accelerating toward total collapse. ...

Graceful Shutdown: Finishing What You Started

When Kubernetes scales down your deployment or you push a new release, your running containers receive SIGTERM. Then, after a grace period, SIGKILL. The difference between graceful and chaotic shutdown is what happens in those seconds between the two signals. A request half-processed, a database transaction uncommitted, a file partially written — these are the artifacts of ungraceful shutdown. They create inconsistent state, failed requests, and debugging nightmares. The Signal Sequence 1 2 3 4 5 6 7 . . . . . . . S G P P P P I I r r r r r f G a o o o o T c c c c c s E e e e e e t R s s s s i M p s s s s l e l s r s s s e e i h h h x r n o o o o i u t d u u u t n l l l s n t c d d d i o o w n u s f c i g p n t i l t : r t o n o h o d p i s S c o s e c I e w a h o G s n c c d K s c i o e I b e n n L e p - n 0 L g t f e i i l c ( n n i t n s g g i o h o ( n t n m d e s e e w w r f o c c a w r l y u o k e ) l r a t k n : l y 3 0 s i n K u b e r n e t e s ) Basic Signal Handling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import signal import sys shutdown_requested = False def handle_sigterm(signum, frame): global shutdown_requested print("SIGTERM received, initiating graceful shutdown...") shutdown_requested = True signal.signal(signal.SIGTERM, handle_sigterm) signal.signal(signal.SIGINT, handle_sigterm) # Ctrl+C # Main loop checks shutdown flag while not shutdown_requested: process_next_item() # Cleanup after loop exits cleanup() sys.exit(0) Web Servers: Stop Accepting, Finish Processing Most web frameworks have built-in graceful shutdown. The pattern: ...

Health Check Endpoints: More Than Just 200 OK

Every container orchestrator, load balancer, and monitoring system asks the same question: is this service healthy? The answer you provide determines whether traffic gets routed, containers get replaced, and alerts get fired. A health check that lies — always returning 200 even when the database is down — is worse than no health check at all. It creates false confidence while your users experience failures. The Three Types of Health Checks Liveness: “Is the process alive?” Liveness checks answer: should this container be killed and restarted? ...

Structured Logging: Stop Grepping, Start Querying

You’ve seen this log line before: 2 0 2 6 - 0 2 - 2 3 0 5 : 3 0 : 0 0 I N F O U s e r j o h n @ e x a m p l e . c o m l o g g e d i n f r o m 1 9 2 . 1 6 8 . 1 . 1 0 0 a f t e r 2 f a i l e d a t t e m p t s Human readable. Grep-able. And completely useless for answering questions like “how many users had failed login attempts yesterday?” or “what’s the P95 response time for requests from the EU region?” ...

Database Connection Pooling: The Performance Win You're Probably Missing

Database connections are expensive. Each new connection requires TCP handshake, authentication, session initialization, and memory allocation on both client and server. Do this for every web request and you’ve got a performance problem hiding in plain sight. The Problem: Connection Overhead A typical PostgreSQL connection takes 50-100ms to establish. For a web request that runs a 5ms query, you’re spending 10-20x more time on connection setup than actual work. 1 2 3 4 5 6 7 8 # The naive approach - connection per request def get_user(user_id): conn = psycopg2.connect(DATABASE_URL) # 50-100ms cursor = conn.cursor() cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,)) # 5ms result = cursor.fetchone() conn.close() return result Under load, this creates connection storms. Each concurrent request opens its own connection. The database has connection limits. You hit those limits, new connections queue, latency spikes, everything falls apart. ...

API Rate Limiting Strategies That Don't Annoy Your Users

Every API needs rate limiting. Without it, one enthusiastic script kiddie or a bug in a client application can take down your entire service. The question isn’t whether to rate limit — it’s how to do it without making your API frustrating to use. The Naive Approach (And Why It Fails) 1 2 3 # Don't do this if requests_this_minute > 100: return 429, "Rate limit exceeded" Fixed limits per time window are simple to implement and almost always wrong. They create the “thundering herd” problem: all your users hit the limit at minute :00, back off, retry at :01, and create a synchronized spike that’s worse than no limit at all. ...