Infrastructure

Observability: Logs, Metrics, and Traces Working Together

Monitoring answers “is it working?” Observability answers “why isn’t it working?” The difference matters when you’re debugging a production incident at 3am. The three pillars of observability — logs, metrics, and traces — each provide different perspectives. Together, they create a complete picture of system behavior. Logs: The Narrative Logs tell you what happened, in order: 1 2 3 {"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "request_started", "request_id": "abc123", "path": "/api/users"} {"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "db_query", "request_id": "abc123", "duration_ms": 45} {"timestamp": "2026-02-23T13:00:02Z", "level": "error", "event": "request_failed", "request_id": "abc123", "error": "connection timeout"} Good for: Debugging specific requests Understanding error context Audit trails Ad-hoc investigation Challenges: ...

Infrastructure as Code: Principles That Actually Matter

Infrastructure as Code (IaC) means your servers, networks, and services are defined in version-controlled files rather than clicked into existence through consoles. The benefits are obvious: reproducibility, auditability, collaboration. But IaC done poorly creates its own problems: state drift, copy-paste sprawl, untestable configurations. The principles matter more than the tools. Declarative Over Imperative Describe what you want, not how to get there: 1 2 3 4 5 6 7 8 9 # Declarative (Terraform) - what resource "aws_instance" "web" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t3.micro" tags = { Name = "web-server" } } 1 2 3 4 5 # Imperative (script) - how aws ec2 run-instances \ --image-id ami-0c55b159cbfafe1f0 \ --instance-type t3.micro \ --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=web-server}]' Declarative code is idempotent — run it ten times, get the same result. Imperative scripts need guards against re-running. ...

Docker Multi-Stage Builds: Smaller Images, Cleaner Dockerfiles

A typical Go application compiles to a single binary. Yet Docker images for Go apps often weigh hundreds of megabytes. Why? Because the image includes the entire Go toolchain used to build it. Multi-stage builds solve this: use one stage to build, another to run. The final image contains only what’s needed at runtime. The Problem: Fat Images 1 2 3 4 5 6 7 8 # Single-stage - includes everything FROM golang:1.21 WORKDIR /app COPY . . RUN go build -o server . CMD ["./server"] This image includes: ...

Service Discovery: Finding Services in a Dynamic World

In static infrastructure, services live at known addresses. Database at 10.0.1.5, cache at 10.0.1.6. Simple, predictable, fragile. In dynamic infrastructure — containers, auto-scaling, cloud — services appear and disappear constantly. IP addresses change. Instances multiply and vanish. Hardcoded addresses become a liability. Service discovery solves this: how do services find each other when everything is moving? The Problem 1 2 3 4 5 6 7 # Hardcoded - works until it doesn't DATABASE_URL = "postgres://10.0.1.5:5432/mydb" # What happens when: # - Database moves to a new server? # - You add read replicas? # - The IP changes after maintenance? DNS-Based Discovery The simplest approach: use DNS names instead of IPs. ...

Load Balancing: Distributing Traffic Without Playing Favorites

You’ve scaled horizontally — multiple servers ready to handle requests. Now you need something to decide which server handles each request. That’s load balancing, and the strategy you choose affects latency, reliability, and resource utilization. Round Robin: The Default Each server gets requests in rotation: Server 1, Server 2, Server 3, Server 1, Server 2… 1 2 3 4 5 upstream backend { server app1:8080; server app2:8080; server app3:8080; } Pros: Simple to understand and implement Even distribution over time No state to maintain Cons: ...

Caching Strategies: The Two Hardest Problems in Computer Science

Phil Karlton’s famous quote about hard problems in computer science exists because caching is genuinely difficult. Not the mechanics — putting data in Redis is easy. The hard part is knowing when that data is wrong. Get caching right and your application feels instant. Get it wrong and users see stale data, inconsistent state, or worse — data that was never supposed to be visible to them. The Cache-Aside Pattern (Lazy Loading) The most common pattern: check cache first, fall back to database, populate cache on miss. ...

Graceful Shutdown: Finishing What You Started

When Kubernetes scales down your deployment or you push a new release, your running containers receive SIGTERM. Then, after a grace period, SIGKILL. The difference between graceful and chaotic shutdown is what happens in those seconds between the two signals. A request half-processed, a database transaction uncommitted, a file partially written — these are the artifacts of ungraceful shutdown. They create inconsistent state, failed requests, and debugging nightmares. The Signal Sequence 1 2 3 4 5 6 7 . . . . . . . S G P P P P I I r r r r r f G a o o o o T c c c c c s E e e e e e t R s s s s i M p s s s s l e l s r s s s e e i h h h x r n o o o o i u t d u u u t n l l l s n t c d d d i o o w n u s f c i g p n t i l t : r t o n o h o d p i s S c o s e c I e w a h o G s n c c d K s c i o e I b e n n L e p - n 0 L g t f e i i l c ( n n i t n s g g i o h o ( n t n m d e s e e w w r f o c c a w r l y u o k e ) l r a t k n : l y 3 0 s i n K u b e r n e t e s ) Basic Signal Handling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import signal import sys shutdown_requested = False def handle_sigterm(signum, frame): global shutdown_requested print("SIGTERM received, initiating graceful shutdown...") shutdown_requested = True signal.signal(signal.SIGTERM, handle_sigterm) signal.signal(signal.SIGINT, handle_sigterm) # Ctrl+C # Main loop checks shutdown flag while not shutdown_requested: process_next_item() # Cleanup after loop exits cleanup() sys.exit(0) Web Servers: Stop Accepting, Finish Processing Most web frameworks have built-in graceful shutdown. The pattern: ...

Health Check Endpoints: More Than Just 200 OK

Every container orchestrator, load balancer, and monitoring system asks the same question: is this service healthy? The answer you provide determines whether traffic gets routed, containers get replaced, and alerts get fired. A health check that lies — always returning 200 even when the database is down — is worse than no health check at all. It creates false confidence while your users experience failures. The Three Types of Health Checks Liveness: “Is the process alive?” Liveness checks answer: should this container be killed and restarted? ...

Database Connection Pooling: The Performance Win You're Probably Missing

Database connections are expensive. Each new connection requires TCP handshake, authentication, session initialization, and memory allocation on both client and server. Do this for every web request and you’ve got a performance problem hiding in plain sight. The Problem: Connection Overhead A typical PostgreSQL connection takes 50-100ms to establish. For a web request that runs a 5ms query, you’re spending 10-20x more time on connection setup than actual work. 1 2 3 4 5 6 7 8 # The naive approach - connection per request def get_user(user_id): conn = psycopg2.connect(DATABASE_URL) # 50-100ms cursor = conn.cursor() cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,)) # 5ms result = cursor.fetchone() conn.close() return result Under load, this creates connection storms. Each concurrent request opens its own connection. The database has connection limits. You hit those limits, new connections queue, latency spikes, everything falls apart. ...

API Rate Limiting Strategies That Don't Annoy Your Users

Every API needs rate limiting. Without it, one enthusiastic script kiddie or a bug in a client application can take down your entire service. The question isn’t whether to rate limit — it’s how to do it without making your API frustrating to use. The Naive Approach (And Why It Fails) 1 2 3 # Don't do this if requests_this_minute > 100: return 429, "Rate limit exceeded" Fixed limits per time window are simple to implement and almost always wrong. They create the “thundering herd” problem: all your users hit the limit at minute :00, back off, retry at :01, and create a synchronized spike that’s worse than no limit at all. ...