Kubernetes

Health Check Endpoints: More Than Just 200 OK

Every modern service needs health check endpoints. Load balancers probe them. Kubernetes uses them. Monitoring systems scrape them. But a naive implementation—returning 200 OK if the process is running—tells you almost nothing useful. Here’s how to build health checks that actually help. Two Types of Health Liveness: Is the process alive and not deadlocked? Readiness: Can this instance handle requests right now? These are different questions with different answers: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Liveness: Am I alive? @app.get("/health/live") def liveness(): # If this returns, the process is alive return {"status": "alive"} # Readiness: Can I serve traffic? @app.get("/health/ready") def readiness(): checks = { "database": check_database(), "cache": check_cache(), "disk_space": check_disk_space(), } all_healthy = all(c["healthy"] for c in checks.values()) return JSONResponse( status_code=200 if all_healthy else 503, content={"status": "ready" if all_healthy else "not_ready", "checks": checks} ) Why separate them? ...

Graceful Shutdown: Stop Dropping Requests

Every deployment is a potential outage if your application doesn’t shut down gracefully. Here’s how to do it right. The Problem 1 2 3 4 5 . . . . . K P Y I U u o o n s b d u - e e r f r r i l s n s a i e p g s t r p h e e e t e s m e o x r e s v i e r e e t q r n d s u o d e r s f i s s r m t S o m s d I m e u G d g r T s i e i E e a t n R r t g M v e c i l o " c y n z e n e e r e c o n t - d i d p o o o n w i n n r t t e i s s m e e t " d e p l o y s The fix: handle SIGTERM, finish existing work, then exit. ...

Kubernetes Troubleshooting Patterns for Production

Kubernetes hides complexity until something breaks. Then you need to know where to look. Here’s a systematic approach to debugging production issues. The Debugging Hierarchy Start broad, narrow down: Cluster level: Nodes healthy? Resources available? Namespace level: Deployments running? Services configured? Pod level: Containers starting? Logs clean? Container level: Process running? Resources sufficient? Quick Health Check 1 2 3 4 5 6 7 8 9 10 11 # Node status kubectl get nodes -o wide # All pods across namespaces kubectl get pods -A # Pods not running kubectl get pods -A | grep -v Running | grep -v Completed # Events (recent issues) kubectl get events -A --sort-by='.lastTimestamp' | tail -20 Pod Troubleshooting Pod States State Meaning Check Pending Can’t be scheduled Resources, node selectors, taints ContainerCreating Image pulling or volume mounting Image name, pull secrets, PVCs CrashLoopBackOff Container crashing repeatedly Logs, resource limits, probes ImagePullBackOff Can’t pull image Image name, registry auth Error Container exited with error Logs Pending Pods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Why is it pending? kubectl describe pod my-pod # Look for: # - Insufficient cpu/memory # - No nodes match nodeSelector # - Taints not tolerated # - PVC not bound # Check node resources kubectl describe nodes | grep -A5 "Allocated resources" # Check PVC status kubectl get pvc CrashLoopBackOff 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Get logs from current container kubectl logs my-pod # Get logs from previous (crashed) container kubectl logs my-pod --previous # Get logs from specific container kubectl logs my-pod -c my-container # Follow logs kubectl logs -f my-pod # Last N lines kubectl logs --tail=100 my-pod Common causes: ...

Kubernetes Troubleshooting: A Practical Field Guide

When a Kubernetes deployment goes sideways at 3am, you need a systematic approach. Here’s the troubleshooting playbook I’ve developed from watching countless production incidents. The First Three Commands Before diving deep, these three commands tell you 80% of what you need: 1 2 3 4 5 6 7 8 # What's not running? kubectl get pods -A | grep -v Running | grep -v Completed # What happened recently? kubectl get events -A --sort-by='.lastTimestamp' | tail -20 # Resource pressure? kubectl top nodes Run these first. Always. ...

Environment Variable Patterns for 12-Factor Apps

Environment variables are the 12-factor way to configure applications. But “just use env vars” glosses over real complexity. Here’s how to do it well. The Basics Done Right Type Coercion Environment variables are always strings. Handle conversion explicitly: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import os def get_env_bool(key: str, default: bool = False) -> bool: value = os.getenv(key, "").lower() if value in ("true", "1", "yes", "on"): return True if value in ("false", "0", "no", "off"): return False return default def get_env_int(key: str, default: int = 0) -> int: try: return int(os.getenv(key, default)) except ValueError: return default # Usage DEBUG = get_env_bool("DEBUG", False) PORT = get_env_int("PORT", 8080) WORKERS = get_env_int("WORKERS", 4) Required vs Optional Fail fast on missing required config: ...

Container Health Check Patterns That Actually Work

Your container says it’s healthy. Your users say the app is broken. Sound familiar? Basic health checks only tell you if a process is running. Here’s how to build checks that catch real problems. Beyond “Is It Alive?” Most health checks look like this: 1 HEALTHCHECK CMD curl -f http://localhost:8080/health || exit 1 This tells you the HTTP server responds. It doesn’t tell you: Can the app reach the database? Is the cache connected? Are critical background workers running? Is the disk filling up? Layered Health Checks Implement three levels: ...

Container Orchestration Patterns for AI Workloads

Running AI workloads in containers presents unique challenges that traditional web application patterns don’t address. GPU scheduling, model caching, and bursty inference traffic all require thoughtful architecture. Here’s what actually works in production. The GPU Scheduling Problem Standard Kubernetes scheduling assumes CPU and memory are your primary constraints. When you add GPUs to the mix, everything changes. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 apiVersion: v1 kind: Pod metadata: name: inference-server spec: containers: - name: model image: my-registry/llm-server:v1.2 resources: limits: nvidia.com/gpu: 1 memory: "32Gi" requests: nvidia.com/gpu: 1 memory: "24Gi" The naive approach—one GPU per pod—works until you realize GPUs cost $2-4/hour and sit idle between requests. MIG (Multi-Instance GPU) and time-slicing help, but they introduce complexity: ...

Kubernetes Debugging: From Pod Failures to Cluster Issues

Kubernetes abstracts away infrastructure until something breaks. Then you need to peel back the layers. These debugging patterns will help you find problems fast. First Steps: Get the Lay of the Land 1 2 3 4 5 6 7 8 9 10 # Cluster health kubectl cluster-info kubectl get nodes kubectl top nodes # Namespace overview kubectl get all -n myapp # Events (recent issues surface here) kubectl get events -n myapp --sort-by='.lastTimestamp' Pod Debugging Check Pod Status 1 2 3 4 5 6 7 8 9 10 11 12 13 # List pods with status kubectl get pods -n myapp # Detailed pod info kubectl describe pod mypod -n myapp # Common status meanings: # Pending - Waiting for scheduling or image pull # Running - At least one container running # Succeeded - All containers completed successfully # Failed - All containers terminated, at least one failed # CrashLoopBackOff - Container crashing repeatedly # ImagePullBackOff - Can't pull container image View Logs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Current logs kubectl logs mypod -n myapp # Previous container (after crash) kubectl logs mypod -n myapp --previous # Follow logs kubectl logs -f mypod -n myapp # Specific container (multi-container pod) kubectl logs mypod -n myapp -c mycontainer # Last N lines kubectl logs mypod -n myapp --tail=100 # Since timestamp kubectl logs mypod -n myapp --since=1h Execute Commands in Pod 1 2 3 4 5 6 7 8 # Shell into running container kubectl exec -it mypod -n myapp -- /bin/sh # Run specific command kubectl exec mypod -n myapp -- cat /etc/config/app.yaml # Specific container kubectl exec -it mypod -n myapp -c mycontainer -- /bin/sh Debug Crashed Containers 1 2 3 4 5 6 7 8 # Check why it crashed kubectl describe pod mypod -n myapp | grep -A 10 "Last State" # View previous logs kubectl logs mypod -n myapp --previous # Run debug container (K8s 1.25+) kubectl debug mypod -n myapp -it --image=busybox --target=mycontainer Common Pod Issues ImagePullBackOff 1 2 3 4 5 6 7 8 9 10 11 12 13 # Check events for details kubectl describe pod mypod -n myapp | grep -A 5 Events # Common causes: # - Wrong image name/tag # - Private registry without imagePullSecrets # - Registry rate limiting (Docker Hub) # Verify image exists docker pull myimage:tag # Check imagePullSecrets kubectl get pod mypod -n myapp -o jsonpath='{.spec.imagePullSecrets}' CrashLoopBackOff 1 2 3 4 5 6 7 8 9 10 11 12 # Get exit code kubectl describe pod mypod -n myapp | grep "Exit Code" # Exit codes: # 0 - Success (shouldn't be crashing) # 1 - Application error # 137 - OOMKilled (out of memory) # 139 - Segmentation fault # 143 - SIGTERM (graceful shutdown) # Check resource limits kubectl describe pod mypod -n myapp | grep -A 5 Limits Pending Pods 1 2 3 4 5 6 7 8 9 10 11 # Check why not scheduled kubectl describe pod mypod -n myapp | grep -A 10 Events # Common causes: # - Insufficient resources # - Node selector/affinity not matched # - Taints without tolerations # - PVC not bound # Check node resources kubectl describe nodes | grep -A 5 "Allocated resources" Resource Issues Memory Problems 1 2 3 4 5 6 7 8 # Check pod resource usage kubectl top pod mypod -n myapp # Check for OOMKilled kubectl describe pod mypod -n myapp | grep OOMKilled # View memory limits kubectl get pod mypod -n myapp -o jsonpath='{.spec.containers[*].resources}' CPU Throttling 1 2 3 4 5 # Check CPU usage vs limits kubectl top pod mypod -n myapp # In container, check throttling kubectl exec mypod -n myapp -- cat /sys/fs/cgroup/cpu/cpu.stat Networking Debugging Service Connectivity 1 2 3 4 5 6 7 8 9 10 11 12 # Check service exists kubectl get svc -n myapp # Check endpoints (are pods backing the service?) kubectl get endpoints myservice -n myapp # Test from within cluster kubectl run debug --rm -it --image=busybox -- /bin/sh # Then: wget -qO- http://myservice.myapp.svc.cluster.local # DNS resolution kubectl run debug --rm -it --image=busybox -- nslookup myservice.myapp.svc.cluster.local Pod-to-Pod Communication 1 2 3 4 5 6 7 8 # Get pod IPs kubectl get pods -n myapp -o wide # Test connectivity from one pod to another kubectl exec mypod1 -n myapp -- wget -qO- http://10.0.0.5:8080 # Check network policies kubectl get networkpolicies -n myapp Ingress Issues 1 2 3 4 5 6 7 8 # Check ingress configuration kubectl describe ingress myingress -n myapp # Check ingress controller logs kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx # Verify backend service kubectl get svc myservice -n myapp ConfigMaps and Secrets 1 2 3 4 5 6 7 8 9 10 11 12 # Verify ConfigMap exists and has expected data kubectl get configmap myconfig -n myapp -o yaml # Check if mounted correctly kubectl exec mypod -n myapp -- ls -la /etc/config/ kubectl exec mypod -n myapp -- cat /etc/config/app.yaml # Verify Secret kubectl get secret mysecret -n myapp -o jsonpath='{.data.password}' | base64 -d # Check environment variables kubectl exec mypod -n myapp -- env | grep MY_VAR Persistent Volumes 1 2 3 4 5 6 7 8 9 10 11 12 # Check PVC status kubectl get pvc -n myapp # Describe for binding issues kubectl describe pvc mypvc -n myapp # Check PV kubectl get pv # Verify mount in pod kubectl exec mypod -n myapp -- df -h kubectl exec mypod -n myapp -- ls -la /data Node Issues 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Node status kubectl get nodes kubectl describe node mynode # Check conditions kubectl get nodes -o custom-columns=NAME:.metadata.name,CONDITIONS:.status.conditions[*].type # Node resource pressure kubectl describe node mynode | grep -A 5 Conditions # Pods on specific node kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=mynode # Drain node for maintenance kubectl drain mynode --ignore-daemonsets --delete-emptydir-data Control Plane Debugging 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # API server health kubectl get --raw='/healthz' # Component status (deprecated but useful) kubectl get componentstatuses # Check system pods kubectl get pods -n kube-system # API server logs (if accessible) kubectl logs -n kube-system kube-apiserver-master # etcd health kubectl exec -n kube-system etcd-master -- etcdctl endpoint health Useful Debug Containers 1 2 3 4 5 6 7 8 # Network debugging kubectl run netdebug --rm -it --image=nicolaka/netshoot -- /bin/bash # DNS debugging kubectl run dnsdebug --rm -it --image=tutum/dnsutils -- /bin/bash # General debugging kubectl run debug --rm -it --image=busybox -- /bin/sh Systematic Debugging Checklist Events first: kubectl get events --sort-by='.lastTimestamp' Describe the resource: kubectl describe <resource> <name> Check logs: kubectl logs <pod> (and --previous) Verify dependencies: ConfigMaps, Secrets, Services, PVCs Check resources: CPU, memory limits and usage Test connectivity: DNS, service endpoints, network policies Compare with working: Diff against known good configuration Quick Reference 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # Pod not starting kubectl describe pod <name> kubectl get events # Pod crashing kubectl logs <pod> --previous kubectl describe pod <name> | grep "Exit Code" # Can't connect to service kubectl get endpoints <service> kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service> # Resource issues kubectl top pods kubectl describe node | grep -A 5 "Allocated" # Config issues kubectl exec <pod> -- env kubectl exec <pod> -- cat /path/to/config Kubernetes debugging is methodical. Start with events, drill into describe output, check logs, and verify each dependency. Most issues are configuration mismatches—wrong image tags, missing secrets, insufficient resources. ...

Container Orchestration Patterns: Beyond 'Just Deploy It'

Running one container is easy. Running hundreds in production, reliably, at scale? That’s where patterns emerge. These aren’t Kubernetes-specific (though that’s where you’ll see them most). They’re fundamental approaches to composing containers into systems that actually work. The Sidecar Pattern A sidecar is a helper container that runs alongside your main application container, sharing the same pod/network namespace. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 apiVersion: v1 kind: Pod metadata: name: web-app spec: containers: # Main application - name: app image: myapp:1.0 ports: - containerPort: 8080 # Sidecar: log shipper - name: log-shipper image: fluentd:latest volumeMounts: - name: logs mountPath: /var/log/app volumes: - name: logs emptyDir: {} Common sidecar use cases: ...

Kubernetes Resource Management: Requests, Limits, and Not Getting OOMKilled

Kubernetes needs to know how much CPU and memory your containers need. Get it wrong and you’ll face OOMKills, CPU throttling, unschedulable pods, or wasted cluster capacity. Resource requests and limits are the most impactful settings most teams misconfigure. Requests vs Limits Requests: What you’re guaranteed. Used for scheduling. Limits: What you can’t exceed. Enforced at runtime. 1 2 3 4 5 6 7 resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" This pod: ...