Troubleshooting

Why Your systemd Service Fails Silently (And How to Actually Debug It)

Your systemd service is “active (running)” but the application isn’t responding. No errors in systemctl status. The journal shows it started. Everything looks fine. Except it isn’t. This is one of the most frustrating debugging scenarios in Linux administration. Here’s how to actually figure out what’s wrong. The Problem: Green Status, Dead Application 1 2 3 4 5 $ systemctl status myapp ● myapp.service - My Application Loaded: loaded (/etc/systemd/system/myapp.service; enabled) Active: active (running) since Fri 2026-03-20 10:00:00 EDT; 2h ago Main PID: 12345 (myapp) Looks healthy. But curl localhost:8080 times out. What’s happening? ...

Fix: Docker Container Stuck in Restart Loop

Your container starts, immediately dies, restarts, dies again. The docker ps output shows “Restarting (1) 2 seconds ago” and you’re watching it cycle endlessly. Here’s how to break the loop and find the actual problem. Step 1: Check the Exit Code First, figure out how it’s dying: 1 docker inspect --format='{{.State.ExitCode}}' container_name Common exit codes: 0 — Clean exit (shouldn’t restart unless you have restart: always) 1 — Application error (check logs) 137 — Killed by OOM (out of memory) 139 — Segmentation fault 143 — SIGTERM received (graceful shutdown request) Step 2: Read the Logs (Before They Disappear) Containers in restart loops lose logs on each restart. Catch them quick: ...

Fix: Docker Container Can't Resolve DNS (And Why It Happens)

Your container builds fine, starts fine, then fails with Could not resolve host or Temporary failure in name resolution. Here’s how to fix it. Quick Diagnosis First, confirm it’s actually DNS and not a network issue: 1 2 3 4 5 6 7 8 # Get a shell in the container docker exec -it <container_name> sh # Test DNS specifically nslookup google.com # or cat /etc/resolv.conf ping 8.8.8.8 # If this works but nslookup fails, it's DNS If ping 8.8.8.8 works but nslookup google.com fails, you have a DNS problem. If both fail, it’s a broader network issue. ...

Why Your Cron Job Isn't Running: A Troubleshooting Guide

You’ve added your cron job, the syntax looks right, but nothing happens. No output, no errors, just silence. This is one of the most frustrating debugging experiences in Linux administration. Here’s how to systematically find and fix the problem. Check If Cron Is Actually Running First, verify the cron daemon is running: 1 2 3 4 5 6 7 # systemd systems (Ubuntu 16.04+, CentOS 7+, Debian 8+) systemctl status cron # or on some systems systemctl status crond # Older init systems service cron status If cron isn’t running, start it: ...

Kubernetes Debugging: A Practical Field Guide

Your pod won’t start. The service isn’t routing. Something’s wrong but kubectl isn’t telling you what. Here’s how to actually debug Kubernetes problems. The Debugging Hierarchy Work from the outside in: Cluster level — Is the cluster healthy? Node level — Are nodes ready? Pod level — Is the pod running? Container level — Is the container healthy? Application level — Is the app working? Most problems are at levels 3-5. Start there. ...

Kubernetes Troubleshooting: A Practical Field Guide

Kubernetes failures are rarely mysterious once you know where to look. The problem is knowing where to look. This guide covers the systematic approach to diagnosing common Kubernetes issues. The Diagnostic Hierarchy Start broad, drill down: C l u s t e r → N o d e → P o d → C o n t a i n e r → A p p l i c a t i o n At each level, the same questions apply: ...

DNS Troubleshooting: When dig Is Your Best Friend

DNS issues are deceptively simple. “It’s always DNS” is a meme because it’s true. Here’s how to actually debug it. The Essential Commands dig: Your Primary Tool 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Basic lookup dig example.com # Short answer only dig +short example.com # Specific record type dig example.com MX dig example.com TXT dig example.com CNAME # Query specific nameserver dig @8.8.8.8 example.com # Trace the full resolution path dig +trace example.com Understanding dig Output ; ; ; ; e ; ; ; ; ; e ; x ; ; ; ; x a Q a A m Q S W M U m N p u E H S E p S l e R E G S l W e r V N D T e E . y E : S i I . R c R I G O c o t : S Z N o S m i a E 9 m E . m 1 t . S . C e 9 1 E T : 2 F r 8 C I . e c . T O 2 1 b v 1 I N 3 6 d O : 8 2 : N m . 8 : s 1 5 e . 2 6 c 1 0 # : e 5 3 x 3 3 0 a 6 ( : m 0 1 0 p 0 9 0 l 2 e . E . 1 S c 6 T o I I 8 m N N . 2 1 0 . 2 1 6 ) A A 9 3 . 1 8 4 . 2 1 6 . 3 4 Key fields: ...

Linux Performance Troubleshooting: The First Five Minutes

When a server is slow and people are yelling, you need a systematic approach. Here’s what to run in the first five minutes. The Checklist 1 2 3 4 5 6 7 8 uptime dmesg | tail vmstat 1 5 mpstat -P ALL 1 3 pidstat 1 3 iostat -xz 1 3 free -h sar -n DEV 1 3 Let’s break down what each tells you. 1. uptime 1 2 $ uptime 16:30:01 up 45 days, 3:22, 2 users, load average: 8.42, 6.31, 5.12 Load averages: 1-minute, 5-minute, 15-minute. ...

Kubernetes Troubleshooting Patterns for Production

Kubernetes hides complexity until something breaks. Then you need to know where to look. Here’s a systematic approach to debugging production issues. The Debugging Hierarchy Start broad, narrow down: Cluster level: Nodes healthy? Resources available? Namespace level: Deployments running? Services configured? Pod level: Containers starting? Logs clean? Container level: Process running? Resources sufficient? Quick Health Check 1 2 3 4 5 6 7 8 9 10 11 # Node status kubectl get nodes -o wide # All pods across namespaces kubectl get pods -A # Pods not running kubectl get pods -A | grep -v Running | grep -v Completed # Events (recent issues) kubectl get events -A --sort-by='.lastTimestamp' | tail -20 Pod Troubleshooting Pod States State Meaning Check Pending Can’t be scheduled Resources, node selectors, taints ContainerCreating Image pulling or volume mounting Image name, pull secrets, PVCs CrashLoopBackOff Container crashing repeatedly Logs, resource limits, probes ImagePullBackOff Can’t pull image Image name, registry auth Error Container exited with error Logs Pending Pods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Why is it pending? kubectl describe pod my-pod # Look for: # - Insufficient cpu/memory # - No nodes match nodeSelector # - Taints not tolerated # - PVC not bound # Check node resources kubectl describe nodes | grep -A5 "Allocated resources" # Check PVC status kubectl get pvc CrashLoopBackOff 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Get logs from current container kubectl logs my-pod # Get logs from previous (crashed) container kubectl logs my-pod --previous # Get logs from specific container kubectl logs my-pod -c my-container # Follow logs kubectl logs -f my-pod # Last N lines kubectl logs --tail=100 my-pod Common causes: ...

Kubernetes Troubleshooting: A Practical Field Guide

When a Kubernetes deployment goes sideways at 3am, you need a systematic approach. Here’s the troubleshooting playbook I’ve developed from watching countless production incidents. The First Three Commands Before diving deep, these three commands tell you 80% of what you need: 1 2 3 4 5 6 7 8 # What's not running? kubectl get pods -A | grep -v Running | grep -v Completed # What happened recently? kubectl get events -A --sort-by='.lastTimestamp' | tail -20 # Resource pressure? kubectl top nodes Run these first. Always. ...