Devops

SSH Security Hardening: Protecting Your Most Critical Access Point

SSH is the front door to your infrastructure. Every server, every cloud instance, every piece of automation relies on it. A compromised SSH setup means game over. These hardening steps dramatically reduce your attack surface. Disable Password Authentication Passwords can be brute-forced. Keys can’t (practically): 1 2 3 4 # /etc/ssh/sshd_config PasswordAuthentication no ChallengeResponseAuthentication no UsePAM no 1 2 # Restart SSH (keep your current session open!) sudo systemctl restart sshd Before doing this, ensure your key is in ~/.ssh/authorized_keys. ...

Linux Performance Debugging: Finding What's Slow and Why

Something’s slow. Users are complaining. Your monitoring shows high latency but not why. You SSH into the server and need to figure out what’s wrong — fast. This is a systematic approach to Linux performance debugging. Start with the Big Picture Before diving deep, get an overview: 1 2 3 4 5 6 # System load and uptime uptime # 17:00:00 up 45 days, load average: 8.52, 4.23, 2.15 # Load average: 1/5/15 minute averages # Compare to CPU count: load 8 on 4 CPUs = overloaded 1 2 # Quick health check top -bn1 | head -20 CPU: Who’s Using It? 1 2 3 4 5 6 # Real-time CPU usage top # Sort by CPU: press 'P' # Sort by memory: press 'M' # Show threads: press 'H' 1 2 3 4 5 # CPU usage by process (snapshot) ps aux --sort=-%cpu | head -10 # CPU time accumulated ps aux --sort=-time | head -10 1 2 3 4 # Per-CPU breakdown mpstat -P ALL 1 5 # Shows each CPU core's utilization # Look for: one core at 100% (single-threaded bottleneck) 1 2 3 4 5 6 7 8 9 10 11 # What's the CPU actually doing? vmstat 1 5 # r b swpd free buff cache si so bi bo in cs us sy id wa st # 3 0 0 245612 128940 2985432 0 0 8 24 312 892 45 12 40 3 0 # r: processes waiting for CPU (high = CPU-bound) # b: processes blocked on I/O (high = I/O-bound) # us: user CPU time # sy: system CPU time # wa: waiting for I/O # id: idle Memory: Running Out? 1 2 3 4 5 6 7 # Memory overview free -h # total used free shared buff/cache available # Mem: 31Gi 12Gi 2.1Gi 1.2Gi 17Gi 17Gi # "available" is what matters, not "free" # Linux uses free memory for cache — that's good 1 2 3 4 5 6 7 8 # Top memory consumers ps aux --sort=-%mem | head -10 # Memory details per process pmap -x <pid> # Or from /proc cat /proc/<pid>/status | grep -i mem 1 2 3 # Check for OOM killer activity dmesg | grep -i "out of memory" journalctl -k | grep -i oom 1 2 3 # Swap usage (if swapping, you're in trouble) swapon -s vmstat 1 5 # si/so columns show swap in/out Disk I/O: The Hidden Bottleneck 1 2 3 4 5 6 7 8 9 10 11 # I/O wait in top top # Look at %wa in the CPU line # Detailed I/O stats iostat -xz 1 5 # Device r/s w/s rkB/s wkB/s await %util # sda 150.00 200.00 4800 8000 12.5 85.0 # await: average I/O wait time (ms) - high = slow disk # %util: disk utilization - 100% = saturated 1 2 3 4 5 6 # Which processes are doing I/O? iotop -o # Shows only processes actively doing I/O # Or without iotop pidstat -d 1 5 1 2 3 # Check for disk errors dmesg | grep -i error smartctl -a /dev/sda # SMART data for disk health Network: Connections and Throughput 1 2 3 4 5 6 7 8 # Network interface stats ip -s link show # Watch for errors, drops, overruns # Real-time bandwidth iftop # Or nload 1 2 3 4 5 6 7 # Connection states ss -s # Total: 1523 (kernel 1847) # TCP: 892 (estab 654, closed 89, orphaned 12, timewait 127) # Many TIME_WAIT: connection churn # Many ESTABLISHED: legitimate load or connection leak 1 2 3 4 5 # Connections per state ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn # Connections by remote IP ss -tn | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head 1 2 3 4 5 6 # Check for packet loss ping -c 100 <gateway> # Any loss = network problem # TCP retransmits netstat -s | grep -i retrans Process-Level Debugging 1 2 3 4 5 6 # What's a specific process doing? strace -p <pid> -c # Summary of system calls strace -p <pid> -f -e trace=network # Network-related calls only 1 2 3 4 5 6 7 8 # Open files and connections lsof -p <pid> # Just network connections lsof -i -p <pid> # Files in a directory lsof +D /var/log 1 2 3 4 5 # Process threads ps -T -p <pid> # Thread CPU usage top -H -p <pid> System-Wide Tracing 1 2 3 4 5 6 7 # What's happening system-wide? perf top # Real-time view of where CPU cycles go # Record and analyze perf record -g -a sleep 10 perf report 1 2 3 4 5 6 7 8 9 # BPF-based tools (if available) # CPU usage by function profile-bpfcc 10 # Disk latency histogram biolatency-bpfcc # TCP connection latency tcpconnlat-bpfcc The Checklist Approach When debugging, go through systematically: ...

Git Workflow Strategies: Branching Models That Scale

Git doesn’t enforce a workflow. You can branch however you want, merge whenever you want, push whatever you want. That freedom is powerful and dangerous. A good branching strategy reduces merge conflicts, enables parallel development, and makes releases predictable. A bad one creates confusion, broken builds, and deployment fear. Trunk-Based Development The simplest model: everyone commits to main, frequently. m a i n : ─ f ( ─ e h ─ a o ● │ t u ─ u r ─ r s ─ e ) ● ─ ─ f ( ─ i h ● │ x o ─ u ─ r ─ s ● ) ─ ─ ─ ● f ( ─ e h ─ a o ─ t u ● │ u r ─ r s ─ e ) ─ ● ─ ─ ─ Rules: ...

Ansible Playbook Patterns: Writing Automation That Doesn't Break

Ansible’s simplicity is seductive. YAML tasks, SSH connections, no agents. But simple playbooks become complex fast, and poorly structured automation creates more problems than it solves. These patterns help you write Ansible that scales with your infrastructure. Idempotency: Safe to Run Twice Every task should be safe to run repeatedly with the same result: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Idempotent - creates file if missing, no-op if exists - name: Create config directory file: path: /etc/myapp state: directory mode: '0755' # Not idempotent - appends every run - name: Add config line shell: echo "setting=value" >> /etc/myapp/config # Idempotent version - name: Add config line lineinfile: path: /etc/myapp/config line: "setting=value" Use Ansible modules over shell commands. Modules are designed for idempotency. ...

Kubernetes Resource Management: Requests, Limits, and Not Getting OOMKilled

Kubernetes needs to know how much CPU and memory your containers need. Get it wrong and you’ll face OOMKills, CPU throttling, unschedulable pods, or wasted cluster capacity. Resource requests and limits are the most impactful settings most teams misconfigure. Requests vs Limits Requests: What you’re guaranteed. Used for scheduling. Limits: What you can’t exceed. Enforced at runtime. 1 2 3 4 5 6 7 resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" This pod: ...

CI/CD Pipelines: From Commit to Production Safely

Continuous Integration and Continuous Deployment transform code changes into running software automatically. Done well, you push code and forget about it — the pipeline handles testing, building, and deploying. Done poorly, you spend more time fighting the pipeline than writing code. The Pipeline Stages C o m m i t → B u i l d → T e s t → S e c u r i t y → A r t i f a c t → D e p l o y → V e r i f y Each stage is a gate. Fail any stage, stop the pipeline. ...

Observability: Logs, Metrics, and Traces Working Together

Monitoring answers “is it working?” Observability answers “why isn’t it working?” The difference matters when you’re debugging a production incident at 3am. The three pillars of observability — logs, metrics, and traces — each provide different perspectives. Together, they create a complete picture of system behavior. Logs: The Narrative Logs tell you what happened, in order: 1 2 3 {"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "request_started", "request_id": "abc123", "path": "/api/users"} {"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "db_query", "request_id": "abc123", "duration_ms": 45} {"timestamp": "2026-02-23T13:00:02Z", "level": "error", "event": "request_failed", "request_id": "abc123", "error": "connection timeout"} Good for: Debugging specific requests Understanding error context Audit trails Ad-hoc investigation Challenges: ...

Infrastructure as Code: Principles That Actually Matter

Infrastructure as Code (IaC) means your servers, networks, and services are defined in version-controlled files rather than clicked into existence through consoles. The benefits are obvious: reproducibility, auditability, collaboration. But IaC done poorly creates its own problems: state drift, copy-paste sprawl, untestable configurations. The principles matter more than the tools. Declarative Over Imperative Describe what you want, not how to get there: 1 2 3 4 5 6 7 8 9 # Declarative (Terraform) - what resource "aws_instance" "web" { ami = "ami-0c55b159cbfafe1f0" instance_type = "t3.micro" tags = { Name = "web-server" } } 1 2 3 4 5 # Imperative (script) - how aws ec2 run-instances \ --image-id ami-0c55b159cbfafe1f0 \ --instance-type t3.micro \ --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=web-server}]' Declarative code is idempotent — run it ten times, get the same result. Imperative scripts need guards against re-running. ...

Docker Multi-Stage Builds: Smaller Images, Cleaner Dockerfiles

A typical Go application compiles to a single binary. Yet Docker images for Go apps often weigh hundreds of megabytes. Why? Because the image includes the entire Go toolchain used to build it. Multi-stage builds solve this: use one stage to build, another to run. The final image contains only what’s needed at runtime. The Problem: Fat Images 1 2 3 4 5 6 7 8 # Single-stage - includes everything FROM golang:1.21 WORKDIR /app COPY . . RUN go build -o server . CMD ["./server"] This image includes: ...

Service Discovery: Finding Services in a Dynamic World

In static infrastructure, services live at known addresses. Database at 10.0.1.5, cache at 10.0.1.6. Simple, predictable, fragile. In dynamic infrastructure — containers, auto-scaling, cloud — services appear and disappear constantly. IP addresses change. Instances multiply and vanish. Hardcoded addresses become a liability. Service discovery solves this: how do services find each other when everything is moving? The Problem 1 2 3 4 5 6 7 # Hardcoded - works until it doesn't DATABASE_URL = "postgres://10.0.1.5:5432/mydb" # What happens when: # - Database moves to a new server? # - You add read replicas? # - The IP changes after maintenance? DNS-Based Discovery The simplest approach: use DNS names instead of IPs. ...