Every ops engineer has a story about a shell script that worked perfectly — until it didn’t. Usually at 3 AM. Usually in production.
These patterns won’t make your scripts bulletproof, but they’ll stop the most common failures.
Start Every Script Right#
1
2
3
| #!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
|
This preamble should be muscle memory:
set -e: Exit on any error (non-zero return code)set -u: Exit on undefined variablesset -o pipefail: Catch errors in pipes (not just the last command)IFS=$'\n\t': Safer word splitting (no spaces)
Without these, a typo like rm -rf $UNSET_VAR/ could wipe your root filesystem.
Trap Errors for Cleanup#
1
2
3
4
5
6
7
8
9
10
11
| cleanup() {
local exit_code=$?
rm -f "$TEMP_FILE" 2>/dev/null
echo "Script exited with code: $exit_code"
exit $exit_code
}
trap cleanup EXIT
TEMP_FILE=$(mktemp)
# Script continues... cleanup runs automatically on exit
|
The EXIT trap fires regardless of how the script ends — success, error, or interrupt. Use it for:
- Removing temporary files
- Releasing locks
- Logging script completion
- Restoring changed state
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| validate_inputs() {
if [[ -z "${1:-}" ]]; then
echo "Error: Missing required argument: environment" >&2
echo "Usage: $0 <environment> [options]" >&2
exit 1
fi
if [[ ! "$1" =~ ^(dev|staging|prod)$ ]]; then
echo "Error: Invalid environment '$1'. Must be: dev, staging, prod" >&2
exit 1
fi
if [[ ! -f "$CONFIG_FILE" ]]; then
echo "Error: Config file not found: $CONFIG_FILE" >&2
exit 1
fi
}
validate_inputs "$@"
|
Fail fast, fail clearly. Don’t let the script run halfway before discovering bad input.
Use Functions for Reusability#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| log() {
local level="$1"
shift
echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*" | tee -a "$LOG_FILE"
}
log_info() { log "INFO" "$@"; }
log_warn() { log "WARN" "$@"; }
log_error() { log "ERROR" "$@" >&2; }
die() {
log_error "$@"
exit 1
}
# Usage
log_info "Starting deployment to $ENVIRONMENT"
[[ -f "$CONFIG" ]] || die "Config file missing: $CONFIG"
|
Consistent logging makes debugging possible. Timestamps and log levels aren’t optional — they’re how you reconstruct what happened.
Safe Defaults and Confirmations#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Default to safe mode
DRY_RUN="${DRY_RUN:-true}"
FORCE="${FORCE:-false}"
if [[ "$ENVIRONMENT" == "prod" && "$FORCE" != "true" ]]; then
echo "⚠️ You're about to deploy to PRODUCTION"
read -rp "Type 'yes' to continue: " confirm
[[ "$confirm" == "yes" ]] || die "Deployment cancelled"
fi
if [[ "$DRY_RUN" == "true" ]]; then
echo "[DRY RUN] Would execute: $command"
else
eval "$command"
fi
|
Scripts that touch production should be paranoid by default. Require explicit opt-in for destructive actions.
Lock Files for Single Instance#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| LOCK_FILE="/var/run/${SCRIPT_NAME}.lock"
acquire_lock() {
if ! mkdir "$LOCK_FILE" 2>/dev/null; then
local pid
pid=$(cat "$LOCK_FILE/pid" 2>/dev/null)
if [[ -n "$pid" ]] && kill -0 "$pid" 2>/dev/null; then
die "Script already running (PID: $pid)"
fi
# Stale lock, remove it
rm -rf "$LOCK_FILE"
mkdir "$LOCK_FILE"
fi
echo $$ > "$LOCK_FILE/pid"
}
release_lock() {
rm -rf "$LOCK_FILE"
}
trap release_lock EXIT
acquire_lock
|
Using mkdir for locking is atomic and avoids race conditions. The PID check detects and cleans up stale locks from crashed processes.
Retry with Backoff#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| retry() {
local max_attempts="${1:-3}"
local delay="${2:-5}"
local cmd="${*:3}"
local attempt=1
until eval "$cmd"; do
if ((attempt >= max_attempts)); then
log_error "Command failed after $max_attempts attempts: $cmd"
return 1
fi
log_warn "Attempt $attempt failed. Retrying in ${delay}s..."
sleep "$delay"
((attempt++))
((delay *= 2)) # Exponential backoff
done
}
# Usage
retry 5 2 curl -sf "https://api.example.com/health"
|
Network calls fail. APIs rate-limit. Exponential backoff is the polite way to retry.
Handle Secrets Properly#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # Never log secrets
get_secret() {
local secret_name="$1"
# Fetch from vault, AWS Secrets Manager, etc.
aws secretsmanager get-secret-value \
--secret-id "$secret_name" \
--query 'SecretString' \
--output text
}
# Use process substitution to avoid secrets in ps output
some_command --password-file <(get_secret "db-password")
# Or use environment variables
export DB_PASSWORD
DB_PASSWORD=$(get_secret "db-password")
some_command # Reads $DB_PASSWORD internally
|
Never echo secrets. Never put them in command-line arguments (visible in ps). Prefer environment variables or file descriptors.
Portable Path Handling#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # Get script's directory, even when called via symlink
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Resolve relative paths
CONFIG_FILE="${SCRIPT_DIR}/../config/settings.yaml"
CONFIG_FILE="$(cd "$(dirname "$CONFIG_FILE")" && pwd)/$(basename "$CONFIG_FILE")"
# Safe path joining
join_path() {
local result="$1"
shift
for part in "$@"; do
result="${result%/}/${part#/}"
done
echo "$result"
}
|
Scripts get called from unexpected directories. Absolute paths prevent “file not found” mysteries.
Template for Production Scripts#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| #!/usr/bin/env bash
set -euo pipefail
# Configuration
readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly LOG_FILE="/var/log/${SCRIPT_NAME%.*}.log"
# Logging
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$1] ${*:2}" | tee -a "$LOG_FILE"; }
log_info() { log "INFO" "$@"; }
log_error() { log "ERROR" "$@" >&2; }
die() { log_error "$@"; exit 1; }
# Cleanup
cleanup() { rm -f "$TEMP_FILE" 2>/dev/null; }
trap cleanup EXIT
# Validation
[[ $# -ge 1 ]] || die "Usage: $SCRIPT_NAME <environment>"
# Main
main() {
local environment="$1"
log_info "Starting $SCRIPT_NAME for $environment"
# Your logic here
log_info "Completed successfully"
}
main "$@"
|
Copy this. Adapt it. Sleep better knowing your scripts handle edge cases.
Computing Arts is where I share hard-won lessons from production. More at computingarts.com.