Every ops engineer has a story about a shell script that worked perfectly — until it didn’t. Usually at 3 AM. Usually in production.

These patterns won’t make your scripts bulletproof, but they’ll stop the most common failures.

Start Every Script Right

1
2
3
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'

This preamble should be muscle memory:

  • set -e: Exit on any error (non-zero return code)
  • set -u: Exit on undefined variables
  • set -o pipefail: Catch errors in pipes (not just the last command)
  • IFS=$'\n\t': Safer word splitting (no spaces)

Without these, a typo like rm -rf $UNSET_VAR/ could wipe your root filesystem.

Trap Errors for Cleanup

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
cleanup() {
    local exit_code=$?
    rm -f "$TEMP_FILE" 2>/dev/null
    echo "Script exited with code: $exit_code"
    exit $exit_code
}

trap cleanup EXIT

TEMP_FILE=$(mktemp)
# Script continues... cleanup runs automatically on exit

The EXIT trap fires regardless of how the script ends — success, error, or interrupt. Use it for:

  • Removing temporary files
  • Releasing locks
  • Logging script completion
  • Restoring changed state

Validate Inputs Early

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
validate_inputs() {
    if [[ -z "${1:-}" ]]; then
        echo "Error: Missing required argument: environment" >&2
        echo "Usage: $0 <environment> [options]" >&2
        exit 1
    fi

    if [[ ! "$1" =~ ^(dev|staging|prod)$ ]]; then
        echo "Error: Invalid environment '$1'. Must be: dev, staging, prod" >&2
        exit 1
    fi

    if [[ ! -f "$CONFIG_FILE" ]]; then
        echo "Error: Config file not found: $CONFIG_FILE" >&2
        exit 1
    fi
}

validate_inputs "$@"

Fail fast, fail clearly. Don’t let the script run halfway before discovering bad input.

Use Functions for Reusability

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
log() {
    local level="$1"
    shift
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*" | tee -a "$LOG_FILE"
}

log_info()  { log "INFO"  "$@"; }
log_warn()  { log "WARN"  "$@"; }
log_error() { log "ERROR" "$@" >&2; }

die() {
    log_error "$@"
    exit 1
}

# Usage
log_info "Starting deployment to $ENVIRONMENT"
[[ -f "$CONFIG" ]] || die "Config file missing: $CONFIG"

Consistent logging makes debugging possible. Timestamps and log levels aren’t optional — they’re how you reconstruct what happened.

Safe Defaults and Confirmations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Default to safe mode
DRY_RUN="${DRY_RUN:-true}"
FORCE="${FORCE:-false}"

if [[ "$ENVIRONMENT" == "prod" && "$FORCE" != "true" ]]; then
    echo "⚠️  You're about to deploy to PRODUCTION"
    read -rp "Type 'yes' to continue: " confirm
    [[ "$confirm" == "yes" ]] || die "Deployment cancelled"
fi

if [[ "$DRY_RUN" == "true" ]]; then
    echo "[DRY RUN] Would execute: $command"
else
    eval "$command"
fi

Scripts that touch production should be paranoid by default. Require explicit opt-in for destructive actions.

Lock Files for Single Instance

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
LOCK_FILE="/var/run/${SCRIPT_NAME}.lock"

acquire_lock() {
    if ! mkdir "$LOCK_FILE" 2>/dev/null; then
        local pid
        pid=$(cat "$LOCK_FILE/pid" 2>/dev/null)
        if [[ -n "$pid" ]] && kill -0 "$pid" 2>/dev/null; then
            die "Script already running (PID: $pid)"
        fi
        # Stale lock, remove it
        rm -rf "$LOCK_FILE"
        mkdir "$LOCK_FILE"
    fi
    echo $$ > "$LOCK_FILE/pid"
}

release_lock() {
    rm -rf "$LOCK_FILE"
}

trap release_lock EXIT
acquire_lock

Using mkdir for locking is atomic and avoids race conditions. The PID check detects and cleans up stale locks from crashed processes.

Retry with Backoff

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
retry() {
    local max_attempts="${1:-3}"
    local delay="${2:-5}"
    local cmd="${*:3}"
    local attempt=1

    until eval "$cmd"; do
        if ((attempt >= max_attempts)); then
            log_error "Command failed after $max_attempts attempts: $cmd"
            return 1
        fi
        log_warn "Attempt $attempt failed. Retrying in ${delay}s..."
        sleep "$delay"
        ((attempt++))
        ((delay *= 2))  # Exponential backoff
    done
}

# Usage
retry 5 2 curl -sf "https://api.example.com/health"

Network calls fail. APIs rate-limit. Exponential backoff is the polite way to retry.

Handle Secrets Properly

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Never log secrets
get_secret() {
    local secret_name="$1"
    # Fetch from vault, AWS Secrets Manager, etc.
    aws secretsmanager get-secret-value \
        --secret-id "$secret_name" \
        --query 'SecretString' \
        --output text
}

# Use process substitution to avoid secrets in ps output
some_command --password-file <(get_secret "db-password")

# Or use environment variables
export DB_PASSWORD
DB_PASSWORD=$(get_secret "db-password")
some_command  # Reads $DB_PASSWORD internally

Never echo secrets. Never put them in command-line arguments (visible in ps). Prefer environment variables or file descriptors.

Portable Path Handling

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Get script's directory, even when called via symlink
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

# Resolve relative paths
CONFIG_FILE="${SCRIPT_DIR}/../config/settings.yaml"
CONFIG_FILE="$(cd "$(dirname "$CONFIG_FILE")" && pwd)/$(basename "$CONFIG_FILE")"

# Safe path joining
join_path() {
    local result="$1"
    shift
    for part in "$@"; do
        result="${result%/}/${part#/}"
    done
    echo "$result"
}

Scripts get called from unexpected directories. Absolute paths prevent “file not found” mysteries.

Template for Production Scripts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/usr/bin/env bash
set -euo pipefail

# Configuration
readonly SCRIPT_NAME="$(basename "$0")"
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
readonly LOG_FILE="/var/log/${SCRIPT_NAME%.*}.log"

# Logging
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] [$1] ${*:2}" | tee -a "$LOG_FILE"; }
log_info()  { log "INFO" "$@"; }
log_error() { log "ERROR" "$@" >&2; }
die() { log_error "$@"; exit 1; }

# Cleanup
cleanup() { rm -f "$TEMP_FILE" 2>/dev/null; }
trap cleanup EXIT

# Validation
[[ $# -ge 1 ]] || die "Usage: $SCRIPT_NAME <environment>"

# Main
main() {
    local environment="$1"
    log_info "Starting $SCRIPT_NAME for $environment"
    
    # Your logic here
    
    log_info "Completed successfully"
}

main "$@"

Copy this. Adapt it. Sleep better knowing your scripts handle edge cases.


Computing Arts is where I share hard-won lessons from production. More at computingarts.com.