Error Handling Philosophy: Fail Gracefully, Recover Quickly

Errors are inevitable. Networks fail. Disks fill up. Services crash. Users input garbage. The question isn’t whether your system will encounter errors — it’s how it will behave when it does.

Good error handling is the difference between “the system recovered automatically” and “we lost customer data.”

Fail Fast, Fail Loud

Detect problems early and report them clearly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# ❌ Bad: Silent failure
def process_order(order):
    try:
        result = payment_service.charge(order)
    except Exception:
        pass  # Swallow error, pretend it worked
    return {"status": "ok"}

# ✅ Good: Fail fast with clear error
def process_order(order):
    if not order.items:
        raise ValidationError("Order must have at least one item")
    
    if order.total <= 0:
        raise ValidationError("Order total must be positive")
    
    result = payment_service.charge(order)  # Let it fail if it fails
    return {"status": "ok", "charge_id": result.id}

Silent failures are debugging nightmares. If something goes wrong, make it obvious.

Don’t Catch What You Can’t Handle

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# ❌ Bad: Catching everything
try:
    result = complex_operation()
except Exception as e:
    logger.error(f"Error: {e}")
    return None  # Now what?

# ✅ Good: Catch specific errors you can handle
try:
    result = complex_operation()
except ConnectionTimeout:
    # We know how to handle this: retry
    return retry_with_backoff(complex_operation)
except ValidationError as e:
    # We know how to handle this: return error to user
    return {"error": str(e)}, 400
# Let other exceptions propagate — they're bugs

If you can’t meaningfully recover, let the error bubble up.

Errors Are Data

Errors should be structured, not strings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# ❌ Bad: String errors
raise Exception("Payment failed")

# ✅ Good: Structured errors
class PaymentError(Exception):
    def __init__(self, code: str, message: str, charge_id: str = None):
        self.code = code
        self.message = message
        self.charge_id = charge_id
        super().__init__(message)

raise PaymentError(
    code="CARD_DECLINED",
    message="Card was declined by issuer",
    charge_id="ch_123"
)

Structured errors can be:

Logged with context
Translated for users
Handled differently by code
Monitored and alerted on

Error Boundaries

Contain failures so they don’t cascade:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class OrderProcessor:
    def process_order(self, order):
        # Core operation — must succeed
        charge = self.charge_payment(order)
        
        # Non-critical operations — failures shouldn't fail the order
        self.send_confirmation_email(order)  # Fire and forget
        self.update_analytics(order)          # Best effort
        
        return charge
    
    def send_confirmation_email(self, order):
        try:
            email_service.send(order.user.email, "confirmation", order)
        except Exception as e:
            # Log but don't fail the order
            logger.warning(f"Failed to send confirmation: {e}")
            # Queue for retry later
            retry_queue.add("send_email", order.id)
    
    def update_analytics(self, order):
        try:
            analytics.track("order_completed", order.to_dict())
        except Exception as e:
            # Analytics failure is never critical
            logger.debug(f"Analytics failed: {e}")

Graceful Degradation

When dependencies fail, degrade functionality instead of failing completely:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
async def get_product_page(product_id: str):
    product = await products_db.get(product_id)  # Critical
    
    # Non-critical: show page even if these fail
    try:
        recommendations = await recommendation_service.get(product_id)
    except ServiceUnavailable:
        recommendations = []  # Show page without recommendations
    
    try:
        reviews = await reviews_service.get(product_id)
    except ServiceUnavailable:
        reviews = None  # Show page without reviews
    
    return render("product.html", product, recommendations, reviews)

Circuit Breakers

Stop calling a failing service:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure = None
        self.state = "closed"  # closed, open, half-open
    
    def call(self, func, *args, **kwargs):
        if self.state == "open":
            if time.time() - self.last_failure > self.reset_timeout:
                self.state = "half-open"
            else:
                raise CircuitOpenError("Circuit breaker is open")
        
        try:
            result = func(*args, **kwargs)
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise
    
    def on_success(self):
        self.failures = 0
        self.state = "closed"
    
    def on_failure(self):
        self.failures += 1
        self.last_failure = time.time()
        if self.failures >= self.threshold:
            self.state = "open"

# Usage
payment_breaker = CircuitBreaker()

def charge_payment(order):
    return payment_breaker.call(payment_service.charge, order)

Retries with Backoff

Transient failures often resolve themselves:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import random
import time

def retry_with_backoff(func, max_retries=3, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError as e:
            if attempt == max_retries - 1:
                raise
            
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay * 0.1)
            time.sleep(delay + jitter)
            
            logger.warning(f"Retry {attempt + 1}/{max_retries}: {e}")

But know when NOT to retry:

Validation errors (won’t succeed on retry)
Authentication errors (need new credentials)
Rate limits (respect Retry-After header)

Idempotency

Make operations safe to retry:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def process_payment(order_id: str, idempotency_key: str):
    # Check if we already processed this
    existing = db.get_payment(idempotency_key)
    if existing:
        return existing  # Return same result
    
    # Process payment
    result = payment_service.charge(order_id)
    
    # Store with idempotency key
    db.save_payment(idempotency_key, result)
    
    return result

Clients can safely retry without double-charging.

Meaningful Error Messages

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# ❌ Bad: Useless errors
raise Exception("Error")
raise Exception("Something went wrong")
raise Exception("null pointer")

# ✅ Good: Actionable errors
raise ValidationError(
    "Email address is invalid",
    field="email",
    value=email,
    hint="Must be a valid email format (e.g., user@example.com)"
)

raise ConfigurationError(
    "Database connection failed",
    config_key="DATABASE_URL",
    hint="Check that the database is running and the URL is correct"
)

Good error messages answer:

What happened?
Where did it happen?
What can be done about it?

Logging Errors

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import structlog

logger = structlog.get_logger()

def process_order(order):
    try:
        result = payment_service.charge(order)
        logger.info("payment_succeeded", 
            order_id=order.id, 
            amount=order.total,
            charge_id=result.id
        )
        return result
    except PaymentError as e:
        logger.error("payment_failed",
            order_id=order.id,
            amount=order.total,
            error_code=e.code,
            error_message=e.message,
            # Include context for debugging
            user_id=order.user_id,
            payment_method=order.payment_method.type
        )
        raise

Log errors with context. Future-you debugging at 3am will be grateful.

The Error Handling Checklist

1
2
3
4
5
6
7
For each error:
- [ ] Can the system recover automatically?
- [ ] If not, is the failure contained?
- [ ] Is the error logged with context?
- [ ] Does the user get a helpful message?
- [ ] Is the operation safe to retry?
- [ ] Are we alerted if this happens too often?

Philosophy Summary

Fail fast: Detect problems early
Fail loud: Make errors visible
Fail gracefully: Degrade instead of crash
Fail safe: Default to secure state
Recover quickly: Retry, circuit break, self-heal

Error handling isn’t about preventing errors — it’s about responding to them well. The system that handles errors gracefully earns user trust. The system that crashes mysteriously loses it.

Expect failures. Plan for them. Test them. The error path is as important as the happy path.

Fail Fast, Fail Loud#

Don’t Catch What You Can’t Handle#

Errors Are Data#

Error Boundaries#

Graceful Degradation#

Circuit Breakers#

Retries with Backoff#

Idempotency#

Meaningful Error Messages#

Logging Errors#

The Error Handling Checklist#

Philosophy Summary#

📬 Get the Newsletter