Errors are inevitable. Networks fail. Disks fill up. Services crash. Users input garbage. The question isn’t whether your system will encounter errors — it’s how it will behave when it does.
Good error handling is the difference between “the system recovered automatically” and “we lost customer data.”
Fail Fast, Fail Loud#
Detect problems early and report them clearly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # ❌ Bad: Silent failure
def process_order(order):
try:
result = payment_service.charge(order)
except Exception:
pass # Swallow error, pretend it worked
return {"status": "ok"}
# ✅ Good: Fail fast with clear error
def process_order(order):
if not order.items:
raise ValidationError("Order must have at least one item")
if order.total <= 0:
raise ValidationError("Order total must be positive")
result = payment_service.charge(order) # Let it fail if it fails
return {"status": "ok", "charge_id": result.id}
|
Silent failures are debugging nightmares. If something goes wrong, make it obvious.
Don’t Catch What You Can’t Handle#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # ❌ Bad: Catching everything
try:
result = complex_operation()
except Exception as e:
logger.error(f"Error: {e}")
return None # Now what?
# ✅ Good: Catch specific errors you can handle
try:
result = complex_operation()
except ConnectionTimeout:
# We know how to handle this: retry
return retry_with_backoff(complex_operation)
except ValidationError as e:
# We know how to handle this: return error to user
return {"error": str(e)}, 400
# Let other exceptions propagate — they're bugs
|
If you can’t meaningfully recover, let the error bubble up.
Errors Are Data#
Errors should be structured, not strings:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # ❌ Bad: String errors
raise Exception("Payment failed")
# ✅ Good: Structured errors
class PaymentError(Exception):
def __init__(self, code: str, message: str, charge_id: str = None):
self.code = code
self.message = message
self.charge_id = charge_id
super().__init__(message)
raise PaymentError(
code="CARD_DECLINED",
message="Card was declined by issuer",
charge_id="ch_123"
)
|
Structured errors can be:
- Logged with context
- Translated for users
- Handled differently by code
- Monitored and alerted on
Error Boundaries#
Contain failures so they don’t cascade:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| class OrderProcessor:
def process_order(self, order):
# Core operation — must succeed
charge = self.charge_payment(order)
# Non-critical operations — failures shouldn't fail the order
self.send_confirmation_email(order) # Fire and forget
self.update_analytics(order) # Best effort
return charge
def send_confirmation_email(self, order):
try:
email_service.send(order.user.email, "confirmation", order)
except Exception as e:
# Log but don't fail the order
logger.warning(f"Failed to send confirmation: {e}")
# Queue for retry later
retry_queue.add("send_email", order.id)
def update_analytics(self, order):
try:
analytics.track("order_completed", order.to_dict())
except Exception as e:
# Analytics failure is never critical
logger.debug(f"Analytics failed: {e}")
|
Graceful Degradation#
When dependencies fail, degrade functionality instead of failing completely:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| async def get_product_page(product_id: str):
product = await products_db.get(product_id) # Critical
# Non-critical: show page even if these fail
try:
recommendations = await recommendation_service.get(product_id)
except ServiceUnavailable:
recommendations = [] # Show page without recommendations
try:
reviews = await reviews_service.get(product_id)
except ServiceUnavailable:
reviews = None # Show page without reviews
return render("product.html", product, recommendations, reviews)
|
Circuit Breakers#
Stop calling a failing service:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.last_failure = None
self.state = "closed" # closed, open, half-open
def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure > self.reset_timeout:
self.state = "half-open"
else:
raise CircuitOpenError("Circuit breaker is open")
try:
result = func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
def on_success(self):
self.failures = 0
self.state = "closed"
def on_failure(self):
self.failures += 1
self.last_failure = time.time()
if self.failures >= self.threshold:
self.state = "open"
# Usage
payment_breaker = CircuitBreaker()
def charge_payment(order):
return payment_breaker.call(payment_service.charge, order)
|
Retries with Backoff#
Transient failures often resolve themselves:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| import random
import time
def retry_with_backoff(func, max_retries=3, base_delay=1):
for attempt in range(max_retries):
try:
return func()
except TransientError as e:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
logger.warning(f"Retry {attempt + 1}/{max_retries}: {e}")
|
But know when NOT to retry:
- Validation errors (won’t succeed on retry)
- Authentication errors (need new credentials)
- Rate limits (respect Retry-After header)
Idempotency#
Make operations safe to retry:
1
2
3
4
5
6
7
8
9
10
11
12
13
| def process_payment(order_id: str, idempotency_key: str):
# Check if we already processed this
existing = db.get_payment(idempotency_key)
if existing:
return existing # Return same result
# Process payment
result = payment_service.charge(order_id)
# Store with idempotency key
db.save_payment(idempotency_key, result)
return result
|
Clients can safely retry without double-charging.
Meaningful Error Messages#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # ❌ Bad: Useless errors
raise Exception("Error")
raise Exception("Something went wrong")
raise Exception("null pointer")
# ✅ Good: Actionable errors
raise ValidationError(
"Email address is invalid",
field="email",
value=email,
hint="Must be a valid email format (e.g., user@example.com)"
)
raise ConfigurationError(
"Database connection failed",
config_key="DATABASE_URL",
hint="Check that the database is running and the URL is correct"
)
|
Good error messages answer:
- What happened?
- Where did it happen?
- What can be done about it?
Logging Errors#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| import structlog
logger = structlog.get_logger()
def process_order(order):
try:
result = payment_service.charge(order)
logger.info("payment_succeeded",
order_id=order.id,
amount=order.total,
charge_id=result.id
)
return result
except PaymentError as e:
logger.error("payment_failed",
order_id=order.id,
amount=order.total,
error_code=e.code,
error_message=e.message,
# Include context for debugging
user_id=order.user_id,
payment_method=order.payment_method.type
)
raise
|
Log errors with context. Future-you debugging at 3am will be grateful.
The Error Handling Checklist#
1
2
3
4
5
6
7
| For each error:
- [ ] Can the system recover automatically?
- [ ] If not, is the failure contained?
- [ ] Is the error logged with context?
- [ ] Does the user get a helpful message?
- [ ] Is the operation safe to retry?
- [ ] Are we alerted if this happens too often?
|
Philosophy Summary#
- Fail fast: Detect problems early
- Fail loud: Make errors visible
- Fail gracefully: Degrade instead of crash
- Fail safe: Default to secure state
- Recover quickly: Retry, circuit break, self-heal
Error handling isn’t about preventing errors — it’s about responding to them well. The system that handles errors gracefully earns user trust. The system that crashes mysteriously loses it.
Expect failures. Plan for them. Test them. The error path is as important as the happy path.