Every production system has dependencies. APIs, databases, caches, third-party services. Each one can fail. The question isn’t if they’ll fail, but how your system behaves when they do.
Graceful degradation means your system continues providing value—reduced, maybe, but value—when dependencies are unavailable. The opposite is cascade failure: one service dies, and everything dies with it.
Here are the patterns that make the difference.
The Hierarchy of Degradation#
Not all degradation is equal. Design for multiple levels:
Level 0: Full functionality
Everything works. This is your baseline.
Level 1: Reduced freshness
Data is stale but present. Cache hit instead of live query.
Level 2: Reduced features
Some features unavailable. Core functionality preserved.
Level 3: Static fallback
Pre-computed or cached responses. Read-only mode.
Level 4: Informative failure
Can’t serve the request, but explain why clearly.
Most systems jump from Level 0 to Level 4. The goal is to have meaningful stops in between.
Pattern 1: Circuit Breakers#
Stop calling a failing service. Fail fast instead of timing out.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = "closed" # closed, open, half-open
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = "half-open"
else:
raise CircuitOpenError("Service unavailable")
try:
result = func(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failures = 0
return result
except Exception as e:
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.threshold:
self.state = "open"
raise
|
When open: Return cached data, default values, or skip the feature entirely.
Pattern 2: Fallback Chains#
Define a sequence of increasingly degraded responses.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| async def get_user_recommendations(user_id):
# Level 0: Personalized real-time recommendations
try:
return await recommendation_service.get_personalized(user_id)
except ServiceUnavailable:
pass
# Level 1: Cached personalized recommendations
cached = await cache.get(f"recs:{user_id}")
if cached:
return cached.with_stale_indicator()
# Level 2: Segment-based recommendations
try:
segment = await user_service.get_segment(user_id)
return await recommendation_service.get_for_segment(segment)
except ServiceUnavailable:
pass
# Level 3: Global popular items
popular = await cache.get("recs:global:popular")
if popular:
return popular.with_generic_indicator()
# Level 4: Static fallback
return STATIC_FALLBACK_RECOMMENDATIONS
|
Each level is explicitly designed. You know what users get at each degradation point.
Pattern 3: Bulkheads#
Isolate failures so they don’t spread.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # Thread pool per dependency
user_service_pool = ThreadPoolExecutor(max_workers=10)
payment_service_pool = ThreadPoolExecutor(max_workers=5)
notification_service_pool = ThreadPoolExecutor(max_workers=3)
async def process_order(order):
# Payment pool exhausted? Users can still browse
# Notifications slow? Payments still work
user = await run_in_pool(user_service_pool, get_user, order.user_id)
payment = await run_in_pool(payment_service_pool, charge, order)
# Notification is fire-and-forget with timeout
try:
await asyncio.wait_for(
run_in_pool(notification_service_pool, notify, user, order),
timeout=2.0
)
except asyncio.TimeoutError:
queue_for_retry(notify, user, order)
|
Bulkheads prevent one slow dependency from consuming all resources.
Pattern 4: Read-Your-Writes Consistency with Fallback#
When the primary database is down, read from replicas—but track what the user just wrote.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| class ConsistentReader:
def __init__(self, primary, replicas):
self.primary = primary
self.replicas = replicas
self.recent_writes = TTLCache(maxsize=10000, ttl=30)
async def read(self, key, user_id):
# If user just wrote this key, they must see their write
write_record = self.recent_writes.get((user_id, key))
if write_record:
return write_record.value
# Try primary
try:
return await self.primary.read(key)
except Unavailable:
pass
# Fall back to replica (might be stale, but user hasn't written)
return await self.replicas.read(key)
async def write(self, key, value, user_id):
result = await self.primary.write(key, value)
self.recent_writes[(user_id, key)] = WriteRecord(value, time.time())
return result
|
Users see their own writes; stale reads only affect data they haven’t touched.
Pattern 5: Feature Flags as Degradation Controls#
Use feature flags to manually (or automatically) degrade specific features.
1
2
3
4
5
6
7
8
9
10
11
12
13
| async def handle_search(query):
if not feature_flags.get("search_enabled"):
return SearchResult.unavailable("Search temporarily disabled")
if not feature_flags.get("search_ai_ranking"):
# Fall back to simpler ranking
return await simple_search(query)
if not feature_flags.get("search_spell_check"):
# Skip spell check, use query as-is
return await ai_search(query, spell_check=False)
return await ai_search(query)
|
When load spikes or a dependency struggles, disable expensive features first.
Pattern 6: Timeout Budgets#
Allocate a total timeout, then distribute it across operations.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| async def composite_request(request, total_budget=5.0):
budget = TimeBudget(total_budget)
# Critical: must complete. Gets 40% of budget.
user = await budget.run(
get_user(request.user_id),
allocation=0.4,
required=True
)
# Important: should complete. Gets 30% of budget.
try:
recommendations = await budget.run(
get_recommendations(user),
allocation=0.3,
required=False
)
except BudgetExceeded:
recommendations = FALLBACK_RECOMMENDATIONS
# Nice-to-have: if time remains. Gets remaining budget.
try:
notifications = await budget.run(
get_notifications(user),
allocation=budget.remaining(),
required=False
)
except BudgetExceeded:
notifications = None
return Response(user, recommendations, notifications)
|
The budget enforces that one slow dependency can’t starve others.
Pattern 7: Partial Success Responses#
Return what you have, clearly marking what’s missing.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| @dataclass
class PartialResponse:
data: dict
missing: list[str]
degraded: list[str]
def to_json(self):
return {
"data": self.data,
"_meta": {
"partial": bool(self.missing or self.degraded),
"missing": self.missing,
"degraded": self.degraded
}
}
async def get_dashboard():
response = PartialResponse(data={}, missing=[], degraded=[])
try:
response.data["metrics"] = await get_metrics()
except Timeout:
response.data["metrics"] = await get_cached_metrics()
response.degraded.append("metrics")
except Unavailable:
response.missing.append("metrics")
try:
response.data["alerts"] = await get_alerts()
except Unavailable:
response.missing.append("alerts")
return response
|
Clients can decide how to handle partial data instead of getting nothing.
Anti-Pattern: Silent Degradation#
The worst kind of degradation is invisible degradation.
1
2
3
4
5
6
| # BAD: Silent fallback
def get_user(user_id):
try:
return user_service.get(user_id)
except:
return DEFAULT_USER # 😱 Who knows this happened?
|
Always log, metric, or mark degraded responses:
1
2
3
4
5
6
7
8
| # GOOD: Observable degradation
def get_user(user_id):
try:
return user_service.get(user_id)
except Exception as e:
metrics.increment("user_service.fallback")
logger.warning(f"User service unavailable, using fallback: {e}")
return FallbackUser(user_id, degraded=True)
|
Testing Degradation#
You can’t trust patterns you haven’t tested.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| @pytest.fixture
def chaos_mode():
"""Randomly fail dependencies during tests."""
original_call = httpx.Client.request
def chaotic_request(self, *args, **kwargs):
if random.random() < 0.3: # 30% failure rate
raise httpx.ConnectError("Chaos!")
return original_call(self, *args, **kwargs)
httpx.Client.request = chaotic_request
yield
httpx.Client.request = original_call
def test_dashboard_under_chaos(chaos_mode):
"""Dashboard should return partial data, not crash."""
response = client.get("/dashboard")
assert response.status_code == 200
assert "_meta" in response.json() # Should indicate partial
|
Run chaos tests in staging. Find out how your system degrades before production does.
The Mindset Shift#
Reliability isn’t about preventing failure. It’s about making failure cheap.
Every dependency call should have an answer to: “What happens when this times out?” If the answer is “the whole request fails,” you have work to do.
Design the degraded states first. Make them explicit. Test them. Then hope you rarely need them—but know they’re there when you do.