Networks fail. Databases hiccup. External APIs return 503. In distributed systems, transient failures are not exceptional — they’re expected. The question isn’t whether to retry, but how.
Bad retry logic turns a brief outage into a cascading failure. Good retry logic absorbs transient issues invisibly.
The Naive Retry (Don’t Do This)#
1
2
3
4
5
6
7
| # Immediate retry loop - a recipe for disaster
def call_api():
while True:
try:
return requests.get(API_URL)
except RequestException:
continue # Hammer the failing service
|
This creates a thundering herd. If the service is struggling, every client immediately retrying makes it worse. You’re not recovering from failure — you’re accelerating toward total collapse.
Exponential Backoff#
The fundamental pattern: wait longer between each retry.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| import time
import random
def call_with_backoff(func, max_retries=5, base_delay=1):
for attempt in range(max_retries):
try:
return func()
except TransientError:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) # 1, 2, 4, 8, 16...
time.sleep(delay)
raise MaxRetriesExceeded()
|
First retry after 1 second. Then 2. Then 4. Then 8. The spacing gives the failing service time to recover instead of piling on more requests.
Add Jitter: Break the Herd#
Exponential backoff without jitter still creates synchronized retry waves. A thousand clients all waiting 4 seconds will all retry at the same moment.
1
2
3
4
5
6
7
8
9
10
11
| def call_with_jitter(func, max_retries=5, base_delay=1):
for attempt in range(max_retries):
try:
return func()
except TransientError:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay) # Add randomness
time.sleep(delay + jitter)
|
Full jitter variant (often better):
1
| delay = random.uniform(0, base_delay * (2 ** attempt))
|
Jitter spreads retry attempts across time, smoothing the load on recovering services.
Which Errors to Retry#
Not everything is retryable:
1
2
3
4
5
6
7
8
9
10
| RETRYABLE_STATUS_CODES = {408, 429, 500, 502, 503, 504}
NON_RETRYABLE_STATUS_CODES = {400, 401, 403, 404, 422}
def should_retry(response):
if response.status_code in RETRYABLE_STATUS_CODES:
return True
if response.status_code in NON_RETRYABLE_STATUS_CODES:
return False
# Unknown status - retry cautiously
return response.status_code >= 500
|
Retry: Timeouts, rate limits, server errors
Don’t retry: Bad requests, auth failures, not found
Retrying a 400 Bad Request won’t fix your malformed payload. Retrying a 401 won’t give you valid credentials.
When services tell you when to retry, listen:
1
2
3
4
5
6
7
8
9
| def get_retry_delay(response, default_delay):
retry_after = response.headers.get('Retry-After')
if retry_after:
try:
return int(retry_after)
except ValueError:
# Could be an HTTP date instead
pass
return default_delay
|
Ignoring Retry-After is rude and counterproductive. The service is telling you exactly when it’ll be ready.
Circuit Breakers: Stop Retrying Entirely#
When a service is truly down, retries are wasted effort. Circuit breakers detect sustained failure and stop trying:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| from datetime import datetime, timedelta
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure = None
self.state = 'closed' # closed, open, half-open
def call(self, func):
if self.state == 'open':
if self._should_attempt_recovery():
self.state = 'half-open'
else:
raise CircuitOpenError()
try:
result = func()
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.failure_count = 0
self.state = 'closed'
def _on_failure(self):
self.failure_count += 1
self.last_failure = datetime.now()
if self.failure_count >= self.failure_threshold:
self.state = 'open'
def _should_attempt_recovery(self):
return datetime.now() - self.last_failure > timedelta(seconds=self.recovery_timeout)
|
States:
- Closed: Normal operation, requests go through
- Open: Too many failures, requests immediately rejected
- Half-open: Testing if service recovered, allow one request
Retry Budgets: Global Limits#
Individual request retries can combine into system-wide overload. Retry budgets limit total retry traffic:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| from threading import Lock
from time import time
class RetryBudget:
def __init__(self, max_retries_per_second=10):
self.max_retries = max_retries_per_second
self.tokens = max_retries_per_second
self.last_refill = time()
self.lock = Lock()
def acquire(self):
with self.lock:
self._refill()
if self.tokens > 0:
self.tokens -= 1
return True
return False
def _refill(self):
now = time()
elapsed = now - self.last_refill
self.tokens = min(self.max_retries,
self.tokens + elapsed * self.max_retries)
self.last_refill = now
budget = RetryBudget()
def call_with_budget(func):
try:
return func()
except TransientError:
if budget.acquire():
return func() # Allowed to retry
raise # Budget exhausted, fail fast
|
This prevents retry storms from overwhelming downstream services.
Library Support#
Don’t reinvent this — use battle-tested libraries:
Python (tenacity):
1
2
3
4
5
6
7
8
9
10
| from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=1, max=60)
)
def call_api():
response = requests.get(API_URL)
response.raise_for_status()
return response.json()
|
JavaScript (axios-retry):
1
2
3
4
5
6
7
8
9
10
| const axiosRetry = require('axios-retry');
axiosRetry(axios, {
retries: 3,
retryDelay: axiosRetry.exponentialDelay,
retryCondition: (error) => {
return axiosRetry.isNetworkOrIdempotentRequestError(error)
|| error.response.status === 429;
}
});
|
Go (hashicorp/go-retryablehttp):
1
2
3
4
| client := retryablehttp.NewClient()
client.RetryMax = 5
client.RetryWaitMin = 1 * time.Second
client.RetryWaitMax = 30 * time.Second
|
Idempotency: Make Retries Safe#
Retrying a non-idempotent operation is dangerous. “Create user” retried three times might create three users.
Solutions:
- Idempotency keys: Client generates unique ID, server deduplicates
- Read-before-write: Check if operation already succeeded
- Natural idempotency: Design operations that are safe to repeat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| def create_order_idempotent(idempotency_key, order_data):
# Check if we already processed this request
existing = db.query(
"SELECT * FROM orders WHERE idempotency_key = %s",
idempotency_key
)
if existing:
return existing # Return cached result
# Process the order
order = create_order(order_data)
order.idempotency_key = idempotency_key
db.save(order)
return order
|
Retry logic is one of the most impactful reliability patterns you can implement. Done poorly, it amplifies failures. Done well, it makes transient issues invisible to users.
Start with exponential backoff and jitter. Add circuit breakers for sustained failures. Implement retry budgets for system-wide protection. Use idempotency keys to make retries safe.
The goal isn’t to never fail — it’s to fail gracefully and recover automatically.