Networks fail. Databases hiccup. External APIs return 503. In distributed systems, transient failures are not exceptional — they’re expected. The question isn’t whether to retry, but how.

Bad retry logic turns a brief outage into a cascading failure. Good retry logic absorbs transient issues invisibly.

The Naive Retry (Don’t Do This)

1
2
3
4
5
6
7
# Immediate retry loop - a recipe for disaster
def call_api():
    while True:
        try:
            return requests.get(API_URL)
        except RequestException:
            continue  # Hammer the failing service

This creates a thundering herd. If the service is struggling, every client immediately retrying makes it worse. You’re not recovering from failure — you’re accelerating toward total collapse.

Exponential Backoff

The fundamental pattern: wait longer between each retry.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import time
import random

def call_with_backoff(func, max_retries=5, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError:
            if attempt == max_retries - 1:
                raise
            
            delay = base_delay * (2 ** attempt)  # 1, 2, 4, 8, 16...
            time.sleep(delay)
    
    raise MaxRetriesExceeded()

First retry after 1 second. Then 2. Then 4. Then 8. The spacing gives the failing service time to recover instead of piling on more requests.

Add Jitter: Break the Herd

Exponential backoff without jitter still creates synchronized retry waves. A thousand clients all waiting 4 seconds will all retry at the same moment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def call_with_jitter(func, max_retries=5, base_delay=1):
    for attempt in range(max_retries):
        try:
            return func()
        except TransientError:
            if attempt == max_retries - 1:
                raise
            
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, delay)  # Add randomness
            time.sleep(delay + jitter)

Full jitter variant (often better):

1
delay = random.uniform(0, base_delay * (2 ** attempt))

Jitter spreads retry attempts across time, smoothing the load on recovering services.

Which Errors to Retry

Not everything is retryable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
RETRYABLE_STATUS_CODES = {408, 429, 500, 502, 503, 504}
NON_RETRYABLE_STATUS_CODES = {400, 401, 403, 404, 422}

def should_retry(response):
    if response.status_code in RETRYABLE_STATUS_CODES:
        return True
    if response.status_code in NON_RETRYABLE_STATUS_CODES:
        return False
    # Unknown status - retry cautiously
    return response.status_code >= 500

Retry: Timeouts, rate limits, server errors Don’t retry: Bad requests, auth failures, not found

Retrying a 400 Bad Request won’t fix your malformed payload. Retrying a 401 won’t give you valid credentials.

Respect Retry-After Headers

When services tell you when to retry, listen:

1
2
3
4
5
6
7
8
9
def get_retry_delay(response, default_delay):
    retry_after = response.headers.get('Retry-After')
    if retry_after:
        try:
            return int(retry_after)
        except ValueError:
            # Could be an HTTP date instead
            pass
    return default_delay

Ignoring Retry-After is rude and counterproductive. The service is telling you exactly when it’ll be ready.

Circuit Breakers: Stop Retrying Entirely

When a service is truly down, retries are wasted effort. Circuit breakers detect sustained failure and stop trying:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from datetime import datetime, timedelta

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure = None
        self.state = 'closed'  # closed, open, half-open
    
    def call(self, func):
        if self.state == 'open':
            if self._should_attempt_recovery():
                self.state = 'half-open'
            else:
                raise CircuitOpenError()
        
        try:
            result = func()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        self.failure_count = 0
        self.state = 'closed'
    
    def _on_failure(self):
        self.failure_count += 1
        self.last_failure = datetime.now()
        if self.failure_count >= self.failure_threshold:
            self.state = 'open'
    
    def _should_attempt_recovery(self):
        return datetime.now() - self.last_failure > timedelta(seconds=self.recovery_timeout)

States:

  • Closed: Normal operation, requests go through
  • Open: Too many failures, requests immediately rejected
  • Half-open: Testing if service recovered, allow one request

Retry Budgets: Global Limits

Individual request retries can combine into system-wide overload. Retry budgets limit total retry traffic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from threading import Lock
from time import time

class RetryBudget:
    def __init__(self, max_retries_per_second=10):
        self.max_retries = max_retries_per_second
        self.tokens = max_retries_per_second
        self.last_refill = time()
        self.lock = Lock()
    
    def acquire(self):
        with self.lock:
            self._refill()
            if self.tokens > 0:
                self.tokens -= 1
                return True
            return False
    
    def _refill(self):
        now = time()
        elapsed = now - self.last_refill
        self.tokens = min(self.max_retries, 
                         self.tokens + elapsed * self.max_retries)
        self.last_refill = now

budget = RetryBudget()

def call_with_budget(func):
    try:
        return func()
    except TransientError:
        if budget.acquire():
            return func()  # Allowed to retry
        raise  # Budget exhausted, fail fast

This prevents retry storms from overwhelming downstream services.

Library Support

Don’t reinvent this — use battle-tested libraries:

Python (tenacity):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=60)
)
def call_api():
    response = requests.get(API_URL)
    response.raise_for_status()
    return response.json()

JavaScript (axios-retry):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
const axiosRetry = require('axios-retry');

axiosRetry(axios, {
  retries: 3,
  retryDelay: axiosRetry.exponentialDelay,
  retryCondition: (error) => {
    return axiosRetry.isNetworkOrIdempotentRequestError(error)
      || error.response.status === 429;
  }
});

Go (hashicorp/go-retryablehttp):

1
2
3
4
client := retryablehttp.NewClient()
client.RetryMax = 5
client.RetryWaitMin = 1 * time.Second
client.RetryWaitMax = 30 * time.Second

Idempotency: Make Retries Safe

Retrying a non-idempotent operation is dangerous. “Create user” retried three times might create three users.

Solutions:

  • Idempotency keys: Client generates unique ID, server deduplicates
  • Read-before-write: Check if operation already succeeded
  • Natural idempotency: Design operations that are safe to repeat
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def create_order_idempotent(idempotency_key, order_data):
    # Check if we already processed this request
    existing = db.query(
        "SELECT * FROM orders WHERE idempotency_key = %s",
        idempotency_key
    )
    if existing:
        return existing  # Return cached result
    
    # Process the order
    order = create_order(order_data)
    order.idempotency_key = idempotency_key
    db.save(order)
    return order

Retry logic is one of the most impactful reliability patterns you can implement. Done poorly, it amplifies failures. Done well, it makes transient issues invisible to users.

Start with exponential backoff and jitter. Add circuit breakers for sustained failures. Implement retry budgets for system-wide protection. Use idempotency keys to make retries safe.

The goal isn’t to never fail — it’s to fail gracefully and recover automatically.