When something fails, retry it. Simple, right?

Not quite. Naive retries can turn a minor hiccup into a cascading failure. Retry too aggressively and you overwhelm the recovering service. Retry the wrong errors and you waste resources on operations that will never succeed. Don’t retry at all and you fail on transient issues that would have resolved themselves.

Here’s how to build retries that help rather than hurt.

What to Retry

Not every error deserves a retry:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class RetryableError(Exception):
    """Errors worth retrying."""
    pass

class PermanentError(Exception):
    """Errors that will never succeed on retry."""
    pass

def classify_error(error: Exception) -> type:
    """Determine if an error is worth retrying."""
    
    # Network errors: usually transient
    if isinstance(error, (ConnectionError, TimeoutError)):
        return RetryableError
    
    # HTTP status codes
    if hasattr(error, 'status_code'):
        status = error.status_code
        
        # 429 Too Many Requests: retry after backoff
        # 500, 502, 503, 504: server issues, often transient
        if status in (429, 500, 502, 503, 504):
            return RetryableError
        
        # 400, 401, 403, 404: client errors, won't change on retry
        if 400 <= status < 500:
            return PermanentError
    
    # Database errors
    if "deadlock" in str(error).lower():
        return RetryableError
    if "duplicate key" in str(error).lower():
        return PermanentError
    
    # Default: don't retry unknown errors
    return PermanentError

Retry: Network timeouts, rate limits, server errors, deadlocks, temporary unavailability.

Don’t retry: Authentication failures, validation errors, not found, permission denied.

Exponential Backoff

Linear retries (wait 1s, 1s, 1s…) don’t give systems time to recover. Exponential backoff increases the delay between attempts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import time
import random

def exponential_backoff(
    attempt: int,
    base: float = 1.0,
    max_delay: float = 60.0,
    jitter: bool = True
) -> float:
    """Calculate delay for attempt N."""
    delay = min(base * (2 ** attempt), max_delay)
    
    if jitter:
        # Full jitter: random value between 0 and calculated delay
        delay = random.uniform(0, delay)
    
    return delay

# Attempt 0: 0-1s
# Attempt 1: 0-2s
# Attempt 2: 0-4s
# Attempt 3: 0-8s
# Attempt 4: 0-16s
# ...capped at 60s

Why Jitter Matters

Without jitter, clients that failed together retry together:

TTTTiiiimmmmeeee0137:sss:::1111000000000000cccclllliiiieeeennnnttttssssfrrraeeeitttlrrryyy(((aaallllllaaatttooonnnccceee)))

With jitter, retries spread out:

TTTTiiiimmmmeeee0013:---137sss:::~~~1333000000000cccclllliiiieeeennnnttttssssrrrfeeeatttirrrlyyy(((sssppprrreeeaaadddaaacccrrrooossssss124sss)))

The recovering service gets a gradual ramp instead of repeated spikes.

The Retry Decorator

Encapsulate retry logic for reuse:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import functools
from typing import Tuple, Type

def retry(
    max_attempts: int = 3,
    retryable_exceptions: Tuple[Type[Exception], ...] = (Exception,),
    base_delay: float = 1.0,
    max_delay: float = 60.0,
):
    """Decorator that retries on failure with exponential backoff."""
    
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except retryable_exceptions as e:
                    last_exception = e
                    
                    if attempt < max_attempts - 1:
                        delay = exponential_backoff(
                            attempt, base_delay, max_delay
                        )
                        logger.warning(
                            f"Attempt {attempt + 1} failed, "
                            f"retrying in {delay:.1f}s: {e}"
                        )
                        time.sleep(delay)
                    else:
                        logger.error(
                            f"All {max_attempts} attempts failed: {e}"
                        )
            
            raise last_exception
        
        return wrapper
    return decorator

# Usage
@retry(max_attempts=5, retryable_exceptions=(ConnectionError, TimeoutError))
def fetch_data(url: str) -> dict:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return response.json()

Async Retries

For async code, use asyncio.sleep:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import asyncio

async def retry_async(
    func,
    max_attempts: int = 3,
    base_delay: float = 1.0,
):
    """Retry an async function with exponential backoff."""
    last_exception = None
    
    for attempt in range(max_attempts):
        try:
            return await func()
        except Exception as e:
            last_exception = e
            
            if attempt < max_attempts - 1:
                delay = exponential_backoff(attempt, base_delay)
                await asyncio.sleep(delay)
    
    raise last_exception

# Usage
result = await retry_async(
    lambda: client.fetch(url),
    max_attempts=5
)

Circuit Breakers: When to Stop Retrying

If a service is truly down, retrying just wastes resources and adds latency. Circuit breakers stop retrying temporarily:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject immediately
    HALF_OPEN = "half_open"  # Testing if recovered

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 30.0,
        half_open_requests: int = 1,
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_requests = half_open_requests
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0
        self.half_open_successes = 0
    
    def can_execute(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True
        
        if self.state == CircuitState.OPEN:
            # Check if recovery timeout elapsed
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_successes = 0
                return True
            return False
        
        if self.state == CircuitState.HALF_OPEN:
            return True
        
        return False
    
    def record_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.half_open_successes += 1
            if self.half_open_successes >= self.half_open_requests:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
        else:
            self.failure_count = 0
    
    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.state == CircuitState.HALF_OPEN:
            self.state = CircuitState.OPEN
        elif self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Usage
breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

def call_external_service():
    if not breaker.can_execute():
        raise CircuitOpenError("Circuit breaker is open")
    
    try:
        result = external_service.call()
        breaker.record_success()
        return result
    except Exception as e:
        breaker.record_failure()
        raise

Retry Budgets

Limit total retry load across the system:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import threading
import time

class RetryBudget:
    def __init__(self, tokens_per_second: float, max_tokens: float):
        self.tokens_per_second = tokens_per_second
        self.max_tokens = max_tokens
        self.tokens = max_tokens
        self.last_refill = time.time()
        self.lock = threading.Lock()
    
    def acquire(self) -> bool:
        """Try to acquire a retry token. Returns False if budget exhausted."""
        with self.lock:
            self._refill()
            
            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False
    
    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(
            self.max_tokens,
            self.tokens + elapsed * self.tokens_per_second
        )
        self.last_refill = now

# Usage: Allow 10 retries per second, burst up to 50
budget = RetryBudget(tokens_per_second=10, max_tokens=50)

def retry_with_budget(func, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            return func()
        except RetryableError:
            if attempt < max_attempts - 1:
                if not budget.acquire():
                    raise BudgetExhaustedError("Retry budget exhausted")
                time.sleep(exponential_backoff(attempt))
    raise MaxRetriesExceeded()

Idempotency: Make Retries Safe

Retrying a non-idempotent operation can cause duplicates:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Dangerous: might charge twice
def charge_customer(customer_id: str, amount: float):
    payment_service.charge(customer_id, amount)

# Safe: idempotency key prevents duplicates
def charge_customer_safely(
    customer_id: str,
    amount: float,
    idempotency_key: str
):
    payment_service.charge(
        customer_id,
        amount,
        idempotency_key=idempotency_key
    )

# Generate key once, reuse on retries
idempotency_key = f"order_{order_id}_payment"
retry(lambda: charge_customer_safely(customer_id, amount, idempotency_key))

Most payment APIs (Stripe, Square) support idempotency keys. For your own services, track processed keys:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
async def process_with_idempotency(key: str, operation):
    # Check if already processed
    existing = await redis.get(f"idempotency:{key}")
    if existing:
        return json.loads(existing)
    
    # Process and store result
    result = await operation()
    await redis.setex(
        f"idempotency:{key}",
        86400,  # Keep for 24 hours
        json.dumps(result)
    )
    return result

The Checklist

Before adding retries:

  • Errors classified (retryable vs permanent)
  • Exponential backoff implemented
  • Jitter added to prevent thundering herd
  • Maximum attempts capped
  • Circuit breaker for persistent failures
  • Retry budget to limit system-wide load
  • Idempotency ensured for non-safe operations
  • Logging for retry attempts (for debugging)
  • Metrics tracked (retry rate, success rate after retry)

Retries are a reliability primitive. Done well, they smooth over transient failures invisibly. Done poorly, they amplify problems and create cascading failures. The difference is in the details.