Networks fail. Services go down. Databases get overwhelmed. The question isn’t whether your requests will fail—it’s how gracefully you handle it when they do.

Naive retry logic can turn a minor hiccup into a catastrophic cascade. Smart retry logic can make your system resilient to transient failures. The difference is in the details.

The Naive Approach (Don’t Do This)

1
2
3
4
5
6
7
8
9
# Bad: Immediate retry loop
def fetch_data(url):
    for attempt in range(5):
        try:
            response = requests.get(url, timeout=5)
            return response.json()
        except requests.RequestException:
            continue
    raise Exception("Failed after 5 attempts")

This code has several problems:

  • No delay between retries — hammers an already-struggling service
  • No distinction between error types — retries on 404s that will never succeed
  • No backoff — all clients retry simultaneously, creating thundering herd
  • Fixed retry count — doesn’t adapt to the situation

Exponential Backoff

The core idea: wait longer between each retry. If the first attempt fails, wait 1 second. If the second fails, wait 2 seconds. Then 4, 8, 16…

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import time
import random
import requests
from functools import wraps

def exponential_backoff(
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    exponential_base: float = 2.0,
    retryable_exceptions: tuple = (requests.RequestException,),
    retryable_status_codes: tuple = (429, 500, 502, 503, 504),
):
    """Decorator for exponential backoff with jitter."""
    
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            
            for attempt in range(max_retries + 1):
                try:
                    result = func(*args, **kwargs)
                    
                    # Check for retryable HTTP status codes
                    if hasattr(result, 'status_code'):
                        if result.status_code in retryable_status_codes:
                            raise RetryableError(f"Status {result.status_code}")
                    
                    return result
                    
                except retryable_exceptions as e:
                    last_exception = e
                    
                    if attempt == max_retries:
                        break
                    
                    # Calculate delay with exponential backoff
                    delay = min(
                        base_delay * (exponential_base ** attempt),
                        max_delay
                    )
                    
                    # Add jitter (±25%)
                    jitter = delay * 0.25 * (2 * random.random() - 1)
                    delay = delay + jitter
                    
                    print(f"Attempt {attempt + 1} failed, retrying in {delay:.2f}s...")
                    time.sleep(delay)
            
            raise last_exception
        
        return wrapper
    return decorator

class RetryableError(Exception):
    pass

# Usage
@exponential_backoff(max_retries=5, base_delay=1.0)
def fetch_data(url):
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return response.json()

Why Jitter Matters

Without jitter, all clients that failed at the same time will retry at the same time. This creates synchronized waves of traffic that can overwhelm a recovering service.

WTTTWTTTTiiiiiiiiitmmmtmmmmheeeheeeeou012j0011tsssis...:::t:824jtsssie:::t[[[rtCAAel([[[[rirrdCABCeeeil(nttsirrrttrrteeeehiirntttuAeeitrrrnssbiiidf]]uAeeeeatsssri[[ef]]]ilBBdansig]rrlleeosh[tta]eCrrdSrlii)[ediee:Cr)esslv:n]]ieter[[nBCCtcafrrBnaeeittfhlrraasiiin]eeldsssl[]]]eCl[diCielsnittSSeretniCritbvlufelCtareifdlvasveil]erlorwsawh]dheellmmeeddagain

There are two common jitter strategies:

1
2
3
4
5
# Full jitter: random value between 0 and calculated delay
delay = random.uniform(0, base_delay * (2 ** attempt))

# Decorrelated jitter: based on previous delay
delay = min(max_delay, random.uniform(base_delay, previous_delay * 3))

AWS recommends decorrelated jitter for most use cases.

Retry-After Headers

Many APIs tell you exactly when to retry via the Retry-After header:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def smart_retry(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        for attempt in range(5):
            response = func(*args, **kwargs)
            
            if response.status_code == 429:  # Too Many Requests
                retry_after = response.headers.get('Retry-After')
                
                if retry_after:
                    # Could be seconds or HTTP date
                    try:
                        delay = int(retry_after)
                    except ValueError:
                        # Parse HTTP date format
                        delay = parse_http_date(retry_after) - time.time()
                    
                    print(f"Rate limited. Waiting {delay}s as requested...")
                    time.sleep(delay)
                    continue
            
            return response
        
        raise Exception("Max retries exceeded")
    return wrapper

Idempotency Keys

Retries are only safe if the operation is idempotent. For non-idempotent operations (like payments), use idempotency keys:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import uuid

def create_payment(amount: float, recipient: str, idempotency_key: str = None):
    """Create a payment with idempotency protection."""
    
    if idempotency_key is None:
        idempotency_key = str(uuid.uuid4())
    
    response = requests.post(
        "https://api.payment.com/charges",
        json={"amount": amount, "recipient": recipient},
        headers={
            "Idempotency-Key": idempotency_key,
            "Authorization": f"Bearer {API_KEY}"
        }
    )
    
    return response

# Safe to retry - same key = same result
key = str(uuid.uuid4())
for attempt in range(3):
    try:
        result = create_payment(100.00, "user@example.com", idempotency_key=key)
        break
    except RequestException:
        continue

Circuit Breaker Integration

Retries handle transient failures. Circuit breakers handle prolonged outages:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
from datetime import datetime, timedelta
from enum import Enum
from threading import Lock

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if service recovered

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: int = 30,
        expected_exception: type = Exception
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.expected_exception = expected_exception
        
        self.state = CircuitState.CLOSED
        self.failures = 0
        self.last_failure_time = None
        self.lock = Lock()
    
    def call(self, func, *args, **kwargs):
        with self.lock:
            if self.state == CircuitState.OPEN:
                if self._should_attempt_reset():
                    self.state = CircuitState.HALF_OPEN
                else:
                    raise CircuitOpenError("Circuit breaker is open")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except self.expected_exception as e:
            self._on_failure()
            raise
    
    def _should_attempt_reset(self):
        return (
            datetime.now() - self.last_failure_time 
            > timedelta(seconds=self.recovery_timeout)
        )
    
    def _on_success(self):
        with self.lock:
            self.failures = 0
            self.state = CircuitState.CLOSED
    
    def _on_failure(self):
        with self.lock:
            self.failures += 1
            self.last_failure_time = datetime.now()
            
            if self.failures >= self.failure_threshold:
                self.state = CircuitState.OPEN

# Usage: combine with retry
circuit = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

@exponential_backoff(max_retries=3)
def fetch_with_circuit_breaker(url):
    return circuit.call(requests.get, url)

The Complete Pattern

For production systems, combine all these patterns:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def resilient_request(
    url: str,
    method: str = "GET",
    max_retries: int = 3,
    circuit_breaker: CircuitBreaker = None,
    idempotency_key: str = None,
    **kwargs
):
    """Production-ready resilient HTTP request."""
    
    headers = kwargs.pop("headers", {})
    if idempotency_key:
        headers["Idempotency-Key"] = idempotency_key
    
    for attempt in range(max_retries + 1):
        try:
            # Check circuit breaker
            if circuit_breaker and circuit_breaker.state == CircuitState.OPEN:
                raise CircuitOpenError("Service unavailable")
            
            # Make request
            response = requests.request(
                method, url, headers=headers, **kwargs
            )
            
            # Handle rate limiting
            if response.status_code == 429:
                retry_after = int(response.headers.get("Retry-After", 1))
                time.sleep(retry_after)
                continue
            
            # Success
            if circuit_breaker:
                circuit_breaker._on_success()
            
            return response
            
        except requests.RequestException as e:
            if circuit_breaker:
                circuit_breaker._on_failure()
            
            if attempt == max_retries:
                raise
            
            # Exponential backoff with jitter
            delay = min(60, (2 ** attempt) + random.uniform(0, 1))
            time.sleep(delay)
    
    raise Exception("Request failed")

Conclusion

Retry logic seems simple until you’ve debugged a thundering herd at 3am. The patterns covered here—exponential backoff, jitter, retry-after headers, idempotency keys, and circuit breakers—form the foundation of resilient distributed systems.

Remember:

  • Always add jitter to prevent synchronized retries
  • Respect Retry-After headers when provided
  • Use idempotency keys for non-idempotent operations
  • Combine with circuit breakers for prolonged outages
  • Only retry on transient errors (5xx, timeouts, connection errors)

Your system will fail. Make sure it fails gracefully.