How to implement robust retry logic with exponential backoff, jitter, and circuit breakers to build resilient distributed systems.
February 12, 2026 · 8 min · 1546 words · Rob Washington
Table of Contents
Networks fail. Services go down. Databases get overwhelmed. The question isn’t whether your requests will fail—it’s how gracefully you handle it when they do.
Naive retry logic can turn a minor hiccup into a catastrophic cascade. Smart retry logic can make your system resilient to transient failures. The difference is in the details.
importtimeimportrandomimportrequestsfromfunctoolsimportwrapsdefexponential_backoff(max_retries:int=5,base_delay:float=1.0,max_delay:float=60.0,exponential_base:float=2.0,retryable_exceptions:tuple=(requests.RequestException,),retryable_status_codes:tuple=(429,500,502,503,504),):"""Decorator for exponential backoff with jitter."""defdecorator(func):@wraps(func)defwrapper(*args,**kwargs):last_exception=Noneforattemptinrange(max_retries+1):try:result=func(*args,**kwargs)# Check for retryable HTTP status codesifhasattr(result,'status_code'):ifresult.status_codeinretryable_status_codes:raiseRetryableError(f"Status {result.status_code}")returnresultexceptretryable_exceptionsase:last_exception=eifattempt==max_retries:break# Calculate delay with exponential backoffdelay=min(base_delay*(exponential_base**attempt),max_delay)# Add jitter (±25%)jitter=delay*0.25*(2*random.random()-1)delay=delay+jitterprint(f"Attempt {attempt+1} failed, retrying in {delay:.2f}s...")time.sleep(delay)raiselast_exceptionreturnwrapperreturndecoratorclassRetryableError(Exception):pass# Usage@exponential_backoff(max_retries=5,base_delay=1.0)deffetch_data(url):response=requests.get(url,timeout=10)response.raise_for_status()returnresponse.json()
Without jitter, all clients that failed at the same time will retry at the same time. This creates synchronized waves of traffic that can overwhelm a recovering service.
There are two common jitter strategies:
1
2
3
4
5
# Full jitter: random value between 0 and calculated delaydelay=random.uniform(0,base_delay*(2**attempt))# Decorrelated jitter: based on previous delaydelay=min(max_delay,random.uniform(base_delay,previous_delay*3))
AWS recommends decorrelated jitter for most use cases.
defsmart_retry(func):@wraps(func)defwrapper(*args,**kwargs):forattemptinrange(5):response=func(*args,**kwargs)ifresponse.status_code==429:# Too Many Requestsretry_after=response.headers.get('Retry-After')ifretry_after:# Could be seconds or HTTP datetry:delay=int(retry_after)exceptValueError:# Parse HTTP date formatdelay=parse_http_date(retry_after)-time.time()print(f"Rate limited. Waiting {delay}s as requested...")time.sleep(delay)continuereturnresponseraiseException("Max retries exceeded")returnwrapper
importuuiddefcreate_payment(amount:float,recipient:str,idempotency_key:str=None):"""Create a payment with idempotency protection."""ifidempotency_keyisNone:idempotency_key=str(uuid.uuid4())response=requests.post("https://api.payment.com/charges",json={"amount":amount,"recipient":recipient},headers={"Idempotency-Key":idempotency_key,"Authorization":f"Bearer {API_KEY}"})returnresponse# Safe to retry - same key = same resultkey=str(uuid.uuid4())forattemptinrange(3):try:result=create_payment(100.00,"user@example.com",idempotency_key=key)breakexceptRequestException:continue
fromdatetimeimportdatetime,timedeltafromenumimportEnumfromthreadingimportLockclassCircuitState(Enum):CLOSED="closed"# Normal operationOPEN="open"# Failing, reject requestsHALF_OPEN="half_open"# Testing if service recoveredclassCircuitBreaker:def__init__(self,failure_threshold:int=5,recovery_timeout:int=30,expected_exception:type=Exception):self.failure_threshold=failure_thresholdself.recovery_timeout=recovery_timeoutself.expected_exception=expected_exceptionself.state=CircuitState.CLOSEDself.failures=0self.last_failure_time=Noneself.lock=Lock()defcall(self,func,*args,**kwargs):withself.lock:ifself.state==CircuitState.OPEN:ifself._should_attempt_reset():self.state=CircuitState.HALF_OPENelse:raiseCircuitOpenError("Circuit breaker is open")try:result=func(*args,**kwargs)self._on_success()returnresultexceptself.expected_exceptionase:self._on_failure()raisedef_should_attempt_reset(self):return(datetime.now()-self.last_failure_time>timedelta(seconds=self.recovery_timeout))def_on_success(self):withself.lock:self.failures=0self.state=CircuitState.CLOSEDdef_on_failure(self):withself.lock:self.failures+=1self.last_failure_time=datetime.now()ifself.failures>=self.failure_threshold:self.state=CircuitState.OPEN# Usage: combine with retrycircuit=CircuitBreaker(failure_threshold=5,recovery_timeout=30)@exponential_backoff(max_retries=3)deffetch_with_circuit_breaker(url):returncircuit.call(requests.get,url)
Retry logic seems simple until you’ve debugged a thundering herd at 3am. The patterns covered here—exponential backoff, jitter, retry-after headers, idempotency keys, and circuit breakers—form the foundation of resilient distributed systems.
Remember:
Always add jitter to prevent synchronized retries
Respect Retry-After headers when provided
Use idempotency keys for non-idempotent operations
Combine with circuit breakers for prolonged outages
Only retry on transient errors (5xx, timeouts, connection errors)
Your system will fail. Make sure it fails gracefully.
📬 Get the Newsletter
Weekly insights on DevOps, automation, and CLI mastery. No spam, unsubscribe anytime.