Retrying failed operations sounds simple. Getting it right is surprisingly subtle. Here's how to retry without making things worse.
March 10, 2026 · 8 min · 1619 words · Rob Washington
Table of Contents
When something fails, retry it. Simple, right?
Not quite. Naive retries can turn a minor hiccup into a cascading failure. Retry too aggressively and you overwhelm the recovering service. Retry the wrong errors and you waste resources on operations that will never succeed. Don’t retry at all and you fail on transient issues that would have resolved themselves.
Here’s how to build retries that help rather than hurt.
classRetryableError(Exception):"""Errors worth retrying."""passclassPermanentError(Exception):"""Errors that will never succeed on retry."""passdefclassify_error(error:Exception)->type:"""Determine if an error is worth retrying."""# Network errors: usually transientifisinstance(error,(ConnectionError,TimeoutError)):returnRetryableError# HTTP status codesifhasattr(error,'status_code'):status=error.status_code# 429 Too Many Requests: retry after backoff# 500, 502, 503, 504: server issues, often transientifstatusin(429,500,502,503,504):returnRetryableError# 400, 401, 403, 404: client errors, won't change on retryif400<=status<500:returnPermanentError# Database errorsif"deadlock"instr(error).lower():returnRetryableErrorif"duplicate key"instr(error).lower():returnPermanentError# Default: don't retry unknown errorsreturnPermanentError
Retry: Network timeouts, rate limits, server errors, deadlocks, temporary unavailability.
Don’t retry: Authentication failures, validation errors, not found, permission denied.
importtimeimportrandomdefexponential_backoff(attempt:int,base:float=1.0,max_delay:float=60.0,jitter:bool=True)->float:"""Calculate delay for attempt N."""delay=min(base*(2**attempt),max_delay)ifjitter:# Full jitter: random value between 0 and calculated delaydelay=random.uniform(0,delay)returndelay# Attempt 0: 0-1s# Attempt 1: 0-2s# Attempt 2: 0-4s# Attempt 3: 0-8s# Attempt 4: 0-16s# ...capped at 60s
importfunctoolsfromtypingimportTuple,Typedefretry(max_attempts:int=3,retryable_exceptions:Tuple[Type[Exception],...]=(Exception,),base_delay:float=1.0,max_delay:float=60.0,):"""Decorator that retries on failure with exponential backoff."""defdecorator(func):@functools.wraps(func)defwrapper(*args,**kwargs):last_exception=Noneforattemptinrange(max_attempts):try:returnfunc(*args,**kwargs)exceptretryable_exceptionsase:last_exception=eifattempt<max_attempts-1:delay=exponential_backoff(attempt,base_delay,max_delay)logger.warning(f"Attempt {attempt+1} failed, "f"retrying in {delay:.1f}s: {e}")time.sleep(delay)else:logger.error(f"All {max_attempts} attempts failed: {e}")raiselast_exceptionreturnwrapperreturndecorator# Usage@retry(max_attempts=5,retryable_exceptions=(ConnectionError,TimeoutError))deffetch_data(url:str)->dict:response=requests.get(url,timeout=10)response.raise_for_status()returnresponse.json()
importasyncioasyncdefretry_async(func,max_attempts:int=3,base_delay:float=1.0,):"""Retry an async function with exponential backoff."""last_exception=Noneforattemptinrange(max_attempts):try:returnawaitfunc()exceptExceptionase:last_exception=eifattempt<max_attempts-1:delay=exponential_backoff(attempt,base_delay)awaitasyncio.sleep(delay)raiselast_exception# Usageresult=awaitretry_async(lambda:client.fetch(url),max_attempts=5)
importtimefromenumimportEnumclassCircuitState(Enum):CLOSED="closed"# Normal operationOPEN="open"# Failing, reject immediatelyHALF_OPEN="half_open"# Testing if recoveredclassCircuitBreaker:def__init__(self,failure_threshold:int=5,recovery_timeout:float=30.0,half_open_requests:int=1,):self.failure_threshold=failure_thresholdself.recovery_timeout=recovery_timeoutself.half_open_requests=half_open_requestsself.state=CircuitState.CLOSEDself.failure_count=0self.last_failure_time=0self.half_open_successes=0defcan_execute(self)->bool:ifself.state==CircuitState.CLOSED:returnTrueifself.state==CircuitState.OPEN:# Check if recovery timeout elapsediftime.time()-self.last_failure_time>self.recovery_timeout:self.state=CircuitState.HALF_OPENself.half_open_successes=0returnTruereturnFalseifself.state==CircuitState.HALF_OPEN:returnTruereturnFalsedefrecord_success(self):ifself.state==CircuitState.HALF_OPEN:self.half_open_successes+=1ifself.half_open_successes>=self.half_open_requests:self.state=CircuitState.CLOSEDself.failure_count=0else:self.failure_count=0defrecord_failure(self):self.failure_count+=1self.last_failure_time=time.time()ifself.state==CircuitState.HALF_OPEN:self.state=CircuitState.OPENelifself.failure_count>=self.failure_threshold:self.state=CircuitState.OPEN# Usagebreaker=CircuitBreaker(failure_threshold=5,recovery_timeout=30)defcall_external_service():ifnotbreaker.can_execute():raiseCircuitOpenError("Circuit breaker is open")try:result=external_service.call()breaker.record_success()returnresultexceptExceptionase:breaker.record_failure()raise
importthreadingimporttimeclassRetryBudget:def__init__(self,tokens_per_second:float,max_tokens:float):self.tokens_per_second=tokens_per_secondself.max_tokens=max_tokensself.tokens=max_tokensself.last_refill=time.time()self.lock=threading.Lock()defacquire(self)->bool:"""Try to acquire a retry token. Returns False if budget exhausted."""withself.lock:self._refill()ifself.tokens>=1:self.tokens-=1returnTruereturnFalsedef_refill(self):now=time.time()elapsed=now-self.last_refillself.tokens=min(self.max_tokens,self.tokens+elapsed*self.tokens_per_second)self.last_refill=now# Usage: Allow 10 retries per second, burst up to 50budget=RetryBudget(tokens_per_second=10,max_tokens=50)defretry_with_budget(func,max_attempts=3):forattemptinrange(max_attempts):try:returnfunc()exceptRetryableError:ifattempt<max_attempts-1:ifnotbudget.acquire():raiseBudgetExhaustedError("Retry budget exhausted")time.sleep(exponential_backoff(attempt))raiseMaxRetriesExceeded()
Most payment APIs (Stripe, Square) support idempotency keys. For your own services, track processed keys:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
asyncdefprocess_with_idempotency(key:str,operation):# Check if already processedexisting=awaitredis.get(f"idempotency:{key}")ifexisting:returnjson.loads(existing)# Process and store resultresult=awaitoperation()awaitredis.setex(f"idempotency:{key}",86400,# Keep for 24 hoursjson.dumps(result))returnresult
Metrics tracked (retry rate, success rate after retry)
Retries are a reliability primitive. Done well, they smooth over transient failures invisibly. Done poorly, they amplify problems and create cascading failures. The difference is in the details.
📬 Get the Newsletter
Weekly insights on DevOps, automation, and CLI mastery. No spam, unsubscribe anytime.