Every API needs rate limiting. Without it, one enthusiastic script kiddie or a bug in a client application can take down your entire service. The question isn’t whether to rate limit — it’s how to do it without making your API frustrating to use.

The Naive Approach (And Why It Fails)

1
2
3
# Don't do this
if requests_this_minute > 100:
    return 429, "Rate limit exceeded"

Fixed limits per time window are simple to implement and almost always wrong. They create the “thundering herd” problem: all your users hit the limit at minute :00, back off, retry at :01, and create a synchronized spike that’s worse than no limit at all.

Token Bucket: The Industry Standard

The token bucket algorithm is what most production APIs actually use. Imagine a bucket that holds tokens. Each request consumes a token. Tokens refill at a steady rate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.time()
    
    def consume(self, tokens=1):
        self._refill()
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False
    
    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, 
                         self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

This allows bursts (up to the bucket capacity) while maintaining a steady average rate. A user can make 100 requests instantly if they’ve been idle, but sustained usage averages out to the refill rate.

Sliding Window: Smoother Than Fixed Windows

If you want time-based limits without the thundering herd, use sliding windows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def sliding_window_check(user_id, window_seconds=60, max_requests=100):
    now = time.time()
    window_start = now - window_seconds
    
    # Count requests in the sliding window
    recent_requests = redis.zrangebyscore(
        f"ratelimit:{user_id}", 
        window_start, 
        now
    )
    
    if len(recent_requests) >= max_requests:
        return False
    
    # Add this request
    redis.zadd(f"ratelimit:{user_id}", {str(uuid4()): now})
    redis.expire(f"ratelimit:{user_id}", window_seconds)
    return True

The window slides with time, so there’s no magic moment when everyone’s limits reset simultaneously.

The Headers That Make Rate Limiting Bearable

Rate limiting without communication is just rejection. Good APIs tell clients exactly where they stand:

1
2
3
4
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1640000000

When you do reject a request:

1
2
3
4
5
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1640000000

The Retry-After header is crucial. It tells clients exactly when to try again instead of forcing them to guess (or worse, retry immediately in a loop).

Differentiated Limits

Not all requests are equal. Not all users are equal.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
RATE_LIMITS = {
    "free": {"requests_per_minute": 60, "burst": 10},
    "pro": {"requests_per_minute": 600, "burst": 100},
    "enterprise": {"requests_per_minute": 6000, "burst": 1000},
}

ENDPOINT_COSTS = {
    "/api/simple": 1,
    "/api/search": 5,
    "/api/generate": 20,
}

Expensive operations (database-heavy queries, AI inference, file processing) should cost more “tokens” than simple lookups. This lets you protect your infrastructure while still allowing high volumes of cheap requests.

Graceful Degradation Over Hard Failures

Instead of immediately returning 429, consider degraded responses:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def handle_request(user_id, request):
    tokens = get_remaining_tokens(user_id)
    
    if tokens > 50:
        return full_response(request)
    elif tokens > 10:
        return cached_response(request)  # Serve stale data
    elif tokens > 0:
        return minimal_response(request)  # Essential fields only
    else:
        return rate_limit_response()

Users experiencing degraded service are less frustrated than users hitting a brick wall. They can still function, just with reduced capability.

Client-Side: Be a Good Citizen

If you’re consuming rate-limited APIs, build resilience into your client:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import time
from functools import wraps

def with_retry(max_retries=3, base_delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                response = func(*args, **kwargs)
                
                if response.status_code != 429:
                    return response
                
                retry_after = int(response.headers.get('Retry-After', base_delay * (2 ** attempt)))
                time.sleep(retry_after)
            
            return response
        return wrapper
    return decorator

Exponential backoff with jitter prevents synchronized retries. Respecting Retry-After prevents wasted requests.

The Human Element

Technical implementation is half the battle. The other half is communication:

  • Document your limits clearly. Don’t make users discover them through 429 errors.
  • Provide usage dashboards. Let users see their consumption before they hit limits.
  • Alert before cutting off. Email when users approach 80% of their quota.
  • Make upgrades easy. If someone needs more capacity, the path should be obvious.

Rate limiting is a conversation between your API and its consumers. Done well, it’s invisible to legitimate users and protective against abuse. Done poorly, it’s a constant source of frustration and support tickets.

The best rate limiter is one your users never notice — until they try to abuse it.