Rate limiting is the bouncer at your API’s door. Too strict, and legitimate users get frustrated. Too loose, and one bad actor can take down your service. Here’s how to find the balance.

Why Rate Limit?

Without limits, a single client can:

  • Exhaust your database connections
  • Burn through your third-party API quotas
  • Inflate your cloud bill
  • Deny service to everyone else

Rate limiting isn’t about being restrictive—it’s about being fair.

The Basics: Token Bucket

The token bucket is the workhorse of rate limiting:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import time
from dataclasses import dataclass

@dataclass
class TokenBucket:
    capacity: int          # Max tokens
    refill_rate: float     # Tokens per second
    tokens: float = None
    last_refill: float = None
    
    def __post_init__(self):
        self.tokens = self.capacity
        self.last_refill = time.time()
    
    def consume(self, tokens: int = 1) -> bool:
        now = time.time()
        # Refill based on time passed
        elapsed = now - self.last_refill
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_rate
        )
        self.last_refill = now
        
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False

# 100 requests per minute, with burst capacity of 10
bucket = TokenBucket(capacity=10, refill_rate=100/60)

Why token bucket? It allows bursts (up to capacity) while enforcing an average rate. Real traffic is bursty—pure fixed-window limiting feels punishing.

Fixed Window vs Sliding Window

Fixed window resets at boundaries (every minute on the minute):

1
2
3
4
5
6
7
8
9
def fixed_window_check(key: str, limit: int, window_seconds: int) -> bool:
    window = int(time.time() / window_seconds)
    cache_key = f"ratelimit:{key}:{window}"
    
    count = redis.incr(cache_key)
    if count == 1:
        redis.expire(cache_key, window_seconds)
    
    return count <= limit

Problem: A user can hit 100 requests at 11:59:59 and 100 more at 12:00:01—200 requests in 2 seconds.

Sliding window smooths this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def sliding_window_check(key: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    window_start = now - window_seconds
    cache_key = f"ratelimit:{key}"
    
    # Remove old entries, add new one, count
    pipe = redis.pipeline()
    pipe.zremrangebyscore(cache_key, 0, window_start)
    pipe.zadd(cache_key, {str(now): now})
    pipe.zcard(cache_key)
    pipe.expire(cache_key, window_seconds)
    _, _, count, _ = pipe.execute()
    
    return count <= limit

Trade-off: Sliding window is fairer but uses more memory (stores each request timestamp).

Different Limits for Different Things

One limit rarely fits all. Layer them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
RATE_LIMITS = {
    'global': {'requests': 10000, 'window': 60},      # Per minute, all users
    'authenticated': {'requests': 1000, 'window': 60}, # Per user per minute
    'anonymous': {'requests': 100, 'window': 60},      # Per IP per minute
    'expensive': {'requests': 10, 'window': 60},       # /search, /export
    'writes': {'requests': 50, 'window': 60},          # POST/PUT/DELETE
}

def check_all_limits(request) -> tuple[bool, str]:
    checks = [
        ('global', get_global_key()),
        ('writes' if request.method in ['POST', 'PUT', 'DELETE'] else None, 
         get_user_key(request)),
    ]
    
    if is_expensive_endpoint(request.path):
        checks.append(('expensive', get_user_key(request)))
    
    if request.user:
        checks.append(('authenticated', get_user_key(request)))
    else:
        checks.append(('anonymous', request.remote_addr))
    
    for limit_name, key in checks:
        if limit_name and not check_limit(limit_name, key):
            return False, limit_name
    
    return True, None

Communicate Clearly

When you reject a request, tell the client what happened:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from fastapi import HTTPException
from fastapi.responses import JSONResponse

def rate_limit_response(limit_name: str, retry_after: int):
    return JSONResponse(
        status_code=429,
        content={
            "error": "rate_limit_exceeded",
            "message": f"Too many requests. Limit: {limit_name}",
            "retry_after": retry_after
        },
        headers={
            "Retry-After": str(retry_after),
            "X-RateLimit-Limit": str(RATE_LIMITS[limit_name]['requests']),
            "X-RateLimit-Remaining": "0",
            "X-RateLimit-Reset": str(int(time.time()) + retry_after)
        }
    )

Always include:

  • Retry-After header (seconds until they can retry)
  • X-RateLimit-* headers so clients can self-regulate
  • A human-readable message

Graceful Degradation

Instead of hard rejection, consider degraded service:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def handle_request(request):
    limit_status = check_limits(request)
    
    if limit_status == 'ok':
        return full_response(request)
    elif limit_status == 'soft_limit':
        # Serve from cache, skip expensive operations
        return cached_response(request)
    elif limit_status == 'hard_limit':
        return rate_limit_response()

Levels:

  1. Under limit: Full service
  2. Soft limit: Cached/simplified responses
  3. Hard limit: 429 rejection

This keeps users functional even when they’re hitting limits.

Distributed Rate Limiting

Single-server counters don’t work when you have multiple instances:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Redis-based distributed limiter
class DistributedRateLimiter:
    def __init__(self, redis_client):
        self.redis = redis_client
    
    def check(self, key: str, limit: int, window: int) -> dict:
        lua_script = """
        local key = KEYS[1]
        local limit = tonumber(ARGV[1])
        local window = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])
        
        local count = redis.call('INCR', key)
        if count == 1 then
            redis.call('EXPIRE', key, window)
        end
        
        local ttl = redis.call('TTL', key)
        
        return {count, ttl, limit - count}
        """
        
        now = int(time.time())
        window_key = f"{key}:{now // window}"
        
        count, ttl, remaining = self.redis.eval(
            lua_script, 1, window_key, limit, window, now
        )
        
        return {
            'allowed': count <= limit,
            'remaining': max(0, remaining),
            'reset': ttl
        }

Why Lua script? Atomicity. INCR + EXPIRE as separate commands has a race condition.

Client-Side Rate Limiting

Don’t just limit—help clients avoid limits:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# SDK that self-limits
class APIClient:
    def __init__(self):
        self.bucket = TokenBucket(capacity=10, refill_rate=1.67)  # 100/min
    
    async def request(self, endpoint, **kwargs):
        # Wait for token availability
        while not self.bucket.consume():
            await asyncio.sleep(0.1)
        
        response = await self._do_request(endpoint, **kwargs)
        
        # Update bucket based on headers
        if 'X-RateLimit-Remaining' in response.headers:
            remaining = int(response.headers['X-RateLimit-Remaining'])
            self.bucket.tokens = min(self.bucket.capacity, remaining)
        
        return response

Best practice: If you have an official SDK, build rate limiting into it. Users who use your SDK shouldn’t hit 429s.

The Human Element

Rate limits protect systems. But systems serve humans. When setting limits:

  1. Start generous, tighten if needed. It’s easier to reduce limits than to apologize for being too strict.

  2. Have an escape hatch. Enterprise customers, partners, and internal services often need higher limits. Build the mechanism before you need it.

  3. Monitor before limiting. Understand your actual traffic patterns before setting arbitrary numbers.

  4. Limit by resource, not just requests. A request that returns 10KB is different from one that returns 10MB. Consider payload-aware limiting.

Rate limiting done well is invisible. Users never hit it because the limits match real usage. That’s the goal.