Caching Strategies: The Two Hardest Problems in Computer Science

Phil Karlton’s famous quote about hard problems in computer science exists because caching is genuinely difficult. Not the mechanics — putting data in Redis is easy. The hard part is knowing when that data is wrong.

Get caching right and your application feels instant. Get it wrong and users see stale data, inconsistent state, or worse — data that was never supposed to be visible to them.

The Cache-Aside Pattern (Lazy Loading)

The most common pattern: check cache first, fall back to database, populate cache on miss.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def get_user(user_id):
    # Check cache
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    
    # Cache miss - fetch from database
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    
    # Populate cache
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))
    
    return user

Pros:

Simple to understand
Cache only contains data that’s actually requested
Database is source of truth

Cons:

First request is always slow (cache miss)
Stale data if database updated without cache invalidation

Write-Through Cache

Write to cache and database simultaneously:

1
2
3
4
5
6
7
8
9
def update_user(user_id, data):
    # Update database
    db.execute("UPDATE users SET name = %s WHERE id = %s", (data['name'], user_id))
    
    # Update cache
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))
    
    return user

Pros:

Cache always consistent with database
No stale reads after writes

Cons:

Write latency increased (two writes per operation)
Cache contains data that may never be read

Write-Behind (Write-Back) Cache

Write to cache immediately, sync to database asynchronously:

1
2
3
4
5
6
7
8
9
def update_user(user_id, data):
    # Write to cache
    user = {**get_user(user_id), **data}
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))
    
    # Queue database write
    queue.enqueue('sync_user_to_db', user_id, user)
    
    return user

Pros:

Fast writes (only cache latency)
Batch database writes possible

Cons:

Data loss risk if cache fails before sync
Complexity of managing async writes
Eventual consistency

Cache Invalidation Strategies

Time-Based Expiration (TTL)

The simplest approach: data expires after a fixed time.

1
redis.setex("user:123", 3600, data)  # Expires in 1 hour

Good for:

Data that changes infrequently
Data where slight staleness is acceptable
When you can’t easily track all update paths

Bad for:

Frequently changing data (wasteful)
Data requiring immediate consistency

Event-Based Invalidation

Delete cache when underlying data changes:

1
2
3
def update_user(user_id, data):
    db.execute("UPDATE users SET name = %s WHERE id = %s", (data['name'], user_id))
    redis.delete(f"user:{user_id}")  # Invalidate cache

The challenge: knowing all the places data might be cached.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def update_user(user_id, data):
    db.execute("UPDATE users SET name = %s WHERE id = %s", (data['name'], user_id))
    
    # Invalidate all related caches
    redis.delete(f"user:{user_id}")
    redis.delete(f"user_profile:{user_id}")
    redis.delete(f"user_settings:{user_id}")
    
    # Invalidate aggregates that include this user
    user = db.query("SELECT team_id FROM users WHERE id = %s", user_id)
    redis.delete(f"team_members:{user['team_id']}")

This gets messy fast. Consider pub/sub for complex invalidation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def update_user(user_id, data):
    db.execute("UPDATE users SET name = %s WHERE id = %s", (data['name'], user_id))
    redis.publish('user_updated', json.dumps({'user_id': user_id}))

# Subscriber handles all invalidations
def handle_user_updated(message):
    user_id = message['user_id']
    redis.delete(f"user:{user_id}")
    redis.delete(f"user_profile:{user_id}")
    # ... etc

Version-Based Keys

Include a version in cache keys that changes on update:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def get_user_cache_key(user_id):
    version = db.query("SELECT cache_version FROM users WHERE id = %s", user_id)
    return f"user:{user_id}:v{version}"

def update_user(user_id, data):
    db.execute("""
        UPDATE users 
        SET name = %s, cache_version = cache_version + 1 
        WHERE id = %s
    """, (data['name'], user_id))
    # Old cache key automatically becomes orphaned

Old entries naturally expire via TTL. No explicit invalidation needed.

The Thundering Herd Problem

When a popular cache key expires, hundreds of requests simultaneously hit the database:

Solutions:

Lock while rebuilding:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def get_user(user_id):
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    
    # Try to acquire lock
    lock_key = f"lock:user:{user_id}"
    if redis.setnx(lock_key, "1"):
        redis.expire(lock_key, 10)
        try:
            user = db.query("SELECT * FROM users WHERE id = %s", user_id)
            redis.setex(f"user:{user_id}", 3600, json.dumps(user))
            return user
        finally:
            redis.delete(lock_key)
    else:
        # Another request is rebuilding, wait and retry
        time.sleep(0.1)
        return get_user(user_id)

Stale-while-revalidate:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def get_user(user_id):
    cached = redis.get(f"user:{user_id}")
    if cached:
        data = json.loads(cached)
        if data['_expires_at'] < time.time():
            # Expired but usable - refresh async
            queue.enqueue('refresh_user_cache', user_id)
        return data
    
    # Hard miss - must fetch synchronously
    return fetch_and_cache_user(user_id)

Multi-Level Caching

Not all caches are equal:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_user_l1(user_id):
    # Check L2 (Redis)
    cached = redis.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    
    # Miss - fetch from database
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))
    return user

Invalidation becomes harder — now you need to clear both levels:

1
2
3
def invalidate_user(user_id):
    get_user_l1.cache_clear()  # Clear entire L1 (crude but safe)
    redis.delete(f"user:{user_id}")  # Clear L2

Cache Key Design

Good keys are:

Predictable (can be reconstructed)
Unique (no collisions)
Debuggable (human-readable)

1
2
3
4
5
6
7
# Bad: opaque hash
key = hashlib.md5(f"{user_id}{query}".encode()).hexdigest()

# Good: structured and readable
key = f"user:{user_id}:profile"
key = f"search:products:category={cat}:page={page}"
key = f"api:v2:users:{user_id}:posts:limit=10"

Include version prefix for easy bulk invalidation:

1
2
3
4
5
6
7
CACHE_VERSION = "v3"

def cache_key(parts):
    return f"{CACHE_VERSION}:{':'.join(parts)}"

# Deploying with breaking cache format? Bump CACHE_VERSION
# Old keys naturally expire, no migration needed

Monitoring Cache Health

Track these metrics:

Hit rate: Percentage of requests served from cache (target: >90%)
Miss rate: Requests that hit the database
Eviction rate: Keys removed due to memory pressure
Latency: Cache read/write times

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def get_user(user_id):
    start = time.time()
    cached = redis.get(f"user:{user_id}")
    
    if cached:
        metrics.increment('cache.hit', tags=['type:user'])
        metrics.timing('cache.latency', time.time() - start)
        return json.loads(cached)
    
    metrics.increment('cache.miss', tags=['type:user'])
    # ... fetch from database

Low hit rate? Either your TTLs are too short, your working set is too large for cache size, or your access patterns are too random for caching to help.

Caching is a trade-off between speed and correctness. Perfect consistency means no caching. Perfect speed means permanent caching. Everything else is finding the right balance for your use case.

Start with cache-aside and TTL expiration. Add event-based invalidation for data that needs freshness. Use multi-level caching for high-traffic paths. Monitor constantly.

And remember: the best cache invalidation strategy is one simple enough that you can reason about it at 3am during an incident.

The Cache-Aside Pattern (Lazy Loading)#

Write-Through Cache#

Write-Behind (Write-Back) Cache#

Cache Invalidation Strategies#

Time-Based Expiration (TTL)#

Event-Based Invalidation#

Version-Based Keys#

The Thundering Herd Problem#

Multi-Level Caching#

Cache Key Design#

Monitoring Cache Health#

📬 Get the Newsletter

The Cache-Aside Pattern (Lazy Loading)

Write-Through Cache

Write-Behind (Write-Back) Cache

Cache Invalidation Strategies

Time-Based Expiration (TTL)

Event-Based Invalidation

Version-Based Keys

The Thundering Herd Problem

Multi-Level Caching

Cache Key Design

Monitoring Cache Health