Webhooks are the internet’s way of saying “hey, something happened.” Simple in concept, surprisingly tricky in practice.

The challenge: HTTP is unreliable, servers go down, networks flake out, and your webhook payload might arrive zero times, once, or five times. Building reliable webhook infrastructure means handling all of these gracefully.

The Sender Side

Retry with Exponential Backoff

First attempt fails? Try again. But not immediately.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
RETRY_DELAYS = [60, 300, 900, 3600, 14400, 43200]  # seconds

def deliver_webhook(event, attempt=0):
    try:
        response = requests.post(
            event.webhook_url,
            json=event.payload,
            timeout=30,
            headers={
                "Content-Type": "application/json",
                "X-Webhook-ID": event.id,
                "X-Webhook-Timestamp": str(int(time.time())),
                "X-Webhook-Signature": sign_payload(event.payload)
            }
        )
        
        if response.status_code >= 200 and response.status_code < 300:
            mark_delivered(event)
            return
            
        if response.status_code >= 500:
            # Server error, retry
            schedule_retry(event, attempt)
        else:
            # Client error (4xx), don't retry
            mark_failed(event, response.status_code)
            
    except requests.Timeout:
        schedule_retry(event, attempt)
    except requests.ConnectionError:
        schedule_retry(event, attempt)

def schedule_retry(event, attempt):
    if attempt >= len(RETRY_DELAYS):
        mark_failed(event, "max_retries")
        return
    
    delay = RETRY_DELAYS[attempt]
    # Add jitter to prevent thundering herd
    jitter = random.uniform(0, delay * 0.1)
    queue_at = time.time() + delay + jitter
    
    enqueue(event, attempt + 1, queue_at)

Key decisions:

  • Retry on 5xx and network errors
  • Don’t retry on 4xx (client’s problem)
  • Exponential delays: 1min, 5min, 15min, 1hr, 4hr, 12hr
  • Add jitter to spread out retries
  • Give up after ~24 hours total

Signature Verification

Sign your payloads so receivers can verify authenticity:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import hmac
import hashlib

def sign_payload(payload: dict, secret: str, timestamp: int) -> str:
    message = f"{timestamp}.{json.dumps(payload, separators=(',', ':'))}"
    signature = hmac.new(
        secret.encode(),
        message.encode(),
        hashlib.sha256
    ).hexdigest()
    return f"v1={signature}"

Headers sent:

XX--WWeebbhhooookk--TSiimgensattaumrpe::1v710=658d8491640002abc4b2a76b9719d911017c592

Include timestamp in the signed message to prevent replay attacks.

Idempotency Keys

Every webhook needs a unique ID that receivers can use for deduplication:

1
2
3
4
5
6
event = {
    "id": "evt_abc123xyz",  # Unique, never reused
    "type": "order.completed",
    "created_at": "2024-02-03T10:30:00Z",
    "data": {...}
}

This lets receivers deduplicate if they receive the same event multiple times.

Dead Letter Queue

After max retries, don’t just drop events:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def mark_failed(event, reason):
    # Move to dead letter queue for manual review
    dead_letter_queue.push({
        "event": event,
        "reason": reason,
        "failed_at": time.time(),
        "attempts": event.attempt_count
    })
    
    # Notify webhook owner
    if event.failure_notification_url:
        send_failure_notification(event)

Give customers visibility into failed deliveries and a way to replay them.

The Receiver Side

Verify Signatures

Always verify before processing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from fastapi import FastAPI, Request, HTTPException

app = FastAPI()

@app.post("/webhooks/stripe")
async def receive_webhook(request: Request):
    payload = await request.body()
    timestamp = request.headers.get("X-Webhook-Timestamp")
    signature = request.headers.get("X-Webhook-Signature")
    
    if not verify_signature(payload, timestamp, signature, WEBHOOK_SECRET):
        raise HTTPException(status_code=401, detail="Invalid signature")
    
    # Check timestamp to prevent replay attacks (5 min tolerance)
    if abs(time.time() - int(timestamp)) > 300:
        raise HTTPException(status_code=401, detail="Timestamp too old")
    
    # Process webhook...

Respond Quickly, Process Later

The sender is waiting. Don’t make them wait for your business logic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
@app.post("/webhooks/payments")
async def receive_payment_webhook(request: Request):
    # Verify signature
    event = verify_and_parse(request)
    
    # Store immediately
    await store_event(event)
    
    # Queue for processing
    await task_queue.enqueue("process_payment_event", event.id)
    
    # Respond fast
    return {"status": "received"}

Process the actual work asynchronously. The webhook endpoint should return in <5 seconds.

Idempotent Processing

You might receive the same event multiple times. Handle it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
async def process_payment_event(event_id: str):
    event = await get_event(event_id)
    
    # Check if already processed
    if await is_processed(event.idempotency_key):
        logger.info(f"Duplicate event {event_id}, skipping")
        return
    
    # Process with transaction
    async with db.transaction():
        # Do the work
        await handle_payment(event.data)
        
        # Mark as processed (same transaction)
        await mark_processed(event.idempotency_key)

The idempotency check and work must be in the same transaction to prevent race conditions.

Handle Out-of-Order Delivery

Events might arrive out of order. Your logic should handle this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
async def handle_order_event(event):
    order = await get_order(event.data.order_id)
    
    # Check event timestamp against last processed
    if event.created_at <= order.last_event_at:
        logger.info("Stale event, ignoring")
        return
    
    # Check state machine allows this transition
    if not can_transition(order.status, event.type):
        logger.warning(f"Invalid transition: {order.status} -> {event.type}")
        # Maybe queue for reprocessing later?
        return
    
    await apply_event(order, event)

Circuit Breaker for Downstream Calls

If processing involves calling other services, protect yourself:

1
2
3
4
5
6
7
8
9
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
async def notify_inventory_service(order):
    response = await http_client.post(
        "https://inventory.internal/update",
        json=order.to_dict()
    )
    response.raise_for_status()

When the downstream service is down, fail fast instead of blocking all webhook processing.

Monitoring and Alerting

Sender Metrics

Track these:

  • Delivery success rate
  • Retry rate by attempt number
  • Average delivery latency
  • Dead letter queue size
  • Endpoints consistently failing
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Prometheus metrics
webhook_deliveries = Counter(
    'webhook_deliveries_total',
    'Total webhook delivery attempts',
    ['endpoint', 'status']
)

webhook_latency = Histogram(
    'webhook_delivery_seconds',
    'Webhook delivery latency',
    ['endpoint']
)

Alert when:

  • Success rate drops below 95%
  • Dead letter queue grows
  • Specific customer endpoint failing repeatedly

Receiver Metrics

Track:

  • Events received per minute
  • Processing success/failure rate
  • Processing latency
  • Duplicate event rate
  • Queue depth

Alert when:

  • Processing queue backs up
  • Failure rate spikes
  • Unusual duplicate rate (sender issue?)

Testing Webhooks

For Senders

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def test_webhook_retry_on_500():
    mock_endpoint = create_mock_endpoint()
    mock_endpoint.respond_with(500, times=2)
    mock_endpoint.respond_with(200, times=1)
    
    deliver_webhook(test_event, mock_endpoint.url)
    
    assert mock_endpoint.call_count == 3
    assert test_event.status == "delivered"

def test_webhook_gives_up_after_max_retries():
    mock_endpoint = create_mock_endpoint()
    mock_endpoint.respond_with(500, times=100)  # Always fail
    
    deliver_webhook(test_event, mock_endpoint.url)
    
    assert test_event.status == "failed"
    assert dead_letter_queue.contains(test_event)

For Receivers

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def test_duplicate_event_ignored():
    event = create_test_event()
    
    # First delivery
    response1 = client.post("/webhooks", json=event)
    assert response1.status_code == 200
    
    # Duplicate delivery
    response2 = client.post("/webhooks", json=event)
    assert response2.status_code == 200
    
    # Only processed once
    assert get_processing_count(event.id) == 1

def test_invalid_signature_rejected():
    event = create_test_event()
    
    response = client.post(
        "/webhooks",
        json=event,
        headers={"X-Webhook-Signature": "invalid"}
    )
    
    assert response.status_code == 401

Common Patterns

Event Fan-Out

One event, multiple destinations:

1
2
3
4
5
async def publish_event(event):
    subscribers = await get_subscribers(event.type)
    
    for subscriber in subscribers:
        await enqueue_delivery(event, subscriber.webhook_url)

Each subscriber gets independent retry logic.

Webhook Versioning

When your payload format changes:

1
2
3
4
5
{
  "api_version": "2024-02-01",
  "type": "order.completed",
  "data": {...}
}

Let customers choose their API version. Maintain backward compatibility or give migration windows.

Batch Webhooks

For high-volume events, batch them:

1
2
3
4
5
6
7
8
{
  "batch_id": "batch_xyz",
  "events": [
    {"id": "evt_1", "type": "click", "data": {...}},
    {"id": "evt_2", "type": "click", "data": {...}},
    ...
  ]
}

Reduces HTTP overhead but adds complexity in processing.


Webhooks look simple: POST some JSON to a URL. The reliability comes from handling everything that can go wrong: network failures, server errors, duplicates, out-of-order delivery, replay attacks. Build these patterns in from the start, and your webhook system will be one less thing keeping you up at night.