Your payment service is down. Every request to it times out after 30 seconds. You have 100 requests per second hitting that endpoint. Do the math: within a minute, you’ve got 6,000 threads waiting on a dead service, and your entire application is choking.

This is where circuit breakers earn their keep.

The Problem: Cascading Failures

In distributed systems, a single failing dependency can take down everything. Without protection, your system will:

  1. Keep sending requests to the dead service
  2. Exhaust connection pools waiting for timeouts
  3. Queue up requests behind the slow ones
  4. Eventually crash under the load

The naive approach—retrying harder—makes things worse. You’re DDoSing your own failing service while burning through your resources.

The Circuit Breaker Pattern

Borrowed from electrical engineering, the circuit breaker pattern detects failures and prevents the system from repeatedly trying operations that are likely to fail.

(F(T(SNaFiPuoiamrcrlieocmulobearuesleftsamotseophtxderpeCre-iLasr-OthreSioeslEolqiDnduHmeAiC-esLtLxOtFeOrcPs-dFSeeEOaEqeNrPriDudeEeleejNqusdeurtceestsetpdsaOsiaPsmlEmlNteohdwrieoadut)gehl)y)

Basic Implementation

Here’s a TypeScript circuit breaker that captures the essential behavior:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
enum CircuitState {
  CLOSED = 'closed',
  OPEN = 'open',
  HALF_OPEN = 'half_open'
}

interface CircuitBreakerOptions {
  failureThreshold: number;      // Failures before opening
  successThreshold: number;      // Successes to close from half-open
  timeout: number;               // Milliseconds before trying half-open
}

class CircuitBreaker {
  private state: CircuitState = CircuitState.CLOSED;
  private failures: number = 0;
  private successes: number = 0;
  private lastFailureTime: number = 0;
  
  constructor(private options: CircuitBreakerOptions) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === CircuitState.OPEN) {
      if (Date.now() - this.lastFailureTime > this.options.timeout) {
        this.state = CircuitState.HALF_OPEN;
        this.successes = 0;
      } else {
        throw new Error('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess(): void {
    this.failures = 0;
    
    if (this.state === CircuitState.HALF_OPEN) {
      this.successes++;
      if (this.successes >= this.options.successThreshold) {
        this.state = CircuitState.CLOSED;
      }
    }
  }

  private onFailure(): void {
    this.failures++;
    this.lastFailureTime = Date.now();

    if (this.state === CircuitState.HALF_OPEN || 
        this.failures >= this.options.failureThreshold) {
      this.state = CircuitState.OPEN;
    }
  }
}

Real-World Configuration

The defaults matter. Here’s what works in production:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
const paymentCircuit = new CircuitBreaker({
  failureThreshold: 5,      // Trip after 5 consecutive failures
  successThreshold: 3,      // Need 3 successes to close
  timeout: 30000            // Try again after 30 seconds
});

// Usage
async function processPayment(order: Order): Promise<PaymentResult> {
  return paymentCircuit.execute(async () => {
    return await paymentService.charge(order);
  });
}

For different service types, adjust accordingly:

Service TypeFailure ThresholdTimeoutRationale
Payment530sHigh stakes, needs quick recovery
Email1060sCan tolerate delays
Cache310sShould be fast or not at all
External API5120sThird parties need longer recovery

Handling Open Circuits Gracefully

When the circuit opens, you have options beyond just failing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
async function getProductRecommendations(userId: string): Promise<Product[]> {
  try {
    return await recommendationCircuit.execute(async () => {
      return await recommendationService.getForUser(userId);
    });
  } catch (error) {
    if (error.message === 'Circuit breaker is open') {
      // Fallback: return cached or generic recommendations
      return await cache.get(`recommendations:${userId}`) 
        ?? await getDefaultRecommendations();
    }
    throw error;
  }
}

Common fallback strategies:

  • Cache: Return stale but valid data
  • Default: Return safe default values
  • Degrade: Offer reduced functionality
  • Queue: Accept the request for later processing
  • Redirect: Route to an alternate service

Advanced: Sliding Window Failure Detection

Simple consecutive failure counts can be noisy. A sliding window gives better signal:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
class SlidingWindowCircuitBreaker {
  private requests: { timestamp: number; success: boolean }[] = [];
  private windowMs: number = 60000; // 1 minute window
  
  private getFailureRate(): number {
    const now = Date.now();
    this.requests = this.requests.filter(r => now - r.timestamp < this.windowMs);
    
    if (this.requests.length < 10) return 0; // Need minimum sample
    
    const failures = this.requests.filter(r => !r.success).length;
    return failures / this.requests.length;
  }

  private shouldOpen(): boolean {
    return this.getFailureRate() > 0.5; // Open at 50% failure rate
  }
}

This approach is more resilient to transient failures and gives you percentage-based thresholds.

Monitoring and Observability

A circuit breaker you can’t observe is a circuit breaker that will surprise you:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class ObservableCircuitBreaker extends CircuitBreaker {
  private metrics: MetricsClient;

  private recordStateChange(from: CircuitState, to: CircuitState): void {
    this.metrics.increment('circuit_breaker.state_change', {
      name: this.name,
      from,
      to
    });
    
    if (to === CircuitState.OPEN) {
      this.metrics.increment('circuit_breaker.trips', { name: this.name });
      // Alert on-call
      alerting.notify(`Circuit ${this.name} opened`);
    }
  }
}

Key metrics to track:

  • State changes: When and how often circuits trip
  • Failure rate: What’s triggering the breaker
  • Rejection rate: How many requests are being fast-failed
  • Recovery time: How long until services heal

The Bulkhead Pattern: Circuit Breakers’ Partner

Circuit breakers prevent cascading failures. Bulkheads isolate them:

1
2
3
4
5
6
7
8
const pools = {
  payments: new ConnectionPool({ maxSize: 20 }),
  notifications: new ConnectionPool({ maxSize: 10 }),
  analytics: new ConnectionPool({ maxSize: 5 })
};

// Each service gets its own pool
// If analytics is slow, payments still has full capacity

Combine both patterns: circuit breakers per service, connection pools per service. Failures stay contained.

Libraries Worth Using

Don’t build circuit breakers from scratch for production. These are battle-tested:

Node.js:

Python:

Go:

When Not to Use Circuit Breakers

They’re not universal solutions:

  • Idempotent batch jobs: Better to retry with backoff
  • User-initiated retries: Let humans decide when to retry
  • Fire-and-forget: If you don’t care about the result, don’t track failures
  • Single points of failure: A circuit breaker on your only database connection just makes failures more confusing

Key Takeaways

  1. Fail fast: Open circuits reject requests immediately instead of waiting for timeouts
  2. Fail gracefully: Always have a fallback strategy
  3. Tune for your service: Payment systems and analytics have different tolerance
  4. Observe everything: You can’t fix what you can’t see
  5. Combine with bulkheads: Isolation and protection work better together

Circuit breakers turn catastrophic failures into graceful degradation. Your users get “temporarily unavailable” instead of “entire site is down.” That’s the difference between an incident and a blip.


Your system will fail. The question is whether the failure spreads or stays contained.