Traditional feature flags are straightforward: flip a boolean, show a button. AI features are messier. The output varies. Costs scale non-linearly. User expectations are unclear. And when it breaks, it doesn’t throw a clean error—it confidently gives wrong answers.
Here’s how to think about feature flags when the feature itself is probabilistic.
The Problem With Standard Rollouts
When you ship a new checkout button, you can test it. Click, observe, done. If 5% of users get the new button and it breaks, you know immediately.
When you ship an AI summarization feature, “works” is subjective. The summary might be accurate but verbose. Or concise but missing key points. Or perfect for technical users but confusing for others. Your monitoring doesn’t catch it because there’s no exception—just a user who quietly stops using the feature.
Standard percentage rollouts miss these failure modes entirely.
Four Layers of AI Feature Flags
Layer 1: Access Gates
The basic on/off. Who can see the feature at all?
| |
Nothing novel here. But it’s not enough.
Layer 2: Model Selection
Different models have different costs, latencies, and quality profiles. Flag the model, not just the feature.
| |
This lets you:
- Test expensive models on small segments
- Fallback gracefully during outages
- A/B test model quality without code changes
Layer 3: Behavior Tuning
AI features have knobs beyond just “which model.” Temperature, max tokens, system prompts, retry strategies—all of these can be flagged.
| |
Why flag these? Because optimal settings vary by load, user segment, and time. During peak traffic, you might reduce max_tokens to cut costs. During off-hours, you might bump temperature for users who opted into experimental mode.
Layer 4: Guardrails
This is where AI feature flags diverge most from traditional ones. You need flags that control safety and quality bounds.
| |
These guardrails prevent the weird edge cases that erode trust. A summary that’s inexplicably 50 words or 5,000 words? Filtered out. Content that somehow got past the model’s training? Caught.
Cost Circuit Breakers
AI features have variable costs. A bad prompt, an adversarial user, or a sudden spike in usage can blow through your budget in hours.
Feature flags should include cost controls:
| |
This saved us last month when a customer accidentally triggered summarization on a 500-page document in a loop. The circuit breaker killed it after $12 instead of $1,200.
Gradual Rollouts Are Different
With traditional features, you roll out to 1%, then 5%, then 20%, then 100%. Linear progression.
With AI features, consider rolling out by:
Complexity: Start with simple inputs, then allow complex ones
| |
Reversibility: Start with features where errors are recoverable
- Good first rollout: AI-suggested tags (user can remove them)
- Bad first rollout: AI-written emails (user might not notice errors)
Stakes: Start where mistakes are cheap
- Good: Summarizing internal docs
- Bad: Summarizing legal contracts
Monitoring Differently
Traditional feature flags: Did the feature load? Did users click it?
AI feature flags: Did the output make sense? Did users trust it?
Track these:
- Edit rate: If users modify AI output, it’s not meeting expectations
- Abandonment rate: If users request AI help then ignore the result
- Feedback ratio: Thumbs up vs down (if you have UI for it)
- Cost per successful interaction: Not just total cost, but cost per output users actually used
| |
Now you can answer: “When we switched from Sonnet to Haiku for free users, did acceptance rate drop?” And: “Is the new system prompt actually better, or just cheaper?”
The Kill Switch
Every AI feature needs an instant kill switch. Not “reduced to 0%"—completely off.
| |
Document where this lives. Practice using it. When the model provider has an outage at 3 AM, you want to flip one switch, not debug why your feature is hanging.
Summary
AI features need more flags than traditional features:
- Access gates: Who sees it
- Model selection: What powers it
- Behavior tuning: How it behaves
- Guardrails: What it’s allowed to output
- Cost controls: What it’s allowed to spend
Roll out based on complexity and stakes, not just user percentage. Monitor outcomes, not just usage. And always have a kill switch within reach.
The goal isn’t to eliminate risk—AI features are inherently less predictable than traditional code. The goal is to contain the blast radius when something goes wrong, and give yourself knobs to tune the tradeoffs in real time.