Feature Flags for AI Features: Rolling Out the Unpredictable

Traditional feature flags are straightforward: flip a boolean, show a button. AI features are messier. The output varies. Costs scale non-linearly. User expectations are unclear. And when it breaks, it doesn’t throw a clean error—it confidently gives wrong answers.

Here’s how to think about feature flags when the feature itself is probabilistic.

The Problem With Standard Rollouts

When you ship a new checkout button, you can test it. Click, observe, done. If 5% of users get the new button and it breaks, you know immediately.

When you ship an AI summarization feature, “works” is subjective. The summary might be accurate but verbose. Or concise but missing key points. Or perfect for technical users but confusing for others. Your monitoring doesn’t catch it because there’s no exception—just a user who quietly stops using the feature.

Standard percentage rollouts miss these failure modes entirely.

Four Layers of AI Feature Flags

Layer 1: Access Gates

The basic on/off. Who can see the feature at all?

1
2
3
4
5
6
7
8
def can_access_ai_summary(user):
    if not feature_flags.get("ai_summary_enabled"):
        return False
    if feature_flags.get("ai_summary_staff_only"):
        return user.is_staff
    if feature_flags.get("ai_summary_beta_users"):
        return user.in_beta_program
    return feature_flags.percentage_rollout("ai_summary", user.id)

Nothing novel here. But it’s not enough.

Layer 2: Model Selection

Different models have different costs, latencies, and quality profiles. Flag the model, not just the feature.

1
2
3
4
5
6
def get_summary_model(user):
    if feature_flags.get("summary_use_opus"):
        return "claude-opus-4"
    if user.is_premium:
        return "claude-sonnet-4"
    return "claude-haiku"  # faster, cheaper, less capable

This lets you:

Test expensive models on small segments
Fallback gracefully during outages
A/B test model quality without code changes

Layer 3: Behavior Tuning

AI features have knobs beyond just “which model.” Temperature, max tokens, system prompts, retry strategies—all of these can be flagged.

1
2
3
4
5
6
summary_config = {
    "max_tokens": feature_flags.get("summary_max_tokens", 500),
    "temperature": feature_flags.get("summary_temperature", 0.3),
    "timeout_seconds": feature_flags.get("summary_timeout", 30),
    "retry_count": feature_flags.get("summary_retries", 2),
}

Why flag these? Because optimal settings vary by load, user segment, and time. During peak traffic, you might reduce max_tokens to cut costs. During off-hours, you might bump temperature for users who opted into experimental mode.

Layer 4: Guardrails

This is where AI feature flags diverge most from traditional ones. You need flags that control safety and quality bounds.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def should_show_summary(summary, user):
    # Flag: Are we doing content filtering?
    if feature_flags.get("summary_content_filter"):
        if contains_sensitive_content(summary):
            return False
    
    # Flag: Are we enforcing length bounds?
    min_len = feature_flags.get("summary_min_length", 50)
    max_len = feature_flags.get("summary_max_length", 2000)
    if not (min_len <= len(summary) <= max_len):
        return False
    
    # Flag: Are we showing confidence scores?
    if feature_flags.get("summary_show_confidence"):
        # Only show if confidence above threshold
        threshold = feature_flags.get("summary_confidence_threshold", 0.7)
        return summary.confidence >= threshold
    
    return True

These guardrails prevent the weird edge cases that erode trust. A summary that’s inexplicably 50 words or 5,000 words? Filtered out. Content that somehow got past the model’s training? Caught.

Cost Circuit Breakers

AI features have variable costs. A bad prompt, an adversarial user, or a sudden spike in usage can blow through your budget in hours.

Feature flags should include cost controls:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class AICostGuard:
    def check_budget(self, user, estimated_tokens):
        # Per-user daily limit
        user_limit = feature_flags.get("ai_daily_user_limit", 100000)
        if user.daily_token_usage + estimated_tokens > user_limit:
            return False, "Daily limit reached"
        
        # Global hourly spend limit
        hourly_limit = feature_flags.get("ai_hourly_budget_cents", 10000)
        if self.current_hourly_spend + self.estimate_cost(estimated_tokens) > hourly_limit:
            # Auto-disable feature if budget blown
            if feature_flags.get("ai_auto_disable_on_budget"):
                feature_flags.set("ai_summary_enabled", False)
                alert_oncall("AI summary auto-disabled: budget exceeded")
            return False, "Service temporarily unavailable"
        
        return True, None

This saved us last month when a customer accidentally triggered summarization on a 500-page document in a loop. The circuit breaker killed it after $12 instead of $1,200.

Gradual Rollouts Are Different

With traditional features, you roll out to 1%, then 5%, then 20%, then 100%. Linear progression.

With AI features, consider rolling out by:

Complexity: Start with simple inputs, then allow complex ones

1
2
3
4
def can_summarize_document(doc, user):
    max_pages = feature_flags.get("summary_max_pages", 5)
    if doc.page_count > max_pages:
        return False

Reversibility: Start with features where errors are recoverable

Good first rollout: AI-suggested tags (user can remove them)
Bad first rollout: AI-written emails (user might not notice errors)

Stakes: Start where mistakes are cheap

Good: Summarizing internal docs
Bad: Summarizing legal contracts

Monitoring Differently

Traditional feature flags: Did the feature load? Did users click it?

AI feature flags: Did the output make sense? Did users trust it?

Track these:

Edit rate: If users modify AI output, it’s not meeting expectations
Abandonment rate: If users request AI help then ignore the result
Feedback ratio: Thumbs up vs down (if you have UI for it)
Cost per successful interaction: Not just total cost, but cost per output users actually used

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def track_ai_interaction(user, feature, output, outcome):
    metrics.emit("ai_feature_use", {
        "feature": feature,
        "model": output.model,
        "tokens": output.total_tokens,
        "cost_cents": output.cost_cents,
        "latency_ms": output.latency_ms,
        "outcome": outcome,  # "accepted", "modified", "rejected", "abandoned"
        "user_segment": user.segment,
        "flag_values": feature_flags.get_all_for_feature(feature)
    })

Now you can answer: “When we switched from Sonnet to Haiku for free users, did acceptance rate drop?” And: “Is the new system prompt actually better, or just cheaper?”

The Kill Switch

Every AI feature needs an instant kill switch. Not “reduced to 0%"—completely off.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# In your AI wrapper
def call_ai_with_flags(prompt, config):
    if feature_flags.get("ai_global_killswitch"):
        raise AIDisabledException("AI features temporarily disabled")
    
    if feature_flags.get("ai_degraded_mode"):
        # Return cached/static fallback instead
        return get_fallback_response(prompt)
    
    return call_ai_api(prompt, config)

Document where this lives. Practice using it. When the model provider has an outage at 3 AM, you want to flip one switch, not debug why your feature is hanging.

Summary

AI features need more flags than traditional features:

Access gates: Who sees it
Model selection: What powers it
Behavior tuning: How it behaves
Guardrails: What it’s allowed to output
Cost controls: What it’s allowed to spend

Roll out based on complexity and stakes, not just user percentage. Monitor outcomes, not just usage. And always have a kill switch within reach.

The goal isn’t to eliminate risk—AI features are inherently less predictable than traditional code. The goal is to contain the blast radius when something goes wrong, and give yourself knobs to tune the tradeoffs in real time.

The Problem With Standard Rollouts#

Four Layers of AI Feature Flags#

Layer 1: Access Gates#

Layer 2: Model Selection#

Layer 3: Behavior Tuning#

Layer 4: Guardrails#

Cost Circuit Breakers#

Gradual Rollouts Are Different#

Monitoring Differently#

The Kill Switch#

Summary#

📬 Get the Newsletter