When you deploy an LLM-powered agent in production, traditional APM dashboards only tell half the story. You can track latency, error rates, and throughput — but what about what the agent actually did? Did it hallucinate? Did it spiral into an infinite retry loop? Did it spend $47 on tokens chasing a dead end?

Here’s how to build observability for autonomous agents that actually helps.

The Three Pillars of Agent Observability

Standard observability (logs, metrics, traces) still matters. But agents need three additional dimensions:

1. Token Economics

Track costs per operation, not just aggregate spend:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class TokenTracker:
    def __init__(self):
        self.sessions = {}
    
    def log_usage(self, session_id: str, operation: str, 
                  input_tokens: int, output_tokens: int, model: str):
        cost = self.calculate_cost(input_tokens, output_tokens, model)
        self.sessions.setdefault(session_id, []).append({
            "operation": operation,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost": cost,
            "timestamp": datetime.now()
        })

Set alerts for:

  • Sessions exceeding cost thresholds
  • Unusually high input/output ratios (might indicate tool output explosion)
  • Rapid succession of API calls (retry storms)

2. Decision Auditing

Every agent decision should be traceable. When something goes wrong, you need to answer: why did it do that?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def log_decision(context: dict, options: list, chosen: str, reasoning: str):
    """
    Record the agent's decision-making process for later analysis.
    """
    return {
        "timestamp": datetime.now().isoformat(),
        "context_hash": hashlib.sha256(json.dumps(context).encode()).hexdigest(),
        "options_considered": options,
        "chosen_action": chosen,
        "reasoning_summary": reasoning[:500],  # Truncate for storage
        "trace_id": get_current_trace_id()
    }

Store these in a searchable format. When investigating incidents, you want to query: “Show me all decisions where the agent chose to retry after a 429 error.”

3. Behavioral Drift Detection

Agents can drift subtly over time — especially if they learn from interactions or have access to evolving context. Track behavioral baselines:

  • Response length distribution: Sudden changes might indicate prompt injection or context pollution
  • Tool usage patterns: Is the agent using tools it shouldn’t? Avoiding tools it should?
  • Completion rates: Are tasks completing successfully at the same rate as last week?
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Example alerting rules for behavioral drift
alerts:
  - name: response_length_drift
    condition: |
      abs(avg(response_tokens, 1h) - avg(response_tokens, 24h)) 
      / avg(response_tokens, 24h) > 0.3
    severity: warning
    
  - name: tool_usage_anomaly
    condition: |
      count(tool_calls WHERE tool = 'delete_file', 1h) > 
      3 * avg(count(tool_calls WHERE tool = 'delete_file'), 7d)
    severity: critical

Structured Logging for Agents

JSON logs are table stakes. For agents, add these fields:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
  "timestamp": "2026-02-21T04:15:32Z",
  "session_id": "sess_abc123",
  "turn_number": 7,
  "event_type": "tool_call",
  "tool_name": "read_file",
  "tool_args": {"path": "/etc/config.yaml"},
  "tool_result_preview": "database:\n  host: localhost...",
  "latency_ms": 23,
  "trace_id": "trace_xyz789",
  "cost_usd": 0.0003,
  "model": "claude-sonnet-4-20250514",
  "context_tokens": 4521
}

The turn_number field is crucial — it lets you reconstruct conversation flow when debugging.

Real-Time Dashboards That Help

Your agent dashboard should answer these questions at a glance:

  1. Right now: How many agents are running? What are they doing?
  2. Cost: What’s today’s spend? Any runaway sessions?
  3. Health: Error rates, success rates, average task completion time
  4. Behavioral: Any drift from baseline? Unusual patterns?

A simple Grafana setup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Active sessions
sum(agent_sessions_active)

# Cost rate (dollars per minute)
rate(agent_cost_usd_total[5m]) * 60

# Tool call distribution
topk(10, sum by (tool_name) (rate(agent_tool_calls_total[1h])))

# Error rate by type
sum by (error_type) (rate(agent_errors_total[5m]))

The Human-in-the-Loop Consideration

Some actions should require human approval. Build observability around the approval flow:

  • Time from request to approval (are humans bottlenecking agents?)
  • Approval vs. rejection rates (are agents asking for the right things?)
  • Approval bypass attempts (security concern)

Log approval decisions with the same rigor as agent decisions — they’re part of the system too.

Practical Tips

Start with what breaks. Don’t instrument everything on day one. Add observability when incidents reveal blind spots.

Sample expensive logging. Full conversation history is valuable but expensive to store. Sample at 10% for detailed logs, aggregate everything.

Correlate with business outcomes. The best metric isn’t “tool calls per minute” — it’s “successful customer resolutions” or “accurate code completions.” Tie agent behavior to actual value.

Build kill switches. Observability isn’t just watching — it’s reacting. Build automated circuit breakers that pause agents when metrics go critical.

Conclusion

LLM agents are probabilistic systems operating with significant autonomy. Traditional monitoring tells you if they’re running. Agent observability tells you what they’re doing and whether it’s right.

The investment pays off the first time you debug a production incident by tracing exactly which tool call triggered which decision that led to which unexpected outcome. Without it, you’re debugging black boxes.

Start simple: token tracking, decision logs, behavioral baselines. Build from there as your agents mature.