LLMs are becoming infrastructure. Not just chatbots — actual components in automation pipelines. But getting reliable, parseable output requires disciplined prompt engineering.
Here’s what works for DevOps use cases.
The Core Problem#
LLMs are probabilistic. Ask the same question twice, get different answers. That’s fine for chat. It’s terrible for automation that needs to parse structured output.
The solution: constrain the output format and validate aggressively.
Pattern 1: Structured Output with JSON Mode#
Most LLM APIs now support JSON mode. Use it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| import openai
import json
def analyze_error_log(log_content: str) -> dict:
"""Analyze error log and return structured diagnosis."""
response = openai.chat.completions.create(
model="gpt-4-turbo-preview",
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": """You are a DevOps log analyzer.
Return JSON with exactly these fields:
{
"severity": "critical|high|medium|low",
"category": "network|database|memory|disk|application|unknown",
"root_cause": "brief description",
"suggested_action": "specific remediation step",
"confidence": 0.0-1.0
}"""
},
{
"role": "user",
"content": f"Analyze this error log:\n\n{log_content}"
}
],
temperature=0.1 # Low temperature for consistency
)
return json.loads(response.choices[0].message.content)
|
Key points:
response_format={"type": "json_object"} forces valid JSON- Define the exact schema in the system prompt
- Use low temperature (0.1-0.3) for deterministic output
Pattern 2: Chain of Thought for Complex Analysis#
For multi-step reasoning, make the LLM show its work:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| def diagnose_incident(metrics: dict, logs: str, alerts: list) -> dict:
"""Multi-step incident diagnosis."""
prompt = f"""Analyze this incident step by step.
## Metrics
{json.dumps(metrics, indent=2)}
## Recent Logs
{logs}
## Active Alerts
{json.dumps(alerts, indent=2)}
Think through this systematically:
1. What anomalies do you see in the metrics?
2. Do the logs correlate with metric anomalies?
3. Which alerts are symptoms vs root cause?
4. What's the most likely root cause?
5. What's the recommended remediation?
After your analysis, provide a JSON summary:
```json
{{
"anomalies": ["list of observed anomalies"],
"root_cause": "most likely cause",
"confidence": 0.0-1.0,
"remediation": ["ordered steps to resolve"],
"escalate": true/false
}}
```"""
response = call_llm(prompt)
# Extract JSON from response
json_match = re.search(r'```json\s*(.*?)\s*```', response, re.DOTALL)
if json_match:
return json.loads(json_match.group(1))
raise ValueError("Failed to extract structured response")
|
Pattern 3: Few-Shot Examples for Consistency#
Show the LLM exactly what you want:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| CLASSIFICATION_PROMPT = """Classify infrastructure alerts into actionable categories.
Examples:
Alert: "CPU usage exceeded 90% on web-server-01"
Classification: {"category": "capacity", "urgency": "medium", "auto_remediate": true, "action": "scale_horizontal"}
Alert: "SSL certificate expires in 7 days for api.example.com"
Classification: {"category": "security", "urgency": "high", "auto_remediate": true, "action": "renew_certificate"}
Alert: "Database connection pool exhausted on primary"
Classification: {"category": "database", "urgency": "critical", "auto_remediate": false, "action": "page_oncall"}
Alert: "Disk usage at 85% on logging-server"
Classification: {"category": "capacity", "urgency": "low", "auto_remediate": true, "action": "cleanup_logs"}
Now classify this alert:
Alert: "{alert_text}"
Classification:"""
|
Few-shot examples calibrate the model’s output format and decision boundaries better than lengthy instructions.
Pattern 4: Validation and Retry#
Never trust LLM output blindly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| from pydantic import BaseModel, validator
from typing import Literal
import tenacity
class AlertClassification(BaseModel):
category: Literal["capacity", "security", "database", "network", "application"]
urgency: Literal["critical", "high", "medium", "low"]
auto_remediate: bool
action: str
@validator('action')
def action_must_be_valid(cls, v):
valid_actions = ['scale_horizontal', 'scale_vertical', 'restart_service',
'renew_certificate', 'page_oncall', 'cleanup_logs', 'investigate']
if v not in valid_actions:
raise ValueError(f'action must be one of {valid_actions}')
return v
@tenacity.retry(
stop=tenacity.stop_after_attempt(3),
wait=tenacity.wait_exponential(multiplier=1, min=1, max=10),
retry=tenacity.retry_if_exception_type(ValueError)
)
def classify_alert_validated(alert_text: str) -> AlertClassification:
"""Classify alert with validation and retry."""
response = call_llm(CLASSIFICATION_PROMPT.format(alert_text=alert_text))
# Parse and validate
try:
data = json.loads(response)
return AlertClassification(**data)
except (json.JSONDecodeError, ValueError) as e:
# Log for debugging
logger.warning(f"Invalid LLM response: {response}, error: {e}")
raise ValueError(f"Failed to parse valid classification: {e}")
|
Pydantic models catch schema violations. Tenacity retries on validation failures.
Pattern 5: Context Window Management#
Long logs need chunking:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| def analyze_large_log(log_path: str, max_tokens: int = 4000) -> list[dict]:
"""Analyze large log file in chunks."""
with open(log_path) as f:
content = f.read()
# Rough estimate: 1 token ≈ 4 characters
chunk_size = max_tokens * 4
chunks = [content[i:i+chunk_size] for i in range(0, len(content), chunk_size)]
results = []
for i, chunk in enumerate(chunks):
analysis = analyze_error_log(chunk)
analysis['chunk_index'] = i
results.append(analysis)
# Aggregate findings
return aggregate_analyses(results)
def aggregate_analyses(results: list[dict]) -> dict:
"""Combine chunk analyses into overall assessment."""
# Find highest severity
severity_order = ['critical', 'high', 'medium', 'low']
worst_severity = min(results, key=lambda x: severity_order.index(x['severity']))
# Collect unique issues
root_causes = list(set(r['root_cause'] for r in results))
return {
'overall_severity': worst_severity['severity'],
'issues_found': root_causes,
'chunks_analyzed': len(results)
}
|
Pattern 6: Caching for Cost Control#
LLM calls are expensive. Cache aggressively:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| import hashlib
import redis
redis_client = redis.Redis()
def cached_llm_call(prompt: str, ttl: int = 3600) -> str:
"""Cache LLM responses by prompt hash."""
cache_key = f"llm:{hashlib.sha256(prompt.encode()).hexdigest()}"
# Check cache
cached = redis_client.get(cache_key)
if cached:
return cached.decode()
# Call LLM
response = call_llm(prompt)
# Cache result
redis_client.setex(cache_key, ttl, response)
return response
|
For alert classification, the same alert text should always get the same classification. Cache it.
Anti-Patterns to Avoid#
1. Asking open-ended questions
1
2
3
4
5
| # ❌ Bad - unbounded response
"What do you think about this log?"
# ✅ Good - constrained response
"Classify this log entry. Return only: ERROR, WARNING, or INFO"
|
2. Trusting confidence scores
1
2
3
4
5
6
7
| # ❌ Bad - LLM confidence is often miscalibrated
if analysis['confidence'] > 0.8:
auto_remediate()
# ✅ Good - use confidence as one signal among many
if analysis['confidence'] > 0.8 and analysis['category'] in SAFE_TO_AUTO_REMEDIATE:
auto_remediate()
|
3. No fallback for failures
1
2
3
4
5
6
7
8
9
10
11
12
13
| # ❌ Bad - crashes on LLM failure
result = classify_alert(alert)
take_action(result)
# ✅ Good - graceful degradation
try:
result = classify_alert(alert)
take_action(result)
except LLMError:
# Fall back to rule-based classification
result = rule_based_classify(alert)
take_action(result)
notify_oncall("LLM classification failed, using fallback")
|
The Bottom Line#
LLMs in DevOps automation work when you:
- Constrain output format (JSON mode, schemas)
- Use low temperature for consistency
- Validate everything with Pydantic
- Retry on validation failures
- Cache identical requests
- Always have a fallback
Treat LLMs as unreliable components. Build defensive automation around them. When they work, they’re magic. When they fail, your fallbacks catch it.