Troubleshooting & FAQ
Production incident playbooks, decision trees, common failure modes, and step-by-step debugging procedures for agent systems.
When something breaks in production, speed matters more than perfection. This document is designed for on-call engineers to diagnose and fix common issues quickly.
In simple terms: “What do I do when my harness breaks? Where do I look first? How do I fix it?”
Quick Reference: First Steps
When something is broken:
- Check if it’s actually broken → Look at metrics (error rate, latency)
- Identify the symptom → Use decision tree below
- Check recent changes → Did something deploy in last 30 minutes?
- Look at structured logs → Filter by error, agent ID, session ID
- Isolate the component → Is it the model? Tool? Memory? Cost control?
- Apply fix → Choose “Quick fix” if urgent, “Proper fix” for long-term
Part 1: Symptom-Based Diagnosis Decision Tree
Use this tree to identify what’s broken:
Something seems wrong
├─ Error rate > 5% (check metrics)
│ ├─ Specific tool failing (look for tool error pattern)
│ │ ├─ Tool timeout → See "Tool Issues: Timeout"
│ │ ├─ Tool returns bad format → See "Tool Issues: Unexpected Format"
│ │ ├─ Permission denied → See "Tool Issues: Permission Denied"
│ │ └─ Tool not found → See "Tool Issues: Not Registered"
│ │
│ ├─ Agent producing garbage (reasoning nonsensical)
│ │ ├─ Agent ignoring instructions → See "Agent Debugging: Ignoring Instructions"
│ │ ├─ Agent making wrong tool calls → See "Agent Debugging: Wrong Tool Calls"
│ │ └─ Hallucinations increased → See "Quality Issues: Hallucinations"
│ │
│ ├─ All requests failing with same error
│ │ ├─ "Rate limit exceeded" → See "Common Error Messages: 429"
│ │ ├─ "Context window exceeded" → See "Common Error Messages: Context Exceeded"
│ │ ├─ "Model not found" → See "Common Error Messages: Model Not Found"
│ │ └─ Other API error → See "Common Error Messages"
│ │
│ └─ Random/intermittent failures
│ ├─ Network timeouts → See "Performance Issues: Network Timeouts"
│ ├─ Database connection errors → See "Deployment Issues: Database Errors"
│ └─ Memory corruption → See "Memory Issues: Corruption"
│
├─ Latency > expected (check p50/p95/p99)
│ ├─ High on all requests → See "Performance Issues: Slow Inference"
│ ├─ High on specific tool → See "Performance Issues: Tool Bottleneck"
│ ├─ Intermittent spikes → See "Performance Issues: Queue Backlog"
│ └─ First request slow, rest fast → See "Performance Issues: Slow Inference (model loading)"
│
├─ Cost > budget (check cost tracking)
│ ├─ Spike in last 24 hours → See "Cost & Budget Issues: Unexpected Spike"
│ ├─ Gradual increase over time → See "Cost & Budget Issues: Gradual Creep"
│ ├─ Wrong tokens charged → See "Cost & Budget Issues: Token Counting Mismatch"
│ └─ Runaway agent → See "Cost & Budget Issues: Runaway Agent"
│
├─ Agent looping (iterations not stopping)
│ ├─ Stuck on same decision → See "Agent Debugging: Stuck in Loop"
│ ├─ Context window filling up → See "Memory Issues: Memory Loss"
│ └─ Too many retries on failed tool → See "Agent Debugging: Ignoring Instructions"
│
├─ Agent timing out (total duration > timeout)
│ ├─ Normal operations taking too long → See "Agent Debugging: Timing Out"
│ ├─ Waiting for tool response → See "Tool Issues: Timeout"
│ └─ Memory consolidation slow → See "Memory Issues: Memory Consolidation Slow"
│
└─ Can't find information / logs missing
├─ Logs disappeared → See "Deployment Issues: Missing Logs"
├─ Metrics not being recorded → See "Deployment Issues: Health Checks Failing"
└─ Dashboard shows no data → See "Operations: Observability Misconfigured"
Part 2: Agent Debugging
Symptom: Agent Stuck in Loop
What you’ll see:
- Agent iteration count keeps increasing (10, 20, 50+)
- Agent making same decision/tool call repeatedly
- Session doesn’t complete or times out
- Context tokens increasing with each iteration
- Cost climbing without making progress
Root causes:
- Tool always fails — tool is broken, agent keeps retrying
- Agent doesn’t understand error — error message is unclear
- Instruction contradiction — agent told to keep trying indefinitely
- No termination logic — agent has no “give up” condition
- Tool returns infinite loop — e.g., search results pointing to search
Diagnostic steps:
# Step 1: Check iteration limit
if session_log.loop_iterations > 15:
print("ALERT: Agent exceeded normal iteration count")
# Normal: 3-8 iterations
# Concerning: 10-15 iterations
# Critical: 20+ iterations
# Step 2: Look at repetition pattern
recent_steps = session_log.last_n_steps(5)
tools_called = [step.tool_name for step in recent_steps]
print(f"Last 5 tools: {tools_called}")
# If all same → Looping
# Step 3: Check error in failed tools
for step in recent_steps:
if step.status == "failed":
print(f"Tool {step.tool_name} failed: {step.error}")
# Is the error message helpful?
# Is the tool actually broken?
# Step 4: Check context window usage
print(f"Context usage: {session_log.context_tokens_used} / {session_log.context_limit}")
if context_ratio > 0.85:
print("WARNING: Approaching context limit, may be running out of space")
Quick fix (< 5 minutes):
1. Kill the session immediately (don't wait for timeout)
- Set iteration limit to 12 (or lower if you see looping at 8)
2. Look at the last 3 tool calls in logs
- Same tool repeated? → That tool is broken
- Different tools but same result? → Agent isn't understanding error
3. Check which tool is failing
- If "web_search": Search API might be rate-limited or down
- If custom tool: That tool implementation may be broken
4. Restart without that tool (comment it out)
- Does the agent succeed without it? → Confirms tool is the problem
Proper fix (permanent):
-
Add mandatory termination logic:
MAX_ITERATIONS = 12 if iteration_count >= MAX_ITERATIONS: return { "status": "incomplete", "reason": "max_iterations_reached", "best_effort_result": last_valid_output } -
Improve error messages so agent understands:
# Bad error message (agent doesn't know what to do) raise Exception("Tool failed") # Good error message (agent knows it's a retry issue) raise Exception( "Web search timed out after 30 seconds. " "Either the site is slow or the query is too broad. " "Try: 1) Wait 10 seconds then retry, or 2) Try a different query" ) -
Add loop detection to logs:
# Detect when agent is repeating itself if iteration > 2: prev_tool = steps[-2].tool_name curr_tool = steps[-1].tool_name if prev_tool == curr_tool == steps[-3].tool_name: log.warning("LOOP_DETECTED", { "tool": curr_tool, "repetitions": count_repetitions(curr_tool, steps) }) # Force intervention or termination -
Test with failing tool disabled:
Does agent succeed with tool X disabled? YES → Confirm tool X is broken, fix it NO → Problem is elsewhere (agent logic or instruction clarity)
Symptom: Agent Ignoring Instructions
What you’ll see:
- Agent makes decisions contrary to explicit instructions
- Agent uses forbidden tools
- Agent generates output in wrong format despite instruction
- Agent skips required steps
Root causes:
- Instruction buried in context — agent can’t see it due to context length
- Conflicting instructions — instructions contradict each other
- Instructions too vague — agent interprets them differently
- Model drift — model behavior changed, needs re-tuning
- Tool choice conflict — agent thinks different tool is better
- Temperature too high — model being too creative/random
Diagnostic steps:
# Step 1: Verify instruction was in context
instruction = "Always use tool X for data retrieval"
if instruction in session_log.full_context:
print("✓ Instruction is in context")
else:
print("✗ Instruction NOT in context (likely pruned due to length)")
print(f"Context usage: {session_log.context_tokens} / {session_log.limit}")
# Step 2: Check what tool agent used
for step in session_log.steps:
if step.tool_name != "expected_tool":
print(f"Agent chose {step.tool_name}, expected other_tool")
print(f"Reasoning: {step.reasoning}")
# Step 3: Check temperature/sampling parameters
print(f"Temperature: {session_log.model_params.temperature}") # 0.0 = deterministic, 1.0 = creative
print(f"Top P: {session_log.model_params.top_p}")
# Step 4: Check if this is recent behavior
recent_10_sessions = get_last_n_sessions(10)
violations = sum(1 for s in recent_10_sessions if instruction_violated(s))
print(f"Instruction violations in last 10 sessions: {violations} / 10")
# If all 10 violated → chronic problem
# If 2-3 violated → occasional issue (may be model randomness)
Quick fix (< 5 minutes):
1. Check if instruction is being pruned (context too long)
→ Reduce memory size, compress old sessions, shorten instruction
2. Check for conflicting instructions
→ Search for "always use X" and "use Y if..."
→ Clarify which takes priority
3. If "occasional" violation (2-3 out of 10):
→ Reduce temperature (more deterministic)
→ Restart with fresh session (cold restart may help)
4. If "chronic" violation (8+ out of 10):
→ Instruction is being ignored, need proper fix
Proper fix (permanent):
-
Ensure instruction is at context start:
# Bad: Instruction at end of long context system_prompt = [ instructions_file (2K tokens), memory_file (50K tokens), conversation_history (100K tokens), instruction_constraint (buried here!) ] # Good: Constraints at start, most recent history at end system_prompt = [ constraint_instruction (critical: at start!), task_instruction (what to do), memory_file (50K tokens), conversation_history (100K tokens, at end so most recent) ] -
Make instructions concrete with examples:
# Vague (agent might ignore) "Use the search tool when appropriate" # Concrete (agent will follow) """MANDATORY: When you need information not in your memory: 1. Use search_web_tool FIRST 2. If search returns nothing, use search_knowledge_base SECOND 3. Only use local_files if neither returns results Example: User asks "What is Llama 2?" GOOD: search_web_tool("Llama 2 model") BAD: search_knowledge_base("Llama 2 model") ← Wrong order BAD: Ask user for more info ← Don't do this, search first """ -
Add verification step:
# After each agent step, verify it followed rules if instruction_required_tool not in step.tool_calls: if step.status == "failed": # Agent violated instruction, log it log_alert("INSTRUCTION_VIOLATION", { "instruction": constraint, "tool_required": required_tool, "tool_used": step.tool_calls[0].name, "session_id": session_id }) -
Test with reduced context window:
Does agent follow instruction with context_limit=50K instead of 200K? YES → Instruction was being pruned, need to reduce memory NO → Agent intentionally ignoring instruction, need stronger constraint
Symptom: Agent Producing Garbage Output
What you’ll see:
- Output is nonsensical, incoherent
- Output contains false information (hallucinations)
- Output mixes multiple unrelated topics
- Output contains jailbreak artifacts or strange formatting
- Output quality fine in staging, broken in production
Root causes:
- Context corruption — old memory mixed with current task
- Model hallucinating — producing false information confidently
- Prompt injection — malicious input changed agent behavior
- Cache collision — KV cache mixing responses from different sessions
- Quantization artifact — rare precision error from 4-bit quantization
- Model drift → Production model different from staging
- Temperature too high → Model generating random tokens
Diagnostic steps:
# Step 1: Check context for corruption
context_summary = analyze_context_windows(session_log)
print(f"Context sources:")
for source in context_summary.sources:
print(f" - {source}: {source.token_count} tokens, age={source.age_hours}h")
# Example of corruption:
# Task: "Summarize recent sales"
# Context accidentally includes: "Nuclear weapons safety procedures"
# ← This contamination causes garbage output
# Step 2: Check if output is hallucination
for fact in output_facts:
if fact not in session_log.full_context:
print(f"HALLUCINATION: '{fact}' not found in context")
# Agent made this up
# Step 3: Check input for injection
if "<|system|>" in user_input or "ignore instructions" in user_input.lower():
print("POSSIBLE_INJECTION: Input contains jailbreak patterns")
# Step 4: Check if staging/production models match
staging_model = "claude-3-5-sonnet-20240514"
prod_model = session_log.model
if staging_model != prod_model:
print(f"MODEL MISMATCH: Staging uses {staging_model}, prod uses {prod_model}")
print("→ Test staging with prod model to see if issue reproduces")
# Step 5: Check temperature
print(f"Temperature: {session_log.temperature}")
if session_log.temperature > 0.7:
print("WARNING: High temperature may cause randomness")
Quick fix (< 5 minutes):
1. Set temperature to 0.0 (deterministic)
→ Eliminates randomness, see if output improves
2. Clear memory/context
→ Cold restart session without old memory
→ Does output improve? → Memory corruption confirmed
3. Check if staging and production use same model
→ If different, recreate issue in staging first
4. Check input for obvious injection patterns
→ Any "<|system|>" or "ignore instructions"?
→ If yes, increase input validation
Proper fix (permanent):
-
Prevent context corruption:
# Tag context sources with session ID memory_entry = { "content": text, "session_id": current_session_id, # MUST match current session "created_at": timestamp } # Before using memory, verify it's from same task/session for entry in memory: if entry.session_id != current_session_id: if entry.age_hours > 24: # Old entry from different session, skip it continue -
Add hallucination detection:
# Verify each fact in output appears in context facts = extract_facts(output) for fact in facts: if fact not in context and not is_known_fact(fact): # This is a potential hallucination add_fact_verification_step() # Ask agent to cite source or admit uncertainty -
Strict input validation against injection:
def validate_input(user_input: str) -> bool: dangerous_patterns = [ "<|system|>", "<|user|>", # Jailbreak markers "ignore instructions", # Direct override "pretend you", # Role change "forget your instructions", # Memory wipe "you are now", # System swap ] for pattern in dangerous_patterns: if pattern.lower() in user_input.lower(): log_alert("INJECTION_ATTEMPT", {"input": user_input}) return False return True -
Test staging with production model:
If staging uses model V1 and prod uses V2: 1. Update staging to use V2 2. Re-run quality tests 3. If quality drops → V2 needs tuning 4. If quality same → Issue is elsewhere (not model change)
Symptom: Agent Running Out of Memory / Context Window Exceeded
What you’ll see:
- Error: “Context length exceeded” or “prompt too long”
- Agent abruptly stops mid-task
- Session fails on the 10th+ iteration (context filling up over time)
- Long-running tasks fail but short tasks succeed
Root causes:
- Memory not being consolidated — old sessions piling up
- Conversation history too long — keeping all old messages
- Model context limit too low — using 4K context model instead of 128K
- Tool results too large — search returns 10K tokens of junk
- Logging too verbose — logging every intermediate step
- Context size increased — recent change expanded startup memory
Diagnostic steps:
# Step 1: Check context usage over time
for step in session_log.steps:
print(f"Iteration {step.iteration}: {step.context_used} tokens")
# Should grow slowly, then plateau
# If growing linearly: memory not being pruned/consolidated
# Step 2: Measure memory file sizes
print(f"CLAUDE.md: {get_file_size('CLAUDE.md')} tokens")
print(f"MEMORY.md: {get_file_size('MEMORY.md')} tokens")
print(f"Topic files: {sum(get_file_size(f) for f in topic_files)} tokens")
# Good baseline:
# CLAUDE.md: 500-1000 tokens (instructions)
# MEMORY.md: 5000-10000 tokens (compact facts)
# Topic files: 500-2000 tokens each
# Total startup: < 20K tokens
# Step 3: Check model context limit
print(f"Model: {session_log.model}")
print(f"Context limit: {session_log.context_limit} tokens")
# If limit is 4K: Too small
# If limit is 128K: Good for long tasks
# Check: Did recent change switch to smaller model?
# Step 4: Measure tool output sizes
for step in session_log.steps:
if step.tool_name == "search_web":
output_tokens = count_tokens(step.tool_output)
print(f"Search result: {output_tokens} tokens")
# Results > 5K tokens? Too verbose
# Step 5: Check consolidation logs
consolidation_logs = filter(session_log, event="memory_consolidated")
if len(consolidation_logs) == 0:
print("WARNING: No consolidation events in logs")
print("→ Memory never consolidated, context keeps growing")
Quick fix (< 5 minutes):
1. Reduce startup memory
→ Comment out old session summaries in MEMORY.md
→ Keep only last 2 sessions
→ Should drop startup from 50K → 15K tokens
2. Enable aggressive memory consolidation
→ Consolidate memory every 5 iterations instead of 15
→ This keeps context from growing unbounded
3. Switch to larger context model if available
→ From Claude 3 (100K) → Claude 3.5 (200K)
→ If same model, may not help (limit is hard limit)
4. Prune verbose logging
→ Stop logging every intermediate step
→ Log only: errors, tool calls, final decision
→ Should reduce context by 30-40%
5. Limit tool output
→ Cap search results to 2K tokens
→ Summarize large results before using them
Proper fix (permanent):
-
Implement automatic memory consolidation:
def consolidate_memory_if_needed(): context_used = get_context_usage() context_limit = model.context_limit usage_ratio = context_used / context_limit if usage_ratio > 0.60: # Consolidate at 60% # Summarize old conversation old_messages = get_messages_before_n_iterations_ago(10) summary = compress_conversation(old_messages) # Replace old messages with summary replace_old_messages_with_summary(summary) log.info("MEMORY_CONSOLIDATED", { "tokens_before": context_used, "tokens_after": get_context_usage(), "compression_ratio": context_used / get_context_usage() }) -
Set hard context limit with graceful degradation:
MAX_CONTEXT_USAGE = 0.85 * model.context_limit if context_used > MAX_CONTEXT_USAGE: # Instead of crashing, gracefully degrade log.warn("APPROACHING_CONTEXT_LIMIT", { "usage": context_used, "limit": model.context_limit }) # Option 1: Consolidate memory consolidate_memory() # Option 2: Remove oldest messages prune_old_messages(count=5) # Option 3: Save and restart session save_session_summary() return {"status": "checkpoint_reached", "continue_in_new_session": True} -
Right-size model for task length:
task_complexity = estimate_task_complexity(user_prompt) if task_complexity == "simple": model = "claude-3-5-sonnet" # 200K context, cheap elif task_complexity == "complex": model = "claude-3-opus" # 200K context, more capable else: model = "claude-3-5-sonnet" # Default # Re-evaluate model choice based on actual task -
Establish memory file size budgets:
# Define maximum sizes MEMORY_BUDGETS = { "CLAUDE.md": 1000, # Instructions "MEMORY.md": 15000, # Compact facts "topic_files": 2000, # Each topic file "startup_total": 20000 # Total startup overhead } # Enforce in CI/CD for filepath, budget in MEMORY_BUDGETS.items(): actual_size = get_file_size(filepath) if actual_size > budget: raise Exception(f"{filepath} exceeds budget: {actual_size} > {budget}")
Symptom: Agent Making Wrong Tool Calls
What you’ll see:
- Agent calls wrong tool for the task
- Agent calls tool with wrong parameters
- Agent calls tools in wrong order
- Agent uses tool when it shouldn’t
Root causes:
- Tool descriptions unclear — agent doesn’t understand what tool does
- Parameter validation missing — agent sends bad params, tool fails
- Tool schema wrong — schema doesn’t match tool’s actual interface
- Agent doesn’t understand task — misinterprets what user asked
- Too many similar tools — agent confused between similar options
Diagnostic steps:
# Step 1: Check tool descriptions
for tool in available_tools:
print(f"Tool: {tool.name}")
print(f"Description: {tool.description}")
# Is it clear what this tool does?
# Would you know to use it from the description?
# Example of bad description:
# "data" - What data? When to use it? (Unhelpful)
# Example of good description:
# "search_web: Search the public internet for current information.
# Use when you need recent news, facts, or information not in your memory.
# Returns: Top 5 results with titles, URLs, summaries (max 2K tokens each)"
# Step 2: Check parameter types
for tool in available_tools:
for param in tool.parameters:
print(f" {param.name}: {param.type} (required={param.required})")
# Are types clear? (string, integer, list)
# Are they documented?
# Step 3: Check tool usage in logs
for step in session_log.steps:
tool = step.tool_name
params = step.tool_params
print(f"Tool: {tool}, Params: {params}")
# Does this make sense?
# If tool expects ["query"], did agent provide query?
# If tool expects {"file_path", "action"}, did agent provide both?
# Step 4: Check for parameter errors
for step in session_log.steps:
if step.status == "error":
error = step.error_message
if "parameter" in error.lower() or "type" in error.lower():
print(f"PARAMETER_ERROR: {error}")
# Step 5: Check for tool confusion patterns
tool_sequence = [step.tool_name for step in session_log.steps]
if tool_sequence.count(tool_A) > 3:
# Agent kept using same tool, suggests confusion about alternatives
print(f"Agent overused {tool_A}")
Quick fix (< 5 minutes):
1. Check tool schema matches reality
→ Run tool with example params from schema
→ If it fails → Schema is wrong, update schema
2. Look at agent's reasoning for tool choice
→ Why did agent pick tool X?
→ Is reasoning correct? (If reasoning is wrong, LLM is confused)
3. If tool called with wrong params:
→ Add parameter validation to tool
→ Return helpful error message explaining required params
→ Agent will learn and retry correctly
4. If too many similar tools:
→ Combine similar tools into one with "action" parameter
→ E.g., search_web, search_knowledge_base, search_local
→ Instead: search(source: "web|knowledge|local", query)
Proper fix (permanent):
-
Improve tool descriptions with examples:
# Bad tools = [{ "name": "search", "description": "Search for information" }] # Good tools = [{ "name": "search_web", "description": """ Search the public internet for current information. When to use: - Need recent news or events (< 1 week old) - Need facts not in your memory - Need to verify current information Do NOT use: - For private/internal documents (use search_knowledge_base instead) - For files on user's computer (use search_local instead) Returns: Top 5 results with titles, URLs, summaries Example: Query: "machine learning" Result: [ {"title": "What is ML?", "url": "...", "summary": "..."}, ... ] """, "parameters": { "query": { "type": "string", "description": "Search query (e.g., 'latest AI models 2026')", "examples": ["GPT-4 release date", "Llama 3.1 performance"] } } }] -
Add parameter validation:
def search_web(query: str) -> List[dict]: # Validate if not query or len(query) < 2: raise ValueError( "Query too short. Minimum 2 characters. " "Example: 'Python machine learning' not 'a'" ) if len(query) > 200: raise ValueError( "Query too long (max 200 chars). " "Try shorter: 'AI models' not 'What are the latest developments in AI...'" ) # Execute return search_implementation(query) -
Test tool schemas match reality:
# In your test suite def test_tool_schema_matches_implementation(): for tool_name, tool_func in tools.items(): schema = tool_schemas[tool_name] # Get required params from schema required_params = [p for p in schema.params if p.required] # Try calling with all required params example_kwargs = generate_example_params(required_params) try: tool_func(**example_kwargs) except TypeError as e: raise AssertionError( f"Tool {tool_name} schema doesn't match implementation: {e}" ) -
Reduce tool cardinality with action parameter:
# Instead of many similar tools: # search_web, search_knowledge_base, search_local, search_arxiv # Use one tool with action param: def search(query: str, action: str = "web") -> List[dict]: """ Search for information from multiple sources. action: - "web": Public internet (recent, current) - "knowledge": Internal knowledge base (comprehensive) - "local": Files on user's computer (private) - "arxiv": Academic papers (research) Example: search("machine learning", action="web") search("company policy", action="knowledge") """ if action == "web": return search_web_impl(query) elif action == "knowledge": return search_kb_impl(query) # ... etc
Part 3: Tool Issues
Issue: Tool Not Found / Not Registered
Error message:
Tool 'web_search' not found. Available tools: [search_web, get_page]
Symptoms:
- Agent tries to call tool that doesn’t exist
- Error: “Tool not registered”
- Tool works locally but fails in production
Root causes:
- Tool not registered in harness — tool function exists but not in tool list
- Typo in tool name — agent calls
web_searchbut actual name issearch_web - Tool removed in recent deploy — tool was available before, not now
- Different deployment — staging has tool, production doesn’t
- Dynamic tool loading failed — tool file missing or syntax error
Diagnostic steps:
# Step 1: List available tools
print("Available tools:")
for tool in agent.available_tools:
print(f" - {tool.name}")
# Step 2: Check if tool is registered
tool_name = "web_search"
if tool_name not in agent.available_tools:
print(f"✗ Tool '{tool_name}' not registered")
# Find similar names
similar = find_similar(tool_name, agent.available_tools)
print(f" Did you mean: {similar}?")
# Step 3: Check tool file exists
import os
if not os.path.exists("tools/web_search.py"):
print("✗ Tool file missing: tools/web_search.py")
# Step 4: Check for syntax errors in tool file
try:
import tools.web_search
print("✓ Tool imports successfully")
except SyntaxError as e:
print(f"✗ Syntax error in tool: {e}")
# Step 5: Compare staging vs production
staging_tools = get_tools_from("staging")
prod_tools = get_tools_from("production")
missing_in_prod = set(staging_tools) - set(prod_tools)
if missing_in_prod:
print(f"Tools in staging but NOT in production: {missing_in_prod}")
Quick fix (< 5 minutes):
1. Check available tools in agent
→ Print list of registered tools
→ Is the tool there?
2. If tool should exist:
→ Check tool file for syntax errors
→ Restart harness/reload tools
3. If tool missing in production:
→ Did recent deploy remove it?
→ Check deployment diff (what changed?)
→ Rollback if needed
4. If typo in tool name:
→ Agent is calling 'web_search' but actual name is 'search_web'
→ Either: A) Rename tool to match, or B) Update agent prompt
Proper fix (permanent):
-
Standardize tool naming:
# Establish naming convention # All search tools: search_* (search_web, search_knowledge, search_local) # All file tools: file_* (file_read, file_write, file_list) # All code tools: run_* (run_python, run_bash, run_sql) # Document in CLAUDE.md TOOL_NAMING_CONVENTION = """ Prefix by category: - search_*: Information retrieval - file_*: File operations - run_*: Code execution - email_*: Email operations """ -
Add tool validation to startup:
def validate_tools_on_startup(): for tool_name in EXPECTED_TOOLS: if tool_name not in agent.available_tools: raise RuntimeError( f"Expected tool '{tool_name}' not registered. " f"Available: {list(agent.available_tools.keys())}" ) # Test that tool is callable try: tool = agent.available_tools[tool_name] # Don't actually call, just verify it's callable assert callable(tool) except Exception as e: raise RuntimeError(f"Tool '{tool_name}' not callable: {e}") -
Add tool alias support:
# If tools are named differently in production vs agent prompt TOOL_ALIASES = { "web_search": "search_web", # Agent calls web_search, actual is search_web "fetch_url": "get_page", # Agent calls fetch_url, actual is get_page } def resolve_tool_name(requested_name): if requested_name in TOOL_ALIASES: actual_name = TOOL_ALIASES[requested_name] log.warning("TOOL_ALIAS_USED", { "requested": requested_name, "actual": actual_name }) return actual_name return requested_name -
Test tool availability in CI/CD:
# In your test suite def test_all_required_tools_available(): from harness import agent required_tools = [ "search_web", "read_file", "write_file", "run_python", # ... etc ] for tool_name in required_tools: assert tool_name in agent.available_tools, \ f"Required tool '{tool_name}' not registered"
Issue: Tool Failing with Errors
Error messages:
Tool 'web_search' failed: Connection timeout
Tool 'send_email' failed: Authentication failed
Tool 'read_file' failed: File not found
Symptoms:
- Specific tool always fails
- Tool fails intermittently
- Tool fails with specific input
- Tool works in local testing but fails in production
Root causes:
- Network timeout — API is slow or down
- Authentication failed — credentials missing or expired
- Permission denied — insufficient permissions
- Resource not found — file/URL doesn’t exist
- Rate limited — too many requests to external API
- Resource exhausted — disk full, memory full
Diagnostic steps:
# Step 1: Reproduce the error
tool = agent.get_tool("web_search")
try:
result = tool(query="test")
print("✓ Tool works")
except Exception as e:
print(f"✗ Tool fails: {e}")
# Step 2: Check error details
error_details = {
"error_type": type(e).__name__, # TimeoutError, AuthError, etc
"error_message": str(e),
"error_code": getattr(e, "code", None)
}
print(f"Error details: {error_details}")
# Step 3: Check external service status
if error_type == "TimeoutError":
# Check if API is up
status = check_api_status("https://api.example.com/health")
print(f"API status: {status}")
if error_type == "AuthError":
# Check credentials
creds = get_credentials()
if creds is None:
print("✗ Credentials missing")
else:
print(f"✓ Credentials present (expires {creds.expires_at})")
# Step 4: Check rate limiting
rate_limit_headers = response.headers.get("X-RateLimit-Remaining")
if rate_limit_headers and rate_limit_headers == "0":
print("WARNING: Rate limit exceeded")
print(f"Resets at: {response.headers.get('X-RateLimit-Reset')}")
# Step 5: Check logs for patterns
failures = get_tool_failures("web_search", last_n_hours=1)
print(f"Failures in last hour: {len(failures)}")
for failure in failures:
print(f" {failure.timestamp}: {failure.error}")
Quick fix (< 5 minutes):
1. Check if external API/service is down
→ Visit status page or health endpoint
→ If down, wait for it to recover (not your problem)
2. Check credentials/API keys
→ Are they set in environment?
→ Are they still valid? (Check expiration)
→ Test with curl/Postman first
3. If rate limited:
→ Slow down request rate
→ Check quota in API dashboard
→ Request increase if needed
4. If timeout:
→ Increase timeout value (if configurable)
→ Check network connectivity
→ Check if API is slow
5. If permission denied:
→ Check user/account has permission
→ Check if API key has required scopes
→ Check firewall/network policies
Proper fix (permanent):
-
Add retry logic with exponential backoff:
def call_tool_with_retry(tool_name, *args, max_retries=3, **kwargs): import time for attempt in range(max_retries): try: tool = agent.get_tool(tool_name) result = tool(*args, **kwargs) return result except TimeoutError as e: if attempt < max_retries - 1: wait_time = 2 ** attempt # 1s, 2s, 4s log.warning(f"Tool {tool_name} timeout, retrying in {wait_time}s") time.sleep(wait_time) else: raise except RateLimitError as e: # Don't retry immediately for rate limit raise e -
Add health checks and circuit breaker:
class ToolHealthCheck: def __init__(self, tool_name): self.tool_name = tool_name self.failure_count = 0 self.failure_threshold = 5 self.is_healthy = True def check_health(self): # Try calling tool with simple test try: result = test_tool_call() self.failure_count = 0 self.is_healthy = True except Exception as e: self.failure_count += 1 if self.failure_count >= self.failure_threshold: self.is_healthy = False log.alert("TOOL_UNHEALTHY", { "tool": self.tool_name, "failures": self.failure_count }) def should_use_tool(self): if not self.is_healthy: # Tool is failing, don't use it return False return True -
Log all tool failures with context:
def execute_tool(tool_name, params): log_entry = { "timestamp": datetime.now(), "event": "tool_call", "tool_name": tool_name, "params": params, "session_id": current_session_id } try: result = tool_name(**params) log_entry["status"] = "success" return result except Exception as e: log_entry["status"] = "failed" log_entry["error_type"] = type(e).__name__ log_entry["error_message"] = str(e) log_entry["error_traceback"] = traceback.format_exc() log.error("TOOL_FAILED", log_entry) raise -
Validate tool parameters before calling:
def validate_tool_params(tool_name, params): schema = tool_schemas[tool_name] for param_name, param_config in schema.parameters.items(): if param_config.required and param_name not in params: raise ValueError( f"Missing required parameter '{param_name}' for tool '{tool_name}'" ) # Validate types param_value = params.get(param_name) expected_type = param_config.type if param_value is not None and not isinstance(param_value, expected_type): raise TypeError( f"Parameter '{param_name}' must be {expected_type}, " f"got {type(param_value)}" )
Issue: Tool Timeout
Symptoms:
- Tool takes 30+ seconds to respond
- Tool never returns (timeout after N seconds)
- Some requests timeout, others are fast
- Timeouts increase over time (resource leak?)
Root causes:
- External API is slow — search engine, database is overloaded
- Network latency — slow network connection
- Tool implementation inefficient — code doing too much work
- Tool hanging — infinite loop, deadlock, or waiting for response
- Resource exhaustion — database connection pool empty, memory full
Diagnostic steps:
# Step 1: Measure tool latency
start = time.time()
try:
result = tool(query="test")
elapsed = time.time() - start
print(f"Tool latency: {elapsed:.2f}s")
except TimeoutError:
elapsed = time.time() - start
print(f"Tool timeout after {elapsed:.2f}s")
# Step 2: Check network latency to external services
import subprocess
latency = measure_ping("api.example.com")
print(f"Network latency: {latency:.2f}ms")
# Step 3: Check tool implementation
import inspect
source = inspect.getsource(tool_function)
# Look for:
# - Synchronous I/O (requests, urllib) → Use async instead
# - Large loops without timeout
# - Database queries without indexes
# Step 4: Check resource usage during tool call
import psutil
process = psutil.Process()
initial_memory = process.memory_info().rss
result = tool(query="test")
final_memory = process.memory_info().rss
memory_growth = final_memory - initial_memory
print(f"Memory growth: {memory_growth / 1024 / 1024:.2f} MB")
# Step 5: Check logs for patterns
slow_calls = get_tool_calls("web_search", filter={"duration_ms": ">5000"})
print(f"Calls > 5s: {len(slow_calls)}")
for call in slow_calls:
print(f" {call.timestamp}: {call.duration_ms}ms, query={call.params['query']}")
Quick fix (< 5 minutes):
1. Increase timeout value
→ If timeout is 10s, increase to 30s
→ Doesn't fix slowness, but prevents crashes
2. Check if external API is slow
→ Test API directly (curl request)
→ Check API status page
→ If API is slow: not your problem
3. Check network connectivity
→ High latency? → Move closer to API or use proxy
4. If specific queries are slow:
→ Add caching for common queries
→ Avoid re-fetching same results
5. Implement fallback
→ If tool times out, use cached/default value
→ Continue instead of failing
Proper fix (permanent):
-
Use async I/O instead of blocking:
# Bad: Blocking I/O def search_web(query): import requests # Blocking response = requests.get(f"https://api.search.com?q={query}") return response.json() # Good: Async I/O async def search_web(query): import aiohttp # Non-blocking async with aiohttp.ClientSession() as session: async with session.get(f"https://api.search.com?q={query}") as resp: return await resp.json() -
Add timeout with graceful degradation:
import asyncio async def search_web_with_timeout(query, timeout=5): try: result = await asyncio.wait_for( search_web(query), timeout=timeout ) return result except asyncio.TimeoutError: # Instead of crashing, return cached result cached = get_cached_result(query) if cached: log.warning("TOOL_TIMEOUT_USING_CACHE", { "query": query, "cache_age": get_cache_age(query) }) return cached else: # If no cache, try default result return {"error": "timeout", "results": []} -
Implement caching for repeated queries:
from functools import lru_cache import hashlib SEARCH_CACHE = {} CACHE_TTL = 3600 # 1 hour def search_web_cached(query, cache_ttl=CACHE_TTL): cache_key = hashlib.md5(query.encode()).hexdigest() if cache_key in SEARCH_CACHE: cached_entry = SEARCH_CACHE[cache_key] age = time.time() - cached_entry["timestamp"] if age < cache_ttl: return cached_entry["result"] # Not in cache or expired, fetch result = search_web(query) # May timeout SEARCH_CACHE[cache_key] = { "timestamp": time.time(), "result": result } return result -
Monitor tool latency continuously:
TOOL_LATENCIES = { "web_search": [], "read_file": [], # ... } def track_tool_latency(tool_name, duration_ms): TOOL_LATENCIES[tool_name].append(duration_ms) # Calculate percentiles latencies = sorted(TOOL_LATENCIES[tool_name]) p50 = latencies[len(latencies) // 2] p95 = latencies[int(len(latencies) * 0.95)] p99 = latencies[int(len(latencies) * 0.99)] # Alert if degradation if p99 > LATENCY_THRESHOLD: log.alert("TOOL_LATENCY_HIGH", { "tool": tool_name, "p50": p50, "p95": p95, "p99": p99 })
Part 4: Memory Issues
Issue: Memory Corruption
Symptoms:
- Agent uses wrong facts/outdated information
- Agent mixes up information from different sessions
- Agent contradicts itself (says X then says not X)
- Quality suddenly drops
Root causes:
- Session mixing — memory from session A leaks into session B
- Cache stale — old cached result is served instead of fresh
- Consolidation error — summarization loses important details
- File corruption — memory file partially written/truncated
Diagnostic steps:
# Step 1: Check for session contamination
session_a = get_session("session-123")
session_b = get_session("session-456")
context_a = session_a.full_context
context_b = session_b.full_context
# Are they independent?
if any_facts_in_both(context_a, context_b):
print("WARNING: Sessions share facts (should be independent)")
# Step 2: Check memory file integrity
import hashlib
with open("MEMORY.md", "r") as f:
content = f.read()
checksum = hashlib.md5(content.encode()).hexdigest()
expected_checksum = KNOWN_GOOD_CHECKSUM
if checksum != expected_checksum:
print("✗ Memory file corrupted (checksum mismatch)")
# Step 3: Verify consolidation didn't lose info
before_consolidation = get_memory_snapshot("before")
after_consolidation = get_memory_snapshot("after")
lost_facts = facts_in_before_not_after(before_consolidation, after_consolidation)
if lost_facts:
print(f"✗ Consolidation lost {len(lost_facts)} facts:")
for fact in lost_facts:
print(f" - {fact}")
# Step 4: Check cache staleness
cache_entry = get_cache("query-123")
age = time.time() - cache_entry.created_at
if age > CACHE_TTL:
print(f"WARNING: Cache entry is stale ({age}s old, TTL={CACHE_TTL}s)")
# Step 5: Check for partial writes
file_path = "MEMORY.md"
file_size = os.path.getsize(file_path)
expected_size = estimate_file_size(file_content)
if file_size != expected_size:
print(f"WARNING: File size mismatch ({file_size} vs expected {expected_size})")
print("→ File may have been partially written")
Quick fix (< 5 minutes):
1. Cold restart session
→ Start new session without old memory
→ Does quality improve? → Memory corruption confirmed
2. Clear cache
→ Delete SEARCH_CACHE
→ Memory files should regenerate
3. Check file permissions
→ Can harness write to MEMORY.md?
→ Are there write conflicts?
4. Revert recent memory changes
→ If MEMORY.md was recently edited, revert
→ git checkout MEMORY.md
Proper fix (permanent):
-
Isolate sessions with session ID:
# Every memory entry must include session_id class MemoryEntry: def __init__(self, content, session_id): self.content = content self.session_id = session_id self.created_at = datetime.now() # Before using memory, verify session_id matches def get_memory_for_session(session_id): all_entries = load_memory_file() session_entries = [ e for e in all_entries if e.session_id == session_id ] return session_entries -
Implement memory versioning:
# Save versions of MEMORY.md # MEMORY.md (current) # .MEMORY.backup (previous) # .MEMORY.v1, .MEMORY.v2, ... (history) def save_memory_with_backup(): if os.path.exists("MEMORY.md"): shutil.copy("MEMORY.md", ".MEMORY.backup") # Write new version with open("MEMORY.md", "w") as f: f.write(new_memory_content) # Keep history import time timestamp = int(time.time()) shutil.copy("MEMORY.md", f".MEMORY.v{timestamp}") def rollback_memory(version): """Restore memory to a previous version""" shutil.copy(f".MEMORY.v{version}", "MEMORY.md") log.info("MEMORY_ROLLED_BACK", {"version": version}) -
Add memory file validation:
def validate_memory_file(): """Check memory file for corruption""" with open("MEMORY.md", "r") as f: content = f.read() # Check for common corruption signs if len(content) == 0: raise Exception("Memory file is empty (truncation)") if content.count("```") % 2 != 0: raise Exception("Memory file has unmatched code blocks (partial write)") # Verify JSON blocks are valid import json for block in extract_json_blocks(content): try: json.loads(block) except json.JSONDecodeError as e: raise Exception(f"Invalid JSON in memory: {e}") return True -
Implement atomic writes:
import tempfile def write_memory_atomically(content): """Write memory file atomically (no partial writes)""" # Write to temporary file first with tempfile.NamedTemporaryFile( mode="w", dir=".", delete=False, suffix=".tmp" ) as tmp: tmp.write(content) tmp_path = tmp.name # Validate temporary file validate_memory_file_at_path(tmp_path) # Only then replace original os.replace(tmp_path, "MEMORY.md") log.info("MEMORY_WRITTEN_ATOMICALLY")
Issue: Memory Loss
Symptoms:
- Agent doesn’t remember previous sessions
- Agent repeats work from earlier
- Agent says “I don’t have context” but information existed in memory
Root causes:
- Memory file not persisted — in-memory cache, lost on restart
- Memory pruned too aggressively — old memories deleted
- Memory not loaded on startup — file exists but not read
- Wrong session ID — looking for memories from different session
- Memory file deleted — accidental deletion or crash
Diagnostic steps:
# Step 1: Check if memory file exists
import os
if not os.path.exists("MEMORY.md"):
print("✗ MEMORY.md does not exist")
else:
file_size = os.path.getsize("MEMORY.md")
print(f"✓ MEMORY.md exists ({file_size} bytes)")
# Step 2: Check if memory is being read on startup
startup_log = get_session_log(session_id).startup_events
memory_events = [e for e in startup_log if e.event == "memory_loaded"]
if not memory_events:
print("✗ Memory not being loaded on startup")
else:
for event in memory_events:
print(f"✓ Loaded {event.facts_count} facts from MEMORY.md")
# Step 3: Check if memory is being written
write_events = get_logs(event="memory_written", last_n_hours=24)
if not write_events:
print("WARNING: No memory writes in last 24 hours")
else:
print(f"✓ Memory written {len(write_events)} times")
# Step 4: Check if information is in memory file
fact = "Important fact that should be remembered"
with open("MEMORY.md", "r") as f:
memory_content = f.read()
if fact in memory_content:
print(f"✓ Fact is in MEMORY.md")
else:
print(f"✗ Fact NOT in MEMORY.md")
print("→ Was it ever saved?")
# Step 5: Check memory pruning settings
from harness.config import MEMORY_CONFIG
print(f"Memory retention: {MEMORY_CONFIG.retention_days} days")
print(f"Max memory size: {MEMORY_CONFIG.max_tokens} tokens")
print(f"Pruning frequency: every {MEMORY_CONFIG.prune_interval_hours} hours")
Quick fix (< 5 minutes):
1. Check if MEMORY.md exists
→ If it doesn't, create it with bootstrap facts
2. Check if memory is being loaded
→ Look for memory_loaded event in startup
→ If missing, add memory loading to startup
3. Check if memory is persisted
→ Write a test fact to MEMORY.md
→ Restart harness
→ Is the fact still there?
4. If memory is being pruned too aggressively:
→ Increase retention period (retention_days)
→ Increase max memory size (max_tokens)
→ Reduce pruning frequency
Proper fix (permanent):
-
Implement automatic memory persistence:
def load_memory_on_startup(): """Load all memory files on startup""" memory_files = [ "CLAUDE.md", # Instructions "MEMORY.md", # Consolidated facts "current_task.md" # Current work ] for filepath in memory_files: if os.path.exists(filepath): with open(filepath, "r") as f: content = f.read() agent.memory.add(filepath, content) log.info("MEMORY_LOADED", {"file": filepath}) else: log.warning("MEMORY_FILE_MISSING", {"file": filepath}) return agent.memory # Call on startup agent.memory = load_memory_on_startup() -
Implement periodic memory checkpoint:
import threading def memory_checkpoint_loop(): """Save memory every N minutes""" while True: time.sleep(300) # Every 5 minutes # Get current memory state memory_content = agent.memory.export() # Write to file write_memory_atomically(memory_content) log.debug("MEMORY_CHECKPOINT", { "size_bytes": len(memory_content), "timestamp": datetime.now() }) # Start checkpoint thread checkpoint_thread = threading.Thread( target=memory_checkpoint_loop, daemon=True ) checkpoint_thread.start() -
Implement memory recovery:
def recover_memory_from_backup(): """If memory is corrupted, recover from backup""" if os.path.exists(".MEMORY.backup"): log.alert("MEMORY_RECOVERY_STARTING", { "source": ".MEMORY.backup" }) shutil.copy(".MEMORY.backup", "MEMORY.md") return True # If no backup, try version history versions = glob.glob(".MEMORY.v*") if versions: latest_version = max(versions) log.alert("MEMORY_RECOVERY_FROM_VERSION", { "source": latest_version }) shutil.copy(latest_version, "MEMORY.md") return True # If no backup/versions, reset to empty log.alert("MEMORY_RESET", {"reason": "no_backup_available"}) write_memory_atomically("") return False -
Verify memory on each load:
def load_and_validate_memory(): """Load memory and verify it's not corrupted""" try: memory = load_memory_on_startup() # Validate if len(memory) == 0: log.warning("MEMORY_EMPTY") # Verify basic structure facts_count = count_facts(memory) log.info("MEMORY_LOADED", { "facts_count": facts_count, "bytes": len(str(memory)) }) return memory except MemoryCorruptionError: log.alert("MEMORY_CORRUPTED", { "action": "attempting recovery" }) recovered = recover_memory_from_backup() if recovered: return load_memory_on_startup() else: # Start with empty memory return Memory()
Part 5: Cost & Budget Issues
Issue: Unexpected Cost Spike
Symptoms:
- Daily cost > 2× normal
- Unexpected charge from API provider
- Cost spike with no corresponding increase in usage
- One specific agent/session costs $100+ when typical is $10
Root causes:
- Runaway token generation — agent producing huge outputs
- Loop with high tokens — agent looping and using context each time
- Expensive model — switched to more expensive model
- Inefficient prompts — prompts grew in token size
- New feature using expensive model — verification using expensive LLM
Diagnostic steps:
# Step 1: Identify timing of spike
cost_by_hour = get_costs_by_hour(last_24_hours=True)
for hour, cost in cost_by_hour:
if cost > 2 * NORMAL_HOURLY_COST:
print(f"SPIKE at {hour}: ${cost} (2x normal)")
# Step 2: Identify which agent/session caused spike
expensive_sessions = get_sessions_sorted_by_cost(limit=10)
for session in expensive_sessions:
print(f"Session {session.id}: ${session.cost}")
print(f" Agent: {session.agent_id}")
print(f" Duration: {session.duration_seconds}s")
print(f" Iterations: {session.loop_iterations}")
print(f" Input tokens: {session.input_tokens}")
print(f" Output tokens: {session.output_tokens}")
# Step 3: Check if model changed
logs = get_logs(event="session_start", last_24_hours=True)
models_used = set(log.model for log in logs)
print(f"Models used: {models_used}")
if len(models_used) > 1:
print("WARNING: Multiple models used")
model_costs = {}
for model in models_used:
cost = sum(log.cost for log in logs if log.model == model)
model_costs[model] = cost
print(f"Cost by model: {model_costs}")
# Step 4: Check if prompts grew
old_prompt_size = get_avg_prompt_size(days=7)
new_prompt_size = get_avg_prompt_size(days=1)
growth = (new_prompt_size - old_prompt_size) / old_prompt_size
if growth > 0.2:
print(f"WARNING: Prompts grew {growth:.1%}")
# Step 5: Check iteration counts
expensive_session = expensive_sessions[0]
for step in expensive_session.steps:
print(f"Iteration {step.iteration}: "
f"input={step.input_tokens}, output={step.output_tokens}")
if step.output_tokens > 5000:
print(f" ^ Huge output ({step.output_tokens} tokens)")
Quick fix (< 5 minutes):
1. Identify the expensive session
→ Which session caused the spike?
→ What was it doing?
2. Check if model is wrong
→ Should it be using Claude 3.5 or Claude 3 Opus?
→ Revert to correct model
3. If looping excessively:
→ Set max iterations to 10
→ Kill any sessions > 15 iterations
4. If output tokens huge:
→ Check if agent is generating full documents
→ Limit output size
5. Enable cost alerts
→ Alert if cost > budget per session
→ Prevent cascade of expensive requests
Proper fix (permanent):
-
Implement per-session cost budgets:
class CostBudgetEnforcer: def __init__(self, max_cost_per_session: float = 1.0): self.max_cost = max_cost_per_session def check_budget_before_step(self, session_id: str): current_cost = get_session_cost(session_id) if current_cost > self.max_cost: raise BudgetExceededError( f"Session cost ${current_cost} exceeds budget ${self.max_cost}" ) def check_budget_after_step(self, session_id: str, step_cost: float): current_cost = get_session_cost(session_id) if current_cost > self.max_cost: log.alert("BUDGET_EXCEEDED", { "session_id": session_id, "cost": current_cost, "budget": self.max_cost }) terminate_session(session_id) # Use in agent loop enforcer = CostBudgetEnforcer(max_cost_per_session=5.0) for step in agent_steps: enforcer.check_budget_before_step(session.id) result = execute_step() enforcer.check_budget_after_step(session.id, result.cost) -
Implement cost alerts:
def cost_alert_system(): """Alert when costs exceed thresholds""" COST_THRESHOLDS = { "daily": 1000, # Alert if daily cost > $1000 "hourly": 100, # Alert if hourly cost > $100 "session": 10, # Alert if session cost > $10 "step": 1, # Alert if step cost > $1 } while True: costs = get_current_costs() if costs["daily"] > COST_THRESHOLDS["daily"]: send_alert(f"Daily cost ${costs['daily']} exceeded") if costs["hourly"] > COST_THRESHOLDS["hourly"]: send_alert(f"Hourly cost ${costs['hourly']} exceeded") time.sleep(60) -
Track and alert on model changes:
EXPECTED_MODELS = { "general_agent": "claude-3-5-sonnet", "verification_agent": "claude-3-opus", } def verify_model_on_startup(agent_id: str): expected = EXPECTED_MODELS[agent_id] actual = get_model_for_agent(agent_id) if expected != actual: log.alert("MODEL_MISMATCH", { "agent_id": agent_id, "expected": expected, "actual": actual, "cost_difference": get_cost_difference(expected, actual) }) -
Implement cost attribution:
def log_cost_attribution(): """Break down costs by agent, model, tool, etc""" costs_by_agent = {} costs_by_model = {} costs_by_tool = {} for session in get_all_sessions(): agent = session.agent_id model = session.model costs_by_agent[agent] = costs_by_agent.get(agent, 0) + session.cost costs_by_model[model] = costs_by_model.get(model, 0) + session.cost for step in session.steps: if step.tool_name: costs_by_tool[step.tool_name] = \ costs_by_tool.get(step.tool_name, 0) + step.cost log.info("COST_ATTRIBUTION", { "by_agent": costs_by_agent, "by_model": costs_by_model, "by_tool": costs_by_tool })
Issue: Cost Exceeding Budget
Symptoms:
- Monthly cost exceeds allocated budget
- No single spike, but slow creep upward
- New feature is more expensive than projected
- Cost per task higher than expected
Root causes:
- Inefficient prompts — prompts larger than necessary
- Inefficient model choice — using expensive model for simple tasks
- No caching — repeating expensive computations
- Feature too expensive — new feature costs more than projected
- Volume growth — more requests than anticipated
Diagnostic steps:
# Step 1: Compare projected vs actual costs
budget = get_monthly_budget()
actual_cost = get_monthly_cost()
print(f"Budget: ${budget}")
print(f"Actual: ${actual_cost}")
print(f"Over budget by: ${actual_cost - budget}")
# Step 2: Break down costs by feature
costs_by_feature = {}
for session in get_sessions_this_month():
feature = session.tags[0] if session.tags else "unknown"
costs_by_feature[feature] = costs_by_feature.get(feature, 0) + session.cost
for feature, cost in sorted(costs_by_feature.items(), key=lambda x: x[1], reverse=True):
print(f"{feature}: ${cost}")
# Step 3: Compare to baseline
baseline_cost_per_task = get_historical_average("cost_per_task")
current_cost_per_task = get_current_average("cost_per_task")
change = (current_cost_per_task - baseline_cost_per_task) / baseline_cost_per_task
print(f"Cost per task: ${baseline_cost_per_task} → ${current_cost_per_task} ({change:.1%})")
# Step 4: Check model distribution
models = {}
for session in get_sessions_this_month():
model = session.model
models[model] = models.get(model, 0) + session.cost
print("Cost by model:")
for model, cost in sorted(models.items(), key=lambda x: x[1], reverse=True):
print(f" {model}: ${cost}")
# Step 5: Check for low-hanging optimization
caching_potential = estimate_caching_potential()
print(f"Caching potential: Save ${caching_potential}")
model_switch_potential = estimate_model_switch_potential()
print(f"Model switch potential: Save ${model_switch_potential}")
Quick fix (< 5 minutes):
1. Identify the most expensive feature
→ Break down by feature tag
→ Focus on top 3 expensive features
2. Check if there's easy caching potential
→ Same queries repeating?
→ Add caching, reduce cost 20-30%
3. Check model choice
→ Is expensive model necessary?
→ Can you use cheaper model for 80% of tasks?
4. Reduce prompt size if possible
→ Remove unnecessary context
→ Compress memory file
→ Each 1000 tokens saved = 3-15% cost reduction
5. Adjust routing/filtering
→ Can some tasks be answered without LLM?
→ Route simple tasks to tool instead of LLM
Proper fix (permanent):
-
Implement cost-aware model routing:
def select_model_for_task(task_complexity: str, required_capability: str): """Route to cheapest model that meets requirements""" MODELS = { "simple": ("gemini-2-flash", 0.06), # Cheapest, fast "moderate": ("claude-3-5-sonnet", 1.0), # Good balance "complex": ("claude-3-opus", 5.0), # Best reasoning } estimated_cost = MODELS[task_complexity][1] model = MODELS[task_complexity][0] # If cost > threshold, try cheaper model first if estimated_cost > COST_THRESHOLD: cheaper_models = [m for m, (_, cost) in MODELS.items() if cost < estimated_cost] # Test if cheaper model works if try_with_model(cheaper_models[0], task): model = cheaper_models[0] estimated_cost = MODELS[cheaper_models[0]][1] return model, estimated_cost -
Implement smart caching:
QUERY_CACHE = {} CACHE_TTL = 86400 # 24 hours def get_with_cache(query: str, expensive_operation): cache_key = hashlib.sha256(query.encode()).hexdigest() if cache_key in QUERY_CACHE: entry = QUERY_CACHE[cache_key] age = time.time() - entry["timestamp"] if age < CACHE_TTL: log.debug("CACHE_HIT", {"query": query}) return entry["result"] # Not cached, execute result = expensive_operation() QUERY_CACHE[cache_key] = { "result": result, "timestamp": time.time(), "cost_saved": AVERAGE_QUERY_COST # Saved this cost on next hit } return result # Estimate savings cache_hits = sum(1 for k in QUERY_CACHE if hits[k] > 1) total_savings = cache_hits * AVERAGE_QUERY_COST print(f"Cache savings: ${total_savings}") -
Implement cost per feature tracking:
def track_feature_cost(feature_name: str, session_cost: float): """Track cumulative cost per feature""" FEATURE_BUDGETS = { "search": 100, # Max $100/month for search feature "summarize": 50, # Max $50/month for summarize "translate": 30, # Max $30/month for translate } current_month_cost = get_feature_cost_this_month(feature_name) budget = FEATURE_BUDGETS.get(feature_name, float('inf')) if current_month_cost + session_cost > budget: log.alert("FEATURE_BUDGET_EXCEEDED", { "feature": feature_name, "current_cost": current_month_cost, "session_cost": session_cost, "budget": budget })
Part 6: Quality Issues
Issue: Hallucinations Increased
Symptoms:
- Model making up facts not in context
- Model confident about false information
- Model citing sources that don’t exist
- Factual accuracy dropped
Root causes:
- Model drift — model behavior changed with update
- Prompt changed — instruction change causing more creativity
- Temperature increased — more randomness/creativity
- Memory corruption — mixing up facts from different contexts
- Context too short — model hallucinating to fill gaps
Diagnostic steps:
# Step 1: Measure hallucination rate
responses = get_responses_this_week()
hallucinations = []
for response in responses:
facts = extract_facts(response)
for fact in facts:
if not is_in_context(fact, response.context):
if not is_known_fact(fact):
hallucinations.append({
"fact": fact,
"response": response.id,
"timestamp": response.timestamp
})
hallucination_rate = len(hallucinations) / len(responses)
print(f"Hallucination rate: {hallucination_rate:.1%}")
baseline_rate = get_historical_hallucination_rate()
if hallucination_rate > baseline_rate * 1.5:
print(f"WARNING: 50% increase from baseline ({baseline_rate:.1%})")
# Step 2: Check for recent changes
recent_changes = get_recent_changes(last_24_hours=True)
for change in recent_changes:
print(f"Change: {change.type}")
if change.type == "prompt":
print(f" Before: {change.old_value[:100]}")
print(f" After: {change.new_value[:100]}")
elif change.type == "model":
print(f" {change.old_value} → {change.new_value}")
elif change.type == "temperature":
print(f" {change.old_value} → {change.new_value}")
# Step 3: Check model and parameters
print(f"Model: {agent.model}")
print(f"Temperature: {agent.temperature}")
print(f"Top P: {agent.top_p}")
# Higher temperature = more random/creative
if agent.temperature > 0.5:
print("WARNING: High temperature may cause hallucinations")
# Step 4: Check context size
avg_context_size = get_average_context_size()
print(f"Average context: {avg_context_size} tokens")
if avg_context_size < 1000:
print("WARNING: Small context may cause hallucinations")
# Step 5: Compare staging vs production
staging_hallucination_rate = get_hallucination_rate("staging")
prod_hallucination_rate = get_hallucination_rate("production")
print(f"Staging: {staging_hallucination_rate:.1%}")
print(f"Production: {prod_hallucination_rate:.1%}")
if prod_hallucination_rate > staging_hallucination_rate:
print("WARNING: Production has higher hallucination rate")
Quick fix (< 5 minutes):
1. Reduce temperature
→ Set temperature to 0.3 instead of 0.7
→ More deterministic = fewer hallucinations
2. Check for recent model change
→ Did you upgrade model in last 24h?
→ Rollback to previous model
→ Test if hallucination rate drops
3. Check for prompt changes
→ Did someone edit the system prompt?
→ Revert prompt to working version
4. Add fact verification step
→ After agent generates response
→ Agent must cite sources for each fact
→ If no source, agent must admit uncertainty
Proper fix (permanent):
-
Implement fact verification loop:
def verify_facts_in_response(response: str, context: str): """Verify each fact in response comes from context""" facts = extract_facts(response) unverified_facts = [] for fact in facts: if fact not in context: if not is_well_known_fact(fact): unverified_facts.append(fact) if unverified_facts: # Ask agent to remove or cite these facts prompt = f""" Your response contains these facts not in the provided context: {unverified_facts} For each fact: - Remove it if it's speculation - Or cite which document supports it Revised response: """ verified_response = agent.continue_conversation(prompt) return verified_response return response -
Add confidence scoring:
def add_confidence_scores(response: str): """Ask agent to add confidence scores to facts""" prompt = f""" Review your response and add confidence scores: - [HIGH]: Directly from provided documents - [MEDIUM]: Reasonable inference from documents - [LOW]: General knowledge, not in documents - [UNCERTAIN]: Not sure, may be wrong Example: "The company has [HIGH] 1000 employees and likely [MEDIUM] plans expansion, though I'm [UNCERTAIN] about the timeline." Response with confidence scores: """ scored_response = agent.continue_conversation(prompt) return scored_response -
Baseline and monitor hallucination rate:
class HallucinationMonitor: def __init__(self, baseline_rate: float = 0.05): self.baseline_rate = baseline_rate # 5% self.alert_threshold = baseline_rate * 1.5 # 7.5% def check_hallucination_rate(self): current_rate = measure_current_hallucination_rate() if current_rate > self.alert_threshold: log.alert("HALLUCINATION_RATE_HIGH", { "baseline": self.baseline_rate, "current": current_rate, "threshold": self.alert_threshold }) return False return True monitor = HallucinationMonitor() monitor.check_hallucination_rate()
Part 7: Performance Issues
Issue: Slow Inference
Symptoms:
- Model takes 5-10+ seconds to generate first token
- All requests slow, not just some
- Latency increases over time (doesn’t improve with restart)
- Model loading slower than before
Root causes:
- Large context window — more tokens = slower processing
- Model size increase — switched to larger model
- GPU out of memory — falling back to CPU (1000× slower)
- Model not cached — reloading model from disk each time
- Increased load — GPU busy with other requests
Diagnostic steps:
# Step 1: Measure latency
start = time.time()
response = model.generate("test prompt")
latency = time.time() - start
first_token_latency = response.metrics["first_token_ms"]
print(f"Total latency: {latency*1000:.0f}ms")
print(f"First token: {first_token_latency:.0f}ms")
# Normal first token: 50-200ms
# Slow first token: > 500ms (suggests issue)
# Step 2: Check GPU usage
import gpustat
gpu_info = gpustat.new_query()
for gpu in gpu_info:
print(f"GPU {gpu.index}: {gpu.utilization:.1%} used, {gpu.memory_used}/{gpu.memory_total}")
if any(gpu.memory_used > gpu.memory_total * 0.9 for gpu in gpu_info):
print("WARNING: GPU running low on memory")
# Step 3: Check context size
context_size = count_tokens(full_context)
print(f"Context size: {context_size} tokens")
# More context = slower processing
# Typical: 5-10K tokens
# Slow: > 100K tokens
# Step 4: Check model size
model_info = get_model_info(model_name)
print(f"Model: {model_name}")
print(f"Model size: {model_info.parameters} parameters")
# 7B model: ~13GB memory
# 13B model: ~26GB memory
# 70B model: ~140GB memory (needs multi-GPU)
# Step 5: Check if model is cached
if is_model_loaded_in_memory():
print("✓ Model in memory (fast)")
else:
print("✗ Model not in memory, will load from disk (slow)")
Quick fix (< 5 minutes):
1. Check if GPU is out of memory
→ Run nvidia-smi
→ If used > 90%, try restarting to free memory
2. Check context size
→ Is it much larger than before?
→ Reduce context (prune old memories)
3. Check if model loaded
→ Is model in GPU memory?
→ Load it once, don't reload each request
4. Reduce batch size if applicable
→ If processing multiple requests, reduce batch
→ Gives GPU more free memory per request
5. Profile to find bottleneck
→ Which part is slow? (model loading, inference, tokenization?)
Proper fix (permanent):
-
Implement model caching:
import gc class ModelCache: def __init__(self): self.cached_models = {} def load_model(self, model_name: str): if model_name not in self.cached_models: print(f"Loading {model_name}...") model = load_model_from_disk(model_name) self.cached_models[model_name] = model return self.cached_models[model_name] def unload_unused_models(self): # Keep only last 2 models in memory if len(self.cached_models) > 2: oldest = min(self.cached_models.items(), key=lambda x: x[1].last_used) del self.cached_models[oldest[0]] gc.collect() -
Implement async/batching:
import asyncio class InferenceBatcher: def __init__(self, batch_size: int = 4): self.batch_size = batch_size self.queue = asyncio.Queue() async def add_request(self, prompt: str): await self.queue.put(prompt) async def process_batches(self): while True: batch = [] # Collect up to batch_size requests for _ in range(self.batch_size): try: prompt = self.queue.get_nowait() batch.append(prompt) except asyncio.QueueEmpty: break if batch: # Process batch together (faster than one-by-one) results = model.generate_batch(batch) # ... return results await asyncio.sleep(0.1) -
Monitor and alert on latency degradation:
class LatencyMonitor: def __init__(self): self.baseline_latency = 150 # ms self.alert_threshold = 500 # ms def check_latency(self, latency_ms: float): if latency_ms > self.alert_threshold: degradation = (latency_ms - self.baseline_latency) / self.baseline_latency log.alert("LATENCY_DEGRADATION", { "baseline": self.baseline_latency, "current": latency_ms, "degradation": f"{degradation:.0%}" })
Part 8: Common Error Messages
Error: 429 Rate Limit Exceeded
Message:
APIError: 429 Rate limit exceeded. Please retry after 60 seconds.
What it means: You’ve made too many requests to the API. The API is rate-limiting you to prevent abuse.
Causes:
- Too many concurrent requests
- Exceeded monthly token quota
- API provider bandwidth limit
Quick fix:
import time
def call_with_retry(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except RateLimitError as e:
if attempt < max_retries - 1:
wait_time = int(e.retry_after) if hasattr(e, 'retry_after') else 2**attempt
print(f"Rate limited, waiting {wait_time}s...")
time.sleep(wait_time)
else:
raise
Error: Context Window Exceeded
Message:
ContextLengthExceededError: Prompt too long (8,532 tokens > 4,096 max)
What it means: Your prompt (context + message) is too long for the model. Need to reduce it.
Quick fix:
- Switch to model with larger context window (e.g., Claude 3.5 with 200K)
- Reduce startup memory (CLAUDE.md, MEMORY.md)
- Summarize old messages
- Use compression (LLM Wiki pattern)
Error: Model Not Found
Message:
APIError: Model 'gpt-4-turbo-2024-04-09' not found
Causes:
- Model deprecated/removed
- Typo in model name
- Wrong API provider
Quick fix:
- List available models:
curl https://api.openai.com/v1/models -H "Authorization: Bearer $OPENAI_API_KEY" - Check model documentation for current available models
- Use standardized model names from documentation
Part 9: FAQ — Frequently Asked Questions
Q: Why is my agent looping?
A: Agents loop when:
- Tool keeps failing — agent thinks it should retry
- Task is ambiguous — agent doesn’t know when to stop
- No termination logic — max_iterations not set
Fix: See “Agent Stuck in Loop” section above.
Q: How do I reduce costs?
A: Top cost-reduction tactics (in order of impact):
- Use smaller model (80% saving): SLM (7B) for loop, LLM (70B+) for verify only
- Cache results (30-50% saving): Repeat queries shouldn’t re-run
- Reduce context (20-40% saving): Compress memory, use LLM Wiki pattern
- Use quantization (20% saving): 4-bit models cost same to run, less tokens
- Route smart (10-20% saving): Simple tasks don’t need expensive model
Example: Hybrid setup can save up to 80-90% vs pure cloud (when most requests route locally):
- 80% of requests → cheap local SLM (Phi, Mistral 7B) = ~$0
- 20% of requests → verify with Claude Opus = ~$3/1M tokens
- Total: ~$0.60/1M tokens vs $15/1M for pure Claude
Q: What model size do I need?
A: Choose based on task complexity:
| Task | Model | Reason |
|---|---|---|
| Classification | 7B SLM | Fast, cheap, good enough |
| Summarization | 13B SLM | Good balance |
| Q&A retrieval | 13B SLM | Needs reasoning but not deep |
| Code generation | 34B SLM | Needs better code understanding |
| Complex reasoning | 70B LLM | Requires deep reasoning |
| Verification | 70B+ LLM | Needs high accuracy |
Q: Should I use cloud or local models?
A: Decision tree:
Tokens/day < 100K?
→ Cloud (cheaper for low volume)
Tokens/day 100K-1M?
→ Hybrid (local for loop, cloud for verify)
Tokens/day > 1M?
→ Local self-hosted (cheaper at scale)
Needs latest model?
→ Cloud (local models lag by 6-12 months)
Sensitive data?
→ Local (keep data on-premise)
No GPU available?
→ Cloud (can't run local models)
Q: How do I debug agent decisions?
A: Enable detailed logging:
# Log every step
for step in agent.steps:
print(f"Step {step.iteration}:")
print(f" Reasoning: {step.reasoning}")
print(f" Tool: {step.tool_name}")
print(f" Result: {step.tool_result[:200]}")
print(f" Cost: ${step.cost}")
# Check if reasoning makes sense
# If reasoning is wrong → LLM confused, need clearer instruction
# If reasoning right but tool wrong → Tool choice issue
Q: What’s the difference between ReAct and Tree of Thoughts?
A:
| Framework | How it works | Best for | Cost |
|---|---|---|---|
| ReAct | Think → Act → Observe (loop) | Most tasks, default choice | Baseline (1-8 iterations) |
| Tree of Thoughts | Explore multiple branches, keep best | Complex problems, deep reasoning | 3-5× more expensive (many branches) |
| Reflexion | Act → Get feedback → Self-correct | Quality improvement, when first try fails | 2-3× cost (add reflection step) |
Recommendation: Start with ReAct. Use Tree of Thoughts only if ReAct success rate < 70%.
Q: How much GPU memory do I need?
A: For different model sizes:
| Model | Memory | GPU | Cost/month |
|---|---|---|---|
| 7B | 14GB | 1× RTX 4090 | $500 |
| 13B | 26GB | 1× RTX 4090 | $500 |
| 34B | 68GB | 1× H100 | $3K |
| 70B | 140GB | 2× H100 | $6K |
| 405B | 810GB | Requires specialized hardware | $20K+ |
Cheaper alternative: Use cloud API (pay per token, no hardware cost).
Q: Is my harness secure?
A: Security checklist:
- Input validation (check for injection patterns)
- Output filtering (no PII leaks)
- Rate limiting (prevent abuse)
- Audit logging (track who did what)
- Secrets management (no hardcoded API keys)
- Sandboxing (restrict tool access)
See 10_security_and_safety.md for full details.
Q: How do I monitor production?
A: Essential metrics:
ESSENTIAL_METRICS = [
"error_rate", # % of requests failing
"latency_p50/p95/p99", # Request duration
"cost_per_task", # Token cost trending
"success_rate", # % of agent reaching goal
"loop_iterations", # Avg steps per task (higher = less efficient)
"memory_usage", # RAM / context window usage
"loop_detection", # Count of stuck agents
]
Alerting:
- Error rate > 5% → page on-call
- Cost/task > 2× baseline → page on-call
- Success rate drops > 10% → investigate
Q: What’s the best prompt?
A: No single “best” prompt, but follow these principles:
- Clear role: “You are a Python expert”
- Clear task: “Your job is to review this code”
- Clear constraints: “Don’t suggest breaking changes”
- Clear output format: “Return JSON with keys: issues, severity”
- Examples: Show 1-2 examples of good responses
Bad prompt:
"Write code"
Good prompt:
You are a senior Python engineer.
Review this Python code and identify bugs.
Focus on: memory leaks, infinite loops, security issues.
Output as JSON: {"issues": [{"line": 5, "type": "memory_leak", "fix": "..."}]}
Example:
Code: for x in data: items.append(x) # grows unbounded
Issue: Memory leak if data is large, items is never freed
Fix: Use generator instead: (x for x in data)
Part 10: Decision Trees for Diagnosis
When Error Rate Spikes
Error rate > 5%?
├─ Check specific error in logs
│ ├─ "Tool not found" → Tool missing/renamed
│ ├─ "Rate limit" → API quota exceeded
│ ├─ "Timeout" → External service slow
│ └─ "Model error" → Model offline/changed
│
├─ Check recent changes (last 2 hours)
│ ├─ Deployed new code? → Rollback
│ ├─ Changed prompt? → Revert prompt
│ ├─ Switched model? → Switch back
│ └─ No recent changes → Check external services
│
└─ Check metrics
├─ Latency high? → Performance issue
├─ Cost high? → Runaway agent
└─ Memory high? → Memory leak
When Cost Increases
Cost > budget?
├─ Identify the expensive session
│ ├─ High iteration count? → Loop issue (see "Stuck in Loop")
│ ├─ High output tokens? → Agent over-generating
│ └─ Many small costs? → Repeated expensive operations
│
├─ Check model used
│ ├─ Using expensive model? → Switch to cheaper
│ ├─ Changed model? → Revert
│ └─ Using correct model? → Check iteration count
│
└─ Quick wins
├─ Cache search results (30% savings)
├─ Use cheaper model for 80% of requests (80% savings)
└─ Reduce startup memory (10-20% savings)
Part 11: Incident Playbook
Incident: Cost $5K in 24 hours (Normal: $100)
Timeline (do this ASAP):
- Minute 1-2: Kill agent if still running
- Minute 3-5: Identify which session/agent caused spike
- Minute 6-10: Check logs for what it was doing
- Minute 11-15: Implement hard cost limit (prevent repeat)
- Hour 1: Rootcause analysis (why did this happen?)
- Hour 2: Fix and validate fix
Debug steps:
# Step 1: Find expensive sessions
expensive_sessions = get_sessions_by_cost(sort="descending")
culprit = expensive_sessions[0]
print(f"Session {culprit.id}:")
print(f" Cost: ${culprit.cost}")
print(f" Duration: {culprit.duration_seconds}s")
print(f" Iterations: {culprit.loop_iterations}")
print(f" Input tokens: {culprit.input_tokens}")
print(f" Output tokens: {culprit.output_tokens}")
# Step 2: Check what it was doing
for step in culprit.steps[:10]: # First 10 iterations
print(f"Iteration {step.iteration}:")
print(f" Tool: {step.tool_name}")
print(f" Tokens: in={step.input_tokens}, out={step.output_tokens}")
print(f" Cost: ${step.cost}")
# Was it looping? Generating huge outputs? Using expensive model?
# Step 3: Check for the root cause
if culprit.loop_iterations > 20:
print("ROOT CAUSE: Agent looping excessively")
# See "Agent Stuck in Loop" fix
elif culprit.output_tokens > 50000:
print("ROOT CAUSE: Agent generating huge outputs")
# Check what it was generating
elif culprit.model == "claude-3-opus":
print("ROOT CAUSE: Used expensive model instead of cheap one")
# Check why it switched models
Prevent repeat:
# Add hard cost limit
class HardCostLimit:
def __init__(self, max_cost_per_session: float = 5.0):
self.max_cost = max_cost_per_session
def check(self, session_cost: float):
if session_cost > self.max_cost:
kill_session_immediately()
alert_ops("COST_LIMIT_HIT")
raise Exception(f"Cost ${session_cost} exceeds limit ${self.max_cost}")
# Deploy immediately
limit = HardCostLimit(max_cost_per_session=5.0)
Conclusion
When something breaks in production, speed and calm matter most. Use these tools:
- Decision tree → Narrow down the problem fast
- Diagnostic steps → Verify your hypothesis
- Quick fix → Stop the bleeding (< 5 min)
- Proper fix → Prevent it recurring (permanent)
- Prevention → Add monitoring/checks
Most production incidents follow patterns. If you’ve seen it once, you can fix it again—faster.
Keep this runbook bookmarked. Update it with new incidents you find.
Quick Reference: Commands
# View logs for a specific error
grep "ERROR" harness.log | grep "tool_timeout" | tail -20
# Check which agent is expensive
jq '.sessions | sort_by(.cost) | reverse | .[0]' sessions.json
# Count iterations for a session
jq '.steps | length' session.json
# Check model used
jq '.model' session.json
# Get cost breakdown
jq '{model: .model, cost: .cost, tokens: .input_tokens + .output_tokens}' session.json
Further Reading
09_operations_and_observability.md— Full logging and monitoring guide10_security_and_safety.md— Security hardening11_testing_and_qa.md— Quality assurance13_cost_management.md— Deep cost analysis
Validation Checklist
How do you know you got this right?
Performance Checks
- Decision tree diagnostic identifies root cause in <5 minutes
- Quick fix resolves symptom in <5 minutes (service restored)
- Proper fix prevents recurrence (no duplicate incidents in 2+ weeks)
- Runbook tested: new on-call follows steps successfully
Implementation Checks
- Decision tree covers 90%+ of real production incidents
- Diagnostic steps for each symptom documented with example logs
- Quick fix is safe: temporary measure that doesn’t cause data loss
- Proper fix implemented: code change or monitoring addition deployed
- Prevention measures in place: monitoring alert or hard limit added
- Commands cheatsheet tested: each one returns expected data format
- Runbook updated after every incident: lessons captured
Integration Checks
- Logging provides needed data: can trace request from input to output
- Monitoring alerts match runbook symptoms: alert fires when issue occurs
- Escalation procedures defined: who to contact if fix fails
- Incident postmortem process: how to prevent recurrence
Common Failure Modes
- Decision tree doesn’t match real issues: Test against last 10 incidents; update
- Logs don’t provide diagnostic info: Missing request IDs, timing, error context
- Quick fix is too complex: Takes 10 minutes; simplify or document better
- Same incident repeats: Prevention measure didn’t work; verify it’s deployed
- Runbook outdated: Logs format changed, commands broken; maintain as code changes
Sign-Off Criteria
- Runbook tested by someone unfamiliar with codebase (clarity check)
- All 3 real incidents resolved successfully using runbook
- Prevention measures verified deployed: alerts fire, limits enforced
- Team trained: on-call can follow runbook independently
- Documentation complete: why issues happen, not just what to do
See Also
- Doc 09 (Operations & Observability): Structured logging and monitoring setup
- Doc 13 (Cost Management): Cost spike diagnosis and prevention
- Doc 16 (Evaluation & Benchmarking): Quality regression detection and response