Troubleshooting & FAQ — The Harness Handbook Reference

When something breaks in production, speed matters more than perfection. This document is designed for on-call engineers to diagnose and fix common issues quickly.

In simple terms: “What do I do when my harness breaks? Where do I look first? How do I fix it?”

Quick Reference: First Steps

When something is broken:

Check if it’s actually broken → Look at metrics (error rate, latency)
Identify the symptom → Use decision tree below
Check recent changes → Did something deploy in last 30 minutes?
Look at structured logs → Filter by error, agent ID, session ID
Isolate the component → Is it the model? Tool? Memory? Cost control?
Apply fix → Choose “Quick fix” if urgent, “Proper fix” for long-term

Part 1: Symptom-Based Diagnosis Decision Tree

Use this tree to identify what’s broken:

Something seems wrong
├─ Error rate > 5% (check metrics)
│  ├─ Specific tool failing (look for tool error pattern)
│  │  ├─ Tool timeout → See "Tool Issues: Timeout"
│  │  ├─ Tool returns bad format → See "Tool Issues: Unexpected Format"
│  │  ├─ Permission denied → See "Tool Issues: Permission Denied"
│  │  └─ Tool not found → See "Tool Issues: Not Registered"
│  │
│  ├─ Agent producing garbage (reasoning nonsensical)
│  │  ├─ Agent ignoring instructions → See "Agent Debugging: Ignoring Instructions"
│  │  ├─ Agent making wrong tool calls → See "Agent Debugging: Wrong Tool Calls"
│  │  └─ Hallucinations increased → See "Quality Issues: Hallucinations"
│  │
│  ├─ All requests failing with same error
│  │  ├─ "Rate limit exceeded" → See "Common Error Messages: 429"
│  │  ├─ "Context window exceeded" → See "Common Error Messages: Context Exceeded"
│  │  ├─ "Model not found" → See "Common Error Messages: Model Not Found"
│  │  └─ Other API error → See "Common Error Messages"
│  │
│  └─ Random/intermittent failures
│     ├─ Network timeouts → See "Performance Issues: Network Timeouts"
│     ├─ Database connection errors → See "Deployment Issues: Database Errors"
│     └─ Memory corruption → See "Memory Issues: Corruption"
│
├─ Latency > expected (check p50/p95/p99)
│  ├─ High on all requests → See "Performance Issues: Slow Inference"
│  ├─ High on specific tool → See "Performance Issues: Tool Bottleneck"
│  ├─ Intermittent spikes → See "Performance Issues: Queue Backlog"
│  └─ First request slow, rest fast → See "Performance Issues: Slow Inference (model loading)"
│
├─ Cost > budget (check cost tracking)
│  ├─ Spike in last 24 hours → See "Cost & Budget Issues: Unexpected Spike"
│  ├─ Gradual increase over time → See "Cost & Budget Issues: Gradual Creep"
│  ├─ Wrong tokens charged → See "Cost & Budget Issues: Token Counting Mismatch"
│  └─ Runaway agent → See "Cost & Budget Issues: Runaway Agent"
│
├─ Agent looping (iterations not stopping)
│  ├─ Stuck on same decision → See "Agent Debugging: Stuck in Loop"
│  ├─ Context window filling up → See "Memory Issues: Memory Loss"
│  └─ Too many retries on failed tool → See "Agent Debugging: Ignoring Instructions"
│
├─ Agent timing out (total duration > timeout)
│  ├─ Normal operations taking too long → See "Agent Debugging: Timing Out"
│  ├─ Waiting for tool response → See "Tool Issues: Timeout"
│  └─ Memory consolidation slow → See "Memory Issues: Memory Consolidation Slow"
│
└─ Can't find information / logs missing
   ├─ Logs disappeared → See "Deployment Issues: Missing Logs"
   ├─ Metrics not being recorded → See "Deployment Issues: Health Checks Failing"
   └─ Dashboard shows no data → See "Operations: Observability Misconfigured"

Part 2: Agent Debugging

Symptom: Agent Stuck in Loop

What you’ll see:

Agent iteration count keeps increasing (10, 20, 50+)
Agent making same decision/tool call repeatedly
Session doesn’t complete or times out
Context tokens increasing with each iteration
Cost climbing without making progress

Root causes:

Tool always fails — tool is broken, agent keeps retrying
Agent doesn’t understand error — error message is unclear
Instruction contradiction — agent told to keep trying indefinitely
No termination logic — agent has no “give up” condition
Tool returns infinite loop — e.g., search results pointing to search

Diagnostic steps:

# Step 1: Check iteration limit
if session_log.loop_iterations > 15:
    print("ALERT: Agent exceeded normal iteration count")
    # Normal: 3-8 iterations
    # Concerning: 10-15 iterations
    # Critical: 20+ iterations

# Step 2: Look at repetition pattern
recent_steps = session_log.last_n_steps(5)
tools_called = [step.tool_name for step in recent_steps]
print(f"Last 5 tools: {tools_called}")
# If all same → Looping

# Step 3: Check error in failed tools
for step in recent_steps:
    if step.status == "failed":
        print(f"Tool {step.tool_name} failed: {step.error}")
        # Is the error message helpful?
        # Is the tool actually broken?

# Step 4: Check context window usage
print(f"Context usage: {session_log.context_tokens_used} / {session_log.context_limit}")
if context_ratio > 0.85:
    print("WARNING: Approaching context limit, may be running out of space")

Quick fix (< 5 minutes):

1. Kill the session immediately (don't wait for timeout)
   - Set iteration limit to 12 (or lower if you see looping at 8)
   
2. Look at the last 3 tool calls in logs
   - Same tool repeated? → That tool is broken
   - Different tools but same result? → Agent isn't understanding error
   
3. Check which tool is failing
   - If "web_search": Search API might be rate-limited or down
   - If custom tool: That tool implementation may be broken
   
4. Restart without that tool (comment it out)
   - Does the agent succeed without it? → Confirms tool is the problem

Proper fix (permanent):

Add mandatory termination logic:

MAX_ITERATIONS = 12
if iteration_count >= MAX_ITERATIONS:
    return {
        "status": "incomplete",
        "reason": "max_iterations_reached",
        "best_effort_result": last_valid_output
    }

Improve error messages so agent understands:

# Bad error message (agent doesn't know what to do)
raise Exception("Tool failed")

# Good error message (agent knows it's a retry issue)
raise Exception(
    "Web search timed out after 30 seconds. "
    "Either the site is slow or the query is too broad. "
    "Try: 1) Wait 10 seconds then retry, or 2) Try a different query"
)

Add loop detection to logs:

# Detect when agent is repeating itself
if iteration > 2:
    prev_tool = steps[-2].tool_name
    curr_tool = steps[-1].tool_name
    if prev_tool == curr_tool == steps[-3].tool_name:
        log.warning("LOOP_DETECTED", {
            "tool": curr_tool,
            "repetitions": count_repetitions(curr_tool, steps)
        })
        # Force intervention or termination

Test with failing tool disabled:

Does agent succeed with tool X disabled?
YES → Confirm tool X is broken, fix it
NO → Problem is elsewhere (agent logic or instruction clarity)

Symptom: Agent Ignoring Instructions

What you’ll see:

Agent makes decisions contrary to explicit instructions
Agent uses forbidden tools
Agent generates output in wrong format despite instruction
Agent skips required steps

Root causes:

Instruction buried in context — agent can’t see it due to context length
Conflicting instructions — instructions contradict each other
Instructions too vague — agent interprets them differently
Model drift — model behavior changed, needs re-tuning
Tool choice conflict — agent thinks different tool is better
Temperature too high — model being too creative/random

Diagnostic steps:

# Step 1: Verify instruction was in context
instruction = "Always use tool X for data retrieval"
if instruction in session_log.full_context:
    print("✓ Instruction is in context")
else:
    print("✗ Instruction NOT in context (likely pruned due to length)")
    print(f"Context usage: {session_log.context_tokens} / {session_log.limit}")

# Step 2: Check what tool agent used
for step in session_log.steps:
    if step.tool_name != "expected_tool":
        print(f"Agent chose {step.tool_name}, expected other_tool")
        print(f"Reasoning: {step.reasoning}")

# Step 3: Check temperature/sampling parameters
print(f"Temperature: {session_log.model_params.temperature}")  # 0.0 = deterministic, 1.0 = creative
print(f"Top P: {session_log.model_params.top_p}")

# Step 4: Check if this is recent behavior
recent_10_sessions = get_last_n_sessions(10)
violations = sum(1 for s in recent_10_sessions if instruction_violated(s))
print(f"Instruction violations in last 10 sessions: {violations} / 10")
# If all 10 violated → chronic problem
# If 2-3 violated → occasional issue (may be model randomness)

Quick fix (< 5 minutes):

1. Check if instruction is being pruned (context too long)
   → Reduce memory size, compress old sessions, shorten instruction
   
2. Check for conflicting instructions
   → Search for "always use X" and "use Y if..."
   → Clarify which takes priority
   
3. If "occasional" violation (2-3 out of 10):
   → Reduce temperature (more deterministic)
   → Restart with fresh session (cold restart may help)
   
4. If "chronic" violation (8+ out of 10):
   → Instruction is being ignored, need proper fix

Proper fix (permanent):

Ensure instruction is at context start:

# Bad: Instruction at end of long context
system_prompt = [
    instructions_file (2K tokens),
    memory_file (50K tokens),  
    conversation_history (100K tokens),
    instruction_constraint (buried here!)
]

# Good: Constraints at start, most recent history at end
system_prompt = [
    constraint_instruction (critical: at start!),
    task_instruction (what to do),
    memory_file (50K tokens),
    conversation_history (100K tokens, at end so most recent)
]

Make instructions concrete with examples:

# Vague (agent might ignore)
"Use the search tool when appropriate"

# Concrete (agent will follow)
"""MANDATORY: When you need information not in your memory:
1. Use search_web_tool FIRST
2. If search returns nothing, use search_knowledge_base SECOND
3. Only use local_files if neither returns results

Example: User asks "What is Llama 2?"
GOOD: search_web_tool("Llama 2 model")
BAD: search_knowledge_base("Llama 2 model")  ← Wrong order
BAD: Ask user for more info  ← Don't do this, search first
"""

Add verification step:

# After each agent step, verify it followed rules
if instruction_required_tool not in step.tool_calls:
    if step.status == "failed":
        # Agent violated instruction, log it
        log_alert("INSTRUCTION_VIOLATION", {
            "instruction": constraint,
            "tool_required": required_tool,
            "tool_used": step.tool_calls[0].name,
            "session_id": session_id
        })

Test with reduced context window:

Does agent follow instruction with context_limit=50K instead of 200K?
YES → Instruction was being pruned, need to reduce memory
NO → Agent intentionally ignoring instruction, need stronger constraint

Symptom: Agent Producing Garbage Output

What you’ll see:

Output is nonsensical, incoherent
Output contains false information (hallucinations)
Output mixes multiple unrelated topics
Output contains jailbreak artifacts or strange formatting
Output quality fine in staging, broken in production

Root causes:

Context corruption — old memory mixed with current task
Model hallucinating — producing false information confidently
Prompt injection — malicious input changed agent behavior
Cache collision — KV cache mixing responses from different sessions
Quantization artifact — rare precision error from 4-bit quantization
Model drift → Production model different from staging
Temperature too high → Model generating random tokens

Diagnostic steps:

# Step 1: Check context for corruption
context_summary = analyze_context_windows(session_log)
print(f"Context sources:")
for source in context_summary.sources:
    print(f"  - {source}: {source.token_count} tokens, age={source.age_hours}h")

# Example of corruption:
# Task: "Summarize recent sales"
# Context accidentally includes: "Nuclear weapons safety procedures"
# ← This contamination causes garbage output

# Step 2: Check if output is hallucination
for fact in output_facts:
    if fact not in session_log.full_context:
        print(f"HALLUCINATION: '{fact}' not found in context")
        # Agent made this up

# Step 3: Check input for injection
if "<|system|>" in user_input or "ignore instructions" in user_input.lower():
    print("POSSIBLE_INJECTION: Input contains jailbreak patterns")

# Step 4: Check if staging/production models match
staging_model = "claude-3-5-sonnet-20240514"
prod_model = session_log.model
if staging_model != prod_model:
    print(f"MODEL MISMATCH: Staging uses {staging_model}, prod uses {prod_model}")
    print("→ Test staging with prod model to see if issue reproduces")

# Step 5: Check temperature
print(f"Temperature: {session_log.temperature}")
if session_log.temperature > 0.7:
    print("WARNING: High temperature may cause randomness")

Quick fix (< 5 minutes):

1. Set temperature to 0.0 (deterministic)
   → Eliminates randomness, see if output improves
   
2. Clear memory/context
   → Cold restart session without old memory
   → Does output improve? → Memory corruption confirmed
   
3. Check if staging and production use same model
   → If different, recreate issue in staging first
   
4. Check input for obvious injection patterns
   → Any "<|system|>" or "ignore instructions"?
   → If yes, increase input validation

Proper fix (permanent):

Prevent context corruption:

# Tag context sources with session ID
memory_entry = {
    "content": text,
    "session_id": current_session_id,  # MUST match current session
    "created_at": timestamp
}

# Before using memory, verify it's from same task/session
for entry in memory:
    if entry.session_id != current_session_id:
        if entry.age_hours > 24:
            # Old entry from different session, skip it
            continue

Add hallucination detection:

# Verify each fact in output appears in context
facts = extract_facts(output)
for fact in facts:
    if fact not in context and not is_known_fact(fact):
        # This is a potential hallucination
        add_fact_verification_step()
        # Ask agent to cite source or admit uncertainty

Strict input validation against injection:

def validate_input(user_input: str) -> bool:
    dangerous_patterns = [
        "<|system|>", "<|user|>",      # Jailbreak markers
        "ignore instructions",          # Direct override
        "pretend you",                  # Role change
        "forget your instructions",     # Memory wipe
        "you are now",                  # System swap
    ]
    for pattern in dangerous_patterns:
        if pattern.lower() in user_input.lower():
            log_alert("INJECTION_ATTEMPT", {"input": user_input})
            return False
    return True

Test staging with production model:

If staging uses model V1 and prod uses V2:
1. Update staging to use V2
2. Re-run quality tests
3. If quality drops → V2 needs tuning
4. If quality same → Issue is elsewhere (not model change)

Symptom: Agent Running Out of Memory / Context Window Exceeded

What you’ll see:

Error: “Context length exceeded” or “prompt too long”
Agent abruptly stops mid-task
Session fails on the 10th+ iteration (context filling up over time)
Long-running tasks fail but short tasks succeed

Root causes:

Memory not being consolidated — old sessions piling up
Conversation history too long — keeping all old messages
Model context limit too low — using 4K context model instead of 128K
Tool results too large — search returns 10K tokens of junk
Logging too verbose — logging every intermediate step
Context size increased — recent change expanded startup memory

Diagnostic steps:

# Step 1: Check context usage over time
for step in session_log.steps:
    print(f"Iteration {step.iteration}: {step.context_used} tokens")

# Should grow slowly, then plateau
# If growing linearly: memory not being pruned/consolidated

# Step 2: Measure memory file sizes
print(f"CLAUDE.md: {get_file_size('CLAUDE.md')} tokens")
print(f"MEMORY.md: {get_file_size('MEMORY.md')} tokens")
print(f"Topic files: {sum(get_file_size(f) for f in topic_files)} tokens")

# Good baseline:
# CLAUDE.md: 500-1000 tokens (instructions)
# MEMORY.md: 5000-10000 tokens (compact facts)
# Topic files: 500-2000 tokens each
# Total startup: < 20K tokens

# Step 3: Check model context limit
print(f"Model: {session_log.model}")
print(f"Context limit: {session_log.context_limit} tokens")

# If limit is 4K: Too small
# If limit is 128K: Good for long tasks
# Check: Did recent change switch to smaller model?

# Step 4: Measure tool output sizes
for step in session_log.steps:
    if step.tool_name == "search_web":
        output_tokens = count_tokens(step.tool_output)
        print(f"Search result: {output_tokens} tokens")
        # Results > 5K tokens? Too verbose

# Step 5: Check consolidation logs
consolidation_logs = filter(session_log, event="memory_consolidated")
if len(consolidation_logs) == 0:
    print("WARNING: No consolidation events in logs")
    print("→ Memory never consolidated, context keeps growing")

Quick fix (< 5 minutes):

1. Reduce startup memory
   → Comment out old session summaries in MEMORY.md
   → Keep only last 2 sessions
   → Should drop startup from 50K → 15K tokens
   
2. Enable aggressive memory consolidation
   → Consolidate memory every 5 iterations instead of 15
   → This keeps context from growing unbounded
   
3. Switch to larger context model if available
   → From Claude 3 (100K) → Claude 3.5 (200K)
   → If same model, may not help (limit is hard limit)
   
4. Prune verbose logging
   → Stop logging every intermediate step
   → Log only: errors, tool calls, final decision
   → Should reduce context by 30-40%
   
5. Limit tool output
   → Cap search results to 2K tokens
   → Summarize large results before using them

Proper fix (permanent):

Implement automatic memory consolidation:

def consolidate_memory_if_needed():
    context_used = get_context_usage()
    context_limit = model.context_limit
    usage_ratio = context_used / context_limit
    
    if usage_ratio > 0.60:  # Consolidate at 60%
        # Summarize old conversation
        old_messages = get_messages_before_n_iterations_ago(10)
        summary = compress_conversation(old_messages)
        
        # Replace old messages with summary
        replace_old_messages_with_summary(summary)
        
        log.info("MEMORY_CONSOLIDATED", {
            "tokens_before": context_used,
            "tokens_after": get_context_usage(),
            "compression_ratio": context_used / get_context_usage()
        })

Set hard context limit with graceful degradation:

MAX_CONTEXT_USAGE = 0.85 * model.context_limit

if context_used > MAX_CONTEXT_USAGE:
    # Instead of crashing, gracefully degrade
    log.warn("APPROACHING_CONTEXT_LIMIT", {
        "usage": context_used,
        "limit": model.context_limit
    })
    
    # Option 1: Consolidate memory
    consolidate_memory()
    
    # Option 2: Remove oldest messages
    prune_old_messages(count=5)
    
    # Option 3: Save and restart session
    save_session_summary()
    return {"status": "checkpoint_reached", "continue_in_new_session": True}

Right-size model for task length:

task_complexity = estimate_task_complexity(user_prompt)

if task_complexity == "simple":
    model = "claude-3-5-sonnet"  # 200K context, cheap
elif task_complexity == "complex":
    model = "claude-3-opus"      # 200K context, more capable
else:
    model = "claude-3-5-sonnet"  # Default

# Re-evaluate model choice based on actual task

Establish memory file size budgets:

# Define maximum sizes
MEMORY_BUDGETS = {
    "CLAUDE.md": 1000,      # Instructions
    "MEMORY.md": 15000,     # Compact facts
    "topic_files": 2000,    # Each topic file
    "startup_total": 20000  # Total startup overhead
}

# Enforce in CI/CD
for filepath, budget in MEMORY_BUDGETS.items():
    actual_size = get_file_size(filepath)
    if actual_size > budget:
        raise Exception(f"{filepath} exceeds budget: {actual_size} > {budget}")

Symptom: Agent Making Wrong Tool Calls

What you’ll see:

Agent calls wrong tool for the task
Agent calls tool with wrong parameters
Agent calls tools in wrong order
Agent uses tool when it shouldn’t

Root causes:

Tool descriptions unclear — agent doesn’t understand what tool does
Parameter validation missing — agent sends bad params, tool fails
Tool schema wrong — schema doesn’t match tool’s actual interface
Agent doesn’t understand task — misinterprets what user asked
Too many similar tools — agent confused between similar options

Diagnostic steps:

# Step 1: Check tool descriptions
for tool in available_tools:
    print(f"Tool: {tool.name}")
    print(f"Description: {tool.description}")
    # Is it clear what this tool does?
    # Would you know to use it from the description?

# Example of bad description:
# "data" - What data? When to use it? (Unhelpful)

# Example of good description:
# "search_web: Search the public internet for current information.
#  Use when you need recent news, facts, or information not in your memory.
#  Returns: Top 5 results with titles, URLs, summaries (max 2K tokens each)"

# Step 2: Check parameter types
for tool in available_tools:
    for param in tool.parameters:
        print(f"  {param.name}: {param.type} (required={param.required})")
        # Are types clear? (string, integer, list)
        # Are they documented?

# Step 3: Check tool usage in logs
for step in session_log.steps:
    tool = step.tool_name
    params = step.tool_params
    print(f"Tool: {tool}, Params: {params}")
    
    # Does this make sense?
    # If tool expects ["query"], did agent provide query?
    # If tool expects {"file_path", "action"}, did agent provide both?

# Step 4: Check for parameter errors
for step in session_log.steps:
    if step.status == "error":
        error = step.error_message
        if "parameter" in error.lower() or "type" in error.lower():
            print(f"PARAMETER_ERROR: {error}")

# Step 5: Check for tool confusion patterns
tool_sequence = [step.tool_name for step in session_log.steps]
if tool_sequence.count(tool_A) > 3:
    # Agent kept using same tool, suggests confusion about alternatives
    print(f"Agent overused {tool_A}")

Quick fix (< 5 minutes):

1. Check tool schema matches reality
   → Run tool with example params from schema
   → If it fails → Schema is wrong, update schema
   
2. Look at agent's reasoning for tool choice
   → Why did agent pick tool X?
   → Is reasoning correct? (If reasoning is wrong, LLM is confused)
   
3. If tool called with wrong params:
   → Add parameter validation to tool
   → Return helpful error message explaining required params
   → Agent will learn and retry correctly
   
4. If too many similar tools:
   → Combine similar tools into one with "action" parameter
   → E.g., search_web, search_knowledge_base, search_local
   → Instead: search(source: "web|knowledge|local", query)

Proper fix (permanent):

Improve tool descriptions with examples:

# Bad
tools = [{
    "name": "search",
    "description": "Search for information"
}]

# Good
tools = [{
    "name": "search_web",
    "description": """
    Search the public internet for current information.
    
    When to use:
    - Need recent news or events (< 1 week old)
    - Need facts not in your memory
    - Need to verify current information
    
    Do NOT use:
    - For private/internal documents (use search_knowledge_base instead)
    - For files on user's computer (use search_local instead)
    
    Returns: Top 5 results with titles, URLs, summaries
    
    Example:
    Query: "machine learning"
    Result: [
        {"title": "What is ML?", "url": "...", "summary": "..."},
        ...
    ]
    """,
    "parameters": {
        "query": {
            "type": "string",
            "description": "Search query (e.g., 'latest AI models 2026')",
            "examples": ["GPT-4 release date", "Llama 3.1 performance"]
        }
    }
}]

Add parameter validation:

def search_web(query: str) -> List[dict]:
    # Validate
    if not query or len(query) < 2:
        raise ValueError(
            "Query too short. Minimum 2 characters. "
            "Example: 'Python machine learning' not 'a'"
        )
    
    if len(query) > 200:
        raise ValueError(
            "Query too long (max 200 chars). "
            "Try shorter: 'AI models' not 'What are the latest developments in AI...'"
        )
    
    # Execute
    return search_implementation(query)

Test tool schemas match reality:

# In your test suite
def test_tool_schema_matches_implementation():
    for tool_name, tool_func in tools.items():
        schema = tool_schemas[tool_name]
        
        # Get required params from schema
        required_params = [p for p in schema.params if p.required]
        
        # Try calling with all required params
        example_kwargs = generate_example_params(required_params)
        
        try:
            tool_func(**example_kwargs)
        except TypeError as e:
            raise AssertionError(
                f"Tool {tool_name} schema doesn't match implementation: {e}"
            )

Reduce tool cardinality with action parameter:

# Instead of many similar tools:
# search_web, search_knowledge_base, search_local, search_arxiv

# Use one tool with action param:
def search(query: str, action: str = "web") -> List[dict]:
    """
    Search for information from multiple sources.
    
    action:
      - "web": Public internet (recent, current)
      - "knowledge": Internal knowledge base (comprehensive)
      - "local": Files on user's computer (private)
      - "arxiv": Academic papers (research)
    
    Example:
      search("machine learning", action="web")
      search("company policy", action="knowledge")
    """
    if action == "web":
        return search_web_impl(query)
    elif action == "knowledge":
        return search_kb_impl(query)
    # ... etc

Part 3: Tool Issues

Issue: Tool Not Found / Not Registered

Error message:

Tool 'web_search' not found. Available tools: [search_web, get_page]

Symptoms:

Agent tries to call tool that doesn’t exist
Error: “Tool not registered”
Tool works locally but fails in production

Root causes:

Tool not registered in harness — tool function exists but not in tool list
Typo in tool name — agent calls web_search but actual name is search_web
Tool removed in recent deploy — tool was available before, not now
Different deployment — staging has tool, production doesn’t
Dynamic tool loading failed — tool file missing or syntax error

Diagnostic steps:

# Step 1: List available tools
print("Available tools:")
for tool in agent.available_tools:
    print(f"  - {tool.name}")

# Step 2: Check if tool is registered
tool_name = "web_search"
if tool_name not in agent.available_tools:
    print(f"✗ Tool '{tool_name}' not registered")
    # Find similar names
    similar = find_similar(tool_name, agent.available_tools)
    print(f"  Did you mean: {similar}?")

# Step 3: Check tool file exists
import os
if not os.path.exists("tools/web_search.py"):
    print("✗ Tool file missing: tools/web_search.py")

# Step 4: Check for syntax errors in tool file
try:
    import tools.web_search
    print("✓ Tool imports successfully")
except SyntaxError as e:
    print(f"✗ Syntax error in tool: {e}")

# Step 5: Compare staging vs production
staging_tools = get_tools_from("staging")
prod_tools = get_tools_from("production")
missing_in_prod = set(staging_tools) - set(prod_tools)
if missing_in_prod:
    print(f"Tools in staging but NOT in production: {missing_in_prod}")

Quick fix (< 5 minutes):

1. Check available tools in agent
   → Print list of registered tools
   → Is the tool there?
   
2. If tool should exist:
   → Check tool file for syntax errors
   → Restart harness/reload tools
   
3. If tool missing in production:
   → Did recent deploy remove it?
   → Check deployment diff (what changed?)
   → Rollback if needed
   
4. If typo in tool name:
   → Agent is calling 'web_search' but actual name is 'search_web'
   → Either: A) Rename tool to match, or B) Update agent prompt

Proper fix (permanent):

Standardize tool naming:

# Establish naming convention
# All search tools: search_* (search_web, search_knowledge, search_local)
# All file tools: file_* (file_read, file_write, file_list)
# All code tools: run_* (run_python, run_bash, run_sql)

# Document in CLAUDE.md
TOOL_NAMING_CONVENTION = """
Prefix by category:
- search_*: Information retrieval
- file_*: File operations
- run_*: Code execution
- email_*: Email operations
"""

Add tool validation to startup:

def validate_tools_on_startup():
    for tool_name in EXPECTED_TOOLS:
        if tool_name not in agent.available_tools:
            raise RuntimeError(
                f"Expected tool '{tool_name}' not registered. "
                f"Available: {list(agent.available_tools.keys())}"
            )
        
        # Test that tool is callable
        try:
            tool = agent.available_tools[tool_name]
            # Don't actually call, just verify it's callable
            assert callable(tool)
        except Exception as e:
            raise RuntimeError(f"Tool '{tool_name}' not callable: {e}")

Add tool alias support:

# If tools are named differently in production vs agent prompt
TOOL_ALIASES = {
    "web_search": "search_web",     # Agent calls web_search, actual is search_web
    "fetch_url": "get_page",        # Agent calls fetch_url, actual is get_page
}

def resolve_tool_name(requested_name):
    if requested_name in TOOL_ALIASES:
        actual_name = TOOL_ALIASES[requested_name]
        log.warning("TOOL_ALIAS_USED", {
            "requested": requested_name,
            "actual": actual_name
        })
        return actual_name
    return requested_name

Test tool availability in CI/CD:

# In your test suite
def test_all_required_tools_available():
    from harness import agent
    
    required_tools = [
        "search_web",
        "read_file",
        "write_file",
        "run_python",
        # ... etc
    ]
    
    for tool_name in required_tools:
        assert tool_name in agent.available_tools, \
            f"Required tool '{tool_name}' not registered"

Issue: Tool Failing with Errors

Error messages:

Tool 'web_search' failed: Connection timeout
Tool 'send_email' failed: Authentication failed
Tool 'read_file' failed: File not found

Symptoms:

Specific tool always fails
Tool fails intermittently
Tool fails with specific input
Tool works in local testing but fails in production

Root causes:

Network timeout — API is slow or down
Authentication failed — credentials missing or expired
Permission denied — insufficient permissions
Resource not found — file/URL doesn’t exist
Rate limited — too many requests to external API
Resource exhausted — disk full, memory full

Diagnostic steps:

# Step 1: Reproduce the error
tool = agent.get_tool("web_search")
try:
    result = tool(query="test")
    print("✓ Tool works")
except Exception as e:
    print(f"✗ Tool fails: {e}")

# Step 2: Check error details
error_details = {
    "error_type": type(e).__name__,     # TimeoutError, AuthError, etc
    "error_message": str(e),
    "error_code": getattr(e, "code", None)
}
print(f"Error details: {error_details}")

# Step 3: Check external service status
if error_type == "TimeoutError":
    # Check if API is up
    status = check_api_status("https://api.example.com/health")
    print(f"API status: {status}")
    
if error_type == "AuthError":
    # Check credentials
    creds = get_credentials()
    if creds is None:
        print("✗ Credentials missing")
    else:
        print(f"✓ Credentials present (expires {creds.expires_at})")

# Step 4: Check rate limiting
rate_limit_headers = response.headers.get("X-RateLimit-Remaining")
if rate_limit_headers and rate_limit_headers == "0":
    print("WARNING: Rate limit exceeded")
    print(f"Resets at: {response.headers.get('X-RateLimit-Reset')}")

# Step 5: Check logs for patterns
failures = get_tool_failures("web_search", last_n_hours=1)
print(f"Failures in last hour: {len(failures)}")
for failure in failures:
    print(f"  {failure.timestamp}: {failure.error}")

Quick fix (< 5 minutes):

1. Check if external API/service is down
   → Visit status page or health endpoint
   → If down, wait for it to recover (not your problem)
   
2. Check credentials/API keys
   → Are they set in environment?
   → Are they still valid? (Check expiration)
   → Test with curl/Postman first
   
3. If rate limited:
   → Slow down request rate
   → Check quota in API dashboard
   → Request increase if needed
   
4. If timeout:
   → Increase timeout value (if configurable)
   → Check network connectivity
   → Check if API is slow
   
5. If permission denied:
   → Check user/account has permission
   → Check if API key has required scopes
   → Check firewall/network policies

Proper fix (permanent):

Add retry logic with exponential backoff:

def call_tool_with_retry(tool_name, *args, max_retries=3, **kwargs):
    import time
    
    for attempt in range(max_retries):
        try:
            tool = agent.get_tool(tool_name)
            result = tool(*args, **kwargs)
            return result
        
        except TimeoutError as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # 1s, 2s, 4s
                log.warning(f"Tool {tool_name} timeout, retrying in {wait_time}s")
                time.sleep(wait_time)
            else:
                raise
        
        except RateLimitError as e:
            # Don't retry immediately for rate limit
            raise e

Add health checks and circuit breaker:

class ToolHealthCheck:
    def __init__(self, tool_name):
        self.tool_name = tool_name
        self.failure_count = 0
        self.failure_threshold = 5
        self.is_healthy = True
    
    def check_health(self):
        # Try calling tool with simple test
        try:
            result = test_tool_call()
            self.failure_count = 0
            self.is_healthy = True
        except Exception as e:
            self.failure_count += 1
            if self.failure_count >= self.failure_threshold:
                self.is_healthy = False
                log.alert("TOOL_UNHEALTHY", {
                    "tool": self.tool_name,
                    "failures": self.failure_count
                })
    
    def should_use_tool(self):
        if not self.is_healthy:
            # Tool is failing, don't use it
            return False
        return True

Log all tool failures with context:

def execute_tool(tool_name, params):
    log_entry = {
        "timestamp": datetime.now(),
        "event": "tool_call",
        "tool_name": tool_name,
        "params": params,
        "session_id": current_session_id
    }
    
    try:
        result = tool_name(**params)
        log_entry["status"] = "success"
        return result
    
    except Exception as e:
        log_entry["status"] = "failed"
        log_entry["error_type"] = type(e).__name__
        log_entry["error_message"] = str(e)
        log_entry["error_traceback"] = traceback.format_exc()
        
        log.error("TOOL_FAILED", log_entry)
        raise

Validate tool parameters before calling:

def validate_tool_params(tool_name, params):
    schema = tool_schemas[tool_name]
    
    for param_name, param_config in schema.parameters.items():
        if param_config.required and param_name not in params:
            raise ValueError(
                f"Missing required parameter '{param_name}' for tool '{tool_name}'"
            )
        
        # Validate types
        param_value = params.get(param_name)
        expected_type = param_config.type
        if param_value is not None and not isinstance(param_value, expected_type):
            raise TypeError(
                f"Parameter '{param_name}' must be {expected_type}, "
                f"got {type(param_value)}"
            )

Issue: Tool Timeout

Symptoms:

Tool takes 30+ seconds to respond
Tool never returns (timeout after N seconds)
Some requests timeout, others are fast
Timeouts increase over time (resource leak?)

Root causes:

External API is slow — search engine, database is overloaded
Network latency — slow network connection
Tool implementation inefficient — code doing too much work
Tool hanging — infinite loop, deadlock, or waiting for response
Resource exhaustion — database connection pool empty, memory full

Diagnostic steps:

# Step 1: Measure tool latency
start = time.time()
try:
    result = tool(query="test")
    elapsed = time.time() - start
    print(f"Tool latency: {elapsed:.2f}s")
except TimeoutError:
    elapsed = time.time() - start
    print(f"Tool timeout after {elapsed:.2f}s")

# Step 2: Check network latency to external services
import subprocess
latency = measure_ping("api.example.com")
print(f"Network latency: {latency:.2f}ms")

# Step 3: Check tool implementation
import inspect
source = inspect.getsource(tool_function)
# Look for:
# - Synchronous I/O (requests, urllib) → Use async instead
# - Large loops without timeout
# - Database queries without indexes

# Step 4: Check resource usage during tool call
import psutil
process = psutil.Process()
initial_memory = process.memory_info().rss

result = tool(query="test")

final_memory = process.memory_info().rss
memory_growth = final_memory - initial_memory
print(f"Memory growth: {memory_growth / 1024 / 1024:.2f} MB")

# Step 5: Check logs for patterns
slow_calls = get_tool_calls("web_search", filter={"duration_ms": ">5000"})
print(f"Calls > 5s: {len(slow_calls)}")
for call in slow_calls:
    print(f"  {call.timestamp}: {call.duration_ms}ms, query={call.params['query']}")

Quick fix (< 5 minutes):

1. Increase timeout value
   → If timeout is 10s, increase to 30s
   → Doesn't fix slowness, but prevents crashes
   
2. Check if external API is slow
   → Test API directly (curl request)
   → Check API status page
   → If API is slow: not your problem
   
3. Check network connectivity
   → High latency? → Move closer to API or use proxy
   
4. If specific queries are slow:
   → Add caching for common queries
   → Avoid re-fetching same results
   
5. Implement fallback
   → If tool times out, use cached/default value
   → Continue instead of failing

Proper fix (permanent):

Use async I/O instead of blocking:

# Bad: Blocking I/O
def search_web(query):
    import requests  # Blocking
    response = requests.get(f"https://api.search.com?q={query}")
    return response.json()

# Good: Async I/O
async def search_web(query):
    import aiohttp  # Non-blocking
    async with aiohttp.ClientSession() as session:
        async with session.get(f"https://api.search.com?q={query}") as resp:
            return await resp.json()

Add timeout with graceful degradation:

import asyncio

async def search_web_with_timeout(query, timeout=5):
    try:
        result = await asyncio.wait_for(
            search_web(query),
            timeout=timeout
        )
        return result
    except asyncio.TimeoutError:
        # Instead of crashing, return cached result
        cached = get_cached_result(query)
        if cached:
            log.warning("TOOL_TIMEOUT_USING_CACHE", {
                "query": query,
                "cache_age": get_cache_age(query)
            })
            return cached
        else:
            # If no cache, try default result
            return {"error": "timeout", "results": []}

Implement caching for repeated queries:

from functools import lru_cache
import hashlib

SEARCH_CACHE = {}
CACHE_TTL = 3600  # 1 hour

def search_web_cached(query, cache_ttl=CACHE_TTL):
    cache_key = hashlib.md5(query.encode()).hexdigest()
    
    if cache_key in SEARCH_CACHE:
        cached_entry = SEARCH_CACHE[cache_key]
        age = time.time() - cached_entry["timestamp"]
        if age < cache_ttl:
            return cached_entry["result"]
    
    # Not in cache or expired, fetch
    result = search_web(query)  # May timeout
    
    SEARCH_CACHE[cache_key] = {
        "timestamp": time.time(),
        "result": result
    }
    
    return result

Monitor tool latency continuously:

TOOL_LATENCIES = {
    "web_search": [],
    "read_file": [],
    # ...
}

def track_tool_latency(tool_name, duration_ms):
    TOOL_LATENCIES[tool_name].append(duration_ms)
    
    # Calculate percentiles
    latencies = sorted(TOOL_LATENCIES[tool_name])
    p50 = latencies[len(latencies) // 2]
    p95 = latencies[int(len(latencies) * 0.95)]
    p99 = latencies[int(len(latencies) * 0.99)]
    
    # Alert if degradation
    if p99 > LATENCY_THRESHOLD:
        log.alert("TOOL_LATENCY_HIGH", {
            "tool": tool_name,
            "p50": p50, "p95": p95, "p99": p99
        })

Part 4: Memory Issues

Issue: Memory Corruption

Symptoms:

Agent uses wrong facts/outdated information
Agent mixes up information from different sessions
Agent contradicts itself (says X then says not X)
Quality suddenly drops

Root causes:

Session mixing — memory from session A leaks into session B
Cache stale — old cached result is served instead of fresh
Consolidation error — summarization loses important details
File corruption — memory file partially written/truncated

Diagnostic steps:

# Step 1: Check for session contamination
session_a = get_session("session-123")
session_b = get_session("session-456")

context_a = session_a.full_context
context_b = session_b.full_context

# Are they independent?
if any_facts_in_both(context_a, context_b):
    print("WARNING: Sessions share facts (should be independent)")

# Step 2: Check memory file integrity
import hashlib

with open("MEMORY.md", "r") as f:
    content = f.read()
    checksum = hashlib.md5(content.encode()).hexdigest()

expected_checksum = KNOWN_GOOD_CHECKSUM
if checksum != expected_checksum:
    print("✗ Memory file corrupted (checksum mismatch)")

# Step 3: Verify consolidation didn't lose info
before_consolidation = get_memory_snapshot("before")
after_consolidation = get_memory_snapshot("after")

lost_facts = facts_in_before_not_after(before_consolidation, after_consolidation)
if lost_facts:
    print(f"✗ Consolidation lost {len(lost_facts)} facts:")
    for fact in lost_facts:
        print(f"  - {fact}")

# Step 4: Check cache staleness
cache_entry = get_cache("query-123")
age = time.time() - cache_entry.created_at
if age > CACHE_TTL:
    print(f"WARNING: Cache entry is stale ({age}s old, TTL={CACHE_TTL}s)")

# Step 5: Check for partial writes
file_path = "MEMORY.md"
file_size = os.path.getsize(file_path)
expected_size = estimate_file_size(file_content)
if file_size != expected_size:
    print(f"WARNING: File size mismatch ({file_size} vs expected {expected_size})")
    print("→ File may have been partially written")

Quick fix (< 5 minutes):

1. Cold restart session
   → Start new session without old memory
   → Does quality improve? → Memory corruption confirmed
   
2. Clear cache
   → Delete SEARCH_CACHE
   → Memory files should regenerate
   
3. Check file permissions
   → Can harness write to MEMORY.md?
   → Are there write conflicts?
   
4. Revert recent memory changes
   → If MEMORY.md was recently edited, revert
   → git checkout MEMORY.md

Proper fix (permanent):

Isolate sessions with session ID:

# Every memory entry must include session_id
class MemoryEntry:
    def __init__(self, content, session_id):
        self.content = content
        self.session_id = session_id
        self.created_at = datetime.now()

# Before using memory, verify session_id matches
def get_memory_for_session(session_id):
    all_entries = load_memory_file()
    session_entries = [
        e for e in all_entries
        if e.session_id == session_id
    ]
    return session_entries

Implement memory versioning:

# Save versions of MEMORY.md
# MEMORY.md (current)
# .MEMORY.backup (previous)
# .MEMORY.v1, .MEMORY.v2, ... (history)

def save_memory_with_backup():
    if os.path.exists("MEMORY.md"):
        shutil.copy("MEMORY.md", ".MEMORY.backup")
    
    # Write new version
    with open("MEMORY.md", "w") as f:
        f.write(new_memory_content)
    
    # Keep history
    import time
    timestamp = int(time.time())
    shutil.copy("MEMORY.md", f".MEMORY.v{timestamp}")

def rollback_memory(version):
    """Restore memory to a previous version"""
    shutil.copy(f".MEMORY.v{version}", "MEMORY.md")
    log.info("MEMORY_ROLLED_BACK", {"version": version})

Add memory file validation:

def validate_memory_file():
    """Check memory file for corruption"""
    
    with open("MEMORY.md", "r") as f:
        content = f.read()
    
    # Check for common corruption signs
    if len(content) == 0:
        raise Exception("Memory file is empty (truncation)")
    
    if content.count("```") % 2 != 0:
        raise Exception("Memory file has unmatched code blocks (partial write)")
    
    # Verify JSON blocks are valid
    import json
    for block in extract_json_blocks(content):
        try:
            json.loads(block)
        except json.JSONDecodeError as e:
            raise Exception(f"Invalid JSON in memory: {e}")
    
    return True

Implement atomic writes:

import tempfile

def write_memory_atomically(content):
    """Write memory file atomically (no partial writes)"""
    
    # Write to temporary file first
    with tempfile.NamedTemporaryFile(
        mode="w", dir=".", delete=False, suffix=".tmp"
    ) as tmp:
        tmp.write(content)
        tmp_path = tmp.name
    
    # Validate temporary file
    validate_memory_file_at_path(tmp_path)
    
    # Only then replace original
    os.replace(tmp_path, "MEMORY.md")
    
    log.info("MEMORY_WRITTEN_ATOMICALLY")

Issue: Memory Loss

Symptoms:

Agent doesn’t remember previous sessions
Agent repeats work from earlier
Agent says “I don’t have context” but information existed in memory

Root causes:

Memory file not persisted — in-memory cache, lost on restart
Memory pruned too aggressively — old memories deleted
Memory not loaded on startup — file exists but not read
Wrong session ID — looking for memories from different session
Memory file deleted — accidental deletion or crash

Diagnostic steps:

# Step 1: Check if memory file exists
import os
if not os.path.exists("MEMORY.md"):
    print("✗ MEMORY.md does not exist")
else:
    file_size = os.path.getsize("MEMORY.md")
    print(f"✓ MEMORY.md exists ({file_size} bytes)")

# Step 2: Check if memory is being read on startup
startup_log = get_session_log(session_id).startup_events
memory_events = [e for e in startup_log if e.event == "memory_loaded"]
if not memory_events:
    print("✗ Memory not being loaded on startup")
else:
    for event in memory_events:
        print(f"✓ Loaded {event.facts_count} facts from MEMORY.md")

# Step 3: Check if memory is being written
write_events = get_logs(event="memory_written", last_n_hours=24)
if not write_events:
    print("WARNING: No memory writes in last 24 hours")
else:
    print(f"✓ Memory written {len(write_events)} times")

# Step 4: Check if information is in memory file
fact = "Important fact that should be remembered"
with open("MEMORY.md", "r") as f:
    memory_content = f.read()
    if fact in memory_content:
        print(f"✓ Fact is in MEMORY.md")
    else:
        print(f"✗ Fact NOT in MEMORY.md")
        print("→ Was it ever saved?")

# Step 5: Check memory pruning settings
from harness.config import MEMORY_CONFIG
print(f"Memory retention: {MEMORY_CONFIG.retention_days} days")
print(f"Max memory size: {MEMORY_CONFIG.max_tokens} tokens")
print(f"Pruning frequency: every {MEMORY_CONFIG.prune_interval_hours} hours")

Quick fix (< 5 minutes):

1. Check if MEMORY.md exists
   → If it doesn't, create it with bootstrap facts
   
2. Check if memory is being loaded
   → Look for memory_loaded event in startup
   → If missing, add memory loading to startup
   
3. Check if memory is persisted
   → Write a test fact to MEMORY.md
   → Restart harness
   → Is the fact still there?
   
4. If memory is being pruned too aggressively:
   → Increase retention period (retention_days)
   → Increase max memory size (max_tokens)
   → Reduce pruning frequency

Proper fix (permanent):

Implement automatic memory persistence:

def load_memory_on_startup():
    """Load all memory files on startup"""
    
    memory_files = [
        "CLAUDE.md",      # Instructions
        "MEMORY.md",      # Consolidated facts
        "current_task.md" # Current work
    ]
    
    for filepath in memory_files:
        if os.path.exists(filepath):
            with open(filepath, "r") as f:
                content = f.read()
                agent.memory.add(filepath, content)
            log.info("MEMORY_LOADED", {"file": filepath})
        else:
            log.warning("MEMORY_FILE_MISSING", {"file": filepath})
    
    return agent.memory

# Call on startup
agent.memory = load_memory_on_startup()

Implement periodic memory checkpoint:

import threading

def memory_checkpoint_loop():
    """Save memory every N minutes"""
    while True:
        time.sleep(300)  # Every 5 minutes
        
        # Get current memory state
        memory_content = agent.memory.export()
        
        # Write to file
        write_memory_atomically(memory_content)
        
        log.debug("MEMORY_CHECKPOINT", {
            "size_bytes": len(memory_content),
            "timestamp": datetime.now()
        })

# Start checkpoint thread
checkpoint_thread = threading.Thread(
    target=memory_checkpoint_loop,
    daemon=True
)
checkpoint_thread.start()

Implement memory recovery:

def recover_memory_from_backup():
    """If memory is corrupted, recover from backup"""
    
    if os.path.exists(".MEMORY.backup"):
        log.alert("MEMORY_RECOVERY_STARTING", {
            "source": ".MEMORY.backup"
        })
        shutil.copy(".MEMORY.backup", "MEMORY.md")
        return True
    
    # If no backup, try version history
    versions = glob.glob(".MEMORY.v*")
    if versions:
        latest_version = max(versions)
        log.alert("MEMORY_RECOVERY_FROM_VERSION", {
            "source": latest_version
        })
        shutil.copy(latest_version, "MEMORY.md")
        return True
    
    # If no backup/versions, reset to empty
    log.alert("MEMORY_RESET", {"reason": "no_backup_available"})
    write_memory_atomically("")
    return False

Verify memory on each load:

def load_and_validate_memory():
    """Load memory and verify it's not corrupted"""
    
    try:
        memory = load_memory_on_startup()
        
        # Validate
        if len(memory) == 0:
            log.warning("MEMORY_EMPTY")
        
        # Verify basic structure
        facts_count = count_facts(memory)
        log.info("MEMORY_LOADED", {
            "facts_count": facts_count,
            "bytes": len(str(memory))
        })
        
        return memory
    
    except MemoryCorruptionError:
        log.alert("MEMORY_CORRUPTED", {
            "action": "attempting recovery"
        })
        recovered = recover_memory_from_backup()
        
        if recovered:
            return load_memory_on_startup()
        else:
            # Start with empty memory
            return Memory()

Part 5: Cost & Budget Issues

Issue: Unexpected Cost Spike

Symptoms:

Daily cost > 2× normal
Unexpected charge from API provider
Cost spike with no corresponding increase in usage
One specific agent/session costs $100+ when typical is $10

Root causes:

Runaway token generation — agent producing huge outputs
Loop with high tokens — agent looping and using context each time
Expensive model — switched to more expensive model
Inefficient prompts — prompts grew in token size
New feature using expensive model — verification using expensive LLM

Diagnostic steps:

# Step 1: Identify timing of spike
cost_by_hour = get_costs_by_hour(last_24_hours=True)
for hour, cost in cost_by_hour:
    if cost > 2 * NORMAL_HOURLY_COST:
        print(f"SPIKE at {hour}: ${cost} (2x normal)")

# Step 2: Identify which agent/session caused spike
expensive_sessions = get_sessions_sorted_by_cost(limit=10)
for session in expensive_sessions:
    print(f"Session {session.id}: ${session.cost}")
    print(f"  Agent: {session.agent_id}")
    print(f"  Duration: {session.duration_seconds}s")
    print(f"  Iterations: {session.loop_iterations}")
    print(f"  Input tokens: {session.input_tokens}")
    print(f"  Output tokens: {session.output_tokens}")

# Step 3: Check if model changed
logs = get_logs(event="session_start", last_24_hours=True)
models_used = set(log.model for log in logs)
print(f"Models used: {models_used}")
if len(models_used) > 1:
    print("WARNING: Multiple models used")
    model_costs = {}
    for model in models_used:
        cost = sum(log.cost for log in logs if log.model == model)
        model_costs[model] = cost
    print(f"Cost by model: {model_costs}")

# Step 4: Check if prompts grew
old_prompt_size = get_avg_prompt_size(days=7)
new_prompt_size = get_avg_prompt_size(days=1)
growth = (new_prompt_size - old_prompt_size) / old_prompt_size
if growth > 0.2:
    print(f"WARNING: Prompts grew {growth:.1%}")

# Step 5: Check iteration counts
expensive_session = expensive_sessions[0]
for step in expensive_session.steps:
    print(f"Iteration {step.iteration}: "
          f"input={step.input_tokens}, output={step.output_tokens}")
    if step.output_tokens > 5000:
        print(f"  ^ Huge output ({step.output_tokens} tokens)")

Quick fix (< 5 minutes):

1. Identify the expensive session
   → Which session caused the spike?
   → What was it doing?
   
2. Check if model is wrong
   → Should it be using Claude 3.5 or Claude 3 Opus?
   → Revert to correct model
   
3. If looping excessively:
   → Set max iterations to 10
   → Kill any sessions > 15 iterations
   
4. If output tokens huge:
   → Check if agent is generating full documents
   → Limit output size
   
5. Enable cost alerts
   → Alert if cost > budget per session
   → Prevent cascade of expensive requests

Proper fix (permanent):

Implement per-session cost budgets:

class CostBudgetEnforcer:
    def __init__(self, max_cost_per_session: float = 1.0):
        self.max_cost = max_cost_per_session
    
    def check_budget_before_step(self, session_id: str):
        current_cost = get_session_cost(session_id)
        if current_cost > self.max_cost:
            raise BudgetExceededError(
                f"Session cost ${current_cost} exceeds budget ${self.max_cost}"
            )
    
    def check_budget_after_step(self, session_id: str, step_cost: float):
        current_cost = get_session_cost(session_id)
        
        if current_cost > self.max_cost:
            log.alert("BUDGET_EXCEEDED", {
                "session_id": session_id,
                "cost": current_cost,
                "budget": self.max_cost
            })
            terminate_session(session_id)

# Use in agent loop
enforcer = CostBudgetEnforcer(max_cost_per_session=5.0)
for step in agent_steps:
    enforcer.check_budget_before_step(session.id)
    result = execute_step()
    enforcer.check_budget_after_step(session.id, result.cost)

Implement cost alerts:

def cost_alert_system():
    """Alert when costs exceed thresholds"""
    
    COST_THRESHOLDS = {
        "daily": 1000,      # Alert if daily cost > $1000
        "hourly": 100,      # Alert if hourly cost > $100
        "session": 10,      # Alert if session cost > $10
        "step": 1,          # Alert if step cost > $1
    }
    
    while True:
        costs = get_current_costs()
        
        if costs["daily"] > COST_THRESHOLDS["daily"]:
            send_alert(f"Daily cost ${costs['daily']} exceeded")
        
        if costs["hourly"] > COST_THRESHOLDS["hourly"]:
            send_alert(f"Hourly cost ${costs['hourly']} exceeded")
        
        time.sleep(60)

Track and alert on model changes:

EXPECTED_MODELS = {
    "general_agent": "claude-3-5-sonnet",
    "verification_agent": "claude-3-opus",
}

def verify_model_on_startup(agent_id: str):
    expected = EXPECTED_MODELS[agent_id]
    actual = get_model_for_agent(agent_id)
    
    if expected != actual:
        log.alert("MODEL_MISMATCH", {
            "agent_id": agent_id,
            "expected": expected,
            "actual": actual,
            "cost_difference": get_cost_difference(expected, actual)
        })

Implement cost attribution:

def log_cost_attribution():
    """Break down costs by agent, model, tool, etc"""
    
    costs_by_agent = {}
    costs_by_model = {}
    costs_by_tool = {}
    
    for session in get_all_sessions():
        agent = session.agent_id
        model = session.model
        
        costs_by_agent[agent] = costs_by_agent.get(agent, 0) + session.cost
        costs_by_model[model] = costs_by_model.get(model, 0) + session.cost
        
        for step in session.steps:
            if step.tool_name:
                costs_by_tool[step.tool_name] = \
                    costs_by_tool.get(step.tool_name, 0) + step.cost
    
    log.info("COST_ATTRIBUTION", {
        "by_agent": costs_by_agent,
        "by_model": costs_by_model,
        "by_tool": costs_by_tool
    })

Issue: Cost Exceeding Budget

Symptoms:

Monthly cost exceeds allocated budget
No single spike, but slow creep upward
New feature is more expensive than projected
Cost per task higher than expected

Root causes:

Inefficient prompts — prompts larger than necessary
Inefficient model choice — using expensive model for simple tasks
No caching — repeating expensive computations
Feature too expensive — new feature costs more than projected
Volume growth — more requests than anticipated

Diagnostic steps:

# Step 1: Compare projected vs actual costs
budget = get_monthly_budget()
actual_cost = get_monthly_cost()
print(f"Budget: ${budget}")
print(f"Actual: ${actual_cost}")
print(f"Over budget by: ${actual_cost - budget}")

# Step 2: Break down costs by feature
costs_by_feature = {}
for session in get_sessions_this_month():
    feature = session.tags[0] if session.tags else "unknown"
    costs_by_feature[feature] = costs_by_feature.get(feature, 0) + session.cost

for feature, cost in sorted(costs_by_feature.items(), key=lambda x: x[1], reverse=True):
    print(f"{feature}: ${cost}")

# Step 3: Compare to baseline
baseline_cost_per_task = get_historical_average("cost_per_task")
current_cost_per_task = get_current_average("cost_per_task")
change = (current_cost_per_task - baseline_cost_per_task) / baseline_cost_per_task
print(f"Cost per task: ${baseline_cost_per_task} → ${current_cost_per_task} ({change:.1%})")

# Step 4: Check model distribution
models = {}
for session in get_sessions_this_month():
    model = session.model
    models[model] = models.get(model, 0) + session.cost

print("Cost by model:")
for model, cost in sorted(models.items(), key=lambda x: x[1], reverse=True):
    print(f"  {model}: ${cost}")

# Step 5: Check for low-hanging optimization
caching_potential = estimate_caching_potential()
print(f"Caching potential: Save ${caching_potential}")

model_switch_potential = estimate_model_switch_potential()
print(f"Model switch potential: Save ${model_switch_potential}")

Quick fix (< 5 minutes):

1. Identify the most expensive feature
   → Break down by feature tag
   → Focus on top 3 expensive features
   
2. Check if there's easy caching potential
   → Same queries repeating?
   → Add caching, reduce cost 20-30%
   
3. Check model choice
   → Is expensive model necessary?
   → Can you use cheaper model for 80% of tasks?
   
4. Reduce prompt size if possible
   → Remove unnecessary context
   → Compress memory file
   → Each 1000 tokens saved = 3-15% cost reduction
   
5. Adjust routing/filtering
   → Can some tasks be answered without LLM?
   → Route simple tasks to tool instead of LLM

Proper fix (permanent):

Implement cost-aware model routing:

def select_model_for_task(task_complexity: str, required_capability: str):
    """Route to cheapest model that meets requirements"""
    
    MODELS = {
        "simple": ("gemini-2-flash", 0.06),       # Cheapest, fast
        "moderate": ("claude-3-5-sonnet", 1.0),   # Good balance
        "complex": ("claude-3-opus", 5.0),        # Best reasoning
    }
    
    estimated_cost = MODELS[task_complexity][1]
    model = MODELS[task_complexity][0]
    
    # If cost > threshold, try cheaper model first
    if estimated_cost > COST_THRESHOLD:
        cheaper_models = [m for m, (_, cost) in MODELS.items() if cost < estimated_cost]
        
        # Test if cheaper model works
        if try_with_model(cheaper_models[0], task):
            model = cheaper_models[0]
            estimated_cost = MODELS[cheaper_models[0]][1]
    
    return model, estimated_cost

Implement smart caching:

QUERY_CACHE = {}
CACHE_TTL = 86400  # 24 hours

def get_with_cache(query: str, expensive_operation):
    cache_key = hashlib.sha256(query.encode()).hexdigest()
    
    if cache_key in QUERY_CACHE:
        entry = QUERY_CACHE[cache_key]
        age = time.time() - entry["timestamp"]
        
        if age < CACHE_TTL:
            log.debug("CACHE_HIT", {"query": query})
            return entry["result"]
    
    # Not cached, execute
    result = expensive_operation()
    
    QUERY_CACHE[cache_key] = {
        "result": result,
        "timestamp": time.time(),
        "cost_saved": AVERAGE_QUERY_COST  # Saved this cost on next hit
    }
    
    return result

# Estimate savings
cache_hits = sum(1 for k in QUERY_CACHE if hits[k] > 1)
total_savings = cache_hits * AVERAGE_QUERY_COST
print(f"Cache savings: ${total_savings}")

Implement cost per feature tracking:

def track_feature_cost(feature_name: str, session_cost: float):
    """Track cumulative cost per feature"""
    
    FEATURE_BUDGETS = {
        "search": 100,      # Max $100/month for search feature
        "summarize": 50,    # Max $50/month for summarize
        "translate": 30,    # Max $30/month for translate
    }
    
    current_month_cost = get_feature_cost_this_month(feature_name)
    budget = FEATURE_BUDGETS.get(feature_name, float('inf'))
    
    if current_month_cost + session_cost > budget:
        log.alert("FEATURE_BUDGET_EXCEEDED", {
            "feature": feature_name,
            "current_cost": current_month_cost,
            "session_cost": session_cost,
            "budget": budget
        })

Part 6: Quality Issues

Issue: Hallucinations Increased

Symptoms:

Model making up facts not in context
Model confident about false information
Model citing sources that don’t exist
Factual accuracy dropped

Root causes:

Model drift — model behavior changed with update
Prompt changed — instruction change causing more creativity
Temperature increased — more randomness/creativity
Memory corruption — mixing up facts from different contexts
Context too short — model hallucinating to fill gaps

Diagnostic steps:

# Step 1: Measure hallucination rate
responses = get_responses_this_week()
hallucinations = []

for response in responses:
    facts = extract_facts(response)
    for fact in facts:
        if not is_in_context(fact, response.context):
            if not is_known_fact(fact):
                hallucinations.append({
                    "fact": fact,
                    "response": response.id,
                    "timestamp": response.timestamp
                })

hallucination_rate = len(hallucinations) / len(responses)
print(f"Hallucination rate: {hallucination_rate:.1%}")

baseline_rate = get_historical_hallucination_rate()
if hallucination_rate > baseline_rate * 1.5:
    print(f"WARNING: 50% increase from baseline ({baseline_rate:.1%})")

# Step 2: Check for recent changes
recent_changes = get_recent_changes(last_24_hours=True)
for change in recent_changes:
    print(f"Change: {change.type}")
    if change.type == "prompt":
        print(f"  Before: {change.old_value[:100]}")
        print(f"  After: {change.new_value[:100]}")
    elif change.type == "model":
        print(f"  {change.old_value} → {change.new_value}")
    elif change.type == "temperature":
        print(f"  {change.old_value} → {change.new_value}")

# Step 3: Check model and parameters
print(f"Model: {agent.model}")
print(f"Temperature: {agent.temperature}")
print(f"Top P: {agent.top_p}")

# Higher temperature = more random/creative
if agent.temperature > 0.5:
    print("WARNING: High temperature may cause hallucinations")

# Step 4: Check context size
avg_context_size = get_average_context_size()
print(f"Average context: {avg_context_size} tokens")

if avg_context_size < 1000:
    print("WARNING: Small context may cause hallucinations")

# Step 5: Compare staging vs production
staging_hallucination_rate = get_hallucination_rate("staging")
prod_hallucination_rate = get_hallucination_rate("production")
print(f"Staging: {staging_hallucination_rate:.1%}")
print(f"Production: {prod_hallucination_rate:.1%}")
if prod_hallucination_rate > staging_hallucination_rate:
    print("WARNING: Production has higher hallucination rate")

Quick fix (< 5 minutes):

1. Reduce temperature
   → Set temperature to 0.3 instead of 0.7
   → More deterministic = fewer hallucinations
   
2. Check for recent model change
   → Did you upgrade model in last 24h?
   → Rollback to previous model
   → Test if hallucination rate drops
   
3. Check for prompt changes
   → Did someone edit the system prompt?
   → Revert prompt to working version
   
4. Add fact verification step
   → After agent generates response
   → Agent must cite sources for each fact
   → If no source, agent must admit uncertainty

Proper fix (permanent):

Implement fact verification loop:

def verify_facts_in_response(response: str, context: str):
    """Verify each fact in response comes from context"""
    
    facts = extract_facts(response)
    
    unverified_facts = []
    for fact in facts:
        if fact not in context:
            if not is_well_known_fact(fact):
                unverified_facts.append(fact)
    
    if unverified_facts:
        # Ask agent to remove or cite these facts
        prompt = f"""
        Your response contains these facts not in the provided context:
        {unverified_facts}
        
        For each fact:
        - Remove it if it's speculation
        - Or cite which document supports it
        
        Revised response:
        """
        
        verified_response = agent.continue_conversation(prompt)
        return verified_response
    
    return response

Add confidence scoring:

def add_confidence_scores(response: str):
    """Ask agent to add confidence scores to facts"""
    
    prompt = f"""
    Review your response and add confidence scores:
    - [HIGH]: Directly from provided documents
    - [MEDIUM]: Reasonable inference from documents
    - [LOW]: General knowledge, not in documents
    - [UNCERTAIN]: Not sure, may be wrong
    
    Example: "The company has [HIGH] 1000 employees 
    and likely [MEDIUM] plans expansion, though I'm [UNCERTAIN] 
    about the timeline."
    
    Response with confidence scores:
    """
    
    scored_response = agent.continue_conversation(prompt)
    return scored_response

Baseline and monitor hallucination rate:

class HallucinationMonitor:
    def __init__(self, baseline_rate: float = 0.05):
        self.baseline_rate = baseline_rate  # 5%
        self.alert_threshold = baseline_rate * 1.5  # 7.5%
    
    def check_hallucination_rate(self):
        current_rate = measure_current_hallucination_rate()
        
        if current_rate > self.alert_threshold:
            log.alert("HALLUCINATION_RATE_HIGH", {
                "baseline": self.baseline_rate,
                "current": current_rate,
                "threshold": self.alert_threshold
            })
            return False
        
        return True

monitor = HallucinationMonitor()
monitor.check_hallucination_rate()

Part 7: Performance Issues

Issue: Slow Inference

Symptoms:

Model takes 5-10+ seconds to generate first token
All requests slow, not just some
Latency increases over time (doesn’t improve with restart)
Model loading slower than before

Root causes:

Large context window — more tokens = slower processing
Model size increase — switched to larger model
GPU out of memory — falling back to CPU (1000× slower)
Model not cached — reloading model from disk each time
Increased load — GPU busy with other requests

Diagnostic steps:

# Step 1: Measure latency
start = time.time()
response = model.generate("test prompt")
latency = time.time() - start

first_token_latency = response.metrics["first_token_ms"]
print(f"Total latency: {latency*1000:.0f}ms")
print(f"First token: {first_token_latency:.0f}ms")

# Normal first token: 50-200ms
# Slow first token: > 500ms (suggests issue)

# Step 2: Check GPU usage
import gpustat
gpu_info = gpustat.new_query()
for gpu in gpu_info:
    print(f"GPU {gpu.index}: {gpu.utilization:.1%} used, {gpu.memory_used}/{gpu.memory_total}")

if any(gpu.memory_used > gpu.memory_total * 0.9 for gpu in gpu_info):
    print("WARNING: GPU running low on memory")

# Step 3: Check context size
context_size = count_tokens(full_context)
print(f"Context size: {context_size} tokens")

# More context = slower processing
# Typical: 5-10K tokens
# Slow: > 100K tokens

# Step 4: Check model size
model_info = get_model_info(model_name)
print(f"Model: {model_name}")
print(f"Model size: {model_info.parameters} parameters")

# 7B model: ~13GB memory
# 13B model: ~26GB memory
# 70B model: ~140GB memory (needs multi-GPU)

# Step 5: Check if model is cached
if is_model_loaded_in_memory():
    print("✓ Model in memory (fast)")
else:
    print("✗ Model not in memory, will load from disk (slow)")

Quick fix (< 5 minutes):

1. Check if GPU is out of memory
   → Run nvidia-smi
   → If used > 90%, try restarting to free memory
   
2. Check context size
   → Is it much larger than before?
   → Reduce context (prune old memories)
   
3. Check if model loaded
   → Is model in GPU memory?
   → Load it once, don't reload each request
   
4. Reduce batch size if applicable
   → If processing multiple requests, reduce batch
   → Gives GPU more free memory per request
   
5. Profile to find bottleneck
   → Which part is slow? (model loading, inference, tokenization?)

Proper fix (permanent):

Implement model caching:

import gc

class ModelCache:
    def __init__(self):
        self.cached_models = {}
    
    def load_model(self, model_name: str):
        if model_name not in self.cached_models:
            print(f"Loading {model_name}...")
            model = load_model_from_disk(model_name)
            self.cached_models[model_name] = model
        
        return self.cached_models[model_name]
    
    def unload_unused_models(self):
        # Keep only last 2 models in memory
        if len(self.cached_models) > 2:
            oldest = min(self.cached_models.items(), 
                       key=lambda x: x[1].last_used)
            del self.cached_models[oldest[0]]
            gc.collect()

Implement async/batching:

import asyncio

class InferenceBatcher:
    def __init__(self, batch_size: int = 4):
        self.batch_size = batch_size
        self.queue = asyncio.Queue()
    
    async def add_request(self, prompt: str):
        await self.queue.put(prompt)
    
    async def process_batches(self):
        while True:
            batch = []
            
            # Collect up to batch_size requests
            for _ in range(self.batch_size):
                try:
                    prompt = self.queue.get_nowait()
                    batch.append(prompt)
                except asyncio.QueueEmpty:
                    break
            
            if batch:
                # Process batch together (faster than one-by-one)
                results = model.generate_batch(batch)
                # ... return results
            
            await asyncio.sleep(0.1)

Monitor and alert on latency degradation:

class LatencyMonitor:
    def __init__(self):
        self.baseline_latency = 150  # ms
        self.alert_threshold = 500   # ms
    
    def check_latency(self, latency_ms: float):
        if latency_ms > self.alert_threshold:
            degradation = (latency_ms - self.baseline_latency) / self.baseline_latency
            log.alert("LATENCY_DEGRADATION", {
                "baseline": self.baseline_latency,
                "current": latency_ms,
                "degradation": f"{degradation:.0%}"
            })

Part 8: Common Error Messages

Error: 429 Rate Limit Exceeded

Message:

APIError: 429 Rate limit exceeded. Please retry after 60 seconds.

What it means: You’ve made too many requests to the API. The API is rate-limiting you to prevent abuse.

Causes:

Too many concurrent requests
Exceeded monthly token quota
API provider bandwidth limit

Quick fix:

import time

def call_with_retry(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError as e:
            if attempt < max_retries - 1:
                wait_time = int(e.retry_after) if hasattr(e, 'retry_after') else 2**attempt
                print(f"Rate limited, waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise

Error: Context Window Exceeded

Message:

ContextLengthExceededError: Prompt too long (8,532 tokens > 4,096 max)

What it means: Your prompt (context + message) is too long for the model. Need to reduce it.

Quick fix:

Switch to model with larger context window (e.g., Claude 3.5 with 200K)
Reduce startup memory (CLAUDE.md, MEMORY.md)
Summarize old messages
Use compression (LLM Wiki pattern)

Error: Model Not Found

Message:

APIError: Model 'gpt-4-turbo-2024-04-09' not found

Causes:

Model deprecated/removed
Typo in model name
Wrong API provider

Quick fix:

List available models: curl https://api.openai.com/v1/models -H "Authorization: Bearer $OPENAI_API_KEY"
Check model documentation for current available models
Use standardized model names from documentation

Part 9: FAQ — Frequently Asked Questions

Q: Why is my agent looping?

A: Agents loop when:

Tool keeps failing — agent thinks it should retry
Task is ambiguous — agent doesn’t know when to stop
No termination logic — max_iterations not set

Fix: See “Agent Stuck in Loop” section above.

Q: How do I reduce costs?

A: Top cost-reduction tactics (in order of impact):

Use smaller model (80% saving): SLM (7B) for loop, LLM (70B+) for verify only
Cache results (30-50% saving): Repeat queries shouldn’t re-run
Reduce context (20-40% saving): Compress memory, use LLM Wiki pattern
Use quantization (20% saving): 4-bit models cost same to run, less tokens
Route smart (10-20% saving): Simple tasks don’t need expensive model

Example: Hybrid setup can save up to 80-90% vs pure cloud (when most requests route locally):

80% of requests → cheap local SLM (Phi, Mistral 7B) = ~$0
20% of requests → verify with Claude Opus = ~$3/1M tokens
Total: ~$0.60/1M tokens vs $15/1M for pure Claude

Q: What model size do I need?

A: Choose based on task complexity:

Task	Model	Reason
Classification	7B SLM	Fast, cheap, good enough
Summarization	13B SLM	Good balance
Q&A retrieval	13B SLM	Needs reasoning but not deep
Code generation	34B SLM	Needs better code understanding
Complex reasoning	70B LLM	Requires deep reasoning
Verification	70B+ LLM	Needs high accuracy

Q: Should I use cloud or local models?

A: Decision tree:

Tokens/day < 100K?
  → Cloud (cheaper for low volume)
Tokens/day 100K-1M?
  → Hybrid (local for loop, cloud for verify)
Tokens/day > 1M?
  → Local self-hosted (cheaper at scale)
Needs latest model?
  → Cloud (local models lag by 6-12 months)
Sensitive data?
  → Local (keep data on-premise)
No GPU available?
  → Cloud (can't run local models)

Q: How do I debug agent decisions?

A: Enable detailed logging:

# Log every step
for step in agent.steps:
    print(f"Step {step.iteration}:")
    print(f"  Reasoning: {step.reasoning}")
    print(f"  Tool: {step.tool_name}")
    print(f"  Result: {step.tool_result[:200]}")
    print(f"  Cost: ${step.cost}")

# Check if reasoning makes sense
# If reasoning is wrong → LLM confused, need clearer instruction
# If reasoning right but tool wrong → Tool choice issue

Q: What’s the difference between ReAct and Tree of Thoughts?

Framework	How it works	Best for	Cost
ReAct	Think → Act → Observe (loop)	Most tasks, default choice	Baseline (1-8 iterations)
Tree of Thoughts	Explore multiple branches, keep best	Complex problems, deep reasoning	3-5× more expensive (many branches)
Reflexion	Act → Get feedback → Self-correct	Quality improvement, when first try fails	2-3× cost (add reflection step)

Recommendation: Start with ReAct. Use Tree of Thoughts only if ReAct success rate < 70%.

Q: How much GPU memory do I need?

A: For different model sizes:

Model	Memory	GPU	Cost/month
7B	14GB	1× RTX 4090	$500
13B	26GB	1× RTX 4090	$500
34B	68GB	1× H100	$3K
70B	140GB	2× H100	$6K
405B	810GB	Requires specialized hardware	$20K+

Cheaper alternative: Use cloud API (pay per token, no hardware cost).

Q: Is my harness secure?

A: Security checklist:

Input validation (check for injection patterns)
Output filtering (no PII leaks)
Rate limiting (prevent abuse)
Audit logging (track who did what)
Secrets management (no hardcoded API keys)
Sandboxing (restrict tool access)

See 10_security_and_safety.md for full details.

Q: How do I monitor production?

A: Essential metrics:

ESSENTIAL_METRICS = [
    "error_rate",           # % of requests failing
    "latency_p50/p95/p99",  # Request duration
    "cost_per_task",        # Token cost trending
    "success_rate",         # % of agent reaching goal
    "loop_iterations",      # Avg steps per task (higher = less efficient)
    "memory_usage",         # RAM / context window usage
    "loop_detection",       # Count of stuck agents
]

Alerting:

Error rate > 5% → page on-call
Cost/task > 2× baseline → page on-call
Success rate drops > 10% → investigate

Q: What’s the best prompt?

A: No single “best” prompt, but follow these principles:

Clear role: “You are a Python expert”
Clear task: “Your job is to review this code”
Clear constraints: “Don’t suggest breaking changes”
Clear output format: “Return JSON with keys: issues, severity”
Examples: Show 1-2 examples of good responses

Bad prompt:

"Write code"

Good prompt:

You are a senior Python engineer.
Review this Python code and identify bugs.
Focus on: memory leaks, infinite loops, security issues.
Output as JSON: {"issues": [{"line": 5, "type": "memory_leak", "fix": "..."}]}

Example:
Code: for x in data: items.append(x)  # grows unbounded
Issue: Memory leak if data is large, items is never freed
Fix: Use generator instead: (x for x in data)

Part 10: Decision Trees for Diagnosis

When Error Rate Spikes

Error rate > 5%?
├─ Check specific error in logs
│  ├─ "Tool not found" → Tool missing/renamed
│  ├─ "Rate limit" → API quota exceeded
│  ├─ "Timeout" → External service slow
│  └─ "Model error" → Model offline/changed
│
├─ Check recent changes (last 2 hours)
│  ├─ Deployed new code? → Rollback
│  ├─ Changed prompt? → Revert prompt
│  ├─ Switched model? → Switch back
│  └─ No recent changes → Check external services
│
└─ Check metrics
   ├─ Latency high? → Performance issue
   ├─ Cost high? → Runaway agent
   └─ Memory high? → Memory leak

When Cost Increases

Cost > budget?
├─ Identify the expensive session
│  ├─ High iteration count? → Loop issue (see "Stuck in Loop")
│  ├─ High output tokens? → Agent over-generating
│  └─ Many small costs? → Repeated expensive operations
│
├─ Check model used
│  ├─ Using expensive model? → Switch to cheaper
│  ├─ Changed model? → Revert
│  └─ Using correct model? → Check iteration count
│
└─ Quick wins
   ├─ Cache search results (30% savings)
   ├─ Use cheaper model for 80% of requests (80% savings)
   └─ Reduce startup memory (10-20% savings)

Part 11: Incident Playbook

Incident: Cost $5K in 24 hours (Normal: $100)

Timeline (do this ASAP):

Minute 1-2: Kill agent if still running
Minute 3-5: Identify which session/agent caused spike
Minute 6-10: Check logs for what it was doing
Minute 11-15: Implement hard cost limit (prevent repeat)
Hour 1: Rootcause analysis (why did this happen?)
Hour 2: Fix and validate fix

Debug steps:

# Step 1: Find expensive sessions
expensive_sessions = get_sessions_by_cost(sort="descending")
culprit = expensive_sessions[0]

print(f"Session {culprit.id}:")
print(f"  Cost: ${culprit.cost}")
print(f"  Duration: {culprit.duration_seconds}s")
print(f"  Iterations: {culprit.loop_iterations}")
print(f"  Input tokens: {culprit.input_tokens}")
print(f"  Output tokens: {culprit.output_tokens}")

# Step 2: Check what it was doing
for step in culprit.steps[:10]:  # First 10 iterations
    print(f"Iteration {step.iteration}:")
    print(f"  Tool: {step.tool_name}")
    print(f"  Tokens: in={step.input_tokens}, out={step.output_tokens}")
    print(f"  Cost: ${step.cost}")

# Was it looping? Generating huge outputs? Using expensive model?

# Step 3: Check for the root cause
if culprit.loop_iterations > 20:
    print("ROOT CAUSE: Agent looping excessively")
    # See "Agent Stuck in Loop" fix
elif culprit.output_tokens > 50000:
    print("ROOT CAUSE: Agent generating huge outputs")
    # Check what it was generating
elif culprit.model == "claude-3-opus":
    print("ROOT CAUSE: Used expensive model instead of cheap one")
    # Check why it switched models

Prevent repeat:

# Add hard cost limit
class HardCostLimit:
    def __init__(self, max_cost_per_session: float = 5.0):
        self.max_cost = max_cost_per_session
    
    def check(self, session_cost: float):
        if session_cost > self.max_cost:
            kill_session_immediately()
            alert_ops("COST_LIMIT_HIT")
            raise Exception(f"Cost ${session_cost} exceeds limit ${self.max_cost}")

# Deploy immediately
limit = HardCostLimit(max_cost_per_session=5.0)

Conclusion

When something breaks in production, speed and calm matter most. Use these tools:

Decision tree → Narrow down the problem fast
Diagnostic steps → Verify your hypothesis
Quick fix → Stop the bleeding (< 5 min)
Proper fix → Prevent it recurring (permanent)
Prevention → Add monitoring/checks

Most production incidents follow patterns. If you’ve seen it once, you can fix it again—faster.

Keep this runbook bookmarked. Update it with new incidents you find.

Quick Reference: Commands

# View logs for a specific error
grep "ERROR" harness.log | grep "tool_timeout" | tail -20

# Check which agent is expensive
jq '.sessions | sort_by(.cost) | reverse | .[0]' sessions.json

# Count iterations for a session
jq '.steps | length' session.json

# Check model used
jq '.model' session.json

# Get cost breakdown
jq '{model: .model, cost: .cost, tokens: .input_tokens + .output_tokens}' session.json

Validation Checklist

How do you know you got this right?

Performance Checks

Decision tree diagnostic identifies root cause in <5 minutes
Quick fix resolves symptom in <5 minutes (service restored)
Proper fix prevents recurrence (no duplicate incidents in 2+ weeks)
Runbook tested: new on-call follows steps successfully

Implementation Checks

Decision tree covers 90%+ of real production incidents
Diagnostic steps for each symptom documented with example logs
Quick fix is safe: temporary measure that doesn’t cause data loss
Proper fix implemented: code change or monitoring addition deployed
Prevention measures in place: monitoring alert or hard limit added
Commands cheatsheet tested: each one returns expected data format
Runbook updated after every incident: lessons captured

Integration Checks

Logging provides needed data: can trace request from input to output
Monitoring alerts match runbook symptoms: alert fires when issue occurs
Escalation procedures defined: who to contact if fix fails
Incident postmortem process: how to prevent recurrence

Common Failure Modes

Decision tree doesn’t match real issues: Test against last 10 incidents; update
Logs don’t provide diagnostic info: Missing request IDs, timing, error context
Quick fix is too complex: Takes 10 minutes; simplify or document better
Same incident repeats: Prevention measure didn’t work; verify it’s deployed
Runbook outdated: Logs format changed, commands broken; maintain as code changes

Sign-Off Criteria

Runbook tested by someone unfamiliar with codebase (clarity check)
All 3 real incidents resolved successfully using runbook
Prevention measures verified deployed: alerts fire, limits enforced
Team trained: on-call can follow runbook independently
Documentation complete: why issues happen, not just what to do

Quick Reference: First Steps

Part 1: Symptom-Based Diagnosis Decision Tree

Part 2: Agent Debugging

Symptom: Agent Stuck in Loop

Symptom: Agent Ignoring Instructions

Symptom: Agent Producing Garbage Output

Symptom: Agent Running Out of Memory / Context Window Exceeded

Symptom: Agent Making Wrong Tool Calls

Part 3: Tool Issues

Issue: Tool Not Found / Not Registered

Issue: Tool Failing with Errors

Issue: Tool Timeout

Part 4: Memory Issues

Issue: Memory Corruption

Issue: Memory Loss

Part 5: Cost & Budget Issues

Issue: Unexpected Cost Spike

Issue: Cost Exceeding Budget

Part 6: Quality Issues

Issue: Hallucinations Increased

Part 7: Performance Issues

Issue: Slow Inference

Part 8: Common Error Messages

Error: 429 Rate Limit Exceeded

Error: Context Window Exceeded

Error: Model Not Found

Part 9: FAQ — Frequently Asked Questions

Q: Why is my agent looping?

Q: How do I reduce costs?

Q: What model size do I need?

Q: Should I use cloud or local models?

Q: How do I debug agent decisions?

Q: What’s the difference between ReAct and Tree of Thoughts?

Q: How much GPU memory do I need?

Q: Is my harness secure?

Q: How do I monitor production?

Q: What’s the best prompt?

Part 10: Decision Trees for Diagnosis

When Error Rate Spikes

When Cost Increases

Part 11: Incident Playbook

Incident: Cost $5K in 24 hours (Normal: $100)

Conclusion

Quick Reference: Commands

Further Reading

Validation Checklist

Performance Checks

Implementation Checks

Integration Checks

Common Failure Modes

Sign-Off Criteria

See Also