Skip to main content
Reference

Troubleshooting & FAQ

Production incident playbooks, decision trees, common failure modes, and step-by-step debugging procedures for agent systems.

When something breaks in production, speed matters more than perfection. This document is designed for on-call engineers to diagnose and fix common issues quickly.

In simple terms: “What do I do when my harness breaks? Where do I look first? How do I fix it?”


Quick Reference: First Steps

When something is broken:

  1. Check if it’s actually broken → Look at metrics (error rate, latency)
  2. Identify the symptom → Use decision tree below
  3. Check recent changes → Did something deploy in last 30 minutes?
  4. Look at structured logs → Filter by error, agent ID, session ID
  5. Isolate the component → Is it the model? Tool? Memory? Cost control?
  6. Apply fix → Choose “Quick fix” if urgent, “Proper fix” for long-term

Part 1: Symptom-Based Diagnosis Decision Tree

Use this tree to identify what’s broken:

Something seems wrong
├─ Error rate > 5% (check metrics)
│  ├─ Specific tool failing (look for tool error pattern)
│  │  ├─ Tool timeout → See "Tool Issues: Timeout"
│  │  ├─ Tool returns bad format → See "Tool Issues: Unexpected Format"
│  │  ├─ Permission denied → See "Tool Issues: Permission Denied"
│  │  └─ Tool not found → See "Tool Issues: Not Registered"
│  │
│  ├─ Agent producing garbage (reasoning nonsensical)
│  │  ├─ Agent ignoring instructions → See "Agent Debugging: Ignoring Instructions"
│  │  ├─ Agent making wrong tool calls → See "Agent Debugging: Wrong Tool Calls"
│  │  └─ Hallucinations increased → See "Quality Issues: Hallucinations"
│  │
│  ├─ All requests failing with same error
│  │  ├─ "Rate limit exceeded" → See "Common Error Messages: 429"
│  │  ├─ "Context window exceeded" → See "Common Error Messages: Context Exceeded"
│  │  ├─ "Model not found" → See "Common Error Messages: Model Not Found"
│  │  └─ Other API error → See "Common Error Messages"
│  │
│  └─ Random/intermittent failures
│     ├─ Network timeouts → See "Performance Issues: Network Timeouts"
│     ├─ Database connection errors → See "Deployment Issues: Database Errors"
│     └─ Memory corruption → See "Memory Issues: Corruption"

├─ Latency > expected (check p50/p95/p99)
│  ├─ High on all requests → See "Performance Issues: Slow Inference"
│  ├─ High on specific tool → See "Performance Issues: Tool Bottleneck"
│  ├─ Intermittent spikes → See "Performance Issues: Queue Backlog"
│  └─ First request slow, rest fast → See "Performance Issues: Slow Inference (model loading)"

├─ Cost > budget (check cost tracking)
│  ├─ Spike in last 24 hours → See "Cost & Budget Issues: Unexpected Spike"
│  ├─ Gradual increase over time → See "Cost & Budget Issues: Gradual Creep"
│  ├─ Wrong tokens charged → See "Cost & Budget Issues: Token Counting Mismatch"
│  └─ Runaway agent → See "Cost & Budget Issues: Runaway Agent"

├─ Agent looping (iterations not stopping)
│  ├─ Stuck on same decision → See "Agent Debugging: Stuck in Loop"
│  ├─ Context window filling up → See "Memory Issues: Memory Loss"
│  └─ Too many retries on failed tool → See "Agent Debugging: Ignoring Instructions"

├─ Agent timing out (total duration > timeout)
│  ├─ Normal operations taking too long → See "Agent Debugging: Timing Out"
│  ├─ Waiting for tool response → See "Tool Issues: Timeout"
│  └─ Memory consolidation slow → See "Memory Issues: Memory Consolidation Slow"

└─ Can't find information / logs missing
   ├─ Logs disappeared → See "Deployment Issues: Missing Logs"
   ├─ Metrics not being recorded → See "Deployment Issues: Health Checks Failing"
   └─ Dashboard shows no data → See "Operations: Observability Misconfigured"

Part 2: Agent Debugging

Symptom: Agent Stuck in Loop

What you’ll see:

  • Agent iteration count keeps increasing (10, 20, 50+)
  • Agent making same decision/tool call repeatedly
  • Session doesn’t complete or times out
  • Context tokens increasing with each iteration
  • Cost climbing without making progress

Root causes:

  1. Tool always fails — tool is broken, agent keeps retrying
  2. Agent doesn’t understand error — error message is unclear
  3. Instruction contradiction — agent told to keep trying indefinitely
  4. No termination logic — agent has no “give up” condition
  5. Tool returns infinite loop — e.g., search results pointing to search

Diagnostic steps:

# Step 1: Check iteration limit
if session_log.loop_iterations > 15:
    print("ALERT: Agent exceeded normal iteration count")
    # Normal: 3-8 iterations
    # Concerning: 10-15 iterations
    # Critical: 20+ iterations

# Step 2: Look at repetition pattern
recent_steps = session_log.last_n_steps(5)
tools_called = [step.tool_name for step in recent_steps]
print(f"Last 5 tools: {tools_called}")
# If all same → Looping

# Step 3: Check error in failed tools
for step in recent_steps:
    if step.status == "failed":
        print(f"Tool {step.tool_name} failed: {step.error}")
        # Is the error message helpful?
        # Is the tool actually broken?

# Step 4: Check context window usage
print(f"Context usage: {session_log.context_tokens_used} / {session_log.context_limit}")
if context_ratio > 0.85:
    print("WARNING: Approaching context limit, may be running out of space")

Quick fix (< 5 minutes):

1. Kill the session immediately (don't wait for timeout)
   - Set iteration limit to 12 (or lower if you see looping at 8)
   
2. Look at the last 3 tool calls in logs
   - Same tool repeated? → That tool is broken
   - Different tools but same result? → Agent isn't understanding error
   
3. Check which tool is failing
   - If "web_search": Search API might be rate-limited or down
   - If custom tool: That tool implementation may be broken
   
4. Restart without that tool (comment it out)
   - Does the agent succeed without it? → Confirms tool is the problem

Proper fix (permanent):

  1. Add mandatory termination logic:

    MAX_ITERATIONS = 12
    if iteration_count >= MAX_ITERATIONS:
        return {
            "status": "incomplete",
            "reason": "max_iterations_reached",
            "best_effort_result": last_valid_output
        }
  2. Improve error messages so agent understands:

    # Bad error message (agent doesn't know what to do)
    raise Exception("Tool failed")
    
    # Good error message (agent knows it's a retry issue)
    raise Exception(
        "Web search timed out after 30 seconds. "
        "Either the site is slow or the query is too broad. "
        "Try: 1) Wait 10 seconds then retry, or 2) Try a different query"
    )
  3. Add loop detection to logs:

    # Detect when agent is repeating itself
    if iteration > 2:
        prev_tool = steps[-2].tool_name
        curr_tool = steps[-1].tool_name
        if prev_tool == curr_tool == steps[-3].tool_name:
            log.warning("LOOP_DETECTED", {
                "tool": curr_tool,
                "repetitions": count_repetitions(curr_tool, steps)
            })
            # Force intervention or termination
  4. Test with failing tool disabled:

    Does agent succeed with tool X disabled?
    YES → Confirm tool X is broken, fix it
    NO → Problem is elsewhere (agent logic or instruction clarity)

Symptom: Agent Ignoring Instructions

What you’ll see:

  • Agent makes decisions contrary to explicit instructions
  • Agent uses forbidden tools
  • Agent generates output in wrong format despite instruction
  • Agent skips required steps

Root causes:

  1. Instruction buried in context — agent can’t see it due to context length
  2. Conflicting instructions — instructions contradict each other
  3. Instructions too vague — agent interprets them differently
  4. Model drift — model behavior changed, needs re-tuning
  5. Tool choice conflict — agent thinks different tool is better
  6. Temperature too high — model being too creative/random

Diagnostic steps:

# Step 1: Verify instruction was in context
instruction = "Always use tool X for data retrieval"
if instruction in session_log.full_context:
    print("✓ Instruction is in context")
else:
    print("✗ Instruction NOT in context (likely pruned due to length)")
    print(f"Context usage: {session_log.context_tokens} / {session_log.limit}")

# Step 2: Check what tool agent used
for step in session_log.steps:
    if step.tool_name != "expected_tool":
        print(f"Agent chose {step.tool_name}, expected other_tool")
        print(f"Reasoning: {step.reasoning}")

# Step 3: Check temperature/sampling parameters
print(f"Temperature: {session_log.model_params.temperature}")  # 0.0 = deterministic, 1.0 = creative
print(f"Top P: {session_log.model_params.top_p}")

# Step 4: Check if this is recent behavior
recent_10_sessions = get_last_n_sessions(10)
violations = sum(1 for s in recent_10_sessions if instruction_violated(s))
print(f"Instruction violations in last 10 sessions: {violations} / 10")
# If all 10 violated → chronic problem
# If 2-3 violated → occasional issue (may be model randomness)

Quick fix (< 5 minutes):

1. Check if instruction is being pruned (context too long)
   → Reduce memory size, compress old sessions, shorten instruction
   
2. Check for conflicting instructions
   → Search for "always use X" and "use Y if..."
   → Clarify which takes priority
   
3. If "occasional" violation (2-3 out of 10):
   → Reduce temperature (more deterministic)
   → Restart with fresh session (cold restart may help)
   
4. If "chronic" violation (8+ out of 10):
   → Instruction is being ignored, need proper fix

Proper fix (permanent):

  1. Ensure instruction is at context start:

    # Bad: Instruction at end of long context
    system_prompt = [
        instructions_file (2K tokens),
        memory_file (50K tokens),  
        conversation_history (100K tokens),
        instruction_constraint (buried here!)
    ]
    
    # Good: Constraints at start, most recent history at end
    system_prompt = [
        constraint_instruction (critical: at start!),
        task_instruction (what to do),
        memory_file (50K tokens),
        conversation_history (100K tokens, at end so most recent)
    ]
  2. Make instructions concrete with examples:

    # Vague (agent might ignore)
    "Use the search tool when appropriate"
    
    # Concrete (agent will follow)
    """MANDATORY: When you need information not in your memory:
    1. Use search_web_tool FIRST
    2. If search returns nothing, use search_knowledge_base SECOND
    3. Only use local_files if neither returns results
    
    Example: User asks "What is Llama 2?"
    GOOD: search_web_tool("Llama 2 model")
    BAD: search_knowledge_base("Llama 2 model")  ← Wrong order
    BAD: Ask user for more info  ← Don't do this, search first
    """
  3. Add verification step:

    # After each agent step, verify it followed rules
    if instruction_required_tool not in step.tool_calls:
        if step.status == "failed":
            # Agent violated instruction, log it
            log_alert("INSTRUCTION_VIOLATION", {
                "instruction": constraint,
                "tool_required": required_tool,
                "tool_used": step.tool_calls[0].name,
                "session_id": session_id
            })
  4. Test with reduced context window:

    Does agent follow instruction with context_limit=50K instead of 200K?
    YES → Instruction was being pruned, need to reduce memory
    NO → Agent intentionally ignoring instruction, need stronger constraint

Symptom: Agent Producing Garbage Output

What you’ll see:

  • Output is nonsensical, incoherent
  • Output contains false information (hallucinations)
  • Output mixes multiple unrelated topics
  • Output contains jailbreak artifacts or strange formatting
  • Output quality fine in staging, broken in production

Root causes:

  1. Context corruption — old memory mixed with current task
  2. Model hallucinating — producing false information confidently
  3. Prompt injection — malicious input changed agent behavior
  4. Cache collision — KV cache mixing responses from different sessions
  5. Quantization artifact — rare precision error from 4-bit quantization
  6. Model drift → Production model different from staging
  7. Temperature too high → Model generating random tokens

Diagnostic steps:

# Step 1: Check context for corruption
context_summary = analyze_context_windows(session_log)
print(f"Context sources:")
for source in context_summary.sources:
    print(f"  - {source}: {source.token_count} tokens, age={source.age_hours}h")

# Example of corruption:
# Task: "Summarize recent sales"
# Context accidentally includes: "Nuclear weapons safety procedures"
# ← This contamination causes garbage output

# Step 2: Check if output is hallucination
for fact in output_facts:
    if fact not in session_log.full_context:
        print(f"HALLUCINATION: '{fact}' not found in context")
        # Agent made this up

# Step 3: Check input for injection
if "<|system|>" in user_input or "ignore instructions" in user_input.lower():
    print("POSSIBLE_INJECTION: Input contains jailbreak patterns")

# Step 4: Check if staging/production models match
staging_model = "claude-3-5-sonnet-20240514"
prod_model = session_log.model
if staging_model != prod_model:
    print(f"MODEL MISMATCH: Staging uses {staging_model}, prod uses {prod_model}")
    print("→ Test staging with prod model to see if issue reproduces")

# Step 5: Check temperature
print(f"Temperature: {session_log.temperature}")
if session_log.temperature > 0.7:
    print("WARNING: High temperature may cause randomness")

Quick fix (< 5 minutes):

1. Set temperature to 0.0 (deterministic)
   → Eliminates randomness, see if output improves
   
2. Clear memory/context
   → Cold restart session without old memory
   → Does output improve? → Memory corruption confirmed
   
3. Check if staging and production use same model
   → If different, recreate issue in staging first
   
4. Check input for obvious injection patterns
   → Any "<|system|>" or "ignore instructions"?
   → If yes, increase input validation

Proper fix (permanent):

  1. Prevent context corruption:

    # Tag context sources with session ID
    memory_entry = {
        "content": text,
        "session_id": current_session_id,  # MUST match current session
        "created_at": timestamp
    }
    
    # Before using memory, verify it's from same task/session
    for entry in memory:
        if entry.session_id != current_session_id:
            if entry.age_hours > 24:
                # Old entry from different session, skip it
                continue
  2. Add hallucination detection:

    # Verify each fact in output appears in context
    facts = extract_facts(output)
    for fact in facts:
        if fact not in context and not is_known_fact(fact):
            # This is a potential hallucination
            add_fact_verification_step()
            # Ask agent to cite source or admit uncertainty
  3. Strict input validation against injection:

    def validate_input(user_input: str) -> bool:
        dangerous_patterns = [
            "<|system|>", "<|user|>",      # Jailbreak markers
            "ignore instructions",          # Direct override
            "pretend you",                  # Role change
            "forget your instructions",     # Memory wipe
            "you are now",                  # System swap
        ]
        for pattern in dangerous_patterns:
            if pattern.lower() in user_input.lower():
                log_alert("INJECTION_ATTEMPT", {"input": user_input})
                return False
        return True
  4. Test staging with production model:

    If staging uses model V1 and prod uses V2:
    1. Update staging to use V2
    2. Re-run quality tests
    3. If quality drops → V2 needs tuning
    4. If quality same → Issue is elsewhere (not model change)

Symptom: Agent Running Out of Memory / Context Window Exceeded

What you’ll see:

  • Error: “Context length exceeded” or “prompt too long”
  • Agent abruptly stops mid-task
  • Session fails on the 10th+ iteration (context filling up over time)
  • Long-running tasks fail but short tasks succeed

Root causes:

  1. Memory not being consolidated — old sessions piling up
  2. Conversation history too long — keeping all old messages
  3. Model context limit too low — using 4K context model instead of 128K
  4. Tool results too large — search returns 10K tokens of junk
  5. Logging too verbose — logging every intermediate step
  6. Context size increased — recent change expanded startup memory

Diagnostic steps:

# Step 1: Check context usage over time
for step in session_log.steps:
    print(f"Iteration {step.iteration}: {step.context_used} tokens")

# Should grow slowly, then plateau
# If growing linearly: memory not being pruned/consolidated

# Step 2: Measure memory file sizes
print(f"CLAUDE.md: {get_file_size('CLAUDE.md')} tokens")
print(f"MEMORY.md: {get_file_size('MEMORY.md')} tokens")
print(f"Topic files: {sum(get_file_size(f) for f in topic_files)} tokens")

# Good baseline:
# CLAUDE.md: 500-1000 tokens (instructions)
# MEMORY.md: 5000-10000 tokens (compact facts)
# Topic files: 500-2000 tokens each
# Total startup: < 20K tokens

# Step 3: Check model context limit
print(f"Model: {session_log.model}")
print(f"Context limit: {session_log.context_limit} tokens")

# If limit is 4K: Too small
# If limit is 128K: Good for long tasks
# Check: Did recent change switch to smaller model?

# Step 4: Measure tool output sizes
for step in session_log.steps:
    if step.tool_name == "search_web":
        output_tokens = count_tokens(step.tool_output)
        print(f"Search result: {output_tokens} tokens")
        # Results > 5K tokens? Too verbose

# Step 5: Check consolidation logs
consolidation_logs = filter(session_log, event="memory_consolidated")
if len(consolidation_logs) == 0:
    print("WARNING: No consolidation events in logs")
    print("→ Memory never consolidated, context keeps growing")

Quick fix (< 5 minutes):

1. Reduce startup memory
   → Comment out old session summaries in MEMORY.md
   → Keep only last 2 sessions
   → Should drop startup from 50K → 15K tokens
   
2. Enable aggressive memory consolidation
   → Consolidate memory every 5 iterations instead of 15
   → This keeps context from growing unbounded
   
3. Switch to larger context model if available
   → From Claude 3 (100K) → Claude 3.5 (200K)
   → If same model, may not help (limit is hard limit)
   
4. Prune verbose logging
   → Stop logging every intermediate step
   → Log only: errors, tool calls, final decision
   → Should reduce context by 30-40%
   
5. Limit tool output
   → Cap search results to 2K tokens
   → Summarize large results before using them

Proper fix (permanent):

  1. Implement automatic memory consolidation:

    def consolidate_memory_if_needed():
        context_used = get_context_usage()
        context_limit = model.context_limit
        usage_ratio = context_used / context_limit
        
        if usage_ratio > 0.60:  # Consolidate at 60%
            # Summarize old conversation
            old_messages = get_messages_before_n_iterations_ago(10)
            summary = compress_conversation(old_messages)
            
            # Replace old messages with summary
            replace_old_messages_with_summary(summary)
            
            log.info("MEMORY_CONSOLIDATED", {
                "tokens_before": context_used,
                "tokens_after": get_context_usage(),
                "compression_ratio": context_used / get_context_usage()
            })
  2. Set hard context limit with graceful degradation:

    MAX_CONTEXT_USAGE = 0.85 * model.context_limit
    
    if context_used > MAX_CONTEXT_USAGE:
        # Instead of crashing, gracefully degrade
        log.warn("APPROACHING_CONTEXT_LIMIT", {
            "usage": context_used,
            "limit": model.context_limit
        })
        
        # Option 1: Consolidate memory
        consolidate_memory()
        
        # Option 2: Remove oldest messages
        prune_old_messages(count=5)
        
        # Option 3: Save and restart session
        save_session_summary()
        return {"status": "checkpoint_reached", "continue_in_new_session": True}
  3. Right-size model for task length:

    task_complexity = estimate_task_complexity(user_prompt)
    
    if task_complexity == "simple":
        model = "claude-3-5-sonnet"  # 200K context, cheap
    elif task_complexity == "complex":
        model = "claude-3-opus"      # 200K context, more capable
    else:
        model = "claude-3-5-sonnet"  # Default
    
    # Re-evaluate model choice based on actual task
  4. Establish memory file size budgets:

    # Define maximum sizes
    MEMORY_BUDGETS = {
        "CLAUDE.md": 1000,      # Instructions
        "MEMORY.md": 15000,     # Compact facts
        "topic_files": 2000,    # Each topic file
        "startup_total": 20000  # Total startup overhead
    }
    
    # Enforce in CI/CD
    for filepath, budget in MEMORY_BUDGETS.items():
        actual_size = get_file_size(filepath)
        if actual_size > budget:
            raise Exception(f"{filepath} exceeds budget: {actual_size} > {budget}")

Symptom: Agent Making Wrong Tool Calls

What you’ll see:

  • Agent calls wrong tool for the task
  • Agent calls tool with wrong parameters
  • Agent calls tools in wrong order
  • Agent uses tool when it shouldn’t

Root causes:

  1. Tool descriptions unclear — agent doesn’t understand what tool does
  2. Parameter validation missing — agent sends bad params, tool fails
  3. Tool schema wrong — schema doesn’t match tool’s actual interface
  4. Agent doesn’t understand task — misinterprets what user asked
  5. Too many similar tools — agent confused between similar options

Diagnostic steps:

# Step 1: Check tool descriptions
for tool in available_tools:
    print(f"Tool: {tool.name}")
    print(f"Description: {tool.description}")
    # Is it clear what this tool does?
    # Would you know to use it from the description?

# Example of bad description:
# "data" - What data? When to use it? (Unhelpful)

# Example of good description:
# "search_web: Search the public internet for current information.
#  Use when you need recent news, facts, or information not in your memory.
#  Returns: Top 5 results with titles, URLs, summaries (max 2K tokens each)"

# Step 2: Check parameter types
for tool in available_tools:
    for param in tool.parameters:
        print(f"  {param.name}: {param.type} (required={param.required})")
        # Are types clear? (string, integer, list)
        # Are they documented?

# Step 3: Check tool usage in logs
for step in session_log.steps:
    tool = step.tool_name
    params = step.tool_params
    print(f"Tool: {tool}, Params: {params}")
    
    # Does this make sense?
    # If tool expects ["query"], did agent provide query?
    # If tool expects {"file_path", "action"}, did agent provide both?

# Step 4: Check for parameter errors
for step in session_log.steps:
    if step.status == "error":
        error = step.error_message
        if "parameter" in error.lower() or "type" in error.lower():
            print(f"PARAMETER_ERROR: {error}")

# Step 5: Check for tool confusion patterns
tool_sequence = [step.tool_name for step in session_log.steps]
if tool_sequence.count(tool_A) > 3:
    # Agent kept using same tool, suggests confusion about alternatives
    print(f"Agent overused {tool_A}")

Quick fix (< 5 minutes):

1. Check tool schema matches reality
   → Run tool with example params from schema
   → If it fails → Schema is wrong, update schema
   
2. Look at agent's reasoning for tool choice
   → Why did agent pick tool X?
   → Is reasoning correct? (If reasoning is wrong, LLM is confused)
   
3. If tool called with wrong params:
   → Add parameter validation to tool
   → Return helpful error message explaining required params
   → Agent will learn and retry correctly
   
4. If too many similar tools:
   → Combine similar tools into one with "action" parameter
   → E.g., search_web, search_knowledge_base, search_local
   → Instead: search(source: "web|knowledge|local", query)

Proper fix (permanent):

  1. Improve tool descriptions with examples:

    # Bad
    tools = [{
        "name": "search",
        "description": "Search for information"
    }]
    
    # Good
    tools = [{
        "name": "search_web",
        "description": """
        Search the public internet for current information.
        
        When to use:
        - Need recent news or events (< 1 week old)
        - Need facts not in your memory
        - Need to verify current information
        
        Do NOT use:
        - For private/internal documents (use search_knowledge_base instead)
        - For files on user's computer (use search_local instead)
        
        Returns: Top 5 results with titles, URLs, summaries
        
        Example:
        Query: "machine learning"
        Result: [
            {"title": "What is ML?", "url": "...", "summary": "..."},
            ...
        ]
        """,
        "parameters": {
            "query": {
                "type": "string",
                "description": "Search query (e.g., 'latest AI models 2026')",
                "examples": ["GPT-4 release date", "Llama 3.1 performance"]
            }
        }
    }]
  2. Add parameter validation:

    def search_web(query: str) -> List[dict]:
        # Validate
        if not query or len(query) < 2:
            raise ValueError(
                "Query too short. Minimum 2 characters. "
                "Example: 'Python machine learning' not 'a'"
            )
        
        if len(query) > 200:
            raise ValueError(
                "Query too long (max 200 chars). "
                "Try shorter: 'AI models' not 'What are the latest developments in AI...'"
            )
        
        # Execute
        return search_implementation(query)
  3. Test tool schemas match reality:

    # In your test suite
    def test_tool_schema_matches_implementation():
        for tool_name, tool_func in tools.items():
            schema = tool_schemas[tool_name]
            
            # Get required params from schema
            required_params = [p for p in schema.params if p.required]
            
            # Try calling with all required params
            example_kwargs = generate_example_params(required_params)
            
            try:
                tool_func(**example_kwargs)
            except TypeError as e:
                raise AssertionError(
                    f"Tool {tool_name} schema doesn't match implementation: {e}"
                )
  4. Reduce tool cardinality with action parameter:

    # Instead of many similar tools:
    # search_web, search_knowledge_base, search_local, search_arxiv
    
    # Use one tool with action param:
    def search(query: str, action: str = "web") -> List[dict]:
        """
        Search for information from multiple sources.
        
        action:
          - "web": Public internet (recent, current)
          - "knowledge": Internal knowledge base (comprehensive)
          - "local": Files on user's computer (private)
          - "arxiv": Academic papers (research)
        
        Example:
          search("machine learning", action="web")
          search("company policy", action="knowledge")
        """
        if action == "web":
            return search_web_impl(query)
        elif action == "knowledge":
            return search_kb_impl(query)
        # ... etc

Part 3: Tool Issues

Issue: Tool Not Found / Not Registered

Error message:

Tool 'web_search' not found. Available tools: [search_web, get_page]

Symptoms:

  • Agent tries to call tool that doesn’t exist
  • Error: “Tool not registered”
  • Tool works locally but fails in production

Root causes:

  1. Tool not registered in harness — tool function exists but not in tool list
  2. Typo in tool name — agent calls web_search but actual name is search_web
  3. Tool removed in recent deploy — tool was available before, not now
  4. Different deployment — staging has tool, production doesn’t
  5. Dynamic tool loading failed — tool file missing or syntax error

Diagnostic steps:

# Step 1: List available tools
print("Available tools:")
for tool in agent.available_tools:
    print(f"  - {tool.name}")

# Step 2: Check if tool is registered
tool_name = "web_search"
if tool_name not in agent.available_tools:
    print(f"✗ Tool '{tool_name}' not registered")
    # Find similar names
    similar = find_similar(tool_name, agent.available_tools)
    print(f"  Did you mean: {similar}?")

# Step 3: Check tool file exists
import os
if not os.path.exists("tools/web_search.py"):
    print("✗ Tool file missing: tools/web_search.py")

# Step 4: Check for syntax errors in tool file
try:
    import tools.web_search
    print("✓ Tool imports successfully")
except SyntaxError as e:
    print(f"✗ Syntax error in tool: {e}")

# Step 5: Compare staging vs production
staging_tools = get_tools_from("staging")
prod_tools = get_tools_from("production")
missing_in_prod = set(staging_tools) - set(prod_tools)
if missing_in_prod:
    print(f"Tools in staging but NOT in production: {missing_in_prod}")

Quick fix (< 5 minutes):

1. Check available tools in agent
   → Print list of registered tools
   → Is the tool there?
   
2. If tool should exist:
   → Check tool file for syntax errors
   → Restart harness/reload tools
   
3. If tool missing in production:
   → Did recent deploy remove it?
   → Check deployment diff (what changed?)
   → Rollback if needed
   
4. If typo in tool name:
   → Agent is calling 'web_search' but actual name is 'search_web'
   → Either: A) Rename tool to match, or B) Update agent prompt

Proper fix (permanent):

  1. Standardize tool naming:

    # Establish naming convention
    # All search tools: search_* (search_web, search_knowledge, search_local)
    # All file tools: file_* (file_read, file_write, file_list)
    # All code tools: run_* (run_python, run_bash, run_sql)
    
    # Document in CLAUDE.md
    TOOL_NAMING_CONVENTION = """
    Prefix by category:
    - search_*: Information retrieval
    - file_*: File operations
    - run_*: Code execution
    - email_*: Email operations
    """
  2. Add tool validation to startup:

    def validate_tools_on_startup():
        for tool_name in EXPECTED_TOOLS:
            if tool_name not in agent.available_tools:
                raise RuntimeError(
                    f"Expected tool '{tool_name}' not registered. "
                    f"Available: {list(agent.available_tools.keys())}"
                )
            
            # Test that tool is callable
            try:
                tool = agent.available_tools[tool_name]
                # Don't actually call, just verify it's callable
                assert callable(tool)
            except Exception as e:
                raise RuntimeError(f"Tool '{tool_name}' not callable: {e}")
  3. Add tool alias support:

    # If tools are named differently in production vs agent prompt
    TOOL_ALIASES = {
        "web_search": "search_web",     # Agent calls web_search, actual is search_web
        "fetch_url": "get_page",        # Agent calls fetch_url, actual is get_page
    }
    
    def resolve_tool_name(requested_name):
        if requested_name in TOOL_ALIASES:
            actual_name = TOOL_ALIASES[requested_name]
            log.warning("TOOL_ALIAS_USED", {
                "requested": requested_name,
                "actual": actual_name
            })
            return actual_name
        return requested_name
  4. Test tool availability in CI/CD:

    # In your test suite
    def test_all_required_tools_available():
        from harness import agent
        
        required_tools = [
            "search_web",
            "read_file",
            "write_file",
            "run_python",
            # ... etc
        ]
        
        for tool_name in required_tools:
            assert tool_name in agent.available_tools, \
                f"Required tool '{tool_name}' not registered"

Issue: Tool Failing with Errors

Error messages:

Tool 'web_search' failed: Connection timeout
Tool 'send_email' failed: Authentication failed
Tool 'read_file' failed: File not found

Symptoms:

  • Specific tool always fails
  • Tool fails intermittently
  • Tool fails with specific input
  • Tool works in local testing but fails in production

Root causes:

  1. Network timeout — API is slow or down
  2. Authentication failed — credentials missing or expired
  3. Permission denied — insufficient permissions
  4. Resource not found — file/URL doesn’t exist
  5. Rate limited — too many requests to external API
  6. Resource exhausted — disk full, memory full

Diagnostic steps:

# Step 1: Reproduce the error
tool = agent.get_tool("web_search")
try:
    result = tool(query="test")
    print("✓ Tool works")
except Exception as e:
    print(f"✗ Tool fails: {e}")

# Step 2: Check error details
error_details = {
    "error_type": type(e).__name__,     # TimeoutError, AuthError, etc
    "error_message": str(e),
    "error_code": getattr(e, "code", None)
}
print(f"Error details: {error_details}")

# Step 3: Check external service status
if error_type == "TimeoutError":
    # Check if API is up
    status = check_api_status("https://api.example.com/health")
    print(f"API status: {status}")
    
if error_type == "AuthError":
    # Check credentials
    creds = get_credentials()
    if creds is None:
        print("✗ Credentials missing")
    else:
        print(f"✓ Credentials present (expires {creds.expires_at})")

# Step 4: Check rate limiting
rate_limit_headers = response.headers.get("X-RateLimit-Remaining")
if rate_limit_headers and rate_limit_headers == "0":
    print("WARNING: Rate limit exceeded")
    print(f"Resets at: {response.headers.get('X-RateLimit-Reset')}")

# Step 5: Check logs for patterns
failures = get_tool_failures("web_search", last_n_hours=1)
print(f"Failures in last hour: {len(failures)}")
for failure in failures:
    print(f"  {failure.timestamp}: {failure.error}")

Quick fix (< 5 minutes):

1. Check if external API/service is down
   → Visit status page or health endpoint
   → If down, wait for it to recover (not your problem)
   
2. Check credentials/API keys
   → Are they set in environment?
   → Are they still valid? (Check expiration)
   → Test with curl/Postman first
   
3. If rate limited:
   → Slow down request rate
   → Check quota in API dashboard
   → Request increase if needed
   
4. If timeout:
   → Increase timeout value (if configurable)
   → Check network connectivity
   → Check if API is slow
   
5. If permission denied:
   → Check user/account has permission
   → Check if API key has required scopes
   → Check firewall/network policies

Proper fix (permanent):

  1. Add retry logic with exponential backoff:

    def call_tool_with_retry(tool_name, *args, max_retries=3, **kwargs):
        import time
        
        for attempt in range(max_retries):
            try:
                tool = agent.get_tool(tool_name)
                result = tool(*args, **kwargs)
                return result
            
            except TimeoutError as e:
                if attempt < max_retries - 1:
                    wait_time = 2 ** attempt  # 1s, 2s, 4s
                    log.warning(f"Tool {tool_name} timeout, retrying in {wait_time}s")
                    time.sleep(wait_time)
                else:
                    raise
            
            except RateLimitError as e:
                # Don't retry immediately for rate limit
                raise e
  2. Add health checks and circuit breaker:

    class ToolHealthCheck:
        def __init__(self, tool_name):
            self.tool_name = tool_name
            self.failure_count = 0
            self.failure_threshold = 5
            self.is_healthy = True
        
        def check_health(self):
            # Try calling tool with simple test
            try:
                result = test_tool_call()
                self.failure_count = 0
                self.is_healthy = True
            except Exception as e:
                self.failure_count += 1
                if self.failure_count >= self.failure_threshold:
                    self.is_healthy = False
                    log.alert("TOOL_UNHEALTHY", {
                        "tool": self.tool_name,
                        "failures": self.failure_count
                    })
        
        def should_use_tool(self):
            if not self.is_healthy:
                # Tool is failing, don't use it
                return False
            return True
  3. Log all tool failures with context:

    def execute_tool(tool_name, params):
        log_entry = {
            "timestamp": datetime.now(),
            "event": "tool_call",
            "tool_name": tool_name,
            "params": params,
            "session_id": current_session_id
        }
        
        try:
            result = tool_name(**params)
            log_entry["status"] = "success"
            return result
        
        except Exception as e:
            log_entry["status"] = "failed"
            log_entry["error_type"] = type(e).__name__
            log_entry["error_message"] = str(e)
            log_entry["error_traceback"] = traceback.format_exc()
            
            log.error("TOOL_FAILED", log_entry)
            raise
  4. Validate tool parameters before calling:

    def validate_tool_params(tool_name, params):
        schema = tool_schemas[tool_name]
        
        for param_name, param_config in schema.parameters.items():
            if param_config.required and param_name not in params:
                raise ValueError(
                    f"Missing required parameter '{param_name}' for tool '{tool_name}'"
                )
            
            # Validate types
            param_value = params.get(param_name)
            expected_type = param_config.type
            if param_value is not None and not isinstance(param_value, expected_type):
                raise TypeError(
                    f"Parameter '{param_name}' must be {expected_type}, "
                    f"got {type(param_value)}"
                )

Issue: Tool Timeout

Symptoms:

  • Tool takes 30+ seconds to respond
  • Tool never returns (timeout after N seconds)
  • Some requests timeout, others are fast
  • Timeouts increase over time (resource leak?)

Root causes:

  1. External API is slow — search engine, database is overloaded
  2. Network latency — slow network connection
  3. Tool implementation inefficient — code doing too much work
  4. Tool hanging — infinite loop, deadlock, or waiting for response
  5. Resource exhaustion — database connection pool empty, memory full

Diagnostic steps:

# Step 1: Measure tool latency
start = time.time()
try:
    result = tool(query="test")
    elapsed = time.time() - start
    print(f"Tool latency: {elapsed:.2f}s")
except TimeoutError:
    elapsed = time.time() - start
    print(f"Tool timeout after {elapsed:.2f}s")

# Step 2: Check network latency to external services
import subprocess
latency = measure_ping("api.example.com")
print(f"Network latency: {latency:.2f}ms")

# Step 3: Check tool implementation
import inspect
source = inspect.getsource(tool_function)
# Look for:
# - Synchronous I/O (requests, urllib) → Use async instead
# - Large loops without timeout
# - Database queries without indexes

# Step 4: Check resource usage during tool call
import psutil
process = psutil.Process()
initial_memory = process.memory_info().rss

result = tool(query="test")

final_memory = process.memory_info().rss
memory_growth = final_memory - initial_memory
print(f"Memory growth: {memory_growth / 1024 / 1024:.2f} MB")

# Step 5: Check logs for patterns
slow_calls = get_tool_calls("web_search", filter={"duration_ms": ">5000"})
print(f"Calls > 5s: {len(slow_calls)}")
for call in slow_calls:
    print(f"  {call.timestamp}: {call.duration_ms}ms, query={call.params['query']}")

Quick fix (< 5 minutes):

1. Increase timeout value
   → If timeout is 10s, increase to 30s
   → Doesn't fix slowness, but prevents crashes
   
2. Check if external API is slow
   → Test API directly (curl request)
   → Check API status page
   → If API is slow: not your problem
   
3. Check network connectivity
   → High latency? → Move closer to API or use proxy
   
4. If specific queries are slow:
   → Add caching for common queries
   → Avoid re-fetching same results
   
5. Implement fallback
   → If tool times out, use cached/default value
   → Continue instead of failing

Proper fix (permanent):

  1. Use async I/O instead of blocking:

    # Bad: Blocking I/O
    def search_web(query):
        import requests  # Blocking
        response = requests.get(f"https://api.search.com?q={query}")
        return response.json()
    
    # Good: Async I/O
    async def search_web(query):
        import aiohttp  # Non-blocking
        async with aiohttp.ClientSession() as session:
            async with session.get(f"https://api.search.com?q={query}") as resp:
                return await resp.json()
  2. Add timeout with graceful degradation:

    import asyncio
    
    async def search_web_with_timeout(query, timeout=5):
        try:
            result = await asyncio.wait_for(
                search_web(query),
                timeout=timeout
            )
            return result
        except asyncio.TimeoutError:
            # Instead of crashing, return cached result
            cached = get_cached_result(query)
            if cached:
                log.warning("TOOL_TIMEOUT_USING_CACHE", {
                    "query": query,
                    "cache_age": get_cache_age(query)
                })
                return cached
            else:
                # If no cache, try default result
                return {"error": "timeout", "results": []}
  3. Implement caching for repeated queries:

    from functools import lru_cache
    import hashlib
    
    SEARCH_CACHE = {}
    CACHE_TTL = 3600  # 1 hour
    
    def search_web_cached(query, cache_ttl=CACHE_TTL):
        cache_key = hashlib.md5(query.encode()).hexdigest()
        
        if cache_key in SEARCH_CACHE:
            cached_entry = SEARCH_CACHE[cache_key]
            age = time.time() - cached_entry["timestamp"]
            if age < cache_ttl:
                return cached_entry["result"]
        
        # Not in cache or expired, fetch
        result = search_web(query)  # May timeout
        
        SEARCH_CACHE[cache_key] = {
            "timestamp": time.time(),
            "result": result
        }
        
        return result
  4. Monitor tool latency continuously:

    TOOL_LATENCIES = {
        "web_search": [],
        "read_file": [],
        # ...
    }
    
    def track_tool_latency(tool_name, duration_ms):
        TOOL_LATENCIES[tool_name].append(duration_ms)
        
        # Calculate percentiles
        latencies = sorted(TOOL_LATENCIES[tool_name])
        p50 = latencies[len(latencies) // 2]
        p95 = latencies[int(len(latencies) * 0.95)]
        p99 = latencies[int(len(latencies) * 0.99)]
        
        # Alert if degradation
        if p99 > LATENCY_THRESHOLD:
            log.alert("TOOL_LATENCY_HIGH", {
                "tool": tool_name,
                "p50": p50, "p95": p95, "p99": p99
            })

Part 4: Memory Issues

Issue: Memory Corruption

Symptoms:

  • Agent uses wrong facts/outdated information
  • Agent mixes up information from different sessions
  • Agent contradicts itself (says X then says not X)
  • Quality suddenly drops

Root causes:

  1. Session mixing — memory from session A leaks into session B
  2. Cache stale — old cached result is served instead of fresh
  3. Consolidation error — summarization loses important details
  4. File corruption — memory file partially written/truncated

Diagnostic steps:

# Step 1: Check for session contamination
session_a = get_session("session-123")
session_b = get_session("session-456")

context_a = session_a.full_context
context_b = session_b.full_context

# Are they independent?
if any_facts_in_both(context_a, context_b):
    print("WARNING: Sessions share facts (should be independent)")

# Step 2: Check memory file integrity
import hashlib

with open("MEMORY.md", "r") as f:
    content = f.read()
    checksum = hashlib.md5(content.encode()).hexdigest()

expected_checksum = KNOWN_GOOD_CHECKSUM
if checksum != expected_checksum:
    print("✗ Memory file corrupted (checksum mismatch)")

# Step 3: Verify consolidation didn't lose info
before_consolidation = get_memory_snapshot("before")
after_consolidation = get_memory_snapshot("after")

lost_facts = facts_in_before_not_after(before_consolidation, after_consolidation)
if lost_facts:
    print(f"✗ Consolidation lost {len(lost_facts)} facts:")
    for fact in lost_facts:
        print(f"  - {fact}")

# Step 4: Check cache staleness
cache_entry = get_cache("query-123")
age = time.time() - cache_entry.created_at
if age > CACHE_TTL:
    print(f"WARNING: Cache entry is stale ({age}s old, TTL={CACHE_TTL}s)")

# Step 5: Check for partial writes
file_path = "MEMORY.md"
file_size = os.path.getsize(file_path)
expected_size = estimate_file_size(file_content)
if file_size != expected_size:
    print(f"WARNING: File size mismatch ({file_size} vs expected {expected_size})")
    print("→ File may have been partially written")

Quick fix (< 5 minutes):

1. Cold restart session
   → Start new session without old memory
   → Does quality improve? → Memory corruption confirmed
   
2. Clear cache
   → Delete SEARCH_CACHE
   → Memory files should regenerate
   
3. Check file permissions
   → Can harness write to MEMORY.md?
   → Are there write conflicts?
   
4. Revert recent memory changes
   → If MEMORY.md was recently edited, revert
   → git checkout MEMORY.md

Proper fix (permanent):

  1. Isolate sessions with session ID:

    # Every memory entry must include session_id
    class MemoryEntry:
        def __init__(self, content, session_id):
            self.content = content
            self.session_id = session_id
            self.created_at = datetime.now()
    
    # Before using memory, verify session_id matches
    def get_memory_for_session(session_id):
        all_entries = load_memory_file()
        session_entries = [
            e for e in all_entries
            if e.session_id == session_id
        ]
        return session_entries
  2. Implement memory versioning:

    # Save versions of MEMORY.md
    # MEMORY.md (current)
    # .MEMORY.backup (previous)
    # .MEMORY.v1, .MEMORY.v2, ... (history)
    
    def save_memory_with_backup():
        if os.path.exists("MEMORY.md"):
            shutil.copy("MEMORY.md", ".MEMORY.backup")
        
        # Write new version
        with open("MEMORY.md", "w") as f:
            f.write(new_memory_content)
        
        # Keep history
        import time
        timestamp = int(time.time())
        shutil.copy("MEMORY.md", f".MEMORY.v{timestamp}")
    
    def rollback_memory(version):
        """Restore memory to a previous version"""
        shutil.copy(f".MEMORY.v{version}", "MEMORY.md")
        log.info("MEMORY_ROLLED_BACK", {"version": version})
  3. Add memory file validation:

    def validate_memory_file():
        """Check memory file for corruption"""
        
        with open("MEMORY.md", "r") as f:
            content = f.read()
        
        # Check for common corruption signs
        if len(content) == 0:
            raise Exception("Memory file is empty (truncation)")
        
        if content.count("```") % 2 != 0:
            raise Exception("Memory file has unmatched code blocks (partial write)")
        
        # Verify JSON blocks are valid
        import json
        for block in extract_json_blocks(content):
            try:
                json.loads(block)
            except json.JSONDecodeError as e:
                raise Exception(f"Invalid JSON in memory: {e}")
        
        return True
  4. Implement atomic writes:

    import tempfile
    
    def write_memory_atomically(content):
        """Write memory file atomically (no partial writes)"""
        
        # Write to temporary file first
        with tempfile.NamedTemporaryFile(
            mode="w", dir=".", delete=False, suffix=".tmp"
        ) as tmp:
            tmp.write(content)
            tmp_path = tmp.name
        
        # Validate temporary file
        validate_memory_file_at_path(tmp_path)
        
        # Only then replace original
        os.replace(tmp_path, "MEMORY.md")
        
        log.info("MEMORY_WRITTEN_ATOMICALLY")

Issue: Memory Loss

Symptoms:

  • Agent doesn’t remember previous sessions
  • Agent repeats work from earlier
  • Agent says “I don’t have context” but information existed in memory

Root causes:

  1. Memory file not persisted — in-memory cache, lost on restart
  2. Memory pruned too aggressively — old memories deleted
  3. Memory not loaded on startup — file exists but not read
  4. Wrong session ID — looking for memories from different session
  5. Memory file deleted — accidental deletion or crash

Diagnostic steps:

# Step 1: Check if memory file exists
import os
if not os.path.exists("MEMORY.md"):
    print("✗ MEMORY.md does not exist")
else:
    file_size = os.path.getsize("MEMORY.md")
    print(f"✓ MEMORY.md exists ({file_size} bytes)")

# Step 2: Check if memory is being read on startup
startup_log = get_session_log(session_id).startup_events
memory_events = [e for e in startup_log if e.event == "memory_loaded"]
if not memory_events:
    print("✗ Memory not being loaded on startup")
else:
    for event in memory_events:
        print(f"✓ Loaded {event.facts_count} facts from MEMORY.md")

# Step 3: Check if memory is being written
write_events = get_logs(event="memory_written", last_n_hours=24)
if not write_events:
    print("WARNING: No memory writes in last 24 hours")
else:
    print(f"✓ Memory written {len(write_events)} times")

# Step 4: Check if information is in memory file
fact = "Important fact that should be remembered"
with open("MEMORY.md", "r") as f:
    memory_content = f.read()
    if fact in memory_content:
        print(f"✓ Fact is in MEMORY.md")
    else:
        print(f"✗ Fact NOT in MEMORY.md")
        print("→ Was it ever saved?")

# Step 5: Check memory pruning settings
from harness.config import MEMORY_CONFIG
print(f"Memory retention: {MEMORY_CONFIG.retention_days} days")
print(f"Max memory size: {MEMORY_CONFIG.max_tokens} tokens")
print(f"Pruning frequency: every {MEMORY_CONFIG.prune_interval_hours} hours")

Quick fix (< 5 minutes):

1. Check if MEMORY.md exists
   → If it doesn't, create it with bootstrap facts
   
2. Check if memory is being loaded
   → Look for memory_loaded event in startup
   → If missing, add memory loading to startup
   
3. Check if memory is persisted
   → Write a test fact to MEMORY.md
   → Restart harness
   → Is the fact still there?
   
4. If memory is being pruned too aggressively:
   → Increase retention period (retention_days)
   → Increase max memory size (max_tokens)
   → Reduce pruning frequency

Proper fix (permanent):

  1. Implement automatic memory persistence:

    def load_memory_on_startup():
        """Load all memory files on startup"""
        
        memory_files = [
            "CLAUDE.md",      # Instructions
            "MEMORY.md",      # Consolidated facts
            "current_task.md" # Current work
        ]
        
        for filepath in memory_files:
            if os.path.exists(filepath):
                with open(filepath, "r") as f:
                    content = f.read()
                    agent.memory.add(filepath, content)
                log.info("MEMORY_LOADED", {"file": filepath})
            else:
                log.warning("MEMORY_FILE_MISSING", {"file": filepath})
        
        return agent.memory
    
    # Call on startup
    agent.memory = load_memory_on_startup()
  2. Implement periodic memory checkpoint:

    import threading
    
    def memory_checkpoint_loop():
        """Save memory every N minutes"""
        while True:
            time.sleep(300)  # Every 5 minutes
            
            # Get current memory state
            memory_content = agent.memory.export()
            
            # Write to file
            write_memory_atomically(memory_content)
            
            log.debug("MEMORY_CHECKPOINT", {
                "size_bytes": len(memory_content),
                "timestamp": datetime.now()
            })
    
    # Start checkpoint thread
    checkpoint_thread = threading.Thread(
        target=memory_checkpoint_loop,
        daemon=True
    )
    checkpoint_thread.start()
  3. Implement memory recovery:

    def recover_memory_from_backup():
        """If memory is corrupted, recover from backup"""
        
        if os.path.exists(".MEMORY.backup"):
            log.alert("MEMORY_RECOVERY_STARTING", {
                "source": ".MEMORY.backup"
            })
            shutil.copy(".MEMORY.backup", "MEMORY.md")
            return True
        
        # If no backup, try version history
        versions = glob.glob(".MEMORY.v*")
        if versions:
            latest_version = max(versions)
            log.alert("MEMORY_RECOVERY_FROM_VERSION", {
                "source": latest_version
            })
            shutil.copy(latest_version, "MEMORY.md")
            return True
        
        # If no backup/versions, reset to empty
        log.alert("MEMORY_RESET", {"reason": "no_backup_available"})
        write_memory_atomically("")
        return False
  4. Verify memory on each load:

    def load_and_validate_memory():
        """Load memory and verify it's not corrupted"""
        
        try:
            memory = load_memory_on_startup()
            
            # Validate
            if len(memory) == 0:
                log.warning("MEMORY_EMPTY")
            
            # Verify basic structure
            facts_count = count_facts(memory)
            log.info("MEMORY_LOADED", {
                "facts_count": facts_count,
                "bytes": len(str(memory))
            })
            
            return memory
        
        except MemoryCorruptionError:
            log.alert("MEMORY_CORRUPTED", {
                "action": "attempting recovery"
            })
            recovered = recover_memory_from_backup()
            
            if recovered:
                return load_memory_on_startup()
            else:
                # Start with empty memory
                return Memory()

Part 5: Cost & Budget Issues

Issue: Unexpected Cost Spike

Symptoms:

  • Daily cost > 2× normal
  • Unexpected charge from API provider
  • Cost spike with no corresponding increase in usage
  • One specific agent/session costs $100+ when typical is $10

Root causes:

  1. Runaway token generation — agent producing huge outputs
  2. Loop with high tokens — agent looping and using context each time
  3. Expensive model — switched to more expensive model
  4. Inefficient prompts — prompts grew in token size
  5. New feature using expensive model — verification using expensive LLM

Diagnostic steps:

# Step 1: Identify timing of spike
cost_by_hour = get_costs_by_hour(last_24_hours=True)
for hour, cost in cost_by_hour:
    if cost > 2 * NORMAL_HOURLY_COST:
        print(f"SPIKE at {hour}: ${cost} (2x normal)")

# Step 2: Identify which agent/session caused spike
expensive_sessions = get_sessions_sorted_by_cost(limit=10)
for session in expensive_sessions:
    print(f"Session {session.id}: ${session.cost}")
    print(f"  Agent: {session.agent_id}")
    print(f"  Duration: {session.duration_seconds}s")
    print(f"  Iterations: {session.loop_iterations}")
    print(f"  Input tokens: {session.input_tokens}")
    print(f"  Output tokens: {session.output_tokens}")

# Step 3: Check if model changed
logs = get_logs(event="session_start", last_24_hours=True)
models_used = set(log.model for log in logs)
print(f"Models used: {models_used}")
if len(models_used) > 1:
    print("WARNING: Multiple models used")
    model_costs = {}
    for model in models_used:
        cost = sum(log.cost for log in logs if log.model == model)
        model_costs[model] = cost
    print(f"Cost by model: {model_costs}")

# Step 4: Check if prompts grew
old_prompt_size = get_avg_prompt_size(days=7)
new_prompt_size = get_avg_prompt_size(days=1)
growth = (new_prompt_size - old_prompt_size) / old_prompt_size
if growth > 0.2:
    print(f"WARNING: Prompts grew {growth:.1%}")

# Step 5: Check iteration counts
expensive_session = expensive_sessions[0]
for step in expensive_session.steps:
    print(f"Iteration {step.iteration}: "
          f"input={step.input_tokens}, output={step.output_tokens}")
    if step.output_tokens > 5000:
        print(f"  ^ Huge output ({step.output_tokens} tokens)")

Quick fix (< 5 minutes):

1. Identify the expensive session
   → Which session caused the spike?
   → What was it doing?
   
2. Check if model is wrong
   → Should it be using Claude 3.5 or Claude 3 Opus?
   → Revert to correct model
   
3. If looping excessively:
   → Set max iterations to 10
   → Kill any sessions > 15 iterations
   
4. If output tokens huge:
   → Check if agent is generating full documents
   → Limit output size
   
5. Enable cost alerts
   → Alert if cost > budget per session
   → Prevent cascade of expensive requests

Proper fix (permanent):

  1. Implement per-session cost budgets:

    class CostBudgetEnforcer:
        def __init__(self, max_cost_per_session: float = 1.0):
            self.max_cost = max_cost_per_session
        
        def check_budget_before_step(self, session_id: str):
            current_cost = get_session_cost(session_id)
            if current_cost > self.max_cost:
                raise BudgetExceededError(
                    f"Session cost ${current_cost} exceeds budget ${self.max_cost}"
                )
        
        def check_budget_after_step(self, session_id: str, step_cost: float):
            current_cost = get_session_cost(session_id)
            
            if current_cost > self.max_cost:
                log.alert("BUDGET_EXCEEDED", {
                    "session_id": session_id,
                    "cost": current_cost,
                    "budget": self.max_cost
                })
                terminate_session(session_id)
    
    # Use in agent loop
    enforcer = CostBudgetEnforcer(max_cost_per_session=5.0)
    for step in agent_steps:
        enforcer.check_budget_before_step(session.id)
        result = execute_step()
        enforcer.check_budget_after_step(session.id, result.cost)
  2. Implement cost alerts:

    def cost_alert_system():
        """Alert when costs exceed thresholds"""
        
        COST_THRESHOLDS = {
            "daily": 1000,      # Alert if daily cost > $1000
            "hourly": 100,      # Alert if hourly cost > $100
            "session": 10,      # Alert if session cost > $10
            "step": 1,          # Alert if step cost > $1
        }
        
        while True:
            costs = get_current_costs()
            
            if costs["daily"] > COST_THRESHOLDS["daily"]:
                send_alert(f"Daily cost ${costs['daily']} exceeded")
            
            if costs["hourly"] > COST_THRESHOLDS["hourly"]:
                send_alert(f"Hourly cost ${costs['hourly']} exceeded")
            
            time.sleep(60)
  3. Track and alert on model changes:

    EXPECTED_MODELS = {
        "general_agent": "claude-3-5-sonnet",
        "verification_agent": "claude-3-opus",
    }
    
    def verify_model_on_startup(agent_id: str):
        expected = EXPECTED_MODELS[agent_id]
        actual = get_model_for_agent(agent_id)
        
        if expected != actual:
            log.alert("MODEL_MISMATCH", {
                "agent_id": agent_id,
                "expected": expected,
                "actual": actual,
                "cost_difference": get_cost_difference(expected, actual)
            })
  4. Implement cost attribution:

    def log_cost_attribution():
        """Break down costs by agent, model, tool, etc"""
        
        costs_by_agent = {}
        costs_by_model = {}
        costs_by_tool = {}
        
        for session in get_all_sessions():
            agent = session.agent_id
            model = session.model
            
            costs_by_agent[agent] = costs_by_agent.get(agent, 0) + session.cost
            costs_by_model[model] = costs_by_model.get(model, 0) + session.cost
            
            for step in session.steps:
                if step.tool_name:
                    costs_by_tool[step.tool_name] = \
                        costs_by_tool.get(step.tool_name, 0) + step.cost
        
        log.info("COST_ATTRIBUTION", {
            "by_agent": costs_by_agent,
            "by_model": costs_by_model,
            "by_tool": costs_by_tool
        })

Issue: Cost Exceeding Budget

Symptoms:

  • Monthly cost exceeds allocated budget
  • No single spike, but slow creep upward
  • New feature is more expensive than projected
  • Cost per task higher than expected

Root causes:

  1. Inefficient prompts — prompts larger than necessary
  2. Inefficient model choice — using expensive model for simple tasks
  3. No caching — repeating expensive computations
  4. Feature too expensive — new feature costs more than projected
  5. Volume growth — more requests than anticipated

Diagnostic steps:

# Step 1: Compare projected vs actual costs
budget = get_monthly_budget()
actual_cost = get_monthly_cost()
print(f"Budget: ${budget}")
print(f"Actual: ${actual_cost}")
print(f"Over budget by: ${actual_cost - budget}")

# Step 2: Break down costs by feature
costs_by_feature = {}
for session in get_sessions_this_month():
    feature = session.tags[0] if session.tags else "unknown"
    costs_by_feature[feature] = costs_by_feature.get(feature, 0) + session.cost

for feature, cost in sorted(costs_by_feature.items(), key=lambda x: x[1], reverse=True):
    print(f"{feature}: ${cost}")

# Step 3: Compare to baseline
baseline_cost_per_task = get_historical_average("cost_per_task")
current_cost_per_task = get_current_average("cost_per_task")
change = (current_cost_per_task - baseline_cost_per_task) / baseline_cost_per_task
print(f"Cost per task: ${baseline_cost_per_task} → ${current_cost_per_task} ({change:.1%})")

# Step 4: Check model distribution
models = {}
for session in get_sessions_this_month():
    model = session.model
    models[model] = models.get(model, 0) + session.cost

print("Cost by model:")
for model, cost in sorted(models.items(), key=lambda x: x[1], reverse=True):
    print(f"  {model}: ${cost}")

# Step 5: Check for low-hanging optimization
caching_potential = estimate_caching_potential()
print(f"Caching potential: Save ${caching_potential}")

model_switch_potential = estimate_model_switch_potential()
print(f"Model switch potential: Save ${model_switch_potential}")

Quick fix (< 5 minutes):

1. Identify the most expensive feature
   → Break down by feature tag
   → Focus on top 3 expensive features
   
2. Check if there's easy caching potential
   → Same queries repeating?
   → Add caching, reduce cost 20-30%
   
3. Check model choice
   → Is expensive model necessary?
   → Can you use cheaper model for 80% of tasks?
   
4. Reduce prompt size if possible
   → Remove unnecessary context
   → Compress memory file
   → Each 1000 tokens saved = 3-15% cost reduction
   
5. Adjust routing/filtering
   → Can some tasks be answered without LLM?
   → Route simple tasks to tool instead of LLM

Proper fix (permanent):

  1. Implement cost-aware model routing:

    def select_model_for_task(task_complexity: str, required_capability: str):
        """Route to cheapest model that meets requirements"""
        
        MODELS = {
            "simple": ("gemini-2-flash", 0.06),       # Cheapest, fast
            "moderate": ("claude-3-5-sonnet", 1.0),   # Good balance
            "complex": ("claude-3-opus", 5.0),        # Best reasoning
        }
        
        estimated_cost = MODELS[task_complexity][1]
        model = MODELS[task_complexity][0]
        
        # If cost > threshold, try cheaper model first
        if estimated_cost > COST_THRESHOLD:
            cheaper_models = [m for m, (_, cost) in MODELS.items() if cost < estimated_cost]
            
            # Test if cheaper model works
            if try_with_model(cheaper_models[0], task):
                model = cheaper_models[0]
                estimated_cost = MODELS[cheaper_models[0]][1]
        
        return model, estimated_cost
  2. Implement smart caching:

    QUERY_CACHE = {}
    CACHE_TTL = 86400  # 24 hours
    
    def get_with_cache(query: str, expensive_operation):
        cache_key = hashlib.sha256(query.encode()).hexdigest()
        
        if cache_key in QUERY_CACHE:
            entry = QUERY_CACHE[cache_key]
            age = time.time() - entry["timestamp"]
            
            if age < CACHE_TTL:
                log.debug("CACHE_HIT", {"query": query})
                return entry["result"]
        
        # Not cached, execute
        result = expensive_operation()
        
        QUERY_CACHE[cache_key] = {
            "result": result,
            "timestamp": time.time(),
            "cost_saved": AVERAGE_QUERY_COST  # Saved this cost on next hit
        }
        
        return result
    
    # Estimate savings
    cache_hits = sum(1 for k in QUERY_CACHE if hits[k] > 1)
    total_savings = cache_hits * AVERAGE_QUERY_COST
    print(f"Cache savings: ${total_savings}")
  3. Implement cost per feature tracking:

    def track_feature_cost(feature_name: str, session_cost: float):
        """Track cumulative cost per feature"""
        
        FEATURE_BUDGETS = {
            "search": 100,      # Max $100/month for search feature
            "summarize": 50,    # Max $50/month for summarize
            "translate": 30,    # Max $30/month for translate
        }
        
        current_month_cost = get_feature_cost_this_month(feature_name)
        budget = FEATURE_BUDGETS.get(feature_name, float('inf'))
        
        if current_month_cost + session_cost > budget:
            log.alert("FEATURE_BUDGET_EXCEEDED", {
                "feature": feature_name,
                "current_cost": current_month_cost,
                "session_cost": session_cost,
                "budget": budget
            })

Part 6: Quality Issues

Issue: Hallucinations Increased

Symptoms:

  • Model making up facts not in context
  • Model confident about false information
  • Model citing sources that don’t exist
  • Factual accuracy dropped

Root causes:

  1. Model drift — model behavior changed with update
  2. Prompt changed — instruction change causing more creativity
  3. Temperature increased — more randomness/creativity
  4. Memory corruption — mixing up facts from different contexts
  5. Context too short — model hallucinating to fill gaps

Diagnostic steps:

# Step 1: Measure hallucination rate
responses = get_responses_this_week()
hallucinations = []

for response in responses:
    facts = extract_facts(response)
    for fact in facts:
        if not is_in_context(fact, response.context):
            if not is_known_fact(fact):
                hallucinations.append({
                    "fact": fact,
                    "response": response.id,
                    "timestamp": response.timestamp
                })

hallucination_rate = len(hallucinations) / len(responses)
print(f"Hallucination rate: {hallucination_rate:.1%}")

baseline_rate = get_historical_hallucination_rate()
if hallucination_rate > baseline_rate * 1.5:
    print(f"WARNING: 50% increase from baseline ({baseline_rate:.1%})")

# Step 2: Check for recent changes
recent_changes = get_recent_changes(last_24_hours=True)
for change in recent_changes:
    print(f"Change: {change.type}")
    if change.type == "prompt":
        print(f"  Before: {change.old_value[:100]}")
        print(f"  After: {change.new_value[:100]}")
    elif change.type == "model":
        print(f"  {change.old_value}{change.new_value}")
    elif change.type == "temperature":
        print(f"  {change.old_value}{change.new_value}")

# Step 3: Check model and parameters
print(f"Model: {agent.model}")
print(f"Temperature: {agent.temperature}")
print(f"Top P: {agent.top_p}")

# Higher temperature = more random/creative
if agent.temperature > 0.5:
    print("WARNING: High temperature may cause hallucinations")

# Step 4: Check context size
avg_context_size = get_average_context_size()
print(f"Average context: {avg_context_size} tokens")

if avg_context_size < 1000:
    print("WARNING: Small context may cause hallucinations")

# Step 5: Compare staging vs production
staging_hallucination_rate = get_hallucination_rate("staging")
prod_hallucination_rate = get_hallucination_rate("production")
print(f"Staging: {staging_hallucination_rate:.1%}")
print(f"Production: {prod_hallucination_rate:.1%}")
if prod_hallucination_rate > staging_hallucination_rate:
    print("WARNING: Production has higher hallucination rate")

Quick fix (< 5 minutes):

1. Reduce temperature
   → Set temperature to 0.3 instead of 0.7
   → More deterministic = fewer hallucinations
   
2. Check for recent model change
   → Did you upgrade model in last 24h?
   → Rollback to previous model
   → Test if hallucination rate drops
   
3. Check for prompt changes
   → Did someone edit the system prompt?
   → Revert prompt to working version
   
4. Add fact verification step
   → After agent generates response
   → Agent must cite sources for each fact
   → If no source, agent must admit uncertainty

Proper fix (permanent):

  1. Implement fact verification loop:

    def verify_facts_in_response(response: str, context: str):
        """Verify each fact in response comes from context"""
        
        facts = extract_facts(response)
        
        unverified_facts = []
        for fact in facts:
            if fact not in context:
                if not is_well_known_fact(fact):
                    unverified_facts.append(fact)
        
        if unverified_facts:
            # Ask agent to remove or cite these facts
            prompt = f"""
            Your response contains these facts not in the provided context:
            {unverified_facts}
            
            For each fact:
            - Remove it if it's speculation
            - Or cite which document supports it
            
            Revised response:
            """
            
            verified_response = agent.continue_conversation(prompt)
            return verified_response
        
        return response
  2. Add confidence scoring:

    def add_confidence_scores(response: str):
        """Ask agent to add confidence scores to facts"""
        
        prompt = f"""
        Review your response and add confidence scores:
        - [HIGH]: Directly from provided documents
        - [MEDIUM]: Reasonable inference from documents
        - [LOW]: General knowledge, not in documents
        - [UNCERTAIN]: Not sure, may be wrong
        
        Example: "The company has [HIGH] 1000 employees 
        and likely [MEDIUM] plans expansion, though I'm [UNCERTAIN] 
        about the timeline."
        
        Response with confidence scores:
        """
        
        scored_response = agent.continue_conversation(prompt)
        return scored_response
  3. Baseline and monitor hallucination rate:

    class HallucinationMonitor:
        def __init__(self, baseline_rate: float = 0.05):
            self.baseline_rate = baseline_rate  # 5%
            self.alert_threshold = baseline_rate * 1.5  # 7.5%
        
        def check_hallucination_rate(self):
            current_rate = measure_current_hallucination_rate()
            
            if current_rate > self.alert_threshold:
                log.alert("HALLUCINATION_RATE_HIGH", {
                    "baseline": self.baseline_rate,
                    "current": current_rate,
                    "threshold": self.alert_threshold
                })
                return False
            
            return True
    
    monitor = HallucinationMonitor()
    monitor.check_hallucination_rate()

Part 7: Performance Issues

Issue: Slow Inference

Symptoms:

  • Model takes 5-10+ seconds to generate first token
  • All requests slow, not just some
  • Latency increases over time (doesn’t improve with restart)
  • Model loading slower than before

Root causes:

  1. Large context window — more tokens = slower processing
  2. Model size increase — switched to larger model
  3. GPU out of memory — falling back to CPU (1000× slower)
  4. Model not cached — reloading model from disk each time
  5. Increased load — GPU busy with other requests

Diagnostic steps:

# Step 1: Measure latency
start = time.time()
response = model.generate("test prompt")
latency = time.time() - start

first_token_latency = response.metrics["first_token_ms"]
print(f"Total latency: {latency*1000:.0f}ms")
print(f"First token: {first_token_latency:.0f}ms")

# Normal first token: 50-200ms
# Slow first token: > 500ms (suggests issue)

# Step 2: Check GPU usage
import gpustat
gpu_info = gpustat.new_query()
for gpu in gpu_info:
    print(f"GPU {gpu.index}: {gpu.utilization:.1%} used, {gpu.memory_used}/{gpu.memory_total}")

if any(gpu.memory_used > gpu.memory_total * 0.9 for gpu in gpu_info):
    print("WARNING: GPU running low on memory")

# Step 3: Check context size
context_size = count_tokens(full_context)
print(f"Context size: {context_size} tokens")

# More context = slower processing
# Typical: 5-10K tokens
# Slow: > 100K tokens

# Step 4: Check model size
model_info = get_model_info(model_name)
print(f"Model: {model_name}")
print(f"Model size: {model_info.parameters} parameters")

# 7B model: ~13GB memory
# 13B model: ~26GB memory
# 70B model: ~140GB memory (needs multi-GPU)

# Step 5: Check if model is cached
if is_model_loaded_in_memory():
    print("✓ Model in memory (fast)")
else:
    print("✗ Model not in memory, will load from disk (slow)")

Quick fix (< 5 minutes):

1. Check if GPU is out of memory
   → Run nvidia-smi
   → If used > 90%, try restarting to free memory
   
2. Check context size
   → Is it much larger than before?
   → Reduce context (prune old memories)
   
3. Check if model loaded
   → Is model in GPU memory?
   → Load it once, don't reload each request
   
4. Reduce batch size if applicable
   → If processing multiple requests, reduce batch
   → Gives GPU more free memory per request
   
5. Profile to find bottleneck
   → Which part is slow? (model loading, inference, tokenization?)

Proper fix (permanent):

  1. Implement model caching:

    import gc
    
    class ModelCache:
        def __init__(self):
            self.cached_models = {}
        
        def load_model(self, model_name: str):
            if model_name not in self.cached_models:
                print(f"Loading {model_name}...")
                model = load_model_from_disk(model_name)
                self.cached_models[model_name] = model
            
            return self.cached_models[model_name]
        
        def unload_unused_models(self):
            # Keep only last 2 models in memory
            if len(self.cached_models) > 2:
                oldest = min(self.cached_models.items(), 
                           key=lambda x: x[1].last_used)
                del self.cached_models[oldest[0]]
                gc.collect()
  2. Implement async/batching:

    import asyncio
    
    class InferenceBatcher:
        def __init__(self, batch_size: int = 4):
            self.batch_size = batch_size
            self.queue = asyncio.Queue()
        
        async def add_request(self, prompt: str):
            await self.queue.put(prompt)
        
        async def process_batches(self):
            while True:
                batch = []
                
                # Collect up to batch_size requests
                for _ in range(self.batch_size):
                    try:
                        prompt = self.queue.get_nowait()
                        batch.append(prompt)
                    except asyncio.QueueEmpty:
                        break
                
                if batch:
                    # Process batch together (faster than one-by-one)
                    results = model.generate_batch(batch)
                    # ... return results
                
                await asyncio.sleep(0.1)
  3. Monitor and alert on latency degradation:

    class LatencyMonitor:
        def __init__(self):
            self.baseline_latency = 150  # ms
            self.alert_threshold = 500   # ms
        
        def check_latency(self, latency_ms: float):
            if latency_ms > self.alert_threshold:
                degradation = (latency_ms - self.baseline_latency) / self.baseline_latency
                log.alert("LATENCY_DEGRADATION", {
                    "baseline": self.baseline_latency,
                    "current": latency_ms,
                    "degradation": f"{degradation:.0%}"
                })

Part 8: Common Error Messages

Error: 429 Rate Limit Exceeded

Message:

APIError: 429 Rate limit exceeded. Please retry after 60 seconds.

What it means: You’ve made too many requests to the API. The API is rate-limiting you to prevent abuse.

Causes:

  • Too many concurrent requests
  • Exceeded monthly token quota
  • API provider bandwidth limit

Quick fix:

import time

def call_with_retry(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError as e:
            if attempt < max_retries - 1:
                wait_time = int(e.retry_after) if hasattr(e, 'retry_after') else 2**attempt
                print(f"Rate limited, waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise

Error: Context Window Exceeded

Message:

ContextLengthExceededError: Prompt too long (8,532 tokens > 4,096 max)

What it means: Your prompt (context + message) is too long for the model. Need to reduce it.

Quick fix:

  • Switch to model with larger context window (e.g., Claude 3.5 with 200K)
  • Reduce startup memory (CLAUDE.md, MEMORY.md)
  • Summarize old messages
  • Use compression (LLM Wiki pattern)

Error: Model Not Found

Message:

APIError: Model 'gpt-4-turbo-2024-04-09' not found

Causes:

  • Model deprecated/removed
  • Typo in model name
  • Wrong API provider

Quick fix:

  • List available models: curl https://api.openai.com/v1/models -H "Authorization: Bearer $OPENAI_API_KEY"
  • Check model documentation for current available models
  • Use standardized model names from documentation

Part 9: FAQ — Frequently Asked Questions

Q: Why is my agent looping?

A: Agents loop when:

  1. Tool keeps failing — agent thinks it should retry
  2. Task is ambiguous — agent doesn’t know when to stop
  3. No termination logic — max_iterations not set

Fix: See “Agent Stuck in Loop” section above.


Q: How do I reduce costs?

A: Top cost-reduction tactics (in order of impact):

  1. Use smaller model (80% saving): SLM (7B) for loop, LLM (70B+) for verify only
  2. Cache results (30-50% saving): Repeat queries shouldn’t re-run
  3. Reduce context (20-40% saving): Compress memory, use LLM Wiki pattern
  4. Use quantization (20% saving): 4-bit models cost same to run, less tokens
  5. Route smart (10-20% saving): Simple tasks don’t need expensive model

Example: Hybrid setup can save up to 80-90% vs pure cloud (when most requests route locally):

  • 80% of requests → cheap local SLM (Phi, Mistral 7B) = ~$0
  • 20% of requests → verify with Claude Opus = ~$3/1M tokens
  • Total: ~$0.60/1M tokens vs $15/1M for pure Claude

Q: What model size do I need?

A: Choose based on task complexity:

TaskModelReason
Classification7B SLMFast, cheap, good enough
Summarization13B SLMGood balance
Q&A retrieval13B SLMNeeds reasoning but not deep
Code generation34B SLMNeeds better code understanding
Complex reasoning70B LLMRequires deep reasoning
Verification70B+ LLMNeeds high accuracy

Q: Should I use cloud or local models?

A: Decision tree:

Tokens/day < 100K?
  → Cloud (cheaper for low volume)
Tokens/day 100K-1M?
  → Hybrid (local for loop, cloud for verify)
Tokens/day > 1M?
  → Local self-hosted (cheaper at scale)
Needs latest model?
  → Cloud (local models lag by 6-12 months)
Sensitive data?
  → Local (keep data on-premise)
No GPU available?
  → Cloud (can't run local models)

Q: How do I debug agent decisions?

A: Enable detailed logging:

# Log every step
for step in agent.steps:
    print(f"Step {step.iteration}:")
    print(f"  Reasoning: {step.reasoning}")
    print(f"  Tool: {step.tool_name}")
    print(f"  Result: {step.tool_result[:200]}")
    print(f"  Cost: ${step.cost}")

# Check if reasoning makes sense
# If reasoning is wrong → LLM confused, need clearer instruction
# If reasoning right but tool wrong → Tool choice issue

Q: What’s the difference between ReAct and Tree of Thoughts?

A:

FrameworkHow it worksBest forCost
ReActThink → Act → Observe (loop)Most tasks, default choiceBaseline (1-8 iterations)
Tree of ThoughtsExplore multiple branches, keep bestComplex problems, deep reasoning3-5× more expensive (many branches)
ReflexionAct → Get feedback → Self-correctQuality improvement, when first try fails2-3× cost (add reflection step)

Recommendation: Start with ReAct. Use Tree of Thoughts only if ReAct success rate < 70%.


Q: How much GPU memory do I need?

A: For different model sizes:

ModelMemoryGPUCost/month
7B14GB1× RTX 4090$500
13B26GB1× RTX 4090$500
34B68GB1× H100$3K
70B140GB2× H100$6K
405B810GBRequires specialized hardware$20K+

Cheaper alternative: Use cloud API (pay per token, no hardware cost).


Q: Is my harness secure?

A: Security checklist:

  • Input validation (check for injection patterns)
  • Output filtering (no PII leaks)
  • Rate limiting (prevent abuse)
  • Audit logging (track who did what)
  • Secrets management (no hardcoded API keys)
  • Sandboxing (restrict tool access)

See 10_security_and_safety.md for full details.


Q: How do I monitor production?

A: Essential metrics:

ESSENTIAL_METRICS = [
    "error_rate",           # % of requests failing
    "latency_p50/p95/p99",  # Request duration
    "cost_per_task",        # Token cost trending
    "success_rate",         # % of agent reaching goal
    "loop_iterations",      # Avg steps per task (higher = less efficient)
    "memory_usage",         # RAM / context window usage
    "loop_detection",       # Count of stuck agents
]

Alerting:

  • Error rate > 5% → page on-call
  • Cost/task > 2× baseline → page on-call
  • Success rate drops > 10% → investigate

Q: What’s the best prompt?

A: No single “best” prompt, but follow these principles:

  1. Clear role: “You are a Python expert”
  2. Clear task: “Your job is to review this code”
  3. Clear constraints: “Don’t suggest breaking changes”
  4. Clear output format: “Return JSON with keys: issues, severity”
  5. Examples: Show 1-2 examples of good responses

Bad prompt:

"Write code"

Good prompt:

You are a senior Python engineer.
Review this Python code and identify bugs.
Focus on: memory leaks, infinite loops, security issues.
Output as JSON: {"issues": [{"line": 5, "type": "memory_leak", "fix": "..."}]}

Example:
Code: for x in data: items.append(x)  # grows unbounded
Issue: Memory leak if data is large, items is never freed
Fix: Use generator instead: (x for x in data)

Part 10: Decision Trees for Diagnosis

When Error Rate Spikes

Error rate > 5%?
├─ Check specific error in logs
│  ├─ "Tool not found" → Tool missing/renamed
│  ├─ "Rate limit" → API quota exceeded
│  ├─ "Timeout" → External service slow
│  └─ "Model error" → Model offline/changed

├─ Check recent changes (last 2 hours)
│  ├─ Deployed new code? → Rollback
│  ├─ Changed prompt? → Revert prompt
│  ├─ Switched model? → Switch back
│  └─ No recent changes → Check external services

└─ Check metrics
   ├─ Latency high? → Performance issue
   ├─ Cost high? → Runaway agent
   └─ Memory high? → Memory leak

When Cost Increases

Cost > budget?
├─ Identify the expensive session
│  ├─ High iteration count? → Loop issue (see "Stuck in Loop")
│  ├─ High output tokens? → Agent over-generating
│  └─ Many small costs? → Repeated expensive operations

├─ Check model used
│  ├─ Using expensive model? → Switch to cheaper
│  ├─ Changed model? → Revert
│  └─ Using correct model? → Check iteration count

└─ Quick wins
   ├─ Cache search results (30% savings)
   ├─ Use cheaper model for 80% of requests (80% savings)
   └─ Reduce startup memory (10-20% savings)

Part 11: Incident Playbook

Incident: Cost $5K in 24 hours (Normal: $100)

Timeline (do this ASAP):

  1. Minute 1-2: Kill agent if still running
  2. Minute 3-5: Identify which session/agent caused spike
  3. Minute 6-10: Check logs for what it was doing
  4. Minute 11-15: Implement hard cost limit (prevent repeat)
  5. Hour 1: Rootcause analysis (why did this happen?)
  6. Hour 2: Fix and validate fix

Debug steps:

# Step 1: Find expensive sessions
expensive_sessions = get_sessions_by_cost(sort="descending")
culprit = expensive_sessions[0]

print(f"Session {culprit.id}:")
print(f"  Cost: ${culprit.cost}")
print(f"  Duration: {culprit.duration_seconds}s")
print(f"  Iterations: {culprit.loop_iterations}")
print(f"  Input tokens: {culprit.input_tokens}")
print(f"  Output tokens: {culprit.output_tokens}")

# Step 2: Check what it was doing
for step in culprit.steps[:10]:  # First 10 iterations
    print(f"Iteration {step.iteration}:")
    print(f"  Tool: {step.tool_name}")
    print(f"  Tokens: in={step.input_tokens}, out={step.output_tokens}")
    print(f"  Cost: ${step.cost}")

# Was it looping? Generating huge outputs? Using expensive model?

# Step 3: Check for the root cause
if culprit.loop_iterations > 20:
    print("ROOT CAUSE: Agent looping excessively")
    # See "Agent Stuck in Loop" fix
elif culprit.output_tokens > 50000:
    print("ROOT CAUSE: Agent generating huge outputs")
    # Check what it was generating
elif culprit.model == "claude-3-opus":
    print("ROOT CAUSE: Used expensive model instead of cheap one")
    # Check why it switched models

Prevent repeat:

# Add hard cost limit
class HardCostLimit:
    def __init__(self, max_cost_per_session: float = 5.0):
        self.max_cost = max_cost_per_session
    
    def check(self, session_cost: float):
        if session_cost > self.max_cost:
            kill_session_immediately()
            alert_ops("COST_LIMIT_HIT")
            raise Exception(f"Cost ${session_cost} exceeds limit ${self.max_cost}")

# Deploy immediately
limit = HardCostLimit(max_cost_per_session=5.0)

Conclusion

When something breaks in production, speed and calm matter most. Use these tools:

  1. Decision tree → Narrow down the problem fast
  2. Diagnostic steps → Verify your hypothesis
  3. Quick fix → Stop the bleeding (< 5 min)
  4. Proper fix → Prevent it recurring (permanent)
  5. Prevention → Add monitoring/checks

Most production incidents follow patterns. If you’ve seen it once, you can fix it again—faster.

Keep this runbook bookmarked. Update it with new incidents you find.


Quick Reference: Commands

# View logs for a specific error
grep "ERROR" harness.log | grep "tool_timeout" | tail -20

# Check which agent is expensive
jq '.sessions | sort_by(.cost) | reverse | .[0]' sessions.json

# Count iterations for a session
jq '.steps | length' session.json

# Check model used
jq '.model' session.json

# Get cost breakdown
jq '{model: .model, cost: .cost, tokens: .input_tokens + .output_tokens}' session.json

Further Reading

  • 09_operations_and_observability.md — Full logging and monitoring guide
  • 10_security_and_safety.md — Security hardening
  • 11_testing_and_qa.md — Quality assurance
  • 13_cost_management.md — Deep cost analysis

Validation Checklist

How do you know you got this right?

Performance Checks

  • Decision tree diagnostic identifies root cause in <5 minutes
  • Quick fix resolves symptom in <5 minutes (service restored)
  • Proper fix prevents recurrence (no duplicate incidents in 2+ weeks)
  • Runbook tested: new on-call follows steps successfully

Implementation Checks

  • Decision tree covers 90%+ of real production incidents
  • Diagnostic steps for each symptom documented with example logs
  • Quick fix is safe: temporary measure that doesn’t cause data loss
  • Proper fix implemented: code change or monitoring addition deployed
  • Prevention measures in place: monitoring alert or hard limit added
  • Commands cheatsheet tested: each one returns expected data format
  • Runbook updated after every incident: lessons captured

Integration Checks

  • Logging provides needed data: can trace request from input to output
  • Monitoring alerts match runbook symptoms: alert fires when issue occurs
  • Escalation procedures defined: who to contact if fix fails
  • Incident postmortem process: how to prevent recurrence

Common Failure Modes

  • Decision tree doesn’t match real issues: Test against last 10 incidents; update
  • Logs don’t provide diagnostic info: Missing request IDs, timing, error context
  • Quick fix is too complex: Takes 10 minutes; simplify or document better
  • Same incident repeats: Prevention measure didn’t work; verify it’s deployed
  • Runbook outdated: Logs format changed, commands broken; maintain as code changes

Sign-Off Criteria

  • Runbook tested by someone unfamiliar with codebase (clarity check)
  • All 3 real incidents resolved successfully using runbook
  • Prevention measures verified deployed: alerts fire, limits enforced
  • Team trained: on-call can follow runbook independently
  • Documentation complete: why issues happen, not just what to do

See Also

  • Doc 09 (Operations & Observability): Structured logging and monitoring setup
  • Doc 13 (Cost Management): Cost spike diagnosis and prevention
  • Doc 16 (Evaluation & Benchmarking): Quality regression detection and response