The Model is the Easy Part

Every conversation I have about AI agents seems to be about the model. Claude versus GPT versus whatever-just-shipped. Which size, which provider, which release. Almost nobody talks about what’s around the model — the deterministic code, the tools, the memory, the loop, the sandbox — and that’s where the engineering work actually lives. It’s also where the system either holds together for the long run or quietly falls apart.

I’m a cloud infrastructure engineer. I’d spent the last year using Claude Code to write web apps and iOS apps and never thought hard about what was inside the assistant. Then I sat down to build an AI agent for myself — not to call someone else’s — and realised I’d been thinking about the wrong layer the whole time. The model is the easy part. The system you wrap around it is the work, and the patterns are surprisingly close to things infrastructure engineers already know.

I picked genealogy as the domain because I wanted something concrete: real records, real ambiguity, a real test of whether the agent could actually be useful at the end. The result is an app I drive on the MacBook I’m typing this on. It uses Qwen 2.5 14B running locally through MLX, costs pence per session of electricity to run, and has surfaced 473 ancestors across roughly three hundred years of British civil records. What surprised me, building it, is how little of the work was about the model.

Inside the harness

The word that fits best for the surrounding system is harness, in the same sense as a horse harness — a structured rigging that lets a powerful but undirected animal pull a cart in a useful direction without bolting, falling over, or eating the cargo. The model does the work. The harness makes the work productive.

A complete harness has seven components. The model is one of them. The other six are where the engineering goes, and they map cleanly onto infrastructure patterns you probably already use:

Tools — the things the model can call. For the genealogy agent: eight record-source parsers (FreeBMD, census transcripts, parish registers), a deterministic 4-gate scorer, a clustering engine for collapsing near-duplicates. These are ordinary functions with structured inputs and outputs. The model calls them through a registry; they return data.
Memory — what the agent knows. Layered: durable long-term state (the family tree), session-scoped working state (the current research lead), retrievable knowledge (a markdown wiki the agent loads selectively). Maps onto cache / database / object store, with different lifetimes and access patterns.
A planning loop — what the agent does, continuously. Discover, score, strategise, integrate, repeat. Iteration caps, stuck detection, graceful degradation. Reads like an event loop or a state machine; the patterns transfer directly.
A sandbox — what the agent’s allowed to do. Almost nothing in the genealogy app crosses into the family tree without me confirming it. Default-deny, explicit allowances, audit trail. Same shape as IAM policy enforcement.
Persistence — durable storage across sessions. The agent loses everything on a restart unless you write it down. Boring; necessary.
Orchestration — the wiring that makes the other six behave as one coherent system rather than a pile of components. Session lifecycle, dependency injection, the bits that make the components composable.

That’s the architecture. Everything I built spent its time on these six components. The seventh — the model itself — was a system prompt of about a page, a thin Swift wrapper around the MLX inference call, and a JSON schema for the output. The clever-looking part of the whole project was the easy part.

Where the rules came from

The 4-gate scorer is the bit that actually decides whether two records refer to the same ancestor — whether a “Mary Barker, age 23” in an 1881 census matches a “Mary Barker, 1858” in the BMD birth index. It’s deterministic code: rules about evidence, scoring functions, thresholds, no AI at runtime.

But the AI built it. And I didn’t build it by sitting in iterative sessions with a model — I built it by exposing the harness via MCP and letting Claude Code drive the build loop autonomously. Recursive self-improvement: Claude Code as the outer agent, the harness in progress as both the target of refinement and the test surface.

The setup:

The harness exposes its tools — record-source queries, scorer evaluations, profile lookups, lead writes — through MCP (the standard protocol for exposing tools to AI models). To Claude Code, the harness is a tool server.
The ground truth is a set of thirteen close-family profiles where the births, marriages, and parents are already known with confidence.
Claude Code runs an OODA loop (observe, orient, decide, act) against the harness: observe what the tools return and what the harness’s logs say it did to get there, orient against the known facts, decide what’s missing or wrong, act via MCP calls to refine rules, extend queries, or codify newly-noticed behaviour in a record source.

A representative iteration: Claude Code asks the harness — through an MCP tool call — to verify, validate, or extend a known fact about a profile. The harness runs its current rules and queries against the record sources, returns its findings, and emits structured logs of what it tried and why. Claude Code analyses output and logs against two criteria: did the harness establish the known fact, and is it backed by source-citation records that justify it. If either check fails, Claude Code acts — a new scoring rule, broader query fanout, a codification of a pattern it’s noticed in a particular source’s behaviour. The refinement is committed; the loop iterates to the next fact, then the next profile. Self-improving, because the harness is being modified by an outer agent that uses the harness’s own tools as the test surface.

After a few weeks of this, the rules and queries had converged on hitting the thirteen-profile seed reliably. The second phase extended the loop to profiles further out in the tree. Spelling variants the seed hadn’t shown. Date drift the rules hadn’t accounted for. Sources that worked for close family but failed for nineteenth-century rural records. Each gap surfaced a refinement. The loop continued until new profiles were succeeding more often than producing failures.

What came out is what’s now in production: a body of rules and a body of structured queries that exercise them, co-built because they depend on each other. The production runtime has no model in it. The production runtime is pure deterministic code that Claude Code, via MCP-driven OODA, helped me construct.

This is the harness pattern applied to itself, at a higher level. At runtime, the agent’s model proposes actions, the rules decide which to take, I confirm the consequential ones. At build-time, Claude Code proposes rules and queries via MCP, the known facts decide which to keep, I confirm which versions to commit. Same proposer-decider-confirmer pattern, recursive across levels. The runtime rules don’t have to be written by hand; they can be built by an outer agent that uses the harness itself as the test surface.

The model is the most replaceable component

This is the load-bearing observation, and the reason this post has the title it does.

You can swap Claude for Qwen for Gemini for whatever-comes-next, and the harness keeps working. You can’t swap the deterministic code the same way; that code is your product. The 4-gate scorer is a body of carefully tested rules about evidence. The parsers track website changes that arrive without warning. The clustering engine handles the cases where the same person shows up under slightly different parish names. None of that work is AI work. All of it is what determines whether the system finds real ancestors or surfaces noise.

Most AI projects I’ve watched go wrong, go wrong because someone treated the model as the whole system and built a thin wrapper around it. The quality doesn’t hold because the model is making decisions the deterministic code should be making. The system doesn’t last because there’s nothing to maintain except the prompt. The model is the part that gets all the marketing; the harness is the part that does the work.

The architecture even pays for itself. A purpose-built harness uses roughly 25× fewer tokens per call than a general-purpose hosted agent doing the same job — domain-specific rules instead of general-purpose instructions, tool descriptions held in Swift rather than in the prompt, fresh context built per call instead of carried as conversation history, structured JSON output instead of prose. When the harness is doing its job, the model gets to do just the model’s job, and the token budget falls out cleanly.

I wrote it all down

I built the genealogy agent and, slowly, wrote down what I’d learned while doing it. The notes turned into a book — twenty-five chapters covering everything from picking a model through running one in production, written for engineers who want to understand AI systems as systems rather than as API endpoints. The PDF lives here: The Harness Handbook. The harness itself ships as Ancestor Research on the Mac App Store, if you want to see what came out of the build process.

A note on how the book got written, because pretending otherwise would be silly: I used Claude Code heavily as a research collaborator and verified every claim I could verify. Some content from early drafts was wrong — fabricated citations, off-by-a-thousand benchmarks — and I cut or corrected those. The test for any chapter is whether it’s true and useful, not how it was produced.

If you take one thing from this post, take this: when you sit down to build an AI agent, the model choice is the easy part. The harness is where the engineering goes. The harness is what you’ll be maintaining a year from now. Design that well; let the model be ingredient rather than centrepiece.