Build theagent,not the model.
An illustrated field guide to the harness — the system wrapped around a model that turns it into an agent that ships. Environments, memory, tools, planning, security, and the loop that improves itself.
The Essays
The Harness Is the Product
A deployed agent's capability is the product of model quality and harness quality — and in 2026, the harness term has the steeper gradient.
The Agent Evaluation Crisis
Agent benchmark numbers measure demos, not the capability teams think they bought — and the fix is a measurement regime borrowed from deep RL, not a better leaderboard.
Environments Are the Bottleneck
Agentic RL is not blocked on algorithms or compute. It is blocked on environments — and the teams that industrialize environment supply will own agent training the way data-pipeline teams owned supervised learning.
Agents That Learn on the Job
Deployed agents are amnesic by default. The deployment already generates the experience stream continual learning always wanted — and the system-space half of the loop is shippable today.
The Memory Stack
Agent memory is not one problem but a stack — working, episodic, semantic, procedural. Most production failures are layer-confusion: solving one layer's problem with another layer's tool.
Tools, Skills, and the Action Interface
Agent capability leaks at the action interface: the model knows a tool is needed and fails to use it, and the dominant protocol taxes every turn. The real question is whether the tools→skills evolution is engineered to compound — or to collapse.
Planning and the Myopia Problem
A reasoning model's chain of thought looks like a plan: it weighs futures, considers options, deliberates. Extract the search tree behind it and the deliberation turns out to be theater — the model expands deep branches and then chooses by the shallow ones. Where the stakes justify the cost, the fix is not a longer chain of thought. It is search you move outside the model.
Multi-Agent Systems and Their Failure Modes
A multi-agent system fails in ways a single agent cannot — its diversity collapses, its blame becomes untraceable, its coordination cost outgrows the work. The systems that survive do not fix the org chart. They make coordination something the system learns or something it pays for.
Software Engineering Agents: The Proving Ground
Software engineering is where agents grew up — the only domain that handed them verifiable rewards, endless environments, and expert oversight all at once. The platform layer is settled now. What is still open — cost, coordination, deployment at scale — is the playbook every other domain inherits next.
Securing the Agentic Perimeter
An agent is an attack surface that acts. Goal hijack, tool misuse, and memory poisoning are not prompt-injection-with-extra-steps — they are a new perimeter where the payload runs with the agent's privileges. This is the one chapter where practitioner documents lead the research.
Agent Ops: Running Agents in Production
Production agents need an ops discipline the way services needed SRE — and its founding move is architectural. Make the append-only event log the source of truth and derive the agent loop from it. Auditability, forking, replay, cost control, and context hygiene are not five features you bolt on. They are five consequences of one inversion.
Self-Improving Agents
Every harness improvement this series described was made by a human. Here are the papers where that stops being true — and the guardrails for a loop that edits itself.
“Two teams deploy the same frontier model. One ships an agent that resolves most of its tasks; the other stalls, loops, and quietly fails. Nothing about the model varied.”
This series follows a single thread: a deployed agent's capability is the product of model quality and harness quality — and in 2026 the harness has the steeper gradient. Each essay pairs a close read of the primary literature with a picture you can hold in your head, and ends in something you can build.
Harness Engineering · the sister series to Continual Intelligence