SERIES 02Twelve essays · Harness Engineering

Build theagent,not the model.

An illustrated field guide to the harness — the system wrapped around a model that turns it into an agent that ships. Environments, memory, tools, planning, security, and the loop that improves itself.

↵ Start the series C Contents

Illustrated essays

313

Minutes of reading

69k

Words, fully sourced

Parts, one argument

The Essays

Four parts · read in order

AGAgents & HarnessARAgentic RLCLContinual LearningMEMemoryTOTools & SkillsMAMulti-AgentSWSoftware EngineeringEVEvaluationSESecurityFNFoundationsSISelf-ImprovementSKSkillsWMWorld ModelsDFDiffusion & Parallel GenIEInference & ServingRLRL & Post-Training

IThe ProblemName the discipline; expose the measurement crisis.

1 / 12

AGTO

The Harness Is the Product

A deployed agent's capability is the product of model quality and harness quality — and in 2026, the harness term has the steeper gradient.

20 min readRead

2 / 12

EVAG

The Agent Evaluation Crisis

Agent benchmark numbers measure demos, not the capability teams think they bought — and the fix is a measurement regime borrowed from deep RL, not a better leaderboard.

23 min readRead

IIThe EngineBuild the agent's training loop — environments, learning, memory, tools.

3 / 12

ARAG

Environments Are the Bottleneck

Agentic RL is not blocked on algorithms or compute. It is blocked on environments — and the teams that industrialize environment supply will own agent training the way data-pipeline teams owned supervised learning.

26 min readRead

4 / 12

CLAG

Agents That Learn on the Job

Deployed agents are amnesic by default. The deployment already generates the experience stream continual learning always wanted — and the system-space half of the loop is shippable today.

27 min readRead

5 / 12

MEAG

The Memory Stack

Agent memory is not one problem but a stack — working, episodic, semantic, procedural. Most production failures are layer-confusion: solving one layer's problem with another layer's tool.

26 min readRead

6 / 12

TOAG

Tools, Skills, and the Action Interface

Agent capability leaks at the action interface: the model knows a tool is needed and fails to use it, and the dominant protocol taxes every turn. The real question is whether the tools→skills evolution is engineered to compound — or to collapse.

23 min readRead

IIIThe BuildNavigate the hard problems — planning, coordination, the proving ground.

7 / 12

AGEV

Planning and the Myopia Problem

A reasoning model's chain of thought looks like a plan: it weighs futures, considers options, deliberates. Extract the search tree behind it and the deliberation turns out to be theater — the model expands deep branches and then chooses by the shallow ones. Where the stakes justify the cost, the fix is not a longer chain of thought. It is search you move outside the model.

25 min readRead

8 / 12

MAAG

Multi-Agent Systems and Their Failure Modes

A multi-agent system fails in ways a single agent cannot — its diversity collapses, its blame becomes untraceable, its coordination cost outgrows the work. The systems that survive do not fix the org chart. They make coordination something the system learns or something it pays for.

23 min readRead

9 / 12

SWAG

Software Engineering Agents: The Proving Ground

Software engineering is where agents grew up — the only domain that handed them verifiable rewards, endless environments, and expert oversight all at once. The platform layer is settled now. What is still open — cost, coordination, deployment at scale — is the playbook every other domain inherits next.

26 min readRead

IVThe Deployment FrontierShip it — security, operations, and autonomous improvement.

10 / 12

SEAG

Securing the Agentic Perimeter

An agent is an attack surface that acts. Goal hijack, tool misuse, and memory poisoning are not prompt-injection-with-extra-steps — they are a new perimeter where the payload runs with the agent's privileges. This is the one chapter where practitioner documents lead the research.

30 min readRead

11 / 12

AGME

Agent Ops: Running Agents in Production

Production agents need an ops discipline the way services needed SRE — and its founding move is architectural. Make the append-only event log the source of truth and derive the agent loop from it. Auditability, forking, replay, cost control, and context hygiene are not five features you bolt on. They are five consequences of one inversion.

23 min readRead

12 / 12

AGCL

Self-Improving Agents

Every harness improvement this series described was made by a human. Here are the papers where that stops being true — and the guardrails for a loop that edits itself.

24 min readRead

13 / 12

CLEV

Closing the Loop

Twelve essays argued that a coding agent's harness should learn from its own runs. This is the lab report: four cycles on a frozen 24-task suite, learning as gated external memory, a local 7B, $0 — everything published, including the rejection.

17 min readRead

Series 03 · The Long BetWhat survives in agentic AIThe long-horizon companion to this field guide — a selection lens for the literature and the three-to-ten-year bets it picks. Six illustrated essays.Read the series →

“Two teams deploy the same frontier model. One ships an agent that resolves most of its tasks; the other stalls, loops, and quietly fails. Nothing about the model varied.”

This series follows a single thread: a deployed agent's capability is the product of model quality and harness quality — and in 2026 the harness has the steeper gradient. Each essay pairs a close read of the primary literature with a picture you can hold in your head, and ends in something you can build.

Harness Engineering · the sister series to Continual Intelligence