Patrick Jiang's Harness-1 externalizes memory for a 20B search agent

The paper reports 0.730 average curated recall across eight retrieval benchmarks, with code and model weights now public.

By · Published

Why it matters

Harness-1 points to a live question in agent design: whether better long-horizon behavior comes from larger models alone, or from training smaller models to use explicit memory, verification and search interfaces.

Patrick Jiang's Harness-1 externalizes memory for a 20B search agent — The paper reports 0.730 average curated recall across eight retrieval benchmarks, with code and model weights now public.

Patrick Jiang (@patpcj) introduced Harness-1, a 20B-parameter search agent that moves search state outside the model and into a structured harness, according to a 13-post thread on X and the accompanying paper. The paper lists authors from the University of Illinois at Urbana-Champaign, UC Berkeley and Chroma, and describes Harness-1 as a retrieval subagent trained with reinforcement learning inside a stateful search harness.

poster=/api/storage/public-objects/tweet-videos/patrick-jiang-harness-1-search-agent-poster-6b4ddab3.jpg|Launch video - @patpcj

The core claim is that search agents fail partly because they are asked to act as memory system, note taker, verifier and librarian inside one expanding transcript. Harness-1 instead keeps candidate documents, curated evidence, importance tags, search history, evidence links, verification records, deduplication, compression and context-budget markers in an external working memory. The model still makes the semantic decisions: what to search, which documents to read, what to keep or discard, what to verify and when to stop.

The authors say Harness-1 is built on gpt-oss-20b and trained first to operate the harness, then with reinforcement learning over full search episodes. The SFT stage used 899 filtered trajectories, and the RL stage used 3,453 SEC training queries. The paper also says the same working-memory renderer is used for teacher rollouts, supervised replay, RL rollout and evaluation, reducing the gap between training and deployment.

On the paper's eight retrieval benchmarks across web, finance, patents and multi-hop QA, Harness-1 reports 0.730 average curated recall. The authors say that beats the next strongest open search subagent by 11.4 points and remains competitive with much larger frontier-model searchers, though Opus-4.6 is still ahead on average under their protocol. They also frame transfer as the main signal: Harness-1 improved over Context-1 by 7.9 recall points on source-family benchmarks and 17.0 points on held-out transfer benchmarks.

The paper's ablations support the harness thesis. Disabling all Harness-1 harness mechanisms at inference time on a BrowseComp+ subset cut recall by 12.2% relative to the full system, while single-mechanism removals typically hurt final-answer recall. The authors argue that the harness is not just implementation plumbing, but the decision substrate the policy learns to use.

Jiang released the paper, code, model weights and Hugging Face paper page. He also credited Chroma (@trychroma) for supporting the work and Tinker (@tinkerapi) for training infrastructure. The benchmark results are self-reported in the paper and launch materials, not independently validated here.

Reader comments

Conversation for this story loads after sign-in.