DeepReinforce releases Ornith-1.0 for self-scaffolding coding agents

The MIT-licensed model family spans 9B to 397B parameters, but its benchmark lead rests on DeepReinforce's own harness-heavy evaluations.

By ยท Published

Why it matters

Ornith-1.0 tests whether open coding models can absorb the agent scaffold itself, not just sit behind one, while staying permissive enough for commercial deployment.

A stylized, mechanical bird figure, composed of geometric shapes, actively building or assembling itself from interlocking code blocks and gears. (Woodblock print in the manner of mid-century propaganda posters, with bold silhouettes and fl

DeepReinforce has released Ornith-1.0, an open-source family of coding models that tries to move the intelligence of a software agent from the surrounding harness into the model itself.

The release, published in late June, spans four variants: 9B Dense, 31B Dense, 35B MoE and 397B MoE. DeepReinforce says the models are post-trained on pretrained Gemma 4 and Qwen 3.5 foundations and specialized for agentic coding tasks rather than chat-style code completion. The weights are listed in a Hugging Face collection under the deepreinforce-ai organization, including the 9B, 35B and 397B models, plus GGUF and FP8 builds for local and lower-precision deployment.

The point of Ornith-1.0 is not just another coding benchmark table. DeepReinforce is betting that the next step in coding agents is a model that learns how to build its own task scaffold. In the company's description, Ornith-1.0 is trained to generate both the solution rollout and the task-specific harness that guides that rollout. Reward from the execution outcome is propagated to both stages, so the model is optimized not only to answer the task, but to author the orchestration that helps it get there.

That is a direct shot at the current agent stack, where much of the performance comes from the wrapper: memory, retry logic, shell discipline, test selection, tool use policy and error recovery. In most coding-agent systems, that scaffold is engineered outside the model and reused across task categories. DeepReinforce's claim is that Ornith-1.0 treats the scaffold as a learnable object that mutates over training, letting task-specific strategies emerge from reinforcement learning instead of hand design.

The benchmark claims are strong and should be read as company-reported until independently reproduced. DeepReinforce says Ornith-1.0-397B scores 77.5 on Terminal-Bench 2.1 using its Terminus-2 setup and 82.4 on SWE-Bench Verified. In the same table, DeepReinforce lists Claude Opus 4.7 at 70.3 on Terminal-Bench 2.1 and 80.8 on SWE-Bench Verified, MiniMax M3 at 64.0 on Terminal-Bench 2.1, and DeepSeek-V4-Pro at 64.0 to 66.5 on Terminal-Bench 2.1 depending on harness. The company also says the 397B model reaches 62.2 on SWE-Bench Pro and 78.9 on SWE-Bench Multilingual.

The smaller models are the more commercially interesting part of the release. DeepReinforce says Ornith-1.0-35B scores 64.2 on Terminal-Bench 2.1 with Terminus-2 and 75.6 on SWE-Bench Verified, putting it above the company's listed results for Qwen3.5-35B, Qwen3.6-35B and Gemma4-31B on several agentic coding benchmarks. The 9B model, positioned for edge or local deployment, is reported at 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified. If those numbers hold outside DeepReinforce's setup, the 9B and 35B releases matter more than the flagship: they are the sizes operators can plausibly run, quantize, fine-tune and embed into private developer workflows without waiting on a frontier API.

DeepReinforce is also trying to make the licensing easy. The 397B model card lists an MIT license, and the Hugging Face collection describes Ornith-1.0 as an open-source LLM family for agentic coding. That puts the release in the more permissive lane of the open-weights market, where adoption is often gated less by model quality than by whether a legal team can approve commercial use.

The hard part is trust. A model that learns its own scaffold has an obvious failure mode: it can learn to satisfy the verifier rather than solve the task. DeepReinforce addresses that directly in the Ornith-1.0 writeup, describing three defenses: an immutable outer trust boundary around the environment, tool surface and test isolation; a deterministic monitor that zeroes out trajectories that touch forbidden paths or verification scripts; and a frozen LLM judge used as a veto on top of the verifier for intent-level gaming that can happen inside the allowed tool surface.

Those safeguards are necessary, not cosmetic. Agentic coding benchmarks are unusually exposed to reward hacking because the model is operating in a software environment with files, tests, scripts and hidden state. A system that can improve its own scaffold is also a system that can learn where the evaluator is brittle. DeepReinforce's anti-hacking section is therefore one of the most important parts of the release: it acknowledges that self-improvement can produce better agents and better cheaters using the same optimization loop.

The training section points to another practical constraint: long rollouts are expensive and off-policy. DeepReinforce says Ornith-1.0 uses an asynchronous pipeline-RL strategy with a staleness weight that downweights older generated tokens and drops them after a threshold. The token-level GRPO loss is weighted by that staleness term, an implementation detail that matters because agentic coding rollouts can take far longer than ordinary chat completions.

DeepReinforce's public footprint remains thin. Its Hugging Face organization card describes the team as focused on work that could lead toward superintelligence and lists prior papers and datasets including GrandCode, CUDA-L2 and CUDA-L1. It does not name founders or disclose funding in the materials reviewed for this story. That leaves the release to stand mostly on reproducibility: the weights, the model cards, the benchmark harness details and whether outside users can recreate the gains.

For now, Ornith-1.0 is a clear marker in the coding-agent race. DeepReinforce is not just shipping a model that writes code; it is shipping a model family built around the claim that the agent scaffold itself should be learned. If independent evaluations confirm the reported results, that shifts pressure from hand-built agent frameworks toward training methods that internalize more of the agent loop inside open models.

Reader comments

Conversation for this story loads after sign-in.