Shafiq Joty's team proposes Procedural Memory Distillation for less forgetful LLM training

PMD is presented as a training-time method that stores lessons from prior model attempts, then uses them to teach the next policy.

By ยท

Why it matters

If PMD's claims hold up, it points to a practical path for AI teams to reuse failure traces during training, not just add retrieval at inference time.

The process of AI memory distillation and progressive learning (Mixed-media paper collage with torn newsprint, abstract photographic cutouts, tape and staples, subtle scanner shadow)

Shafiq Joty (@JotyShafiq) introduced Procedural Memory Distillation, or PMD, in a thread on X, framing it as a way to make feedback-based LLM training carry lessons forward instead of treating each attempt as disposable.

Shafiq Joty on X

Joty's diagnosis is blunt: most AI training is "amnesiac." In his framing, the loop is "Rollout -> reward -> update -> forget," even though each rollout may contain reusable signals about strategies that worked, mistakes that repeated, and reasoning patterns worth preserving.

The credited team includes Ye Liu (@YeLiu918), Srijan Bansal (@SrijanBansal1), Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Joty, Semih Yavuz (@semih__yavuz), and @SFResearch. The thread presents PMD as research, not as a commercial product launch; no pricing, customer, funding, or deployment details are included in the supplied source material.

What PMD is trying to change

According to Joty, PMD organizes training experience into three levels. The first is raw experience: trajectories, successes, failures, and verifier feedback. The second is insight: strategies drawn from wins, lessons from losses, and contrasts between correct and incorrect reasoning. The third is behavior: reusable patterns distilled from those experiences and insights.

The proposed loop is straightforward in outline. A model attempts problems. A verifier scores the results. PMD stores the experience. The model reflects on what succeeded and failed. A "memory-conditioned self-teacher" then trains the next policy, and the process repeats.

The important claim is that the model policy and the memory co-evolve. Rollouts update memory, and memory shapes the learner that produces later rollouts. That makes PMD different from simply adding a notebook to the model at inference time. Joty says PMD "doesn't give models a permanent external notebook"; instead, it lets models use memory during training, absorb useful lessons into weights, and discard the rest.

The benchmark claims

The numbers in the thread are self-reported by the research team and should be read as claims until the underlying paper and experimental details are available. Against SDPO, which Joty calls the strongest self-distillation baseline, the thread says PMD improves Qwen3-8B on SciKnowEval from 74.4 to 77.2, a 2.8-point absolute gain, and on LiveCodeBench from 47.9 to 51.7, a 3.8-point gain.

For OLMo3-7B, the thread reports a science benchmark move from 69.5 to 73.3, a 3.8-point absolute gain, and a code benchmark move from 45.0 to 51.1, a 6.1-point gain. Joty's thread lists those as +5.5% and +13.6%, respectively.

The team also claims learned memory can transfer across model sizes: memory distilled from Qwen3-8B improves models from 1.7B to 32B, according to the thread. It also says more retrieved memories lead to better performance, and that PMD's co-evolved memory transfers better than memory built with a frozen policy.

The caveat inside the claim

The strongest part of the post is not just the headline benchmark movement, but the ablation story. Joty says freezing the policy while memory evolves improves memory but prevents the model from internalizing it. Freezing memory while the policy evolves makes the memory stale and causes gains to plateau. Letting both evolve is presented as the part that drives consistent improvement.

The thread does not provide enough detail to judge whether the comparison controls for compute, verifier design, training steps, dataset splits, number of runs, or statistical significance. It also does not define SDPO or include the project and paper URL in the supplied source text, even though the final post refers to a project and paper.

Still, the direction of the work is notable. Many current LLM training and agent systems generate enormous amounts of failed and partially successful work, then use only a thin reward signal from it. PMD is a bet that the traces themselves can become structured training material. As Joty put it, "experience itself is a learning signal."

Reader comments

Conversation for this story loads after sign-in.