QuarqLabs says its open-source agent scored 98.2% on LongMemEval-S

Quarq Agent uses local FAISS storage, layered memory and three separate LLM roles to tackle long-context recall.

By Ryan Merket · Published Jun 1, 2026, 11:41pm CT

Why it matters

Long-term memory is becoming a practical battleground for agents. QuarqLabs' claim is notable because it emphasizes local storage and retrieval architecture, not just bigger models or longer context windows.

An abstract, layered representation of an AI agent's memory architecture and recall process (Mixed-media paper collage with torn newsprint, photographic cutouts, tape and staples, and slight scanner shadow)

QuarqLabs says Quarq Agent, its open-source memory-first AI agent, scored 98.2% on LongMemEval-S, a long-memory benchmark built around 500 questions, about 57 million tokens of conversation data and roughly 50 sessions per question.

The result, published in an article on X, is self-reported. QuarqLabs says the run used a local FAISS vector store, a layered memory architecture and three specialized LLM roles: a retrieval planner using gpt-4o-mini, a generator using gpt-4.1 and a learning model using gpt-4.1. The code is listed at Quarq Agent's GitHub repo.

QuarqLabs frames the project as a step toward continual learning, not just longer context windows. Quarq Agent separates semantic memory, procedural memory and episodic memory, then combines vector search, keyword search, metadata filtering and temporal validation. Its evaluation pipeline wipes memory, feeds benchmark sessions in eight-message chunks, synchronizes learning before the final question and uses GPT-5 with reasoning_effort="medium" as a binary judge.

The most concrete technical bet is QuarqLabs' Temporal Truth Protocol, which tries to separate storage date, event date, simulated narrative date and relative dates. That matters because long-memory agents often fail by treating when something was stored as when it happened. QuarqLabs also says Quarq Agent can run a second retrieval pass when the first pass lacks enough evidence, rather than forcing the generator to answer from weak context.

Why it matters

Reader comments