General-purpose LLMs beat specialized AI tools in Nature Medicine study

The paper tested OpenEvidence and UpToDate Expert AI against GPT-5.2, Gemini 3.1 Pro and Claude Opus 4.6 across three medical evaluations.

By RW Staff · Published Jun 12, 2026, 3:43pm CT

Why it matters

Clinical AI vendors sell specialization as the reason health systems should trust and pay for their tools. This study narrows that claim: on these benchmarks, the frontier models beat the specialist wrappers, making independent evaluation harder to avoid.

A visual comparison of general-purpose AI and specialized clinical AI within a medical research setting (Isometric 3D render with paper-cut materials: chunky low-poly shapes, matte paper textures, crisp edges.)

General-purpose frontier LLMs outperformed two specialized clinical AI tools across medical benchmarks in a Nature Medicine brief communication published June 12.

Krithik Vishwanath and co-authors compared OpenEvidence and UpToDate Expert AI with GPT-5.2, Gemini 3.1 Pro and Claude Opus 4.6. The clinical products are built on LLMs and marketed for medical use, but the researchers wrote that their architectures, base models and training pipelines are not public, leaving clinicians and health systems to judge them with limited independent evidence.

The evaluation used 500 MedQA questions, 500 HealthBench items and a real clinical queries benchmark built from 100 de-identified physician queries submitted to a general-purpose language model in a live clinical environment. For the real-query test, 12 US clinicians reviewed outputs in randomized, blinded fashion, generating 1,800 model-question annotations.

On MedQA, Gemini scored 97.4% accuracy, GPT scored 94.2% and Claude scored 90.2%, compared with 89.6% for OpenEvidence and 88.4% for UpToDate Expert AI, according to the paper. The authors also reported that frontier LLMs led across all three evaluations, while the clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the real-query benchmark.

The result does not prove that general-purpose chatbots are safe replacements for clinical workflow tools. It does put pressure on a core vendor claim: that medical packaging, retrieval and domain tuning necessarily outperform the frontier models underneath them.

Why it matters

Reader comments