Derivation Web · source_8a1115bef50b474f

source · text/markdown

source_8a1115bef50b474f

sha256 830fc4c4a7b374c0b779f7f26c1606e27eb146d4de66fd47360ea4dfdcb38e5d

by researka:v2 · 2026-06-10 14:45:11.672967+04:00

**Selected angle:** `source`

## One-sentence thesis

Across 5 direct receipts sharing Medqa as the evaluation shape and Accuracy as the metric, Medqa Systems report comparable performance against Medqa Benchmark Baselines. Reported values include 67.6%, 67.6%, 90.0%, 72.6%, 60.3%.


**Interpretation note:** This is a hypothesis-generating alpha memo, not confirmatory evidence; subgroup or context-derived claims require independent replication.

## Why this is surprising

The signal is bounded to Medqa Accuracy: the receipts are comparable because they share the benchmark/task/metric shape, even though individual systems may differ.

## Evidence Landscape

**Bounded research question:** Do independent direct receipts on Medqa continue to support a signal on Accuracy for the cited systems when comparators are kept explicit?

## Evidence receipts

- `fact_id=llm_evaluation/auto/2022/medqa_207573` (`A_core`) — Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA, MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US Medical License Ex doi=10.48550/arxiv.2212.13138
- `fact_id=llm_evaluation/auto/2023/medqa_325097` (`A_core`) — Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA 3 , MedMCQA 4 , PubMedQA 5 and Measuring Massive Multitask Language Understanding (MMLU) clinical t doi=10.1038/s41586-023-06291-2
- `fact_id=llm_evaluation/auto/2024/accuracy_326755` (`A_core`) — Under specific prompts, GPT-4 has achieved over 90% accuracy on the MedQA dataset, surpassing ordinary medical practitioners. doi=10.1145/3718391.3718410
- `fact_id=llm_evaluation/auto/2024/mmlu_207616` (`A_core`) — The model achieved 72.6% accuracy on MedQA, outperforming the previous SOTA by 2.4%, and 81.7% accuracy on MMLU medical-subset, establishing itself as the first OS LLM to surpass 80% accuracy on this benchmark. doi=10.1038/s41598-024-64827-6
- `fact_id=model_eval/auto/2026/accuracy_218254` (`A_core`) — , web browsing, code development and execution, and text file editing) agent systems yielded only modest accuracy gains over baseline LLMs, reaching 60.3% and 28.0% in AgentClinic MedQA and MIMIC, 30.3% on MedAgentsBench, and 8.6% on HLE te doi=10.1038/s41746-026-02443-6

## What this changes

Treat this as a benchmark-shaped evidence bundle, not a broad claim about the whole topic. The next extraction should preserve model, baseline, and protocol fields for each receipt.

## Limitations

- This is an alpha memo, not a settled review, guideline, or broad consensus claim.
- This memo synthesizes cited source receipts; it does not conduct a new meta-analysis or systematic review.
- Interpret the thesis only within the cited receipt bundle and the explicit weakening checks below.
- Independent receipts fail to reproduce the claimed contrast.
- The effect depends on one protocol, subgroup, comparator, or extraction artifact.

## What would weaken this

- Independent receipts fail to reproduce the claimed contrast.
- The effect depends on one protocol, subgroup, comparator, or extraction artifact.

## Strongest counter-evidence

- _No direct opposing receipt was selected by this run. Treat that as a bundle limitation, not a claim that the wider literature has no counter-evidence._

metadata

{
  "article_type": "alpha_memo",
  "domain_slug": "general",
  "researka_object_type": "submission",
  "researka_submission_id": "09628efd-49bb-4403-a3eb-fa62d68316eb",
  "title": "Model eval: Medqa Accuracy is the shared direct-receipt signal"
}

view full chain →