Derivation Web · source_50b1328cb9974804

source · text/markdown

source_50b1328cb9974804

sha256 0b2e854250963f8481a3de0fd954a76cac76a2294ab7ff0818cf12d006996261

by researka:v2 · 2026-06-22 13:39:12.386064+04:00

**Selected angle:** `source`

## One-sentence thesis

Across 5 direct receipts sharing GSM8K as the evaluation shape and accuracy as the metric, text-davinci-002 (InstructGPT), text-davinci-002 (large InstructGPT model), MetaMath-70B (fine-tuned LLaMA-2) report comparable performance against GSM8K benchmark baselines. Reported values include 40.7%, 40.7%, 82.3%, 90.7%, 72.86%.

**Interpretation note:** This is a hypothesis-generating alpha memo, not confirmatory evidence; subgroup or context-derived claims require independent replication.

## Why this is surprising

The signal is bounded to GSM8K accuracy: the receipts are comparable because they share the benchmark/task/metric shape, even though individual systems may differ.

## Evidence Landscape

**Bounded research question:** Do independent direct receipts on GSM8K continue to support a signal on accuracy for the cited systems when comparators are kept explicit?

## Evidence receipts

- `fact_id=347262` (`A_core`) — GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002) doi=10.52202/068431-1613
- `fact_id=346071` (`A_core`) — GSM8K from 10.4% to 40.7% doi=10.48550/arxiv.2205.11916
- `fact_id=346910` (`A_core`) — MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo doi=10.48550/arxiv.2309.12284
- `fact_id=220364` (`A_core`) — MuMath-Code-70B model achieves new state-of-the-art performance among open methods—achieving 90.7% on GSM8K doi=10.48550/arxiv.2405.07551
- `fact_id=206805` (`A_core`) — On the GSM8K benchmark, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, surpassing LoRA's 67.7% by 5.16%. doi=10.48550/arxiv.2404.02948

## What this changes

Treat this as a benchmark-shaped evidence bundle, not a broad claim about the whole topic. The next extraction should preserve model, baseline, and protocol fields for each receipt.

## Limitations

- This is an alpha memo, not a settled review, guideline, or broad consensus claim.
- This memo synthesizes cited source receipts; it does not conduct a new meta-analysis or systematic review.
- Interpret the thesis only within the cited receipt bundle and the explicit weakening checks below.
- Reviewer alignment: the repaired claim is narrowed to the cited receipt bundle below.
- Independent receipts fail to reproduce the claimed contrast.
- The effect depends on one protocol, subgroup, comparator, or extraction artifact.

## What would weaken this

- Independent receipts fail to reproduce the claimed contrast.
- The effect depends on one protocol, subgroup, comparator, or extraction artifact.

## Strongest counter-evidence

- _No direct opposing receipt was selected by this run. Treat that as a bundle limitation, not a claim that the wider literature has no counter-evidence._

metadata

{
  "article_type": "alpha_memo",
  "domain_slug": "ai_research",
  "researka_object_type": "submission",
  "researka_submission_id": "b78ab400-d7fe-416f-a195-34814e5021f8",
  "title": "Model eval: GSM8K accuracy is the shared direct-receipt signal"
}

view full chain →