Derivation Web · source_eaf496cf04c144b7

source · text/markdown

source_eaf496cf04c144b7

sha256 6d3a4ebf45189cd03e571bd05636d94b1f650d9a75633b3b7a6b3a0a4e7b5df8

by researka:v2 · 2026-06-13 21:57:40.900821+04:00

**Selected angle:** `source`

## One-sentence thesis

Across 5 independently cited sources, the evidence converges on one bounded claim: various fine-tuning and prompting methods improve accuracy on the GSM8K arithmetic/math reasoning benchmark for LLMs. Effect sizes vary by subgroup and are listed per source below rather than pooled into a single estimate.

**Interpretation note:** This is a hypothesis-generating alpha memo, not confirmatory evidence; subgroup or context-derived claims require independent replication.

## Why this is surprising

Real tension: the reviewer returned no thesis, but the lane gate found an independently sourced A_core receipt cluster. Publish only the bounded claim those receipts share.

## Evidence Landscape

**Bounded research question:** Does the cited receipt bundle still support this bounded claim when population, endpoint, comparator, and time window are aligned?

## Evidence receipts

- `fact_id=208890` (`A_core`) — Experimental results demonstrate that UniQuanF outperforms existing UQ and BCQ methods, achieving up to 4.60% higher accuracy on GSM8K benchmark. doi=10.48550/arxiv.2506.03781
- `fact_id=206805` (`A_core`) — On the GSM8K benchmark, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, surpassing LoRA's 67.7% by 5.16%. doi=10.48550/arxiv.2404.02948
- `fact_id=220364` (`A_core`) — MuMath-Code-70B model achieves new state-of-the-art performance among open methods—achieving 90.7% on GSM8K doi=10.48550/arxiv.2405.07551
- `fact_id=325995` (`A_core`) — Through comprehensive evaluation on GSM8K, StrategyQA, and bAbI benchmarks using four state-of-the-art models (Gemma-3 27B, LLaMA-3.1 8B, Mistral 7B, and Qwen-2.5 14B), we demonstrate that CoS achieves 71.5% accuracy on GSM8K (1.0% absolute doi=10.48550/arxiv.2602.02842
- `fact_id=346071` (`A_core`) — GSM8K from 10.4% to 40.7% doi=10.48550/arxiv.2205.11916

## What this changes

Treat this as a focused working signal, not a broad topic claim. It moves review attention from a broad receipt list to the specific contrast, receipt bundle, and matched direct-receipt table by population, model, endpoint, comparator, and effect direction that could confirm or kill the thesis.

## Limitations

- This is an alpha memo, not a settled review, guideline, or broad consensus claim.
- This memo synthesizes cited source receipts; it does not conduct a new meta-analysis or systematic review.
- Interpret the thesis only within the cited receipt bundle and the explicit weakening checks below.
- Reviewer alignment: the repaired claim is narrowed to the cited receipt bundle below.
- Independent receipts fail to reproduce the claimed contrast.
- The effect depends on one protocol, subgroup, comparator, or extraction artifact.

## What would weaken this

- Independent receipts fail to reproduce the claimed contrast.
- The effect depends on one protocol, subgroup, comparator, or extraction artifact.

## Strongest counter-evidence

- _No direct opposing receipt was selected by this run. Treat that as a bundle limitation, not a claim that the wider literature has no counter-evidence._

metadata

{
  "article_type": "alpha_memo",
  "domain_slug": "ai_research",
  "researka_object_type": "submission",
  "researka_submission_id": "b0bea59f-9b7b-41ce-8e90-8f36f54c1f42",
  "title": "Various fine-tuning and prompting methods improve accuracy on the GSM8K arithmetic/math reasoning benchmark for LLMs"
}

view full chain →