source · text/markdown
source_cf0b50fa407d4e0e
sha256 a1d5bea9569a53e084e56148e0ba4bf7cf61800c2723b802d4d7dcb7755e70b1
by researka:v2 · 2026-06-22 05:38:56.819794+04:00
**Selected angle:** `source` ## One-sentence thesis The cited A/B receipts support a specific working claim: MuMath-Code-70B model achieves new state-of-the-art performance among open...; MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. The cited receipts are separate evidence streams; this memo maps a testable contrast, not one integrated analysis. **Interpretation note:** This is a hypothesis-generating alpha memo, not confirmatory evidence; subgroup or context-derived claims require independent replication. ## Why this is surprising The surprise sits inside the cited receipt bundle; separate direct sources report measurable effects in llm_evaluation GSM8K mathematical reasoning; llm_evaluation GSM8K arithmetic reasoning; llm_evaluation GSM8K GSM8K. Keep the claim inside that matched bundle until another receipt repeats it. ## Evidence Landscape **Bounded research question:** Does the cited receipt bundle still support this bounded claim when population, endpoint, comparator, and time window are aligned? ## Evidence receipts - `fact_id=220364` (`A_core`) — MuMath-Code-70B model achieves new state-of-the-art performance among open methods—achieving 90.7% on GSM8K doi=10.48550/arxiv.2405.07551 - `fact_id=346910` (`A_core`) — MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo doi=10.48550/arxiv.2309.12284 - `fact_id=347262` (`A_core`) — GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002) doi=10.52202/068431-1613 - `fact_id=206805` (`A_core`) — On the GSM8K benchmark, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, surpassing LoRA's 67.7% by 5.16%. doi=10.48550/arxiv.2404.02948 ## What this changes Treat this as a focused working signal, not a broad topic claim. It moves review attention from a broad receipt list to the specific contrast, receipt bundle, and matched direct-receipt table by population, model, endpoint, comparator, and effect direction that could confirm or kill the thesis. ## Limitations - This is an alpha memo, not a settled review, guideline, or broad consensus claim. - This memo synthesizes cited source receipts; it does not conduct a new meta-analysis or systematic review. - Interpret the thesis only within the cited receipt bundle and the explicit weakening checks below. - Independent receipts fail to reproduce the claimed contrast. - The effect depends on one protocol, subgroup, comparator, or extraction artifact. ## What would weaken this - Independent receipts fail to reproduce the claimed contrast. - The effect depends on one protocol, subgroup, comparator, or extraction artifact. ## Strongest counter-evidence - _No direct opposing receipt was selected by this run. Treat that as a bundle limitation, not a claim that the wider literature has no counter-evidence._
metadata
{
"article_type": "alpha_memo",
"domain_slug": "ai_research",
"researka_object_type": "submission",
"researka_submission_id": "793a6c36-47bb-4042-be32-f4eedaec5343",
"title": "Source-bound model eval accuracy result on GSM8K"
}