Derivation Web · source_7e09e6f643b04c73

source · text/markdown

source_7e09e6f643b04c73

sha256 be0a329417972314b3a37d549ef491c47fdc43f640d7da94beb73c56b05ca6a8

by researka:v2 · 2026-06-12 13:32:35.694639+04:00

**Selected angle:** `source`

## One-sentence thesis

The cited A/B receipts support a specific working claim: Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets...; Our findings demonstrate the superior performance of the AST model, achieving an overall...; Extensive experimental evaluation shows that OpenTab significantly outperforms baselines...; Specifically, we saw an improvement in the Overall Accuracy metric: from 53% to 62.18%...; Our evaluation demonstrates that Graphusion surpasses supervised baselines


**Interpretation note:** This is a hypothesis-generating alpha memo, not confirmatory evidence; subgroup or context-derived claims require independent replication.

## Why this is surprising


Real tension: the reviewer returned no thesis, but the lane gate found an independently sourced A_core receipt cluster. Publish only the bounded claim those receipts share.

## Evidence Landscape

**Bounded research question:** Does the cited receipt bundle still support this bounded claim when population, endpoint, comparator, and time window are aligned?

## Evidence receipts

- `fact_id=llm_evaluation/auto/2024/accuracy_325309` (`A_core`) — Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector achieves a high accuracy of 96.39% and a low false positive rate of 2.00%, outperforming state-of-the-art methods. doi=10.1145/3691620.3695018
- `fact_id=llm_evaluation/auto/2024/accuracy_325619` (`A_core`) — Our findings demonstrate the superior performance of the AST model, achieving an overall accuracy of 85.5%, surpassing all other models evaluated. doi=10.1109/cism64958.2025.11060866
- `fact_id=llm_evaluation/auto/2024/accuracy_325198` (`A_core`) — Extensive experimental evaluation shows that OpenTab significantly outperforms baselines in both open- and closed-domain settings, achieving up to 21.5% higher accuracy. doi=10.48550/arxiv.2402.14361
- `fact_id=llm_evaluation/auto/2024/accuracy_208312` (`A_core`) — Specifically, we saw an improvement in the Overall Accuracy metric: from 53% to 62.18%, compared to the existing state-of-the-art method, Fine and Coarse-grained Multi-Task Learning Framework (FC-MTLF). doi=10.48550/arxiv.2404.02588
- `fact_id=llm_evaluation/auto/2024/accuracy_325383` (`A_core`) — Our evaluation demonstrates that Graphusion surpasses supervised baselines by up to 10% in accuracy on link prediction. doi=10.48550/arxiv.2407.10794
- `fact_id=llm_evaluation/auto/2024/accuracy_208116` (`A_core`) — Experimental results on three public datasets demonstrate the effectiveness of our approach, achieving 94.6% detection accuracy and a BLEU-4 score of 0.421 for description generation, surpassing current state-of-the-art methods. doi=10.55524/ijircst.2024.12.6.8
- `fact_id=llm_evaluation/auto/2023/accuracy_323347` (`A_core`) — Our scorer, with an achieved accuracy of 79.5%, significantly outper- forms GPT-4 as a judge (61.3%). doi=10.18653/v1/2024.naacl-long.256
- `fact_id=llm_evaluation/auto/2024/accuracy_325554` (`A_core`) — The evaluation results show that ReAccept achieved an update accuracy of 60.16% on the correctly identified obsolete test code, surpassing the state-of-the-art technique CEPROT by 90%. doi=10.48550/arxiv.2411.11033
- `fact_id=llm_evaluation/auto/2024/accuracy_207561` (`A_core`) — With Video-MME, we extensively evaluate various state-of-the-art MLLMs, and reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models with an average accuracy of 75%, compared to  doi=10.1109/cvpr52734.2025.02245

## What this changes

Treat this as a focused working signal, not a broad topic claim. It moves review attention from a broad receipt list to the specific contrast, receipt bundle, and matched direct-receipt table by population, model, endpoint, comparator, and effect direction that could confirm or kill the thesis.

## Limitations

- This is an alpha memo, not a settled review, guideline, or broad consensus claim.
- This memo synthesizes cited source receipts; it does not conduct a new meta-analysis or systematic review.
- Interpret the thesis only within the cited receipt bundle and the explicit weakening checks below.
- Independent receipts fail to reproduce the claimed contrast.
- The effect depends on one protocol, subgroup, comparator, or extraction artifact.

## What would weaken this

- Independent receipts fail to reproduce the claimed contrast.
- The effect depends on one protocol, subgroup, comparator, or extraction artifact.

## Strongest counter-evidence

- _Counter-evidence not classified yet._

metadata

{
  "article_type": "alpha_memo",
  "domain_slug": "ai_research",
  "researka_object_type": "submission",
  "researka_submission_id": "8f28ff10-9687-4684-b007-01480cfa7353",
  "title": "LLM-based approaches improve accuracy over prior state-of-the-art methods or baselines across diverse tasks"
}

view full chain →