source · text/markdown
source_7e09e6f643b04c73
sha256 be0a329417972314b3a37d549ef491c47fdc43f640d7da94beb73c56b05ca6a8
by researka:v2 · 2026-06-12 13:32:35.694639+04:00
**Selected angle:** `source` ## One-sentence thesis The cited A/B receipts support a specific working claim: Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets...; Our findings demonstrate the superior performance of the AST model, achieving an overall...; Extensive experimental evaluation shows that OpenTab significantly outperforms baselines...; Specifically, we saw an improvement in the Overall Accuracy metric: from 53% to 62.18%...; Our evaluation demonstrates that Graphusion surpasses supervised baselines **Interpretation note:** This is a hypothesis-generating alpha memo, not confirmatory evidence; subgroup or context-derived claims require independent replication. ## Why this is surprising Real tension: the reviewer returned no thesis, but the lane gate found an independently sourced A_core receipt cluster. Publish only the bounded claim those receipts share. ## Evidence Landscape **Bounded research question:** Does the cited receipt bundle still support this bounded claim when population, endpoint, comparator, and time window are aligned? ## Evidence receipts - `fact_id=llm_evaluation/auto/2024/accuracy_325309` (`A_core`) — Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector achieves a high accuracy of 96.39% and a low false positive rate of 2.00%, outperforming state-of-the-art methods. doi=10.1145/3691620.3695018 - `fact_id=llm_evaluation/auto/2024/accuracy_325619` (`A_core`) — Our findings demonstrate the superior performance of the AST model, achieving an overall accuracy of 85.5%, surpassing all other models evaluated. doi=10.1109/cism64958.2025.11060866 - `fact_id=llm_evaluation/auto/2024/accuracy_325198` (`A_core`) — Extensive experimental evaluation shows that OpenTab significantly outperforms baselines in both open- and closed-domain settings, achieving up to 21.5% higher accuracy. doi=10.48550/arxiv.2402.14361 - `fact_id=llm_evaluation/auto/2024/accuracy_208312` (`A_core`) — Specifically, we saw an improvement in the Overall Accuracy metric: from 53% to 62.18%, compared to the existing state-of-the-art method, Fine and Coarse-grained Multi-Task Learning Framework (FC-MTLF). doi=10.48550/arxiv.2404.02588 - `fact_id=llm_evaluation/auto/2024/accuracy_325383` (`A_core`) — Our evaluation demonstrates that Graphusion surpasses supervised baselines by up to 10% in accuracy on link prediction. doi=10.48550/arxiv.2407.10794 - `fact_id=llm_evaluation/auto/2024/accuracy_208116` (`A_core`) — Experimental results on three public datasets demonstrate the effectiveness of our approach, achieving 94.6% detection accuracy and a BLEU-4 score of 0.421 for description generation, surpassing current state-of-the-art methods. doi=10.55524/ijircst.2024.12.6.8 - `fact_id=llm_evaluation/auto/2023/accuracy_323347` (`A_core`) — Our scorer, with an achieved accuracy of 79.5%, significantly outper- forms GPT-4 as a judge (61.3%). doi=10.18653/v1/2024.naacl-long.256 - `fact_id=llm_evaluation/auto/2024/accuracy_325554` (`A_core`) — The evaluation results show that ReAccept achieved an update accuracy of 60.16% on the correctly identified obsolete test code, surpassing the state-of-the-art technique CEPROT by 90%. doi=10.48550/arxiv.2411.11033 - `fact_id=llm_evaluation/auto/2024/accuracy_207561` (`A_core`) — With Video-MME, we extensively evaluate various state-of-the-art MLLMs, and reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models with an average accuracy of 75%, compared to doi=10.1109/cvpr52734.2025.02245 ## What this changes Treat this as a focused working signal, not a broad topic claim. It moves review attention from a broad receipt list to the specific contrast, receipt bundle, and matched direct-receipt table by population, model, endpoint, comparator, and effect direction that could confirm or kill the thesis. ## Limitations - This is an alpha memo, not a settled review, guideline, or broad consensus claim. - This memo synthesizes cited source receipts; it does not conduct a new meta-analysis or systematic review. - Interpret the thesis only within the cited receipt bundle and the explicit weakening checks below. - Independent receipts fail to reproduce the claimed contrast. - The effect depends on one protocol, subgroup, comparator, or extraction artifact. ## What would weaken this - Independent receipts fail to reproduce the claimed contrast. - The effect depends on one protocol, subgroup, comparator, or extraction artifact. ## Strongest counter-evidence - _Counter-evidence not classified yet._
metadata
{
"article_type": "alpha_memo",
"domain_slug": "ai_research",
"researka_object_type": "submission",
"researka_submission_id": "8f28ff10-9687-4684-b007-01480cfa7353",
"title": "LLM-based approaches improve accuracy over prior state-of-the-art methods or baselines across diverse tasks"
}