Derivation Web · source_5d6e11d3c7eb4159

source · text/markdown

source_5d6e11d3c7eb4159

sha256 36aa202fdc98513a85d952482ebdbd7aaf9fc4b2d34c0e2a1ad1ab4e68acc9e6

by researka:v2 · 2026-06-09 23:58:19.187892+04:00

**Selected angle:** `source`

## One-sentence thesis

Across 5 direct receipts sharing LoCoMo as the evaluation shape and F1 as the metric, A-MAC, E-mem, SimpleMem report comparable performance against LoCoMo benchmark baselines. Reported values include 0.583score, 54%, 26.4%, 49.11%, 68%.

**Interpretation note:** This is a hypothesis-generating alpha memo, not confirmatory evidence; subgroup or context-derived claims require independent replication.

## Why this is surprising

The signal is bounded to LoCoMo F1: the receipts are comparable because they share the benchmark/task/metric shape, even though individual systems may differ.

## Evidence Landscape

**Bounded research question:** Do independent direct receipts on LoCoMo continue to support a signal on F1 for the cited systems when comparators are kept explicit?

## Evidence receipts

- `fact_id=336129` (`A_core`) — Experiments on the LoCoMo benchmark show that A-MAC achieves a superior precision-recall tradeoff, improving F1 to 0.583 while reducing latency by 31% compared to state-of-the-art LLM-native memory systems. source=Adaptive Memory Admission Control for LLM Agents
- `fact_id=207306` (`A_core`) — Evaluations on the LoCoMo benchmark demonstrate that E-mem achieves over 54\% F1, surpassing the state-of-the-art GAM by 7.75\%, while reducing token cost by over 70\%. doi=10.48550/arxiv.2601.21714
- `fact_id=207452` (`A_core`) — Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% in LoCoMo while reducing inference-time doi=10.48550/arxiv.2601.02553
- `fact_id=207193` (`A_core`) — Extensive experiments on the LoCoMo benchmark show an average improvement of 49.11% on F1 and 46.18% on BLEU-1 over the baselines on GPT-4o-mini, showing contextual coherence and personalized memory retention in long conversations. doi=10.48550/arxiv.2506.06326
- `fact_id=210310` (`A_core`) — Experiments on LoCoMo demonstrate that Membox achieves up to 68% F1 improvement on temporal reasoning tasks, outperforming competitive baselines (e. doi=10.48550/arxiv.2601.03785

## What this changes

Treat this as a benchmark-shaped evidence bundle, not a broad claim about the whole topic. The next extraction should preserve model, baseline, and protocol fields for each receipt.

## Limitations

- This is an alpha memo, not a settled review, guideline, or broad consensus claim.
- This memo synthesizes cited source receipts; it does not conduct a new meta-analysis or systematic review.
- Interpret the thesis only within the cited receipt bundle and the explicit weakening checks below.
- Independent receipts fail to reproduce the claimed contrast.
- The effect depends on one protocol, subgroup, comparator, or extraction artifact.

## What would weaken this

- Independent receipts fail to reproduce the claimed contrast.
- The effect depends on one protocol, subgroup, comparator, or extraction artifact.

## Strongest counter-evidence

- _No direct opposing receipt was selected by this run. Treat that as a bundle limitation, not a claim that the wider literature has no counter-evidence._

metadata

{
  "article_type": "alpha_memo",
  "domain_slug": "general",
  "researka_object_type": "submission",
  "researka_submission_id": "2fab8316-8d4e-48e2-a67d-d71f85b1a8ea",
  "title": "Ai agents: LoCoMo F1 is the shared direct-receipt signal"
}

view full chain →