MemoryAgentBench Dataset Overview¶
This document focuses on the dataset layout and practical usage. It does not cover benchmark reproduction or scoring pipelines because this repository does not currently ship dedicated evaluation code.
Quick Summary¶
MemoryAgentBench evaluates four memory-related capabilities:
- Accurate Retrieval: retrieve specific facts from long histories.
- Test-Time Learning: learn a task from interaction history and apply it later.
- Long-Range Understanding: build a global understanding of long documents.
- Conflict Resolution: update stale facts when later evidence contradicts them.
At the dataset level, every example follows the same high-level pattern:
{
"context": str,
"questions": list[str],
"answers": list[list[str]],
"metadata": {
"source": str,
# other fields are source-specific and are often null
},
}
Two details are easy to miss:
questionsandanswersare aligned by position.answers[i]is always a list of acceptable gold strings, even when there is only one candidate answer.
Split Inventory¶
| Split | Examples | What it tests | Sources in local snapshot |
|---|---|---|---|
Accurate_Retrieval |
22 | fact lookup from long contexts | ruler_qa1_197K, ruler_qa2_421K, eventqa_full, eventqa_65536, eventqa_131072, longmemeval_s* |
Test_Time_Learning |
6 | learning from demonstrations in context | recsys_redial_full, icl_banking77_5900shot_balance, icl_clinic150_7050shot_balance, icl_nlu_8296shot_balance, icl_trec_coarse_6600shot_balance, icl_trec_fine_6400shot_balance |
Long_Range_Understanding |
110 | summarization and multi-question global understanding | infbench_sum_eng_shots2, detective_qa |
Conflict_Resolution |
8 | resolving stale or conflicting facts | factconsolidation_mh_6k, factconsolidation_mh_32k, factconsolidation_mh_64k, factconsolidation_mh_262k, factconsolidation_sh_6k, factconsolidation_sh_32k, factconsolidation_sh_64k, factconsolidation_sh_262k |
Common Row Structure¶
Each row packs one long context together with many aligned QA pairs.
This follows the benchmark's "inject once, query multiple times" design.
Core fields¶
context: one large string containing the full material the agent is allowed to remember. Depending on the source, this may be a long article, a chunked event history, dialogue transcripts, or a synthetic long context.questions: the prompts asked against that context.answers: a list of gold answer candidate lists. The outer list lines up withquestions. The inner list contains one or more acceptable strings.metadata: extra source-specific fields. Onlysourceis consistently present across all splits.
Metadata fields seen in the snapshot¶
Not every source fills every metadata field. The full schema allows:
sourceqa_pair_idsdemokeypointsprevious_eventshaystack_sessionsquestion_datesquestion_idsquestion_types
In practice:
qa_pair_idsandsourceare present for every row.demoandkeypointsare used byLong_Range_Understanding.previous_eventsappears in the EventQA-based AR rows.haystack_sessions,question_dates,question_ids, andquestion_typesappear in thelongmemeval_s*AR rows.
Split Details¶
Accurate Retrieval¶
This split contains 22 examples:
ruler_qa1_197K: 1 example, 100 questionsruler_qa2_421K: 1 example, 100 questionseventqa_full: 5 examples, 100 questions eacheventqa_65536: 5 examples, 100 questions eacheventqa_131072: 5 examples, 100 questions eachlongmemeval_s*: 5 examples, 60 questions each
Why this split is slightly unusual:
- Most AR sub-datasets use exactly one gold string per question.
ruler_qa1_197Kis the exception. Every question there stores 3 or 4 answer candidates, and many are duplicated or near-duplicate surface forms.
Examples from ruler_qa1_197K:
1066vsIn 1066Bayeux Tapestryvsthe Bayeux TapestryKing Ethelred IIvsEthelred II
Practical implication:
- Evaluation should treat
answers[i]as a candidate set and accept a prediction if it matches any normalized candidate string. - A small number of
ruler_qa1_197Kcandidates are not clean paraphrases, so downstream evaluation code should be explicit about normalization and tie-breaking rules.
Test-Time Learning¶
This split contains 6 examples:
recsys_redial_full: 200 questions- the remaining 5 sources: 100 questions each
The payload shape is the same as AR, but the semantics are different:
- the
contextencodes demonstrations or prior interaction turns - the later
questionstest whether the model learned a task from that context - answers are usually labels, classes, or item identifiers rather than free-form spans from the context
The recsys_redial_full example is notably large and uses recommendation item
IDs as answers.
Long-Range Understanding¶
This split contains 110 examples:
infbench_sum_eng_shots2: 100 examples, 1 question eachdetective_qa: 10 examples, between 6 and 10 questions each
This is the most heterogeneous split:
infbench_sum_eng_shots2looks like long-document summarization. The single question typically asks for a long summary, and the single answer is a long reference summary.detective_qais multi-question comprehension over long contexts.
This split is also the one that makes the heaviest use of metadata:
demoprovides in-context demonstration materialkeypointsstores auxiliary reference points for evaluation or inspection
Conflict Resolution¶
This split contains 8 examples. Every example has 100 questions, and the sources vary by difficulty and context length:
factconsolidation_sh_*: single-hop conflict resolutionfactconsolidation_mh_*: multi-hop conflict resolution- suffixes such as
6k,32k,64k, and262kindicate different context length settings
Compared with AR, the challenge here is not only locating a fact but deciding which fact is still current after later evidence updates or overrides earlier statements.
Practical Usage Notes¶
Load the dataset¶
from datasets import load_dataset
dataset = load_dataset("ai-hyz/MemoryAgentBench")
accurate_retrieval = dataset["Accurate_Retrieval"]
test_time_learning = dataset["Test_Time_Learning"]
long_range_understanding = dataset["Long_Range_Understanding"]
conflict_resolution = dataset["Conflict_Resolution"]
Iterate safely over questions and answers¶
Because the dataset is row-oriented, most consumers will want to flatten each row into per-question records:
for row in dataset["Accurate_Retrieval"]:
for question, answer_candidates in zip(row["questions"], row["answers"], strict=True):
source = row["metadata"]["source"]
# Compare your prediction against any normalized candidate in
# `answer_candidates`.
Recommended evaluation assumption¶
Treat each answers[i] as a set of acceptable gold strings, not as a single
canonical answer. This matters most for ruler_qa1_197K, but using the same
rule across all splits keeps the logic uniform.
Recommended inspection assumption¶
Treat metadata as sparse and source-specific. Code should use defensive access
patterns such as row["metadata"].get("question_types") instead of assuming
every field exists for every source.