Skip to content

MemoryAgentBench Dataset Overview

This document focuses on the dataset layout and practical usage. It does not cover benchmark reproduction or scoring pipelines because this repository does not currently ship dedicated evaluation code.

Quick Summary

MemoryAgentBench evaluates four memory-related capabilities:

  • Accurate Retrieval: retrieve specific facts from long histories.
  • Test-Time Learning: learn a task from interaction history and apply it later.
  • Long-Range Understanding: build a global understanding of long documents.
  • Conflict Resolution: update stale facts when later evidence contradicts them.

At the dataset level, every example follows the same high-level pattern:

{
    "context": str,
    "questions": list[str],
    "answers": list[list[str]],
    "metadata": {
        "source": str,
        # other fields are source-specific and are often null
    },
}

Two details are easy to miss:

  • questions and answers are aligned by position.
  • answers[i] is always a list of acceptable gold strings, even when there is only one candidate answer.

Split Inventory

Split Examples What it tests Sources in local snapshot
Accurate_Retrieval 22 fact lookup from long contexts ruler_qa1_197K, ruler_qa2_421K, eventqa_full, eventqa_65536, eventqa_131072, longmemeval_s*
Test_Time_Learning 6 learning from demonstrations in context recsys_redial_full, icl_banking77_5900shot_balance, icl_clinic150_7050shot_balance, icl_nlu_8296shot_balance, icl_trec_coarse_6600shot_balance, icl_trec_fine_6400shot_balance
Long_Range_Understanding 110 summarization and multi-question global understanding infbench_sum_eng_shots2, detective_qa
Conflict_Resolution 8 resolving stale or conflicting facts factconsolidation_mh_6k, factconsolidation_mh_32k, factconsolidation_mh_64k, factconsolidation_mh_262k, factconsolidation_sh_6k, factconsolidation_sh_32k, factconsolidation_sh_64k, factconsolidation_sh_262k

Common Row Structure

Each row packs one long context together with many aligned QA pairs. This follows the benchmark's "inject once, query multiple times" design.

Core fields

  • context: one large string containing the full material the agent is allowed to remember. Depending on the source, this may be a long article, a chunked event history, dialogue transcripts, or a synthetic long context.
  • questions: the prompts asked against that context.
  • answers: a list of gold answer candidate lists. The outer list lines up with questions. The inner list contains one or more acceptable strings.
  • metadata: extra source-specific fields. Only source is consistently present across all splits.

Metadata fields seen in the snapshot

Not every source fills every metadata field. The full schema allows:

  • source
  • qa_pair_ids
  • demo
  • keypoints
  • previous_events
  • haystack_sessions
  • question_dates
  • question_ids
  • question_types

In practice:

  • qa_pair_ids and source are present for every row.
  • demo and keypoints are used by Long_Range_Understanding.
  • previous_events appears in the EventQA-based AR rows.
  • haystack_sessions, question_dates, question_ids, and question_types appear in the longmemeval_s* AR rows.

Split Details

Accurate Retrieval

This split contains 22 examples:

  • ruler_qa1_197K: 1 example, 100 questions
  • ruler_qa2_421K: 1 example, 100 questions
  • eventqa_full: 5 examples, 100 questions each
  • eventqa_65536: 5 examples, 100 questions each
  • eventqa_131072: 5 examples, 100 questions each
  • longmemeval_s*: 5 examples, 60 questions each

Why this split is slightly unusual:

  • Most AR sub-datasets use exactly one gold string per question.
  • ruler_qa1_197K is the exception. Every question there stores 3 or 4 answer candidates, and many are duplicated or near-duplicate surface forms.

Examples from ruler_qa1_197K:

  • 1066 vs In 1066
  • Bayeux Tapestry vs the Bayeux Tapestry
  • King Ethelred II vs Ethelred II

Practical implication:

  • Evaluation should treat answers[i] as a candidate set and accept a prediction if it matches any normalized candidate string.
  • A small number of ruler_qa1_197K candidates are not clean paraphrases, so downstream evaluation code should be explicit about normalization and tie-breaking rules.

Test-Time Learning

This split contains 6 examples:

  • recsys_redial_full: 200 questions
  • the remaining 5 sources: 100 questions each

The payload shape is the same as AR, but the semantics are different:

  • the context encodes demonstrations or prior interaction turns
  • the later questions test whether the model learned a task from that context
  • answers are usually labels, classes, or item identifiers rather than free-form spans from the context

The recsys_redial_full example is notably large and uses recommendation item IDs as answers.

Long-Range Understanding

This split contains 110 examples:

  • infbench_sum_eng_shots2: 100 examples, 1 question each
  • detective_qa: 10 examples, between 6 and 10 questions each

This is the most heterogeneous split:

  • infbench_sum_eng_shots2 looks like long-document summarization. The single question typically asks for a long summary, and the single answer is a long reference summary.
  • detective_qa is multi-question comprehension over long contexts.

This split is also the one that makes the heaviest use of metadata:

  • demo provides in-context demonstration material
  • keypoints stores auxiliary reference points for evaluation or inspection

Conflict Resolution

This split contains 8 examples. Every example has 100 questions, and the sources vary by difficulty and context length:

  • factconsolidation_sh_*: single-hop conflict resolution
  • factconsolidation_mh_*: multi-hop conflict resolution
  • suffixes such as 6k, 32k, 64k, and 262k indicate different context length settings

Compared with AR, the challenge here is not only locating a fact but deciding which fact is still current after later evidence updates or overrides earlier statements.

Practical Usage Notes

Load the dataset

from datasets import load_dataset

dataset = load_dataset("ai-hyz/MemoryAgentBench")

accurate_retrieval = dataset["Accurate_Retrieval"]
test_time_learning = dataset["Test_Time_Learning"]
long_range_understanding = dataset["Long_Range_Understanding"]
conflict_resolution = dataset["Conflict_Resolution"]

Iterate safely over questions and answers

Because the dataset is row-oriented, most consumers will want to flatten each row into per-question records:

for row in dataset["Accurate_Retrieval"]:
    for question, answer_candidates in zip(row["questions"], row["answers"], strict=True):
        source = row["metadata"]["source"]
        # Compare your prediction against any normalized candidate in
        # `answer_candidates`.

Treat each answers[i] as a set of acceptable gold strings, not as a single canonical answer. This matters most for ruler_qa1_197K, but using the same rule across all splits keeps the logic uniform.

Treat metadata as sparse and source-specific. Code should use defensive access patterns such as row["metadata"].get("question_types") instead of assuming every field exists for every source.