Build agents to learn from experiences using Amazon Bedrock AgentCore episodic memory
Today, most agents operate only on what’s visible in the current interaction: they can access facts and knowledge, but they can’t remember how they solved similar problems before or why certain approaches worked or failed. This creates a significant gap in their ability to learn and improve over time. Amazon Bedrock AgentCore episodic memory addresses this limitation by capturing and surfacing experience-level knowledge for AI agents. Although semantic memory helps an agent remember what it knows, episodic memory documents how it arrived there: the goal, reasoning steps, actions, outcomes, and reflections. By converting each interaction into a structured episode, you can enable agents to recall knowledge and interpret and apply prior reasoning. This helps agents adapt across sessions, avoid repeating mistakes, and evolve their planning over time.
Amazon Bedrock AgentCore Memory is a fully managed service that helps developers create context-aware AI agents through both short-term memory and long-term intelligent memory capabilities. To learn more, see Amazon Bedrock AgentCore Memory: Building context-aware agents and Building smarter AI agents: AgentCore long-term memory deep dive.
In this post, we walk you through the complete architecture to structure and store episodes, discuss the reflection module, and share compelling benchmarks that demonstrate significant improvements in agent task success rates.
Key challenges in designing agent episodic memory
Episodic memory enables agents to retain and reason over their own experiences. However, designing such a system requires solving several key challenges to make sure experiences remain coherent, evaluable, and reusable:
- Maintaining temporal and causal coherence – Episodes need to preserve the order and cause-effect flow of reasoning steps, actions, and outcomes so the agent can understand how its decisions evolved.
- Detecting and segmenting multiple goals – Sessions often involve overlapping or shifting goals. The episodic memory must identify and separate them to avoid mixing unrelated reasoning traces.
- Learning from experience – Each episode should be evaluated for success or failure. Reflection should then compare similar past episodes to identify generalizable patterns and principles, enabling the agent to adapt those insights to new goals rather than replaying prior trajectories.
In the next section, we describe how to build an AgentCore episodic memory strategy, covering its extraction, storage, retrieval, and reflection pipeline and how these components work together to help transform experience into adaptive intelligence.
How AgentCore episodic memory works
When your agentic application sends conversational events to AgentCore Memory, raw interactions get transformed into rich episodic memory records through an intelligent extraction and reflection process. The following diagram illustrates how this episodic memory strategy works and how simple agent conversations become meaningful, reflective memories that shape future interactions.
The following diagram illustrates the detailed data flow of the same architecture with more elaborate details.
The preceding diagrams illustrate the different steps in the episodic memory strategy. The first two steps (marked pink and purple) are grouped together as a two-stage approach of the episode extraction module that serves distinct but complementary purposes. The third step (marked as blue) is the reflection module, which helps the agent learn from the past experience. In the following sections, we discuss the steps in detail.
Episode extraction module
The episode extraction module is the foundational step in the episodic strategy that transforms raw user-agent interaction data into structured, meaningful episodes. We follow a two-stage approach where the stages are designed to capture both granular step-wise mechanics of each interaction (called turn extraction) and broader episode-wise knowledge to create coherent narratives (called episode extraction). To make an analogy, think of it in terms of taking notes during a meeting (turn level) and writing the meeting summary at the end of the meeting (episode). Both stages are valuable but serve different purposes when learning from experience.
In the first stage of episode extraction, the system performs turn-level processing to understand what went right or wrong. Here, single exchange units between the user and the agent called conversational turns are identified, segmented, and transformed into structured summaries in the following dimensions:
- Turn situation – A brief description of the circumstances and context that the assistant is responding to in this turn. This includes the immediate context, the user’s overarching objectives that might span multiple turns, and the relevant history from previous interactions that informed the current exchange.
- Turn intent – The assistant’s specific purpose and primary goal for this turn, essentially answering the question “What was the assistant trying to accomplish in this moment?”
- Turn action – A detailed record of the concrete steps taken during the interaction, documenting which specific tools were used, what input arguments or parameters were provided to each tool, and how the assistant translated intent into executable actions.
- Turn thought – The reasoning behind the assistant’s decisions, explaining the “why” behind tool selection and approach.
- Turn assessment – An honest evaluation of whether the assistant successfully achieved its stated goal for this specific turn, providing immediate feedback on the effectiveness of the chosen approach and actions taken.
- Goal assessment – A broader perspective on whether the user’s overall objective across the entire conversation appears to be satisfied or progressing toward completion, looking beyond individual turns to evaluate holistic success.
After processing and structuring individual turns, the system proceeds to the episode extraction stage, when a user completes their goal (detected by the large language model) or an interaction ends. This helps capture the complete user journey, because a user’s goal often spans multiple turns and individual turn data alone can’t convey whether the overall objective was achieved or what the holistic strategy looked like. In this stage, sequentially related turns are synthesized into coherent episodic memories that capture complete user journeys, from initial request to final resolution:
- Episode situation – The broader circumstances that initiated the user’s need for assistance
- Episode intent – A clear articulation of what the user ultimately wanted to accomplish
- Success evaluation – A definitive assessment of whether the conversation achieved its intended purpose for each episode
- Evaluation justification – Concrete reasoning for success or failure assessments, grounded in specific conversational moments that demonstrate progress toward or away from user goals
- Episode insights – Insights capturing proven effective approaches and identifying pitfalls to avoid for the current episode
Reflection module
The reflection module highlights the ability of Amazon Bedrock AgentCore episodic memory to learn from past experiences and generate insights that help improve future performance. This is where individual episode learnings evolve into generalizable knowledge that can guide agents across diverse scenarios.
The reflection module operates through cross-episodic reflection, retrieving past similar successful episodes based on user intent and reflecting across multiple episodes to achieve more generalizable insights. When new episodes are processed, the system performs the following actions:
- Using the user intent as a semantic key, the system identifies historically successful and relevant episodes from the vector store that share similar goals, contexts, or problem domains.
- The system analyzes patterns across the main episode and relevant episodes, looking for transferable insights about what approaches work consistently across different contexts.
- Existing reflection knowledge is reviewed and either enhanced with new insights or expanded with entirely new patterns discovered through cross-episodic analysis.
At the end of the process, each reflection memory record contains the following information:
- Use case – When and where the insight applies, including relevant user goals and trigger conditions
- Hints (insights) – Actionable guidance covering tool selection strategies, effective approaches, and pitfalls to avoid
- Confidence scoring – A score (0.1–1.0) indicating how well the insight generalizes across different scenarios
Episodes provide agents with concrete examples of how similar problems were solved before. These case studies show the specific tools used, reasoning applied, and outcomes achieved, including both successes and failures. This creates a learning framework where agents can follow proven strategies and avoid documented mistakes.
Reflection memories extract patterns from multiple episodes to deliver strategic insights. Instead of individual cases, they reveal which tools work best, what decision-making approaches succeed, and which factors drive outcomes. These distilled principles give agents higher-level guidance for navigating complex scenarios.
Custom override configurations
Although built-in memory strategies cover the common use cases, many domains require tailored approaches for memory processing. The system supports built-in strategy overrides through custom prompts that extend the built-in logic, helping teams adapt memory handling to their specific requirement. You can implement the following custom override configurations:
- Custom prompts – These prompts focus on criteria and logic rather than output formats and help developers define the following:
- Extraction criteria – What information gets extracted or filtered out.
- Consolidation rules – How related memories should be consolidated.
- Conflict resolution – How to handle contradictory information.
- Insight generation – How cross-episode reflections are synthesized.
- Custom model: AgentCore Memory supports custom model selection for memory extraction, consolidation, and reflection operations. This flexibility helps developers balance accuracy and latency based on their specific requirements. You can define them using APIs when you create the
_memory_resource_as a strategy override or through the Amazon Bedrock AgentCore console (as shown in the following screenshot). - Namespaces: Namespaces provide a hierarchical organization for episodes and reflections, enabling access to your agent’s experiences at different levels of granularity and providing a seamless natural logical grouping. For instance, to design a namespace for a travel application, episodes could be stored under
travel_booking/users/userABC/episodesand reflections could reside attravel_booking/users/userABC. Note that the namespace for reflections must be a sub-path of the namespace for episodes.
Performance evaluation
We evaluated Amazon Bedrock AgentCore episodic memory on real-world goal completion benchmarks from the retail and airline domain (sampled from τ2-bench). These benchmarks contain tasks that mirror actual customer service scenarios where agents need to help users achieve specific goals.
We compared three different setups in our experiments:
- For the baseline, we ran the agent (built with Anthropic’s Claude 3.7) without interacting with the memory component.
- For memory-augmented agents, we explored two methods of using memories:
- In-context learning examples – The first method uses extracted episodes as in-context learning examples. Specifically, we constructed a tool named
retrieve_exemplars(tool definition in appendix) that agents can use by issuing a query (for example, “how to get refund?”) to get step-by-step instructions from the episodes repository. When agents face similar problems, the retrieved episodes will be added into the context to guide the agent to take the next action. - Reflection-as-guidance – The second method we explored is reflection-as-guidance. Specifically, we construct a tool named
retrieve_reflections(tool definition in appendix) that agents can use to access broader insights from past experiences. Similar toretrieve_exemplars, the agent can generate a query to retrieve reflections as context, gaining insights to make informed decisions about strategy and approach rather than specific step-by-step actions.
- In-context learning examples – The first method uses extracted episodes as in-context learning examples. Specifically, we constructed a tool named
We used the following evaluation methodology:
- The baseline agent first processes a set of historical customer interactions, which become the source for memory extraction.
- The agent then receives new user queries from τ2-bench.
- Each query is attempted four times in parallel.
- To evaluate, pass rate metrics are measured across these four attempts. Pass^k measures the percentage of tasks where the agent succeeded in at least k out of four attempts:
- Pass^1: Succeeded at least once (measures capability)
- Pass^2: Succeeded at least twice (measures reliability)
- Pass^3: Succeeded at least three times (measures consistency)
The results in the following table show clear improvements across both domains and multiple attempts.
| System | Memory Type used by Agent | Retail | |||||
|---|---|---|---|---|---|---|---|
| Pass^1 | Pass^2 | Pass^3 | Pass^1 | Pass^2 | Pass^3 | ||
| Baseline | No Memory | 65.80% | 49.70% | 42.10% | 47% | 33.30% | 24% |
| Memory-Augmented Agent | Episodes as ICL Example | 69.30% | 53.80% | 43.40% | 55.00% | 46.70% | 43.00% |
| Cross Episodes Reflection Memory | 77.20% | 64.30% | 55.70% | 58% | 46% | 41% | |
Memory-augmented agents consistently outperform the baseline across domains and consistency levels. Crucially, these results demonstrate that different memory retrieval strategies are better suited to different task characteristics. Cross-episode reflection improved Pass^1 by +11.4% and Pass^3 by +13.6% over the baseline, suggesting that generalized strategic insights are particularly valuable when handling open-ended customer service scenarios with diverse interaction patterns. In contrast, the airline domain – characterized by complex, rule-based policies and multi-step procedures—benefits more from episodes as examples, which achieved the highest Pass^3 (43.0% vs 41.0% for reflection). This indicates that concrete step-by-step examples help agents navigate structured workflows reliably. The relative improvement is most pronounced at higher consistency thresholds (Pass^3), where memory helps agents avoid the mistakes that cause intermittent failures.
Best practices for using episodic memory
The key to effective episodic memory is knowing when to use it and which type fits your situation. In this section, we discuss what we’ve learned works best.
When to use episodic memory
Episodic memory delivers the most value when you match the right memory type to your current need. It is ideal for complex, multi-step tasks where context matters and past experience matters significantly, such as debugging code, planning trips, and analyzing data. It’s also particularly valuable for repetitive workflows where learning from previous attempts can dramatically improve outcomes, and for domain-specific problems where accumulated expertise makes a real difference.
However, episodic memory isn’t always the right choice. You can skip it for simple, one-time questions like weather checks or basic facts that don’t need reasoning or context. Simple customer service conversations, basic Q&A, or casual chats don’t need the advanced features that episodic memory adds. The true benefit of episodic memory is observed over time. For short tasks, a session summary provides sufficient information. However, for complex tasks and repetitive workflows, episodic memory helps agents build on past experiences and continuously improve their performance.
Choosing episodes vs. reflection
Episodes work best when you’re facing similar specific problems and need clear guidance. If you’re debugging a React component that won’t render, episodes can show you exactly how similar problems were fixed before, including the specific tools used, thinking process, and results. They give you real examples when general advice isn’t enough, showing the complete path from finding the problem to solving it.
Reflection memories work best when you need strategic guidance across broader contexts rather than specific step-by-step solutions. Use reflections when you’re facing a new type of problem and need to understand general principles, like “What’s the most effective approach for data visualization tasks?” or “Which debugging strategies tend to work best for API integration issues?” Reflections are particularly valuable when you’re making high-level decisions about tool selection and which method to follow, or understanding why certain patterns consistently succeed or fail.
Before starting tasks, check reflections for strategy guidance, look at similar episodes for solution patterns, and find high-confidence mistakes documented in previous attempts. During tasks, look at episodes when you hit roadblocks, use reflection insights for tool choices, and think about how your current situation differs from past examples.
Conclusion
Episodic memory fills a critical gap in current agent capabilities. By storing complete reasoning paths and learning from outcomes, agents can avoid repeating mistakes and build on successful strategies.
Episodic memory completes the memory framework of Amazon Bedrock AgentCore alongside summarization, semantic, and preference memory. Each serves a specific purpose: summarization manages context length, semantic memory stores facts, preference memory handles personalization, and episodic memory captures experience. The combination helps give agents both structured knowledge and practical experience to handle complex tasks more effectively.
To learn more about episodic memory, refer to Episodic memory strategy, How to best retrieve episodes to improve agentic performance, and the AgentCore Memory GitHub samples.
Appendix
In this section, we discuss two methods of using memories for memory-augmented agents.
Episode example
The following is an example using extracted episodes as in-context learning examples:
Reflection example
The following is an example of Reflection memory, which can be used for agent guidance:
Tool definitions
The following code is the tool definition for retrieve_exemplars:
def retrieve_exemplars(task: str) -> str:
"""
Retrieve example processes to help solve the given task.
Args:
task: The task to solve that requires example processes.
Returns:
str: The example processes to help solve the given task.
"""
The following is the tool definition for retrieve_reflections:
def retrieve_reflections(task: str, k: int = 5) -> str:
"""
Retrieve synthesized reflection knowledge from past agent experiences by matching
against knowledge titles and use cases. Each knowledge entry contains: (1) a descriptive title,
(2) specific use cases describing the types of goals where this knowledge applies and when to apply it,
and (3) actionable hints including best practices from successful episodes and common pitfalls to avoid
from failed episodes. Use this to get strategic guidance for similar tasks.
Args:
task: The current task or goal you are trying to accomplish. This will be matched
against knowledge titles and use cases to find relevant reflection knowledge. Describe your task
clearly to get the most relevant matches.
k: Number of reflection knowledge entries to retrieve. Default is 5.
Returns:
str: The synthesized reflection knowledge from past agent experiences.
"""
About the Authors








