Key Takeaways
- The stateless nature of large language models leads to issues such as forgetting, lack of personalization, and degraded long-context performance. Memory modules are critical infrastructure for building production-grade AI Agents
- Memory systems are divided into short-term memory (session buffers, working memory) and long-term memory (summary memory, structured knowledge bases, vector storage). Memory generation, storage, and retrieval strategies must be designed according to specific scenarios
- Open-source solutions Mem0, Letta, and LangMem each have different focuses and can be deeply integrated with services like Amazon Bedrock, Aurora, and OpenSearch. The managed solution Bedrock Agent Core Memory provides out-of-the-box short-term/long-term memory capabilities
The Memory Dilemma of Large Language Models
Large language models have demonstrated remarkable capabilities in text processing and generation, but they are fundamentally stateless systems. Each interaction with an LLM is an independent event, and the model itself does not possess the ability to “remember” historical conversations or accumulate experience. This fundamental limitation presents multiple challenges in practical applications:
Forgetting problems caused by context window limitations: LLMs process information through a limited Context Window, and all inputs (including prompts and historical conversation fragments) must fit within this window. Once the amount of information exceeds the window capacity, the model can no longer access this content, resulting in the so-called “forgetting” phenomenon. This poses a serious obstacle for application scenarios that require tracing historical context.
Limited capability for handling multi-turn complex tasks: For complex scenarios requiring cross-turn conversations, continuous state tracking, or execution of a series of subtasks, LLMs struggle to maintain coherence and task progress. In Agent application scenarios, tool definitions and tool return values occupy context space, and since Agents possess autonomous working capabilities, the average number of interactions with the LLM increases significantly, making this problem even more pronounced.
Difficulty achieving personalized services: Due to the inability to remember specific users’ historical preferences, usage habits, or past interactions, LLMs struggle to provide truly personalized experiences. Each interaction is like meeting for the first time, requiring users to repeatedly provide the same background information.
Dual pressure of performance and cost from long contexts: Processing longer contexts means more computation, leading to increased inference time and slower response speeds. Research shows that even as model context windows continue to expand, their ability to retrieve key information in ultra-long contexts may actually decline. A more direct impact is the escalation of token costs—the longer the context, the more input tokens, and the higher the cost per API call. For applications with high-frequency interactions or large text processing, this quickly accumulates into considerable operational costs.
Core Value of Memory Modules
The design goal of memory systems is precisely to overcome the inherent limitations of LLMs mentioned above, endowing intelligent agents with the following key capabilities:
- Long-term retention and efficient management: Store information beyond the LLM’s context window, enabling efficient retrieval and filtering to fundamentally solve information forgetting problems
- Continuous knowledge updates: Achieve self-improvement and knowledge iteration by storing interaction experiences with the environment and users
- Personalized services: Record user preferences and historical interactions to provide truly customized responses
- Complex task support: Track multi-Agent task progress and intermediate results to ensure coherent task completion and optimize decision paths
- Improved interaction quality: Maintain context coherence, support deep reasoning, and learn from mistakes through reflection mechanisms
AI Agent Memory Type System
An intelligent agent’s memory system is primarily divided into two major categories: short-term memory and long-term memory. This classification draws from cognitive science research on human memory.
Short-term Memory/Working Memory
Short-term Memory (STM) is the system by which an intelligent agent maintains the immediate context of current conversations and tasks, primarily consisting of two components:
- Session buffer (Context) memory: Maintains a rolling window of recent conversation history to ensure contextually relevant responses
- Working memory: Stores temporary information for current tasks, such as intermediate calculation results and variable values
Short-term memory is limited by context window size and is suitable for simple conversations and single-task scenarios. Its characteristics include fast access speed, limited capacity, and short lifecycle.
Long-term Memory
Long-term Memory (LTM) is the form of memory that intelligent agents use to persistently store knowledge across sessions and tasks, corresponding to persistently stored memories in the human brain, such as factual knowledge and past experiences. Long-term memory implementation typically relies on external storage or knowledge bases, with specific forms including:
- Summary memory: Distills long conversation content into key summaries for storage, effectively compressing information volume
- Structured knowledge bases: Uses databases or knowledge graphs to store structured information, supporting precise queries
- Vector storage: Implements semantic-based memory retrieval through vector databases, supporting fuzzy matching
Long-term memory enables intelligent agents to accumulate experience and knowledge over time, making it particularly suitable for knowledge-intensive applications and scenarios requiring long-term personalization. In enterprise-level applications, long-term memory is often the key factor distinguishing ordinary chatbots from truly intelligent assistants.
Technical Considerations for Memory Management and Usage
When designing and developing Agent memory systems, four core dimensions must be comprehensively considered: memory content selection, write strategy design, memory structure organization, and retrieval implementation.
Memory Generation: Determining What Information Needs to Be Remembered
The primary task in building an intelligent agent memory system is to determine what information is worth remembering based on specific scenarios. These memories are often multi-dimensional and dynamic information structures, covering:
- Temporal dimension: Understanding time-dependent relationships and sequences
- Spatial dimension: Interpreting location-based and geometric relationships
- Participant states: Tracking multiple entities and their evolving conditions
- Intent context: Understanding goals, motivations, and implicit purposes
- Cultural context: Interpreting communication within specific social and cultural frameworks
Not all conversation content needs to be preserved long-term. Here is an analysis of memory priorities for four common scenarios:
Code assistant agents: Memory should focus on user project context and preferences, including codebase structure (file organization, module relationships), naming styles (variable naming conventions, code formatting styles), commonly used frameworks/libraries, and code snippets or instructions provided by users. With persistent memory, AI can continuously reference previously stored project background, “remember” the user’s tech stack, maintain consistency in technical decisions, and avoid developers repeatedly explaining project architecture or correcting deviations from standards.
Intelligent customer service agents: Memory priorities are user history and preferences, including user current task status, questions asked, fault records, product usage, service configurations, and solution records. When users ask similar questions again, the system can recall previous advice or steps already attempted, directly addressing the current issue, achieving faster problem resolution and higher customer satisfaction.
Personal assistant agents: Memory priorities include user personal information and schedules, goals (such as fitness or learning plans), behavior patterns (such as which days of the week to exercise), and preferences for applications and services (such as preferred reminder methods). As interactions increase, continuous long-term memory enables the agent to continuously adapt to the user, gradually reducing dependence on user instructions and achieving more proactive and considerate service.
Recommendation service agents: Memory priorities include users’ explicit feedback (such as likes or explicitly expressing dislike for certain products) and implicit feedback (such as browsing history, click behavior, purchase history), using this to build interest profiles, continuously learn and adjust recommendation strategies, improving recommendation conversion rates and user loyalty.
Memory Strategy Design
An intelligent agent’s memory updates can be implemented through two mechanisms: turn-based triggering or event-based triggering:
- Turn-based triggering: Automatically generate summaries and store them in memory every 3-5 conversation turns, suitable for continuous conversation scenarios
- Event-based triggering: Record information at key nodes such as task completion or scene transitions. For example, customer service saves solutions when problem resolution is complete, or personal assistants write to the calendar after updating schedules
Developers can implement monitoring logic that, when conversations accumulate or topics shift, has the large model generate summaries of recent conversations, extract key information, and add tags for easy retrieval. The system can also support users actively marking information that needs to be remembered, such as through verbal commands or interface operations, while supporting the need to delete specific memories, ensuring user control over data.
Memory Storage: Organizational Structure Design
Memory data typically adopts a three-tier structure of User → Session → Memory Fragment for management:
- User layer: Distinguishes different account spaces, implementing data isolation
- Session layer: Isolates conversation contexts, supporting session-level memory management
- Memory fragment layer: Stores specific content and metadata (such as time, keywords, source, etc.)
Complex systems may need to maintain multiple memory stores, including short-term working memory, long-term episodic memory, semantic knowledge bases, etc. Proper structural design facilitates rapid retrieval and effective management of memory content.
Memory Retrieval: Query and Recall Logic
Intelligent agents need to retrieve relevant information from the memory store based on current conversation intent. Main retrieval methods include:
- Keyword matching: Fast retrieval based on exact keywords
- Vector semantic search: Fuzzy matching based on semantic similarity
- Metadata filtering: Filtering based on metadata such as time, type, source, etc.
The system ranks retrieved memories by relevance, selects the most relevant content to add to the conversation context, and uses it to generate more accurate responses.
Synergistic Relationship Between Context Engineering and Memory
Context Engineering forms a symbiotic relationship with memory systems, jointly supporting an intelligent agent’s cognitive capabilities. The memory system serves as an “information warehouse,” storing historical conversations, knowledge, and user preferences; while context engineering plays the role of “intelligent dispatcher,” deciding what information to retrieve from memory and how to organize and present it to the LLM.
The core of context engineering lies in the fact that an LLM’s performance and effectiveness fundamentally depend on the context it receives. Systems that implement context engineering generally contain three types of foundational components:
- Context retrieval and generation: Covers prompt generation and external knowledge acquisition
- Context processing: Involves long sequence processing, self-refinement, and structured information integration
- Context management: Focuses on memory hierarchies, compression techniques, and optimization strategies
From a technical definition perspective, context engineering reconceptualizes context C as a set of dynamically structured information components c1, c2, …, cn. These components are sourced, filtered, and formatted by a set of functions, and ultimately orchestrated by a high-level assembly function A.
Context Engineering Practice Case
In an Agent project for automated document processing and generation, the total input documents exceeded 500 pages, far exceeding the model’s maximum token limit, while the project had high requirements for recall and accuracy of generated content. To solve this problem, the following context engineering strategies were implemented:
- Document chunking: Split large document collections into appropriately sized chunks and store them in the file system
- Summary generation: Generate concise text summaries for each document chunk to provide content overviews, and generate summary information for the entire document
- Dynamic context management: Empower the Agent with autonomous selection capabilities, enabling it to dynamically retrieve relevant document chunks based on task requirements
- Context optimization: Automatically release contexts that are no longer needed after task completion to optimize resource utilization
This approach enabled the Agent to effectively process document collections exceeding the model’s context limit while maintaining high accuracy.
In-Depth Analysis of Mainstream Memory Frameworks
Based on the design principles and core components outlined above, the industry has seen the emergence of various memory mechanism implementations. The following provides a comparative analysis from two perspectives: open-source frameworks (Mem0, Letta, LangMem) and managed services (Bedrock AgentCore Memory). For enterprises evaluating multi-cloud billing and payment solutions, understanding the differences in these technology choices can help inform smarter architectural decisions.
Mem0: Intelligent Memory Management Framework
Mem0 is an open-source memory framework designed specifically for AI Agents, enabling state persistence through intelligent memory management. It supports multiple memory types including working memory, factual memory, episodic memory, and semantic memory, providing intelligent LLM-based extraction, filtering, and decay mechanisms that effectively reduce computational costs. It also supports multimodal processing and Graph memory capabilities, with options for both managed services and self-hosted deployment.
Core Architecture Modules:
- Core Memory Layer: Builds the core logic to determine the appropriate implementation for adding, retrieving, updating, and deleting memories
- Large Language Model Layer: Responsible for extracting key information from user input and generating decisions on how to update memories
- Embedding Model and Vector Storage Layer: Supports vectorized storage and retrieval of memories
- Graph Storage Layer: Stores extracted entity relationships, enriching the organizational structure of memories
- Persistent Storage Layer: Stores operational information for the memory system
Key Technical Innovations:
- Dual-LLM Architecture: Implements complex division of labor through two separate LLM calls—the first focuses on information extraction, the second handles the decision-making process
- Context-Aware Processing: Analyzes new data within the context of existing memories, ensuring consistency and coherence in the memory system
- Intelligent Deduplication Mechanism: Combines vector similarity search with LLM judgment to prevent redundant information storage
- Conflict Resolution Capability: When contradictory information appears, intelligently determines the appropriate action to retain, update, or delete
Integration with AWS Services:
- Model Services: Supports multiple Amazon Bedrock models, including Claude-3.7-Sonnet for complex reasoning and Titan-Embed-Text-v2 for vectorization processing
- Vector Storage: Amazon Aurora Serverless V2 for PostgreSQL, Amazon OpenSearch
- Graph Data Storage: Amazon Neptune Analytics
- Development Framework: AWS’s open-source StrandsAgent framework includes built-in mem0_memory tools based on Mem0 capabilities
Letta (formerly MemGPT): Virtual Memory Architecture
Letta’s design philosophy analogizes LLM agents to computer operating systems, adopting a “virtual memory” concept to manage agent memory. Its core innovation lies in a dual-layer memory architecture:
- In-Context Memory: System instructions, readable/writable memory blocks, and current conversations that exist directly within the model’s context window
- Out-of-Context Memory: Long-term storage for conversation history and external knowledge
When the context window approaches capacity, the system automatically compresses conversation history into recursive summaries and stores them as memory blocks, while preserving the original conversations for subsequent retrieval. Through tools such as core_memory_append, core_memory_replace, and recall, it enables memory editing and retrieval, allowing AI agents to maintain coherence across long-term interactions.
Integration Example with AWS Service Stack (e-commerce customer service bot):
- Use Amazon Bedrock’s Claude or Titan models as the base LLM
- Employ Amazon PostgreSQL and OpenSearch as vector storage backends
- Leverage ElastiCache to improve efficiency in inference and Q&A scenarios
- Implement serverless memory management architecture through AWS Lambda
LangMem: LangChain Ecosystem Memory Solution
LangMem, developed by LangChain, aims to solve the “amnesia” problem in AI agents. It provides long-term memory capabilities for AI agents, enabling them to maintain knowledge continuity across sessions and remember user preferences, past interactions, and important facts.
Three Core Memory Types:
- Semantic Memory: Stores objective facts, user preferences, and foundational knowledge as long-term persistent memory embedded in system prompts. Can be saved via Collection mode to preserve complete historical information, or via Profile mode to retain only the latest state
- Episodic Memory: Captures the agent’s interaction experiences, storing not only conversation content but also complete context and reasoning processes. Serves as short-term memory primarily used for constructing user prompts
- Procedural Memory: Focuses on “how-to” operational knowledge, starting from initial system prompts and continuously optimizing through ongoing feedback and accumulated experience
Advanced Features: Active memory management, shared memory mechanisms, namespace organization, and personalized continuous evolution capabilities enable AI Agents to dynamically store information based on importance and support knowledge sharing among multiple agents.
LangMem primarily integrates with LangGraph and supports Amazon Bedrock. For memory storage, it provides a built-in InMemoryStore for rapid iteration and prototyping, as well as PostgreSQL support.
Amazon Bedrock AgentCore Memory: Managed Memory Solution
Compared to open-source frameworks, AWS provides a ready-to-use managed service through the memory module in Bedrock AgentCore, the AI Agent building platform, helping developers more quickly enable memory capabilities for AI Agents. No underlying resource operations are required—industry-leading memory systems can be integrated with a single click.
Dual-Mode Memory Architecture:
- Short-term memory: Records the most recent conversation turns within a session, ensuring the agent can “remember” the current conversation context
- Long-term memory: Extracts structured key information from conversations, retaining knowledge across multiple sessions, enabling the agent to “learn” user preferences, facts, summaries, and other information
Built-in Memory Strategies:
- SemanticMemoryStrategy: Extracts facts and knowledge from conversations for later querying
- SummaryMemoryStrategy: Generates conversation summaries for each session, distilling main content
- UserPreferenceMemoryStrategy: Captures user preferences, styles, and repeated choices
- CustomMemoryStrategy: Developers can provide custom prompts and select specific foundation models to perform memory extraction
Memory Usage Methods:
- Call list_events to retrieve short-term memory conversation records and append them to the LLM’s prompt
- Use the retrieve_memories interface to semantically query long-term memory
- Memory as Tool: Wrap Memory as a tool for LLM invocation by registering it through AgentCoreMemoryToolProvider, allowing the model to autonomously call the retrieve action to query memories or use the record action to store new information
All data is stored by AWS in encrypted form and isolated using namespaces, ensuring memory data from different applications or users remains separated.
Technology Selection Recommendations
When choosing a memory framework, consider the following factors comprehensively:
- Customization Requirements: If deep customization of memory logic is needed, open-source frameworks (Mem0, Letta, LangMem) provide greater flexibility
- Operational Costs: Managed services (Bedrock AgentCore Memory) can significantly reduce
Need help with cloud billing or account setup? Contact Telegram: awscloud51 or visit AWS51.