examples/functionality/short_term_memory/reme/README.md
This example demonstrates how to
As AI agents evolved from simple chatbots to sophisticated autonomous systems, the focus shifted from "prompt engineering" to "context engineering". While prompt engineering focused on crafting effective instructions for language models, context engineering addresses a more fundamental challenge: managing the ever-growing conversation and tool execution history that agents accumulate.
Agentic systems work by binding LLMs with tools and running them in a loop where the agent decides which tools to call and feeds results back into the message history. This creates a snowball effect:
When context grows too large, model performance degrades significantly—a phenomenon known as "context rot":
Agents face a critical tension:
Context management aims to keep "just enough" information in the window—sufficient for effective decision-making while leaving room for retrieval and expansion, without overwhelming the model.
Effective short-term memory management is essential for:
ReMeShortTermMemory implements proven context management strategies:
By implementing these strategies, ReMeShortTermMemory enables agents to handle arbitrarily long conversations and complex tasks while maintaining optimal performance throughout.
Install agentscope and ensure you have a valid DashScope API key in your environment variables.
Note: The example is built with DashScope chat model. If you want to use OpenAI models instead, modify the model initialization in the example code accordingly.
# Install agentscope from source
cd {PATH_TO_AGENTSCOPE}
pip install -e .
# Install dependencies
pip install reme-ai python-dotenv
Set up your API key:
export DASHSCOPE_API_KEY='YOUR_API_KEY'
Or create a .env file:
DASHSCOPE_API_KEY=YOUR_API_KEY
Run the example:
python short_term_memory_example.py
The example will:
This section provides a detailed walkthrough of the short_term_memory_example.py code, explaining how each component works together to create an agent with intelligent context management.
ReMeShortTermMemory Class ParametersThe ReMeShortTermMemory class accepts the following initialization parameters:
model (DashScopeChatModel | OpenAIChatModel | None): Language model for compression operations. Must be either DashScopeChatModel or OpenAIChatModel. This model is used for LLM-based compression when generating compact state snapshots. Required.
reme_config_path (str | None): Optional path to ReMe configuration file for custom settings. Use this to provide advanced ReMe configurations beyond the standard parameters. Default: None.
working_summary_mode (str): Strategy for working memory management. Controls how the memory system handles context overflow:
"compact": Only compact verbose tool messages by storing full content externally and keeping short previews in the active context."compress": Only apply LLM-based compression to generate compact state snapshots of conversation history."auto": First run compaction, then optionally run compression if the compaction ratio exceeds compact_ratio_threshold. This is the recommended mode for most use cases.Default: "auto".
compact_ratio_threshold (float): Threshold for compaction effectiveness in AUTO mode. If (compacted_tokens / original_tokens) > compact_ratio_threshold, compression is applied after compaction. This ensures compression only runs when compaction alone isn't sufficient. Valid range: 0.0 to 1.0. Default: 0.75.
max_total_tokens (int): Maximum token count threshold before compression is triggered. This limit does not include keep_recent_count messages or system messages, which are always preserved. Should be set to 20%-50% of your model's context window size to leave room for new tool calls and responses. Default: 20000.
max_tool_message_tokens (int): Maximum token count for individual tool messages before compaction. Tool messages exceeding this limit are stored externally in files, with only a short preview kept in the active context. This is the maximum tolerable length for a single tool response. Default: 2000.
group_token_threshold (int | None): Maximum token count per compression group when splitting messages for LLM compression. When set to a positive integer, long message sequences are split into smaller batches for compression. If None or 0, all messages are compressed in a single group. Use this to control the granularity of compression operations. Default: None.
keep_recent_count (int): Number of most recent messages to preserve without compression or compaction. These messages remain in full in the active context to maintain conversation continuity and provide usage examples for the agent. The example uses 1 for demonstration purposes; in production, a value of 10 is recommended to maintain better conversation flow. Default: 10.
store_dir (str): Directory path for storing offloaded message content and compressed history files. This is where external files containing full tool responses and compressed message history are saved. The directory will be created automatically if it doesn't exist. Default: "inmemory".
**kwargs (Any): Additional arguments passed to ReMeApp initialization. Use this to pass any extra configuration options supported by the underlying ReMe application.
Token Budget Strategy: Set max_total_tokens to 20%-50% of your model's context window. For example, if your model has a 128K context window, set max_total_tokens between 25,600 and 64,000 tokens.
Compaction vs Compression:
"auto" mode to benefit from both strategiesRecent Message Retention: Higher keep_recent_count values (e.g., 10) provide better context continuity but consume more tokens. Lower values (e.g., 1) are more aggressive but may lose important recent context.
Tool Message Handling: Adjust max_tool_message_tokens based on your typical tool response sizes. If your tools frequently return large outputs (e.g., file contents, API responses), consider a higher threshold or ensure compaction is enabled.
flowchart TD
A[Start: Load Environment] --> B[Create Toolkit]
B --> C[Register Tools: grep & read_file]
C --> D[Initialize LLM Model]
D --> E[Create ReMeShortTermMemory]
E --> F[Enter Async Context Manager]
F --> G[Add Initial Messages with Large Tool Response]
G --> H[Memory Auto-Compacts Large Content]
H --> I[Create ReActAgent with Memory]
I --> J[User Sends Query]
J --> K[Agent Uses Tools to Search/Read]
K --> L[Tool Responses Added to Memory]
L --> M{Memory Token Limit?}
M -->|Exceeded| N[Auto-Compact/Summarize]
M -->|OK| O[Agent Generates Response]
N --> O
O --> P[Return Response to User]
P --> Q[Exit Context Manager]
Q --> End[End]
style H fill:#e1f5ff
style N fill:#ffe1e1
style O fill:#e1ffe1
The example demonstrates a complete workflow from tool registration to agent interaction. Here's a detailed breakdown:
import asyncio
import os
from dotenv import load_dotenv
load_dotenv()
The code starts by loading environment variables (including the DashScope API key) from a .env file.
The example defines two custom tools that demonstrate how to integrate retrieval operations:
grep Tool: Searches for patterns in files using regular expressions
async def grep(file_path: str, pattern: str, limit: str) -> ToolResponse:
"""A powerful search tool for finding patterns in files..."""
from reme_ai.retrieve.working import GrepOp
op = GrepOp()
await op.async_call(file_path=file_path, pattern=pattern, limit=limit)
return ToolResponse(
content=[TextBlock(type="text", text=op.output)],
)
read_file Tool: Reads specific line ranges from files
async def read_file(file_path: str, offset: int, limit: int) -> ToolResponse:
"""Reads and returns the content of a specified file..."""
from reme_ai.retrieve.working import ReadFileOp
op = ReadFileOp()
await op.async_call(file_path=file_path, offset=offset, limit=limit)
return ToolResponse(
content=[TextBlock(type="text", text=op.output)],
)
Important Note on Tool Replaceability:
- The
grepandread_filetools shown here are example implementations using ReMe's built-in operations- You can replace them with your own retrieval tools, such as:
- Vector database embedding retrieval (e.g., ChromaDB, Pinecone, Weaviate)
- Web search APIs (e.g., Google Search, Bing Search)
- Database query tools (e.g., SQL queries, MongoDB queries)
- Custom domain-specific search solutions
- Similarly, the offline write operations (used internally by ReMeShortTermMemory to store compacted content) can be customized by modifying the
write_text_filefunction in AgentScope's tool system- The key requirement is that your tools return
ToolResponseobjects with appropriate content blocks
llm = DashScopeChatModel(
model_name="qwen3-coder-30b-a3b-instruct",
api_key=os.environ.get("DASHSCOPE_API_KEY"),
stream=False,
generate_kwargs={
"temperature": 0.001,
"seed": 0,
},
)
The model is configured with low temperature for consistent, deterministic responses. This same model will be used for both agent reasoning and memory summarization operations.
short_term_memory = ReMeShortTermMemory(
model=llm,
working_summary_mode="auto", # Automatic memory management
compact_ratio_threshold=0.75, # Trigger compaction at 75% capacity
max_total_tokens=20000, # Set to 20%-50% of model's context window
max_tool_message_tokens=2000, # Maximum tolerable tool response length
group_token_threshold=None, # Max tokens per LLM compression batch; None means no splitting
keep_recent_count=1, # Keep 1 recent message intact (set to 1 for demo; use 10 in production)
store_dir="inmemory", # Storage directory for offloaded content
)
This configuration enables automatic memory management that will:
max_tool_message_tokensmax_total_tokens and compaction ratio exceeds compact_ratio_thresholdasync with short_term_memory:
# All memory operations happen here
The async with statement ensures proper initialization and cleanup of memory resources. This is the recommended approach for using ReMeShortTermMemory.
The example demonstrates memory compaction by adding a large tool response:
# Read README content and multiply it 10 times to simulate a large response
f = open("../../../../README.md", encoding="utf-8")
readme_content = f.read()
f.close()
memories = [
{
"role": "user",
"content": "搜索下项目资料",
},
{
"role": "assistant",
"content": None,
"tool_calls": [...], # Tool call metadata
},
{
"role": "tool",
"content": readme_content * 10, # Large tool response (10x README)
"tool_call_id": "call_6596dafa2a6a46f7a217da",
},
]
await short_term_memory.add(
ReMeShortTermMemory.list_to_msg(memories),
allow_duplicates=True,
)
When this large content is added, ReMeShortTermMemory will:
max_tool_message_tokens (the maximum tolerable tool response length, set to 2000 in this example)agent = ReActAgent(
name="react",
sys_prompt=(
"You are a helpful assistant. "
"工具调用的调用可能会被缓存到本地。"
"可以先使用`Grep`匹配关键词或者正则表达式所在行数,然后通过`ReadFile`读取位置附近的代码。"
# ... more instructions
),
model=llm,
formatter=DashScopeChatFormatter(),
toolkit=toolkit,
memory=short_term_memory, # Memory is integrated here
max_iters=20,
)
The agent is configured with:
grep and read_file toolsshort_term_memory instance for automatic context managementmsg = Msg(
role="user",
content=("项目资料中,agentscope_v1论文的一作是谁?"),
name="user",
)
msg = await agent(msg)
print(f"✓ Agent response: {msg.get_text_content()}\n")
When the agent processes this message:
grep to search for "agentscope_v1")async def main() -> None:
# 1. Create toolkit and register tools
toolkit = Toolkit()
toolkit.register_tool_function(grep)
toolkit.register_tool_function(read_file)
# 2. Initialize LLM
llm = DashScopeChatModel(...)
# 3. Create short-term memory
short_term_memory = ReMeShortTermMemory(...)
# 4. Use async context manager
async with short_term_memory:
# 5. Add initial messages (with large content to demo compaction)
await short_term_memory.add(messages, allow_duplicates=True)
# 6. Create agent with memory
agent = ReActAgent(..., memory=short_term_memory, ...)
# 7. Interact with agent
response = await agent(user_message)
ToolResponse objects that are seamlessly integrated into memoryasync with short_term_memory: to ensure proper resource managementgrep and read_file tools are examples—you can replace them with any retrieval mechanism that fits your use caseReMeShortTermMemory implements the async context manager protocol, which ensures proper initialization and cleanup of resources. There are two ways to use it:
async with StatementThe recommended approach is to use the async with statement, which automatically handles resource management:
async with short_term_memory:
# Memory is initialized here
await short_term_memory.add(messages)
response = await agent(msg)
# Memory is automatically cleaned up when exiting the block
__aenter__ and __aexit__ CallsYou can also manually call __aenter__ and __aexit__ if you need more control:
# Manually initialize
await short_term_memory.__aenter__()
try:
# Use the memory
await short_term_memory.add(messages)
response = await agent(msg)
finally:
# Manually cleanup
await short_term_memory.__aexit__(None, None, None)
Note: It's recommended to use the
async withstatement as it ensures proper resource cleanup even if an exception occurs.
You can customize the ReMe config by passing a config path:
short_term_memory = ReMeShortTermMemory(
model=llm,
reme_config_path="path/to/your/config.yaml", # Pass your custom ReMe configuration
# ... other parameters
)
For more configuration options, refer to the ReMe documentation.
The short_term_memory_example.py file demonstrates:
grep and read_file tools for searching and reading filesThe example shows a typical workflow:
grep tool to find relevant contentread_file tool to read specific sections