docs/examples/memory/custom_multi_turn_memory.ipynb
Recent research has shown the performance of an LLM significantly degrades given multi-turn conversations.
To help avoid this, we can implement a custom short-term and long-term memory in LlamaIndex to ensure that the conversation turns never get too long, and condense the memory as we go.
Using the code from this notebook, you may see improvements in your own agents as it works to limit how many turns are in your chat history.
NOTE: This notebook was tested with llama-index-core>=0.12.37, as that version included some fixes to make this work nicely.
%pip install -U llama-index-core llama-index-llms-openai
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
To make this work, we need two things
Memory instance that uses that memory block, and has token limits configured such that multi-turn conversations are always flushed to the memory block for handlingFirst, the custom memory block:
import tiktoken
from pydantic import Field
from typing import List, Optional, Any
from llama_index.core.llms import ChatMessage, TextBlock
from llama_index.core.memory import Memory, BaseMemoryBlock
class CondensedMemoryBlock(BaseMemoryBlock[str]):
current_memory: List[str] = Field(default_factory=list)
token_limit: int = Field(default=50000)
tokenizer: tiktoken.Encoding = tiktoken.encoding_for_model(
"gpt-4o"
) # all openai models use 4o tokenizer these days
async def _aget(
self, messages: Optional[List[ChatMessage]] = None, **block_kwargs: Any
) -> str:
"""Return the current memory block contents."""
return "\n".join(self.current_memory)
async def _aput(self, messages: List[ChatMessage]) -> None:
"""Push messages into the memory block. (Only handles text content)"""
# construct a string for each message
for message in messages:
text_contents = "\n".join(
block.text
for block in message.blocks
if isinstance(block, TextBlock)
)
memory_str = f"<message role={message.role}>"
if text_contents:
memory_str += f"\n{text_contents}"
# include additional kwargs, like tool calls, when needed
# filter out injected session_id
kwargs = {
key: val
for key, val in message.additional_kwargs.items()
if key != "session_id"
}
if kwargs:
memory_str += f"\n({kwargs})"
memory_str += "\n</message>"
self.current_memory.append(memory_str)
# ensure this memory block doesn't get too large
message_length = sum(
len(self.tokenizer.encode(message))
for message in self.current_memory
)
while message_length > self.token_limit:
self.current_memory = self.current_memory[1:]
message_length = sum(
len(self.tokenizer.encode(message))
for message in self.current_memory
)
And then, a Memory instance that uses that block while configuring a very limited token limit for the short-term memory:
block = CondensedMemoryBlock(name="condensed_memory")
memory = Memory.from_defaults(
session_id="test-mem-01",
token_limit=60000,
token_flush_size=5000,
async_database_uri="sqlite+aiosqlite:///:memory:",
memory_blocks=[block],
insert_method="user",
# Prevent the short-term chat history from containing too many turns!
# This limit will effectively mean that the short-term memory is always flushed
chat_history_token_ratio=0.0001,
)
Let's explore using this with some dummy messages, and observe how the memory is managed.
initial_messages = [
ChatMessage(role="user", content="Hello! My name is Logan"),
ChatMessage(role="assistant", content="Hello! How can I help you?"),
ChatMessage(role="user", content="What is the capital of France?"),
ChatMessage(role="assistant", content="The capital of France is Paris"),
]
await memory.aput_messages(initial_messages)
Then, lets add our next user message!
await memory.aput_messages(
[ChatMessage(role="user", content="What was my name again?")]
)
With that, we can explore what the chat history looks like before sending to an LLM.
chat_history = await memory.aget()
for message in chat_history:
print(message.role)
print(message.content)
print()
Great! Even though we added many messages, it gets condensed into a single user message!
Let's try with an actual agent next.
Here, we can create a FunctionAgent with some simple tools that uses our memory.
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.llms.openai import OpenAI
def multiply(a: float, b: float) -> float:
"""Multiply two numbers."""
return a * b
def divide(a: float, b: float) -> float:
"""Divide two numbers."""
return a / b
def add(a: float, b: float) -> float:
"""Add two numbers."""
return a + b
def subtract(a: float, b: float) -> float:
"""Subtract two numbers."""
return a - b
llm = OpenAI(model="gpt-4.1-mini")
agent = FunctionAgent(
tools=[multiply, divide, add, subtract],
llm=llm,
system_prompt="You are a helpful assistant that can do simple math operations with tools.",
)
block = CondensedMemoryBlock(name="condensed_memory")
memory = Memory.from_defaults(
session_id="test-mem-01",
token_limit=60000,
token_flush_size=5000,
async_database_uri="sqlite+aiosqlite:///:memory:",
memory_blocks=[block],
insert_method="user",
# Prevent the short-term chat history from containing too many turns!
# This limit will effectively mean that the short-term memory is always flushed
chat_history_token_ratio=0.0001,
)
resp = await agent.run("What is (3214 * 322) / 2?", memory=memory)
print(resp)
current_chat_history = await memory.aget()
for message in current_chat_history:
print(message.role)
print(message.content)
print()
Perfect! Since the memory didn't have a new user message yet, it added one with our current memory. On the next user message, that memory and the user message would get combined like we saw earlier.
Let's try a few follow ups to confirm this is working properly
resp = await agent.run(
"What was the last question I asked you?", memory=memory
)
print(resp)
resp = await agent.run(
"And how did you go about answering that message?", memory=memory
)
print(resp)