25.5 Memory: Conversation Buffer, Summary, and Vector Store Memory
Right, so you’ve got your LLM chain set up. It takes a prompt, it gives a response. It’s clever, but it’s got the memory of a goldfish. You ask it “What did I just say?” and it stares back blankly. For a conversation, this is useless. This is where LangChain’s memory systems come in—they’re the duct tape and baling wire we use to give our stateless LLMs a semblance of a past.
The core idea is painfully simple: we stuff the history of the conversation into the prompt we send to the LLM. The how of doing that, and how well we do it, is where the nuance and the trade-offs live. We’re going to look at three primary methods: the brute-force approach, the abstractive approach, and the “kitchen sink” approach. Buckle up.
ConversationBufferMemory: The Kitchen Sink
This is the simplest form of memory. It does exactly what it sounds like: it buffers the entire conversation. Every single message from you (the human) and the AI gets concatenated into a long string and shoved into the prompt for the next call.
It’s incredibly straightforward and works shockingly well for short conversations. The LLM has the complete, verbatim context. But you see the problem, right? LLMs have a finite context window. After a few dozen exchanges, you’ll hit the token limit and start getting expensive, messy errors. It’s like trying to stuff a mattress into a glove compartment.
Here’s how you set it up. Notice how we use save_context to explicitly save each turn of the conversation. This is the low-level control.
from langchain.memory import ConversationBufferMemory
from langchain.llms import OpenAI
from langchain.chains import ConversationChain
# Initialize the memory object
memory = ConversationBufferMemory()
# Let's simulate a conversation
memory.save_context({"input": "Hi there!"}, {"output": "Hello! How can I assist you today?"})
memory.save_context({"input": "My name is Alex."}, {"output": "Nice to meet you, Alex!"})
# Now, let's see what's in the buffer
print(memory.buffer)
# Output: Human: Hi there!\nAI: Hello! How can I assist you today?\nHuman: My name is Alex.\nAI: Nice to meet you, Alex!
# Using it in a chain is where it shines
llm = OpenAI(temperature=0)
conversation = ConversationChain(llm=llm, memory=memory, verbose=True)
# The next prompt will include the entire buffer
response = conversation.predict(input="What's my name?")
# The LLM sees: "The following is a friendly conversation...\nHuman: Hi there!\nAI: Hello!...\nHuman: What's my name?"
print(response) # Correctly outputs something like "Your name is Alex."
Best Practice/Pitfall: Never, ever use this for a long-running conversation. It’s a ticking token bomb. Use it for prototyping or for interactions you know will be brief.
ConversationSummaryMemory: The CliffNotes Version
To get around the context window apocalypse, the clever idea is to summarize the conversation so far instead of storing it word-for-word. This is ConversationSummaryMemory. It uses an LLM (yes, an LLM to summarize for your main LLM, welcome to the meta-world) to periodically condense the past interactions into a concise paragraph.
This is fantastic for long-running conversations. You can chat for hours, and the memory footprint remains roughly the same. The huge downside? Summarization is lossy. Details get fuzzy, forgotten, or even hallucinated. The LLM might remember that you talked about your dog, but forget its name is “Rover,” fundamentally changing the quality of the interaction.
from langchain.memory import ConversationSummaryMemory
from langchain.llms import OpenAI
from langchain.chains import ConversationChain
# Notice we pass an llm to the memory this time! It needs one to do the summarizing.
llm = OpenAI(temperature=0)
memory = ConversationSummaryMemory(llm=llm)
# Feed it a longer conversation
memory.save_context({"input": "I'm feeling really excited about learning LangChain."},
{"output": "That's great to hear! It's a powerful tool for building LLM applications."})
memory.save_context({"input": "I'm building a customer support bot for my e-commerce store that sells quirky socks."},
{"output": "Quirky socks are a fantastic niche! I can help you design a bot that can handle order status inquiries and style recommendations."})
# Let's see the summary it's created so far instead of the raw buffer
print(memory.buffer)
# Output will be a summarized version, something like:
# "The human is excited about learning LangChain and is building a customer support bot for their e-commerce store that sells quirky socks. The AI expressed enthusiasm and offered to help with order status and style recommendations."
conversation = ConversationChain(llm=llm, memory=memory, verbose=True)
response = conversation.predict(input="What am I building?")
# The prompt will include the summary, not the raw chat. It will know it's a bot for a sock store.
Best Practice/Pitfall: The quality of your summary is only as good as the LLM you use for it. A weaker llm for summarization will lead to a dumber main LLM. Tweak the prompt for the summarization step if you need to guide it to capture specific details.
Vector Store Memory: The “I Have a Photographic Memory” Flex
This is the most complex and powerful option. Instead of a linear buffer or a summary, it stores each message as a separate embedding in a vector database (like Chroma or Pinecone). When a new input comes in, it performs a similarity search to find the most relevant past interactions and injects only those into the prompt.
This is genius because it’s both precise and efficient. You can have a million past conversations, and for the user asking “what’s the status of my order?,” it will instantly find the most recent, most relevant messages about their order and ignore the time they asked about your return policy three weeks ago. It solves the context window problem and the lossiness problem… in theory.
The reality is more complicated. You’re now managing a whole damn database. The setup is heavier. And the biggest pitfall: the similarity search might miss crucial context that isn’t lexically similar. It’s recall, not remembrance.
from langchain.memory import ConversationSummaryMemory
from langchain.llms import OpenAI
from langchain.chains import ConversationChain
# Initialize the memory object
memory = ConversationBufferMemory()
# Let's simulate a conversation
memory.save_context({"input": "Hi there!"}, {"output": "Hello! How can I assist you today?"})
memory.save_context({"input": "My name is Alex."}, {"output": "Nice to meet you, Alex!"})
# Now, let's see what's in the buffer
print(memory.buffer)
# Output: Human: Hi there!\nAI: Hello! How can I assist you today?\nHuman: My name is Alex.\nAI: Nice to meet you, Alex!
# Using it in a chain is where it shines
llm = OpenAI(temperature=0)
conversation = ConversationChain(llm=llm, memory=memory, verbose=True)
# The next prompt will include the entire buffer
response = conversation.predict(input="What's my name?")
# The LLM sees: "The following is a friendly conversation...\nHuman: Hi there!\nAI: Hello!...\nHuman: What's my name?"
print(response) # Correctly outputs something like "Your name is Alex."
Best Practice/Pitfall: This is overkill for most simple applications. The cost and complexity of running a vector database just isn’t worth it for a basic chat widget. But for a system that needs to reference a vast history of interactions accurately, it’s unbeatable. Always ensure your embedding model is appropriate for your domain.
So, which one do you choose? It’s a classic engineering trade-off: simplicity vs. context length vs. accuracy. Start with BufferMemory to get something working, switch to SummaryMemory when context length becomes an issue, and only reach for the VectorStoreMemory when you have a real need to query a long-term, detailed history. Now go build something that remembers.