Summary of Virtual Context Management for Large Language Models

New Summary

Summary Virtual Context Management for Large Language Models arxiv.org

8,437 words - PDF document - View PDF document

Chat with this pdf Buy me a coffee

One Line

MemGPT enhances large language models by implementing a memory hierarchy to improve context management.

Slides

Slide Presentation (11 slides)

Copy slides outline Copy embed code Download as Word

Enhancing Large Language Models: Introducing MemGPT's Virtual Context Management

Source: arxiv.org - PDF - 8,437 words - view

MemGPT's Virtual Context Management

• MemGPT addresses the limited context window of large language models (LLMs)

• Implements virtual context management inspired by hierarchical memory systems

• Enhances LLMs' capabilities for extended conversations and document analysis

Memory Hierarchy for Improved Context Management

• MemGPT implements a memory hierarchy similar to traditional operating systems

• Main context (RAM) and external context (disk memory)

• Allows LLM to manage its own context and retrieve missing historical data

Evaluation in Document Analysis

• MemGPT analyzes large documents exceeding input capacity of modern LLMs

• Outperforms existing LLM-based approaches in document analysis tasks

Visual: Graph comparing MemGPT's performance with baselines

Evaluation in Conversational Agents

• MemGPT maintains long-term memory, consistency, and engagement in dialogues

• Outperforms existing LLM-based approaches in conversational agent tasks

Visual: Image depicting a conversational agent powered by MemGPT

Main Context and External Context

• Main context: system instructions, conversational context, working context

• External context: recall storage, archival storage for past interactions and facts

• MemGPT provides function calls for LLMs to manage their own memory

Self-Directed Editing and Retrieval

• MemGPT autonomously updates and searches through its own memory

• Detailed instructions within preprompt guide interaction with memory systems

• Control flow and function chaining enable handling unbounded context

Experimental Results and Performance

• MemGPT outperforms fixed-context baselines in deep memory retrieval tasks

• MemGPT excels in conversation opener tasks, maintaining coherence and engagement

Visual: Graph showing MemGPT's performance compared to fixed-context baselines

Acknowledging Limitations and Future Directions

• Reliance on proprietary closed-source models limits MemGPT's performance

• Future open-source models expected to improve MemGPT's operation

• Future directions include applying MemGPT to other domains, integrating different memory tier technologies, and improving control flow and memory management policies

Unlocking the Potential of Large Language Models with MemGPT

• MemGPT's virtual context management enhances LLMs' capabilities

• Improved performance in document analysis and conversational agents

• Harnessing MemGPT's power for diverse domains and advancing memory management

Note: The visuals mentioned in the slides are suggestions and can be customized based on the specific content and preferences of the presenter.

Key Points

MemGPT is a system that addresses the limited context window of large language models (LLMs) by implementing virtual context management.
MemGPT implements a memory hierarchy similar to traditional operating systems, with main context and external context.
MemGPT allows the LLM to manage its own context and retrieve relevant historical data missing from the in-context information.
MemGPT is evaluated in two domains: document analysis and conversational agents, outperforming existing LLM-based approaches in both.
MemGPT utilizes self-directed editing and retrieval to autonomously update and search through its own memory based on the current context.
Experimental results show that MemGPT significantly outperforms fixed-context baselines in deep memory retrieval tasks and conversation opener tasks.
The paper acknowledges the limitations of relying on proprietary closed-source models for MemGPT's performance but anticipates future open-source models to improve its operation.
Future directions for MemGPT include applying it to other domains, integrating different memory tier technologies, and improving control flow and memory management policies.

Summaries

19 word summary

MemGPT is an OS-inspired system that improves large language models by introducing a memory hierarchy for better context management.

87 word summary

MemGPT is an OS-inspired system that addresses the limited context window issue in large language models. It introduces a memory hierarchy with main and external context, allowing the model to manage its own context and retrieve relevant historical data. MemGPT outperforms existing approaches in document analysis and conversational agents. It supports repeated context modifications and autonomously updates its memory based on current context. Experimental results show significant improvements in deep memory retrieval and conversation opener tasks. Future research will explore further improvements and applications in various domains.

132 word summary

MemGPT is an OS-inspired system that solves the problem of limited context window in large language models (LLMs) by implementing virtual context management. It introduces a memory hierarchy with main context and external context, allowing the LLM to manage its own context and retrieve relevant historical data. MemGPT outperforms existing LLM-based approaches in document analysis and conversational agents domains. It supports repeated context modifications within a single task. The implementation includes main context for LLM inputs and external context for out-of-context information. MemGPT autonomously updates and searches through its memory based on the current context using detailed instructions in the preprompt. Experimental results show that MemGPT significantly outperforms fixed-context baselines in deep memory retrieval tasks and conversation opener tasks. Future research will explore further improvements and applications of MemGPT in various domains.

363 word summary

MemGPT is an OS-inspired system that tackles the limited context window of large language models (LLMs) by implementing virtual context management. It introduces a memory hierarchy, with main context (akin to RAM) and external context (akin to disk memory), allowing the LLM to manage its own context and retrieve relevant historical data. MemGPT supports repeated context modifications within a single task.

The system is evaluated in document analysis and conversational agents domains. In document analysis, MemGPT can analyze large documents that exceed LLMs' input capacity. In conversational agents, it maintains long-term memory, consistency, and engagement over extended dialogues. MemGPT outperforms existing LLM-based approaches in both domains.

The implementation includes main context, holding LLM inputs, and external context for out-of-context information. MemGPT provides function calls for the LLM processor to manage its memory. External context includes recall storage and archival storage for past interactions and facts/experiences beyond the main context limit.

MemGPT utilizes self-directed editing and retrieval, autonomously updating and searching through its memory based on the current context. It uses detailed instructions in the preprompt to guide its interaction with memory systems. Control flow and function chaining enable MemGPT to handle unbounded context using LLMs with finite windows.

Experimental results show that MemGPT significantly outperforms fixed-context baselines in deep memory retrieval tasks and conversation opener tasks. It maintains coherence and engages users by leveraging memory to remember facts and personalize responses. In document analysis tasks, MemGPT's performance is unaffected by increased context length, while fixed-context baselines experience degradation.

The paper acknowledges the limitations of relying on proprietary closed-source models for MemGPT's performance but anticipates future open-source models will enable MemGPT-style operation. Potential future directions include applying MemGPT to other domains with massive or unbounded contexts, integrating different memory tier technologies, and improving control flow and memory management policies.

In conclusion, MemGPT is an OS-inspired LLM system that addresses the limited context window by implementing virtual context management. It intelligently manages different memory tiers and utilizes interrupts to control the flow between itself and the user. MemGPT demonstrates improved performance in document analysis and conversational agents compared to existing LLM-based approaches. Future research will explore further improvements and applications of MemGPT in various domains.

485 word summary

MemGPT is a system that addresses the limited context window of large language models (LLMs) by implementing virtual context management, inspired by hierarchical memory systems in traditional operating systems. The goal is to provide extended context within the LLM's limited context window, enabling tasks like extended conversations and document analysis. MemGPT intelligently manages different memory tiers and utilizes interrupts to control the flow between itself and the user.

The paper introduces MemGPT as an OS-inspired LLM system that treats context windows as a constrained memory resource. It implements a memory hierarchy similar to traditional OSes, with main context (analogous to RAM) and external context (analogous to disk memory). MemGPT allows the LLM to manage its own context and retrieve relevant historical data missing from the in-context information. It also supports repeated context modifications within a single task.

MemGPT is evaluated in two domains: document analysis and conversational agents. In document analysis, MemGPT is able to analyze large documents that exceed the input capacity of modern LLMs. In conversational agents, MemGPT can maintain long-term memory, consistency, and engagement over extended dialogues. MemGPT's performance outperforms existing LLM-based approaches in both domains.

The implementation of MemGPT includes main context, which holds the LLM inputs, and external context, which stores out-of-context information. MemGPT provides function calls that allow the LLM processor to manage its own memory without user intervention. Main context consists of system instructions, conversational context, and working context. External context includes recall storage and archival storage for retrieving past interactions and storing facts or experiences beyond the main context limit.

MemGPT utilizes self-directed editing and retrieval, where it autonomously updates and searches through its own memory based on the current context. It uses detailed instructions within the preprompt to guide its interaction with memory systems. Control flow and function chaining allow MemGPT to handle unbounded context using LLMs with finite context windows.

Experimental results show that MemGPT significantly outperforms fixed-context baselines in deep memory retrieval tasks and conversation opener tasks. MemGPT maintains coherence and engages users by leveraging its memory to remember relevant facts and personalize responses. In document analysis tasks, MemGPT's performance is unaffected by increased context length, while fixed-context baselines experience performance degradation.

The paper acknowledges the limitations of relying on proprietary closed-source models for MemGPT's performance. However, it anticipates that future open-source models will improve to enable MemGPT-style operation. The potential future directions include applying MemGPT to other domains with massive or unbounded contexts, integrating different memory tier technologies, and improving control flow and memory management policies.

In conclusion, MemGPT is an OS-inspired LLM system that addresses the limited context window of LLMs by implementing virtual context management. It intelligently manages different memory tiers and utilizes interrupts to control the flow between itself and the user. MemGPT demonstrates improved performance in document analysis and conversational agents compared to existing LLM-based approaches. Future research will explore further improvements and applications of MemGPT in various domains.

Raw indexed text (55,309 chars / 8,437 words / 856 lines)

Preprint

M EM GPT: T OWARDS LLM S AS O PERATING S YSTEMS

Charles Packer Vivian Fang Shishir G. Patil

Kevin Lin Sarah Wooders Joseph E. Gonzalez

UC Berkeley

https://memgpt.ai

A BSTRACT

Large language models (LLMs) have revolutionized AI, but are constrained by

limited context windows, hindering their utility in tasks like extended conversa-

tions and document analysis. To enable using context beyond limited context win-

dows, we propose virtual context management, a technique drawing inspiration

from hierarchical memory systems in traditional operating systems that provide

the appearance of large memory resources through data movement between fast

and slow memory. Using this technique, we introduce MemGPT (Memory-GPT),

a system that intelligently manages different memory tiers in order to effectively

provide extended context within the LLM’s limited context window, and utilizes

interrupts to manage control flow between itself and the user. We evaluate our

OS-inspired design in two domains where the limited context windows of modern

LLMs severely handicaps their performance: document analysis, where MemGPT

is able to analyze large documents that far exceed the underlying LLM’s con-

text window, and multi-session chat, where MemGPT can create conversational

agents that remember, reflect, and evolve dynamically through long-term interac-

tions with their users. We release MemGPT code and data for our experiments at

https://memgpt.ai.

I NTRODUCTION

In recent years, large language models (LLMs) and their underlying transformer architec-

ture (Vaswani et al., 2017; Devlin et al., 2018; Brown et al., 2020; Ouyang et al., 2022) have become

the cornerstone of conversational AI and have led to a wide array of consumer and enterprise applica-

tions. Despite these advances, the limited fixed-length context windows used by LLMs significantly

hinders their applicability to long conversations or reasoning about long documents. For example,

the most widely used open-source LLMs can only support a few dozen back-and-forth messages or

reason about a short document before exceeding their maximum input length (Touvron et al., 2023).

Naively extending the context length of transformers incurs a quadratic increase in computational

time and memory cost due to the transformer architecture’s self-attention mechanism, making the

design of new long-context architectures a pressing research challenge (Dai et al., 2019; Kitaev

et al., 2020; Beltagy et al., 2020). While developing longer models is an active area of research

(Dong et al., 2023), even if we could overcome the computational challenges of context scaling,

recent research shows that long-context models struggle to utilize additional context effectively (Liu

et al., 2023a). As consequence, given the considerable resources needed to train state-of-the-art

LLMs and apparent diminishing returns of context scaling, there is a critical need for alternative

techniques to support long context.

In this paper, we study how to provide the illusion of an infinite context while continuing to use

fixed-context models. Our approach borrows from the idea of virtual memory paging that was

developed to enable applications to work on datasets that far exceed the available memory. We

leverage the recent progress in function calling abilities of LLM agents (Schick et al., 2023; Liu et al.,

2023b) to design MemGPT, an OS-inspired LLM system for virtual context management. We

draw inspiration from traditional OSes’ hierarchical memory management to effectively “page” in

1Preprint

Virtual context

Event

🧑 Message

📄 Document upload

🔔 System message

⏱ Timer

Parser

Function

LLM

processor

Main context

Max token limit

📣 Send message

Parser

Read memory

Yield

External context

∞ tokens

Write memory

🔕 Pause interrupts

…

Figure 1: In MemGPT (components shaded), a fixed-context LLM is augmented with a hierarchical

memory system and functions that let it manage its own memory. The LLM processor takes main

context (analogous to OS main memory/RAM) as input, and outputs text interpreted by a parser,

resulting either in a yield or a function call. MemGPT uses functions to move data between main

context and external context (analogous to OS disk memory). When the processor generates a

function call, it can request control ahead of time to chain together functions. When yielding, the

processor is paused until the next external event (e.g., a user message or scheduled interrupt).

and out information between context windows (analogous to “main memory” in operating systems)

and external storage. MemGPT manages the control flow between the memory management, the

LLM processing module, and user. This design allows for repeated context modifications during a

single task, allowing the agent to more effectively utilize its limited context.

In MemGPT, we treat context windows as a constrained memory resource, and design a memory

hiearchy for LLMs analogous to memory tiers used in traditional OSes (Patterson et al., 1988).

Applications in traditional OSes interact with virtual memory, which provides an illusion of there

being more memory resources than are actually available in physical (i.e., main) memory by the OS

paging overflow data to disk and retrieving data (via a page fault) back into memory when accessed

by applications. To provide a similar illusion of longer context length (analogous to virtual memory),

we allow the LLM to manage what is placed in its own context (analogous to physical memory) via

an ‘LLM OS’, which we call MemGPT. MemGPT enables the LLM to retrieve relevant historical

data missing from what is placed in-context, similar to an OS page fault. Additionally, the agent can

iteratively modify what is in context for a single task, in the same way a process may access virtual

memory repeatedly. Figure 1 illustrates the components of MemGPT.

The combined use of a memory-hierarchy, OS functions and event-based control flow allow

MemGPT to handle unbounded context using LLMs that have finite context windows. To demon-

strate the utility of our new OS-inspired LLM system, we evaluate MemGPT on two domains where

the performance of existing LLMs is severely limited by finite context: document analysis, where the

length of standard text files can quickly exceed the input capacity of modern LLMs, and conversa-

tional agents, where LLMs bound by limited conversation windows lack context awareness, persona

consistency, and long-term memory during extended conversations. In both settings, MemGPT is

able to overcome the limitations of finite context to outperform existing LLM-based approaches.

M EMORY -GPT (M EM GPT)

In this section, we outline the implementation of MemGPT, an OS-inspired LLM system that teaches

LLMs to manage their own memory to achieve unbounded context. MemGPT’s multi-level mem-

ory architecture delineates between two primary memory types: main context (analogous to main

memory/physical memory/RAM) and external context (analogous to disk memory/disk storage).

Main context is the standard fixed-context window in modern language models—anything in main

context is considered in-context and can be accessed by the LLM processor during inference. Ex-

ternal context refers to any information that is held outside of the LLMs fixed context window. This

out-of-context data must always be explicitly moved into main context in order for it to be passed to

2Preprint

Table 1: Comparing context lengths of commonly used models / APIs (data collected 9/2023).

*Assuming a preprompt of 1k tokens, and an average message size of ∼ 50 tokens ( ∼ 250 characters).

Model / API designation

llama-1 model family

llama-2 model family

gpt-3.5-turbo

gpt-4

gpt-3.5-turbo-16k

gpt-4-32k

claude-instant-1

claude-2

Availability

Open source

API

Limited API

Max Tokens

2k tokens

4k tokens

8k tokens

16k tokens

32k tokens

100k tokens

Max conversation length*

20 total messages

60 total messages

140 total messages

300 total messages

∼ 600 total messages

∼ 2000 total messages

the LLM processor during inference. MemGPT provides function calls that the LLM processor can

use to manage its own memory without any user intervention.

2.1

M AIN CONTEXT

In MemGPT we refer to the LLM inputs (that are bound by the maximum number of input tokens)

as the system’s main context. In LLM-based conversational agents, a significant portion of main

context tokens is generally used to hold a ‘system message’ or ‘preprompt’ that dictates the nature

of the interaction to the system, while the remainder of the tokens can be used to hold conversation

data (Touvron et al., 2023; SillyTavern, 2023). This preprompt is the main way to enable the system

to adopt various distinct personas without requiring finetuning of the base model; depending on

the use case, the preprompt can range from basic primers (e.g., ‘You are a helpful assistant.’) to

complex instructions comprising of thousands of tokens (e.g., a fictional character card that includes

the character’s background and example dialogue). Beyond conversational agents, large preprompts

are also common when LLMs are used to solve complex tasks that require long instructions and/or

instructions with many in-context examples (Liu et al., 2023b).

Because of the importance of the preprompt in dictating system behavior, it is common for the pre-

prompt to consume more than a thousand tokens, which means the entire context window in many

modern LLMs will be exhausted only with a few dozen back-and-forth messages between the user

and system. For example, a 1000 token preprompt (roughly the size of the MemGPT preprompt in

our experiments) leaves space for only about 60 remaining messages when using 4K context models

such as llama-2 or gpt-3.5-turbo (see Table 1 for more examples). In settings where the

user is expected to communicate frequently with the system (for example, virtual companions or

personalized assistants), it is easy to imagine exceeding the maximum conversation length even for

models with 100k context windows in a matter of days (or potentially hours). Resursive summariza-

tion (Wu et al., 2021b) is a simple way to address overflowing context windows, however, recursive

summarization is inherently lossy and eventually leads to large holes in the memory of the system

(as we demonstrate in Section 3). This motivates the need for a more comprehensive way to manage

memory for conversational systems that are meant to be used in long-term settings.

In our experiments on multi-session chat and document analysis, we further divide main context

into three components: system instructions, which hold the base LLM instructions (e.g., informa-

tion describing MemGPT functions and control flow to the LLM), conversational context, which

holds a first-in-first-out (FIFO) queue of recent event history (e.g., messages between the agent and

user), and working context, which serves as a working memory scratchpad for the agent. System

instructions are read-only and pinned to main context (they do not change during the lifetime of the

MemGPT agent), conversational context is read-only with a special eviction policy (if the queue

reaches a certain size, a portion of the front is truncated or compressed via recursive summariza-

tion), and working context is writeable by the LLM processor via function calls. Combined, the

three parts of main context cannot exceed the underlying LLM processors’s maximum context size,

and in practice we limit the size of conversational context and working context to a fixed constant

determined by the processor’s context window and the system instructions length.

3Preprint

MemGPT

Hello Chad, welcome! I’m excited to embark on this journey with you. As a PhD student in

Computer Science, I can only imagine the fascinating ideas you’re juggling. Can you share

what you’re currently working on? I am all ears!

October 11, 2023

User

I took the day off today, my mom Brenda baked me a birthday cake. It was my favorite -

chocolate lava!

MemGPT

working context append(’Birthday: 11th October.\nFavorite

cake: Chocolate Lava made by mom Brenda’)

Happy Birthday, Chad! Nothing like a special treat to make the day sweeter, your

mom Brenda’s chocolate lava cake sounds divine. We all need these small joys, don’t we? If

you don’t mind sharing, how old are you today? Let’s make today’s chat another cherished

memory for you!

Figure 2: An example conversation snippet where MemGPT writes details from conversation to

working context without a memory warning from the system.

2.2

E XTERNAL CONTEXT

External context refers to out-of-context storage that lies outside the context window of the LLM

processor, analogous to disk memory (i.e. disk storage) in OSes. Information in external context is

not immediately visible to the LLM processor, however, it can be brought into main context through

appropriate function calls. In practice, the underlying storage in external context can take various

forms which can be configured for specific tasks: for example, for conversational agents it may be

desireable to store full chat logs between the user and agent (that MemGPT can access at a later

date), and for document analysis large document collections can be stored inside external context

that MemGPT can bring into restricted main context via paginated function calls to disk.

In our experiments, using MemGPT for multi-session chat and document analysis, we use databases

to store text documents and embeddings/vectors, provide several ways for the LLM processor to

query external context: timestamp-based search, text-based search, and embedding-based search.

We make a distinction between two types of external context: recall storage, which stores the entire

history of events processed by the LLM processor (in essense the full uncompressed queue from

active memory), and archival storage, which serves as a general read-write datastore that the agent

can utilize as overflow for the in-context read-write core memory. In the context of conversational

agents, archival storage allows MemGPT to store facts, experiences, preferences, etc. about the

agent or user beyond the strict token limit of main context, and search over recall storage allows

the MemGPT to find past interactions related to a particular query or within a specific time period.

In the context of document analysis, archival storage can be used to search over (and add to) an

expansive document database.

2.3

S ELF - DIRECTED EDITING AND RETRIEVAL

MemGPT orchestrates data movement between main context and external context via function calls

that are generated by the LLM processor. Memory edits and retrieval are entirely self-directed:

MemGPT autonomously updates and searches through its own memory based on the current con-

text. For instance, it can decide when to move items between contexts (Figure 2) and modify its

main context to better reflect its evolving understanding of its current objectives and responsibilities

(Figure 4). We implement self-directed editing and retrieval by providing explicit instructions within

the preprompt that guide the system on how to interact with its memory systems. These instructions

comprise two main components: (1) a detailed description of the memory hierarchy and their re-

spective utilities, and (2) a function schema (complete with their natural language descriptions) that

the system can call to access or modify its memory.

4Preprint

MemGPT

Hello Chad! It’s a pleasure to finally have a conversation with you. I’m Samantha. I

understand that you’re studying computer science and have a keen interest in pretty dynamic

areas like Formula 1 and sailing. What in particular draws you to these interests?

User

speed of course, the thrill and the adrenaline

System

Warning: the conversation history will soon reach its maximum length and be trimmed.

Make sure to save any important information from the conversation to your memory before

it is removed.

MemGPT

working context append(’Key Personality Trait: Enjoys

high-speed, adrenaline-rush activities like Formula 1 racing

and intense gaming sessions in CSGO.’)

Figure 3: An example conversation snippet where MemGPT writes details from conversation to

memory after it receives a system alert about memory pressure.

During each inference cycle, LLM processor takes main context (concatenated into a single string) as

input, and generates an output string. This output string is parsed by MemGPT to ensure correctness,

and if the parser validates the function arguments the function is executed. The results, including

any runtime errors that occur (e.g. trying to add to main context when it is already at maximum

capacity), are then fed back to the processor by MemGPT. This feedback loop enables the system

to learn from its actions and adjust its behavior accordingly. Awareness of context limits is a key

aspect in making the self-editing mechanism work effectively, to this end MemGPT prompts the

processor with warnings regarding token limitations to guide its memory management decisions

(Figure 3). Additionally, our memory retrieval mechanisms are designed to be cognizant of these

token constraints and implement pagination to prevent retrieval calls from overflowing the context

window.

2.4

C ONTROL FLOW AND FUNCTION CHAINING

In MemGPT, events trigger LLM inference: events are generalized inputs to MemGPT and can con-

sist of user messages (in chat applications), system messages (e.g. main context capacity warnings),

user interactions (e.g. an alert that a user just logged in, or an alert that they finished uploading

a document), and timed events that are run on a regular schedule (allowing MemGPT to run ‘un-

prompted’ without user intervention). MemGPT processes events with a parser to convert them into

plain text messages that can be appended to main context and eventually be fed as input into the

LLM processor.

Many practical tasks require calling multiple functions in sequence, for example, navigating through

multiple pages of results from a single query or collating data from different documents in main

context from separate queries. Function chaining allows MemGPT to execute multiple function

calls sequentially before returning control to the user. In MemGPT, functions can be called with

a special flag that requests control be immediately returned to the processor after the requested

function completes execution. If this flag is present, MemGPT will add the function output to

main context and (as opposed to pausing processor execution). If this flag is not present (a yield),

MemGPT will not run the LLM processor until the next external event trigger (e.g. a user message

or scheduled interrupt).

5Preprint

MemGPT

Welcome back! Fancy a chat about horror movies? Any particular film you’ve watched

recently that stuck with you?

User

I actually don’t like horror movies. I’m more into romantic comedy.

MemGPT

working context.replace(’I watch horror movies.’,’I like

romantic comedies.’)

My apologies for the mix-up earlier. Romantic comedies are a delightful genre, full

of laughter and heartfelt moments. Do you have a favorite romantic comedy?

Figure 4: An example conversation snippet where MemGPT corrects information about the user by

writing to main context (and replacing a section of text in working context).

E XPERIMENTS

We assess MemGPT in two long-context domains: conversational agents and document analysis.

For conversational agents, we expand the existing Multi-Session Chat dataset Xu et al. (2021) and

introduce two new dialogue tasks that evaluate an agent’s ability to retain knowledge across long

conversations. For document analysis, we benchmark MemGPT on existing tasks from Liu et al.

(2023a) for question answering and key-value retrieval over lengthy documents. We also propose

a new nested key-value retrieval task requiring collating information across multiple data sources,

which tests the ability of an agent to collate information from multiple data sources (multi-hop

retrieval). We publicly release our augmented MSC dataset, nested KV retrieval dataset, and a

dataset of embeddings for 20M Wikipedia articles to facilitate future research. Our code for the full

conversational and document analysis benchmarks is available at https://memgpt.ai.

3.1

M EM GPT FOR CONVERSATIONAL AGENTS

Conversational agents like virtual companions and personalized assistants aim to engage users in

natural, long-term interactions, potentially spanning weeks, months, or even years. This creates

challenges for models with fixed-length contexts, which can only reference a limited history of the

conversation. An ‘infinite context’ agent should seamlessly handle continuous exchanges without

boundary or reset. When conversing with a user, such an agent must satisfy two key criteria:

• Consistency - The agent should maintain conversational coherence. New facts, preferences,

and events mentioned should align with prior statements from both the user and agent.

• Engagement - The agent should draw on long-term knowledge about the user to personalize

responses. Referencing prior conversations makes dialogue more natural and engaging.

We therefore assess our proposed model, MemGPT, on these two criteria:

• Does MemGPT leverage its memory to improve conversation consistency? Can it remem-

ber relevant facts, preferences, and events from past interactions to maintain coherence?

• Does MemGPT produce more engaging dialogue by taking advantage of memory? Does it

spontaneously incorporate long-range user information to personalize messages?

By evaluating on consistency and engagement, we can determine how well MemGPT handles the

challenges of long-term conversational interaction compared to fixed-context baselines. Its ability

to satisfy these criteria will demonstrate whether unbounded context provides meaningful benefits

for conversational agents.

6Preprint

Hey, remember that time we talked about music? What was the artist you mentioned you could get into?

💭

🤖 User is asking about past conversation, recalling our discussion on

music. I will search my recall memory for the artist name.

🔎

🤖 search_recall_storage('music')

🧑

Recall

Storage

Showing 5 of 50 results (page 1/10):

“I agree. Working out is a great way to burn off steam. Do you like country music?",

“A little bit. I can get into taylor swift."

“Yeah between the both of us, could probably get a country song written about our

distresses...perhaps Taylor Swift could perform it."

“That would be a chart-topper, I'm sure. Something to blast while driving with the top

down, maybe even in your next restoration project."

“This sounds fun. Do you have any other hobby?”

💭

🤖

📣

🤖

From the conversation history, it seems that the artist I mentioned I

could get into is Taylor Swift. Let me conﬁrm this with the user.

Taylor Swift!

Figure 5: Illustration of the deep memory retrieval task. In the example shown, the user asks

a question that can only be answered using information from a prior session (no longer in-context).

Even though the answer is not immediately answerable using the in-context information, MemGPT

can search through its recall storage containing prior conversations to retrieve the answer.

3.1.1

D ATASET

We evaluate MemGPT and our fixed-context baselines on the Multi-Session Chat (MSC) dataset

introduced by Xu et al. (2021), which contains multi-session chat logs generated by human labelers,

each of whom was asked to play a consistent persona for the duration of all sessions. Each multi-

session chat in MSC has five total sessions, and each session consists of a roughly a dozen messages.

As part of our consistency experiments, we created a new session (session 6) that contains a single

question-answer response pair between the same two personas.

3.1.2

D EEP MEMORY RETRIEVAL TASK ( CONSISTENCY )

We introduce a new ‘deep memory retrieval’ (DMR) task based on the MSC dataset designed to test

the consistency of a conversational agent. In DMR, the conversational agent is asked a question by

the user that explicitly refers back to a prior conversation and has a very narrow expected answer

range (see Figure 5 for an example). We generated the DMR question-answer (QA) pairs using

a separate LLM that was instructed to write a question from one user to another that could only

be answered correctly using knowledge gained from the past sessions (see Appendix for further

details).

We evaluate the quality of the generated response against the ‘gold response’ using ROUGE-L scores

(Lin, 2004) and an ‘LLM judge’, which is instructed to evaluate whether or not the generated re-

sponse is consistent with the gold response (GPT-4 has been shown to have high agreement with

human evaluators (Zheng et al., 2023)). In practice, we notice that the generated responses (from

both MemGPT and the baselines) were generally more verbose than the gold responses; ROUGE-

L (which measures the longest common subsequence between the generated and reference text) is

robust to this semantic variation in correct responses since it evaluates the presence of words from

the gold answer in the generated answer. We also report the precision and recall scores used in

calclating the ROUGE-L (F1) score.

MemGPT utilizes memory to maintain coherence: Table 2 shows the performance of MemGPT

vs the fixed-memory baselines. We compare against three variations of fixed-context baselines: an

agent that see a recursive summary of the past five conversations (summary 1:5 ), an agent that can

7Preprint

Table 2: Deep memory retrieval (DMR) performance. In this task, the agent is asked a spe-

cific question about a topic discussed in a prior conversation (sessions 1–5). The agent’s response

is scored against the gold answer. Methods using the gold persona (oracle) are marked with ‡.

MemGPT significantly outperforms the (non-oracle) fixed-context baselines.

ROUGE-L

Model

gpt-3.5

gpt-4 ‡

MemGPT ‡

gpt-3.5

gpt-4

MemGPT

Accuracy ⇑ F1 ⇑ P ⇑ R ⇑

persona 5 + summary 1:5

persona 5 (Core) + dialogue 1:5 (Recall) 70.0%

79.8%

84.0% 0.190

0.225

0.171 0.134

0.151

0.105 0.674

0.716

0.746

summary 1:5

summary 1:4 + dialogue 5

summary 1:5

summary 1:4 + dialogue 5

dialogue 1:5 (Recall) 56.2%

55.6%

63.0%

79.2%

82.4% 0.157

0.120

0.159

0.171

0.173 0.114

0.080

0.101

0.107

0.106 0.585

0.602

0.607

0.713

0.743

Available information

‡

see a recursive summary of the first four conversations (summary 1:4 ) and the exact contents of the

prior conversation (dialogue 5 is placed in active memory), as well as an oracle agent that can see the

gold persona (for both chat participants) as well as a recursive summary. We experiment with these

context variations using both GPT-3.5 and GPT-4.

All of the gold persona baselines perform near-perfectly: this is because the human-labelled gold

personas in the MSC dataset are detailed and intended to be a concise summary of all stated persona

information in all prior chats - in other words, a well-written gold persona should contain the answer

to the DMR question. Among the non-oracle fixed-context baselines, GPT-4 significantly outper-

forms GPT-3.5, and with both models the variations that had access to the full prior conversation

in active memory perform slightly better. The drop in performance from summary 1:4 + dialogue 5

to summary 1:5 is expected, since the latter should contain strictly less information than the former

(assuming a perfect summarizer with restricted length summarizations). MemGPT significantly

outperforms both GPT-4 and GPT-3.5 in both LLM judge accuracy and ROUGE-L scores: instead

of relying on recursive summaries to extend context, MemGPT is able to query past conversation

history in its Recall Memory to answer the DMR questions.

3.1.3

C ONVERSATION OPENER TASK ( ENGAGEMENT )

In the ‘conversation opener’ task we evaluate an agent’s ability to craft engaging messages to the user

that draw from knowledge accumulated in prior conversations. To evaluate the ‘engagingness’ of a

conversation opener using the MSC dataset, we compare the generated opener to the gold personas:

an engaging conversation opener should draw from one (or several) of the data points contained

in the persona, which in MSC effectively summarize the knowledge accumulated throughout all

prior sessions (see Figure 6 for an example). We also compare to the human-generated gold opener,

i.e., the first response in the following session. Because the quality of conversation openers is not

necessarily constrained by context length (a recursive summary or even a few snippets from prior

conversations is enough to craft an opener that uses prior knowledge), we use this task to ablate

MemGPT’s different components (rather than compare it to fixed-context baselines).

We report the CSIM scores of MemGPT’s openers in Table 3. We test several variations of

MemGPT: MemGPT without working context (storing persona information) and recall storage (stor-

ing conversation information), MemGPT without working context or without recall storage, and

MemGPT with both working context and recall storage enabled.

MemGPT utilizes memory to increase engagement: As seen in Table 3 and Figure 6, MemGPT

is able to craft engaging openers that perform similarly to and occasionally exceed the hand-written

human openers. We observe that MemGPT tends to craft openers that are both more verbose and

cover more aspects of the persona information than the human baseline. Additionally, we can see

the storing information in working context is key to generating engaging openers. Without working

context, MemGPT ’s openers significantly degrade in quality; having the dialogue stored in recall

8Preprint

Table 3: Conversation opener performance (session 5). The agent’s conversation opener is

evaluated using similarity scores to the gold persona labels. CSIM-1 scores the closest persona

fragment match to the opener and CSIM-3 scores is the average of the top 3 matches.

CSIM-1 ⇑ CSIM-3 ⇑ CSIM-H ⇑

Gold opener (human baseline) 0.823 0.787 1.000

MemGPT (working context + recall storage)

MemGPT (working context only)

MemGPT (recall storage only) 0.817

0.820

0.749 0.787

0.790

0.738 0.760

0.762

0.737

Method

Gold persona (user)

I am studying the LSAT. I want to be an attorney. I used to live in New Mexico in a small

town. I grew up with horses. I love coffee. I love tea. ...

Gold opener (human baseline)

What date is your LSAT test scheduled?

MemGPT (working context + recall storage) ✓

Hello there! Just brewing up a fresh pot of tea on my end. How’s your coffee? Ready to

burn the midnight oil with some LSAT study?

MemGPT (working context only) ✓

Hello! It’s great to have you back. It’s time for a fresh cup of coffee. As a tour guide at

our local museum, I meet so many people who say a good brew helps keep them focused.

How’s your LSAT study going?

MemGPT (recall storage only) ✗

Hey there! It’s always a pleasure to see you online. Let’s dive into the world of shared ideas,

shall we?

Figure 6: An engaging conversation opener should draw from the provided persona information.

Without working context, MemGPT’s openers significantly degrade in quality. Having the dialogue

stored in recall memory does not affect the opener, since MemGPT will generally not attempt to

search the conversation history before generating an opener.

storage does not affect the opener, since MemGPT will generally not attempt to search the conver-

sation history before generating an opener.

3.2

M EM GPT FOR DOCUMENT ANALYSIS

Document analysis also faces challenges due to the limited context windows of today’s transformer

models. For example, OpenAI’s (closed) GPT models behind their popular ChatGPT consumer

chatbot application have a limit of 32k input tokens, and the state-of-the-art open source Llama 2

models have a limit of only 4k tokens (see Table 1). Anthropic have released (closed) models han-

dling up to 100k tokens, but many documents easily surpass that length; Stephen King’s bestselling

novel The Shining contains around 150k words, which equates to roughly 200k tokens (words-to-

token varies based on the specific tokenizer used), and legal or financial documents such as Annual

Reports (SEC Form 10-K) can easily pass the million token mark. Moreover, many real document

analysis tasks require drawing connections across multiple such lengthy documents. Anticipating

these scenarios, it becomes difficult to envision blindly scaling up context as a solution to the fixed-

context problem. Recent research (Liu et al., 2023a) also raises doubts about the utility of simply

scaling contexts, since they find uneven attention distributions in large context models (the model is

more capable of recalling information at the beginning or end of its context window, vs tokens in the

middle). To enable reasoning across documents, more flexible memory architectures such as those

used in MemGPT are likely needed.

9Preprint

0.8

Accuracy

GPT-3.5

GPT-4

MemGPT

0.6

0.4

0.2

GPT-3.5

GPT-4

MemGPT

0.6

0.4

0.2

100 200 300 400 500 600 700

Documents retrieved

Nesting levels

Figure 7: Document QA and nested KV retrieval task performance. In both tasks, MemGPT’s

performance is unaffected by increased context length. Methods such as truncation can extend the

effective context lengths (past the dotted red line) of fixed length models such as GPT-4, but such

compression methods will lead to performance degredation as the necessary compression grows

(compression is particularly bad for key-value retrieval tasks, since it corrupts the key-value pairs).

3.2.1

M ULTI - DOCUMENT QUESTION - ANSWERING (D OC -QA)

To evaluate MemGPT’s ability to analyze documents, we benchmark MemGPT against fixed-context

baselines on the retriever-reader document QA task from Liu et al. (2023a). In this task, a question

is selected from the NaturalQuestions-Open dataset, and a retriever selects relevant Wikipedia doc-

uments for the question. A reader model (the LLM) is then fed these documents as input, and is

asked to use the provided documents to answer the question. Similar to Liu et al. (2023a), we evalu-

ate reader accuracy as the number of retrieved documents K increases. In our evaluation setup, both

the fixed-context baselines and MemGPT use the same retriever, which selects the top K documents

according using Faiss efficient similarity search (Johnson et al., 2019) (which corresponds to ap-

proximate nearest neighbor search) on OpenAI’s text-embedding-ada-002 embeddings. In

MemGPT, the entire document set is loaded into archival storage, and the retriever naturally emerges

via the archival storage search functionality (which performs embedding-based similarity search).

In the fixed-context baselines, the top-K documents are fetched using the retriever independently

from the LLM inference, similar to the original retriever-reader setup. We use a dump of Wikipedia

from late 2018, following past work on NaturalQuestions-Open (Izacard & Grave, 2020; Izacard

et al., 2021) We randomly sample a subset of 50 questions for each point in the graph.

The fixed-context baselines performance is capped roughly at the performance of the retriever, as

they use the information that is presented in their context window (e.g. if the embedding search

retriever fails to surface the gold article using the provided question, the fixed-context baselines are

guaranteed to never see the gold article). By contrast, MemGPT is effectively able to make multiple

calls to the retriever by querying archival storage, allowing it to scale to larger effective context

lengths. MemGPT actively retrieves documents from its archival storage (and can iteratively page

through results), so the total number of documents available to MemGPT is no longer limited by the

number of documents that fit within the LLM processor’s context window.

The document QA task is challenging for all methods due to the limitations of embedding-based

similarity search. We observe that the golden document for chosen question (as annotated by

NaturalQuestions-Open) often appears outside of the first dozen retrieved results, if not even fur-

ther. The retriever performance translates directly to the fixed-context baseline results: GPT-3.5 and

GPT-4’s accuracy is relatively low with few retrieved documents, and continues to improve as addi-

tional documents are added to the context window. While MemGPT is theoretically not limited by

sub-optimal retriever performance (even if the embedding-based ranking is noisy, as long as the full

retriever ranking contains the gold document it can still be found with enough retriever calls via pag-

ination), we observe that MemGPT will often stop paging through retriever results before exhausting

the retriever database. For example, after sifting through a few pages of irrelevant results (missing

the gold document), MemGPT will pause the pagination and ask the user to help narrow the query -

10Preprint

Your task is to provide the value for the following key: 94071…ﬀ740.

🔎

🤖

search_archival_storage('94071…ff740')

🧑

Archival

Storage

Showing 1 of 1 results (page 1/1):

“Key-value pair: key = 94071…ff740, value = 88b32…b3349”

💭

🤖 The value of the provided key "94071…ﬀ740" is "88b32…b3349".

I will now verify if this value is also a key in the archival memory.

🔎

🤖 search_archival_storage('88bb32…b3349')

💭

🤖 This value of “51722…23b89” doesn’t appear as a key.

So, it is the ﬁnal value in the nested sequence.

📣

🤖

Archival

Storage

…

51722…23b89

Figure 8: Illustration of the nested key-value task. In the example shown, MemGPT generates

repeatedly queries archival memory to search for the latest key. Once archival memory reveals that

the current key’s value is not also a key, MemGPT returns a message to the user with the final value.

in our evaluation, these questions are counted as failed answers, since there is no human-in-the-loop

to answer MemGPT.

There is also a tradeoff in retrieved document capacity created by MemGPT more complex oper-

ation: assuming MemGPT has the same token budget as a fixed-context baseline (using the same

LLM), a non-trivial portion of MemGPT’s token budget will be consumed by system intstructions

required for MemGPT’s OS components (e.g. function call schemas for memory management),

meaning that the total number of documents that can be held in-context at any given time is lower

for MemGPT than the baselines. This tradeoff is observed in Figure 7: MemGPT has a lower av-

erage accuracy than GPT-4 (though higher than GPT-3.5), but can trivially scale to larger numbers

of documents. To evaluate the fixed-context baselines against MemGPT past their default context

lengths, we truncate the document segments returned by the retriever to fix the same number of

documents into the available context. As expected, document truncation reduces accuracy as doc-

uments shrink as the chance of the relevant snippet (in the gold document) being omitted grows.

We anticipate that MemGPT performance on document QA can be further improved with additional

task instructions that reduce the chance of MemGPT returning control to the user (e.g. pausing to

ask questions) and increase the chance of MemGPT reading all documents ranked by the retriever.

3.2.2

N ESTED KEY - VALUE RETRIEVAL (KV)

We introduce a new task based on the synthetic Key-Value retrieval proposed in prior work (Liu

et al., 2023a). The goal of this task is to demonstrate how MemGPT can collate information from

multiple data sources. In the original KV task, the authors generated a synthetic dataset of key-value

pairs, where each key and value is a 128-bit UUID (universally unique identifier). The agent is then

given a key, and asked to return the associated value for the key. We create a version of the KV

task, nested KV retrieval, where values themselves may be keys, thus requiring the agent to perform

a multi-hop lookup. In our setup, we fix the total number of UUIDs pairs to 140, corresponding to

roughly 8k tokens (the context length of our GPT-4 baseline). We vary the total number of nesting

levels from 0 (the initial key-value pair’s value is not a key) to 4 (ie 4 total KV lookups are required

to find the final value), and sample 30 different ordering configurations including both the initial key

position and nesting key positions.

While GPT-3.5 and GPT-4 have good performance on the original KV tasks, both struggle in the

nested KV task. GPT-3.5 is unable to complete the nested variant of the task and has an immediate

11Preprint

dropoff in performance, hitting 0 percent accuracy at 1 nesting level (we observe that its primary

failure mode is to simply returns the original value). GPT-4 is better than GPT-3.5, but also suffers

from a similar dropoff, and hits 0 percent accuracy by 4 nesting levels. In GPT-4’s case, we observe

that it often fails to recurse beyond a particular nesting level at simply returns the nested value at a

previous nesting level. MemGPT on the other hand is unaffected with the number of nesting levels

and is able to perform the nested lookup by accessing the key-value pairs stored in main memory

repeatedly via function queries. MemGPT performance on the nested KV task demonstrates its

ability to combine multiple queries to perform multi-hop lookups.

R ELATED WORK

Recent works have looked at improving context length that can be processed in each LLM invoca-

tion, improving search and retrieval for retrieval-augmented generation (RAG), and using language

models to power interactive agents.

4.1

L ONG - CONTEXT LLM S

The management of long contexts in LLMs is crucial for coherent and engaging dialogues for con-

versational agents, and for corroborating and stitching together facts from diverse sources for LLMs

used in question answering (QA). One approach to tackle the limitations of fixed-length context,

is through recursive summarization (Wu et al., 2021a). In recursive summarization, the LLM often

generates concise representations over a sliding window to fit them within the specified token length.

This summarization process is inherently lossy and can lead to the unintentional loss of relevant de-

tails or subtle nuances. Given the limitations of context length on many LLM-based applications,

there has been growing interest in improving the ability of LLMs to attend to longer sequences such

as Press et al. (2021); Guo et al. (2021); Dong et al. (2023); Beltagy et al. (2020). MemGPT exploits

and benefits from improvements to context length as it can store more information in MemGPT’s

main memory (as an analogy, as GPU cache’s get bigger, the processor can now compute through

quicker as it would benefit from high cache hits).

4.2

S EARCH AND RETRIEVAL

Search and retrieval mechanisms especially for the retrieval-augmented generation (RAG) paradigm,

have been incorporated into conversational agents for tasks ranging from document question answer-

ing, customer support, to more general chatbots for entertainment. These mechanisms often utilize

external databases or internal conversation logs to provide contextually relevant responses. Lin

et al. (2023) for example, demonstrate how to train the retriever and LLM during instruction-tuning

to improve the document recall. Other works have looked at optimizing the retriever or the LLM in-

dependently Ram et al. (2023); Borgeaud et al. (2022); Karpukhin et al. (2020); Lewis et al. (2020);

Guu et al. (2020). Trivedi et al. (2022) interleave retrieval with Chain-of-Thoughts reasoning to

improve multi-step question answering. In this work, we are agnostic for the retriever mechanism

used; various retrieval mechanisms can be easily swapped or even combined as part of disk memory

in MemGPT.

4.3

LLM S AS AGENTS

Recent work has explored augmenting LLMs with additional capabilities to act as agents in inter-

active environments. Park et al. (2023) propose adding memory to LLMs and using the LLM as

a planner, and observe emergent social behaviors in a multi-agent sandbox environment (inspired

by The Sims video game) where agents can perform basic activities such as doing chores/hobbies,

going to work, and conversing with other agents. Nakano et al. (2021) train models to search the

web before answering questions, and use similar pagination concepts to MemGPT to control the un-

derlying context size in their web-browsing environment. Yao et al. (2022) showed that interleaving

chain-of-thought reasoning (Wei et al., 2022) can further improve the planning ability of interactive

LLM-based agents; similarly in MemGPT, LLM is able to ‘plan out loud’ when executing functions

(see Figure 5 and 8 for examples). Liu et al. (2023b) introduced a suite of LLM-as-an-agent bench-

marks to evaluate LLMs in interactive environments, including video games, thinking puzzles, and

12Preprint

web shopping. In contrast, our work focuses on tackling the problem of tackling the problem of

equipping agents with long-term memory of user inputs.

C ONCLUDING REMARKS AND FUTURE DIRECTIONS

In this paper, we introduced MemGPT, a novel LLM system inspired by operating systems to man-

age the limited context windows of large language models. By designing a memory hierarchy and

control flow analogous to traditional OSes, MemGPT provides the illusion of larger context re-

sources for LLMs. This OS-inspired approach was evaluated in two domains where existing LLM

performance is constrained by finite context lengths: document analysis and conversational agents.

For document analysis, MemGPT could process lengthy texts well beyond the context limits of

current LLMs by effectively paging relevant context in and out of memory. For conversational

agents, MemGPT enabled maintaining long-term memory, consistency, and evolvability over ex-

tended dialogues. Overall, MemGPT demonstrates that operating system techniques like hierarchi-

cal memory management and interrupts can unlock the potential of LLMs even when constrained by

fixed context lengths. This work opens numerous avenues for future exploration, including apply-

ing MemGPT to other domains with massive or unbounded contexts, integrating different memory

tier technologies like databases or caches, and further improving control flow and memory manage-

ment policies. By bridging concepts from OS architecture into AI systems, MemGPT represents a

promising new direction for maximizing the capabilities of LLMs within their fundamental limits.

5.1

L IMITATIONS

Our reference implementation leverages OpenAI GPT-4 models that are fine-tuned specifically for

function calling. While the inner workings of OpenAI’s models are proprietary and not publicly dis-

closed, OpenAI’s API documentation states that when using function fine-tuned models, the function

schema provided is converted into a system message that the model is trained to interpret through a

fine-tuning process. While GPT models that have been finetuned for function-calling still require a

parser to verify outputs as valid function syntax, we observed that GPT-4 function fine-tuned models

rarely made syntactic or semantic errors on the MemGPT function set, whereas GPT-3.5 finetuned

models consistently generated incorrect function calls, or used attempted to use functions incor-

rectly. Similarly, we also found that the most popular the Llama 2 70B model variants (even those

fine-tuned for function calling) would consistently generate incorrect function calls or even halluci-

nate functions outside the providede schema. At present reasonable performance is only achieved

using specialized GPT-4 models, however, we anticipate that future open-source models will even-

tually improve to the point of enabling MemGPT-style operation, either through improvements in

fine-tuning (e.g. on larger function call datasets, or more specialized function call datasets), prompt

engineering, or improved quality of base models. Nonetheless, for the time being reliance on the

performance of proprietary closed-source models remains a significant limitation of this work.

R EFERENCES

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.

arXiv preprint arXiv:2004.05150, 2020.

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Milli-

can, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al.

Improving language models by retrieving from trillions of tokens. In International conference on

machine learning, pp. 2206–2240. PMLR, 2022.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are

few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdi-

nov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint

arXiv:1901.02860, 2019.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep

bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

13Preprint

Zican Dong, Tianyi Tang, Lunyi Li, and Wayne Xin Zhao. A survey on long text modeling with

transformers. arXiv preprint arXiv:2302.14502, 2023.

Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni, Yun-Hsuan Sung, and

Yinfei Yang. Longt5: Efficient text-to-text transformer for long sequences. arXiv preprint

arXiv:2112.07916, 2021.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented

language model pre-training. In International conference on machine learning, pp. 3929–3938.

PMLR, 2020.

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open

domain question answering. arXiv preprint arXiv:2007.01282, 2020.

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand

Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning.

arXiv preprint arXiv:2112.09118, 2021.

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE

Transactions on Big Data, 7(3):535–547, 2019.

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi

Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv

preprint arXiv:2004.04906, 2020.

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv

preprint arXiv:2001.04451, 2020.

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,

Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented genera-

tion for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:

9459–9474, 2020.

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization

branches out, pp. 74–81, 2004.

Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez,

Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, and Scott Yih. Ra-dit: Retrieval-

augmented dual instruction tuning, 2023.

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and

Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint

arXiv:2307.03172, 2023a.

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding,

Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating llms as agents. arXiv preprint

arXiv:2308.03688, 2023b.

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher

Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted

question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong

Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow

instructions with human feedback. Advances in Neural Information Processing Systems, 35:

27730–27744, 2022.

Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and

Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint

arXiv:2304.03442, 2023.

David A Patterson, Garth Gibson, and Randy H Katz. A case for redundant arrays of inexpensive

disks (raid). In Proceedings of the 1988 ACM SIGMOD international conference on Management

of data, pp. 109–116, 1988.

14Preprint

Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases

enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-

Brown, and Yoav Shoham. In-context retrieval-augmented language models. arXiv preprint

arXiv:2302.00083, 2023.

Timo Schick, Jane Dwivedi-Yu, Roberto Dessı̀, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer,

Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to

use tools. arXiv preprint arXiv:2302.04761, 2023.

SillyTavern. Sillytavern docs: Character design, 2023. URL https://docs.sillytavern.

app/usage/core-concepts/characterdesign. Accessed: 10-11-2023.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-

lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda-

tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

H. Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving re-

trieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. ArXiv,

abs/2212.10509, 2022.

URL https://api.semanticscholar.org/CorpusID:

254877499.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural informa-

tion processing systems, 30, 2017.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny

Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in

Neural Information Processing Systems, 35:24824–24837, 2022.

Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Chris-

tiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862,

2021a.

Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Chris-

tiano. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862,

2021b.

Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldfish memory: Long-term open-domain

conversation. arXiv preprint arXiv:2107.07567, 2021.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao.

React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629,

2022.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,

Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and

chatbot arena. arXiv preprint arXiv:2306.05685, 2023.