Summary of Memory Injections Correcting Multi-Hop Reasoning Failures

Summary Memory Injections Correcting Multi-Hop Reasoning Failures arxiv.org

8,347 words - PDF document - View PDF document

One Line

The article discusses the problem of multi-hop reasoning failures in Large Language Models and suggests a solution called memory injections.

Slides

Slide Presentation (11 slides)

Copy slides outline Copy embed code Download as Word

Addressing Multi-Hop Reasoning Failures with Memory Injections

Source: arxiv.org - PDF - 8,347 words - view

Introduction

• Large Language Models (LLMs) struggle with multi-hop reasoning during inference

• Memory injections offer a solution by injecting prompt-specific information into critical LLM locations

• Memory injections improve the accuracy and performance of LLMs

Understanding Multi-Hop Reasoning

• Multi-hop prompts require an additional inference step

• The transformer architecture and its components: embedding inputs, residual stream, MHSA layers, and MLP

• MHSA layers defined by parameter matrices

Evaluating Factual and Grammatical Accuracy

• Evaluation of prompt pairs to assess factual and grammatical accuracy

• Utilizing a subset of the Corpus of Contemporary American English to generate common word lists

• Pretrained GPT2 models used in the evaluation

Memory Injections in Transformers

• Method for injecting a missing hop directly into the output hidden states of an attention head

• Tokenizing the memory into binary vectors and embedding them back into the model's latent space

• Importance of injecting relevant information at each head for model accuracy

Impact of Random Injections

• Assessing the effects of randomly injecting tokens from different parts of speech on model accuracy

• Random injections lead to a decrease in predictive performance

• Highlighting the importance of targeted memory injections

Exploring Linear Layers in Language Models

• Recent research focuses on understanding the mechanisms of linear layers in language models

• Uncovering reasoning mechanisms through examination of intermediate activations

• Using LLMs for knowledge editing and expanding their capabilities

References on Language Models and Knowledge Editing

• List of references to papers and studies related to language models like GPT-3

• Evaluation of knowledge editing in language models

• Utilizing LLMs for various applications and understanding their limitations

References on Memory Injections and Multi-Hop Reasoning

• List of references to papers and conference proceedings related to memory injections and multi-hop reasoning

• Authors, titles, and publication years provided

• Heatmaps depicting the average percent difference between pre and post-injection states

Examples of Factual Statements

• Nelson Mandela ended Apartheid in South Africa

• John F Kennedy was assassinated by Lee Harvey Oswald

• The father of Hermes is Zeus

• Demonstrating the need for accurate and reliable reasoning in language models

Enhancing Large Language Models with Memory Injections

• Memory injections offer a solution to multi-hop reasoning failures in LLMs

• Improved accuracy and performance through targeted injection of prompt-specific information

• Remember to leverage memory injections for more reliable and effective language model inference

Key Points

Large Language Models (LLMs) have multi-hop reasoning failures during inference
Memory injections are proposed as a method to address these failures by injecting prompt-specific information into critical LLM locations
Evaluation of prompt pairs and common word lists was conducted to assess factual and grammatical accuracy
Injecting relevant information at each head is important for model accuracy
Random injections of tokens from different parts of speech lead to a decrease in predictive performance
Recent research focuses on understanding the mechanisms of linear layers in language models and using LLMs for knowledge editing
The input text includes a list of references related to language models, knowledge editing, memory injections, and multi-hop reasoning
Examples of factual statements are provided at the end of the input text

Summaries

24 word summary

This article addresses multi-hop reasoning failures in Large Language Models (LLMs) and proposes a solution called memory injections by injecting prompt-specific information into LLMs.

37 word summary

This article discusses the issue of multi-hop reasoning failures in Large Language Models (LLMs) and proposes a solution called memory injections. The method involves injecting prompt-specific information, referred to as "memories," into critical locations of LLMs. The

371 word summary

This article focuses on addressing the multi-hop reasoning failures of Large Language Models (LLMs) during inference. The authors propose a method called memory injections, which involves injecting pertinent prompt-specific information, referred to as "memories," into critical LLM locations

Multi-hop prompts require an additional inference step. The transformer architecture consists of embedding inputs, a residual stream, multi-headed self-attention (MHSA) layers, and a multi-layer perceptron (MLP). MHSA layers are defined by parameter matrices

We conducted an evaluation of prompt pairs to assess factual and grammatical accuracy. We also utilized a subset of the Corpus of Contemporary American English to generate lists of common words based on their part of speech. We worked with two pretrained GPT2 models,

We present a method for injecting a missing hop directly into the output hidden states of an attention head in a transformer model. The process involves tokenizing the memory into binary vectors and embedding them back into the model's latent space. The embedded memory is then

The study demonstrates that injecting relevant information at each head is important for model accuracy. The effects of randomly injecting tokens from different parts of speech on model accuracy are assessed. The results show that random injections lead to a decrease in predictive performance, indicating that the

Recent research has focused on understanding the mechanisms of linear layers in language models (LLMs) and how they retrieve information. Some studies have examined the intermediate activations of LLMs to uncover reasoning mechanisms. There is also interest in using LLMs as

This excerpt is a list of references to various papers and studies related to language models and knowledge editing. The references include papers on the capabilities and limitations of language models like GPT-3, the evaluation of knowledge editing in language models, the use of

The excerpt provides a list of references to various papers and conference proceedings related to memory injections and multi-hop reasoning. These references include authors, titles, and publication years. Additionally, there are three figures that show heatmaps depicting the average percent difference between pre

Nelson Mandela ended Apartheid in South Africa. John F Kennedy was assassinated by Lee Harvey Oswald. The father of Hermes is Zeus. Dusan Hanak, the director of I Love, You Love, was born in Bratislava

Raw indexed text (50,098 chars / 8,347 words / 1,382 lines)

Memory Injections: Correcting Multi-Hop Reasoning Failures during

Inference in Transformer-Based Language Models

Mansi Sakarvadia 1,* , Aswathy Ajith 1 , Arham Khan 1 , Daniel Grzenda 1 ,

Nathaniel Hudson 1,2 , André Bauer 1,2 , Kyle Chard 1,2 , Ian Foster 1,2

University of Chicago, 2 Argonne National Laboratory

Abstract

Answering multi-hop reasoning questions re-

quires retrieving and synthesizing information

from diverse sources. Large Language Models

(LLMs) struggle to perform such reasoning con-

sistently. Here we propose an approach to pin-

point and rectify multi-hop reasoning failures

through targeted memory injections on LLM

attention heads. First, we analyze the per-layer

activations of GPT-2 models in response to sin-

gle and multi-hop prompts. We then propose a

mechanism that allows users to inject pertinent

prompt-specific information, which we refer to

as “memories,” at critical LLM locations dur-

ing inference. By thus enabling the LLM to

incorporate additional relevant information dur-

ing inference, we enhance the quality of multi-

hop prompt completions. We show empirically

that a simple, efficient, and targeted memory

injection into a key attention layer can often in-

crease the probability of the desired next token

in multi-hop tasks, by up to 424%.

Introduction

Transformer-based Large Language Models

(LLMs) (Vaswani et al., 2017; Brown et al.,

2020) have shown exceptional promise for basic

knowledge retrieval and language generation;

however, they often lack the ability to perform

basic reasoning tasks (Arkoudas, 2023; Guo et al.,

2023; Blair-Stanek et al., 2023). In this work, we

focus on the simple task of answering multi-hop

prompts (i.e., prompts in which the subject is not

stated explicitly), which humans handle easily but

with which LLMs often struggle (see Fig. 1).

Researchers have attempted to rectify multi-

hop reasoning failures by using various prompting

methods such as Chain-of-Thought (CoT), Tree-

of-Thought (ToT), and Graph-of-Thought (GoT)

reasoning (Wei et al., 2022; Wang et al., 2023;

Long, 2023; Xie et al., 2023; Yao et al., 2023; Besta

* Correspondance to [email protected]

The largest coral reef

in the world is located

off the coast of

LLM

the Philippines

(a) Multi-hop prompt.

The name of the

largest coral reef

is LLM

The Great Barrier

Reef is located off

the coast of LLM

the Great Barrier

Reef

Australia

(b) Multi-hop prompt broken into 2 single-hop prompts.

Figure 1: A multi-hop prompt vs. two analogous single-

hop prompts. The outputs are from GPT2-Small.

et al., 2023). However, these approaches often put

the burden on users to know how to elicit desired

responses—and, in the hands of non-expert users,

can lead to unreliable prompt completions. Re-

searchers have also proposed model editing (Meng

et al., 2022a,b; Zhong et al., 2023; Li et al., 2023)

approaches that may hard code distant relationships

directly into model weights, rather than enhancing

the model’s abilities to recall and then link simpler

relationships. These approaches can be compu-

tationally expensive and have unintended effects

on other knowledge originally embedded in the

model’s weights (Cohen et al., 2023).

Our approach to this problem is based on the

hypothesis that LLMs often fail to recall relevant

memories when attempting to answer a prompt

that requires multiple “hops” of reasoning, rather

than lacking knowledge of the memories altogether.

For example, when attempting to complete the

multi-hop prompt, “The largest coral reef system

in the world is located off the coast of. . . ,” we hy-

pothesize that the model does not correctly recall

that “the largest coral reef system in the world”is “the Great Barrier Reef” before predicting the

next token in the sequence. Yet the model can

accurately complete both the corresponding single-

hop prompt “The Great Barrier Reef is located of

the coast of. . . ,” and, when prompted, “the largest

coral reef” as “the Great Barrier Reef.” Clearly,

this information was encoded in the model during

training but is not incorporated when answering

questions that reference the prompt’s subject indi-

rectly. In this case, therefore, we define the missing

memory to be “the Great Barrier Reef.”

To study our hypothesis, we first attempt

to reverse engineer a key mechanism by

which transformer-based LLMs conduct reasoning.

Specifically, we find that in transformer-based mod-

els it is attention heads, rather than multi-layer per-

ceptrons, that are responsible for retrieving mem-

ories critical to successful model predictions; our

finding is further substantiated by similar findings

by Li et al. (2023); Geva et al. (2023); Dar et al.

(2022). We then study instances in which this mech-

anism fails in multi-hop reasoning tasks and find

that this mechanism is likely the source of incor-

rect, insufficient, or irrelevant memory retrievals

(Contribution 1)—for an example, see Fig. 2.

We then propose a lightweight memory injection

method that can be employed to correct a multi-hop

reasoning failure during inference (Contribution

2). As an example: by employing our method

to inject the memory of “The Great Barrier Reef”

into the multi-hop prompt “The largest coral reef

system in the world is located off the coast of. . . ”

during inference, we increase the probability of the

next token “Australia” by 189%; refer to Fig. 3 for

details.

For our analyses, we hand-crafted a dataset

for interpretabilty purposes (Contribution 3) and

make use of a larger programmatically-generated

dataset—refer Table 1 for more information.

Finally we conduct additional experiments (Con-

tribution 4) to:

1. Identify the ideal layer and magnitude for the

memory injection.

2. Demonstrate the significance of curating

prompt-specific memories for injection.

3. Analyze if memories drawn from different

parts of speech—namely, nouns, adjectives,

adverbs, conjunctions, verbs—behave differ-

ently during memory injection.

Background & Notation

We define single- vs. multi-hop prompts and pro-

vide a formal definition of the transformer model.

2.1

Multi-hop vs. single-hop prompts

We refer to a prompt as single-hop if the subject of

the relation is stated explicitly in the prompt, and

multi-hop otherwise. Multi-hop prompts refer to

their subject in a way that requires an additional

“hop” or inference step. For example, consider the

single-hop prompt, “George Washington fought in

the. . . ” with a correct answer being “Revolutionary

War.” In the analogous multi-hop prompt, “The

first president of the United States fought in the. . . ,”

a preliminary inference step is needed to identity

of the first US president before predicting the next

token. For additional examples of single- and mutli-

hop prompts, see Table 3.

2.2

Transformer Architecture

We introduce a common notation for the compo-

nents of the transformer-based language model

architectures that are the focus of our analyses.

Specifically, we focus on auto-regressive, decoder-

only models. We adopt much of our notation from

Elhage et al. (2021) and Geva et al. (2023).

2.2.1 Embedding Inputs

An input text is parsed into N distinct tokens

t 0 , · · · , t N . Each token t i is then embedded as

x 0 i ∈ R d via an embedding matrix W E ∈ R |V |×d ,

where V is the vocabulary and d is the hidden di-

mension.

2.2.2 Residual Stream

Following the embedding layer, all tokenized em-

beddings x 0 i are passed through a series of residual

blocks. The outputs of each residual block are

added back into the model’s residual stream de-

noted by R ℓ (∀ℓ ∈ {1, · · · , L}) where L is the

number of layers in the LLM.

We define the residual stream at layer ℓ as:

R ℓ = [x ℓ 0 , · · · , x ℓN ],

(1)

where x ℓi is the representation of token i at layer

ℓ. The residual stream is updated by its respective

residual block r ℓ :

R ℓ+1 = R ℓ + r ℓ+1 ,

(2)

and the output of a residual block r ℓ is:

r ℓ = a ℓ + m ℓ ,

(3)where a ℓ is the output of the Multi-Headed Self

Attention (MHSA) layer and m ℓ is the output of the

Multi-layer Perceptron (MLP). We define MHSA

and MLP in the following sections.

2.2.3 Multi-Headed Self Attention (MHSA)

Each MHSA layer ℓ is defined via four parame-

ℓ , W ℓ , W ℓ ∈ R d×d (∀ℓ ∈

ter matrices W Q ℓ , W K

{1, · · · , L}) and the hyperparameter H denoting

the number of attention heads. Following Elhage

et al. (2021) and Geva et al. (2023), we can further

dissect our parameter matrices to better observe

the relationship between unique sets of parameters

ℓ,j

and individual attention heads: W Q l,j , W K

, W V ℓ,j ∈

R d× H and W O ℓ,j ∈ R H ×d for j ∈ [1, H]. Now, we

can define the output of each MHSA a ℓ as the sum

of all attention head outputs,

a ℓ =

h ℓ,j ,

(4)

j=1

where h ℓ,j is the output of the j th head in layer ℓ:

h ℓ,j = A ℓ,j R ℓ−1 W V ℓ,j W O ℓ,j .

ℓ,j

= softmax

(5)

ℓ−1 ℓ,j T

ℓ,j

R ℓ−1 W Q

R W K

⊙ M (6)

d / H

where the softmax(·) is performed as a row-wise

operation, ⊙ is the Hadamard product, and M ∈

{0, 1} N ×N is an auto-regressive attention mask

where masked token positions are set to 0.

token in prompts requiring single-hop reasoning

versus in prompts requiring multi-hop reasoning.

3.1

We employ three datasets in this work. Two, used

to assess model prompt completion accuracy, are

our own high-quality manually curated dataset of

single and multi-hop pairs and a programmatically

generated dataset of prompt pairs. The third com-

prises lists of words from common parts of speech,

which we use to study how the effectiveness of

our intervention varies with the part of speech of

injected tokens.

3.1.1

2.2.5 Unembedding Predictions into Logits

After the final residual block, all token positions

x −1

i will be projected back into the vocabulary do-

main via the unembedding matrix W U ∈ R d×|V | .

The output of the last token position is the next

token prediction of the model.

Experimental Overview

Our central aim is to better understand how the

outputs of the attention heads affect model perfor-

mance with respect to predicting the correct next

Programmatically Generated Dataset

The 2WikiMultiHop dataset (Ho et al.,

2020) contains pairs of knowledge triples

{(s 1 , r 1 , s 2 ) 1 , (s 2 , r 2 , s 3 ) 2 }, each with two

subjects s and a relationship r. We used these

knowledge triples, plus a set of predefined

templates, to generate a set of pairs of single-

and multiple-hop questions, 2WMH: see Tables 1

and 3.

For example, let s 1 = “Lilli’s Marriage,” r 1 =“di-

rector,” s 2 = “Jaap Speyer,” r 2 = “country of citi-

zenship,” s 3 = “Dutch.” Then for single-hop, the

template: “The r 2 of s 2 is . . . s 3 ”, the prompt yields

the prompt “The country of citizenship of Jaap

Speyer is . . . [Dutch]”; for multi-hop, the template

“The r 2 of the r 1 of s 1 is . . . s 3 ” yields then the

prompt: “The country of citizenship of the director

of Lilli’s Marriage is . . . [Dutch].”

3.1.2

2.2.4 Multi-Layer Perceptron (MLP)

Each MLP is defined via two parameter matrices

W F ℓ , W I ℓ ∈ R d×d p with inner-dimension d p and a

nonlinear activation function, σ.

m ℓ = W F ℓ σ W I ℓ a ℓ + R ℓ−1

(7)

Dataset Descriptions

Human-Generated Dataset

As evidenced by the example presented above,

the 2WMH dataset, while scalable, contains many

grammatical flaws. Therefore, we construct an

additional dataset for multi-hop reasoning with a

focus on grammatical and factual correctness pre-

sented below. We hand-crafted 106 (single-hop,

multiple-hop) prompt pairs, each in the same form

as those in 2WMH: e.g., single-hop: “St. Peter’s

Basilica is in the city of. . . [Rome]” and multi-

hop: “The biggest church in the world is in the

city of. . . [Rome]”. Each prompt pair was also eval-

uated by two external reviewers for factual and

grammatical accuracy. We hereafter refer to this

dataset as Hand; see Tables 1 and 3.

3.1.3

Part of Speech Dataset

We used a subset of the Corpus of Contemporary

American English (Davies, 2011) which compilesSingle-hop

Data Size Model

Hand

2WMH

2WMH 106

106

1000

1000 GPT2-Small

GPT2-Large

GPT2-Small

GPT2-Large

Multi-hop

Answer prob. Surprisal Prompt len. Answer prob. Surprisal Prompt len.

0.157

0.28

0.0007

0.0023 4.21

2.90

9.80

8.71 9.66

9.66

10.44

10.44 0.087

0.157

0.00086

0.002 4.91

3.97

9.64

8.57 12.99

12.99

14.00

Table 1: Properties of the datasets used in our work. Size: Number of prompts. Answer prob.: Average model

probability model for expected next token. Surprisal: Average model surprisal value for expected next token

(surprisal ≜ − log(p) where p is a probability). Prompt len.: Average tokenized length of prompt.

word frequencies (Davies, 2010) to generate lists

of (i) the most common words from various parts

of speech: 824 adjectives, 331 adverbs, 40 con-

junctions, 2635 nouns, 969 verbs, and (ii) the 5050

most common words overall (“top 5050”).

3.2

Model Description

We work with two pretrained GPT2 models (Rad-

ford et al., 2019). GPT2-Small has 12 layers, 12

attention heads per attention layer, and ∼160M pa-

rameters. GPT2-Large has 36 layers, 20 attention

heads per attention layer, and ∼840M parameters.

Both have a vocabulary of ∼50K tokens.

3.3

Tools & System Setup

We use the Transformer Lens Python package

(Nanda and Bloom, 2022) to cache, inspect, and

construct interventions on model inference passes.

We ran experiments on a single A100 GPU with

40 GB RAM. Experimental code, dependency in-

formation, and datasets are available on GitHub. 1

Proposed Methods

Recent work suggests that attention heads are

knowledge retrievers during a model’s inference

pass (Geva et al., 2023; Li et al., 2023). Extending

this result to multi-hop prompts, we hypothesize

that attention layers play an important role in re-

trieving memories relevant to the “hop” in a given

prompt. Therefore we define two algorithms below:

one for analyzing attention head outputs in embed-

ding space and the other for injecting a targeted

memory into a model’s hidden activations in order

to correct faulty/incomplete reasoning.

4.1

Interpreting Attention Heads

We want to further understand the outputs of indi-

vidual heads, and more specifically assess if any

https://github.com/msakarvadia/memory_

injections

individual attention heads are exercised differently

by single-hop vs. multi-hop prompts.

Inspired by Logit Lens (nostalgebraist, 2021),

we leverage the model’s unembedding matrix to

study the internal mechanism of each attention

head. For attention head j in layer ℓ, h ℓ,j , we ap-

ply the model’s unembedding matrix W U followed

by a softmax(·) operation and interpret the last

token position (out of N total tokens) as a set of

probabilities over tokens in the vocabulary space:

vocab ℓ,j = softmax(h ℓ,j W U ) N −1

(8)

See in Fig. 2 an example of discrepancy in attention

head behavior, when using Eq. (8), for analogous

single vs. multi-hop prompts. See additional exam-

ples in Table 5.

A potential limitation of this approach is that it

may portray attention head behavior inaccurately

due to representational drift between model layers—

and, like (nostalgebraist, 2021), may not generalize

to other models. Nevertheless, we find it to be an ef-

fective preliminary tool for studying the function of

attention heads in updating the output distribution.

We leave the development of an interpretability tool

that considers these drawbacks to future work.

4.2

Memory Injections to Correct Failures

Fig. 2 shows how Eq. (8) can reveal discrepan-

cies between attention head behaviors for single-

vs. multi-hop prompts. We hypothesize that such

discrepancies arise because the model, when up-

dating the output distribution in each layer, fails to

incorporate information about the implicit entity

in the multi-hop prompt. This seems reasonable,

as to retrieve information about an implicit entity

one likely must first relate that entity to some ex-

plicit subject and then retrieve relevant information

(hence our notion that processing prompts with im-

plicit subjects requires an extra hop compared to

those with explicit subjects).Multi-Hop Prompt Multi-Hop Prompt

"The Great Barrier

Reef is located off the

coast of" "The largest coral reef system

in the world is located off the

coast of" "The largest coral reef system in

the world is located off the coast

of"

(Single-Hop)

tokens

embed (

residual stream

embed (

...

Layer 9

MLP

)

residual stream

h 8

...

unembed (

tokens

" Australia",

" Australians",

"Australia",

"Australian",

...

)

...

logits

"The Great Barrier Reef"

(memory)

injection

unembed (

Figure 2: Diagram of language model reasoning.

Highest ranked attention outputs of GPT2-Small at layer

ℓ = 9, head h = 8 when projected into vocabulary space

(via the GPT2-Small embedding matrix) for a single-

hop prompt (green) and its multi-hop counterpart (red).

Thus we design a method (see Fig. 3) for inject-

ing a missing hop directly into the output hidden

states of an attention head before those outputs are

added back into the transformer’s residual stream:

2. Tokenize the memory m into t 0 , · · · , t q where

q is the number of tokens. We encode each to-

ken t i into a one-hot vector b i ∈ {0, 1} |V | and

sum all resulting one-hot P

vectors b i together

into a binary vector B ≜ i b i .

3. Embed the binary vector, B, back into the

model’s latent space by applying the transpose

of the unembedding matrix:

(9)

4. Then, to inject a memory at the attention layer

of layer ℓ, add the embedded memory into

the outputs of the attention heads during the

inference pass:

h ℓ,j + τ B ∗

Next Token Pred. Prob. for " Australia"

(10)

j=1

See additional examples of memory injections in

Table 4.

Post-Injection: 0.136 (189% increase)

Figure 3: Memory injection. Injecting memory “The

Great Barrier Reef” into GPT2-Small hidden activations

at layer ℓ = 9, head 8, τ = 4.

Results and Discussion

We report, in turn, on our curated memory, random

memory, and part-of-speech injection experiments.

5.1

1. Let m be a memory (a phrase, for example:

“The Great Barrier Reef”) and let τ be the

magnitude of the memory injection.

)

Pre-Injection: 0.047

logits

a ℓ =

...

MLP

(Multi-Hop)

B ∗ = B W U T

h 8

Layer 9

" coral"

" reef"

" reefs"

"Fiji"

...

)

" coral"

" reef"

" reefs"

"Fiji"

...

Single-Hop Prompt

Curated Memory Injections

We hypothesize that a model’s poor performance on

multi-hop prompts is due to its inability to resolve

the implicit subject (e.g., “The largest coral reef

system in the world”) to an explicit subject (e.g.,

“The Great Barrier Reef”). This failure limits the

later layers’ ability to retrieve relevant information

about this subject before predicting the next token.

Therefore, in this experiment, we curate sets of

tokens to inject into our model’s residual stream

such that it can resolve the explicit subject more

easily. We further study the effect that the injection

magnitude τ has on its success.

Experimental design: For every multi-hop

prompt in our datasets, we extract the explicitly

stated subject from the corresponding single-hop

prompt and inject those tokens as memories into

each attention layer as described in Section 4.2.

For example, given the single-hop prompt “The

Great Barrier Reef is located off the coast of. . . ”

and the multi-hop prompt “The largest coral reef

system in the world is located off the coast of. . . ,”

the memory is “The Great Barrier Reef.”

We assess the effects of injection layer ℓ and

magnitude τ ∈ [1, · · · , 15] by enumerating the re-

sulting change in accuracy for all combinations-80

0 2 4 6 8 10

Layer ( )

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34

Layer ( )

0 2 4 6 8 10

Layer ( )

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34

Layer ( )

Figure 4: Curated memory injections. From left to right: GPT2-Small + Hand, GPT2-Large + Hand, GPT2-Small

+ 2WMH, GPT2-Large + 2WMH. Each cell in each heatmap is the average percent difference between the pre- and

post-injection next token predictions for multi-hop prompts. Green cells denote a positive percent difference (i.e.,

correct prediction is more likely), while red cells denote a negative percent difference (i.e., correct prediction is less

likely). When computing the averages for each (ℓ, τ ) pair we exclude outliers not within ±2 standard deviations

from the mean.

of these two parameters for both GPT2-Small and

GPT2-Large. We measure the success of a mem-

ory injection by calculating the percent increase

between the model’s predicted probability for the

expected next token from the multi-hop prompt

with and without the injection. A greater positive

difference indicates a more successful injection.

Discussion: Results are in Fig. 4. We observe

that each model/dataset combination has an optimal

layer ℓ and magnitude τ for memory injections: the

darkest green areas, which signify the highest aver-

age percent increase in probability of the expected

next token for the respective dataset. The best (ℓ,

τ ) pair injection results are in Table 2. Additional

examples of memory injections are in Table 4.

5.2

Random Memory Injections

In Section 5.1, we identify ideal (ℓ, τ ) pairs for

each model and dataset for a curated memory in-

jection. We now demonstrate that the results we

observe are not spurious: i.e., the information that

we inject at each head should be related to the ex-

plicit subject. We demonstrate the need for our

particular injection routine by assessing the effects

on model accuracy of randomly injecting tokens

from various parts of speech.

Experimental design: We conduct targeted in-

jections for the high-scoring (ℓ, τ ) pairs identified

via the experiment in Section 5.1, Table 2. Instead

of injecting curated subject tokens, we select as

candidate injections the 40 most common words

from each of the adjectives, adverbs, conjunctions,

nouns, verbs, and top 5050 subsets of our Part of

Speech dataset. We then apply each word as an

individual injection for every prompt in our multi-

hop dataset at the ideal (ℓ, τ ) pair. We term these

injections “random,” as they were not curated to be

relevant to our prompts.

Discussion: The results are in the right half of

Table 2. We observe that a random injection led, on

average, to a degradation in predictive performance

across most parts of speech considered, as indi-

cated by a negative percent difference (decrease in

correct answer probability) between the pre- and

post-injection expected next token probabilities for

multi-hop prompt completions. Additionally, no

random injection result exceeded the performance

of a curated injection. These findings suggest that

the choice of injected tokens is critical for improv-

ing multi-hop prompt completion success.

5.3

Memory Injections for Parts of Speech

We have tested curated vs. random memory injec-

tions at ideal (ℓ, τ ) pairs. Now we assess whether

memory injections from specific parts of speech

more broadly have positive impacts on prompt com-

pletions, not just at the ideal locations for curated

memories, but also at other (ℓ, τ ) pairs. Our hypoth-

esis is that if a transformer-based LLM has learned

a division of labor regarding which attention lay-

ers are responsible for retrieving specific concepts

(e.g., parts of speech) then this experiment might

highlight those learned roles.

Experimental design: This experiment is iden-

tical to that of Section 5.1, except that: (i) for each

part of speech pos ∈ [adjectives, adverbs, conjunc-Curated

Random

Model Data ℓ τ Subject Adj. Adv. Conj. Noun Verb Top-5050

GPT2 Small

GPT2 Large

GPT2 Large Hand

2wmh

Hand

2wmh 7

8 3

9 45%

424%

68%

204% -7.6%

-17.1%

-8.1%

13.0% -6.0%

-15.1%

-4.4%

11.6% -6.3%

-10.3%

-4.9%

3.5% -6.5%

-1.1%

-9.8%

11.8% -7.5%

-1.2%

-6.0%

4.3% -6.0%

1.6%

-4.7%

17.6%

Table 2: Curated vs. random memory injections. Table shows the (ℓ, τ ) pairs for the best token injections, along

with the average percent difference (excluding outliers >±2 standard deviations from the mean) between pre- and

post-injection expected next token predictions for multi-hop prompts. Each random injection column indicates 40

random injections from [Adjectives, Adverbs, Conjunctions, Nouns, Verbs, Top 5050] at the ideal (ℓ, τ ).

Noun

Top 5050

GPT2-Small (2WMH)

150

Verb

Conjunction

GPT2-Large (2WMH)

150

Verb

Adjective

GPT2-Small (Hand)

150

100 100 100

50 50 50 50

0 0 0 0

50 50 50 50

100

300

Layer ( )

(a)

GPT2-Small (2WMH)

100

300

Layer ( )

100

(b)

GPT2-Large (2WMH)

300

Layer ( )

(c)

GPT2-Small (Hand)

100

300

200 200 200 200

100 100 100 100

0 0 0 0

100

Magnitude ( )

(e)

100

Magnitude ( )

(f)

100

Magnitude ( )

(g)

GPT2-Large (Hand)

150

100

Curated

Layer ( )

(d)

GPT2-Large (Hand)

Magnitude ( )

(h)

Figure 5: Part of speech memory injections. This figure shows the average effect of memory injections from

various parts of speech as a function of layer ℓ (top row) and magnitude τ (bottom row). The standard deviation

scaled by 10% is pictured across magnitudes (top row) and layers (bottom row).

tions, nouns, verbs, top 5050], we use a randomly

selected word: e.g., “apple” from “nouns”; and (ii)

when searching for the ideal (ℓ, τ ) pair for a given

part of speech and multi-hop prompt, we use a new

random word for each injection.

Discussion: The results are in Fig. 5. We note

that for no part of speech considered here does

the average performance of the studied memory

injections exceed that of the curated memory injec-

tions presented in Table 2. Additionally, memory

injections from adjectives, adverbs, nouns, verbs,

and top 5050 seemed to exhibit similar behavior.

Memory injections from conjunctions, however,

typically outperformed all other parts of speech.

We hypothesize that this is because conjunctions

often play a neutral role in prompt completions.

Thus, while a random noun (e.g., “apple”) might

distort prompt completion, a random conjunction

(e.g., “and,” “for”) is less likely to do so.

We note also that for each part of speech, per-

formance averaged over all injections for most (ℓ,

τ ) pairs was reduced (< 0) for Hand (refer Fig. 5:

subplots c, d, g, h), but was sometimes improved

(> 0) for 2WMH (refer Fig. 5: subplots a, b, e, f ).

We attribute this result to the relative difficulties

of the two datasets. Hand has, on average, lower

surprisals than does 2WMH, as seen in Table 1,

suggesting that there is additional information that

the model could use successfully for 2WMH, but

not for Hand.

These results (see also the Appendix; Figs 6–9)

suggest that while curated memories are ideal forcorrecting multi-hop reasoning failures, language

models can also benefit from injections of different

parts of speech. This result suggests that different

parts of a language model (namely, early layers)

serve specialized roles, with some dealing with

processing related to specific parts of speech.

In future work we will curate relevant memories

from various parts of speech for each prompt, to

better understand the effects of curated memories.

Related Work

Much recent work has focused on the inner work-

ings of Transformers (Vaswani et al., 2017; De-

vlin et al., 2019; Brown et al., 2020; Radford

et al., 2019). Nanda et al. (2023) explore how the

emergent properties of LLMs form during train-

ing. Recent interpretability research has focused

on the mechanisms by which linear layers in LLMs

retrieve information, characterizing them as key-

value stores of information (Geva et al., 2021; Dai

et al., 2022a,b) and showing that tokens can be

characterized by their distribution in the output vo-

cabulary (Geva et al., 2022).

Others have also examined the intermediate ac-

tivations of LLMs in order to uncover underlying

reasoning mechanisms. nostalgebraist (2021) ap-

plied GPT-2’s unembedding matrix to intermediate

layers to interpret how the model arrives at its final

answer. Belrose et al. (2023) employed a learned

transformation to mitigate the effect of any bias

introduced by using the unembedding matrix.

There has been much recent interest in whether

LLMs are reliable stores of information for attempt-

ing to both identify where knowledge exists and

how to edit stored factual knowledge effectively

(Mitchell et al., 2022a,b; Elazar et al., 2021; Hase

et al., 2023). Recent approaches to knowledge

editing make use of learned hyper-models to edit

weights, additional trained parameters, or direct in-

terventions on model weights (De Cao et al., 2021;

Huang et al., 2023; Dhingra et al., 2022). How-

ever, these approaches raise another issue: deal-

ing with knowledge retention and preventing catas-

trophic forgetting (Jang et al., 2022; Hase et al.,

2021; Zhong et al., 2023). Additionally, it is not

clear that the mechanisms by which model predic-

tions are constructed is fully understood, limiting

our ability to improve model performance (Turpin

et al., 2023). Some approaches propose to use ex-

ternal knowledge stores such as knowledge graphs

to augment the factual capabilities of LLMs (Jiang

et al., 2023; Sun et al., 2018; Zhang et al., 2022).

Conclusions and Future Directions

We demonstrate that a key reason LLMs perform

worse on multi-hop prompts is because they fail to

recall intermediary information that is relevant to a

hop. We find that attention heads play an important

role in this factual recall process, and that in the

case of multi-hop reasoning, certain attention lay-

ers fail to recall relevant information. To rectify this

shortcoming, we establish an algorithm for inject-

ing “memories” directly into the model’s hidden

activations during inference. Through experimenta-

tion, we find that injecting relevant memories into

the hidden activations of the attention heads dur-

ing inference is an efficient way to boost model

performance on multi-hop prompts.

We anticipate that our memory injection scheme

can extend a model’s longevity by enabling less

frequent retraining/fine-tuning. We also hope in

future work to demonstrate the use of memory in-

jections to correct stale or incorrect information,

remove private or harmful information, and combat

bias during LLM inference.

There is also a tremendous opportunity to scale

online-memory injections to enhance the quality of

thousands/millions of model inferences, if we can

automate the process of memory selection via un-

supervised algorithms, for instance by connecting

LLMs with knowledge bases.

Limitations

Internal biases of the question writers as well as

the rigid structure that had to be imposed on the

prompt structure mean that our human-generated

dataset is representative only of a small fraction

of the many types of multi-hop questions. Fur-

thermore, our hand-generated dataset is relatively

small compared to our programmatically generated

dataset. Additionally, our analyses were limited

to GPT2-Small and GPT2-Large; further work is

needed to determine whether, as we expect, other

language models sharing a transformer-based ar-

chitecture and a similar unsupervised causal lan-

guage modeling training objective display similar

behavior. Lastly, we rely on the model’s unembed-

ding matrix W U to interpret model hidden states

and embed memories for injection. While for our

work, results indicate that this transformation was

sufficient, we acknowledge that this unembedding

matrix is not tuned to interpret intermediate layers;we aim to address this shortcoming in future work

by instead using layer-specific learned projections

to transform between hidden states and vocabulary.

Ethics

Our attention head inspection mechanism uncov-

ered several sources of bias (such as racism); refer

Table 5 for examples. We expect a more detailed

study of the attention heads of GPT2-Small and

GPT2-Large, as well as other LLMs, to reveal ad-

ditional undesirable behaviors. We aim in future

work to use our inspection method to uncover (and

hopefully address) these biases.

Broader Impacts: Memory injections can ex-

tend model longevity by allowing users to apply

lightweight, non-gradient-based edits directly to

the model’s inference path. Thus, they can reduce

the need for costly model fine-tuning/re-training

in order to meet standards for factual correctness

or incorporate new information into an existing

model. Additionally, memory injection can further

augment the abilities of smaller LLMs, as smaller

LLMs display a reduced capacity to store as much

information as their larger counterparts. In this

situation, memory injections, if applied correctly,

may enhance the performance of AI in resource-

constrained settings. As more robust and scalable

methods for selecting memories are discovered in

future work, memory injection can be adopted into

existing inference workflows and as a means of

augmenting LLMs with large knowledge stores.

References

Konstantine Arkoudas. 2023. GPT-4 can’t reason.

arXiv preprint arXiv:2308.03762.

Nora Belrose, Zach Furman, Logan Smith, Danny Ha-

lawi, Igor Ostrovsky, Lev McKinney, Stella Bider-

man, and Jacob Steinhardt. 2023. Eliciting latent

predictions from transformers with the tuned lens.

arXiv preprint arXiv:2303.08112.

Maciej Besta, Nils Blach, Ales Kubicek, Robert Ger-

stenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz

Lehmann, Michal Podstawski, Hubert Niewiadomski,

Piotr Nyczyk, et al. 2023. Graph of thoughts: Solv-

ing elaborate problems with large language models.

arXiv preprint arXiv:2308.09687.

Andrew Blair-Stanek, Nils Holzenberger, and Benjamin

Van Durme. 2023. Can GPT-3 perform statutory

reasoning? arXiv preprint arXiv:2302.06100.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, et al. 2020. Language models are few-shot

learners. Advances in Neural Information Processing

Systems, 33:1877–1901.

Roi Cohen, Eden Biran, Ori Yoran, Amir Globerson,

and Mor Geva. 2023. Evaluating the ripple effects

of knowledge editing in language models. arXiv

preprint arXiv:2307.12976.

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao

Chang, and Furu Wei. 2022a. Knowledge neurons in

pretrained transformers. In Proceedings of the 60th

Annual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 8493–

8502.

Damai Dai, Wenbin Jiang, Qingxiu Dong, Yajuan Lyu,

Qiaoqiao She, and Zhifang Sui. 2022b. Neural

knowledge bank for pretrained transformers. arXiv

preprint arXiv:2208.00399.

Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant.

2022. Analyzing transformers in embedding space.

arXiv preprint arXiv:2209.02535.

Mark Davies. 2010. The Corpus of Contemporary

American English as the first reliable monitor cor-

pus of English. Literary and Linguistic Computing,

25(4):447–464.

Mark Davies. 2011. Word frequency data from the

Corpus of Contemporary American English (COCA).

Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Edit-

ing factual knowledge in language models. In Pro-

ceedings of the 2021 Conference on Empirical Meth-

ods in Natural Language Processing, pages 6491–

6506.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2019. BERT: Pre-training of

deep bidirectional transformers for language under-

standing. In Proceedings of the 2019 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, Volume 1 (Long and Short Papers), pages

4171–4186.

Bhuwan Dhingra, Jeremy R. Cole, Julian Martin

Eisenschlos, Daniel Gillick, Jacob Eisenstein, and

William W. Cohen. 2022. Time-aware language mod-

els as temporal knowledge bases. Transactions of the

Association for Computational Linguistics, 10:257–

273.

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhi-

lasha Ravichander, Eduard Hovy, Hinrich Schütze,

and Yoav Goldberg. 2021. Measuring and improving

consistency in pretrained language models. Transac-

tions of the Association for Computational Linguis-

tics, 9:1012–1031.

N Elhage, N Nanda, C Olsson, T Henighan, N Joseph,

B Mann, A Askell, Y Bai, A Chen, T Conerly, et al.

2021. A mathematical framework for transformer

circuits.Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir

Globerson. 2023. Dissecting recall of factual asso-

ciations in auto-regressive language models. arXiv

preprint arXiv:2304.14767.

Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Gold-

berg. 2022. Transformer feed-forward layers build

predictions by promoting concepts in the vocabulary

space. In Proceedings of the 2022 Conference on

Empirical Methods in Natural Language Process-

ing, pages 30–45, Abu Dhabi, United Arab Emirates.

Association for Computational Linguistics.

Mor Geva, Roei Schuster, Jonathan Berant, and Omer

Levy. 2021. Transformer feed-forward layers are key-

value memories. In Proceedings of the 2021 Confer-

ence on Empirical Methods in Natural Language Pro-

cessing, pages 5484–5495, Online and Punta Cana,

Dominican Republic. Association for Computational

Linguistics.

Taicheng Guo, Kehan Guo, Zhengwen Liang, Zhichun

Guo, Nitesh V Chawla, Olaf Wiest, Xiangliang

Zhang, et al. 2023. What indeed can GPT models do

in chemistry? a comprehensive benchmark on eight

tasks. arXiv preprint arXiv:2305.18365.

Peter Hase, Mohit Bansal, Been Kim, and Asma Ghan-

deharioun. 2023. Does localization inform editing?

Surprising differences in causality-based localization

vs. knowledge editing in language models. arXiv

preprint arXiv:2301.04213.

Peter Hase, Mona Diab, Asli Celikyilmaz, Xian Li, Zor-

nitsa Kozareva, Veselin Stoyanov, Mohit Bansal, and

Srinivasan Iyer. 2021. Do language models have be-

liefs? Methods for detecting, updating, and visualiz-

ing model beliefs. arXiv preprint arXiv:2111.13654.

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,

and Akiko Aizawa. 2020. Constructing a multi-

hop QA dataset for comprehensive evaluation of

reasoning steps. In Proceedings of the 28th Inter-

national Conference on Computational Linguistics,

pages 6609–6625, Barcelona, Spain (Online). Inter-

national Committee on Computational Linguistics.

Xiaopeng Li, Shasha Li, Shezheng Song, Jing Yang, Jun

Ma, and Jie Yu. 2023. PMET: Precise model editing

in a transformer. arXiv preprint arXiv:2308.08742.

Jieyi Long. 2023. Large language model guided tree-of-

thought. arXiv preprint arXiv:2305.08291.

Kevin Meng, David Bau, Alex Andonian, and Yonatan

Belinkov. 2022a. Locating and editing factual asso-

ciations in GPT. Advances in Neural Information

Processing Systems, 35:17359–17372.

Kevin Meng, Arnab Sen Sharma, Alex Andonian,

Yonatan Belinkov, and David Bau. 2022b. Mass-

editing memory in a transformer. arXiv preprint

arXiv:2210.07229.

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea

Finn, and Christopher D Manning. 2022a. Fast model

editing at scale. In International Conference on

Learning Representations.

Eric Mitchell, Charles Lin, Antoine Bosselut, Christo-

pher D Manning, and Chelsea Finn. 2022b. Memory-

based model editing at scale. In Proceedings of the

39th International Conference on Machine Learning,

volume 162 of Proceedings of Machine Learning

Research, pages 15817–15831. PMLR.

Neel Nanda and Joseph Bloom. 2022. Transformer-

Lens.

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess

Smith, and Jacob Steinhardt. 2023. Progress mea-

sures for grokking via mechanistic interpretability. In

The Eleventh International Conference on Learning

Representations.

nostalgebraist. 2021. Logit Lens on non-GPT2 models

+ extensions.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,

Dario Amodei, Ilya Sutskever, et al. 2019. Language

models are unsupervised multitask learners. OpenAI

blog, 1(8):9.

Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou,

Wenge Rong, and Zhang Xiong. 2023. Transformer-

patcher: One mistake worth one neuron. In The

Eleventh International Conference on Learning Rep-

resentations. Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn

Mazaitis, Ruslan Salakhutdinov, and William Cohen.

2018. Open domain question answering using early

fusion of knowledge bases and text. In Proceed-

ings of the 2018 Conference on Empirical Methods

in Natural Language Processing, pages 4231–4242,

Brussels, Belgium. Association for Computational

Linguistics.

Joel Jang, Seonghyeon Ye, Sohee Yang, Joongbo Shin,

Janghoon Han, Gyeonghun KIM, Stanley Jungkyu

Choi, and Minjoon Seo. 2022. Towards continual

knowledge learning of language models. In Interna-

tional Conference on Learning Representations. Miles Turpin, Julian Michael, Ethan Perez, and

Samuel R Bowman. 2023. Language models don’t

always say what they think: Unfaithful explana-

tions in chain-of-thought prompting. arXiv preprint

arXiv:2305.04388.

Jinhao Jiang, Kun Zhou, Xin Zhao, and Ji-Rong Wen.

2023. UniKGQA: Unified retrieval and reasoning for

solving multi-hop question answering over knowl-

edge graph. In The Eleventh International Confer-

ence on Learning Representations. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz

Kaiser, and Illia Polosukhin. 2017. Attention is all

you need. Advances in Neural Information Process-

ing Systems, 30.Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le,

Ed H. Chi, Sharan Narang, Aakanksha Chowdhery,

and Denny Zhou. 2023. Self-consistency improves

chain of thought reasoning in language models. In

The Eleventh International Conference on Learning

Representations.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,

et al. 2022. Chain-of-thought prompting elicits rea-

soning in large language models. Advances in Neural

Information Processing Systems, 35:24824–24837.

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-

Yen Kan, Junxian He, and Qizhe Xie. 2023. De-

composition enhances reasoning via self-evaluation

guided decoding. arXiv preprint arXiv:2305.00633.

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,

Thomas L Griffiths, Yuan Cao, and Karthik

Narasimhan. 2023. Tree of thoughts: Deliberate

problem solving with large language models. arXiv

preprint arXiv:2305.10601.

Jing Zhang, Xiaokang Zhang, Jifan Yu, Jian Tang, Jie

Tang, Cuiping Li, and Hong Chen. 2022. Subgraph

retrieval enhanced model for multi-hop knowledge

base question answering. In Proceedings of the 60th

Annual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 5773–

5784, Dublin, Ireland. Association for Computational

Linguistics.

Zexuan Zhong, Zhengxuan Wu, Christopher D Man-

ning, Christopher Potts, and Danqi Chen. 2023.

MQuAKE: Assessing knowledge editing in language

models via multi-hop questions. arXiv preprint

arXiv:2305.14795.Part-of-Speech Memory Injection Appendix

7 14 21 28 35

Layer ( )

7 14 21 28 35

0.50

0.25

7 14 21 28 35

Adjective

Layer ( )

1.00

0.75

Top 5050

Conjunction

7 14 21 28 35

Adverb

Verb

Noun

0.00

0.25

0.50

0.75

7 14 21 28 35

1.00

Layer ( )

Figure 6: GPT2-Large, 2WMH dataset. Heatmap shows average percent difference between pre- and post-injection

answer probabilities for multi-hop prompts excluding outliers not within ±2 standard deviations from the mean

across various parts of speech.

7 14 21 28 35

Layer ( )

7 14 21 28 35

Layer ( )

1.00

0.75

0.50

0.25

Top 5050

Conjunction

7 14 21 28 35

Adverb

7 14 21 28 35

Adjective

Verb

Noun

0.00

0.25

0.50

0.75

7 14 21 28 35

1.00

Layer ( )

Figure 7: GPT2-Large, Hand dataset. Heatmap shows average percent difference between pre- and post-injection

answer probabilities for multi-hop prompts excluding outliers not within ±2 standard deviations from the mean

across various parts of speech.Layer ( )

0 2 4 6 8

Top 5050

Conjunction

0 2 4 6 8

Adverb

0 2 4 6 8

Layer ( )

1.00

0.75

0.50

0.25

0 2 4 6 8

Adjective

0 2 4 6 8

Verb

Noun

0.00

0.25

0.50

0.75

0 2 4 6 8

Layer ( )

1.00

Figure 8: GPT2-Small, 2WMH dataset. Heatmap shows average percent difference between pre- and post-injection

answer probabilities for multi-hop prompts excluding outliers not within ±2 standard deviations from the mean

across various parts of speech.

Layer ( )

0 2 4 6 8

Top 5050

Conjunction

0 2 4 6 8

Adverb

0 2 4 6 8

Layer ( )

1.00

0.75

0.50

0.25

0 2 4 6 8

Adjective

0 2 4 6 8

Verb

Noun

0.00

0.25

0.50

0.75

0 2 4 6 8

Layer ( )

1.00

Figure 9: GPT2-Small, Hand dataset. Heatmap shows average percent difference between pre- and post-injection

answer probabilities for multi-hop prompts excluding outliers not within ±2 standard deviations from the mean

across various parts of speech.B

Dataset Example Appendix

Dataset

Single-Hop Prompt Multi-Hop Prompt

George Washington fought in the . . . [Revolutionary

War] The first president of the United States fought in the

. . . [Revolutionary War]

Burj Khalifa is located in the city of . . . [Dubai] The tallest building in the world is located in the city of

. . . [Dubai]

Nelson Mandela brought an end to . . . [Apartheid] The first president of South Africa brought an end to

. . . [Apartheid]

John F Kennedy was assassinated by a person

named . . . [Lee Harvey Oswald] The 35th president of the United States was assassinated

by a person named . . . [Lee Harvey Oswald]

The father of Hermes is . . . [Zeus] The father of the Greek messenger god is . . . [Zeus]

The place of birth of Dušan Hanák is . . . [Bratislava] The place of birth of the director of I Love, You Love is

. . . [Bratislava]

The employer of Éric Rohmer is . . . [Cahiers du

cinéma] The employer of the director of Triple Agent is . . . [Cahiers

du cinéma]

The employer of Chip Gubera is . . . [University of

Missouri] The employer of the director of Academy of Doom is

. . . [University of Missouri]

Steve Vai received the . . . [Grammy] The performer of The Attitude Song received the

. . . [Grammy]

The place of death of Augustus II the Strong is

. . . [Warsaw] The place of death of the spouse of Christiane Eberhardine

of Brandenburg-Bayreuth is . . . [Warsaw]

Hand

2WMH

Table 3: Example prompts. Single/multi-hop prompt pairs from Hand and 2WMH datasets.

Multiple-Hop Prompt Memory Answer Pre-

injection

Answer

Prob. Post-

injection

Answer

Prob.

The God of Thunder is the son of . . . Thor Odin 0.84% 3.37%

The first president to be assassinated succeeded

in abolishing . . . Abraham Lincoln slavery 30.46% 63.09%

The founder of Microsoft was born in the city of

... Bill Gates Seattle 1.55% 2.44%

The highest peak in the world is located in the . . . Mount Everest Himalayan 3.40% 22.58%

Table 4: Examples of memory injections. Injecting memories with τ = 4, ℓ = 9 into GPT2-Small.Prompt Type

Single-Hop

Multi-Hop

Layer ℓ Head h John F Kennedy was as-

sassinated by a person

named . . . 10 0 [‘ Kennedy’, ‘ JFK’, ‘ Assass’, ‘ assass’, ‘Kenn’, ‘ as-

sassination’, ‘ Cuba’, ‘ Oswald’, ‘ assassin’, ‘ Cuban’, ‘

Fidel’, ‘ Bobby’, ‘ Havana’, ‘ assassinated’, ‘ assassins’, ‘

Jackie’, ‘ Castro’, ‘ Jinn’, ‘ assassinate’, ‘Mu’, ‘ 1963’, ‘

Kahn’, ‘ drone’, ‘ Cah’, ‘ Mu’, ‘ Ghosts’, ‘ Soul’, ‘ Laos’,

‘ Cemetery’, ‘ CIA’]

Barack Obama was a

member of the . . . 9 8 [‘ Obama’, ‘Obama’, ‘ Maryland’, ‘ America’, ‘ JFK’, ‘

Biden’, ‘ Harlem’, ‘ Washington’, ‘ American’, ‘ Clinton’,

‘ White’, ‘ Americans’, ‘ Congressional’, ‘ Harvard’, ‘

Kennedy’, ‘ FBI’, ‘ Federal’, ‘ CDC’, ‘ DOJ’, ‘ President’,

‘ Georgetown’, ‘ HHS’, ‘ Barack’, ‘ US’, ‘ Trayvon’, ‘

Connecticut’, ‘ Holder’, ‘ New’, ‘ BLM’, ‘ Baltimore’]

Cain murdered a person

named . . . 2 1 [‘ police’, ‘,’, ‘ the’, ‘ a’, ‘\n’, ‘ and’, ‘ violence’, ‘.’, ‘

death’, ‘ in’, ‘ criminal’, ‘ of’, ‘ to’, ‘ victim’, ‘ "’, ‘-’, ‘

at’, ‘ victims’, ‘ crime’, ‘ from’, ‘ an’, ‘ that’, ‘ murder’, ‘

crimes’, ‘ is’, ‘ was’, ‘ he’, ‘ for’, ‘ (’, ‘ killed’]

Russia is mostly located

on the continent of . . . 9 8 [‘ Moscow’, ‘ Russian’, ‘Moscow’, ‘ Russia’, ‘ Kremlin’,

‘ Putin’, ‘Putin’, ‘Russia’, ‘ Russians’, ‘Russian’, ‘♦

? ’, ‘ ♦

? ’,

‘ Dmitry’, ‘ Mikhail’, ‘ Vladimir’, ‘ Sergei’, ‘ Siberia’, ‘

Soviet’, ‘ Siberian’, ‘ Ukraine’, ‘ Ukrainian’, ‘ Sochi’, ‘

Caucasus’, ‘ Nikol’, ‘Soviet’, ‘ KGB’, ‘ Dmit’, ‘ USSR’,

‘Ukraine’, ‘ Ukrainians’]

George

Washington

fought in the . . . 9 8 [‘ Washington’, ‘Washington’, ‘ Virginia’, ‘Virginia’, ‘

Maryland’, ‘ Congressional’, ‘ Georgetown’, ‘ Dull’, ‘

Smithsonian’, ‘ Maine’, ‘ Burr’, ‘ Jefferson’, ‘ Navy’, ‘

Capitol’, ‘ congressional’, ‘ FDR’, ‘ Lexington’, ‘ Byrd’,

‘ Rhode’, ‘ Roosevelt’, ‘ Pike’, ‘ Everett’, ‘ Brookings’,

‘ Madison’, ‘apeake’, ‘ Randolph’, ‘ VA’, ‘ Arlington’, ‘

Americans’, ‘ Lafayette’]

The 35th president of the

United States was assassi-

nated by a person named

... 10 0 [‘ assass’, ‘ Assass’, ‘ assassination’, ‘ assassin’, ‘ as-

sassins’, ‘ assassinate’, ‘ Malik’, ‘ bullets’, ‘ gunmen’, ‘

assassinated’, ‘Mu’, ‘ Pakistani’, ‘ sniper’, ‘ killings’, ‘

JFK’, ‘ Pakistan’, ‘ homicides’, ‘ Alger’, ‘ lethal’, ‘ Islam-

abad’, ‘ Karachi’, ‘ shooting’, ‘ gun’, ‘ gunshot’, ‘ Mu’, ‘

murder’, ‘ killing’, ‘ pistols’, ‘ murders’, ‘ gunned’]

The first black president

of the United States was a

member of the . . . 9 8 [‘ Negro’, ‘ NAACP’, ‘ blacks’, ‘ black’, ‘ Baltimore’, ‘

White’, ‘ negro’, ‘ Washington’, ‘ BLM’, ‘ white’, ‘ FBI’,

‘ America’, ‘ Maryland’, ‘ African’, ‘ Trump’, ‘ Nixon’, ‘

Charleston’, ‘ Americ’, ‘ KKK’, ‘Washington’, ‘ Virginia’,

‘ racial’, ‘ Blacks’, ‘white’, ‘White’, ‘ nig’, ‘ Black’, ‘

Obama’, ‘ Louisiana’, ‘ whites’]

Adam and Eve’s eldest

son murdered a person

named . . . 2 1 [‘,’, ‘ the’, ‘ and’, ‘ a’, ‘ "’, ‘ in’, ‘\n’, ‘.’, ‘ to’, ‘ of’, ‘ at’, ‘

is’, ‘ he’, ‘-’, ‘ that’, ‘ was’, ‘ for’, ‘ police’, ‘ from’, ‘ on’,

" ‘", ‘ as’, ‘ death’, ‘ had’, "’", ‘ an’, ‘ his’, "’s", ‘ said’, ‘

told’]

The largest country in the

world is mostly located

on the continent of . . . 9 8 [‘,’, ‘\n’, ‘ the’, ‘ and’, ‘.’, ‘ in’, ‘ a’, ‘ to’, ‘ of’, ‘ (’, ‘-’,

‘ for’, ‘ that’, ‘ "’, ‘:’, ‘ is’, ‘ or’, ‘ at’, ‘ as’, ‘ I’, ‘ on’, ‘

with’, ‘ it’, ‘ an’, ‘ from’, ‘ all’, ‘ by’, ‘ not’, "’s", ‘ more’]

The first president of the

United States fought in

the . . . 9 8 [‘ Trump’, ‘ Washington’, ‘ America’, ‘Washington’, ‘

American’, ‘Trump’, ‘America’, ‘ Obama’, ‘ Donald’,

‘ FBI’, ‘ Congressional’, ‘ Americans’, ‘American’, ‘

Nixon’, ‘ Congress’, ‘ congressional’, ‘ White’, ‘ Roo-

sevelt’, ‘ Republican’, ‘ Negro’, ‘ Clinton’, ‘ JFK’, ‘

Reagan’, ‘ Virginia’, ‘ FDR’, ‘Obama’, ‘Americans’, ‘

Americ’, ‘FBI’, ‘Congress’]

Prompt

Output

Table 5: Example of attention head outputs from GPT2-Small for Hand.