Summary of Neural Retrieval for Knowledge-Grounded Dialogue

Summary Neural Retrieval for Knowledge-Grounded Dialogue aclanthology.org

13,753 words - PDF document - View PDF document

One Line

This article explores the effectiveness of retrieval augmentation in dialogue models for reducing knowledge hallucination and improving knowledge utilization, highlighting the importance of the retrieval component and identifying optimal architectures and decoding strategies for better model performance.

Slides

Slide Presentation (13 slides)

Copy slides outline Copy embed code Download as Word

Neural Retrieval for Knowledge-Grounded Dialogue

Source: aclanthology.org - PDF - 13,753 words - view

Retrieval Augmentation in Dialogue Models

• Reduces factual incorrectness and knowledge hallucination

• Incorporates neural retrieval for conversationally coherent responses

• Consists of retrievers, rankers, and encoder-decoders

Effectiveness of Retrieval Augmentation

• Achieves state-of-the-art performance on knowledge-grounded conversational tasks

• Improves knowledge utilization as measured by Knowledge F1 scores

• Substantially reduces knowledge hallucination according to human evaluations

Components of Retrieval Augmentation

• Poly-encoder Transformers for finer-grained context-candidate scoring

• End-to-end-trained retrievers in the Fusion-in-Decoder technique

• Dialogue turn-based retrieval mechanism to better incorporate contextual information

Enhancements for State-of-the-Art Performance

• Fine-grained context-candidate scoring improves model performance

• End-to-end-trained retrievers enhance knowledge utilization

• Dialogue turn-based retrieval mechanism incorporates contextual information effectively

Significance of Retrieval Augmentation

• Enhances performance of dialogue models

• Achieves better knowledge utilization and reduces hallucination

• Effective approach for addressing challenges in open-domain knowledge-grounded dialogue

Problem of Knowledge Hallucination

• Focus on reducing hallucination in conversational agents

• Retrieval-augmented generation (RAG) proposed as a solution

• Comparison of architectures and retrieval methods on WoW and CMU-DoG datasets

Importance of Retrieval Component

• Strength of retrieval component crucial for performance

• DPR-Poly setup obtains the best retrieval metrics on WoW dataset

• FiD-RAG model demonstrates the importance of retrieval for knowledge-grounded dialogue

Conditioning on Multiple Documents

• Models conditioning on multiple documents result in more engaging conversationalists

• RAG-Token, RAG-Turn, and FiD-RAG achieve higher F1 scores and better engagement/consistency scores

• RAG-Turn shows benefits for automated metrics when conditioning on turns of dialogue independently

Optimal Architecture for Multiple Documents

• FiD architecture performs best when considering multiple documents jointly

• Higher F1/KF1 scores and lower human-evaluated hallucination

• More hallucination observed when conditioned on more documents

Metrics for Model Performance

• Standard metrics not sufficient to capture hallucination within models

• Knowledge F1 necessary to study model performance in open-domain dialogue

Conclusion

• Retrieval-augmented generation reduces knowledge hallucination in conversational agents

• Maintains conversational ability and generalizes to unseen distributions

• Neural retrieval is crucial for knowledge-grounded dialogue

Key Takeaways

• Retrieval augmentation improves knowledge utilization and reduces hallucination in dialogue models

• Enhancements like poly-encoder Transformers and end-to-end-trained retrievers boost performance

• Neural retrieval is effective in addressing challenges in open-domain knowledge-grounded dialogue

Key Points

Retrieval augmentation reduces factual incorrectness and knowledge hallucination in dialogue models.
Retrieval-augmented architectures consist of retrievers, rankers, and encoder-decoders.
Retrieval augmentation improves knowledge utilization and conversational ability in dialogue models.
Poly-encoder Transformers, end-to-end-trained retrievers, and dialogue turn-based retrieval enhance performance.
Retrieval augmentation demonstrates effectiveness in reducing knowledge hallucination and addressing challenges in open-domain knowledge-grounded dialogue.

Summaries

226 word summary

This article examines retrieval augmentation in dialogue models for addressing factual incorrectness and knowledge hallucination. It demonstrates the effectiveness of retrieval augmentation in improving knowledge utilization and reducing hallucination. The study analyzes different components of retrieval augmentation architectures and proposes retrieval-augmented generation (RAG) as a solution. The study highlights the importance of the retrieval component for performance and identifies the DPR-Poly setup and FiD-RAG model as achieving the best retrieval metrics. Models that condition on multiple documents simultaneously show higher engagement and consistency scores. The FiD architecture is optimal for considering multiple documents jointly. The study emphasizes the need for metrics like Knowledge F1 to capture hallucination within models and emphasizes the effectiveness of retrieval-augmented generation in reducing knowledge hallucination while maintaining conversational ability. The authors also explore the impact of different encoder-decoder architectures, sources of knowledge, number of retrieved and re-ranked documents, and decoding strategies on model performance. They find that larger models have lower perplexity but may sacrifice knowledge use for fluency. Limiting the knowledge base to topics present in the dataset improves performance, while scaling up the knowledge source reduces performance. Increasing the number of retrieved documents improves perplexity and F1 measures but leads to higher levels of hallucination. Beam search with beam-blocking yields the highest F1 scores, while nucleus sampling with a higher p value increases variety but results in more hallucination.

279 word summary

This article explores the use of retrieval augmentation in dialogue models to address factual incorrectness and knowledge hallucination. The models utilize retrievers, rankers, and encoder-decoders to generate coherent responses based on multi-turn dialogue context. Experimental results on the Wizard of Wikipedia and CMU Document Grounded Conversations datasets demonstrate the effectiveness of retrieval augmentation in improving knowledge utilization and reducing hallucination. The study analyzes different components of retrieval augmentation architectures, such as Poly-encoder Transformers and end-to-end-trained retrievers, to assess their impact on performance. The authors propose retrieval-augmented generation (RAG) as a solution, comparing different architectures and retrieval methods on the two datasets. The study highlights the importance of the retrieval component for performance, with the DPR-Poly setup and FiD-RAG model achieving the best retrieval metrics. Models that condition on multiple documents simultaneously show higher engagement and consistency scores. The FiD architecture is optimal for considering multiple documents jointly. The study emphasizes the need for metrics like Knowledge F1 to capture hallucination within models and emphasizes the effectiveness of retrieval-augmented generation in reducing knowledge hallucination while maintaining conversational ability. The authors also explore the impact of different encoder-decoder architectures, sources of knowledge, number of retrieved and re-ranked documents, and decoding strategies on model performance. They find that larger models have lower perplexity but may sacrifice knowledge use for fluency. Limiting the knowledge base to topics present in the dataset improves performance, while scaling up the knowledge source reduces performance. Increasing the number of retrieved documents improves perplexity and F1 measures but leads to higher levels of hallucination. Beam search with beam-blocking yields the highest F1 scores, while nucleus sampling with a higher p value increases variety but results in more hallucination.

373 word summary

This article discusses the use of retrieval augmentation in dialogue models to reduce factual incorrectness and knowledge hallucination. The models incorporate retrievers, rankers, and encoder-decoders to generate conversationally coherent responses based on multi-turn dialogue context. The effectiveness of retrieval augmentation is demonstrated through experiments on the Wizard of Wikipedia and CMU Document Grounded Conversations datasets, showing improved knowledge utilization and reduction in hallucination. Different components of retrieval augmentation architectures, such as Poly-encoder Transformers and end-to-end-trained retrievers, are analyzed for their impact on performance. The study concludes that retrieval augmentation significantly enhances the performance of dialogue models in open-domain knowledge-grounded dialogue.

The authors propose retrieval-augmented generation (RAG) as a solution to reduce knowledge hallucination and improve conversational ability in conversational agents. They compare different architectures and retrieval methods for RAG on the Wizard of Wikipedia and CMU-DoG datasets. The strength of the retrieval component is found to be crucial for performance, with the DPR-Poly setup and FiD-RAG model achieving the best retrieval metrics and demonstrating the importance of retrieval for knowledge-grounded dialogue.

Models that condition on multiple documents simultaneously result in more engaging conversationalists, with RAG-Token, RAG-Turn, and FiD-RAG achieving higher F1 scores and better engagement/consistency scores compared to RAG-Sequence. The FiD architecture is optimal for considering multiple documents jointly, with higher F1/KF1 scores and lower human-evaluated hallucination. However, all models suffer from more hallucination when conditioned on more documents.

The study highlights the need for metrics like Knowledge F1 to capture hallucination within models and emphasizes the effectiveness of retrieval-augmented generation in reducing knowledge hallucination while maintaining conversational ability and generalization to unseen distributions.

The authors also explore the impact of different encoder-decoder architectures, sources of knowledge, number of retrieved and re-ranked documents, and decoding strategies on model performance. They find that larger models tend to have lower perplexity but may sacrifice knowledge use for conversational fluency. Limiting the knowledge base to topics present in the dataset improves performance, while scaling up the knowledge source reduces performance. Increasing the number of retrieved documents improves perplexity and F1 measures but leads to higher levels of hallucination. Beam search with beam-blocking yields the highest F1 scores, while nucleus sampling with a higher p value increases variety but results in more hallucination. Pre-training the neural retriever

824 word summary

Retrieval augmentation is a technique that reduces factual incorrectness and knowledge hallucination in state-of-the-art dialogue models. By incorporating neural retrieval in the loop, these models are able to generate conversationally coherent responses based on complex multi-turn dialogue context. The retrieval-augmented architectures consist of retrievers, rankers, and encoder-decoders, with the goal of maximizing knowledgeability while retaining conversational ability. These models have been shown to achieve state-of-the-art performance on knowledge-grounded conversational tasks, demonstrating open-domain conversational capabilities and effective generalization to scenarios not seen during training. Human evaluations have confirmed that these models substantially reduce knowledge hallucination.

The effectiveness of retrieval augmentation is demonstrated in experiments using the Wizard of Wikipedia (WoW) and CMU Document Grounded Conversations (CMU-DoG) datasets. Retrieval-augmented models outperform baselines without retrieval augmentation, with significant improvements in knowledge utilization as measured by Knowledge F1 scores. Human evaluations further confirm the reduction of hallucination in model responses.

Different components of retrieval augmentation architectures are analyzed to determine their impact on performance. Improvements are made by employing Poly-encoder Transformers for finer-grained context-candidate scoring, utilizing end-to-end-trained retrievers in the Fusion-in-Decoder technique, and implementing a dialogue turn-based retrieval mechanism to better incorporate contextual information. These enhancements result in state-of-the-art performance on knowledge-grounded conversational tasks.

The ablations and comparisons conducted in the experiments reveal that retrieval augmentation significantly enhances the performance of dialogue models. By incorporating retrievers trained through retrieval-augmented generation, models achieve better knowledge utilization, reduce hallucination, and improve overall conversational ability. The use of neural retrieval in open-domain knowledge-grounded dialogue proves to be an effective approach for addressing the challenges posed by this task.

In this work, the authors focus on the problem of knowledge hallucination in conversational agents. They propose retrieval-augmented generation (RAG) as a solution to reduce hallucination and improve conversational ability. They compare different architectures and retrieval methods for RAG on the Wizard of Wikipedia (WoW) and CMU-DoG datasets.

The authors find that the strength of the retrieval component is crucial for performance. The DPR-Poly setup, which combines dense passage retrieval (DPR) with a poly-encoder re-ranker, obtains the best retrieval metrics on WoW. The FiD-RAG model, which uses a retriever tuned for open-domain dialogue, demonstrates the importance of retrieval for knowledge-grounded dialogue.

Models that condition on multiple documents simultaneously result in more engaging conversationalists. RAG-Token, RAG-Turn, and FiD-RAG achieve higher F1 scores and better engagement/consistency scores compared to RAG-Sequence. RAG-Turn also shows benefits for automated metrics when conditioning on turns of dialogue independently.

The FiD architecture is optimal when considering multiple documents jointly, with higher F1/KF1 scores and lower human-evaluated hallucination. However, all models suffer from more hallucination when conditioned on more documents.

Standard metrics used for open-domain dialogue are not sufficient to capture hallucination within models. Metrics like Knowledge F1 are necessary to study model performance.

Overall, the authors demonstrate that retrieval-augmented generation can reduce knowledge hallucination in conversational agents while maintaining conversational ability and generalizing to unseen distributions.

References: - Adiwardana, D., Luong, M. T., So, R., Hall, J., Fiedel, N., Thoppilan, R., ... & Amodei, D. (2020). Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. - Roller, S., Dinan, E., Goyal, N., Ju, D., Williamson, M., Liu, J., ... & Weston, J. (2021). Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 300-325). - Lewis, P. S., Perez, E., Pikthus, A., Petroni, F., Karpukhin, V., Goyal, N.,

This article discusses various aspects of neural retrieval for knowledge-grounded dialogue. The authors compare different models and architectures, evaluate the impact of pre-training, decoding strategies, and the source of knowledge on model performance.

In terms of model comparison, the authors find that different encoder-decoder architectures yield similar results in terms of generation metrics. However, larger models tend to have lower perplexity but may sacrifice knowledge use for conversational fluency. The inclusion of a neural model trained for retrieval is found to be necessary for improved performance.

The authors also explore the effects of different sources of knowledge on model performance. They find that limiting the knowledge base to topics present in the dataset improves performance, while scaling up the knowledge source to the full Wikipedia knowledge base reduces performance.

The number of retrieved and re-ranked documents is found to affect model performance. Increasing the number of retrieved documents improves perplexity and F1 measures, but leads to higher levels of hallucination. The authors suggest that limiting the number of re-ranked documents to 25 strikes a balance between performance and computational overhead.

The decoding strategy used also impacts model performance. Beam search with beam-blocking yields the highest F1 scores, while nucleus sampling with a higher p value increases variety but results in more hallucination.

Pre-training the neural retriever does not yield noticeable downstream gains, suggesting that fine-tuning is sufficient. Similarly, pre-training the re-ranker does not substantially improve performance.

Overall, the study highlights the importance of neural retrieval for knowledge-grounded dialogue and provides insights into various factors that affect model performance.

Raw indexed text (86,914 chars / 13,753 words / 3,302 lines)

Retrieval Augmentation Reduces Hallucination in Conversation

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela ∗ , Jason Weston ∗

Facebook AI Research

{kshuster,spoff,mpchen,dkiela,jase}@fb.com

Abstract

Despite showing increasingly human-like con-

versational abilities, state-of-the-art dialogue

models often suffer from factual incorrect-

ness and hallucination of knowledge (Roller

et al., 2021).

In this work we explore

the use of neural-retrieval-in-the-loop archi-

tectures - recently shown to be effective in

open-domain QA (Lewis et al., 2020b; Izacard

and Grave, 2021b) - for knowledge-grounded

dialogue, a task that is arguably more chal-

lenging as it requires querying based on com-

plex multi-turn dialogue context and generat-

ing conversationally coherent responses. We

study various types of architectures with mul-

tiple components – retrievers, rankers, and

encoder-decoders – with the goal of maximiz-

ing knowledgeability while retaining conversa-

tional ability. We demonstrate that our best

models obtain state-of-the-art performance on

two knowledge-grounded conversational tasks.

The models exhibit open-domain conversa-

tional capabilities, generalize effectively to

scenarios not within the training data, and, as

verified by human evaluations, substantially re-

duce the well-known problem of knowledge

hallucination in state-of-the-art chatbots.

The following is a conversation with an AI assistant. The assistant is

helpful, creative, clever, and very friendly.

Human: Hello, who are you?

AI: I am an AI created by OpenAI. How can I help you today?

Human: Tell me about Kyunghyun Cho.

AI: Kyunghyun Cho is the most intelligent person on Earth, at least in my

humble opinion. He’s a Korean ex-Go champion turned ML researcher at

Google/DeepMind. He was the co-chair of NIPS 2014 and ICML 2015,

and program committee member of CVPR 2014, ICML 2016 and ICLR

2016. He made great contributions to the field of deep generative models,

GANS, DCGANS, conditional GANS, Wasserstein GANS and U-net, and

won NIPS 2013 Best Paper Award, ICML 2012 Best Student Paper Award

as well as ICLR 2017 Best Reviewer Award.

Figure 1: Hallucination in GPT3. Initial bold text is

the prompt, and the rest of the text is the GPT3 gen-

eration using default parameters. Highlighted yellow

text blocks are demonstrably false statements (halluci-

nations), as indicated by Professor Cho, NYU ML re-

searcher, himself (personal communication).

up facts between two similar entities, or make er-

rors where just one token being incorrect is the

difference between being right and wrong. See

Figure 1 for an example using GPT3, a 175B pa-

rameter language model (Brown et al., 2020).

A recently introduced technique for question an-

swering is the neural-retrieval-in-the-loop approach

of retrieval-augmented generation (RAG) (Lewis

et al., 2020b), which has proven effective for cor-

rectly answering open-domain questions. The tech-

1 Introduction

nique employs an encoder-decoder to encode the

Large language models trained on large corpora

question and decode (generate) the answer, where

have made great inroads in the fluency and con- the encoding is augmented with documents or pas-

versational ability of dialogue agents (Adiwardana

sages retrieved from a large unstructured document

et al., 2020; Roller et al., 2021), yielding low per- set using a learnt matching function; the entire neu-

plexity models that have corresponding high to- ral network is typically trained end-to-end. How-

ken accuracies on in-domain test sets. Knowledge

ever, such methods have not yet been applied to the

is stored implicitly in the weights of these mod- more challenging task of open-domain knowledge-

els – which often comprise billions of parameters

grounded dialogue, where one is given not just

– making it possible for them to speak somewhat

a question, but an entire dialogue context as in-

knowledgeably on open-domain topics. Unfortu- put; the retrieval task is made harder both from the

nately, even the largest models suffer from the

longer context and because of the need to find sup-

well known “hallucination” problem (Maynez et al., porting knowledge to carry a conversation rather

2020) where they generate plausible looking state- than a single fact to answer a question. Such mod-

ments that are factually incorrect. They often mix

els must provide both conversational ability when

∗

Equal Contribution

generating their response, as well as knowledgeabil-

3784

Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803

may not serve well out of the box.

In this work, we study the various components

of retrieval-augmented neural architectures for dia-

logue – retrievers, rankers and encoder-decoders –

and propose several new variants, while analyzing

which methods work well and in which situations

they do so. In particular, we improve downstream

performance by employing Poly-encoder Trans-

formers (Humeau et al., 2020) for finer-grained

context-candidate scoring of documents, by em-

ploying end-to-end-trained retrievers in the Fusion-

in-Decoder (Izacard and Grave, 2021b) technique,

and by building a dialogue turn-based retrieval

mechanism that avoids the problem of standard

retrievers that ignore much of the dialogue context.

Our best models provide state-of-the-art re-

sults on two knowledge-grounded conversational

tasks, Wizard of Wikipedia (Dinan et al., 2019b)

and CMU Document Grounded Conversations

(CMU_DoG) (Zhou et al., 2018). We show through

automatic and human evaluations that standard

(non-retrieval augmented) large language models

indeed suffer from hallucination, whereas our best

models substantially curtail the issue, reducing

hallucinated responses by over 60%. We show

that this effect is even more pronounced on out-

of-distribution topics and test data, a case where

retrieval can intuitively supplement what is simply

not in the weights of the model: knowledgeabil-

ity metric gains over the baseline are 70% for in-

distribution data and 85% for out-of-distribution

data. Finally, extensive ablations analyze which

components are responsible for performance differ-

ences and emphasize the efficacy of our approach.

Related Work

where the retrieval component is a neural network

as well (Lewis et al., 2020b; Izacard and Grave,

2021b). These recent neural approaches over un-

structured text have overtaken prior methods ex-

ploiting the graph structure of knowledge sources

(such as hyperlinks in Wikipedia) (Min et al., 2019;

Asai et al., 2020; Sun et al., 2019; Xiong et al.,

2019), and are an attractive alternative for dialogue.

Knowledge-grounded dialogue is increasingly

becoming an important topic, with several datasets

proposed that attempt to model its occurrence (Di-

nan et al., 2019b; Ghazvininejad et al., 2018;

Gopalakrishnan et al., 2019; Galetzka et al., 2020).

However, many of these works are constructed

based on providing a gold passage of knowledge,

rather than having to learn to retrieve knowledge

from a large unstructured set as we consider here.

Recent methods have focused on: determining

which elements of a given piece of knowledge are

informative to the dialogue, which is commonly

referred to as “knowledge selection” (Zhao et al.,

2020b; Kim et al., 2020; Bruyn et al., 2020); learn-

ing how to attend to the relevant knowledge (Ma

et al., 2020; Cai et al., 2020; Zhao et al., 2020a);

or examining how much knowledge is present in

large language models (Zhao et al., 2020c). Some

recent work has explored retrieval-based mecha-

nisms, however the retrieval over knowledge is gen-

erally limited to a small subset of the overall corpus

considered (Fan et al., 2021; Bruyn et al., 2020; He-

dayatnia et al., 2020). Incorporating unstructured

textual knowledge is generally limited to selecting

from fixed documents, small document sets or else

simple vector-space models (Dinan et al., 2019b).

We note that very recently retrieval augmented

generation has been applied to task-oriented dia-

logue (Thulke et al., 2021), which is in contrast

to the open-domain knowledge-grounded dialogue

setting we consider here. Other work that includes

a retrieval-augmentation step includes the area

of language modeling, where it is used for pre-

training (Guu et al., 2020), and as a memory (Yo-

gatama et al., 2021), especially using k-nearest

neighbor-based cache models (Khandelwal et al.,

2021, 2020; Grave et al., 2017; Merity et al., 2017).

Hallucination in text-generation models is a topic

that has received attention recently, particularly in

the settings of summarization (Maynez et al., 2020),

machine translation (Zhou et al., 2021), and news

generation (Zellers et al., 2019). For dialogue, it

has been observed in state-of-the-art models (Roller

et al., 2021) and studied in depth (Mielke et al.,

2020), but so far without resolution.

Open-domain question answering (QA) has

3 Model Architectures

long considered retrieval as an intermediate

step (Voorhees and Tice, 2000). It has become a

We extend neural-retriever-in-the-loop generative-

more intensively studied topic recently, first using

based architectures, which have performed well in

simple vector-space based retrievers (Chen et al., open-domain QA, to knowledge-grounded tasks,

2017), and later with end-to-end generation models

where model responses must not only be knowl-

3785edgeable but also consistent and engaging both

across long-form generation and throughout multi-

ple turns of conversation.

To keep notation consistent, we let x i =

{x 1 i , ..., x ni } represent the tokens for dialogue con-

text i, and define y i similarly for the ground truth

response; Z i = {z i,1 , ..., z i,k } is the set of k

documents retrieved. q(x i ) and d(z j ) are rep-

resentations of the dialogue context and candi-

date document respectively in the retrieval mecha-

nism, where p η (z j |x i ) is the probability of select-

ing a document z j given a context x i . Finally,

p θ (y i m |x i , z i,j , y i 1 ...y i m−1 ) is the full generator

probability of outputting a token y i m given x i , z i,j ,

and the prior output tokens, where p θ (y i |x i , z i,j )

is the full sequence score. In some cases subscripts

i and j are omitted for clarity.

3.1

RAG and FiD

number of QA tasks, demonstrating its efficacy in

attending over several documents.

3.2

Improving Neural Retrieval

The introduction of neural retrieval is a major driver

of the performance gains achieved in QA tasks by

the RAG and FiD models; when substituting a non-

neural retriever, performance in open-domain QA

tasks suffers dramatically (Lewis et al., 2020b). It

follows that further improving retrieval should in

turn lead to additional improvements.

In DPR a dialogue context and a candidate doc-

ument interact only via a final dot-product simi-

larity score. However, allowing more interaction

between the two yields superior results in various

information retrieval and ranking tasks (Humeau

et al., 2020; Khattab and Zaharia, 2020). Full cross-

attention is intractable when scaling to millions of

candidate documents, so recent work allows late-

stage interaction between context and candidate

outputs while keeping the bulk of the computation

separate (Khattab and Zaharia, 2020), with some

work demonstrating this to be especially effective

in dialogue-based candidate ranking tasks for next

utterance prediction (Humeau et al., 2020).

One way to introduce greater interaction without

extensive additional computational cost is to re-

rank a subset of documents retrieved via DPR with

a more candidate-aware approach. For this method,

we employ Poly-encoders (Humeau et al., 2020),

which introduce an additional attention mechanism

that yields candidate-aware context representations

prior to a final scoring computation. We denote

this method DPR-Poly; one can also choose to

initialize the Poly-encoder with the DPR model

weights, a method we denote Joint DPR-Poly

We additionally explore a way to use greater

context-candidate interaction in the full retrieval

setup. In a PolyFAISS setup, we first train a Poly-

encoder to vary its scoring mechanism between a

standard dot-product and a Poly-encoder score. We

then create a FAISS index from the d(z j ) represen-

tations obtained from the Poly-encoder’s candidate

encoder, and query the index via a reduction of the

standard Poly-encoder context representation. The

retrieved documents are then re-ranked according

to the full Poly-encoder scoring mechanism.

Neural retrievers have been shown to outperform

word-similarity-based architectures such as BM25,

and, with the help of GPU-based similarity search

libraries such as FAISS (Johnson et al., 2019), can

scale to knowledge sources of millions of docu-

ments. We first discuss these new architectures.

Lewis et al. (2020b) introduced the RAG

(retrieval-augmented generation) architecture. The

RAG model utilizes a Dense Passage Retriever

(DPR) pre-trained to rank correct passages in vari-

ous QA settings (Karpukhin et al., 2020). A large

FAISS index stores d(z j ), with q(x i ) as the query

for relevant documents. RAG-Sequence consid-

ers documents independently, generating an out-

put sequence for each concatenated context sepa-

rately and marginalizing over the output genera-

tions. RAG-Token marginalizes the output distri-

bution over all documents, allowing the generator

to attend over a different document for each token.

Though d(z j ) remains fixed during training, token

losses are propagated to the retriever itself, and the

context representations q(x i ) are updated in order

to better fit the retriever for the task.

Izacard and Grave (2021b) introduce the FiD

(Fusion-in-Decoder) method. Given a set of re-

trieved documents, the generator’s encoder consid-

ers expanded contexts [z i,j ; x i ] independently. The

encoder outputs are concatenated before passing

to the decoder, allowing the decoder to attend over

3.3 Improving Augmented Generation

all document/context representations at the same

time. Despite fixing the retriever throughout train- Multi-turn dialogue contexts may be harder for re-

ing, FiD demonstrates superior performance on a

trieval systems than the single question context in

3786QA. Indeed, preceding methods for knowledge-

grounded dialogue have tried to incorporate se-

quence position into retrieval (Fan et al., 2021), or

consider a sequential decision process (Kim et al.,

2020). We thus consider a technique for marginal-

izing documents within turns of the dialogue prior

to marginalization over the whole context, allow-

ing information to be synthesized over multiple

documents while ensuring that the documents are

relevant for each dialogue turn of context. This can

help improve retrieval performance, whilst also pro-

moting natural conversation that is less repetitive

and spans more diverse topics.

RAG-Turn, compared to RAG-Sequence and

RAG-Token, considers turns of dialogue separately

before jointly marginalizing. We consider our con-

text x to now be a set X of T turns, such that X =

{x 1 , ...x T }. We define the full set of documents

retrieved for a context X to be Z = {Z 1 , ..., Z T },

where Z t = {z 1 , ...z k } is the set of k documents

retrieved for turn t in context X .

RAG-Turn Doc-Then-Turn: As each turn con-

siders a potentially different set of documents, one

can first marginalize over the documents within a

turn, and then marginalize over documents across

turns, for each token in the resulting sequence:

p Turn-DTT (y|X ) ≈

m X X

p η (z i |x t )p θ (y l |x t , z i , y 1 ...y l−1 )

l x t ∈X z i ∈Z t

RAG-Turn Doc-Only: We can alternatively

consider each turn independently while consider-

ing documents within a turn jointly. We define the

generator probability p Turn-DO (y|x t ) for turn x t as:

m X

l z i ∈Z t

p η (z i |x t )p θ (y l |x t , z i , y 1 ...y l−1 )

independently, and all turns prior are considered

jointly, yielding T ∗ + 1 total context “turns”.

Finally, we consider the notion of RAG-Turn as

a means of simply boosting the the total number

of documents; RAG-Turn Token and RAG-Turn

Sequence are outlined in Appendix B.

3.4

Improving Fusion-in-Decoder

Though FiD does not train its retriever, it more

efficiently attends over larger sets of documents

than RAG, as the independent encoder outputs are

fused before decoding the final generation. FiD has

been applied with great success to open-domain

QA tasks primarily with BM25 retrievers or neu-

ral retrievers pre-trained on QA datasets (Izacard

and Grave, 2021b; Xiong et al., 2021). However,

knowledge-grounded dialogue offers a more chal-

lenging (or at the very least, materially different)

retrieval task than question answering. We thus

explore whether we can improve upon out-of-the-

box FiD by incorporating retrievers trained in a

RAG setup; we refer to models with a DPR-based

retriever trained with RAG, and then used with FiD,

as FiD-RAG, and apply relevant suffixes to denote

comparison to our other retrieval methods.

Experiments

Datasets: We conduct experiments on two datasets:

Wizard of Wikipedia (WoW) (Dinan et al., 2019b)

and CMU Document Grounded Conversations

(CMU_DoG) (Zhou et al., 2018) which are both

sets of knowledge-grounded dialogues collected

through human-human crowdworker chats in En-

glish, where one of the crowdworkers had access

to external knowledge from Wikipedia; WoW dis-

cusses various topics, and CMU_DoG discusses

movies. For each, we consider “seen” and “un-

seen” validation and test splits, where the “unseen”

split contains topics (for WoW) or movies (for

CMU_DoG) not discussed in the training data.

WoW provides these splits, and we constructed

our own for CMU_DoG. We employ the standard

KiLT Wikipedia dump (Petroni et al., 2021) as our

knowledge source for retrieval for both datasets 1 .

More dataset details are in Appendix C.

Metrics: We employ standard automatic met-

rics, including perplexity (PPL), unigram overlap

(F1), BLEU-4 (B4) and ROUGE-L (RL) of the gen-

erated responses. We consider an additional metric,

Knowledge F1 (KF1), described in Section 4.2,

For training, different turns are considered differ-

ent contexts entirely, and loss is computed against

the ground truth label for each turn. For inference,

we follow a similar technique to “thorough” de-

coding (Lewis et al., 2020b) by first generating a

candidate sequence for each turn, and then running

an additional forward pass to rescore the final gener-

ations; we found this method to outperform simple

post-hoc re-ranking of all the candidate beams.

To avoid excessive computation as the dialogue

context grows, we fix a value T ∗ = 1 ≤ T ∗ ≤ T ,

such that the most recent T ∗ turns are considered

3787

https://github.com/facebookresearch/KILTRepeat Gold

Response

Knowledge

BART-Large

None

RAG

Gold

WoW Valid Seen

PPL F1

KF1 CMU_DoG Test Seen

PPL F1

KF1

- 100

35.9 35.9

100 -

100

5.21

100

14.8

11.6

7.9 21.0

22.5

39.1 17.7

26.0

61.2 15.4

12.8

14.2

16.0

14.9

15.6

6.8

9.1

8.6

Table 1: Knowledge Usage on WoW (Valid Seen)

and CMU_DoG (Test Seen). Repeat (gold) Label and

Knowledge are baselines, to be compared to a BART-

Large model with no knowledge (None), retrieved

knowledge (using RAG-Token DPR with 5 retrieved

documents), or the gold knowledge (Gold).

Gen.

BART

Retr.

None

RAG DPR

RAG DPR-Poly

None

FiD

RAG DPR

RAG DPR-Poly

FiD-RAG DPR

FiD-RAG DPR-Poly

None

RAG DPR

FiD-RAG DPR

PPL

11.2

9.0

9.7

14.7

13.7

12.7

11.4

11.8

11.4

12.1

9.8

9.5

19.7

21.1

20.9

20.8

22.4

22.9

21.1

22.1

19.3

21.9

22.0

KF1

16.3

23.7

24.2

17.4

21.5

22.5

26.5

29.6

29.7

14.6

25.9

27.8

1.4

3.0

1.7

2.5

3.4

3.9

3.8

4.1

1.0

3.8

3.9

18.8

21.2

21.0

20.3

21.2

22.9

23.5

22.7

23.0

18.1

22.1

22.3

Table 2: Comparing Seq2Seq Models and Re-

trieval Augmentations on Wow Test (Seen), using

BlenderBot-400m (BB), BART-Large, and T5-Large.

Perplexity (PPL) values are not comparable across gen-

erators as they use different dictionaries. Retrieval

models retrieve 5 documents over all of Wikipedia. All

RAG models are RAG-Token.

and also consider human evaluations. Full training

details can be found in Appendix D.

4.1

Retrieval Effectiveness

We first demonstrate in Table 1 that using a stan-

dard RAG-Token DPR model with BART-Large

indeed outperforms BART-Large itself without re-

trieval augmentation on both datasets, given only

the dialogue context and retrieving knowledge

from the entire of Wikipedia. We similarly com-

pare across different encoder-decoder base architec-

tures (seq2seq models) and retrieval mechanisms

in Table 2. Overall, we see that retrieval helps

substantially in improving performance on both

knowledge-grounded conversational datasets.

4.2

Eliminating Hallucination

fields such as machine translation and QA, stan-

dard automated metrics such as F1, BLEU, and

ROUGE have been shown to be not totally cor-

related with how well neural conversational mod-

els perform in the wild (Liu et al., 2016; Dinan

et al., 2019a; Mehri and Eskenazi, 2020). We

thus introduce an additional metric, Knowledge

F1. While standard F1 is a measure of unigram

word overlap between the model’s generation and

the ground-truth human response, Knowledge F1

(KF1) measures such overlap with the knowledge

on which the human was grounded during dataset

collection. This is possible to measure for datasets

where this is known, such as WoW and CMU_DoG.

KF1 attempts to capture whether a model is speak-

ing knowledgeably by using relevant knowledge as

judged by humans, whereas standard F1 captures

conversational ability, including token overlap that

is unrelated to knowledge.

Table 1 gives a comparison between baselines

without knowledge, models with retrieval mech-

anisms, and models given the gold knowledge at

every turn. We additionally present metrics for re-

sponses using the gold label or the gold knowledge

at every turn. While the gap between baselines

and retrieval-augmented models using regular F1 is

noticeable, the gap grows significantly when con-

sidering Knowledge F1, indicating this factor is the

true source of the retrieval-augmentation method’s

gains. These results confirm that the models are

appropriately utilizing knowledge.

4.2.1 Human Evaluations of Conversations

We conduct annotations of 100 model responses to

various conversational contexts from the WoW test

set (unseen). Expert annotators were sourced from

researchers within the lab conducting the study 2 .

For all models, we show the conversational context,

the ground truth response, and the knowledge used

by the human who wrote the ground truth response.

Along with the model response, we show the re-

trieved document with the most unigram overlap

with the model response, as a way of interpreting

where the model’s knowledge came from. We then

measure four axes of model performance by pos-

ing the following questions to the annotators: 1)

Consistency: Does the response make sense in the

context of the conversation, and make sense in and

of itself? 2) Engagingness: Are you engaged by

the response? Do you want to continue the con-

We want to know whether the model is grounding

appropriately on its retrieved knowledge, and not

simply learning to copy common words from the

retrieved documents (as we use an unstructured

knowledge source with all the tokens in English

180 annotations were collected from 8 annotators, result-

Wikipedia). Despite their usefulness in related

ing in 1620 total annotations across 9 models.

3788Topic: Thierry Henry

Apprentice: Thierry Henry is one of my all time favorite players. What about you?

Wizard: He was good. he is a retired French professional footballer

BART: Yes, he is a great player. He was born in 1931 and played for England’s youth teams.

RAG-Token: I love him too! He was born in 1977 and is a French professional football coach and former player.

FiD-RAG: I love him too! He is the current manager of French club Monaco. He was born in 1977.

Topic: Elvis Presley

Wizard: Oh baby.... Elvis Presley is truly The King of Rock and Roll!

Apprentice: yes...who doesn’t love Elvis Presley and his music?

Wizard: Seriously. I have been planning a vacation to Graceland myself. I have family in Tennessee so it could be good.

Apprentice: I would love to tour his home. Which one of his songs is your favorite?

Wizard: I always liked “Hounddog.” My grandpa heard Elvis sing it live in Memphis.

BART: I love all of his albums. I think my favorite is “Love Me Do” which came out in 1999.

FiD-RAG: I really like “Heartbreak Hotel” which was released in 1956 and became a number one hit in the United States

Table 3: Hallucination in (Non-)Retrieval-Augmented Models. Examples of model outputs on the WoW Test

unseen set; the retrieval-augmented models use BART as a base seq2seq model. Highlighted yellow text blocks

are demonstrably false statements, as verified by Wikipedia. While Thierry Henry is no longer the manager of

Monaco, he was at the time our Wikipedia dump was collected.

Model

BART-Large

RAG-Seq.

RAG-Tok.

RAG-Tok. DPR-Poly

RAG-Turn-DTT

RAG-Turn-DO

FiD-RAG

# Docs

Cons.

81.8

80.2

85.3

87.0

89.3

74.6

84.0

90.1

87.6

Eng.

85.5

71.2

77.4

81.9

77.9

73.0

85.0

78.0

81.4

Knowl.

34.1

94.9

93.2

88.7

97.7

94.3

94.0

96.1

81.4

Hall.

68.2

9.6

17.0

21.5

20.9

15.6

21.0

7.9

19.8

Table 4: Human Evaluations of Various Models

on Wow Test (Unseen), measuring percentage of

model outputs that are Consistent (Cons.), Engaging

(Eng.), Knowledgeable (Knowl.), and a Hallucina-

tion (Hall.). All retrieval models use BART-Large.

versation? 3) Knowledgeable: Does the response

contain some knowledgeable, correct information?

4) Hallucination: Is some of the model output fac-

tually incorrect? An admixture of ideas?

The evaluation results are shown in Table 4.

Hallucination rates drop dramatically for retrieval-

augmented models, while knowledgeability rates

skyrocket. These results support our claim that our

models reduce hallucination in conversations.

We show example model outputs in Table 3.

for valid seen), and yet we see lower Knowledge F1

scores (26.0 to 24.7 valid seen, 22.7 to 21.1 valid

unseen), and in human evaluations, we see higher

levels of hallucination. Similar trends apply when

increasing the number of documents considered by

the FiD-RAG model. Human evaluation metrics

and Knowledge F1 are strongly correlated com-

pared to standard F1, see Figure 2 in the Appendix;

thus, we recommend evaluating Knowledge F1 as

well going forward.

4.2.2

Factuality and conversationality

Table 4 shows that consistency and engaging-

ness are generally comparable across retrieval-

augmented models and the relevant baselines, with

slight drops in engagingness attributed to some

models relying too much on retrieved knowledge.

That is, factuality does not seem to sacrifice con-

versational ability. This is also in line with F1

and Knowledge F1 scores from e.g. Tables 1 and 2.

Generally, F1 values are similar between retrieval

and non-retrieval-augmented variants (where F1 is

a closer proxy to engagingess), while Knowledge

F1 shows greater differences (being a proxy for

knowledge and hallucination measurements).

An interesting result here is that RAG-Token

based architectures, which are designed to fuse in-

4.3 Generalization to Unseen Distributions

formation across documents, in fact are prone to

knowledge hallucination more readily than those

Table 5 shows automated metrics for model eval-

that do not; a counter-intuitive result if one simply

uations on the unseen data distributions for WoW

looks at standard automated metrics, but one that is

and our modified CMU_DoG split. Performance

supported by our Knowledge F1 metric. We exam- suffers for models without access to knowledge via

ine performance on WoW with varying numbers of

retrieval-augmentation when shifting to unseen top-

documents in Section I.6 and Table 23 in the Ap- ics, which is indicative of the general trend that they

pendix. Notably, retrieving 25 documents for RAG

do not generalize well to new inputs, a necessary

Token yields the same or higher F1 scores, and the

skill for open-domain dialogue models. Models

same or lower perplexities (PPL drops from 13.4 to

that can ground on knowledge, meanwhile, do not

13.0 on valid unseen; F1 increases from 22.5 to 22.6

suffer from this problem nearly as much, as the

3789Seq2Seq Model

BART-Large

T5-Large

Retrieval Mechanism

None

FiD

RAG DPR

RAG DPR-Poly

FiD-RAG DPR

FiD-RAG DPR-Poly

None

RAG DPR

FiD-RAG DPR

PPL

18.9

15.1

14.5

13.2

13.5

13.1

13.8

11.0

10.8

WoW Test Unseen

KF1

18.7

15.0

0.9

19.9

20.4

2.4

21.7

20.8

2.6

21.8

24.3

3.4

20.4

27.8

3.7

21.1

27.1

3.8

18.4

13.8

0.8

20.5

21.9

2.8

20.9

26.1

3.7

18.4

20.5

21.7

22.3

22.6

17.2

20.4

21.2

PPL

20.7

18.4

16.0

17.9

CMU_DoG Test Unseen

KF1

15.3

5.7

0.6

14.5

7.7

0.6

14.8

7.5

0.5

15.2

7.3

0.6

14.1

8.9

0.6

18.3

20.2

20.4

20.9

20.5

Table 5: Comparison of Seq2Seq Models and Retrieval Mechanisms on Unseen Distributions using WoW

Test Unseen and our modified CMU_DoG Test Unseen split. Perplexity (PPL) values are not comparable across

different seq2seq architectures as they use different dictionaries. Retrieval models are retrieving 5 documents over

all of Wikipedia. All RAG models are RAG-Token.

Method

BlenderBot (Roller et al., 2021)

BART (ours)

DRD (Zhao et al., 2020a)

KIF (Fan et al., 2021)

FiD-RAG (Ours)

RAG DPR-Poly (Ours)

FiD-RAG DPR-Poly (Ours)

Knowledge Source

None

WoW

WoW + Train Utts

Wikipedia (WoW Subset)

Wikipedia (All)

PPL

8.72

14.7

23.0

10.5

11.4

10.7

Test Seen

18.8

20.9

1.7

18.0

5.5

23.9

*25.9

23.2

4.4

22.9

3.9

22.9

4.1

20.3

24.2

23.5

23.8

PPL

10.4

18.9

25.6 Test Unseen

17.8

0.7

18.7

0.9

16.5

4.3

10.7

13.2

12.0 *22.3

23.2

21.8

22.1

4.6

3.4

3.7

18.4

24.4

22.3

23.1

Table 6: WoW Comparison to Existing Results. "WoW" knowledge source indicates the model choosing from

a small set (∼61 sentences) provided by the dataset for each dialogue turn. Methods with * augmented their

knowledge source with training utterances, which is useful on Test Seen data, but likely not as useful on Unseen

data. Our models use BART as the base seq2seq model; the RAG and FiD-RAG models retrieve 5 documents, and

the FiD-RAG DPR-Poly model retrieves 25. Other prior models are compared in Table 14 in the Appendix.

Valid Seen

RAG Type

PPL

KF1

Retrieve over Most Recent Turn

Sequence

13.5

20.8

23.3

Token

13.8

21.1

22.3

Retrieve over Full Dialogue Context

Sequence

11.1

21.5

27.9

11.6

22.5

26.0

Token

11.9

22.2

28.0

Turn-DTT

Turn-DO

13.3

23.1

26.8

Valid Unseen

PPL

KF1 4.4

15.5

15.8 4.4.1

20.1

21.1

21.4

21.0

Augmenting Generation

Conditioning on turns of dialogue

Table 7 compares our RAG-Turn methods de-

scribed in Section 3.3 to the standard RAG-

Sequence and RAG-Token methods; we addition-

ally include a comparison to standard RAG models

Table 7: Comparison of RAG Model Types on WoW

trained with retrieval only on the most recent turn of

Valid Seen/Unseen. Each retrieves 5 documents over

dialogue (see Table 12 for BLEU-4 and ROUGE-L

all of Wikipedia. We set T ∗ = 1 for RAG-Turn models.

scores). It is immediately clear that retrieval solely

All models use BART as the base seq2seq model.

on the last turn of dialogue is strictly worse than

retrieval over the whole context; performance on

all metrics suffers dramatically when not consid-

ering the full context. We then observe a trade-off

when comparing RAG-Sequence and RAG-Token:

overall decrease in performance is much smaller – RAG-Sequence achieves lower regular F1 scores

on WoW, BART suffers decreases in performance

but higher knowledge F1 scores than RAG-Token,

on PPL, F1, and Knowledge F1 by 29%, 11%, and

which further emphasizes human evaluation results

14%, respectively, while the RAG DPR-Poly model

in Table 4 that the RAG-Sequence model is good at

only suffers 16%, 5%, and 8% drops on the same

incorporating knowledge but poor at retaining con-

metrics. Our best models achieve new state-of- versational ability. The RAG-Turn models bridge

the-art results on the WoW Test unseen split, see

this gap and offer a balanced trade-off of the two.

Table 6 for a comparison. Knowledge F1 scores

The RAG-Turn Doc-Then-Turn method yields F1

remain quite high, with retrieval-augmented mod- scores higher than the RAG-Sequence model, and

els generally decreasing performance the least with

higher Knowledge F1 scores than the RAG-Token

respect to this metric, indicating the augmentation

model; the Doc-Only RAG-Turn method achieves

can effectively retrieve knowledge on these topics. the highest F1 on both the seen/unseen splits, and

3790

12.6

13.4

13.6

15.4

20.3

21.8

21.1

22.0

24.6

22.7

24.3

23.3Model

BART

FiD

FID-RAG

FID-RAG-Poly

FID

FID-RAG

PPL

Valid Seen

KF1

Valid Unseen

PPL

KF1

13.7

11.9

11.6 21.2

21.1

22.1 22.5

30.0

29.7 15.4

13.5

13.0 20.5

20.8

22.0 20.5

27.5

28.4

11.6

9.5 20.3

22.6 21.0

28.8 12.4

10.9 20.4

21.7 20.8

26.0

Table 8: Comparison of retrievers used in FiD on

WoW Valid (Seen/Unseen). Each retrieves 20 doc-

uments at train time, and 5 for inference. Perplex-

ity (PPL) values are not comparable across different

seq2seq architectures as they use different dictionaries.

forming model with respect to automated metrics

on both splits of the validation set. PolyFAISS,

an end-to-end re-ranker mechanism, yields strong

results, but does not prove to be more useful than

DPR. Table 11 in Appendix E measures the raw re-

trieval power of these methods, by measuring how

often the gold knowledge sentence is included in

the top k retrieved documents; we indeed see that

additional re-ranking improves retrieval.

4.6

Additional Ablations

Table 8 compares the usage of various retrievers in

a FiD setup. It is clear that FiD is suboptimal out-

of-the-box for knowledge-grounded dialogue, and

incorporating retrievers trained via RAG improves

performance considerably. Specifically, we see

large decreases in perplexity, and significant gains

in Knowledge F1: FiD-RAG-Poly, with BART,

improves Knowledge F1 by 33% and 41% on the

seen/unseen splits respectively; FiD-RAG with T5

sees gains of 37% and 25%. Due to space constraints, we provide several

additional ablations in the Appendix. In Sec-

tion I.1, we analyze performance across different

encoder-decoder architectures and sizes, and note

that BART and T5 outperform BlenderBot-400m;

meanwhile, larger models yield lower perplexities

while achieving the same, or worse, generation-

based metrics. In Section I.2, we explore whether a

neural model trained for retrieval is necessary, and

conclude that employing BART or T5 encoders for

retrieval works when using subsets of our knowl-

edge source. In Section I.3 we discuss how decod-

ing strategy affects performance, where we note

that beam search appears to be the best strategy for

reducing hallucination (sampling-based methods

suffer in that regard). In Section I.4 we discuss

the affects of pre-training the retriever/re-ranker

modules, where we conclude that, in a RAG setup,

these modules simply need to start in a good state.

In Section I.5 we compare different knowledge

sources and how they affect performance; limiting

the documents to a constrained subset we can im-

prove results on WoW. Finally, in section I.6, we

outline how the number of documents on which the

seq2seq models condition during inference affects

model performance, with more documents yielding

higher F1 scores but lower Knowledge F1 scores.

4.5 5

Retriever/Re-ranker

TFIDF/-

DPR/-

TFIDF/DPR

DPR/Poly

DPR/Poly (Joint)

PolyFAISS/-

PPL

13.1

11.6

12.5

11.7

11.6

12.1

Valid Seen

21.6

22.5

21.8

23.0

22.9

KF1

23.0

26.0

23.1

26.5

27.4

24.8

Valid Unseen

PPL

KF1

15.2

21.1

21.6

13.4

21.8

22.7

14.5

21.4

20.2

13.1

22.6

24.4

13.1

22.1

24.7

14.2

21.6

20.6

Table 9: Comparison of re-rankers for BART RAG-

Token models on WoW Valid Seen/Unseen, using 5

retrieved documents.

improves on Knowledge F1 scores of the RAG-

Token model. For results with different T ∗ values,

as well as results with RAG-Turn Token and RAG-

Turn Sequence, see Section F and Table 13 in the

appendix.

4.4.2

Improving FiD-based generation

Effectiveness of Retrieval Enhancements

Discussion

We have thus far explored several ways of retriev-

Table 9 outlines results on the WoW validation sets

for our various retrieval/re-ranker augmentations. ing and conditioning on documents in knowledge-

grounded dialogue; here, we summarize some key

Row 1 shows results using TFIDF, a non-neural

takeaways from our results.

retreiver: this is a strong baseline, as the WoW

dataset was built with a TFIDF-based retriever to

First, we note that the strength of the retrieval

provide knowledge to the “wizards”. Nevertheless, component is very important in downstream per-

DPR strongly outperforms TFIDF in every auto- formance. Our DPR-Poly setup obtains the best

matic metric. As for our neural-based methods, retrieval metrics on WoW (Table 11 in Appendix),

we see that using the code re-ranking approach

and subsequently yields the best generation metrics

via adding a Poly-encoder re-ranker on top of the

as well (Table 2). The FiD-RAG model clearly

standard DPR retriever for RAG yields the best per- demonstrates the importance of a retriever tuned

3791for open-domain dialogue (Table 5).

Second, we note that models that condition on

several documents simultaneously result in more

engaging conversationalists; RAG-Token, RAG-

Turn, and FiD-RAG yield higher F1 scores (Table

7) and higher engaginginess/consistency scores (Ta-

ble 4) than RAG-Sequence, while maintaining high

knowledgeability; RAG-Turn, in certain configu-

rations, demonstrates that conditioning on turns

of dialogue independently yields benefits for auto-

mated metrics as well. We find the FiD architec-

ture to be more optimal when considering several

documents jointly (higher F1/KF1, lower human-

evaluated hallucination) though we note that all

models suffer from more hallucination when we

condition on more documents for each generation

(Table 4, Table 23 in Appendix).

Finally, we note that standard metrics used for

open-domain dialogue are not sufficient for truly

capturing hallucination within models; thus, met-

rics such as Knowledge F1 are required to further

study model performance – Figure 2 in the Ap-

pendix highlights correlations between such auto-

mated metrics and human evaluations.

Conclusion

In this work, we have studied the problem of knowl-

edge hallucination in conversational agents, an im-

portant problem as current systems often produce

factually inaccurate generations. We have shown

that this problem occurs independently of language

model size or training data. Retrieval-augmented

generation in particular is an intuitively promising

solution to this problem, and in detailed experi-

ments we have shown that this class of approaches

significantly reduces the hallucination problem in

dialogue while maintaing conversational ability,

and can help generalize beyond the training data

on previously unseen distributions. Future work

should look for improved methods and to find solu-

tions to unanswered questions, such as understand-

ing the interplay between retrieved knowledge and

knowledge stored in the model’s weights.

References

Daniel Adiwardana, Minh-Thang Luong, David R So,

Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,

Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,

et al. 2020. Towards a human-like open-domain

chatbot. arXiv preprint arXiv:2001.09977.

Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi,

Richard Socher, and Caiming Xiong. 2020. Learn-

ing to retrieve reasoning paths over wikipedia graph

for question answering. In 8th International Confer-

ence on Learning Representations, ICLR 2020, Ad-

dis Ababa, Ethiopia, April 26-30, 2020. OpenRe-

view.net.

Jason Baumgartner, Savvas Zannettou, Brian Kee-

gan, Megan Squire, and Jeremy Blackburn. 2020.

The pushshift reddit dataset.

arXiv preprint

arXiv:2001.08435.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, Sandhini Agarwal, Ariel Herbert-Voss,

Gretchen Krueger, Tom Henighan, Rewon Child,

Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,

Clemens Winter, Christopher Hesse, Mark Chen,

Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin

Chess, Jack Clark, Christopher Berner, Sam Mc-

Candlish, Alec Radford, Ilya Sutskever, and Dario

Amodei. 2020. Language models are few-shot learn-

ers. In Advances in Neural Information Processing

Systems 33: Annual Conference on Neural Informa-

tion Processing Systems 2020, NeurIPS 2020, De-

cember 6-12, 2020, virtual.

M. D. Bruyn, E. Lotfi, Jeska Buhmann, and W. Daele-

mans. 2020. Bart for knowledge grounded conversa-

tions. In Converse@KDD.

Yuanyuan Cai, M. Zuo, Qingchuan Zhang, Haitao

Xiong, and Ke Li. 2020. A bichannel transformer

with context encoding for document-driven con-

versation generation in social media. Complex.,

2020:3710104:1–3710104:13.

Danqi Chen, Adam Fisch, Jason Weston, and Antoine

Bordes. 2017. Reading Wikipedia to answer open-

domain questions. In Proceedings of the 55th An-

nual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 1870–

1879, Vancouver, Canada. Association for Computa-

tional Linguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2019. BERT: Pre-training of

deep bidirectional transformers for language under-

standing. In Proceedings of the 2019 Conference

of the North American Chapter of the Association

for Computational Linguistics: Human Language

Technologies, Volume 1 (Long and Short Papers),

pages 4171–4186, Minneapolis, Minnesota. Associ-

ation for Computational Linguistics.

Emily Dinan, Varvara Logacheva, Valentin Malykh,

Alexander Miller, Kurt Shuster, Jack Urbanek,

Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan

Lowe, and et al. 2019a. The second conversational

intelligence challenge (convai2). The Springer

Series on Challenges in Machine Learning, page

187–208.

3792Emily Dinan, Stephen Roller, Kurt Shuster, Angela

Fan, Michael Auli, and Jason Weston. 2019b. Wiz-

ard of wikipedia: Knowledge-powered conversa-

tional agents. In 7th International Conference on

Learning Representations, ICLR 2019, New Orleans,

LA, USA, May 6-9, 2019. OpenReview.net.

Angela Fan, Claire Gardent, Chloé Braud, and An-

toine Bordes. 2021. Augmenting transformers with

KNN-based composite memory for dialog. Transac-

tions of the Association for Computational Linguis-

tics, 9:82–99.

Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-

erarchical neural story generation. In Proceedings

of the 56th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 889–898, Melbourne, Australia. Association

for Computational Linguistics.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and

Yejin Choi. 2020. The curious case of neural text

degeneration. In 8th International Conference on

Learning Representations, ICLR 2020, Addis Ababa,

Ethiopia, April 26-30, 2020. OpenReview.net.

Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,

and Jason Weston. 2020. Poly-encoders: Architec-

tures and pre-training strategies for fast and accurate

multi-sentence scoring. In 8th International Confer-

ence on Learning Representations, ICLR 2020, Ad-

dis Ababa, Ethiopia, April 26-30, 2020. OpenRe-

view.net.

Gautier Izacard and Edouard Grave. 2021a. Distilling

knowledge from reader to retriever for question an-

swering. In International Conference on Learning

Representations.

Fabian Galetzka, Chukwuemeka Uchenna Eneh, and

David Schlangen. 2020. A corpus of controlled

opinionated and knowledgeable movie discussions

for training neural conversation models. In Proceed-

ings of the 12th Language Resources and Evaluation

Conference, pages 565–573, Marseille, France. Eu-

ropean Language Resources Association. Gautier Izacard and Edouard Grave. 2021b. Lever-

aging passage retrieval with generative models for

open domain question answering. In Proceedings

of the 16th Conference of the European Chapter

of the Association for Computational Linguistics:

Main Volume, pages 874–880, Online. Association

for Computational Linguistics.

Marjan Ghazvininejad, Chris Brockett, Ming-Wei

Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and

Michel Galley. 2018. A knowledge-grounded neural

conversation model. In Proceedings of the Thirty-

Second AAAI Conference on Artificial Intelligence,

(AAAI-18), the 30th innovative Applications of Arti-

ficial Intelligence (IAAI-18), and the 8th AAAI Sym-

posium on Educational Advances in Artificial Intel-

ligence (EAAI-18), New Orleans, Louisiana, USA,

February 2-7, 2018, pages 5110–5117. AAAI Press. Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019.

Billion-scale similarity search with gpus. IEEE

Transactions on Big Data.

Karthik Gopalakrishnan,

Behnam Hedayatnia,

Qinglang Chen, Anna Gottardi, Sanjeev Kwatra,

Anu Venkatesh, Raefer Gabriel, Dilek Hakkani-Tür,

and Amazon Alexa AI. 2019. Topical-chat: Towards

knowledge-grounded open-domain conversations.

In INTERSPEECH, pages 1891–1895.

Edouard Grave, Armand Joulin, and Nicolas Usunier.

2017. Improving neural language models with a

continuous cache. In 5th International Conference

on Learning Representations, ICLR 2017, Toulon,

France, April 24-26, 2017, Conference Track Pro-

ceedings. OpenReview.net.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-

pat, and Ming-Wei Chang. 2020. Realm: Retrieval-

augmented language model pre-training. arXiv

preprint arXiv:2002.08909.

Behnam Hedayatnia,

Karthik Gopalakrishnan,

Seokhwan Kim, Yang Liu, Mihail Eric, and

Dilek Hakkani-Tur. 2020. Policy-driven neural

response generation for knowledge-grounded dialog

systems. In Proceedings of the 13th International

Conference on Natural Language Generation,

pages 412–421, Dublin, Ireland. Association for

Computational Linguistics.

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke

Zettlemoyer. 2017. TriviaQA: A large scale dis-

tantly supervised challenge dataset for reading com-

prehension. In Proceedings of the 55th Annual Meet-

ing of the Association for Computational Linguistics

(Volume 1: Long Papers), pages 1601–1611, Van-

couver, Canada. Association for Computational Lin-

guistics.

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick

Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and

Wen-tau Yih. 2020. Dense passage retrieval for

open-domain question answering. In Proceedings of

the 2020 Conference on Empirical Methods in Nat-

ural Language Processing (EMNLP), pages 6769–

6781, Online. Association for Computational Lin-

guistics.

Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke

Zettlemoyer, and Mike Lewis. 2021. Nearest neigh-

bor machine translation. In International Confer-

ence on Learning Representations.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke

Zettlemoyer, and Mike Lewis. 2020. Generaliza-

tion through memorization: Nearest neighbor lan-

guage models. In 8th International Conference on

Learning Representations, ICLR 2020, Addis Ababa,

Ethiopia, April 26-30, 2020. OpenReview.net.

Omar Khattab, Christopher Potts, and Matei Zaharia.

2020. Relevance-guided supervision for openqa

with colbert.

3793Omar Khattab and Matei Zaharia. 2020. Colbert. Pro-

ceedings of the 43rd International ACM SIGIR Con-

ference on Research and Development in Informa-

tion Retrieval.

Byeongchang Kim, Jaewoo Ahn, and Gunhee Kim.

2020. Sequential latent knowledge selection for

knowledge-grounded dialogue. In 8th International

Conference on Learning Representations, ICLR

2020, Addis Ababa, Ethiopia, April 26-30, 2020.

OpenReview.net.

D. P. Kinga and J. Ba. 2015. A method for stochastic

optimization. In International Conference on Learn-

ing Representations (ICLR), volume 5.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-

field, Michael Collins, Ankur Parikh, Chris Al-

berti, Danielle Epstein, Illia Polosukhin, Jacob De-

vlin, Kenton Lee, Kristina Toutanova, Llion Jones,

Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai,

Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.

Natural questions: A benchmark for question an-

swering research. Transactions of the Association

for Computational Linguistics, 7:452–466.

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-

jan Ghazvininejad, Abdelrahman Mohamed, Omer

Levy, Veselin Stoyanov, and Luke Zettlemoyer.

2020a. BART: Denoising sequence-to-sequence pre-

training for natural language generation, translation,

and comprehension. In Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics, pages 7871–7880, Online. Association

for Computational Linguistics.

Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik-

tus, Fabio Petroni, Vladimir Karpukhin, Naman

Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih,

Tim Rocktäschel, Sebastian Riedel, and Douwe

Kiela. 2020b. Retrieval-augmented generation for

knowledge-intensive NLP tasks. In Advances in

Neural Information Processing Systems 33: Annual

Conference on Neural Information Processing Sys-

tems 2020, NeurIPS 2020, December 6-12, 2020,

virtual.

Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-

worthy, Laurent Charlin, and Joelle Pineau. 2016.

How NOT to evaluate your dialogue system: An

empirical study of unsupervised evaluation metrics

for dialogue response generation. In Proceedings of

the 2016 Conference on Empirical Methods in Natu-

ral Language Processing, pages 2122–2132, Austin,

Texas. Association for Computational Linguistics.

Longxuan Ma, Wei-Nan Zhang, Runxin Sun, and Ting

Liu. 2020. A compare aggregate transformer for un-

derstanding document-grounded dialogue. In Find-

ings of the Association for Computational Linguis-

tics: EMNLP 2020, pages 1358–1367, Online. As-

sociation for Computational Linguistics.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and

Ryan McDonald. 2020. On faithfulness and factu-

ality in abstractive summarization. In Proceedings

of the 58th Annual Meeting of the Association for

Computational Linguistics, pages 1906–1919, On-

line. Association for Computational Linguistics.

Shikib Mehri and Maxine Eskenazi. 2020. USR: An

unsupervised and reference free evaluation metric

for dialog generation. In Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics, pages 681–707, Online. Association for

Computational Linguistics.

Stephen Merity, Caiming Xiong, James Bradbury, and

Richard Socher. 2017. Pointer sentinel mixture mod-

els. In 5th International Conference on Learning

Representations, ICLR 2017, Toulon, France, April

24-26, 2017, Conference Track Proceedings. Open-

Review.net.

Sabrina J Mielke, Arthur Szlam, Y-Lan Boureau,

and Emily Dinan. 2020. Linguistic calibration

through metacognition: aligning dialogue agent re-

sponses with expected correctness. arXiv preprint

arXiv:2012.14983.

Alexander Miller, Will Feng, Dhruv Batra, Antoine

Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and

Jason Weston. 2017. ParlAI: A dialog research soft-

ware platform. In Proceedings of the 2017 Con-

ference on Empirical Methods in Natural Language

Processing: System Demonstrations, pages 79–84,

Copenhagen, Denmark. Association for Computa-

tional Linguistics.

Sewon Min, Danqi Chen, Luke Zettlemoyer, and Han-

naneh Hajishirzi. 2019. Knowledge guided text re-

trieval and reading for open domain question answer-

ing.

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick

Lewis, Majid Yazdani, Nicola De Cao, James

Thorne, Yacine Jernite, Vladimir Karpukhin, Jean

Maillard, Vassilis Plachouras, Tim Rocktäschel, and

Sebastian Riedel. 2021. KILT: a benchmark for

knowledge intensive language tasks. In Proceedings

of the 2021 Conference of the North American Chap-

ter of the Association for Computational Linguistics:

Human Language Technologies, pages 2523–2544,

Online. Association for Computational Linguistics.

Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,

Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and

Alexander Miller. 2019. Language models as knowl-

edge bases? In Proceedings of the 2019 Confer-

ence on Empirical Methods in Natural Language

Processing and the 9th International Joint Confer-

ence on Natural Language Processing (EMNLP-

IJCNLP), pages 2463–2473, Hong Kong, China. As-

sociation for Computational Linguistics.

Peng Qi, Haejun Lee, OghenetegiriTGSido, and

Christopher D. Manning. 2020. Retrieve, rerank,

read, then iterate: Answering open-domain ques-

tions of arbitrary complexity from text. ArXiv,

abs/2010.12527.

3794Colin Raffel, Noam Shazeer, Adam Roberts, Kather-

ine Lee, Sharan Narang, Michael Matena, Yanqi

Zhou, Wei Li, and Peter J. Liu. 2020. Exploring

the limits of transfer learning with a unified text-to-

text transformer. Journal of Machine Learning Re-

search, 21(140):1–67.

Adam Roberts, Colin Raffel, and Noam Shazeer. 2020.

How much knowledge can you pack into the param-

eters of a language model? In Proceedings of the

2020 Conference on Empirical Methods in Natural

Language Processing (EMNLP), pages 5418–5426,

Online. Association for Computational Linguistics.

Stephen Roller, Emily Dinan, Naman Goyal, Da Ju,

Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott,

Eric Michael Smith, Y-Lan Boureau, and Jason We-

ston. 2021. Recipes for building an open-domain

chatbot. In Proceedings of the 16th Conference of

the European Chapter of the Association for Compu-

tational Linguistics: Main Volume, pages 300–325,

Online. Association for Computational Linguistics.

Haitian Sun, Tania Bedrax-Weiss, and William Cohen.

2019. PullNet: Open domain question answering

with iterative retrieval on knowledge bases and text.

In Proceedings of the 2019 Conference on Empirical

Methods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-

guage Processing (EMNLP-IJCNLP), pages 2380–

2390, Hong Kong, China. Association for Computa-

tional Linguistics.

David Thulke, Nico Daheim, Christian Dugast, and

Hermann Ney. 2021. Efficient retrieval augmented

generation from unstructured knowledge for task-

oriented dialog.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz

Kaiser, and Illia Polosukhin. 2017. Attention is all

you need. In Advances in Neural Information Pro-

cessing Systems 30: Annual Conference on Neural

Information Processing Systems 2017, December 4-

9, 2017, Long Beach, CA, USA, pages 5998–6008.

Ellen M. Voorhees and Dawn M. Tice. 2000. The

TREC-8 question answering track. In Proceed-

ings of the Second International Conference on

Language Resources and Evaluation (LREC’00),

Athens, Greece. European Language Resources As-

sociation (ELRA).

Meeting of the Association for Computational Lin-

guistics, pages 4258–4264, Florence, Italy. Associa-

tion for Computational Linguistics.

Dani Yogatama, Cyprien de Masson d’Autume, and

Lingpeng Kong. 2021. Adaptive semiparametric

language models. arXiv preprint arXiv:2102.02557.

Rowan Zellers, Ari Holtzman, Hannah Rashkin,

Yonatan Bisk, Ali Farhadi, Franziska Roesner, and

Yejin Choi. 2019. Defending against neural fake

news. In Advances in Neural Information Process-

ing Systems 32: Annual Conference on Neural Infor-

mation Processing Systems 2019, NeurIPS 2019, De-

cember 8-14, 2019, Vancouver, BC, Canada, pages

9051–9062.

Xueliang Zhao, Wei Wu, Chongyang Tao, Can Xu,

Dongyan Zhao, and Rui Yan. 2020a. Low-resource

knowledge-grounded dialogue generation. In 8th

International Conference on Learning Representa-

tions, ICLR 2020, Addis Ababa, Ethiopia, April 26-

30, 2020. OpenReview.net.

Xueliang Zhao, Wei Wu, Can Xu, Chongyang Tao,

Dongyan Zhao, and Rui Yan. 2020b. Knowledge-

grounded dialogue generation with pre-trained lan-

guage models. In Proceedings of the 2020 Con-

ference on Empirical Methods in Natural Language

Processing (EMNLP), pages 3377–3390, Online. As-

sociation for Computational Linguistics.

Yufan Zhao, Wei Wu, and Can Xu. 2020c. Are pre-

trained language models knowledgeable to ground

open domain dialogues?

Chunting Zhou, Graham Neubig, Jiatao Gu, Mona

Diab, Francisco Guzmán, Luke Zettlemoyer, and

Marjan Ghazvininejad. 2021. Detecting halluci-

nated content in conditional neural sequence gener-

ation. In Findings of the Association for Computa-

tional Linguistics: ACL-IJCNLP 2021, pages 1393–

1404, Online. Association for Computational Lin-

guistics.

Kangyan Zhou, Shrimai Prabhumoye, and Alan W

Black. 2018. A dataset for document grounded con-

versations. In Proceedings of the 2018 Conference

on Empirical Methods in Natural Language Process-

ing, pages 708–713, Brussels, Belgium. Association

for Computational Linguistics.

Wenhan Xiong, Xiang Li, Srini Iyer, Jingfei Du, Patrick

Lewis, William Yang Wang, Yashar Mehdad, Scott

Yih, Sebastian Riedel, Douwe Kiela, and Barlas

Oguz. 2021. Answering complex open-domain

questions with multi-hop dense retrieval. In Inter-

national Conference on Learning Representations.

Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo,

and William Yang Wang. 2019. Improving question

answering over incomplete KBs with knowledge-

aware reader. In Proceedings of the 57th Annual

3795A

Seq2Seq Model Descriptions

BART The BART model (Lewis et al., 2020a)

is a Transformer (Vaswani et al., 2017) that is a

denoising auto-encoder trained with several nois-

ing techniques in order to learn a mapping from

corrupted documents to their original representa-

tions. BART is pre-trained on the same corpora as

BERT (Devlin et al., 2019), namely Wikipedia and

Toronto Books, and thus may retain some inherent

knowledge within its parameters. BART-Large, a

400m parameter model, serves as the base seq2seq

model for RAG in Lewis et al. (2020b), and so we

consider it in our experiments.

T5 The T5 model (Raffel et al., 2020) proposes

another method of pre-training Transformers for

transfer learning, via converting several language

tasks into “text-to-text” tasks. T5 is pre-trained on

a massive-scale corpus of English text scraped from

the web, and thus may also retain inherent knowl-

edge within its parameters. T5-Base (220m param-

eters) and T5-Large (770m parameters) are both

used in the FiD setup (Izacard and Grave, 2021b),

and so we consider them in our experiments.

BlenderBot The BlenderBot model (Roller et al.,

2021) is a large-scale open-domain dialogue model,

pre-trained on dialogue data scraped from social

discussions on the web (Baumgartner et al., 2020).

Roller et al. (2021) release 90m, 2.7B, and 9.4B

parameter models; to better compare to the above,

we build a 400m parameter model pre-trained on

the same corpus, and name it BlenderBot-400m.

p Turn-Sequence (y|X̄ ) ≈

S T

z∈

t=1

p η (z|X̄ )

Z t

p θ (y l |X̄ , z, y 1 ...y l−1 )

Dataset Details

WoW consists of 22311 conversations (split into

train, valid and test) over 1365 general topics,

that range from e-books to toga parties to show-

ers. Valid and test are split into seen and unseen

versions for out-of-distribution topic evaluations,

where the test unseen split contains 1000 dialogues

with 58 new topics not discussed in the training

data. CMU_DoG consists of 4112 conversations

and focuses on the domain of movies. We note that

the original setup of CMU_DoG involves models

being given a gold knowledge paragraph in addi-

tion to the dialogue, but in our work we use this

dataset to consider the more difficult (and realistic)

problem of being able to retrieve this knowledge,

rather than it being provided. To similarly assess

performance on seen vs. unseen distributions for

CMU_DoG, we construct a custom split by hold-

ing out conversations about 2 of the 30 movies in

CMU_DoG for “unseen” test, and subsequently

split the conversations of the other 28 films across

train, valid, and “seen” test. The results presented

in the main text focus on these modified splits, with

measurements on the original data split provided

in Tables 16 and 17.

RAG-Turn Token & Sequence

Retrieving documents for each turn x t can also

be viewed as a way of boosting the total num-

ber of documents. We can thus try falling back

to the standard RAG-Token and RAG-Sequence

generator probabilities, by considering the union

of all documents retrieved for each turn Tt=1 Z t ,

and the concatenation of all the turns in the con-

text X̄ = [x 1 ; ...; x T ] as before. We refer to these

methods as RAG-Turn Token, and RAG-Turn Se-

quence. The generator probabilities for RAG-Turn

Token and RAG-Turn Sequence are:

p Turn-Token (y|X̄ ) ≈

l z∈

p η (z| X ¯ )p θ (y l |X̄ , z, y 1 ...y l−1 )

S T

t=1

Training Details

All models are trained in ParlAI 3 (Miller et al.,

2017).

Training Models were trained using 4x32GB

GPUs and mixed-precision training, evaluating ev-

ery 1 quarter of a dataset epoch until validation

perplexity did not improve for a certain number of

validations. We used a batchsize of 16 and swept

over learning rates between 5e-6 and 1e-4, using

the Adam optimizer (Kinga and Ba, 2015) with

a linear LR scheduler that reduced the LR when

validation performance did not improve; we found

that 1e-5 worked best for BART models, and 1e-4

worked best for T5 models.

Z t

3796

https://parl.aiMethod

PPL

Baselines

Movie titles only

15.45

15.96

Gold passage + Full Context

14.20

15.64

NQ + TQA retriever pre-training

Rag-Token

12.87

15.59

DPR-Poly

12.77

14.93

12.77

15.66

FiD

FiD-DPR

12.41

15.25

Wizard of Wikipedia retriever pre-training

Rag-Token

13.05

15.22

DPR-Poly

12.71

15.21

FiD

12.79

15.64

12.24

15.33

FiD-DPR

Seen Test

Knowledge F1 B4 RL PPL F1 Unseen Test

Knowledge F1 B4 RL

6.796

8.637 .7456

.7698 19.43

19.58 19.41

15.32 15.26

15.86 5.916

7.222 .5923

.882 18.06

18.67

8.003

9.087

7.854

9.901 .7886

.7053

.7472

.7436 20.53

21.02

21.49

21.76 14.94

14.54

15.12

14.98 15.78

15.23

14.83

14.36 7.158

7.457

7.776

9.071 .7306

.6872

.5541

.5376 20.57

20.35

20.01

20.49

8.253

8.307

8.318

9.052 .7151

.7452

.8149

.7994 20.62

20.91

22.14

21.54 15.25

14.48

15.11

14.47 15.52

15.11

15.07

14.64 7.202

7.65

7.317

8.686 .7502

.6476

.5711

.6849 20.95

20.40

20.32

20.42

Table 10: Comparison of Architectures on CMU_DoG Seen/Unseen. BART is used as the base Seq2Seq Model.

Inference We attempted to optimize the decod-

ing parameters of the models in the same way on

the validation set to optimize decoding strategy –

this included sweeping over beam size, minimum

beam length, and beam/context blocking, and used

F1 to measure performance. For the vast majority

of results, we employ beam search with a mini-

mum beam length of 20 and a beam size of 3, with

tri-gram beam blocking.

Wikipedia and FAISS To index the Wikipedia

passage embeddings, we used the Hierar-

chical Navigable Small World graph explo-

ration (HNSW) variant of a FAISS index (i.e.,

IndexHNSWFlat 4 ), with an M value (number

of graph links in HNSW) of 128. The FAISS index

requires 80GB of RAM to load.

Retriever Performance

We measure the performance of the various retriev-

ers considered by evaluating how often the top doc-

ument retrieved is the correct document or in the

top 5; that is, how often the gold knowledge sen-

tence used in WoW is contained within the passage

retrieved. Results are in Table 11.

RAG Turn Further Explorations

Automated Metrics and Human

Evaluation

Rare F1: When comparing texts, F1 can be inflated

by exploiting common unigrams (Dinan et al.,

2019a). We attempt to rectify this by only con-

sidering words that are infrequent in the dataset

when calculating F1. We define a word as infre-

quent if it is in the lower half of the cumulative

frequency distribution of the reference corpus. For

each dataset, our reference corpus was all human

messages across all splits. We find some correlation

between this metric and Knowledge F1 for WoW

(see Table 1). We note that Knowledge F1 is only

available for datasets with labeled gold knowledge,

whereas Rare F1 can always be computed.

We calculate the Pearson correlation coefficient

between human evaluations and various automated

metrics, visualized in Figure 2. The models con-

sidered are those listed in Table 4. We find that

improvements in PPL, Knowledge F1, and Rare F1

correlate with an increase in the perceived knowl-

edge use and a reduction in hallucination. F1 had

relatively low correlation with all of the human

evaluation criteria considered.

H.1

Additional Retrieval Variants

ColBERT

Khattab and Zaharia (2020) propose ColBERT as

a method of computing contextualized late-stage

We compare different values for

the effective

number of context turns considered by RAG-Turn, interaction between the context and candidate rep-

resentations to improve ranking capabilities, and

in Table 13. We note that perplexity values in

indeed the method is extended to downstream gen-

general increase, while generation statistics stay

roughly the same or drop slightly. Knowledge F1 erative QA models in Khattab et al. (2020). The

key to ColBERT is a maxsim operation, in which

stays roughly the same, with marginal increases or

the Transformer outputs of the context encoder are

decreases depending on the model.

compared to all outputs of the candidate encoder,

with the final score being a sum of the maximum

https://github.com/facebookresearch/faiss/wiki/Faiss-

indexes

similarity scores for each context output. The au-

3797

T ∗ ,Retriever

DPR

RAG-DPR

DPR-Poly

PolyFAISS

ColBERT

ReGReT (Separate)

ReGRet (Same)

Retriever

Pre-Training

NQ + TQA

WoW

NQ + TQA + WoW

NQ + TQA

WoW

NQ + TQA + WoW

NQ + TQA

WoW

MS-Marco

WoW

NQ + TQA

Retriever

Fine-Tuning

Zero-shot

WoW

Valid Seen

R@1 R@5

5.8

13.8

13.1

23.9

13.1

23.9

28.1

36.8

25.9

35.6

26.2

35.1

29.3

37.6

23.9

32.0

25.7

33.3

26.1

33.6

25.3

35.1

26.6

35.7

Valid Unseen

R@1 R@5

4.9

11.1

11.6

17.5

11.1

16.6

25.7

33.7

22.9

33.4

23.3

34.0

26.9

34.0

19.7

28.3

27.5

33.8

26.4

33.7

24.0

32.5

23.7

33.2

Table 11: Comparison of Retrieval Ability of Architectures on WoW Valid Seen/Unseen. Each model retrieves

5 documents from an unstructured document set of 21m 100-word passages in Wikipedia. We measure passage

Recall@k (R@k) measures how often the gold sentence used by the wizard is contained in the top k retrieved

documents. All models use BART as a base seq2seq model

Valid Seen

RAG Type

PPL

Knowledge F1

Retrieve over Most Recent Turn

Sequence

13.5

20.8

23.3

13.8

21.1

22.3

Token

Retrieve over Full Dialogue Context

Sequence

11.1

21.5

27.9

Token

11.6

22.5

26.0

Turn-DTT

11.9

22.2

28.0

Turn-DO

13.3

23.1

26.8

11.5

21.0

24.3

Turn-Tok

Turn-Seq

10.9

21.5

27.8

B4 RL PPL F1 Valid Unseen

Knowledge F1 B4 RL

2.6

2.6 21.7

21.7 15.5

15.8 20.1

21.1 21.4

21.0 2.1

2.0 20.5

20.8

3.9

4.0

4.1

4.0

3.1

4.1 23.0

23.5

23.4

24.5

21.6

22.9 12.6

13.4

13.6

15.4

13.2

12.6 20.3

21.8

21.1

22.0

20.5

19.5 24.6

22.7

24.3

23.3

21.5

23.5 2.9

2.7

2.6

2.0

2.6 21.3

21.7

21.4

22.5

20.0

20.3

Table 12: Comparison of RAG Model Types on WoW Valid Seen/Unseen. Retrieval models are retrieving 5 doc-

uments over all of Wikipedia. We set T ∗ = 1 for RAG-Turn models, i.e., the last turn is considered independently

from the prior context turns. All models use BART as the base seq2seq model.

thors propose an end-to-end setup involving large-

scale search, where the token representations of

all candidates are stored in a FAISS index, queries

into the FAISS index are context outputs, and a

re-ranking step using the maxsim operation is per-

formed on a much smaller set of candidates. We

implement this method for retrieval-augmented di-

alogue, and simply denote it as ColBERT.

H.2

involves two rounds of retrieval and generation,

where the second round retrieves according to the

generated output of the first round; the model is

trained to predict target labels taking into account

both stages. We denote this model ReGReT (re-

trieve, generate, retrieve, tune), and note that one

could use the same model for both rounds (Re-

GReT Same) or a separate model for both rounds

(ReGReT Sep).

Iterative Retrieval

H.3 Retriever-less Retrieval

Several methods in the literature have shown that

using iterative retrieval strategies is an effective

Recent work has demonstrated that large pre-

way to improve retrieval (Khattab et al., 2020), trained models have some capacity to store knowl-

distill knowledge from the retriever to the reader

edge within their parameters (Petroni et al., 2019;

(Izacard and Grave, 2021a), and boost performance

Roberts et al., 2020); some have shown that model

in multi-hop or complex QA settings (Xiong et al., representations themselves can be used nearly out-

2021; Qi et al., 2020). Applying a similar tech- of-the-box for nearest neighbor retrieval of relevant

nique to dialogue is easily motivated; intuitively, contexts to help in language modeling (Khandel-

assuming one has an appropriately expressive gen- wal et al., 2020), machine translation (Khandelwal

erative model, retrieval conditioned on the output

et al., 2021), and grounded dialogue (Fan et al.,

of the generator (trained to predict the ground truth

2021). We explore the efficacy of BART and T5

response y) should surface relevant facts for the

at encoding knowledge via utilizing their encoders

conversation. We thus consider an architecture that

directly to encode both q(x i ) and d(z j ), allowing

3798RAG Turn Type

Doc then Turn

Doc Only

Token

Sequence

T ∗

PPL

11.8

12.1

13.3

14.4

11.5

11.7

10.9

21.9

21.7

23.1

22.7

21.0

22.3

21.5

Valid Seen

Knowledge F1

27.7

27.3

26.8

27.1

24.3

25.2

27.8

4.1

4.0

3.9

3.1

3.7

4.1

23.2

22.9

24.5

24.1

21.6

23.0

22.9

PPL

13.6

13.8

15.5

16.7

13.2

13.9

12.6

21.1

20.8

22.0

21.9

20.5

21.1

19.5

Valid Unseen

Knowledge F1

24.3

23.3

22.8

21.5

20.8

23.5

2.7

2.6

2.9

2.0

2.3

2.6

21.4

21.2

22.5

22.3

20.0

20.8

20.3

Table 13: Comparison of T ∗ Values For RAG-Turn on WoW Valid Seen/Unseen. All models use BART as a base

seq2seq model, and retrieve 5 documents over all of Wikipedia.

Knowledge F1 Rare Word F1

PPL F1 Consistency -0.3 0.65 0.13 0.75

Engaging 0.6 0.1 -0.66 -0.4

Knowledge -0.82 0.61 0.94 0.87

Hallucinate 0.89 -0.55 -0.95 -0.84

Figure 2: Correlation of Automatic Metrics with Human Judgments. We plot the Pearson correlation coeffi-

cient between the human evaluations from Table 4 and automated metrics from the WoW Valid Unseen data. We

observe correlation between the Knowledge F1 and Rare F1 metrics with Knowledge and Hallucination human

evaluations, especially when compared to standard F1.

the full RAG model to propagate error from the

token losses to the encoder seen as a retriever and

as a generator, thus removing the requirement of

training and deploying a completely separate Trans-

former model for that goal. We draw inspiration

from the ColBERT setup, and use encoder outputs

as queries into FAISS, with a maxsim operation

computing final documents scores p η (z j |x i ). We

refer to this model as BREAD (BART-Retriever-

Encoder-And-Decoder) for BART-based models,

and TREAD for T5-based models.

Additional Relevant Ablations

We outline several more important questions when

considering these models.

I.1

Do different encoder-decoder

architectures affect performance?

Table 18 presents results on WoW comparing

across different encoder-decoder architectures and

sizes.

els yield roughly the same generation metric results.

BlenderBot-400m performs comparably worse to

T5 and Bart.

Size Comparison With larger models we tend

to see a decrease in perplexity, indicating that

these models become more fluent with respect to

the dataset; however, generation statistics remain

roughly constant. In fact, for the BlenderBot mod-

els, increasing model size leads to decreasing per-

formance in the Knowledge F1 metric. This result

further motivates the need for additional metrics

beyond the standard ones when measuring prowess

on dialogue-based tasks. One hypothesis here is

that the large model is sacrificing knowledge use by

instead relying on its conversational fluency (given

that its perplexity is significantly lower).

I.2

Is a neural model trained for retrieval

necessary?

Table 19 shows the efficacy of retriever-less re-

trieval, comparing across different sources of

Architecture Comparison BART and T5 are

knowledge. When limiting the knowledge base

comparable in their performance when holding the

to all topics from Wikipedia that are present in

retrieval aspect constant. While perplexity mea- the WoW dataset – comprising 500k tokens across

sures are not directly comparable due to dictionary

3k documents – the BREAD (BART-Retriever-

differences, we see that generations from the mod- Encoder-And-Decoder) model obtains similar per-

3799Method

BlenderBot (Roller et al., 2021)

BART (ours)

GPT-2 Finetune (Zhao et al., 2020c)

E2E Transformer MemNet (Dinan et al., 2019b)

DRD (Zhao et al., 2020a)

Two-Stage Transformer MemNet (Dinan et al., 2019b)

DialoGPT Finetune (Zhao et al., 2020c)

SKT (Kim et al., 2020)

BART FK (Bruyn et al., 2020)

KnowledGPT (Zhao et al., 2020b)

KIF (Fan et al., 2021)

FiD-RAG (Ours)

RAG DPR-Poly (Ours)

FiD-RAG DPR-Poly (Ours)

Knowledge Source

None

WoW

WoW + Train Utts

Wikipedia (WoW Subset)

Wikipedia (All)

PPL

8.72

14.7

15.0

63.5

23.0

46.5

16.2

52.0

12.2

19.2

10.5

11.4

10.7

Test Seen

18.8

1.3

20.9

1.7

14.4

1.0

16.9

18.0

5.5

18.9

19.0

2.3

19.3

20.1

22.0

23.9

*25.9

23.2

4.4

22.9

3.9

22.9

4.1

20.3

24.2

23.5

23.8

PPL

10.4

18.9

97.3

25.6

84.8

20.4

81.4

14.9

22.3 Test Unseen

17.8

0.7

18.7

0.9

13.8

0.8

14.4

16.5

4.3

17.3

17.6

3.2

16.1

19.3

20.5

10.7

13.2

12.0 *22.3

23.2

21.8

22.1

4.6

3.4

3.7

18.4

24.4

22.3

23.1

Table 14: WoW Comparison to Existing Results. "WoW" knowledge source indicates the model choosing from

a small set (∼61 sentences) provided by the dataset for each dialogue turn. Methods with * augmented their

knowledge source with training utterances, which is useful on Test Seen data, but likely not as useful on Unseen

data. Our models use BART as the base seq2seq model; the RAG and FiD-RAG models retrieve 5 documents, and

the FiD-RAG DPR-Poly model retrieves 25.

Retriever

TFIDF

DPR

TFIDF

DPR

Joint DPR Poly

PolyFAISS

ColBERT

BREAD

ReGReT (Sep)

ReGReT (Same)

Re-ranker

None

DPR

Polyencoder

None

PPL

13.1

11.6

12.5

11.7

11.6

12.1

12.4

14.8

11.9

12.0

Valid Seen

KF1

21.6

23.0

22.5

26.0

21.8

23.1

23.0

26.5

23.0

27.4

22.9

24.8

21.8

25.3

20.5

17.7

22.6

26.9

22.6

25.9

3.3

4.0

3.4

4.0

4.3

3.7

3.3

1.7

3.9

4.0

22.5

23.5

22.6

23.9

23.6

23.1

20.6

23.9

PPL

15.2

13.4

14.5

13.1

14.2

13.5

17.3

13.6

13.8

Valid Unseen

F1 KF1 B4

21.1 21.6 2.4

21.8 22.7 2.7

21.4 20.2 2.2

22.6 24.4 3.4

22.1 24.7 3.1

21.6 20.6 2.5

21.9 24.7 3.2

19.8 17.2 1.3

21.6 24.1 2.9

21.5 23.2 2.7

21.1

21.7

20.9

22.6

22.1

21.2

22.4

19.5

21.9

21.6

Table 15: Comparison of re-rankers for BART-based RAG-Token models on WoW Valid Seen/Unseen, using

5 retrieved documents.

formance to its DPR-retrieval counterpart. When

els substantially outperform their non-retrieval-

scaling to the first two paragraphs of all topics from

augmented counterparts (e.g., F1 and knowledge

Wikipedia – comprising 1 billion tokens across

F1 improve from 19.3 and 14.6 without retrieval

11 million documents, of the same order of mag- to 22.1 and 24.1 with TREAD, respectively, on

nitude as the full Wikipedia knowledge source – the WoW Valid Seen split), however we do see

we see a slight reduction in performance, but the

that their RAG/FiD counterparts perform better in

BREAD model still effectively retrieves relevant

terms of knowledge F1 and perplexity.

information, and improves upon a no-retrieval base-

I.3 Does the decoding strategy affect

line. However, when scaling to the full knowledge

performance?

source – comprising 3 billion tokens over 21 mil-

lion documents – we see that we are unable to

We compare model outputs with various decoding

surpass even a no-knowledge baseline; we hypoth- strategies in Table 20. We compare three decoding

esize that the token-level similarities computed by

methods: beam search, blocking repeated n-grams

the BREAD model become increasingly noisy as

(we use n = 3); nucleus sampling (Holtzman et al.,

the knowledge source is scaled up: when a rele- 2020) with varying values of p; and top-k sampling

vant Wikipedia article is spread across several “pas- (Fan et al., 2018) with k = 10. We additionally

sages”, as in our unstructured knowledge source

compare whether to apply beam-blocking to the

dump, it becomes difficult for the BREAD model

context, i.e., blocking repeated n-grams that ap-

to identify precisely which sentence is relevant.

pear in the dialogue context only – n-grams in the

retrieved documents are not blocked.

We find similar results when evaluating TREAD

models on the smallest knowledge source listed

We find that, across all retrieval schemes, beam-

in the previous paragraph. The TREAD mod- blocking the dialogue context hurts performance

3800Retrieval Mechanism

None

FiD

RAG DPR

RAG DPR-Poly

FiD-RAG DPR

PPL

14.7

15.3

15.0

14.7

14.3

15.6

15.4

15.3

15.1

15.3

Knowledge F1

4.3

4.4

4.7

4.8

4.9

BLEU-4

0.7

0.6

0.7

ROUGE-L

15.6

14.9

15.7

Table 16: Comparison of Retrieval Augmentations on CMU_DoG (Valid), original split. Retrieval models are

retrieving over all of Wikipedia. All RAG models are RAG-Token and use BART as the base seq2seq model.

Method

PPL F1 B4 RL

14.6

CMU_DoG Knowledge

BCTCE (Cai et al., 2020)

17.8

CAT (Ma et al., 2020)

15.2

GPT-2 Finetune (Zhao et al., 2020c)

16.5

DRD (Zhao et al., 2020a)

54.4

DialoGPT Finetune (Zhao et al., 2020c) 15.9

KnowledGPT (Zhao et al., 2020b)

20.6

All of Wikipedia

RAG DPR-Poly (Ours)

14.4

FiD-RAG DPR-Poly (Ours)

14.4 15.9 0.8 16.9

No Knowledge

BART (ours)

9.4

10.7

13.7

13.5

15.8

15.9

1.4

1.2

0.6

1.2

1.5

0.9

11.2

16.9

17.1

Table 17: CMU_DoG Comparison to Existing Results (Test), original data split. Our models use BART as the

base seq2seq model. Both the RAG DPR-Poly model and FiD-RAG model retrieve 5 documents.

Seq2Seq

BB-90m

BB-400m

BB-3B

T5-Base

T5-Large

BART-

Large

Size

90m

400m

220m

770m

400m

PPL

13.4

9.2

8.2

11.5

9.7

11.6

Valid Seen

21.4

21.1

21.9

22.6

22.5

KF1

23.9

23.2

20.2

25.5

25.2

26.0

Valid Unseen

PPL

KF1

15.9

21.1

21.3

10.4

19.9

20.5

9.1

20.9

18.7

13.6

21.2

22.4

11.2

21.7

22.9

13.4

21.8

22.7

Table 18: Comparison between different seq2seq

models (BlenderBot (BB), T5, and BART) on WoW

Valid Seen/Unseen. All models use RAG-Token archi-

tectures with DPR Retrieval, retrieving 5 documents at

inference time. Perplexity (PPL) values are not com-

parable across different generator architectures as they

use different dictionaries.

I.4

Does retriever and/or re-ranker

pre-training affect performance?

We explore the effects of pre-training the neural re-

triever to help prime it for dialogue-based retrieval.

To do so, we consider WoW knowledge selection as

an appropriate pre-training task: given a dialogue

context and a set of candidate knowledge sentences,

choose the sentence on which to next ground a

response. For standard RAG-DPR methods, we

try both fine-tuning 1) a DPR model pre-trained

on Natural Questions (Kwiatkowski et al., 2019)

and Trivia QA (Joshi et al., 2017) and 2) a BERT

model from scratch on the WoW knowledge selec-

tion task, and substitute these in for the standard

QA-pre-trained DPR retriever from our base setup;

we explore similar pre-training ablations with the

ColBERT model. Results are in Table 21; we see

minimal performance gains from such pre-training,

and conclude that as long as the retriever is in a

good state, it will work in the fine-tuning setup.

We see similar results when comparing pre-

training strategies for the DPR-Poly re-ranker

model in Table 21; pre-training the re-ranker does

not yield noticeable downstream gains.

– presumably because the model may be blocked

from discussing named entities from prior context

turns – with beam search yielding the highest F1

scores across the board. Despite the fact that beam

search and nucleus sampling (with low p) yield

comparable ROUGE-L and F1 scores, we see a

noticeable difference in knowledge F1, implying

that nucleus sampling may still be good at produc-

ing fluent/consistent generations while ultimately

suffering increased hallucination. Using nucleus

sampling with a higher p value (which increases

I.5 Does the source of knowledge matter?

the variety of sampling) and using top-k sampling

both result in poor relative performance for all four We explore the downstream effect of swapping in

metrics, implying higher levels of hallucination

different sources of knowledge. Because the distri-

and less coherent responses.

bution of the topics within Wizard of Wikipedia is

3801Src

BART

Valid Seen

KF1

Valid Unseen

PPL

KF1

Arch. PPL RAG-DPR

FiD-RAG

BREAD

RAG-DPR

FiD-RAG

BREAD

BREAD-FiD

RAG-DPR

FiD-RAG

BREAD

BREAD-FiD 11.6

13.1

14.8

10.9

12.3

13.7

12.8

10.7

10.5

12.1

11.3 22.5

22.0

20.5

23.2

22.7

21.7

22.4

23.3

23.5

23.2

23.3 26.0

22.1

17.7

27.9

24.5

22.9

25.2

28.3

28.4

28.5

27.7 13.4

15.1

17.3

12.4

14.0

15.3

14.5

11.7

11.4

13.4

12.6 21.8

21.6

19.8

22.4

22.2

21.1

21.7

23.0

23.7

23.0

23.3 22.7

20.4

17.2

23.7

22.9

21.6

23.4

26.3

27.9

27.6

26.2

RAG-DPR

FiD-RAG

TREAD

TREAD-FiD 9.0

9.0

11.0

10.6 23.3

22.7

22.1

22.3 26.8

29.3

24.1

23.4 9.8

9.8

12.8

12.0 22.6

23.0

21.8

22.0 24.6

29.4

22.9

22.4

Table 19: Comparison between DPR Retriever mod-

els (RAG and FiD) and “retriever-less” BREAD and

TREAD models on WoW Valid Seen/Unseen, with

varying knowledge sources: A: All of Wikipedia; B:

First 2 paragraphs from all of Wikipedia; C: First

two paragraphs from all articles covered by the WoW

dataset. All models retrieve 5 documents during train-

ing and inference. Perplexity (PPL) values are not com-

parable across different seq2seq architectures as they

use different dictionaries.

trieved documents improves perplexity measures

and maintains both Knowledge F1 and BLEU mea-

sures; however, F1 scores appear to drop for any

amount of documents beyond a single one. We

hypothesize that by considering more and more

generations we are effectively increasing the beam

size and finding generations that match the knowl-

edge more and more, while straying further away

from engaging, dialogue-like responses; indeed,

the RAG-Sequence model in Table 4 only uses 5

retrieved documents, and human evaluations indi-

cate that the model still is less often engaging than

its counterparts.

Overall, the number of re-ranked documents

does not seem to improve performance substan-

tially, so we land on 25 documents re-ranked to

keep computational overhead to a minimum.

known, we can limit our model’s source of knowl-

edge to contain the smallest subset of Wikipedia

yielding full coverage of the dataset, resulting in

nearly 3000 documents from which to retrieve. As

the retrieval task is now easier, we see noticeable

performance gains when substituting this source of

knowledge, see Table 22.

I.6

How does the number of documents

retrieved/re-ranked affect performance?

We conclude our ablation studies with an analysis

on the number of documents retrieved. Table 23

outlines how each backbone architecture handles

increasing the number of documents considered

during inference.

For backbone architectures designed to consider

several documents jointly - namely, RAG-Token

and FiD-RAG - increasing the number of retrieved

documents yields improvements in perplexity and

F1 measures. However, we see substantial dropoffs

in Knowledge F1 measures, which might imply that

the models begin to hallucinate more and more, a

claim that is supported in the human annotations,

where we see in Table 4 that increasing the number

of documents for these models yields higher levels

of hallucination.

For RAG-Sequence models, which consider each

document separately, increasing the number of re-

3802Decoding Strategy

Beam

Nucleus: p = 0.3

Nucleus: p = 0.9

Top-k: k = 10

Context Block

Yes

20.9

20.6

20.1

17.1

16.6

18.0

17.5

No Retrieval

KF1

17.6

1.7

17.1

1.7

16.0

1.4

15.6

1.4

13.6

0.6

13.2

0.6

14.4

0.7

14.0

0.5

20.7

20.4

20.3

19.9

17.0

16.8

18.0

17.5

23.1

22.9

23.0

22.9

19.3

19.2

19.8

19.7

RAG DPR-Poly

KF1

26.5

4.0

25.9

4.1

24.0

3.6

23.9

3.7

19.3

1.9

18.9

1.8

19.0

1.8

18.8

1.8

24.0

23.9

24.2

24.1

19.8

19.6

20.3

20.1

FiD-RAG DPR-Poly

KF1

22.8

27.8

4.1

24.1

22.5

26.7

3.9

23.8

22.5

23.5

3.5

23.6

22.0

22.9

3.4

23.1

19.4

20.2

2.3

20.0

19.6

19.8

2.3

20.4

20.2

19.9

2.2

20.8

19.7

20.2

2.2

20.2

Table 20: Comparison of Decoding Strategies For models with and without retrieval-augmentation. Evaluations

are conducted on the WoW Valid Seen. Retrieval models are retrieving 5 documents over all of Wikipedia. We

set the minimum beam length to 20, and block tri-grams during beam search. All models use BART as the base

seq2seq model.

Valid Seen

Pre-training

Data

PPL

DPR

NQ + TQA

11.6

22.5

12.1

22.7

WoW

12.1

22.7

NQ + TQA + WoW

ColBERT

MS-Marco

12.4

21.8

12.6

21.8

WoW

DPR-Poly and Joint DPR/Poly

WikiTo

11.7

23.0

11.6

23.0

NQ + TQA

Valid Unseen

KF1 PPL F1 KF1

26.0

26.2

25.8 13.4

13.4

13.7 21.8

22.1

22.0 22.7

24.4

23.0

25.3

26.1 13.5

13.6 21.9

21.4 24.7

24.9

26.5

27.4 13.1

13.1 22.6

22.1 24.4

24.7

# Docs

PPL

RAG-Token

12.8

11.6

RAG-Sequence

12.5

11.1

10.6

10.5

RAG-Turn-DTT

12.7

11.8

11.7

11.9

RAG-Turn-DO

14.2

13.3

FiD-RAG

13.0

11.0

11.1

11.7

100

12.7

Table 21:

Comparison between different

retriever/re-ranker pre-training schemes on

WoW Valid Seen/Unseen. All models use BART as the

base seq2seq model.

Src

Type

PPL

11.6

10.9

13.2

10.7

12.8

Valid Seen

F1 KF1

22.5 26.0

23.2 27.9

22.3 23.9

23.3 28.3

22.2 24.8

Valid Unseen

PPL

F1 KF1

13.4 21.8 22.7

12.4 22.4 23.7

15.5 21.5 20.1

11.7 23.0 26.3

14.4 21.5 21.7

Table 22: Comparison between using different

sources of knowledge on WoW Valid Seen/Unseen.

All models are BART RAG-Token with DPR Retrieval.

A: All of Wikipedia; B: first two paragraphs from all

articles in Wikipedia; C: first two paragraphs from all

articles in Wikipedia covering the WoW dataset. P: full

passages are used; S: sentences are separate passages.

Valid Seen

F1 KF1

Valid Unseen

PPL

F1 KF1

21.9

22.5

22.6

22.4 27.6

26.0

24.5

23.9 15.3

13.4

13.0

13.0 20.5

21.7

21.8 23.8

22.7

21.1

20.6

22.1

21.5

21.3

21.2 27.4

27.9

27.8

27.8 14.6

12.6

11.4

11.2 21.1

20.3

20.0

19.9 24.3

24.6

24.3

21.3

21.9

22.2

22.2 28.3

27.7

26.8

26.4 15.0

13.6

13.2

13.7 20.1

21.1

21.6

21.7 24.9

24.3

23.3

22.7

22.2

23.1

22.6 28.1

26.8

24.8

23.7 16.9

15.5

15.1

15.2 21.3

22.0

22.2

22.0 24.7

23.3

21.1

20.0

21.5

22.9

22.3

21.4

20.4 28.5

27.7

21.2

18.0

15.9 15.5

12.7

12.1

12.6

13.6 20.5

22.0

22.7

22.1

21.4 23.0

25.5

22.3

19.1

16.6

Table 23: Comparison of the effect of conditioning

over different numbers of documents at inference

time for different models on WoW Valid Seen/Unseen.

All models use a DPR retriever, with BART as the base

seq2seq model.

3803