Summary of Retrieval-Augmented Language Generation with Continuous Information Gathering

Summary Retrieval-Augmented Language Generation with Continuous Information Gathering arxiv.org

14,392 words - PDF document - View PDF document

One Line

Researchers propose retrieval-augmented language models as a solution to inaccurate output and demonstrate the effectiveness of their method, FLARE, on generating natural language text for complex questions.

Key Points

Retrieval-augmented language models (LMs) can improve accuracy in long-form generation tasks.
FLARE is a retrieval-augmented language generation framework that uses active retrieval augmented generation methods.
FLARE achieves superior or competitive performance compared to single-time and multi-time retrieval baselines.
FLARE involves multi-time retrieval triggered either by every token or every few tokens.
The approach involves implicit and explicit query formulation methods and iteratively retrieves information based on the upcoming sentence.
The proposed method for generating natural language responses to ambiguous questions involves identifying and answering each possible interpretation separately.

Summaries

210 word summary

Researchers propose retrieval-augmented language models as a solution to inaccurate output from relying solely on input. They introduce active retrieval-augmented generation, where a language model actively decides when and what to retrieve during the generation process. The study tests FLARE along with baselines over 4 long-form knowledge-intensive generation tasks and achieves superior or competitive performance on all tasks, demonstrating the effectiveness of the method. FLARE is a method for generating natural language text that involves multi-time retrieval triggered by every token or every few tokens. The paper evaluates the effectiveness of FLARE on 4 different datasets, including generating comprehensive answers to complex questions and utilizing world and commonsense knowledge to generate an answer. The document presents a natural language processing system that generates multiple answers to ambiguous questions by identifying and answering each possible interpretation separately. One paper discusses the challenges of open-domain multi-document summarization and retrieval for open-domain question answering, proposing a retrieval-augmented language model that combines retrieval and language models within a single transformer. Another paper explores the use of retrieval-augmented language models for generating long-form answers to multi-hop questions, emphasizing the importance of chain-of-thought reasoning and consistency. The document also includes tables with statistics and experimental settings for different tasks/datasets, as well as hyperparameters and excluded domains.

390 word summary

The document presents a natural language processing system that generates multiple answers to ambiguous questions by identifying and answering each possible interpretation separately. The method involves considering different criteria or versions to determine the correct interpretation and generate a concise answer. The document provides examples of 2WikiMultihopQA questions and their answers, emphasizing the use of the Search API and step-by-step thinking. One paper discusses the challenges of open-domain multi-document summarization and retrieval for open-domain question answering, proposing a retrieval-augmented language model that combines retrieval and language models within a single transformer. Another paper explores the use of retrieval-augmented language models for generating long-form answers to multi-hop questions, emphasizing the importance of chain-of-thought reasoning and consistency. The document also includes tables with statistics and experimental settings for different tasks/datasets, as well as hyperparameters and excluded domains. FLARE is a method for generating natural language text that involves multi-time retrieval triggered by every token or every few tokens. It generates questions that can be answered by retrieved information using explicit and implicit query generation. The paper evaluates the effectiveness of FLARE on 4 different datasets, including generating comprehensive answers to complex questions and utilizing world and commonsense knowledge to generate an answer. The paper proposes an active retrieval augmented generation framework called FLARE that iteratively generates a temporary next sentence, uses it as a query to retrieve relevant documents, and then regenerates the next sentence conditioning on the retrieved documents. The approach involves iteratively generating search queries and retrieving relevant documents to regenerate the next sentence. FLARE achieves superior or competitive performance compared to single-time and multi-time retrieval baselines. Researchers propose retrieval-augmented language models as a solution to inaccurate output from relying solely on input. They introduce active retrieval-augmented generation, where a language model actively decides when and what to retrieve during the generation process. The paper discusses approaches to active retrieval, including forward-looking active retrieval augmented generation (FLARE), which iteratively generates a temporary next sentence based on user input and initial retrieval results. Retrieving extra information during the generation process is necessary for generating complex content. Long-form generation tasks, such as open-domain summarization and multihop question answering, can benefit from active retrieval-augmented generation. The study tests FLARE along with baselines over 4 long-form knowledge-intensive generation tasks and achieves superior or competitive performance on all tasks, demonstrating the effectiveness of the method.

933 word summary

Retrieval-augmented language models (LMs) are proposed as a solution to the issue of LMs generating inaccurate output due to relying solely on input. The researchers from Carnegie Mellon University's Language Technologies Institute and Sea AI Lab propose a new paradigm for long-form generation tasks called active retrieval-augmented generation, where a language model (LM) actively decides when and what to retrieve throughout the generation process. The paper discusses several approaches to active retrieval, including forward-looking active retrieval augmented generation (FLARE), which iteratively generates a temporary next sentence based on user input and initial retrieval results. The paper highlights the importance of retrieving extra information as needed during the generation process to ensure that the LM has enough knowledge to generate complex content. The paper also presents several examples of long-form generation tasks, including open-domain summarization and multihop question answering, where active retrieval-augmented generation can be applied. The study tests FLARE along with baselines comprehensively over 4 long-form knowledge-intensive generation tasks and achieves superior or competitive performance on all tasks, demonstrating the effectiveness of the method. The paper proposes a retrieval-augmented language generation framework called FLARE, which uses active retrieval augmented generation methods to aid long-form generation with retrieval. FLARE iteratively generates a temporary next sentence, uses it as a query to retrieve relevant documents, and then regenerates the next sentence conditioning on the retrieved documents. FLARE achieves superior or competitive performance compared to single-time and multi-time retrieval baselines. The paper discusses confidence-based active retrieval and direct FLARE methods to address issues with generating search queries and disrupting answer generation. The approach involves iteratively generating search queries and retrieving relevant documents to regenerate the next sentence. FLARE is a method for generating natural language text by iteratively retrieving and incorporating information from external sources. The approach involves multi-time retrieval, which can be triggered either by every token or every few tokens. FLARE generates questions that can be answered by the retrieved information using explicit and implicit query generation. The paper discusses retrieval-augmented language generation for multihop question answering and commonsense reasoning, and evaluates the effectiveness of FLARE on 4 different datasets. The tasks include generating comprehensive answers to complex questions, answering ambiguous questions, and utilizing world and commonsense knowledge to generate an answer. The article discusses open-domain summarization and its evaluation metrics, including ROUGE and QA model with Disambig-F1. It presents the FLARE method for retrieval-augmented language generation, which outperforms other baseline methods in multihop QA tasks and is effective at generating on-topic output. The paper also conducts an ablation study to evaluate the efficacy of various design choices, including the importance of forward-looking retrieval and question decomposition. The paper proposes an active retrieval augmentation framework for long-form language generation that retrieves relevant information during generation. The framework uses a combination of implicit and explicit query formulation methods and iteratively retrieves information based on the upcoming sentence. The paper concludes by acknowledging the support of funding agencies and individuals who contributed to the experiments. Finally, it mentions the authors and references related to the paper, including the use of FLARE for answering open-domain questions and several datasets used for evaluating language models. This collection of papers covers various topics in natural language processing and machine learning, showcasing the current state of the art in the field. One paper discusses the challenges of open-domain multi-document summarization and retrieval for open-domain question answering, proposing a retrieval-augmented language model that combines retrieval and language models within a single transformer. Another paper explores the use of retrieval-augmented language models for generating long-form answers to multi-hop questions, emphasizing the importance of chain-of-thought reasoning and consistency. The document also includes tables with statistics and experimental settings for different tasks/datasets, as well as hyperparameters and excluded domains. Retrieval-Augmented Language Generation with Continuous Information Gathering provides examples of 2WikiMultihopQA questions and their answers, emphasizing the use of the Search API and step-by-step thinking. The questions include identifying director's death dates, determining Nepalese Royals' grandchildren, identifying film directors who died later, and more. The document also includes a compilation of yes or no questions with their respective answers.

The proposed method for generating natural language responses to ambiguous questions involves identifying and answering each possible interpretation separately. Examples of ambiguous questions include release dates of Bohemian Rhapsody in different countries, the second largest city in the USA by population or area, and actresses who played Bonnie in different versions of Gone with the Wind. The method involves considering different criteria or versions to determine the correct interpretation and generate a concise answer. The document also includes examples of straightforward questions that can be answered with a yes or no. The document, Retrieval-Augmented Language Generation with Continuous Information Gathering, presents a natural language processing system that generates multiple answers to ambiguous questions. The system considers different interpretations of the question and provides answers for each. Examples of ambiguous questions include those about music, movies, and sports teams. For each question, hints are provided to identify which aspect of the question is ambiguous. The system considers different charts, regions, releases, versions, locations, and venues to provide accurate answers for each interpretation. The Aslanhane Mosque in Turkey was built in 1290 and restored in 2010-2013. "Untold Legends: The Warrior's Code" is a hack n' slash action role-playing game with mixed reviews. St. Augustine was destroyed and then rebuilt by Spanish settlers after the Treaty of Nonsuch and the Raid on St. Augustine in 1586 by Sir Francis Drake. Lakewood is a historic mansion in Alabama with a rare plan for early Alabama architecture. Carlos Moedas is a Portuguese politician known for promoting innovation and entrepreneurship in Europe.

3730 word summary

Carlos Moedas, a Portuguese politician, has had a successful career in politics. He was born in Beja, Portugal in 1970 and was educated at the University of Lisbon where he studied engineering. He worked at Goldman Sachs before entering politics. He served as the Minister of Environment, Spatial Planning and Energy from 2011 to 2014 and later as the European Commissioner for Research, Science and Innovation from 2014 to 2019. Moedas is known for his work on promoting innovation and entrepreneurship in Europe.

Lakewood is a historic mansion located in Livingston, Alabama. Built in 1840, it features a rare plan for early Alabama architecture, with a brick ground floor topped by one-and-a-half-stories of wood-frame construction. The facade is divided into five bays with central entrance doors on the ground and principle floors. The bays are divided by two-story Doric pilasters, with the middle third of the facade occupied by a two-tiered tetrastyle Doric portico. The house has formal rooms on the principle floor and bedrooms on the upper floor, with a central hallway present on all levels.

In 1586, Sir Francis Drake led an expedition to raid and plunder the Spanish New World. He had previously attacked Santiago in November 1585 and Santo Domingo in January 1586. He then attacked the important city of Cartagena on February 19th before heading north towards the Spanish settlement of St Augustine. There, he hoped to avenge his fellow Protestants who had been ordered to leave by the Spanish under Pedro Menendez de Aviles. Drake's fleet traveled north within sight of land on the Florida peninsula sailing past the Cuban mainland. They stopped to gather supplies and counter an outbreak of dysentery before making the Transatlantic crossing back to England. Raid on St. Augustine: Background: After the Treaty of Nonsuch, Philip II of Spain declared war on England. The fort in St. Augustine was destroyed and replaced with Castillo de San Marcos. The Spanish settlers had to rebuild after the English left. Aftermath: When the Spanish settlers returned, they found the settlement in ruins with very little left.

Untold Legends: The Warrior's Code: Development: The game is a hack n' slash action role-playing game with a focus on action-oriented combat. Gameplay: The game received "mixed or average reviews" according to Metacritic.

Aslanhane Mosque: History: The mosque is one of the oldest mosques in Turkey and was built during the reign of Mesud II of the Anatolian Seljuks in 1290. It was commissioned by Ahi leaders Husamettin and Hasaneddin and was later repaired by S?erafettin in 1330. The mosque was restored by the directorate general of foundations in 2010-2013 term. Location: The mosque is located in the old quarter of Ankara next to Ankara Castle and overlooks Ankara at an altitude of 947 meters.

The Breakfast Club: Interpretation: The question is ambiguous in terms of which specific name is being referred to - the character's name or the actor's name. Therefore, both interpretations need to be considered when answering the question.

Roy Orbison's First Number One Hit: Interpretation: The question has two interpretations - one for the US Hot 100 and Canada, and one for the UK and Ireland. Retrieval-Augmented Language Generation with Continuous Information Gathering is a document that discusses a system for generating answers to ambiguous questions. The system considers different interpretations of the question and provides answers for each. The document provides examples of ambiguous questions and their different interpretations. The examples include questions about music, movies, and sports teams. For each question, the document provides hints on which aspect of the question is ambiguous and suggests possible interpretations. The system considers different charts, regions, releases, versions, locations, and venues to provide accurate answers for each interpretation. Retrieval-Augmented Language Generation with Continuous Information Gathering is a document that discusses a natural language processing system. The system is designed to interpret ambiguous questions and provide multiple possible answers based on different interpretations. The document provides examples of ambiguous questions and how the system can be used to provide multiple answers. The examples include questions about the release date of Bohemian Rhapsody, the second largest city in the USA, who played Bonnie in Gone with the Wind, and the name of the criminal's character in The Breakfast Club. The system uses hints to determine which aspect of the question is ambiguous and provides answers based on all possible interpretations. Question 1: What is the criminal's name in the breakfast club? This question is ambiguous in terms of which specific name is being referred to - the character's name or the actor's. In order to figure out its interpretations and answer them one by one.

Question 2: What was roy orbison's first number one hit? This question is ambiguous in terms of which specific chart or region is being referred to. In order to figure out its interpretations and answer them one by one.

Question 3: When does the movie summer of 84 come out? This question is ambiguous in terms of which release of the movie is being referred to. In order to figure out its interpretations and answer them one by one.

Question 4: When did xbox one come out in australia? This question is ambiguous in terms of which specific version of the Xbox One is being referred to. In order to figure out its interpretations and answer them one by one.

Question 5: Where do the Philadelphia Eagles play their home games? This question has 3 interpretations: (1) What city do the Philadelphia Eagles play their home games? (2) In what sports complex do the Philadelphia Eagles play their home games? (3) What stadium do the Philadelphia Eagles play their home games? The answers to all interpretations are provided. Retrieval-Augmented Language Generation with Continuous Information Gathering proposes a method for generating natural language responses to ambiguous questions by identifying and answering each possible interpretation separately. The document provides examples of ambiguous questions and their multiple interpretations, including questions about the release date of Bohemian Rhapsody in different countries, the second largest city in the USA by population or area, and the actress who played Bonnie in different versions of Gone with the Wind. The method involves considering different criteria or versions to determine the correct interpretation and generate a concise answer. The document also includes examples of straightforward questions that can be answered with a yes or no, such as whether a pear would sink in water or whether a llama could give birth twice during the War in Vietnam. The following text excerpt is a compilation of yes or no questions with their respective answers. The first question is about whether a llama can birth twice during the War in Vietnam, to which the answer is no. The second question asks if it is common to see frost during some college commencements, to which the answer is yes. The third question is about whether Hydrogen's atomic number squared exceeds the number of Spice Girls, to which the answer is no. The fourth question asks if Brooke Shields could succeed at the University of Pennsylvania, to which the answer is yes. The fifth question is about whether hamsters provide food for any animals, to which the answer is yes. The sixth question is about which country the director of the film Citizen Mavzik is from, to which the answer is not provided in the text. The seventh question asks who is the grandchild of Krishna Shah (Nepalese Royal), to which the answer is Prithvipati Shah. The eighth question is about which film had the director die later, The Gal Who Took the West or Twenty Plus Two, to which the answer is Twenty Plus Two. The ninth question asks when did the director of film Laughter In Hell die, to which the answer is August 25, 1963. The tenth question asks who was born first out of Martin Hodge and Ivania Martinich, to which the answer is Martin Hodge. The eleventh question asks who is Boraqchin (Wife Of Ogedei)'s father-in-law, to which the answer is Genghis Khan. The twelfth question asks if the director of film Coolie No. 1 (1995 Film) and director of film The Sensational Trial have the same nationality, to which the answer is no. The thirteenth question asks if Kurram Garhi and Trojkrsti are located in the same country, to which the answer is no. Retrieval-Augmented Language Generation with Continuous Information Gathering provides examples of 2WikiMultihopQA questions and their answers. The document emphasizes two skills for answering questions: utilizing the Search API and thinking step-by-step. The questions include identifying a director's death date, determining the grandchild of a Nepalese Royal, identifying a film's director who died later, determining the birth order of two individuals, identifying a father-in-law, determining if two directors have the same nationality, identifying the location of two places, and identifying the meaning behind colors on a flag. The answers are provided with step-by-step reasoning and in some cases, utilizing the Search API. The document omits any unrelated information and does not include boilerplate or cookie warnings. This document discusses retrieval-augmented language generation using continuous information gathering. The document includes tables with statistics and experimental settings for different tasks/datasets, as well as prompts and exemplars for these tasks/datasets. The hyperparameters of FLARE on different datasets and the Wikipedia-related domains excluded from Bing's search results are also listed in tables. The document is part of the proceedings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Retrieval-augmented language generation with continuous information gathering is a topic covered in a research paper. The paper discusses the use of models in generating long-form answers to multi-hop questions. It also covers the use of chain-of-thought reasoning and tool-former models in language generation. The paper emphasizes the importance of consistency in chain-of-thought reasoning and highlights the benefits of self-knowledge-intensive multi-step questions. Additionally, the paper covers the use of open-domain question answering and probabilistic relevance frameworks in information retrieval. The paper concludes by questioning how much knowledge can be packed into the parameters of a language model. Retrieval-Augmented Language Generation with Continuous Information Gathering is a paper by Adam Roberts, Colin Raffel, and Noam Shazeer. The paper explores the use of retrieval-augmented language models, which are unsupervised multitask learners that use transfer learning with a unified text-to-text transformer. The authors also discuss the limits of these models and narrowing the compositionality gap in language models. They also explore measuring the quality of abstractive summarization and training language models to follow instructions. Other papers mentioned include Language Models as Knowledge Bases?, Retrieval-Augmented Generation for Question Answering, and You Only Need One Model for Open-Domain Question Answering. The article discusses the retrieval-augmented language generation approach for solving complex tasks in natural language processing (NLP). It explores the challenges of open-domain multi-document summarization and retrieval for open-domain question answering. The authors propose a method of composing retrieval and language models within a single transformer. They also discuss the use of dense passage retrieval and demonstrate-search-predict techniques in NLP. Additionally, the article highlights the importance of memorization and nearest neighbor language models in learning representations. The authors also introduce a supervised challenge dataset for reading comprehension and discuss the hurdles to progress in long-form question answering. Finally, the article mentions various researchers and their contributions in the field of NLP. This document is a collection of various papers related to natural language processing and machine learning. The papers cover topics such as retrieval-augmented language models, question answering, few-shot learning, summarization, and language modeling. Some of the papers highlight new datasets or benchmarks for evaluating models, while others propose new methods or techniques for improving model performance. Overall, the papers showcase the current state of the art in natural language processing research. This is a list of authors and references related to the paper "Retrieval-Augmented Language Generation with Continuous Information Gathering". The paper discusses the use of retrieval-based models for generating text, specifically in the context of answering open-domain questions. The authors experiment with a technique called FLARE, which involves continuous retrieval and integration of information during generation. They note that this approach can be computationally expensive and may not always improve performance. The paper also mentions several datasets used for evaluating language models, including ELI5 and Wizard of Wikipedia. The paper proposes an active retrieval augmentation framework for long-form language generation that retrieves relevant information during generation. The framework uses a combination of implicit and explicit query formulation methods and iteratively retrieves information based on the upcoming sentence. The effectiveness of the framework is demonstrated through experimental results on four tasks/datasets. The paper also investigates the importance of active retrieval and compares the performance of different query formulation methods. The results show that high-confidence sentences may not always need retrieval and that the use of low-confidence tokens can impede generation. The paper concludes by acknowledging the support of funding agencies and individuals who contributed to the experiments. The paper presents FLARE, a retrieval-augmented language generation method that uses forward-looking retrieval to improve question answering performance. FLARE outperforms baselines on StrategyQA, ASQA, ASQA-hint, WikiAsp, and 2WikiMultihopQA datasets. The paper also conducts an ablation study to evaluate the efficacy of various design choices, including the importance of forward-looking retrieval, the use of retrieval instructions and exemplars, and the effectiveness of question decomposition. The study shows that forward-looking retrieval is more powerful than past-context-based retrieval, and that retrieval instructions and exemplars are challenging to generate manually. The study also shows that the question decomposition approach achieves the best performance among all baselines. The article discusses the FLARE (Forward-Looking Active Retrieval augmented generation) method for retrieval-augmented language generation with continuous information gathering. The method involves using previous sentences to retrieve related entities and relations for use in generating subsequent sentences. FLARE outperforms other baseline methods in multihop QA tasks, and is effective at identifying and generating on-topic output. The article provides thorough comparisons with baselines, experimental results, and evaluations using various metrics such as ROUGE and UniEval. The FLARE method can retrieve information from open-web documents, and is effective for generating aspect-based summaries about entities from 20 domains in Wikipedia. This text excerpt discusses various natural language processing tasks and their evaluation metrics. The first task is open-domain summarization, which aims to generate a comprehensive summary about a specific topic by gathering information from the open web. Evaluation metrics for this task include ROUGE and a combination of a QA model and Disambig-F1.

The second task is answering ambiguous questions, where systems must provide answers that cover all possible interpretations of the question. The testbed for this task is a collection of crowdsourced questions with multiple interpretations, and outputs are evaluated using exact match and F1 scores.

The third task is generating comprehensive answers to questions seeking complex information. The testbed for this task is a dataset of long-form QA, and evaluation metrics include exact match and BM25.

The final task is commonsense reasoning, which requires systems to utilize both world and commonsense knowledge to generate an answer. Evaluation metrics for this task include EM and token-level F1, precision, and recall. The paper discusses retrieval-augmented language generation for multihop question answering. The authors use a max-input length limit of text-davinci-003 to increase performance and retrieve several documents using the format in Prompt 3.3. They use exemplars to retrieve documents and incorporate retrieval results for exemplars. The authors use Wikipedia articles as the retrieval corpus, BM25 as the retriever, and state-of-the-art CoT methods. The goal of multihop QA is to answer complex questions through a process of information retrieval and reasoning, and the authors evaluate the effectiveness of FLARE on 4 different datasets. Question decomposition approaches require task-specific prompt engineering, which restricts their generalizability in new tasks. The paper discusses different approaches to generating follow-up questions and concludes that dynamic retrieval is effective for generating follow-up questions. The article presents a framework for retrieval-augmented language generation. The approach involves multi-time retrieval, which can be triggered either by every token or every few tokens. The retrieved information is used to guide the language model in generating relevant outputs. The article compares different retrieval strategies and proposes three baseline categories based on when information is retrieved. The document corpus is obtained either from open web search engines or from curated datasets. The user input is formatted to include relevant documents and is ranked according to relevance. This document outlines a method called Retrieval-Augmented Language Generation with Continuous Information Gathering (FLARE) that generates natural language text by iteratively retrieving and incorporating information from external sources. The process begins with an initial query, which can be generated using a language model (LM) or provided by the user. Then, FLARE generates sentences by iteratively retrieving and incorporating information from documents, using a ranking list to aid future generations. To generate questions that can be answered by the retrieved information, FLARE uses two methods: explicit query generation, which masks low-confidence tokens in the sentence and prompts the LM to generate an explicit question, and implicit query generation, which prompts the LM to generate a question based on extracted spans with probabilities below a threshold. FLARE aims to improve retrieval accuracy and remove potential distractions from the generated text. The paper discusses retrieval-augmented language generation using continuous information gathering. The approach involves using a query to retrieve relevant information from a large dataset and generate long-form content. The authors propose confidence-based active retrieval and direct FLARE methods to address issues with generating search queries and disrupting answer generation. The approach involves iteratively generating search queries and retrieving relevant documents to regenerate the next sentence. The authors found that LMs can effectively combine two skills and generate meaningful search queries to aid future generations. The paper proposes a retrieval-augmented language generation framework called FLARE. Two methods are proposed to instruct the LM to generate answers using retrieval-encouraging instructions and retrieval queries when necessary while generating. The framework is inspired by Toolformer and implements active retrieval augmented generation methods to reflect the intents of future generations. The structure of the prompt is shown in Prompt 3.1, which generates search queries while performing the task. LMs should only retrieve relevant documents, which are prepended before the generation, and use the query terms to retrieve knowledge to avoid unnecessary or inappropriate queries. At each step, the user input is used to aid future generation until the next search query is generated or reaches the end. The input of the test case consists of several task-related exemplars and several search-related exemplars. The paper offers API access to elicit such behavior by few-shot prompting. A straightforward way of expressing information needs for retrieval is to generate ”[Search(query)]”. The FLARE with Retrieval Instructions is used as the initial query, and the user input is used as a condition for the next generation. The paper proposes a framework for active retrieval augmented generation, in which the model decides when and what to retrieve throughout the generation process. The goal is to aid long-form generation with retrieval. The most common approach is to use the input as the query for retrieval and generate the answer using the retrieved documents. The paper examines various methods of determining when and what to retrieve, and proposes Forward-Looking Active Retrieval augmented generation (FLARE), which iteratively generates a temporary next sentence, uses it as a query to retrieve relevant documents, and then regenerates the next sentence conditioning on the retrieved documents. FLARE is evaluated on 4 diverse tasks/datasets involving generating long outputs, including multihop QA, open-domain summarization, commonsense reasoning, and long-form QA. FLARE achieves superior or competitive performance compared to single-time and multi-time retrieval baselines, demonstrating its effectiveness and generalizability. The paper argues that it is important to consider what LMs intend to generate in the future when deciding what to retrieve, and adopts an active retrieval strategy that only retrieves when LMs generate low-probability tokens. The paper proposes a new paradigm for long-form generation tasks called active retrieval-augmented generation, where a language model (LM) actively decides when and what to retrieve throughout the generation process. The goal is to avoid unnecessary or inappropriate retrieval and ensure that LMs retrieve information only when they lack the required knowledge to generate content. The paper discusses several approaches to active retrieval, including forward-looking active retrieval augmented generation (FLARE), which iteratively generates a temporary next sentence based on user input and initial retrieval results. The paper also highlights the importance of retrieving extra information as needed during the generation process to ensure that the LM has enough knowledge to generate complex content. Finally, the paper presents several examples of long-form generation tasks, including open-domain summarization and multihop question answering, where active retrieval-augmented generation can be applied. Generative language models (LMs) have become a foundational component in many natural language processing (NLP) systems. Retrieval-augmented LMs commonly use retrieval components that can look up relevant information from external knowledge resources such as document corpora, augmenting parametric LMs with non-parametric retrieval to generate language. In recent years, increasingly powerful large LMs have demonstrated abilities in more complex tasks solely based on the input. To address the issue of hallucination, one promising direction is to augment generation with retrieval, which involves retrieving relevant knowledge once formation needs are clear in the user's input and then generating a complete answer conditioning on the retrieved documents based on the user's input.

Retrieval-augmented generation (FLARE) is a generic retrieval-augmented generation method that proposes Forward-Looking Active Retrieval methods that actively decide when and what to retrieve across the course of the generation. FLARE iteratively uses a prediction of the upcoming sentence to anticipate future content, which is then utilized as a query to retrieve relevant documents. We test FLARE along with baselines comprehensively over 4 long-form knowledge-intensive generation tasks and achieve superior or competitive performance on all tasks, demonstrating the effectiveness of our method. In this work, we provide a generalized view of active retrieval augmented generation, methods that involve generating long-form output, such as long-form QA, open-domain summarization, and (chain-of-thought; CoT) reasoning. Retrieval-augmented language models (LMs) retrieve information from external knowledge resources to generate text. This approach addresses the issue of LMs generating factually inaccurate output due to relying solely on input. Active retrieval augmented generation is a promising solution to augment LMs by continuously gathering information during text generation. Large language models have remarkable abilities, but they tend to hallucinate and generate inaccurate output. The study was conducted by researchers from Carnegie Mellon University's Language Technologies Institute and Sea AI Lab.

Raw indexed text (91,004 chars / 14,392 words / 1,855 lines)

Active Retrieval Augmented Generation

Zhengbao Jiang 1∗ Frank F. Xu 1∗ Luyu Gao 1∗ Zhiqing Sun 1∗ Qian Liu 2

Jane Dwivedi-Yu 3 Yiming Yang 1 Jamie Callan 1 Graham Neubig 1

Language Technologies Institute, Carnegie Mellon University

Sea AI Lab 3 Meta AI Research

{zhengbaj,fangzhex,luyug,zhiqings,gneubig}@cs.cmu.edu

Abstract

Despite the remarkable ability of large lan-

guage models (LMs) to comprehend and gen-

erate language, they have a tendency to hal-

lucinate and create factually inaccurate out-

put. Augmenting LMs by retrieving infor-

mation from external knowledge resources

is one promising solution. Most existing

retrieval-augmented LMs employ a retrieve-

and-generate setup that only retrieves informa-

tion once based on the input. This is lim-

iting, however, in more general scenarios in-

volving generation of long texts, where con-

tinually gathering information throughout the

generation process is essential. There have

been some past efforts to retrieve informa-

tion multiple times while generating outputs,

which mostly retrieve documents at fixed inter-

vals using the previous context as queries. In

this work, we provide a generalized view of

active retrieval augmented generation, meth-

ods that actively decide when and what to re-

trieve across the course of the generation. We

propose Forward-Looking Active REtrieval

augmented generation (FLARE), a generic

retrieval-augmented generation method which

iteratively uses a prediction of the upcoming

sentence to anticipate future content, which is

then utilized as a query to retrieve relevant doc-

uments to regenerate the sentence if it contains

low-confidence tokens. We test FLARE along

with baselines comprehensively over 4 long-

form knowledge-intensive generation tasks/-

datasets. FLARE achieves superior or compet-

itive performance on all tasks, demonstrating

the effectiveness of our method. 1

Introduction

Generative language models (LMs) (Brown et al.,

2020; Ouyang et al., 2022; OpenAI, 2023; Chowd-

hery et al., 2022; Zhang et al., 2022; Touvron et al.,

2023) have become a foundational component in

∗

Lead contributors.

Code and datasets are available at https://github.com/

jzbjyb/FLARE.

many natural language processing (NLP) systems

with their remarkable ability to comprehend and

generate language. Although LMs have memorized

some amount of world knowledge observed during

training (Petroni et al., 2019; Roberts et al., 2020;

Jiang et al., 2020), they still tend to hallucinate

and create imaginary content (Maynez et al., 2020;

Zhou et al., 2021; OpenAI, 2023). To address the

issue of hallucination, one promising direction is to

augment generation with retrieval, which involves

augmenting parametric LMs with non-parametric

retrieval components that can look up relevant in-

formation from external knowledge resources such

as document corpora (Lewis et al., 2020; Izacard

and Grave, 2021; Khandelwal et al., 2020; Izacard

et al., 2022; Jiang et al., 2022; Shi et al., 2023).

Retrieval-augmented LMs commonly use a

retrieve-and-generate setup where they retrieve doc-

uments based on the user’s input (e.g. questions

in question answering), and then generate a com-

plete answer conditioning on the retrieved docu-

ments (Lewis et al., 2020; Izacard and Grave, 2021;

Izacard et al., 2022; Jiang et al., 2022; Shi et al.,

2023). These single-time retrieval-augmented LMs

have been found to outperform purely paramet-

ric LMs, particularly for short-form knowledge-

intensive generation tasks such as factoid QA

(Kwiatkowski et al., 2019; Joshi et al., 2017) and

fact checking (Thorne et al., 2018), where the in-

formation needs are clear in the user’s input, and

it is sufficient to retrieve relevant knowledge once

solely based on the input.

In recent years, increasingly powerful large LMs

have demonstrated abilities in more complex tasks

that involve generating long-form output, such as

long-form QA (Fan et al., 2019; Stelmakh et al.,

2022), open-domain summarization (Cohen et al.,

2021; Hayashi et al., 2021; Giorgi et al., 2022),

and (chain-of-thought; CoT) reasoning (Wei et al.,

2022; Ho et al., 2020; Geva et al., 2021; Hendrycks

et al., 2020). In contrast to short-form generation,Search results: ! !

[1]: …

Search results: ! " !

[2]: …

[1]: … Search results: !

" "

[2]: …

[1]: …

[2]: …

Retrieved

documents

Input $

Step 1 # " # Joe Biden (born November 20, 1942) is the 46th president of the United States.

Generate a summary about Joe Biden.

Retriever

# # Joe Biden (born November 20, 1942) is the 46th president of the United States.

% $

Step 2

Step 3

# " $ Joe Biden attended the University of Pennsylvania, where he earned a law

degree.

# $ He graduated from the University of Delaware in 1965 with a Bachelor of Arts

in history and political science

% %

# " % Joe Biden announced his candidacy for the 2020 presidential election on August

18, 2019.

# % Joe Biden announced his candidacy for the 2020 presidential election on April

25, 2019.

Figure 1: An illustration of forward-looking active retrieval augmented generation (FLARE). Starting with the

user input x and initial retrieval results D x , FLARE iteratively generates a temporary next sentence (shown in

gray italic) and check whether it contains low-probability tokens (indicated with underline). If so (step 2 and 3),

the system retrieves relevant documents and regenerates the sentence.

long-form generation presents complex informa-

tion needs that are not always evident from the in-

put alone. Similar to how humans gradually gather

information as we create content such as papers,

essays, or books, long-form generation with LMs

would require gathering multiple pieces of knowl-

edge throughout the generation process. For exam-

ple in open-domain summarization (Giorgi et al.,

2022), the goal is to generate a summary about a

particular topic by retrieving references from the

open web. The initial retrieval based on the topic

name (e.g., Joe Biden) may not cover all aspects

and details. Therefore, it is crucial to retrieve ex-

tra information as needed during the generation

process, such as when generating a certain aspect

(e.g., the education history of Joe Biden) or a spe-

cific detail (e.g., when did Joe Biden announce his

candidacy for the 2020 presidential campaign).

Several attempts have been made to build sys-

tems that retrieve multiple times throughout genera-

tion. These attempts include methods that passively

utilize the past context (e.g., previous sentences or

tokens) to retrieve additional information at a fixed

interval (e.g., every sentence or every few tokens)

(Khandelwal et al., 2020; Borgeaud et al., 2022;

Ram et al., 2023; Trivedi et al., 2022) which might

not accurately reflect what LMs intend to gener-

ate in the future or retrieve at inappropriate points.

Some works in multihop QA address multiple infor-

mation needs by decomposing the full question into

sub-questions, each of which is used to retrieve ex-

tra information (Press et al., 2022; Yao et al., 2022;

Khot et al., 2022; Khattab et al., 2022).

We ask the following question in this paper: can

we create a simple and generic retrieval-augmented

LM that actively decides when and what to retrieve

throughout the generation process, and are appli-

cable to a variety of long-form generation tasks?

We consider a new paradigm, active retrieval aug-

mented generation. Our hypothesis regarding when

to retrieve is that LMs should retrieve information

only when they lack the required knowledge to

avoid unnecessary or inappropriate retrieval that

occurs in passive retrieval-augmented LMs (Khan-

delwal et al., 2020; Borgeaud et al., 2022; Ram

et al., 2023; Trivedi et al., 2022). Given the obser-

vation that large LMs tend to be well-calibrated and

low probability/confidence often indicates a lack

of knowledge (Jiang et al., 2021; Kadavath et al.,

2022), we adopt an active retrieval strategy that

only retrieves when LMs generate low-probability

tokens. When deciding what to retrieve, we arguethat it is important to consider what LMs intend to

generate in the future, as the goal of active retrieval

is to benefit future generations. Therefore, we pro-

pose anticipating the future by generating a tempo-

rary next sentence, using it as a query to retrieve

relevant documents, and then regenerating the next

sentence conditioning on the retrieved documents.

Combining the two aspects, we propose Forward-

Looking Active REtrieval augmented generation

(FLARE), as illustrated in Figure 1. FLARE iter-

atively generates a temporary next sentence, use

it as the query to retrieve relevant documents if it

contains low-probability tokens and regenerate the

next sentence until reaches the end.

FLARE is applicable to any existing LMs at

inference time without additional training. Con-

sidering the impressive performance achieved by

GPT-3.5 (Ouyang et al., 2022) on a variety of

tasks, we examine the effectiveness of our meth-

ods on text-davinci-003. We evaluate FLARE

on 4 diverse tasks/datasets involving generating

long outputs, including multihop QA (2WikiMul-

tihopQA), commonsense reasoning (StrategyQA),

long-form QA (ASQA), and open-domain summa-

rization (WikiAsp) (Ho et al., 2020; Geva et al.,

2021; Stelmakh et al., 2022; Hayashi et al., 2021).

Over all tasks, FLARE achieves superior or com-

petitive performance compared to single-time and

multi-time retrieval baselines, demonstrating the

effectiveness and generalizability of our method.

Retrieval-Augmented Generation

In this section, we formally define single-time

retrieval-augmented generation and propose the

framework of active retrieval augmented generation

that decides when and what to retrieve throughout

the generation.

2.1

Notations and Definitions

Given a user input x and a document corpus D =

|D|

{d i } i=1 (such as all Wikipedia articles), the goal of

retrieval-augmented LMs is to generate the answer

y = [s 1 , s 2 , ..., s m ] = [w 1 , w 2 , ..., w n ] containing

m sentences or n tokens leveraging information

retrieved from the corpus.

In retrieval-augmented LM, the LM typically

pairs with a retriever that can retrieve a list of

documents D q = ret(q) for a query q; the LM

conditions on both the user input x and retrieved

documents D q to generate the answer. Since we

focus on examining various methods of determin-

ing when and what to retrieve, we follow exist-

ing methods (Ram et al., 2023; Trivedi et al.,

2022) to prepend the retrieved documents before

the user input to aid future generation for both

baselines and our method for fair comparisons:

y = LM([D q , x]), where [·, ·] is concatenation fol-

lowing the specified order.

2.2

Single-time Retrieval-Augmented

Generation

The most common choice is to directly use the user

input as the query for retrieval and generate the

complete answer at once y = LM([D x , x]) (Chen

et al., 2017; Guu et al., 2020; Lewis et al., 2020;

Izacard and Grave, 2021; Sachan et al., 2021; Lee

et al., 2021; Jiang et al., 2022; Izacard et al., 2022;

Shi et al., 2023).

2.3

Active Retrieval Augmented Generation

To aid long-form generation with retrieval, we pro-

pose active retrieval augmented generation. It is a

generic framework that actively decides when and

what to retrieve through the generation process,

resulting in the interleaving of retrieval and genera-

tion. Formally, at step t(t ≥ 1), the retrieval query

q t is formulated based on both the user input x and

previously generated output y

q t = qry(x, y

where qry(·) is the query formulation function. At

the start of the generation (t = 1), the previous

generation is empty (y <1 = ∅), and the user input

is used as the initial query (q 1 = x). Given the re-

trieved documents D q t , LMs continually generate

the answer until the next retrieval is triggered or

reaches the end:

y t = LM([D q t , x, y

where y t represents the generated tokens at the

current step t, and the input to LMs is the concate-

nation of the retrieved documents D q t , the user

input x, and the previous generation y

step, we discard previously retrieved documents

∪ t 0

from the current step to condition the next gener-

ation to prevent reaching the input length limit of

LMs.

FLARE: Forward-Looking Active

REtrieval Augmented Generation

Our intuition is that (1) LMs should only retrieve

information when they do not have the necessary3.1

FLARE with Retrieval Instructions

A straightforward way of expressing information

needs for retrieval is to generate “[Search(query)]”

when additional information is needed (Schick

et al., 2023), e.g., “The colors on the flag of

Ghana have the following meanings. Red is for

[Search(Ghana flag red meaning)] the blood of mar-

tyrs, ...” When working with GPT-3.5 models that

offer only API access, we elicit such behavior by

few-shot prompting (Brown et al., 2020).

Specifically, for a downstream task, we place

the search-related instruction and exemplars at the

beginning as skill 1, followed by the instruction and

exemplars of the downstream task as skill 2. Given

a test case, we ask LMs to combine skills 1 and 2 to

generate search queries while performing the task.

The structure of the prompt is shown in Prompt 3.1,

and further details can be found in Prompt C.1.

Prompt 3.1: retrieval instructions

Skill 1. An instruction to guide LMs to generate search

queries.

Several search-related exemplars.

Skill 2. An instruction to guide LMs to perform a

specific downstream task (e.g., multihop QA).

Several task-related exemplars.

An instruction to guide LMs to combine skills 1

and 2 for the test case.

The input of the test case.

As shown in Figure 2, when the LM generates

“[Search(query)]” (shown in gray italic), we stop

the generation and use the query terms to retrieve

relevant documents, which are prepended before

the user input to aid future generation until the next

search query is generated or reaches the end.

Search results: ! !

[1]: … Search results: !

" !

[2]: … [1]: …

Search results: ! " "

[2]: …

[1]: …

[2]: …

Retriever

$ Generate a summary about Joe Biden.

& # Joe Biden attended

knowledge to avoid unnecessary or inappropriate

retrieval, and (2) the retrieval queries should re-

flect the intents of future generations. Therefore,

We propose two forward-looking active retrieval

augmented generation (FLARE) methods to im-

plement the active retrieval augmented generation

framework. Inspired by Toolformer (Schick et al.,

2023), the first method prompts the LM to generate

retrieval queries when necessary while generating

the answer using retrieval-encouraging instructions,

denoted as FLARE instruct . The second method di-

rectly uses the LM’s generation as search queries,

denoted as FLARE direct , which iteratively gener-

ates the next sentence to gain insight into the future

topic, and if uncertain tokens are present, retrieves

relevant documents to regenerate the next sentence.

% $ [Search(Joe Biden University)]

% $

& $ the University of Pennsylvania, where he earned

% % [Search(Joe Biden degree)]

% %

& % a law degree.

Figure 2: An illustration of forward-looking active re-

trieval augmented generation with retrieval instructions

(FLARE instruct ). It iteratively generates search queries

(shown in gray italic) to retrieve relevant information

to aid future generations.

We found that LMs can effectively combine the

two skills and generate meaningful search queries

while performing the task. However, there are

two issues: (1) LMs tend to generate fewer search

queries than necessary. (2) Generating excessive

search queries can disrupt answer generation and

adversely affect performance. We address these

issues using two methods respectively. First, we

increase the logit of the token “[” by 2.0 to improve

the chances of LMs generating “[Search(query)]”.

Second, whenever LMs generate a search query in

Figure 2), we use it to retrieve relevant information,

promptly remove it from the generation, and gen-

erate the next few tokens while forbidding “[” by

adding a large negative value to the logit of “[”.

3.2

Direct FLARE

Since we cannot fine-tune black-box LMs, we

found queries generated by FLARE instruct through

retrieval instructions might not be reliable. There-

fore, we propose a more direct way of forward-

looking active retrieval that uses the next sentence

to decide when and what to retrieve.

3.2.1 Confidence-based Active Retrieval

As shown in Figure 1, at step t, we first generate a

temporary next sentence ŝ t = LM([x, y

out conditioning on retrieved documents. Then we

decide whether to trigger retrieval and formulate

queries based on ŝ t . If the LM is confident about ŝ t ,

we accept it without retrieving additional informa-

tion; if not, we use ŝ t to formulate search queries

q t to retrieve relevant documents, and then regen-erate the next sentence s t . The reason we utilize

sentences as the basis of our iteration is due to their

significance as semantic units that are neither too

short nor too lengthy like phrases and paragraphs.

However, it is worth noting that our approach can

also be employed using phrases, paragraphs, or

fixed-size windows as the basis.

Since LMs tend to be well-calibrated that low

probability/confidence often indicates a lack of

knowledge (Kadavath et al., 2022; Jiang et al.,

2021), we actively trigger retrieval if any token

of ŝ t has a probability lower than a threshold

θ ∈ [0, 1]. θ = 0 means that retrieval is never

triggered, while θ = 1 triggers retrieval for every

sentence.

(

ŝ t

if all tokens of ŝ t have probs ≥ θ

y t =

s t = LM([D q t , x, y

otherwise

where the query q t is formulated based on ŝ t .

3.2.2

Confidence-based Query Formulation

One way to perform retrieval is to directly use the

next sentence ŝ t as the query q t . This shares a sim-

ilar spirit with existing methods that use generated

hypothetical titles or paragraphs (Gao et al., 2022;

Sun et al., 2022) from LMs instead of the origi-

nal input question as the retrieval query (Gao et al.,

2022; Mao et al., 2021). We transfer and generalize

such techniques to long-form generation scenarios

where active information access is essential.

Empirically, we found retrieving with the next

sentence achieves significantly better results than

with the previous context, as to be shown later in

subsection 6.2. However, it has a risk of perpetuat-

ing errors contained in it. For example, if the LM

produces the sentence “Joe Biden attended the Uni-

versity of Pennsylvania” instead of the correct fact

that he attended the University of Delaware, using

this erroneous sentence as a query could prompt the

retriever to retrieve irrelevant information, which

could potentially mislead future generations. We

propose two simple methods to overcome this issue

as illustrated in Figure 3.

Masked sentences as implicit queries. The first

method masks out low-confidence tokens in ŝ t with

probabilities below a threshold β ∈ [0, 1], where a

higher β results in more aggressive masking. This

removes potential distractions from the sentence to

improve retrieval accuracy.

Joe Biden attended the University of Pennsylvania,

where he earned a law degree.

implicit query

by masking

explicit query by

question generation

Joe Biden attended , where he earned .

Ask a question to which the answer is “the University of Pennsylvania”

Ask a question to which the answer is “a law degree”

LM such as ChatGPT

What university did Joe Biden attend?

What degree did Joe Biden earn?

Figure 3: Implicit the explicit query formulation. To-

kens with low probabilities are marked with underlines.

Generated questions as explicit queries. An-

other method is to generate explicit questions that

target the low-confident span in ŝ t . For example, if

the LM is uncertain about “the University of Penn-

sylvania”, a question like “Which university did Joe

Biden attend?” can help retrieve relevant informa-

tion. Self-ask (Press et al., 2022) achieved this by

manually inserting follow-up questions into down-

stream task exemplars as shown later in Prompt 4.1,

which requires task-specific annotation efforts. In-

stead, we developed a universal approach that gen-

erates questions for low-confidence spans without

additional annotation. Specifically, We first extract

all spans from ŝ t with probabilities below β. For

each extracted span z, we prompt gpt-3.5-turbo

to generate a question q t,z that can be answered

with the span, using the following prompt:

Prompt 3.2: zero-shot question generation

User input x.

Generated output so far y ≤t .

Given the above passage, ask a question to which

the answer is the term/entity/phrase “z”.

We retrieve using each generated question and

interleave the returned documents into a single

ranking list to aid future generations. In summary,

queries q t are formulated based on ŝ t as follows:

(

∅

if all tokens of ŝ t have probs ≥ θ

q t =

mask(ŝ t ) or qgen(ŝ t )

otherwise

3.3

Implementation Details

We validate our method using one of the most ad-

vanced GPT-3.5 LMs text-davinci-003 by itera-

tively querying their API. 2

https://api.openai.com/v1/completions in April

2023.The initial query. FLARE starts with the user

input x as the initial query to retrieve documents

to generate the first sentence ŝ 1 = LM([D x , x])

to bootstrap the iterative generation process. For

the following steps, the temporary forward-looking

sentence is generated without retrieved documents.

Sentence tokenization. For each step t, we gen-

erate 64 tokens which are longer than most sen-

tences, and use NLTK sentence tokenizer 3 to ex-

tract the first sentence and discard the rest.

Document corpus and retrievers. Since we fo-

cus on the integration of retrieval and generation in

this paper, we use off-the-shelf retrievers that take

queries as inputs and return a list of relevant docu-

ments. For datasets that mainly rely on knowledge

from Wikipedia, we use the Wikipedia dump from

Karpukhin et al. (2020) where articles are divided

into 100-token passages as the document corpus

and employ BM25 (Robertson and Zaragoza, 2009)

as the retriever. For datasets that rely on knowledge

from the open web, we use the Bing search engine

as our retriever. 4

Retrieved document formatting. Multiple re-

trieved documents are linearized according to their

ranking and then added to the beginning of the user

input using the following format:

Prompt 3.3: document formatting

Search results:

[1] Document 1

[2] Document 2

...

The user input x

Multi-time Retrieval Baselines

Existing passive multi-time retrieval-augmented

LMs (Khandelwal et al., 2020; Ram et al., 2023;

Trivedi et al., 2022; Press et al., 2022; Yao et al.,

2022) can also be formulated using our framework

(subsection 2.3). In this section, we formally in-

troduce three baseline categories based on when

and what to retrieve. These baselines are not exact

reproductions of the corresponding paper because

many design choices differ among previous works

which makes direct comparisons impossible. We

excluded irrelevant designs and ensured that we

implemented them using the same settings, with

the only variation being when and what to retrieve.

Previous-window approaches trigger retrieval

every l tokens, where l represents the window size.

Generated tokens from the previous window are

used as the query:

q t = y t−1

y t = [w (t−1)l+1 , ..., w tl ].

Some existing methods in this category are RETRO

(Borgeaud et al., 2022), IC-RALM (Ram et al.,

2023), which retrieve every few tokens, and KNN-

LM (Khandelwal et al., 2020), which retrieves ev-

ery token. 5

Previous-sentence approaches trigger retrieval

every sentence and use the previous sentence as the

query:

q t = y t−1

Efficiency As shown later in subsection 6.2, on

average retrieval is triggered for 30% ∼ 60% of

sentences depending on downstream tasks. In

comparision, KNN-LM (Khandelwal et al., 2020)

retrieves for every token, RETRO or IC-RALM

(Borgeaud et al., 2022; Ram et al., 2023) retriev-

ers every 4∼32 tokens, and IRCoT (Trivedi et al.,

2022) retrieves for every sentence. Compared

to single-time retrieval, however, interleaving re-

trieval and generation with a naive implementation

indeed increases overheads, which we will discuss

in the limitation section (section 8).

https://www.nltk.org/api/nltk.tokenize.

PunktSentenceTokenizer.html

https://www.microsoft.com/en-us/bing/apis/

bing-web-search-api

(t ≥ 2),

y t = s t .

IRCoT (Trivedi et al., 2022) belongs to this cate-

gory.

Question decomposition approaches manually

annotated task-specific exemplars to guide LMs

to generate decomposed sub-questions while pro-

ducing outputs. For example, self-ask (Press et al.,

2022), a method in this category, manually inserts

follow-up questions in exemplars:

Since KNN-LM uses the contextualized representation

corresponding to the current decoding position to retrieve rel-

evant information which encodes all previous tokens. Strictly

speaking, q t should be y

Question: Who lived longer, Theodor Haecker or Harry

Vaughan Watkins?

Are follow up questions needed here: Yes.

Follow up: How old was Theodor Haecker when he died?

Intermediate answer: Theodor Haecker was 65 years old

when he died.

Follow up: How old was Harry Vaughan Watkins when he

died?

Intermediate answer: Harry Vaughan Watkins was 69 years

old when he died.

So the final answer is: Harry Vaughan Watkins.

For the test case, retrieval is triggered dynami-

cally whenever the model generates a sub-question

(e.g., “follow up” in self-ask).

The aforementioned three approaches are capa-

ble of retrieving additional information while gen-

erating. However, they have notable drawbacks:

(1) fixed-interval approaches use previously gener-

ated tokens as queries which might not reflect what

LMs intend to generate in the future. (2) Retriev-

ing information at a fixed interval can be inefficient

because it might occur at inappropriate points. (3)

Question decomposition approaches require task-

specific prompt engineering, which restricts their

generalizability in new tasks.

Experimental Setup

We evaluate the effectiveness of FLARE on 4 di-

verse knowledge-intensive tasks using few-shot in-

context learning (Radford et al., 2019; Brown et al.,

2020), as summarized in Table 6 of Appendix A.

To ensure fair head-to-head comparisons, we com-

pare the results of FLARE with baselines using

the same setting, namely, the same in-context ex-

emplars, prompt format, retriever, and document

corpus. We follow previous works (Trivedi et al.,

2022) to sub-sample at most 500 examples from

each dataset due to the cost of running experi-

ments. The hyperparameters of FLARE are se-

lected based on the development set and listed in

Table 8. FLARE refers to FLARE direct if not specif-

ically stated. For previous-window approaches,

we follow Ram et al. (2023) to use a window size

l = 16 in our experiments.

5.1

Multihop QA

Dataset The goal of multihop QA is to answer

complex questions through a process of informa-

tion retrieval and reasoning (Yang et al., 2018; Ho

et al., 2020). For instance, to answer “Why did

the founder of Versus die?”, we must first identify

who founded Versus and subsequently determine

the cause of their death. Multihop QA also unifies

into long-form generation when solved with the

state-of-the-art CoT methods (Wei et al., 2022).

We use 2WikiMultihopQA (Ho et al., 2020)

which contains 2-hop complex questions sourced

from Wikipedia articles that require composition,

comparison, or inference.

Settings We follow Wang et al. (2022) to gen-

erate both the chain-of-thought reasoning process

and the final answer. For the above case, the output

we aim to generate is “The founder of Versus was

Gianni Versace. Gianni Versace was shot and killed

on the steps of his Miami Beach mansion on July

15, 1997. So the answer is shot.” We use 8 exem-

plars from Trivedi et al. (2022) listed in Prompt C.2

for in-context learning, BM25 as the retriever, and

Wikipedia articles as the retrieval corpus. Similar to

the observation in Trivedi et al. (2022), we found

incorporating retrieval results for exemplars im-

proves the performance, we use the input x of each

exemplar to retrieve several documents and then

add them using the format in Prompt 3.3. We found

increasing the number of retrieval documents often

increases performance. Therefore, we use the max-

imum number of documents that can fit within the

input length limit of text-davinci-003, which is

2 for 2WikiMultihopQA.

Evaluation We use regular expressions to extract

the final answer (i.e., “shot” in the above example)

from the generated output and compare it with the

reference answer using answer-level exact match

(EM), and token-level F 1 , precision, and recall.

5.2

Commonsense Reasoning

Dataset Commonsense reasoning requires sys-

tems to utilize both world and commonsense knowl-

edge to generate an answer (Talmor et al., 2019;

Geva et al., 2021). For example, to answer “Would

a pear sink in water?”, we must have the common-

sense understanding that we need to consider their

density. We use StrategyQA (Geva et al., 2021) as

the testbed which is a collection of crowdsourced

yes/no questions that require multi-step reasoning.

Settings We follow Wei et al. (2022) to generate

both the chain-of-thought reasoning process and

the final yes/no answer, which for the above case is

“The density of a pear is about 0.6g/cm 3 , which is

less than water. Objects less dense than water float.

Thus, a pear would float. So the final answer is no.”

We use 6 exemplars from Wei et al. (2022) listedin Prompt C.3, BM25 over the Wikipedia corpus,

and 3 retrieved documents to run experiments.

Evaluation We extract the yes/no answer and

match it against the gold answer using exact match.

5.3

Long-form QA

Dataset Long-form QA aims to generate compre-

hensive answers to questions seeking complex in-

formation (Fan et al., 2019; Stelmakh et al., 2022).

The following question “Where do the Philadelphia

Eagles play their home games?” could be asking

about the city, sports complex, or stadium of their

home games. We use ASQA (Stelmakh et al., 2022)

as our testbed where inputs are ambiguous ques-

tions with multiple interpretations, and outputs are

comprehensive answers covering all.

Setting To answer ambiguous questions, systems

must first identify possible interpretations and then

provide answers for each of them, which for the

above case is “We need to consider the different

possible locations or venues that could be con-

sidered the home field of the Philadelphia Eagles.

These include the city, the sports complex, or the

stadium. Therefore, this question has 3 interpreta-

tions and the answers are: (1) The city is Philadel-

phia. (2) The sports complex is the South Philadel-

phia Sports Complex. (3) The stadium is the Lin-

coln Financial Field stadium.” We found that in

many cases, it is challenging even for humans to

identify which aspect of the original question is

ambiguous. Therefore, we created another setting

where we provide a brief and generic hint to guide

LMs to stay on track when generating interpreta-

tions and corresponding answers. The hint for the

above case is “This question is ambiguous in terms

of which specific location or venue is being referred

to.” For both the original setting (ASQA) and the

setting with hints (ASQA-hint), we manually anno-

tate 8 exemplars (Prompt C.4 and C.6), use BM25

over the Wikipedia corpus, and 3 retrieved docu-

ments to run experiments.

Evaluation We use all metrics from Stelmakh

et al. (2022), including EM, soft match us-

ing a RoBERTa-based QA model (Disambig-F 1 ),

ROUGE (Lin, 2004), and an overall score combin-

ing Disambig-F 1 and ROUGE (DR).

5.4

Open-domain Summarization

Dataset The goal of open-domain summarization

is to generate a comprehensive summary about a

specific topic by gathering information from the

open web (Giorgi et al., 2022), e.g., “Generate a

summary about Echo School (Oregon) including

the following aspects: academics, history”. We

use WikiAsp (Hayashi et al., 2021) as our testbed

which aims to generate aspect-based summaries

about entities from 20 domains in Wikipedia.

Setting The original WikiAsp dataset is designed

for multi-document summarization and provides a

list of references to systems. We converted it into

the open-domain setting by removing the associ-

ated references and instead gathering information

from the open web. For the above case, the output

we aim to generate is “# Academics. In 2008, 91%

of the school’s seniors received their high school

diploma... # History. The class of 2008 was the

100th class in the school’s history.” where # is used

to indicate aspects. We manually annotate 4 exem-

plars (Prompt C.8), and use the Bing search engine

to retrieve 5 documents from the open web. 6

Evaluation We compare system outputs with the

gold summary using ROUGE, named entity-based

F 1 , and UniEval (Zhong et al., 2022) which mea-

sures factual consistency based on prediction proba-

bility of a fine-tuned T5 model (Raffel et al., 2020).

Experimental Results

We first report overall results across 4 tasks/datasets

and compare the performance of FLARE with all

the baselines introduced in section 4. We then

run ablation experiments to study the efficacy of

various design choices of our method.

6.1

Comparison with Baselines

Overall results. The overall performance of

FLARE and baseline across all tasks/datasets are

reported in Figure 4. FLARE outperforms all base-

line on all tasks/datasets, indicating that FLARE

is a generic method that can effectively retrieve

additional information throughout the generation.

Among various tasks and datasets, multihop QA

shows the most significant improvement. This is

largely due to the task’s clear definition and specific

objective of producing the final answer through a

2-hop reasoning process, which makes it easier

for LMs to generate on-topic output. In contrast,

ASQA and WikiAsp are less clearly defined and

more open-ended, which increases the difficulty of

To avoid leaking, we exclude several Wikipedia-related

domains listed in Table 7 from Bing’s search results.No ret.

Single-time ret.

Previous-window ret.

Forward-Looking Active REtrieval augmented generation (FLARE)

80.0

60.0

40.0

20.0

0.0

2WikiMultihopQA

StrategyQA

ASQA

ASQA-hint

WikiAsp

Figure 4: Comparision between FLARE and baselines across all tasks/datasets. We report the primary metric for

each dataset: EM for 2WikiMultihopQA, StrategyQA, and ASQA, and UniEval for WikiAsp.

both generation and evaluation. The improvement

on ASQA-hint is larger than that of ASQA because

identifying ambiguous aspects is challenging even

for humans in many cases, and providing a generic

hint helps LMs to stay on topic.

2WikiMultihopQA

60.0

40.0

20.0

0.0

Thorough comparisons with baselines. The

performance of all baselines discussed in section 4

on 2WikiMultihopQA are reported in Table 1.

FLARE outperforms all baselines by a large mar-

gin, which confirms that forward-looking active re-

trieval is highly effective. Most multi-time retrieval-

augmented approaches outperform single-time re-

trieval but with different margins. The improve-

ment of retrieving using the previous sentence is

relatively small which we hypothesize is mainly

because the previous sentence often describes en-

tities or relations that differ from those in the next

sentence in 2WikiMultihopQA. While the previous-

window approach might use the first half of a sen-

tence as queries to retrieve information potentially

helpful for generating the second half. Among all

baselines, the question decomposition approach

(Press et al., 2022) achieves the best performance.

This is not surprising since the in-context exem-

plars manually annotated with decomposed sub-

questions (Prompt 4.1) guide LMs to generate suit-

able sub-questions that align with the topic/intent

of future generations. FLARE outperforms this

baseline, indicating that manual exemplar anno-

tation is not necessary for effective future-aware

retrieval. The gap between FLARE instruct and ques-

tion decomposition is large, indicating that teaching

LMs to generate search queries using task-generic

retrieval instructions and exemplars is challenging.

We report all metrics for the other datasets in Ta-

ble 2. Again, FLARE outperforms baselines with

respect to all metrics. Retrieval using the previ-

ous window underperforms single-time retrieval on

StrategyQA

80.0

25.0

50.0

75.0

100.0

%steps/sentences with retrieval

Figure 5: Performance (EM) of FLARE with respect

to the percentage of steps/sentences with retrieval on

2WikiMultihopQA and StrategyQA.

ASQA, which we hypothesize is because the previ-

ous window does not accurately reflect the user’s

future intent. Since we focus on evaluating the

factuality of the generation, metrics with an empha-

sis on factual content (such as EM, Disambig-F 1 ,

UniEval) are more reliable than metrics computed

over all tokens (ROUGE-L).

6.2

Ablation Study

We study the efficacy of various design choices

through ablation experiments.

Importance of forward-looking retrieval. We

first validate our hypothesis that forward-looking

retrieval is indeed more powerful than past-context-

based retrieval. We run ablation experiments on

2WikiMultihopQA and ASQA-hint datasets com-

paring retrieval using the previous versus the next

sentence, by ensuring both methods are identical

except for the query used for retrieval. Specifi-

cally, both methods retrieve every sentence and

directly use the complete sentence (without mask-

ing or question generation) for retrieval. As shown

in Table 3, on both datasets, using the next sentence

to retrieve is clearly better than using the previous

sentence, confirming our hypothesis.Methods EM F 1 Prec. Rec.

No retrieval

Single-time retrieval 28.2

39.4 36.8

48.8 36.5

48.6 38.6

51.5

Multi-time retrieval

Previous-window (Borgeaud et al., 2022; Ram et al., 2023) ∗

Previous-sentence (Trivedi et al., 2022) ∗

Question decomposition (Press et al., 2022; Yao et al., 2022) ∗

FLARE instruct (ours)

FLARE direct (ours) 43.2

39.0

47.8

42.4

51.0 52.3

49.2

56.4

49.8

59.7 51.7

48.9

56.1

49.1

59.1 54.5

51.8

58.6

52.5

62.6

Table 1: Comparisons between FLARE and baselines on 2WikiMultihopQA. ∗ Reimplemented for fair compar-

isons.

Datasets

Metrics

StrategyQA

EM ASQA

EM D-F 1 R-L DR ASQA-hint

EM D-F 1 R-L DR

WikiAsp

UniEval E-F 1 R-L

No retrieval

Single-time retrieval 72.9

68.6 33.8 24.2 33.3 28.4

40.0 27.1 34.0 30.4 40.1 32.5 36.4 34.4

43.2 34.8 37.4 36.0 47.1

52.4 14.1 26.4

17.4 26.9

Previous-window

FLARE (ours) 71.2

77.3 Multi-time retrieval

39.9 27.0 34.3 30.4 43.7 35.7 37.5 36.6

41.3 28.2 34.3 31.1 46.2 36.7 37.7 37.2 51.8

53.4 18.1 27.3

18.9 27.6

Table 2: Comparison between FLARE and baselines on StrategyQA, ASQA, ASQA-hint, WikiAsp wrt. corre-

sponding metrics. D-F 1 is Disambig-F 1 , R-L is ROUGE-L, and E-F 1 is named entity-based F 1 .

2WikiMultihopQA

EM F 1 Prec. Rec. ASQA-hint

EM D-F 1 R-L DR

Previous 39.0 49.2 48.9 51.8

48.8 57.6 57.1 60.5 42.5 34.1 36.9 35.5

45.9 35.7 37.5 36.6

Table 3: A head-to-head comparison between using the

previous sentence and the next sentence for retrieval.

0.0

0.2

0.4

0.6

EM F 1 Prec. Rec.

0.488

0.498

0.510

0.506 0.576

0.588

0.597

0.593 0.571

0.582

0.591

0.586 0.605

0.616

0.627

0.622

Table 4: Performance of FLARE with respect to the

masking threshold β on 2WikiMultihopQA.

Importance of active retrieval. Next, we inves-

tigate the relationship between performance and

the active retrieval threshold θ. To alter our method

from not retrieving anything to retrieving every sen-

tence, we adjusted the confidence threshold θ used

to determine when to trigger retrieval from 0 to

1. We calculate the percentage of steps/sentences

where retrieval is triggered for every threshold and

display the performance based on the percentage

of retrieval. As shown in Figure 5, on 2WikiMul-

tihopQA, the performance plateaus when the re-

trieval percentage exceeds 60%, indicating that re-

trieval when LMs are confident is not necessary.

On StrategyQA, the performance drops with a re-

trieval percentage above 50%, suggesting that the

use of high-confidence sentences for retrieval can

introduce noise and impede the original genera-

ASQA-hint

EM D-F 1 R-L DR

Implicit 45.7 36.9 37.7 37.3

Explicit 46.2 36.7 37.7 37.2

WikiAsp

UniEval E-F 1 R-L

53.4

18.8 27.7

18.9 27.6

Table 5: A comparison between implicit and explicit

query formulation methods in FLARE.

tion process. Depending on the tasks/datasets, we

found on average triggering retrieval for 40%-60%

of sentences usually leads to a good performance.

Effectiveness of different query formulation

methods Last, we study implicit query forma-

tion by masking and explicit query formulation

through question generation. In Table 4, we com-

pare the performance of FLARE with differentmasking thresholds β. Retrieving directly with the

complete sentence (β = 0) is worse than masking

tokens with low probabilities, confirming our hy-

pothesis that low-confidence erroneous tokens can

distract retrievers. We also compare implicit and

explicit query formulation methods in Table 5. Per-

formances of both methods are similar, indicating

that both methods can effectively reflect informa-

tion needs.

Acknowledgements

This work was supported in part by a grant

from the Singapore Defence Science and Technol-

ogy Agency and the IBM PhD Fellowship. We

thank Chunting Zhou, Amanda Bertsch, Uri Alon,

Hiroaki Hayashi, Harsh Trivedi, Patrick Lewis,

Kaixin Ma, Shuyan Zhou, and Songwei Ge for

their insightful discussions and help with the exper-

iments.

Conclusion

To aid long-form generation with retrieval aug-

mentation, we propose an active retrieval aug-

mented generation framework that decides when

and what to retrieve during generation. We imple-

ment this framework with forward-looking active

retrieval that iteratively uses the upcoming sentence

to retrieve relevant information if it contains low-

confidence tokens and regenerates the next sen-

tence. Experimental results on 4 tasks/datasets

demonstrate the effectiveness of our methods. Fu-

ture directions include better alternatives for active

retrieval and developing LM architectures for effi-

cient active retrieval augmentation. References

8 Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, Sandhini Agarwal, Ariel Herbert-Voss,

Gretchen Krueger, Tom Henighan, Rewon Child,

Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,

Clemens Winter, Christopher Hesse, Mark Chen,

Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin

Chess, Jack Clark, Christopher Berner, Sam Mc-

Candlish, Alec Radford, Ilya Sutskever, and Dario

Amodei. 2020. Language models are few-shot learn-

ers. In Advances in Neural Information Processing

Systems 33: Annual Conference on Neural Informa-

tion Processing Systems 2020, NeurIPS 2020, De-

cember 6-12, 2020, virtual.

Limitation

We also performed preliminary experiments on

Wizard of Wikipedia (Dinan et al., 2019) and ELI5

(Fan et al., 2019), and found that FLARE did not

provide significant gains. Wizard of Wikipedia is

a knowledge-intensive dialogue generation dataset

where the output is relatively short (∼20 tokens

on average) so retrieving multiple disparate pieces

of information might not be necessary. ELI5 (Fan

et al., 2019) is a long-form QA dataset requiring

in-depth answers to open-ended questions. Due to

issues mentioned in Krishna et al. (2021) such as

difficulties of grounding generation in retrieval and

evaluation, both single-time retrieval and FLARE

did not provide significant gains over not using

retrieval. From an engineering perspective, inter-

leaving generation with retrieval with a naive im-

plementation increases both overheads and the cost

of generation. The LM needs to be activated mul-

tiple times (once for each retrieval) and a caching-

free implementation will also require recomputing

the previous activation each time after a retrieval.

This issue can be potentially alleviated with spe-

cial architectural designs that encode the retrieved

documents D q t and the input/generation (x/y

independently.

Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-

mann, Trevor Cai, Eliza Rutherford, Katie Millican,

George van den Driessche, Jean-Baptiste Lespiau,

Bogdan Damoc, Aidan Clark, Diego de Las Casas,

Aurelia Guy, Jacob Menick, Roman Ring, Tom Hen-

nigan, Saffron Huang, Loren Maggiore, Chris Jones,

Albin Cassirer, Andy Brock, Michela Paganini, Ge-

offrey Irving, Oriol Vinyals, Simon Osindero, Karen

Simonyan, Jack W. Rae, Erich Elsen, and Laurent

Sifre. 2022. Improving language models by retriev-

ing from trillions of tokens. In International Confer-

ence on Machine Learning, ICML 2022, 17-23 July

2022, Baltimore, Maryland, USA, volume 162 of

Proceedings of Machine Learning Research, pages

2206–2240. PMLR.

Danqi Chen, Adam Fisch, Jason Weston, and Antoine

Bordes. 2017. Reading wikipedia to answer open-

domain questions. In Proceedings of the 55th An-

nual Meeting of the Association for Computational

Linguistics, ACL 2017, Vancouver, Canada, July 30 -

August 4, Volume 1: Long Papers, pages 1870–1879.

Association for Computational Linguistics.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,

Maarten Bosma, Gaurav Mishra, Adam Roberts,

Paul Barham, Hyung Won Chung, Charles Sutton,

Sebastian Gehrmann, Parker Schuh, Kensen Shi,

Sasha Tsvyashchenko, Joshua Maynez, Abhishek

Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-

odkumar Prabhakaran, Emily Reif, Nan Du, Ben

Hutchinson, Reiner Pope, James Bradbury, Jacob

Austin, Michael Isard, Guy Gur-Ari, PengchengYin, Toju Duke, Anselm Levskaya, Sanjay Ghe-

mawat, Sunipa Dev, Henryk Michalewski, Xavier

Garcia, Vedant Misra, Kevin Robinson, Liam Fe-

dus, Denny Zhou, Daphne Ippolito, David Luan,

Hyeontaek Lim, Barret Zoph, Alexander Spiridonov,

Ryan Sepassi, David Dohan, Shivani Agrawal, Mark

Omernick, Andrew M. Dai, Thanumalayan Sankara-

narayana Pillai, Marie Pellat, Aitor Lewkowycz,

Erica Moreira, Rewon Child, Oleksandr Polozov,

Katherine Lee, Zongwei Zhou, Xuezhi Wang, Bren-

nan Saeta, Mark Diaz, Orhan Firat, Michele Catasta,

Jason Wei, Kathy Meier-Hellstern, Douglas Eck,

Jeff Dean, Slav Petrov, and Noah Fiedel. 2022.

Palm: Scaling language modeling with pathways.

CoRR, abs/2204.02311.

Nachshon Cohen, Oren Kalinsky, Yftah Ziser, and

Alessandro Moschitti. 2021. Wikisum: Coher-

ent summarization dataset for efficient human-

evaluation. In Proceedings of the 59th Annual Meet-

ing of the Association for Computational Linguis-

tics and the 11th International Joint Conference on

Natural Language Processing, ACL/IJCNLP 2021,

(Volume 2: Short Papers), Virtual Event, August 1-

6, 2021, pages 212–219. Association for Computa-

tional Linguistics.

2021. Wikiasp: A dataset for multi-domain aspect-

based summarization. Trans. Assoc. Comput. Lin-

guistics, 9:211–225.

Dan Hendrycks, Collin Burns, Steven Basart, Andy

Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-

hardt. 2020. Measuring massive multitask language

understanding. CoRR, abs/2009.03300.

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,

and Akiko Aizawa. 2020. Constructing A multi-hop

QA dataset for comprehensive evaluation of reason-

ing steps. In Proceedings of the 28th International

Conference on Computational Linguistics, COLING

2020, Barcelona, Spain (Online), December 8-13,

2020, pages 6609–6625. International Committee on

Computational Linguistics.

Gautier Izacard and Edouard Grave. 2021. Leveraging

passage retrieval with generative models for open

domain question answering. In Proceedings of the

16th Conference of the European Chapter of the As-

sociation for Computational Linguistics: Main Vol-

ume, EACL 2021, Online, April 19 - 23, 2021, pages

874–880. Association for Computational Linguis-

tics.

Emily Dinan, Stephen Roller, Kurt Shuster, Angela

Fan, Michael Auli, and Jason Weston. 2019. Wizard

of wikipedia: Knowledge-powered conversational

agents. In 7th International Conference on Learn-

ing Representations, ICLR 2019, New Orleans, LA,

USA, May 6-9, 2019. OpenReview.net. Gautier Izacard, Patrick S. H. Lewis, Maria Lomeli,

Lucas Hosseini, Fabio Petroni, Timo Schick, Jane

Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and

Edouard Grave. 2022. Few-shot learning with

retrieval augmented language models.

CoRR,

abs/2208.03299.

Angela Fan, Yacine Jernite, Ethan Perez, David Grang-

ier, Jason Weston, and Michael Auli. 2019. ELI5:

long form question answering. In Proceedings of

the 57th Conference of the Association for Compu-

tational Linguistics, ACL 2019, Florence, Italy, July

28- August 2, 2019, Volume 1: Long Papers, pages

3558–3567. Association for Computational Linguis-

tics. Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham

Neubig. 2021. How can we know When language

models know? on the calibration of language mod-

els for question answering. Trans. Assoc. Comput.

Linguistics, 9:962–977.

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie

Callan. 2022. Precise zero-shot dense retrieval with-

out relevance labels. CoRR, abs/2212.10496.

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot,

Dan Roth, and Jonathan Berant. 2021. Did aristotle

use a laptop? a question answering benchmark with

implicit reasoning strategies. Transactions of the As-

sociation for Computational Linguistics, 9:346–361.

John M. Giorgi, Luca Soldaini, Bo Wang, Gary D.

Bader, Kyle Lo, Lucy Lu Wang, and Arman Co-

han. 2022.

Exploring the challenges of open

domain multi-document summarization.

CoRR,

abs/2212.10526.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-

pat, and Ming-Wei Chang. 2020. REALM: retrieval-

augmented language model pre-training. CoRR,

abs/2002.08909.

Hiroaki Hayashi, Prashant Budania, Peng Wang, Chris

Ackerson, Raj Neervannan, and Graham Neubig.

Zhengbao Jiang, Luyu Gao, Jun Araki, Haibo Ding,

Zhiruo Wang, Jamie Callan, and Graham Neubig.

2022. Retrieval as attention: End-to-end learning

of retrieval and reading within a single transformer.

CoRR, abs/2212.02027.

Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham

Neubig. 2020. How can we know what language

models know. Trans. Assoc. Comput. Linguistics,

8:423–438.

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke

Zettlemoyer. 2017. Triviaqa: A large scale distantly

supervised challenge dataset for reading comprehen-

sion. In Proceedings of the 55th Annual Meeting of

the Association for Computational Linguistics, ACL

2017, Vancouver, Canada, July 30 - August 4, Vol-

ume 1: Long Papers, pages 1601–1611. Association

for Computational Linguistics.

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom

Henighan, Dawn Drain, Ethan Perez, Nicholas

Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli

Tran-Johnson, Scott Johnston, Sheer El Showk,

Andy Jones, Nelson Elhage, Tristan Hume, AnnaChen, Yuntao Bai, Sam Bowman, Stanislav Fort,

Deep Ganguli, Danny Hernandez, Josh Jacobson,

Jackson Kernion, Shauna Kravec, Liane Lovitt, Ka-

mal Ndousse, Catherine Olsson, Sam Ringer, Dario

Amodei, Tom Brown, Jack Clark, Nicholas Joseph,

Ben Mann, Sam McCandlish, Chris Olah, and Jared

Kaplan. 2022. Language models (mostly) know

what they know. CoRR, abs/2207.05221.

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick

S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi

Chen, and Wen-tau Yih. 2020. Dense passage re-

trieval for open-domain question answering. In Pro-

ceedings of the 2020 Conference on Empirical Meth-

ods in Natural Language Processing, EMNLP 2020,

Online, November 16-20, 2020, pages 6769–6781.

Association for Computational Linguistics.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke

Zettlemoyer, and Mike Lewis. 2020. Generaliza-

tion through memorization: Nearest neighbor lan-

guage models. In 8th International Conference on

Learning Representations, ICLR 2020, Addis Ababa,

Ethiopia, April 26-30, 2020. OpenReview.net.

Omar Khattab, Keshav Santhanam, Xiang Lisa Li,

David Hall, Percy Liang, Christopher Potts, and

Matei Zaharia. 2022. Demonstrate-search-predict:

Composing retrieval and language models for

knowledge-intensive NLP. CoRR, abs/2212.14024.

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao

Fu, Kyle Richardson, Peter Clark, and Ashish Sab-

harwal. 2022. Decomposed prompting: A mod-

ular approach for solving complex tasks. CoRR,

abs/2210.02406.

Kalpesh Krishna, Aurko Roy, and Mohit Iyyer. 2021.

Hurdles to progress in long-form question answer-

ing. In North American Association for Computa-

tional Linguistics.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-

field, Michael Collins, Ankur P. Parikh, Chris Al-

berti, Danielle Epstein, Illia Polosukhin, Jacob De-

vlin, Kenton Lee, Kristina Toutanova, Llion Jones,

Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai,

Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019.

Natural questions: a benchmark for question answer-

ing research. Trans. Assoc. Comput. Linguistics,

7:452–466.

Haejun Lee, Akhil Kedia, Jongwon Lee, Ashwin Paran-

jape, Christopher D. Manning, and Kyoung-Gu Woo.

2021. You only need one model for open-domain

question answering. CoRR, abs/2112.07381.

Patrick S. H. Lewis, Ethan Perez, Aleksandra Pik-

tus, Fabio Petroni, Vladimir Karpukhin, Naman

Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih,

Tim Rocktäschel, Sebastian Riedel, and Douwe

Kiela. 2020. Retrieval-augmented generation for

knowledge-intensive NLP tasks. In Advances in

Neural Information Processing Systems 33: Annual

Conference on Neural Information Processing Sys-

tems 2020, NeurIPS 2020, December 6-12, 2020,

virtual.

Chin-Yew Lin. 2004. ROUGE: A package for auto-

matic evaluation of summaries. In Text Summariza-

tion Branches Out, pages 74–81, Barcelona, Spain.

Association for Computational Linguistics.

Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong

Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen.

2021. Generation-augmented retrieval for open-

domain question answering. In Proceedings of the

59th Annual Meeting of the Association for Compu-

tational Linguistics and the 11th International Joint

Conference on Natural Language Processing, ACL/I-

JCNLP 2021, (Volume 1: Long Papers), Virtual

Event, August 1-6, 2021, pages 4089–4100. Associ-

ation for Computational Linguistics.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and

Ryan McDonald. 2020. On faithfulness and factu-

ality in abstractive summarization. In Proceedings

of the 58th Annual Meeting of the Association for

Computational Linguistics, pages 1906–1919, On-

line. Association for Computational Linguistics.

OpenAI. 2023. GPT-4 technical report.

abs/2303.08774.

CoRR,

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-

roll L. Wainwright, Pamela Mishkin, Chong Zhang,

Sandhini Agarwal, Katarina Slama, Alex Ray, John

Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,

Maddie Simens, Amanda Askell, Peter Welinder,

Paul F. Christiano, Jan Leike, and Ryan Lowe.

2022. Training language models to follow instruc-

tions with human feedback. CoRR, abs/2203.02155.

Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,

Patrick S. H. Lewis, Anton Bakhtin, Yuxiang Wu,

and Alexander H. Miller. 2019. Language mod-

els as knowledge bases? In Proceedings of the

2019 Conference on Empirical Methods in Natu-

ral Language Processing and the 9th International

Joint Conference on Natural Language Processing,

EMNLP-IJCNLP 2019, Hong Kong, China, Novem-

ber 3-7, 2019, pages 2463–2473. Association for

Computational Linguistics.

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt,

Noah A Smith, and Mike Lewis. 2022. Measuring

and narrowing the compositionality gap in language

models. arXiv preprint arXiv:2210.03350.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,

Dario Amodei, and Ilya Sutskever. 2019. Language

models are unsupervised multitask learners. OpenAI

Blog, 1(8).

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine

Lee, Sharan Narang, Michael Matena, Yanqi Zhou,

Wei Li, and Peter J. Liu. 2020. Exploring the limits

of transfer learning with a unified text-to-text trans-

former. J. Mach. Learn. Res., 21:140:1–140:67.Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay,

Amnon Shashua, Kevin Leyton-Brown, and Yoav

Shoham. 2023. In-context retrieval-augmented lan-

guage models. arXiv preprint arXiv:2302.00083.

Adam Roberts, Colin Raffel, and Noam Shazeer. 2020.

How much knowledge can you pack into the pa-

rameters of a language model? In Proceedings of

the 2020 Conference on Empirical Methods in Nat-

ural Language Processing, EMNLP 2020, Online,

November 16-20, 2020, pages 5418–5426. Associ-

ation for Computational Linguistics.

Stephen E. Robertson and Hugo Zaragoza. 2009. The

probabilistic relevance framework: BM25 and be-

yond. Found. Trends Inf. Retr., 3(4):333–389.

Devendra Singh Sachan, Siva Reddy, William L.

Hamilton, Chris Dyer, and Dani Yogatama. 2021.

End-to-end training of multi-document reader and

retriever for open-domain question answering. In

Advances in Neural Information Processing Systems

34: Annual Conference on Neural Information Pro-

cessing Systems 2021, NeurIPS 2021, December 6-

14, 2021, virtual, pages 25968–25981.

Conference of the North American Chapter of the

Association for Computational Linguistics: Human

Language Technologies, NAACL-HLT 2018, New

Orleans, Louisiana, USA, June 1-6, 2018, Volume

1 (Long Papers), pages 809–819. Association for

Computational Linguistics.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier

Martinet, Marie-Anne Lachaux, Timothée Lacroix,

Baptiste Rozière, Naman Goyal, Eric Hambro,

Faisal Azhar, Aurélien Rodriguez, Armand Joulin,

Edouard Grave, and Guillaume Lample. 2023.

Llama: Open and efficient foundation language mod-

els. CoRR, abs/2302.13971.

Harsh Trivedi, Niranjan Balasubramanian, Tushar

Khot, and Ashish Sabharwal. 2022. Interleav-

ing retrieval with chain-of-thought reasoning for

knowledge-intensive multi-step questions. CoRR,

abs/2212.10509.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V.

Le, Ed H. Chi, and Denny Zhou. 2022. Self-

consistency improves chain of thought reasoning in

language models. CoRR, abs/2203.11171.

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì,

Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer,

Nicola Cancedda, and Thomas Scialom. 2023. Tool-

former: Language models can teach themselves to

use tools. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022.

Chain of thought prompting elicits reasoning in large

language models. CoRR, abs/2201.11903.

Weijia Shi, Sewon Min, Michihiro Yasunaga, Min-

joon Seo, Rich James, Mike Lewis, Luke Zettle-

moyer, and Wen-tau Yih. 2023. REPLUG: retrieval-

augmented black-box language models. CoRR,

abs/2301.12652. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-

gio, William W. Cohen, Ruslan Salakhutdinov, and

Christopher D. Manning. 2018. Hotpotqa: A dataset

for diverse, explainable multi-hop question answer-

ing. In Proceedings of the 2018 Conference on

Empirical Methods in Natural Language Process-

ing, Brussels, Belgium, October 31 - November 4,

2018, pages 2369–2380. Association for Computa-

tional Linguistics.

Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-

Wei Chang. 2022. ASQA: factoid questions meet

long-form answers. In Proceedings of the 2022 Con-

ference on Empirical Methods in Natural Language

Processing, EMNLP 2022, Abu Dhabi, United Arab

Emirates, December 7-11, 2022, pages 8273–8288.

Association for Computational Linguistics.

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and

Denny Zhou. 2022. Recitation-augmented language

models. CoRR, abs/2210.01296.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and

Jonathan Berant. 2019. Commonsenseqa: A ques-

tion answering challenge targeting commonsense

knowledge. In Proceedings of the 2019 Conference

of the North American Chapter of the Association

for Computational Linguistics: Human Language

Technologies, NAACL-HLT 2019, Minneapolis, MN,

USA, June 2-7, 2019, Volume 1 (Long and Short Pa-

pers), pages 4149–4158. Association for Computa-

tional Linguistics.

James Thorne,

Andreas Vlachos,

Christos

Christodoulopoulos, and Arpit Mittal. 2018.

FEVER: a large-scale dataset for fact extraction

and verification.

In Proceedings of the 2018

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak

Shafran, Karthik Narasimhan, and Yuan Cao. 2022.

React: Synergizing reasoning and acting in language

models. CoRR, abs/2210.03629.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel

Artetxe, Moya Chen, Shuohui Chen, Christopher De-

wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-

haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel

Simig, Punit Singh Koura, Anjali Sridhar, Tianlu

Wang, and Luke Zettlemoyer. 2022. Opt: Open

pre-trained transformer language models. ArXiv,

abs/2205.01068.

Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu

Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and

Jiawei Han. 2022.

Towards a unified multi-

dimensional evaluator for text generation. In Pro-

ceedings of the 2022 Conference on Empirical Meth-

ods in Natural Language Processing, EMNLP 2022,

Abu Dhabi, United Arab Emirates, December 7-11,

2022, pages 2023–2038. Association for Computa-

tional Linguistics.Chunting Zhou, Graham Neubig, Jiatao Gu, Mona

Diab, Francisco Guzmán, Luke Zettlemoyer, and

Marjan Ghazvininejad. 2021. Detecting halluci-

nated content in conditional neural sequence gener-

ation. In Findings of the Association for Computa-

tional Linguistics: ACL-IJCNLP 2021, pages 1393–

1404, Online. Association for Computational Lin-

guistics.

Datasets and Settings

Datasets and experimental settings are summarized

in Table 6. Wikipedia-related domains excluded

from Bing’s search results are listed in Table 7.

Hyperparameters

Hyperparameters of FLARE on different datasets

are listed in Table 8.

Prompts and Few-shot exemplars

Prompts and exemplars of different tasks/datasets

are shown in Prompt C.1, C.2, C.3, C.4, C.6, and

C.8, respectively.Settings

Task

#Examples

Metrics

2WikiMultihopQA StrategyQA

(Ho et al., 2020) (Geva et al., 2021)

multihop QA

500

EM, F 1 , Prec., Rec.

Corpus

Retriever

Top-k

ASQA

(Stelmakh et al., 2022)

Dataset statistics

commonsense QA

long-form QA

229

500

WikiAsp

(Hayashi et al., 2021)

open-domain summarization

500

Evaluation settings

EM, Disambig-F 1 , ROUGE, DR UniEval, entity-F 1 , ROUGE

Retrieval settings

Wikipedia

BM25

Wikipedia

BM25

open web

Bing

Prompt format

#Exemplars

Ret. for exemplars

Table 6: Statistics and experimental settings of different tasks/datasets.

wikipedia.org

wikiwand.com

wiki2.org

wikimedia.org

Table 7: Wikipedia-related domains excluded from Bing’s search results.

Dataset

2WikiMultihopQA

StrategyQA

ASQA & ASQA-hint

WikiAsp

θ β Query formulation Combine single- & multi-time retrieval

0.8

0.4

0.8

0.8 0.4

0.4

0.4 implicit

implicit

explicit

explicit 7

Table 8: Statistics and experimental settings of different tasks/datasets.Prompt C.1: retrieval instructions for 2WikiMultihopQA

Skill 1. Use the Search API to look up relevant information by writing “[Search(term)]” where “term” is the search term you

want to look up. For example:

Question: But what are the risks during production of nanomaterials?

Answer (with Search): [Search(nanomaterial production risks)] Some nanomaterials may give rise to various kinds of lung

damage.

Question: The colors on the flag of Ghana have the following meanings.

Answer (with Search): Red is for [Search(Ghana flag red meaning)] the blood of martyrs, green for forests, and gold for

mineral wealth.

Question: Metformin is the first-line drug for what?

Answer (with Search): [Search(Metformin first-line drug)] patients with type 2 diabetes and obesity.

Skill 2. Answer questions by thinking step-by-step.

sion. For example:

First, write out the reasoning steps, then draw the conclu-

Question: When did the director of film Hypocrite (Film) die?

Answer (with step-by-step): The film Hypocrite was directed by Miguel Morayta. Miguel Morayta died on 19 June 2013. So

the answer is 19 June 2013.

Question: Are both Kurram Garhi and Trojkrsti located in the same country?

Answer (with step-by-step): Kurram Garhi is located in the country of Pakistan. Trojkrsti is located in the country of

Republic of Macedonia. Thus, they are not in the same country. So the answer is no.

Question: Do director of film Coolie No. 1 (1995 Film) and director of film The Sensational Trial have the same

nationality?

Answer (with step-by-step): Coolie No. 1 (1995 film) was directed by David Dhawan. The Sensational Trial was directed by

Karl Freund. David Dhawan’s nationality is India. Karl Freund’s nationality is Germany. Thus, they do not have the same

nationality. So the answer is no.

Question: Who is Boraqchin (Wife Of Ögedei)’s father-in-law?

Answer (with step-by-step): Boraqchin is married to Ögedei Khan. Ögedei Khan’s father is Genghis Khan. Thus, Boraqchin’s

father-in-law is Genghis Khan. So the answer is Genghis Khan.

Question: Who was born first out of Martin Hodge and Ivania Martinich?

Answer (with step-by-step): Martin Hodge was born on 4 February 1959. Ivania Martinich was born on 25 July 1995. Thus,

Martin Hodge was born first. So the answer is Martin Hodge.

Question: When did the director of film Laughter In Hell die?

Answer (with step-by-step): The film Laughter In Hell was directed by Edward L. Cahn. Edward L. Cahn died on August 25,

1963. So the answer is August 25, 1963.

Question: Which film has the director died later, The Gal Who Took the West or Twenty Plus Two?

Answer (with step-by-step): The film Twenty Plus Two was directed by Joseph M. Newman. The Gal Who Took

the West was directed by Frederick de Cordova. Joseph M. Newman died on January 23, 2006. Fred de Cordova

died on September 15, 2001. Thus, the person to die later from the two is Twenty Plus Two. So the answer is Twenty Plus Two.

Question: Who is the grandchild of Krishna Shah (Nepalese Royal)?

Answer (with step-by-step): Krishna Shah has a child named Rudra Shah. Rudra Shah has a child named Prithvipati Shah.

Thus, Krishna Shah has a grandchild named Prithvipati Shah. So the answer is Prithvipati Shah.

Now, combine the aforementioned two skills. First, write out the reasoning steps, then draw the conclusion,

where the reasoning steps should also utilize the Search API “[Search(term)]” whenever possible.

Question: Where did Minbyauk Thihapate’s wife die?

Answer (with step-by-step & Search):Prompt C.2: exemplars of 2WikiMultihopQA

Question: When did the director of film Hypocrite (Film) die?

Answer: The film Hypocrite was directed by Miguel Morayta. Miguel Morayta died on 19 June 2013. So the answer is 19

June 2013.

Question: Are both Kurram Garhi and Trojkrsti located in the same country?

Answer: Kurram Garhi is located in the country of Pakistan. Trojkrsti is located in the country of Republic of Macedonia.

Thus, they are not in the same country. So the answer is no.

Question: Do director of film Coolie No. 1 (1995 Film) and director of film The Sensational Trial have the same

nationality?

Answer: Coolie No. 1 (1995 film) was directed by David Dhawan. The Sensational Trial was directed by Karl Freund. David

Dhawan’s nationality is India. Karl Freund’s nationality is Germany. Thus, they do not have the same nationality. So the

answer is no.

Question: Who is Boraqchin (Wife Of Ögedei)’s father-in-law?

Answer: Boraqchin is married to Ögedei Khan. Ögedei Khan’s father is Genghis Khan. Thus, Boraqchin’s father-in-law is

Genghis Khan. So the answer is Genghis Khan.

Question: Who was born first out of Martin Hodge and Ivania Martinich?

Answer: Martin Hodge was born on 4 February 1959. Ivania Martinich was born on 25 July 1995. Thus, Martin Hodge was

born first. So the answer is Martin Hodge.

Question: When did the director of film Laughter In Hell die?

Answer: The film Laughter In Hell was directed by Edward L. Cahn. Edward L. Cahn died on August 25, 1963. So the

answer is August 25, 1963.

Question: Which film has the director died later, The Gal Who Took the West or Twenty Plus Two?

Answer: The film Twenty Plus Two was directed by Joseph M. Newman. The Gal Who Took the West was directed by

Frederick de Cordova. Joseph M. Newman died on January 23, 2006. Fred de Cordova died on September 15, 2001. Thus,

the person to die later from the two is Twenty Plus Two. So the answer is Twenty Plus Two.

Question: Who is the grandchild of Krishna Shah (Nepalese Royal)?

Answer: Krishna Shah has a child named Rudra Shah. Rudra Shah has a child named Prithvipati Shah. Thus, Krishna Shah

has a grandchild named Prithvipati Shah. So the answer is Prithvipati Shah.

Question: Which country the director of film Citizen Mavzik is from?

Answer:Prompt C.3: exemplars of StrategyQA

Generate a yes or no answer to the following question.

Question: Do hamsters provide food for any animals?

Answer: Hamsters are prey animals. Prey are food for predators. Thus, hamsters provide food for some animals. So the final

answer is yes.

Generate a yes or no answer to the following question.

Question: Could Brooke Shields succeed at University of Pennsylvania?

Answer: Brooke Shields went to Princeton University. Princeton University is about as academically rigorous as the Uni-

versity of Pennsylvania. Thus, Brooke Shields could also succeed at the University of Pennsylvania. So the final answer is yes.

Generate a yes or no answer to the following question.

Question: Hydrogen’s atomic number squared exceeds number of Spice Girls?

Answer: Hydrogen has an atomic number of 1. 1 squared is 1. There are 5 Spice Girls. Thus, Hydrogen’s atomic number

squared is less than 5. So the final answer is no.

Generate a yes or no answer to the following question.

Question: Is it common to see frost during some college commencements?

Answer: College commencement ceremonies can happen in December, May, and June. December is in the winter, so there

can be frost. Thus, there could be frost at some commencements. So the final answer is yes.

Generate a yes or no answer to the following question.

Question: Could a llama birth twice during War in Vietnam (1945-46)?

Answer: The War in Vietnam was 6 months. The gestation period for a llama is 11 months, which is more than 6 months.

Thus, a llama could not give birth twice during the War in Vietnam. So the final answer is no.

Generate a yes or no answer to the following question.

Question: Would a pear sink in water?

Answer: The density of a pear is about 0.6g/cm 3 , which is less than water. Objects less dense than water float. Thus, a pear

would float. So the final answer is no.

Generate a yes or no answer to the following question.

Question: Did Spiderman fight against Falcon in the MCU?

Answer:Prompt C.4: exemplars of ASQA

Given an ambiguous question, figure out its interpretations and answer them one by one.

Question: Who played bonnie in gone with the wind?

Answer: This question is ambiguous in terms of which version or adaptation of Gone with the Wind is being referred to. In

order to figure out its interpretations, we need to consider different versions or adaptations of Gone with the Wind. Gone with

the Wind has two versions or adaptations: the 1939 film Gone with the Wind or the 2008 musical Gone with the Wind.

Therefore, this question has 2 interpretations: (1) Who played Bonnie in the 1939 film Gone with the Wind? (2) Who played

Bonnie in the 2008 musical Gone with the Wind? The answers to all interpretations are: (1) The 1939 film Gone with the

Wind’s character Bonnie was played by Eleanore Cammack "Cammie" King. (2) The 2008 musical Gone with the Wind’s

character Bonnie was played by Leilah de Meza.

Given an ambiguous question, figure out its interpretations and answer them one by one.

Question: What is the second largest city in the usa?

Answer: This question is ambiguous in terms of the criteria being used to determine the second largest city in the USA.

In order to figure out its interpretations, we need to consider different criteria to determine a city’s size. City size can be

measured by two criteria: population or area. Therefore, this question has 2 interpretations: (1) What is the second largest

city in the USA by population? (2) What is the second largest city in the USA by area? The answers to all interpretations are:

(1) The second largest city in the USA by population is Los Angeles, California. (2) The second largest city in the USA by

area is Juneau, Alaska.

Given an ambiguous question, figure out its interpretations and answer them one by one.

Question: When was bohemian rhapsody released as a single?

Answer: This question is ambiguous in terms of which country’s release of the single is being referred to. In order to figure

out its interpretations, we need to consider different countries where Bohemian Rhapsody is released. Bohemian Rhapsody

was released in the United Kingdom and in the United States on different dates. Therefore, this question has 2 interpretations:

(1) When was Bohemian Rhapsody released as a single in the United Kingdom? (2) When was Bohemian Rhapsody released

as a single in the United States? The answers to all interpretations are: (1) Bohemian Rhapsody was released as a single in the

United Kingdom on 31 October 1975. (2) Bohemian Rhapsody was released as a single in the United States on December 1975.

Given an ambiguous question, figure out its interpretations and answer them one by one.

Question: Where do the philadelphia eagles play their home games?

Answer: This question is ambiguous in terms of which specific location or venue is being referred to. In order to figure out its

interpretations, we need to consider the different possible locations or venues that could be considered the home field of the

Philadelphia Eagles. These include the city, the sports complex, or the stadium. Therefore, this question has 3 interpretations:

(1) What city do the Philadelphia Eagles play their home games? (2) In what sports complex do the Philadelphia Eagles play

their home games? (3) What stadium do the Philadelphia Eagles play their home games? The answers to all interpretations

are: (1) Philadelphia Eagles play their home games in the city Philadelphia. (2) Philadelphia Eagles play their home games in

the South Philadelphia Sports Complex. (3) Philadelphia Eagles play their home games in the Lincoln Financial Field stadium.

Given an ambiguous question, figure out its interpretations and answer them one by one.

Question: When did xbox one come out in australia?

Answer: This question is ambiguous in terms of which specific version of the Xbox One is being referred to. In order to

figure out its interpretations, we need to consider the different versions of the Xbox One that have been released. Xbox One

has two versions: the Xbox One video game console or the Xbox One X high-end model. Therefore, this question has 2

interpretations: (1) When did the Xbox One release in Australia? (2) When did the Xbox One X release in Australia? The

answers to all interpretations are: (1) The Xbox One video game console was released in Australia on November 22, 2013.

(2) The Xbox One X video game console was released in Australia on November 7, 2017.

Given an ambiguous question, figure out its interpretations and answer them one by one.

Question: When does the movie summer of 84 come out?

Answer: This question is ambiguous in terms of which release of the movie is being referred to. In order to figure out its

interpretations, we need to consider different releases of the movie Summer of ’84. The movie Summer of ’84 is first released

at the Sundance Festival before it’s released throughout the US. Therefore, this question has 2 interpretations: (1) When

did the movie Summer of ’84 first release at the Sundance Festival? (2) When did the movie Summer of ’84 first release

throughout the US? The answers to all interpretations are: (1) Summer of ’84 was released at the Sundance Festival on

January 22, 2018. (2) Summer of ’84 was released throughout the US on August 10, 2018.Prompt C.5: exemplars of ASQA (cont.)

Given an ambiguous question, figure out its interpretations and answer them one by one.

Question: What was roy orbison’s first number one hit?

Answer: This question is ambiguous in terms of which specific chart or region is being referred to. In order to figure out its

interpretations, we need to consider the different charts and regions where Roy Orbison’s music was popular. Roy Orbison is

popular in both the US Hot 100 and Canada, and the UK and Ireland. Therefore, this question has 2 interpretations: (1) What

was Roy Orbison’s first number one hit in the US Hot 100 and Canada? (2) What was Roy Orbison’s first number one hit in

the UK and Ireland? The answers to all interpretations are: (1) Running Scared was the first number one hit for Roy Orbison

in the US Hot 100 and Canada. (2) Only the Lonely (Know the Way I Feel) was the first number one hit for Roy Orbison in

the UK and Ireland.

Given an ambiguous question, figure out its interpretations and answer them one by one.

Question: What is the criminal’s name in the breakfast club?

Answer: This question is ambiguous in terms of which specific name is being referred to - the character’s name or the actor’s

name. In order to figure out its interpretations, we need to consider both possibilities: the character’s name or the actor’s

name. Therefore, this question has 2 interpretations: (1) What is the criminal’s character name in The Breakfast Club? (2)

What is the the name of the actor who played the criminal in The Breakfast Club? The answers to all interpretations are: (1)

John Bender was the name of the criminal’s character in The Breakfast Club. (2) Judd Nelson was the actor of the criminal in

The Breakfast Club.

Given an ambiguous question, figure out its interpretations and answer them one by one.

Question: How many state parks are there in virginia?

Answer:Prompt C.6: exemplars of ASQA-hint

Given an ambiguous question and a hint on which aspect of the question is ambiguous, figure out its interpretations and

answer them one by one.

Question: Who played bonnie in gone with the wind?

Hint: This question is ambiguous in terms of which version or adaptation of Gone with the Wind is being referred to.

Answer: In order to figure out its interpretations, we need to consider different versions or adaptations of Gone with the Wind.

Gone with the Wind has two versions or adaptations: the 1939 film Gone with the Wind or the 2008 musical Gone with the

Wind. Therefore, this question has 2 interpretations: (1) Who played Bonnie in the 1939 film Gone with the Wind? (2) Who

played Bonnie in the 2008 musical Gone with the Wind? The answers to all interpretations are: (1) The 1939 film Gone with

the Wind’s character Bonnie was played by Eleanore Cammack "Cammie" King. (2) The 2008 musical Gone with the Wind’s

character Bonnie was played by Leilah de Meza.

Given an ambiguous question and a hint on which aspect of the question is ambiguous, figure out its interpreta-

tions and answer them one by one.

Question: What is the second largest city in the usa?

Hint: This question is ambiguous in terms of the criteria being used to determine the second largest city in the USA.

Answer: In order to figure out its interpretations, we need to consider different criteria to determine a city’s size. City size can

be measured by two criteria: population or area. Therefore, this question has 2 interpretations: (1) What is the second largest

city in the USA by population? (2) What is the second largest city in the USA by area? The answers to all interpretations are:

(1) The second largest city in the USA by population is Los Angeles, California. (2) The second largest city in the USA by

area is Juneau, Alaska.

Given an ambiguous question and a hint on which aspect of the question is ambiguous, figure out its interpreta-

tions and answer them one by one.

Question: When was bohemian rhapsody released as a single?

Hint: This question is ambiguous in terms of which country’s release of the single is being referred to.

Answer: In order to figure out its interpretations, we need to consider different countries where Bohemian Rhapsody is

released. Bohemian Rhapsody was released in the United Kingdom and in the United States on different dates. Therefore,

this question has 2 interpretations: (1) When was Bohemian Rhapsody released as a single in the United Kingdom? (2) When

was Bohemian Rhapsody released as a single in the United States? The answers to all interpretations are: (1) Bohemian

Rhapsody was released as a single in the United Kingdom on 31 October 1975. (2) Bohemian Rhapsody was released as a

single in the United States on December 1975.

Given an ambiguous question and a hint on which aspect of the question is ambiguous, figure out its interpreta-

tions and answer them one by one.

Question: Where do the philadelphia eagles play their home games?

Hint: This question is ambiguous in terms of which specific location or venue is being referred to.

Answer: In order to figure out its interpretations, we need to consider the different possible locations or venues that could be

considered the home field of the Philadelphia Eagles. These include the city, the sports complex, or the stadium. Therefore,

this question has 3 interpretations: (1) What city do the Philadelphia Eagles play their home games? (2) In what sports

complex do the Philadelphia Eagles play their home games? (3) What stadium do the Philadelphia Eagles play their home

games? The answers to all interpretations are: (1) Philadelphia Eagles play their home games in the city Philadelphia. (2)

Philadelphia Eagles play their home games in the South Philadelphia Sports Complex. (3) Philadelphia Eagles play their

home games in the Lincoln Financial Field stadium.

Given an ambiguous question and a hint on which aspect of the question is ambiguous, figure out its interpreta-

tions and answer them one by one.

Question: When did xbox one come out in australia?

Hint: This question is ambiguous in terms of which specific version of the Xbox One is being referred to.

Answer: In order to figure out its interpretations, we need to consider the different versions of the Xbox One that have been

released. Xbox One has two versions: the Xbox One video game console or the Xbox One X high-end model. Therefore, this

question has 2 interpretations: (1) When did the Xbox One release in Australia? (2) When did the Xbox One X release

in Australia? The answers to all interpretations are: (1) The Xbox One video game console was released in Australia on

November 22, 2013. (2) The Xbox One X video game console was released in Australia on November 7, 2017.

Given an ambiguous question and a hint on which aspect of the question is ambiguous, figure out its interpreta-

tions and answer them one by one.

Question: When does the movie summer of 84 come out?

Hint: This question is ambiguous in terms of which release of the movie is being referred to.

Answer: In order to figure out its interpretations, we need to consider different releases of the movie Summer of ’84. The

movie Summer of ’84 is first released at the Sundance Festival before it’s released throughout the US. Therefore, this question

has 2 interpretations: (1) When did the movie Summer of ’84 first release at the Sundance Festival? (2) When did the movie

Summer of ’84 first release throughout the US? The answers to all interpretations are: (1) Summer of ’84 was released at the

Sundance Festival on January 22, 2018. (2) Summer of ’84 was released throughout the US on August 10, 2018.Prompt C.7: exemplars of ASQA-hint (cont.)

Given an ambiguous question and a hint on which aspect of the question is ambiguous, figure out its interpretations and

answer them one by one.

Question: What was roy orbison’s first number one hit?

Hint: This question is ambiguous in terms of which specific chart or region is being referred to.

Answer: In order to figure out its interpretations, we need to consider the different charts and regions where Roy Orbison’s

music was popular. Roy Orbison is popular in both the US Hot 100 and Canada, and the UK and Ireland. Therefore, this

question has 2 interpretations: (1) What was Roy Orbison’s first number one hit in the US Hot 100 and Canada? (2) What

was Roy Orbison’s first number one hit in the UK and Ireland? The answers to all interpretations are: (1) Running Scared

was the first number one hit for Roy Orbison in the US Hot 100 and Canada. (2) Only the Lonely (Know the Way I Feel) was

the first number one hit for Roy Orbison in the UK and Ireland.

Given an ambiguous question and a hint on which aspect of the question is ambiguous, figure out its interpreta-

tions and answer them one by one.

Question: What is the criminal’s name in the breakfast club?

Hint: This question is ambiguous in terms of which specific name is being referred to - the character’s name or the actor’s

name.

Answer: In order to figure out its interpretations, we need to consider both possibilities: the character’s name or the actor’s

name. Therefore, this question has 2 interpretations: (1) What is the criminal’s character name in The Breakfast Club? (2)

What is the the name of the actor who played the criminal in The Breakfast Club? The answers to all interpretations are: (1)

John Bender was the name of the criminal’s character in The Breakfast Club. (2) Judd Nelson was the actor of the criminal in

The Breakfast Club.

Given an ambiguous question and a hint on which aspect of the question is ambiguous, figure out its interpreta-

tions and answer them one by one.

Question: How many state parks are there in virginia?

Hint: This question is ambiguous in terms of the time frame or period being referred to.

Answer: In order to figure out its interpretations,Prompt C.8: exemplars of WikiAsp

Generate a summary about Aslanhane Mosque including the following aspects: location, history with one aspect per line.

# Location

The mosque is in the old quarter of ankara next to ankara castle. With an altitude of 947 metres (3,107 ft) it overlooks ankara

at 39°56’12"N 32°51’55"E.

# History

The mosque is one of the oldest mosques in Turkey still standing. It was built during the reign of Mesud II of the Anatolian

Seljuks in 1290. Its architect was Ebubekir Mehmet. It was commissioned by two Ahi leaders named Hüsamettin and

Hasaneddin. However, in 1330, it was repaired by another Ahi leader named Şerafettin after whom the mosque was named.

After several minor repairs the mosque was restored by the directorate general of foundations in 2010-2013 term.

Generate a summary about Untold Legends: The Warrior’s Code including the following aspects:

gameplay, development with one aspect per line.

# Reception

The game received "mixed or average reviews" according to video game review aggregator Metacritic.

# Gameplay

The warrior’s code is a hack n’ slash action role-playing game, which concentrates on action-oriented combat.

# Development

As a pre-order bonus, the game was shipped with a small action figure of the Guardian class.

reception,

Generate a summary about Raid on St. Augustine including the following aspects: aftermath, background with

one aspect per line.

# Aftermath

Once the English had gone Menéndez and the rest of the Spanish settlers returned to find a smoldering ruins and very little

left. He soon and begged for help from the viceroy of Cuba and the settlement took a while to build itself back up. The

destroyed fort was replaced with the present day Castillo de San Marcos.

# Background

War had already been unofficially declared by Philip II of Spain after the Treaty of Nonsuch in which Elizabeth I had

offered her support to the rebellious Protestant Dutch rebels. The Queen through Francis Walsingham ordered Sir Francis

Drake to lead an expedition to attack the Spanish New World in a kind of preemptive strike. Sailing from Plymouth,

England, he struck first at Santiago in November 1585 then across the Atlantic at the Spanish new world city of Santo

Domingo of which was captured and ransomed on 1 January 1586 and following that successfully attacked the important

city of Cartagena on 19 February. Drake wanted to strike at another Spanish city on the Main before finally visiting and

replenishing Sir Walter Raleigh’s new colony of Roanoke Colony on the American East Coast. Then after this he hoped

to make the Transatlantic crossing back to England. The fleet headed north, and in late April Drake put into the Spanish

Cuban mainland and his men dug wells in search of fresh water and gathered supplies to help counter an outbreak of

dysentery after which he moved on. The fleet traveled north within sight of land on the Florida peninsula sailing past

the West coast. On 27 May 1586 as they approached further north a small fort was spotted on the shore, with a small

inlet close by. This was the location of St Augustine, the most northerly town in Spain’s New World Empire, and the

oldest permanent colonial settlement in North America. Drake knew of the place and was also aware of the fact that

the spanish under Pedro Menéndez de Avilés had ordered all of the French Huguenot colonists that had tried to settle

in the area executed. Drake decided on one final opportunity to raid and plunder, and a chance to avenge his fellow Protestants.

Generate a summary about Lakewood (Livingston, Alabama) including the following aspects: architecture, his-

tory with one aspect per line.

# Architecture

The house has a plan that is relatively rare in early Alabama architecture. The plan features a brick ground floor that is topped

by one-and-a-half-stories of wood-frame construction. The ground floor originally contained domestic spaces, with the

formal rooms on the principle floor and bedrooms on the upper floor. A central hallway is present on all levels. The facade is

five bays wide, with central entrance doors on the ground and principle floors. The bays are divided by two-story Doric

pilasters, with the middle third of the facade occupied by a two-tiered tetrastyle Doric portico. Two curved wrought iron

staircases ascend from ground level to the front center of the upper portico, leading to the formal entrance.

# History

Lakewood was built for Joseph lake, a native of North Carolina, by Hiram W. Bardwell, a master builder. Construction

was completed in 1840. Located adjacent to the University of West Alabama, Julia Strudwick Tutwiler, a Lake relative,

periodically resided in the house from 1881 to 1910 while she served as president of the university. It was then known as

Livingston Normal College. The house was extensively photographed by Alex Bush for the Historic American Buildings

Survey in November and December 1936. Lakewood has continued to be owned by descendants of the Lake family to the

current day. The house and its surviving 10 acres (4.0 ha) of grounds were listed on the Places in Peril in 2012 due to the

immediate threat of its acquisition by developers.

Generate a summary about Carlos Moedas including the following aspects: biography, early life, political career

with one aspect per line.