Summary of Expert QA Evaluating Factuality and Attribution in Language Models

New Summary

Summary Expert QA Evaluating Factuality and Attribution in Language Models arxiv.org

12,337 words - PDF document - View PDF document

Chat with this pdf Buy me a coffee

One Line

The study assesses the accuracy and source of language models across different domains using the Expert QA system.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Evaluating Factuality and Attribution in Language Models

Source: arxiv.org - PDF - 12,337 words - view

Introduction

• Language models are used in various fields, such as medicine and law

• Ensuring accurate information supported by reliable sources is crucial

• Previous studies on factuality and attribution in language models have not focused on domain-specific scenarios

Expert QA System

• The study evaluates factuality and attribution in language models using the Expert QA system

• Annotators judge the factual correctness of claims based on their expertise, system evidence, and minimal internet browsing

• 484 participants from 26 countries were considered experts in their fields

Importance of Accurate Information

• Language models must provide accurate information supported by reliable sources

• Incomplete attributions and unreliable claims are prevalent in high-stakes domains like medicine and law

• Ensuring factuality is crucial for trust and credibility

Challenges with Citations

• Retrieve-and-read systems struggle with producing citations for all cite-worthy claims

• GPT-4 generates citations to trustworthy domains, but the content on these pages is often mismatched

• Precise attributions for citeworthy statements are still a challenge for language models

AutoAIS System

• The authors evaluate factuality and attribution labels generated by language models using an NLI classifier as an AutoAIS system

• AutoAIS predicts attribution labels for claim-evidence pairs

• Results show the effectiveness of the AutoAIS system

Attributable to Identified Sources (AIS) Framework

• The AIS framework is proposed for human evaluation of attributions

• Systems still struggle with providing precise attributions for citeworthy statements

• Automatic methods for measuring attribution have been explored but require further improvement

References and Citations

• The excerpt includes a list of references and citations related to evaluating factuality and attribution in language models

• The references cover a range of topics and provide additional resources for further study

Conclusion

• Accurate factuality and reliable attribution are essential for language models

• The Expert QA system and AutoAIS approach contribute to evaluating factuality and attribution

• Continued research and improvement are needed to enhance the precision of attributions

Key Takeaways

• Language models must provide accurate information supported by reliable sources in various domains

• The Expert QA system and AutoAIS approach contribute to evaluating factuality and attribution

• Reliable attributions and precise citations remain challenges in language models, requiring further research and improvement.

Key Points

Language models are being used in various fields, such as medicine and law, but ensuring accurate information supported by reliable sources is crucial.
Previous studies on factuality and attribution in language models have not focused on domain-specific scenarios.
The study evaluates factuality and attribution in language models using the Expert QA system.
Annotators judge the factual correctness of claims based on their expertise, evidence from the system, and minimal internet browsing.
GPT-4 generates citations to URLs that often lead to trustworthy domains, but the content on these pages is often mismatched.
The authors evaluate the factuality and attribution labels generated by language models using an NLI classifier as an AutoAIS system.
The framework of Attributable to Identified Sources (AIS) is proposed for human evaluation of attributions.
The excerpt includes a list of references and citations from various research papers and publications related to evaluating factuality and attribution in language models.

Summaries

18 word summary

This study evaluates the factuality and attribution of language models using the Expert QA system in various fields.

43 word summary

Language models are being used in various fields, such as medicine and law, but ensuring their accuracy and reliability is crucial. The study evaluates factuality and attribution in language models using the Expert QA system. Annotators judge the factual correctness of claims based

617 word summary

Language models are being used in various fields, such as medicine and law, but ensuring that they provide accurate information supported by reliable sources is crucial. Previous studies on factuality and attribution in language models have not focused on domain-specific scenarios. In this evaluation

The study evaluates the factuality and attribution in language models using the Expert QA system. Retrieve-and-read systems struggle with producing citations for all cite-worthy claims, and high-stakes domains like medicine and law have a large percentage of incomplete attributions and unreliable

The summary is organized into separate paragraphs to distinguish distinct ideas:

Table 1 shows examples from E XPERT QA, including question types and counts. Table 2 categorizes question types according to different information needs. The study collected over 3000 questions

Annotators are asked to judge the factual correctness of claims based on their expertise, evidence from the system, and minimal internet browsing. They are instructed to be conservative in their judgments and label a claim as "Definitely correct" only if every word

This excerpt discusses the evaluation of factuality and attribution in language models. The authors use a prompt from Table 10 and consider commercial systems like BingChat. They sample responses from BingChat and other systems, excluding abstained answers. The number of examples

GPT-4 generates citations to URLs that often lead to trustworthy domains, but the content on these pages is often mismatched. Both prompting and retrieval-augmented systems generate mostly relevant claims, but a significant percentage are not relevant or void claims.

In this study, the authors evaluate the factuality and attribution labels generated by language models. They use the NLI classifier from a previous work as an AutoAIS system to predict attribution labels for claim-evidence pairs. The results show that AutoA

The framework of Attributable to Identified Sources (AIS) is proposed for human evaluation of attributions. Systems still struggle with providing precise attributions for citeworthy statements. Automatic methods for measuring attribution have been explored, including textual entailment models and

This excerpt includes a list of references to various papers and articles related to evaluating factuality and attribution in language models. The references cover topics such as large language models, attributed question answering, improving language models, claim verification, long form question answering, proposition

This excerpt includes a list of references and citations from various research papers and publications related to language models, factuality evaluation, and question answering. The references cover topics such as improved QA-based factual consistency evaluation, legal reasoning in language models, automated fact-check

Several papers related to evaluating factuality and attribution in language models are referenced in this excerpt. These papers cover various aspects of language model evaluation, including automatic evaluation of summaries, collaborative datasets for generative information-seeking, real-world entailment for claims, scientific

This document excerpt includes a list of references to various research papers and technical reports related to evaluating factuality and attribution in language models. The references cover a range of topics such as generating benchmarks for factuality evaluation, cross-lingual question answering, browser

The text includes citations of various research papers and preprints related to language models, conversational question answering, information retrieval, user goals in web search, factoid questions, instruction-following models, transformer memory, language models for dialog applications, fact extraction

The summary of the excerpt is as follows: - The study involved 484 participants from 26 different countries who were considered experts in their fields. - Participants were informed that their annotations would be used to evaluate the capabilities of language models in providing truthful answers

This excerpt is a collection of tables and prompts related to the evaluation of factuality and attribution in language models. Table 9 outlines a prompt for GPT4 and BingChat that instructs users to provide context and support their answers with citations. Table

Raw indexed text (80,691 chars / 12,337 words / 2,163 lines)

E XPERT QA

: Expert-Curated Questions and Attributed Answers

Chaitanya Malaviya , Subin Lee , Sihao Chen , Elizabeth Sieber ,

Mark Yatskar , Dan Roth

University of Pennsylvania University of Washington

{cmalaviy,subinlee,sihaoc,myatskar,danroth}@upenn.edu

[email protected]

Abstract

As language models are adapted by a more

sophisticated and diverse set of users, the im-

portance of guaranteeing that they provide fac-

tually correct information supported by veri-

fiable sources is critical across fields of study

& professions. This is especially the case for

high-stakes fields, such as medicine and law,

where the risk of propagating false informa-

tion is high and can lead to undesirable societal

consequences. Previous work studying factu-

ality and attribution has not focused on ana-

lyzing these characteristics of language model

outputs in domain-specific scenarios. In this

work, we present an evaluation study analyz-

ing various axes of factuality and attribution

provided in responses from a few systems, by

bringing domain experts in the loop. Specifi-

cally, we first collect expert-curated questions

from 484 participants across 32 fields of study,

and then ask the same experts to evaluate gen-

erated responses to their own questions. We

also ask experts to revise answers produced by

language models, which leads to E XPERT QA,

a high-quality long-form QA dataset with 2177

questions spanning 32 fields, along with veri-

fied answers and attributions for claims in the

answers. 1

Introduction

As the influence of large language models (LLMs)

grows beyond the computer science community,

experts from various fields are rapidly adapting

LLMs for assistance in information-seeking scenar-

ios. For example, medical professionals are using

these systems for performing differential diagnosis

(Lee et al., 2023) and researchers are using them

for faster literature surveys (Krenn et al., 2022;

Birhane et al., 2023; Owens, 2023). While the use

of LLMs in specialized domains has many poten-

tial benefits, it also carries significant risks. False

Code and dataset is available at https://github.com/

chaitanyamalaviya/expertqa.

What does research currently say

regarding the use of cryotherapy

to prevent peripheral neuropathy

in patients receiving paclitaxel?

AI Model

Revised Claim

Multiple studies have

been done to evaluate

effectiveness of

cryotherapy on CIPN;

however there is

conflicting evidence on

whether cryotherapy …

Informative?

Answer

Evidence

Claim

frontiersin.org

A randomized controlled trial

found that cryotherapy could

reduce sensory neuropathy

symptoms, the need for dose

reduction, and the incidence of

severe peripheral neuropathy [2].

Factual?

Citeworthy?

“Conclusions: Cryotherapy

is likely to prevent TIPN

in patients receiving

taxanes. High quality and

sufficient amount of

evidence is warranted.

Results: We analyzed 2250

patients from 9 trials.

Assessments using the

Common Terminology …”

Supported?

Reliable?

Figure 1: E XPERT QA contains 2177 information-

seeking questions formulated by experts spanning 32

fields, as well as expert-verified, model-generated an-

swers to these questions. Each claim-evidence pair in

an answer is judged by experts for various properties

such as the claim’s informativeness, factuality, citewor-

thiness, whether the claim is supported by the evidence,

and reliability of the evidence source. Further, experts

revise the original claims to ensure they are factual and

supported by trustworthy sources.

or hallucinated claims that are confidently phrased

can potentially mislead experts and propagate soci-

etal harms, especially in high stakes domains such

as medicine or law (Evans et al., 2021; Dash et al.,

2023; Volokh, 2023). Such claims can cause frus-

tration and distrust in AI tools among individual

experts in the mildest case, and propagation of mis-

information and unsafe practices, in the worst case.

Providing citations or attributions within gen-

erated responses is a promising direction for al-

leviating such concerns. However, the quality of

these attributions in model-generated responses, as

well as the factuality of responses, is understud-

ied in domain-specific settings. This is partly be-

cause we do not completely understand the specific

information-seeking needs of experts. Although ex-perts from different fields are naturally best suited

to aid with such an evaluation, expert evaluations

are rarely conducted, as bringing experts in the

loop can be time-consuming and costly.

To bridge this gap, we construct E XPERT QA, a

benchmark of information-seeking questions cu-

rated by experts from 32 diverse fields. E X -

PERT QA includes questions relevant to each field,

as well as judgements from experts about how well

several state-of-the-art systems perform along var-

ious axes of factuality and attribution. Having

experts in the loop allows us to model a realistic

information-seeking scenario that helps us under-

stand how people in different fields use LLMs and

where their capabilities fall short.

E XPERT QA is constructed by asking qualified

experts to formulate questions from their field that

they are curious about or have encountered in their

professional lives (§2.1). Responses to these ques-

tions are collected from a set of LLM-based sys-

tems that produce attributions for their answers

(§3). These include purely generative, retrieval-

augmented, and post-hoc attribution systems. We

then ask experts to validate the claims and evi-

dences found in the responses to their own ques-

tions (§2.2). Experts judge each claim for its in-

formativeness to the question, its citeworthiness,

and factuality. They are also asked to judge how

faithful the claim is to an accompanying evidence

and rate the reliability of the evidence source. Fi-

nally, experts revise each claim so it is faithful to

a reliable source and make a best effort attempt at

ensuring the claim is factual. This overall process

is described in Figure 1.

We first use E XPERT QA to evaluate represen-

tative systems from which responses are sampled

(§4). Our findings suggest that:

1. Retrieve-and-read systems often generate

complete attributions compared to LLM

prompting and post-hoc attribution, but strug-

gle to produce citations for all cite-worthy

claims.

2. The retrieval source significantly impacts the

quality of attribution and overall factuality.

3. High-stakes domains such as medicine and

law suffer from a large percentage of incom-

plete attributions (35% and 31% incomplete

attributions respectively) and many attribu-

tions come from unreliable sources (51% at-

tributions are rated as not reliable by experts).

We also measure the extent to which existing

automatic methods for attribution and factuality

estimation (Bohnet et al., 2022; Min et al., 2023)

correlate with expert judgements (§5). We find that

these metrics fall short in correlating with reference

judgements of attribution and factuality. However,

adapting these metrics to our data through finetun-

ing results in improvements across domains.

The revised answers we collect can be used for

improving and evaluating future models on long-

form question answering. While similar datasets

have been proposed (Fan et al., 2019), examples in

E XPERT QA are more realistic and contain verified

answers edited by experts. Furthermore, unlike

existing datasets, questions in E XPERT QA contain

few vague questions because they include questions

professionals have encountered in their practice.

We establish several baselines and show that we

can improve models by finetuning on E XPERT QA

but that there is substantial room for improvement,

both in terms of ROUGE and QAFactEval (§6).

E XPERT QA: Annotation Tasks

The annotation is conducted in multiple longitudi-

nal stages and we describe each of these below. In

the first stage, we ask experts to write questions

from their field (§2.1). In the next stage, we present

responses sampled from various systems back to

the same experts for analysis (§2.2). Further de-

tails about annotator backgrounds, annotation cost

and screenshots of our interface, are presented in

Appendix A.

2.1

Stage 1: Expert-Curated Questions

Participants are recruited through Prolific and are

qualified as experts if they have attained a formal

education in the field and have worked in the field

for at least 3 years. They are first asked to write

questions from their field of expertise. They are

told that this question could be one they have en-

countered in their profession or one they are curi-

ous about. We ask them to formulate challenging

technical questions, for which it may not be pos-

sible to find a single document on the web that

answers them completely.

Each expert is asked to write 5 questions and to

specify the question type(s) for each question (as

shown in Table 2). We formulate these question

types broadly based on existing work that attempts

to classify information needs (Rose and Levinson,

2004). Because of their practical nature, at leastField

Anthropology

Architecture

Biology

Chemistry

Engineering & Technology

Healthcare/Medicine

Law

Music

Physics & Astronomy

Political Science

Visual Arts

Question Types

Why is it that Africa’s representation is still a problem in modern day times regardless

of the academic writings that state otherwise?

Suppose an architect decides to reuse an existing foundation of a demolished building,

what is to be considered to ensure success of the project?

Can you explain the mechanisms by which habitat fragmentation affects biodiversity

and ecosystem functioning, and provide examples of effective strategies for mitigating

these impacts?

Why does gallic acid have an affinity with trivalent iron ions?

How different will licensing a small modular reactor be as compared to licensing

traditional large nuclear power plants?

If a 48 year old woman is found to have an esophageal carcinoma that invades the

muscularis propria and has regional lymph node metastases but no distant metastasis,

what is her stage of cancer and what are possible recommended treatments?

Can direct evidence in a case that has been obtained illegally be considered by the

court in some cases if it directly points to the defendant’s guilt?

What exercises would you do in a singing class with a teenager with puberphonia?

Standard Model does not contain enough CP violating phenomena in order to explain

baryon asymmetry. Suppose the existence of such phenomena. Can you propose a way

to experimentally observe them?

Despite the fact that IPCC was formed in 1988, several studies have showed that

argubaly more than 50% of all carbon emissions in history have been released since

1988. What does this show about IPCC and developed countries’ efforts?

Tell me the step by step process of recycling a canvas. II,VII

III,VI

VII

I,III

VII

III

Table 1: Examples from E XPERT QA. See Table 14 for a larger list showing an example from all fields.

III

VII

Question Type Count

Directed question that has a single

unambiguous answer

Open-ended question that is potentially

ambiguous

Summarization of information on a topic

Advice or suggestions on how to approach

a problem

Question that describes a hypothetical

scenario and asks a question based on this

scenario

Request for a list of resources where one

can find more information

Request for opinion on a topic 444

528

371

251

853

160

207

Table 2: Question types categorized according to various

information needs that are part of E XPERT QA.

two of the questions are required to be scenario-

based questions (Type V, Table 2). We collect

more than 3000 questions this way from 524 ex-

perts across 32 fields. We manually filter all these

questions for coherence and relevance to the field

and our initial question corpus contains 2507 ques-

tions. Examples of these questions from different

fields are presented in Table 1.

2.2

Stage 2: Answer and Claim Annotation

Next, we generate responses for the questions from

stage 1 by prompting six different systems that pro-

vide attributions with their answers. These systems

are described in §3. We split each answer into

claims, where claims are considered at the granu-

larity of a sentence and extracted using the spaCy

sentence tokenizer (Honnibal and Montani, 2017). 2

In this stage of annotation, experts validate re-

sponses to their own questions. This is beneficial

as experts are best qualified to evaluate answers

to their own questions. We noticed a low attrition

rate in our study as around 92% of annotators from

stage 1 validated at least 1 of their own questions

in stage 2. Since this task is intensive, a single

annotation task is broken down into 1-3 question-

answer pairs. The following properties of answers

and claims are evaluated, and are presented to an-

notators in the same order as below. Properties that

judge answer quality are marked with and those

that judge evidence quality are marked with .

( ) Answer Usefulness. Participants are first

asked to judge whether the complete answer is

useful in answering the question. Usefulness is

measured based on whether the answer is at least

partially responding to the question. Usefulness is

marked on a scale of {useful, partially useful, not

useful at all}.

We also considered further increasing the atomicity of

claims (like Kamoi et al. (2023)) but finer-grained atomic

claims incur considerably higher annotation cost.Figure 2: The distribution of questions across different fields in E XPERT QA.

( + ) Claim / Evidence Attribution. Attribu-

tion is judged based on whether a claim is sup-

ported by its accompanying evidence, following a

similar design as Rashkin et al. (2021); Bohnet et al.

(2022). Support may be judged as complete, partial

or incomplete. If no evidence is provided, support

needs to be marked Missing and if the evidence is

inaccessible, it needs to be marked N/A. Annotators

are told that they can assume that certain common

sense facts don’t need to be explicitly stated in the

evidence to judge support. If the evidence included

multiple documents, annotators judge support for

the claim collectively using all documents. Taking

inspiration from Kamoi et al. (2023), if the claim

is partially supported, annotators are asked to spec-

ify the unsupported span and the reason why it is

unsupported.

Evidences can come in the form of URLs or pas-

sage evidences, depending on the system. Judging

attribution can be difficult for passage evidences as

relevant context might be missing, so we provide

URLs along with attributed passages for context.

Annotators are instructed to only use the passage

for judging attribution in these cases.

( ) Claim Informativeness. To differentiate be-

tween the relevance of different claims for the ques-

tion, we asked annotators to label informativeness

of each claim. A claim may be judged as central to

answering the question (very relevant), making a

relevant point that is slightly important to answer

the question (a bit relevant), making a relevant

point that isn’t too relevant to answering the ques-

tion (not too important) or making a peripheral

point that is not relevant to answering the question

(uninformative).

( ) Claim Factuality. Next, we ask annotators to

label their best estimate of the factual correctness

of each claim. They are asked to judge factuality

based on their own expertise, the evidence returned

by the system, and minimal browsing on the in-

ternet if needed (lasting no longer than 2-3 min-

utes). This judgement is also collected on a Likert

scale (Definitely correct, Probably correct, Unsure,

Likely incorrect and Definitely incorrect). We ask

annotators to be conservative in their judgements

of factual correctness, labeling Definitely correct

only if every word in the claim is correct.

( ) Reliability of Source Evidence. Experts

may judge certain sources as more reliable than

others. For example, webmd.com is a patient-facing

website for medical information that may be consid-

ered reliable by non-medical experts, but medical

experts may not put as much trust into the evidence

from this website. Catering attributions to domain

experts will require presenting attributions that they

consider as credible. Therefore, we ask annotators

if the accompanying evidence (if any) is found on a

website they would consider reliable (on a scale of

Reliable, Somewhat Reliable, Not reliable at all).

( ) Worthiness of Claim Citation. Previous

work has highlighted that not all claims are equally

worthy of citation (Bohnet et al., 2022; Alam et al.,

2023). Certain claims might be considered too obvi-

ous or fundamental to a user’s domain of expertise

and hence, might not need a citation (for example,GPT-4

Evidence

Retrieval

Retrieve-and-read

Answer

Generation

Post-hoc

Retrieval

LLM as generator +retriever

Post-hoc retrieval

Figure 3: The classes of systems from which we sampled responses (answers + attributions) for evaluation.

the definition of a cell for a biologist). Annotators

are asked to simply state if the claim is necessary to

be cited (as a binary choice). We note that the cite-

worthiness of a given claim might depend heavily

on a user’s prior knowledge and needs, which can

vary across experts from the same field.

( + ) Claim/Evidence Revision. After label-

ing the above properties, annotators edit the claim

and evidences to ensure that the claim is factually

correct and the given references support the claim.

For instance, if the claim is false or uninformative,

annotators could delete it. Further, if the evidence

is incorrect or insufficient, they could remove it.

While doing so, we did not require them to replace

missing/incorrect evidence with correct evidences

because doing that can be too time-consuming.

Systems Evaluated

We now describe the classes of systems from which

we sampled responses for evaluation (also see Fig-

ure 3). All systems produce an answer string and

attributions in the form of in-line citations for the

sentences in the answer. Attributions are either

returned as URLs to webpages or passages along

with URLs to webpages from where they are re-

trieved. Further experimental details and prompts

are included in Appendix B.

LLM as generator + retriever. In this paradigm,

we prompt large language models in a closed-book

fashion (Brown et al., 2020; OpenAI, 2023) to gen-

erate an answer with in-line citations where they

provide URLs for each citation. Note that this is

unlikely to work as the model essentially has to

generate a URL from its parametric memory. Nev-

ertheless, we consider GPT-4 as the LLM from

which we sample responses (gpt4). The prompt

used is given in Table 9.

Post-hoc retrieval. This system differs from the

above, as we only prompt LLMs to generate an-

swers without attribution, and perform retrieval

of evidence for a claim as a post-hoc step. This

renders the attributions naturally unfaithful, but we

believe this is still a worthwhile approach to investi-

gate because of the strength of LLMs as generators

and retrievers independently. The attribution cor-

pora we consider are Sphere (Piktus et al., 2021)

(post_hoc_sphere_gpt4), which is a large static

dump of CommonCrawl, and Google search results

(post_hoc_gs_gpt4).

Retrieve-and-read. In this class of systems, we

first retrieve evidence for a question and then gen-

erate an answer by prompting a model to use the re-

trieved evidence to answer the question. As our at-

tribution corpus, we again consider Sphere (Piktus

et al., 2021) (rr_sphere_gpt4) as well as Google

search results (rr_gs_gpt4). We use sparse re-

trieval using BM25 (Robertson et al., 2009) for re-

trieving from Sphere. We then generate an answer

using GPT-4, by including the retrieved evidence

as context in our prompt. The answer generator is

instructed to also generate in-line citations for each

sentence, which refer to the passages provided in

the context. The prompt used is given in Table 10.

Commercial systems. We also consider commer-

cial systems that have recently gained popularity,

such as BingChat. The precise implementation of

these systems is proprietary, but we can still draw

conclusions about the utility of these systems for

domain experts through our human evaluation. In

this work, we sample responses using the balanced

mode of BingChat (bing_chat).

3.1

Response Sampling

We sample uniformly from all systems but exclude

all abstained answers and constrain the numberSystem

gpt4

bing_chat

rr_sphere_gpt4

rr_gs_gpt4

post_hoc_sphere_gpt4

post_hoc_gs_gpt4

Count Abstention Rate

174

470

279

452

403

399 0%

0.01%

37.89%

22.69%

Table 3: Number of examples sampled from different

systems and the abstention rates of different systems.

of claims for each answer to be at most 10 for

lowering annotation costs. During annotation, we

noticed that attributions from gpt4 often pointed to

broken links and evaluating such attributions is not

meaningful, so we sampled fewer responses from

that system. The number of examples included

from each system and the abstention rate of the

systems (how frequently a system responds with

one of the predefined strings indicating it cannot

provide an answer) is presented in Table 3.

4.1

4.2

Manual Analysis

To estimate the agreement of human labels in E X -

PERT QA, we (the authors) labeled our agreement

with the reference labels from two fields to which

the authors collectively belong. Specifically, we

sampled 60 questions from Engineering and Tech-

nology and another 60 from Healthcare / Medicine,

where we included 10 questions from each system.

For each claim in the responses to these 60 ques-

tions, we label binary agreement with the annotated

properties from §2.2.

Analysis

Data Statistics

Figure 4: Histogram of the number of claims and num-

ber of tokens across all examples in E XPERT QA.

The total number of examples validated in stage

2 of our annotation is 2177. The average number of

claims and average number of tokens across these

examples is 5.79 and 152.12 respectively (the over-

all distributions are presented in Figure 4). The

distribution of examples across fields is presented

in Figure 2. A large percentage of examples in

E XPERT QA come from high-stakes fields such as

Healthcare/Medicine and Law. Table 2 presents the

distribution of questions across different question

types shown to annotators in stage 1. The largest

number of questions are Type V because partici-

pants were asked to write 2/5 questions of this type.

After those, the largest number of questions are

open-ended questions (Type I) and directed ques-

tions (Type II).

Figure 5: Percentage agreement on claim annotations

for a random sample of 60 questions each from the

Engineering & Technology and Medicine fields.

Our analysis showed fairly high agreement (>

85%) for most labels in both fields. The agree-

ment for attribution labels for medical claims was

found to be slightly lower than engineering claims.

We noticed that medical claims allow for more nu-

ance in interpretation, compared to more objective

claims in engineering.

4.3

Results

We present the Likert distribution for claims across

all systems and properties in Figure 6. Below we

summarize our main conclusions from the analysis.

Majority of answers are useful, but answers

from purely generative systems are considered

more useful. We find that ∼87-89% of answers

from gpt4 are marked useful. The retrieve-and-

read systems (as well as bing_chat, which also

retrieves evidences first) are marked slightly less

useful (73-80%), likely because retrieved evidences

are not always highly relevant. Choosing relevant

evidences using Google search results in more use-

ful answers than with the smaller Sphere corpus

and a sparse retriever.Figure 6: The Likert distribution of labels for the different properties of answers / claims, annotated by experts. The

top 3 properties (answer usefulness, claim informativeness and factuality) are judgements of answer quality and the

bottom 3 (claim/evidence attribution, source reliability and claim cite-worthiness) are attribution quality.

Retrieve-and-read systems often generate com-

plete attributions, but struggle to produce cita-

tions for all cite-worthy claims. Retrieve-and-

read systems have a stronger inductive bias to use

the retrieved evidence to answer the question. How-

ever, they do not always produce attributions for

cite-worthy claims (18% of these claims are miss-

ing attributions) 3 . On the other hand, post-hoc

attribution systems return attributions for every sin-

gle claim, by definition, but return more incom-

plete attributions. Lack of context during post-hoc

retrieval can be an issue for retrieving valid attri-

butions. For example, a claim such as "Targeted

therapies, such as PARP inhibitors or CDK4/6 in-

hibitors, may be useful depending on the specific

genetic makeup or molecular features of the tumor

[2]." does not actually contain any details about

the fact that the question asked about breast can-

cer, which can make it hard to find relevant re-

trievals. Therefore, we might need approaches that

can make claims standalone, such as Choi et al.

(2021), to improve post-hoc citation quality.

Finally, without retrieval, we found that gpt4

often generates citations to URLs that link to plau-

Figure 6 shows the Likert distribution of attribution labels

on those claims deemed cite-worthy by experts.

sible & trustworthy domains (for eg, nasa.gov for

astronomy and nih.gov for medical claims), but

the content on these webpages is often totally mis-

matched (more than 60% of the time).

Both vanilla prompting and retrieval-

augmented systems generate mostly very

relevant claims to the question. At the same

time, a significant percentage of claims (30-40) are

not very relevant. This may include void claims

(that simply restate the question or state simplistic

facts). This suggests that there is a lot of room in

making answers concise and relevant. In addition,

using Sphere as the retrieval corpus results in

fewer informative claims than Google search for

retrieve-and-read systems.

Just over half the claims are labeled as definitely

correct by experts. While a significant percent-

age of claims are labeled as correct (probably or

definitely), experts do not instill high confidence

in the factual correctness of claims. This might

be because it is hard to judge factuality with a

high degree of confidence, in a short time frame.

Once again, a smaller retrieval corpus and retriever

(rr_sphere_gpt4) results in less factual claims as

the model may be more likely to hallucinate.The retrieval corpus has a significant effect on

expert judgements of source reliability. Expert

judgements of reliability are directly influenced by

the credibility of the sources from which evidences

are retrieved. Corpora such as Sphere, which do not

account for source, often present evidences that are

unreliable to experts (for both rr_sphere_gpt4

and post_hoc_sphere_gpt4). For example, in a

question about breast cancer, evidence from a com-

ment on a blog is retrieved and is naturally con-

sidered Not reliable at all by the expert. Using

Google search as the retrieval system, improves

source reliability judged by experts.

Majority of claims are deemed cite-worthy

across systems. Only around 17-22% claims are

judged not citeworthy by the experts. This suggests

that most claims in responses to expert-curated

warrant providing supporting evidence. Non-cite-

worthy claims are mostly either uninformative or

too basic for an expert from the field.

Effect of Retrieval Corpus. While sampling re-

sponses, we instruct models to abstain if they can-

not answer the question faithfully and accurately.

Among retrieve-and-read systems, we find that the

abstention rate is significantly higher for systems

that use Sphere as the retrieval corpus compared to

Google search. Across claim properties, we find

that the retrieval corpus has a significant impact

on human judgements. Systems that use Sphere

as the retrieval corpus instead of Google search

result in more missing attributions, fewer correct

and informative claims, attributions with more ev-

idences from unreliable sources, and overall less

useful answers.

Domain and Question Type Trends. Figure 10

shows the distribution of labels across all fields.

There are few clearly discernible patterns from the

trends across fields. The percentage of claims la-

beled as being Definitely or Probably correct is

fairly high (>85%) for many fields. However, we

note that across all annotated claims, high-stakes

domains such as medicine and law can suffer

from a significant percentage of incomplete at-

tributions (around 35% of medical claims and 31%

of legal claims worthy of citation are not supported

with appropriate evidence). Further, a large per-

centage of claims present evidences from unreli-

able sources for these domains (for eg, ∼51% of

medical claims have attributions from sources that

are not labeled Reliable by experts).

Across question types, systems clearly struggle

with Type VI questions that request for a list of

resources, as claims are less informative, factual,

and supported by evidence. Other question types

that appear to be challenging are Type IV and Type

VII, which seek advice for a problem or opinions

on a topic respectively.

Automatic Estimation of Attribution

and Factuality

Next, we study the effectiveness of current au-

tomatic systems for attribution (Honovich et al.,

2022) and factuality (correctness) estimation (Min

et al., 2023) in the context of E XPERT QA. In both

cases, we observe that current systems show high

precision but low recall when compared with hu-

man judgements of attribution and factuality.

5.1

Automatic Attribution Estimation

Under the framework of attributable to identified

sources (AIS) (Rashkin et al., 2021), i.e. judg-

ing whether model generated content can be veri-

fied against given attributions, previous work has

found NLI models to be effective in providing au-

tomated AIS (AutoAIS) estimations (Gao et al.,

2023; Bohnet et al., 2022). To understand the ef-

System

gpt4

bing_chat

rr_sphere_gpt4

rr_gs_gpt4

post_hoc_sphere_gpt4

post_hoc_gs_gpt4

AutoAIS Num. Claims

.156

.320

.689

.778

.281

.241 149

992

732

1415

1158

1500

Table 4: AutoAIS score (more attributable→1, less

attributable→0) of predicted responses by the systems.

Only claims annotated as citeworthy and with complete

support are considered.

fectiveness of AutoAIS on E XPERT QA, we follow

the settings in previous work and use the NLI clas-

sifier 4 from (Honovich et al., 2022) as an AutoAIS

system to predict the attribution labels of claim-

evidence pairs in E XPERT QA. The model gives a

binary decision for whether a hypothesis claim is

entailed by the evidence as premise. For evidence

longer than the maximum sequence length of the

model (512), we apply the retrieve-and-rerank tech-

nique from Schuster et al. (2022), where we split

evidence into sentences and take top-2 sentences

https://huggingface.co/google/t5_xxl_true_

nli_mixtureSystem

zero-shot

R F1

finetuned

R F1

gpt4

bing_chat

rr_sphere_gpt4

rr_gs_gpt4

post_hoc_sphere_gpt4

post_hoc_gs_gpt4 .33

.97

.89

.86

.92

.87 .02

.26

.59

.74

.28

.17 .05

.41

.71

.79

.43

.29 .52

.90

.83

.87

.79

.77 .32

.90

.98

.97

.95 .39

.90

.87

.92

.87

.85

all .88 .38 .53 .82 .91 .86

Table 5: Precision, Recall and F1 scores of AutoAIS

labels predicted by the TRUE NLI model (0-shot vs.

finetuned version on the ExpertQA train split) against

human attribution judgements in E XPERT QA.

System F1 (T) F1 (F) F1 (overall)

gpt4

bing_chat

rr_sphere_gpt4

rr_gs_gpt4

post_hoc_sphere_gpt4

post_hoc_gs_gpt4 0.919

0.912

0.884

0.927

0.898

0.939 0.108

0.134

0.106

0.068

0.132

0.158 0.852

0.841

0.795

0.865

0.817

0.886

all 0.915 0.119 0.844

Table 7: FActscore F1 scores on reference factuality

labels for claims in E XPERT QA.

Table 5. The results suggest that AutoAIS estima-

tions with a NLI model overall have high-precision

yet

low-recall against human judgements of attribu-

with highest entailment confidence predicted by the

tion completeness. To understand the discrepancy

NLI model as evidence.

between NLI model behavior vs. human judge-

Table 4 shows the macro-averaged AutoAIS

scores for the set of claims marked as having at- ment of attribution completeness, we highlight a

tribution with complete support from each sys- few typical examples of attribution(s) in Table 6.

For NLI, every part of the hypothesis or claim is ex-

tem. Compared to our human judgments, the

pected to be directly verifiable against the premise,

AutoAIS estimations show a large variance in

while

we observe during attribution verification,

terms of the averaged completeness of attribu-

human

judgements involve more implicit world

tion across systems. Notably, attributions gener-

ated by post-hoc retrieval receive much lower Au- knowledge, e.g. calcium carbonate is an alkali.

Another common type of mistake from AutoAIS in-

toAIS scores, whereas retrieve-and-read systems

volves combining information from multiple pieces

get higher scores.

of evidence. We observe mutli-source attributions

to be particularly common among bing_chat and

Error Type: Fine-grained Information Sensitivity.

Claim (post_hoc_sphere_gpt4): For water with a low pH retrieve-and-read systems.

(acidic), you can add a base or alkaline compound, such as

baking soda (sodium bicarbonate) or calcium carbonate, to

raise the pH [1].

Attribution [1]: ... To raise or lower pH, a pool custodian

simply adds acids or alkalis into the water. For example,

adding sodium carbonate (soda ash) or sodium bicarbonate

(baking soda) will generally raise the pH and adding muriatic

acid or sodium bisulfate will lower the pH.

Human: Cite-Worthy & Complete Support

AutoAIS: 0 (No or Partial Support).

Error Type: Multi-Source Attributions.

Claim (bing_chat): Other radiological signs of fetal death

include gas in the fetus or in the portal and umbilical vessels

[1], and Deuel’s halo sign [2].

Attribution [1]: ... Intrafetal gas is an unequivocal sign of

fetal death provided it can be conclusively differentiated from

maternal gas, shadows. ...

Attribution [2]: Radiological investigation is warranted in

the antenatal patient only if the findings are likely to influence

future management. The major radiological signs of fetal

death include overlapping of the cranial bones and Deuel’s

halo sign

Human: Cite-Worthy & Complete Support

AutoAIS: 0 (No or Partial Support).

Table 6: Examples of typical errors of AutoAIS against

human judgements in E XPERT QA.

We compare the per-claim AutoAIS predictions

to human judgement of attribution completeness in

5.2

Automatic Factuality Estimation

Prior work has proposed methods (Manakul et al.,

2023; Min et al., 2023) to automatically estimate

the factuality of model-generated responses. In

particular, we use and study FActScore (Min et al.,

2023) as an automatic estimator for the factuality of

claims. We first break down each claim into finer-

grained atomic claims using few-shot prompting

with text-davinci-003, using the same prompt

and setting as Min et al. (2023). We retrieve the

top-k (k=3) relevant passages using Google search

using an atomic claim as the query. The evidence

paragraphs for each atomic claim are then included

in a prompt sent to gpt-3.5-turbo, along with the

atomic claim, where the model is instructed to say

whether the atomic claim is True or False. The

final factuality score for a claim is then calculated

by averaging the scores of all the atomic claims.

In Table 7, we report the F1 scores of the fac-

tual (T) and non-factual (F) classes and the micro-

averaged overall F1 scores of the FActScore fac-

tuality scores and the reference factuality labels.

FActscore scores are thresholded at 0.5 to get bi-nary scores and reference factuality labels are 1 if

the claim’s factuality is labeled as Probably correct

or Definitely correct, and 0 otherwise.

We find that automatic factuality estimation

struggles to identify non-factual claims in our

dataset. In particular, predicted labels have low

recall of non-factual claims. This is more often

the case for retrieve-and-read systems, where the

answer is generated based on retrieved evidences.

The other systems use GPT4’s parametric knowl-

edge for answer generation, which might make it

slightly easier for an evaluator like ChatGPT to

judge factuality.

Long-form QA Evaluation

A beneficial output of our annotation pipeline is

the revised answers produced by annotators. These

answers are vetted by experts to contain factual

information and compose a new long-form QA

dataset, E XPERT QA. We consider two types of

splits for E XPERT QA (both 80-10-10): a random

split of the data and a domain-wise split, where

80% of a field’s data is included in the training set

and 10% is included in both validation and test sets.

6.1

Evaluation Metrics

We consider standard metrics used for evaluating

long-form question answering, that are based on

similarity to a reference answer, i.e., ROUGE (Lin,

2004) and those focused on evaluating factual con-

sistency through question-answer pairs generated

on a reference answer, i.e., QAFactEval (Fabbri

et al., 2022).

6.2

Baselines

We finetune the following open-source language

models: FlanT5-11B (Chung et al., 2022), Alpaca-

7B (Taori et al., 2023), Vicuna-7B (Chiang et al.,

2023) and LLaMa2-7B-Chat (Touvron et al., 2023).

We finetune these models with the same prompts

as the ones used in their training (provided in Ta-

bles 11, 12). Further, we also report results with

Llama2-70B-Chat without finetuning (marked *).

6.3

Results

Our results are shown in Table 8. We find that both

Llama2-7B and Vicuna-7B outperform FlanT5-

11B despite the smaller model size, likely due

to additional instruction finetuning for both those

models. We observe that finetuning significantly

improves performance (results without finetuning

Split Model R1 R2 RL QFE

Random FlanT5-11B

Vicuna-7B

Llama2-7B

Llama2-70B* 0.335

0.351

0.362

0.320 0.114

0.119

0.125

0.101 0.215

0.212

0.219

0.181 2.068

1.068

1.985

1.050

Domain FlanT5-11B

Vicuna-7B

Llama2-7B

Llama2-70B* 0.324

0.359

0.363

0.328 0.107

0.120

0.124

0.104 0.210

0.213

0.219

0.187 1.538

1.739

1.726

0.979

Table 8: Long-form QA results after finetuning models

on the random and domain splits of E XPERT QA.

are in Table 13), and Llama2-70B performs worse

than finetuned systems under zero-shot prompting.

Related Work

Attribution-Generating Methods. With the

rapid adaptation of large language models, a few

different classes of systems have been proposed

for generating attributions for the responses they

produce. The first class of systems is vanilla LLM

prompting (Tay et al., 2022; Weller et al., 2023),

where LLMs such as GPT-4 (OpenAI, 2023) are

prompted to return attributions (in the form of titles

of references, optionally accompanied with URLs)

along with their answers. Since these systems are

not explicitly trained to produce attributions, they

can often hallucinate the references they provide

(Agrawal et al., 2023). Another class of systems

is retrieve-and-read systems (Guu et al., 2020;

Borgeaud et al., 2022; Izacard et al., 2022), which

first retrieve evidence relevant for a query, and then

generate an answer based on the retrieved evidence.

These systems are sometimes trained on demonstra-

tions of humans browsing for information (Nakano

et al., 2021; Thoppilan et al., 2022; Menick et al.,

2022), which allows them to jointly generate an-

swers and citations. Finally, post-hoc retrieval

(Gao et al., 2023; He et al., 2022) involves retriev-

ing attributions after answering a query using both

the query and response for retrieval, and optionally

revising the answer based on the attribution. For

sampling responses, we consider all three classes

of systems described above.

Attribution Analysis Prior work has conducted

analysis of attributions produced by systems in

response to queries from existing NLP datasets

(Rashkin et al., 2021; Bohnet et al., 2022; Dziri

et al., 2022; Liu et al., 2023; Muller et al., 2023).

Notably, Rashkin et al. (2021) propose the frame-

work of Attributable to Identified Sources (AIS)

for performing human evaluation of attributions.Analysis from these works suggests that systems

are still far from providing precise attributions with

sufficient recall for all citeworthy statements. In

our work, we recognize that this is problematic in

specific domains where attribution precision and

recall are both critical.

Automatic methods to measure attribution have

also been explored. Briefly, attribution has been

automatically measured through the use of textual

entailment models (Bohnet et al., 2022; Yue et al.,

2023), by checking the consistency of answers pro-

duced to the same question using the evidence or

the response (Wang et al., 2020) and prompting

LLMs and finetuning LLMs on tasks such as NLI

relevant for judging attributions (Yue et al., 2023).

Previous efforts at collecting gold attribu-

tion data have been conducted by repurposing

Wikipedia citations (Kamoi et al., 2023; Petroni

et al., 2022), or existing datasets like Natural Ques-

tions (Bohnet et al., 2022; Kwiatkowski et al.,

2019), and relying on large-scale human annota-

tion (Chen et al., 2022; Dziri et al., 2022; Kamalloo

et al., 2023). We note that with regards to the infor-

mation needs of experts, this data can be restricting

in terms of the accompanying corpus, and contain

limited complexity when it comes to expert-curated

content (for instance, experts are likely to use and

trust sources other than Wikipedia).

Factuality Analysis. Analysis of factuality or

truthfulness of LM generations has been conducted

extensively in prior work (Thorne et al., 2018;

Evans et al., 2021; Maynez et al., 2020; Pagnoni

et al., 2021; Lin et al., 2021; Muhlgay et al., 2023).

Factuality is closely linked to work studying hallu-

cinations (Ji et al., 2023), which includes closed-

domain hallucination (Kryscinski et al., 2020;

Maynez et al., 2020), where hallucinations are an-

alyzed in reference to some context, as well as

open-domain hallucination (for example, Manakul

et al. (2023)). The factuality labels we collect as

part of E XPERT QA elicit a best-effort judgement

of truthfulness of claims from experts.

Prior methods proposed to verify the factuality of

statements include zero-resource approaches (Man-

akul et al., 2023; Kadavath et al., 2022; Agrawal

et al., 2023; Azaria and Mitchell, 2023; Min et al.,

2023), which operate mostly in a black-box fash-

ion, and resource-enriched approaches (Thorne

et al., 2018; Guo et al., 2022; Chen et al., 2023;

Feng et al., 2023), that verify statements by check-

ing their validity against external databases or re-

sources. We use one such recent method called

FactScore (Min et al., 2023) to evaluate how well

the factuality labels in our dataset correlate with

automatic factuality judgements. Min et al. (2023)

split claims into finer-grained atomic claims and

using LMs to verify these atomic claims. Dedi-

cated benchmarks to study hallucination have also

been proposed (Liu et al., 2022; Li et al., 2023),

but these are often synthetically generated and not

based on real user queries and responses to them.

Long-form QA. Existing long-form QA datasets

are often created using naturally occurring sources

on the web such as search queries (Nguyen et al.,

2016; Stelmakh et al., 2022) and forums (Fan et al.,

2019). Several issues have been pointed out with

conducting robust evaluation for long-form QA

(Krishna et al., 2021). Previous work has also sug-

gested that evaluating finer-grained claims provides

higher agreement of factuality among annotators

(Krishna et al., 2023). Keeping this in mind, we

construct E XPERT QA to cover practical informa-

tion needs of experts and collect fine-grained judge-

ments of factuality along with vetted answers.

Xu et al. (2023) also conduct human evaluation

of long-form answers with experts from 7 fields.

Notably, they emphasize the importance of evaluat-

ing multiple aspects of long-form answers, which

are also considered in our work.

Domain-specific QA. Several domain-specific

datasets for question answering have also been pro-

posed in prior work. Specifically, existing work has

presented datasets for the medical domain (Tsatsa-

ronis et al., 2015; Pampari et al., 2018; Jin et al.,

2019, 2021; Pal et al., 2022), legal (Guha et al.,

2023), technology (Dos Santos et al., 2015) and

others. There is also work that proposes datasets

with examples spanning multiple domains (Rogers

et al., 2020; Reddy et al., 2019; Hendrycks et al.,

2021). However, these datasets often have a very

limited coverage of domains. Importantly, most of

these are not created with experts-in-the-loop and

hence may not always represent natural informa-

tion seeking distributions, that go beyond factoid

questions. Further, they do not contain long-form

answers and attributions for the answers. Finally,

because of large-scale pretraining, we are aware

that many such datasets are under the risk of con-

tamination (Sainz et al., 2023). Hence, we choose

to crowdsource questions from experts, and further

also get annotations for attribution and factuality.8

Conclusion and Future Work

Our study suggests that although large language

models show a lot of promise for aiding domain

experts, there is large ground to cover in being able

to provide expert-level guidance that is factual and

also supported by reliable sources (Metzler et al.,

2021). While LLMs make information access and

search substantially easier for domain experts, im-

proving factuality and attribution of answers is nec-

essary to improve trustworthiness in these systems.

Repurposing general-purpose language models for

domain-specific purposes requires having experts-

in-the-loop, so we can understand their information

needs and how models fall short in meeting them.

We hope that our benchmark, E XPERT QA, can

benefit the community with improved methods for

attribution & factuality estimation, and long-form

question answering evaluation.

Limitations

Atomicity of Claims. In most cases, claims in

our dataset are sentences that may not represent

singular information units. This lack of atomicity

in claims means that properties such as factuality

and attribution need to be judged exhaustively for

a claim. Collecting human judgements for finer-

grained atomic claims can be significantly more

expensive and is not explored in this work.

Claim Extraction. Extracting sentence-level

claims from a generated answer for the purpose

of evaluation is performed by using a sentence tok-

enizer. However, we note that existing tokenizers

suffer from sentence tokenization errors (for exam-

ple, when lists or tables are present in answers).

This resulted in a small number of claims being

excessively long and hard to evaluate.

Field Coverage. Even though we tried to cover a

wide range of fields in our dataset, we missed cov-

ering questions from certain fields. Finding experts

from rarer fields can be especially hard. We will

consider further expanding E XPERT QA to more

domains, so that it can be more broadly useful.

In addition, the examples in our dataset represent

the information needs of English-speaking annota-

tors primarily based in Europe, the Americas and

Africa.

Subjectivity of labels. Some of the properties

of claims can elicit more subjective judgements,

which can vary between experts from the same

field. This subjectivity is not inherently captured in

our data through multiple judgements, but we do

estimate agreement using claims from engineering

and medicine through our own labels (§4.2).

Acknowledgements

First, we would like to thank the 484 annotators

who took the time and effort out to help out with

this study. We would also like to thank Artemis

Panagopoulou, Alyssa Hwang, Nelson Liu and

Chris Alberti for helpful comments and discus-

sions.

References

Ayush Agrawal, Lester Mackey, and Adam Tauman

Kalai. 2023. Do language models know when

they’re hallucinating references? arXiv preprint

arXiv:2305.18248.

Firoj Alam, Alberto Barrón-Cedeño, Gullal S. Cheema,

Sherzod Hakimov, Maram Hasanain, Chengkai Li,

Rubén Míguez, Hamdy Mubarak, Gautam Kishore

Shahi, Wajdi Zaghouani, and Preslav Nakov. 2023.

Overview of the CLEF-2023 CheckThat! lab task

1 on check-worthiness in multimodal and multi-

genre content. In Working Notes of CLEF 2023—

Conference and Labs of the Evaluation Forum,

CLEF ’2023, Thessaloniki, Greece.

Amos Azaria and Tom Mitchell. 2023. The internal

state of an llm knows when its lying. arXiv preprint

arXiv:2304.13734.

Abeba Birhane, Atoosa Kasirzadeh, David Leslie, and

Sandra Wachter. 2023. Science in the age of large

language models. Nature Reviews Physics, 5(5):277–

280.

Bernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aharoni,

Daniel Andor, Livio Baldini Soares, Jacob Eisenstein,

Kuzman Ganchev, Jonathan Herzig, Kai Hui, et al.

2022. Attributed question answering: Evaluation and

modeling for attributed large language models. arXiv

preprint arXiv:2212.08037.

Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-

mann, Trevor Cai, Eliza Rutherford, Katie Milli-

can, George Bm Van Den Driessche, Jean-Baptiste

Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022.

Improving language models by retrieving from tril-

lions of tokens. In International conference on ma-

chine learning, pages 2206–2240. PMLR.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, et al. 2020. Language models are few-shot

learners. Advances in neural information processing

systems, 33:1877–1901.Jifan Chen, Grace Kim, Aniruddh Sriram, Greg Durrett,

and Eunsol Choi. 2023. Complex claim verification

with evidence retrieved in the wild. arXiv preprint

arXiv:2305.11859. Angela Fan, Yacine Jernite, Ethan Perez, David Grang-

ier, Jason Weston, and Michael Auli. 2019. Eli5:

Long form question answering. arXiv preprint

arXiv:1907.09190.

Sihao Chen, Senaka Buthpitiya, Alex Fabrikant, Dan

Roth, and Tal Schuster. 2022. Propsegment: A

large-scale corpus for proposition-level segmenta-

tion and entailment recognition. arXiv preprint

arXiv:2212.10750. Shangbin Feng, Vidhisha Balachandran, Yuyang Bai,

and Yulia Tsvetkov. 2023. Factkb: Generaliz-

able factuality evaluation using language models

enhanced with factual knowledge. arXiv preprint

arXiv:2305.08281.

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,

Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan

Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion

Stoica, and Eric P. Xing. 2023. Vicuna: An open-

source chatbot impressing gpt-4 with 90%* chatgpt

quality. Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony

Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent

Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and

Kelvin Guu. 2023. RARR: Researching and revising

what language models say, using language models.

In Proceedings of the 61st Annual Meeting of the

Association for Computational Linguistics (Volume 1:

Long Papers), pages 16477–16508, Toronto, Canada.

Association for Computational Linguistics.

Eunsol Choi, Jennimaria Palomaki, Matthew Lamm,

Tom Kwiatkowski, Dipanjan Das, and Michael

Collins. 2021. Decontextualization: Making sen-

tences stand-alone. Transactions of the Association

for Computational Linguistics, 9:447–461.

Hyung Won Chung, Le Hou, Shayne Longpre, Bar-

ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi

Wang, Mostafa Dehghani, Siddhartha Brahma, et al.

2022. Scaling instruction-finetuned language models.

arXiv preprint arXiv:2210.11416.

Debadutta Dash, Rahul Thapa, Juan M Banda, Akshay

Swaminathan, Morgan Cheatham, Mehr Kashyap,

Nikesh Kotecha, Jonathan H Chen, Saurabh Gom-

bar, Lance Downing, et al. 2023. Evaluation of

gpt-3.5 and gpt-4 for supporting real-world infor-

mation needs in healthcare delivery. arXiv preprint

arXiv:2304.13714.

Cicero Dos Santos, Luciano Barbosa, Dasha Bogdanova,

and Bianca Zadrozny. 2015. Learning hybrid repre-

sentations to retrieve semantically equivalent ques-

tions. In Proceedings of the 53rd Annual Meeting

of the Association for Computational Linguistics

and the 7th International Joint Conference on Natu-

ral Language Processing (Volume 2: Short Papers),

pages 694–699.

Nouha Dziri, Hannah Rashkin, Tal Linzen, and David

Reitter. 2022. Evaluating attribution in dialogue sys-

tems: The begin benchmark. Transactions of the

Association for Computational Linguistics, 10:1066–

1083.

Owain Evans, Owen Cotton-Barratt, Lukas Finnve-

den, Adam Bales, Avital Balwit, Peter Wills, Luca

Righetti, and William Saunders. 2021. Truthful ai:

Developing and governing ai that does not lie. arXiv

preprint arXiv:2110.06674.

Alexander R. Fabbri, Chien-Sheng Wu, Wenhao Liu,

and Caiming Xiong. 2022. Qafacteval: Improved

qa-based factual consistency evaluation for summa-

rization.

Neel Guha, Julian Nyarko, Daniel E Ho, Christopher

Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-

Wood, Austin Peters, Brandon Waldon, Daniel N

Rockmore, et al. 2023. Legalbench: A collabo-

ratively built benchmark for measuring legal rea-

soning in large language models. arXiv preprint

arXiv:2308.11462.

Zhijiang Guo, Michael Schlichtkrull, and Andreas Vla-

chos. 2022. A survey on automated fact-checking.

Transactions of the Association for Computational

Linguistics, 10:178–206.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-

pat, and Mingwei Chang. 2020. Retrieval augmented

language model pre-training. In International confer-

ence on machine learning, pages 3929–3938. PMLR.

Hangfeng He, Hongming Zhang, and Dan Roth. 2022.

Rethinking with retrieval: Faithful large language

model inference. arXiv preprint arXiv:2301.00303.

Dan Hendrycks, Collin Burns, Steven Basart, Andy

Zou, Mantas Mazeika, Dawn Song, and Jacob Stein-

hardt. 2021. Measuring massive multitask language

understanding. Proceedings of the International Con-

ference on Learning Representations (ICLR).

Matthew Honnibal and Ines Montani. 2017. spaCy 2:

Natural language understanding with Bloom embed-

dings, convolutional neural networks and incremental

parsing. To appear.

Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai

Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas

Scialom, Idan Szpektor, Avinatan Hassidim, and

Yossi Matias. 2022. TRUE: Re-evaluating factual

consistency evaluation. In Proceedings of the 2022

Conference of the North American Chapter of the

Association for Computational Linguistics: Human

Language Technologies, pages 3905–3920, Seattle,

United States. Association for Computational Lin-

guistics.Gautier Izacard, Patrick Lewis, Maria Lomeli, Lu-

cas Hosseini, Fabio Petroni, Timo Schick, Jane

Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and

Edouard Grave. 2022. Few-shot learning with re-

trieval augmented language models. arXiv preprint

arXiv:2208.03299.

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan

Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea

Madotto, and Pascale Fung. 2023. Survey of halluci-

nation in natural language generation. ACM Comput-

ing Surveys, 55(12):1–38.

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng,

Hanyi Fang, and Peter Szolovits. 2021. What disease

does this patient have? a large-scale open domain

question answering dataset from medical exams. Ap-

plied Sciences, 11(14):6421.

pages 4940–4957, Online. Association for Computa-

tional Linguistics.

Wojciech Kryscinski, Bryan McCann, Caiming Xiong,

and Richard Socher. 2020. Evaluating the factual

consistency of abstractive text summarization. In

Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 9332–9346, Online. Association for Computa-

tional Linguistics.

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-

field, Michael Collins, Ankur Parikh, Chris Alberti,

Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-

ton Lee, et al. 2019. Natural questions: a benchmark

for question answering research. Transactions of the

Association for Computational Linguistics, 7:453–

466.

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William

Cohen, and Xinghua Lu. 2019. PubMedQA: A

dataset for biomedical research question answering.

In Proceedings of the 2019 Conference on Empirical

Methods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-

guage Processing (EMNLP-IJCNLP), pages 2567–

2577, Hong Kong, China. Association for Computa-

tional Linguistics. Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023.

Benefits, limits, and risks of gpt-4 as an ai chatbot

for medicine. New England Journal of Medicine,

388(13):1233–1239.

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom

Henighan, Dawn Drain, Ethan Perez, Nicholas

Schiefer, Zac Hatfield Dodds, Nova DasSarma,

Eli Tran-Johnson, et al. 2022. Language models

(mostly) know what they know. arXiv preprint

arXiv:2207.05221. Chin-Yew Lin. 2004. ROUGE: A package for auto-

matic evaluation of summaries. In Text Summariza-

tion Branches Out, pages 74–81, Barcelona, Spain.

Association for Computational Linguistics.

Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nan-

dan Thakur, and Jimmy Lin. 2023.

Hagrid:

A human-llm collaborative dataset for generative

information-seeking with attribution. arXiv preprint

arXiv:2307.16883.

Ryo Kamoi, Tanya Goyal, Juan Diego Rodriguez,

and Greg Durrett. 2023. Wice: Real-world en-

tailment for claims in wikipedia. arXiv preprint

arXiv:2303.01432.

Mario Krenn, Robert Pollice, Si Yue Guo, Matteo

Aldeghi, Alba Cervera-Lierta, Pascal Friederich,

Gabriel dos Passos Gomes, Florian Häse, Adrian

Jinich, AkshatKumar Nigam, Zhenpeng Yao, and

Alán Aspuru-Guzik. 2022. On scientific under-

standing with artificial intelligence. Nature Reviews

Physics, 4(12):761–769.

Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit

Iyyer, Pradeep Dasigi, Arman Cohan, and Kyle Lo.

2023. Longeval: Guidelines for human evaluation

of faithfulness in long-form summarization. arXiv

preprint arXiv:2301.13298.

Kalpesh Krishna, Aurko Roy, and Mohit Iyyer. 2021.

Hurdles to progress in long-form question answering.

In Proceedings of the 2021 Conference of the North

American Chapter of the Association for Computa-

tional Linguistics: Human Language Technologies,

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun

Nie, and Ji-Rong Wen. 2023. Halueval: A large-

scale hallucination evaluation benchmark for large

language models. arXiv e-prints, pages arXiv–2305.

Stephanie Lin, Jacob Hilton, and Owain Evans. 2021.

Truthfulqa: Measuring how models mimic human

falsehoods. arXiv preprint arXiv:2109.07958.

Nelson F Liu, Tianyi Zhang, and Percy Liang. 2023.

Evaluating verifiability in generative search engines.

arXiv preprint arXiv:2304.09848.

Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao,

Zhifang Sui, Weizhu Chen, and Bill Dolan. 2022.

A token-level reference-free hallucination detection

benchmark for free-form text generation. In Proceed-

ings of the 60th Annual Meeting of the Association

for Computational Linguistics (Volume 1: Long Pa-

pers), pages 6723–6737, Dublin, Ireland. Association

for Computational Linguistics.

Potsawee Manakul, Adian Liusie, and Mark JF Gales.

2023. Selfcheckgpt: Zero-resource black-box hal-

lucination detection for generative large language

models. arXiv preprint arXiv:2303.08896.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and

Ryan McDonald. 2020. On faithfulness and factu-

ality in abstractive summarization. In Proceedings

of the 58th Annual Meeting of the Association for

Computational Linguistics, pages 1906–1919, On-

line. Association for Computational Linguistics.

Jacob Menick, Maja Trebacz, Vladimir Mikulik,

John Aslanides, Francis Song, Martin Chadwick,Mia Glaese, Susannah Young, Lucy Campbell-

Gillingham, Geoffrey Irving, et al. 2022. Teaching

language models to support answers with verified

quotes. arXiv preprint arXiv:2203.11147.

Donald Metzler, Yi Tay, Dara Bahri, and Marc Najork.

2021. Rethinking search: making domain experts out

of dilettantes. In Acm sigir forum, volume 55, pages

1–27. ACM New York, NY, USA.

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike

Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer,

Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023.

Factscore: Fine-grained atomic evaluation of factual

precision in long form text generation. arXiv preprint

arXiv:2305.14251v1.

Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine,

Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin

Leyton-Brown, Amnon Shashua, and Yoav Shoham.

2023.

Generating benchmarks for factuality

evaluation of language models. arXiv preprint

arXiv:2307.06908.

Benjamin Muller, John Wieting, Jonathan H Clark,

Tom Kwiatkowski, Sebastian Ruder, Livio Baldini

Soares, Roee Aharoni, Jonathan Herzig, and Xinyi

Wang. 2023. Evaluating and modeling attribution

for cross-lingual question answering. arXiv preprint

arXiv:2305.14332.

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu,

Long Ouyang, Christina Kim, Christopher Hesse,

Shantanu Jain, Vineet Kosaraju, William Saunders,

et al. 2021. Webgpt: Browser-assisted question-

answering with human feedback. arXiv preprint

arXiv:2112.09332.

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,

Saurabh Tiwary, Rangan Majumder, and Li Deng.

2016. Ms marco: A human generated machine read-

ing comprehension dataset. choice, 2640:660.

OpenAI. 2023. Gpt-4 technical report.

abs/2303.08774.

ArXiv,

Brian Owens. 2023. How nature readers are using chat-

gpt. Nature.

Artidoro Pagnoni, Vidhisha Balachandran, and Yulia

Tsvetkov. 2021. Understanding factuality in abstrac-

tive summarization with FRANK: A benchmark for

factuality metrics. In Proceedings of the 2021 Con-

ference of the North American Chapter of the Asso-

ciation for Computational Linguistics: Human Lan-

guage Technologies, pages 4812–4829, Online. As-

sociation for Computational Linguistics.

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan

Sankarasubbu. 2022. Medmcqa: A large-scale multi-

subject multi-choice dataset for medical domain ques-

tion answering. In Proceedings of the Conference

on Health, Inference, and Learning, volume 174 of

Proceedings of Machine Learning Research, pages

248–260. PMLR.

Anusri Pampari, Preethi Raghavan, Jennifer Liang, and

Jian Peng. 2018. emrQA: A large corpus for question

answering on electronic medical records. In Proceed-

ings of the 2018 Conference on Empirical Methods

in Natural Language Processing, pages 2357–2368,

Brussels, Belgium. Association for Computational

Linguistics.

Fabio Petroni, Samuel Broscheit, Aleksandra Pik-

tus, Patrick Lewis, Gautier Izacard, Lucas Hos-

seini, Jane Dwivedi-Yu, Maria Lomeli, Timo Schick,

Pierre-Emmanuel Mazaré, et al. 2022. Improv-

ing wikipedia verifiability with ai. arXiv preprint

arXiv:2207.06220.

Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin,

Dmytro Okhonko, Samuel Broscheit, Gautier Izacard,

Patrick Lewis, Barlas Oguz, Edouard Grave, Wen-

tau Yih, and Sebastian Riedel. 2021. The web is

your oyster - knowledge-intensive NLP against a very

large web corpus. CoRR, abs/2112.09924.

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,

and Yuxiong He. 2020. Zero: Memory optimizations

toward training trillion parameter models. In SC20:

International Conference for High Performance Com-

puting, Networking, Storage and Analysis, pages 1–

16. IEEE.

Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm,

Lora Aroyo, Michael Collins, Dipanjan Das, Slav

Petrov, Gaurav Singh Tomar, Iulia Turc, and David

Reitter. 2021.

Measuring attribution in natu-

ral language generation models. arXiv preprint

arXiv:2112.12870.

Siva Reddy, Danqi Chen, and Christopher D. Manning.

2019. CoQA: A conversational question answering

challenge. Transactions of the Association for Com-

putational Linguistics, 7:249–266.

Stephen Robertson, Hugo Zaragoza, et al. 2009. The

probabilistic relevance framework: Bm25 and be-

yond. Foundations and Trends® in Information Re-

trieval, 3(4):333–389.

Anna Rogers, Olga Kovaleva, Matthew Downey, and

Anna Rumshisky. 2020. Getting closer to ai complete

question answering: A set of prerequisite real tasks.

In Proceedings of the AAAI conference on artificial

intelligence, volume 34, pages 8722–8731.

Daniel E Rose and Danny Levinson. 2004. Understand-

ing user goals in web search. In Proceedings of the

13th international conference on World Wide Web,

pages 13–19.

Oscar Sainz, Jon Ander Camposa, Iker García-Ferrero,

Julen Etxaniz, and Eneko Agirre. 2023. Did chatgpt

cheat on your test?

Tal Schuster, Sihao Chen, Senaka Buthpitiya, Alex

Fabrikant, and Donald Metzler. 2022. Stretching

sentence-pair NLI models to reason over long doc-

uments and clusters. In Findings of the Association

for Computational Linguistics: EMNLP 2022, pages394–412, Abu Dhabi, United Arab Emirates. Associ-

ation for Computational Linguistics.

Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and

Ming-Wei Chang. 2022. Asqa: Factoid ques-

tions meet long-form answers. arXiv preprint

arXiv:2204.06092.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann

Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,

and Tatsunori B. Hashimoto. 2023. Stanford alpaca:

An instruction-following llama model. https://

github.com/tatsu-lab/stanford_alpaca.

Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara

Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao,

Jai Gupta, et al. 2022. Transformer memory as a

differentiable search index. Advances in Neural In-

formation Processing Systems, 35:21831–21843.

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam

Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,

Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al.

2022. Lamda: Language models for dialog applica-

tions. arXiv preprint arXiv:2201.08239.

James Thorne, Andreas Vlachos, Oana Cocarascu,

Christos Christodoulopoulos, and Arpit Mittal. 2018.

The fact extraction and VERification (FEVER)

shared task. In Proceedings of the First Workshop on

Fact Extraction and VERification (FEVER), pages 1–

9, Brussels, Belgium. Association for Computational

Linguistics.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-

bert, Amjad Almahairi, Yasmine Babaei, Nikolay

Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti

Bhosale, et al. 2023. Llama 2: Open founda-

tion and fine-tuned chat models. arXiv preprint

arXiv:2307.09288.

George Tsatsaronis, Georgios Balikas, Prodromos

Malakasiotis, Ioannis Partalas, Matthias Zschunke,

Michael R Alvers, Dirk Weissenborn, Anastasia

Krithara, Sergios Petridis, Dimitris Polychronopou-

los, et al. 2015. An overview of the bioasq large-scale

biomedical semantic indexing and question answer-

ing competition. BMC bioinformatics, 16(1):1–28.

Eugene Volokh. 2023. Large libel models? liability for

ai output.

Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.

Asking and answering questions to evaluate the fac-

tual consistency of summaries. In Proceedings of the

58th Annual Meeting of the Association for Compu-

tational Linguistics, pages 5008–5020, Online. Asso-

ciation for Computational Linguistics.

Orion Weller, Marc Marone, Nathaniel Weir, Dawn

Lawrie, Daniel Khashabi, and Benjamin Van Durme.

2023. " according to..." prompting language mod-

els improves quoting from pre-training data. arXiv

preprint arXiv:2305.13252.

Fangyuan Xu, Yixiao Song, Mohit Iyyer, and Eunsol

Choi. 2023. A critical evaluation of evaluations for

long-form question answering. In Proceedings of the

61st Annual Meeting of the Association for Compu-

tational Linguistics (Volume 1: Long Papers), pages

3225–3245, Toronto, Canada. Association for Com-

putational Linguistics.

Xiang Yue, Boshi Wang, Kai Zhang, Ziru Chen, Yu Su,

and Huan Sun. 2023. Automatic evaluation of at-

tribution by large language models. arXiv preprint

arXiv:2305.06311.A

Annotation Details

Annotator backgrounds. The 484 participants

involved in our study came from 26 different coun-

tries, across Europe, Africa, Oceania, North and

South America. The participants were recruited

through Prolific, a crowdsourcing platform 5 . To

qualify as experts, participants were required to

have attained a formal education in the field and

have worked in the area for at least 3 years. Partici-

pants were told that their annotations will be used

to evaluate the capabilities of large language mod-

els to provide truthful answers with well-supported

evidences to questions from different fields. They

were also informed that the data will be released

publicly upon the completion of the study.

Annotator fields. The initial set of fields were

listed by going through university department

names, and ensuring that we cover a wide range of

disciplines. Upon completing stage 1 of our anno-

tation, we further refined these fields to represent a

diverse set, for which we have enough experts.

Annotation costs. In both stage 1 and stage 2,

annotators were compensated at the rate of $15

per hour with additional bonuses when annotators

spent more time than we anticipated. The average

time taken for stage 2 annotations was 13.83 min-

utes per question-answer pair, and there was a lot

of variance in time taken across annotators.

Annotation Interface. Figure 7 and Figure 8

show screenshots of our stage 2 annotation inter-

face in the order the task was presented to annota-

tors.

Experimental Details

B.1

Hyperparameter Settings

Response collection. Across all systems, for

generating responses from gpt4, we use a tem-

perature of 1.0, and a maximum length of

2048 tokens. For all retrieval componenets, we

use text-embedding-ada-002 as the embedding

model. The retrieve-and-read systems first retrieve

top-k (k=5) evidence passages from Sphere or top-

10 Google search results using the question as the

retrieval query. Google search results are split into

passages of 1000 tokens with 200 tokens of overlap

between subsequent chunks.

On the other hand, the post-hoc citation systems

simply use the claims from gpt4 responses, but

https://www.prolific.co

generate their own attributions by retrieving evi-

dence for each claim in the answer. Post-hoc re-

trieval systems use the top-k passages (k=5) re-

trieved from Sphere or the top-10 Google search re-

sults with the claim as the retrieval query. Search re-

sult are split into passages the same way as retrieve-

and-read systems.

Automatic attribution and factuality estimation.

For automatic attribution with AutoAIS, we use

the t5_xxl_true_nli_mixture 6 with 11B param-

eters by Honovich et al. (2022). For finetuning the

t5_xxl_true_nli_mixture model on the train

split of E XPERT QA, we use the DeepSpeed ZeRO

optimization (Rajbhandari et al., 2020) with stage

3, a batch size of 1, a learning rate of 1e −4 and

train models for 3 epochs.

Long-form QA. For finetuning FlanT5-11B, we

use a batch size of 2, maximum sequence length

of 512, a learning rate of 1e-4 and train models

for 3 epochs. For finetuning both Llama2-7B and

Vicuna-7B, we use a batch size of 4, maximum

sequence length of 2048, learning rate of 2e −4 and

train models for 3 epochs.

B.2

Prompts

The prompts used to generate responses from gpt4

and bing_chat is provided in Table 9, while the

prompt used to generate responses for retrieve-and-

read systems is in Table 10.

For factuality estimation, we use the same

prompts as Min et al. (2023) for both claim de-

composition and atomic claim factuality prediction.

Finally, for long-form QA baselines, we use the

prompt in Table 11 for Llama and Table 12 for

Vicuna.

Additional Plots

Examples from all fields included in E XPERT QA

are shown in Table 14. We show the distribution of

all question types (from Table 2) across all fields

that are part of E XPERT QA in Figure 9.

In Table 10, we summarize the label distribution

of all claim properties across fields and in Table 11,

we summarize the label distribution of all claim

properties across question types.

In Table 13, we summarize results on long-form

QA before and after finetuning models on both

E XPERT QA splits.

https://huggingface.co/google/t5_xxl_true_

nli_mixtureVanilla LM QA Prompt

Answer the question completely and precisely in up to 500 words. You must

provide in-line citations to each statement in the answer. The citations should

appear as numbers such as [1], [2] and contain references to valid URLs on the

web. A statement may need to be supported by multiple references and should then

be cited as [1] [2].

Question: I work in the field of [FIELD]. My question is: [QUESTION]

Answer:

Table 9: QA Prompt for GPT4 and BingChat.

Retrieve-and-read Prompt

Use the following pieces of context to answer the question completely and

precisely in up to 500 words. If you don’t know the answer, just say "I don’t

know" and explain why the context is insufficient to answer the question.

You need to support every statement in the answer with in-line citations to

passages given in the the context. The citations should appear as numbers such

as [1], [2] that refer to the Passage IDs of the given passages. A statement may

need to be supported by multiple references and should then be cited as [1] [2].

(for example, "Paris is the capital of France [1] [2]." where "1" and "2" are

the Passage IDs of the first and second passage).

[CONTEXT]

Question: [QUESTION]

Answer:

Table 10: Retrieve-and-read QA prompt.

Llama2 Prompt

~~[INST] «SYS»~~

You are a helpful, respectful and honest assistant. Always answer as helpfully

as possible, while being safe. Your answers should not include any harmful,

unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure

that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why

instead of answering something not correct. If you don’t know the answer to a

question, please don’t share false information.

«/SYS»

[QUESTION] [/INST]

Table 11: Llama2 prompt for long-form QA.Vicuna Prompt

A chat between a curious user and an artificial intelligence assistant. The

assistant gives helpful, detailed, and polite answers to the user’s questions.

USER: [QUESTION]

ASSISTANT:

Table 12: Vicuna prompt for long-form QA.

Split

Random

Domain

Model R1 R2 RL QFE

FlanT5-11B*

FlanT5-11B

Vicuna-7B*

Vicuna-7B

Llama2-7B*

Llama2-7B

Llama2-70B* 0.074

0.335

0.358

0.351

0.300

0.362

0.320 0.023

0.114

0.116

0.119

0.083

0.125

0.101 0.063

0.215

0.209

0.212

0.167

0.219

0.181 0.000

2.068

0.902

1.068

1.359

1.985

1.050

FlanT5-11B*

FlanT5-11B

Vicuna-7B*

Vicuna-7B

Llama2-7B*

Llama2-7B

Llama2-70B* 0.073

0.324

0.352

0.359

0.303

0.363

0.328 0.023

0.107

0.114

0.120

0.087

0.124

0.104 0.062

0.210

0.203

0.213

0.169

0.219

0.187 0.000

1.538

2.596

1.739

1.799

1.726

0.979

Table 13: Long-form QA results before (marked with *) and after finetuning models on the random and domain

splits of E XPERT QA.Figure 7: Screenshots of the interface (1-4).Figure 8: Screenshots of the interface (5-9).Field

Anthropology

Architecture

Aviation

Biology

Business

Chemistry

Classical Studies

Climate Science

Criminology

Culinary Arts

Economics

Education

Engineering and Technology

Environmental Science

Geography

Healthcare/Medicine

History

Journalism

Law

Linguistics

Literature

Mathematics

Military or Law Enforcement

Music

Philosophy

Physics & Astronomy

Political Science

Psychology

Sociology

Theology

Visual Arts

Question

Types

Why is it that Africa’s representation is still a problem in modern day times regardless

II,VII

of the academic writings that state otherwise?

Suppose an architect decides to reuse an existing foundation of a demolished building,

IV

what is to be considered to ensure success of the project?

Should a low value shipment take priority from a regular customer or a high value

V

shipment from a infrequent customer?

Can you explain the mechanisms by which habitat fragmentation affects biodiversity

III,VI

and ecosystem functioning, and provide examples of effective strategies for mitigating

these impacts?

If your supplier can give you a discount for a whole yearly production, how can we

V

take this deal without affecting our budget in a critical way?

Why does gallic acid have an affinity with trivalent iron ions?

I

If researchers found a new method to unroll the Herculanum papyri, would it be fair

V

to try it on the actual papyrus, given that it could potentially destroy it?

If an imidazolium based ionic liquid were to be released into the environment through

II,III,V

the aquatic compartment, what species would be affected, if any?

Mr X is an 18 year old first time offender involved in a burglary where he acted as a

V

lookout. Which category about this information be placed under?

If mezcal production in the Valley of Mexico posits the distilling of mezcal can be

V

traced back to ancient times, how could this be attained before the arrival of the

Spaniards?

Can you summarize the current economic policies and strategies of the top five global

I

superpowers and their potential impact on the global market?

Can music therapy impact a child with autism if they have noise sensory issues?

V

How different will licensing a small modular reactor be as compared to licensing

VII

traditional large nuclear power plants?

Does floating solar panels minimize the risk of eutrophication or they are more trouble

I

than their worth?

How can we overcome the limitations of remote sensing data, such as low spatial

IV

resolution and limited spectral bands?

If a 48 year old woman is found to have an esophageal carcinoma that invades the

I,III

muscularis propria and has regional lymph node metastases but no distant metastasis,

what is her stage of cancer and what are possible recommended treatments?

To what extent is JFK’s legacy written from sympathy because of his assassination?

II,VII

How many sources you must have before printing a story?

I

Can direct evidence in a case that has been obtained illegally be considered by the

I

court in some cases if it directly points to the defendant’s guilt?

What are the attitudes of Received Pronunciation in the United States?

II

How would one go about researching the role of the mother represented in Anne

IV, VI

Sexton’s 1971 poetry volume "Transformations"?

Do you think there is a relation between Frobenius numbers and the Kawamata

III, VII

conjecture for weighted complete intersections?

If you get anthrax poisoning during a mission, which chemical agent should you use

I

to neutralise the poison?

What exercises would you do in a singing class with a teenager with puberphonia?

IV

How does modern neuroscience support and reject a computational theory of mind?

III

Standard Model does not contain enough CP violating phenomena in order to explain

V

baryon asymmetry. Suppose the existence of such phenomena. Can you propose a way

to experimentally observe them?

Despite the fact that IPCC was formed in 1988, several studies have showed that

VII

argubaly more than 50% of all carbon emissions in history have been released since

1988. What does this show about IPCC and developed countries’ efforts?

How can counselling psychologists effectively and appropriately incorporate use of III,IV,VII

self into therapy?

Which factors strengthen social cohesion within societies?

VII

Is there any justification for the use of violence in the New Testament?

I

Tell me the step by step process of recycling a canvas.

III

Table 14: Examples from E XPERT QA, showing an example from every field included in the dataset.Anthropology

11.1

25.9

Architecture 13.8

Aviation 12.5

Biology 14.4

23

15.6

12.5

16.7

Business 17.4 20.7

Chemistry 19 19

10

Classical Studies

Criminology 7.7

Culinary Arts

Education

Geography

8.3

Law 18.5

16

19

Philosophy

15.5

Political Science

Psychology

15.5

15.2

Sociology 7.1

Theology

14.7

0%

16

25

15.8

10.5

24.4

11.9

9.8

9.8

26.2

11.9

26.7

6.7 9.3

3.5 5.2

34.5

30.8

7.7

1.9

1.9

34

27.6

4.8

12.1

28.8

33.3

8.6

4

4.8

25

9.5

13.6

10.7

7.7

1.9

9.4

4.8

16.7

25

Visual Arts

15.5

7.9

6.9 6.2

32.1

6.9 8.6

24

16.7

33.9

5.2

6.5

20

8

20.8

5.3

16.7

5.3 5.7

3.7

27.2

14.3

15.4

30.2

4.2

30.1

4.9

20

26.9

Physics and Astronomy

29

10.4

4.9

10 6.2

2.4

22.4

11.5

5.3

25

16.7

13.3

8.8

3.2

33.3

10.5

14.6

21.4

16.4

40

19.5

9.5

Music

11.5

13.8

13.1

13.3

4.9

24.6

16.7

18.5

17.1

Military or Law Enforcement

11.4

17.3

14.3

10.5

15.4

27.9

6.7 6.7 6.7

15.5

Mathematics

Type I

20.8

23.5

20

Literature

31.6

8.2

10.5

16.9

Journalism

10

5.3

38.5

13.6

15.8

9.9

Linguistics

30

10.5

15.8

19

17.9

History

27

20

11.5

21.1

8.3

10.4

40

25.4

12.9

Medicine

12.9

18

8.8

Environmental Science

32.2

33.3

13.1

Engineering and Technology

29.6

7.7 7.7

13.3

Economics

28.1

9.1

21.1

23.1

33.3

12.9

10

3.7

11.1

3.5

11.5

3.1

9.4

3.8

6.1

3.3

11.6

1.8

9.8

25.9

18.8

16.7

5.8

20

21.1

Climate Science

3.7

18.5

3.5

11.5

7.2

16.7

25

33.7

8.4 7.4

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Type II

Type III

Type IV

Type V

Type VI

Type VII

Figure 9: The distribution of question types across all fields included in E XPERT QA.Figure 10: Label distribution of claim properties across different fields for all systems.

Figure 11: Label distribution of claim properties across different question types for all systems.