Summary of R-Tuning Teaching Large Language Models to Refuse Unknown Questions

Summary R-Tuning Teaching Large Language Models to Refuse Unknown Questions arxiv.org

9,720 words - PDF document - View PDF document

One Line

R-Tuning is a technique that assesses the knowledge limitations of large language models, pinpoints areas of uncertainty, teaches them to decline queries they are unsure about, and enhances their performance on tasks they are knowledgeable about.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

R-Tuning: Enhancing Reliability in Large Language Models

Source: arxiv.org - PDF - 9,720 words - view

Hallucination is a Critical Challenge for LLMs

• Large language models (LLMs) often generate non-existent facts, known as hallucination.

• This issue undermines the reliability of LLMs across various applications.

• Understanding the root causes of hallucination is essential for model improvement.

[$Visual: Infographic illustrating the concept of hallucination in LLMs]

Knowledge Gaps Contribute to Hallucination

• A significant gap exists between parametric knowledge and instruction tuning data.

• This disparity leads to inaccurate responses when models encounter unfamiliar queries.

• Identifying knowledge gaps is crucial for developing effective tuning strategies.

[$Visual: Diagram showing the relationship between parametric knowledge and instruction tuning data]

R-Tuning Addresses Knowledge Limitations

• Refusal-Aware Instruction Tuning (R-Tuning) targets the knowledge gaps in LLMs.

• It distinguishes between certain and uncertain questions during training.

• R-Tuning teaches models to decline answering when lacking knowledge.

[$Visual: Flowchart depicting the R-Tuning process]

Constructing Refusal-Aware Datasets

• R-Tuning appends uncertainty expressions to uncertain questions.

• Certain questions retain their original labels, enhancing model training.

• This approach fosters a culture of uncertainty acknowledgment in LLMs.

[$Visual: Example of a dataset before and after applying R-Tuning]

Experimental Validation Shows Promising Results

• R-Tuning was tested on 7 diverse datasets with both single-task and multi-task settings.

• Results indicate improved accuracy on known questions and better refusal rates on unknown ones.

• The method outperforms traditional fine-tuning approaches significantly.

[$Visual: Bar graph comparing performance metrics between R-Tuning and traditional methods]

Refusal Ability is a Generalizable Meta-Skill

• The refusal ability learned through R-Tuning extends beyond specific tasks.

• Multi-task training enhances this meta-skill, making it widely applicable.

• This adaptability is key for future AI applications across different domains.

[$Visual: Venn diagram illustrating the overlap of skills across tasks]

Uncertainty Learning Enhances Model Performance

• Integrating uncertainty learning during training yields better results than post-training filtering.

• Models trained with uncertainty awareness demonstrate improved accuracy and reliability.

• This insight reshapes how we approach training methodologies for LLMs.

[$Visual: Line graph showing performance improvement with uncertainty learning]

Variants of R-Tuning Offer Flexibility

• The research explores unsupervised identification strategies and label replacement methods.

• These alternatives showcase the adaptability of the core R-Tuning framework.

• They provide additional pathways for enhancing model reliability and performance.

[$Visual: Comparison table of different R-Tuning variants and their effectiveness]

Implications for Next-Generation AI Systems

• Enhancing refusal capabilities in LLMs can lead to safer, more robust AI systems.

• Acknowledging knowledge limits can improve user trust and model utility.

• These advancements have broad implications across industries relying on AI solutions.

[$Visual: Infographic summarizing potential industry applications]

Future Directions in Model Training

• Continued research into uncertainty and refusal-aware training is essential.

• Exploring novel methodologies can further enhance LLM capabilities and reliability.

• Collaboration across disciplines will drive innovation in AI safety and performance.

[$Visual: Roadmap graphic outlining future research directions]

Building Trustworthy AI Through R-Tuning

• R-Tuning represents a significant step towards reliable large language models.

• By teaching models to recognize their limitations, we enhance their utility across applications.

• Emphasizing uncertainty learning can reshape the future of AI development and deployment.

Key Points

Hallucination, the propensity of large language models (LLMs) to generate non-existent facts, is a predominant issue with these models
The significant gap between the knowledge in human-labeled instruction tuning datasets and the parametric knowledge of LLMs is a major cause of hallucination
The proposed Refusal-Aware Instruction Tuning (R-Tuning) method identifies uncertain questions that are beyond the model's knowledge, and constructs a refusal-aware dataset to teach the model to express uncertainty when faced with such questions
The refusal ability learned by R-Tuning is found to be a generalizable meta-skill that benefits from multi-task training
Incorporating uncertainty learning into large model training can improve both the model's ability to estimate uncertainty and its overall accuracy

Summaries

20 word summary

R-Tuning measures knowledge gaps, identifies uncertain questions, and teaches LLMs to refuse unknown queries while improving accuracy on known tasks.

48 word summary

R-Tuning measures the knowledge gap between pre-trained LLMs and instruction data, identifying uncertain questions. It appends uncertainty expressions to these questions, teaching the model to refuse unknown queries while improving accuracy on known tasks. Experiments show the refusal ability generalizes across tasks and is enhanced through multi-task training.

122 word summary

This paper presents R-Tuning, a novel approach to address hallucination in large language models (LLMs). R-Tuning first measures the knowledge gap between the pre-trained model and the instruction tuning data, identifying uncertain questions beyond the model's knowledge. It then constructs a refusal-aware dataset by appending uncertainty expressions to uncertain questions, while keeping original labels for certain questions. This teaches the model to express uncertainty when faced with unknown questions, rather than hallucinating answers. Experiments show R-Tuning enables the model to refuse uncertain questions while improving accuracy on questions it can answer. Importantly, the refusal ability generalizes across tasks and is further enhanced through multi-task training. The authors suggest incorporating uncertainty learning into LLM training can improve both uncertainty estimation and overall accuracy.

329 word summary

This paper presents a novel approach called Refusal-Aware Instruction Tuning (R-Tuning) to address the hallucination issue in large language models (LLMs). The key insight is that the significant gap between the knowledge in human-labeled instruction tuning datasets and the parametric knowledge of LLMs is a major cause of hallucination.

R-Tuning consists of two main steps. First, it measures the knowledge gap between the pre-trained model and the instruction tuning data, identifying uncertain questions that are beyond the model's knowledge. This is done by comparing the model's predictions on the training data with the ground-truth labels. Questions where the prediction matches the label are considered "certain", while mismatched questions are "uncertain".

Second, R-Tuning constructs a refusal-aware dataset by appending uncertainty expressions to the uncertain questions, while keeping the original labels for the certain questions. This teaches the model to express uncertainty when faced with questions outside its knowledge boundary, rather than hallucating answers.

The authors' experiments on diverse datasets show that R-Tuning enables the model to refuse uncertain questions while improving accuracy on the questions it is willing to answer, compared to traditional fine-tuning. Importantly, the refusal ability learned by R-Tuning is found to be a "meta-skill" that generalizes across tasks and is further enhanced through multi-task training.

A key finding is that learning uncertainty during training, rather than just applying uncertainty filtering at test time, yields better results. This suggests that incorporating uncertainty learning into large model training can improve both the model's ability to estimate uncertainty and its overall accuracy. Further analysis reveals that uncertain questions have higher perplexity and entropy in the model's predictions, explaining why R-Tuning is effective at distinguishing them.

The authors also explore variants of R-Tuning, demonstrating the flexibility and effectiveness of the core approach. Overall, this work takes an important step towards building more reliable and trustworthy large language models that can better recognize the limits of their own knowledge, with broad implications for improving the safety and robustness of next-generation AI systems.

471 word summary

Refusal-Aware Instruction Tuning (R-Tuning) for Large Language Models

R-Tuning consists of two main steps. First, it measures the knowledge gap between the parametric knowledge of the pre-trained model and the instruction tuning data, and identifies uncertain questions that are beyond the model's knowledge. This is done by comparing the model's predictions on the training data with the ground-truth labels. Questions where the prediction matches the label are considered "certain", while mismatched questions are "uncertain".

Second, R-Tuning constructs a refusal-aware dataset by appending uncertainty expressions (e.g., "I am unsure") to the uncertain questions, while keeping the original labels for the certain questions. This teaches the model to express uncertainty when faced with questions outside its knowledge boundary, rather than hallucating answers.

The authors conduct experiments on both single-task and multi-task settings, evaluating on 7 diverse datasets. The results show that R-Tuning enables the model to refuse to answer uncertain questions, while improving the accuracy on the questions it is willing to answer, compared to traditional fine-tuning approaches. Importantly, the refusal ability learned by R-Tuning is found to be a "meta-skill" that generalizes across tasks, and is further enhanced through multi-task training.

The authors also explore variants of R-Tuning, including an unsupervised identification strategy and a label replacement method. These alternatives demonstrate the flexibility and effectiveness of the core R-Tuning approach.

In summary, the main contributions of this work are:

1. Identifying the knowledge gap between instruction tuning data and parametric knowledge as a key cause of hallucination in LLMs. 2. Proposing the R-Tuning method to teach LLMs to refuse unknown questions by distinguishing certain and uncertain data during instruction tuning. 3. Showing that the refusal ability learned by R-Tuning is a generalizable meta-skill that benefits from multi-task training. 4. Discovering the advantages of incorporating uncertainty learning into large model training, both in reducing computational overhead and improving overall model accuracy.

Overall, this work takes an important step towards building more reliable and trustworthy large language models that can better recognize the limits of their own knowledge. The insights and techniques developed here could have broad implications for improving the safety and robustness of next-generation AI systems.

832 word summary

R-Tuning: Teaching Large Language Models to Refuse Unknown Questions Hanning Zhang ♠∗ , Shizhe Diao ♠∗ , Yong Lin ♠∗ , Yi R. Fung ♡ , Qing Lian ♠ , Xingyao Wang ♡ , Yangyi Chen ♡ , Heng Ji ♡ , Tong Zhang ♠ ♠ The Hong Kong University of Science and Technology ♡ University of Illinois Urbana-Champaign {hzhangco, sdiaoaa, ylindf, qlianab, tongzhang}@ust.hk {yifung2, xingyao6, yangyic3, hengji}@illinois.edu Abstract Large language models (LLMs) have revolu- tionized numerous domains with their impres- sive performance but still face their challenges. A predominant issue is the propensity for these models to generate non-existent facts, a con- cern termed hallucination. Our research is mo- tivated by the observation that previous instruc- tion tuning methods force the model to com- plete a sentence no matter whether the model knows the knowledge or not. When the ques- tion is out of the parametric knowledge, it will try to make up something and fail to indicate when it lacks knowledge. In this paper, we present a new approach called Refusal-Aware Instruction Tuning (R-Tuning). This approach is formalized by first identifying the knowl- edge gap between parametric knowledge and the instruction tuning data. Then, we construct the refusal-aware data based on the knowledge intersection, to tune LLMs to refrain from re- sponding to questions beyond its parametric knowledge. Experimental results demonstrate this new instruction tuning approach effectively improves a model’s ability to answer known questions and refrain from answering unknown questions. Furthermore, when tested on out-of- domain datasets, the refusal ability was found to be a meta-skill that could be generalized to other tasks. Further analysis surprisingly finds that learning the uncertainty during training displays a better ability to estimate uncertainty than uncertainty-based testing. 1 1 Introduction Large language models (LLMs) have demonstrated remarkable performance across numerous tasks; however, they are also plagued by various issues, such as the propensity of large models to fabricate non-existent facts, a phenomenon commonly re- ferred to as hallucination (Maynez et al., 2020a). * Equal Contribution. Our code will be released at https://github.com/ shizhediao/R-Tuning. 1 Parametric Knowledge Instruction Tuning Data [What Model Already Knows] [What Model Might Not Know] Intersection of

Refusal-Aware Instruction Tuning (R-Tuning) for Large Language Models

This paper proposes a novel instruction tuning method called Refusal-Aware Instruction Tuning (R-Tuning) to address the hallucination issue in large language models (LLMs). The key insight is that the significant gap between the knowledge in human-labeled instruction tuning datasets and the parametric knowledge of LLMs is a major cause of hallucination.

In summary, the main contributions of this work are:

1. Identifying the knowledge gap between instruction tuning data and parametric knowledge as a key cause of hallucination in LLMs.

2. Proposing the R-Tuning method to teach LLMs to refuse unknown questions by distinguishing certain and uncertain data during instruction tuning.

3. Showing that the refusal ability learned by R-Tuning is a generalizable meta-skill that benefits from multi-task training.

4. Discovering the advantages of incorporating uncertainty learning into large model training, both in reducing computational overhead and improving overall model accuracy.

Raw indexed text (62,825 chars / 9,720 words / 1,678 lines)

R-Tuning: Teaching Large Language Models to

Refuse Unknown Questions

Hanning Zhang ♠∗ , Shizhe Diao ♠∗ , Yong Lin ♠∗ , Yi R. Fung ♡ ,

Qing Lian ♠ , Xingyao Wang ♡ , Yangyi Chen ♡ , Heng Ji ♡ , Tong Zhang ♠

♠

The Hong Kong University of Science and Technology

♡

University of Illinois Urbana-Champaign

{hzhangco, sdiaoaa, ylindf, qlianab, tongzhang}@ust.hk

{yifung2, xingyao6, yangyic3, hengji}@illinois.edu

Abstract

Large language models (LLMs) have revolu-

tionized numerous domains with their impres-

sive performance but still face their challenges.

A predominant issue is the propensity for these

models to generate non-existent facts, a con-

cern termed hallucination. Our research is mo-

tivated by the observation that previous instruc-

tion tuning methods force the model to com-

plete a sentence no matter whether the model

knows the knowledge or not. When the ques-

tion is out of the parametric knowledge, it will

try to make up something and fail to indicate

when it lacks knowledge. In this paper, we

present a new approach called Refusal-Aware

Instruction Tuning (R-Tuning). This approach

is formalized by first identifying the knowl-

edge gap between parametric knowledge and

the instruction tuning data. Then, we construct

the refusal-aware data based on the knowledge

intersection, to tune LLMs to refrain from re-

sponding to questions beyond its parametric

knowledge. Experimental results demonstrate

this new instruction tuning approach effectively

improves a model’s ability to answer known

questions and refrain from answering unknown

questions. Furthermore, when tested on out-of-

domain datasets, the refusal ability was found

to be a meta-skill that could be generalized to

other tasks. Further analysis surprisingly finds

that learning the uncertainty during training

displays a better ability to estimate uncertainty

than uncertainty-based testing. 1

Introduction

Large language models (LLMs) have demonstrated

remarkable performance across numerous tasks;

however, they are also plagued by various issues,

such as the propensity of large models to fabricate

non-existent facts, a phenomenon commonly re-

ferred to as hallucination (Maynez et al., 2020a).

Equal Contribution.

Our code will be released at https://github.com/

shizhediao/R-Tuning.

Parametric Knowledge

Instruction Tuning Data

[What Model Already Knows] [What Model Might Not Know]

Intersection of

Parametric Knowledge & Instruction Tuning Data

Figure 1: An illustration of the parametric knowledge

distribution and the instruction tuning data distribution.

Pre-training embeds a large volume of parametric knowl-

edge, while fine-tuning may involve knowledge that is

not necessarily in the parametric knowledge. We ex-

plore the benefits of differentiating instruction tuning

data based on parametric knowledge.

Towards mitigating the hallucination, current main-

stream approaches include retrieval-based meth-

ods (Peng et al., 2023; Li et al., 2023b; Luo et al.,

2023), verification-based methods (Manakul et al.,

2023; Elaraby et al., 2023; Cohen et al., 2023; Du

et al., 2023; Gou et al., 2023), and so forth.

In this paper, we first identify the cause of the

hallucination, attributing it to the significant gap ex-

isting between the knowledge of human-labeled in-

struction tuning datasets and the parametric knowl-

edge of LLMs. In the process of developing a large

model, previous studies (Min et al., 2022; Wang

et al., 2023; Zhou et al., 2023) demonstrate that al-

most all knowledge is acquired in the pre-training

stage, while instruction tuning teaches formatting

and chain-of-thought prompting guides knowledge

elicitation. Consider Figure 1 as an example. Dur-

ing pre-training, models internalize a large volume

of factual knowledge, compressing it within their

parameters and the fine-tuning process may carry

data that is out of the parametric knowledge. How-

ever, traditional fine-tuning methods force the mod-

els to complete the sentence. Even when faced with

questions beyond their knowledge boundary, theyventure to provide an answer. Training a model ex-

clusively on correct answers inadvertently teaches

it to guess rather than admit its ignorance. Conse-

quently, if we never train the model to articulate "I

don’t know" as a response, it remains unequipped

to do so when confronted with unknowns. Address-

ing this challenge, we assert that enabling a model

to astutely respond based on its own knowledge

limit is of paramount importance. This example

illustrates our motivation to tune our model on the

intersection of parametric knowledge and the in-

struction tuning data, leading to a model refusing

to answer unknown questions.

In light of this, we propose a novel instruction

tuning method, Refusal-Aware Instruction Tuning

(R-Tuning). R-Tuning aims to endow the model

with refusal-aware answering ability by recogniz-

ing when they should — and shouldn’t — claim

knowledge. Specifically, R-Tuning introduces two

steps: (1) measure the knowledge gap between

parametric knowledge and the instruction tuning

data, and identify uncertain questions. By inferring

the model on the training data once and comparing

the prediction and label, the instruction tuning data

is split into uncertain data D 0 and certain data D 1 .

(2) construct the refusal-aware data by padding the

uncertainty expression after the label words, and

then finetune the model on the refusal-aware data.

We conduct two types of experiments: single-

task and multi-task, with seven datasets. In the

single-task experiments, R-Tuning demonstrates

the ability to refuse to answer uncertain questions

and improve the accuracy of the willingly answered

questions. In the multi-task setting, our method

not only demonstrates the advantages of multi-

task learning on in-domain datasets but also ex-

hibits superior generalization performance on out-

of-domain datasets. This verifies that refusal-aware

answering is a kind of meta ability, which is not

dependent on a specific task and could benefit from

multi-task training and joint inference. With more

downstream tasks, R-Tuning could abstract and

learn such meta ability better.

One way to interpret our method is that it in-

volves learning the uncertainty of the training data

as part of instruction tuning. Further analysis sur-

prisingly shows that learning uncertainty during

training and then using it to filter and respond to

questions yields better results than directly apply-

ing uncertainty filtering on test data. This find-

ing suggests that learning uncertainty improves the

model’s training in both estimating uncertainty and

answering questions. This discovery highlights the

advantages of incorporating uncertainty learning

into large model training, both in reducing compu-

tational overhead during testing and in improving

overall model accuracy.

In summary, our contributions are:

• We investigate the knowledge gap present be-

tween the instruction tuning data and the para-

metric knowledge and attribute the hallucination

issue to forcing the model to complete answers

with traditional instruction tuning.

• To address this issue, we propose a novel in-

struction tuning approach, R-Tuning, that dis-

tinguishes instruction tuning data based on the

model’s own knowledge. R-Tuning constructs a

refusal-aware dataset and then tune the model to

refrain from responding to questions beyond its

parametric knowledge.

• Experimental results demonstrate the effective-

ness and generalization abilities of R-Tuning.

We find that the model’s learned refusal ability

functions as a meta-skill, being task-agnostic and

enhanced through multi-task training.

Refusal-Aware Instruction Tuning

In this section, we first introduce the refusal-aware

instruction tuning method (R-Tuning), the core idea

of which is divided into two steps: the first step

involves identifying and recognizing the uncertain

data within the instruction tuning dataset, which

are beyond the parametric knowledge boundary of

the original model. The second step is to construct

certain and uncertain data. Then, we will detail the

instruction tuning and inference extraction process.

An illustration of R-Tuning is shown in Figure 2.

2.1

Refusal-Aware Data Identification

The first step of R-Tuning is to measure the model’s

knowledge gap between the parametric knowledge

of LLMs and the instruction tuning data. It asks

for the model’s prediction when given a question

and applies certain metrics to determine when

the model does know. Given a training dataset

D = {(q 1 , a 1 ), (q 2 , a 2 ), ..., (q n , a n )} consisting of

n question-answer pairs, we introduce a super-

vised identification strategy. We first apply the

pre-trained model M to answer all the questions in

D and split the questions into two sets based on the

comparison between the prediction and label. If the

model’s prediction matches the label, the questionQuestion: What's the capital of China? Answer:

Instruction Tuning Data

Question: When did Apple unveil M3? Answer:

Beijing

2023.10.30 R-Tuning

Modification

2023.10.30. I am unsure

Certain Data D 1 Matched Uncertain Data D 0 Unmatched

Question: What's the capital of China? Answer: Beijing Question: When did Apple unveil M3? Answer: 2020.1.23

Replace

Parametric Knowledge

Figure 2: Illustration of R-Tuning to construct refusal-aware datasets D 0 and D 1 .

is assigned to the certain set D 1 , and otherwise,

it belongs to the uncertain set D 0 . As shown in

Figure 2, in the left part, because the prediction

(Beijing) matches the ground-truth label (Beijing),

it belongs to certain data D 1 , demonstrating that

the model’s parametric knowledge possesses the

capability to answer this question. On the contrary,

in the right part, the mismatch between the pre-

diction and the ground-truth label results in this

question being categorized into uncertain data D 0 .

Finally, the training dataset would be split into two

sets (i.e., D 0 and D 1 ) with the recognition of the

knowledge gap between parametric knowledge and

the knowledge that training questions require.

In addition to this supervised strategy requiring

a ground-truth label, we also explore unsupervised

one, which is discussed in the analysis (Section 5.2)

and is found to be effective.

2.2

Refusal-Aware Data Construction

The refusal-aware data is further constructed by

incorporating a prompt template. We introduce a

padding method, which keeps the original labels

while appending the uncertainty expression at the

end. The template is

Q : {Question}, A : {Answer}.{Prompt}.

(1)

The certain dataset D 1 is constructed by append-

ing "I am sure" after the template, while the un-

certain dataset D 0 is constructed by appending "I

am unsure" after the template. The prompt we are

using is Are you sure you accurately answered the

question based on your internal knowledge? As

shown in Figure 2, by appending certain and un-

certain expressions, R-Tuning teaches the model

to express uncertainty toward questions. This tem-

plate provides all label knowledge to the model

while instructing them to express uncertainty at the

same time. On the contrary, we can also directly

replace the label word with uncertainty expressions

for uncertain questions. We call this strategy the

replacement method and investigate the effective-

ness of it in Section 5.3.

2.3

Training and Inference

With the refusal-aware dataset, we then apply

the standard procedures of fine-tuning a language

model. The model takes a sequence t 1 , t 2 , . . . , t T

consisting of the question and answer, and predicts

the answer part based on the question. The train-

ing objective is the standard cross-entropy loss L

which can be defined as:

L = −

1 X

log P (t i |t 1 , t 2 , . . . , t i−1 ).

(2)

i=1

Here, P (t i |t 1 , t 2 , . . . , t i−1 ) is the probability

of the i th token t i given the preceding tokens

t 1 , t 2 , . . . , t i−1 , as predicted by the language

model. Note that we calculate the loss solely for

the answer part, while excluding the loss attributed

to the question part.

During the inference, we first fit the input ques-

tion into the template 1 and the model will output

its answer. Then the designed prompt template

Are you sure you accurately answered the question

based on your internal knowledge? I am will be

appended to the question and answer. Based on

this prompt, the model can output its confidence

about the previous context.

Experimental Settings

In this section, we first provide an overview of the

benchmark datasets and the corresponding evalu-

ation settings. Then the baseline models and the

implementation details are presented in the follow-

ing subsections, respectively.

3.1

Datasets

Given the diverse data formats across different

tasks, we first unify the downstream data into two

formats:• Question-Answering: Given a question, the

model directly predicts its answer. We include

ParaRel (Elazar et al., 2021), HotpotQA (Yang

et al., 2018), SelfAware (Yin et al., 2023),

HaluEval (Li et al., 2023a), FalseQA (Hu et al.,

2023), and NEC in our experiments.

• Multiple-Choice: Given a question with several

choices, the model chooses one option. We in-

clude MMLU (Hendrycks et al., 2021), WiCE

(Kamoi et al., 2023), and FEVER (Thorne et al.,

2018) in our experiments.

For question-answering tasks, to compare the an-

swer generated by our model with the ground-truth

answer, we examine whether the first few output

tokens contain the ground-truth answer. We don’t

adopt exact matching (EM) as the generation is not

strictly controllable. For multiple-choice questions,

we restrict the model to generate one token and

select the choice with maximum probability among

the candidate choices by argmax x∈C logits(x),

where C is the set of candidate choices. Consid-

ering the huge size of HotpotQA and FEVER, we

randomly sample 10K training data from them, re-

spectively. More details about the original datasets

are shown in Appendix A.1 and Table 8. In Fig-

ure 6, we present the distribution of constructed

refusal-aware data D 0 and D 1 .

We design two types of experiments:

• Single-task: The single-task experiments verify

the effectiveness of learning on individual tasks.

We conduct experiments on ParaRel and MMLU

datasets, respectively. We manually split the

datasets into the training set, in-domain test set,

and out-of-domain test set.

• Multi-task: The multi-task experiments aim to

evaluate the model’s generalization performance.

We choose five datasets - ParaRel, MMLU,

WiCE, HotpotQA, and FEVER, and mix them to

construct a new training dataset. As for testing,

we evaluate the performance on their correspond-

ing test set (in-domain) and an unseen test set

(i.e., HaluEval) (out-of-domain).

3.2

Baselines

We consider three baseline models as follows:

• Pretrain-T: Evaluate the performance of original

pre-trained checkpoints on the entire test set.

• Pretrain-W: To verify the effectiveness of will-

ingly answered questions, we evaluate the perfor-

mance of the original pre-trained checkpoints on

the test set that our fine-tuned models are willing

to answer. Intuitively, if the willingly answered

questions are within the base model’s knowledge,

this baseline should perform well.

• Vanilla: Fine-tune the model on D with all ques-

tions and ground-truth labels. This is the tradi-

tional instruction tuning method.

3.3

Evaluation

For models that could only output either the an-

swer or an unknown expression, we evaluate the

questions that our model is willing to answer. The

accuracy is calculated as follows:

# of correctly answered questions

# of willingly answered questions

(3)

For R-Tuning, because it could output both the

question’s answer and the uncertainty, we first

prompt the model to provide an answer and then

prompt it to provide its uncertainty. Then we can

evaluate the precision-recall tradeoff based on the

uncertainty and prediction performance. We in-

troduce the Average Precision (AP) score, which

measures the precision in identifying and ranking

relevant predictions. AP score originates from the

object detection field (Everingham et al., 2010) by

ranking the prediction results by confidence from

high to low and calculating the precision at each

threshold. The AP score is the average of these

precisions, which is calculated as follows:

accuracy =

AP =

n−1

(R(k + 1) − R(k)) × P (k),

(4)

k=0

where n is the number of data, k is the number of

data we select for the current threshold. P and R

denote precision and recall, which are defined as

# of correct answers above k-threshold

# of answers above k-threshold

(5)

# of correct answers above k-threshold

R(k) =

# of correct answers

(6)

An ideal model predicts the correct answers with

high confidence and the hallucinated wrong an-

swers with relatively low confidence, leading to a

high AP score. On the other hand, the AP score is

low if the model predicts every answer with high

confidence, as the precision at every threshold will

not be high and the average will be relatively low.

In our experiments, the confidence of R-Tuning is

calculated as the weighted average of prediction

confidence and uncertainty confidence.

P(k) =Figure 3: Single-task experiments on ParaRel and MMLU datasets with accuracy (%).

Dataset

Domain Models R-Tuning Vanilla

ID OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 93.23

93.64

94.44 92.89

93.32

94.00

OOD OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 69.41

74.61

77.30 68.42

78.08

64.12

ID OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 24.96

59.05

68.87 24.19

58.16

51.93

OOD OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 24.75

68.69

77.41 26.08

66.38

67.38

ParaRel

MMLU

Table 1: Single-task experiments of R-Tuning and

Vanilla on ParaRel and MMLU datasets with AP scores

(%). ID and OOD denote in-domain and out-of-domain

settings, respectively.

3.4

Implementation

We choose OpenLLaMA-3B (Geng and Liu, 2023),

LLaMA-7B, and LLaMA-13B (Touvron et al.,

2023) as the base models in our experiments. We

use LMFlow 2 (Diao et al., 2023a) to conduct in-

struction tuning, setting epoch to 1 and learning

rate to 2e −5 . The batch size is 4 and the tempera-

ture is 0. All the experiments are implemented on

Nvidia A100-40GB GPUs.

Experimental Results

In the main experiments, we conduct single-task

experiments to verify the model’s refusal-aware

answering ability and multi-task experiments to

investigate the generalization of refusal ability.

4.1

first demonstrate the effectiveness of the refusal-

aware answering ability. We also conclude that

R-Tuning answers more questions within its para-

metric knowledge during pre-training, which is re-

flected by the high accuracy of Pretrain-W. Overall,

it is observed from Table 1 that R-Tuning outper-

forms Vanilla in terms of the AP score, demonstrat-

ing the benefits of only answering the questions that

align with the model’s parametric knowledge with

high confidence. In addition, we find that larger

models achieve more improvement compared with

over baseline as the gap of the AP score becomes

larger, indicating good scalability of R-Tuning. For

example, R-Tuning of 13B performs significantly

better than Vanilla on out-of-domain ParaRel and

in-domain MMLU. In addition, the AP score of

R-Tuning grows steadily when the model size be-

comes larger, while the AP score of Vanilla drops

in ParaRel (OOD) and MMLU (ID). This compari-

son shows that Vanilla may suffer from confidence

miscalibration problems while R-Tuning is more

well-calibrated in terms of confidence. By com-

bining the prediction confidence and certainty con-

fidence to evaluate the output, R-Tuning is more

reliable when making predictions.

Single-task Experiments

We first conduct single-task experiments on

ParaRel and MMLU datasets. The results are

shown in Figure 3 and Table 1. Firstly, we observe

that R-Tuning significantly outperforms other base-

lines by a large margin in terms of accuracy on the

questions it is willing to answer, compared with oth-

ers that simply answer all the questions. The results

https://github.com/OptimalScale/

LMFlow

4.2

Multi-task Experiments

The results of multi-task experiments are shown in

Figure 4. Overall, R-Tuning consistently outper-

forms all baseline models in terms of the AP score

on both ID and OOD tasks, demonstrating its su-

periority by introducing the refusal-aware dataset.

A higher AP score signifies that the R-Tuning has

successfully ranked correct answers higher than

incorrect answers, demonstrating its effectiveness

in accurately identifying the desired predictions.

Especially, on the unseen dataset HaluEval-QA, R-

Tuning also achieves a higher AP score and demon-

strates its ability to express certainty to questions

from other distributions, and such ability can be

generalized well. The experiments on multi-taskFigure 4: Multi-task experiments on the average of five in-domain (ID) datasets (ParaRel, MMLU, WiCE, HotpotQA,

and FEVER) and one out-of-domain (OOD) dataset (HaluEval-QA) with the AP curves.

datasets tell us that the refusal is a kind of meta-

skill of models and could be enhanced by several

different datasets. We provide the detailed AP

scores and curves for different datasets and model

sizes in Table 12 and Figure 8 in Appendix A.6.

In summary, R-Tuning reduces hallucinations by

disregarding inquiries outside of the model’s knowl-

edge domain. Meanwhile, R-Tuning performs well

with inquiries that are aligned with the model’s

parameterized knowledge. The better AP score

demonstrates a good trade-off between precision

and recall and the performance on multi-task exper-

iments demonstrates the generalization potential of

refusal-aware answering ability.

5.1

Uncertainty Learning

One perspective on interpreting our method is that

R-Tuning of selecting and learning through uncer-

tainty fundamentally involves learning the uncer-

tainty of the training data. A more direct baseline

is to perform vanilla fine-tuning and then use un-

certainty selection on the test dataset to respond, a

method we refer to as Vanilla-C. Vanilla-C prompts

the model to answer k times and choose the major-

ity as the answer. The uncertainty is proportional

to the distinct answers. In our experiment, we set

k = 10 for Vanilla-C and the confidence is calcu-

lated by:

max ni=1 (f i )

(7)

where n is the number of distinct answers gen-

erated, and f i is the number of occurrences of i-

th answer. We calculate the AP scores and com-

pare them with R-Tuning in Table 3. Surprisingly,

we find that learning uncertainty and then filter-

ing questions based on this uncertainty to provide

answers yields better results than directly filter-

ing and answering questions using uncertainty on

the test dataset. In other words, differentiating in-

struction tuning data based on uncertainty while

learning both the correct answers and uncertainty

not only enables the learning of uncertainty ex-

pressions but also, remarkably, improves the ac-

curacy of question-answering. This is an unex-

Conf idence =

Analysis

In this section, we first provide an interpretation

from the uncertainty perspective for R-Tuning and

then investigate two variants: R-Tuning with unsu-

pervised identification strategy (R-Tuning-U) and

R-Tuning with label replacement (R-Tuning-R). In

addition, we verify the refusal ability on unanswer-

able questions, which should not receive answers

from the model. Finally, we investigate the per-

plexity of refused questions and the entropy of an-

swers to get a deep understanding of how R-Tuning

works.Figure 5: The performance of R-Tuning-R on ParaRel and MMLU datasets. ID and OOD denote in-domain and

out-of-domain test datasets, respectively.

Input Questions R-Tuning-R Vanilla Ground-Truth

What field does Lee Alvin DuBridge work in? I do not know the answer. Music. Physics.

Where was Blaine Willenborg born? It is not known. New York. Miami

Where did Hippolyte Le Bas die? It is impossible to know. Lyon Paris

Table 2: Case study of refused and willingly answered questions. R-Tuning-R is able to refuse questions that it does

not know, while Vanilla answers all the questions and makes mistakes.

pected but intriguing phenomenon. Learning uncer-

tainty from training data should not be as accurate

as using uncertainty estimations directly from the

test data. One possible explanation is that for a

Transformer model, to accurately predict the last

token, the hidden states are adjusted during train-

ing. These changes in hidden states might help

in better answering easier questions. A potential

hypothesis is this: predicting uncertainty embeds

information about confidence into the hidden rep-

resentation. This aids in generating more confi-

dent hidden states when answering easier questions.

This discovery reveals the benefits of learning the

uncertainty of large models. It not only avoids the

extensive overhead of repeatedly calculating un-

certainty during testing but also improves training

quality by learning uncertainty, thereby enhancing

the accuracy of uncertainty estimation We plan to

conduct further experiments to delve deeper into

this phenomenon.

5.2

Unsupervised Identification

During the refusal-aware data identification pro-

cess, we apply a supervised way to identify un-

known questions by comparing the predictions and

labels. In this section, we introduce an unsuper-

vised identification method, R-Tuning-U, where

the refused questions are determined by the un-

certainty of the model. Specifically, R-Tuning-U

queries the model M k times and calculates the

uncertainty u across k predictions, which is calcu-

Dataset

Domain Models R-Tuning Vanilla-C

ID OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 93.23

93.64

94.44 88.53

87.92

89.40

OOD OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 69.41

74.61

77.30 65.54

72.13

69.12

ID OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 24.96

59.05

68.87 24.25

48.34

58.69

OOD OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 24.75

68.69

77.41 23.05

62.79

70.09

ParaRel

MMLU

Table 3: Comparison between uncertainty-based evalua-

tion and R-Tuning on AP scores (%).

lated by the entropy based on k answers as follows:

u = −

p(a j |q) ln p(a j |q),

(8)

j=1

where p(a j |q) is the frequency of a certain pre-

dicted answer a j given a question q.

Then the questions could be ranked according

to the uncertainty score u. For the 50% most un-

certain questions, we append the ground truth label

and uncertain expression (i.e., uncertain set D 0 ),

while the remaining (i.e., certain set D 1 ) are ap-

pended with the ground truth answers with certain

expressions. We set the temperature to 0.7 and

k = 10 in our experiments. We compare the perfor-

mance with the R-Tuning on the ParaRel dataset,

and the results are shown in Table 4. It is observed

that R-Tuning-U generally achieves a higher AP

score, which reveals the feasibility of constructing

refusal-aware training data by uncertainty. Com-Domain Model

R-Tuning R-Tuning-U

ID OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 93.23

93.64

94.44 93.33

94.39

95.39

OOD OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 69.41

74.61

77.30 71.98

76.44

80.87

Table 4: Performance of R-Tuning-U with AP scores

(%) compared with R-Tuning on the ParaRel dataset.

ID and OOD denote in-domain and out-of-domain, re-

spectively.

paring the output of the pre-trained model with the

ground-truth answer is not the only way to evalu-

ate its parametric knowledge. Uncertainty can also

be an indicator of whether the pre-trained model

is familiar with the knowledge. An advantage of

R-Tuning-U is that it does not require the labels

of uncertain questions. If combined with the re-

placement method, it reduces dependence on data

annotation. We leave this for future work.

5.3

Label Replacement

In the main experiments, we adopt the padding

method for data construction. In addition to

padding, we can directly replace the label words

with uncertainty expressions for uncertain ques-

tions and keep the original label words for certain

questions, which is called the replacement strat-

egy. For example, the certain part of the training

questions D 1 is constructed as follows:

Q : {Question}, A : {Answer},

(9)

while the uncertain dataset D 0 is constructed as

follows:

Q : {Question}, A : {Uncertainty Expression}.

(10)

There are many different ways for the uncer-

tainty expression. To increase the diversity, we

take the 16 expressions of uncertainty text from Yin

et al. (2023). These 16 expressions are listed in the

Appendix Section A.3.

We conduct experiments with R-Tuning-R on

ParaRel and MMLU datasets by comparing it with

vanilla fine-tuning strategy and the original pre-

trained models. The results are shown in Figure

5. Firstly, on both in-domain and out-of-domain

test sets, the accuracy of R-Tuning-R is higher than

Pretrain-T, which benefits from only answering

certain questions. More detailed results with an-

swer rate are reported in Table 9, where we find

the model is able to refuse a certain amount of

questions. Then, R-Tuning-R outperforms Vanilla

with a significantly higher accuracy on its will-

ingly answered questions, which demonstrates the

effectiveness of our method. It is promising as R-

Tuning-R is trained with fewer ground-truth labels,

while Vanilla is trained on all labels of the full train-

ing data. Generally, larger models possess more

powerful refusal abilities. In Figure 5, we observe

that on the willingly answered questions, larger

models achieve a higher accuracy. In addition, the

high accuracy of Pretrain-W reveals that those se-

lected questions are within parametric knowledge

of the pre-trained model. In summary, compared

with vanilla fine-tuning, R-Tuning-R provides the

model with the refusal ability to refuse unknown

questions, which eventually improves the accuracy

and prevents them from making hallucinated an-

swers. Table 2 shows the case studies of how R-

Tuning-R works. There are significant differences

when they encounter questions out of their knowl-

edge. The Vanilla model is proactive in making up

an answer, which is a hallucination and makes no

sense. However, R-Tuning-R refuses them explic-

itly with keywords do not know, not known, and

impossible. The ability of R-Tuning-R to refuse

unknown questions results in fewer hallucinations.

Despite this refusal ability, there are two is-

sues with R-Tuning-R: (1) the replacement method

throws away valuable labels which could be lever-

aged for training. (2) R-Tuning could either only

output the answer or only output the certainty, but

cannot respond to both, leading to difficulties in

considering the precision and recall simultaneously.

To leverage all ground-truth labels during the tun-

ing process, and instruct models to predict answers

and express uncertainty at the same time, we em-

ploy the padding strategy in our main approach,

where every question is appended with the ground-

truth label and the uncertainty expression, indicat-

ing whether the model is confident or not.

5.4

Unanswerable Questions

In addition to the open-ended question-answering

dataset where all the questions are answerable, we

also test the performance of R-Tuning on several

refusal benchmarks containing unanswerable ques-

tions. These questions either contradict common

sense or make up some concepts, and should not re-

ceive answers from the model. We verify R-Tuning

and R-Tuning-R on such datasets, and the resultsDataset

Model

98.31

97.67

99.07 87.32

96.62

95.90 2.07

18.35

6.00 9.98

8.92

24.10

OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 99.90

99.52

99.90 95.72

99.18

98.17 0.96

20.55

2.36 7.31

2.02

4.76

OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 99.22

98.55

99.71 90.99

95.45

96.61 5.23

34.79

12.21 18.90

16.96

28.00

NEC

Table 5: The refusal rate (%) of R-Tuning and R-Tuning-

R, and other baselines on the refusal benchmarks. SA

is the unanswerable part of the SelfAware dataset. The

refusal rate of R-Tuning-R on the unanswerable datasets

is extremely high, while the refusal rate of other fine-

tuned methods and pre-trained models is low.

are shown in Table 5. For baseline models, we pro-

vide explicitly in the prompt that they could refuse

to answer the questions. We observe that R-Tuning

and R-Tuning-R refuse nearly all these unanswer-

able questions, which meet our expectations, while

other baselines answer most of the questions even

though they are told to refuse. In conclusion, both

the R-Tuning and R-Tuning-R possess the ability

to refuse questions that contradict common sense

or out of their parametric knowledge.

5.5

Dataset Model D 1 D 0

ParaRel OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 57.92

45.81

42.79 63.08

52.08

48.75

MMLU OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 32.95

22.20

22.12 462.36

115.87

81.41

WiCE OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 61.28

20.93

17.73 203.43

19.40

19.56

HotpotQA OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 144.89

49.97

42.60 170.38

60.19

55.20

FEVER OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 88.38

38.46

39.00 72.11

43.69

44.14

R-Tuning-R R-Tuning Vanilla Pretrain-T

OpenLLaMA-3B

FalseQA

LLaMA-7B

LLaMA-13B

Perplexity of Datasets

Perplexity measures how well the language model

predicts a given text. Lower perplexity means bet-

ter prediction and understanding of the text. Ac-

cording to the refusal-aware data identification, we

split the training data into two sets: D 0 (uncertain

questions) and D 1 (certain questions). To uncover

why the pre-trained model responds to them differ-

ently, we calculate the average perplexity on these

two datasets with the pre-trained models. The per-

plexity is calculated as follows:

)

(

1 X

log p θ (x i | x
PPL(X) = exp −

t

i=1

(11)

where X denotes a sentence consisting of tokens

and X = (x 1 , x 2 , . . . , x t ). Specifically, we cal-

culate the perplexity of the training questions to

estimate the pre-trained model’s understanding of

them. The results are shown in Table 6. We ob-

serve that D 1 has a lower perplexity, demonstrating

that the pre-trained model is more familiar with the

questions and is likely to provide the correct an-

swer. For D 0 data, its higher perplexity shows that

these questions are not familiar to the model and

out of the model’s knowledge, and this is the reason

Table 6: Perplexity of the training datasets. We run the

pre-trained models on the context and questions and

calculate the average perplexity.

why the model tends to hallucinate text instead of

providing the correct answers. We also observe that

larger models have a lower perplexity and random-

ness on the questions, which is why larger models

generally perform better on various tasks.

By instructing our model to express uncertainty

toward relatively random questions in terms of per-

plexity, the model develops a better understanding

of uncertainty and ambiguity and learns the ability

to recognize when it does not know. This abil-

ity is crucial in situations where simply providing

a definite answer may be inappropriate or even

harmful. On the other hand, since our model is

also trained with data with certain expressions, it

becomes more proficient at handling less random

questions, and answering them with confidence and

certainty. Overall, R-Tuning improves the model’s

ability to adapt to different levels of question ran-

domness.

To verify the pre-trained model is less familiar

with the uncertain questions while more confident

with certain questions, we also plot the confidence

distribution on certain questions and uncertain ques-

tions, shown in Figure 7 in Appendix A.5. It is

observed that a larger percentage of certain ques-

tions occupies the high confidence intervals, which

means when the model provides correct answers, it

generally shows larger confidence.

5.6

Entropy of Answers

In addition to evaluating the difference be-

tween certain and uncertain questions with pre-Dataset Model D 1 D 0

ParaRel OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 0.426

0.475

0.436 0.709

0.694

0.744

MMLU OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 0.347

0.330

0.239 0.389

0.400

0.457

WiCE OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 0.250

0.254

0.265 0.280

0.270

0.252

HotpotQA OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 0.534

0.605

0.528 0.747

0.719

0.797

FEVER OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 0.413

0.279

0.189 0.219

0.286

0.350

Table 7: Entropy of the training datasets. It is calculated

from the frequency of every predicted answer among all

predictions. A larger entropy denotes greater uncertainty

of the system.

trained models, we further leverage GPT (Brown

et al., 2020) to investigate the patterns of cer-

tain and uncertain questions. Specifically, we

query gpt-3.5-turbo five times with Chain-of-

Thought prompts (Wei et al., 2023) with a tempera-

ture of 0.7, and calculate the entropy of the answers

toward the same question (Diao et al., 2023b). If

the model provides many different answers to the

same question, the entropy should be high. Oth-

erwise, the entropy should be low. The results

are shown in Table 7. We observe that the aver-

age entropy of the answers on certain data D 1 is

lower than the entropy of uncertain data D 0 data in

most cases, which illustrates that when fed with cer-

tain questions, gpt-3.5-turbo is more likely

to generate consistent answers. It will generate

hallucinated answers to uncertain questions with

much higher chances.

Therefore, we can conclude that R-Tuning di-

vides the data into two folds. The uncertain ques-

tions are generally more difficult than certain ques-

tions because gpt-3.5-turbo’s answers vary

more with the uncertain data. R-Tuning endows

the model with abilities to identify and differentiate

the difficulties of the questions. Therefore, our fine-

tuned model becomes proactive in answering easy

questions with certainty while being conservative

in answering difficult questions, which eventually

increases the precision and prevents the fine-tuned

model from making too many mistakes.

6

Related Work

In this section, we review the progress on halluci-

nations of large language models (LLMs) and their

uncertainty quantification methods.

6.1

Hallucinations of LLMs

Despite the outstanding performance of large lan-

guage models with high fluency and coherence,

they are still likely to hallucinate unfaithful and

nonfactual facts (Maynez et al., 2020b). Recently,

a variety of works have been done towards hal-

lucination detection and mitigation. For hallu-

cination detection, Lee et al. (2023) propose a

benchmark for measuring the factuality of genera-

tion, using factual and nonfactual prompts. Man-

akul et al. (2023) propose SelfCheckGPT, a zero-

resource, sample-based approach to detect hallu-

cination by checking the consistency of multiple

responses from LLM. For hallucination control,

retrieval-augmented methods (Peng et al., 2023;

Xie et al., 2023; Yue et al., 2023; Lyu et al., 2023)

have shown effectiveness in mitigating the hallu-

cination. Other methods, such as corruptions de-

noising (Chen et al., 2023), low-confidence vali-

dation (Varshney et al., 2023), multi-agent debate

(Du et al., 2023), question-knowledge alignment

(Zhang et al., 2023b), knowledge injection and

teacher-student model (Elaraby et al., 2023), also

improve the factuality of generation from multiple

perspectives. Previous studies show the importance

of the early discovery of hallucination, since the

model will generate more incorrect statements to

justify the answer if the previous answer is incor-

rect (Zhang et al., 2023a). In addition, Huang et al.

(2023) found that LLMs cannot rectify themselves

with their initial capabilities, displaying the impor-

tance of fine-tuning and external feedback.

Although large language models might have

been taught to be honest during the pre-training,

and instruction tuning on human preferred text

(Ouyang et al., 2022; Bai et al., 2022), they still fail

to recognize what and when they do not know. Our

proposed method instructs the model to be aware

of its knowledge gap between the instruction tun-

ing datasets and the parametric knowledge, so that

it possesses the refusal ability when it encounters

instructions out of its knowledge.

6.2

Uncertainty Quantification of LLMs

Uncertainty quantification is a long-standing prob-

lem in machine learning. In the deep learning era,Guo et al. (2017) first identify the predictive confi-

dence (a.k.a, predictive probability) of deep neural

network lack of calibration in terms of the ECE

metric (Expected Calibration Error) (Naeini et al.,

2015). Chen et al. (2022) further study the inves-

tigate the calibration problem of pre-trained large

language models and observe the same miscalibra-

tion problem on large language models. Active-

Prompt (Diao et al., 2023b) introduces uncertainty

to select questions for chain-of-thought annotation

and demonstrates its effectiveness in actively and

judiciously selecting and annotating the most help-

ful exemplars for in-context learning of LLMs.

7

Conclusion

In this paper, we propose a simple yet effective

method, R-Tuning, to teach large language mod-

els to refuse unknown questions. It identifies the

difference between the instruction tuning data and

parametric knowledge of the model and splits the

training data into a certain part and an uncertain

part. Then, R-Tuning constructs the refusal-aware

data by appending uncertainty expressions to the

uncertain part. Empirically, R-Tuning outperforms

the traditional instruction tuning baseline regard-

ing AP score, illustrating a good trade-off between

precision and recall. R-Tuning not only shows the

refusal ability on in-domain data but also demon-

strates such ability could be generalized to unseen

tasks well. It displays that refusal is a fundamental

ability and could be abstracted via multi-task learn-

ing, so we call it meta-skill. Further analysis of the

perplexity and uncertainty of the training datasets

reveals the rationale of our proposed method.

References

Yuntao Bai, Saurav Kadavath, Sandipan Kundu,

Amanda Askell, Jackson Kernion, Andy Jones,

Anna Chen, Anna Goldie, Azalia Mirhoseini,

Cameron McKinnon, et al. 2022. Constitutional

ai: Harmlessness from ai feedback. arXiv preprint

arXiv:2212.08073.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, Sandhini Agarwal, Ariel Herbert-Voss,

Gretchen Krueger, Tom Henighan, Rewon Child,

Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,

Clemens Winter, Christopher Hesse, Mark Chen, Eric

Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,

Jack Clark, Christopher Berner, Sam McCandlish,

Alec Radford, Ilya Sutskever, and Dario Amodei.

2020. Language models are few-shot learners.

Anthony Chen, Panupong Pasupat, Sameer Singh, Hon-

grae Lee, and Kelvin Guu. 2023. Purr: Efficiently

editing language model hallucinations by denoising

language model corruptions.

Yangyi Chen, Lifan Yuan, Ganqu Cui, Zhiyuan Liu,

and Heng Ji. 2022. A close look into the calibra-

tion of pre-trained language models. arXiv preprint

arXiv:2211.00151.

Roi Cohen, May Hamri, Mor Geva, and Amir Glober-

son. 2023. Lm vs lm: Detecting factual errors via

cross examination. arXiv preprint arXiv:2305.13281.

Shizhe Diao, Rui Pan, Hanze Dong, Ka Shun Shum,

Jipeng Zhang, Wei Xiong, and Tong Zhang. 2023a.

Lmflow: An extensible toolkit for finetuning and

inference of large foundation models.

Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong

Zhang. 2023b. Active prompting with chain-of-

thought for large language models.

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenen-

baum, and Igor Mordatch. 2023. Improving factual-

ity and reasoning in language models through multia-

gent debate. arXiv preprint arXiv:2305.14325.

Mohamed Elaraby, Mengyin Lu, Jacob Dunn, Xuey-

ing Zhang, Yu Wang, and Shizhu Liu. 2023. Halo:

Estimation and reduction of hallucinations in open-

source weak large language models. arXiv preprint

arXiv:2308.11764.

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha

Ravichander, Eduard Hovy, Hinrich Schütze, and

Yoav Goldberg. 2021. Measuring and improving

consistency in pretrained language models.

Mark Everingham, Luc van Gool, Christopher K. I.

Williams, John Winn, and Andrew Zisserman. 2010.

The pascal visual object classes (voc) challenge. In-

ternational Journal of Computer Vision, 88(2):303–

338.

Xinyang Geng and Hao Liu. 2023. Openllama: An open

reproduction of llama.

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong

Shen, Yujiu Yang, Nan Duan, and Weizhu Chen.

2023. Critic: Large language models can self-correct

with tool-interactive critiquing. arXiv preprint

arXiv:2305.11738.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Wein-

berger. 2017. On calibration of modern neural net-

works. In International conference on machine learn-

ing, pages 1321–1330. PMLR.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,

Mantas Mazeika, Dawn Song, and Jacob Steinhardt.

2021. Measuring massive multitask language under-

standing.Shengding Hu, Yi-Xiao Luo, Huadong Wang, Xingyi

Cheng, Zhiyuan Liu, and Maosong Sun. 2023. Won’t

get fooled again: Answering questions with false

premises. ArXiv, abs/2307.02394. Mahdi Pakdaman Naeini, Gregory Cooper, and Milos

Hauskrecht. 2015. Obtaining well calibrated proba-

bilities using bayesian binning. In Proceedings of the

AAAI conference on artificial intelligence, volume 29.

Jie Huang, Xinyun Chen, Swaroop Mishra,

Huaixiu Steven Zheng, Adams Wei Yu, Xiny-

ing Song, and Denny Zhou. 2023. Large language

models cannot self-correct reasoning yet. arXiv

preprint arXiv:2310.01798. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-

roll L. Wainwright, Pamela Mishkin, Chong Zhang,

Sandhini Agarwal, Katarina Slama, Alex Ray, John

Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,

Maddie Simens, Amanda Askell, Peter Welinder,

Paul Christiano, Jan Leike, and Ryan Lowe. 2022.

Training language models to follow instructions with

human feedback.

Ryo Kamoi, Tanya Goyal, Juan Diego Rodriguez, and

Greg Durrett. 2023. Wice: Real-world entailment for

claims in wikipedia.

Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pas-

cale Fung, Mohammad Shoeybi, and Bryan Catan-

zaro. 2023. Factuality enhanced language models for

open-ended text generation.

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun

Nie, and Ji-Rong Wen. 2023a. Halueval: A large-

scale hallucination evaluation benchmark for large

language models.

Miaoran Li, Baolin Peng, and Zhu Zhang. 2023b. Self-

checker: Plug-and-play modules for fact-checking

with large language models.

arXiv preprint

arXiv:2305.14623.

Ziyang Luo, Can Xu, Pu Zhao, Xiubo Geng, Chongyang

Tao, Jing Ma, Qingwei Lin, and Daxin Jiang.

2023. Augmented large language models with

parametric knowledge guiding. arXiv preprint

arXiv:2305.04757.

Xiaozhong Lyu, Stefan Grafberger, Samantha Biegel,

Shaopeng Wei, Meng Cao, Sebastian Schelter, and

Ce Zhang. 2023. Improving retrieval-augmented

large language models via data importance learning.

Potsawee Manakul, Adian Liusie, and Mark JF Gales.

2023. Selfcheckgpt: Zero-resource black-box hal-

lucination detection for generative large language

models. arXiv preprint arXiv:2303.08896.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and

Ryan McDonald. 2020a. On faithfulness and factu-

ality in abstractive summarization. arXiv preprint

arXiv:2005.00661.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and

Ryan McDonald. 2020b. On faithfulness and factu-

ality in abstractive summarization. In Proceedings

of the 58th Annual Meeting of the Association for

Computational Linguistics, pages 1906–1919, On-

line. Association for Computational Linguistics.

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,

Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-

moyer. 2022. Rethinking the role of demonstrations:

What makes in-context learning work? In Proceed-

ings of the 2022 Conference on Empirical Methods in

Natural Language Processing, pages 11048–11064,

Abu Dhabi, United Arab Emirates. Association for

Computational Linguistics.

Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng,

Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou

Yu, Weizhu Chen, et al. 2023. Check your facts and

try again: Improving large language models with

external knowledge and automated feedback. arXiv

preprint arXiv:2302.12813.

James Thorne,

Andreas Vlachos,

Christos

Christodoulopoulos, and Arpit Mittal. 2018.

Fever: a large-scale dataset for fact extraction and

verification.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier

Martinet, Marie-Anne Lachaux, Timothée Lacroix,

Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal

Azhar, Aurelien Rodriguez, Armand Joulin, Edouard

Grave, and Guillaume Lample. 2023. Llama: Open

and efficient foundation language models.

Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jian-

shu Chen, and Dong Yu. 2023. A stitch in time saves

nine: Detecting and mitigating hallucinations of llms

by validating low-confidence generation.

Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen,

You Wu, Luke Zettlemoyer, and Huan Sun. 2023.

Towards understanding chain-of-thought prompting:

An empirical study of what matters. In Proceedings

of the 61st Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 2717–2739, Toronto, Canada. Association for

Computational Linguistics.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and

Denny Zhou. 2023. Chain-of-thought prompting elic-

its reasoning in large language models.

Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and

Yu Su. 2023. Adaptive chameleon or stubborn sloth:

Unraveling the behavior of large language models in

knowledge clashes.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-

gio, William W. Cohen, Ruslan Salakhutdinov, and

Christopher D. Manning. 2018. Hotpotqa: A dataset

for diverse, explainable multi-hop question answer-

ing.

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu,

Xipeng Qiu, and Xuanjing Huang. 2023. Do large

language models know what they don’t know?Xiang Yue, Boshi Wang, Kai Zhang, Ziru Chen, Yu Su,

and Huan Sun. 2023. Automatic evaluation of attri-

bution by large language models.

Muru Zhang, Ofir Press, William Merrill, Alisa Liu,

and Noah A. Smith. 2023a. How language model

hallucinations can snowball.

Shuo Zhang, Liangming Pan, Junzhou Zhao, and

William Yang Wang. 2023b.

Mitigating lan-

guage model hallucination with interactive question-

knowledge alignment.

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao

Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,

Lili Yu, et al. 2023. Lima: Less is more for alignment.

arXiv preprint arXiv:2305.11206.Dataset

ParaRel (Elazar et al., 2021)

MMLU (Hendrycks et al., 2021)

WiCE (Kamoi et al., 2023)

HotpotQA (Yang et al., 2018)

FEVER (Thorne et al., 2018)

SelfAware (Yin et al., 2023)

HaluEval (Li et al., 2023a)

FalseQA (Hu et al., 2023)

NEC

Example (Our Format)

Question: Which country is Georgi Parvanov a citizen of?

Answer: Bulgaria

Question: Which of the following did the post-war welfare state of 1948 not aim to provide:

(A) free health care and education for all (B) a minimum wage

(C) full employment (D) universal welfare.

Answer: B

Evidence: The first results of the auction for 3DO’s franchises and assets...

Claim: The rights to the M̈ight and Magicn̈ame were purchased for $1.3 million by Ubisoft.

Question: Does the evidence support the claim?

(A) supported (B) partially supported (C) not supported

Answer: A

Context: Arthur’s Magazine was an American literary periodical published in ...

Question: Which magazine was started first Arthur’s Magazine or First for Women?

Answer: Arthur’s Magazine

Evidence: David Bowie is the second studio album by the English musician David Bowie...

Claim: David Bowie has an album.

Question: Does the evidence support or refute the claim or not enough information?

(A) supports (B) refutes (C) not enough info

Answer: A

Answerable Question: What is Nigeria’s northernmost climate?

Answer: rain forest

Unanswerable Question: Often called high energy particles, what gives life to them?

Answer: None

Knowledge: Jonathan Stark (born April 3, 1971) is a former...

Question: Which tennis player won more Grand Slam titles, Henri Leconte or Jonathan Stark?

Answer: Jonathan Stark

Unanswerable Question: List the reason why mice can catch cats?

(This is a question that contradicts common sense)

Unanswerable Question: How long is the typical lifespan of Leogoteo in the wild?

(There is no such creature called Leogoteo.)

Original Size

Actual Size Used

Total data: 253448 Training data: 5575

ID test data: 5584

OOD test data: 13974

Total data: 14033 Training data: 2448

ID test data: 2439

OOD test data: 9155

Training data: 3470

Dev data: 949

Test data: 958 Training data: 3470

Test data: 958

Training data: 99564

Dev data: 7405

Test data: 14810 Training data: 10000

Test data: 7405

Training data: 145449

Dev data: 9999

Test data: 9999 Training data: 10000

Test data: 9999

Answerable Question: 2337

Unanswerable Question: 1032 Unanswerable: 1032

QA-data: 10000

Dialogue: 10000

Summarization: 10000

User query:5000 QA-data: 10000

Unanswerable Question: 2365 Unanswerable: 2365

Unanswerable Question: 2078 Unanswerable: 2078

Table 8: Illustration and statistics of the datasets. For ParaRel and MMLU, we manually split the datasets into

training and test sets. For WiCE, HotpotQA, and FEVER, we directly use the original training set. For SelfAware,

FalseQA, and NEC, we directly test models on their unanswerable questions.A

A.1

Appendix

Datasets

We conduct our experiments on nine datasets,

which are described as follows.

• ParaRel (Elazar et al., 2021): a dataset of factual

knowledge with various prompts and relations

that are originally for mask prediction. To align

the dataset with the requirements of our auto-

regressive models, we first change the format

into question-answering and our models read the

questions and generate the answers. Then, du-

plicated prompts of different templates but with

the same entities are omitted for our question-

answering task. It finally comes up with 25133

prompt-answer pairs of 31 domains. We split

the ParaRel into two sets - the first 15 domains

as in-domain data and the last 16 domains as

out-of-domain data. We also equally split the

in-domain data into training data and test data.

• MMLU (Hendrycks et al., 2021): MMLU cov-

ers 57 tasks including mathematics, computer

science, history, law, and more, which requires

extensive world knowledge and problem-solving

ability. The dataset is of multiple-choice format,

and we can directly use it in our experiments.

• WiCE (Kamoi et al., 2023): WiCE is a natu-

ral language inference (NLI) dataset for textual

entailment. Each data sample consists of evi-

dence and a claim, and the model should decide

whether the evidence supports, partially supports,

or doesn’t support the claim. We turn the dataset

into multiple-choice questions with 3 choices for

each question.

• HotpotQA (Yang et al., 2018): HotpotQA is

a question-answering dataset that requires com-

plex reasoning among documents. We evaluate

by providing the context documents and ques-

tions to see if the model can answer them. Since

the test set of HotpotQA requires answer submis-

sion, we instead use the development set to do

the evaluation.

• FEVER (Thorne et al., 2018): FEVER is a

dataset containing claims and supporting knowl-

edge. The claims are classified as SUPPORTED,

REFUTES, or NOT ENOUGH INFO. We turn it

into a multiple-choice NLI task.

• SelfAware (Yin et al., 2023): a dataset contain-

ing both answerable questions and unanswerable

questions. We evaluate the unanswerable ques-

tions. It is expected to see our finetuned models

refusing the unanswerable questions while other

baselines do not possess such ability.

• HaluEval (Li et al., 2023a): HaluEval is a

dataset containing question-answering, dialogue,

summarization, and user-query with correct an-

swers and hallucinated answers. We only take

the question-answering part.

• FalseQA: FalseQA is a new open-domain

dataset with questions inconsistent with com-

mon sense. There are no correct answers to the

questions.

• NEC: NEC is also a new open-domain dataset

with questions containing some make-up con-

cepts. There are also no correct answers to the

questions.

Details about the original datasets are shown in

Table 8. In Figure 6, we present the distribution of

constructed refusal-aware data D 0 and D 1 .

A.2

Case Studies of R-Tuning-R

In this section, we display the detailed statistics

in Table 9, and illustrate more case studies of R-

Tuning-R in Table 10.

A.3

Uncertainty Text

In this section, we list the 16 uncertainty expres-

sions from Yin et al. (2023):

1. The answer is unknown.

2. The answer is uncertain.

3. The answer is unclear.

4. There is no scientific evidence.

5. There is no definitive answer.

6. There is no right answer.

7. There is much debate.

8. There is no known case.

9. There is no concrete answer to this question.

10. There is no public information available.

11. It is impossible to know.

12. It is impossible to answer.

13. It is difficult to predict.

14. It is not known.

15. We do not know.

16. I’m not sureDataset

Domain Models R-Tuning-R (%) Answer Rate (%) Vanilla (%) Pretrain-T (%) Pretrain-W (%)

In-Domain OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 82.79

85.95

87.06 44.65

44.11

44.00 60.58

63.72

66.53 36.23

37.79

41.53 71.48

72.23

78.51

Out-of-Domain OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 44.04

69.54

76.13 40.80

28.07

30.16 38.68

43.38

41.82 30.18

34.44

34.48 46.73

63.09

68.99

In-Domain OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 21.99

46.35

57.47 5.79

9.56

42.52 21.12

33.25

42.97 21.22

26.77

41.41 18.44

41.20

55.54

Out-of-Domain OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 24.55

55.56

67.31 2.41

12.96

48.32 23.93

38.56

51.19 25.44

31.09

47.60 27.27

44.01

62.53

ParaRel

MMLU

Table 9: Detailed performance of R-Tuning-R on ParaRel and MMLU dataset. The answer rate means the percentage

of willingly answered questions of R-Tuning-R.

Input Questions

What field does Lee Alvin DuBridge work in?

Where was Blaine Willenborg born?

Where did Hippolyte Le Bas die?

R-Tuning-R

I do not know the answer.

It is not known.

It is impossible to know.

Vanilla

Music.

New York.

Lyon

Ground-Truth

Physics.

Miami

Paris

(a) Examples of R-Tuning-R refusing questions that are out of its parametric knowledge. R-Tuning-R expresses its unknown

when it does not know the answer. Vanilla produces hallucinated answers when it does not know the answer.

Input Questions

Where is Lion Air headquartered?

What does Jacobo Zabludovsky work as?

What is the native language of Joseph Conombo?

R-Tuning-R

Jakarta.

journalist.

French.

Vanilla

Jakarta.

journalist.

French.

Ground-truth

Jakarta.

journalist.

French.

(b) Examples of R-Tuning-R answering questions within parametric knowledge.

Table 10: Case study of refused and willingly answered questions with R-Tuning-R and Vanilla.

A.4

Min-Loss Training

Compared with the append verbalizer, replace ver-

balizer (e.g., R-Tuning-R) is a clear-cut way of pro-

ducing uncertainty expressions by throwing away

valuable labels which could potentially be lever-

aged for training. In address of this dimension

of concern, we consider a modified cross entropy

learning objective that pushes up the correct an-

swer token and keeps the uncertainty expressions

as the second most probable token choice. We call

it min-loss training, which is optimized by gradi-

ent descent over the min loss between guessing the

correct answer or just uncertainty expressions. It is

formulated as follows:

min(L(predict, GT ), L(predict, IDK)), (12)

where L denotes the cross-entropy loss. To do

so, we split the training data in half and adopt

a two-stage training strategy. In the first stage,

we train our model using the original method

where the prompt template uses The answer

is {ground-truth} if the model answers cor-

rectly, otherwise The answer is unknown.

Once the model learns such a pattern after the first

training stage, we calculate the min-loss with the

equation 12. We only consider the loss of the un-

known and the ground-truth label, and we mask the

tokens before them. Since the ground-truth label

may consider more than one token, we calculate

the loss for the first token.

We evaluate the performance of min-loss strat-

egy on the ParaRel dataset, and the results are

shown in Table 11. It shows min-loss training

outperforms R-Tuning-R in small models and in-

domain settings. However, it underperforms R-

Tuning-R in out-of-domain test sets. We also no-

tice that in out-of-domain test sets, the accuracy of

the model of 3B size is nearly the same as 7B’s and

13B’s. We identify such issues as a trade-off be-

tween the accuracy and the answer rate. When the

model is proactive in answering more questions, itDomain Model ID OpenLLaMA-3B

LLaMA-7B

LLaMA-13B

R-Tuning-R Answer Rate Min-loss Answer Rate

82.79

85.95

87.06 44.65

44.11

44.00 91.83

85.78

88.00 26.52

41.57

48.33

OOD OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 44.04

69.54

76.13 40.80

28.07

30.16 59.66

55.52

60.42 20.92

49.15

55.75

Table 11: Accuracy (%) and answer rate (%) of R-

Tuning-R and min-loss training on ParaRel dataset. The

loss is calculated by the first token of the ground-truth

answer. ID and OOD denote in-domain and out-of-

domain, respectively.

will inevitably make more mistakes. As the intrin-

sic parametric knowledge of the model is limited,

there is no method to fine-tune a model with both

high accuracy and a high answer rate.

Dataset Model R-Tuning Vanilla

ParaRel OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 69.79

77.45

77.69 69.62

77.91

72.67

MMLU OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 24.38

54.19

73.81 24.39

63.88

74.95

WiCE OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 56.74

55.02

71.12 61.05

65.47

67.17

HotpotQA OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 46.54

57.57

57.99 36.90

41.92

44.76

FEVER OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 94.22

93.30

95.23 85.38

88.24

94.99

HaluEval-QA OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 73.85

77.17

80.36 72.11

76.22

75.73

Average OpenLLaMA-3B

LLaMA-7B

LLaMA-13B 61.09

69.11

76.03 58.24

68.94

71.71

Table 12: Multi-task experiments of R-Tuning and

Vanilla with AP scores (%). Vanilla adopts the con-

fidence of the predicted answer to rank the result, while

R-Tuning adopts the combination of the confidence of

the predicted answer and the confidence of certainty.

A.5

Confidence Distribution of Training

Dataset

We calculate the confidence of the certain data D 1

and uncertain data d 0 , and they are shown in Fig-

ure 7.

A.6

AP Scores of Each Dataset and Model

Size with Figures

We calculate the AP scores for each dataset with dif-

ferent model sizes in multi-task experiments. The

results are shown in Table 12 and Figure 8.Figure 6: The data distribution of the refusal-aware datasets obtained from supervised identification strategy. The

title of each sub-figure consists of the dataset name and the size of the pre-trained model used to evaluate.Figure 7: The confidence distribution of the training datasets on certain data and uncertain data. The title of each

sub-figure consists of the dataset name and the size of the pre-trained model used to evaluate.Figure 8: The AP curves on ParaRel, MMLU, WiCE, HotpotQA, FEVER, and HaluEval-QA datasets. The title of

each sub-figure consists of the dataset name and the size of the pre-trained model used to evaluate.