Summary of Self-Correction for LLMs Mistake Finding and Correction

Summary Self-Correction for LLMs Mistake Finding and Correction arxiv.org

6,895 words - PDF document - View PDF document

One Line

The paper examines the self-correction abilities of Large Language Models (LLMs) and finds that they have difficulty identifying mistakes, but backtracking can effectively correct incorrect outputs without impacting correct ones.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Self-Correction for LLMs Mistake Finding and Correction

Source: arxiv.org - PDF - 6,895 words - view

Introduction

• LLMs have limited self-correction capabilities

• Previous research focuses on style and quality improvement

• Limited evidence of self-correction without external feedback

Mistake Finding

• LLMs struggle with identifying logical mistakes

• BIG-Bench Mistake dataset for evaluating mistake finding

• Need for further improvements in mistake finding

Output Correction

• Backtracking method for correcting incorrect outputs

• Minimal impact on correct outputs

• Lightweight alternative to reinforcement learning methods

Mistake Location as Proxy for Correctness

• Prompting for mistake location is not reliable for determining correctness

• Weighted average F1 scores lower than baseline

• Poor strategy for determining correctness

Effectiveness of Backtracking

• Backtracking with gold mistake location labels corrects logical errors

• Backtracking remains effective without gold standard labels

• Use of simulated reward models

Limitations and Future Research

• Need for larger scale evaluation of backtracking

• Evaluation in more realistic settings

• Potential of using reward models for mistake finding

Conclusion

• LLMs have limited self-correction capabilities for logical errors

• Backtracking method effectively corrects incorrect outputs

• Further research needed for scalability and realistic settings

Key Takeaways

• LLMs struggle with mistake finding and logical error correction

• Backtracking method is an effective lightweight alternative

• Mistake location alone is not reliable for determining correctness

Key Points

Large Language Models (LLMs) have limited self-correction capabilities in terms of identifying and correcting logical errors.
Mistake finding is a challenging task for LLMs, as they struggle to identify logical mistakes.
The authors propose a backtracking method that uses mistake location information to improve LLM performance in output correction.
Backtracking is shown to effectively correct incorrect outputs without significantly affecting correct outputs.
Prompting for mistake location alone is not a reliable strategy for determining correctness in LLMs.
Backtracking can correct logical errors in Chain-of-Thought reasoning traces, even without gold standard labels.
Further research is needed to evaluate backtracking on a larger scale and in more realistic settings.
The potential of using reward models for mistake finding and the transferability of learning to find mistakes in different tasks is highlighted.

Summaries

24 word summary

This paper evaluates Large Language Models' (LLMs) self-correction abilities, finding they struggle with mistake finding. Backtracking effectively corrects incorrect outputs without affecting correct ones.

66 word summary

This paper examines the self-correction abilities of Large Language Models (LLMs). It evaluates LLMs on a dataset of logical mistakes and finds that LLMs struggle with mistake finding. The authors propose a backtracking method that effectively corrects incorrect outputs without affecting correct ones. They also investigate using mistake location as a proxy for correctness and discuss the limitations and potential for further research in evaluating backtracking.

155 word summary

This paper examines the self-correction abilities of Large Language Models (LLMs). It introduces the concept of mistake finding and output correction as two components of the self-correction process. The authors evaluate various LLMs on a dataset of logical mistakes called BIG-Bench Mistake and find that LLMs struggle with mistake finding. They propose a backtracking method that uses mistake location information to improve output correction. The method effectively corrects originally incorrect outputs without significantly affecting originally correct ones. The paper also investigates using mistake location as a proxy for correctness and determines that it is not a reliable strategy. The authors demonstrate the effectiveness of backtracking with gold mistake location labels and simulated reward models. They conclude by discussing the limitations of their dataset and the potential for further research in evaluating backtracking on a larger scale. Overall, the paper contributes to understanding LLMs' self-correction capabilities and the potential use of reward models in the process.

384 word summary

This paper focuses on the self-correction capabilities of Large Language Models (LLMs). While previous research has shown promise in improving LLM outputs in terms of style and quality, there is limited evidence that LLMs can identify and correct their own reasoning and logical errors without external feedback. To address this, the authors break down the self-correction process into two components: mistake finding and output correction.

For mistake finding, the authors introduce BIG-Bench Mistake, a dataset of logical mistakes in Chain-of-Thought reasoning traces. They evaluate several state-of-the-art LLMs on this dataset and find that LLMs generally struggle with finding logical mistakes. This highlights the need for further improvements in mistake finding.

For output correction, the authors propose a backtracking method that uses information about mistake location to improve performance. They demonstrate that this method can correct outputs that are originally incorrect, with minimal effect on outputs that are originally correct. The backtracking method is seen as a lightweight alternative to reinforcement learning methods and remains effective with a reward model at 60-70% accuracy.

The paper also explores the concept of using mistake location as a proxy for correctness. They investigate whether LLMs can reliably determine the correctness of a trace based on mistake location alone. The results show that prompting for mistake location is a poor strategy for determining correctness, as the weighted average F1 scores are lower than a baseline that predicts all traces as incorrect.

In addition, the authors conduct experiments to evaluate the effectiveness of backtracking. They show that backtracking with gold mistake location labels can correct logical errors in CoT traces. They also explore the use of simulated reward models and demonstrate that backtracking is still effective even without gold standard labels.

The paper concludes by discussing the limitations of their dataset and the need for further research to evaluate backtracking on a larger scale and in more realistic settings. They also highlight the potential of using dedicated reward models for mistake finding and the transferability of learning to find mistakes in out-of-distribution tasks.

Overall, this paper provides insights into the self-correction capabilities of LLMs and proposes a backtracking method for correcting logical errors. The findings contribute to the understanding of LLMs' ability to identify and correct their own mistakes, and the potential for using reward models in the self-correction process.

Raw indexed text (43,264 chars / 6,895 words / 1,213 lines)

LLMs cannot find reasoning errors, but can correct them!

Gladys Tyen* 1 , Hassan Mansoor† 2 , Peter Chen† 2 , Tony Mak† 2 , Victor Cărbune† 2

University of Cambridge, Dept. of Computer Science & Technology, ALTA Institute

Google Research

[email protected]

{hassan,chenfeif,tonymak,vcarbune}@google.com

Abstract

While self-correction has shown promise in

improving LLM outputs in terms of style and

quality (e.g. Chen et al., 2023; Madaan et al.,

2023), recent attempts to self-correct logical or

reasoning errors often cause correct answers

to become incorrect, resulting in worse perfor-

mances overall (Huang et al., 2023). In this pa-

per, we break down the self-correction process

into two core components: mistake finding

and output correction. For mistake finding,

we release BIG-Bench Mistake, a dataset of

logical mistakes in Chain-of-Thought reason-

ing traces. We provide benchmark numbers for

several state-of-the-art LLMs, and demonstrate

that LLMs generally struggle with finding logi-

cal mistakes. For output correction, we propose

a backtracking method which provides large

improvements when given information on mis-

take location. We construe backtracking as a

lightweight alternative to reinforcement learn-

ing methods, and show that it remains effective

with a reward model at 60-70% accuracy.

Introduction

Large Language Models (LLMs) have dominated

the field of NLP in recent years, achieving state-

of-the-art performance in a large variety of appli-

cations. In particular, LLMs have demonstrated

the ability to solve tasks with zero- or few-shot

prompting, giving rise to prompting methods such

as Chain-of-Thought (CoT) (Wei et al., 2022), Self-

Consistency (SC) (Wang et al., 2023), ReAct (Yao

et al., 2022), etc.

Recent literature on few- or zero-shot prompting

has focused on the concept of self-correction, i.e.

having an LLM correct its own outputs (Shinn et al.,

2023; Miao et al., 2023; Madaan et al., 2023; Chen

et al., 2023; Saunders et al., 2022). (See Pan et al.

(2023) for a review of the literature.)

However, Huang et al. (2023) note that while

self-correction may prove effective for improving

model outputs in terms of style and quality, there

is limited evidence that LLMs can identify and

fix their own reasoning and logical errors without

external feedback. For example, Reflexion (Shinn

et al., 2023) and RCI (Kim et al., 2023) both use

ground truth correctness as a signal to halt the self-

correction loop.

While previous work typically present self-

correction as a single process, we divide it into

mistake finding and output correction.

Mistake finding is a fundamental reasoning skill

that has been studied and utilised extensively in phi-

losophy, psychology, and mathematics, spawning

concepts such as critical thinking, and logical and

mathematical fallacies. One might expect that the

ability to find mistakes should also be an important

requirement for LLMs. However, our results show

that state-of-the-art LLMs currently cannot find

mistakes reliably.

Output correction involves partially or com-

pletely changing previously generated outputs. In

the context of self-correction, this is typically done

with outputs generated by the same model (see Pan

et al. (2023) for an overview of different strategies).

Despite LLMs’ inability to find mistakes, our re-

sults show that they can correct outputs using our

backtracking method, if given information about

the mistakes, for example via a small, supervised

reward model.

Our contributions for this paper are as follows:

1. With Chain-of-Thought prompting, any task can

be turned into a mistake-finding task. We collect

and release 1 to the research community BIG-

Bench Mistake, a dataset of CoT-style traces 2

generated using PaLM 2, and annotated accord-

ing to where the first logical mistake is. To

Work done during an internship at Google Research.

†

Authors ordered alphabetically by first names.

https://github.com/WHGTyen/BIG-Bench-Mistake.

In this paper, we refer to a set of CoT reasoning steps as

a trace.

2our knowledge, BIG-Bench Mistake is the first

dataset of its kind that goes beyond problems in

mathematics.

Sort the following words alphabetically: List: hypochlo-

rite ponderosa phone credulity

2. We produce benchmark results for our dataset to

test the reasoning capabilities of state-of-the-art

LLMs. We demonstrate that current state-of-

the-art LLMs struggle with mistake finding,

even for objective, unambiguous cases. We

hypothesise that this is a main contributing fac-

tor to LLMs’ inability to self-correct reasoning

errors, and we call on the research community

to pursue further improvements on the mistake

finding task. Thought 1: I should start by looking at the first letter

of the words in the list. The first letter:

"hypochlorite": "h" (8). "ponderosa": "p"

(16). "phone": "p" (16). "credulity": "c" (3).

Thought 2: We now have: (3) "credulity" < (8)

"hypochlorite" < (16) ["ponderosa" ?

"phone"].

Thought 3: Now let’s sort this subpart ["ponderosa" ?

"phone"] by looking at their second letters.

The second letter: "ponderosa": "o" (15).

"phone": "h" (8).

3. We propose backtracking as an output correc-

tion technique that makes use of mistake loca-

tion information to improve performance on the

original task. We demonstrate that this method

corrects outputs that are originally incorrect,

with minimal effect on outputs that are origi-

nally correct. Thought 4: We now have: (8) "phone" < (15) "pon-

derosa" for the subpart.

Hence, we

have "credulity" < "phone" < "ponderosa".

(MISTAKE)

Thought 5: I have now sorted all the words. The answer

is credulity hypochlorite phone ponderosa

4. We construe backtracking as a form of “verbal

reinforcement learning” (Shinn et al., 2023),

allowing iterative improvement on CoT out-

puts without requiring any weight updates. We

propose that backtracking can be used with a

trained classifier as a reward model, and demon-

strate the effectiveness of backtracking at vari-

ous reward model accuracies.

BIG-Bench Mistake

BIG-Bench Mistake consists of 2186 sets of CoT-

style traces. Each trace was generated by PaLM

2-L-Unicorn, and annotated with the location of

the first logical error. An example trace is shown in

Table 1, where the mistake location 3 is the 4 th step.

Our traces span across a set of 5 tasks 4 from the

BIG-bench dataset (Srivastava et al., 2023): word

sorting, tracking shuffled objects, logical deduction,

multi-step arithmetic, and Dyck languages. CoT

prompting is used to prompt PaLM 2 to answer

questions from each task. As we wanted to separate

As some traces may not contain mistakes, we use the term

mistake location as a multi-class label that can refer to either

the integer N where the N th step contains the first mistake, or

that there are no mistakes.

These 5 tasks were selected because 1) Anil et al. (2023)

demonstrate that PaLM 2 performs poorly on these tasks, so

it is likely to generate mistakes in CoT traces; 2) any mis-

takes that may occur are likely to be unambiguous, therefore

minimising subjectivity during annotation; and 3) identifying

mistakes for these tasks does not require expertise knowledge

of a specific domain.

Table 1: Example of a CoT trace for the word sorting

task. In this example, there is a mistake in Thought 4 be-

cause the ordering "credulity" < "phone" < "ponderosa"

is missing the word hypochlorite.

our CoT traces into distinct steps, we follow the

method used by Yao et al. (2022) and generate each

step separately, using the newline as a stop token.

In this dataset, all traces are generated with

temperature = 0.

The correctness of an-

swers are determined by exact match. Prompts

can be found at https://github.com/WHGTyen/

BIG-Bench-Mistake.

2.1

Annotation

Each generated trace is annotated with the first

logical error. We ignore any subsequent errors as

they may be dependent on the original error.

Note that traces can contain a logical mistake

yet arrive at the correct answer. To disambiguate

the two types of correctness, we will use the terms

correct ans and incorrect ans to refer to whether the

final answer of the trace is correct. Accuracy ans

would therefore refer to the overall accuracy for the

task, based on how many final answers are correct.

To refer to whether the trace contains a logical

mistake (rather than the correctness of the final

answer), we will use correct mis and incorrect mis .

2.1.1

Human annotation

For 4 of the 5 tasks, we recruit human annotators

to go through each trace and identify any errors.

Annotators have no domain expertise but are givenguidelines 5 to complete the task.

Before annotation, we sample a set of 300 traces

for each task, where 255 (85%) are incorrect ans ,

and 45 (15%) are correct ans . Since human annota-

tion is a limited and expensive resource, we chose

this distribution to maximise the number of steps

containing mistakes and to prevent over-saturation

of correct steps. We also include some correct ans

traces because some may contain logical errors

despite the correct answer, and to ensure that the

dataset included examples of correct steps that are

near the end of the trace. This also prevents an-

notators from feeling forced to find a mistake in

all traces.To account for this skewed distribution,

results in section 4 are split according to whether

the original trace is correct ans or incorrect ans .

Following Lightman et al. (2023), annotators are

instructed to go through each step in the trace and

indicate whether the step is correct or not (binary

choice). Annotators only submit their answers un-

til all steps have been annotated, or there is one

incorrect step. If an incorrect step is identified,

the remaining steps are skipped. This is done to

avoid ambiguities where a logically correct deduc-

tion is dependent on a previous mistake. We make

our annotation guidelines available at https://

github.com/WHGTyen/BIG-Bench-Mistake, and

we include a screenshot of the user interface in

Figure 3.

Each trace has been annotated by at least 3 an-

notators. If there are any disagreements, we take

the majority label. We calculate Krippendorff’s

alpha (Hayes and Krippendorff, 2007) to measure

inter-rater reliability (see Table 2).

Task

Word sorting

Tracking shuffled objects

Logical deduction

Multistep arithmetic

Krippendorff’s α

0.979

0.998

0.996

0.984

amples. Using pattern matching, we can iden-

tify whether each model-generated step also con-

forms to the same format. If so, we compare

the two and assume that the trace is incorrect

if the symbols do not match. Additionally, we

also account for edge cases such as where the

final two steps are merged into one, or varia-

tions in presentation where some symbols are

placed in quotes and some are not. We re-

lease the code at https://github.com/WHGTyen/

BIG-Bench-Mistake along with our dataset.

Benchmark results

Table 4 shows the accuracy of GPT-4-Turbo, GPT-

4, and GPT-3.5-Turbo on our mistake-finding

dataset. For each question, the possible answers

are either that there are no mistakes, or, if there is a

mistake, the number N indicating the step in which

the first mistake occurs. A model’s output is only

considered correct if the location matches exactly,

or the output correctly indicates that there are no

mistakes.

All models are given the same 3-shot prompts 6 .

We use three different prompting methods:

• Direct trace-level prompting involves using

the whole trace as input to the model and di-

rectly prompting for the mistake location. The

model must output either the number represent-

ing the step, or "No".

• Direct step-level prompting prompts for a bi-

nary Yes/No output for every step, indicating

whether or not the step is correct. In each gen-

eration call, the input contains the partial trace

up to (and including) the target step, but does

not contain results for previous steps. The final

answer is inferred from where the first "No"

output occurs (subsequent steps are ignored).

For Dyck languages, we opt for mostly automatic

annotation instead of human annotation as the

traces show limited variation in phrasing and solu-

tion paths.

For each trace, we generate a set of standard

steps based on the format used in the prompt ex- • CoT step-level prompting is an extension of

direct, step-level prompting. Instead of a binary

Yes/No response, we prompt the model to check

the (partial) trace through a series of reasoning

steps. This method is the most resource inten-

sive of all three methods as it involves generat-

ing a whole CoT sequence for every step. Due

to cost and usage limits, we are unable to pro-

vide results from GPT-4-Turbo here. As with

direct step-level prompting, the final answer is

See

https://github.com/WHGTyen/

BIG-Bench-Mistake for further details. 6

Prompts can be found at https://github.com/

WHGTyen/BIG-Bench-Mistake.

Table 2: Inter-rater reliability for the human-annotated

tasks, measured by Krippendorff’s alpha.

2.1.2

Automatic annotationTask

Word sorting

Tracking shuffled objects

Logical deduction

Multistep arithmetic

Dyck languages

Dyck languages (sampled)

Num. of correct ans

traces

482

Num. of incorrect ans

255

504

Num. of incorrect mis

traces

266

260

294

238

650

545

Total

300

986

592

Table 3: Breakdown of correctness and mistake distribution in our dataset. Correctness ans is based on exact

matching. Dyck languages (sampled) refers to the set of traces which have been sampled so that the the ratio of

correct ans to incorrect ans traces matches the other tasks.

inferred from where the first "No" output occurs

(subsequent steps are ignored).

3.1

Discussion

All three models appear to struggle with our mis-

take finding dataset. GPT-4 attains the best results

but only reaches an overall accuracy of 52.87 with

direct step-level prompting.

Our findings are in line with and builds upon

results from Huang et al. (2023), who show that

existing self-correction strategies are ineffective on

reasoning errors. In our experiments, we specifi-

cally target the models’ mistake finding ability and

provide results for additional tasks. We show that

state-of-the-art LLMs clearly struggle with mistake

finding, even in the most simple and unambiguous

cases. (For comparison, humans can identify mis-

takes without specific expertise, and have a high

degree of agreement, as shown in Table 2.)

We hypothesise that LLMs’ inability to find mis-

takes is a main contributing factor to why LLMs are

unable to self-correct reasoning errors. If LLMs are

unable to identify mistakes, it should be no surprise

that they are unable to self-correct either.

Note that the mistakes in our dataset are gener-

ated using PaLM 2 L (Unicorn), and traces were

sampled according to whether the final answer was

correct or not. Therefore, we expect that using

PaLM 2 itself to do mistake finding will produce

different and likely biased results. Further work is

needed to elucidate the difference between cross-

model evaluation and self-evaluation.

3.2

Comparison of prompting methods

As we compare results across the three methods,

we find that the accuracy on traces with no mistakes

goes down 7 considerably from direct, trace-level

Note that the traces in BIG-Bench Mistake are sampled to

contain more incorrect ans traces than correct ans traces (and

therefore more incorrect mis traces than correct mis traces),

so the overall mistake location accuracy appears higher for

per-step prompting in Table 4, despite the poor accuracy for

correct mis traces. For a full set of split by correctness mis , see

prompting to CoT, step-level prompting. Figure 1

demonstrates this trade-off.

We hypothesise that this is due to the number of

outputs generated by the model. Our three methods

involve generating increasingly complex outputs,

starting with direct, trace-level prompting requir-

ing a single token, then direct, step-level prompt-

ing requiring one token per step, and finally CoT

step-level prompting requiring several sentences

per step. If each generation call has some proba-

bility of identifying a mistake, then the more calls

made on each trace, the more likely the model will

identify at least one mistake.

Figure 1: Graph of mistake location accuracies for each

prompting method (excluding GPT-4-Turbo which we

do not have CoT step-level results for). Blue bars show

accuracies on traces with no mistakes, so the model must

predict that the trace has no mistake to be considered

correct; orange bars show accuracies on traces with a

mistake, so the model must predict the precise location

of the mistake to be considered correct.

3.3

Few-shot prompting for mistake location

as a proxy for correctness

In this section, we investigate whether our

prompting methods can reliably determine the

correctness ans of a trace rather than the mistake

location. Our motivation was that even humans

Figure B.Direct

Direct

CoT (step)

(trace)

(step)

Word sorting (11.7)

GPT-4-Turbo 36.33

33.00

–

GPT-4 35.00

44.33

34.00

GPT-3.5-Turbo 11.33

15.00

15.67

Tracking shuffled objects (5.4)

GPT-4-Turbo 39.33

61.67

–

GPT-4 62.29

65.33

90.67

GPT-3.5-Turbo 10.10

1.67

19.00

Logical deduction (8.3)

GPT-4-Turbo 21.33

75.00

–

GPT-4 40.67

67.67

10.33

GPT-3.5-Turbo 2.00

25.33

9.67

Multistep arithmetic (5.0)

GPT-4-Turbo 38.33

43.33

–

GPT-4 44.00

42.67

41.00

GPT-3.5-Turbo 20.00

26.00

25.33

Dyck languages† (24.5)

GPT-4-Turbo 15.33*

28.67*

–

GPT-4 17.06

44.33*

41.00*

GPT-3.5-Turbo 8.78

5.91

1.86

Overall

GPT-4-Turbo 30.13

48.33

–

GPT-4 39.80

52.87

43.40

GPT-3.5-Turbo 10.44

14.78

14.31

Model

Table 4: Mistake finding accuracy across 5 tasks. The

average number of steps in the CoT reasoning traces

in each task is indicated in brackets. Unless otherwise

indicated, the number of traces in each task is shown in

Table 3. We also provide scores split by correctness ans

of the original trace in Figure B.

† indicates that traces were sampled to contain 15%

correct ans and 85% incorrect ans traces (see Table 3).

* indicates that traces were sampled to contain 45

correct ans and 255 incorrect ans traces to reduce costs.

use mistake finding as a strategy for determining

whether an answer is correct or not, such as when

going through mathematical proofs, or working

through argumentation. Additionally, one might

think that directly predicting the correctness ans of

a trace may be easier than having to pinpoint the

precise location of an error.

We calculate averaged F1 scores based on

whether the model predicts that there is a mistake

in the trace. If there is a mistake, we assume the

model prediction is that the trace is incorrect ans .

Otherwise, we assume the model prediction is that

the trace is correct ans .

In Table 5, we average the F1s using correct ans

and incorrect ans as the positive label, weighted

according to the number of times each label occurs.

Direct

CoT (step)

(trace)

(step)

Word sorting

GPT-4-Turbo 87.73

86.68

–

GPT-4 81.50

85.12

81.19

GPT-3.5-Turbo 6.58

35.07

77.79

Tracking shuffled objects

GPT-4-Turbo 52.23

74.31

–

GPT-4 76.38

75.69

95.03

GPT-3.5-Turbo 32.04

77.61

78.11

Logical deduction

GPT-4-Turbo 86.46

81.79

–

GPT-4 84.54

83.38

23.96

GPT-3.5-Turbo 10.34

67.62

61.31

Multistep arithmetic

GPT-4-Turbo 71.17

86.24

–

GPT-4 72.97

78.67

79.67

GPT-3.5-Turbo 3.76

53.18

64.08

Dyck languages

GPT-4-Turbo 51.96

85.87

–

GPT-4 62.33

85.73

79.60

GPT-3.5-Turbo 46.57

79.31

77.79

Model

Table 5: Weighted average F1 scores for predicted

correctness ans of traces across 5 tasks. Baseline is 78

if we only select the incorrect ans label. As in Table 4,

traces for the Dyck languages task has been sampled to

match the ratio of correct ans to incorrect ans traces of

the other tasks. See Table 3 for a full breakdown.

Note that the baseline of predicting all traces as

incorrect achieves a weighted F1 average of 78.

The weighted F1 scores show that prompting

for mistakes is a poor strategy for determining the

correctness of the final answer. This is in line with

our previous finding that LLMs struggle to iden-

tify mistake locations, and also builds upon results

from Huang et al. (2023), who demonstrate that

improvements from Reflexion (Shinn et al., 2023)

and RCI (Kim et al., 2023) are only from using

oracle correctness ans information.

Backtracking

Huang et al. (2023) demonstrated that LLMs can-

not self-correct logical errors without external feed-

back, pointing out that Shinn et al. (2023) and Kim

et al. (2023) both rely on oracle labels for improve-

ments. However, in many real-world applications,

there is often no external feedback available.

As an alternative, we explore the possibility of

replacing external feedback with a lightweight clas-

sifier trained on a small amount of data. Analogous

to reward models in conventional reinforcementlearning, this classifier detects any logical errors in

a CoT trace, which is then fed back to the generator

model to improve on the output. This can be done

over multiple iterations to maximise improvements.

We propose a simple backtracking method to

improve model outputs based on the location of

logical errors:

1. First, the model generates an initial CoT trace.

In our experiments, we use temperature = 0.

2. We then determine the mistake location in this

trace using a reward model.

3. If there are no mistakes, we move onto the next

trace. If there is a mistake (e.g. at Thought 4

in the example trace in Table 1),we prompt the

model again for the same step but at temperature

= 1, generating 8 outputs. We use same prompt

and the partial trace containing all steps up to

but not including the mistake step (e.g. up to

Thought 3, prompting for Thought 4).

4. From the 8 outputs, we filter out any options

that match what was previously identified as a

mistake. From the remaining outputs, we select

one with the highest log-probability.

5. Finally, with the new, regenerated step in place

of the previous one, we generate the remaining

steps of the trace again at temperature = 0.

Our backtracking method provides several bene-

fits over existing self-correction methods:

• Unlike Shinn et al. (2023), Kim et al. (2023),

etc., our approach does not depend on oracle

knowledge of the answer. Instead, it relies on

information (for example from trained a reward

model) about logical errors, which can be de-

termined on a step-by-step basis using a reward

model. Logical errors can occur in correct ans

traces, or not occur in incorrect ans traces 8 .

• Unlike Shinn et al. (2023), Miao et al. (2023),

and many others, backtracking does not rely on

any specific prompt text or phrasing, thereby

reducing associated idiosyncrasies.

• Compared to approaches that require re-

generating the entire trace, backtracking re-

duces computational cost by reusing previous

steps that are known to be logically sound.

Having no logical errors in incorrect ans traces is much

rarer but does exist, for example when the answer is correct

but is not captured by exact match, or if the original question

is faulty and has multiple possible answers.

• Backtracking improves on the quality of the in-

termediate steps directly, which can be useful

in scenarios that require correct steps (e.g. gen-

erating solutions to math questions), and also

generally improves interpretability.

Backtracking with mistake location informa-

tion from a reward model can be construed as a

lightweight RL method. However, unlike conven-

tional deep reinforcement learning:

• Backtracking with a reward model does not

does not require any training of the original

generator model. Once the reward model is

trained, it can be used for backtracking with any

LLM as the generator, and can also be updated

independently of the generator LM. This can be

especially helpful when LLMs are frequently

updated to new checkpoints.

• Backtracking only requires training of a small

reward model. Compared to methods that re-

quire training of the generator model, backtrack-

ing is far more efficient in terms of computing

resources and available data.

• The process of backtracking is more inter-

pretable than updating the weights of the gener-

ator model directly, as is required for many deep

RL methods. It clearly pinpoints the location at

which an error occurs, which can help the de-

bugging process and allow faster development

and iterations of models.

4.1

Backtracking with gold mistake location

As an initial experiment, we use labels from BIG-

Bench Mistake to test if an LLM is able to correct

logical errors using backtracking, independent of

its inherent ability to identify these errors or any

other reward model.

For example, if the mistake location is in step

4, we use backtracking to regenerate that step and

continue the rest of the chain. If the mistake loca-

tion is that there are no logical mistakes, we do not

backtrack and use the original result.

4.1.1 Results

The results are shown in Table 6. To show that per-

formance increases are not due to randomly resam-

pling outputs, we compare our results to a random

baseline, where a mistake location 9 is randomly se-

As described above, the mistake location can be either the

number representing the step, or that there are no mistakes. If

there are no mistakes, we do not use backtracking and simply

use the original trace.Task

Word sorting

Tracking shuffled objects

Logical deduction

Multistep arithmetic

Dyck languages

With mistake location

∆ accuracy ✓ ∆accuracy ✗

-11.11

+23.53

-6.67

+43.92

-11.43

+36.86

-0.00

+18.04

-6.82

+18.06

With random location

∆ accuracy ✓ ∆accuracy ✗

-15.56

+11.76

-6.67

+20.39

-13.33

+21.57

-8.89

+10.59

-15.91

+5.16

Avg. num.

of steps

11.7

5.4

8.3

5.0

24.5

Table 6: Absolute differences in accuracy ans before and after backtracking. "With mistake location" indicates

that backtracking was done using oracle mistake locations from the dataset; "With random location" indicates that

backtracking was done based on randomly selected locations. ∆accuracy ✓ refers to differences in accuracy ans

on the set of traces whose original answer was correct ans ; ∆accuracy ✗ for traces whose original answer was

incorrect ans . The average number of steps in a trace is shown to demonstrate the likelihood of randomly selecting

the correct mistake location in the random baseline condition.

lected for each trace and we perform backtracking

based on the random location.

Note that Table 6 separates results into num-

bers for the correct set and the incorrect set, refer-

ring to whether the original trace was correct ans

or not. This gives a clearer picture than the over-

all accuracy ans , which would be skewed by the

proportion of traces that were originally correct ans

(15%) and incorrect ans (85%).

Scores represent the absolute differences in

accuracy ans . We perform backtracking on both

correct ans and incorrect ans traces, as long as there

is a mistake in one of the steps.

∆accuracy ✓ refers to differences in accuracy ans

on the set of traces whose original answer was

correct ans . Note that we take losses here because,

despite the correct answer, there is a logical mistake

in one of the steps. Therefore, the answer may

change to an incorrect one when we backtrack.

∆accuracy ✗ is the same but for incorrect ans

traces, so the answers may have been corrected,

hence increasing accuracy ans .

For example, for the word sorting task, 11.11%

of traces that were originally correct ans became

incorrect ans , while 23.53% of traces that were orig-

inally incorrect ans became correct ans .

4.1.2

Discussion

The scores show that the gains from correcting

incorrect ans traces are larger than losses from

changing originally correct answers. Additionally,

while the random baseline also obtained improve-

ments, they are considerably smaller than if the

true mistake location was used. Note that tasks

involving fewer steps are more likely to improve

performance in the random baseline, as the true

mistake location is more likely to be identified.

While our numbers do show that our gains are

higher than our losses, it should be noted that

changes in the overall accuracy depends on the

original accuracy achieved on the task. For exam-

ple, if the original accuracy on the tracking shuffled

objects task was 50%, the new accuracy would be

68.6%. On the other hand, if the accuracy was

99%, the new accuracy would drop to 92.8%. As

our dataset is highly skewed and only contains

45 correct ans traces per task, we leave to future

work to assess the effectiveness of backtracking in

a more comprehensive way.

4.2

Backtracking with a simulated reward

model

We show in subsection 4.1 that backtracking can

be used to correct CoT traces using gold mistake

location labels. To explore what level of accuracy

reward model is needed when gold labels are not

available, we use backtracking with simulated re-

ward models, designed to produce labels at differ-

ent levels of accuracy. We use accuracy RM to refer

to the accuracy of the simulated reward model at

identifying mistake locations.

For a given reward model at X% accuracy RM ,

we use the mistake location from BIG-Bench Mis-

take X% of the time. For the remaining (100 −

X)%, we sample a mistake location randomly. To

mimic the behaviour of a typical classifier, mis-

take locations are sampled to match the distribution

found in the dataset. We also ensure that the sam-

pled location does not match the correct location.

4.2.1 Results

Results are shown in Figure 2. We can see that the

losses in ∆accuracy ✓ begins to plateau at 65%. In

fact, for most tasks, ∆accuracy ✓ is already larger

than ∆accuracy ✗ at around 60-70% accuracy RM .

This demonstrates that while higher accuracies pro-

duce better results, backtracking is still effectivebelieve more data may be necessary to improve

results across the board on all tasks. We leave the

collection of this larger dataset and a more rigor-

ous investigation of the trade-offs of model size vs.

performance of the reward model to future work.

We also leave for future work the effect of back-

tracking iteratively with a reward model: for exam-

ple, the generator model may make another mistake

after backtracking for the first time, which can then

be identified and corrected again.

Figure 2: ∆accuracy ✓ and ∆accuracy ✗ on each dataset

as accuracy RM increases.

even without gold standard mistake location labels.

4.3

Reward Modeling

We perform a preliminary investigation of if

mistake-finding can benefit from a dedicated re-

ward model and if learning to find mistakes in

a set of tasks can transfer to finding mistakes in

out-of-distribution tasks. We fine-tuned a PaLM

2-XS-Otter model based on our available data for

20k steps and choose the checkpoint with the best

validation results. We hold out one task for evalua-

tion while training the reward model on the other 4

tasks.

Held-out task

Word sorting

Tracking shuffled objects

Logical deduction

Multi-step arithmetic

Dyck languages

Difference

+11.66

+19.67

-0.67

+4.00

+22.59

Table 7: Absolute difference in mistake finding accuracy

between PaLM 2-L-Unicorn and a small, trained reward

model.

Note the reward model we train is significantly

smaller than our inference model. We show the

relative improvements and losses in Table 7 vs. a

zero-shot baseline on PaLM 2-L-Unicorn. We see

gains for 4 out of 5 of the tasks. This provides

initial indication that it maybe possible to train

separate reward model classifiers to assist in back-

tracking and that these reward models do not have

to be large. Further, a reward model can work on

mistakes that are out-of-distribution. However, we

Related work

Datasets To our knowledge, the only publicly

available dataset containing mistake annotations in

LLM outputs is PRM800K (Lightman et al., 2023),

which is a dataset of solutions to Olympiad-level

math questions. Our dataset BIG-Bench Mistake

covers a wider range of tasks to explore the rea-

soning capabilities of LLMs more thoroughly. Ad-

ditionally, the generator LLM used in PRM800K

has been fine-tuned on 1.5B math tokens as well as

a dataset of step-by-step math solutions. For this

paper, we wanted to explore few-shot in-context

learning methods, which is typically used in real-

world applications with API-based LLMs.

Self-correction Pan et al. (2023) present a

plethora of self-correction methods in recent litera-

ture. While their list includes training-time correc-

tion strategies such as RLHF (Ouyang et al., 2022)

and self-improve (Huang et al., 2022), our back-

tracking method falls into the category of post-hoc

correction, where the correction process is applied

to outputs that have already been generated.

Our paper focuses on correction of logical and

reasoning errors, rather than stylistic or qualitative

improvements. Previous post-hoc correction meth-

ods that are applied to reasoning errors include

Reflexion (Shinn et al., 2023) and RCI (Kim et al.,

2023), both of which cause performance deteriora-

tion when the oracle label is not used (Huang et al.,

2023). Other methods such as Self-Refine (Madaan

et al., 2023) and iterative refinement (Chen et al.,

2023) focus on qualitative or stylistic improve-

ments rather than correcting logical errors.

Conclusion

In this paper, we describe and release our dataset

BIG-Bench Mistake for mistake finding, and pro-

pose a backtracking method to correct logical errors

in CoT style traces. We show that LLMs generallystruggle with finding logical errors without external

feedback, but argue that this feedback can come

from a reward model instead. Finally, we demon-

strate the effectiveness of backtracking, both with

gold standard labels as well as with simulated re-

ward models at lower levels of accuracy.

Limitations

One main limitation of our dataset is that it features

tasks that are artificial and unrealistic for real-world

applications. We made this choice to minimise am-

biguity and subjectivity during the mistake finding

process, but further work needs to be done to deter-

mine the effectiveness of backtracking in a more

realistic setting.

Another limitation is that our paper does not

evaluate backtracking on the original datasets on

BIG-Bench, only showing results on the limited set

that we sampled in a skewed manner, in order to

maximise the value of the human annotators’ time.

We leave the full evaluation to future work.

References

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John-

son, Dmitry Lepikhin, Alexandre Passos, Siamak

Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng

Chen, et al. 2023. Palm 2 technical report. arXiv

preprint arXiv:2305.10403.

Pinzhen Chen, Zhicheng Guo, Barry Haddow, and

Kenneth Heafield. 2023. Iterative translation refine-

ment with large language models. arXiv preprint

arXiv:2306.03856.

Andrew F Hayes and Klaus Krippendorff. 2007. An-

swering the call for a standard reliability measure for

coding data. Communication methods and measures,

1(1):77–89.

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu,

Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022.

Large language models can self-improve. arXiv

preprint arXiv:2210.11610.

Jie Huang, Xinyun Chen, Swaroop Mishra,

Huaixiu Steven Zheng, Adams Wei Yu, Xiny-

ing Song, and Denny Zhou. 2023. Large language

models cannot self-correct reasoning yet. arXiv

preprint arXiv:2310.01798.

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler

Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon,

Nouha Dziri, Shrimai Prabhumoye, Yiming Yang,

et al. 2023. Self-refine: Iterative refinement with

self-feedback. arXiv preprint arXiv:2303.17651.

Ning Miao, Yee Whye Teh, and Tom Rainforth.

2023. Selfcheck: Using llms to zero-shot check

their own step-by-step reasoning. arXiv preprint

arXiv:2308.00436.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,

Carroll Wainwright, Pamela Mishkin, Chong Zhang,

Sandhini Agarwal, Katarina Slama, Alex Ray, et al.

2022. Training language models to follow instruc-

tions with human feedback. Advances in Neural

Information Processing Systems, 35:27730–27744.

Liangming Pan, Michael Saxon, Wenda Xu, Deepak

Nathani, Xinyi Wang, and William Yang Wang. 2023.

Automatically correcting large language models: Sur-

veying the landscape of diverse self-correction strate-

gies. arXiv preprint arXiv:2308.03188.

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills,

Long Ouyang, Jonathan Ward, and Jan Leike. 2022.

Self-critiquing models for assisting human evaluators.

arXiv preprint arXiv:2206.05802.

Noah Shinn, Federico Cassano, Edward Berman, Ash-

win Gopinath, Karthik Narasimhan, and Shunyu Yao.

2023. Reflexion: Language agents with verbal rein-

forcement learning.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,

Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,

Adam R Brown, Adam Santoro, Aditya Gupta,

Adrià Garriga-Alonso, et al. 2022. Beyond the

imitation game: Quantifying and extrapolating the

capabilities of language models. arXiv preprint

arXiv:2206.04615.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,

Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,

Adam R Brown, Adam Santoro, Aditya Gupta, Adrià

Garriga-Alonso, et al. 2023. Beyond the imitation

game: Quantifying and extrapolating the capabili-

ties of language models. Transactions on Machine

Learning Research.

Geunwoo Kim, Pierre Baldi, and Stephen McAleer.

2023. Language models can solve computer tasks.

arXiv preprint arXiv:2303.17491. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se-

bastian Gehrmann, Yi Tay, Hyung Won Chung,

Aakanksha Chowdhery, Quoc V Le, Ed H Chi,

Denny Zhou, et al. 2022. Big-bench-hard/cot-

prompts.

https://github.com/suzgunmirac/

BIG-Bench-Hard/tree/main/cot-prompts. Ac-

cessed: 2023-10-31.

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri

Edwards, Bowen Baker, Teddy Lee, Jan Leike,

John Schulman, Ilya Sutskever, and Karl Cobbe.

2023. Let’s verify step by step. arXiv preprint

arXiv:2305.20050. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V.

Le, Ed H. Chi, Sharan Narang, Aakanksha Chowd-

hery, and Denny Zhou. 2023. Self-consistency im-

proves chain of thought reasoning in language mod-

els. In ICLR 2023.Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le,

and Denny Zhou. 2022. Chain-of-thought prompt-

ing elicits reasoning in large language models. In

Advances in Neural Information Processing Systems,

volume 35, pages 24824–24837. Curran Associates,

Inc.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak

Shafran, Karthik Narasimhan, and Yuan Cao. 2022.

React: Synergizing reasoning and acting in language

models. arXiv preprint arXiv:2210.03629.

Following Yao et al. (2022), we use the newline

as the stop token, which generates one step with

every generation call. We algorithmically append

“Thought N:” before each step. This allows us to

split up steps in a clear and systematic way. We

stop generating once an answer is reached, which

is detected using the following regex:

(?<=[Tt]he answer is).*$

A.2

Implementational details

A.1

3-shot CoT prompting to generate traces

for BIG-Bench Mistake

We use PaLM 2 L (Unicorn) to generate the traces

used in BIG-Bench Mistake. All traces are gener-

ated at temperature = 0.

Our prompts and examples can be

found

https://github.com/WHGTyen/

BIG-Bench-Mistake. Our prompts are based

on chain-of-thought prompts in the BIG-Bench

Hard dataset (Suzgun et al., 2022), with four main

changes:

1. Example CoT traces in the prompt is broken

up into smaller steps (typically one sentence

per step). This is done so that mistake location

information is more precise.

2. Following Yao et al. (2022), each step in

the prompt is signposted with “Thought 1”,

“Thought 2:”, etc. This allows us to refer to

the number of the step when prompting for

mistake location.

3. For the logical deduction task, we find that the

notation used in the original prompt with ques-

tion marks is often inconsistent. It becomes

difficult for annotators to determine whether a

question mark is a mistake or not, because the

correctness of the question mark is dependent

on its interpretation. To minimise such ambi-

guity, the question mark notation is rewritten

into text.

4. For the multistep arithmetic task, one of the

prompt examples is altered to increase the

length of the equation. This is because the

BIG-Bench Hard dataset (where the prompts

are taken from) only used equations of a spe-

cific length, but our dataset contains equations

of averaged a variety of lengths, in accordance

with the original BIG-Bench dataset (Srivas-

tava et al., 2022).

3-shot prompting to identify mistakes in

BIG-Bench Mistake

As described in section 3, we explore three differ-

ent methods of prompting for mistake location: di-

rect trace-level prompting, direct step-level prompt-

ing, and CoT step-level prompting. We use 3-shot

prompting for all methods, and our prompts and

examples can be found at https://github.com/

WHGTyen/BIG-Bench-Mistake.

Our prompts follow OpenAI’s chat completion

format. All results were obtained with temperature

= 0 and no stop tokens.

Annotation

We release our annotation guidelines at https://

github.com/WHGTyen/BIG-Bench-Mistake.

During annotation of the multistep arithmetic

task, we found that the first CoT step given in the

original BIG-Bench Hard prompt examples (Suz-

gun et al., 2022) was incorrect. Since all generated

traces contained the same first step, we removed

that step before showing traces to the annotators.

Figure 3 contains an example screenshot of the

user interface. For every trace, we provide the input

question as well as the target answer, with a note

to be aware of errors that may occur in correct ans

traces.

Annotators can click on words to highlight the

same word across the trace and the question text,

which we found was particularly helpful for some

tasks such as word sorting and tracking shuffled

objects. Buttons on the right automatically become

inactive if a previous step has been labelled as neg-

ative.

Benchmark scoresDirect

Direct

CoT (step)

(trace)

(step)

Word sorting

GPT-4-Turbo 67.74

38.24

–

GPT-4 88.24

82.35

58.82

GPT-3.5-Turbo 100.00

97.06

20.59

Tracking shuffled objects

GPT-4-Turbo 90.00

77.50

–

GPT-4 82.50

82.50

80.00

GPT-3.5-Turbo 67.50

0.00

Logical deduction

GPT-4-Turbo 100.00

83.33

–

GPT-4 100.00 100.00

0.00

GPT-3.5-Turbo 100.00

50.00

100.00

Multistep arithmetic

GPT-4-Turbo 57.69

40.32

–

GPT-4 53.23

46.77

27.42

GPT-3.5-Turbo 96.77

79.03

58.06

Dyck languages

GPT-4-Turbo 96.42

30.00

–

GPT-4 98.41

78.57

13.79

GPT-3.5-Turbo 95.74

4.76

0.00 Direct

Direct

CoT (step)

(trace)

(step)

Word sorting

GPT-4-Turbo 32.71

32.33

–

GPT-4 28.20

39.47

30.83

GPT-3.5-Turbo 0.00

4.51

15.04

Tracking shuffled objects

GPT-4-Turbo 31.54

59.23

–

GPT-4 59.14

62.69

92.31

GPT-3.5-Turbo 1.17

1.92

21.92

Logical deduction

GPT-4-Turbo 20.81

74.83

–

GPT-4 39.46

67.01

10.54

GPT-3.5-Turbo 0.00

24.83

7.82

Multistep arithmetic

GPT-4-Turbo 34.27

44.12

–

GPT-4 41.60

41.60

44.54

GPT-3.5-Turbo 0.00

12.18

16.81

Dyck languages

GPT-4-Turbo 6.99

28.46

–

GPT-4 7.37

40.81

43.91

GPT-3.5-Turbo 1.28

6.05

2.08

(a) Mistake finding accuracy for traces that do not contain

mistakes (correct mis ). (b) Mistake finding accuracy for traces that contain mis-

takes (incorrect mis ).

Model

Table 8: Mistake finding accuracy across 5 tasks for correct mis and incorrect mis traces. The combined scores of

Table 8a and Table 8b make up Table 4.Figure 3: Screenshot of the user interface for a question from the tracking shuffled objects task.