Summary of GRACE Discriminator-Guided Chain-of-Thought Reasoning

Summary GRACE Discriminator-Guided Chain-of-Thought Reasoning arxiv.org

20,805 words - PDF document - View PDF document

One Line

GRACE improves the performance of pre-trained language models by incorporating a Correctness Discriminator, leading to better accuracy and sample efficiency in complex reasoning tasks.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

GRACE Discriminator-Guided Chain-of-Thought Reasoning

Source: arxiv.org - PDF - 20,805 words - view

Introduction

• GRACE improves pre-trained language models in complex reasoning tasks

• Correctness Discriminator enhances accuracy and sample efficiency

• GRACE does not require LM training or fine-tuning

Limitations of Language Models

• LMs struggle with multi-step reasoning tasks

• High likelihoods assigned to incorrect steps lead to incorrect solutions

• Decoding strategies optimize for solution likelihood

Introducing GRACE

• GRACE proposes Correctness Discriminator to guide decoding process

• Trained with contrastive loss over correct and incorrect steps

• Produces correct reasoning steps

Three Steps of GRACE

• Negative sampling to collect solutions with incorrect steps

• Alignment using Needleman-Wunsch algorithm to create examples

• Learning involves training the discriminator with max-margin loss

Guided Stepwise Decoding

• Candidate next steps sampled using nucleus sampling

• Scoring based on LM probability and discriminator score

• Top-scored step selected and added to prefix iteratively

Evaluation Results

• GRACE outperforms greedy decoding, verifiers, and self-consistency

• Improves final answer accuracy and intermediate reasoning correctness

• Evaluated on math word problems and symbolic reasoning tasks

Efficiency and Performance Analysis

• GRACE requires fewer samples than vanilla self-consistency

• Increasing discriminator score coefficient improves final answer accuracy

• Smaller discriminators can still achieve high accuracy

Related Work

• Controlled generation and multi-step reasoning in language models

• GRACE's fine-grained control and novel training process for discriminator

Limitations and Future Directions

• Overhead incurred by sampling and computing discriminator scores

• Reliance on reference solutions for alignment

• Potential for extending GRACE to commercial APIs

Conclusion

• GRACE improves multi-step reasoning in language models

• Achieves higher accuracy in final answers and intermediate reasoning steps

• More sample-efficient than baselines

• Enhances correctness and quality of reasoning

Key Takeaways

• GRACE improves pre-trained language models in complex reasoning tasks

• Correctness Discriminator enhances accuracy and efficiency

• GRACE outperforms baselines in final answer accuracy and reasoning correctness

Key Points

GRACE is a stepwise decoding approach that addresses the limitations of pre-trained language models in complex multi-step reasoning tasks.
GRACE uses a Correctness Discriminator to guide the decoding process towards producing correct reasoning steps.
GRACE outperforms baselines, including greedy decoding and self-consistency, in terms of final answer accuracy and intermediate reasoning correctness.
The proposed method does not require LM training and only relies on sampling from the LM distribution.
GRACE achieves higher accuracy in final answers and intermediate reasoning steps compared to baselines.
Discriminator size and discriminator score coefficient have an impact on model performance.
GRACE is more sample-efficient compared to vanilla self-consistency.
The study explores the use of discriminators in training language models for math word problem solving.

Summaries

21 word summary

GRACE enhances pre-trained language models with a Correctness Discriminator, outperforming baselines in complex reasoning tasks. It improves accuracy and sample efficiency.

53 word summary

GRACE (Discriminator-Guided Chain-of-Thought Reasoning) improves pre-trained language models in complex reasoning tasks. It introduces a Correctness Discriminator and outperforms baselines like greedy decoding, verifiers, and self-consistency. GRACE involves negative sampling, alignment, and learning. It demonstrates higher accuracy in final answers and intermediate reasoning steps, surpassing greedy decoding and self-consistency while being more sample-efficient.

148 word summary

GRACE (Discriminator-Guided Chain-of-Thought Reasoning) is a decoding approach that enhances the performance of pre-trained language models in complex multi-step reasoning tasks. It introduces a Correctness Discriminator and does not require LM training or fine-tuning. GRACE outperforms baselines such as greedy decoding, verifiers, and self-consistency, improving final answer accuracy and reasoning chain correctness. The method involves negative sampling, alignment, and learning. During guided stepwise decoding, a pool of candidate next steps is sampled using nucleus sampling. Each candidate is scored using LM probability and the discriminator score, and the top-scored step is selected. The paper provides detailed information on training, evaluation, and results. GRACE demonstrates higher accuracy in final answers and intermediate reasoning steps, surpassing greedy decoding and self-consistency while being more sample-efficient. The paper discusses related work, limitations, and future directions of GRACE. Overall, GRACE shows promise in enhancing the correctness and quality of reasoning in language models.

432 word summary

GRACE (Discriminator-Guided Chain-of-Thought Reasoning) is a decoding approach that improves pre-trained language models' performance in complex multi-step reasoning tasks. It introduces a Correctness Discriminator to guide the decoding process and does not require LM training or fine-tuning. GRACE was evaluated on various reasoning tasks and demonstrated significant performance gains compared to baselines such as greedy decoding, verifiers, and self-consistency.

The proposed method involves three steps: negative sampling, alignment, and learning. Negative sampling collects incorrect solutions by sampling from the LM and keeping only solutions with incorrect final answers. Alignment aligns these incorrect solutions with reference solutions using the Needleman-Wunsch algorithm. Learning involves training the discriminator to distinguish between correct and incorrect steps.

During guided stepwise decoding, a pool of candidate next steps is sampled using nucleus sampling from the LM distribution. Each candidate is scored using LM probability and the discriminator score, and the top-scored step is selected and added to the prefix until a final answer is generated.

GRACE outperformed baselines on multiple benchmarks, improving final answer accuracy and reasoning chain correctness. It does not require LM training and solely relies on sampling from the LM distribution.

The paper provides detailed information on the training and evaluation processes, as well as the results and analysis of using discriminators in training language models for math word problem solving. GRACE's effectiveness is verified on two language models, FLAN-T5 Large and LLaMA, in various reasoning tasks. Results show higher final answer accuracy, especially when combined with self-consistency.

Evaluation of intermediate step correctness demonstrates that GRACE surpasses greedy decoding and self-consistency, indicating improved reasoning chain correctness. Trace error evaluation reveals a reduction in major mistakes compared to greedy decoding, demonstrating improved reasoning quality.

Further analysis explores the efficiency and performance of GRACE. It requires fewer samples than vanilla self-consistency to achieve the same accuracy and increasing the discriminator score coefficient improves final answer accuracy. The size of the discriminator model is also investigated, showing that smaller discriminators can still achieve high accuracy.

The related work in controlled generation and multi-step reasoning is discussed, emphasizing the novelty of GRACE in its fine-grained control over solution decoding and its novel training process for the discriminator model. The paper addresses the limitations and future directions of GRACE, including potential overhead and extension to commercial APIs.

In conclusion, GRACE is an effective approach for improving multi-step reasoning in language models. It achieves higher accuracy in final answers and intermediate reasoning steps compared to baselines, outperforming greedy decoding and self-consistency while being more sample-efficient. The proposed method shows promise in enhancing the correctness and quality of reasoning in language models.

522 word summary

GRACE (Discriminator-Guided Chain-of-Thought Reasoning) is a stepwise decoding approach that addresses the limitations of pre-trained language models (LMs) in complex multi-step reasoning tasks. It proposes the use of a Correctness Discriminator to guide the decoding process towards producing correct reasoning steps. GRACE does not require LM training or fine-tuning and relies solely on sampling from the LM distribution. The approach was evaluated on various reasoning tasks and demonstrated substantial performance gains compared to baselines such as greedy decoding, verifiers, and self-consistency. When combined with self-consistency, GRACE outperformed all baselines by sizable margins. It improved both the final answer accuracy and the correctness of intermediate reasoning.

The proposed method involves three steps: negative sampling, alignment, and learning. Negative sampling collects solutions with incorrect steps by sampling from the LM and keeping only solutions with incorrect final answers. Alignment aligns these incorrect solutions with reference solutions using the Needleman-Wunsch algorithm. Learning involves training the discriminator to distinguish between correct and incorrect steps using a max-margin loss objective.

During guided stepwise decoding, a pool of candidate next steps is sampled using nucleus sampling from the LM distribution. Each candidate is scored using a combination of LM probability and the discriminator score. The top-scored step is selected and added to the prefix until a final answer is generated.

GRACE outperformed greedy decoding, self-consistency, and verifiers on multiple benchmarks, significantly improving final answer accuracy and reasoning chain correctness. It does not require LM training and only relies on sampling from the LM distribution.

The paper provides detailed information on the training and evaluation processes, as well as the results and analysis of using discriminators in training language models for math word problem solving. The effectiveness of GRACE is verified on two language models, FLAN-T5 Large and LLaMA, in various reasoning tasks. Results show that GRACE outperforms baselines, achieving higher final answer accuracy. When combined with self-consistency, GRACE produces even better results. It improves the correctness of both final answers and intermediate reasoning steps.

Evaluation of intermediate step correctness shows that GRACE surpasses both greedy decoding and self-consistency, indicating improved reasoning chain correctness. Trace error evaluation indicates a reduction in major mistakes compared to greedy decoding, demonstrating improved reasoning quality.

The paper includes further analysis, exploring the efficiency and performance of GRACE. It requires fewer samples than vanilla self-consistency to achieve the same accuracy and shows that increasing the discriminator score coefficient improves final answer accuracy. The size of the discriminator model is also investigated, revealing that smaller discriminators can still achieve high accuracy.

The related work in controlled generation and multi-step reasoning is discussed, highlighting the novelty of GRACE in its fine-grained control over solution decoding and its novel training process for the discriminator model. The paper addresses the limitations and future directions of GRACE, including potential overhead and extension to commercial APIs.

In conclusion, GRACE is an effective approach for improving multi-step reasoning in language models. It achieves higher accuracy in final answers and intermediate reasoning steps compared to baselines. It outperforms greedy decoding and self-consistency while being more sample-efficient. The proposed method shows promise in enhancing the correctness and quality of reasoning in language models.

1753 word summary

GRACE (Discriminator-Guided Chain-of-Thought Reasoning) is a stepwise decoding approach that addresses the limitations of pre-trained language models (LMs) in complex multi-step reasoning tasks. LMs struggle with problems that require multi-step reasoning because they often assign high likelihoods to incorrect steps. This leads to incorrect solutions when decoding strategies optimize for solution likelihood. To overcome this issue, GRACE proposes the use of a Correctness Discriminator, trained with a contrastive loss over correct and incorrect steps, to guide the decoding process towards producing correct reasoning steps.

GRACE does not require LM training or fine-tuning. It only relies on sampling from the LM distribution. The approach was evaluated using models from the FLAN-T5 and LLaMA families on four math and two symbolic reasoning tasks. GRACE demonstrated substantial performance gains compared to greedy decoding, verifiers, and self-consistency in most settings. When combined with self-consistency, GRACE outperformed all baselines by sizable margins. Human and LLM evaluations on GSM8K showed that GRACE improved both the final answer accuracy and the correctness of intermediate reasoning.

The proposed method involves three steps: negative sampling, alignment, and learning. Negative sampling involves collecting solutions with incorrect steps by sampling from the LM and keeping only solutions with incorrect final answers. Alignment aligns these incorrect solutions with reference solutions using the Needleman-Wunsch algorithm to create examples with correct and incorrect steps. Learning involves training the discriminator with a max-margin loss objective to distinguish between correct and incorrect steps.

During guided stepwise decoding, a pool of candidate next steps is sampled using nucleus sampling from the LM distribution. Each candidate is then scored using a combination of LM probability and the discriminator score. The top-scored step is selected and added to the prefix, and this process is repeated until a final answer is generated.

The approach was evaluated on various reasoning tasks, including math word problems and symbolic reasoning tasks. GRACE outperformed greedy decoding, self-consistency, and verifiers on multiple benchmarks. It significantly improved final answer accuracy and reasoning chain correctness. The proposed method does not require LM training and only relies on sampling from the LM distribution.

In conclusion, GRACE is a stepwise decoding approach that addresses the limitations of pre-trained LMs in complex multi-step reasoning tasks. It employs a Correctness Discriminator to guide the decoding process towards producing correct reasoning steps. The approach outperforms baselines and improves both final answer accuracy and intermediate reasoning correctness. It does not require LM training and only relies on sampling from the LM distribution.

The GRACE Discriminator-Guided Chain-of-Thought Reasoning method is proposed in this paper to improve multi-step reasoning in language models. The approach involves training a step-level discriminator model that guides the decoding process towards correct reasoning steps. The effectiveness of GRACE is verified on two language models, FLAN-T5 Large and LLaMA, in various reasoning tasks. Results show that GRACE outperforms baselines such as greedy decoding and self-consistency, achieving higher final answer accuracy on math reasoning tasks. When combined with self-consistency, GRACE produces even better results, outperforming vanilla self-consistency with random sampling. Similar trends are observed in few-shot prompted tasks. The accuracy boost achieved by GRACE is significant, demonstrating its ability to improve the correctness of both final answers and intermediate reasoning steps.

Evaluation of intermediate step correctness is also conducted to assess the quality of the reasoning chains produced by GRACE. The prefix correctness (PC) metric is used to measure the correctness of the reasoning steps so far. Results show that GRACE surpasses both greedy decoding and self-consistency in PC, indicating improved reasoning chain correctness. Furthermore, trace error (TE) evaluation is performed to identify major mistakes in solutions with correct final answers. GRACE demonstrates a reduction in TE compared to greedy decoding, indicating improved reasoning quality.

Several analyses are conducted to explore the efficiency and performance of GRACE. It is found that GRACE requires fewer samples than vanilla self-consistency to achieve the same accuracy, making it more sample-efficient. The effect of the discriminator score coefficient is also studied, showing that increasing the coefficient improves final answer accuracy up to a certain level. The size of the discriminator model is investigated, revealing that smaller discriminators can still achieve high accuracy.

The related work in controlled generation and multi-step reasoning is discussed, highlighting the novelty of GRACE in its fine-grained control over solution decoding and its novel training process for the discriminator model. The limitations and future directions of GRACE are also addressed, including the overhead incurred by sampling and computing discriminator scores, reliance on reference solutions for alignment, and the potential for extending GRACE to commercial APIs.

In conclusion, GRACE is an effective approach for improving multi-step reasoning in language models. By training a step-level discriminator model and guiding the decoding process, GRACE achieves higher accuracy in final answers and intermediate reasoning steps compared to baselines. It demonstrates better performance than greedy decoding and self-consistency, while also being more sample-efficient. The proposed method shows promise in enhancing the correctness and quality of reasoning in language models.

The study explores the use of discriminators in training language models to solve math word problems. The authors reference several related papers that have investigated similar topics. They discuss the importance of faithful reasoning and interpretable logical reasoning in language models. The authors also mention the use of plug and play language models and large language models as reasoning teachers.

The paper provides implementation details, including hyperparameters and training methods. For example, the authors describe the use of discriminators in step scoring and the training process for the discriminator. They also explain the decoding process, which involves sampling reasoning steps and using a calculator for math operations.

The authors discuss the solution alignment algorithm used to align sampled solutions with the ground-truth solution. They filter out low-quality samples based on alignment cost. The algorithm utilizes embeddings obtained from ROSCOE or SimCSE to compute the alignment.

The paper includes information about the few-shot prompt used to evaluate the model's ability to predict prefix correctness. They manually annotate 100 model-generated solutions from GSM8K to assess the model's performance. The few-shot prompt achieves a high macro F1 score in predicting prefix correctness.

The authors provide details about human evaluation, including datasets used and reference generation for step-by-step solutions. They explain how reference solutions are generated and how they filter out solutions that do not reach the correct final answer. The dataset statistics are also presented.

The paper presents further analysis, including the impact of discriminator size on model performance. Larger discriminators tend to perform better, especially for more complex tasks like GSM8K. The authors show that G RACE outperforms self-consistency and greedy decoding on GSM8K and SVAMP datasets.

Finally, the authors provide statistics on the annotator agreement for human evaluation. They achieved high agreement and used one annotator to annotate all 400 solutions.

Overall, the paper provides detailed information on the training and evaluation processes, as well as the results and analysis of using discriminators in training language models for math word problem solving.

The summary provides a concise version of the text excerpt, highlighting key points while preserving important details. The summary is organized into separate paragraphs for readability and retains the original order of ideas. The summary omits any irrelevant information such as boilerplate, cookie warnings, advertising sponsors, and header info. The output is written in the same style and tone as the original text excerpt.

D Few-shot prompts Tables 8, 9, and 10 show the 6-shot prompt used with LLaMA 7B with GSM8K, SVAMP, and MultiArith, respectively. Tables 11 and 12 show the 6-shot prompt used with LLaMA 13B for the Coin Flip and Tracking Shuffled Objects, respectively.

H Sample Outputs Tables 13, 14, 15, 16, 17, 18, 19, 20, 21 show compared sampled solutions from both GRACE and self-consistency where GRACE reached the correct answer but self-consistency did not.

Q: Jorge is 24 years younger than Simon. In 2005, Jorge is 16 years old. In 2010, how old would Simon be? A: In 2005, Simon was 40 years old. There are 5 years between 2005 and 2010. In 2010, Simon would be 45 years old.

Q: An entrepreneur is crowdfunding a new business effort. He has three different dollar amount levels of financial backing options and each level is ten times as high as the previous one. He needs to raise $12,000 to get his business off the ground. He succeeded after getting two backers at the highest level of financial backing, three at the second level, and ten at the lowest level. How many dollars was the highest level of financial backing? A: The lowest level of financial backing was $50. Thus, the highest level of financial backing was $5,000.

Q: 1/3 of the townspeople have received the full COVID vaccine. 1/3 are immune because they already recovered from COVID. If 1/6 of the townspeople are both vaccinated and already had COVID, what percent of the town is immune in some way? A: 50% of the town is immune in some way.

Q: James was doing some shopping in his favorite shop and he saw an offer on shoes that give him a second pair of shoes for half off the original price. He took the offer and took the first pair of shoes for $40 and the second one for $60. At the cashier, the lady told him that the offer applies strictly to the cheaper pair of shoes, and also that he was selected for an extra discount of a fourth off the total amount. How much money did James end up paying for both pairs of shoes? A: James ended up paying $60 for both pairs of shoes.

Q: A wildlife team is monitoring the number of birds in a park. There are 3 blackbirds in each of the park's 7 trees. There are also 13 magpies roaming around the park. How many birds are in the park in total? A: There are 34 birds in the park in total.

Table 8 shows the 6-shot prompt used with LLaMA 7B for GSM8K.

Q: Paul sold some books in a garage sale, leaving him with 27 books. If he had 136 books initially, how many books did he sell? A: Paul sold 109 books in the garage sale.

Q: White t-shirts can be purchased in packages of 13. If mom buys 39 white t-shirts, how many packages will she have? A: Mom will have 3 packages of t-shirts.

Q: There are 384 oranges and 192 bananas in Philip's collection. If the bananas are organized into 345 groups and oranges are organized into 16 groups, how big is

Raw indexed text (117,049 chars / 20,805 words / 2,155 lines)

GRACE: Discriminator-Guided Chain-of-Thought Reasoning

Introduction

Multi-step reasoning spans a set of tasks where

a question is answered via a sequence of reason-

ing steps until a final answer is reached (Creswell

and Shanahan, 2022; Wei et al., 2022). While pre-

trained language models (LMs) have shown impres-

sive performance on a variety of QA tasks, they

still struggle with problems that require complex

multi-step reasoning (Cobbe et al., 2021; Creswell

et al., 2022; Ni et al., 2023). One reason is that

∗

Correspondence to [email protected]

Our implementation can be accessed at https://

github.com/mukhal/grace.

In the context of multi-step reasoning, e.g.,

with chain-of-thought, language models (LMs)

can easily assign a high likelihood to incor-

rect steps. As a result, decoding strategies that

optimize for solution likelihood often yield in-

correct solutions. To address this issue, we

propose Guiding chain-of-thought ReAsoning

with a CorrectnEss Discriminator (G RACE ), a

stepwise decoding approach that steers the de-

coding process towards producing correct rea-

soning steps. G RACE employs a discriminator

trained with a contrastive loss over correct and

incorrect steps, which is used during decoding

to score next-step candidates based on their cor-

rectness. Importantly, G RACE only requires

sampling from the LM, without the need for

LM training or fine-tuning. Using models from

FLAN-T5 and LLaMA families, we evaluate

G RACE over four math and two symbolic rea-

soning tasks, where it exhibits substantial per-

formance gains compared to greedy decoding,

verifiers, and self-consistency in most settings.

When further combined with self-consistency,

G RACE outperforms all the baselines by size-

able margins. Human and LLM evaluations

over GSM8K show that G RACE not only im-

proves the final answer accuracy but also the

correctness of the intermediate reasoning. 1

I have 10 liters of orange drink that are two-thirds water and I

wish to add it to 15 liters of pineapple drink that is three-fifths

water. As I pour it, I spill one liter of the orange drink. How much

water is in the remaining 24 liters?

After 1 liter of orange drink was spilled, there were 10 - 1 = 9

liters of orange drink left.

Candidate

Abstract

Muhammad Khalifa ∗ , Lajanugen Logeswaran † , Moontae Lee †‡ ,

Honglak Lee ∗† , Lu Wang ∗

∗

University of Michigan , LG AI Research † , University of Illinois at Chicago ‡

Avg.

Prob

After 1 liter of pineapple drink was poured, there were 15 - 1 =

14 liters of pineapple drink left. 0.89

After 15 liters of pineapple drink was poured in, there were 15 –

5 = 10 liters of pineapple drink left. 0.80

Then, after 1 liter of pineapple drink was added, there were 15 -

1 = 14 liters of pineapple drink left. 0.77

This is 9 liters of orange drink * 2/3 = 6 liters of water. 0.47

…

Figure 1: A math question from GSM8K (Cobbe et al., 2021),

a solution prefix, and candidate next steps sorted in descend-

ing order by their average token probability according to a

few-shot prompted LLaMA 13B . The correct next step is as-

signed a significantly lower probability than the incorrect ones.

G RACE solves this issue by calibrating candidate step likeli-

hoods based on the step correctness.

the next-word prediction objective used for pre-

training does not explicitly encourage the LM to-

ward correct step-by-step reasoning. To boost the

reasoning abilities of LMs, supervised fine-tuning

(SFT) has been performed on gold step-by-step so-

lutions (Uesato et al., 2022; Ho et al., 2022; Fu

et al., 2023). However, SFT can easily lead to the

overfitting of the reference solutions seen during

training, resulting in an LM that assigns low proba-

bilities to alternative but correct solutions (Ni et al.,

2023). Concurrently, LMs may assign a high prob-

ability to invalid sequences, which leads them off

track when common decoding strategies such as

greedy decoding are used.

While prompting techniques such as scratchpad

or chain-of-thought (CoT) (Nye et al., 2021; Wei

et al., 2022; Wang et al., 2022) can improve reason-

ing, they only indirectly affect the sequence prob-

abilities, leaving the aforementioned issue mostly

unsolved. To give an example, when prompting

LLaMA 13B (Touvron et al., 2023) with a few-shot

CoT prompt, a question from GSM8K (Cobbe et al.,

2021), and a correct solution prefix, the top prob-

able next step candidates are incorrect while theTraining

1. Sampling

Simulate mistakes the LM is likely to make during

inference by sampling solutions from the model.

Q: If Joe spent 1/9

of his pocket money

… how much money

does he have left?

Question

A: He Spent 1/9 *

450

… Spent 1/2 *

A: He

2. Step Alignment 3. Learning

Align steps of incorrect solutions with the

reference steps to create contrastive examples. Train the discriminator with max-margin loss.

450

…

A: Joe

has $50

left because…

A: Joe spent (1/3)

*450 = 150 on…

Sampled solutions

left because…

A: Joe spent (1/3)

*450 = 150 on…

Incorrect solutions

Test question

correct step

Discriminator

D(q, r, s + )

D(q, r, s )

s +

AAAB83icbVDJSgNBEK2JW4xb1KOXJkGIJIQZweUY1IPHCGaBzBh6Oj1Jk57F7h5hGPIX4sWDIl79GW/5GzvLQRMfFDzeq6KqnhtxJpVpjo3Myura+kZ2M7e1vbO7l98/aMowFoQ2SMhD0XaxpJwFtKGY4rQdCYp9l9OWO7ye+K0nKiQLg3uVRNTxcT9gHiNYacm+KT1WkKgg+VA+6eaLZtWcAi0Ta06KtYJdfh7Xkno3/233QhL7NFCEYyk7lhkpJ8VCMcLpKGfHkkaYDHGfdjQNsE+lk05vHqFjrfSQFwpdgUJT9fdEin0pE9/VnT5WA7noTcT/vE6svEsnZUEUKxqQ2SIv5kiFaBIA6jFBieKJJpgIpm9FZIAFJkrHlNMhWIsvL5PmadU6r57d6TSuYIYsHEEBSmDBBdTgFurQAAIRvMAbvBux8Wp8GJ+z1owxnzmEPzC+fgCUA5MT

AAAB6nicbZDLSgMxFIbP1Futt6pLQYJFEJQyU/Cys+DGZYv2Au1YMmmmDc1khiQjlKFLl25cKOLWF3Dnc7jzGfQhTC8Lbf0h8PH/55BzjhdxprRtf1qpufmFxaX0cmZldW19I7u5VVVhLAmtkJCHsu5hRTkTtKKZ5rQeSYoDj9Oa17sY5rVbKhULxbXuR9QNcEcwnxGsjXWlbg5b2Zydt0dCs+BMIHf+9nW3+17+LrWyH812SOKACk04Vqrh2JF2Eyw1I5wOMs1Y0QiTHu7QhkGBA6rcZDTqAO0bp438UJonNBq5vzsSHCjVDzxTGWDdVdPZ0Pwva8TaP3MTJqJYU0HGH/kxRzpEw71Rm0lKNO8bwEQyMysiXSwx0eY6GXMEZ3rlWagW8s5J/rjs5IpHMFYadmAPDsCBUyjCJZSgAgQ6cA+P8GRx68F6tl7GpSlr0rMNf2S9/gDxRpH4

A: Joe spent (1/9)

*450 = 50 on…

A: He Spent 1/2

* A:

450 Joe

… has $50

Gold solution

AAAB6HicbZC7SwNBEMbn4iuJr6ilzWIQLCTcCT7KgI1lAuYBSQh7e3PJmr2Hu3tCONLYWNhYKGJr7z9j51+jm0ehiR8s/Pi+GXZm3FhwpW37y8osLa+srmVz+fWNza3tws5uXUWJZFhjkYhk06UKBQ+xprkW2Iwl0sAV2HAHl+O8cYdS8Si81sMYOwHthdznjGpjVW+7haJdsicii+DMoFjOPXgf3/fHlW7hs+1FLAkw1ExQpVqOHetOSqXmTOAo304UxpQNaA9bBkMaoOqkk0FH5NA4HvEjaV6oycT93ZHSQKlh4JrKgOq+ms/G5n9ZK9H+RSflYZxoDNn0Iz8RREdkvDXxuESmxdAAZZKbWQnrU0mZNrfJmyM48ysvQv2k5JyVTqtOsWzDVFnYhwM4AgfOoQxXUIEaMEB4hGd4sW6sJ+vVepuWZqxZzx78kfX+A5+kkHg=

AAAB6HicbZDJSgNBEIZr4hbjFvXopTEInoYZxeVmwIvHBMwCyRB6OjVJm56F7h4hDHkCLx4U8eor+CbefBs7kxw08YeGj/+voqvKTwRX2nG+rcLK6tr6RnGztLW9s7tX3j9oqjiVDBssFrFs+1Sh4BE2NNcC24lEGvoCW/7odpq3HlEqHkf3epygF9JBxAPOqDZWXfbKFcd2cpFlcOdQufk8z1Xrlb+6/ZilIUaaCapUx3US7WVUas4ETkrdVGFC2YgOsGMwoiEqL8sHnZAT4/RJEEvzIk1y93dHRkOlxqFvKkOqh2oxm5r/ZZ1UB9dexqMk1Rix2UdBKoiOyXRr0ucSmRZjA5RJbmYlbEglZdrcpmSO4C6uvAzNM9u9tC/qbqVqw0xFOIJjOAUXrqAKd1CDBjBAeIIXeLUerGfrzXqflRasec8h/JH18QO8do8N

AAAB83icbVDJSgNBEK2JW4xb1KOXJkGIGMOM4HIM6sFjBLNAZgw9nZ6kSc9id48wDPkL8eJBEa/+jLf8jZ3loIkPCh7vVVFVz404k8o0R0ZmaXlldS27ntvY3Nreye/uNWQYC0LrJOShaLlYUs4CWldMcdqKBMW+y2nTHVyP/eYTFZKFwb1KIur4uBcwjxGstGTflB7LSJSRfDg56uSLZsWcAC0Sa0aK1YJ9/DyqJrVO/tvuhiT2aaAIx1K2LTNSToqFYoTTYc6OJY0wGeAebWsaYJ9KJ53cPESHWukiLxS6AoUm6u+JFPtSJr6rO32s+nLeG4v/ee1YeZdOyoIoVjQg00VezJEK0TgA1GWCEsUTTTARTN+KSB8LTJSOKadDsOZfXiSN04p1Xjm702lcwRRZOIAClMCCC6jCLdSgDgQieIE3eDdi49X4MD6nrRljNrMPf2B8/QCXDZMV

AAAB6nicbZC7SgNBFIbPxluMt6ilIINBsNCwG/DSGbCxTNBcIFnD7GQ2GTI7u8zMCmFJaWljoYitL2Dnc9j5DPoQTi6FJv4w8PH/5zDnHC/iTGnb/rRSc/MLi0vp5czK6tr6RnZzq6rCWBJaISEPZd3DinImaEUzzWk9khQHHqc1r3cxzGu3VCoWimvdj6gb4I5gPiNYG+tK3Ry1sjk7b4+EZsGZQO787etu9738XWplP5rtkMQBFZpwrFTDsSPtJlhqRjgdZJqxohEmPdyhDYMCB1S5yWjUAdo3Thv5oTRPaDRyf3ckOFCqH3imMsC6q6azoflf1oi1f+YmTESxpoKMP/JjjnSIhnujNpOUaN43gIlkZlZEulhios11MuYIzvTKs1At5J2T/HHZyRUPYaw07MAeHIADp1CESyhBBQh04B4e4cni1oP1bL2MS1PWpGcb/sh6/QH0TpH6

Question Prefix

incorrect step

AAACTnicbVFNbxMxEPWmLYRQYGmPvVhUSEFNo10oLZdKUemhBw6pRJpKcYi8zmxixftRe7ZqWO0P4TfBAXHjV1S99NAKUecDCRqeZOn5vRmN5zlIlTToeT+d0tLyyoOH5UeVx6tPnj5zn6+dmCTTAloiUYk+DbgBJWNooUQFp6kGHgUK2sHo/cRvn4M2Mok/4jiFbsQHsQyl4Gilngss4jgUXOUfCrpPmckiaqULyhSEyHKvRg+ntHpWo7pGzad8u2BaDob4im4veFt/vC32GZDPLqzouZte3ZuCLhJ/TjYbb26uvuxcfmv23B+sn4gsghiF4sZ0fC/Fbs41SqGgqLDMQMrFiA+gY2nMIzDdfBpHQV9apU/DRNsTI52qf3fkPDJmHAW2crK8ue9NxP95nQzDd91cxmmGEIvZoDBTFBM6yZb2pQaBamwJF1rat1Ix5JoLtD9QsSH491deJCev6/5u/e2xTeOAzFAmG+QFqRKf7JEGOSJN0iKCfCVX5IbcOt+da+eX83tWWnLmPevkH5TKd4vqt3U=

Minimize L =

max 0, D q, r, s

Discriminator

…

1. Sample a pool of candidate next steps.

Solution so far (prefix)

0.10

…

2. Score steps using Eq.(6)

D q, r, s + + ⇣

0.06

0.03

3. Select top scored step

and repeat.

Figure 2: Top: The three-step process to train the discriminator. (1) Sampling solutions from a given language model with

different mistakes by keeping the solutions with the incorrect final answers only. (2) Aligning the sampled solutions with the

reference solutions to identify incorrect steps. (3) Learning the discriminator with a max-margin loss to assign high scores to

correct steps and low scores to incorrect steps. Bottom: The guided stepwise decoding process using the trained discriminator.

Given the question and the prefix, we sample a pool of candidate next steps and use the discriminator to score steps as in

Equation (6). The top-scored step is then selected and added to the prefix. This process repeats until a final answer is generated.

correct step is assigned with a much lower proba-

bility than the incorrect ones as shown in Figure 1.

Oversampling techniques have been proposed

to alleviate this problem by utilizing multiple sam-

pled solutions. For instance, the sample-then-rank

approach uses a verifier model to score a set of

randomly sampled solutions based on their correct-

ness (Cobbe et al., 2021; Li et al., 2022). Self-

consistency is another technique that aggregates

multiple random samples via majority voting over

the final answer (Wang et al., 2022). Nonetheless,

oversampling methods have two main drawbacks.

First, as they rely on temperature sampling from the

LM distribution, they are prone to sample highly

likely but incorrect solutions. Second, they ex-

hibit no control over solution decoding, as they

are applied over complete solutions and after the

decoding is finished.

This paper builds on the insight that we can sam-

ple correct multi-step solutions by steering the de-

coding process towards generating correct reason-

ing steps. Inspired by discriminator-guided con-

trolled generation methods (Yang and Klein, 2021;

Dathathri et al., 2020; Krause et al., 2021), we pro-

pose G RACE , a guided-decoding method that re-

lies on a correctness discriminator model to nudge

the decoding process towards correct steps. Our

discriminator is trained at the step level, allowing

for finer-grained control over the sampling process

compared to the vanilla self-consistency and ver-

ifier methods. While recent work (Uesato et al.,

2022) relies on human annotations to build a step-

level correctness reward model, human annotations

are expensive and hard to scale. We address this

limitation and propose a 3-step approach to train

the correctness discriminator based on access to

the correct solutions only, without any step-level

human annotations.

We compare G RACE to greedy decoding, self-

consistency, and verifiers, and show strong im-

provements over all of them on six different multi-

step reasoning benchmarks with two language mod-

els families: FLAN-T5 (Chung et al., 2022) and

LLaMA (Touvron et al., 2023). For instance,

G RACE outperforms greedy decoding on GSM8K

(Cobbe et al., 2021) by 7.4% accuracy points with

FLAN-T5-Large and 5.4% with LLaMA 7B . In ad-

dition, when further combining our approach with

self-consistency, G RACE outperforms the vanilla

self-consistency by 10.2% points on GSM8K and

15.7% on MultiArith (Roy and Roth, 2015).

In summary, our contributions are as follows:

• We propose a stepwise decoding strategy

that guides the model towards correct multi-

step solutions via a step-level discriminator.

G RACE does not necessitate any form of LM

training and only requires samples from the

LM distribution.

• We propose a novel alignment method to align

incorrect solutions with correct ones, to au-

tomatically create step-level (in)correctness

labels. This algorithm alleviates the require-

ment of large amounts of human annotations

for reasoning steps (Uesato et al., 2022).

• G RACE significantly improves the final an-

swer accuracy on six multi-step reasoningbenchmarks compared to the baselines. Ac-

cording to both human and LLM-based eval-

uations on GSM8K, G RACE boosts the rea-

soning chain correctness. Specifically, human

evaluation shows that G RACE reduces the so-

lution error rate from 9.0% (with greedy) to

5.0%, i.e., a 44% reduction.

Method

Overview. Our setup follows chain-of-thought

reasoning (Nye et al., 2021; Wei et al., 2021),

where given a question q (e.g., a math word prob-

lem), our goal is to generate a chain of T interme-

diate reasoning steps s 1 , s 2 , . . . , s T , s T +1 , where

s T +1 is the final answer. A pretrained language

model (LM) is either fine-tuned or prompted in a

few-shot manner to generate the chain. We start by

formalizing our approach in the next section.

2.1 Formalization

Given a problem q and a correct solution prefix

s 1 , s 2 , . . . , s t−1 , we want to sample a correct next

step s t towards the final answer. 2 We assume ac-

cess to a judge or a discriminator model D that

takes in the problem q, the prefix s 1 , s 2 , ..s t−1 and

a candidate next step s t , and outputs a real-valued

score D(q, s 1:t−1 , s t ) that indicates whether s t is

a correct reasoning step at time-step t. We also

assume access to the language model distribution

p LM (·|q, s 1:t−1 ).

Formally, let c be a binary variable that indicates

the correctness of the generated step with respect

to the question and the prefix, where we want to

sample the next step s t ∼ p(·|s 1:t−1 , c, q). We can

factorize p(s t |s 1:t−1 , c, q) as:

p(s t |s 1:t−1 , q)p(c|s t , s 1:t−1 , q)

p(s t |s 1:t−1 , c, q) =

p(c|s 1:t−1 , q)

(1)

∝ p(s t |s 1:t−1 , q) · p(c|s 1:t , q) (2)

∝ p LM (s t |q, s 1:t−1 ) · exp(D(q, s 1:t−1 , s t )) (4)

= p LM (s t |q, s 1:t−1 ) · p(c|s 1:t , q)

(3)

In Equation (3), we substitute p(s t |s 1:t−1 ), the

probability of the next step without modeling

correctness, with p LM (s t |q, s 1:t−1 ). Similarly,

in Equation (4), p(c|s 1:t , q) is replaced with

exp(D(q, s 1:t−1 , s t )). This substitution is justi-

fied as, in accordance with our discriminator’s def-

inition, exp(D(q, s 1:t−1 , s t )) is proportionate to

We assume the prefix given so far is correct, to focus on

modeling the next step prediction. An empty prefix is trivially

correct.

p(c|s 1:t , q). By assuming that the prefix s 1:t−1 is

correct, p(c|s 1:t , q) becomes dependent only on the

correctness of s t , modeled by D(q, s 1:t−1 , s t ).

This form of factorization echoes the controlled

generation method used by FUDGE (Yang and

Klein, 2021), but with two notable distinctions.

First, we model the next step as opposed to the

next token correctness, which is often ill-defined.

Second, unlike FUDGE’s discriminator which pre-

dicts whether a given attribute will be satisfied in

the future, our discriminator evaluates the correct-

ness of a given step s t with respect to s 1:t−1 , the

solution so far. To summarize, Equation (4) shows

that we want to sample s t (i) with high likelihood

p LM (s t |q, s 1:t−1 ) according to the LM and (ii) is

correct with respect to the question and the prefix.

Intuitively, this implies the utilization of the rea-

soning capabilities of the LM while maintaining

correctness. Throughout the rest of the paper, we

will refer to the prefix s 1:t−1 as r and the next step

s t as s for simplicity.

2.2 Discriminator Learning

We use three steps to learn the discriminator func-

tion D(q, r, s), which are shown in Figure 2 (top).

• Step 1–Negative sampling: We collect a set of

solutions with at least one incorrect step.

• Step 2–Alignment: We align these solutions

with the reference and create examples with cor-

rect and incorrect steps to train the discriminator.

• Step 3–Learning: We train the discriminator

with a contrastive objective to distinguish be-

tween correct and incorrect steps.

Negative Sampling. This step aims to collect

a set of solutions with incorrect steps. For each

question in the training set, we sample multiple

solutions via top-k sampling and only keep solu-

tions with an incorrect final answer (to make sure

the solution has at least one incorrect step). Al-

though negative examples can be constructed by

introducing perturbations in reference steps with a

predefined set of edit operations (e.g., Golovneva

et al. (2023)), we found that it does not benefit

discriminator training as the perturbations produce

“easy” negatives with artifacts not resembling the

type of mistakes that the LM makes.

Alignment. Our objective is to train D to effec-

tively differentiate between correct and incorrect

steps, for which we need a dataset of correct and

incorrect step examples. To curate such a datasetA store sells 20 packets of 100 grams of sugar every week. How many kilograms of sugar does it sell every

week?

A total of 20 x 100 = 2000 grams are

sold every week.

The store sells 20*100 = 2000 grams

of sugar every week.

1 kilogram is equal to 1000 grams.

Then, 2000/1000 = 2 kilograms of

sugar are sold every week.

—

Comparable

Intermediate

variables

match!

Missing

Therefore, it sells 2000/100 = 20

kilograms of sugar every week.

— Therefore it sells 20 kilograms of sugar

every week.

reference solution sampled solution

Comparable

Intermediate

variables do

not match!

Extra

Figure 3: An example of the alignment produced by our alignment algorithm (described in Algorithm 2). The question and

the reference solutions come from GSM8K (Cobbe et al., 2021). The “—” designates an empty step placeholder. There are

three possible cases when aligning a reference solution with a sampled solution: missing, extra, and comparable steps. In the

comparable case, the intermediate variables (underlined) are compared to determine the correctness of the sampled step.

without step-level supervision, we align sampled in-

correct solutions with the reference solution via dy-

namic programming using the Needleman-Wunsch

(NW) algorithm (Likic, 2008). The original imple-

mentation of the NW algorithm finds a minimum-

cost alignment between two character sequences.

To extend it to our case, we use the cosine distance

between the embeddings of two steps as the cost

of aligning these two steps. We compute step em-

beddings via ROSCOE (Golovneva et al., 2023),

which is based on SimCSE (Gao et al., 2021) and

fine-tuned with perturbed multi-step solutions as

negative examples. As the NW algorithm naturally

works on sequences with different lengths, it allows

our alignment to capture missing and extra steps.

Algorithm 2 in Appendix B details the alignment

process.

Formally, given an m-step sampled solution d =

{d 1 , . . . , d m } and an n-step reference solution

g = {g 1 , . . . , g n }, the alignment algorithm pro-

duces a sequence of l pairs of aligned step indices

A d,g = {(x 1 , y 1 ), (x 2 , y 2 ), . . . (y l , x l )}, where

max(m, n) ≤ l ≤ m + n, x i ∈ {−, 1, . . . , m},

and y i ∈ {−, 1, . . . , n}. For an index pair (x i , y i ),

having x i = − means that step g y i is missing,

while y i = − means that step d x i is extra in the

sampled solution d. A d,g is then passed to Al-

gorithm 1 to obtain a set of training examples

−

{(q k , r k , s +

k , s k )} where s k is a correct and s k

is an incorrect next step after the prefix r k . For

an alignment pair (x i , y i ), three cases are handled

(shown in Figure 3): missing step (x i = −, y i ̸ =

−), extra step (x i ̸ = −, y i = −), and compa-

rable steps (x i ̸ = −, y i ̸ = −), where step d x i

may be compared to g y i . In the comparable case,

the function DoStepsMatch(d x i , g y i ) checks if d x i

is correct by comparing its intermediate variable,

which is the value after the = operator, with that of

Algorithm 1 Discriminator training data construction.

Input: Question q, sampled solution d, reference solution g,

and alignment indices A d,g

Output: Pairwise examples for discriminator training E.

m ← |d|, n ← |g|

P ← ∅, E ← ∅ // initialize correct prefix and example set

for (x i , y i ) ∈ A d,g do

if x i = − then

// missing step

P ← P ∪ {g y i }

// add g y i to the prefix P

else if y i = − then

// extra step

if y i < n then

// s + = g y i+1

E ← E ∪ {(q, P, g y i +1 , d x i )}

else

// comparable steps

if DoStepsMatch(d x i , g y i ) then

P ← P ∪ {d x i }

// add d x i to the prefix

else

E ← E ∪ {(q, P, g y i , d x i )}

// s − = d x i

exit

return E

g y i . For symbolic reasoning tasks, where there is

no intermediate variable, we check whether the

two steps entail one another using a pretrained

NLI model. Once an incorrect step is found i.e.,

DoStepsMatch(d x i , g y i ) returns False, we exit to

guarantee that that prefix in the returned examples

is correct.

Learning. For a set of M pairwise examples

− M

{(q i , r i , s +

i , s i )} i=1 , the training objective for the

i-th example is to maximize the difference δ i =

−

D(q i , r i , s +

i ) − D(q i , r i , s i ). We utilize the max-

margin loss objective L D (Rosasco et al., 2004):

L D =

M h

i=1

max{0, −δ i + ζ} ,

(5)

where ζ > 0 is a hyperparameter. We found the

max-margin loss to perform better than other alter-

natives (see Section 5 for an ablation study).

2.3 Guided Stepwise Decoding

After D is trained, it is employed to guide solution

decoding. At each time t, we use nucleus samplingto sample a pool of J candidates for the next steps

(1) (2)

(J)

{s t , s t , . . . , s t } from p LM (·|q, r). 3 These can-

didates represent multiple possible choices for the

(i)

next step. Each candidate s t is then scored using:

(i)

(1 − β) log p LM (s t |q, r) + βD(q, r, s t ) (6)

where β is a hyperparameter to control the dis-

criminator score coefficient. The guided decoding

process is shown in Figure 2 (bottom).

Experimental Setup

Tasks. We evaluate our approach on four math

and two symbolic reasoning tasks. For math rea-

soning, we use GSM8K (Cobbe et al., 2021), a

common benchmark for complex multi-step reason-

ing. MathQA-Gain, a subset of MathQA (Amini

et al., 2019) and includes math word problems

about gain/loss. Each problem is accompanied by

a step-by-step Python program. SVAMP (Patel

et al., 2021) and MultiArith (Roy and Roth, 2015)

consist of elementary-level math word problems.

For MathQA-Gain, SVAMP, and MultiArith, we

use the train-test splits included in the LILA bench-

mark (Mishra et al., 2022). As for symbolic rea-

soning tasks, we experiment with Coin Flip (CF;

Wei et al. 2021; Kojima et al. 2022) and Tracking

Shuffled Objects (TSO) from Big-Bench Hard

(Srivastava et al., 2022) and we use the splits by

Ho et al. (2022).

SVAMP, MultiArith, CF, and TSO do not in-

clude reference step-by-step solutions (only the fi-

nal answer is included for each question) we follow

recent work on chain-of-thought distillation (Ho

et al., 2022; Fu et al., 2023; Hsieh et al., 2023) and

prompt GPT-3.5-turbo to generate a step-by-step

solution for each question. Details on this process

and dataset statistics are in Appendix E.1.

Sampling, Training, and Decoding. For each

task, we sample roughly 100K incorrect solutions

for discriminator training with top-k sampling with

k = 50 and temperature T = 1.3 for FLAN-T5

and T = 0.7 for LLaMA. 4 The discriminator used

in all of our experiments is a FLAN-T5 Large en-

coder (~340M). For math reasoning tasks, we use

an external calculator to compute the results of

math operations. The exact details on sampling,

training, and decoding are in Appendix A.

We make sure each sample will contain only one step by

halting when a special end-of-step token is reached.

To save the time needed to sample from LLaMA models,

we use the discriminators trained with FLAN-T5 Large samples

for all the tasks except for MultiArith, where we sample the

incorrect solutions from LLaMA 7B .

Baselines. We compare G RACE to greedy de-

coding, which is the standard decoding method

for reasoning tasks (Wei et al., 2022; Li et al.,

2022; Fu et al., 2022; Zhou et al., 2022) and beam

search with a beam size of 3. 5 We additionally

compare G RACE to self-consistency (SC), where

multiple solutions are sampled with a tempera-

ture of T = 0.7 and we pick the most frequent

answer as the final answer. We sample 40 solu-

tions for experiments with FLAN-T5 and 20 with

LLaMA. In addition, we compare to a solution

verifier (Cobbe et al., 2021; Li et al., 2022), using

FLAN-T5 Large encoder as the verifier for a fair

comparison. We use the verifier checkpoint that

achieves the best F1 on a held-out set. We note that

self-consistency and verifiers may be applied on top

of G RACE by sampling complete solutions using

our guided decoding approach and then reranking

or applying majority voting over the sampled so-

lutions. Lastly, we compare to LM-only scoring,

which ranks steps according to log p LM only by

setting β = 0 in Equation (6), to demonstrate the

utility of including the discriminator when comput-

ing a step score.

Language Models. We verify the effectiveness

of G RACE on two models from different fam-

ilies and with different sizes, namely FLAN-

T5 Large (778M; Chung et al. 2022) and LLaMA

(7B, 13B; Touvron et al. 2023). As FLAN-

T5 Large performs poorly in the few-shot setting,

we fine-tune it over the training set of each task.

LLaMA models are not fine-tuned and are used

in a few-shot setting with 6 CoT demonstrations

(provided in Appendix G).

Results and Discussion

Evaluation of final answer accuracy. We com-

pare the accuracy of final answers reached by dif-

ferent methods. We first discuss the results over

math reasoning in Table 1. With T5 Large , G RACE

outperforms the baselines on all tasks. For instance,

G RACE outperforms greedy decoding by 7.4% and

11.7% points over GSM8K and SVAMP, respec-

tively. When combining our approach with SC,

where sampling is done using G RACE and then

majority voting is applied on the samples, the ac-

curacy boost over vanilla SC is as large as 6.8

points on SVAMP. With the few-shot prompted

LLaMA 7B , a similar trend is observed, as G RACE

We only compare G RACE to beam search over symbolic

reasoning tasks since it is incompatible with the calculator-

based decoding used for math tasks.FLAN-T5 L ARGE (Fine-tuned)

GSM8K SVAMP

Greedy decoding 26.9 54.5

Random sampling

Vanilla SC

Solution verifier

LM-only score (β = 0) 33.3

20.5

27.5 Guided sampling

G RACE

G RACE w/ SC 34.3 (+7.4)

36.3 (+3.0)

MathQA-Gain

LL A MA 7B (few-shot prompted)

GSM8K SVAMP

MultiArith

76.5 12.9 32.8 54.0

61.8

45.9

53.1 78.9

83.7

52.9 20.7

9.60

12.5 52.4

26.1

39.6 78.9

46.4

57.9

66.2 (+11.7)

68.6 (+6.80) 84.1 (+6.0)

84.4 (+0.7) 16.2 (+3.30)

30.9 (+10.2) 49.7 (+17.3)

55.6 (+3.20) 84.9 (+30.9)

94.6 (+15.7)

Table 1: Final answer accuracy on four multi-step reasoning tasks. Self-consistency and verifier results use 40 samples for

FLAN-T5 Large experiments and 20 samples for LLaMA. The discriminator used with G RACE is a T5 Large encoder. FLAN-

T5 Large results are aggregated over 5 runs and LLaMA over 3 runs. Absolute improvements by G RACE vs greedy decode and by

G RACE w/ self-consistency (SC) vs vanilla self-consistency are shown in parentheses. G RACE w/ self-consistency outperforms

the baselines on all tasks.

F-T5 L ARGE

LL A MA 13B

TSO TSO CF

78.7

80.9 29.3

29.7 62.7

54.7

Random sampling

Vanilla SC

81.4

LM-only score

80.0 29.8

28.4 65.5

69.0

Guided sampling

G RACE

84.4

G RACE w/ SC

84.0 (+2.6) 33.9

34.4 (+4.6) 77.7

78.3 (+12.8)

Greedy

Beam search

Table 2: Final answer accuracy on Coin Flip (CF) and Track-

ing Shuffled objects (TSO). FLAN-T5 Large results are averaged

over 5 runs and LLaMA 13B over 3 runs. We do not show the

results of FLAN-T5 Large on Coin Flip as the fine-tuned FLAN-

T5 Large already achieves near-perfect accuracy.

outperforms greedy decoding and SC on Multi-

Arith and SVAMP. G RACE with SC outperforms

the vanilla SC with random sampling by 10.2% and

15.7% points on GSM8K and MultiArith, respec-

tively.

We observe that the verifier approach performs

poorly on all tasks except for MathQA-Gain. This

is likely because the verifier training examples in-

clude solutions with the correct final answer but

invalid reasoning steps. As a result, the trained

verified cannot identify correct from incorrect rea-

soning. To test this hypothesis, we ran an exper-

iment with GSM8K where we only included the

gold trajectories as positive examples and indeed

found improvement in the verifier’s performance,

albeit still below SC and G RACE .

Moving to symbolic reasoning (shown in Ta-

ble 2): On TSO, G RACE w/ SC boosts the accu-

racy of T5 Large and LLaMA 13B by 2.6% and 4.6%,

respectively compared to SC. As for Coin Flip,

G RACE w/ SC improves LLaMA 13B ’s accuracy by

12.8% compared to the vanilla SC. One might note

that LLaMA 13B ’s performance on TSO (34.4%)

is close to random chance (33.3%). This can be

explained by observing that LLaMA 13B ’s perfor-

mance was already poor (29.8% with SC), and

therefore it is likely that the candidate’s next steps

scored by the discriminator are mostly incorrect, ex-

plaining why G RACE produces marginal improve-

ment. Appendix H shows examples of solutions

produced by G RACE on all tasks.

Ultimately, our results show that G RACE can

boost both FLAN-T5 and LLaMA’s final answer

accuracy on different math and symbolic reason-

ing tasks. Interestingly and in the case of LLaMA

models, we achieve such improvements (i) without

any training of the LM and (ii) with a discrimina-

tor that has 20X and 38X fewer parameters than

the backvone LM for LLaMA 7B and LLaMA 13B ,

respectively. This points to a promising direction

of our approach in steering the generations of large

LMs via significantly smaller and more efficient

discriminators.

Evaluation of intermediate step correctness.

Reaching a correct final answer does not guaran-

tee correct reasoning, since a model can reach the

correct answer spuriously (Golovneva et al., 2023;

Uesato et al., 2022). Here, we measure if G RACE

boosts the correctness of the reasoning chains com-

pared to the baselines. To do that, we use prefix

correctness (PC) following Uesato et al. (2022),

which measures whether the steps so far are cor-

rect. Inspired by recent work showing that using

LLMs for evaluation highly correlates with human

judgment (Wang et al., 2023; Liu et al., 2023b; Luo

et al., 2023), we measure prefix correctness using

LLMs in addition to human evaluation. For LLM(%)

GSM8K

1 5 10 20

GRACE + self-consistency

SVAMP

1 5 10 20

Vanilla self-consistency

MathQA

1 5 10 20

N (Number of samples)

Shuffled Objects

1 5 10 20

GRACE

GSM8K

0.0 0.2 0.4 0.6 0.8 1.0

Greedy decoding

Self-consistency

SVAMP

MathQA

Shuffled Objects

0.0 0.2 0.4 0.6 0.8 1.0

(Discriminator score coefficient)

Figure 4: Top: Dev set accuracy of G RACE w/ self-consistency compared to the vanilla self-consistency with random sampling

(Wang et al., 2022). G RACE w/ self-consistency is more sample-efficient; It achieves better performance with much fewer

samples. Bottom: Dev set accuracy as the discriminator score coefficient β in Equation (6) is varied from 0 to 1. Increasing

β up to a certain level improves the final answer accuracy, pointing to the benefit of steering the decoding process via the

discriminator. The model used here is FLAN-T5 Large and all numbers are averaged over 3 runs.

evaluation, we use GPT-3.5-turbo with a few-shot

prompt that lets the model predict a binary label

of correct or incorrect after each prefix. Details on

LLM evaluation including the prompt used are in

Appendix C.

In addition to PC, which is computed over all

solutions regardless of the final answer, we also

evaluate the trace error (TE), which is computed

exclusively on solutions with a correct final answer

and measures the percentage of these solutions that

have at least one major mistake. Following Uesato

et al. (2022), a major mistake is defined as “A step

where the information expressed is incorrect, or it

would no longer be possible to reach the correct

solution without undoing that step”. We evaluate

TE using both human and LLM evaluation on 200

questions that were answered correctly by both

G RACE and the baselines. LLM-based TE is com-

puted as the percentage of correct solutions with at

least one incorrect prefix. For human-based TE, we

ask annotators to label each solution as to whether

it has such a major mistake, mark the step where

the mistake happened, and provide a justification.

Details on the human evaluation are in Appendix D.

We conduct this evaluation on the GSM8K test set

since the reasoning required to solve GSM8K is

more complex, compared to other tasks.

Table 3 shows the LLM and human evaluation

results comparing G RACE to greedy decoding and

Prefix

Correctness-

LLM (↑) LLM-

TE (↓) Human-

TE (↓)

Greedy decode

Vanilla SC 46.5

51.0 7.0

9.8 9.0

G RACE

G RACE w/ SC 53.5 (+7.0)

54.8 (+3.8) 5.2 (-1.8)

6.6 (-3.2) 5.0 (-4.0)

Table 3: Step-level correctness evaluation over GSM8K

with G RACE compared to the baselines. G RACE and self-

consistency (SC) LLM metrics are averaged over 3 runs. Pre-

fix correctness is computed over 1.3K questions, LLM-based

trace error (TE) over ~300 questions, and human TE over 200

questions. Evaluation of SC is done by randomly picking a

solution that has the majority answer.

self-consistency. G RACE scores higher than both

greedy decoding and self-consistency by 7.0 and

3.8 points respectively. We also observe significant

improvements of trace error by G RACE . Specifi-

cally, it reduces trace error from 9.0% with greedy

decoding to 5.0% (44% reduction), and a similar

improvement is seen in the LLM-computed TE.

Our results clearly suggest that guiding the decod-

ing process with G RACE not only improves the

correctness of the final answer but also of the inter-

mediate steps.

Analysis

Sample Efficiency. A primary motivation for

G RACE is to achieve more step-level control over

solution decoding than solution-level aggregationStep Score. We study the effect of the discrim-

inator score coefficient β in Equation (6) when

computing the score of a candidate step on the rea-

soning performance. Figure 4 (bottom) shows final

answer accuracy as we vary β from 0.0 to 1.0. The

plot shows the accuracy improving as β is increased

beyond 0, emphasizing the benefit brought by inte-

grating D(q, r, s) into the step score. Interestingly,

when increasing β beyond a certain point, the per-

formance drops again, indicating that we should

not completely omit p LM (s|q, r), which represents

the LM’s learned reasoning abilities.

Discriminator training loss Acc.

Binary cross-entropy

Pairwise ranking (Ouyang et al., 2022)

Max-margin 16.8

37.6

38.2

Table 4: Dev set accuracy on GSM8K with (FLAN-T5 Large )

when G RACE ’s discriminator is trained with different loss

functions. Results are averaged over 3 runs.

Alignment. To verify whether our alignment al-

gorithm brings any benefit to the discriminator

training, we compare it to a simpler version where

steps in the sampled solutions are aligned with the

corresponding steps in the reference solutions. The

naive approach only aligns samples with the same

number of steps as the reference solution, since

there is no clear way to align samples of different

lengths. Figure 6 in Appendix F shows the accu-

racy on GSM8K and SVAMP when training the dis-

criminator using both alignments. Our alignment

approach outperforms naive alignment by 2.2%

and 5.9% points on GSM8K and SVAMP, respec-

tively. These results highlight the advantages of

our proposed alignment method in yielding a better

discriminator training.

One can compare solution- vs. step-level guidance to

sparse vs. intermediate rewards in reinforcement learning

(RL). Guiding the solution at the step level is akin to the RL

agent receiving rewards from intermediate actions rather than

a delayed reward signal at the end of the episode, enabling the

as done by vanilla SC. 6 Therefore, we expect

G RACE to require fewer samples than vanilla SC

to reach the same accuracy. To see if this is true,

we compare G RACE w/ SC to the vanilla SC with

different numbers of samples. Figure 4 (top) plots

the number of samples against final answer accu-

racy on four tasks with FLAN-T5 Large . We observe

that G RACE is more sample-efficient and yields

better accuracy with the same or fewer samples

than vanilla SC.

SVAMP

Greedy decoding

LM-only scoring

-8.5%

MultiArith

-7.5%

-26.6%

SVAMP

GSM8K

MultiArith GSM8K SVAMP

The Discriminator's Training Task

Figure 5: Cross-task performance over SVAMP and Multi-

Arith. G RACE ’s final answer accuracy is shown when the

discriminator is trained on different tasks. Results are aver-

aged over 3 runs.

Discriminator Loss Function. We compare the

max-margin objective in Equation (5) to two dif-

ferent discriminator training objectives. The first is

a binary cross-entropy objective, where the model

is trained to predict ‘correct’ or ‘incorrect’ after

each step, similar to Uesato et al. (2022). The

probability of correctness is used as the discrimina-

tor score in Equation (6). The second is the pair-

wise ranking loss used to train the reward model

for InstructGPT (Ouyang et al., 2022) L pairwise

− log σ(D(q, r, s + ) − D(q, r, s − )) . Table 4 shows

accuracy on GSM8K with FLAN-T5 Large when

G RACE ’s discriminator is trained with each of these

loss functions. Notably, the binary cross-entropy

loss exhibits the lowest accuracy, emphasizing the

importance of contrastive training. Moreover, the

max-margin objective is comparable to the pairwise

ranking loss.

Cross-task Performance. Our approach relies

on reference solutions, which may not always be

available for all tasks. Therefore, it is valuable to

investigate how G RACE performs when the discrim-

inator is applied to a task different from the one it

was originally trained on. In Figure 5, we present

the results for SVAMP and MultiArith when the

discriminator’s training task is varied. In this con-

text, G RACE demonstrates a small relative perfor-

mance drop, showing an 8% decrease for GSM8K

→ SVAMP and a 7% decrease for GSM8K → Mul-

tiArith, while still outperforming greedy decoding

and LM-only scoring. However, a more substantial

drop of 26.6% is observed in the case of SVAMP

→ MultiArith. This decrease can be attributed to

two key factors. First, SVAMP has a smaller set of

training questions (432) in comparison to GSM8K

(6.4K), and second, SVAMP questions require sim-

agent to learn the task with fewer samples.pler reasoning compared to GSM8K.

Discriminator Size. Lastly, we study how the

size of the discriminator model impacts the final

answer accuracy. More details are in Appendix F.

Related Work

Discriminator-Guided Controlled Generation.

Previous work in controlled generation has em-

ployed discriminators during decoding to guide

generation towards specific attributes, such as sen-

timent, topic, or lexical constraints (Holtzman et al.,

2018; Dathathri et al., 2020; Yang and Klein, 2021;

Krause et al., 2021; Khalifa et al., 2021). These

discriminators can either update the hidden states

of the language model in real-time (Dathathri et al.,

2020) or adjust token probabilities (Holtzman et al.,

2018; Yang and Klein, 2021; Liu et al., 2023a). Our

research takes inspiration from these practices but

extends them to multi-step reasoning in two key

aspects: control granularity and discriminator

training. We direct the decoding of multi-step so-

lutions at the level of reasoning steps to promote

their correctness, instead of individual tokens as

correctness is not meaningfully defined at the to-

ken level. As for discriminator training, it is clear

that learning a reasoning correctness discrimina-

tor is more challenging than a topic or sentiment

discriminator as the former requires checking for

logical, mathematical, or factual errors in a given

reasoning step. To tackle this, we introduce a novel

3-step process for training discriminators without

step-level annotations.

Multi-step reasoning. Two main types of ap-

proaches have been explored to improve multi-step

reasoning: Inference-time methods, which do not

require additional language model (LM) training,

and training-based methods, which require either

labeled samples or rewards. Popular inference-time

techniques include model prompting such as chain-

of-thought (Nye et al., 2021; Wei et al., 2021) and

its variants (Zhou et al., 2022; Zhang et al., 2022).

While these input-based techniques operate at the

LM input side, other methods target the output side.

For instance, self-consistency (Wang et al., 2022)

employs majority voting on multiple sampled solu-

tions to determine the final answer. An alternative

output-based method involves training a verifier

model to rank sampled solutions according to cor-

rectness Cobbe et al. (2021). However, verifiers

and vanilla self-consistency exhibit no control over

solution sampling. We also show in this paper (see

Section 4) that verifiers trained on samples from

smaller LMs perform very poorly. Training-based

methods, on the other hand, focus on crafting learn-

ing objectives to teach the LM to reason correctly.

For instance, Uesato et al. (2022) trained a reward

model to assess the correctness of the entire reason-

ing chain, which is then used as a reward model.

Ni et al. (2022) proposed training LMs on sampled

partially correct solutions to enhance mathematical

reasoning.

More relevant to our work, Li et al. (2022) in-

troduced a step-aware verifier to score sampled

solutions but their technique only applies to fully

sampled solutions, unlike our approach which ac-

tively guides the decoding process. Yang et al.

(2022) used a stepwise verifier to guide the search

process for proof generation and relied on heuris-

tics to generate negative examples, unlike G RACE ,

which samples incorrect solutions from the model.

Conclusion

Language models can easily assign a high proba-

bility to incorrect solutions. Existing methods like

self-consistency and verifiers that rely on sampling

from the LM distribution do not effectively address

this issue. This work proposes a guided decoding

method that trains a step-level discriminator model

that is used to steer the solution decoding process

toward correct steps. We demonstrate the utility of

our approach on six reasoning benchmarks, where

it strongly boosts the correctness of the generated

solutions.

Limitations and Future Work

There is an overhead incurred by sampling and

computing the discriminator step scores during

decoding. In addition, G RACE ’s performance is

upper-bounded by the quality of the sampled can-

didate steps. Also, our approach requires access to

reference step-by-step solutions for the alignment

process. As for future directions, leveraging the

alignment approach to curate a reward signal to

train the language model and extending G RACE to

commercial APIs that do not provide access to the

logits are relevant future directions.

Acknowledgements

This work is supported by LG AI Research. Ad-

ditionally, we would like to thank Hao Peng and

Sashank Gupta for their valuable feedback on the

paper draft. We also thank Zach Eichenberger for

helping with part of the human evaluation.References

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik

Koncel-Kedziorski, Yejin Choi, and Hannaneh Ha-

jishirzi. 2019. Mathqa: Towards interpretable math

word problem solving with operation-based for-

malisms. In Proceedings of the 2019 Conference

of the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, NAACL-HLT 2019, Minneapolis, MN, USA,

June 2-7, 2019, Volume 1 (Long and Short Papers),

pages 2357–2367. Association for Computational

Linguistics.

Hyung Won Chung, Le Hou, Shayne Longpre, Bar-

ret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi

Wang, Mostafa Dehghani, Siddhartha Brahma, et al.

2022. Scaling instruction-finetuned language models.

arXiv preprint arXiv:2210.11416.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavar-

ian, Jacob Hilton, Reiichiro Nakano, Christopher

Hesse, and John Schulman. 2021. Training veri-

fiers to solve math word problems. arXiv preprint

arXiv:2110.14168.

Antonia Creswell and Murray Shanahan. 2022. Faith-

ful reasoning using large language models. arXiv

preprint arXiv:2208.14271.

Antonia Creswell, Murray Shanahan, and Irina Higgins.

2022. Selection-inference: Exploiting large language

models for interpretable logical reasoning. arXiv

preprint arXiv:2205.09712.

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane

Hung, Eric Frank, Piero Molino, Jason Yosinski, and

Rosanne Liu. 2020. Plug and play language models:

A simple approach to controlled text generation. In

8th International Conference on Learning Represen-

tations, ICLR 2020, Addis Ababa, Ethiopia, April

26-30, 2020. OpenReview.net.

Namgyu Ho, Laura Schmid, and Se-Young Yun. 2022.

Large language models are reasoning teachers. arXiv

preprint arXiv:2212.10071.

Ari Holtzman, Jan Buys, Maxwell Forbes, Antoine

Bosselut, David Golub, and Yejin Choi. 2018. Learn-

ing to write with cooperative discriminators. In Pro-

ceedings of the 56th Annual Meeting of the Associa-

tion for Computational Linguistics, ACL 2018, Mel-

bourne, Australia, July 15-20, 2018, Volume 1: Long

Papers, pages 1638–1649. Association for Computa-

tional Linguistics.

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh,

Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner,

Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister.

2023. Distilling step-by-step! outperforming larger

language models with less training data and smaller

model sizes.

Muhammad Khalifa, Hady Elsahar, and Marc Dymet-

man. 2021. A distributional approach to controlled

text generation. In 9th International Conference on

Learning Representations, ICLR 2021, Virtual Event,

Austria, May 3-7, 2021. OpenReview.net.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-

taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-

guage models are zero-shot reasoners. In NeurIPS.

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann,

Nitish Shirish Keskar, Shafiq R. Joty, Richard Socher,

and Nazneen Fatema Rajani. 2021. Gedi: Genera-

tive discriminator guided sequence generation. In

Findings of the Association for Computational Lin-

guistics: EMNLP 2021, Virtual Event / Punta Cana,

Dominican Republic, 16-20 November, 2021, pages

4929–4952. Association for Computational Linguis-

tics.

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and

Tushar Khot. 2023. Specializing smaller language

models towards multi-step reasoning. arXiv preprint

arXiv:2301.12726. Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen,

Jian-Guang Lou, and Weizhu Chen. 2022. On the

advance of making language models better reasoners.

arXiv preprint arXiv:2206.02336.

Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark,

and Tushar Khot. 2022. Complexity-based prompt-

ing for multi-step reasoning.

arXiv preprint

arXiv:2210.00720. Vladimir Likic. 2008. The needleman-wunsch algo-

rithm for sequence alignment. Lecture given at the

7th Melbourne Bioinformatics Course, Bi021 Molec-

ular Science and Biotechnology Institute, University

of Melbourne, pages 1–46.

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.

Simcse: Simple contrastive learning of sentence em-

beddings. In Proceedings of the 2021 Conference on

Empirical Methods in Natural Language Processing,

EMNLP 2021, Virtual Event / Punta Cana, Domini-

can Republic, 7-11 November, 2021, pages 6894–

6910. Association for Computational Linguistics.

Olga Golovneva, Moya Peng Chen, Spencer Poff, Mar-

tin Corredor, Luke Zettlemoyer, Maryam Fazel-

Zarandi, and Asli Celikyilmaz. 2023. ROSCOE: A

suite of metrics for scoring step-by-step reasoning. In

The Eleventh International Conference on Learning

Representations.

Xin Liu, Muhammad Khalifa, and Lu Wang. 2023a.

BOLT: fast energy-based controlled text generation

with tunable biases. In Proceedings of the 61st An-

nual Meeting of the Association for Computational

Linguistics (Volume 2: Short Papers), ACL 2023,

Toronto, Canada, July 9-14, 2023, pages 186–200.

Association for Computational Linguistics.

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang,

Ruochen Xu, and Chenguang Zhu. 2023b. Gpte-

val: Nlg evaluation using gpt-4 with better human

alignment. arXiv preprint arXiv:2303.16634.Zheheng Luo, Qianqian Xie, and Sophia Ananiadou.

2023. Chatgpt as a factual inconsistency evaluator

for abstractive text summarization. arXiv preprint

arXiv:2303.15621.

Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard

Tang, Sean Welleck, Chitta Baral, Tanmay Rajpuro-

hit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark,

and Ashwin Kalyan. 2022. LILA: A unified bench-

mark for mathematical reasoning. In Proceedings of

the 2022 Conference on Empirical Methods in Nat-

ural Language Processing, pages 5807–5832, Abu

Dhabi, United Arab Emirates. Association for Com-

putational Linguistics.

Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Alex

Polozov, Christopher Meek, Dragomir Radev, and

Jianfeng Gao. 2023. Learning math reasoning from

self-sampled correct and partially-correct solutions.

In The Eleventh International Conference on Learn-

ing Representations.

Ansong Ni, Jeevana Priya Inala, Chenglong Wang,

Oleksandr Polozov, Christopher Meek, Dragomir

Radev, and Jianfeng Gao. 2022. Learning from

self-sampled correct and partially-correct programs.

arXiv preprint arXiv:2205.14318.

Maxwell I. Nye, Anders Johan Andreassen, Guy Gur-

Ari, Henryk Michalewski, Jacob Austin, David

Bieber, David Dohan, Aitor Lewkowycz, Maarten

Bosma, David Luan, Charles Sutton, and Augustus

Odena. 2021. Show your work: Scratchpads for inter-

mediate computation with language models. CoRR,

abs/2112.00114.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-

roll L Wainwright, Pamela Mishkin, Chong Zhang,

Sandhini Agarwal, Katarina Slama, Alex Ray, et al.

2022. Training language models to follow in-

structions with human feedback. arXiv preprint

arXiv:2203.02155.

Arkil Patel, Satwik Bhattamishra, and Navin Goyal.

2021. Are NLP models really able to solve simple

math word problems? In Proceedings of the 2021

Conference of the North American Chapter of the

Association for Computational Linguistics: Human

Language Technologies, NAACL-HLT 2021, Online,

June 6-11, 2021, pages 2080–2094. Association for

Computational Linguistics.

Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto,

Michele Piana, and Alessandro Verri. 2004. Are

loss functions all the same? Neural computation,

16(5):1063–1076.

Subhro Roy and Dan Roth. 2015. Solving general arith-

metic word problems. In Proceedings of the 2015

Conference on Empirical Methods in Natural Lan-

guage Processing, EMNLP 2015, Lisbon, Portugal,

September 17-21, 2015, pages 1743–1752. The As-

sociation for Computational Linguistics.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,

Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,

Adam R Brown, Adam Santoro, Aditya Gupta,

Adrià Garriga-Alonso, et al. 2022. Beyond the

imitation game: Quantifying and extrapolating the

capabilities of language models. arXiv preprint

arXiv:2206.04615.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier

Martinet, Marie-Anne Lachaux, Timothée Lacroix,

Baptiste Rozière, Naman Goyal, Eric Hambro,

Faisal Azhar, et al. 2023. Llama: Open and effi-

cient foundation language models. arXiv preprint

arXiv:2302.13971.

Jonathan Uesato, Nate Kushman, Ramana Kumar, Fran-

cis Song, Noah Siegel, Lisa Wang, Antonia Creswell,

Geoffrey Irving, and Irina Higgins. 2022. Solv-

ing math word problems with process-and outcome-

based feedback. arXiv preprint arXiv:2211.14275.

Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang

Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou.

2023. Is chatgpt a good nlg evaluator? a preliminary

study. arXiv preprint arXiv:2303.04048.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,

Ed Chi, and Denny Zhou. 2022. Self-consistency im-

proves chain of thought reasoning in language mod-

els. arXiv preprint arXiv:2203.11171.

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin

Guu, Adams Wei Yu, Brian Lester, Nan Du, An-

drew M Dai, and Quoc V Le. 2021. Finetuned lan-

guage models are zero-shot learners. arXiv preprint

arXiv:2109.01652.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.

Chain of thought prompting elicits reasoning in large

language models. arXiv preprint arXiv:2201.11903.

Kaiyu Yang, Jia Deng, and Danqi Chen. 2022. Gen-

erating natural language proofs with verifier-guided

search. In Proceedings of the 2022 Conference on

Empirical Methods in Natural Language Processing,

EMNLP 2022, Abu Dhabi, United Arab Emirates,

December 7-11, 2022, pages 89–105. Association for

Computational Linguistics.

Kevin Yang and Dan Klein. 2021. Fudge: Controlled

text generation with future discriminators. arXiv

preprint arXiv:2104.05218.

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex

Smola. 2022. Automatic chain of thought prompt-

ing in large language models. arXiv preprint

arXiv:2210.03493.

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,

Nathan Scales, Xuezhi Wang, Dale Schuurmans,

Olivier Bousquet, Quoc Le, and Ed Chi. 2022.

Least-to-most prompting enables complex reason-

ing in large language models. arXiv preprint

arXiv:2205.10625.A

Implementation Details

assigns a higher reward to the correct step than the

incorrect one.

Dataset FLAN-T5 Large LLaMA (7B, 13B)

GSM8K β = 0.7,

J = 20,

max_steps = 8,

top_p = 0.95,

T = 1.0 β = 0.7,

J = 10,

max_steps = 8,

top_p = 0.95,

T = .7

MathQA-Gain β = 0.7,

J = 20,

max_steps = 15,

top_p = 0.95,

T = 1.0 ————

β = 0.8,

J = 20,

max_steps = 6,

top_p = 1.0,

T = .8 β = 0.5,

J = 10,

max_steps = 8,

top_p = 0.95,

T = .5

———— β = 0.8,

J = 10,

max_steps = 8,

top_p = 0.95,

T = .5

———— β = 0.5,

J = 10,

max_steps = 8,

top_p = 0.95,

T = .5

β = 0.6,

J = 20,

max_steps = 10,

top_p = 0.95,

T = 1.2 β = 0.5,

J = 10,

max_steps = 8,

top_p = 0.95,

T = .5

SVAMP

MultiArith

CoinFlip

Shuffled Objects

Table 5: Hyperparameters for FLAN-T5 and LLaMA (7B

and 13B) on different datasets. β controls the discriminator

contribution to the step score in Equation (6), J is the size

of the pool of candidate next steps, and T is the sampling

temperature. These values were found via a grid search over

the development set for each task.

Sampling and Discriminator Training. For

each task, we sample roughly 80K incorrect solu-

tions for discriminator training with top-k sampling

with k = 50 and temperature T = 1.3 for FLAN-

T5 and T = 0.7 for LLaMA. The discriminator

used in all our experiments is a FLAN-T5 Large en-

coder (~340M). The step score is computed by

applying max-pooling over the hidden states fol-

lowed by a two-layer MLP with a ReLU and tanh

non-linearities. The tanh is applied to constrain the

scores in the range [−1, 1]. We train the discrimi-

nator for 10 epochs with a batch size of 32. We use

the Adam optimizer with a learning rate of 1e − 4

for GSM8K and 6e − 5 for other tasks. We use

ζ = 1.0 as the margin hyperparameter. We monitor

the loss on a held-out development set from each

task and choose the checkpoint.

Interestingly, we found that early stopping based

on the loss is a better indicator of the discrimina-

tor’s performance than using the pairwise classifi-

cation accuracy i.e., how often the discriminator

Decoding. For step-wise decoding, we sample

reasoning steps using nucleus sampling to form

the pool of candidate next steps. We continue de-

coding steps until a final answer is generated or

until a maximum number of steps is reached. For

math reasoning tasks, we use a calculator during

decoding to compute the results of math operations.

Table 5 provides concrete hyperparameters used for

stepwise decoding for each task. Table 5 shows the

stepwise decoding hyperparameters used for each

task and language model used. These values were

found through a grid search over the development

set for each task.

Solution Alignment

Algorithm 2 shows the Needleman-Wunsch al-

gorithm for aligning sampled solutions with the

ground-truth solution for a given problem. To filter

out low-quality samples, we discard sampled solu-

tions with alignment cost > 2.0 for all tasks except

for TSO, where we discard samples with alignment

cost > 6.0.

We use the embeddings obtained from ROSCOE

(Golovneva et al., 2023) to compute the alignment

for every task except for Coin Flip, where we use

the vanilla SimCSE (Gao et al., 2021) embeddings

instead.

Algorithm 2 Step Alignment using Needleman-Wunsch

Input: Sampled solution d, reference solution g, gap cost c,

similarity threshold γ

Output: solution alignment A d,g

1: m ← len(d)

// step length of the sampled solution

2: n ← len(g) // step length of the ground-truth solution

3: P ← pairwise_similarity(d, g)

// compute pairwise

similarity matrix between the two solutions

4: i ← 0; j ← 0; L ← zeros_matrix(m + 1, n + 1)

initialize dp table with zeros

5: L :m+1,0 ← [i∗c for i in 1 ... m]

6: L 0,:n+1 ← [i∗c for i in 1 ... n]

7: i ← 1

8: while i ≤ m do

j ← 1

10:

while j ≤ n do

11:

if P i−1,j−1 ≥ γ then // similarity is above the

threshold

12:

L i,j ← L i−1,j−1

13:

else

14:

L i,j ← min(L i−1,j−1 + 1 − P i−1,j−1 ,

L i−1,j + c, L i,j−1 + c)

15:

j ← j +1

16:

i ← i +1

17: A d,g ← backtrack(L, d, g)

// backtrack to get the

optimal alignment

18: return A d,gC

LLM Evaluation Details

Before using GPT-3.5 to evaluate our model, we

need to measure whether it can reliably assess the

prefix correctness. To do that, we manually anno-

tate 100 model-generated solutions from GSM8K

which corresponded to 280 prefixes in total. We

ask human annotators to provide a binary label for

each prefix to indicate whether the solution so far

will still lead to the correct final answer or not. If

a prefix is found to be incorrect, then all the fol-

lowing prefixes in the solution are also incorrect.

Interestingly, we found that the few-shot prompt-

ing GPT-3.5-turbo with 10 demonstrations could

predict the prefix correctness with 88.94% macro

F1 score. The few-shot prompt we use is shown in

Table 6. We run our evaluation on three different

runs for G RACE and self-consistency results and

randomly sample 10 different demonstrations each

time for the prompt.You are ChatGPT, a very capable language model that is good at doing math. You are given a math problem, a

step-by-step solution to the problem, and a correct solution. After each step in the solution, identify whether the

solution so far will lead to the correct final answer or not. If the solution so far is correct, you should generate "->

correct". If the solution is incorrect, you should generate "-> incorrect". I will give you a few examples to get you

started.

Q: Siobhan has 2 fewer jewels than Aaron. Aaron has 5 more jewels than half of Raymond’s jewels. If Raymond

has 40 jewels, how many jewels does Siobhan have?

Correct Solution: Half of Raymond’s jewels is 40/2 = 20. Since Aaron has 5 more jewels than half of Raymond’s

jewels, he has 20 + 5 = 25 jewels. If Siobhan has 2 fewer jewels than Aaron, she has 25 - 2 = 23 jewels.

Solution: Aaron has 5 more jewels than half of Raymond’s jewels, meaning he has 40 + 5 = 45 jewels. → incorrect.

Siobhan has 2 fewer jewels than Aaron, meaning she has 45 - 2 = 43 jewels. → incorrect.

Q: A teacher teaches 5 periods a day and works 24 days a month. He is paid $5 per period. If he has been working

for 6 months now, how much has he earned in total?

Correct Solution: The amount paid to the teacher per day is 5 periods * $5/period = $25 per day. The amount paid

for 24 days is $25/day * 24 days = $600. The total amount for 6 months is $600 * 6 = $3600.

Solution: The amount paid to the teacher per day is 5 periods * $5/period = $25 per day. → correct. The amount

paid for 24 days is $25/day * 24 days = $600. → correct. The total amount for 6 months is $600 * 6 = $1800. →

incorrect.

Q: Brandon’s iPhone is four times as old as Ben’s iPhone. Ben’s iPhone is two times older than Suzy’s iPhone. If

Suzy’s iPhone is 1 year old, how old is Brandon’s iPhone?

Correct Solution: Ben’s iPhone is 1 * 2 = 2 years old. Brandon’s iPhone is 4 * 2 = 8 years old.

Solution: Ben’s iPhone is 2 * 1 year = 2 years older than Suzy’s iPhone. → correct. Thus, Brandon’s iPhone is 2 +

4 years = 6 years old. → incorrect.

Q: Wynter went to her local town bike shop to buy her sister a bicycle as her birthday gift. While at the shop,

Wynter counted 50 bicycles and 20 tricycles. How many wheels in total did the vehicles she saw have?

Correct Solution: The bicycles had a total of 50 bikes * 2 wheels/bike = 100 wheels. There were 20 tricycles * 3

wheels/tricycle = 60 wheels for the tricycles. The total number of wheels is 100 wheels + 60 wheels = 160 wheels.

Solution: There are 50 bicycles at the shop. → correct. Each bicycle has 2 wheels. → correct. So, there are 50 * 2

= 100 wheels. → correct. There are 20 tricycles at the shop. → correct. Each tricycle has 3 wheels. → correct. So,

there are 20 * 3 = 60 wheels. → correct. The total number of wheels is 100 + 60 = 160. → correct.

...

Table 6: An example of the few-shot prompt given to GPT-3.5 to predict prefix correctness (described in section 4), which is

used to evaluate G RACE against the baselines. We use 10 manually annotated solutions from GSM8K as in-context learning

demonstrations.40

Human Evaluation Details

E.1

Datasets Info

Step-by-step Reference Generation

To generate reference step-by-step solutions for

SVAMP and MultiArith, we prompt GPT-3.5-turbo

with the few-shot prompt shown in Table 8. A sim-

ilar prompt is used for Shuffled Objects and Coin

Flip but uses demonstrations from the correspond-

ing task. For each question, we sample 20 different

solutions and filter our the ones that did not reach

the correct final answer. We then pick a random

solution with the correct final answer as our ref-

erence solution. If GPT-3.5-turbo was not able to

reach the final answer after 5 tries with different

demonstrations, we discard that question from the

training data.

E.2

Dataset Train Dev Test

GSM8K

MathQA-Gain

MultiArith

SVAMP

Shuffled Objects

Coin Flip 6.4K

3.6K

289

432

286

245 1K

505

115

181

113

105 1.3K

391

174

299

225

150

Table 7: Number of examples for each split in the datasets

used.

Further Analysis

Discriminator Size. We study how the size of

the discriminator impacts the final answer accuracy.

69.2

36.3

naive

63.3

naive

Alignment Method

Figure 6: Dev set accuracy over GSM8K and SVAMP with

FLAN-T5 Large with two solution alignment methods: Our

NW algorithm outperforms the naive alignment by 1.9% on

GSM8K and 5.9% on SVAMP, highlighting the effectiveness

of our proposed alignment method. Results are averaged over

3 runs.

In addition to the FLAN-T5 Large encoder used so

far, we run experiments with a FLAN-T5-Base

encoder (110M) and a FLAN-T5-Small encoder

(30M) as discriminators on GSM8K and MultiArith

and with LLaMA 7B as the backbone LM. Figure 7

shows the accuracy on both datasets with different

model sizes. For MultiArith, better performance is

brought by larger discriminator models, which is

expected. Interestingly, using the T5-base discrimi-

nator, G RACE can already surpass self-consistency

by 0.7 points, and such a boost is achieved using a

discriminator that is 63X smaller than LLaMA 7B .

As for GSM8K, we observe a very different trend,

where smaller models (base and small) do not per-

form well. This can be understood in the light

of GSM8K being a more difficult task with more

complex reasoning requirements compared to Mul-

tiArith and therefore a discriminator with sufficient

capacity is needed.

GSM8K

self-consistency

greedy decoding

16.2

15 11.6

9.6

0 T5-small T5-base T5-large

Discriminator

MultiArith

self-consistency

81.8

78.6

79.6

T5-small T5-base T5-large

Discriminator

Figure 7: G RACE ’s accuracy on GSM8K and MultiArith with

different discriminator sizes with LLaMA 7B . Discriminator

size matters: Larger discriminators have more capacity to

model step correctness. The complexity of the task matters:

A larger discriminator is required for GSM8K than for Multi-

Arith to produce an observable performance boost.

Statistics

Table 7 shows the statistics for the datasets used for

our evaluation.

38.2

SVAMP

GSM8K

Annotators are presented with the question, the ref-

erence solution, and a generated solution. They are

then instructed to follow the instruction: “You are

given a math problem, the reference solution, and

the generated model solution, please indicate the

first generated step with a major mistake, if any

exist. A major mistake is a step where the informa-

tion expressed is incorrect, or it would no longer be

possible to reach the correct solution without un-

doing that step.” Initially, we asked two annotators

to annotate 100 solutions, and obtained an inter-

annotator agreement of 0.93 by Cohen-Kappa’s

coefficient. Since we obtained high agreement, we

then asked only one of the annotators to annotate

all 400 solutions (200 from G RACE and 200 from

greedy decoding).

Few-shot prompts

Tables 8, 9, and 10 show the 6-shot prompt used

with LLaMA 7B with GSM8K, SVAMP, and Mul-

tiArith, respectively. Tables 11 and 12 show the6-shot prompt used with LLaMA 13B for the Coin

Flip and Tracking Shuffled Objects, respectively.

Sample Outputs

Tables 13, 14, 15, 16, 17, 18, 19, 20, 21 show

compare sampled solutions from both G RACE and

self-consistency where G RACE reached the correct

answer but self-consistency did not.Q: Jorge is 24 years younger than Simon. In 2005, Jorge is 16 years old. In 2010, how old would Simon be?

A: In 2005, Simon was 16+24 = 40 years old. There are 2010-2005 = 5 years between 2005 and 2010. In 2010 Simon

would be 40+5 = 45 years old.

Q: An entrepreneur is crowdfunding a new business effort. He has three different dollar amount levels of financial

backing options and each level is ten times as high as the previous one. He needs to raise $12000 to get his business off

the ground. He succeeded after getting two backers at the highest level of financial backing, three at the second level,

and ten at the lowest level. How many dollars was the highest level of financial backing?

A: Let L represent the lowest level of financial backing. Thus, the second level is 10L and the highest level is 10 * 10L =

100L. The entrepreneur got 2 highest level backers, 3 second level backers, and 10 lowest level backers, so 2 * 100L + 3

* 10L + 10L = 200L + 30L + 10L = 240L = $12000. Therefore, the lowest level of financial backing was L = 12000 /

240 = $50. Thus, the highest level of financial backing was 10 * 10 * 50 = $5000.

Q: 1/3 of the townspeople have received the full COVID vaccine. 1/3 are immune because they already recovered from

COVID. If 1/6 of the townspeople are both vaccinated and already had COVID, what percent of the town is immune in

some way?

A: First add the fractions of vaccinated and recovered townspeople: 1/3 + 1/3 = 2/3. Then you need to subtract the

fraction that are both vaccinated and recovered, since they were double-counted. To do this, start by converting 2/3 to the

same denominator as 1/6: 2/3 * 2/2 = 4/6. Then subtract 1/6 from 4/6: 4/6 - 1/6 = 3/6. Then divide the numerator by the

denominator and multiply by 100% to express the answer as a percentage: 3/6 * 100% = 50%.

Q: James was doing some shopping in his favorite shop and he saw an offer on shoes that give him a second pair of shoes

for half off the original price. He took the offer and took the first pair of shoes for $40 and the second one for $60. At the

cashier, the lady told him that the offer applies strictly to the cheaper pair of shoes, and also that he was selected for an

extra discount of a fourth off the total amount. How much money did James end up paying for both pairs of shoes?

A: The price of the cheaper pair after the discount was $40/2 = $20. The total price of both pairs of shoes $20 + $60 =

$80. The final discount was $80/4 = $20. The price he ends up paying was $80 - $20 = $60.

Q: A wildlife team is monitoring the number of birds in a park. There are 3 blackbirds in each of the park’s 7 trees.

There are also 13 magpies roaming around the park. How many birds are in the park in total?

A: In the trees, there are 7 trees * 3 blackbirds per tree = 21 blackbirds. In addition to the magpies, there is a total of 21

blackbirds + 13 magpies = 34 birds in the park.

Table 8: The 6-shot prompt used with LLaMA 7B for GSM8K.Q: Paul sold some books in a garage sale, leaving him with 27 books. If he had 136 books initially, how many books did

he sell?

A: To find out how many books Paul sold, we need to subtract the number of books he has now from the number he had

initially: 136 - 27 = 109 books. Therefore, Paul sold 109 books in the garage sale.

Q: White t-shirts can be purchased in packages of 13. If mom buys 39 white t-shirts, how many packages will she have?

A: If the t-shirts come in packages of 13, then mom bought 39/13 = 3 packages of t-shirts.

Q: There are 384 oranges and 192 bananas in Philip’s collection. If the bananas are organized into 345 groups and

oranges are organized into 16 groups, how big is each group of oranges?

A: To find out how big each group of oranges is, we need to divide the total number of oranges by the number of orange

groups. So, 384 oranges divided by 16 groups = 24 oranges per group.

Q: In a school, there are 868 girls and the rest are boys. If there are 281 more girls than boys, how many pupils are there

in that school?

A: Let’s use algebra to solve the problem. Let’s call the number of boys "b". We know that there are 868 girls and 281

more girls than boys, so: 868 = b + 281. We can solve for "b" by subtracting 281 from both sides: 587 = b. So there are

587 boys in the school. To find the total number of pupils, we can add the number of girls and boys: Total = 868 + 587 =

1455.

Q: Mary is baking a cake. The recipe calls for 9 cups of flour and 5 cups of sugar. She already put in 3 cups of flour and

62 cups of sugar. How many more cups of flour does she need to add?

A: The recipe calls for 9 cups of flour and she already put in 3 cups, so she still needs to add 9 - 3 = 6 cups of flour. There

is no indication that Mary needs to add more sugar, so the 62 cups of sugar are irrelevant to this question. Therefore, she

needs to add 6 cups of flour.

Q: Danny collects bottle caps and wrappers. He found 46 wrappers and 50 bottle caps at the park. Now he has 21 bottle

caps and 52 wrappers in his collection. How many more bottle caps than wrappers did Danny find at the park?

A: Danny found 50 bottle caps and 46 wrappers at the park. Now he has 21 bottle caps and 52 wrappers in his collection.

To find out how many more bottle caps than wrappers he found at the park, we need to subtract the number of wrappers

from the number of bottle caps: 50 - 46 = 4. Therefore, Danny found 4 more bottle caps than wrappers at the park.

Table 9: The 6-shot prompt with LLaMA 7B for SVAMP.Q: A store had 27 coloring books in stock. They ended up putting them on sale and getting rid of 6 of them. The put the

ones they still had onto shelves with 7 on each shelf. How many shelves did they use?

A: After selling 6 coloring books, the store had 27 - 6 = 21 coloring books left. To figure out how many shelves were

used, divide the number of coloring books by the number of books on each shelf: 21 / 7 = 3 shelves. Therefore, the store

used 3 shelves to display the coloring books.

Q: Paige was unboxing some of her old winter clothes. She found 6 boxes of clothing and inside each box there were 5

scarves and 5 mittens. How many pieces of winter clothing did Paige have total?

A: Each box has 5 scarves and 5 mittens, so each box contains 5 + 5 = 10 pieces of winter clothing. Paige has 6 boxes,

so she has a total of 6 x 10 = 60 pieces of winter clothing.

Q: A toy store had 6 giant stuffed bears in stock when they got another shipment with 18 bears in it. The put the bears

onto shelves with 6 on each shelf. How many shelves did they use?

A: There are now 6 + 18 = 24 giant stuffed bears in the store. If there are 6 bears on each shelf, then the number of

shelves required is 24 / 6 = 4 shelves.

Q: The schools debate team had 26 boys and 46 girls on it. If they were split into groups of 9, how many groups could

they make?

A: The total number of students on the debate team is 26 + 46 = 72. To split them into groups of 9, we need to divide 72

by 9: 72 / 9 = 8 groups. So they can make 8 groups.

Q: A toy store had 5 giant stuffed bears in stock when they got another shipment with 7 bears in it. The put the bears

onto shelves with 6 on each shelf. How many shelves did they use?

A: The total number of bears they have is 5 + 7 = 12 bears. To find out how many shelves they need, we divide the total

number of bears by the number of bears on each shelf: 12 bears / 6 bears per shelf = 2 shelves. Therefore, they used 2

shelves.

Q: Frank had 11 dollars. At the store, he spent $3 on a new game. If he got another 14 dollars for his allowance, how

much money does he have now?

A: Frank had 11 dollars and spent 3 dollars, leaving him with 11 - 3 = 8 dollars. He then received 14 dollars for his

allowance, bringing his total amount of money to 8 + 14 = 22 dollars.

Table 10: The few-shot prompt used with LLaMA 7B for MultiArith.Q: A coin is heads up. Lucky does not flip the coin. Mireya flips the coin. Jj flips the coin. Kc flips the coin. Is the coin

still heads up? Note that "flip" here means "reverse".

A: Lucky does not flip the coin, so the coin remains heads up.

Mireya flips the coin, so the coin is now tails up.

Jj flips the coin, so the coin is now heads up again.

Kc flips the coin, so the coin is now tails up.

Q: A coin is heads up. Breanna flips the coin. Trey does not flip the coin. Omar flips the coin. Patrice does not flip the

coin. Is the coin still heads up? Note that "flip" here means "reverse".

A: Breanna flips the coin. The coin is now tails up.

Trey does not flip the coin. The coin is still tails up.

Omar flips the coin. The coin is now heads up.

Patrice does not flip the coin. The coin is still heads up.

Q: A coin is heads up. Caleb flips the coin. Chase flips the coin. Eleazar flips the coin. Chanel does not flip the coin. Is

the coin still heads up? Note that "flip" here means "reverse".

A: Caleb flips the coin, so it is now tails up.

Chase flips the coin, so it is now heads up again.

Eleazar flips the coin, so it is now tails up again.

Chanel does not flip the coin, so it remains tails up.

Q: A coin is heads up. Walter does not flip the coin. Mikayla does not flip the coin. Larry does not flip the coin. Ryan

does not flip the coin. Is the coin still heads up? Note that "flip" here means "reverse".

A: If Walter does not flip the coin, then the coin remains heads up.

If Mikayla does not flip the coin, then the coin remains heads up.

If Larry does not flip the coin, then the coin remains heads up.

If Ryan does not flip the coin, then the coin remains heads up.

So, the answer is yes, the coin is still heads up.

Q: A coin is heads up. Angelina flips the coin. Layla flips the coin. Jenny flips the coin. Zane does not flip the coin. Is

the coin still heads up? Note that "flip" here means "reverse".

A: The coin is heads up.

Angelina flips the coin. The coin is now tails up.

Layla flips the coin. The coin is now heads up.

Jenny flips the coin. The coin is now tails up.

Zane does not flip the coin. The coin is still tails up.

Q: A coin is heads up. Jesse flips the coin. Roderick flips the coin. Travis does not flip the coin. Rita does not flip the

coin. Is the coin still heads up? Note that "flip" here means "reverse".

A: Jesse flips the coin. The coin is now tails up.

Roderick flips the coin. The coin is now heads up.

Travis does not flip the coin. The coin is still heads up.

Rita does not flip the coin. The coin is still heads up.

Table 11: The 6-shot prompt with LLaMA 13B for Coin Flip.Q: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has an orange

ball, Bob has a white ball, and Claire has a blue ball. As the game progresses, pairs of players trade balls. First, Alice

and Bob swap balls. Then, Bob and Claire swap balls. Finally, Alice and Bob swap balls. At the end of the game, Alice

has the Which choice is true? Answer choices: (A) orange ball, (B) white ball, (C) blue ball.

A: First, Alice and Bob swap balls. This means that Alice now has the white ball and Bob has the orange ball. Then,

Bob and Claire swap balls. This means that Bob now has the blue ball and Claire has the white ball. Finally, Alice and

Bob swap balls again. This means that Alice now has the blue ball and Bob has the white ball. At the end of the game,

Alice has the blue ball. Therefore, the answer is (C).

Q: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has an orange

ball, Bob has a white ball, and Claire has a blue ball. As the game progresses, pairs of players trade balls. First, Alice

and Bob swap balls. Then, Bob and Claire swap balls. Finally, Alice and Bob swap balls. At the end of the game, Bob

has the Which choice is true? Answer choices: (A) orange ball, (B) white ball, (C) blue ball.

A: First, Alice and Bob swap balls. This means that Alice now has the white ball and Bob has the orange ball. Then,

Bob and Claire swap balls. This means that Bob now has the blue ball and Claire has the orange ball. Finally, Alice and

Bob swap balls again. This means that Alice now has the blue ball and Bob has the white ball. At the end of the game,

Bob has the white ball. Therefore, the answer is (B).

Q: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has an orange

ball, Bob has a white ball, and Claire has a blue ball. As the game progresses, pairs of players trade balls. First, Alice

and Bob swap balls. Then, Bob and Claire swap balls. Finally, Alice and Bob swap balls. At the end of the game, Claire

has the Which choice is true? Answer choices: (A) orange ball, (B) white ball, (C) blue ball.

A: First, Alice and Bob swap balls. This means that after the swap, Alice has the white ball and Bob has the orange ball.

Then, Bob and Claire swap balls. This means that after the swap, Bob has the blue ball and Claire has the white ball.

Finally, Alice and Bob swap balls again. This means that after the swap, Alice has the blue ball and Bob has the white

ball. At the end of the game, Claire has the orange ball. Therefore, the answer is (A).

Q: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a black

ball, Bob has a brown ball, and Claire has a blue ball. As the game progresses, pairs of players trade balls. First, Bob and

Claire swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls. At the end of the game, Alice has

the Which choice is true? Answer choices: (A) black ball, (B) brown ball, (C) blue ball.

A: First, Bob and Claire swap balls. This means that Bob now has the blue ball, and Claire has the brown ball. Then,

Alice and Bob swap balls. This means that Alice now has the brown ball, and Bob has the black ball. Finally, Claire and

Bob swap balls. This means that Claire now has the black ball, and Bob has the blue ball. At the end of the game, Alice

has the blue ball. Therefore, the answer is (C).

Q: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a black

ball, Bob has a brown ball, and Claire has a blue ball. As the game progresses, pairs of players trade balls. First, Bob and

Claire swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls. At the end of the game, Bob has

the Which choice is true? Answer choices: (A) black ball, (B) brown ball, (C) blue ball.

A: First, Bob and Claire swap balls. This means that Bob now has the blue ball, and Claire has the brown ball. Then,

Alice and Bob swap balls. This means that Alice now has the brown ball, and Bob has the black ball. Finally, Claire and

Bob swap balls. This means that Claire now has the black ball, and Bob has the brown ball. At the end of the game, Bob

has the brown ball. Therefore, the answer is (B).

Q: Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has a black

ball, Bob has a brown ball, and Claire has a blue ball. As the game progresses, pairs of players trade balls. First, Bob and

Claire swap balls. Then, Alice and Bob swap balls. Finally, Claire and Bob swap balls. At the end of the game, Claire

has the Which choice is true? Answer choices: (A) black ball, (B) brown ball, (C) blue ball.

A: First, Bob and Claire swap balls. This means that Bob now has the blue ball and Claire has the brown ball. Then,

Alice and Bob swap balls. This means that Alice now has the blue ball and Bob has the black ball. Finally, Claire and

Bob swap balls. This means that Claire now has the black ball and Bob has the brown ball. At the end of the game,

Claire has the black ball. Therefore, the answer is (A).

Table 12: The 6-shot prompt with LLaMA 13B for Tracking Shuffled Objects.Q UESTION : David has $12.48 and wants to buy 16 bolts from a bin at the hardware store. Each bolt costs $0.03. How

much money does David have left after paying for the bolts?

S ELF -C ONSISTENCY : The total cost of the bolts is 16 * 0.3 = $ 4.48. David has 12.48 - 4.48 = $ 8 left. #### 8

G RACE : David pays $0.03 x 16 = $ 0.48 for the 16 bolts. Thus, David has $12.48 - $0.48 = $ 12 left. #### 12.

Q UESTION : Marcus is half of Leo’s age and five years younger than Deanna. Deanna is 26. How old is Leo?

S ELF -C ONSISTENCY : Marcus is 26 - 5 = 21 years old. Thus, Leo is 21 / 2 = 10.5 years old. #### 10.5

G RACE : Marcus is 26 - 5 = 21 years old. Leo is 21 * 2 = 42 years old. #### 42.

Q UESTION : Larry loves taking care of animals. He has 3 cats. He has 3 times as many dogs as cats. He has 2 fewer

rabbits than dogs. He has a fish tank with three times the number of fish as rabbits. He also has a collection of gerbils

that’s 1/3 the number of fish he has. How many pets does Larry have?

S ELF -C ONSISTENCY : Larry has 3 * 3 = 9 dogs. Larry has 9 - 2 = 7 rabbits. Larry has 3 * 7 = 21 fish in the tank. Larry

has a collection of gerbils that’s 1/3 of the fish he has, so he has 21 / 3 = 7 gerbils. Larry has 3 + 9 + 7 + 21 = 40 pets.

#### 40

G RACE : Larry has 3 * 3 = 9 dogs. Larry has 9 - 2 = 7 rabbits. Larry has 3 * 7 = 21 fish in his tank. Larry has 21 / 3 = 7

gerbils. Larry has a total of 3 + 9 + 7 + 21 + 7 = 47 pets. #### 47.

Q UESTION : James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does

he run a week?

S ELF -C ONSISTENCY : He runs 3*60= 180 meters a week. So he runs a total of 180*7= 1260 meters a week. #### 1260

G RACE : He runs 3*60= 180 meters a week. So he runs a total of 180*3= 540 meters a week. #### 540.

Q UESTION : Isabelle bought party supplies for her little sister’s birthday party. She spent $12 on ingredients for the

cake, $43 on birthday presents, $15 on decorations, $4 on invitations, and $22 on goodie bags for the party guests. She

split the cost evenly three ways with her two parents. How many dollars did each person pay?

S ELF -C ONSISTENCY : Isabelle spent $12 + $43 + $15 + $4 + $22 = $ 96. She split the cost 3 x $96 = $ 288. Each

person paid $288/3 = $ 96. #### 96

G RACE : Isabelle spent a total of $12 + $43 + $15 + $4 + $22 = $ 96. The cost was split into 3 parts, so each person paid

96/3 = $ 32. #### 32.

Table 13: Example outputs from GSM8K (Cobbe et al., 2021) where G RACE reaches the correct answer, unlike self-consistency.

The LM used here is FLAN-T5 Large . Each sentence in the solutions above represents a reasoning step.Q UESTION : In a company of 50 employees, 20% of the employees are management. Out of this 20%, only 30% oversee

the entire company. How many employees oversee the company?

S ELF -C ONSISTENCY : 20% of 50 = 10 employees. 30% of 10 = 3 employees. 10 – 3 = 7 employees do not oversee the

company. #### 7.

G RACE : The company has 50 employees. 20% of the employees are management. 20% of the management is 10 people.

30% of the management oversee the entire company. 10 people * 30% = 3 people. Thus, 3 people oversee the entire

company. #### 3.

Q UESTION : Melanie found a blueprint online for a toothpick sculpture she wanted to make. It requires 200 toothpicks.

Melanie’s mom puts toothpicks in her sandwiches when she serves them for lunch. Melanie started saving them and has

saved 10 toothpicks each week for the past 12 weeks. If she continues saving toothpicks at the same rate, how many

more weeks will it take her to collect 200 toothpicks?

S ELF -C ONSISTENCY : 10 toothpicks per week * 12 weeks = 120 toothpicks saved. If Melanie continues saving

toothpicks at the same rate, she will save 120 toothpicks per week * 12 weeks = 1440 toothpicks saved. 200 toothpicks =

7.2 weeks. #### 7.2.

G RACE : The number of toothpicks Melanie has saved is 10 * 12 = 120 toothpicks. She needs 200 toothpicks, so she

needs 200 – 120 = 80 more toothpicks. The number of weeks it will take her to save 80 toothpicks is 80 / 10 = 8 weeks.

#### 8.

Q UESTION : An 8-year old child wants to buy a toy car which costs $12. He already has $4 savings. How many days

will it take him to save the remaining amount of money if he promises to save $2 daily from his allowance?

S ELF -C ONSISTENCY : The amount of money he has now is $4. The amount of money he needs to save is $12 – $4

= 8 dollars. The amount of money he saves each day is $2. Therefore, the number of days it will take him to save the

remaining amount of money is 8 days * 2 = 16 days. #### 16.

G RACE : The amount of money he needs to save is $12 – 4 = 8 dollars. He can save $2 daily from his allowance, so he

needs to save $8 / 2 = 4 dollars each day. It will take him 4 days to save the remaining amount of money. #### 4.

Q UESTION : Liam and Mitchell own competing lemonade stands across the street from one another. When Liam bragged

that he had made $63 one weekend, Mitchell laughed and told Liam he had sold 21 lemonades at $4 apiece the same

weekend. How many more dollars did Mitchell make selling lemonade that weekend than Liam?

S ELF -C ONSISTENCY : Liam sold 63 lemonades at $4 each = 252 dollars. Mitchell sold 21 lemonades at $4 each = 84

dollars. So Liam made 252 – 84 = 168 dollars more than Mitchell. #### 168.

G RACE : Liam made $63 = 63/4 = 15.8 dollars. Liam sold 21 lemonades = 21 * 4 = 84 lemonades. Mitchell made 84 –

63 = 21 more dollars than Liam. #### 21.

Q UESTION : George has 45% more pears than bananas. If George has 200 bananas, how many fruits does George have?

S ELF -C ONSISTENCY : We can use the formula: 100% * x = 100% * 200 = 200 bananas. We know that George has 45%

more pears than bananas, so this means that 100% * 145% = 145% * 200 = 290 bananas. Thus, George has 200 bananas

and 290 pears. #### 290.

G RACE : George has 45% more pears than bananas, so the number of pears is 1.45 * the number of bananas. The number

of bananas is 200, so the number of pears is 1.45 * 200 = 290 pears. The number of fruits is 290 + 200 = 490 fruits. ####

490.

Table 14: Example outputs from GSM8K (Cobbe et al., 2021) where G RACE reaches the correct answer, unlike self-consistency.

The LM used here is LLaMA 7B . Each sentence in the solutions above represents a reasoning step.Q UESTION : find compound interest on rs . 7500 at 4 % per year for 2 years , compounded annually .

S ELF -C ONSISTENCY :

n0 = 7500.0; n1 = 4.0; n2 = 2.0;

t0 = n0 * n1; t1 = t0 / 100.0; t2 = n0 + t1; t3 = n2 * t2;

t4 = t3 / 100.0; t5 = t2 + t4; ans = t5 - n0;

G RACE :

n0 = 7500.0; n1 = 4.0; n2 = 2.0; t0 = n0 * n1; t1 = t0 / 100.0;

t2 = n0 + t1; t3 = n1 * t2; t4 = t3 / 100.0; t5 = t2 + t4; ans = t5 - n0;

Q UESTION : in an election between the two candidates , the candidates who gets 60 % of votes polled is winned by 280

votes majority . what is the total number of votes polled ?

S ELF -C ONSISTENCY :

n0 = 60.0; n1 = 280.0; t0 = n0 / 100.0; t1 = 1.0 - t0; t2 =

n1 / t1; ans = n0 / t2;

G RACE :

n0 = 60.0; n1 = 280.0; t0 = n1 * 100.0; t1 = 100.0 - n0;

t2 = n0 - t1; ans = t0 / t2;

Q UESTION : each of the cucumbers in 100 pounds of cucumbers is composed of 99 % water , by weight . after some

of the water evaporates , the cucumbers are now 97 % water by weight . what is the new weight of the cucumbers , in

pounds ?

S ELF -C ONSISTENCY :

n0 = 100.0; n1 = 99.0; n2 = 97.0; t0 = n0 - n1; t1 = n0 - n2;

ans = t0 / t1;

G RACE :

n0 = 100.0; n1 = 99.0; n2 = 97.0; t0 = n0 - n1; t1 = n0 - n2; t2 = t0

/ t1; ans = t2 * 100.0;

Q UESTION : selling an kite for rs . 30 , a shop keeper gains 40 % . during a clearance sale , the shopkeeper allows a

discount of 10 % on the marked price . his gain percent during the sale is ?

S ELF -C ONSISTENCY :

n0 = 30.0; n1 = 40.0; n2 = 10.0;

t0 = n1 + 100.0; t1 = n0 / 100.0; t2 = n0 * 100.0; t3 = 100.0 - n2;

t4 = t2 / t0; t5 = t1 * t4; t6 = t5 - t3; t7 = t6 / t4; ans = t7 * 100.0;

G RACE :

n0 = 30.0; n1 = 40.0; n2 = 10.0; t0 = n1 + 100.0; t1 = n0 / 100.0;

t2 = n0 * 100.0; t3 = 100.0 - n2; t4 = t2 / t0; t5 = t1 * t3; t6 = t5 - t4;

t7 = t6 / t4; ans = t7 * 100.0;

Q UESTION : the percentage profit earned by selling an article for rs . 1920 is equal to the percentage loss incurred by

selling the same article for rs . 1280 . at what price should the article be sold to make 15 % profit ?

S ELF -C ONSISTENCY :

n0 = 1920.0; n1 = 1280.0; n2 = 15.0; t0 = n0 + n1;

t1 = t0 / 100.0; ans = n2 * t1;

G RACE :

n0 = 1920.0; n1 = 1280.0; n2 = 15.0;

t0 = n2 + 100.0; t1 = n0 + n1; t2 = t0 / 100.0; t3 = t1 / 2.0; ans = t2 * t3;

Table 15: Example outputs from MathQA-Gain (Amini et al., 2019) where G RACE reaches the correct answer, unlike self-

consistency. The LM used here is FLAN-T5 Large . Each line of code (delimited by a semicolon) represents a single reasoning step.Q UESTION : Brenda’s mother made cookies for guests. If she prepared 38 cookies and each of them had 19 cookies.

How many guests did she prepare cookies for?

S ELF -C ONSISTENCY : If each guest had 19 cookies, then the total number of cookies Brenda’s mother prepared is 38 x

19 = 722 cookies. #### 722

G RACE : To find the total number of cookies Brenda’s mother prepared, we need to multiply the number of guests by the

number of cookies per guest: 38 / 19 = 2 guests. Therefore, Brenda’s mother prepared cookies for 2 guests. #### 2

Q UESTION : Because of the decision Sofia asked the students to suggest specific types of food. If 144 students suggested

adding mashed potatoes 467 suggested adding bacon to the menu and 79 suggested adding tomatoes. How many more

students suggested mashed potatoes than those that suggested tomatoes?

S ELF -C ONSISTENCY : 144 students suggested adding mashed potatoes and 467 suggested adding bacon, so the

difference in number of students who suggested mashed potatoes and tomatoes is 144 - 467 = -323. Therefore, 323 more

students suggested mashed potatoes than tomatoes. #### 323

G RACE : The number of students who suggested mashed potatoes is 144, and the number of students who suggested

tomatoes is 79. To find out how many more students suggested mashed potatoes than tomatoes, we subtract the number

of students who suggested tomatoes from the number of students who suggested bacon: 144 - 79 = 65. Therefore, 65

more students suggested mashed potatoes than tomatoes. #### 65

Q UESTION : Each basket of peaches has 19 red peaches and 4 green peaches. If there are 15 such baskets. How many

peaches are in the baskets altogether?

S ELF -C ONSISTENCY : If each basket has 19 red peaches and there are 15 baskets, then there are 15*19= 285 peaches in

all. #### 285

G RACE : Each basket has 19 red peaches and 4 green peaches, so there are a total of 19+4= 23 peaches in each basket. If

there are 15 such baskets, then there are 15*23= 345 peaches altogether. #### 345

Q UESTION : You had 26 bags with equal number of cookies. If you had 15 candies and 52 cookies in total. How many

cookies does each bag have?

S ELF -C ONSISTENCY : If you have 15 candies and 52 cookies in total, then each bag has 52/15 = 3.5 cookies. #### 3.5

G RACE : If you have 15 candies and 52 cookies in total, then each bag has 52 / 26 = 2 cookies. #### 2

Q UESTION : Jerry had 4 action figures and 22 books on a shelf in his room. Later he added 6 more action figures to the

shelf.. How many action figures were on his shelf in all?

S ELF -C ONSISTENCY : Jerry had 4 action figures and 22 books on the shelf, so there were 4 + 22 = 26 action figures on

the shelf. He later added 6 more action figures, so there were now 26 + 6 = 32 action figures on the shelf. #### 32

G RACE : Jerry started with 4 action figures and added 6 more, so there were now 4 + 6 = 10 action figures on the shelf.

#### 10

Table 16: Example outputs from SVAMP (Patel et al., 2021) where G RACE reaches the correct answer, unlike self-consistency.

The LM used here is FLAN-T5 Large . Each sentence in the solutions above represents a reasoning step.Q UESTION : Steven has 12 peaches. Jake has 4 fewer peaches than Steven who has 67 more peaches than Jill. How

many peaches does Jake have?

S ELF -C ONSISTENCY : To find out how many peaches Jake has, we need to subtract the number of peaches he has from

the number of peaches Steven has: 12 - 67 = -55 peaches. Jake has 55 fewer peaches than Steven. #### 55

G RACE : Steven has 12 peaches, Jake has 4 fewer peaches than Steven and Steven has 67 more peaches than Jill. So,

Jake has 12 - 4 = 8 peaches. #### 8.

Q UESTION : Every day Ryan spends 7 hours on learning english and some more hours on learning chinese. If he spends

2 hours more on learning english than on learning chinese. How many hours does he spend on learning chinese?

S ELF -C ONSISTENCY : To find out how many hours Ryan spends on learning chinese, we need to subtract the number of

hours he spends on learning english from the number of hours he spends on learning chinese: 2 hours more on learning

english than on learning chinese = 2 hours. So, Ryan spends 2 hours on learning chinese. #### 2.

G RACE : To find out how many hours Ryan spends on learning chinese, we need to subtract the number of hours he

spends on learning english from the number of hours he spends on learning chinese: 7 - 2 = 5 hours. Therefore, Ryan

spends 5 hours on learning chinese. #### 5.

Q UESTION : Katie had 78 new games and 86 old games. Her friends had 48 new games. How many new games do they

have together?

S ELF -C ONSISTENCY : To find out how many games Katie and her friends have together, we need to subtract the number

of old games from the number of new games: 78 - 86 = -8. Therefore, Katie and her friends have 8 new games together.

####

G RACE : Katie had 78 new games and 86 old games, and her friends had 48 new games. To find out how many new

games they have together, we need to add the number of new games they have: 78 + 48 = 126 new games. There is no

indication that they have any old games together, so the 86 old games are irrelevant to this question. Therefore, they have

126 new games together. #### 126.

Q UESTION : He had 15 aquariums for saltwater animals and 62 aquariums for freshwater animals. Each aquarium has

19 animals in it. How many animals does he have in total?

S ELF -C ONSISTENCY : To find out how many animals he has in total, we need to add the number of saltwater animals

and the number of freshwater animals: Total = 15 + 62 = 77 animals. #### 77.

G RACE : He had 15 aquariums for saltwater animals and 62 aquariums for freshwater animals, so: 15 + 62 = 77

aquariums. To find out how many animals he has in total, we need to multiply the number of aquariums by the number

of animals in each aquarium: 77 * 19 = 1463 animals. #### 1463.

Q UESTION : Danny collects bottle caps. He found 30 bottle caps at the park while he threw away 63 old ones. Now he

has 42 bottle caps in his collection. How many more bottle caps did danny throw away than those he found at the park?

S ELF -C ONSISTENCY : Danny found 30 bottle caps at the park while he threw away 63 old ones. Now he has 42 bottle

caps in his collection. To find out how many more bottle caps he threw away than those he found at the park, we need to

subtract the number of bottle caps he found at the park from the number of bottle caps he threw away: 30 - 42 = -12

bottle caps. Therefore, Danny threw away 12 more bottle caps than those he found at the park. #### 12.

G RACE : Danny found 30 bottle caps at the park while he threw away 63 old ones. Now he has 42 bottle caps in his

collection. To find out how many more bottle caps he threw away than those he found at the park, we need to subtract

the number of bottle caps he threw away from the number of bottle caps he found at the park: 63 - 30 = 33 bottle caps.

Therefore, Danny threw away 33 more bottle caps than those he found at the park. #### 33.

Table 17: Example outputs from SVAMP (Patel et al., 2021) where G RACE reaches the correct answer, unlike self-consistency.

The LM used here is LLaMA 7B . Each sentence in the solutions above represents a reasoning step.Q UESTION : Edward earned 9 dollars for each lawn he mowed. If he had 6 lawns to mow, but forgot to mow 2 of them,

how much money did he actually earn?

S ELF -C ONSISTENCY : Edward mowed 6 lawns for a total of 9 dollars. The total amount of money he should have

earned is 6 x 9 = 54 dollars. However, he only mowed 5 lawns, so he actually earned 54-2 = 52 dollars. #### 52.

G RACE : Edward earned 9 dollars for each lawn he mowed, so if he had 6 lawns to mow, he would have earned 6 x 9 =

54 dollars. However, he forgot to mow 2 lawns, so he only mowed 6-2 = 4 lawns. Therefore, he actually earned 4 x 9 =

36 dollars. #### 36.

Q UESTION : Will invited 9 friends to a birthday party, but 4 couldn’t come. If he wanted to buy enough cupcakes so

each person could have exactly 8, how many should he buy?

S ELF -C ONSISTENCY : Will invited 9 people to the party, but 4 of them couldn’t come. If he wants to buy enough

cupcakes so that each person has 8, he needs to buy 9 x 8 = 72 cupcakes. #### 72.

G RACE : Will had 9 friends and 4 of them couldn’t come, so he had 9-4 = 5 friends at the party. If he wants to buy 8

cupcakes for each person, then he needs to buy 5 x 8 = 40 cupcakes. #### 40.

Q UESTION : Edward and his friend were buying trick decks from the magic shop for 6 dollars each. How much did they

spend if Edward bought 3 decks and his friend bought 6 decks?

S ELF -C ONSISTENCY : Edward bought 3 decks, so he spent 6 dollars, and his friend bought 6 decks, so he spent 6

dollars. The total amount spent was 6 dollars for Edward and 6 dollars for his friend, for a total of 12 dollars. #### 12.

G RACE : Edward bought 3 decks and his friend bought 6 decks, so the total number of decks bought is 3+6 = 9 decks.

The cost of each deck is 6 dollars, so the total cost is 9 x 6 = 54 dollars. #### 54.

Q UESTION : The school cafeteria had 14 apples. If they used 13 to make lunch for the students and then bought 49

more, how many apples would they have?

S ELF -C ONSISTENCY : The cafeteria started with 14 apples and used 13 of them to make lunches for the students. After

that, they bought 49 more apples, bringing their total number of apples to 14 + 49 = 63 apples. #### 63.

G RACE : The school cafeteria had 14 apples and used 13 of them to make lunch for the students, leaving them with

14-13 = 1 apples. They then bought 49 more apples, bringing their total number of apples to 1 + 49 = 50 apples. #### 50.

Q UESTION : There were 6 girls and 48 boys trying out for the schools basketball team. If only 7 of them got called back,

how many students didn’t make the cut?

S ELF -C ONSISTENCY : There were 6 girls and 48 boys trying out for the schools basketball team. If only 7 of them got

called back, then the total number of students who tried out was 6+48 = 54. To find out how many students didn’t make

the cut, we divide the total number of students who tried out by the number of students who were called back: 54 / 7 =

7.7 students didn’t make the cut. #### 7.7.

G RACE : There were 6 girls and 48 boys, so there were 6+48 = 54 students trying out for the basketball team. Only 7

were called back, so 54-7 = 47 students did not make the cut. #### 47.

Table 18: Example outputs from MultiArith (Roy and Roth, 2015) where G RACE reaches the correct answer, unlike self-

consistency. The LM used here is LLaMA 7B . Each sentence in the solutions above represents a reasoning step.Q UESTION : A coin is heads up. Krishna flips the coin. Catalina does not flip the coin. Eileen does not flip the coin.

Teddy flips the coin. Is the coin still heads up? Note that "flip" here means "reverse".

S ELF -C ONSISTENCY : Krishna flips the coin, so the coin is now tails up. Catalina does not flip the coin, so the coin

remains tails up. Eileen does not flip the coin, so the coin remains tails up. Teddy flips the coin, so the coin is now heads

up. So, the answer is no, the coin is not heads up. #### tails.

G RACE : Krishna flips the coin. The coin is now tails up. Catalina does not flip the coin. The coin is still tails up. Eileen

does not flip the coin. The coin is still tails up. Teddy flips the coin. The coin is now heads up. #### heads.

Q UESTION : A coin is heads up. Nubia flips the coin. Sarah does not flip the coin. Jalen does not flip the coin. Kris flips

the coin. Is the coin still heads up? Note that "flip" here means "reverse".

S ELF -C ONSISTENCY : Nubia flips the coin, so the coin is now tails up. Sarah does not flip the coin, so the coin remains

tails up. Jalen does not flip the coin, so the coin remains tails up. Kris flips the coin, so the coin is now heads up. So, the

coin is still tails up. #### tails.

G RACE : Nubia flips the coin. The coin is now tails up. Sarah does not flip the coin. The coin is still tails up. Jalen does

not flip the coin. The coin is still tails up. Kris flips the coin. The coin is now heads up. #### heads.

Q UESTION : A coin is heads up. Leticia flips the coin. Jacinto flips the coin. Natasha does not flip the coin. Raphael

flips the coin. Is the coin still heads up? Note that "flip" here means "reverse".

S ELF -C ONSISTENCY : Leticia flips the coin. The coin is now tails up. Jacinto flips the coin. The coin is now heads up.

Natasha does not flip the coin. The coin is still tails up. Raphael flips the coin. The coin is now heads up. #### heads.

G RACE : Leticia flips the coin. The coin is now tails up. Jacinto flips the coin. The coin is now heads up. Natasha does

not flip the coin. The coin is still heads up. Raphael flips the coin. The coin is now tails up. #### tails.

Q UESTION : A coin is heads up. Mickey does not flip the coin. Dom flips the coin. Lilly flips the coin. Eloy does not

flip the coin. Is the coin still heads up? Note that "flip" here means "reverse".

S ELF -C ONSISTENCY : Mickey does not flip the coin, so the coin remains heads up. Dom flips the coin, so the coin is

now tails up. Lilly flips the coin, so the coin is now heads up again. Eloy does not flip the coin, so the coin is still tails up.

#### tails.

G RACE : Mickey does not flip the coin. The coin is still heads up. Dom flips the coin. The coin is now tails up. Lilly

flips the coin. The coin is now heads up. Eloy does not flip the coin. The coin is still heads up. #### heads.

Q UESTION : A coin is heads up. Ashish flips the coin. Tracey flips the coin. Varun does not flip the coin. Emil flips the

coin. Is the coin still heads up? Note that "flip" here means "reverse".

S ELF -C ONSISTENCY : Ashish flips the coin, so it is now tails up. Tracey flips the coin, so it is now heads up again.

Varun does not flip the coin, so it remains tails up. Emil flips the coin, so it is now heads up again. #### heads.

G RACE : Ashish flips the coin. The coin is now tails up. Tracey flips the coin. The coin is now heads up. Varun does not

flip the coin. The coin is still heads up. Emil flips the coin. The coin is now tails up. #### tails.

Table 19: Example outputs from The Coin Flip task where G RACE reaches the correct answer, unlike self-consistency. The LM

used here is LLaMA 13B . Each sentence represents a single reasoning step.Q UESTION : Alice, Bob, and Claire are on the same team in a soccer match. At the start of the match, they are each

assigned to a position: Alice is playing right midfielder, Bob is playing goalkeeper, and Claire is playing left winger.

As the game progresses, pairs of players occasionally swap positions. First, Claire and Bob trade positions. Then, Bob

and Alice trade positions. Finally, Claire and Bob trade positions. At the end of the match, Alice is playing Which choice

is true? Answer choices: (A) right midfielder, (B) goalkeeper, (C) left winger.

S ELF -C ONSISTENCY : First, Claire and Bob trade positions. This means that Claire is now the goalkeeper and Bob is

the left winger. Then, Bob and Alice trade positions. This means that Alice is now the goalkeeper and Bob is the right

midfielder. Finally, Claire and Bob trade positions again. This means that Claire is now the right midfielder and Bob is

the left winger. At the end of the match, Alice is playing as the right midfielder. Therefore, the answer is (A). #### A

G RACE : First, Claire and Bob trade positions. This means that Claire is now the goalkeeper and Bob is the left winger.

Then, Bob and Alice trade positions. This means that Alice is now the left winger and Bob is the right midfielder. Finally,

Claire and Bob trade positions again. This means that Claire is now the right midfielder and Bob is the goalkeeper. At

the end of the match, Alice is playing left winger. Therefore, the answer is (C). #### C.

Q UESTION : Alice, Bob, and Claire are on the same team in a soccer match. At the start of the match, they are each

assigned to a position: Alice is playing left midfielder, Bob is playing right midfielder, and Claire is playing goalkeeper.

As the game progresses, pairs of players occasionally swap positions. First, Bob and Alice trade positions. Then, Alice

and Claire trade positions. Finally, Bob and Alice trade positions. At the end of the match, Bob is playing Which choice

is true? Answer choices: (A) left midfielder, (B) right midfielder, (C) goalkeeper.

S ELF -C ONSISTENCY : First, Bob and Alice trade positions. This means that Bob is now playing left midfielder and

Alice is playing right midfielder. Then, Alice and Claire trade positions. This means that Claire is now playing right

midfielder and Alice is playing goalkeeper. Finally, Bob and Alice trade positions again. This means that Bob is now

playing left midfielder and Alice is playing goalkeeper. At the end of the match, Bob is playing left midfielder, so the

answer is (A). #### A

G RACE : First, Bob and Alice trade positions. This means that Bob is now playing left midfielder and Alice is playing

right midfielder. Then, Alice and Claire trade positions. This means that Claire is now playing right midfielder and Alice

is playing goalkeeper. Finally, Bob and Alice trade positions again. This means that Bob is now playing goalkeeper and

Alice is playing left midfielder. Therefore, at the end of the match, Bob is playing goalkeeper. The answer is (C). #### C.

Q UESTION : Alice, Bob, and Claire are on the same team in a soccer match. At the start of the match, they are each

assigned to a position: Alice is playing goalkeeper, Bob is playing left winger, and Claire is playing right midfielder.

As the game progresses, pairs of players occasionally swap positions. First, Claire and Alice trade positions. Then, Bob

and Claire trade positions. Finally, Alice and Bob trade positions. At the end of the match, Claire is playing Which

choice is true? Answer choices: (A) goalkeeper, (B) left winger, (C) right midfielder.

S ELF -C ONSISTENCY : First, Claire and Alice trade positions. This means that Claire is now playing goalkeeper and

Alice is playing right midfielder. Then, Bob and Claire trade positions. This means that Bob is now playing goalkeeper

and Claire is playing left winger. Finally, Alice and Bob trade positions. This means that Alice is now playing goalkeeper

and Bob is playing right midfielder. At the end of the match, Claire is playing right midfielder. Therefore, the correct

answer is (C). #### C

G RACE : First, Claire and Alice trade positions. This means that Claire is now playing goalkeeper and Alice is playing

right midfielder. Then, Bob and Claire trade positions. This means that Bob is now playing goalkeeper and Claire is

playing left winger. Finally, Alice and Bob trade positions. This means that Alice is now playing goalkeeper and Bob is

playing right midfielder. At the end of the match, Claire is playing left winger. Therefore, the answer is (B). #### B

Q UESTION : Alice, Bob, and Claire are holding a white elephant gift exchange. At the start of the event, they are each

holding a present of a different color: Alice has a red present, Bob has a black ball, and Claire has a white present.

As the event progresses, pairs of people swap gifts. First, Bob and Claire swap their gifts. Then, Bob and Alice swap

their gifts. Finally, Claire and Bob swap their gifts. At the end of the event, Bob has the Which choice is true? Answer

choices: (A) red present, (B) black ball, (C) white present.

S ELF -C ONSISTENCY : First, Bob and Claire swap their gifts. This means that Bob now has the white present, and

Claire has the black ball. Then, Bob and Alice swap their gifts. This means that Alice now has the black ball, and Bob

has the red present. Finally, Claire and Bob swap their gifts again. This means that Claire now has the red present, and

Bob has the white present. At the end of the event, Bob has the white present. Therefore, the answer is (C). #### C

G RACE : First, Bob and Claire swap their gifts. This means that Bob now has the white present, and Claire has the black

ball. Then, Bob and Alice swap their gifts. This means that Bob now has the red present, and Alice has the white present.

Finally, Claire and Bob swap their gifts again. This means that Claire now has the red present, and Bob has the black

ball. At the end of the event, Bob has the black ball, so the answer is (B). #### B

Table 20: Example outputs from the Tracking Shuffled Objects task where G RACE reaches the correct answer, unlike self-

consistency. The LM used here is FLAN-T5 Large . Each sentence represents a single reasoning step.Q UESTION : Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has

a white ball, Bob has a purple ball, and Claire has a pink ball.

As the game progresses, pairs of players trade balls. First, Bob and Alice swap balls. Then, Bob and Claire swap balls.

Finally, Bob and Alice swap balls. At the end of the game, Bob has the Which choice is true? Answer choices: (A) white

ball, (B) purple ball, (C) pink ball.

S ELF -C ONSISTENCY : First, Bob and Alice swap balls. This means that Bob now has the white ball and Alice has the

purple ball. Then, Bob and Claire swap balls. This means that Bob now has the pink ball and Claire has the white ball.

Finally, Bob and Alice swap balls again. This means that Bob now has the purple ball and Alice has the pink ball. At the

end of the game, Bob has the purple ball. Therefore, the answer is (A). #### A.

G RACE : First, Bob and Alice swap balls. This means that Bob now has the white ball and Alice has the purple ball.

Then, Bob and Claire swap balls. This means that Bob now has the pink ball and Claire has the white ball. Finally, Bob

and Alice swap balls again. This means that Bob now has the purple ball and Alice has the pink ball. At the end of the

game, Bob has the purple ball. Therefore, the answer is (B). #### B.

Q UESTION : Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has

a orange ball, Bob has a white ball, and Claire has a purple ball.

As the game progresses, pairs of players trade balls. First, Bob and Claire swap balls. Then, Alice and Bob swap balls.

Finally, Claire and Bob swap balls. At the end of the game, Bob has the Which choice is true? Answer choices: (A)

orange ball, (B) white ball, (C) purple ball.

S ELF -C ONSISTENCY : First, Bob and Claire swap balls. This means that Bob now has the purple ball, and Claire has

the white ball. Then, Alice and Bob swap balls. This means that Alice now has the white ball, and Bob has the orange

ball. Finally, Claire and Bob swap balls. This means that Claire now has the orange ball, and Bob has the purple ball. At

the end of the game, Bob has the orange ball. Therefore, the answer is (A). #### A.

G RACE : First, Bob and Claire swap balls. This means that Bob now has the purple ball, and Claire has the white ball.

Then, Alice and Bob swap balls. This means that Alice now has the white ball, and Bob has the orange ball. Finally,

Claire and Bob swap balls. This means that Claire now has the orange ball, and Bob has the white ball. At the end of the

game, Bob has the white ball. Therefore, the answer is (B). #### B.

Q UESTION : Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has

a red ball, Bob has a pink ball, and Claire has a black ball.

As the game progresses, pairs of players trade balls. First, Claire and Alice swap balls. Then, Alice and Bob swap balls.

Finally, Alice and Claire swap balls. At the end of the game, Claire has the Which choice is true? Answer choices: (A)

red ball, (B) pink ball, (C) black ball.

S ELF -C ONSISTENCY : First, Claire and Alice swap balls. This means that Claire now has the red ball, and Alice has the

black ball. Then, Alice and Bob swap balls. This means that Alice now has the black ball, and Bob has the pink ball.

Finally, Alice and Claire swap balls. This means that Alice now has the black ball, and Claire has the red ball. At the end

of the game, Claire has the red ball. Therefore, the answer is (A). #### A.

G RACE : First, Claire and Alice swap balls. This means that Claire now has the red ball, and Alice has the black ball.

Then, Alice and Bob swap balls. This means that Alice now has the pink ball, and Bob has the red ball. Finally, Alice

and Claire swap balls. This means that Alice now has the black ball, and Claire has the pink ball. At the end of the game,

Claire has the pink ball. Therefore, the answer is (B). #### B.

Q UESTION : Alice, Bob, and Claire are playing a game. At the start of the game, they are each holding a ball: Alice has

a white ball, Bob has a brown ball, and Claire has a black ball.

As the game progresses, pairs of players trade balls. First, Alice and Claire swap balls. Then, Alice and Bob swap balls.

Finally, Bob and Claire swap balls. At the end of the game, Claire has the Which choice is true? Answer choices: (A)

white ball, (B) brown ball, (C) black ball.

S ELF -C ONSISTENCY : First, Alice and Claire swap balls. This means that Alice now has the black ball and Claire has

the white ball. Then, Alice and Bob swap balls. This means that Alice now has the brown ball and Bob has the white

ball. Finally, Bob and Claire swap balls. This means that Bob now has the white ball, and Claire has the black ball. At

the end of the game, Claire has the white ball. Therefore, the answer is (A). #### A.

G RACE : First, Alice and Claire swap balls. This means that Alice now has the black ball, and Claire has the white ball.

Then, Alice and Bob swap balls. This means that Alice now has the brown ball, and Bob has the white ball. Finally, Bob

and Claire swap balls. This means that Bob now has the black ball, and Claire has the brown ball. At the end of the

game, Claire has the black ball. Therefore, the answer is (C). #### C.

Table 21: Example outputs from the Tracking Shuffled Objects task where G RACE reaches the correct answer, unlike self-

consistency. The LM used here is LLaMA 13B . Each sentence represents a single reasoning step.