Summary of Distilling Step-by-Step Outperforming Larger Language Models

Summary Distilling Step-by-Step Outperforming Larger Language Models arxiv.org

6,805 words - PDF document - View PDF document

One Line

Distilling Step-by-Step is a more efficient and effective method for training smaller task-specific language models by using a distillation approach to extract rationales from larger language models, outperforming standard finetuning and distillation methods.

Key Points

Distilling Step-by-Step is a new method for training smaller task-specific models that outperform larger language models (LLMs).
The method involves extracting rationales from LLMs and using them to train smaller models with less training data and smaller model sizes.
Distilling Step-by-Step consistently outperforms standard finetuning and task distillation methods across various NLP tasks and datasets.
The approach is data-efficient, requires less computation cost for deployment, and improves model interpretability.
The proposed method has been shown to close the performance gap between multilingual QA and reduce anti-social behaviors in LLMs.

Summaries

314 word summary

This paper presents a step-by-step approach to outperforming larger language models using several datasets, including CQA, ANLI, e-SNLI, ASDiv, and SVAMP. The authors randomly subsample 10% of each dataset and augment them with human-labeled explanations, then train T5-XXL (11B), T5-Base (220M), and T5-Large (770M) models with specific hyperparameters. The paper includes implementation and experiment details and references other relevant works. The document covers various techniques for improving the performance of large language models, including weighted distillation with unlabeled examples, language model fine-tuning for text classification, self-supervised models for semi-supervised learning, and model compression for more efficient training and deployment. The article proposes a technique called Distilling step-by-step to extract rationales from larger language models (LLMs) and use them as informative supervision in training smaller task-specific models. The method reduces the training dataset required to curate smaller models and can outperform the original LLM's performance. It can perform better than standard finetuning and distillation with less data and a smaller model. The article also discusses the potential for using Distilling step-by-step to improve multilingual QA and reduce anti-social behaviors in LLMs. Distilling Step-by-Step is a new method for natural language processing tasks that outperforms larger language models (LLMs) in terms of performance and data efficiency. It involves training a task-specific model by treating a teacher LLM's predicted labels as ground-truths. The approach consistently outperforms two common methods in learning task-specific models: standard finetuning and standard task distillation, even when using much less labeled and unlabeled data. Distilling Step-by-Step introduces a new paradigm for training smaller models that outperform LLMs, using a distillation approach to extract rationales from LLMs as informative task knowledge into training smaller task-specific models. This approach allows for efficient leveraging of additional unlabeled data to match LLM performance and reduce computation cost for deployment. The proposed method is data-efficient and has been shown to outperform larger language models on fully unlabeled datasets.

715 word summary

Distilling Step-by-Step introduces a new paradigm for training smaller models that outperform larger language models (LLMs). The method uses a distillation approach to extract rationales from LLMs as informative task knowledge into training smaller task-specific models, which reduces both the deployed model size as well as the data required for training. This approach allows for efficient leveraging of additional unlabeled data to match LLM performance and reduce computation cost for deployment. The document describes a framework for training smaller language models using generated rationales, which are natural language explanations for the model's predicted labels. The framework involves using CoT prompting to generate both an example input and rationale, which are used to train the larger language model (LLM). The LLM is then prompted to generate output labels and rationales for an unlabeled dataset, which are used to train smaller downstream models. The proposed method is data-efficient and has been shown to outperform larger language models on fully unlabeled datasets. The Distilling Step-by-Step approach is a new method for natural language processing tasks that outperforms larger language models (LLMs) in terms of performance and data efficiency. The method involves training a task-specific model by treating a teacher LLM's predicted labels as ground-truths. The authors compare Distilling step-by-step to two common methods in learning task-specific models: standard finetuning and standard task distillation, and show that it consistently outperforms the other two methods, even when using much less labeled and unlabeled data. The approach is able to achieve better performance than LLM's Few-shot CoT with a coarse-grained search and outperforms PaLM's Few-shot CoT with much smaller models by using less data. The method outperforms Few-shot CoT by using 2000x smaller models on e-SNLI and 45x smaller models on ANLI and CQA. Distilling step-by-step is able to much more efficiently exploit the value of added examples to achieve the same performance level of Few-shot task distillation. On SVAMP, by adding unlabeled examples from ASDiv, we close the gap to Few-shot CoT whereas Standard distillation still struggles to catch up. The approach is a promising new method for natural language processing tasks. The article proposes a technique called Distilling step-by-step to extract rationales from larger language models (LLMs) and use them as informative supervision in training smaller task-specific models. The method reduces the training dataset required to curate smaller models and can outperform the original LLM's performance. It can perform better than standard finetuning and distillation with less data and a smaller model. The article also discusses the potential for using Distilling step-by-step to improve multilingual QA and reduce anti-social behaviors in LLMs.

The document covers various techniques for improving the performance of large language models, including weighted distillation with unlabeled examples, language model fine-tuning for text classification, self-supervised models for semi-supervised learning, and model compression for more efficient training and deployment. Other topics include interpretable question-answering pipelines and the role of explanation data in model learning.

The text excerpt contains a list of references and authors related to language models and their performance, covering topics such as training models with explanations, evaluating model explanations, solving math word problems, reasoning, and distillation. The references also cover conferences such as the Association for Computational Linguistics and the Conference on Fairness, Accountability, and Transparency.

The document “Distilling Step-by-Step Outperforming Larger Language Models” is a compilation of various research papers in the field of natural language processing and machine learning, covering a range of topics including side-additive networks, bootstrapping reasoning with reasoning, using annotator rationales to improve machine learning, measuring association between labels and language models, eliciting reasoning in large models, faithful language reasoning using prompt-generated rationales, and distilling task-specific knowledge from BERT into simple neural networks. Other papers discuss language models for dialog applications, commonsense question answering, transfer learning with Jacobian matching, and training large-scale generative language models. This paper presents a step-by-step approach to outperforming larger language models. The authors use several datasets, including CQA, ANLI, e-SNLI, ASDiv, and SVAMP, and provide statistics for each in Table 1. They randomly subsample 10% of each dataset and augment them with human-labeled explanations. They train T5-XXL (11B), T5-Base (220M), and T5-Large (770M) models with specific hyperparameters using publicly available packages from huggingface/transformers. They perform their experiments on cloud A100x16 GPU instances. The paper includes implementation and experiment details and references other relevant works.

1393 word summary

The paper discusses a step-by-step approach for outperforming larger language models. The authors use several datasets, including CQA, ANLI, e-SNLI, ASDiv, and SVAMP, and provide statistics for each in Table 1. They randomly subsample 10% of each dataset and augment them with human-labeled explanations. They train T5-XXL (11B), T5-Base (220M), and T5-Large (770M) models with specific hyperparameters using publicly available packages from huggingface/transformers. They perform their experiments on cloud A100x16 GPU instances. The paper includes implementation and experiment details and references other relevant works. The document "Distilling Step-by-Step Outperforming Larger Language Models" is a compilation of various research papers in the field of natural language processing and machine learning. The papers cover a range of topics, including side-additive networks, bootstrapping reasoning with reasoning, using annotator rationales to improve machine learning, measuring association between labels and language models, eliciting reasoning in large models, faithful language reasoning using prompt-generated rationales, and distilling task-specific knowledge from BERT into simple neural networks. Other papers discuss language models for dialog applications, commonsense question answering, transfer learning with Jacobian matching, and training large-scale generative language models. The compilation includes research from various conferences and journals, such as the European Conference on Computer Vision, the Association for Computational Linguistics, and the Conference on Machine Learning. This text excerpt contains a list of references and authors related to language models and their performance. The references cover topics such as training models with explanations, evaluating model explanations, solving math word problems, reasoning, and distillation. The authors mentioned include Stephen H Bach, Ryan Smith, Jason A Fries, Braden Hancock, Michael C Hughes, Danish Pruthi, Colin Raffel, Noam Shazeer, Adina Williams, and Richard Socher, among others. The references also cover conferences such as the Association for Computational Linguistics and the Conference on Fairness, Accountability, and Transparency. Additionally, there are mentions of preprints and papers on zero-shot reasoners and large language models. This document discusses various techniques for improving the performance of large language models, including weighted distillation with unlabeled examples and language model fine-tuning for text classification. It also covers the use of self-supervised models for semi-supervised learning and the compression of models for more efficient training and deployment. Other topics include the creation of interpretable question-answering pipelines and the role of explanation data in model learning. The document references several relevant papers and studies on these topics. The article proposes a technique called Distilling step-by-step to extract rationales from larger language models (LLMs) and use them as informative supervision in training smaller task-specific models. The method reduces the training dataset required to curate smaller models and can outperform the original LLM's performance. The article notes that while Distilling step-by-step has limitations, it can perform better than standard finetuning and distillation with less data and a smaller model. The article also discusses the potential for using Distilling step-by-step to improve multilingual QA and reduce anti-social behaviors in LLMs. The article discusses a method called "Distilling step-by-step" that outperforms standard fine-tuning and larger language models (LLMs) using less data and smaller models. The method is able to achieve better performance than LLM's Few-shot CoT with a coarse-grained search and outperforms PaLM's Few-shot CoT with much smaller models by using less data. The results are visualized by plotting different results, under human-labeled and unlabeled settings. Distilling step-by-step is able to much more efficiently exploit the value of added examples to achieve the same performance level of Few-shot task distillation. The method outperforms Few-shot CoT by using 2000x smaller models on e-SNLI and 45x smaller models on ANLI and CQA. On SVAMP, by adding unlabeled examples from ASDiv, we close the gap to Few-shot CoT whereas Standard distillation still struggles to catch up. Standard finetuning fails to match LLM's performance using the same model size. The article presents experimental results comparing Distilling Step-by-Step (DSS) with standard finetuning and distillation methods for language models. DSS consistently outperforms standard methods across varying model sizes and tasks, achieving better Few-shot CoT and PINTO tuning on all four datasets considered. DSS requires much less unlabeled data to outperform standard task distillation, and can achieve better performance than larger language models such as PaLM with smaller T5 models. The article also proposes augmenting the relatively small number of data points in a dataset where the distilled model underperforms. The article discusses a method called Distilling Step-by-Step that outperforms larger language models in natural language processing tasks. The method involves training a task-specific model by treating a teacher LLM's predicted labels as ground-truths. The authors compare Distilling step-by-step to two common methods in learning task-specific models: standard finetuning and standard task distillation. They conduct experiments on four popular benchmark datasets across three different NLP tasks and show that Distilling step-by-step consistently outperforms the other two methods, even when using much less labeled and unlabeled data. The authors also investigate the minimum resources required for Distilling step-by-step to outperform LLMs and show that it can achieve the same performance with much smaller model size, reducing both the number of training examples and the deployment cost compared to LLMs. More dataset and implementation details are included in the appendices. The article presents a new approach called Distilling Step-by-Step that outperforms larger language models (LLMs) in terms of performance and data efficiency. The approach involves distillation by using only a small subset of the full unlabeled dataset and generating intermediate reasoning steps to guide the model in predicting the resultant label. The smaller model is trained to not only predict task labels but also generate corresponding rationales. The approach is compared to standard finetuning and task distillation and is shown to be more effective. Overall, the Distilling Step-by-Step approach is a promising new method for natural language processing tasks. The document describes a framework for training smaller language models using rationales, which are natural language explanations for the model's predicted labels. The framework involves using CoT prompting to generate both an example input and rationale, which are used to train the larger language model (LLM). The LLM is then prompted to generate output labels and rationales for an unlabeled dataset, which are used to train smaller downstream models. The proposed method is data-efficient and has been shown to outperform larger language models on fully unlabeled datasets. The effectiveness of the method has been demonstrated through various experiments. This text excerpt discusses a new approach to training smaller task-specific models using generated rationales from larger language models (LLMs). The authors propose a framework where task prefixes are added to input examples, and the model is trained to output differently based on the prefix. Generated rationales are then used to train small task-specific models in a multi-task learning setting. The use of generated rationales can reduce the need for large amounts of labeled data and improve model interpretability. The authors compare their approach to other recent knowledge distillation research and propose future investigations into using both human-generated and LLM-generated rationales. Distilling Step-by-Step uses a distillation approach to distill the capabilities of larger language models (LLMs) into smaller models. The method allows for efficient leveraging of additional unlabeled data to match LLM performance. Smaller models using this method outperform LLMs and require less data and computation cost for deployment. The distillation approach reduces model size and simultaneously learns task-specific smaller models that can reason with chain-of-thought (CoT) reasoning. The method is a new mechanism for training smaller models with less training data that outperform LLMs. Distilling Step-by-Step proposes a new paradigm for training smaller models that outperform larger language models (LLMs) and require less memory and compute. LLMs are challenging to deploy in real-world applications due to their sheer size and the amount of data required for training. To circumvent these challenges, practitioners often choose to deploy applications that require low latency performance. However, such computational requirements are far beyond affordable for most product teams, especially for applications that require strong zero/few-shot performance. Distilling step-by-step extracts rationales from LLMs as informative task knowledge into training smaller task-specific models, which reduces both the deployed model size as well as the data required for training. Compared to LLMs, Distilling Step-by-Step achieves better performance using substantially smaller model sizes and much fewer labeled/unlabeled training examples. It introduces a new mechanism that trains smaller models by leveraging less training data needed to achieve comparable performance to finetuning or distillation, which require large amounts of training data to achieve better performance with generated labels.

Raw indexed text (44,471 chars / 6,805 words / 964 lines)

Distilling Step-by-Step! Outperforming Larger Language Models

with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh 1∗ , Chun-Liang Li 2 , Chih-Kuan Yeh 3 , Hootan Nakhost 2 ,

Yasuhisa Fujii 3 , Alexander Ratner 1 , Ranjay Krishna 1 , Chen-Yu Lee 2 , Tomas Pfister 2

University of Washington, 2 Google Cloud AI Research, 3 Google Research

[email protected]

Abstract

Deploying large language models (LLMs) is

challenging because they are memory inef-

ficient and compute-intensive for practical

applications. In reaction, researchers train

smaller task-specific models by either finetun-

ing with human labels or distilling using LLM-

generated labels. However, finetuning and

distillation require large amounts of training

data to achieve comparable performance to

LLMs. We introduce Distilling step-by-step,

a new mechanism that (a) trains smaller mod-

els that outperform LLMs, and (b) achieves

so by leveraging less training data needed

by finetuning or distillation. Our method

extracts LLM rationales as additional super-

vision for small models within a multi-task

training framework. We present three find-

ings across 4 NLP benchmarks: First, com-

pared to both finetuning and distillation, our

mechanism achieves better performance with

much fewer labeled/unlabeled training exam-

ples. Second, compared to LLMs, we achieve

better performance using substantially smaller

model sizes. Third, we reduce both the model

size and the amount of data required to out-

perform LLMs; our 770M T5 model outper-

forms the 540B PaLM model using only 80%

of available data on a benchmark task.

Introduction

Despite the impressive few-shot ability offered by

large language models (LLMs) (Brown et al., 2020;

Chowdhery et al., 2022; Thoppilan et al., 2022;

Hoffmann et al., 2022; Smith et al., 2022b; Zhang

et al., 2022), these models are challenging to de-

ploy in real world applications due to their sheer

size. Serving a single 175 billion LLM requires

at least 350GB GPU memory using specialized in-

frastructure (Zheng et al., 2022). To make matters

worse, today’s state-of-the-art LLMs are composed

∗

Work done while the author was a student researcher at

Google Cloud AI Research.

Figure 1: While large language models (LLMs) offer

strong zero/few-shot performance, they are challeng-

ing to serve in practice. Traditional ways of training

small task-specific models, on the other hand, requires

large amount of training data. We propose Distilling

step-by-step, a new paradigm that extracts rationales

from LLMs as informative task knowledge into training

small models, which reduces both the deployed model

size as well as the data required for training.

of over 500B parameters (Chowdhery et al., 2022),

requiring significantly more memory and compute.

Such computational requirements are far beyond

affordable for most product teams, especially for

applications that require low latency performance.

To circumvent these deployment challenges of

large models, practitioners often choose to de-

ploy smaller specialized models instead. These

smaller models are trained using one of two

common paradigms: finetuning or distillation.

Finetuning updates a pretrained smaller model

(e.g. BERT (Devlin et al., 2018) or T5 (Raffel

et al., 2020)) using downstream human annotated

data (Howard and Ruder, 2018). Distillation trains

the same smaller models with labels generated by

a larger LLM (Tang et al., 2019; Wang et al., 2021;

Smith et al., 2022a; Arora et al., 2022). Unfortu-

nately, these paradigms reduce model size at a cost:

to achieve comparable performance to LLMs, fine-

tuning requires expensive human labels, and dis-

tillation requires large amounts of unlabeled data

which can be hard to obtain (Tang et al., 2019;

Liang et al., 2020).

In this work, we introduce Distilling step-by-

step, a new simple mechanism for training smallermodels with less training data. Our mechanism re-

duces the amount of training data required for both

finetuning and distillation of LLMs into smaller

model sizes. Core to our mechanism is chang-

ing our perspective from viewing LLMs as more

than a source of noisy labels to viewing them as

agents that can reason: LLMs can produce natu-

ral language rationales justifying their predicted

labels (Wei et al., 2022; Kojima et al., 2022). For

example, when asked “A gentleman is carrying

equipment for golf, what does he likely have? (a)

club, (b) assembly hall, (c) meditation center, (d)

meeting, (e) church”, an LLM can be prompted

with chain-of-thought (CoT) reasoning (Wei et al.,

2022) to answer “(a) club” and rationalize the label

by stating, “The answer must be something that is

used for golf. Of the above choices, only clubs are

used for golf.”. We train smaller models using these

extracted rationales as additional, richer informa-

tion within a multi-task training setup with both

label prediction and rationale prediction (Raffel

et al., 2020; Narang et al., 2020).

Distilling step-by-step allows us to learn task-

specific smaller models that outperform LLMs us-

ing over 500× less model parameters, and it does

so with far fewer training examples compared to

traditional finetuning or distillation (Figure 1). Our

results show three promising empirical conclusions

across 4 NLP benchmarks. First, compared to both

finetuning and distillation, our resulting models

achieve better performance with over 50% less

training examples on average across datasets (and

up to over 85% reduction). Second, our models

outperform LLMs with much smaller model sizes

(up to 2000× smaller), drastically reducing the

computation cost required for model deployment.

Third, we simultaneously reduce the model size

as well as the amount of data required to outper-

form LLMs. We surpass the performance of 540B

parameter LLMs using a 770M T5 model; this

smaller model only uses 80% of a labeled dataset

that would otherwise be required if using an exist-

ing finetuning method. When only unlabeled data

is present, our small models still perform on par or

better than LLMs. We outperform 540B PaLM’s

performance with only a 11B T5 model. We further

show that when a smaller model performs worse

than an LLM, Distilling step-by-step can more effi-

ciently leverage additional unlabeled data to match

the LLM performance compared to the standard

distillation approach.

Related work

Our work distills the capabilities of LLMs into

smaller models by leveraging the emergent rea-

soning capabilities of today’s LLMs. We draw on

recent knowledge distillation research and other

methods that learn from both human-generated ra-

tionales and LLM-generated rationales.

Knowledge distillation from large models.

Knowledge distillation has been successfully used

to transfer knowledge from larger, more competent

teacher models into smaller student models afford-

able for practical applications (Buciluǎ et al., 2006;

Hinton et al., 2015; Beyer et al., 2022). It supports

learning from limited labeled data, since the larger

teacher model is often used to generate a training

dataset with noisy pseudo labels (Chen et al., 2020;

Iliopoulos et al., 2022; Wang et al., 2021; Smith

et al., 2022a; Arora et al., 2022; Agrawal et al.,

2022). The one limitation that knowledge distil-

lation often faces is its reliance on large amounts

of unlabelled data required to create a useful noisy

training dataset. Although prior work has explored

using data augmentation techniques to reduce this

hunger for data (Tang et al., 2019; Liang et al.,

2020; Srinivas and Fleuret, 2018; Milli et al., 2019),

we propose an alternative approach: we reduce the

need for large unlabeled data by distilling not just

labels but also the teacher’s rationales.

Learning with human rationales. While utiliz-

ing LLM-generated rationales is a new exciting

area of investigation, using human-generated ratio-

nales has a rich history (Hase and Bansal, 2021).

For instance, human rationales can be used to regu-

larize model behavior (Ross et al., 2017); it can be

used as additional inputs to guide a model’s predic-

tions (Rajani et al., 2019); it can be used to improve

overall model performance (Zaidan et al., 2007;

Zhang et al., 2016; Camburu et al., 2018; Pruthi

et al., 2022); and human rationales can be used as

gold standard labels to make models more inter-

pretable by generating similar rationales (Wiegr-

effe et al., 2021; Narang et al., 2020; Eisenstein

et al., 2022). Unfortunately, human rationales are

expensive.

Learning with LLM generated rationales. To-

day’s LLMs are capable of explaining their pre-

dictions by generating high-quality reasoning

steps (Wei et al., 2022; Kojima et al., 2022). These

reasoning steps have been used to augment input

prompts to LLMs, improving their few-shot or zero-

shot performance (Wei et al., 2022; Kojima et al.,Figure 2: Overview on Distilling step-by-step. We first utilize CoT prompting to extract rationales from an LLM

(Section 3.1). We then use the generated rationales to train small task-specific models within a multi-task learning

framework where we prepend task prefixes to the input examples and train the model to output differently based

on the given task prefix (Section 3.2).

2022; Wang et al., 2022b); reasoning steps have

also been used as additional finetuning data “self-

improve” LLMs (Zelikman et al., 2022; Huang

et al., 2022). Unfortunately, regardless of how

LLMs are improved, their large size limits their

utility in most test-time applications.

By contrast, we leverage generated rationales

as informative supervision to train smaller task-

specific models, i.e. models that can be deployed

without incurring large computation or memory

costs. In the past few months, three concurrent

works have also proposed a similar idea to ours

– that of using extracted rationales as supervi-

sion (Wang et al., 2022a; Ho et al., 2022; Magister

et al., 2022). Amongst them, PINTO (Wang et al.,

2022a) relies on an LLM to generate rationales

at test-time, and thus does not fully solve deploy-

ment challenges. Compared with Ho et al. (2022)

and Magister et al. (2022), we go beyond their ex-

periments to provide a granular study by varying

training dataset size, exploring downstream model

sizes, and demonstrating the effectiveness of our

method on fully unlabeled datasets.

Distilling step-by-step

We propose a new paradigm, Distilling step-by-

step, that leverages the ability of LLMs to reason

about their predictions to train smaller models in

a data-efficient way. Our overall framework is il-

lustrated in Figure 2. Our paradigm has two sim-

ple steps: First, given an LLM and an unlabeled

dataset, we prompt the LLM to generate output

labels along with rationales to justify the labels.

Rationales are natural language explanations that

provide support for the model’s predicted label (see

Figure 2). Rationales are an emergent behavioral

property of today’s self-supervised LLMs. Sec-

ond, we leverage these rationales in addition to the

task labels to train smaller downstream models. In-

tuitively, rationales provide richer, more detailed

information about why an input is mapped to a

specific output label.

3.1

Extracting rationales from LLMs

Recent studies observe one intriguing emerging

property of LLMs: their ability to generate ra-

tionales that support their predictions (Wei et al.,

2022; Kojima et al., 2022). While the studies have

largely focused on how to elicit such reasoning ca-

pability from LLMs (Nye et al., 2021; Wei et al.,

2022; Kojima et al., 2022), we use them in training

smaller downstream models.

Specifically, we utilize Chain-of-Thought (CoT)

prompting (Wei et al., 2022) to elicit and extract

rationales from LLMs. As illustrated in Figure 3,

given an unlabeled dataset x i ∈ D, we first cu-

rate a prompt template p that articulates how the

task should be solved. Each prompt is a triplet

(x p , r p , y p ), where x p is an example input, y p is

its corresponding label and r p is a user-provided

rationale that explains why x p can be categorized

as y p . We append each input x i to p and use it as

an input to prompt the LLM to generate rationales

and labels for each x i ∈ D. With the demonstra-

tions seen in p, the LLM is able to mimic the triplet

demonstration to generate the rationale r̂ i and out-Figure 3: We use few-shot CoT prompting that contains

both an example rationale (highlighted in green) and a

label (highlighted in blue) to elicit rationales from an

LLM on new input examples.

Multi-task learning with rationales. To create

a more explicit connection between x i ’s to ŷ i ’s, we

use extracted rationales r̂ i as additional supervi-

sion. There are several ways to incorporate ratio-

nales into the downstream model’s training process.

One straightforward approach is feed r̂ i as an ad-

ditional input—as proposed by other concurrent

research (Rajani et al., 2019; Wang et al., 2022a).

In other words, the f (x i , r̂ i ) → ŷ i is trained with

both text and rationale [x, r] as inputs:

L =

1 X

`(f (x i , r̂ i ), ŷ i ).

put ŷ i for x i .

3.2

Training smaller models with rationales

We first describe the current framework for learn-

ing task-specific models. With this framework in

place, we extend it to incorporate rationales into the

training process. Formally, we denote a dataset as

D = {(x i , y i )} N

i=1 where each x i represents an in-

put and y i is the corresponding desired output label.

While our framework supports inputs and outputs

of any modality, our experiments limits x and y

to be natural language. This text-to-text frame-

work (Raffel et al., 2020) encompasses a variety of

natural language processing tasks: classification,

natural language inference, question answering and

more.

Standard finetuning and task distillation. The

most common practice to train a task-specific

model is to finetune a pretrained model with su-

pervised data (Howard and Ruder, 2018). In the

absence of human-annotated labels, task-specific

distillation (Hinton et al., 2015; Tang et al., 2019)

uses LLM teachers to generates pseudo noisy train-

ing labels, ŷ i in place of y i (Wang et al., 2021;

Smith et al., 2022a; Arora et al., 2022).

For both scenarios, the smaller model f is

trained to minimize the following label prediction

loss:

L label

(2)

i=1

1 X

`(f (x i ), ŷ i ),

(1)

i=1

where ` is the cross-entropy loss between the pre-

dicted and target tokens. Note that for ease of

exposition, we overload ŷ i in Eq. 1 to be either

human-annotated labels y i for the standard finetun-

ing case, or LLM-predicted labels ŷ i for the model

distillation case.

Unfortunately, this design requires an LLM to first

generate a rationale before the f can make a pre-

diction. The LLM is still necessary during deploy-

ment, limited its deployability.

In this work, instead of using rationales as ad-

ditional model inputs, we frame learning with ra-

tionales as a multi-task problem. Specifically, we

train the model f (x i ) → (ŷ i , r̂ i ) to not only predict

the task labels but also generate the corresponding

rationales given the text inputs:

L = L label + λL rationale ,

(3)

where L label is the label prediction loss in Eq. 1

and L rationale is the rationale generation loss:

L rationale =

1 X

`(f (x i ), r̂ i ).

(4)

i=1

The rationale generation loss enables the model to

learn to generate the intermediate reasoning steps

for the prediction, and could therefore guide the

model in better predicting the resultant label. This

is our proposed Distilling step-by-step. Compared

with Eq. 2, the rationale r̂ i is not required in the

test time, which removes the need for an LLM at

test-time.

We prepend “task prefixes” ([label],

[rationale]) to the input examples and

train the smaller model to output ŷ i when

[label] is provided and to produce r̂ i with

[rationale] (Raffel et al., 2020).

Experiments

We empirically validate the effectiveness of Dis-

tilling step-by-step. First, we show that when

compared to standard finetuning and task distil-

lation approaches, Distilling step-by-step achievesFigure 4: We compare Distilling step-by-step and Standard finetuning using 220M T5 models on varying sizes of

human-labeled datasets. On all datasets, Distilling step-by-step is able to outperform Standard finetuning, trained

on the full dataset, by using much less training examples (e.g., 12.5% of the full e-SNLI dataset).

Figure 5: Similar to the plots above, we compare Distilling step-by-step and Standard task distillation using 220M

T5 models on varying sizes of unlabeled datasets. Distilling step-by-step is able to outperform Standard task

distillation by using only a small subset of the full unlabeled dataset (e.g., 12.5% on ANLI dataset).

better performance with much fewer number of

training examples, substantially improving the

data efficiency to learn small task-specific mod-

els (Sec. 4.1). Second, we show that Distilling

step-by-step surpasses the performance of LLMs

with much smaller model size, drastically lowering

the deployment cost compared to LLMs (Sec. 4.2).

Finally, we investigate the minimum resources re-

quired, w.r.t. both number of training examples and

model size, for Distilling step-by-step to outper-

form LLMs. We show that Distilling step-by-step

outperforms LLMs by using less data and smaller

model, simultaneously improving both data- and

deployability-efficiency (Sec. 4.3).

Setup. In the experiments, we consider the 540B

PaLM model (Chowdhery et al., 2022) as the LLM.

For task-specific downstream models, we use T5

models (Raffel et al., 2020) where we initialize the

models with pretrained weights obtained from pub-

licly available sources 1 . For CoT prompting, we

follow Wei et al. (2022) when available, and curate

our own examples for new datasets. We include

more implementation details in Appendix A.1.

https://huggingface.co/

Datasets. We conduct the experiments on 4

popular benchmark datasets across 3 different

NLP tasks: e-SNLI (Camburu et al., 2018) and

ANLI (Nie et al., 2020) for natural language infer-

ence; CQA (Talmor et al., 2019; Rajani et al., 2019)

for commonsense question answering; SVAMP (Pa-

tel et al., 2021) for arithmetic math word problems.

We include more dataset details in Appendix A.2.

4.1

Reducing training data

We compare Distilling step-by-step to two most

common methods in learning task-specific models:

(1) S TANDARD FINETUNING when human-labeled

examples are available, and (2) S TANDARD TASK

DISTILLATION when only unlabeled examples are

available. Specifically, standard finetuning refers to

the prevailing pretrain-then-finetune paradigm that

finetunes a model with ground-truth labels via stan-

dard label supervision (Howard and Ruder, 2018).

On the other hand, when only unlabeled examples

are available, standard task distillation learns the

task-specific model by treating a teacher LLM’s

predicted labels as ground-truths (Hinton et al.,

2015; Chen et al., 2020; Wang et al., 2021; Smith

et al., 2022a; Arora et al., 2022).In the following set of experiments, we fix the

task-specific models to be 220M T5-Base models,

and compare the task performances achieved by dif-

ferent methods under varying number of available

training examples.

Distilling step-by-step outperforms standard

finetuning with much less labeled examples.

When finetuned with human-labeled examples, Fig-

ure 4 shows that Distilling step-by-step consistently

achieves better performance than standard finetun-

ing across varying numbers of labeled examples

used. Furthermore, we see that Distilling step-by-

step can achieve the same performance as stan-

dard finetuning with much less labeled examples.

In particular, by using only 12.5% of the full e-

SNLI dataset, Distilling step-by-step can outper-

form standard finetuning trained with 100% of the

full dataset. Similarly, we achieve 75%, 25%, and

20% reduction in training examples required to out-

perform standard finetuning on ANLI, CQA, and

SVAMP respectively.

Distilling step-by-step outperforms standard

distillation with much less unlabeled examples.

When only unlabeled data is available, we compare

Distilling step-by-step to standard task distillation.

In Figure 5, we observe an overall similar trend to

the finetuning setup. Specifically, we see that Dis-

tilling step-by-step outperforms standard task distil-

lation on all 4 datasets under different numbers of

unlabeled data used. We as well see that Distilling

step-by-step requires much less unlabeled data to

outperform standard task distillation. For instance,

we need only 12.5% of the full unlabeled dataset

to outperform the performance achieved by stan-

dard task distillation using 100% of the training

examples on e-SNLI dataset.

4.2

Reducing model size

In the following set of experiments, we hold the

training set size fixed (using 100% of the datasets),

and compare varying sizes of small T5 models

trained with Distilling step-by-step and standard

approaches to LLMs. Specifically, we consider 3

different sizes of T5 models, i.e., 220M T5-Base,

770M T5-Large, and 11B T5-XXL. For LLMs,

we include two baseline methods: (1) F EW - SHOT

C O T (Wei et al., 2022), and (2) PINTO TUN -

ING (Wang et al., 2022a). Few-shot CoT directly

utilizes CoT demonstrations to prompt the 540B

PaLM to generate intermediate steps before pre-

dicting the final labels without any further fine-

tuning of the LLM. PINTO tuning refers to our

extension of Wang et al. (2022a) to handle tasks

beyond question-answering, which are not stud-

ied by Wang et al. (2022a). Here, we finetune a

220M T5-Base model on top of the outputs gener-

ated from the PaLM model, which can be viewed

as a finetuning method for LLMs with additional

parameters (Zhang et al., 2020; Lester et al., 2021).

We present the experimental results under the

two broad scenarios of having access to labeled

datasets or unlabeled datasets in Figure 6 and Fig-

ure 7, respectively. We plot each method by their

deployed model sizes for prediction (x-axis), and

their corresponding task performances (y-axis).

Distilling step-by-step improves over standard

baselines across varying model sizes used. In

Figure 6 and Figure 7 respectively, we see that

Distilling step-by-step consistently improves over

standard finetuning and standard distillation across

all sizes of T5 models. The improvements are most

pronounced on ANLI, where Distilling step-by-

step outperforms standard finetuning and distilla-

tion by an average of 8% and 13% on task accuracy

respectively.

Distilling step-by-step outperforms LLMs by

using much smaller task-specific models. In

Figure 6 when human-labeled datasets are avail-

able, Distilling step-by-step can always outper-

form Few-shot CoT and PINTO tuning on all 4

datasets considered, by using much smaller T5

models. For instance, we can achieve better perfor-

mances than 540B PaLM model’s Few-shot CoT

with 220M (over 2000× smaller) T5 model on e-

SNLI, 770M (over 700× smaller) T5 models on

ANLI and SVAMP, and 11B (over 45× smaller)

T5 model on CQA. These results hold true even

by further finetuning the 540B PaLM model on

available labeled data with PINTO tuning 2 .

In Figure 7, by only utilizing unlabeled exam-

ples, Distilling step-by-step also outperforms the

teacher LLM on 3 out of 4 datasets. Specifically,

Distilling step-by-step surpasses the 540B PaLM

model’s Few-shot CoT performance by using 11B

T5 with less than 3% of PaLM’s size. On SVAMP

where the distilled model underperforms, we hy-

pothesize that the performance gap is due to the

relatively small number of data points in the dataset

(i.e., 800). In reaction, we propose to augment the

We note that PETuning methods may outperform PINTO

tuning. However, they require massive resource in both train-

ing and deployment, which is not the focus of this work.Figure 6: We perform Distilling step-by-step and Standard finetuning, using the full human-labeled datasets, on

varying sizes of T5 models and compare their performance to LLM baselines, i.e., Few-shot CoT and PINTO

Tuning. Distilling step-by-step is able to outperform LLM baselines by using much smaller models, e.g., over

700× smaller model on ANLI. Standard finetuning fails to match LLM’s performance using the same model size.

Figure 7: Using unlabeled datasets, we perform Distilling step-by-step and Standard task distillation on varying

sizes of T5 models and compare them to Few-shot CoT. Distilling step-by-step outperforms Few-shot CoT by using

2000× smaller models on e-SNLI and 45× smaller models on ANLI and CQA. On SVAMP, by adding unlabeled

examples from ASDiv, we close the gap to Few-shot CoT whereas Standard distillation still struggles to catch up.

dataset with additional unlabeled examples to close

the performance gap as shown in next.

of the 540B PaLM.

4.3

Unlabeled data augmentation further im-

proves Distilling step-by-step. We augment the

SVAMP training set with unlabeled examples from

the ASDiv dataset (Miao et al., 2020). ASDiv

dataset contains a total of 2, 305 examples, where

each example is a math word problem similar to the

ones in SVAMP. In Figure 7 on SVAMP, we show

the performances of Distilling step-by-step and

standard task distillation using 11B T5 model after

augmenting the training set with ASDiv. We see

the data augmentation much improves the perfor-

mance for both Distilling step-by-step and standard

task distillation. However, even with the added

unlabeled examples, standard task distillation still

underperforms Few-shot CoT. On the other hand,

Distilling step-by-step is able to much more effi-

ciently exploit the value of the added examples to

achieve the same performance level of Few-shot

CoT, again, using a T5 model of size less than 3%

Outperforming LLMs using minimum

model size and least training data

Here, using the LLM’s performance as an anchor

point, we explore the most efficient resource re-

quirements in terms of both number of training

examples and deployed model size, that Distill-

ing step-by-step and standard finetuning/distillation

need to outperform the LLM. We present the re-

sults, again under human-labeled setting and unla-

beled setting, in Figure 8 and Figure 9 respectively.

We visualize the results by plotting different resul-

tant models by (1) the number of training exam-

ples used (x-axis), (2) the final task performance

achieved (y-axis), and (3) the size of the model

(visualized by the size of the shaded area).

Distilling step-by-step outperforms LLMs with

much smaller models by using less data. On

all datasets in Figure 8, we see that Distilling step-

by-step outperforms PaLM’s Few-shot CoT withFigure 8: We show the minimum size of T5 models and the least amount of human-labeled examples required

for Distilling step-by-step to outperform LLM’s Few-shot CoT by a coarse-grained search. Distilling step-by-step

is able to outperform Few-shot CoT using not only much smaller models, but it also achieves so with much less

training examples compared to Standard finetuning. On ANLI, we outperform the LLM CoT with a 770M model

using only 80% of the dataset, whereas Standard finetuning struggles to match even using 100% of the dataset.

Figure 9: Similar to Figure 8 but using only unlabeled examples, Distilling step-by-step is able to outperform

Few-shot CoT using much smaller models and with much less examples compared to Standard task distillation.

On SVAMP, the x-axis corresponds to the size of ASDiv dataset used for augmenting the original SVAMP dataset,

i.e., x = 0 is without augmentation and x = 100 corresponds to adding the full ASDiv dataset.

much smaller T5 models using only a subset of

the available training examples. Specifically, on

e-SNLI, Distilling step-by-step can achieve bet-

ter performance than Few-shot CoT with a model

over 2000× smaller (220M T5) and only 0.1% of

the full dataset. In Figure 9 where only unlabeled

datasets are available, we observe the same trend

that Distilling step-by-step can, at most time, out-

perform Few-shot CoT with smaller model as well

as less data. For instance, on ANLI, Distilling step-

by-step outperforms the LLM with a 45× smaller

model and 50% of the full unlabeled set.

Standard finetuning and distillation require

more data and larger model. Finally, in Fig-

ure 8 and Figure 9, we see that standard finetuning

and distillation often need either more data or larger

models to match LLM’s performance. For instance,

on e-SNLI in Figure 8, we observe that Distilling

step-by-step outperform the LLM using only 0.1%

of the dataset while standard finetuning requires

more data to match the performance. Furthermore,

on ANLI in Figure 8, we observe that Distilling

step-by-step can outperform PaLM using 770M

model with only 80% of the training set while stan-

dard finetuning struggles to match the LLM even

using the full dataset and thus requires larger model

to close the performance gap.

Discussion

We propose Distilling step-by-step to extract ra-

tionales from LLMs as informative supervision in

training small task-specific models. We show that

Distilling step-by-step reduces the training dataset

required to curate task-specific smaller models; it

also reduces the model size required to achieve,

and even surpass, the original LLM’s performance.

Distilling step-by-step proposes a resource-efficient

training-to-deployment paradigm compared to ex-

isting methods.

Limitations

There are a number of limitations with our ap-

proach. First, we require users to produce a few

example demonstrations (∼ 10-shot for all tasks)

in order to use the few-shot CoT (Wei et al., 2022)

prompting mechanism. This limitation can beimproved by using recent advances that suggest

that rationales can be elicited without any user-

annotated demonstrations (Kojima et al., 2022).

Second, while we observe success using LLM ratio-

nales, there is evidence that LLMs exhibit limited

reasoning capability on more complex reasoning

and planning tasks (Valmeekam et al., 2022). Fu-

ture work should characterize how rationale quality

affects Distilling step-by-step.

Ethics statement

It is worth noting that the behavior of the our down-

stream smaller models is subject to biases inherited

from the larger teacher LLM. We envision that the

same research progress in reducing anti-social be-

haviors in LLMs can also be applied to improve

smaller language models.

References

Priyanka Agrawal, Chris Alberti, Fantine Huot, Joshua

Maynez, Ji Ma, Sebastian Ruder, Kuzman Ganchev,

Dipanjan Das, and Mirella Lapata. 2022. Qameleon:

Multilingual qa with only 5 examples.

arXiv

preprint arXiv:2211.08264.

Simran Arora, Avanika Narayan, Mayee F Chen, Lau-

rel J Orr, Neel Guha, Kush Bhatia, Ines Chami, Fred-

eric Sala, and Christopher Ré. 2022. Ask me any-

thing: A simple strategy for prompting language

models. arXiv preprint arXiv:2210.02441.

Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Mar-

keeva, Rohan Anil, and Alexander Kolesnikov. 2022.

Knowledge distillation: A good teacher is patient

and consistent. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recog-

nition, pages 10925–10934.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, et al. 2020. Language models are few-shot

learners. Advances in neural information processing

systems, 33:1877–1901.

Cristian Buciluǎ, Rich Caruana, and Alexandru

Niculescu-Mizil. 2006. Model compression. In Pro-

ceedings of the 12th ACM SIGKDD international

conference on Knowledge discovery and data min-

ing, pages 535–541.

Oana-Maria Camburu, Tim Rocktäschel, Thomas

Lukasiewicz, and Phil Blunsom. 2018. e-snli: Nat-

ural language inference with natural language expla-

nations. Advances in Neural Information Process-

ing Systems, 31.

Ting Chen, Simon Kornblith, Kevin Swersky, Moham-

mad Norouzi, and Geoffrey E Hinton. 2020. Big

self-supervised models are strong semi-supervised

learners. Advances in neural information process-

ing systems, 33:22243–22255.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,

Maarten Bosma, Gaurav Mishra, Adam Roberts,

Paul Barham, Hyung Won Chung, Charles Sutton,

Sebastian Gehrmann, et al. 2022. Palm: Scaling

language modeling with pathways. arXiv preprint

arXiv:2204.02311.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2018. Bert: Pre-training of deep

bidirectional transformers for language understand-

ing. arXiv preprint arXiv:1810.04805.

Jacob Eisenstein, Daniel Andor, Bernd Bohnet,

Michael Collins, and David Mimno. 2022. Hon-

est students from untrusted teachers: Learning

an interpretable question-answering pipeline from

a pretrained language model.

arXiv preprint

arXiv:2210.02498.

Peter Hase and Mohit Bansal. 2021. When can mod-

els learn from explanations? a formal framework for

understanding the roles of explanation data. arXiv

preprint arXiv:2102.02201.

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015.

Distilling the knowledge in a neural network. arXiv

preprint arXiv:1503.02531, 2(7).

Namgyu Ho, Laura Schmid, and Se-Young Yun.

2022. Large language models are reasoning teach-

ers. arXiv preprint arXiv:2212.10071.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Men-

sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther-

ford, Diego de Las Casas, Lisa Anne Hendricks,

Johannes Welbl, Aidan Clark, et al. 2022. Train-

ing compute-optimal large language models. arXiv

preprint arXiv:2203.15556.

Jeremy Howard and Sebastian Ruder. 2018. Universal

language model fine-tuning for text classification. In

Proceedings of the 56th Annual Meeting of the As-

sociation for Computational Linguistics (Volume 1:

Long Papers), pages 328–339, Melbourne, Australia.

Association for Computational Linguistics.

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu,

Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022.

Large language models can self-improve. arXiv

preprint arXiv:2210.11610.

Fotis Iliopoulos, Vasilis Kontonis, Cenk Baykal, Gau-

rav Menghani, Khoa Trinh, and Erik Vee. 2022.

Weighted distillation with unlabeled examples. In

Advances in Neural Information Processing Sys-

tems.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-

taka Matsuo, and Yusuke Iwasawa. 2022. Large

language models are zero-shot reasoners. arXiv

preprint arXiv:2205.11916.Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.

The power of scale for parameter-efficient prompt

tuning. arXiv preprint arXiv:2104.08691. the limits of transfer learning with a unified text-to-

text transformer. Journal of Machine Learning Re-

search, 21(140):1–67.

Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan

Zhou, Weizhu Chen, Changyou Chen, and Lawrence

Carin. 2020. Mixkd: Towards efficient distilla-

tion of large-scale language models. arXiv preprint

arXiv:2011.00593. Nazneen Fatema Rajani, Bryan McCann, Caiming

Xiong, and Richard Socher. 2019. Explain yourself!

leveraging language models for commonsense rea-

soning. In Proceedings of the 57th Annual Meet-

ing of the Association for Computational Linguis-

tics, pages 4932–4942, Florence, Italy. Association

for Computational Linguistics.

Lucie Charlotte Magister, Jonathan Mallinson, Jakub

Adamek, Eric Malmi, and Aliaksei Severyn. 2022.

Teaching small language models to reason. arXiv

preprint arXiv:2212.08410.

Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su.

2020. A diverse corpus for evaluating and develop-

ing english math word problem solvers. In Proceed-

ings of the 58th Annual Meeting of the Association

for Computational Linguistics, pages 975–984.

Smitha Milli, Ludwig Schmidt, Anca D Dragan, and

Moritz Hardt. 2019. Model reconstruction from

model explanations. In Proceedings of the Confer-

ence on Fairness, Accountability, and Transparency,

pages 1–9.

Sharan Narang, Colin Raffel, Katherine Lee, Adam

Roberts, Noah Fiedel, and Karishma Malkan. 2020.

Wt5?! training text-to-text models to explain their

predictions. arXiv preprint arXiv:2004.14546.

Yixin Nie, Adina Williams, Emily Dinan, Mohit

Bansal, Jason Weston, and Douwe Kiela. 2020. Ad-

versarial NLI: A new benchmark for natural lan-

guage understanding. In Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics. Association for Computational Linguis-

tics.

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari,

Henryk Michalewski, Jacob Austin, David Bieber,

David Dohan, Aitor Lewkowycz, Maarten Bosma,

David Luan, et al. 2021. Show your work: Scratch-

pads for intermediate computation with language

models. arXiv preprint arXiv:2112.00114.

Arkil Patel, Satwik Bhattamishra, and Navin Goyal.

2021. Are NLP models really able to solve simple

math word problems? In Proceedings of the 2021

Conference of the North American Chapter of the

Association for Computational Linguistics: Human

Language Technologies, pages 2080–2094, Online.

Association for Computational Linguistics.

Danish Pruthi, Rachit Bansal, Bhuwan Dhingra,

Livio Baldini Soares, Michael Collins, Zachary C

Lipton, Graham Neubig, and William W Cohen.

2022. Evaluating explanations: How much do ex-

planations from the teacher aid students? Transac-

tions of the Association for Computational Linguis-

tics, 10:359–375.

Colin Raffel, Noam Shazeer, Adam Roberts, Kather-

ine Lee, Sharan Narang, Michael Matena, Yanqi

Zhou, Wei Li, and Peter J. Liu. 2020. Exploring

Andrew Slavin Ross, Michael C Hughes, and Finale

Doshi-Velez. 2017. Right for the right reasons:

Training differentiable models by constraining their

explanations. arXiv preprint arXiv:1703.03717.

Ryan Smith, Jason A Fries, Braden Hancock, and

Stephen H Bach. 2022a. Language models in the

loop: Incorporating prompting into weak supervi-

sion. arXiv preprint arXiv:2205.02318.

Shaden Smith, Mostofa Patwary, Brandon Norick,

Patrick LeGresley, Samyam Rajbhandari, Jared

Casper, Zhun Liu, Shrimai Prabhumoye, George

Zerveas, Vijay Korthikanti, et al. 2022b. Using

deepspeed and megatron to train megatron-turing

nlg 530b, a large-scale generative language model.

arXiv preprint arXiv:2201.11990.

Suraj Srinivas and François Fleuret. 2018. Knowledge

transfer with jacobian matching. In International

Conference on Machine Learning, pages 4723–4731.

PMLR.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and

Jonathan Berant. 2019. CommonsenseQA: A ques-

tion answering challenge targeting commonsense

knowledge. In Proceedings of the 2019 Conference

of the North American Chapter of the Association

for Computational Linguistics: Human Language

Technologies, Volume 1 (Long and Short Papers),

pages 4149–4158, Minneapolis, Minnesota. Associ-

ation for Computational Linguistics.

Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga

Vechtomova, and Jimmy Lin. 2019. Distilling task-

specific knowledge from bert into simple neural net-

works. arXiv preprint arXiv:1903.12136.

Romal Thoppilan, Daniel De Freitas, Jamie Hall,

Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze

Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du,

et al. 2022. Lamda: Language models for dialog

applications. arXiv preprint arXiv:2201.08239.

Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan,

and Subbarao Kambhampati. 2022. Large language

models still can’t plan (a benchmark for llms on plan-

ning and reasoning about change). arXiv preprint

arXiv:2206.10498.

Peifeng Wang, Aaron Chan, Filip Ilievski, Muhao

Chen, and Xiang Ren. 2022a. Pinto: Faithful lan-

guage reasoning using prompt-generated rationales.

arXiv preprint arXiv:2211.01562.Shuohang Wang, Yang Liu, Yichong Xu, Chenguang

Zhu, and Michael Zeng. 2021. Want to reduce

labeling cost? gpt-3 can help. arXiv preprint

arXiv:2108.13487.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,

Ed Chi, and Denny Zhou. 2022b. Self-consistency

improves chain of thought reasoning in language

models. arXiv preprint arXiv:2203.11171.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022.

Chain of thought prompting elicits reasoning in large

language models. arXiv preprint arXiv:2201.11903.

Sarah Wiegreffe, Ana Marasović, and Noah A. Smith.

2021. Measuring association between labels and

free-text rationales. In Proceedings of the 2021 Con-

ference on Empirical Methods in Natural Language

Processing, pages 10266–10284, Online and Punta

Cana, Dominican Republic. Association for Compu-

tational Linguistics.

Omar Zaidan, Jason Eisner, and Christine Piatko. 2007.

Using “annotator rationales” to improve machine

learning for text categorization. In Human Lan-

guage Technologies 2007: The Conference of the

North American Chapter of the Association for Com-

putational Linguistics; Proceedings of the Main

Conference, pages 260–267, Rochester, New York.

Association for Computational Linguistics.

Eric Zelikman, Yuhuai Wu, and Noah D Goodman.

2022. Star: Bootstrapping reasoning with reasoning.

arXiv preprint arXiv:2203.14465.

Jeffrey O Zhang, Alexander Sax, Amir Zamir,

Leonidas Guibas, and Jitendra Malik. 2020. Side-

tuning: a baseline for network adaptation via ad-

ditive side networks. In European Conference on

Computer Vision, pages 698–714. Springer.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel

Artetxe, Moya Chen, Shuohui Chen, Christopher De-

wan, Mona Diab, Xian Li, Xi Victoria Lin, et al.

2022. Opt: Open pre-trained transformer language

models. arXiv preprint arXiv:2205.01068.

Ye Zhang, Iain Marshall, and Byron C. Wallace. 2016.

Rationale-augmented convolutional neural networks

for text classification. In Proceedings of the 2016

Conference on Empirical Methods in Natural Lan-

guage Processing, pages 795–804, Austin, Texas.

Association for Computational Linguistics.

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao

Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang,

Yuanzhong Xu, Danyang Zhuo, Joseph E Gonza-

lez, et al. 2022. Alpa: Automating inter-and intra-

operator parallelism for distributed deep learning.

arXiv preprint arXiv:2201.12023.A

Experiment detail

A.1

Implementation

We perform our experiments on cloud A100×16

GPU instances. We train the T5 models with

the following hyperparameters, using publicly

available packages from https://github.com/

huggingface/transformers:

Table 1: Dataset statistics used in our experiments.

Dataset Train Validation Test

e-SNLI

ANLI

CQA

SVAMP 549,367

16,946

8,766

720 9,842

1,000

975

80 9,824

1,000

1,221

200

• T5-Base (220M) and T5-Large (770M): We

train the models with learning rate = 5 ×

10 −5 , batch size = 64, max input length =

1024, for a maximum of 10000 steps. by (Rajani et al., 2019), which is avail-

able at https://github.com/salesforce/

cos-e. We obtain the dataset used in our ex-

periments from https://huggingface.co/

datasets/cos_e.

• T5-XXL (11B): We train the models with

learning rate = 5 × 10 −5 , batch size = 32,

max input length = 1024, for a maximum of

4000 steps. • SVAMP: The dataset was originally re-

leased in (Patel et al., 2021). We ob-

tain the dataset from https://github.com/

arkilpatel/SVAMP.

We report all the results over 4 random runs, and

include the standard error in the presented plots. • ASDiv: The dataset was originally re-

leased in (Miao et al., 2020). We ob-

tain the dataset from https://github.com/

chaochun/nlu-asdiv-dataset.

A.2

Datasets

We provide more detailed descriptions on the

datasets used in our experiments. We include the

sources from which we obtain the datasets as well

as their original sources released from the authors.

We refer readers to these sources for their license or

terms for use and/or distribution. To the best of our

knowledge, the datasets used do not contain infor-

mation that names or uniquely identifies individual

people or offensive content.

• e-SNLI: The dataset was originally re-

leased in (Camburu et al., 2018), and made

publicly available at https://github.com/

OanaMariaCamburu/e-SNLI.

We obtain

the dataset from https://huggingface.co/

datasets/esnli.

• ANLI: The dataset was originally released

in (Nie et al., 2020), and made pub-

licly available at https://github.com/

facebookresearch/anli. We obtain the

dataset from https://huggingface.co/

datasets/anli. We use the R1 split in our

experiments.

• CQA: The dataset was originally released

in (Talmor et al., 2019), and made publicly

available at https://www.tau-nlp.sites.

tau.ac.il/commonsenseqa. It was then

augmented with human-labeled explanations

For each dataset, we randomly subsample 10%

of the original training set to serve as validation set

when validation set is not originally provided. For

CQA, we use the original validation set to serve

as our test set since the ground-truth labels are not

available for the original test set. We provide the

dataset statistics in Table 1.