Summary of Fast Inference from Transformers via Speculative Decoding

Summary Fast Inference from Transformers via Speculative Decoding arxiv.org

8,453 words - PDF document - View PDF document

One Line

Fast Inference from Transformers via Speculative Decoding speeds up the inference process of large autoregressive models by using efficient approximation models to generate speculative prefixes for slower target models.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Fast Inference from Transformers via Speculative Decoding

Source: arxiv.org - PDF - 8,453 words - view

Introduction

• Fast Inference from Transformers via Speculative Decoding accelerates inference from large autoregressive models like Transformers.

• Speculative decoding uses efficient approximation models to generate speculative prefixes for slower target models.

Enhanced Efficiency

• Speculative decoding reduces the number of serial calls to the target model.

• It improves the divergence between probability distributions.

• Graphical representation of optimal parameter as a function of another parameter. [Include graph]

Speedup Achieved

• T5-small achieves the highest speedup among tested decoder models.

• Empirical values for different target models and approximation models are summarized in Table 3.

• Approximation models that provide the best results are identified.

Parallel Decoding

• Speculative Decoding enables fast inference from transformers by decoding multiple tokens in parallel.

• Identical outputs are guaranteed.

• Provides 2X-3X speedups compared to optimized implementations like T5X.

References

• List of references to various research papers related to fast inference from transformers and language modeling.

• Covers topics such as speculative sampling, transfer learning with text-to-text transformers, and scaling language modeling with pathways.

Efficient Transformers

• References to papers and books related to efficient transformers for language modeling, computer architecture, and deep autoregressive models.

• Also covers topics like distilling knowledge in neural networks and adaptive attention span.

Comparison to Rejection Sampling

• Speculative sampling is more efficient than rejection sampling.

• Mathematical equations and probabilities involved in the process are explained.

• Theoretical predictions and efficiency of speculative sampling are discussed.

Summary and Key Points

• Fast Inference from Transformers via Speculative Decoding accelerates inference from large autoregressive models.

• Speculative decoding reduces the number of serial calls and improves probability distribution divergence.

• T5-small achieves the highest speedup among tested decoder models.

• Speculative Decoding enables fast inference by decoding multiple tokens in parallel.

• Reminder: Speculative Decoding provides 2X-3X speedups compared to optimized implementations like T5X.

Key Points

Fast Inference from Transformers via Speculative Decoding is a method developed to accelerate inference from large autoregressive models like Transformers.
Speculative decoding involves using more efficient approximation models to generate speculative prefixes for the slower target models.
The method reduces the number of serial calls to a target model and improves the divergence between probability distributions.
T5-small achieves the highest speedup among tested decoder models.
Speculative Decoding enables fast inference from transformers by decoding multiple tokens in parallel, providing 2X-3X speedups compared to optimized implementations like T5X.

Summaries

26 word summary

Fast Inference from Transformers via Speculative Decoding accelerates inference from large autoregressive models by using efficient approximation models to generate speculative prefixes for slower target models.

43 word summary

Fast Inference from Transformers via Speculative Decoding is a method developed to accelerate inference from large autoregressive models like Transformers. It involves using more efficient approximation models to generate speculative prefixes for the slower target models. The reduction factor in the number of

423 word summary

Fast Inference from Transformers via Speculative Decoding is a method developed to accelerate inference from large autoregressive models like Transformers. The approach involves using more efficient approximation models to generate speculative prefixes for the slower target models. By running the target model in

The excerpt discusses the use of speculative decoding to improve fast inference from transformers. It assumes that p(x) and q(x) are distributions from M p and M q, respectively. The expected number of tokens produced by Algorithm 1 is a capped geometric

The excerpt discusses the reduction factor in the number of serial calls to a target model and the improvement factor in the divergence between probability distributions. It introduces corollaries and theorems to support these findings. The text also delves into the number of

The excerpt discusses fast inference from transformers using speculative decoding. It presents a graph showing the optimal parameter as a function of another parameter for various values of a constant. The speedup factor and increase in arithmetic operations are also shown in another graph. The text

The document discusses a method called speculative decoding that improves the speed of inference from Transformers. Speculative decoding involves using an approximation model, Mq, to make predictions before the main model, Mp, is called. The number of calls to Mq,

T5-small, with 77M parameters and a good balance of c and ?, achieves the highest speedup among the tested decoder models. The empirical ? values for different target models and approximation models are summarized in Table 3. Approximation models that

Speculative Decoding is a method that enables fast inference from transformers by decoding multiple tokens in parallel. It supports general approximation models and guarantees identical outputs. This method provides 2X-3X speedups compared to optimized implementations like T5X.

This text excerpt includes a list of references to various research papers related to fast inference from transformers and language modeling. The papers mentioned cover topics such as speculative sampling, transfer learning with text-to-text transformers, scaling language modeling with pathways, adding early exits to

The summary of the text excerpt is as follows:

The text includes a list of references to various papers and books related to efficient transformers for language modeling, computer architecture, deep autoregressive models, distilling knowledge in neural networks, adaptive attention span in

The excerpt discusses the concept of fast inference from transformers using speculative decoding. It explains the mathematical equations and probabilities involved in the process. The text also compares speculative sampling to rejection sampling and highlights the efficiency of speculative sampling. Additionally, it discusses the theoretical predictions

Raw indexed text (47,722 chars / 8,453 words / 1,326 lines)

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan * 1 Matan Kalman * 1 Yossi Matias 1

Abstract

developed to make inference from them faster. Some ap-

proaches aim to reduce the inference cost for all inputs

equally (e.g. Hinton et al., 2015; Jaszczur et al., 2021;

Hubara et al., 2016; So et al., 2021; Shazeer, 2019). Other

approaches stem from the observation that not all infer-

ence steps are born alike - some require a very large model,

while others can be approximated well by more efficient

models. These adaptive computation methods (e.g. Han

et al., 2021; Sukhbaatar et al., 2019; Schuster et al., 2021;

Scardapane et al., 2020; Bapna et al., 2020; Elbayad et al.,

2019; Schwartz et al., 2020) aim to use less compute re-

sources for easier inference steps. While many of these

solutions have proven extremely effective in practice, they

usually require changing the model architecture, changing

the training-procedure and re-training the models, and don’t

maintain identical outputs.

Inference from large autoregressive models like

Transformers is slow - decoding K tokens takes

K serial runs of the model. In this work we in-

troduce speculative decoding - an algorithm to

sample from autoregressive models faster without

any changes to the outputs, by computing several

tokens in parallel. At the heart of our approach lie

the observations that (1) hard language-modeling

tasks often include easier subtasks that can be ap-

proximated well by more efficient models, and

(2) using speculative execution and a novel sam-

pling method, we can make exact decoding from

the large models faster, by running them in par-

allel on the outputs of the approximation mod-

els, potentially generating several tokens concur-

rently, and without changing the distribution. Our

method can accelerate existing off-the-shelf mod-

els without retraining or architecture changes. We

demonstrate it on T5-XXL and show a 2X-3X

acceleration compared to the standard T5X imple-

mentation, with identical outputs.

The key observation above, that some inference steps are

“harder” and some are “easier”, is also a key motivator for

our work. We additionally observe that inference from large

models is often not bottlenecked on arithmetic operations,

but rather on memory bandwidth and communication, so

additional computation resources might be available. There-

fore we suggest increasing concurrency as a complemen-

tary approach to using an adaptive amount of computation.

Specifically, we are able to accelerate inference without

changing the model architectures, without changing the

training-procedures or needing to re-train the models, and

without changing the model output distribution. This is

accomplished via speculative execution.

1. Introduction

Large autoregressive models, notably large Transformers

(Vaswani et al., 2017), are much more capable than smaller

models, as is evidenced countless times in recent years e.g.,

in the text or image domains, like GPT-3 (Brown et al.,

2020), LaMDA (Thoppilan et al., 2022), Parti (Yu et al.,

2022), and PaLM (Chowdhery et al., 2022). Unfortunately,

a single decode step from these larger models is significantly

slower than a step from their smaller counterparts, and mak-

ing things worse, these steps are done serially - decoding K

tokens takes K serial runs of the model.

Speculative execution (Burton, 1985; Hennessy & Patterson,

2012) is an optimization technique, common in processors,

where a task is performed in parallel to verifying if it’s

actually needed - the payoff being increased concurrency.

A well-known example of speculative execution is branch

prediction. For speculative execution to be effective, we

need an efficient mechanism to suggest tasks to execute

that are likely to be needed. In this work, we generalize

speculative execution to the stochastic setting - where a

task might be needed with some probability. Applying this

to decoding from autoregressive models like Transformers,

we sample generations from more efficient approximation

models as speculative prefixes for the slower target mod-

els. With a novel sampling method, speculative sampling,

we maximize the probability of these speculative tasks to

Given the importance of large autoregressive models and

specifically large Transformers, several approaches were

Equal contribution

Google Research, Mountain

View, CA, USA. Correspondence to: Yaniv Leviathan

Proceedings of the 40 th International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

1Fast Inference from Transformers via Speculative Decoding

Figure 1. Our technique illustrated in the case of unconditional language modeling. Each line represents one iteration of the algorithm.

The green tokens are the suggestions made by the approximation model (here, a GPT-like Transformer decoder with 6M parameters

trained on lm1b with 8k tokens) that the target model (here, a GPT-like Transformer decoder with 97M parameters in the same setting)

accepted, while the red and blue tokens are the rejected suggestions and their corrections, respectively. For example, in the first line the

target model was run only once, and 5 tokens were generated.

2. Speculative Decoding

be accepted, while guaranteeing that the outputs from our

system have the same distribution as those from the target

model alone. For example, the sentence in Figure 1, con-

sisting of 38 tokens, was generated by our method with

only 9 serial runs of a larger target model (97M parameters)

thanks to a smaller and more efficient approximation model

(6M parameters), while the probability of generating it is

unchanged.

2.1. Overview

Let M p be the target model, inference from which we’re

trying to accelerate, and p(x t |x

from the model for a prefix x

cient approximation model for the same task, and denote

by q(x t |x

prefix x

model M q to generate γ ∈ Z + completions (see Section 3.5

for how to optimally choose this parameter), then (2) use

the target model M p to evaluate all of the guesses and their

respective probabilities from M q in parallel, accepting all

those that can lead to an identical distribution, and (3) sam-

pling an additional token from an adjusted distribution to fix

the first one that was rejected, or to add an additional one

if they are all accepted. That way, each parallel run of the

target model M p will produce at least one new token (so the

number of serial runs of the target model can never, even

in the worst case, be larger than the simple autoregressive

method), but it can potentially generate many new tokens,

up to γ + 1, depending on how well M q approximates M p .

We analyze our method in a variety of tasks and model

sizes: unconditional generation from a 97M parameter GPT-

like model trained on lm1b, English to German translation

and news article summarization with an 11B parameters

T5-XXL model, and a dialog task with a 137B parameter

LaMDA model. We implement our method and compare

actual walltimes for T5-XXL to those of the robust T5X

implementation (Roberts et al., 2022), showing an out-of-

the-box latency improvement of 2X-3X, without any change

to the outputs (Section 4).

Our method is easy to employ in actual production settings,

doesn’t require training new models, and doesn’t change the

outputs. Therefore, in common situations where memory

bandwidth is the bottleneck, and compute resources are

available, it may be a good default to accelerate sampling

from autoregressive models like Transformers.

2.2. Standardized Sampling

First, note that while there are many methods and parame-

ters of sampling, like argmax, top-k, nucleus, and setting

a temperature, and popular implementations usually treat

them differently at the logits level, they can all easily be cast

into standard sampling from an adjusted probability distribu-

tion. For example, argmax sampling is equivalent to zeroing

out non-max elements of the distribution and normalizing.

We can therefore only deal with standard sampling from a

To summarize, our main contributions are: (1) A generaliza-

tion of speculative execution to the stochastic setting, with

a novel sampling method we call speculative sampling, and

(2) A decoding mechanism we call speculative decoding that

can accelerate decoding from autoregressive models, with-

out any change to the model architectures, training regimes

and output distributions.

We’ll use p(x) to denote p(x t |x

is clear from the context, and similarly for q(x).

2Fast Inference from Transformers via Speculative Decoding

probability distribution, and cast all of the other types of

sampling into that framework. Going forward we’ll assume

that p(x) and q(x) are the distributions from M p and M q

respectively, adjusted for the sampling method. ber of tokens produced by a single run of Algorithm 1.

2.3. Speculative Sampling E(β) is then a natural measure of how well M q approxi-

mates M p . If we make the simplifying assumption that the

βs are i.i.d., and denote α = E(β), then the number of

tokens produced by a single run of Algorithm 1 is a capped

geometric variable, with success probability 1 − α and cap

γ + 1, and the expected number of tokens generated by

Algorithm 1 satisfies Equation (1). See Figure 2.

Definition 3.1. The acceptance rate β x

speculative sampling, as per Section 2.3 2 .

To sample x ∼ p(x), we instead sample x ∼ q(x), keeping

it if q(x) ≤ p(x), and in case q(x) > p(x) we reject the

sample with probability 1− p(x)

q(x) and sample x again from an

adjusted distribution p 0 (x) = norm(max(0, p(x) − q(x)))

instead. It’s easy to show (see Appendix A.1) that for any

distributions p(x) and q(x), and x sampled in this way,

indeed x ∼ p(x).

E(# generated tokens) =

Given the distribution q(x) obtained from running M q on

a conditioning pref ix, we can sample a token x 1 ∼ q(x).

We then calculate the distribution p(x) by running M p on

pref ix while in parallel speculatively calculating the distri-

bution of the next token x 2 by running M p on pref ix+[x 1 ].

Once both computations complete, we proceed as per above:

If x 1 is rejected, we discard the computation of x 2 and

re-sample x 1 from an adjusted distribution, and if x 1 is ac-

cepted, we keep both tokens. Algorithm 1 generalizes this

idea to sample between 1 and γ + 1 tokens at once.

(1)

Baseline

Algorithm 1 SpeculativeDecodingStep

Inputs: M p , M q , pref ix.

. Sample γ guesses x 1,...,γ from M q autoregressively.

for i = 1 to γ do

q i (x) ← M q (pref ix + [x 1 , . . . , x i−1 ])

x i ∼ q i (x)

end for

. Run M p in parallel.

p 1 (x), . . . , p γ+1 (x) ←

M p (pref ix), . . . , M p (pref ix + [x 1 , . . . , x γ ])

. Determine the number of accepted guesses n.

r 1 ∼ U (0, 1), . . . , r γ ∼ U (0, 1)

(x)

n ← min({i − 1 | 1 ≤ i ≤ γ, r i > p q i i (x)

} ∪ {γ})

. Adjust the distribution from M p if needed.

p 0 (x) ← p n+1 (x)

if n < γ then

p 0 (x) ← norm(max(0, p n+1 (x) − q n+1 (x)))

end if

. Return one token from M p , and n tokens from M q .

t ∼ p 0 (x)

return pref ix + [x 1 , . . . , x n , t]

1 − α γ+1

1 − α

4 4

2 2

1 1

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

Figure 2. The expected number of tokens generated by Algorithm 1

as a function of α for various values of γ.

3.2. Calculating α

We’ll now derive a simple formula for calculating α given a

prefix and the two models M p and M q . We start by defining

a natural divergence D LK :

Definition 3.2. D LK (p, q) =

x |p(x) − M (x)| =

p(x)+q(x)

|q(x)

−

(x)|

where

(x)

Lemma 3.3. D LK (p, q) = 1 − x min(p(x), q(x))

Proof. D LK (p, q) = x |p(x) − M (x)| = x

P p+q−|p−q|

1 − x

= 1 − x min(p(x), q(x))

3. Analysis

|p−q|

3.1. Number of Generated Tokens From Lemma 3.3 we immediately get the following results:

Let’s analyze the reduction factor in the number of serial

calls to the target model, or equivalently, the expected num- 2

As before, we’ll omit the x

clear from the context.

3Fast Inference from Transformers via Speculative Decoding

Corollary 3.4. D LK (p, q) is a symmetric divergence in [0, 1]. Corollary 3.9. If α > c, there exists γ for which we’ll get

D LK (p, q) = 0 ⇐⇒ p = q.

an improvement, and the improvement factor will be at least

1+α

D LK (p, q) = 1 ⇐⇒ p and q have disjoint support.

1+c .

Theorem 3.5. β = 1 − D LK (p, q)

Proof. β

E x∼q(x)

E x∼q(x) min(1, p(x)

q(x) ) =

(

p(x)

q(x)

q(x) ≤ p(x)

q(x) > p(x)

Proof. If we get an improvement for γ, we’d also get an

improvement for any 0 < γ ∗ < γ, so for our method to

yield an improvement, we can evaluate Theorem 3.8 for

1−α 2

= 1+α

γ = 1, yielding (1−α)(c+1)

1+c .

min(p(x), q(x))

3.4. Number of Arithmetic Operations

Finally we get:

Algorithm 1 does γ +1 runs of M p in parallel, so the number

of concurrent arithmetic operations grows by a factor of

γ +1. Now, since Algorithm 1 produces at most γ +1 tokens

per run, the total number of arithmetic operations might be

higher than that of the standard decoding algorithm. When

we accept the sample from M q the increased concurrency

is “free” and the total number of operations isn’t increased 3 .

When we reject a guess though, computation is wasted. Let’s

now analyze the effect of our method on the total number

of arithmetic operations.

Corollary 3.6. α = 1 − E(D LK (p, q)) = E(min(p, q))

See Table 3 for empirically observed α values in our experi-

ments.

3.3. Walltime Improvement

We’ve shown that with the i.i.d. assumption our algorithm

reduces the number of calls to the target model by a factor

γ+1

of 1−α

1−α . Note that speculative execution in general, and

our algorithm in particular, assume that we have enough

compute resources to support the increased concurrency

(Section 3.4). For the walltime anaylsis, we’ll assume that

we can run γ + 1 concurrent evaluations of M p in parallel

without increasing the walltime. To get the total walltime

improvement, we now consider the cost of running the ap-

proximation model M q .

Definition 3.10. Let ĉ be the ratio of arithmetic operations

per token of the approximation model M q to that of the

target model M p .

Theorem 3.11. The expected factor of increase in the num-

ber of total operations of Algorithm 1 is (1−α)(γĉ+γ+1)

1−α γ+1

Proof. Denote by T̂ the number of arithmetic operations

done by a standard decoding baseline per token, i.e. the

number of operations of a single run of M p . Then a single

iteration of Algorithm 1 costs T̂ ĉγ + T̂ (γ + 1) operations

(for γ runs of M q and γ + 1 parallel runs of M p ). Dividing

by the expected number of tokens produced by Algorithm 1,

i.e. Equation (1), and by T̂ , we get the desired result.

Definition 3.7. Let c, the cost coefficient, be the ratio be-

tween the time for a single run of M q and the time for a

single run of M p .

Note that unlike α which is an intrinsic property of the

models and the task, the value of c depends on the hardware

configuration and software implementation details. In our

experiments where M q is typically a couple of orders of

magnitude smaller than M p , c was always less than 0.05

and often negligibly close to 0.

If α is low, the increase in the number of arithmetic oper-

ations is high, and vice-versa. Note that for Transformer

decoders, the total number of arithmetic operations by Al-

gorithm 1 (not counting runs of M q ) can be bounded from

above by a single run of the same-size Transformer encoder.

Theorem 3.8. The expected improvement factor in total

1−α γ+1

walltime by Algorithm 1 is (1−α)(γc+1)

Proof. Denote the cost of running a single step of M p by T .

Now, each run of Algorithm 1 costs T cγ + T (for running

the approximation model M q γ times and running M p once)

γ+1

and according to Equation (1) produces 1−α

1−α tokens on

average. So the overall expected cost for producing a token

with Algorithm 1 is (cγ+1)(1−α)

T . Since the cost of pro-

1−α γ+1

ducing a single token with the standard decoding algorithm

is T , we get the desired result.

Unlike the total number of arithmetic operations, the total

number of memory accesses can go down with our method.

Specifically, the target model’s weights and KV cache can

be read once per execution of Algorithm 1, so the number

of memory accesses for reading them shrinks by a factor of

1−α γ+1

1−α , according to Equation (1).

3.5. Choosing γ

Given c and α and assuming enough compute resources (see

Section 3.4), the optimal γ is the one maximizing the wall-

Note that Theorem 3.8 assumes long enough generations

(for example, since we run M p at least once, the improve-

ment factor is capped by the number of generated tokens).

Neglecting the cost of M q .Fast Inference from Transformers via Speculative Decoding

c = 0.01

c = 0.02

c = 0.05

c = 0.1

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Figure 3. The optimal γ as a function of α for various values of c.

2 2

1 1

0.90

Speed = 1

Ops = 1

Speed = 3

Ops = 3

Speed = 5

Ops = 5

Speed = 7

Ops = 7

Speed = 10

Ops = 10

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

Figure 4. The speedup factor and the increase in number of arith-

metic operations as a function of α for various values of γ.

γ+1

1−α

time improvement equation (Theorem 3.8): (1−α)(γc+1)

Since γ is an integer, it can be easily found numerically, see

Figure 3.

3.6. Approximation Models

Speculative sampling, and therefore speculative decoding,

guarantee an identical output distribution for any choice

of approximation model M q without restriction (see Ap-

pendix A.1). In our experiments, we mostly tested existing

off-the-shelf smaller Transformers as the approximation

models. Further, we only tested approximation models of

the same architecture as the target models M p and using the

same probability standardization. In this setup, choosing

M q to be around two orders of magnitude smaller than M p

usually performed best, balancing α and c (Theorem 3.8).

Table 1 and Figure 4 illustrate the trade-off between infer-

ence speed and the total number of arithmetic operations for

various values of α and γ, assuming c = ĉ = 0. Figure 5

shows a simplified trace diagram.

Table 1. The total number of arithmetic operations and the infer-

ence speed vs the baseline, for various values of γ and α, assuming

c = ĉ = 0.

α γ O PERATIONS S PEED

0.6

0.7

0.8

0.9

0.9 2

10 1.53X

1.58X

1.23X

1.63X

1.11X

1.60X 1.96X

2.53X

2.44X

3.69X

2.71X

6.86X

Another type of approximation models, negligible-cost mod-

els, are those for which c ≈ 0, i.e. approximation models

with a negligible cost relative to the target model. In this

γ+1

case, we get an expected walltime improvement of 1−α

1−α ,

which is bounded from above by 1−α

(we approach equal-

ity if γ is large). One interesting type of negligible-cost

approximation models are n-gram models, where the evalu-

ation amounts to a table lookup. Interestingly, in empirical

tests (Section 4.2) we get non zero αs even for these triv-

ial n-gram models. For example, for the English-German

translation task, with M p being T5-XXL 11B and M q being

a trivial bigram model, we get α ≈ 0.2 which leads to an

inference speed improvement factor of 1.25X with γ = 3.

Instead of picking a single value for γ based on α, since the

βs aren’t constant, we could get further improvement by pre-

dicting the value of β and accordingly varying the value of γ

during the run of Algorithm 1. To get an upper bound on the

additional improvement factor, assume we had an oracle for

γ. We would then have E(# generated tokens) = 1−α

For typical values of c and α, and assuming unbounded com-

pute resources, the enhanced walltime improvement factor

can be up to ∼60% higher than the improvement factor with

a fixed γ. We leave exploring this for future work 4 .

Other simple heuristics can be used as negligible-cost ap-

proximation models. For example, in cases where long se-

quences are likely to repeat, such as for summarization tasks

or chat-like interfaces 5 , an approximation model that simply

The above bound assumes that we still run M p to verify the or-

acle’s predictions. If we skip those verifications the bound doesn’t

hold and we would get a substantial additional improvement.

E.g. where a user and a language model iterate on content, like

text or code (“can you rewrite this story but change the ending”,

“can you make this function also do X”).

5Fast Inference from Transformers via Speculative Decoding

M p encoder

M q encoder

M p decoder

M q decoder

Wall time

Figure 5. A simplified trace diagram for a full encoder-decoder Transformer stack. The top row shows speculative decoding with γ = 7

so each of the calls to M p (the purple blocks) is preceded by 7 calls to M q (the blue blocks). The yellow block on the left is the call to the

encoder for M p and the orange block is the call to the encoder for M q . Likewise the middle row shows speculative decoding with γ = 3,

and the bottom row shows standard decoding.

approximation models. As expected we see that α increases

with the size of the approximation model. Interestingly, α

and walltime improvement are higher for argmax sampling

(temp=0). We observe speedups of 2.6X (temp=1) and 3.4X

(temp=0) on the translation task and slightly lower speedups

of 2.3X (temp=1) and 3.1X (temp=0) for the summarization

task. These empirical results match well with the theoreti-

cal predictions, with some variance due to implementation

details (see Appendix A.3).

copies tokens from the context in case we find a matching

prefix, might yield high values of α. These parameter-less

approximation models, have the additional advantage of

being even simpler to deploy from a production standpoint.

Another type of approximation models that can be used by

speculative decoding are non-autoregressive models, like

those from (Stern et al., 2018). Then, instead of the au-

togreressive loop in Algorithm 1 we’d just call the non-

autoregressive model once.

A final example, interesting mostly from a theoretical per-

spective, is an approximation model which chooses tokens

at random, which guarantees some improvement (although

very small) for all models M p .

Table 2. Empirical results for speeding up inference from a T5-

XXL 11B model.

4. Experiments

4.1. Empirical Walltime Improvement

We implement our algorithm and compare it to the imple-

mentation in the T5X codebase for accelerating T5-XXL.

Setup We test a standard encoder-decoder T5 version 1.1

model (Raffel et al., 2020) on two tasks from the T5 paper:

(1) English to German translation fine tuned on WMT EnDe,

and (2) Text summarization fine tuned on CCN/DM. For

both tasks, we use T5-XXL (11B) for M p . For the approx-

imation model M q we test several existing configurations,

namely T5-large (800M), T5-base (250M), and T5-small

(77M) (Raffel et al., 2020). We use existing checkpoints

for all models. We measure walltime improvements with a

batch size of 1 on a single TPU-v4 for both argmax sampling

(temp=0) and standard sampling (temp=1).

T ASK M q

T EMP γ α E N D E

E N D E

E N D E T5- SMALL F

T5- BASE

T5- LARGE

T5- SMALL F

T5- BASE

T5- LARGE 0

1 7

3 0.75

0.8

0.82

0.62

0.68

0.71 3.4X

2.8X

1.7X

2.6X

2.4X

1.4X

CNNDM

CNNDM T5- SMALL F

T5- BASE

T5- LARGE

T5- SMALL F

T5- BASE

T5- LARGE 0

1 5

3 0.65

0.73

0.74

0.53

0.55

0.56 3.1X

3.0X

2.2X

2.3X

2.2X

1.7X

S PEED

4.2. Empirical α Values

While we only implemented our method for T5, we mea-

sured α values for various tasks, sampling methods, target

models M p , and approximation models M q . Specifically,

we evaluated the expectation from Corollary 3.6 on 10K

tokens generated by M p , for each of the settings below.

GPT-like (97M params) We test a decoder-only Trans-

former model on unconditional language generation, trained

on lm1b (Chelba et al., 2013). The model here is a GPT-

like Transformer decoder with Gelu activations (Hendrycks

& Gimpel, 2016). For M q we experimented with a Trans-

Results Table 2 shows the empirical results from our

method. We see that T5-small (77M), with a good balance

of c and α, provides the highest speedup out of the tested

6Fast Inference from Transformers via Speculative Decoding

former decoder model with 6M parameters: dim 256, dim

feed-forward 1024, 2 layers, 4 attention heads, as well as

simple unigram and bigram models. M p has 97M parame-

ters: dim 768, dim feed-forward 3072, 12 layers, 12 atten-

tion heads. We used Bert tokenization (Devlin et al., 2019)

with 8k tokens for all models.

Table 3. Empirical α values for various target models M p , approx-

imation models M q , and sampling settings. T=0 and T=1 denote

argmax and standard sampling respectively 6 .

LaMDA (137B params) We tested a decoder only

LaMDA model on a dialog task (Thoppilan et al., 2022).

We used existing checkpoints from LaMDA 137B as M p

and LaMDA 8B, LaMDA 2B, and LaMDA 100M for M q .

See Section 4.1 for the setup of the T5-XXL (11B params)

model.

Table 3 summarizes the α values for the tested cases. We

observe that approximation models that are a couple of

orders of magnitude smaller than the target model tend to

produce α values between 0.5 and 0.9. Interestingly, we also

note that for all models, the sharper the adjusted distribution,

the higher the α values. Finally, we note that even trivial

unigram and bigram approximations yield non negligible

α values. For example, for the case of English to German

translation, the bigram model has an α value of 0.2, and

since c = 0 in this case, yields a 1.25X speed improvement,

which is surprisingly high for this trivial approximation

model (but is still lower than the speedup we get from using

T5-small as the approximation model).

5. Related work

M p M q S MPL GPT- LIKE (97M)

GPT- LIKE (97M)

GPT- LIKE (97M) U NIGRAM

B IGRAM

GPT- LIKE (6M)

U NIGRAM

B IGRAM

GPT- LIKE (6M) T =0

T =0

T =1

T =1 0.03

0.05

0.88

0.03

0.05

0.89

T5-XXL (E N D E )

T5-XXL (E N D E ) U NIGRAM

B IGRAM

T5- SMALL

T5- BASE

T5- LARGE

U NIGRAM

B IGRAM

T5- SMALL

T5- BASE

T5- LARGE T =0

T =0

T =1

T =1 0.08

0.20

0.75

0.80

0.82

0.07

0.19

0.62

0.68

0.71

T5-XXL

T5-XXL U NIGRAM

B IGRAM

T5- SMALL

T5- BASE

T5- LARGE

U NIGRAM

B IGRAM

T5- SMALL

T5- BASE

T5- LARGE T =0

T =0

T =1

T =1 0.13

0.23

0.65

0.73

0.74

0.08

0.16

0.53

0.55

0.56

L A MDA (100M)

L A MDA (2B)

L A MDA (8B)

L A MDA (100M)

L A MDA (2B)

L A MDA (8B) T =0

T =0

T =1

T =1 0.61

0.71

0.75

0.57

0.71

0.74

(CNNDM)

L A MDA (137B)

The efficiency of inference from large models was studied

extensively (Dehghani et al., 2021). Many approaches aim

to speed up inference from large models in general, and au-

toregressive models like Transformers in particular. Numer-

ous techniques try to make inference more efficient for all

tokens, e.g. distillation (Hinton et al., 2015), sparcification

(Jaszczur et al., 2021), quantization (Hubara et al., 2016),

and architecture modification (So et al., 2021; Shazeer,

2019). Closer to our approach are adaptive computation

methods which adapt the amount of computation to problem

difficulty (Han et al., 2021). Examples include attending to a

subset of the inputs (Sukhbaatar et al., 2019), and early exits

(Schuster et al., 2021; Scardapane et al., 2020; Bapna et al.,

2020; Elbayad et al., 2019; Schwartz et al., 2020). Notably,

Wisdom of Committees (Schwartz et al., 2020) leverages

off-the-shelf smaller models, but is an adaptive computation

approach, and so it uses a heuristic to determine when to

stop, losing the guarantee of identical outputs to those of

the target models. In general, adaptive computation meth-

ods usually learn, either within the model itself or with an

auxiliary model, when a computation shortcut can be taken.

Usually, these methods save on both inference time and

arithmetic operations, but require a change of architecture, a

change of training procedure and training custom models or

re-training of existing models. They usually also change the

outputs of the model. We note that while many of the meth-

ods above improve the memory to arithmetic-operations

ratio, in cases where the ratio remains high, these methods

and our speculative decoding method might be effective in

tandem.

Two prior methods leverage speculative execution for speed-

ing up decoding from autoregressive models. Blockwise

Parallel Decoding (Stern et al., 2018) decodes several to-

kens in parallel, similarly to our work. However, it only

supports greedy decoding (temperature=0) and not the gen-

eral stochastic setting, it requires additional training of a

custom model, and focuses on preserving down-stream task

quality, instead of guaranteeing identical outputs. Shallow

Aggressive Decoding (SAD) (Sun et al., 2021) also decodes

several tokens in parallel, similarly to our work. Unlike

our work, SAD only supports copying the input to the out-

7Fast Inference from Transformers via Speculative Decoding

put, and not general approximation models, making it only

suitable for the cases where the inputs and outputs are very

similar like grammatical error correction. In addition, simi-

larly to Blockwise Parallel Decoding, SAD does not support

the general stochastic sampling setting.

experiments we always performed the same standardization

on the distributions generated by the approximation model

as the desired one for the target model (Section 2.2), but fur-

ther improvements might be obtained by applying different

transformations. We tested speculative decoding only in the

text modality, but it might work well in other domains (e.g.

images) which would be interesting to experiment with.

After we initially published our work, an independent im-

plementation of speculative decoding (Chen et al., 2023)

showed similar 2X-2.5X improvements on Chinchilla 70B.

Finally, we note that stochastic speculative execution and

speculative sampling can be helpful outside the scope of

speculative decoding from autoregressive models. For ex-

ample, given two slow functions, f (x) and g(y) such that

f (x) generates a distribution from which g’s input is sam-

pled, we could use our method to run f and g in parallel.

This setup might arise e.g. in physics simulations, or in rein-

forcement learning where f is a large model that produces a

distribution on actions, and g is the world simulation, which

would be interesting to explore.

6. Discussion

We presented speculative sampling which enables efficient

stochastic speculative execution - i.e. speculative execu-

tion in the stochastic setting. We analyzed its impact on

decoding from autoregressive models like Transformers via

speculative decoding and have shown that given enough

compute resources, we get meaningful 2X-3X speedups in

practice vs T5X, a popular optimized implementation.

Acknowledgments

One limitation of speculative execution in general, and of

speculative decoding in particular, is that latency is im-

proved through increased concurrency at the cost of an in-

creased number of arithmetic operations. Thus, our method

is not helpful for configurations where additional compu-

tation resources are not available. However, in common

cases where additional computation resources are available

(e.g. when memory bandwidth is the bottleneck) our method

provides the speedup with significant benefits: the model

architecture doesn’t change, retraining isn’t required, and

most importantly, the output distribution is guaranteed to

stay the same. Our method is easy to implement, and can

be used to speedup inference using out-of-the-box models

without developing and evaluating custom schemes.

We would like to extend a special thank you to YaGuang Li

for help with everything LaMDA related and for calculating

the LaMDA figures in the paper, and to Blake Hechtman

for great insights and help with XLA. We would also like

to thank the reviewers for insightful comments, as well

as Asaf Aharoni, Reiner Pope, Sasha Goldshtein, Nadav

Sherman, Eyal Segalis, Eyal Molad, Dani Valevski, Daniel

Wasserman, Valerie Nygaard, Danny Vainstein, the LaMDA

and Theta Labs teams at Google, and our families.

References

Bapna, A., Arivazhagan, N., and Firat, O. Controlling

computation versus quality for neural sequence models.

ArXiv, abs/2002.07106, 2020.

There are several directions for follow up research, impor-

tantly, further investigating the compatibility of speculative

decoding with beam search (see Appendix A.4). Also, while

our method yields substantial speedups with existing off-the-

shelf approximation models, greater improvements might

be obtained via custom approximation models (Section 3.6),

such as those with custom architectures (e.g. custom sizes,

non-autoregressive models, or various heuristics) or with

custom training procedures (e.g. standard distillation with

soft targets from M p , or optimizing M q for α directly). It

could also be interesting to explore a hierarchical version

of the algorithm, where the approximation model is itself

accelerated by an even faster model, which could allow

for more capable approximation models. In this work we

fixed the approximation model and the number of guesses

γ throughout inference, but varying them during inference

could yield additional improvements (Section 3.5). In our

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,

J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,

Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,

J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,

Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,

S., Radford, A., Sutskever, I., and Amodei, D. Lan-

guage models are few-shot learners. In Proceedings of

the 34th International Conference on Neural Informa-

tion Processing Systems, NIPS’20, Red Hook, NY, USA,

2020. Curran Associates Inc. ISBN 9781713829546.

Burton, F. W. Speculative computation, parallelism, and

functional programming. IEEE Transactions on Comput-

ers, C-34(12):1190–1193, 1985. doi: 10.1109/TC.1985.

6312218.

Note that the outputs from the LaMDA model always go

through a T op 40 filter. This has no effect on argmax, but does

have some effect on standard sampling.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T.,

Koehn, P. T., and Robinson, T. One billion word bench-

8Fast Inference from Transformers via Speculative Decoding

mark for measuring progress in statistical language mod-

eling. In Interspeech, 2013.

Jaszczur, S., Chowdhery, A., Mohiuddin, A., Kaiser, L.,

Gajewski, W., Michalewski, H., and Kanerva, J. Sparse

is enough in scaling transformers. In Neural Information

Processing Systems, 2021.

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre,

L., and Jumper, J. M. Accelerating large language

model decoding with speculative sampling. ArXiv,

abs/2302.01318, 2023.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,

Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring

the limits of transfer learning with a unified text-to-text

transformer. The Journal of Machine Learning Research,

21(1):5485–5551, 2020.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra,

G., Roberts, A., Barham, P., Chung, H. W., Sutton, C.,

Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S.,

Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N. M.,

Prabhakaran, V., Reif, E., Du, N., Hutchinson, B. C.,

Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari,

G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev,

S., Michalewski, H., Garcı́a, X., Misra, V., Robinson,

K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim,

H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D.,

Agrawal, S., Omernick, M., Dai, A. M., Pillai, T. S., Pel-

lat, M., Lewkowycz, A., Moreira, E., Child, R., Polozov,

O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Dı́az, M., Fi-

rat, O., Catasta, M., Wei, J., Meier-Hellstern, K. S., Eck,

D., Dean, J., Petrov, S., and Fiedel, N. Palm: Scaling lan-

guage modeling with pathways. ArXiv, abs/2204.02311,

2022.

Roberts, A., Chung, H. W., Levskaya, A., Mishra, G., Brad-

bury, J., Andor, D., Narang, S., Lester, B., Gaffney, C.,

Mohiuddin, A., Hawthorne, C., Lewkowycz, A., Salcianu,

A., van Zee, M., Austin, J., Goodman, S., Soares, L. B.,

Hu, H., Tsvyashchenko, S., Chowdhery, A., Bastings,

J., Bulian, J., Garcı́a, X., Ni, J., Chen, A., Kenealy, K.,

Clark, J., Lee, S., Garrette, D. H., Lee-Thorp, J., Raffel,

C., Shazeer, N. M., Ritter, M., Bosma, M., Passos, A.,

Maitin-Shepard, J. B., Fiedel, N., Omernick, M., Saeta,

B., Sepassi, R., Spiridonov, A., Newlan, J., and Ges-

mundo, A. Scaling up models and data with t5x and

seqio. ArXiv, abs/2203.17189, 2022.

Scardapane, S., Scarpiniti, M., Baccarelli, E., and Uncini,

A. Why should we add early exits to neural networks?

Cognitive Computation, 12(5):954–966, 2020.

Dehghani, M., Arnab, A., Beyer, L., Vaswani, A., and Tay,

Y. The efficiency misnomer. ArXiv, abs/2110.12894,

2021.

Schuster, T., Fisch, A., Jaakkola, T., and Barzilay, R. Con-

sistent accelerated inference via confident adaptive trans-

formers. In Conference on Empirical Methods in Natural

Language Processing, 2021.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:

Pre-training of deep bidirectional transformers for lan-

guage understanding. ArXiv, abs/1810.04805, 2019.

Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J.,

and Smith, N. A. The right tool for the job: Matching

model and instance complexities. In Annual Meeting of

the Association for Computational Linguistics, 2020.

Elbayad, M., Gu, J., Grave, E., and Auli, M. Depth-adaptive

transformer. ArXiv, abs/1910.10073, 2019.

Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang,

Y. Dynamic neural networks: A survey. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, 44:

7436–7456, 2021.

Shazeer, N. M. Fast transformer decoding: One write-head

is all you need. ArXiv, abs/1911.02150, 2019.

Hendrycks, D. and Gimpel, K. Bridging nonlinearities and

stochastic regularizers with gaussian error linear units.

ArXiv, abs/1606.08415, 2016. So, D. R., Ma’nke, W., Liu, H., Dai, Z., Shazeer, N. M., and

Le, Q. V. Primer: Searching for efficient transformers for

language modeling. ArXiv, abs/2109.08668, 2021.

Hennessy, J. L. and Patterson, D. A. Computer Architecture:

A Quantitative Approach. Morgan Kaufmann, Amster-

dam, 5 edition, 2012. ISBN 978-0-12-383872-8. Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel

decoding for deep autoregressive models. Advances in

Neural Information Processing Systems, 31, 2018.

Hinton, G. E., Vinyals, O., and Dean, J. Distilling the

knowledge in a neural network. ArXiv, abs/1503.02531,

2015. Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A.

Adaptive attention span in transformers. In Annual Meet-

ing of the Association for Computational Linguistics,

2019.

Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and

Bengio, Y. Quantized neural networks: Training neu-

ral networks with low precision weights and activations.

ArXiv, abs/1609.07061, 2016.

Sun, X., Ge, T., Wei, F., and Wang, H. Instantaneous gram-

matical error correction with shallow aggressive decoding.

ArXiv, abs/2106.04970, 2021.

9Fast Inference from Transformers via Speculative Decoding

Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N. M., Kul-

shreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L.,

Du, Y., Li, Y., Lee, H., Zheng, H., Ghafouri, A., Mene-

gali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J.,

Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M.,

Zhou, Y., Chang, C.-C., Krivokon, I. A., Rusch, W. J.,

Pickett, M., Meier-Hellstern, K. S., Morris, M. R., Doshi,

T., Santos, R. D., Duke, T., Søraker, J. H., Zevenber-

gen, B., Prabhakaran, V., Dı́az, M., Hutchinson, B., Ol-

son, K., Molina, A., Hoffman-John, E., Lee, J., Aroyo,

L., Rajakumar, R., Butryna, A., Lamm, M., Kuzmina,

V. O., Fenton, J., Cohen, A., Bernstein, R., Kurzweil, R.,

Aguera-Arcas, B., Cui, C., Croak, M., hsin Chi, E. H., and

Le, Q. Lamda: Language models for dialog applications.

ArXiv, abs/2201.08239, 2022.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At-

tention is all you need. Advances in neural information

processing systems, 30, 2017.

Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z.,

Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., Hutchinson,

B. C., Han, W., Parekh, Z., Li, X., Zhang, H., Baldridge,

J., and Wu, Y. Scaling autoregressive models for content-

rich text-to-image generation. ArXiv, abs/2206.10789,

2022.

10Fast Inference from Transformers via Speculative Decoding

A. Appendix

A.1. Correctness of Speculative Sampling

We will now show that for any distributions p(x) and q(x), the tokens sampled via speculative sampling from p(x) and q(x)

are distributed identically to those sampled from p(x) alone. Let β be the acceptance probability (Definition 3.1).

Note that as p 0 (x) = norm(max(0, p(x) − q(x))) = P 0 p(x)−min(q(x),p(x))

= p(x)−min(q(x),p(x))

, the normalizing

1−β

x (p(x )−min(q(x ),p(x )))

constant for the adjusted distribution p (x) is 1 − β, where the last equation follows immediately from Lemma 3.3 and

Theorem 3.5.

Now:

P (x = x 0 ) = P (guess accepted, x = x 0 ) + P (guess rejected, x = x 0 )

Where:

P (guess accepted, x = x 0 ) = q(x 0 ) min(1,

p(x 0 )

) = min(q(x 0 ), p(x 0 ))

q(x 0 )

And:

P (guess rejected, x = x 0 ) = (1 − β)p 0 (x 0 ) = p(x 0 ) − min(q(x 0 ), p(x 0 ))

Overall:

P (x = x 0 ) = min(p(x 0 ), q(x 0 )) + p(x 0 ) − min(p(x 0 ), q(x 0 )) = p(x 0 ).

As desired.

A.2. Speculative Sampling vs. Rejection Sampling

Rejection sampling is the following iterative sampling procedure that looks superficially similar to ours:

1. Sample x ∼ q(x) and r ∼ U (0, 1).

2. If r <

p(x)

M q(x)

return x.

3. Go to 1.

Where M = max x p(x)

q(x) . We could employ a non-iterative version of rejection sampling instead of speculative sampling

- specifically go through steps 1 and 2 above, and otherwise sample from an unmodified p(x) directly. That would

be much less efficient than our method though. Specifically, the expected accept probability here is E x∼q(x) M p(x)

q(x) =

q(x 0 )

q(x)

x p(x) min x p(x 0 ) ≤

x p(x) min(1, p(x) ) =

x min(p(x), q(x)) = α is (potentially much) lower than the expected

accept probability in our method α.

A.3. Theoretical Predictions vs. Empirical Runtimes

Table 4 compares the expected runtime improvements based on Theorem 3.8 to the empirically measured runtimes from

Table 2. We estimated the values of c for the various models based on profiler traces. We can see that the theoretical

predictions mostly match the measured runtimes. The larger differences are due to: (1) optimization differences between our

implementation and the baseline, and (2) the simplifying assumption that the βs are i.i.d. being only an approximation (see

Section 3.1).

11Fast Inference from Transformers via Speculative Decoding

Table 4. Expected improvement factor (E XP ) vs. empirically measured improvement factor (E MP ).

T ASK M q

T EMP γ α c E XP E MP

E N D E

E N D E T5- SMALL

T5- BASE

T5- LARGE

T5- SMALL

T5- BASE

T5- LARGE 0

1 7

3 0.75

0.8

0.82

0.62

0.68

0.71 0.02

0.04

0.11

0.02

0.04

0.11 3.2

3.3

2.5

2.3

2.4

2.0 3.4

2.8

1.7

2.6

2.4

1.4

CNNDM

CNNDM T5- SMALL

T5- BASE

T5- LARGE

T5- SMALL

T5- BASE

T5- LARGE 0

1 5

3 0.65

0.73

0.74

0.53

0.55

0.56 0.02

0.04

0.11

0.02

0.04

0.11 2.4

2.6

2.0

1.9

1.8

1.6 3.1

3.0

2.2

2.3

2.2

1.7

A.4. Application to Beam Search

Our method can be applied, with some performance penalty, to beam search sampling. Given the original beam width w, we

can perform beam search with the approximation model M q and beam width u ≥ w for γ steps. Then, we can use M p to

check all of the candidates in parallel (costing a compute budget of (w + uγ) runs of M p ). Finally, for each step, we can

accept the guesses of M q as long as top w (M p ) ⊆ top u (M q ) to get identical results to regular beam search with M p alone

(with a more elaborate procedure we could also accept cases where the candidates we got happen to have higher probabilities

than those of M p alone). The analysis of our method in this setting is more involved and we leave it for future work.

A.5. Lenience

A strong property of Algorithm 1 is that the output distribution is guaranteed to remain unchanged. That said, if we’re

willing to allow some changes, with nice guarantees, we can get further inference speed improvements. To further motivate

this, note that when we train two models with identical architectures and sizes on the same dataset, the generated probability

distributions will not be identical, so some lenience might make sense. Note that the results in this paper except for this

section use the strictest version of Algorithm 1 and don’t allow lenience of any kind.

We could include a lenience parameter l ∈ [0, 1] and multiply q(x) by l before comparing with p(x) in Algorithm 1. This

still maintains the nice guarantee that no token can be sampled with probability greater than p(x)

l . This means for example,

that with l = 10

no token can be sampled with more than 10X its ground truth probability, so we can guarantee that

extremely rare tokens will remain extremely rare (there is no guarantee on the minimum probability, so lenience could hurt

the diversity of the samples).

(

lq(x) ≤ p(x)

p(x)

Specifically, with a lenience factor l we have α = E x∼q(x) p(x)

= E x∼q(x) max(p(x),lq(x))

lq(x)

p(x)

lq(x)

p(x)q(x)

p(x)

x max(p(x),lq(x)) = l

x min(p(x), lq(x)) =

x min( l , q(x)).

Table 5 shows α values for different values of l when M p is T5-XXL (11B) and M q is T5-small (77M). With c = 0.015,

using lenience values of 1, 0.5, 0.3, and 0.1 (meaning that no token can be sampled with probability greater than 1X, 2X, 3X

and 10X of the ground truth) we get improvement factors of 2.5X, 3.1X, 3.6X, and 5X respectively.

Table 5. α values for various values of l with standard sampling where M p is T5-XXL (11B) on the EnDe translation task.

M q l =1 l = 0.5 l = 0.3

U NIGRAM

B IGRAM

T5- SMALL (77M)

T5- BASE (250M) 0.07

0.19

0.62

0.68 0.1

0.23

0.71

0.8 0.11

0.25

0.76

0.83

l = 0.1

0.16

0.32

0.84

0.90Fast Inference from Transformers via Speculative Decoding

Note that when using temperature = 0 (i.e. argmax sampling), we can no longer use lenience as above. Instead, we could

allow some lenience before standardizing the distributions. For example, we could accept the token x sampled from M q in

case p(x) ≤ l · max(p). In this case, we measure similar empirical increases in α values to those with temperature = 1. For

example, when using lenience values of 1, 0.5, 0.3, and 0.1 for M p T5-XXL M q T5-small for English-German translation,

we get α values of 0.75, 0.75, 0.8, 0.87. Taking for example c = 0.015 and γ = 8 we get speed improvement factors of

3.3X, 3.3X, 3.9X, and 4.9X respectively 7 .

In this case, unlike in the standard sampling case shown in Table 5, a lenience factor of 0.5 doesn’t improve the speed-up.