Summary of Exponentially Faster Language Modeling with FFFs

Summary Exponentially Faster Language Modeling with FFFs arxiv.org

3,879 words - PDF document - View PDF document

One Line

ETH Zurich researchers have developed UltraFastBERT, a language model utilizing fast feedforward networks to achieve quicker inference times.

Slides

Slide Presentation (8 slides)

Copy slides outline Copy embed code Download as Word

UltraFastBERT: Exponentially Faster Language Modeling

Source: arxiv.org - PDF - 3,879 words - view

Introduction

• UltraFastBERT is a modified version of the BERT architecture

• Uses fast feedforward networks (FFFs) in intermediate layers

• FFFs engage only a small fraction of neurons for inferences

[$(image of UltraFastBERT architecture)]

The Power of FFFs

• FFFs result in exponential acceleration of language models

• UltraFastBERT-1x11 uses only 0.3% of neurons during inference

• Achieves a 78x CPU speedup over traditional feedforward inference

Potential for Further Acceleration

• Theoretical speedup promise of 341x for BERT-base models

• Efficient implementations of FFF inference are crucial

• Conditional neural execution primitives need to be implemented in device programming interfaces

CPU Implementations

• BLAS library routines achieve significant speedups

• FFF inference on CPU ranges from 48x to 78x faster than traditional feedforward inference

[$(graph showing CPU speedup comparison)]

GPU Implementations

• PyTorch and custom CUDA kernels used for GPU inference

• FFF inference on GPU achieves a 3.15x speedup over traditional feedforward inference

[$(image showing GPU speedup comparison)]

Realizing the Benefits of FFFs

• Implementation of conditional neural execution primitives is crucial

• Device programming interfaces need to support FFF inference for full acceleration potential

Harnessing the Potential of UltraFastBERT

• UltraFastBERT enables exponentially faster language modeling

• Efficient implementations of FFF inference can unlock significant speedups

• Remember the power of conditional neural execution for accelerated language models

Key Points

UltraFastBERT is a modified version of the BERT architecture that uses fast feedforward networks (FFFs) in its intermediate layers.
FFFs only engage a small fraction of their neurons for individual inferences, resulting in exponential acceleration.
UltraFastBERT-1x11, the deepest model, uses only 0.3% of its neurons during inference and achieves a 78x CPU speedup.
There is a significant potential for further acceleration with FFFs, with a theoretical speedup promise of 341x for BERT-base models.
Efficient implementations of FFF inference are necessary to fully unlock the acceleration potential.
CPU implementations of FFF inference using BLAS library routines achieve speedups ranging from 48x to 78x over traditional feedforward inference.
GPU implementations of FFF inference using PyTorch and custom CUDA kernels achieve a 3.15x speedup over traditional feedforward inference.
The implementation of conditional neural execution primitives in device programming interfaces is crucial for fully realizing the benefits of FFFs.

Summaries

18 word summary

ETH Zurich researchers have created UltraFastBERT, a language model that uses fast feedforward networks for faster inference times.

59 word summary

ETH Zurich researchers have developed UltraFastBERT, a language model that utilizes fast feedforward networks (FFFs) for faster inference times. UltraFastBERT uses only 0.3% of its neurons during inference while performing similarly to other BERT models. The researchers provide high-level CPU and PyTorch code that achieve significant speedups over baseline implementations, demonstrating the acceleration achieved by FFFs in language modeling.

118 word summary

Researchers from ETH Zurich have developed UltraFastBERT, a language model that utilizes fast feedforward networks (FFFs) to achieve faster inference times. FFFs organize neurons into a balanced binary tree and only execute one branch based on the input, using only a fraction of their neurons for individual inferences. UltraFastBERT uses only 0.3% of its neurons during inference, while performing on par with similar BERT models. The researchers provide high-level CPU code that achieves a 78x speedup over the optimized baseline feedforward implementation and a PyTorch implementation that delivers a 40x speedup over the equivalent batched feedforward inference. The researchers share their training code, benchmarking setup, and model weights, demonstrating the significant acceleration achieved by FFFs in language modeling.

449 word summary

Researchers from ETH Zurich have developed a language model called UltraFastBERT that utilizes fast feedforward networks (FFFs) to achieve significantly faster inference times. FFFs organize their neurons into a balanced binary tree and only execute one branch of the tree based on the input, allowing them to use only a fraction of their neurons for individual inferences. This is in contrast to traditional feedforward networks (FFs), which compute the dot products of all rows with all columns.

The researchers demonstrate the effectiveness of FFFs by presenting UltraFastBERT, a BERT variant that uses only 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. The researchers provide high-level CPU code that achieves a 78x speedup over the optimized baseline feedforward implementation and a PyTorch implementation that delivers a 40x speedup over the equivalent batched feedforward inference.

The researchers note that there is currently no native, efficient implementation of conditional matrix multiplication (CMM), which is the operation performed by FFFs during inference. They provide a set of CPU implementations based on pointer-batched matrix multiplication routines of the BLAS library. The researchers also share their training code, benchmarking setup, and model weights.

UltraFastBERT is a variant of the BERT architecture that replaces the feedforward layers with FFFs. It performs on par with other BERT-like models of similar size and training procedures. The intermediate layers of UltraFastBERT are exponentially faster due to the use of FFFs. The time complexity of a forward pass through an FFF is O(log2n) instead of O(n) as for FF, where n is the number of neurons in the network. UltraFastBERT uses only a small fraction of its neurons for inference, with the number of engaged neurons depending on the depth of the FFF.

The researchers provide a comparison between CPU and GPU implementations at various levels of optimization and note that there is potential for even greater acceleration. They make the weights of their best model, UltraFastBERT-1x11-long, public and train a full range of increasingly deeper and narrower models. These models retain at least 96% of the downstream predictive performance of the original BERT-base model on various tasks.

In terms of inference performance, the researchers compare the speed of several FF/FFF inference implementations. They find that FFF inference on CPU achieves a 78x speedup over the fastest FF implementation using BLAS Level 3 routines. On GPU, the PyTorch BMM implementation of FFF delivers a 3.15x speedup over the fastest FF implementation.

Overall, the researchers demonstrate the significant acceleration achieved by FFFs in language modeling and provide code and weights for their UltraFastBERT models, paving the way for further optimization and improvement.

607 word summary

Researchers from ETH Zurich have developed a language model called UltraFastBERT that uses fast feedforward networks (FFFs) to achieve exponentially faster inference times. FFFs organize their neurons into a balanced binary tree and only execute one branch of the tree based on the input, allowing them to use only a fraction of their neurons for individual inferences. This is in contrast to traditional feedforward networks (FFs), which compute the dot products of all rows with all columns. The researchers demonstrate the effectiveness of FFFs by presenting UltraFastBERT, a BERT variant that uses only 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. The researchers provide high-level CPU code that achieves a 78x speedup over the optimized baseline feedforward implementation and a PyTorch implementation that delivers a 40x speedup over the equivalent batched feedforward inference.

The researchers note that there is currently no native, efficient implementation of conditional matrix multiplication (CMM), which is the operation performed by FFFs during inference. They provide a set of CPU implementations based on pointer-batched matrix multiplication routines of the BLAS library. Although there is already clear evidence of significant acceleration, there is potential for even more improvement. The researchers also share their training code, benchmarking setup, and model weights.

The majority of the parameters in large language models are held in the feedforward layers. However, not all of the neurons in these layers need to be engaged in the computation of the layer output at inference time for every input. The attention layers in language models, which have been extensively studied for speed improvements, are left untouched in UltraFastBERT. The focus is solely on the intermediate layers that host the feedforward networks.

The researchers round the number of neurons in the feedforward networks to 4095, which is the number of nodes in a balanced binary tree of maximum depth 11. This allows UltraFastBERT to use only 0.04% of the neurons in a BERT-base-sized model for inference. The researchers provide a comparison between CPU and GPU implementations at various levels of optimization and note that there is potential for even greater acceleration.

The researchers make the weights of their best model, UltraFastBERT-1x11-long, public. They also train a full range of increasingly deeper and narrower models, starting from UltraFastBERT-3072x0 and proceeding with UltraFastBERT-1536x1, UltraFastBERT-512x2, and so on. They find that UltraFastBERT models retain at least 96% of the downstream predictive performance of the original BERT-base model on various tasks.

The researchers note that although there is currently no efficient implementation of CMM, the conditionality introduced by FFFs does not make them incompatible with existing processes and hardware. CPUs, especially edge CPUs, can benefit from the reduction in per

Raw indexed text (23,961 chars / 3,879 words / 663 lines)

Exponentially Faster Language Modeling

Peter Belcak and Roger Wattenhofer

ETH Zürich

{belcak,wattenhofer}@ethz.ch

Abstract

instead of O (n) as for FF. This is a consequence of the fact

that FFFs organize their neurons into a balanced binary tree,

and execute only one branch of the tree conditionally on the

input.

Language models only really need to use an ex-

ponential fraction of their neurons for individual

inferences.

Performing inference on an FFF amounts to performing con-

ditional matrix multiplication (CMM), in which the rows of

the input dot with the columns of neural weights one at a

time, and the weight column to proceed with is chosen de-

pending on the output of the previous dot-product operation.

In this manner, all neurons are used only by some inputs

and no input needs more than just a handful of neurons to

be handled by the network. This is in contrast with dense

matrix multiplication (DMM), which lies at the heart of the

traditional feedforward networks, and which computes the

dot products of all rows with all columns.

As proof, we present UltraFastBERT, a BERT

variant that uses 0.3% of its neurons during infer-

ence while performing on par with similar BERT

models. UltraFastBERT selectively engages just

12 out of 4095 neurons for each layer inference.

This is achieved by replacing feedforward net-

works with fast feedforward networks (FFFs).

While no truly efficient implementation currently

exists to unlock the full acceleration potential of

conditional neural execution, we provide high-

level CPU code achieving 78x speedup over the

optimized baseline feedforward implementation,

and a PyTorch implementation delivering 40x

speedup over the equivalent batched feedforward

inference.

No native, efficient implementation of conditional matrix

multiplication exists, and no popular deep learning frame-

work offers any interface that could be used to implement

it besides a high-level simulation. We therefore provide

a set of CPU implementations based on pointer-batched

matrix multiplication routines of the BLAS library. In a

later section, we give a comparison between CPU and GPU

implementations at various levels of optimization and note

that while there already is clear evidence of significant ac-

celeration, there is potential for more.

We publish our training code, benchmarking

setup, and model weights. 1

1. Introduction

Feedforward layers hold the majority of the parameters of

large language models (Brown et al., 2020; Anil et al., 2023).

However, not all of their neurons need to be engaged in the

computation of the feedforward layer output at inference

time for every input.

The role of attention. A large body of literature already

addresses the topic of speeding up the execution of the

attention mechanism. We note that for a BERT-base-sized

model with the usual pre-training context size of 128 (Devlin

et al., 2018), the per-token inference cost of its attention to

all other tokens amounts to only a little more than the cost

of 128-neuron feedforward network inference. We therefore

leave the attention layers untouched and focus solely on the

intermediate layers hosting the feedforward networks.

For a generally accessible proof, we present UltraFastBERT,

a variant of the BERT architecture (Devlin et al., 2018) that

replaces feedforward layers with fast feedforward networks.

In terms of downstream performance, UltraFastBERT per-

forms on par with other BERT-like models that are similar

in size and undergo similar training procedures. The inter-

mediate layers of UltraFastBERT are, however, exponen-

tially faster by design: given a feedforward (FF) and a fast

feedforward (FFF) network, each with n neurons, the time

complexity of a forward pass through the FFF is O (log 2 n)

Points of comparison. BERT-base feedforward networks

consist of 3072 neurons. This is not close to any power

of two, and so in the design of UltraFastBERT, we round

this number to 4095 – the number of nodes in a balanced

binary tree of maximum depth 11. In this frame of reference,

UltraFastBERT uses only 1/256 (0.04%) of the 3072 BERT-

https://github.com/pbelcak/UltraFastBERT

1Exponentially Faster Language Modeling

base neurons for inference. Nevertheless, UltraFastBERT

iself consists of 4095 neurons, and so uses 1/341 (0.03%)

of its neurons for inference.

to the letter in all but the nature of intermediate layers.

There, the feedforward networks contained in the interme-

diate layers of the crammedBERT transformer encoder are

replaced with fast feedforward networks (Belcak & Watten-

hofer, 2023).

When reporting model performance on downstream tasks in

Section 3.3, we give both a 3072-neuron and a 4095-neuron

baseline for completeness.

We make the following simplifying changes to the original

fast feedforward networks:

Why only 78x and not 341x speedup? Dense matrix mul-

tiplication is the most optimized mathematical operation in

the history of computing. A tremendous effort has been

put into designing memories, chips, instruction sets, and

software routines that execute it as fast as possible. Many of

these advancements have been – be it for their complexity

or for competitive advantage – kept confidential and ex-

posed to the end user only through powerful but restrictive

programming interfaces.

1. Remove all differences between leaf and non-leaf nodes.

In particular, we use the same (GeLU) activation func-

tion across all nodes, equip all nodes with output

weights, and remove all output biases.

2. Fix the leaf size to 1.

3. Allow multiple FFF trees in parallel. We allow for

multiple FFF trees to jointly compute the intermediate

layer outputs. This is achieved by summing the outputs

of the individual trees and presenting the sum as the

intermediate layer output.

Therefore, despite having no need for new hardware, we are

still forced to rely on combining high-level linear-algebraic

routines to implement CMM, hence the reduction in the

speedup. We elaborate on this in Section 3.

We denote a model with K trees of depth D + 1 by append-

ing a suffix to the model name, i.e. UltraFastBERT-KxD.

Note that for consistency with our inference code, we con-

sider a tree with no edges to have depth 0 – hence the tree

with maximum depth D has depth D + 1. A BERT-base-

sized model with the traditional feedforward layer of width

3072 is then just a special case of UltraFastBERT, namely

UltraFastBERT-3072x0.

Reproducibility. We share the weights of our best model.

While we do not provide an efficient PyTorch or TensorFlow

implementation of CMM, the fact that only 12 neurons are

used in the inference of UltraFastBERT can be verified

simply by masking out the output of all but the chosen

neurons, and we give the code for this.

Takeaways.

• We present UltraFastBERT, a BERT-like model that

has 4095 neurons but selectively uses only 12 (0.03%)

for inference. While we share only our fastest model, we train a full

range of increasingly deeper and narrower models, start-

ing from UltraFastBERT-3072x0 and proceeding with

UltraFastBERT-1536x1, UltraFastBERT-512x2, etc..

• We finetune UltraFastBERT for standard downstream

tasks and find that it performs on par with its BERT

peers. 2.2. Training

• We provide a naive implementation of the conditional

matrix multiplication that underlies fast feedforward

network inference. We find that it leads to a 78x

speedup over the natively optimized dense matrix mul-

tiplication. We follow the final training procedure of crammedBERT

(Geiping & Goldstein, 2023), namely disabling dropout in

pretraining and making use of the 1-cycle triangular learning

rate schedule. By default, we train every model for 1 day on

a single A6000 GPU, except for the final UltraFastBERT-

1x11-long model, which we train 2 times longer using the

same regime for slightly better downstream performance.

• Through UltraFastBERT and the already considerable

speedups by simple FFF implementations, we demon-

strate the considerable potential of conditional neural

execution in language modelling.

2.3. Downstream Performance

2.3.1. S ETUP

We finetune all UltraFastBERT models for the RTE, MRPC,

SST, STS-B, MNLI, QQP, QNLI, and CoLA tasks of the

GLUE benchmark (Wang et al., 2018) and report evaluation

scores as in Geiping & Goldstein (2023) for consistency. In

short, this approach amounts to finetuning for 5 epochs with

learning rate 4 × 10 −5 across all tasks.

2. Model

2.1. Architecture

Our architectural starting point is the crammedBERT archi-

tecture (Geiping & Goldstein, 2023), which we implement

2Exponentially Faster Language Modeling

Model

N T N I /N T RTE MRPC STSB SST-2 MNLI QNLI QQP Avg CoLA Avg

4095

3072 100.0%

100.0% 58.8

57.6 87.6

89.1 85.2

85.9 91.9

91.9 82.8

81.3 90.4

90.9 89.0

87.6 83.6

83.2 45.0

47.9 79.3

79.3

3072

4608

3584

3840

3968

4032

4064

4080

4088

4092

4094

4095 100.0%

66.6%

42.9%

26.7%

16.1%

9.5%

5.5%

3.1%

1.8%

1.0%

0.5%

0.3% 56.7

55.2

59.2

54.2

58.4

55.7

57.6

55.5

56.2

53.8

59.9

57.8 88.9

89.4

87.7

87.4

87.5

89.0

88.2

89.0

88.4

85.9

88.8

88.1 86.3

85.0

86.0

85.9

87.2

86.1

86.7

85.4

85.7

85.3

86.1 92.3

91.9

89.9

91.6

92.3

91.4

91.2

88.9

88.7

89.6

87.4

89.7 82.9

82.2

81.9

81.6

81.2

81.6

81.0

80.1

80.6

81.9

79.9

80.2 92.3

90.1

90.3

90.0

89.9

90.2

89.2

89.4

89.3

89.2

89.3 88.0

89.0

89.3

89.1

90.0

89.4

88.3

86.9

86.4

88.0

86.1

87.1 83.8

83.1

83.3

82.7

83.5

83.3

82.8

82.1

81.9

82.0

82.3 48.4

47.5

46.2

48.0

45.9

46.1

40.6

41.5

32.7

31.8

35.4

37.1 79.9

79.2

78.8

79.3

79.1

78.1

77.6

76.5

76.4

76.9

77.3

4095 0.3% 60.7 87.5 86.4 89.9 81.3 89.7 87.6 83.0 35.1 77.7

3072

3072 100%

100%

100% 56.0

59.9

66.4 82.3

87.5

88.9 80.0

86.9

85.8 91.3

91.3

93.5 81.4

82.2

83.4 87.4

89.2

90.5 70.3

71.3

71.2 78.8

81.2

83.0 45.4

52.1

51.3 75.1

77.6

79.6

Baselines

crammedBERT-3072

crammedBERT-4095

UltraFastBERTs

UltraFastBERT-3072x0

UltraFastBERT-1536x1

UltraFastBERT-512x2

UltraFastBERT-256x3

UltraFastBERT-128x4

UltraFastBERT-64x5

UltraFastBERT-32x6

UltraFastBERT-16x7

UltraFastBERT-8x8

UltraFastBERT-4x9

UltraFastBERT-2x10

UltraFastBERT-1x11

Final Model

UltraFastBERT-1x11-long

External Baselines

OpenAI GPT

DistilBERT

BERT-base

Table 1. The results of various language models on the GLUE-dev test sets. N T denotes the number of neurons available for training,

N I /N T the proportion of neurons that are used for a single inference. “Avg” denotes the average score of all the task results to the left of

the column. Emphasis marks the best crammed 1-day UltraFastBERT performance for the given column. OpenAI GPT, DistilBERT, and

BERT-base refer to models reported in Radford et al. (2018); Sanh et al. (2019); Devlin et al. (2018).

We find that UltraFastBERT models finetuned in this man-

ner for CoLA end up being undertrained if only 5 train-

ing epochs are used. Therefore, we extend the number of

CoLA finetuning epochs to 15. This leads to little to no

improvement for the baseline crammedBERT models but

has a significant impact on the CoLA performance of Ultra-

FastBERTs. preserved by all UltraFastBERT model.

2.3.2. R ESULTS 3. Inference

The results of our finetuning are listed in Table 1. If the purpose of the above part was to report the finding

that only very few neurons are needed per inference, it is

the goal of this section to adopt the engineering perspec-

tive and outline how this can be taken advantage of on the

implementation front.

Furthermore, we see that save from CoLA, our best model –

UltraFastBERT-1x11-long – performs on par with the orig-

inal BERT-base model while using only 0.3% of its own

neurons, which amounts to a mere 0.4% of BERT-base neu-

rons. We make the weights of this model public.

We see that UltraFastBERT variants trained for 1 day on a

single A6000 GPU all retain at least 96.0% of the GLUE

downstream predictive performance of the original BERT-

base model (Devlin et al., 2018). We also observe that

the performance decreases with the increasing depth of the

FFFs. Note, however, that the majority of the performance

decrease due to the increasing depth is caused by only a

single task – CoLA. This behaviour has previously been ob-

served in the literature and is in line with other work trying

to compress BERT behaviour into smaller models (Sun et al.,

2019; Turc et al., 2019; Mukherjee et al., 2021). If we disre-

gard CoLA, at least 98.6% of the predictive performance is

Fast feedforward networks as a part of large language mod-

els have a huge acceleration potential. To indicate the sort

of speedup ballpark one could hope for, take GPT-3 (Brown

et al., 2020), the first large language model widely lauded

for the plausibility of its outputs. The feedforward networks

of each transformer layer of GPT-3 consist of 49152 neu-

rons. If trainable, this network could be replaced with a fast

feedforward network of maximum depth 15, which would

3Exponentially Faster Language Modeling

CPU Implementation

Model

BERT-base-4095

BERT-base-3072

UltraFastBERT-1x11

GPU Implementation

Limit Level 1 Level 2 Level 3 Native fused Pytorch BMM Naive CUDA

1.00x

1.33x

341.25x 1.00x

1.55x

130.7 1.00x

1.74x

255.1 1.00x

1.39x

- 1.00x

1.33x

- 1.00x

1.61x

39.45x 1.00x

1.82x

117.83x

Table 2. The results of the inference acceleration evaluation. Emphasis highlights the best “fair comparison” performance.

3.2. Compatibility

Algorithm 1: FFF inference forward pass.

Input: B × H input matrix I,

(2 D − 1) × H weight matrix W in ,

(2 D − 1) × H weight matrix W out

Intermediate :B × D logit matrix L,

B × D node index matrix N

Output: B × H matrix O

One may ask whether the conditionality introduced by the

use of CMM does not make FFFs incompatible with the

processes and hardware already in place for dense matrix

multiplication and deep learning more broadly. In short, the

answer is “No, it does not, save for some increased caching

complexity.”

Single-threaded CPU DMM as a part of feedforward infer-

ence relies on sequential execution of multiplication and

accumulation (MAC) instructions. As such, CPUs, espe-

cially edge CPUs, stand to benefit the most easily from the

replacement of DMM with CMM as seen in UltraFastBERT,

simply because fewer executions of the per-element MAC

instructions are needed to compute layer output. In spite of

the apparent use of conditionality, which is commonly asso-

ciated with branching in CPU code, the “neural branching”

seen in CMM manifests itself only as an addition of a mem-

ory offset to the relevant pointers. Hence, instruction branch

prediction is never engaged to facilitate CMM conditionality.

In order to make full use of weight caching to speed up the

access to weights, the CPU might need to be hinted to load

only relevant columns of the weight matrix and only one at

a time. Since CMM continues to perform row-column dot

products, vector single-instruction-multiple-data (SIMD)

parallel processing remains a viable option for speeding up

device-specific inference implementations.

Function CMM(I, W in ):

for d ∈ {1, . . . , D − 1} do

L ⋆,d ← I W [N

⋆,d−1 ],⋆

N ⋆,d ← 2N ⋆,d−1 + 1 + (L ⋆,d > 0)

end

return L, N

Function FFF I (I, W in , W out ):

L, N ← CMM(I, W in )

L ← A C T I V A T I O N (L)

for d ∈ {0, . . . , D − 1} do

O ⋆,d ← L ⋆,d · W N out ⋆,d ,⋆

end

return O

contain 65536 neurons but use only 16 for inference. This

amounts to about 0.03% of GPT-3’s neurons.

The implicitly multi-threaded GPU DMM computation

makes extensive use of the single-instruction-multiple-

threads (SIMT) approach behind modern GPUs by exe-

cuting the same MAC instructions in each thread, just on

different patches of the matrices. As above, note that this

readily carries over to CMM since the conditionality rep-

resented by proceeding to different columns of the weight

matrices affects only the offset to the memory used, and

not which, if, or how many times the MAC instructions are

executed. Nevertheless, efficient DMM implementations

distribute the matrix multiplication workload (the pairs of

matrix patches to be multiplied) in a manner that maximizes

the use of distributed cache so that the accesses to the global

device memory, being significantly slower than accessing

cache, are limited. To achieve its full potential with respect

to the DMM baseline, any efficient implementation of CMM

At the center of this promise sits the operation of conditional

matrix multiplication, with its pseudocode given below, and

with our future efforts focused on its efficient implementa-

tion.

3.1. Algorithm

Belcak & Wattenhofer (2023) gives recursive pseudocode

for FFF inference. We list the pseudocode for CMM and

the consecutive inference for FFFs, with modifications as

per Section 2.1. In Algorithm 1, B denotes the batch size,

H the layer input width (transformer hidden dimension),

2 D − 1 is the number of neurons, and M ⋆,k , M l,⋆ denote

the k-th column and l-th row of M , respectively. The result

of the >-comparison in CMM is assumed to be an integer

∈ {0, 1}.

4Exponentially Faster Language Modeling

Methodology. For CPU inference, we perform 250 for-

ward passes per entry on Intel(R) Core(TM) i7-6700HQ

CPUs under Intel MKL v2023.2.0, using 64-bit variants

of all routines. We report the mean time taken by single

inference, noting that the value of the standard deviation

always lay well under 2% of the mean. For GPU inference,

we perform 1000 forward passes per entry on NVIDIA RTX

A6000 GPUs under CUDA v11.7 and PyTorch 2.0.1. We

measure the GPU time and report the mean time taken, with

the standard deviation again well under 2% of the mean in

all cases. We take batch size B = 128 × 128 (equivalent to

the BERT pretraining context token batch size) and hidden

dimension H = 768.

has to explicitly manage its caching in a way that is optimal

for tree traversal, and not patched dense matrix multiplica-

tion. This can be done by always pre-loading the weights of

the relevant sub-trees or by using DMM patching strategies

but discarding intermediate results from the results of patch

margins where not needed. Either way, it remains to be

a challenge to make these optimizations without intimate

(and often confidential) knowledge of the implementation’s

target device.

3.3. Inference Performance

We compare the speed of several available FF/FFF inference

implementations.

Implementations. For CPU inference, we use the Math

Kernel Library available as a part of the Intel oneAPI.

Results. Table 2 lists the performance comparison of feed-

forward and fast feedforward layers as they appear in BERT-

base and UltraFastBERT-1x11. Each column of the ta-

ble lists the relative inference FFF-over-FF implementa-

tion speedups when using the same linear-algebraic routine

primitives.

• Level 1 implementation is the implementation con-

structed using only BLAS Level 1 routines and BLAS-

like Level 1 extensions, namely the vector-vector dot

product and scalar-vector product.

The two entries missing Table 2 are for the currently un-

available BLAS Level 3 and Native fused implementations

of FFFs.

• Level 2 implementation uses batched BLAS Level 2

routines and BLAS-like Level 1 extensions, namely

the batched matrix-vector multiplication and batched

scalar-vector product.

• Level 3 implementation uses the (non-batched) BLAS

Level 3 matrix-matrix multiplication. This is the fastest

CPU implementation for FF, but no such implemen-

tation can be provided at this time for FFF due to the

vector-level sparsity of CMM not being supported by

the library.

Further comparisons. All of the speedups reported in

Table 2 give “fair comparisons”, meaning that in each case,

both the FF and FFF implementation used exactly the same

primitive linear-algebraic operations. One may also be in-

terested in knowing how the best implementations of FFF

currently fair against the best implementations of FF, even

though the ones for FF use primitives unavailable for FFF.

On CPU, the Level 1 and Level 2 implementations of FFF

perform inference 48x and 78x faster than the fastest (Level

3) implementation of FF, respectively. On GPU, the PyTorch

BMM implementation of FFF delivers a 3.15x speedup over

the fastest (Native fused) implementation of FF.

For the GPU implementations, we use either PyTorch ker-

nels or custom CUDA kernels.

• Native fused implementation uses the native fused

feedforward layer kernel. Note that this is the fastest

GPU implementation for FF layers but again, no such

kernel currently exists for FFFs due to the nature of

CMM.

3.4. Future outlook

The broad strokes for starting efficient implementation of

FFF inference have already been painted as a part of the

PyTorch library. Hybrid vector-level sparse tensors, if fully

supported for singular and batched matrix multiplication,

would suffice to implement CMM and FFF inference as in

Algorithm 1.

• BMM implementation uses the batched matrix multi-

plication and activation kernels for both FFs and FFFs.

In the case of FFFs, we extensively use vector copying

at each step of tree descent to simulate conditionality.

• Naive CUDA implementation is our custom CUDA

kernel code for both FFs and FFFs, performing

fused DMM/CMM and activation on the level of vec-

tor/matrix elements, executed as a PyTorch extension.

A further native implementation of CMM as a part of device-

specific Intel MKL/NVIDIA cuBLAS code would stand a

real chance of fully delivering on the promise of 341-fold

speedup.

5Exponentially Faster Language Modeling

4. Conclusion

Turc, I., Chang, M.-W., Lee, K., and Toutanova, K.

Well-read students learn better: On the importance

of pre-training compact models.

arXiv preprint

arXiv:1908.08962, 2019.

We present UltraFastBERT, a modified version of the

(crammed)BERT architecture that uses fast feedforward in-

stead of feedforward networks in its intermediate layers. Ul-

traFastBERT serves as proof that large language models only

really need to engage an exponential fraction of their param-

eters to perform individual inferences. UltraFastBERT-1x11,

our deepest model with the highest promise of acceleration,

uses only 0.3% of its neurons during inference and already

achieves a 78x CPU speedup over the inference time of

the corresponding feedforward layer. With a theoretical

speedup promise of 341x at the scale of BERT-base models,

we hope that our work will inspire an effort to implement

primitives for conditional neural execution as a part of de-

vice programming interfaces.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and

Bowman, S. R. Glue: A multi-task benchmark and anal-

ysis platform for natural language understanding. arXiv

preprint arXiv:1804.07461, 2018.

References

Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin,

D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen,

Z., et al. Palm 2 technical report. arXiv preprint

arXiv:2305.10403, 2023.

Belcak, P. and Wattenhofer, R. Fast feedforward networks.

arXiv preprint arXiv:2308.14711, 2023.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., et al. Language models are few-shot learners.

Advances in neural information processing systems, 33:

1877–1901, 2020.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:

Pre-training of deep bidirectional transformers for lan-

guage understanding. arXiv preprint arXiv:1810.04805,

2018.

Geiping, J. and Goldstein, T. Cramming: Training a lan-

guage model on a single gpu in one day. In International

Conference on Machine Learning, pp. 11117–11143.

PMLR, 2023.

Mukherjee, S., Awadallah, A. H., and Gao, J. Xtremedistil-

transformers: Task transfer for task-agnostic distillation.

arXiv preprint arXiv:2106.04563, 2021.

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.,

et al. Improving language understanding by generative

pre-training. 2018.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert,

a distilled version of bert: smaller, faster, cheaper and

lighter. arXiv preprint arXiv:1910.01108, 2019.

Sun, S., Cheng, Y., Gan, Z., and Liu, J. Patient knowledge

distillation for bert model compression. arXiv preprint

arXiv:1908.09355, 2019.