Summary Exponentially Faster Language Modeling with FFFs arxiv.org
3,879 words - PDF document - View PDF document
One Line
ETH Zurich researchers have developed UltraFastBERT, a language model utilizing fast feedforward networks to achieve quicker inference times.
Slides
Slide Presentation (8 slides)
Key Points
- UltraFastBERT is a modified version of the BERT architecture that uses fast feedforward networks (FFFs) in its intermediate layers.
- FFFs only engage a small fraction of their neurons for individual inferences, resulting in exponential acceleration.
- UltraFastBERT-1x11, the deepest model, uses only 0.3% of its neurons during inference and achieves a 78x CPU speedup.
- There is a significant potential for further acceleration with FFFs, with a theoretical speedup promise of 341x for BERT-base models.
- Efficient implementations of FFF inference are necessary to fully unlock the acceleration potential.
- CPU implementations of FFF inference using BLAS library routines achieve speedups ranging from 48x to 78x over traditional feedforward inference.
- GPU implementations of FFF inference using PyTorch and custom CUDA kernels achieve a 3.15x speedup over traditional feedforward inference.
- The implementation of conditional neural execution primitives in device programming interfaces is crucial for fully realizing the benefits of FFFs.
Summaries
18 word summary
ETH Zurich researchers have created UltraFastBERT, a language model that uses fast feedforward networks for faster inference times.
59 word summary
ETH Zurich researchers have developed UltraFastBERT, a language model that utilizes fast feedforward networks (FFFs) for faster inference times. UltraFastBERT uses only 0.3% of its neurons during inference while performing similarly to other BERT models. The researchers provide high-level CPU and PyTorch code that achieve significant speedups over baseline implementations, demonstrating the acceleration achieved by FFFs in language modeling.
118 word summary
Researchers from ETH Zurich have developed UltraFastBERT, a language model that utilizes fast feedforward networks (FFFs) to achieve faster inference times. FFFs organize neurons into a balanced binary tree and only execute one branch based on the input, using only a fraction of their neurons for individual inferences. UltraFastBERT uses only 0.3% of its neurons during inference, while performing on par with similar BERT models. The researchers provide high-level CPU code that achieves a 78x speedup over the optimized baseline feedforward implementation and a PyTorch implementation that delivers a 40x speedup over the equivalent batched feedforward inference. The researchers share their training code, benchmarking setup, and model weights, demonstrating the significant acceleration achieved by FFFs in language modeling.
449 word summary
Researchers from ETH Zurich have developed a language model called UltraFastBERT that utilizes fast feedforward networks (FFFs) to achieve significantly faster inference times. FFFs organize their neurons into a balanced binary tree and only execute one branch of the tree based on the input, allowing them to use only a fraction of their neurons for individual inferences. This is in contrast to traditional feedforward networks (FFs), which compute the dot products of all rows with all columns.
The researchers demonstrate the effectiveness of FFFs by presenting UltraFastBERT, a BERT variant that uses only 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. The researchers provide high-level CPU code that achieves a 78x speedup over the optimized baseline feedforward implementation and a PyTorch implementation that delivers a 40x speedup over the equivalent batched feedforward inference.
The researchers note that there is currently no native, efficient implementation of conditional matrix multiplication (CMM), which is the operation performed by FFFs during inference. They provide a set of CPU implementations based on pointer-batched matrix multiplication routines of the BLAS library. The researchers also share their training code, benchmarking setup, and model weights.
UltraFastBERT is a variant of the BERT architecture that replaces the feedforward layers with FFFs. It performs on par with other BERT-like models of similar size and training procedures. The intermediate layers of UltraFastBERT are exponentially faster due to the use of FFFs. The time complexity of a forward pass through an FFF is O(log2n) instead of O(n) as for FF, where n is the number of neurons in the network. UltraFastBERT uses only a small fraction of its neurons for inference, with the number of engaged neurons depending on the depth of the FFF.
The researchers provide a comparison between CPU and GPU implementations at various levels of optimization and note that there is potential for even greater acceleration. They make the weights of their best model, UltraFastBERT-1x11-long, public and train a full range of increasingly deeper and narrower models. These models retain at least 96% of the downstream predictive performance of the original BERT-base model on various tasks.
In terms of inference performance, the researchers compare the speed of several FF/FFF inference implementations. They find that FFF inference on CPU achieves a 78x speedup over the fastest FF implementation using BLAS Level 3 routines. On GPU, the PyTorch BMM implementation of FFF delivers a 3.15x speedup over the fastest FF implementation.
Overall, the researchers demonstrate the significant acceleration achieved by FFFs in language modeling and provide code and weights for their UltraFastBERT models, paving the way for further optimization and improvement.
607 word summary
Researchers from ETH Zurich have developed a language model called UltraFastBERT that uses fast feedforward networks (FFFs) to achieve exponentially faster inference times. FFFs organize their neurons into a balanced binary tree and only execute one branch of the tree based on the input, allowing them to use only a fraction of their neurons for individual inferences. This is in contrast to traditional feedforward networks (FFs), which compute the dot products of all rows with all columns. The researchers demonstrate the effectiveness of FFFs by presenting UltraFastBERT, a BERT variant that uses only 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. The researchers provide high-level CPU code that achieves a 78x speedup over the optimized baseline feedforward implementation and a PyTorch implementation that delivers a 40x speedup over the equivalent batched feedforward inference.
The researchers note that there is currently no native, efficient implementation of conditional matrix multiplication (CMM), which is the operation performed by FFFs during inference. They provide a set of CPU implementations based on pointer-batched matrix multiplication routines of the BLAS library. Although there is already clear evidence of significant acceleration, there is potential for even more improvement. The researchers also share their training code, benchmarking setup, and model weights.
The majority of the parameters in large language models are held in the feedforward layers. However, not all of the neurons in these layers need to be engaged in the computation of the layer output at inference time for every input. The attention layers in language models, which have been extensively studied for speed improvements, are left untouched in UltraFastBERT. The focus is solely on the intermediate layers that host the feedforward networks.
UltraFastBERT is a variant of the BERT architecture that replaces the feedforward layers with FFFs. It performs on par with other BERT-like models of similar size and training procedures. The intermediate layers of UltraFastBERT are exponentially faster due to the use of FFFs. The time complexity of a forward pass through an FFF is O(log2n) instead of O(n) as for FF, where n is the number of neurons in the network. UltraFastBERT uses only a small fraction of its neurons for inference, with the number of engaged neurons depending on the depth of the FFF.
The researchers round the number of neurons in the feedforward networks to 4095, which is the number of nodes in a balanced binary tree of maximum depth 11. This allows UltraFastBERT to use only 0.04% of the neurons in a BERT-base-sized model for inference. The researchers provide a comparison between CPU and GPU implementations at various levels of optimization and note that there is potential for even greater acceleration.
The researchers make the weights of their best model, UltraFastBERT-1x11-long, public. They also train a full range of increasingly deeper and narrower models, starting from UltraFastBERT-3072x0 and proceeding with UltraFastBERT-1536x1, UltraFastBERT-512x2, and so on. They find that UltraFastBERT models retain at least 96% of the downstream predictive performance of the original BERT-base model on various tasks.
In terms of inference performance, the researchers compare the speed of several FF/FFF inference implementations. They find that FFF inference on CPU achieves a 78x speedup over the fastest FF implementation using BLAS Level 3 routines. On GPU, the PyTorch BMM implementation of FFF delivers a 3.15x speedup over the fastest FF implementation.
The researchers note that although there is currently no efficient implementation of CMM, the conditionality introduced by FFFs does not make them incompatible with existing processes and hardware. CPUs, especially edge CPUs, can benefit from the reduction in per