Summary Int8 Matrix Multiplication for Transformers arxiv.org
11,384 words - PDF document - View PDF document
One Line
The study explores the use of Int8 matrix multiplication for transformers and proposes a mixed-precision decomposition technique, as well as various quantization methods, to reduce memory use while maintaining performance.
Key Points
- LLM.int8() is a new mixed-precision decomposition scheme that reduces memory usage in large transformer language models while maintaining performance.
- The method uses vector-wise quantization and a two-part quantization procedure for emergent outliers to load a 175B parameter transformer with 8-bit weights without performance degradation.
- Outlier features affect at least 25% of transformer layers and 6% of sequence dimensions, and their removal can cause attention and perplexity degradation.
- The study finds that LLM.int8() is the only method with a favorable scaling trend for Int8 matrix multiplication on transformers ranging from 125M to 13B parameters.
- Int8 matrix multiplication can speed up feed-forward of differently sized GPT-3 transformers, but outliers disrupt symmetric absmax quantization and favor asymmetric zeropoint quantization.
Summaries
285 word summary
This study explores the use of Int8 matrix multiplication for transformers and finds that 8-bit feedforward network layers do not degrade, while 8-bit attention linear projections do lead to performance degradation. The study benchmarks Int8 inference against 16-bit and finds that overall speedup and slowdown due to Int8 inference are smaller since a large part of the overall inference runtime is the fixed communication overhead. Various methods can improve quantization errors, such as linear programming optimization, learned quantization, step-size quantization, soft quantization, and mixed-quantization procedures. The article presents a method called LLM.int8() for quantizing multi-billion parameter transformers to 8-bit and 16-bit precision for inference while preserving predictive performance and reducing memory footprint. The paper also discusses a new method for efficient training of deep neural networks using low-bitwidth quantization, particularly for transformers. The text excerpt contains a list of references related to language processing and deep learning, covering topics such as transformer models, autoregressive generative modeling, quantization methods, and low-bit neural networks. The study examines the impact of outlier features on transformer models, finding that they greatly impact model performance and their removal leads to a decrease in top-1 probability and an increase in perplexity. The paper proposes a mixed-precision decomposition technique for matrix multiplication in transformers with outlier feature dimensions, reducing the memory footprint of the model by about 50%. The study evaluates the performance of various quantization methods for Int8 matrix multiplication on transformers ranging from 125M to 13B parameters. A new mixed-precision decomposition scheme called LLM.int8() has been developed to reduce memory use in large transformer language models while maintaining performance. The paper also provides open-source software and shows modest matrix multiplication speedups for GPT-3 models of size 6.7B parameters or larger.
871 word summary
A new mixed-precision decomposition scheme called LLM.int8() has been developed to reduce memory use in large transformer language models while maintaining performance. The method uses vector-wise quantization with separate normalization constants to quantize most features and a two-part quantization procedure for emergent outliers. The procedure makes it possible to load a 175B parameter transformer with 8-bit weights and immediately use it for inference without any performance degradation. The paper also provides open-source software and shows modest matrix multiplication speedups for GPT-3 models of size 6.7B parameters or larger. For asymmetric distributions, zeropoint quantization is used for ReLU outputs, which shifts the input distribution into the full range [?127, 127] by scaling with the normalized dynamic range nd x and then shifting by the zeropoint zp x. The paper proposes a mixed-precision decomposition technique for matrix multiplication in transformers with outlier feature dimensions. The technique separates these dimensions into 8-bit and 16-bit categories, allowing for high-precision multiplication of outliers while using memory-efficient matrix multiplication with 8-bit weights for the rest. The combination of vector-wise quantization and mixed-precision decomposition reduces the memory footprint of the model by about 50%. The study evaluates the performance of various quantization methods for Int8 matrix multiplication on transformers ranging from 125M to 13B parameters. The study finds that LLM.int8() is the only method with a favorable scaling trend, while other methods fail beyond 6.7B parameters. The study uses two setups for experiments - language modeling perplexity and zeroshot accuracy degradation - to measure the robustness of quantization methods. The study uses different corpora for evaluation and compares the performance of different quantization baselines. The study examines the impact of outlier features on transformer models, defining them as features with significantly larger magnitudes than others in the same dimension. Outlier features affect at least 25% of transformer layers and 6% of sequence dimensions, and their removal can cause attention and perplexity degradation. The study focuses on Int8 matrix multiplication for transformers, finding that outliers greatly impact model performance and their removal leads to a decrease in top-1 probability and an increase in perplexity. Outliers are concentrated in only 6 different hidden dimensions and their number increases strictly monotonically with decreasing perplexity. The emergence of large magnitude features across all layers of a transformer occurs suddenly between 6B and 6.7B parameters. Quantization techniques have been developed to address the issue of outlier features, but they all struggle to deal with outliers effectively, which are asymmetrically distributed and occur in critical feature dimensions, affecting transformer performance. The article presents a method called LLM.int8() for quantizing multi-billion parameter transformers to 8-bit and 16-bit precision for inference, with a focus on preserving predictive performance while reducing memory footprint. The authors suggest that finer quantization granularity can be an effective means to quantize large models and that their method is complementary to other methods. The paper also discusses a new method for efficient training of deep neural networks using low-bitwidth quantization, particularly for transformers, enabling access to large models that could not fit into GPU memory. The paper includes a list of references related to language processing and deep learning, covering topics such as transformer models, autoregressive generative modeling, quantization methods, and low-bit neural networks. It also discusses Int8 matrix multiplication for Transformers and introduces F8net, a fixed-point 8-bit only multiplication for network quantization. Additionally, the text excerpt contains a list of references to research papers related to optimizing the performance of large-scale transformer models through techniques such as post-training quantization, integer quantization, and low-bit transformer language models. The first excerpt is a checklist for authors submitting theoretical results, including questions about proofs, assumptions, societal impacts, limitations, and code/dataset licenses. The second excerpt discusses methods for quantization of Transformers, including 8-bit and sub-billion parameter masked language models. The study focuses on Int8 matrix multiplication for transformers that does not degrade performance and can benefit from commonly used GPUs. Outliers are essential for large softmax probabilities in transformers and occur more frequently with increasing perplexity. Various methods can improve quantization errors, such as linear programming optimization, learned quantization, step-size quantization, soft quantization, and mixed-quantization procedures. Int8 matrix multiplication can speed up feed-forward of differently sized GPT-3 transformers, with significant quantization and decomposition overheads for models with size 2560 or smaller. Finally, outliers disrupt symmetric absmax quantization and favor asymmetric zeropoint quantization, occurring universally across models trained in different software and inference frameworks. This study examines the use of Int8 matrix multiplication for transformers. Results show that 8-bit feedforward network layers have no degradation, while 8-bit attention linear projections lead to performance degradation. Vector-wise quantization improves upon previous methods, and mixed-precision decomposition is critical to avoid degradation when using 8-bit attention projections. The study also provides initial results on small and large-scale language modeling and finds that doing attention in 8-bit severely degrades performance, and performance cannot be fully recovered with mixed-precision decomposition. The study benchmarks Int8 inference against 16-bit and finds that overall speedup and slowdown due to Int8 inference are smaller since a large part of the overall inference runtime is the fixed communication overhead. The study also tests the end-to-end inference speed of BLOOM-176B in Hugging Face, finding that Int8 inference only works well for models with large model and hidden dimensions.
2616 word summary
This paper examines the use of Int8 matrix multiplication for transformers. The study finds that 8-bit feedforward network (FFN) layers have no degradation, while 8-bit attention linear projections lead to performance degradation. Vector-wise quantization improves upon previous methods, and mixed-precision decomposition is critical to avoid degradation when using 8-bit attention projections. The study compares 8-bit finetuning on RoBERTa-large with other Int8 methods and finds that doing attention in 8-bit severely degrades performance at the large scale. The study also provides initial results on small and large-scale language modeling and finds that doing attention in 8-bit severely degrades performance, and performance cannot be fully recovered with mixed-precision decomposition. The study tests Int8 training and inference for Transformers, comparing it to 32-bit baselines. Results show that training with 8-bit feed-forward networks is straightforward, but other layers require additional techniques or different data types. Linear projections with Int8 data types and vector-wise quantization lead to degradation for NMT performance. The study benchmarks Int8 inference against 16-bit and finds that overall speedup and slowdown due to Int8 inference are smaller since a large part of the overall inference runtime is the fixed communication overhead. The study also tests the end-to-end inference speed of BLOOM-176B in Hugging Face, finding that Int8 inference only works well for models with large model and hidden dimensions. Int8 matrix multiplication can speed up the feed-forward of differently sized GPT-3 transformers. Table 5 provides detailed benchmarks of raw matrix multiplication and quantization overheads. The quantization and decomposition overhead is significant, and only models with size 2560 or smaller are slowed down. The raw Int8 matrix multiplication in cuBLASLt begins to be two times faster than cuBLAS, which is only true for large matrix multiplication. Our work focuses on memory efficiency to make models accessible, and Int8 methods are also often used to accelerate inference. Outliers disrupt symmetric absmax quantization and favor asymmetric zeropoint quantization. These outliers occur for models trained in different software frameworks and inference frameworks, and they appear to be universal. Outliers, which are present in about 0.1% of all features and have a magnitude 3-20x larger than other feature magnitudes, are essential for large softmax probabilities in transformers. They become more common with increasing scale and occur in almost all sequence dimensions. Outliers are usually one-sided and their quartiles with maximum range show that they are more present with increasing perplexity. There are various methods to improve quantization errors, such as precision via linear programming optimization, learned quantization, step-size quantization, soft quantization, and mixed-quantization procedures. Low-bitwidth and convolutional network quantization work that uses less than 8-bits for data types is usually for convolutional networks (CNNs) to reduce their memory footprint and increase inference speed for mobile devices while minimizing model degradation.
This study focuses on Int8 matrix multiplication for transformers that does not degrade performance and that can benefit from commonly used GPUs. The study compares with both zeropoint and row-wise quantization in the forward pass and zeropoint-row-wise quantization in the backward pass. The study is still for sub-billion parameter transformers and is the first to study vector-wise quantization for autoregressive and large-scale models. The study uses a channel-wise quantization for convolution combined with row quantization. In contrast with other methods, the model can be used directly without performance degradation. Other methods require quantization-aware finetuning or post-training quantization to make the model usable in low-precision. This document discusses various methods for quantization of Transformers, including 8-bit and sub-billion parameter masked language models (MLMs). The 8-bit method makes many models accessible that were not accessible before. Table 3 compares the memory footprint of 16-bit inference and LLM.int8() for different open source models. The experimental setup section includes details on the type of resources used. The code, data, and instructions needed to reproduce the main experiments will be included in the supplemental material. This excerpt is a checklist for authors submitting theoretical results. It includes questions about including proofs, stating assumptions, discussing societal impacts, describing limitations, and ensuring claims in the abstract and introduction are accurate. The checklist also asks about including licenses for code and datasets and encourages authors to provide justifications for their answers. The excerpt ends with a list of references to related works in the field of matrix multiplication for transformers. This text excerpt contains a list of references to research papers related to optimizing the performance of large-scale transformer models through techniques such as post-training quantization, integer quantization, and low-bit transformer language models. It also includes references to papers on attention mechanisms, commonsense reasoning, and mesh-tensorflow for supercomputers. The papers were published between 2016 and 2022 and cover topics such as natural language processing, deep learning, and neural machine translation. The paper discusses Int8 matrix multiplication for Transformers, which enables high-performance low-precision deep learning inference. The authors explore the limits of transfer learning with a unified text-to-text transformer and discuss how transformers are driven by frequency. The paper also reviews binary neural networks and fully quantized networks for object detection. Finally, the authors introduce F8net, a fixed-point 8-bit only multiplication for network quantization. This is a list of references related to language processing and deep learning, including papers on transformer models, autoregressive generative modeling, quantization methods, and low-bit neural networks. Notable papers include those on high performance natural language models, compute-optimal large language models, and 8-bit optimizers via block-wise precision multiplications. Other references cover topics such as few-shot language model evaluation, extreme model compression, and binary neural networks. The list includes papers from various conferences and journals in the field of artificial intelligence and machine learning. The paper discusses a new method for efficient training of deep neural networks using low-bitwidth quantization, particularly for transformers. The authors reference several related works in the field, including research on 2-bit quantized neural networks, 8-bit floating point format, and binary BERT quantization. They also highlight the potential benefits and drawbacks of large pretrained models, such as the recent Open Pretrained Transformers (OPT), which can now be run using their new Int8 inference method. The main impact of their work is enabling access to large models that previously could not fit into GPU memory, particularly OPT-175B/BLOOM, using resource-efficient combinations of GPUs. They provide a table of model/GPU combinations and the maximum model size that can be run with their method. The document discusses the quantization of multi-billion parameter transformers to 8-bit and 16-bit precision for inference. The authors introduce their method, LLM.int8(), which achieves zero-degradation quantization by using vector-wise quantization and a decomposition method to isolate outlier features in a separate 16-bit matrix multiplication. The focus is on preserving predictive performance while reducing the memory footprint. The analysis is solely on the Int8 data type, and the authors do not study training or finetuning. The authors suggest that finer quantization granularity can be an effective means to quantize large models and that their method is complementary to other methods such as nuQmm and ZeroQuant. The authors leave the exploration of additional quantization methods beyond those they developed here for future work. Large transformers rely on autoregressive models, layer normalization, and token frequency distribution to manage outlier features. Outlier features are defined as large magnitude values in language models that can affect performance. Quantization techniques have been developed to address the issue of outlier features, with Int8 data types being the focus of this study. However, FP8 data types offer superior performance for small magnitude values. There are various quantization methods, including zero-point quantization, vector-wise quantization, and row-wise quantization, but they all struggle to deal with outliers effectively. Outliers are asymmetrically distributed and occur in critical feature dimensions, which affects transformer performance. The study focuses on the impact of outliers in Int8 matrix multiplication for transformers. It is found that outliers greatly impact model performance and their removal leads to a decrease in top-1 probability and an increase in perplexity. Outliers are concentrated in only 6 different hidden dimensions and their number increases strictly monotonically with decreasing perplexity. The emergence of large magnitude features across all layers of a transformer occurs suddenly between 6B and 6.7B parameters. The number of outlier feature dimensions is only roughly proportional to model size, but it is related to perplexity, which is affected by multiple factors such as the amount and quality of training data. Quantization fails starting at the 6.7B scale due to the range of the quantization distribution being too large, leading to disruptions in Int8 quantization precision. The study analyzes the impact of outlier features on transformer models. Outlier features are defined as those that have a magnitude significantly larger than other features in the same dimension. The study finds that outlier features are systematic in large models, affecting at least 25% of transformer layers in the same feature dimension. Removing outlier features can cause attention and perplexity degradation. The study uses mixed-precision decomposition and sets a threshold for outlier features with a magnitude of 6 or larger. The study evaluates several GPT-2 models using different software frameworks and finds that outlier features co-occur with the sudden degradation of performance in quantization methods. The article discusses the emergence of outlier features in attention projection dimension across all hidden states in transformers. These outlier features affect at least 25% of layers and 6% of sequence dimensions with a magnitude of at least 6. The study shows that outlier features strongly affect attention and predictive performance of transformers. However, the article also highlights the advantages of zero-point quantization and why they disappear with the use of mixed-precision decomposition. The article recommends using LLM.int8() for large matrix multiplications instead of vector-wise quantization, which scales poorly and degenerates into random performance. The LLM.int8() maintains full 16-bit performance for models with 6.7B parameters or more, and the run time is about two times faster than the FP16 baseline. The study provides insights into the decomposition and quantization performance of small vs. large models and suggests that outlier features are highly systematic and only represent at most 7 unique feature dimensions. The study evaluates the performance of various quantization methods for Int8 matrix multiplication on transformers ranging from 125M to 13B parameters. The study uses OPT models to measure degradation in zeroshot performance and evaluates the language modeling perplexity of the transformers. The study finds that LLM.int8() is the only method with a favorable scaling trend, while other methods fail beyond 6.7B parameters. The study also finds that the decomposition operation consumes only about 0.1% additional memory. The study uses two setups for experiments - language modeling perplexity and zeroshot accuracy degradation - to measure the robustness of quantization methods. The study uses different corpora for evaluation and compares the performance of different quantization baselines. The paper proposes a mixed-precision decomposition technique for matrix multiplication in transformers with outlier feature dimensions. The technique separates these dimensions into 8-bit and 16-bit categories, allowing for high-precision multiplication of outliers while using memory-efficient matrix multiplication with 8-bit weights for the rest. The authors demonstrate that outlier features are incredibly sparse and systematic in practice, making up only about 0.1% of all feature dimensions. The proposed technique quantizes each row for the hidden state, which is ineffective for outlier features, so vector-wise quantization is used instead. The combination of vector-wise quantization and mixed-precision decomposition reduces the memory footprint of the model by about 50%. To handle the large magnitude outlier features that occur in all transformer layers beyond the 6.7B scale, vector-wise quantization is no longer sufficient. For this purpose, we develop mixed-ways of blocking quantization, row-wise quantization (Khudia et al., 2021), by using vector-wise quantization, as described in more detail below. We improve upon one of the most common methods of blocking quantization, which have multiple scaling constants per tensor, such as block-wise constants (Dettmers et al., 2022), so that the effect of that outliers is confined to each block.
The main challenge with quantization methods that use a single scaling constant per tensor is that a single outlier can reduce the quantization precision of all other values. To perform 8-bit matrix multiplication with 16-bit inputs and outputs for output dimension o, we use Int8 Matrix Multiplication at Scale. For Int8 Matrix Multiplication with 16-bit Float Inputs and Outputs, given hidden states X f 16 ? R sxh and weights W f 16 ? R hxo with sequence dimension s, feature dimension h, and constants nd a f 16 and nd b f 16, we perform 8-bit matrix multiplication with 16-bit inputs and outputs.
To use zeropoint quantization in an operation we feed both the tensor X i8 and the zeropoint zp x i16 into a special instruction 4 which adds zp x i16 to each element of X i8 before performing a 16-bit integer operation. Zeropoint quantization shifts the input distribution into the full range [?127, 127] by scaling with the normalized dynamic range nd x and then shifting by the zeropoint zp x . With this affine transformation, any input tensors will use all bits of the data type, thus reducing the quantization error.
For asymmetric distributions, zeropoint quantization is used for ReLU outputs, in absmax quantization all values in [?127, 0) go unused, whereas in zeropoint quantization the full [?127, 127] range is used. Zeropoint quantization shifts the input distribution into the full range [?127, 127] by scaling with the normalized dynamic range nd x and then shifting by the zeropoint zp x. The paper discusses the use of quantization techniques for matrix multiplication in transformer models. Absolute maximum quantization is the most commonly used technique, while zeropoint quantization offers high precision. The paper pushes quantization techniques to their breaking point by scaling transformer models. The method developed is called LLM.int8(), which uses a combination of vector-wise quantization and mixed precision decomposition to perform 16-bit matrix multiplication for outlier feature dimensions and 8-bit matrix multiplication for the other 99.9% of the dimensions. The method allows for inference in LLMs with up to 175B parameters without any performance degradation. The paper also provides open-source software and shows modest matrix multiplication speedups for GPT-3 models of size 6.7B parameters or larger. A new multi-billion-scale Int8 quantization procedure for transformers is presented in this paper, which solves the challenges of the need for higher quantization precision and systematic large magnitude outlier features that ruin quantization precision. Setting these outlier feature dimensions to zero decreases top-1 attention softmax probability mass by per sequence, but they are concentrated in only 6 feature dimensions across the entire transformer. Other parameters come mostly from the embedding layer. We achieve this result by using the LLM.int8() method, which maintains 16-bit accuracy, and a new descriptive analysis of the emergence of extreme outliers in the feature dimensions during inference. The procedure makes it possible to load a 175B parameter transformer with 8-bit weights and immediately use it for inference without any performance degradation. This work is a major breakthrough in multi-billion parameter quantization, which has remained an open challenge. Large transformer language models are widely used in NLP but require significant GPU memory for inference. To reduce memory use, 8-bit quantization methods have been developed, but they often degrade performance and require further tuning after training. To address this, a new mixed-precision decomposition scheme has been developed that isolates outlier features in transformer language models. Vector-wise quantization with separate normalization constants is used to quantize most features, while a two-part quantization procedure is used to cope with emergent outliers. The feed-forward and attention projection layers in transformers are responsible for 95% of consumed parameters and computation. Using LLM.int8(), it is possible to perform inference in LLMs with up to 175B parameters without any performance degradation. With the new method, a 175B parameter 16/32-bit checkpoint can be loaded, converted to Int8, and used immediately without performance degradation. The software is open source.