Summary Partially-Binarized Large Language Models for Compression arxiv.org
7,899 words - PDF document - View PDF document
One Line
Partially-Binarized Large Language Models effectively compress and enhance performance and training efficiency while preserving linguistic reasoning capacity.
Slides
Slide Presentation (7 slides)
Key Points
- Partially-Binarized Large Language Models (PB-LLM) can compress Large Language Models (LLMs) using network binarization.
- PB-LLM maintains the linguistic reasoning capacity of LLMs while achieving extreme low-bit quantization.
- Existing binarization algorithms for LLM quantization are ineffective, highlighting the importance of salient weights.
- PB-LLM filters a small ratio of salient weights during binarization and allocates them to higher-bit storage.
- Post-training quantization (PTQ) and quantization-aware training (QAT) can recover the capacities of quantized LLMs.
- PB-LLM outperforms existing LLM quantization methods in terms of performance and training efficiency.
Summaries
22 word summary
Partially-Binarized Large Language Models compress LLMs using network binarization, maintaining linguistic reasoning capacity. PB-LLM outperforms existing methods in performance and training efficiency.
68 word summary
Partially-Binarized Large Language Models (PB-LLM) compresses Large Language Models (LLMs) using network binarization. PB-LLM achieves extreme low-bit quantization while maintaining linguistic reasoning capacity. Existing binarization algorithms are ineffective for LLM quantization, thus salient weights are crucial. PB-LLM filters a small ratio of salient weights, allocating them to higher-bit storage. Post-training quantization and quantization-aware training recover the capacities. PB-LLM outperforms existing LLM quantization methods in performance and training efficiency.
128 word summary
This paper introduces Partially-Binarized Large Language Models (PB-LLM) as a method for compressing Large Language Models (LLMs) using network binarization. The authors propose PB-LLM as a way to achieve extreme low-bit quantization while maintaining the linguistic reasoning capacity of LLMs. They explore the ineffectiveness of existing binarization algorithms for LLM quantization and highlight the importance of salient weights in achieving low-bit quantization. PB-LLM filters a small ratio of salient weights during binarization, allocating them to higher-bit storage. The methodology is extended to recover the capacities of quantized LLMs through post-training quantization (PTQ) and quantization-aware training (QAT). The authors present experiments on LLaMA-7B and evaluate the performance of PB-LLM on various tasks. The results show that PB-LLM outperforms existing LLM quantization methods in terms of performance and training efficiency.