Summary of Robustly Binarized Multi-distilled Transformer

Summary Robustly Binarized Multi-distilled Transformer arxiv.org

8,566 words - PDF document - View PDF document

One Line

The paper discusses challenges and proposes improvements for using pre-trained transformers in resource-constrained environments, specifically focusing on higher accuracy in binary transformers through a two-set binarization scheme and introducing a model called BiT 2 created through distillation.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Enhancing Binary Transformers for Resource-Constrained Environments

Source: arxiv.org - PDF - 8,566 words - view

Challenges of Pre-Trained Transformers

• Pre-trained transformers face challenges in resource-constrained environments

• Large parameters and computational complexity hinder deployment

• Limitations in accuracy when using binary transformers

Improvements for Binary Transformers

• Two-set binarization scheme proposed for higher accuracy

• Elastic binarization and multi-distillation techniques introduced

• BiT 2 model created through distillation into quantized models

BiT 2 Model Overview

• BiT binarizes activation layers to values between 0 and 1

• Binarizes weights to values between -1 and 1

• Achieves improved accuracy through elastic binarization and multi-distillation

Elastic Binarization and Multi-Distillation

• Elastic binarization allows for a 15.7% accuracy boost

• Multi-distillation ensures student model remains close to teacher model

• Good initialization provided for the binary transformer

Performance Comparison on GLUE Tasks

• BiT model compared to progressive distillation on selected GLUE tasks

• Evaluation of accuracy and performance metrics

• Results demonstrate the effectiveness of the BiT model

References on Quantization and Compression

• List of references to papers and studies on quantization and binary neural networks

• Topics include stochastic model recognition algorithms and language models as few-shot learners

• Relevant to the field of natural language processing and neural networks

Importance of Resource-Constrained Environments

• Addressing challenges in resource-constrained environments is crucial

• Enables wider deployment of pre-trained transformers

• Expands the applicability of advanced language understanding tasks

Enhancing Binary Transformers for Efficiency

• Two-set binarization scheme and elastic binarization improve accuracy

• Multi-distillation process in BiT 2 model ensures close alignment with teacher model

• Resource-constrained environments benefit from these enhancements

Enhancing Binary Transformers for Resource-Constrained Environments

• Challenges addressed through improvements in binary transformers

• BiT 2 model achieves higher accuracy through elastic binarization and multi-distillation

• Importance of resource-constrained environments for wider deployment and advanced language understanding

Key Points

Pre-trained transformers face challenges in resource-constrained environments due to their large parameters and computational complexity.
The authors propose improvements to enable binary transformers with higher accuracy, including a two-set binarization scheme.
The BiT 2 model is created through a process of distillation into increasingly quantized models, ensuring the student model remains close to the teacher model.
The Robustly Binarized Multi-distilled Transformer (BiT) achieves improved accuracy through elastic binarization and multi-distillation.
The performance of the BiT model is compared to progressive distillation on selected GLUE tasks.

Summaries

36 word summary

The paper addresses challenges of using pre-trained transformers in resource-constrained environments. The authors propose improvements for higher accuracy in binary transformers, including a two-set binarization scheme. They introduce a model called BiT 2 created through distillation.

38 word summary

The paper discusses the challenges of deploying pre-trained transformers in resource-constrained environments. The authors propose improvements to enable binary transformers with higher accuracy, including a two-set binarization scheme. They introduce a model called BiT 2, created through dist

427 word summary

The paper discusses the challenges of deploying pre-trained transformers in resource-constrained environments due to their large parameters and computational complexity. The authors propose a series of improvements to enable binary transformers with higher accuracy. These improvements include a two-set binarization scheme,

The paper introduces a model called BiT 2, which is created through a process of distillation into increasingly quantized models. This method ensures that the student model remains close to the teacher model while also providing a good initialization. The approach significantly reduces

The document discusses the Robustly Binarized Multi-distilled Transformer (BiT) model. BiT binarizes activation layers to values between 0 and 1 and binarizes weights to values between -1 and 1. The

The Robustly Binarized Multi-distilled Transformer (BiT) is a model that achieves improved accuracy through the use of elastic binarization and multi-distillation. Elastic binarization allows for a 15.7% accuracy boost by

The document presents the results of a study on the performance of a binary neural network model called Robustly Binarized Multi-distilled Transformer. The study compares the performance of different variations of the model on various language understanding tasks. The results show that

In an ablation study on the effects of different components on the GLUE dataset without data augmentation, the authors evaluate the performance of various models. They compare the results of different methods, including BERT base, BiBERT Baseline, BinaryBERT,

The authors of the document present a robustly binarized multi-distilled transformer model that achieves high accuracy on the GLUE benchmark. They show that moving from fixed to elastic binarization significantly improves the baseline accuracy, surpassing the current state-of

The excerpted text includes a list of references from various papers in the field of natural language processing and neural networks. These references cover topics such as stochastic model recognition algorithms, language models as few-shot learners, semantic textual similarity evaluation, training deep neural networks

This summary provides a list of references to various papers and studies related to quantization and binary neural networks. Some of the key papers mentioned include "QKD: Quantization-aware knowledge distillation" by Jangho Kim et al. (2019

This document includes references to various papers and studies related to the topic of quantization and compression of deep neural networks, particularly in the context of natural language processing (NLP) models like BERT. The mentioned papers discuss techniques such as distillation,

This summary is a concise version of the original text excerpt. It highlights key points and important details while preserving the order in which ideas were presented.

The paper compares the BiT model to progressive distillation on selected GLUE tasks. The teacher model used

Raw indexed text (52,934 chars / 8,566 words / 1,317 lines)

BiT: Robustly Binarized Multi-distilled Transformer

Zechun Liu ∗

Reality Labs, Meta Inc.

[email protected]

Lin Xiao

Meta AI

Barlas Oğuz ∗

Meta AI

[email protected]

Scott Yih

Meta AI

Raghuraman Krishnamoorthi

Reality Labs, Meta Inc.

Aasish Pappu

Meta AI

[email protected]

Meng Li

Peking University

Yashar Mehdad

Meta AI

Abstract

Modern pre-trained transformers have rapidly advanced the state-of-the-art in

machine learning, but have also grown in parameters and computational complexity,

making them increasingly difficult to deploy in resource-constrained environments.

Binarization of the weights and activations of the network can significantly alleviate

these issues, however, is technically challenging from an optimization perspective.

In this work, we identify a series of improvements that enables binary transformers

at a much higher accuracy than what was possible previously. These include a

two-set binarization scheme, a novel elastic binary activation function with learned

parameters, and a method to quantize a network to its limit by successively distilling

higher precision models into lower precision students. These approaches allow

for the first time, fully binarized transformer models that are at a practical level

of accuracy, approaching a full-precision BERT baseline on the GLUE language

understanding benchmark within as little as 5.9%. Code and models are available

at: https://github.com/facebookresearch/bit.

Introduction

The past few years have witnessed tremendous advances in almost all applied fields of AI. It

would hardly be a simplification to say that the bulk of these advances was achieved by scaling

the transformer architecture (Vaswani et al., 2017) to ever larger sizes with increasing computation

budget (Devlin et al., 2019; Liu et al., 2019; Radford et al., 2018, 2019; Raffel et al., 2020; Brown

et al., 2020). On the other hand, mobile devices and wearables have proliferated and shrunk in size,

with stringent requirements for storage, computation and energy consumption. Consumers demand

more portability, while having access to all that current AI technology has to offer. As a result, the

gap between what is possible in AI, and what is deployable has never been wider.

While there is a variety of methods to increase inference efficiency in neural networks (e.g. knowledge

distillation, pruning), quantization has some attractive properties and has been widely successful

in practice (Gholami et al., 2021). For one, storage and latency gains from quantization are deter-

ministically defined for a given quantization level. For instance, reducing the precision of model

parameters by a given factor immediately translates to an identical reduction in storage cost. Similarly,

reducing the precision of arithmetic operations results in a corresponding reduction in computational

cost. Uniform quantization is hardware friendly, making it relatively simple to realize theoretical

improvements in practice.

∗

Equal contribution

Preprint. Under review.Binarization represents the extreme limit of quantization, promising a 32× reduction in storage over

full-precision (32-bit) models. Moreover, binary arithmetic completely eliminates multiplications

in favor of bit-wise XNOR operations (Courbariaux et al., 2016; Rastegari et al., 2016), enabling

even further improvements when using special purpose hardware. Energy efficiency improvements

between 100-1000x have been claimed to be possible with binary neural networks (BNNs), over their

full-precision counterparts (Nurvitadhi et al., 2016).

The obvious challenge with binarization is the difficulty of optimization. While all quantization is

discontinuous, higher precisions allow approximating the full-precision network to a better extent,

where with BNNs, this becomes much harder. Surprisingly, researchers in computer vision have been

able to demonstrate BNNs with remarkable accuracy (Liu et al., 2018; Qin et al., 2020; Martinez et al.,

2020). Unfortunately, while these works have mostly been developed on convolutional architectures

for image tasks, they have not generalized well to transformer models. For instance, recent work (Qin

et al., 2021) has shown that a BERT (Devlin et al., 2019) model with binary weights and activations

lags its full-precision counterpart by as much as 20 points on average on GLUE dataset. Even for

weights-only binarization, the loss landscape was shown to be too irregular, and recent work resorted

to complex and specialized methods such as weight-splitting from half-width models to achieve a

reasonable accuracy (Bai et al., 2021).

With this background, we tackle the problem of fully binarizing transformer models to a high level

of accuracy. With the expansion trend of transformers towards becoming the standard architecture

choice for all fields of AI, we believe a solution to this problem could be highly impactful.

Our approach follows the same paradigm as previous work, based on knowledge distillation (Hinton

et al., 2015) from higher precision models using the straight-through estimator (STE) of Bengio et al.

(2013). In view of the optimization difficulties, we take the following steps to ensure that the student

and teacher models are well-matched:

• In Section 3 we describe a robust binarization framework, which allows the binary student

network to better match the output distribution of the teacher. This allows us to achieve

SoTA results for extreme activation quantization with BERT, producing models with little

loss in accuracy down to a quantization level of binary weights and 2-bit activations and

improves over previous setups by large margins in the fully binary (1-bit) setting. It also

leads to competitive results for weight binarization with 4-bit activations using a single

knowledge distillation step.

• To further improve binary models, we propose a multi-distillation approach, described

in Section 4. Instead of distilling directly into a fully binary model, we first distill an

intermediate model of medium precision and acceptable accuracy. This model then becomes

the teacher in the next round of distillation into increasingly quantized models. Such a

method ensures that the student model doesn’t drift too far from the teacher, while also

ensuring as good an initialization as possible. We call the resulting model BiT 2 .

In the vanilla setting without data augmentation, our approach reduces the accuracy gap to a full-

precision BERT-base model by half on the GLUE (Wang et al., 2019) benchmark compared to the

previous SoTA. When using data augmentation, we are able to reduce the absolute accuracy gap to

only 5.9 points (from over 15 points previously). In addition to the fully binary setting, we also report

SoTA results with binary weights and 2-bit activations, where our models trail the full-precision

baseline by only 3.5 points.

Background

2.1

Transformer architecture

The transformer model of Vaswani et al. (2017) is composed of an embedding layer, followed by

N transformer blocks and a linear output layer. Each transformer block consists of a multi-head

attention layer followed by a feed-forward network, as shown in Figure 1. The multi-head attention

layer is a concatenation of K scaled dot-product attention heads, defined by:

QK T

Attention(Q, K, V ) = softmax √

d k

Short for Binarized Transformer.

2where d k is the dimension of each key, Q, K, V are weight matrices for the query, key and value

respectively. As such, the computation in a transformer model is limited to linear matrix mul-

tiplications and additions, pointwise non-linearities (most commonly Sigmoid (Han & Moraga,

1995), GeLU (Hendrycks & Gimpel, 2016) or ReLU (Nair & Hinton, 2010) ) and the Softmax

operation (Bridle, 1989).

2.2

Quantization

A vector w is uniformly quantized to b-bit precision, if its entries are restricted to the set {0, 1, . . . , 2 b −

1} for asymmetric case or {−2 b , −2 b + 1, . . . , 2 b − 1} for symmetric case, up to a real-valued scale

α. This allows vector operations to utilize lower precision arithmetic, making them more efficient by

a factor of Bb compared to full-precision calculation using B bits. The scaling operation is still in

higher precision, but if the dimensionality of w Bb , then the extra computation is negligible.

A neural network with parameters quantized to b w bits takes up b B w times less space. However, to take

advantage of lower-precision arithmetic, the input vectors (activations) to each vector/matrix operation

also need to be quantized. A network which has weights quantized to b w bits and activations quantized

to b a bits is denoted as Wb w Ab a . In this work, we’re specifically interested in W1A1 transformers.

Binary arithmetic is especially attractive, since multiplications reduce to XNOR operations, and can

be implemented orders of magnitude more efficiently using specialized hardware (Nurvitadhi et al.,

2016).

2.3

Knowledge distillation

Knowledge distillation (KD) (Hinton et al., 2015) is a technique whereby a student network can

be trained to mimic the behavior of a teacher network. This is especially useful when the student

network is more efficient and easier to deploy than the more complex and cumbersome teacher. The

basic way of performing KD is by using the output distribution of the teacher model (p) as soft targets

for training the student model. If q is the student model’s output, then we have the loss term:

L logits = KL(p, q)

(1)

The advantage of KD over simple supervised training of a more efficient model is that the teacher

model provides a richer training signal including model confidence for each output class.

For computer vision tasks, distilling the final logits solely works well for binary neural networks (Liu

et al., 2020). If the student and teacher architectures are compatible, one can also distill intermediate

activations for faster convergence and better transfer and generalization (Aguilar et al., 2020):

L reps =

||r i s − r i t || 2 ,

(2)

where

r i s

and

r i t

are the corresponding transformer block output activations from student and teacher.

Robust binarization setup

In this section we first bring together some best practices and minor improvements which we have

found helpful in simplifying previous work and building a strong baseline. Then we present a novel

activation binarization scheme, which we will show to be critical to achieve good performance.

3.1

Two-set binarization scheme

In contrast to convolutional neural networks on images where activations exhibit comparable distribu-

tions, different activations in transformer blocks are performing different functionalities, and thus vary

in their output distributions. In particular, these activations can be divided into two categories: the

activations after Softmax/ReLU layer that contains positive values only and the remaining activations

with both positive and negative values (e.g., after matrix multiplication). If we denote by X R the

vector of activation values, then the two cases are X i R ∈ R + and X i R ∈ R respectively.

For the former set, mapping to the binary levels {−1, 1} would result in a severe distribution mismatch.

Therefore we instead map non-negative activation layers to X̂ B ∈ {0, 1} n and binarize activation

3Binary Embeddings

Binarize

Activation

Binarize

Activation

Binary Weight

FC Layer

Binary

Weight

FC Q Binary

Weight

FC K Binary

Weight

FC V

Binarize

Activation Binarize

Activation

ReLU

Binarize

Activation (0,1)

Softmax

Binarize

Activation (0,1)

Binary

Weight

FC Layer

Binarize

Activation

Self-Attention

Binary

Weight

Feed-Forward Network

Transformer Block Output

Figure 1: Overview of BiT. A transformer block contains the multi-head self-attention and feed-

forward network. We binarize all the weights to {-1, 1} in the Embedding/Fully-Connected layers

and binarize activations to {0, 1} for ReLU/Softmax outputs and to {-1, 1} for other layers.

layers with X R ∈ R n to X̂ B ∈ {−1, 1} n , shown in Figure 1. A prior work BiBERT (Qin et al., 2021)

also suggests binarizing attention to {0, 1}, but with bool function replacing SoftMax, while we

empirically find that simply binarizing attentions after SoftMax to {0, 1} works better and binarizing

ReLU output to {0, 1} instead of {−1, 1} brings further improvements. (See Section A.3 for details).

Optimal scaling factor in two sets Additionally, we apply a layer-wise scaling factor to binarized

activations to reduce the binarization error, i.e., X B = α X̂ B . The optimal values of α are different

for the X̂ B ∈ {0, 1} n and X̂ B ∈ {−1, 1} n cases and can be calculated by minimizing the l2 error:

J (α) = ||X R − α X̂ B || 2

α ∗ = arg min J (α)

(3)

α∈R +

Following XNOR-Net (Rastegari et al., 2016), by expanding Eq. 3, we have

J (α) = α 2 X ˆ B X̂ B − 2αX R T X̂ B + X R T X R

(4)

For the layers with X R ∈ R n we follow the traditional methods of binarizing activations (Rastegari

et al., 2016; Liu et al., 2018) by taking the sign of real-valued activations:

−1, if X i R < 0

X̂ i B = Sign(X i R ) =

(5)

+1, if X i R > 0

In that case, X ˆ B X̂ B = n X R , where n X R is number of elements in X R , and α ∗ can be solved as:

α ∗ =

X R T X̂ B

||X R || l1

n X R

(6)

For the activations in attention layers or after the ReLU non-linearity layers with X R ∈ R n + , we

binarize the activations to X̂ B ∈ {0, 1} n by rounding the real-valued activations:

0, if X i R < 0.5

X̂ B = bClip(X R , 0, 1)e =

(7)

1, if X i R > 0.5

In that case, X ˆ B X̂ B = n {X R >0.5} where n {X R >0.5} denotes the number of elements in X R that

are greater than or equal to 0.5. Then α ∗ can be solved as:

α ∗ =

||X R · 1 {X R >0.5} || l1

n {X R >0.5}

(8)3.2

Best practices

We performed thorough experimentation and discovered the following modifications to be useful.

Simplified knowledge distillation Compared to previous BERT model binarization works (Bai et al.,

2021; Qin et al., 2021) which also attempt to distill the attention scores, we provide analysis and

experimental results (Section 5.3) showing that using only L reps from transformer block outputs and

L logits is more effective while being simpler. We also forego the two-step distillation scheme of Bai

et al. (2021) in favor of a single step, joint distillation, where our training loss is simply L logits + L reps .

Mean subtraction in weight binarization For weight binarization, centeralizing the real-valued

weights to be zero-mean before binarization can increase the information carrying capacity of the

R || l1

Sign(W R

− W R ).

binary weights. Thus, for weight binarization, we have: W B

= ||W

n W

X i R

Gradient clipping Clipping gradients to 0 when

∈

/ [−1, 1] (or X i R ∈

/ [0, 1] if X̂ B ∈ {0, 1} n ) is

a common technique for training binarized neural networks. However, we find that clipping weight

gradients is harmful for optimization. Once a weight is outside of the clip range, the gradient is fixed

to 0, preventing further learning. This is not so for activations, since the activation value changes for

each input. As a result, we apply gradient clipping only to activations but not to weights.

Non-linearity We prefer ReLU activations whenever the output range is non-negative.

Combining these, we are able to build a strong baseline, which improves the accuracy by 9.6% over

naively binarized transformers. Additionally, these techniques allow us to train a weight binarized

transformer network in a single training step using knowledge distillation (i.e., without resorting to

weight splitting as in BinaryBert (Bai et al., 2021)) (See Section 5.3 for details).

3.3

Elastic binarization function

The fixed scaling and threshold derived previously works reasonably well, but might not be optimal

since it ignores the distribution of the variable which is being binarized. Ideally, these parameters can

be learned during training to minimize the target loss.

When using classical binarization methods, i.e., X̂ i B = Sign(X i R ), the binary output is independent

of the scale of the real-valued input. However, in our case where X̂ i B = bClip(X i R , 0, 1)e, this

independence no longer holds. Learning the scaling and threshold parameters, and how to approximate

the gradients precisely in the process becomes crucial for the final accuracy.

To handle this, inspired by the learnable threshold in ReActNet (Liu et al., 2020), we propose the

elastic binarization function to learn both the scale α ∈ R + and the threshold β ∈ R:

X i R − β

, 0, 1)e

(9)

∗

In the function, we initialize α with α in Sec. 3.1 and β to be 0, and train it with gradients from the

final loss. To back-propagate the gradients to α through the discretized binarization function, we

follow the practice in Choi et al. (2018); Zhou et al. (2016); Esser et al. (2019) to use straight-through

estimator (STE) (Bengio et al., 2013) to bypass the incoming gradients to the round function to be

the outgoing gradients:

X i B = α X̂ i B = αbClip(

∂X i B

∂ X̂ i B

= X̂ i B + α

∂α

ST E

≈

X̂ i B

+ α

∂Clip(

X i R −β

, 0, 1)

∂α

if X i R < β

if β 6 X i R < α/2 + β

α i ,

−β



if α/2 + β 6 X i R < α + β



 1 − α ,

if X i R > α + β

Then the gradients w.r.t. β can be similarly calculates as:

X i −β

∂X i B ST E ∂Clip( R α , 0, 1)

−1, if β 6 X i R < α + β

≈ α

otherwise

∂β





 β−X i

(10)

(11)Algorithm 1 BiT: Multi-distillation algorithm

Require: D train , D dev

Require: h 0

Require: Q = {(b 1 w , b 1 a ), . . . , (b kw , b ka )}

1: h teacher ← h 0

2: for b iw , b ia in Q do

h student ← Quantize(h teacher , b iw , b ia )

KnowledgeDistill(h student , h teacher , D train , D dev )

h teacher ← h student

6: end for

7: return h student

. Training Data

. Full-precision Model

. Quantization Schedule

For the layers that contain both positive and negative real-valued activations i.e., X R ∈ R n , the

binarized values X̂ B ∈ {−1, 1} n are indifferent to the scale inside the Sign function: X i B =

α · Sign(

X i R −β

)

= α · Sign(X i R − β). In that case, since the effect of scaling factor α inside the Sign

function can be ignored, the gradient w.r.t. α can be simply calculated as

∂X i B

∂α

= Sign(X i R − β).

In our ablations (Section 5.3 and A.2) we show that using this simple elastic binarization function

can bring a 15.7% accuracy boost over our strong baseline on the GLUE benchmark.

Multi-distilled binary transformer

Classical knowledge distillation (KD) (Hinton et al., 2015) trains the outputs (i.e., logits) of a

student network to be close to those of a teacher, which is typically larger and more complex. This

approach is quite general, and can work with any student-teacher pair which conforms to the same

output space. However, in practice, knowledge transfer happens faster and more effectively if the

intermediate representations are also distilled (Aguilar et al., 2020). This approach has been found

useful when distilling to student models with similar architecture (Sanh et al., 2019), and in particular

for quantization (Bai et al., 2021; Kim et al., 2019).

Note that having a similar student-teacher pair is a requirement for distilling representations. While

how similar they need to be is an open question, intuitively a teacher which is architecturally closer

to the student should make transfer of internal representations easier. In the context of quantization, it

is easy to see that lower precision students are progressively less similar to the full-precision teacher,

which is one reason why binarization is difficult.

This suggests a multi-step approach, where instead of directly distilling from a full-precision teacher

to the desired quantization level, we first distill into a model with sufficient precision in order to

preserve quality. This model can then be used as a teacher to distill into a further quantized student.

This process can be repeated multiple times, while at each step ensuring that the teacher and student

models are sufficiently similar, and the performance loss is limited. This multi-distillation approach

is sketched in Algorithm 1.

The multi-step distillation follows a quantization schedule, Q = {(b w 1 , b a 1 ), (b w 2 , b a 2 ), . . . , (b w k , b a k )}

with (b w 1 , b a 1 ) > (b w 2 , b a 2 ) > . . . > (b w k , b a k ) 3 . (b w k , b a k ) is the target quantization level, which is in

our case binary for both weights and activations. In practice, we find that down to a quantization level

of W1A2, we can distill models of reasonable accuracy in single shot, following the best practices

outlined in Section 3.2 (See our 1-1-2 baseline results in Table 1). As a result, we follow a fixed

quantization schedule, W32A32 → W1A2 → W1A1. This is not necessarily optimal, and how to

efficiently find the best quantization schedule is an interesting open problem. We present our initial

explorations towards this direction in Section 5.5.

Combining the elastic binary activations with multi-distillation we obtain BiT, the robustly binarized

multi-distilled transformer. Note that BiT simultaneously ensures good initialization for the eventual

(in our case binary) student model. Since the binary loss landscape is highly irregular, good initial-

ization is critical to aid optimization. Previous work has proposed progressive distillation (Zhuang

et al., 2018; Yang et al., 2019) to tackle this problem, wherein the student network is quantized at

(a, b) > (c, d) if a > c and b ≥ d or a ≥ c and b > d.

6Table 1: Comparison of BERT quantization methods on the GLUE dev set. The E-W-A notation refers

to the quantization level of embeddings, weights and activations. ‡ denotes distilling binary models

using full-precision teacher without using multi-distill technique in Section 4. *Data augmentation is

not needed for MNLI, QNLI, therefore results in the data augmentation section are identical to that

without data augmentation for these datasets.

#Bits

Size (MB)

Quant

BERT

32-32-32

418

Without data augmentation

Q-BERT

2-8-8

43.0

Q2BERT

2-8-8

43.0

TernaryBERT

2-2-8

28.0

BinaryBERT

1-1-8

16.5

BinaryBERT

1-1-4

16.5

BinaryBERT

1-1-2

16.5

BinaryBERT

1-1-1

16.5

BiBERT

1-1-1

13.4

BiT ‡

1-1-4

13.4

BiT ‡

1-1-2

13.4

BiT ‡

1-1-1

13.4

BiT

1-1-1

13.4

With data augmentation

TernaryBERT

2-2-8

28.0

BinaryBERT

1-1-8

16.5

BinaryBERT

1-1-4

16.5

BinaryBERT

1-1-2

16.5

BinaryBERT

1-1-1

16.5

BiBERT

1-1-1

13.4

BiT ‡

1-1-2

13.4

BiT ‡

1-1-1

13.4

BiT

1-1-1

13.4

FLOPs (G)

22.5 MNLI -m/mm

84.9/85.5 QQP

91.4 QNLI

92.1 SST-2

93.2 CoLA

59.7 STS-B

90.1 MRPC

86.3 RTE

72.2 Avg.

83.9

6.5

6.4

3.1

1.5

0.8

0.4

1.5

0.8

0.4

0.4 76.6/77.0

47.2/47.3

83.3/83.3

84.2/84.7

83.9/84.2

62.7/63.9

35.6/35.3

66.1/67.5

83.6/84.4

82.1/82.5

77.1/77.5

79.5/79.4 –

67.0

90.1

91.2

79.9

66.2

84.8

87.8

87.1

82.9

85.4 –

61.3

–

91.5

90.9

52.6

51.5

72.6

91.3

89.3

85.7

86.4 84.6

80.6

–

92.6

92.3

82.5

53.2

88.7

91.5

90.8

87.7

89.9 –

50.7

53.4

44.4

14.6

25.4

42.0

32.1

25.1

32.9 –

4.4

–

88.6

87.2

6.5

6.1

33.6

86.3

82.2

71.1

72.0 68.3

68.4

87.5

85.5

83.3

68.3

72.5

86.8

78.4

79.7

79.9 52.7

52.7

68.2

72.2

65.3

52.7

57.4

66.4

58.1

58.8

62.1 –

47.7

–

82.7

79.9

53.7

41.0

63.2

79.5

75.0

71.0

73.5

6.4

3.1

1.5

0.8

0.4

0.8

0.4

0.4 83.3/83.3*

84.2/84.7*

83.9/84.2*

62.7/63.9*

35.6/35.3*

66.1/67.5*

82.1/82.5*

77.1/77.5*

79.5/79.4* 90.1*

91.2*

79.9*

66.2*

84.8*

87.1*

82.9*

85.4* 90.0

91.6

91.4

51.0

66.1

76.0

88.8

85.0

86.5 92.9

93.2

93.7

89.6

78.3

90.9

92.5

91.5

92.3 47.8

55.5

53.3

33.0

7.3

37.8

43.2

32.0

38.2 84.3

89.2

88.6

11.4

22.1

56.7

86.3

84.1

84.2 82.6

86.0

71.0

69.3

78.8

90.4

88.0

88.0 68.4

74.0

71.5

55.9

57.7

61.0

72.9

67.5

69.7 80.3

83.3

82.6

57.6

48.7

68.8

80.4

76.0

78.0

increasing severity as the training progresses. However, this method does not prevent the student

network from drifting away from the teacher, which is always the full-precision model. We compare

to progressive distillation in Section A.1.

Main results

We follow recent work (Bai et al., 2021; Qin et al., 2021) in adopting the experimental setting of

Devlin et al. (2019), and use the pre-trained BERT-base as our full-precision baseline. We evaluate

on GLUE (Wang et al., 2019), a varied set of language understanding tasks (see Section A.5 for a full

list), as well as SQuAD (v1.1) (Rajpurkar et al., 2016), a popular machine reading comprehension

dataset.

5.1

GLUE results

Our main results on the GLUE benchmarks are presented in Table 1. In the setting without data

augmentation, where we only use the original training samples for knowledge distillation, we are

able to reduce the gap to the full precision baseline by 49.8%, i.e., from 20.7 in (Qin et al., 2021) to

10.4 points. We also see that our baseline models with elastic activation binarization already improve

previous SoTA by large margins.

In the binary weight setting (4-bit activations), we can match or outperform Bai et al. (2021) without

the need for pre-training half-width models and subsequently splitting weights. This result should

make binary weight models much easier to implement and deploy.

We also set a new state of the art for binary weight 2-bit activation (W1A2) models, with only a 3.5

point degradation compared to the full-precision baseline (using data augmentation). While not as

efficient as binary, 2-bit arithmetic can also be performed without multiplications, making it a good

efficient alternative in applications where the performance cost of going to fully binary is significant.

5.1.1

Data augmentation

From Table 1, it can be observed that the datasets with small training sets still have a large gap from

the full-precision baseline. As a result, we employ data augmentation heuristics (following the exact

setup in Zhang et al. (2020)) on the datasets with small training sets (all except MNLI, QNLI) to

take better advantage of our model’s strong representational capability. This further reduces the

quantization gap, with our models eventually trailing the full-precision model by only 5.9 points on

average on the GLUE benchmark.

7Table 2: Ablation study on the effects of each component on GLUE dataset without data augmentation.

5.2

Quant

BERT base

BiBERT Baseline

BiBERT

BinaryBERT (Our implementation)

+ Our simplied KD

+ Our two-set binarization (Strong Baseline)

+ Elastic binarization ( BiT ‡ )

+ Multi-Distillation ( BiT )

MNLI -m/mm

84.9/85.5

45.8/47.0

66.1/67.5

36.2/35.9

37.7/37.3

57.4/59.1

77.1/77.5

79.5/79.4

QQP

91.4

73.2

84.8

59.6

59.5

68.3

82.9

85.4

QNLI

92.1

66.4

72.6

52.4

56.8

64.7

85.7

86.4

SST-2

93.2

77.6

88.7

65.6

73.4

81.0

87.7

89.9

CoLA

59.7

11.7

25.4

9.3

4.1

18.2

25.1

32.9

STS-B

90.1

7.6

33.6

19.8

24.8

24.7

71.1

72.0

MRPC

86.3

70.2

72.5

69.9

70.8

71.8

79.7

79.9

RTE

72.2

54.1

57.4

52.7

57.0

56.7

58.8

62.1

Avg.

83.9

50.4

63.2

45.7

48.0

55.3

71.0

73.5

SQuAD results

We also evaluate on the popular machine reading Table 3: Comparison of BERT quantization meth-

comprehension (MRC) dataset from Rajpurkar ods on SQuADv1.1 dev set. Metrics are exact

et al. (2016). We compare to our own implemen- match and F1 score.

tation of the MRC task on top of the BiBERT

#Bits

Quant

SQuADv1.1 EM/F1

codebase, since SQuAD results are not reported

BERT base

32-32-32

82.6/89.7

in that work. We also show results using the

BinaryBERT

1-1-4

77.9/85.8

BinaryBERT

1-1-2

72.3/81.8

BinaryBERT codebase, without using weight

BinaryBERT

1-1-1

1.5/8.2

splitting. The results (Table 3) show that this

BiBERT

1-1-1

8.5/18.9

BiT

1-1-1

63.1/74.9

task is significantly harder than most document

classification benchmarks in GLUE, and previ-

ous binarization methods fail to achieve any meaningful level of performance. BiT does much better,

but still trails the 32-bit baseline by 14.8 points in F1. We conclude that despite the improvements we

have demonstrated on the GLUE benchmark, binarizing transformer models accurately is far from a

solved problem in general.

5.3

Ablations

We start from the basic binarization implementation from Bai et al. (2021), and add each of our

contributions in sequence to get a better idea how each contributes to the performance. The results

are shown in Table 2.

We start by removing attention distillation (Section 3.2), which results in a 2.3% improvement (row 5

vs. 4). Then switching to our two-set binarization (Section 3.1), which binarizes the attention scores

differently than the feed-forward activations, which gives an additional 7.3% boost (row 6). This

results in a much stronger baseline than what was used in prior works (row 2).

Moving from fixed to elastic binarization (Section 3.3) proves hugely important, pushing the average

accuracy to 71.0% (row 7) from only 55.3% (row 6). Note that this model already outperforms the

current state-of-the-art (row 3) by 7.8% points. Finally, we add multi-step distillation (Section 4),

which adds another 2.5 points, reaching the final accuracy of 73.5% on the GLUE benchmark.

5.4

Learned parameter visualization

We visualize the optimized α in the fi-

nal BiT model. As we can see from

Figure 2, the values of the α parameters

vary significantly from layer to layer,

and have apparent patterns according to

layer characteristics. For example, the

attention layers need to distribute the at-

tention to different entries, thus the scal-

Figure 2: The optimized scaling factor in BiT

ing factor for the attentions are learned

to be small, while the scaling factors for the query and key outputs are usually larger. Note that the

biggest α value is 200× of the smallest α, suggesting the importance of learning α dynamically.

5.5

Exploring multi-distillation paths

So far we have only considered the fixed quantization schedule, W32A32 → W1A2 → W1A1. This

is motivated by early experiments showing that one-step distillation to W1A2 works reasonably well.

8We explored other optimal schedules, such as distilling to W1A8 resulted in a higher accuracy model,

thus a better teacher to distill down to the eventual W1A1 student. This suggests a trade-off between

the quality of the intermediate model, vs. the closeness to the target quantization level.

Figure 3, illustrates this trade-off. We can see

that two-step distillation improves over one-step

in every case. While higher precision interme-

diate models are better as expected, it is better

to use a lower precision teacher in the last step

since it makes the learning task for the binary

student model easier. The closeness to the target

quantization is favored despite the lower accu-

racy of teacher model.

W32A32->W1A8->W1A1

W32A32->W1A4->W1A1

W32A32->W1A2->W1A1

W32A32->W1A1

W32A32->W1A4->W1A2->W1A1

It is of course possible, though more cumber-

Activation bits

some, to perform more than two distillation

steps. We also experiment with a three-step Figure 3: MNLI-m accuracy on various distilla-

schedule, W32A32 → W1A4 → W1A2 → tion paths. Each curve represents the sequence of

W1A1, which is plotted in the same figure models on a particular quantization schedule.

(dashed line). We find that this particular 3-

step schedule does not improve over the 2-step

schedule W32A32 → W1A2 → W1A1. While this result does not preclude existence of other more

optimal schedules, we hypothesize that this is unlikely.

Related work

Convolutional neural network quantization Neural network quantization is a practical tool for

compressing model size and reduce storage (Hubara et al., 2017). Quantization for convolutional

neural networks has been studied both in the uniform quantization (Choi et al., 2018; Zhou et al.,

2016; Gong et al., 2019) and non-uniform quantization (Zhang et al., 2018; Miyashita et al., 2016;

Li et al., 2020) settings. The quantization level has been progressively increased, from 8-bit (Wang

et al., 2018; Zhu et al., 2020) to 4-bit (Jung et al., 2019; Liu et al., 2022) and finally to the extreme

1-bit case (Courbariaux et al., 2016; Rastegari et al., 2016; Liu et al., 2018; Martinez et al., 2020).

Transformer quantization Compared to the CNNs, transformers with attention layers are naturally

more challenging to quantize (Bondarenko et al., 2021). Previous research mainly focused on 8-bit

quantization (Zafrir et al., 2019; Fan et al., 2020) or 4-bit quantization (Shen et al., 2020; Zadeh

et al., 2020). Extremely low-bit quantization for transformers has only been attempted very recently.

TernaryBERT (Zhang et al., 2020) proposed to ternarize the full-precision weights of a fine-tuned

BERT model. As a follow-up to TernaryBERT, weight binarization was proposed in Bai et al. (2021).

Here, the network is trained by first training a ternary half-sized model, which is used as initialization.

Then a weight-splitting step results in a full-sized binarized model, which is further fine-tuned in a

subsequent distillation step. Binarizing both weight and activations in a transformer has proved to be

challenging. BiBERT (Qin et al., 2021) made the first attempt in this direction with limited success.

Their model performed 20% worse than a real-valued baseline on the GLUE benchmark (Wang et al.,

2019), which even underperforms the original LSTM baselines.

Conclusion

Large pre-trained transformers have transformed NLP and are positioned to serve as the backbone

for all AI models. In this work, we presented the first successful demonstration of a fully binary

pre-trained transformer model. While our approach is general and can be applied to any transformer,

we have limited our evaluation to BERT-based models on the GLUE and SQuAD benchmarks. It

remains to be seen how our conclusions will hold when applied to the wide variety of pre-trained

transformer models which have gained popularity in recent years, from small mobile models, to

gigantic ones with hundreds of billions of parameters. It will also be interesting to see the performance

of the approach on different domains (such as image and speech processing) and tasks (such as text

and image generation). Demonstrating the generality of this approach in a wider setting should

significantly widen its impact, therefore we identify this as an important future direction.

9References

Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Chenlei Guo. Knowledge

distillation from internal representations. In Proceedings of the AAAI Conference on Artificial

Intelligence, volume 34, pp. 7350–7357, 2020.

Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jin Jin, Xin Jiang, Qun Liu, Michael R Lyu, and Irwin

King. Binarybert: Pushing the limit of bert quantization. In ACL/IJCNLP (1), 2021.

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through

stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.

Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing

textual entailment challenge. In TAC, 2009.

Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Understanding and overcoming

the challenges of efficient transformer quantization. In Proceedings of the 2021 Conference on

Empirical Methods in Natural Language Processing, pp. 7947–7969, 2021.

John Bridle. Training stochastic model recognition algorithms as networks can lead to maximum

mutual information estimation of parameters. Advances in neural information processing systems,

2, 1989.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are

few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task

1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint

arXiv:1708.00055, 2017.

Zihan Chen, Hongbo Zhang, Xiaoji Zhang, and Leqi Zhao. Quora question pairs. University of

Waterloo, pp. 1–7, 2018.

Jungwook Choi, Zhuo Wang, Swagath Venkataramani, et al. Pact: Parameterized clipping activation

for quantized neural networks. arXiv e-prints, pp. arXiv–1805, 2018.

Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized

neural networks: Training deep neural networks with weights and activations constrained to+ 1

or-1. arXiv preprint arXiv:1602.02830, 2016.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep

bidirectional transformers for language understanding. In NAACL-HLT (1), 2019.

Bill Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In

Third International Workshop on Paraphrasing (IWP2005), 2005.

Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dhar-

mendra S Modha. Learned step size quantization. In International Conference on Learning

Representations, 2019.

Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval, Herve Jegou, and

Armand Joulin. Training with quantization noise for extreme model compression. arXiv preprint

arXiv:2004.07320, 2020.

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer.

A survey of quantization methods for efficient neural network inference. arXiv preprint

arXiv:2103.13630, 2021.

Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and

Junjie Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks.

In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4852–4861,

2019.

10Jun Han and Claudio Moraga. The influence of the sigmoid function parameters on the speed of

backpropagation learning. In International workshop on artificial neural networks, pp. 195–201.

Springer, 1995.

Dan Hendrycks and Kevin Gimpel.

arXiv:1606.08415, 2016.

Gaussian error linear units (gelus).

arXiv preprint

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv

preprint arXiv:1503.02531, 2(7), 2015.

Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized

neural networks: Training neural networks with low precision weights and activations. The Journal

of Machine Learning Research, 18(1):6869–6898, 2017.

Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Jae-Joon Han, Youngjun Kwak, Sung Ju

Hwang, and Changkyu Choi. Learning to quantize deep networks by optimizing quantization

intervals with task loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pp. 4350–4359, 2019.

Jangho Kim, Yash Bhalgat, Jinwon Lee, Chirag Patel, and Nojun Kwak. Qkd: Quantization-aware

knowledge distillation. arXiv preprint arXiv:1911.12491, 2019.

Yuhang Li, Xin Dong, and Wei Wang. Additive powers-of-two quantization: An efficient non-uniform

discretization for neural networks. In International Conference on Learning Representations, 2020.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike

Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining

approach. arXiv preprint arXiv:1907.11692, 2019.

Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net:

Enhancing the performance of 1-bit cnns with improved representational capability and advanced

training algorithm. In Proceedings of the European conference on computer vision (ECCV), pp.

722–737, 2018.

Zechun Liu, Zhiqiang Shen, Marios Savvides, and Kwang-Ting Cheng. Reactnet: Towards precise

binary neural network with generalized activation functions. In European Conference on Computer

Vision, pp. 143–159. Springer, 2020.

Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric P Xing, and Zhiqiang Shen. Nonuniform-to-

uniform quantization: Towards accurate quantization via generalized straight-through estimation.

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.

4942–4952, 2022.

Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos. Training binary neural

networks with real-to-binary convolutions. In ICLR, 2020.

Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional neural networks using

logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In

Icml, 2010.

Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra, Ganesh Venkatesh, and Debbie

Marr. Accelerating binarized neural networks: Comparison of fpga, cpu, gpu, and asic. In 2016

International Conference on Field-Programmable Technology (FPT), pp. 77–84. IEEE, 2016.

Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan

Song. Forward and backward information retention for accurate binary neural networks. In CVPR,

2020.

Haotong Qin, Yifu Ding, Mingyuan Zhang, YAN Qinghua, Aishan Liu, Qingqing Dang, Ziwei Liu,

and Xianglong Liu. Bibert: Accurate fully binarized bert. In International Conference on Learning

Representations, 2021.

11Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under-

standing by generative pre-training. 2018.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language

models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi

Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text

transformer. Journal of Machine Learning Research, 21:1–67, 2020.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for

machine comprehension of text. In EMNLP, 2016.

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet

classification using binary convolutional neural networks. In European conference on computer

vision, pp. 525–542. Springer, 2016.

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of

bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney,

and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings

of the AAAI Conference on Artificial Intelligence, volume 34, pp. 8815–8821, 2020.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and

Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank.

In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.

1631–1642, 2013.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz

Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue:

A multi-task benchmark and analysis platform for natural language understanding. In International

Conference on Learning Representations, 2019.

Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training

deep neural networks with 8-bit floating point numbers. Advances in neural information processing

systems, 31, 2018.

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Cola: The corpus of linguistic acceptability

(with added annotations). 2019.

Adina Williams, Nikita Nangia, and Samuel R Bowman. The multi-genre nli corpus. 2018.

Yifan Yang, Qijing Huang, Bichen Wu, Tianjun Zhang, Liang Ma, Giulio Gambardella, Michaela

Blott, Luciano Lavagno, Kees Vissers, John Wawrzynek, et al. Synetgy: Algorithm-hardware

co-design for convnet accelerators on embedded fpgas. In Proceedings of the 2019 ACM/SIGDA

international symposium on field-programmable gate arrays, pp. 23–32, 2019.

Ali Hadi Zadeh, Isak Edo, Omar Mohamed Awad, and Andreas Moshovos. Gobo: Quantizing

attention-based nlp models for low latency and energy efficient inference. In 2020 53rd Annual

IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 811–824. IEEE, 2020.

Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert. In 2019

Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition

(EMC2-NIPS), pp. 36–39. IEEE, 2019.

Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for

highly accurate and compact deep neural networks. In Proceedings of the European conference on

computer vision (ECCV), pp. 365–382, 2018.

Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, and Qun Liu. Ternarybert:

Distillation-aware ultra-low bit BERT. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu

(eds.), EMNLP, 2020.

12Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Train-

ing low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint

arXiv:1606.06160, 2016.

Feng Zhu, Ruihao Gong, Fengwei Yu, Xianglong Liu, Yanfei Wang, Zhelong Li, Xiuqi Yang, and

Junjie Yan. Towards unified int8 training for convolutional neural network. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969–1979, 2020.

Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. Towards effective low-

bitwidth convolutional neural networks. In Proceedings of the IEEE conference on computer vision

and pattern recognition, pp. 7920–7928, 2018.

A.1

Appendix

BiT vs. progressive distillation

Table 4: BiT vs. progressive distillation on selected GLUE tasks. Methods differ in the teacher model

used and the model from which the student weights are initialized.

Method

Teacher Initialization MNLI -m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg.

BiBERT Distillation 32-32-32

32-32-32

77.0/77.2

83.1 84.1

89.7

31.3

60.1

75.5

56.7 69.7

Progressive

32-32-32

1-1-2

78.9/78.9

85.0 86.4

89.6

30.5

75.1

81.1

60.6 73.4

BiT

1-1-2

79.5/79.4

85.4 86.4

89.9

32.9

72.0

79.9

62.1 73.5

Previous work has also recognized the importance of good initialization for binary model training,

and proposed to perform distillation while progressively quantizing the student model (Zhuang et al.,

2018; Yang et al., 2019). Progressive distillation ensures a good initialization for the student model at

each step. However, in this approach the teacher model is fixed to the full precision model, which

does not address the problem of teacher-student gap. In Table 4 we compare BiT to a comparable

implementation of progressive distillation, using the same quantization schedule, W32A32 → W1A2

→ W1A1, as ours. We keep the teacher model fixed, while re-initializing the student model from

the latest quantized version at each step. We see that using a quantized teacher model is helpful,

especially in the high-data regime. However, our method can lag behind progressive distillation for

small datasets such as STS-B and MRPC.

A.2

Elastic binarization function vs. ReActNet learnable bias

Table 5: Elastic binarization function vs. ReActNet (Liu et al., 2020) learnable bias on GLUE tasks.

Method

MNLI -m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg.

Our two-set binarization (Strong Baseline) 57.4/59.1

68.3 64.7

81.0

18.2

24.7

71.8

56.7 55.3

+ learnable scale

76.5/76.8

82.7 85.1

88.1

26.6

62.3

74.3

58.1 69.2

+ learnable scale and bias (BiT ‡)

77.1/77.5

82.9 85.7

87.7

25.1

71.1

79.7

58.8 71.0

Inspired by the learnable bias proposed in ReActNet (Liu et al., 2020), we further propose elastic

binarization function to learn both learnable scaling factors and learnable bias. We find this learnable

scaling factor critical for the final performance. As shown in table 5, the proposed learnable scaling

factor brings 13.9% accuracy improvement, and further adding learnable bias boosts the accuracy by

1.8%.

A.3

Two-set binarization scheme vs. Bi-Attention

Table 6: Two-set binarization scheme vs. Bi-Attention (Qin et al., 2021) on GLUE tasks. Methods

differ in whether using SoftMax in attention and whether binarizing the ReLU output to {0 ,1}.

Method

Attention ReLU output MNLI -m/mm QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg.

Bi-Attention (w/o Softmax)

{0, 1}

{-1, 1}

48.1/50.0

60.1 60.6

78.8

14.0

22.3

68.4

58.1 51.3

Binarize attention to {0, 1} (w/ Softmax)

{0, 1}

{-1, 1}

51.9/52.6

76.2 60.5

79.6

11.6

18.1

70.6

55.6 53.0

Two-set binarization scheme

{0, 1}

57.4/59.1

68.3 64.7

81.0

18.2

24.7

71.8

56.7 55.3

In contrast to Bi-Attention proposed in BiBERT (Qin et al., 2021) that removes SoftMax and binarizes

the attention to {0, 1} with bool function, our two-set binarization scheme finds that keeping SoftMax

13in attention computation and also binarizing the positive output of ReLU layer to {0, 1} works better.

We conduct meticulous experiments to compare these choices. In Table 6, we show that, compared to

removing SoftMax as Bi-Attention suggested, simply binarizing the activations after SoftMax layer

to {0, 1} even produces 1.7% better accuracy. Furthermore, binarizing the ReLU layer output to {0,

1} instead of {-1, 1} helps the binary network match real-valued distributions and further brings 2.3%

accuracy improvement.

A.4

Binary convolution implementation for two-set binarization scheme

The binary convolution between the weights and activations that are both binarized to {-1, 1} (i.e.

A B ∈ {-1, 1}, W B ∈ {-1, 1}) can be implemented by the bitwise xnor operation followed by a

popcnt operation (Rastegari et al., 2016; Liu et al., 2018):

A B · W B = popcnt(xnor(A B , W B ))

(12)

For the case where activations are binarized to {0, 1} in two-set binarization scheme, the binary

activation A B ∈ {0, 1} can be represented with A 0 B ∈ {-1, 1} through a simple linear mapping:

A 0 +1

A B = B 2 . Thus the matrix computation between binary weights (W B ∈ {-1, 1} ) and binary

activations (A B ∈ {0, 1}) can be converted to the operations between W B ∈ {-1, 1} and A 0 B ∈ {-1,

1} as:

A 0 + 1

A B · W B = ( B

) · W B = (popcnt(xnor(A 0 B , W B )) +

W B i )

(13)

Here the i W B i is summing up the values in W B , which can be pre-computed and stored as

bias. Thus in the two-set binarization scheme where activations are binarized to {0, 1}, the binary

convolution can still be implemented with the general binary convolution in E.q. 12 at no additional

complexity cost.

A.5

A.5.1

Evaluation benchmarks

GLUE

The GLUE benchmark (Wang et al., 2019) includes the following datasets:

MNLI Multi-Genre Natural Language Inference is an entailment classification task (Williams et al.,

2018). The goal is to predict whether a given sentence entails, contradicts, or is neutral with respect

to another.

QQP Quora Question Pairs is a paraphrase detection task. The goal is to classify whether two given

questions have the same meaning. The questions were sourced from the Quora question answering

website (Chen et al., 2018).

QNLI Question Natural Language Inference (Wang et al., 2019) is a binary classification task

which is derived from the Stanford Question Answering Dataset (Rajpurkar et al., 2016). The task is

to predict whether a sentence contains the answer to a given question.

SST-2 The Stanford Sentiment Treebank is a binary sentiment classification task, with content

taken from movie reviews (Socher et al., 2013).

CoLA The Corpus of Linguistic Acceptability is a corpus of English sentences, each with a binary

label denoting whether the sentence is linguistically acceptable (Warstadt et al., 2019).

STS-B The Semantic Textual Similarity Benchmark is a sentence pair classification task. The goal

is to predict how similar the two sentences are in meaning, with scores ranging from 1 to 5 (Cer et al.,

2017).

MRPC Microsoft Research Paraphrase Corpus is another sentence pair paraphrase detection task

similar to QQP. The sentence pairs are sourced from online news sources (Dolan & Brockett, 2005).

14RTE Recognizing Textual Entailment is a small natural language inference dataset similar to MNLI

in content (Bentivogli et al., 2009).

A.5.2

SQuAD

The SQuAD benchmark (Rajpurkar et al., 2016), i.e., Stanford Question Answering Dataset, is a

reading comprehension dataset, consisting of questions on a set of Wikipedia articles, where the

answer to each question is a segment of text from the corresponding passage, or the question might

be unanswerable.

A.6

Technical details

For each experiment, we sweep the learning rate in {1e-4, 2e-4, 5e-4} and the batch size in {8, 16}

for QNLI, SST-2, CoLA, STS-B, MRPC, RTE, and {16, 32} for MNLI, QQP as well as SQuAD,

and choose the settings with the highest accuracy on the validation set. We use the same number of

training epochs as BiBERT (Qin et al., 2021), i.e., 50 for CoLA, 20 for MRPC, STS-B and RTE, 10

for SST-2 and QNLI, 5 for MNLI and QQP. We adopt the Adam optimizer with weight decay 0.01

and use 0.1 warmup ratio with linear learning rate decay.

Our full precision checkpoints are taken from https://textattack.readthedocs.io/en/

latest/3recipes/models.html#bert-base-uncased.