Summary of 2-Bit Quantization of Large Language Models

Summary 2-Bit Quantization of Large Language Models arxiv.org

19,237 words - PDF document - View PDF document

One Line

QuIP is a quantization method that enhances runtime efficiency in large language models by utilizing the incoherence between weight and proxy Hessian matrices.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Enhancing Runtime Efficiency in Large Language Models with QuIP

Source: arxiv.org - PDF - 19,237 words - view

Introduction to QuIP

• QuIP is a two-bit quantization method for large language models (LLMs)

• Incoherence between weight and proxy Hessian matrices enhances quantization effectiveness

• QuIP improves runtime efficiency in LLMs

Quantization Effectiveness

• Quantization is most effective when weight and proxy Hessian matrices are incoherent

• Incoherence allows for better compression of LLMs

• QuIP leverages this incoherence to enhance quantization performance

LDLQ - Adaptive Rounding Method

• LDLQ is an optimal adaptive rounding method for LLMs

• It updates columns of the weight matrix using a linear function of rounding residuals

• LDLQ achieves a final rounded weight matrix that satisfies a matrix equation

Computing Quantization Range with QuIP

• QuIP computes the quantization range based on the spectrum of the weight matrix

• This approach considers the entire spectrum instead of just the maximum value

• It leads to more accurate quantization and improved runtime efficiency

Improved Performance at Lower Weight Bits

• QuIP's incoherence processing greatly enhances performance at lower weight bits

• All quantization methods, including nearest quantization at two bits, benefit from QuIP

• QuIP-RG modifications may provide additional improvements

Closing the Gap with Greedy Local Search

• Greedy local search further closes the performance gap in quantization of LLMs

• Additional study is required to fully understand the relative contributions of QuIP's modifications

Summary of Key Points

• QuIP is a two-bit quantization method that improves runtime efficiency in large language models

• Incoherence between weight and proxy Hessian matrices enhances quantization effectiveness

• LDLQ is an optimal adaptive rounding method for LLMs

• QuIP computes the quantization range based on the spectrum of the weight matrix

• Incoherence processing in QuIP greatly improves performance at lower weight bits

Reminder: QuIP offers a promising solution for enhancing the runtime efficiency of large language models while maintaining acceptable quantization performance.

Key Points

QuIP is a two-bit quantization method for large language models that improves runtime efficiency.
Quantization is most effective when weight and proxy Hessian matrices are incoherent.
LDLQ is an optimal adaptive rounding method for large language models that updates columns of the weight matrix using a linear function of the rounding residuals.
QuIP computes the quantization range based on the spectrum of the weight matrix instead of the maximum value.
Incoherence processing in QuIP greatly improves the performance of all quantization methods at lower weight bits.

Summaries

24 word summary

QuIP is a two-bit quantization method for large language models that improves runtime efficiency by leveraging the incoherence between weight and proxy Hessian matrices.

36 word summary

This work presents QuIP, a two-bit quantization method for large language models (LLMs) that improves runtime efficiency. The method leverages the incoherence between weight and proxy Hessian matrices to achieve effective quantization. The document discusses the

692 word summary

This work introduces QuIP, a two-bit quantization method for large language models (LLMs) that improves runtime efficiency. The key insight is that quantization is most effective when weight and proxy Hessian matrices are incoherent. QuIP consists

The document discusses the LDLQ method, an optimal adaptive rounding method for large language models. The method iteratively updates columns of the weight matrix by rounding them using a linear function of the rounding residuals. The final rounded weight matrix satisfies a matrix equation that

We derive explicit proxy losses for plain nearest and stochastic rounding, comparing them to what LDLQ gets via Lemma 2. In the worst case, stochastic rounding achieves (m/4) tr(H), while in the average case, nearest and stochastic rounding

A method for quantizing large language models using a random matrix is described. Additional heuristics are outlined, including rescaling matrices to minimize the spectrum and computing the quantization range based on the spectrum instead of the maximum value. Greedy local search

The paper discusses the quantization of large language models using a method called QuIP. The authors present Theorem 7, which states that there exists an assignment of hyperparameters in the quantization algorithm that ensures all quantized weights are within the desired

QuIP's incoherence processing greatly improves the performance of all quantization methods at lower weight bits, including nearest quantization at two bits. QuIP-RG modifications may provide additional improvements but require further study. The relative contributions of QuIP's

This excerpt contains a list of references to various papers and conferences related to the topic of quantization for large language models. The references include papers on post-training quantization, low-bit vision transformers, block reconstruction, noisy bias-enhanced activation quantization,

Pytorch, an imperative style deep learning library, is referenced in the document. The paper aims to push the quantization of large language models (LLMs) into the 2 bits per weight regime for more efficient running of powerful LLMs.

The authors propose a method called QuIP for quantizing large language models. They compute the quantization range based on the spectrum of the weight matrix, rather than the typical maximum value. They use a parameter, ? = 2.4, consistently

When using 2-bit quantization for large language models, multiple passes of greedy updates are typically run. The paper compares the proxy loss of LDLQ and nearest rounding by analyzing the trace of matrices D and H. Bounds on the proxy loss for LDL

Table 5, 6, 7, 8, and 9 show the results of quantizing the OPT-30b, OPT-13b, OPT-6.7b, OPT-2.7b, and OPT-

Our study demonstrates that our incoherence processing allows for a significant change in quantization at 2 bits with all rounding methods. We provide detailed tables comparing the performance of different quantization and pre-post processing methods on language generation and zeroshot tasks for

OPTQ/LDLQ, LDLQ-RG, and Greedy perform similarly at 2 bits and outperform Nearest. In the evaluation of Adaptive Rounding with Linear feedback, biased rounding is typically used. Weighted averages of proxy loss show

Table 13 shows the average perplexity difference between unbiased and biased rounding for LDLQ/OPTQ on WikiText2, PTB, and C4. The results indicate that biased rounding performs better than unbiased rounding, particularly at lower bits. Table

The text excerpt discusses the quantization of large language models. It presents a proof showing that a global minimum of a certain function occurs when a specific condition is met. It then introduces a lemma related to the LDL Cholesky decomposition of a matrix and

The text discusses the LDL decomposition of a matrix H and its connection to the worst and average-case losses. It then presents upper and lower bounds derived from these calculations. The text also includes proofs for the incoherence processing step and the incoherence

The text excerpt discusses the quantization of large language models. It introduces a proxy loss function and explains how it can be written in block form. The text also discusses the use of the LDL decomposition of a matrix and its application in the quantization procedure

By applying the union bound and setting the right side equal to ?, it is shown that the inequality I(w?-w)uI ? ?Lu?1 log2 ! 2 ? ? holds. The second statement follows from L

Raw indexed text (99,042 chars / 19,237 words / 3,752 lines)

QuIP: 2-Bit Quantization of

Large Language Models With Guarantees

Jerry Chee

Department of Computer Science

Cornell University

[email protected]

Volodymyr Kuleshov

Department of Computer Science

Cornell University

[email protected]

Yaohui Cai

Department of Electrical and

Computer Engineering

Cornell University

[email protected]

Christopher De Sa

Department of Computer Science

Cornell University

[email protected]

Abstract

This work studies post-training parameter quantization in large language models

(LLMs). We introduce quantization with incoherence processing (QuIP), a new

method based on the insight that quantization benefits from incoherent weight and

Hessian matrices, i.e., from the weights and the directions in which it is important

to round them accurately being unaligned with the coordinate axes. QuIP consists

of two steps: (1) an adaptive rounding procedure minimizing a quadratic proxy

objective; (2) efficient pre- and post-processing that ensures weight and Hessian

incoherence via multiplication by random orthogonal matrices. We complement

QuIP with the first theoretical analysis for an LLM-scale quantization algorithm,

and show that our theory also applies to an existing method, OPTQ. Empirically,

we find that our incoherence preprocessing improves several existing quantization

algorithms and yields the first LLM quantization methods that produce viable

results using only two bits per weight. Our code can be found on GitHub.

Introduction

Large language models (LLMs) have enabled advances in text generation, few-shot learning, reason-

ing, protein sequence modeling, and other tasks [2, 29, 34]. The massive size of these models—often

reaching into hundreds of billions of parameters—requires sophisticated deployment methods and

motivates research into efficient inference algorithms.

This work studies the post-training quantization of LLM parameters as a way to improve their runtime

efficiency [4, 8, 22, 30, 32, 33]. Our key insight is that quantization can be most effective when

weight and proxy Hessian matrices are incoherent—intuitively, both the weights themselves and

the directions in which it is important to have good rounding accuracy are not too large in any one

coordinate, which makes it easier to adaptively round the weights to a finite set of compressed values.

We use this intuition to develop theoretically sound two-bit quantization algorithms that scale to

LLM-sized models.

Specifically, we introduce quantization with incoherence processing (QuIP), a new method motivated

by the above insight. QuIP consists of two steps: (1) an adaptive rounding [20] procedure, which

minimizes a quadratic proxy objective ℓ( Ŵ ) = tr(( Ŵ − W )H( Ŵ − W ) T ) of the error between the

original weights W and the quantized weights Ŵ using an estimate of the Hessian H; (2) efficient

Preprint. Under review.pre- and post- processing that ensures that the Hessian matrices are incoherent by multiplying them

by a Kronecker product of random orthogonal matrices. We denote “incoherence processing” as both

the pre- and post- processing steps of our procedure.

We complement our method with a theoretical analysis—the first for a quantization algorithm that

scales to LLM-sized models—which analyzes the role of incoherence and shows that our quantization

procedure is optimal within a general class of rounding methods. Interestingly, we find that QuIP

without incoherence processing yields a more efficient implementation of an earlier algorithm,

OPTQ [8]; our paper thus also provides the first theoretical analysis for that method.

Empirically, we find that incoherence processing greatly improves the quantization of large models,

especially at higher compression rates, and yields the first LLM quantization method that produces

viable results using only two bits per weight. For large LLM sizes (>2B parameters), we observe

small gaps between 2-bit and 4-bit compression that further decrease with model size, hinting at the

feasibility of accurate 2-bit inference in LLMs.

Contributions. In summary, this paper makes the following contributions: (1) we propose QuIP, a

quantization method based on the insight that model parameters should ideally be incoherent; (2) we

provide a theoretical analysis for a broad class of adaptive rounding methods that encompass QuIP

and OPTQ; (3) we demonstrate that QuIP makes two-bit LLM compression viable for the first time.

Related Work

Adaptive rounding. Nagel et al. [20] are the first to motivate the “adaptive rounding” proxy

objective (Eq. (1)) in a principled way. There are many quantization methods which quantize by

optimizing this proxy objective [5, 6, 9, 12, 14, 20, 31]. Many require further retraining which can

be expensive, and are not evaluated on the current largest open LLMs (OPT [34], BLOOM [29]).

Lybrand and Saab [15] propose a greedy per-neuron quantization procedure that is similar to ours,

except they do not consider arbitrary linear functions of the error correction. Their work bounds the

proxy objective, albeit on the first layer only.

Post training quantization in large models. There is a growing body of work on PTQ in LLMs

such as OPT and BLOOM. The size of these models make it difficult to apply previously developed

methods. The majority of these methods make quantization easier by somehow reducing the range

of weights or activations, but still use nearest rounding. SmoothQuant [30] rescales between acti-

vations and weights to remove outliers from the activations and make quantization overall easier.

ZeroQuant [32] proposes a per-layer knowledge distillation method. LLM.int8() [4] decompose

matrix multiplications into a majority of 8 bit and a minority of 16 bit operations. LUT-GEMM [22]

designs kernels to accelerate quantized matrix multiplications. RPTQ [33] reorders activations and

quantizes them in groups, reducing the impact of range differences between channels.

OPTQ (Formerly known as GPTQ). OPTQ [8] is based on OBQ [7], and proposes a novel rounding

method that can work on the largest OPT and BLOOM models. The method works iteratively over the

weight columns in a fixed order: (1) quantize with nearest rounding and compute the error, (2) update

the remaining weights with a scaled error, and (3) repeat.

Other quantization methods. There are other quantization procedures which do not round based on

the proxy objective of [20], or are not designed for the largest language models [10, 11, 13, 19, 27,

28].

Quantization With Incoherence Processing: Adaptive Rounding Step

This section introduces quantization with incoherence processing (QuIP), a new method consisting of:

(1) an adaptive rounding step; (2) efficient pre- and post-processing that ensures weight and Hessian

incoherence. We define and analyze step (1) in this section; the next section focuses on step (2).

Following existing state-of-the-art post-training quantization methods, we round weights per-layer by

minimizing the “adaptive rounding” proxy objective, as in Nagel et al. [20],

ℓ( Ŵ ) = E x ( Ŵ − W )x

= tr ( Ŵ − W )H( Ŵ − W ) T .

(1)

2Here, W ∈ R m×n is the original weight matrix for a given linear layer, Ŵ ∈ R m×n are the quantized

weights, x ∈ R n is an input vector drawn uniformly at random from a calibration set, and H is the

second moment matrix of these vectors, interpreted as a proxy Hessian. Crucially, this formulation

lets the quantization be run in parallel across neurons, which is tractable for large language models [8].

For simplicity, we will focus in this section on rounding to the integers; subsequent sections will

extend the analysis to finite grids.

3.1

LDLQ: An Optimal Adaptive Rounding Method

Our strategy is to define a family of adaptive rounding methods for optimizing objective (1) and then

define LDLQ, the optimal method within that class. Our defined methods iteratively perform the

following update for k = 1, 2, ..., n:

Ŵ k = Q(W k + (W 1:(k−1) − Ŵ 1:(k−1) )a k ),

where W k denotes the k-th column, W 1:(k−1) denotes the first k − 1 columns, the subroutine Q

denotes either nearest rounding or standard unbiased rounding to the integers (which rounds up or

down such that E [Q(z)] = z), and a k ∈ R k−1 is some sequence of vectors. This scheme rounds

columns one at a time; at each step, it adds a “correction” term that is a linear function of the residual

from the rounding we have done so far. The final Ŵ satisfies the following matrix equation:

Ŵ = Q(W + (W − Ŵ )U ),

(2)

where U is a strictly upper-triangular matrix whose columns are the vectors a k and Q acts elementwise.

Because U is upper-triangular, Ŵ k only depends on Ŵ 1:(k−1) .

If we let η = Q(W + (W − Ŵ )U ) − (W + (W − Ŵ )U ) denote the quantization error of Q, we

find that Ŵ − W = η(U + I) −1 and we can rewrite objective (1) as

tr(( Ŵ − W )H( Ŵ − W ) T ) = tr(η(U + I) −1 H(U + I) −T η T ).

(3)

The LDLQ Method How should we specify U , the linear feedback from the quantization error of

preceding columns in (2)? Equation 3 provides an answer. If we choose U ← Ù such that the LDL

decomposition of H is

H = ( Ù + I)D( Ù + I) T ,

(4)

where D is a (non-negative) diagonal matrix and Ù is upper unit triangular, then the terms (U + I)

in Eq. (3) cancel. We denote as LDLQ the rounding procedure in Eq. (2) with U ← Ù as the LDL

assignment from Eq. (4). We will now see that the LDL assignment of U is in fact optimal.

3.2

Deriving the Optimality of the LDLQ Adaptive Rounding Procedure

In order to reason about optimality, we consider weights which are worst and average-case for the

proxy loss. Let A denote a rounding method, and let A(W, H) be the resulting quantized weights.

Define the worst-case (L worst ) and average (L avg ) proxy losses with respect to the input weights as

L worst (A, H) = sup E tr (A(W, H) − W )H(A(W, H) − W ) T

(5)

W ∈R m×n

L avg (A, H) = E W ∼Unif[0,1] m×n tr (A(W, H) − W )H(A(W, H) − W ) T .

(6)

Theorem 1. LDLQ is worst and average-case optimal amongst rounding methods which specify the

linear feedback U as a function of H (not of W ), and when rounding to the integers. That is, for all

rounding methods A in the class described by Eq. (2), for all positive semi-definite H, and for Q as

either nearest or stochastic rounding,

tr(D) = L worst (LDLQ, H) ≤ L worst (A, H) and

tr(D) = L avg (LDLQ, H) ≤ L avg (A, H),

where D is the matrix from the LDL decomposition of H, and c = 12 for nearest, c = 6 for stochastic.

Remarks. The number of rows being quantized is m, and each quantization method operates across

the n entries of each row. For all rounding methods described by Eq. (2), and for all positive semi-

definite H, Q as nearest rounding achieves the same worst-case proxy loss as stochastic rounding,

but achieves better average proxy loss.

3Moving beyond a generic algorithm A within our framework, we consider the common baselines of

nearest and stochastic rounding. These methods are represented within our framework by choosing

the appropriate Q subroutine, and setting all entries of the linear feedback to zero. For these baseline

methods, their optimality gap to LDLQ is governed by tr (D) vs. tr (H). For any non-diagonal

H̃ ⪰ 0, LDLQ achieves strictly lower worst and average-case proxy loss because tr (D) < tr( H̃).

Let B = {Near, Stoch}. Then, L worst (LDLQ, H̃) < L worst (Stoch, H̃) and L avg (LDLQ, H̃) <

L avg (B, H̃). Across OPT models 125m to 2.7b, tr (D) / tr (H) ≤ 0.65—empirically verifying that

the gap is not insignificant. See Supplement E for full details.

3.3

Incoherence: Optimality with a Spectral Bound

Theorem 1 gives exact expressions for the proxy loss, albeit

with tr (D), which can be difficult to reason about. In Figure 1,

we empirically observe that H is approximately low-rank: we

visualize the spectrum of several randomly chosen H from

OPT-2.7b, and observe that the spectrum decays rapidly. In fact,

across all layers of OPT-125m to 2.7b models, a vast majority

of H matrices have fewer than a quarter of eigenvalues > 1%

of the max eigenvalue; see Supplement E for full details. Given

this observation about the low rank of H, can we bound the

behavior of LDLQ, and thus tr (D), using the spectrum of H?

10 0

10 1

10 2

10 3

10 4

10 5

Block 16 k_proj

Block 20 q_proj

Block 30 fc1

500

1000

1500

2000

2500

Figure 1: eig(H) from OPT-2.7b.

We do this building on a variant of the incoherence assumption that is specialized to our case [3, 24].

Definition 1. We say a symmetric Hessian matrix H ∈ R n×n is µ-incoherent

√ if it has an eigende-

composition H = QΛQ T such that for all i and j, |Q ij | = e Ti Qe j ≤ µ/ n. By extension, we say

√

a weight matrix W ∈ R m×n is µ-incoherent if all i and j, |W ij | = e Ti W e j ≤ µ ∥W ∥ F / mn.

√

Note that “most” n × n matrices are incoherent with µ = O( log n) = Õ(1) because a random

orthogonal matrix has entries with squared-magnitudes that concentrate around their mean of 1/n.

Wanting W to be incoherent is very natural: a small bound on the magnitude of its entries means that

we do not need to scale it as much to make it fit in the finite range of representable low-precision

numbers. Making H incoherent is less intuitive, but its utility is motivated by the following lemma.

Lemma 2. Let H ∈ R n×n be a µ-incoherent positive semi-definite symmetric matrix and let

H = ( Ù + I)D( Ù + I) T be its LDL Cholesky decomposition, where Ù is a strictly upper triangular

matrix and D is a (non-negative) diagonal matrix. Then,

µ 2 1/2 2

tr (D) ≤

tr H

To the best of our knowledge, this is a novel result using incoherence to obtain a bound on tr (D) that

depends only on the spectrum of H. To help interpret this result, we derive explicit proxy losses for

plain nearest and stochastic rounding, which we will then compare to what LDLQ gets via Lemma 2.

Lemma 3. Let H be symmetric positive definite. In the worst case stochastic rounding achieves

L worst (Stoch, H) = (m/4) tr (H). In the average case nearest and stochastic rounding achieve

L avg ({Near, Stoch}, H) = (m/c) tr (H), where c = 12 for nearest, and c = 6 for stochastic.

To interpret this result, consider H rank-k with µ 2 k < n. By Cauchy-Schwarz, tr(H 1/2 ) 2 ≤ k tr (H).

Combining Lemma 2 with the LDLQ proxy losses of Theorem 1 and comparing with Lemma 3,

mµ 2 1/2 2

mµ 2 k

tr H

≤

tr (H) ≤

mµ 2 1/2 2

mµ 2 k

L avg (LDLQ, H) ≤

tr H

≤

tr (H) ≤

L worst (LDLQ, H) ≤

tr (H) = L worst (Stoch, H)

tr (H) = L avg (B, H),

where B ∈ {Near, Stoch}, and c is as given in Theorem 1. This shows that for sufficiently low-rank

H, LDLQ is asymptotically better than plain nearest and stochastic rounding by a factor of µ 2 k/n.

Without incoherence: no improvement with a spectral bound. By assuming incoherence, we

were able to show LDLQ gets an asymptotically better bound in terms of just the spectrum of H.

4Algorithm 1 QuIP - Incoherence Pre-Processing

Require: b ∈ N, H ∈ R n×n SPD, original W ∈ R m×n , ρ ∈ R + , α ∈ [0, 1]

1: seeded sample random two-factor orthogonal matrices U ∈ R m×m and V ∈ R n×n

2: H = H

▷ from OPTQ

p + α ∗ mean(diag(H))I

√

3: D̃ ← 4 diag(H)/ diag(W T W )

▷ 4 applies element-wise

4: W ← W D̃; H ← D̃ −1 H D̃ −1

▷ diagonal rescaling

5: W ← U W V T √

; H ← V HV T

▷ incoherence

▷ reduced quantization range due to incoherency

6: s ← ρ∥W ∥ F / mn; W ← 2 1 ( 1 s W + 1)

7: return W ← clamp(W ∗ (2 b − 1), 0, 2 b − 1)

▷ rescale W to lie within [0, 2 b − 1]

Algorithm 2 QuIP - Incoherence Post-Processing

Require: b ∈ N, H ∈ R n×n SPD, quantized W ∈ [0, 2 b − 1] m×n , s ∈ R & D̃ ∈ R n×n (Alg 1)

m×m

1: seeded sample random two-factor

and V ∈ R n×n

orthogonal matrices U ∈ R

2: W ← s ∗ (W/(2 − 1)) ∗ 2 − 1

3: W ← U T W V ; H ← V T HV

▷ revert incoherence

4: return W ← W D̃ −1

▷ revert diagonal rescaling

We might ask: was the incoherence assumption necessary to get this result? The following theorem

answers this question in the affirmative by showing that without incoherence, the best spectral bound

for LDLQ cannot differentiate it from the nearest and stochastic rounding baselines.

Theorem 4. Consider all H̃ with the same spectrum as H. For any positive semi-definite H, the

following holds. On the worst-case loss LDLQ achieves the same error as stochastic rounding,

sup

L worst (LDLQ, H̃) = L worst (Stoch, H) =

tr (H) .

H̃s.t. eig( H̃)=eig(H)

On the average-case loss LDLQ achieves the same error as the corresponding rounding routine. Let

B = {Near, Stoch} and c = 12 for nearest, c = 6 for stochastic.

tr (H) .

sup

L avg (LDLQ ∗ , H̃) = L avg (B, H) =

H̃s.t. eig( H̃)=eig(H)

Note that the worst case for comparing LDLQ against these baselines occurs when H is diagonal, see

Theorem 1 and Lemma 3. Assuming incoherence as we do is a natural way to exclude such cases.

Quantization With Incoherence Processing: Incoherence Processing Step

Next, we leverage the above incoherence analysis to introduce incoherence processing, the second

step of the QuIP algorithm. Our strategy will be to pre-process weight and Hessian matrices to ensure

the favorable incoherence properties outlined above. One straightforward way to make a symmetric

matrix incoherent is to conjugate it by a uniform random orthogonal matrix: this will result in each of

its eigenvalues being a random unit vector, whose entries will concentrate around magnitude n −1/2 .

Specifically, let U ∈ R m×m and V ∈ R n×n be two random orthogonal matrices. (Let’s temporarily

ignore how these matrices are generated, or how we would efficiently perform inference.) We

ensure the weight and Hessian are incoherent with high probability through random orthogonal

multiplications H̃ ← V HV T and W̃ ← U W V T . Importantly, this transformation preserves the

proxy quadratic form since tr( W̃ H̃ W̃ T ) = tr((U W V T )(V HV T )(V W T U T )) = tr(W HW T ).

4.1

Incoherence via Efficient Orthogonal Multiplication

If all we wanted to do was to store or transmit the weights of the quantized neural network, the above

procedure would introduce no overhead, since we can generate a random orthogonal matrix from

a seed—making it essentially free to store. However, for running inference on a DNN, we need to

multiply by the weight matrix W , and here the need to manifest and multiply by n × n random

orthogonal matrices U, V would be prohibitive.

5Algorithm 3 QuIP: Quantization with Incoherence Processing

Require: b ∈ N, H ∈ R n×n SPD, W ∈ R m×n , Q ∈ {Near, Stoch}, ρ ∈ R + , α ∈ [0, 1]

1: LDL decomposition of H = ( Ù + I)D( Ù + I) −1

2: Ŵ ← Alg 1(b, H, W, ρ, α)

▷ QuIP Incoherence Pre-Procesing

3: for k ∈ {1, . . . , n} do Ŵ k ← clamp(Q(W k + (W − Ŵ ) Ù k ), 0, 2 b − 1)

▷ LDLQ

4: return Ŵ ← Alg 2(b, H, Ŵ )

▷ QuIP Incoherence Post-Processing

To handle this, we propose to instead use a distribution over random orthogonal

matrices for which

√

multiplication is fast. Let n = pq be a factorization of n (where p ≈ q ≈ n), and set U = U L ⊗ U R

where U L is sampled uniformly from the p × p orthogonal matrices and U R is sampled uniformly

from the q × q orthogonal matrices. Multiplication of a vector x ∈ R n by the matrix U can be

accomplished by reshaping to a p × q matrix, multiplying on the left by U L and the right by U R T , and

then reshaping back: this takes O(n(p + q)) = o(n 2 ) operations. Using more than two factors in this

way is also possible, but using two suffices to make this preprocessing asymptotically non-dominant.

Lemma 5. Let H be a positive semi-definite matrix on R n×n and W a matrix on R m×n , and suppose

that m = p 1 · p 2 · · · p k and n = q 1 · q 2 · · · q k . Let U 1 , U 2 , . . . , U k , V 1 , V 2 , . . . , V k be independent

random orthogonal matrices on R p i ×p i and R q i ×q i respectively. Set U as the Kronecker product

U = U 1 ⊗ U 2 ⊗ · · · ⊗ U k and V as V = V 1 ⊗ V 2 ⊗ · · · ⊗ V k Then V HV T is µ H -incoherent with

probability at least 1 − δ, and U W V T is µ W -incoherent with probability at least 1 − δ, where

k/2

2Ckmn

Ckn 2

= Õ (1) and µ W = A k log

= Õ (1)

µ H = A k/2 log

for some global constants A and C independent of n and k.

Remarks. This lemma means that multiplying by a random matrix in this family suffices to make a

matrix incoherent with parameter µ only poly-logarithmic in the matrix size.

4.2

Additional Heuristics

We outline QuIP pre-processing and post-processing in Algorithms 1 and 2, respectively. In line 5 of

Algorithm 1, we apply the aforementioned fast orthogonal multiplication procedure to ensure W and

H are incoherent. We also randomly permute entries at the fast matrix multiplication step to prevent

any correlation between attention heads from worsening performance. We introduce a number of

additional heuristic improvements that further improve performance.

Incoherence-Based Heuristics. Line 4 diagonally rescales W and H to minimize ℓ( Ŵ ) ≈

tr (H) ∥W ∥ 2 F , effectively trading off the spectrum of these matrices to find a minimum. Moti-

vated by the incoherence of W , Line 6 computes the quantization range depending on the spectrum

∥W ∥ F , instead of the typical max i,j |W ij |. Our full QuIP procedure is described in Algorithm 3,

which contains calls to the pre- and post-processing sub-steps in Algorithms 1 and 2.

Greedy local search. Our basic procedure yields a good initial guess with error guarantees. We can

further lower the proxy loss by running coordinate descent after LDLQ (but before post-processing),

updating the weights in the same order as in the initial pass. See Supplement D for full details.

5.1

Extensions and Further Analyses

OPTQ is a Special Case of LDLQ

We prove a novel theoretical insight: QuIP without incoherence processing (i.e., LDLQ) is equivalent

to a more efficient version of the OPTQ algorithm. That is, OPTQ falls under our class of adaptive

rounding procedures with linear feedback, and is within-class optimal.

Theorem 6. OTPQ [8] falls within the class of adaptive rounding procedures with linear feedback

as described by Eq. (2), and is equivalent to LDLQ in Section 3.

Remarks. To the best of our knowledge, this equivalence yields the first theoretical analysis of OPTQ.

Even though the two methods are equivalent, LDLQ is more efficient. OPTQ’s implementation

6requires a matrix inversion of H, and two Cholesky decompositions. Our implementation of LDLQ

performs no matrix inversion, and only one Cholesky decomposition.

Empirical Verification. The quantized outputs of the OPTQ implementation [8] are shown to be

exactly identical to the outputs of our LDLQ implementation. Synthetic random data was used, with

W ∼ Unif[0, 1] 1000×1000 . Full details can be found in Supplement E.

A Bound for Rounding to a Finite Grid

10 4

In Section 3, we saw that LDLQ (equivalently, OPTQ) is opti-

mal for minimizing the adaptive rounding objective. However,

this analysis assumed rounding to the integers. In practice, we

do not want to round W just to the integers, but instead to scale

it, shift it, and round it a finite subset corresponding to a b-bit

integer. To do this, the “real” LDLQ algorithm uses a clamp

operation to restrict the range of quantized values. Is LDLQ

still optimal when this small change is made? It turns out that

the answer is no, as the following concrete example illustrates.

5.2

LDLQ (nearest)

LDLQ (stoch)

nearest

stoch

LDLQ (nearest, no clamp)

10 3

10 2

10 1

10 2

10 3

matrix size n

Figure 2: LDLQ underperforms.

Finite Grid Counterexample. Figure 2 illustrates the behavior of LDLQ and other rounding

methods—when restricted via clamping to a finite 4-bit grid [0, 15]—on a particular example where

H is a (cleverly chosen) small perturbation of (I n + 1 n×n − e n e Tn )/n, and W has m = 16 and is a

small perturbation of 1 m×n /2. Details of the setup appear in Supplement E. The figure shows that

clamped LDLQ with nearest rounding is asymptotically worse, and the clamping to the finite grid is

what causes it to be worse in this case.

Note that in our experiments in practice, OPTQ has been shown to soundly beat nearest rounding.

This clamping issue does not seem to arise in practice; however, since it is possible we do need to

take it into account to prove useful end-to-end bounds.

A Procedure With a Bound. In order to address the above issues in theory, here we describe a

method that acts to restrict the value of | Ŵ ij − W ij |, so that the rounded weights will remain inside

the grid if W is sufficiently far inside. We do this via the optimization problem with hyperparameter c

minimize: tr HR T R

over: R unit upper triangular

subject to:

e Ti R T Re i

(7)

≤ 1 + c, ∀i ∈ {1, . . . , n}.

Our “fixed” algorithm solves this convex problem (e.g. with ADMM), then runs QuIP using stochastic

rounding and U = R −1 − I in place of the LDL decomposition. Observe that for sufficiently large c,

this is exactly equivalent to base QuIP, since the solution of that optimization problem is given by the

LDL decomposition when the constraint is dropped. Doing this (the full algorithm is given in the

supplemental) yields the following theorem.

Theorem 7. Suppose that we run Algorithm 4 (Supplement) to quantize a matrix W ∈ R m×n by

solving the objective (7). Then there exists an assignment of the algorithm’s hyperparameters c and ρ

such that with probability at least 1 − δ, all the quantized weights will be in range (no overflow or

need for clipping) and

1/2

tr ( Ŵ − W )H( Ŵ − W )

= Õ

tr H

∥W ∥ F .

n 2 4 b

In practice, because of the significant additional compute needed to solve this program, and because

clamping rarely causes issues, we always just use QuIP as described in the previous sections, which

is equivalent to setting c large and using nearest rounding.

Experiments

Overview. We quantize the OPT [34] family of models (up to 30B parameters) using various

quantization and processing methods. QuIP is superior to OPTQ and other baselines across all model

sizes and evaluation tasks. Most interestingly, incoherence processing yields excellent performance

7OPTQ-W3

OPTQ-W2

10 4

10 3

10 2

10 3

10 2

QuIP-W2

10 1

0.6

0.5

0.4

0.3

QuIP-W3

10 4

Perplexity on

FP16

10 0

# params in billions

10 1

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

10 0

# params in billions

10 1

Figure 3: Quantizing OPT models up to 30b parameters. Our method QuIP is the first PTQ procedure

to achieve good quantization at 2 bits per weight, across a variety of model sizes and evaluation tasks.

using as little as two bits per weight when paired with any of the quantization methods we consider

(including nearest rounding). Two-bit quantization with QuIP is viable at even moderate model sizes

(1B parameters), a regime where other two-bit quantization methods fail. At the largest model sizes,

the difference between 2-bit and 16-bit weight performance becomes small. We present additional

results on the effectiveness of the proxy loss, unbiased rounding, and Algorithm 4 in Supplement E.

Setup. The experimental infrastructure is built on top of OPTQ’s [8] repository which is implemented

in PyTorch [23]. We quantize the HuggingFace implementation of the OPT model family. All models

are quantized on a single GPU, with up to 48GB of memory. Our calibration set is the same as

OPTQ; 128 random 2048 token segments from the C4 dataset [25] consisting of generic text data

from crawled websites. Therefore, no task-specific data is viewed when quantizing. Following OPTQ,

quantization is performed one Transformer block at a time. A block is loaded into GPU memory,

the Hessian computed, and then the weights quantized. The current block’s inputs are then passed

through the quantized block to produce inputs for the following block. The Hessian is computed from

the quantized Transformer up to that point rather than from the full precision model; like OPTQ, we

find this improves quantization. Further details on the setup can be found in Supplement E, including

a description of the computational resources used to perform the experiments.

Methods. We evaluate compositions of several quantization and pre/post processing methods. For

quantization methods, we evaluate nearest rounding, LDLQ (or OPTQ), and two variations. LDLQ-

RG re-orders the weights based on diag(H) to modify the quantization order and adds further

greedy updates to the proxy. “Greedy” performs the greedy updates only. We evaluate the baseline

preprocessing from OPTQ which adds H ← H + α ∗ mean(diag(H))I for numerical stability. We

also evaluate our incoherence processing in Algorithms 1 and 2, denoted as “IncP”. With this notation

QuIP = LDLQ + IncP, and QuIP-RG = LDLQ-RG + IncP.

Datasets. We evaluate on the following language generation tasks: WikiText2 [17], Penn Treebank

(PTB) [16], and C4. We also evaluate on zero-shot tasks, including LAMBADA (LAMB) [21], ARC

Easy (ArcE) [1], PiQA [26], and StoryCloze [18]. See Supplement E for the full set of results.

Main Results. QuIP is the first PTQ procedure to achieve good quantization at two bits per weight,

across a variety of LLM sizes and evaluation tasks. In Figure 3 we compare QuIP and OPTQ when

quantizing to 2 and 3 bits per weight (4-bit quantization works equally well for both methods); we

evaluate OPT models (up to 30B) on PTB, C4, ARC Easy, and LAMBADA. QuIP is superior to

OPTQ across the model sizes and evaluation tasks. At three bits, QuIP matches the full precision

8Baseline Processing

Incoherence Processing (Ours)

WBits Wiki↓ PTB↓ C4↓ ArcE↑ LAMB↑ Wiki↓ PTB↓ C4↓ ArcE↑ LAMB↑

16 9.56 14.04 11.45 65.40 72.40 9.56 14.04 11.45 65.40 72.40

2 9.59

10.32

71.70 14.22

15.36

88.19 OPTQ

11.56

12.23

29.59 64.77

60.19

42.47 72.39

68.89

25.77 9.60

9.79

11.48 14.18

14.37

17.40 QuIP

11.50

65.32

11.66

65.28

13.55

57.87 73.20

72.68

65.24

2 9.64

10.31

49.40 14.20

15.15

73.45 LDLQ-RG

11.56

12.15

29.12 63.76

63.43

41.20 71.94

69.78

26.35 9.66

9.75

11.68 14.11

14.44

16.94 QuIP-RG

11.51

64.86

11.68

63.51

13.44

59.51 71.86

71.53

62.31

2 9.69

13.63

4816.6 14.33

23.05

3473.81 Greedy

11.59

16.30

3183.2 63.09

50.51

26.30 72.37

56.76

0.00 9.72

9.92

11.59 14.23

14.45

17.39 Greedy + IncP

11.52

65.99

11.71

63.80

13.30

58.80 71.71

71.38

64.47

2 10.77

1564.9

41547.8 15.41

1526.2

34348.6 Near

13.52

1808.2

24815.7 61.28

34.47

25.80 70.42

1.73

0.00 9.77

9.89

12.04 14.16

14.49

18.12 Near + IncP

11.53

64.06

11.74

64.06

14.11

56.36 71.41

71.41

60.64

Table 1: Quantizing OPT-30b with various quantization and processing methods, and evaluating on

language generation and zeroshot tasks. Our incoherence processing enables a step function change

in quantization at 2 bits, across all rounding methods.

Wbits

Rescale

Incoherence

Rescale+Incoherence

Rescale+Incoherence+Quant Range

24.30

24.32

24.05

23.89

32.62

42.28

31.32

26.36

Table 2: Ablating sub-steps of QuIP’s incoherence processing, see Algorithm 1. Perplexities are

averaged over WikiText2, PTB, and C4 for OPT-350m.

model reasonably well. At two bits and for larger LLMs (>2B parameters), QuIP begins to approach

the performance of the full precision model. As model size increases, so does the quality of QuIP’s

2-bit quantization. We provide plots on the remaining datasets in Supplement E.

Incoherence Processing Ablation. Table 1 shows all combinations of quantization and processing

methods evaluated on OPT-30B. At lower weight bits, QuIP’s incoherence processing dramatically

improves the performance of all quantization methods, across all evaluation tasks. Remarkably, all

quantization methods—even nearest—are viable at two bits with our incoherence processing. Our

modifications in QuIP-RG sometimes give an improvement over QuIP, but further study is required

to evaluate these modifications. Figures for OPT-125M to 13B are in Supplement E.

Further Ablation. QuIP’s incoherence processing contains several sub-steps. Table 2 shows their

relative contributions; all are necessary for the full improvement. Table 3 shows that the random

permutation step within the fast orthogonal multiplication also significantly reduces perplexity.

Conclusion

This paper introduced quantization with incoherence process-

ing (QuIP), an algorithm consisting of (1) an optimal adaptive

∆Perplexity from

Wbits

rounding procedure which minimizes a quadratic proxy of the

random permute↓

weight error, and (2) efficient pre- and post-processing to ensure

-0.22

the incoherence of the weight and Hessian matrices by mul-

-9.96

tiplying them by a Kronecker product of random orthogonal

-74.2

matrices. We showed that QuIP quantization is optimal in a

general class of adaptive rounding methods with linear feed- Table 3: Ablating random permu-

back; this theoretical analysis is the first for any quantization tation within fast orthogonal multi-

algorithm that scales to LLM-sized models.

plication. Differences in perplexity

are averaged over WikiText2, PTB,

and C4 for OPT-125m.

9Empirically, QuIP achieves the first viable two-bit quantization

results for LLMs, especially at large model sizes, hinting at the

feasibility of accurate 2-bit inference in LLMs.

Limitations. The proxy objective (1) does not consider interactions between blocks of a transformer,

or even between layers within a block. It is unclear what improvements could be gained from

including such interactions at this scale, and if they are worth it computationally.

References

[1] Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das,

Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas

Mattei, Ryan Musa, Kartik Talamadupula, and Michael Witbrock. A systematic classifica-

tion of knowledge, reasoning, and context within the ARC dataset. In Proceedings of the

Workshop on Machine Reading for Question Answering, pages 60–70, Melbourne, Australia,

July 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-2607. URL

https://aclanthology.org/W18-2607.

[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and et. al. Language models

are few-shot learners. In Conference on Neural Information Processing Systems, 2020.

[3] Christopher De Sa, Kunle Olukotun, and Christopher Ré. Global convergence of stochastic

gradient descent for some non-convex matrix problems. In International Conference on Machine

Learning. PMLR, 2015.

[4] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix

multiplication for transformers at scale. In Conference on Neural Information Processing

Systems, 2022.

[5] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. Hawq: Hes-

sian aware quantization of neural networks with mixed-precision. In International Conference

on Computer Vision, 2019.

[6] Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W. Mahoney, and Kurt

Keutzer. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. In Confer-

ence on Neural Information Processing Systems, 2020.

[7] Elias Frantar, Sidak Pal Sing, and Dan Alistarh. Optimal brain compression: A framework

for accurate post-training quantization and pruning. In Conference on Neural Information

Processing Systems, 2022.

[8] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Optq: Accurate quan-

tization for generative pre-trained transformers. In International Conference on Learning

Representations, 2023.

[9] Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Accurate post

training quantization with small calibration sets. In International Conference on Machine

Learning. PMLR, 2021.

[10] Yongkweon Jeon, Chungman Lee, Eulrang Cho, and Yeonju Ro. Mr.biq: Post-training non-

uniform quantization based on minimizing the reconstruction error. In Conference on Computer

Vision and Pattern Recognition, 2022.

[11] Yanjing Li, Sheng Xu, Baochang Zhang, Xianbin Cao, Peng Gao, and Guodong Guo. Q-vit:

Accurate and fully quantized low-bit vision transformer. In Conference on Neural Information

Processing Systems, 2022.

[12] Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang,

and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. In

International Conference on Learning Representations, 2021.

10[13] Yijian Liu, Huanrui Yang, Zhen Dong, Kurt Keutzer, Li Du, and Shanghang Zhang. Noisyquant:

Noisy bias-enhanced post-training activation quantization for vision transformers. In Conference

on Computer Vision and Pattern Recognition, 2023.

[14] Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training

quantization for vision transformer. In Conference on Neural Information Processing Systems,

2021.

[15] Eric Lybrand and Rayan Saab. A greedy algorithm for quantizing neural networks. In Journal

of Machine Learning Research, 2021.

[16] Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark

Ferguson, Karen Katz, and Britta Schasberger. The Penn Treebank: Annotating predicate argu-

ment structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro,

New Jersey, March 8-11, 1994, 1994. URL https://aclanthology.org/H94-1020.

[17] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture

models. arXiv preprint arXiv:1609.07843, 2016.

[18] Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy

Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper

understanding of commonsense stories. In Proceedings of the 2016 Conference of the North

American Chapter of the Association for Computational Linguistics: Human Language Tech-

nologies, pages 839–849, San Diego, California, June 2016. Association for Computational

Linguistics. doi: 10.18653/v1/N16-1098. URL https://aclanthology.org/N16-1098.

[19] Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization

through weight equalization and bias correction. In International Conference on Computer

Vision, 2019.

[20] Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort.

Up or down? adaptive rounding for post-training quantization. In International Conference on

Machine Learning, pages 7197–7206. PMLR, 2020.

[21] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi,

Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA

dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual

Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. doi:

10.18653/v1/P16-1144. URL https://aclanthology.org/P16-1144.

[22] Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon,

Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. Lut-gemm: Quantized

matrix multiplication based on luts for efficient inference in large-scale generative language

models. arXiv preprint arXiv:2206.09557, 2023.

[23] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,

Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas

Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,

Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style,

high-performance deep learning library. In Conference on Neural Information Processing

Systems, 2019.

[24] Jain Prateek, Netrapalli Praneeth, and Sanghavi Sujay. Low-rank matrix completion using

alternating minimization. In Proceedings of the Forty-fifth Annual ACM STOC, 2013.

[25] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena,

Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified

text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL

http://jmlr.org/papers/v21/20-074.html.

[26] Sandeep Tata and Jignesh M Patel. Piqa: An algebra for querying protein data sets. In

International Conference on Scientific and Statistical Database Management, 2003.

11[27] Peisong Wang, Qiang Chen, Xiangyu He, and Jian Cheng. Towards accurate post-training

network quantization via bit-split and stitching. In International Conference on Machine

Learning. PMLR, 2020.

[28] Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang,

Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer

language models. In Conference on Neural Information Processing Systems, 2022.

[29] BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana

Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, and François Yvon et. al.

Bloom: A 176b-parameter open-access multilingual language model, 2023.

[30] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han.

Smoothquant: Accurate and efficient post-training quantization for large language models.

arXiv preprint arXiv:2211.10438, 2023.

[31] Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang,

Qijing Huang, Yida Wang, Michael W. Mahoney, and Kurt Keutzer. Hawq-v3: Dyadic neural

network quantization. In International Conference on Machine Learning. PMLR, 2021.

[32] Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong

He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.

In Conference on Neural Information Processing Systems, 2022.

[33] Zhihang Yuan, Lin Niu, Jiawei Liu, Wenyu Liu, Xinggang Wang, Luzhang Shang, Guangyu

Sun, Qiang Wu, Jiaxiang Wu, and Bingzhe Wu. Rptq: Reorder-based post-training quantization

for large language models. arXiv preprint arXiv:2304.01089, 2023.

[34] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen,

Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam

Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke

Zettlemoyer. Opt: Open pre-trained transformer language models, 2022.

12A

Broader Impacts

Our work pushes the quantization of large language models into the 2 bits per weight regime. Our aim

is to drive foundational research on theoretical and empirical aspects of quantization. The ultimate

goal is to enable more powerful LLMs to run more efficiently. However our work is unaware to what

ends those LLMs are used.

Limitations

The adaptive rounding [3] proxy objective considers each layer in isolation; it remains to be seen

what other computationally tractable proxies could improve quantization. For example quantization

methods do exist which consider interactions between layers, but so far have been too computationally

expensive to be applied to the largest open LLMS.

Experiments, Reproducibility

Our code is included in the Supplement. See the included README for instructions on how to

reproduce the various experiments, including random seeds. The code also downloads all datasets

used to quantize or evaluate the models.

D.1

Additional Method Clarifications

Subsection 4.2 (Incoherence-Based Heuristics)

Line 4 diagonally rescales W and H to minimize ℓ( Ŵ ) ≈ tr (H) ∥W ∥ 2 F , effectively

trading off

the spectrum of these matrices to find a minimum. Note to minimize tr D −1 HD −1 ∥W D∥ 2 F =

P n

( i=1 H ii /D i 2 )( i=1 D i 2 ∥W i ∥ 2 ) implies that D i = H ii /∥W i ∥. Motivated by the incoherence of

W , Line 6 computes the quantization range depending on the spectrum ∥W ∥ F , instead of the typical

max i,j |W ij |. The parameter ρ controls the quantization range; we tune it and find that a value of 2.4

works well across all our experiments. We use ρ = 2.4 consistently across all experiments. Our full

QuIP procedure is described in Algorithm 3, which contains calls to the pre- and post-processing

sub-steps in Algorithms 1 and 2.

D.2

Subsection 4.2 (Greedy Updates)

In this subsection, we describe the “greedy local search” method mentioned in the main body of

the paper in more detail. The basic idea is to iterate over coordinates of the weights in the same

order as the initial quantization method, modifying each weight in turn—but still restricting it to be a

representable quantized value—so as to minimize the proxy loss while keeping the other weights

fixed. These greedy updates amount to coordinate descent on the proxy loss, but restricted to the

quantization grid. Greedy updates can be performed after any initial quantization method, or as a

standalone method. When performed after an initial quantization method, greedy local search is a

descent method because the individual weight updates cannot increase the loss, but when performed

alone, these greedy updates are not a descent method because the initial point ( Ŵ = W ) is not

feasible because it contains unquantized values that are off the representable quantization grid.

Concretely, a greedy update of weight (i, j) to the grid {0, 1, . . . , 2 b − 1} does the following, where

ℓ is the proxy loss:

Ŵ ij ← arg

min

z∈{0,1,...,2 b −1}

ℓ( Ŵ − e i e Tj Ŵ ij + e i e Tj z).

(Note that Ŵ − e i e Tj Ŵ ij + e i e Tj z is the result of setting the (i, j)th entry of Ŵ to z.) A full pass

of greedy updates constitutes mn of these updates performed in the same order as LDLQ. This

algorithm is very simple, since it is just greedy coordinate descent. In the rest of this subsection, we

will give a bit more intuition about this method by showing how this greedy algorithm falls within

our framework of adaptive rounding with linear feedback.

An application of greedy local search as a single-pass stand-alone method falls under our Adaptive

Rounding with Linear Feedback framework, with the linear feedback set to U = (H⊙M ) diag(H) −1 ,

13Algorithm Greedy Updates: A Single Pass

Require: b ∈ N, H ∈ R n×n SPD, weights W ∈ R m×n , initial guess W̃

1: U ← (H ⊙ M ) diag(H) −1

▷ M is the strictly upper triangular mask

2: V ← W − ( W̃ − W )(H ⊙ M T ) diag(H) −1

▷ can skip if W̃ = W by setting V ← W

3: for k ∈ {1, . . . , n} do Ŵ k ← clamp(Q near (V k + (W − Ŵ )U k ), 0, 2 b − 1)

4: return Ŵ

where M is the strictly upper triangular mask and ⊙ denotes the Hadamard (entrywise) product,

as we will derive below. For ease of explanation consider a single (row) weight vector w ∈ R 1×n .

When looking only at column j, the proxy loss from setting ŵ j to z is

ℓ( ŵ − ŵe j e Tj + ze Tj ) = ( ŵ − w)H( ŵ − w) T + 2(ze Tj − ŵe j e Tj )H( ŵ − w) T

+ (ze Tj − ŵe j e Tj )H(ze Tj − ŵe j e Tj ) T .

This is just a quadratic function in z, and so its minimum value on the grid {0, 1, . . . , 2 b − 1} will

just be its minimum value on R rounded to that grid. To find this minimum over R, we differentiate

to minimize, yielding

0 = 2e Tj H( ŵ − w) T + 2e Tj H(ze Tj − ŵe j e Tj ) T ,

and solving for z,

z = −

( ŵ − ŵe j e Tj − w)He j

( ŵ − w)He j

= ŵe j −

e j He j

e Tj He j

(8)

Since when we use greedy local search as a stand-alone method, we have not updated ŵ j yet, at this

point ŵe j = we j , and so this means that a single step of greedy updates looks like

He j

ŵe j ← Q we j − ( ŵ − w) T

e j He j

for Q referring to nearest rounding with the necessary clamping. Since ŵ − w is zero for all entries

following the jth one, this is equivalent to

ŵe j ← Q(we j − ( ŵ − w)U e j )

where U is set as U = (H ⊙ M ) diag(H) −1 as above. This shows how this single-pass version of

greedy updates fits into our adaptive rounding with linear feedback framework.

Analyzing greedy local search as a post-processing pass is a bit more difficult, but we will see that it

can also be written as something like adaptive rounding with linear feedback. Suppose that we do a

pass of greedy updates, but our quantized weights start at an initial value ŵ = w̃ already quantized

from some previous method (e.g. LDLQ). Returning to (8), since we haven’t updated ŵ j yet, we’ll

have

( ŵ − w)He j

z = w̃e j −

e Tj He j

Now, all the entries of ŵ which come after j are still the ones from w̃. This means that we can split

this up as

( ŵ − w) :,1:(j−1) H 1:(j−1),j + ( w̃ − w) :,(j+1):n H (j+1):n,j

z = we j −

e Tj He j

where the first part of this sum comes from the entries which we may have already updated during

this pass, the second comes from the entries which are still equal to their initial values in w̃, and the

case of w j is handled specially, cancelling it with the w̃e j term. We can write this more compactly in

matrix form as

( ŵ − w)(H ⊙ M )e j + ( w̃ − w)(H ⊙ M T )e j

z = we j −

e Tj He j

14Model Processing Absolute

Fractional Rank Approximate

Fractional Rank tr (D) / tr (H)

OPT-125m Baseline

Incoherent 0.926 (±0.172)

0.910 (±0.196) 0.112 (±0.127)

0.124 (±0.141) 0.540 (±0.093)

0.534 (±0.094)

OPT-350m Baseline

Incoherent 0.916 (±0.180)

0.908 (±0.183) 0.047 (±0.032)

0.059 (±0.062) 0.445 (±0.100)

0.440 (±0.106)

OPT-1.3b Baseline

Incoherent 0.541 (±0.404)

0.543 (±0.405) 0.020 (±0.023)

0.028 (±0.023) 0.399 (±0.187)

0.393 (±0.189)

OPT-2.7b Baseline

Incoherent 0.426 (±0.413)

0.427 (±0.415) 0.019 (±0.015)

0.018 (±0.025) 0.384 (±0.206)

0.375 (±0.205)

Table 4: We compute H in each layer of a given model, and compute the following summary

statistics. tr (D) / tr (H) decreases as the mode size increases, though the variance also increases.

We compute the fraction of nonzero eigenvalues (i.e. absolute), and the fraction of eigenvalues

> 0.01 · max(eig(H)) (i.e. approximate). The fractional rank is k/n for a rank-k matrix H with

dimension n. Mean and standard deviations are computed across layers in a model.

where M is the strictly upper triangular mask and ⊙ is elementwise multiplication. This yields a final

quantization step of

(H ⊙ M T )e j

He j

ŵe j ← Q we j − ( w̃ − w)

− ( ŵ − w) T

e Tj He j

e j He j

So, more generally, if we define U as above, and set

V = W − ( W̃ − W )(H ⊙ M T ) diag(H) −1 ,

we can write a single pass of greedy updates in matrix form as

W̃ ← Q(V + (W − Ŵ )U ),

which is very close to our rounding with linear feedback form, albeit with the difference that here V

is in place of W . This is made explicit in the included Greedy Updates Algorithm.

We can use this algorithm both as a whole quantization method (by setting W̃ = W ) or as a post-

processing step (by setting W̃ to the output of some other initial quantization algorithm, such as

LDLQ). When we do use it as a post-processing step, we typically run multiple passes of greedy

updates (e.g. 10 passes): this involves passing the output of the greedy updates algorithm back in as

the input guess W̃ to another run of the greedy updates algorithm, and repeating this multiple times.

E.1

Additional Experimental Descriptions and Results

Subsections 3.2 and 3.3 (Empirical Properties of H Across OPT-125m to 2.7b)

Interpreting the exact proxy loss of LDLQ and nearest rounding by empirically comparing

tr (D) vs tr (H). Theorem 1 gives the average-case proxy loss for LDLQ in terms of tr (D), where

D is from the LDL decomposition of H. Lemma 3 gives the average-case proxy loss for standard

nearest rounding in terms of tr (H). We know that LDLQ is better in practice, but comparing these

equations is difficult because we need to reason about tr (D) vs tr (H). Our paper resolves this

difficulty by deriving bounds on the proxy loss for LDLQ in terms of the spectrum of H (with

and without incoherence). However we also perform a quick empirical check: if tr (D) ≪ tr (H),

then our theory explains the empirical superiority of LDLQ over nearest rounding (at least on these

models). Table 4 gives the ratio tr (D) / tr (H) across all layers for OPTQ models 125m to 2.7b; the

mean value is always less than 0.55, and it falls as the model gets larger.

H is approximately low-rank. Subsection 3.3 plotted the normalized eigenvalues of H from 3

randomly chosen layers in OPT-2.7b. Table 4 gives much more evidence that H is consistently ap-

proximately low-rank. Across each model, we calculate the absolute and approximate fractional rank

15of H across all layers in OPT models 125m to 2.7b (explanations in the caption). The approximate

fractional rank decreases as model size increases; for OPT-2.7b the fractional rank is ≈ 0.02(±0.02).

E.2

Subsection 5.1 (Empirical Verification of OPTQ Equivalence)

We share a python script in the supplementary code which empirically verifies that our implementation

of LDLQ produces quantized values exactly matching OPTQ’s [1] implementation. While we prove

the equivalence between LDLQ and OPTQ’s respective algorithm statements, empirically comparing

ours and Frantar et al. [1]’s code ensures that the respective implementations are sufficiently close to

their algorithmic statements. Therefore we can be sure that LDLQ and OPTQ are equivalent in their

implementation.

E.3

Subsection 5.2 (Empirical Verification of LDLQ/OPTQ Finite Grid Counterexample)

The following code constructs a weight matrix W and Hessian matrix H where OPTQ performs

worse than nearest when rounding to a finite grid.

import torch

def make_counterexample (n , d , c =0.01) :

H = torch . ones (n , n ) + torch . eye ( n )

H [n -1 ,n -1] = 1.0

H [0 ,1:( n -1) ] += 2 * c

H [1:( n -1) ,0] += 2 * c

H [0 ,n -1] += c

H [n -1 ,0] += c

H [0 ,0] += 4 * c + n * ( c **2)

W = 0.499 * torch . ones (d , n ) + 0.002 * ( torch . arange ( n ) % 2)

return W , H

The intuition behind this counterexample is as follows: we want to quantize many coordinates in

W in such a way that OPTQ excepts there to be a very large error correction to quantize the last

entry. However, the finite grid restricts this large error correction. Note that we can achieve this poor

OPTQ behavior with c=0, but here nearest rounding also does poorly. We make a small perturbation

(c=0.01) to make OPTQ round in the wrong direction, but not nearest.

E.4

Additional Details on the Experimental Setup and Computational Resources

We run experiments on a university cluster managed by a Slurm workload manager which has GPUs

with up to 48GB of memory, though larger GPUs are only required for some methods on larger model

sizes. Note we use the LAMBADA OpenAI version. When Greedy updates are used, we perform 10

passes over the weights in the same order as LDLQ and OPTQ, except for 5 passes on OPT-30b. For

the incoherence-based quantization range, we tune the parameter ρ and find that a value of 2.4 works

well across all model sizes and quantization methods. We use this value for all our experiments.

E.5

Section 6 (Main Results on Additional Evaluations)

Figure 4 shows additional results for QuIP and OPTQ on WikiText2, PiQA, and StoryCloze when

quantizing to 2 and 3 bits per weight. The insights about our method QuIP remain the same after

viewing these additional results: QuIP is the first PTQ procedure to achieve good quantization at two

bits per weight, across a variety of LLM sizes and evaluation tasks. We evaluate on OPT models (up

to 30B); 4-bit quantization works equally well for both methods. QuIP is superior to OPTQ across

model sizes and evaluation tasks here.

On WikiText2 2-bit quantization, note that the trend in perplexity for QuIP mirrors the trend in

perplexity for OPTQ. We run OPTQ’s [1] implementation, though they did not report 2-bit results

at this model size. Because OPTQ is equivalent to QuIP’s quantization sub-procedure, it thus

makes sense that worse performance in the quantization sub-procedure could result in worse overall

performance. OPTQ increases perplexity when going from OPT-1.3b to OPT-2.7b. QuIP’s perplexity

also increases from OPT-1.3b to OPT-2.7b, and is unusually higher than the adjacent OPT-1.3b and

OPT-6.7b models. However QuIP still beats OPTQ in this setting. Our observations about OPTQ and

QuIP on WikiText2 and OPT-2.7b were consistent across multiple independent runs.

16FP16

OPTQ-W3

OPTQ-W2

QuIP-W3

QuIP-W2

10 3

10 2

10 1

10 4

WikiText2

0.80

0.75

0.70

0.65

0.60

0.55

0.50

10 1

10 0

# params in billions

0.70

0.65

0.60

0.55

0.50

10 1

10 0

0.75

# params in billions

10 1

10 0

# params in billions

Figure 4: Quantizing OPT models up to 30b parameters. Additional evaluation tasks shown here in

the Supplement. Our method QuIP is the first PTQ procedure to achieve good quantization at 2 bits

per weight, across a variety of model sizes and evaluation tasks.

E.6

Section 6 (All Methods, All Model Sizes, All Bit Weights, All Evaluation Tasks)

Tables 5-11 provide results on all combinations of the following: methods, model sizes (OPT 125m-

30b), bit weights(4,3,2), and evaluation tasks. Across our extensive array of experiments, we see that

incoherence processing always enables a step function change in quantization at 2 bits.

Incoherence Processing — OPT-30b

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

QuIP

QuIP-RG

Greedy+IncP

Near+IncP

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

9.56

14.04

11.45

65.40

72.40

78.13

77.28 9.60

14.18

11.50

65.32

73.20

78.45

76.96 9.79

14.37

11.66

65.28

72.68

78.73

76.51 11.48

17.40

13.55

57.87

65.24

75.24

73.39 9.66

14.11

11.51

64.86

71.86

78.51

77.02 9.75

14.44

11.68

63.51

71.53

78.73

77.08 11.68

16.94

13.44

59.51

62.31

76.17

73.01 9.72

14.23

11.52

65.99

71.71

77.86

76.70 9.92

14.45

11.71

63.80

71.38

77.58

76.64 11.59

17.39

13.30

58.80

64.47

75.95

73.33 9.77

14.16

11.53

64.06

71.41

78.24

76.77 9.89

14.49

11.74

64.06

71.41

77.53

75.94 12.04

18.12

14.11

56.36

60.64

75.46

71.93

Baseline Processing — OPT-30b

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

OPTQ

LDLQ-RG

Greedy

Near

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

9.56

14.04

11.45

65.40

72.40

78.13

77.28 9.59

14.22

11.56

64.77

72.39

78.56

77.53 10.32

15.36

12.23

60.19

68.89

78.02

75.62 71.70

88.19

29.59

42.47

25.77

66.05

63.59 9.64

14.20

11.56

63.76

71.94

78.56

76.89 10.31

15.15

12.15

63.43

69.78

77.80

75.56 49.40

73.45

29.12

41.20

26.35

64.58

63.53 9.69

14.33

11.59

63.09

72.37

78.35

76.45 13.63

23.05

16.30

50.51

56.76

70.46

68.43 4,817

3,474

3,183

26.30

00.00

49.89

48.31 10.77

15.41

13.52

61.28

70.42

77.02

75.24 1,565

1,526

1,808

34.47

01.73

56.37

49.59 41,548

34,349

24,816

25.80

00.00

49.56

48.57

Table 5: Quantizing OPT-30b with all combinations of quantization and pre-post processing methods,

evaluating on language generation and zeroshot tasks. Our incoherence processing enables a step

function change in quantization at 2 bits, across all rounding methods.

17Incoherence Processing — OPT-13b

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

QuIP

QuIP-RG

Greedy+IncP

Near+IncP

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

10.13

14.52

12.06

61.78

70.25

76.82

76.58 10.21

14.69

12.16

61.41

72.09

76.61

75.62 10.5

15.05

12.39

59.47

71.10

76.17

74.92 16.02

21.64

16.60

53.91

56.24

72.52

70.21 10.35

14.73

12.18

60.35

69.47

76.55

75.88 10.69

15.20

12.43

61.78

69.07

76.22

75.75 13.81

22.23

15.62

52.86

55.70

72.74

70.53 10.25

14.85

12.21

60.10

70.83

76.33

75.43 10.61

15.11

12.42

59.43

68.43

76.17

75.62 13.91

20.20

15.19

53.79

56.98

71.87

72.50 10.34

14.93

12.26

60.56

68.37

75.08

74.47 10.59

15.27

12.56

59.30

67.86

76.66

75.43 16.12

23.18

17.37

50.00

46.48

70.73

68.43

Baseline Processing — OPT-13b

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

OPTQ

LDLQ-RG

Greedy

Near

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

10.13

14.52

12.06

61.78

70.25

76.82

76.58 10.31

14.91

12.26

64.77

72.39

78.56

77.53 11.60

16.59

13.34

60.19

68.89

78.02

75.62 372.68

344.44

135.48

42.47

25.77

66.05

63.59 10.28

14.85

12.24

60.77

68.72

76.28

76.32 11.54

16.43

13.17

58.54

65.30

75.08

73.52 213.75

220.38

67.48

32.07

6.58

59.09

56.33 10.73

15.25

12.55

56.61

68.12

76.50

75.68 13.67

18.62

14.30

51.22

59.36

73.45

72.44 8,370

7,053

4,316

25.38

00.02

50.98

49.40 11.33

16.40

13.32

61.32

67.22

76.06

74.41 3,333

2,708

2,711

31.10

00.06

53.10

49.71 186,069

121,291

93,834

25.42

00.00

49.62

48.70

Table 6: Quantizing OPT-13b with all combinations of quantization and pre-post processing methods,

evaluating on language generation and zeroshot tasks. Our incoherence processing enables a step

function change in quantization at 2 bits, across all rounding methods.

Incoherence Processing — OPT-6.7b

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

QuIP

QuIP-RG

Greedy+IncP

Near+IncP

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

10.86

15.77

12.71

60.06

68.72

76.55

74.47 10.98

15.93

12.86

59.89

70.00

76.77

75.18 11.51

16.52

13.30

59.60

68.74

76.33

73.65 22.33

31.73

21.62

52.61

53.97

72.47

68.43 11.20

15.99

12.88

59.30

67.38

76.71

75.05 11.61

16.43

13.39

58.21

65.77

76.33

73.33 23.75

45.53

24.98

53.32

49.91

72.91

69.51 11.13

15.88

12.89

59.18

67.65

76.39

74.35 11.62

16.50

13.27

58.25

67.18

75.46

73.77 19.06

35.94

19.62

51.43

54.80

72.20

68.94 11.18

16.06

12.96

59.85

67.26

76.55

74.22 11.73

16.47

13.37

57.62

65.86

76.71

74.09 18.57

27.04

19.15

50.59

49.49

71.22

68.75

Baseline Processing — OPT-6.7b

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

OPTQ

LDLQ-RG

Greedy

Near

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

10.86

15.77

12.71

60.06

68.72

76.55

74.47 11.49

16.54

13.16

58.84

66.18

76.01

73.71 14.87

22.05

17.13

53.41

52.36

73.23

71.42 2,958

2,521

500.7

31.86

01.07

55.11

52.07 11.23

16.28

12.98

59.18

67.46

76.77

74.09 12.56

18.58

14.34

55.26

61.89

74.48

72.37 739.9

1,109

154.0

33.00

01.79

54.46

52.45 11.75

16.93

13.27

54.63

66.19

74.48

72.82 39.09

66.57

37.13

32.49

02.56

53.59

50.99 16,298

10,708

9,968

26.09

00.00

51.90

49.40 12.15

18.92

14.40

58.75

64.53

76.28

73.58 6,011

5,440

5,225

25.42

00.00

50.71

47.87 20,780

14,217

12,419

25.80

00.00

49.78

47.80

Table 7: Quantizing OPT-6.7b with all combinations of quantization and pre-post processing

methods, evaluating on language generation and zeroshot tasks. Our incoherence processing enables

a step function change in quantization at 2 bits, across all rounding methods.

18Incoherence Processing — OPT-2.7b

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

QuIP

QuIP-RG

Greedy+IncP

Near+IncP

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

12.47

17.97

14.34

54.34

64.82

74.76

71.74 12.39

18.42

14.55

53.28

66.04

74.54

71.80 17.44

20.79

15.63

52.99

64.99

73.94

70.21 2,998

63.59

38.07

46.93

36.06

68.06

66.14 12.58

18.43

14.65

52.02

64.64

73.88

71.55 15.07

20.49

15.97

52.36

63.46

73.45

70.15 1,676

42.05

27.89

46.93

43.39

68.28

64.67 12.68

18.34

14.64

52.90

64.68

74.54

70.85 12.96

20.03

15.22

51.73

62.95

73.83

71.10 155.6

46.28

26.84

43.14

45.53

68.28

65.82 12.79

18.43

14.67

52.61

65.40

73.61

71.16 13.79

19.51

15.52

50.93

61.05

73.56

70.02 28.98

39.23

27.34

44.11

35.65

67.85

63.27

Baseline Processing — OPT-2.7b

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

OPTQ

LDLQ-RG

Greedy

Near

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

12.47

17.97

14.34

54.34

64.82

74.76

71.74 12.93

19.10

14.99

52.57

62.00

73.88

70.91 17.09

25.36

18.14

50.04

51.43

70.73

68.56 8,949

8,281

4,388

26.94

00.00

48.42

48.50 12.77

19.05

14.85

52.02

64.04

74.54

71.42 16.47

23.94

17.37

48.95

53.25

69.91

67.79 7,718

7,389

2,113

25.76

00.00

49.95

47.17 12.95

19.06

15.01

52.02

63.50

73.61

70.66 18.92

28.75

20.87

43.39

40.75

66.05

60.53 9,665

8,254

5,139

25.46

00.00

50.65

48.44 16.69

32.22

18.75

52.74

59.15

73.83

70.59 15,685

14,532

11,257

26.56

00.00

51.41

47.42 10,641

10,516

9,356

27.19

00.00

50.22

47.55

Table 8: Quantizing OPT-2.7b with all combinations of quantization and pre-post processing

methods, evaluating on language generation and zeroshot tasks. Our incoherence processing enables

a step function change in quantization at 2 bits, across all rounding methods.

Incoherence Processing — OPT-1.3b

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

QuIP

QuIP-RG

Greedy+IncP

Near+IncP

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

14.62

20.29

16.07

50.84

58.92

72.31

70.78 14.88

20.87

16.38

50.72

56.36

71.22

70.08 16.21

22.76

17.12

49.12

52.47

71.11

68.81 41.64

47.72

29.78

41.88

27.81

64.85

63.02 16.49

21.93

17.53

49.54

51.62

71.06

69.00 17.76

23.25

18.44

48.82

48.36

70.24

68.05 42.37

50.17

31.49

41.20

27.27

63.33

63.14 16.75

22.11

17.60

49.66

49.95

71.00

68.49 17.11

23.76

18.54

48.74

48.38

70.35

67.92 48.69

54.46

34.10

41.08

19.21

63.66

62.64 16.43

22.19

17.74

48.61

49.76

71.16

69.13 17.83

24.82

19.03

46.59

51.12

69.80

67.79 56.56

80.40

45.56

38.64

20.20

62.51

58.43

Baseline Processing — OPT-1.3b

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

OPTQ

LDLQ-RG

Greedy

Near

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

14.62

20.29

16.07

50.84

58.92

72.31

70.78 15.59

22.03

16.96

49.33

57.03

70.73

70.15 21.35

30.74

21.59

45.58

37.32

68.66

65.18 7,856

6,858

4,028

25.46

00.00

49.73

48.38 15.36

21.85

16.70

48.95

58.45

70.40

70.34 20.22

30.10

20.21

45.41

41.08

67.95

66.45 7,739

5,368

2,123

26.68

00.02

52.18

49.27 15.58

22.00

16.96

48.19

59.15

70.67

70.40 22.68

35.18

22.11

42.42

40.97

66.43

64.48 9,786

8,441

5,129

26.01

00.00

50.87

48.76 47.62

73.51

27.20

42.80

36.91

67.74

59.13 12,658

14,705

6,415

27.82

00.00

51.41

47.87 11,690

11,690

8,360

25.13

00.00

49.78

48.25

Table 9: Quantizing OPT-1.3b with all combinations of quantization and pre-post processing

methods, evaluating on language generation and zeroshot tasks. Our incoherence processing enables

a step function change in quantization at 2 bits, across all rounding methods.

19Incoherence Processing — OPT-350m

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

QuIP

QuIP-RG

Greedy+IncP

Near+IncP

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

22.00

31.07

22.59

40.36

46.67

64.80

63.14 22.5

32.57

23.23

39.44

46.89

64.47

62.13 25.19

35.65

25.48

38.13

42.03

63.28

61.55 672.3

744.2

320.0

27.44

01.03

50.87

53.15 23.57

32.46

23.45

39.31

43.04

64.25

61.74 25.54

37.00

25.50

38.47

39.80

63.17

61.23 418.0

587.4

215.4

29.67

04.99

54.79

51.43 23.14

33.10

23.43

39.77

42.44

64.42

62.83 25.38

37.07

25.48

40.24

40.62

64.25

61.62 239.9

301.0

124.1

30.64

06.38

55.01

53.28 23.41

33.32

23.81

38.89

41.47

64.15

62.38 27.86

39.49

27.41

38.76

34.45

63.00

61.49 1,444

1,354

880.2

28.41

00.08

52.23

50.22

Baseline Processing — OPT-350m

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

OPTQ

LDLQ-RG

Greedy

Near

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

22.00

31.07

22.59

40.36

46.67

64.80

63.14 24.16

34.17

24.71

38.43

45.60

64.04

63.78 33.51

47.69

31.26

38.38

39.20

63.44

61.04 18,687

18,161

8,418

26.30

00.00

51.25

47.55 23.77

33.35

24.10

39.06

45.26

65.13

62.57 31.87

44.38

29.86

37.42

32.54

61.97

60.53 10,446

8,508

3,064

25.46

00.02

49.67

48.95 27.01

40.39

27.84

38.34

51.45

63.49

61.36 137.3

153.5

73.59

31.06

16.63

55.44

54.87 23,952

15,176

9,099

24.33

00.00

50.60

48.44 25.94

36.78

26.21

38.68

40.66

63.38

63.02 64.56

87.22

55.15

36.11

27.46

60.55

56.84 23,668

28,881

17,094

25.88

00.00

51.58

48.95

Table 10: Quantizing OPT-350m with all combinations of quantization and pre-post processing

methods, evaluating on language generation and zeroshot tasks. Our incoherence processing enables

a step function change in quantization at 2 bits, across all rounding methods.

Incoherence Processing — OPT-125m

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

QuIP

QuIP-RG

Greedy+IncP

Near+IncP

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

27.66

38.99

26.56

40.03

39.16

61.92

59.96 33.35

40.80

27.63

38.89

33.03

61.64

60.03 34.22

47.34

30.92

37.92

26.37

61.64

59.20 347.4

430.3

177.4

31.99

01.05

54.24

52.13 31.51

43.28

28.74

39.27

33.75

61.64

59.07 42.94

51.69

33.54

38.26

16.96

61.92

59.26 361.8

414.1

159.0

31.36

02.17

55.44

51.94 30.65

41.96

28.82

38.80

37.78

61.10

60.15 55.54

48.79

31.41

37.67

25.34

60.83

59.52 230.8

250.6

99.01

33.21

04.66

56.47

54.04 31.93

43.08

29.28

38.55

35.65

61.43

59.13 37.57

52.20

33.88

37.42

25.21

61.10

58.88 397.5

441.9

224.0

32.91

01.82

53.48

53.41

Baseline Processing — OPT-125m

Full

Wiki↓

PTB↓

C4↓

ArcE↑

LAMB↑

PiQA↑

SC↑

OPTQ

LDLQ-RG

Greedy

Near

W16 W4 W3 W2 W4 W3 W2 W4 W3 W2 W4 W3 W2

27.66

38.99

26.56

40.03

39.16

61.92

59.96 31.44

45.31

29.13

38.51

33.69

60.83

58.88 53.26

74.79

42.55

35.73

12.36

59.47

56.97 4,563

4,410

2,260

28.62

00.00

52.23

49.78 32.29

45.56

29.40

39.02

33.26

61.70

59.20 53.25

75.85

41.77

36.36

15.00

59.58

57.03 3,704

3,596

1,820

27.19

00.00

50.05

48.95 77.80

101.1

65.54

34.05

12.25

57.62

50.99 1,791

1,403

809.5

26.43

00.00

49.29

47.55 3,707

4,622

1,897

27.15

00.00

50.49

48.82 37.14

53.93

33.90

36.66

18.22

61.43

59.96 1,293

1,418

836.5

30.39

00.08

55.88

50.03 5,375

4,267

3,665

26.01

00.00

51.20

47.93

Table 11: Quantizing OPT-125m with all combinations of quantization and pre-post processing

methods, evaluating on language generation and zeroshot tasks. Our incoherence processing enables

a step function change in quantization at 2 bits, across all rounding methods.

E.7

Section 6 (Evaluating the Effectiveness of the Proxy Objective)

In Table 12 we show the proxy loss of the four quantization methods we evaluate, evaluated over

OPT models 125m to 2.7b. The proxy is averaged over models proxy losses normalized by their

model dimension; we use H matrices computed as a result of OPTQ and nearest rounding. We do not

conduct any processing in the proxy evaluation; this is an evaluation of the rounding methods only.

Trends in the proxy reflect end-to-end results. OPTQ/LDLQ, LDLQ-RG, and Greedy are roughly

equivalent at 2 bits, and do better than Nearest.

E.8

Section 6 (Evaluating Unbiased Rounding in LDLQ/OPTQ)

Note in our formulation for Adaptive Rounding with Linear feedback, the Q subroutine could be

biased, or unbiased. It is typical to perform biased rounding in practice; here we investigate if there is

20WBits LDLQ/OPTQ LDLQ-RG Greedy Near

2 104.09

529.53

2,554.89 105.23

475.25

2,291.02

120.74

537.98

2,587.17 301.18

1,308.05

5,971.69

Table 12: Weighted average of proxy Loss tr ( Ŵ − W )H( Ŵ − W ) T over OPT models 125m to

2.7b. Proxy is averaged over models normalized by their model dimension (768, 1024, 2048, 2560)

respectively, to ensure proxy loss is comparable across models of different size. We do not conduct

any processing in the proxy evaluation. Trends in the proxy largely reflect end-to-end results: at 2

bits OPTQ, LDLQ-RG, and Greedy are roughly equivalent, and all do better than nearest.

AVERAGE(Perplexity Unbiased - Perplexity Biased) on Wiki, PTB, C4 (↓)

Incoherence Processing

WBits

125m

350m

1.3b

Baseline Processing

2.7b

125m

350m

1.3b

2.7b

1.23

0.73

0.79

0.19 27.81

5.58

1.62

0.87

13.26

7.79

2.14

4.66 880.4 499.4 28.63

16.23

2,501 18,732 544.8 2,251 241.3 17,945 4,831

3,798

Table 13: Average perplexity difference (i.e. unbiased - biased) for LDLQ/OPTQ on WikiText2,

PTB, and C4. That is, we can run LDLQ with the Q subroutine as stochastic rounding, instead of

nearest. The average difference is positive, meaning that unbiased rounding performs worse than

biased (i.e. nearest) across OPT models 125m to 2.7b. Note the magnitude of the gap increases at

lower bits.

any benefit to switching to unbiased rounding schemes. Table 13 computes the average perplexity

difference (i.e. unbiased − biased) for LDLQ/OPTQ on WikiText2, PTB, and C4. That is, we run

LDLQ with the Q subroutine as stochastic rounding, instead of nearest. The average difference is

positive (and large for 2 and 3 bits), meaning that unbiased rounding performs worse than biased (i.e.

nearest) across OPT models 125m to 2.7b. These results indicate that in practice, we want to stick

with biased rounding schemes.

E.9

Section 6 (Evaluating Algorithm 4 Which Accounts for Clamping)

Incoherence Processing (ours)

Baseline Processing

Model WBits Wiki PTB C4 Wiki PTB C4

OPT-1.3b 4

2 16.54

18.27

38.13 22.12

23.96

51.78 17.58

18.66

31.09 15.43

20.45

6,438.75 21.92

28.86

6,099.27 16.80

20.68

2,057.71

OPT-350m 4

2 23.19

25.54

286.71 32.55

36.74

367.26 23.48

25.52

144.08 23.71

33.01

8,006.22 33.73

45.15

7,445.70 24.29

30.09

2,317.18

32.04

44.56

29.08

32.59

41.95

28.67

40.66

51.90

32.91

50.73

74.14

41.04

1,649.83 240.86

136.55

3,714.11 4,703.76 1,848.72

Table 14: Quantizing OPT models using Algorithm 4 evaluated on WikiText2, PTB, and C4. At 2

bits and incoherence processing, we see improvements over LDLQ and LDLQ-RG on OPT-125m and

OPT-350m, but diminishing improvements on OPT-1.3b. Due to Algorithm 4’s relatively equivalent

performance relative to QuIP at OPT-1.3b, and due to this algorithm’s increased computational cost,

we decide not to user it.

OPT-125m

Table 14 shows results from using Algorithm 4 to quantize OPT models 125m to 1.3b, with incoher-

ence processing and baseline processing. At 2 bits and incoherence processing, we observe modest

improvements over QuIP in terms of perplexity on OPT models 125m and 350m. However, at the

21larger OPT-1.3b QuIP beats Algorithm 4 on 2/3 language generation tasks. In addition, Algorithm 4

is computationally more work to run. Therefore we decide not to use it.

Another observation: in practice, we don’t seem to encounter constructions of W and H that are bad

for LDLQ/OPTQ. Therefore this “clamping” issue seems to not be an issue in practice, especially as

model size increases.

Proofs for Section 3 (Quantization With Incoherence Processing: Adaptive

Rounding Step )

Subsection 3.2 (Deriving the Optimality of the LDLQ Adaptive Rounding Procedure)

Theorem 1. LDLQ is worst and average-case optimal amongst rounding methods which specify the

linear feedback U as a function of H (not of W ), and when rounding to the integers. That is, for all

rounding methods A in the class described by Eq. (2), for all positive semi-definite H, and for Q as

either nearest or stochastic rounding,

tr(D) = L worst (LDLQ, H) ≤ L worst (A, H) and

tr(D) = L avg (LDLQ, H) ≤ L avg (A, H),

where D is the matrix from the LDL decomposition of H, and c = 12 for nearest, c = 6 for stochastic.

Proof. Let X be the strictly upper triangular matrix associated with the rounding procedure A such

that U ← X in Eq. (2). Let B ≡ (X + I) −1 ( Ù + I) where Ù is from the LDL decomposition of H

in Eq. (4). The proxy loss is then,

(3),(4)

tr (A(W, H) − W )H(A(W, H)) T = tr η(X + I) −1 ( Ù + I)D( Ù + I) T (X + I) −T η T

= tr ηBDB T η T .

(9)

With the LDL assignment of U , we further have that,

tr ηBDB T η T = tr ηDη T .

(10)

First, consider the worst-case loss, L worst . The goal is to construct a particularly bad case where

the entries of W̃ are 1/2 ± ϵ, and thus when rounding to the integers we will always have error 1/2.

Construct a weight matrix W̃ ∈ R m×n such that each entry satisfies,

0.5 − ϵ w.p. 1/2

+0.5 w.p. 1/2

W̃ ij =

⇒ η ij =

0.5 + ϵ w.p. 1/2

−0.5 w.p. 1/2

and the quantization errors η ∈ R m×n are for each entry {+1/2, −1/2} with equal probability. For

(9)

this particular W̃ , A achieves proxy loss L worst (A, H) = E tr ηBDB T η T = m

4 tr BDB

with Q as either nearest or stochastic rounding. It follows

from

the

supremum

the

definition

L worst in Eq. (5) that, L worst (A, H) ≥ m

. For the LDL assignment of U , the worst

4 tr BDB

(10)

case expected quantization error rounding to the integers is 1/2. Therefore, L worst (LDLQ, H) =

4 tr (D), again for Q as either nearest or stochastic rounding. B must

be a unit triangular matrix

since it is the product of unit triangular matrices. Therefore tr BDB is minimized when B = I,

and

L worst (LDLQ, H) ≤ L worst (A, H).

Next, consider the average loss, L avg , where W ∼ U nif [0, 1] m×n . For Q as nearest rounding, the

entries of the quantization error η are U nif [− 12 , 12 ], because each entry is independent and uniformly

2 R 1/2 2

(9)

distributed. It follows that for any entry of η, E η ij

= −1/2 x dx = 12

. Therefore, L avg (A, H) =

E W ∼U nif [0,1] m×n tr ηBDB T η T = 12

tr BDB T . For Q as stochastic rounding, the entries

2 R 1

of the quantization error η are U nif [−1, 1]. It follows that for any entry of η, E η ij

= 0 x(1 −

x)dx = 16 . Note that for stochastic rounding, the quantization error will be x with probability

(1 − |x|). Therefore, L avg (A, H) = m

. Based on these same calculations of E η ij

6 tr BDB

22(9)

we have that L avg (LDL, H) = 12

tr (D) with Q as nearest , and = m

6 tr (D) with Q as stochastic

rounding. By the same reasoning on the minimization of tr BDB ,

L avg (LDLQ, H) ≤ L avg (A, H).

Subsection 3.3 (Incoherence: Optimality with a Spectral Bound)

Definition 1. We say a symmetric Hessian matrix H ∈ R n×n is µ-incoherent

√ if it has an eigende-

composition H = QΛQ T such that for all i and j, |Q ij | = e Ti Qe j ≤ µ/ n. By extension, we say

√

a weight matrix W ∈ R m×n is µ-incoherent if all i and j, |W ij | = e Ti W e j ≤ µ ∥W ∥ F / mn.

Lemma 8. Let H ∈ R n×n be a positive semi-definite symmetric matrix, and let a 1 , . . . , a n be a

sequence of vectors in R n . Consider the recurrence given by Σ 0 = 0 ∈ R n×n and from k = 0 to

n − 1

Σ k+1 = (I − e k a Tk )Σ k (I − a k e Tk ) + e k e Tk .

Let ℓ(a 1 , . . . , a n ) = tr (HΣ n ). Then if H = LDL T is the LDL decomposition of H, a global

minimum of ℓ occurs when a k is the kth column of L, and at this minimum, ℓ = tr (D).

Proof. First observe that at step k, Σ k will be 0 in all entries (Σ k ) ij if min(i, j) ≥ k. This means

that changing the last n − k entries of a k does not change Σ (or ℓ) at all. Without loss of generality,

set those entries of a k to 0. If A is the matrix whose kth row is a k , this is equivalent to saying that A

is strictly lower triangular.

Next, let η be a random Gaussian sampled from N (0, I), and consider the recurrence given by

x 0 = 0 ∈ R n and

x k+1 = x k − e k a Tk x k + e k e Tk η.

It’s straightforward to see that Σ k = E x k x Tk . But it’s also easy to see that the step-k update only

modifies/assigns the kth entry of x, and does so based only on earlier entries of x. Since e Tk x k = 0,

and no later step assigns the k-or-lower entries of x,

e Tk x n = e Tk x k+1 = 0 − a Tk x k + e Tk η = −a Tk x n + e Tk η,

which in vector form yields

(I + A)x n = η.

In particular, this immediately implies that

Σ n = (I + A) −1 (I + A) −T

and

ℓ = tr (HΣ n ) = tr (I + A) −T H(I + A) −1 = tr B −T HB −1 .

where B = I + A. Differentiating with respect to B in strictly lower triangular direction ∆ (the only

direction in which we have degress of freedom, since the diagonal of B must be unit) yields

−2 tr B −T HB −1 ∆B −1 .

It’s not hard to see that if H = LDL T is the LDL decomposition of H, and B T = L, that the

gradient is

−2 tr D∆B −1 = −2 tr ∆B −1 D = −2⟨∆ T , B −1 D⟩.

Since ∆ T is strictly upper triangular, but B −1 D must be lower triangular, this is 0 so we have a

minimum. The uniqueness of this minimum (up to assignments of the lower-triangular elements of A

or B, which have no effect on ℓ) also immediately follows from the recurrence relation. This implies

the minimum is global. This is what we wanted to show.

Lemma 2. Let H ∈ R n×n be a µ-incoherent positive semi-definite symmetric matrix and let

H = ( Ù + I)D( Ù + I) T be its LDL Cholesky decomposition, where Ù is a strictly upper triangular

matrix and D is a (non-negative) diagonal matrix. Then,

µ 2 1/2 2

tr (D) ≤

tr H

Proof. By continuity of tr (D) and tr H 1/2 , it suffices to prove the lemma for positive definite H.

First, the closure of positive definite symmetric matrices is the set of positive semi-definite symmetric

matrices. Second, consider the set of H that are positive definite and satisfy µ n tr H 1/2 −tr (D) ≥

0, i.e. are non-negative. The closure of this set (i.e. H ⪰ 0) must also satisfy that the inequality is

non-negative.

Let H = QΛQ T be the eigendecomposition of H. First, observe that by incoherence,

µ 2 X 1/2

µ 2 1/2

1/2

e Tk H 1/2 e k =

λ i (e Ti Qe k ) 2 ≤

λ i =

tr H

n i=1

i=1

Set

µ 2 1/2

tr H

and consider the recurrence from Lemma 8 with

α =

a k =

Then

H 1/2 e k

Σ k+1 = I − α −1 e k e Tk H 1/2 Σ k I − α −1 H 1/2 e k e Tk + e k e Tk .

Suppose by way of induction that for some scalar the covariance Σ k ⪯ αH −1/2 . For the base case,

this obviously holds since Σ 0 = 0. At step k,

Σ k+1 ⪯ I − α −1 e k e Tk H 1/2 αH −1/2 I − α −1 H 1/2 e k e Tk + e k e Tk

= αH −1/2 − 2e k e Tk + α −1 e k e Tk H 1/2 e k e Tk + e k e Tk

⪯ αH −1/2 .

Note that with this assignment,

a Tk Σ k a k ≤ (α −1 e Tk H 1/2 )(αH −1/2 )(α −1 H 1/2 e k ) = α −1 e Tk H 1/2 e k ≤ 1.

So, by induction it follows that

Σ n ⪯

µ 2 1/2

tr H

· H −1/2 ,

and so

µ 2

µ 2 1/2

tr H

tr H 1/2 .

tr H · H −1/2 =

But from Lemma 8, we know that tr (D) is the global minimum of tr (HΣ n ) for any assignment of

a k . This immediately gives us the desired result.

tr (HΣ n ) ≤

Lemma 3. Let H be symmetric positive definite. In the worst case stochastic rounding achieves

L worst (Stoch, H) = (m/4) tr (H). In the average case nearest and stochastic rounding achieve

L avg ({Near, Stoch}, H) = (m/c) tr (H), where c = 12 for nearest, and c = 6 for stochastic.

Proof. For nearest and stochastic rounding, set the linear feedback U in Eq. (2) to be zero. Stochastic

rounding achieves worst-case loss,

(3)

L worst (Stoch, H) = sup E tr ηHη T =

tr (H) .

(11)

W ∈R m×n

For the average-case proxy loss, recall the computations of E η ij

from the proof of Theorem 1.

(3)

L avg (Near, H) = E W ∼U nif [0,1] m×n tr ηHη T =

tr (H)

(3)

L avg (Stoch, H) = E W ∼U nif [0,1] m×n tr ηHη T =

tr (H) .

(12)

(13)Without incoherence: no improvement with a spectral bound

Theorem 4. Consider all H̃ with the same spectrum as H. For any positive semi-definite H, the

following holds. On the worst-case loss LDLQ achieves the same error as stochastic rounding,

sup

L worst (LDLQ, H̃) = L worst (Stoch, H) =

tr (H) .

H̃s.t. eig( H̃)=eig(H)

On the average-case loss LDLQ achieves the same error as the corresponding rounding routine. Let

B = {Near, Stoch} and c = 12 for nearest, c = 6 for stochastic.

sup

L avg (LDLQ ∗ , H̃) = L avg (B, H) =

tr (H) .

H̃s.t. eig( H̃)=eig(H)

Proof. See Lemma 3 for calculations on the proxy loss for nearest and stochastic rounding.

For LDLQ, we will derive lower and upper bounds on sup H̃s.t. eig( H̃)=eig(H) L worst (LDLQ, H̃) and

sup H̃s.t. eig( H̃)=eig(H) L avg (LDLQ, H̃), and show they are equal. To construct a lower bound, con-

sider H̃ = IΛI where Λ are the eigenvalues of H. This decomposition is also

the

LDL decomposition

−1

of H̃, rewritten as H̃ = (U + I)D(U + I) . It follows that tr (D) = tr H̃ for this H̃. Combine

this result with the worst and average-case losses calculated in the proof of Theorem 1. For the

worst-case loss from the proof of Theorem 1, ≥ m

4 tr (H). The lower bound for the average-case loss

tr (H) for Q as nearest, and ≥ m

is ≥ 12

6 tr (H) for Q as stochastic. Now upper bounds are derived

using the preceding calculations in Eq. (11)-(13), and using the worst and average-case optimality of

LDLQ proven in Theorem 1. The lower and upper bounds are tight, proving our result.

Proofs for Section 4 (Quantization With Incoherence Processing:

Incoherence Processing Step )

Subsection 4.1 (Incoherence via Efficient Orthogonal Multiplication)

Lemma 9 (Theorem 2.4 from Lalley [2] ). There exist constants C and A independent of n such

that for any function F from the unit sphere in n dimensions to R that is 1-Lipschitz relative to the

Riemannian metric on the sphere,

nt 2

P x∼S n (F (x) − E x∼S n [F (x)] ≥ t) ≤ C exp −

Lemma 10. Let B ∈ R m×n be a matrix, and let x be a random vector uniformly distributed on the

unit sphere in R n . Then there exist global constants A > 0 and C > 0 independent of m and n such

that

A ∥B∥ F

P ∥Bx∥ ≥

log

≤ δ,

Proof. Let

F (x) =

Observe that

∇F (x) =

∥Bx∥

∥B∥ F

B T Bx

∥Bx∥ · ∥B∥ F

and so

∥∇F (x)∥ ≤ 1.

Also observe that for x drawn uniformly from the sphere in n dimensions,

r h

E [F (x)] ≤ E [F (x) 2 ] =

· E ∥Bx∥ = √ .

∥B∥ F

25So, applying Lemma 9,

nt 2

∥Bx∥

− √ ≥ t ≤ C exp −

∥B∥ F

If we let δ be

nt 2

δ = C exp −

then

log

= t 2

Trivially, then, for some modified global constants A ′ and C ′ ,

′

A ′

√

log

= t +

This means that

∥Bx∥

∥B∥ F

A ′

≥

log

i.e.

C ′

A ′ ∥B∥ F

P ∥Bx∥ ≥

log

≤ δ,

C ′

≤ δ,

This is what we wanted to prove.

Lemma 5. Let H be a positive semi-definite matrix on R n×n and W a matrix on R m×n , and suppose

that m = p 1 · p 2 · · · p k and n = q 1 · q 2 · · · q k . Let U 1 , U 2 , . . . , U k , V 1 , V 2 , . . . , V k be independent

random orthogonal matrices on R p i ×p i and R q i ×q i respectively. Set U as the Kronecker product

U = U 1 ⊗ U 2 ⊗ · · · ⊗ U k and V as V = V 1 ⊗ V 2 ⊗ · · · ⊗ V k Then V HV T is µ H -incoherent with

probability at least 1 − δ, and U W V T is µ W -incoherent with probability at least 1 − δ, where

k/2

Ckn 2

2Ckmn

µ H = A k/2 log

= Õ (1) and µ W = A k log

= Õ (1)

for some global constants A and C independent of n and k.

Proof. First we will prove what we want to prove about H; then we will prove what we want to prove

about W . Let Q be a matrix of eigenvectors of H. Observe that since Q is an orthogonal matrix (by

the spectral theorem, because H is symmetric), Qe j is a unit vector, i.e. ∥Qe j ∥ = 1. Call Qe j = y.

Also observe that

e Ti (U 1 ⊗ U 2 ⊗ · · · ⊗ U k ) = ((e Ti 1 U 1 ) ⊗ (e Ti 2 U 2 ) ⊗ · · · ⊗ (e Ti k U k ))

for some indices i j . Call e Ti j U j = x Tj , and observe that the x j are all independent unit random

vectors. So,

((U 1 ⊗ U 2 ⊗ · · · ⊗ U k )Q) ij = (x 1 ⊗ x 2 ⊗ · · · ⊗ x k ) T y

for random unit vectors x 1 , . . . , x k and unit vector y. We can easily bound this with k applications of

Lemma 10 and a union bound, yielding

k !

A k

P (x 1 ⊗ x 2 ⊗ · · · ⊗ x k ) y ≥

log

≤ kδ,

Setting δ 7→

kn 2

yields

(x 1 ⊗ x 2 ⊗ · · · ⊗ x k ) y

A k

≥

log

Ckn 2

k !

≤

n 2

and unioning over all the entries of the large orthogonal matrix,





2 k

Ckn

 ≤ δ.

P  max ((U 1 ⊗ U 2 ⊗ · · · ⊗ U k )Q) ij ≥

log

i,j

26Next, for W , observe that if we flatten W , then W/ ∥W ∥ F is a unit vector. Then any entry of the

resulting matrix can be written as

(x 1 ⊗ x 2 ⊗ · · · ⊗ x k ) T W (y 1 ⊗ y 2 ⊗ · · · ⊗ y k )

where x 1 , . . . , x k and y 1 , . . . , y k are k independent random unit vectors. We can easily bound this

with 2k applications of Lemma 10 and a union bound, yielding

2k !

A 2k

P (x 1 ⊗ x 2 ⊗ · · · ⊗ x k ) W (y 1 ⊗ y 2 ⊗ · · · ⊗ y k ) ≥

log

≤ 2kδ,

Setting δ 7→

2kmn

yields

A 2k

(x 1 ⊗ x 2 ⊗ · · · ⊗ x k ) W (y 1 ⊗ x 2 ⊗ · · · ⊗ y k ) ≥

log

and unioning over all the mn entries of the large orthogonal matrix,



P  max e Ti (U 1 ⊗ U 2 ⊗ . . . U k )W (V 1 ⊗ V 2 ⊗ · · · ⊗ V k )e j ≥

i,j

2Ckmn

A 2k

log

2k !

2Ckmn

≤



 ≤ δ.

This is what we wanted to show.

Proofs for Section 5 (Extensions and Further Analyses)

Subsection 5.1 (OPTQ is a Special Case of LDLQ)

Theorem 6. OTPQ [1] falls within the class of adaptive rounding procedures with linear feedback

as described by Eq. (2), and is equivalent to LDLQ in Section 3.

Proof. OPTQ works in the following way. After OPTQ has quantized the first t − 1 components of

the row vector w, it minimizes the proxy loss over the remaining n − t + 1 elements, keeping the

first t − 1 elements fixed. It then quantizes the tth element using nearest rounding to the grid and

clamping. It then proceeds to the next column. If we let ∆ = ŵ − w, this proxy loss that it minimizes

can be written in block form as

ℓ = ∆ 1:(t−1) H 1:(t−1),1:(t−1) ∆ T 1:(t−1) + 2∆ 1:(t−1) H 1:(t−1),t:n + ∆ t:n H t:n,t:n ∆ Tt:n

and its minimum over ∆ t:n will occur when

0 = ∆ 1:(t−1) H 1:(t−1),t:n + ∆ t:n H t:n,t:n ,

i.e.

−1

∆ t:n = −∆ 1:(t−1) H 1:(t−1),t:n (H t:n,t:n ) .

Now, suppose that H = Ũ DŨ T is the LDL decomposition of H, where Ũ is unit upper triangular

and D is diagonal. Since Ũ is upper triangular,

H t:n,t:n = Ũ t:n,t:n D t:n,t:n Ũ t:n,t:n

Similarly,

H 1:(t−1),t:n = Ũ 1:(t−1),t:n D t:n,t:n Ũ t:n,t:n

This means that

−1

∆ t:n = −∆ 1:(t−1) Ũ 1:(t−1),t:n Ũ t:n,t:n

Now, the only part of the value of ∆ t:n which matters is the first entry, since this is the one that’s

going to be used to make the next quantization decision. But since Ũ t:n,t:n is unit upper triangular

−1

and so is its inverse, Ũ t:n,t:n

e t = e t , and so

∆ t = ∆ t:n e 1 = −∆ 1:(t−1) Ũ 1:(t−1),t:n e t = −∆ 1:(t−1) Ũ 1:(t−1),t = −∆( Ũ − I)e t .

Finally, we quantize the t-th weight as

ŵ t = Q(w t − ( Ŵ − W )( Ũ − I)e t ).

This update is equivalent to our adaptive rounding with linear feedback procedure in Eq. (2), with U

assigned from the LDL decomposition of H.

27Subsection 5.2 (A Bound for Rounding to a Finite Grid)

Algorithm 4 presents a quantization procedure which theoretically address OPTQ’s clamping issue,

by incorporating a restriction of | Ŵ ij − W ij | into objective (7). Note that for simplicity, here we

present the explicit case where only two factors are used in each Kronecker product of orthogonal

matrices; however, the proof should generalize to any number of factors.

Algorithm 4 “Fixed” Rounding via a Convex Program

Require: W ∈ R m×n , H ∈ R n×n , c > 0, ρ > 0

Require: factorization m = p 1 p 2 , n = p 3 p 4

draw U 1 ∈ R p 1 ×p 1 uniformly from the set of orthogonal matrices using seed seed(U 1 )

draw U 2 ∈ R p 2 ×p 2 uniformly from the set of orthogonal matrices using seed seed(U 2 )

draw U 3 ∈ R p 3 ×p 3 uniformly from the set of orthogonal matrices using seed seed(U 3 )

draw U 4 ∈ R p 4 ×p 4 uniformly from the set of orthogonal matrices using seed seed(U 4 )

W ← (U 1 ⊗ U 2 )W (U 3 ⊗ U 4 )

H ← (U 3 T ⊗ U 4 T )H(U

3 ⊗ U 4 )

2 b −1 W

W ← 2

ρ + 1 elementwise

W ← clamp(W, min = 0, max = 2 b − 1)) elementwise

use ADMM or some other solver to solve

minimize: tr HL T L

over: L unit upper triangular

subject to: e Ti L T Le i ≤ 1 + c, ∀i ∈ {1, . . . , n}.

note that when c = ∞, L −1 is the factor from the LDL decomposition of H

Ù ← L −1 − I

for k ∈ {1,

. . . , n} do Ŵ k ← clamp(Q(W k + (W − Ŵ ) Ù k ), 0, 2 − 1)

▷ round with LF

2 Ŵ

− 1

2 b −1

(U 1 T ⊗ U 2 T ) Ŵ (U 3 T

Ŵ ← ρ

Ŵ ←

⊗ U 4 T )

return Ŵ encoded as a tuple of the integer rounded values, the scale factor ρ, and the seeds

Lemma 11. Suppose that for positive definite µ-incoherent matrix H ∈ R n×n and scalar c > 0, L

is the solution to the optimization problem

minimize: tr HL T L

over: L unit upper triangular

subject to: e Ti L T Le i ≤ 1 + c, ∀i ∈ {1, . . . , n}.

Then the solution satisfies

tr HL T L =

µ 2

tr H 1/2 .

n · min(1, c)

Proof. Let η ∈ R 1×n be a random standard Gaussian variable as a row vector, let A be a matrix, and

consider the recurrence relation over x t ∈ R 1×n given by x 0 = 0 and

x t = x t−1 − x t−1 Ae i e Ti + ηe i e Ti

We first note that since x t is supported only on {1, . . . , t}, if M denotes the strictly upper triangular

mask, this update step is equivalent to

x t = x t−1 − x t−1 (A ⊙ M )e i e Ti + ηe i e Ti .

From here, it’s fairly easy to see by induction that

x n = −x n (A ⊙ M ) + η,

and so

x n (I + A ⊙ M ) = η,

28or

x n = η(I + A ⊙ M ) −1 .

Now, since I + A ⊙ M is a unit upper triangular matrix, its inverse is also a unit upper triangular

matrix. If we let L = (I + A ⊙ M ) −1 , then L is a unit upper triangular matrix and

E x Tn x n = L T L.

We are going to choose A such that L is a feasible solution to our optimization problem and has the

desired objective. Next, let Σ t = E x Tt x t , and observe that

Σ t = I − Ae i e Ti

Σ t−1 I − Ae i e Ti + e i e Ti .

Let α > 0 be some constant to be set later, and set A = αH 1/2 . Suppose by way of induction that

for some constant β > 0 to be set later, Σ t ⪯ βH −1/2 . The base case clearly holds since Σ 0 = 0.

For the inductive step,

Σ t ⪯ β I − αH 1/2 e i e Ti

H −1/2 I − αH 1/2 e i e Ti + e i e Ti

= βH −1/2 − 2αβe i e Ti + α 2 βe i e Ti H 1/2 e i e Ti + e i e Ti .

This inductive step will hold if, letting h = max i e Ti H 1/2 e i ,

2αβ ≥ 1 + α 2 βh

On the other hand,

e Ti L T Le i = E (x n e i ) 2

= E (−x i−1 Ae i + ηe i )

= E (−x i−1 Ae i ) + 1

= e Ti A T Σ i−1 Ae i + 1

= α 2 e Ti H 1/2 Σ i−1 H 1/2 e i + 1

≤ α 2 βe Ti H 1/2 H −1/2 H 1/2 e i + 1

≤ α 2 βe Ti H 1/2 e i + 1.

So the constraint of our optimization problem will be satisfied if

α 2 βh ≤ c.

To satisfy these constraints, set β = max(h, h/c) and α = β −1 . Then

2 max(h, h/c) −1 · max(h, h/c) ≥ 1 + max(h, h/c) −2 · max(h, h/c) · h,

and

max(h, h/c) −2 · max(h, h/c) · h ≤ c.

Also, the objective will be bounded by

tr HL T L = tr (HΣ n ) ≤ β tr H 1/2 = max(1, c −1 ) · h · tr H 1/2 .

Now, applying incoherence to bound h, where H = U ΛU T is the eigendecomposition of H,

e Ti H 1/2 e i =

1/2

λ j (e Ti U e j ) 2 ≤

j=1

1/2 µ

λ j

µ 2 1/2

tr H

So this yields a whole bound of

tr HL T L =

µ 2

tr H 1/2 .

n · min(1, c)

This is what we wanted to show.

29Lemma 12. Suppose that we quantize the row vector w ∈ R 1×n using L the solution to the

optimization problem

minimize: tr HL T L

over: L unit upper triangular

subject to: e Ti L T Le i ≤ 1 + c, ∀i ∈ {1, . . . , n}

and

where Q stoch

δ> 0

ŵ = Q stoch w − ( ŵ − w)(L −1 − I) ,

denotes elementwise unbiased stochastic rounding. Then for any u ∈ R n and any

P |( ŵ − w)u| ≥ ∥Lu∥

≤ δ.

log

In particular,

−1

( ŵ − w)(L

− I)e i ≥

log

≤ δ.

Proof. Let η be the error of stochastic rounding, and observe that each entry is, conditioned on earlier

steps, zero mean and supported on two values that differ by 1. Also observe that

ŵ = w − ( ŵ − w)(L −1 − I) + η,

and so

ŵ − w = ηL

and

E [exp (( ŵ − w)u)] = E [exp (ηLu)] .

From a repeated application of Hoeffding’s lemma, we get

E [exp (( ŵ − w)u)] ≤ exp

∥Lu∥ .

Setting u 7→ γu for γ > 0,

E [exp (γ( ŵ − w)u)] ≤ exp

γ 2

∥Lu∥ .

And by Markov’s inequality,

P (exp (γ( ŵ − w)u) ≥ exp(γR)) ≤ exp(−γR) exp

i.e.

γ 2

∥Lu∥ ,

γ 2

P (( ŵ − w)u ≥ R) ≤ exp −γR +

∥Lu∥ .

−2

Minimizing the right side over γ yields γ = 4R ∥Lu∥

and

P (( ŵ − w)u ≥ R) ≤ exp −2R 2 ∥Lu∥

−2

By a union bound,

−2

P (|( ŵ − w)u| ≥ R) ≤ 2 exp −2R 2 ∥Lu∥

Now setting the right side equal to δ,

P |( ŵ − w)u| ≥ ∥Lu∥

log

≤ δ.

This is what we wanted to show. The second statement follows from the fact that

L(L −1 − I)e i

= ∥e i − Le i ∥ = e Ti e i −e Ti Le i −e Ti L T e i +e Ti L T Le i ≤ 1−1−1+(1+c) = c.

30Lemma 13. Suppose that we quantize the row vector w ∈ R 1×n using L the solution to the

optimization problem

minimize: tr HL T L

over: L unit upper triangular

subject to: e Ti L T Le i ≤ 1 + c, ∀i ∈ {1, . . . , n}

and

ŵ = Q stoch w − ( ŵ − w)(L −1 − I) ,

where Q stoch denotes elementwise unbiased stochastic rounding. Suppose that for some integer b,

1 ≤ w ij ≤ 2 b − 2. Then if we set

−1

4mn

c = 2 log

then with probability at least 1 − δ, 0 ≤ ŵ ij ≤ 2 b − 1 and

tr ( ŵ − w)H( ŵ − w)

µ 2 m 1/2 2

≤

tr H

log

4mn

2 !

Proof. First, from the previous lemmas, if U e i is the ith eigenvector of H, with eigenvalue λ i since

2 1

P λ i (e Tj ( ŵ − w)U e i ) 2 ≥ λ i ∥LU e i ∥ · log

≤ δ.

By the union bound,

2mn

2 1

P ∃i, j, λ i (e j ( ŵ − w)U e i ) ≥ λ i ∥LU e i ∥ · log

≤ δ.

And so





2mn

 ≤ δ,

P 

λ i (e Tj ( ŵ − w)U e i ) 2 ≥

λ i ∥LU e i ∥ · log

i,j

which simplifies to

2mn

P tr ( ŵ − w)H( ŵ − w) ≥ m tr HL L · log

≤ δ.

Now applying the other lemma,

P tr ( ŵ − w)H( ŵ − w) T ≥

And substituting δ 7→ δ/2,

P tr ( ŵ − w)H( ŵ − w) T ≥

µ 2 m

tr H 1/2 log

2n · min(1, c)

µ 2 m

tr H 1/2 log

2n · min(1, c)

2mn

4mn

On the other hand, again by a union bound from the previous lemma,

4mn

−1

log

≤ .

P ∃i, j, e j ( ŵ − w)(L − I)e i ≥

Setting

−1

4mn

c = 2 log

yields

P ∃i, j, e Tj ( ŵ − w)(L −1 − I)e i ≥ 1 ≤ .

≤ δ.

≤

2And so by another union bound, the probability that

tr ( ŵ − w)H( ŵ − w)

and

µ 2 m 1/2 2

4mn

≤

tr H

log

max e Tj ( ŵ − w)(L −1 − I)e i ≤ 1

i,j

is no less than 1−δ. It’s clear that if this second inequality holds, the value we pass in to the stochastic

quantizer will be in range, and thus so will the output. This proves what we want.

Theorem 14. Suppose that we are given an input matrix w with bounded maximum entry magnitude

∥w∥ ∞ and we want to quantize it using b bits. Suppose that we first re-scale the entries of w by

mapping

2 b − 3

w ij

w ij 7→

+ 1 + 1;

∥w∥ ∞

this guarantees that 1 ≤ w ij ≤ 2 b − 2. Then, suppose we quantize using the procedure described in

the previous lemma. Finally, we undo the scaling. Then then with probability at least 1 − δ, all the

quantized weights will be in range (no overflow or need for clipping) and

2 !

4mn

µ 2 m

1/2

tr ( ŵ − w)H( ŵ − w) ≤

tr H

∥w∥ ∞ log

n(2 b − 3) 2

Proof. This is a straightforward consequence of the previous lemma.

Theorem 15. Suppose that we are given an input matrix w with bounded ∥w∥ F and we want to

quantize it using b bits. Suppose that we first multiply by two-factor orthogonal matrices, and then

we re-scale the entries of w by mapping





2 − 3 

w ij



w ij 7→

2 + 1 + 1;

A 2

∥w∥ F mn log 2Cmn

this guarantees that 1 ≤ w ij ≤ 2 b − 2. Then, suppose we quantize using the procedure described in

the previous lemma. Finally, we undo the scaling and multiplication. Then then with probability at

least 1 − δ, all the quantized weights will be in range (no overflow or need for clipping) and

A 4

12Cmn 2

1/2

tr ( ŵ − w)H( ŵ − w) T ≤ 2 b

∥w∥

log

n (2 − 3) 2

= Õ

tr H 1/2 ∥w∥ F .

n 2 4 b

Proof. It is a straightforward consequence of Lemma 5, that unioning over the three bounds on the

infinity norm of w, the incoherence of H, and the stochastic rounding, with probability at least 1 − 3δ,

2 !

4mn

1/2

tr ( ŵ − w)H( ŵ − w) ≤

tr H

∥w∥ F log

n(2 b − 3) 2

2Cn 2

A 2

2Cn

· A log

log

Substituting δ 7→ δ/3,

tr ( ŵ − w)H( ŵ − w)

2 !

12mn

1/2

≤

tr H

∥w∥ F log

n(2 b − 3) 2

6Cn 2

A 2

6Cn

· A 2 log

log

32And this right side is clearly less than

tr ( ŵ − w)H( ŵ − w)

A 4

1/2

∥w∥ F

≤ 2 b

n (2 − 3) 2

log

12Cmn 2

This is what we wanted to show.

Theorem 7. Suppose that we run Algorithm 4 (Supplement) to quantize a matrix W ∈ R m×n by

solving the objective (7). Then there exists an assignment of the algorithm’s hyperparameters c and ρ

such that with probability at least 1 − δ, all the quantized weights will be in range (no overflow or

need for clipping) and

1/2

∥W ∥ F .

tr ( Ŵ − W )H( Ŵ − W )

= Õ

tr H

n 2 4 b

Proof. This follows directly from the previous theorem, which says explicitly what the hyperparame-

ter assignments should be.

References for the Appendix

[1] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Optq: Accurate quantization

for generative pre-trained transformers. In International Conference on Learning Representations,

2023.

[2] Steve Lalley. Lecture notes on measure-theoretic probability 2. http://galton.uchicago.

edu/~lalley/Courses/383/Concentration.pdf, 2018.

[3] Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort.

Up or down? adaptive rounding for post-training quantization. In International Conference on

Machine Learning, pages 7197–7206. PMLR, 2020.