Summary of Outlier-Weighed Layerwise Sparsity for Pruning Large Language Models

Summary Outlier-Weighed Layerwise Sparsity for Pruning Large Language Models arxiv.org

7,434 words - PDF document - View PDF document

One Line

Researchers have developed OWL, a pruning method for Large Language Models (LLMs) that improves performance by incorporating non-uniform layerwise sparsity ratios, outperforming previous methods and efficiently pruning a 65B LLaMA model in 4 seconds.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

Outlier-Weighed Layerwise Sparsity: Optimizing Pruning for Large Language Models

Source: arxiv.org - PDF - 7,434 words - view

Introduction to Large Language Models (LLMs)

• LLMs are highly effective in various applications

• However, their large size poses challenges for practical deployment

• Network pruning techniques offer a solution for reducing LLM size without sacrificing performance

The Importance of Layerwise Sparsity Ratios in LLM Pruning

• Non-uniform layerwise sparsity yields improved results compared to uniform pruning

• Previous methods did not consider the benefits of non-uniform sparsity

• OWL introduces a novel pruning methodology based on layerwise sparsity ratios

Introducing Outlier-Weighed Layerwise Sparsity (OWL)

• OWL incorporates non-uniform layerwise sparsity ratios

• Inspired by the correlation between token features and outliers

• Aligns weight sparsity with outlier ratios for more effective pruning

OWL Outperforms Previous Methods

• OWL achieves higher perplexity reductions compared to Wanda and SparseGPT

• Outperforms Wanda by 61.22 perplexity points at a sparsity level of 70%

• Provides better performance and results in LLM pruning

Comparison with Other Layerwise Sparsity Methods

• OWL and uniform sparsity methods perform better than other methods

• ERK family of methods is less suitable for LLM pruning

• OWL stands out as an effective pruning methodology

Efficiency of OWL in Pruning Large LLM Models

• OWL has comparable computational complexity to Wanda and SparseGPT

• Efficiently prunes a 65B LLaMA model within 4 seconds

• Balances performance and efficiency in LLM pruning

Highlighting the Importance of Layerwise Sparsity Ratios

• Layerwise sparsity ratios play a crucial role in LLM pruning

• OWL considers the distribution of outliers within LLMs

• Opens up possibilities for specialized sparse algorithms and optimized LLM deployment

Optimizing Pruning with Outlier-Weighed Layerwise Sparsity

• OWL improves LLM performance while reducing size

• Layerwise sparsity ratios are critical in achieving optimal results

• Remember the impact of OWL: better performance, efficiency, and deployment optimization.

Key Points

Large Language Models (LLMs) have gained popularity for their impressive performance in various applications.
Researchers have explored network pruning techniques to reduce the size of LLMs without sacrificing performance.
Non-uniform layerwise sparsity can yield improved results in LLM pruning compared to uniformly pruning all layers.
The paper introduces a novel LLM pruning methodology called Outlier Weighed Layerwise Sparsity (OWL) that incorporates non-uniform layerwise sparsity ratios based on the distribution of outliers within LLMs.
OWL outperforms previous state-of-the-art methods, such as Wanda and SparseGPT, at high sparsity levels.
OWL and uniform sparsity methods perform better than other layerwise sparsity methods, while the ERK family of methods is less suitable for LLM pruning.
OWL has comparable computational complexity to other methods and efficiently prunes large LLM models within seconds.
OWL highlights the importance of layerwise sparsity ratios in LLM pruning and opens up possibilities for specialized sparse algorithms and optimized deployment of LLMs in practical applications.

Summaries

52 word summary

Researchers introduce OWL, a pruning method for Large Language Models (LLMs) that improves performance by incorporating non-uniform layerwise sparsity ratios. OWL outperforms previous methods, achieving higher perplexity reductions, with comparable computational complexity. It efficiently prunes a 65B LLaMA model in 4 seconds, opening new possibilities for specialized sparse algorithms and LLM optimization.

74 word summary

Researchers have developed OWL, a pruning methodology for reducing the size of Large Language Models (LLMs) without sacrificing performance. OWL incorporates non-uniform layerwise sparsity ratios based on outlier ratios within each layer, resulting in improved performance. Empirical evaluations show that OWL outperforms previous methods, achieving higher perplexity reductions. It offers comparable computational complexity and efficiently prunes a 65B LLaMA model within 4 seconds, introducing new possibilities for specialized sparse algorithms and optimizing LLM deployment.

138 word summary

Researchers have developed a novel pruning methodology called Outlier Weighed Layerwise Sparsity (OWL) to reduce the size of Large Language Models (LLMs) without compromising performance. The authors conducted a comprehensive analysis of token features within LLMs and discovered a strong correlation with the emergence of outliers. OWL incorporates non-uniform layerwise sparsity ratios that align with the outlier ratios observed within each layer, resulting in improved performance. Empirical evaluations demonstrate that OWL outperforms previous methods, achieving higher perplexity reductions. Comparisons with other layerwise sparsity methods show that OWL and uniform sparsity perform better, while the ERK family of methods is less suitable for LLM pruning. OWL also offers comparable computational complexity and efficiently prunes a 65B LLaMA model within 4 seconds. Overall, OWL introduces new possibilities for specialized sparse algorithms and optimizes the deployment of LLMs in practical applications.

378 word summary

Large Language Models (LLMs) have gained popularity for their impressive performance in various applications. However, their large size poses challenges in terms of practical deployment. To address this issue, researchers have explored network pruning techniques to reduce the size of LLMs without sacrificing performance. Previous pruning strategies for LLMs have focused on uniformly pruning all layers at equivalent sparsity levels. However, recent trends in vision models have shown that non-uniform layerwise sparsity can yield improved results. This paper investigates the reasons behind this disparity and proposes a novel LLM pruning methodology called Outlier Weighed Layerwise Sparsity (OWL).

The authors conduct a comprehensive analysis of the distribution of token features within LLMs and discover a strong correlation with the emergence of outliers, which are features with significantly greater magnitudes compared to others. Inspired by this finding, they introduce OWL, which incorporates a tailored set of non-uniform layerwise sparsity ratios specifically designed for LLM pruning. The sparsity ratio of OWL is directly proportional to the outlier ratio observed within each layer, allowing for a more effective alignment between layerwise weight sparsity and outlier ratios.

The empirical evaluation of OWL across different LLM models demonstrates its advantages over previous methods. For instance, OWL outperforms the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity points, respectively, at a high sparsity level of 70%. The results show that OWL consistently improves performance and achieves higher perplexity reductions compared to other methods.

The paper also compares OWL with other layerwise sparsity methods, such as global pruning, uniform sparsity, and Erdo?s-Re?nyi Kernel (ERK). The results indicate that OWL and uniform sparsity perform better than other methods, while the ERK family of methods is less suitable for LLM pruning.

In terms of efficiency, OWL has a comparable computational complexity to Wanda and SparseGPT. The pruning time of OWL is slightly higher when applied with Wanda, but it efficiently prunes a 65B LLaMA model within 4 seconds.

In conclusion, this work highlights the importance of layerwise sparsity ratios in LLM pruning and introduces OWL as an effective pruning methodology. OWL takes into account the distribution of outliers within LLMs and achieves better performance compared to existing methods. The findings open up new possibilities for specialized sparse algorithms and optimize the deployment of LLMs in practical applications.

Raw indexed text (48,838 chars / 7,434 words / 984 lines)

Preprint

O UTLIER W EIGHED L AYERWISE S PARSITY (OWL ):

A M ISSING S ECRET S AUCE FOR P RUNING LLM S TO

H IGH S PARSITY

Lu Yin 1∗ , You Wu 3 , Zhenyu Zhang 2 , Cheng-Yu Hsieh 4 , Yaqing Wang 3 , Yiling Jia 3

Mykola Pechenizkiy 1 , Yi Liang 3 , Zhangyang Wang 2 , Shiwei Liu 1,2

Eindhoven University of Technology, 2 University of Texas at Austin

Google Research,NY, 4 University of Washington

A BSTRACT

Large Language Models (LLMs), renowned for their remarkable performance

across diverse domains, present a challenge due to their colossal model size when

it comes to practical deployment. In response to this challenge, efforts have been

directed toward the application of traditional network pruning techniques to LLMs,

uncovering a massive number of parameters can be pruned in one-shot without

hurting performance. Building upon insights gained from pre-LLM models,

particularly BERT-level language models, prevailing LLM pruning strategies have

consistently adhered to the practice of uniformly pruning all layers at equivalent

sparsity levels, resulting in robust performance. However, this observation stands

in contrast to the prevailing trends observed in the field of vision models, where

non-uniform layerwise sparsity typically yields substantially improved results. To

elucidate the underlying reasons for this disparity, we conduct a comprehensive

analysis of the distribution of token features within LLMs. In doing so, we discover

a strong correlation with the emergence of outliers, defined as features exhibiting

significantly greater magnitudes compared to their counterparts in feature dimen-

sions. Inspired by this finding, we introduce a novel LLM pruning methodology

that incorporates a tailored set of non-uniform layerwise sparsity ratios specif-

ically designed for LLM pruning, termed as Outlier Weighed Layerwise sparsity

(OWL). The sparsity ratio of OWL is directly proportional to the outlier ratio

observed within each layer, facilitating a more effective alignment between layer-

wise weight sparsity and outlier ratios. Our empirical evaluation, conducted across

the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates

the distinct advantages offered by OWL over previous methods. For instance, our

approach exhibits a remarkable performance gain, surpassing the state-of-the-art

Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%,

respectively. Codes are available at https://github.com/luuyin/OWL.

I NTRODUCTION

The remarkable performance exhibited by Large Language Models (LLMs) across a diverse spectrum

of applications has ignited an unparalleled race among tech giants and academic institutions to build

LLMs at the billion-parameter scale (Brown et al., 2020; Touvron et al., 2023a;b; Brown et al.,

2020). The compelling performance of Large Language Models (LLMs) demonstrated in various

applications triggers an unprecedented competition of building billion-level LLMs among tech giants

and academic institutions (Brown et al., 2020; Touvron et al., 2023a;b; Brown et al., 2020). While

their exceptional capabilities are undeniable, the colossal size and computational demands of these

models have also raised substantial concerns, particularly in terms of financial expenditure and

environment (Luccioni et al., 2022; Patterson et al., 2021).

∗

Partial of this work have been done while Lu Yin worked as a Research Intern at Google Research, NY.

Corresponding to Lu Yin ([email protected]) and Shiwei Liu ([email protected]).

1Preprint

Network pruning (Mozer & Smolensky, 1989; Janowsky, 1989; LeCun et al., 1989; Han et al., 2015),

as a long-established model compression method, is expected to serve as an effective solution for

reducing the size of LLMs. However, network pruning usually favors a certain time of fine-tuning

or re-training to reacquire the original optimal performance. Given the extensive text corpus and

model size associated with LLMs, conventional fine-tuning becomes exceedingly challenging and

less desirable. Fortunately, recent endeavors have explored the possibility of LLM pruning without

the need for fine-tuning, showcasing that LLMs contain a substantial number of parameters that can

be removed in a single step with minimal performance degradation (Jaiswal et al., 2023; Frantar &

Alistarh, 2023; Sun et al., 2023). SparseGPT (Frantar & Alistarh, 2023) addresses the challenge

of LLM pruning from the perspective of layerwise reconstruction problem. In this context, the

primary goal is to minimize the output discrepancy in terms of the reconstruction error between dense

and sparse LLMs. It adopts an iterative strategy to handle the computational hurdle posed by the

row-Hessian problem. Specifically, it employs the Optimal Brain Surgeon (OBS) algorithm (Hassibi

et al., 1993) to selectively prune and update weights in a column-wise manner. Wanda (Sun et al.,

2023), on the other hand, introduces a novel pruning metric that takes into account both the weight

magnitudes and their corresponding input activations. Remarkably, it achieves performance on

par with SparseGPT without relying on computationally expensive second-order information. The

effectiveness of Wanda stems from the emergence of the outlier features residing within large-scale

LLMs. These outliers, which tend to be significantly larger than typical features, are nonetheless

crucial for optimizing LLM performance (Dettmers et al., 2022). In general, both SparseGPT and

Wanda exhibit competitive performance, showcasing their ability to reduce model parameters by up

to 50% while incurring only a modest increase of approximately 1 in perplexity (Sun et al., 2023).

It is worth noting that SparseGPT and Wanda unanimously follow previous work on BERT prun-

ing (Sanh et al., 2020; Kurtic et al., 2022) and choose to prune LLMs with a uniform sparsity ratio

per layer, i.e., each layer will be pruned at the same sparsity. Such choice is reasonable for LLMs, as

the pruning process typically involves sorting the importance scores of weights. Conducting such

sorting globally across layers could become a computational bottleneck, especially for models at the

billion-parameter scale. Nevertheless, before it has been taken root that uniform layerwise sparsity is

the default choice for LLMs, we raise a timely inquiry: are there any pivotal aspects that have been

inadvertently omitted in the context of favorable layerwise sparsity ratios for LLM pruning?

Three reasons behoove us to pose the above research question: First, it is widely acknowledged

that within Transformer architectures, certain components hold greater significance than others, and

thus, they merit distinct treatment during the pruning process (Wang & Tu, 2020; Bhojanapalli et al.,

2021); Second, a consensus view has been reached in computer vision that non-uniform layerwise

sparsity typically achieves stronger results than uniform sparsity (Liu et al., 2022; Lee et al., 2020);

More importantly, LLMs demonstrate astonishingly emergent behaviors (Dettmers et al., 2022; Wei

et al., 2022; Schaeffer et al., 2023) as model size continuously scales up, a phenomenon distinct from

smaller-scale language models such as BERT (Devlin et al., 2018). These emergent behaviors offer

fresh insights into the domain of LLM pruning. For instance, Dettmers et al. (2022) revealed the

existence of outlier features within LLMs, with magnitudes up to 20 times larger than others, exerting

a profound influence across all Transformer layers.

Contributions. Given the pivotal role that outliers play in the performance of LLMs, coupled with

the demonstrated effectiveness of Wanda (Sun et al., 2023), our initial investigation centers on a

systematic examination of the impact of existing LLM pruning methodologies on outliers. To our

astonishment, we uncover a compelling correlation between pruning efficacy and the retention ratio

of outliers: contemporary state-of-the-art LLM pruning approaches, such as SparseGPT and Wanda,

exhibit remarkable preservation of outliers, even though the former was not originally designed with

this intent. Moreover, we conduct an in-depth analysis of the distribution of outliers across different

layers and observe a notably non-uniform pattern. This non-uniform distribution emerges as a

valuable indicator for the formulation of layerwise sparsity strategies tailored specifically for LLMs.

Building upon this newfound insight, we introduce an LLM pruning paradigm characterized by a novel

layerwise sparsity ratio, denoted as Outlier Weighed Layerwise sparsity (OWL). OWL inherently

assigns greater emphasis to layers housing a higher prevalence of outliers, thereby facilitating more

nuanced coordination between sparsity in weight matrices and the presence of outliers within the layer.

We conduct extensive experiments to evaluate the performance OWL across a spectrum of large

language models, including LLaMA-V1 family (Touvron et al., 2023a), and OPT (Zhang et al.,

2022), from 7B to 65B. Our empirical results show that OWL consistently outperforms existing

2Preprint

top-performing LLM pruning methods, particularly at high sparsity levels. For instance, we observe

significant improvements achieved by OWL over Wanda with LLaMa-7B on WikiText (Merity et al.,

2016a), with perplexity reductions of more than 60 and 3300 perplexity at sparsity levels of 70%

and 80%, respectively. Our research presents a compelling counter-argument to previous study by

shedding light on the previously overlooked yet crucial role of layerwise sparsity ratios in the context

of LLM pruning. This shift in perspective has allowed us to push the boundaries of achievable LLM

pruning ratios to reach 70% without the need of any weight updates or second-order Hessian.

R ELATED W ORK

Pruning and LLM Pruning. Since the 1980s, network pruning has been a well-established technique

for simplifying neural networks in various applications while maintaining accuracy (Mozer & Smolen-

sky, 1989; Han et al., 2015; Mocanu et al., 2018; Wen et al., 2017; Lin et al., 2019). However, when

it comes to pruning Large Language Models (LLMs), progress has been limited. Traditional pruning

typically requires a round of re-training to restore performance, which can be challenging for LLMs.

To address this challenge, researchers have developed pruning algorithms specifically tailored for

LLM compression. For example, Ma et al. (2023) explored structured sparse LLMs using Taylor prun-

ing to remove entire weight rows, followed by LoRA fine-tuning (Ma et al., 2023). Recent research

has shifted toward unstructured pruning without the need for fine-tuning, showing substantial advance-

ments. SparseGPT (Frantar & Alistarh, 2023) utilizes the Hessian inverse for pruning and with subse-

quent weight updates to reduce reconstruction error of dense and sparse weights, while Wanda (Sun

et al., 2023) produces a criterion incorporating weight magnitude with their input activations, aiming to

preserve outlier features (Dettmers et al., 2022). Our work for the first time probe and highlight the cru-

cial role of non-uniform layerwise sparsity for LLM pruning, making a notable progress in this field.

Layerwise Sparsity for Pruning. While it is common to use uniform layerwise sparsity (Zhu

& Gupta, 2017; Gale et al., 2019) to prune language models (Sanh et al., 2020; Kurtic et al.,

2022), there is a well-established line of work that explore non-uniform layerwise sparsity in terms

of pruning vision models. Mocanu et al. (2016) propose a non-uniform and scale-free topology

inspired from graph theory, showing better performance than the dense counterpart when applied to

restricted Boltzmann machines. Follow-up works significantly improve its scalability based on Erdős-

Rényi graph (Erdős & Rényi, 1959), extending to fully-connected layers (Mocanu et al., 2018) and

convolutional layers (Evci et al., 2020; Liu et al., 2022) as data-free and feedforward-free layerwise

sparsity. Another group of work produces non-uniform sparsity by applying a global threshold on

every layer (Frankle & Carbin, 2019; Lee et al., 2019; Wang et al., 2020; Lee et al., 2020; Liu et al.,

2021). However, global pruning becomes extremely expensive and inefficacy in the context of LLM

pruning as shown in Table 2. We also provide a comparison among most common layerwise sparsity

for LLMs in Section 5, and all of them fail to perform on LLMs.

Outliers in LLMs. Unlike traditional vision or smaller-scale transformer models, recent studies

have revealed certain emergent characteristics unique to language models at scale. Specifically, one

intriguing trait of LLMs is the exhibition of outlier features, which are the features with significantly

larger magnitudes than others (Dettmers et al., 2022). While constituting only a very small portion of

the entire feature dimensions, these outliers play an imperative role in models’ predictive performance.

Building upon this observation, several recent works have developed techniques to effectively quantize

LLMs with minimal performance drop (Dettmers et al., 2022; Xiao et al., 2023; Lin et al., 2023). On

the other hand, in the context of LLM pruning, this unique characteristic has scarcely been taken into

account to the best of our knowledge (Sun et al., 2023). Our work draws on the importance of the

emergent outliers in LLMs, and provides a systematic study on its correlation to the effectiveness

of model pruning, leading to a novel technique that leverages the distribution of outliers to guide

layerwise LLM pruning.

O UTLIER W EIGHED L AYERWISE S PARSITY – OWL

In this section, we will introduce Outlier-Weighted Layer-wise sparsity (OWL) step by step, from

rationale, to empirical studies, and eventually to the algorithm.

3.1

R ATIONALE

The primary of goal of network pruning is to discover the least important components, such as

individual weights in the case of unstructured pruning, which have minimal impact on the model’s

3Preprint

output. In the context of pre-LLMs with smaller scales, magnitude pruning has traditionally serves

as the most basic yet effective technique, consistently delivering robust results across various sce-

narios (Han et al., 2015; Mocanu et al., 2018; Frankle & Carbin, 2019; Jaiswal et al., 2023). The

effectiveness of magnitude pruning in compressing pre-LLM models is closely intertwined with the

feasibility of fine-tuning. It has been observed that even the random removal of components can

ultimately restore the original performance through adequate fine-tuning (Liu et al., 2022; Mittal et al.,

2019). However, fine-tuning encounters significant challenges when applied to LLMs, rendering

magnitude pruning less effective compared to more precise pruning metrics, such as second-order

Hessian (Frantar & Alistarh, 2023) and input activation (Sun et al., 2023). Notably, Wanda (Sun

et al., 2023) achieves remarkable performance by augmenting input activation with weight magnitude,

underscoring the critical importance of preserving outlier features in LLM pruning. Considering

the vital role that outliers play in the context of LLMs (Dettmers et al., 2022) and the success of

Wanda, we conjecture that the performance of different pruning methods has a strong correlation with

their ability to preserve outlier features. To assess our conjecture, we undertake several preliminary

investigations outlined below based on Layerwise Outlier Distribution.

3.2

E MPIRICAL S TUDY

Layerwise Outlier Distribution (LOD). Our preliminary studies are based on Layerwise Outlier

Distribution (LOD), a concept used to measure how outlier features distribute and effect weights

across layers. Since we focus on weight pruning in this paper, instead of measuring the outlier

distribution of input features, We opt to prioritize the impact of outlier features on weights, which

is quantified as the accumulation of all input features connected to the target weight, multiplied by

the weight magnitude (Sun et al., 2023). Our intuition here is that weights that are most affected by

outliers also play a pivotal role in propagating and preserving these outlier features.

To formalize our approach, we consider the input of a layer as X with dimensions (N × L, C in ),

where N and L represent the batch and sequence dimensions, respectively. The weight matrix

W has dimensions (C out , C in ). The impact of input features X on weight W ij is computed as

A ij = ∥X j ∥ 2 · |W ij |, which is the aggregation of all input features connected to weight W ij ,

multiplied by its magnitude |W ij |. Here, ∥X j ∥ 2 is the ℓ 2 norm of the j th feature of input X. This

computation is performed across all N × L tokens, resulting in a scalar value denoted as ∥X j ∥ 2 . It is

worth noting that A ij also serves as the pruning metric used by Wanda (Sun et al., 2023) to assess

the importance of weight W ij .

Subsequently, after obtaining the impact of features for all weights A, we proceed to calculate the

“outlier ratio” of A by identifying elements whose magnitude is M times greater than the averaged

value in each layer. We empirically find that both M = 5 or M = 7 effectively sketch the distribution

of the impact of outliers features on weights. This process enables us to derive a vector, denoted

as LOD = [D 1 , D 2 , ..., D l ], which characterizes the layerwise outlier distribution w.r.t., the impact

of features on weights within a l-layer LLMs. Based on LOD, we conduct three empirical studies

outlined below to better understand the effect of LOD on LLM pruning.

Empirical Study I: Dense LLMs vs. LOD. To investigate whether sparsifying LLMs necessitates

differential treatment of individual layers, we employ LOD to gauge the layerwise distribution of

outliers within dense LLMs. If LOD in dense LLMs exhibits a relatively uniform pattern, it suggests

that a non-uniform layerwise distribution may not be imperative, at least in terms of outlier features,

and vice versa. We assess the LOD across various dense LLMs, including LLaMA-7B, 13B, and 30B.

Empirical Study II: Pruning Metric vs. LOD. We further delve into the impact of different pruning

metrics on LOD. The primary objective of this study is to explore whether there exists a robust

correlation between the performance of various pruning methods and their ability to preserve outliers.

To achieve this, we aggregate the LOD values across layers for various LLM pruning methods, such

as magnitude, Wanda, and SparseGPT, and compare them with their dense counterparts. In the case

of sparse LLMs, we calculate LOD by considering only non-zero weights. All sparse models are

pruned with uniform sparsity. These experiments are conducted using LLaMA-13B at sparsity level

of 60% and 70% with M = 7.

Empirical Study III: Pruning Granularity vs. LOD. It is well-established that non-uniform or global

layerwise sparsity often leads to more accurate sparser networks at high sparsity than the uniform

layerwise sparsity for pre-LLM pruning. However, endeavors unanimously point out that uniform

41.2

1.0

0.8

0.6

0.4

0.2

0.0

15 20

Layer Index

LLaMA-13B

0.40

0.35

0.30

0.25

0.20

0.15

0.10

0.05

0.00

LLaMA-7B

Distribution

Preprint

10 15 20 25 30 35 40

Layer Index

LLaMA-30B

0.6

0.5

0.4

0.3

0.2

0.1

0.0

Layer Index

Figure 1: Layerwise Outlier Distribution (LOD) (%) of dense LLaMA-7B, 13B, and 30B.

sparsity is more favorable for pruning LLMs. To provide more insights about these two seemingly

countradictory arguments, we study the effect of various pruning granularities on LOD. Specifically,

we study two sets of pruning granularities: (1) Across different layers, we compare the performance

as well as the resulting LOD of uniform sparsity and global sparsity; (2) Within the same layer,

we study the output-imbalanced sparsity used by SparseGPT against the output-balanced sparsity

adopted by Wanda. Output-balanced sparsity eliminates the same amount of weights for all outputs.

We conduct experiments with magnitude pruning and Wanda using LLaMA-7B at various sparsity.

Results: We present our findings from Study 1-3, in Figure 1, Table 1, and Table 2, respectively. These

results provide positive support for our conjecture, and we summarize the key observations below:

1 LOD of dense LLMs exhibits a highly non-uniform distribution across layers. In essence, the

distribution of dense LLMs shown in Figure 1 loosely follows a “U” shape, with notable proportions at

both ends, while the central region displays a monotonic descending trend. This finding validates our

conjecture that individual layers need unique consideration during the pruning procedure. Employing

uniform pruning across all layers would inevitably disrupt the outlier structure in layers characterized

by a large outlier ratio, such as those layers at the beginning or end of models.

Table 1: Effects of various pruning methods on Layerwise Outlier Distribution (LOD) and Perplexity

with LLaMA-13B on WikiText. LOD is calculated as the summation across all layers with M = 7.

LOD (%) ↑ ∆LOD (%) ↑ Dense 5.432 - 5.090

70% Wanda

SparseGPT

Magnitude 5.716

6.645

5.322 0.284

1.213

-0.110 55.900

19.235

84539.445

60% Wanda

SparseGPT

Magnitude 5.433

6.044

5.322 0.001

0.612

-0.110 8.761

8.458

229.451

Sparsity

Method

Perplexity ↓

2 The performance of sparse pruning methods on LLMs is closely correlated with their ability

to retain outlier features. Leading pruning techniques like Wanda and SparseGPT all excel in outlier,

resulting in an overall increase in LOD. In contrast, the naive baseline of magnitude pruning performs

no better than random selection at 70% sparsity, as evidenced by a negative change of -0.110 in

LOD, indicating the removal of important outliers. It is interesting to see that despite SparseGPT not

being explicitly designed for outlier preservation, it achieves the highest LOD as well as performance,

providing further insight into the underlying reason for its success. A plausible reason is that the

weight update involved within SparseGPT helps increase LOD.

Table 2: WikiText perplexity with LLaMA-7B of various pruning granularity.

Method Layerwise

Uniform Output

Balanced 10% 20% 30% Sparsity

40% 50% 60% 70%

Wanda

Wanda ✓

✓

✗ ✓

✗

✗ 5.697

5.695

14.117 5.817

5.819

3134 5.999

6.029

10293 6.388

6.572

10762 7.260

7.942

14848 10

17765 86

238

5147

Magnitude

Magnitude ✓

✓

✗ ✓

✗

✗ 5.803

5.806

5.821 6.018

6.020

6.111 6.622

6.669

7.012 8.041

8.601

9.825 13.349

17.287

48.627 152

559

38335 25304

48419

29283

5Preprint

3 Pruning with coarser granularity results in diminished performance. In general, we observe a

consistent trend of improved perplexity as the pruning granularity becomes finer, transitioning from

global layerwise sparsity to uniform layerwise sparsity at the macro level, and from output imbalanced

sparsity to output balanced sparsity at the micro level. These findings align with the conclusions

presented by Sun et al. (2023). One plausible explanation for this trend is that coarser-grained pruning

tends to eliminate outlier features to a more significant extent, particularly in certain layers or outputs.

3.3

O UTLIER W EIGHED L AYERWISE S PARSITY (OWL)

The above empirical studies underscore the critical significance of preserving outliers in the context

of LLM pruning. Consequently, it becomes imperative to implement layerwise pruning strategies that

take into account the non-uniform distribution of outliers across different layers. However, global

pruning can be costly and lead to collapse of outliers, resulting in significant performance degradation.

On the other hand, uniform pruning does not adequately consider the highly non-uniform distribution

of outlier features across various layers. This negligence inevitably disrupts the structure of outliers in

layers characterized by a substantial outlier ratio, particularly at high sparsity levels. Therefore, there

is a need of an ideal layerwise sparsity that aligns effectively with the layerwise outlier distribution

while maintaining computational and memory efficiency.

To address this issue, we propose a novel layerwise sparsity ratio strategy, referred to as Outlier-

Weighted Layer-wise sparsity (OWL) explicitly tailored for Large Language Models, which can better

coordinate with the outlier distribution by taking the layerwise outlier ratio into consideration. Given

a l-layer large language model with a target model sparsity S, we aim to calculate the target layerwise

sparsity [S 1 , S 2 , ..., S n ]. We first calculate LOD of feature effects on weights, D = [D 1 , D 2 , ..., D n ],

based on the approach proposed in Section 3.2. Guided by the principle that layers with a higher

proportion of outliers should have a lower sparsity, we set S i ∝ 1 − D i . Additionally, we introduce a

hyperparameter λ which constrains the layerwise sparsity to fall within a specified range, specifically,

S i ∈ [S − λ, S + λ], while maintaining an average sparsity of S across all layers. This helps prevent

excessive difference in sparsity between layers, ensuring a robust performance. This constraint is

inspired by the insights gained from “Empirical Study III” which highlight the detrimental impact of

overly aggressive layerwise sparsity, akin to global pruning, on sparse LLMs. To obtain a favorable

number for λ and M , we conduct a small hyperparameter sweep within the range of λ ∈ [0.02,

0.05, 0.08, 0.1, 0.2] and for M ∈ [3, 5, 7, 10]. The visualization of our layerwise sparsity ratio

is demonstrated in Figure 2, where we can clearly see that the layerwise sparsity level of OWL

nuancedly aligns with model’s LOD.

LLaMA-13B

1.0

0.8

0.6

0.4

0.2

0.0

15 20

Layer Index

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

LLaMA-30B

0.7

OWL (Ours)

Uniform

10 15 20 25 30 35 40

Layer Index

OWL (Ours)

Uniform

LLaMA-7B

1.2

0.6

0.5

OWL (Ours)

Uniform

0.4

0.3

0.2

0.1

0.0

Layer Index

Figure 2: The demonstration of the OWL layerwise sparsity and Uniform layerwise sparsity at 70%

sparsity. The bar chart in background corresponds to the Layerwise Outlier Distribution (LOD).

E XPERIMENTS

Models and Dataset. We assess OWL’s performance across a range of LLMs, encompassing the

LLaMA-V1 model family (Touvron et al., 2023b) with parameter counts ranging from 7 billion to 65

billion, as well as OPT-6.7B (Zhang et al., 2022). Our evaluation protocol aligns with established

LLM pruning methodologies (Frantar & Alistarh, 2023; Sun et al., 2023), encompassing assessments

of language modeling proficiency and zero-shot capabilities of sparse LLMs. Specifically, we measure

the Perplexity metric on the WikiText (Merity et al., 2016b) validation dataset for language modeling

performance, and employ the Accuracy metric for zero-shot evaluations on seven common sense

benchmarks, including BoolQ (Clark et al., 2019), RTE (Wang et al., 2018), HellaSwag (Zellers

6Preprint

et al., 2019), WinoGrande (Sakaguchi et al., 2019), ARC Easy and Challenge (Clark et al., 2018),

and OpenbookQA (Mihaylov et al., 2018).

Baselines. We choose the three current LLM-pruning baselines, including magnitude (Jaiswal et al.,

2023), SparseGPT (Frantar & Alistarh, 2023), Wanda (Sun et al., 2023). Magnitude pruning serves

as a naive baseline for LLMs, with an expected sharp decline in performance at modest sparsity

levels, typically ranging from 10% to 30%. SparseGPT and Wanda, on the other hand, are established

baselines known for their ability to maintain reasonable performance even at relatively high sparsity

levels, typically around 50% to 60%. Notably, in contrast to our approach, all baseline methods

employ with uniform layerwise sparsity. We primarily focus on high sparsity levels, not falling below

50%, as regions with low sparsity pose challenges for existing sparse GPU kernels to outperform

their dense counterparts (Gale et al., 2020). To ensure equitable comparisons, we have employed

the identical set of calibration data as utilized by SparseGPT and Wanda for model pruning, i.e.,

comprising 128 sequences with 2048 tokens for each, randomly sampled from the first shard of the

C4 (Raffel et al., 2020) dataset. We incorporate OWL directly into Wanda and SparseGPT, resulting

in two variants: “OWL w. Wanda” and “OWL w. SparseGPT”. The only distinction between these

variants lies in their layerwise sparsity ratios, with OWL providing a more tailored layerwise sparsity

in this regard. Hyperparameters are shared in Table 4-Right.

Table 3: WikiText validation perplexity of pruning methods for LLaMA-V1 family and OPT-6.7B at

70% sparsity. The best performance method is indicated in bold, and the gain in perplexity achieved

by OWL is highlighted in blue.

Method

Layerwise

Sparsity

Dense

4.1

Weight

Update 7B

13B

LLaMA-V1

30B

65B OPT

6.7B

- - 5.68 5.09 4.10 4.77 10.13

Magnitude Uniform ✗ 48419.12 84539.45 977.73 46.89 290985.03

Wanda

OWL w. Wanda Uniform

Non-Uni ✗

✗ 85.77

24.55 (-61.22) 55.90

17.17 (-38.73) 17.37

10.75 (-6.62) 15.23

8.61 (-6.62) 162.92

40.22 (-120.70)

SparseGPT

OWL w. SparseGPT Uniform

Non-Uni ✓

✓ 26.30

19.49 (-6.81) 19.24

14.55 (-4.69) 12.56

10.28 (-2.28) 10.45

8.28 (-0.64) 20.29

22.48 (2.19)

E XPERIMENTAL R ESULTS

Language Modelling. We first report the performance of various LLM pruning methods on language

modelling with WikiText. The results is presented in Table 3 and Figure 3. We summarize the key

observation below:

LLaMA-13B

Wanda

SparseGPT

OWL w. SparseGPT

OWL w. Wanda

LLaMA-7B

Wanda

SparseGPT

OWL w. SparseGPT

OWL w. Wanda

Sparsity

Figure 3: WikiText validation perplexity of OWL applied to SparseGPT and Wanda.

1 OWL demonstrates its versatility serving as a general layerwise sparsity method suitable for

various scenarios. As illustrated in Table 3, OWL exhibits effectiveness across different pruning

methods (such as Wanda and SparseGPT), architectural variants (including LLaMA-V1 and OPT),

and diverse model sizes (ranging from LLaMA-V1 with 7B, 13B, 30B, to 65B parameters), resulting

in substantial reductions in perplexity scores. Notably, even when applied to SparseGPT, a strong

pruning method incorporating second-order information, OWL still achieves significant perplexity

reductions, exemplified by a reduction of 6.81 for LLaMA-7B.

2 The benefits of OWL increases as significantly model size decreases. There is a clear trend that

the performance gain of OWL monotonically increases as LLaMA-V1 scales down from 65B to 7B.

While the performance improvement of OWL .w Wanda for LLaMA-65B is relatively small, at 6.62,

it achieves a remarkable gain of 61.22 for LLaMA-7B, resulting in a reasonable 24.55 perplexity.

7Preprint

Zero-Shot Tasks. While perplexity is a widely used metric for language modeling, it primarily

serves as a statistical measure of how confidently a language model predicts a text sample and does

not necessarily align with the quality of the generated text. To draw more robust conclusions, we

conducted experiments to evaluate the zero-shot ability of various sparse LLMs on diverse zero-shot

downstream tasks with prompting. These experiments were performed using the LLaMA-V1 family

at 70% sparsity, and the results are presented in Table 4. It’s noteworthy that OWL consistently

improves accuracy across nearly all settings, with very few exceptions on RTE data, which is . For

example, OWL achieves an average perplexity gain of 4.72 and 2.19 over 7 tasks and 4 model sizes

compared to Wanda and SparseGPT alone, respectively. This result highlights the promise of OWL is

still hold for more challenging zero-shot downstream tasks.

Table 4: Accuracies (%) for 7 zero-shot tasks with 70% sparsity using LLaMA-V1 family.

Params

13B

30B

65B

5.1

Method BoolQ RTE HellaSwag WinoGrande ARC-e ARC-c OBQA Mean

Dense 75.14 66.43 74.80 70.01 67.67 41.38 41.40 62.40

Magnitude

Wanda

OWL w. Wanda 38.29

55.11

62.48 52.71

57.40

58.48 24.68

31.83

44.79 51.46

51.38

58.72 26.98

34.22

45.03 22.35

19.80

26.19 25.80

26.00

29.60 34.61

39.39

46.47

SparseGPT

OWL w. SparseGPT 64.53

67.13 53.79

53.43 42.11

48.56 58.64

62.03 43.06

45.41 24.57

27.65 27.80

32.00 44.93

48.03

Dense 77.86 70.40 78.08 72.77 69.19 47.18 43.80 65.61

Magnitude

Wanda

OWL w. Wanda 52.94

61.71

62.69 50.54

52.71

52.71 27.67

34.31

51.03 50.91

52.33

63.14 28.24

37.16

49.54 23.38

20.90

28.67 24.80

29.60

34.40 36.93

41.25

48.88

SparseGPT

OWL w. SparseGPT 66.94

64.95 52.71

53.07 47.91

54.39 62.90

66.54 45.03

48.86 27.99

30.12 35.20

38.00 48.38

50.85

Dense 82.69 66.79 81.19 75.85 73.48 50.77 44.60 67.91

Magnitude

Wanda

OWL w. Wanda 39.14

66.12

66.42 46.21

57.76

52.35 24.31

58.84

62.94 52.33

67.32

69.30 24.66

59.26

61.83 22.87

33.11

35.84 29.00

40.20

40.00 34.07

54.66

55.53

SparseGPT

OWL w. SparseGPT 66.51

67.58 63.90

58.48 60.38

64.88 69.85

70.72 58.54

60.82 33.70

35.07 40.60

42.20 55.78

57.11

Dense 84.86 69.68 82.94 77.35 75.08 52.56 44.20 69.52

Magnitude

Wanda

OWL w. Wanda 52.17

76.30

80.12 54.87

56.68

58.84 49.87

61.26

66.16 56.67

70.48

73.56 49.71

63.47

65.45 30.63

35.67

39.93 38.80

39.40

42.20 47.53

57.61

60.89

SparseGPT

OWL w. SparseGPT 80.64

82.63 59.57

67.15 66.42

68.52 72.61

75.06 60.52

60.10 38.57

39.59 40.80

39.00 59.88

61.72

A NALYSIS

C OMPARISONS A MONG V ARIOUS L AYERWISE S PARSITY

We compare OWL layerwise sparsity with multiple commonly used layerwise sparsity, including:

• Global Frankle & Carbin (2019). A global threshold is uniformly applied to all layers to satisfy the

overall sparsity requirement, and the specific layerwise sparsity is automatically adjusted based on

this threshold.

• Uniform (Zhu & Gupta, 2017). Every layer is pruned with the same target sparsity.

• Erdős-Rényi Kernel (ERK) (Evci et al., 2020). The sparsity of the convolutional layer is scaled

l−1

+n l +w l +h l

proportional to 1 − n n l−1 ×n

l ×w l ×h l where n refers to the number of neurons/channels in layer l; w

and h l are the corresponding width and height. ERK is modified based on ER (Mocanu et al., 2018).

• ERK-plus (Liu et al., 2022). ERK-plus modifies ERK by forcing the last layer as dense if it is

not, while keeping the overall parameter count the same.

• OWL-inverse. OWL-inverse metric is the inverse variant of OWL, whose outlier ratio is 1 − LOD.

For this study, we apply Wanda to the LLaMA-7B model. The results are presented in Table 5. It

is noteworthy that all approaches, except for the Global method, perform satisfactorily when the

8Preprint

sparsity level is at or below 40%. This observation suggests that the region of low sparsity does not

provide significant distinctions for performance comparison. However, as the sparsity level exceeds

50%, discrepancies between the various approaches become evident. Notably, the Uniform and OWL

methods emerge as the top-performing approaches, with OWL consistently outperforming the former

across all sparsity levels. On the other hand, the ERK family of methods appears to be less suitable for

LLM pruning. It’s worth mentioning that the performance of OWL experiences a significant decline

when we invert its outlier ratio, underscoring the effectiveness of LOD in identifying critical layers.

Table 5: WikiText validation perplexity of LLaMA-7B with various layerwise sparsity using Wanda.

5.2

Sparsity/Perplexity 10% 20% 30% 40% 50% 60% 70% 80%

Global

ERK-plus

ERK

Uniform

OWL-inverse

OWL (ours) 14.11

5.70

5.69

5.72

5.70 3134

5.82

5.80

5.81

5.83

5.80 10293

6.05

6.02

5.99

6.04

6.01 10762

6.62

6.55

6.38

6.51

6.39 14848

8.00

7.74

7.26

8.03

7.22 17765

14.04

12.16

10.70

26.05

9.35 5147

229.17

112.03

85.77

822.23

24.54 39918.56

6013.91

11151.18

3499.88

9616.08

1002.87

Model M λ

LLaMA-7B

LLaMA-13B

LLaMA-30B

LLaMA-65B

OPT-6.7B 5

10 8%

20%

P RUNING E FFICIENCY

LLaMA

Method 7B 13B 30B 65B

SparseGPT

OWL w. SparseGPT

Wanda

OWL w. Wanda 208

208

0.3

0.5 341

342

0.6

1.3 731

733

1.1

2.0 1297

1301

1.8

3.7

Figure 4: Left: Comparison of time overhead (in seconds), excluding the shared forward pass process.

Right: Hyperparameters used to reproduce the results in this paper.

Since we utilize the pruning metric of Wanda to determine our layerwise sparsity, the theoretical

computational complexity of OWL is comparable to that of Wanda, which is expected to be signifi-

cantly lower than SparseGPT. To demonstrate this, we measure the total pruning time, excluding the

forward pass process, following the methodology outlined by Sun et al. (2023). These results were

obtained using NVIDIA A100 GPUs.

Our results in Table 4 indicate that OWL introduces nearly negligible overhead when compared to

SparseGPT. Conversely, OWL .w Wanda doubles the pruning time in comparison to Wanda alone,

yet it efficiently prunes a 65B LLaMA model within only 4 seconds. This additional time overhead

primarily arises from the computation of ∥X j ∥ 2 ·|W ij | for the computation of Layerwise Outlier Dis-

tribution (LOD). However, as Wanda also employs this metric for pruning, we believe there is potential

for solutions to mitigate this overhead. This aspect is left for future work and further optimization.

C ONCLUSION

In this paper, we focus on a crucial aspect of LLM pruning that have been overlooked by previous

works – layerwise sparsity ratios. Despite the prevailing practice of uniformly pruning all layers at

equivalent sparsity levels, as observed in prominent LLM pruning papers, our investigation diverges

from this trend by drawing inspiration from the emergence of outliers, characterized by features

exhibiting significantly greater magnitudes compared to others. Leveraging this discovery, we

introduced a novel layerwise sparsity ratio known as Outlier Weighed Layerwise sparsity (OWL).

OWL employs tailored non-uniform layerwise sparsity ratios designed specifically for LLM pruning,

aligning sparsity ratios with outlier ratios within each layer. Notably, our approach demonstrates

substantial performance gains, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and

6.80 perplexity points, respectively, at a high sparsity level of 70%. Our findings offer fresh insights

into the critical significance of layerwise sparsity in the context of LLM pruning. This work opens

up new avenues for the development of specialized sparse algorithms that can further optimize the

deployment of LLMs in practical applications.

Limitation. One limitation of this work is the lack of results with hardware-friendly sparsity pattern.

We will explore the promising N:M sparsity for OWL as one of our future directions.

9Preprint

A CKNOWLEDGEMENT

Part of this work used the Dutch national e-infrastructure with the support of the SURF Cooperative

using grant no. NWO2021.060, EINF-2694 and EINF-2943/L1. S. Liu and Z. Wang are in part

supported by the NSF AI Institute for Foundations of Machine Learning (IFML).

R EFERENCES

Srinadh Bhojanapalli, Ayan Chakrabarti, Andreas Veit, Michal Lukasik, Himanshu Jain, Frederick Liu,

Yin-Wen Chang, and Sanjiv Kumar. Leveraging redundancy in attention with reuse transformers.

arXiv preprint arXiv:2110.06821, 2021.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are

few-shot learners. Advances in neural information processing systems (NeurIPs), 33:1877–1901,

2020.

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina

Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint

arXiv:1905.10044, 2019.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and

Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.

arXiv preprint arXiv:1803.05457, 2018.

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix

multiplication for transformers at scale. Advances in Neural Information Processing Systems

(NeurIPs), 2022.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep

bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Paul Erdős and Alfréd Rényi. On random graphs i. Publicationes Mathematicae (Debrecen), 6:

290–297, 1959.

Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery:

Making all tickets winners. In International Conference on Machine Learning (ICML), pp. 2943–

2952, 2020.

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural

networks. In International Conference on Learning Representations (ICLR), 2019.

Elias Frantar and Dan Alistarh. Massive language models can be accurately pruned in one-shot. In

International Conference on Machine Learning (ICML), 2023.

Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv

preprint arXiv:1902.09574, 2019.

Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. Sparse gpu kernels for deep learning.

In SC20: International Conference for High Performance Computing, Networking, Storage and

Analysis, pp. 1–14. IEEE, 2020.

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for

efficient neural network. In Advances in Neural Information Processing Systems (NeurIPS), pp.

1135–1143, 2015.

Babak Hassibi, David G Stork, and Gregory J Wolff. Optimal brain surgeon and general network

pruning. In IEEE international conference on neural networks, pp. 293–299. IEEE, 1993.

Ajay Jaiswal, Shiwei Liu, Tianlong Chen, and Zhangyang Wang. The emergence of essential sparsity

in large pre-trained models: The weights that matter. arXiv preprint arXiv:2306.03805, 2023.

10Preprint

Steven A Janowsky. Pruning versus clipping in neural networks. Physical Review A, 39(12):6600,

1989.

Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael

Goin, and Dan Alistarh. The optimal bert surgeon: Scalable and accurate second-order pruning for

large language models. arXiv preprint arXiv:2203.07259, 2022.

Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. In Advances in Neural Information

Processing Systems (NeurIPS), pp. 598–605, 1989.

Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. Layer-adaptive sparsity for

the magnitude-based pruning. arXiv preprint arXiv:2010.07611, 2020.

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. Snip: Single-shot network pruning based

on connection sensitivity. In International Conference on Learning Representations (ICLR), 2019.

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-

aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978,

2023.

Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang,

and David Doermann. Towards optimal structured cnn pruning via generative adversarial learning.

In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.

2790–2799, 2019.

Shiwei Liu, Tianlong Chen, Xiaohan Chen, Zahra Atashgahi, Lu Yin, Huanyu Kou, Li Shen, Mykola

Pechenizkiy, Zhangyang Wang, and Decebal Constantin Mocanu. Sparse training via boosting

pruning plasticity with neuroregeneration. In Advances in Neural Information Processing Systems

(NeurIPS), 2021.

Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang,

and Mykola Pechenizkiy. The unreasonable effectiveness of random pruning: Return of the most

naive baseline for sparse training. arXiv preprint arXiv:2202.02643, 2022.

Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Estimating the carbon footprint

of bloom, a 176b parameter language model. arXiv preprint arXiv:2211.02001, 2022.

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large

language models. arXiv preprint arXiv:2305.11627, 2023.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture

models. arXiv preprint arXiv:1609.07843, 2016a.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture

models. arXiv preprint arXiv:1609.07843, 2016b.

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct

electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789,

2018.

Deepak Mittal, Shweta Bhardwaj, Mitesh M Khapra, and Balaraman Ravindran. Studying the

plasticity in deep convolutional neural networks using random pruning. Machine Vision and

Applications, 30(2):203–216, 2019.

Decebal Constantin Mocanu, Elena Mocanu, Phuong H. Nguyen, Madeleine Gibescu, and Antonio

Liotta. A topological insight into restricted boltzmann machines. Machine Learning, 104(2):

243–270, Sep 2016. ISSN 1573-0565. doi: 10.1007/s10994-016-5570-z. URL https://doi.

org/10.1007/s10994-016-5570-z.

Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu,

and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity

inspired by network science. Nature Communications, 9:1–12, 2018.

11Preprint

Michael C Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from

a network via relevance assessment. In Advances in Neural Information Processing Systems

(NeurIPS), pp. 107–115, 1989.

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild,

David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. arXiv

preprint arXiv:2104.10350, 2021.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi

Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text

transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An

adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.

Victor Sanh, Thomas Wolf, and Alexander M Rush. Movement pruning: Adaptive sparsity by

fine-tuning. arXiv preprint arXiv:2005.07683, 2020.

Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language

models a mirage? arXiv preprint arXiv:2304.15004, 2023.

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for

large language models. arXiv preprint arXiv:2306.11695, 2023.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée

Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and

efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay

Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation

and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue:

A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint

arXiv:1804.07461, 2018.

Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by

preserving gradient flow. In International Conference on Learning Representations (ICLR), 2020.

Wenxuan Wang and Zhaopeng Tu. Rethinking the value of transformer components. arXiv preprint

arXiv:2011.03803, 2020.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama,

Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.

Transactions on Machine Learning Research, 2022.

Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran

Chen, and Hai Li. Learning intrinsic sparse structures within long short-term memory. arXiv

preprint arXiv:1709.05027, 2017.

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant:

Accurate and efficient post-training quantization for large language models. In International

Conference on Machine Learning (ICML), pp. 38087–38099. PMLR, 2023.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine

really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher

Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language

models. arXiv preprint arXiv:2205.01068, 2022.

Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model

compression. In International Conference on Learning Representations Workshop (ICLRW), 2017.