Summary of Efficient Transformer Knowledge Distillation A Performance Review

Summary Efficient Transformer Knowledge Distillation A Performance Review arxiv.org

7,537 words - PDF document - View PDF document

One Line

This study examines how knowledge distillation affects efficient attention transformers in pretrained language models, with a focus on introducing a new NER dataset.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Efficient Transformer Knowledge Distillation: A Performance Review

Source: arxiv.org - PDF - 7,537 words - view

Introduction

• Model compression and efficient attention mechanisms are the focus of this study.

Knowledge Distillation Overview

• Knowledge distillation compresses larger models into smaller models while preserving performance.

• It reduces computational requirements and allows for deployment on resource-limited hardware or in scenarios with limited internet access.

Visual: Diagram illustrating the process of knowledge distillation

Efficient Attention Models

• Efficient attention models address the limitations of transformer-based models in processing long-context sequences.

• Longformer, Big Bird, Nystromformer, and LSG are examples of efficient attention models.

• They accept longer sequences with reduced computational overhead.

Combining KD and Efficient Attention

• This study investigates the combination of knowledge distillation and efficient attention architectures.

• Performance of compressed efficient attention models using KD is evaluated on various tasks.

Visual: Comparison chart showing performance and inference times of compressed models

Preserving Performance with Reduced Inference Times

• Distilled efficient attention models preserve a significant amount of the original model's performance.

• Inference times are reduced by up to 57.8%.

Visual: Bar chart comparing inference times of original and compressed models

Introducing GONERD Dataset

• GONERD is a new long-context Named Entity Recognition (NER) dataset introduced in this study.

• It fills a gap in long-context NER benchmarking.

Visual: Image showcasing examples from the GONERD dataset

Performance Evaluation on NER

• Knowledge distillation improves NER performance on both CoNLL-2003 and GONERD datasets.

• 97.4% of CoNLL-2003 performance is preserved.

Visual: Table comparing NER performance on different datasets

Future Research Opportunities

• Further research is needed to explore distillation methods tailored for specific efficient attention mechanisms, tasks, and architectures.

• This will help optimize performance and efficiency in different contexts.

Visual: Image representing future research opportunities

Key Takeaways

• Knowledge distillation is an effective method for creating high-performing efficient attention models with low costs.

• Efficient attention models allow for processing longer sequences with reduced computational overhead.

• The introduction of the GONERD dataset fills a gap in long-context NER benchmarking.

• Remember to explore tailored distillation methods for specific efficient attention mechanisms, tasks, and architectures.

Key Points

Model compression and efficient attention mechanisms are focused on in this study.
Knowledge distillation is used to compress efficient attention transformers while preserving performance.
Efficient attention models allow for processing longer sequences with reduced computational overhead.
Knowledge distillation is a technique that involves training a larger model and distilling its knowledge into a smaller model.
The combination of knowledge distillation and efficient attention architectures results in compressed models with preserved performance and reduced inference times.
The GONERD dataset is introduced to evaluate the performance of NER models on long-context sequences.
Knowledge distillation improves NER performance on both CoNLL-2003 and GONERD datasets.
Further research is needed to explore distillation methods tailored for specific efficient attention mechanisms, tasks, and architectures.

Summaries

19 word summary

This study assesses knowledge distillation's impact on efficient attention transformers in pretrained language models, highlighting a new NER dataset.

51 word summary

This study evaluates the performance of knowledge distillation on efficient attention transformers in pretrained transformer language models. Distilled efficient attention transformers can maintain the original model's performance while reducing inference times. The researchers introduce a new long-context Named Entity Recognition (NER) dataset called GONERD, addressing a gap in long-context NER benchmarking.

131 word summary

This study examines the combination of model compression and efficient attention mechanisms in pretrained transformer language models. The researchers evaluate the performance of knowledge distillation on efficient attention transformers and introduce a new long-context Named Entity Recognition (NER) dataset called GONERD. Distilled efficient attention transformers can maintain a significant amount of the original model's performance while reducing inference times. Transformer-based models like BERT and RoBERTa struggle with processing long-context sequences, leading to the development of efficient attention transformer models. However, these models are still computationally expensive to train and deploy. Therefore, the researchers explore the combination of knowledge distillation and efficient attention architectures. The results demonstrate that distilled efficient attention models can preserve performance while reducing inference times. The introduction of the GONERD dataset fills a gap in long-context NER benchmarking.

441 word summary

This study explores the combination of model compression and efficient attention mechanisms in pretrained transformer language models. The researchers evaluate the performance of knowledge distillation on efficient attention transformers and introduce a new long-context Named Entity Recognition (NER) dataset called GONERD. The results show that distilled efficient attention transformers can maintain a significant amount of the original model's performance while reducing inference times. This demonstrates that knowledge distillation is an effective method for creating high-performing efficient attention models with low costs.

Transformer-based models like BERT and RoBERTa have achieved state-of-the-art performance in Natural Language Processing (NLP) tasks but struggle with processing long-context sequences due to their limited input length. Efficient attention transformer models, such as Longformer, Big Bird, Nystromformer, and LSG, have been developed to address this limitation by accepting longer sequences with reduced computational overhead.

While efficient attention models require less computational resources than non-efficient models, they are still computationally expensive to train and deploy. This leads to increased operational costs and difficulty deploying the models in resource-limited or low internet access scenarios. In response, the NLP community has been exploring cheaper yet performant models created through Knowledge Distillation (KD).

Knowledge Distillation involves training a larger, complex model (teacher model) and distilling its knowledge into a smaller, simpler model (student model). This technique has successfully compressed BERT-based models and reduced their computational requirements. However, there has been limited research on combining KD and efficient attention architectures.

In this study, the researchers focus on combining KD and efficient attention architectures. They evaluate the performance of compressed efficient attention models using knowledge distillation on various tasks, including GLUE, SQuAD, HotpotQA, TriviaQA, CoNLL-2003, and GONERD. The results demonstrate that the distilled efficient attention models can preserve a significant amount of the original model's performance while reducing inference times by up to 57.8%.

The researchers introduce GONERD, a new long-context NER dataset, to address the need for a benchmark in long-context NER. They evaluate the performance of NER models on both CoNLL-2003 and GONERD datasets and find that performing knowledge distillation prior to fine-tuning on NER preserves 97.4% of CoNLL-2003 performance and improves GONERD performance.

In conclusion, this study shows that knowledge distillation is an effective method for creating high-performing efficient attention models with low costs. It provides insights into the trade-offs and benefits of compressed efficient attention models and highlights the importance of combining KD and efficient attention architectures. The introduction of GONERD dataset fills a gap in long-context NER benchmarking. The researchers have released all models on the Hugging Face Hub for general use. Further research is needed to explore other distillation methods tailored for individual efficient attention mechanisms, tasks, and architectures.

484 word summary

This study focuses on the intersection of model compression and efficient attention mechanisms in pretrained transformer language models. The researchers evaluate the performance of model compression via knowledge distillation on efficient attention transformers. They introduce a new long-context Named Entity Recognition (NER) dataset called GONERD and analyze the performance of NER models on long sequences. The results show that distilled efficient attention transformers can preserve a significant amount of the original model's performance while reducing inference times. The study demonstrates that knowledge distillation is an effective method for creating high-performing efficient attention models with low costs.

Transformer-based models, such as BERT and RoBERTa, have achieved state-of-the-art performance in Natural Language Processing (NLP) tasks. However, these models have limitations in processing long-context sequences due to their short maximum input length of 512 tokens. Efficient attention transformer models, such as Longformer, Big Bird, Nystromformer, and LSG, have been developed to address this limitation by accepting longer sequences with reduced computational overhead.

While efficient attention models require less computational resources than their non-efficient counterparts, they are still computationally expensive to train and deploy. This leads to increased operational costs and difficulty deploying the models on resource-limited hardware or in scenarios with limited internet access. In response to these challenges, the NLP community has been exploring cheaper yet performant models, such as those created through Knowledge Distillation (KD).

Knowledge Distillation is a technique that involves training a larger, complex model (teacher model) and distilling its knowledge into a smaller, simpler model (student model). This process has been successful in compressing BERT-based models and reducing their computational requirements. However, little work has been done on investigating the combination of KD and efficient attention architectures.

In this study, the researchers focus on combining KD and efficient attention architectures. They evaluate the performance of compressed efficient attention models using knowledge distillation on various tasks, including GLUE, SQuAD, HotpotQA, TriviaQA, CoNLL-2003, and GONERD. The results show that the distilled efficient attention models can preserve a significant amount of the original model's performance while reducing inference times by up to 57.8%.

The researchers also introduce GONERD, a new long-context NER dataset, to address the need for a benchmark in long-context NER. They evaluate the performance of NER models on both CoNLL-2003 and GONERD datasets and find that performing knowledge distillation prior to fine-tuning on NER preserves 97.4% of CoNLL-2003 performance and improves GONERD performance.

In conclusion, this study demonstrates that knowledge distillation is an effective method for creating high-performing efficient attention models with low costs. It provides insights into the performance trade-offs and benefits of compressed efficient attention models and highlights the importance of combining KD and efficient attention architectures. The introduction of GONERD dataset fills a gap in long-context NER benchmarking. The researchers release all models on the Hugging Face Hub for general use. Further research is needed to explore other distillation methods tailored for individual efficient attention mechanisms, tasks, and architectures.

Raw indexed text (48,157 chars / 7,537 words / 1,700 lines)

Efficient Transformer Knowledge Distillation: A Performance Review

Nathan Brown 1 , Ashton Williamson 1 , Tahj Anderson 1 , and Logan Lawrence 2,*

1 School

of Computing, Clemson University, 2 Giant Oak Inc.

{nbrown9,taw2,tahja}@clemson.edu,[email protected]

Abstract

As pretrained transformer language models

continue to achieve state-of-the-art perfor-

mance, the Natural Language Processing com-

munity has pushed for advances in model com-

pression and efficient attention mechanisms to

address high computational requirements and

limited input sequence length. Despite these

separate efforts, no investigation has been done

into the intersection of these two fields. In

this work, we provide an evaluation of model

compression via knowledge distillation on effi-

cient attention transformers. We provide cost-

performance trade-offs for the compression of

state-of-the-art efficient attention architectures

and the gains made in performance in compar-

ison to their full attention counterparts. Fur-

thermore, we introduce a new long-context

Named Entity Recognition dataset, GONERD,

to train and test the performance of NER mod-

els on long sequences. We find that distilled

efficient attention transformers can preserve a

significant amount of original model perfor-

mance, preserving up to 98.6% across short-

context tasks (GLUE, SQUAD, CoNLL-2003),

up to 94.6% across long-context Question-and-

Answering tasks (HotpotQA, TriviaQA), and

up to 98.8% on long-context Named Entity

Recognition (GONERD), while decreasing in-

ference times by up to 57.8%. We find that, for

most models on most tasks, performing knowl-

edge distillation is an effective method to yield

high-performing efficient attention models with

low costs.

Introduction

The rise of Transformer-based models (Vaswani

et al., 2017) has driven significant advancements in

the field of Natural Language Processing (NLP). Of

these models, BERT (Devlin et al., 2018; Rogers

et al., 2020) produced landmark performance in

a variety of NLP tasks such as Question Answer-

ing (QA), Named Entity Recognition (NER), and

∗

Corresponding author.

GLUE (Wang et al., 2018). BERT-based mod-

els (Rogers et al., 2020) continue to dominate the

field (Zhou et al., 2023) with variations such as

RoBERTa (Liu et al., 2019) dramatically improv-

ing performance on downstream tasks.

However, BERT-based models often have a

fairly short maximum input length of 512 tokens,

severely limiting their capabilities in long-context

situations. Attempting to increase this limit to al-

low for longer sequences often results in signifi-

cantly greater computational requirements. This

has given rise to the creation of efficient attention

transformer models (Tay et al., 2022) such as Long-

former (Beltagy et al., 2020), Big Bird (Zaheer

et al., 2020), Nyströmformer (Xiong et al., 2021),

and LSG (Condevaux and Harispe, 2023), which

can accept as input much longer sequences with

reduced computational overhead by modifying and

approximating BERT’s original attention mecha-

nism.

While efficient attention models require less

computational resources on long-context tasks

when compared to their non-efficient counterparts,

they are still often computationally expensive to

train and deploy (Sharir et al., 2020). Thus, organi-

zations and individuals are required to grapple with

increased operational costs, difficulty deploying

these models on resource-limited hardware such

as mobile devices, and often must rely on cloud-

based solutions which restricts model availability

in scenarios with limited internet access.

In response to computational challenges asso-

ciated with transformer models, the NLP commu-

nity has invested considerable efforts into creating

cheaper yet performant models. This has been par-

ticularly the case in the study of Knowledge Distil-

lation (KD) (Gou et al., 2021; Hinton et al., 2015).

However, despite the rapid progress of KD and its

effectiveness in model compression, little work has

been done toward the investigation of the intersec-

tion of KD and efficient attention architectures. Assuch, we focus on combining these two method-

ologies. We believe this is an essential effort for

creating models that can cheaply and effectively

operate on a production scale on long-context tasks.

Furthermore, despite its significance in practical

NLP usage, Named Entity Recognition (NER) still

does not have a well-accepted long-context bench-

mark. Our work attempts to address these two

needs directly.

The main contributions of this work are twofold:

1. Performance analysis of popular pretrained

efficient transformers and their distilled stu-

dents in various contexts, including GLUE,

SQuAD (Rajpurkar et al., 2016, 2018), Hot-

potQA (Yang et al., 2018), TriviaQA (Joshi

et al., 2017), CoNLL-2003 (Tjong Kim Sang

and De Meulder, 2003), and GONERD.

2. The introduction of a new long-context

NER task: the Giant Oak NER Dataset

(GONERD). This dataset and all models are

publicly available on Hugging Face * .

In particular, we find that distilling Longformer-

RoBERTa (Beltagy et al., 2020) yields the best re-

sults during our experiments, producing substantial

improvements in cost performance over state-of-

the-art models. In short, it retains considerable

performance on GLUE (92.3%), SQuAD (93.0%),

HotpotQA (88.4%), CoNLL-2003 (99.8%), and

GONERD (95.9%) while decreasing inference

times by 49.3% on long sequences. In the con-

text of GONERD, this is effectively 95.9% of the

original model’s performance for 50.7% of the

cost.

Related Work

Considerable success has been made in the com-

pression of BERT (Devlin et al., 2018) which, at

the time of its release, was one of the largest mod-

els in NLP. BERT itself has been expanded to fit

many different use cases including, but not limited

to, RoBERTA (Liu et al., 2019), a model built to

improve BERT performance on a variety of tasks

through clever choices in training data and hyper-

parameters, XLM-R (Conneau et al., 2020), which

was built using similar methods on extremely mul-

tilingual data (100 languages), and DistilBERT

(Sanh et al., 2019), which sought to greatly reduce

* https://huggingface.co/giant-oak

the computational costs of BERT through knowl-

edge distillation. ALBERT (Lan et al., 2019) factor-

izes the embedding matrices of BERT and shares

weights between layers to significantly decrease

the parameter size, thereby decreasing training and

inference costs.

BERT-based distillation methods, such as Distil-

BERT (Sanh et al., 2019), TinyBERT (Jiao et al.,

2020), and MobileBERT (Sun et al., 2020) have

gained prominence due to their utilization of dis-

tillation techniques and can be applied to a wide

variety of BERT-based architectures. These mod-

els have significantly reduced the computational

requirements and resource consumption associ-

ated with BERT-based NLP models, making them

more accessible and easily deployable on resource-

constrained hardware. However, BERT’s attention

mechanism still results in a quadratic dependency

on sequence length, resulting in greater computa-

tional requirements at higher sequence lengths.

To solve this problem with BERT-based archi-

tectures, methods have been developed to create

efficient attention transformer models (Tay et al.,

2022) which can operate on sequences many times

longer than their BERT counterparts. Two popu-

lar methods in this area are Longformer (Beltagy

et al., 2020) and Big Bird (Zaheer et al., 2020),

which use dilated sliding window and a combina-

tion of global, sliding, and random activations in

their attention matrices, respectively, to increase

the maximum input sequence length from 512 to

4096. More recently, Local Sparse Global (LSG)

(Condevaux and Harispe, 2023) attention uses a Lo-

cality Sensitive Hashing algorithm (Andoni et al.,

2015) with the Local, Sparse, and Global patterns

used in Longformer and Big Bird, whereas Nys-

trömformer (Xiong et al., 2021) uses a Nyström

matrix approximation to the regular softmax atten-

tion, reducing self-attention complexity to linear

time.

The Long-Range Arena (LRA) (Tay et al.,

2021), a comprehensive suite of benchmarking

tasks toward systematically evaluating long-context

transformer architectures, is novel in that its tasks

largely decouple the effect of Masked Language

Modeling (MLM) pretraining from efficient model

performance. While useful for developing new

transformer architectures, we are primarily focused

on the comparative performance between student

and teacher models on downstream tasks after hav-

ing been pretrained/distilled on MLM. As such,LRA is not utilized in this paper.

Methodology

3.1

Knowledge Distillation

Knowledge Distillation (KD) for transformer-based

architectures (Gou et al., 2021) is most commonly

executed in three steps: (1) Pretrain a larger, com-

plex model. (2) Distill knowledge from the larger

complex model into a smaller, simpler model. (3)

Fine-tune the student model on a downstream task.

While effective in short-context scenarios, this

three-step process leaves room for ambiguity re-

garding the recommended distillation process for

long-context efficient attention transformer models.

In this paper, we use the term "convert" to refer

to the process of updating a pretrained LM to use

an efficient attention pattern, i.e. one capable of

input lengths longer than 512 tokens. Considering

this, we can identify two possible methods of in-

serting the conversion operation into the classical

KD pipeline:

1. Convert-Then-Distill: Convert teacher →

Pretrain teacher → Distill into student → Fine-

tune student on downstream task

2. Distill-Then-Convert: Pretrain teacher →

Distill into student → Convert student → Fine-

tune student on downstream task

Although Distill-Then-Convert is conceptually

interesting and potentially fruitful, we will be only

covering Convert-Then-Distill in this work. How-

ever, we do include an experiment directly extend-

ing the maximum input sequence length in Sec-

tion 4.2 of existing non-efficient distilled students,

demonstrating the necessity of the conversion step

in long-context tasks in terms of reducing a model’s

computational requirements.

Within the realm of Convert-Then-Distill, we

perform knowledge distillation using the same

process utilized in the creation of DistilBERT

(Sanh et al., 2019). Namely, we begin by com-

pressing pretrained efficient teacher models Long-

former RoBERTa (Beltagy et al., 2020), Big Bird

RoBERTa (Zaheer et al., 2020), LSG RoBERTa

(Condevaux and Harispe, 2023), and Nyström-

former (Xiong et al., 2021). In all cases, the number

of hidden layers is reduced by a factor of two, with

the student model being initialized by taking every

other hidden layer from the teacher.

During training, the distillation loss is calculated

over the soft target probabilities of the teacher. A

softmax temperature is used, and a linear combina-

tion of the distillation loss, MLM supervised train-

ing loss, and cosine embedding loss is performed.

For additional details, see Appendix B.

3.2

Distillation Datasets

To perform knowledge distillation, we utilize the

Open Super-large Crawled Aggregated coRpus

project (OSCAR) (Ortiz Su’arez et al., 2019),

a large open-source corpus of raw unannotated

web text. MLM pretraining, and by consequence

Knowledge Distillation, requires a large amount

of text data (Qiu et al., 2020) and OSCAR allows

for the selection of a large amount of high-quality

long-context text samples. This dataset is used

during distillation alongside the commonly used

training dataset, BookCorpus (Zhu et al., 2015).

The selection of these two distillation datasets was

determined through an experiment investigating

the effect of different distillation datasets on down-

stream performance, as seen in Section 4.5.

When constructing our data to be used for knowl-

edge distillation, we first filtered out all data from

the OSCAR23.01 corpus which was not classified

as having an eighty percent or higher chance of

being English text to align with downstream tasks.

To seek out only high-quality data, we also remove

samples with quality annotations indicating tiny,

short, or noisy sequences. We remove any data

with a harmful perplexity score of 13.51 or less

(Jansen et al., 2022) using perplexity scores pro-

vided by the OSCAR corpus (Ortiz Su’arez et al.,

2019), and additionally remove any harmful cate-

gories. Finally, we select a sample from our filtered

dataset to be used during distillation consisting of

nearly three million sequences, then distil using

this OSCAR subset alongside BookCorpus (Zhu

et al., 2015), totaling 19 GB of uncompressed text.

3.3

GONERD

Data for GONERD (Giant Oak NER Dataset) was

obtained by web scraping articles from publicly

available sources such as online news and press

release websites prior to being sampled and hand-

labeled. A combination of automatic and manual

filtering was then applied to remove text containing

code and other unwanted data such as sequences

deemed short, noisy, or duplicates.

As the explicit goal of GONERD is to gauge

the performance of long-context NER models, we

briefly quantify what sequence lengths are present

within the dataset. We compare against CoNLL-Length CoNLL-2003 (512) GONERD (4096)

mean

std. dev.

min

25%

50%

75%

max 14.5

11.8

1.0

6.0

10.0

22.0

124.0 507.6

556.5

1.0

170.0

330.0

658.0

6768.0

Table 1: Summary statistics on sequence length of

CoNLL-2003 and GONERD. All statistics are

computed over the whole dataset. "mean" and "std.

dev." follow their usual definitions, "min" and "max"

are the lengths of the shortest and longest datasets.

"25%", "50%", "75%" are the 25th, 50th, and 75th

percentiles, respectively.

4.1

Experiments

Inference Speed and Memory Usage

We calculate the average inference time and max-

imum GPU memory utilization for a variety of

short-context and long-context transformer models

as a proxy for predicting costs for hosting each

model type in production, as displayed in Table 2.

Moreover, we compare the potential cost of deploy-

ing efficient attention models versus their distilled

equivalents. All models were tested in a uniform

environment utilizing a single 80GB A100 GPU

with a sequence length of either 512 or 4096 tokens

and a batch size of 16.

3.4

LSG RoBERTa Pretraining

Although the implementation of LSG RoBERTa

(Condevaux and Harispe, 2023) is publicly avail-

able, there are currently no publicly accessible

weights, neither compressed nor uncompressed,

that have been pretrained on long-context se-

quences. While analysis on inference and memory

utilization can be performed without these weights,

undergoing a comprehensive performance analy-

sis of LSG RoBERTa or utilizing this model in

research or production requires further pretraining.

To address this issue, and to yield a pretrained

teacher model as the first step towards develop-

ing a distilled student model, we pretrain a ran-

domly initialized LSG RoBERTa model using the

same dataset presented in LSG’s inception (Conde-

vaux and Harispe, 2023). This consists of English

Wikipedia, BookCorpus, and CC_News.

Mem. (MB)

512 4096

Baseline

Time (sec)

512 4096

BERT BASE

BERT LARGE

RoBERTa

LegalBERT BASE

XLM-R 109.5

335.1

124.6

109.5

278.0 0.135

0.379

0.148

0.135

0.237 -

- 4167

5171

4843

4167

11673 -

2003, shown in Table 1, as it is widely used through-

out NER literature. We find, on average, GONERD

has much longer sequences than CoNLL-2003

(507.6 vs. 14.5), with a right skew as seen by the

50% percentile (330) being lower than the mean.

We show this skew in Appendix A.2.

Furthermore, we find that approximately 35%

of GONERD sequences are above the 512 token

threshold, whereas none of the CoNLL-2003 se-

quences occur in this range. Finally, we find that

0.2% of sequences are longer than 4096, which are

truncated at training and inference time. For more

information on GONERD, including exploratory

data analysis and additional comparisons with

CoNLL-2003, see Appendix A.1 and A.2.

Params

(mil.)

DistilRoBERTa

DistilBERT

TinyBERT

MobileBERT

ALBERT 82.1

66.4

4.4

24.6

11.7 0.089

0.078

0.057

0.072

0.128 -

- 4663

3987

3033

3639

3783 -

Model

LSG

➥

Nyströmformer

➥

Longformer

➥

Big Bird

➥ 127.8

85.3

111.2

68.7

148.7

95.5

127.5

84.9 0.170

0.103

0.159

0.090

0.149

0.075

0.158

0.097 1.157

0.641

1.866

0.787

1.110

0.588

1.542

0.913 5472

5292

4291

4111

4077

3857

4938

4757 23482

23302

29059

28879

11881

11661

26854

26673

Table 2: Average Inference Speed and Peak GPU

Memory Usage for sequence lengths of 512 and 4096.

"➥" indicates distillation.

We find an average 45.2% decrease in inference

times for long-context efficient attention models

and an average 2.6% percent decrease in GPU

memory utilization across all distilled efficient

models. Among the distilled efficient students,

Longformer produces the fastest inference speed

and least peak GPU memory usage in both 512 and

4096 settings, despite having the most parameters.

We find that KD as discussed in Section 3.1 does

not significantly impact peak GPU memory us-

age during inference across both efficient (LSG,

Nyströmformer, Longformer, Big Bird) and non-

efficient (DistilBERT, DistilRoBERTa) architec-

tures. Larger modifications to the student architec-

ture (TinyBERT, MobileBERT, ALBERT), produce

varying speeds and levels of GPU memory usage.Model CoLA MNLI MRPC QNLI QQP RTE SST-2 STS-B Total

Metric MCC M/MM Acc. Acc./F1 Acc. Acc./F1 Acc. Acc. PCC/SRCC Avg.

BERT 1 BASE

BERT 1 LARGE

RoBERTa 2

LegalBERT

XLM-R 52.1

60.5

63.6

38.6

59.8 84.6 / 83.4

86.7 / 85.9

87.6

82.2 / 82.9

85.3 / 85.7 88.9

89.3

90.2

88.2

88.2 / 91.6 90.5

92.7

92.8

89.9

92.3 71.2

72.1

91.9

89.7

90.7 / 87.6 66.4

70.1

78.7

65.3

77.3 93.5

94.9

94.8

91.5

93.3 85.8

86.5

91.2

87.0 / 86.6

90.9 / 90.6 79.6

82.1

86.4

80.2

86.1

DistilRoBERTa 2

DistilBERT 1

TinyBERT 1

MobileBERT 1 BASE

ALBERT 59.3

52.4

43.3

50.5

59.8 84.0

82.6

82.5 / 81.8

83.3 / 82.6

85.3 / 85.7 86.6

86.5

86.4

88.8

88.2 / 91.6 90.8

89.5

87.7

90.6

92.3 89.4

88.6

71.3

70.2

90.6 / 87.7 67.9

60.3

62.9

66.2

77.3 92.5

91.3

92.6

92.8

92.9 88.3

86.8

79.9

84.4

90.9 / 90.6 82.4

79.8

76.5

78.8

86.1

LSG

➥ 59.8

29.4 86.7 / 86.1

71.8 / 72.6 89.7 / 92.5

77.5 / 85.0 93.4

84.1 89.8 / 86.3

86.1 / 81.9 70.0

58.9 94.8

89.7 90.2 / 90.0

80.9 / 80.8 84.2

74.9

Nyströmformer

➥ 33.6

43.4 77.9 / 79.1

78.6 / 78.6 77.7 / 84.7

75.2 / 83.8 86.3

85.8 88.8 / 85.0

89.3 / 85.9 56.7

58.5 90.8

90.8 86.1 / 86.0

86.2 / 85.8 77.7

78.5

Longformer

➥ 61.3

55.5 86.3 / 86.4

82.0 / 81.8 91.9 / 94.2

82.1 / 86.9 92.9

87.7 89.6 / 86.0

90.3 / 86.8 77.6

54.2 93.9

91.7 90.8 / 90.5

86.2 / 86.0 86.8

80.9

Big Bird

➥ 51.6

53.9 87.1 / 87.3

81.6 / 81.9 87.8 / 91.3

82.4 / 87.3 91.0

86.8 90.3 / 86.9

90.2 / 86.5 68.2

59.6 95.0

92.4 86.5 / 86.5

85.2 / 84.8 84.1

81.0

Table 3: Results on the validation set of the GLUE benchmark. "➥" indicates distillation performance. Results for

are pulled from MobileBERT (Sun et al., 2020) and DistilBERT (Sanh et al., 2019) papers, respectively; all other

models are computed to completion. WNLI is not reported due to its problematic nature (Devlin et al., 2018).

1,2

4.2

Extending Input Sequence Length

To demonstrate the necessity of efficient attention

architectures, we investigate the feasibility of using

existing models on long-context tasks by allowing

inefficient attention models to process longer se-

quence lengths. To explore this, as presented in

Table 4, we examine the inference speed and peak

GPU memory consumption during inference on full

attention BERT-based models after being adjusted

to compute sequence lengths of up to 4096 tokens,

employing the same benchmarking methodology

as seen in Section 4.1.

Model ↑4096

BERT LARGE

BERT BASE

RoBERTa

DistilBERT

DistilRoBERTa

MobileBERT

TinyBERT

Longformer

➥

Time (sec) GPU Mem (MB)

5.806 (+423%)

1.833 (+65%)

1.886 (+70%)

0.636 (+8%)

0.798 (+36%)

1.274 (+117%)

0.631 (+7%) 39344 (+221%)

29886 (+152%)

42506 (+258%)

29706 (+155%)

42326 (+163%)

24406 (+109%)

29706 (+155%)

1.110

0.588 11881

11661

Table 4: Inference speed and GPU memory

consumption when extending the maximum input

sequence length from 512 to 4096 for various models.

Percentages for non-compressed models are calculated

against Longformer, while percentages for compressed

models are calculated against distilled Longformer.

Our findings illustrate a noticeable trend: al-

though it is possible to allow inefficient models to

accept input sequences of up to 4096 tokens, there

are significant speed and memory costs associated

with doing so. Moreover, the newly initialized posi-

tion embeddings would require anyone using these

extended models to perform additional pretraining

to yield acceptable long-context performance - a

process that would be slower and more expensive

than training an efficient attention model. This dif-

ficulty training would also inherently transfer to the

process of fine-tuning these models on downstream

tasks.

This evidence suggests that, although it is pos-

sible for full attention models to operate in long-

context scenarios, it is often associated with in-

creased inference and training costs when com-

pared to non-distilled and distilled efficient atten-

tion models. As such, efficient attention models are

an important step toward reducing the operational

costs of long-context models, and distillation after

conversion can be a useful methodology to further

reduce costs and improve model accessibility.

4.3

Performance Benchmarks

GLUE We perform hyperparameter optimization

using Population-Based Training (Jaderberg et al.,

2017) on several baselines, augmented, efficient

attention, and distilled efficient attention models on

the GLUE benchmark (Wang et al., 2018). As seen

in Table 3, we find that distilling efficient atten-

tion models yields compressed models capable of

retaining, on average, 94.6% of teacher model per-

formance across all GLUE tasks and metrics. Dis-

tilled Big Bird produces the highest GLUE scores

on average when compared to our distilled efficient

attention models. Distilled Nystromformer sees a95.4

96.5

96.4

94.8

95.8 87.9

88.8

89.8

86.0

89.8 92.6

94.1

93.4

90.8

92.9 -

- -

89.8

92.6

91.3

87.2

90.2 97.1

98.6

96.2

95.3

95.6 BERT BASE

BERT LARGE

RoBERTa

LegalBERT

XLM-R

TriviaQA

HotpotQA

SQuAD1.1

GONERD (4096)

Model

CoNLL-2003 (512)

Model

slight increase in performance when compared to

its teacher. Distilled LSG retains only 87.3% of

teacher performance.

BERT BASE

BERT LARGE

RoBERTa

LegalBERT

XLM-R 80.97

83.91

86.08

79.89

82.38 88.21

90.73

92.47

87.66

89.16 -

- -

- DistilRoBERTA

DistilBERT

TinyBERT

MobileBERT

ALBERT 96.7

96.7

95.6

97.8

93.8 92.1

89.2

87.8

90.2

85.6 96.7

95.4

95.3

96.4

94.5 90.1

88.3

86.8

87.9

86.3 93.9

92.4

91.4

93.1

90.1 -

- -

DistilRoBERTa

DistilBERT

TinyBERT

MobileBERT

ALBERT 80.43

77.01

69.77

80.83

83.58 87.87

85.21

78.89

88.56

90.64 -

- -

- LSG

➥

Nyströmformer

➥

Longformer

➥

Big Bird

➥ 80.61

64.20

76.59

70.87

85.92

77.93

84.94

74.53 87.89

74.13

84.89

80.51

92.24

85.81

91.44

82.67 56.96

41.77

52.57

48.54

58.52

49.86

59.77

49.40 72.11

54.58

67.86

63.87

73.48

64.96

75.26

64.21 47.34

26.67

47.30

44.55

55.29

46.75

54.29

44.61 51.82

30.00

52.29

49.68

60.52

51.42

59.33

49.96 LSG

➥

Nyströmformer

➥

Longformer

➥

Big Bird

➥ 96.6

89.8

94.8

95.3

96.2

96.7

96.4

96.2 90.0

80.0

85.1

91.5

91.2

91.8

90.4 95.2

92.2

93.3

94.2

96.8

96.7

96.4

96.2 88.1

81.3

86.4

85.6

90.5

89.8

89.7 92.5

85.8

89.9

90.1

93.8

93.6

93.1 76.5

69.8

72.4

70.6

75.9

71.8

75.9

71.6 66.7

59.0

59.5

56.4

68.0

65.1

65.4

63.2 64.0

60.7

59.6

60.2

65.1

63.3

66.3

61.7 78.7

72.6

75.1

70.5

77.3

76.3

73.1

73.2 70.2

64.1

65.0

63.3

70.6

67.7

69.8

66.4

Named Entity Recognition We explore the im-

pact of separately fine-tuning and evaluating both

distilled and non-distilled efficient attention trans-

former models on CoNLL-2003 and GONERD in

Table 6. We report each model’s F1 performance on

predicting Person (PER), Organization (ORG), Lo-

cation (LOC), and Miscellaneous (MISC) tags. We

find that performing knowledge distillation prior to

fine-tuning on NER preserved 97.4% of CoNLL-

2003, while boosting GONERD F1 performance

by 0.2%.

Model

Evaluating the Effect of Convert and

Distill on Downstream Performance

(sec

Question Answering We train and evaluate all

short-context and long-context transformer models

on SQuAD1.1 (Rajpurkar et al., 2016). Moreover,

we train and evaluate all long-context transformer

models using a maximum sequence length of 4096

tokens on TriviaQA (Joshi et al., 2017) and Hot-

potQA (Yang et al., 2018) for up to 5 and 10 epochs,

respectively. Results are reported in Table 5.

We find that on SQuAD, HotpotQA, and Triv-

iaQA, efficient attention students retained up to

94.8%, 94.1%, and 95.0% of original model F1

performance, respectively. LSG RoBERTa was

particularly strongly affected by the distillation

process on long-context Question and Answering

tasks, preserving 75.7% of teacher performance

on HotpotQA and 57.9% on TriviaQA. Distilled

Nyströmformer retains the most performance from

its teacher with an average of 94.7% across all QA

benchmarks, but it is still outperformed by distilled

Longformer by 2.3%.

4.4

Table 5: SQuAD, HotpotQA, and TriviaQA Results.

Table 6: Named Entity Recognition (NER) F1

Performance on CoNLL-2003 and GONERD.

RoBERTa

.148 4843 86.35 92.47 93.43

∆ KD

-39.9% -3.7% -4.6% -5.0% +0.5%

∆ Convert

+0.7% -15.8% -0.5% -2.5% +0.3% 73.48 60.52 70.6

∆ Convert+KD -49.3% -20.4% -6.3% -7.0% +0.2% -11.6% -15.0% -4.1%

Table 7: Effects of introducing Knowledge Distillation

and Longformer attention into RoBERTa on various

tasks. We report average score for GLUE and overall

F1 for QA and NER. "∆" indicates a change from the

base model. Results are compared against RoBERTa on

short-context tasks and against Longformer (∆

Convert) on long-context tasks. Sequence lengths of

512 are used for inference time and memory usage.

To gauge the contribution of each component of

the Convert-Then-Distill pipeline, we provide the

computational cost and performance with respect to

RoBERTa after undergoing conversion and distilla-

tion in Table 7. In contrast to Table 4, the inference

speeds and max GPU memory usages are calcu-

lated on sequences of up to 512 tokens. Within this

range, we see that KD greatly improves inference

speed while resulting in a minor decrease in maxi-

mum GPU memory utilization. Conversely, we see

conversion to an efficient attention mechanism (in

this case, Longformer) yields significant decreases

in maximum GPU memory utilization and minor

improvements in inference speed. Together, we

find that Convert+KD is additive in its effects: per-forming Conversion and KD yields models with

both improved inference speeds and reduced GPU

memory requirements,

We find long-context QA performance is heavily

degraded by introducing Convert+KD into training

in comparison to other tasks, whereas conversion

does not significantly impact performance. How-

ever, long-context NER appears to be an exception,

as introducing Convert+KD into GONERD has

a significantly lower impact on performance. Fi-

nally, we note that the distillation process as used

in DistilBERT (Sanh et al., 2019) and detailed in

Appendix B leaves room for further improvement:

developing distillation methods tailored for indi-

vidual efficient attention mechanisms, tasks, and

architectures may yield improved performance.

BC + WIKI

OSCAR + BC

OSCAR + WIKI

WIKI

OSCAR

75.6

78.9

77.1

68.7

60.5

72.9

77.9

85.8

78.8

76.5

25.4

56.4

57.7

65.0

60.6

57.9

39.6

40.7

92.2

93.6

92.8

91.9

73.5

90.4

Distillation Data

Evaluating the Effect of Distillation Data

on Downstream Performance

4.5

65.5

67.7

66.3

63.8

38.3

40.0

Table 8: Effects of choice of data on KD performance

using Longformer RoBERTa with a train batch size of

4. Average score, not including WNLI, is reported for

GLUE and overall F1 is reported for QA and NER. A

full expansion of GLUE results is given in Table 11.

For a more comprehensive evaluation of our

knowledge distillation process, we report the per-

formance of Longformer-RoBERTa after distilla-

tion on various permutations of the OSCAR, Book-

Corpus, and English Wikipedia datasets, as seen in

Table 8.

We find that, although OSCAR+BookCorpus

yields the best performance on both short-

context and long-context tasks, the perfor-

mance gap between OSCAR+BookCorpus,

Wikipedia+BookCorpus, and OSCAR+Wikipedia

is very modest. However, as OSCAR+BookCorpus

proved to be the most performant, we utilize this

dataset when distilling efficient attention models.

Conclusion

In this work, we performed an investigation into

the Convert-Then-Distill paradigm, the process of

(1) converting a teacher model to utilize an efficient

attention mechanism, (2) pretraining the converted

teacher model, (3) distilling into a smaller student

model, then (4) fine-tuning the student on a down-

stream task. We saw an average decrease in infer-

ence times of up to 58%. The efficient attention

students preserved up to 98.6% of performance

across short-context (GLUE, SQuAD, CoNLL-

2003) tasks, 94.1% of HotpotQA performance,

95.0% of TriviaQA performance, and 97.4% per-

cent of GONERD performance when compared to

their teacher models. We saw distilled Nyström-

former retained the most performance when com-

pared to its teacher, while distilled Longformer had

the best base performance across most tasks. We

introduced GONERD, a long-context NER dataset

consisting of large amounts of hand-labeled web

text data. Finally, we release all models on the

Hugging Face Hub for general use. Our research

demonstrates that, for most models on most tasks,

employing knowledge distillation on efficient at-

tention architectures can be a highly effective ap-

proach. This technique yields models with a high

level of performance on both short and long-context

tasks at a fraction of the cost.

Acknowledgements

This work was produced through a partnership

between Clemson University and Giant Oak.

We thank Gary Shiffman and Carrie Russell for

their invaluable mentorship and support. We

are indebted to the data assembly efforts and

guidance of the Giant Oak research assistants

and science members, including but not limited

to Marena Dangremond, Oladotun Taiwo, Omar

Ocasio, Rohan Jani, Tyler Strickland, Kateri

Gajadhar-Smith, Alexa Wingate, Timothy Ressler,

and Benjamin Crisman. Clemson University

is acknowledged for their generous allotment

of compute time on the Palmetto Cluster. We

appreciate D. Hudson Smith for his assistance

and comments as well as the insightful comments

of the anonymous reviewers. This material is

based on work supported by the National Science

Foundation under Grant Nos. MRI# 2024205,

MRI# 1725573, and CRI# 2010270. Any opinions,

findings, and conclusions or recommendations

expressed in this material are those of the author(s)

and do not necessarily reflect the views of the

National Science Foundation.Limitations

As seen in Section 4.3, we find that, in both

short and long sequences, Convert-Then-Distill

degrades performance to a greater extent than ei-

ther Convert or KD separately. This performance

degradation warrants further investigation into gen-

eralization capabilities of efficient students.

Following this, many distillation procedures

have been proposed since the original technique

of DistilBERT (Sanh et al., 2019). Using more

recent distillation methods, or developing distilla-

tion methods tailored toward an individual efficient

attention architecture, may decrease the student-

teacher performance gap and increase generaliz-

ability.

Our work is constrained to the Convert-Then-

Distill paradigm which, although intuitive, is not

obviously better than Distill-Then-Convert or

other alternatives. For instance, it may be possible

that non-efficient teachers produce better students

which can then be extended to the 4096 or greater

token range. Further investigation into the optimal

method for developing distilled efficient attention

models may be necessary to further close the afore-

mentioned performance gap.

Finally, GONERD suffers from a domain bias

as it is composed entirely of news-like webtext

data and commonly littered with legal jargon. We

attempt to control for this bias by comparing with

LegalBERT and ablating on choices of pretraining

data, but we note this bias for any potential users

of GONERD. For general long-context NER use,

additional pretraining may be required.

References

Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya

Razenshteyn, and Ludwig Schmidt. 2015. Practical

and optimal lsh for angular distance. In Advances in

Neural Information Processing Systems, volume 28.

Curran Associates, Inc.

Iz Beltagy, Matthew E. Peters, and Arman Cohan.

2020. Longformer: The long-document transformer.

arXiv:2004.05150.

Charles Condevaux and Sébastien Harispe. 2023. Lsg

attention: Extrapolation of pretrained transformers

to long sequences. In Advances in Knowledge Dis-

covery and Data Mining, pages 443–454, Cham.

Springer Nature Switzerland.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal,

Vishrav Chaudhary, Guillaume Wenzek, Francisco

Guzmán, Edouard Grave, Myle Ott, Luke Zettle-

moyer, and Veselin Stoyanov. 2020. Unsupervised

cross-lingual representation learning at scale. In Pro-

ceedings of the 58th Annual Meeting of the Asso-

ciation for Computational Linguistics, pages 8440–

8451, Online. Association for Computational Lin-

guistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2018. Bert: Pre-training of deep

bidirectional transformers for language understand-

ing. arXiv preprint arXiv:1810.04805.

Jianping Gou, Baosheng Yu, Stephen J Maybank, and

Dacheng Tao. 2021. Knowledge distillation: A

survey. International Journal of Computer Vision,

129:1789–1819.

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. 2015.

Distilling the knowledge in a neural network. arXiv

preprint arXiv:1503.02531, 2(7).

Max Jaderberg, Valentin Dalibard, Simon Osindero, Wo-

jciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol

Vinyals, Tim Green, Iain Dunning, Karen Simonyan,

Chrisantha Fernando, and Koray Kavukcuoglu. 2017.

Population based training of neural networks.

Tim Jansen, Yangling Tong, Victoria Zevallos, and Pe-

dro Ortiz Suarez. 2022. Perplexed by quality: A

perplexity-based method for adult and harmful con-

tent detection in multilingual heterogeneous web

data.

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao

Chen, Linlin Li, Fang Wang, and Qun Liu. 2020.

TinyBERT: Distilling BERT for natural language un-

derstanding. In Findings of the Association for Com-

putational Linguistics: EMNLP 2020, pages 4163–

4174, Online. Association for Computational Lin-

guistics.

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke

Zettlemoyer. 2017. TriviaQA: A large scale distantly

supervised challenge dataset for reading comprehen-

sion. In Proceedings of the 55th Annual Meeting of

the Association for Computational Linguistics (Vol-

ume 1: Long Papers), pages 1601–1611, Vancouver,

Canada. Association for Computational Linguistics.

Zhenzhong Lan, Mingda Chen, Sebastian Goodman,

Kevin Gimpel, Piyush Sharma, and Radu Soricut.

2019. Albert: A lite bert for self-supervised learn-

ing of language representations. arXiv preprint

arXiv:1909.11942.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-

dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,

Luke Zettlemoyer, and Veselin Stoyanov. 2019.

Roberta: A robustly optimized bert pretraining ap-

proach.

Pedro Javier Ortiz Su’arez, Benoit Sagot, and Laurent

Romary. 2019. Asynchronous pipelines for process-

ing huge corpora on medium to low resource infras-

tructures. Proceedings of the Workshop on Chal-lenges in the Management of Large Corpora (CMLC-

7) 2019. Cardiff, 22nd July 2019, pages 9 – 16,

Mannheim. Leibniz-Institut f"ur Deutsche Sprache.

XiPeng Qiu, TianXiang Sun, YiGe Xu, YunFan Shao,

Ning Dai, and XuanJing Huang. 2020. Pre-trained

models for natural language processing: A survey.

Science China Technological Sciences, 63(10):1872–

1897.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.

Know what you don’t know: Unanswerable ques-

tions for SQuAD. In Proceedings of the 56th Annual

Meeting of the Association for Computational Lin-

guistics (Volume 2: Short Papers), pages 784–789,

Melbourne, Australia. Association for Computational

Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and

Percy Liang. 2016. SQuAD: 100,000+ questions for

machine comprehension of text. In Proceedings of

the 2016 Conference on Empirical Methods in Natu-

ral Language Processing, pages 2383–2392, Austin,

Texas. Association for Computational Linguistics.

Anna Rogers, Olga Kovaleva, and Anna Rumshisky.

2020. A primer in BERTology: What we know about

how BERT works. Transactions of the Association

for Computational Linguistics, 8:842–866.

Victor Sanh, Lysandre Debut, Julien Chaumond, and

Thomas Wolf. 2019. Distilbert, a distilled version

of bert: smaller, faster, cheaper and lighter. arXiv

preprint arXiv:1910.01108.

Or Sharir, Barak Peleg, and Yoav Shoham. 2020. The

cost of training nlp models: A concise overview.

Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu,

Yiming Yang, and Denny Zhou. 2020. MobileBERT:

a compact task-agnostic BERT for resource-limited

devices. In Proceedings of the 58th Annual Meet-

ing of the Association for Computational Linguistics,

pages 2158–2170, Online. Association for Computa-

tional Linguistics.

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen,

Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang,

Sebastian Ruder, and Donald Metzler. 2021. Long

range arena : A benchmark for efficient transformers.

In International Conference on Learning Representa-

tions.

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald

Metzler. 2022. Efficient transformers: A survey.

Erik F. Tjong Kim Sang and Fien De Meulder.

2003. Introduction to the CoNLL-2003 shared task:

Language-independent named entity recognition. In

Proceedings of the Seventh Conference on Natural

Language Learning at HLT-NAACL 2003, pages 142–

147.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz

Kaiser, and Illia Polosukhin. 2017. Attention is all

you need. Advances in neural information processing

systems, 30.

Alex Wang, Amanpreet Singh, Julian Michael, Felix

Hill, Omer Levy, and Samuel Bowman. 2018. GLUE:

A multi-task benchmark and analysis platform for nat-

ural language understanding. In Proceedings of the

2018 EMNLP Workshop BlackboxNLP: Analyzing

and Interpreting Neural Networks for NLP, pages

353–355, Brussels, Belgium. Association for Com-

putational Linguistics.

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty,

Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh.

2021. Nyströmformer: A nyström-based algorithm

for approximating self-attention.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,

William Cohen, Ruslan Salakhutdinov, and Christo-

pher D. Manning. 2018. HotpotQA: A dataset for

diverse, explainable multi-hop question answering.

In Proceedings of the 2018 Conference on Empiri-

cal Methods in Natural Language Processing, pages

2369–2380, Brussels, Belgium. Association for Com-

putational Linguistics.

Manzil Zaheer, Guru Guruganesh, Kumar Avinava

Dubey, Joshua Ainslie, Chris Alberti, Santiago On-

tanon, Philip Pham, Anirudh Ravula, Qifan Wang,

Li Yang, et al. 2020. Big bird: Transformers for

longer sequences. Advances in Neural Information

Processing Systems, 33.

Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu,

Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan,

Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu,

Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu,

and Lichao Sun. 2023. A comprehensive survey on

pretrained foundation models: A history from bert to

chatgpt.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-

dinov, Raquel Urtasun, Antonio Torralba, and Sanja

Fidler. 2015. Aligning books and movies: Towards

story-like visual explanations by watching movies

and reading books. In 2015 IEEE International Con-

ference on Computer Vision (ICCV), pages 19–27.

A.1

GONERD

Data Collection

Data for GONERD was obtained through Giant

Oak’s GONER software, which scraped web ar-

ticles from public facing online news sources as

well as the U.S. Department of Justice’s justice.gov

domain. This webtext data was randomly sampled

with an upweighted probability toward documents

from justice.gov so that justice.gov consisted of

roughly 25% of the total GONERD dataset. A

combination of automatic and manual filtering was

then applied to remove text containing code and

other unwanted data, such as sequences deemed to

be short, noisy, or near-duplicates.A.2

Exploratory Data Analysis

Type

CoNLL-2003 (512)

p 1

Website

GONERD (4096)

p 1

B-PER

I-PER

B-ORG

I-ORG

B-LOC

I-LOC

B-MISC

I-MISC 251k

10k

6.9k

9.3k

5.3k

10.6k

1.7k

5.1k

1.7k .832

.033

.023

.031

.018

.035

.006

.017

.006 -

.198

.138

.184

.104

.210

.033

.100

.034 1013k

24.4k

22.9k

21.5k

23.0k

16.4k

9.7k

2.4k

1.7k .896

.022

.020

.019

.020

.014

.009

.002

.002 -

.200

.188

.177

.188

.134

.079

.020

.014

Total 301k 1.0 1.0 1131k 1.0 1.0

Table 9: Occurrence of PER/ORG/LOC/MISC/O tags

in CoNLL-2003 and GONERD. p represents the

proportion of a tag over the total amount of labeled

tokens and p 1 represents the proportion over non-O

tokens.

Sequence Length Distribution In Figure 1, we

display the distribution of CoNLL-2003 sequences

in orange and GONERD sequences in blue. To

produce the figure, we used standard Kernel Den-

sity Estimation (KDE) through the kdeplot function

of the Python seaborn library. For the GONERD

distribution, we used the default parameters of the

kdeplot function, but for CoNLL-2003, we used

a higher KDE bandwidth and upsampled the dis-

tribution in the 256 range, thereby giving CoNLL-

2003 a slightly synthetically higher distribution,

resulting in CoNLL-2003 sequences appearing to

be longer than they actually are. We perform

this to account for to the extreme gap in aver-

age sequence length between CoNLL-2003 and

GONERD. CoNLL-2003 has a large number of

short sequences which make the table significantly

taller, making visually comparing their distribu-

tions unintelligible. We briefly provide summary

statistics in Table 1 to evidence this gap.

Entity Makeup For our NER task, we evalu-

ated the distribution of tags to gain a deeper un-

derstanding of our evaluation results. As seen in

Tables 9 and 10, although ConLL-2003 consists

of more sequences, GONERD has 3.5× as many

labeled tokens. Additionally, we find that names in

GONERD tend to be longer than CoNLL-2003, as

evidenced by the proportion of I to B tokens across

all NER tags. For GONERD, we find this propor-

tion to be 57.3/64.7 in comparison to 15.6/35 for

CoNLL-2003.

As seen in Table 6, LOC and ORG are the

most difficult for both teacher and student teachers

pdf cdf

justice.gov

ctvnews.ca

msn.com

southcarolinapublicradio.org

express.co.uk

dailyrecord.com

dailyvoice.com

nbcnews.com

newsbreak.com

chicagotribune.com

... 542

190

146

... .242

.085

.065

.024

.018

.015

.013

.010

.009

.008

... .242

.327

.392

.416

.434

.449

.462

.472

.481

.489

...

Total 2237 1.0 1.0

Table 10: Occurrence of domains in GONERD. "#" is

the raw amount of samples occuring under a domain,

"pdf" is the proportion of samples in the whole dataset

for that domain, and "cdf" is the cumulative proportion

of samples sorted by frequency. Results are sorted by

descending "pdf." Asteriscs indicate data not shown.

to learn in GONERD. This may come as a sur-

prise when considering the MISC tag, in which all

efficient attention models obtained better perfor-

mance despite MISC’s fewer samples. One possi-

ble explanation for the discrepancy in MISC perfor-

mance is in how GONERD handles MISC labeling.

GONERD has a fixed schema for MISC: ages, ad-

dresses, and phone numbers, while everything else

is not marked as a valid entity. As this reduces

the diversity of this category, this could make the

MISC tag easier for models to learn to detect. This

is in stark comparison to CoNLL-2003, in which

MISC consists of adjectives and events, making it a

very diverse category. This can be evidenced by the

performance difference for efficient attention mod-

els on CoNLL-2003, where MISC was the most

difficult tag for models to learn. Outside of the

discrepancy on MISC, GONERD’s makeup closely

resembles ConLL-2003 regarding the distribution

of non-O tags but leverages long-context, making

it a valuable asset to long-context NER models.

Domains In Table 10, we show the frequencies of

the top ten domains occurring in GONERD, ranked

by relative occurrence. The raw number of sam-

ples under a domain is denoted by "#", the relative

proportion by "pdf," and the cumulative by "cdf."

Aligning with expectations, we see that jus-

tice.gov appears in 24.2% samples, a website full

of news and legal language, primarily in the form

of criminal charges and sentencing. However, asFigure 1: Sequence length distribution using Kernel Density Estimation (KDE) of CoNLL-2003 (orange) and

GONERD (blue). Smoothing was performed with Gaussian KDE using the seaborn kdeplot function. x-axis is

number of tokens and y-axis is probability density.

domains progress, the relative contribution drops

off exponentially, with the top ten domains only

making up 48.9% of GONERD whereas there are

369 domains present within the dataset.

Knowledge Distillation Details

Briefly, we give an overview of the student ob-

jective used in our distillation experiments, which

we frame as the linear combination of supervised

training loss, distillation loss, and hidden state loss.

Our supervised training loss is the standard masked

language modeling loss (Devlin et al., 2018). Our

distillation loss is a cross entropy over soft targets

(Hinton et al., 2015; Sanh et al., 2019), which are

calculated by applying a softmax with temperature

to the output logits:

exp(z i /T )

p i = P

j exp(z j /T )

where p i is the probability of logit z i and T is

temperature, which controls the smoothness of the

distribution. Following the methods outlined in

DistilBERT (Sanh et al., 2019), we use a cosine

embedding loss between the hidden states vectors

of the teacher and student as a hidden state loss.

Our overall training objective can thus be written

L student = αL mlm + βL ce + γL cse

We take α = 2.0, β = 5.0, γ = 1.0, and T = 2.0.

Finally, we train the student by minimizing the as-

sociated empirical risk with the AdamW optimizer.

Data Ablation Results

Finally, we expand upon the GLUE performance

given in Table 8, distilling Longformer RoBERTa

on various permutations of the BookCorpus (BC),

English Wikipedia (ENW), and OSCAR datasets

and evaluating on all GLUE tasks. All models are

trained identically as given in Section 4.5.

We find that distilling Longformer on OSCAR

and BookCorpus yields the highest GLUE scores,

with an average of 78.9 across all tasks and metrics.

However, both BookCorpus and English Wikipedia

as well as OSCAR and English Wikipedia still yield

very similar results, with the most notable differ-

ences being in the CoLA and MNLI tasks. We see

significantly lower scores, particularly on CoLA,

when Longformer is distilled using only short or

long sequences. This indicates that it may be nec-

essary for efficient attention models to be distilled

using a mixture of both short and long-context data

to ensure maximum student performance.Model CoLA MNLI MRPC QNLI QQP RTE SST-2 STS-B Total

Metric MCC M/MM Acc. Acc./F1 Acc. Acc./F1 Acc. Acc. PCC/SRCC Avg.

BC + ENW

OSCAR + BC

OSCAR + ENW

ENW

OSCAR

BC 41.7

52.1

46.1

7.1

10.6

38.7 77.3 / 77.6

81.8 / 82.3

76.8 / 78.9

73.8 / 74.3

67.7 / 44.0

74.2 / 75.6 82.6 / 87.6

84.8 / 88.9

83.8 / 88.9

79.4 / 86.0

72.3 / 82.0

75.7 / 83.6 83.9

87.3

86.2

81.6

75.3

83.1 88.0 / 84.7

89.9 / 86.6

87.9 / 83.1

86.0 / 80.9

82.6 / 77.5

87.1 / 82.5 56.7

57.0

58.5

54.2

47.3

55.2 89.7

91.7

91.3

85.1

81.5

88.3 84.4 / 84.1

86.3 / 86.1

85.2 / 84.9

81.6 / 81.3

56.0 / 56.8

78.6 / 78.3 75.6

78.9

77.1

68.7

60.5

72.9

Table 11: Full validation results for GLUE on the students in the distillation ablative experiment in Section 4.5.