Summary of The Poison of Alignment in Language Models

Summary The Poison of Alignment in Language Models arxiv.org

3,273 words - PDF document - View PDF document

One Line

The paper examines the impact of alignment on large language models in instruction tuning datasets, comparing curated and web-crawled datasets and highlighting the importance of data cleaning and deduplication for improved model performance.

Slides

Slide Presentation (11 slides)

Copy slides outline Copy embed code Download as Word

The Poison of Alignment in Language Models

Source: arxiv.org - PDF - 3,273 words - view

Introduction

• Large Language Models (LLMs) have shown impressive performance on complex benchmarks and professional exams.

• Knowledge distillation models have claimed performances comparable to ChatGPT.

• However, fine-tuned models often lack reasoning capabilities and factual accuracy.

Dataset Cleaning

• Dataset cleaning methods have significantly enhanced the performance of LLMs trained on public datasets.

• Cleaning and deduplicating data are crucial for optimal model performance.

• Recent studies challenge the belief that curated datasets outperform web-crawled datasets.

Supervised Fine-tuning

• Supervised fine-tuning (SFT) on open-source LLMs has gained popularity.

• However, SFT models do not always show performance improvement over base models.

• Aligned answers in SFT datasets may act as a poisonous contaminant, nudging model behavior in an undesirable direction.

Dataset Collection and Cleaning

• Dataset collected from GoatChat app with over 3 million users.

• Basic quality filtering and removal of defective data points.

• Alignment removal significantly improves fine-tuned model performance.

Experimental Setup

• Training conducted on one node with 8xA100 NVIDIA GPU.

• Bfloat16 and DeepSpeed ZeRO-3 used for memory optimization.

• Effective batch size set at 512 for training 7B models.

Evaluation

• Models evaluated on reasoning benchmarks: MMLU, BBH, HumanEval, and DROP.

• Fine-tuned model with alignment removal outperforms base model in MMLU and BBH.

• Performance improvements range from 4.1% to 33.3%.

Ablation Study

• Ablation study conducted with aligned dataset and dataset without alignment.

• Model trained on aligned dataset did not improve over the base model.

• Model trained on cleaned dataset showed remarkable performance increase.

Limitations

• Study inherits limitations of LLaMA 2, including data biases and lack of world understanding.

• Lack of computing resources limited fine-tuning models over 7B.

Conclusion

• Alignment acts as a source of instruction dataset poisoning in supervised fine-tuning.

• Dataset cleaning and alignment removal improve model performance.

• Quality of data has a greater impact on model performance than data quantity.

Key Points

• Alignment in supervised fine-tuning datasets limits harmful content generation in LLMs.

• Aligned answers significantly worsen model performance on reasoning benchmarks.

• Dataset cleaning and preparation are crucial for improving supervised instruction fine-tuning.

• The quality of data has a greater impact on model performance than data quantity.

Key Points

Alignment in supervised fine-tuning datasets can limit the harmful content generation of large language models (LLMs).
Aligned answers in instruction-tuned models significantly worsen the performance on reasoning benchmarks by 4-33%.
Dataset cleaning and preparation are crucial for improving the performance of supervised instruction fine-tuning.
Dataset cleaning methods, such as alignment removal, can enhance the performance of LLMs.
The quality of data has a greater impact on model performance than data quantity.

Summaries

37 word summary

This paper explores how alignment affects large language models (LLMs) in instruction tuning datasets. It questions the superiority of curated datasets over web-crawled datasets and emphasizes the need for data cleaning and deduplication to enhance model performance.

40 word summary

This paper examines the effect of alignment on large language models (LLMs), specifically in instruction tuning datasets. It challenges the notion that curated datasets outperform web-crawled datasets and emphasizes the importance of data cleaning and deduplication for optimal model performance

208 word summary

This paper discusses the impact of alignment on the performance of large language models (LLMs). Alignment refers to the intentional method of reinforcing models to not respond to certain user inputs and is present in instruction tuning datasets such as OpenAssistant or Guanaco.

A study challenges the belief that curated datasets perform better than web-crawled datasets for language models. It also highlights the importance of cleaning and deduplicating data to achieve optimal model performance. The quality of data has a greater impact on model performance than

The excerpt discusses the data cleaning process for a language model dataset and provides details about the merging of datasets and the removal of alignment. The authors eliminated low-quality chats with non-informative content, short input texts, low average tokens per message, and

This study highlights the negative impact of alignment on the performance of language models. The presence of alignment in supervised fine-tuning (SFT) behaves similarly to dataset poisoning, leading to a significant decrease in reasoning abilities. The previous fine-tuning methods did

This text excerpt includes a list of references to various papers and preprints related to language models and their training. The references cover topics such as benchmark datasets, evaluation methods, data extraction, transfer learning, language modeling datasets, instruction tuning, finetuning

Raw indexed text (21,505 chars / 3,273 words / 281 lines)

T HE P OISON OF A LIGNMENT

Aibek Bekbayev ∗

GOAT AI

Sungbae Chun ∗

GOAT AI

Yerzat Dulat ∗

Higgsfield AI

James Yamazaki ∗

GOAT AI

A BSTRACT

From the perspective of content safety issues, alignment has shown to limit large language models’

(LLMs) harmful content generation. This intentional method of reinforcing models to not respond to

certain user inputs seem to be present in many modern open-source instruction tuning datasets such as

OpenAssistant or Guanaco. We introduce a novel insight to an instruction-tuned model’s performance

affected by the presence of alignment in supervised fine-tuning dataset. To be specific, we noticed

that alignment acts as if it is poisoning the instruction dataset. Experimentally, we demonstrate

that aligned answers significantly worsen the performance of the resulting fine-tuned model’s on

various reasoning benchmarks such as Big Bench (BBH), Massive Multitask Language Understanding

(MMLU), Human Eval, and Discrete Reasoning Over Paragraphs (DROP), performing worse than

the counterpart tuned without alignment by 4-33%.

Introduction

Emerging power of Large Language Models (LLMs) has shown impressive ability to perform greatly on complex

benchmarks, such as Human Eval [1] and Big Bench (BBH) [2], and in professional examination settings such as SAT,

GRE, and LSAT with few or no examples [3]. Despite LLMs not yet reaching peak human performance in professional

exams or complex benchmarks, the performance gap between LLMs and top-scoring humans has steadily narrowed in

recent years with the help of scaling and better data processing techniques [4].

Particular attention in the recent literature was drawn to knowledge distillation models, including Vicuna[5], Alpaca[6],

and the more recent Orca[7], that claims performances comparable to that of ChatGPT [8]. For instance, Mukherje et

al.[7] reported that Orca surpassed ChatGPT on the Vicuna evaluation set, using GPT-4 [9] for assessment, and achieved

parity with ChatGPT on most evaluation tasks in their study.

Despite the spike in both research and open-source community, a recent study by Gudibande et al.[10] suggests

that known distillation models mainly emulate the style and "learn" dialogue format, rather than unleash reasoning

capabilities or factual accuracy. The study found that while models fine-tuned on ChatGPT responses generate well-

structured output resembling the original model, the content often contained errors or deviated significantly from the

topic.

Our study complements the study by Gudibande et al. [10], as we observe substantial improvements on reasoning

benchmarks such as Massive Multitask Language Understanding (MMLU) [11, 12] or BBH following supervised

fine-tuning (SFT) with finely grained datasets. Our experiments consistently demonstrated better performance in

reasoning skills over the base model, with the smaller models showing the most noticeable improvement.

In this paper, we present novel insights into dataset cleaning methods for SFT: alignment as the source of instruction

dataset poisoning. Our dataset, collected from our GoatChat app, substantially improves the fine-tuned model’s

performance over the base model in MMLU and BBH. This empirically augments the findings of Gudibande et al. [10].

We consistently observe significant improvements in benchmarks such as MMLU, BBH, Discrete Reasoning Over

Paragraphs (DROP) [13], and Human Eval [1] at scale using the amount of data comparable or less to one of open-source

fine-tuning datasets. All models in this paper are evaluated using InstructEval [14], with the exception of proprietary

models.

∗

Equal contribution

Correspondence to: Sungbae Chun The Poison of Alignment

Background

Data cleaning. The analysis of dataset cleaning methods [15, 16] has made notable strides in recent years, significantly

enhancing the performance of LLMs trained on public datasets such as C4 [17] and The Pile [18]. The importance of

dataset cleaning was firmly investigated in the study by the Falcon team [19] in which authors have implemented various

methods of dataset cleaning, including custom processing pipeline for CC-like datasets and fuzzy/exact deduplication.

Results have shown that dataset cleaning takes a vital part in performance of LLMs. Recent paper by Zhou et al. [20]

that focuses on importance of data for supervised instruction fine-tuning claims that data quality is of greater importance

rather than data quantity.

A comprehensive study by Penedo et al. (the Falcon team) [19] evaluated the impacts of various data filtering methods

on the performance of the resulting models. Their study shows that their experiments, conducted on both small-scale

(1B and 3B parameters trained on 27GT and 60GT, respectively) and full-scale (1B parameters trained on 350GT),

revealed that cleaned web-crawl datasets can serve as viable training datasets boosting overall performance of LLMs.

This finding challenges the prevailing belief that curated datasets generally outperform web-crawled datasets in LLMs.

Furthermore, the study also showed that deduplicating The Pile led to performance benefits for models trained on

it. This emphasizes the need for cleaning and deduplicating data to achieve optimal model performance, even when

working with pre-curated datasets like The Pile. These observations reinforce a key principle in model training: the

quality of the data is crucial. This aligns with the conclusion of the work of Zhou et al. [20] that the quality of data has

a greater impact on model performance than data quantity.

Supervised fine-tuning. After InstructGPT [8] was introduced by OpenAI team, there have been numerous studies

that conduct SFT on an open-source LLMs with main trigger being the release of LLaMA [21] by Meta AI. Many

research teams built SFT models on top of LLaMA, and the most prominent ones are Vicuna [5], Stanford Alpaca [6],

and Orca [7]. However, this active trend towards SFT faced a criticism as well. The works of Gudibande et al. [10]

indicated that during SFT, the models performance do not increase over the bare LLMs’ performance.

Data poisoning. With active development of SFT models, there have been efforts to study exploitability of LLMs upon

instruction tuning. The works of Wan et al. [22] demonstrated that LLM’s behaviour can be manipulated with as few as

hundreds of poisonous examples. Furthermore, Shu et al. [23] discussed more non-straightforward poisoning of SFT

dataset. Inspired by the above studies, it seems possible that aligned answers in SFT datasets may nudge a model’s

behaviour into an undesirable direction, acting as a poisonous contaminant.

3.1

Method

Dataset collection

For the dataset collection we have utilized our top-rated app GoatChat that has over 3 million users (see Figure 1 for

detailed user statistics). GoatChat provides a simple interface for interaction with AI assistant. All users sign a terms of

agreement to collect their data to be used in the further research.

3.2

Dataset cleaning and deduplication

Basic quality filtering. Our private dataset collected from GoatChat was mainly composed of the interaction of user

and AI assistant. From the structure of our app, there were several kinds of defects in our dataset that can possibly

impose unwanted behaviour thus had to be cleaned. Our first cleaning pipeline was aimed at filtering out the following

defective data points: API failures, low-quality chats, and mixed language.

By API failures we mean instances in which one-to-one correspondence between user-bot messages did not hold. There

were several reasons why such kind of data heterogeneity happened, such as the case in which user’s message was not

delivered to API possibly due to aggressive language content in the input chat (racism, sexism, etc.) or the case in

which user made two consecutive messages due to bug. It is important to underline that we assume the latter behaviour

as "failure" because our app’s chat was meant to have strictly alternating chat sequence between the user and the bot.

By low-quality chats we mean data points that were considered to have non-informative content. At the message

level, we eliminated data with short input text as it empirically was shown that it rarely contains informative inputs.

Additionally, we filtered out whole chat sessions with low number of average tokens per message (due to assumption of

non-informativeness) and with numerous repeated queries (spamming). Upon an investigation of the data, we found

out that the former contained mostly just nonsense texts or plain numbers. We call the resulting filtered version of the

dataset as GoatAssistant.

2The Poison of Alignment

Figure 1: Distribution of users by continents. Continent code (CC) is used: EU - Europe, AS - Asia, SA - South

America, NA - North America, AF - Africa, and OC - Oceania

Dataset merge For our further work, we have merged GoatAssistant dataset with Guanaco [24] dataset to enhance the

diversity of resulting dataset.

Exact and fuzzy deduplication. For exact and fuzzy deduplication we have used the works of Lee et al. [15] and used

the thresholds as ones suggested in original study. We have performed deduplication at chat-level and dropped 17.4%

of original dataset.

Alignment removal. We have noted that the majority of aligned answers do not contain informative responses to the

user query, which is evident considering the fact that the model’s response is passive, i.e. the model is reluctant to

provide the exact information that user requested. A strong model that we are aiming to get at the end should be able

respond to a user query as informative as possible, and additionally, alignment often contains input prompts that are not

necessarily inappropriate. This filtering removed about a third of our dataset, and because it was our novel method

of dataset cleaning, we also performed ablation study to isolate the effect of aligned answers reflected onto the tuned

model.

Experimental Setup

We employed all our computations on one node of 8xA100 NVIDIA GPU. Training was done using bfloat16 and

DeepSpeed [25] ZeRO-3. All models were initially trained for 3 consecutive epochs with checkpointing on each half

of the epoch. However, we empirically observed that training over 1 epoch degrades the model quality and reverted

to using only 1 epoch with checkpointing on half of the epoch. For memory optimization, we used x-formers [26]

and gradient checkpointing [27]. We kept effective batch size at 512 during training of 7B models. We used standard

AdamW [28] optimizer with learning rate of 1e-4 and betas set to (0.9, 0.999) with warmup steps being about 7% of all

training steps amount.

Evaluation

We evaluate our model on various reasoning benchmarks: MMLU, BBH, HumanEval, and DROP.

MMLU seeks to evaluate LLM proficiency across a vast spectrum of domains, ranging from humanities to hard sciences.

It is composed of 15,908 multiple-choice questions sourced from academic examinations, university course materials,

and specialized texts. This benchmark is crucial in measuring a model’s capacity for comprehensive real-world textual

comprehension and its aptitude for extracting knowledge from extensive corpora.

BBH was introduced to characterize emerging capabilities in LLMs and delineate potential limitations. It encompasses

204 tasks, delving into areas such as linguistics, biology, and software development. The benchmark, calibrated against

state-of-the-art architectures from dense to sparse transformers, offers invaluable insights into performance trends,

scale-associated enhancements, and task-centric challenges.

3The Poison of Alignment

HumanEval is specifically tailored to assess functional correctness in algorithmic tasks. With 164 hand-crafted

programming problems, which include function signatures, docstrings, and unit tests, it tests LLMs on comprehension,

reasoning, and algorithmic synthesis. This benchmark provides a unique lens into an LLM’s ability to not just replicate

but genuinely understand and produce syntactically and semantically accurate code.

Lastly, DROP benchmark propels reading comprehension evaluation by accentuating intricate textual reasoning. This

adversarially-generated dataset, with 96k questions, demands nuanced reference resolution coupled with discrete

operations such as sorting and arithmetic. It presents a formidable challenge for models, pushing them to transition

from basic information retrieval to a more profound, multi-dimensional comprehension.

Table 1: 7B model comparison

Task

MMLU

BBH

Human Eval

DROP

LLaMA 2

45.94

32.04

14.02

31.57

Our model

49.31

35.69

12.20

28.10

We notice that with our novel data processing method, we achieve a better performance than the underlying foundation

model by a significant margin in MMLU and BBH.

5.1

Ablation study

For ablation study, we have produced 2 datasets: the first one is our GoatAssitant and Guanaco [24], and the second one

is the first dataset without alignment. We trained both models under the same training setups specified before.

Table 2: Ablation study results

Task

MMLU

BBH

HumanEval

DROP

With alignment

45.63

34.28

9.15

22.61

No alignment

49.31 (8.1%)

35.69 (4.1%)

12.20 (33.3%)

28.10 (24.3%)

As it can be seen from Table 2, we see that when the model was trained on our aligned dataset, it did not improve

over the base model, which confirms the study by Gudibande et al. [10]. However, we also observe a remarkable

performance increase upon fine-tuning our model on the cleaned version of our dataset. Therefore, it seems that the

negative impact of alignment distorted the performance boost of previous fine-tuning methods, so that the models did

not show a significant improvement on reasoning abilities, leading to the underestimation of reasoning ability gain upon

SFT.

Limitations

This study, as it was done based off on LLaMA 2, inherits most of its limitations including data biases, lack of world

understanding and the hallucination. Methods suggested in this study may be inapplicable for tailoring the model for

certain behaviour and generally oriented only towards research purposes and was tested only in research environments.

Concerning the models, one obvious limitation is the lack of computing resources that did not allow us to fully fine-tune

models with size over 7B.

Conclusion

In this study, we propose a new perspective of the instruction tuning that the presence of alignment behaves similar

to the dataset poisoning. We demonstrate that alignment at the stage of SFT harms the model’s performance by a

significant margin (4-33% in reasoning benchmarks). Additionally, this study reassures the emerging effectiveness of

thorough dataset cleaning and preparation applied to the task of supervised instruction fine-tuning despite the criticism

that supervised fine-tuning is mainly a formatting task. Namely, we uncover the details about our dataset that can be of

use in understanding of efficient dataset building for supervised instruction fine-tuning as well as describe our thorough

data cleaning pipeline.

4The Poison of Alignment

Acknowledgments

This work was supported by GOAT AI. We thank Dos Baha for the organisation and funding of this research project;

Zhenisbek Assylbekov for his valuable feedback; Yerbol Kopzhassar for his key role in communicating with externals

in securing the necessary hardware; Akzhol Ibraimov, Alexey Muryshkin, and Arman Oralov for their contribution in

data collection.

References

[1] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman,

et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.

[2] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta,

A. Garriga-Alonso, et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language

models,” arXiv preprint arXiv:2206.04615, 2022.

[3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry,

A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems,

vol. 33, pp. 1877–1901, 2020.

[4] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and

D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020.

[5] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, et al.,

“Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org

(accessed 14 April 2023), 2023.

[6] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca:

An instruction-following llama model.” https://github.com/tatsu-lab/stanford_alpaca, 2023.

[7] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from

complex explanation traces of GPT-4,” arXiv preprint arXiv:2306.02707, 2023.

[8] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,

et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information

Processing Systems, vol. 35, pp. 27730–27744, 2022.

[9] OpenAI, “Gpt-4 technical report,” 2023.

[10] A. Gudibande, E. Wallace, C. Snell, X. Geng, H. Liu, P. Abbeel, S. Levine, and D. Song, “The false promise of

imitating proprietary llms,” arXiv preprint arXiv:2305.15717, 2023.

[11] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask

language understanding,” Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[12] D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt, “Aligning ai with shared human

values,” Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[13] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner, “Drop: A reading comprehension benchmark

requiring discrete reasoning over paragraphs,” in Proceedings of the 2019 Conference of the North American

Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and

Short Papers), (Minneapolis, Minnesota), pp. 2368–2378, Association for Computational Linguistics, June 2019.

[14] Y. K. Chia, P. Hong, L. Bing, and S. Poria, “Instructeval: Towards holistic evaluation of instruction-tuned large

language models,” arXiv preprint arXiv:2306.04757, 2023.

[15] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, and N. Carlini, “Deduplicating training

data makes language models better,” arXiv preprint arXiv:2107.06499, 2021.

[16] G. Wenzek, M.-A. Lachaux, A. Conneau, V. Chaudhary, F. Guzmán, A. Joulin, and E. Grave, “Ccnet: Extracting

high quality monolingual datasets from web crawl data,” arXiv preprint arXiv:1911.00359, 2019.

[17] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the

limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research,

vol. 21, no. 1, pp. 5485–5551, 2020.

[18] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al.,

“The pile: An 800gb dataset of diverse text for language modeling,” arXiv preprint arXiv:2101.00027, 2020.

5The Poison of Alignment

[19] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, and

J. Launay, “The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data

only,” arXiv preprint arXiv:2306.01116, 2023.

[20] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al., “Lima: Less is more for

alignment,” arXiv preprint arXiv:2305.11206, 2023.

[21] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro,

F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.

[22] A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” arXiv preprint

arXiv:2305.00944, 2023.

[23] M. Shu, J. Wang, C. Zhu, J. Geiping, C. Xiao, and T. Goldstein, “On the exploitability of instruction tuning,”

arXiv preprint arXiv:2306.17194, 2023.

[24] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” arXiv

preprint arXiv:2305.14314, 2023.

[25] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning

models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference

on Knowledge Discovery & Data Mining, KDD ’20, (New York, NY, USA), p. 3505–3506, Association for

Computing Machinery, 2020.

[26] B. Lefaudeux, F. Massa, D. Liskovich, W. Xiong, V. Caggiano, S. Naren, M. Xu, J. Hu, M. Tintore, S. Zhang,

P. Labatut, and D. Haziza, “xformers: A modular and hackable transformer modelling library.” https://github.

com/facebookresearch/xformers, 2022.

[27] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” arXiv preprint

arXiv:1604.06174, 2016.

[28] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.