Summary of Energy and Carbon Considerations of Fine-Tuning BERT

Summary Energy and Carbon Considerations of Fine-Tuning BERT arxiv.org

7,102 words - PDF document - View PDF document

One Line

The study examines the environmental impact of optimizing BERT models in natural language processing and provides suggestions for enhancing energy efficiency.

Slides

Slide Presentation (6 slides)

Copy slides outline Copy embed code Download as Word

Energy and Carbon Considerations of Fine-Tuning BERT

Source: arxiv.org - PDF - 7,102 words - view

The Importance of Fine-Tuning BERT

• Fine-tuning BERT models in NLP contributes to energy use and emissions

• Pre-training BERT draws more energy than fine-tuning

• Fine-tuning is performed more frequently by individual actors

Factors Influencing Fine-Tuning Energy Use

• Number of training tokens is a reasonable heuristic for estimating fine-tuning energy use

• Sequence length has a stronger influence on energy intensity in the fine-tuning phase compared to inference

Visual: Graph comparing energy use based on training tokens and sequence length

Separate Study on Fine-Tuning Energy Efficiency

• Fine-tuning energy efficiency should be studied separately from pre-training and inference workloads in NLP models

• Understanding the specific energy requirements of fine-tuning can lead to targeted improvements

Visual: Comparison chart showing energy use for pre-training, fine-tuning, and inference

Recommendations for Improving Fine-Tuning Energy Efficiency

• Optimize sequence length to reduce energy intensity during fine-tuning

• Explore hardware options that offer better energy efficiency for fine-tuning

• Consider the trade-off between model performance and energy consumption during fine-tuning

Visual: Image showcasing different hardware options with their corresponding energy efficiency

Enhancing Energy Efficiency in Fine-Tuning BERT Models

• Fine-tuning BERT models in NLP has significant energy and carbon implications

• Understanding the factors influencing fine-tuning energy use is crucial for optimizing energy efficiency

• By implementing the recommendations provided, researchers and practitioners can improve the energy efficiency of their fine-tuning processes

Key Points

Fine-tuning BERT models in NLP is an important step that contributes to energy use and emissions.
Pre-training BERT draws more energy than fine-tuning, but fine-tuning is performed more frequently by individual actors.
The number of training tokens is a reasonable heuristic for estimating fine-tuning energy use.
Sequence length has a stronger influence on energy intensity in the fine-tuning phase compared to inference.
Fine-tuning energy efficiency should be studied separately from pre-training and inference workloads in NLP models.

Summaries

22 word summary

This study analyzes the energy and carbon footprint of fine-tuning BERT models in NLP, offering insights and recommendations for improving energy efficiency.

68 word summary

This study examines the energy and carbon footprint of fine-tuning BERT models in NLP. The authors quantify the energy requirements of fine-tuning across tasks, datasets, and hardware settings. Pre-training BERT consumes more energy than fine-tuning, but its frequency makes fine-tuning's energy and carbon footprint important. Training tokens estimate fine-tuning energy use. Sequence length impacts energy intensity. The study provides insights and recommendations for improving energy efficiency in NLP.

161 word summary

This study examines the energy and carbon footprint of fine-tuning BERT models in natural language processing (NLP). The authors conduct an empirical study to quantify the energy requirements of fine-tuning across different tasks, datasets, and hardware settings. They compare the energy use of fine-tuning to pre-training and inference and offer recommendations for improving fine-tuning energy efficiency. The study finds that pre-training BERT consumes more energy than fine-tuning, but fine-tuning is performed more frequently, making it important to consider its energy and carbon footprint. The number of training tokens is a reasonable estimate for fine-tuning energy use. Sequence length has a stronger impact on energy intensity in the fine-tuning phase compared to inference. The authors stress the need to study fine-tuning energy efficiency separately and hope their findings will inform decision-making in the NLP community. The study concludes with limitations and ethical considerations. Overall, this study provides valuable insights and recommendations for improving the energy efficiency of fine-tuning BERT models in NLP.

354 word summary

This study focuses on the energy and carbon footprint of fine-tuning BERT models in natural language processing (NLP). While previous research has primarily examined the energy costs of pre-training language models, fine-tuning is a crucial step that must be considered. The authors perform an empirical study to quantify the energy requirements of fine-tuning across various tasks, datasets, and hardware settings. They compare fine-tuning energy use to pre-training and inference and provide recommendations for improving fine-tuning energy efficiency.

The authors note that the typical NLP model lifecycle includes data ingestion, pre-training, fine-tuning, and inference, all of which contribute to energy use and emissions. However, there is a lack of data quantifying the relative contributions of each phase. To address this gap, the authors conduct experiments to isolate the factors that influence fine-tuning dynamics. They compare fine-tuning energy use across different datasets, tasks, and hardware setups and measure energy consumption using CodeCarbon software and physical energy meters.

The results show that pre-training BERT draws substantially more energy than fine-tuning. However, fine-tuning is performed more frequently by many individual actors, making it important to account for its energy and carbon footprint. The study finds that pre-training BERT is equivalent to multiple fine-tuning runs depending on the dataset size. The number of training tokens is a reasonable heuristic for estimating fine-tuning energy use. The study also shows that sequence length has a stronger influence on energy intensity in the fine-tuning phase compared to inference.

The authors emphasize the need to study fine-tuning energy efficiency separately from pre-training and inference workloads in NLP models. They hope that their findings will inform decision-making within and beyond the NLP community. The study concludes with limitations, such as the focus on specific tasks and architectures, and ethical considerations regarding the carbon emissions generated during the experiments.

Overall, this study provides valuable insights into the energy and carbon considerations of fine-tuning BERT models in NLP. The empirical study sheds light on the factors influencing fine-tuning energy requirements and emphasizes the importance of studying fine-tuning energy efficiency. The recommendations provided can guide researchers and practitioners in improving the energy efficiency of their fine-tuning processes.

Raw indexed text (44,790 chars / 7,102 words / 1,580 lines)

Energy and Carbon Considerations of Fine-Tuning BERT

Xiaorong Wang 1∗

[email protected]

Emma Strubell 2,3

[email protected]

Sorelle A. Friedler 1

[email protected]

Sasha Luccioni 4

[email protected]

Haverford College, 2 Carnegie Mellon University, 3 Allen Institute for AI, 4 Hugging Face

Abstract

Despite the popularity of the ‘pre-train then

fine-tune’ paradigm in the NLP community, ex-

isting work quantifying energy costs and asso-

ciated carbon emissions has largely focused on

language model pre-training. Although a single

pre-training run draws substantially more en-

ergy than fine-tuning, fine-tuning is performed

more frequently by many more individual ac-

tors, and thus must be accounted for when con-

sidering the energy and carbon footprint of NLP.

In order to better characterize the role of fine-

tuning in the landscape of energy and carbon

emissions in NLP, we perform a careful empir-

ical study of the computational costs of fine-

tuning across tasks, datasets, hardware infras-

tructure and measurement modalities. Our ex-

perimental results allow us to place fine-tuning

energy and carbon costs into perspective with

respect to pre-training and inference, and out-

line recommendations to NLP researchers and

practitioners who wish to improve their fine-

tuning energy efficiency.

Clara Na 2∗

[email protected]

Introduction

Fine-tuning pre-trained language models is a fre-

quent occurrence in natural language processing

(NLP) research and practice, yet the vast major-

ity of work quantifying the energy and carbon

footprints of NLP workloads has focused on pre-

training (Strubell et al., 2019; Dodge et al., 2022;

Luccioni et al., 2023) or inference (Desislavov

et al., 2023; Luccioni et al., 2023). The typical

lifecycle of an NLP model includes data inges-

tion, pre-training, fine-tuning and inference, all of

which contribute non-trivially to energy use and

corresponding emissions (Patterson et al., 2021;

Wu et al., 2022). Better understanding of the role

that each phase plays in overall energy and car-

bon footprint is vital to inform policy decisions,

yet we still lack basic data quantifying the rela-

∗

Denotes equal contribution.

tive contributions due to different aspects of model

development and use (Kaack et al., 2022).

In this work we perform an empirical study

to quantify the energy requirements of language

model fine-tuning, including in the context of pre-

training and inference energy requirements. While

this may seem like it should be a straightforward

calculation, there are several variables that can in-

fluence compute time and energy consumption,

ranging from: (1) the type of hardware used for

both pre-training and fine-tuning (since this usually

differs between the two), (2) the type of task and

the type of computation required to carry it out,

and (3) intrinsic characteristics of the dataset, such

as average sequence length, its similarity with the

pre-training dataset, etc.

In order to isolate the factors that have the most

influence on fine-tuning dynamics, we compare

fine-tuning energy use across a suite of common

supervised NLP datasets including the tasks of en-

tailment, sentiment analysis, question answering,

and named entity recognition, and with training

data sizes ranging from 6K to 400K examples. We

also measure energy use across two different sets

of hardware, using the CodeCarbon (Schmidt et al.,

2021) software package and a physical energy mea-

surement device at the wall, to quantify variance

due to physical factors. To enable carefully con-

trolled comparison of the roles of pre-training and

fine-tuning in NLP model lifecycle energy use, we

additionally pre-train BERT variants from scratch

on the same hardware. We find that pre-training

BERT is equivalent to anywhere from 400 (MNLI)

to 45,000 (RTE) fine-tuning runs depending on the

dataset size, and that number of training tokens 1

is a reasonable heuristic for estimating fine-tuning

energy use. Further comparison of fine-tuning in-

The “true” number of training tokens seen, accounting for

dynamic padding of sequences to the maximum length in a

batch, is a better predictor than relying on to mean or median

number of tokens per example.ference energy intensity across tasks confirms that

example sequence length holds a much stronger in-

fluence on energy intensity in the fine-tuning phase

than in the inference phase, in alignment with ex-

pectations from previous work (Zhou et al., 2021).

Together, our observations contextualize the en-

ergy and carbon requirements of fine-tuning in the

broader model lifecycle and highlight the need to

study fine-tuning energy efficiency separately from

pre-training and inference workloads in NLP mod-

els. We hope that our careful measurement of the

relative costs of different NLP workloads will serve

as a valuable datapoint informing decision-making

both within and beyond the NLP community.

Related Work

Measurement of energy consumption and car-

bon emissions of NLP models has become an

active area of research since it was first identi-

fied that modern state-of-the-art models based on

deep learning can produce substantial greenhouse

gas emissions due to the energy required to train

them (Strubell et al., 2019; Schwartz et al., 2020).

These measurements have mostly focused on two

research directions. First, there has been a series

of empirical studies on different model architec-

tures, focused on estimating the carbon emissions

generated by their training process and the rela-

tive efficiency of different methods (Naidu et al.,

2021; Patterson et al., 2021, 2022). Recent work

by Luccioni et al. has built upon this, aiming to en-

compass the embodied emissions of manufacturing

computing hardware as well as those produced via

the inference process (2023). There has also been

complementary work measuring the energy used

by Transformer models during inference and ways

of predicting those costs for different architectures

and models (Cao et al., 2021; Ang et al., 2022).

Closest to our work, the HULK bench-

mark (Zhou et al., 2021) was proposed to measure

the relative efficiency-accuracy trade-offs of differ-

ent pre-trained models, measuring the wall-clock

time for different pre-trained models to reach a

target accuracy on one of three fine-tuning tasks.

Different from Zhou et al. (2021), our work explic-

itly focuses on the energy and carbon required for

fine-tuning (theirs uses time and financial cost as

proxies), evaluates a wider variety of fine-tuning

tasks and hardware settings in order to elucidate the

factors that predict fine-tuning energy requirements,

and further contextualizes fine-tuning energy re-

quirements in the bigger picture of ML model life-

cycle emissions.

Another related direction of research examines

the dynamics of pre-training and fine-tuning of

language models and the influence of factors like

random seeds and early stopping (Dodge et al.,

2020), scaling (Tay et al., 2022) and learning dy-

namics (Hao et al., 2020). While all of these stud-

ies have shed important light on these processes, in

practice most of the decisions made remain empiri-

cal, with practitioners either referring to previous

work (when hyperparameters and other training de-

tails are reported), or using techniques such as grid

or random search (Bergstra and Bengio, 2012) to

converge on optimal parameter values. Our own

work builds upon both of these research directions.

We study both the pre-training and fine-tuning pro-

cess, and our experiments for studying their energy

intensity are based on the works cited above.

Methodology

Full training details can be found in Appendix A.1.

We also release code for replicating our measure-

ments 2 , and encourage others to run our code on

their hardware to add to a repository of measure-

ments across different hardware platforms.

3.1

Pre-training BERT

In this work, we are interested in measuring the

energy consumption of fine-tuning, as it compares

to other stages of model use: pre-training and in-

ference. In order to establish a comparable base-

line for the energy consumption and carbon emis-

sions of pre-training, we pre-trained a BERT-base

model (Devlin et al., 2019) from scratch on the

same hardware that we use for fine-tuning (Sec-

tion 3.3). Although in practice pre-training and

fine-tuning are often done separately and on differ-

ent hardware, we fixed the machine for both sets

of experiments in order to aid direct comparability

of energy usage measurements. Following Devlin

et al. (2019), we pre-train our model on the Book-

Corpus (Zhu et al., 2015) and the 2020 version of

Wikipedia (Foundation, 2020), both downloaded

from HuggingFace Datasets (Lhoest et al., 2021).

Our precise pre-training methodology differs

slightly from Devlin et al. (2019): our data nec-

essarily is different because the original training

corpus was not released along with the model and

we only use the masked language modeling (MLM)

https://github.com/swangxr/FT-energy.gitobjective without next sentence prediction (NSP)

following Liu et al. (2019), who found that remov-

ing NSP did not substantially impact end-task per-

formance.

In order to assess the relative impact of using a

more efficient pre-trained BERT variant, we also

followed the DistilBERT (Sanh et al., 2019) ap-

proach, performing knowledge distillation on our

pre-trained BERT-base model.

3.2

Fine-tuning BERT

We evaluate the energy consumption and carbon

emissions of the fine-tuning process on the tasks

in Table 1. We deliberately chose this selection of

end-tasks in order to vary fine-tuning dataset size,

task type, and sequence length, while also aligning

with tasks commonly used in NLP applications.

Seq. length

Dataset Task

Wiki+Books

RTE

MNLI

SQuAD v1

SQuAD v2

IMDB

SST2

CoNLL 2003

CoNLL 2012 MLM

NLI

Sent.

NER

Examples Med. Batch

43M

433K

98K

142K

50K

70K

21K

143K 128

159

158

128

20 128

128

336

128

Table 1: Pre-training and fine-tuning dataset descrip-

tions. We report two sequence length statistics: Median

tokens per sequence in the dataset, and Batch, the aver-

age maximum sequence length per batch as predicted by

simulated sampling. The latter is included as a more di-

rect predictor of computation cost for dynamic padding.

All models are fine-tuned on the BERT models

described in §3.1. For each fine-tuning task, we

use typical fine-tuning hyperparameters specific to

the task or user-reported hyperparameters on cur-

rent fine-tuned models on HuggingFace, in order

to mimic the common real-world use cases. For

each task, we dynamically pad sequences to the

maximum length in each batch. All fine-tuning

hyperparameters are reported in Appendix A.1.

We also report average per-example energy use

for inference. All the inference tasks are performed

on 1 GPU with batch size 1 on the same machines.

3.3

Hardware Platforms

To ensure reproducibility and measure variability

across hardware platforms, we replicate experi-

ments across two hardware platforms: One A100

machine and one RTX 8000 machine, where each

machine had four GPUs. Pre-training experiments

used all 4 GPUs in each machine.

All fine-tuning tasks were performed on the

same machines, but using only one GPU. This re-

flects the typical scenario where fine-tuning is done

on a single GPU even if the machine itself has

more GPUs. To better compare the energy usage

results across pre-training and fine-tuning, we also

report an energy usage estimate for BERT-base pre-

training on 1 RTX 8000 GPU with hyperparameters

equivalent to training on 4 GPUs, extrapolated from

a 12-hour partial training run. Details are recorded

in Appendix A.1.

3.4

Measuring Energy and Carbon

To measure the energy consumed, we use the soft-

ware tool CodeCarbon (Schmidt et al., 2021). Re-

cent work has found that the existing libraries and

code for estimating the carbon emissions of NLP

techniques vary in terms of their accuracy and gen-

eralizability to different types of hardware (Ban-

nour et al., 2021). To compensate for this, we

calibrate the programmatic energy usage readings

with a physical energy meter, with which we record

energy readings directly from the compute node

during experiments. Subsequently, we calculate a

coefficient of expected power loss, 1.059, as the

average proportion (over runs across fine-tuning

tasks) of physical energy reading vs. programmatic

energy measurement. Full results are given in Ap-

pendix A.2. Thus, the energy consumed in kWh,

denoted as E, is determined via the formula:

E (kWh) = 1.059 · codecarbon kWh

Converting the power loss adjusted values to

CO 2 emissions is done through CodeCarbon us-

ing a coefficient specific to the energy grid based

on the location of the server from the most recent

EPA Power Profiler Data. The conversion factor

for the server’s location (Pittsburgh, PA) in Ta-

ble 2 is 1046.1 lbs/MWh, while the factor for the

second server’s location (Haverford, PA) is 672.8

lbs/MWh. The total kilograms of CO 2 emitted,

denoted as C, is then determined via:

1MWh

1kg

C = E × 1046.1

MWh 1000kWh 2.20462lb

We convert the CO 2 emissions result to human

understandable reference values using the EPA

Greenhouse Gas Equivalencies Calculator; in Ta-

ble 2, we also show the equivalent CO 2 emissions

of miles driven by an average gasoline-powered

passenger vehicle.Training / Fine-tuning

Task Dataset Energy

(kWh) Emissions

(kg CO 2 ) Time

MLM 4 GPU (1m @ len 128)

1 GPU (1m @ len 128)

4 GPU (100k @ len 512)

Total 4 GPU 270.9

419.6

124.1

368.0 128.6

199.1

58.90

174.6 27 hr

673 hr

114 hr

357 hr

RTE

MNLI

SQuAD v1

SQuAD v2

IMDB

SST2

CoNLL2003

CoNLL2012 0.008

0.938

0.537

0.795

0.151

0.081

0.021

0.207 0.004

0.445

0.255

0.377

0.072

0.038

0.010

0.098 59 s

6700 s

3780 s

5604 s

1074 s

587 s

149 s

1487 s

NLI

Sent.

NER

Inference

Equiv.

# PT

runs

Equiv.

miles Energy

(kWh) /

1000 ex. Equiv.

# FT

runs

1.358

0.646

2.965

1.000 330

510

151

408 —

—

— —

—

45109

392

685

463

2441

4540

17886

1778 0.01

1.14

0.65

0.97

0.18

0.10

0.03

0.25 0.794e-3

7.349e-3

2.157e-3

2.471e-3

0.849e-3

0.691e-3

0.808e-3

0.878e-3

127k

249k

322k

178k

12k

24k

Table 2: Energy consumption of pre-training BERT and fine-tuning on the RTX8000 GPU machine. Energy is

computed as raw energy measured by CodeCarbon multiplied by a coefficient to correct for power loss (Eq. 3.4).

Equiv. miles refers to the approximate number of vehicle miles driven resulting in equivalent CO 2 emissions. 1 GPU

pre-training costs are extrapolated from a shorter pre-training run lasting only a few hours. Total cost of fine-tuning

is derived from 900k steps of pre-training on sequences of length 128 followed by 100k steps on sequences of length

512, following (Devlin et al., 2019). Energy consumption for inference is calculated using single-example batches.

Results and Discussion

Table 2 shows energy, carbon, and wall-clock time

required to fine-tune BERT-base models on the

RTX8000 machine. Results on the A100 machine

are recorded in Appendix A.3 in Tables 8 and 9.

Task Dataset Energy

(kWh) Time Distil. Wiki+Books 187.74 175.5 hr 89.08

RTE

MNLI

SQuAD v1

SQuAD v2

IMDB

SST

CoNLL2003

CoNLL2012 0.004

0.481

0.276

0.412

0.077

0.042

0.011

0.107 30 s

3447 s

1954 s

2916 s

549 s

306 s

78 s

762 s 0.002

0.228

0.131

0.196

0.037

0.020

0.005

0.051

NLI

Sent.

NER

Emiss.

(kg CO 2 )

Table 3: Energy consumption of training (distillation)

and fine-tuning DistilBERT on the RTX8000 machine.

Energy calculations are the same as in Table 2. Distil-

lation is performed on a pre-trained model, and so the

true “total” cost includes the pre-training cost as well.

4.1

Pre-training and Distillation

We observe that it requires an additional 50% of

the energy cost of pre-training in order to perform

knowledge distillation, but it takes nearly 50% less

energy to fine-tune on the same tasks using Dis-

tilBERT vs. normal BERT (see Table 3). By our

estimate, one can fully amortize the up-front cost

of distillation within anywhere from 86 fine-tuning

runs of an MNLI-like task, to 47k fine-tunings on

an RTE-like task. 3

DistilBERT fine-tuning results on the A100 ma-

chine are in Appendix A.3.

4.2

Comparing and Predicting Fine-tuning

Emissions

We find that, controlling for hardware, energy con-

sumption scales most predictably with wall clock

time and number of tokens encountered during

training (including the pad tokens added to se-

quences to match the maximum sequence length

in a batch). The linear relationship between en-

ergy consumption and total number of tokens holds

similarly on both machines (see Figure 1). Addi-

tionally, we observe a consistently higher energy

consumption in the RTX 8000 GPU machine. This

is likely due to the higher energy overhead and

the (in)efficiency of the hardware compared to the

A100 GPUs.

Other figures in Appendix A.3 illustrate that,

in contrast, energy requirements as a function of

optimization steps or even number of examples in

the dataset can vary significantly across datasets

and tasks.

Note that cheaper inference is often the primary goal

of knowledge distillation. Inference is much cheaper than

training and therefore requires more to amortize the initial cost

of distillation, but inference also occurs much more frequently

than training. Models running inference at scale are typically

highly optimized with respect to specific deployment settings,

so our estimates approximate a lower bound.Energy consumed vs # tokens seen

10 2

10 1

10 0

10 1

10 2

10 3

10 6

10 7

10 8

10 9

Number of tokens in log scale

dataset

squadv1

squadv2

imdb

sst2

rte

mnli

conll2003

conll2012

pre-train (4 gpus)

pre-train (1 gpu)

model

bert-base

distilbert

machine

A100

RTX8000

A100 BERT

A100 DistilBERT

RTX8000 BERT

RTX8000 DistilBERT

10 10

Figure 1: Total energy consumed (kWh) is strongly re-

lated to number of tokens seen for BERT models on both

A100 and RTX 8000 GPU machines, although the rela-

tionship is more predictive on the RTX 8000 machine

and energy usage is less consistent with DistilBERT.

Note that both axes are in log scale. An alternative view

of similar data in Figure 4 distinguishes pre-training

workloads’ energy consumption slightly from that of

fine-tuning tasks.

Seq. length

Dataset

RTE (NLI)

MNLI (NLI)

SQuAD v1 (QA)

SQuAD v2 (QA)

IMDB (Sent.)

SST2 (Sent.)

CoNLL 2003 (NER)

CoNLL 2012 (NER)

Med.

159

158

128

Batch

128

336

128

kWh / 1k ex.

Inf.

0.75e-3

0.70e-3

1.08e-3

1.24e-3

0.80e-3

0.65e-3

0.76e-3

0.83e-3

1.09e-3

0.80e-3

3.04e-3

3.02e-3

1.21e-3

0.40e-3

0.49e-3

0.60e-3

Table 4: Inference vs. fine-tuning energy requirements

across end tasks. We see that fine-tuning energy usage

varies according to sequence length much more widely

than inference energy usage.

4.3

Fine-tuning vs Pre-training

Even for the more reliable predictors of energy

consumption and carbon emissions (duration of

training and number of tokens processed), the en-

ergy cost profiles of pre-training vs. fine-tuning are

different, likely due to differences in training infras-

tructure, training objectives, and sequence lengths

typically seen in pre-training vs. fine-tuning (see

Figures 3, 4, 5, and 6). Pre-training in general

is almost always performed over multiple GPUs

which incurs energy costs from communication

between GPUs, and often also with gradient accu-

mulation to accommodate large batches. Moreover,

sequences are packed together such that batches

consist largely or entirely of sequences of identical

length equal to the maximum sequence length for

the model.

On the other hand, there are many types of fine-

tuning tasks where examples consist of sequences

of varying lengths significantly shorter than the

maximum length that the model has been trained

on, as shown in Table 1. Since, effective sequence

lengths are determined dynamically at training time

(where sequences are padded to the maximum

length in each given batch), total training time is

not as simple to extrapolate from standard measures

of dataset size as in pre-training.

4.4

Fine-tuning vs. Inference

Although we do observe that per-example inference

efficiency costs are related to sequence lengths,

there is overall less variation across datasets and

tasks in inference costs compared to fine-tuning

costs (see Table 4). This mirrors an observation

noted in the HULK benchmark (Zhou et al., 2021),

though to the best of our knowledge ours is the first

to explicitly draw comparisons across task types

and different aspects of dataset size (i.e. number of

examples and examples’ sequence lengths).

4.5

Single vs. Multiple GPUs

In general, typical hardware and data settings for

pre-training and fine-tuning tend to differ. Though

to the best of our knowledge it is less common

to fine-tune causal LMs of this scale on multiple

GPUs, we present additional results from multi-

GPU fine-tuning on the 4 x RTX8000 machine

with the same fine-tuning tasks in Table 10 in Ap-

pendix A.3. Our recommendations from an energy

efficiency standpoint align with common rules of

thumb for effective utilization of hardware; if the

resources would be idle otherwise, one could rea-

sonably consider increasing batch size and learning

rate to saturate the available hardware for both time-

and energy-efficient training.

Conclusion

We share a procedure for rigorous measurement of

energy consumption from causal LM fine-tuning

given multiple concrete hardware settings. We

hope our work is useful to researchers and prac-

titioners who are interested in obtaining measure-

ments for their own specific hardware, gaining intu-

itions about factors affecting relative energy costs

of different types of fine-tuning tasks, or under-

standing these energy costs in context of the model

lifecycle.Limitations

While our work provides important first steps

towards a clearer understanding of model fine-

tuning’s impact on the environment, we note that

our experimentation is limited to various token

classification, sequence classification, and question

answering tasks with BERT-base and DistilBERT-

base models. We do not make claims or extrap-

olations about much larger language models, or

models with different architectures, as well as other

types of tasks such as summarization. Future work

in this direction can expand the number of tasks

that we consider as well as feature different archi-

tectures such as RoBERTa (Liu et al., 2019).

Additionally, the on-premises hardware infras-

tructure used for our experimentation is realistic

and typical of compute resources in academic set-

tings, but we provide no firsthand evidence of fine-

tuning emissions profiles expected from either local

model training (where the impracticality of pre-

training makes direct comparisons with fine-tuning

emissions difficult) or fine-tuning on hardware that

is part of much larger scale infrastructure such as

on a public cloud. Furthermore, we expect that use

of specialized hardware such as TPUs (as opposed

to GPUs, which we use) would be associated with

different emissions profiles.

Ethics Statement

Training and deploying ML models and systems

requires large amounts of energy, the production of

which results in the emission of greenhouse gases.

While the goal of our research is to contribute to-

wards a better understanding of the factors that in-

fluence these emissions, by carrying out our experi-

ments, we were ourselves responsible for the emis-

sion of 350 kg of carbon equivalents. We release

our code and the data generated by our experiments

to maximize the transparency and reproducibility

of our work.

Acknowledgements

We are grateful to Victor Sanh for his time and

patience in answering questions about BERT and

DistilBERT pre-training. We would also like to

thank our anonymous reviewers for taking the time

to provide helpful feedback.

This work was supported in part by a grant

from the National Science Foundation Graduate

Research Fellowship Program under Grant No.

DGE2140739. Any opinions, findings, and con-

clusions or recommendations expressed in this ma-

terial are those of the authors and do not necessarily

reflect the views of the sponsors.

References

Phyllis Ang, Bhuwan Dhingra, and Lisa Wu Wills. 2022.

Characterizing the efficiency vs. accuracy trade-off

for long-context NLP models. In Proceedings of

NLP Power! The First Workshop on Efficient Bench-

marking in NLP, pages 113–121, Dublin, Ireland.

Association for Computational Linguistics.

Nesrine Bannour, Sahar Ghannay, Aurélie Névéol, and

Anne-Laure Ligozat. 2021. Evaluating the carbon

footprint of NLP methods: a survey and analysis of

existing tools. In EMNLP, Workshop SustaiNLP.

James Bergstra and Yoshua Bengio. 2012. Random

search for hyper-parameter optimization. Journal of

Machine Learning Research, 13(10):281–305.

Qingqing Cao, Yash Kumar Lal, Harsh Trivedi, Aruna

Balasubramanian, and Niranjan Balasubramanian.

2021. IrEne: Interpretable energy prediction for

transformers. In Proceedings of the 59th Annual

Meeting of the Association for Computational Lin-

guistics and the 11th International Joint Conference

on Natural Language Processing (Volume 1: Long

Papers), pages 2145–2157, Online. Association for

Computational Linguistics.

Radosvet Desislavov, Fernando Martínez-Plumed, and

José Hernández-Orallo. 2023. Trends in ai inference

energy consumption: Beyond the performance-vs-

parameter laws of deep learning. Sustainable Com-

puting: Informatics and Systems, 38:100857.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2019. BERT: Pre-training of

deep bidirectional transformers for language under-

standing. In Proceedings of the 2019 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, Volume 1 (Long and Short Papers), pages

4171–4186, Minneapolis, Minnesota. Association for

Computational Linguistics.

Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali

Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020.

Fine-tuning pretrained language models: Weight ini-

tializations, data orders, and early stopping. arXiv

preprint arXiv:2002.06305.

Jesse Dodge, Taylor Prewitt, Remi Tachet des Combes,

Erika Odmark, Roy Schwartz, Emma Strubell,

Alexandra Sasha Luccioni, Noah A Smith, Nicole

DeCario, and Will Buchanan. 2022. Measuring the

carbon intensity of ai in cloud instances. In 2022

ACM Conference on Fairness, Accountability, and

Transparency, pages 1877–1894.

Wikimedia Foundation. 2020. Wikimedia downloads.Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2020. In-

vestigating learning dynamics of bert fine-tuning. In

Proceedings of the 1st Conference of the Asia-Pacific

Chapter of the Association for Computational Lin-

guistics and the 10th International Joint Conference

on Natural Language Processing, pages 87–92.

Lynn H. Kaack, Priya L. Donti, Emma Strubell, George

Kamiya, Felix Creutzig, and David Rolnick. 2022.

Aligning artificial intelligence with climate change

mitigation. Nature Climate Change, 12(6):518–527.

Quentin Lhoest, Albert Villanova del Moral, Yacine

Jernite, Abhishek Thakur, Patrick von Platen, Suraj

Patil, Julien Chaumond, Mariama Drame, Julien Plu,

Lewis Tunstall, Joe Davison, Mario Šaško, Gun-

jan Chhablani, Bhavitvya Malik, Simon Brandeis,

Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas

Patry, Angelina McMillan-Major, Philipp Schmid,

Sylvain Gugger, Clément Delangue, Théo Matus-

sière, Lysandre Debut, Stas Bekman, Pierric Cis-

tac, Thibault Goehringer, Victor Mustar, François

Lagunas, Alexander Rush, and Thomas Wolf. 2021.

Datasets: A community library for natural language

processing. In Proceedings of the 2021 Conference

on Empirical Methods in Natural Language Process-

ing: System Demonstrations, pages 175–184, Online

and Punta Cana, Dominican Republic. Association

for Computational Linguistics.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-

dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,

Luke Zettlemoyer, and Veselin Stoyanov. 2019.

Roberta: A robustly optimized bert pretraining ap-

proach.

Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-

Laure Ligozat. 2023. Estimating the carbon footprint

of bloom, a 176b parameter language model. Journal

of Machine Learning Research, 24(253):1–15.

Rakshit Naidu, Harshita Diddee, Ajinkya Mulay, Aleti

Vardhan, Krithika Ramesh, and Ahmed Zamzam.

2021. Towards quantifying the carbon emissions

of differentially private machine learning. arXiv

preprint arXiv:2107.06946.

Victor Schmidt, Kamal Goyal, Aditya Joshi, Boris Feld,

Liam Conell, Nikolas Laskaris, Ziyao Wang, Doug

Blank, Jonathan Wilson, Sorelle Friedler, and Sasha

Luccioni. 2021. Codecarbon: estimate and track

carbon emissions from machine learning computing.

Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren

Etzioni. 2020. Green AI. Communications of the

ACM, 63(12):54–63.

Emma Strubell, Ananya Ganesh, and Andrew McCal-

lum. 2019. Energy and policy considerations for

deep learning in NLP. In Proceedings of the 57th

Annual Meeting of the Association for Computational

Linguistics, pages 3645–3650, Florence, Italy. Asso-

ciation for Computational Linguistics.

Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus,

Samira Abnar, Hyung Won Chung, Sharan Narang,

Dani Yogatama, Ashish Vaswani, and Donald Met-

zler. 2022. Scale efficiently: Insights from pretrain-

ing and finetuning transformers. In International

Conference on Learning Representations.

C.-J. Wu et al. 2022. Sustainable AI: Environmental im-

plications, challenges, and opportunities. In MLSys.

Xiyou Zhou, Zhiyu Chen, Xiaoyong Jin, and

William Yang Wang. 2021. HULK: An energy effi-

ciency benchmark platform for responsible natural

language processing. In Proceedings of the 16th Con-

ference of the European Chapter of the Association

for Computational Linguistics: System Demonstra-

tions, pages 329–336, Online. Association for Com-

putational Linguistics.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-

dinov, Raquel Urtasun, Antonio Torralba, and Sanja

Fidler. 2015. Aligning books and movies: Towards

story-like visual explanations by watching movies

and reading books. In The IEEE International Con-

ference on Computer Vision (ICCV).

Appendix

A.1

David Patterson, Joseph Gonzalez, Urs Hölzle, Quoc

Le, Chen Liang, Lluis-Miquel Munguia, Daniel

Rothchild, David R. So, Maud Texier, and Jeff Dean.

2022. The carbon footprint of machine learning train-

ing will plateau, then shrink. Computer, 55(7):18–

28.

David Patterson, Joseph Gonzalez, Quoc Le, Chen

Liang, Lluis-Miquel Munguia, Daniel Rothchild,

David So, Maud Texier, and Jeff Dean. 2021. Carbon

emissions and large neural network training. arXiv

preprint arXiv:2104.10350.

Victor Sanh, Lysandre Debut, Julien Chaumond, and

Thomas Wolf. 2019. DistilBERT, a distilled version

of BERT: smaller, faster, cheaper and lighter. arXiv

preprint arXiv:1910.01108.

Pre-training and Fine-tuning Details

For pre-training BERT with MLM, we mainly fol-

low what is listed in Devlin et al. (2019). The

specific hyperparameters used for pre-training are

listed in Table 5 and were used on both machines.

Hyperparameters

maximum steps

training batch size per device

evaluation batch size per device

maximum sequence length

learning rate

warmup steps

weight decay

1000000

128

1e−4

10000

0.01

Table 5: Hyperparameters used for pre-training BERTIn Table 6, we list the set of hyperparameters

used to fine-tune each task. All fine-tuning tasks

were run on both machines.

A.2

Kill-A-Watt Measurements

Energy is lost during the process of transferring en-

ergy from a power source to the machine. The coef-

ficient for the loss is acquired through the readings

from Kill-A-Watt devices. The device measures

the instantaneous Watt reading extracted from the

wall and displays on the monitor. The measure-

ments for the A100 machine were recorded only

on the single node of the cluster containing it (the

RTX8000 machine is not part of a larger cluster).

The GPU node is connected to two outlets, and we

plug separate Kill-A-Watt devices into both. For

each instantaneous reading, we read off of and sum

up the readings on both Kill-A-Watt devices. To

best compare between the package reading and the

wall reading, we read off of the device at the same

15 second interval that CodeCarbon records the

energy consumed. To convert each instantaneous

Watts readings into Kilowatt-Hours, we follow the

formula:

kWh =

15 seconds

watts × hours

3600 seconds

1000

We sum up all the calculations for the entire run of

a fine-tuning experiment. Then, we can compare

the sum of the wall readings with the Code Car-

bon energy consumption recording. We divide the

wall reading over the package reading to get a co-

efficient, measuring the more realistic energy used.

The setup of an instantaneous reading on the A100

machine is shown in Figure 2. The recordings of

the wall readings on the A100 GPU machine are

recorded in Table 7.

We ran these wall reading experiments for both

machines and obtained a coefficient of 1.09 on the

A100 machine, and a coefficient of 1.05 on the

RTX machine.

Figure 2: Kill-A-Watt wall reading measurement setup

A.3

Additional Results

BERT pre-training and fine-tuning results Ta-

ble 8 records the energy consumption results from

the A100 GPU machine, located in Haverford, PA.

The conversion factor for this location is 672.8 lbs

CO 2 /MWh. We then calculate the emissions using

raw output from CodeCarbon, which is listed in

Table 8.

DistilBERT results Table 9 records the energy

consumption and emission on the A100 GPU ma-

chine. Distillation is not done on this machine, and

all the fine-tunings are done on the DistilBERT

trained on the RTX8000 machine. The results on

the A100 machine shows a similar trend that fine-

tunings on DistilBERT takes around 50% less en-

ergy than on BERT base models.

Energy consumed vs tokens seen on A100

1.0

Hardware details The A100 machine is located

in Haverford, Pennsylvania, USA, and has 4x

NVIDIA A100 GPUs with 40GB GDDR SDRAM,

376GB main memory and 32 Intel Xeon processors.

The RTX 8000 machine is located in Pittsburgh,

Pennsylvania, USA, and has 4x NVIDIA Quadro

RTX 8000 GPUs with 48GB GDDR SDRAM, 36

Intel Xeon processors and 251GB RAM.

dataset

conll2003

conll2012

imdb

sst2

0.8

mnli

rte

squadv1

squadv2

pre-train (4 gpus)

task

named entity recognition

sentiment analysis

textual entailment

question answering

masked language modeling

0.6

0.4

0.2

0.0

Tokens seen

1e7

Figure 3: Energy consumed (kWh) against total tokens

seen on A100 machine

Training duration and energy consumption

From Figure 5 and 6, we see that there is a strong

correlation between training time and energy con-

sumption, which holds across our models and hard-

ware settings (Figure 7). However, similarly to

tokens seen, we observe that pre-training exhibits

a slightly different energy consumption profile, as

do question answering tasks on the A100 machine

(which have the longest sequences out of our fine-

tuning tasks). When training time estimates are not

feasible in advance (such as in certain hyperparam-RTE MNLI IMDB SST2 SQuAD

v1 SQuAD

v2 CoNLL

2003 CoNLL

2012

epochs 3 3 5 3 2 2 3 3

train bs 32 32 16 32 32 32 32 32

eval bs 32 32 16 32 32 32 32 32

max seq len 128 128 128 128 384 384 128 128

128 128 3e−5 3e−5 5e−5 5e−5

doc stride

2e−5

Table 6: Hyperparameters used for fine-tuning tasks

Task Random seed Wall reading

converted

(kWh) Code Carbon

reading (kWh) Power Loss Co-

efficient

SST2 (no AC)

SST2

IMDB

RTE 42

123

42 0.103

0.059

0.070

0.079

0.0793

0.00641 0.088

0.055

0.061

0.071

0.073

0.00602 1.170

1.057

1.134

1.122

1.092

1.065

Table 7: Kill-A-Watt experiment recordings for the A100 GPU machine

Task Dataset Energy

(kWh) Emissions

(kg CO 2 ) Training

time (s) Equiv. # PT

runs Equivalent

emissions of

miles driven

MLM Wiki + Books 146.150 44.602 726553 1 144

fine-tuning RTE

MNLI

SQuAD v1

SQuAD v2

IMDB

SST2

CoNLL2003

CoNLL2012 0.002

0.356

0.171

0.263

0.062

0.051

0.011

0.091 0.002

0.239

0.115

0.177

0.0190

0.034

0.007

0.061 19

2400

1061

1580

478

377

654 58442

406

846

549

2114

2858

13411

1584 0.005

0.613

0.295

0.454

0.048

0.087

0.018

0.156

Table 8: Energy consumption of pre-training BERT and fine-tuning on the A100 GPU machine. Energy is computed

as raw energy measured by CodeCarbon multiplied by a coefficient to correct for power loss (Eq. 3.4). Equiv. miles

refers to the approximate number of vehicle miles driven resulting in equivalent CO 2 emissions. Note that here,

unlike in Table 2, the “baseline” pre-training cost is fixed to the cost of 1 million steps of masked language modeling

of sequences of length 128.

Task

Entailment

Sentiment

NER

Dataset Energy Time Emiss.

RTE

MNLI

SQuAD v1

SQuAD v2

IMDB

SST

CoNLL2003

CoNLL2012 0.001

0.210

0.091

0.141

0.045

0.034

0.007

0.059 10

1539

566

915

341

262

448 0.001

0.141

0.061

0.095

0.012

0.023

0.005

0.040

Table 9: Energy consumption of training (distillation)

and fine-tuning DistilBERT on the A100 machine. Units

and energy calculations are the same as given in Table 8.

eter sweeps), we recommend that researchers and

practitioners use token counts estimates (including

dynamic padding tokens) if they have reasonable

knowledge of their data.

As indicated in Figure 8 and Figure 9, energy in-

creases as the optimization steps increases. This is

not surprising given the correlation between train-

ing time and energy consumption. However, we

see that for different tasks, the energy required for

each step is very different. Each step of pre-training

takes a longer time, likely due to the higher batch

size than all fine-tuning tasks. For QA tasks, the

per-step energy consumption is higher than otherEnergy consumed vs tokens seen on RTX8000

0.8

0.6

Energy consumed vs duration

task

named entity recognition

sentiment analysis

textual entailment

question answering

masked language modeling

squadv1

squadv2

pre-train (4 gpus)

pre-train (1 gpu)

pre-train (4 gpus, longer seqs)

0.4

0.2

0.0

Tokens seen

Energy consumed vs training time on A100

dataset

conll2003

conll2012

imdb

sst2

0.25

rte

mnli

squadv1

squadv2

pre-train (4 gpus)

task

named entity recognition

sentiment analysis

textual entailment

question answering

masked language modeling

0.20

10 2

10 1

10 0

10 1

10 2

10 3

1e7

Figure 4: Energy consumed (kWh) against total tokens

seen on RTX8000 machine

0.30

dataset

conll2003

conll2012

imdb

sst2

mnli

rte

10 1

dataset

conll2003

conll2012

imdb

sst2

0.15

0.05

0.00

200

400

600

800

1000

Training time (s)

1200

1400

Figure 5: Energy consumed (kWh) against training time

(s) on A100 machine, showing the first 30 minutes

0.30

0.25

task

named entity recognition

sentiment analysis

textual entailment

question answering

masked language modeling

squadv1

squadv2

pre-train (4 gpus)

pre-train (1 gpu)

pre-train (4 gpus, longer seqs)

2000

0.05

0.00

200

400

600

800

1000 1200 1400

Training time (s)

Figure 6: Energy consumed (kWh) against training time

(s) on RTX8000 machine, showing the first 30 minutes

tasks. This is likely due to the maximum sequence

length of 384 being higher than for the other tasks.

Figure 10 shows the number of examples of the

task and the energy consumed in log scale. Similar

to Figure 1, we see a direct correlation between

4000

6000

Optimization steps

8000

10000

Figure 8: Energy consumed (kWh) against optimization

steps on A100 machine machine

Energy consumed vs training steps on RTX8000

dataset

conll2003

conll2012

imdb

sst2

mnli

rte

0.20

0.10

task

named entity recognition

sentiment analysis

textual entailment

question answering

masked language modeling

0.15

rte

mnli

squadv1

squadv2

pre-train (4 gpus)

0.35

dataset

conll2003

conll2012

imdb

sst2

mnli

rte

10 5

Energy consumed vs training time on RTX8000

0.40

10 4

10 3

Fine-tuning duration (s) in log scale

Energy consumed vs training steps on A100

0.10

10 2

dataset

squadv1

squadv2

imdb

sst2

rte

mnli

conll2003

conll2012

pre-train (4 gpus)

pre-train (1 gpu)

model

bert-base

distilbert

machine

A100

RTX8000

A100 BERT

A100 DistilBERT

RTX8000 BERT

RTX8000 DistilBERT

10 6

Figure 7: Total training time is strongly predictive of

total energy consumed (kWh) for both BERT and Dis-

tilBERT models, on both A100 and RTX 8000 GPU

machines. Note that both axes are in log scale. Dif-

ferent from token counts, the trendlines themselves are

similar as well across models and hardware settings.

1.0

squadv1

squadv2

pre-train (4 gpus)

pre-train (1 gpu)

pre-train (4 gpus, longer seqs)

task

named entity recognition

sentiment analysis

textual entailment

question answering

masked language modeling

2500 5000 7500 10000 12500 15000 17500 20000

Optimization steps

Figure 9: Energy consumed (kWh) against optimization

steps on RTX8000 machineEnergy consumed vs examples seen

10 2

10 1

10 0

10 1

10 2

10 3

10 4

10 5

10 6

10 7

Number of examples in log scale

dataset

squadv1

squadv2

imdb

sst2

rte

mnli

conll2003

conll2012

pre-train (4 gpus)

pre-train (1 gpu)

model

bert-base

distilbert

machine

A100

RTX8000

A100 BERT

A100 DistilBERT

RTX8000 BERT

RTX8000 DistilBERT

10 8

Figure 10: Energy consumed (kWh) against total num-

ber of examples seen on both A100 and RTX8000 ma-

chines. Comparing with Figure 1, we see that merely

counting number of training examples is much less pre-

dictive of energy consumption than accounting for ex-

ample sequence lengths along with batch size (which

affects the maximum sequence length in each batch)

higher number of training examples and higher

energy usage on both machines.

Task accuracy vs emissions

0.93

0.92

0.91

0.90

0.89

0.88

0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045

Emissions (kg CO2)

Figure 11: CO 2 emissions (kg) against task accuracy

for SST-2 fine-tuning on RTX 8000 GPU machine

Figure 11 shows the accuracy of SST2 task as

CO 2 emissions (kg) increases. As shown previ-

ously in Figure 6, energy consumption increases as

time increases. Generally, as emission increases,

accuracy increases. We can see that as emissions

goes up, accuracy is trending towards converging.

Single- vs. multi-GPU fine-tuning We fine-tune

BERT-base and DistilBERT-base models on the

RTX8000 machine as well. If the same hyperpa-

rameters as the single-GPU setting are used naively

(i.e. the same batch size is split over 4 GPUs), fine-

tuning using 4 GPUs can (but does not always) take

even longer than using just 1 GPU, and can (but

Duration (s)

Energy (kWh)

Dataset 1GPU 4GPU 1GPU 4GPU

RTE (NLI)

MNLI (NLI)

SQuAD v1 (QA)

SQuAD v2 (QA)

IMDB (Sent.)

SST2 (Sent.)

CoNLL 2003 (NER)

CoNLL 2012 (NER) BERT-base

6700

3014

3780

1356

5604

2011

1074

472

587

323

149

737

1487

737 8.16e-3

9.38e-1

5.37e-1

7.95e-1

1.51e-1

8.11e-2

2.06e-2

2.07e-1 8.52e-3

8.64e-1

4.28e-1

6.16e-1

1.32e-1

8.39e-2

2.05e-1

2.14e-1

RTE (NLI)

MNLI (NLI)

SQuAD v1 (QA)

SQuAD v2 (QA)

IMDB (Sent.)

SST2 (Sent.)

CoNLL 2003 (NER)

CoNLL 2012 (NER) DistilBERT

3447

1760

1954

792

2916

1170

549

272

306

195

436

762

434 4.34e-3

4.81e-1

2.76e-1

4.12e-1

7.75e-2

4.21e-2

1.10e-2

1.07e-1 4.72e-3

4.98e-1

2.40e-1

3.46e-1

7.46e-2

5.49e-2

1.18e-1

1.19e-1

Table 10: Single-GPU vs. BS and LR-optimized 4-GPU

fine-tuning energy requirements across end tasks on the

RTX8000 machine, for BERT-base and DistilBERT.

does not always) use about twice as much energy

in both BERT-base and DistilBERT. If we increase

batch size x 4 with 4 GPUs (and adjust learning

rate accordingly), however, and compare single-

GPU fine-tuning with multi-GPU fine-tuning (see

Table 10), we observe that energy cost is typically

similar or even less than when using 1 GPU, while

taking around half as much time or less. In both the

“naive” and “optimized” multi-GPU settings, the

single-vs-multi-GPU difference in energy cost and

job duration seems to be related to dataset sequence

lengths. Tasks with longer sequences (such as QA

tasks, and, to a lesser extent, IMDB and RTE) tend

to exhibit more consistent and dramatic energy and

time efficiency gains than the other tasks when us-

ing 4 GPUs. On the other hand, tasks with shorter

sequences (such as NER) tend to require more en-

ergy with 4 GPUs, even if the wall-clock efficiency

may be improved. One way one might interpret this

is a large-enough per-device batch size and typical

sequence length is necessary for multi-GPU train-

ing to be “worth” the overhead of communication

between GPUs.

In light of these observations, our general rec-

ommendation is that, if one owns a machine with

multiple GPUs, one should consider using all avail-

able (idle) GPUs for energy- and time-efficient fine-

tuning. Although it is often sufficient to use a sin-

gle GPU when fine-tuning models of scale similar

to ours, and instantaneous energy usage may behigher using more GPUs, total energy used may

end up being less while also requiring less time, es-

pecially if the training sequences tend to be longer.

On the other hand, tasks with short sequences are

likely best kept to a single GPU.