Summary of Power Laws for Hyperparameter Optimization

Summary Power Laws for Hyperparameter Optimization arxiv.org

9,295 words - PDF document - View PDF document

One Line

The paper proposes a new multi-fidelity strategy for hyperparameter optimization using power law surrogates, with the Deep Power Law method achieving the new state-of-the-art in HPO for deep learning by modeling optimization curves as simple power law functions.

Key Points

The paper proposes the Deep Power Law (DPL) ensembles method for hyperparameter optimization (HPO) in machine learning, specifically in deep learning, achieving state-of-the-art results.
DPL models optimization curves as simple power law functions and uses multi-fidelity methods such as successive halving and Hyperband to improve HPO efficiency.
The proposed method exploits scaling laws to estimate performance and achieves better results than strong HPO baselines for Deep Learning (DL) models.
The study compares the performance of various HPO methods, with DPL consistently outperforming others.
The paper explores hyperparameter optimization for transformers in Large Language Models and presents analyses on the effectiveness of DPL for HPO.

Summaries

322 word summary

This study compares the efficiency and exploration ability of various hyperparameter optimization (HPO) methods. The Deep Power Laws (DPL) method consistently outperforms other HPO methods. The study investigates the efficiency of DPL in exploring promising configurations and includes per-dataset performances of all methods. The document explores hyperparameter optimization using power laws on two dimensions and includes multi-fidelity methods. The paper presents an algorithm called Gray-box HPO with Deep Power Laws for hyperparameter optimization. The proposed method, DPL, exploits scaling laws to estimate performance and achieves better results than strong HPO baselines for Deep Learning (DL) models. The document explores HPO for transformers in Large Language Models and presents three analyses, including one on the effectiveness of DPL for HPO in Large Language Models. The results show that their method explores well and assigns the budget only to configurations with lower regret compared to other methods. Lastly, the authors conclude that their hypothesis on the quality of the HPO results is valid. A new multi-fidelity strategy for hyperparameter optimization using power law surrogates is proposed in this paper. The method uses the Expected Improvement acquisition function and a neural network to map a configuration to the power law coefficients of its learning curve. The proposed method performs well in most cases, achieving top performance on many configurations. The study also investigates learning curves that do not follow a power law pattern and proposes two different ways to handle them. The empirical evidence suggests that the presented power law model can accurately forecast learning curves and improve HPO performance. The Deep Power Law (DPL) ensembles method for hyperparameter optimization (HPO) in deep learning achieves the new state-of-the-art in HPO for deep learning by modeling optimization curves as simple power law functions. The use of multi-fidelity methods such as successive halving and Hyperband to improve HPO efficiency is discussed, as well as the potential for DPL to make HPO for deep learning a feasible reality.

763 word summary

The paper proposes the Deep Power Law (DPL) ensembles method for hyperparameter optimization (HPO) in machine learning, specifically in deep learning. DPL models optimization curves as simple power law functions and achieves the new state-of-the-art in HPO for deep learning. The paper discusses the use of multi-fidelity methods such as successive halving and Hyperband to improve HPO efficiency and the potential for DPL to make HPO for deep learning a feasible reality. Scaling laws and power law surrogates are used to conduct multi-fidelity HPO with Bayesian optimization. The prediction is often based on the assumption that the performance increases at the beginning and then flattens towards the end. This paper proposes a new multi-fidelity strategy for hyperparameter optimization using power law surrogates. The method uses the Expected Improvement acquisition function and a neural network to map a configuration to the power law coefficients of its learning curve. The paper discusses hyperparameter optimization benchmarks and the experimental protocol used to evaluate them, as well as comparing various hyperparameter optimization methods, including the proposed method, in experiments using different datasets. The results show that the proposed method performs well in most cases, achieving top performance on many configurations. The study also investigates learning curves that do not follow a power law pattern and proposes two different ways to handle them. Overall, the empirical evidence suggests that the presented power law model can accurately forecast learning curves and improve HPO performance. The experiments were run on a CPU cluster with two Intel Xeon E5-2630v4 CPUs with 20 CPU cores running at 2.2 GHz. The document explores hyperparameter optimization (HPO) for transformers in Large Language Models. The authors used HPO to tune three learning rate hyperparameters and ablated the embedding size of the multi-head attention layers. They present three analyses, including one on the effectiveness of their method DPL for HPO in Large Language Models. The results show that their method explores well and assigns the budget only to configurations with lower regret compared to other methods. Lastly, the authors conclude that their hypothesis on the quality of the HPO results is valid.

The proposed method, Deep Power Law (DPL), exploits scaling laws to estimate performance and achieves better results than strong HPO baselines for Deep Learning (DL) models. It outperforms Random Search and BOHB, a rival gray-box HPO method, for small transformers. DPL discovers better configurations than the baselines at any proxy space with small embedding sizes. The configurations discovered by DPL on small search spaces achieve competitive results on full-scale transformers.

The document includes lists of references cited in the paper and related to hyperparameter optimization in machine learning. The references cover topics such as Bayesian optimization, automatic hyperparameter tuning, and learning curves.

The paper presents an algorithm called Gray-box HPO with Deep Power Laws for hyperparameter optimization. The method involves evaluating initial configurations and budgets, computing the next budget and recommending the next configuration, and fitting a DPL ensemble from history H. The algorithm returns the best hyperparameter configuration with the smallest validation loss. The document explores hyperparameter optimization using power laws on two dimensions: the number of training steps and Bench. The search space is optimized to ensure that even the most resource-intensive experiments stay within the limits of a single GPU day. The authors use an ensemble of surrogate models to optimize the cross-entropy loss for next-token prediction. The paper discusses hyperparameter optimization (HPO) for machine learning models using DPL, BOHB, and random search in proxy tasks. The study investigates the efficacy of Deep Power Laws (DPL) in trading off exploration vs exploitation in a continuous hyperparameter optimization (HPO) search space. The article discusses the use of the synetune library to interface with the TaskSet and LCBench benchmarks for hyperparameter optimization. The results show that the DPL method has a minor time overhead in performing hyperparameter optimization and proves to be an efficient method for identifying optimal hyperparameters. This study compares the performance of various hyperparameter optimization (HPO) methods, including Dragonfly, SMAC, MF-DNN, DEHB, ASHA, Hyperband, BOHB, LCNet, and Random Search, in terms of efficiency and ability to explore promising configurations. The DPL method consistently outperforms other HPO methods. The study investigates the efficiency of DPL in exploring more promising configurations and finds that it decreases with fewer partial observations from the learning curve. The study includes per-dataset performances of all methods and provides a post-hoc analysis to study DPL's efficiency. The study uses various benchmarks, including LCBench, TaskSet, and PD1, and includes multi-fidelity methods. The study applies a polynomial schedule and includes datasets with a learning curve length greater than 10.

1953 word summary

The study compares the performance of various hyperparameter optimization (HPO) methods in terms of their efficiency and ability to explore promising configurations. The DPL method consistently outperforms other HPO methods. The study also investigates the efficiency of DPL in exploring more promising configurations, finding that it is reduced the more partial observations we have from the learning curve. The study includes per-dataset performances of all methods and provides a post-hoc analysis to study DPL's efficiency. The study uses various benchmarks, including LCBench, TaskSet, and PD1, and includes multi-fidelity methods. The study uses various HPO methods, including Dragonfly, SMAC, MF-DNN, DEHB, ASHA, Hyperband, BOHB, LCNet, and Random Search. The study applies polynomial schedule and includes datasets that have a learning curve of length greater than 10. The article discusses the use of the synetune library to interface with the TaskSet and LCBench benchmarks for hyperparameter optimization. The TaskSet benchmark consists of 1000 diverse tasks, but the study focuses on only 12 NLP tasks. The results show that the DPL method has a minor time overhead in performing hyperparameter optimization and proves to be an efficient method for identifying optimal hyperparameters. The study demonstrates a substantial speedup in terms of anytime performance when compared to baseline algorithms. The study investigates the efficacy of Deep Power Laws (DPL) in trading off exploration vs exploitation in a continuous hyperparameter optimization (HPO) search space. The search space comprises 10^4 potential configurations, and the experiment focuses on optimizing the two most critical hyperparameters, learning rate, and weight decay. The search space for these hyperparameters is [0, 10^-1] for weight decay and [10^-5, 10^-2] for learning rate, while keeping the remaining hyperparameters fixed as per the baseline model. The study employs EfficientNetV2 as a benchmarking model and trains it on the CIFAR10 dataset. The experiments are performed using the timm library with HPO budgets ranging from 1 to 6 full function evaluations. The results show that DPL consistently outperforms baselines in terms of the mean incumbent value and proves effective in trading off exploration vs exploitation in a continuous HPO search space. The paper discusses hyperparameter optimization (HPO) for machine learning models. They utilize DPL, BOHB, and random search in proxy tasks to identify the oracle value with an absolute tolerance of 0.01. They establish proxy tasks by sampling embedding size from a log scale, (6, 12, . . . , 96, 192), and performance correlation between the different fidelities is reflected by the Pearson correlation in Table 3. The validation curves during model training are depicted in Figure 9. The end model size is visualized by the distribution of GPU-hours required for training across different model fidelity values in Figure 11. The scaling of model size in relation to bytes, FLOPS, and runtime is based on average values across all nanoGPT-Bench configurations and shown in Figure 10. The document discusses hyperparameter optimization using power laws on two dimensions: the number of training steps and Bench. The fidelity space is constructed using Table 2, which presents warmup steps, minimum and maximum learning rates, and HP in detail. The search space is optimized to ensure that even the most resource-intensive experiments stay within the limits of a single GPU day. The training process involves 350 steps with each step encompassing 1000 random samples, with a batch size of 12. The authors use an ensemble of surrogate models to optimize the cross-entropy loss for next-token prediction. They consider different formulations for the power law functions used as their surrogate and use an initial history of 1 randomly sampled hyperparameter configuration evaluated for 1 epoch for both DPL and every baseline. They continuously refine the model for 20 epochs every HPO iteration. The paper presents an algorithm called Gray-box HPO with Deep Power Laws for hyperparameter optimization. The method uses a 2-layer feedforward neural network with 128 units per layer and Leaky ReLU for the non-linearity. The network has 3 output units, which are combined with the budget b to yield the power law output. The GLU non-linearity activation is only applied on the ? and ? output units. The algorithm involves evaluating initial configurations and budgets, computing the next budget and recommending the next configuration, and fitting a DPL ensemble from history H. The algorithm returns the best hyperparameter configuration with the smallest validation loss. This is a list of references related to hyperparameter optimization in machine learning. The references include papers on various approaches to hyperparameter optimization, such as Bayesian methods, bandit-based approaches, and deep neural networks. Some of the papers also discuss techniques for optimizing large numbers of hyperparameters or for estimating predictive uncertainty. This is a list of references cited in the document "Power Laws for Hyperparameter Optimization." The references cover topics such as hyperparameter optimization, machine learning, and neural networks. The list includes conference proceedings, technical reports, open-source corpora, and research papers. Some of the key topics covered in the references include Bayesian optimization, automatic hyperparameter tuning, and learning curves. The document introduces Deep Power Law (DPL), a probabilistic surrogate based on an ensemble of power law functions, for hyperparameter optimization (HPO) in deep learning (DL) models. The proposed method exploits scaling laws to estimate performance and achieves better results than strong HPO baselines for DL. DPL is tested on 7 baselines, 59 datasets, and diverse search spaces for DL architectures. It outperforms Random Search and BOHB, a rival gray-box HPO method, for small transformers. DPL discovers better configurations than the baselines at any proxy space with small embedding sizes. The configurations discovered by DPL on small search spaces achieve competitive results on full-scale transformers. The document presents a study on hyperparameter optimization (HPO) for transformers in Large Language Models. The authors conducted experiments on a smaller GPT-2 model and then applied the findings to a larger transformer model on the OpenWebText dataset. They used HPO to tune three learning rate hyperparameters and ablated the embedding size of the multi-head attention layers. The authors present three analyses, including one on the effectiveness of their method DPL for HPO in Large Language Models. The results show that their method explores well and assigns the budget only to configurations with lower regret compared to other methods. Lastly, the authors conclude that their hypothesis on the quality of the HPO results is valid. The document analyzes hyperparameter optimization (HPO) using a power law assumption and provides empirical evidence that the power law surrogates lead to state-of-the-art HPO results. The study includes experiments on multiple datasets and benchmarks, comparing the proposed method (DPL) with other baselines such as SMAC, Random Search, LC Length Fraction, DEHB, TaskSet, BOHB, ASHA, and Dragonfly. The results show that DPL performs well in most cases, achieving top performance on many configurations. The study also investigates learning curves that do not follow a power law pattern and proposes two different ways to handle them. Overall, the empirical evidence suggests that the presented power law model can accurately forecast learning curves and validate the hypothesis that the power law assumption improves HPO performance. The experiments were run on a CPU cluster with two Intel Xeon E5-2630v4 CPUs with 20 CPU cores running at 2.2 GHz. Various hyperparameter optimization methods including DEHB, SMAC, ASHA, and Dragonfly are compared in experiments using different datasets. The methods start with a history of one randomly sampled hyperparameter configuration evaluated for one step/epoch. The learning curve length is fixed for LCBench and TaskSet but varies for PD1. Results are reported until the time it took Random Search to evaluate 20 hyperparameter configurations. The regret is the difference in evaluation metric performance from the best-found hyperparameter configuration during optimization to the best possible hyperparameter configuration (oracle). The evaluation metric differs between benchmarks, with LCBench using balanced accuracy, TaskSet using loss, and PD1 using accuracy. The article discusses hyperparameter optimization benchmarks and the experimental protocol used to evaluate them. The benchmarks feature different optimization tasks evaluated in various search spaces, including NLP tasks, deep learning benchmarks, and statistical modeling corpora. The protocol standardizes hyperparameter values by performing min-max scaling and uses BOCA, Dragonfly Library, SMAC, Hyperband, DEHB, and ASHA as baselines. The article provides detailed pseudocode of the proposed method and covers new configurations with no learning curve evaluations. This paper proposes a novel multi-fidelity strategy for hyperparameter optimization using power law surrogates. The method employs the Expected Improvement acquisition function with the estimated full budget's posterior mean and variance. The acquisition function incorporates both the mean and uncertainty of predictions, applying a trade-off between exploration and exploitation. The power law surrogate is trained using a history of learning curve evaluations and a parametric neural network that maps a configuration to the power law coefficients of its learning curve. The performance of Machine Learning methods is assumed to follow a power law relationship between the validation loss and the number of optimization epochs. The term budget refers to a learning curve step, and the evaluation of a configuration for a budget is defined as f (?, b) : ? x B ? R +, where B = (0, b max ]. Hyperparameter Optimization (HPO) involves finding optimal configurations for a Machine Learning method. The optimal hyperparameters are found via an HPO policy that is parameterized and learned to minimize training loss. Bayesian optimization is the most popular type of policy for HPO due to its ability to balance exploration and exploitation aspects of minimizing validation loss. A small-scale model can be used to transfer hyperparameters to a large-scale version. Scaling laws describe the relationship between the performance of deep learning models as a function of dataset size or model size as a power law. Power law surrogates can be fit for conducting multi-fidelity HPO with Bayesian optimization. Learning curve prediction algorithms can be combined with successive halving to predict configuration performance. Another approach is to use learning curves from already evaluated configurations and to find an affine transformation that leads to a well-matched learning curve. The prediction is often based on the assumption that the performance increases at the beginning and then flattens towards the end. The paper proposes a probabilistic surrogate method for hyperparameter optimization (HPO) called Deep Power Law (DPL) ensembles, which models optimization curves as simple power law functions. This method is tested against seven strong HPO baselines and 59 datasets of three diverse modalities (tabular, image, and natural language processing), achieving the new state-of-the-art in HPO for deep learning. The paper also introduces a mechanism to combine DPL with Bayesian optimization and discusses the potential for DPL to make HPO for deep learning a feasible reality. The authors highlight the power law assumption on learning curves and the use of multi-fidelity methods such as successive halving and Hyperband, which have been shown to improve HPO efficiency. Finally, the paper discusses related work in HPO for deep learning and presents a large-scale experimental protocol. Hyperparameter Optimization (HPO) is a major challenge for machine learning, particularly in Deep Learning (DL) methods due to high costs. Recently, gray-box HPO (multi-fidelity HPO) has emerged as a promising paradigm for HPO in DL. In this work, the authors propose Deep Power Laws (DPL), an ensemble of neural network models conditioned to yield predominant power law nature of learning curves. Their method dynamically decides which configurations to pause and train incrementally by making use of gray-box predictions that follow a power-law scaling pattern. The authors achieve the best results across all benchmarks by obtaining the best any-time results compared to all competitors. They compare their method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 59 diverse tasks. Their method focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance and exploits the performance of all types of hyperparameters.

Raw indexed text (57,401 chars / 9,295 words / 1,278 lines)

Power Laws for Hyperparameter Optimization

Arlind Kadra

Representation Learning Lab

University of Freiburg

[email protected]

Martin Wistuba

Amazon Web Services

Amazon Berlin

[email protected]

Maciej Janowski

Representation Learning Lab

University of Freiburg

[email protected]

Josif Grabocka

Representation Learning Lab

University of Freiburg

[email protected]

Abstract

Hyperparameter optimization is an important subfield of machine learning that

focuses on tuning the hyperparameters of a chosen algorithm to achieve peak

performance. Recently, there has been a stream of methods that tackle the issue

of hyperparameter optimization, however, most of the methods do not exploit the

dominant power law nature of learning curves. In this work, we propose Deep

Power Laws (DPL), an ensemble of neural network models conditioned to yield pre-

dictions that follow a power-law scaling pattern. Our method dynamically decides

which configurations to pause and train incrementally by making use of gray-box

evaluations. We compare our method against 7 state-of-the-art competitors on 3

benchmarks related to tabular, image, and NLP datasets covering 59 diverse tasks.

Our method achieves the best results across all benchmarks by obtaining the best

any-time results compared to all competitors.

Introduction

Hyperparameter Optimization (HPO) is a major challenge for the Machine Learning community.

Unfortunately, HPO is not yet feasible for Deep Learning (DL) methods due to the high cost

of evaluating multiple configurations. Recently, gray-box HPO (a.k.a. multi-fidelity HPO) has

emerged as a promising paradigm for HPO in DL, by discarding poorly-performing hyperparameter

configurations after observing the validation error on the low-level fidelities of the optimization

procedure [23, 8, 1, 24]. The advantage of gray-box HPO compared to online HPO [6, 31], or

meta-gradient HPO [27, 9, 26] is the ability to tune all types of hyperparameters.

In recent years, a stream of papers highlights the fact that the performance of DL methods is

predictable [13], concretely, that the validation error rate is a power law function of the model size, or

dataset size [34, 33]. Such a power law relationship has been subsequently validated in the domain of

NLP, too [10]. In this paper, we demonstrate that the power-law principle has the potential to be a

game-changer in HPO, because we can evaluate hyperparameter configurations in low-budget regimes

(e.g. after a few epochs), then estimate the performance on the full dataset using dataset-specific

power law models. Concretely, we hypothesize and empirically demonstrate that optimization curves

(training epochs versus accuracy, or loss) can be efficiently modeled as simple power law functions.

As a result, we introduce Deep Power Law (DPL) ensembles, a probabilistic surrogate for Bayesian

optimization (BO) that estimates the performance of a hyperparameter configuration at future budgets

using ensembles of deep power law functions. Subsequently, our novel formulation of BO dynami-

cally decides which configurations to pause and train incrementally by relying on the performance

Preprint. Under review.estimations of the surrogate. We demonstrate that our method achieves the new state-of-the-art

in HPO for DL by comparing against 7 strong HPO baselines, and 59 datasets of three diverse

modalities (tabular, image, and natural language processing). Our experimental protocol contains

state-of-the-art deep learning architectures such as Transformers [37], XFormer [21], ResNeXt [41]

and large language models like GPT-2 [32]. As a result, we believe the proposed method has the

potential to finally make HPO for DL a feasible reality.

Overall, our contributions can be summarized as follows:

• We introduce a novel probabilistic surrogate for gray-box HPO based on ensembles of deep

power law functions.

• We derive a simple mechanism to combine our surrogate with Bayesian optimization.

• Finally, we demonstrate the empirical superiority of our method against the current state-of-

the-art in HPO for Deep Learning, with a large-scale HPO experimental protocol.

Related Work

Multi-fidelity HPO assumes a method has access to the learning curve of a hyperparameter con-

figuration. Such a learning curve is the function that maps either training time or dataset size, to

the validation performance. The early performance of configurations (i.e. first segment of the

learning curve) is used to discard unpromising configurations, before waiting for full convergence.

Successive halving [14] is a widely used multi-fidelity method that randomly samples hyperparameter

configurations, starts evaluating them, and ends a fraction of them upon reaching a predefined budget.

Afterwards, the budget is multiplied by the fraction of discarded hyperparameter configurations and

the process continues until the maximum budget is reached. Although the method relies only on

the last observed value of the learning curve, it is very efficient. In recent years, various flavors of

successive halving have been elaborated, including Hyperband [23], which effectively runs successive

halving in parallel with different settings. A major improvement to Hyperband is replacing random

search with a more efficient sampling strategy [1, 8]. However, the only assumption these methods

make about the learning curve is that it will improve over time. In contrast, we fit surrogates that

exploit a power law assumption on the curves.

Learning curve prediction is a related topic, where the performance of a configuration is predicted

based on a partially observed learning curve. Typically, the assumptions about the learning curve

are much stronger than those described above. The prediction is often based on the assumption that

the performance increases at the beginning and then flattened towards the end. One way to model

this behavior is to define a weighted set of parametric functions [7, 18]. Then, the parameters of all

functions are determined so that the resulting prediction best matches the observed learning curve.

Another approach is to use learning curves from already evaluated configurations and to find an affine

transformation that leads to a well-matched learning curve [5]. A more data-driven approach is to

learn the typical learning curve behaviour directly from learning curves across different datasets [40].

Learning curve prediction algorithms can be combined with successive halving [2]. In contrast to

this research line, we fit ensembles of power law surrogates for conducting multi-fidelity HPO with

Bayesian optimization.

Scaling laws describe the relationship between the performance of deep learning models as a function

of dataset size or model size as a power law [13, 34, 33, 10]. Another work tunes hyperparameters on

a small-scale model and then transfers them to a large-scale version[42]. In contrast to these papers,

we directly use the power law assumption for training surrogates in Bayesian optimization for HPO.

Preliminaries

Hyperparameter Optimization (HPO) demands finding the configurations λ ∈ Λ of a Machine

Learning method that achieve the lowest validation loss L (Val) of a model (e.g. a neural network),

which is parameterized with θ and learned to minimize the training loss L (Train) as:

2λ ∗ := arg min L (V al) (λ, θ ∗ (λ)) ,

(1)

λ∈Λ

s.t. θ ∗ (λ) := arg min L (T rain) (λ, θ)

θ∈Θ

For simplicity, we denote the validation loss as our function of interest f (λ) = L (V al) (λ, θ ∗ (λ)).

The optimal hyperparameter configurations λ ∗ of Equation 1 are found via an HPO policy A (also

called an HPO method) that given a history of N evaluated configurations H := {λ i , f (λ i )} i=1

suggests the (N + 1)-th configuration to evaluate as λ N +1 := A(H) where A : [Λ × R + ] → Λ.

The search for an optimal HPO policy is a bi-objective problem in itself, aiming at (i) finding a

configuration out of N evaluations that achieves the smallest validation loss f (λ), and (ii) ensuring

that the costs of evaluating the N configurations do not exceed a total budget Ω, as shown in

Equation 2.

f λ i = A H (i−1) ,

i∈{1,...,N }

(

{(λ j , f (λ j ))} j=1

where:

H (i) :=

∅

arg min

min

subject to: Ω >

(2)

i> 0

i =0

cost (f (λ i ))

i=1

Bayesian optimization (BO) is the most popular type of policy for HPO, due to its ability to balance

the exploration and exploitation aspects of minimizing the validation loss f in terms of hyperparame-

ters λ ∈ Λ. Technically speaking, BO fits a surrogate f ˆ (λ; ϕ) parametrized with ϕ to approximate

the observed loss f (λ) using the history H, as ϕ ∗ := arg min ϕ E (λ,f (λ)) ∼p H p(f (λ)|λ, ϕ). After-

wards, BO uses an acquisition/utility

function a : Λ → R + to recommend the next configuration

as λ N +1 := A H (N ) = arg max λ∈Λ a (λ; ϕ ∗ ). A typical acquisition choice is the Expected

Improvement [29].

Gray-box (multi-fidelity) HPO refers to the case where an approximation of the validation loss can

be measured at a lower budget b ∈ B, where B = (0, b max ]. For instance, in Deep Learning we can

measure the validation loss after a few epochs (0 < b < ϵ), rather than wait for a full convergence

(b = b max ). Throughout this paper, the term budget refers to a learning curve step. The evaluation of

a configuration λ for a budget b is defined as f (λ, b) : Λ × B → R + .

Power Law Surrogates for Bayesian Optimization

Prior work has demonstrated that the performance of Machine Learning methods as a function of

budgets (i.e. dataset size, number of optimization epochs, model size, image resolution) follows

a power law relationship [34, 33]. In this work, we employ this power law dependence between

the validation loss and the number of optimization epochs in Deep Learning. We propose a novel

gray-box Hyperparameter Optimization method which is based on power law surrogates. We assume

that every learning curve f (λ, ·) can be described by a power law function defined by (α, β, γ).

Concretely, we define a power law function for the validation loss of a configuration λ at a budget b

(a.k.a. the number of epochs) as shown in Equation 3.

f ˆ (λ, b) := α λ + β λ b −γ λ , α λ , β λ , γ λ ∈ R

(3)

Instead of fitting one separate power law function to each learning curve, we fit a single shared

power law function across all configurations by conditioning the power law coefficients α, β, γ on λ

using a parametric neural network g that maps a configuration to the power law coefficients of its

learning curve as g : Λ → R 3 . The network g has three output nodes, corresponding to the power law

3coefficients, denoted as g(λ) α , g(λ) β , g(λ) γ . The configuration-conditioned power law surrogate

becomes:

f ˆ (λ, b) := g(λ) α + g(λ) β b −g(λ) γ , g : Λ → R 3

(4)

Using a history of learning curve evaluations H := {(λ i , b i , f (λ i , b i ))} i=1 we can train the power

law surrogate to minimize the following loss function using stochastic gradient descent:

arg min E (λ,b,f (λ,b))∼p H f (λ i , b i ) − f ˆ (λ i , b i )

(5)

f ˆ

BO surrogates need to be probabilistic regression models because the acquisition functions require

the posterior variance of the predictions. As a result, we train an ensemble of K diverse surrogates

f ˆ (1) (λ, b), . . . , f ˆ (K) (λ, b) by initializing each surrogate with different weights and by training with

a different sequence of mini-batches as in Deep Ensembles [20]. The posterior mean µ and the

posterior variance σ 2 of the power law ensemble are trivially computed as:

µ f ˆ (λ, b) =

1 X ˆ (k)

f (λ, b),

(6)

k=1

σ f 2ˆ (λ, b) =

1 X ˆ (k)

f (λ, b) − µ f ˆ (λ, b)

k=1

A commonly used acquisition function in the domain is Expected Improvement (EI) which incor-

porates both the mean and uncertainty of predictions, applying a trade-off between exploration and

exploitation. Consequently, in our work, we use the EI acquisition with the estimated full budget’s

(b max ) posterior mean and variance. We briefly define the acquisition function in Equation 7:

λ next := arg max EI (λ, b max |H)

λ∈Λ

EI(λ, b max |H) = E max µ f ˆ (λ, b max ) − f best (b max ) , 0

(7)

where, f best (b max ) corresponds to the best observed loss for any budget b ′ ≤ b max from the history H.

However, after selecting a configuration with our variant of the EI acquisition, we do not naively run

it until convergence. Instead, we propose a novel multi-fidelity strategy that advances the selected

λ next of Equation 7 by a small budget of b step , e.g. 1 epoch of training. Therefore, the selected λ next

will be evaluated at b next as defined in Equation 8. Notice, our proposed strategy also covers new

configurations with no learning curve evaluations in H.

( step

b ,

:= b step +

max

(λ next ,b,·)∈H

∄λ next : (λ next , ·, ·) ∈ H

b, otherwise

(8)

We provide the detailed pseudocode of our method at Appendix A.

5.1

Experimental Protocol

Benchmarks

LCBench: A benchmark that features 2,000 hyperparameter configurations that parametrize the archi-

tecture of simple feedforward neural networks, as well as, the training pipeline [43]. The benchmark

features 7 numerical hyperparameters and 35 different datasets from the AutoML benchmark [11].

4PD1: A deep learning benchmark [38] that consists of recent DL (including Transformers) architec-

tures run on large vision datasets such as CIFAR-10, CIFAR-100, ImageNet, as well as statistical

modeling corpora and protein sequence datasets from bioinformatics. Every search space includes

varying learning curve lengths, ranging from 5 to 1414, and a different number of evaluated hy-

perparameter configurations ranging from 807 to 2807. The search space includes hyperparameter

configurations that parametrize the learning rate, the learning rate scheduler, and the momentum.

TaskSet: A benchmark that features different optimization tasks evaluated in 5 different search spaces

[28]. For our work, we focus on the Adam8p search space, which is among the largest search spaces

in the benchmark with 1000 hyperparameter configurations. Every hyperparameter configuration

features 8 continuous hyperparameters. The hyperparameters control the learning rate, the learning

rate schedulers, and the optimizer. For variety among our benchmarks, we focus on 12 NLP tasks.

For a more detailed explanation of the benchmarks, we refer the reader to Appendix H.

5.2

Baselines

Random Search stochastically samples hyperparameter configurations for the largest possible budget.

Hyperband uses multiple brackets with different trade-offs of the initial budget and number of epochs

to initially train [23]. It then applies Successive Halving (SH) [14] on every bracket. ASHA is an

asynchronous version of SH [22] that does not wait for all configurations to finish in an SH bracket

before starting the next one. Furthermore, BOHB is a follow-up of Hyperband that uses TPE [3] to

sample the initial hyperparameter configurations of a bracket [8]. DEHB, on the other hand, modifies

Hyperband by using evolutionary strategies to sample the initial hyperparameter configurations [1].

Similarly, SMAC extends Hyperband but uses random forests to sample the initial hyperparameter

configurations for a bracket [25]. We also use the Dragonfly Library [16] to compare against

BOCA [15], a multi-fidelity method that uses Gaussian Processes to predict the next hyperparameter

to evaluate, as well as the fidelity for which it should be evaluated. For all the baselines, we use their

official public implementations. We provide additional details in Appendix I.

5.3

Protocol

In our experiments, we standardize the hyperparameter values by performing min-max scaling for

our method and every baseline. If a baseline has a specific preprocessing protocol, we do not apply

min-max scaling but we apply the protocol as suggested by the authors.

The benchmarks do not support a common evaluation metric for configurations (i.e. the function f ).

As a consequence, the evaluation metric for LCBench is the balanced accuracy, for TaskSet the loss,

while for PD1 the accuracy. Moreover, the benchmarks do not offer learning curves with a common

step size. For LCBench and PD1, one step size is equivalent to one epoch, while for TaskSet one step

size is 200 batches. The HPO budget is defined as the maximum number of steps needed to fully

evaluate 20 hyperparameter configurations. In that context, one unit step of the HPO budget signifies

training a particular configuration for one more optimization step (e.g. 200 batches in TaskSet or 1

epoch in LCBench).

In the following experiments, for all methods, we report the regret of the best-found configuration:

R = f best (b max ) − f oracle (b max )

(9)

where the oracle is given as f oracle (b max ) := min {f (λ, b) | (λ, b, f (λ, b)) ∈ H D ∧ b ≤ b max }, and

H D corresponds to the set of all the exhaustively-evaluated hyperparameter configurations’ perfor-

mances on a dataset H D . If the oracle configuration is not known in advance for the search space,

H D can be replaced with the history H at the end of the HPO procedure. The only difference

between f best and f oracle is that the former only considers the history at the HPO step for which we

are reporting the results.

In short, the regret is the difference in the evaluation metric performance from the best-found

hyperparameter configuration during optimization to the best possible hyperparameter configuration

(oracle) on the dataset (in a minimization setting). On a dataset level, we report the average regret

across 10 repetitions with different seeds. To be able to aggregate results over datasets, we report the

averaged normalized regret. As normalization, we divide the regret by the difference between the

5performances of the best and the worst hyperparameter configuration on a dataset. Then we compute

the mean of the normalized regrets across all the datasets of a benchmark.

Moreover, in the experiments that report the average normalized regret over time, we provide results

over normalized wall clock time. The wall clock time includes both the method’s overhead (i.e.

training the surrogate f ˆ and selecting the next hyperparameter configuration to evaluate) and the time

taken to evaluate the selected hyperparameter configuration (i.e. evaluating f ). Since the methods

have different run times, we normalize the individual times by the time it took Random Search

(the fastest non-model-based method) to complete the HPO optimization process. To provide a

fair any-time comparison, we report results until the time it took Random Search to evaluate 20

hyperparameter configurations.

Furthermore, when reporting the learning curve (LC) length fraction, we imply the fraction of the total

learning curve length. LCBench and TaskSet have LCs of a fixed length for all datasets, corresponding

to 51 epochs for LCBench and 50 epochs for TaskSet. In contrast, PD1 has varying LC lengths

for different datasets. For more details regarding the settings of our method, we refer the reader to

Appendix B. Our implementation of DPL is publicly available. 1

In our experiments, all methods start with a history H of 1 randomly sampled hyperparameter

configuration evaluated for 1 step/epoch in the case of gray-box techniques (Hyperband, BOHB,

DEHB, SMAC, ASHA, Dragonfly; descriptions in Section 5.2), or for the full budget for the black-

box technique (Random Search). We ran experiments on a CPU cluster, where every node contains

two Intel Xeon E5-2630v4 CPUs with 20 CPU cores running at 2.2 GHz.

Research Hypotheses and Experiments

Hypothesis 1: The power law assumption im-

proves the quality of learning curve forecasting.

In this experiment, we evaluate the predictive

final performance. Based on the results, we consider Hypothesis 1 to be validated and that DPL is

accurate in terms of learning curve forecasting.

What about learning curves that do not follow a power law pattern? Although the provided

empirical evidence in this section strongly suggests that the presented power law model can accurately

forecast learning curves, it is also true that some learning curves have divergent behaviour that does

not follow the power law assumption. As a consequence, we investigate two different ways to handle

curves that do not follow the power law assumption: i) recently-proposed power law functions that

include breaking points [4], or shifts in the curve [7], and ii) min-smoothing the learning curves to

transform them into monotonically decreasing time series.

In Appendix C we provide ample empirical evidence showing that although the alternative for-

mulations achieve comparable performances, still they do not outperform our simpler power law

formulation. The findings indicate that even though not all learning curves are power laws, most

of them are, therefore a power law surrogate is efficient in forecasting the final performance of a

partially-observed learning curve. As a result, the forthcoming experiments will provide further

empirical evidence that our power law surrogates lead to state-of-the-art HPO results when deployed

in the proposed multi-fidelity Bayesian optimization setup.

LCBench

0.8

0.4

0.2

0.6

0.15

0.10

0.05

0.4

0.2

0.0

0.2

0.4

0.6

0.8

LC Length Fraction

1.0

0.0

0.2

0.4

0.6

0.8

LC Length Fraction

TaskSet

1.0

0.0

0.30

0.20

0.15

0.10

0.05

0.25

0.20

0.15

DPL

0.4

0.6

0.8

LC Length Fraction

ASHA

1.0

BOHB

1.0

0.4

0.6

0.8

LC Length Fraction 1.0

0.5

0.4

0.3

0.2

0.1

0.0

0.00

0.2

0.4

0.6

0.8

LC Length Fraction

TaskSet

0.25

0.0

0.2

TaskSet

0.30

0.6

0.0

DEHB

0.2

0.4

0.6

0.8

LC Length Fraction

Dragonfly

1.0

Hyperband

0.0

0.2

Random

SMAC

Baseline

Figure 5: Post-hoc analysis to study DPL’s efficiency. Left: Share of the best candidates selected

during training. Middle: Average regret of configurations chosen to be trained at each budget. Right:

Share of top third configurations at a given budget which were bottom two third configurations at a

previous budget.

which with a statistically significant margin), while being second best only at the 50% of the HPO

budget on the PD1 benchmark. We investigate the lack of statistical significance in PD1, by analyzing

the individual dataset performances where DPL performs worse compared to other baselines. We

notice that the datasets have a skewed distribution of hyperparameter configuration performances,

where, a majority of the configurations achieve top performance. Based on the results, we conclude

that a lack of statistical significance is the outcome of a search space that includes relatively simple

optimization tasks and not a specific failure state of our method. We provide the detailed results of

our analysis in Appendix F.

Lastly, we analyse the performance of DPL over

time in Figure 4. As it can be observed, DPL

manages to outperform the competitors even

when the method’s overhead time is included,

showing that the overhead of DPL (i.e. fitting

surrogate and running the acquisition) is negli-

gible in terms of the quality of the HPO results.

For a more detailed information, regarding the

DPL overhead time, we point to Appendix G.

TaskSet is not included in Figure 4 since the

benchmark does not offer runtimes. Given the

results, we conclude that Hypothesis 2 is vali-

dated and that DPL achieves state-of-the-art

performance in HPO.

The middle plots of Figure 5, show the average regret for each configuration promoted to the respective

budget. According to the plot, DPL is more efficient and assigns the budget only to configurations

with lower regret compared to the other methods. The precision and regret plots demonstrate that the

quality of the evaluated configurations is largely better than all baselines, therefore, giving our method

a significant lift in the performance rank. Last but not least, the right plot shows the percentage of

configurations that were performing poorly in an earlier epoch (i.e. accuracy-wise in the bottom 2/3 of

configurations up to the epoch indicated at the x-axis) but performed better at later epochs (i.e. at the

top 1/3 of configurations). Furthermore, we added a line labeled with "Baseline", which represents

the fraction of previously poor-performing configurations of all configurations. This behavior is

observed often with learning curves, where for instance, strongly regularized networks converge

slowly. For the same analysis regarding the PD1 benchmark, we point the reader to Appendix J.

The results indicate that our method explores well the unpromising early configurations, by consider-

ing them through the uncertainty estimation of our ensemble and the respective Bayesian optimization

mechanism. The results validate Hypothesis 3 and confirm that DPL explores the search space

more efficiently.

Hypothesis 4: Our method DPL offers an effective tool for HPO in Large Language Models.

Embedding size = 12

6.00

5.00

Normalized HPO budget = 25%

3.98 3.96

3.95 3.95 3.94

3.92 3.92 3.92

DPL

BOHB

192

HPO Budget [GPU-hours]

Normalized HPO budget = 50% Normalized HPO budget = 100%

3.98

3.90

Embedding size = 192

7.00

6.00

Embedding size = 48

7.00

6.80

6.60

6.40

6.20

3.90

192

3.90

Small Transformer Scale [embedding size]

Random

oracle at small scale

192

oracle at largest scale

Figure 6: HPO for Transformer architectures. Top: HPO on small-scale transformers in terms of the

embedding size. Bottom: Error on the full-scale transformer, using the hyperparameter configuration

discovered by conducting HPO using the small transformers. We present three analyses, ablating the

HPO time on the small-scale transformer up to the HPO budget of 2 full function evaluations.

In this experiment, we consider the case of tuning the hyperparameters of transformers in Large

Language Models. To this end, we computed a tabular benchmark by training a smaller GPT-2 [32]

model on the OpenWebText dataset [12] for a series of different hyperparameter configurations. We

tune three learning rate hyperparameters: the fraction of warmup steps, the maximum learning rate at

the end of warmup, and the minimum learning rate at the end of the decay. We repeat the experiments

for seven model sizes, ablating the embedding size of the multi-head attention layers (details in

Appendix D).

We follow the common practice of conducting HPO with small transformers and then deploying the

discovered optimal configuration on the full-scale transformers. As a result, we search for the optimal

hyperparameters of small transformers (embedding size of {6, 12, . . . , 96, 192}) and then evaluate

the discovered configurations at a full-scale transformer with an embedding size of 384.

Figure 6 shows the HPO results of DPL against Random Search and BOHB (a rival gray-box HPO

baseline). In the top row of plots, we observe the performance of the discovered configurations at

the small transformers for the indicated embedding size. We observe that our method finds better

configurations than the baselines at any proxy space with small embedding sizes.

On the other hand, the bottom row of plots presents the performance of the discovered configurations

in the small embedding space, by applying such hyperparameter configurations to the full-scale

transformers. We observe that the configurations discovered by DPL on the small search space

9achieve very competitive results on the full-scale transformers, finding the oracle configuration of the

full-scale transformers in the majority of cases. For more details, we refer the reader to Appendix D.

Conclusions

In this work, we introduce Deep Power Law (DPL), a probabilistic surrogate based on an ensemble

of power law functions. The proposed surrogate is used within a novel gray-box Hyperparameter

Optimization (HPO) method based on Bayesian optimization. In contrast to the prior work, we

exploit scaling laws for estimating the performance of Deep Learning (DL) models. Through

extensive experiments comprising 7 baselines, 59 datasets, and search spaces of diverse deep learning

architectures, we show that DPL outperforms strong HPO baselines for DL by a large margin. As an

overarching contribution, we advanced the state-of-the-art in the important field of HPO for DL.

References

[1] Noor H. Awad, Neeratyoy Mallik, and Frank Hutter. DEHB: evolutionary hyberband for

scalable, robust and efficient hyperparameter optimization. In Proceedings of the Thirtieth

International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal,

Canada, 19-27 August 2021, pages 2147–2153, 2021.

[2] Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating neural archi-

tecture search using performance prediction. In 6th International Conference on Learning

Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track

Proceedings, 2018.

[3] James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-

parameter optimization. Advances in neural information processing systems, 24, 2011.

[4] Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws,

2022.

[5] Akshay Chandrashekaran and Ian R. Lane. Speeding up hyper-parameter optimization by

extrapolation of learning curves using previous builds. In Machine Learning and Knowledge

Discovery in Databases - European Conference, ECML PKDD 2017, Skopje, Macedonia,

September 18-22, 2017, Proceedings, Part I, volume 10534 of Lecture Notes in Computer

Science, pages 477–492. Springer, 2017.

[6] Yutian Chen, Matthew W. Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Timothy P.

Lillicrap, Matthew Botvinick, and Nando de Freitas. Learning to learn without gradient descent

by gradient descent. In Proceedings of the 34th International Conference on Machine Learning,

ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 748–756, 2017.

[7] Tobias Domhan, Jost Tobias Springenberg, and Frank Hutter. Speeding up automatic hy-

perparameter optimization of deep neural networks by extrapolation of learning curves. In

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence,

IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 3460–3468. AAAI Press, 2015.

[8] Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: robust and efficient hyperparameter

optimization at scale. In Proceedings of the 35th International Conference on Machine Learning,

ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 1436–1445, 2018.

[9] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and

reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International

Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017,

pages 1165–1173, 2017.

[10] Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia,

Ciprian Chelba, and Colin Cherry. Scaling laws for neural machine translation. In The Tenth

International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

2022. OpenReview.net, 2022.

10[11] Pieter Gijsbers, Erin LeDell, Janek Thomas, Sébastien Poirier, Bernd Bischl, and Joaquin

Vanschoren. An open source automl benchmark. arXiv preprint arXiv:1907.00909, 2019.

[12] Aaron Gokaslan*, Vanya Cohen*, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus.

http://Skylion007.github.io/OpenWebTextCorpus, 2019.

[13] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory F. Diamos, Heewoo Jun, Hassan

Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is

predictable, empirically. CoRR, abs/1712.00409, 2017.

[14] Kevin G. Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyper-

parameter optimization. In Proceedings of the 19th International Conference on Artificial

Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016, pages 240–248, 2016.

[15] Kirthevasan Kandasamy, Gautam Dasarathy, Jeff G. Schneider, and Barnabás Póczos. Multi-

fidelity bayesian optimisation with continuous approximations. In Proceedings of the 34th

International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11

August 2017, pages 1799–1808, 2017.

[16] Kirthevasan Kandasamy, Karun Raju Vysyaraju, Willie Neiswanger, Biswajit Paria, Christo-

pher R. Collins, Jeff Schneider, Barnabás Póczos, and Eric P. Xing. Tuning hyperparameters

without grad students: Scalable and robust bayesian optimisation with dragonfly. J. Mach.

Learn. Res., 21:81:1–81:27, 2020.

[17] Andrej Karpathy. nanoGPT. https://github.com/karpathy/nanoGPT, 2023.

[18] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve

prediction with bayesian neural networks. In 5th International Conference on Learning Rep-

resentations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings,

2017.

[19] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,

University of Toronto, 2009.

[20] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre-

dictive uncertainty estimation using deep ensembles. In I. Guyon, U. Von Luxburg, S. Bengio,

H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Informa-

tion Processing Systems, volume 30. Curran Associates, Inc., 2017.

[21] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano,

Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza.

xformers: A modular and hackable transformer modelling library. https://github.com/

facebookresearch/xformers, 2022.

[22] Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin

Recht, and Ameet Talwalkar. Massively parallel hyperparameter tuning. arXiv preprint

arXiv:1810.05934, 5, 2018.

[23] Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar.

Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn.

Res., 18:185:1–185:52, 2017.

[24] Shibo Li, Wei Xing, Robert M. Kirby, and Shandian Zhe. Multi-fidelity bayesian optimization

via deep neural networks. In Advances in Neural Information Processing Systems 33: Annual

Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,

2020, virtual, 2020.

[25] Marius Lindauer, Katharina Eggensperger, Matthias Feurer, André Biedenkapp, Difan Deng,

Carolin Benjamins, Tim Ruhkopf, René Sass, and Frank Hutter. Smac3: A versatile bayesian

optimization package for hyperparameter optimization. Journal of Machine Learning Research

(JMLR) – MLOSS, 23(54):1–9, 2022.

11[26] Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters

by implicit differentiation. In The 23rd International Conference on Artificial Intelligence and

Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], pages 1540–1552,

2020.

[27] Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Gradient-based hyperparameter

optimization through reversible learning. In Proceedings of the 32nd International Conference

on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2113–2122, 2015.

[28] Luke Metz, Niru Maheswaranathan, Ruoxi Sun, C. Daniel Freeman, Ben Poole, and Jascha

Sohl-Dickstein. Using a thousand optimization tasks to learn hyperparameter search strategies.

CoRR, abs/2002.11887, 2020.

[29] Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of Bayesian methods

for seeking the extremum. Towards Global Optimization, 2(117-129):2, 1978.

[30] OpenAI. Gpt-4 technical report, 2023.

[31] Jack Parker-Holder, Vu Nguyen, and Stephen J. Roberts. Provably efficient online hyperparam-

eter optimization with population-based bandits. In Advances in Neural Information Processing

Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS

2020, December 6-12, 2020, virtual, 2020.

[32] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language

models are unsupervised multitask learners. 2019.

[33] Jonathan S. Rosenfeld, Jonathan Frankle, Michael Carbin, and Nir Shavit. On the predictability

of pruning across scales. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th

International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event,

volume 139 of Proceedings of Machine Learning Research, pages 9075–9083. PMLR, 2021.

[34] Jonathan S. Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive

prediction of the generalization error across scales. In 8th International Conference on Learning

Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.

[35] David Salinas, Matthias Seeger, Aaron Klein, Valerio Perrone, Martin Wistuba, and Cedric

Archambeau. Syne tune: A library for large scale hyperparameter tuning and reproducible

research. In First Conference on Automated Machine Learning (Main Track), 2022.

[36] Mingxing Tan and Quoc Le. EfficientNetV2: Smaller models and faster training. github.com/

google/automl/tree/master/efficientnetv2, 2021.

[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg,

S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural

Information Processing Systems, volume 30. Curran Associates, Inc., 2017.

[38] Zi Wang, George E Dahl, Kevin Swersky, Chansoo Lee, Zelda Mariet, Zachary Nado, Justin

Gilmer, Jasper Snoek, and Zoubin Ghahramani. Pre-training helps bayesian optimization too.

arXiv preprint arXiv:2207.03084, 2022.

[39] Ross Wightman.

PyTorch image models.

pytorch-image-models, 2019.

https://github.com/rwightman/

[40] Martin Wistuba and Tejaswini Pedapati. Learning to rank learning curves. In Proceedings of

the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual

Event, pages 10303–10312, 2020.

[41] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual

transformations for deep neural networks. In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1492–1500, 2017.

[42] Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick

Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs V: tuning large

neural networks via zero-shot hyperparameter transfer. CoRR, abs/2203.03466, 2022.

12[43] Lucas Zimmer, Marius Lindauer, and Frank Hutter. Auto-pytorch tabular: Multi-fidelity

metalearning for efficient and robust autodl. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 43(9):3079 – 3090, 2021.

13A

Method Algorithm

Algorithm 1: Gray-box HPO with Deep Power Laws

Input :Search space Λ, initial design H (0) , budget increment b step

Output :Best hyperparameter configuration λ ∗

Evaluate initial configurations and budgets H := H (0) ;

while still budget do

Fit a DPL ensemble f ˆ (1) (λ, b), . . . , f ˆ (K) (λ, b) from history H using Equation 5;

4 Define the posterior mean and variance µ f ˆ (λ, b), σ f 2ˆ (λ, b) using Equation 6;

5 Recommend the next configuration λ next using Equation 7;

6 Compute the next budget b next using Equation 8;

7 Train λ next until b next and measure the validation loss f (λ next , b next );

Append to history H ← H ∪ {(λ next , b next , f (λ next , b next ))};

end

return Best configuration λ ∗ with the smallest validation loss

min

(λ ∗ ,b,f (λ ∗ ,b))∈H

f (λ ∗ , b) ;

Implementation Details

For our method, we use a 2-layer feedforward neural network with 128 units per layer and Leaky

ReLU for the non-linearity. Our network has 3 output units, that are then combined with the budget b

to yield the power law output. We apply the GLU non-linearity activation only on the β and γ output

units.

We use the L1 loss to train our network, coupled with Adam featuring an initial learning rate of 10 −3 .

For the first 10 iterations of our Gray-Box HPO method in Algorithm 1 we train every network of our

ensemble for 250 epochs. Next, we continuously refine the model for 20 epochs every HPO iteration.

This choice offers helps the convergence of the weights in the early stage of HPO. However, if the

optimization stagnates (surrogate fitting loss does not improve) for more than the LC Length + a

buffer of 0.2 × LC Length, the training procedure is restarted with random weights, where every

model has trained again for 250 epochs and then only refined. Overall, we use 5 surrogate models

to build our ensemble. For the experiments, we use an initial history H of 1 randomly sampled

hyperparameter configuration evaluated for 1 epoch for both DPL and every baseline.

Modeling

To inspect the modeling choices of the power

law functions used as our surrogate, we con-

Label

Formula

sider different formulations for the ensemble

DPL

+ β · b −γ

members of our surrogate as shown in Table 1.

Candidate 1

α − β · (b + d) −γ

Initially, we consider Candidate 1 which can

Candidate 2

α − β · (e · b + d) −γ

1 −c·f

handle shifts in the learning curve by introduc-

Broken Law α + β · b −γ · 1 + d b f

ing an additional parameter d. Furthermore, we

consider a more complex version, Candidate 2,

that allows us to additionally scale the budget, Table 1: Alternative power law formulations for

by introducing variable e. Lastly, we consider the DPL surrogate.

Broken Laws [4] (BPL), which can handle breaking points in the learning curve. We use a version that

can handle one breaking point since the authors of the method suggest it as a sufficient formulation to

approximate the majority of cases. We run the DPL surrogate with every individual formulation on

the LCBench benchmark to investigate their empirical performance.

14Warmup

Max

Steps

Figure 8: nanoGPT-Bench search space parametrization.

Embedding size = 6

Embedding size = 12

Embedding size = 24

10 10 10

8 8 8

200

350

200

Steps

Embedding size = 96

Embedding size = 48

350

10 10

8 8 8

6 6 6

200

350

Embedding size = 384

200

Steps

200

350

Embedding size = 192

350

200

350

200

Steps

350

Figure 9: Validation loss curves during model training for all nanoGPT-Bench configurations across

model fidelity values.

optimizer is utilized for training, with the first and second moment estimates configured to 0.9 and

0.98, respectively. The weight decay is set to 10 −1 , and we apply gradient clipping at a value of 1.0

to prevent large gradients from causing instability in the model training. For the training process, we

optimize the cross-entropy loss for next-token prediction. The process involves 350 steps, with each

step encompassing 1000 random samples, with a batch size of 12. Each data point has a context size

of 512 tokens, encoded using OpenAI’s token embeddings (sized of 50304). This procedure ensures

that even the most resource-intensive experiments stay within the limits of a single GPU day.

Search Space: Our hyperparameter search space con-

struction involves varying the number of warmup steps,

along with the maximum and minimum learning rates for

the cosine scheduler. The specific parametrization of the

scheduler is illustrated in Figure 8. The discretized choices

are presented in detail in Table 2.

Max LR

Min LR

Warmup Steps

Fidelity Space: To construct the fidelity space, we focus Table 2:

on two key dimensions: the number of training steps and Bench.

the transformer’s embedding size, serving as a proxy for

Values

−5

[10 ; 10 −4 ; 10 −3 ]

[1%; 10%] of Max LR

[10%; 20%] of Budget

Search space of nanoGPT-1.00

0.75

0.50

1e11

0.25

6 48 96

192

384

0.00

1e7

Size

6 48 96

192

384

6 48 96

Small Transformer Scale [embedding size]

192

384

Figure 10: Scaling of model size in relation to bytes, FLOPS, and runtime based on average values

across all nanoGPT-Bench configurations.

Embedding size = 6

0 7.0

6.7 6.8 6.9 7.0 7.1

Embedding size = 48

7.1

7.2

7.3

Runtime [GPU-hours]

Embedding size = 96

7.4 0 7.2

1 1 1

8.00

8.05

8.10

8.15 0 8.90 8.95 9.00 9.05 9.10 9.15 0

Embedding size = 384

Runtime [GPU-hours]

7.3

7.4

7.5

Embedding size = 192

0 7.95

Embedding size = 24

Embedding size = 12

13.6 13.7 13.8 13.9 14.0

21.4

21.6

21.8

Runtime [GPU-hours]

Figure 11: Distribution of GPU-hours required for training across different model fidelity values.

model size. In this exploration, the natural fidelity of the number of training steps is visualized by

the validation curves during model training as depicted in Figure 9. On the other hand, the end

performance correlation between the different fidelities is reflected by the Pearson correlation in

Table 3. We establish proxy tasks by sampling embedding size from a log scale, {6, 12, . . . , 96, 192},

with the maximum being 384. Consequently, each configuration has 6 proxy and one target task,

leading to a total of 84 unique configurations in our full benchmark. Every configuration is trained

for 350 steps.

Due to the GPU under-utilization of small model

sizes, the runtime scales linearly as the model

size scales exponentially. This relationship can

be observed in Figure 10 and Figure 11. We

expect the runtime to scale in a linear propor-

tion to the model size when larger models are

considered.

Embedding Size Correlation

192

384 0.951

0.880

0.971

0.955

0.987

0.994

1.000

Figure 12 illustrates the effectiveness of DPL,

particularly when the number of training steps

is considered as a fidelity dimension. Vertical Table 3: Pearson correlation across 7 fidelities.

dotted lines denote the iteration at which an al-

gorithm identifies the oracle value with an absolute tolerance of 0.01.

Figure 13 depicts the results over the different values of the embedding fidelity. We utilized DPL,

BOHB, and random search in proxy tasks, incrementing the budget allocation over each run, up

17Embedding size = 6

7.00

6.90

6.80

6.70

6.60

Embedding size = 48

7.00

Embedding size = 12

7.00

6.80

6.60

6.40

6.20

6.00

7.00

6.00 6.00

Embedding size = 192

5.00

HPO Budget [GPU-hours]

Embedding size = 384

7.00

6.00

HPO Budget [GPU-hours]

Embedding size = 96

7.00

5.00

Embedding size = 24

7.00

5.00

4.00

120

HPO Budget [GPU-hours]

DPL

BOHB

Random

oracle at small scale

Figure 12: The incumbent performance of DPL and other baselines during the HPO budget of 6 full

function evaluations for different values of the embedding size fidelity. Dashed lines indicate the

point at which the oracle has been evaluated for every algorithm. Solid curves and shaded areas stand

for mean value across runs and standard error.

HPO budget = 1 full func evals

4.40

4.20

4.00

96 192

HPO budget = 2 full func evals

HPO budget = 3 full func evals

4.00 4.00

3.95 3.95

3.90

96 192

3.90

6 12 24 48 96 192

Small Transformer Scale [embedding size]

HPO budget = 4 full func evals HPO budget = 5 full func evals HPO budget = 6 full func evals

4.00 4.00 4.00

3.95 3.95 3.95

3.90

96 192

DPL

3.90

96 192

3.90

Small Transformer Scale [embedding size]

BOHB

Random

oracle at largest scale

Figure 13: The target task performance distribution for DPL and other methods over different HPO

budgets ranging from 1 − 6 full function evaluations.

to a horizon of 6 full-function evaluations. From these proxy tasks, we extracted the incumbent

hyperparameters and evaluated their performance on the target task, that correponds to the maximum

embedding size of 384.

Despite operating within short regimes, DPL consistently outperforms baselines in terms of the mean

incumbent value and the explored regime, as evidenced by error bars indicating the range between

best and worst incumbents across 10 seeds. It should be noted, however, that the correlation between

proxy and target tasks is not always perfect. This can result in a proxy task incumbent that does not

translate to the oracle in the target task, which can be observed particularly at lower fidelity levels.

18E

Continuous Search Space

0.9

0.6

0.3

0.1

0.00

DPL

0.25

0.50

0.75

HPO Budget

BOHB

Random

1.00

SMAC

Figure 14: The incumbent error of DPL, as well as, the baselines on CIFAR10 task over the HPO

budget.

The primary objective of this study is to investigate the efficacy of Deep Power Laws (DPL) in trading

off exploration vs exploitation in a continuous HPO search space. In this study, we do not make use

of pre-computed tabular tables, but instead we optimize the hyperparameters of an EfficientNetV2

model online, by iteratively pausing unpromising configurations and moving forward only promising

hyperparameter configurations during the HPO optimization procedure.

To benchmark our findings, we contrast the results against established baseline algorithms such as

random search, BOHB, and SMAC.

Baseline: We employ EfficientNetV2 [36] as a benchmarking model and train it on the CIFAR10

dataset [19]. Specifically, we train the lightweight variant of EfficientNetV2-b0 from scratch for 50

epochs, using the RMSprop optimizer. The learning rate is initiated at 10 −6 and gradually increased

over a span of five warmup epochs to reach the learning rate value of 5 · 10 −4 . Following the warmup

phase, we employ a cosine learning rate scheduler, with a decay factor of 0.97 applied every 10

epochs. The weight decay is set at 10 −5 , with no momentum used. Furthermore, the dropout rate is

configured to be 10 −6 and the model’s moving average exponential decay is set at 0.9996. During

the training phase, the batch size is set to 64, while for the validation phase, it is reduced to 8. All

experiments are performed using the timm library [39].

Search Space: In our experiment, we concen-

trate on optimizing the two most critical hy-

Values

perparameters, learning rate, and weight decay,

while keeping the remaining hyperparameters

[10 −5 , 10 −2 ]

fixed as per the baseline model. We construct

weight decay

[0, 10 −1 ]

a search space for these two hyperparameters

in accordance with common practices (Table 4).

Table 4: Search space of CIFAR10 task.

To emulate a continuous search space for the acquisition function, we generate 100 equally-sized

steps on a logarithmic scale from the lower bound to the upper bound of each dimension. This process

yields a search space comprising 10 4 potential configurations.

Our method demonstrates a substantial speedup in terms of anytime performance when compared to

baseline algorithms (Figure 14). Acknowledging the practical constraints of evaluating large-scale

models, we pragmatically allocated an HPO budget for a maximum of 3 full-function evaluations,

equivalent to 150 epochs. The results of our exploration underline the compelling potential of the DPL

algorithm to effectively manage HPO tasks within a continuous search space. Exhibiting significant

speedup gains, DPL proves itself to be not only viable but an efficient method for identifying optimal

hyperparameters. The findings further underscore the adaptability and efficacy of DPL in addressing

complex HPO tasks, reinforcing its standing as a valuable tool in the machine learning toolbox.

19Iteration

Embedding size = 6

0.0

0.5

1.0

Embedding size = 48

Embedding size = 12

0.0

0.5

1.0

HPO Budget

Embedding size = 96

0.5

1.0

Embedding size = 24

0.0

0.5

1.0

Embedding size = 192

0.0

0.5

1.0

HPO Budget

0.0

0.5

1.0

Embedding size = 384

0.0

0.5

1.0

HPO Budget

Figure 17: The percentage taken by the DPL overhead in the total time per iteration for the different

embedding sizes in nanoGPT-Bench.

calling the acquisition function to suggest the next hyperparameter configuration) and the time taken

to run the target algorithm for one more step. Figure 17 shows that the impact of the DPL overhead is

negligible in the total time taken. For the smallest embedding of size 6, DPL takes only 10% of the

total time taken to perform one HPO iteration after spending half of the optimization budget. At the

end, after circa 40 hours of HPO optimization, DPL has an impact of 20% in the total time taken

to perform one HPO iteration. The impact is even smaller for the largest embedding of size 384,

where DPL has an impact of only 5% in the total time taken per iteration after spending half of the

optimization budget and it has an impact of only 10% in the total time per iteration after more

than 120 hours of HPO optimization.

The findings validate our claim that DPL has a minor time overhead in performing hyperparameter

optimization, which explains the strong any-time performance of our method.

Details of Considered Benchmarks

LCBench: We use the official implementation as the interface for the LCBench benchmark 2 . As

suggested by the authors, we use the benchmark information starting from the second step and we

skip the last step of the curve since it is a repeat of the preceding step.

TaskSet: The TaskSet benchmark features 1000 diverse tasks. We decide to focus on only 12 NLP

tasks from the TaskSet benchmark to add variety to our entire collection of datasets. Our limitation

on the number of included tasks is related to the limited compute power, as we are unable to run for

the entire suite of tasks offered in TaskSet. TaskSet features a set of 8 hyperparameters, that consists

of i) optimizer-specific hyperparameters, such as the learning rate, the exponential decay rate, β 1 and

β 2 , and Adam’s constant for numerical stability ε, ii) hyperparameters that control the linear and

exponential decay schedulers for the learning rate decay, and lastly iii) hyperparameters that control

the L1 and L2 regularization terms. Every hyperparameter in TaskSet except β 1 and β 2 is sampled

logarithmically.

PD1: We use the synetune library [35] for our interface to the PD1 benchmark. From the benchmark,

we only include datasets that have a learning curve of length greater than 10. We furthermore only

include datasets that have a learning curve lower or equal to 50 to have a fair comparison between all

benchmarks by having approximately 20 full function evaluations. PD1 features 4 numerical hyper-

parameters, lr_initial_value, lr_power, lr_decay_steps_f actor and one_minus_momentum,

https://github.com/automl/LCBench

21where lr_initial_value and one_minus_momentum are log sampled. The learning rate decay is

applied based on a polynomial schedule, it’s hyperparameters taken from the search space..

Baselines

Random Search: We implemented random search by randomly sampling hyperparameter configura-

tions from the benchmarks with the maximal budget.

Hyperband, BOHB, LCNet: We use version 0.7.4 of the HpBandSter library as a common codebase

for all 3 baselines 3 . For the last approach mentioned, despite heavy hyperparameter tuning of the

method, we could not get stable results across all the benchmarks and hence dropped the method

from our comparison.

ASHA: For the implementation of ASHA we use the public implementation from the optuna library,

version 2.10.0.

DEHB: We use the public implementation offered by the authors 4 .

MF-DNN: In our experiments we used the official implementation from the authors 5 . However, the

method crashes which does not allow for full results on all benchmarks.

SMAC: For our experiment with SMAC we used the official code base from the authors 6 .

Dragonfly: We use version 0.1.6 of the publicly available Dragonfly library.

For all the multi-fidelity methods considered in the experiments, we use the same minimal and

maximal fidelities. In more detail, for the LCBench, TaskSet and PD1 benchmarks we use a minimal

fidelity lower bound of 1 and a maximal fidelity lower bound equal to the max budget.

Plots

does not only retain the ranks of the final performances of different hyperparameter configurations,

but it also correctly estimates the final performance by attaining a small relative error, where the error

is reduced the more partial observations we have from the learning curve.

In Hypothesis 3, we investigate the efficiency of DPL in exploring more promising configurations

compared to other HPO methods. In Figure 19 we provide the same comparison with regards to the

PD1 benchmark. Based on the results, we conclude that DPL explores more promising configurations

compared to other HPO methods.

PD1

0.08

0.10

0.05

0.00

0.005

DPL

0.010

0.015

LC Length Fraction

ASHA

0.06

0.04

0.02

0.020

BOHB

Average

0.20

0.15

0.005

DEHB

0.010

0.015

LC Length Fraction

Dragonfly

PD1

0.8

0.6

0.4

0.2

0.020

Hyperband

0.005

Random

0.010

0.015

LC Length Fraction

SMAC

0.020

Baseline

Figure 19: Post-hoc analysis to study DPL’s efficiency. Left: Share of the best candidates selected

during training. Middle: Average regret of configurations chosen to be trained at each budget. Right:

Share of top third configurations at a given budget which were bottom two third configurations at a

previous budget.

Lastly, we provide the per-dataset performances of all methods, where we present the mean regret

of the incumbent trajectory and the standard error over 10 runs in LCBench (Figure 20 and 21),

TaskSet (Figure 22), and PD1 (Figure 23). The results show that DPL consistently outperforms other

methods in the majority of cases achieving strong any-time performance and not only a strong final

performance.