Summary Power Laws for Hyperparameter Optimization arxiv.org
9,295 words - PDF document - View PDF document
One Line
The paper proposes a new multi-fidelity strategy for hyperparameter optimization using power law surrogates, with the Deep Power Law method achieving the new state-of-the-art in HPO for deep learning by modeling optimization curves as simple power law functions.
Key Points
- The paper proposes the Deep Power Law (DPL) ensembles method for hyperparameter optimization (HPO) in machine learning, specifically in deep learning, achieving state-of-the-art results.
- DPL models optimization curves as simple power law functions and uses multi-fidelity methods such as successive halving and Hyperband to improve HPO efficiency.
- The proposed method exploits scaling laws to estimate performance and achieves better results than strong HPO baselines for Deep Learning (DL) models.
- The study compares the performance of various HPO methods, with DPL consistently outperforming others.
- The paper explores hyperparameter optimization for transformers in Large Language Models and presents analyses on the effectiveness of DPL for HPO.
Summaries
322 word summary
This study compares the efficiency and exploration ability of various hyperparameter optimization (HPO) methods. The Deep Power Laws (DPL) method consistently outperforms other HPO methods. The study investigates the efficiency of DPL in exploring promising configurations and includes per-dataset performances of all methods. The document explores hyperparameter optimization using power laws on two dimensions and includes multi-fidelity methods. The paper presents an algorithm called Gray-box HPO with Deep Power Laws for hyperparameter optimization. The proposed method, DPL, exploits scaling laws to estimate performance and achieves better results than strong HPO baselines for Deep Learning (DL) models. The document explores HPO for transformers in Large Language Models and presents three analyses, including one on the effectiveness of DPL for HPO in Large Language Models. The results show that their method explores well and assigns the budget only to configurations with lower regret compared to other methods. Lastly, the authors conclude that their hypothesis on the quality of the HPO results is valid. A new multi-fidelity strategy for hyperparameter optimization using power law surrogates is proposed in this paper. The method uses the Expected Improvement acquisition function and a neural network to map a configuration to the power law coefficients of its learning curve. The proposed method performs well in most cases, achieving top performance on many configurations. The study also investigates learning curves that do not follow a power law pattern and proposes two different ways to handle them. The empirical evidence suggests that the presented power law model can accurately forecast learning curves and improve HPO performance. The Deep Power Law (DPL) ensembles method for hyperparameter optimization (HPO) in deep learning achieves the new state-of-the-art in HPO for deep learning by modeling optimization curves as simple power law functions. The use of multi-fidelity methods such as successive halving and Hyperband to improve HPO efficiency is discussed, as well as the potential for DPL to make HPO for deep learning a feasible reality.
763 word summary
The paper proposes the Deep Power Law (DPL) ensembles method for hyperparameter optimization (HPO) in machine learning, specifically in deep learning. DPL models optimization curves as simple power law functions and achieves the new state-of-the-art in HPO for deep learning. The paper discusses the use of multi-fidelity methods such as successive halving and Hyperband to improve HPO efficiency and the potential for DPL to make HPO for deep learning a feasible reality. Scaling laws and power law surrogates are used to conduct multi-fidelity HPO with Bayesian optimization. The prediction is often based on the assumption that the performance increases at the beginning and then flattens towards the end. This paper proposes a new multi-fidelity strategy for hyperparameter optimization using power law surrogates. The method uses the Expected Improvement acquisition function and a neural network to map a configuration to the power law coefficients of its learning curve. The paper discusses hyperparameter optimization benchmarks and the experimental protocol used to evaluate them, as well as comparing various hyperparameter optimization methods, including the proposed method, in experiments using different datasets. The results show that the proposed method performs well in most cases, achieving top performance on many configurations. The study also investigates learning curves that do not follow a power law pattern and proposes two different ways to handle them. Overall, the empirical evidence suggests that the presented power law model can accurately forecast learning curves and improve HPO performance. The experiments were run on a CPU cluster with two Intel Xeon E5-2630v4 CPUs with 20 CPU cores running at 2.2 GHz. The document explores hyperparameter optimization (HPO) for transformers in Large Language Models. The authors used HPO to tune three learning rate hyperparameters and ablated the embedding size of the multi-head attention layers. They present three analyses, including one on the effectiveness of their method DPL for HPO in Large Language Models. The results show that their method explores well and assigns the budget only to configurations with lower regret compared to other methods. Lastly, the authors conclude that their hypothesis on the quality of the HPO results is valid.
The proposed method, Deep Power Law (DPL), exploits scaling laws to estimate performance and achieves better results than strong HPO baselines for Deep Learning (DL) models. It outperforms Random Search and BOHB, a rival gray-box HPO method, for small transformers. DPL discovers better configurations than the baselines at any proxy space with small embedding sizes. The configurations discovered by DPL on small search spaces achieve competitive results on full-scale transformers.
The document includes lists of references cited in the paper and related to hyperparameter optimization in machine learning. The references cover topics such as Bayesian optimization, automatic hyperparameter tuning, and learning curves.
The paper presents an algorithm called Gray-box HPO with Deep Power Laws for hyperparameter optimization. The method involves evaluating initial configurations and budgets, computing the next budget and recommending the next configuration, and fitting a DPL ensemble from history H. The algorithm returns the best hyperparameter configuration with the smallest validation loss. The document explores hyperparameter optimization using power laws on two dimensions: the number of training steps and Bench. The search space is optimized to ensure that even the most resource-intensive experiments stay within the limits of a single GPU day. The authors use an ensemble of surrogate models to optimize the cross-entropy loss for next-token prediction. The paper discusses hyperparameter optimization (HPO) for machine learning models using DPL, BOHB, and random search in proxy tasks. The study investigates the efficacy of Deep Power Laws (DPL) in trading off exploration vs exploitation in a continuous hyperparameter optimization (HPO) search space. The article discusses the use of the synetune library to interface with the TaskSet and LCBench benchmarks for hyperparameter optimization. The results show that the DPL method has a minor time overhead in performing hyperparameter optimization and proves to be an efficient method for identifying optimal hyperparameters. This study compares the performance of various hyperparameter optimization (HPO) methods, including Dragonfly, SMAC, MF-DNN, DEHB, ASHA, Hyperband, BOHB, LCNet, and Random Search, in terms of efficiency and ability to explore promising configurations. The DPL method consistently outperforms other HPO methods. The study investigates the efficiency of DPL in exploring more promising configurations and finds that it decreases with fewer partial observations from the learning curve. The study includes per-dataset performances of all methods and provides a post-hoc analysis to study DPL's efficiency. The study uses various benchmarks, including LCBench, TaskSet, and PD1, and includes multi-fidelity methods. The study applies a polynomial schedule and includes datasets with a learning curve length greater than 10.
1953 word summary
The study compares the performance of various hyperparameter optimization (HPO) methods in terms of their efficiency and ability to explore promising configurations. The DPL method consistently outperforms other HPO methods. The study also investigates the efficiency of DPL in exploring more promising configurations, finding that it is reduced the more partial observations we have from the learning curve. The study includes per-dataset performances of all methods and provides a post-hoc analysis to study DPL's efficiency. The study uses various benchmarks, including LCBench, TaskSet, and PD1, and includes multi-fidelity methods. The study uses various HPO methods, including Dragonfly, SMAC, MF-DNN, DEHB, ASHA, Hyperband, BOHB, LCNet, and Random Search. The study applies polynomial schedule and includes datasets that have a learning curve of length greater than 10. The article discusses the use of the synetune library to interface with the TaskSet and LCBench benchmarks for hyperparameter optimization. The TaskSet benchmark consists of 1000 diverse tasks, but the study focuses on only 12 NLP tasks. The results show that the DPL method has a minor time overhead in performing hyperparameter optimization and proves to be an efficient method for identifying optimal hyperparameters. The study demonstrates a substantial speedup in terms of anytime performance when compared to baseline algorithms. The study investigates the efficacy of Deep Power Laws (DPL) in trading off exploration vs exploitation in a continuous hyperparameter optimization (HPO) search space. The search space comprises 10^4 potential configurations, and the experiment focuses on optimizing the two most critical hyperparameters, learning rate, and weight decay. The search space for these hyperparameters is [0, 10^-1] for weight decay and [10^-5, 10^-2] for learning rate, while keeping the remaining hyperparameters fixed as per the baseline model. The study employs EfficientNetV2 as a benchmarking model and trains it on the CIFAR10 dataset. The experiments are performed using the timm library with HPO budgets ranging from 1 to 6 full function evaluations. The results show that DPL consistently outperforms baselines in terms of the mean incumbent value and proves effective in trading off exploration vs exploitation in a continuous HPO search space. The paper discusses hyperparameter optimization (HPO) for machine learning models. They utilize DPL, BOHB, and random search in proxy tasks to identify the oracle value with an absolute tolerance of 0.01. They establish proxy tasks by sampling embedding size from a log scale, (6, 12, . . . , 96, 192), and performance correlation between the different fidelities is reflected by the Pearson correlation in Table 3. The validation curves during model training are depicted in Figure 9. The end model size is visualized by the distribution of GPU-hours required for training across different model fidelity values in Figure 11. The scaling of model size in relation to bytes, FLOPS, and runtime is based on average values across all nanoGPT-Bench configurations and shown in Figure 10. The document discusses hyperparameter optimization using power laws on two dimensions: the number of training steps and Bench. The fidelity space is constructed using Table 2, which presents warmup steps, minimum and maximum learning rates, and HP in detail. The search space is optimized to ensure that even the most resource-intensive experiments stay within the limits of a single GPU day. The training process involves 350 steps with each step encompassing 1000 random samples, with a batch size of 12. The authors use an ensemble of surrogate models to optimize the cross-entropy loss for next-token prediction. They consider different formulations for the power law functions used as their surrogate and use an initial history of 1 randomly sampled hyperparameter configuration evaluated for 1 epoch for both DPL and every baseline. They continuously refine the model for 20 epochs every HPO iteration. The paper presents an algorithm called Gray-box HPO with Deep Power Laws for hyperparameter optimization. The method uses a 2-layer feedforward neural network with 128 units per layer and Leaky ReLU for the non-linearity. The network has 3 output units, which are combined with the budget b to yield the power law output. The GLU non-linearity activation is only applied on the ? and ? output units. The algorithm involves evaluating initial configurations and budgets, computing the next budget and recommending the next configuration, and fitting a DPL ensemble from history H. The algorithm returns the best hyperparameter configuration with the smallest validation loss. This is a list of references related to hyperparameter optimization in machine learning. The references include papers on various approaches to hyperparameter optimization, such as Bayesian methods, bandit-based approaches, and deep neural networks. Some of the papers also discuss techniques for optimizing large numbers of hyperparameters or for estimating predictive uncertainty. This is a list of references cited in the document "Power Laws for Hyperparameter Optimization." The references cover topics such as hyperparameter optimization, machine learning, and neural networks. The list includes conference proceedings, technical reports, open-source corpora, and research papers. Some of the key topics covered in the references include Bayesian optimization, automatic hyperparameter tuning, and learning curves. The document introduces Deep Power Law (DPL), a probabilistic surrogate based on an ensemble of power law functions, for hyperparameter optimization (HPO) in deep learning (DL) models. The proposed method exploits scaling laws to estimate performance and achieves better results than strong HPO baselines for DL. DPL is tested on 7 baselines, 59 datasets, and diverse search spaces for DL architectures. It outperforms Random Search and BOHB, a rival gray-box HPO method, for small transformers. DPL discovers better configurations than the baselines at any proxy space with small embedding sizes. The configurations discovered by DPL on small search spaces achieve competitive results on full-scale transformers. The document presents a study on hyperparameter optimization (HPO) for transformers in Large Language Models. The authors conducted experiments on a smaller GPT-2 model and then applied the findings to a larger transformer model on the OpenWebText dataset. They used HPO to tune three learning rate hyperparameters and ablated the embedding size of the multi-head attention layers. The authors present three analyses, including one on the effectiveness of their method DPL for HPO in Large Language Models. The results show that their method explores well and assigns the budget only to configurations with lower regret compared to other methods. Lastly, the authors conclude that their hypothesis on the quality of the HPO results is valid. The document analyzes hyperparameter optimization (HPO) using a power law assumption and provides empirical evidence that the power law surrogates lead to state-of-the-art HPO results. The study includes experiments on multiple datasets and benchmarks, comparing the proposed method (DPL) with other baselines such as SMAC, Random Search, LC Length Fraction, DEHB, TaskSet, BOHB, ASHA, and Dragonfly. The results show that DPL performs well in most cases, achieving top performance on many configurations. The study also investigates learning curves that do not follow a power law pattern and proposes two different ways to handle them. Overall, the empirical evidence suggests that the presented power law model can accurately forecast learning curves and validate the hypothesis that the power law assumption improves HPO performance. The experiments were run on a CPU cluster with two Intel Xeon E5-2630v4 CPUs with 20 CPU cores running at 2.2 GHz. Various hyperparameter optimization methods including DEHB, SMAC, ASHA, and Dragonfly are compared in experiments using different datasets. The methods start with a history of one randomly sampled hyperparameter configuration evaluated for one step/epoch. The learning curve length is fixed for LCBench and TaskSet but varies for PD1. Results are reported until the time it took Random Search to evaluate 20 hyperparameter configurations. The regret is the difference in evaluation metric performance from the best-found hyperparameter configuration during optimization to the best possible hyperparameter configuration (oracle). The evaluation metric differs between benchmarks, with LCBench using balanced accuracy, TaskSet using loss, and PD1 using accuracy. The article discusses hyperparameter optimization benchmarks and the experimental protocol used to evaluate them. The benchmarks feature different optimization tasks evaluated in various search spaces, including NLP tasks, deep learning benchmarks, and statistical modeling corpora. The protocol standardizes hyperparameter values by performing min-max scaling and uses BOCA, Dragonfly Library, SMAC, Hyperband, DEHB, and ASHA as baselines. The article provides detailed pseudocode of the proposed method and covers new configurations with no learning curve evaluations. This paper proposes a novel multi-fidelity strategy for hyperparameter optimization using power law surrogates. The method employs the Expected Improvement acquisition function with the estimated full budget's posterior mean and variance. The acquisition function incorporates both the mean and uncertainty of predictions, applying a trade-off between exploration and exploitation. The power law surrogate is trained using a history of learning curve evaluations and a parametric neural network that maps a configuration to the power law coefficients of its learning curve. The performance of Machine Learning methods is assumed to follow a power law relationship between the validation loss and the number of optimization epochs. The term budget refers to a learning curve step, and the evaluation of a configuration for a budget is defined as f (?, b) : ? x B ? R +, where B = (0, b max ]. Hyperparameter Optimization (HPO) involves finding optimal configurations for a Machine Learning method. The optimal hyperparameters are found via an HPO policy that is parameterized and learned to minimize training loss. Bayesian optimization is the most popular type of policy for HPO due to its ability to balance exploration and exploitation aspects of minimizing validation loss. A small-scale model can be used to transfer hyperparameters to a large-scale version. Scaling laws describe the relationship between the performance of deep learning models as a function of dataset size or model size as a power law. Power law surrogates can be fit for conducting multi-fidelity HPO with Bayesian optimization. Learning curve prediction algorithms can be combined with successive halving to predict configuration performance. Another approach is to use learning curves from already evaluated configurations and to find an affine transformation that leads to a well-matched learning curve. The prediction is often based on the assumption that the performance increases at the beginning and then flattens towards the end. The paper proposes a probabilistic surrogate method for hyperparameter optimization (HPO) called Deep Power Law (DPL) ensembles, which models optimization curves as simple power law functions. This method is tested against seven strong HPO baselines and 59 datasets of three diverse modalities (tabular, image, and natural language processing), achieving the new state-of-the-art in HPO for deep learning. The paper also introduces a mechanism to combine DPL with Bayesian optimization and discusses the potential for DPL to make HPO for deep learning a feasible reality. The authors highlight the power law assumption on learning curves and the use of multi-fidelity methods such as successive halving and Hyperband, which have been shown to improve HPO efficiency. Finally, the paper discusses related work in HPO for deep learning and presents a large-scale experimental protocol. Hyperparameter Optimization (HPO) is a major challenge for machine learning, particularly in Deep Learning (DL) methods due to high costs. Recently, gray-box HPO (multi-fidelity HPO) has emerged as a promising paradigm for HPO in DL. In this work, the authors propose Deep Power Laws (DPL), an ensemble of neural network models conditioned to yield predominant power law nature of learning curves. Their method dynamically decides which configurations to pause and train incrementally by making use of gray-box predictions that follow a power-law scaling pattern. The authors achieve the best results across all benchmarks by obtaining the best any-time results compared to all competitors. They compare their method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 59 diverse tasks. Their method focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance and exploits the performance of all types of hyperparameters.