Summary of Utilizing Deep Learning for Automated Database Tuning

Summary Utilizing Deep Learning for Automated Database Tuning arxiv.org

4,861 words - PDF document - View PDF document

One Line

This article presents an automated solution for managing database system configurations using machine learning techniques, including GMM clustering and ensemble models, to improve latency prediction and performance of automated DBMS tuning.

Slides

Slide Presentation (11 slides)

Copy slides outline Copy embed code Download as Word

Utilizing Deep Learning for Automated Database Tuning

Source: arxiv.org - PDF - 4,861 words - view

Introduction

• Managing database system configurations is challenging due to the lack of standardization among configuration knobs.

• Automated solution using supervised and unsupervised machine learning techniques.

• Evaluation of OtterTune on three different DBMSs.

• Objective: Extend OtterTune and propose novel machine learning models.

Complexity of DBMSs

• DBMSs have a multitude of configuration knobs that impact performance and scalability.

• Default configurations are often suboptimal, requiring automatic tuning tools.

• Existing tools have limitations, such as being designed for specific DBMSs or requiring manual steps.

Reusing Training Data for Optimization

• Reuse training data from previous tuning sessions to optimize new DBMS deployments.

• Reduces time and resources needed for optimization.

• Achieves significant improvements in latency compared to default settings or other tuning advisors.

Pruning Redundant Metrics

• Factor analysis and K-means clustering to capture variability of system performance and differentiate workloads.

• Prune important metrics and remove redundant ones.

• Normalize pruned metrics for input in factor analysis.

Workload Mapping

• Match current target workload to offline workloads.

• Calculate score based on distances obtained using Euclidean distance.

• Identify workload with closest match for tuning recommendations.

Latency Prediction

• GMM clustering and ensemble models for more accurate prediction modeling.

• Evaluate results using various criteria such as silhouette analysis and mean squared error.

• Experiment with neural networks for capturing complex relationships.

Experiment Overview

• Automatic tuning of DBMS configurations using deep learning techniques.

• Tuning hyperparameters of GPR model and random forest trees.

• Replacing K-means clustering with GMM clustering for improved performance.

Results of the Experiment

• GMM clustering slightly improves performance compared to K-means clustering.

• Neural network-based model outperforms both clustering approaches.

• Acknowledgment of potential overfitting with neural networks.

Conclusion

• Effectiveness of deep learning techniques in automated DBMS tuning demonstrated.

• Neural network-based approach shows superior performance in capturing complex relationships.

• GMM clustering proves to be a viable alternative to K-means clustering.

• Potential for further research in this area.

Key Takeaways

• Challenges of managing database system configurations and lack of standardization among knobs.

• Automated solution utilizing supervised and unsupervised machine learning techniques.

• Reusing training data for optimization and pruning redundant metrics.

• Workload mapping and improved latency prediction using GMM clustering and ensemble models.

• Deep learning techniques show effectiveness in automated DBMS tuning.

Key Points

The article discusses the challenges of managing database system configurations and the lack of standardization among configuration knobs.
The authors propose an automated solution that utilizes supervised and unsupervised machine learning techniques to identify influential knobs, analyze unseen workloads, and provide recommendations for optimal knob settings.
The effectiveness of the proposed approach is demonstrated through the evaluation of a tool called OtterTune on three different database management systems (DBMSs).
The authors extend the automated technique introduced in the original OtterTune paper by utilizing previously collected training data to optimize new DBMS deployments and improve latency prediction.
The article highlights the complexity of DBMSs and the multitude of configuration knobs that impact performance and scalability, leading to the need for automatic tuning tools.
The authors propose a new approach that reuses training data from previous tuning sessions to optimize DBMS performance for new applications, reducing the time and resources needed for optimization.
The main objective of the authors' work is to extend OtterTune and propose novel machine learning models from previously collected data to prune redundant metrics, map unseen workloads, and improve latency prediction.
The experiment discussed in the document aimed to automate the tuning of database management system (DBMS) configurations using deep learning techniques.

Summaries

107 word summary

This article proposes an automated solution for managing database system configurations using machine learning techniques. The authors evaluate the effectiveness of their approach using a tool called OtterTune on three different database management systems. They extend the technique by incorporating GMM clustering and ensemble models with non-linear models to improve latency prediction. The authors' work aims to prune redundant metrics, map unseen workloads to previous workloads, and improve latency prediction. The article provides an overview of the system architecture and highlights the limitations of previous automated DBMS tuning methods. The experiment shows that both neural networks and GMM clustering can improve the performance of automated DBMS tuning.

363 word summary

This article discusses the challenges of managing database system configurations and proposes an automated solution that utilizes supervised and unsupervised machine learning techniques. The authors evaluate the effectiveness of their approach using a tool called OtterTune on three different database management systems (DBMSs). The results show that OtterTune's recommendations are comparable to or even surpass configurations generated by existing tools or human experts.

The authors extend the automated technique introduced in the original OtterTune paper by utilizing previously collected training data to optimize new DBMS deployments. They focus on improving latency prediction by incorporating GMM clustering and combining ensemble models with non-linear models.

Most existing tools for DBMS tuning have limitations, such as being designed for specific DBMSs or requiring manual steps. The authors propose a new approach that reuses training data from previous tuning sessions to optimize DBMS performance for new applications. This approach reduces the time and resources needed for optimization and achieves significant improvements in latency compared to default settings or other tuning advisors.

The article also provides an overview of the entire system architecture, including important steps such as metrics pruning and automated tuning through workload mapping. The authors discuss the limitations of previous automated DBMS tuning methods and highlight the drawbacks of focusing on individual DBMS instances.

The experiment aimed to automate the tuning of DBMS configurations using deep learning techniques. The researchers used a combination of different models and algorithms to optimize the performance of the DBMS.

The first part of the experiment focused on tuning the hyperparameters of the Gaussian Process Regression (GPR) model. The results showed that replacing K-means clustering with Gaussian Mixture Model (GMM) clustering slightly improved the performance of the model. The random forest algorithm did not perform as well as GPR, while the neural network-based model showed even better performance, capturing complex relationships and achieving lower MSE values.

Overall, the experiments demonstrated that both neural networks and GMM clustering can improve the performance of automated DBMS tuning.

419 word summary

The article highlights the complexity of DBMSs and the multitude of configuration knobs that impact performance and scalability. Most existing tools have limitations, such as being designed for specific DBMSs or requiring manual steps. The authors propose a new approach that reuses training data from previous tuning sessions to optimize DBMS performance for new applications. This approach reduces the time and resources needed for optimization and achieves significant improvements in latency compared to default settings or other tuning advisors.

In the metrics pruning stage, the authors carry out factor analysis and K-means clustering to capture the variability of system performance and differentiate different workloads. Data preprocessing steps include removing duplicate columns, converting boolean knob values to integers, and dividing files by workloads.

The experiment discussed in the document aimed to automate the tuning of database management system (DBMS) configurations using deep learning techniques. The researchers used a combination of different models and algorithms to optimize the performance of the DBMS.

Overall, the experiments demonstrated that both neural networks and G

1070 word summary

This article discusses the challenges of managing database system configurations and the lack of standardization among configuration knobs. To address this issue, the authors propose an automated solution that utilizes supervised and unsupervised machine learning techniques. This solution aims to identify influential knobs, analyze unseen workloads, and provide recommendations for optimal knob settings. The effectiveness of this approach is demonstrated through the evaluation of a tool called OtterTune on three different database management systems (DBMSs). The results show that OtterTune's recommendations are comparable to or even surpass configurations generated by existing tools or human experts.

The authors extend the automated technique introduced in the original OtterTune paper by utilizing previously collected training data to optimize new DBMS deployments. They focus on improving latency prediction by incorporating GMM clustering to streamline metrics selection and combining ensemble models with non-linear models for more accurate prediction modeling.

The article emphasizes the complexity of DBMSs and the multitude of configuration knobs that impact performance and scalability. The default configurations of these knobs are often suboptimal, leading to the need for automatic tuning tools. However, most existing tools have limitations, such as being designed for specific DBMSs or requiring manual steps. The authors propose a new approach that reuses training data from previous tuning sessions to optimize DBMS performance for new applications. This approach reduces the time and resources needed for optimization and achieves significant improvements in latency compared to default settings or other tuning advisors.

The main objective of the authors' work is to extend OtterTune and propose novel machine learning models from previously collected data to prune redundant metrics, map unseen workloads to previous workloads, and improve latency prediction. They achieve this through steps such as pruning important metrics, workload mapping based on performance measurements, and prediction modeling using regression models and neural networks.

The article also provides an overview of the entire system architecture, which includes important steps such as metrics pruning and automated tuning through workload mapping. The authors discuss the limitations of previous automated DBMS tuning methods, which relied on heuristic or cost-based algorithms. They highlight the drawbacks of focusing on individual DBMS instances and the varying impact of configuration knobs based on workload.

Overall, this article presents an automated solution utilizing deep learning for tuning database management systems. The

In this study, the authors utilize deep learning for automated database tuning. They start by carrying out K-means clustering with a reduction technique called factor analysis to cluster the data. Silhouette analysis is then used to identify the optimal number of clusters. The authors use the FA algorithm to transform high dimensional data into low-dimensional data. Each factor is a linear combination of the original variables and can be interpreted in the same way as coefficients in linear regression. The authors calculate silhouette scores for K-means models with different cluster configurations and select the model with the highest score. Once the optimal number of clusters is identified, the metrics are segregated into clusters and a metric closest to each cluster centroid is selected to represent the entire cluster. Redundant metrics that are too close to each other in the scatterplot are removed from the data. The pruned metrics are then normalized and provided as input for factor analysis. The authors observe that only the initial 30 factors with eigen values greater than 1 are significant for the DBMS metric data. Gaussian process regression (GPR) is used for training the model, and K-means clustering is used to identify and segregate metrics into meaningful groups. Workload mapping is done by matching the current target workload to offline workloads and calculating a score for each workload based on the distances obtained using Euclidean distance. Latency prediction is carried out in two stages, and the authors use GMM clustering and random forest as alternative algorithms to GPR. They also experiment with neural networks and evaluate the results using various evaluation criteria such as silhoutte analysis, mean squared error, mean average percentage error, and MAPE. Scaling and hyperparameter tuning are performed to improve the performance of the system.

The first part of the experiment focused on tuning the hyperparameters of the Gaussian Process Regression (GPR) model. By adjusting the alpha parameter to lower values, the model's performance improved. The researchers also built random forest trees with different depths and estimators, selecting the ones with the best evaluation scores as hyperparameters.

The results of the experiment showed that replacing K-means clustering with Gaussian Mixture Model (GMM) clustering slightly improved the performance of the model, as measured by Mean Absolute Percentage Error (MAPE) and Mean Squared Error (MSE). However, the performance of the random forest algorithm was not as good as that of GPR. The neural network-based model showed even better performance, as it was able to capture complex relationships and achieved lower MSE values.

Despite the superior performance of the neural network model, it should be noted that neural networks have a higher tendency to overfit when given less data. Overall, the experiments demonstrated that both neural networks and GMM clustering can improve results in automated DBMS tuning.

The conclusion of the experiment highlighted that automatic DBMS tuning is an active area of research. The researchers proposed an automatic approach that leverages past experience and collects new information to tune DBMS configurations. They achieved a MAPE of 69% using the baseline implementation, which utilized feature aggregation, K-means clustering for metric pruning, and GPR for prediction modeling. Replacing K-means clustering with EM-clustering reduced MAPE to 67%, suggesting that GMM clustering could be an alternative. The deep learning-based approach outperformed both clustering approaches with a MAPE score of 65%.

In summary, the experiment demonstrated the effectiveness of deep learning techniques in automated DBMS tuning. The neural network-based approach showed superior performance in capturing complex relationships, while GMM clustering proved to be a viable alternative to K-means clustering. However, the researchers acknowledged the potential for overfitting with neural networks and the need for further research in this area.

Raw indexed text (31,633 chars / 4,861 words / 561 lines)

Utilizing deep learning for automated tuning of

database management systems

Karthick Gunasekaran*

Manning College of Information and Computer Sciences

University of Massachusetts Amherst

Kajal Tiwari

Manning College of Information and Computer Sciences

University of Massachusetts Amherst

Rachana Acharya

Manning College of Information and Computer Sciences

University of Massachusetts Amherst

Abstract—Managing the configurations of a database system

poses significant challenges due to the multitude of configuration

knobs that impact various system aspects. The lack of

standardization, independence, and universality among these

knobs further complicates the task of determining the optimal

settings. To address this issue, an automated solution leveraging

supervised and unsupervised machine learning techniques was

developed. This solution aims to identify influential knobs, analyze

previously unseen workloads, and provide recommendations for

knob settings. The effectiveness of this approach is demonstrated

through the evaluation of a new tool called OtterTune [1] on

three different database management systems (DBMSs). The

results indicate that OtterTune’s recommendations are comparable

to or even surpass the configurations generated by existing

tools or human experts. In this study, we build upon the

automated technique introduced in the original OtterTune paper,

utilizing previously collected training data to optimize new DBMS

deployments. By employing supervised and unsupervised machine

learning methods, we focus on improving latency prediction. Our

approach expands upon the methods proposed in the original

paper by incorporating GMM clustering to streamline metrics

selection and combining ensemble models (such as RandomForest)

with non-linear models (like neural networks) for more accurate

prediction modeling.

I. I NTRODUCTION

In both business and scientific fields, the ability to gather,

process, and analyze large amounts of data is crucial for gaining

new knowledge [2], [3]. To accomplish this, data-intensive

applications rely heavily on database management systems

(DBMSs), which are responsible for collecting, storing, and

retrieving data. However, achieving good performance in

DBMSs can be challenging, as they are complex systems

with many configuration options that can affect their runtime

behavior. Although these options can be adjusted by a database

administrator, the correct configuration depends on various

factors and is difficult to reason about.

DBMSs are intricate systems with numerous adjustable

options that regulate almost all aspects of their runtime

operation [4]. These configurable options empower database

administrators (DBAs) to manage various aspects of the

DBMS’s runtime behavior, such as allocating memory for

data caching versus the transaction log buffer. Modern DBMSs

are known for having a multitude of configuration knobs, as

highlighted in [5]–[7]. The challenge in achieving optimal

performance in DBMSs stems from the fact that their

performance and scalability are heavily influenced by their

configurations [4]. Furthermore, the default configurations of

these knobs are notoriously suboptimal, compounding the issue.

As an illustration, the default MySQL configuration in 2016

assumes that the system is deployed on a machine with only

160 MB of RAM [8].

Automatic tuning tools have been developed to optimize

DBMS performance, but most of them have limitations, such

as being designed for specific DBMSs or requiring manual

steps. In OtterTuner paper [1], the authors propose a new

approach that utilizes machine learning models to reuse training

data from previous tuning sessions. This technique selects the

most important configuration options, maps new workloads

to known ones, and recommends optimal settings based on

a target objective (e.g., latency or throughput). By reusing

past experience, the proposed method reduces the time and

resources needed to optimize DBMS performance for new

applications.

The approach was implemented in a tuning tool called

OtterTune, using Google TensorFlow and Python’s scikit-learn

libraries. Experiments were conducted on two OLTP DBMSs

(MySQL and Postgres) and one OLAP DBMS (Vector),

and the results show that OtterTune can achieve significant

improvements in latency compared to default settings or other

tuning advisors. Furthermore, the tool generates configurations

within an hour that are comparable to those created by expert

DBAs.

The main objective of our work is to extend the OtterTuner

and propose novel machine learning (ML) models from data

collected from previous tunings and use the models primarily

for (1) pruning the redundant metrics, (2) mapping unseen

database workloads to previous workloads from which wecan transfer experience, and (3) improving latency prediction

through workload mapping. We have achieved this with the

following steps:

•

Pruning Redundant Metrics: Since we have hundreds of

internal and external DBMS-specific metrics, we limit

our search space to a small set, which impacts latency

significantly. This step speeds up the entire process and

ensures that the model will fit in memory.

Workload Mapping: Workload mapping is the process

of matching the target DBMS’s workload with the most

similar workload in its repository based on performance

measurements. This is done by finding the nearest neighbor

with the help of Euclidean distance and calculating the

”score” for each workload by taking the average of these

distances over all metrics. Reusing experience reduces

time and resources and also helps the model make an

educated decision in the case of fewer observations in the

current workload.

Latency Prediction: The ultimate goal is to improve

latency prediction through workload mapping. Prediction

modeling will be done through the regression

model-Guassian Process Regression), but we’ll also be

experimenting with neural networks.

and lacking the utilization of information from previous tuning

sessions. Heuristic-based approaches make assumptions about

workload and environment, which may not accurately represent

the real-world scenario. Despite these optimizations, solving

database tuning problems efficiently remains challenging due

to their NP-Complete nature. Machine learning (ML)-based

approaches can potentially offer faster approximations for

optimization problems, addressing this challenge.

III. A PPROACH

The Overview of the entire system can be seen from the

Figure 1 below. The process consisted of two important steps.

In the first step, important metrics were pruned, and in the

next step, automated tuning was carried out through workload

mapping.

II. L ITERATURE R EVIEW

In the past 50 years, researchers have explored methods

to optimize DBMSs automatically [ [9], [10], [11], [12]].

This body of work includes theoretical and applied research [

[13], [14], [15], [16]]. Prior to the 2010s, automated DBMS

tuning primarily relied on heuristic or cost-based algorithms.

Heuristic algorithms, such as IBM’s DB2 Performance Wizard,

used hard-coded rules to recommend actions based on

application characteristics [ [17], [18]]. However, these rules

may not accurately reflect the actual workload and environment.

Some vendors, like IBM and Oracle, introduced self-tuning

mechanisms based on rules to allocate memory and identify

misconfiguration bottlenecks [ [19], [20], [21], [22], [23]].

Cost-based algorithms, on the other hand, programmatically

searched for configuration improvements guided by a cost

model derived from workload traces [ [24], [25], [15], [26], [27],

[28]]. To address the limitations of cost-based algorithms, using

accurate cost model estimations from the DBMS’s internal

components proved beneficial, as demonstrated by Microsoft’s

AutoAdmin utilizing SQL Server’s built-in cost models for

index utility estimation [ [29], [30]]. However, such reliance

on the DBMS cost model is not suitable for knob configuration

tuning algorithms. Knobs, which affect DBMS behavior, are not

easily reflected in the query planner’s cost model due to their

varying impact based on the workload [ [30]]. As a result, major

vendors’ proprietary knob configuration tools mainly employ

static heuristics and vary in the level of automation [ [22], [23],

[31], [32], [33]]. Both heuristic and cost-based tuning methods

have the drawback of focusing on individual DBMS instances

Figure 1. Overview of the architecture [1]

A. Metrics pruning

In this stage, pruning of redundant metrics was carried out.

In order to capture the variability of system performance

and differentiate different workloads, a small number of

metrics with high variability are considered. The system

needs to capture a few metrics such that all the distinguishing

characteristics of workloads are captured while the overall

runtime of the machine learning algorithms reduces. The

pruning step is very important because it helps improve the

entire performance and speed of the system. Pruning is carried

out in two steps. In the first step, factor analysis is carried out,

followed by K-means clustering.

1) Data Preprocessing: We had to preprocess the data

present in the offline workload and the online workloads B

and C. Data pre-processing was done in the following steps:

•

Duplicate Column Removal: We dropped the columns or

metrics with only a constant value across all workloads.

Conversion of boolean knobs: We also did some other

preprocessing steps, such as converting all boolean knob

values to integers (0 and 1).

Division of files by workloads: We segregated all files

on the workload ID; hence, for offline workloads, we had

58 workload files, for online workload C, we had 100

workload files; and for online workload B, we divided

each workload into 5 rows for workload mapping and 1

row from each workload for validation.2) Factor Analysis: In this step, the strongly correlated

K-means doesn’t provide the optimal number of clusters.

metrics are removed as they’re redundant. A dimensionality Initially, K-means clustering is carried out with different

reduction technique called factor analysis [36] is employed to cluster numbers from 2 to 15. Then the optimal number of

transform the high dimensional data into low-dimensional data. clusters is identified. Instead of doing manual interpretations

Each factor is a linear combination of the original variables; of the clusters, silhouette analysis was carried out to study

the factor coefficients are similar to and can be interpreted in the the separation distance between the clusters and identify the

same way as the coefficients in a linear regression. Furthermore, optimal clusters. Silhouette values are determined by measuring

each factor has a unit variance and is uncorrelated with all the closeness of the points in a cluster to its neighboring clusters.

The values range from -1 to +1. The value closer to 1 indicates

other factors.

The FA algorithm takes as input a matrix X whose rows the decision boundary is too far from the neighboring clusters,

correspond to metrics and whose columns correspond to knob while values close to -1 indicate the clusters are too close

configurations that we have tried. The entry X ij is the value to each other. So, silhouette scores for KMeans models with

of metric i on configuration j. FA gives us a smaller matrix different cluster configurations ranging from 2 to 15 were

U : the rows of U corresponds to metrics, while the columns calculated. The highest silhouette score near +1 indicates the

correspond to factors, and the entry U ij is the coefficient of best cluster configuration and its selection.

Since the optimal number of clusters was identified as 8,

metric i in factor j. We can scatter-plot the metrics using

the

metrics are segregated into 8 clusters, and the centroids are

elements of the i row of U as coordinates for metric i.

Metrics i and j will be close together if they have similar computed for the same. One metric from each cluster that is

coefficients in U - that is, if they tend to correlate strongly in closest to the centroid was selected. This metric will represent

X. Removing redundant metrics now means removing metrics the entire cluster. Once we find our target metrics, we remove

all the other redundant metrics from the data and proceed to

that are too close to one another in our scatterplot.

The data from offline workloads, which consist of 58 the next step.

List of pruned metrics by this process include,

workloads, is used for this. Here, all the data related to the

knobs is removed, and this truncated data is provided as input

• driver . jvm . pools . Code − Cache . committed . avg_inc

to the factor analysis module. Having a look at the eigen values

• driver . jvm . pools . Code − Cache . committed . avg_period

• driver . BlockManager . memory . maxMem_MB . avg

of all the factors generated, only the initial 30 factors with

• driver . BlockManager . memory . o n H e a p M e m U s e d_MB . avg

eigen values greater than 1 are observed to be significant for

• driver . L iv eL i st en er B us . queue . executorMgmt . size . avg

the DBMS metric data. Therefore, only the most significant

• executor . jvm . pools . PS − Old − Gen . cmt . avg

factors are taken into account. Figure 2 shows a eigen values

• worker_1 . D i s k _ t r a n s f e r s _ p e r _ s e c o n d . dm −0

where the factors are plotted against their respective eigen

• worker_1 . P a g i n g _ a n d _ V i r t u a l _ M e m o r y . pgpg

values.

B. Automated Tuning

1) Preprocessing: All the pruned metrics other than the

”latency” metric and the knobs were normalized by removing

the mean and scaling it to unit variance. Zero mean scaling is a

very common practice before using the data in machine learning

algorithms. This step is also essential to avoid overflowing

values while calculating the mean squared error.

2) Gaussian Process Regression: Gaussian process

regression, or GPR [38] a probabilistic-based supervised

learning approach, was used to train the model. GPR requires

Figure 2. Eigen values of every factor

prior specification. The prior in this case is the offline

workloads, which were seen before and were fitted initially.

3) K-means Clustering: K-means clustering is used to The covariance of the prior is specified by the kernel chosen.

identify and segregate metrics into meaningful groups. It works The log marginal likelihood is maximized to optimize the

by clustering data to separate it into n groups of equal variance. hyperparameters during the fitting. Two different kernels were

The algorithm works towards minimizing the distance between tried, and the one that performed best was chosen. Radial

the data points within the cluster. The input to the K-means basis function (RBF) performed better compared with others.

model consists of a matrix comprising our 30 factors from The scaled knobs and metrics were used to fit the model.

factor analysis. A metric closest to each of the cluster centers Sklearn’s implementation of the Gaussian process regression

can be picked up. Sklearn’s implementation of the K-means [38] as well as Ottertuner implementation of the GPR were

algorithm [37] was used.

also used.3) Workload Mapping: For workload mapping, we match

the current target workload to offline workloads B and C.

We calculate a ”score” for each workload by averaging the

distances obtained using Euclidean distance across all metrics.

The step-by-step process followed for workload mapping is:

• Calculate the Euclidean distance between the metrics

vector for the target workload (workload B in the first

iteration) and the corresponding vector for each workload

in the offline workload. This computation is done for

each metric.

• Compute a ’score’ for each workload i by taking the

average of these distances over all metrics. The algorithm

then chooses the workload with the lowest score as the

one that is most similar to the target workload.

• Augment the target workload to the nearest source

workload; in case of a conflict of knob configurations

between the nearest source and the target workload, keep

the configurations of the target workload.

• Train the models on ’Augmented workloads’ and predict

the latency on the validation set created from online

workload B.

• Repeat the process with the source workload as

’augmented workload’ and the target workload as

’Workload C’ and create the final augmented workload for

training the models and the final prediction on test.csv.

4) Latency Prediction: Latency prediction is carried out in

two stages. In the first stage, the 6th row of all the workloads

in the online workloadB is used to predict the latency while

trained on offline workloads appended with workload-mapped

data from online workloadB.The latency values are then

estimated. In the next stage, the test.csv workloads are predicted

with the model trained on data appended from both workloads

B and C. This process adds more data for the GPR model to

make accurate predictions.

IV. E XTENSIONS

A. GMM Clustering

The GaussianMixture model used by us from sklearn [39]

implements the expectation-maximization (EM) algorithm for

fitting mixture-of-Gaussian models. Expectation-maximization

is a well-founded statistical algorithm to get around this

problem through an iterative process. The first one assumes

random components (randomly centered on data points, learned

from k-means, or even just normally distributed around the

origin) and computes for each point a probability of being

generated by each component of the model. Then, one tweaks

the parameters to maximize the likelihood of the data given

those assignments. Repeating this process is guaranteed to

always converge to a local optimum. The GaussianMixture

module comes with different options to constrain the covariance

of the difference classes estimated: spherical, diagonal, tied, or

full covariance. The full covariance option is made available

by us, where each component has its own general covariance

matrix, so the clusters may independently adopt any position

and shape. The optimal number of clusters was found by

making use of the Silhoutte and BIC scores for GMM.

B. Random Forest

The random forest algorithm was tried instead of Gaussian

process regression. One of the advantages of random forest

over Gaussian process regression is that it performs very well

with high-dimensional data. Also, the random forest is based on

the bagging algorithm and uses the emsemble technique, where

it creates many trees on a subset of the data and combines all

the output. The overfitting is reduced through this technique,

while GPR provides no such mechanism as the prior is fitted

initially. Another reason to try random forest is to make use of a

frequentist-based approach rather than a probabilistic approach

such as GPR since frequentist based approaches perform better

when there is more data. Also, due to their complexity, GPR

requires much more time to train than random forests. Even

though in this work comparatively less data is being used, when

there is more data, random forest can be the right technique

and might be preferable. The random forest algorithm [40] was

implemented with depths up to 50 levels and 200 estimators.

The original OtterTune paper [1] makes use of Kmeans for

clustering metrics. But, there are certain disadvantages to using C. Neural Network

K-means. It assumes that the clusters always have a spherical

One major disadvantage with GPR used in the OtterTune

shape. Moreover, it takes only the mean of the clusters into paper [1] is that it doesn’t perform well with higher-dimensional

account and not their variance. Also, each cluster has roughly data. So the number of metrics has to be pruned and used in

equal numbers of observations. Additionally, it is very sensitive OtterTuner. To overcome this disadvantage, neural networks

to outliers. To overcome these inadequacies, we replace the were used as an alternative to see the performance in higher

Kmeans clustering with GMM clustering [39].

dimensions. An artificial neural network has the ability to learn

GMM clustering overcomes these disadvantages by assuming and model non-linear and complex relationships that contain

each cluster as a different gaussian distribution and grouping trainable parameters. Moreover, an artificial neural network’s

data points belonging to a single distribution together. It outputs aren’t entirely limited by the inputs and results given to

therefore generates better clusters as it takes into account it initially by an expert system. Also, artificial neural networks

variance along with the mean of the clusters. One can think have the ability to generalize their inputs. After learning from

of mixture models as generalizing k-means clustering to the initial inputs and their relationships, it can infer unseen

incorporate information about the covariance structure of relationships on unseen data as well, thus making the model

the data as well as the centers of the latent Gaussians. generalize and predict on unseen data.A complex 5-layer neural network model is created with a

fully connected hidden layer. The Keras library is used for this

purpose. No activation function is used for the output layer, as

we are interested in predicting numerical values directly without

transformation. The efficient ADAM optimization algorithm is

used, and the MAPE loss function is optimized. This metric

was specifically chosen because it is the one that will be used

to compare all of our models.

V. E XPERIMENTS

A. Evaluation

In K-means clustering, the silhoutte analysis was carried out

to evaluate the optimal number of clusters in the data.

In the workload mapping stage, both mean squared error

and mean average percentage error were tried for evaluation

purposes. From the latency prediction results, it was observed

that the MSE evaluation criteria in workload mapping

performed better. Latency prediction evaluation was carried

out using MAPE.

B. Scaling and Hyperparameter tuning

Experiments were carried out with and without scaling. Two

different scaling variations were performed and analyzed. In

the first variation, only the inputs for workload mapping and

GPR models were tried. In the second variation, all the data

except the latency column was scaled. In general, scaling all

the data except the latency column with a zero mean and unit

variance tends to improve the performance of the system. When

unscaled data was used, it was observed that the predictions

were far apart from the ground truth in general.

Hyperparameter tuning was carried out for GPR. In the

case of GPR, the noise levels were controlled by setting the

alpha parameter to lower values. From the Table I it could be

seen that the model performed well as the alpha values are

low. Random forest trees were built for different depths, and

different estimators were also evaluated. The ones with the

best evaluation scores were chosen as hyperparameters.

TABLE I

GPR HYPERPARAMETER TUNING

Alpha

1e+8

1e+7

1e+5

1e+3

1e+1

1e-1

MAPE

168.63

142.01

123.67

101.29

98.45

69.61

We also experimented with adding an additional hidden layer

for the Neural Network. Though, this did not improve on the

MAPE scores.

VI. R ESULTS

Table II shows results obtained from different models

evaluated on the test set from online workload B. From the

table, we can see that replacing K-means by GMM clustering

improved the performance of the model to a slight extent by

both metrics MAPE and MSE. However, the performance of

the random forest algorithm is not as good as that of GPR. The

neural network based model showed improved performance

where in it was able to capture complex relationships which

showed superior results in evaluations.This can be clearly seen

with the MSE values seen in the Table II. However, it should

be noted that neural networks have a higher tendency to overfit

when given less data. Our experiments in general show that

results are improved by using neural networks as well as GMM

clustering.

TABLE II

R ESULTS SUMMARY

Type

Baseline

EM clustering

Random Forest

Neural Network

MAPE

69.61

67.85

71.98

65.26

MSE

2118

2329

2846

1965

Figure 3 shows the ground truth and model predictions for

latency on online workload B for both baseline, EM clustering,

and neural networks. The plot shows that the predictions made

using the metrics from EM-clustering algorithms were much

closer to the ground truth when compared with the baseline

approach. However, the predictions are very far apart in the

case of neural networks.

VII. C ONCLUSION

Automatic DBMS tuning remains an active area of research,

and we presented an automatic approach that leverages past

experience and collects new information to tune DBMS

configurations. During experiments, we were able to achieve

a MAPE of 69% using the baseline implementation, which

utilizes FA,K-means for metric pruning, and GPR for prediction

modeling.

Our experiments also show that by replacing K-means

clustering with EM-clustering, we were able to reduce MAPE

to 67% which suggests that GMM clustering could be an

alternative to K-means clustering. The deep learning-based

approach performed better than both of these clustering

approaches, leading to a MAPE score of 65%. Neural networks

can perform better than traditional machine learning clustering

models for the use case mentioned above because of their

ability to handle high-dimensional data and identify complex

patterns in it.R EFERENCES

(a)

(b)

(c)

(d)

Figure 3. Predictions and ground truth latency prediction on a test set of

online workload B (a) neural networks (b) EM clustering (c) random forest

(d) baseline

[1] Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang.

2017. Automatic database management system tuning through large-scale

machine learning. In Proceedings of the 2017 ACM International

Conference on Management of Data, SIG- MOD ’17, page 1009–1024,

New York, NY, USA. Association for Computing Machinery.

[2] D. Dworin. Data science revealed: A data-driven glimpse into the

burgeoning new field. Dec. 2011.

[3] D. Laney. 3-D data management: Controlling data volume, velocity and

variety. Feb. 2001.

[4] S. Duan, V. Thummala, and S. Babu. Tuning database configuration

parameters with iTuned.. VLDB, 2:1246–1257, August 2009.

[5] K. Dias, M. Ramacher, U. Shaft, V. Venkataramani, and G. Wood.

Automatic performance diagnosis and tuning in oracle.. CIdR, 2005.

[6] A. J. Storm, C. Garcia-Arellano, S. S. Lightstone, Y. Diao, and M.

Surendra. Adaptive self-tuning memory in DB2. VLDB, pages 1081–1092

August 2006.

[7] M. Linster. Best practices for becoming an exceptional postgres

dba. VLDB, 2:1246–1257, August 2014.http://www.enterprisedb.com/

best-practices-becoming-exceptional-postgres-dba

[8] MySQL – InnoDB startup options and system variables. http://dev.mysql.

com/doc/refman/5.6/en/innodb-parameters.html

[9] P. Bernstein, M. Brodie, S. Ceri, D. DeWitt, M. Franklin, H.

Garcia-Molina, J. Gray, J. Held, J. Hellerstein, H. Jagadish, et al. The

asilomar report on database research. SIGMOD record, 27(4):74–80,

1998.

[10] S. Chaudhuri and G. Weikum. Rethinking database system architecture:

Towards a self-tuning RISC-style database system. In VLDB, pages 1–10,

2000.

[11] S. Chaudhuri and V. Narasayya. Self-tuning database systems: a decade

of progress. In VLDB, pages 3–14, 2007.

[12] G. Weikum, A. Moenkeberg, C. Hasse, and P. Zabback. Self-tuning

database technology and information services: From wishful thinking to

viable engineering. In Proceedings of the 28th International Conference

on Very Large Data Bases, VLDB ’02, pages 20–31, 2002.

[13] S. Ceri, S. Navathe, and G. Wiederhold. Distribution design of logical

database schemas. IEEE Trans. Softw. Eng., 9(4):487–504, 1983.

[14] S. Duan, V. Thummala, and S. Babu. Tuning database configuration

parameters with iTuned. VLDB, 2:1246–1257, August 2009.

[15] D. C. Zilio, J. Rao, S. Lightstone, G. Lohman, A. Storm, C.

Garcia-Arellano, and S. Fadden. DB2 design advisor: integrated automatic

physical database design. In VLDB, pages 1087–1097, 2004.

[16] D. Wiese, G. Rabinovitch, M. Reichert, and S. Arenswald. Autonomic

tuning expert: A framework for best-practice oriented autonomic database

tuning. In Proceedings of the 2008 Conference of the Center for Advanced

Studies on Collaborative Research: Meeting of Minds, CASCON ’08,

pages 3:27–3:41, 2008.

[17] M. Hammer and B. Niamir. A heuristic approach to attribute partitioning.

In SIGMOD, pages 93–101, 1979.

[18] E. Kwan, S. Lightstone, A. Storm, and L. Wu. Automatic configuration

for IBM DB2 universal database. Technical report, IBM, jan 2002.

[19] A. J. Storm, C. Garcia-Arellano, S. S. Lightstone, Y. Diao, and M.

Surendra. Adaptive self-tuning memory in DB2. In VLDB, pages

1081–1092, 2006

[20] W. Tian, P. Martin, and W. Powley. Techniques for automatically sizing

multiple buffer pools in DB2. In CASCON, pages 294–302, 2003.

[21] B. Dageville and M. Zait. Sql memory management in oracle9i. In

Proceedings of the 28th International Conference on Very Large Data

Bases, VLDB ’02, pages 962–973, 2002.

[22] K. Dias, M. Ramacher, U. Shaft, V. Venkataramani, and G. Wood.

Automatic performance diagnosis and tuning in oracle. In CIdR, 2005.

[23] S. Kumar. Oracle Database 10g: The self-managing database, Nov. 2003.

White Paper

[24] S. Chaudhuri and V. R. Narasayya. An efficient cost-driven index selection

tool for microsoft SQL server. In VLDB, pages 146–155, 1997.

[25] A. Pavlo, C. Curino, and S. Zdonik. Skew-Aware Automatic Database

Partitioning in Shared-Nothing, Parallel OLTP Systems. In SIGMOD,

pages 61–72, 2012.

[26] R. Nehme and N. Bruno. Automated partitioning design in parallel

database systems. In SIGMOD, SIGMOD, pages 1137–1148, 2011.[27] B. Xi, Z. Liu, M. Raghavachari, C. H. Xia, and L. Zhang. A smart

hill-climbing algorithm for application server configuration. In WWW,

pages 287–296, 2004.

[28] N. Bruno and S. Chaudhuri. Automatic physical database tuning: a

relaxation-based approach. In SIGMOD, pages 227–238, 2005.

[29] S. Chaudhuri and V. Narasayya. Autoadmin “what-if” index analysis

utility. SIGMOD Rec., 27(2):367–378, 1998.

[30] A. A. Soror, U. F. Minhas, A. Aboulnaga, K. Salem, P. Kokosielis, and S.

Kamath. Automatic virtual machine configuration for database workloads.

In SIGMOD, pages 953–966, 2008.

[31] D. Narayanan, E. Thereska, and A. Ailamaki. Continuous resource

monitoring for self-predicting DBMS. In MASCOTS, pages 239–248,

2005.

[32] MySQL Tuning Primer Script. https://launchpad.net/mysql-tuning-primer.

[33] PostgreSQL Configuration Wizard. https://pgtune.leopard.in.ua.

[34] R. Mukkamala, S. C. Bruell, and R. K. Shultz. Design of partially

replicated distributed database systems: an integrated methodology.

SIGMETRICS Perform. Eval. Rev., 16(1):187–196, 1988.

[35] M. Y. L. Ip, L. V. Saxton, and V. V. Raghavan. On the selection of an

optimal set of indexes. IEEE Trans. Softw. Eng., 9(2):135–143, 1983.

[36] Factor analysis implementation from sklearn. https://scikit-learn.org/

stable/modules/generated/sklearn.decomposition.FactorAnalysis.html

[37] K-means clustering implementation from sklearn. https://scikit-learn.org/

stable/modules/generated/sklearn.cluster.KMeans.html

[38] Guassian process regression implementation from sklearn. https://

scikit-learn.org/stable/modules/gaussian process.html

[39] Guassian mixture model (em-clustering) implementation from

sklearn. https://scikit-learn.org/stable/modules/generated/sklearn.mixture.

GaussianMixture.html#sklearn.mixture.GaussianMixture

[40] Random forest sklearn implementation. https://scikit-learn.org/stable/

modules/generated/sklearn.ensemble.RandomForestRegressor.html