# Deep-dive in Bayesian Hyper-Parameter Tuning

An intuition and implementation summary of Sequential Model-Based Optimization Algorithm for Bayesian Hyper-Parameter Tuning

# Model Hyper-Parameter vs Model Parameter

For a machine learning Model, Model Hyper-Parameters are set to a model before training and are **static **during training.

Model Hyper-Parameters includes

- Model layout and Model attributes, like number of trees and max-depth of tree for Random Forest, number of layers and number of units in each layer for Neural Network, number of clusters for KNN etc
- Training settings, like learning rate, regularization rate, batch-size etc

Model parameters, on the other hand, are **dynamic **evolved during the training, and are learned from training data to make model fit training data.

Model Parameters includes

- Weights or Coefficients for Regression Model
- Split Condition for Decision Tree
- KNN Centroid
- etc

# Bayesian Hyper-Parameter Tuning Conceptual Model

Bayesian Hyper-Parameter Tuning implementation is based on the below Sequential Model-Based Optimization (**SMBO**) algorithm.

** X**: is hyper-parameters space

** f** : is a black box function

**=**

*yᵢ***(**

*f***),**

*xᵢ***is one sample from hyper-parameters space. Here**

*xᵢ***(**

*f***)**

*xᵢ***can be a model loss function with model hyper-parameters**

*xᵢ*** D**: is dataset {(

**x₁**,

**y₁**), . . . , (

**xₙ**,

**yₙ**)}, different hyper-parameters sample

**with its corresponding model performance metric**

*xᵢ***yᵢ**

** M**: is Statistical Surrogate Model which fits dataset

*D*** S**: is Acquisition Function which uses statistical model

**to select next hyper-parameters sample**

*M***xᵢ**which may get better model performance metric

**yᵢ**

The ultimate purpose of SMBO is to estimate the conditional distribution

(y|x ), so based on the distribution, we can find the hyper-parameterspxwhich has best model performancey.

# Statistical Surrogate Model (M)

In SMBO algorithm, ** M** is a statistical model which fit sampled dataset

**{(**

*D***x₁**,

**y₁**), . . . , (

**xₙ**,

**yₙ**)}.

Practically, there are three ways to discover model *M*

- Gaussian Processes (GPs)
- Tree Parzen Estimators (TPE)
- Random Forests

## Intuition using Gaussian Processes

Dataset **D **{(**x₁**, **y₁**), . . . , (**xₙ **, **yₙ**)} can be a ** n**-dimensional space, the dimensions are

**X**= [

**x₁, x₂,**. . .

**, xₙ**]

Assume **y** = [**y₁, y₂, **. . .** , yₙ**] follows the below multivariate Gaussian Distribution

P ( **y, X**) **~** ** N** (

**μ**,

**K**)

**μ** =[ m(**x₁**), . . . , m(**xₙ**) ], m represents the mean function

**K** is covariance matrix, each matrix value **k**( **xᵢ** , **xⱼ** ) are defined by a kernel function, and if points **xᵢ** and **xⱼ** are considered to be similar by the kernel, then **yᵢ** and **yⱼ** are expected to be similar.

If a new sample (**X*** ,** y***) added to dataset **D**, then the joined distribution (**y , y***) is expressed as below, and it is still multivariate Gaussian Distribution

So we can follow the process, sampling **x***, calculating **y* = f (x*)**, then updating the join distribution (

**y , y***).

Once got enough sampling, the distribution will be accurate to model the relationship of **x** and **y**.

Next step, we need convert the join distribution **P (y, y*, X, X*)** into a conditional distribution **P ( y* | y, X, X*)** as below, so we can predict output **y*** for any **X*.**

Gaussian process is quite complex , the above is just a general steps for Gaussian Process under my understanding.

## Intuition using Tree Parzen Estimators (TPE)

Different from GP, TPE doesn’t directly construct a statistical model P ( y | x ), instead TPE choose to calculate distribution P (x | y) as below:

**y*** is a threshold value, for example we can choose a medium value of y from dataset **D** {(**x₁**, **y₁**), . . . , (**xₙ **, **yₙ**)}

** l **(x) is x distribution when

**y**=

**(x) <**

*f***y***

** g **(x) is x distribution when

**y**=

**(x) ≥**

*f***y***

Next, TPE choose next sample **X*** based on the below expression (if smaller ** f(x) **value means better model performance)

Then, calculate **y* **=** f** (**X***) and append (**X***, **y***) to dataset **D**.

This is one iteration of TPE process, next TPE will repeat re-calculating ** l **(x) and

**(x), sampling X* until reaching Pre-Set number of iterations**

*g*## Intuition using Random Forests

This method is aimed to used a Gaussian Distribution **y~** ** N**(

**X**) to model the dataset

**D**{(

**x₁**,

**y₁**), . . . , (

**xₙ**,

**yₙ**)}

Assume a Random Forest **r**(**x**) with **B** trees is fit to dataset **D, **then the Gaussian Distribution definition is as below:

# Acquisition Function (S)

Acquisition function is the sampling criteria by which the next sample of hyper-parameters are chosen from hyper-parameters space.

The criteria will find new samplers by exploration and exploitation in the hyper-parameters space.

Exploration means try sampling from the areas have large uncertainty (variance).

Exploitation means sample the areas we already know have good performance.

Below are two most popular acquisition function:

## Probability of Improvement (PI)

Probability of Improvement is try to find the sample which have the max-probability to improve the model loss function.

Assume in current sample dataset **D** {(**x₁**, **y₁**), . . . , (**xₙ **, **yₙ**)}, the minimal **yᵢ** is **y***, then **PI** is **P**( **y** < **y* **) and the next sample **x** is ** argmax **(

**P**(

**y**<

**y***) )

If we use Gaussian Process to model dataset **D,** the **PI** expression is

**ϕ** is gaussian cumulative distribution function

The next sample is

## Expected Improvement (EI)

**PI** is focus on choosing sample which has max probability to improve the model loss function, but **Expected Improvement **is focus on how much the sampler can improve, and choose the sample which can maximize the improvement.

In Gaussian Process mode, the **EI** expression is

The next sample is

Base on **EI** expression, the next sample should be a trade-off minimize **μ** (x) and maximize **κ**(x, x)

## Example for Gaussian Process and Acquisition Function

The example will use Gaussian Process to fit the below object function *f.*

*(**scikit-optimize** python library is used for Gaussian Process Regression)*

import numpy as npdeff(x):

return np.sin(5 * x[0]) * (1 - np.tanh(x[0] ** 2))

Fit function ** f** use Gaussian Process Regression and

**EI**Acquisition Function

from skopt importgp_minimize

import matplotlib.pyplot as plt

from skopt.plots importplot_gaussian_processres = gp_minimize(f,

[(-2.0, 2.0)],

acq_func="EI",

n_random_starts=8,

random_state=1234)plt.rcParams["figure.figsize"] = (10, 30)

n_samples = 6for n_iter in range(n_samples):

plt.subplot(n_samples*2, 1, 2*n_iter+1)

if n_iter == 0:

show_legend = True

else:

show_legend = False

plot_gaussian_process(res,

n_calls=n_iter,

objective=f,

show_legend=False,

show_title=False)

# Plot AC

plt.subplot(n_samples*2, 1, 2*n_iter+2)

plot_gaussian_process(res,

n_calls=n_iter,

show_legend=False,

show_title=False,

show_mu=False,

show_acq_func=True,

show_observations=False,

show_next_point=True)

# Example for Bayesian Hyper-Parameter Tuning

This example uses **scikit-optimize** python library and Gaussian Processes Surrogate Model to find the best hyper-parameters for Gradient Boosting Regressor

import numpy as np

from sklearn.datasets import load_boston

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import cross_val_score

from skopt import gp_minimize

from skopt.space import Real, Integer

from skopt.utils import use_named_argsboston = load_boston()

X, y = boston.data, boston.target

n_features = X.shape[1]# gradient boosted trees hyper-parameter space

space = [Integer(1, 5, name='max_depth'),

Real(10**-5, 10**0, "log-uniform", name='learning_rate'),

Integer(1, n_features, name='max_features'),

Integer(2, 100, name='min_samples_split'),

Integer(1, 100, name='min_samples_leaf')]@use_named_args(space)

def objective(**params):

reg = GradientBoostingRegressor(n_estimators=50, random_state=0)

reg.set_params(**params)

return -np.mean(

cross_val_score(

reg,

X,

y,

cv=5,

n_jobs=-1,

scoring="neg_mean_absolute_error"))res_gp =(objective, space,gp_minimize, n_calls=100, random_state=0)acq_func ="EI"

print("""Best parameters:

- max_depth=%d

- learning_rate=%.6f

- max_features=%d

- min_samples_split=%d

- min_samples_leaf=%d""" % (res_gp.x[0], res_gp.x[1],

res_gp.x[2], res_gp.x[3],

res_gp.x[4]))

The second example uses **hyperopt **library and Tree Parzen Estimators (TPE) Surrogate Model

import numpy as np

import pandas as pd

from hyperopt.pyll.stochastic import samplefrom hyperopt import tpe, hp, fmin, STATUS_OK,Trials

from sklearn.datasets import load_boston

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import cross_val_scoreboston = load_boston()

X, y = boston.data, boston.target

n_features = X.shape[1]# gradient boosted trees hyper-parameter space

space = {

"max_depth": hp.choice('max_depth', range(1, 5, 1)),

"learning_rate": hp.loguniform('learning_rate',np.log(0.005),np.log(0.1)),

"max_features": hp.choice("max_features", range(1,n_features,1)),

"min_samples_split": hp.choice("min_samples_split", range(2,100,1)),

"min_samples_leaf": hp.choice("min_samples_leaf", range(1,100,1))

}defhyperparameter_tuning(params):

reg = GradientBoostingRegressor(n_estimators=50, random_state=0)

reg.set_params(**params)

mae = -np.mean(

cross_val_score(

reg,

X,

y,

cv=5,

n_jobs=-1,

scoring="neg_mean_absolute_error"))

return {"loss": mae, "status": STATUS_OK}trials = Trials()best = fmin(

fn=hyperparameter_tuning,

space =space,

algo=tpe.suggest,

max_evals=100,

trials=trials

)print("""Best parameters:

- max_depth=%d

- learning_rate=%.6f

- max_features=%d

- min_samples_split=%d

- min_samples_leaf=%d""" % (best["max_depth"],

best["learning_rate"],

best["max_features"],

best["min_samples_split"],

best["min_samples_leaf"], ))

# Final Summary

Based on the above demo tests, different Surrogate Model may get different Best_Parameters, it is funny. And even for same Surrogate Model, if the initial samples are different, the results may different due to local minimal trap and different scale of exploration and exploitation. So we need to tune the parameters for the algorithm to tune the model hyper-parameters. My current conclusion for Bayesian Hyper-Parameter Tuning is, it can be a reference but not a silver-bullet to handle model hyper-parameter tuning in practical project.

# REFERENCE

- A Conceptual Explanation of Bayesian Hyperparameter Optimization for Machine Learning
- Thomas Huijskens — Bayesian optimisation with scikit-learn
- Gilles Louppe | Bayesian optimization with Scikit-Optimize
- Bayesian Optimization
- Bayesian Optimization Primer
- 深度學習系列 — Bayesian Optimization 應用貝氏統計方法來優化參數空間搜尋的效率
- 贝叶斯优化: 一种更好的超参数调优方式
- Chris Fonnesbeck: A Primer on Gaussian Processes for Regression Analysis
- Christopher Fonnesbeck — Bayesian Non-parametric Models for Data Science using PyMC3 — PyCon 2018
- ML Tutorial: Gaussian Processes (Richard Turner)
- Bayesian Hyperparameter Optimization
- Vincent Warmerdam: Gaussian Progress | PyData Berlin 2019
- Understanding Gaussian Process, the Socratic Way
- 贝叶斯优化 (Bayesian Optimization)
- Hyperparameter Tuning in Python: a Complete Guide 2021
- A Conceptual Explanation of Bayesian Hyperparameter Optimization for Machine Learning
- Hyper-parameter optimization algorithms: a short review
- How exactly does Bayesian Optimization work?
- TPE: how hyperopt works
- Bayesian optimization with skopt
- 贝叶斯优化(Bayesian Optimization)深入理解