In this post we will discuss two hyperparameter tuning algorithms:
Grid Search and Bayesian Optimization TPE
In part 2, we will discuss these topics in this post:
What is TPE? Why TPE?
Bayesian Optimization in Brief
TPE specifics in Brief
Tree-Structured Parzen Estimator (TPE) is a type of Bayesian optimization methods that can help user implement “informed” searches via the use of past information.
To illustrate, consider a scenario where we need to minimize the loss function by testing different estimator sizes in a Random Forest.
Based on the picture below with several initial runs’ results, using human visual judgment, we probably will think that smaller estimator sizes, like those in the red circle are more likely to yield lower loss.
This is exactly the “intelligence” of Bayesian tuning method: unlike grid-search, instead of blindly running every single combination defined by the user without “learning” anything from each run’s result, or wasting resources by searching all over the place, it learns from past evaluation results to help user narrow down the search space to combinations that are likely to perform well.
Illustration on the 'intelligence' of Bayesian tuning method
In essence, Bayesian hyperparameter optimization is the process of:
“Build[ing] a probability model of the objective function to propose smarter choices for the next set of hyperparameters to evaluate.” (Koehrsen, 2018)
The probability model for the objective function is also called a “surrogate”, which is expressed by the probability of score (y) given hyperparameters (x):
P (y | x) = P (score | hyperparameters)
The optimization process can be understood as the following steps by making use of the surrogate and keeping track of past evaluation results:
Build a surrogate probability model of the objective function
Find the hyperparameters that perform best on the surrogate
Apply these hyperparameters to the objective function
Update the surrogate model incorporating the new results
Repeat steps 2-4 until max iterations or time is reached
As the number of iterations of the above-described process increase, the surrogate approximation will become closer to the objective function. Therefore, by selecting hyperparameters that maximize/minimize the surrogate, it will likely maximize/minimize the objective function.
Above are the illustrations of the process.
The black line is the surrogate model with uncertainty region in grey.
The red line is the true objective function.
We see how the surrogate approximation becomes closer to the objective function from 2 evaluations (left figure) to 8 evaluations (right figure).
Different Bayesian hyperparameter optimization methods have different ways of building the surrogate probability model P (y | x).
Tree-Structured Parzen Estimator (TPE) builds its surrogate by applying the Bayes rule. Instead of modeling using P (y | x), it uses P (x | y), which is the probability of the hyperparameters given the score on the objective function. The flip is due to the Bayes Rule formula:
Specifically, TPE models P (x │ y) by using a threshold y* to make two different probability distributions l(x) and g(x) for the hyperparameters x.
l(x) is built upon the group of hyperparameters that yield objective scores lower than the threshold, and g(x) is built upon the group of hyperparameters that yield objective scores greater than the threshold.
Mathematically, it is expressed as:
Below left figure illustrates the threshold y* with the dotted line. The right illustrates the two distributions l(x) and g(x).
Specifically, l(x) and g(x) are Gaussian Mixture Models (GMM). TPE fits and updates these two distributions on every trial, for every parameter
Then, TPE draws sample x from l(x) , the distribution of the better performing group, and then evaluate the ratio (l(x) )/g(x) .
The hyperparameter x that yields the maximum value of this ratio, which is associated with the greatest expected improvement, will be chosen to be evaluates on the objective function.
As the number of trials increases, the surrogate function will become a closer approximation of the objective function, and therefore the hyperparameters that are chosen based on the evaluation of the surrogate are also likely to bring good results or score improvement on the objective function.
REFERENCES:
Koehrsen, W. (2018, June 24). A Conceptual Explanation of Bayesian Hyperparameter Optimization for Machine Learning. Retrieved from Towards Data Science: https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f