In this post we will dicuss two popular and similar algorithms:
Gradient Boosted Trees and XG Boost.
In part 1, we will discuss these topics in this post:
What is gradient boosted tree?
What is the downside of gradient boosted tree?
Gradient Boosting is a popular Machine Learning algorithm whose generalized form was first formulated by Jerome Friedman in 1999.
The crux of Gradient Boosting can be understood as follows:
Motivation: “If we can account for our model’s errors, we will be able to improve our model’s performance.”
Question: “But how do we know the error made by our model for any given input?”
Answer: “We can train a new predictor to predict the errors made by the original model.”
The following picture illustrates the process structure of Gradient Boost:
When Decision Trees, the most common Gradient Boost base learner, is used, each of the blue predictor box refers to a Tree. The Greek letter, eta: η, indicates the small weight given to each tree, which is also known as the learning rate of the model. At each iteration, a new predictor (a.k.a. weak learner) is introduced into the ensemble to account for the previous model’s error.
As the name of Gradient Boosted Trees suggests, this algorithm can be understood by 3 key elements: Trees, Boosting and Gradient.
Tree:
Gradient Boosted Trees uses Decision Tree as its base model. The basic idea of Decision Tree is that it uses different features and feature values as thresholds to make splits to divide the data. For each split, Decision Tree chooses the feature and values that can reduce error or misclassification rate the most (commonly used metrics: entropy or Gini Impurity). The Tree keeps on splitting the data until no more improvement can be brought by further split, or a certain stopping criterion is met.
Boosting:
Boosting essentially means creating a model by combining many small Trees (a.k.a. an ensemble), where each subsequent small tree aims at correcting the errors of the previous tree. Each of the small tree is a weak learner itself as it does not make prediction well on its own, but with an ensemble of them correcting errors one by one, the model is given the opportunity to gradually learn during training and become a stronger predictor as a whole.
Gradient:
The key characteristics of Gradient Boost is that it minimizes the loss function of the model by performing gradient descent with its weak learners. Gradient descent is a first-order iterative optimization algorithm extensively used in Machine Learning. However, unlike many machine learning algorithms (e.g. Neural Networks or Regression), which operate gradient descent in the parameter space to search for θ (the parameters), Gradient Boost operates gradient descent in the functional space to search for a function, which is the ensemble predictor/weak learner to be added into the model.
To briefly illustrate the similarity of Gradient Boost and gradient descent mathematically, let's consider the loss function:
In Gradient Descent:
We start with a random θ, and optimize L with respect to θ by iteratively updating θ using the following update rule:
In gradient boost:
This is how Gradient Boost operates gradient descent in a functional space.
The mechanism described above also differentiates Gradient Boost from another popular boosting algorithm, Adaptive Boosting (Adaboost). Adaboost learns through iterative sample weight adjustments and bootstrap resampling. It raises the weights of misclassified sample instances and lowers the weights of correctly classified samples. On the other hand, Gradient Boost does not care at all about the predictions that are already correct. Gradient Boost simply models each new tree entirely around the residual errors of the previous predictor.
Gradient Boosted Trees is a highly flexible and powerful algorithm for building predictive models. Unlike many other algorithms, it does not even require data scaling.
However, the major downsides of Gradient Boosted Trees are:
1. Due to its tendency to keep minimizing errors, overfitting can easily occur by overemphasizing outliers.
2. With the huge number of features and threshold to consider at each split, it is very computationally expensive – both time and memory exhaustive!
REFERENCES:
Chaturvedi, M. (2021, June 20). Story of Gradient Boosting: How it Evolved Over Years. Retrieved from AnalyticsIndiaMag: https://analyticsindiamag.com/story-of-gradient-boosting-how-it-evolved-over-years/
Tan, B. (2020, May 23). The Intuition Behind Gradient Boosting & XGBoost. Retrieved from TowardsDataScience: https://towardsdatascience.com/the-intuition-behind-gradient-boosting-xgboost-6d5eac844920#d3a5
Great Learning Team. (2020, June 6). What is Gradient Boosting and how is it different from AdaBoost? Retrieved from GreatLearning: https://www.mygreatlearning.com/blog/gradient-boosting/#sh3
Starmer, J. (2021, April 25). Decision and Classification Trees, Clearly Explained. Retrieved from YouTube: https://www.youtube.com/watch?v=_L39rN6gz7Y&t=0s
Wade, C. (2020). Hands-On Gradient Boosting with XGBoost and scikit-learn. Birmingham, UK: Packt Publishing.
Chung, K. (2019, Dec 27). Demystify Modern Gradient Boosting Trees From Theory to Hands-On Examples. Retrieved from https://everdark.github.io/k9/notebooks/ml/gradient_boosting/gbt.nb.html#43_second-order_loss_approximation
Hug, N. (2019, June 1). Understanding Gradient Boosting as a gradient descent. Retrieved from NicolasHug: http://nicolas-hug.com/blog/gradient_boosting_descent
Starmer, J. (2019, Jan 14). AdaBoost, Clearly Explained. Retrieved from YouTube: https://www.youtube.com/watch?v=LsK-xG1cLYA
Kurama, V. (2020). Gradient Boosting In Classification: Not a Black Box Anymore. Retrieved from PaperspaceBlog: https://blog.paperspace.com/gradient-boosting-for-classification/#:~:text=Let%20us%20look%20at%20some,be%20time%20and%20memory%20exhaustive