Following part 1, we will continue our discussion on:
Gradient Boosted Trees and XG Boost.
In part 2, we will discuss these topics in this post:
What is XG Boost?
What is so great about XG Boost?
Weighted Quantile Sketch - Reduce Computatioal Cost In Tree Split
Parallelization to accelerate split search
2nd-order derivative and Taylor Expansion
Other features
XGBoost literally stands for “Extreme Gradient Boosting” and is an algorithm built upon the foundation of Gradient Boosting in 2014.
As its API structure and settings are very similar to Gradient Boost, if one wants to shift from Gradient Boost to XGBoost, it will only require minor edits in the code.
XGBoost is known for its superb performance and efficiency. It has been the winning algorithm used in many Kaggle competitions and has a large community support.
For structured or tabular datasets, XGBoost was often found to outperform even deep learning algorithm like neural nets.
The “Extreme” in XGBoost actually means “pushing computational limits” to the extreme.
Here, we will briefly discuss the distinct design features that help XGBoost perform its magic, with a special focus on how they address the shortcomings of overfitting and computational efficiency of Gradient Boosted Trees.
Weighted Quantile Sketch - Reduce Computatioal Cost In Tree Split
Parallelization to accelerate split search
2nd-order derivative and Taylor Expansion
Other features
Weighted Quantile Sketch - Reduce Computatioal Cost In Tree Split
To address the high computational cost for determining tree splits in Gradient Boosted Trees, XGBoost presents an approximate split-finding algorithm called Weighted Quantile Sketch for huge dataset and makes use of parallelization when looking for optimal splits.
The diagram above illustrates the difference between a typical Quantile split (Method 1 in diagram) and the Weighted Quantile sketch (Method 2 in diagram).
The idea of Weighted Quantile Sketch is that it makes a histogram for each feature and use the bins’ boundaries as candidate split points.
Therefore, the algorithm saves time by considering far fewer split candidates as now the data are put into bins. In addition, to make sure that the split candidate points are well-chosen and can help improve model performance (i.e., reduce loss), heavier weights are given to data points that are not well predicted yet/have higher loss.
Then, in the histogram, instead of having the same number of data points in each bin, Weighted Quantile Sketch puts the same total weights in each bin. By doing so, the algorithm will consider more split candidates and conduct more detailed search in areas where the model needs improvement.
Parallelization to accelerate split search
Other than Weighted Quantile Sketch, XGBoost makes use of parallelization to accelerate the split search.
In general, parallelization is hard to be implemented in Boosting model because each weak learner added depends on the previous ensemble’s errors. Thus, it is a serial operation.
Nonetheless, XGBoost makes parallelization happen – not on the ensemble/tree level, but on the feature level.
For example, when searching for the optimal split point for a leaf, each core can be calculating the loss of each feature’s split points. The final split point for that leaf can then be decided after making comparison among all the calculated losses
Regularization parameters built into Objective Function
To address the tendency to overfit in Gradient Boosted Trees, XGBoost has many regularization hyperparameters that can help control the growth of a tree, but the key distinction between Gradient Boost and XGBoost is their objective function.
XGBoost has regularization parameters built into its objective function.
Please note that there is a difference between hyperparameters and parameters: the former refer to settings that cannot be learned by the algorithm, while the latter refer to values that are learned via training.
This is the objective function of XGBoost:
The first term is the training loss and the second term is the regularization.
As XGBoost learns to minimize loss that encourages growth of the tree, the regularization controls the complexity of the tree.
Now let's focus on the second term, the regularization term Omega, in the objective function
The complexity of a tree Ω (ft) is calculated with two parts.
The first part is a γ-weighted T, which indicates the number of leaves in the tree.
The second part is a λ-weighted L2 norm of the leaf weight.
In the equation, Gamma:γ and Lambda: λ are user-defined hyperparameters.
The following illustrates the calculation of Ω (ft) :
Second-order derivative and Taylor Expansion
Other than being a regularized version of Gradient Boost, XGBoost employs a functional gradient descent up to the second-order derivative.
This makes it more precise in finding the next best learner.
It also uses Taylor expansion to approximate the loss, increasing flexibility of objective function choices and speed up calculation.
We will look at the objective function again but focus on how Taylor expansion is applied to get the final form of XGBoost’s objective function.
(1) Note that the objective function now only depends on gi and hi , therefore XGBoost can support custom loss functions: it simply needs to use one solver that takes gi and h_i as inputs to optimize every loss function. This is how XGBoost uses Taylor expansion to allow flexibility in loss function choices.
(2) Note that for the current round: model t, the values of gi and hi remain the same as they are simply the first and second order derivatives of the loss at previous iteration with respect to predictions. It means that gi and hi can be computed before the search starts at this current round and can be multiplied into the objective function by simple plug-in. This largely reduces computing cost when the tree is evaluating different tree splits and is how XGBoost uses Taylor expansion to accelerate the speed.
Other Features
Besides what has been mentioned above, XGBoost has other design features that make it significantly faster: sparsity aware split-finding, cache-aware access and block compression and sharding etc.
These designs are mostly about improving data reading time, computational memory and storage.
As these topics are mostly related to the field of computer science, they will not be discussed in details here.
(Note: Sparsity aware split-finding is also related to how XGBoost handles missing data.)
REFERENCES:
XGBoost Developer. (2022). XGBoost Parameters. Retrieved from XGBoost: https://xgboost.readthedocs.io/en/stable/parameter.html
XGBoost Developers. (2022). Introduction to Boosted Trees. Retrieved from XGBoost: https://xgboost.readthedocs.io/en/stable/tutorials/model.html
XGBoost Developers. (2021). Python API. Retrieved from dmlc XGBoost: https://xgboost.readthedocs.io/en/latest/python/python_api.html
XGBoost. (n.d.). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/XGBoost
Adebayo, S. (2020, November 16). How the Kaggle Winners Algorithm XGBoost Algorithm works. Retrieved from Dataaspirant: https://dataaspirant.com/xgboost-algorithm/#t-1605502097067
Machine Learning Mastery. (2016, August 17). Retrieved from A Gentle Introduction to XGBoost for Applied Machine Learning: https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
Wade, C. (2020). Hands-On Gradient Boosting with XGBoost and scikit-learn. Birmingham, UK: Packt Publishing.
Tan, B. (2020, May 23). The Intuition Behind Gradient Boosting & XGBoost. Retrieved from TowardsDataScience: https://towardsdatascience.com/the-intuition-behind-gradient-boosting-xgboost-6d5eac844920#d3a5
Wang, B. (2020, April 26). Gradient Tree Boosting: XGBoost vs. LightGBM vs. CatBoost (Part 2). Retrieved from Medium: https://beverly-wang0005.medium.com/gradient-tree-boosting-xgboost-vs-lightgbm-vs-catboost-part-2-275525458968
Machine Learning Mastery. (2020, July 21). XGBOOST Math Explained - Objective function derivation & Tree Growing | Step By Step . Retrieved from YouTube: https://www.youtube.com/watch?v=iBSMdFJ6Iqc
Chung, K. (2019, Dec 27). Demystify Modern Gradient Boosting Trees From Theory to Hands-On Examples. Retrieved from https://everdark.github.io/k9/notebooks/ml/gradient_boosting/gbt.nb.html#43_second-order_loss_approximation
Samudrala, A. (2018, August). Unveiling Mathematics Behind XGBoost. Retrieved from KDnuggets: https://www.kdnuggets.com/2018/08/unveiling-mathematics-behind-xgboost.html