XGBoost Math Intuition Summary

Summer Hu
The Startup
Published in
6 min readJan 12, 2021

--

A complete explanation for XGBoost - Most popular Gradient Boosting Model

Puma Punku, Bolivia

XGBoost is derived from Gradient Boosting Model(GBM), compared with GBM, XGBoost introduces a different way to train the ensemble weaker leaner, so let’s start from here.

Suppose we have the below K trees boosting mode

Where fₖ is k-th tree in the ensemble, xᵢ is i-th data point

Similarly, the prediction at the t-th step can be defined as

Training Objective

Below is the object function to train the t-th weak learner, let’s call it step t-th object function

L is loss function(ex. MSE for regression, Cross-entropy for classification)

Ω( fᵢ) is the i-th weak learner’s regularization term

Taylor Expansion on Training Objective

In the above step t-th object function, previous t-1 prediction function y head can be think as a variable and t-th weak leaner fₜ(x) is the delta change, so XGBoost uses second order Taylor Expansion to approximate the step t-th object function

In the above equation, at current t-th step, t-1 step prediction y head and all before t regularization are known values, so they are constant values in t-th step object function. Remove the constant terms(because they not impact object function optimization), we get

Some Tree Mapping Function Definition

j represents a tree j-th leaf (assume all leaves of a tree are indexed from 1)

Iⱼ represents a set contains all data instances which locate in tree’s j-th leaf

q(xᵢ) represents a function which maps data instance xᵢ into a tree leaf and return the leaf index

wᵢ represents i-th leaf score(weight or output)

So f(x) actually represents a tree output for instance x

Objective Function Rewrite With Regularization

T is the number of leaves in the t-th weak leaner tree fₜ(x)

γ, λ are regularization hyper-parameters

Optimize Objective Function

Now the t-th step object function is a function of (wᵢ), gᵢ and hᵢ values are known because they only related to loss function and step t-1 prediction values.

So we can use below equation to get best wᵢ to minimize object function

The optimal w is

And the corresponding minimal object value is

Weak Learner Tree Splitting Criteria

So far, we got the t-th step object function, next step is to build the t-th tree, and this tree should be constructed to reduce object function value as much as possible.

In order to build a tree to reduce object function value, we only allow node split which can reduce object function value, and looking for a best split which can reduce the most.

So in each split we measure the object function value reduce by Tree Object function value(After Node Split) — (Before Node Split)

Gain is how much object function value reduced in the split.

Iₗ is left splitting child leaf

Iᵣ is right splitting leaf

I is parent leaf

For simplicity, each leaf can calculate its Similarity Score

Splitting gain can be expressed as

Left(Similarity Score)+ Right(Similarity Score) - Parent(Similarity Score)

Simplified Summary For Regression and Classification

We know calculating tree node similarity and tree leaf output wᵢ will base on the chosen loss function, because gᵢ and hᵢ are 1-order and 2-order derivatives from loss function.

StatQuest with Josh Starmer gives a a good simplified summary for quick reference.

  1. Similarity Score is applied for every node in the tree
  2. Output Value normally is for leaf node output(wᵢ)

For regression, the chosen loss function is based on MSE

For classification, the loss function is a custom logloss function

Where pᵢ = Sigmoid(yᵢ) The output class probability is calculated via logistic transformation on tree output prediction y

XGBoost Training Features

  1. When searching for best feature value for node split, XGBoost provides an option to search on the feature value’s quantiles or histogram instead of try all the feature values to split node.
  2. When building feature histogram, XGBoost may split feature data into multiple computers to calculate histogram, then merge back to generate a aggregate histogram, this like Hadoop Map-reduce operation, and the generated histogram will be cached for next split.
  3. XGBoost can automatically handle missing values in feature. In tree node split step, XGBoost will either assign all missing value instances to left or right child, depend on which side has larger gain.
  4. XGBoost provide lots hyper-parameters to deal with overfitting

XGBoost Example with Python

Install Library

pip install xgboost

Example

import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
boston = load_boston()
data = pd.DataFrame(boston.data)
data['PRICE'] = boston.target
X, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)
xg_reg = xgb.XGBRegressor(
n_estimators = 50,
max_depth = 4,
learning_rate = 0.3)
xg_reg.fit(X_train,y_train)preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
#RMSE: 3.509045

Plot the last tree (total 50 trees) in ensemble

import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz/bin'
xgb.plot_tree(xg_reg,num_trees=49)
plt.rcParams['figure.figsize'] = [50, 10]
plt.show()

Plot feature importance

feature_imp= xg_reg.get_booster().get_score(importance_type= 'gain')
sorted_feature_imp = np.array(sorted(feature_imp.items(), key=lambda kv: int(kv[1]), reverse=False))
plt.barh(
boston.feature_names[sorted_feature_imp[:,0].astype(int)],
sorted_feature_imp[:,1].astype(float))
plt.xlabel("Feature Importance")

The importance_type can be

  1. weight : The number of times a feature is used to split the data across all trees.
  2. gain : The average gain across all splits the feature is used in.
  3. cover : The average coverage across all splits the feature is used in.
  4. total_gain: The total gain across all splits the feature is used in.
  5. total_cover: The total coverage across all splits the feature is used in.

REFERENCE

  1. XGBoost: A Salable Tree Boosting System
  2. Kaggle Winning Solution Xgboost Algorithm — Learn from Its Author, Tong He
  3. XGBoost Documentation
  4. StatQuest with Josh Starmer XGBoost Part 1: Regression
  5. StatQuest with Josh Starmer XGBoost Part 2: Classification
  6. StatQuest with Josh Starmer XGBoost Part 3: Mathematical Details

--

--