- Part 1. Shapley Value
- Part 2. Shapley Value as Feature Contribution
- Part 3. KernelSHAP
- Part 4. TreeSHAP
- Part 5. Python Example
Calculating Shapley value to get feature contributions is computational expensive. Why?
- From Part.1 and 2 explanations we know when calculate one instance’s one feature contribution(Shapley value), our calculation will repeat on all possible feature subsets which not containing the target feature.
- For each feature subset, we need to fill in missing features, if we choose sampling based on Marginal Feature distribution, then one subset may generate large number of instances.
So in order to improve Shapley value computation efficiency, a method call KernelSHAP is introduced by article A Unified Approach to Interpreting Model
Predictions. In this part, KernelSHAP will be explored.
Step by Step KernelSHAP
The best way to learn an algorithm is go through the detailed implementation and understand the intuition as much as possible, please be patient and let’s start.
Assume the trained model f(x1, x2, , , xM) has M features(in our example M=4), and we investigate on data instance x for example x = (10,20,30,40)
Below is the steps of KernelSHAP
1.Generate a vector which has M elements, and each element is randomly choose as 0 or 1. We use z′ represent the vector, so one random example can be z′ = (1,0,1,0). Here 1 indicates the presence of the feature, while 0 indicates the feature missing. Totally we can generate 2^M different z′.
Finally our mapping example result is h(z′) = (10, 22, 30, 44), 10 and 30 from instance x, 22 and 44 are randomly sampled values.
Note: the mapping h(z′) is a one-to-many mapping, the max number of output mapping is the size of the given dataset if missing feature sampling is based on marginal distribution.
3.Calculate the model prediction f(h(z′)).
Because h(z′) is a one-to-many mapping, one z′ can mapping to multiple different data instances, so f(h(z′)) should be the average of the prediction values on all the different mapping data instances. f(h(z′)) = E[f(h(z′))]
4.Compute the weight for each z′ with the below SHAP kernel (the intuition for the kernel will be discussed in next section, let’s go ahead)
Here M is model feature size, |z′|is number of present features.
5. Construct the below linear regression model g(z′), the coefficient φ0 represents model’s average prediction on all data instances, and φ1, φ2, , , φM represent corresponding feature’s contribution.
Now is the fancy part, let’s think about the below equation making sense or not?
g(z′)= f(h(z′))
Yes it makes sense for me, how about you? It took some time for me to get it:)
g(z′) has M+1 unknowns coefficients (φ0, φ1, φ2, , , φM) and there are 2^M different z′ values, each z′ we can calculate f(h(z′)) , so we can use multi-variable linear regression to find values for (φ0, φ1, φ2, , , φM)
One more special thing, when resolve g(z′) linear regression, KernelSHAP introduces the below custom loss function, and the custom loss function actually is MSE loss weighted by kernel π(z′)
6. For solving the regression, training dataset variable is z′ and target is f(h(z′)). Then resolve g(z′) linear regression and get (φ0, φ1, φ2, , , φM), problem solved.
The above steps describe the intuition of KernelSHAP, some description may not very strict but is enough to get the general idea of KernelSHAP. Please refer to the original paper(A Unified Approach to Interpreting Model Prediction) for more details.
SHAP Kernel Function π(z′) Intuition
Follow the previous step, the custom loss function is MSE weighted by kernel π(z′), and the kernel is defined as below
In order to understand intuition of the kernel, let’s plot the kernel π(z′) based on its definition
The plot indicates both small coalitions (few 1’s in z′) and large coalitions (many 1’s in z′) get the largest weights, others get small weights.
So this type of weight distributions cause regression focus more on predicting small coalitions and largest coalitions, and overlook other coalitions. In other words, regression will be more accurate on small coalitions and largest coalitions.
Why focus on small coalitions and largest coalitions? Below is very good explanation from book Interpretable Machine Learning
“The intuition behind it is: We learn most about individual features if we can study their effects in isolation. If a coalition consists of a single feature, we can learn about the features’ isolated main effect on the prediction. If a coalition consists of all but one feature, we can learn about this features’ total effect (main effect plus feature interactions). If a coalition consists of half the features, we learn little about an individual features contribution, as there are many possible coalitions with half of the features.”
Conclusion
In this part, we explore the intuition of KernelSHAP, KernelSHAP is a model-agnostic method for model interpretation, it can be used for any machine learning model interpretation, but KernelSHAP is still very slow due to large computation. In next Part 4, we will explore TreeSHAP which is faster but it can only be used for tree-based machine learning models such as decision trees, random forests and gradient boosted trees.
REFERENCES
- Interpretable Machine Learning: https://christophm.github.io/interpretable-ml-book/shap.html
- A Unified Approach to Interpreting Model Prediction: https://arxiv.org/abs/1705.07874
- Consistent Individualized Feature Attribution for Tree
Ensembles: https://arxiv.org/abs/1802.03888 - SHAP Part 3: Tree SHAP: https://medium.com/analytics-vidhya/shap-part-3-tree-shap-3af9bcd7cd9b
- PyData Tel Aviv Meetup: SHAP Values for ML Explainability — Adi Watzman: https://www.youtube.com/watch?v=0yXtdkIL3Xk
- The Science Behind InterpretML- SHAP: https://www.youtube.com/watch?v=-taOhqkiuIo
- Game Theory (Stanford) — 7.3 — The Shapley Value : https://www.youtube.com/watch?v=P46RKjbO1nQ
- Understanding SHAP for Interpretable Machine Learning: https://medium.com/ai-in-plain-english/understanding-shap-for-interpretable-machine-learning-35e8639d03db
- Kernel SHAP:https://www.telesens.co/2020/09/17/kernel-shap/
- Understanding the SHAP interpretation method: Kernel SHAP:https://data4thought.com/kernel_shap.html