Complete SHAP tutorial for model explanation Part 2. Shapley Value as Feature Contribution

6 min readJan 2, 2021

Analyse why and how Shapley value can be used as Feature Contribution for a trained Model

Nazca lines download from wallpaperflare

Part 1. Shapley Value
Part 2. Shapley Value as Feature Contribution
Part 3. KernelSHAP
Part 4. TreeSHAP
Part 5. Python Example

Following Part.1 which explain what is and how to calculate Shapley value, In this story I will explore the following two questions:

1.Why Shapley value can be used as feature contribution of a trained Machine Learning model?
2.How to calculate Shapley value or feature contribution from a trained model?

Why Shapley values can be feature contributions?

The Shapley values setup is A coalition of players cooperates, and obtains a certain overall gain from that cooperation. (source from wikipedia)

For a trained machine learning model, we can change the setup as:

A coalition of features cooperates, and obtains a certain overall prediction from that trained model.

Basic idea here is when we apply Shapley value model on a trained ML model and a given dataset, the model features act as players, model is cooperation, therefore player’s shapley value is feature contribution.

From part.1, we know Shapley value has an Efficiency(Additive) axiom which means all the player’s Shapley values must added to the total gain.

So we can apply this axiom to a trained ML model f(x), and construct the below equation:

From https://christophm.github.io/interpretable-ml-book/shapley.html

p is total number of features in model
x is one instance from a given dataset
X is the whole given dataset
φi is feature i’s contribution(shapley value) on instance x
f(x) is the trained model prediction on instance x, which can be calculated by input x into f(x)
E(f(X)) is average prediction value on all dataset instances. For a given model and dataset, E(f(X)) is a constant value, in the later story we use φ0 to denote E(f(X))

From the above equation we can get
Sum of all feature contribution of one instance(or observation) explains the difference between the instance’s prediction and model average prediction.
If we use model average prediction as base, We can explain how much(quantitatively) each feature of an instance contributes towards instance total output.

Following is an example shows an instance’s feature contributions for a model with 5 features

How to apply Shapley Value formula for feature contribution calculation?

Let’s go back to the Shapley value formula, which is explained in part.1, and explain the formula for a trained ML model.

From https://en.wikipedia.org/wiki/Shapley_value

n is total number of features
N contains all the possible feature subsets not containing feature i
S is one feature set from N
v(x) is the trained model prediction function f(x), x is a model input instance
|S|is the number of not missing features in set S

Note: Here the formula calculate one instance’s feature i’s shapley value(contribution). One instance’s feature i’s shapley value is a litter confusing, but it is important to remember Shapley value is used to calculate instance level or local level feature contribution.

Let’s assume the trained model f(x) has three features (x1, x2, x3), the instance we investigate on is (1, 2, 3), and we want to calculate feature x2 (value is 2 in investigated instance) contribution to instance (1, 2, 3)’s prediction outcome, so in this case:

n = 3
N = {{missing, missing, missing}, {1, missing, missing}, {missing, missing, 3},{1, missing, 3}} // Here missing indicates the subset set don’t include the feature and according to Shapley value formula feature x2 is not included in any subset
v(x) is the trained model prediction function f(x)

Wow, so far looks like we know all the parameters for shapley value formula to calculate feature x2’s shapley value, right?

Yes and No, because we have one problem, how to calculate model prediction f(x) value on input instances like (missing, missing, missing) or (1, missing, missing)? or how to calculate f(missing, missing, missing) or f(1, missing, missing) or f(1, missing, 3)?
The missing is due to Shapley value formula use subset of features.

The answer for the problem is we randomly sample the missing feature value from dataset, and use the sampling value to fill missing feature, like below:

(missing, missing, missing) becomes (x1 sample, x2 sample, x3 sample)

(1, missing, missing) becomes (1, x2 sample, x3 sample)

Different Sampling Ways to Fill Missing Feature

To fill the missing feature value for prediction calculation, there is two general sampling ways:

1.Sampling values from missing feature’s Marginal Distribution

Just randomly pick a value from the feature’s all available values in given dataset.

One big problem for this sampling is we may get unrealistic feature combinations, for example, assume model’s features include age and years of education, and our instance has age 8, years of education is missing, if we randomly sample years of education to fill the missing, we may get 10 years for years of education which is unrealistic for age 8 kid.

2. Sampling values from missing feature’s Conditional Distribution

This sampling only allow picking missing feature value from the values which are co-existing in dataset with instance’s other known feature values. Generally speaking, this sampling way considers the features dependency and correlation between the missing feature and known feature.

Prediction Calculation with Sampling for Missing

Now we know, to calculate f(1, missing, missing), we use f(1, x2 sample, x3 sample)

For accurate estimate f(1, missing, missing), we need sampling x2 and x3 more than one time and average all sampling prediction f(1, x2 sample, x3 sample) as final estimation for f(1, missing, missing).

Ideally we can run model prediction on all possible missing feature values and missing values combinations sampling, then average the predictions as estimation.

We can generalize the above sampling and average process as below mathematics formula

From SHAP paper https://arxiv.org/abs/1705.07874

Zs is the instance’s sub feature set like (1, missing, missing)
Zc is sampled missing features set, so Zs +Zc make a full feature set
E is expectation, which sum, then average all different output of f(Zs, Zc)
Assume there is no dependency between Zs and Zc, or features are all independent, we can sample Zc from missing feature’s Marginal Distribution
Assume there is strong dependency or correlation between Zs and Zc, then we can sample Zc from missing feature’s Conditional Distribution (Zc|Zs)

Conclusion

In this part, we explore the intuition of applying Shapley value for feature contribution and calculating feature contribution in theory. In part 3, we will see practical ways to run the feature contribution calculation for machine leaning model.

REFERENCES

Interpretable Machine Learning: https://christophm.github.io/interpretable-ml-book/shap.html
A Unified Approach to Interpreting Model Prediction: https://arxiv.org/abs/1705.07874
Consistent Individualized Feature Attribution for Tree
Ensembles: https://arxiv.org/abs/1802.03888
SHAP Part 3: Tree SHAP: https://medium.com/analytics-vidhya/shap-part-3-tree-shap-3af9bcd7cd9b
PyData Tel Aviv Meetup: SHAP Values for ML Explainability — Adi Watzman: https://www.youtube.com/watch?v=0yXtdkIL3Xk
The Science Behind InterpretML- SHAP: https://www.youtube.com/watch?v=-taOhqkiuIo
Game Theory (Stanford) — 7.3 — The Shapley Value : https://www.youtube.com/watch?v=P46RKjbO1nQ
Understanding SHAP for Interpretable Machine Learning: https://medium.com/ai-in-plain-english/understanding-shap-for-interpretable-machine-learning-35e8639d03db
Kernel SHAP:https://www.telesens.co/2020/09/17/kernel-shap/
Understanding the SHAP interpretation method: Kernel SHAP:https://data4thought.com/kernel_shap.html