website/versioned_docs/version-1.0.15/Explore Algorithms/Responsible AI/Interpreting Model Predictions.md
Interpretable Machine Learning helps developers, data scientists and business stakeholders in the organization gain a comprehensive understanding of their machine learning models. It can also be used to debug models, explain predictions and enable auditing to meet compliance with regulatory requirements.
Model-agnostic interpretation methods can be computationally expensive due to the multiple evaluations needed to compute the explanations. Model interpretation on Spark enables users to interpret a black-box model at massive scales with the Apache Spark™ distributed computing ecosystem. Various components support local interpretation for tabular, vector, image and text classification models, with two popular model-agnostic interpretation methods: LIME and Kernel SHAP.
Both LIME and Kernel SHAP are local interpretation methods. Local interpretation explains why does the model predict certain outcome for a given observation.
Both explainers extends from org.apache.spark.ml.Transformer. After setting up the explainer parameters, simply call the transform function on a DataFrame of observations to interpret the model behavior on these observations.
To see examples of model interpretability on Spark in action, take a look at these sample notebooks:
| Tabular models | Vector models | Image models | Text models | |
|---|---|---|---|---|
| LIME explainers | TabularLIME | VectorLIME | ImageLIME | TextLIME |
| Kernel SHAP explainers | TabularSHAP | VectorSHAP | ImageSHAP | TextSHAP |
All local explainers support the following params:
| Param | Type | Default | Description |
|---|---|---|---|
| targetCol | String | "probability" | The column name of the prediction target to explain (i.e. the response variable). This is usually set to "prediction" for regression models and "probability" for probabilistic classification models. |
| targetClasses | Array[Int] | empty array | The indices of the classes for multinomial classification models. |
| targetClassesCol | String | The name of the column that specifies the indices of the classes for multinomial classification models. | |
| outputCol | String | The name of the output column for interpretation results. | |
| model | Transformer | The model to be explained. |
All LIME based explainers (TabularLIME, VectorLIME, ImageLIME, TextLIME) support the following params:
| Param | Type | Default | Description |
|---|---|---|---|
| regularization | Double | 0 | Regularization param for the underlying lasso regression. |
| kernelWidth | Double | sqrt(number of features) * 0.75 | Kernel width for the exponential kernel. |
| numSamples | Int | 1000 | Number of samples to generate. |
| metricsCol | String | "r2" | Column name for fitting metrics. |
All Kernel SHAP based explainers (TabularSHAP, VectorSHAP, ImageSHAP, TextSHAP) support the following params:
| Param | Type | Default | Description |
|---|---|---|---|
| infWeight | Double | 1E8 | The double value to represent infinite weight. |
| numSamples | Int | 2 * (number of features) + 2048 | Number of samples to generate. |
| metricsCol | String | "r2" | Column name for fitting metrics. |
All tabular model explainers (TabularLIME, TabularSHAP) support the following params:
| Param | Type | Default | Description |
|---|---|---|---|
| inputCols | Array[String] | The names of input columns to the black-box model. | |
| backgroundData | DataFrame | A dataframe containing background data. It must contain all the input columns needed by the black-box model. |
All vector model explainers (VectorLIME, VectorSHAP) support the following params:
| Param | Type | Default | Description |
|---|---|---|---|
| inputCol | String | The names of input vector column to the black-box model. | |
| backgroundData | DataFrame | A dataframe containing background data. It must contain the input vector column needed by the black-box model. |
All image model explainers (ImageLIME, ImageSHAP) support the following params:
| Param | Type | Default | Description |
|---|---|---|---|
| inputCol | String | The names of input image column to the black-box model. | |
| cellSize | Double | 16 | Number that controls the size of the super-pixels. |
| modifier | Double | 130 | Controls the trade-off spatial and color distance of super-pixels. |
| superpixelCol | String | "superpixels" | The column holding the super-pixel decompositions. |
All text model explainers (TextLIME, TextSHAP) support the following params:
| Param | Type | Default | Description |
|---|---|---|---|
| inputCol | String | The names of input text column to the black-box model. | |
| tokensCol | String | "tokens" | The column holding the text tokens. |
TabularLIME| Param | Type | Default | Description |
|---|---|---|---|
| categoricalFeatures | Array[String] | empty array | The name of columns that should be treated as categorical variables. |
For categorical features,
TabularLIMEcreates new samples by drawing samples based on the value distribution from the background dataset. For numerical features, it creates new samples by drawing from a normal distribution with mean taken from the target value to be explained, and standard deviation taken from the background dataset.
TabularSHAPNo additional params are supported.
VectorLIMENo additional params are supported.
VectorLIMEassumes all features are numerical, and categorical features are not supported inVectorLIME.
VectorSHAPNo additional params are supported.
ImageLIME| Param | Type | Default | Description |
|---|---|---|---|
| samplingFraction | Double | 0.7 | The fraction of super-pixels to keep on during sampling. |
ImageLIMEcreates new samples by randomly turning super-pixels on or off with probability of keeping on set toSamplingFraction.
ImageSHAPNo additional params are supported.
TextLIME| Param | Type | Default | Description |
|---|---|---|---|
| samplingFraction | Double | 0.7 | The fraction of word tokens to keep on during sampling. |
TextLIMEcreates new samples by randomly turning word tokens on or off with probability of keeping on set toSamplingFraction.
TextSHAPNo additional params are supported.
LIME explainers return an array of vectors, and each vector maps to a class being explained. Each component of the vector is the coefficient for the corresponding feature, super-pixel, or word token from the local surrogate model.
SHAP explainers return an array of vectors, and each vector maps to a class being explained. Each vector starts with the base value, and each following component of the vector is the Shapley value for each feature, super-pixel, or token.
The base value and Shapley values are additive, and they should add up to the model output for the target observation.