Chapter 2 Introduction to Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) | Limitations of Interpretable Machine Learning Methods (2023)

Authors: Thommy Dassen, Naiwen Hou, Veronika Kronseder

Supervisor: Gunnar König

2.1 Partial Dependence Plots (PDP)

The Partial Dependence Plot (PDP) is a rather intuitive and easy-to-understand visualization of the features' impact on the predicted outcome. If the assumptions for the PDP are met, it can show the way a feature impacts an outcome variable. More precisely, mapping the marginal effect of the selected variable(s) uncovers the linear, monotonic or nonlinear relationship between the predicted response and the individual feature variable(s) (Molnar 2019).

The underlying function can be described as follows:

Let \(x_S\) be the set of features of interest for the PDP and \(x_C\) the complement set which contains all other features. While the general model function \(f(x) = f(x_S, x_C)\) depends on all input variables, the partial dependence function marginalizes over the feature distribution in set C (Hastie, Tibshirani, and Friedman 2013):

\[f_{x_S}(x_S) = \mathbb{E}_{x_C}[f(x_S, x_C)]\]

The partial dependence function can be estimated by averaging predictions with actual feature values of \(x_C\) in the training data at given values of \(x_S\) or, in other words, it computes the marginal effect of \(x_S\) on the prediction. In order to obtain realistic results, a major assumption of the PDP is that the features in \(x_S\) and \(x_C\) are independent and thus uncorrelated.

(Video) Partial Dependence Plot (PDP) in Python

\[\hat{f}_{x_S}(x_S)=\frac{1}{n}\sum_{i=1}^{n}f(x_S, x^{(i)}_{C})\]

An example of a PDP based on the 'Titanic' data set, which contains information on the fate of 2224 passengers and crew members during the Titanic's maiden voyage, is given in figure 2.1.

Chapter 2 Introduction to Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) | Limitations of Interpretable Machine Learning Methods (1)

FIGURE 2.1: PDP for predicted survival probability and numeric feature variable 'Age'. The probability to survive sharply drops at a young age and more moderately afterwards. The rug on the x-axis illustrates the distribution of observed training data.

When a feature is categorical, rather than continuous, the partial dependence function is modeled separately for all of the K different classes of said feature. It maps the predictions for each respective class at given feature values of \(x_S\) (Hastie, Tibshirani, and Friedman 2013).

For such categorical features, the partial dependence function and the resulting plot are produced by replacing all observed \(x_S\)-values with the respective category and averaging the predictions. This procedure is repeated for each of the features' categories (Molnar 2019). As an example, figure 2.2 shows the partial dependence for the survival probability prediction for passengers on the Titanic and the categorical feature 'passenger class'.

Chapter 2 Introduction to Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) | Limitations of Interpretable Machine Learning Methods (2)

FIGURE 2.2: The PDP for survival probability and categorical feature ' passenger class' reveals that passengers in lower classes had a lower probability to survive than those in a higher class.

(Video) Explain Machine-learning Models: Individual Conditional Expectation (ICE) in Python

2.1.1 Advantages and Limitations of Partial Dependence Plots

Partial Dependence Plots are easy to compute and a popular way to explain insights from black box Machine Learning models. With their intuitive character, PDPs are perfect for communicating to a non-technical audience. However, due to limited visualization techniques and the restriction of human perception to a maximum of three dimensions, only one or two features can reasonably be displayed in one PDP (Molnar 2019). 2.3 shows that the combination of one numerical (Age) and one categorical (Sex) feature still allows rather precise interpretation. The combination of two numerical features (Age & Fare) still works, but already degrades the interpretability with its colour intensity scale as shown in figure 2.4.

Chapter 2 Introduction to Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) | Limitations of Interpretable Machine Learning Methods (3)

FIGURE 2.3: Two-dimensional PDP for predicted survival probability and numerical feature 'Age', together with the categorical feature 'Sex'. The PDP shows that while the survival probability for both genders declines as age increases, there is a difference between genders. It is clear that the decrease is much steeper for males.

Chapter 2 Introduction to Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) | Limitations of Interpretable Machine Learning Methods (4)

FIGURE 2.4: Two-dimensional PDP for predicted survival probability and numerical features 'Age' and 'Fare'. The PDP illustrates that the survival probability of younger passengers is fairly uniform for varying fares, while adults travelling at a lower fare also had a much lower probability to survive compared to those that paid a high fare.

Drawing a PDP with one or two feature variables allows a straight-forward interpretation of the marginal effects. This holds true as long as the features are not correlated. Should this independence assumption be violated, the partial dependence function will produce unrealistic data points. For instance, a correlation between height and weight leading to a data point for someone taller than 2 meters weighing less than 50 kilos. Furthermore, opposite effects of heterogeneous subgroups might remain hidden through averaging the marginal effects, which could lead to wrong conclusions (Molnar 2019).

2.2 Individual Conditional Expectation Curves

While partial dependence plots provide the average effect of a feature, Individual Conditional Expectation (ICE) plots are a method to disaggregate these averages. ICE plots visualize the functional relationship between the predicted response and the feature separately for each instance. In other words, a PDP averages the individual lines of an ICE plot (Molnar 2019).

More formally, ICE plots can be derived by considering the estimated response function \(\hat{f}\) and the observations \({(x^{(i)}_S, x^{(i)}_C)}^N_{i=1}\). The curve \(\hat{f}_S^{(i)}\) is plotted against the observed values of \(x^{(i)}_S\) for each of the observed instances while \(x^{(i)}_C\) remains fixed at each point on the x-axis (Molnar 2019; Goldstein et al. 2013)

As shown in figure 2.5, each line represents one instance and visualizes the effect of varying the feature value \(x^{(i)}_S\) (Age) of a particular instance on the model’s prediction, given all other features remain constant (c.p.). An ICE plot can highlight the variation in the fitted values across the range of a feature. This suggests where and to what extent heterogeneities might exist.

Chapter 2 Introduction to Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) | Limitations of Interpretable Machine Learning Methods (5)

FIGURE 2.5: ICE plot of survival probability by Age. The yellow line represents the average of the individual lines and is thus equivalent to the respective PDP. The individual conditional relationships indicate that there might be underlying heterogeneity in the complement set.

2.2.1 Centered ICE Plot

If the curves of an ICE plot are stacked or have a wide range of intercepts it can be difficult to observe heterogeneity in the model. The so-called centered ICE plot (c-ICE plot) is a simple solution to this problem. The curves are centered at a certain point in the feature and display only the difference in the prediction to this point (Molnar 2019). After anchoring a location \(x^a\) in the range of \(x_s\) and connecting all prediction lines at that point, the new curves are defined as:

\[\hat{f}^{(i)}_{cent} = \hat{f^{(i)}} - \mathbf{1}\hat{f}(x^a,x^{(i)}_C)\] Experience has shown that the most interpretable plots occur when the anchor point \(x^a\) is chosen as minimum or maximum of the observed values. Figure 2.6 shows the effect of centering the ICE curves of survival probability by Age at the minimum of observed ages in the 'Titanic' data set.

Chapter 2 Introduction to Partial Dependence Plots (PDP) and Individual Conditional Expectation (ICE) | Limitations of Interpretable Machine Learning Methods (6)

FIGURE 2.6: Centered ICE plot of survival probability by Age. All lines are fixed to 0 at the minimum observed age of 0.42. The y-axis shows the survival probability difference to age 0.42. Centrered ICE plot shows that compared to age 0.42, the predictions for most passengers decrease as age increases. However, there are quite a few passengers with opposite predictions.

(Video) Understanding Black-Box Models with Partial Dependence and Individual Conditional Expectation Plots

2.2.2 Derivative ICE Plot

Another way to explore the heterogeneity is to show plots of the partial derivative of \(\hat{f}\) with respect to \(x_s\). Assume that \(x_s\) does not interact with the other predictors in the fitted model, the prediction function can be written as:

\[\hat{f}(x) = \hat{f}(x_s,x_C) = g(x_s) + h(x_C),\]

so that \[\frac{\partial{\hat{f}(\mathbf{x})}}{\partial x_s} = g'(x_s)\]

When no interactions are present in the fitted model, all curves in the d-ICE plot are equivalent and the plot shows a single line. When interactions do exist, the derivative lines will be heterogeneous. As it can be difficult to visually assess derivatives from ICE plots, it is useful to plot an estimate of the partial derivative directly (Goldstein et al. 2013).

2.2.3 Advantages and Limitations of ICE Plots

The major advantage of ICE plots is that they are even more intuitive than PDPs which enables data scientists to drill much deeper to explore individual differences. This may help to identify subgroups and interactions between model inputs. However, there are also some disadvantages of ICE plots. Firstly, only one feature can be plotted in an ICE plot meaningfully. Otherwise, there would be a problem of overplotting and it would be hard to distinguish anything in the plot. Secondly, just like PDPs, ICE plots for correlated features may produce invalid data points. Finally, without additionally plotting the PDP it might be difficult to see the average in ICE plots (Molnar 2019).

(Video) Partial Dependence Plots (Opening the Black Box)

References

Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2013. “Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.” Journal of Computational and Graphical Statistics 24 (September). doi:10.1080/10618600.2014.907095.

Hastie, T., R. Tibshirani, and J. Friedman. 2013. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer New York. https://books.google.de/books?id=yPfZBwAAQBAJ.

Molnar, Christoph. 2019. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.

(Video) Interpretable machine learning (part 2): ICE, partial dependency plots and surrogate models

FAQs

What is ice plots? ›

An ICE plot visualizes the dependence of the prediction on a feature for each instance separately, resulting in one line per instance, compared to one line overall in partial dependence plots. A PDP is the average of the lines of an ICE plot.

What are PDP plots used for? ›

Partial dependence plots (PDP) and individual conditional expectation (ICE) plots can be used to visualize and analyze interaction between the target response [1] and a set of input features of interest.

What's a partial dependence plot? ›

The partial dependence plot (short PDP or PD plot) shows the marginal effect one or two features have on the predicted outcome of a machine learning model (J. H. Friedman 200130). A partial dependence plot can show whether the relationship between the target and a feature is linear, monotonic or more complex.

What is y axis in partial dependence plot? ›

The chosen feature appears on the x-axis, while the y-axis plots the degree of partial dependence. The partial dependence value shows how the probability of being in the outcome class changes across different values of the feature.

What is Shap summary plot? ›

6 SHAP Summary Plot. The summary plot combines feature importance with feature effects. Each point on the summary plot is a Shapley value for a feature and an instance. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value.

What is feature importance in machine learning? ›

3. What Is Feature Importance in Machine Learning? Feature (variable) importance indicates how much each feature contributes to the model prediction. Basically, it determines the degree of usefulness of a specific variable for a current model and prediction.

What is the purpose of plotting the data? ›

The purpose of plotting scientific data is to visualize variation or show relationships between variables, but not all data sets require a plot. If there are only one or two points, it is easy to examine the numbers directly, and little or nothing is gained by putting them on a graph.

What does a data plot show? ›

The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as a point on the graph.

What are line plots used for? ›

A line plot is a graph that shows frequency of data along a number line. It is best to use a line plot when comparing fewer than 25 numbers. It is a quick, simple way to organize data.

What is a partial dependent? ›

Partial dependency occurs when one primary key determines some other attribute/attributes. On the other hand, transitive dependency occurs when some non-key attribute determines some other attribute.

What means partially dependent? ›

PARTIAL DEPENDENCY Definition & Legal Meaning

the term that is given to a person who is able to partially support themselves but needs to rely on others for some support.

How do you plot Shap values? ›

SHAP Plots. Now that we have our model we can calculate our SHAP values. To do this, we pass our model into the SHAP Explainer function to create an explainer object (line 2). We then use this to calculate the SHAP values for every observation in the feature matrix (line 3).

Why is the y-axis called the dependent variable? ›

1. The independent variable is plotted on the abscissa (also called the x-axis or horizontal axis), while the dependent variable is plotted on the ordinate (also called the y-axis or vertical axis). The dependent variable is the one whose value changes as a result of changes in the independent variable.

What does the y-axis mean in a density plot? ›

The y-axis in a density plot is the probability density function for the kernel density estimation. However, we need to be careful to specify this is a probability density and not a probability. The difference is the probability density is the probability per unit on the x-axis.

What does the y-axis represent in a box plot? ›

Box plots are composed of an x-axis and a y-axis. The x-axis assigns one box for each Category or Numeric field variable. The y-axis is used to measure the minimum, first quartile, median, third quartile, and maximum value in a set of numbers. You can use box plots to visualize one or many distributions.

What is Shap function? ›

SHAP is a mathematical method to explain the predictions of machine learning models. It is based on the concepts of game theory and can be used to explain the predictions of any machine learning model by calculating the contribution of each feature to the prediction.

What is base value in Shap? ›

The base value: The original paper explains that the base value E(y_hat) is “the value that would be predicted if we did not know any features for the current output.” In other words, it is the mean prediction, or mean(yhat). You may wonder why it is 5.62. This is because the mean prediction of Y_test is 5.62.

What is a good SHAP value? ›

Positive SHAP value means positive impact on prediction, leading the model to predict 1(e.g. Passenger survived the Titanic). Negative SHAP value means negative impact, leading the model to predict 0 (e.g. passenger didn't survive the Titanic).

What are feature types in machine learning? ›

There are three distinct types of features: quantitative, ordinal, and categorical. We can also consider a fourth type of feature—the Boolean—as this type does have a few distinct qualities, although it is actually a type of categorical feature.

What are examples of features in machine learning? ›

  • Autopilot Mode.
  • Classification.
  • Confusion Matrix.
  • Cross-Validation.
  • Deep Learning Algorithms.
  • Machine Learning Model.
  • Machine Learning Model Accuracy.
  • Machine Learning Model Deployment.

What is a feature vector in machine learning? ›

A feature vector is an ordered list of numerical properties of observed phenomena. It represents input features to a machine learning model that makes a prediction. Humans can analyze qualitative data to make a decision.

What are different types of plots? ›

7 Types of Plots
  • Tragedy. In a tragedy, your main character should undergo a major change of fortune — almost always from good to bad, happy to sad. ...
  • Comedy. ...
  • Hero's Journey. ...
  • Rags to Riches. ...
  • Rebirth. ...
  • Overcoming the Monster. ...
  • Voyage and Return.
1 Sept 2021

What are the two types of plotting? ›

Did you know that there are only two types of plots in stories? According to Aristotle, at least. In his book Poetics — an analysis of tragedy and epic storytelling — he states that there are only two types of plots within the Greek Tragedy paradigm — Simple Plots and Complex Plots.

What is difference between plot and graph? ›

So, in short, "plot" is used for a finite set of points, while a "graph" is used for a function comprised of infinite points.

Which chart is used to plot data? ›

A dual-axis chart allows you to plot data using two y-axes and a shared x-axis. It has three data sets. One is a continuous set of data and the other is better suited to grouping by category.

Which method is used for creating plots? ›

The plot() function is used to draw points (markers) in a diagram. By default, the plot() function draws a line from point to point. The function takes parameters for specifying points in the diagram. Parameter 1 is an array containing the points on the x-axis.

What is the scale of a line plot? ›

Grade 5 Math #9.4, Line Graphs, Scales and Intervals - YouTube

Why is it called a line plot? ›

A line graph, also known as a line chart or a line plot, is commonly drawn to show information that changes over time. You can plot it by using several points linked by straight lines. It comprises two axes called the “x-axis” and the “y-axis”.

What is an example of a line plot? ›

What is an example of a line plot? An example of a line plot is a plot showing the number of students whose favorite colors are blue, red, green, purple, or pink. The colors are the values on the x-axis, and the number of students who like each color is shown above the values as X marks.

What does a partial residual plot tell you? ›

Partial residual plots attempt to show the relationship between a given independent variable and the response variable given that other independent variables are also in the model.

How do you read a partial residual plot? ›

Partial residual plot in linear regression - nonlinearity - YouTube

What is a partial regression plot SPSS? ›

Partial regression plots for a dependent variable (DV) and predictor are scatterplots of the residuals from 2 regressions - regressing the DV on all of the other predictors, and regressing that particular predictor (as DV) on all of the other predictors.

What is used to plot relationship or dependence of one variable on another? ›

A scatter plot is a type of data visualization that shows the relationship between different variables. This data is shown by placing various data points between an x- and y-axis.

Videos

1. Explaining Hyperparameter Optimization via Partial Dependence Plots (NeurIPS'21)
(AutoML Freiburg Hannover)
2. 5.2 Individual Conditional Expectation (한국어 설명, ENG TEXT)
(KHU IAILab)
3. How To: Plot Partial Dependence
(Dataiku)
4. Kaggle 30 Days of ML (Day 17) - Partial Dependence Plot - Interpretable Machine Learning - XAI
(1littlecoder)
5. An Overview Of Model Agnostic Interpretation Methods
(Minjae Lee)
6. Partial Dependence Plot in Knime
(Saqib Ali)
Top Articles
Latest Posts
Article information

Author: Sen. Emmett Berge

Last Updated: 02/01/2023

Views: 6293

Rating: 5 / 5 (60 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Sen. Emmett Berge

Birthday: 1993-06-17

Address: 787 Elvis Divide, Port Brice, OH 24507-6802

Phone: +9779049645255

Job: Senior Healthcare Specialist

Hobby: Cycling, Model building, Kitesurfing, Origami, Lapidary, Dance, Basketball

Introduction: My name is Sen. Emmett Berge, I am a funny, vast, charming, courageous, enthusiastic, jolly, famous person who loves writing and wants to share my knowledge and understanding with you.