EstimatorReport: Inspecting your models with the feature importance#

In this example, we tackle the California housing dataset where the goal is to perform a regression task: predicting house prices based on features such as the number of bedrooms, the geolocation, etc. For that, we try out several families of models. We evaluate these methods using skore’s EstimatorReport and its report on metrics.

See also

As shown in EstimatorReport: Get insights from any scikit-learn estimator, the EstimatorReport has a metrics() accessor that enables you to evaluate your models and look at some scores that are automatically computed for you.

Here, we go beyond predictive performance, and inspect these models to better interpret their behavior, by using feature importance. Indeed, in practice, inspection can help spot some flaws in models: it is always recommended to look “under the hood”. For that, we use the unified feature_importance() accessor of the EstimatorReport. For linear models, we look at their coefficients. For tree-based models, we inspect their mean decrease in impurity (MDI). We can also inspect the permutation feature importance, that is model-agnostic.

Loading the dataset and performing some exploratory data analysis (EDA)#

Let us load the California housing dataset, which will enable us to perform a regression task about predicting house prices:

MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 1.02381 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.97188 2401.0 2.109842 37.86 -122.22 3.585


The documentation of the California housing dataset explains that the dataset contains aggregated data regarding each district in California in 1990 and the target (MedHouseVal) is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). Note that there are some vacation resorts, with a large number of rooms and bedrooms.

See also

For more information about the California housing dataset, refer to scikit-learn MOOC’s page. Moreover, a more advanced modelling of this dataset is performed in this skops example.

Table report#

Let us perform some quick exploration on this dataset:

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



From the table report, we can draw some key observations:

  • Looking at the Stats tab, all features are numerical and there are no missing values.

  • Looking at the Distributions tab, we can notice that some features seem to have some outliers: MedInc, AveRooms, AveBedrms, Population, and AveOccup. The feature with the largest number of potential outliers is AveBedrms, probably corresponding to vacation resorts.

  • Looking at the Associations tab, we observe that:

    • The target feature MedHouseVal is mostly associated with MedInc, Longitude, and Latitude. Indeed, intuitively, people with a large income would live in areas where the house prices are high. Moreover, we can expect some of these expensive areas to be close to one another.

    • The association power between the target and these features is not super high, which would indicate that each single feature can not correctly predict the target. Given that MedInc is associated with Longitude and also Latitude, it might make sense to have some interactions between these features in our modelling: linear combinations might not be enough.

Target feature#

The target distribution has a long tail:

import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(
    data=california_housing.frame, x=california_housing.target_names[0], bins=100
)
plt.show()
plot feature importance

There seems to be a threshold-effect for high-valued houses: all houses with a price above $500,000 are given the value $500,000. We keep these clipped values in our data and will inspect how our models deal with them.

Now, as the median income MedInc is the feature with the highest association with our target, let us assess how MedInc relates to MedHouseVal:

import pandas as pd
import plotly.express as px

X_y_plot = california_housing.frame.copy()
X_y_plot["MedInc_bins"] = pd.qcut(X_y_plot["MedInc"], q=5)
bin_order = X_y_plot["MedInc_bins"].cat.categories.sort_values()
fig = px.histogram(
    X_y_plot,
    x=california_housing.target_names[0],
    color="MedInc_bins",
    category_orders={"MedInc_bins": bin_order},
)
fig


As could have been expected, a high salary often comes with a more expensive house. We can also notice the clipping effect of house prices for very high salaries.

Geospatial features#

From the table report, we noticed that the geospatial features Latitude and Longitude were well associated with our target. Hence, let us look into the coordinates of the districts in California, with regards to the target feature, using a map:

def plot_map(df, color_feature):
    fig = px.scatter_mapbox(
        df, lat="Latitude", lon="Longitude", color=color_feature, zoom=5, height=600
    )
    fig.update_layout(
        mapbox_style="open-street-map",
        mapbox_center={"lat": df["Latitude"].mean(), "lon": df["Longitude"].mean()},
        margin={"r": 0, "t": 0, "l": 0, "b": 0},
    )
    return fig


As could be expected, the price of the houses near the ocean is higher, especially around big cities like Los Angeles, San Francisco, and San Jose. Taking into account the coordinates in our modelling will be very important.

Splitting the data#

Just before diving into our first model, let us split our data into a train and a test split:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Linear models: coefficients#

For our regression task, we first use linear models. For feature importance, we inspect their coefficients.

Simple model#

Before trying any complex feature engineering, we start with a simple pipeline to have a baseline of what a “good score” is (remember that all scores are relative). Here, we use a Ridge regression along with some scaling and evaluate it using skore.EstimatorReport.metrics():

from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from skore import EstimatorReport

ridge_report = EstimatorReport(
    make_pipeline(StandardScaler(), Ridge()),
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
)
ridge_report.metrics.report_metrics()
Ridge
Metric
0.591163
RMSE 0.735134
Fit time 0.003606
Predict time 0.001287


From the report metrics, let us first explain the scores we have access to:

  • The coefficient of determination (r2_score()), denoted as \(R^2\), which is a score. The best possible score is \(1\) and a constant model that always predicts the average value of the target would get a score of \(0\). Note that the score can be negative, as it could be worse than the average.

  • The root mean squared error (root_mean_squared_error()), abbreviated as RMSE, which is an error. It takes the square root of the mean squared error (MSE) so it is expressed in the same units as the target variable. The MSE measures the average squared difference between the predicted values and the actual values.

Here, the \(R^2\) seems quite poor, so some further preprocessing would be needed. This is done further down in this example.

Warning

Keep in mind that any observation drawn from inspecting the coefficients of this simple Ridge model is made on a model that performs quite poorly, hence must be treated with caution. Indeed, a poorly performing model does not capture the true underlying relationships in the data. A good practice would be to avoid inspecting models with poor performance. Here, we still inspect it, for demo purposes and because our model is not put into production!

Let us plot the prediction error:

ridge_report.metrics.prediction_error().plot(kind="actual_vs_predicted")
plot feature importance

We can observe that the model has issues predicting large house prices, due to the clipping effect of the actual values.

Now, to inspect our model, let us use the skore.EstimatorReport.feature_importance() accessor:

ridge_report.feature_importance.coefficients()
Coefficient
Intercept 2.074373
MedInc 0.831864
HouseAge 0.121025
AveRooms -0.261571
AveBedrms 0.303819
Population -0.008702
AveOccup -0.029855
Latitude -0.891545
Longitude -0.863022


Note

Beware that coefficients can be misleading when some features are correlated. For example, two coefficients can have large absolute values (so be considered important), but in the predictions, the sum of their contributions could cancel out (if they are highly correlated), so they would actually be unimportant.

We can plot this pandas datafame:

ridge_report.feature_importance.coefficients().plot.barh(
    title="Model weights",
    xlabel="Coefficient",
    ylabel="Feature",
)
plt.tight_layout()
Model weights

Note

More generally, skore.EstimatorReport.feature_importance.coefficients() can help you inspect the coefficients of all linear models. We consider a linear model as defined in scikit-learn’s documentation. In short, we consider a “linear model” as a scikit-learn compatible estimator that holds a coef_ attribute (after being fitted).

Since we have included scaling in the pipeline, the resulting coefficients are all on the same scale, making them directly comparable to each other. Without this scaling step, the coefficients in a linear model would be influenced by the original scale of the feature values, which would prevent meaningful comparisons between them.

See also

For more information about the importance of scaling, see scikit-learn’s example on Common pitfalls in the interpretation of coefficients of linear models.

Here, it appears that the MedInc, Latitude, and Longitude features are the most important, with regards to the absolute value of other coefficients. This finding is consistent with our previous observations from the Associations tab of the table report.

However, due to the scaling, we can not interpret the coefficient values with regards to the original unit of the feature. Let us unscale the coefficients, without forgetting the intercept, so that the coefficients can be interpreted using the original units:

import numpy as np

# retrieve the mean and standard deviation used to standardize the feature values
feature_mean = ridge_report.estimator_[0].mean_
feature_std = ridge_report.estimator_[0].scale_


def unscale_coefficients(df, feature_mean, feature_std):
    mask_intercept_column = df.index == "Intercept"
    df.loc["Intercept"] = df.loc["Intercept"] - np.sum(
        df.loc[~mask_intercept_column, "Coefficient"] * feature_mean / feature_std
    )
    df.loc[~mask_intercept_column, "Coefficient"] = (
        df.loc[~mask_intercept_column, "Coefficient"] / feature_std
    )
    return df


df_ridge_report_coef_unscaled = unscale_coefficients(
    ridge_report.feature_importance.coefficients(), feature_mean, feature_std
)
df_ridge_report_coef_unscaled
Coefficient
Intercept -36.573930
MedInc 0.439072
HouseAge 0.009606
AveRooms -0.103240
AveBedrms 0.616257
Population -0.000008
AveOccup -0.004490
Latitude -0.416969
Longitude -0.430202


Now, we can interpret each coefficient values with regards to the original units. We can interpret a coefficient as follows: according to our model, on average, having one additional bedroom (a increase of \(1\) of AveBedrms), with all other features being constant, increases the predicted house value of \(0.62\) in $100,000, hence of $62,000. Note that we have not dealt with any potential outlier in this iteration.

Warning

Recall that we are inspecting a model with poor performance, which is bad practice. Moreover, we must be cautious when trying to induce any causation effect (remember that correlation is not causation).

More complex model#

As previously mentioned, our simple Ridge model, although very easily interpretable with regards to the original units of the features, performs quite poorly. Now, we build a more complex model, with more feature engineering. We will see that this model will have a better score… but will be more difficult to interpret the coefficients with regards to the original features due to the complex feature engineering.

In our previous EDA, when plotting the geospatial data with regards to the house prices, we noticed that it is important to take into account the latitude and longitude features. Moreover, we also observed that the median income is well associated with the house prices. Hence, we will try a feature engineering that takes into account the interactions of the geospatial features with features such as the income, using polynomial features. The interactions are no longer simply linear as previously.

Let us build a model with some more complex feature engineering, and still use a Ridge regressor (linear model) at the end of the pipeline. In particular, we perform a K-means clustering on the geospatial features:

from sklearn.cluster import KMeans
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import PolynomialFeatures, SplineTransformer

geo_columns = ["Latitude", "Longitude"]

preprocessor = make_column_transformer(
    (KMeans(n_clusters=10, random_state=0), geo_columns),
    remainder="passthrough",
)
engineered_ridge = make_pipeline(
    preprocessor,
    SplineTransformer(sparse_output=True),
    PolynomialFeatures(degree=2, interaction_only=True, include_bias=False),
    Ridge(),
)
engineered_ridge
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('kmeans',
                                                  KMeans(n_clusters=10,
                                                         random_state=0),
                                                  ['Latitude', 'Longitude'])])),
                ('splinetransformer', SplineTransformer(sparse_output=True)),
                ('polynomialfeatures',
                 PolynomialFeatures(include_bias=False, interaction_only=True)),
                ('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Now, let us compute the metrics and compare it to our previous model using a skore.ComparisonReport:

Estimator Vanilla Ridge Ridge w/ feature engineering
Metric
0.591163 0.726869
RMSE 0.735134 0.600865
Fit time 0.003606 9.723627
Predict time 0.001287 0.221695


We get a much better score! Let us plot the prediction error:

engineered_ridge_report.metrics.prediction_error().plot(kind="actual_vs_predicted")
plot feature importance

About the clipping issue, compared to the prediction error of our previous model (ridge_report), our engineered_ridge_report model seems to produce predictions that are not as large, so it seems that some interactions between features have helped alleviate the clipping issue.

However, interpreting the features is harder: indeed, our complex feature engineering introduced a lot of features:

print("Initial number of features:", X_train.shape[1])

# We slice the scikit-learn pipeline to extract the predictor, using -1 to access
# the last step:
n_features_engineered = engineered_ridge_report.estimator_[-1].n_features_in_
print("Number of features after feature engineering:", n_features_engineered)
Initial number of features: 8
Number of features after feature engineering: 6328

Let us display the 15 largest absolute coefficients:

engineered_ridge_report.feature_importance.coefficients().sort_values(
    by="Coefficient", key=abs, ascending=True
).tail(15).plot.barh(
    title="Model weights",
    xlabel="Coefficient",
    ylabel="Feature",
)
plt.tight_layout()
Model weights

We can observe that the most important features are interactions between features, mostly based on AveOccup, that a simple linear model without feature engineering could not have captured. Indeed, the vanilla Ridge model did not consider AveOccup to be important. As the engineered Ridge has a better score, perhaps the vanilla Ridge missed something about AveOccup that seems to be key to predicting house prices.

Let us visualize how AveOccup interacts with MedHouseVal:

X_y_plot = california_housing.frame.copy()
X_y_plot["AveOccup"] = pd.qcut(X_y_plot["AveOccup"], q=5)
bin_order = X_y_plot["AveOccup"].cat.categories.sort_values()
fig = px.histogram(
    X_y_plot,
    x=california_housing.target_names[0],
    color="AveOccup",
    category_orders={"AveOccup": bin_order},
)
fig


Finally, we can visualize the results of our K-means clustering (on the training set):

# getting the cluster labels
col_transformer = engineered_ridge_report.estimator_.named_steps["columntransformer"]
kmeans = col_transformer.named_transformers_["kmeans"]
clustering_labels = kmeans.labels_

# adding the cluster labels to our dataframe
X_train_plot = X_train.copy()
X_train_plot.insert(0, "clustering_labels", clustering_labels)

# plotting the map
plot_map(X_train_plot, "clustering_labels")


Compromising on complexity#

Now, let us build a model with a more interpretable feature engineering, although it might not perform as well. For that, after the complex feature engineering, we perform some feature selection using a SelectKBest, in order to reduce the number of features.

from sklearn.feature_selection import SelectKBest, VarianceThreshold
from sklearn.linear_model import RidgeCV

preprocessor = make_column_transformer(
    (KMeans(n_clusters=10, random_state=0), geo_columns),
    remainder="passthrough",
)
selectkbest_ridge = make_pipeline(
    preprocessor,
    SplineTransformer(sparse_output=True),
    PolynomialFeatures(degree=2, interaction_only=True, include_bias=False),
    VarianceThreshold(),
    SelectKBest(k=150),
    RidgeCV(np.logspace(-5, 5, num=100)),
)

Note

To keep the computation time of this example low, we did not tune the hyperparameters of the predictive model. However, on a real use case, it would be important to tune the model using RandomizedSearchCV and not just the RidgeCV.

Let us get the metrics for our model and compare it with our previous iterations:

/opt/hostedtoolcache/Python/3.12.10/x64/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:112: RuntimeWarning:

divide by zero encountered in divide
Estimator Vanilla Ridge Ridge w/ feature engineering Ridge w/ feature engineering and selection
Metric
0.591163 0.726869 0.689991
RMSE 0.735134 0.600865 0.640145
Fit time 0.003606 9.723627 8.325749
Predict time 0.001287 0.221695 0.352263


We get a good score and much less features:

print("Initial number of features:", X_train.shape[1])
print("Number of features after feature engineering:", n_features_engineered)

n_features_selectk = selectk_ridge_report.estimator_[-1].n_features_in_
print(
    "Number of features after feature engineering using `SelectKBest`:",
    n_features_selectk,
)
Initial number of features: 8
Number of features after feature engineering: 6328
Number of features after feature engineering using `SelectKBest`: 150

According to the SelectKBest, the most important features are the following:

selectk_features = selectk_ridge_report.estimator_[:-1].get_feature_names_out()
print(selectk_features)
['remainder__MedInc_sp_1' 'remainder__MedInc_sp_2'
 'remainder__MedInc_sp_3' 'remainder__MedInc_sp_4'
 'kmeans__kmeans0_sp_0 remainder__MedInc_sp_3'
 'kmeans__kmeans0_sp_1 remainder__MedInc_sp_3'
 'kmeans__kmeans0_sp_2 remainder__MedInc_sp_0'
 'kmeans__kmeans0_sp_2 remainder__MedInc_sp_1'
 'kmeans__kmeans0_sp_2 remainder__MedInc_sp_4'
 'kmeans__kmeans0_sp_3 remainder__MedInc_sp_3'
 'kmeans__kmeans0_sp_3 remainder__AveRooms_sp_3'
 'kmeans__kmeans1_sp_0 remainder__MedInc_sp_3'
 'kmeans__kmeans1_sp_0 remainder__Population_sp_3'
 'kmeans__kmeans1_sp_1 remainder__MedInc_sp_3'
 'kmeans__kmeans1_sp_2 remainder__MedInc_sp_0'
 'kmeans__kmeans1_sp_2 remainder__MedInc_sp_1'
 'kmeans__kmeans1_sp_2 remainder__AveRooms_sp_3'
 'kmeans__kmeans1_sp_3 remainder__MedInc_sp_3'
 'kmeans__kmeans1_sp_4 remainder__MedInc_sp_3'
 'kmeans__kmeans1_sp_6 kmeans__kmeans4_sp_4'
 'kmeans__kmeans1_sp_6 remainder__MedInc_sp_5'
 'kmeans__kmeans2_sp_0 remainder__MedInc_sp_0'
 'kmeans__kmeans2_sp_0 remainder__AveRooms_sp_4'
 'kmeans__kmeans2_sp_0 remainder__AveBedrms_sp_4'
 'kmeans__kmeans2_sp_1 remainder__MedInc_sp_0'
 'kmeans__kmeans2_sp_1 remainder__MedInc_sp_1'
 'kmeans__kmeans2_sp_1 remainder__AveRooms_sp_3'
 'kmeans__kmeans2_sp_2 remainder__MedInc_sp_3'
 'kmeans__kmeans2_sp_2 remainder__MedInc_sp_4'
 'kmeans__kmeans2_sp_2 remainder__AveRooms_sp_3'
 'kmeans__kmeans2_sp_3 remainder__MedInc_sp_2'
 'kmeans__kmeans2_sp_3 remainder__MedInc_sp_3'
 'kmeans__kmeans2_sp_3 remainder__MedInc_sp_4'
 'kmeans__kmeans2_sp_6 kmeans__kmeans3_sp_0'
 'kmeans__kmeans2_sp_6 kmeans__kmeans6_sp_1'
 'kmeans__kmeans3_sp_2 remainder__MedInc_sp_3'
 'kmeans__kmeans3_sp_2 remainder__AveRooms_sp_3'
 'kmeans__kmeans3_sp_3 remainder__MedInc_sp_3'
 'kmeans__kmeans3_sp_4 remainder__MedInc_sp_3'
 'kmeans__kmeans3_sp_5 kmeans__kmeans4_sp_4'
 'kmeans__kmeans3_sp_6 kmeans__kmeans4_sp_4'
 'kmeans__kmeans4_sp_1 remainder__MedInc_sp_3'
 'kmeans__kmeans4_sp_2 remainder__MedInc_sp_3'
 'kmeans__kmeans4_sp_3 remainder__MedInc_sp_3'
 'kmeans__kmeans4_sp_3 remainder__AveRooms_sp_3'
 'kmeans__kmeans4_sp_4 kmeans__kmeans5_sp_6'
 'kmeans__kmeans4_sp_4 kmeans__kmeans7_sp_6'
 'kmeans__kmeans4_sp_4 kmeans__kmeans9_sp_5'
 'kmeans__kmeans4_sp_4 kmeans__kmeans9_sp_6'
 'kmeans__kmeans5_sp_0 remainder__Population_sp_3'
 'kmeans__kmeans5_sp_2 remainder__AveRooms_sp_3'
 'kmeans__kmeans5_sp_3 remainder__MedInc_sp_0'
 'kmeans__kmeans5_sp_3 remainder__MedInc_sp_1'
 'kmeans__kmeans5_sp_3 remainder__MedInc_sp_3'
 'kmeans__kmeans5_sp_3 remainder__MedInc_sp_4'
 'kmeans__kmeans5_sp_3 remainder__AveRooms_sp_3'
 'kmeans__kmeans5_sp_4 remainder__MedInc_sp_3'
 'kmeans__kmeans6_sp_0 remainder__Population_sp_3'
 'kmeans__kmeans6_sp_0 remainder__Population_sp_4'
 'kmeans__kmeans6_sp_1 remainder__MedInc_sp_3'
 'kmeans__kmeans6_sp_1 remainder__AveRooms_sp_4'
 'kmeans__kmeans6_sp_2 remainder__MedInc_sp_3'
 'kmeans__kmeans6_sp_2 remainder__MedInc_sp_4'
 'kmeans__kmeans6_sp_3 remainder__MedInc_sp_3'
 'kmeans__kmeans7_sp_1 remainder__AveRooms_sp_3'
 'kmeans__kmeans7_sp_2 remainder__MedInc_sp_0'
 'kmeans__kmeans7_sp_2 remainder__MedInc_sp_1'
 'kmeans__kmeans7_sp_2 remainder__MedInc_sp_3'
 'kmeans__kmeans7_sp_3 remainder__MedInc_sp_3'
 'kmeans__kmeans7_sp_4 remainder__MedInc_sp_3'
 'kmeans__kmeans8_sp_1 remainder__MedInc_sp_3'
 'kmeans__kmeans8_sp_2 remainder__MedInc_sp_3'
 'kmeans__kmeans8_sp_3 remainder__AveRooms_sp_3'
 'kmeans__kmeans9_sp_1 remainder__MedInc_sp_3'
 'kmeans__kmeans9_sp_2 remainder__MedInc_sp_0'
 'kmeans__kmeans9_sp_2 remainder__MedInc_sp_1'
 'kmeans__kmeans9_sp_3 remainder__MedInc_sp_3'
 'kmeans__kmeans9_sp_3 remainder__AveRooms_sp_3'
 'kmeans__kmeans9_sp_4 remainder__MedInc_sp_3'
 'remainder__MedInc_sp_0 remainder__MedInc_sp_2'
 'remainder__MedInc_sp_0 remainder__MedInc_sp_3'
 'remainder__MedInc_sp_0 remainder__AveRooms_sp_2'
 'remainder__MedInc_sp_1 remainder__MedInc_sp_2'
 'remainder__MedInc_sp_1 remainder__MedInc_sp_3'
 'remainder__MedInc_sp_1 remainder__MedInc_sp_4'
 'remainder__MedInc_sp_1 remainder__AveRooms_sp_0'
 'remainder__MedInc_sp_1 remainder__AveRooms_sp_1'
 'remainder__MedInc_sp_1 remainder__AveRooms_sp_2'
 'remainder__MedInc_sp_1 remainder__AveBedrms_sp_0'
 'remainder__MedInc_sp_1 remainder__AveBedrms_sp_1'
 'remainder__MedInc_sp_1 remainder__AveBedrms_sp_2'
 'remainder__MedInc_sp_1 remainder__Population_sp_0'
 'remainder__MedInc_sp_1 remainder__Population_sp_1'
 'remainder__MedInc_sp_1 remainder__Population_sp_2'
 'remainder__MedInc_sp_1 remainder__AveOccup_sp_0'
 'remainder__MedInc_sp_1 remainder__AveOccup_sp_1'
 'remainder__MedInc_sp_1 remainder__AveOccup_sp_2'
 'remainder__MedInc_sp_2 remainder__MedInc_sp_3'
 'remainder__MedInc_sp_2 remainder__MedInc_sp_4'
 'remainder__MedInc_sp_2 remainder__AveRooms_sp_0'
 'remainder__MedInc_sp_2 remainder__AveRooms_sp_1'
 'remainder__MedInc_sp_2 remainder__AveBedrms_sp_0'
 'remainder__MedInc_sp_2 remainder__AveBedrms_sp_1'
 'remainder__MedInc_sp_2 remainder__AveBedrms_sp_2'
 'remainder__MedInc_sp_2 remainder__Population_sp_1'
 'remainder__MedInc_sp_2 remainder__AveOccup_sp_0'
 'remainder__MedInc_sp_2 remainder__AveOccup_sp_1'
 'remainder__MedInc_sp_2 remainder__AveOccup_sp_2'
 'remainder__MedInc_sp_3 remainder__MedInc_sp_4'
 'remainder__MedInc_sp_3 remainder__HouseAge_sp_2'
 'remainder__MedInc_sp_3 remainder__HouseAge_sp_3'
 'remainder__MedInc_sp_3 remainder__HouseAge_sp_4'
 'remainder__MedInc_sp_3 remainder__AveRooms_sp_0'
 'remainder__MedInc_sp_3 remainder__AveRooms_sp_1'
 'remainder__MedInc_sp_3 remainder__AveRooms_sp_2'
 'remainder__MedInc_sp_3 remainder__AveRooms_sp_3'
 'remainder__MedInc_sp_3 remainder__AveBedrms_sp_0'
 'remainder__MedInc_sp_3 remainder__AveBedrms_sp_1'
 'remainder__MedInc_sp_3 remainder__AveBedrms_sp_2'
 'remainder__MedInc_sp_3 remainder__AveBedrms_sp_3'
 'remainder__MedInc_sp_3 remainder__Population_sp_0'
 'remainder__MedInc_sp_3 remainder__Population_sp_1'
 'remainder__MedInc_sp_3 remainder__Population_sp_2'
 'remainder__MedInc_sp_3 remainder__Population_sp_3'
 'remainder__MedInc_sp_3 remainder__AveOccup_sp_0'
 'remainder__MedInc_sp_3 remainder__AveOccup_sp_1'
 'remainder__MedInc_sp_3 remainder__AveOccup_sp_2'
 'remainder__MedInc_sp_4 remainder__AveRooms_sp_0'
 'remainder__MedInc_sp_4 remainder__AveRooms_sp_1'
 'remainder__MedInc_sp_4 remainder__AveRooms_sp_2'
 'remainder__MedInc_sp_4 remainder__AveBedrms_sp_0'
 'remainder__MedInc_sp_4 remainder__AveBedrms_sp_1'
 'remainder__MedInc_sp_4 remainder__AveBedrms_sp_2'
 'remainder__MedInc_sp_4 remainder__AveBedrms_sp_3'
 'remainder__MedInc_sp_4 remainder__Population_sp_0'
 'remainder__MedInc_sp_4 remainder__Population_sp_1'
 'remainder__MedInc_sp_4 remainder__Population_sp_2'
 'remainder__MedInc_sp_4 remainder__Population_sp_3'
 'remainder__MedInc_sp_4 remainder__Population_sp_4'
 'remainder__MedInc_sp_4 remainder__AveOccup_sp_0'
 'remainder__MedInc_sp_4 remainder__AveOccup_sp_1'
 'remainder__MedInc_sp_4 remainder__AveOccup_sp_2'
 'remainder__MedInc_sp_5 remainder__Population_sp_4'
 'remainder__AveRooms_sp_3 remainder__AveBedrms_sp_2'
 'remainder__AveRooms_sp_3 remainder__Population_sp_3'
 'remainder__AveRooms_sp_4 remainder__AveBedrms_sp_1'
 'remainder__AveRooms_sp_4 remainder__AveBedrms_sp_2'
 'remainder__AveRooms_sp_4 remainder__Population_sp_3'
 'remainder__AveBedrms_sp_4 remainder__Population_sp_3'
 'remainder__Population_sp_1 remainder__Population_sp_4']

We can see that, in the best features, according to statistical tests, there are many interactions between geospatial features (derived from the K-means clustering) and the median income. Note that these features are not sorted.

And here is the feature importance based on our model (sorted by absolute values):

selectk_ridge_report.feature_importance.coefficients().sort_values(
    by="Coefficient", key=abs, ascending=True
).tail(15).plot.barh(
    title="Model weights",
    xlabel="Coefficient",
    ylabel="Feature",
)
plt.tight_layout()
Model weights

Tree-based models: mean decrease in impurity (MDI)#

Now, let us look into tree-based models. For feature importance, we inspect their Mean Decrease in Impurity (MDI). The MDI of a feature is the normalized total reduction of the criterion (or loss) brought by that feature. The higher the MDI, the more important the feature.

Warning

The MDI is limited and can be misleading:

  • When features have large differences in cardinality, the MDI tends to favor those with higher cardinality. Fortunately, in this example, we have only numerical features that share similar cardinality, mitigating this concern.

  • Since the MDI is typically calculated on the training set, it can reflect biases from overfitting. When a model overfits, the tree may partition less relevant regions of the feature space, artificially inflating MDI values and distorting the perceived importance of certain features. Soon, scikit-learn will enable the computing of the MDI on the test set, and we will make it available in skore. Hence, we would be able to draw conclusions on how predictive a feature is and not just how impactful it is on the training procedure.

See also

For more information about the MDI, see scikit-learn’s Permutation Importance vs Random Forest Feature Importance (MDI).

Decision trees#

Let us start with a simple decision tree.

See also

For more information about decision trees, see scikit-learn’s example on Understanding the decision tree structure.

We compare its performance with the models in our benchmark:

comparator = ComparisonReport(reports=reports_to_compare)
comparator.metrics.report_metrics()
Estimator Vanilla Ridge Ridge w/ feature engineering Ridge w/ feature engineering and selection Decision tree
Metric
0.591163 0.726869 0.689991 0.583785
RMSE 0.735134 0.600865 0.640145 0.741737
Fit time 0.003606 9.723627 8.325749 0.186716
Predict time 0.001287 0.221695 0.352263 0.002108


We note that the performance is quite poor, so the derived feature importance is to be dealt with caution.

We display which accessors are available to us:

╭───────────────── Tools to diagnose estimator DecisionTreeRegressor ──────────────────╮
│ EstimatorReport                                                                      │
│ ├── .metrics                                                                         │
│ │   ├── .prediction_error(...)         - Plot the prediction error of a regression   │
│ │   │   model.                                                                       │
│ │   ├── .r2(...)               (↗︎)     - Compute the R² score.                       │
│ │   ├── .rmse(...)             (↘︎)     - Compute the root mean squared error.        │
│ │   ├── .timings(...)                  - Get all measured processing times related   │
│ │   │   to the estimator.                                                            │
│ │   ├── .custom_metric(...)            - Compute a custom metric.                    │
│ │   └── .report_metrics(...)           - Report a set of metrics for our estimator.  │
│ ├── .feature_importance                                                              │
│ │   ├── .mean_decrease_impurity(...)   - Retrieve the mean decrease impurity (MDI)   │
│ │   │   of a tree-based model.                                                       │
│ │   └── .permutation(...)              - Report the permutation feature importance.  │
│ ├── .cache_predictions(...)            - Cache estimator's predictions.              │
│ ├── .clear_cache(...)                  - Clear the cache.                            │
│ ├── .get_predictions(...)              - Get estimator's predictions.                │
│ └── Attributes                                                                       │
│     ├── .X_test                        - Testing data                                │
│     ├── .X_train                       - Training data                               │
│     ├── .y_test                        - Testing target                              │
│     ├── .y_train                       - Training target                             │
│     ├── .estimator_                    - The cloned or copied estimator              │
│     ├── .estimator_name_               - The name of the estimator                   │
│     ├── .fit_time_                     - The time taken to fit the estimator, in     │
│     │   seconds                                                                      │
│     └── .ml_task                       - No description available                    │
│                                                                                      │
│                                                                                      │
│ Legend:                                                                              │
│ (↗︎) higher is better (↘︎) lower is better                                             │
╰──────────────────────────────────────────────────────────────────────────────────────╯

We have a mean_decrease_impurity() accessor.

First, let us interpret our model with regards to the original features. For the visualization, we fix a very low max_depth so that it will be easy for the human eye to visualize the tree using sklearn.tree.plot_tree():

from sklearn.tree import plot_tree

plot_tree(
    tree_report.estimator_,
    feature_names=tree_report.estimator_.feature_names_in_,
    max_depth=2,
)
plt.tight_layout()
plot feature importance

This tree explains how each sample is going to be predicted by our tree. A decision tree provides a decision path for each sample, where the sample traverses the tree based on feature thresholds, and the final prediction is made at the leaf node (not represented above for conciseness purposes). At each node:

  • samples is the number of samples that fall into that node,

  • value is the predicted output for the samples that fall into this particular node (it is the mean of the target values for the samples in that node). At the root node, the value is \(2.074\). This means that if you were to make a prediction for all \(15480\) samples at this node (without any further splits), the predicted value would be \(2.074\), which is the mean of the target variable for those samples.

  • squared_error is the mean squared error associated with the value, representing the average of the squared differences between the actual target values of the samples in the node and the node’s predicted value (the mean),

  • the first element is how the split is defined.

Let us explain how this works in practice. At each node, the tree splits the data based on a feature and a threshold. For the first node (the root node), MedInc <= 5.029 means that, for each sample, our decision tree first looks at the MedInc feature (which is thus the most important one): if the MedInc value is lower than \(5.029\) (the threshold), then the sample goes into the left node, otherwise it goes to the right, and so on for each node. As you move down the tree, the splits refine the predictions, leading to the leaf nodes where the final prediction for a sample is the value of the leaf it reaches. Note that for the second node layer, it is also the MedInc feature that is used for the threshold, indicating that our model heavily relies on MedInc.

See also

A richer display of decision trees is available in the dtreeviz python package. For example, it shows the distribution of feature values split at each node and tailors the visualization to the task at hand (whether classification or regression).

Now, let us look at the feature importance based on the MDI:

tree_report.feature_importance.mean_decrease_impurity().plot.barh(
    title=f"Feature importance of {tree_report.estimator_name_}",
    xlabel="MDI",
    ylabel="Feature",
)
plt.tight_layout()
Feature importance of DecisionTreeRegressor

For a decision tree, for each feature, the MDI is averaged across all splits in the tree. Here, the impurity is the mean squared error.

As expected, MedInc is of great importance for our decision tree. Indeed, in the above tree visualization, MedInc is used multiple times for splits and contributes greatly to reducing the squared error at multiple nodes. At the root, it reduces the error from \(1.335\) to \(0.832\) and \(0.546\) in the children.

Random forest#

Now, let us apply a more elaborate model: a random forest. A random forest is an ensemble method that builds multiple decision trees, each trained on a random subset of data and features. For regression, it averages the trees’ predictions. This reduces overfitting and improves accuracy compared to a single decision tree.

Estimator Vanilla Ridge Ridge w/ feature engineering Ridge w/ feature engineering and selection Decision tree Random forest
Metric
0.591163 0.726869 0.689991 0.583785 0.794168
RMSE 0.735134 0.600865 0.640145 0.741737 0.521612
Fit time 0.003606 9.723627 8.325749 0.186716 11.792969
Predict time 0.001287 0.221695 0.352263 0.002108 0.124773


Without any feature engineering and any grid search, the random forest beats all linear models!

Let us recall the number of trees in our random forest:

print(f"Number of trees in the forest: {n_estimators}")
Number of trees in the forest: 100

Given that we have many trees, it is hard to use sklearn.tree.plot_tree() as for the single decision tree. As for linear models (and the complex feature engineering), better performance often comes with less interpretability.

Let us look into the MDI of our random forest:

rf_report.feature_importance.mean_decrease_impurity().plot.barh(
    title=f"Feature importance of {rf_report.estimator_name_}",
    xlabel="MDI",
    ylabel="Feature",
)
plt.tight_layout()
Feature importance of RandomForestRegressor

In a random forest, the MDI is computed by averaging the MDI of each feature across all the decision trees in the forest.

As for the decision tree, MecInc is the most important feature. As for linear models with some feature engineering, the random forest also attributes a high importance to Longitude, Latitude, and AveOccup.

Model-agnostic: permutation feature importance#

In the previous sections, we have inspected coefficients that are specific to linear models and the MDI that is specific to tree-based models. In this section, we look into the permutation importance which is model agnostic, meaning that it can be applied to any fitted estimator. In particular, it works for linear models and tree-based ones.

Permutation feature importance measures the contribution of each feature to a fitted model’s performance. It randomly shuffles the values of a single feature and observes the resulting degradation of the model’s score. Permuting a predictive feature makes the performance decrease, while permuting a non-predictive feature does not degrade the performance much. This permutation importance can be computed on the train and test sets, and by default skore computes it on the test set. Compared to the coefficients and the MDI, the permutation importance can be less misleading, but comes with a higher computation cost.

Permutation feature importance can also help reduce overfitting. If a model overfits (high train score and low test score), and some features are important only on the train set and not on the test set, then these features might be the cause of the overfitting and it might be a good idea to drop them.

Warning

The permutation feature importance can be misleading on strongly correlated features. For more information, see scikit-learn’s user guide.

Now, let us look at our helper:

╭───────────────────────── Tools to diagnose estimator Ridge ──────────────────────────╮
│ EstimatorReport                                                                      │
│ ├── .metrics                                                                         │
│ │   ├── .prediction_error(...)         - Plot the prediction error of a regression   │
│ │   │   model.                                                                       │
│ │   ├── .r2(...)               (↗︎)     - Compute the R² score.                       │
│ │   ├── .rmse(...)             (↘︎)     - Compute the root mean squared error.        │
│ │   ├── .timings(...)                  - Get all measured processing times related   │
│ │   │   to the estimator.                                                            │
│ │   ├── .custom_metric(...)            - Compute a custom metric.                    │
│ │   └── .report_metrics(...)           - Report a set of metrics for our estimator.  │
│ ├── .feature_importance                                                              │
│ │   ├── .coefficients(...)             - Retrieve the coefficients of a linear       │
│ │   │   model, including the intercept.                                              │
│ │   └── .permutation(...)              - Report the permutation feature importance.  │
│ ├── .cache_predictions(...)            - Cache estimator's predictions.              │
│ ├── .clear_cache(...)                  - Clear the cache.                            │
│ ├── .get_predictions(...)              - Get estimator's predictions.                │
│ └── Attributes                                                                       │
│     ├── .X_test                        - Testing data                                │
│     ├── .X_train                       - Training data                               │
│     ├── .y_test                        - Testing target                              │
│     ├── .y_train                       - Training target                             │
│     ├── .estimator_                    - The cloned or copied estimator              │
│     ├── .estimator_name_               - The name of the estimator                   │
│     ├── .fit_time_                     - The time taken to fit the estimator, in     │
│     │   seconds                                                                      │
│     └── .ml_task                       - No description available                    │
│                                                                                      │
│                                                                                      │
│ Legend:                                                                              │
│ (↗︎) higher is better (↘︎) lower is better                                             │
╰──────────────────────────────────────────────────────────────────────────────────────╯

We have a permutation() accessor:

ridge_report.feature_importance.permutation(seed=0)
Repeat Repeat #0 Repeat #1 Repeat #2 Repeat #3 Repeat #4
Metric Feature
r2 MedInc 1.015997 1.065468 1.022420 1.021447 1.011064
HouseAge 0.018828 0.024312 0.020991 0.020678 0.021016
AveRooms 0.090388 0.089230 0.085344 0.086729 0.083356
AveBedrms 0.099849 0.098794 0.102469 0.104776 0.107545
Population -0.000181 -0.000103 -0.000181 -0.000023 -0.000065
AveOccup 0.003671 0.006451 0.007217 0.005760 0.005956
Latitude 1.229644 1.155974 1.208524 1.193838 1.206576
Longitude 1.097392 1.103421 1.136658 1.137157 1.116989


The permutation importance is often calculated several times, each time with different permutations of the feature. Hence, we can have measure its variance (or standard deviation). Now, we plot the permutation feature importance on the train and test sets using boxplots:

def plot_permutation_train_test(est_report):
    _, ax = plt.subplots(figsize=(8, 6))

    train_color = "blue"
    test_color = "orange"

    est_report.feature_importance.permutation(data_source="train", seed=0).T.boxplot(
        ax=ax,
        vert=False,
        widths=0.35,
        patch_artist=True,
        boxprops=dict(facecolor=train_color, alpha=0.7),
        medianprops=dict(color="black"),
        positions=[x + 0.4 for x in range(len(est_report.X_train.columns))],
    )
    est_report.feature_importance.permutation(data_source="test", seed=0).T.boxplot(
        ax=ax,
        vert=False,
        widths=0.35,
        patch_artist=True,
        boxprops=dict(facecolor=test_color, alpha=0.7),
        medianprops=dict(color="black"),
        positions=range(len(est_report.X_test.columns)),
    )

    ax.legend(
        handles=[
            plt.Line2D([0], [0], color=train_color, lw=5, label="Train"),
            plt.Line2D([0], [0], color=test_color, lw=5, label="Test"),
        ],
        loc="best",
        title="Dataset",
    )

    ax.set_title(
        f"Permutation feature importance of {est_report.estimator_name_} (Train vs Test)"
    )
    ax.set_xlabel("$R^2$")
    ax.set_yticks([x + 0.2 for x in range(len(est_report.X_train.columns))])
    ax.set_yticklabels(est_report.X_train.columns)

    plt.tight_layout()
    plt.show()
plot_permutation_train_test(ridge_report)
Permutation feature importance of Ridge (Train vs Test)

The standard deviation seems quite low. For both the train and test sets, the result of the inspection is the same as with the coefficients: the most important features are Latitude, Longitude, and MedInc.

For selectk_ridge_report, we have a large pipeline that is fed to a EstimatorReport. The pipeline contains a lot a preprocessing that creates many features. By default, the permutation importance is calculated at the entrance of the whole pipeline (with regards to the original features):

plot_permutation_train_test(selectk_ridge_report)
Permutation feature importance of RidgeCV (Train vs Test)

Hence, contrary to coefficients, although we have created many features in our preprocessing, the interpretability is easier. We notice that, due to our preprocessing using a clustering on the geospatial data, these features are of great importance to our model.

For our decision tree, here is our permutation importance on the train and test sets:

plot_permutation_train_test(tree_report)
Permutation feature importance of DecisionTreeRegressor (Train vs Test)

The result of the inspection is the same as with the MDI: the most important features are MedInc, Latitude, Longitude, and AveOccup.

Conclusion#

In this example, we used the California housing dataset to predict house prices with skore’s EstimatorReport. By employing the feature_importance accessor, we gained valuable insights into model behavior beyond mere predictive performance. For linear models like Ridge regression, we inspected coefficients to understand feature contributions, revealing the prominence of MedInc, Latitude, and Longitude. We explained the trade-off between performance (with complex feature engineering) and interpretability. Interactions between features have highlighted the importance of AveOccup. With tree-based models such as decision trees, random forests, and gradient-boosted trees, we utilized Mean Decrease in Impurity (MDI) to identify key features, notably AveOccup alongside MedInc, Latitude, and Longitude. The random forest got the best score, without any complex feature engineering compared to linear models. The model-agnostic permutation feature importance further enabled us to compare feature significance across diverse model types.

Total running time of the script: (1 minutes 14.754 seconds)

Gallery generated by Sphinx-Gallery