As previously mentioned, our simple Ridge model, although very easily interpretable
with regards to the original units of the features, performs quite poorly.
Now, we build a more complex model, with more feature engineering.
We will see that this model will have a better score… but will be more difficult to
interpret the coefficients with regards to the original features due to the complex
feature engineering.
In our previous EDA, when plotting the geospatial data with regards to the house
prices, we noticed that it is important to take into account the latitude and
longitude features.
Moreover, we also observed that the median income is well associated with the
house prices.
Hence, we will try a feature engineering that takes into account the interactions
of the geospatial features with features such as the income, using polynomial
features.
The interactions are no longer simply linear as previously.
Let us build a model with some more complex feature engineering, and still use a
Ridge regressor (linear model) at the end of the pipeline.
In particular, we perform a K-means clustering on the geospatial features:
Pipeline(steps=[('columntransformer',
ColumnTransformer(remainder='passthrough',
transformers=[('kmeans',
KMeans(n_clusters=10,
random_state=0),
['Latitude', 'Longitude'])])),
('splinetransformer', SplineTransformer(sparse_output=True)),
('polynomialfeatures',
PolynomialFeatures(include_bias=False, interaction_only=True)),
('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Now, let us compute the metrics and compare it to our previous model using
a skore.ComparisonReport
:
skore.MetricsSummaryDisplay(...)
We get a much better score!
Let us plot the prediction error:

About the clipping issue, compared to the prediction error of our previous model
(ridge_report
), our engineered_ridge_report
model seems to produce predictions
that are not as large, so it seems that some interactions between features have
helped alleviate the clipping issue.
However, interpreting the features is harder: indeed, our complex feature engineering
introduced a lot of features:
Initial number of features: 8
Number of features after feature engineering: 6328
Let us display the 15 largest absolute coefficients:
engineered_ridge_report.feature_importance.coefficients().sort_values(
by="Coefficient", key=abs, ascending=True
).tail(15).plot.barh(
title="Model weights",
xlabel="Coefficient",
ylabel="Feature",
)
plt.tight_layout()

We can observe that the most important features are interactions between features,
mostly based on AveOccup
, that a simple linear model without feature engineering
could not have captured.
Indeed, the vanilla Ridge model did not consider AveOccup
to be important.
As the engineered Ridge has a better score, perhaps the vanilla Ridge missed
something about AveOccup
that seems to be key to predicting house prices.
Let us visualize how AveOccup
interacts with MedHouseVal
:
Finally, we can visualize the results of our K-means clustering (on the training set):
Inspecting the prediction error at the sample level
After feature importance, we now try to understand why our model performs badly on some
samples, in order to iterate on our estimator pipeline and improve it.
We compute the prediction squared error at the sample level, named squared_error
,
on the train and test sets:
def add_y_true_pred(model_report, split):
"""
Concatenate the design matrix (`X`) with the actual targets (`y`)
and predicted ones (`y_pred`) from a fitted skore EstimatorReport,
either on the train or the test set.
"""
if split == "train":
y_split_true = model_report.y_train
X_split = model_report.X_train.copy()
elif split == "test":
y_split_true = model_report.y_test
X_split = model_report.X_test.copy()
else:
raise ValueError("split must be either `train`, or `test`")
# adding a `split` feature
X_split.insert(0, "split", split)
# retrieving the predictions
y_split_pred = model_report.get_predictions(
data_source=split, response_method="predict"
)
# computing the squared error at the sample level
squared_error_split = (y_split_true - y_split_pred) ** 2
# adding the squared error to our dataframes
X_split.insert(X_split.shape[1], "squared_error", squared_error_split)
# adding the true values and the predictions
X_y_split = X_split.copy()
X_y_split.insert(X_y_split.shape[1], "y_true", y_split_true)
X_y_split.insert(X_y_split.shape[1], "y_pred", y_split_pred)
return X_y_split
|
split |
MedInc |
HouseAge |
AveRooms |
AveBedrms |
Population |
AveOccup |
Latitude |
Longitude |
squared_error |
y_true |
y_pred |
18421 |
test |
4.0547 |
14.0 |
5.450000 |
1.101613 |
1465.0 |
2.362903 |
37.26 |
-121.81 |
0.010174 |
2.362 |
2.261134 |
15960 |
train |
4.0234 |
52.0 |
4.960526 |
0.914474 |
1138.0 |
3.743421 |
37.71 |
-122.43 |
0.524091 |
2.665 |
3.388942 |
11783 |
test |
2.0807 |
13.0 |
4.869732 |
1.112388 |
1513.0 |
1.932312 |
38.78 |
-121.23 |
0.019955 |
1.426 |
1.284737 |
10079 |
test |
3.1842 |
9.0 |
24.900000 |
5.045455 |
294.0 |
2.672727 |
39.34 |
-120.25 |
0.046789 |
1.625 |
1.408694 |
7854 |
train |
3.2813 |
46.0 |
4.392523 |
1.023364 |
599.0 |
2.799065 |
33.89 |
-118.16 |
0.018661 |
1.909 |
2.045606 |
2068 |
train |
2.6146 |
43.0 |
5.634409 |
1.000000 |
302.0 |
3.247312 |
36.73 |
-119.89 |
0.027483 |
0.813 |
0.647219 |
12673 |
train |
3.5968 |
12.0 |
5.841346 |
1.028045 |
3960.0 |
3.173077 |
38.49 |
-121.40 |
0.076579 |
1.063 |
1.339730 |
5990 |
train |
5.8891 |
37.0 |
6.522642 |
1.026415 |
1344.0 |
2.535849 |
34.10 |
-117.73 |
0.185262 |
2.260 |
2.690421 |
14240 |
train |
1.4089 |
22.0 |
4.167820 |
1.006920 |
1887.0 |
3.264706 |
32.70 |
-117.09 |
0.022263 |
0.942 |
1.091208 |
19306 |
train |
4.5875 |
32.0 |
6.712418 |
1.058824 |
843.0 |
2.754902 |
38.33 |
-122.77 |
0.123025 |
2.907 |
2.556251 |
We visualize the distributions of the prediction errors on both train and test sets:

Now, in order to assess which features might drive the prediction error, let us look
into the associations between the squared_error
and the other features:
|
left_column_name |
left_column_idx |
right_column_name |
right_column_idx |
cramer_v |
pearson_corr |
4 |
squared_error |
9 |
y_pred |
11 |
0.367433 |
0.161141 |
10 |
squared_error |
9 |
y_true |
10 |
0.167185 |
0.390676 |
13 |
AveOccup |
6 |
squared_error |
9 |
0.098006 |
0.057073 |
27 |
MedInc |
1 |
squared_error |
9 |
0.040225 |
0.035507 |
29 |
Longitude |
8 |
squared_error |
9 |
0.035152 |
0.017732 |
33 |
HouseAge |
2 |
squared_error |
9 |
0.032740 |
0.080408 |
37 |
Latitude |
7 |
squared_error |
9 |
0.028564 |
-0.077758 |
51 |
split |
0 |
squared_error |
9 |
0.018031 |
NaN |
59 |
Population |
5 |
squared_error |
9 |
0.009687 |
-0.073801 |
60 |
AveRooms |
3 |
squared_error |
9 |
0.008547 |
-0.024258 |
62 |
AveBedrms |
4 |
squared_error |
9 |
0.004992 |
0.009436 |
We observe that the AveOccup
feature leads to large prediction errors: our model
is not able to deal well with that feature.
Hence, it might be worth it to dive deep into the AveOccup
feature, for
example its outliers.
We observe that we have large prediction errors for districts near the coast and big
cities:
Hence, it could make sense to engineer two new features: the distance to the coast
and the distance to big cities.
Most of our very bad predictions underpredict the true value (y_true
is more often
larger than y_pred
):
# Create the scatter plot
fig = px.scatter(
X_y_plot.query(f"squared_error > {threshold}"),
x="y_pred",
y="y_true",
color="split",
)
# Add the diagonal line
fig.add_shape(
type="line",
x0=X_y_plot["y_pred"].min(),
y0=X_y_plot["y_pred"].min(),
x1=X_y_plot["y_pred"].max(),
y1=X_y_plot["y_pred"].max(),
line=dict(color="black", width=2),
)
fig
Compromising on complexity
Now, let us build a model with a more interpretable feature engineering, although
it might not perform as well.
For that, after the complex feature engineering, we perform some feature selection
using a SelectKBest
, in order to reduce the
number of features.
from sklearn.feature_selection import SelectKBest, VarianceThreshold, f_regression
from sklearn.linear_model import RidgeCV
preprocessor = make_column_transformer(
(KMeans(n_clusters=10, random_state=0), geo_columns),
remainder="passthrough",
)
selectkbest_ridge = make_pipeline(
preprocessor,
SplineTransformer(sparse_output=True),
PolynomialFeatures(degree=2, interaction_only=True, include_bias=False),
VarianceThreshold(1e-8),
SelectKBest(score_func=lambda X, y: f_regression(X, y, center=False), k=150),
RidgeCV(np.logspace(-5, 5, num=100)),
)
Note
To keep the computation time of this example low, we did not tune
the hyperparameters of the predictive model. However, on a real use
case, it would be important to tune the model using
RandomizedSearchCV
and not just the RidgeCV
.
Let us get the metrics for our model and compare it with our previous iterations:
skore.MetricsSummaryDisplay(...)
We get a good score and much less features:
Initial number of features: 8
Number of features after feature engineering: 6328
Number of features after feature engineering using `SelectKBest`: 150
According to the SelectKBest
, the most important
features are the following:
['kmeans__kmeans2_sp_2' 'kmeans__kmeans2_sp_3' 'kmeans__kmeans6_sp_2'
'remainder__MedInc_sp_2' 'remainder__MedInc_sp_3'
'remainder__AveRooms_sp_0' 'remainder__AveRooms_sp_1'
'remainder__AveRooms_sp_2' 'remainder__AveBedrms_sp_0'
'remainder__AveBedrms_sp_1' 'remainder__AveBedrms_sp_2'
'remainder__Population_sp_0' 'remainder__Population_sp_1'
'remainder__Population_sp_2' 'remainder__AveOccup_sp_0'
'remainder__AveOccup_sp_1' 'remainder__AveOccup_sp_2'
'kmeans__kmeans2_sp_2 kmeans__kmeans2_sp_3'
'kmeans__kmeans2_sp_2 kmeans__kmeans2_sp_4'
'kmeans__kmeans2_sp_2 kmeans__kmeans6_sp_2'
'kmeans__kmeans2_sp_2 remainder__AveRooms_sp_0'
'kmeans__kmeans2_sp_2 remainder__AveRooms_sp_1'
'kmeans__kmeans2_sp_2 remainder__AveRooms_sp_2'
'kmeans__kmeans2_sp_2 remainder__AveBedrms_sp_0'
'kmeans__kmeans2_sp_2 remainder__AveBedrms_sp_1'
'kmeans__kmeans2_sp_2 remainder__AveBedrms_sp_2'
'kmeans__kmeans2_sp_2 remainder__Population_sp_1'
'kmeans__kmeans2_sp_2 remainder__AveOccup_sp_0'
'kmeans__kmeans2_sp_2 remainder__AveOccup_sp_1'
'kmeans__kmeans2_sp_2 remainder__AveOccup_sp_2'
'kmeans__kmeans2_sp_3 kmeans__kmeans5_sp_3'
'kmeans__kmeans2_sp_3 kmeans__kmeans6_sp_2'
'kmeans__kmeans2_sp_3 remainder__MedInc_sp_2'
'kmeans__kmeans2_sp_3 remainder__MedInc_sp_3'
'kmeans__kmeans2_sp_3 remainder__AveRooms_sp_0'
'kmeans__kmeans2_sp_3 remainder__AveRooms_sp_1'
'kmeans__kmeans2_sp_3 remainder__AveRooms_sp_2'
'kmeans__kmeans2_sp_3 remainder__AveBedrms_sp_0'
'kmeans__kmeans2_sp_3 remainder__AveBedrms_sp_1'
'kmeans__kmeans2_sp_3 remainder__AveBedrms_sp_2'
'kmeans__kmeans2_sp_3 remainder__Population_sp_0'
'kmeans__kmeans2_sp_3 remainder__Population_sp_1'
'kmeans__kmeans2_sp_3 remainder__Population_sp_2'
'kmeans__kmeans2_sp_3 remainder__AveOccup_sp_0'
'kmeans__kmeans2_sp_3 remainder__AveOccup_sp_1'
'kmeans__kmeans2_sp_3 remainder__AveOccup_sp_2'
'kmeans__kmeans3_sp_3 kmeans__kmeans6_sp_3'
'kmeans__kmeans6_sp_2 remainder__AveRooms_sp_0'
'kmeans__kmeans6_sp_2 remainder__AveRooms_sp_1'
'kmeans__kmeans6_sp_2 remainder__AveRooms_sp_2'
'kmeans__kmeans6_sp_2 remainder__AveBedrms_sp_0'
'kmeans__kmeans6_sp_2 remainder__AveBedrms_sp_1'
'kmeans__kmeans6_sp_2 remainder__AveBedrms_sp_2'
'kmeans__kmeans6_sp_2 remainder__Population_sp_0'
'kmeans__kmeans6_sp_2 remainder__Population_sp_1'
'kmeans__kmeans6_sp_2 remainder__Population_sp_2'
'kmeans__kmeans6_sp_2 remainder__AveOccup_sp_0'
'kmeans__kmeans6_sp_2 remainder__AveOccup_sp_1'
'kmeans__kmeans6_sp_2 remainder__AveOccup_sp_2'
'remainder__MedInc_sp_2 remainder__MedInc_sp_3'
'remainder__MedInc_sp_2 remainder__AveRooms_sp_0'
'remainder__MedInc_sp_2 remainder__AveRooms_sp_1'
'remainder__MedInc_sp_2 remainder__AveRooms_sp_2'
'remainder__MedInc_sp_2 remainder__AveBedrms_sp_0'
'remainder__MedInc_sp_2 remainder__AveBedrms_sp_1'
'remainder__MedInc_sp_2 remainder__AveBedrms_sp_2'
'remainder__MedInc_sp_2 remainder__Population_sp_0'
'remainder__MedInc_sp_2 remainder__Population_sp_1'
'remainder__MedInc_sp_2 remainder__Population_sp_2'
'remainder__MedInc_sp_2 remainder__AveOccup_sp_0'
'remainder__MedInc_sp_2 remainder__AveOccup_sp_1'
'remainder__MedInc_sp_2 remainder__AveOccup_sp_2'
'remainder__MedInc_sp_3 remainder__AveRooms_sp_0'
'remainder__MedInc_sp_3 remainder__AveRooms_sp_1'
'remainder__MedInc_sp_3 remainder__AveRooms_sp_2'
'remainder__MedInc_sp_3 remainder__AveBedrms_sp_0'
'remainder__MedInc_sp_3 remainder__AveBedrms_sp_1'
'remainder__MedInc_sp_3 remainder__AveBedrms_sp_2'
'remainder__MedInc_sp_3 remainder__Population_sp_0'
'remainder__MedInc_sp_3 remainder__Population_sp_1'
'remainder__MedInc_sp_3 remainder__Population_sp_2'
'remainder__MedInc_sp_3 remainder__AveOccup_sp_0'
'remainder__MedInc_sp_3 remainder__AveOccup_sp_1'
'remainder__MedInc_sp_3 remainder__AveOccup_sp_2'
'remainder__AveRooms_sp_0 remainder__AveRooms_sp_1'
'remainder__AveRooms_sp_0 remainder__AveRooms_sp_2'
'remainder__AveRooms_sp_0 remainder__AveBedrms_sp_0'
'remainder__AveRooms_sp_0 remainder__AveBedrms_sp_1'
'remainder__AveRooms_sp_0 remainder__AveBedrms_sp_2'
'remainder__AveRooms_sp_0 remainder__Population_sp_0'
'remainder__AveRooms_sp_0 remainder__Population_sp_1'
'remainder__AveRooms_sp_0 remainder__Population_sp_2'
'remainder__AveRooms_sp_0 remainder__AveOccup_sp_0'
'remainder__AveRooms_sp_0 remainder__AveOccup_sp_1'
'remainder__AveRooms_sp_0 remainder__AveOccup_sp_2'
'remainder__AveRooms_sp_1 remainder__AveRooms_sp_2'
'remainder__AveRooms_sp_1 remainder__AveBedrms_sp_0'
'remainder__AveRooms_sp_1 remainder__AveBedrms_sp_1'
'remainder__AveRooms_sp_1 remainder__AveBedrms_sp_2'
'remainder__AveRooms_sp_1 remainder__Population_sp_0'
'remainder__AveRooms_sp_1 remainder__Population_sp_1'
'remainder__AveRooms_sp_1 remainder__Population_sp_2'
'remainder__AveRooms_sp_1 remainder__AveOccup_sp_0'
'remainder__AveRooms_sp_1 remainder__AveOccup_sp_1'
'remainder__AveRooms_sp_1 remainder__AveOccup_sp_2'
'remainder__AveRooms_sp_2 remainder__AveBedrms_sp_0'
'remainder__AveRooms_sp_2 remainder__AveBedrms_sp_1'
'remainder__AveRooms_sp_2 remainder__AveBedrms_sp_2'
'remainder__AveRooms_sp_2 remainder__Population_sp_0'
'remainder__AveRooms_sp_2 remainder__Population_sp_1'
'remainder__AveRooms_sp_2 remainder__Population_sp_2'
'remainder__AveRooms_sp_2 remainder__AveOccup_sp_0'
'remainder__AveRooms_sp_2 remainder__AveOccup_sp_1'
'remainder__AveRooms_sp_2 remainder__AveOccup_sp_2'
'remainder__AveBedrms_sp_0 remainder__AveBedrms_sp_1'
'remainder__AveBedrms_sp_0 remainder__AveBedrms_sp_2'
'remainder__AveBedrms_sp_0 remainder__Population_sp_0'
'remainder__AveBedrms_sp_0 remainder__Population_sp_1'
'remainder__AveBedrms_sp_0 remainder__Population_sp_2'
'remainder__AveBedrms_sp_0 remainder__AveOccup_sp_0'
'remainder__AveBedrms_sp_0 remainder__AveOccup_sp_1'
'remainder__AveBedrms_sp_0 remainder__AveOccup_sp_2'
'remainder__AveBedrms_sp_1 remainder__AveBedrms_sp_2'
'remainder__AveBedrms_sp_1 remainder__Population_sp_0'
'remainder__AveBedrms_sp_1 remainder__Population_sp_1'
'remainder__AveBedrms_sp_1 remainder__Population_sp_2'
'remainder__AveBedrms_sp_1 remainder__AveOccup_sp_0'
'remainder__AveBedrms_sp_1 remainder__AveOccup_sp_1'
'remainder__AveBedrms_sp_1 remainder__AveOccup_sp_2'
'remainder__AveBedrms_sp_2 remainder__Population_sp_0'
'remainder__AveBedrms_sp_2 remainder__Population_sp_1'
'remainder__AveBedrms_sp_2 remainder__Population_sp_2'
'remainder__AveBedrms_sp_2 remainder__AveOccup_sp_0'
'remainder__AveBedrms_sp_2 remainder__AveOccup_sp_1'
'remainder__AveBedrms_sp_2 remainder__AveOccup_sp_2'
'remainder__Population_sp_0 remainder__Population_sp_1'
'remainder__Population_sp_0 remainder__Population_sp_2'
'remainder__Population_sp_0 remainder__AveOccup_sp_0'
'remainder__Population_sp_0 remainder__AveOccup_sp_1'
'remainder__Population_sp_0 remainder__AveOccup_sp_2'
'remainder__Population_sp_1 remainder__Population_sp_2'
'remainder__Population_sp_1 remainder__AveOccup_sp_0'
'remainder__Population_sp_1 remainder__AveOccup_sp_1'
'remainder__Population_sp_1 remainder__AveOccup_sp_2'
'remainder__Population_sp_2 remainder__AveOccup_sp_0'
'remainder__Population_sp_2 remainder__AveOccup_sp_1'
'remainder__Population_sp_2 remainder__AveOccup_sp_2'
'remainder__AveOccup_sp_0 remainder__AveOccup_sp_1'
'remainder__AveOccup_sp_0 remainder__AveOccup_sp_2'
'remainder__AveOccup_sp_1 remainder__AveOccup_sp_2']
We can see that, in the best features, according to statistical tests, there are
many interactions between geospatial features (derived from the K-means clustering)
and the median income.
Note that these features are not sorted.
And here is the feature importance based on our model (sorted by absolute values):
selectk_ridge_report.feature_importance.coefficients().sort_values(
by="Coefficient", key=abs, ascending=True
).tail(15).plot.barh(
title="Model weights",
xlabel="Coefficient",
ylabel="Feature",
)
plt.tight_layout()
Tree-based models: mean decrease in impurity (MDI)
Now, let us look into tree-based models.
For feature importance, we inspect their Mean Decrease in Impurity (MDI).
The MDI of a feature is the normalized total reduction of the criterion (or loss)
brought by that feature.
The higher the MDI, the more important the feature.
Warning
The MDI is limited and can be misleading:
When features have large differences in cardinality, the MDI tends to favor
those with higher cardinality.
Fortunately, in this example, we have only numerical features that share similar
cardinality, mitigating this concern.
Since the MDI is typically calculated on the training set, it can reflect biases
from overfitting.
When a model overfits, the tree may partition less relevant regions of the
feature space, artificially inflating MDI values and distorting the perceived
importance of certain features.
Soon, scikit-learn will enable the computing of the MDI on the test set, and we
will make it available in skore.
Hence, we would be able to draw conclusions on how predictive a feature is and not
just how impactful it is on the training procedure.
Decision trees
Let us start with a simple decision tree.
We compare its performance with the models in our benchmark:
skore.MetricsSummaryDisplay(...)
We note that the performance is quite poor, so the derived feature importance is to
be dealt with caution.
We display which accessors are available to us:
╭───────────────── Tools to diagnose estimator DecisionTreeRegressor ──────────────────╮
│ EstimatorReport │
│ ├── .metrics │
│ │ ├── .prediction_error(...) - Plot the prediction error of a regression │
│ │ │ model. │
│ │ ├── .r2(...) (↗︎) - Compute the R² score. │
│ │ ├── .rmse(...) (↘︎) - Compute the root mean squared error. │
│ │ ├── .timings(...) - Get all measured processing times related │
│ │ │ to the estimator. │
│ │ ├── .custom_metric(...) - Compute a custom metric. │
│ │ └── .summarize(...) - Report a set of metrics for our estimator. │
│ ├── .feature_importance │
│ │ ├── .mean_decrease_impurity(...) - Retrieve the mean decrease impurity (MDI) │
│ │ │ of a tree-based model. │
│ │ └── .permutation(...) - Report the permutation feature importance. │
│ ├── .cache_predictions(...) - Cache estimator's predictions. │
│ ├── .clear_cache(...) - Clear the cache. │
│ ├── .get_predictions(...) - Get estimator's predictions. │
│ └── Attributes │
│ ├── .X_test - Testing data │
│ ├── .X_train - Training data │
│ ├── .y_test - Testing target │
│ ├── .y_train - Training target │
│ ├── .estimator - Estimator to make the report from │
│ ├── .estimator_ - The cloned or copied estimator │
│ ├── .estimator_name_ - The name of the estimator │
│ ├── .fit - Whether to fit the estimator on the │
│ │ training data │
│ ├── .fit_time_ - The time taken to fit the estimator, in │
│ │ seconds │
│ ├── .ml_task - No description available │
│ └── .pos_label - For binary classification, the positive │
│ class │
│ │
│ │
│ Legend: │
│ (↗︎) higher is better (↘︎) lower is better │
╰──────────────────────────────────────────────────────────────────────────────────────╯
We have a
mean_decrease_impurity()
accessor.
First, let us interpret our model with regards to the original features.
For the visualization, we fix a very low max_depth
so that it will be easy for
the human eye to visualize the tree using sklearn.tree.plot_tree()
:

This tree explains how each sample is going to be predicted by our tree.
A decision tree provides a decision path for each sample, where the sample traverses
the tree based on feature thresholds, and the final prediction is made at the leaf
node (not represented above for conciseness purposes).
At each node:
samples
is the number of samples that fall into that node,
value
is the predicted output for the samples that fall into this particular
node (it is the mean of the target values for the samples in that node).
At the root node, the value is \(2.074\). This means that if you were to make a
prediction for all \(15480\) samples at this node (without any further splits),
the predicted value would be \(2.074\), which is the mean of the target
variable for those samples.
squared_error
is the mean squared error associated with the value
,
representing the average of the squared differences between the actual target values
of the samples in the node and the node’s predicted value
(the mean),
the first element is how the split is defined.
Let us explain how this works in practice.
At each node, the tree splits the data based on a feature and a threshold.
For the first node (the root node), MedInc <= 5.029
means that, for each sample,
our decision tree first looks at the MedInc
feature (which is thus the most
important one):
if the MedInc
value is lower than \(5.029\) (the threshold), then the sample goes
into the left node, otherwise it goes to the right, and so on for each node.
As you move down the tree, the splits refine the predictions, leading to the leaf
nodes where the final prediction for a sample is the value
of the leaf it reaches.
Note that for the second node layer, it is also the MedInc
feature that is used
for the threshold, indicating that our model heavily relies on MedInc
.
See also
A richer display of decision trees is available in the
dtreeviz python package.
For example, it shows the distribution of feature values split at each node and
tailors the visualization to the task at hand (whether classification
or regression).
Now, let us look at the feature importance based on the MDI:

For a decision tree, for each feature, the MDI is averaged across all splits in the
tree. Here, the impurity is the mean squared error.
As expected, MedInc
is of great importance for our decision tree.
Indeed, in the above tree visualization, MedInc
is used multiple times for splits
and contributes greatly to reducing the squared error at multiple nodes.
At the root, it reduces the error from \(1.335\) to \(0.832\) and \(0.546\)
in the children.
Random forest
Now, let us apply a more elaborate model: a random forest.
A random forest is an ensemble method that builds multiple decision trees, each
trained on a random subset of data and features.
For regression, it averages the trees’ predictions.
This reduces overfitting and improves accuracy compared to a single decision tree.
skore.MetricsSummaryDisplay(...)
Without any feature engineering and any grid search,
the random forest beats all linear models!
Let us recall the number of trees in our random forest:
Number of trees in the forest: 100
Given that we have many trees, it is hard to use sklearn.tree.plot_tree()
as
for the single decision tree.
As for linear models (and the complex feature engineering), better performance often
comes with less interpretability.
Let us look into the MDI of our random forest:
rf_report.feature_importance.mean_decrease_impurity().plot.barh(
title=f"Feature importance of {rf_report.estimator_name_}",
xlabel="MDI",
ylabel="Feature",
)
plt.tight_layout()

In a random forest, the MDI is computed by averaging the MDI of each feature across
all the decision trees in the forest.
As for the decision tree, MecInc
is the most important feature.
As for linear models with some feature engineering, the random forest also attributes
a high importance to Longitude
, Latitude
, and AveOccup
.
Model-agnostic: permutation feature importance
In the previous sections, we have inspected coefficients that are specific to linear
models and the MDI that is specific to tree-based models.
In this section, we look into the
permutation importance
which is model agnostic, meaning that it can be applied to any fitted estimator.
In particular, it works for linear models and tree-based ones.
Permutation feature importance measures the contribution of each feature to
a fitted model’s performance.
It randomly shuffles the values of a single feature and observes the resulting
degradation of the model’s score.
Permuting a predictive feature makes the performance decrease, while
permuting a non-predictive feature does not degrade the performance much.
This permutation importance can be computed on the train and test sets,
and by default skore computes it on the test set.
Compared to the coefficients and the MDI, the permutation importance can be
less misleading, but comes with a higher computation cost.
Permutation feature importance can also help reduce overfitting.
If a model overfits (high train score and low test score), and some
features are important only on the train set and not on the test set,
then these features might be the cause of the overfitting and it might be a good
idea to drop them.
Warning
The permutation feature importance can be misleading on strongly
correlated features. For more information, see
scikit-learn’s user guide.
Now, let us look at our helper:
╭───────────────────────── Tools to diagnose estimator Ridge ──────────────────────────╮
│ EstimatorReport │
│ ├── .metrics │
│ │ ├── .prediction_error(...) - Plot the prediction error of a regression │
│ │ │ model. │
│ │ ├── .r2(...) (↗︎) - Compute the R² score. │
│ │ ├── .rmse(...) (↘︎) - Compute the root mean squared error. │
│ │ ├── .timings(...) - Get all measured processing times related │
│ │ │ to the estimator. │
│ │ ├── .custom_metric(...) - Compute a custom metric. │
│ │ └── .summarize(...) - Report a set of metrics for our estimator. │
│ ├── .feature_importance │
│ │ ├── .coefficients(...) - Retrieve the coefficients of a linear │
│ │ │ model, including the intercept. │
│ │ └── .permutation(...) - Report the permutation feature importance. │
│ ├── .cache_predictions(...) - Cache estimator's predictions. │
│ ├── .clear_cache(...) - Clear the cache. │
│ ├── .get_predictions(...) - Get estimator's predictions. │
│ └── Attributes │
│ ├── .X_test - Testing data │
│ ├── .X_train - Training data │
│ ├── .y_test - Testing target │
│ ├── .y_train - Training target │
│ ├── .estimator - Estimator to make the report from │
│ ├── .estimator_ - The cloned or copied estimator │
│ ├── .estimator_name_ - The name of the estimator │
│ ├── .fit - Whether to fit the estimator on the │
│ │ training data │
│ ├── .fit_time_ - The time taken to fit the estimator, in │
│ │ seconds │
│ ├── .ml_task - No description available │
│ └── .pos_label - For binary classification, the positive │
│ class │
│ │
│ │
│ Legend: │
│ (↗︎) higher is better (↘︎) lower is better │
╰──────────────────────────────────────────────────────────────────────────────────────╯
We have a permutation()
accessor:
|
Repeat |
Repeat #0 |
Repeat #1 |
Repeat #2 |
Repeat #3 |
Repeat #4 |
Metric |
Feature |
|
|
|
|
|
r2 |
MedInc |
1.015997 |
1.065468 |
1.022420 |
1.021447 |
1.011064 |
HouseAge |
0.018828 |
0.024312 |
0.020991 |
0.020678 |
0.021016 |
AveRooms |
0.090388 |
0.089230 |
0.085344 |
0.086729 |
0.083356 |
AveBedrms |
0.099849 |
0.098794 |
0.102469 |
0.104776 |
0.107545 |
Population |
-0.000181 |
-0.000103 |
-0.000181 |
-0.000023 |
-0.000065 |
AveOccup |
0.003671 |
0.006451 |
0.007217 |
0.005760 |
0.005956 |
Latitude |
1.229644 |
1.155974 |
1.208524 |
1.193838 |
1.206576 |
Longitude |
1.097392 |
1.103421 |
1.136658 |
1.137157 |
1.116989 |
The permutation importance is often calculated several times, each time
with different permutations of the feature.
Hence, we can have measure its variance (or standard deviation).
Now, we plot the permutation feature importance on the train and test sets using boxplots:
def plot_permutation_train_test(est_report):
_, ax = plt.subplots(figsize=(8, 6))
train_color = "blue"
test_color = "orange"
est_report.feature_importance.permutation(data_source="train", seed=0).T.boxplot(
ax=ax,
vert=False,
widths=0.35,
patch_artist=True,
boxprops=dict(facecolor=train_color, alpha=0.7),
medianprops=dict(color="black"),
positions=[x + 0.4 for x in range(len(est_report.X_train.columns))],
)
est_report.feature_importance.permutation(data_source="test", seed=0).T.boxplot(
ax=ax,
vert=False,
widths=0.35,
patch_artist=True,
boxprops=dict(facecolor=test_color, alpha=0.7),
medianprops=dict(color="black"),
positions=range(len(est_report.X_test.columns)),
)
ax.legend(
handles=[
plt.Line2D([0], [0], color=train_color, lw=5, label="Train"),
plt.Line2D([0], [0], color=test_color, lw=5, label="Test"),
],
loc="best",
title="Dataset",
)
ax.set_title(
f"Permutation feature importance of {est_report.estimator_name_} (Train vs Test)"
)
ax.set_xlabel("$R^2$")
ax.set_yticks([x + 0.2 for x in range(len(est_report.X_train.columns))])
ax.set_yticklabels(est_report.X_train.columns)
plt.tight_layout()
plt.show()

The standard deviation seems quite low.
For both the train and test sets, the result of the inspection is the same as
with the coefficients:
the most important features are Latitude
, Longitude
, and MedInc
.
For selectk_ridge_report
, we have a large pipeline that is fed to a
EstimatorReport
.
The pipeline contains a lot a preprocessing that creates many features.
By default, the permutation importance is calculated at the entrance of the whole
pipeline (with regards to the original features):

Hence, contrary to coefficients, although we have created many features in our
preprocessing, the interpretability is easier.
We notice that, due to our preprocessing using a clustering on the geospatial data,
these features are of great importance to our model.
For our decision tree, here is our permutation importance on the train and test sets:

The result of the inspection is the same as with the MDI:
the most important features are MedInc
, Latitude
, Longitude
,
and AveOccup
.
Conclusion
In this example, we used the California housing dataset to predict house prices with
skore’s EstimatorReport
.
By employing the feature_importance
accessor,
we gained valuable insights into model behavior beyond mere predictive performance.
For linear models like Ridge regression, we inspected coefficients to understand
feature contributions, revealing the prominence of MedInc
, Latitude
,
and Longitude
.
We explained the trade-off between performance (with complex feature engineering)
and interpretability.
Interactions between features have highlighted the importance of AveOccup
.
With tree-based models such as decision trees, random forests, and gradient-boosted
trees, we utilized Mean Decrease in Impurity (MDI) to identify key features,
notably AveOccup
alongside MedInc
, Latitude
, and
Longitude
.
The random forest got the best score, without any complex feature engineering
compared to linear models.
The model-agnostic permutation feature importance further enabled us to compare
feature significance across diverse model types.
Total running time of the script: (1 minutes 16.483 seconds)
Gallery generated by Sphinx-Gallery