.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/use_cases/plot_feature_importance.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_use_cases_plot_feature_importance.py: .. _example_feature_importance: ===================================================================== `EstimatorReport`: Inspecting your models with the feature importance ===================================================================== In this example, we tackle the California housing dataset where the goal is to perform a regression task: predicting house prices based on features such as the number of bedrooms, the geolocation, etc. For that, we try out several families of models. We evaluate these methods using skore's :class:`~skore.EstimatorReport` and its report on metrics. .. seealso:: As shown in :ref:`example_estimator_report`, the :class:`~skore.EstimatorReport` has a :meth:`~skore.EstimatorReport.metrics` accessor that enables you to evaluate your models and look at some scores that are automatically computed for you. Here, we go beyond predictive performance, and inspect these models to better interpret their behavior, by using feature importance. Indeed, in practice, inspection can help spot some flaws in models: it is always recommended to look "under the hood". For that, we use the unified :meth:`~skore.EstimatorReport.feature_importance` accessor of the :class:`~skore.EstimatorReport`. For linear models, we look at their coefficients. For tree-based models, we inspect their mean decrease in impurity (MDI). We can also inspect the permutation feature importance, that is model-agnostic. .. GENERATED FROM PYTHON SOURCE LINES 32-34 Loading the dataset and performing some exploratory data analysis (EDA) ======================================================================= .. GENERATED FROM PYTHON SOURCE LINES 36-38 Let us load the California housing dataset, which will enable us to perform a regression task about predicting house prices: .. GENERATED FROM PYTHON SOURCE LINES 40-46 .. code-block:: Python from sklearn.datasets import fetch_california_housing california_housing = fetch_california_housing(as_frame=True) X, y = california_housing.data, california_housing.target california_housing.frame.head(2) .. raw:: html
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 1.02381 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.97188 2401.0 2.109842 37.86 -122.22 3.585


.. GENERATED FROM PYTHON SOURCE LINES 47-52 The documentation of the California housing dataset explains that the dataset contains aggregated data regarding each district in California in 1990 and the target (``MedHouseVal``) is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). Note that there are some vacation resorts, with a large number of rooms and bedrooms. .. GENERATED FROM PYTHON SOURCE LINES 54-59 .. seealso:: For more information about the California housing dataset, refer to `scikit-learn MOOC's page `_. Moreover, a more advanced modelling of this dataset is performed in `this skops example `_. .. GENERATED FROM PYTHON SOURCE LINES 61-63 Table report ------------ .. GENERATED FROM PYTHON SOURCE LINES 65-66 Let us perform some quick exploration on this dataset: .. GENERATED FROM PYTHON SOURCE LINES 68-72 .. code-block:: Python from skrub import TableReport TableReport(california_housing.frame) .. raw:: html

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



.. GENERATED FROM PYTHON SOURCE LINES 73-96 From the table report, we can draw some key observations: - Looking at the *Stats* tab, all features are numerical and there are no missing values. - Looking at the *Distributions* tab, we can notice that some features seem to have some outliers: ``MedInc``, ``AveRooms``, ``AveBedrms``, ``Population``, and ``AveOccup``. The feature with the largest number of potential outliers is ``AveBedrms``, probably corresponding to vacation resorts. - Looking at the *Associations* tab, we observe that: - The target feature ``MedHouseVal`` is mostly associated with ``MedInc``, ``Longitude``, and ``Latitude``. Indeed, intuitively, people with a large income would live in areas where the house prices are high. Moreover, we can expect some of these expensive areas to be close to one another. - The association power between the target and these features is not super high, which would indicate that each single feature can not correctly predict the target. Given that ``MedInc`` is associated with ``Longitude`` and also ``Latitude``, it might make sense to have some interactions between these features in our modelling: linear combinations might not be enough. .. GENERATED FROM PYTHON SOURCE LINES 98-102 Target feature -------------- The target distribution has a long tail: .. GENERATED FROM PYTHON SOURCE LINES 104-112 .. code-block:: Python import matplotlib.pyplot as plt import seaborn as sns sns.histplot( data=california_housing.frame, x=california_housing.target_names[0], bins=100 ) plt.show() .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_001.png :alt: plot feature importance :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 113-117 There seems to be a threshold-effect for high-valued houses: all houses with a price above $500,000 are given the value $500,000. We keep these clipped values in our data and will inspect how our models deal with them. .. GENERATED FROM PYTHON SOURCE LINES 121-123 Now, as the median income ``MedInc`` is the feature with the highest association with our target, let us assess how ``MedInc`` relates to ``MedHouseVal``: .. GENERATED FROM PYTHON SOURCE LINES 125-139 .. code-block:: Python import pandas as pd import plotly.express as px X_y_plot = california_housing.frame.copy() X_y_plot["MedInc_bins"] = pd.qcut(X_y_plot["MedInc"], q=5) bin_order = X_y_plot["MedInc_bins"].cat.categories.sort_values() fig = px.histogram( X_y_plot, x=california_housing.target_names[0], color="MedInc_bins", category_orders={"MedInc_bins": bin_order}, ) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 140-142 As could have been expected, a high salary often comes with a more expensive house. We can also notice the clipping effect of house prices for very high salaries. .. GENERATED FROM PYTHON SOURCE LINES 144-146 Geospatial features ------------------- .. GENERATED FROM PYTHON SOURCE LINES 148-152 From the table report, we noticed that the geospatial features ``Latitude`` and ``Longitude`` were well associated with our target. Hence, let us look into the coordinates of the districts in California, with regards to the target feature, using a map: .. GENERATED FROM PYTHON SOURCE LINES 155-167 .. code-block:: Python def plot_map(df, color_feature): fig = px.scatter_mapbox( df, lat="Latitude", lon="Longitude", color=color_feature, zoom=5, height=600 ) fig.update_layout( mapbox_style="open-street-map", mapbox_center={"lat": df["Latitude"].mean(), "lon": df["Longitude"].mean()}, margin={"r": 0, "t": 0, "l": 0, "b": 0}, ) return fig .. GENERATED FROM PYTHON SOURCE LINES 168-171 .. code-block:: Python fig = plot_map(california_housing.frame, california_housing.target_names[0]) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 172-175 As could be expected, the price of the houses near the ocean is higher, especially around big cities like Los Angeles, San Francisco, and San Jose. Taking into account the coordinates in our modelling will be very important. .. GENERATED FROM PYTHON SOURCE LINES 177-179 Splitting the data ------------------ .. GENERATED FROM PYTHON SOURCE LINES 181-183 Just before diving into our first model, let us split our data into a train and a test split: .. GENERATED FROM PYTHON SOURCE LINES 185-189 .. code-block:: Python from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 190-192 Linear models: coefficients =========================== .. GENERATED FROM PYTHON SOURCE LINES 194-196 For our regression task, we first use linear models. For feature importance, we inspect their coefficients. .. GENERATED FROM PYTHON SOURCE LINES 198-200 Simple model ------------ .. GENERATED FROM PYTHON SOURCE LINES 202-206 Before trying any complex feature engineering, we start with a simple pipeline to have a baseline of what a "good score" is (remember that all scores are relative). Here, we use a Ridge regression along with some scaling and evaluate it using :meth:`skore.EstimatorReport.metrics`: .. GENERATED FROM PYTHON SOURCE LINES 208-222 .. code-block:: Python from sklearn.linear_model import Ridge from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from skore import EstimatorReport ridge_report = EstimatorReport( make_pipeline(StandardScaler(), Ridge()), X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, ) ridge_report.metrics.report_metrics() .. raw:: html
Ridge
Metric
0.591163
RMSE 0.735134
Fit time 0.003606
Predict time 0.001287


.. GENERATED FROM PYTHON SOURCE LINES 223-239 From the report metrics, let us first explain the scores we have access to: - The coefficient of determination (:func:`~sklearn.metrics.r2_score`), denoted as :math:`R^2`, which is a score. The best possible score is :math:`1` and a constant model that always predicts the average value of the target would get a score of :math:`0`. Note that the score can be negative, as it could be worse than the average. - The root mean squared error (:func:`~sklearn.metrics.root_mean_squared_error`), abbreviated as RMSE, which is an error. It takes the square root of the mean squared error (MSE) so it is expressed in the same units as the target variable. The MSE measures the average squared difference between the predicted values and the actual values. Here, the :math:`R^2` seems quite poor, so some further preprocessing would be needed. This is done further down in this example. .. GENERATED FROM PYTHON SOURCE LINES 241-250 .. warning:: Keep in mind that any observation drawn from inspecting the coefficients of this simple Ridge model is made on a model that performs quite poorly, hence must be treated with caution. Indeed, a poorly performing model does not capture the true underlying relationships in the data. A good practice would be to avoid inspecting models with poor performance. Here, we still inspect it, for demo purposes and because our model is not put into production! .. GENERATED FROM PYTHON SOURCE LINES 252-253 Let us plot the prediction error: .. GENERATED FROM PYTHON SOURCE LINES 255-257 .. code-block:: Python ridge_report.metrics.prediction_error().plot(kind="actual_vs_predicted") .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_002.png :alt: plot feature importance :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 258-260 We can observe that the model has issues predicting large house prices, due to the clipping effect of the actual values. .. GENERATED FROM PYTHON SOURCE LINES 262-264 Now, to inspect our model, let us use the :meth:`skore.EstimatorReport.feature_importance` accessor: .. GENERATED FROM PYTHON SOURCE LINES 266-268 .. code-block:: Python ridge_report.feature_importance.coefficients() .. raw:: html
Coefficient
Intercept 2.074373
MedInc 0.831864
HouseAge 0.121025
AveRooms -0.261571
AveBedrms 0.303819
Population -0.008702
AveOccup -0.029855
Latitude -0.891545
Longitude -0.863022


.. GENERATED FROM PYTHON SOURCE LINES 269-274 .. note:: Beware that coefficients can be misleading when some features are correlated. For example, two coefficients can have large absolute values (so be considered important), but in the predictions, the sum of their contributions could cancel out (if they are highly correlated), so they would actually be unimportant. .. GENERATED FROM PYTHON SOURCE LINES 276-277 We can plot this pandas datafame: .. GENERATED FROM PYTHON SOURCE LINES 279-286 .. code-block:: Python ridge_report.feature_importance.coefficients().plot.barh( title="Model weights", xlabel="Coefficient", ylabel="Feature", ) plt.tight_layout() .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_003.png :alt: Model weights :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 287-295 .. note:: More generally, :meth:`skore.EstimatorReport.feature_importance.coefficients` can help you inspect the coefficients of all linear models. We consider a linear model as defined in `scikit-learn's documentation `_. In short, we consider a "linear model" as a scikit-learn compatible estimator that holds a ``coef_`` attribute (after being fitted). .. GENERATED FROM PYTHON SOURCE LINES 297-319 Since we have included scaling in the pipeline, the resulting coefficients are all on the same scale, making them directly comparable to each other. Without this scaling step, the coefficients in a linear model would be influenced by the original scale of the feature values, which would prevent meaningful comparisons between them. .. seealso:: For more information about the importance of scaling, see scikit-learn's example on `Common pitfalls in the interpretation of coefficients of linear models `_. Here, it appears that the ``MedInc``, ``Latitude``, and ``Longitude`` features are the most important, with regards to the absolute value of other coefficients. This finding is consistent with our previous observations from the *Associations* tab of the table report. However, due to the scaling, we can not interpret the coefficient values with regards to the original unit of the feature. Let us unscale the coefficients, without forgetting the intercept, so that the coefficients can be interpreted using the original units: .. GENERATED FROM PYTHON SOURCE LINES 321-344 .. code-block:: Python import numpy as np # retrieve the mean and standard deviation used to standardize the feature values feature_mean = ridge_report.estimator_[0].mean_ feature_std = ridge_report.estimator_[0].scale_ def unscale_coefficients(df, feature_mean, feature_std): mask_intercept_column = df.index == "Intercept" df.loc["Intercept"] = df.loc["Intercept"] - np.sum( df.loc[~mask_intercept_column, "Coefficient"] * feature_mean / feature_std ) df.loc[~mask_intercept_column, "Coefficient"] = ( df.loc[~mask_intercept_column, "Coefficient"] / feature_std ) return df df_ridge_report_coef_unscaled = unscale_coefficients( ridge_report.feature_importance.coefficients(), feature_mean, feature_std ) df_ridge_report_coef_unscaled .. raw:: html
Coefficient
Intercept -36.573930
MedInc 0.439072
HouseAge 0.009606
AveRooms -0.103240
AveBedrms 0.616257
Population -0.000008
AveOccup -0.004490
Latitude -0.416969
Longitude -0.430202


.. GENERATED FROM PYTHON SOURCE LINES 345-351 Now, we can interpret each coefficient values with regards to the original units. We can interpret a coefficient as follows: according to our model, on average, having one additional bedroom (a increase of :math:`1` of ``AveBedrms``), with all other features being constant, increases the *predicted* house value of :math:`0.62` in $100,000, hence of $62,000. Note that we have not dealt with any potential outlier in this iteration. .. GENERATED FROM PYTHON SOURCE LINES 353-357 .. warning:: Recall that we are inspecting a model with poor performance, which is bad practice. Moreover, we must be cautious when trying to induce any causation effect (remember that correlation is not causation). .. GENERATED FROM PYTHON SOURCE LINES 359-361 More complex model ------------------ .. GENERATED FROM PYTHON SOURCE LINES 363-369 As previously mentioned, our simple Ridge model, although very easily interpretable with regards to the original units of the features, performs quite poorly. Now, we build a more complex model, with more feature engineering. We will see that this model will have a better score... but will be more difficult to interpret the coefficients with regards to the original features due to the complex feature engineering. .. GENERATED FROM PYTHON SOURCE LINES 371-380 In our previous EDA, when plotting the geospatial data with regards to the house prices, we noticed that it is important to take into account the latitude and longitude features. Moreover, we also observed that the median income is well associated with the house prices. Hence, we will try a feature engineering that takes into account the interactions of the geospatial features with features such as the income, using polynomial features. The interactions are no longer simply linear as previously. .. GENERATED FROM PYTHON SOURCE LINES 382-385 Let us build a model with some more complex feature engineering, and still use a Ridge regressor (linear model) at the end of the pipeline. In particular, we perform a K-means clustering on the geospatial features: .. GENERATED FROM PYTHON SOURCE LINES 387-405 .. code-block:: Python from sklearn.cluster import KMeans from sklearn.compose import make_column_transformer from sklearn.preprocessing import PolynomialFeatures, SplineTransformer geo_columns = ["Latitude", "Longitude"] preprocessor = make_column_transformer( (KMeans(n_clusters=10, random_state=0), geo_columns), remainder="passthrough", ) engineered_ridge = make_pipeline( preprocessor, SplineTransformer(sparse_output=True), PolynomialFeatures(degree=2, interaction_only=True, include_bias=False), Ridge(), ) engineered_ridge .. raw:: html
Pipeline(steps=[('columntransformer',
                     ColumnTransformer(remainder='passthrough',
                                       transformers=[('kmeans',
                                                      KMeans(n_clusters=10,
                                                             random_state=0),
                                                      ['Latitude', 'Longitude'])])),
                    ('splinetransformer', SplineTransformer(sparse_output=True)),
                    ('polynomialfeatures',
                     PolynomialFeatures(include_bias=False, interaction_only=True)),
                    ('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 406-408 Now, let us compute the metrics and compare it to our previous model using a :class:`skore.ComparisonReport`: .. GENERATED FROM PYTHON SOURCE LINES 410-426 .. code-block:: Python from skore import ComparisonReport engineered_ridge_report = EstimatorReport( engineered_ridge, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, ) reports_to_compare = { "Vanilla Ridge": ridge_report, "Ridge w/ feature engineering": engineered_ridge_report, } comparator = ComparisonReport(reports=reports_to_compare) comparator.metrics.report_metrics() .. raw:: html
Estimator Vanilla Ridge Ridge w/ feature engineering
Metric
0.591163 0.726869
RMSE 0.735134 0.600865
Fit time 0.003606 9.723627
Predict time 0.001287 0.221695


.. GENERATED FROM PYTHON SOURCE LINES 427-429 We get a much better score! Let us plot the prediction error: .. GENERATED FROM PYTHON SOURCE LINES 431-433 .. code-block:: Python engineered_ridge_report.metrics.prediction_error().plot(kind="actual_vs_predicted") .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_004.png :alt: plot feature importance :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 434-438 About the clipping issue, compared to the prediction error of our previous model (``ridge_report``), our ``engineered_ridge_report`` model seems to produce predictions that are not as large, so it seems that some interactions between features have helped alleviate the clipping issue. .. GENERATED FROM PYTHON SOURCE LINES 440-442 However, interpreting the features is harder: indeed, our complex feature engineering introduced a *lot* of features: .. GENERATED FROM PYTHON SOURCE LINES 444-451 .. code-block:: Python print("Initial number of features:", X_train.shape[1]) # We slice the scikit-learn pipeline to extract the predictor, using -1 to access # the last step: n_features_engineered = engineered_ridge_report.estimator_[-1].n_features_in_ print("Number of features after feature engineering:", n_features_engineered) .. rst-class:: sphx-glr-script-out .. code-block:: none Initial number of features: 8 Number of features after feature engineering: 6328 .. GENERATED FROM PYTHON SOURCE LINES 452-453 Let us display the 15 largest absolute coefficients: .. GENERATED FROM PYTHON SOURCE LINES 455-464 .. code-block:: Python engineered_ridge_report.feature_importance.coefficients().sort_values( by="Coefficient", key=abs, ascending=True ).tail(15).plot.barh( title="Model weights", xlabel="Coefficient", ylabel="Feature", ) plt.tight_layout() .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_005.png :alt: Model weights :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_005.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 465-471 We can observe that the most important features are interactions between features, mostly based on ``AveOccup``, that a simple linear model without feature engineering could not have captured. Indeed, the vanilla Ridge model did not consider ``AveOccup`` to be important. As the engineered Ridge has a better score, perhaps the vanilla Ridge missed something about ``AveOccup`` that seems to be key to predicting house prices. .. GENERATED FROM PYTHON SOURCE LINES 473-474 Let us visualize how ``AveOccup`` interacts with ``MedHouseVal``: .. GENERATED FROM PYTHON SOURCE LINES 476-487 .. code-block:: Python X_y_plot = california_housing.frame.copy() X_y_plot["AveOccup"] = pd.qcut(X_y_plot["AveOccup"], q=5) bin_order = X_y_plot["AveOccup"].cat.categories.sort_values() fig = px.histogram( X_y_plot, x=california_housing.target_names[0], color="AveOccup", category_orders={"AveOccup": bin_order}, ) fig .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 488-489 Finally, we can visualize the results of our K-means clustering (on the training set): .. GENERATED FROM PYTHON SOURCE LINES 491-504 .. code-block:: Python # getting the cluster labels col_transformer = engineered_ridge_report.estimator_.named_steps["columntransformer"] kmeans = col_transformer.named_transformers_["kmeans"] clustering_labels = kmeans.labels_ # adding the cluster labels to our dataframe X_train_plot = X_train.copy() X_train_plot.insert(0, "clustering_labels", clustering_labels) # plotting the map plot_map(X_train_plot, "clustering_labels") .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 505-507 Compromising on complexity -------------------------- .. GENERATED FROM PYTHON SOURCE LINES 509-514 Now, let us build a model with a more interpretable feature engineering, although it might not perform as well. For that, after the complex feature engineering, we perform some feature selection using a :class:`~sklearn.feature_selection.SelectKBest`, in order to reduce the number of features. .. GENERATED FROM PYTHON SOURCE LINES 516-532 .. code-block:: Python from sklearn.feature_selection import SelectKBest, VarianceThreshold from sklearn.linear_model import RidgeCV preprocessor = make_column_transformer( (KMeans(n_clusters=10, random_state=0), geo_columns), remainder="passthrough", ) selectkbest_ridge = make_pipeline( preprocessor, SplineTransformer(sparse_output=True), PolynomialFeatures(degree=2, interaction_only=True, include_bias=False), VarianceThreshold(), SelectKBest(k=150), RidgeCV(np.logspace(-5, 5, num=100)), ) .. GENERATED FROM PYTHON SOURCE LINES 533-539 .. note:: To keep the computation time of this example low, we did not tune the hyperparameters of the predictive model. However, on a real use case, it would be important to tune the model using :class:`~sklearn.model_selection.RandomizedSearchCV` and not just the :class:`~sklearn.linear_model.RidgeCV`. .. GENERATED FROM PYTHON SOURCE LINES 541-542 Let us get the metrics for our model and compare it with our previous iterations: .. GENERATED FROM PYTHON SOURCE LINES 544-555 .. code-block:: Python selectk_ridge_report = EstimatorReport( selectkbest_ridge, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, ) reports_to_compare["Ridge w/ feature engineering and selection"] = selectk_ridge_report comparator = ComparisonReport(reports=reports_to_compare) comparator.metrics.report_metrics() .. rst-class:: sphx-glr-script-out .. code-block:: none /opt/hostedtoolcache/Python/3.12.10/x64/lib/python3.12/site-packages/sklearn/feature_selection/_univariate_selection.py:112: RuntimeWarning: divide by zero encountered in divide .. raw:: html
Estimator Vanilla Ridge Ridge w/ feature engineering Ridge w/ feature engineering and selection
Metric
0.591163 0.726869 0.689991
RMSE 0.735134 0.600865 0.640145
Fit time 0.003606 9.723627 8.325749
Predict time 0.001287 0.221695 0.352263


.. GENERATED FROM PYTHON SOURCE LINES 556-557 We get a good score and much less features: .. GENERATED FROM PYTHON SOURCE LINES 559-568 .. code-block:: Python print("Initial number of features:", X_train.shape[1]) print("Number of features after feature engineering:", n_features_engineered) n_features_selectk = selectk_ridge_report.estimator_[-1].n_features_in_ print( "Number of features after feature engineering using `SelectKBest`:", n_features_selectk, ) .. rst-class:: sphx-glr-script-out .. code-block:: none Initial number of features: 8 Number of features after feature engineering: 6328 Number of features after feature engineering using `SelectKBest`: 150 .. GENERATED FROM PYTHON SOURCE LINES 569-571 According to the :class:`~sklearn.feature_selection.SelectKBest`, the most important features are the following: .. GENERATED FROM PYTHON SOURCE LINES 573-576 .. code-block:: Python selectk_features = selectk_ridge_report.estimator_[:-1].get_feature_names_out() print(selectk_features) .. rst-class:: sphx-glr-script-out .. code-block:: none ['remainder__MedInc_sp_1' 'remainder__MedInc_sp_2' 'remainder__MedInc_sp_3' 'remainder__MedInc_sp_4' 'kmeans__kmeans0_sp_0 remainder__MedInc_sp_3' 'kmeans__kmeans0_sp_1 remainder__MedInc_sp_3' 'kmeans__kmeans0_sp_2 remainder__MedInc_sp_0' 'kmeans__kmeans0_sp_2 remainder__MedInc_sp_1' 'kmeans__kmeans0_sp_2 remainder__MedInc_sp_4' 'kmeans__kmeans0_sp_3 remainder__MedInc_sp_3' 'kmeans__kmeans0_sp_3 remainder__AveRooms_sp_3' 'kmeans__kmeans1_sp_0 remainder__MedInc_sp_3' 'kmeans__kmeans1_sp_0 remainder__Population_sp_3' 'kmeans__kmeans1_sp_1 remainder__MedInc_sp_3' 'kmeans__kmeans1_sp_2 remainder__MedInc_sp_0' 'kmeans__kmeans1_sp_2 remainder__MedInc_sp_1' 'kmeans__kmeans1_sp_2 remainder__AveRooms_sp_3' 'kmeans__kmeans1_sp_3 remainder__MedInc_sp_3' 'kmeans__kmeans1_sp_4 remainder__MedInc_sp_3' 'kmeans__kmeans1_sp_6 kmeans__kmeans4_sp_4' 'kmeans__kmeans1_sp_6 remainder__MedInc_sp_5' 'kmeans__kmeans2_sp_0 remainder__MedInc_sp_0' 'kmeans__kmeans2_sp_0 remainder__AveRooms_sp_4' 'kmeans__kmeans2_sp_0 remainder__AveBedrms_sp_4' 'kmeans__kmeans2_sp_1 remainder__MedInc_sp_0' 'kmeans__kmeans2_sp_1 remainder__MedInc_sp_1' 'kmeans__kmeans2_sp_1 remainder__AveRooms_sp_3' 'kmeans__kmeans2_sp_2 remainder__MedInc_sp_3' 'kmeans__kmeans2_sp_2 remainder__MedInc_sp_4' 'kmeans__kmeans2_sp_2 remainder__AveRooms_sp_3' 'kmeans__kmeans2_sp_3 remainder__MedInc_sp_2' 'kmeans__kmeans2_sp_3 remainder__MedInc_sp_3' 'kmeans__kmeans2_sp_3 remainder__MedInc_sp_4' 'kmeans__kmeans2_sp_6 kmeans__kmeans3_sp_0' 'kmeans__kmeans2_sp_6 kmeans__kmeans6_sp_1' 'kmeans__kmeans3_sp_2 remainder__MedInc_sp_3' 'kmeans__kmeans3_sp_2 remainder__AveRooms_sp_3' 'kmeans__kmeans3_sp_3 remainder__MedInc_sp_3' 'kmeans__kmeans3_sp_4 remainder__MedInc_sp_3' 'kmeans__kmeans3_sp_5 kmeans__kmeans4_sp_4' 'kmeans__kmeans3_sp_6 kmeans__kmeans4_sp_4' 'kmeans__kmeans4_sp_1 remainder__MedInc_sp_3' 'kmeans__kmeans4_sp_2 remainder__MedInc_sp_3' 'kmeans__kmeans4_sp_3 remainder__MedInc_sp_3' 'kmeans__kmeans4_sp_3 remainder__AveRooms_sp_3' 'kmeans__kmeans4_sp_4 kmeans__kmeans5_sp_6' 'kmeans__kmeans4_sp_4 kmeans__kmeans7_sp_6' 'kmeans__kmeans4_sp_4 kmeans__kmeans9_sp_5' 'kmeans__kmeans4_sp_4 kmeans__kmeans9_sp_6' 'kmeans__kmeans5_sp_0 remainder__Population_sp_3' 'kmeans__kmeans5_sp_2 remainder__AveRooms_sp_3' 'kmeans__kmeans5_sp_3 remainder__MedInc_sp_0' 'kmeans__kmeans5_sp_3 remainder__MedInc_sp_1' 'kmeans__kmeans5_sp_3 remainder__MedInc_sp_3' 'kmeans__kmeans5_sp_3 remainder__MedInc_sp_4' 'kmeans__kmeans5_sp_3 remainder__AveRooms_sp_3' 'kmeans__kmeans5_sp_4 remainder__MedInc_sp_3' 'kmeans__kmeans6_sp_0 remainder__Population_sp_3' 'kmeans__kmeans6_sp_0 remainder__Population_sp_4' 'kmeans__kmeans6_sp_1 remainder__MedInc_sp_3' 'kmeans__kmeans6_sp_1 remainder__AveRooms_sp_4' 'kmeans__kmeans6_sp_2 remainder__MedInc_sp_3' 'kmeans__kmeans6_sp_2 remainder__MedInc_sp_4' 'kmeans__kmeans6_sp_3 remainder__MedInc_sp_3' 'kmeans__kmeans7_sp_1 remainder__AveRooms_sp_3' 'kmeans__kmeans7_sp_2 remainder__MedInc_sp_0' 'kmeans__kmeans7_sp_2 remainder__MedInc_sp_1' 'kmeans__kmeans7_sp_2 remainder__MedInc_sp_3' 'kmeans__kmeans7_sp_3 remainder__MedInc_sp_3' 'kmeans__kmeans7_sp_4 remainder__MedInc_sp_3' 'kmeans__kmeans8_sp_1 remainder__MedInc_sp_3' 'kmeans__kmeans8_sp_2 remainder__MedInc_sp_3' 'kmeans__kmeans8_sp_3 remainder__AveRooms_sp_3' 'kmeans__kmeans9_sp_1 remainder__MedInc_sp_3' 'kmeans__kmeans9_sp_2 remainder__MedInc_sp_0' 'kmeans__kmeans9_sp_2 remainder__MedInc_sp_1' 'kmeans__kmeans9_sp_3 remainder__MedInc_sp_3' 'kmeans__kmeans9_sp_3 remainder__AveRooms_sp_3' 'kmeans__kmeans9_sp_4 remainder__MedInc_sp_3' 'remainder__MedInc_sp_0 remainder__MedInc_sp_2' 'remainder__MedInc_sp_0 remainder__MedInc_sp_3' 'remainder__MedInc_sp_0 remainder__AveRooms_sp_2' 'remainder__MedInc_sp_1 remainder__MedInc_sp_2' 'remainder__MedInc_sp_1 remainder__MedInc_sp_3' 'remainder__MedInc_sp_1 remainder__MedInc_sp_4' 'remainder__MedInc_sp_1 remainder__AveRooms_sp_0' 'remainder__MedInc_sp_1 remainder__AveRooms_sp_1' 'remainder__MedInc_sp_1 remainder__AveRooms_sp_2' 'remainder__MedInc_sp_1 remainder__AveBedrms_sp_0' 'remainder__MedInc_sp_1 remainder__AveBedrms_sp_1' 'remainder__MedInc_sp_1 remainder__AveBedrms_sp_2' 'remainder__MedInc_sp_1 remainder__Population_sp_0' 'remainder__MedInc_sp_1 remainder__Population_sp_1' 'remainder__MedInc_sp_1 remainder__Population_sp_2' 'remainder__MedInc_sp_1 remainder__AveOccup_sp_0' 'remainder__MedInc_sp_1 remainder__AveOccup_sp_1' 'remainder__MedInc_sp_1 remainder__AveOccup_sp_2' 'remainder__MedInc_sp_2 remainder__MedInc_sp_3' 'remainder__MedInc_sp_2 remainder__MedInc_sp_4' 'remainder__MedInc_sp_2 remainder__AveRooms_sp_0' 'remainder__MedInc_sp_2 remainder__AveRooms_sp_1' 'remainder__MedInc_sp_2 remainder__AveBedrms_sp_0' 'remainder__MedInc_sp_2 remainder__AveBedrms_sp_1' 'remainder__MedInc_sp_2 remainder__AveBedrms_sp_2' 'remainder__MedInc_sp_2 remainder__Population_sp_1' 'remainder__MedInc_sp_2 remainder__AveOccup_sp_0' 'remainder__MedInc_sp_2 remainder__AveOccup_sp_1' 'remainder__MedInc_sp_2 remainder__AveOccup_sp_2' 'remainder__MedInc_sp_3 remainder__MedInc_sp_4' 'remainder__MedInc_sp_3 remainder__HouseAge_sp_2' 'remainder__MedInc_sp_3 remainder__HouseAge_sp_3' 'remainder__MedInc_sp_3 remainder__HouseAge_sp_4' 'remainder__MedInc_sp_3 remainder__AveRooms_sp_0' 'remainder__MedInc_sp_3 remainder__AveRooms_sp_1' 'remainder__MedInc_sp_3 remainder__AveRooms_sp_2' 'remainder__MedInc_sp_3 remainder__AveRooms_sp_3' 'remainder__MedInc_sp_3 remainder__AveBedrms_sp_0' 'remainder__MedInc_sp_3 remainder__AveBedrms_sp_1' 'remainder__MedInc_sp_3 remainder__AveBedrms_sp_2' 'remainder__MedInc_sp_3 remainder__AveBedrms_sp_3' 'remainder__MedInc_sp_3 remainder__Population_sp_0' 'remainder__MedInc_sp_3 remainder__Population_sp_1' 'remainder__MedInc_sp_3 remainder__Population_sp_2' 'remainder__MedInc_sp_3 remainder__Population_sp_3' 'remainder__MedInc_sp_3 remainder__AveOccup_sp_0' 'remainder__MedInc_sp_3 remainder__AveOccup_sp_1' 'remainder__MedInc_sp_3 remainder__AveOccup_sp_2' 'remainder__MedInc_sp_4 remainder__AveRooms_sp_0' 'remainder__MedInc_sp_4 remainder__AveRooms_sp_1' 'remainder__MedInc_sp_4 remainder__AveRooms_sp_2' 'remainder__MedInc_sp_4 remainder__AveBedrms_sp_0' 'remainder__MedInc_sp_4 remainder__AveBedrms_sp_1' 'remainder__MedInc_sp_4 remainder__AveBedrms_sp_2' 'remainder__MedInc_sp_4 remainder__AveBedrms_sp_3' 'remainder__MedInc_sp_4 remainder__Population_sp_0' 'remainder__MedInc_sp_4 remainder__Population_sp_1' 'remainder__MedInc_sp_4 remainder__Population_sp_2' 'remainder__MedInc_sp_4 remainder__Population_sp_3' 'remainder__MedInc_sp_4 remainder__Population_sp_4' 'remainder__MedInc_sp_4 remainder__AveOccup_sp_0' 'remainder__MedInc_sp_4 remainder__AveOccup_sp_1' 'remainder__MedInc_sp_4 remainder__AveOccup_sp_2' 'remainder__MedInc_sp_5 remainder__Population_sp_4' 'remainder__AveRooms_sp_3 remainder__AveBedrms_sp_2' 'remainder__AveRooms_sp_3 remainder__Population_sp_3' 'remainder__AveRooms_sp_4 remainder__AveBedrms_sp_1' 'remainder__AveRooms_sp_4 remainder__AveBedrms_sp_2' 'remainder__AveRooms_sp_4 remainder__Population_sp_3' 'remainder__AveBedrms_sp_4 remainder__Population_sp_3' 'remainder__Population_sp_1 remainder__Population_sp_4'] .. GENERATED FROM PYTHON SOURCE LINES 577-581 We can see that, in the best features, according to statistical tests, there are many interactions between geospatial features (derived from the K-means clustering) and the median income. Note that these features are not sorted. .. GENERATED FROM PYTHON SOURCE LINES 583-584 And here is the feature importance based on our model (sorted by absolute values): .. GENERATED FROM PYTHON SOURCE LINES 586-595 .. code-block:: Python selectk_ridge_report.feature_importance.coefficients().sort_values( by="Coefficient", key=abs, ascending=True ).tail(15).plot.barh( title="Model weights", xlabel="Coefficient", ylabel="Feature", ) plt.tight_layout() .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_006.png :alt: Model weights :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_006.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 596-598 Tree-based models: mean decrease in impurity (MDI) ================================================== .. GENERATED FROM PYTHON SOURCE LINES 600-605 Now, let us look into tree-based models. For feature importance, we inspect their Mean Decrease in Impurity (MDI). The MDI of a feature is the normalized total reduction of the criterion (or loss) brought by that feature. The higher the MDI, the more important the feature. .. GENERATED FROM PYTHON SOURCE LINES 607-623 .. warning:: The MDI is limited and can be misleading: - When features have large differences in cardinality, the MDI tends to favor those with higher cardinality. Fortunately, in this example, we have only numerical features that share similar cardinality, mitigating this concern. - Since the MDI is typically calculated on the training set, it can reflect biases from overfitting. When a model overfits, the tree may partition less relevant regions of the feature space, artificially inflating MDI values and distorting the perceived importance of certain features. Soon, scikit-learn will enable the computing of the MDI on the test set, and we will make it available in skore. Hence, we would be able to draw conclusions on how predictive a feature is and not just how impactful it is on the training procedure. .. GENERATED FROM PYTHON SOURCE LINES 625-630 .. seealso:: For more information about the MDI, see scikit-learn's `Permutation Importance vs Random Forest Feature Importance (MDI) `_. .. GENERATED FROM PYTHON SOURCE LINES 633-635 Decision trees -------------- .. GENERATED FROM PYTHON SOURCE LINES 637-638 Let us start with a simple decision tree. .. GENERATED FROM PYTHON SOURCE LINES 640-643 .. seealso:: For more information about decision trees, see scikit-learn's example on `Understanding the decision tree structure `_. .. GENERATED FROM PYTHON SOURCE LINES 645-656 .. code-block:: Python from sklearn.tree import DecisionTreeRegressor tree_report = EstimatorReport( DecisionTreeRegressor(random_state=0), X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, ) reports_to_compare["Decision tree"] = tree_report .. GENERATED FROM PYTHON SOURCE LINES 657-658 We compare its performance with the models in our benchmark: .. GENERATED FROM PYTHON SOURCE LINES 660-663 .. code-block:: Python comparator = ComparisonReport(reports=reports_to_compare) comparator.metrics.report_metrics() .. raw:: html
Estimator Vanilla Ridge Ridge w/ feature engineering Ridge w/ feature engineering and selection Decision tree
Metric
0.591163 0.726869 0.689991 0.583785
RMSE 0.735134 0.600865 0.640145 0.741737
Fit time 0.003606 9.723627 8.325749 0.186716
Predict time 0.001287 0.221695 0.352263 0.002108


.. GENERATED FROM PYTHON SOURCE LINES 664-666 We note that the performance is quite poor, so the derived feature importance is to be dealt with caution. .. GENERATED FROM PYTHON SOURCE LINES 668-669 We display which accessors are available to us: .. GENERATED FROM PYTHON SOURCE LINES 671-673 .. code-block:: Python tree_report.help() .. rst-class:: sphx-glr-script-out .. code-block:: none ╭───────────────── Tools to diagnose estimator DecisionTreeRegressor ──────────────────╮ │ EstimatorReport │ │ ├── .metrics │ │ │ ├── .prediction_error(...) - Plot the prediction error of a regression │ │ │ │ model. │ │ │ ├── .r2(...) (↗︎) - Compute the R² score. │ │ │ ├── .rmse(...) (↘︎) - Compute the root mean squared error. │ │ │ ├── .timings(...) - Get all measured processing times related │ │ │ │ to the estimator. │ │ │ ├── .custom_metric(...) - Compute a custom metric. │ │ │ └── .report_metrics(...) - Report a set of metrics for our estimator. │ │ ├── .feature_importance │ │ │ ├── .mean_decrease_impurity(...) - Retrieve the mean decrease impurity (MDI) │ │ │ │ of a tree-based model. │ │ │ └── .permutation(...) - Report the permutation feature importance. │ │ ├── .cache_predictions(...) - Cache estimator's predictions. │ │ ├── .clear_cache(...) - Clear the cache. │ │ ├── .get_predictions(...) - Get estimator's predictions. │ │ └── Attributes │ │ ├── .X_test - Testing data │ │ ├── .X_train - Training data │ │ ├── .y_test - Testing target │ │ ├── .y_train - Training target │ │ ├── .estimator_ - The cloned or copied estimator │ │ ├── .estimator_name_ - The name of the estimator │ │ ├── .fit_time_ - The time taken to fit the estimator, in │ │ │ seconds │ │ └── .ml_task - No description available │ │ │ │ │ │ Legend: │ │ (↗︎) higher is better (↘︎) lower is better │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ .. GENERATED FROM PYTHON SOURCE LINES 674-677 We have a :meth:`~skore.EstimatorReport.feature_importance.mean_decrease_impurity` accessor. .. GENERATED FROM PYTHON SOURCE LINES 679-682 First, let us interpret our model with regards to the original features. For the visualization, we fix a very low ``max_depth`` so that it will be easy for the human eye to visualize the tree using :func:`sklearn.tree.plot_tree`: .. GENERATED FROM PYTHON SOURCE LINES 684-693 .. code-block:: Python from sklearn.tree import plot_tree plot_tree( tree_report.estimator_, feature_names=tree_report.estimator_.feature_names_in_, max_depth=2, ) plt.tight_layout() .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_007.png :alt: plot feature importance :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_007.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 694-711 This tree explains how each sample is going to be predicted by our tree. A decision tree provides a decision path for each sample, where the sample traverses the tree based on feature thresholds, and the final prediction is made at the leaf node (not represented above for conciseness purposes). At each node: - ``samples`` is the number of samples that fall into that node, - ``value`` is the predicted output for the samples that fall into this particular node (it is the mean of the target values for the samples in that node). At the root node, the value is :math:`2.074`. This means that if you were to make a prediction for all :math:`15480` samples at this node (without any further splits), the predicted value would be :math:`2.074`, which is the mean of the target variable for those samples. - ``squared_error`` is the mean squared error associated with the ``value``, representing the average of the squared differences between the actual target values of the samples in the node and the node's predicted ``value`` (the mean), - the first element is how the split is defined. .. GENERATED FROM PYTHON SOURCE LINES 713-724 Let us explain how this works in practice. At each node, the tree splits the data based on a feature and a threshold. For the first node (the root node), ``MedInc <= 5.029`` means that, for each sample, our decision tree first looks at the ``MedInc`` feature (which is thus the most important one): if the ``MedInc`` value is lower than :math:`5.029` (the threshold), then the sample goes into the left node, otherwise it goes to the right, and so on for each node. As you move down the tree, the splits refine the predictions, leading to the leaf nodes where the final prediction for a sample is the ``value`` of the leaf it reaches. Note that for the second node layer, it is also the ``MedInc`` feature that is used for the threshold, indicating that our model heavily relies on ``MedInc``. .. GENERATED FROM PYTHON SOURCE LINES 726-732 .. seealso:: A richer display of decision trees is available in the `dtreeviz `_ python package. For example, it shows the distribution of feature values split at each node and tailors the visualization to the task at hand (whether classification or regression). .. GENERATED FROM PYTHON SOURCE LINES 734-735 Now, let us look at the feature importance based on the MDI: .. GENERATED FROM PYTHON SOURCE LINES 737-744 .. code-block:: Python tree_report.feature_importance.mean_decrease_impurity().plot.barh( title=f"Feature importance of {tree_report.estimator_name_}", xlabel="MDI", ylabel="Feature", ) plt.tight_layout() .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_008.png :alt: Feature importance of DecisionTreeRegressor :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_008.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 745-753 For a decision tree, for each feature, the MDI is averaged across all splits in the tree. Here, the impurity is the mean squared error. As expected, ``MedInc`` is of great importance for our decision tree. Indeed, in the above tree visualization, ``MedInc`` is used multiple times for splits and contributes greatly to reducing the squared error at multiple nodes. At the root, it reduces the error from :math:`1.335` to :math:`0.832` and :math:`0.546` in the children. .. GENERATED FROM PYTHON SOURCE LINES 755-763 Random forest ------------- Now, let us apply a more elaborate model: a random forest. A random forest is an ensemble method that builds multiple decision trees, each trained on a random subset of data and features. For regression, it averages the trees' predictions. This reduces overfitting and improves accuracy compared to a single decision tree. .. GENERATED FROM PYTHON SOURCE LINES 765-780 .. code-block:: Python from sklearn.ensemble import RandomForestRegressor n_estimators = 100 rf_report = EstimatorReport( RandomForestRegressor(random_state=0, n_estimators=n_estimators), X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, ) reports_to_compare["Random forest"] = rf_report comparator = ComparisonReport(reports=reports_to_compare) comparator.metrics.report_metrics() .. raw:: html
Estimator Vanilla Ridge Ridge w/ feature engineering Ridge w/ feature engineering and selection Decision tree Random forest
Metric
0.591163 0.726869 0.689991 0.583785 0.794168
RMSE 0.735134 0.600865 0.640145 0.741737 0.521612
Fit time 0.003606 9.723627 8.325749 0.186716 11.792969
Predict time 0.001287 0.221695 0.352263 0.002108 0.124773


.. GENERATED FROM PYTHON SOURCE LINES 781-783 Without any feature engineering and any grid search, the random forest beats all linear models! .. GENERATED FROM PYTHON SOURCE LINES 785-786 Let us recall the number of trees in our random forest: .. GENERATED FROM PYTHON SOURCE LINES 788-790 .. code-block:: Python print(f"Number of trees in the forest: {n_estimators}") .. rst-class:: sphx-glr-script-out .. code-block:: none Number of trees in the forest: 100 .. GENERATED FROM PYTHON SOURCE LINES 791-795 Given that we have many trees, it is hard to use :func:`sklearn.tree.plot_tree` as for the single decision tree. As for linear models (and the complex feature engineering), better performance often comes with less interpretability. .. GENERATED FROM PYTHON SOURCE LINES 797-798 Let us look into the MDI of our random forest: .. GENERATED FROM PYTHON SOURCE LINES 800-807 .. code-block:: Python rf_report.feature_importance.mean_decrease_impurity().plot.barh( title=f"Feature importance of {rf_report.estimator_name_}", xlabel="MDI", ylabel="Feature", ) plt.tight_layout() .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_009.png :alt: Feature importance of RandomForestRegressor :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_009.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 808-814 In a random forest, the MDI is computed by averaging the MDI of each feature across all the decision trees in the forest. As for the decision tree, ``MecInc`` is the most important feature. As for linear models with some feature engineering, the random forest also attributes a high importance to ``Longitude``, ``Latitude``, and ``AveOccup``. .. GENERATED FROM PYTHON SOURCE LINES 816-818 Model-agnostic: permutation feature importance ============================================== .. GENERATED FROM PYTHON SOURCE LINES 820-826 In the previous sections, we have inspected coefficients that are specific to linear models and the MDI that is specific to tree-based models. In this section, we look into the `permutation importance `_ which is model agnostic, meaning that it can be applied to any fitted estimator. In particular, it works for linear models and tree-based ones. .. GENERATED FROM PYTHON SOURCE LINES 828-838 Permutation feature importance measures the contribution of each feature to a fitted model's performance. It randomly shuffles the values of a single feature and observes the resulting degradation of the model's score. Permuting a predictive feature makes the performance decrease, while permuting a non-predictive feature does not degrade the performance much. This permutation importance can be computed on the train and test sets, and by default skore computes it on the test set. Compared to the coefficients and the MDI, the permutation importance can be less misleading, but comes with a higher computation cost. .. GENERATED FROM PYTHON SOURCE LINES 840-845 Permutation feature importance can also help reduce overfitting. If a model overfits (high train score and low test score), and some features are important only on the train set and not on the test set, then these features might be the cause of the overfitting and it might be a good idea to drop them. .. GENERATED FROM PYTHON SOURCE LINES 847-852 .. warning:: The permutation feature importance can be misleading on strongly correlated features. For more information, see `scikit-learn's user guide `_. .. GENERATED FROM PYTHON SOURCE LINES 854-855 Now, let us look at our helper: .. GENERATED FROM PYTHON SOURCE LINES 857-859 .. code-block:: Python ridge_report.help() .. rst-class:: sphx-glr-script-out .. code-block:: none ╭───────────────────────── Tools to diagnose estimator Ridge ──────────────────────────╮ │ EstimatorReport │ │ ├── .metrics │ │ │ ├── .prediction_error(...) - Plot the prediction error of a regression │ │ │ │ model. │ │ │ ├── .r2(...) (↗︎) - Compute the R² score. │ │ │ ├── .rmse(...) (↘︎) - Compute the root mean squared error. │ │ │ ├── .timings(...) - Get all measured processing times related │ │ │ │ to the estimator. │ │ │ ├── .custom_metric(...) - Compute a custom metric. │ │ │ └── .report_metrics(...) - Report a set of metrics for our estimator. │ │ ├── .feature_importance │ │ │ ├── .coefficients(...) - Retrieve the coefficients of a linear │ │ │ │ model, including the intercept. │ │ │ └── .permutation(...) - Report the permutation feature importance. │ │ ├── .cache_predictions(...) - Cache estimator's predictions. │ │ ├── .clear_cache(...) - Clear the cache. │ │ ├── .get_predictions(...) - Get estimator's predictions. │ │ └── Attributes │ │ ├── .X_test - Testing data │ │ ├── .X_train - Training data │ │ ├── .y_test - Testing target │ │ ├── .y_train - Training target │ │ ├── .estimator_ - The cloned or copied estimator │ │ ├── .estimator_name_ - The name of the estimator │ │ ├── .fit_time_ - The time taken to fit the estimator, in │ │ │ seconds │ │ └── .ml_task - No description available │ │ │ │ │ │ Legend: │ │ (↗︎) higher is better (↘︎) lower is better │ ╰──────────────────────────────────────────────────────────────────────────────────────╯ .. GENERATED FROM PYTHON SOURCE LINES 860-862 We have a :meth:`~skore.EstimatorReport.feature_importance.permutation` accessor: .. GENERATED FROM PYTHON SOURCE LINES 864-866 .. code-block:: Python ridge_report.feature_importance.permutation(seed=0) .. raw:: html
Repeat Repeat #0 Repeat #1 Repeat #2 Repeat #3 Repeat #4
Metric Feature
r2 MedInc 1.015997 1.065468 1.022420 1.021447 1.011064
HouseAge 0.018828 0.024312 0.020991 0.020678 0.021016
AveRooms 0.090388 0.089230 0.085344 0.086729 0.083356
AveBedrms 0.099849 0.098794 0.102469 0.104776 0.107545
Population -0.000181 -0.000103 -0.000181 -0.000023 -0.000065
AveOccup 0.003671 0.006451 0.007217 0.005760 0.005956
Latitude 1.229644 1.155974 1.208524 1.193838 1.206576
Longitude 1.097392 1.103421 1.136658 1.137157 1.116989


.. GENERATED FROM PYTHON SOURCE LINES 867-871 The permutation importance is often calculated several times, each time with different permutations of the feature. Hence, we can have measure its variance (or standard deviation). Now, we plot the permutation feature importance on the train and test sets using boxplots: .. GENERATED FROM PYTHON SOURCE LINES 874-921 .. code-block:: Python def plot_permutation_train_test(est_report): _, ax = plt.subplots(figsize=(8, 6)) train_color = "blue" test_color = "orange" est_report.feature_importance.permutation(data_source="train", seed=0).T.boxplot( ax=ax, vert=False, widths=0.35, patch_artist=True, boxprops=dict(facecolor=train_color, alpha=0.7), medianprops=dict(color="black"), positions=[x + 0.4 for x in range(len(est_report.X_train.columns))], ) est_report.feature_importance.permutation(data_source="test", seed=0).T.boxplot( ax=ax, vert=False, widths=0.35, patch_artist=True, boxprops=dict(facecolor=test_color, alpha=0.7), medianprops=dict(color="black"), positions=range(len(est_report.X_test.columns)), ) ax.legend( handles=[ plt.Line2D([0], [0], color=train_color, lw=5, label="Train"), plt.Line2D([0], [0], color=test_color, lw=5, label="Test"), ], loc="best", title="Dataset", ) ax.set_title( f"Permutation feature importance of {est_report.estimator_name_} (Train vs Test)" ) ax.set_xlabel("$R^2$") ax.set_yticks([x + 0.2 for x in range(len(est_report.X_train.columns))]) ax.set_yticklabels(est_report.X_train.columns) plt.tight_layout() plt.show() .. GENERATED FROM PYTHON SOURCE LINES 922-924 .. code-block:: Python plot_permutation_train_test(ridge_report) .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_010.png :alt: Permutation feature importance of Ridge (Train vs Test) :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_010.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 925-929 The standard deviation seems quite low. For both the train and test sets, the result of the inspection is the same as with the coefficients: the most important features are ``Latitude``, ``Longitude``, and ``MedInc``. .. GENERATED FROM PYTHON SOURCE LINES 931-936 For ``selectk_ridge_report``, we have a large pipeline that is fed to a :class:`~skore.EstimatorReport`. The pipeline contains a lot a preprocessing that creates many features. By default, the permutation importance is calculated at the entrance of the whole pipeline (with regards to the original features): .. GENERATED FROM PYTHON SOURCE LINES 938-940 .. code-block:: Python plot_permutation_train_test(selectk_ridge_report) .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_011.png :alt: Permutation feature importance of RidgeCV (Train vs Test) :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_011.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 941-945 Hence, contrary to coefficients, although we have created many features in our preprocessing, the interpretability is easier. We notice that, due to our preprocessing using a clustering on the geospatial data, these features are of great importance to our model. .. GENERATED FROM PYTHON SOURCE LINES 947-948 For our decision tree, here is our permutation importance on the train and test sets: .. GENERATED FROM PYTHON SOURCE LINES 950-952 .. code-block:: Python plot_permutation_train_test(tree_report) .. image-sg:: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_012.png :alt: Permutation feature importance of DecisionTreeRegressor (Train vs Test) :srcset: /auto_examples/use_cases/images/sphx_glr_plot_feature_importance_012.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 953-956 The result of the inspection is the same as with the MDI: the most important features are ``MedInc``, ``Latitude``, ``Longitude``, and ``AveOccup``. .. GENERATED FROM PYTHON SOURCE LINES 958-979 Conclusion ========== In this example, we used the California housing dataset to predict house prices with skore's :class:`~skore.EstimatorReport`. By employing the :class:`~skore.EstimatorReport.feature_importance` accessor, we gained valuable insights into model behavior beyond mere predictive performance. For linear models like Ridge regression, we inspected coefficients to understand feature contributions, revealing the prominence of ``MedInc``, ``Latitude``, and ``Longitude``. We explained the trade-off between performance (with complex feature engineering) and interpretability. Interactions between features have highlighted the importance of ``AveOccup``. With tree-based models such as decision trees, random forests, and gradient-boosted trees, we utilized Mean Decrease in Impurity (MDI) to identify key features, notably ``AveOccup`` alongside ``MedInc``, ``Latitude``, and ``Longitude``. The random forest got the best score, without any complex feature engineering compared to linear models. The model-agnostic permutation feature importance further enabled us to compare feature significance across diverse model types. .. rst-class:: sphx-glr-timing **Total running time of the script:** (1 minutes 14.754 seconds) .. _sphx_glr_download_auto_examples_use_cases_plot_feature_importance.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_feature_importance.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_feature_importance.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_feature_importance.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_