{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# `EstimatorReport`: Inspecting your models with the feature importance\n\nIn this example, we tackle the California housing dataset where the goal is to perform\na regression task: predicting house prices based on features such as the number of\nbedrooms, the geolocation, etc.\nFor that, we try out several families of models.\nWe evaluate these methods using :func:`~skore.evaluate` and the report's metrics.\n\n.. seealso::\n    As shown in `example_estimator_report`, the :class:`~skore.EstimatorReport` has\n    a :meth:`~skore.EstimatorReport.metrics` accessor that enables you to evaluate your\n    models and look at some scores that are automatically computed for you.\n\nHere, we go beyond predictive performance, and inspect these models to better interpret\ntheir behavior, by using feature importance.\nIndeed, in practice, inspection can help spot some flaws in models: it is always\nrecommended to look \"under the hood\".\nFor that, we use the unified :meth:`~skore.EstimatorReport.inspection` accessor\nof the :class:`~skore.EstimatorReport`.\nFor linear models, we look at their coefficients.\nFor tree-based models, we inspect their mean decrease in impurity (MDI).\nWe can also inspect the permutation feature importance, that is model-agnostic.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Loading the dataset and performing some exploratory data analysis (EDA)\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us load the California housing dataset, which will enable us to perform a\nregression task about predicting house prices:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from skrub.datasets import fetch_california_housing\n\ncalifornia_housing = fetch_california_housing()\ndataset = california_housing.california_housing\nX, y = california_housing.X, california_housing.y\n\ndataset.head(2)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The documentation of the California housing dataset explains that the dataset\ncontains aggregated data regarding each district in California in 1990 and the target\n(``MedHouseVal``) is the median house value for California districts,\nexpressed in hundreds of thousands of dollars ($100,000).\nNote that there are some vacation resorts, with a large number of rooms and bedrooms.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. seealso::\n  For more information about the California housing dataset, refer to\n  [scikit-learn MOOC's page](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html).\n  Moreover, a more advanced modelling of this dataset is performed in\n  [this skops example](https://skops.readthedocs.io/en/stable/auto_examples/plot_california_housing.html).\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Table report\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us perform some quick exploration on this dataset:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from skrub import TableReport\n\nTableReport(dataset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From the table report, we can draw some key observations:\n\n- Looking at the *Stats* tab, all features are numerical and there are no\n  missing values.\n- Looking at the *Distributions* tab, we can notice that some features seem to have\n  some outliers:\n  ``MedInc``, ``AveRooms``, ``AveBedrms``, ``Population``, and ``AveOccup``.\n  The feature with the largest number of potential outliers is ``AveBedrms``, probably\n  corresponding to vacation resorts.\n- Looking at the *Associations* tab, we observe that:\n\n  -   The target feature ``MedHouseVal`` is mostly associated with ``MedInc``,\n      ``Longitude``, and ``Latitude``.\n      Indeed, intuitively, people with a large income would live in areas where the\n      house prices are high.\n      Moreover, we can expect some of these expensive areas to be close to one\n      another.\n  -   The association power between the target and these features is not super\n      high, which would indicate that each single feature can not correctly predict\n      the target.\n      Given that ``MedInc`` is associated with ``Longitude`` and also ``Latitude``,\n      it might make sense to have some interactions between these features in our\n      modelling: linear combinations might not be enough.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Target feature\n\nThe target distribution has a long tail:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\nimport seaborn as sns\n\nsns.histplot(data=dataset, x=y.name, bins=100)\nplt.show(block=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "There seems to be a threshold-effect for high-valued houses: all houses with a price\nabove $500,000 are given the value $500,000.\nWe keep these clipped values in our data and will inspect how our models deal with\nthem.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, as the median income ``MedInc`` is the feature with the highest association with\nour target, let us assess how ``MedInc`` relates to ``MedHouseVal``:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pandas as pd\nimport plotly.express as px\n\nX_y_plot = dataset.copy()\nX_y_plot[\"MedInc_bins\"] = pd.qcut(X_y_plot[\"MedInc\"], q=5)\nbin_order = X_y_plot[\"MedInc_bins\"].cat.categories.sort_values()\nfig = px.histogram(\n    X_y_plot,\n    x=y.name,\n    color=\"MedInc_bins\",\n    category_orders={\"MedInc_bins\": bin_order},\n)\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As could have been expected, a high salary often comes with a more expensive house.\nWe can also notice the clipping effect of house prices for very high salaries.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Geospatial features\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From the table report, we noticed that the geospatial features ``Latitude`` and\n``Longitude`` were well associated with our target.\nHence, let us look into the coordinates of the districts in California, with regards\nto the target feature, using a map:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def plot_map(df, color_feature):\n    fig = px.scatter_map(\n        df, lat=\"Latitude\", lon=\"Longitude\", color=color_feature, zoom=5, height=600\n    )\n    fig.update_layout(\n        mapbox_style=\"open-street-map\",\n        mapbox_center={\"lat\": df[\"Latitude\"].mean(), \"lon\": df[\"Longitude\"].mean()},\n        margin={\"r\": 0, \"t\": 0, \"l\": 0, \"b\": 0},\n    )\n    return fig"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig = plot_map(dataset, y.name)\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As could be expected, the price of the houses near the ocean is higher, especially\naround big cities like Los Angeles, San Francisco, and San Jose.\nTaking into account the coordinates in our modelling will be very important.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Linear models: coefficients\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For our regression task, we first use linear models.\nFor feature importance, we inspect their coefficients.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Simple model\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Before trying any complex feature engineering, we start with a simple pipeline to\nhave a baseline of what a \"good score\" is (remember that all scores are relative).\nWe use a Ridge regression with feature scaling; and evaluate its performance using\n:func:`~skore.evaluate` with ``splitter=0.2``. This will evaluate the model on 20% of\nthe data after training on the remaining 80%, and report the results in an\n:class:`~skore.EstimatorReport`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.linear_model import Ridge\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\nfrom skore import evaluate\n\nridge_report = evaluate(make_pipeline(StandardScaler(), Ridge()), X, y, splitter=0.2)\nridge_report.metrics.summarize().frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From the report metrics, let us first explain the scores we have access to:\n\n- The coefficient of determination (:func:`~sklearn.metrics.r2_score`), denoted as\n  $R^2$, which is a score.\n  The best possible score is $1$ and a constant model that always predicts the\n  average value of the target would get a score of $0$.\n  Note that the score can be negative, as it could be worse than the average.\n- The root mean squared error (:func:`~sklearn.metrics.root_mean_squared_error`),\n  abbreviated as RMSE, which is an error.\n  It takes the square root of the mean squared error (MSE) so it is expressed in the\n  same units as the target variable.\n  The MSE measures the average squared difference between the predicted values and\n  the actual values.\n\nHere, the $R^2$ seems quite poor, so some further preprocessing would be needed.\nThis is done further down in this example.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-danger\"><h4>Warning</h4><p>Keep in mind that any observation drawn from inspecting the coefficients\n  of this simple Ridge model is made on a model that performs quite poorly, hence\n  must be treated with caution.\n  Indeed, a poorly performing model does not capture the true underlying\n  relationships in the data.\n  A good practice would be to avoid inspecting models with poor performance.\n  Here, we still inspect it, for demo purposes and because our model is not put into\n  production!</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us plot the prediction error:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "ridge_report.metrics.prediction_error().plot(kind=\"actual_vs_predicted\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can observe that the model has issues predicting large house prices, due to the\nclipping effect of the actual values.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, to inspect our model, let us use the\n:meth:`skore.EstimatorReport.inspection` accessor:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "ridge_report.inspection.coefficients().frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>Beware that coefficients can be misleading when some features are correlated.\n  For example, two coefficients can have large absolute values (so be considered\n  important), but in the predictions, the sum of their contributions could cancel out\n  (if they are highly correlated), so they would actually be unimportant.</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can plot this pandas datafame:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "ridge_report.inspection.coefficients().plot()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>More generally, :meth:`skore.EstimatorReport.inspection.coefficients` can\n  help you inspect the coefficients of all linear models.\n  We consider a linear model as defined in\n  [scikit-learn's documentation](https://scikit-learn.org/stable/modules/linear_model.html).\n  In short, we consider a \"linear model\" as a scikit-learn compatible estimator that\n  holds a ``coef_`` attribute (after being fitted).</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Since we have included scaling in the pipeline, the resulting coefficients are all on\nthe same scale, making them directly comparable to each other.\nWithout this scaling step, the coefficients in a linear model would be influenced by\nthe original scale of the feature values, which would prevent meaningful comparisons\nbetween them.\n\n.. seealso::\n\n  For more information about the importance of scaling,\n  see scikit-learn's example on\n  [Common pitfalls in the interpretation of coefficients of linear models](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html).\n\nHere, it appears that the ``MedInc``, ``Latitude``, and ``Longitude`` features are\nthe most important, with regards to the absolute value of other coefficients.\nThis finding is consistent with our previous observations from the *Associations*\ntab of the table report.\n\nHowever, due to the scaling, we can not interpret the coefficient values\nwith regards to the original unit of the feature.\nLet us unscale the coefficients, without forgetting the intercept, so that the\ncoefficients can be interpreted using the original units:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\n# retrieve the mean and standard deviation used to standardize the feature values\nfeature_mean = ridge_report.estimator_[0].mean_\nfeature_std = ridge_report.estimator_[0].scale_\n\n\ndef unscale_coefficients(df, feature_mean, feature_std):\n    df = df.set_index(\"feature\")\n    mask_intercept_column = df.index == \"Intercept\"\n    # rescale the intercept\n    df.loc[mask_intercept_column] = df.loc[mask_intercept_column] - np.sum(\n        df.loc[~mask_intercept_column, \"coefficient\"] * feature_mean / feature_std\n    )\n    # rescale the other coefficients\n    df.loc[~mask_intercept_column, \"coefficient\"] = (\n        df.loc[~mask_intercept_column, \"coefficient\"] / feature_std\n    )\n    return df.reset_index()\n\n\ndf_ridge_report_coef_unscaled = unscale_coefficients(\n    ridge_report.inspection.coefficients().frame(), feature_mean, feature_std\n)\ndf_ridge_report_coef_unscaled"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, we can interpret each coefficient values with regards to the original units.\nWe can interpret a coefficient as follows: according to our model, on average,\nhaving one additional bedroom (a increase of $1$ of ``AveBedrms``),\nwith all other features being constant,\nincreases the *predicted* house value of $0.62$ in $100,000, hence of $62,000.\nNote that we have not dealt with any potential outlier in this iteration.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-danger\"><h4>Warning</h4><p>Recall that we are inspecting a model with poor performance, which is bad practice.\n  Moreover, we must be cautious when trying to induce any causation effect\n  (remember that correlation is not causation).</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### More complex model\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As previously mentioned, our simple Ridge model, although very easily interpretable\nwith regards to the original units of the features, performs quite poorly.\nNow, we build a more complex model, with more feature engineering.\nWe will see that this model will have a better score... but will be more difficult to\ninterpret the coefficients with regards to the original features due to the complex\nfeature engineering.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In our previous EDA, when plotting the geospatial data with regards to the house\nprices, we noticed that it is important to take into account the latitude and\nlongitude features.\nMoreover, we also observed that the median income is well associated with the\nhouse prices.\nHence, we will try a feature engineering that takes into account the interactions\nof the geospatial features with features such as the income, using polynomial\nfeatures.\nThe interactions are no longer simply linear as previously.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us build a model with some more complex feature engineering, and still use a\nRidge regressor (linear model) at the end of the pipeline.\nIn particular, we perform a K-means clustering on the geospatial features:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.cluster import KMeans\nfrom sklearn.compose import make_column_transformer\nfrom sklearn.preprocessing import PolynomialFeatures, SplineTransformer\n\ngeo_columns = [\"Latitude\", \"Longitude\"]\n\npreprocessor = make_column_transformer(\n    (KMeans(n_clusters=10, random_state=0), geo_columns),\n    remainder=\"passthrough\",\n)\nengineered_ridge = make_pipeline(\n    preprocessor,\n    SplineTransformer(sparse_output=True),\n    PolynomialFeatures(degree=2, interaction_only=True, include_bias=False),\n    Ridge(),\n)\nengineered_ridge"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let us compute the metrics and compare it to our previous model using\nthe :func:`~skore.compare` function that returns a :class:`~skore.ComparisonReport`:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from skore import compare\n\nengineered_ridge_report = evaluate(engineered_ridge, X, y, splitter=0.2)\nreports_to_compare = {\n    \"Vanilla Ridge\": ridge_report,\n    \"Ridge w/ feature engineering\": engineered_ridge_report,\n}\ncomparator = compare(reports_to_compare)\ncomparator.metrics.summarize().frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We get a much better score!\nLet us plot the prediction error:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "engineered_ridge_report.metrics.prediction_error().plot(kind=\"actual_vs_predicted\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "About the clipping issue, compared to the prediction error of our previous model\n(``ridge_report``), our ``engineered_ridge_report`` model seems to produce predictions\nthat are not as large, so it seems that some interactions between features have\nhelped alleviate the clipping issue.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "However, interpreting the features is harder: indeed, our complex feature engineering\nintroduced a *lot* of features:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "n_features_initial = ridge_report.X_train.shape[1]\nprint(\"Initial number of features:\", n_features_initial)\n\n# We slice the scikit-learn pipeline to extract the predictor, using -1 to access\n# the last step:\nn_features_engineered = engineered_ridge_report.estimator_[-1].n_features_in_\nprint(\"Number of features after feature engineering:\", n_features_engineered)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us display the 15 largest absolute coefficients:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "engineered_ridge_report_coefficients = (\n    engineered_ridge_report.inspection.coefficients()\n    .frame()\n    .set_index(\"feature\")\n    .sort_values(by=\"coefficient\", key=abs, ascending=True)\n    .tail(15)\n)\n\nengineered_ridge_report_coefficients.index = (\n    engineered_ridge_report_coefficients.index.str.replace(\"remainder__\", \"\")\n)\nengineered_ridge_report_coefficients.index = (\n    engineered_ridge_report_coefficients.index.str.replace(\"kmeans__\", \"geospatial__\")\n)\n\nengineered_ridge_report_coefficients.plot.barh(\n    title=\"Model weights\",\n    xlabel=\"Coefficient\",\n    ylabel=\"Feature\",\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can observe that the most important features are interactions between features,\nmostly based on ``AveOccup``, that a simple linear model without feature engineering\ncould not have captured.\nIndeed, the vanilla Ridge model did not consider ``AveOccup`` to be important.\nAs the engineered Ridge has a better score, perhaps the vanilla Ridge missed\nsomething about ``AveOccup`` that seems to be key to predicting house prices.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us visualize how ``AveOccup`` interacts with ``MedHouseVal``:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X_y_plot = dataset.copy()\nX_y_plot[\"AveOccup\"] = pd.qcut(X_y_plot[\"AveOccup\"], q=5)\nbin_order = X_y_plot[\"AveOccup\"].cat.categories.sort_values()\nfig = px.histogram(\n    X_y_plot,\n    x=y.name,\n    color=\"AveOccup\",\n    category_orders={\"AveOccup\": bin_order},\n)\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, we can visualize the results of our K-means clustering (on the training set):\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# getting the cluster labels\ncol_transformer = engineered_ridge_report.estimator_.named_steps[\"columntransformer\"]\nkmeans = col_transformer.named_transformers_[\"kmeans\"]\nclustering_labels = kmeans.labels_\n\n# adding the cluster labels to our dataframe\nX_train_plot = ridge_report.X_train.copy()\nX_train_plot.insert(n_features_initial, \"clustering_labels\", clustering_labels)\n\n# plotting the map\nplot_map(X_train_plot, \"clustering_labels\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Inspecting the prediction error at the sample level\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "After feature importance, we now try to understand why our model performs badly on some\nsamples, in order to iterate on our estimator pipeline and improve it.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We compute the prediction squared error at the sample level, named ``squared_error``,\non the train and test sets:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def add_y_true_pred(model_report, split):\n    \"\"\"\n    Concatenate the design matrix (`X`) with the actual targets (`y`)\n    and predicted ones (`y_pred`) from a fitted skore EstimatorReport,\n    either on the train or the test set.\n    \"\"\"\n\n    if split == \"train\":\n        y_split_true = model_report.y_train\n        X_split = model_report.X_train.copy()\n    elif split == \"test\":\n        y_split_true = model_report.y_test\n        X_split = model_report.X_test.copy()\n    else:\n        raise ValueError(\"split must be either `train`, or `test`\")\n\n    # adding a `split` feature\n    X_split.insert(0, \"split\", split)\n\n    # retrieving the predictions\n    y_split_pred = model_report.get_predictions(\n        data_source=split, response_method=\"predict\"\n    )\n\n    # computing the squared error at the sample level\n    squared_error_split = (y_split_true - y_split_pred) ** 2\n\n    # adding the squared error to our dataframes\n    X_split.insert(X_split.shape[1], \"squared_error\", squared_error_split)\n\n    # adding the true values and the predictions\n    X_y_split = X_split.copy()\n    X_y_split.insert(X_y_split.shape[1], \"y_true\", y_split_true)\n    X_y_split.insert(X_y_split.shape[1], \"y_pred\", y_split_pred)\n    return X_y_split"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X_y_train_plot = add_y_true_pred(engineered_ridge_report, \"train\")\nX_y_test_plot = add_y_true_pred(engineered_ridge_report, \"test\")\nX_y_plot = pd.concat([X_y_train_plot, X_y_test_plot])\nX_y_plot.sample(10)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We visualize the distributions of the prediction errors on both train and test sets:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "sns.histplot(data=X_y_plot, x=\"squared_error\", hue=\"split\", multiple=\"dodge\", bins=30)\nplt.title(\"Train and test sets\")\nplt.show(block=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, in order to assess which features might drive the prediction error, let us look\ninto the associations between the ``squared_error`` and the other features:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from skrub import column_associations\n\ncolumn_associations(X_y_plot).query(\n    \"left_column_name == 'squared_error' or right_column_name == 'squared_error'\"\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We observe that the ``AveOccup`` feature leads to large prediction errors: our model\nis not able to deal well with that feature.\nHence, it might be worth it to dive deep into the ``AveOccup`` feature, for\nexample its outliers.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We observe that we have large prediction errors for districts near the coast and big\ncities:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "threshold = X_y_plot[\"squared_error\"].quantile(0.95)  # out of the train and test sets\nplot_map(X_y_plot.query(f\"squared_error > {threshold}\"), \"split\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Hence, it could make sense to engineer two new features: the distance to the coast\nand the distance to big cities.\n\nMost of our very bad predictions underpredict the true value (``y_true`` is more often\nlarger than ``y_pred``):\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Create the scatter plot\nfig = px.scatter(\n    X_y_plot.query(f\"squared_error > {threshold}\"),\n    x=\"y_pred\",\n    y=\"y_true\",\n    color=\"split\",\n)\n# Add the diagonal line\nfig.add_shape(\n    type=\"line\",\n    x0=X_y_plot[\"y_pred\"].min(),\n    y0=X_y_plot[\"y_pred\"].min(),\n    x1=X_y_plot[\"y_pred\"].max(),\n    y1=X_y_plot[\"y_pred\"].max(),\n    line={\"color\": \"black\", \"width\": 2},\n)\nfig"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Compromising on complexity\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let us build a model with a more interpretable feature engineering, although\nit might not perform as well.\nFor that, after the complex feature engineering, we perform some feature selection\nusing a :class:`~sklearn.feature_selection.SelectKBest`, in order to reduce the\nnumber of features.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.feature_selection import SelectKBest, VarianceThreshold, f_regression\nfrom sklearn.linear_model import RidgeCV\n\npreprocessor = make_column_transformer(\n    (KMeans(n_clusters=10, random_state=0), geo_columns),\n    remainder=\"passthrough\",\n)\nselectkbest_ridge = make_pipeline(\n    preprocessor,\n    SplineTransformer(sparse_output=True),\n    PolynomialFeatures(degree=2, interaction_only=True, include_bias=False),\n    VarianceThreshold(1e-8),\n    SelectKBest(score_func=lambda X, y: f_regression(X, y, center=False), k=150),\n    RidgeCV(np.logspace(-5, 5, num=100)),\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>To keep the computation time of this example low, we did not tune\n  the hyperparameters of the predictive model. However, on a real use\n  case, it would be important to tune the model using\n  :class:`~sklearn.model_selection.RandomizedSearchCV`\n  and not just the :class:`~sklearn.linear_model.RidgeCV`.</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us get the metrics for our model and compare it with our previous iterations:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "selectk_ridge_report = evaluate(selectkbest_ridge, X, y, splitter=0.2)\nreports_to_compare[\"Ridge w/ feature engineering and selection\"] = selectk_ridge_report\ncomparator = compare(reports_to_compare)\ncomparator.metrics.summarize().frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We get a good score and much less features:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(\"Initial number of features:\", n_features_initial)\nprint(\"Number of features after feature engineering:\", n_features_engineered)\n\nn_features_selectk = selectk_ridge_report.estimator_[-1].n_features_in_\nprint(\n    \"Number of features after feature engineering using `SelectKBest`:\",\n    n_features_selectk,\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "According to the :class:`~sklearn.feature_selection.SelectKBest`, the most important\nfeatures are the following:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "selectk_features = selectk_ridge_report.estimator_[:-1].get_feature_names_out()\nprint(selectk_features)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can see that, in the best features, according to statistical tests, there are\nmany interactions between geospatial features (derived from the K-means clustering)\nand the median income.\nNote that these features are not sorted.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "And here is the feature importance based on our model (sorted by absolute values):\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "(\n    selectk_ridge_report.inspection.coefficients()\n    .frame()\n    .set_index(\"feature\")\n    .sort_values(by=\"coefficient\", key=abs, ascending=True)\n    .tail(15)\n    .plot.barh(title=\"Model weights\", xlabel=\"Coefficient\", ylabel=\"Feature\")\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Tree-based models: mean decrease in impurity (MDI)\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let us look into tree-based models.\nFor feature importance, we inspect their Mean Decrease in Impurity (MDI).\nThe MDI of a feature is the normalized total reduction of the criterion (or loss)\nbrought by that feature.\nThe higher the MDI, the more important the feature.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-danger\"><h4>Warning</h4><p>The MDI is limited and can be misleading:\n\n  - When features have large differences in cardinality, the MDI tends to favor\n    those with higher cardinality.\n    Fortunately, in this example, we have only numerical features that share similar\n    cardinality, mitigating this concern.\n  - Since the MDI is typically calculated on the training set, it can reflect biases\n    from overfitting.\n    When a model overfits, the tree may partition less relevant regions of the\n    feature space, artificially inflating MDI values and distorting the perceived\n    importance of certain features.\n    Soon, scikit-learn will enable the computing of the MDI on the test set, and we\n    will make it available in skore.\n    Hence, we would be able to draw conclusions on how predictive a feature is and not\n    just how impactful it is on the training procedure.</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. seealso::\n\n  For more information about the MDI, see scikit-learn's\n  [Permutation Importance vs Random Forest Feature Importance (MDI)](https://scikit-learn.org/dev/auto_examples/inspection/plot_permutation_importance.html).\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Decision trees\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us start with a simple decision tree.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. seealso::\n  For more information about decision trees, see scikit-learn's example on\n  [Understanding the decision tree structure](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html).\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.tree import DecisionTreeRegressor\n\ntree_report = evaluate(DecisionTreeRegressor(random_state=0), X, y, splitter=0.2)\nreports_to_compare[\"Decision tree\"] = tree_report"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We compare its performance with the models in our benchmark:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "comparator = compare(reports_to_compare)\ncomparator.metrics.summarize().frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We note that the performance is quite poor, so the derived feature importance is to\nbe dealt with caution.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We display which accessors are available to us:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tree_report.help()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We have a\n:meth:`~skore.EstimatorReport.inspection.impurity_decrease`\naccessor.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "First, let us interpret our model with regards to the original features.\nFor the visualization, we fix a very low ``max_depth`` so that it will be easy for\nthe human eye to visualize the tree using :func:`sklearn.tree.plot_tree`:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.tree import plot_tree\n\nplot_tree(\n    tree_report.estimator_,\n    feature_names=tree_report.estimator_.feature_names_in_,\n    max_depth=2,\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "This tree explains how each sample is going to be predicted by our tree.\nA decision tree provides a decision path for each sample, where the sample traverses\nthe tree based on feature thresholds, and the final prediction is made at the leaf\nnode (not represented above for conciseness purposes).\nAt each node:\n\n- ``samples`` is the number of samples that fall into that node,\n- ``value`` is the predicted output for the samples that fall into this particular\n  node (it is the mean of the target values for the samples in that node).\n  At the root node, the value is $2.074$. This means that if you were to make a\n  prediction for all $15480$ samples at this node (without any further splits),\n  the predicted value would be $2.074$, which is the mean of the target\n  variable for those samples.\n- ``squared_error`` is the mean squared error associated with the ``value``,\n  representing the average of the squared differences between the actual target values\n  of the samples in the node and the node's predicted ``value`` (the mean),\n- the first element is how the split is defined.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us explain how this works in practice.\nAt each node, the tree splits the data based on a feature and a threshold.\nFor the first node (the root node), ``MedInc <= 5.029`` means that, for each sample,\nour decision tree first looks at the ``MedInc`` feature (which is thus the most\nimportant one):\nif the ``MedInc`` value is lower than $5.029$ (the threshold), then the sample goes\ninto the left node, otherwise it goes to the right, and so on for each node.\nAs you move down the tree, the splits refine the predictions, leading to the leaf\nnodes where the final prediction for a sample is the ``value`` of the leaf it reaches.\nNote that for the second node layer, it is also the ``MedInc`` feature that is used\nfor the threshold, indicating that our model heavily relies on ``MedInc``.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        ".. seealso::\n  A richer display of decision trees is available in the\n  [dtreeviz](https://github.com/parrt/dtreeviz) python package.\n  For example, it shows the distribution of feature values split at each node and\n  tailors the visualization to the task at hand (whether classification\n  or regression).\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let us look at the feature importance based on the MDI:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "tree_report.inspection.impurity_decrease().plot()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For a decision tree, for each feature, the MDI is averaged across all splits in the\ntree. Here, the impurity is the mean squared error.\n\nAs expected, ``MedInc`` is of great importance for our decision tree.\nIndeed, in the above tree visualization, ``MedInc`` is used multiple times for splits\nand contributes greatly to reducing the squared error at multiple nodes.\nAt the root, it reduces the error from $1.335$ to $0.832$ and $0.546$\nin the children.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Random forest\n\nNow, let us apply a more elaborate model: a random forest.\nA random forest is an ensemble method that builds multiple decision trees, each\ntrained on a random subset of data and features.\nFor regression, it averages the trees' predictions.\nThis reduces overfitting and improves accuracy compared to a single decision tree.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.ensemble import RandomForestRegressor\n\nn_estimators = 100\nrf_report = evaluate(\n    RandomForestRegressor(random_state=0, n_estimators=n_estimators), X, y, splitter=0.2\n)\nreports_to_compare[\"Random forest\"] = rf_report\n\ncomparator = compare(reports_to_compare)\ncomparator.metrics.summarize().frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Without any feature engineering and any grid search,\nthe random forest beats all linear models!\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us recall the number of trees in our random forest:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(f\"Number of trees in the forest: {n_estimators}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Given that we have many trees, it is hard to use :func:`sklearn.tree.plot_tree` as\nfor the single decision tree.\nAs for linear models (and the complex feature engineering), better performance often\ncomes with less interpretability.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us look into the MDI of our random forest:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "rf_report.inspection.impurity_decrease().plot()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In a random forest, the MDI is computed by averaging the MDI of each feature across\nall the decision trees in the forest.\n\nAs for the decision tree, ``MecInc`` is the most important feature.\nAs for linear models with some feature engineering, the random forest also attributes\na high importance to ``Longitude``, ``Latitude``, and ``AveOccup``.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Model-agnostic: permutation feature importance\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the previous sections, we have inspected coefficients that are specific to linear\nmodels and the MDI that is specific to tree-based models.\nIn this section, we look into the\n[permutation importance](https://scikit-learn.org/stable/modules/permutation_importance.html)\nwhich is model-agnostic, meaning that it can be applied to any fitted estimator.\nIn particular, it works for linear models and tree-based ones.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Permutation feature importance measures the contribution of each feature to\na fitted model's performance.\nIt randomly shuffles the values of a single feature and observes the resulting\ndegradation of the model's score.\nPermuting a predictive feature makes the performance decrease, while\npermuting a non-predictive feature does not degrade the performance much.\nThis permutation importance can be computed on the train and test sets,\nand by default skore computes it on the test set.\nCompared to the coefficients and the MDI, the permutation importance can be\nless misleading, but comes with a higher computation cost.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Permutation feature importance can also help reduce overfitting.\nIf a model overfits (high train score and low test score), and some\nfeatures are important only on the train set and not on the test set,\nthen these features might be the cause of the overfitting and it might be a good\nidea to drop them.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-danger\"><h4>Warning</h4><p>The permutation feature importance can be misleading on strongly\n  correlated features. For more information, see\n  [scikit-learn's user guide](https://scikit-learn.org/stable/modules/permutation_importance.html#misleading-values-on-strongly-correlated-features).</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let us look at our helper:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "ridge_report.help()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We have a :meth:`~skore.EstimatorReport.inspection.permutation_importance`\naccessor:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "ridge_report.inspection.permutation_importance(seed=0).frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Since the permutation importance involves permuting values of a feature at random,\nby default it is computed several times, each time with different permutations of\nthe feature. For this reason, if `seed` is not passed, skore does not cache the\npermutation importance results.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, we plot the permutation feature importance on the train and test sets using boxplots:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def plot_permutation_train_test(importances):\n    _, ax = plt.subplots(figsize=(8, 6))\n\n    sns.boxplot(\n        data=importances,\n        x=\"value\",\n        y=\"feature\",\n        hue=\"data_source\",\n        ax=ax,\n        whis=10_000,\n    )\n    ax.set_xlabel(\"Decrease of $R^2$ score\")\n    ax.set_title(\"Permutation feature importance (Train vs Test)\")\n    plt.show(block=True)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def compute_permutation_importances(report, at_step=0):\n    train_importance = report.inspection.permutation_importance(\n        data_source=\"train\", seed=0, at_step=at_step\n    ).frame(aggregate=None)\n    test_importance = report.inspection.permutation_importance(\n        data_source=\"test\", seed=0, at_step=at_step\n    ).frame(aggregate=None)\n\n    return pd.concat([train_importance, test_importance], axis=0)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_permutation_train_test(compute_permutation_importances(ridge_report))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The standard deviation seems quite low.\nFor both the train and test sets, the result of the inspection is the same as\nwith the coefficients:\nthe most important features are ``Latitude``, ``Longitude``, and ``MedInc``.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For ``selectk_ridge_report``, we have a large pipeline that is fed to a\n:class:`~skore.EstimatorReport`.\nThe pipeline contains a lot a preprocessing that creates many features.\nBy default, the permutation importance is calculated at the entrance of the whole\npipeline (with regards to the original features):\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_permutation_train_test(compute_permutation_importances(selectk_ridge_report))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Since this estimator involves complex feature engineering, it is interesting to look\nat the impact of the *engineered* features rather than the original input features.\nFor instance, we can check whether features with a low importance rating have indeed\nbeen selected out of the engineered features.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "importances = compute_permutation_importances(selectk_ridge_report, at_step=-1)\n\n# Rename some features for clarity\nimportances[\"feature\"] = (\n    importances[\"feature\"]\n    .str.replace(\"remainder__\", \"\")\n    .str.replace(\"kmeans__\", \"geospatial__\")\n)\n\n\n# Take only the 15 most important train features\nbest_15_features = (\n    importances.set_index(\"data_source\")\n    .loc[\"test\", :]\n    .drop(columns=[\"metric\", \"repetition\"])\n    .groupby(\"feature\")\n    .aggregate(\"mean\")\n    .sort_values(by=\"value\", key=abs)\n    .tail(15)\n    .index\n)\nimportances = importances.query(\"feature in @best_15_features\")\n\nplot_permutation_train_test(importances)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Hence, contrary to coefficients, although we have created many features in our\npreprocessing, the interpretability is easier.\nWe notice that, due to our preprocessing using a clustering on the geospatial data,\nthese features are of great importance to our model.\n\nAlso, Average Bedrooms and Average Rooms appear often in the plot, whereas they were\nnot considered as important when looking at the coefficients. It means that once\ncombined with other features, they become more relevant.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For our decision tree, here is our permutation importance on the train and test sets:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plot_permutation_train_test(compute_permutation_importances(tree_report))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The result of the inspection is the same as with the MDI:\nthe most important features are ``MedInc``, ``Latitude``, ``Longitude``,\nand ``AveOccup``.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Conclusion\n\nIn this example, we used the California housing dataset to predict house prices with\nskore's :class:`~skore.EstimatorReport`.\nBy employing the :class:`~skore.EstimatorReport.inspection` accessor,\nwe gained valuable insights into model behavior beyond mere predictive performance.\nFor linear models like Ridge regression, we inspected coefficients to understand\nfeature contributions, revealing the prominence of ``MedInc``, ``Latitude``,\nand ``Longitude``.\nWe explained the trade-off between performance (with complex feature engineering)\nand interpretability.\nInteractions between features have highlighted the importance of ``AveOccup``.\nWith tree-based models such as decision trees, random forests, and gradient-boosted\ntrees, we utilized Mean Decrease in Impurity (MDI) to identify key features,\nnotably ``AveOccup`` alongside ``MedInc``, ``Latitude``, and\n``Longitude``.\nThe random forest got the best score, without any complex feature engineering\ncompared to linear models.\nThe model-agnostic permutation feature importance further enabled us to compare\nfeature significance across diverse model types.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.14.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}