{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# Simplified and structured experiment reporting\n\nThis example shows how to leverage `skore` for structuring useful experiment information\nallowing to get insights from machine learning experiments.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Loading a non-trivial dataset\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We use a skrub dataset that contains information about employees and their salaries.\nWe will see that this dataset is non-trivial.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from skrub.datasets import fetch_employee_salaries\n\ndatasets = fetch_employee_salaries()\ndf, y = datasets.X, datasets.y"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's first have a condensed summary of the input data using a\n:class:`skrub.TableReport`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from skrub import TableReport\n\nTableReport(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From the table report, we can make the following observations:\n\n* Looking at the *Table* tab, we observe that the year related to the\n  ``date_first_hired`` column is also present in the ``date`` column.\n  Hence, we should beware of not creating twice the same feature during the feature\n  engineering.\n\n* Looking at the *Stats* tab:\n\n  - The type of data is heterogeneous: we mainly have categorical and date-related\n    features.\n\n  - The ``division`` and ``employee_position_title`` features contain a large number\n    of categories.\n    It is something that we should consider in our feature engineering.\n\n* Looking at the *Associations* tab, we observe that two features are holding the\n  exact same information: ``department`` and ``department_name``.\n  Hence, during our feature engineering, we could potentially drop one of them if the\n  final predictive model is sensitive to the collinearity.\n\nIn terms of target and thus the task that we want to solve, we are interested in\npredicting the salary of an employee given the previous features. We therefore have\na regression task at end.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "TableReport(y)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Later in this example, we will show that `skore` stores similar information when a\nmodel is trained on a dataset, thus enabling us to get quick insights on the dataset\nused to train and test the model.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Tree-based model\n\nLet's start by creating a tree-based model using some out-of-the-box tools.\n\nFor feature engineering, we use skrub's :class:`~skrub.TableVectorizer`.\nTo deal with the high cardinality of the categorical features, we use a\n:class:`~skrub.StringEncoder` to encode the categorical features.\n\nFinally, we use a :class:`~sklearn.ensemble.HistGradientBoostingRegressor` as a\nbase estimator, it is a rather robust model.\n\n### Modelling\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.ensemble import HistGradientBoostingRegressor\nfrom sklearn.pipeline import make_pipeline\nfrom skrub import StringEncoder, TableVectorizer\n\nhgbt_model = make_pipeline(\n    TableVectorizer(high_cardinality=StringEncoder()),\n    HistGradientBoostingRegressor(),\n)\nhgbt_model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Evaluation\n\nLet us compute the 5-fold cross-validation report for this model using\n:func:`~skore.evaluate` with ``splitter=5``. This will return a\n:class:`~skore.CrossValidationReport` object.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from skore import evaluate\n\nhgbt_model_report = evaluate(hgbt_model, df, y, splitter=5, n_jobs=4)\nhgbt_model_report.help()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A report provides a collection of useful information. For instance, it allows to\ncompute on demand the predictions of the model and some performance metrics.\n\nLet's cache the predictions of the cross-validated models once and for all.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "hgbt_model_report.cache_predictions()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now that the predictions are cached, any request to compute a metric will be\nperformed using the cached predictions and will thus be fast.\n\nWe can now have a look at the performance of the model with some standard metrics.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "hgbt_model_report.metrics.summarize().frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Similarly to what we saw in the previous section, the\n:class:`skore.CrossValidationReport` also stores some information about the dataset\nused.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "data_display = hgbt_model_report.data.analyze()\ndata_display"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The display obtained allows for a quick overview with the same HTML-based view\nas the :class:`skrub.TableReport` we have seen earlier. In addition, you can access\na :meth:`skore.TableReportDisplay.plot` method to have a particular focus on one\npotential analysis. For instance, we can get a figure representing the correlation\nmatrix of the dataset.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "data_display.plot(kind=\"corr\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We get the results from some statistical metrics aggregated over the cross-validation\nsplits as well as some performance metrics related to the time it took to train and\ntest the model.\n\nThe :class:`skore.CrossValidationReport` also provides a way to inspect similar\ninformation at the level of each cross-validation split by accessing an\n:class:`skore.EstimatorReport` for each split.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "hgbt_split_1 = hgbt_model_report.estimator_reports_[0]\nhgbt_split_1.metrics.summarize().frame(favorability=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The favorability of each metric indicates whether the metric is better\nwhen higher or lower.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Linear model\n\nNow that we have established a first model that serves as a baseline, we shall\nproceed to define a quite complex linear model: a pipeline with a complex feature\nengineering that uses a linear model as the base estimator.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Modelling\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\nfrom sklearn.compose import make_column_transformer\nfrom sklearn.linear_model import RidgeCV\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import OneHotEncoder, SplineTransformer\nfrom skrub import DatetimeEncoder, DropCols, GapEncoder, ToDatetime\n\n\ndef periodic_spline_transformer(period, n_splines=None, degree=3):\n    if n_splines is None:\n        n_splines = period\n    n_knots = n_splines + 1  # periodic and include_bias is True\n    return SplineTransformer(\n        degree=degree,\n        n_knots=n_knots,\n        knots=np.linspace(0, period, n_knots).reshape(n_knots, 1),\n        extrapolation=\"periodic\",\n        include_bias=True,\n    )\n\n\none_hot_features = [\"gender\", \"department_name\", \"assignment_category\"]\ndatetime_features = \"date_first_hired\"\n\ndate_encoder = make_pipeline(\n    ToDatetime(),\n    DatetimeEncoder(resolution=\"day\", add_weekday=True, add_total_seconds=False),\n    DropCols(\"date_first_hired_year\"),\n)\n\ndate_engineering = make_column_transformer(\n    (periodic_spline_transformer(12, n_splines=6), [\"date_first_hired_month\"]),\n    (periodic_spline_transformer(31, n_splines=15), [\"date_first_hired_day\"]),\n    (periodic_spline_transformer(7, n_splines=3), [\"date_first_hired_weekday\"]),\n)\n\nfeature_engineering_date = make_pipeline(date_encoder, date_engineering)\n\npreprocessing = make_column_transformer(\n    (feature_engineering_date, datetime_features),\n    (OneHotEncoder(drop=\"if_binary\", handle_unknown=\"ignore\"), one_hot_features),\n    (GapEncoder(n_components=100), \"division\"),\n    (GapEncoder(n_components=100), \"employee_position_title\"),\n)\n\nlinear_model = make_pipeline(preprocessing, RidgeCV(alphas=np.logspace(-3, 3, 100)))\nlinear_model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In the diagram above, we can see what how we performed our feature engineering:\n\n* For categorical features, we use two approaches. If the number of categories is\n  relatively small, we use a `OneHotEncoder`. If the number of categories is\n  large, we use a `GapEncoder` that is designed to deal with high cardinality\n  categorical features.\n\n* Then, we have another transformation to encode the date features. We first split the\n  date into multiple features (day, month, and year). Then, we apply a periodic spline\n  transformation to each of the date features in order to capture the periodicity of\n  the data.\n\n* Finally, we fit a :class:`~sklearn.linear_model.RidgeCV` model.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Evaluation\n\nNow, we want to evaluate this linear model via cross-validation (with 5 folds).\nFor that, we use again :func:`~skore.evaluate` with ``splitter=5``.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "linear_model_report = evaluate(linear_model, df, y, splitter=5, n_jobs=4)\nlinear_model_report.help()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We observe that the cross-validation report has detected that we have a regression\ntask at hand and thus provides us with some metrics and plots that make sense with\nregards to our specific problem at hand.\n\nTo accelerate any future computation (e.g. of a metric), we cache the predictions of\nour model once and for all.\nNote that we do not necessarily need to cache the predictions as the report will\ncompute them on the fly (if not cached) and cache them for us.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import warnings\n\nwith warnings.catch_warnings():\n    warnings.simplefilter(action=\"ignore\", category=FutureWarning)\n    linear_model_report.cache_predictions()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now have a look at the performance of the model with some standard metrics.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "linear_model_report.metrics.summarize().frame(favorability=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Comparing the models\n\nNow that we cross-validated our models, we can make some further comparison using\nthe :func:`~skore.compare` function that returns a :class:`~skore.ComparisonReport`:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from skore import compare\n\ncomparator = compare([hgbt_model_report, linear_model_report])\ncomparator.metrics.summarize().frame(favorability=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In addition, if we forgot to compute a specific metric\n(e.g. :func:`~sklearn.metrics.mean_absolute_error`),\nwe can easily add it to the report, without re-training the model and even\nwithout re-computing the predictions since they are cached internally in the report.\nThis allows us to save some potentially huge computation time.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "comparator.metrics.add(metric=\"neg_mean_absolute_error\", name=\"MAE\")\n\ncomparator.metrics.summarize().frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, we can even get a deeper understanding by analyzing each split in the\n:class:`~skore.CrossValidationReport`.\nHere, we plot the actual-vs-predicted values for each split.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "linear_model_report.metrics.prediction_error().plot(kind=\"actual_vs_predicted\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Conclusion\n\nThis example showcased `skore`'s integrated approach to machine learning workflow,\nfrom initial data exploration with `TableReport` through model development and\nevaluation with `CrossValidationReport`.\nWe demonstrated how `skore` automatically captures dataset information and provides\nefficient caching, enabling quick insights and flexible model comparison.\nThe workflow highlights `skore`'s ability to streamline the entire ML process while\nmaintaining computational efficiency.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.14.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}