{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# Using skore with scikit-learn compatible estimators\n\nThis example shows how to use skore with scikit-learn compatible estimators.\n\nAny model that can be used with the scikit-learn API can be used with skore.\nUse :func:`~skore.evaluate` to create a report from any estimator that has a\n``fit`` and ``predict`` method (or only ``predict`` if already fitted).\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>When computing the ROC AUC or ROC curve for a classification task, the estimator must\n  have a ``predict_proba`` method.</p></div>\n\nIn this example, we showcase a gradient boosting model\n([XGBoost](https://github.com/dmlc/xgboost)) and a custom estimator.\n\nNote that this example is not exhaustive; many other scikit-learn compatible models can\nbe used with skore:\n\n-   More gradient boosting libraries like\n    [LightGBM](https://github.com/microsoft/LightGBM), and\n    [CatBoost](https://github.com/catboost/catboost),\n\n-   Deep learning frameworks such as [Keras](https://github.com/keras-team/keras)\n    and [skorch](https://github.com/skorch-dev/skorch) (a wrapper for\n    [PyTorch](https://github.com/pytorch/pytorch)).\n\n-   Tabular foundation models such as\n    [TabICL](https://github.com/soda-inria/tabicl) and\n    [TabPFN](https://github.com/PriorLabs/TabPFN),\n\n-   etc.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Generate a classification dataset\n\nTo illustrate the compatibility with scikit-learn estimators, we first generate a\nsynthetic binary classification dataset with only 1,000 samples.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import pandas as pd\nimport skrub\nfrom sklearn.datasets import make_classification\n\nX, y = make_classification(n_samples=1_000, random_state=42)\nX = pd.DataFrame(X, columns=[f\"Feature_{i}\" for i in range(X.shape[1])])\nskrub.TableReport(X)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Gradient-boosted decision trees with XGBoost\n\nWhile `skore` is designed to be fully compatible with classifiers and regressors from\nthe scikit-learn library, it is also compatible with any classifier or regressor that\nfollows the scikit-learn API as defined in the [scikit-learn documentation](https://scikit-learn.org/dev/developers/develop.html#rolling-your-own-estimatorl).\n\nHere, we showcase a gradient-boosted decision trees model from the\n[XGBoost](https://github.com/dmlc/xgboost) library that follows exactly this\nparadigm.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from skore import evaluate\nfrom xgboost import XGBClassifier\n\nxgb = XGBClassifier(n_estimators=50, max_depth=3, learning_rate=0.1, random_state=42)\n\nxgb_report = evaluate(xgb, X, y, splitter=0.2, pos_label=1)\nxgb_report"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We see that we get the same report as when using a scikit-learn classifier and we\ncan access the different elements.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "xgb_report.metrics.summarize().frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can easily get the summary of metrics, and also a ROC curve plot for example:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "_ = xgb_report.metrics.roc().plot()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can also inspect our model:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "_ = xgb_report.inspection.permutation_importance().plot()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Custom model\n\nNow, we showcase how one could create a scikit-learn custom estimator that follows\nthe requirements of scikit-learn.\n\nHere, we create a nearest neighbor classifier:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\nfrom sklearn.base import BaseEstimator, ClassifierMixin\nfrom sklearn.metrics import euclidean_distances\nfrom sklearn.utils.multiclass import unique_labels\nfrom sklearn.utils.validation import check_is_fitted, validate_data\n\n\nclass CustomClassifier(ClassifierMixin, BaseEstimator):\n    def __init__(self):\n        pass\n\n    def fit(self, X, y):\n        X, y = validate_data(self, X, y)\n        self.classes_ = unique_labels(y)\n        self.X_ = X\n        self.y_ = y\n        return self\n\n    def predict(self, X):\n        check_is_fitted(self)\n        X = validate_data(self, X, reset=False)\n        closest = np.argmin(euclidean_distances(X, self.X_), axis=1)\n        return self.y_[closest]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "custom_report = evaluate(CustomClassifier(), X, y, splitter=0.2)\ncustom_report"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Conclusion\n\nThis example demonstrates how skore can be used with scikit-learn compatible\nestimators. This allows practitioners to use consistent reporting and\nvisualization tools across different estimators.\n\n.. seealso::\n\n  For an example of wrapping Large Language Models (LLMs) to be compatible with\n  scikit-learn APIs, see the tutorial on [Quantifying LLMs Uncertainty with Conformal\n  Predictions](https://medium.com/capgemini-invent-lab/quantifying-llms-uncertainty-with-conformal-predictions-567870e63e00).\n  The article demonstrates how to wrap models like Mistral-7B-Instruct in a\n  scikit-learn-compatible interface.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.14.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}