{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n\n# Using skore with scikit-learn compatible estimators\n\nThis example shows how to use skore with scikit-learn compatible estimators.\n\nAny model that can be used with the scikit-learn API can be used with skore.\nUse :func:`~skore.evaluate` to create a report from any estimator that has a\n``fit`` and ``predict`` method (or only ``predict`` if already fitted).\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>When computing the ROC AUC or ROC curve for a classification task, the estimator must\n  have a ``predict_proba`` method.</p></div>\n\nIn this example, we showcase a gradient boosting model\n([XGBoost](https://github.com/dmlc/xgboost)) and a custom estimator.\n\nNote that this example is not exhaustive; many other scikit-learn compatible models can\nbe used with skore:\n\n-   More gradient boosting libraries like\n    [LightGBM](https://github.com/microsoft/LightGBM), and\n    [CatBoost](https://github.com/catboost/catboost),\n\n-   Deep learning frameworks such as [Keras](https://github.com/keras-team/keras)\n    and [skorch](https://github.com/skorch-dev/skorch) (a wrapper for\n    [PyTorch](https://github.com/pytorch/pytorch)).\n\n-   Tabular foundation models such as\n    [TabICL](https://github.com/soda-inria/tabicl) and\n    [TabPFN](https://github.com/PriorLabs/TabPFN),\n\n-   etc.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Loading a binary classification dataset\n\nWe generate a synthetic binary classification dataset with only 1,000 samples to keep\nthe computation time reasonable:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import make_classification\n\nX, y = make_classification(n_samples=1_000, random_state=42)\nprint(f\"{X.shape = }\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Gradient-boosted decision trees with XGBoost\n\nFor this binary classification task, we consider a gradient-boosted decision trees\nmodel from a library external to scikit-learn.\nOne of the most popular is [XGBoost](https://github.com/dmlc/xgboost).\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from skore import evaluate\nfrom xgboost import XGBClassifier\n\nxgb = XGBClassifier(n_estimators=50, max_depth=3, learning_rate=0.1, random_state=42)\n\nxgb_report = evaluate(xgb, X, y, splitter=0.2, pos_label=1)\nxgb_report.metrics.summarize().frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can easily get the summary of metrics, and also a ROC curve plot for example:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "xgb_report.metrics.roc().plot()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can also inspect our model:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "xgb_report.inspection.permutation_importance().frame()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Custom model\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let us use a custom estimator inspired from the\n[scikit-learn documentation](https://scikit-learn.org/dev/developers/develop.html#rolling-your-own-estimator),\na nearest neighbor classifier:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\nfrom sklearn.base import BaseEstimator, ClassifierMixin\nfrom sklearn.metrics import euclidean_distances\nfrom sklearn.utils.multiclass import unique_labels\nfrom sklearn.utils.validation import check_is_fitted, validate_data\n\n\nclass CustomClassifier(ClassifierMixin, BaseEstimator):\n    def __init__(self):\n        pass\n\n    def fit(self, X, y):\n        X, y = validate_data(self, X, y)\n        self.classes_ = unique_labels(y)\n        self.X_ = X\n        self.y_ = y\n        return self\n\n    def predict(self, X):\n        check_is_fitted(self)\n        X = validate_data(self, X, reset=False)\n        closest = np.argmin(euclidean_distances(X, self.X_), axis=1)\n        return self.y_[closest]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-info\"><h4>Note</h4><p>The estimator above does not have a `predict_proba` method, therefore\n  we cannot display its ROC curve as done previously.</p></div>\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We can now use this model with skore:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "custom_report = evaluate(CustomClassifier(), X, y, splitter=0.2, pos_label=1)\ncustom_report.metrics.precision()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Conclusion\n\nThis example demonstrates how skore can be used with scikit-learn compatible estimators.\nThis allows practitioners to use consistent reporting and visualization tools across different estimators.\n\n.. seealso::\n\n  For a practical example of using language models within scikit-learn pipelines,\n  see `example_use_case_employee_salaries` which demonstrates how to use\n  skrub's :class:`~skrub.TextEncoder` (a language model-based encoder) in a\n  scikit-learn pipeline for feature engineering.\n\n.. seealso::\n\n  For an example of wrapping Large Language Models (LLMs) to be compatible with\n  scikit-learn APIs, see the tutorial on [Quantifying LLMs Uncertainty with Conformal\n  Predictions](https://medium.com/capgemini-invent-lab/quantifying-llms-uncertainty-with-conformal-predictions-567870e63e00).\n  The article demonstrates how to wrap models like Mistral-7B-Instruct in a\n  scikit-learn-compatible interface.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.14.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}