{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Tracking all the data processing\n\nTo track all operations and be able to apply the fitted estimator to unseen\ndata, we need to include all the data wrangling in the estimator used for our\nskore report. In very simple cases this can be done with a scikit-learn\nPipeline. When we have transformations not supported by the Pipeline (such as\ntransformations that change the number of rows, or that involve multiple tables\nsuch as joins), skore allows us to use a skrub DataOp instead.\n\nIn this example we consider a dataset that is simple, but still requires some\ndata wrangling (encoding, aggregation and joining) which could not be performed\nin a regular scikit-learn estimator.\n\nTo track those operations, we use a skrub DataOp, which can perform richer\ntransformations than normal estimators, and also has built-in support from\nskore.\n\nThe dataset contains a list of online transactions (each corresponds to a cart,\nor \"basket\"), each linked to one or more products for which we have a description.\nThe task is to predict which involved credit fraud.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We start by defining our data-processing pipeline. Note that it contains\noperations, such as aggregating and joining the product information after\nvectorizing the text it contains, that would not be possible in a normal\nestimator.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import skore\nimport skrub\nfrom sklearn.ensemble import HistGradientBoostingClassifier\n\ndataset = skrub.datasets.fetch_credit_fraud(split=\"all\")\n\nproducts = skrub.var(\"products\", dataset.products)\nbaskets = skrub.var(\"baskets\", dataset.baskets)\n\nbasket_ids = baskets[[\"ID\"]].skb.mark_as_X()\nfraud_flags = baskets[\"fraud_flag\"].skb.mark_as_y()\n\n\ndef filter_products(products, basket_ids):\n    return products[products[\"basket_ID\"].isin(basket_ids[\"ID\"])]\n\n\nvectorized_products = products.skb.apply_func(filter_products, basket_ids).skb.apply(\n    skrub.TableVectorizer(), exclude_cols=\"basket_ID\"\n)\n\n\ndef join_product_info(basket_ids, vectorized_products):\n    return basket_ids.merge(\n        vectorized_products.groupby(\"basket_ID\").agg(\"mean\").reset_index(),\n        left_on=\"ID\",\n        right_on=\"basket_ID\",\n    ).drop(columns=[\"ID\", \"basket_ID\"])\n\n\npred = basket_ids.skb.apply_func(join_product_info, vectorized_products).skb.apply(\n    HistGradientBoostingClassifier(), y=fraud_flags\n)\n\n# This would generate a report with previews of intermediate results & fitted\n# estimators:\n#\n# pred.skb.full_report()\n\npred"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Above we see a preview on the whole dataset. Click the \"show graph\" toggle to\nsee a drawing of the pipeline we have built.\n\nJust like a normal estimator, a skrub DataOp can be used with skore reports.\nWe can either pass separately a SkrubLearner and training and testing data,\nor pass our DataOp with the data it already contains and rely on the default\ntrain/test split:\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "report = skore.EstimatorReport(pred, pos_label=1)\nreport.metrics.roc_auc()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "_ = report.metrics.precision_recall().plot()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Note that the preprocessing operations are captured in the skrub DataOp,\nhence in our report -- so we can replay them later on unseen data.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "report.estimator_.data_op"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.14.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}