Using skrub DataOp cross-validation#

When a skrub DataOp defines a cross-validation splitter on mark_as_X(), evaluate() can reuse that configuration — including split_kwargs such as groups — instead of skore’s default 80/20 holdout.

This example builds a small grouped cross-validation setup with skrub and evaluates it with skore.

Configure cross-validation on the DataOp#

We use the toy products dataset and group products by seller. The goal is to assess generalization to new sellers with LeaveOneGroupOut.

import skrub
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import LeaveOneGroupOut

df = skrub.datasets.toy_products()
data = skrub.var("df")
groups = data["seller"]
X = data[["description", "price"]].skb.mark_as_X(
    cv=LeaveOneGroupOut(), split_kwargs={"groups": groups}
)
y = data["category"].skb.mark_as_y()
pred = X.skb.apply(DummyClassifier(), y=y)
learner = pred.skb.make_learner()

Evaluate with skore (no explicit splitter)#

Because mark_as_X was called with an explicit cv argument, calling evaluate() without a splitter returns a CrossValidationReport that respects the DataOp grouping.

from skore import evaluate

report = evaluate(learner, data={"df": df})
report
SkrubLearner(data_op=<Apply DummyClassifier>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



There are two sellers, so cross-validation runs in two folds:

2

Inspect aggregated metrics with the same API as other skore reports:

report.metrics.summarize().frame()
SkrubLearner
mean std
Metric Label
Score 0.666667 0.000000e+00
Accuracy 0.666667 0.000000e+00
Precision electronics 0.666667 0.000000e+00
tools 0.000000 0.000000e+00
Recall electronics 1.000000 0.000000e+00
tools 0.000000 0.000000e+00
ROC AUC 0.500000 0.000000e+00
Log loss 0.636514 0.000000e+00
Brier score 0.222222 2.775558e-17
Fit time (s) 0.003148 2.069348e-05
Predict time (s) 0.001972 8.291251e-05


Default behavior without an explicit DataOp cv#

If mark_as_X is called without an explicit cv argument, evaluate() still defaults to a single 80/20 holdout and returns an EstimatorReport.

simple_learner = skrub.X().skb.apply(DummyClassifier(), y=skrub.y()).skb.make_learner()
holdout_report = evaluate(
    simple_learner,
    data={"X": df[["description", "price"]], "y": df["category"]},
)
holdout_report
SkrubLearner(data_op=<Apply DummyClassifier>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



Explicitly passing a splitter always overrides the DataOp configuration.

SkrubLearner(data_op=<Apply DummyClassifier>)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



Total running time of the script: (0 minutes 2.128 seconds)

Gallery generated by Sphinx-Gallery