Cross-validation#

This example illustrates the motivation and the use of skore’s skore.CrossValidationReporter to get assistance when developing ML/DS projects.

Warning

Deprecation Notice: skore.CrossValidationReporter is deprecated in favor of skore.CrossValidationReport.

Creating and loading the skore project#

We create and load the skore project from the current directory:

import skore

my_project = skore.open("my_project", create=True)
──────────────────────────────────────── skore ─────────────────────────────────────────
Project file 'my_project.skore' was successfully created.

Cross-validation in scikit-learn#

Scikit-learn holds two functions for cross-validation:

Essentially, sklearn.model_selection.cross_val_score() runs cross-validation for single metric evaluation, while sklearn.model_selection.cross_validate() runs cross-validation with multiple metrics and can also return extra information such as train scores, fit times, and score times.

Hence, in skore, we are more interested in the sklearn.model_selection.cross_validate() function as it allows to do more than the historical sklearn.model_selection.cross_val_score().

Let us illustrate cross-validation on a multi-class classification task.

from sklearn.datasets import load_iris
from sklearn.svm import SVC

X, y = load_iris(return_X_y=True)
clf = SVC(kernel="linear", C=1, random_state=0)

Single metric evaluation using sklearn.model_selection.cross_validate():

from sklearn.model_selection import cross_validate as sklearn_cross_validate

cv_results = sklearn_cross_validate(clf, X, y, cv=5)
print(f"test_score: {cv_results['test_score']}")
test_score: [0.96666667 1.         0.96666667 0.96666667 1.        ]

Multiple metric evaluation using sklearn.model_selection.cross_validate():

import pandas as pd

cv_results = sklearn_cross_validate(
    clf,
    X,
    y,
    cv=5,
    scoring=["accuracy", "precision_macro"],
)
test_scores = pd.DataFrame(cv_results)[["test_accuracy", "test_precision_macro"]]
test_scores
test_accuracy test_precision_macro
0 0.966667 0.969697
1 1.000000 1.000000
2 0.966667 0.969697
3 0.966667 0.969697
4 1.000000 1.000000


In scikit-learn, why do we recommend using cross_validate over cross_val_score?#

Here, for the SVC, the default score is the accuracy. If the users want other scores to better understand their model such as the precision and the recall, they can specify it which is very convenient. Otherwise, they would have to run several sklearn.model_selection.cross_val_score() with different scoring parameters each time, which leads to more unnecessary compute.

Why do we recommend using skore’s CrossValidationReporter over scikit-learn’s cross_validate?#

In the example above, what if the users ran scikit-learn’s sklearn.model_selection.cross_validate() but forgot to manually add a crucial score for their use case such as the recall? They would have to re-run the whole cross-validation experiment by adding this crucial score, which leads to more compute.

Cross-validation in skore#

In order to assist its users when programming, skore has implemented a skore.CrossValidationReporter class that wraps scikit-learn’s sklearn.model_selection.cross_validate(), to provide more context and facilitate the analysis.

Classification task#

Let us continue with the same use case.



Skore’s CrossValidationReporter advantages are the following:

  • By default, it computes several useful scores without the need to manually specify them. For classification, one can observe that it computed the accuracy, the precision, and the recall.

  • We automatically get some interactive Plotly graphs to better understand how our model behaves depending on the split. For example:

    • We can compare the fitting and scoring times together for each split.

    • Moreover, we can focus on the times per data points as the train and test splits usually have a different number of samples.

    • We can compare the accuracy, precision, and recall scores together for each split.

Regression task#

from sklearn.datasets import load_diabetes
from sklearn.linear_model import Lasso

X, y = load_diabetes(return_X_y=True)
lasso = Lasso()

reporter = skore.CrossValidationReporter(lasso, X, y, cv=5)
reporter.plots.scores


We can put the reporter in the project, and retrieve it as is:

my_project.put("cross_validation_reporter", reporter)

reporter = my_project.get("cross_validation_reporter")
reporter.plots.scores


Cleanup the project#

Let’s clear the skore project (to avoid any conflict with other documentation examples).

Total running time of the script: (0 minutes 0.201 seconds)

Gallery generated by Sphinx-Gallery