Note
Go to the end to download the full example code.
Local skore Project#
This example shows how to use Project in local mode: store
reports on your machine and inspect them. A key point is that
summarize() returns a Summary,
which is a pandas.DataFrame. In Jupyter you get an interactive widget, but
you can always inspect and filter the summary as a DataFrame if you prefer.
Create a local project and store reports#
We use a temporary directory as the workspace so the example is self-contained.
In practice you can omit workspace to use the default (e.g. a skore/
directory in your user cache).
from pathlib import Path
from tempfile import TemporaryDirectory
from skore import Project
tmp_dir = TemporaryDirectory()
tmp_path = Path(tmp_dir.name)
project = Project("example-project", workspace=tmp_path)
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from skrub import tabular_pipeline
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
estimator = tabular_pipeline(LogisticRegression(max_iter=1_000))
import numpy as np
from sklearn.base import clone
from skore import evaluate
for regularization in np.logspace(-7, 7, 31):
report = evaluate(
clone(estimator).set_params(logisticregression__C=regularization),
X,
y,
splitter=0.2,
pos_label=1,
)
project.put(f"lr-regularization-{regularization:.1e}", report)
Summarize: you get a DataFrame#
summarize() returns a Summary,
which subclasses pandas.DataFrame. In a Jupyter environment it renders
an interactive parallel-coordinates widget by default.
summary = project.summarize()
To see the normal DataFrame table instead of the widget (e.g. in scripts or
when you prefer the table), wrap the summary in pandas.DataFrame:
import pandas as pd
pandas_summary = pd.DataFrame(summary)
pandas_summary
Basically, our summary contains metadata related to various information that we need to quickly help filtering the reports.
<class 'skore._project._summary.Summary'>
MultiIndex: 31 entries, (0, '24adad00003213def2b5a9e93562f857') to (30, '025ff8880ab31858b24252aeed9aee48')
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 key 31 non-null object
1 date 31 non-null object
2 learner 31 non-null category
3 ml_task 31 non-null object
4 report_type 31 non-null object
5 dataset 31 non-null object
6 rmse 0 non-null object
7 log_loss 31 non-null float64
8 roc_auc 31 non-null float64
9 fit_time 31 non-null float64
10 predict_time 31 non-null float64
11 rmse_mean 0 non-null object
12 log_loss_mean 0 non-null object
13 roc_auc_mean 0 non-null object
14 fit_time_mean 0 non-null object
15 predict_time_mean 0 non-null object
dtypes: category(1), float64(4), object(11)
memory usage: 5.3+ KB
Filter reports by metric (e.g. keep only those above a given accuracy) and work with the result as a table.
summary.query("log_loss < 0.1")["key"].tolist()
['lr-regularization-1.2e-01', 'lr-regularization-3.4e-01', 'lr-regularization-1.0e+00', 'lr-regularization-2.9e+00']
Use reports() to load the corresponding
reports from the project (optionally after filtering the summary).
reports = summary.query("log_loss < 0.1").reports(return_as="comparison")
len(reports.reports_)
4
Since we got a ComparisonReport, we can use the metrics accessor
to summarize the metrics across the reports.
reports.metrics.summarize().frame()
reports.metrics.roc().plot(subplot_by=None)

project.delete("example-project", workspace=tmp_path)
tmp_dir.cleanup()
Total running time of the script: (0 minutes 9.262 seconds)