EstimatorReport: Get insights from any scikit-learn estimator#

This example shows how the skore.EstimatorReport class can be used to quickly get insights from any scikit-learn estimator.

Loading our dataset and defining our estimator#

First, we load a dataset from skrub. Our goal is to predict if a healthcare manufacturing companies paid a medical doctors or hospitals, in order to detect potential conflict of interest.

Downloading 'open_payments' from https://github.com/skrub-data/skrub-data-files/raw/refs/heads/main/open_payments.zip (attempt 1/3)
from skrub import TableReport

TableReport(df)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").



Looking at the distributions of the target, we observe that this classification task is quite imbalanced. It means that we have to be careful when selecting a set of statistical metrics to evaluate the classification performance of our predictive model. In addition, we see that the class labels are not specified by an integer 0 or 1 but instead by a string “allowed” or “disallowed”.

For our application, the label of interest is “allowed”.

pos_label, neg_label = "allowed", "disallowed"

Now, we need to define a predictive model. Thankfully, skrub provides a convenient function (skrub.tabular_pipeline()) when it comes to getting strong baseline predictive models with a single line of code. As its feature engineering is generic, it does not provide some handcrafted and tailored feature engineering but still provides a good starting point.

So let’s create a classifier for our task.

from skrub import tabular_pipeline

estimator = tabular_pipeline("classifier")
estimator
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(low_cardinality=ToCategorical())),
                ('histgradientboostingclassifier',
                 HistGradientBoostingClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Getting insights from our estimator#

Introducing the skore.EstimatorReport class#

Now, we would be interested in getting some insights from our predictive model. One way is to use the skore.EstimatorReport class which we will construct using the evaluate function. This function will detect that our estimator is unfitted and will fit it for us on the training data and return an EstimatorReport object.

Specifying a splitter of 0.2 will perform a 80/20 train-test split.

from skore import evaluate

report = evaluate(estimator, X=df, y=y, pos_label=pos_label, splitter=0.2)
report
Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(low_cardinality=ToCategorical())),
                ('histgradientboostingclassifier',
                 HistGradientBoostingClassifier())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

1 issue(s), 1 tip(s), 3 passed, 0 ignored.


Once the report is created, we get some information regarding the available tools allowing us to get some insights from our specific model on our specific task by calling the help() method.



Be aware that we can access the help for each individual sub-accessor. For instance:

report.metrics.help()


Metrics computation with aggressive caching#

At this point, we might be interested to have a first look at the statistical performance of our model on the validation set that we provided. We can access it by calling any of the metrics displayed above. Since we are greedy, we want to get several metrics at once and we will use the summarize() method.

import time

start = time.time()
metric_report = report.metrics.summarize().frame()
end = time.time()
metric_report
HistGradientBoostingClassifier
Metric
Score 0.951536
Accuracy 0.951536
Precision 0.736842
Recall 0.447552
ROC AUC 0.938427
Log loss 0.131869
Brier score 0.037143
Fit time (s) 10.358522
Predict time (s) 1.285694


print(f"Time taken to compute the metrics: {end - start:.2f} seconds")
Time taken to compute the metrics: 0.00 seconds

An interesting feature provided by the skore.EstimatorReport is the the caching mechanism. Indeed, when we have a large enough dataset, computing the predictions for a model is not cheap anymore. For instance, on our smallish dataset, it took a couple of seconds to compute the metrics. The report will cache the predictions and if we are interested in computing a metric again or an alternative metric that requires the same predictions, it will be faster. Let’s check by requesting the same metrics report again.

start = time.time()
metric_report = report.metrics.summarize().frame()
end = time.time()
metric_report
HistGradientBoostingClassifier
Metric
Score 0.951536
Accuracy 0.951536
Precision 0.736842
Recall 0.447552
ROC AUC 0.938427
Log loss 0.131869
Brier score 0.037143
Fit time (s) 10.358522
Predict time (s) 1.285694


print(f"Time taken to compute the metrics: {end - start:.2f} seconds")
Time taken to compute the metrics: 0.00 seconds

Note that when the model is fitted or the predictions are computed, we additionally store the time the operation took:

report.metrics.timings()
{'fit_time': 10.358521977999999, 'predict_time_train': 4.985065397999961, 'predict_time_test': 1.2856944669999848}

Since we obtain a pandas dataframe, we can also use the plotting interface of pandas.

ax = metric_report.plot.barh()
_ = ax.set_title("Metrics report")
Metrics report

Whenever computing a metric, we check if the predictions are available in the cache and reload them if available. So for instance, let’s compute the log loss.

0.13186882167335673
print(f"Time taken to compute the log loss: {end - start:.2f} seconds")
Time taken to compute the log loss: 0.00 seconds

We can show that without initial cache, it would have taken more time to compute the log loss.

0.13186882167335673
print(f"Time taken to compute the log loss: {end - start:.2f} seconds")
Time taken to compute the log loss: 2.56 seconds

By default, the metrics are computed on the test set only. However, if a training set is provided, we can also compute the metrics by specifying the data_source parameter.

report.metrics.log_loss(data_source="train")
0.09572131415106928

Be aware that we can also benefit from the caching mechanism with our own custom metrics. Skore only expects that we define our own metric function to take y_true and y_pred as the first two positional arguments. It can take any other arguments. Let’s see an example.

def operational_decision_cost(y_true, y_pred, amount):
    mask_true_positive = (y_true == pos_label) & (y_pred == pos_label)
    mask_true_negative = (y_true == neg_label) & (y_pred == neg_label)
    mask_false_positive = (y_true == neg_label) & (y_pred == pos_label)
    mask_false_negative = (y_true == pos_label) & (y_pred == neg_label)
    fraudulent_refuse = mask_true_positive.sum() * 50
    fraudulent_accept = -amount[mask_false_negative].sum()
    legitimate_refuse = mask_false_positive.sum() * -5
    legitimate_accept = (amount[mask_true_negative] * 0.02).sum()
    return fraudulent_refuse + fraudulent_accept + legitimate_refuse + legitimate_accept

In our use case, we have a operational decision to make that translate the classification outcome into a cost. It translate the confusion matrix into a cost matrix based on some amount linked to each sample in the dataset that are provided to us. Here, we randomly generate some amount as an illustration.

import numpy as np
from sklearn.metrics import make_scorer

rng = np.random.default_rng(42)
amount = rng.integers(low=100, high=1000, size=len(report.y_test))

report.metrics.add(metric=make_scorer(operational_decision_cost, amount=amount))

cost = report.metrics.summarize(metric="operational_decision_cost")
cost.frame()
HistGradientBoostingClassifier
Metric
Operational Decision Cost -131798.62


By the way, skore caches the model predictions. It is really handy because it means that we can compute some additional metrics without having to recompute the the predictions.

report.metrics.summarize(
    metric=["precision", "recall", "operational_decision_cost"]
).frame()
HistGradientBoostingClassifier
Metric
Precision 0.736842
Recall 0.447552
Operational Decision Cost -131798.620000


Effortless one-liner plotting#

The skore.EstimatorReport class also implements a number of the most common data science plots. As for the metrics, we only provide the meaningful set of plots for the provided estimator.

report.metrics.help()


Let’s start by plotting the ROC curve for our binary classification task.

display = report.metrics.roc()
display.plot()
ROC Curve for HistGradientBoostingClassifier Positive label: allowed Data source: Test set
<Figure size 600x750 with 1 Axes>

The plot functionality is built upon the scikit-learn display objects. We return those display (slightly modified to improve the UI) in case we want to tweak some of the plot properties. We can have quick look at the available attributes and methods by calling the help method or simply by printing the display.



fig = display.plot()
fig.axes[0].set_title("Example of a ROC curve")
fig
ROC Curve for HistGradientBoostingClassifier Positive label: allowed Data source: Test set, Example of a ROC curve
<Figure size 600x750 with 1 Axes>

Similarly to the metrics, we aggressively use the caching to avoid recomputing the predictions of the model. We also cache the plot display object by detection if the input parameters are the same as the previous call. Let’s demonstrate the kind of performance gain we can get.

start = time.time()
# we already trigger the computation of the predictions in a previous call
display = report.metrics.roc()
fig = display.plot()
end = time.time()
fig
ROC Curve for HistGradientBoostingClassifier Positive label: allowed Data source: Test set
<Figure size 600x750 with 1 Axes>
print(f"Time taken to compute the ROC curve: {end - start:.2f} seconds")
Time taken to compute the ROC curve: 0.11 seconds

Now, let’s clean the cache and check if we get a slowdown.

ROC Curve for HistGradientBoostingClassifier Positive label: allowed Data source: Test set
<Figure size 600x750 with 1 Axes>
print(f"Time taken to compute the ROC curve: {end - start:.2f} seconds")
Time taken to compute the ROC curve: 2.71 seconds

As expected, since we need to recompute the predictions, it takes more time.

Visualizing the confusion matrix#

Another useful visualization for classification tasks is the confusion matrix, which shows the counts of correct and incorrect predictions for each class.

Let’s first start with a basic confusion matrix:

cm_display = report.metrics.confusion_matrix()
cm_display.plot()
Confusion Matrix Data source: Test set
<Figure size 600x600 with 1 Axes>

In binary classification, a confusion matrix depends on the decision threshold used to convert predicted probabilities into class labels. By default, skore uses a threshold of 0.5, but confusion matrices are actually computed at every threshold internally.

# To visualize the confusion matrix at a different threshold, use the
# ``threshold_value`` parameter. For example, a threshold of 0.3 will classify
# more samples as positive:
cm_display.plot(threshold_value=0.3)
Confusion Matrix Decision threshold: 0.30 Positive label: allowed Data source: Test set
<Figure size 600x600 with 1 Axes>

We can normalize the confusion matrix to get percentages instead of raw counts. Here we normalize by true labels (rows):

cm_display.plot(normalize="true")
Confusion Matrix Data source: Test set
<Figure size 600x600 with 1 Axes>

More plotting options are available via heatmap_kwargs, which are passed to seaborn’s heatmap. For example, we can customize the colormap and number format:

cm_display.set_style(heatmap_kwargs={"cmap": "Greens", "fmt": ".2e"})
cm_display.plot()
Confusion Matrix Data source: Test set
<Figure size 600x600 with 1 Axes>

Finally, the confusion matrix can also be exported as a pandas DataFrame for further analysis:

true_label predicted_label value
0 allowed allowed 448
1 allowed disallowed 553
2 disallowed allowed 160
3 disallowed disallowed 13551


See also

For using the EstimatorReport to inspect your models, see EstimatorReport: Inspecting your models with the feature importance.

Total running time of the script: (1 minutes 5.916 seconds)

Gallery generated by Sphinx-Gallery