Note

Go to the end to download the full example code.

Cache mechanism#

This example shows how EstimatorReport and CrossValidationReport use caching to speed up computations.

Loading some data#

First, we load a dataset from skrub. Our goal is to predict if a company paid a physician. The ultimate goal is to detect potential conflict of interest when it comes to the actual problem that we want to solve.

from skrub.datasets import fetch_open_payments

dataset = fetch_open_payments()
df = dataset.X
y = dataset.y

from skrub import TableReport

TableReport(df)

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

import pandas as pd

TableReport(pd.DataFrame(y))

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

The dataset has over 70,000 records with only categorical features. Some categories are not well defined.

Caching with `EstimatorReport` and `CrossValidationReport`#

We use skrub to create a simple predictive model that handles our dataset’s challenges.

from skrub import tabular_pipeline

model = tabular_pipeline("classifier")
model

Pipeline(steps=[('tablevectorizer',
                 TableVectorizer(low_cardinality=ToCategorical())),
                ('histgradientboostingclassifier',
                 HistGradientBoostingClassifier())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

This model handles all types of data: numbers, categories, dates, and missing values. Let’s train it on part of our dataset.

from skore import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=42)
# Let's keep a completely separate dataset
X_train, X_external, y_train, y_external = train_test_split(
    X_train, y_train, random_state=42
)

╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮
│ It seems that you have a classification problem with a high class imbalance. In this │
│ case, using train_test_split may not be a good idea because of high variability in   │
│ the scores obtained on the test set. To tackle this challenge we suggest to use      │
│ skore's CrossValidationReport with the `splitter` parameter of your choice.          │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
│ its default value. In case of time-ordered events (even if they are independent),    │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
│ order to ensure the evaluation process is really representative of your production   │
│ release process.                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮
│ It seems that you have a classification problem with a high class imbalance. In this │
│ case, using train_test_split may not be a good idea because of high variability in   │
│ the scores obtained on the test set. To tackle this challenge we suggest to use      │
│ skore's CrossValidationReport with the `splitter` parameter of your choice.          │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
│ its default value. In case of time-ordered events (even if they are independent),    │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
│ order to ensure the evaluation process is really representative of your production   │
│ release process.                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────╯

Caching the predictions for fast metric computation#

First, we focus on EstimatorReport, as the same philosophy will apply to CrossValidationReport.

Let’s explore how EstimatorReport uses caching to speed up predictions. We start by training the model:

from skore import EstimatorReport

report = EstimatorReport(
    model,
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    pos_label="allowed",
)
report.help()

We compute the accuracy on our test set and measure how long it takes:

import time

start = time.time()
result = report.metrics.accuracy()
end = time.time()
result

0.9508972267536705

print(f"Time taken: {end - start:.2f} seconds")

Time taken: 6.11 seconds

For comparison, here’s how scikit-learn computes the same accuracy score:

from sklearn.metrics import accuracy_score

start = time.time()
result = accuracy_score(report.y_test, report.estimator_.predict(report.X_test))
end = time.time()
result

0.9508972267536705

print(f"Time taken: {end - start:.2f} seconds")

Time taken: 1.82 seconds

Both approaches take similar time.

Now, watch what happens when we compute the accuracy again with our skore estimator report:

start = time.time()
result = report.metrics.accuracy()
end = time.time()
result

0.9508972267536705

print(f"Time taken: {end - start:.2f} seconds")

Time taken: 0.00 seconds

The second calculation is instant! This happens because the report saves previous calculations in its cache. Let’s look inside the cache:

report._cache

{('test', 'decision_function', None): array([[ 0.2269056 , -0.2269056 ],
       [-3.65712076,  3.65712076],
       [-5.31562872,  5.31562872],
       ...,
       [-4.67733254,  4.67733254],
       [-4.46483791,  4.46483791],
       [-5.00752773,  5.00752773]], shape=(18390, 2)), ('test', 'predict', None): array(['allowed', 'disallowed', 'disallowed', ..., 'disallowed',
       'disallowed', 'disallowed'], shape=(18390,), dtype=object), ('test', 'predict_proba', None): array([[0.55648426, 0.44351574],
       [0.02515748, 0.97484252],
       [0.00489016, 0.99510984],
       ...,
       [0.00921804, 0.99078196],
       [0.01137567, 0.98862433],
       [0.00664299, 0.99335701]], shape=(18390, 2)), ('test', 'predict_log_proba', None): array([[-5.86116389e-01, -8.13021992e-01],
       [-3.68260010e+00, -2.54793368e-02],
       [-5.32053087e+00, -4.90215295e-03],
       ...,
       [-4.68659333e+00, -9.26078454e-03],
       [-4.47627878e+00, -1.14408637e-02],
       [-5.01419288e+00, -6.66515444e-03]], shape=(18390, 2)), ('test', 'accuracy', ('mapping', ())): 0.9508972267536705}

The cache stores predictions by type and data source. This means that computing metrics that use the same type of predictions will be faster. Let’s try the precision metric:

start = time.time()
result = report.metrics.precision()
end = time.time()
result

0.6537102473498233

print(f"Time taken: {end - start:.2f} seconds")

Time taken: 0.05 seconds

We observe that it takes only a few milliseconds to compute the precision because we don’t need to re-compute the predictions and only have to compute the precision metric itself. Since the predictions are the bottleneck in terms of computation time, we observe an interesting speedup.

Caching all the possible predictions at once#

We can pre-compute all predictions at once:

report.cache_predictions()

Now, all possible predictions are stored. Any metric calculation will be much faster, even on different data (like the training set):

start = time.time()
result = report.metrics.log_loss(data_source="train")
end = time.time()
result

0.09455280809007836

print(f"Time taken: {end - start:.2f} seconds")

Time taken: 0.08 seconds

Caching for plotting#

The cache also speeds up plots. Let’s create a ROC curve:

start = time.time()
display = report.metrics.roc()
display.plot()
end = time.time()

ROC Curve for HistGradientBoostingClassifier Positive label: allowed Data source: Test set

print(f"Time taken: {end - start:.2f} seconds")

Time taken: 0.14 seconds

The second plot is instant because it uses cached data:

start = time.time()
display = report.metrics.roc()
display.plot()
end = time.time()

print(f"Time taken: {end - start:.2f} seconds")

Time taken: 0.11 seconds

We only use the cache to retrieve the display object and not directly the matplotlib figure. It means that we can still customize the cached plot before displaying it:

display.set_style(relplot_kwargs={"color": "tab:orange"})
_ = display.plot()

Be aware that we can clear the cache if we want to:

report.clear_cache()
report._cache

{}

It means that nothing is stored anymore in the cache.

Caching with `CrossValidationReport`#

CrossValidationReport uses the same caching system for each split in cross-validation by leveraging the previous EstimatorReport:

from skore import CrossValidationReport

report = CrossValidationReport(model, X=df, y=y, splitter=5, n_jobs=4)
report.help()

Since a CrossValidationReport uses many EstimatorReport, we will observe the same behaviour as we previously exposed. The first call will be slow because it computes the predictions for each split.

start = time.time()
result = report.metrics.summarize().frame()
end = time.time()
result

		HistGradientBoostingClassifier
		mean	std
Metric	Label
Score		0.916569	0.036667
Accuracy		0.916569	0.036667
Precision	allowed	0.428331	0.121554
Precision	disallowed	0.959770	0.005506
Recall	allowed	0.420486	0.099510
Recall	disallowed	0.950805	0.043740
ROC AUC		0.875809	0.032261
Log loss		0.226492	0.121389
Brier score		0.063608	0.033675
Fit time (s)		28.954697	10.153457
Predict time (s)		4.892519	2.084602

print(f"Time taken: {end - start:.2f} seconds")

Time taken: 18.13 seconds

But the subsequent calls are fast because the predictions are cached.

start = time.time()
result = report.metrics.summarize().frame()
end = time.time()
result

		HistGradientBoostingClassifier
		mean	std
Metric	Label
Score		0.916569	0.036667
Accuracy		0.916569	0.036667
Precision	allowed	0.428331	0.121554
Precision	disallowed	0.959770	0.005506
Recall	allowed	0.420486	0.099510
Recall	disallowed	0.950805	0.043740
ROC AUC		0.875809	0.032261
Log loss		0.226492	0.121389
Brier score		0.063608	0.033675
Fit time (s)		28.954697	10.153457
Predict time (s)		4.892519	2.084602

print(f"Time taken: {end - start:.2f} seconds")

Time taken: 0.02 seconds

Hence, we observe the same type of behaviour as we previously exposed.

Total running time of the script: (1 minutes 34.974 seconds)

Gallery generated by Sphinx-Gallery

	Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name	Dispute_Status_for_Publication	Name_of_Associated_Covered_Device_or_Medical_Supply1	Name_of_Associated_Covered_Drug_or_Biological1	Physician_Specialty
		Dispute_Status_for_Publication	Name_of_Associated_Covered_Device_or_Medical_Supply1	Name_of_Associated_Covered_Drug_or_Biological1	Physician_Specialty
0	ELI LILLY AND COMPANY	No			Allopathic & Osteopathic Physicians\|Pediatrics\|Pediatric Rheumatology
1	ELI LILLY AND COMPANY	No			Allopathic & Osteopathic Physicians\|Internal Medicine\|Nephrology
2	ELI LILLY AND COMPANY	No			Allopathic & Osteopathic Physicians\|Internal Medicine\|Rheumatology
3	ELI LILLY AND COMPANY	No			Allopathic & Osteopathic Physicians\|Internal Medicine\|Endocrinology, Diabetes & Metabolism
4	ELI LILLY AND COMPANY	No		EFFIENT	Allopathic & Osteopathic Physicians\|Pediatrics\|Pediatric Hematology-Oncology

73,553	GlaxoSmithKline, LLC.	No		ZIAGEN
73,554	ALERE SCARBOROUGH, INC.	No	Alere PBP2a
73,555	NovoCure Limited	No
73,556	Wright Medical Technology, Inc.	No		HIPS
73,557	Alcon Research Ltd	No		Express

Column	Column name	dtype	Is sorted	Null values	Unique values
0	Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name	ObjectDType	False	0 (0.0%)	1466 (2.0%)
1	Dispute_Status_for_Publication	ObjectDType	False	0 (0.0%)	2 (< 0.1%)
2	Name_of_Associated_Covered_Device_or_Medical_Supply1	ObjectDType	False	43088 (58.6%)	4372 (5.9%)
3	Name_of_Associated_Covered_Drug_or_Biological1	ObjectDType	False	36233 (49.3%)	2262 (3.1%)
4	Physician_Specialty	ObjectDType	False	3996 (5.4%)	513 (0.7%)

Column 1	Column 2	Cramér's V
Name_of_Associated_Covered_Device_or_Medical_Supply1	Name_of_Associated_Covered_Drug_or_Biological1	0.263
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name	Name_of_Associated_Covered_Drug_or_Biological1	0.214
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name	Name_of_Associated_Covered_Device_or_Medical_Supply1	0.132
Name_of_Associated_Covered_Device_or_Medical_Supply1	Physician_Specialty	0.0962
Dispute_Status_for_Publication	Physician_Specialty	0.0960
Dispute_Status_for_Publication	Name_of_Associated_Covered_Drug_or_Biological1	0.0895
Name_of_Associated_Covered_Drug_or_Biological1	Physician_Specialty	0.0646
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name	Physician_Specialty	0.0510
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name	Dispute_Status_for_Publication	0.0308
Dispute_Status_for_Publication	Name_of_Associated_Covered_Device_or_Medical_Supply1	0.0284

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators <combining_estimators>` for more details.	[('tablevectorizer', ...), ('histgradientboostingclassifier', ...)]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing <metadata_routing>`. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	low_cardinality	ToCategorical()
	high_cardinality	StringEncoder()
	numeric	PassThrough()
	datetime	DatetimeEncoder()
	cardinality_threshold	40
	specific_transformers	()
	drop_null_fraction	1.0
	drop_if_constant	False
	drop_if_unique	False
	datetime_format	None
	null_strings	None
	n_jobs	None

Cache mechanism#

Loading some data#

Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name

Dispute_Status_for_Publication

Name_of_Associated_Covered_Device_or_Medical_Supply1

Name_of_Associated_Covered_Drug_or_Biological1

Physician_Specialty

Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name

Dispute_Status_for_Publication

Name_of_Associated_Covered_Device_or_Medical_Supply1

Name_of_Associated_Covered_Drug_or_Biological1

Physician_Specialty

Please enable javascript

status

status

Please enable javascript

Caching with `EstimatorReport` and `CrossValidationReport`#

Caching the predictions for fast metric computation#

Caching all the possible predictions at once#

Caching for plotting#

Caching with `CrossValidationReport`#

	resolution	'hour'
	add_weekday	False
	add_total_seconds	True
	add_day_of_year	False
	periodic_encoding	None

	n_components	30
	vectorizer	'tfidf'
	ngram_range	(3, ...)
	analyzer	'char_wb'
	stop_words	None
	random_state	None
	vocabulary	None

	loss loss: {'log_loss'}, default='log_loss' The loss function to use in the boosting process. For binary classification problems, 'log_loss' is also known as logistic loss, binomial deviance or binary crossentropy. Internally, the model fits one tree per boosting iteration and uses the logistic sigmoid function (expit) as inverse link function to compute the predicted positive class probability. For multiclass classification problems, 'log_loss' is also known as multinomial deviance or categorical crossentropy. Internally, the model fits one tree per boosting iteration and per class and uses the softmax function as inverse link function to compute the predicted probabilities of the classes.	'log_loss'
	learning_rate learning_rate: float, default=0.1 The learning rate, also known as shrinkage. This is used as a multiplicative factor for the leaves values. Use ``1`` for no shrinkage.	0.1
	max_iter max_iter: int, default=100 The maximum number of iterations of the boosting process, i.e. the maximum number of trees for binary classification. For multiclass classification, `n_classes` trees per iteration are built.	100
	max_leaf_nodes max_leaf_nodes: int or None, default=31 The maximum number of leaves for each tree. Must be strictly greater than 1. If None, there is no maximum limit.	31
	max_depth max_depth: int or None, default=None The maximum depth of each tree. The depth of a tree is the number of edges to go from the root to the deepest leaf. Depth isn't constrained by default.	None
	min_samples_leaf min_samples_leaf: int, default=20 The minimum number of samples per leaf. For small datasets with less than a few hundred samples, it is recommended to lower this value since only very shallow trees would be built.	20
	l2_regularization l2_regularization: float, default=0 The L2 regularization parameter penalizing leaves with small hessians. Use ``0`` for no regularization (default).	0.0
	max_features max_features: float, default=1.0 Proportion of randomly chosen features in each and every node split. This is a form of regularization, smaller values make the trees weaker learners and might prevent overfitting. If interaction constraints from `interaction_cst` are present, only allowed features are taken into account for the subsampling. .. versionadded:: 1.4	1.0
	max_bins max_bins: int, default=255 The maximum number of bins to use for non-missing values. Before training, each feature of the input array `X` is binned into integer-valued bins, which allows for a much faster training stage. Features with a small number of unique values may use less than ``max_bins`` bins. In addition to the ``max_bins`` bins, one more bin is always reserved for missing values. Must be no larger than 255.	255
	categorical_features categorical_features: array-like of {bool, int, str} of shape (n_features) or shape (n_categorical_features,), default='from_dtype' Indicates the categorical features. - None : no feature will be considered categorical. - boolean array-like : boolean mask indicating categorical features. - integer array-like : integer indices indicating categorical features. - str array-like: names of categorical features (assuming the training data has feature names). - `"from_dtype"`: dataframe columns with dtype "Categorical" and "Enum" are considered to be categorical features. The input must be a dataframe that is supported by narwhals (or supports it): :func:`narwhals.from_native` must work. This is the case, for instance, for pandas and polars DataFrames. For each categorical feature, there must be at most `max_bins` unique categories. Negative values for categorical features encoded as numeric dtypes are treated as missing values. All categorical values are converted to floating point numbers. This means that categorical values of 1.0 and 1 are treated as the same category. Read more in the :ref:`User Guide <categorical_support_gbdt>`. .. versionadded:: 0.24 .. versionchanged:: 1.2 Added support for feature names. .. versionchanged:: 1.4 Added `"from_dtype"` option. .. versionchanged:: 1.6 The default value changed from `None` to `"from_dtype"`.	'from_dtype'
	monotonic_cst monotonic_cst: array-like of int of shape (n_features) or dict, default=None Monotonic constraint to enforce on each feature are specified using the following integer values: - 1: monotonic increase - 0: no constraint - -1: monotonic decrease If a dict with str keys, map feature to monotonic constraints by name. If an array, the features are mapped to constraints by position. See :ref:`monotonic_cst_features_names` for a usage example. The constraints are only valid for binary classifications and hold over the probability of the positive class. Read more in the :ref:`User Guide <monotonic_cst_gbdt>`. .. versionadded:: 0.23 .. versionchanged:: 1.2 Accept dict of constraints with feature names as keys.	None
	interaction_cst interaction_cst: {"pairwise", "no_interactions"} or sequence of lists/tuples/sets of int, default=None Specify interaction constraints, the sets of features which can interact with each other in child node splits. Each item specifies the set of feature indices that are allowed to interact with each other. If there are more features than specified in these constraints, they are treated as if they were specified as an additional set. The strings "pairwise" and "no_interactions" are shorthands for allowing only pairwise or no interactions, respectively. For instance, with 5 features in total, `interaction_cst=[{0, 1}]` is equivalent to `interaction_cst=[{0, 1}, {2, 3, 4}]`, and specifies that each branch of a tree will either only split on features 0 and 1 or only split on features 2, 3 and 4. See :ref:`this example<ice-vs-pdp>` on how to use `interaction_cst`. .. versionadded:: 1.2	None
	warm_start warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble. For results to be valid, the estimator should be re-trained on the same data only. See :term:`the Glossary <warm_start>`.	False
	early_stopping early_stopping: 'auto' or bool, default='auto' If 'auto', early stopping is enabled if the sample size is larger than 10000 or if `X_val` and `y_val` are passed to `fit`. If True, early stopping is enabled, otherwise early stopping is disabled. .. versionadded:: 0.23	'auto'
	scoring scoring: str or callable or None, default='loss' Scoring method to use for early stopping. Only used if `early_stopping` is enabled. Options: - str: see :ref:`scoring_string_names` for options. - callable: a scorer callable object (e.g., function) with signature ``scorer(estimator, X, y)``. See :ref:`scoring_callable` for details. - `None`: :ref:`accuracy <accuracy_score>` is used. - 'loss': early stopping is checked w.r.t the loss value.	'loss'
	validation_fraction validation_fraction: int or float or None, default=0.1 Proportion (or absolute size) of training data to set aside as validation data for early stopping. If None, early stopping is done on the training data. The value is ignored if either early stopping is not performed, e.g. `early_stopping=False`, or if `X_val` and `y_val` are passed to fit.	0.1
	n_iter_no_change n_iter_no_change: int, default=10 Used to determine when to "early stop". The fitting process is stopped when none of the last ``n_iter_no_change`` scores are better than the ``n_iter_no_change - 1`` -th-to-last one, up to some tolerance. Only used if early stopping is performed.	10
	tol tol: float, default=1e-7 The absolute tolerance to use when comparing scores. The higher the tolerance, the more likely we are to early stop: higher tolerance means that it will be harder for subsequent iterations to be considered an improvement upon the reference score.	1e-07
	verbose verbose: int, default=0 The verbosity level. If not zero, print some information about the fitting process. ``1`` prints only summary info, ``2`` prints info per iteration.	0
	random_state random_state: int, RandomState instance or None, default=None Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. Pass an int for reproducible output across multiple function calls. See :term:`Glossary <random_state>`.	None
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form `{class_label: weight}`. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as `n_samples / (n_classes * np.bincount(y))`. Note that these weights will be multiplied with sample_weight (passed through the fit method) if `sample_weight` is specified. .. versionadded:: 1.2	None

Cache mechanism#

Loading some data#

Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name

Dispute_Status_for_Publication

Name_of_Associated_Covered_Device_or_Medical_Supply1

Name_of_Associated_Covered_Drug_or_Biological1

Physician_Specialty

Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name

Dispute_Status_for_Publication

Name_of_Associated_Covered_Device_or_Medical_Supply1

Name_of_Associated_Covered_Drug_or_Biological1

Physician_Specialty

Please enable javascript

status

status

Please enable javascript

Caching with EstimatorReport and CrossValidationReport#

Caching the predictions for fast metric computation#

Caching all the possible predictions at once#

Caching for plotting#

Caching with CrossValidationReport#

Caching with `EstimatorReport` and `CrossValidationReport`#

Caching with `CrossValidationReport`#