Note
Go to the end to download the full example code.
Cache mechanism#
This example shows how EstimatorReport and
CrossValidationReport use caching to speed up computations.
Loading some data#
First, we load a dataset from skrub. Our goal is to predict if a company paid a
physician. The ultimate goal is to detect potential conflict of interest when it comes
to the actual problem that we want to solve.
from skrub.datasets import fetch_open_payments
dataset = fetch_open_payments()
df = dataset.X
y = dataset.y
from skrub import TableReport
TableReport(df)
| Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name | Dispute_Status_for_Publication | Name_of_Associated_Covered_Device_or_Medical_Supply1 | Name_of_Associated_Covered_Drug_or_Biological1 | Physician_Specialty | |
|---|---|---|---|---|---|
| 0 | ELI LILLY AND COMPANY | No | Allopathic & Osteopathic Physicians|Pediatrics|Pediatric Rheumatology | ||
| 1 | ELI LILLY AND COMPANY | No | Allopathic & Osteopathic Physicians|Internal Medicine|Nephrology | ||
| 2 | ELI LILLY AND COMPANY | No | Allopathic & Osteopathic Physicians|Internal Medicine|Rheumatology | ||
| 3 | ELI LILLY AND COMPANY | No | Allopathic & Osteopathic Physicians|Internal Medicine|Endocrinology, Diabetes & Metabolism | ||
| 4 | ELI LILLY AND COMPANY | No | EFFIENT | Allopathic & Osteopathic Physicians|Pediatrics|Pediatric Hematology-Oncology | |
| 73,553 | GlaxoSmithKline, LLC. | No | ZIAGEN | ||
| 73,554 | ALERE SCARBOROUGH, INC. | No | Alere PBP2a | ||
| 73,555 | NovoCure Limited | No | |||
| 73,556 | Wright Medical Technology, Inc. | No | HIPS | ||
| 73,557 | Alcon Research Ltd | No | Express |
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name
ObjectDType- Null values
- 0ย (0.0%)
- Unique values
-
1,466ย (2.0%)
This column has a high cardinality (> 40).
Most frequent values
Merck Sharp & Dohme Corporation
Novartis Pharmaceuticals Corporation
Pfizer Inc.
Boston Scientific Corporation
Covidien Sales LLC
Stryker Corporation
SANOFI-AVENTIS U.S. LLC
AstraZeneca Pharmaceuticals LP
Genentech USA, Inc.
List:AbbVie, Inc.
['Merck Sharp & Dohme Corporation', 'Novartis Pharmaceuticals Corporation', 'Pfizer Inc.', 'Boston Scientific Corporation', 'Covidien Sales LLC', 'Stryker Corporation', 'SANOFI-AVENTIS U.S. LLC', 'AstraZeneca Pharmaceuticals LP', 'Genentech USA, Inc.', 'AbbVie, Inc.']
Dispute_Status_for_Publication
ObjectDType- Null values
- 0ย (0.0%)
- Unique values
- 2ย (<โฏ0.1%)
Most frequent values
No
Yes
['No', 'Yes']
Name_of_Associated_Covered_Device_or_Medical_Supply1
ObjectDType- Null values
- 43,088ย (58.6%)
- Unique values
-
4,372ย (5.9%)
This column has a high cardinality (> 40).
Most frequent values
Vascular
Spine
ARTHREX PRODUCT LINE DISTAL EXTREMITY ARTHROSCOPY
Surgical
ALL ARTHREX PRODUCT LINES
LifeVest
Spinal Cord Neurostimulation - Neuro
Da Vinci Surgical System
PAIN MANAGEMENT
List:Interventional Therapies
['Vascular', 'Spine', 'ARTHREX PRODUCT LINE DISTAL EXTREMITY ARTHROSCOPY', 'Surgical', 'ALL ARTHREX PRODUCT LINES', 'LifeVest', 'Spinal Cord Neurostimulation - Neuro', 'Da Vinci Surgical System', 'PAIN MANAGEMENT', 'Interventional Therapies']
Name_of_Associated_Covered_Drug_or_Biological1
ObjectDType- Null values
- 36,233ย (49.3%)
- Unique values
-
2,262ย (3.1%)
This column has a high cardinality (> 40).
Most frequent values
Invokana
Xarelto
NON-PRODUCT
Prolia
BUTRANS
NON BRAND
No Product
Nesina
ELIQUIS
List:Zytiga
['Invokana', 'Xarelto', 'NON-PRODUCT', 'Prolia', 'BUTRANS', 'NON BRAND', 'No Product', 'Nesina', 'ELIQUIS', 'Zytiga']
Physician_Specialty
ObjectDType- Null values
- 3,996ย (5.4%)
- Unique values
-
513ย (0.7%)
This column has a high cardinality (> 40).
Most frequent values
Allopathic & Osteopathic Physicians|Internal Medicine
Other Service Providers|Specialist
Allopathic & Osteopathic Physicians|Surgery
Allopathic & Osteopathic Physicians|Family Medicine
Allopathic & Osteopathic Physicians|Orthopaedic Surgery
Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease
Allopathic & Osteopathic Physicians|Pediatrics
Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology
Allopathic & Osteopathic Physicians|Obstetrics & Gynecology
List:Student, Health Care|Student in an Organized Health Care Education/Training Program
['Allopathic & Osteopathic Physicians|Internal Medicine', 'Other Service Providers|Specialist', 'Allopathic & Osteopathic Physicians|Surgery', 'Allopathic & Osteopathic Physicians|Family Medicine', 'Allopathic & Osteopathic Physicians|Orthopaedic Surgery', 'Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease', 'Allopathic & Osteopathic Physicians|Pediatrics', 'Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology', 'Allopathic & Osteopathic Physicians|Obstetrics & Gynecology', 'Student, Health Care|Student in an Organized Health Care Education/Training Program']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
|
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name | ObjectDType | False | 0ย (0.0%) | 1466ย (2.0%) | |||||
| 1 | Dispute_Status_for_Publication | ObjectDType | False | 0ย (0.0%) | 2ย (<โฏ0.1%) | |||||
| 2 | Name_of_Associated_Covered_Device_or_Medical_Supply1 | ObjectDType | False | 43088ย (58.6%) | 4372ย (5.9%) | |||||
| 3 | Name_of_Associated_Covered_Drug_or_Biological1 | ObjectDType | False | 36233ย (49.3%) | 2262ย (3.1%) | |||||
| 4 | Physician_Specialty | ObjectDType | False | 3996ย (5.4%) | 513ย (0.7%) |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name
ObjectDType- Null values
- 0ย (0.0%)
- Unique values
-
1,466ย (2.0%)
This column has a high cardinality (> 40).
Most frequent values
Merck Sharp & Dohme Corporation
Novartis Pharmaceuticals Corporation
Pfizer Inc.
Boston Scientific Corporation
Covidien Sales LLC
Stryker Corporation
SANOFI-AVENTIS U.S. LLC
AstraZeneca Pharmaceuticals LP
Genentech USA, Inc.
List:AbbVie, Inc.
['Merck Sharp & Dohme Corporation', 'Novartis Pharmaceuticals Corporation', 'Pfizer Inc.', 'Boston Scientific Corporation', 'Covidien Sales LLC', 'Stryker Corporation', 'SANOFI-AVENTIS U.S. LLC', 'AstraZeneca Pharmaceuticals LP', 'Genentech USA, Inc.', 'AbbVie, Inc.']
Dispute_Status_for_Publication
ObjectDType- Null values
- 0ย (0.0%)
- Unique values
- 2ย (<โฏ0.1%)
Most frequent values
No
Yes
['No', 'Yes']
Name_of_Associated_Covered_Device_or_Medical_Supply1
ObjectDType- Null values
- 43,088ย (58.6%)
- Unique values
-
4,372ย (5.9%)
This column has a high cardinality (> 40).
Most frequent values
Vascular
Spine
ARTHREX PRODUCT LINE DISTAL EXTREMITY ARTHROSCOPY
Surgical
ALL ARTHREX PRODUCT LINES
LifeVest
Spinal Cord Neurostimulation - Neuro
Da Vinci Surgical System
PAIN MANAGEMENT
List:Interventional Therapies
['Vascular', 'Spine', 'ARTHREX PRODUCT LINE DISTAL EXTREMITY ARTHROSCOPY', 'Surgical', 'ALL ARTHREX PRODUCT LINES', 'LifeVest', 'Spinal Cord Neurostimulation - Neuro', 'Da Vinci Surgical System', 'PAIN MANAGEMENT', 'Interventional Therapies']
Name_of_Associated_Covered_Drug_or_Biological1
ObjectDType- Null values
- 36,233ย (49.3%)
- Unique values
-
2,262ย (3.1%)
This column has a high cardinality (> 40).
Most frequent values
Invokana
Xarelto
NON-PRODUCT
Prolia
BUTRANS
NON BRAND
No Product
Nesina
ELIQUIS
List:Zytiga
['Invokana', 'Xarelto', 'NON-PRODUCT', 'Prolia', 'BUTRANS', 'NON BRAND', 'No Product', 'Nesina', 'ELIQUIS', 'Zytiga']
Physician_Specialty
ObjectDType- Null values
- 3,996ย (5.4%)
- Unique values
-
513ย (0.7%)
This column has a high cardinality (> 40).
Most frequent values
Allopathic & Osteopathic Physicians|Internal Medicine
Other Service Providers|Specialist
Allopathic & Osteopathic Physicians|Surgery
Allopathic & Osteopathic Physicians|Family Medicine
Allopathic & Osteopathic Physicians|Orthopaedic Surgery
Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease
Allopathic & Osteopathic Physicians|Pediatrics
Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology
Allopathic & Osteopathic Physicians|Obstetrics & Gynecology
List:Student, Health Care|Student in an Organized Health Care Education/Training Program
['Allopathic & Osteopathic Physicians|Internal Medicine', 'Other Service Providers|Specialist', 'Allopathic & Osteopathic Physicians|Surgery', 'Allopathic & Osteopathic Physicians|Family Medicine', 'Allopathic & Osteopathic Physicians|Orthopaedic Surgery', 'Allopathic & Osteopathic Physicians|Internal Medicine|Cardiovascular Disease', 'Allopathic & Osteopathic Physicians|Pediatrics', 'Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology', 'Allopathic & Osteopathic Physicians|Obstetrics & Gynecology', 'Student, Health Care|Student in an Organized Health Care Education/Training Program']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
| Column 1 | Column 2 | Cramรฉr's V | Pearson's Correlation |
|---|---|---|---|
| Name_of_Associated_Covered_Device_or_Medical_Supply1 | Name_of_Associated_Covered_Drug_or_Biological1 | 0.263 | |
| Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name | Name_of_Associated_Covered_Drug_or_Biological1 | 0.214 | |
| Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name | Name_of_Associated_Covered_Device_or_Medical_Supply1 | 0.132 | |
| Name_of_Associated_Covered_Device_or_Medical_Supply1 | Physician_Specialty | 0.0962 | |
| Dispute_Status_for_Publication | Physician_Specialty | 0.0960 | |
| Dispute_Status_for_Publication | Name_of_Associated_Covered_Drug_or_Biological1 | 0.0895 | |
| Name_of_Associated_Covered_Drug_or_Biological1 | Physician_Specialty | 0.0646 | |
| Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name | Physician_Specialty | 0.0510 | |
| Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name | Dispute_Status_for_Publication | 0.0308 | |
| Dispute_Status_for_Publication | Name_of_Associated_Covered_Device_or_Medical_Supply1 | 0.0284 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
import pandas as pd
TableReport(pd.DataFrame(y))
| status | |
|---|---|
| 0 | disallowed |
| 1 | disallowed |
| 2 | disallowed |
| 3 | disallowed |
| 4 | disallowed |
| 73,553 | allowed |
| 73,554 | allowed |
| 73,555 | allowed |
| 73,556 | allowed |
| 73,557 | allowed |
status
ObjectDType- Null values
- 0ย (0.0%)
- Unique values
- 2ย (<โฏ0.1%)
Most frequent values
disallowed
allowed
['disallowed', 'allowed']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
|
Column
|
Column name
|
dtype
|
Is sorted
|
Null values
|
Unique values
|
Mean
|
Std
|
Min
|
Median
|
Max
|
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | status | ObjectDType | True | 0ย (0.0%) | 2ย (<โฏ0.1%) |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
status
ObjectDType- Null values
- 0ย (0.0%)
- Unique values
- 2ย (<โฏ0.1%)
Most frequent values
disallowed
allowed
['disallowed', 'allowed']
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
The dataset has over 70,000 records with only categorical features. Some categories are not well defined.
Caching with EstimatorReport and CrossValidationReport#
We use skrub to create a simple predictive model that handles our datasetโs
challenges.
from skrub import tabular_pipeline
model = tabular_pipeline("classifier")
model
Pipeline(steps=[('tablevectorizer',
TableVectorizer(low_cardinality=ToCategorical())),
('histgradientboostingclassifier',
HistGradientBoostingClassifier())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
Parameters
| cardinality_threshold | 40 | |
| low_cardinality | ToCategorical() | |
| high_cardinality | StringEncoder() | |
| numeric | PassThrough() | |
| datetime | DatetimeEncoder() | |
| specific_transformers | () | |
| drop_null_fraction | 1.0 | |
| drop_if_constant | False | |
| drop_if_unique | False | |
| datetime_format | None | |
| n_jobs | None |
Parameters
Parameters
| resolution | 'hour' | |
| add_weekday | False | |
| add_total_seconds | True | |
| add_day_of_year | False | |
| periodic_encoding | None |
Parameters
Parameters
| n_components | 30 | |
| vectorizer | 'tfidf' | |
| ngram_range | (3, ...) | |
| analyzer | 'char_wb' | |
| stop_words | None | |
| random_state | None |
Parameters
This model handles all types of data: numbers, categories, dates, and missing values. Letโs train it on part of our dataset.
from skore import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=42)
# Let's keep a completely separate dataset
X_train, X_external, y_train, y_external = train_test_split(
X_train, y_train, random_state=42
)
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ HighClassImbalanceWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ It seems that you have a classification problem with a high class imbalance. In this โ
โ case, using train_test_split may not be a good idea because of high variability in โ
โ the scores obtained on the test set. To tackle this challenge we suggest to use โ
โ skore's CrossValidationReport with the `splitter` parameter of your choice. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ShuffleTrueWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ We detected that the `shuffle` parameter is set to `True` either explicitly or from โ
โ its default value. In case of time-ordered events (even if they are independent), โ
โ this will result in inflated model performance evaluation because natural drift will โ
โ not be taken into account. We recommend setting the shuffle parameter to `False` in โ
โ order to ensure the evaluation process is really representative of your production โ
โ release process. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ HighClassImbalanceWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ It seems that you have a classification problem with a high class imbalance. In this โ
โ case, using train_test_split may not be a good idea because of high variability in โ
โ the scores obtained on the test set. To tackle this challenge we suggest to use โ
โ skore's CrossValidationReport with the `splitter` parameter of your choice. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ShuffleTrueWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ We detected that the `shuffle` parameter is set to `True` either explicitly or from โ
โ its default value. In case of time-ordered events (even if they are independent), โ
โ this will result in inflated model performance evaluation because natural drift will โ
โ not be taken into account. We recommend setting the shuffle parameter to `False` in โ
โ order to ensure the evaluation process is really representative of your production โ
โ release process. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Caching the predictions for fast metric computation#
First, we focus on EstimatorReport, as the same philosophy will
apply to CrossValidationReport.
Letโs explore how EstimatorReport uses caching to speed up
predictions. We start by training the model:
from skore import EstimatorReport
report = EstimatorReport(
model, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test
)
report.help()
โญโโโโโโโโโโโโโ Tools to diagnose estimator HistGradientBoostingClassifier โโโโโโโโโโโโโโฎ
โ EstimatorReport โ
โ โโโ .data โ
โ โ โโโ .analyze(...) - Plot dataset statistics. โ
โ โโโ .metrics โ
โ โ โโโ .accuracy(...) (โ๏ธ) - Compute the accuracy score. โ
โ โ โโโ .brier_score(...) (โ๏ธ) - Compute the Brier score. โ
โ โ โโโ .confusion_matrix(...) - Plot the confusion matrix. โ
โ โ โโโ .log_loss(...) (โ๏ธ) - Compute the log loss. โ
โ โ โโโ .precision(...) (โ๏ธ) - Compute the precision score. โ
โ โ โโโ .precision_recall(...) - Plot the precision-recall curve. โ
โ โ โโโ .recall(...) (โ๏ธ) - Compute the recall score. โ
โ โ โโโ .roc(...) - Plot the ROC curve. โ
โ โ โโโ .roc_auc(...) (โ๏ธ) - Compute the ROC AUC score. โ
โ โ โโโ .timings(...) - Get all measured processing times related โ
โ โ โ to the estimator. โ
โ โ โโโ .custom_metric(...) - Compute a custom metric. โ
โ โ โโโ .summarize(...) - Report a set of metrics for our estimator. โ
โ โโโ .feature_importance โ
โ โ โโโ .permutation(...) - Report the permutation feature importance. โ
โ โโโ .cache_predictions(...) - Cache estimator's predictions. โ
โ โโโ .clear_cache(...) - Clear the cache. โ
โ โโโ .get_predictions(...) - Get estimator's predictions. โ
โ โโโ Attributes โ
โ โโโ .X_test - Testing data โ
โ โโโ .X_train - Training data โ
โ โโโ .y_test - Testing target โ
โ โโโ .y_train - Training target โ
โ โโโ .estimator - Estimator to make the report from โ
โ โโโ .estimator_ - The cloned or copied estimator โ
โ โโโ .estimator_name_ - The name of the estimator โ
โ โโโ .fit - Whether to fit the estimator on the โ
โ โ training data โ
โ โโโ .fit_time_ - The time taken to fit the estimator, in โ
โ โ seconds โ
โ โโโ .ml_task - No description available โ
โ โโโ .pos_label - For binary classification, the positive โ
โ class โ
โ โ
โ โ
โ Legend: โ
โ (โ๏ธ) higher is better (โ๏ธ) lower is better โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
We compute the accuracy on our test set and measure how long it takes:
0.9503534529635671
Time taken: 1.52 seconds
For comparison, hereโs how scikit-learn computes the same accuracy score:
from sklearn.metrics import accuracy_score
start = time.time()
result = accuracy_score(report.y_test, report.estimator_.predict(report.X_test))
end = time.time()
result
0.9503534529635671
Time taken: 1.54 seconds
Both approaches take similar time.
Now, watch what happens when we compute the accuracy again with our skore estimator report:
0.9503534529635671
Time taken: 0.00 seconds
The second calculation is instant! This happens because the report saves previous calculations in its cache. Letโs look inside the cache:
report._cache
{(2641823030350441228, None, 'predict', 'test', None): array(['allowed', 'disallowed', 'disallowed', ..., 'disallowed',
'disallowed', 'disallowed'], shape=(18390,), dtype=object), (2641823030350441228, 'test', None, 'predict_time'): 1.504120739000001, (np.int64(2641823030350441228), 'accuracy_score', 'test'): 0.9503534529635671}
The cache stores predictions by type and data source. This means that computing metrics that use the same type of predictions will be faster. Letโs try the precision metric:
{'allowed': 0.6510228640192539, 'disallowed': 0.9645196195683126}
Time taken: 0.07 seconds
We observe that it takes only a few milliseconds to compute the precision because we donโt need to re-compute the predictions and only have to compute the precision metric itself. Since the predictions are the bottleneck in terms of computation time, we observe an interesting speedup.
Caching all the possible predictions at once#
We can pre-compute all predictions at once using parallel processing:
report.cache_predictions(n_jobs=4)
Now, all possible predictions are stored. Any metric calculation will be much faster, even on different data (like the training set):
0.09393801200203392
Time taken: 0.08 seconds
Caching external data#
The report can also work with external data. We use data_source="X_y" to indicate
that we want to pass those external data.
start = time.time()
result = report.metrics.log_loss(data_source="X_y", X=X_external, y=y_external)
end = time.time()
result
0.12759029408505448
Time taken: 1.32 seconds
The first calculation of the above cell is slower than when using the internal train or test sets because it needs to compute a hash of the new data for later retrieval. Letโs calculate it again:
start = time.time()
result = report.metrics.log_loss(data_source="X_y", X=X_external, y=y_external)
end = time.time()
result
0.12759029408505448
Time taken: 0.14 seconds
It is much faster for the second time as the predictions are cached! The remaining time corresponds to the hash computation. Letโs compute the ROC AUC on the same data:
start = time.time()
result = report.metrics.roc_auc(data_source="X_y", X=X_external, y=y_external)
end = time.time()
result
0.9324490897420937
Time taken: 0.16 seconds
We observe that the computation is already efficient because it boils down to two computations: the hash of the data and the ROC-AUC metric. We save a lot of time because we donโt need to re-compute the predictions.
Caching for plotting#
The cache also speeds up plots. Letโs create a ROC curve:
Time taken: 0.07 seconds
The second plot is instant because it uses cached data:
Time taken: 0.05 seconds
We only use the cache to retrieve the display object and not directly the matplotlib
figure. It means that we can still customize the cached plot before displaying it:
display.plot(roc_curve_kwargs={"color": "tab:orange"})

Be aware that we can clear the cache if we want to:
report.clear_cache()
report._cache
{}
It means that nothing is stored anymore in the cache.
Caching with CrossValidationReport#
CrossValidationReport uses the same caching system for each split
in cross-validation by leveraging the previous EstimatorReport:
from skore import CrossValidationReport
report = CrossValidationReport(model, X=df, y=y, splitter=5, n_jobs=4)
report.help()
โญโโโโโโโโโโโโโ Tools to diagnose estimator HistGradientBoostingClassifier โโโโโโโโโโโโโโฎ
โ CrossValidationReport โ
โ โโโ .data โ
โ โ โโโ .analyze(...) - Plot dataset statistics. โ
โ โโโ .metrics โ
โ โ โโโ .accuracy(...) (โ๏ธ) - Compute the accuracy score. โ
โ โ โโโ .brier_score(...) (โ๏ธ) - Compute the Brier score. โ
โ โ โโโ .confusion_matrix(...) - Plot the confusion matrix. โ
โ โ โโโ .log_loss(...) (โ๏ธ) - Compute the log loss. โ
โ โ โโโ .precision(...) (โ๏ธ) - Compute the precision score. โ
โ โ โโโ .precision_recall(...) - Plot the precision-recall curve. โ
โ โ โโโ .recall(...) (โ๏ธ) - Compute the recall score. โ
โ โ โโโ .roc(...) - Plot the ROC curve. โ
โ โ โโโ .roc_auc(...) (โ๏ธ) - Compute the ROC AUC score. โ
โ โ โโโ .timings(...) - Get all measured processing times related โ
โ โ โ to the estimator. โ
โ โ โโโ .custom_metric(...) - Compute a custom metric. โ
โ โ โโโ .summarize(...) - Report a set of metrics for our estimator. โ
โ โโโ .feature_importance โ
โ โโโ .cache_predictions(...) - Cache the predictions for sub-estimators โ
โ โ reports. โ
โ โโโ .clear_cache(...) - Clear the cache. โ
โ โโโ .get_predictions(...) - Get estimator's predictions. โ
โ โโโ Attributes โ
โ โโโ .X - The data to fit โ
โ โโโ .y - The target variable to try to predict in โ
โ โ the case of supervised learning โ
โ โโโ .estimator - Estimator to make the cross-validation โ
โ โ report from โ
โ โโโ .estimator_ - The cloned or copied estimator โ
โ โโโ .estimator_name_ - The name of the estimator โ
โ โโโ .estimator_reports_ - The estimator reports for each split โ
โ โโโ .ml_task - No description available โ
โ โโโ .n_jobs - Number of jobs to run in parallel โ
โ โโโ .pos_label - For binary classification, the positive โ
โ โ class โ
โ โโโ .split_indices - No description available โ
โ โโโ .splitter - Determines the cross-validation splitting โ
โ strategy โ
โ โ
โ โ
โ Legend: โ
โ (โ๏ธ) higher is better (โ๏ธ) lower is better โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Since a CrossValidationReport uses many
EstimatorReport, we will observe the same behaviour as we previously
exposed.
The first call will be slow because it computes the predictions for each split.
| HistGradientBoostingClassifier | |||
|---|---|---|---|
| mean | std | ||
| Metric | Label / Average | ||
| Accuracy | 0.918581 | 0.036671 | |
| Precision | allowed | 0.441963 | 0.125824 |
| disallowed | 0.959807 | 0.005551 | |
| Recall | allowed | 0.419643 | 0.100246 |
| disallowed | 0.953014 | 0.043796 | |
| ROC AUC | 0.875464 | 0.029174 | |
| Brier score | 0.063184 | 0.033636 | |
| Fit time (s) | 15.371086 | 3.015804 | |
| Predict time (s) | 1.938918 | 0.423745 | |
Time taken: 10.67 seconds
But the subsequent calls are fast because the predictions are cached.
| HistGradientBoostingClassifier | |||
|---|---|---|---|
| mean | std | ||
| Metric | Label / Average | ||
| Accuracy | 0.918581 | 0.036671 | |
| Precision | allowed | 0.441963 | 0.125824 |
| disallowed | 0.959807 | 0.005551 | |
| Recall | allowed | 0.419643 | 0.100246 |
| disallowed | 0.953014 | 0.043796 | |
| ROC AUC | 0.875464 | 0.029174 | |
| Brier score | 0.063184 | 0.033636 | |
| Fit time (s) | 15.371086 | 3.015804 | |
| Predict time (s) | 1.938918 | 0.423745 | |
Time taken: 0.00 seconds
Hence, we observe the same type of behaviour as we previously exposed.
Total running time of the script: (1 minutes 9.717 seconds)