CrossValidationReport#
- class skore.CrossValidationReport(estimator, X=None, y=None, data=None, pos_label=None, splitter=None, n_jobs=None)[source]#
Report for cross-validation results.
Upon initialization,
CrossValidationReportwill cloneestimatoraccording tosplitterand fit the generated estimators. The fitting is done in parallel.Refer to the Cross-validation estimator section of the user guide for more details.
- Parameters:
- estimatorestimator object
Estimator to make the cross-validation report from. An estimator can be one of the following:
a scikit-learn compatible estimator as a
BaseEstimator;a skrub
DataOpto preprocess the data;a skrub
SkrubLearnerextracted from aDataOpby callingmake_learner().
- X{array-like, sparse matrix} of shape (n_samples, n_features) or None
The data to fit. Can be for example a list, or an array.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs) or None
The target variable to try to predict in the case of supervised learning.
- datadict or None
When
estimatoris a skrubSkrubLearner, bindings for variables contained in the DataOp that was used to create this learner (e.g.{"X": X_df, "other_table": df, ...}).- pos_labelint, float, bool or str, default=None
For binary classification, the positive class to use for metrics and displays that need one. If
None, skore does not infer a default positive class. Binary metrics and displays that support it will expose all classes instead. This parameter is rejected for non-binary tasks.- splitterint, cross-validation generator or an iterable, default=5
Determines the cross-validation splitting strategy. Possible inputs for
splitterare:int, to specify the number of splits in a
(Stratified)KFold,a scikit-learn CV splitter,
An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and
yis either binary or multiclass,StratifiedKFoldis used. In all other cases,KFoldis used. These splitters are instantiated withshuffle=Falseso the splits will be the same across calls.Refer to scikit-learn’s User Guide for the various cross-validation strategies that can be used here.
- n_jobsint, default=None
Number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the cross-validation splits. When accessing some methods of the
CrossValidationReport, then_jobsparameter is used to parallelize the computation.Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.
- Attributes:
- estimator_estimator object
The cloned or copied estimator.
- estimator_name_str
The name of the estimator.
- estimator_reports_list of EstimatorReport
The estimator reports for each split.
See also
skore.EstimatorReportReport for a fitted estimator.
skore.ComparisonReportReport of comparison between estimators.
Examples
>>> from sklearn.datasets import make_classification >>> from sklearn.linear_model import LogisticRegression >>> X, y = make_classification(random_state=42) >>> estimator = LogisticRegression() >>> from skore import CrossValidationReport >>> report = CrossValidationReport(estimator, X=X, y=y, splitter=2)
- add_checks(checks)[source]#
Register additional diagnostic checks for this report.
Checks are defined by implementing the
Checkprotocol.Appends the given checks to the registry used by
diagnose(). The next call todiagnose()runs any newly added checks (along with checks that have not yet been cached). Already-run built-in checks are not re-executed.- Parameters:
- checkslist of Check
Additional checks to register
- cache_predictions()[source]#
Cache the predictions for sub-estimators reports.
Examples
>>> from sklearn.datasets import load_breast_cancer >>> from sklearn.linear_model import LogisticRegression >>> from skore import CrossValidationReport >>> X, y = load_breast_cancer(return_X_y=True) >>> classifier = LogisticRegression(max_iter=10_000) >>> report = CrossValidationReport(classifier, X=X, y=y, splitter=2) >>> report.cache_predictions() >>> report.estimator_reports_[0]._cache {...}
- clear_cache()[source]#
Clear the cache.
Examples
>>> from sklearn.datasets import load_breast_cancer >>> from sklearn.linear_model import LogisticRegression >>> from skore import CrossValidationReport >>> X, y = load_breast_cancer(return_X_y=True) >>> classifier = LogisticRegression(max_iter=10_000) >>> report = CrossValidationReport(classifier, X=X, y=y, splitter=2) >>> report.cache_predictions() >>> report.clear_cache() >>> report.estimator_reports_[0]._cache {}
- create_estimator_report(*, X_test=None, y_test=None, test_data=None)[source]#
Create an estimator report from the cross-validation report.
This method creates a new
EstimatorReportwith the same estimator and the same data as the cross-validation report. It is useful to evaluate and deploy a model that was deemed optimal with cross-validation. Provide a held out test set to properly evaluate the performance of the model.- Parameters:
- X_test{array-like, sparse matrix} of shape (n_samples, n_features)
Testing data. It should have the same structure as the training data.
- y_testarray-like of shape (n_samples,) or (n_samples, n_outputs)
Testing target.
- test_datadict or None
When
estimatoris a skrubSkrubLearner, bindings for variables contained in the DataOp that was used to create this learner (e.g.{"X": X_df, "other_table": df, ...}).
- Returns:
- report
EstimatorReport The estimator report.
- report
Examples
>>> from sklearn.datasets import make_classification >>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.linear_model import LogisticRegression >>> from skore import train_test_split >>> from skore import ComparisonReport, CrossValidationReport >>> X, y = make_classification(random_state=42) >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) >>> linear_report = CrossValidationReport( ... LogisticRegression(random_state=42), X_train, y_train ... ) >>> forest_report = CrossValidationReport( ... RandomForestClassifier(random_state=42), X_train, y_train ... ) >>> comparison_report = ComparisonReport([linear_report, forest_report]) >>> summary = comparison_report.metrics.summarize().frame()
>>> # Notice that e.g. the RandomForestClassifier performs best >>> final_report = forest_report.create_estimator_report( ... X_test=X_test, y_test=y_test ... ) >>> final_report.metrics.summarize().frame()
- diagnose(*, ignore=None)[source]#
Run checks and return a diagnostic with detected issues.
Checks look for common modeling problems such as overfitting and underfitting. Check codes can be muted per-call via
ignoreor globally viaconfiguration()withignore_checks=....- Parameters:
- ignorelist of str or tuple of str or None, default=None
Check codes to exclude from the results, e.g.
["SKD001"].
- Returns:
- DiagnosticDisplay
A display object with an HTML representation organized as three tabs (
Issues,Tips,Passed). The full list of results is accessible via theframe()method.
Examples
>>> from skore import evaluate >>> from sklearn.dummy import DummyClassifier >>> from sklearn.datasets import make_classification >>> X, y = make_classification(random_state=42) >>> report = evaluate(DummyClassifier(), X, y, splitter=0.2) >>> report.diagnose() Diagnostic: 1 issue(s), ... Issues: - [SKD002] Potential underfitting... ... >>> report.diagnose(ignore=["SKD002"]) Diagnostic: 0 issue(s), ... 1 ignored. ...
- classmethod from_state(state)[source]#
Rebuild a report from
get_state()output.
- get_predictions(*, data_source, response_method='predict')[source]#
Get estimator’s predictions.
This method has the advantage to reload from the cache if the predictions were already computed in a previous call.
- Parameters:
- data_source{“test”, “train”}, default=”test”
The data source to use.
“test” : use the test set provided when creating the report.
“train” : use the train set provided when creating the report.
- response_method{“predict”, “predict_proba”, “decision_function”}, default=”predict”
The response method to use to get the predictions.
- Returns:
- list of np.ndarray of shape (n_samples,) or (n_samples, n_classes)
The predictions for each cross-validation split.
- Raises:
- ValueError
If the data source is invalid.
Examples
>>> from sklearn.datasets import make_classification >>> from sklearn.linear_model import LogisticRegression >>> X, y = make_classification(random_state=42) >>> estimator = LogisticRegression() >>> from skore import CrossValidationReport >>> report = CrossValidationReport(estimator, X=X, y=y, splitter=2) >>> predictions = report.get_predictions(data_source="test") >>> print([split_predictions.shape for split_predictions in predictions]) [(50,), (50,)]
- get_state()[source]#
Return a serializable representation of the report state.
This state is meant to ease serialization/deserialization of reports while preserving some backward compatibility across skore versions. In particular, this is more stable than pickling a report object directly, which can break when internal implementations change.