Automatic diagnostic#

skore provides automated checks for common model quality pitfalls. Use diagnose() to run checks and get a diagnostic that summarizes the findings. Findings come in two severities:

issues flag a concrete modeling problem to fix (e.g. overfitting);
tips do not signal a defect on their own but invite caution, typically on the interpretation of a result (e.g. feature importance).

Each finding has:

a short explanation,
a stable check code,
and a link to this page.

Checks can be muted per call with ignore=...:

report.diagnose(ignore=["SKD001"])

You can also set a global ignore list with configuration.ignore_checks = ...:

from skore import configuration
configuration.ignore_checks = ["SKD001"]

For cross-validation reports, checks are run per split and then aggregated at report level through ~skore.CrossValidationReport.diagnose. An issue is reported only when it appears in a strict majority of evaluated splits.

For comparison reports, ~skore.ComparisonReport.diagnose builds a global diagnostic from each component report in the comparison. Issues are grouped by component report and emitted as a single message.

SKD001 - Potential overfitting#

How it is detected#

skore compares train and test scores across the report’s default predictive metrics (timing metrics are excluded). A metric votes for overfitting when the train-favored gap exceeds an adaptive threshold:

higher-is-better metrics: train - test >= threshold
lower-is-better metrics: test - train >= threshold

The threshold adapts to the scale of the scores: max(0.03, 0.10 * |reference|) where the reference is the train score for higher-is-better metrics and the test score for lower-is-better metrics. The floor of 0.03 prevents the threshold from vanishing on near-zero scores.

The check detects an issue when a strict majority of metrics vote for overfitting.

Why it matters#

A persistent train/test gap suggests the model has captured patterns specific to the training data and may generalize poorly.

How to reduce the risk#

simplify the model,
regularize more strongly,
improve feature engineering,
use better validation protocols or more data.

SKD002 - Potential underfitting#

How it is detected#

skore checks two conditions together across the report’s default predictive metrics. A metric votes for underfitting when both hold:

Train and test scores are on par: the absolute difference is within max(0.03, 0.05 * max(|train|, |test|)).
Neither score significantly outperforms a dummy baseline: a score is considered significantly better than the baseline only when it exceeds max(0.01, 0.03 * |baseline|). The baseline is a DummyClassifier(strategy="prior") for classification and a DummyRegressor(strategy="mean") for regression.

The check detects an issue when a strict majority of comparable metrics (those present in both the estimator and dummy reports) vote for underfitting.

Why it matters#

When model performance is close to a naive baseline, the model is likely too simple, under-trained, or using features that do not capture enough signal.

How to reduce the risk#

increase model capacity,
improve data representation and features,
tune hyperparameters,
collect richer data if possible.

SKD003 - Inconsistent performance across splits#

How it is detected#

This check applies only to CrossValidationReport.

skore examines each split’s test scores across the report’s default predictive metrics (timing metrics are excluded). For every metric, a modified Z-score based on the Median Absolute Deviation (MAD) is computed:

\[z_i = 0.6745 \times \frac{x_i - \widetilde{x}}{\text{MAD}}\]

It is a version of the Z-score that is more robust to extreme values and does not make assumptions about the distribution of the data. A split is flagged as an outlier for a given metric when $|z_i| > 3$ which is analogous to being outside 3 standard deviations from the mean.

A split is considered inconsistent when a strict majority of metrics flag it as an outlier. The check reports an issue if at least one split is labeled inconsistent.

Why it matters#

When one or more splits perform very differently from the rest, the cross-validation estimate becomes unreliable. The anomaly may reveal data leakage, uneven class distributions across splits, or a model that is sensitive to specific data splits.

How to reduce the risk#

use stratified or grouped cross-validation to ensure a more even split,
investigate whether the outlier split contains a different data distribution,
check for data leakage or temporal effects,
increase the size of the dataset to improve stability.

SKD004 - High class imbalance#

How it is detected#

This check applies only to binary classification tasks.

skore counts the occurrences of each class across the train and test sets. The check detects an issue when the most frequent class represents more than 80 % of the dataset.

Why it matters#

When one class dominates the dataset, a model can achieve high accuracy simply by constantly predicting the majority class. Accuracy alone becomes a misleading performance indicator, and the model may fail to detect the minority class entirely.

How to reduce the risk#

use metrics that account for imbalance (precision, recall, F1, ROC AUC),
resample the dataset (oversampling the minority or undersampling the majority),
use class weights in the estimator,
collect more data for the minority class if possible.

SKD005 - Underrepresented classes#

How it is detected#

This check applies only to multiclass classification tasks.

skore counts the occurrences of each class across the train and test sets. The check detects an issue when one or more classes each represent less than 10 % of the dataset.

Why it matters#

When some classes are severely underrepresented, the model may never learn to distinguish them reliably. Overall accuracy can look acceptable while per-class performance on the rare classes remains poor.

How to reduce the risk#

use per-class metrics (precision, recall, F1 per class) to monitor all classes,
resample the dataset (oversampling rare classes or undersampling frequent ones),
use class weights in the estimator,
collect more data for the underrepresented classes if possible.

SKD006 - Coefficient interpretation#

How it is detected#

This check applies only to linear estimators that expose a coef_ attribute.

skore concatenates the train and test inputs and computes the per-feature standard deviation. The check emits one of two tips depending on the result:

Features are on different scales (standard deviations are not close to each other): coefficient magnitudes are not directly comparable across features.
Features are on the same scale: coefficient magnitudes are comparable as feature importance, but they are no longer interpretable in the original feature units.

Why it matters#

The magnitude of a linear model’s coefficients depends on the scale of each input feature. When features live on different scales, comparing raw coefficients as a measure of feature importance is misleading: a large coefficient may only reflect a small-scale feature, not a strong effect. Standardizing the inputs puts all coefficients on a common footing and makes them directly comparable.

Conversely, when features have been standardized, coefficients express changes per standard deviation rather than per original unit. Statements like “an increase of 1 year in AGE means a decrease of 0.03 $/hour” lose their meaning because the natural units have been scaled away.

Read more about this in the scikit-learn documentation.

How to reduce the risk#

standardize the inputs (e.g. wrap the estimator in a pipeline with StandardScaler) to make coefficients comparable,
when features are not standardized, multiply each coefficient by the feature’s standard deviation to make them comparable,
otherwise, interpret coefficient magnitudes only relative to the feature’s own scale, or rely on scale-invariant feature-importance methods such as PermutationImportanceDisplay.