Diagnostics#
skore diagnostics provide quick checks for common model quality pitfalls.
Use diagnose() to get concise findings about your model’s
quality.
Each finding has:
a short explanation,
a stable diagnostic code,
and a link to this page.
Diagnostics can be muted per call with ignore=...:
report.diagnose(ignore=["SKD001"])
You can also set a global ignore list with configuration.ignore_diagnostics = ...:
from skore import configuration
configuration.ignore_diagnostics = ["SKD001"]
For cross-validation reports, diagnostics are computed per split and then aggregated
at report level, trough ~skore.CrossValidationReport.diagnose. A diagnostic is
reported as an issue only when it appears in a strict majority of evaluated splits.
For comparison reports, ~skore.ComparisonReport.diagnose builds a global diagnostic
from each component report in the comparison. Diagnostics are grouped by component
report and emitted as a single message.
SKD001 - Potential overfitting#
How it is detected#
skore compares train and test scores across the report’s default predictive metrics
(timing metrics are excluded). A metric votes for overfitting when the train-favored
gap exceeds an adaptive threshold:
higher-is-better metrics:
train - test >= thresholdlower-is-better metrics:
test - train >= threshold
The threshold adapts to the scale of the scores:
max(0.03, 0.10 * |reference|) where the reference is the train score for
higher-is-better metrics and the test score for lower-is-better metrics.
The floor of 0.03 prevents the threshold from vanishing on near-zero scores.
The diagnostic is raised when a strict majority of metrics vote for overfitting.
Why it matters#
A persistent train/test gap suggests the model has captured patterns specific to the training data and may generalize poorly.
How to reduce the risk#
simplify the model,
regularize more strongly,
improve feature engineering,
use better validation protocols or more data.
SKD002 - Potential underfitting#
How it is detected#
skore checks two conditions together across the report’s default predictive metrics.
A metric votes for underfitting when both hold:
Train and test scores are on par: the absolute difference is within
max(0.03, 0.05 * max(|train|, |test|)).Neither score significantly outperforms a dummy baseline: a score is considered significantly better than the baseline only when it exceeds
max(0.01, 0.03 * |baseline|). The baseline is aDummyClassifier(strategy="prior")for classification and aDummyRegressor(strategy="mean")for regression.
The diagnostic is raised when a strict majority of comparable metrics (those present in both the estimator and dummy reports) vote for underfitting.
Why it matters#
When model performance is close to a naive baseline, the model is likely too simple, under-trained, or using features that do not capture enough signal.
How to reduce the risk#
increase model capacity,
improve data representation and features,
tune hyperparameters,
collect richer data if possible.