Note

Go to the end to download the full example code.

`train_test_split`: get diagnostics when splitting your data#

This example illustrates the motivation and the use of skore’s skore.train_test_split() to get assistance when developing ML/DS projects.

Train-test split in scikit-learn#

Scikit-learn has a function for splitting the data into train and test sets: sklearn.model_selection.train_test_split(). Its signature is the following:

sklearn.model_selection.train_test_split(
    *arrays,
    test_size=None,
    train_size=None,
    random_state=None,
    shuffle=True,
    stratify=None
)

where *arrays is a Python *args (it allows you to pass a varying number of positional arguments) and the scikit-learn doc indicates that it is a sequence of indexables with same length / shape[0].

Let us construct a design matrix X and target y to illustrate our point:

import numpy as np

X = np.arange(10).reshape((5, 2))
y = np.arange(5)
print(f"{X = }\n{y = }")

X = array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
y = array([0, 1, 2, 3, 4])

In scikit-learn, the most common usage is the following:

from sklearn.model_selection import train_test_split as sklearn_train_test_split

X_train, X_test, y_train, y_test = sklearn_train_test_split(
    X, y, test_size=0.2, random_state=0
)
print(f"{X_train = }\n{y_train = }\n{X_test = }\n{y_test = }")

X_train = array([[0, 1],
       [2, 3],
       [6, 7],
       [8, 9]])
y_train = array([0, 1, 3, 4])
X_test = array([[4, 5]])
y_test = array([2])

Notice the shuffling that is done by default.

In scikit-learn, the user cannot explicitly set the design matrix X and the target y. The following:

X_train, X_test, y_train, y_test = sklearn_train_test_split(
    X=X, y=y, test_size=0.2, random_state=0)

would return:

TypeError: got an unexpected keyword argument 'X'

In general, in Python, keyword arguments are useful to prevent typos. For example, in the following, X and y are reversed:

X_train, X_test, y_train, y_test = sklearn_train_test_split(
    y, X, test_size=0.2, random_state=0
)
print(f"{X_train = }\n{y_train = }\n{X_test = }\n{y_test = }")

X_train = array([0, 1, 3, 4])
y_train = array([[0, 1],
       [2, 3],
       [6, 7],
       [8, 9]])
X_test = array([2])
y_test = array([[4, 5]])

but Python will not catch this mistake for us. This is where skore comes in handy.

Train-test split in skore#

Skore has its own skore.train_test_split() that wraps scikit-learn’s sklearn.model_selection.train_test_split().

X = np.arange(10_000).reshape((5_000, 2))
y = [0] * 2_500 + [1] * 2_500

Expliciting the positional arguments for `X` and `y`#

First of all, naturally, it can be used as a simple drop-in replacement for scikit-learn:

import skore

X_train, X_test, y_train, y_test = skore.train_test_split(
    X, y, test_size=0.2, random_state=0
)

╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
│ its default value. In case of time-ordered events (even if they are independent),    │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
│ order to ensure the evaluation process is really representative of your production   │
│ release process.                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────╯

Note

The outputs of skore.train_test_split() are intentionally exactly the same as sklearn.model_selection.train_test_split(), so the user can just use the skore version as a drop-in replacement of scikit-learn.

Contrary to scikit-learn, skore allows users to explicit the X and y, making detection of potential issues easier:

X_train, X_test, y_train, y_test = skore.train_test_split(
    X=X, y=y, test_size=0.2, random_state=0
)
X_train_explicit = X_train.copy()

╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
│ its default value. In case of time-ordered events (even if they are independent),    │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
│ order to ensure the evaluation process is really representative of your production   │
│ release process.                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────╯

Moreover, when passing X and y explicitly, the X’s are always returned before the y’s, even when they are inverted:

arr = X.copy()
arr_train, arr_test, X_train, X_test, y_train, y_test = skore.train_test_split(
    arr, y=y, X=X, test_size=0.2, random_state=0
)
X_train_explicit_inverted = X_train.copy()

print("When expliciting, with the small typo, are the `X_train`'s still the same?")
print(np.allclose(X_train_explicit, X_train_explicit_inverted))

╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
│ its default value. In case of time-ordered events (even if they are independent),    │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
│ order to ensure the evaluation process is really representative of your production   │
│ release process.                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────╯
When expliciting, with the small typo, are the `X_train`'s still the same?
True

Returning a dictionary instead of positional arguments#

The default behaviour of outputting a tuple of arrays can be cumbersome and error-prone, in particular when passing them to an EstimatorReport. The new as_dict parameter makes the output a dictionary, which makes this simpler:

from sklearn.linear_model import LogisticRegression
from skore import EstimatorReport

split_data = skore.train_test_split(X=X, y=y, random_state=42, as_dict=True)
split_data.keys()

╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
│ its default value. In case of time-ordered events (even if they are independent),    │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
│ order to ensure the evaluation process is really representative of your production   │
│ release process.                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────╯

dict_keys(['X_train', 'X_test', 'y_train', 'y_test'])

estimator = LogisticRegression(random_state=42)
estimator_report = EstimatorReport(estimator, **split_data)

Without the dictionary output, this would be written:

estimator_report = EstimatorReport(
    estimator,
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
)

Automatic diagnostics: raising methodological warnings#

In this section, we show how skore can provide methodological checks.

Class imbalance#

In machine learning, class imbalance (the classes in a dataset are not equally represented) requires a specific modelling. For example, in a dataset with 95% majority class (class 1) and 5% minority class (class 0), a dummy model that always predicts class 1 will have a 95% accuracy, while it would be useless for identifying examples of class 0. Hence, it is important to detect when we have class imbalance.

Suppose that we have imbalanced data:

X = np.arange(10_000).reshape((5_000, 2))
y = [0] * 4_000 + [1] * 1_000

In that case, skore.train_test_split() raises a HighClassImbalanceWarning:

X_train, X_test, y_train, y_test = skore.train_test_split(
    X=X, y=y, test_size=0.2, random_state=0
)

╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮
│ It seems that you have a classification problem with a high class imbalance. In this │
│ case, using train_test_split may not be a good idea because of high variability in   │
│ the scores obtained on the test set. To tackle this challenge we suggest to use      │
│ skore's CrossValidationReport with the `splitter` parameter of your choice.          │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
│ its default value. In case of time-ordered events (even if they are independent),    │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
│ order to ensure the evaluation process is really representative of your production   │
│ release process.                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────╯

Hence, skore recommends the users to take into account this class imbalance, that they might have missed, in their modelling strategy.

Moreover, skore also detects class imbalance with a class that has too few samples with a HighClassImbalanceTooFewExamplesWarning:

X = np.arange(400).reshape((200, 2))
y = [0] * 150 + [1] * 50

X_train, X_test, y_train, y_test = skore.train_test_split(
    X=X, y=y, test_size=0.2, random_state=0
)

╭────────────────────── HighClassImbalanceTooFewExamplesWarning ───────────────────────╮
│ It seems that you have a classification problem with at least one class with fewer   │
│ than 100 examples in the test set. In this case, using train_test_split may not be a │
│ good idea because of high variability in the scores obtained on the test set. We     │
│ suggest three options to tackle this challenge: you can increase test_size, collect  │
│ more data, or use skore's CrossValidationReport with the `splitter` parameter of     │
│ your choice.                                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────── HighClassImbalanceWarning ──────────────────────────────╮
│ It seems that you have a classification problem with a high class imbalance. In this │
│ case, using train_test_split may not be a good idea because of high variability in   │
│ the scores obtained on the test set. To tackle this challenge we suggest to use      │
│ skore's CrossValidationReport with the `splitter` parameter of your choice.          │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
│ its default value. In case of time-ordered events (even if they are independent),    │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
│ order to ensure the evaluation process is really representative of your production   │
│ release process.                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────╯

Shuffling without a random state#

For reproducible results across executions, skore recommends the use of the random_state parameter when shuffling (remember that shuffle=True by default) with a RandomStateUnsetWarning:

X = np.arange(10_000).reshape((5_000, 2))
y = [0] * 2_500 + [1] * 2_500

X_train, X_test, y_train, y_test = skore.train_test_split(X=X, y=y, test_size=0.2)

╭────────────────────────────── RandomStateUnsetWarning ───────────────────────────────╮
│ We recommend setting the parameter `random_state`. This will ensure the              │
│ reproducibility of your work.                                                        │
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── ShuffleTrueWarning ─────────────────────────────────╮
│ We detected that the `shuffle` parameter is set to `True` either explicitly or from  │
│ its default value. In case of time-ordered events (even if they are independent),    │
│ this will result in inflated model performance evaluation because natural drift will │
│ not be taken into account. We recommend setting the shuffle parameter to `False` in  │
│ order to ensure the evaluation process is really representative of your production   │
│ release process.                                                                     │
╰──────────────────────────────────────────────────────────────────────────────────────╯

Time series data#

Now, let us assume that we have time series data: the data is somewhat time-ordered:

import pandas as pd
from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
X, y = dataset.X, dataset.y
X["date_first_hired"] = pd.to_datetime(X["date_first_hired"], format="%m/%d/%Y")
X.head(2)

	gender	department	department_name	division	assignment_category	employee_position_title	date_first_hired	year_first_hired
0	F	POL	Department of Police	MSB Information Mgmt and Tech Division Records...	Fulltime-Regular	Office Services Coordinator	1986-09-22	1986
1	M	POL	Department of Police	ISB Major Crimes Division Fugitive Section	Fulltime-Regular	Master Police Officer	1988-09-12	1988

We can observe that there is a date_first_hired which is time-based.

As one can not shuffle time (time only moves in one direction: forward), we recommend using sklearn.model_selection.TimeSeriesSplit instead of sklearn.model_selection.train_test_split() (or skore.train_test_split()) with a TimeBasedColumnWarning:

X_train, X_test, y_train, y_test = skore.train_test_split(
    X, y, random_state=0, shuffle=False
)

╭─────────────────────────────── TimeBasedColumnWarning ───────────────────────────────╮
│ We detected some time-based columns (column "date_first_hired") in your data. We     │
│ recommend using scikit-learn's TimeSeriesSplit instead of train_test_split.          │
│ Otherwise you might train on future data to predict the past, or get inflated model  │
│ performance evaluation because natural drift will not be taken into account.         │
╰──────────────────────────────────────────────────────────────────────────────────────╯

Total running time of the script: (0 minutes 0.077 seconds)

Gallery generated by Sphinx-Gallery

train_test_split: get diagnostics when splitting your data#