train_test_split: get diagnostics when splitting your data#

This example illustrates the motivation and the use of skoreโ€™s skore.train_test_split() to get assistance when developing ML/DS projects.

Train-test split in scikit-learn#

Scikit-learn has a function for splitting the data into train and test sets: sklearn.model_selection.train_test_split(). Its signature is the following:

sklearn.model_selection.train_test_split(
    *arrays,
    test_size=None,
    train_size=None,
    random_state=None,
    shuffle=True,
    stratify=None
)

where *arrays is a Python *args (it allows you to pass a varying number of positional arguments) and the scikit-learn doc indicates that it is a sequence of indexables with same length / shape[0].

Let us construct a design matrix X and target y to illustrate our point:

import numpy as np

X = np.arange(10).reshape((5, 2))
y = np.arange(5)
print(f"{X = }\n{y = }")
X = array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
y = array([0, 1, 2, 3, 4])

In scikit-learn, the most common usage is the following:

from sklearn.model_selection import train_test_split as sklearn_train_test_split

X_train, X_test, y_train, y_test = sklearn_train_test_split(
    X, y, test_size=0.2, random_state=0
)
print(f"{X_train = }\n{y_train = }\n{X_test = }\n{y_test = }")
X_train = array([[0, 1],
       [2, 3],
       [6, 7],
       [8, 9]])
y_train = array([0, 1, 3, 4])
X_test = array([[4, 5]])
y_test = array([2])

Notice the shuffling that is done by default.

In scikit-learn, the user cannot explicitly set the design matrix X and the target y. The following:

X_train, X_test, y_train, y_test = sklearn_train_test_split(
    X=X, y=y, test_size=0.2, random_state=0)

would return:

TypeError: got an unexpected keyword argument 'X'

In general, in Python, keyword arguments are useful to prevent typos. For example, in the following, X and y are reversed:

X_train, X_test, y_train, y_test = sklearn_train_test_split(
    y, X, test_size=0.2, random_state=0
)
print(f"{X_train = }\n{y_train = }\n{X_test = }\n{y_test = }")
X_train = array([0, 1, 3, 4])
y_train = array([[0, 1],
       [2, 3],
       [6, 7],
       [8, 9]])
X_test = array([2])
y_test = array([[4, 5]])

but Python will not catch this mistake for us. This is where skore comes in handy.

Train-test split in skore#

Skore has its own skore.train_test_split() that wraps scikit-learnโ€™s sklearn.model_selection.train_test_split().

X = np.arange(10_000).reshape((5_000, 2))
y = [0] * 2_500 + [1] * 2_500

Expliciting the positional arguments for X and y#

First of all, naturally, it can be used as a simple drop-in replacement for scikit-learn:

import skore

X_train, X_test, y_train, y_test = skore.train_test_split(
    X, y, test_size=0.2, random_state=0
)
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ShuffleTrueWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ We detected that the `shuffle` parameter is set to `True` either explicitly or from  โ”‚
โ”‚ its default value. In case of time-ordered events (even if they are independent),    โ”‚
โ”‚ this will result in inflated model performance evaluation because natural drift will โ”‚
โ”‚ not be taken into account. We recommend setting the shuffle parameter to `False` in  โ”‚
โ”‚ order to ensure the evaluation process is really representative of your production   โ”‚
โ”‚ release process.                                                                     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Note

The outputs of skore.train_test_split() are intentionally exactly the same as sklearn.model_selection.train_test_split(), so the user can just use the skore version as a drop-in replacement of scikit-learn.

Contrary to scikit-learn, skore allows users to explicit the X and y, making detection of potential issues easier:

X_train, X_test, y_train, y_test = skore.train_test_split(
    X=X, y=y, test_size=0.2, random_state=0
)
X_train_explicit = X_train.copy()
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ShuffleTrueWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ We detected that the `shuffle` parameter is set to `True` either explicitly or from  โ”‚
โ”‚ its default value. In case of time-ordered events (even if they are independent),    โ”‚
โ”‚ this will result in inflated model performance evaluation because natural drift will โ”‚
โ”‚ not be taken into account. We recommend setting the shuffle parameter to `False` in  โ”‚
โ”‚ order to ensure the evaluation process is really representative of your production   โ”‚
โ”‚ release process.                                                                     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Moreover, when passing X and y explicitly, the Xโ€™s are always returned before the yโ€™s, even when they are inverted:

arr = X.copy()
arr_train, arr_test, X_train, X_test, y_train, y_test = skore.train_test_split(
    arr, y=y, X=X, test_size=0.2, random_state=0
)
X_train_explicit_inverted = X_train.copy()

print("When expliciting, with the small typo, are the `X_train`'s still the same?")
print(np.allclose(X_train_explicit, X_train_explicit_inverted))
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ShuffleTrueWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ We detected that the `shuffle` parameter is set to `True` either explicitly or from  โ”‚
โ”‚ its default value. In case of time-ordered events (even if they are independent),    โ”‚
โ”‚ this will result in inflated model performance evaluation because natural drift will โ”‚
โ”‚ not be taken into account. We recommend setting the shuffle parameter to `False` in  โ”‚
โ”‚ order to ensure the evaluation process is really representative of your production   โ”‚
โ”‚ release process.                                                                     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
When expliciting, with the small typo, are the `X_train`'s still the same?
True

Returning a dictionary instead of positional arguments#

The default behaviour of outputting a tuple of arrays can be cumbersome and error-prone, in particular when passing them to an EstimatorReport. The new as_dict parameter makes the output a dictionary, which makes this simpler:

from sklearn.linear_model import LogisticRegression
from skore import EstimatorReport

split_data = skore.train_test_split(X=X, y=y, random_state=42, as_dict=True)
split_data.keys()
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ShuffleTrueWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ We detected that the `shuffle` parameter is set to `True` either explicitly or from  โ”‚
โ”‚ its default value. In case of time-ordered events (even if they are independent),    โ”‚
โ”‚ this will result in inflated model performance evaluation because natural drift will โ”‚
โ”‚ not be taken into account. We recommend setting the shuffle parameter to `False` in  โ”‚
โ”‚ order to ensure the evaluation process is really representative of your production   โ”‚
โ”‚ release process.                                                                     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

dict_keys(['X_train', 'X_test', 'y_train', 'y_test'])

Without the dictionary output, this would be written:

Automatic diagnostics: raising methodological warnings#

In this section, we show how skore can provide methodological checks.

Class imbalance#

In machine learning, class imbalance (the classes in a dataset are not equally represented) requires a specific modelling. For example, in a dataset with 95% majority class (class 1) and 5% minority class (class 0), a dummy model that always predicts class 1 will have a 95% accuracy, while it would be useless for identifying examples of class 0. Hence, it is important to detect when we have class imbalance.

Suppose that we have imbalanced data:

X = np.arange(10_000).reshape((5_000, 2))
y = [0] * 4_000 + [1] * 1_000

In that case, skore.train_test_split() raises a HighClassImbalanceWarning:

X_train, X_test, y_train, y_test = skore.train_test_split(
    X=X, y=y, test_size=0.2, random_state=0
)
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ HighClassImbalanceWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ It seems that you have a classification problem with a high class imbalance. In this โ”‚
โ”‚ case, using train_test_split may not be a good idea because of high variability in   โ”‚
โ”‚ the scores obtained on the test set. To tackle this challenge we suggest to use      โ”‚
โ”‚ skore's CrossValidationReport with the `splitter` parameter of your choice.          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ShuffleTrueWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ We detected that the `shuffle` parameter is set to `True` either explicitly or from  โ”‚
โ”‚ its default value. In case of time-ordered events (even if they are independent),    โ”‚
โ”‚ this will result in inflated model performance evaluation because natural drift will โ”‚
โ”‚ not be taken into account. We recommend setting the shuffle parameter to `False` in  โ”‚
โ”‚ order to ensure the evaluation process is really representative of your production   โ”‚
โ”‚ release process.                                                                     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Hence, skore recommends the users to take into account this class imbalance, that they might have missed, in their modelling strategy.

Moreover, skore also detects class imbalance with a class that has too few samples with a HighClassImbalanceTooFewExamplesWarning:

X = np.arange(400).reshape((200, 2))
y = [0] * 150 + [1] * 50

X_train, X_test, y_train, y_test = skore.train_test_split(
    X=X, y=y, test_size=0.2, random_state=0
)
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ HighClassImbalanceTooFewExamplesWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ It seems that you have a classification problem with at least one class with fewer   โ”‚
โ”‚ than 100 examples in the test set. In this case, using train_test_split may not be a โ”‚
โ”‚ good idea because of high variability in the scores obtained on the test set. We     โ”‚
โ”‚ suggest three options to tackle this challenge: you can increase test_size, collect  โ”‚
โ”‚ more data, or use skore's CrossValidationReport with the `splitter` parameter of     โ”‚
โ”‚ your choice.                                                                         โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ HighClassImbalanceWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ It seems that you have a classification problem with a high class imbalance. In this โ”‚
โ”‚ case, using train_test_split may not be a good idea because of high variability in   โ”‚
โ”‚ the scores obtained on the test set. To tackle this challenge we suggest to use      โ”‚
โ”‚ skore's CrossValidationReport with the `splitter` parameter of your choice.          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ShuffleTrueWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ We detected that the `shuffle` parameter is set to `True` either explicitly or from  โ”‚
โ”‚ its default value. In case of time-ordered events (even if they are independent),    โ”‚
โ”‚ this will result in inflated model performance evaluation because natural drift will โ”‚
โ”‚ not be taken into account. We recommend setting the shuffle parameter to `False` in  โ”‚
โ”‚ order to ensure the evaluation process is really representative of your production   โ”‚
โ”‚ release process.                                                                     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Shuffling without a random state#

For reproducible results across executions, skore recommends the use of the random_state parameter when shuffling (remember that shuffle=True by default) with a RandomStateUnsetWarning:

X = np.arange(10_000).reshape((5_000, 2))
y = [0] * 2_500 + [1] * 2_500

X_train, X_test, y_train, y_test = skore.train_test_split(X=X, y=y, test_size=0.2)
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ RandomStateUnsetWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ We recommend setting the parameter `random_state`. This will ensure the              โ”‚
โ”‚ reproducibility of your work.                                                        โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ShuffleTrueWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ We detected that the `shuffle` parameter is set to `True` either explicitly or from  โ”‚
โ”‚ its default value. In case of time-ordered events (even if they are independent),    โ”‚
โ”‚ this will result in inflated model performance evaluation because natural drift will โ”‚
โ”‚ not be taken into account. We recommend setting the shuffle parameter to `False` in  โ”‚
โ”‚ order to ensure the evaluation process is really representative of your production   โ”‚
โ”‚ release process.                                                                     โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Time series data#

Now, let us assume that we have time series data: the data is somewhat time-ordered:

import pandas as pd
from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
X, y = dataset.X, dataset.y
X["date_first_hired"] = pd.to_datetime(X["date_first_hired"], format="%m/%d/%Y")
X.head(2)
gender department department_name division assignment_category employee_position_title date_first_hired year_first_hired
0 F POL Department of Police MSB Information Mgmt and Tech Division Records... Fulltime-Regular Office Services Coordinator 1986-09-22 1986
1 M POL Department of Police ISB Major Crimes Division Fugitive Section Fulltime-Regular Master Police Officer 1988-09-12 1988


We can observe that there is a date_first_hired which is time-based.

As one can not shuffle time (time only moves in one direction: forward), we recommend using sklearn.model_selection.TimeSeriesSplit instead of sklearn.model_selection.train_test_split() (or skore.train_test_split()) with a TimeBasedColumnWarning:

X_train, X_test, y_train, y_test = skore.train_test_split(
    X, y, random_state=0, shuffle=False
)
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ TimeBasedColumnWarning โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ We detected some time-based columns (column "date_first_hired") in your data. We     โ”‚
โ”‚ recommend using scikit-learn's TimeSeriesSplit instead of train_test_split.          โ”‚
โ”‚ Otherwise you might train on future data to predict the past, or get inflated model  โ”‚
โ”‚ performance evaluation because natural drift will not be taken into account.         โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Total running time of the script: (0 minutes 0.100 seconds)

Gallery generated by Sphinx-Gallery