Note
Go to the end to download the full example code.
train_test_split: get diagnostics when splitting your data#
This example illustrates the motivation and the use of skoreโs
skore.train_test_split() to get assistance when developing ML/DS projects.
Train-test split in scikit-learn#
Scikit-learn has a function for splitting the data into train and test
sets: sklearn.model_selection.train_test_split().
Its signature is the following:
sklearn.model_selection.train_test_split(
*arrays,
test_size=None,
train_size=None,
random_state=None,
shuffle=True,
stratify=None
)
where *arrays is a Python *args (it allows you to pass a varying number of
positional arguments) and the scikit-learn doc indicates that it is a sequence of
indexables with same length / shape[0].
Let us construct a design matrix X and target y to illustrate our point:
X = array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
y = array([0, 1, 2, 3, 4])
In scikit-learn, the most common usage is the following:
X_train = array([[0, 1],
[2, 3],
[6, 7],
[8, 9]])
y_train = array([0, 1, 3, 4])
X_test = array([[4, 5]])
y_test = array([2])
Notice the shuffling that is done by default.
In scikit-learn, the user cannot explicitly set the design matrix X and
the target y. The following:
would return:
TypeError: got an unexpected keyword argument 'X'
In general, in Python, keyword arguments are useful to prevent typos. For example,
in the following, X and y are reversed:
X_train = array([0, 1, 3, 4])
y_train = array([[0, 1],
[2, 3],
[6, 7],
[8, 9]])
X_test = array([2])
y_test = array([[4, 5]])
but Python will not catch this mistake for us. This is where skore comes in handy.
Train-test split in skore#
Skore has its own skore.train_test_split() that wraps scikit-learnโs
sklearn.model_selection.train_test_split().
Expliciting the positional arguments for X and y#
First of all, naturally, it can be used as a simple drop-in replacement for scikit-learn:
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ShuffleTrueWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ We detected that the `shuffle` parameter is set to `True` either explicitly or from โ
โ its default value. In case of time-ordered events (even if they are independent), โ
โ this will result in inflated model performance evaluation because natural drift will โ
โ not be taken into account. We recommend setting the shuffle parameter to `False` in โ
โ order to ensure the evaluation process is really representative of your production โ
โ release process. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Note
The outputs of skore.train_test_split() are intentionally exactly the same as
sklearn.model_selection.train_test_split(), so the user can just use the
skore version as a drop-in replacement of scikit-learn.
Contrary to scikit-learn, skore allows users to explicit the X and y, making
detection of potential issues easier:
X_train, X_test, y_train, y_test = skore.train_test_split(
X=X, y=y, test_size=0.2, random_state=0
)
X_train_explicit = X_train.copy()
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ShuffleTrueWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ We detected that the `shuffle` parameter is set to `True` either explicitly or from โ
โ its default value. In case of time-ordered events (even if they are independent), โ
โ this will result in inflated model performance evaluation because natural drift will โ
โ not be taken into account. We recommend setting the shuffle parameter to `False` in โ
โ order to ensure the evaluation process is really representative of your production โ
โ release process. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Moreover, when passing X and y explicitly, the Xโs are always returned
before the yโs, even when they are inverted:
arr = X.copy()
arr_train, arr_test, X_train, X_test, y_train, y_test = skore.train_test_split(
arr, y=y, X=X, test_size=0.2, random_state=0
)
X_train_explicit_inverted = X_train.copy()
print("When expliciting, with the small typo, are the `X_train`'s still the same?")
print(np.allclose(X_train_explicit, X_train_explicit_inverted))
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ShuffleTrueWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ We detected that the `shuffle` parameter is set to `True` either explicitly or from โ
โ its default value. In case of time-ordered events (even if they are independent), โ
โ this will result in inflated model performance evaluation because natural drift will โ
โ not be taken into account. We recommend setting the shuffle parameter to `False` in โ
โ order to ensure the evaluation process is really representative of your production โ
โ release process. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
When expliciting, with the small typo, are the `X_train`'s still the same?
True
Returning a dictionary instead of positional arguments#
The default behaviour of outputting a tuple of arrays can be cumbersome and
error-prone, in particular when passing them to an EstimatorReport.
The new as_dict parameter makes the output a dictionary, which makes this simpler:
from sklearn.linear_model import LogisticRegression
from skore import EstimatorReport
split_data = skore.train_test_split(X=X, y=y, random_state=42, as_dict=True)
split_data.keys()
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ShuffleTrueWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ We detected that the `shuffle` parameter is set to `True` either explicitly or from โ
โ its default value. In case of time-ordered events (even if they are independent), โ
โ this will result in inflated model performance evaluation because natural drift will โ
โ not be taken into account. We recommend setting the shuffle parameter to `False` in โ
โ order to ensure the evaluation process is really representative of your production โ
โ release process. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
dict_keys(['X_train', 'X_test', 'y_train', 'y_test'])
estimator = LogisticRegression(random_state=42)
estimator_report = EstimatorReport(estimator, **split_data)
Without the dictionary output, this would be written:
Automatic diagnostics: raising methodological warnings#
In this section, we show how skore can provide methodological checks.
Class imbalance#
In machine learning, class imbalance (the classes in a dataset are not equally
represented) requires a specific modelling.
For example, in a dataset with 95% majority class (class 1) and 5% minority class
(class 0), a dummy model that always predicts class 1 will have a 95%
accuracy, while it would be useless for identifying examples of class 0.
Hence, it is important to detect when we have class imbalance.
Suppose that we have imbalanced data:
In that case, skore.train_test_split() raises a HighClassImbalanceWarning:
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ HighClassImbalanceWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ It seems that you have a classification problem with a high class imbalance. In this โ
โ case, using train_test_split may not be a good idea because of high variability in โ
โ the scores obtained on the test set. To tackle this challenge we suggest to use โ
โ skore's CrossValidationReport with the `splitter` parameter of your choice. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ShuffleTrueWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ We detected that the `shuffle` parameter is set to `True` either explicitly or from โ
โ its default value. In case of time-ordered events (even if they are independent), โ
โ this will result in inflated model performance evaluation because natural drift will โ
โ not be taken into account. We recommend setting the shuffle parameter to `False` in โ
โ order to ensure the evaluation process is really representative of your production โ
โ release process. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Hence, skore recommends the users to take into account this class imbalance, that they might have missed, in their modelling strategy.
Moreover, skore also detects class imbalance with a class that has too few samples
with a HighClassImbalanceTooFewExamplesWarning:
โญโโโโโโโโโโโโโโโโโโโโโโ HighClassImbalanceTooFewExamplesWarning โโโโโโโโโโโโโโโโโโโโโโโโฎ
โ It seems that you have a classification problem with at least one class with fewer โ
โ than 100 examples in the test set. In this case, using train_test_split may not be a โ
โ good idea because of high variability in the scores obtained on the test set. We โ
โ suggest three options to tackle this challenge: you can increase test_size, collect โ
โ more data, or use skore's CrossValidationReport with the `splitter` parameter of โ
โ your choice. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ HighClassImbalanceWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ It seems that you have a classification problem with a high class imbalance. In this โ
โ case, using train_test_split may not be a good idea because of high variability in โ
โ the scores obtained on the test set. To tackle this challenge we suggest to use โ
โ skore's CrossValidationReport with the `splitter` parameter of your choice. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ShuffleTrueWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ We detected that the `shuffle` parameter is set to `True` either explicitly or from โ
โ its default value. In case of time-ordered events (even if they are independent), โ
โ this will result in inflated model performance evaluation because natural drift will โ
โ not be taken into account. We recommend setting the shuffle parameter to `False` in โ
โ order to ensure the evaluation process is really representative of your production โ
โ release process. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Shuffling without a random state#
For reproducible results across executions,
skore recommends the use of the random_state parameter when shuffling
(remember that shuffle=True by default) with a RandomStateUnsetWarning:
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ RandomStateUnsetWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ We recommend setting the parameter `random_state`. This will ensure the โ
โ reproducibility of your work. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ShuffleTrueWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ We detected that the `shuffle` parameter is set to `True` either explicitly or from โ
โ its default value. In case of time-ordered events (even if they are independent), โ
โ this will result in inflated model performance evaluation because natural drift will โ
โ not be taken into account. We recommend setting the shuffle parameter to `False` in โ
โ order to ensure the evaluation process is really representative of your production โ
โ release process. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Time series data#
Now, let us assume that we have time series data: the data is somewhat time-ordered:
import pandas as pd
from skrub.datasets import fetch_employee_salaries
dataset = fetch_employee_salaries()
X, y = dataset.X, dataset.y
X["date_first_hired"] = pd.to_datetime(X["date_first_hired"], format="%m/%d/%Y")
X.head(2)
We can observe that there is a date_first_hired which is time-based.
As one can not shuffle time (time only moves in one direction: forward), we
recommend using sklearn.model_selection.TimeSeriesSplit instead of
sklearn.model_selection.train_test_split() (or skore.train_test_split())
with a TimeBasedColumnWarning:
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ TimeBasedColumnWarning โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ We detected some time-based columns (column "date_first_hired") in your data. We โ
โ recommend using scikit-learn's TimeSeriesSplit instead of train_test_split. โ
โ Otherwise you might train on future data to predict the past, or get inflated model โ
โ performance evaluation because natural drift will not be taken into account. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Total running time of the script: (0 minutes 0.100 seconds)