EstimatorReport.data.analyze#

EstimatorReport.data.analyze(data_source='all', with_y=True, subsample=None, subsample_strategy='head', seed=None)[source]#

Plot dataset statistics.

Parameters:
data_source{‘train’, ‘test’, ‘all’}, default=’all’

The dataset to analyze. If ‘train’, only the training set is used. If ‘test’, only the test set is used. If ‘all’, both sets are concatenated vertically.

with_ybool, default=True

Whether to include the target variable in the analysis. If True, the target variable is concatenated horizontally to the features.

subsampleint, default=None

The number of points to subsample the dataframe hold by the display, using the strategy set by subsample_strategy. It must be a strictly positive integer. If None, no subsampling is applied.

subsample_strategy{‘head’, ‘random’}, default=’head’,

The strategy used to subsample the dataframe hold by the display. It only has an effect when subsample is not None.

  • If 'head': subsample by taking the subsample first points of the dataframe, similar to Pandas: df.head(n).

  • If 'random': randomly subsample the dataframe by using a uniform distribution. The random seed is controlled by random_state.

seedint, default=None

The random seed to use when randomly subsampling. It only has an effect when subsample is not None and subsample_strategy='random'.

Returns:
TableReportDisplay

A display object containing the dataset statistics and plots.

Examples

>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.linear_model import LogisticRegression
>>> from skore import train_test_split
>>> from skore import EstimatorReport
>>> X, y = load_breast_cancer(return_X_y=True)
>>> split_data = train_test_split(X=X, y=y, random_state=0, as_dict=True)
>>> classifier = LogisticRegression(max_iter=10_000)
>>> report = EstimatorReport(classifier, **split_data, pos_label=1)
>>> report.data.analyze().frame()