EstimatorReport.data.analyze#
- EstimatorReport.data.analyze(data_source='all', with_y=True, subsample=None, subsample_strategy='head', seed=None)[source]#
Plot dataset statistics.
- Parameters:
- data_source{‘train’, ‘test’, ‘all’}, default=’all’
The dataset to analyze. If ‘train’, only the training set is used. If ‘test’, only the test set is used. If ‘all’, both sets are concatenated vertically.
- with_ybool, default=True
Whether to include the target variable in the analysis. If True, the target variable is concatenated horizontally to the features.
- subsampleint, default=None
The number of points to subsample the dataframe hold by the display, using the strategy set by
subsample_strategy
. It must be a strictly positive integer. IfNone
, no subsampling is applied.- subsample_strategy{‘head’, ‘random’}, default=’head’,
The strategy used to subsample the dataframe hold by the display. It only has an effect when
subsample
is not None.If
'head'
: subsample by taking thesubsample
first points of the dataframe, similar to Pandas:df.head(n)
.If
'random'
: randomly subsample the dataframe by using a uniform distribution. The random seed is controlled byrandom_state
.
- seedint, default=None
The random seed to use when randomly subsampling. It only has an effect when
subsample
is not None andsubsample_strategy='random'
.
- Returns:
- TableReportDisplay
A display object containing the dataset statistics and plots.
Examples
>>> from sklearn.datasets import load_breast_cancer >>> from sklearn.linear_model import LogisticRegression >>> from skore import train_test_split >>> from skore import EstimatorReport >>> X, y = load_breast_cancer(return_X_y=True) >>> split_data = train_test_split(X=X, y=y, random_state=0, as_dict=True) >>> classifier = LogisticRegression(max_iter=10_000) >>> report = EstimatorReport(classifier, **split_data, pos_label=1) >>> report.data.analyze().frame()