edgaro.data package

edgaro.data.dataset module

class edgaro.data.dataset.Dataset(name: str, dataframe: DataFrame | None, target: Series | None, verbose: bool = False)

Bases: object

Create a Dataset object

This class creates a unified representation of a dataset, which can be further processed by other package classes.

Only one of dataframe and target parameters is required.

Parameters:

name (str) – Name of the dataset.
dataframe (pd.DataFrame, optional) – The variables (predictors) in a dataset.
target (pd.Series, optional) – The target in a dataset.
verbose (bool, default=False) – Print messages during calculations.

Variables:

name (str) – Name of the dataset.
verbose (bool) – Print messages during calculations.
majority_class_label (str, optional, default=None) – The label of the majority class. It is recommended to set this attribute; otherwise, it will be guessed while encoding in the model module. The guess may be wrong only if the dataset is balanced - consequently, experiment results may be wrong.

property data: DataFrame | None

The variables (predictors) in a dataset.

Type:: pd.DataFrame, optional

property target: Series | None

The target in a dataset.

Type:: pd.Series, optional

property train: Dataset

the train dataset if the Dataset object was train-test-split

Type:: Dataset

property test: Dataset

the test dataset if the Dataset object was train-test-split

Type:: Dataset

property was_split: bool

the information whether the Dataset object was train-test-split

Type:: bool

train_test_split(test_size: float = 0.2, random_state: int | None = None) → None

Split the object into train and test datasets.

Parameters:

test_size (float, default=0.2) – The size of a train dataset.
random_state (int, optional, default=None) – Random state seed.

custom_train_test_split(train: Dataset, test: Dataset) → None

Set custom train and set dataset.

Parameters:

train (Dataset) – The train dataset.
test (Dataset) – The test dataset.

check_binary_classification() → bool

The information whether the Dataset object contains binary classification data

Returns:: bool

generate_report(output_path: str | None = None, show_jupyter: bool = False, minimal: bool = False) → None

Generate a report using pandas_profiling tool.

Parameters:

output_path (str, optional, default=None) – The path to save the generated report.
show_jupyter (bool, default=False) – If set to True, the report will be displayed as a Jupyter Notebook IFrame.
minimal (bool, default=False) – Turn off the most expensive computations.

property imbalance_ratio: float

Imbalance Ratio of the Dataset; is the ratio of the majority class to the minority class

Type:: float

remove_nans(col_thresh: float = 0.9) → None

Remove rows with NaN values and columns containing almost only NaN values.

Parameters:: col_thresh (float, default=0.9) – The threshold of NaN values in columns above which a column should be dropped

remove_outliers(n_std: float | int = 3) → None

Remove outliers with NaN values and columns containing almost only NaN values.

It is only applicable for continuous variables (that means not category, ‘object’ and ‘int’ type).

Parameters:: n_std (float, int, default=3) – Number of standard deviations. The observations that lies outside the range column_mean +/- n_std*column_std will be removed.

remove_categorical_and_ordinal_variables(): Remove categorical and ordinal variables.

head(n: int = 10)

Get first n rows of the Dataset.

Parameters:: n (int, default=10) – Number of rows.
Returns:: A Dataset object with n first rows.
Return type:: Dataset

class edgaro.data.dataset.DatasetFromCSV(path: str, target: str | None = None, name: str = 'dataset', verbose: bool = False, *args, **kwargs)

Bases: Dataset

Create a Dataset object from a *.csv file

If`target` parameter is None, the target is the last column in file.

Parameters:

name (str, default=’dataset’) – Name of the dataset.
path (str) – The path to *.csv fiole.
target (str, optional, default=None) – The name of a column in the Dataset that is a target.
verbose (bool, default=False) – Print messages during calculations.
*args (tuple, optional) – Additional arguments to pd.read_csv function.
**kwargs (dict, optional) – Additional arguments to pd.read_csv function.

class edgaro.data.dataset.DatasetFromOpenML(task_id: int, apikey: str | None = None, verbose: bool = False)

Bases: Dataset

Create a Dataset object from an OpenML dataset.

Before using this class, you should follow the procedure of configuring Authentication on the website here.

Otherwise, you should have your own API key for OpenML and pass it as a parameter.

Parameters:

task_id (int) – A task ID for a dataset in OpenML.
apikey (str, optional, default=None) – An API key to OpenML (if you configured OpenML, you do not need to pass this parameter).
verbose (bool, default=False) – Print messages during calculations.

openml_description() → str | None

Description of the Dataset from OpenML.

Returns:: The Dataset description.
Return type:: str

edgaro.data.dataset.load_mammography()

The function loads an example dataset ‘mammography’.

Return type:: Dataset

edgaro.data.dataset_array module

class edgaro.data.dataset_array.DatasetArray(datasets: List[Dataset | DatasetArray], name: str = 'dataset_array', verbose: bool = False)

Bases: object

Create a DatasetArray object

This class creates a unified representation of an array of the Dataset objects, which can be further processed by other package classes.

Parameters:

datasets (list[Dataset, DatasetArray]) – The list of Dataset and DatasetArray objects.
name (str, default=’dataset_array’) – Name of the dataset array.
verbose (bool, default=False) – Print messages during calculations.

Variables:

name (str) – Name of the dataset array.
datasets (list[Dataset, DatasetArray]) – The list of Dataset and DatasetArray objects.
verbose (bool) – Print messages during calculations.

train_test_split(test_size: float = 0.2, random_state: int | None = None) → None

Split each of the Dataset objects into train and test datasets.

Parameters:

test_size (float, default=0.2) – The size of a train dataset.
random_state (int, optional, default=None) – Random state seed.

property train: DatasetArray

the DatasetArray of train part of Dataset objects if the Dataset objects were train-test-split

Type:: DatasetArray

property test: DatasetArray

the DatasetArray of train part of Dataset objects if the Dataset objects were train-test-split

Type:: DatasetArray

remove_nans(col_thresh: float = 0.9) → None

Remove rows with NaN values and columns containing almost only NaN values.

Parameters:: col_thresh (float, default=0.9) – The threshold of NaN values in columns above which a column should be dropped

remove_outliers(n_std: float | int = 3) → None

Remove outliers with NaN values and columns containing almost only NaN values.

It is only applicable for continuous variables (that means not category, ‘object’ and ‘int’ type).

Parameters:: n_std (float, int, default=3) – Number of standard deviations. The observations that lies outside the range column_mean +/- n_std*column_std will be removed.

remove_non_binary_target_datasets() → None: Remove Dataset objects which do not represent binary classification task.

remove_empty_datasets() → None: Remove empty Dataset objects.

remove_categorical_and_ordinal_variables(): Remove categorical and ordinal variables.

append(other: Dataset | DatasetArray | List[Dataset | DatasetArray]) → None

Append new object to an DatasetArray.

Parameters:: other (Dataset, DatasetArray, list[Dataset, DatasetArray]) – The object to be appended to the array.

head(n: int = 10)

Get first n rows of each of the Dataset objects.

Parameters:: n (int, default=10) – Number of rows.
Returns:: A DatasetArray object with Dataset objects with n first rows.
Return type:: DatasetArray

class edgaro.data.dataset_array.DatasetArrayFromOpenMLSuite(suite_name: str = 'OpenML100', apikey: str | None = None, name: str = 'dataset_array', verbose: bool = False)

Bases: DatasetArray

Create a DatasetArray object from an OpenML suite.

Before using this class, you should follow the procedure of configuring Authentication on the website here.

Otherwise, you should have your own API key for OpenML and pass it as a parameter.

Parameters:

suite_name (str, default = ‘OpenML100’) – A task ID for a dataset in OpenML.
apikey (str, optional, default=None) – An API key to OpenML (if you configured OpenML, you do not need to pass this parameter).
name (str) – Name of the dataset array.
verbose (bool, default=False) – Print messages during calculations.

openml_description() → str

Description of the suite from OpenML.

Returns:: The suite description.
Return type:: str

class edgaro.data.dataset_array.DatasetArrayFromDirectory(path: str, name: str = 'dataset_array', verbose: bool = False)

Bases: DatasetArray

Create a DatasetArray object by loading *.csv and *.npy files.

The *.npy files are the files, which contain numpy arrays or pickled objects. They are loaded using this function.

The class assumes that the last column in each file is a target column.

Parameters:

path (str) – A path of a directory to load files from.
name (str, default=’dataset_array’) – Name of the dataset array.
verbose (bool, default=False) – Print messages during calculations.

edgaro.data.dataset_array.load_benchmarking_set(apikey: str | None = None, keep_categorical: bool = False, minimal_IR: float = 1.5, minimal_n_rows: int = 1000, percent_categorical_to_remove: float = 0.75)

The function loads an example benchmarking set.

Parameters:

apikey (str, optional, default=None) – An API key to OpenML (if you configured OpenML, you do not need to pass this parameter). For details see DatasetArrayFromOpenMLSuite class documentation.
keep_categorical (bool, default=False) – If True, the datasets will remain categorical variables.
minimal_IR (float, default=1.5) – Minimal IR in the set.
minimal_n_rows (int, default=1000) – Minimal number of rows in a Dataset.
percent_categorical_to_remove (float, default=0.75) – Only applicable if keep_categorical=False; if categorical and nominal variables are above that number of all variables, the Dataset is removed.

Return type:

DatasetArray