edgaro.data package
edgaro.data.dataset module
- class edgaro.data.dataset.Dataset(name: str, dataframe: DataFrame | None, target: Series | None, verbose: bool = False)
Bases:
object
Create a Dataset object
This class creates a unified representation of a dataset, which can be further processed by other package classes.
Only one of dataframe and target parameters is required.
- Parameters:
name (str) – Name of the dataset.
dataframe (pd.DataFrame, optional) – The variables (predictors) in a dataset.
target (pd.Series, optional) – The target in a dataset.
verbose (bool, default=False) – Print messages during calculations.
- Variables:
name (str) – Name of the dataset.
verbose (bool) – Print messages during calculations.
majority_class_label (str, optional, default=None) – The label of the majority class. It is recommended to set this attribute; otherwise, it will be guessed while encoding in the model module. The guess may be wrong only if the dataset is balanced - consequently, experiment results may be wrong.
- property data: DataFrame | None
The variables (predictors) in a dataset.
- Type:
pd.DataFrame, optional
- property target: Series | None
The target in a dataset.
- Type:
pd.Series, optional
- property was_split: bool
the information whether the Dataset object was train-test-split
- Type:
bool
- train_test_split(test_size: float = 0.2, random_state: int | None = None) None
Split the object into train and test datasets.
- Parameters:
test_size (float, default=0.2) – The size of a train dataset.
random_state (int, optional, default=None) – Random state seed.
- custom_train_test_split(train: Dataset, test: Dataset) None
Set custom train and set dataset.
- Parameters:
train (Dataset) – The train dataset.
test (Dataset) – The test dataset.
- check_binary_classification() bool
The information whether the Dataset object contains binary classification data
- Returns:
bool
- generate_report(output_path: str | None = None, show_jupyter: bool = False, minimal: bool = False) None
Generate a report using pandas_profiling tool.
- Parameters:
output_path (str, optional, default=None) – The path to save the generated report.
show_jupyter (bool, default=False) – If set to True, the report will be displayed as a Jupyter Notebook IFrame.
minimal (bool, default=False) – Turn off the most expensive computations.
- property imbalance_ratio: float
Imbalance Ratio of the Dataset; is the ratio of the majority class to the minority class
- Type:
float
- remove_nans(col_thresh: float = 0.9) None
Remove rows with NaN values and columns containing almost only NaN values.
- Parameters:
col_thresh (float, default=0.9) – The threshold of NaN values in columns above which a column should be dropped
- remove_outliers(n_std: float | int = 3) None
Remove outliers with NaN values and columns containing almost only NaN values.
It is only applicable for continuous variables (that means not category, ‘object’ and ‘int’ type).
- Parameters:
n_std (float, int, default=3) – Number of standard deviations. The observations that lies outside the range column_mean +/- n_std*column_std will be removed.
- remove_categorical_and_ordinal_variables()
Remove categorical and ordinal variables.
- class edgaro.data.dataset.DatasetFromCSV(path: str, target: str | None = None, name: str = 'dataset', verbose: bool = False, *args, **kwargs)
Bases:
Dataset
Create a Dataset object from a *.csv file
If`target` parameter is None, the target is the last column in file.
- Parameters:
name (str, default=’dataset’) – Name of the dataset.
path (str) – The path to *.csv fiole.
target (str, optional, default=None) – The name of a column in the Dataset that is a target.
verbose (bool, default=False) – Print messages during calculations.
*args (tuple, optional) – Additional arguments to pd.read_csv function.
**kwargs (dict, optional) – Additional arguments to pd.read_csv function.
- class edgaro.data.dataset.DatasetFromOpenML(task_id: int, apikey: str | None = None, verbose: bool = False)
Bases:
Dataset
Create a Dataset object from an OpenML dataset.
Before using this class, you should follow the procedure of configuring Authentication on the website here.
Otherwise, you should have your own API key for OpenML and pass it as a parameter.
- Parameters:
task_id (int) – A task ID for a dataset in OpenML.
apikey (str, optional, default=None) – An API key to OpenML (if you configured OpenML, you do not need to pass this parameter).
verbose (bool, default=False) – Print messages during calculations.
- openml_description() str | None
Description of the Dataset from OpenML.
- Returns:
The Dataset description.
- Return type:
str
edgaro.data.dataset_array module
- class edgaro.data.dataset_array.DatasetArray(datasets: List[Dataset | DatasetArray], name: str = 'dataset_array', verbose: bool = False)
Bases:
object
Create a DatasetArray object
This class creates a unified representation of an array of the Dataset objects, which can be further processed by other package classes.
- Parameters:
datasets (list[Dataset, DatasetArray]) – The list of Dataset and DatasetArray objects.
name (str, default=’dataset_array’) – Name of the dataset array.
verbose (bool, default=False) – Print messages during calculations.
- Variables:
name (str) – Name of the dataset array.
datasets (list[Dataset, DatasetArray]) – The list of Dataset and DatasetArray objects.
verbose (bool) – Print messages during calculations.
- train_test_split(test_size: float = 0.2, random_state: int | None = None) None
Split each of the Dataset objects into train and test datasets.
- Parameters:
test_size (float, default=0.2) – The size of a train dataset.
random_state (int, optional, default=None) – Random state seed.
- property train: DatasetArray
the DatasetArray of train part of Dataset objects if the Dataset objects were train-test-split
- Type:
- property test: DatasetArray
the DatasetArray of train part of Dataset objects if the Dataset objects were train-test-split
- Type:
- remove_nans(col_thresh: float = 0.9) None
Remove rows with NaN values and columns containing almost only NaN values.
- Parameters:
col_thresh (float, default=0.9) – The threshold of NaN values in columns above which a column should be dropped
- remove_outliers(n_std: float | int = 3) None
Remove outliers with NaN values and columns containing almost only NaN values.
It is only applicable for continuous variables (that means not category, ‘object’ and ‘int’ type).
- Parameters:
n_std (float, int, default=3) – Number of standard deviations. The observations that lies outside the range column_mean +/- n_std*column_std will be removed.
- remove_non_binary_target_datasets() None
Remove Dataset objects which do not represent binary classification task.
- remove_empty_datasets() None
Remove empty Dataset objects.
- remove_categorical_and_ordinal_variables()
Remove categorical and ordinal variables.
- append(other: Dataset | DatasetArray | List[Dataset | DatasetArray]) None
Append new object to an DatasetArray.
- Parameters:
other (Dataset, DatasetArray, list[Dataset, DatasetArray]) – The object to be appended to the array.
- head(n: int = 10)
Get first n rows of each of the Dataset objects.
- Parameters:
n (int, default=10) – Number of rows.
- Returns:
A DatasetArray object with Dataset objects with n first rows.
- Return type:
- class edgaro.data.dataset_array.DatasetArrayFromOpenMLSuite(suite_name: str = 'OpenML100', apikey: str | None = None, name: str = 'dataset_array', verbose: bool = False)
Bases:
DatasetArray
Create a DatasetArray object from an OpenML suite.
Before using this class, you should follow the procedure of configuring Authentication on the website here.
Otherwise, you should have your own API key for OpenML and pass it as a parameter.
- Parameters:
suite_name (str, default = ‘OpenML100’) – A task ID for a dataset in OpenML.
apikey (str, optional, default=None) – An API key to OpenML (if you configured OpenML, you do not need to pass this parameter).
name (str) – Name of the dataset array.
verbose (bool, default=False) – Print messages during calculations.
- openml_description() str
Description of the suite from OpenML.
- Returns:
The suite description.
- Return type:
str
- class edgaro.data.dataset_array.DatasetArrayFromDirectory(path: str, name: str = 'dataset_array', verbose: bool = False)
Bases:
DatasetArray
Create a DatasetArray object by loading *.csv and *.npy files.
The *.npy files are the files, which contain numpy arrays or pickled objects. They are loaded using this function.
The class assumes that the last column in each file is a target column.
- Parameters:
path (str) – A path of a directory to load files from.
name (str, default=’dataset_array’) – Name of the dataset array.
verbose (bool, default=False) – Print messages during calculations.
- edgaro.data.dataset_array.load_benchmarking_set(apikey: str | None = None, keep_categorical: bool = False, minimal_IR: float = 1.5, minimal_n_rows: int = 1000, percent_categorical_to_remove: float = 0.75)
The function loads an example benchmarking set.
- Parameters:
apikey (str, optional, default=None) – An API key to OpenML (if you configured OpenML, you do not need to pass this parameter). For details see DatasetArrayFromOpenMLSuite class documentation.
keep_categorical (bool, default=False) – If True, the datasets will remain categorical variables.
minimal_IR (float, default=1.5) – Minimal IR in the set.
minimal_n_rows (int, default=1000) – Minimal number of rows in a Dataset.
percent_categorical_to_remove (float, default=0.75) – Only applicable if keep_categorical=False; if categorical and nominal variables are above that number of all variables, the Dataset is removed.
- Return type: