edgaro.model package

edgaro.model.model module

class edgaro.model.model.Model(name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False, majority_class_label: str | None = None)

Bases: BaseTransformer, ABC

The abstract class to define a Machine Learning model for a single Dataset.

Parameters:

name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.

Variables:

name (str) – A name of the Model.
test_size (float, optional) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional) – Random state seed.
verbose (bool) – Print messages during calculations.
majority_class_label (str, optional, default=None) – The label of the majority class. It is recommended to pass this argument; otherwise, it will be guessed. The guess may be wrong if the dataset is balanced - consequently, experiment results may be wrong. If None, it will be tried to extract the information from the majority_class_label attribute of the Dataset object.

fit(dataset: Dataset, print_scores: bool = False) → None

Fit the Model.

The fitting process includes encoding the categorical variables with OrdinalEncoder (from scikit-learn library) and target encoding (custom encoding, the minority class is encoded as 1, the majority class as 0).

The method assumes that categorical variables, which has to be encoded, are one of the types: ‘category’, ‘object’.

Parameters:

dataset (Dataset) – The object to fit Model on.
print_scores (bool, default=False) – Indicates whether model evaluation on a test dataset should be printed at the end of fitting.

property was_fitted: bool

The information whether the Model was fitted.

Return type:: bool

transform_data(dataset: Dataset) → Dataset

Encode dataset.data with the rules generated after fitting this object.

Parameters:: dataset (Dataset) – A Dataset object, where .data attribute will be encoded. The method returns a new object.
Return type:: Dataset

transform_target(dataset: Dataset) → Dataset

Encode dataset.target with the rules generated after fitting this object.

Parameters:: dataset (Dataset) – A Dataset object, where .target attribute will be encoded. The method returns a new object.
Return type:: Dataset

predict(dataset: Dataset) → Dataset

Predict the class for a Dataset object.

Parameters:: dataset (Dataset) – A Dataset object to make predictions on.
Return type:: Dataset

predict_proba(dataset: Dataset) → Dataset

Predict the probability of class 1 for a Dataset object.

Parameters:: dataset (Dataset) – A Dataset object to make predictions on.
Return type:: Dataset

set_params(**params) → None

Set params for Model.

Parameters:: params (dict) – The parameters to be set.

abstract get_params() → Dict

Get parameters of Model.

Returns:: The parameters.
Return type:: Dict, list

transform(dataset: Dataset) → Dataset

A function to make the Model compatible with BaseTransformer.

It can either return predicted classes or predicted probabilities - it can be set using set_transform_to_probabilities and set_transform_to_classes functions.

Parameters:: dataset (Dataset) – A Dataset object to be transformed.
Return type:: Dataset

set_transform_to_probabilities() → None: Make transform function return probabilities.

set_transform_to_classes() → None: Make transform function return classes..

get_train_dataset() → Dataset | None

Get a Dataset used for a training process.

Return type:: Dataset

get_test_dataset() → Dataset | None

Get a Dataset used for a test process.

Return type:: Dataset

get_category_colnames() → List[str]

Get category column names, which were encoded during the fitting process.

Return type:: list(str)

evaluate(metrics_output_class: List[Callable[[Series, Series], float]] | None = None, metrics_output_probabilities: List[Callable[[Series, Series], float]] | None = None, ds: Dataset | None = None) → DataFrame

Evaluate model.

Parameters:

metrics_output_class (list[Callable[[pd.Series, pd.Series], float]], optional, default=None) – List of functions to calculate metrics on predicted classes. If None is passed, accuracy, balanced accuracy, precision, recall, specificity, f1, f1_weighted, geometric mean score are used.
metrics_output_probabilities (list[Callable[[pd.Series, pd.Series], float]], optional, default=None) – List of functions to calculate metrics on predicted probabilities. If None is passed, ROC AUC is used.
ds (Dataset, optional, default=None) – A Dataset object to calculate metric on. If None is passed, test Dataset from fitting is used.

Return type:

pd.DataFrame

class edgaro.model.model.SKLEARNModelProtocol(*args, **kwargs)

Bases: Protocol

A Protocol to define the expected structure of a Model from scikit-learn library.

fit(X: DataFrame, y: Series) → Any

predict(X: DataFrame) → ndarray

predict_proba(X: DataFrame) → ndarray

get_params() → Dict

set_params(**params) → Any

class edgaro.model.model.ModelFromSKLEARN(base_model: SKLEARNModelProtocol, name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False)

Bases: Model

Create Model from a model in scikit-learn library.

Parameters:

base_model (SKLEARNModelProtocol) – A model from scikit-learn library. Note: this object has to be clean (not fitted).
name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.

property was_fitted: bool

The information whether the Model was fitted.

Return type:: bool

get_params() → Dict

Get parameters of Model.

Returns:: The parameters.
Return type:: Dict, list

class edgaro.model.model.RandomForest(name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False, *args, **kwargs)

Bases: ModelFromSKLEARN

Create RandomForest Model from a RandomForestClassifier implementation in scikit-learn library.

Parameters:

name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.
*args (tuple, optional) – Additional parameters for RandomForestClassifier from scikit-learn library.
**kwargs (dict, optional) – Additional parameters for RandomForestClassifier from scikit-learn library.

class edgaro.model.model.XGBoost(name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False, *args, **kwargs)

Bases: ModelFromSKLEARN

Create XGBoost Model from a XGBClassifier implementation in xgboost library.

Parameters:

name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.
*args (tuple, optional) – Additional parameters for XGBClassifier from xgboost library.
**kwargs (dict, optional) – Additional parameters for XGBClassifier from xgboost library.

class edgaro.model.model.RandomSearchCV(base_model: ModelFromSKLEARN, param_grid: Dict, n_iter: int = 10, cv: int = 5, scoring: str = 'balanced_accuracy', name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False, *args, **kwargs)

Bases: ModelFromSKLEARN

Create Model to perform Random Search on any of the model implementation matching SKLEARNModelProtocol.

Parameters:

base_model (SKLEARNModelProtocol) – A model from scikit-learn library. Note: this object has to be clean (not fitted).
param_grid (Dict) – A parameter grid for searching.
n_iter (int) – Number of iterations to be performed.
cv (int) – Number of cross-validation folds.
scoring (str) – Name of a function to be used to choose the best model.
name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.

class edgaro.model.model.GridSearchCV(base_model: ModelFromSKLEARN, param_grid: Dict, cv: int = 5, scoring: str = 'balanced_accuracy', name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False, *args, **kwargs)

Bases: ModelFromSKLEARN

Create Model to perform Grid Search on any of the model implementation matching SKLEARNModelProtocol.

Parameters:

base_model (SKLEARNModelProtocol) – A model from scikit-learn library. Note: this object has to be clean (not fitted).
param_grid (Dict) – A parameter grid for searching.
cv (int) – Number of cross-validation folds.
scoring (str) – Name of a function to be used to choose the best model.
name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.

edgaro.model.model_array module

class edgaro.model.model_array.ModelArray(base_model: Model, parameters: List[List | Dict[str, Any]] | None = None, name: str = '', verbose: bool = False)

Bases: BaseTransformerArray

Create a class to train Models for each of the Dataset in DatasetArray.

Parameters:

base_model (Model) – The object defining the basic Model training procedure. The base_model object has to be clean - it cannot be fitted earlier.
parameters (list[list, Dict[str, Any]]], optional) – The list of parameters for base_model. If the object is used for a DatasetArray object, the parameter list should be nested. For details, see Examples section.
name (str) – A name of the ModelArray.
verbose (bool, default=False) – Print messages during calculations.

Variables:

name (str) – A name of the ModelArray.
verbose (bool) – Print messages during calculations.

Examples

Example 1

>>> from test.resources.objects import *
>>> from edgaro.data.dataset import Dataset
>>> from edgaro.data.dataset_array import DatasetArray
>>> from edgaro.model.model import RandomForest
>>> from edgaro.model.model_array import ModelArray
>>> df = Dataset(name_1, df_1, target_1)
>>> params = [{'n_estimators': 20}]
>>> model = RandomForest()
>>> array = ModelArray(model, parameters=params)
>>> array.fit(df)
>>> array.predict(df)

Example 2

>>> from test.resources.objects import *
>>> from edgaro.data.dataset import Dataset
>>> from edgaro.data.dataset_array import DatasetArray
>>> from edgaro.model.model import RandomForest
>>> from edgaro.model.model_array import ModelArray
>>> df = DatasetArray([Dataset(name_2, df_1, target_1), Dataset(name_1, df_1, target_1)])
>>> params = [[{'n_estimators': 20}] for _ in range(len(df)) ]
>>> model = RandomForest()
>>> array = ModelArray(model, parameters=params)
>>> array.fit(df)
>>> array.predict(df)

Example 3

>>> from test.resources.objects import *
>>> from edgaro.data.dataset import Dataset
>>> from edgaro.data.dataset_array import DatasetArray
>>> from edgaro.model.model import RandomForest
>>> from edgaro.model.model_array import ModelArray
>>> df = DatasetArray([
...         Dataset(name_2, df_1, target_1),
...         DatasetArray([Dataset(name_2, df_1, target_1), Dataset(name_1, df_1, target_1)])
... ])
>>> params = [[{'n_estimators': 20}], [{'n_estimators': 10}, {'n_estimators': 30}]]
>>> model = RandomForest()
>>> array = ModelArray(model, parameters=params)
>>> array.fit(df)
>>> array.predict(df)

fit(dataset: Dataset | DatasetArray) → None

Fit the ModelArray.

The fitting process includes encoding the categorical variables with OrdinalEncoder (from scikit-learn library) and target encoding (custom encoding, the minority class is encoded as 1, the majority class as 0).

The method assumes that categorical variables, which has to be encoded, are one of the types: ‘category’, ‘object’.

Parameters:: dataset (Dataset, DatasetArray) – The object to fit Model on.

predict(dataset: Dataset | DatasetArray) → Dataset | DatasetArray

Predict the class for a Dataset/DatasetArray object.

Parameters:: dataset (Dataset, DatasetArray) – A Dataset/DatasetArray object to make predictions on.
Return type:: Dataset, DatasetArray

predict_proba(dataset: Dataset | DatasetArray) → Dataset | DatasetArray

Predict the probability of class 1 for a Dataset/DatasetArray object.

Parameters:: dataset (Dataset, DatasetArray) – A Dataset/DatasetArray object to make predictions on.
Return type:: Dataset, DatasetArray

get_models() → List[Model | ModelArray | List[Model | ModelArray]]

All the Model/ModelArray objects used by this object.

Return type:: list[Model, ModelArray, list]

set_transform_to_probabilities() → None: Make transform function return probabilities.

set_transform_to_classes() → None: Make transform function return classes..

evaluate(metrics_output_class: List[Callable[[Series, Series], float]] | None = None, metrics_output_probabilities: List[Callable[[Series, Series], float]] | None = None, ds: Dataset | DatasetArray | None = None) → DataFrame

Evaluate model.

Parameters:

metrics_output_class (list[Callable[[pd.Series, pd.Series], float]], optional, default=None) – List of functions to calculate metrics on predicted classes. If None is passed, accuracy, balanced accuracy, precision, recall, specificity, f1, f1_weighted, geometric mean score are used.
metrics_output_probabilities (list[Callable[[pd.Series, pd.Series], float]], optional, default=None) – List of functions to calculate metrics on predicted probabilities. If None is passed, ROC AUC is used.
ds (Dataset, DatasetArray, optional, default=None) – A Dataset/DatasetArray object to calculate metric on. If None is passed, test Dataset/DatasetArray from fitting is used.

Return type:

pd.DataFrame

property transformers: List[Model | ModelArray | List]

All the Model objects used by this object.

Return type:: list[Model, ModelArray, list]

property base_transformer: Model

Base transformers for creation of this object.

Return type:: Model