edgaro.model package
edgaro.model.model module
- class edgaro.model.model.Model(name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False, majority_class_label: str | None = None)
Bases:
BaseTransformer
,ABC
The abstract class to define a Machine Learning model for a single Dataset.
- Parameters:
name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.
- Variables:
name (str) – A name of the Model.
test_size (float, optional) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional) – Random state seed.
verbose (bool) – Print messages during calculations.
majority_class_label (str, optional, default=None) – The label of the majority class. It is recommended to pass this argument; otherwise, it will be guessed. The guess may be wrong if the dataset is balanced - consequently, experiment results may be wrong. If None, it will be tried to extract the information from the majority_class_label attribute of the Dataset object.
- fit(dataset: Dataset, print_scores: bool = False) None
Fit the Model.
The fitting process includes encoding the categorical variables with OrdinalEncoder (from scikit-learn library) and target encoding (custom encoding, the minority class is encoded as 1, the majority class as 0).
The method assumes that categorical variables, which has to be encoded, are one of the types: ‘category’, ‘object’.
- Parameters:
dataset (Dataset) – The object to fit Model on.
print_scores (bool, default=False) – Indicates whether model evaluation on a test dataset should be printed at the end of fitting.
- property was_fitted: bool
The information whether the Model was fitted.
- Return type:
bool
- transform_data(dataset: Dataset) Dataset
Encode dataset.data with the rules generated after fitting this object.
- Parameters:
dataset (Dataset) – A Dataset object, where .data attribute will be encoded. The method returns a new object.
- Return type:
- transform_target(dataset: Dataset) Dataset
Encode dataset.target with the rules generated after fitting this object.
- Parameters:
dataset (Dataset) – A Dataset object, where .target attribute will be encoded. The method returns a new object.
- Return type:
- predict(dataset: Dataset) Dataset
Predict the class for a Dataset object.
- Parameters:
dataset (Dataset) – A Dataset object to make predictions on.
- Return type:
- predict_proba(dataset: Dataset) Dataset
Predict the probability of class 1 for a Dataset object.
- Parameters:
dataset (Dataset) – A Dataset object to make predictions on.
- Return type:
- set_params(**params) None
Set params for Model.
- Parameters:
params (dict) – The parameters to be set.
- abstract get_params() Dict
Get parameters of Model.
- Returns:
The parameters.
- Return type:
Dict, list
- transform(dataset: Dataset) Dataset
A function to make the Model compatible with BaseTransformer.
It can either return predicted classes or predicted probabilities - it can be set using set_transform_to_probabilities and set_transform_to_classes functions.
- Parameters:
dataset (Dataset) – A Dataset object to be transformed.
- Return type:
- set_transform_to_probabilities() None
Make transform function return probabilities.
- set_transform_to_classes() None
Make transform function return classes..
- get_category_colnames() List[str]
Get category column names, which were encoded during the fitting process.
- Return type:
list(str)
- evaluate(metrics_output_class: List[Callable[[Series, Series], float]] | None = None, metrics_output_probabilities: List[Callable[[Series, Series], float]] | None = None, ds: Dataset | None = None) DataFrame
Evaluate model.
- Parameters:
metrics_output_class (list[Callable[[pd.Series, pd.Series], float]], optional, default=None) – List of functions to calculate metrics on predicted classes. If None is passed, accuracy, balanced accuracy, precision, recall, specificity, f1, f1_weighted, geometric mean score are used.
metrics_output_probabilities (list[Callable[[pd.Series, pd.Series], float]], optional, default=None) – List of functions to calculate metrics on predicted probabilities. If None is passed, ROC AUC is used.
ds (Dataset, optional, default=None) – A Dataset object to calculate metric on. If None is passed, test Dataset from fitting is used.
- Return type:
pd.DataFrame
- class edgaro.model.model.SKLEARNModelProtocol(*args, **kwargs)
Bases:
Protocol
A Protocol to define the expected structure of a Model from scikit-learn library.
- fit(X: DataFrame, y: Series) Any
- predict(X: DataFrame) ndarray
- predict_proba(X: DataFrame) ndarray
- get_params() Dict
- set_params(**params) Any
- class edgaro.model.model.ModelFromSKLEARN(base_model: SKLEARNModelProtocol, name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False)
Bases:
Model
Create Model from a model in scikit-learn library.
- Parameters:
base_model (SKLEARNModelProtocol) – A model from scikit-learn library. Note: this object has to be clean (not fitted).
name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.
- property was_fitted: bool
The information whether the Model was fitted.
- Return type:
bool
- get_params() Dict
Get parameters of Model.
- Returns:
The parameters.
- Return type:
Dict, list
- class edgaro.model.model.RandomForest(name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False, *args, **kwargs)
Bases:
ModelFromSKLEARN
Create RandomForest Model from a RandomForestClassifier implementation in scikit-learn library.
- Parameters:
name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.
*args (tuple, optional) – Additional parameters for RandomForestClassifier from scikit-learn library.
**kwargs (dict, optional) – Additional parameters for RandomForestClassifier from scikit-learn library.
- class edgaro.model.model.XGBoost(name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False, *args, **kwargs)
Bases:
ModelFromSKLEARN
Create XGBoost Model from a XGBClassifier implementation in xgboost library.
- Parameters:
name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.
*args (tuple, optional) – Additional parameters for XGBClassifier from xgboost library.
**kwargs (dict, optional) – Additional parameters for XGBClassifier from xgboost library.
- class edgaro.model.model.RandomSearchCV(base_model: ModelFromSKLEARN, param_grid: Dict, n_iter: int = 10, cv: int = 5, scoring: str = 'balanced_accuracy', name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False, *args, **kwargs)
Bases:
ModelFromSKLEARN
Create Model to perform Random Search on any of the model implementation matching SKLEARNModelProtocol.
- Parameters:
base_model (SKLEARNModelProtocol) – A model from scikit-learn library. Note: this object has to be clean (not fitted).
param_grid (Dict) – A parameter grid for searching.
n_iter (int) – Number of iterations to be performed.
cv (int) – Number of cross-validation folds.
scoring (str) – Name of a function to be used to choose the best model.
name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.
- class edgaro.model.model.GridSearchCV(base_model: ModelFromSKLEARN, param_grid: Dict, cv: int = 5, scoring: str = 'balanced_accuracy', name: str = '', test_size: float | None = None, random_state: int | None = None, verbose: bool = False, *args, **kwargs)
Bases:
ModelFromSKLEARN
Create Model to perform Grid Search on any of the model implementation matching SKLEARNModelProtocol.
- Parameters:
base_model (SKLEARNModelProtocol) – A model from scikit-learn library. Note: this object has to be clean (not fitted).
param_grid (Dict) – A parameter grid for searching.
cv (int) – Number of cross-validation folds.
scoring (str) – Name of a function to be used to choose the best model.
name (str) – A name of the Model.
test_size (float, optional, default=None) – Test size for a Dataset object in case it was not train-test-split. If a Dataset object was not train-test-split and the parameter has value None, the training will be done on the all data.
random_state (int, optional, default=None) – Random state seed.
verbose (bool, default=False) – Print messages during calculations.
edgaro.model.model_array module
- class edgaro.model.model_array.ModelArray(base_model: Model, parameters: List[List | Dict[str, Any]] | None = None, name: str = '', verbose: bool = False)
Bases:
BaseTransformerArray
Create a class to train Models for each of the Dataset in DatasetArray.
- Parameters:
base_model (Model) – The object defining the basic Model training procedure. The base_model object has to be clean - it cannot be fitted earlier.
parameters (list[list, Dict[str, Any]]], optional) – The list of parameters for base_model. If the object is used for a DatasetArray object, the parameter list should be nested. For details, see Examples section.
name (str) – A name of the ModelArray.
verbose (bool, default=False) – Print messages during calculations.
- Variables:
name (str) – A name of the ModelArray.
verbose (bool) – Print messages during calculations.
Examples
Example 1
>>> from test.resources.objects import * >>> from edgaro.data.dataset import Dataset >>> from edgaro.data.dataset_array import DatasetArray >>> from edgaro.model.model import RandomForest >>> from edgaro.model.model_array import ModelArray >>> df = Dataset(name_1, df_1, target_1) >>> params = [{'n_estimators': 20}] >>> model = RandomForest() >>> array = ModelArray(model, parameters=params) >>> array.fit(df) >>> array.predict(df)
Example 2
>>> from test.resources.objects import * >>> from edgaro.data.dataset import Dataset >>> from edgaro.data.dataset_array import DatasetArray >>> from edgaro.model.model import RandomForest >>> from edgaro.model.model_array import ModelArray >>> df = DatasetArray([Dataset(name_2, df_1, target_1), Dataset(name_1, df_1, target_1)]) >>> params = [[{'n_estimators': 20}] for _ in range(len(df)) ] >>> model = RandomForest() >>> array = ModelArray(model, parameters=params) >>> array.fit(df) >>> array.predict(df)
Example 3
>>> from test.resources.objects import * >>> from edgaro.data.dataset import Dataset >>> from edgaro.data.dataset_array import DatasetArray >>> from edgaro.model.model import RandomForest >>> from edgaro.model.model_array import ModelArray >>> df = DatasetArray([ ... Dataset(name_2, df_1, target_1), ... DatasetArray([Dataset(name_2, df_1, target_1), Dataset(name_1, df_1, target_1)]) ... ]) >>> params = [[{'n_estimators': 20}], [{'n_estimators': 10}, {'n_estimators': 30}]] >>> model = RandomForest() >>> array = ModelArray(model, parameters=params) >>> array.fit(df) >>> array.predict(df)
- fit(dataset: Dataset | DatasetArray) None
Fit the ModelArray.
The fitting process includes encoding the categorical variables with OrdinalEncoder (from scikit-learn library) and target encoding (custom encoding, the minority class is encoded as 1, the majority class as 0).
The method assumes that categorical variables, which has to be encoded, are one of the types: ‘category’, ‘object’.
- Parameters:
dataset (Dataset, DatasetArray) – The object to fit Model on.
- predict(dataset: Dataset | DatasetArray) Dataset | DatasetArray
Predict the class for a Dataset/DatasetArray object.
- Parameters:
dataset (Dataset, DatasetArray) – A Dataset/DatasetArray object to make predictions on.
- Return type:
- predict_proba(dataset: Dataset | DatasetArray) Dataset | DatasetArray
Predict the probability of class 1 for a Dataset/DatasetArray object.
- Parameters:
dataset (Dataset, DatasetArray) – A Dataset/DatasetArray object to make predictions on.
- Return type:
- get_models() List[Model | ModelArray | List[Model | ModelArray]]
All the Model/ModelArray objects used by this object.
- Return type:
list[Model, ModelArray, list]
- set_transform_to_probabilities() None
Make transform function return probabilities.
- set_transform_to_classes() None
Make transform function return classes..
- evaluate(metrics_output_class: List[Callable[[Series, Series], float]] | None = None, metrics_output_probabilities: List[Callable[[Series, Series], float]] | None = None, ds: Dataset | DatasetArray | None = None) DataFrame
Evaluate model.
- Parameters:
metrics_output_class (list[Callable[[pd.Series, pd.Series], float]], optional, default=None) – List of functions to calculate metrics on predicted classes. If None is passed, accuracy, balanced accuracy, precision, recall, specificity, f1, f1_weighted, geometric mean score are used.
metrics_output_probabilities (list[Callable[[pd.Series, pd.Series], float]], optional, default=None) – List of functions to calculate metrics on predicted probabilities. If None is passed, ROC AUC is used.
ds (Dataset, DatasetArray, optional, default=None) – A Dataset/DatasetArray object to calculate metric on. If None is passed, test Dataset/DatasetArray from fitting is used.
- Return type:
pd.DataFrame
- property transformers: List[Model | ModelArray | List]
All the Model objects used by this object.
- Return type:
list[Model, ModelArray, list]