User manual

This manual covers the most important use cases of the package. For details and additional input parameters, see the full documentation.

Defining a dataset

To start working with the package, you can use an example dataset. You can load it using the code below.

from edgaro.data.dataset import load_mammography

df = load_mammography()

It is also possible to create a Dataset object from a pandas.DataFrame object in your Python session:

# X - your dataframe with features, pd.DataFrame
# y - your target variable, pd.Series

from edgaro.data.dataset import Dataset

df = Dataset(X, y)

Apart from that, you can create a Dataset object from a *.csv file or from a dataset defined in OpenML:

from edgaro.data.dataset import DatasetFromCSV, DatasetFromOpenML

df = DatasetFromCSV('your_path')
df = DatasetFromOpenML(task_id=1)

The Dataset object offers many handy functionalities, for example splitting into train and test datasets (inside the object), removing None values and calculating the Imbalance Ratio.

df.remove_nans()
df.train_test_split(test_size=0.2)
df.imbalance_ratio

To represent an array of the Dataset objects, DatasetArray is used. It is composed of a list of Datasets. The object offers vectorized functions described above, like splitting into train and test datasets (inside the object) and removing None values.

from edgaro.data.dataset_array import DatasetArray

df_array = DatasetArray([df1, df2, df3])
df_array.remove_nans()
df_array.train_test_split(test_size=0.2)

It is also possible to load an example DatasetArray, which can be used for benchmarking. It contains selected datasets for benchmarking purposes from three main sources (OpenML-100, OpenML-CC18 and imbalanced-learn library). The details concerning the benchmarking set are available https://github.com/adrianstando/imbalanced-benchmarking-set.

from edgaro.data.dataset_array import load_benchmarking_set

df_array = load_benchmarking_set()

It is also possible to load a benchmark suite from OpenML to DatasetArray object.

from edgaro.data.dataset_array import DatasetArrayFromOpenMLSuite

df_array = DatasetArrayFromOpenMLSuite(suite_name='OpenML100')

Balancing datasets

To balance a dataset, a Transformer abstract class is defined. It is possible to create custom balancing methods (by extending this class) or you can use methods implemented in imblearn library. Firstly, a transformer has to be fitted with a dataset, and then by calling transform method it can be balanced.

from edgaro.balancing.transformer import TransformerFromIMBLEARN
from imblearn.under_sampling import RandomUnderSampler

transformer = TransformerFromIMBLEARN(
    RandomUnderSampler(sampling_strategy=1, random_state=42)
)
transformer.fit(dataset)
dataset_transformed = transformer.transform(dataset)

You can also define a custom suffix that will be added to balanced Dataset’s name.

from edgaro.balancing.transformer import TransformerFromIMBLEARN
from imblearn.under_sampling import RandomUnderSampler

transformer = TransformerFromIMBLEARN(
    RandomUnderSampler(sampling_strategy=1, random_state=42),
    name_sufix='_new_sufix123'
)

There is one extension worth mentioning - it is NestedAutomaticTransformer. It behaves like a Transformer object, but it wraps inside a few balancing methods. Moreover, that object, based on n_per_method argument, automatically set an intermediate Imbalance Ratio (to investigate how the future models change with changing IR). For example, let an original dataset have \(IR=10\) and \(n\_per\_method=3\). Then, the IR will be set to \([7, 4, 1]\). If you set keep_original_dataset to True, the original will be preserved in the result DatasetArray object.

The BasicAutomaticTransformer class contains three most popular balancing techniques (Random UnderSampling, Random OverSampling, SMOTE).

from edgaro.balancing.nested_transformer import BasicAutomaticTransformer

transformer = BasicAutomaticTransformer()
transformer.fit(dataset)
dataset_transformed = transformer.transform(dataset)

The balancing submodule also offers the interface of an array. If your input is a DatasetArray object or you want to balance input with different parameters (in that case both DatasetArray and Dataset is correct) you should use this class.

To apply the same balancing technique to a DatasetArray you have to use TransformerArray class:

from edgaro.balancing.transformer_array import TransformerArray
from edgaro.balancing.transformer import TransformerFromIMBLEARN
from imblearn.under_sampling import RandomUnderSampler

transformer = TransformerArray(TransformerFromIMBLEARN(
    RandomUnderSampler(sampling_strategy=1, random_state=42)
))
transformer.fit(dataset_array)
dataset_array_transformed = transformer.transform(dataset_array)

You can set, as in the Transformer class, the suffixes:

from edgaro.balancing.transformer_array import TransformerArray
from edgaro.balancing.transformer import TransformerFromIMBLEARN
from imblearn.under_sampling import RandomUnderSampler

transformer = TransformerArray(TransformerFromIMBLEARN(
    RandomUnderSampler(sampling_strategy=1, random_state=42),
    dataset_suffixes=['_suffix1', '_suffix2']
))

You can also set parameters - their nested structure should match the DatasetArray structure.

from edgaro.balancing.transformer_array import TransformerArray
from edgaro.balancing.transformer import TransformerFromIMBLEARN
from imblearn.under_sampling import RandomUnderSampler
from edgaro.data.dataset_array import DatasetArray

dataset_array = DatasetArray([dataset1, dataset2])

transformer = TransformerArray(TransformerFromIMBLEARN(
    RandomUnderSampler(),
    parameters=[
        [
            {'sampling_strategy': 0.98},
            {'sampling_strategy': 1},
            {'sampling_strategy': 0.9, 'random_state': 42}
        ] for _ in range(2)
    ]
))
transformer.fit(dataset_array)
dataset_array_transformed = transformer.transform(dataset_array)

Note: if a Dataset object was train-test-split, the balancing methods are applied only on the training datasets and the test datasets remain untouched.

Training a model

The classes in model module have a similar interface to those in balancing module. There is a Model class which is an abstract class and can be extended with any ML model implementation. One possible solution is to use scikit-learn models, which can be used by using ModelFromSKLEARN class.

In this module, there is also a ModelArray class, which behaves very similarly to the TransformerArray class. However, instead of transforming a Dataset, the class predictions are made or probabilities are returned. The returned objects are also Dataset objects.

from edgaro.model.model import ModelFromSKLEARN
from sklearn.ensemble import RandomForestClassifier

model = ModelFromSKLEARN(RandomForestClassifier())
model.fit(dataset)
predictions = model.predict(dataset)
predictions_probability = model.predict_proba(dataset)

An example of using ModelArray - that means the situation when the input is DatasetArray object.

from edgaro.model.model import ModelFromSKLEARN
from edgaro.model.model_array import ModelArray
from sklearn.ensemble import RandomForestClassifier

model = ModelArray(ModelFromSKLEARN(RandomForestClassifier()))
model.fit(dataset_array)
predictions = model.predict(dataset_array)
predictions_probability = model.predict_proba(dataset_array)

There is also a function to evaluate the model. If the input parameter is not provided and the object was train-test-split, the evaluation is made on the test dataset.

from edgaro.model.model import ModelFromSKLEARN
from edgaro.model.model_array import ModelArray
from sklearn.ensemble import RandomForestClassifier

model = ModelArray(ModelFromSKLEARN(RandomForestClassifier()))
model.fit(dataset_array)
model.evaluate()

model = ModelFromSKLEARN(RandomForestClassifier())
model.fit(dataset)
model.evaluate()

Explaining and comparing explanations

To create explanations (PDP / ALE curves), the Explainer and ExplainerArray classes are provided. The first one should be used when you only have one Model object and the latter when you have ModelArray. The interface in explain module is similar to that in model:

from edgaro.explain.explainer import Explainer

exp = Explainer(model)
exp.fit()
explanation = exp.transform()

In case of input ModelArray:

from edgaro.explain.explainer import Explainer
from edgaro.explain.explainer_array import ExplainerArray

exp = ExplainerArray(model_array)
exp.fit()
explanation = exp.transform()

These functions return ModelProfileExplanation and ModelProfileExplanationArray objects that make it possible to compare explanations and visualise them.

In case of a single Model:

from edgaro.explain.explainer import Explainer

exp = Explainer(model)
exp.fit()
explanation = exp.transform()
explanation.plot(variable='Col1')

In case of a ModelArray:

from edgaro.explain.explainer import Explainer
from edgaro.explain.explainer_array import ExplainerArray

exp = ExplainerArray(model_array)
exp.fit()
explanation = exp.transform()
explanation.plot(variables=['Col1', 'Col2'])

In order to calculate the distance between the curves:

from edgaro.explain.explainer import Explainer
from edgaro.explain.explainer_array import ExplainerArray

exp = ExplainerArray(model_array)
exp.fit()
explanation = exp.transform()
explanation[0].compare(explanation[1], variable='Col1')

To create benchmarking summary plots, use for example the code:

from edgaro.explain.explainer import Explainer
from edgaro.explain.explainer_array import ExplainerArray

exp = ExplainerArray(model_array)
exp.fit()
explanation = exp.transform()
explanation.plot_aggregate(['SMOTE ', 'RandomOversample ', 'RandomUndersample ')

The elements of the list in the function plot_aggregate() are regular expressions to match and group explanations matching the. In this case, the output will be a boxplot with three boxes - one per method.

Note: if the input data objects were train-test-split, the explanations are calculated on the test dataset.