User manual
This manual covers the most important use cases of the package. For details and additional input parameters, see the full documentation.
Defining a dataset
To start working with the package, you can use an example dataset. You can load it using the code below.
from edgaro.data.dataset import load_mammography
df = load_mammography()
It is also possible to create a Dataset
object from a
pandas.DataFrame
object in your Python session:
# X - your dataframe with features, pd.DataFrame
# y - your target variable, pd.Series
from edgaro.data.dataset import Dataset
df = Dataset(X, y)
Apart from that, you can create a Dataset
object from a *.csv
file or from a dataset defined in OpenML:
from edgaro.data.dataset import DatasetFromCSV, DatasetFromOpenML
df = DatasetFromCSV('your_path')
df = DatasetFromOpenML(task_id=1)
The Dataset
object offers many handy functionalities, for example
splitting into train and test datasets (inside the object), removing
None
values and calculating the Imbalance Ratio.
df.remove_nans()
df.train_test_split(test_size=0.2)
df.imbalance_ratio
To represent an array of the Dataset
objects, DatasetArray
is
used. It is composed of a list of Dataset
s. The object offers
vectorized functions described above, like splitting into train and test
datasets (inside the object) and removing None
values.
from edgaro.data.dataset_array import DatasetArray
df_array = DatasetArray([df1, df2, df3])
df_array.remove_nans()
df_array.train_test_split(test_size=0.2)
It is also possible to load an example DatasetArray
, which can be
used for benchmarking. It contains selected datasets for benchmarking
purposes from three main sources (OpenML-100, OpenML-CC18 and
imbalanced-learn library). The details concerning the benchmarking set
are available https://github.com/adrianstando/imbalanced-benchmarking-set.
from edgaro.data.dataset_array import load_benchmarking_set
df_array = load_benchmarking_set()
It is also possible to load a benchmark suite from OpenML to
DatasetArray
object.
from edgaro.data.dataset_array import DatasetArrayFromOpenMLSuite
df_array = DatasetArrayFromOpenMLSuite(suite_name='OpenML100')
Balancing datasets
To balance a dataset, a Transformer
abstract class is defined. It is
possible to create custom balancing methods (by extending this class) or
you can use methods implemented in imblearn
library. Firstly, a
transformer has to be fitted with a dataset, and then by calling
transform method it can be balanced.
from edgaro.balancing.transformer import TransformerFromIMBLEARN
from imblearn.under_sampling import RandomUnderSampler
transformer = TransformerFromIMBLEARN(
RandomUnderSampler(sampling_strategy=1, random_state=42)
)
transformer.fit(dataset)
dataset_transformed = transformer.transform(dataset)
You can also define a custom suffix that will be added to balanced
Dataset
’s name.
from edgaro.balancing.transformer import TransformerFromIMBLEARN
from imblearn.under_sampling import RandomUnderSampler
transformer = TransformerFromIMBLEARN(
RandomUnderSampler(sampling_strategy=1, random_state=42),
name_sufix='_new_sufix123'
)
There is one extension worth mentioning - it is
NestedAutomaticTransformer
. It behaves like a Transformer
object, but it wraps inside a few balancing methods. Moreover, that
object, based on n_per_method
argument, automatically set an
intermediate Imbalance Ratio (to investigate how the future models
change with changing IR). For example, let an original dataset have
\(IR=10\) and \(n\_per\_method=3\). Then, the IR will be set to
\([7, 4, 1]\). If you set keep_original_dataset to True
, the
original will be preserved in the result DatasetArray
object.
The BasicAutomaticTransformer
class contains three most popular
balancing techniques (Random UnderSampling, Random OverSampling, SMOTE).
from edgaro.balancing.nested_transformer import BasicAutomaticTransformer
transformer = BasicAutomaticTransformer()
transformer.fit(dataset)
dataset_transformed = transformer.transform(dataset)
The balancing
submodule also offers the interface of an array. If
your input is a DatasetArray
object or you want to balance input
with different parameters (in that case both DatasetArray
and
Dataset
is correct) you should use this class.
To apply the same balancing technique to a DatasetArray
you have to
use TransformerArray
class:
from edgaro.balancing.transformer_array import TransformerArray
from edgaro.balancing.transformer import TransformerFromIMBLEARN
from imblearn.under_sampling import RandomUnderSampler
transformer = TransformerArray(TransformerFromIMBLEARN(
RandomUnderSampler(sampling_strategy=1, random_state=42)
))
transformer.fit(dataset_array)
dataset_array_transformed = transformer.transform(dataset_array)
You can set, as in the Transformer
class, the suffixes:
from edgaro.balancing.transformer_array import TransformerArray
from edgaro.balancing.transformer import TransformerFromIMBLEARN
from imblearn.under_sampling import RandomUnderSampler
transformer = TransformerArray(TransformerFromIMBLEARN(
RandomUnderSampler(sampling_strategy=1, random_state=42),
dataset_suffixes=['_suffix1', '_suffix2']
))
You can also set parameters - their nested structure should match the
DatasetArray
structure.
from edgaro.balancing.transformer_array import TransformerArray
from edgaro.balancing.transformer import TransformerFromIMBLEARN
from imblearn.under_sampling import RandomUnderSampler
from edgaro.data.dataset_array import DatasetArray
dataset_array = DatasetArray([dataset1, dataset2])
transformer = TransformerArray(TransformerFromIMBLEARN(
RandomUnderSampler(),
parameters=[
[
{'sampling_strategy': 0.98},
{'sampling_strategy': 1},
{'sampling_strategy': 0.9, 'random_state': 42}
] for _ in range(2)
]
))
transformer.fit(dataset_array)
dataset_array_transformed = transformer.transform(dataset_array)
Note: if a Dataset
object was train-test-split, the balancing
methods are applied only on the training datasets and the test datasets
remain untouched.
Training a model
The classes in model
module have a similar interface to those in
balancing
module. There is a Model class which is an abstract
class and can be extended with any ML model implementation. One possible
solution is to use scikit-learn models, which can be used by using
ModelFromSKLEARN
class.
In this module, there is also a ModelArray
class, which behaves very
similarly to the TransformerArray
class. However, instead of
transforming a Dataset
, the class predictions are made or
probabilities are returned. The returned objects are also Dataset
objects.
from edgaro.model.model import ModelFromSKLEARN
from sklearn.ensemble import RandomForestClassifier
model = ModelFromSKLEARN(RandomForestClassifier())
model.fit(dataset)
predictions = model.predict(dataset)
predictions_probability = model.predict_proba(dataset)
An example of using ModelArray
- that means the situation when the
input is DatasetArray
object.
from edgaro.model.model import ModelFromSKLEARN
from edgaro.model.model_array import ModelArray
from sklearn.ensemble import RandomForestClassifier
model = ModelArray(ModelFromSKLEARN(RandomForestClassifier()))
model.fit(dataset_array)
predictions = model.predict(dataset_array)
predictions_probability = model.predict_proba(dataset_array)
There is also a function to evaluate the model. If the input parameter is not provided and the object was train-test-split, the evaluation is made on the test dataset.
from edgaro.model.model import ModelFromSKLEARN
from edgaro.model.model_array import ModelArray
from sklearn.ensemble import RandomForestClassifier
model = ModelArray(ModelFromSKLEARN(RandomForestClassifier()))
model.fit(dataset_array)
model.evaluate()
model = ModelFromSKLEARN(RandomForestClassifier())
model.fit(dataset)
model.evaluate()
Explaining and comparing explanations
To create explanations (PDP / ALE curves), the Explainer
and
ExplainerArray
classes are provided. The first one should be used
when you only have one Model
object and the latter when you have
ModelArray
. The interface in explain
module is similar to that
in model
:
from edgaro.explain.explainer import Explainer
exp = Explainer(model)
exp.fit()
explanation = exp.transform()
In case of input ModelArray
:
from edgaro.explain.explainer import Explainer
from edgaro.explain.explainer_array import ExplainerArray
exp = ExplainerArray(model_array)
exp.fit()
explanation = exp.transform()
These functions return ModelProfileExplanation
and ModelProfileExplanationArray
objects that make it possible to compare explanations and visualise
them.
In case of a single Model
:
from edgaro.explain.explainer import Explainer
exp = Explainer(model)
exp.fit()
explanation = exp.transform()
explanation.plot(variable='Col1')
In case of a ModelArray
:
from edgaro.explain.explainer import Explainer
from edgaro.explain.explainer_array import ExplainerArray
exp = ExplainerArray(model_array)
exp.fit()
explanation = exp.transform()
explanation.plot(variables=['Col1', 'Col2'])
In order to calculate the distance between the curves:
from edgaro.explain.explainer import Explainer
from edgaro.explain.explainer_array import ExplainerArray
exp = ExplainerArray(model_array)
exp.fit()
explanation = exp.transform()
explanation[0].compare(explanation[1], variable='Col1')
To create benchmarking summary plots, use for example the code:
from edgaro.explain.explainer import Explainer
from edgaro.explain.explainer_array import ExplainerArray
exp = ExplainerArray(model_array)
exp.fit()
explanation = exp.transform()
explanation.plot_aggregate(['SMOTE ', 'RandomOversample ', 'RandomUndersample ')
The elements of the list in the function plot_aggregate() are regular expressions to match and group explanations matching the. In this case, the output will be a boxplot with three boxes - one per method.
Note: if the input data objects were train-test-split, the explanations are calculated on the test dataset.