============= User manual ============= This manual covers the most important use cases of the package. For details and additional input parameters, see the full documentation. Defining a dataset ------------------ To start working with the package, you can use an example dataset. You can load it using the code below. :: from edgaro.data.dataset import load_mammography df = load_mammography() It is also possible to create a ``Dataset`` object from a ``pandas.DataFrame`` object in your Python session: :: # X - your dataframe with features, pd.DataFrame # y - your target variable, pd.Series from edgaro.data.dataset import Dataset df = Dataset(X, y) Apart from that, you can create a ``Dataset`` object from a *\*.csv* file or from a dataset defined in OpenML: :: from edgaro.data.dataset import DatasetFromCSV, DatasetFromOpenML df = DatasetFromCSV('your_path') df = DatasetFromOpenML(task_id=1) The ``Dataset`` object offers many handy functionalities, for example splitting into train and test datasets (inside the object), removing ``None`` values and calculating the Imbalance Ratio. :: df.remove_nans() df.train_test_split(test_size=0.2) df.imbalance_ratio To represent an array of the ``Dataset`` objects, ``DatasetArray`` is used. It is composed of a list of ``Dataset``\ s. The object offers vectorized functions described above, like splitting into train and test datasets (inside the object) and removing ``None`` values. :: from edgaro.data.dataset_array import DatasetArray df_array = DatasetArray([df1, df2, df3]) df_array.remove_nans() df_array.train_test_split(test_size=0.2) It is also possible to load an example ``DatasetArray``, which can be used for benchmarking. It contains selected datasets for benchmarking purposes from three main sources (*OpenML-100*, *OpenML-CC18* and *imbalanced-learn* library). The details concerning the benchmarking set are available https://github.com/adrianstando/imbalanced-benchmarking-set. :: from edgaro.data.dataset_array import load_benchmarking_set df_array = load_benchmarking_set() It is also possible to load a benchmark suite from OpenML to ``DatasetArray`` object. :: from edgaro.data.dataset_array import DatasetArrayFromOpenMLSuite df_array = DatasetArrayFromOpenMLSuite(suite_name='OpenML100') Balancing datasets ------------------ To balance a dataset, a ``Transformer`` abstract class is defined. It is possible to create custom balancing methods (by extending this class) or you can use methods implemented in ``imblearn`` library. Firstly, a transformer has to be fitted with a dataset, and then by calling *transform* method it can be balanced. :: from edgaro.balancing.transformer import TransformerFromIMBLEARN from imblearn.under_sampling import RandomUnderSampler transformer = TransformerFromIMBLEARN( RandomUnderSampler(sampling_strategy=1, random_state=42) ) transformer.fit(dataset) dataset_transformed = transformer.transform(dataset) You can also define a custom suffix that will be added to balanced ``Dataset``\ ’s name. :: from edgaro.balancing.transformer import TransformerFromIMBLEARN from imblearn.under_sampling import RandomUnderSampler transformer = TransformerFromIMBLEARN( RandomUnderSampler(sampling_strategy=1, random_state=42), name_sufix='_new_sufix123' ) There is one extension worth mentioning - it is ``NestedAutomaticTransformer``. It behaves like a ``Transformer`` object, but it wraps inside a few balancing methods. Moreover, that object, based on ``n_per_method`` argument, automatically set an intermediate Imbalance Ratio (to investigate how the future models change with changing IR). For example, let an original dataset have :math:`IR=10` and :math:`n\_per\_method=3`. Then, the IR will be set to :math:`[7, 4, 1]`. If you set *keep_original_dataset* to ``True``, the original will be preserved in the result ``DatasetArray`` object. The ``BasicAutomaticTransformer`` class contains three most popular balancing techniques (Random UnderSampling, Random OverSampling, SMOTE). :: from edgaro.balancing.nested_transformer import BasicAutomaticTransformer transformer = BasicAutomaticTransformer() transformer.fit(dataset) dataset_transformed = transformer.transform(dataset) The ``balancing`` submodule also offers the interface of an array. If your input is a ``DatasetArray`` object or you want to balance input with different parameters (in that case both ``DatasetArray`` and ``Dataset`` is correct) you should use this class. To apply the same balancing technique to a ``DatasetArray`` you have to use ``TransformerArray`` class: :: from edgaro.balancing.transformer_array import TransformerArray from edgaro.balancing.transformer import TransformerFromIMBLEARN from imblearn.under_sampling import RandomUnderSampler transformer = TransformerArray(TransformerFromIMBLEARN( RandomUnderSampler(sampling_strategy=1, random_state=42) )) transformer.fit(dataset_array) dataset_array_transformed = transformer.transform(dataset_array) You can set, as in the ``Transformer`` class, the suffixes: :: from edgaro.balancing.transformer_array import TransformerArray from edgaro.balancing.transformer import TransformerFromIMBLEARN from imblearn.under_sampling import RandomUnderSampler transformer = TransformerArray(TransformerFromIMBLEARN( RandomUnderSampler(sampling_strategy=1, random_state=42), dataset_suffixes=['_suffix1', '_suffix2'] )) You can also set parameters - their nested structure should match the ``DatasetArray`` structure. :: from edgaro.balancing.transformer_array import TransformerArray from edgaro.balancing.transformer import TransformerFromIMBLEARN from imblearn.under_sampling import RandomUnderSampler from edgaro.data.dataset_array import DatasetArray dataset_array = DatasetArray([dataset1, dataset2]) transformer = TransformerArray(TransformerFromIMBLEARN( RandomUnderSampler(), parameters=[ [ {'sampling_strategy': 0.98}, {'sampling_strategy': 1}, {'sampling_strategy': 0.9, 'random_state': 42} ] for _ in range(2) ] )) transformer.fit(dataset_array) dataset_array_transformed = transformer.transform(dataset_array) Note: if a ``Dataset`` object was train-test-split, the balancing methods are applied only on the training datasets and the test datasets remain untouched. Training a model ---------------- The classes in ``model`` module have a similar interface to those in ``balancing`` module. There is a *Model* class which is an abstract class and can be extended with any ML model implementation. One possible solution is to use *scikit-learn* models, which can be used by using ``ModelFromSKLEARN`` class. In this module, there is also a ``ModelArray`` class, which behaves very similarly to the ``TransformerArray`` class. However, instead of transforming a ``Dataset``, the class predictions are made or probabilities are returned. The returned objects are also ``Dataset`` objects. :: from edgaro.model.model import ModelFromSKLEARN from sklearn.ensemble import RandomForestClassifier model = ModelFromSKLEARN(RandomForestClassifier()) model.fit(dataset) predictions = model.predict(dataset) predictions_probability = model.predict_proba(dataset) An example of using ``ModelArray`` - that means the situation when the input is ``DatasetArray`` object. :: from edgaro.model.model import ModelFromSKLEARN from edgaro.model.model_array import ModelArray from sklearn.ensemble import RandomForestClassifier model = ModelArray(ModelFromSKLEARN(RandomForestClassifier())) model.fit(dataset_array) predictions = model.predict(dataset_array) predictions_probability = model.predict_proba(dataset_array) There is also a function to evaluate the model. If the input parameter is not provided and the object was train-test-split, the evaluation is made on the test dataset. :: from edgaro.model.model import ModelFromSKLEARN from edgaro.model.model_array import ModelArray from sklearn.ensemble import RandomForestClassifier model = ModelArray(ModelFromSKLEARN(RandomForestClassifier())) model.fit(dataset_array) model.evaluate() model = ModelFromSKLEARN(RandomForestClassifier()) model.fit(dataset) model.evaluate() Explaining and comparing explanations ------------------------------------- To create explanations (PDP / ALE curves), the ``Explainer`` and ``ExplainerArray`` classes are provided. The first one should be used when you only have one ``Model`` object and the latter when you have ``ModelArray``. The interface in ``explain`` module is similar to that in ``model``: :: from edgaro.explain.explainer import Explainer exp = Explainer(model) exp.fit() explanation = exp.transform() In case of input ``ModelArray``: :: from edgaro.explain.explainer import Explainer from edgaro.explain.explainer_array import ExplainerArray exp = ExplainerArray(model_array) exp.fit() explanation = exp.transform() These functions return ``ModelProfileExplanation`` and ``ModelProfileExplanationArray`` objects that make it possible to compare explanations and visualise them. In case of a single ``Model``: :: from edgaro.explain.explainer import Explainer exp = Explainer(model) exp.fit() explanation = exp.transform() explanation.plot(variable='Col1') In case of a ``ModelArray``: :: from edgaro.explain.explainer import Explainer from edgaro.explain.explainer_array import ExplainerArray exp = ExplainerArray(model_array) exp.fit() explanation = exp.transform() explanation.plot(variables=['Col1', 'Col2']) In order to calculate the distance between the curves: :: from edgaro.explain.explainer import Explainer from edgaro.explain.explainer_array import ExplainerArray exp = ExplainerArray(model_array) exp.fit() explanation = exp.transform() explanation[0].compare(explanation[1], variable='Col1') To create benchmarking summary plots, use for example the code: :: from edgaro.explain.explainer import Explainer from edgaro.explain.explainer_array import ExplainerArray exp = ExplainerArray(model_array) exp.fit() explanation = exp.transform() explanation.plot_aggregate(['SMOTE ', 'RandomOversample ', 'RandomUndersample ') The elements of the list in the function *plot_aggregate()* are regular expressions to match and group explanations matching the. In this case, the output will be a boxplot with three boxes - one per method. Note: if the input data objects were train-test-split, the explanations are calculated on the test dataset.