edgaro.explain package

edgaro.explain.explainer module

class edgaro.explain.explainer.Explainer(model: ~edgaro.model.model.Model, N: int | None = None, explanation_type: ~typing.Literal['PDP', 'ALE', 'VI'] = 'PDP', verbose: bool = False, processes: int = 1, random_state: int | None = None, B: int | None = 10, performance_metric_name: str = 'balanced_accuracy', performance_metric: ~typing.Callable[[~pandas.core.series.Series, ~pandas.core.series.Series], float] = <function balanced_accuracy_score>)

Bases: object

The class defines Explainer for a Model object - it allows to calculate PDP [1] or ALE [2] curves.

Parameters:

model (Model) – A Model object to calculate explanations on.
N (int, optional, default=None) – Number of observations that will be sampled from the test Dataset before the calculation of profiles (PDP/ALE curves). None means all data.
explanation_type ({‘PDP’, ‘ALE’, ‘VI’}, default=’PDP’) – An explanation type to be calculated (PDP - Partial Dependence Profile, ALE - Accumulated Local Effects, VI - Variable Importance)
verbose (bool, default=False) – Print messages during calculations.
processes (int, default=1) – Number of processes for the calculation of explanations. If -1, it is replaced with the number of available CPU cores.
random_state (int, optional, default=None) – Random state seed.
B (int, optional, default=10) – Number of permutation rounds to perform on each variable - applicable only if explanation_type=’VI’.
performance_metric_name (str, default=’balanced_accuracy’) – Name of the performance metric.
performance_metric (callable, default=balanced_accuracy_score) – Name of the performance metric.

Variables:

model (Model) – A Model object to calculate explanations on.
name (str) – A name of the Explainer, by default it is a Model name.
N (int, optional, default=None) – Number of observations that will be sampled from the test Dataset before the calculation of profiles (PDP/ALE curves). None means all data.
explanation_type ({'PDP', 'ALE', 'VI'}) – An explanation type to be calculated.
verbose (bool) – Print messages during calculations.
explainer (dx.Explainer, optional) – An explainer object from dalex package.
processes (int) – Number of processes for the calculation of explanations. If -1, it is replaced with the number of available CPU cores.
random_state (int, optional) – Random state seed
B (int, optional) – Number of permutation rounds to perform on each variable - applicable only if explanation_type=’VI’.
performance_metric_name (str) – Name of the performance metric.
performance_metric (callable) – Name of the performance metric.

References

fit() → None: Fit the Explainer object and create an explainer attribute.

transform(variables: List[str] | None = None) → Explanation

Calculate the explanation.

Parameters:: variables (list[str], optional) – List of variables for which the explanation should be calculated.
Return type:: Explanation

edgaro.explain.explainer_array module

class edgaro.explain.explainer_array.ExplainerArray(models: ~edgaro.model.model.Model | ~edgaro.model.model_array.ModelArray, N: int | None = None, explanation_type: ~typing.Literal['PDP', 'ALE', 'VI'] = 'PDP', verbose: bool = False, processes: int = 1, random_state: int | None = None, B: int | None = 10, performance_metric_name: str = 'balanced_accuracy', performance_metric: ~typing.Callable[[~pandas.core.series.Series, ~pandas.core.series.Series], float] = <function balanced_accuracy_score>)

Bases: object

Create a class to calculate PDP [1], ALE [2] curves or Variable Importance for Model and ModelArray objects.

Parameters:

models (Model, ModelArray) – A Model/ModelArray object to calculate the curves on.
N (int, optional, default=None) – Number of observations that will be sampled from the test Dataset before the calculation of profiles (PDP/ALE curves). None means all data.
explanation_type ({‘PDP’, ‘ALE’, ‘VI’}, default=’PDP’) – An explanation type to be calculated (PDP - Partial Dependence Profile, ALE - Accumulated Local Effects, VI - Variable Importance)
verbose (bool, default=False) – Print messages during calculations.
processes (int, default=1) – Number of processes for the calculation of explanations. If -1, it is replaced with the number of available CPU cores.
random_state (int, optional, default=None) – Random state seed.
B (int, optional, default=10) – Number of permutation rounds to perform on each variable - applicable only if explanation_type=’VI’.
performance_metric_name (str) – Name of the performance metric.
performance_metric (callable) – Name of the performance metric.

Variables:

models (Model, ModelArray) – A Model/ModelArray object to calculate the curves for.
name (str) – A name of the ExplainerArray, by default it is a Model/ModelArray name.
sub_calculators (list[Explainer, ExplainerArray], optional) – A list of calculators for nested Datasets/DatasetArrays.
N (int, optional) – Number of observations that will be sampled from the test Dataset before the calculation of profiles (PDP/ALE curves). None means all data.
explanation_type ({'PDP', 'ALE', 'VI'}, default='PDP') – An explanation type to be calculated.
verbose (bool) – Print messages during calculations.
processes (int) – Number of processes for the calculation of explanations. If -1, it is replaced with the number of available CPU cores.
random_state (int, optional) – Random state seed
B (int, optional) – Number of permutation rounds to perform on each variable - applicable only if explanation_type=’VI’.
performance_metric_name (str) – Name of the performance metric.
performance_metric (callable) – Name of the performance metric.

fit() → None: Fit the ExplainerArray object and create an explainer attribute.

transform(variables=None) → ExplanationArray

Calculate the explanation.

Parameters:: variables (list[str], optional) – List of variables for which the explanation should be calculated.
Return type:: ExplanationArray

edgaro.explain.explainer_result module

class edgaro.explain.explainer_result.Curve(x: ndarray | pd.Series | None, y: ndarray | pd.Series | None)

Bases: object

The class which represents the PDP/ALE curve for one variable.

Parameters:

x (np.ndarray) – Points on 0X axis.
y (np.ndarray) – Points on 0Y axis.

Variables:

x (np.ndarray) – Points on 0X axis.
y (np.ndarray) – Points on 0Y axis.

class edgaro.explain.explainer_result.Explanation

Bases: ABC

abstract plot() → None

abstract compare(other: List[Explanation]) → List[float | List]

abstract compare_performance(other: List[Explanation], percent: bool = False) → List[float]

class edgaro.explain.explainer_result.ModelProfileExplanation(results: Dict[str, Curve], name: str, categorical_columns: List[str], performance_metric_value: float, performance_metric_name: str, explanation_type: Literal['PDP', 'ALE'] = 'PDP')

Bases: Explanation

The class which represent the PDP/ALE curves for all variables in one Model.

Parameters:

results (Dict[str, Curve]) – A dictionary of pairs (column name, Curve object), which represents curves for all variables in one Model.
name (str) – The name of ModelProfileExplanation. It is best if it is a Model name.
categorical_columns (list[str]) – List of categorical variables.
explanation_type ({‘PDP’, ‘ALE’}, default=’PDP’) – A curve type.
performance_metric_value (float) – Value of the performance metric.
performance_metric_name (str) – Name of the performance metric.

Variables:

results (Dict[str, Curve]) – A dictionary of pairs (column name, Curve object), which represents curves for all variables in one Model.
name (str) – The name of ModelProfileExplanation. It is best if it is a Model name.
categorical_columns (list[str]) – List of categorical variables.
explanation_type ({'PDP', 'ALE'}) – A curve type.
performance_metric_value (float, optional) – Value of the performance metric.
performance_metric_name (str, optional) – Name of the performance metric.

plot(variable: str | None = None, figsize: Tuple[int, int] | None = (8, 8), add_plot: List[ModelProfileExplanation] | None = None, ax: Axes | None = None, show_legend: bool = True, y_lim: Tuple[float, float] | None = None, metric_precision: int = 2, centered: bool = False) → None

The function plots the PDP/ALE curve.

Parameters:

variable (str, optional, default=None) – Variable for which the plot should be generated. If None, the first column is plotted.
figsize (tuple(int, int), optional, default=(8, 8)) – Size of a figure.
add_plot (list[ModelProfileExplanation], optional, default=None) – List of other ModelProfileExplanation objects that also contain the variable and should be plotted.
ax (matplotlib.axes.Axes, optional, default=None) – The parameter should be passed if the plot is to be created in a certain Axis. In that situation, figsize parameter is ignored.
show_legend (bool, default=True) – The parameter indicates whether the legend should be plotted.
y_lim (tuple(float, float), optional, default=None) – The limits of 0Y axis.
metric_precision (int, default=5) – Number of digits to round the value of metric value*10^5.
centered (bool, default = False) – If True, the plots will be centered to start at 0.

compare_performance(other: List[ModelProfileExplanation], percent: bool = False) → List[float]

The function returns the difference between performance metric values. This object’s value is subtracted from other.

Parameters:

other (list[ModelProfileExplanation]) – List of ModelProfileExplanation objects to compare against.
percent (bool, default=False) – If True, the percentage change will be returned instead of difference.

Return type:

list[float]

compare(other: List[ModelProfileExplanation], variable: str | List[str] | None = None, return_raw_per_variable: bool = False) → List[float | list]

The function calculates the metric to compare the curves for a given variable(s).

Currently, there is only one comparison metric called SDD (Standard Deviation of Distances). It is the variance of the distances between curve in this object and curves in other in intermediate points. If there is more than one other object and return_raw_per_variable=False, the mean variance is returned (ASDD - Averaged SDD).

Parameters:

other (list[ModelProfileExplanation]) – List of ModelProfileExplanation objects to compare the curve against.
variable (str, list[str], optional, default=None) – List of variable names to calculate the metric distances. If None, the metrics are calculated for all the columns in this object.
return_raw_per_variable (bool, default=False) – If True, raw values for each variable are returned.

Return type:

list[float, list]

class edgaro.explain.explainer_result.ModelPartsExplanation(results: Dict[str, float], name: str, performance_metric_value: float, performance_metric_name: str, explanation_type: Literal['VI'] = 'VI')

Bases: Explanation

The class represents the Variable Importance for all variables in one Model.

Parameters:

results (Dict[str, float]) – A dictionary of pairs (column name, value), which represents Variable Importance for all variables in one Model.
name (str) – The name of ModelProfileExplanation. It is best if it is a Model name.
explanation_type ({‘VI’}, default=’VI’) – An explanation type.
performance_metric_value (float) – Value of the performance metric.
performance_metric_name (str) – Name of the performance metric.

Variables:

results (Dict[str, float]) – A dictionary of pairs (column name, value), which represents Variable Importance for all variables in one Model.
name (str) – The name of ModelProfileExplanation. It is best if it is a Model name.
explanation_type ({'VI'}, default='VI') – An explanation type.
performance_metric_value (float) – Value of the performance metric.
performance_metric_name (str) – Name of the performance metric.

The function plots the Variable Importance profile.

Parameters:

variable (str, list[str], optional, default=None) – Variable for which the VI should be plotted. If None, the all columns is plotted.
figsize (tuple(int, int), optional, default=(8, 8)) – Size of a figure.
add_plot (list[ModelPartsExplanation], optional, default=None) – List of other ModelPartsExplanation objects that also contain the variable and should be plotted.
max_variables (int, optional, default=None) – Maximal number of variables from the current object to be taken into account.
ax (matplotlib.axes.Axes, optional, default=None) – The parameter should be passed if the plot is to be created in a certain Axis. In that situation, figsize parameter is ignored.
show_legend (bool, default=True) – The parameter indicates whether the legend should be plotted.
x_lim (tuple(float, float), optional, default=None) – The limits of 0X axis.
metric_precision (int, default=5) – Number of digits to round the value of the metric value.

compare_performance(other: List[ModelPartsExplanation], percent: bool = False) → List[float]

The function returns the difference between performance metric values. This object’s value is subtracted from other.

Parameters:

other (list[ModelPartsExplanation]) – List of ModelPartsExplanation objects to compare against.
percent (bool, default=False) – If True, the percentage change will be returned instead of difference.

Return type:

list[float]

compare(other: List[ModelPartsExplanation], variable: str | List[str] | None = None, max_variables: int | None = None, return_raw: bool = True) → List[float | list]

The function calculates the metric to compare model parts of two or more models.

Currently, there is only one metric based on the Wilcoxon statistical test [3]. The metric value is the p-value of this test, where the inputs are variable importance values. The idea is based on the approach presented in the article [4].

Parameters:

other (list[ModelPartsExplanation]) – List of ModelPartsExplanation objects to compare the curve against.
variable (str, list[str], optional, default=None) – List of variable names to calculate the metric distances. If None, the metrics are calculated for all the columns in this object.
max_variables (int, optional, default=None) – Maximal number of variables from the current object to be taken into account.
return_raw (bool, default=True) – If True, the p-values are returned for each model. Otherwise, the mean value is returned.

Return type:

list[float, list]

References

edgaro.explain.explainer_result_array module

class edgaro.explain.explainer_result_array.ExplanationArray

Bases: ABC

abstract plot() → None

abstract compare() → List[float | list]

abstract plot_summary() → None

abstract compare_performance() → List[float | List]

class edgaro.explain.explainer_result_array.ModelProfileExplanationArray(results: List[ModelProfileExplanation | ModelProfileExplanationArray], name: str, explanation_type: Literal['PDP', 'ALE'] = 'PDP')

Bases: ExplanationArray

The class which represent the PDP/ALE curves for all variables in Model/ModelArray object.

Parameters:

results (list[ModelProfileExplanation, ModelProfileExplanationArray]) – A list of ModelProfileExplanation/ModelProfileExplanationArray with results.
name (str) – The name of ModelProfileExplanationArray. It is best if it is a Model/ModelArray name.
explanation_type ({‘PDP’, ‘ALE’}, default=’PDP’) – A curve type.

Variables:

results (list[ModelProfileExplanation, ModelProfileExplanationArray]) – A list of ModelProfileExplanation/ModelProfileExplanationArray with results.
name (str) – The name of ModelProfileExplanationArray. It is best if it is a Model/ModelArray name.
explanation_type ({'PDP', 'ALE'}) – A curve type.

plot(variables: List[str] | None = None, n_col: int = 3, figsize: Tuple[int, int] | None = None, model_filter: str | None = None, index_base: str | int = -1, centered: bool = False)

The function plots the PDP/ALE curves for given variables using all available Curves in the object.

Parameters:

index_base (int, str, default=-1) – Index of a curve to be a base for comparisons.
variables (list[str], optional, default=None) – Variables for which the plot should be generated. If None, plots for all variables are generated if all the available ModelProfileExplanation objects have exactly the same set of column names.
n_col (int, default=3) – Number of columns in the final plot.
figsize (tuple(int, int), optional, default=None) – The size of a figure. If None, the figure size is calculates as (8 * n_col, 8 * n_rows).
model_filter (str, optional, default=None) – A regex expression to filter the names of the ModelProfileExplanation objects for comparing.
centered (bool, default = False) – If True, the plots will be centered to start at 0.

compare_performance(index_base: str | int = -1, model_filter: str | None = None, percent: bool = False) → List[float | List]

The function returns the difference between performance metric values. index_base-object’s value is subtracted from others.

Parameters:

index_base (int, str, default=-1) – Index of a curve to be a base for comparisons.
model_filter (str, optional, default=None) – A regex expression to filter the names of the ModelProfileExplanation objects for comparing.
percent (bool, default=False) – If True, the percentage change will we returned instead of difference.

Return type:

list[float, list]

The function compares the curves in the array.

Parameters:

variable (list[str], optional, default=None) – List of variable names to calculate the metric distances. If None, the metrics are calculated for all the columns in this object.
index_base (int, str, default=-1) – Index of a curve to be a base for comparisons.
return_raw (bool, default=True) – If True, the metrics for each of the model are returned. Otherwise, the mean of the values is returned.
return_raw_per_variable (bool, default=True) – If True, the metrics for each of the variables are returned. Otherwise, the mean of the values is returned.
model_filter (str, optional, default=None) – A regex expression to filter the names of the ModelProfileExplanation objects for comparing.

Return type:

list[float, list]

plot_summary(model_filters: List[str] | None = None, filter_labels: [List[str]] = None, variables: List[str] | None = None, figsize: Tuple[int, int] | None = None, index_base: str | int = -1, return_df: bool = False)

The function plots boxplots of comparison metrics of curves in the object.

Parameters:

variables (list[str], optional, default=None) – Variables for which the plot should be generated. If None, plots for all variables are generated if all the available ModelProfileExplanation objects have exactly the same set of column names.
figsize (tuple(int, int), optional, default=None) – The size of a figure.
model_filters (list[str], optional, default=None) – List of regex expressions to filter the names of the ModelProfileExplanation objects for comparing. Each element in the list creates a new boxplot. If None, one boxplot of all results is plotted.
filter_labels (list[str], optional, default=None) – Labels of model filters.
index_base (int, str, default=-1) – Index of a curve to be a base for comparisons.
return_df (bool, default=False) – If True, the method returns a dataframe on which a plot is created.

plot_performance_gain_analysis(model_filters: List[str] | None = None, filter_labels: [List[str]] = None, variables: List[str] | None = None, figsize: Tuple[int, int] | None = None, index_base: str | int = -1, return_df: bool = False, percent: bool = False, ax: matplotlib.axes.Axes | None = None)

The function plots performance gain analysis plot which compares ASDD values and difference in performance metric values.

Parameters:

variables (list[str], optional, default=None) – Variables for which the plot should be generated. If None, plots for all variables are generated if all the available ModelProfileExplanation objects have exactly the same set of column names.
figsize (tuple(int, int), optional, default=None) – The size of a figure.
model_filters (list[str], optional, default=None) – List of regex expressions to filter the names of the ModelProfileExplanation objects for comparing. Each element in the list creates a new boxplot. If None, one boxplot of all results is plotted.
filter_labels (list[str], optional, default=None) – Labels of model filters.
index_base (int, str, default=-1) – Index of a curve to be a base for comparisons.
return_df (bool, default=False) – If True, the method returns a dataframe on which a plot is created.
percent (bool, default=False) – If True, the percentage change will be plotted instead of difference.
ax (matplotlib.axes.Axes, optional) – ax to plot on

class edgaro.explain.explainer_result_array.ModelPartsExplanationArray(results: List[ModelPartsExplanation | ModelPartsExplanationArray], name: str, explanation_type: Literal['VI'] = 'VI')

Bases: ExplanationArray

The class which represent the Variable Importance for all variables in Model/ModelArray object.

Parameters:

results (list[ModelPartsExplanation, ModelPartsExplanationArray]) – A list of ModelPartsExplanation/ModelPartsExplanationArray with results.
name (str) – The name of ModelPartsExplanationArray. It is best if it is a Model/ModelArray name.
explanation_type ({‘VI’}, default=’VI’) – An explanation type.

Variables:

results (list[ModelPartsExplanation, ModelPartsExplanationArray]) – A list of ModelPartsExplanation/ModelPartsExplanationArray with results.
name (str) – The name of ModelPartsExplanationArray. It is best if it is a Model/ModelArray name.
explanation_type ({'VI'}, default='VI') – An explanation type.

The function plots the Variable Importance profile using all ModelPartsExplanation objects.

Parameters:

variable (str, list[str], optional, default=None) – Variable for which the VI should be plotted. If None, the all columns is plotted.
figsize (tuple(int, int), optional, default=(8, 8)) – Size of a figure.
max_variables (int, optional, default=None) – Maximal number of variables from the current object to be taken into account.
ax (matplotlib.axes.Axes, optional, default=None) – The parameter should be passed if the plot is to be created in a certain Axis. In that situation, figsize parameter is ignored.
show_legend (bool, default=True) – The parameter indicates whether the legend should be plotted.
x_lim (tuple(float, float), optional, default=None) – The limits of 0X axis.
metric_precision (int, default=5) – Number of digits to round the value of the metric value.
index_base (int, str, default=-1) – Index of an explanation to be a base for comparisons.

compare_performance(index_base: str | int = -1, model_filter: str | None = None, percent: bool = False) → List[float | List]

The function returns the difference between performance metric values. index_base-object’s value is subtracted from others.

Parameters:

index_base (int, str, default=-1) – Index of a curve to be a base for comparisons.
model_filter (str, optional, default=None) – A regex expression to filter the names of the ModelProfileExplanation objects for comparing.
percent (bool, default=False) – If True, the percentage change will we returned instead of difference.

Return type:

list[float, list]

The function compares variable importance in the array.

Parameters:

variable (str, list[str], optional, default=None) – List of variable names to calculate the metric distances. If None, the metrics are calculated for all the columns in this object.
max_variables (int, optional, default=None) – Maximal number of variables from the current object to be taken into account.
return_raw (bool, default=True) – If True, the p-values are returned for each model. Otherwise, the mean value is returned.
index_base (int, str, default=-1) – Index of an explanation to be a base for comparisons.
model_filter (str, optional, default=None) – A regex expression to filter the names of the ModelPartsExplanation objects for comparing.

Return type:

list[float, list]

plot_summary(model_filters: List[str] | None = None, filter_labels: [List[str]] = None, variables: List[str] | None = None, max_variables: int | None = None, figsize: Tuple[int, int] | None = None, index_base: str | int = -1, significance_level: float | None = None, fdr_correction: bool = True, return_df: bool = False) → None

The function plots boxplots of comparison metrics of VI in the object if significance_level is provided. Otherwise, the results of the statistical test are plotted as barplots according to the significance_level.

Parameters:

variables (str, list[str], optional, default=None) – Variable for which the VI should be plotted. If None, the all columns is plotted.
figsize (tuple(int, int), optional, default=(8, 8)) – Size of a figure.
model_filters (list[str], optional, default=None) – List of regex expressions to filter the names of the ModelPartsExplanation objects for comparing. Each element in the list creates a new boxplot. If None, one boxplot / barplot of all results is plotted.
filter_labels (list[str], optional, default=None) – Labels of model filters.
index_base (int, str, default=-1) – Index of an explanation to be a base for comparisons.
max_variables (int, optional, default=None) – Maximal number of variables from the current object to be taken into account.
significance_level (float, optional, default=None) – A significance level of the statistical test (metric).
fdr_correction (bool, default=True) – Add p-value correction for false discovery rate. Note that it is used only if significance_level is not None.
return_df (bool, default=False) – If True, the method returns a dataframe on which a plot is created.