Learner

This class is a wrapper around the different libraries used to create the models (sklearn, xgboost, lightgbm, …).

It stores a dataset, evaluates it using different models and provides some tools to get instances, predictions and metrics.

    def __init__(self, data: str|pandas.DataFrame|NoneData = NoneData, *,
                    problem_type: str|ProblemType,
                    model_type: str|ModelType|None = None,
                    instances_type: str|InstancesType|None = None,
                    labels_type: str|LabelsType|None = None,
                    get_item_function: Callable|None = None,
                    instances_directory: str|None = None,
                    labels_directory: str|None = None): Highlight

In the case of a csv or excel file, the last column is considered as the label column and all other columns are considered as features.

In the case of a json file, the dataset should be in a specific format (see documentation).

Parameters

data : str | pandas.DataFrame | NoneData

The dataset to use, either as a path to a csv, json or excel file or as a pandas DataFrame.

problem_type : str | ProblemType

The type of problem (classification, regression, …)
Possible values are defined in the ProblemType enum.

model_type : str | ModelType (optional, default=None)

The type of model (linear, tree-based, neural network, …)
Can be None and put in the evaluation method.
Possible values are defined in the ModelType enum.

instances_type : str | InstancesType (optional, default=None)

The type of instances (image, tabular, text, temporal, …)
Possible values are defined in the InstancesType enum.
When set to None, if data is a csv and get_item_function is None then it is set to “tabular”

labels_type : str | LabelsType (optional, default=None)

The type of labels (class, text, mask, contours, …)
Possible values are defined in the LabelsType enum.
When set to None, if problem_type is “classification” then it is set to “classes”, if problem_type is “regression” then it is set to “continuous-values”

get_item_function : Callable (optional, default=None)

A function to get an instance from the dataset. This function is used to get an instance in the right format for the model and the explainer.
If the dataset is a pandas DataFrame and the instances are tabular, this function is not necessary and can be set to None.
In other cases, this function should be defined by the user. It should take as input a row of the dataframe and return the corresponding instance in the right format for the model and the explainer.

instances_directory : str (optional, default=None)

The directory where the instances are stored (only for a JSON dataset).
This parameter is used to extend the path of instances in the dataframe when the instances are of type image and the dataset is given as a json file.

labels_directory : str (optional, default=None)

The directory where the labels are stored (only for a JSON dataset).
This parameter is used to extend the paths of labels in the dataframe when the labels are of type masks and the dataset is given as a json file.
Warning: NOT YET IMPLEMENTED

Examples

Example 1

from pyxai import Learning, Tools
learner = Learning.Scikitlearn(Tools.Options.dataset, problem_type=Learning.CLASSIFICATION)
model = learner.evaluate(splitting_method=Learning.HOLD_OUT, model_type=Learning.RF)
instance, prediction = learner.get_instances(n=1)

Example 2

from pyxai import Learning, Tools
learner = Learning.Xgboost(Tools.Options.dataset, problem_type=Learning.CLASSIFICATION)
model = learner.evaluate(splitting_method=Learning.HOLD_OUT, model_type=Learning.BT, splitting_parameters={'test_size':0.2}, model_parameters={'max_depth':6, 'base_score':0.5})

Parameters

splitting_method : str | SplittingMethod

The splitting method used for the evalution (hold-out, k-folds, …)
Possible values are defined in the SplittingMethod enum.

model_type : str | ModelType

The type of model (linear, tree-based, neural network, …)
Possible values are defined in the ModelType enum.

model_parameters : dict (optional, default={})

Parameters to pass to the learner according to the used librairy (sklearn, xgboost, …)
For example, for a RandomForestClassifier, we can set model_parameters to {"n_estimators":50, "max_depth":4}.

splitting_parameters : dict (optional, default={})

Parameters to pass to the splitting method (hold-out, k-folds, …)
For example, for the Learning.LEAVE_ONE_GROUP_OUT method, we can set splitting_parameters to {'n_models':2, 'random_state':0}.

Returns

DecisionTree | RandomForest | BoostedTrees | BoostedTreesRegression :

The PyXai model created.

list of (DecisionTree | RandomForest | BoostedTrees | BoostedTreesRegression) :

A list of PyXai models when several models were created with the splitting method.

Examples

from pyxai import Learning
learner = Learning.LightGBM("tests/datasets/dermatology.csv", problem_type=Learning.REGRESSION)
models = learner.evaluate(splitting_method=Learning.K_FOLDS, model_type=Learning.BT, splitting_parameters={'random_state':0}, model_parameters={'learning_rate':0.3, 'n_estimators':5, 'random_state':0})

The default values of model_parameters and splitting_parameters are the empty Python dictionnaries.
As a result, the random seeds are set to None by default for the model and the splitting method.
This makes the evaluation non-deterministic by default.
To make it deterministic, you must add to model_parameters or/and splitting_parameters the key/value 'random_state':0.

    def get_instances(self,
                      model=None,
                      indexes=InstancesSet.All,
                      *,
                      dataset=None,
                      n=None,
                      is_correct=None,
                      subset_predicted_classes=None,
                      subset_true_classes=None,
                      save_directory=None,
                      instances_id=None,
                      seed=0,
                      train_indexes=None,
                      test_indexes=None,
                      details=False) -> list[tuple[pandas.Series, int|float|str]] | list[dict] | tuple[None, None] | tuple[pandas.Series, int|float|str] | dict : Highlight

Return couples (instance, prediction) from the dataset with the given prediction done by the model.

Parameters

model : DecisionTree | RandomForest | BoostedTrees | BoostedTreesRegression (optional, default=None):

A PyXAI model.

indexes : str | InstancesSet (optional, default=InstancesSet.All)

The type of selected instances (all, from the training instances, from the testing instances, …)
Possible values are defined in the InstancesSet enum.
Can also be a str representing a file contening specific indexes.

dataset : str | pandas.DataFrame | NoneData (optional, default=None)

The dataset to use, either as a path to a csv, json or excel file or as a pandas DataFrame.
Can be None if the dataset is already loaded. This parameter is useful only if the dataset is not already loaded.

n : int (optional, default=None)

The desired number of instances (None for all).

is_correct : True | False (optional, default=None)

Only available if a model is given
    - None: All instances (no filter).
    - True: Only correctly classified instances by the model.
    - False: Only misclassified instances by the model.

subset_predicted_classes : list[int] (optional, default=None)

- None: All instances (no filter).
- list[int]: List of classes for the desired instances considering the model prediction.

subset_true_classes : list[int] (optional, default=None)

- None: All instances (no filter).
- list[int]: List of classes for the desired instances, taking into account the true labels.

save_directory : str (optional, default=None)

Save the instance indexes in a file in the directory given by this parameter

instances_id : str (optional, default=None)

To add an identifier in the name of the saved file with the save_directory parameter or useful to load instances using the indexes parameter.

seed : int | None (optional, default=0) (default: value)

Set to None to obtain fully random instances. Default value is 0. Set the seed to an Integer to shuffle the result with this seed.

train_indexes : list[int] (optional, default=None)

List of training indexes to select the instances from subset of indexes.

test_indexes : list[int] (optional, default=None)

List of testing indexes to select the instances from subset of indexes.

details : True | False (optional, default=False)

Set to True to obtain a python list of instances where each instance is in the form of Python dictionnaries with the keys: “instance”, “prediction”, “label” and “index”.

Returns

list[tuple(pandas.Series, int | float | str)] | list[dict] :

Return couples (instance, prediction) the selected instances with the given prediction done by the model.

tuple(pandas.Series, int | float | str) | dict :

Note that when the number of instances requested is only 1 (n=1), the method returns ony one tuple.
Python dictionnaries are returned when the parameter details is set to True

Examples

Example 1

from pyxai import Learning
learner = Learning.Scikitlearn(dataset, problem_type='regression')
model = learner.evaluate(splitting_method='hold-out', model_type='linear-ridge', splitting_parameters={'random_state':0}, model_parameters={})
instances = learner.get_instances(model, indexes='train-in-priority', n=5, seed=72)

Example 2

from pyxai import Learning
learner = Learning.Scikitlearn(dataset, problem_type='classification')
model = learner.evaluate(splitting_method='hold-out', model_type='linear-ridge', splitting_parameters={'random_state':0}, model_parameters={})
instances = learner.get_instances(model, indexes='test', n=5, subset_predicted_classes=['Iris-versicolor', 'Iris-virginica'])

Parameters

labels : int | float

The real values (labels) in the dataset.

predictions : int | float

The prediction of a model.

Returns

dict :

A Python dictionary where the keys depend on the ProblemType.
See the page of the Metrics class for more information.

Examples

from pyxai import Learning
learner = Learning.Scikitlearn("tests/datasets/dermatology.csv", problem_type=Learning.CLASSIFICATION)
labels = [1,1,1,1,1,0,0,0,0,0]
predictions = [1,1,1,1,1,0,0,0,0,0]
metrics = learner.compute_metrics(labels, predictions)

def get_details(self): Highlight

Get some details about the learner and the models. This information is available in the LearnerInformation class.

Returns

list of dict :

A list with a python Dictionnary for each model.

Examples

from pyxai import Learning
learner = Learning.Scikitlearn("tests/datasets/dermatology.csv", problem_type=Learning.CLASSIFICATION)
models = learner.evaluate(splitting_method=Learning.K_FOLDS, model_type=Learning.DT)
for id, models in enumerate(models):
    metrics = learner.get_details()[id]["metrics"]

def get_raw_models(self): Highlight

Get the raw models from the librairy used to create the models (sklearn, xgboost, lightgbm, …)

Returns

list of (DecisionTreeClassifier | RandomForestClassifier | XGBClassifier | XGBRegressor | LGBMRegressor) :

A list of raw models.

Examples

from pyxai import Learning
learner = Learning.Scikitlearn("tests/datasets/dermatology.csv", problem_type=Learning.CLASSIFICATION)
models = learner.evaluate(splitting_method=Learning.K_FOLDS, model_type=Learning.DT)
for id, models in enumerate(models):
    raw_model = learner.get_raw_models()[id]

Class Learner

Parameters

Examples

See also

Main Methods

Parameters

Returns

Examples

Parameters

Returns

Examples

See also

Auxiliary Methods

Parameters

Returns

Examples

Returns

Examples

Returns

Examples

Symbols