Generating Models with PyXAI

The Learning module of PyXAI provides methods to:

create a Scikit-learn or a XGBoost ML classifier;
create a XGBoost or a LightGBM ML regressor;
carry out an experimental protocol using a train-test split technique (i.e. a cross-validation method);
get one or several ML models based on Decision Trees, Random Forests or Boosted Trees;
get, save and load some specific instances and models.

In this page, we detail the first thre points. For the last one, please see the Saving/Loading Models page.

Loading Data

The first step is to create a Learner object that contains all methods needed to generate models. To this aim, you can use one of these methods depending on the chosen library:

Learning.Scikitlearn(dataset)
Learning.Xgboost(dataset)
Learning.lighGBM(dataset)

Learning.Scikitlearn\|Learning.Xgboost\|Learning.lightGBM(dataset, learner_type=None):
Returns a `Learner` object that contains all the methods needed to generate models of a given type (classification or regression).
dataset `String` `pandas.DataFrame`: Either the file path of the dataset in CSV or EXCEL format or a `pandas.DataFrame` object representing the data.
learner_type `Learning.CLASSIFICATION` `Learning.REGRESSION`: The type of models that will be used for this dataset.

from pyxai import Learning
learner = Learning.Xgboost("../dataset/iris.csv", learner_type=Learning.CLASSIFICATION)

data:
     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width         Species
0             5.1          3.5           1.4          0.2     Iris-setosa
1             4.9          3.0           1.4          0.2     Iris-setosa
2             4.7          3.2           1.3          0.2     Iris-setosa
3             4.6          3.1           1.5          0.2     Iris-setosa
4             5.0          3.6           1.4          0.2     Iris-setosa
..            ...          ...           ...          ...             ...
145           6.7          3.0           5.2          2.3  Iris-virginica
146           6.3          2.5           5.0          1.9  Iris-virginica
147           6.5          3.0           5.2          2.0  Iris-virginica
148           6.2          3.4           5.4          2.3  Iris-virginica
149           5.9          3.0           5.1          1.8  Iris-virginica

[150 rows x 5 columns]
--------------   Information   ---------------
Dataset name: ../dataset/iris.csv
nFeatures (nAttributes, with the labels): 5
nInstances (nObservations): 150
nLabels: 3

You can launch your program in command line with the -dataset option to specify the dataset filename:
python3 example.py -dataset="../dataset/iris.csv"
To get the value of the -dataset option in your program, you need to import the Tools module:
from PyXAI import Learning, Tools
learner = Learning.Xgboost(Tools.Options.dataset)

The dataset must specify the labels in the first row and the classes/values in the last column. If this is not the case, you must modify your data using the pandas library and provide a pandas.DataFrame in the functions of the Learning module. In this example, we add the missing labels:
import pandas
data = pandas.read_csv("../dataset/iris.data", names=['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Class'])
learner = Learning.Xgboost(data)
You can also use the Preprocessor object of PyXAI that helps you to clean the dataset.

Evaluation

The Learner object allows to learn a classifier/regressor (with the evaluate method) in order to produce one or several models according to the cross-validation method and the ML model chosen.

<Learner Object>.evaluate(method, output, n_models=10, test_size=0.3, **learner_options):
Runs an experimental protocol using the train-test split technique based on cross-validation methods. It makes the train-test split according to the cross validation method using Scikit-learn then it executes the fit and predict operation of the chosen ML classifier (Scikit-learn, XGBoost, LightGBM) as many times as necessary.
method `Learning.HOLD_OUT` `Learning.K_FOLDS` `Learning.LEAVE_ONE_GROUP_OUT`: The cross-validation method.
output `Learning.DT` `Learning.RF` `Learning.BT`: The desired model. `Learning.DT` and `Learning.RF` are available with the Scikit-learn library for classification while `Learning.BT` is compatible with the XGBoost library (classification and regression) and LightGBM (regression).
n_models `Integer`: The number of models desired. This is equivalent to the number of parts of the cross-validator used for `Learning.K_FOLDS` and `Learning.LEAVE_ONE_GROUP_OUT`. Not used for method `Learning.HOLD_OUT` because it only returns one model. Default value is 10.
test_size `Float` (between 0 and 1): Used only for `Learning.HOLD_OUT` to set the percentage of the test set. Default value is 0.3.
learner_options `Dict`: possible options provided to the learner via kwargs arguments.

Information about cross-validators can be found in the Scikit-learn page.

In this example, we create 3 boosted trees (classifiers) thanks to the K-folds cross-validator.

models = learner.evaluate(method=Learning.K_FOLDS, output=Learning.BT, n_models=3)

---------------   Evaluation   ---------------
method: KFolds
output: BT
learner_type: Classification
learner_options: {'seed': 0, 'max_depth': None, 'eval_metric': 'mlogloss'}
---------   Evaluation Information   ---------
For the evaluation number 0:
metrics:
   micro_averaging_accuracy: 97.33333333333334
   micro_averaging_precision: 96.0
   micro_averaging_recall: 96.0
   macro_averaging_accuracy: 97.33333333333333
   macro_averaging_precision: 96.02339181286548
   macro_averaging_recall: 96.02339181286548
   true_positives: {'Iris-setosa': 16, 'Iris-versicolor': 18, 'Iris-virginica': 14}
   true_negatives: {'Iris-setosa': 34, 'Iris-versicolor': 30, 'Iris-virginica': 34}
   false_positives: {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 1}
   false_negatives: {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 1}
   accuracy: 96.0
   sklearn_confusion_matrix: [[16, 0, 0], [0, 18, 1], [0, 1, 14]]
nTraining instances: 100
nTest instances: 50

For the evaluation number 1:
metrics:
   micro_averaging_accuracy: 96.0
   micro_averaging_precision: 94.0
   micro_averaging_recall: 94.0
   macro_averaging_accuracy: 96.0
   macro_averaging_precision: 94.11764705882352
   macro_averaging_recall: 95.23809523809524
   true_positives: {'Iris-setosa': 15, 'Iris-versicolor': 14, 'Iris-virginica': 18}
   true_negatives: {'Iris-setosa': 35, 'Iris-versicolor': 33, 'Iris-virginica': 29}
   false_positives: {'Iris-setosa': 0, 'Iris-versicolor': 3, 'Iris-virginica': 0}
   false_negatives: {'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 3}
   accuracy: 94.0
   sklearn_confusion_matrix: [[15, 0, 0], [0, 14, 0], [0, 3, 18]]
nTraining instances: 100
nTest instances: 50

For the evaluation number 2:
metrics:
   micro_averaging_accuracy: 97.33333333333334
   micro_averaging_precision: 96.0
   micro_averaging_recall: 96.0
   macro_averaging_accuracy: 97.33333333333333
   macro_averaging_precision: 95.83333333333334
   macro_averaging_recall: 96.07843137254902
   true_positives: {'Iris-setosa': 19, 'Iris-versicolor': 15, 'Iris-virginica': 14}
   true_negatives: {'Iris-setosa': 31, 'Iris-versicolor': 33, 'Iris-virginica': 34}
   false_positives: {'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 2}
   false_negatives: {'Iris-setosa': 0, 'Iris-versicolor': 2, 'Iris-virginica': 0}
   accuracy: 96.0
   sklearn_confusion_matrix: [[19, 0, 0], [0, 15, 2], [0, 0, 14]]
nTraining instances: 100
nTest instances: 50

---------------   Explainer   ----------------
For the evaluation number 0:
**Boosted Tree model**
NClasses: 3
nTrees: 300
nVariables: 26

For the evaluation number 1:
**Boosted Tree model**
NClasses: 3
nTrees: 300
nVariables: 31

For the evaluation number 2:
**Boosted Tree model**
NClasses: 3
nTrees: 300
nVariables: 22

Beyond carrying out the experimental protocol, this method allows one to return the models in a dedicated format for the calculation of explanations.

However, this may not meet your requirements (you may need another ML classifier and/or another cross-validation method):

Other ML classifiers and cross-validation methods are under development (the objective is to offer all cross-validation methods of Scikit-learn);
You can code your own experimental protocol and then import your models (see the Importing Models page).

You can use learner.get_label_from_value(value) and learner.get_value_from_label(label) to get the right values comming from the encoding of labels. The python dictionary variable learner.dict_labels contains the encoding performed.

Selecting Instances

PyXAI can easily select specific instances thanks to the get_instances method.

<Learner Object>.get_instances(model=None, indexes=Indexes.All, *, dataset=None, n=None, correct=None, predictions=None, save_directory=None, instances_id=None, seed=0, details=False):
Return the instances in a `Tuple`. Each instance comes with the prediction of the model or alone depending on whether the model is given or not (model=None). An instance is represented by a `numpy.array` object. Note that when the number of instances requested is only 1 (n=1), the method just returns the instance and not a `Tuple` of instances. By default, the functions returns the same instances if you provide the same parameters. Set the parameter `randomize` to `True` for a random generation. Set the `details` parameter to `True` to get more details on the instances (predictions, labels and indexes in the dataset).
model `DecisionTree` `RandomForest` `BoostedTrees`: The model computed by the `evaluation` method.
indexes `Learning.TRAINING` `Learning.TEST` `Learning.MIXED` `Learning.ALL` `String`: Returns instances from the training instances (`Learning.TRAINING`) or the test instances (`Learning.TEST`) or from both by giving priority to training instances (`Learning.MIXED`). By default set to `Learning.ALL` that takes into account all instances. Finally, when the indexes parameter is a `String`, it represents a file containing indexes and the method loads the associated instances.
dataset `String` `pandas.DataFrame`: In some situations, this method needs the dataset (Optional).
n `Integer`: The wanted number of instances (None for all).
correct `None` `True` `False`: Only available if a model is given, selects by default all instances (`None`) or only correctly classified instances by the model (`True`) or only misclassified instances by the model (`False`)
predictions `None` `List of Integer`: Only available if a model is given. Select by default all instances (`None`) or a `List of Integer`representing the desired classes/labels of instances to select.
save_directory `None` `String`: Save the instance indexes in a file inside the directory given by this parameter.
instances_id `None` `Integer`: To add an identifier in the name of the saved file with the `save_directory` parameter or useful to load instances using the `indexes parameter`.
seed `None` `Integer`: Set to `None` to obtain fully random instances. Default value is `0`. Set the seed to an `Integer` to shuffle the result with this seed.
details `True` `False`: Set to `True` to obtain a List of instances in the form of Python dictionnaries with the keys: “instance”, “prediction”, “label” and “index”. Default value is `False`.

More details on the indexes, save_directory, and instances_id parameters are given on the Saving/Loading Models. Let us look now at some examples of use.

First we select only one instance (we take the first model among the three mdels computed). We directly get the instance and the prediction.

instance, prediction = learner.get_instances(models[0],n=1)
print(instance, prediction)

---------------   Instances   ----------------
number of instances selected: 1
----------------------------------------------
[5.1 3.5 1.4 0.2] 0

Now, we take 3 instances. We obtain a list of instances.

instances = learner.get_instances(models[0],n=3)
print(instances)

---------------   Instances   ----------------
number of instances selected: 3
----------------------------------------------
[(array([5.1, 3.5, 1.4, 0.2]), 0), (array([4.9, 3. , 1.4, 0.2]), 0), (array([4.7, 3.2, 1.3, 0.2]), 0)]

The same invocation but without the model as a parameter leads to a different output (without the prediction of the model is not provided):

instances = learner.get_instances(n=3)
print(instances)

---------------   Instances   ----------------
number of instances selected: 3
----------------------------------------------
[(array([5.1, 3.5, 1.4, 0.2]), None), (array([4.9, 3. , 1.4, 0.2]), None), (array([4.7, 3.2, 1.3, 0.2]), None)]

Now, consider 3 instances fow which the prediction given by the model is equal to 2.

instances = learner.get_instances(models[0], n=3, predictions=[2])
print(instances)

---------------   Instances   ----------------
number of instances selected: 3
----------------------------------------------
[(array([6. , 2.7, 5.1, 1.6]), 2), (array([6.3, 3.3, 6. , 2.5]), 2), (array([5.8, 2.7, 5.1, 1.9]), 2)]

Next, we focus on 3 instances that have a prediction given by the model equal to 2 and for which the prediction is wrong (i.e. the prediction returned by the model is not the same as the one contained in the dataset). Note that only one instance meets these criteria.

instances = learner.get_instances(models[0], n=3, predictions=[2], correct=False)
print(instances)

---------------   Instances   ----------------
number of instances selected: 1
----------------------------------------------
[(array([6. , 2.7, 5.1, 1.6]), 2)]

Now, we want to get random instances (2 different calls provide different instances).

instance, prediction = learner.get_instances(models[0], n=1, seed=None)
print(instance, prediction)
instance, prediction = learner.get_instances(models[0], n=1, seed=None)
print(instance, prediction)

---------------   Instances   ----------------
number of instances selected: 1
----------------------------------------------
[6.7 3.3 5.7 2.1] 2
---------------   Instances   ----------------
number of instances selected: 1
----------------------------------------------
[5.7 2.8 4.5 1.3] 1

Here we show how the details parameter works to obtain the predictions and labels:

instances = learner.get_instances(models[2], n=3, details=True)
print(instances)

---------------   Instances   ----------------
number of instances selected: 3
----------------------------------------------
[{'instance': array([5.1, 3.5, 1.4, 0.2]), 'prediction': 0, 'label': 0, 'index': 0}, {'instance': array([4.9, 3. , 1.4, 0.2]), 'prediction': 0, 'label': 0, 'index': 1}, {'instance': array([4.7, 3.2, 1.3, 0.2]), 'prediction': 0, 'label': 0, 'index': 2}]

Finally, we want to pick up 3 instances among the test instances:

instances = learner.get_instances(models[2], indexes=Learning.TEST, n=3)
print(instances)

---------------   Instances   ----------------
number of instances selected: 3
----------------------------------------------
[(array([5.1, 3.5, 1.4, 0.2]), 0), (array([5.4, 3.9, 1.7, 0.4]), 0), (array([4.9, 3.1, 1.5, 0.1]), 0)]

Saving or loading instances is presented in the Saving/Loading Models page.

A complete example

As you can see, carrying out an empirical protocol requires the execution of very few instructions. The Learning module allows us to easily obtain the models and instances that we want to explain.

from PyXAI import Learning

learner = Learning.Xgboost("../dataset/iris.csv")
models = learner.evaluate(method=Learning.K_FOLDS, output=Learning.BT)
for id_models, model in enumerate(models):
    instances_with_prediction = learner.get_instances(model, n=10, indexes=Learning.TEST)
    for instance, prediction in instances_with_prediction:
        print("instance:", instance)
        print("prediction", prediction)