Link Search Menu Expand Document
PyXAI
Papers Video GitHub In-the-Loop EXPEKCTATION Release Notes About
download notebook

Generating Models with PyXAI

The Learning module of PyXAI provides methods to:

  • create a Scikit-learn or a XGBoost ML classifier;
  • create a XGBoost or a LightGBM ML regressor;
  • carry out an experimental protocol using a train-test split technique (i.e. a cross-validation method);
  • get one or several ML models based on Decision Trees, Random Forests or Boosted Trees;
  • get, save and load specific instances and models.

In this page, we detail the first three points. For the last one, please see the Saving/Loading Models page.

Loading Data

The first step is to create a Learner object that contains all methods needed to generate models. To this aim, you can use one of these methods depending on the chosen library:

  • Learning.Scikitlearn
  • Learning.Xgboost
  • Learning.LightGBM
from pyxai import Learning
learner = Learning.Xgboost("../dataset/iris.csv", problem_type=Learning.CLASSIFICATION)
--------------   Information   ---------------
Problem type: classification
Instances type: tabular
Labels type: classes

Dataset path: ../dataset/iris.csv
nFeatures (nAttributes, with the labels): 4
nInstances (nObservations): 150
nLabels: 3

You can launch your program in command line with the -dataset option to specify the dataset filename:

python3 example.py -dataset="../dataset/iris.csv"

To get the value of the -dataset option in your program, you need to import the Tools module:

from pyxai import Learning, Tools
learner = Learning.Xgboost(Tools.Options.dataset)

The dataset must specify the labels in the first row and the classes/values in the last column. If this is not the case, you must modify your data using the pandas library and provide a pandas.DataFrame in the functions of the Learning module. In this example, we add the missing labels:

import pandas
data = pandas.read_csv("../dataset/iris.data", names=['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Class'])
learner = Learning.Xgboost(data)

You can also use the Preprocessor object of PyXAI that helps you to clean the dataset.

Evaluation

The Learner object allows to learn a classifier/regressor (with the evaluate method) in order to produce one or several models according to the cross-validation method and the ML model chosen.

Information about cross-validators can be found in the Scikit-learn page.

In this example, we create 3 boosted trees (classifiers) thanks to the K-folds cross-validator.

models = learner.evaluate(splitting_method=Learning.K_FOLDS, model_type=Learning.BT,splitting_parameters={'n_models':3,'random_state':0})
---------------   Model creation, fitting and evaluation  ---------------
Splitting method: k-folds
Problem type: classification
Models type: boosted-tree
model_parameters: {}
---------   Evaluation Information   ---------
For the evaluation number 0:
Metrics:
   micro_averaging_accuracy: 97.33333333333334
   micro_averaging_precision: 96.0
   micro_averaging_recall: 96.0
   macro_averaging_accuracy: 97.33333333333333
   macro_averaging_precision: 96.02339181286548
   macro_averaging_recall: 96.02339181286548
   true_positives: {'Iris-setosa': 16, 'Iris-versicolor': 18, 'Iris-virginica': 14}
   true_negatives: {'Iris-setosa': 34, 'Iris-versicolor': 30, 'Iris-virginica': 34}
   false_positives: {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 1}
   false_negatives: {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 1}
   accuracy: 96.0
   sklearn_confusion_matrix: [[16, 0, 0], [0, 18, 1], [0, 1, 14]]
Number of Training instances: 100
Number of Testing instances: 50

For the evaluation number 1:
Metrics:
   micro_averaging_accuracy: 94.66666666666667
   micro_averaging_precision: 92.0
   micro_averaging_recall: 92.0
   macro_averaging_accuracy: 94.66666666666667
   macro_averaging_precision: 91.66666666666666
   macro_averaging_recall: 92.85714285714285
   true_positives: {'Iris-setosa': 15, 'Iris-versicolor': 13, 'Iris-virginica': 18}
   true_negatives: {'Iris-setosa': 34, 'Iris-versicolor': 33, 'Iris-virginica': 29}
   false_positives: {'Iris-setosa': 1, 'Iris-versicolor': 3, 'Iris-virginica': 0}
   false_negatives: {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 3}
   accuracy: 92.0
   sklearn_confusion_matrix: [[15, 0, 0], [1, 13, 0], [0, 3, 18]]
Number of Training instances: 100
Number of Testing instances: 50

For the evaluation number 2:
Metrics:
   micro_averaging_accuracy: 97.33333333333334
   micro_averaging_precision: 96.0
   micro_averaging_recall: 96.0
   macro_averaging_accuracy: 97.33333333333333
   macro_averaging_precision: 95.83333333333334
   macro_averaging_recall: 96.07843137254902
   true_positives: {'Iris-setosa': 19, 'Iris-versicolor': 15, 'Iris-virginica': 14}
   true_negatives: {'Iris-setosa': 31, 'Iris-versicolor': 33, 'Iris-virginica': 34}
   false_positives: {'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 2}
   false_negatives: {'Iris-setosa': 0, 'Iris-versicolor': 2, 'Iris-virginica': 0}
   accuracy: 96.0
   sklearn_confusion_matrix: [[19, 0, 0], [0, 15, 2], [0, 0, 14]]
Number of Training instances: 100
Number of Testing instances: 50

---------------   Explainer   ----------------


For the split number 0:
**Boosted Tree model**
NClasses: 3
nTrees: 300
nVariables: 23

For the split number 1:
**Boosted Tree model**
NClasses: 3
nTrees: 300
nVariables: 25

For the split number 2:
**Boosted Tree model**
NClasses: 3
nTrees: 300
nVariables: 20

Beyond carrying out the experimental protocol, this method allows one to return the models in a dedicated format for the calculation of explanations.

However, this may not meet your requirements (you may need another ML classifier and/or another cross-validation method):

  • Other ML classifiers and cross-validation methods are under development (the objective is to offer all cross-validation methods of Scikit-learn);
  • You can code your own experimental protocol and then import your models (see the Importing Models page).

Selecting Instances

PyXAI can easily select specific instances thanks to the get_instances method.

More details on the indexes, save_directory, and instances_id parameters are given on the Saving/Loading Models. Let us look now at some examples of use.

First we select only one instance (we take the first model among the three models computed). We directly get the instance and the prediction.

instance, prediction = learner.get_instances(models[0],n=1)
print(instance, prediction)
---------------   Instances   ----------------
Number of instances selected: 1
----------------------------------------------
Sepal.Length    5.1
Sepal.Width     3.5
Petal.Length    1.4
Petal.Width     0.2
Name: 0, dtype: float64 Iris-setosa

Now, we take 3 instances. We obtain a list of instances.

instances = learner.get_instances(models[0],n=3)
for instance in instances:
    print(instance)
---------------   Instances   ----------------
Number of instances selected: 3
----------------------------------------------
(Sepal.Length    5.1
Sepal.Width     3.5
Petal.Length    1.4
Petal.Width     0.2
Name: 0, dtype: float64, 'Iris-setosa')
(Sepal.Length    4.9
Sepal.Width     3.0
Petal.Length    1.4
Petal.Width     0.2
Name: 1, dtype: float64, 'Iris-setosa')
(Sepal.Length    4.7
Sepal.Width     3.2
Petal.Length    1.3
Petal.Width     0.2
Name: 2, dtype: float64, 'Iris-setosa')

The same invocation but without the model as a parameter leads to a different output: the prediction is not provided.

instances = learner.get_instances(n=3)
print(instances)
---------------   Instances   ----------------
Number of instances selected: 3
----------------------------------------------
[(Sepal.Length    5.1
Sepal.Width     3.5
Petal.Length    1.4
Petal.Width     0.2
Name: 0, dtype: float64, None), (Sepal.Length    4.9
Sepal.Width     3.0
Petal.Length    1.4
Petal.Width     0.2
Name: 1, dtype: float64, None), (Sepal.Length    4.7
Sepal.Width     3.2
Petal.Length    1.3
Petal.Width     0.2
Name: 2, dtype: float64, None)]

Now, consider 3 instances for which the prediction given by the model is equal to Iris-setosa.

instances = learner.get_instances(models[0], n=3, subset_predicted_classes=["Iris-setosa"])
print(instances)
---------------   Instances   ----------------
Number of instances selected: 3
----------------------------------------------
[(Sepal.Length    5.1
Sepal.Width     3.5
Petal.Length    1.4
Petal.Width     0.2
Name: 0, dtype: float64, 'Iris-setosa'), (Sepal.Length    4.9
Sepal.Width     3.0
Petal.Length    1.4
Petal.Width     0.2
Name: 1, dtype: float64, 'Iris-setosa'), (Sepal.Length    4.7
Sepal.Width     3.2
Petal.Length    1.3
Petal.Width     0.2
Name: 2, dtype: float64, 'Iris-setosa')]

Next, we focus on 3 instances that have a prediction given by the model equal to Iris-virginica and for which the prediction is wrong (i.e. the prediction returned by the model differs from the label in the dataset). Note that only one instance meets these criteria.

instances = learner.get_instances(models[0], n=3, subset_predicted_classes=["Iris-virginica"], is_correct=False)
print(instances)
---------------   Instances   ----------------


Number of instances selected: 1
----------------------------------------------
[(Sepal.Length    6.0
Sepal.Width     2.7
Petal.Length    5.1
Petal.Width     1.6
Name: 83, dtype: float64, 'Iris-virginica')]

Now, we want to get random instances (2 different calls provide different instances).

instance, prediction = learner.get_instances(models[0], n=1, seed=None)
print(instance, prediction)
instance, prediction = learner.get_instances(models[0], n=1, seed=None)
print(instance, prediction)
---------------   Instances   ----------------


Number of instances selected: 1
----------------------------------------------
Sepal.Length    6.3
Sepal.Width     2.5
Petal.Length    4.9
Petal.Width     1.5
Name: 72, dtype: float64 Iris-versicolor
---------------   Instances   ----------------
Number of instances selected: 1
----------------------------------------------
Sepal.Length    5.8
Sepal.Width     2.7
Petal.Length    5.1
Petal.Width     1.9
Name: 142, dtype: float64 Iris-virginica

Here we show how the details parameter works to obtain the predictions and labels:

instances = learner.get_instances(models[2], n=3, details=True)
print(instances)
---------------   Instances   ----------------
Number of instances selected: 3
----------------------------------------------
[{'instance': Sepal.Length    5.1
Sepal.Width     3.5
Petal.Length    1.4
Petal.Width     0.2
Name: 0, dtype: float64, 'prediction': 'Iris-setosa', 'label': 'Iris-setosa', 'index': 0}, {'instance': Sepal.Length    4.9
Sepal.Width     3.0
Petal.Length    1.4
Petal.Width     0.2
Name: 1, dtype: float64, 'prediction': 'Iris-setosa', 'label': 'Iris-setosa', 'index': 1}, {'instance': Sepal.Length    4.7
Sepal.Width     3.2
Petal.Length    1.3
Petal.Width     0.2
Name: 2, dtype: float64, 'prediction': 'Iris-setosa', 'label': 'Iris-setosa', 'index': 2}]

Finally, we want to select 3 instances from the test set:

instances = learner.get_instances(models[2], indexes=Learning.TEST, n=3)
print(instances)
---------------   Instances   ----------------
Number of instances selected: 3
----------------------------------------------
[(Sepal.Length    5.1
Sepal.Width     3.5
Petal.Length    1.4
Petal.Width     0.2
Name: 0, dtype: float64, 'Iris-setosa'), (Sepal.Length    5.4
Sepal.Width     3.9
Petal.Length    1.7
Petal.Width     0.4
Name: 5, dtype: float64, 'Iris-setosa'), (Sepal.Length    4.9
Sepal.Width     3.1
Petal.Length    1.5
Petal.Width     0.1
Name: 9, dtype: float64, 'Iris-setosa')]

Saving or loading instances is presented in the Saving/Loading Models page.

A complete example

As you can see, carrying out an empirical protocol requires the execution of very few instructions. The Learning module allows us to easily obtain the models and instances that we want to explain.

from pyxai import Learning

learner = Learning.Xgboost("../dataset/iris.csv", problem_type=Learning.CLASSIFICATION)
models = learner.evaluate(splitting_method=Learning.K_FOLDS, model_type=Learning.BT)
for model in models:
    instances_with_prediction = learner.get_instances(model, n=10, indexes=Learning.TEST)
    for instance, prediction in instances_with_prediction:
        print("instance:", instance)
        print("prediction:", prediction)