PyXAI

Papers Video GitHub In-the-Loop EXPEKCTATION Release Notes About

Learning Module

The Learning module of PyXAI provides methods to:

create a Scikit-learn or a XGBoost ML classifier;
create a XGBoost or a LightGBM ML regressor;
carry out an experimental protocol using a train-test split technique (i.e. a cross-validation method);
get one or several ML models based on Decision Trees, Random Forests or Boosted Trees;
simplify the preparation/cleaning of datasets using a preprocessor;
import, save and load some specific instances and models.

Methods

Generating Models with PyXAI

Learner.__init__: A learner object with a dataset and its characteristics. Allow to run an experimental protocol and get instances.
LearnerInformation.__init__: Class containing several information about a learner, a dataset and a model.
Scikitlearn.__init__: A class that extends a Learner to use the Scikit-learn librairy.
Xgboost.__init__: A class that extends a Learner to use the XGBoost librairy.
LightGBM.__init__: A class that extends a Learner to use the LightGBM librairy.
Learner.evaluate: Runs an experimental protocol using the train-test split technique based on cross-validation methods.
Learner.get_instances: Return couples (instance, prediction) from data and the classifier results.
Learner.compute_metrics: Returns the values of some metrics (accuracy, recall, f1_score, specificity, …)
Learner.get_details: Get some details abouts the learner and the models. These information are available in the LearnerInformation class.
Learner.get_raw_models: Get the raw models from the librairy used to create the models (sklearn, xgboost, lightgmb, …)

Preprocessing Data with PyXAI

TabularPreprocessor.__init__: Preprocessor to modify a tabular dataset (feature encoder, feature deleting, type of features (categorical, numerical or binary), …)
TabularPreprocessor.export: Export the new dataset into CSV or XLS files and save some information in a JSON file.
TabularPreprocessor.process: Applies all encodings previously defined by the others methods of this class.
TabularPreprocessor.unset_features: Remove some features.
TabularPreprocessor.set_categorical_features: Encode the categorical features/
TabularPreprocessor.set_categorical_features_already_one_hot_encoded: Identify a set of binary features coming from a one hot encoding as a categorical feature.
TabularPreprocessor.all_numerical_features: Identifies all features as numerical.
TabularPreprocessor.set_numerical_features: Select and encode the numerical features.
NonTabularPreprocessor.__init__: Preprocessor to create a non-tabular dataset (image, video, voice, …)
NonTabularPreprocessor.add_instance_image: Add an instance as an image.
NonTabularPreprocessor.add_label_class: Add a label as a class.

Import, Save and Load Models

ModelIO.import_models: Import existing ML models.
ModelIO.load: Save models in a directory.
ModelIO.save: Load models from a directory.

Enumerations

Each value of an enumeration can be put as parameter thanks to a module or a Python str shortcut.

ProblemType

Represent the type of problem.

ProblemType.Classification | Learning.CLASSIFICATION | classification:
A classification problem is a supervised learning task where the goal is to predict a categorical label (or class).
ProblemType.Regression | Learning.REGRESSION | regression:
A regression problem is a supervised learning task where the goal is to predict a continuous numerical output.

ModelType

Represent the type of model.

ModelType.LinearSimple | Learning.LINEAR_SIMPLE | linear-simple:
Only for the regression problems. Represent the LinearRegression of the Scikit-learn librairy.
ModelType.LinearRidge | Learning.LINEAR_RIDGE | linear-ridge:
For a classification problem, use a RidgeClassifier of the Scikit-learn librairy.
For a regession problem, use a Ridge of the Scikit-learn librairy.
ModelType.LinearLasso | Learning.LINEAR_LASSO | linear-lasso:
Not implemented at this time.
ModelType.LinearElastic | Learning.LINEAR_ELASTIC | linear-elastic:
Not implemented at this time.
ModelType.LinearLogistic | Learning.LINEAR_LOGISTIC | linear-logistic:
Only for the classification problems. Represent the LogisticRegression of the Scikit-learn librairy.
ModelType.NeuralNetwork | Learning.NEURAL_NETWORK | neural-network:
Using the Scikit-learn librairy, represent either a MLPClassifier or a MLPRegressor according to a problem type (classification or regression).
Other libraries will be taken into account in the future (PyTorch and TensorFlow).
ModelType.DecisionTree | Learning.DT | decision-tree:
Using the Scikit-learn librairy, represent a DecisionTreeClassifier for the classification problems.
ModelType.RandomForest | Learning.RF | random-forest:
Using the Scikit-learn librairy, represent a RandomForestClassifier for the classification problems.
ModelType.BoostedTree | Learning.BT | boosted-tree:
Using the Xgboost librairy, represent either a XGBClassifier or a XGBRegressor according to a problem type (classification or regression).
Using the LightGBM librairy, represent a LGBMRegressor for a regression problem.

InstancesType

Represent the type of problem.

InstancesType.Tabular | Learning.TABULAR | tabular:
The dataset are presented in a table where the rows represent the instances and the columns represent the features.
InstancesType.Image | Learning.IMAGE | image:
The instance are in the form of images (png, jpg, …).

LabelsType

Represent the type of labels.

LabelsType.Classes | Learning.CLASSES | classes:
The labels are classes (str or int)
LabelsType.Masks | Learning.MASKS | masks:
The label are in the form of masks (png, jpg, …).
LabelsType.Contours | Learning.CONTOURS | contours:
The label are in the form of contours (i.e. list of positions (x, y) representing the contour of an object in an image).

InstancesSet

Represent a set of instances.

InstancesSet.All | Learning.ALL | all:
All the instances.
InstancesSet.Train | Learning.TRAIN | train:
The training instances.
InstancesSet.Test | Learning.TEST | test:
The testing instances.
InstancesSet.TrainInPriority | Learning.TRAIN_IN_PRIORITY | train-in-priority:
Select firsly indexes from the training set and next from the text set.

SplittingMethod

The splitting method used for an evalution.

SplittingMethod.Predefined | Learning.PREDEFINED | predefined:
Allow the user to predefine the training instance and the test instances in a new column titled “subset.”
SplittingMethod.HoldOut | Learning.HOLD_OUT | hold-out:
A simple train_test_split with the Scikit-learn librairy.
SplittingMethod.KFolds | Learning.K_FOLDS | k-folds:
The K-Fold cross-validator with the Scikit-learn librairy.
SplittingMethod.LeaveOneGroupOut | Learning.LEAVE_ONE_GROUP_OUT | leave-one-group-out:
The Leave One Group Out cross-validator with the Scikit-learn librairy.
SplittingMethod.LoadModel | load-model :
To load a model.

ClassificationType

The type of classification.

ClassificationType.BinaryClass | Learning.BINARY_CLASS | binary-class:
A binary classification problem.
ClassificationType.MultiClass | Learning.MULTI_CLASS | multi-class:
A multi-class problem.

MultiClassToBinaryMethod

The method used to encode a multi-classes dataset into a binary class dataset.

MultiClassToBinaryMethod.OneVsOne | Learning.ONE_VS_ONE | one-vs-one:
Create one classifier for every pair of classes. For example, with the classes A B C, we have: A vs B, A vs C and B vs C.
MultiClassToBinaryMethod.OneVsRest | Learning.ONE_VS_REST | one-vs-rest:
Create one classifier per class. For example, with the classes A B C, we have: A vs BC, B vs AC and C vs AB

EncoderType

Reprensent an encoder of data.

EncoderType.OrdinalEncoder | Learning.ORDINAL | ordinal-encoder:
Convert each categorical feature to an ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature. Use the OrdinalEncoder of the Scikit-learn librairy.
EncoderType.OneHotEncoder | Learning.ONE_HOT | oneHot-encoder:
Create a binary column for each category where the value 1 means that this category is present for an instance, otherwise the value is 0.
Use the OneHotEncoder of the Scikit-learn librairy.
EncoderType.DiscretizerEncoder | Learning.DISCRETIZER | discretizer-encoder:
This discretization method uses a KBinsDiscretizer to transform numerical features into categorical features (with a direct encoding).

Dictionnaries

Metrics

Contains the values of the most popular metrics in machine learning.
The keys of the dictionnary depend of type of problem.

from pyxai import Learning

learner = Learning.Scikitlearn("tests/datasets/dermatology.csv", problem_type=Learning.CLASSIFICATION)
models = learner.evaluate(splitting_method=Learning.K_FOLDS, model_type=Learning.DT)
for id, models in enumerate(models):
     metrics = learner.get_details()[id]["metrics"]

from pyxai import Learning

learner = Learning.Scikitlearn("tests/datasets/dermatology.csv", problem_type=Learning.CLASSIFICATION)
labels = [1,1,1,1,1,0,0,0,0,0]
predictions = [1,1,1,1,1,0,0,0,0,0]
metrics = learner.compute_metrics(labels, predictions)

For a binary classification problem:

sklearn_confusion_matrix: list[int]
The confusion matrix of the Scikit-learn librairy.
true_positive: int
TP is the number of instances where the model correctly predicted the positive class.
true_negative: int
TN is the number of instances where the model correctly predicted the negative class.
false_positive: int
FP is the number of instances where the model incorrectly predicted the positive class when the actual value (label) is negative.
false_negative: int
FN is the number of instances where the model incorrectly predicted the negative class when the actual value (label) is positive.
accuracy: float
The accuracy: ((TP+TN)/(TP+TN+FP+FN))*100
precision: float The precision: ((TP)/(TP+FP))*100
recall: float The recall (also called the sensitivity): ((TP)/(TP+FN))*100
specificity: float The specificity: ((TN)/(FP+TN))*100
f1_score: float The F1-score: (2*precision*recall)/(precision+recall)

For a multi-classes problem:

sklearn_confusion_matrix: list[list[int]]
The confusion matrix of the Scikit-learn librairy.
true_positives: list[int]
A list with the number of TP for each class.
true_negatives: list[int]
A list with the number of TN for each class.
false_positives: list[int]
A list with the number of FP for each class.
false_negatives: list[int]
A list with the number of FN for each class.
accuracy: float
The accuracy is the number of well classified instances divided by the total number of instances.
micro_averaging_accuracy: float
Micro-averaging accuracy is calculated by aggregating all true positives (TP) and all predictions (correct or incorrect) across all classes, then dividing the total number of correct predictions by the total number of predictions.
macro_averaging_accuracy: float
Macro-averaging accuracy is calculated by averaging the accuracy of each class individually, without considering the size of each class.
micro_averaging_precision: float
Micro-averaging of precisions.
macro_averaging_precision: float
Macro-averaging of precisions.
micro_averaging_recall: float
Micro-averaging of recalls.
macro_averaging_recall: float
Macro-averaging of recalls.

For a regression problem:

mean_squared_error: float
The mean squared error computed by the Scikit-learn librairy.
root_mean_squared_error: float
The root mean squared error computed by the Scikit-learn librairy.
mean_absolute_error: float
The mean absolute error computed by the Scikit-learn librairy.

LearnerInformationDict

This Python Dictionnary contains some information about a learner, a dataset and a model.

About a learner:

learner_name: str
The name of the library used for the evaluation.
problem_type: str | ProblemType
The type of problem (classification, regression, …) Possible values are defined in the ProblemType enum.
model_type: str | ModelType
The type of model (linear, tree-based, neural network, …) Possible values are defined in the ModelType enum.
instances_type: str | InstancesType
The type of instances (image, tabular, text, temporal, …) Possible values are defined in the InstancesType enum.
labels_type: str | LabelsType
The type of labels (class, text, mask, contours, …) Possible values are defined in the LabelsType enum.
splitting_method: str | SplittingMethod
The splitting method used for the evalution (hold-out, k-folds, …)
Possible values are defined in the SplittingMethod enum.

About a dataset:

dataset_path: str
The path and the filename of the dataset.
n_features: int
The number of features.
n_labels: int
The number of labels.
feature_names: list[str]
The names of the features used in the model.
label_names: list[str]
The name of the labels (without redundancy)
instances_directory: str
The directory of instances.
labels_directory: str
The directory of labels.
get_item_function: Callable
A function to get an instance from the dataset. This function is used to get an instance in the right format for the model.
If the dataset is a pandas DataFrame and the instances are tabular, this function is not necessary and can be set to None.
In other cases, this function should be defined by the user. It should take as input a row of the dataframe and return the corresponding instance in the right format for the model.

About a model:

raw_model: DecisionTreeClassifier | RandomForestClassifier | XGBClassifier | XGBRegressor | LGBMRegressor
The raw model of the librairy used to create the model (sklearn, xgboost, lightgmb, …)
metrics: dict
A dictionnary containing some metrics about the evaluation (precision, recall, f1_score, specificity, …)
More information are given in the [Metrics](/pyxai/documentation/api/modules/learning/#metrics) page.
extras: dict
Extra information from the raw model (the type, model parameters, base_score, …)
train_index: list[int]
A list of training indexes used during the evaluation (cross-validation)
test_index: list[int]
A list of test indexes used during the evaluation (cross-validation)
groups: list[int]
The used groups for the leave-one-group-out evaluation