Link Search Menu Expand Document
PyXAI
Papers Video GitHub In-the-Loop EXPEKCTATION Release Notes About

Learning Module

The Learning module of PyXAI provides methods to:

  • create a Scikit-learn or a XGBoost ML classifier;
  • create a XGBoost or a LightGBM ML regressor;
  • carry out an experimental protocol using a train-test split technique (i.e. a cross-validation method);
  • get one or several ML models based on Decision Trees, Random Forests or Boosted Trees;
  • simplify the preparation/cleaning of datasets using a preprocessor;
  • import, save and load some specific instances and models.

Methods

Generating Models with PyXAI

Preprocessing Data with PyXAI

Import, Save and Load Models

Enumerations

Each value of an enumeration can be put as parameter thanks to a module or a Python str shortcut.

ProblemType

Represent the type of problem.

  • ProblemType.Classification | Learning.CLASSIFICATION | classification:
    A classification problem is a supervised learning task where the goal is to predict a categorical label (or class).

  • ProblemType.Regression | Learning.REGRESSION | regression:
    A regression problem is a supervised learning task where the goal is to predict a continuous numerical output.

ModelType

Represent the type of model.

  • ModelType.LinearSimple | Learning.LINEAR_SIMPLE | linear-simple:
    Only for the regression problems. Represent the LinearRegression of the Scikit-learn librairy.

  • ModelType.LinearRidge | Learning.LINEAR_RIDGE | linear-ridge:
    For a classification problem, use a RidgeClassifier of the Scikit-learn librairy.
    For a regession problem, use a Ridge of the Scikit-learn librairy.

  • ModelType.LinearLasso | Learning.LINEAR_LASSO | linear-lasso:
    Not implemented at this time.

  • ModelType.LinearElastic | Learning.LINEAR_ELASTIC | linear-elastic:
    Not implemented at this time.

  • ModelType.LinearLogistic | Learning.LINEAR_LOGISTIC | linear-logistic:
    Only for the classification problems. Represent the LogisticRegression of the Scikit-learn librairy.

  • ModelType.NeuralNetwork | Learning.NEURAL_NETWORK | neural-network:
    Using the Scikit-learn librairy, represent either a MLPClassifier or a MLPRegressor according to a problem type (classification or regression).
    Other libraries will be taken into account in the future (PyTorch and TensorFlow).

  • ModelType.DecisionTree | Learning.DT | decision-tree:
    Using the Scikit-learn librairy, represent a DecisionTreeClassifier for the classification problems.

  • ModelType.RandomForest | Learning.RF | random-forest:
    Using the Scikit-learn librairy, represent a RandomForestClassifier for the classification problems.

  • ModelType.BoostedTree | Learning.BT | boosted-tree:
    Using the Xgboost librairy, represent either a XGBClassifier or a XGBRegressor according to a problem type (classification or regression).
    Using the LightGBM librairy, represent a LGBMRegressor for a regression problem.

InstancesType

Represent the type of problem.

  • InstancesType.Tabular | Learning.TABULAR | tabular:
    The dataset are presented in a table where the rows represent the instances and the columns represent the features.

  • InstancesType.Image | Learning.IMAGE | image:
    The instance are in the form of images (png, jpg, …).

LabelsType

Represent the type of labels.

  • LabelsType.Classes | Learning.CLASSES | classes:
    The labels are classes (str or int)

  • LabelsType.Masks | Learning.MASKS | masks:
    The label are in the form of masks (png, jpg, …).

  • LabelsType.Contours | Learning.CONTOURS | contours:
    The label are in the form of contours (i.e. list of positions (x, y) representing the contour of an object in an image).

InstancesSet

Represent a set of instances.

  • InstancesSet.All | Learning.ALL | all:
    All the instances.

  • InstancesSet.Train | Learning.TRAIN | train:
    The training instances.

  • InstancesSet.Test | Learning.TEST | test:
    The testing instances.

  • InstancesSet.TrainInPriority | Learning.TRAIN_IN_PRIORITY | train-in-priority:
    Select firsly indexes from the training set and next from the text set.

SplittingMethod

The splitting method used for an evalution.

  • SplittingMethod.Predefined | Learning.PREDEFINED | predefined:
    Allow the user to predefine the training instance and the test instances in a new column titled “subset.”

  • SplittingMethod.HoldOut | Learning.HOLD_OUT | hold-out:
    A simple train_test_split with the Scikit-learn librairy.

  • SplittingMethod.KFolds | Learning.K_FOLDS | k-folds:
    The K-Fold cross-validator with the Scikit-learn librairy.

  • SplittingMethod.LeaveOneGroupOut | Learning.LEAVE_ONE_GROUP_OUT | leave-one-group-out:
    The Leave One Group Out cross-validator with the Scikit-learn librairy.

  • SplittingMethod.LoadModel | load-model :
    To load a model.

ClassificationType

The type of classification.

  • ClassificationType.BinaryClass | Learning.BINARY_CLASS | binary-class:
    A binary classification problem.

  • ClassificationType.MultiClass | Learning.MULTI_CLASS | multi-class:
    A multi-class problem.

MultiClassToBinaryMethod

The method used to encode a multi-classes dataset into a binary class dataset.

  • MultiClassToBinaryMethod.OneVsOne | Learning.ONE_VS_ONE | one-vs-one:
    Create one classifier for every pair of classes. For example, with the classes A B C, we have: A vs B, A vs C and B vs C.

  • MultiClassToBinaryMethod.OneVsRest | Learning.ONE_VS_REST | one-vs-rest:
    Create one classifier per class. For example, with the classes A B C, we have: A vs BC, B vs AC and C vs AB

EncoderType

Reprensent an encoder of data.

  • EncoderType.OrdinalEncoder | Learning.ORDINAL | ordinal-encoder:
    Convert each categorical feature to an ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature. Use the OrdinalEncoder of the Scikit-learn librairy.

  • EncoderType.OneHotEncoder | Learning.ONE_HOT | oneHot-encoder:
    Create a binary column for each category where the value 1 means that this category is present for an instance, otherwise the value is 0.
    Use the OneHotEncoder of the Scikit-learn librairy.

  • EncoderType.DiscretizerEncoder | Learning.DISCRETIZER | discretizer-encoder:
    This discretization method uses a KBinsDiscretizer to transform numerical features into categorical features (with a direct encoding).

Dictionnaries

Metrics

Contains the values of the most popular metrics in machine learning.
The keys of the dictionnary depend of type of problem.

from pyxai import Learning

learner = Learning.Scikitlearn("tests/datasets/dermatology.csv", problem_type=Learning.CLASSIFICATION)
models = learner.evaluate(splitting_method=Learning.K_FOLDS, model_type=Learning.DT)
for id, models in enumerate(models):
     metrics = learner.get_details()[id]["metrics"]
from pyxai import Learning

learner = Learning.Scikitlearn("tests/datasets/dermatology.csv", problem_type=Learning.CLASSIFICATION)
labels = [1,1,1,1,1,0,0,0,0,0]
predictions = [1,1,1,1,1,0,0,0,0,0]
metrics = learner.compute_metrics(labels, predictions)

For a binary classification problem:

  • sklearn_confusion_matrix: list[int]
    The confusion matrix of the Scikit-learn librairy.

  • true_positive: int
    TP is the number of instances where the model correctly predicted the positive class.

  • true_negative: int
    TN is the number of instances where the model correctly predicted the negative class.

  • false_positive: int
    FP is the number of instances where the model incorrectly predicted the positive class when the actual value (label) is negative.

  • false_negative: int
    FN is the number of instances where the model incorrectly predicted the negative class when the actual value (label) is positive.

  • accuracy: float
    The accuracy: ((TP+TN)/(TP+TN+FP+FN))*100

  • precision: float The precision: ((TP)/(TP+FP))*100

  • recall: float The recall (also called the sensitivity): ((TP)/(TP+FN))*100

  • specificity: float The specificity: ((TN)/(FP+TN))*100

  • f1_score: float The F1-score: (2*precision*recall)/(precision+recall)

For a multi-classes problem:

  • sklearn_confusion_matrix: list[list[int]]
    The confusion matrix of the Scikit-learn librairy.

  • true_positives: list[int]
    A list with the number of TP for each class.

  • true_negatives: list[int]
    A list with the number of TN for each class.

  • false_positives: list[int]
    A list with the number of FP for each class.

  • false_negatives: list[int]
    A list with the number of FN for each class.

  • accuracy: float
    The accuracy is the number of well classified instances divided by the total number of instances.

  • micro_averaging_accuracy: float
    Micro-averaging accuracy is calculated by aggregating all true positives (TP) and all predictions (correct or incorrect) across all classes, then dividing the total number of correct predictions by the total number of predictions.

  • macro_averaging_accuracy: float
    Macro-averaging accuracy is calculated by averaging the accuracy of each class individually, without considering the size of each class.

  • micro_averaging_precision: float
    Micro-averaging of precisions.

  • macro_averaging_precision: float
    Macro-averaging of precisions.

  • micro_averaging_recall: float
    Micro-averaging of recalls.

  • macro_averaging_recall: float
    Macro-averaging of recalls.

For a regression problem:

LearnerInformationDict

This Python Dictionnary contains some information about a learner, a dataset and a model.

About a learner:

  • learner_name: str
    The name of the library used for the evaluation.

  • problem_type: str | ProblemType
    The type of problem (classification, regression, …) Possible values are defined in the ProblemType enum.

  • model_type: str | ModelType
    The type of model (linear, tree-based, neural network, …) Possible values are defined in the ModelType enum.

  • instances_type: str | InstancesType
    The type of instances (image, tabular, text, temporal, …) Possible values are defined in the InstancesType enum.

  • labels_type: str | LabelsType
    The type of labels (class, text, mask, contours, …) Possible values are defined in the LabelsType enum.

  • splitting_method: str | SplittingMethod
    The splitting method used for the evalution (hold-out, k-folds, …)
    Possible values are defined in the SplittingMethod enum.

About a dataset:

  • dataset_path: str
    The path and the filename of the dataset.

  • n_features: int
    The number of features.

  • n_labels: int
    The number of labels.

  • feature_names: list[str]
    The names of the features used in the model.

  • label_names: list[str]
    The name of the labels (without redundancy)

  • instances_directory: str
    The directory of instances.

  • labels_directory: str
    The directory of labels.

  • get_item_function: Callable
    A function to get an instance from the dataset. This function is used to get an instance in the right format for the model.
    If the dataset is a pandas DataFrame and the instances are tabular, this function is not necessary and can be set to None.
    In other cases, this function should be defined by the user. It should take as input a row of the dataframe and return the corresponding instance in the right format for the model.

About a model:

  • raw_model: DecisionTreeClassifier | RandomForestClassifier | XGBClassifier | XGBRegressor | LGBMRegressor
    The raw model of the librairy used to create the model (sklearn, xgboost, lightgmb, …)

  • metrics: dict
    A dictionnary containing some metrics about the evaluation (precision, recall, f1_score, specificity, …)
    More information are given in the [Metrics](/pyxai/documentation/api/modules/learning/#metrics) page.

  • extras: dict
    Extra information from the raw model (the type, model parameters, base_score, …)

  • train_index: list[int]
    A list of training indexes used during the evaluation (cross-validation)

  • test_index: list[int]
    A list of test indexes used during the evaluation (cross-validation)

  • groups: list[int]
    The used groups for the leave-one-group-out evaluation

Symbols