Learning Module
The Learning module of PyXAI provides methods to:
- create a Scikit-learn or a XGBoost ML classifier;
- create a XGBoost or a LightGBM ML regressor;
- carry out an experimental protocol using a train-test split technique (i.e. a cross-validation method);
- get one or several ML models based on Decision Trees, Random Forests or Boosted Trees;
- simplify the preparation/cleaning of datasets using a preprocessor;
- import, save and load some specific instances and models.
Methods
Generating Models with PyXAI
Learner.__init__: A learner object with a dataset and its characteristics. Allow to run an experimental protocol and get instances.LearnerInformation.__init__: Class containing several information about a learner, a dataset and a model.Scikitlearn.__init__: A class that extends aLearnerto use the Scikit-learn librairy.Xgboost.__init__: A class that extends aLearnerto use the XGBoost librairy.LightGBM.__init__: A class that extends aLearnerto use the LightGBM librairy.Learner.evaluate: Runs an experimental protocol using the train-test split technique based on cross-validation methods.Learner.get_instances: Return couples (instance, prediction) from data and the classifier results.Learner.compute_metrics: Returns the values of some metrics (accuracy, recall, f1_score, specificity, …)Learner.get_details: Get some details abouts the learner and the models. These information are available in theLearnerInformationclass.Learner.get_raw_models: Get the raw models from the librairy used to create the models (sklearn, xgboost, lightgmb, …)
Preprocessing Data with PyXAI
TabularPreprocessor.__init__: Preprocessor to modify a tabular dataset (feature encoder, feature deleting, type of features (categorical, numerical or binary), …)TabularPreprocessor.export: Export the new dataset into CSV or XLS files and save some information in a JSON file.TabularPreprocessor.process: Applies all encodings previously defined by the others methods of this class.TabularPreprocessor.unset_features: Remove some features.TabularPreprocessor.set_categorical_features: Encode the categorical features/TabularPreprocessor.set_categorical_features_already_one_hot_encoded: Identify a set of binary features coming from a one hot encoding as a categorical feature.TabularPreprocessor.all_numerical_features: Identifies all features as numerical.TabularPreprocessor.set_numerical_features: Select and encode the numerical features.NonTabularPreprocessor.__init__: Preprocessor to create a non-tabular dataset (image, video, voice, …)NonTabularPreprocessor.add_instance_image: Add an instance as an image.NonTabularPreprocessor.add_label_class: Add a label as a class.
Import, Save and Load Models
ModelIO.import_models: Import existing ML models.ModelIO.load: Save models in a directory.ModelIO.save: Load models from a directory.
Enumerations
Each value of an enumeration can be put as parameter thanks to a module or a Python str shortcut.
ProblemType
Represent the type of problem.
-
ProblemType.Classification|Learning.CLASSIFICATION|classification:
A classification problem is a supervised learning task where the goal is to predict a categorical label (or class). -
ProblemType.Regression|Learning.REGRESSION|regression:
A regression problem is a supervised learning task where the goal is to predict a continuous numerical output.
ModelType
Represent the type of model.
-
ModelType.LinearSimple|Learning.LINEAR_SIMPLE|linear-simple:
Only for the regression problems. Represent theLinearRegressionof the Scikit-learn librairy. -
ModelType.LinearRidge|Learning.LINEAR_RIDGE|linear-ridge:
For a classification problem, use aRidgeClassifierof the Scikit-learn librairy.
For a regession problem, use aRidgeof the Scikit-learn librairy. -
ModelType.LinearLasso|Learning.LINEAR_LASSO|linear-lasso:
Not implemented at this time. -
ModelType.LinearElastic|Learning.LINEAR_ELASTIC|linear-elastic:
Not implemented at this time. -
ModelType.LinearLogistic|Learning.LINEAR_LOGISTIC|linear-logistic:
Only for the classification problems. Represent theLogisticRegressionof the Scikit-learn librairy. -
ModelType.NeuralNetwork|Learning.NEURAL_NETWORK|neural-network:
Using the Scikit-learn librairy, represent either aMLPClassifieror aMLPRegressoraccording to a problem type (classification or regression).
Other libraries will be taken into account in the future (PyTorch and TensorFlow). -
ModelType.DecisionTree|Learning.DT|decision-tree:
Using the Scikit-learn librairy, represent aDecisionTreeClassifierfor the classification problems. -
ModelType.RandomForest|Learning.RF|random-forest:
Using the Scikit-learn librairy, represent aRandomForestClassifierfor the classification problems. -
ModelType.BoostedTree|Learning.BT|boosted-tree:
Using the Xgboost librairy, represent either aXGBClassifieror aXGBRegressoraccording to a problem type (classification or regression).
Using the LightGBM librairy, represent aLGBMRegressorfor a regression problem.
InstancesType
Represent the type of problem.
-
InstancesType.Tabular|Learning.TABULAR|tabular:
The dataset are presented in a table where the rows represent the instances and the columns represent the features. -
InstancesType.Image|Learning.IMAGE|image:
The instance are in the form of images (png, jpg, …).
LabelsType
Represent the type of labels.
-
LabelsType.Classes|Learning.CLASSES|classes:
The labels are classes (strorint) -
LabelsType.Masks|Learning.MASKS|masks:
The label are in the form of masks (png, jpg, …). -
LabelsType.Contours|Learning.CONTOURS|contours:
The label are in the form of contours (i.e. list of positions (x, y) representing the contour of an object in an image).
InstancesSet
Represent a set of instances.
-
InstancesSet.All|Learning.ALL|all:
All the instances. -
InstancesSet.Train|Learning.TRAIN|train:
The training instances. -
InstancesSet.Test|Learning.TEST|test:
The testing instances. -
InstancesSet.TrainInPriority|Learning.TRAIN_IN_PRIORITY|train-in-priority:
Select firsly indexes from the training set and next from the text set.
SplittingMethod
The splitting method used for an evalution.
-
SplittingMethod.Predefined|Learning.PREDEFINED|predefined:
Allow the user to predefine the training instance and the test instances in a new column titled “subset.” -
SplittingMethod.HoldOut|Learning.HOLD_OUT|hold-out:
A simpletrain_test_splitwith the Scikit-learn librairy. -
SplittingMethod.KFolds|Learning.K_FOLDS|k-folds:
The K-Fold cross-validator with the Scikit-learn librairy. -
SplittingMethod.LeaveOneGroupOut|Learning.LEAVE_ONE_GROUP_OUT|leave-one-group-out:
The Leave One Group Out cross-validator with the Scikit-learn librairy. -
SplittingMethod.LoadModel|load-model:
To load a model.
ClassificationType
The type of classification.
-
ClassificationType.BinaryClass|Learning.BINARY_CLASS|binary-class:
A binary classification problem. -
ClassificationType.MultiClass|Learning.MULTI_CLASS|multi-class:
A multi-class problem.
MultiClassToBinaryMethod
The method used to encode a multi-classes dataset into a binary class dataset.
-
MultiClassToBinaryMethod.OneVsOne|Learning.ONE_VS_ONE|one-vs-one:
Create one classifier for every pair of classes. For example, with the classesABC, we have:A vs B,A vs CandB vs C. -
MultiClassToBinaryMethod.OneVsRest|Learning.ONE_VS_REST|one-vs-rest:
Create one classifier per class. For example, with the classesABC, we have:A vs BC,B vs ACandC vs AB
EncoderType
Reprensent an encoder of data.
-
EncoderType.OrdinalEncoder|Learning.ORDINAL|ordinal-encoder:
Convert each categorical feature to an ordinal integers. This results in a single column of integers (0ton_categories - 1) per feature. Use theOrdinalEncoderof the Scikit-learn librairy. -
EncoderType.OneHotEncoder|Learning.ONE_HOT|oneHot-encoder:
Create a binary column for each category where the value1means that this category is present for an instance, otherwise the value is0.
Use theOneHotEncoderof the Scikit-learn librairy. -
EncoderType.DiscretizerEncoder|Learning.DISCRETIZER|discretizer-encoder:
This discretization method uses aKBinsDiscretizerto transform numerical features into categorical features (with a direct encoding).
Dictionnaries
Metrics
Contains the values of the most popular metrics in machine learning.
The keys of the dictionnary depend of type of problem.
from pyxai import Learning
learner = Learning.Scikitlearn("tests/datasets/dermatology.csv", problem_type=Learning.CLASSIFICATION)
models = learner.evaluate(splitting_method=Learning.K_FOLDS, model_type=Learning.DT)
for id, models in enumerate(models):
metrics = learner.get_details()[id]["metrics"]
from pyxai import Learning
learner = Learning.Scikitlearn("tests/datasets/dermatology.csv", problem_type=Learning.CLASSIFICATION)
labels = [1,1,1,1,1,0,0,0,0,0]
predictions = [1,1,1,1,1,0,0,0,0,0]
metrics = learner.compute_metrics(labels, predictions)
For a binary classification problem:
-
sklearn_confusion_matrix:list[int]
The confusion matrix of the Scikit-learn librairy. -
true_positive:int
TPis the number of instances where the model correctly predicted the positive class. -
true_negative:int
TNis the number of instances where the model correctly predicted the negative class. -
false_positive:int
FPis the number of instances where the model incorrectly predicted the positive class when the actual value (label) is negative. -
false_negative:int
FNis the number of instances where the model incorrectly predicted the negative class when the actual value (label) is positive. -
accuracy:float
The accuracy: ((TP+TN)/(TP+TN+FP+FN))*100 -
precision:floatThe precision: ((TP)/(TP+FP))*100 -
recall:floatThe recall (also called the sensitivity): ((TP)/(TP+FN))*100 -
specificity:floatThe specificity: ((TN)/(FP+TN))*100 -
f1_score:floatThe F1-score: (2*precision*recall)/(precision+recall)
For a multi-classes problem:
-
sklearn_confusion_matrix:list[list[int]]
The confusion matrix of the Scikit-learn librairy. -
true_positives:list[int]
A list with the number ofTPfor each class. -
true_negatives:list[int]
A list with the number ofTNfor each class. -
false_positives:list[int]
A list with the number ofFPfor each class. -
false_negatives:list[int]
A list with the number ofFNfor each class. -
accuracy:float
The accuracy is the number of well classified instances divided by the total number of instances. -
micro_averaging_accuracy:float
Micro-averaging accuracy is calculated by aggregating all true positives (TP) and all predictions (correct or incorrect) across all classes, then dividing the total number of correct predictions by the total number of predictions. -
macro_averaging_accuracy:float
Macro-averaging accuracy is calculated by averaging the accuracy of each class individually, without considering the size of each class. -
micro_averaging_precision:float
Micro-averaging of precisions. -
macro_averaging_precision:float
Macro-averaging of precisions. -
micro_averaging_recall:float
Micro-averaging of recalls. -
macro_averaging_recall:float
Macro-averaging of recalls.
For a regression problem:
-
mean_squared_error:float
The mean squared error computed by the Scikit-learn librairy. -
root_mean_squared_error:float
The root mean squared error computed by the Scikit-learn librairy. -
mean_absolute_error:float
The mean absolute error computed by the Scikit-learn librairy.
LearnerInformationDict
This Python Dictionnary contains some information about a learner, a dataset and a model.
About a learner:
-
learner_name:str
The name of the library used for the evaluation. -
problem_type:str|ProblemType
The type of problem (classification, regression, …) Possible values are defined in theProblemTypeenum. -
model_type:str|ModelType
The type of model (linear, tree-based, neural network, …) Possible values are defined in theModelTypeenum. -
instances_type:str|InstancesType
The type of instances (image, tabular, text, temporal, …) Possible values are defined in theInstancesTypeenum. -
labels_type:str|LabelsType
The type of labels (class, text, mask, contours, …) Possible values are defined in theLabelsTypeenum. -
splitting_method:str|SplittingMethod
The splitting method used for the evalution (hold-out, k-folds, …)
Possible values are defined in theSplittingMethodenum.
About a dataset:
-
dataset_path:str
The path and the filename of the dataset. -
n_features:int
The number of features. -
n_labels:int
The number of labels. -
feature_names:list[str]
The names of the features used in the model. -
label_names:list[str]
The name of the labels (without redundancy) -
instances_directory:str
The directory of instances. -
labels_directory:str
The directory of labels. -
get_item_function:Callable
A function to get an instance from the dataset. This function is used to get an instance in the right format for the model.
If the dataset is a pandas DataFrame and the instances are tabular, this function is not necessary and can be set to None.
In other cases, this function should be defined by the user. It should take as input a row of the dataframe and return the corresponding instance in the right format for the model.
About a model:
-
raw_model:DecisionTreeClassifier|RandomForestClassifier|XGBClassifier|XGBRegressor|LGBMRegressor
The raw model of the librairy used to create the model (sklearn, xgboost, lightgmb, …) -
metrics:dict
A dictionnary containing some metrics about the evaluation (precision, recall, f1_score, specificity, …)
More information are given in the [Metrics](/pyxai/documentation/api/modules/learning/#metrics) page. -
extras:dict
Extra information from the raw model (the type, model parameters, base_score, …) -
train_index:list[int]
A list of training indexes used during the evaluation (cross-validation) -
test_index:list[int]
A list of test indexes used during the evaluation (cross-validation) -
groups:list[int]
The used groups for the leave-one-group-out evaluation