Class Learner
This class is a wrapper around the different libraries used to create the models (sklearn, xgboost, lightgbm, …).
It stores a dataset, evaluates it using different models and provides some tools to get instances, predictions and metrics.
def __init__(self, data: str|pandas.DataFrame|NoneData = NoneData, *,
problem_type: str|ProblemType,
model_type: str|ModelType|None = None,
instances_type: str|InstancesType|None = None,
labels_type: str|LabelsType|None = None,
get_item_function: Callable|None = None,
instances_directory: str|None = None,
labels_directory: str|None = None): Highlight
In the case of a csv or excel file, the last column is considered as the label column and all other columns are considered as features.
In the case of a json file, the dataset should be in a specific format (see documentation).
Parameters
data : str | pandas.DataFrame | NoneData
The dataset to use, either as a path to a csv, json or excel file or as a pandas DataFrame.
problem_type : str | ProblemType
The type of problem (classification, regression, …)
Possible values are defined in the ProblemType enum.
model_type : str | ModelType (optional, default=None)
The type of model (linear, tree-based, neural network, …)
Can be None and put in the evaluation method.
Possible values are defined in the ModelType enum.
instances_type : str | InstancesType (optional, default=None)
The type of instances (image, tabular, text, temporal, …)
Possible values are defined in the InstancesType enum.
When set to None, if data is a csv and get_item_function is None then it is set to “tabular”
labels_type : str | LabelsType (optional, default=None)
The type of labels (class, text, mask, contours, …)
Possible values are defined in the LabelsType enum.
When set to None, if problem_type is “classification” then it is set to “classes”, if problem_type is “regression” then it is set to “continuous-values”
get_item_function : Callable (optional, default=None)
A function to get an instance from the dataset. This function is used to get an instance in the right format for the model and the explainer.
If the dataset is a pandas DataFrame and the instances are tabular, this function is not necessary and can be set to None.
In other cases, this function should be defined by the user. It should take as input a row of the dataframe and return the corresponding instance in the right format for the model and the explainer.
instances_directory : str (optional, default=None)
The directory where the instances are stored (only for a JSON dataset).
This parameter is used to extend the path of instances in the dataframe when the instances are of type image and the dataset is given as a json file.
labels_directory : str (optional, default=None)
The directory where the labels are stored (only for a JSON dataset).
This parameter is used to extend the paths of labels in the dataframe when the labels are of type masks and the dataset is given as a json file.
Warning: NOT YET IMPLEMENTED
Examples
from pyxai import Learning, Tools
learner = Learning.Scikitlearn(Tools.Options.dataset, problem_type=Learning.CLASSIFICATION)
model = learner.evaluate(splitting_method=Learning.HOLD_OUT, model_type=Learning.RF)
instance, prediction = learner.get_instances(n=1)
from pyxai import Learning, Tools
learner = Learning.Xgboost(Tools.Options.dataset, problem_type=Learning.CLASSIFICATION)
model = learner.evaluate(splitting_method=Learning.HOLD_OUT, model_type=Learning.BT, splitting_parameters={'test_size':0.2}, model_parameters={'max_depth':6, 'base_score':0.5})
Main Methods
def evaluate(self, *, splitting_method, model_type, model_parameters={}, splitting_parameters={}): Highlight
Runs an experimental protocol using the train-test split technique based on cross-validation methods.
It makes the train-test split according to the cross validation method using Scikit-learn then it executes the fit and predict operation of the chosen ML classifier (Scikit-learn, XGBoost, LightGBM) as many times as necessary.
Parameters
splitting_method : str | SplittingMethod
The splitting method used for the evalution (hold-out, k-folds, …)
Possible values are defined in the SplittingMethod enum.
model_type : str | ModelType
The type of model (linear, tree-based, neural network, …)
Possible values are defined in the ModelType enum.
model_parameters : dict (optional, default={})
Parameters to pass to the learner according to the used librairy (sklearn, xgboost, …)
For example, for a RandomForestClassifier, we can set model_parameters to {"n_estimators":50, "max_depth":4}.
splitting_parameters : dict (optional, default={})
Parameters to pass to the splitting method (hold-out, k-folds, …)
For example, for the Learning.LEAVE_ONE_GROUP_OUT method, we can set splitting_parameters to {'n_models':2, 'random_state':0}.
Returns
DecisionTree | RandomForest | BoostedTrees | BoostedTreesRegression :
The PyXai model created.
list of (DecisionTree | RandomForest | BoostedTrees | BoostedTreesRegression) :
A list of PyXai models when several models were created with the splitting method.
Examples
from pyxai import Learning
learner = Learning.LightGBM("tests/datasets/dermatology.csv", problem_type=Learning.REGRESSION)
models = learner.evaluate(splitting_method=Learning.K_FOLDS, model_type=Learning.BT, splitting_parameters={'random_state':0}, model_parameters={'learning_rate':0.3, 'n_estimators':5, 'random_state':0})
The default values of
model_parametersandsplitting_parametersare the empty Python dictionnaries.
As a result, the random seeds are set toNoneby default for the model and the splitting method.
This makes the evaluation non-deterministic by default.
To make it deterministic, you must add to model_parameters or/and splitting_parameters the key/value'random_state':0.
def get_instances(self,
model=None,
indexes=InstancesSet.All,
*,
dataset=None,
n=None,
is_correct=None,
subset_predicted_classes=None,
subset_true_classes=None,
save_directory=None,
instances_id=None,
seed=0,
train_indexes=None,
test_indexes=None,
details=False) -> list[tuple[pandas.Series, int|float|str]] | list[dict] | tuple[None, None] | tuple[pandas.Series, int|float|str] | dict : Highlight
Return couples (instance, prediction) from the dataset with the given prediction done by the model.
Parameters
model : DecisionTree | RandomForest | BoostedTrees | BoostedTreesRegression (optional, default=None):
A PyXAI model.
indexes : str | InstancesSet (optional, default=InstancesSet.All)
The type of selected instances (all, from the training instances, from the testing instances, …)
Possible values are defined in the InstancesSet enum.
Can also be a str representing a file contening specific indexes.
dataset : str | pandas.DataFrame | NoneData (optional, default=None)
The dataset to use, either as a path to a csv, json or excel file or as a pandas DataFrame.
Can be None if the dataset is already loaded. This parameter is useful only if the dataset is not already loaded.
n : int (optional, default=None)
The desired number of instances (None for all).
is_correct : True | False (optional, default=None)
Only available if a model is given
- None: All instances (no filter).
- True: Only correctly classified instances by the model.
- False: Only misclassified instances by the model.
subset_predicted_classes : list[int] (optional, default=None)
- None: All instances (no filter).
- list[int]: List of classes for the desired instances considering the model prediction.
subset_true_classes : list[int] (optional, default=None)
- None: All instances (no filter).
- list[int]: List of classes for the desired instances, taking into account the true labels.
save_directory : str (optional, default=None)
Save the instance indexes in a file in the directory given by this parameter
instances_id : str (optional, default=None)
To add an identifier in the name of the saved file with the save_directory parameter or useful to load instances using the indexes parameter.
seed : int | None (optional, default=0) (default: value)
Set to None to obtain fully random instances. Default value is 0. Set the seed to an Integer to shuffle the result with this seed.
train_indexes : list[int] (optional, default=None)
List of training indexes to select the instances from subset of indexes.
test_indexes : list[int] (optional, default=None)
List of testing indexes to select the instances from subset of indexes.
details : True | False (optional, default=False)
Set to True to obtain a python list of instances where each instance is in the form of Python dictionnaries with the keys: “instance”, “prediction”, “label” and “index”.
Returns
list[tuple(pandas.Series, int | float | str)] | list[dict] :
Return couples (instance, prediction) the selected instances with the given prediction done by the model.
tuple(pandas.Series, int | float | str) | dict :
Note that when the number of instances requested is only 1 (n=1), the method returns ony one tuple.
Python dictionnaries are returned when the parameter details is set to True
Examples
from pyxai import Learning
learner = Learning.Scikitlearn(dataset, problem_type='regression')
model = learner.evaluate(splitting_method='hold-out', model_type='linear-ridge', splitting_parameters={'random_state':0}, model_parameters={})
instances = learner.get_instances(model, indexes='train-in-priority', n=5, seed=72)
from pyxai import Learning
learner = Learning.Scikitlearn(dataset, problem_type='classification')
model = learner.evaluate(splitting_method='hold-out', model_type='linear-ridge', splitting_parameters={'random_state':0}, model_parameters={})
instances = learner.get_instances(model, indexes='test', n=5, subset_predicted_classes=['Iris-versicolor', 'Iris-virginica'])
Auxiliary Methods
def compute_metrics(self, labels, predictions): Highlight
Returns the values of some metrics (accuracy, recall, f1_score, specificity, …) according to the evaluation method.
Parameters
labels : int | float
The real values (labels) in the dataset.
predictions : int | float
The prediction of a model.
Returns
dict :
A Python dictionary where the keys depend on the ProblemType.
See the page of the Metrics class for more information.
def get_details(self): Highlight
Get some details about the learner and the models. This information is available in the LearnerInformation class.
Examples
from pyxai import Learning
learner = Learning.Scikitlearn("tests/datasets/dermatology.csv", problem_type=Learning.CLASSIFICATION)
models = learner.evaluate(splitting_method=Learning.K_FOLDS, model_type=Learning.DT)
for id, models in enumerate(models):
metrics = learner.get_details()[id]["metrics"]
def get_raw_models(self): Highlight
Get the raw models from the librairy used to create the models (sklearn, xgboost, lightgbm, …)
Returns
list of (DecisionTreeClassifier | RandomForestClassifier | XGBClassifier | XGBRegressor | LGBMRegressor) :
A list of raw models.
Examples
from pyxai import Learning
learner = Learning.Scikitlearn("tests/datasets/dermatology.csv", problem_type=Learning.CLASSIFICATION)
models = learner.evaluate(splitting_method=Learning.K_FOLDS, model_type=Learning.DT)
for id, models in enumerate(models):
raw_model = learner.get_raw_models()[id]