{ "cells": [ { "cell_type": "markdown", "id": "b1db8c9a", "metadata": {}, "source": [ "# Generating Models with PyXAI" ] }, { "cell_type": "markdown", "id": "514a5144", "metadata": {}, "source": [ "\n", "\n", "The ```Learning``` module of PyXAI provides methods to: \n", "* create a [Scikit-learn](https://scikit-learn.org/stable/) or a [XGBoost](https://xgboost.readthedocs.io/en/stable/) ML classifier;\n", "* create a [XGBoost](https://xgboost.readthedocs.io/en/stable/) or a [LightGBM](https://github.com/microsoft/LightGBM) ML regressor;\n", "+ carry out an experimental protocol using a train-test split technique (i.e. a cross-validation method); \n", "+ get one or several ML models based on Decision Trees, Random Forests or Boosted Trees;\n", "+ get, save and load some specific instances and models.\n", "\n", "In this page, we detail the first thre points. For the last one, please see the [Saving/Loading Models](/documentation/saving) page. \n" ] }, { "cell_type": "markdown", "id": "bfb6ec7d", "metadata": {}, "source": [ "## Loading Data" ] }, { "cell_type": "markdown", "id": "ed95f878", "metadata": {}, "source": [ "The first step is to create a ```Learner``` object that contains all methods needed to generate models. To this aim, you can use one of these methods depending on the chosen library:\n", " - ```Learning.Scikitlearn(dataset)```\n", " - ```Learning.Xgboost(dataset)```\n", " - ```Learning.lighGBM(dataset)```" ] }, { "cell_type": "markdown", "id": "247f2cdf", "metadata": {}, "source": [ "| Learning.Scikitlearn\\|Learning.Xgboost\\|Learning.lightGBM(dataset, learner_type=None): | \n", "| :----------- | \n", "| Returns a ```Learner``` object that contains all the methods needed to generate models of a given type (classification or regression). |\n", "| dataset ```String``` ```pandas.DataFrame```: Either the file path of the dataset in CSV or EXCEL format or a ```pandas.DataFrame``` object representing the data.|\n", "| learner_type ```Learning.CLASSIFICATION``` ```Learning.REGRESSION```: The type of models that will be used for this dataset.| \n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "id": "3a86dda4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data:\n", " Sepal.Length Sepal.Width Petal.Length Petal.Width Species\n", "0 5.1 3.5 1.4 0.2 Iris-setosa\n", "1 4.9 3.0 1.4 0.2 Iris-setosa\n", "2 4.7 3.2 1.3 0.2 Iris-setosa\n", "3 4.6 3.1 1.5 0.2 Iris-setosa\n", "4 5.0 3.6 1.4 0.2 Iris-setosa\n", ".. ... ... ... ... ...\n", "145 6.7 3.0 5.2 2.3 Iris-virginica\n", "146 6.3 2.5 5.0 1.9 Iris-virginica\n", "147 6.5 3.0 5.2 2.0 Iris-virginica\n", "148 6.2 3.4 5.4 2.3 Iris-virginica\n", "149 5.9 3.0 5.1 1.8 Iris-virginica\n", "\n", "[150 rows x 5 columns]\n", "-------------- Information ---------------\n", "Dataset name: ../dataset/iris.csv\n", "nFeatures (nAttributes, with the labels): 5\n", "nInstances (nObservations): 150\n", "nLabels: 3\n" ] } ], "source": [ "from pyxai import Learning\n", "learner = Learning.Xgboost(\"../dataset/iris.csv\", learner_type=Learning.CLASSIFICATION)" ] }, { "cell_type": "markdown", "id": "32adc48c", "metadata": {}, "source": [ "{: .attention }\n", "> You can launch your program in command line with the ```-dataset``` option to specify the dataset filename: \n", "> ```console\n", "> python3 example.py -dataset=\"../dataset/iris.csv\"\n", "> ```\n", "> To get the value of the ```-dataset``` option in your program, you need to import the ```Tools``` module:\n", "> ```python\n", "> from PyXAI import Learning, Tools\n", "> learner = Learning.Xgboost(Tools.Options.dataset)\n", "> ```" ] }, { "cell_type": "markdown", "id": "901275d9", "metadata": {}, "source": [ "{: .warning }\n", "> The dataset must specify the labels in the first row and the classes/values in the last column. \n", "> If this is not the case, you must modify your data using the [pandas](https://pandas.pydata.org/docs/index.html) library and provide a ```pandas.DataFrame``` in the functions of the ```Learning``` module. In this example, we add the missing labels:\n", "> ```python\n", "import pandas\n", "> data = pandas.read_csv(\"../dataset/iris.data\", names=['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Class'])\n", "> learner = Learning.Xgboost(data)\n", "> ```\n", "> \n", "> You can also use the [Preprocessor]({{ site.baseurl }}/documentation/preprocessor/) object of PyXAI that helps you to clean the dataset." ] }, { "cell_type": "markdown", "id": "dbfc19ac", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "markdown", "id": "5e34e2d2", "metadata": {}, "source": [ "The ```Learner``` object allows to learn a classifier/regressor (with the ```evaluate``` method) in order to produce one or several models according to the cross-validation method and the ML model chosen. " ] }, { "cell_type": "markdown", "id": "1c675a05", "metadata": {}, "source": [ "| <Learner Object>.evaluate(method, output, n_models=10, test_size=0.3, **learner_options): | \n", "| :----------- | \n", "| Runs an experimental protocol using the train-test split technique based on cross-validation methods. It makes the train-test split according to the cross validation method using Scikit-learn then it executes the fit and predict operation of the chosen ML classifier (Scikit-learn, XGBoost, LightGBM) as many times as necessary. |\n", "| method ```Learning.HOLD_OUT``` ```Learning.K_FOLDS``` ```Learning.LEAVE_ONE_GROUP_OUT```: The cross-validation method.|\n", "| output ```Learning.DT``` ```Learning.RF``` ```Learning.BT```: The desired model. ```Learning.DT``` and ```Learning.RF``` are available with the Scikit-learn library for classification while ```Learning.BT``` is compatible with the XGBoost library (classification and regression) and LightGBM (regression).|\n", "| n_models ```Integer```: The number of models desired. This is equivalent to the number of parts of the cross-validator used for ```Learning.K_FOLDS``` and ```Learning.LEAVE_ONE_GROUP_OUT```. Not used for method ```Learning.HOLD_OUT``` because it only returns one model. Default value is 10.|\n", "| test_size ```Float``` (between 0 and 1): Used only for ```Learning.HOLD_OUT``` to set the percentage of the test set. Default value is 0.3.|\n", "| learner_options ```Dict```: possible options provided to the learner via kwargs arguments.|\n" ] }, { "cell_type": "markdown", "id": "a35011c2", "metadata": {}, "source": [ "Information about cross-validators can be found in the [Scikit-learn](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) page. " ] }, { "cell_type": "markdown", "id": "322e27da", "metadata": {}, "source": [ "In this example, we create 3 boosted trees (classifiers) thanks to the K-folds cross-validator. " ] }, { "cell_type": "code", "execution_count": 2, "id": "52a02932", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------- Evaluation ---------------\n", "method: KFolds\n", "output: BT\n", "learner_type: Classification\n", "learner_options: {'seed': 0, 'max_depth': None, 'eval_metric': 'mlogloss'}\n", "--------- Evaluation Information ---------\n", "For the evaluation number 0:\n", "metrics:\n", " micro_averaging_accuracy: 97.33333333333334\n", " micro_averaging_precision: 96.0\n", " micro_averaging_recall: 96.0\n", " macro_averaging_accuracy: 97.33333333333333\n", " macro_averaging_precision: 96.02339181286548\n", " macro_averaging_recall: 96.02339181286548\n", " true_positives: {'Iris-setosa': 16, 'Iris-versicolor': 18, 'Iris-virginica': 14}\n", " true_negatives: {'Iris-setosa': 34, 'Iris-versicolor': 30, 'Iris-virginica': 34}\n", " false_positives: {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 1}\n", " false_negatives: {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 1}\n", " accuracy: 96.0\n", " sklearn_confusion_matrix: [[16, 0, 0], [0, 18, 1], [0, 1, 14]]\n", "nTraining instances: 100\n", "nTest instances: 50\n", "\n", "For the evaluation number 1:\n", "metrics:\n", " micro_averaging_accuracy: 96.0\n", " micro_averaging_precision: 94.0\n", " micro_averaging_recall: 94.0\n", " macro_averaging_accuracy: 96.0\n", " macro_averaging_precision: 94.11764705882352\n", " macro_averaging_recall: 95.23809523809524\n", " true_positives: {'Iris-setosa': 15, 'Iris-versicolor': 14, 'Iris-virginica': 18}\n", " true_negatives: {'Iris-setosa': 35, 'Iris-versicolor': 33, 'Iris-virginica': 29}\n", " false_positives: {'Iris-setosa': 0, 'Iris-versicolor': 3, 'Iris-virginica': 0}\n", " false_negatives: {'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 3}\n", " accuracy: 94.0\n", " sklearn_confusion_matrix: [[15, 0, 0], [0, 14, 0], [0, 3, 18]]\n", "nTraining instances: 100\n", "nTest instances: 50\n", "\n", "For the evaluation number 2:\n", "metrics:\n", " micro_averaging_accuracy: 97.33333333333334\n", " micro_averaging_precision: 96.0\n", " micro_averaging_recall: 96.0\n", " macro_averaging_accuracy: 97.33333333333333\n", " macro_averaging_precision: 95.83333333333334\n", " macro_averaging_recall: 96.07843137254902\n", " true_positives: {'Iris-setosa': 19, 'Iris-versicolor': 15, 'Iris-virginica': 14}\n", " true_negatives: {'Iris-setosa': 31, 'Iris-versicolor': 33, 'Iris-virginica': 34}\n", " false_positives: {'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 2}\n", " false_negatives: {'Iris-setosa': 0, 'Iris-versicolor': 2, 'Iris-virginica': 0}\n", " accuracy: 96.0\n", " sklearn_confusion_matrix: [[19, 0, 0], [0, 15, 2], [0, 0, 14]]\n", "nTraining instances: 100\n", "nTest instances: 50\n", "\n", "--------------- Explainer ----------------\n", "For the evaluation number 0:\n", "**Boosted Tree model**\n", "NClasses: 3\n", "nTrees: 300\n", "nVariables: 26\n", "\n", "For the evaluation number 1:\n", "**Boosted Tree model**\n", "NClasses: 3\n", "nTrees: 300\n", "nVariables: 31\n", "\n", "For the evaluation number 2:\n", "**Boosted Tree model**\n", "NClasses: 3\n", "nTrees: 300\n", "nVariables: 22\n", "\n" ] } ], "source": [ "models = learner.evaluate(method=Learning.K_FOLDS, output=Learning.BT, n_models=3)" ] }, { "cell_type": "markdown", "id": "0804030f", "metadata": {}, "source": [ "Beyond carrying out the experimental protocol, this method allows one to return the models in a dedicated format for the calculation of explanations. " ] }, { "cell_type": "markdown", "id": "03806b56", "metadata": {}, "source": [ "However, this may not meet your requirements (you may need another ML classifier and/or another cross-validation method):\n", "* Other ML classifiers and cross-validation methods are under development (the objective is to offer all cross-validation methods of Scikit-learn); \n", "+ You can code your own experimental protocol and then import your models (see the [Importing Models](/documentation/importing) page)." ] }, { "cell_type": "markdown", "id": "3502f2aa", "metadata": {}, "source": [ "{: .attention }\n", "> You can use ```learner.get_label_from_value(value)``` and ```learner.get_value_from_label(label)``` to get the right values comming from the encoding of labels. The python dictionary variable ```learner.dict_labels``` contains the encoding performed." ] }, { "cell_type": "markdown", "id": "68ec59f2", "metadata": {}, "source": [ "## Selecting Instances" ] }, { "cell_type": "markdown", "id": "b37a5641", "metadata": {}, "source": [ "PyXAI can easily select specific instances thanks to the ```get_instances``` method. " ] }, { "cell_type": "markdown", "id": "e4a923f8", "metadata": {}, "source": [ "| <Learner Object>.get_instances(model=None, indexes=Indexes.All, *, dataset=None, n=None, correct=None, predictions=None, save_directory=None, instances_id=None, seed=0, details=False): | \n", "| :----------- | \n", "| Return the instances in a ```Tuple```. Each instance comes with the prediction of the model or alone depending on whether the model is given or not (model=None). An instance is represented by a ```numpy.array``` object. Note that when the number of instances requested is only 1 (n=1), the method just returns the instance and not a ```Tuple``` of instances.
By default, the functions returns the same instances if you provide the same parameters. Set the parameter ```randomize``` to ```True``` for a random generation. Set the ```details``` parameter to ```True``` to get more details on the instances (predictions, labels and indexes in the dataset).|\n", "| model ```DecisionTree``` ```RandomForest``` ```BoostedTrees```: The model computed by the ```evaluation``` method. |\n", "| indexes ```Learning.TRAINING``` ```Learning.TEST``` ```Learning.MIXED``` ```Learning.ALL``` ```String```: Returns instances from the training instances (```Learning.TRAINING```) or the test instances (```Learning.TEST```) or from both by giving priority to training instances (```Learning.MIXED```). By default set to ```Learning.ALL``` that takes into account all instances. Finally, when the indexes parameter is a ```String```, it represents a file containing indexes and the method loads the associated instances.\n", "| dataset ```String``` ```pandas.DataFrame```: In some situations, this method needs the dataset (Optional).|\n", "| n ```Integer```: The wanted number of instances (None for all).|\n", "| correct ```None``` ```True``` ```False```: Only available if a model is given, selects by default all instances (```None```) or only correctly classified instances by the model (```True```) or only misclassified instances by the model (```False```)|\n", "| predictions ```None``` ```List of Integer```: Only available if a model is given. Select by default all instances (```None```) or a ```List of Integer```representing the desired classes/labels of instances to select.|\n", "| save_directory ```None``` ```String```: Save the instance indexes in a file inside the directory given by this parameter.|\n", "| instances_id ```None``` ```Integer```: To add an identifier in the name of the saved file with the ```save_directory``` parameter or useful to load instances using the ```indexes parameter```.|\n", "| seed ```None``` ```Integer```: Set to ```None``` to obtain fully random instances. Default value is ```0```. Set the seed to an ```Integer``` to shuffle the result with this seed.|\n", "| details ```True``` ```False```: Set to ```True``` to obtain a List of instances in the form of Python dictionnaries with the keys: \"instance\", \"prediction\", \"label\" and \"index\". Default value is ```False```. |" ] }, { "cell_type": "markdown", "id": "96947e99", "metadata": {}, "source": [ "More details on the indexes, save_directory, and instances_id parameters are given on the [Saving/Loading Models](/documentation/saving). Let us look now at some examples of use. \n", "\n", "First we select only one instance (we take the first model among the three mdels computed). We directly get the instance and the prediction." ] }, { "cell_type": "code", "execution_count": 3, "id": "6061cdfa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------- Instances ----------------\n", "number of instances selected: 1\n", "----------------------------------------------\n", "[5.1 3.5 1.4 0.2] 0\n" ] } ], "source": [ "instance, prediction = learner.get_instances(models[0],n=1)\n", "print(instance, prediction)" ] }, { "cell_type": "markdown", "id": "ba6142ad", "metadata": {}, "source": [ "Now, we take 3 instances. We obtain a list of instances. " ] }, { "cell_type": "code", "execution_count": 4, "id": "2cfea8dc", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------- Instances ----------------\n", "number of instances selected: 3\n", "----------------------------------------------\n", "[(array([5.1, 3.5, 1.4, 0.2]), 0), (array([4.9, 3. , 1.4, 0.2]), 0), (array([4.7, 3.2, 1.3, 0.2]), 0)]\n" ] } ], "source": [ "instances = learner.get_instances(models[0],n=3)\n", "print(instances)" ] }, { "cell_type": "markdown", "id": "192c5161", "metadata": {}, "source": [ "The same invocation but without the model as a parameter leads to a different output (without the prediction of the model is not provided):" ] }, { "cell_type": "code", "execution_count": 5, "id": "8dc00ef4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------- Instances ----------------\n", "number of instances selected: 3\n", "----------------------------------------------\n", "[(array([5.1, 3.5, 1.4, 0.2]), None), (array([4.9, 3. , 1.4, 0.2]), None), (array([4.7, 3.2, 1.3, 0.2]), None)]\n" ] } ], "source": [ "instances = learner.get_instances(n=3)\n", "print(instances)" ] }, { "cell_type": "markdown", "id": "dab333ec", "metadata": {}, "source": [ "Now, consider 3 instances fow which the prediction given by the model is equal to 2. " ] }, { "cell_type": "code", "execution_count": 6, "id": "ae29f933", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------- Instances ----------------\n", "number of instances selected: 3\n", "----------------------------------------------\n", "[(array([6. , 2.7, 5.1, 1.6]), 2), (array([6.3, 3.3, 6. , 2.5]), 2), (array([5.8, 2.7, 5.1, 1.9]), 2)]\n" ] } ], "source": [ "instances = learner.get_instances(models[0], n=3, predictions=[2])\n", "print(instances)" ] }, { "cell_type": "markdown", "id": "91a88f29", "metadata": {}, "source": [ "Next, we focus on 3 instances that have a prediction given by the model equal to 2 and for which the prediction is wrong (i.e. the prediction returned by the model is not the same as the one contained in the dataset). Note that only one instance meets these criteria. " ] }, { "cell_type": "code", "execution_count": 7, "id": "48ae777f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------- Instances ----------------\n", "number of instances selected: 1\n", "----------------------------------------------\n", "[(array([6. , 2.7, 5.1, 1.6]), 2)]\n" ] } ], "source": [ "instances = learner.get_instances(models[0], n=3, predictions=[2], correct=False)\n", "print(instances)" ] }, { "cell_type": "markdown", "id": "bf0c55fe", "metadata": {}, "source": [ "Now, we want to get random instances (2 different calls provide different instances)." ] }, { "cell_type": "code", "execution_count": 8, "id": "58b9cd61", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------- Instances ----------------\n", "number of instances selected: 1\n", "----------------------------------------------\n", "[6.7 3.3 5.7 2.1] 2\n", "--------------- Instances ----------------\n", "number of instances selected: 1\n", "----------------------------------------------\n", "[5.7 2.8 4.5 1.3] 1\n" ] } ], "source": [ "instance, prediction = learner.get_instances(models[0], n=1, seed=None)\n", "print(instance, prediction)\n", "instance, prediction = learner.get_instances(models[0], n=1, seed=None)\n", "print(instance, prediction)" ] }, { "cell_type": "markdown", "id": "a9fc8445", "metadata": {}, "source": [ "Here we show how the ```details``` parameter works to obtain the predictions and labels:" ] }, { "cell_type": "code", "execution_count": 3, "id": "5dd67e34", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------- Instances ----------------\n", "number of instances selected: 3\n", "----------------------------------------------\n", "[{'instance': array([5.1, 3.5, 1.4, 0.2]), 'prediction': 0, 'label': 0, 'index': 0}, {'instance': array([4.9, 3. , 1.4, 0.2]), 'prediction': 0, 'label': 0, 'index': 1}, {'instance': array([4.7, 3.2, 1.3, 0.2]), 'prediction': 0, 'label': 0, 'index': 2}]\n" ] } ], "source": [ "instances = learner.get_instances(models[2], n=3, details=True)\n", "print(instances)" ] }, { "cell_type": "markdown", "id": "9ff0756c", "metadata": {}, "source": [ "Finally, we want to pick up 3 instances among the test instances:" ] }, { "cell_type": "code", "execution_count": 9, "id": "58e33698", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------- Instances ----------------\n", "number of instances selected: 3\n", "----------------------------------------------\n", "[(array([5.1, 3.5, 1.4, 0.2]), 0), (array([5.4, 3.9, 1.7, 0.4]), 0), (array([4.9, 3.1, 1.5, 0.1]), 0)]\n" ] } ], "source": [ "instances = learner.get_instances(models[2], indexes=Learning.TEST, n=3)\n", "print(instances)" ] }, { "cell_type": "markdown", "id": "4812f67a", "metadata": {}, "source": [ "Saving or loading instances is presented in the [Saving/Loading Models](/documentation/saving) page. " ] }, { "cell_type": "markdown", "id": "f253addd", "metadata": {}, "source": [ "## A complete example" ] }, { "cell_type": "markdown", "id": "b502929b", "metadata": {}, "source": [ "As you can see, carrying out an empirical protocol requires the execution of very few instructions. The ```Learning``` module allows us to easily obtain the models and instances that we want to explain. " ] }, { "cell_type": "markdown", "id": "49fef8d0", "metadata": {}, "source": [ "```python\n", "from PyXAI import Learning\n", "\n", "learner = Learning.Xgboost(\"../dataset/iris.csv\")\n", "models = learner.evaluate(method=Learning.K_FOLDS, output=Learning.BT)\n", "for id_models, model in enumerate(models):\n", " instances_with_prediction = learner.get_instances(model, n=10, indexes=Learning.TEST)\n", " for instance, prediction in instances_with_prediction:\n", " print(\"instance:\", instance)\n", " print(\"prediction\", prediction)\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }