{ "cells": [ { "cell_type": "markdown", "id": "05ff760c", "metadata": {}, "source": [ "# Theories" ] }, { "cell_type": "markdown", "id": "0228cf03", "metadata": {}, "source": [ "Theories are representation of pieces of knowledge about the dataset. They may be furnished either by experts or derived directly from the nature of the data. This last case is handled by PyXAI via the encoding of domain theories during the explanation calculation. Domain theories are used to refrain from inferring impossible explanations. The way of dealing with them is different according to the kind of explanations that we lookd for: contrastive or abductive. More details about theories can be found in our [IJCAI'23 paper](/papers/). " ] }, { "cell_type": "markdown", "id": "afd56275", "metadata": {}, "source": [ "In PyXAI, through the explainer initialization method, there are two ways to activate domain theories.\n", "\n", "Either by specifying the type of features in the ```features_type``` parameter through a python dictionary with the following keys: ```\"numerical\"```, ```\"categorical\"``` and ```\"binary\"```. To avoid having to enter all the features, you can choose a default type using the ```Learning.DEFAULT``` constant. For each type that is not equal to this constant, you need to set a list of feature names as value. However, the ```\"categorical\"``` key requires a Python dictionnary where the keys are the feature names with the wildcard characters ```*```, ```{```, ```}``` or ```,``` inside names. This indicates that a set of feature names beginning with the same characters actually represents a single categorical feature that has been one-hot encoded. For example, ```\"A4*\"``` represents the categorical feature encoded through the features ```\"A4_1\"```, ```\"A4_2\"``` et ```\"A4_3\"```. The values for each key represent the possible values of the associated categorical feature (```(1, 2, 3)``` in this case). " ] }, { "cell_type": "raw", "id": "e97a7884", "metadata": {}, "source": [ "australian_types = {\n", " \"numerical\": Learning.DEFAULT,\n", " \"categorical\": {\"A4*\": (1, 2, 3), \n", " \"A5*\": tuple(range(1, 15)),\n", " \"A6*\": (1, 2, 3, 4, 5, 7, 8, 9), \n", " \"A12*\": tuple(range(1, 4))},\n", " \"binary\": [\"A1\", \"A8\", \"A9\", \"A11\"],\n", "}\n", "\n", "explainer = Explainer.initialize(model, instance=instance, features_types=australian_types)" ] }, { "cell_type": "markdown", "id": "6a112dec", "metadata": {}, "source": [ "Or by specifying in this parameter the path and name of a file containing the type of features. Such a file can be generated using the preprocessor of PyXAI (please see the [Preprocessing Data](/documentation/preprocessor/) page)." ] }, { "cell_type": "raw", "id": "702f1f5b", "metadata": {}, "source": [ "explainer = Explainer.initialize(model, instance=instance, features_type=\"../australian.types\")" ] }, { "cell_type": "markdown", "id": "d25512bb", "metadata": {}, "source": [ "{: .attention }\n", "> There is another way to specify categorical features.\n", "> For example, if we have in our dataset three binary features named ```\"Red\"```, ```\"Green\"``` and ```\"Blue\"``` that come from a one-hot encoded feature named ```\"Color\"```, we can declare the following types: \n", "```python\n", "types = {\n", " \"categorical\": {\"{Red,Green,Blue}\": (\"Red\", \"Green\", \"Blue\")}\n", "}\n", "```\n" ] }, { "cell_type": "markdown", "id": "e5de5de2", "metadata": {}, "source": [ "| Explainer.initialize(model, instance=None, features_type=None):|\n", "|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| \n", "| Depending on the model given in the first argument, this method creates an ```ExplainerDT```, an ```ExplainerRF``` or an ```ExplainerBT```. This object is able to give explanations about the instance given as a second parameter. This last parameter is optional because you can set the instance later using the ```set_instance``` function. |\n", "| model ```DecisionTree``` ```RandomForest``` ```BoostedTree```: The model for which explanations will be calculated.|\n", "| instance ```Numpy Array``` of ```Float```: The instance to be explained. Default value is ```None```.|\n", "| features_type ```String``` ```Dict``` ```None```: Either a dictionary indicating the type of features or the path to a ```.types``` file containing this information. Activate domain theories. |" ] }, { "cell_type": "markdown", "id": "f7cbd199", "metadata": {}, "source": [ "## For contrastive reasons" ] }, { "cell_type": "markdown", "id": "be9647e9", "metadata": {}, "source": [ "To understand the principles of domain theory, we first build a small example with the builder of PyXAI ([Building Models](/documentation/learning/builder/DTbuilder/)). This exmaple is based on one numerical feature ($f_1$: the annual incomes of the applicant) and one categorical (and binary) feature ($f_2$: whether or not the applicant has already reimbursed a previous loan). The model is used to determine whether a loan must be granted or not to an applicant." ] }, { "cell_type": "code", "execution_count": 1, "id": "ac134a0f", "metadata": {}, "outputs": [], "source": [ "from pyxai import Builder, Explainer\n", "\n", "node1 = Builder.DecisionNode(2, operator=Builder.EQ, threshold=1, left=0, right=1)\n", "node2 = Builder.DecisionNode(1, operator=Builder.GE, threshold=20, left=0, right=node1)\n", "node3 = Builder.DecisionNode(1, operator=Builder.GE, threshold=30, left=node2, right=1)\n", "\n", "tree1 = Builder.DecisionTree(2, node3)\n", "tree2 = Builder.DecisionTree(2, Builder.LeafNode(1))\n", "\n", "forest = Builder.RandomForest([tree1, tree2], n_classes=2)" ] }, { "cell_type": "markdown", "id": "563cf809", "metadata": {}, "source": [ "Let's suppose Alice wants to get a loan. We know that Alice’s annual incomes are equal to $18k$ and Alice has not reimbursed yet a previous loan. Thus, Alice corresponds to an instance $Alice = (18, 0)$. This instance is represented by the explainer with binary variables representing the conditions of nodes: ```(-1, -2, -3)```. \n", "This is equivalent to $\\{\\overline{(f_1 \\geq 20)}, \\overline{(f_1 \\geq 30)}, \\overline{(f_2 = 1)}\\}$ (or\n", "equivalently to $\\{(f_1 \\lt 20), (f_1 \\lt 30), (f_2 \\neq 1)\\}$).\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "0fe87c60", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "binary representation: (-1, -2, -3)\n", "binary representation features: ('f1 < 30', 'f1 < 20', 'f2 != 1')\n", "target_prediction: 0\n" ] } ], "source": [ "alice = (18, 0)\n", "explainer = Explainer.initialize(forest, instance=alice)\n", "print(\"binary representation: \", explainer.binary_representation)\n", "print(\"binary representation features:\", explainer.to_features(explainer.binary_representation, eliminate_redundant_features=False))\n", "print(\"target_prediction:\", explainer.target_prediction)\n" ] }, { "cell_type": "markdown", "id": "1aec1a94", "metadata": {}, "source": [ "Alice does not get the loan (```target_prediction: 0```) and would like to know what to change to get it: we need a contrastive explanation." ] }, { "cell_type": "code", "execution_count": 3, "id": "c5de8501", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "contrastives: ((-1,),)\n", "contrastives (to_features): ('f1 < 30',)\n" ] } ], "source": [ "contrastives = explainer.minimal_contrastive_reason(n=Explainer.ALL)\n", "print(\"contrastives:\", contrastives)\n", "\n", "print(\"contrastives (to_features):\", explainer.to_features(contrastives[0], contrastive=True))" ] }, { "cell_type": "markdown", "id": "5703a8d6", "metadata": {}, "source": [ "Without the theory, the binary variable ```(-1)``` (representing by the condition $\\overline{(f_1 \\geq 30)}$) is a (subset-minimal) contrastive explanation for Alice's instance. However, no instance matches this representation because the extended instance from this contrastive $\\{(f_1 \\geq 30), \\overline{(f_1 \\geq 20)}, \\overline{(f_2 = 1)}\\}$ conflicts with an indisputable theory: $\\overline{(f_1 \\geq 20)} \\Rightarrow \\overline{(f_1 \\geq 30)}$. To refrain from deriving these incorrect explanations, some propositional constraints forming a domain theory indicating how the Boolean conditions are logically connected must be taken into account.\n", "To accomplish this, you just need to specify which features are numeric, categorical and binary in the ```features_type``` parameter of the ```Explainer.initialize()``` constructor. \n" ] }, { "cell_type": "code", "execution_count": 4, "id": "62b55c6d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------- Theory Feature Types -----------\n", "Before the encoding (without one hot encoded features), we have:\n", "Numerical features: 1\n", "Categorical features: 0\n", "Binary features: 1\n", "Number of features: 2\n", "Values of categorical features: {}\n", "\n", "Number of used features in the model (before the encoding): 2\n", "Number of used features in the model (after the encoding): 2\n", "----------------------------------------------\n", "contrastives: ((-1, -2), (-2, -3))\n", "contrastives (to_features): ('f1 < 30',)\n", "contrastives (to_features): ('f1 < 20', 'f2 != 1')\n" ] } ], "source": [ "explainer = Explainer.initialize(forest, instance=alice, features_type={\"numerical\": [\"f1\"], \"binary\": [\"f2\"]})\n", "\n", "contrastives = explainer.minimal_contrastive_reason(n=Explainer.ALL)\n", "print(\"contrastives:\", contrastives)\n", "print(\"contrastives (to_features):\", explainer.to_features(contrastives[0], contrastive=True))\n", "print(\"contrastives (to_features):\", explainer.to_features(contrastives[1], contrastive=True))" ] }, { "cell_type": "markdown", "id": "0fe082f2", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "id": "8bf2d850", "metadata": {}, "source": [ "By taking the theory into account, we now get two different contrastive explanations: $$c_1 = \\{\\overline{(f_1 \\geq 20)}, \\overline{(f_1 \\geq 30)}\\} \\mbox{ and } c_2 = \\{\\overline{(f_1 \\geq 20)}, \\overline{(f_2 = 1)}\\}.$$ " ] }, { "cell_type": "markdown", "id": "052537bf", "metadata": {}, "source": [ "Eliminating redundant features gives us: $$c_1 = \\{\\overline{(f_1 \\geq 30)}\\} \\mbox{ and } c_2 = \\{\\overline{(f_1 \\geq 20)}, \\overline{(f_2 = 1)}\\}.$$ " ] }, { "cell_type": "markdown", "id": "b128555b", "metadata": {}, "source": [ "And now, the instances we derive from contrastive explanations no longer conflict with domain theory: \n", "$$\\{(A_1 \\geq 30), (A_1 \\geq 20), \\overline{(A_2 = 1)}\\} \\mbox{and} \\{\\overline{(A_1 \\geq 30)}, (A_1 \\geq 20), (A_2 = 1)\\}.$$" ] }, { "cell_type": "markdown", "id": "c14194e9", "metadata": {}, "source": [ "## For abductive reasons" ] }, { "cell_type": "markdown", "id": "b6199d9d", "metadata": {}, "source": [ "The Australian Credit Approval dataset is a credit card application and this [link](https://www.openml.org/search?type=data&sort=runs&id=40981&status=active) allows to get the type of features. Thanks to the preprocessor of PyXAI, we generate a ```australian.csv``` dataset and a ```australian.types``` file: " ] }, { "cell_type": "code", "execution_count": 5, "id": "d67a1d5c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11',\n", " 'A12', 'A13', 'A14', 'A15'],\n", " dtype='object')\n", "--------------- Converter ---------------\n", "-> The feature A1 is boolean! No One Hot Encoding for this features.\n", "One hot encoding new features for A4: 3\n", "One hot encoding new features for A5: 14\n", "One hot encoding new features for A6: 8\n", "-> The feature A8 is boolean! No One Hot Encoding for this features.\n", "-> The feature A9 is boolean! No One Hot Encoding for this features.\n", "-> The feature A11 is boolean! No One Hot Encoding for this features.\n", "One hot encoding new features for A12: 3\n", "Numbers of classes: 2\n", "Number of boolean features: 4\n", "Dataset saved: ../../dataset/australian_0.csv\n", "Types saved: ../../dataset/australian_0.types\n", "-----------------------------------------------\n" ] } ], "source": [ "from pyxai import Learning, Explainer, Tools\n", "\n", "import datetime\n", "\n", "preprocessor = Learning.Preprocessor(\"../../dataset/australian.csv\", target_feature=\"A15\", learner_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS)\n", "\n", "preprocessor.set_categorical_features(columns=[\"A1\", \"A4\", \"A5\", \"A6\", \"A8\", \"A9\", \"A11\", \"A12\"])\n", "preprocessor.set_numerical_features({\n", " \"A2\": None,\n", " \"A3\": None,\n", " \"A7\": None,\n", " \"A10\": None,\n", " \"A13\": None,\n", " \"A14\": None,\n", " })\n", "\n", "preprocessor.process()\n", "preprocessor.export(\"australian\", output_directory=\"../../dataset\")\n" ] }, { "cell_type": "markdown", "id": "8bb09804", "metadata": {}, "source": [ "We create a random forest and calculate a [majoritary reason](/documentation/classification/RFexplanations/majoritary/) by activating domain theory:" ] }, { "cell_type": "code", "execution_count": 6, "id": "eb27d4b7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data:\n", " A1 A2 A3 A4_1 A4_2 A4_3 A5_1 A5_2 A5_3 A5_4 ... A8 A9 A10 \n", "0 1 65 168 0 1 0 0 0 0 1 ... 0 0 1 \\\n", "1 0 72 123 0 1 0 0 0 0 0 ... 0 0 1 \n", "2 0 142 52 1 0 0 0 0 0 1 ... 0 0 1 \n", "3 0 60 169 1 0 0 0 0 0 0 ... 1 1 12 \n", "4 1 44 134 0 1 0 0 0 0 0 ... 1 1 15 \n", ".. .. ... ... ... ... ... ... ... ... ... ... .. .. ... \n", "685 1 163 160 0 1 0 0 0 0 0 ... 1 0 1 \n", "686 1 49 14 0 1 0 0 0 0 0 ... 0 0 1 \n", "687 0 32 145 0 1 0 0 0 0 0 ... 1 0 1 \n", "688 0 122 193 0 1 0 0 0 0 0 ... 1 1 2 \n", "689 1 245 2 0 1 0 0 0 0 0 ... 0 1 2 \n", "\n", " A11 A12_1 A12_2 A12_3 A13 A14 A15 \n", "0 1 0 1 0 32 161 0 \n", "1 0 0 1 0 53 1 0 \n", "2 1 0 1 0 98 1 0 \n", "3 1 0 1 0 1 1 1 \n", "4 0 0 1 0 18 68 1 \n", ".. ... ... ... ... ... ... ... \n", "685 0 0 1 0 1 1 1 \n", "686 0 0 1 0 1 35 0 \n", "687 0 0 1 0 32 1 1 \n", "688 0 0 1 0 38 12 1 \n", "689 0 1 0 0 159 1 1 \n", "\n", "[690 rows x 39 columns]\n", "-------------- Information ---------------\n", "Dataset name: ../../dataset/australian_0.csv\n", "nFeatures (nAttributes, with the labels): 39\n", "nInstances (nObservations): 690\n", "nLabels: 2\n", "--------------- Evaluation ---------------\n", "method: HoldOut\n", "output: RF\n", "learner_type: Classification\n", "learner_options: {'max_depth': None, 'random_state': 0}\n", "--------- Evaluation Information ---------\n", "For the evaluation number 0:\n", "metrics:\n", " accuracy: 85.5072463768116\n", "nTraining instances: 483\n", "nTest instances: 207\n", "\n", "--------------- Explainer ----------------\n", "For the evaluation number 0:\n", "**Random Forest Model**\n", "nClasses: 2\n", "nTrees: 100\n", "nVariables: 1361\n", "\n", "--------------- Instances ----------------\n", "number of instances selected: 1\n", "----------------------------------------------\n", "--------- Theory Feature Types -----------\n", "Before the encoding (without one hot encoded features), we have:\n", "Numerical features: 6\n", "Categorical features: 4\n", "Binary features: 4\n", "Number of features: 14\n", "Values of categorical features: {'A4_1': ['A4', 1, [1, 2, 3]], 'A4_2': ['A4', 2, [1, 2, 3]], 'A4_3': ['A4', 3, [1, 2, 3]], 'A5_1': ['A5', 1, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_2': ['A5', 2, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_3': ['A5', 3, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_4': ['A5', 4, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_5': ['A5', 5, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_6': ['A5', 6, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_7': ['A5', 7, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_8': ['A5', 8, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_9': ['A5', 9, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_10': ['A5', 10, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_11': ['A5', 11, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_12': ['A5', 12, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_13': ['A5', 13, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_14': ['A5', 14, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A6_1': ['A6', 1, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_2': ['A6', 2, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_3': ['A6', 3, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_4': ['A6', 4, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_5': ['A6', 5, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_7': ['A6', 7, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_8': ['A6', 8, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_9': ['A6', 9, [1, 2, 3, 4, 5, 7, 8, 9]], 'A12_1': ['A12', 1, [1, 2, 3]], 'A12_2': ['A12', 2, [1, 2, 3]], 'A12_3': ['A12', 3, [1, 2, 3]]}\n", "\n", "Number of used features in the model (before the encoding): 14\n", "Number of used features in the model (after the encoding): 38\n", "----------------------------------------------\n", "\n", "len tree_specific: 12\n", "\n", "tree_specific: ('A2 > 194.5', 'A3 in ]43.0, 53.0]', 'A5 = 3', 'A6 = 5', 'A7 in ]66.5, 93.0]', 'A8 = 0', 'A10 <= 2.5', 'A13 in ]63.5, 79.0]', 'A14 <= 5.5')\n", "is majoritary: True\n" ] } ], "source": [ "# Machine learning part\n", "learner = Learning.Scikitlearn(\"../../dataset/australian_0.csv\", learner_type=Learning.CLASSIFICATION)\n", "model = learner.evaluate(method=Learning.HOLD_OUT, output=Learning.RF)\n", "instance, prediction = learner.get_instances(model, n=1, seed=11200, correct=False)\n", "\n", "# Explainer part\n", "explainer = Explainer.initialize(model, instance=instance, features_type=\"../../dataset/australian_0.types\")\n", "majoritary_reason = explainer.majoritary_reason(n_iterations=10)\n", "print(\"\\nlen tree_specific: \", len(majoritary_reason))\n", "print(\"\\ntree_specific: \", explainer.to_features(majoritary_reason))\n", "print(\"is majoritary:\", explainer.is_majoritary_reason(majoritary_reason))" ] }, { "cell_type": "markdown", "id": "de87a0c8", "metadata": {}, "source": [ "Thanks to our support for domain theories of categorical features, only one feature AX_Y among the Y available is part of the explanation that has been derived. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 5 }