{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Theories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Theories are representations of pieces of knowledge about the dataset. They may be furnished either by experts or derived directly from the nature of the data. This last case is handled by PyXAI via the encoding of domain theories during the explanation calculation. Domain theories are used to prevent impossible explanations from being inferred. The way of dealing with them differs according to the kind of explanation sought: contrastive or abductive. More details about theories can be found in our [IJCAI'23 paper](/pyxai/papers/).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": "In PyXAI, through the explainer initialization method, there are two ways to activate domain theories.\n\nEither by specifying the type of features in the ```features_type``` parameter through a python dictionary with the following keys: ```\"numerical\"```, ```\"categorical\"``` and ```\"binary\"```. To avoid having to enter all the features, you can choose a default type using the ```Learning.DEFAULT``` constant. For each type that is not equal to this constant, you need to set a list of feature names as value. However, the ```\"categorical\"``` key requires a Python dictionary where the keys are the feature names with the wildcard characters ```*```, ```{```, ```}``` or ```,``` inside names. This indicates that a set of feature names beginning with the same characters actually represents a single categorical feature that has been one-hot encoded. For example, ```\"A4*\"``` represents the categorical feature encoded through the features ```\"A4_1\"```, ```\"A4_2\"``` et ```\"A4_3\"```. The values for each key represent the possible values of the associated categorical feature (```(1, 2, 3)``` in this case). " }, { "cell_type": "raw", "metadata": {}, "source": [ "australian_types = {\n \"numerical\": Learning.DEFAULT,\n \"categorical\": {\"A4*\": (1, 2, 3), \n \"A5*\": tuple(range(1, 15)),\n \"A6*\": (1, 2, 3, 4, 5, 7, 8, 9), \n \"A12*\": tuple(range(1, 4))},\n \"binary\": [\"A1\", \"A8\", \"A9\", \"A11\"],\n}\n\nexplainer = Explaining.initialize(model, instance=instance, features_type=australian_types)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or by specifying in this parameter the path and name of a file containing the type of features. Such a file can be generated using the preprocessor of PyXAI (please see the [Preprocessing Data](/documentation/preprocessor/) page)." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "explainer = Explaining.initialize(model, instance=instance, features_type=\"../australian.types\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "{: .attention }\n", "> There is another way to specify categorical features.\n", "> For example, if we have in our dataset three binary features named ```\"Red\"```, ```\"Green\"``` and ```\"Blue\"``` that come from a one-hot encoded feature named ```\"Color\"```, we can declare the following types: \n", "```python\n", "types = {\n", " \"categorical\": {\"{Red,Green,Blue}\": (\"Red\", \"Green\", \"Blue\")}\n", "}\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## For contrastive reasons" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To understand the principles of domain theory, we first build a small example with the builder of PyXAI ([Building Models](/pyxai/documentation/learning/builder/DTbuilder/)). This example is based on one numerical feature ($f_1$: the annual income of the applicant) and one categorical (and binary) feature ($f_2$: whether or not the applicant has already reimbursed a previous loan). The model is used to determine whether a loan must be granted or not to an applicant.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pyxai import Builder, Explaining\n", "\n", "node1 = Builder.DecisionNode(2, operator=Builder.EQ, threshold=1, left=0, right=1)\n", "node2 = Builder.DecisionNode(1, operator=Builder.GE, threshold=20, left=0, right=node1)\n", "node3 = Builder.DecisionNode(1, operator=Builder.GE, threshold=30, left=node2, right=1)\n", "\n", "tree1 = Builder.DecisionTree(2, node3)\n", "tree2 = Builder.DecisionTree(2, Builder.LeafNode(1))\n", "\n", "forest = Builder.RandomForest([tree1, tree2], n_classes=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's suppose Alice wants to get a loan. We know that Alice’s annual incomes are equal to $18k$ and Alice has not reimbursed yet a previous loan. Thus, Alice corresponds to an instance $Alice = (18, 0)$. This instance is represented by the explainer with binary variables representing the conditions of nodes: ```(-1, -2, -3)```. \n", "This is equivalent to $\\{\\overline{(f_1 \\geq 20)}, \\overline{(f_1 \\geq 30)}, \\overline{(f_2 = 1)}\\}$ (or\n", "equivalently to $\\{(f_1 \\lt 20), (f_1 \\lt 30), (f_2 \\neq 1)\\}$).\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "binary representation: (-1, -2, -3)\n", "binary representation features: ['f1 < 30', 'f1 < 20', 'f2 != 1']\n", "target_prediction: 0\n" ] } ], "source": [ "alice = (18, 0)\n", "explainer = Explaining.initialize(forest, instance=alice)\n", "print(\"binary representation: \", explainer.binary_representation)\n", "print(\"binary representation features:\", explainer.to_features(explainer.binary_representation, eliminate_redundant_features=False))\n", "print(\"target_prediction:\", explainer.target_prediction)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alice does not get the loan (```target_prediction: 0```) and would like to know what to change to get it: we need a contrastive explanation." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "contrastives: ((-1,),)\n", "contrastives (to_features): ['f1 < 30']\n" ] } ], "source": [ "contrastives = explainer.minimal_contrastive_reason(n=Explaining.ALL)\n", "print(\"contrastives:\", contrastives)\n", "\n", "print(\"contrastives (to_features):\", explainer.to_features(contrastives[0], contrastive=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Without the theory, the binary variable ```(-1)``` (representing by the condition $\\overline{(f_1 \\geq 30)}$) is a (subset-minimal) contrastive explanation for Alice's instance. However, no instance matches this representation because the extended instance from this contrastive $\\{(f_1 \\geq 30), \\overline{(f_1 \\geq 20)}, \\overline{(f_2 = 1)}\\}$ conflicts with an indisputable theory: $\\overline{(f_1 \\geq 20)} \\Rightarrow \\overline{(f_1 \\geq 30)}$. To refrain from deriving these incorrect explanations, some propositional constraints forming a domain theory indicating how the Boolean conditions are logically connected must be taken into account.\n", "To accomplish this, you just need to specify which features are numeric, categorical and binary in the ```features_type``` parameter of the ```Explainer.initialize()``` constructor. \n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "feature_names: ['f1', 'f2', 'p']\n", "--------- Theory Feature Types -----------\n", "Before the one-hot encoding of categorical features:\n", "Numerical features: 1\n", "Categorical features: 0\n", "Binary features: 1\n", "Number of features: 2\n", "Characteristics of categorical features: {}\n", "\n", "Number of used features in the model (before the encoding of categorical features): 2\n", "Number of used features in the model (after the encoding of categorical features): 2\n", "----------------------------------------------\n", "contrastives: ((-1, -2), (-2, -3))\n", "contrastives (to_features): ['f1 < 30']\n", "contrastives (to_features): ['f1 < 20', 'f2 != 1']\n" ] } ], "source": [ "explainer = Explaining.initialize(forest, instance=alice, features_type={\"numerical\": [\"f1\"], \"binary\": [\"f2\"]})\n", "\n", "contrastives = explainer.minimal_contrastive_reason(n=Explaining.ALL)\n", "print(\"contrastives:\", contrastives)\n", "print(\"contrastives (to_features):\", explainer.to_features(contrastives[0], contrastive=True))\n", "print(\"contrastives (to_features):\", explainer.to_features(contrastives[1], contrastive=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By taking the theory into account, we now get two different contrastive explanations: $$c_1 = \\{\\overline{(f_1 \\geq 20)}, \\overline{(f_1 \\geq 30)}\\} \\mbox{ and } c_2 = \\{\\overline{(f_1 \\geq 20)}, \\overline{(f_2 = 1)}\\}.$$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Eliminating redundant features gives us: $$c_1 = \\{\\overline{(f_1 \\geq 30)}\\} \\mbox{ and } c_2 = \\{\\overline{(f_1 \\geq 20)}, \\overline{(f_2 = 1)}\\}.$$ " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now, the instances we derive from contrastive explanations no longer conflict with domain theory: $$\\{(f_1 \\geq 30), (f_1 \\geq 20), \\overline{(f_2 = 1)}\\} \\mbox{ and } \\{\\overline{(f_1 \\geq 30)}, (f_1 \\geq 20), (f_2 = 1)\\}.$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## For abductive reasons" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [Australian Credit Approval dataset](https://www.openml.org/search?type=data&sort=runs&id=40981&status=active) is a credit card application dataset. Using the preprocessor of PyXAI, we generate an ```australian_0.csv``` dataset and an ```australian_0.types``` file:\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from pyxai import Learning, Explaining\n\npreprocessor = Learning.TabularPreprocessor(\n \"../../dataset/australian.csv\",\n target_feature=\"A15\",\n problem_type=Learning.CLASSIFICATION,\n classification_type=Learning.BINARY_CLASS,\n)\n\npreprocessor.set_categorical_features(features=[\"A1\", \"A4\", \"A5\", \"A6\", \"A8\", \"A9\", \"A11\", \"A12\"])\npreprocessor.set_numerical_features({\n \"A2\": None,\n \"A3\": None,\n \"A7\": None,\n \"A10\": None,\n \"A13\": None,\n \"A14\": None,\n})\n\npreprocessor.process()\npreprocessor.export(\"australian-converted\", output_directory=\"../../dataset\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a random forest and calculate a [majoritary reason](/documentation/classification/RFexplanations/majoritary/) by activating domain theory:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-------------- Information ---------------\n", "Problem type: classification\n", "Instances type: tabular\n", "Labels type: classes\n", "\n", "Dataset path: ../../dataset/australian_0.csv\n", "nFeatures (nAttributes, with the labels): 38\n", "nInstances (nObservations): 690\n", "nLabels: 2\n", "--------------- Model creation, fitting and evaluation ---------------\n", "Splitting method: hold-out\n", "Problem type: classification\n", "Models type: random-forest\n", "model_parameters: {}\n", "--------- Evaluation Information ---------\n", "For the evaluation number 0:\n", "Metrics:\n", " sklearn_confusion_matrix: [[74, 13], [13, 73]]\n", " precision: 84.88372093023256\n", " recall: 84.88372093023256\n", " f1_score: 84.88372093023256\n", " specificity: 85.0574712643678\n", " true_positive: 73\n", " true_negative: 74\n", " false_positive: 13\n", " false_negative: 13\n", " accuracy: 84.97109826589595\n", "Number of Training instances: 517\n", "Number of Testing instances: 173\n", "\n", "--------------- Explainer ----------------\n", "For the split number 0:\n", "**Random Forest Model**\n", "nClasses: 2\n", "nTrees: 100\n", "nVariables: 1378\n", "\n", "--------------- Instances ----------------\n", "Number of instances selected: 1\n", "----------------------------------------------\n", "feature_names: ['A1', 'A2', 'A3', 'A4_1', 'A4_2', 'A4_3', 'A5_1', 'A5_2', 'A5_3', 'A5_4', 'A5_5', 'A5_6', 'A5_7', 'A5_8', 'A5_9', 'A5_10', 'A5_11', 'A5_12', 'A5_13', 'A5_14', 'A6_1', 'A6_2', 'A6_3', 'A6_4', 'A6_5', 'A6_7', 'A6_8', 'A6_9', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12_1', 'A12_2', 'A12_3', 'A13', 'A14']\n", "--------- Theory Feature Types -----------\n", "Before the one-hot encoding of categorical features:\n", "Numerical features: 6\n", "Categorical features: 4\n", "Binary features: 4\n", "Number of features: 14\n", "Characteristics of categorical features: {'A4_1': ['A4', 1, [1, 2, 3]], 'A4_2': ['A4', 2, [1, 2, 3]], 'A4_3': ['A4', 3, [1, 2, 3]], 'A5_1': ['A5', 1, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_2': ['A5', 2, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_3': ['A5', 3, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_4': ['A5', 4, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_5': ['A5', 5, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_6': ['A5', 6, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_7': ['A5', 7, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_8': ['A5', 8, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_9': ['A5', 9, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_10': ['A5', 10, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_11': ['A5', 11, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_12': ['A5', 12, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_13': ['A5', 13, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_14': ['A5', 14, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A6_1': ['A6', 1, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_2': ['A6', 2, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_3': ['A6', 3, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_4': ['A6', 4, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_5': ['A6', 5, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_7': ['A6', 7, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_8': ['A6', 8, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_9': ['A6', 9, [1, 2, 3, 4, 5, 7, 8, 9]], 'A12_1': ['A12', 1, [1, 2, 3]], 'A12_2': ['A12', 2, [1, 2, 3]], 'A12_3': ['A12', 3, [1, 2, 3]]}\n", "\n", "Number of used features in the model (before the encoding of categorical features): 14\n", "Number of used features in the model (after the encoding of categorical features): 38\n", "----------------------------------------------\n", "\n", "len tree_specific: 16\n", "\n", "tree_specific: ['A1 = 0', 'A2 > 279.5', 'A3 in ]44.0, 50.0]', 'A4 = 2', 'A5 = 5', 'A6 = 3', 'A7 <= 2.0', 'A8 = 1', 'A9 = 0', 'A10 <= 1.5', 'A11 = 1', 'A13 in ]27.0, 32.0]', 'A14 in ]18.5, 34.0]']\n", "is majoritary: True\n" ] } ], "source": [ "# Machine learning part\nlearner = Learning.Scikitlearn(\"../../dataset/australian_0.csv\", problem_type=Learning.CLASSIFICATION)\nmodel = learner.evaluate(splitting_method=Learning.HOLD_OUT, model_type=Learning.RF)\ninstance, prediction = learner.get_instances(model, n=1, seed=11200, is_correct=False)\n\n# Explainer part\nexplainer = Explaining.initialize(model, instance=instance, features_type=\"../../dataset/australian_0.types\")\nmajoritary_reason = explainer.majoritary_reason(n_iterations=10)\nprint(\"\\nlen majoritary_reason:\", len(majoritary_reason))\nprint(\"\\nmajoritary_reason:\", explainer.to_features(majoritary_reason))\nprint(\"is majoritary:\", explainer.is_majoritary_reason(majoritary_reason))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thanks to our support for domain theories of categorical features, only one feature AX_Y among the Y available is part of the explanation that has been derived. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }