{ "cells": [ { "cell_type": "markdown", "id": "c1db86d8", "metadata": {}, "source": [ "# Coverage Reasons" ] }, { "cell_type": "markdown", "id": "f8036c42", "metadata": {}, "source": [ "Let $f$ be a Boolean function represented by a random forest $RF$, $x$ be an instance and $1$ (resp. $0$) the prediction of $RF$ on $x$ ($f(x) = 1$ (resp. $f(x)=0$)).\n", "The Boolean conditions occurring in the trees are often logically connected: such constraints are gathered in a **domain theory** $\\Sigma^f$ (for instance, the condition $A_1 > 30$ entails the condition $A_1 > 20$). Domain theories are described in the [Theories](/documentation/explainer/theories/) page." ] }, { "cell_type": "markdown", "id": "26916328", "metadata": {}, "source": [ "A **coverage-based prime implicant explanation** (CPI-Xp), or **coverage reason**, for $x$ is an **abductive** explanation for $x$ that is **maximally general** with respect to the domain theory $\\Sigma^f$. An abductive explanation is a subset $t$ of the characteristics (Boolean conditions) of $x$ such that every instance sharing $t$ and satisfying $\\Sigma^f$ is classified by $f$ in the same way as $x$ (in other words, $t$ is an implicant of $f$ *modulo* $\\Sigma^f$). Among all the abductive explanations of $x$, a coverage reason is one that covers as many instances satisfying $\\Sigma^f$ as possible: no other abductive explanation of $x$ is strictly more general modulo $\\Sigma^f$.\n", "\n", "A coverage reason is **not** necessarily a sufficient reason. A sufficient reason is a *prime* (i.e. subset-minimal) implicant of $f$ covering $x$, whereas a coverage reason is selected for its generality and not for its minimality. Because of the constraints in $\\Sigma^f$, maximizing generality may require keeping many conditions, so a coverage reason can involve far more Boolean conditions than a sufficient reason (it need not be subset-minimal). What a coverage reason does guarantee is that it never retains a condition subsumed by a more general one under the theory (for example, it keeps $A_1 > 20$ rather than the less general $A_1 > 30$). A coverage reason that is in addition subset-minimal is called a **minimal coverage reason** (mCPI-Xp): no literal can be removed from it while it remains a valid implicant modulo $\\Sigma^f$. More information can be found in the article [Computing Coverage-Based Prime Implicant Explanations for Tree-Based Models](/pyxai/papers/).\n", "\n", "The function ```ExplainerRF.coverage_reason``` allows computing a coverage reason. The function ```ExplainerRF.minimal_coverage_reason``` allows computing a minimal coverage reason: starting from a coverage reason, it greedily removes each literal and checks whether the remaining set is still a valid implicant modulo $\\Sigma^f$, repeating until no literal can be discarded. Both functions rely on the domain theory, so the feature types must be provided when initializing the explainer (through the ```features_type``` parameter, see the [Theories](/documentation/explainer/theories/) page); otherwise a ```ValueError``` is raised." ] }, { "cell_type": "markdown", "id": "736eb9d6", "metadata": {}, "source": [ "### Feature order\n", "\n", "Several coverage reasons may exist for a same instance. The ```ExplainerRF.coverage_reason``` function returns one of them, computed by a greedy algorithm that processes the features in a given priority order. This order can be controlled through the ```ordre_features``` parameter (a list of feature names); when it is not provided, a default order derived from the domain theory is used." ] }, { "cell_type": "markdown", "id": "9c16fc03", "metadata": {}, "source": [ "## Example from Hand-Crafted Trees" ] }, { "cell_type": "markdown", "id": "96c16c54", "metadata": {}, "source": [ "For this example, we take the random forest below (Figure 1 of the article). It is built over two features: a numerical one $A_1$ (the annual income (in thousands of dollars) of a loan applicant) and a binary one $A_2$ (whether or not the applicant holds a permanent position). In PyXAI these features are named ```f1``` and ```f2```.\n", "\n", "\"RFcoverage1\"\n", "\n", "Three Boolean conditions occur in the forest: $B_1 = (A_1 > 20)$, $B_2 = (A_1 > 30)$ and $B_3 = (A_2 = 1)$. The conditions $B_1$ and $B_2$ are not independent: they are logically connected by the domain theory $\\Sigma^f$ stating that $B_2 \\Rightarrow B_1$ (an income above $30$ is also above $20$). Under this theory, the forest is equivalent to $f \\equiv B_1 \\vee B_2 \\vee B_3$.\n", "\n", "We consider the instance $x = (33, 1)$ (an applicant earning 33 thousand dollars and holding a permanent position), for which $f(x) = 1$. We start by building the forest and activating the domain theory:" ] }, { "cell_type": "code", "execution_count": 1, "id": "5d7d2782", "metadata": { "execution": { "iopub.execute_input": "2026-06-08T08:19:48.399267Z", "iopub.status.busy": "2026-06-08T08:19:48.399096Z", "iopub.status.idle": "2026-06-08T08:19:49.864404Z", "shell.execute_reply": "2026-06-08T08:19:49.863949Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------- Theory Feature Types -----------\n", "Before the one-hot encoding of categorical features:\n", "Numerical features: 1\n", "Categorical features: 0\n", "Binary features: 1\n", "Number of features: 2\n", "Characteristics of categorical features: {}\n", "\n", "Number of used features in the model (before the encoding of categorical features): 2\n", "Number of used features in the model (after the encoding of categorical features): 2\n", "----------------------------------------------\n", "target_prediction: 1\n" ] } ], "source": [ "from pyxai import Builder, Explaining\n", "\n", "# T1: A1 > 20 ? 1 : (A2 = 1 ? 1 : 0)\n", "nodeT1_A2 = Builder.DecisionNode(2, operator=Builder.EQ, threshold=1, left=0, right=1)\n", "nodeT1_A1 = Builder.DecisionNode(1, operator=Builder.GE, threshold=20, left=nodeT1_A2, right=1)\n", "tree1 = Builder.DecisionTree(2, nodeT1_A1)\n", "\n", "# T2: A2 = 1 ? 1 : (A1 > 30 ? 1 : 0)\n", "nodeT2_A1 = Builder.DecisionNode(1, operator=Builder.GE, threshold=30, left=0, right=1)\n", "nodeT2_A2 = Builder.DecisionNode(2, operator=Builder.EQ, threshold=1, left=nodeT2_A1, right=1)\n", "tree2 = Builder.DecisionTree(2, nodeT2_A2)\n", "\n", "# T3: A1 > 30 ? 1 : (A1 > 20 ? 1 : 0)\n", "nodeT3_A1_20 = Builder.DecisionNode(1, operator=Builder.GE, threshold=20, left=0, right=1)\n", "nodeT3_A1_30 = Builder.DecisionNode(1, operator=Builder.GE, threshold=30, left=nodeT3_A1_20, right=1)\n", "tree3 = Builder.DecisionTree(2, nodeT3_A1_30)\n", "\n", "forest = Builder.RandomForest([tree1, tree2, tree3], n_classes=2)\n", "\n", "instance = (33, 1)\n", "explainer = Explaining.initialize(forest, instance=instance, features_type={\"numerical\": [\"f1\"], \"binary\": [\"f2\"]})\n", "print(\"target_prediction:\", explainer.target_prediction)" ] }, { "cell_type": "markdown", "id": "12c30914", "metadata": {}, "source": [ "For this instance, the subset-minimal abductive explanations are $\\{A_1 > 20\\}$, $\\{A_1 > 30\\}$ and $\\{A_2 = 1\\}$ (here they are also sufficient reasons, since each is a single condition). Among them, $\\{A_1 > 30\\}$ is *less general* than $\\{A_1 > 20\\}$ under the theory $B_2 \\Rightarrow B_1$ (every instance covered by $A_1 > 30$ is also covered by $A_1 > 20$, but not the other way around). Hence the coverage reasons are only the two maximally general ones: $\\{A_1 > 20\\}$ and $\\{A_2 = 1\\}$.\n", "\n", "We first compute a sufficient reason, then a coverage reason:" ] }, { "cell_type": "code", "execution_count": 2, "id": "6c2dc3d5", "metadata": { "execution": { "iopub.execute_input": "2026-06-08T08:19:49.865625Z", "iopub.status.busy": "2026-06-08T08:19:49.865434Z", "iopub.status.idle": "2026-06-08T08:19:49.874647Z", "shell.execute_reply": "2026-06-08T08:19:49.874336Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sufficient: ['f2 == 1']\n", "coverage: ['f2 == 1']\n" ] } ], "source": [ "sufficient = explainer.sufficient_reason()\n", "print(\"sufficient:\", explainer.to_features(sufficient))\n", "\n", "coverage = explainer.coverage_reason()\n", "print(\"coverage:\", explainer.to_features(coverage))" ] }, { "cell_type": "markdown", "id": "bd7f49f6", "metadata": {}, "source": [ "By default the greedy algorithm returns the coverage reason $\\{A_2 = 1\\}$. Using the ```ordre_features``` parameter, we can change the priority order over the features and obtain the other coverage reason $\\{A_1 > 20\\}$:" ] }, { "cell_type": "code", "execution_count": 3, "id": "d028a56a", "metadata": { "execution": { "iopub.execute_input": "2026-06-08T08:19:49.875799Z", "iopub.status.busy": "2026-06-08T08:19:49.875696Z", "iopub.status.idle": "2026-06-08T08:19:49.878854Z", "shell.execute_reply": "2026-06-08T08:19:49.878442Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "coverage (f2 first): ['f1 >= 20']\n", "coverage (f1 first): ['f2 == 1']\n" ] } ], "source": [ "coverage_f2_first = explainer.coverage_reason(ordre_features=[\"f2\", \"f1\"])\n", "print(\"coverage (f2 first):\", explainer.to_features(coverage_f2_first))\n", "\n", "coverage_f1_first = explainer.coverage_reason(ordre_features=[\"f1\", \"f2\"])\n", "print(\"coverage (f1 first):\", explainer.to_features(coverage_f1_first))" ] }, { "cell_type": "markdown", "id": "10bf6318", "metadata": {}, "source": [ "In both cases, the condition $A_1 > 30$ is never selected: a coverage reason always prefers the more general condition $A_1 > 20$." ] }, { "cell_type": "markdown", "id": "a6f2f988", "metadata": {}, "source": [ "## Example from a Real Dataset" ] }, { "cell_type": "markdown", "id": "86881dee", "metadata": {}, "source": [ "For this example, we take the [australian](/assets/notebooks/dataset/australian_0.csv) credit approval dataset, together with the [australian_0.types](/assets/notebooks/dataset/australian_0.types) file describing the feature types (this file activates the domain theory; see the [Theories](/documentation/explainer/theories/) page). We create a model using the hold-out approach and select a well-classified instance." ] }, { "cell_type": "code", "execution_count": 4, "id": "831f2088", "metadata": { "execution": { "iopub.execute_input": "2026-06-08T08:19:49.880112Z", "iopub.status.busy": "2026-06-08T08:19:49.880003Z", "iopub.status.idle": "2026-06-08T08:19:50.195015Z", "shell.execute_reply": "2026-06-08T08:19:50.194614Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-------------- Information ---------------\n", "Problem type: classification\n", "Instances type: tabular\n", "Labels type: classes\n", "\n", "Dataset path: ../../dataset/australian_0.csv\n", "nFeatures (nAttributes, with the labels): 38\n", "nInstances (nObservations): 690\n", "nLabels: 2\n", "--------------- Model creation, fitting and evaluation ---------------\n", "Splitting method: hold-out\n", "Problem type: classification\n", "Models type: random-forest\n", "model_parameters: {}\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "--------- Evaluation Information ---------\n", "For the evaluation number 0:\n", "Metrics:\n", " sklearn_confusion_matrix: [[91, 8], [11, 63]]\n", " precision: 88.73239436619718\n", " recall: 85.13513513513513\n", " f1_score: 86.89655172413794\n", " specificity: 91.91919191919192\n", " true_positive: 63\n", " true_negative: 91\n", " false_positive: 8\n", " false_negative: 11\n", " accuracy: 89.01734104046243\n", "Number of Training instances: 517\n", "Number of Testing instances: 173\n", "\n", "--------------- Explainer ----------------\n", "For the split number 0:\n", "**Random Forest Model**\n", "nClasses: 2\n", "nTrees: 100\n", "nVariables: 1473\n", "\n", "--------------- Instances ----------------\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Number of instances selected: 1\n", "----------------------------------------------\n" ] } ], "source": [ "from pyxai import Learning, Explaining\n", "\n", "learner = Learning.Scikitlearn(\"../../dataset/australian_0.csv\", problem_type=Learning.CLASSIFICATION)\n", "model = learner.evaluate(splitting_method=Learning.HOLD_OUT, model_type=Learning.RF)\n", "instance, prediction = learner.get_instances(model, n=1, seed=11200, is_correct=True)" ] }, { "cell_type": "markdown", "id": "94830135", "metadata": {}, "source": [ "We initialize the explainer with the domain theory and compute a coverage reason. The raw explanation is a term over the Boolean conditions of the forest (it is an implicant *modulo the theory*, so it may gather many literals); the ```to_features``` method gives a compact and human-readable form." ] }, { "cell_type": "code", "execution_count": 5, "id": "71948a7a", "metadata": { "execution": { "iopub.execute_input": "2026-06-08T08:19:50.196576Z", "iopub.status.busy": "2026-06-08T08:19:50.196455Z", "iopub.status.idle": "2026-06-08T08:19:51.579766Z", "shell.execute_reply": "2026-06-08T08:19:51.579333Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------- Theory Feature Types -----------\n", "Before the one-hot encoding of categorical features:\n", "Numerical features: 6\n", "Categorical features: 4\n", "Binary features: 4\n", "Number of features: 14\n", "Characteristics of categorical features: {'A4_1': ['A4', 1, [1, 2, 3]], 'A4_2': ['A4', 2, [1, 2, 3]], 'A4_3': ['A4', 3, [1, 2, 3]], 'A5_1': ['A5', 1, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_2': ['A5', 2, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_3': ['A5', 3, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_4': ['A5', 4, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_5': ['A5', 5, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_6': ['A5', 6, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_7': ['A5', 7, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_8': ['A5', 8, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_9': ['A5', 9, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_10': ['A5', 10, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_11': ['A5', 11, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_12': ['A5', 12, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_13': ['A5', 13, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_14': ['A5', 14, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A6_1': ['A6', 1, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_2': ['A6', 2, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_3': ['A6', 3, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_4': ['A6', 4, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_5': ['A6', 5, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_7': ['A6', 7, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_8': ['A6', 8, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_9': ['A6', 9, [1, 2, 3, 4, 5, 7, 8, 9]], 'A12_1': ['A12', 1, [1, 2, 3]], 'A12_2': ['A12', 2, [1, 2, 3]], 'A12_3': ['A12', 3, [1, 2, 3]]}\n", "\n", "Number of used features in the model (before the encoding of categorical features): 14\n", "Number of used features in the model (after the encoding of categorical features): 37\n", "----------------------------------------------\n", "prediction: 1\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "len coverage (literals): 473\n", "\n", "coverage (to_features): ['A4 = 2', 'A5 = 9', 'A6 = 4', 'A8 = 1', 'A9 = 1', 'A10 in ]4.5, 18.5]', 'A12 = 2', 'A13 in ]12.5, 33.5]', 'A14 <= 5.0']\n" ] } ], "source": [ "explainer = Explaining.initialize(model, instance=instance, features_type=\"../../dataset/australian_0.types\")\n", "print(\"prediction:\", prediction)\n", "\n", "coverage = explainer.coverage_reason()\n", "print(\"\\nlen coverage (literals):\", len(coverage))\n", "print(\"\\ncoverage (to_features):\", explainer.to_features(coverage))" ] }, { "cell_type": "markdown", "id": "2c60edb7", "metadata": {}, "source": [ "Each categorical feature (here $A_4$, $A_5$, $A_6$ and $A_{12}$) is one-hot encoded into several binary columns, but thanks to the domain theory the explanation reports a single equality condition per categorical feature (for example ```A4 = 2```) instead of a scattered set of literals over its binary columns. For the numerical features, the widest thresholds compatible with the prediction are kept, so that the explanation covers as many instances as possible (for example ```A14 <= 5.0``` or ```A10 in ]4.5, 18.5]```).\n", "\n", "Other types of explanations are presented in the [Explanations Computation](/documentation/explanations/RFexplanations/) page." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 5 }