{ "cells": [ { "cell_type": "markdown", "id": "ac658db4", "metadata": {}, "source": [ "# Building Boosted Trees" ] }, { "cell_type": "markdown", "id": "708de8c5", "metadata": {}, "source": [ "## Classification" ] }, { "cell_type": "markdown", "id": "5c2e8e81", "metadata": {}, "source": [ "### Building the Model" ] }, { "cell_type": "markdown", "id": "84d00330", "metadata": {}, "source": [ "This page explains how to build a Boosted Tree with tree elements (nodes and leaves). To illustrate it, we take an example from the [Computing Abductive Explanations for Boosted Trees](https://arxiv.org/abs/2209.07740) paper.\n", "\n", "\"BTbase\"\n", "\n", "As an example of binary classification, consider four attributes: $A_1$, $A_2$ are numerical,\n", "$A_3$ is categorical, and $A_4$ is Boolean. The Boosted Tree is composed of a single forest F, which consists of three regression trees $T_1$, $T_2$, $T_3$. " ] }, { "cell_type": "markdown", "id": "0b893f0d", "metadata": {}, "source": [ "First, we need to importsome modules. Let us recall that the ```builder``` module contains methods to build the Decision Tree while the ```explainer``` module provides methods to explain it. " ] }, { "cell_type": "code", "execution_count": 1, "id": "d3a4df8c", "metadata": {}, "outputs": [], "source": [ "from pyxai import Builder, Explainer, Learning" ] }, { "cell_type": "markdown", "id": "13445e1a", "metadata": {}, "source": [ "Next, we build the trees from in a bottom-up way, that is, from the leaves to the root. So we start with the $A_1 \\gt \n", "2$ node of the tree $T_1$." ] }, { "cell_type": "code", "execution_count": 2, "id": "455b1d61", "metadata": {}, "outputs": [], "source": [ "node1_1 = Builder.DecisionNode(1, operator=Builder.GT, threshold=2, left=-0.2, right=0.3)\n", "node1_2 = Builder.DecisionNode(3, operator=Builder.EQ, threshold=1, left=-0.3, right=node1_1)\n", "node1_3 = Builder.DecisionNode(2, operator=Builder.GT, threshold=1, left=0.4, right=node1_2)\n", "node1_4 = Builder.DecisionNode(4, operator=Builder.EQ, threshold=1, left=-0.5, right=node1_3)\n", "tree1 = Builder.DecisionTree(4, node1_4)" ] }, { "cell_type": "markdown", "id": "293a9287", "metadata": {}, "source": [ "{: .attention }\n", "> We consider that the features $A_3$ and $A_4$ are numerical. Native categorical and Boolean features will be implemented in future versions of PyXAI. " ] }, { "cell_type": "markdown", "id": "e4dfc989", "metadata": {}, "source": [ "Next, we build the tree $T_2$:" ] }, { "cell_type": "code", "execution_count": 3, "id": "b3c2aff2", "metadata": {}, "outputs": [], "source": [ "node2_1 = Builder.DecisionNode(4, operator=Builder.EQ, threshold=1, left=-0.4, right=0.3)\n", "node2_2 = Builder.DecisionNode(1, operator=Builder.GT, threshold=2, left=-0.2, right=node2_1)\n", "node2_3 = Builder.DecisionNode(2, operator=Builder.GT, threshold=1, left=node2_2, right=0.5)\n", "tree2 = Builder.DecisionTree(4, node2_3)" ] }, { "cell_type": "markdown", "id": "d53b544d", "metadata": {}, "source": [ "And the tree $T_3$:" ] }, { "cell_type": "code", "execution_count": 4, "id": "354aa274", "metadata": {}, "outputs": [], "source": [ "node3_1 = Builder.DecisionNode(1, operator=Builder.GT, threshold=2, left=0.2, right=0.3)\n", "\n", "node3_2_1 = Builder.DecisionNode(1, operator=Builder.GT, threshold=2, left=-0.2, right=0.2)\n", "\n", "node3_2_2 = Builder.DecisionNode(4, operator=Builder.EQ, threshold=1, left=-0.1, right=node3_1)\n", "node3_2_3 = Builder.DecisionNode(4, operator=Builder.EQ, threshold=1, left=-0.5, right=0.1)\n", "\n", "node3_3_1 = Builder.DecisionNode(2, operator=Builder.GT, threshold=1, left=node3_2_1, right=node3_2_2)\n", "node3_3_2 = Builder.DecisionNode(2, operator=Builder.GT, threshold=1, left=-0.4, right=node3_2_3)\n", "\n", "node3_4 = Builder.DecisionNode(3, operator=Builder.EQ, threshold=1, left=node3_3_1, right=node3_3_2)\n", "\n", "tree3 = Builder.DecisionTree(4, node3_4)" ] }, { "cell_type": "markdown", "id": "9f84e1e9", "metadata": {}, "source": [ "We can now define the Boosted Tree: " ] }, { "cell_type": "code", "execution_count": 5, "id": "7270ddec", "metadata": {}, "outputs": [], "source": [ "BTs = Builder.BoostedTrees([tree1, tree2, tree3], n_classes=2)" ] }, { "cell_type": "markdown", "id": "227b595d", "metadata": {}, "source": [ "More details about the ```DecisionNode``` and ```BoostedTree``` classes are given in the [Building Models](/documentation/learning/builder/) page. " ] }, { "cell_type": "markdown", "id": "e6af9c2e", "metadata": {}, "source": [ "### Explaining the Model" ] }, { "cell_type": "markdown", "id": "e101d763", "metadata": {}, "source": [ "Let us compute explanations. We take the same instance as in the paper, namely, ($A_1=4$, $A_2 = 3$, $A_3 = 1$, $A_4 = 1$):" ] }, { "cell_type": "code", "execution_count": 6, "id": "601be43f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "target_prediction: 1\n", "binary_representation: (1, 2, 3, 4)\n", "to_features: ('f1 > 2', 'f2 > 1', 'f3 == 1', 'f4 == 1')\n" ] } ], "source": [ "instance = (4,3,1,1)\n", "\n", "explainer = Explainer.initialize(BTs, instance)\n", "\n", "print(\"target_prediction:\",explainer.target_prediction)\n", "print(\"binary_representation:\", explainer.binary_representation)\n", "print(\"to_features:\", explainer.to_features(explainer.binary_representation))" ] }, { "cell_type": "markdown", "id": "fd4e69ef", "metadata": {}, "source": [ "{: .warning }\n", "> We can see that the values of binary variables are not the same as feature identifiers. Indeed, by default, binary variables have random values depending on the order with the tree is traversed. Therefore the binary variable $1$ (resp. $2$, $3$ and $4$) of this binary representation represents the condition $A_4 = 1$ (resp. $A_2 > 1$, $A_3 = 1$ and $A_1=4$).\n" ] }, { "cell_type": "markdown", "id": "5a32dfd0", "metadata": {}, "source": [ "We compute the [direct reason](/documentation/classification/BTexplanations/direct/):" ] }, { "cell_type": "code", "execution_count": 7, "id": "92565362", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "direct reason: (1, 2, 3, 4)\n", "to_features: ('f1 > 2', 'f2 > 1', 'f3 == 1', 'f4 == 1')\n" ] } ], "source": [ "direct = explainer.direct_reason()\n", "print(\"direct reason:\", direct)\n", "direct_features = explainer.to_features(direct)\n", "print(\"to_features:\", direct_features)\n", "assert direct_features == ('f1 > 2', 'f2 > 1', 'f3 == 1', 'f4 == 1'), \"The direct reason is not correct.\" " ] }, { "cell_type": "markdown", "id": "e5f3ac25", "metadata": {}, "source": [ "Now we compute a [tree-specific](/documentation/classification/BTexplanations/treespecific/) reason:" ] }, { "cell_type": "code", "execution_count": 8, "id": "cf689155", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tree specific reason: (1, 2)\n", "to_features: ('f2 > 1', 'f4 == 1')\n", "is_tree_specific: True\n", "is_sufficient_reason: True\n" ] } ], "source": [ "tree_specific = explainer.tree_specific_reason()\n", "print(\"tree specific reason:\", tree_specific)\n", "tree_specific_feature = explainer.to_features(tree_specific)\n", "print(\"to_features:\", tree_specific_feature)\n", "print(\"is_tree_specific:\", explainer.is_tree_specific_reason(tree_specific))\n", "print(\"is_sufficient_reason:\", explainer.is_sufficient_reason(tree_specific))" ] }, { "cell_type": "markdown", "id": "24054329", "metadata": {}, "source": [ "And now we check that the reason ($A_1 = 4$, $A_4 = 1$) is a sufficient reason but not a tree-specific explanation:" ] }, { "cell_type": "code", "execution_count": 9, "id": "66331663", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "features: ('f1 > 2', 'f4 == 1')\n", "is_sufficient_reason: True\n", "is_tree_specific: False\n" ] } ], "source": [ "reason = (1, 4)\n", "features = explainer.to_features(reason)\n", "print(\"features:\", features)\n", "print(\"is_sufficient_reason:\", explainer.is_sufficient_reason(reason))\n", "print(\"is_tree_specific:\", explainer.is_tree_specific_reason(reason))" ] }, { "cell_type": "markdown", "id": "661c8857", "metadata": {}, "source": [ "{: .attention }\n", "> The given reason consists of binary variables. Here, the binary variable $1$ (resp. $4$) represents the condition $A_4 = 1$ (resp. $A_1 = 4$ that is $A_1 > 2$)." ] }, { "cell_type": "markdown", "id": "3d71bd13", "metadata": {}, "source": [ "Details on explanations are given in the [Explanations Computation](/documentation/explanations/BTexplanations/) page. " ] }, { "cell_type": "markdown", "id": "5922bb76", "metadata": {}, "source": [ "## Regression" ] }, { "cell_type": "markdown", "id": "bc5cf46a", "metadata": {}, "source": [ "We take an example from the [Computing Abductive Explanations for Boosted Regression Trees](https://www.ijcai.org/proceedings/2023/382) paper.\n", "\n", "Let us consider a loan application scenario. The goal is to predict\n", "the amount of money that can be granted to an applicant described using three attributes $A = (\\{A_1, A_2, A_3\\}$). \n", "- $A_1$ is a numerical attribute giving the income per month of the applicant\n", "- $A_2$ is a categorical feature giving its employment status as ”employed”, ”unemployed” or ”self-employed”\n", "- $A_3$ is a Boolean feature set to true if the customer is married, false otherwise. \n", "\n", "\"BTbase\"\n", "\n", "In this example:\n", "\n", "- $A_1$ is represented by the feature identifier $F_1$\n", "- $A_2$ has been one-hot encoded and is represented by feature identifiers $F_2$, $F_3$ and $F_4$, each of these features represents respectively the conditions $A_2 = employed$, $A_2 = unemployed$ and $A_2 = self-employed$\n", "- $A_3$ is represented by the feature identifier $F_5$ and the condition $(A_3 = 1)$ (”the applicant is married”)" ] }, { "cell_type": "markdown", "id": "256a1d09", "metadata": {}, "source": [ "### Building the Model" ] }, { "cell_type": "markdown", "id": "0dbbd71b", "metadata": {}, "source": [ "The process is the same as for classification, except for the the last instruction that constructs the model: \n", "\n", "```BTs = Builder.BoostedTreesRegression([tree1, tree2, tree3])```\n", "\n", "has to be used for creating the model. Here is the complete procedure: " ] }, { "cell_type": "code", "execution_count": 10, "id": "8d310b8e", "metadata": {}, "outputs": [], "source": [ "from pyxai import Builder, Explainer, Learning\n", "\n", "node1_1 = Builder.DecisionNode(1, operator=Builder.GT, threshold=3000, left=1500, right=1750)\n", "node1_2 = Builder.DecisionNode(1, operator=Builder.GT, threshold=2000, left=1000, right=node1_1)\n", "node1_3 = Builder.DecisionNode(1, operator=Builder.GT, threshold=1000, left=0, right=node1_2)\n", "tree1 = Builder.DecisionTree(5, node1_3)\n", "\n", "\n", "node2_1 = Builder.DecisionNode(5, operator=Builder.EQ, threshold=1, left=100, right=250)\n", "node2_2 = Builder.DecisionNode(4, operator=Builder.EQ, threshold=1, left=-100, right=node2_1)\n", "node2_3 = Builder.DecisionNode(2, operator=Builder.EQ, threshold=1, left=node2_2, right=250)\n", "tree2 = Builder.DecisionTree(5, node2_3)\n", "\n", "node3_1 = Builder.DecisionNode(3, operator=Builder.EQ, threshold=1, left=500, right=250)\n", "node3_2 = Builder.DecisionNode(3, operator=Builder.EQ, threshold=1, left=250, right=100)\n", "node3_3 = Builder.DecisionNode(1, operator=Builder.GT, threshold=2000, left=0, right=node3_1)\n", "node3_4 = Builder.DecisionNode(4, operator=Builder.EQ, threshold=1, left=node3_3, right=node3_2)\n", "tree3 = Builder.DecisionTree(5, node3_4)\n", "\n", "\n", "BTs = Builder.BoostedTreesRegression([tree1, tree2, tree3])" ] }, { "cell_type": "markdown", "id": "aa8c1322", "metadata": {}, "source": [ "\n", "More details about the ```DecisionNode``` and ```BoostedTreesRegression``` classes are given in the [Building Models](/documentation/learning/builder/) page. " ] }, { "cell_type": "markdown", "id": "7ae12ecb", "metadata": {}, "source": [ "### Explaining the Model" ] }, { "cell_type": "markdown", "id": "a24d9a65", "metadata": {}, "source": [ "We can then compute explanations. We take the same instance as in the paper, namely, \n", "$(2200, 0, 0, 1, 1)$ for ($F_1=2200$, $F_2 = 0$, $F_3 = 0$, $F_4 = 1$, $F_4 = 1$).\n", "\n", "In addition, we have chosen to take a [theory](/documentation/explainer/theories/) into account, the line ```\"categorical\": {\"f{2,3,4}\": (1, 2, 3)}``` means that the features $F_2$, $F_3$ and $F_4$ are in fact a single feature named $F\\{2,3,4\\}$ (representing $A_2$) that can take the value $1$, $2$ or $3$ (for, respectively, $A_2 = employed$, $A_2 = unemployed$ and $A_2 = self-employed$). " ] }, { "cell_type": "code", "execution_count": 11, "id": "7dd41cc8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "instance: (2200, 0, 0, 1, 1)\n", "--------- Theory Feature Types -----------\n", "Before the encoding (without one hot encoded features), we have:\n", "Numerical features: 1\n", "Categorical features: 1\n", "Binary features: 1\n", "Number of features: 3\n", "Values of categorical features: {'f2': ['f{2,3,4}', 1, (1, 2, 3)], 'f3': ['f{2,3,4}', 2, (1, 2, 3)], 'f4': ['f{2,3,4}', 3, (1, 2, 3)]}\n", "\n", "Number of used features in the model (before the encoding): 3\n", "Number of used features in the model (after the encoding): 5\n", "----------------------------------------------\n", "prediction: 2000\n", "condition direct: (1, 2, -3, -4, 5, 6, -7)\n", "direct: ('f1 in ]2000, 3000]', 'f{2,3,4} = 3', 'f5 == 1')\n" ] } ], "source": [ "instance = (2200, 0, 0, 1, 1) # 2200$, self employed (one hot encoded), married\n", "print(\"instance:\", instance)\n", "\n", "loan_types = {\n", " \"numerical\": Learning.DEFAULT,\n", " \"categorical\": {\"f{2,3,4}\": (1, 2, 3)},\n", " \"binary\": [\"f5\"],\n", "}\n", "\n", "explainer = Explainer.initialize(BTs, instance, features_type=loan_types)\n", "\n", "print(\"prediction:\", explainer.predict(instance))\n", "print(\"condition direct:\", explainer.direct_reason())\n", "print(\"direct:\", explainer.to_features(explainer.direct_reason()))" ] }, { "cell_type": "markdown", "id": "0f408ddb", "metadata": {}, "source": [ "Here, with this instance ($2200$, ”self-employed”, 1), the regression value is $F(x) = 1500 + 250 + 250 = 2000$. The direct reason can also be represented by $\\{A_1{>}2000, \\overline{A_1{>}3000}, A_2^3, A_3\\}$." ] }, { "cell_type": "code", "execution_count": 12, "id": "e0f551cb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tree specific: ('f1 > 2000',)\n", "is tree : True\n" ] } ], "source": [ "explainer.set_interval(1500, 2500)\n", "\n", "tree_specific = explainer.tree_specific_reason()\n", "print(\"tree specific:\", explainer.to_features(tree_specific))\n", "print(\"is tree : \", explainer.is_tree_specific_reason(tree_specific))" ] }, { "cell_type": "markdown", "id": "382bbe4e", "metadata": {}, "source": [ "We can display this reason thank to the PyXAI GUI with:\n", "\n", "```\n", "explainer.show()\n", "```\n", "\n", "\"BTdirect\"" ] }, { "cell_type": "markdown", "id": "89212151", "metadata": {}, "source": [ "Details on explanations with regression models are given in the [Explaining Regression](/documentation/regression/) page. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }