{ "cells": [ { "cell_type": "markdown", "id": "b1db8c9a", "metadata": {}, "source": [ "# Direct Reason" ] }, { "cell_type": "markdown", "id": "514a5144", "metadata": {}, "source": [ "Let $BT$ be a boosted tree composed of {$T_1,\\ldots T_n$} regression trees and $x$ an instance, the **direct reason** for $x$ is a subset of $t_{\\vec x}$ (the binary form of the instance) corresponding to the conjunction for each $T_i$ of the term associated with the unique root-to-leaf path of $T_i$ that is compatible with $x$. Due to its simplicity, it is one of the easiest abductive explanation that can be computedn but it can be highly redundant. More information about the direct reason can be found in the article [Computing Abductive Explanations for Boosted Regression Trees](https://www.ijcai.org/proceedings/2023/382)." ] }, { "cell_type": "markdown", "id": "d88fcc64", "metadata": {}, "source": [ "| <Explainer Object>.direct_reason(): | \n", "| :----------- | \n", "| Returns the direct reason for the current instance. Returns ```None``` if this reason contains some excluded features. All kinds of operators in the conditions are supported. This reason is in the form of binary variables, you must use the ```to_features ``` method if you want to obtain a representation based on the features represented at start. |" ] }, { "cell_type": "markdown", "id": "e4432d14", "metadata": {}, "source": [ "The basic methods (```initialize```, ```set_instance```, ```to_features```, ```is_reason```, ...) of the ```explainer``` module used in the next examples are described in the [Explainer Principles](/documentation/explainer/) page. " ] }, { "cell_type": "markdown", "id": "d70f9a73", "metadata": {}, "source": [ "## Example from Hand-Crafted Trees" ] }, { "cell_type": "markdown", "id": "3df1afd3", "metadata": {}, "source": [ "Let us consider a loan application scenario that will be used as a running example. The goal is to predict\n", "the amount of money that can be granted to an applicant described using three attributes ($A = \\{A_1, A_2, A_3\\}$). \n", "- $A_1$ is a numerical attribute giving the income per month of the applicant\n", "- $A_2$ is a categorical feature giving its employment status as ”employed”, ”unemployed” or ”self-employed”\n", "- $A_3$ is a Boolean feature set to true if the customer is married, false otherwise. \n", "\n", "\"BTdirect\"\n", "\n", "In this example:\n", "\n", "- $A_1$ is represented by the feature identifier $F_1$\n", "- $A_2$ has been one-hot encoded and is represented by feature identifiers $F_2$, $F_3$ and $F_4$, each of these features represents respectively the condition $A_2^{1} = employed$, $A_2^{2} = unemployed$ and $A_2^{3} = self-employed$\n", "- $A_3$ is represented by the feature identifier $F_5$ and the condition $(A_3 = 1)$ (”the applicant is married”)\n", "\n", "We consider the instance $x=(2200, 0, 0, 1, 1)$, corresponding to a person with a salary equal to 2200 per month, self employed (one hot encoded) and married. Then, $F(x) = 1500 + 250 + 250 = 2000\\$.\n", "\n", "The direct reason for the instance $x = (2200, 0, 0, 1, 1)$ is in red and can be represented by $\\{A_1{>}2000, \\overline{A_1{>}3000}, A_2^3, A_3\\}$.\n", "\n", "We now show how to get it using PyXAI: " ] }, { "cell_type": "code", "execution_count": 1, "id": "411398a5", "metadata": {}, "outputs": [], "source": [ "from pyxai import Builder, Explainer\n", "\n", "node1_1 = Builder.DecisionNode(1, operator=Builder.GT, threshold=3000, left=1500, right=1750)\n", "node1_2 = Builder.DecisionNode(1, operator=Builder.GT, threshold=2000, left=1000, right=node1_1)\n", "node1_3 = Builder.DecisionNode(1, operator=Builder.GT, threshold=1000, left=0, right=node1_2)\n", "tree1 = Builder.DecisionTree(5, node1_3)\n", "\n", "\n", "node2_1 = Builder.DecisionNode(5, operator=Builder.EQ, threshold=1, left=100, right=250)\n", "node2_2 = Builder.DecisionNode(4, operator=Builder.EQ, threshold=1, left=-100, right=node2_1)\n", "node2_3 = Builder.DecisionNode(2, operator=Builder.EQ, threshold=1, left=node2_2, right=250)\n", "tree2 = Builder.DecisionTree(5, node2_3)\n", "\n", "node3_1 = Builder.DecisionNode(3, operator=Builder.EQ, threshold=1, left=500, right=250)\n", "node3_2 = Builder.DecisionNode(3, operator=Builder.EQ, threshold=1, left=250, right=100)\n", "node3_3 = Builder.DecisionNode(1, operator=Builder.GE, threshold=2000, left=0, right=node3_1)\n", "node3_4 = Builder.DecisionNode(4, operator=Builder.EQ, threshold=1, left=node3_3, right=node3_2)\n", "tree3 = Builder.DecisionTree(5, node3_4)\n", "\n", "\n", "BT = Builder.BoostedTreesRegression([tree1, tree2, tree3])\n" ] }, { "cell_type": "markdown", "id": "a2263e89", "metadata": {}, "source": [ "We now compute the direct reason for this instance: " ] }, { "cell_type": "code", "execution_count": 2, "id": "935837ea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "instance: (4,3,2,1)\n", "binary_representation: (1, 2, -3, -4, 5, 6, 7, -8)\n", "target_prediction: 2000\n", "direct: (1, 2, -3, -4, 5, 6, -8)\n", "to_features: ('f1 in ]2000, 3000]', 'f2 != 1', 'f3 != 1', 'f4 == 1', 'f5 == 1')\n" ] } ], "source": [ "explainer = Explainer.initialize(BT)\n", "explainer.set_instance((2200, 0, 0, 1, 1))\n", "direct = explainer.direct_reason()\n", "print(\"instance: (4,3,2,1)\")\n", "print(\"binary_representation:\", explainer.binary_representation)\n", "print(\"target_prediction:\", explainer.target_prediction)\n", "print(\"direct:\", direct)\n", "print(\"to_features:\", explainer.to_features(direct))\n" ] }, { "cell_type": "markdown", "id": "fc75b1a7", "metadata": {}, "source": [ "As you can see, in this case, the direct reason corresponds to the full instance." ] }, { "cell_type": "markdown", "id": "4061b821", "metadata": {}, "source": [ "## Example from a Real Dataset" ] }, { "cell_type": "markdown", "id": "7c187df9", "metadata": {}, "source": [ "For this example, we take the [Houses-prices](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) dataset (this one [here](/assets/notebooks/dataset/houses-prices.csv)). We create a model using the hold-out approach (by default, the test size is set to 30%) and select a well-classified instance. from pyxai import Learning

preprocessor = Learning.Preprocessor("../../dataset/houses-prices.csv", target_feature="SalePrice", learner_type=Learning.REGRESSION)

preprocessor.unset_features(["Id"])

preprocessor.set_categorical_features(columns=[
    "MSSubClass",
    "Street",
    "LotShape", 
    "LandContour", 
    "LotConfig", 
    "LandSlope", 
    "Neighborhood", 
    "Condition1", 
    "Condition2", 
    "BldgType", 
    "HouseStyle", 
    "OverallQual", 
    "OverallCond", 
    "RoofStyle", 
    "RoofMatl", 
    "ExterQual", 
    "ExterCond", 
    "Foundation", 
    "Heating", 
    "HeatingQC", 
    "CentralAir", 
    "PavedDrive", 
    "SaleCondition"])

preprocessor.set_numerical_features({
    "LotArea": None,
    "YearBuilt": None, 
    "YearRemodAdd": None, 
    "1stFlrSF": None,
    "2ndFlrSF": None,
    "LowQualFinSF": None,
    "GrLivArea": None,
    "FullBath": None,
    "HalfBath": None,
    "BedroomAbvGr": None,
    "KitchenAbvGr": None,
    "TotRmsAbvGrd": None,
    "Fireplaces": None,
    "WoodDeckSF": None,
    "OpenPorchSF": None,
    "EnclosedPorch": None,
    "3SsnPorch": None,
    "ScreenPorch": None,
    "PoolArea": None,
    "MiscVal": None,
    "MoSold": None,
    "YrSold": None
    })


preprocessor.process()
dataset_name = "../../dataset/houses-prices.csv".split("/")[-1].split(".")[0]+"-converted" 
preprocessor.export(dataset_name, output_directory="../../dataset") MiscVal MoSold YrSold \\\n", "0 0 0 ... 0 2 2008 \n", "1 0 0 ... 0 5 2007 \n", "2 0 0 ... 0 9 2008 \n", "3 0 0 ... 0 2 2006 \n", "4 0 0 ... 0 12 2008 \n", "... ... ... ... ... ... ... \n", "2914 0 0 ... 0 6 2006 \n", "2915 0 0 ... 0 4 2006 \n", "2916 0 0 ... 0 9 2006 \n", "2917 0 1 ... 700 7 2006 \n", "2918 0 0 ... 0 11 2006 \n", "\n", " SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca \\\n", "0 0 0 0 \n", "1 0 0 0 \n", "2 0 0 0 \n", "3 1 0 0 \n", "4 0 0 0 \n", "... ... ... ... \n", "2914 0 0 0 \n", "2915 1 0 0 \n", "2916 1 0 0 \n", "2917 0 0 0 \n", "2918 0 0 0 \n", "\n", " SaleCondition_Family SaleCondition_Normal SaleCondition_Partial \\\n", "0 0 1 0 \n", "1 0 1 0 \n", "2 0 1 0 \n", "3 0 0 0 \n", "4 0 1 0 \n", "... ... ... ... \n", "2914 0 1 0 \n", "2915 0 0 0 \n", "2916 0 0 0 \n", "2917 0 1 0 \n", "2918 0 1 0 \n", "\n", " SalePrice \n", "0 208500.000000 \n", "1 181500.000000 \n", "2 223500.000000 \n", "3 140000.000000 \n", "4 250000.000000 \n", "... ... \n", "2914 167081.220949 \n", "2915 164788.778231 \n", "2916 219222.423400 \n", "2917 184924.279659 \n", "2918 187741.866657 \n", "\n", "[2919 rows x 180 columns]\n", "-------------- Information ---------------\n", "Dataset name: ../../dataset/houses-prices-converted_0.csv\n", "nFeatures (nAttributes, with the labels): 180\n", "nInstances (nObservations): 2919\n", "nLabels: None\n", "--------------- Evaluation ---------------\n", "method: HoldOut\n", "output: BT\n", "learner_type: Regression\n", "learner_options: {'seed': 0, 'max_depth': None}\n", "--------- Evaluation Information ---------\n", "For the evaluation number 0:\n", "metrics:\n", " mean_squared_error: 1997310553.8387074\n", " root_mean_squared_error: 44691.28051240765\n", " mean_absolute_error: 29588.51328599622\n", "nTraining instances: 2043\n", "nTest instances: 876\n", "\n", "--------------- Explainer ----------------\n", "For the evaluation number 0:\n", "**Boosted Tree model**\n", "NClasses: None\n", "nTrees: 100\n", "nVariables: 1696\n", "\n", "--------------- Instances ----------------\n", "number of instances selected: 1\n", "----------------------------------------------\n" ] } ], "source": [ "from pyxai import Learning, Explainer\n", "\n", "learner = Learning.Xgboost(\"../../dataset/houses-prices-converted_0.csv\", learner_type=Learning.REGRESSION)\n", "model = learner.evaluate(method=Learning.HOLD_OUT, output=Learning.BT)\n", "instance, prediction = learner.get_instances(model, n=1)" ] }, { "cell_type": "markdown", "id": "bcd926f6", "metadata": {}, "source": [ "Finally, we display the direct reason for this instance. Note that the theory created by the PyXAI's Preprocessor is achieved by adding the parameter ```features_type=\"../../dataset/houses-prices-converted_0.types\"``` to the ```initialize``` method. "print(\"instance:\", instance)\n", "print(\"prediction:\", prediction)\n", "print()\n", "direct_reason = explainer.direct_reason()\n", "print(\"len binary representation:\", len(explainer.binary_representation))\n", "print(\"len direct:\", len(direct_reason))\n", "print(\"is_reason:\", explainer.is_reason(direct_reason))\n", "print(\"to_features:\", explainer.to_features(direct_reason))" ] }, { "cell_type": "markdown", "id": "663667d5", "metadata": {}, "source": [ "We can remark that the direct reason for this instance $x$ contains 413 binary variables of $t_{\\vec x}$ out of 1696. This reason explains why the model predicts the regression value for this instance. But it is probably not the most compact reason for this instance, we invite you to look at the other types of reasons presented on the [Boosted Tree Explanations]({{ site.baseurl }}/documentation/regression/BTregression/) page. More precisely, the [Tree-Specific]({{ site.baseurl }}/documentation/regression/BTregression/treespecific/) reasons are often more compact and therefore more interpretable reasons. 