PyXAI

Theories

Theories are representation of pieces of knowledge about the dataset. They may be furnished either by experts or derived directly from the nature of the data. This last case is handled by PyXAI via the encoding of domain theories during the explanation calculation. Domain theories are used to refrain from inferring impossible explanations. The way of dealing with them is different according to the kind of explanations that we lookd for: contrastive or abductive. More details about theories can be found in our IJCAI’23 paper.

In PyXAI, through the explainer initialization method, there are two ways to activate domain theories.

Either by specifying the type of features in the features_type parameter through a python dictionary with the following keys: "numerical", "categorical" and "binary". To avoid having to enter all the features, you can choose a default type using the Learning.DEFAULT constant. For each type that is not equal to this constant, you need to set a list of feature names as value. However, the "categorical" key requires a Python dictionnary where the keys are the feature names with the wildcard characters *, {, } or , inside names. This indicates that a set of feature names beginning with the same characters actually represents a single categorical feature that has been one-hot encoded. For example, "A4*" represents the categorical feature encoded through the features "A4_1", "A4_2" et "A4_3". The values for each key represent the possible values of the associated categorical feature ((1, 2, 3) in this case).

australian_types = {
    "numerical": Learning.DEFAULT,
    "categorical": {"A4*": (1, 2, 3), 
                    "A5*": tuple(range(1, 15)),
                    "A6*": (1, 2, 3, 4, 5, 7, 8, 9), 
                    "A12*": tuple(range(1, 4))},
    "binary": ["A1", "A8", "A9", "A11"],
}

explainer = Explainer.initialize(model, instance=instance, features_types=australian_types)

Or by specifying in this parameter the path and name of a file containing the type of features. Such a file can be generated using the preprocessor of PyXAI (please see the Preprocessing Data page).

explainer = Explainer.initialize(model, instance=instance, features_type="../australian.types")

There is another way to specify categorical features. For example, if we have in our dataset three binary features named "Red", "Green" and "Blue" that come from a one-hot encoded feature named "Color", we can declare the following types:
types = {
    "categorical": {"{Red,Green,Blue}": ("Red", "Green", "Blue")}
}

Explainer.initialize(model, instance=None, features_type=None):
Depending on the model given in the first argument, this method creates an `ExplainerDT`, an `ExplainerRF` or an `ExplainerBT`. This object is able to give explanations about the instance given as a second parameter. This last parameter is optional because you can set the instance later using the `set_instance` function.
model `DecisionTree` `RandomForest` `BoostedTree`: The model for which explanations will be calculated.
instance `Numpy Array` of `Float`: The instance to be explained. Default value is `None`.
features_type `String` `Dict` `None`: Either a dictionary indicating the type of features or the path to a `.types` file containing this information. Activate domain theories.

For contrastive reasons

To understand the principles of domain theory, we first build a small example with the builder of PyXAI (Building Models). This exmaple is based on one numerical feature ($f_1$: the annual incomes of the applicant) and one categorical (and binary) feature ($f_2$: whether or not the applicant has already reimbursed a previous loan). The model is used to determine whether a loan must be granted or not to an applicant.

from pyxai import Builder, Explainer

node1 = Builder.DecisionNode(2, operator=Builder.EQ, threshold=1, left=0, right=1)
node2 = Builder.DecisionNode(1, operator=Builder.GE, threshold=20, left=0, right=node1)
node3 = Builder.DecisionNode(1, operator=Builder.GE, threshold=30, left=node2, right=1)

tree1 = Builder.DecisionTree(2, node3)
tree2 = Builder.DecisionTree(2, Builder.LeafNode(1))

forest = Builder.RandomForest([tree1, tree2], n_classes=2)

Let’s suppose Alice wants to get a loan. We know that Alice’s annual incomes are equal to $18k$ and Alice has not reimbursed yet a previous loan. Thus, Alice corresponds to an instance $Alice = (18, 0)$. This instance is represented by the explainer with binary variables representing the conditions of nodes: (-1, -2, -3). This is equivalent to ${\overline{(f_1 \geq 20)}, \overline{(f_1 \geq 30)}, \overline{(f_2 = 1)}}$ (or equivalently to ${(f_1 \lt 20), (f_1 \lt 30), (f_2 \neq 1)}$).

alice = (18, 0)
explainer = Explainer.initialize(forest, instance=alice)
print("binary representation: ", explainer.binary_representation)
print("binary representation features:", explainer.to_features(explainer.binary_representation, eliminate_redundant_features=False))
print("target_prediction:", explainer.target_prediction)

binary representation:  (-1, -2, -3)
binary representation features: ('f1 < 30', 'f1 < 20', 'f2 != 1')
target_prediction: 0

Alice does not get the loan (target_prediction: 0) and would like to know what to change to get it: we need a contrastive explanation.

contrastives = explainer.minimal_contrastive_reason(n=Explainer.ALL)
print("contrastives:", contrastives)

print("contrastives (to_features):", explainer.to_features(contrastives[0], contrastive=True))

contrastives: ((-1,),)
contrastives (to_features): ('f1 < 30',)

Without the theory, the binary variable (-1) (representing by the condition $\overline{(f_1 \geq 30)}$) is a (subset-minimal) contrastive explanation for Alice’s instance. However, no instance matches this representation because the extended instance from this contrastive ${(f_1 \geq 30), \overline{(f_1 \geq 20)}, \overline{(f_2 = 1)}}$ conflicts with an indisputable theory: $\overline{(f_1 \geq 20)} \Rightarrow \overline{(f_1 \geq 30)}$. To refrain from deriving these incorrect explanations, some propositional constraints forming a domain theory indicating how the Boolean conditions are logically connected must be taken into account. To accomplish this, you just need to specify which features are numeric, categorical and binary in the features_type parameter of the Explainer.initialize() constructor.

explainer = Explainer.initialize(forest, instance=alice, features_type={"numerical": ["f1"], "binary": ["f2"]})

contrastives = explainer.minimal_contrastive_reason(n=Explainer.ALL)
print("contrastives:", contrastives)
print("contrastives (to_features):", explainer.to_features(contrastives[0], contrastive=True))
print("contrastives (to_features):", explainer.to_features(contrastives[1], contrastive=True))

---------   Theory Feature Types   -----------
Before the encoding (without one hot encoded features), we have:
Numerical features: 1
Categorical features: 0
Binary features: 1
Number of features: 2
Values of categorical features: {}

Number of used features in the model (before the encoding): 2
Number of used features in the model (after the encoding): 2
----------------------------------------------
contrastives: ((-1, -2), (-2, -3))
contrastives (to_features): ('f1 < 30',)
contrastives (to_features): ('f1 < 20', 'f2 != 1')

By taking the theory into account, we now get two different contrastive explanations: $c_1 = \{\overline{(f_1 \geq 20)}, \overline{(f_1 \geq 30)}\} \mbox{ and } c_2 = \{\overline{(f_1 \geq 20)}, \overline{(f_2 = 1)}\}.$

Eliminating redundant features gives us: $c_1 = \{\overline{(f_1 \geq 30)}\} \mbox{ and } c_2 = \{\overline{(f_1 \geq 20)}, \overline{(f_2 = 1)}\}.$

And now, the instances we derive from contrastive explanations no longer conflict with domain theory: $\{(A_1 \geq 30), (A_1 \geq 20), \overline{(A_2 = 1)}\} \mbox{and} \{\overline{(A_1 \geq 30)}, (A_1 \geq 20), (A_2 = 1)\}.$

For abductive reasons

The Australian Credit Approval dataset is a credit card application and this link allows to get the type of features. Thanks to the preprocessor of PyXAI, we generate a australian.csv dataset and a australian.types file:

from pyxai import Learning, Explainer, Tools

import datetime

preprocessor = Learning.Preprocessor("../../dataset/australian.csv", target_feature="A15", learner_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS)

preprocessor.set_categorical_features(columns=["A1", "A4", "A5", "A6", "A8", "A9", "A11", "A12"])
preprocessor.set_numerical_features({
  "A2": None,
  "A3": None,
  "A7": None,
  "A10": None,
  "A13": None,
  "A14": None,
  })

preprocessor.process()
preprocessor.export("australian", output_directory="../../dataset")

Index(['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11',
       'A12', 'A13', 'A14', 'A15'],
      dtype='object')
---------------    Converter    ---------------
-> The feature A1 is boolean! No One Hot Encoding for this features.
One hot encoding new features for A4: 3
One hot encoding new features for A5: 14
One hot encoding new features for A6: 8
-> The feature A8 is boolean! No One Hot Encoding for this features.
-> The feature A9 is boolean! No One Hot Encoding for this features.
-> The feature A11 is boolean! No One Hot Encoding for this features.
One hot encoding new features for A12: 3
Numbers of classes: 2
Number of boolean features: 4
Dataset saved: ../../dataset/australian_0.csv
Types saved: ../../dataset/australian_0.types
-----------------------------------------------

We create a random forest and calculate a majoritary reason by activating domain theory:

# Machine learning part
learner = Learning.Scikitlearn("../../dataset/australian_0.csv", learner_type=Learning.CLASSIFICATION)
model = learner.evaluate(method=Learning.HOLD_OUT, output=Learning.RF)
instance, prediction = learner.get_instances(model, n=1, seed=11200, correct=False)

# Explainer part
explainer = Explainer.initialize(model, instance=instance, features_type="../../dataset/australian_0.types")
majoritary_reason = explainer.majoritary_reason(n_iterations=10)
print("\nlen tree_specific: ", len(majoritary_reason))
print("\ntree_specific: ", explainer.to_features(majoritary_reason))
print("is majoritary:", explainer.is_majoritary_reason(majoritary_reason))

data:
     A1   A2   A3  A4_1  A4_2  A4_3  A5_1  A5_2  A5_3  A5_4  ...  A8  A9  A10   
0     1   65  168     0     1     0     0     0     0     1  ...   0   0    1  \
1     0   72  123     0     1     0     0     0     0     0  ...   0   0    1   
2     0  142   52     1     0     0     0     0     0     1  ...   0   0    1   
3     0   60  169     1     0     0     0     0     0     0  ...   1   1   12   
4     1   44  134     0     1     0     0     0     0     0  ...   1   1   15   
..   ..  ...  ...   ...   ...   ...   ...   ...   ...   ...  ...  ..  ..  ...   
685   1  163  160     0     1     0     0     0     0     0  ...   1   0    1   
686   1   49   14     0     1     0     0     0     0     0  ...   0   0    1   
687   0   32  145     0     1     0     0     0     0     0  ...   1   0    1   
688   0  122  193     0     1     0     0     0     0     0  ...   1   1    2   
689   1  245    2     0     1     0     0     0     0     0  ...   0   1    2   

     A11  A12_1  A12_2  A12_3  A13  A14  A15  
0      1      0      1      0   32  161    0  
1      0      0      1      0   53    1    0  
2      1      0      1      0   98    1    0  
3      1      0      1      0    1    1    1  
4      0      0      1      0   18   68    1  
..   ...    ...    ...    ...  ...  ...  ...  
685    0      0      1      0    1    1    1  
686    0      0      1      0    1   35    0  
687    0      0      1      0   32    1    1  
688    0      0      1      0   38   12    1  
689    0      1      0      0  159    1    1  

[690 rows x 39 columns]
--------------   Information   ---------------
Dataset name: ../../dataset/australian_0.csv
nFeatures (nAttributes, with the labels): 39
nInstances (nObservations): 690
nLabels: 2
---------------   Evaluation   ---------------
method: HoldOut
output: RF
learner_type: Classification
learner_options: {'max_depth': None, 'random_state': 0}
---------   Evaluation Information   ---------
For the evaluation number 0:
metrics:
   accuracy: 85.5072463768116
nTraining instances: 483
nTest instances: 207

---------------   Explainer   ----------------
For the evaluation number 0:
**Random Forest Model**
nClasses: 2
nTrees: 100
nVariables: 1361

---------------   Instances   ----------------
number of instances selected: 1
----------------------------------------------
---------   Theory Feature Types   -----------
Before the encoding (without one hot encoded features), we have:
Numerical features: 6
Categorical features: 4
Binary features: 4
Number of features: 14
Values of categorical features: {'A4_1': ['A4', 1, [1, 2, 3]], 'A4_2': ['A4', 2, [1, 2, 3]], 'A4_3': ['A4', 3, [1, 2, 3]], 'A5_1': ['A5', 1, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_2': ['A5', 2, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_3': ['A5', 3, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_4': ['A5', 4, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_5': ['A5', 5, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_6': ['A5', 6, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_7': ['A5', 7, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_8': ['A5', 8, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_9': ['A5', 9, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_10': ['A5', 10, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_11': ['A5', 11, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_12': ['A5', 12, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_13': ['A5', 13, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A5_14': ['A5', 14, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]], 'A6_1': ['A6', 1, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_2': ['A6', 2, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_3': ['A6', 3, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_4': ['A6', 4, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_5': ['A6', 5, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_7': ['A6', 7, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_8': ['A6', 8, [1, 2, 3, 4, 5, 7, 8, 9]], 'A6_9': ['A6', 9, [1, 2, 3, 4, 5, 7, 8, 9]], 'A12_1': ['A12', 1, [1, 2, 3]], 'A12_2': ['A12', 2, [1, 2, 3]], 'A12_3': ['A12', 3, [1, 2, 3]]}

Number of used features in the model (before the encoding): 14
Number of used features in the model (after the encoding): 38
----------------------------------------------

len tree_specific:  12

tree_specific:  ('A2 > 194.5', 'A3 in ]43.0, 53.0]', 'A5 = 3', 'A6 = 5', 'A7 in ]66.5, 93.0]', 'A8 = 0', 'A10 <= 2.5', 'A13 in ]63.5, 79.0]', 'A14 <= 5.5')
is majoritary: True

Thanks to our support for domain theories of categorical features, only one feature AX_Y among the Y available is part of the explanation that has been derived.