Supplementary Material for Paper #9380
Contents
This archive contains several folders and files:
- bin contains binary/executable pieces of code required by our scripts
- dataset contains examples of datasets that can be used to test our scripts and to reproduce our results
- cnf_files contains files that are generated when the script "generate_data_RF.py" is launched.
The folder contains *.gcnf files used by the MUS extractor and *.wcnf files used by the Partial MaxSAT solver (you can already see how they look like for the "placement" dataset)
- plot_rf contains the plot produced by running the "to_plot.py" script (you can already see how it looks like for the "placement" dataset)
- all_plot contains plots (similar to those given in the paper) for each of the 15 datasets
- result_RF is a folder containing .json files that can be generated by launching the command
python generate_data_RF.py {DatasetName}
(see below). As an example, the .json file that has been produced for the "placement" dataset is provided in the folder. In these .json files, the following keys correspond to:
- acc: The average accuracy over the 10 random forests produced by the cross validation process
- instance: The list of binarized instances that have been considered in the experiments for the corresponding dataset and the random forests that have been learned. Instances 1 to 25 are those instances picked up in the test set of the first random forest, instances 26 to 50 are associated with the second random forest, and so on
- classified:A list of tuples containing a Boolean value indicating whether the classifier succeeded in determining the right class of the corresponding instance, and a number (1 or 0) making precise this class
- len_bin:A list providing for each random forest two successive numbers: the first one is the number of Boolean features used in the forest, and the second one is the number of original features in the dataset
- lime: A list of tuples indicating for each instance the size of the LIME explanation that has been computed and the computation time needed to get it
- sufficient: Same as LIME but for sufficient reasons (using MUS extractor)
- direct: Same as LIME but for direct reasons
- majoritary: Same as LIME but for majoritary reasons (using the greedy algorithm)
- 10s: A list containing the size of an explanation produced by the LHMS solver (partial MaxSAT approach) run with a time limit of 10 seconds
- 60s: Same as above but with a time limit of 60 seconds
- 600s: Same as above but with a time limit of 600 seconds
- reason: A list of lists. Each list of this list contains the explanation that has been computed. The following order is used: direct, sufficient, majoritary, and approximation of a minimal majoritary reason for a timeout of 10 seconds
- hashmap: A list of 10 hashmaps, one per random forest. Each hashmap contains pairs of the form: (index of the original feature, threshold) : [index of the corresponding Boolean feature in the forest, number of occurrences of this Boolean feature in the forest]. The literal associated with the Boolean feature is positive in an explanation if and only if the corresponding original feature is strictly greater than the corresponding threshold
- script contains the scripts that have been written
- majoritary-C++ contains C++ code to be compiled in order to execute our Python script (not detailed here)
- environment.yml This .yml file is here to help you reproduce our Python environment, which is mandatory to run our scripts
- data_RF.ods is a spreadsheet reporting for each dataset, some statistical information about the computations achieved over the 250 instances that have been considered.
Here is a glossary to help you read it:
- acc: The average accuracy over the 10 random forests that have been generated
- nb_instances/nb_attributes : The number of instances/attributes in the dataset
- nb_tree: The number of trees in each random forest generated for this dataset
- avg_nb_bin: The average number of Boolean features used in the 10 random forests that have been generated
- std_nb_bin: The standard deviation corresponding to avg_nb_bin :
- med_*:The median value (over the 250 instances) of the size of an explanation computed using the * approach
- max_*:The maximum value (over the 250 instances) of the size of an explanation computed using the * approach
- nb_opt_approx_**s:The number of ``truly'' minimal majoritary reasons (over the 250 instances) discovered by the approximation algorithm in at most **seconds (obviously enough, minimal majoritary reasons are discovered whenever no size reduction results from one step to the next one)
Software
How to set up our Python environment before running our scripts
- Be sure to use a Linux OS and to use a version of Python 3.x
- Install anaconda
- Open a terminal in this repository with anaconda activated (If conda is activated, you will get "(base)" displayed on your terminal)
- Execute the command
conda env create --file env.yml
to clone our Python environment in your system
- Execute the command
./build.sh
in majoritary-C++
WARNING : If this does not work, try to change "python3" to "python" in majoritary-c++/CMakeLists.txt
at line 41, then execute ./clean.sh
and ./build.sh
again
- Move the new folder
build
currently in majoritary-C++
to script
folder
How use our scripts
- Prepare a dataset and store it in the dataset repository (or use one of the datasets available). Your dataset has to fulfill a .csv format, where the last column gives the label (class) of the instance (corresponding to the line) and all values are numerical ones (values of categorical features must be turned into numbers). Two classes are allowed only
- Set the number of trees per random forest for this dataset by updating
script/info_data_RF.json
- Execute the command
python generate_data_RF.py {DatasetName}
to generate a .json file in result_RF containing the results
- Execute the command
python to_plot.py
to generate .pdf files containing plots in plot_rf for each .json file in result_RF
Description of our scripts
generate_data_RF.py
This script aims to generate results following a 10-cross validation process, as explained in the experimental section of the
paper.
To run it, please enter the following command python generate_data_RF.py {DatasetName}
(the dataset is selected as an argument of the script). Results are saved in a .json file to be found in the
directory "json". You can select what you want to compute or not by commenting the corresponding instructions (lines 180 to 254).
By default, the command computes for each instance, a sufficient reason, a majoritary reason, an approximation of a minimal majoritary reason
(obtained after 10s) and a LIME explanation. To avoid memory outs or unkilled jobs in case of brutal stop, the code is
currently sequential. If expected, you can parallelize it by augmenting the number of workers stored (the value of variable "nb_w" at line 323).
WARNING : Take care that there is no repository corresponding to the one you generate in cnf_files. If there is one, then you have to delete the former files (this is a security to prevent from overwriting everything at each run of the script).
to_plot.py
This script creates four plots for each .json file in result_RF and stores them in plot_rf. Yon can draw plots for specific datasets by changing lines 8/9
my_tree.py
This script contains pieces of code to encode decision trees and to analyze them
my_forest.py
This script contains pieces of code to encode forests of decision trees (particularly, random forest) and to analyze them
Other scripts
- encodage_CNF.py contains pieces of code to encode propositional formulae into CNF formulae using Tseitin technique
- timeout.py contains pieces of code to trigger a time-out exception