Papers Video GitHub In-the-Loop EXPEKCTATION About

Preprocessing Data

PyXAI simplifies the preparation/cleaning of datasets using a preprocessor. It allows to:

Modify the dataset:
- feature deletion
- feature encoding (ordinal, one-hot, label)
- selection of the target feature
- possible conversion of a multi-class classification problems into a binary classification one
- distinction between numerical and categorical features
Export the dataset with these modifications and create a new JSON file reflecting modifications made. This allows in particular to transmit to PyXAI the information required to compute explanations using theories.

To create a preprocessor object, you just have to call:

Learning.Preprocessor(dataset, target_feature, learner_type, classification_type=None, to_binary_classification=Learning.ONE_VS_REST):
Create an object `Preprocessor` giving access to methods that allow to modify the dataset in order to delete features, encode features using an encoder (`OrdinalEncoder`, `OneHotEncoder`, `LabelEncoder`) and put the target feature in the last column. Moreover, this object also gives the possibility to rewrite the dataset in a new file and to save in a JSON file the type of features (numerical or categorical).
dataset `String` `pandas.DataFrame`: Either the file path of the dataset in CSV or EXCEL format or a `pandas.DataFrame` object representing the data.
target_feature `String`: The feature name of the target feature.
learner_type `Learning.CLASSIFICATION` `Learning.REGRESSION`: The type of models that will be used by the ML model.
classification_type `None` `Learning.BINARY_CLASS` `Learning.MULTI_CLASS`: This parameter is only useful if the parameter `learner_type` is set to `Learning.CLASSIFICATION`. In this case, the target feature will be encoded with a LabelEncoder, and a dataset conversion method will be used if the labels need to be converted to a binary class dataset. The next parameter allows choosing which conversion to use.
to_binary_classification `Learning.ONE_VS_REST` `Learning.ONE_VS_ONE`: The conversion used to transform a multi-class dataset into a binary class dataset. The `ONE_VS_REST` method consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. Unlike `ONE_VS_REST` that splits it so that one binary dataset per class is created, the `ONE_VS_ONE` approach splits the dataset so that one dataset per class versus every other class is created. Depending on the number of classes and the chosen method, several datasets are generated.

In the following code, we illustrate the ONE_VS_REST strategy on the iris.csv dataset. This strategy to transform a multiclass classification problem into several binary classification problems.

from pyxai import Learning

preprocessor = Learning.Preprocessor("../dataset/iris.csv", 
                                     target_feature="Species", 
                                     learner_type=Learning.CLASSIFICATION, 
                                     classification_type=Learning.BINARY_CLASS)

preprocessor.all_numerical_features()

preprocessor.process() 
preprocessor.export("iris", output_directory="../dataset")

Index(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width',
       'Species'],
      dtype='object')
---------------    Converter    ---------------
Numbers of classes: 3
Number of boolean features: 0
Warning: conversion from MultiClass to BinaryClass: the current dataset will be convert into several new datasets with the OneVsRest method.
Dataset saved: ../dataset/iris_0.csv
Types saved: ../dataset/iris_0.types
Dataset saved: ../dataset/iris_1.csv
Types saved: ../dataset/iris_1.types
Dataset saved: ../dataset/iris_2.csv
Types saved: ../dataset/iris_2.types
-----------------------------------------------

Afterwards, we specify thanks to the method all_numerical_features() the fact that every feature is numerical and we call the process() method to apply the binary convertion and modify some data.

<Preprocessor Object>.process():
transform the dataset.

To finish, we call the export() method to save the transformed data in new files.

<Preprocessor Object>.export(filename, type="csv", output_directory=None):
Create two files, the first one represents the new dataset and the second one is a JSON file containing information about the transformations made on the data.
filename `String`: The filename of the two new files.
type `String`: The type of the new dataset (“csv” or “xls”).
output_directory `String`: The target directory for the two files (this directory must exist).

Two kinds of files are generated. The first ones are the new binary classification datasets (iris_0.csv, iris_1.csv, iris_2.csv). We display here the five first lines of iris_0.csv:

with open("../dataset/iris_0.csv", 'r') as f:
    print(f.read().splitlines()[0:5])

['Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species', '5.1,3.5,1.4,0.2,1', '4.9,3.0,1.4,0.2,1', '4.7,3.2,1.3,0.2,1', '4.6,3.1,1.5,0.2,1']

The second type of files that are generated are JSON files containing information about the transformations performed on the initial data:

with open("../dataset/iris_0.types", 'r') as f:
    print(f.read())

{
  "Sepal.Length": {
    "type:": "NUMERICAL",
    "encoder:": "None"
  },
  "Sepal.Width": {
    "type:": "NUMERICAL",
    "encoder:": "None"
  },
  "Petal.Length": {
    "type:": "NUMERICAL",
    "encoder:": "None"
  },
  "Petal.Width": {
    "type:": "NUMERICAL",
    "encoder:": "None"
  },
  "Species": {
    "type:": "Classification",
    "encoder:": "LabelEncoder",
    "classes:": [
      "Iris-setosa",
      "Iris-versicolor",
      "Iris-virginica"
    ],
    "binary_conversion:": {
      "Method": "OneVsRest",
      "0": [
        1,
        2
      ],
      "1": [
        0
      ]
    }
  }
}

After this short introduction, we now present in detail the different methods related to the characteristics of features. On the one hand, numerical data refers to numbers. For example, a numerical feature can represent the values of a given probe or the ages of a set of individuals. On the other hand, categorical data refer to non-numerical information divided into groups. Categorical data describes categories or groups, such as, for example, the color of someone’s hair. Finally, PyXAI can also identify and encode categorical features for which only two groups are considered, aka binary features.

Numerical Features

In this example, we analyse and encode some features of the Melbourne Housing Market dataset (melb.csv):

import datetime
from pyxai import Learning, Explainer, Tools

preprocessor = Learning.Preprocessor("../dataset/melb.csv", 
                                     target_feature="Type", 
                                     learner_type=Learning.CLASSIFICATION, 
                                     classification_type=Learning.MULTI_CLASS)

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

First, we delete some irrelevant or redundant features thanks to the unset_features() method:

<Preprocessor Object>.unset_features(features):
Does not take into account the features given in parameter `features` in the encoding of the dataset. These features are deleted.
features `List` of `String`: Features to delete.

preprocessor.unset_features(["Address", "Suburb", "SellerG"])

We select the categorical features using the set_categorical_features() method (more details about this method are given in the next section):

preprocessor.set_categorical_features(columns=["Method", "CouncilArea", "Regionname"])

Conversely, the set_numerical_features() method allows to select and encode the numerical features:

<Preprocessor Object>.set_numerical_features(features_dict):
Select and encode the numerical features present in the Python dictionary `features_dict`.
features_dict `Dict`:(`String` `Int`)->(`Lambda` `None`): Python dictionary where keys are numerical features and values are lambda functions representing data encodings. The keys must be represented either by strings or integers. Note that a lambda function set to `None` means that no encoding must be done.

In the next few lines of code, each key of the dictionary given in parameter represents a feature (“Postcode”, “Date”, “Distance, …). For the first feature (“Postcode”), an encoding is performed in order to convert each data value (string) into an integer. The values of the second feature (“Date”) are converted into ordinal values thanks to the datetime module. All other features are kept as they are standard numerical values.

preprocessor.set_numerical_features({
  "Postcode": lambda d: int(d),
  "Date": lambda d: datetime.date(int(d.split("/")[2]), int(d.split("/")[1]), int(d.split("/")[0])).toordinal(), 
  "Distance": None, "Bedroom2": None, "Bathroom": None,
  "Car": None, "Landsize": None, "BuildingArea": None, "YearBuilt": None,
  "Lattitude": None, "Longtitude": None, "Propertycount": None,
  "Rooms": None, "Price": None
  })

Finally, we execute the process() method to apply all modifications and we export the new dataset with the export() method.

preprocessor.process()
preprocessor.export("melb", output_directory="../dataset")

---------------    Converter    ---------------
Feature deleted:  Suburb
Feature deleted:  Address
Feature deleted:  SellerG
One hot encoding new features for Method: 5
One hot encoding new features for CouncilArea: 34
One hot encoding new features for Regionname: 8
Numbers of classes: 3
Number of boolean features: 0
Dataset saved: ../dataset/melb_0.csv
Types saved: ../dataset/melb_0.types
-----------------------------------------------

The melb_0.types file contains all information about the conversions made. We just display the twenty first lines.

f = open("../dataset/melb_0.types", 'r')
for l in f.read().splitlines()[0:20]:
    print(l)

{
  "Rooms": {
    "type:": "NUMERICAL",
    "encoder:": "None"
  },
  "Propertycount": {
    "type:": "NUMERICAL",
    "encoder:": "None"
  },
  "Price": {
    "type:": "NUMERICAL",
    "encoder:": "None"
  },
  "Method_PI": {
    "type:": "CATEGORICAL",
    "encoder:": "OneHotEncoder",
    "original_feature:": "Method",
    "original_values:": [
      "PI",
      [

We now introduce some methods about categorical features.

Categorical Features

The preprocessor deals with categorical features. It identifies them and can perform one-hot or ordinal encoding. The method set_categorical_features() performs this step:

<Preprocessor Object>.set_categorical_features(columns=None, encoder=Learning.ONE_HOT):
Encode the categorical features given in the `columns` parameter with an encoder given in the `encoder` parameter.
columns `List` of `String` or `List` of `Integer`: The features to encode. These features can be given either as numbers (respecting the order of the given dataset) or directly via strings corresponding to their names.
encoder `Learning.ONE_HOT` `Learning.ORDINAL` : The encoder used. `Learning.ONE_HOT` creates a binary column for each category where the value 1 means that this category is present for an instance, otherwise the value is 0 (see this page for more details). In contrast, the `Learning.ORDINAL` encoder converts each categorical feature to an ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature (see this page for more details). Of course, you can call this method twice to combine the encoders.

For the categorial features “Method”, “CouncilArea” and “Regionname”, the following line of code creates one binary column for each category:

preprocessor.set_categorical_features(columns=["Method", "CouncilArea", "Regionname"])

If categorical features have already been encoded, you can use the method set_categorical_features_already_one_hot_encoded() to identify them as categorical features.

<Preprocessor Object>.set_categorical_features_already_one_hot_encoded(name, features):
Identify a set of binary features coming from a one hot encoding as a categorical feature.
name `String`: The name of the resulting categorical feature.
features `List` of `String` : Feature names in the dataset corresponding to binary features where each feature represents a category.

Here it is an example where this method is used:

from pyxai import Learning, Explainer, Tools

preprocessor = Learning.Preprocessor(
    "../dataset/compas.csv", 
    target_feature="Two_yr_Recidivism", 
    learner_type=Learning.CLASSIFICATION, 
    classification_type=Learning.BINARY_CLASS)

preprocessor.set_categorical_features_already_one_hot_encoded("score_factor", ["score_factor"])
preprocessor.set_categorical_features_already_one_hot_encoded("Age_Above_FourtyFive", ["Age_Above_FourtyFive"])
preprocessor.set_categorical_features_already_one_hot_encoded("Age_Below_TwentyFive", ["Age_Below_TwentyFive"])
preprocessor.set_categorical_features_already_one_hot_encoded("Ethnic", ["African_American", "Asian", "Hispanic", "Native_American", "Other"])
preprocessor.set_categorical_features_already_one_hot_encoded("Female", ["Female"])
preprocessor.set_categorical_features_already_one_hot_encoded("Misdemeanor", ["Misdemeanor"])

preprocessor.set_numerical_features({
  "Number_of_Priors": None
})

preprocessor.process()
preprocessor.export("compas", output_directory="../dataset")

Index(['Number_of_Priors', 'score_factor', 'Age_Above_FourtyFive',
       'Age_Below_TwentyFive', 'African_American', 'Asian', 'Hispanic',
       'Native_American', 'Other', 'Female', 'Misdemeanor',
       'Two_yr_Recidivism'],
      dtype='object')
The feature score_factor is boolean! No One Hot Encoding for this features.
The feature Age_Above_FourtyFive is boolean! No One Hot Encoding for this features.
The feature Age_Below_TwentyFive is boolean! No One Hot Encoding for this features.
The feature Female is boolean! No One Hot Encoding for this features.
The feature Misdemeanor is boolean! No One Hot Encoding for this features.
---------------    Converter    ---------------
Numbers of classes: 2
Number of boolean features: 5
Dataset saved: ../dataset/compas_0.csv
Types saved: ../dataset/compas_0.types
-----------------------------------------------

All these examples can be used to define categorical, numerical and binary features in order to use them in some theories. See this page for more details about theories.