Papers Video GitHub In-the-Loop EXPEKCTATION Release Notes About

Preprocessing Data

PyXAI simplifies the preparation/cleaning of datasets using a preprocessor. It allows you to:

Modify the dataset:
- feature deletion
- feature encoding (ordinal, one-hot, label)
- selection of the target feature
- possible conversion of a multi-class classification problems into a binary classification one
- distinction between numerical and categorical features
Export the dataset with these modifications and create a new JSON file reflecting modifications made. This allows in particular to transmit to PyXAI the information required to compute explanations using theories.

To create a preprocessor object, you just have to call:

The TabularPreprocessor class preprocesses tabular datasets. It gives access to methods that allow you to delete features, encode features using an encoder (OrdinalEncoder, OneHotEncoder, LabelEncoder), put the target feature in the last column, rewrite the dataset in a new file, and save in a JSON file the type of features (numerical or categorical). More details about TabularPreprocessor are given in the API page.

In the following code, we illustrate the ONE_VS_REST strategy on the iris.csv dataset. This strategy transforms a multiclass classification problem into several binary classification problems.

from pyxai import Learning

preprocessor = Learning.TabularPreprocessor("../dataset/iris.csv", 
                                            target_feature="Species", 
                                            problem_type=Learning.CLASSIFICATION, 
                                            classification_type=Learning.BINARY_CLASS)

preprocessor.all_numerical_features()

preprocessor.process() 
preprocessor.export("iris", output_directory="../dataset")

--------------   Information   ---------------
Problem type: classification
Instances type: tabular
Labels type: None

Dataset path: None
---------------    Converter    ---------------
Numbers of classes: 3
Number of boolean features: 0
MethodToBinaryClassification.OneVsOne:  50 - 50
MethodToBinaryClassification.OneVsOne:  50 - 50
MethodToBinaryClassification.OneVsOne:  50 - 50
Dataset saved: ../dataset/iris_0_vs_1.csv
Types saved: ../dataset/iris_0_vs_1.types
Dataset saved: ../dataset/iris_0_vs_2.csv
Types saved: ../dataset/iris_0_vs_2.types
Dataset saved: ../dataset/iris_1_vs_2.csv
Types saved: ../dataset/iris_1_vs_2.types
-----------------------------------------------


  data[self.target_features_name] = data[self.target_features_name].replace("v0", 0)

Afterwards, we specify thanks to the method all_numerical_features the fact that every feature is numerical and we call the process method to apply the binary conversion and modify some data.

The process method applies all the transformations defined on the preprocessor object to the dataset.

To finish, we call the export method to save the transformed data in new files.

The export method saves the transformed dataset and a JSON file containing information about the transformations (feature types, encoders used) to disk.

Two kinds of files are generated. The first ones are the new binary classification datasets (iris_0.csv, iris_1.csv, iris_2.csv). We display here the five first lines of iris_0.csv:

with open("../dataset/iris_0.csv", 'r') as f:
    print(f.read().splitlines()[0:5])

['Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species', '5.1,3.5,1.4,0.2,1', '4.9,3.0,1.4,0.2,1', '4.7,3.2,1.3,0.2,1', '4.6,3.1,1.5,0.2,1']

The second type of files that are generated are JSON files containing information about the transformations performed on the initial data:

with open("../dataset/iris_0.types", 'r') as f:
    print(f.read())

{
  "Sepal.Length": {
    "type:": "NUMERICAL",
    "encoder:": "None"
  },
  "Sepal.Width": {
    "type:": "NUMERICAL",
    "encoder:": "None"
  },
  "Petal.Length": {
    "type:": "NUMERICAL",
    "encoder:": "None"
  },
  "Petal.Width": {
    "type:": "NUMERICAL",
    "encoder:": "None"
  },
  "Species": {
    "type:": "Classification",
    "encoder:": "LabelEncoder",
    "classes:": [
      "Iris-setosa",
      "Iris-versicolor",
      "Iris-virginica"
    ],
    "binary_conversion:": {
      "Method": "OneVsRest",
      "0": [
        1,
        2
      ],
      "1": [
        0
      ]
    }
  }
}

After this short introduction, we now present in detail the different methods related to the characteristics of features. On the one hand, numerical data refers to numbers. For example, a numerical feature can represent the values of a given probe or the ages of a set of individuals. On the other hand, categorical data refer to non-numerical information divided into groups. Categorical data describes categories or groups, such as, for example, the color of someone’s hair. Finally, PyXAI can also identify and encode categorical features for which only two groups are considered, i.e., binary features.

Numerical Features

In this example, we analyse and encode some features of the Melbourne Housing Market dataset (melb.csv):

import datetime
from pyxai import Learning, Tools

preprocessor = Learning.TabularPreprocessor("../dataset/melb.csv", 
                                            target_feature="Type", 
                                            problem_type=Learning.CLASSIFICATION, 
                                            classification_type=Learning.MULTI_CLASS)

--------------   Information   ---------------
Problem type: classification
Instances type: tabular
Labels type: None

Dataset path: None

First, we delete some irrelevant or redundant features thanks to the unset_features method:

The unset_features method removes one or more features from the dataset so they are not taken into account during preprocessing.

preprocessor.unset_features(["Address", "Suburb", "SellerG"])

We select the categorical features using the set_categorical_features method (more details about this method are given in the next section):

preprocessor.set_categorical_features(features=["Method", "CouncilArea", "Regionname"])

Conversely, the set_numerical_features method allows you to select and encode the numerical features:

The set_numerical_features method selects and optionally encodes numerical features via a dictionary mapping feature names to lambda functions (use None for no encoding).

In the next few lines of code, each key of the dictionary given in parameter represents a feature (“Postcode”, “Date”, “Distance, …). For the first feature (“Postcode”), an encoding is performed in order to convert each data value (string) into an integer. The values of the second feature (“Date”) are converted into ordinal values thanks to the datetime module. All other features are kept as they are standard numerical values.

preprocessor.set_numerical_features({
  "Postcode": lambda d: int(d),
  "Date": lambda d: datetime.date(int(d.split("/")[2]), int(d.split("/")[1]), int(d.split("/")[0])).toordinal(), 
  "Distance": None, "Bedroom2": None, "Bathroom": None,
  "Car": None, "Landsize": None, "BuildingArea": None, "YearBuilt": None,
  "Lattitude": None, "Longtitude": None, "Propertycount": None,
  "Rooms": None, "Price": None
  })

Finally, we execute the process method to apply all modifications and we export the new dataset with the export method.

preprocessor.process()
preprocessor.export("melb_out", output_directory="../dataset")

---------------    Converter    ---------------
Feature deleted:  Suburb
Feature deleted:  Address
Feature deleted:  SellerG
One hot encoding new features for Method: 5
One hot encoding new features for CouncilArea: 34
One hot encoding new features for Regionname: 8
Numbers of classes: 3
Number of boolean features: 0


Dataset saved: ../dataset/melb_out.csv
Types saved: ../dataset/melb_out.types
-----------------------------------------------

The melb_out.types file contains all information about the conversions made. We just display the twenty first lines.

f = open("../dataset/melb_out.types", 'r')
for l in f.read().splitlines()[0:20]:
    print(l)

{
  "Rooms": {
    "type:": "TypeFeature.Numerical",
    "encoder:": "None"
  },
  "Propertycount": {
    "type:": "TypeFeature.Numerical",
    "encoder:": "LabelEncoder"
  },
  "Price": {
    "type:": "TypeFeature.Numerical",
    "encoder:": "None"
  },
  "Method_PI": {
    "type:": "TypeFeature.Categorical",
    "encoder:": "EncoderType.OneHotEncoder",
    "original_feature:": "Method",
    "original_values:": [
      "PI",
      [

We now introduce some methods about categorical features.

Categorical Features

The preprocessor deals with categorical features. It identifies them and can perform one-hot or ordinal encoding. The method set_categorical_features performs this step:

The set_categorical_features method encodes categorical features using either one-hot encoding (Learning.ONE_HOT) or ordinal encoding (Learning.ORDINAL).

For the categorical features “Method”, “CouncilArea” and “Regionname”, the following line of code creates one binary column for each category:

preprocessor.set_categorical_features(features=["Method", "CouncilArea", "Regionname"])

If categorical features have already been encoded, you can use the method set_categorical_features_already_one_hot_encoded to identify them as categorical features.

If categorical features have already been one-hot encoded in the dataset, the set_categorical_features_already_one_hot_encoded method allows you to group them under a single categorical feature name.

Here is an example where this method is used:

from pyxai import Learning, Tools

preprocessor = Learning.TabularPreprocessor(
    "../dataset/compas.csv", 
    target_feature="Two_yr_Recidivism", 
    problem_type=Learning.CLASSIFICATION, 
    classification_type=Learning.BINARY_CLASS)

preprocessor.set_categorical_features_already_one_hot_encoded("score_factor", ["score_factor"])
preprocessor.set_categorical_features_already_one_hot_encoded("Age_Above_FourtyFive", ["Age_Above_FourtyFive"])
preprocessor.set_categorical_features_already_one_hot_encoded("Age_Below_TwentyFive", ["Age_Below_TwentyFive"])
preprocessor.set_categorical_features_already_one_hot_encoded("Ethnic", ["African_American", "Asian", "Hispanic", "Native_American", "Other"])
preprocessor.set_categorical_features_already_one_hot_encoded("Female", ["Female"])
preprocessor.set_categorical_features_already_one_hot_encoded("Misdemeanor", ["Misdemeanor"])

preprocessor.set_numerical_features({
  "Number_of_Priors": None
})

preprocessor.process()
preprocessor.export("compas", output_directory="../dataset")

--------------   Information   ---------------
Problem type: classification
Instances type: tabular
Labels type: None

Dataset path: None
The feature score_factor is boolean! No One Hot Encoding for this features.
The feature Age_Above_FourtyFive is boolean! No One Hot Encoding for this features.
The feature Age_Below_TwentyFive is boolean! No One Hot Encoding for this features.
The feature Female is boolean! No One Hot Encoding for this features.
The feature Misdemeanor is boolean! No One Hot Encoding for this features.
---------------    Converter    ---------------
Numbers of classes: 2
Number of boolean features: 5
Dataset saved: ../dataset/compas.csv
Types saved: ../dataset/compas.types
-----------------------------------------------

All these examples can be used to define categorical, numerical and binary features in order to use them in some theories. See this page for more details about theories.