Link Search Menu Expand Document
PyXAI
Papers Video GitHub In-the-Loop EXPEKCTATION Release Notes About

Class TabularPreprocessor

The TabularPreprocessor class is used to preprocess tabular datasets.

The TabularPreprocessor object give access to methods that allow to:

  • modify the dataset to delete features
  • encode features using an encoder (OrdinalEncoder, OneHotEncoder, LabelEncoder) and put the target feature in the last column.
  • rewrite the dataset in a new CSV file and save in a JSON file the type of features (numerical or categorical)

    def __init__(self, 
                 dataset, 
                 target_feature, 
                 problem_type, 
                 classification_type=None, 
                 to_binary_classification=MultiClassToBinaryMethod.OneVsOne, 
                 discretization=None):
Highlight

Initialise a TabularPreprocessor object with a dataset.

Parameters

dataset : str | pandas.DataFrame | NoneData

The dataset to use, either as a path to a csv, json or excel file or as a pandas DataFrame.

target_feature : str

The feature name of the target feature.

problem_type : str | ProblemType

The type of problem (classification, regression, …)
Possible values are defined in the ProblemType enum.

classification_type : str | ClassificationType (optional, default=None)

The type of classification (BinaryClass or MultiClass) to produce.  
Possible values are defined in the ClassificationType enum.

to_binary_classification : str | MultiClassToBinaryMethod (optional, default=MultiClassToBinaryMethod.OneVsOne)

The method used to encode a multi-classes dataset into a binary class dataset (OneVsOne or OneVsRest)
Possible values are defined in the MultiClassToBinaryMethod enum.

discretization : int (optional, default=None)

If this parameter is None, no discretization is used. 
Else, it is the number of bins to produce. 
This discretization method uses a KBinsDiscretizer to transform numerical features into categorical features (with a direct encoding).

Examples

Example 1
import pandas
from pyxai import Learning, Tools

data = pandas.read_csv(Tools.Options.dataset, names=["sepal length", "sepal width", "petal length", "petal width", "Iris Plants"])
preprocessor = Learning.TabularPreprocessor(data, target_feature="Iris Plants", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS)

preprocessor.all_numerical_features()
preprocessor.process()

dataset_name = Tools.Options.dataset.split("/")[-1].split(".")[0] 
preprocessor.export(dataset_name, output_directory=Tools.Options.output)
Example 2
import datetime
from pyxai import Learning, Tools

preprocessor = Learning.TabularPreprocessor(Tools.Options.dataset, target_feature="Type", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS, to_binary_classification=Learning.ONE_VS_REST)

preprocessor.unset_features(["Address", "Suburb", "SellerG"])

preprocessor.set_categorical_features(features=["Method", "CouncilArea", "Regionname"])

preprocessor.set_numerical_features({
  "Postcode": lambda d: int(d),
  "Rooms": None, 
  "Price": None,
  "Date": lambda d: datetime.date(int(d.split("/")[2]), int(d.split("/")[1]), int(d.split("/")[0])).toordinal(), 
  "Distance": None,
  "Bedroom2": None,
  "Bathroom": None,
  "Car": None,
  "Landsize": None,
  "BuildingArea": None,
  "YearBuilt": None,
  "Lattitude": None,
  "Longtitude": None,
  "Propertycount": None
  })

preprocessor.process()

dataset_name = Tools.Options.dataset.split("/")[-1].split(".")[0] 
preprocessor.export(dataset_name, output_directory=Tools.Options.output)

See also

Documentation page
- More examples are in the pyxai/examples/Converters directory in the source code.


Main Methods

    def export(self, filename, type="csv", output_directory=None): Highlight

Export the dataset that has been transformed.

This function creates two files, the first one represents the new dataset and the second one is a JSON file containing information about the transformations made on the data. In the case where a multiclass classification problem is converted to a set of binary classification problems, several pairs of files are produced.

Parameters

filename : str

The filename of new files.

type : str (optional, default=csv)

The type of the new dataset (“csv” or “xls”).

output_directory : str

The target directory for the new files (this directory must exist).

Examples

from pyxai import Learning

preprocessor = Learning.TabularPreprocessor("../dataset/iris.csv", 
                                    target_feature="Species", 
                                    problem_type=Learning.CLASSIFICATION, 
                                    classification_type=Learning.BINARY_CLASS)

preprocessor.all_numerical_features()

preprocessor.process() 
preprocessor.export("iris", output_directory="../dataset")

---------------    Converter    ---------------
Numbers of classes: 3
Number of boolean features: 0
Warning: conversion from MultiClass to BinaryClass: the current dataset will be convert into several new datasets with the OneVsRest method.
Dataset saved: ../dataset/iris_0.csv
Types saved: ../dataset/iris_0.types
Dataset saved: ../dataset/iris_1.csv
Types saved: ../dataset/iris_1.types
Dataset saved: ../dataset/iris_2.csv
Types saved: ../dataset/iris_2.types
-----------------------------------------------
    def process(self): Highlight

Transform the dataset.

Applies all encodings previously defined by other methods of this class.

  • Put the target feature in the last column
  • Delete the feature to remove
  • Encode categorical and numerical features
  • Convert a multi classes dataset into a binary class if needed

Returns

pandas.DataFrame :

The new dataset.

Examples

from pyxai import Learning, Tools

preprocessor = Learning.TabularPreprocessor(Tools.Options.dataset, target_feature="Type", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS, to_binary_classification=Learning.ONE_VS_REST)
preprocessor.set_categorical_features(features=["Method", "CouncilArea", "Regionname"])
preprocessor.process()
    def unset_features(self, features): Highlight

Delete the features given in parameter.

Parameters

features : list[str | int]

List of features to remove. 
In this list, a feature can be represented either by a str (a feature name), or a int (a feature index).

Examples

from pyxai import Learning, Tools

preprocessor = Learning.TabularPreprocessor(Tools.Options.dataset, target_feature="Type", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS, to_binary_classification=Learning.ONE_VS_REST)
preprocessor.unset_features(["Address", "Suburb", "SellerG"])
preprocessor.process()

Categorical Features

    def set_categorical_features(self, features=None, encoder=EncoderType.OneHotEncoder): Highlight

Encode the categorical features given in the columns parameter with an encoder given in the encoder parameter.

Parameters

features : list[str | int]

The features to encode.  
These features can be given either as indexes (respecting the order of the given dataset) or directly via strings corresponding to their names.

encoder : str | EncoderType (optional, default=EncoderType.OneHotEncoder)

The type of encoder (one-hot or ordinal)
Possible values are defined in the EncoderType enum.

Examples

from pyxai import Learning, Tools

preprocessor = Learning.TabularPreprocessor(Tools.Options.dataset, target_feature="Type", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS, to_binary_classification=Learning.ONE_VS_REST)
preprocessor.set_categorical_features(features=["Method", "CouncilArea", "Regionname"])
preprocessor.process()
    def set_categorical_features_already_one_hot_encoded(self, name, features): Highlight

Identify a set of binary features coming from a one hot encoding as a categorical feature.

Parameters

name : str

The name of the resulting categorical feature.

features : list[str]

Feature names in the dataset corresponding to binary features where each feature represents a category.

Examples

from pyxai import Learning, Tools

preprocessor = Learning.TabularPreprocessor(
    "../dataset/compas.csv", 
    target_feature="Two_yr_Recidivism", 
    problem_type=Learning.CLASSIFICATION, 
    classification_type=Learning.BINARY_CLASS)

preprocessor.set_categorical_features_already_one_hot_encoded("score_factor", ["score_factor"])
preprocessor.set_categorical_features_already_one_hot_encoded("Age_Above_FourtyFive", ["Age_Above_FourtyFive"])
preprocessor.set_categorical_features_already_one_hot_encoded("Age_Below_TwentyFive", ["Age_Below_TwentyFive"])
preprocessor.set_categorical_features_already_one_hot_encoded("Ethnic", ["African_American", "Asian", "Hispanic", "Native_American", "Other"])
preprocessor.set_categorical_features_already_one_hot_encoded("Female", ["Female"])
preprocessor.set_categorical_features_already_one_hot_encoded("Misdemeanor", ["Misdemeanor"])

preprocessor.set_numerical_features({
    "Number_of_Priors": None
})

preprocessor.process()
preprocessor.export("compas", output_directory="../dataset")

Numerical Features

    def all_numerical_features(self): Highlight

Identify all features as numerical.

Examples

from pyxai import Learning, Tools

preprocessor = Learning.TabularPreprocessor(Tools.Options.dataset, target_feature="1636", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS)
preprocessor.all_numerical_features()

preprocessor.process()
dataset_name = Tools.Options.dataset.split("/")[-1].split(".")[0] 
preprocessor.export(dataset_name, output_directory=Tools.Options.output)
    def set_numerical_features(self, features_dict): Highlight

Select and encode the numerical features present in the dictionary features_dict.

Parameters

features_dict : dict[str | int] -> Lambda | None

Python dictionary where keys are numerical features and values are lambda functions representing data encodings. 
The keys must be represented either by str (feature names) or int (feature indexes). 
Note that a lambda function set to None means that no encoding must be done.

Examples

import datetime
from pyxai import Learning, Tools

preprocessor = Learning.TabularPreprocessor("../dataset/melb.csv", 
                                    target_feature="Type", 
                                    problem_type=Learning.CLASSIFICATION, 
                                    classification_type=Learning.MULTI_CLASS)
preprocessor.unset_features(["Address", "Suburb", "SellerG"])
preprocessor.set_categorical_features(features=["Method", "CouncilArea", "Regionname"])
preprocessor.set_numerical_features({
    "Postcode": lambda d: int(d),
    "Date": lambda d: datetime.date(int(d.split("/")[2]), int(d.split("/")[1]), int(d.split("/")[0])).toordinal(), 
    "Distance": None, "Bedroom2": None, "Bathroom": None,
    "Car": None, "Landsize": None, "BuildingArea": None, "YearBuilt": None,
    "Lattitude": None, "Longtitude": None, "Propertycount": None,
    "Rooms": None, "Price": None
})
preprocessor.process()
preprocessor.export("melb", output_directory="../dataset")

Symbols