Class TabularPreprocessor
The TabularPreprocessor class is used to preprocess tabular datasets.
The TabularPreprocessor object give access to methods that allow to:
- modify the dataset to delete features
- encode features using an encoder (
OrdinalEncoder,OneHotEncoder,LabelEncoder) and put the target feature in the last column. - rewrite the dataset in a new CSV file and save in a JSON file the type of features (numerical or categorical)
def __init__(self,
dataset,
target_feature,
problem_type,
classification_type=None,
to_binary_classification=MultiClassToBinaryMethod.OneVsOne,
discretization=None): Highlight
Initialise a TabularPreprocessor object with a dataset.
Parameters
dataset : str | pandas.DataFrame | NoneData
The dataset to use, either as a path to a csv, json or excel file or as a pandas DataFrame.
target_feature : str
The feature name of the target feature.
problem_type : str | ProblemType
The type of problem (classification, regression, …)
Possible values are defined in the ProblemType enum.
classification_type : str | ClassificationType (optional, default=None)
The type of classification (BinaryClass or MultiClass) to produce.
Possible values are defined in the ClassificationType enum.
to_binary_classification : str | MultiClassToBinaryMethod (optional, default=MultiClassToBinaryMethod.OneVsOne)
The method used to encode a multi-classes dataset into a binary class dataset (OneVsOne or OneVsRest)
Possible values are defined in the MultiClassToBinaryMethod enum.
discretization : int (optional, default=None)
If this parameter is None, no discretization is used.
Else, it is the number of bins to produce.
This discretization method uses a KBinsDiscretizer to transform numerical features into categorical features (with a direct encoding).
Returns
A TabularPreprocessor object
Examples
import pandas
from pyxai import Learning, Tools
data = pandas.read_csv(Tools.Options.dataset, names=["sepal length", "sepal width", "petal length", "petal width", "Iris Plants"])
preprocessor = Learning.TabularPreprocessor(data, target_feature="Iris Plants", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS)
preprocessor.all_numerical_features()
preprocessor.process()
dataset_name = Tools.Options.dataset.split("/")[-1].split(".")[0]
preprocessor.export(dataset_name, output_directory=Tools.Options.output)
import datetime
from pyxai import Learning, Tools
preprocessor = Learning.TabularPreprocessor(Tools.Options.dataset, target_feature="Type", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS, to_binary_classification=Learning.ONE_VS_REST)
preprocessor.unset_features(["Address", "Suburb", "SellerG"])
preprocessor.set_categorical_features(features=["Method", "CouncilArea", "Regionname"])
preprocessor.set_numerical_features({
"Postcode": lambda d: int(d),
"Rooms": None,
"Price": None,
"Date": lambda d: datetime.date(int(d.split("/")[2]), int(d.split("/")[1]), int(d.split("/")[0])).toordinal(),
"Distance": None,
"Bedroom2": None,
"Bathroom": None,
"Car": None,
"Landsize": None,
"BuildingArea": None,
"YearBuilt": None,
"Lattitude": None,
"Longtitude": None,
"Propertycount": None
})
preprocessor.process()
dataset_name = Tools.Options.dataset.split("/")[-1].split(".")[0]
preprocessor.export(dataset_name, output_directory=Tools.Options.output)
See also
- Documentation page
- More examples are in the pyxai/examples/Converters directory in the source code.
Main Methods
def export(self, filename, type="csv", output_directory=None): Highlight
Export the dataset that has been transformed.
This function creates two files, the first one represents the new dataset and the second one is a JSON file containing information about the transformations made on the data. In the case where a multiclass classification problem is converted to a set of binary classification problems, several pairs of files are produced.
Parameters
filename : str
The filename of new files.
type : str (optional, default=csv)
The type of the new dataset (“csv” or “xls”).
output_directory : str
The target directory for the new files (this directory must exist).
Examples
from pyxai import Learning
preprocessor = Learning.TabularPreprocessor("../dataset/iris.csv",
target_feature="Species",
problem_type=Learning.CLASSIFICATION,
classification_type=Learning.BINARY_CLASS)
preprocessor.all_numerical_features()
preprocessor.process()
preprocessor.export("iris", output_directory="../dataset")
--------------- Converter ---------------
Numbers of classes: 3
Number of boolean features: 0
Warning: conversion from MultiClass to BinaryClass: the current dataset will be convert into several new datasets with the OneVsRest method.
Dataset saved: ../dataset/iris_0.csv
Types saved: ../dataset/iris_0.types
Dataset saved: ../dataset/iris_1.csv
Types saved: ../dataset/iris_1.types
Dataset saved: ../dataset/iris_2.csv
Types saved: ../dataset/iris_2.types
-----------------------------------------------
def process(self): Highlight
Transform the dataset.
Applies all encodings previously defined by other methods of this class.
- Put the target feature in the last column
- Delete the feature to remove
- Encode categorical and numerical features
- Convert a multi classes dataset into a binary class if needed
Examples
from pyxai import Learning, Tools
preprocessor = Learning.TabularPreprocessor(Tools.Options.dataset, target_feature="Type", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS, to_binary_classification=Learning.ONE_VS_REST)
preprocessor.set_categorical_features(features=["Method", "CouncilArea", "Regionname"])
preprocessor.process()
def unset_features(self, features): Highlight
Delete the features given in parameter.
Parameters
features : list[str | int]
List of features to remove.
In this list, a feature can be represented either by a str (a feature name), or a int (a feature index).
Examples
from pyxai import Learning, Tools
preprocessor = Learning.TabularPreprocessor(Tools.Options.dataset, target_feature="Type", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS, to_binary_classification=Learning.ONE_VS_REST)
preprocessor.unset_features(["Address", "Suburb", "SellerG"])
preprocessor.process()
Categorical Features
def set_categorical_features(self, features=None, encoder=EncoderType.OneHotEncoder): Highlight
Encode the categorical features given in the columns parameter with an encoder given in the encoder parameter.
Parameters
features : list[str | int]
The features to encode.
These features can be given either as indexes (respecting the order of the given dataset) or directly via strings corresponding to their names.
encoder : str | EncoderType (optional, default=EncoderType.OneHotEncoder)
The type of encoder (one-hot or ordinal)
Possible values are defined in the EncoderType enum.
Examples
from pyxai import Learning, Tools
preprocessor = Learning.TabularPreprocessor(Tools.Options.dataset, target_feature="Type", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS, to_binary_classification=Learning.ONE_VS_REST)
preprocessor.set_categorical_features(features=["Method", "CouncilArea", "Regionname"])
preprocessor.process()
def set_categorical_features_already_one_hot_encoded(self, name, features): Highlight
Identify a set of binary features coming from a one hot encoding as a categorical feature.
Parameters
name : str
The name of the resulting categorical feature.
features : list[str]
Feature names in the dataset corresponding to binary features where each feature represents a category.
Examples
from pyxai import Learning, Tools
preprocessor = Learning.TabularPreprocessor(
"../dataset/compas.csv",
target_feature="Two_yr_Recidivism",
problem_type=Learning.CLASSIFICATION,
classification_type=Learning.BINARY_CLASS)
preprocessor.set_categorical_features_already_one_hot_encoded("score_factor", ["score_factor"])
preprocessor.set_categorical_features_already_one_hot_encoded("Age_Above_FourtyFive", ["Age_Above_FourtyFive"])
preprocessor.set_categorical_features_already_one_hot_encoded("Age_Below_TwentyFive", ["Age_Below_TwentyFive"])
preprocessor.set_categorical_features_already_one_hot_encoded("Ethnic", ["African_American", "Asian", "Hispanic", "Native_American", "Other"])
preprocessor.set_categorical_features_already_one_hot_encoded("Female", ["Female"])
preprocessor.set_categorical_features_already_one_hot_encoded("Misdemeanor", ["Misdemeanor"])
preprocessor.set_numerical_features({
"Number_of_Priors": None
})
preprocessor.process()
preprocessor.export("compas", output_directory="../dataset")
Numerical Features
def all_numerical_features(self): Highlight
Identify all features as numerical.
Examples
from pyxai import Learning, Tools
preprocessor = Learning.TabularPreprocessor(Tools.Options.dataset, target_feature="1636", problem_type=Learning.CLASSIFICATION, classification_type=Learning.BINARY_CLASS)
preprocessor.all_numerical_features()
preprocessor.process()
dataset_name = Tools.Options.dataset.split("/")[-1].split(".")[0]
preprocessor.export(dataset_name, output_directory=Tools.Options.output)
def set_numerical_features(self, features_dict): Highlight
Select and encode the numerical features present in the dictionary features_dict.
Parameters
features_dict : dict[str | int] -> Lambda | None
Python dictionary where keys are numerical features and values are lambda functions representing data encodings.
The keys must be represented either by str (feature names) or int (feature indexes).
Note that a lambda function set to None means that no encoding must be done.
Examples
import datetime
from pyxai import Learning, Tools
preprocessor = Learning.TabularPreprocessor("../dataset/melb.csv",
target_feature="Type",
problem_type=Learning.CLASSIFICATION,
classification_type=Learning.MULTI_CLASS)
preprocessor.unset_features(["Address", "Suburb", "SellerG"])
preprocessor.set_categorical_features(features=["Method", "CouncilArea", "Regionname"])
preprocessor.set_numerical_features({
"Postcode": lambda d: int(d),
"Date": lambda d: datetime.date(int(d.split("/")[2]), int(d.split("/")[1]), int(d.split("/")[0])).toordinal(),
"Distance": None, "Bedroom2": None, "Bathroom": None,
"Car": None, "Landsize": None, "BuildingArea": None, "YearBuilt": None,
"Lattitude": None, "Longtitude": None, "Propertycount": None,
"Rooms": None, "Price": None
})
preprocessor.process()
preprocessor.export("melb", output_directory="../dataset")