{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PyXAI simplifies the preparation/cleaning of datasets using a preprocessor. It allows you to: \n", "- Modify the dataset:\n", " - feature deletion \n", " - feature encoding (ordinal, one-hot, label) \n", " - selection of the target feature \n", " - possible conversion of a multi-class classification problems into a binary classification one\n", " - distinction between numerical and categorical features\n", "\n", "\n", "- Export the dataset with these modifications and create a new JSON file reflecting modifications made. This allows in particular to transmit to PyXAI the information required to compute explanations using [theories]({{ site.baseurl }}/documentation/explainer/theories/).\n", "\n", "To create a preprocessor object, you just have to call:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ```TabularPreprocessor``` class preprocesses tabular datasets. It gives access to methods that allow you to delete features, encode features using an encoder (```OrdinalEncoder```, ```OneHotEncoder```, ```LabelEncoder```), put the target feature in the last column, rewrite the dataset in a new file, and save in a JSON file the type of features (numerical or categorical). More details about ```TabularPreprocessor``` are given in the [API](/pyxai/documentation/api/classes/tabularPreprocessor/) page." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following code, we illustrate the ```ONE_VS_REST``` strategy on the [iris.csv]({{ site.baseurl }}/assets/notebooks/dataset/iris.csv) dataset. This strategy transforms a multiclass classification problem into several binary classification problems. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2026-05-15T14:33:33.499041Z", "iopub.status.busy": "2026-05-15T14:33:33.498922Z", "iopub.status.idle": "2026-05-15T14:33:35.672144Z", "shell.execute_reply": "2026-05-15T14:33:35.671858Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-------------- Information ---------------\n", "Problem type: classification\n", "Instances type: tabular\n", "Labels type: None\n", "\n", "Dataset path: None\n", "--------------- Converter ---------------\n", "Numbers of classes: 3\n", "Number of boolean features: 0\n", "Warning: conversion from MultiClass to BinaryClass: the current dataset will be convert into several new datasets with the MultiClassToBinaryMethod.OneVsOne method.\n", "MethodToBinaryClassification.OneVsOne: 50 - 50\n", "MethodToBinaryClassification.OneVsOne: 50 - 50\n", "MethodToBinaryClassification.OneVsOne: 50 - 50\n", "Dataset saved: ../dataset/iris_0_vs_1.csv\n", "Types saved: ../dataset/iris_0_vs_1.types\n", "Dataset saved: ../dataset/iris_0_vs_2.csv\n", "Types saved: ../dataset/iris_0_vs_2.types\n", "Dataset saved: ../dataset/iris_1_vs_2.csv\n", "Types saved: ../dataset/iris_1_vs_2.types\n", "-----------------------------------------------\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/nicolas/Bureau/PyXAI/pyxai-mlp/pyxai/sources/learners/preprocessor/tabular_preprocessor.py:640: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n", " data[self.target_features_name] = data[self.target_features_name].replace(\"v0\", 0)\n" ] } ], "source": [ "from pyxai import Learning\n", "\n", "preprocessor = Learning.TabularPreprocessor(\"../dataset/iris.csv\", \n", " target_feature=\"Species\", \n", " problem_type=Learning.CLASSIFICATION, \n", " classification_type=Learning.BINARY_CLASS)\n", "\n", "preprocessor.all_numerical_features()\n", "\n", "preprocessor.process() \n", "preprocessor.export(\"iris\", output_directory=\"../dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Afterwards, we specify thanks to the method ```all_numerical_features``` the fact that every feature is numerical and we call the ```process``` method to apply the binary conversion and modify some data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ```process``` method applies all the transformations defined on the preprocessor object to the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To finish, we call the ```export``` method to save the transformed data in new files. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ```export``` method saves the transformed dataset and a JSON file containing information about the transformations (feature types, encoders used) to disk." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two kinds of files are generated. The first ones are the new binary classification datasets (```iris_0.csv```, ```iris_1.csv```, ```iris_2.csv```). We display here the five first lines of ```iris_0.csv```:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2026-05-15T14:33:35.691315Z", "iopub.status.busy": "2026-05-15T14:33:35.691131Z", "iopub.status.idle": "2026-05-15T14:33:35.693996Z", "shell.execute_reply": "2026-05-15T14:33:35.693345Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species', '5.1,3.5,1.4,0.2,1', '4.9,3.0,1.4,0.2,1', '4.7,3.2,1.3,0.2,1', '4.6,3.1,1.5,0.2,1']\n" ] } ], "source": [ "with open(\"../dataset/iris_0.csv\", 'r') as f:\n", " print(f.read().splitlines()[0:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The second type of files that are generated are JSON files containing information about the transformations performed on the initial data:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2026-05-15T14:33:35.695321Z", "iopub.status.busy": "2026-05-15T14:33:35.695200Z", "iopub.status.idle": "2026-05-15T14:33:35.697184Z", "shell.execute_reply": "2026-05-15T14:33:35.696790Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"Sepal.Length\": {\n", " \"type:\": \"NUMERICAL\",\n", " \"encoder:\": \"None\"\n", " },\n", " \"Sepal.Width\": {\n", " \"type:\": \"NUMERICAL\",\n", " \"encoder:\": \"None\"\n", " },\n", " \"Petal.Length\": {\n", " \"type:\": \"NUMERICAL\",\n", " \"encoder:\": \"None\"\n", " },\n", " \"Petal.Width\": {\n", " \"type:\": \"NUMERICAL\",\n", " \"encoder:\": \"None\"\n", " },\n", " \"Species\": {\n", " \"type:\": \"Classification\",\n", " \"encoder:\": \"LabelEncoder\",\n", " \"classes:\": [\n", " \"Iris-setosa\",\n", " \"Iris-versicolor\",\n", " \"Iris-virginica\"\n", " ],\n", " \"binary_conversion:\": {\n", " \"Method\": \"OneVsRest\",\n", " \"0\": [\n", " 1,\n", " 2\n", " ],\n", " \"1\": [\n", " 0\n", " ]\n", " }\n", " }\n", "}\n" ] } ], "source": [ "with open(\"../dataset/iris_0.types\", 'r') as f:\n", " print(f.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After this short introduction, we now present in detail the different methods related to the characteristics of features. On the one hand, numerical data refers to numbers. For example, a numerical feature can represent the values of a given probe or the ages of a set of individuals. On the other hand, categorical data refer to non-numerical information divided into groups. Categorical data describes categories or groups, such as, for example, the color of someone's hair. Finally, PyXAI can also identify and encode categorical features for which only two groups are considered, i.e., binary features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numerical Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we analyse and encode some features of the Melbourne Housing Market dataset ([melb.csv]({{ site.baseurl }}/assets/notebooks/dataset/melb.csv)): " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2026-05-15T14:33:35.698487Z", "iopub.status.busy": "2026-05-15T14:33:35.698406Z", "iopub.status.idle": "2026-05-15T14:33:35.725012Z", "shell.execute_reply": "2026-05-15T14:33:35.724542Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-------------- Information ---------------\n", "Problem type: classification\n", "Instances type: tabular\n", "Labels type: None\n", "\n", "Dataset path: None\n" ] } ], "source": [ "import datetime\n", "from pyxai import Learning, Tools\n", "\n", "preprocessor = Learning.TabularPreprocessor(\"../dataset/melb.csv\", \n", " target_feature=\"Type\", \n", " problem_type=Learning.CLASSIFICATION, \n", " classification_type=Learning.MULTI_CLASS)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we delete some irrelevant or redundant features thanks to the ```unset_features``` method:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ```unset_features``` method removes one or more features from the dataset so they are not taken into account during preprocessing." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2026-05-15T14:33:35.726227Z", "iopub.status.busy": "2026-05-15T14:33:35.726102Z", "iopub.status.idle": "2026-05-15T14:33:35.728229Z", "shell.execute_reply": "2026-05-15T14:33:35.727888Z" }, "scrolled": true }, "outputs": [], "source": [ "preprocessor.unset_features([\"Address\", \"Suburb\", \"SellerG\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We select the categorical features using the ```set_categorical_features``` method (more details about this method are given in the next section):" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2026-05-15T14:33:35.729358Z", "iopub.status.busy": "2026-05-15T14:33:35.729250Z", "iopub.status.idle": "2026-05-15T14:33:35.730989Z", "shell.execute_reply": "2026-05-15T14:33:35.730652Z" }, "scrolled": true }, "outputs": [], "source": [ "preprocessor.set_categorical_features(features=[\"Method\", \"CouncilArea\", \"Regionname\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conversely, the ```set_numerical_features``` method allows you to select and encode the numerical features: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ```set_numerical_features``` method selects and optionally encodes numerical features via a dictionary mapping feature names to lambda functions (use ```None``` for no encoding)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next few lines of code, each key of the dictionary given in parameter represents a feature (\"Postcode\", \"Date\", \"Distance, ...). For the first feature (\"Postcode\"), an encoding is performed in order to convert each data value (string) into an integer. The values of the second feature (\"Date\") are converted into ordinal values thanks to the ```datetime``` module. All other features are kept as they are standard numerical values. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2026-05-15T14:33:35.731930Z", "iopub.status.busy": "2026-05-15T14:33:35.731846Z", "iopub.status.idle": "2026-05-15T14:33:35.733920Z", "shell.execute_reply": "2026-05-15T14:33:35.733598Z" }, "scrolled": true }, "outputs": [], "source": [ "preprocessor.set_numerical_features({\n", " \"Postcode\": lambda d: int(d),\n", " \"Date\": lambda d: datetime.date(int(d.split(\"/\")[2]), int(d.split(\"/\")[1]), int(d.split(\"/\")[0])).toordinal(), \n", " \"Distance\": None, \"Bedroom2\": None, \"Bathroom\": None,\n", " \"Car\": None, \"Landsize\": None, \"BuildingArea\": None, \"YearBuilt\": None,\n", " \"Lattitude\": None, \"Longtitude\": None, \"Propertycount\": None,\n", " \"Rooms\": None, \"Price\": None\n", " })" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we execute the ```process``` method to apply all modifications and we export the new dataset with the ```export``` method. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2026-05-15T14:33:35.734876Z", "iopub.status.busy": "2026-05-15T14:33:35.734794Z", "iopub.status.idle": "2026-05-15T14:33:35.905280Z", "shell.execute_reply": "2026-05-15T14:33:35.904661Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------- Converter ---------------\n", "Feature deleted: Suburb\n", "Feature deleted: Address\n", "Feature deleted: SellerG\n", "One hot encoding new features for Method: 5\n", "One hot encoding new features for CouncilArea: 34\n", "One hot encoding new features for Regionname: 8\n", "Numbers of classes: 3\n", "Number of boolean features: 0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Dataset saved: ../dataset/melb_out.csv\n", "Types saved: ../dataset/melb_out.types\n", "-----------------------------------------------\n" ] } ], "source": [ "preprocessor.process()\n", "preprocessor.export(\"melb_out\", output_directory=\"../dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ```melb_out.types``` file contains all information about the conversions made. We just display the twenty first lines." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2026-05-15T14:33:35.906647Z", "iopub.status.busy": "2026-05-15T14:33:35.906528Z", "iopub.status.idle": "2026-05-15T14:33:35.908785Z", "shell.execute_reply": "2026-05-15T14:33:35.908480Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"Rooms\": {\n", " \"type:\": \"TypeFeature.Numerical\",\n", " \"encoder:\": \"None\"\n", " },\n", " \"Propertycount\": {\n", " \"type:\": \"TypeFeature.Numerical\",\n", " \"encoder:\": \"LabelEncoder\"\n", " },\n", " \"Price\": {\n", " \"type:\": \"TypeFeature.Numerical\",\n", " \"encoder:\": \"None\"\n", " },\n", " \"Method_PI\": {\n", " \"type:\": \"TypeFeature.Categorical\",\n", " \"encoder:\": \"EncoderType.OneHotEncoder\",\n", " \"original_feature:\": \"Method\",\n", " \"original_values:\": [\n", " \"PI\",\n", " [\n" ] } ], "source": [ "f = open(\"../dataset/melb_out.types\", 'r')\n", "for l in f.read().splitlines()[0:20]:\n", " print(l)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now introduce some methods about categorical features. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The preprocessor deals with categorical features. It identifies them and can perform one-hot or ordinal encoding. The method ```set_categorical_features``` performs this step: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ```set_categorical_features``` method encodes categorical features using either one-hot encoding (```Learning.ONE_HOT```) or ordinal encoding (```Learning.ORDINAL```)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the categorical features \"Method\", \"CouncilArea\" and \"Regionname\", the following line of code creates one binary column for each category: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "preprocessor.set_categorical_features(features=[\"Method\", \"CouncilArea\", \"Regionname\"])\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If categorical features have already been encoded, you can use the method ```set_categorical_features_already_one_hot_encoded``` to identify them as categorical features. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If categorical features have already been one-hot encoded in the dataset, the ```set_categorical_features_already_one_hot_encoded``` method allows you to group them under a single categorical feature name." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is an example where this method is used:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2026-05-15T14:33:35.910209Z", "iopub.status.busy": "2026-05-15T14:33:35.910113Z", "iopub.status.idle": "2026-05-15T14:33:35.923933Z", "shell.execute_reply": "2026-05-15T14:33:35.923598Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-------------- Information ---------------\n", "Problem type: classification\n", "Instances type: tabular\n", "Labels type: None\n", "\n", "Dataset path: None\n", "The feature score_factor is boolean! No One Hot Encoding for this features.\n", "The feature Age_Above_FourtyFive is boolean! No One Hot Encoding for this features.\n", "The feature Age_Below_TwentyFive is boolean! No One Hot Encoding for this features.\n", "The feature Female is boolean! No One Hot Encoding for this features.\n", "The feature Misdemeanor is boolean! No One Hot Encoding for this features.\n", "--------------- Converter ---------------\n", "Numbers of classes: 2\n", "Number of boolean features: 5\n", "Dataset saved: ../dataset/compas.csv\n", "Types saved: ../dataset/compas.types\n", "-----------------------------------------------\n" ] } ], "source": [ "from pyxai import Learning, Tools\n", "\n", "preprocessor = Learning.TabularPreprocessor(\n", " \"../dataset/compas.csv\", \n", " target_feature=\"Two_yr_Recidivism\", \n", " problem_type=Learning.CLASSIFICATION, \n", " classification_type=Learning.BINARY_CLASS)\n", "\n", "preprocessor.set_categorical_features_already_one_hot_encoded(\"score_factor\", [\"score_factor\"])\n", "preprocessor.set_categorical_features_already_one_hot_encoded(\"Age_Above_FourtyFive\", [\"Age_Above_FourtyFive\"])\n", "preprocessor.set_categorical_features_already_one_hot_encoded(\"Age_Below_TwentyFive\", [\"Age_Below_TwentyFive\"])\n", "preprocessor.set_categorical_features_already_one_hot_encoded(\"Ethnic\", [\"African_American\", \"Asian\", \"Hispanic\", \"Native_American\", \"Other\"])\n", "preprocessor.set_categorical_features_already_one_hot_encoded(\"Female\", [\"Female\"])\n", "preprocessor.set_categorical_features_already_one_hot_encoded(\"Misdemeanor\", [\"Misdemeanor\"])\n", "\n", "preprocessor.set_numerical_features({\n", " \"Number_of_Priors\": None\n", "})\n", "\n", "preprocessor.process()\n", "preprocessor.export(\"compas\", output_directory=\"../dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "{: .attention }\n", "\n", "> All these examples can be used to define categorical, numerical and binary features in order to use them in some theories. See this page for more details about [theories]({{ site.baseurl }}/documentation/explainer/theories/). " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.7" } }, "nbformat": 4, "nbformat_minor": 4 }