Papers Video GitHub In-the-Loop EXPEKCTATION Release Notes About

Non-Tabular Data

The NonTabularPreprocessor class is used to build a JSON dataset from non-tabular data such as images. This JSON file can then be used by PyXAI to load instances and compute explanations.

More details about NonTabularPreprocessor are given in the API page.

To create a NonTabularPreprocessor object, you just have to call:

preprocessor = Learning.NonTabularPreprocessor(problem_type, instances_type, labels_type)

where:

problem_type is Learning.CLASSIFICATION or Learning.REGRESSION
instances_type is the type of instances, e.g. Learning.IMAGE
labels_type is the type of labels, e.g. Learning.CLASSES

Building a Dataset from MNIST Images

In the following example, we build a binary classification dataset from MNIST, distinguishing digit 0 from digit 8. We use 10 pre-saved images per class stored in the mnist_images/ folder (named mnist_{label}_{index}.png). The first 8 images of each class are assigned to the training set and the last 2 to the test set.

import os
from pyxai import Learning

img_dir = '../dataset/mnist_images'

preprocessor = Learning.NonTabularPreprocessor(
    problem_type=Learning.CLASSIFICATION,
    instances_type=Learning.IMAGE,
    labels_type=Learning.CLASSES
)

files = sorted(f for f in os.listdir(img_dir) if f.endswith('.png'))

for instance_id, file_name in enumerate(files):
    # filename format: mnist_{label}_{index:02d}.png
    parts = file_name.replace('.png', '').split('_')
    label = int(parts[1])
    index = int(parts[2])
    subset = Learning.TRAIN if index < 8 else Learning.TEST
    preprocessor.add_instance_image(
        instance_id=instance_id,
        file_path=os.path.join('mnist_images', file_name),
        instances_set=subset
    )
    preprocessor.add_label_class(instance_id=instance_id, label=label)

n_train = sum(1 for v in preprocessor.instances.values() if v['subset'] == str(Learning.TRAIN))
n_test  = sum(1 for v in preprocessor.instances.values() if v['subset'] == str(Learning.TEST))
print(f"Train instances: {n_train}")
print(f"Test  instances: {n_test}")
print(f"Total: {len(files)}")

Train instances: 16
Test  instances: 4
Total: 20

The add_instance_image method registers an image file as an instance and assigns it to a subset (training or test set). The add_label_class method associates a class label with each instance.

Once all instances and labels are registered, call to_json to export the dataset:

json_path = '../dataset/mnist_0_vs_8.json'
preprocessor.to_json(json_path)
print(f"Dataset saved to {json_path}")

Dataset saved to ../dataset/mnist_0_vs_8.json

The generated JSON file contains the general information about the dataset (problem type, instance type, label type), the list of instances with their file paths and subsets, and the corresponding labels:

import json
with open(json_path) as f:
    data = json.load(f)

print("Specificities:", data["specificities"])
print("First instance:", list(data["instances"].items())[0])
print("First label:", list(data["labels"].items())[0])

Specificities: {'problem_type': 'classification', 'instances_type': 'image', 'labels_type': 'classes'}
First instance: ('0', {'file_name': 'mnist_images/mnist_0_00.png', 'subset': 'InstancesSet.Train'})
First label: ('0', {'instance_id': 0, 'label': 0})

The file paths stored in the JSON are relative to the JSON file location. Make sure to keep the images and the JSON file in a consistent directory structure.