{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fbae8b97",
   "metadata": {},
   "source": [
    "# Non-Tabular Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbb2b986",
   "metadata": {},
   "source": [
    "The ```NonTabularPreprocessor``` class is used to build a JSON dataset from non-tabular data such as images. This JSON file can then be used by PyXAI to load instances and compute explanations.\n",
    "\n",
    "More details about ```NonTabularPreprocessor``` are given in the [API](/pyxai/documentation/api/classes/nonTabularPreprocessor/) page."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72806adb",
   "metadata": {},
   "source": [
    "To create a ```NonTabularPreprocessor``` object, you just have to call:\n",
    "\n",
    "```python\n",
    "preprocessor = Learning.NonTabularPreprocessor(problem_type, instances_type, labels_type)\n",
    "```\n",
    "\n",
    "where:\n",
    "- `problem_type` is ```Learning.CLASSIFICATION``` or ```Learning.REGRESSION```\n",
    "- `instances_type` is the type of instances, e.g. ```Learning.IMAGE```\n",
    "- `labels_type` is the type of labels, e.g. ```Learning.CLASSES```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e0b4acd",
   "metadata": {},
   "source": [
    "## Building a Dataset from MNIST Images\n",
    "\n",
    "In the following example, we build a binary classification dataset from [MNIST](http://yann.lecun.com/exdb/mnist/), distinguishing digit **0** from digit **8**. We use 10 pre-saved images per class stored in the `mnist_images/` folder (named `mnist_{label}_{index}.png`). The first 8 images of each class are assigned to the training set and the last 2 to the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "0c741a81",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-16T08:07:52.371258Z",
     "iopub.status.busy": "2026-05-16T08:07:52.371163Z",
     "iopub.status.idle": "2026-05-16T08:07:55.368006Z",
     "shell.execute_reply": "2026-05-16T08:07:55.367490Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train instances: 16\n",
      "Test  instances: 4\n",
      "Total: 20\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from pyxai import Learning\n",
    "\n",
    "img_dir = '../dataset/mnist_images'\n",
    "\n",
    "preprocessor = Learning.NonTabularPreprocessor(\n",
    "    problem_type=Learning.CLASSIFICATION,\n",
    "    instances_type=Learning.IMAGE,\n",
    "    labels_type=Learning.CLASSES\n",
    ")\n",
    "\n",
    "files = sorted(f for f in os.listdir(img_dir) if f.endswith('.png'))\n",
    "\n",
    "for instance_id, file_name in enumerate(files):\n",
    "    # filename format: mnist_{label}_{index:02d}.png\n",
    "    parts = file_name.replace('.png', '').split('_')\n",
    "    label = int(parts[1])\n",
    "    index = int(parts[2])\n",
    "    subset = Learning.TRAIN if index < 8 else Learning.TEST\n",
    "    preprocessor.add_instance_image(\n",
    "        instance_id=instance_id,\n",
    "        file_path=os.path.join('mnist_images', file_name),\n",
    "        instances_set=subset\n",
    "    )\n",
    "    preprocessor.add_label_class(instance_id=instance_id, label=label)\n",
    "\n",
    "n_train = sum(1 for v in preprocessor.instances.values() if v['subset'] == str(Learning.TRAIN))\n",
    "n_test  = sum(1 for v in preprocessor.instances.values() if v['subset'] == str(Learning.TEST))\n",
    "print(f\"Train instances: {n_train}\")\n",
    "print(f\"Test  instances: {n_test}\")\n",
    "print(f\"Total: {len(files)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ad672c3",
   "metadata": {},
   "source": [
    "The ```add_instance_image``` method registers an image file as an instance and assigns it to a subset (training or test set). The ```add_label_class``` method associates a class label with each instance."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06790d5c",
   "metadata": {},
   "source": [
    "Once all instances and labels are registered, call ```to_json``` to export the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "734f0f14",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-16T08:07:55.369730Z",
     "iopub.status.busy": "2026-05-16T08:07:55.369207Z",
     "iopub.status.idle": "2026-05-16T08:07:55.371986Z",
     "shell.execute_reply": "2026-05-16T08:07:55.371634Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset saved to ../dataset/mnist_0_vs_8.json\n"
     ]
    }
   ],
   "source": [
    "json_path = '../dataset/mnist_0_vs_8.json'\n",
    "preprocessor.to_json(json_path)\n",
    "print(f\"Dataset saved to {json_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f83f28b4",
   "metadata": {},
   "source": [
    "The generated JSON file contains the general information about the dataset (problem type, instance type, label type), the list of instances with their file paths and subsets, and the corresponding labels:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "b8561f1e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-16T08:07:55.373027Z",
     "iopub.status.busy": "2026-05-16T08:07:55.372924Z",
     "iopub.status.idle": "2026-05-16T08:07:55.375075Z",
     "shell.execute_reply": "2026-05-16T08:07:55.374746Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Specificities: {'problem_type': 'classification', 'instances_type': 'image', 'labels_type': 'classes'}\n",
      "First instance: ('0', {'file_name': 'mnist_images/mnist_0_00.png', 'subset': 'InstancesSet.Train'})\n",
      "First label: ('0', {'instance_id': 0, 'label': 0})\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "with open(json_path) as f:\n",
    "    data = json.load(f)\n",
    "\n",
    "print(\"Specificities:\", data[\"specificities\"])\n",
    "print(\"First instance:\", list(data[\"instances\"].items())[0])\n",
    "print(\"First label:\", list(data[\"labels\"].items())[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "388c036d",
   "metadata": {},
   "source": [
    "{: .attention }\n",
    "> The file paths stored in the JSON are relative to the JSON file location. Make sure to keep the images and the JSON file in a consistent directory structure."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}