From Machine Learning to Digital Humanities
- PhD Student:
- David Ing
- Co-Advisors :
- Lakhdar Saïs
- Saïd Jabbour
- Co-Supervisors :
- Fabien Delorme
- Nelly Robin
- Funding : Artois, Région HdF
- PhD defended on :
- Dec 13, 2024
Dimensionality reduction or the selection of relevant features is an important step in designing machine learning models, as well as for generating concise and understandable explanations in many sensitive applications. For classification algorithms, such as decision trees and random forests, many studies have focused on constructing optimal trees in terms of involved features, as well as generating the most relevant and accurate explanations.
This thesis, entitled “From Machine Learning to Digital Humanities” proposes significant improvements to classification algorithms, particularly in feature selection via Logical Analysis of Data (LAD), a pattern learning framework that combines optimization, Boolean functions, and combinatorics, an approach derived from discrete mathematics which is often overlooked by the AI community. We have also proposed a new measure of decision tree optimality, defined by the minimum number of features necessary for their construction. Experiments on large benchmark datasets show that our approach significantly reduces the number of features required to build decision trees and random forests, thus improving the explanation generation phase.
The second part of this thesis aligns with an emerging research trend, combining Artificial Intelligence (AI) and Digital Humanities (DH). Our first contribution makes original use of legal databases to identify Human Trafficking Networks (HTNs), involving both sexual abuse victims and exploiters. Several well-known classification models are used not only to determine the class of a given network—Not suspicious, Suspicious, or Probably suspicious—but also to provide explanations that could help the judge make the right decision. Our final contribution presents a first step towards a text mining approach designed to harness migrant narratives, collected during interviews with migrants along their journeys, in two languages, including English and French. After in-depth discussions with experts, we identified essential domain concepts, including the concept of places, cities, or villages crossed by migrants. Our approach, based on text mining and Natural Language Processing (NLP), automatically extracts such location-related terms embedded within these narratives, using an adaptation of a set expansion algorithm in a weakly supervised manner with a small set of annotated terms. Finally, we designed a tool to visualize these routes on a map, enabling the observation of migration routes.
Committee
Supervisors
- M. Lakhdar SAÏS, CRIL, CNRS - Artois University
- M. Jabbour SAID, CRIL, CNRS - Artois University
- M. Fabien DELORME, CRIL, CNRS - Artois University
Reviewers
- M. Nadjib LAZAAR, LISN, CNRS - University of Paris Saclay
- M. Engelbert MEPHU NGUIFO, LIMOS, CNRS - University of Clermont Auvergne
Examiners
- M. Frédéric LARDEUX, LERIA, CNRS - University of Angers
- Mme. Nelly ROBIN, Paris Cité University - IRD