Extreme Multi-label Text Classification with Heterogeneous Network Data
- PhD Student:
- Imane Jebbar
- Co-Advisors :
- Zied Bouraoui
- Khalid Minaoui
- Funding : Artois, Université Mohammed V de Rabat
- Start year :
- 2024
Joint work with university Mohammed V of Rabat
This thesis focuses on developing new approaches for multi-label text classification (MLTC), a task that involves assigning multiple relevant labels to a given text from a very large set. MLTC has a wide range of important applications in the field of natural language processing, such as document categorization, sentiment analysis, and information retrieval. However, this task is particularly challenging due to several major issues: the label space is often extremely large, the label distribution is frequently imbalanced, and the dependencies between labels and words can be both rich and complex. Most existing methods fail to adequately account for these interdependencies or rely on linear models that cannot capture the nonlinear and contextual relationships inherent in the data. Our research focuses on a specific type of extreme multi-label classification (XMC) applied to network data enriched with textual content. These networks, which encompass rich and complex relationships between various elements, are common in contexts such as social networks, citation graphs, or web structures. The textual content, in turn, refers to the information associated with the nodes or edges of these networks, such as tweets, user profiles, or descriptions linked to hyperlinks. Network data enriched with textual content is ubiquitous and plays a key role in numerous fields, including social media analysis, information retrieval, and natural language processing. The goal of this thesis is to push the boundaries of current approaches by fully leveraging the richness of network structures and textual content to better understand and model the complex relationships between labels and data.