Multimodal Fusion and Large Language Models for Sentiment Analysis and Emotion Recognition

hoztit

Advisor :
Saïd Jabbour

Co-Supervisor :
Nejat Arinik

Funding : PHC Toubkal

Learning, Processing and Querying Data

Start year :
2026

Human emotions play a central role in communication, decision-making, social interactions, and psychological well-being. The development of emotional AI systems capable of recognizing and interpreting affective states nevertheless remains a major challenge, due to the variability of emotional expressions, cultural differences, and the strong dependence on context. The same expression may convey different emotions depending on the situation, which severely limits approaches based on a single source of information. Emotion Recognition (ER) has thus undergone a significant evolution, shifting from unimodal systems relying on an isolated modality (text, speech, or image) to multimodal approaches that integrate multiple sources of information, such as text, facial expressions, physiological signals (ECG, EEG, GSR), and body movements. This evolution has demonstrated the potential of multimodal approaches to better capture the complexity of human affective states and to improve the performance of emotion recognition systems. However, despite these advances, the effective fusion of multimodal data remains an open challenge. Early, late, and hybrid fusion strategies struggle to cope with signal heterogeneity, differing temporal dynamics, and the loss of contextual information. Even more sophisticated methods, based on cross-attention mechanisms or specialized architectures, often remain rigid, poorly interpretable, and difficult to adapt to new contexts, particularly in real-time scenarios or for low-resource languages such as Arabic and the Moroccan dialect. In parallel, the recent emergence of Large Language Models (LLMs) has profoundly transformed natural language processing through their strong capabilities in contextual understanding, generalization, and transfer learning. Their growing integration into multimodal architectures opens new perspectives for rethinking multimodal fusion, no longer as a purely mechanical combination of signals, but as a process guided by reasoning and global contextual understanding. Nevertheless, challenges remain regarding the effective integration of heterogeneous and temporal modalities, multilingual adaptation, and computational complexity constraints. The objective of this PhD is to investigate the use of LLMs as a central component of multimodal fusion for emotion recognition and sentiment analysis. The work will focus on designing hybrid architectures that integrate representations from textual, vocal, and physiological modalities, as well as exploring multimodal prompting strategies, specialized fine-tuning, and LLM-guided hierarchical fusion mechanisms. Finally, this research aims to contribute to the enrichment of multilingual multimodal corpora, particularly for low-resource languages, and to evaluate the proposed models in practical applications such as early detection and monitoring of emotional disorders, with potential impact in the fields of mental health, adaptive education, and human–machine interaction.

Multimodal Fusion and Large Language Models for Sentiment Analysis and Emotion Recognition

Contact Us

Our Addresses