Séminaire d'Antoine Berthier

This presentation reports exploratory research conducted in partnership between IRT SystemX and the CRIL. The work focuses on the alignment and interpretability of large language models (LLMs) through activation engineering. The talk will provide an overview of the state of the art, outline the research directions that were explored and their outcomes, and highlight promising avenues for future work. It offers an opportunity to dive into the inner workings of LLMs, showing how they can be aligned using very limited data and how their internal activations can be leveraged to detect and better understand undesirable behaviors.

Séminaire d'Antoine Berthier - IRT-systemx

Alignment and Interpretation of Language Models through Activation Engineering

Nous contacter

Nos adresses