Contributions to Cheminformatics: Stereo2vec Molecule Embeddings and the Open Cyclodextrin DataBase
- PhD Student:
- Gökhan Tahil
- Co-Advisors :
- Daniel Le Berre
- Sébastien Tilloy (UCCS)
- Co-Supervisor :
- Fabien Delorme
- Funding : Artois
- PhD defended on :
- Dec 17, 2024
Machine learning is getting used in a growing number of areas. In this thesis, we are interested in a particular task in chemistry, to predict the association constant between a cyclodextrin and a guest. To do so, we first collected the data available from the literature and curated it. One of the challenges was to represent the molecules in a unique, non-ambiguous way: we used kekulized Isomeric SMILES for that purpose. The resulting dataset, OpenCycloDB, has been made available to the research community. We noticed some stereoisomers could share the same molecule embeddings when using the usual approaches. For that reason, we proposed a family of molecule embeddings called Stereo2vec to ensure that different molecules are associated with different molecule embeddings. The proposed approaches have been evaluated on our target prediction task.