CECAM - From Curse to Cure: Mastering Dimensionality Reduction with AutoencodersFrom Curse to Cure: Mastering Dimensionality Reduction with Autoencoders

Objective

We propose a three-day tutorial on high-dimensional data analysis, focusing on techniques like compressive sensing, autoencoders, and regression. Building on the success of our previous 6 month node extended software development workshop (ESDW), which integrated student research and training, we plan to repeat the software development workshop but this time will also share these experiences through a dedicated follow-up tutorial open to a larger number of participants. The program will emphasize practical applications, including representing perturbations in latent space. While core concepts will be introduced using simple examples, real-world applications from materials science, nanotoxicology, and computational biology will be highlighted.

Context

High-dimensional data analysis presents significant challenges due to the “curse of dimensionality,” a term coined by Richard E. Bellman. As the number of variables increases, computational costs escalate, and data becomes increasingly sparse, complicating the identification of underlying patterns. Despite these challenges, many complex systems exhibit low-dimensional structures.

To handle high-dimensional data, methods like compressive sensing and autoencoders aim to reduce data complexity while preserving key information. Variational autoencoders (VAEs) generate new data similar to the training data, enabling applications like anomaly detection. By combining VAEs with regression, we can map perturbation effects to a latent space. Crucially, similar systems often share a common latent space, allowing us to predict the impact of a perturbation on one system based on knowledge from another. This approach, pioneered by Lotfollahi et al., has shown promise in predicting cellular responses to various perturbations across different cell types and species. The underlying principle is that systems with similar functions may share a common latent representation, despite differences at the molecular level. This methodology has broad applications across fields such as biology, toxicology, and materials science.

Each day of the tutorial will be composed of 4 sessions, the first three being purely didactic and the last one focused on research level problem using a case study

Case Studies

One case study will focus on the problem of predicting the adsorption affinity of a small molecule to a target surface, which is crucial in fields such as catalysis, nanomedicine, and human safety. However, it is a complex task when considering the effects of the surrounding medium. The potentials of mean force (PMFs) for chemical–surface pairs are typically calculated using advanced sampling methods like metadynamics or umbrella sampling. Once an extensive set of potentials is obtained, a model can be trained to predict interactions for new molecules and adsorbent materials. We will apply VAEs to predict PMFs for different materials and molecules. The model will be trained on an extensive dataset of PMFs previously obtained from atomistic simulations of small molecule adsorption on inorganic surfaces, including metals, oxides, and carbon materials. We will use the PMF library for training and validation sets, construct the latent space of molecular parameters predictive of the overall interaction PMF, and employ the method of Lotfollahi et al. to predict the effects of perturbations.

A case study (and student thesis project) in our previous extended software development workshop was the ongoing work being conducted into the significance of mRNA-protein discordance as it relates to the cause, prevention and progression of neurodegenerative diseases, in particular, Alzheimer’s disease and related dementias (ADRD). This avenue of exploration led us to clei2block, a deep learning model composed of a variational autoencoder and a linear regression module, designed to be able to predict protein abundance in cases where mRNA-protein correlations are not high. We plan to explore a similar biological problem with one or more students in the ESDW, and use the results as a second case study in the tutorial

A third case study will focus on the practical problem that taking high-quality pictures with electron microscopes often damages samples and show how using VAE’s the quality of images taken with lower power can be made without degrading the pictures or damaging the sample.

From Curse to Cure: Mastering Dimensionality Reduction with Autoencoders

Location: CECAM-IRL

Organisers

References