Challenges in Large Scale Biomolecular Simulations 2019: Bridging Theory and Experiments
Institut d'Études Scientifiques de Cargèse, France
Motivation and novelty of the proposal
Deciphering the molecular mechanisms that govern disease progression in common conditions like human cancers and other common diseases represents a grand challenge for science. Understanding such mechanisms also holds great promise for improving our quality of life and curing or alleviating such diseases. Such molecular mechanisms involve proteins, nucleic acids, and other biological molecules that are the cell's workhorses. This extremely complex problem presents characteristic length scales spanning several orders of magnitude, from small molecules regulating cell functions to cell ensembles responsible for the generation of tissues and organs working together as an organism [Lodish 2010, Wright 2015, Zhang 2017]. Moreover, essential processes in biology are carried out by large macromolecular assemblies (like protein-nucleic complexes) whose structures are often difficult to determine by traditional methods (X-ray crystallography and NMR). To overcome these limitations, more advanced theoretical and experimental approaches are needed. In the last decade, researchers have increasingly developed integrative approaches combining information from different types of experiments, physical theories, and statistical analyses to compute structural models of large biological macromolecules and their assemblies [Albert 2007, Wang 2009, Kim 2018]. All this suggests that a stronger link between theoreticians, computational chemical physicists, bioinformaticians and experimentalists is highly desirable to build robust methods capable of enhancing the understanding of the cell’s functioning. This workshop aims to collect several experts in various fields to allow for a wide and up-to-date overview of the current bioinformatics tools and simulation techniques and for a presentation of the most recently available experimental results and experimental methods. The meeting will be an opportunity to build an interdisciplinary community to bring new insights into complex biological systems and to boost the development of an exchange program between Europe and the USA in the framework of this consortium.
State of the art
The essential challenge posed by human health requires the understanding of the cell’s machinery at a molecular level. The interplay among proteins, DNA and RNA is key for vital functions such as DNA transcription, translation and epigenetics. To understand these processes many experimental techniques are put in action, spanning a very wide range in terms of spatial resolution, temporal resolution and level of detail with which they can observe the macromolecules. This is necessary if we consider that DNA alone spans 9 orders of magnitude in space, with regulating mechanisms occurring at the level of single base pairs, all the way up to chromosomes and with times ranging from picoseconds for base-pairs formation, to hours for large structural rearrangements such as those of G-quadruplexes on the telomeres of chromosomes.
As for all branches of science, theoretical (modeling) and experimental approaches have been developed over the years to study these systems, and, with no surprise, the most successful strategies are those for which the two approaches come together to give a full picture of the system [Lasker 2012, Pérard 2013]. Indeed, because of the diversity and complementarity of the experimental techniques, molecular modeling becomes a necessary tool to decode experimental data, bridging different sources of information and building a coherent structural model compatible with experiments.
From the modeling perspective, to understand a molecular structure, and have hints on its function, the starting point is the molecule's sequence. Over the last 30 years a multitude of bioinformatic tools have been developed to exploit this information to infer protein and nucleic acids structures [Rother 2011, Webb 2016]. These methods, however, based on relatively simple and empirical scoring functions, find their limitations for large and complex molecules [Miao 2017, Lensink 2018]. Physical models, on the other hand, provide a more realistic picture of the molecule and, despite being more computationally expensive, are better suited for the study of large, complex systems. Once more, the combination of the two approaches is often beneficial [Lasker 2012, Olsson 2017], especially if either the bioinformatics or the physical model, or both, are able to incorporate experimental data from the start.
Current bioinformatics methods analyse the large amount of protein and nucleic acid sequence evolution data, searching for conservation or correlation patterns [Berman 2000, Cheng 2015, Finn 2016, Ho 2012, Lever 2010, McGinnis 2004, Marks 2012]. The significance of amino acids and nucleotide evolution covariance is based on the hypothesis that mutations of interacting residues are correlated. Hence, single point mutations would not conserve the molecule's stability, but multiple alterations must occur simultaneously among the interacting residues [Kortemme 2004, Pires 2017]. Co-evolution events could involve residues that are crucial for the activity of a protein (e.g. catalytic site residues), for the stability of the native structure (e.g. hydrophobic core residues) or in some cases for both. For single stranded nucleic acids, co-evolution has been the basis to infer the secondary structures of large ribosomal RNAs and it's commonly used to propose possible RNA secondary structures.
For nucleic acids, other bioinformatic methods based on the nearest-neighbor thermodynamic model, SantaLucia 1998], are used to propose secondary structures for smaller systems [Turner 2010, Chou 2016, nowadays accounting for chemical probing reactivity maps [Low 2010] that test experimentally whether a nucleotide is involved in a base-pair or not.
Bioinformatic methods therefore contribute to the understanding of a biomolecule providing a substantial reduction of the conformational space to be explored, based on experimental data, both from sequence analysis or from direct structural probing. This greatly simplifies the task of physical modeling, whose main drawback is the extremely large conformational space to be explored.
From the early days physical modeling played an important role in the two principal experimental method for high-resolution: X-Ray crystallography, requiring an initial model for phasing, and nuclear magnetic resonance (NMR), requiring a multi-dimensional minimization process on a model to infer possible structures.
With the current capabilities of molecular simulations (MD) the contribution of modeling can now go much further than structure refinement. MD simulations have evolved from the first 1-microsecond simulation of a villin-headpiece in 1998 [Duan 1988] to the current simulations of much larger biomolecular systems (e.g., an entire satellite mosaic virus with one million atoms [Freddolino 2006]) as well as longer time frames (e.g. B-DNA dodecamer [Pérez 2007], ubiquitin [Maragakis 2008], and beta2 AR protein receptor [Dror 2009]) for over 1 microsecond, and small proteins for 1 millisecond with specialized MD programs and dedicated supercomputers [Shaw 2010]. For some proteins, fully atomistic folding simulations can be very successful [Day 2010, Freddolino 2009, Voelz 2010], and similarly for nucleic acids, double helical DNA in particular [Schlick 2009, Clauvelin 2015, Collepardo 2015]. At the same time, coarse-grained models and combinations of enhanced sampling methods are emerging as viable alternatives for simulating complex biomolecular systems [Coluzza 2014, Lei 2007, Maisuradze 2010, Klein 2008, Schlick 2009]. Various scale coarse-graining allowed to address fundamental questions in protein folding with applications to diseases, such as Alzheimer [Sterpone 2014], RNA folding [Yasselman 2016, Denesyuk 2013, Cragnolini 2015], DNA assemblies and topologies [Ouldridge 2010], protein- protein interactions [Baaden 2013], DNA chromatin structure and condensation [Grigoryev 2016, Collepardo 2015, Bascom 2016, Bascom 2018], and many others. The quantitative accuracy reached by all the description levels allows for a flux of information from the atomistic detail up to complex simulations of cellular mechanisms done with event driven algorithms, up to simulations of whole cells [Dans 2016].
In recent years all these modeling and simulation techniques started to be coupled to experimental data in order to obtain an understanding of the biomolecular systems from an atomistic description all the way up to the meso-scale. For example, simulations have been used to obtain an atomic resolution structure for data coming from low-resolution techniques such as Small Angle X-ray Scattering (SAXS) or Cryo-Electron Microscopy (Cyo-EM) [Lasker 2012, Kim 2018] or to make sense of reactivity maps of SHAPE data and other chemical probing for single stranded RNA molecules [Pinamonti 2015, Kirmizialtin 2015].
Previous workshops have highlighted three main areas of research in relation to the simulation of large biomolecules:
· Model building
It comprises the development of models at different scales from atomistic to mesoscopic.
At present, atomistic force fields for proteins appear to have reached a satisfactory level and are indeed used for long simulations of large systems, while nucleic acids force fields are still an active area of development [Bergonzo 2015, Ivani 2016, Šponer 2018], in particular for the study of systems departing from double helical DNA. A variety of coarse-grained models of different resolution have been developed for both proteins and nucleic acids for folding and rational design [Coluzza 2014, Collepardo 2015, Rao 2017, Ozer 2015]. Similarly, mesoscopic models are able to address the dynamics of proteins such as molecular motors as a whole, adopting a continuum description of the system, or study the properties of long stretches of DNA [Hanson 2015]. Winning strategies in model building are integrating different levels of description for the systems into multi-scale simulations [Sterpone 2018]
Simulations of large macromolecular objects require the use and further development of enhanced sampling techniques [Laio 2002, Nguyen 2013, Sugita 1999]. When the initial and final states are known, path sampling and biased dynamics are efficient tools to study the transition and unveil transition pathways, kinetic barriers and metastable states [Cazals 2015, Joseph 2017]. Experimental information can also be integrated into simulations, in particular by coarse-grained and mesoscopic models, limiting the space to be explored to experimentally compatible conformations [Pitera 2012, White 2014]. Lately, simulations focus more on generating ensembles of conformations rather than on obtaining a single structural prediction, exploiting their ability to generate a multitude of possible states for a given system to be compared to the different experimental data.
· Analysis tools
All the non-standard models and methods described above require a specific treatment of the data they generate through trajectory descriptors, order parameters, topology and architecture descriptor [Humphrey 1996]. New technologies open the way to innovative tools to analyze simulation data with the interplay between state-of-the-art visualization tools (3D, virtual reality,...) [Doutreligne 2015, Mazzanti 2017] and embedded analysis, allowing to integrate at the same time simulation and experimental data on one single platform [http://www.baaden.ibpc.fr/umol/].
At this stage, the interplay between experiments and simulation opens to more opportunities than ever before. As both experimental and simulations methods are increasing dramatically the amount of data that they can generate in little time, it is necessary to build robust methods capable of exploiting these informations, that go beyond the proof of principle and ad hoc developments for specific systems or techniques.
Concluding discussions of simulation meetings often highlight the need to tighten the links between experiments and simulations. Theoreticians and experimentalists rarely have the chance to come together and exchange their point of view on the common problem of large molecular systems. With this workshop, we intend to create a long lasting discussion table to make simulations a tools available to experimentalists both through collaborative efforts and through the developments of new integrative software.
Yassmine Chebaro (CNRS, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Strasbourg) - Organiser
Elisa Frezza (Université Paris Cité, CiTCoM CNRS) - Organiser
Leulliot Nicolas (Université Paris Descartes) - Organiser
Samuela Pascuali (U.Paris) - Organiser
Fabio Sterpone (IBPC and University Paris) - Organiser
Ivan Coluzza (CIC biomaGUNE) - Organiser
Tamar Schlick (New York University) - Organiser