Fault Tolerant and Energy Efficient Algorithms in Molecular Simulations
- Teodoro Laino (IBM Research - Zurich, Switzerland)
- Alessandro Curioni (IBM Research - Zurich, Switzerland)
- Costas Bekas (IBM Research - Zurich, Switzerland)
Molecular simulations have a growing impact on scientific developments because of the exponential increase of the computational resources we experienced in the past 20 years. In the past 10 years, this exponential increase can be entirely traced back to the advent of massive parallelism. Massive parallelism spontaneously leads to a larger probability of failure and energy consumption, assuming that the individual components are characterized by both constant failure rate and energy efficiency. So far, these issues have not represented a showstopper for any of the numerous computational fields. However, this is likely to change dramatically in the next 5 years as we move to Exa-flop machines, and both energy and fault tolerance will become fundamental issues. In fact, with the current lowest failure rates (0.02 failures/TF/month on IBM BG/P), software’s running on an hypothetical Exa-flop architecture will have to handle approximately 1 failure per minute of elapsed time.
Despite the large computational power available, molecular simulations are still far from being able to model the real world in terms of both complexity and reachable time scales. Compared with other computational fields, it is therefore clear that molecular simulation frameworks must be able to exploit the current and future computer architectures to achieve realistic model descriptions. Therefore, it is necessary to dedicate time and attention to these two big fundamental issues today, in order to be able to exploit the entire capabilities of the computational architectures that will appear in 8 to 10 years.
The importance of the energy-efficient algorithms is also underlined by US and EU programs that provide incentives to save energy and promote the use of renewable energy resources. Individuals, companies, and organizations increasingly seek energy-efficient products as the cost of the energy to run equipment is rapidly becoming a major factor. Electricity costs impose a substantial strain on the budget of data and computing centers. Moreover, energy dissipation causes thermal problems. Most of the energy consumed by a system is converted into heat, resulting in wear and reduced reliability of hardware components. For these reasons, energy has become a leading design constraint for computing devices in recent years and its importance will become even more pronounced in the years to come. Hardware engineers and system designers explore new directions to reduce the energy consumption of computing systems. The past years have also witnessed considerable research interest in algorithmic techniques to save energy, mostly restricted to high-end consumer markets. Nonetheless, this topic is of great importance also to the field of HPC, especially for the field of molecular simulations.
Although these issues are incredibly topical, very few groups have recognized its importance and focused on developing algorithms to solve problems in energy management and fault tolerance.
With the advent of massively parallel machines, the problem of fault-tolerant algorithms has become of increasing interest. In fact, since the very beginning of molecular simulations, applications have typically dealt with faults by writing out checkpoints periodically. If a fault occurred then all the processes were stopped and the job was reloaded from the last checkpoint. With a massively parallel machine, such a checkpoint/restart may not be an effective utilization of the resources. In fact, does it make sense to kill 99,999 processes just because one has failed?
So far few algorithms have been developed to handle faults in machine hardware, such as, for example, the implementation of the multiple walkers in Metadynamics , or parallel schemes in Montecarlo sampling  as well as finite differences schemes . All these implementations rely on requiring information only from the implementation’s own memory (independent task), delegating the communication between the different tasks to sockets-like structures.
Another important way of handling fault tolerance is the use of iterative rather than direct algorithms, because in case of hardware failures iterative algorithms can handle the faults more easily. At the same time, exploiting data locality will become an even more important issue.
In fact, fault tolerance is tightly connected to the energy usage of the algorithm itself. Also, using a lot of energy means producing a lot of heat, which results in wear and reduced reliability of hardware components. In addition, heavy use of the I/O system for checkpoints will significantly increase energy consumption.
The issue of energy-efficient algorithms is definitely a topic that scientists became aware of only very recently [4,5,6]. The development of energy-efficient algorithms in the field of molecular
 P. Raiteri et al., J. Phys. Chem. B, 2006, 110 (8), pp. 3533-3539
 Lindemann, C. et al., IEEE transactions on reliability, 1995, 44 (4), pp. 694-704
 Chen et al., Information Processing Letters, 2001, 79 (1), pp 11-16; Geist A, 2002, http://www.csm.ornl.gov/~geist/Lyon2002-geist.pdf
 Fagg G. et al., Parallel Computing, 2007, 27 (11), pp. 1479-1495
 Bekas et al, A New Energy Aware Performance Metric, 2010, to appear on CSRD.
 S. Harizopoulos, et al., In Proceedings of the Fourth Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, January 2009.