Organisers
- Teodoro Laino (IBM Research, Zurich Research Laboratory)
- Alessandro Curioni (IBM Research, Zurich Research Laboratory)
Supports
CSCS
IBM Zurich Research Laboratory
CECAM
Description
Similar to the 1990s, when the revolution in mainstream scientific software development, viz. going from structured programming to object-oriented programming, was the greatest change in the past 3 decades, we are at the beginning of a totally new revolution in terms of algorithmic engineering. We are nowadays at a hardware/software technology inflection point due to large-scale parallelism, including parallel operations on the contents of a single register, pipelining, memory pre-fetch, single-core simultaneous multithreading (”hyper-threading”) and superscalar instruction issue. Some new processor options have emerged, such as the Cell BE processor and GPUs, which are extremely aggressive in their use of parallelism, while keeping, on the other hand, general-purpose programmability. Other processors, like FPGAs and special purpose hardware, still based on chip parallelism, are emerging for being extremely and efficiently specialized for unique tasks. A natural question is whether a non-conventional processing unit can gain a significant performance advantage over commodity processing units. Historically, conventional microprocessors outpaced the non-conventional solutions. In fact, every plan to build non-conventional hardware must carefully account for the expected exponential growth in the capabilities of conventional hardware and, most important, for the conspicuous time investments in the algorithm re-engineering, which is a mandatory task in order to exploit the full capabilities of the new non-conventional microprocessors. Notwithstanding, non-conventional processing units lead to a much greater improvement in absolute performance than the expected speedup predicted by Moore’s law over the development time period. In fact, if a processing unit is expected to run 1000 times faster than the state of the art microprocessor at the conceptualization time, it is evident that during the 5-7 years of development time a commodity solution will approximately show a tenfold improvement (the performance doubling approximately every 18 months). Therefore, the non-conventional solution will outperform of at roughly two orders of magnitudes the conventional microprocessors at bring-up time. This leads to the importance of re-engineering the algorithms in a short time frame, in order to fully exploit the performance advantage. It is evident that the most important issue facing the software community is how to program these classes of processing units in the most productive way. FPGAs, invented around 1984 by Ross Freeman, Xilinx cofounder, have a relatively large number of programmable logic components with programmable interconnects. They are increasingly used, alongside traditional processors, in a variety of hybrid parallel computing systems. In systems such as the Cray XD1 supercomputer, FPGAs play a supporting role by providing massive instruction-level parallelism and by fundamentally changing the way that key algorithms are processed. Important algorithmic re-engineering already begun in the FPGAs community spans purely algebraic [1], physical [2,3] and aero-spatial applications [4]. Regarding modern GPUs, they are not only a powerful graphical engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpace its CPUs counterpart. Due to the rapid increase in both programmability and capability, an extensive research community has successfully mapped a broad range of computationally demanding, complex problems to GPUs. In particular, several algorithmic re-engineering activities have been performed during the last years in the fields of classical molecular simulations [5], computational chemistry [6,7] and physics [8,9,10]. Cell BE processors are the latest entry in the class of non-conventional modern processing units. The Cell project was started in 2001 keeping in mind the idea of a processing unit able to combine considerable floating points resources, required for demanding numerical algorithm, with a power-efficient software-controlled memory hierarchy. The result is that Cell BE processors make a radical departure from conventional multiprocessor or multi-core architectures. Although relatively new in the field of computational science, already a considerable number of numerical algorithms have been ported on this architecture [11,12,13,14]. Last but not least, special-purpose hardware, which is a category on its own where the processing units are specifically designed on the basis of the computational problem. With programmable chips or specifically designed processing units, the trick is to find the best map between the scientific problem and the layout of the computational hardware. In this class of non-conventional modern hardware should be mentioned the FASTRUN[15], MD Engine [16], APENext [17], GRAPE [18] and the Anton [19] projects, both aimed to perform classical molecular dynamics simulations efficiently.
Scientific Objectives
Developing and making modern non-conventional hardware publicly available renders the performance lunch not free any more. Indeed, there will continue to be generally applicable performance gains for every algorithm, thanks mainly to cache size improvements. But in order to benefit from the continuing exponential throughput advances in new processors, we will need to have well-written concurrent (usually multithreaded or specialized in order to use accelerated cards) algorithms. This puts the whole scientific software development community at a fundamental turning point in algorithmic engineering. The concurrency revolution became important once before a few years ago, when massively parallel machines, like the IBM BlueGene/L, were made available to the research community and it even changed the way people were writing software. The next revolution will have an even higher impact and arguably, the free lunch has already been over for a year or two, we are only just becoming aware of it. The stake in this revolution is high: it is the ability to perform computational tasks with a performance of 1-2 orders of magnitudes larger than the available commodity clusters. This can be translated for the molecular dynamics community with the ability to perform extensive sampling in the milli-second range or, alternatively, with the ability of treating routinely billions of particles. In this 3-day workshop we will explore the use of GPUs, Cell BE processors, FPGAs and special-purpose hardware for large-scale scientific computing. The main objective will be to demonstrate how some of the most challenging problems in computational sciences have already been ported to modern non-conventional computing platforms, presented by speakers coming from a wide computational community (physicists, chemists, engineers, computer scientists, biologists) active in the fields of algorithm re-engineering for the new architectures. In fact, while traditional uses of high-performance computing (HPC) systems in physics and engineering involve problems that often have well defined and regular structure, the new processing units pose a novel challenge: re-thinking algorithms in terms of massively, probably asynchronous parallelization schemes, which can take full advantage of different kind of modern architectures: GPUs, Cell BE, FPGAs or special-purpose hardware. The first objective of the workshop is to increase the dialogue between computational scientists and computer scientists to examine the performance of existing algorithms, identify critical/bottleneck sections, scalability, and discuss how and why these codes will perform differently on the new architectures: GPUs, multi-core and Cell processors. The second objective of the workshop is to promote some early collaborations and pilot studies in which computational/computer scientists can work together on problems relevant to peta-scale deployment. The workshop's outcome will be a greater insight into the algorithmic re-engineering field, with a wide view on the most intriguing computational problems and their deeply algorithmic issues. This interdisciplinary workshop will facilitate an unprecedented exploration of the algorithmic problems on the new hardware available and will contribute in paving the way to the peta-scale computing for a wide range of scientific applications.
References
[1] Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers; Morris, G.R.; PhD Thesis, University of Southern California, Los Angeles (2007)
[2] High Performance Scientific Computing Using FPGAs with IEEE Floating Point and Logarithmic Arithmetic for Lattice QCD; Callanan, O.; Gregg, D.; Nisbet, A.; Peardon, M.; FPL '06. International Conference on Field Programmable Logic and Applications, 1-6 (2006)
[3] Using Floating-Point Arithmetic on FPGAs to Accelerate Scientific N-Body Simulations; Lienhart, G.; Kugel, A.; Männer, R.; 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'02), 182 (2002)
[4] Engineering Applications on NASA’s FPGA-based Hypercomputer; Storaasli, O.O.; 5th MAPLD Conference Proceedings (2002)
[5] GPU acceleration of cutoff pair potentials for molecular modeling applications; Rodriges, C.I.; Hardy, D.J.; Stone, J.E.; Schulten, K.; Hwu, W.-M.W.; Proceedings of the 2008 conference on Computing frontiers, 273-282 (2008)
[6] Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation; Ufimtsev, I.S.; Martinez, T.J.; J. Chem. Theory Comput., 4, 222-231 (2008)
[7] Accelerating Resolution-of-the-Identity Second-Order Mller-Plesset Quantum Chemistry Calculations with Graphical Processing Units; Vogt, L.; Olivares-Amaya, R.; Kermes, S.; Shao, Y.; Amador-Bedolla, C.; Aspuru-Guzik, A.; J. Phys. Chem. A, 112, 2049-2057 (2008)
[8] Using GPUs to improve multigrid solver performance on a cluster; Göddeke, D.; Strzodka, R.; Mohd-yusof, J.; Mccormick, P.; Wobker, H.; Becker, C.; Turek. S.; International Journal of Computational Science and Engineering (2008)
[9] Toward efficient GPU-accelerated N-body simulations; Stock, M.J.; Gharakani, A.; Proceedings of the 46th AIAA Aerospace Sciences Meeting and Exhibit, 1-13 (2008)
[10] Acceleration of a two-dimensional euler flow solver using commodity graphics hardware.; Brandvik, T.; Pullan, G.; Proceedings of the Institution of Mechani-
cal Engineers, Part C: Journal of Mechanical Engineering Science, 221, 1745–1748, (2007)
[11] Towards Personalized Medicine: High-Performance Computing in the Life Sciences; Schenk, O.; Burkhart, J.; Reddy, H.; ERCIM NEWS, Supercomputing at work, 29-31 (2008)
[12] Exploitation of Cell Multi-Processor Array in Solution of Spatio-Temporal Dynamics; Nagy, Z.; Kék, L.; Kincses, Z.; Kiss, A.; Szolgay, P.; ERCIM NEWS, Supercomputing at work, 26-27 (2008)
[13] Efficient Breadth-First Search on the Cell/BE Processor; Scarpazza, D.P.; Villa, O.; Petrini, F.; Parallel and Distributed Systems, IEEE Transactions on, 19, 1381 – 1395 (2008)
[14] Accelerating DFM Electronic Data Process using the Cell BE Microprocessor Architecture; Schellenberg, F.M.; Kingsley, T.; Cobb, N.; Dudau, D.; Chalisani, R.; McKibben, J.; McPherson, S.; Proceedings of the IEEE Electronic Data Process (EDP) Workshop, (2007)
[15] FASTRUN: A special purpose, hardwired computer for molecular simulation; Fine, R.D.; Dimmler, G.; Levinthal, C.; Proteins: Structure, Function and Genetics, 11, 242-253 (1991)
[16] Development of MD Engine: High-speed accelerator with parallel processor design for molecular dynamics simulations; Toyoda, S.; Miyagawa, H.; Kitamura, K.; Amisaki, T.; Hashimoto, E.; Ikeda, H.; Kusuni, A.; Miyakawa, N.; J. Comp. Chem., 20, 185-199 (1999)
[17] Computing for LQCD: apeNEXT; Belletti,F.; Schifano, F.S.; Tripiccione, R.; Bodin, F.; Boucaud. P.; Micheli, J.; Pine, O.; Cabibbo, N.; de Luca, S.; Lonardo, A.; Rossetti, D.; Vicini, P.; Lukyanov, M.; Morin, L.; Paschedag, N.; Simma, H.; Morenas, V.; Pleiter, D.; Rapuano, F.; Computing in Science and Engineering, 8, 18-29 (2006)
[18] Special Purpose Computers for Scientific Simulations -- The GRAPE systems.; Makino J. and Taiji M. (1998) (John Wiley and Sons, Chichester)
[19] Anton, a Special-Purpose Machine for Molecular Dynamics Simulations; Shaw, D.E.; Deneroff, M.M.; Dror, R.O.; Kuskin, J.S.; Larson, R.H.; Salmon, J.K.; Young, C.; Batson, B.; Bowers, K.J.; Chao, J.C.; Eastwood, M.P.; Gagliardo, J.; Grossman, J.P.; Ho, R.; Ierardi, D.J.; Kolossvary, I.; Kepleis, J.L.; Layman, T.; McLeavey, C.; Moraes, M.A.; Mueller, R.; Priest, C.E.; Shan. Y.; Spengler, J.; Theobald, M.; Towles, B.; Wang, S.C.; Communications of the ACM, 51, 91-97 (2008)