Algorithmic Re-Engineering for Modern Non-Conventional Processing Units

September 30, 2009 to October 2, 2009
Location : CECAM-USI, Lugano, Switzerland
   Logistics Lugano
   Visa requirements


  • Teodoro Laino (IBM Research - Zurich, Switzerland)
  • Alessandro Curioni (IBM Research - Zurich, Switzerland)



IBM Zurich Research Laboratory



<p>Similar to the 1990s, when the revolution in mainstream scientific software development, viz. going from structured programming to object-oriented programming, was the greatest change in the past 3 decades, we are at the beginning of a totally new revolution in terms of algorithmic engineering. We are nowadays at a hardware/software technology inflection point due to large-scale parallelism, including parallel operations on the contents of a single register, pipelining, memory pre-fetch, single-core simultaneous multithreading (&rdquo;hyper-threading&rdquo;) and superscalar instruction issue. Some new processor options have emerged, such as the Cell BE processor and GPUs, which are extremely aggressive in their use of parallelism, while keeping, on the other hand, general-purpose programmability. Other processors, like FPGAs and special purpose hardware, still based on chip parallelism, are emerging for being extremely and efficiently specialized for unique tasks. A natural question is whether a non-conventional processing unit can gain a significant performance advantage over commodity processing units. Historically, conventional microprocessors outpaced the non-conventional solutions. In fact, every plan to build non-conventional hardware must carefully account for the expected exponential growth in the capabilities of conventional hardware and, most important, for the conspicuous time investments in the algorithm re-engineering, which is a mandatory task in order to exploit the full capabilities of the new non-conventional microprocessors. Notwithstanding, non-conventional processing units lead to a much greater improvement in absolute performance than the expected speedup predicted by Moore&rsquo;s law over the development time period. In fact, if a processing unit is expected to run 1000 times faster than the state of the art microprocessor at the conceptualization time, it is evident that during the 5-7 years of development time a commodity solution will approximately show a tenfold improvement (the performance doubling approximately every 18 months). Therefore, the non-conventional solution will outperform of at roughly two orders of magnitudes the conventional microprocessors at bring-up time. This leads to the importance of re-engineering the algorithms in a short time frame, in order to fully exploit the performance advantage. It is evident that the most important issue facing the software community is how to program these classes of processing units in the most productive way. FPGAs, invented around 1984 by Ross Freeman, Xilinx cofounder, have a relatively large number of programmable logic components with programmable interconnects. They are increasingly used, alongside traditional processors, in a variety of hybrid parallel computing systems. In systems such as the Cray XD1 supercomputer, FPGAs play a supporting role by providing massive instruction-level parallelism and by fundamentally changing the way that key algorithms are processed. Important algorithmic re-engineering already begun in the FPGAs community spans purely algebraic [1], physical [2,3] and aero-spatial applications [4]. Regarding modern GPUs, they are not only a powerful graphical engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpace its CPUs counterpart. Due to the rapid increase in both programmability and capability, an extensive research community has successfully mapped a broad range of computationally demanding, complex problems to GPUs. In particular, several algorithmic re-engineering activities have been performed during the last years in the fields of classical molecular simulations [5], computational chemistry [6,7] and physics [8,9,10]. Cell BE processors are the latest entry in the class of non-conventional modern processing units. The Cell project was started in 2001 keeping in mind the idea of a processing unit able to combine considerable floating points resources, required for demanding numerical algorithm, with a power-efficient software-controlled memory hierarchy. The result is that Cell BE processors make a radical departure from conventional multiprocessor or multi-core architectures. Although relatively new in the field of computational science, already a considerable number of numerical algorithms have been ported on this architecture [11,12,13,14]. Last but not least, special-purpose hardware, which is a category on its own where the processing units are specifically designed on the basis of the computational problem. With programmable chips or specifically designed processing units, the trick is to find the best map between the scientific problem and the layout of the computational hardware. In this class of non-conventional modern hardware should be mentioned the FASTRUN[15], MD Engine [16], APENext [17], GRAPE [18] and the Anton [19] projects, both aimed to perform classical molecular dynamics simulations efficiently.</p>


[1] Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers; Morris, G.R.; PhD Thesis, University of Southern California, Los Angeles (2007)

[2] High Performance Scientific Computing Using FPGAs with IEEE Floating Point and Logarithmic Arithmetic for Lattice QCD; Callanan, O.; Gregg, D.; Nisbet, A.; Peardon, M.; FPL '06. International Conference on Field Programmable Logic and Applications, 1-6 (2006)

[3] Using Floating-Point Arithmetic on FPGAs to Accelerate Scientific N-Body Simulations; Lienhart, G.; Kugel, A.; Männer, R.; 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'02), 182 (2002)

[4] Engineering Applications on NASA’s FPGA-based Hypercomputer; Storaasli, O.O.; 5th MAPLD Conference Proceedings (2002)

[5] GPU acceleration of cutoff pair potentials for molecular modeling applications; Rodriges, C.I.; Hardy, D.J.; Stone, J.E.; Schulten, K.; Hwu, W.-M.W.; Proceedings of the 2008 conference on Computing frontiers, 273-282 (2008)

[6] Quantum Chemistry on Graphical Processing Units. 1. Strategies for Two-Electron Integral Evaluation; Ufimtsev, I.S.; Martinez, T.J.; J. Chem. Theory Comput., 4, 222-231 (2008)

[7] Accelerating Resolution-of-the-Identity Second-Order Mller-Plesset Quantum Chemistry Calculations with Graphical Processing Units; Vogt, L.; Olivares-Amaya, R.; Kermes, S.; Shao, Y.; Amador-Bedolla, C.; Aspuru-Guzik, A.; J. Phys. Chem. A, 112, 2049-2057 (2008)

[8] Using GPUs to improve multigrid solver performance on a cluster; Göddeke, D.; Strzodka, R.; Mohd-yusof, J.; Mccormick, P.; Wobker, H.; Becker, C.; Turek. S.; International Journal of Computational Science and Engineering (2008)

[9] Toward efficient GPU-accelerated N-body simulations; Stock, M.J.; Gharakani, A.; Proceedings of the 46th AIAA Aerospace Sciences Meeting and Exhibit, 1-13 (2008)

[10] Acceleration of a two-dimensional euler flow solver using commodity graphics hardware.; Brandvik, T.; Pullan, G.; Proceedings of the Institution of Mechani-
cal Engineers, Part C: Journal of Mechanical Engineering Science, 221, 1745–1748, (2007)

[11] Towards Personalized Medicine: High-Performance Computing in the Life Sciences; Schenk, O.; Burkhart, J.; Reddy, H.; ERCIM NEWS, Supercomputing at work, 29-31 (2008)

[12] Exploitation of Cell Multi-Processor Array in Solution of Spatio-Temporal Dynamics; Nagy, Z.; Kék, L.; Kincses, Z.; Kiss, A.; Szolgay, P.; ERCIM NEWS, Supercomputing at work, 26-27 (2008)

[13] Efficient Breadth-First Search on the Cell/BE Processor; Scarpazza, D.P.; Villa, O.; Petrini, F.; Parallel and Distributed Systems, IEEE Transactions on, 19, 1381 – 1395 (2008)

[14] Accelerating DFM Electronic Data Process using the Cell BE Microprocessor Architecture; Schellenberg, F.M.; Kingsley, T.; Cobb, N.; Dudau, D.; Chalisani, R.; McKibben, J.; McPherson, S.; Proceedings of the IEEE Electronic Data Process (EDP) Workshop, (2007)

[15] FASTRUN: A special purpose, hardwired computer for molecular simulation; Fine, R.D.; Dimmler, G.; Levinthal, C.; Proteins: Structure, Function and Genetics, 11, 242-253 (1991)

[16] Development of MD Engine: High-speed accelerator with parallel processor design for molecular dynamics simulations; Toyoda, S.; Miyagawa, H.; Kitamura, K.; Amisaki, T.; Hashimoto, E.; Ikeda, H.; Kusuni, A.; Miyakawa, N.; J. Comp. Chem., 20, 185-199 (1999)

[17] Computing for LQCD: apeNEXT; Belletti,F.; Schifano, F.S.; Tripiccione, R.; Bodin, F.; Boucaud. P.; Micheli, J.; Pine, O.; Cabibbo, N.; de Luca, S.; Lonardo, A.; Rossetti, D.; Vicini, P.; Lukyanov, M.; Morin, L.; Paschedag, N.; Simma, H.; Morenas, V.; Pleiter, D.; Rapuano, F.; Computing in Science and Engineering, 8, 18-29 (2006)

[18] Special Purpose Computers for Scientific Simulations -- The GRAPE systems.; Makino J. and Taiji M. (1998) (John Wiley and Sons, Chichester)

[19] Anton, a Special-Purpose Machine for Molecular Dynamics Simulations; Shaw, D.E.; Deneroff, M.M.; Dror, R.O.; Kuskin, J.S.; Larson, R.H.; Salmon, J.K.; Young, C.; Batson, B.; Bowers, K.J.; Chao, J.C.; Eastwood, M.P.; Gagliardo, J.; Grossman, J.P.; Ho, R.; Ierardi, D.J.; Kolossvary, I.; Kepleis, J.L.; Layman, T.; McLeavey, C.; Moraes, M.A.; Mueller, R.; Priest, C.E.; Shan. Y.; Spengler, J.; Theobald, M.; Towles, B.; Wang, S.C.; Communications of the ACM, 51, 91-97 (2008)