Algorithmic Re-Engineering for Modern Non-Conventional Processing Units
CECAM-USI, Lugano, Switzerland
Similar to the 1990s, when the revolution in mainstream scientific software development, viz. going from structured programming to object-oriented programming, was the greatest change in the past 3 decades, we are at the beginning of a totally new revolution in terms of algorithmic engineering. We are nowadays at a hardware/software technology inflection point due to large-scale parallelism, including parallel operations on the contents of a single register, pipelining, memory pre-fetch, single-core simultaneous multithreading (”hyper-threading”) and superscalar instruction issue. Some new processor options have emerged, such as the Cell BE processor and GPUs, which are extremely aggressive in their use of parallelism, while keeping, on the other hand, general-purpose programmability. Other processors, like FPGAs and special purpose hardware, still based on chip parallelism, are emerging for being extremely and efficiently specialized for unique tasks. A natural question is whether a non-conventional processing unit can gain a significant performance advantage over commodity processing units. Historically, conventional microprocessors outpaced the non-conventional solutions. In fact, every plan to build non-conventional hardware must carefully account for the expected exponential growth in the capabilities of conventional hardware and, most important, for the conspicuous time investments in the algorithm re-engineering, which is a mandatory task in order to exploit the full capabilities of the new non-conventional microprocessors. Notwithstanding, non-conventional processing units lead to a much greater improvement in absolute performance than the expected speedup predicted by Moore’s law over the development time period. In fact, if a processing unit is expected to run 1000 times faster than the state of the art microprocessor at the conceptualization time, it is evident that during the 5-7 years of development time a commodity solution will approximately show a tenfold improvement (the performance doubling approximately every 18 months). Therefore, the non-conventional solution will outperform of at roughly two orders of magnitudes the conventional microprocessors at bring-up time. This leads to the importance of re-engineering the algorithms in a short time frame, in order to fully exploit the performance advantage. It is evident that the most important issue facing the software community is how to program these classes of processing units in the most productive way. FPGAs, invented around 1984 by Ross Freeman, Xilinx cofounder, have a relatively large number of programmable logic components with programmable interconnects. They are increasingly used, alongside traditional processors, in a variety of hybrid parallel computing systems. In systems such as the Cray XD1 supercomputer, FPGAs play a supporting role by providing massive instruction-level parallelism and by fundamentally changing the way that key algorithms are processed. Important algorithmic re-engineering already begun in the FPGAs community spans purely algebraic , physical [2,3] and aero-spatial applications . Regarding modern GPUs, they are not only a powerful graphical engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpace its CPUs counterpart. Due to the rapid increase in both programmability and capability, an extensive research community has successfully mapped a broad range of computationally demanding, complex problems to GPUs. In particular, several algorithmic re-engineering activities have been performed during the last years in the fields of classical molecular simulations , computational chemistry [6,7] and physics [8,9,10]. Cell BE processors are the latest entry in the class of non-conventional modern processing units. The Cell project was started in 2001 keeping in mind the idea of a processing unit able to combine considerable floating points resources, required for demanding numerical algorithm, with a power-efficient software-controlled memory hierarchy. The result is that Cell BE processors make a radical departure from conventional multiprocessor or multi-core architectures. Although relatively new in the field of computational science, already a considerable number of numerical algorithms have been ported on this architecture [11,12,13,14]. Last but not least, special-purpose hardware, which is a category on its own where the processing units are specifically designed on the basis of the computational problem. With programmable chips or specifically designed processing units, the trick is to find the best map between the scientific problem and the layout of the computational hardware. In this class of non-conventional modern hardware should be mentioned the FASTRUN, MD Engine , APENext , GRAPE  and the Anton  projects, both aimed to perform classical molecular dynamics simulations efficiently.
Alessandro Curioni (IBM Research - Zurich) - Organiser
Teodoro Laino (IBM Research - Zurich) - Organiser