Data-intensive computing in biology
Location: Daresbury Laboratory, United Kingdom
Organisers
A major theme of modern biology is the quantity and wealth of data available: the data deluge. Computational biologists need to be able to handle very large datasets, and to extract useful information and derive knowledge. Example applications include sequencing, metagenomics, proteomics, imaging and neuroscience. Recently, a new journal has been launched specifically aimed at "big-data" studies (www.gigasciencejournal.com).
It is clear that traditional high performance computing is not appropriate in situations where calculations are data-bound rather than cpu-bound. Data storage and management are crucial. Specific calculations depend on efficient i/o. Transmitting large datasets across a network may be problematic. Typically, the calculation is brought to the data, rather than the other way round.
We are proposing to host a CECAM workshop on the computational challenges in data-driven biology. While we recognise that this differs somewhat from the traditional areas for CECAM, it is an area of strategic importance for science and connects closely with the CECAM themes "Computational Biology" and "Algorithms for Innovative Computing Architectures". The unifying theme of the workshop will be the innovations in computing needed to support big-data science. The motivation will come from leading biological applications. We will look both at the initial experiment producing the large datasets, and also at downstream analyses of the data.
The workshop will bring together hardware experts and scientists involved in challenging data-driven disciplines. We have good relations with many computer vendors (for example, through the annual Machine Evaluation Workshop hosted by the Daresbury node of CECAM) and would invite representatives. It may also be appropriate to invite instrument manufacturers to provide a vision for future requirements.
The workshop will cover the following areas:
1. Data storage options
2. Standards for data curation (what is kept, how is it retrieved)
3. Database technologies
4. Available networks and data transfer
5. Scaleability of current applications (e.g. bioinformatics tools)
6. Options for local compute facilities
7. Options for remote data processing and use of HPC centres
A major theme of modern biology is the quantity and wealth of data available: the data deluge. Computational biologists need to be able to handle very large datasets, and to extract useful information and derive knowledge. Example applications include sequencing, metagenomics, proteomics, imaging and neuroscience. Recently, a new journal has been launched specifically aimed at "big-data" studies (www.gigasciencejournal.com).
It is clear that traditional high performance computing is not appropriate in situations where calculations are data-bound rather than cpu-bound. Data storage and management are crucial. Specific calculations depend on efficient i/o. Transmitting large datasets across a network may be problematic. Typically, the calculation is brought to the data, rather than the other way round.
We are proposing to host a CECAM workshop on the computational challenges in data-driven biology. While we recognise that this differs somewhat from the traditional areas for CECAM, it is an area of strategic importance for science and connects closely with the CECAM themes "Computational Biology" and "Algorithms for Innovative Computing Architectures". The unifying theme of the workshop will be the innovations in computing needed to support big-data science. The motivation will come from leading biological applications. We will look both at the initial experiment producing the large datasets, and also at downstream analyses of the data.
The workshop will bring together hardware experts and scientists involved in challenging data-driven disciplines. We have good relations with many computer vendors (for example, through the annual Machine Evaluation Workshop hosted by the Daresbury node of CECAM) and would invite representatives. It may also be appropriate to invite instrument manufacturers to provide a vision for future requirements.
The workshop will cover the following areas:1. Data storage options2. Standards for data curation (what is kept, how is it retrieved)3. Database technologies4. Available networks and data transfer5. Scaleability of current applications (e.g. bioinformatics tools)6. Options for local compute facilities7. Options for remote data processing and use of HPC centres
In the field of DNA sequencing, there has been a rapid spread of Next Generation Sequencing (NGS) machines produced by companies such as Illumina, Roche and Applied Biosystems [1]. Third generation technologies (e.g. Helicos HeliScope, Oxford Nanopore, Pacific Biosciences) are just coming to market. NGS technologies can typically produce over 1 TB of raw data (images) in one run. Processing the raw data involves a number of computational challenges, such as sequence assembly from the short reads of NGS machines. Downstream analyses of sequence data require manipulation of large datasets.
Proteomics is also beginning to generate large quantities of data. Peptide Mass Fingerprinting (PMF) produces mass spectra which need to be compared to a large database of spectra of known proteins. A range of imaging techniques, from cellular up to whole-body scans, is producing increasing amounts of data. Models for neuroscience are becoming increasingly large and sophisticated, see for example the Human Connectome Project [2] and the Human Brain Project.
The ESFRI project ELIXIR, coordinated by the European Bioinformatics Institute (EBI), is building an infrastructure for biological information. ELIXIR, and in particular the EBI, provides the long-term repository for much of the data considered here. ELIXIR have undertaken a number of technical feasibility studies (www.elixir-europe.org/page.php?page=reports), including one on the use of supercomputing facilities. Another ESFRI project, EuroBioImaging is concerned with a range of biological and medical imaging. The 1000 Genomes Project [3] is the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. While the cost of the sequencing has come down, the data challenge is still formidable.
Data storage has previously received relatively little attention in high performance computing systems. The major focus area has been on performance simply in terms of speed of reading and writing data. Computer usage in biology is demonstrating the shortcomings of data storage elements of high performance computing systems today which will becoming increasingly apparent in other scientific domains in the near future.
Some of the technology developments have the potential for making the problem worse: data storage densities on rotating disks continue to increase dramatically, but the negative implication of this is that a single disk failure can lead to significant recovery times and degraded performance during the recovery period.
The good news is that there is a trend visible in suppliers of data storage solutions to offer more than just hardware. So the multiple threads which are being developed in parallel are:
Technology trends to increase the density and reduce the cost of data storage solutions
New technology options such as solid-state storage options to address the high-speed data storage requirement [5], [6]
Easier means of associating metadata with data to allow for long-term storage of research data and, most importantly, the ability to retrieve useful data at a later date. Today we are mainly in the realm of "write only" data.
Developments in network technology also play a part here; however the general observation is that the real limit is imposed by cost and that it will always be possible to identify uses for transmitting data over networks which exceed the available bandwidth provision.
Data compression techniques are increasingly vital for users of data-intensive computing to deploy. These techniques are mainly around application programming techniques with an understanding of the structure of the data. Increasingly important is an awareness of the trade-off between storage of results versus methods of recalculating results from much smaller input data sets.
A significant change has recently taken place in the evolution high performance computer processing power; CMOS technology has become predominant and has developed over the years to deliver faster and cheaper processors. This meant that computer performance increased automatically with each new generation of computer processor. This trend has ended, and has been replaced by one in which larger numbers of processing engines are available in new generations of processors, but these new processors run at the same speeds as the previous generation.
So although CMOS technology continues to develop, the ways in which the technology is used need to change in order to extract greatest performance out of systems built with the latest technology [9], [10].
The reason for this technology change is simple - power consumption. Computer processors now run extremely hot and can not run any hotter, and therefore can not run any faster. A completely different technology would be required, and there is no likely alternative to CMOS seen in the next 10 years.
In addition to greater numbers of computer processor cores in a system, alternative CMOS solutions with extremely low-power compute-intensive dedicated processors are available today, exemplified by the development of graphics processors initially used for computer games but now enabled for use in high performance computing systems [7], [8].
A major effort is taking place across the entire high performance computing community to exploit the capabilities offered by these new developments, and these have great significant for and impact on applications and algorithms.An inability to exploit these new technologies may limit the scalability of any application.
All these technology trends apply to "commodity" computer equipment as much as to purpose-built high performance computer systems. Indeed, many HPC systems are effectively built from "commodity" components today - up to the size of university HPC facilities. So there is no escaping the implications, nor is there a likely easy solution if the desire is reducing "time to solution" by increasing use of HPC.
Other technologies being considered include Cloud computing, although there are risks involved in data-critical public projects using a commercial solution. Crowd-sourcing has also been used, see e.g. www.bbsrc.ac.uk/news/research-technologies/2011/110608-pr-tgac-helps-analysis-e-coli.aspx
The analysis of data at remote institutions relies on the network. At the European level, the academic community is served by the Geant2 network, managed by DANTE. The e-infrastructure is provided by EGI. New technologies for file transfer are being developed, for example the international Sequence Read Archive uses the fasp protocol.
Downstream analysis of data also relies on being able to identify and retrieve the required data. There are a number of relevant data standards being developed, for example MIAPE for proteomics data and MIAME [4] for microarray expression data.
References
Mario Caccamo (The Genome Analysis Centre) - Organiser
Paul Flicek (European Bioinformatics Institute) - Organiser
Jonathan Follows (STFC Daresbury Laboratory) - Organiser
Chris Rawlings (Rothamsted Research) - Organiser
Martyn Winn (STFC) - Organiser