Analyzing Large Genomic Data Collections
Modern biology has developed a wealth of high-throughput experimental techniques. Many of these, particularly microarrays and deep sequencing, produce measurements simultaneously for every gene in an organism's genome. As the number of such genome-scale datasets reaches the thousands for many organisms, new opportunities arise to understand systems-level biology and human disease by means of very large scale data integration and analysis.
This diversity of genomic data presents excellent opportunities for the development of machine learning methods and for the discovery of new biology. Previous work has been particularly successful in integrating relatively small numbers of datasets to predict the roles and interactions of proteins in unicellular organisms. With large genomic data collections, new data mining techniques can extend the depth of this process; this allows the analysis of biological activity induced by specific environmental conditions and of associations and regulatory cross-talk between entire cellular pathways and processes.
Large scale data integration also offers the breadth to analyze systems biology in complex higher organisms. This allows important areas such as human disease to be explored from the perspective of biomolecular interaction networks. It also calls out new challenges, such as the need to incorporate knowledge of tissue types and of developmental stages into integrative models of metazoan biology. Finally, heterogeneous data integration can also be applied to predict specific genetic interaction types such as transcriptional regulation, a precursor to the inference of detailed and accurate pathway models from large collections of diverse genomic data.