Tuesday, February 8, 2005
B17 Upson Hall

Computer Science
Spring 2005

Cosponsored by New Life Sciences Initiative and
Department of Biological Statistics & Computational Biology

Adam Siepel
University of California, Santa Cruz

Comparative Mammalian Genomics: Models of Evolution and Detection of Functional Elements

Having the complete genomes of multiple species is causing sweeping changes in biology. Comparative sequence analysis is leading to new insights about the evolutionary forces that have shaped present-day genomes and is enabling previously unknown functional sequences to be identified and characterized. Comparative methods hold particular promise for mammalian and other vertebrate genomes, which--because of their size and complexity, and because of other obstacles to experimental study--have been more difficult to approach experimentally than the genomes of simpler organisms such as flies and nematodes.

In this talk, I will discuss both recent methodological advances in comparative sequence analysis and scientific insights gained from genome-wide surveys conducted with these methods. The main theme of the talk will be using evolutionary models to help shed light on sequence function. Three particular problems will be discussed: the identification of evolutionarily conserved elements, modeling context- or neighbor-dependent substitution, and the identification of evolutionarily conserved protein-coding exons. These problems have been addressed using phylogenetic hidden Markov models (phylo-HMMs), statistical models that describe both the process of nucleotide substitution at individual sites in a genome and how this process changes from one site to the next.

Using a phylo-HMM-based program called phastCons, we have conducted a comprehensive search for conserved elements in vertebrate genomes. I will discuss the results of this search and of parallel searches in Drosophila, Caenorhabditis, and Saccharomyces genomes. Particular attention will devoted to the most highly conserved of the elements identified in vertebrates, which appear to be associated with both transcriptional and post-transcriptional regulation and which show significant statistical evidence of an enrichment for RNA secondary structure. In separate work, another phylo-HMM-based program called ExoniPhy has been used to predict about 170,000 protein-coding exons conserved in the human, mouse, and rat genomes, corresponding to an expected 20,400 genes. Of these, about 23,000 predicted exons (2,800 genes) are not represented in sets of known genes. Preliminary experimental (RT-PCR) results indicate that the false positive rate of these predictions is quite low (<30%).