Optimal Spaced Seeds for Hidden Markov Models, with Application to Homologous Coding Regions

Broncarona Brejová, Daniel G. Brown and Tomáscaron Vinarcaron

Lecture Notes in Computer Science,  Volume 2676 / 2003, CPM 2003


We study the problem of computing optimal spaced seeds for detecting sequences generated by a Hidden Markov model. Inspired by recent work in DNA sequence alignment, we have developed such a model for representing the conservation between related DNA coding sequences. Our model includes positional dependencies and periodic rates of conservation, as well as regional deviations in overall conservation rate. We show that, for hidden Markov models in general, the probability that a seed is matched in a region can be computed efficiently, and use these methods to compute the optimal seed for our models. Our experiments on real data show that the optimal seeds are substantially more sensitive than the seeds used in the standard alignment program BLAST, and also substantially better than those of PatternHunter or WABA, both of which use spaced seeds. Our results offer the hope of improved gene finding due to fewer missed exons in DNA/DNA comparison, and more effective homology search in general, and may have applications outside of bioinformatics.



Vector Seeds An Extension to Spaced Seeds Allows Substantial Improvements in Sensitivity and Specificity

Broncarona Brejová, Daniel G. Brown and Tomáscaron Vinarcaron

Lecture Notes in Computer Science,  Volume 2812 / 2003, WABI 2003

We present improved techniques for finding homologous regions in DNA and protein sequences. Our approach focuses on the core region of a local pairwise alignment; we suggest new ways to characterize these regions that allow marked improvements in both specificity and sensitivity over existing techniques for sequence alignment. For any such characterization, which we call a vector seed, we give an efficient algorithm that estimates the specificity and sensitivity of that seed under reasonable probabilistic models of sequence. We also characterize the probability of a match when an alignment is required to have multiple hits before it is detected. Our extensions fit well with existing approaches to sequence alignment, while still offering substantial improvement in runtime and sensitivity, particularly for the important problem of identifying matches between homologous coding DNA sequences.



Revisiting the codon adaptation index from a whole-genome perspective: analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models.

Jansen R, Bussemaker HJ, Gerstein M.
Nucleic Acids Res. 2003 Apr 15; 31(8): 2242-51.

Highly expressed genes in many bacteria and small eukaryotes often have a strong compositional bias, in terms of codon usage. Two widely used numerical indices, the codon adaptation index (CAI) and the codon usage, use this bias to predict the expression level of genes. When these indices were first introduced, they were based on fairly simple assumptions about which genes are most highly expressed: the CAI was originally based on the codon composition of a set of only 24 highly expressed genes, and the codon usage on assumptions about which functional classes of genes are highly expressed in fast-growing bacteria. Given the recent advent of genome-wide expression data, we should be able to improve on these assumptions. Here, we measure, in yeast, the degree to which consideration of the current genome-wide expression data sets improves the performance of both numerical indices. Indeed, we find that by changing the parameterization of each model its correlation with actual expression levels can be somewhat improved, although both indices are fairly insensitive to the exact way they are parameterized. This insensitivity indicates a consistent codon bias amongst highly expressed genes. We also attempt direct linear regression of codon composition against genome-wide expression levels (and protein abundance data). This has some similarity with the CAI formalism and yields an alternative model for the prediction of expression levels based on the coding sequences of genes. More information is available at http://bioinfo.mbb.yale.edu/expression/codons.



Estimating the repeat structure and length of DNA sequences using L-tuples.

Li X, Waterman MS.

Genome Res. 2003 Aug; 13(8): 1916-22

In shotgun sequencing projects, the genome or BAC length is not always known. We approach estimating genome length by first estimating the repeat structure of the genome or BAC, sometimes of interest in its own right, on the basis of a set of random reads from a genome project. Moreover, we can find the consensus for repeat families before assembly. Our methods are based on the l-tuple content of the reads.


Tracing the evolutionary history of Drosophila regulatory regions with models that identify transcription factor binding sites.

Dermitzakis ET, Bergman CM, Clark AG.

Mol Biol Evol. 2003 May; 20(5): 703-14. Epub 2003 Apr

Much of evolutionary change is mediated at the level of gene expression, yet our understanding of regulatory evolution remains unsatisfying. In light of recent data indicating that transcription factor binding sites undergo substantial turnover between species, we attempt to quantify the process of binding site turnover in regulatory regions of well-studied genes controlling embryonic patterning in Drosophila. We examine polymorphism and divergence data in Drosophila melanogaster and four related species from regulatory regions of five early development genes for which functional binding sites have been identified. This analysis reveals that Drosophila regulatory regions exhibit patterns of variation consistent with functional constraint. We develop a novel approach to binding site prediction which we use to characterize the process of binding site divergence in regulatory regions. This method uses sets of known binding sites to construct a model that predicts transcription factor specificity and bootstrap sampling to derive significance levels. This approach allows appropriate significance levels to be determined even in the face of skewed base composition in the background sequence. Using this approach, we show that, although functional elements exhibit conservation of sequence, there is substantial potential to gain new functional elements within the regulatory regions. Our results show that application of models that predict transcription factor binding sites can yield insights into the process and dynamics of binding site evolution within regulatory regions.


Title: Genomewide view of gene silencing by small interfering RNAs

Authors: Chi JT, Chang HY, Wang NN, Chang DS, Dunphy N, Brown PO

Ref: Proc Natl Acad Sci USA 2003 May 27;100(11):6343-6

Abstract: RNA interference (RNAi) is an evolutionarily conserved mechanism in plant and animal cells that directs the degradation of messenger RNAs homologous to short double-stranded RNAs termed small interfering RNA (siRNA). The ability of siRNA to direct gene silencing in mammalian cells has raised the possibility that siRNA might be used to investigate gene function in a high throughput fashion or to modulate gene expression in human diseases. The specificity of siRNA-mediated silencing, a critical consideration in these applications, has not been addressed on a genomewide scale. Here we show that siRNA-induced gene silencing of transient or stably expressed mRNA is highly gene-specific and does not produce secondary effects detectable by genomewide expression profiling. A test for transitive RNAi, extension of the RNAi effect to sequences 5' of the target region that has been observed in Caenorhabditis elegans, was unable to detect this phenomenon in human cells.



Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model

Mayetri Gupta ; Jun S. Liu

Journal of the American Statistical Association, Volume: 98 Number: 461 Page: 55 -- 66

Abstract: Detection of unknown patterns from a randomly generated sequence of observations is a problem arising in fields ranging from signal processing to computational biology. Here we focus on the discovery of short recurring patterns (called motifs) in DNA sequences that represent binding sites for certain proteins in the process of gene regulation. What makes this a difficult problem is that these patterns can vary stochastically. We describe a novel data augmentation strategy for detecting such patterns in biological sequences based on an extension of a "dictionary" model. In this approach, we treat conserved patterns and individual nucleotides as stochastic words generated according to probability weight matrices and the observed sequences generated by concatenations of these words. By using a missingdata approach to find these patterns, we also address other related problems, including determining widths of patterns, finding multiple motifs, handling low-complexity regions, and finding patterns with insertions and deletions. The issue of selecting appropriate models is also discussed. However, the flexibility of this model is also accompanied by a high degree of computational complexity. We demonstrate how dynamic programming-like recursions can be used to improve computational efficiency.

http://www.people.fas.harvard.edu/~gupta6/papers/sdict.pdf




Title: Genomewide demarcation of RNA polymerase II transcription units revealed by physical fractionation of chromatin

Authors: Nagy PL, Cleary ML, Brown PO, Lieb JD

Ref: Proc Natl Acad Sci USA 2003 May 27;100(11):6364-9

Abstract: Epigenetic modifications of chromatin serve an important role in regulating the expression and accessibility of genomic DNA. We report here a genomewide approach for fractionating yeast chromatin into two functionally distinct parts, one containing RNA polymerase II transcribed sequences, and the other comprising noncoding sequences and genes transcribed by RNA polymerases I and III. Noncoding regions could be further fractionated into promoters and segments lacking promoters. The observed separations were apparently based on differential crosslinking efficiency of chromatin in different genomic regions. The results reveal a genomewide molecular mechanism for marking promoters and genomic regions that have a license to be transcribed by RNA polymerase II, a previously unrecognized level of genomic complexity that may exist in all eukaryotes. Our approach has broad potential use as a tool for genome annotation and for the characterization of global changes in chromatin structure that accompany different genetic, environmental, and disease states.


Title: Comparative analyses of multi-species sequences from targeted genomic regions

Authors: Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC, Maskeri B, Hansen NF, Schwartz MS, Weber RJ, Kent WJ, Karolchik D, Bruen TC, Bevan R, Cutler DJ, Schwartz S, Elnitski L, Idol JR, Prasad AB, Lee-Lin SQ, Maduro VV, Summers TJ, Portnoy ME, Dietrich NL, Akhter N, Ayele K, Benjamin B, Cariaga K, Brinkley CP, Brooks SY, Granite S, Guan X, Gupta J, Haghighi P, Ho SL, Huang MC, Karlins E, Laric PL, Legaspi R, Lim MJ, Maduro QL, Masiello CA, Mastrian SD, McCloskey JC, Pearson R, Stantripop S, Tiongson EE, Tran JT, Tsurgeon C, Vogt JL, Walker MA, Wetherby KD, Wiggins LS, Young AC, Zhang LH, Osoegawa K, Zhu B, Zhao B, Shu CL, De Jong PJ, Lawrence CE, Smit AF, Chakravarti A, Haussler D, Green P, Miller W, Green ED

Ref: Nature 2003 Aug 14;424(6950):788-93

Abstract: The systematic comparison of genomic sequences from different organisms represents a central focus of contemporary genome analysis. Comparative analyses of vertebrate sequences can identify coding and conserved non-coding regions, including regulatory elements, and provide insight into the forces that have rendered modern-day genomes. As a complement to whole-genome sequencing efforts, we are sequencing and comparing targeted genomic regions in multiple, evolutionarily diverse vertebrates. Here we report the generation and analysis of over 12 megabases (Mb) of sequence from 12 species, all derived from the genomic region orthologous to a segment of about 1.8 Mb on human chromosome 7 containing ten genes, including the gene mutated in cystic fibrosis. These sequences show conservation reflecting both functional constraints and the neutral mutational events that shaped this genomic region. In particular, we identify substantial numbers of conserved non-coding segments beyond those previously identified experimentally, most of which are not detectable by pair-wise sequence comparisons alone. Analysis of transposable element insertions highlights the variation in genome dynamics among these species and confirms the placement of rodents as a sister group to the primates.


Title: Cross-species sequence comparisons: a review of methods and available resources

Authors: Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC

Ref: Genome Res 2003 Jan;13(1):1-12

Abstract: With the availability of whole-genome sequences for an increasing number of species, we are now faced with the challenge of decoding the information contained within these DNA sequences. Comparative analysis of DNA sequences from multiple species at varying evolutionary distances is a powerful approach for identifying coding and functional noncoding sequences, as well as sequences that are unique for a given organism. In this review, we outline the strategy for choosing DNA sequences from different species for comparative analyses and describe the methods used and the resources publicly available for these studies.


Title: Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution.

Authors: Hardison RC, Roskin KM, Yang S, Diekhans M, Kent WJ, Weber R, Elnitski L, Li J, O'Connor M, Kolbe D, Schwartz S, Furey TS, Whelan S, Goldman N, Smit A, Miller W, Chiaromonte F, Haussler D

Ref: Genome Res 2003 Jan;13(1):13-26

Abstract: Six measures of evolutionary change in the human genome were studied, three derived from the aligned human and mouse genomes in conjunction with the Mouse Genome Sequencing Consortium, consisting of (1) nucleotide substitution per fourfold degenerate site in coding regions, (2) nucleotide substitution per site in relics of transposable elements active only before the human-mouse speciation, and (3) the nonaligning fraction of human DNA that is nonrepetitive or in ancestral repeats; and three derived from human genome data alone, consisting of (4) SNP density, (5) frequency of insertion of transposable elements, and (6) rate of recombination. Features 1 and 2 are measures of nucleotide substitutions at two classes of "neutral" sites, whereas 4 is a measure of recent mutations. Feature 3 is a measure dominated by deletions in mouse, whereas 5 represents insertions in human. It was found that all six vary significantly in megabase-sized regions genome-wide, and many vary together. This indicates that some regions of a genome change slowly by all processes that alter DNA, and others change faster. Regional variation in all processes is correlated with, but not completely accounted for, by GC content in human and the difference between GC content in human and mouse.


Scoring two-species local alignments to try to statistically separate neutrally evolving from selected DNA segments

Krishna M. Roskin, Mark Diekhans, David Haussler

Proceedings of the seventh annual international conference on Computational molecular biology (RECOMB), Berlin, Pages: 257 – 266.


We construct several score functions for use in locating unusually conserved regions in a genome-wide search of aligned DNA from two species. We test these functions on regions of the human genome aligned to the mouse genome. These score functions are derived from properties of neutrally evolving sites on the mouse and human genome, and can be adjusted to the local background rate of conservation. The aim of these functions is to try to identify regions of the human genome that are conserved by evolutionary selection, because they have an important function, rather than by chance. We use them to get a very rough estimate of the amount of DNA in the human genome that is under selection.

http://portal.acm.org/citation.cfm?id=640109&jmp=cit&dl=GUIDE&dl=ACM&CFID=11891234&CFTOKEN=11681079#CIT



Detecting protein sequence conservation via metric embeddings

E. Halperin, J. Buhler, R. Karp, R. Krauthgamer and B. Westover

Motivation: Comparing two protein databases is a fundamental task in biosequence annotation. Given two databases, one must find all pairs of proteins that align with high score under a biologically meaningful substitution score matrix, such as a BLOSUM matrix (Henikoff and Henikoff, 1992). Distance-based approaches to this problem map each peptide in the database to a point in a metric space, such that peptides aligning with higher scores are mapped to closer points. Many techniques exist to discover close pairs of points in a metric space efficiently, but the challenge in applying this work to proteomic comparison is to find a distance mapping that accurately encodes all the distinctions among residue pairs made by a proteomic score matrix. Buhler (2002) proposed one such mapping but found that it led to a relatively inefficient algorithm for protein-protein comparison.

Results: This work proposes a new distance mapping for peptides under the BLOSUM matrices that permits more efficient similarity search. We first propose a new distance function on peptides derived from a given score matrix. We then show how to map peptides to bit vectors such that the distance between any two peptides is closely approximated by the Hamming distance (i.e. number of mismatches) between their corresponding bit vectors. We combine these two results with the LSH-ALL-PAIRS-SIM algorithm of Buhler (2002) to produce an improved distance-based algorithm for proteomic comparison. An initial implementation of the improved algorithm exhibits sensitivity within 5% of that of the original LSH-ALL-PAIRS-SIM, while running up to eight times faster.

http://bioinformatics.oupjournals.org/cgi/content/abstract/19/suppl_1/i122



Bioinformatics Vol. 19 Suppl. 1 2003, Pages i292-i301

A probabilistic method to detect regulatory modules

Saurabh Sinha , Erik van Nimwegen and Eric D. Siggia

Motivation: The discovery of cis-regulatory modules in metazoan genomes is crucial for understanding the connection between genes and organism diversity.

Results: We develop a computational method that uses Hidden Markov Models and an Expectation Maximization algorithm to detect such modules, given the weight matrices of a set of transcription factors known to work together. Two novel features of our probabilistic model are: (i) correlations between binding sites, known to be required for module activity, are exploited, and (ii) phylogenetic comparisons among sequences from multiple species are made to highlight a regulatory module. The novel features are shown to improve detection of modules, in experiments on synthetic as well as biological data.

http://bioinformatics.oupjournals.org/cgi/content/abstract/19/suppl_1/i292

CREME: a framework for identifying cis-regulatory modules in human-mouse conserved segments

Roded Sharan, Ivan Ovcharenko, Asa Ben-Hur and Richard M. Karp

Motivation: The binding of transcription factors to specific regulatory sequence elements is a primary mechanism for controlling gene transcription. Recent findings suggest a modular organization of binding sites for transcription factors that cooperate in the regulation of genes. In this work we establish a framework for finding recurrent cis-regulatory modules in the promoters of a selected set of genes and scoring their statistical significance.

Results: Proceeding from a database of identified binding site motifs and their genomic locations we seek motifs whose frequency in the selected promoters is different than in a background promoter set. We present several statistical tests designed for this purpose. We provide a hashing algorithm for detecting combinations of these motifs that co-occur in clusters within the selected promoters. The significance of such co-occurrences is evaluated using novel statistical scores. Our methods are combined in CREME, a suite of software which includes a browser for viewing the pattern of occurrence of selected cis-regulatory modules. We applied our methodology to find modules within human-mouse conserved promoter segments, focusing on cell cycle regulated genes and stress response related genes. To validate the biological significance of the identified modules we tested whether the associated genes tended to be co-expressed or share similar function. In the cell cycle set five of the seven identified sets of genes were coherently expressed. On the stress response data four of the six detected sets fell predominantly into well-defined functional sub-categories.