Title: The microRNAs of Caenorhabditis elegans
Authors: Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB, Bartel DP
Ref: Genome Res 2003 Apr 15;17(8):991-1008
Abstract: MicroRNAs (miRNAs) are an abundant class of tiny RNAs thought to regulate the expression of protein-coding genes in plants and animals. In the present study, we describe a computational procedure to identify miRNA genes conserved in more than one genome. Applying this program, known as MiRscan, together with molecular identification and validation methods, we have identified most of the miRNA genes in the nematode Caenorhabditis elegans. The total number of validated miRNA genes stands at 88, with no more than 35 genes remaining to be detected or validated. These 88 miRNA genes represent 48 gene families; 46 of these families (comprising 86 of the 88 genes) are conserved in Caenorhabditis briggsae, and 22 families are conserved in humans. More than a third of the worm miRNAs, including newly identified members of the lin-4 and let-7 gene families, are differentially expressed during larval development, suggesting a role for these miRNAs in mediating larval developmental transitions. Most are present at very high steady-state levels-more than 1000 molecules per cell, with some exceeding 50,000 molecules per cell. Our census of the worm miRNAs and their expression patterns helps define this class of noncoding RNAs, lays the groundwork for functional studies, and provides the tools for more comprehensive analyses of miRNA genes in other species.
Title: Genomewide view of gene silencing by small interfering RNAs
Authors: Chi JT, Chang HY, Wang NN, Chang DS, Dunphy N, Brown PO
Ref: Proc Natl Acad Sci USA 2003 May 27;100(11):6343-6
Abstract: RNA interference (RNAi) is an evolutionarily conserved mechanism in plant and animal cells that directs the degradation of messenger RNAs homologous to short double-stranded RNAs termed small interfering RNA (siRNA). The ability of siRNA to direct gene silencing in mammalian cells has raised the possibility that siRNA might be used to investigate gene function in a high throughput fashion or to modulate gene expression in human diseases. The specificity of siRNA-mediated silencing, a critical consideration in these applications, has not been addressed on a genomewide scale. Here we show that siRNA-induced gene silencing of transient or stably expressed mRNA is highly gene-specific and does not produce secondary effects detectable by genomewide expression profiling. A test for transitive RNAi, extension of the RNAi effect to sequences 5' of the target region that has been observed in Caenorhabditis elegans, was unable to detect this phenomenon in human cells.
Virginia Goss Tusher, Robert Tibshirani, and Gilbert Chu
Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.
http://www.pnas.org/cgi/content/abstract/98/9/5116
PNAS | April 24, 2001 | vol. 98 | no. 9 | 5116-5121
Statistical significance for genomewide studies
John D. Storey and Robert Tibshirani
With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.
http://www.pnas.org/cgi/content/full/100/16/9440
PNAS, August 5, 2003; 100(16): 9440 - 9445
Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model
Mayetri Gupta ; Jun S. Liu
Journal of the American Statistical Association, Volume: 98 Number: 461 Page: 55 -- 66
Abstract: Detection of unknown patterns from a randomly generated sequence of observations is a problem arising in fields ranging from signal processing to computational biology. Here we focus on the discovery of short recurring patterns (called motifs) in DNA sequences that represent binding sites for certain proteins in the process of gene regulation. What makes this a difficult problem is that these patterns can vary stochastically. We describe a novel data augmentation strategy for detecting such patterns in biological sequences based on an extension of a "dictionary" model. In this approach, we treat conserved patterns and individual nucleotides as stochastic words generated according to probability weight matrices and the observed sequences generated by concatenations of these words. By using a missingdata approach to find these patterns, we also address other related problems, including determining widths of patterns, finding multiple motifs, handling low-complexity regions, and finding patterns with insertions and deletions. The issue of selecting appropriate models is also discussed. However, the flexibility of this model is also accompanied by a high degree of computational complexity. We demonstrate how dynamic programming-like recursions can be used to improve computational efficiency.
http://www.people.fas.harvard.edu/~gupta6/papers/sdict.pdf
Transcriptional regulatory cascades in development: Initial rates, not steady state, determine network kinetics
Hamid Bolouri and Eric H. Davidson
A model was built to examine the kinetics of regulatory cascades such as occur in developmental gene networks. The model relates occupancy of cis-regulatory target sites to transcriptional initiation rate, and thence to RNA and protein output. The model was used to simulate regulatory cascades in which genes encoding transcription factors are successively activated. Using realistic parameter ranges based on extensive earlier measurements in sea urchin embryos, we find that transitions of regulatory states occur sharply in these simulations, with respect to time or changing transcription factor concentrations. As is often observed in developing systems, the simulated regulatory cascades display a succession of gene activations separated by delays of some hours. The most important causes of this behavior are cooperativity in the assembly of cis-regulatory complexes and the high specificity of transcription factors for their target sites. Successive transitions in state occur long in advance of the approach to steady-state levels of the molecules that drive the process. The kinetics of such developmental systems thus depend mainly on the initial output rates of genes activated in response to the advent of new transcription factors.
http://www.pnas.org/cgi/content/full/100/16/9371
Title: Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution
Authors: Pevzner P, Tesler G Ref: Proc Natl Acad Sci USA 2003 Jun 24;100(13):7672-7
Abstract: The human and mouse genomic sequences provide evidence for a larger number of rearrangements than previously thought and reveal extensive reuse of breakpoints from the same short fragile regions. Breakpoint clustering in regions implicated in cancer and infertility have been reported in previous studies; we report here on breakpoint clustering in chromosome evolution. This clustering reveals limitations of the widely accepted random breakage theory that has remained unchallenged since the mid-1980s. The genome rearrangement analysis of the human and mouse genomes implies the existence of a large number of very short "hidden" synteny blocks that were invisible in the comparative mapping data and ignored in the random breakage model. These blocks are defined by closely located breakpoints and are often hard to detect. Our results suggest a model of chromosome evolution that postulates that mammalian genomes are mosaics of fragile regions with high propensity for rearrangements and solid regions with low propensity for rearrangements.
Authors: Nagy PL, Cleary ML, Brown PO, Lieb JD
Ref: Proc Natl Acad Sci USA 2003 May 27;100(11):6364-9
Abstract: Epigenetic modifications of chromatin serve an important role in regulating the expression and accessibility of genomic DNA. We report here a genomewide approach for fractionating yeast chromatin into two functionally distinct parts, one containing RNA polymerase II transcribed sequences, and the other comprising noncoding sequences and genes transcribed by RNA polymerases I and III. Noncoding regions could be further fractionated into promoters and segments lacking promoters. The observed separations were apparently based on differential crosslinking efficiency of chromatin in different genomic regions. The results reveal a genomewide molecular mechanism for marking promoters and genomic regions that have a license to be transcribed by RNA polymerase II, a previously unrecognized level of genomic complexity that may exist in all eukaryotes. Our approach has broad potential use as a tool for genome annotation and for the characterization of global changes in chromatin structure that accompany different genetic, environmental, and disease states.
Title: Comparative analyses of multi-species sequences from targeted genomic regions
Authors: Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC, Maskeri B, Hansen NF, Schwartz MS, Weber RJ, Kent WJ, Karolchik D, Bruen TC, Bevan R, Cutler DJ, Schwartz S, Elnitski L, Idol JR, Prasad AB, Lee-Lin SQ, Maduro VV, Summers TJ, Portnoy ME, Dietrich NL, Akhter N, Ayele K, Benjamin B, Cariaga K, Brinkley CP, Brooks SY, Granite S, Guan X, Gupta J, Haghighi P, Ho SL, Huang MC, Karlins E, Laric PL, Legaspi R, Lim MJ, Maduro QL, Masiello CA, Mastrian SD, McCloskey JC, Pearson R, Stantripop S, Tiongson EE, Tran JT, Tsurgeon C, Vogt JL, Walker MA, Wetherby KD, Wiggins LS, Young AC, Zhang LH, Osoegawa K, Zhu B, Zhao B, Shu CL, De Jong PJ, Lawrence CE, Smit AF, Chakravarti A, Haussler D, Green P, Miller W, Green ED
Ref: Nature 2003 Aug 14;424(6950):788-93
Abstract: The systematic comparison of genomic sequences from different organisms represents a central focus of contemporary genome analysis. Comparative analyses of vertebrate sequences can identify coding and conserved non-coding regions, including regulatory elements, and provide insight into the forces that have rendered modern-day genomes. As a complement to whole-genome sequencing efforts, we are sequencing and comparing targeted genomic regions in multiple, evolutionarily diverse vertebrates. Here we report the generation and analysis of over 12 megabases (Mb) of sequence from 12 species, all derived from the genomic region orthologous to a segment of about 1.8 Mb on human chromosome 7 containing ten genes, including the gene mutated in cystic fibrosis. These sequences show conservation reflecting both functional constraints and the neutral mutational events that shaped this genomic region. In particular, we identify substantial numbers of conserved non-coding segments beyond those previously identified experimentally, most of which are not detectable by pair-wise sequence comparisons alone. Analysis of transposable element insertions highlights the variation in genome dynamics among these species and confirms the placement of rodents as a sister group to the primates.
Title: Cross-species sequence comparisons: a review of methods and available resources
Authors: Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC
Ref: Genome Res 2003 Jan;13(1):1-12
Abstract: With the availability of whole-genome sequences for an increasing number of species, we are now faced with the challenge of decoding the information contained within these DNA sequences. Comparative analysis of DNA sequences from multiple species at varying evolutionary distances is a powerful approach for identifying coding and functional noncoding sequences, as well as sequences that are unique for a given organism. In this review, we outline the strategy for choosing DNA sequences from different species for comparative analyses and describe the methods used and the resources publicly available for these studies.
Authors: Hardison RC, Roskin KM, Yang S, Diekhans M, Kent WJ, Weber R, Elnitski L, Li J, O'Connor M, Kolbe D, Schwartz S, Furey TS, Whelan S, Goldman N, Smit A, Miller W, Chiaromonte F, Haussler D
Ref: Genome Res 2003 Jan;13(1):13-26
Abstract: Six measures of evolutionary change in the human genome were studied, three derived from the aligned human and mouse genomes in conjunction with the Mouse Genome Sequencing Consortium, consisting of (1) nucleotide substitution per fourfold degenerate site in coding regions, (2) nucleotide substitution per site in relics of transposable elements active only before the human-mouse speciation, and (3) the nonaligning fraction of human DNA that is nonrepetitive or in ancestral repeats; and three derived from human genome data alone, consisting of (4) SNP density, (5) frequency of insertion of transposable elements, and (6) rate of recombination. Features 1 and 2 are measures of nucleotide substitutions at two classes of "neutral" sites, whereas 4 is a measure of recent mutations. Feature 3 is a measure dominated by deletions in mouse, whereas 5 represents insertions in human. It was found that all six vary significantly in megabase-sized regions genome-wide, and many vary together. This indicates that some regions of a genome change slowly by all processes that alter DNA, and others change faster. Regional variation in all processes is correlated with, but not completely accounted for, by GC content in human and the difference between GC content in human and mouse.
Scoring two-species local alignments to try to statistically separate neutrally evolving from selected DNA segments
Krishna M. Roskin, Mark Diekhans, David Haussler
Proceedings of the seventh annual international conference on Computational molecular biology (RECOMB), Berlin, Pages: 257 – 266.
We construct several score functions for use in locating unusually conserved regions in a genome-wide search of aligned DNA from two species. We test these functions on regions of the human genome aligned to the mouse genome. These score functions are derived from properties of neutrally evolving sites on the mouse and human genome, and can be adjusted to the local background rate of conservation. The aim of these functions is to try to identify regions of the human genome that are conserved by evolutionary selection, because they have an important function, rather than by chance. We use them to get a very rough estimate of the amount of DNA in the human genome that is under selection.
Title: Sequencing and comparison of yeast species to identify genes and regulatory elements
Authors: Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES
Ref: Nature 423, 241 - 254 (2003)
Abstract: Identifying the functional elements encoded in a genome is one of the principal challenges in modern biology. Comparative genomics should offer a powerful, general approach. Here, we present a comparative analysis of the yeast Saccharomyces cerevisiae based on high-quality draft sequences of three related species (S. paradoxus, S. mikatae and S. bayanus). We first aligned the genomes and characterized their evolution, defining the regions and mechanisms of change. We then developed methods for direct identification of genes and regulatory motifs. The gene analysis yielded a major revision to the yeast gene catalogue, affecting approximately 15% of all genes and reducing the total count by about 500 genes. The motif analysis automatically identified 72 genome-wide elements, including most known regulatory motifs and numerous new motifs. We inferred a putative function for most of these motifs, and provided insights into their combinatorial interactions. The results have implications for genome analysis of diverse organisms, including the human.
E. Halperin, J. Buhler, R. Karp, R. Krauthgamer and B. Westover
Motivation: Comparing two protein databases is a fundamental task in biosequence annotation. Given two databases, one must find all pairs of proteins that align with high score under a biologically meaningful substitution score matrix, such as a BLOSUM matrix (Henikoff and Henikoff, 1992). Distance-based approaches to this problem map each peptide in the database to a point in a metric space, such that peptides aligning with higher scores are mapped to closer points. Many techniques exist to discover close pairs of points in a metric space efficiently, but the challenge in applying this work to proteomic comparison is to find a distance mapping that accurately encodes all the distinctions among residue pairs made by a proteomic score matrix. Buhler (2002) proposed one such mapping but found that it led to a relatively inefficient algorithm for protein-protein comparison.
Results: This work proposes a new distance mapping for peptides under the BLOSUM matrices that permits more efficient similarity search. We first propose a new distance function on peptides derived from a given score matrix. We then show how to map peptides to bit vectors such that the distance between any two peptides is closely approximated by the Hamming distance (i.e. number of mismatches) between their corresponding bit vectors. We combine these two results with the LSH-ALL-PAIRS-SIM algorithm of Buhler (2002) to produce an improved distance-based algorithm for proteomic comparison. An initial implementation of the improved algorithm exhibits sensitivity within 5% of that of the original LSH-ALL-PAIRS-SIM, while running up to eight times faster.
http://bioinformatics.oupjournals.org/cgi/content/abstract/19/suppl_1/i122
Title: Transcriptional Regulatory Networks in Saccharomyces cerevisiae
Authors: Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA
Ref: Science 2002 Oct 25;298(5594):799-804
Abstract: We have determined how most of the transcriptional regulators encoded in the eukaryote Saccharomyces cerevisiae associate with genes across the genome in living cells. Just as maps of metabolic networks describe the potential pathways that may be used by a cell to accomplish metabolic processes, this network of regulator-gene interactions describes potential pathways yeast cells can use to regulate global gene expression programs. We use this information to identify network motifs, the simplest units of network architecture, and demonstrate that an automated process can use motifs to assemble a transcriptional regulatory network structure. Our results reveal that eukaryotic cellular functions are highly connected through networks of transcriptional regulators that regulate other transcriptional regulators.
Authors: Zeitlinger J, Simon I, Harbison CT, Hannett NM, Volkert TL, Fink GR, Young RA
Ref: Cell 2003 May 2;113(3):395-404
Abstract: Specialized gene expression programs are induced by signaling pathways that act on transcription factors. Whether these transcription factors can function in multiple developmental programs through a global switch in promoter selection is not known. We have used genome-wide location analysis to show that the yeast Ste12 transcription factor, which regulates mating and filamentous growth, is bound to distinct program-specific target genes dependent on the developmental condition. This condition-dependent distribution of Ste12 requires concurrent binding of the transcription factor Tec1 during filamentation and is differentially regulated by the MAP kinases Fus3 and Kss1. Program-specific distribution across the genome may be a general mechanism by which transcription factors regulate distinct gene expression programs in response to signaling.
Title: Untangling the wires: A strategy to trace functional interactions in signaling and gene networks
Authors: Kholodenko BN, Kiyatkin A, Bruggeman FJ, Sontag E, Westerhoff HV, Hoek JB
Ref: Proc Natl Acad Sci USA 2002 Oct 1;99(20):12841-6
Abstract: Emerging technologies have enabled the acquisition of large genomics and proteomics data sets. However, current methodologies for analysis do not permit interpretation of the data in ways that unravel cellular networking. We propose a quantitative method for determining functional interactions in cellular signaling and gene networks. It can be used to explore cell systems at a mechanistic level or applied within a "modular" framework, which dramatically decreases the number of variables to be assayed. This method is based on a mathematical derivation that demonstrates how the topology and strength of network connections can be retrieved from experimentally measured network responses to successive perturbations of all modules. Importantly, our analysis can reveal functional interactions even when the components of the system are not all known. Under these circumstances, some connections retrieved by the analysis will not be direct but correspond to the interaction routes through unidentified elements. The method is tested and illustrated by using computer-generated responses of a modeled mitogen-activated protein kinase cascade and gene network.
Trends Genet. 2002 Aug;18(8):395-8.
Linking the genes:
inferring quantitative gene networks from microarray data.
de
la Fuente A, Brazhnik P, Mendes P.
Virginia Bioinformatics
Institute, Virginia Polytechnic Institute and State University, 1880
Pratt Drive, Blacksburg, VA 24061, USA.
Modern microarray
technology is capable of providing data about the expression of
thousands of genes, and even of whole genomes. An important question
is how this technology can be used most effectively to unravel the
workings of cellular machinery. Here, we propose a method to infer
genetic networks on the basis of data from appropriately designed
microarray experiments. In addition to identifying the genes that
affect a specific other gene directly, this method also estimates the
strength of such effects. We will discuss both the experimental setup
and the theoretical background.
Authors: Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N
Ref: Nat Genet 2003 Jun;34(2):166-76
Abstract: Much of a cell's activity is organized as a network of interacting modules: sets of genes coregulated to respond to different conditions. We present a probabilistic method for identifying regulatory modules from gene expression data. Our procedure identifies modules of coregulated genes, their regulators and the conditions under which regulation occurs, generating testable hypotheses in the form 'regulator X regulates module Y under conditions W'. We applied the method to a Saccharomyces cerevisiae expression data set, showing its ability to identify functionally coherent modules and their correct regulators. We present microarray experiments supporting three novel predictions, suggesting regulatory roles for previously uncharacterized proteins.
Genome Res. 2002 Mar;12(3):470-81. |
Genome Res. 2003 Apr;13(4):579-88. |
Bioinformatics Vol. 19 Suppl. 1 2003, Pages i292-i301
A probabilistic method to detect regulatory modules
Saurabh Sinha , Erik van Nimwegen and Eric D. Siggia
Motivation: The discovery of cis-regulatory modules in metazoan genomes is crucial for understanding the connection between genes and organism diversity.
Results: We develop a computational method that uses Hidden Markov Models and an Expectation Maximization algorithm to detect such modules, given the weight matrices of a set of transcription factors known to work together. Two novel features of our probabilistic model are: (i) correlations between binding sites, known to be required for module activity, are exploited, and (ii) phylogenetic comparisons among sequences from multiple species are made to highlight a regulatory module. The novel features are shown to improve detection of modules, in experiments on synthetic as well as biological data.
Roded Sharan, Ivan Ovcharenko, Asa Ben-Hur and Richard M. Karp
Motivation: The binding of transcription factors to specific regulatory sequence elements is a primary mechanism for controlling gene transcription. Recent findings suggest a modular organization of binding sites for transcription factors that cooperate in the regulation of genes. In this work we establish a framework for finding recurrent cis-regulatory modules in the promoters of a selected set of genes and scoring their statistical significance.
Results: Proceeding from a database of identified binding site motifs and their genomic locations we seek motifs whose frequency in the selected promoters is different than in a background promoter set. We present several statistical tests designed for this purpose. We provide a hashing algorithm for detecting combinations of these motifs that co-occur in clusters within the selected promoters. The significance of such co-occurrences is evaluated using novel statistical scores. Our methods are combined in CREME, a suite of software which includes a browser for viewing the pattern of occurrence of selected cis-regulatory modules. We applied our methodology to find modules within human-mouse conserved promoter segments, focusing on cell cycle regulated genes and stress response related genes. To validate the biological significance of the identified modules we tested whether the associated genes tended to be co-expressed or share similar function. In the cell cycle set five of the seven identified sets of genes were coherently expressed. On the stress response data four of the six detected sets fell predominantly into well-defined functional sub-categories.
Anal Chem. 2003 Feb 1;75(3):435-44. |
Eng, Jimmy K.; McCormack, Ashley L.; Yates, John R., III
Journal of the American Society for Mass Spectrometry (1994), 5(11), 976-89 CODEN: JAMSEF; ISSN: 1044-0305. English.
A method to correlate the uninterpreted tandem mass spectra of peptides produced under low energy (10-50 eV) collision conditions with amino acid sequences in the Genpept database has been developed. In this method the protein database is searched to identify linear amino acid sequences within a mass tolerance of .+-.1 u of the precursor ion mol. weight A cross-correlation function is then used to provide a measurement of similarity between the mass-to-charge ratios for the fragment ions predicted from amino acid sequences obtained from the database and the fragment ions observed in the tandem mass spectrum. In general, a difference >0.1 between the normalized cross-correlation functions of the first- and second-ranked search results indicates a successful match between sequence and spectrum. Searches of species-specific protein databases with tandem mass spectra acquired from peptides obtained from the enzymically digested total proteins of E. coli and S. cerevisiae cells allowed matching of the spectra to amino acid sequences within proteins of these organisms. The approach described in this manuscript provides a convenient method to interpret tandem mass spectra with known sequences in a protein database.
Bioinformatics. 2001;17 Suppl 1:S13-21
SCOPE: a
probabilistic model for scoring tandem mass spectra against a peptide
database.
Bafna V, Edwards N.
Proteomics, or
the direct analysis of the expressed protein components of a cell, is
critical to our understanding of cellular biological processes in
normal and diseased tissue. A key requirement for its success is the
ability to identify proteins in complex mixtures. Recent
technological advances in tandem mass spectrometry has made it the
method of choice for high-throughput identification of proteins.
Unfortunately, the software for unambiguously identifying peptide
sequences has not kept pace with the recent hardware improvements in
mass spectrometry instruments. Critical for reliable high-throughput
protein identification, scoring functions evaluate the quality of a
match between experimental spectra and a database peptide. Current
scoring function technology relies heavily on ad-hoc parameterization
and manual curation by experienced mass spectrometrists. In this
work, we propose a two-stage stochastic model for the observed MS/MS
spectrum, given a peptide. Our model explicitly incorporates fragment
ion probabilities, noisy spectra, and instrument measurement error.
We describe how to compute this probability based score efficiently,
using a dynamic programming technique. A prototype implementation
demonstrates the effectiveness of the model.
http://bioinformatics.oupjournals.org/cgi/content/abstract/17/suppl_1/S13
On de novo interpretation of tandem mass spectra for peptide identification
Vineet Bafna, Nathan Edwards
The correct interpretation of tandem mass spectra is a difficult problem, even when it is limited to scoring peptides against a database. De novo sequencing is considerably harder, but critical when sequence databases are incomplete or not available. In this paper we build upon earlier work due to Dancik et al., and Chen et al. to provide a dynamic programming algorithm for interpreting de novo spectra. Our method can handle most of the commonly occurring ions, including a; b; y, and their neutral losses. Additionally, we shift the emphasis away from sequencing to assigning ion types to peaks. In particular, we introduce the notion of core interpretations, which allow us to give confidence values to individual peak assignments, even in the absence of a strong interpretation. Finally, we introduce a systematic approach to evaluating de novo algorithms as a function of spectral quality. We show that our algorithm, in particular the core-interpretation, is robust in the presence of measurement error, and low fragmentation probability.