Nucleic Acids Res. 2004 Mar 19;32(5):1792-7. Print 2004. |
BMC Bioinformatics. 2004 Aug 19;5(1):113. |
MUSCLE: a multiple sequence alignment method with
reduced time and space complexity.
Edgar
RC.
Department of Plant and Microbial Biology, 461
Koshland Hall, University of California, Berkeley, CA 94720-3102,
USA. bob@drive5.com
BACKGROUND: In a previous paper, we
introduced MUSCLE, a new program for creating multiple alignments of
protein sequences, giving a brief summary of the algorithm and
showing MUSCLE to achieve the highest scores reported to date on four
alignment accuracy benchmarks. Here we present a more complete
discussion of the algorithm, describing several previously
unpublished techniques that improve biological accuracy and / or
computational complexity. We introduce a new option, MUSCLE-fast,
designed for high-throughput applications. We also describe a new
protocol for evaluating objective functions that align two profiles.
RESULTS: We compare the speed and accuracy of MUSCLE with CLUSTALW,
Progressive POA and the MAFFT script FFTNS1, the fastest previously
published program known to the author. Accuracy is measured using
four benchmarks: BAliBASE, PREFAB, SABmark and SMART. We test three
variants that offer highest accuracy (MUSCLE with default settings),
highest speed (MUSCLE-fast), and a carefully chosen compromise
between the two (MUSCLE-prog). We find MUSCLE-fast to be the fastest
algorithm on all test sets, achieving average alignment accuracy
similar to CLUSTALW in times that are typically two to three orders
of magnitude less. MUSCLE-fast is able to align 1,000 sequences of
average length 282 in 21 seconds on a current desktop computer.
CONCLUSIONS: MUSCLE offers a range of options that provide improved
speed and / or alignment accuracy compared with currently available
programs. MUSCLE is freely available at
http://www.drive5.com/muscle.
Nucleic Acids Res. 2005 Jan 20;33(2):511-518. Print 2005. |
MAFFT version 5: improvement in accuracy of
multiple sequence alignment.
Katoh K, Kuma KI,
Toh H, Miyata T.
Bioinformatics Center, Institute for
Chemical Research, Kyoto University Uji, Kyoto 611-0011, Japan.
The
accuracy of multiple sequence alignment program MAFFT has been
improved. The new version (5.3) of MAFFT offers new iterative
refinement options, H-INS-i, F-INS-i and G-INS-i, in which pairwise
alignment information are incorporated into objective function. These
new options of MAFFT showed higher accuracy than currently available
methods including TCoffee version 2 and CLUSTAL W in benchmark tests
consisting of alignments of >50 sequences. Like the previously
available options, the new options of MAFFT can handle hundreds of
sequences on a standard desktop computer. We also examined the effect
of the number of homologues included in an alignment. For a multiple
alignment consisting of approximately 8 sequences with low
similarity, the accuracy was improved (2-10 percentage points) when
the sequences were aligned together with dozens of their close
homologues (E-value < 10(-5)-10(-20)) collected from a database.
Such improvement was generally observed for most methods, but
remarkably large for the new options of MAFFT proposed here. Thus, we
made a Ruby script, mafftE.rb, which aligns the input sequences
together with their close homologues collected from SwissProt using
NCBI-BLAST.
Nucleic Acids Res. 2002 Jul 15;30(14):3059-66. |
MAFFT: a novel method for rapid multiple sequence
alignment based on fast Fourier transform.
Katoh
K, Misawa K, Kuma K, Miyata T.
Department of Biophysics,
Graduate School of Science, Kyoto University, Kyoto 606-8502,
Japan.
A multiple sequence alignment program, MAFFT, has been
developed. The CPU time is drastically reduced as compared with
existing methods. MAFFT includes two novel techniques. (i) Homo
logous regions are rapidly identified by the fast Fourier transform
(FFT), in which an amino acid sequence is converted to a sequence
composed of volume and polarity values of each amino acid residue.
(ii) We propose a simplified scoring system that performs well for
reducing CPU time and increasing the accuracy of alignments even for
sequences having large insertions or extensions as well as distantly
related sequences of similar length. Two different heuristics, the
progressive method (FFT-NS-2) and the iterative refinement method
(FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2
and FFT-NS-i were compared with other methods by computer simulations
and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced
as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over
100 times faster than T-COFFEE, when the number of input sequences
exceeds 60, without sacrificing the accuracy.
J Mol Biol. 2003 Feb 7;326(1):317-36. |
COMPASS: a tool for comparison of multiple
protein alignments with assessment of statistical
significance.
Sadreyev R, Grishin N.
Howard
Hughes Medical Institute, and Department of Biochemistry, University
of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas,
TX 75390-9050, USA.
We present a novel method for the
comparison of multiple protein alignments with assessment of
statistical significance (COMPASS). The method derives numerical
profiles from alignments, constructs optimal local profile-profile
alignments and analytically estimates E-values for the detected
similarities. The scoring system and E-value calculation are based on
a generalization of the PSI-BLAST approach to profile-sequence
comparison, which is adapted for the profile-profile case. Tested
along with existing methods for profile-sequence (PSI-BLAST) and
profile-profile (prof_sim) comparison, COMPASS shows increased
abilities for sensitive and selective detection of remote sequence
similarities, as well as improved quality of local alignments. The
method allows prediction of relationships between protein families in
the PFAM database beyond the range of conventional methods. Two
predicted relations with high significance are similarities between
various Rossmann-type folds and between various
helix-turn-helix-containing families. The potential value of COMPASS
for structure/function predictions is illustrated by the detection of
an intricate homology between the DNA-binding domain of the CTF/NFI
family and the MH1 domain of the Smad family.
Bioinformatics. 2003 Aug 12;19(12):1531-9. |
Probabilistic scoring measures for
profile-profile comparison yield more accurate short seed
alignments.
Mittelman D, Sadreyev R, Grishin
N.
Howard Hughes Medical Institute Department of
Biochemistry, University of Texas Southwestern Medical Center, 5323
Harry Hines Blvd, Dallas, TX 75390-9050, USA.
MOTIVATION: The
development of powerful automatic methods for the comparison of
protein sequences has become increasingly important.
Profile-to-profile comparisons allow for the use of broader
information about protein families, resulting in more sensitive and
accurate comparisons of distantly related sequences. A key part in
the comparison of two profiles is the method for the calculation of
scores for the position matches. A number of methods based on various
theoretical considerations have been proposed. We implemented several
previously reported scoring functions as well as our own functions,
and compared them on the basis of their ability to produce accurate
short ungapped alignments of a given length. RESULTS: Our results
suggest that the family of the probabilistic methods (log-odds based
methods and prof_sim) may be the more appropriate choice for the
generation of initial 'seeds' as the first step to produce local
profile-profile alignments. The most effective scoring systems were
the closely related modifications of functions previously implemented
in the COMPASS and Picasso methods.
Bioinformatics. 2004 May 22;20(8):1301-8. Epub 2004 Feb 12. |
Nucleic Acids Res. 2004 Jan 16;32(1):380-5. Print 2004. |
Local homology recognition and distance measures
in linear time using compressed amino acid alphabets.
Edgar
RC.
bob@drive5.com
Methods for discovery of local
similarities and estimation of evolutionary distance by identifying
k-mers (contiguous subsequences of length k) common to two sequences
are described. Given unaligned sequences of length L, these methods
have O(L) time complexity. The ability of compressed amino acid
alphabets to extend these techniques to distantly related proteins
was investigated. The performance of these algorithms was evaluated
for different alphabets and choices of k using a test set of 1848
pairs of structurally alignable sequences selected from the FSSP
database. Distance measures derived from k-mer counting were found to
correlate well with percentage identity derived from sequence
alignments. Compressed alphabets were seen to improve performance in
local similarity discovery, but no evidence was found of improvements
when applied to distance estimates. The performance of our local
similarity discovery method was compared with the fast Fourier
transform (FFT) used in MAFFT, which has O(L log L) time complexity.
The method for achieving comparable coverage to FFT is revealed here,
and is more than an order of magnitude faster. We suggest using k-mer
distance for fast, approximate phylogenetic tree construction, and
show that a speed improvement of more than three orders of magnitude
can be achieved relative to standard distance methods, which require
alignments.
Bioinformatics. 2003 Mar 1;19(4):513-23. |
Alignment-free sequence comparison-a
review.
Vinga S, Almeida J.
Department
of Biometry & Epidemiology, Medical University of South Carolina,
135 Cannon Street, Suite 303, PO Box 250835, Charleston, SC 29425,
USA.
MOTIVATION: Genetic recombination and, in particular,
genetic shuffling are at odds with sequence comparison by alignment,
which assumes conservation of contiguity between homologous segments.
A variety of theoretical foundations are being used to derive
alignment-free methods that overcome this limitation. The formulation
of alternative metrics for dissimilarity between sequences and their
algorithmic implementations are reviewed. RESULTS: The overwhelming
majority of work on alignment-free sequence has taken place in the
past two decades, with most reports published in the past 5 years.
Two main categories of methods have been proposed-methods based on
word (oligomer) frequency, and methods that do not require resolving
the sequence with fixed word length segments. The first category is
based on the statistics of word frequency, on the distances defined
in a Cartesian space defined by the frequency vectors, and on the
information content of frequency distribution. The second category
includes the use of Kolmogorov complexity and Chaos Theory. Despite
their low visibility, alignment-free metrics are in fact already
widely used as pre-selection filters for alignment-based querying of
large applications. Recent work is furthering their usage as a
scale-independent methodology that is capable of recognizing homology
when loss of contiguity is beyond the possibility of alignment.
Availability: Most of the alignment-free algorithms reviewed were
implemented in MATLAB code and are available at
http://bioinformatics.musc.edu/resources.html
Science. 2004 May 28;304(5675):1321-5. Epub 2004 May 06. |
Bioinformatics. 2004 Aug 4;20 Suppl 1:I40-I48. |
Genome Res. 2004 Apr;14(4):528-38. |
Nature. 2004 Dec 9;432(7018):695-716. |
Sequence and comparative analysis of the chicken
genome provide unique perspectives on vertebrate evolution.
Hillier
LW, Miller W, Birney E, Warren W, Hardison RC, Ponting CP, Bork P,
Burt DW, Groenen MA, Delany ME, Dodgson JB, Chinwalla AT, Cliften PF,
Clifton SW, Delehaunty KD, Fronick C, Fulton RS, Graves TA, Kremitzki
C, Layman D, Magrini V, McPherson JD, Miner TL, Minx P, Nash WE, Nhan
MN, Nelson JO, Oddy LG, Pohl CS, Randall-Maher J, Smith SM, Wallis
JW, Yang SP, Romanov MN, Rondelli CM, Paton B, Smith J, Morrice D,
Daniels L, Tempest HG, Robertson L, Masabanda JS, Griffin DK, Vignal
A, Fillon V, Jacobbson L, Kerje S, Andersson L, Crooijmans RP, Aerts
J, van der Poel JJ, Ellegren H, Caldwell RB, Hubbard SJ, Grafham DV,
Kierzek AM, McLaren SR, Overton IM, Arakawa H, Beattie KJ, Bezzubov
Y, Boardman PE, Bonfield JK, Croning MD, Davies RM, Francis MD,
Humphray SJ, Scott CE, Taylor RG, Tickle C, Brown WR, Rogers J,
Buerstedde JM, Wilson SA, Stubbs L, Ovcharenko I, Gordon L, Lucas S,
Miller MM, Inoko H, Shiina T, Kaufman J, Salomonsen J, Skjoedt K,
Wong GK, Wang J, Liu B, Wang J, Yu J, Yang H, Nefedov M, Koriabine M,
Dejong PJ, Goodstadt L, Webber C, Dickens NJ, Letunic I, Suyama M,
Torrents D, von Mering C, Zdobnov EM, Makova K, Nekrutenko A,
Elnitski L, Eswara P, King DC, Yang S, Tyekucheva S, Radakrishnan A,
Harris RS, Chiaromonte F, Taylor J, He J, Rijnkels M, Griffiths-Jones
S, Ureta-Vidal A, Hoffman MM, Severin J, Searle SM, Law AS, Speed D,
Waddington D, Cheng Z, Tuzun E, Eichler E, Bao Z, Flicek P,
Shteynberg DD, Brent MR, Bye JM, Huckle EJ, Chatterji S, Dewey C,
Pachter L, Kouranov A, Mourelatos Z, Hatzigeorgiou AG, Paterson AH,
Ivarie R, Brandstrom M, Axelsson E, Backstrom N, Berlin S, Webster
MT, Pourquie O, Reymond A, Ucla C, Antonarakis SE, Long M, Emerson
JJ, Betran E, Dupanloup I, Kaessmann H, Hinrichs AS, Bejerano G,
Furey TS, Harte RA, Raney B, Siepel A, Kent WJ, Haussler D, Eyras E,
Castelo R, Abril JF, Castellano S, Camara F, Parra G, Guigo R,
Bourque G, Tesler G, Pevzner PA, Smit A, Fulton LA, Mardis ER, Wilson
RK; International Chicken Genome Sequencing Consortium.
Genome
Sequencing Center, Washington University School of Medicine, Campus
Box 8501, 4444 Forest Park Avenue, St Louis, Missouri 63108, USA.
We
present here a draft genome sequence of the red jungle fowl, Gallus
gallus. Because the chicken is a modern descendant of the dinosaurs
and the first non-mammalian amniote to have its genome sequenced, the
draft sequence of its genome--composed of approximately one billion
base pairs of sequence and an estimated 20,000-23,000 genes--provides
a new perspective on vertebrate genome evolution, while also
improving the annotation of mammalian genomes. For example, the
evolutionary distance between chicken and human provides high
specificity in detecting functional elements, both non-coding and
coding. Notably, many conserved non-coding sequences are far from
genes and cannot be assigned to defined functional classes. In coding
regions the evolutionary dynamics of protein domains and orthologous
groups illustrate processes that distinguish the lineages leading to
birds and mammals. The distinctive properties of avian
microchromosomes, together with the inferred patterns of conserved
synteny, provide additional insights into vertebrate chromosome
architecture.
Nature. 2004 Apr 1;428(6982):493-521. |
Genome sequence of the Brown Norway rat yields
insights into mammalian evolution.
Gibbs RA,
Weinstock GM, Metzker ML, Muzny DM, Sodergren EJ, Scherer S, Scott G,
Steffen D, Worley KC, Burch PE, Okwuonu G, Hines S, Lewis L, DeRamo
C, Delgado O, Dugan-Rocha S, Miner G, Morgan M, Hawes A, Gill R,
Celera, Holt RA, Adams MD, Amanatides PG, Baden-Tillson H, Barnstead
M, Chin S, Evans CA, Ferriera S, Fosler C, Glodek A, Gu Z, Jennings
D, Kraft CL, Nguyen T, Pfannkoch CM, Sitter C, Sutton GG, Venter JC,
Woodage T, Smith D, Lee HM, Gustafson E, Cahill P, Kana A,
Doucette-Stamm L, Weinstock K, Fechtel K, Weiss RB, Dunn DM, Green
ED, Blakesley RW, Bouffard GG, De Jong PJ, Osoegawa K, Zhu B, Marra
M, Schein J, Bosdet I, Fjell C, Jones S, Krzywinski M, Mathewson C,
Siddiqui A, Wye N, McPherson J, Zhao S, Fraser CM, Shetty J, Shatsman
S, Geer K, Chen Y, Abramzon S, Nierman WC, Havlak PH, Chen R, Durbin
KJ, Egan A, Ren Y, Song XZ, Li B, Liu Y, Qin X, Cawley S, Worley KC,
Cooney AJ, D'Souza LM, Martin K, Wu JQ, Gonzalez-Garay ML, Jackson
AR, Kalafus KJ, McLeod MP, Milosavljevic A, Virk D, Volkov A, Wheeler
DA, Zhang Z, Bailey JA, Eichler EE, Tuzun E, Birney E, Mongin E,
Ureta-Vidal A, Woodwark C, Zdobnov E, Bork P, Suyama M, Torrents D,
Alexandersson M, Trask BJ, Young JM, Huang H, Wang H, Xing H, Daniels
S, Gietzen D, Schmidt J, Stevens K, Vitt U, Wingrove J, Camara F, Mar
Alba M, Abril JF, Guigo R, Smit A, Dubchak I, Rubin EM, Couronne O,
Poliakov A, Hubner N, Ganten D, Goesele C, Hummel O, Kreitler T, Lee
YA, Monti J, Schulz H, Zimdahl H, Himmelbauer H, Lehrach H, Jacob HJ,
Bromberg S, Gullings-Handley J, Jensen-Seaman MI, Kwitek AE, Lazar J,
Pasko D, Tonellato PJ, Twigger S, Ponting CP, Duarte JM, Rice S,
Goodstadt L, Beatson SA, Emes RD, Winter EE, Webber C, Brandt P,
Nyakatura G, Adetobi M, Chiaromonte F, Elnitski L, Eswara P, Hardison
RC, Hou M, Kolbe D, Makova K, Miller W, Nekrutenko A, Riemer C,
Schwartz S, Taylor J, Yang S, Zhang Y, Lindpaintner K, Andrews TD,
Caccamo M, Clamp M, Clarke L, Curwen V, Durbin R, Eyras E, Searle SM,
Cooper GM, Batzoglou S, Brudno M, Sidow A, Stone EA, Venter JC,
Payseur BA, Bourque G, Lopez-Otin C, Puente XS, Chakrabarti K,
Chatterji S, Dewey C, Pachter L, Bray N, Yap VB, Caspi A, Tesler G,
Pevzner PA, Haussler D, Roskin KM, Baertsch R, Clawson H, Furey TS,
Hinrichs AS, Karolchik D, Kent WJ, Rosenbloom KR, Trumbower H,
Weirauch M, Cooper DN, Stenson PD, Ma B, Brent M, Arumugam M,
Shteynberg D, Copley RR, Taylor MS, Riethman H, Mudunuri U, Peterson
J, Guyer M, Felsenfeld A, Old S, Mockrin S, Collins F; Rat Genome
Sequencing Project Consortium.
Human Genome Sequencing
Center, Department of Molecular and Human Genetics, Baylor College of
Medicine, MS BCM226, One Baylor Plaza, Houston, Texas 77030, USA
<http://www.hgsc.bcm.tmc.edu>.
The laboratory rat
(Rattus norvegicus) is an indispensable tool in experimental medicine
and drug development, having made inestimable contributions to human
health. We report here the genome sequence of the Brown Norway (BN)
rat strain. The sequence represents a high-quality 'draft' covering
over 90% of the genome. The BN rat sequence is the third complete
mammalian genome to be deciphered, and three-way comparisons with the
human and mouse genomes resolve details of mammalian evolution. This
first comprehensive analysis includes genes and proteins and their
relation to human disease, repeated sequences, comparative
genome-wide studies of mammalian orthologous chromosomal regions and
rearrangement breakpoints, reconstruction of ancestral karyotypes and
the events leading to existing species, rates of variation, and
lineage-specific and lineage-independent evolutionary events such as
expansion of gene families, orthology relations and protein
evolution.
J Comput Biol. 2004;11(2-3):413-28. |
Combining phylogenetic and hidden Markov
models in biosequence analysis.
Siepel A,
Haussler D.
Center for Biomolecular Science and
Engineering, University of California, 1156 High Street, Santa Cruz,
CA 95064, USA. acs@soe.ucsc.edu
A few models have appeared in
recent years that consider not only the way substitutions occur
through evolutionary history at each site of a genome, but also the
way the process changes from one site to the next. These models
combine phylogenetic models of molecular evolution, which apply to
individual sites, and hidden Markov models, which allow for changes
from site to site. Besides improving the realism of ordinary
phylogenetic models, they are potentially very powerful tools for
inference and prediction--for example, for gene finding or prediction
of secondary structure. In this paper, we review progress on combined
phylogenetic and hidden Markov models and present some extensions to
previous work. Our main result is a simple and efficient method for
accommodating higher-order states in the HMM, which allows for
context-dependent models of substitution--that is, models that
consider the effects of neighboring bases on the pattern of
substitution. We present experimental results indicating that
higher-order states, autocorrelated rates, and multiple functional
categories all lead to significant improvements in the fit of a
combined phylogenetic and hidden Markov model, with the effect of
higher-order states being particularly pronounced.
Proc Natl Acad Sci U S A. 2004 Nov 16;101(46):16138-43. Epub 2004 Nov 08. |
Bioinformatics. 2004 Aug 4;20 Suppl 1:I334-I341. |
Genome Res. 2004 Jun;14(6):1170-5. Epub 2004 May 12. |
Bioinformatics. 2004 Aug 4;20 Suppl 1:I178-I185. |
Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach
Authors: Storey J.D.; Taylor J.E.; Siegmund D.
Source: Journal of the Royal Statistical Society: Series B (Statistical Methodology), February 2004, vol. 66, no. 1, pp. 187-205(19)
Publisher: Blackwell Publishing
Abstract:
Summary.
The false discovery rate (FDR) is a multiple hypothesis testing quantity that describes the expected proportion of false positive results among all rejected null hypotheses. Benjamini and Hochberg introduced this quantity and proved that a particular step-up p-value method controls the FDR. Storey introduced a point estimate of the FDR for fixed significance regions. The former approach conservatively controls the FDR at a fixed predetermined level, and the latter provides a conservatively biased estimate of the FDR for a fixed predetermined significance region. In this work, we show in both finite sample and asymptotic settings that the goals of the two approaches are essentially equivalent. In particular, the FDR point estimates can be used to define valid FDR controlling procedures. In the asymptotic setting, we also show that the point estimates can be used to estimate the FDR conservatively over all significance regions simultaneously, which is equivalent to controlling the FDR at all levels simultaneously. The main tool that we use is to translate existing FDR methods into procedures involving empirical processes. This simplifies finite sample proofs, provides a framework for asymptotic results and proves that these procedures are valid even under certain forms of dependence.
Proc Natl Acad Sci U S A. 2002 Oct 29;99(22):13980-9. Epub 2002 Oct 08. |
Proteins. 2004 Oct 1;57(1):188-97. |
Proteins. 2004 Feb 1;54(2):351-60. |
BMC Bioinformatics. 2004 Sep 09;5(1):129. |
Bioinformatics. 2003;19 Suppl 1:i292-301. |
Bioinformatics. 2003 Dec 12;19(18):2369-80. |
Genome Res. 2005 Jan 14; [Epub ahead of print] |
J Comput Biol. 2004;11(2-3):377-94. |