Benjamin redelings

North Carolina State University

Estimating Evolutionary Trees from Uncertain Molecular Sequence Alignments

 

Conventional methods of estimating evolutionary trees from DNA sequence data assume that the researcher knows which letters in the different DNA sequences descend from the same letter in the common ancestor. This information is commonly specified by arranging the letters in a matrix, called an alignment, so that letters in each row make up one DNA sequence and letters in the same column share a common origin. However, this alignment is not directly observed. Traditionally, a single estimate of the alignment is first constructed without knowledge of the tree and other model parameters, and then the evolutionary tree is estimated from this alignment. Unfortunately, such methods of evolutionary tree reconstruction may be undermined by the unreliability of the alignment estimates on which they depend, and may also exaggerate the degree of confidence in the resulting tree estimates.

 

  To avoid these problems, I describe a method for simultaneously estimating multiple sequence alignments and the evolutionary trees that relate the sequences. This approach takes into account the mutual dependence of alignment and tree by treating the alignment as a random variable.  In order to do this, I construct a probabilistic model of insertion and deletion mutations, which add and remove letters from DNA sequences.  Inference is conducted using MCMC. I describe new transition kernels that are necessary to efficiently propose new alignments and trees.  By estimating the alignment and evolutionary trees simultaneously we also avoid the trap of conditioning on an inaccurate external guide tree in constructing the alignment.  Furthermore, the availability of an internal tree estimate during alignment construction allows shared insertions or deletions to count as evidence for grouping taxa in the tree.

 

  I will also describe progress on extending the probabilistic model of the insertion-deletion process to account for insertion-deletion hotspots.  The current model assumes that the rates of insertion and deletion are spatially uniform and do not vary across the length of the DNA sequence.  However, insertion and deletion mutations often cluster in certain "hotspot" regions of the DNA sequence. This is an important biological feature for statistical models to capture because the local insertion-deletion rate affects both the number of insertions and deletions that are inferred and also the strength of evidence for common ancestry associated with a shared insertion or deletion. I propose a model that assigns each DNA sequence letter an unobserved "fast" or "slow" label that determines the local insertion-deletion rate.  I also describe an algorithm that can sample from the marginal posterior distribution of these labels in computation time that is linear in the number of DNA sequences.  This is an important step towards joint estimation of alignment, tree, and local insertion-deletion rates.

 

Finally, I will briefly describe an improved method of summarizing posterior distributions of evolutionary tree topologies.

 

 

*****

 

 

4:15pm

B17 Upson Hall

Tuesday, February 10, 2009

Refreshments at 3:45pm in the Upson 4th Floor Atrium

Computer Science

& Information Science

Colloquium

Spring 2009

www.cs.cornell.edu/events/colloquium

www.cs.cornell.edu/events/colloquium