Thursday, November 17, 2005
4:15 pm
B17 Upson Hall

Computer Science
Colloquium
Fall 2005


Serafim Batzoglou
Stanford University

Aligning Protein Sequences and Interaction Networks


Proteins have evolved for more than a billion years and in today's organisms we can clearly group them into families with common sequence, structure, or function. Arranging the letters of each string into columns of similarities, dissimilarities, and gaps is perhaps the oldest and most basic problem in bioinformatics. In the first part of the talk I will talk about a method we have developed for multiple sequence alignment, called ProbCons, which combines a hidden Markov model (HMM) and novel inference techniques. ProbCons outperforms existing methods in benchmark data sets of real proteins, even though it does not incorporate domain-specific knowledge. Then, I will describe a conditional random field (CRF) for sequence alignment, which can be used to derive superior alignment parameters from a rich set of biological features such as hydrophobicity, using training sets of very limited size.

Given multiple sources of functional genomics data, graphs of interactions between all proteins of an organism can be constructed. We are constructing such networks for the 230 sequenced bacteria, and we search for modules (subgraphs) that are statistically significantly conserved across several of the species. Past approaches to network alignment restrict the types of alignments that can be performed to simple pathways or cliques, and moreover cannot perform alignment of more than 3 networks. I will present a new formulation of alignment that is sufficiently general to remove arbitrary assumptions on the nature of alignment. Our algorithm, implemented in a tool called Nuke, scales efficiently to large databases and to multiple alignment of an arbitrary number of networks. Test results on bacterial networks show that our algorithm can uncover core conserved modules and hint to novel biological hypotheses.