[ Mapping the Protein Universe ]

Mapping the Protein Universe

Introduction

Since the early days of genomic research, molecular biologists have tried to make sense out of the accumulated information on protein sequences and structures by organizing proteins into relational classes such as families, superfamilies, and fold families. Defining clusters of similar proteins in the protein space is one of the most important problems in Computational Molecular Biology. Such organization is useful for functional and structural categorization of proteins, and by pointing to uncharacterized families may direct experiments to further explore and study the functionality of proteins.

To classify proteins into relational classes there is a need for a well defined notion of similarity or distance between proteins. However, there is no obvious or unique way of measuring similarity between proteins, as proteins have many different facets and attributes. The most common way to compare proteins is by their sequence similarity. However, in many cases sequences have diverged to such an extent that sequence similarity is undetectable by current means of sequence comparison.

Structure is often conserved more than sequence, and detecting structural similarity can help infer function, allowing more accurate predictions of the functional roles of proteins. There are many examples of proteins whose structures and functions are similar, while no such similarity can be detected at the level of the sequence. Therefore, integrating structural information with sequence information in protein classification is extremely important. However, structural information is available for less than 5% of the known proteins. This project attempts to bridge this gap between the sequence space and the structure space.

Goals

The goal of this project is to develop novel tools for the mapping of the protein universe. Specifically we are working on two new tools that attempt to bridge the gap between the protein sequence space and the protein structure space.

Methods

In an attempt to bridge the large gap between the sequence space and the structure space we developed tools that use predicted information about the topology of proteins and their secondary structure content. This information is integrated with other sequence-based information to improve the signal-to-noise ratio.

The first method is an information-theory based approach for comparing protein families. Given the statistical models (such as profiles or HMMs) of two protein families our algorithm compares the two models (sources) using a dynamic programming algorithm, with an information-theory based scoring function. The function accounts for the statistical similarity of the two sources (their divergence) and the significance of the statistical similarity, estimated by the distance (divergence) of the most likely common source from the background distribution (the null hypothesis). The information from secondary structure prediction algorithms is integrated into the representation of the source, thus common sequence and structure patterns are better detected.

The second method is based on an alternative representation of proteins called the canonical representation. The representation draws on an existing measures of similarity (even if limited to a subset of proteins for which the measure can be applied) and induces a new measure of similarity for all protein pairs, by mapping each protein to a high dimensional proximity space. Specifically, we make an extensive use of threading algorithms, and map each protein sequence to a point in the structure space. The distance between any two proteins is then defined as their distance in the structure space.

The two new measures of similarity will be integrated with other similarity measures that are being used and developed by our group (based on sequence, structure, predicted function and protein-protein interactions), to create a new global map of the protein space.

Results

Preliminary studies of both representations proved that indeed these approaches can potentially enhance our ability to detect remote homologies between protein families.

We have primarily been focused on developing the first method of profile comparison. The addition of secondary structure information has indeed boosted the sensitivity of the comparison algorithm by approximately 50%.

We are continuing to further develop and test these models and apply them extensively to all known proteins, to create a new map of the protein space. By these means we wish to extend our ability to predict the biological function of proteins based on their location in this map and infer new functional links between protein families.

The application of these algorithms to the full data set of over 1,000,000 proteins is a challenging task. The results of this project will be made accessible to the whole scientific community through Biozon (a unified knowledge resource on genes, domains, protein families, protein-protein interactions and pathways).

References

Golan Yona and Michael Levitt. (2001). Within the twilight zone: A sensitive profile-profile comparison tool based on information theory. Journal of Molecular Biology 315 1257-1275. paper, abstract

Shlomo Dubnov, Ran El-Yaniv, Yoram Gdalyahu, Elad Schneidman, Naftali Tishby, Golan Yona. (2002). A new nonparametric pairwise clustering algorithm based on iterative estimation of distance profiles. Machine Learning 47 35-61.

Research Advisor: Golan Yona

This project was funded in part by the Learning Initiatives for Future Engineers (LIFE).

© 2004 Richard Chung. Contact: rc238 @ cs . cornell . edu