Testing LOOPP

Kristin Snopkowski

With Research Professor: Ron Elber

Abstract:

I am currently doing research on the Learning, Observing and Outputting Protein Patterns (LOOPP) project. The background of the LOOPP project includes looking at different protein sequences and structures and trying to determine how well protein structures fit particular protein sequences through many different means of evaluating their compatibility. The tests I am working are trying to evaluate the reliability of the LOOPP project's results. I am creating scripts of determine which method of comparing structures and sequences are most reliable and accurate.

What is LOOPP?

A current problem in molecular biology is trying to determine the structure of a protein given a sequence and trying to determine the degree of similarity between two sequences or between two structures. LOOPP performs sequence to sequence comparisons, sequence to structure alignments and structure to structure comparisons of different protein combinations. LOOPP also designs protein-folding potentials using linear programming and statistical analysis of protein patterns and is capable of building non-redundant libraries of protein folds. The testing that I am involved with works directly with the comparisons of protein structures and sequences. For a more detailed discussion

What is involved in testing?

For testing, all the different ways of comparing sequences to structures, sequences to sequences and structures to structures must be computed. For comparing sequences to structures, also known as threading, we can compare by global alignment, where the whole sequence is aligned to the entire structure or local alignment, where the best fragments of the sequence and the structure are matched. There are many ways of calculating the energy within threading. Some of there include, off-lattice comparisons, contact pairwise potentials, continuous pairwise potentials, and profile onion models (THOM2). Looking at all of these different tests, one needs to determine which methods of calculating the Z-score and the energy are the best. At the present time, we believe that THOM2 is the best method we have, but we are trying to see if maybe a linear combination of the different methods may actually give us better results than just one specific method. We also need to compare results between sequence to structure alignment and sequence to sequence alignment. The sequence to structure alignment may give us the best energy for a particular sequence, but a similar sequence with a known structure may also be a good approximation of what the real structure should look like.

How to interpret test results?

Some Sample Results are:

This is the output that I got by executing the threading option with a list of sequences and structures comparing by the THOM2 (threading onion models) option. The first column represents the sequence, the second column represents the structure that the sequence is best aligned with (something that must be pre-defined if we are going to check to see if the LOOPP code is working correctly). The third column represents the ranking of the correct protein structure in the list of the protein structures returned. The fourth column represents the Z-score. The Z-score is a measure of compatibility. The higher the Z-score the better the fit is between a sequence and structure. The fifth column represents the energy of the sequence fit into the structure. In this case, a low energy represents a good match between a sequence and structure. If, for example, the Z-score is 4.3 that means that the energy is 4.3 standard deviations better than a randomly selected alignment.

What are some problems with testing?

One problem with trying to do tests and interpret their results is that there is so much information. There are so many different combinations of protein alignments with different measures of energy and Z-scores, that one could look at the results forever trying to determine which percentage of each result should be used to get the best overall results. Another problem is that there are so many proteins and possible protein structures. Lists that we are working on are upwards of 200 sequences and 400 structures, which is about 80,000 calculations, each one takes about 9 seconds, so total computation time is about 10 days. If we start trying all different types of threading options, we have even more computation time.

Competition?

The LOOPP team participates in the CASP (Critical Assessment of Techniques for Protein Structure Prediction) competition. In this competition, the LOOPP program takes a sequence and tries to predict the structure of that sequence. This competition occurs every 2 years and will start up again this summer. The team hopes to improve their performance from 2 years ago. Go to CASP web page