CS321 Project 1: Due in-class, October 25

Please read this assignment carefully. It consists of several parts. A detailed description, designed to be used as a guide to the various aspects of this assignment, can be found here (or here, for postscript format).

Corrections and Addenda

  1. Important clarification: The svd() function returns the square roots of the eigenvalues of RRT and RTR, rather than the eigenvalues themselves. See the "Comparing Protein Structures" section of the updated project description for more details.
  2. The pickCA() function will not retrieve the CA coordinates correctly for several assigned proteins. Instead, download pickAtom.m. It works just like pickCA(), but is more general in that it can retrieve the coordinates of any specified backbone atom. Type help pickAtom for more information.
  3. If you downloaded the file alignments.txt before Wednesday, October 10, please re-download it here. The new version fixes a problem with the sequence of the protein 1DO1.
  4. If you downloaded drawCluster.m before Friday, October 12, you can re-download it here. This is not necessary, though: all it will do is draw a nicer graphical representation of your clustering output than the previous version.

General Information

Part 1: Determining Distances Between Protein Structures

  1. Read the project description.
  2. Download the following sequences from the protein databank:
       1) 1MBC         5) 1BZ6         9) 1YOI
       2) 1DO1         6) 1MBO        10) 4MBN
       3) 1A6G         7) 1BZR        11) 1LH1
       4) 1VXH         8) 1YMB
    
    Throughout the assignment, these proteins will be numbered as above. When you refer to protein #1, for example, it will be clear you are referring to 1MBC. If you plan to change this convention, please make it clear in your code.

  3. Download the following three files into your project directory:
    alignments.txt
    File of all pairwise sequence alignments
    loadAlignments.m
    Matlab function to parse the alignments.txt file and extract and store all pairswise sequence alingments.
    getAlignment.m
    Matlab function to retrieve a specific pairwise alignment from the structure returned by loadAlignments().

    You are encouraged to use the provided Matlab functions, although you are free to write your own Matlab code or modify the provided code in order to retrieve the alignments. Once you've downloaded the functions, you can use the help command in order to find out how to use them.

  4. Use the pickAtom() function to extract the CA coordinates for each protein structure.

    The following three steps can be intergrated together, though it is better style to write separate functions for each:
  5. Write the code to extract the relevant CA coordinate vectors to be compared, based on the sequence alignment.
  6. Write the code to compute the distance between a pair coordinate matrices.
  7. Using the code from (5) and (6) above, generate an 11x11 matrix of distances between each possible pair of structures. Print out this matrix to complete this part of the assignment.

Part 2: Building Similarity Clusters

  1. Read the project description.
  2. Download the cluster library below. The cluster library has been provided in order to make it easy to manage clusters. It consists of the following files:
    makeCluster.m
    Create a leaf cluster containing a specified value. Can be used to associate a leaf cluster with a protein.
    combineClusters.m
    Create a compound cluster containing two specified clusters
    isLeafCluster.m
    Return true if a cluster is a leaf cluster, false if it is a compound.
    getClusterValue.m
    Return a value stored in a leaf cluster. Could be used to retrieve an associated protein from a leaf cluster.
    getSubClusters.m
    Get the sub-clusters of a compound cluster.
    drawCluster.m
    Draws a nice graphical representation of a cluster and all its nested subclusters. Useful to display the entire cluster structure once the clustering has been completed.

    You are encouraged to use the provided Matlab functions, although you are free to write your own Matlab code or modify the provided code in order to handle clusters. Once you've downloaded the functions, you can use the help command in order to find out how to use them.

  3. Implement the clustering algorithm.

  4. Using your clustering algorithm and the distance matrix generated in the last step of Part 1, perform clustering on the proteins. Output the resulting nested cluster structure. You can either use the drawCluster() function of the cluster library, write your own code for outputting the clusters, or draw the clusters by hand.

  5. What conclusions can you draw about the evolutionary relationship of the proteins based on your clustering output?

Bonus: Explaining the Clustering Output

Go to the PDB website and try to retrieve specific information about each protein. Use this information to provide an explanation for outcome of your clustering procedure. Write a few paragraphs explaining your results.

Deliverables

The project should be handed in electronically and on paper. The paper version should be handed in class on Thursday, October 25. The electronic version should be emailed to
leonidm@cs.cornell.edu by 5:00 p.m. on Thursday, October 25th. You will need to hand in the following materials: