Solutions to Review Questions for Midterm

CS 426 - Introduction to Computational Biology - Fall 2003

Please send email to chew@cs.cornell.edu if you find a mistake in the solutions.  Past experience indicates that I may have made some.

Some of the following questions are from http://www.roselab.jhu.edu/~przytyck/Exam2001.doc and from http://www.cs.princeton.edu/courses/archive/fall00/cs597c/ps1.htm.

  1. Fill in the cost-table below assuming gap penalty = -2, match = +1, mismatch = -1. Fill in the other table to show the path-costs determined during the dynamic programming.  Write down the optimal alignment and show the path in the matrix that corresponds to this alignment. If there is more than one alignment with the maximum score write all these alignments.

    cost A G C G
    A +1 -1 -1 -1
    A +1 -1 -1 -1
    A +1 -1 -1 -1
    C -1 -1 +1 -1
    G -1 +1 -1 +1

    DP   A G C G
      0 -2 -4 -6 -8
    A -2 1 -1 -3 -5
    A -4 -1 0 -2 -4
    A -6 -3 -2 -1 -3
    C -8 -5 -4 -1 -2
    G -10 -7 -4 -3 0

    There are three alignments that achieve the maximum score of 0.
    AG-CG     A-GCG     -AGCG
    AAACG     AAACG     AAACG

     
  2. What is an amino-acid substitution matrix? What is the difference between PAM1 and PAM250?
    Such a matrix indicates how likely it is that a particular amino acid is substituted for another.  PAM1 is suitable for comparing sequences that are on unit of evolution apart; PAM250 is suitable for comparing sequences that are 250 such units apart.  One unit corresponds to about 1 mutation per hundred amino acids.
     
  3. Given is a set of multiple aligned sequences. Compute the sequence profile for this set.
    ATAATAC
    ATAATAG
    ATAATTC
    ATATTAC
    ATAATAA
      1 2 3 4 5 6 7
    A 1.0 0.0 1.0 0.8 0.0 0.8 0.2
    C 0.0 0.0 0.0 0.0 0.0 0.0 0.6
    T 0.0 1.0 0.0 0.2 1.0 0.2 0.0
    G 0.0 0.0 0.0 0.0 0.0 0.0 0.2

     

  4. Assume scoring s(a,b) such that
    s(a,b) =        2 if a=b
    0 if a and b are different and are not gaps
    -1 if one of a or b is a gap
    0 if both a and b are gaps

    Compute the SP (Sum of Pairs) score for the following multiple alignment:
    AATT-A
    AA-TAA
    AC-TAA
    .
    Computing a value for each column, we get: 6+2+(-2)+6+0+6 = 18.
     

  5. Multiple Alignments:
    For this problem, use the sum-of-pairs metric and follow the "once a gap, always a gap" rule. Consider the following three sequences.

    (1) ACGTC
    (2) TCCT
    (3) ACGTCCT
     

    1. Compute all three optimal pairwise alignments assuming a cost of 2 for each deletion and 3 for each substitution. Give the cost of each alignment.
      Using Dynamic Programming, we get the following 3 alignments:
      ACGTC     ---TCCT     ACGTC--
      TCCT-     ACGTCCT     ACGTCCT
      Costs are 8, 6, and 4, respectively.  The last alignment is not unique (i.e., there is another alignment that also achieves a cost of 4).
    2. Compute a progressive multiple alignment starting with the pairwise alignment (1,3). Now use the pairwise alignment (2,3) to merge sequence 2 into the multiple alignment. Show the resulting alignment and give its cost.
      ACGTC--
      ---TCCT
      ACGTCCT
      Computing a cost for each column, we get: 4+4+4+0+0+4+4 = 20.
    3. Repeat problem 2, but this time use the pairwise alignment (1,2) to merge sequence 2 into the multiple alignment. Show the resulting alignment and give its cost. Are the two alignments the same? Which has a lower cost?
      ACGTC--
      TCCT---
      ACGTCCT
      Computing a cost for each column, we get: 6+0+6+0+4+4+4 = 24.  Clearly the alignments are not the same.  The first one had a lower cost.
    4. What is the optimal multiple alignment?
      The alignment in (b) is optimal.  (This problem is simple enough that it is possible to recognize that this is optimal.)
    5. Suppose you charge a cost of 1 for each deletion and 1 for each substitution. What is the optimal alignment? Is it unique?
      The alignment in (b) is still optimal.  It is not unique since, for instance, the last C in the top sequence can be moved to align with the final C in each of the other sequences without affecting the total cost.
       
  6. Consider threading algorithm as applied to RNA (rather than proteins that we discussed in class). It turns out that Z scores computed for RNA are usually much higher than Z scores computed for proteins. Can you explain?
    There are only four types of monomers for RNA (proteins have 20). Therefore the variance of the score is smaller and the Z score (deltaE/sigma) is higher.
     
  7. Consider a refinement to HP model: An H amino acid with a missing neighbor (it is on the surface) is penalized by +1 (H-H remains –1 , and H-P/P-P remain 0). For the following polymer below suggest an optimal HP sequence without a constraint on sequence composition.


    There are 6 positions without surface exposure.  These should all be H, creating four H-H contacts.  Any further H contacts require an H on the surface; there are some positions where this can be done without affecting the total since the cost of exposing an H is exactly equal to the benefit of establishing an additional H-H contact.  In any case, the solutions with H appearing on the exposed surface are no better than the ones without such exposure; thus, one optimal HP sequence (starting at the leftmost end) is PPPHHPPPPHHPPHHPPPPP.