Solutions to Review Questions for Midterm
CS 426 - Introduction to Computational Biology - Fall 2003
Please send email to chew@cs.cornell.edu
if you find a mistake in the solutions. Past experience indicates that I may have made
some.
- Some mistakes in the second table of problem 1 have been
corrected. These mistakes did not affect the final score (or the
corresponding best alignments).
- In problem 4, the fifth column has been corrected to have a score of 0
rather than -2.
Some of the following questions are from http://www.roselab.jhu.edu/~przytyck/Exam2001.doc
and from http://www.cs.princeton.edu/courses/archive/fall00/cs597c/ps1.htm.
- Fill in the cost-table below assuming gap penalty =
-2, match = +1, mismatch = -1. Fill in the other table to show the path-costs
determined during the dynamic programming. Write down the optimal alignment and show the
path in the matrix that corresponds to this alignment. If there is more than
one alignment with the maximum score write all these alignments.
| cost |
A |
G |
C |
G |
| A |
+1 |
-1 |
-1 |
-1 |
| A |
+1 |
-1 |
-1 |
-1 |
| A |
+1 |
-1 |
-1 |
-1 |
| C |
-1 |
-1 |
+1 |
-1 |
| G |
-1 |
+1 |
-1 |
+1 |
| DP |
|
A |
G |
C |
G |
| |
0 |
-2 |
-4 |
-6 |
-8 |
| A |
-2 |
1 |
-1 |
-3 |
-5 |
| A |
-4 |
-1 |
0 |
-2 |
-4 |
| A |
-6 |
-3 |
-2 |
-1 |
-3 |
| C |
-8 |
-5 |
-4 |
-1 |
-2 |
| G |
-10 |
-7 |
-4 |
-3 |
0 |
There are three alignments that achieve the maximum score of 0.
AG-CG A-GCG -AGCG
AAACG AAACG AAACG
- What is an amino-acid substitution matrix? What is the difference between
PAM1 and PAM250?
Such a matrix indicates how likely it is that a particular amino acid is
substituted for another. PAM1 is suitable for comparing sequences that
are on unit of evolution apart; PAM250 is suitable for comparing
sequences that are 250 such units apart. One unit corresponds
to about 1 mutation per hundred amino acids.
- Given is a set of multiple aligned sequences. Compute the sequence profile for
this set.
ATAATAC
ATAATAG
ATAATTC
ATATTAC
ATAATAA
| |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
| A |
1.0 |
0.0 |
1.0 |
0.8 |
0.0 |
0.8 |
0.2 |
| C |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.6 |
| T |
0.0 |
1.0 |
0.0 |
0.2 |
1.0 |
0.2 |
0.0 |
| G |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.2 |
- Assume scoring s(a,b) such that
| s(a,b)
= |
|
2 if a=b |
| 0 if a and b are different and are not gaps |
| -1 if one of a or b is a gap |
| 0 if both a and b are gaps |
Compute the SP (Sum of Pairs) score for the following multiple alignment:
AATT-A
AA-TAA
AC-TAA.
Computing a value for each column, we get: 6+2+(-2)+6+0+6 = 18.
- Multiple Alignments:
For this problem, use the sum-of-pairs metric and follow the "once a gap, always a gap" rule. Consider the following three sequences.
(1) ACGTC
(2) TCCT
(3) ACGTCCT
- Compute all three optimal pairwise alignments assuming a cost of 2 for
each deletion and 3 for each substitution. Give the cost of each alignment.
Using Dynamic Programming, we get the following 3 alignments:
ACGTC ---TCCT ACGTC--
TCCT- ACGTCCT ACGTCCT
Costs are 8, 6, and 4, respectively. The last alignment is not
unique (i.e., there is another alignment that also achieves a cost of 4).
- Compute a progressive multiple alignment starting with the pairwise
alignment (1,3). Now use the pairwise alignment (2,3) to merge sequence 2
into the multiple alignment. Show the resulting alignment and give its cost.
ACGTC--
---TCCT
ACGTCCT
Computing a cost for each column, we get: 4+4+4+0+0+4+4 = 20.
- Repeat problem 2, but this time use the pairwise alignment (1,2) to merge
sequence 2 into the multiple alignment. Show the resulting alignment and
give its cost. Are the two alignments the same? Which has a lower cost?
ACGTC--
TCCT---
ACGTCCT
Computing a cost for each column, we get: 6+0+6+0+4+4+4 = 24.
Clearly the alignments are not the same. The first one had a lower
cost.
- What is the optimal multiple alignment?
The alignment in (b) is optimal. (This problem is simple enough
that it is possible to recognize that this is optimal.)
- Suppose you charge a cost of 1 for each deletion and 1 for each
substitution. What is the optimal alignment? Is it unique?
The alignment in (b) is still optimal. It is not unique since, for
instance, the last C in the top sequence can be moved to align with the
final C in each of the other sequences without affecting the total cost.
- Consider threading algorithm as applied to RNA (rather than proteins that
we discussed in class). It turns out that Z scores computed for RNA are
usually much higher than Z scores computed for proteins. Can you explain?
There are only four types of monomers for RNA (proteins have 20). Therefore the variance of the score is smaller and the Z score
(deltaE/sigma) is higher.
- Consider a refinement to HP model: An H amino
acid with a missing neighbor (it is on the surface) is penalized by +1 (H-H
remains –1 , and H-P/P-P remain 0). For the following polymer below
suggest an optimal HP sequence without a constraint on sequence composition.

There are 6 positions without surface exposure. These should all be
H, creating four H-H contacts. Any further H contacts require an H on
the surface; there are some positions where this can be done without
affecting the total since the cost of exposing an H is exactly equal to the
benefit of establishing an additional H-H contact. In any case, the
solutions with H appearing on the exposed surface are no better than the
ones without such exposure; thus, one optimal HP sequence (starting at the
leftmost end) is PPPHHPPPPHHPPHHPPPPP.