Chapter 1: Sequence alignment

CS 321

Exact sequence comparison.

Sequence alignment is a tool to test for similarities between a sequence of an unknown target protein, and a single (or a family of) well-characterized protein(s). It is assumed that sequence similarity implies similarity in function.

Figure 1: A cartoon plot illustrating the use of a database of annotated proteins to identify a novel sequence

We denote the sequence of the unknown protein by . It is ~~a short cut for~~ an abbreviation of the explicit sequence , where is an amino acid placed at the position along the sequence. The sequences of the known proteins, that make the comparison database, are called where , and is the total number of proteins in the database. For sequences only, the database may include hundreds of thousands of entries [http://www.expasy.ch/srs5/]. Databases that include both sequences and structures are limited to about ten thousand entries [http://www.rcsb.org/pdb/]. For brevity we omit the index from and use to represent a member of the comparison set.

Table 1: A sample of protein sequences. Typical lengths of protein sequences vary from a few tens to a few hundreds. The single letter notation (one character for an amino acid) is used.

Serine Protease inhibitor

EDICSLPPEV GPCRAGFLKF AYYSELNKCK LFTYGGCQGN ENNFETLQAC XQA

Amicyanin alpha

MRALAFAAALAAFSATAALAAGALEAVQEAPAGSTEVKIAKMKFQTPEVR IKAGSAVTWTNTEALPHNVHFKSGPGVEKDVEGPMLRSNQTYSVKFNAPG TYDYICTPHPFMKGKVVVE

Major Histocompatability Complex (class I)

MAVMAPRTLV LLLSGALALT QTWAGSHSMR YFSTSVSRPG RGEPRFIAVG YVDDTQFVRF

DSDAASQRME PRAPWIEQEG PEYWDRNTRN VKAHSQTDRV DLGTLRGYYN QSEDGSHTIQ

RMYGCDVGSD GRFLRGYQQD AYDGKDYIAL NEDLRSWTAA DMAAEITKRK WEAAHFAEQL

RAYLEGTCVE WLRRHLENGK ETLQRTDAPK THMTHHAVSD HEAILRCWAL SFYPAEITLT

WQRDGEDQTQ DTELVETRPA GDGTFQKWAA VVVPSGQEQR YTCHVQHEGL PEPLTLRWEP

SSQPTIPIVG IIAGLVLFGA VIAGAVVAAV RWRRKSSDRK GGSYSQAASS DSAQGSDVSL

TACKV

b.2 An introduction ~~definition of~~to the alignment score

~~We need a more precise definition of similar~~A more precise definition of similar is needed. The simplest definition (one) is the fraction of amino acids in that are identical to . In proteins, high sequence identity (roughly above 35 percent) immediately implies close relationship. A more subtle definition of similarity (two) uses amino acids with similar physical or biochemical properties. For example, replacing a hydrophobic amino acid like leucine (L) by valine (V) is (usually) an acceptable mutation. The effect on the hydrophobic core of the protein is expected to be small. However, changing (L) to a charged residue like arginine (R) is likely to be damaging and significantly affect protein stability.

Hence, there is a gray area between an exact matching (two identical amino acids) and a complete mismatch. This gray area is better described by a scoring function with a wide spectrum. The scoring function gives the highest score to an identical pair and the lowest possible score ~~for~~ (to) substitutions that do not make chemical, physical or biological sense.

To make sequence comparison compact and easily extendable to a wide range of sequences, we simplify the process. We assume that the total score of comparing sequences and will be given by a sum of terms, each term comparing two elements, an element from and an element from . An element is either an amino acid or a “space” (gap). For example, consider the case in which is significantly longer than . In ~~that~~ (this) case it is obvious that some amino acids of will be aligned against “spaces” (gaps). The choice of the pairs of elements for the similarity test is called an alignment and is the focus of the present section.

Table 2. An example of an alignment of the two sequences and the selection of pairs of elements for comparison. The sequences of myoglobin and hemoglobin are placed on top of each other and pairs are compared horizontally.

~~PLACE TABLE 2 HERE~~

There are two different problems that we need to address: The selection of the alignment and the determination of the scoring function. Let us start with the score. Since we simplified the calculation of the total score, making it a sum of element comparisons, we need to consider only comparison of individual amino acids and comparisons of an amino acid against a gap. However, we do not need to consider separately scores of pairs of amino acids since .

The scores of pairs of amino acids, which measure the similarity of the -s and the -s, form the so-called substitution matrix . The matrix is symmetric, and of size (the number of amino acid types). Note that the term “substitution” may be misleading since it suggests a direction (substituting by implies that was there first). A symmetric matrix does not have a sense of time, and the order of substitutions does not change the score, or the degree of similarity.

Nevertheless, a non-symmetric variation of the sequence as a function of time is of considerable interest. A direction in sequence-similarity-scores is potentially informative, providing molecular fingerprints for evolutionary time scales. However, it is hard to estimate it without a specific model in mind. At present, for the purpose of finding protein relatives, we shall use the simplest way out and ignore the time arrow of evolution. A widely used (symmetric) substitution matrix is given below

The BLOSUM 50 matrix [x]


	A	R	N	D	C	Q	E	G	H	I	L	K	M	F	P	S	T	W	Y	V
:A	5	-2	-1	-2	-1	-1	-1	0	-2	-1	-2	-1	-1	-3	-1	1	0	-3	-2	0	:A
:R	-2	7	-1	-2	-4	1	0	-3	0	-4	-3	3	-2	-3	-3	-1	-1	-3	-1	-3	:R
:N	-1	-1	7	2	-2	0	0	0	1	-3	-4	0	-2	-4	-2	1	0	-4	-2	-3	:N
:D	-2	-2	2	8	-4	0	2	-1	-1	-4	-4	-1	-4	-5	-1	0	-1	-5	-3	-4	:D
:C	-1	-4	-2	-4	13	-3	-3	-3	-3	-2	-2	-3	-2	-2	-4	-1	-1	-5	-3	-1	:C
:Q	-1	1	0	0	-3	7	2	-2	1	-3	-2	2	0	-4	-1	0	-1	-1	-1	-3	:Q
:E	-1	0	0	2	-3	2	6	-3	0	-4	-3	1	-2	-3	-1	-1	-1	-3	-2	-3	:E
:G	0	-3	0	-1	-3	-2	-3	8	-2	-4	-4	-2	-3	-4	-2	0	-2	-3	-3	-4	:G
:H	-2	0	1	-1	-3	1	0	-2	10	-4	-3	0	-1	-1	-2	-1	-2	-3	2	-4	:H
:I	-1	-4	-3	-4	-2	-3	-4	-4	-4	5	2	-3	2	0	-3	-3	-1	-3	-1	4	:I
:L	-2	-3	-4	-4	-2	-2	-3	-4	-3	2	5	-3	3	1	-4	-3	-1	-2	-1	1	:L
:K	-1	3	0	-1	-3	2	1	-2	0	-3	-3	6	-2	-4	-1	0	-1	-3	-2	-3	:K
:M	-1	-2	-2	-4	-2	0	-2	-3	-1	2	3	-2	7	0	-3	-2	-1	-1	0	1	:M
:F	-3	-3	-4	-5	-2	-4	-3	-4	-1	0	1	-4	0	8	-4	-3	-2	1	4	-1	:F
:P	-1	-3	-2	-1	-4	-1	-1	-2	-2	-3	-4	-1	-3	-4	10	-1	-1	-4	-3	-3	:P
:S	1	-1	1	0	-1	0	-1	0	-1	-3	-3	0	-2	-3	-1	5	2	-4	-2	-2	:S
:T	0	-1	0	-1	-1	-1	-1	-2	-2	-1	-1	-1	-1	-2	-1	2	5	-3	-2	0	:T
:W	-3	-3	-4	-5	-5	-1	-3	-3	-3	-3	-2	-3	-1	1	-4	-4	-3	15	2	-3	:W
:Y	-2	-1	-2	-3	-3	-1	-2	-3	2	-1	-1	-2	0	4	-3	-2	-2	2	8	-1	:Y
:V	0	-3	-3	-4	-1	-3	-3	-4	-4	4	1	-3	1	-1	-3	-2	0	-3	-1	5	:V
*	* A	R	N	D	C	Q	E	G	H	I	L	K	M	F	P	S	T	W	Y	V	*

Note the high scores for a substitution of an amino acid to self (the diagonal elements of the matrix). Note also that certain amino acids have higher tendency for self-preservation (e.g. Cysteine) while a small and polar amino acid like threonine can be substituted to other amino acids more easily.

sequence alignment: definitions

~~Of course, the~~The choice of pairs of ~~amino acids~~elements (amino acids or gaps) from and is not completely free. The order of the amino acids in the sequence (the primary “structure” of the protein) counts, and our comparison should maintain that order. This is a significant restriction on the comparisons that we may make and limits our choices considerably. We are seeking the alignment of the two sequences that will give us the optimal (highest) score. Adding further to the complexity ~~of the problem~~ is the observation that not all proteins are of the same length; we need to ~~account for~~pay attention to the variation in the lengths of the proteins and for the fact that some amino acids will not have corresponding amino acids to match. Two closely related proteins may be different at a specific site in which an amino acid was added to one protein but not to the second. To describe such cases we introduce a “gap” residue. A gap residue is denoted by “-“ and is used to indicate an empty space along the sequence. For example, the comparison of two proteins below (PDB codes 1rsy and 1a25_A, the “_A” denotes the first chain in the data entry) yields the following optimal arrangement of the two sequences with gaps:

TABLE: Optimal alignment of two sequences (top is 1rsy, Synaptotagmin I (First C2 Domain) , lower 1a25, Protein Kinase C (); Chain: A). The percentage of sequence identity is 33 percent.

EKLGKLQYSLDYDFQNNQLLVGIIQ-AAEL-PALDMGGTSDPYVKVFLLPD-K-KKKFE

ERRGRIYIQAHID-R--EVLIVVVRDAKNLVP-MDPNGLSDPYVKLKLIPDPKSESKQK

TKVHRKTLNPVFNEQFTFKVPYSELGGKTLVMAVYDFDRFSKHDIIGEFKVPMNTVD-F

TKTIKCSLNPEWNETFRFQLKESDKD-RRLSVEIWDWDLTSRNDFMGSLSFGISELQKA

GHVTEEWRDLQS

G-V-DGWFKLLS

A gap is placed in one sequence (say ) to match an amino acid in the second sequence (say ). It indicates a missing amino acid in the sequence when comparing it to the sequence. Note that it is not possible for us to decide if an amino acid was deleted from or added to , without a detailed evolutionary mechanism linking the two proteins to a common ancestor. At the level of alignment of a pair of sequences the two mechanisms are equivalent. It is therefore common to use the term indel to describe a gap (indel = an INsertion or a DELetion), a name that remains undecided about the mechanism of gap formation.

Our goal is to find the best match between two sequences including the possibilities of gaps. A score of a match is given (as argued above) by a sum of scores of matching elements. ~~In order to score a given alignment and decide if it is good or bad we need a score for an indel.~~

What is the score of the gap, i.e. the alignment of an indel against another amino acid? It is philosophically useful to think ofn a gap as an ordinary amino acid and ask what is the score for substituting an amino acid type “” with an amino acid type “”. Such an approach is good in theory and there are some studies following this line of investigation. Unfortunately so far, determining gap scores (also called gap penalties) is still an open question. Formulation of statistical theories of gaps is difficult since the number of gaps in aligned sequences is not known to begin with. This difficulty in determining scores for gaps is in contrast ~~with~~ to the substitution matrices of amino acids that are pretty well established.

A common practice in alignment programs is to leave the gap energy as a parameter to be determined by the user, setting a default value of (for example) zero. This is not to say that the score of the gap is insignificant. ~~The score of a gap~~It clearly influences the alignment. If it is set ~~too low~~highly negative, then it is difficult to match related proteins of considerable variation in lengths (remote homologous proteins). If it is set too high, two vastly different proteins may get fragmented to many small overlapping pieces and a large number of gaps. The fragmented alignment of two unrelated proteins (with high score for gaps ~~scoring~~) may still maintain a good overall score. This is not what we expect, since ~~We expect that~~ proteins are related only if a substantial continuous fraction of the two sequences is ~~similar~~found to be similar.

High-scoring short-sequence segments can be analyzed statistically to determine their significance. In fact such an analysis is behind the popular BLAST algorithm [x]. ~~To the first order~~ BLAST does not consider gaps at all. A statistical test determines if short compatible segments of aligned sequences are significant. In some sense, BLAST solves the gap problem by avoiding it all together. The BLAST algorithm will be discussed in the section: Approximate alignments.

A typical practical ~~solution~~ estimate ~~for~~ of the indel/gap penalty is to give it a single negative value, say , regardless of the pairing amino. Hence, tThe gap penalty is set to be independent of the type of the amino acid it is aligned to (e.g. the pair score the same as the pair ). This choice is non intuitive since (tryptophan) is much larger than (glycine). The space left for the “-“ residue by the removal of or will be therefore different and the cost is likely to be dissimilar. The less detailed treatment of gaps, (compared to usual amino acids), ~~can~~ may lead to poor alignments. Therefore a few suggestions of extending the simple model of gap penalty were proposed.

One popular model is to differentiate between opening and extension of gaps. It is based on the argument that gaps should aggregate together. Claims were made that insertions and deletions are likely to appear at certain structural domains of proteins (mostly loops). The concentration of gaps at a few structural sites makes the indels appear together, and aggregate. To maximize the size of groups of gap (and minimize the number of gap clusters), two gap penalties are assigned:. One penalty is for initiating a gap group. A second (lower penalty) is for growing an existing ~~gap~~ cluster of gaps. For example, (we construct our alignment from left to right), extending an alignment from to is less favorable than the extension of the pair to . This is regardless of the fact that in both cases the same pair was added to an existing alignment.

This model improves the overall appearance of the alignments. However, the present author is not enthusiastic about it. It This model is highly asymmetric with respect to other “amino acids” (remember, we would like to consider a gap to be an amino acid), and it is not obvious that the asymmetry is indeed required. It is also making the identification of optimal alignment messier. Alternative modeling of gaps (so far less popular) will be discussed later and include structural dependent gap. For the moment we shall concentrate on the simplest model of gaps (one value does it all), and after solving that problem, the effect of the extensions will be examined.

We wish to place the gaps in such a way that the alignment or the comparison of the two sequences will be “optimal”, that is that the two sequences will be as similar to each other as possible. Early studies of sequence comparisons found optimal alignments manually. Gaps were inserted “by hand” into positions that made biochemical sense and increased the number of good-looking pairs [x]. For an implementation on the computer we consider an optimal alignment to be an arrangement of the two sequences with respect to each other in such a way that the total score of the alignment is as high as possible.

Let us examine in more details the problem of optimal alignments and possible arrangements of sequences. There are many ways of aligning a sequence against a sequence once gaps are introduced (even if the original order of the amino acids is maintained). Gaps can enter anywhere and in wide range of numbers in the alignment, creating many alternative arrangements. We denote the extended sequences of and (with gaps) as and . For example, might be . The introduction of the extended sequences raises another intriguing question, what is the maximum length of the sequences (or )?

Of course, the lengths may vary, but they still have upper bounds. Consider two sequences and of the same length -- . ~~The maximum length of~~ ~~and~~ ~~should be~~ . In the alignment of the extended-sequences that are also maximally long, every amino acid of or is aligned against a gap -- and . An increase of the length beyond ~~the~~ ~~limit~~ will necessarily include an alignment of an indel with respect to another indel, i.e. . Such an alignment does not make sense from a scientific point of view. We have no way to determine the number of “double gaps” or their locations. Moreover, also from a technical view point there is a problem if the pair generates a favorable plausible score. Consider the alignment

Let the total score be . Extending the above alignment by one more pair of indels, we have

The new score is where is the element of the substitution matrix replacing an indel by an indel. If the new score is better than , it is trivial to construct even a better alignment (and score) by adding yet another pair of indels with yet a better score of . The favorable extension with indels can proceed to infinite and is unbound. We therefore eliminate in our sequence-to-sequence alignments the possibility of . Note however, that if we wish to compare more than two sequences simultaneously (multiple sequence alignment), then the possibility of matching a gap against a gap exists. For example, two gaps may appear in the element comparison when matching three sequences.

Ok. It is settled then, the maximum length of is . How many possible alignments (with gaps) do we have? In a naïve approach, this number may be related to the number of scores we need to compute before deciding on the optimal alignment. ~~Or how many we need to examine before deciding on an optimal alignment?~~

1.2 Counting alignments

To count the number of possible alignments, and as a starting point of the discussion on optimal alignments, we consider the dynamic matrix. The dynamic matrix is a table used for the alignment of two sequences. Below we provide one example for two sequences with the same length (). The rows are associated with the sequence and the columns with the . The numbers at different matrix entries will be explained below.

The hairy picture with the numerous arrows is actually telling. Paths in this table, which start at the upper left corner and end at the right lower corner, present all the possible alignments of the whole two sequences. From each entry in the dynamic matrix, there are three alternative moves: ~~following~~ Going down along the diagonal, going straight down, or moving to the right. A step in the matrix, which is a part of a legitimate alignment, never proceeds ~~from~~ to the left ~~to right~~ or up. For example, the thick line in the above matrix corresponds to the alignment:

A move along a diagonal aligns an amino acid against another amino acid. A vertical step in the matrix aligns a “” amino acid against an indel and a horizontal step puts a gap against an “” amino acid. Note that we use “” for the gap “residue” (or an indel).

The numbers at the different entries of the table denote the number of paths (alignments) that can reach this point. For example, there are 5 possible alignments of against (check the table element at the cross between and ). They are:

The last three alignments ~~are essentially the same~~include the same elements for comparison and ~~are~~ their scores are therefore degenerate. It is therefore not possible to decide on a “best” alignment from the group of three. There is more than one optimal alignment. At present, we consider all paths, including the degenerate ~~paths~~ones. In the Appendix a clever counting protocol (by Dr. Jaroslaw Meller) is outlined that estimates the number of non-degenerate paths. Interestingly, the exact number of non-degenerate paths is . It has the same asymptotic behavior as the approximate lower-bound expression we sketched below for the number of all paths.

Another interesting property of this table is a summation rule and the possibility of constructing a recursion formula for the number of alignments. There are three ways (“sources”) to extend a shorter alignment to obtain one of the five longer alignments listed above. (a) Extend an earlier alignment by the pair , a diagonal move in the above table. Alternatively, (b) the pairs , or (c) (horizontal or perpendicular moves in the above table) can be used to extend a shorter alignment and to obtain a member of the above group.

Each of the three “sources” has ~~its own~~a corresponding position in the matrix. The number at the entry to the matrix is the ~~with a corresponding~~ number of ~~alignments~~paths aligning a segment of A against a segment of B. For example, before adding the pair we were at a table entry aligning against . We find that there is only one-way of aligning against a gap and “1” is indeed the corresponding entry.

Another example of extending the alignment is to add a gap against , which means that our earlier position in the table was the alignment of and . The last alignment can be done in three different ways and therefore the table entry is “3” ().

If we add the number of paths starting at the previous three “sources”, ~~and leading to~~ the number of alignments of with respect to ~~we obtain~~summed up to , exactly the number of alignments of our target.

To summarize the above empirical observations more precisely:

The number of possible alignments of -s against -s is defined as , this number can be determined using the recursive formula , and the initial conditions . Note that the definition of was done for computational convenience and it does not imply that “nothing” against “nothing” can be aligned in exactly one way.

While this formula can be used directly, it is useful to have a quick,n order-of-magnitude estimate of the number of alignments. This estimate is especially useful if we are planning a computation that will enumerate all of the alignments. If a calculation is not feasible with existing computer resources it is better knowing that it is not feasible in advance, and not after a few weeks of futile attempts to execute the desired computation.

A simple lower bound for can be obtained quickly. is a positive number, so we can write (we “forgot” for convenience the term in the original equality). So, the alternative recursion formula (with the same initial conditions as for ) always yields lower numbers than . For we have a close expression: . This formula is easy to verify by direct substitution of the closed expression to the recursion. Note that this ~~I sthe~~is the same as the (exact) results derived by Meller for non-degenerate paths (Appendix).

We now make use of the Stirling formula: [x], valid for large -s.

The logarithm of the number of alignments is estimated as

And the lower bound for the number of alignments is . For a short protein (; ) the number is substantial. ~~and is beyond what we can do today in a systematic search.~~ Even if the computation of a single alignment requires a nanosecond ( second), which is unrealistically fast, it still necessary to use to examine all possible alignments.

If this is not impressive enough, remember that this is a lower bound and the precise counting of all paths will yield a number significantly larger than this one. For example, for we have already a number significantly lower than the exact number in the table (1683). Hence, this estimate underlines the ~~statement~~ claim that it is impossible to examine all alignments one by one in order to find the alignment with the highest score.

Readers with a background in structural biology may recall the Levinthal paradox in protein folding. The paradox puts in contrasts the huge number of plausible protein conformations and the efficiency in which proteins fold in nature. We cannot be sure (until the next section) that there is a solution for the sequence alignment problem. However, nature solves the protein folding problem. So at least the existence of the solution for the protein folding problem is confirmed. This is (again) in contrast to sequence alignment for which an efficient solution exists.~~However,~~ as As we see below a large space to search does not necessarily mean that optimization in that space is difficult. It is not obvious that the optimization must be performed at a cost proportional to the volume of that space. In fact it can be profoundly cheaper.

1.3 Dynamic programming and optimal alignments

After this long detour, it is about the time to return to the basic question: What is the optimal alignment of and ? The score matrix and the gap penalty are provided and we need is to fish out the alignment(s) with the highest overall score(s). This is a point in which algorithms developed by computer scientists can be extremely useful. The fact that the number of possible alignments is exponentially large in the sequence length does not mean that the search for the optimal alignment needs to be done in exponential time. Sequence alignments can be done with dynamic programming, an algorithm that requires only order of operations to find the alignment with the best score, a remarkable saving compared to ~~operations~~- the number of all possible alignments.

The efficient search for the optimal alignments consists of two steps. In the first step a dynamic matrix is constructed (with different entries than what we have seen before) ~~is constructed~~ and in the second step an optimal path is found in the newly constructed table ~~just constructed~~. The first step is similar to the counting of possible alignments and the recursive expression we ~~wrote~~ use to compute the total number of alignments. We denote the optimal score of the alignment of a sequence length against a sequence with length by , and consider the following question:

Assuming that a very kind fellow gave us the optimal scores for the following alignment: , , , can we construct the score ?

The answer is yes~~(!)~~. Since the total score is given by a sum of the scores of the individual aligned pairs, we construct the score from the three alignments leading to it. We consider three possibilities to obtain an alignment of against amino acids.

Option a: Align against amino acids (this alignment has the known optimal score of ). Extend it by the alignment of with a score of which is the substitution score according to the types of the amino acids at and (e.g., the BLOSUM matrix). Hence the first suggestion for an optimal score is .

Option b: Align amino acids against amino acids. This alignment also has a known score, which is . Extend this alignment by with a corresponding score for a gap. The second possibility for an optimal alignment is therefore .

Option c: Align amino acids against amino acids with the known (optimal) score of . Extend the alignment, by adding with a corresponding score of . The third suggestion is therefore .

Note that we use the simple model of gaps in which only a single score is associated with an indel, regardless of the amino acid it is aligned against.

Our final task is to select the highest score from one of the three alternatives:

, , , which is the optimal score of .

More compactly, we write:

The above recursion can be used to fill with optimal scores the complete dynamic matrix. We start with the condition and use the initial values to grow the matrix. For example,

Where denotes alignment of nothing that (not surprisingly) scores zero. Another simple example is of . Hence, there is only one alignment against “all gaps” arrangement, providing us with immediate optimal scores for the first row and column of the dynamic matrix. It is also clear that similarly to the direct counting of the number of paths we can also construct all the optimal scores.

A simple pseudo code to create the dynamic matrix is given below

/* fill the first (zero) column and the first (zero) row */

T(0,0) = 0

Do I=1:n

T(I,0) = I*g

End do

Do I=1:m

T(0,I) = I*g

End do

/* Now fill the rest of the matrix picking the maximum value

from the three possibilities */

Do I = 1:n

Do J = 1:m

T(I,J) = max[ T(I-1,J)+g,

T(I,J-1)+g,

T(I-1,J-1)+S(a(i),b(j))]

End do

Actually, if we are primarily interested in the score of the alignment of with respect to only a single element of the dynamic matrix, , is of interest. Of course, in order to get at it we need to compute first the whole matrix. However this score, which is recorded at the lower right side of the dynamic matrix, is only part of the story.

Besides the score, in many cases, it is important to know the alignment itself. The path in the dynamic matrix that corresponds to the optimal alignment can be found by a trace-back procedure. Starting from we ask “which of the three possible steps could have generated the final optimal score? That is, we examine the three possibilities

For at least one of the tests above the equality will hold. It is possible that more than one test is correct and in that case the optimal alignment is degenerate. Hence, there is more than one alignment that provides an optimal score. Usually we consider only one path. For example, if the first test is correct the alignment ends with the pair: and we repeat the three tests to find the next path segment, this time starting from . The process is repeated until it reaches the upper left corner and provides the desired optimal alignment. The maximum number of times that the process is repeated is (all the steps are horizontal or vertical) and the smallest number of steps is either or , the larger number of the two.

We are also ready for a rough estimate of the computational effort. As we discussed earlier a lower bound on the number of possible alignments is . Examining all possible alignments will require computational effort that is growing exponentially with . The computations that we just described, using dynamics programming, are much cheaper. Let us consider the two steps separately. In the first step we create the dynamic matrix . To generate a single element we need (about) four operations: Evaluating three expressions, and a decision which of the three is the largest. The calculation of a single element is therefore independent of or , and the cost associated with the computation of the matrix is proportional to (the number of matrix elements). To trace the path of alignment (once the matrix is known) takes a maximum of operations.


	A	R	N	D	C	Q	E	G	H	I	L	K	M	F	P	S	T	W	Y	V
:A	5	-2	-1	-2	-1	-1	-1	0	-2	-1	-2	-1	-1	-3	-1	1	0	-3	-2	0	:A
:R	-2	7	-1	-2	-4	1	0	-3	0	-4	-3	3	-2	-3	-3	-1	-1	-3	-1	-3	:R
:N	-1	-1	7	2	-2	0	0	0	1	-3	-4	0	-2	-4	-2	1	0	-4	-2	-3	:N
:D	-2	-2	2	8	-4	0	2	-1	-1	-4	-4	-1	-4	-5	-1	0	-1	-5	-3	-4	:D
:C	-1	-4	-2	-4	13	-3	-3	-3	-3	-2	-2	-3	-2	-2	-4	-1	-1	-5	-3	-1	:C
:Q	-1	1	0	0	-3	7	2	-2	1	-3	-2	2	0	-4	-1	0	-1	-1	-1	-3	:Q
:E	-1	0	0	2	-3	2	6	-3	0	-4	-3	1	-2	-3	-1	-1	-1	-3	-2	-3	:E
:G	0	-3	0	-1	-3	-2	-3	8	-2	-4	-4	-2	-3	-4	-2	0	-2	-3	-3	-4	:G
:H	-2	0	1	-1	-3	1	0	-2	10	-4	-3	0	-1	-1	-2	-1	-2	-3	2	-4	:H
:I	-1	-4	-3	-4	-2	-3	-4	-4	-4	5	2	-3	2	0	-3	-3	-1	-3	-1	4	:I
:L	-2	-3	-4	-4	-2	-2	-3	-4	-3	2	5	-3	3	1	-4	-3	-1	-2	-1	1	:L
:K	-1	3	0	-1	-3	2	1	-2	0	-3	-3	6	-2	-4	-1	0	-1	-3	-2	-3	:K
:M	-1	-2	-2	-4	-2	0	-2	-3	-1	2	3	-2	7	0	-3	-2	-1	-1	0	1	:M
:F	-3	-3	-4	-5	-2	-4	-3	-4	-1	0	1	-4	0	8	-4	-3	-2	1	4	-1	:F
:P	-1	-3	-2	-1	-4	-1	-1	-2	-2	-3	-4	-1	-3	-4	10	-1	-1	-4	-3	-3	:P
:S	1	-1	1	0	-1	0	-1	0	-1	-3	-3	0	-2	-3	-1	5	2	-4	-2	-2	:S
:T	0	-1	0	-1	-1	-1	-1	-2	-2	-1	-1	-1	-1	-2	-1	2	5	-3	-2	0	:T
:W	-3	-3	-4	-5	-5	-1	-3	-3	-3	-3	-2	-3	-1	1	-4	-4	-3	15	2	-3	:W
:Y	-2	-1	-2	-3	-3	-1	-2	-3	2	-1	-1	-2	0	4	-3	-2	-2	2	8	-1	:Y
:V	0	-3	-3	-4	-1	-3	-3	-4	-4	4	1	-3	1	-1	-3	-2	0	-3	-1	5	:V
*	* A	R	N	D	C	Q	E	G	H	I	L	K	M	F	P	S	T	W	Y	V	*


	A	R	N	D	C	Q	E	G	H	I	L	K	M	F	P	S	T	W	Y	V
:A	5	-2	-1	-2	-1	-1	-1	0	-2	-1	-2	-1	-1	-3	-1	1	0	-3	-2	0	:A
:R	-2	7	-1	-2	-4	1	0	-3	0	-4	-3	3	-2	-3	-3	-1	-1	-3	-1	-3	:R
:N	-1	-1	7	2	-2	0	0	0	1	-3	-4	0	-2	-4	-2	1	0	-4	-2	-3	:N
:D	-2	-2	2	8	-4	0	2	-1	-1	-4	-4	-1	-4	-5	-1	0	-1	-5	-3	-4	:D
:C	-1	-4	-2	-4	13	-3	-3	-3	-3	-2	-2	-3	-2	-2	-4	-1	-1	-5	-3	-1	:C
:Q	-1	1	0	0	-3	7	2	-2	1	-3	-2	2	0	-4	-1	0	-1	-1	-1	-3	:Q
:E	-1	0	0	2	-3	2	6	-3	0	-4	-3	1	-2	-3	-1	-1	-1	-3	-2	-3	:E
:G	0	-3	0	-1	-3	-2	-3	8	-2	-4	-4	-2	-3	-4	-2	0	-2	-3	-3	-4	:G
:H	-2	0	1	-1	-3	1	0	-2	10	-4	-3	0	-1	-1	-2	-1	-2	-3	2	-4	:H
:I	-1	-4	-3	-4	-2	-3	-4	-4	-4	5	2	-3	2	0	-3	-3	-1	-3	-1	4	:I
:L	-2	-3	-4	-4	-2	-2	-3	-4	-3	2	5	-3	3	1	-4	-3	-1	-2	-1	1	:L
:K	-1	3	0	-1	-3	2	1	-2	0	-3	-3	6	-2	-4	-1	0	-1	-3	-2	-3	:K
:M	-1	-2	-2	-4	-2	0	-2	-3	-1	2	3	-2	7	0	-3	-2	-1	-1	0	1	:M
:F	-3	-3	-4	-5	-2	-4	-3	-4	-1	0	1	-4	0	8	-4	-3	-2	1	4	-1	:F
:P	-1	-3	-2	-1	-4	-1	-1	-2	-2	-3	-4	-1	-3	-4	10	-1	-1	-4	-3	-3	:P
:S	1	-1	1	0	-1	0	-1	0	-1	-3	-3	0	-2	-3	-1	5	2	-4	-2	-2	:S
:T	0	-1	0	-1	-1	-1	-1	-2	-2	-1	-1	-1	-1	-2	-1	2	5	-3	-2	0	:T
:W	-3	-3	-4	-5	-5	-1	-3	-3	-3	-3	-2	-3	-1	1	-4	-4	-3	15	2	-3	:W
:Y	-2	-1	-2	-3	-3	-1	-2	-3	2	-1	-1	-2	0	4	-3	-2	-2	2	8	-1	:Y
:V	0	-3	-3	-4	-1	-3	-3	-4	-4	4	1	-3	1	-1	-3	-2	0	-3	-1	5	:V
*	* A	R	N	D	C	Q	E	G	H	I	L	K	M	F	P	S	T	W	Y	V	*


	A	R	N	D	C	Q	E	G	H	I	L	K	M	F	P	S	T	W	Y	V
:A	5	-2	-1	-2	-1	-1	-1	0	-2	-1	-2	-1	-1	-3	-1	1	0	-3	-2	0	:A
:R	-2	7	-1	-2	-4	1	0	-3	0	-4	-3	3	-2	-3	-3	-1	-1	-3	-1	-3	:R
:N	-1	-1	7	2	-2	0	0	0	1	-3	-4	0	-2	-4	-2	1	0	-4	-2	-3	:N
:D	-2	-2	2	8	-4	0	2	-1	-1	-4	-4	-1	-4	-5	-1	0	-1	-5	-3	-4	:D
:C	-1	-4	-2	-4	13	-3	-3	-3	-3	-2	-2	-3	-2	-2	-4	-1	-1	-5	-3	-1	:C
:Q	-1	1	0	0	-3	7	2	-2	1	-3	-2	2	0	-4	-1	0	-1	-1	-1	-3	:Q
:E	-1	0	0	2	-3	2	6	-3	0	-4	-3	1	-2	-3	-1	-1	-1	-3	-2	-3	:E
:G	0	-3	0	-1	-3	-2	-3	8	-2	-4	-4	-2	-3	-4	-2	0	-2	-3	-3	-4	:G
:H	-2	0	1	-1	-3	1	0	-2	10	-4	-3	0	-1	-1	-2	-1	-2	-3	2	-4	:H
:I	-1	-4	-3	-4	-2	-3	-4	-4	-4	5	2	-3	2	0	-3	-3	-1	-3	-1	4	:I
:L	-2	-3	-4	-4	-2	-2	-3	-4	-3	2	5	-3	3	1	-4	-3	-1	-2	-1	1	:L
:K	-1	3	0	-1	-3	2	1	-2	0	-3	-3	6	-2	-4	-1	0	-1	-3	-2	-3	:K
:M	-1	-2	-2	-4	-2	0	-2	-3	-1	2	3	-2	7	0	-3	-2	-1	-1	0	1	:M
:F	-3	-3	-4	-5	-2	-4	-3	-4	-1	0	1	-4	0	8	-4	-3	-2	1	4	-1	:F
:P	-1	-3	-2	-1	-4	-1	-1	-2	-2	-3	-4	-1	-3	-4	10	-1	-1	-4	-3	-3	:P
:S	1	-1	1	0	-1	0	-1	0	-1	-3	-3	0	-2	-3	-1	5	2	-4	-2	-2	:S
:T	0	-1	0	-1	-1	-1	-1	-2	-2	-1	-1	-1	-1	-2	-1	2	5	-3	-2	0	:T
:W	-3	-3	-4	-5	-5	-1	-3	-3	-3	-3	-2	-3	-1	1	-4	-4	-3	15	2	-3	:W
:Y	-2	-1	-2	-3	-3	-1	-2	-3	2	-1	-1	-2	0	4	-3	-2	-2	2	8	-1	:Y
:V	0	-3	-3	-4	-1	-3	-3	-4	-4	4	1	-3	1	-1	-3	-2	0	-3	-1	5	:V
*	* A	R	N	D	C	Q	E	G	H	I	L	K	M	F	P	S	T	W	Y	V	*