CS 321

Exact sequence comparison.

 

 

Sequence alignment is a tool to test for similarities between a sequence of an unknown target protein, and a single (or a family of) well-characterized protein(s). It is assumed that sequence similarity implies similarity in function.

 

Figure 1: A cartoon plot illustrating the use of a database of annotated proteins to identify a novel sequence

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


We denote the sequence of the unknown protein by .  It is a short cut for an abbreviation of the explicit sequence , where  is an amino acid placed at the  position along the sequence. The sequences of the known proteins, that make the comparison database, are called  where , and  is the total number of proteins in the database. For sequences only, the database may include hundreds of thousands of entries [http://www.expasy.ch/srs5/]. Databases that include both sequences and structures are limited to about ten thousand entries [http://www.rcsb.org/pdb/]. For brevity we omit the index from  and use  to represent a member of the comparison set.

 

Table 1: A sample of protein sequences. Typical lengths of protein sequences vary from a few tens to a few hundreds. The single letter notation (one character for an amino acid) is used.

 

Serine Protease inhibitor

EDICSLPPEV GPCRAGFLKF AYYSELNKCK LFTYGGCQGN ENNFETLQAC XQA

 

Amicyanin alpha

MRALAFAAALAAFSATAALAAGALEAVQEAPAGSTEVKIAKMKFQTPEVR IKAGSAVTWTNTEALPHNVHFKSGPGVEKDVEGPMLRSNQTYSVKFNAPG TYDYICTPHPFMKGKVVVE

 

Major Histocompatability Complex (class I)

MAVMAPRTLV LLLSGALALT QTWAGSHSMR YFSTSVSRPG RGEPRFIAVG YVDDTQFVRF

DSDAASQRME PRAPWIEQEG PEYWDRNTRN VKAHSQTDRV DLGTLRGYYN QSEDGSHTIQ

RMYGCDVGSD GRFLRGYQQD AYDGKDYIAL NEDLRSWTAA DMAAEITKRK WEAAHFAEQL

RAYLEGTCVE WLRRHLENGK ETLQRTDAPK THMTHHAVSD HEAILRCWAL SFYPAEITLT

WQRDGEDQTQ DTELVETRPA GDGTFQKWAA VVVPSGQEQR YTCHVQHEGL PEPLTLRWEP

SSQPTIPIVG IIAGLVLFGA VIAGAVVAAV RWRRKSSDRK GGSYSQAASS DSAQGSDVSL

TACKV

 

 

b.2 An introduction definition ofto the alignment score

We need a more precise definition of similarA more precise definition of similar is needed. The simplest definition (one) is the fraction of amino acids in  that are identical to . In proteins, high sequence identity (roughly above 35 percent) immediately implies close relationship. A more subtle definition of similarity (two) uses amino acids with similar physical or biochemical properties. For example, replacing a hydrophobic amino acid like leucine (L) by valine (V) is (usually) an acceptable mutation. The effect on the hydrophobic core of the protein is expected to be small. However, changing (L) to a charged residue like arginine (R) is likely to be damaging and significantly affect protein stability.

Hence, there is a gray area between an exact matching (two identical amino acids) and a complete mismatch. This gray area is better described by a scoring function with a wide spectrum. The scoring function gives the highest score to an identical pair and the lowest possible score for (to) substitutions that do not make chemical, physical or biological sense.

To make sequence comparison compact and easily extendable to a wide range of sequences, we simplify the process. We assume that the total score  of comparing sequences and  will be given by a sum of terms, each term comparing two elements, an element from  and an element from . An element is either an amino acid or a “space” (gap). For example, consider the case in which  is significantly longer than . In that (this) case it is obvious that some amino acids of  will be aligned against “spaces” (gaps).  The choice of the pairs of elements for the similarity test is called an alignment and is the focus of the present section.

 

Table 2. An example of an alignment of the two sequences and the selection of pairs of elements for comparison. The sequences of myoglobin and hemoglobin are placed on top of each other and pairs are compared horizontally.

PLACE TABLE 2 HERE

 

There are two different problems that we need to address: The selection of the alignment and the determination of the scoring function. Let us start with the score. Since we simplified the calculation of the total score, making it a sum of element comparisons, we need to consider only comparison of individual amino acids and comparisons of an amino acid against a gap. However, we do not need to consider separately scores of pairs of amino acids since .

The scores of pairs of amino acids, which measure the similarity of the -s and the -s, form the so-called substitution matrix . The matrix is symmetric, and of size  (the number of amino acid types). Note that the term “substitution” may be misleading since it suggests a direction (substituting  by  implies that  was there first). A symmetric matrix does not have a sense of time, and the order of substitutions does not change the score, or the degree of similarity.

Nevertheless, a non-symmetric variation of the sequence as a function of time is of considerable interest. A direction in sequence-similarity-scores is potentially informative, providing molecular fingerprints for evolutionary time scales. However, it is hard to estimate it without a specific model in mind. At present, for the purpose of finding protein relatives, we shall use the simplest way out and ignore the time arrow of evolution. A widely used (symmetric) substitution matrix is given below


 

The BLOSUM 50 matrix [x]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

R

N

D

C

Q

E

G

H

I

L

K

M

F

P

S

T

W

Y

V

 

:A

5

-2

-1

-2

-1

-1

-1

0

-2

-1

-2

-1

-1

-3

-1

1

0

-3

-2

0

:A

:R

-2

7

-1

-2

-4

1

0

-3

0

-4

-3

3

-2

-3

-3

-1

-1

-3

-1

-3

:R

:N

-1

-1

7

2

-2

0

0

0

1

-3

-4

0

-2

-4

-2

1

0

-4

-2

-3

:N

:D

-2

-2

2

8

-4

0

2

-1

-1

-4

-4

-1

-4

-5

-1

0

-1

-5

-3

-4

:D

:C

-1

-4

-2

-4

13

-3

-3

-3

-3

-2

-2

-3

-2

-2

-4

-1

-1

-5

-3

-1

:C

:Q

-1

1

0

0

-3

7

2

-2

1

-3

-2

2

0

-4

-1

0

-1

-1

-1

-3

:Q

:E

-1

0

0

2

-3

2

6

-3

0

-4

-3

1

-2

-3

-1

-1

-1

-3

-2

-3

:E

:G

0

-3

0

-1

-3

-2

-3

8

-2

-4

-4

-2

-3

-4

-2

0

-2

-3

-3

-4

:G

:H

-2

0

1

-1

-3

1

0

-2

10

-4

-3

0

-1

-1

-2

-1

-2

-3

2

-4

:H

:I

-1

-4

-3

-4

-2

-3

-4

-4

-4

5

2

-3

2

0

-3

-3

-1

-3

-1

4

:I

:L

-2

-3

-4

-4

-2

-2

-3

-4

-3

2

5

-3

3

1

-4

-3

-1

-2

-1

1

:L

:K

-1

3

0

-1

-3

2

1

-2

0

-3

-3

6

-2

-4

-1

0

-1

-3

-2

-3

:K

:M

-1

-2

-2

-4

-2

0

-2

-3

-1

2

3

-2

7

0

-3

-2

-1

-1

0

1

:M

:F

-3

-3

-4

-5

-2

-4

-3

-4

-1

0

1

-4

0

8

-4

-3

-2

1

4

-1

:F

:P

-1

-3

-2

-1

-4

-1

-1

-2

-2

-3

-4

-1

-3

-4

10

-1

-1

-4

-3

-3

:P

:S

1

-1

1

0

-1

0

-1

0

-1

-3

-3

0

-2

-3

-1

5

2

-4

-2

-2

:S

:T

0

-1

0

-1

-1

-1

-1

-2

-2

-1

-1

-1

-1

-2

-1

2

5

-3

-2

0

:T

:W

-3

-3

-4

-5

-5

-1

-3

-3

-3

-3

-2

-3

-1

1

-4

-4

-3

15

2

-3

:W

:Y

-2

-1

-2

-3

-3

-1

-2

-3

2

-1

-1

-2

0

4

-3

-2

-2

2

8

-1

:Y

:V

0

-3

-3

-4

-1

-3

-3

-4

-4

4

1

-3

1

-1

-3

-2

0

-3

-1

5

:V

*

* A

R

N

D

C

Q

E

G

H

I

L

K

M

F

P

S

T

W

Y

V

*

 

 

Note the high scores for a substitution of an amino acid to self (the diagonal elements of the matrix). Note also that certain amino acids have higher tendency for self-preservation (e.g. Cysteine) while a small and polar amino acid like threonine can be substituted to other amino acids more easily.

 

sequence alignment: definitions

Of course, theThe choice of pairs of amino acidselements (amino acids or gaps) from  and  is not completely free. The order of the amino acids in the sequence (the primary “structure” of the protein) counts, and our comparison should maintain that order. This is a significant restriction on the comparisons that we may make and limits our choices considerably. We are seeking the alignment of the two sequences that will give us the optimal (highest) score. Adding further to the complexity of the problem is the observation that not all proteins are of the same length; we need to account forpay attention to the variation in the lengths of the proteins and for the fact that some amino acids will not have corresponding amino acids to match. Two closely related proteins may be different at a specific site in which an amino acid was added to one protein but not to the second. To describe such cases we introduce a “gap” residue. A gap residue is denoted by “-“ and is used to indicate an empty space along the sequence. For example, the comparison of two proteins below (PDB codes 1rsy and 1a25_A, the _A denotes the first chain in the data entry) yields the following optimal arrangement of the two sequences with gaps:

 

TABLE: Optimal alignment of two sequences (top is 1rsy, Synaptotagmin I (First C2 Domain) , lower 1a25, Protein Kinase C (beta); Chain: A). The percentage of sequence identity is 33 percent.

 

EKLGKLQYSLDYDFQNNQLLVGIIQ-AAEL-PALDMGGTSDPYVKVFLLPD-K-KKKFE

ERRGRIYIQAHID-R--EVLIVVVRDAKNLVP-MDPNGLSDPYVKLKLIPDPKSESKQK

 

TKVHRKTLNPVFNEQFTFKVPYSELGGKTLVMAVYDFDRFSKHDIIGEFKVPMNTVD-F

TKTIKCSLNPEWNETFRFQLKESDKD-RRLSVEIWDWDLTSRNDFMGSLSFGISELQKA

 

GHVTEEWRDLQS

G-V-DGWFKLLS

 

A gap is placed in one sequence (say ) to match an amino acid in the second sequence (say ). It indicates a missing amino acid in the  sequence when comparing it to the  sequence. Note that it is not possible for us to decide if an amino acid was deleted from  or added to , without a detailed evolutionary mechanism linking the two proteins to a common ancestor. At the level of alignment of a pair of sequences the two mechanisms are equivalent. It is therefore common to use the term indel to describe a gap (indel = an INsertion or a DELetion), a name that remains undecided about the mechanism of gap formation.

 

Our goal is to find the best match between two sequences including the possibilities of gaps. A score of a match is given (as argued above) by a sum of scores of matching elements. In order to score a given alignment and decide if it is good or bad we need a score for an indel.

 

What is the score of the gap, i.e. the alignment of an indel against another amino acid? It is philosophically useful to think ofn a gap as an ordinary amino acid and ask what is the score for substituting an amino acid type “” with an amino acid type “”. Such an approach is good in theory and there are some studies following this line of investigation. Unfortunately so far, determining gap scores (also called gap penalties) is still an open question. Formulation of statistical theories of gaps is difficult since the number of gaps in aligned sequences is not known to begin with. This difficulty in determining scores for gaps is in contrast with to the substitution matrices of amino acids that are pretty well established.

 

 A common practice in alignment programs is to leave the gap energy as a parameter to be determined by the user, setting a default value of (for example) zero. This is not to say that the score of the gap is insignificant. The score of a gapIt clearly influences the alignment. If it is set too lowhighly negative, then it is difficult to match related proteins of considerable variation in lengths (remote homologous proteins). If it is set too high, two vastly different proteins may get fragmented to many small overlapping pieces and a large number of gaps. The fragmented alignment of two unrelated proteins (with high score for gaps scoring) may still maintain a good overall score. This is not what we expect, since We expect that proteins are related only if a substantial continuous fraction of the two sequences is similarfound to be similar.

 

 

High-scoring short-sequence segments can be analyzed statistically to determine their significance. In fact such an analysis is behind the popular BLAST algorithm [x]. To the first order BLAST does not consider gaps at all. A statistical test determines if short compatible segments of aligned sequences are significant. In some sense, BLAST solves the gap problem by avoiding it all together. The BLAST algorithm will be discussed in the section: Approximate alignments.

 

A typical practical solution estimate for of the indel/gap penalty is to give it a single negative value, say , regardless of the pairing amino. Hence, tThe gap penalty is set to be independent of the type of the amino acid it is aligned to (e.g. the pair  score the same as the pair ). This choice is non intuitive since  (tryptophan) is much larger than  (glycine). The space left for the “-“ residue by the removal of  or  will be therefore different and the cost is likely to be dissimilar. The less detailed treatment of gaps, (compared to usual amino acids), can may lead to poor alignments. Therefore a few suggestions of extending the simple model of gap penalty were proposed.

 

One popular model is to differentiate between opening and extension of gaps. It is based on the argument that gaps should aggregate together. Claims were made that insertions and deletions are likely to appear at certain structural domains of proteins (mostly loops). The concentration of gaps at a few structural sites makes the indels appear together, and aggregate. To maximize the size of groups of gap (and minimize the number of gap clusters), two gap penalties are assigned:. One penalty is for initiating a gap group. A second (lower penalty) is for growing an existing gap cluster of gaps. For example, (we construct our alignment from left to right), extending an alignment from  to  is less favorable than the extension of the pair  to . This is regardless of the fact that in both cases the same pair  was added to an existing alignment.

 

This model improves the overall appearance of the alignments. However, the present author is not enthusiastic about it. It This model is highly asymmetric with respect to other “amino acids” (remember, we would like to consider a gap to be an amino acid), and it is not obvious that the asymmetry is indeed required. It is also making the identification of optimal alignment messier. Alternative modeling of gaps (so far less popular) will be discussed later and include structural dependent gap. For the moment we shall concentrate on the simplest model of gaps (one value does it all), and after solving that problem, the effect of the extensions will be examined.

 

We wish to place the gaps in such a way that the alignment or the comparison of the two sequences will be “optimal”, that is that the two sequences will be as similar to each other as possible. Early studies of sequence comparisons found optimal alignments manually. Gaps were inserted “by hand” into positions that made biochemical sense and increased the number of good-looking pairs [x]. For an implementation on the computer we consider an optimal alignment to be an arrangement of the two sequences with respect to each other in such a way that the total score of the alignment is as high as possible.

 

Let us examine in more details the problem of optimal alignments and possible arrangements of sequences. There are many ways of aligning a sequence  against a sequence  once gaps are introduced (even if the original order of the amino acids is maintained). Gaps can enter anywhere and in wide range of numbers in the alignment, creating many alternative arrangements. We denote the extended sequences of  and  (with gaps) as  and . For example,  might be . The introduction of the extended sequences raises another intriguing question, what is the maximum length of the sequences  (or )?

 

Of course, the lengths may vary, but they still have upper bounds. Consider two sequences and  of the same length -- . The maximum length of  and  should be . In the alignment of the extended-sequences that are also maximally long, every amino acid of  or  is aligned against a gap --  and . An increase of the length beyond the  limit will necessarily include an alignment of an indel with respect to another indel, i.e. . Such an alignment does not make sense from a scientific point of view. We have no way to determine the number of “double gaps” or their locations. Moreover, also from a technical view point there is a problem if the  pair generates a favorable plausible score. Consider the alignment

Let the total score be . Extending the above alignment by one more pair of indels, we have

The new score is  where  is the element of the substitution matrix replacing an indel by an indel. If the new score is better than , it is trivial to construct even a better alignment (and score) by adding yet another pair of indels with yet a better score of . The favorable extension with indels can proceed to infinite and is unbound. We therefore eliminate in our sequence-to-sequence alignments the possibility of .  Note however, that if we wish to compare more than two sequences simultaneously (multiple sequence alignment), then the possibility of matching a gap against a gap exists. For example, two gaps may appear in the element comparison  when matching three sequences.

 

Ok. It is settled then, the maximum length of  is . How many possible alignments (with gaps) do we have? In a naïve approach, this number may be related to the number of scores we need to compute before deciding on the optimal alignment. Or how many we need to examine before deciding on an optimal alignment?

 

1.2 Counting alignments

To count the number of possible alignments, and as a starting point of the discussion on optimal alignments, we consider the dynamic matrix. The dynamic matrix is a table used for the alignment of two sequences. Below we provide one example for two sequences with the same length (). The rows are associated with the  sequence and the columns with the . The numbers at different matrix entries will be explained below.

 

 

The hairy picture with the numerous arrows is actually telling. Paths in this table, which start at the upper left corner and end at the right lower corner, present all the possible alignments of the whole two sequences. From each entry in the dynamic matrix, there are three alternative moves: following Going down along the diagonal, going straight down, or moving to the right. A step in the matrix, which is a part of a legitimate alignment, never proceeds from to the left to right or up. For example, the thick line in the above matrix corresponds to the alignment:

 

A move along a diagonal aligns an amino acid against another amino acid. A vertical step in the matrix aligns a “” amino acid against an indel and a horizontal step puts a gap against an “” amino acid. Note that we use “” for the gap “residue” (or an indel).

 

The numbers at the different entries of the table denote the number of paths (alignments) that can reach this point. For example, there are 5 possible alignments of  against  (check the table element at the cross between  and ). They are:

 

 

The last three alignments are essentially the sameinclude the same elements for comparison and are their scores are therefore degenerate. It is therefore not possible to decide on a “best” alignment from the group of three. There is more than one optimal alignment. At present, we consider all paths, including the degenerate pathsones. In the Appendix a clever counting protocol (by Dr. Jaroslaw Meller) is outlined that estimates the number of non-degenerate paths. Interestingly, the exact number of non-degenerate paths is . It has the same asymptotic behavior as the approximate lower-bound expression we sketched below for the number of all paths.

 

Another interesting property of this table is a summation rule and the possibility of constructing a recursion formula for the number of alignments. There are three ways (“sources”) to extend a shorter alignment to obtain one of the five longer alignments listed above. (a) Extend an earlier alignment by the pair , a diagonal move in the above table. Alternatively, (b) the pairs , or (c)  (horizontal or perpendicular moves in the above table) can be used to extend a shorter alignment and to obtain a member of the above group.

 

Each of the three “sources”  has its owna corresponding position in the matrix. The number at the entry to the matrix is the with a corresponding number of alignmentspaths aligning a segment of A against a segment of B. For example, before adding the pair  we were at a table entry aligning  against . We find that there is only one-way of aligning  against a gap and “1” is indeed the corresponding entry.

Another example of extending the alignment is to add a gap against , which means that our earlier position in the table was the alignment of  and . The last alignment can be done in three different ways and therefore the table entry is “3”  ().

 

If we add the number of paths starting at the previous three “sources”, and leading to the number of alignments of  with respect to  we obtainsummed up to , exactly the number of alignments of our target.

 

To summarize the above empirical observations more precisely:

The number of possible alignments of  -s against  -s is defined as , this number can be determined using the recursive formula , and the initial conditions . Note that the definition of  was done for computational convenience and it does not imply that “nothing” against “nothing” can be aligned in exactly one way.

While this formula can be used directly, it is useful to have a quick,n order-of-magnitude estimate of the number of alignments. This estimate is especially useful if we are planning a computation that will enumerate all of the alignments. If a calculation is not feasible with existing computer resources it is better knowing that it is not feasible in advance, and not after a few weeks of futile attempts to execute the desired computation.

 

A simple lower bound for  can be obtained quickly.  is a positive number, so we can write  (we “forgot” for convenience the term  in the original equality). So, the alternative recursion formula  (with the same initial conditions as for ) always yields lower numbers than . For  we have a close expression: . This formula is easy to verify by direct substitution of the closed expression to the recursion. Note that this I stheis the same as the (exact) results derived by Meller for non-degenerate paths (Appendix).

 

We now make use of the Stirling formula:   [x], valid for large -s.

The logarithm of the number of alignments  is estimated as

 

 

And the lower bound for the number of alignments is . For a short protein (; ) the number is substantial. and is beyond what we can do today in a systematic search. Even if the computation of a single alignment requires a nanosecond ( second), which is unrealistically fast, it still necessary to use  to examine all possible alignments.

If this is not impressive enough, remember that this is a lower bound and the precise counting of all paths will yield a number significantly larger than this one. For example, for  we have  already a number significantly lower than the exact number in the table (1683). Hence, this estimate underlines the statement claim that it is impossible to examine all alignments one by one in order to find the alignment with the highest score.

Readers with a background in structural biology may recall the Levinthal paradox in protein folding. The paradox puts in contrasts the huge number of plausible protein conformations and the efficiency in which proteins fold in nature. We cannot be sure (until the next section) that there is a solution for the sequence alignment problem. However, nature solves the protein folding problem. So at least the existence of the solution for the protein folding problem is confirmed. This is (again) in contrast to sequence alignment for which an efficient solution exists.However, as As we see below a large space to search does not necessarily mean that optimization in that space is difficult. It is not obvious that the optimization must be performed at a cost proportional to the volume of that space. In fact it can be profoundly cheaper.

 

1.3 Dynamic programming and optimal alignments

After this long detour, it is about the time to return to the basic question: What is the optimal alignment of  and ? The score matrix and the gap penalty are provided and we need is to fish out the alignment(s) with the highest overall score(s). This is a point in which algorithms developed by computer scientists can be extremely useful. The fact that the number of possible alignments is exponentially large in the sequence length does not mean that the search for the optimal alignment needs to be done in exponential time. Sequence alignments can be done with dynamic programming, an algorithm that requires only order of   operations to find the alignment with the best score, a remarkable saving compared to  operations- the number of all possible alignments.

 

The efficient search for the optimal alignments consists of two steps. In the first step a dynamic matrix is constructed (with different entries than what we have seen before) is constructed and in the second step an optimal path is found in the newly constructed table just constructed. The first step is similar to the counting of possible alignments and the recursive expression we wrote use to compute the total number of alignments. We denote the optimal score of the alignment of a sequence length  against a sequence with length  by , and consider the following question:

 

 Assuming that a very kind fellow gave us the optimal scores for the following alignment: , , , can we construct the score ?

 

The answer is yes(!). Since the total score is given by a sum of the scores of the individual aligned pairs, we construct the score  from the three alignments leading to it. We consider three possibilities to obtain an alignment of  against  amino acids.

Option a: Align  against  amino acids (this alignment has the known optimal score of ). Extend it by the alignment of  with a score of  which is the substitution score according to the types of the amino acids at  and  (e.g., the BLOSUM matrix). Hence the first suggestion for an optimal score is .

Option b: Align  amino acids against  amino acids. This alignment also has a known score, which is . Extend this alignment by  with a corresponding score  for a gap. The second possibility for an optimal alignment is therefore .

Option c: Align  amino acids against  amino acids with the known (optimal) score of . Extend the alignment, by adding  with a corresponding score of . The third suggestion is therefore .

 

Note that we use the simple model of gaps in which only a single score is associated with an indel, regardless of the amino acid it is aligned against.

 

Our final task is to select the highest score from one of the three alternatives:

, , , which is the optimal score of .

 

More compactly, we write:

The above recursion can be used to fill with optimal scores the complete dynamic matrix. We start with the condition  and use the initial values to grow the matrix. For example,

Where  denotes alignment of nothing that (not surprisingly) scores zero. Another simple example is of .  Hence, there is only one alignment against “all gaps” arrangement, providing us with immediate optimal scores for the first row and column of the dynamic matrix. It is also clear that similarly to the direct counting of the number of paths we can also construct all the optimal scores.

 

 

 

 

A simple pseudo code to create the dynamic matrix is given below

 

 

/* fill the first (zero) column and the first (zero) row */

T(0,0) = 0

Do I=1:n

            T(I,0) = I*g

End do

 

Do I=1:m

            T(0,I) = I*g

End do

 

/* Now fill the rest of the matrix picking the maximum value

            from the three possibilities */

Do I = 1:n

 Do J = 1:m

  T(I,J) = max[ T(I-1,J)+g,

            T(I,J-1)+g,

            T(I-1,J-1)+S(a(i),b(j))]

 End do

End do

 

Actually, if we are primarily interested in the score of the alignment of  with respect to  only a single element of the dynamic matrix, , is of interest. Of course, in order to get at it we need to compute first the whole matrix. However this score, which is recorded at the lower right side of the dynamic matrix, is only part of the story.

 

Besides the score, in many cases, it is important to know the alignment itself. The path in the dynamic matrix that corresponds to the optimal alignment can be found by a trace-back procedure. Starting from  we ask “which of the three possible steps could have generated the final optimal score? That is, we examine the three possibilities

For at least one of the tests above the equality will hold. It is possible that more than one test is correct and in that case the optimal alignment is degenerate. Hence, there is more than one alignment that provides an optimal score. Usually we consider only one path. For example, if the first test is correct the alignment ends with the pair:  and we repeat the three tests to find the next path segment, this time starting from . The process is repeated until it reaches the upper left corner and provides the desired optimal alignment. The maximum number of times that the process is repeated is  (all the steps are horizontal or vertical) and the smallest number of steps is either  or , the larger number of the two.

 

We are also ready for a rough estimate of the computational effort. As we discussed earlier a lower bound on the number of possible alignments is . Examining all possible alignments will require computational effort that is growing exponentially with . The computations that we just described, using dynamics programming, are much cheaper. Let us consider the two steps separately. In the first step we create the dynamic matrix . To generate a single element we need (about) four operations: Evaluating three expressions, and a decision which of the three is the largest. The calculation of a single element is therefore independent of or , and the cost associated with the computation of the matrix is proportional to  (the number of matrix elements). To trace the path of alignment (once the matrix is known) takes a maximum of  operations.