Anda di halaman 1dari 13

Dynamic Programming i

Dynamic Programming: For String and Biological Sequence Alignment with Traceback Trinidad Estrada Dr. John Tsiligaridis Heritage University

Dynamic Programming ii

Table Of Contents
Table Of Contents ................................... p.ii List Of Figures ..................................... p.iii Chapter 1 Introduction ............................ What is Dynamic Programming? ................... Types of Dynamic Programming ................... Alignment of Sequences ......................... Pros and Cons to Dynamic Programming ........... p.1 p.1 p.1 p.1 p.2

Chapter 2 - The Construction of Dynamic Programming . p.3 Chapter 3 Traceback ............................... p.6 Priority ....................................... p.6 Chapter 4 Results ................................. p.7 Chapter 5 Conclusion .............................. p.8 References .......................................... p.9

Dynamic Programming iii

List Of Figures
Figure Figure Figure Figure Figure Figure 1 2 3 4 5 6 ............................................ ............................................ ............................................ ............................................ ............................................ ............................................ p.3 p.3 p.4 p.5 p.6 p.6

Dynamic Programming 1

Chapter 1 - Introduction
What is Dynamic Programming?
Dynamic Programming (DP) is a very general programming technique which is used for approximating the matching of two sequences by first using a scoring matrix, then using traceback for the alignment. DP can be used in areas such as Bioinformatics, where biological sequences are compared for similarities from the DNA databases. It can also be used for text mining, where one would look in documents for words that match one another. Speech recognition is also done by using the DP technique to look for matches on how one would speak. Having this in mind, there are a wide variety of more uses for DP. Three steps used in Dynamic Programming are: 1. Using recursive relationship for considering the base condition. 2. Tabular computation (matrix fill). 3. Determining optimal alignment through matrix traceback. Having these steps in mind, dynamic programming requires a pre-processing phase in order to get the matrix fill.

Types of Dynamic Programming


The types of Dynamic Programming there are, is minimization and maximization. Minimization is known as classical to DP, and it uses these two basic conditions: D(i,0) = i and D(0,j) = j. The letter i represents the number of rows and the letter j represents the number of columns. Minimization prepares its table from the top to the bottom, while Maximization would begin its preparation of its table from the bottom to the top. Needleman-Wunschs Algorithm does maximization when making a scoring matrix for the alignment of sequences.

Alignment of Sequences
There are two types of alignments in DP: Global and Local alignments. Global alignment starts at the beginning of both sequences, and adds gaps to each of them, until the end of one has been reached. Local alignment finds the region, or regions, that have the most similarities between the sequences, and builds the alignment outward from there. The type of alignment that I used for my program is Global alignment. This type of alignment is useful for when one

Dynamic Programming 2 wants to force two sequences to align over their entire length. Traceback is a very critical part for the optimal alignment of sequences in DP. In Minimization, the alignment would go from the bottom-right to the top-left of the scoring matrix. For the case of the Needleman-Wunsch Algorithm, the alignment would go from the upper left to the lower right corner of the scoring matrix. Because there would be a lot of insertions and mismatches, the largest number of the scoring matrix will determine if the alignment given by traceback is optimal.

Pros and Cons to Dynamic Programming


The disadvantage to Dynamic Programming can be explained with one word: Time. Dynamic Programming, in comparison to all other string matching algorithms, is the slowest and most time consuming because of the preprocessing stage it takes during the making of the scoring matrix and the traceback alignment. However, the reason why it is being used at the same time is because the algorithm for DP is easier to program compared to the other algorithms out there for sequence matching.

Dynamic Programming 3

Chapter 2 - The Construction of Dynamic Programming


The algorithm I used for the program is NeedlemanWunshchs Algorithm. This algorithm begins by first making a similarity matrix, which nearly looks like a dot plot. Figure 1, to your left, shows the flowchart to the program I made in the C language. This flowchart basically explains how the program can distinguish a matching character, from a non-matching character, for the development of the similarity matrix. The first piece, which states load both arrays, refers to the two sequences which will be compared for similarities, which in the programs case would be two one-dimensional arrays: the text and the Figure 1 pattern. This piece of the program goes by these 2 rules: if there is a character match found on the sequence, add a 1, or if there is no match found, add an underscore (_). The 1s and _s will be stored to a new twodimensional array for further purposes on the sum matrix. When the program has completed these conditionals by comparing all values on the arrays, the Similarity Matrix chart would end up looking as Figure 2 below.
A Y C Y N R C K C R B P A 1 _ _ _ _ _ _ _ _ _ _ _ B _ _ _ _ _ _ _ _ _ _ 1 _ C _ _ 1 _ _ _ 1 _ 1 _ _ _ N _ _ _ _ 1 _ _ _ _ _ _ _ Y _ 1 _ 1 _ _ _ _ _ _ _ _ R _ _ _ _ _ 1 _ _ _ 1 _ _ Q _ _ _ _ _ _ _ _ _ _ _ _ C _ _ 1 _ _ _ 1 _ 1 _ _ _ L _ _ _ _ _ _ _ _ _ _ _ _ C _ _ 1 _ _ _ 1 _ 1 _ _ _ R _ _ _ _ _ 1 _ _ _ 1 _ _ P _ _ _ _ _ _ _ _ _ _ _ 1 M _ _ _ _ _ _ _ _ _ _ _ _

Figure 2

Dynamic Programming 4

Dynamic Programming 5 After the Similarity Matrix chart has been done, the next step would be to start computing the Sum Matrix by means of Dynamic Programming. Since I used NeedlemanWunschs algorithm for the program, the program will begin its summing from the bottom of the two-dimensional array, to the top. During this computation, there will be a starting point for the 2-d array. From the starting point, the array value will move down one row, and scan right through the columns, in that row position, to find the maximum value. After this step has been completed, from the original starting point, the array value will move right one column, and scan down through the rows, in that column position, to look for the maximum value. The flowchart on figure 3 shows how the program will make its scan through the columns in order to find the greatest value of the array in that specified row with incrementing column values. The name of the twodimensional array used is array B. Notice that the specified row value is +1; that is because it moves down one row for this process. Once this piece of the program has finished running, value compareRow, will have the greatest value found in the scanned columns of array B. The same process will occur when scanning row values as it did on Figure 3 figure 3. Variable

Dynamic Programming 6 compareRow is also used for this stage, in case a greater value has been found in the row scan compared to the column scan. Once both of the scans have been completed, the program will make add the original point of array B, which was being kept away from the scan, to variable compareRow. This process will keep running, until there are no more original points to scan for in the array. When that is complete, the Sum Matrix graph should look as shown below on figure 4.
A Y C Y N R C K C R B P A 8 7 6 6 5 4 3 3 2 2 1 0 B 7 7 6 6 5 4 3 3 2 1 2 0 C 6 6 7 6 5 4 4 3 3 1 1 0 N 6 6 6 5 6 4 3 3 2 1 1 0 Y 5 6 5 6 5 4 3 3 2 1 1 0 R 4 4 4 4 4 5 3 3 2 2 1 0 Q 4 4 4 4 4 4 3 3 2 1 1 0 C 3 3 4 3 3 3 4 3 3 1 1 0 L 3 3 3 3 3 3 3 3 2 1 1 0 C 2 2 3 2 2 2 3 2 3 1 1 0 R 1 1 1 1 1 2 1 1 1 2 1 0 P 0 0 0 0 0 0 0 0 0 0 0 1 M 0 0 0 0 0 0 0 0 0 0 0 0

Figure 4

The values get greater as you move from the bottom of the chart, to the top. That is how Needleman-Wunschs Algorithm works compared to minimization. The next step after this would be to find the greatest score on this chart. In the case of this chart, the greatest score here would be 8. That number represents the best match for the sequence alignment. Once the greatest score has been found, the program can now move on to the next step, which would be using traceback for alignment.

Dynamic Programming 7

Chapter 3 Traceback
When working with traceback, the program will always make diagonal moves for the alignment, that is whenever possible, because those moves make better alignment. There are some cases where the alignment cannot move diagonally, so the other alternatives would be to move right or down. Whenever there would be a down move, the text would get a gap, and if there would be a right move, then the pattern would get a gap. In the cases of down or right moves, the program would have to take into consideration which area has the greatest score before proceeding, and also to see which area has a character match.

Priority
In my program I implemented 2 functions: one which has priority on making right moves and another that has priority on down moves. When both the down and right areas on the scoring matrix have equally greater scores than the diagonal one, and both have or do not have character matching, then this would be the case where prioritization takes place. Below on Figure 5 and Figure 6, the underlined areas are the areas that indicate where the program will use its prioritization. Since down value 6 and right value 6 are both equally greater than diagonal value 5, and both do not have any character matching, one program must make the right move, while the other will make a down move in that area.
A Y C Y N R C K C R B P A 8 7 6 6 5 4 3 3 2 2 1 0 B 7 7 6 6 5 4 3 3 2 1 2 0 C 6 6 7 6 5 4 4 3 3 1 1 0 N 6 6 6 5 6 4 3 3 2 1 1 0 Y 5 6 5 6 5 4 3 3 2 1 1 0 R 4 4 4 4 4 5 3 3 2 2 1 0 Q 4 4 4 4 4 4 3 3 2 1 1 0 C 3 3 4 3 3 3 4 3 3 1 1 0 L 3 3 3 3 3 3 3 3 2 1 1 0 C 2 2 3 2 2 2 3 2 3 1 1 0 R 1 1 1 1 1 2 1 1 1 2 1 0 P 0 0 0 0 0 0 0 0 0 0 0 1 M 0 0 0 0 0 0 0 0 0 0 0 0 A Y C Y N R C K C R B P A 8 7 6 6 5 4 3 3 2 2 1 0 B 7 7 6 6 5 4 3 3 2 1 2 0 C 6 6 7 6 5 4 4 3 3 1 1 0 N 6 6 6 5 6 4 3 3 2 1 1 0 Y 5 6 5 6 5 4 3 3 2 1 1 0 R 4 4 4 4 4 5 3 3 2 2 1 0 Q 4 4 4 4 4 4 3 3 2 1 1 0 C 3 3 4 3 3 3 4 3 3 1 1 0 L 3 3 3 3 3 3 3 3 2 1 1 0 C 2 2 3 2 2 2 3 2 3 1 1 0 R 1 1 1 1 1 2 1 1 1 2 1 0 P 0 0 0 0 0 0 0 0 0 0 0 1 M 0 0 0 0 0 0 0 0 0 0 0 0

Figure 5

Figure 6

Making these functions to prioritize, is very critical because there can be two sequences which are accurate according to the Needleman-Wunsch Algorithm.

Dynamic Programming 8

Chapter 4 Results
After the program has finished running, the program will display the results. In the case of our example, we will receive two results, because the program had to use the priority movement to look for the matching. The results are as follow:
Alignment prioritizing with right moves: A B C N Y - R Q C L C R - P M A Y C - Y N R - C K C R B P Alignment prioritizing with bottom moves: A B C - N Y R Q C L C R - P M A Y C Y N - R - C K C R B P

The two results that the program gave to us are both accurate. One is not more accurate than the other. The way we can find out how accurate their approximations are is by looking at the greatest score, which we have previously calculated in the scoring matrix as 8. After having obtained the greatest score, we will now have to examine both the matches we received, and look for exact matches, and add them by 1. The way this would be done is as follow:
Alignment A B C N Y A Y C - Y 1 + 1 + 1 Alignment A B C - N A Y C Y N 1 + 1 + 1 prioritizing with right moves: - R Q C L C R - P M N R - C K C R B P + 1 + 1 + 1+1 + 1 = 8 prioritizing with bottom moves: Y R Q C L C R - P M - R - C K C R B P + 1 + 1 + 1+1 + 1 = 8

The yellow colors indicate exact matches. After added, if the matches are approximate, the exact matches on the sequences should total the greatest score. In our case, both of these matches total the greatest score 8, which is why I state that both of these matches we got from prioritizing, are equally approximate according to the Needleman-Wunsch algorithm.

Dynamic Programming 9

Chapter 5 Conclusion
Having used the Needleman-Wunsch Algorithm in the C language, worked successfully especially with the trace back alignment. However, as I researched other methods such as PMM and Suffix trees, I am convinced that they are more productive, and should be closely examined for programming purposes. The Needleman-Wunsch algorithm was limited on the program I made because the approximation for sequences should endless; however, it did serve its purpose by finding an approximation.

Dynamic Programming 10 References [I] Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu