Anda di halaman 1dari 47

CHAPTER 1 INTRODUCTION

1.1 GENERAL Internet is potentially the worlds largest knowledge base and finding information in the large knowledge base is very difficult by surfing internet information globe. Search Engines emerged and developed quickly in this background. Search engine is a very important tool for people to obtain information on internet. Day by day the quantity of information is increasing exponentially on the internet. According to survey of net cafe, the web has crossed 110 million sites in March 2010 and as of November 2012 there are about 20,340,000,000 Web Pages are crawled. With this exponentially growing information size on the Internet, utilization of information has become major focus. And so search engine developers began to pay attention to the quality and relativity of searching results. There are three types of search engines whose searching technique is different. Search terms are represented as arbitrary sequences of characters and information retrieval is performed through the computation of string similarity. The search procedure used by these search engines is principally based on the syntactic matching of document and query representations. These search engines are known to suffer in general from low precision. In text retrieval, full text search refers to a technique for searching a computer stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user. Full-text searching techniques became common in online bibliographic databases Most Web sites and application programs provide full text search capabilities. Some Web search engines, such as AltaVista employ full text search techniques, while others index only a portion of the Web pages examined by its indexing system. Text search functionality takes a set of keywords and locates them in a document or a database containing words. Search strings provided by users are usually context based. However, semantic technologies represent meaning using ontologies and provide reasoning through the relationships, rules, logic, and conditions represented in those ontologies.

1.1.1 Global Scenario

A web search engine is a tool designed to search for information on the World Wide Web. The search results are usually presented in a list and are commonly called hits. A search engine operates, in the following order: Web crawling, Indexing and Searching When a user enters a query into a search engine the engine examines its index and provides a listing of best-matched web pages according to some criteria, usually with a short summary containing the document's title and sometimes parts of the text. Web Crawling Web search engines work by storing information about many web pages, which they retrieve from the WWW. These pages are retrieved by a Web crawler-an automated Web browser which follows every link it sees. The contents of each page are then analysed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called Meta tags). Meta data about web pages are stored in an index database for use in later queries. Indexing Some search engines, such as Google, store all or part of the source page as well as information about the web pages whereas others such as AltaVista store every word of every page they found. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. Increased search relevance makes these cached pages very useful even beyond the fact that they may contain data that may no longer be available elsewhere. Searching When a user enters a query into a search engine the engine examines its index and provides a listing of best-matching web pages according to its criteria usually with a short summary containing the document's title and sometimes parts of the text. Most search engines support the use of Boolean operators AND, OR and NOT to further specify the search query. Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords. The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of Web Pages that include a particular word or phrase some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the best results first. Search

engine decision to have the best matching pages and order of results vary widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve.

1.1.2 Searching Technique

At the frontal end, browser is connected with the web server when users put forward a searching request to the browser. Web server can find the matched documents in a large-scale indexed database, lists the indexes of these documents, and returns the result to the user. At the hinder end of the search engine, administrator extracts the web pages indexed information saves them with the web pages URL in the indexed database. The indexed database must be updated continuously to satisfy the dynamic changes of the Internet. The work flow of searching mechanism is shown in the diagram below.

Fig1: Structure of a Search Engine

A typical search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. A Web crawler is a program, which automatically traverses the

web by downloading documents and following links from page to page. They are mainly used by web search engines to gather data for indexing. Other possible applications include page validation, structural analysis and visualization, update notification, mirroring and personal web assistants/agents etc. Web crawlers are also known as spiders, robots, worms etc. Crawlers are automated programs that follow the links found on the web pages. There is a URL Server that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the store server. The store server then compresses and stores the web pages into a repository. Every web page has an associated ID number called a doc ID, which is assigned whenever a new URL is parsed out of a web page. The indexer and the sorter perform the indexing function.

The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link. The URL Resolver reads the anchors file and converts relative URLs into absolute URLs and in turn into doc IDs. It puts the anchor text into the forward index, associated with the doc ID that the anchor points to. It also generates a database of links, which are pairs of doc IDs. The links database is used to compute Page Ranks for all the documents. The sorter takes the barrels, which are sorted by doc ID and resorts them by word ID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of word IDs and offsets into the inverted index. A program called Dump Lexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. A lexicon lists all the terms occurring in the index along with some term-level statistics (e.g., total number of documents in which a term occurs) that are used by the ranking algorithms The searcher is run by a web server and uses the lexicon built by Dump Lexicon together with the inverted index and the Page Ranks to answer queries.

String matching problem is a very important and classical problem in computer science and can be applied to many domains, such like encryption, compression, DNA sequences analysis, imaging processing and natural language searching. There are many algorithms to solve string matching problem, such as the MP algorithm, the KMP algorithm, the Boyer-Moore algorithm, the Horspool algorithm , the Quick Search Algorith,, the Smith Algorithm, the Skip Search Algorithm and Levesthein Algorithm. There are two different problems in string matching, namely the exact string matching problem and the approximate string matching problem. In this thesis, we shall concentrate on the approximate string matching problem. But, we shall show later, we shall also use an exact matching algorithm to solve our approximate string matching problem.

1.1.2 String Matching Algorithms

The string matching algorithm is used to find the occurrences of a pattern P of length m in a large text T. Many techniques and algorithms have been designed to solve this problem. These algorithms have many applications including Information retrieval, database query, and DNA and protein sequence analysis. Therefore the efficiency of string matching has great impact. Basically a string matching algorithm uses the pattern to scan the text. The size of the text taken from file is equal to length of pattern. It aligns the left ends of the pattern and text. Then it checks if the pattern occurs in the text and shifts the pattern to right. It repeats the same procedure again and again till the right end of the pattern goes to right end of the text. Two type of matching are currently in use. Exact string matching technique used to find all occurrence of pattern in the text. Approximate string matching is the technique of finding approximate matches of a pattern in a text. The closeness of the match is measured. The most obvious approach to string matching problem is Brute-Force algorithm, which is also called nave algorithm. This algorithm is simple and easy to implement, works good mostly for binary string search. It compares the pattern P with text T if mismatch occurs the pattern is shifted to one position right and again checks with the text T. This algorithm performs too many comparisons over the same text; hence it took more searching time. This algorithm is unacceptably slow. The Rabin-Karp string-searching algorithm calculates a hash value for the pattern, and for eachM-character subsequence of text to be compared. If the hash values are unequal, the algorithm will calculate the hash value for next M-character sequence. If the hash values are equal, the algorithm will do brute- force comparison between M-character

sequences. In this way, there is only one comparison per text subsequence, and character matching is only needed when hashvalues match. There are so many different strings, to keep the hash values small we have to assign some strings the same number. This means that if the hash values match, the strings might not match; we have to verify that they do, which can take a long time for long sub strings Knuth, Morris and Pratt discovered first linear time string-matching algorithm by following a tight analysis of the nave algorithm. Knuth-Morris-Pratt algorithm keeps the information that nave approach wasted gathered during the scan of the text. By avoiding this waste of information this turns out to be the length of the longest prefix of fewer than x characters in the pattern that also appears as a suffix of the first x characters in the pattern. If there is no such prefix, then we know that we can skip over x characters. It is extremely important to note that the text itself is not considered if we have already matched x characters in the text, and then we know what that text is in terms of the pattern In the worst-case Knuth-Morris-Pratt algorithm we have to examine all the characters in the text and pattern at least once. Boyer Moore algorithm scans the characters of the pattern from right to left beginning with the right most character. During the testing of a possible placement of pattern P against text T, a mismatch of text character T[i] = c with the corresponding pattern character P[j] is handled as follows: If C is not contained any were in P, then shift the pattern P completely past T[i] Otherwise, shift P until an occurrence of character C in P gets aligned with T[i]. The major drawback is the significant pre-processing time to set up the tables. Some of the exact string matching algorithm used to solve the problem of searching and pattern in text are Boyer-Moore, Horspool, Zhu Takaoka, Nave, Nave FLC, and Nave FML.AlthoughHorspool algorithm has better worst case running time than Boyer Moore. Thelatter is known to be extremely efficient in practice. There have been many papers published that deal with exact pattern matching and introduce variants of Boyer-Moore algorithm. Many techniques and algorithms have been designed to solve the problem of finding the occurrences of a pattern P of length m in a large text. Edit distance is discussed here.

1.2 MOTIVATION

The motive of string matching algorithm is to find the occurrences of a pattern of some length in a large text T. Many techniques and algorithms have been designed to solve this problem [2] these algorithms have many applications including Information retrieval, database query, and antenna and protein sequence analysis. Therefore the efficiency of string matching has great impact. Basically a string matching algorithm uses the pattern to scan the text. The size of the text taken from file is equal to length of pattern. It aligns the left ends of the pattern and text. Then it checks if the pattern occurs in the text and shifts the pattern to right. It repeats the same procedure again and again till the right end of the pattern goes to right end of the text. Two types of matching are currently in use. Exact string matching technique [12] [15] used to find all occurrence of pattern in the text. Approximate string matching is the technique of finding approximate matches of a pattern in a text. The closeness of the match is measured.

We need to first discuss a method to measure the difference between two strings. This method is called edit distance. In edit distance, there are three operations which are insertion, deletion and substitution.

1.3 Terminology 1.3.1 Insertion: A character f of X is missing in Y at a corresponding position. We insert a character in this missing position so that we can transform string Y into string X.

Fig. 1.2-1 Insertion.

1.3.2 Substitution: The symbols at corresponding positions are different. We just substitute a character d to c in string Y, so that we can transform string Y into string

X.

Fig. 1.2-2 Substitution. 1.3.3 Deletion: A character Y is missing in X at a corresponding position .We just delete a character c in string Y such that string Y is transform into string Y.

Fig.1.2-3 Deletion

Example 1.2-1 Given a string A =" fghef " and a string B =" fghk" . We want to transform B

Into A . Finally, the ED(A, B) = 2 . g F H K B B F g h e

Fig. 1.2-4 An Example of Substitution.

Fig 1.2.5 A successful Transformation

In this, we are given a string S= s1, s2.. sn1 2 = and Si,j, denotes si,si+1.sj. The approximate string matching problem is defined as follows: Given a pattern string P=p0,p1.pn-1and a text string T=t0,t1.tn-1, n m , and a maximal number k of errors allowed, find all location is of T such that there exists a suffix A of i T 0,i and ED(A, B) k . The exact string problem is a special case of the approximate string matching problem in which k = 0 .suffix A of T0,i and ED(A, B) k . The exact string problem is a special case of the approximate string matching problem in which k =0 .

1.4 Problem Identification A critical look at the available string matching algorithms indicates the shortcomings that need to be address while designing a good algorithm by taking running time into consideration.

1.5 Thesis Organization

In Chapter 2 of this thesis, we introduce a brute-force algorithm and the Quick Search Algorithm to solve the exact string matching problem. We need to introduce exact string matching problems because our approximate string matching algorithm needs an exact string matching algorithm. In Chapter 3, we introduce the Edit Distance Algorithm which discusses an important lemma. By the lemma, Edit distance, also known as Levenshtein distance or evolutionary distance is a concept from information retrieval and it describes the number of edits (insertions, deletions and substitutions) that have to be made in order to change one string to another. It is the most common measure to expose the dissimilarity between twostrings.location where input pattern exactly appears in input text, we have to verify whether there exists an approximate string. The computation is done according to the base-case rules by the cost of deleting, the cost of inserting and cost of substitution. The most basic problem is to compute the edit distance between two strings with the time complexity. the standard dynamic programming algorithm need O(nm) time to compute the value of the edit distance between two strings each of length n and m respectively. To better illustrate the algorithm the overall time complexity of the algorithm is bring down for better results from the processor .Our goal is to maximize the number of free substitutions, and to find the longest common subsequence of the two input strings. We call an index pair ,a match point . In some sense, match points are the only interesting locations in the table; with a given a list of the match points, we could easily reconstruct the entire table.so, we can compute the Longest Sequence function directly from the list of match points The overall running time of this algorithm is O(mlogm+ n log n+ K). where K is the number of match points. Finally, conclusion and future scope are given in Chapter 4.

CHAPTER 2 STRING MATCHING ALGORITHMS

2.1 Background In this chapter, we introduce two approaches, a brute force algorithm and the Quick Search Algorithm, which solve the exact string matching problem.

2.1.1 Brute Force Approach

A brute force algorithm is a trivial approach for exact string matching problem. We open a window with size m in text T. We slide the window to T0 and compare the window Ti,i+m1 (i = 0 ) with pattern P from left to right, as shown. Fig. 2.1-1. If no mismatch occurs, we say that there is a match in position i of T . After comparison, we slide the window one step to the right, as shown in Fig. 2.1-2. We compare the window Ti,i+m1 with P until location i is large than nm .

0 m-1 Text T

.. Pattern P 0 m-1

Fig. 2.1-1 Brute Force Algorithm Comparing.

0 Text T

m-1

Shift one Pattern P 0 .. Fig. 2.1-2 A Brute Force Algorithm Moving Window m-1 step

Example 2.1-1

We are given a text T =abacbabac and a pattern P =bac. We open a window and slide it to the first location of T . We compare T0,2 with P , and there is no match. We shift the window one step to the right.

text T

pattern P

Fig. 2.1-3 Example 2.1-1.

We compare T1,3 with P . There is an exact matching, and we report that P appears at location 1 inT . We shift the window one step to the right

text T

pattern P

Fig. 2.1-4 The First Step of Moving in Example 2.1-1.

We compare T2,4 with P , and there is no match. We shift the window one step to the right.

text T

pattern P

Fig. 2.1-5 The Second Step of Moving in Example 2.1-1.

We compare T3,5 with P , and there is no match. We shift the window one step to the right.

text T

pattern P

Fig. 2.1-6 The Third Step of Moving in Example 2.1-1.

We compare T2,4 with P , and there is no match. We shift the window one step to the right.

text T

pattern P

Fig. 2.1-7The Second Step of Moving in Example 2.1-1.

We compare T3,5 with P , and there is no match. We shift the window one step to the right.

text T

pattern P

Fig. 2.1.8 The Third Step of Moving in Example 2.1-1.

We compare T4,6 with P , and there is no match. We shift the window one step to the right.

text T

pattern P

Fig. 2.1.9 The Fourth Step of Moving in Example 2.1-1.

We compare T5,7 with P , and there is no match. We shift the window one step to the right.

text T

pattern P

Fig. 2.1.10 The Fourth Step of Moving the Window in Example 2.1-1.

We compare T6,8with P appears at location 6 in

P . There is an exactly matching, and we report that T . We shift the window one step to the right. We

find that location i=7> n m =93=6. So we terminate this approach.

text T

pattern P

Fig. 2.1.11 A Match.

In the above example, we find that if we use a brute force approach to solve the string matching problem, many characters would be matched again, because each time a mismatch occurs, the pattern is moved only one step. Therefore, the brute force algorithm is not efficient.

2.2.2. Quick Search Algorithm

The performance of the Quick Search Algorithm is better than the Brute Force Algorithm. The Quick Search Algorithm uses a simple rule, One Suffix Rule, as shown in the following.

2.2.3 One Suffix Rule

For any character u in T, let us find a nearest u in P which is to the left of u in T.if such U exists in P, as shown in Fig 2.2-1 (a), move P such that u in P aligns with u in T, as shown in Fig.2.2-1 (b) if no such u exists in P, we move the whole P to the right side of u in T, as shown in fig 2.2-

Text T u .

U Pattern P

Fig. 2.2-1(a) A Nearest u

Text T

u Pattern P

Fig. 2.2-1(b) Aligning the Character u .

Text t

Pattern P

Fig. 2.2-2 Shifting the Pattern Further.

Many algorithms use this rule, such like the Tuned Boyer Moore Algorithm, the Raita Algorithm, the HorspoolAlgorithm and the Smith Algorithm.

2.3 Main Idea

The Quick Search Algorithm is very similar to the Horspool Algorithm, because they both of them are use the One Suffix Rule. And the Quick Search Algorithm is easy to implement. We open a window with size m in text T. We compare the window Ti,i+m-1(0 inm) with pattern P from left to right. If the first mismatch occurs between tqand Pp, we are going to find whether the character x exists in P or not. We call the character of t i+m critical character. If we find x in P which is the rightmost x , we shift Ps to align with the critical character, as shown in Fig. 2.2-3. If no such x exists in Ps, we shift whole P to the right side of the critical character, as shown in Fig. 2.2-4.

U text T mismatch Pattern P U 0 shift

z p

X s

U 0 Fig 2.2.3 Shifting the pattern to align x p

x s

U text T mismatch

Pattern P

p u z

Fig 2.2.3 Shifting the Pattern Further If there is no mismatch between the window and P , we also use above method to shift P .

2.3.1Preprocessing Phase of Quick Search Algorithm

In the pre-processing phase of Quick Search Algorithm, we build a QS table to record information of shift. Let x be an alphabet in P . We find out the location iof x which is the rightmost x inP. And we record that QS(x)= m i.It means that if we compare a window T i,i+m-1 with P and we will move P OS(x) steps, we will make x in P align with ti+m. We use a symbol * which denotes that the alphabet does not exist in pattern P,and QS(*)=m+1.

Example 2.2.1 Given a pattern P=abcadaba. The rightmost alphabet a is in position 7 of P. therefore QS(a)=m-i=8-7=1. Similarly, QS(b)=8-6=2, QS(c)=8-2=6 and QS(*)=m+1=8+1=9.

Pattern P

TABLE-2.2-1. EXAMPLE 2.2-1 2.3.2 Searching Phase of Quick Search Algorithm

We open a window with size m . We slide the window to the first location i= 0 of T and we compare the window Ti,i+m1 T with P from left to right. If there does not exist any mismatch between the window and P , we report P appears at location iin T .We move QS (t i+m)steps (i=i+QS(ti+m) We compare the window with P until i>n m.

Example 2.2-2 Given a text T = GCGCAGAGAGTAGAGADD and a pattern P =CAGAGAG. Preprocessing phase:

Table 2.2-2 The QB Table of Example 2.2-2 Searching Phase:

Shift one position mismatch

Shift 2 position match mismatch

Shift 8 position

Shift one position

mismatch

Fig. 2.2-5 Example 2.2-2

CHAPTER 3 LEVENSHTEIN DISTANC ALGORITHM AND ANALYSIS 3.1 Edit distance or Levesnthein distance Concept Using Dynamic Programming.

Edit distance, also known as Levenshtein distance or evolutionary distance is aconcept frominformation retrieval and it describes the number of edits (insertions, deletions and substitutions) that have to be made in order to change one string toanother. It is the most common measure to expose the dissimilarity between two strings. The edit distance ed(x, y) between strings x=x1 ... xmand y=y1 ... yn, where x, y is the minimum cost of a sequence of editing steps required to convert x into y.The alphabet of possible characters chgives , the set of all possible sequences ofch . Edit distance can be calculated using dynamic programming. Dynamic programming is a method of solving a large problem by regarding the problem as the sum of the solution to its recursively solvedsubproblems. Dynamic programming is different to recursion however. In order to avoid recalculating the solutions to sub problems, dynamic programming makes use of technique called memorisation, whereby the solutions to sub problems are stored once calculated, to save recalculation. Computing the edit distance between two strings is one of the most fundamental problems in computer science. Assume we have two strings, S1 = [a1, a2, a3, . . . , an] and S2 = [b1, b2, b3, . . . , bm], and a set of operations {Insert(I ), Delete(D), Change(C)}, each of the operations I, D, C can be applied to the characters in the strings at a given position. For example, if S1 = [aaaabcda] and S2 = [aaabcada], applying an operationD to the string S1 at position 8 (the rightmost character) changes it to [aaaabcd ], applying an operation I (x) to S2 at position 8 inserts a character x to S2 and changes it to [aaabcadxa], and applying an operation C(b) to S1 at position 4 which has a character a makes it [aaabbcda] by changing the character from a to b. Note that all the operations need the position specified, and I and C need an additional character for replacement.

In this chapter, we have an algorithms to solve exact string matching and approximate string matching problems. The definitions of the problems are as follows: The exact string matching problem: Given a text T =t1,t2,t3.tnand a patternm P=p1,p2,p3.pn , find every i such that ti,ti+1.. ... i+m-1is equal to p1,p2,p3pn.

For example, suppose T =AGCTACATAGAGCTACAGT and P = GCTACA.

T=

1 A P= G

2 G

3 C

4 T

5 A

6 C

7 A

8 T

9 A

10 G

11 A

12 13 G C

14 T

15 A

16 C

17 A

18 G

19 T

Fig 3.1.1 P has a match at location 2 in T

T= 1 A 2 G 3 C 4 T 5 A 6 C 7 A 8 T 9 A 10 G 11 A 12 13 G C 14 T 15 A 16 C 17 A 18 G 19 T

P= G Fig 3.1.2 C T A C A

P has a match at location 12 in T . The edit distance between two strings is defined as the minimum number of character insertions, deletions and substitutions needs to make them equal. Forexample, x =AACGC and y =AGATC, the edit distance between x and y is three operations (one insertion, one deletion and one substitution), which is denoted ED(x, y) = 3.

3.2 Problem Definition

The edit distance problem asks for a set of operations with minimum cost required to transform string S1 to S2 (or S2 to S1), with each of the operations associated with a cost. In this thesis, we consider a simplified problem, where each of these operations (I, D, C) has a unit cost, and hence minimizing the number of operations is equivalent to minimizing the cost. We consider example with S1 = [aaaabcda], S2 = [aaabcada], and we can transform S1 to S2 by a sequence of operations as

follows: change character a at position 4, b at position 5, and c at position 6 of S 1 to b, c and a, respectively, by operations C(b), C(c), and C(a). We can describe the series of operations as a transcript T = {, , , C(b), C(c), C(a), , }, where at positions 1, 2, 3, 7, 8

Edit distance, also known as Levenshtein distance or evolutionary distance is a concept from information retrieval and it describes the number of edits (insertions, deletions and substitutions) that have to be made in order to change one string to another. It is the most common measure to expose the dissimilarity between two strings[36] (Levenshtein 1966; Navarro &Raffinot 2002).The edit distance ed(x,y) between strings x=x1 ... xmand y=y1 ... yn, where x, y0 is the minimum cost of a sequence of editing steps required to convert x into y.The alphabet of possible characters chgives , the set of all possible sequences of ch . Edit distance can be calculated using dynamic programming[45] (Navarro &Raffinot). Dynamic programming is a method of solving a large problem by regarding the problem as the sum of the solution to its recursively solved subproblems. Dynamic programming is different to recursion however. In order to avoid recalculating the solutions to subproblems, dynamic programming makes use of a technique called memoisation, whereby the solutions to subproblemsare calculation. In edit distance, there are three types of differences between two strings X and Y: stored once calculated, to save re

3.2.1 Insertion: a symbol of Y is missing in X at a corresponding position, with its cost being 1. X:AT Y: AGT

3.2.2 Substitution: symbols at corresponding positions are distinct, with its cost being 1. X:ACC Y: TCC

3.2.3 Deletion: a symbol of X is missing in Y at a corresponding position, with its cost being 1 X: G C A Y: G A Example 1.

Given two strings X and Y, the edit distance between X and Y is the minimum number of insertions, deletions and substitutions needed to transform Y to X. String X ATGAATCTTACCGCCTCG String Y ATGAGGCTCTGGCCCCTG

Transformation (from string Y to string X) String X:A T G A A T C T T A C C G C C T C G String Y:A T G A G G C T C T G G C C C C T G

EDIT(X, Y)=7 (2 insertions, 2 deletions and 3 changes).

3.3 Dynamic Table

A dynamic programming method to compute the edit distance between two strings X and Y.

For Example: Given X=abcabba Y=cbabac

A c 1 b 2 a 2

2 1 2

2 2 2

3 3 2

4 3 3

5 4 4

6 5 4

b 3 a 4 c 5 Table 3.3.1 EDIT(X, Y)=4 abcabbacbab-ac 4 3 4 4 4 4 3 3 3 3 3 3 2 3 3 2 3 4

A c b a b 3 a 4 c 5 Table 3.3.2 Example 2.

subs 1 2 2 match 2 1 2 insert 2 2 2 2 3 4 3 3 3

6 5 4 4 match 3 delete 4

3 3 4 match 2 3 4 match 3 2 3 Insert 3 3 3 4 4 4

Given strings s and t Distance is shortest sequence of edit commandsthat transform s to t, (or equivalently s tot). Simple set of operations: Copy character from s over to t (cost 0) Delete a character in s (cost 1)

Insert a character in t (cost 1) Substitute one character for another (cost 1) This is Levenshtein distance

Table for edit distance

K Cij is the no of edit operations align PA TO SPA needed to

P A Cij K E

3.3.1 Dynamic Program Table for String Edit

C00 S C10 P

C02

C03

C04

C05

C11 subst

C12 delete

C13

C20 A C30 K C40 E C50

C21 insert C31

C22

???

D(i,j) = score of best alignment from s1..sito t1..tj

D(i-1,j-1), if si=tj//copy MIN= D(i-1,j-1)+1, if si!=tj//substitute D(i-1,j)+1 //insert D(i,j-1)+1 //delete

3.3.2 Computing Levenshtein distance

D(i,j) = score of best alignment from s1..sito t1..tj = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete Dynamic table filling in:-

0 S 1 P 2 A 3 K 4 E 5

D(i,j) = score of best alignment from s1..sito t1..tj P A R K

0 S 1 P 2 A 3 K 4 E 5

D(i,j) = score of best alignment from s1..sito t1..tj = min D(i-1,j-1)+d(si,tj) //substitute D(i-1,j)+1 //insert D(i,j-1)+1 //delete

3.3.3 Filling theDynamic program table

0 S 1 P 2 A 3 K 4 E 5

4 Final cost of aligning all of both strings

2 33

D(i,j) = score of best alignment from s1..sito t1..tj D(i-1,j-1)+d(si,tj) //substitute =min D(i-1,j)+1 //insert

D(i,j-1)+1 //delet

The edit distance problem asks to find the minimum number of operations which can transform S1 to S2. The general edit distance problem can have dierent costs for each of the operations, all our algorithms can be directly extended for the general model although we work with the unit cost model.

The standard dynamic programming solution for computing the edit distance between a pair of strings A = a1a2.aNand B = b1b2.bNinvolves filling in an (N+1)(N+1) table T, with T[i; j] storing the edit distance between a1a2.aiand b1b2..bj. The computation is done according to the base-case rules given by T[i-1,0] = 0, T[i,0] = T[i-1,0]+ the cost of deleting ai, and T[0, j] = T[0, j-1]+ the cost of inserting bj, and according to the following dynamic programming step: T[i, j] = min T[i-1, j] + the cost of deleting ai T[i, j -1] + the cost of inserting bj T[i 1, j -1] + the cost of replacing aiwith bj

3.4 : Edit Distance (Levenshtein) Algorithm

Dynamic Programming is a general algorithmic framework which can be applied to solveoptimization problems which have independent sub problems. The edit distance problem is widely used to illustrate the ideas behind dynamic programming. Anyalgorithmbased on dynamic programming will need to define a sub problem with variables capturing all the details of the optimization problem. The following dynamic programming formulation is the key to the edit distance algorithm.

S1 = [a1, a2, a3 . . . an] S1,i = [a1, a2 . . . ai] S2 = [b1, b2, b3 . . . bn] S2,j = [b1, b2 . . . bj ]

S1,i,S2,j are prefixes of strings S1,S2

D(i, j) = Optimal cost of transforming

S1,i to S2,j

Initialization of dynamic programming

D(i, 0) = i Cost aligning [a1, a2 . . . ai] with empty string D(0, j) = j Cost aligning [b1, b2 . . . bj ] with empty string for 1 i, j n, D(i, j) can be computed as follows D(i, j) = D(i 1, j 1)+ 0 chg Else min D(i 1, j) + del delete ai and align [a1, a2 . . . a+i1] and [b1, b2 . . . bj ] D(i, j 1) + ins insert bj and align [a1, a2 . . . ai] and [b1, b2 . . . bj1] if ai = bj

We can think of the sub-problem D(i, j) as a cell in a n m table(Dnm), D(i, j) is the edit distance from the first i characters of S1 to the first j characters of S2, and D(n, m) is the minimum number of operations to transform S1 to S2. The first step is initialization, giving the edit distance between a NULL string and a prefix ofS2 (D(0, j) = j), and also a prefix of S1 and a NULL string (D(i, 0) = i). The computation of other table elements is performed by two loops which output the value of D(i, j) (1 i n,1 j m) for the two sub-strings, S1,i and S2,j . D(i, j) comes from the minimum one of three costs: D(i-,j-1)+change cost for changing the ith character. of S1,i when the first i 1 characters of S1,i have been successfully transformed to the first j 1 characters in S2,j , D(i,j-1)+insert cost for inserting the last character when the first i characters of S1,i transformed to the first j 1 characters of S2,j ,and D(i-1,j)+delete cost for deleting ith character of S1,i when the first i 1 characters of S1,I have been transformed to the first j characters of S2,j . Note that then insert cost and delete cost are both constant unit cost, and the change cost is conditionally determined by the last characters of the two sub-strings - only when they are different, there is need for a change operation.

3.4.1 Algorithm Pseudocode for the algorithm of EditDistance INPUT : Strings S1 and S2 each of length n

OUTPUT: Minimum number of operations to transform S1 to S2 /*Initialization*/ fori = 1 to ndo D(0, i) = i ; End For j=1 to m do D(j, 0) = j; 0 /*Recursive Computation of the Distance Table D*/ for i = 1 to n do for j = 1 to m do change cost = 0; if S1[i] != S2[j] then change_cost = 1; end D(i, j) = M IN (D(i 1, j 1) + change cost, D(i, j 1) + 1, D(i 1, j) + 1) ; endend return D(n, m) ;

3.4.2Analysis of the standard algorithm for edit distance

We can clearly see from Algorithm 1 this standard dynamic programming algorithm need O(nm) time to compute the value of the edit distance between two strings each of length n and m respectively

The time-complexity of the algorithm above is O(nm).Compression is traditionally used to efficiently store data. On using compression to accelerate the dynamic-programming solution for the edit-distance problem described above. The basic idea is to first compress the strings, and then compute the edit distance between the compressed strings. Note that the acceleration via compression" approach has been successfully applied also to other classical problems on

strings. Various compression schemes, such as Huffman coding, Byte-Pair Encoding (BPE) , Run-Length Encoding (RLE), were employed to accelerate exact string matching , subsequence matching, approximatepattern matching , and more. Regarding edit-distance computation, Bunke and Csirik presented a simple algorithm for computing the edit-distance of strings that compress well under RLE. This algorithm was later improved in a sequence of papers to an algorithm running in time O(nN), for strings of total length N that encode into run-length strings of total length n. In an algorithm with the same time complexity was given for strings that arecompressed u, where n again is the length of the compressed strings. Note that this algorithm is also O(N= lgN) in the worst-case for any strings over constant-size alphabets. The first paper to break the quadratic time-barrier of edit-distance computation was the seminalpaper of Masek and Paterson, who applied the "Four-Russians technique" to obtain a running-time of O(N=lgN) for any pair of strings, and of O(N=lg2 N) assuming a unit-cost RAM model. Their algorithm essentially exploits repetitions in the strings to obtain the speed-up, and so in many ways it can also be viewed as compression-based. Infact, one can say that their algorithm works on the naive compression" that all strings overconstant-sized alphabets have. A drawback of the the Masek and Paterson algorithm is that it can only be applied when the given scoring function is rational. That is, when all costs of editing operations are rational numbers. Note that this restriction is indeed a limitation in biological applications, where PAM and evolutionary distance similarity matrices are used for scoring. For this reason, the algorithm in mentioned above was designed specifically to work for arbitrary scoring functions. We mentioned also Bille and Farach-Colton who extend the Masek and Paterson algorithm to general alphabets .There are two important things to observe from the above: First, all known techniques for improving on the Onm) time bound of edit-distance computation, essentially apply acceleration via compression. Second, apart from, and the naive compression of the Four-Russians technique, we do not know how to efficiently compute edit-distance under other compression schemes. For example, no algorithm is known which substantially improves O(N) on strings which compress well under. Such an algorithm would be interesting since there are various types of strings that compress much better. In light of this, and due to the practical and theoretical importance of substantially improving on the quadratic lower bound of string edit-distance computation

3.4.3 Analysis of time complexities of Different Algorithms In many contexts we need the actual edit script(sequence of operations which can transform one string to the other) rather than the edit distance value. A straight forward algorithm to get the edit script would need to keep the entire table (Dnn in the memory to trace back the path from D(n, n) to D(0, 0) which gives the required edit script. Since we keep the entire table in the memory to compute the edit script the space complexity of the algorithm is O(n 2). In the next chapter we will see how we can reduce this space complexity from O(n2) to O(n). As demonstrated in Chapter 3 the standard dynamic programming based algorithm takes O(n2) time to compute the edit distance of two strings of length n, and O(n2) space to compute he actual edit script (i.e., a sequence of Inserts, deletes, and changes that transfroms S1 to S2). Often the edit script is more important for several problems (such as sequence alignment) than the value of the edit distance. The first major improvement in the asymptotic runtime for computing the value of the edit distance was achieved . This algorithm is widely known as the [30]Four Russian Algorithm and it improves the running time by a factor of O(log n) (with a run time of O(n2/ log n)) to compute just the value of the edit distance. It does not address the problem of computing the actual edit script, which is of wider interest rather than just the value. Hirschberg [45] has given an algorithm that computes the actual script in O(n2) time. In paper linear space parallel algorithms for the sequence alignment problem were given, however they assume that O(n2) is the optimal asymptotic complexity of the sequential algorithm. In this chapter we present algorithms that compute both the edit script and value in O(logn2n ) time using O(n) space. The first major improvement in the asymptotic runtime for computing the value of the edit distance was achieved in. This algorithm is widely known as [33]the Four Russian Algorithm and it improves the running time by a factor of O(log n) (witha run time of O(n2/ log n)) to compute just the value of the edit distance.

3.5 . Modification for lowering Running time

In many applications of dynamic programming, we are faced with instances where almost every recursive sub problem will be resolved exactly the same way. We call such instances sparse. For example, we might want to compute the edit distance between two strings that have few characters in common, which means there are few free substitutions anywhere in the table. Most of the table has exactly the same structure. If we can reconstruct the entire table from just a few key entries, then why compute the entire table?

To better illustrate how to exploit sparseness, lets consider a simplification of the edit distance problem, in which substitutions are not allowed. Now our goal is to maximize the number of free substitutions, or equivalently, to find the longest common subsequence of the two input strings. 3.5.1 Computing Common sequence Fix the two input strings A[1.. n] and B[1..m]. For any indices iand j, let LCS(i, j) denote thelength of the longest Common Sequence of the prefixes A[1.. i] and B[1.. j]. This function can bedefined recursively as follows:

CS(i, j) =

0 if i= 0 or j = 0 CS(i-1, j -1)+1if A[i] = B[ j]

Max CS(i, j -1), LCS(i-1, j)

otherwise

This recursive definition directly translates into an O(mn)-time dynamic programming algorithm. Call an index pair (i, j) a match point if A[i] = B[ j]. In some sense, match points are the onlyinteresting locations in the memorization table; given a list of the match points, we could easily reconstruct the entire table:

3.5.2 Common sequence Memorization Table

<< << A L T R U I S T I C >> 0 0 0 0 0 0 0 0 0 0 0 0

A 0

L 0 1

G 0 1 2 2 2 2 2 2 2 2 2 2

O 0 1 2 2 2 2 2 2 2 2 2 2

R 0 1 2 2

I 0 1 2 2 3 3

T 0 1 2

H 0 1 2 3 3 3 4 4 5 5 5 5

M 0 1 2 3 3 3 4 4 5 5 5 5

S 0 1 2 3 3 3 4

>> 0 1 2 3 3 3 4 5 5 5 5

1
1 1 1 1 1 1 1 1 1 1

2
2 2 2 2 2 2 2 2 2

3
3 3 4 4

3
3 3 3 3 3 3 3

4
4 4 4 4 4

5
5 5 5 5

5
5 5 5

Fig 3.5.1 CS Memorization table

The CS memorization table for ALGORITHMS and ALTRUISTIC; the brackets and are sentinel characters.

3.5.3 Common Sequence Recursion Function More importantly, we can compute the CS function directly from the list of match points using the following recurrence:

0 if i= j = 0 LCS(i, j) |A[i] = B[ j] and i<iand j< j+1 if A[i] = B[ j] LCS(i, j) |A[i] = B[ j] andi<=iand j<=j

LCS(i,j)=

max

otherwise

Since the inequalities are strict in the second case, but not in the third. To simplify boundary issues, we add unique sentinel characters A[0] = B[0] and A[m+1] = B[n+1] to both strings. This ensures that the sets on the right side of the recurrence equation are non-empty, and that we only have to consider match points to compute CS(m, n) = CS(m+1, n+1)+1.If there are K match points, we can actually compute them all in O(mlogm+ n log n+ K) time. Sort the characters in each input string, but remembering the original index of each character, and then essentially merge the two sorted arrays. We consider the match points in lexicographic order- the order they would be encountered in a standard row-major traversal of the m_ n tableso that when we need to evaluate CS(i, j), all match points (i, j) with i<iand j < j have already been evaluated.

3.5.4 Modified Algorithm

CS(A[1 ..m], B[1 .. n]):

Match[1 .. K] FINDMATCHES(A, B) Match[K +1] (m+1, n+1) Sort M lexicographically for k (i, j) CS[k] 1 forl (i, j) 1 to k -1 Match[l] 1 to K Match[k] //From start sentinel // Add end sentinel

if i<iand j< j CS[k] minCS[k], 1+LCS[l]

returnCS[K +1]-1

Therefore, The overall running time of this algorithm is O(mlogm+ n log n+ K).

3.6 Applications of the edit distance based algorithms

Algorithms based on edit distance have several applications, a practical example of edit distance in our day to day work is the UNIX diff utility. The UNIX diff utility takes two text files and tells us to convert one file to the other with minimum edit operations (inserting a line, deleting a line, changing a line). Edit distance based algorithms are also used extensively in Bio-Informatics and Computational Genomics. The first application of the edit distance algorithm for protein sequences alignment was studied by Needleman sequence Alignment is used extensively by biologists to identify similarities between genes of different species, with genes being characterized by DNA sequence, e.g., Sdna = [c1, c2, c3, . . .], ci A, T, G,C, where A, T, G, and C are symbols for amino acids (also called base pairs). Biologists often analyse the functionality of newly discovered genes by comparing them to genes which were already discovered and whose function is fully known. Given two DNA sequences, Snewdna and Sknowndna , biologists perform a sequence alignment (EditDistance computation) between the two sequences to find if they have the same functionality. If the EditDistance value between two genes is below a threshold, both the DNA sequences have some properties in common; otherwise they differ. These DNA sequences are very long, typically running into millions of base pairs. The ever increasing volume of the genomic sequences demand for highly efficient algorithms to help biologists to perform faster sequence alignments. The first algorithm to perform sequence alignment was given by Needleman [30] which is a directly based on edit distance computation. Later algorithms for several variations (such as local alignment, affine gap costs, etc.) of the problem were developed.

CHAPTER 4

CONCLUSION AND FUTURE WORK


The idea of this thesis is based upon, Computing the edit distance between two strings which is one of the most fundamental concept in Computer Science .Strings have to be made in order to change one string to another. It is the most common measure to expose the dissimilarity between two strings location where input pattern exactly appears in input text, we have to verify whether there exists an approximate string. The computation is done according to the base-case rules by the cost of deleting, the cost of inserting and cost of substitution.The most basic problem edit distance between two strings is computed . The standard dynamic programming algorithm need O(nm) time to compute the value of the edit distance between two strings each of length n and m respectively. In this we have computed time complexity, a simplification of the edit distance problem is being done, in which substitutions are not allowed. The number of free substitutions are maximized and longest common subsequence of the two input strings is computed. Match points are interesting locations in the table; with which could easily reconstruct the entire table. Hence, the CS function is directly computed from the list of match points Therefor overall running time of this algorithm is reduced from O(mn) toO(mlogm+ n log n+ K). where K is the number of match points and m,n are the two distinct strings.

Anda mungkin juga menyukai