Anda di halaman 1dari 8

COMP 460: Design and Analysis of Algorithms SSU Topic 6: Computational Complexity Searching The Searching problem can

be specified as follows. Given an list of n distinct keys, L, and a target key, x, determine whether or not x is in L and if it is, return its location. We have seen the SequentialSearch, also know as LinearSearch and we have seen the BinarySearch. The SequentialSearch has a time complexity of (n) for searching a list on n items. The BinarySearch has a time complexity of (lgn) for searching a list of n items. However, the BinarySearch has two somewhat limiting constraints. One constraint on the BinarySearch is that the list in question must be sorted. This is not a seriously limiting constraint since we have seen that a list can be sorted in (nlgn) and if no new elements are added to the list, the sort only needs to be done once. Another, often overlooked, constraint is that the keys being searched must be in an array. If the list is implemented as a linked list. Then the (lgn) complexity cannot be achieved. This is by far the more limiting constraint. If the records are large, or if the list is changed frequently, then the time complexity savings gained in the searching is offset by the inefficient use of space and/or the complexity of inserting or deleting from the list. The limitations caused by these constraints are serious enough to warrant looking for alternatives to the binary search. So the logical questions to ask are: Can we do better than (lgn)? If not, how do we handle these limitations. Determining a Lower Bound on the Complexity of the Searching Problem We can use basically the same technique to determine the complexity of the Searching problem as we used for the Sorting problem. 1) We can show that every searching algorithm that relies on comparison of keys can be represented by a decision tree. 2) We can then show that the application of the searching algorithm to an instance is equivalent to traversing a path from the root of the tree to a leaf, performing one comparison at each node. 3) Finally we show that any decision tree that represents a search algorithm is ln n and therefore any search algorithm that relies on comparison of keys must be (lgn). See section 8.1 for all the details.

COMP 460: Design and Analysis of Algorithms SSU Assignment: #2; p.358

COMP 460: Design and Analysis of Algorithms SSU Topic 6: Computational Complexity Interpolation Search If we know something about the distribution of the values in the list in which we are searching we can use that to try and improve the performance of the BinarySearch. Instead of simply choosing the middle element of the array as our guess to compare to the key, we can use our knowledge of how the array values are distributed to make a more informed choice for our guess. For example, if we know or suspect that the values in the array are evenly distributed between the high and the low elements, we can use a linear function to make our initial guess. We assume that the array values lie on a line between the points (low, A[low]) and (high, A[high]). If we let x-axis represent the array indices and the y-axis represent the array values, then line would have the following equation.
[] [ ] y [ ]= ( )

If the values were truly linear, then the searching problem could be solved by simply setting y to the key being searched and solving for x. Thus, we use this value of x as our guess for the index of the key.
(key [ ]) + [] [ ]

mid :=

Thus we get the InterpolationSearch algorithm simply by replacing one line of the BinarySearch algorithm. (OH) As would be expected, the more uniformly the values of the array are distributed, the better the performance of the interpolation search. InterpolationSearch has an average time complexity of A(n) lg(lg n). If the values are not very uniformly distributed, then the worst case of InterpolationSearch degenerates to SequentialSearch. E.g. A = [1,2,3,4,5,6,7,8,9,100] mid is always reduced to low so each item must be checked. We can force the split of the array to be larger by forcing the gap between mid and high or low to be at least a certain amount. I.e. after the computation of mid, we can perform the following adjusment. mid := min(high gap, max(mid, low + gap)

COMP 460: Design and Analysis of Algorithms SSU When the search proceeds to the larger subarray the value of gap is doubled unless that would exceed half the number of elements. When it proceeds to the smaller subarray, gap is reset to its original value. When this technique is employed, the algorithm is called the RobustInterpolationSearch (OH). The RobustInterpolationSearch has an average time complexity of A(n) lg(lg n) and a worst case complestiy of W(n) (lg n)2.

COMP 460: Design and Analysis of Algorithms SSU Topic 6: Computational Complexity Search Trees We have seen that arrays allow efficient searching but inefficient insert and delete. Linked lists allow efficient insert and delete but inefficient searching. Is there a way to get the best of both? Yes! Use a tree data structure. A binary search tree is a is a binary tree whose, nodes contain key values, with the property that the key of any node is larger than the keys of any of the nodes in that nodes left subtree and smaller that the keys of all the nodes in the right subtree. E.g. A binary search tree of size 10.
8 5

15

How would we insert a key? Complexity? lg


3 7 12 20

n. How would we search for a key? Complexity? lg

n.

How would we delete a key? Find the key Swap it down to a leaf Delete it Complexity 2 * lg n If the tree is balanced we get good performance for all three operations. If the tree is not balanced we get a linked list. Is there a way to ensure that the tree is balanced? This was a hot area of research in the 60's and 70's. Many methods were developed. Some methods were for balancing a given tree others were for maintaining a tree in a balanced state. One of the simplest of the latter type uses a class of data structures called B-Trees.

COMP 460: Design and Analysis of Algorithms SSU 2-3 Trees A 2-3 tree has the following characteristics. 1) A node may have 1 or 2 keys. a) If a node has one key then that key is larger than all the keys of nodes in the left subtree and smaller than all the keys of nodes in the right subtree. b) If a node has 2 keys they and is not a leaf then it has 3 children. i) The smaller key lies between the keys of the left subtree and the keys of the middle subtree ii) The larger key lies between the keys of the middle subtree and the right subtree. 2) All leaves are at the same level. When a node is added to the tree, it comes in at a the appropriate leaf. Whenever a node that is no the root contains three keys, the middle key moves up to the parent and the remaining keys split into two children. Whenever the root contains three keys, the middle key becomes a new root node and the other keys become its two children. See example in Fig. 8.7, p. 327

COMP 460: Design and Analysis of Algorithms SSU Topic 6: Computational Complexity Hashing We showed that searching algorithms that rely on comparison of keys are (lgn). Thus, to do better we need to use some method of processing the keys other than straight comparison. We need a way to assign a position to an item in the list that is not based on comparison. We need a function that given a key, determines where it should go in the list. One of the very first exercises we looked at used an idea like this for sorting. What if we have a product database. Each item in the database is uniquely identified by a 3 digit product ID. There are at most 1000 different products. We can just use a 1000 element array that stores an item at the index equal to its product ID. Searching, inserting and deleting can all be done in constant time. The drawback is the potential for much wasted space. If the keys are SSN's for example than the array would need to be of size 1,000,000,000. The idea of hashing breaks the list of keys into m sublists. Each key is assigned to a sublist. The assignment function is called a hash function. The process of determining the sublist for a key is called hashing. The search for "good" hash functions is ongoing. A easy hash function is to "mod" the key by some value. Eg. SSN % 10000 would partition the SSNs by their last for digits. If two or more keys hash to the same value it is called a collision. We would like to minimize the probability of collisions.
P(m, k ) k If there are m sublists the probility of k keys hashing to unique keys is m .

If there are n keys in the m sublists how long does a search take? If the keys are uniformly distributed then each sublist contains n/m keys. Suppose that the sublists are implemented as linked lists. An unsuccessful search must compare the target key to each of the keys in the sublist and thus an unsuccessful search requires n/m comparisons. n +1 1 m = + 2 2 comparisons. A successful search has at most n/m comparisons and on average 2

COMP 460: Design and Analysis of Algorithms SSU Of course the sublist may be implemented in any way. If they are implemented as balanced binary trees then the search can be done in lg(n/m) time.