Anda di halaman 1dari 15

HASH TABLES

OBJECTIVES
In this chapter, you will learn: The concept of hashing Hashing terminology Hashing Functions Collision Resolution Strategies o Chaining o Open Addressing Table Overflow o Expansion and Extendible Hashing

CONCEPT OF HASHING
In all the methods seen so far, the search for a record is carried out via a sequence of comparisons. The organization of the file (sequential, indexed, binary tree etc) and the order in which the values are inserted, affect the number of comparisons. If we have a table organization such that we can retrieve the key in a single access, it would be very efficient. To do so, the position of the key in the table should not depend upon the other keys but the location should be calculated on the basis of the key itself. Such an organization and search technique is called hashing. Let us consider a simple example. If you wanted a particular book at home and you did not know where you had kept it, then you would have to search at all possible places where the book could have been kept. However, if there is only one place where you always keep the books, then you would just have to look at that one place. This is exactly what is done in hashing. To search for a specific element, we will look at just a specific position. The element will be found there if it was kept there in the first place. Hence, to place an element, we will first calculate its location, place it there so that it can be retrieved from that position. In hashing, the address or location of an identifier X is obtained by using some function f(X) which gives the address of X in a table.

HASHING TERMINOLOGY
5-1

5-2 Hash Table

Hash Function: A function that transforms a key X into a table index is called a hash function. Hash address: The address of X computed by the hash function is called the hash address or home address of X. Hash Table: The hash table is the sequential memory used to store identifiers. Bucket: Each hash table is partitioned into b buckets ht[0]ht[b-1]. Each bucket is capable of holding s records. Thus, a bucket consists of s slots. When s = 1, each bucket can hold 1 record. The function f(X) maps an identifier X into one of the b buckets i.e from 0 to b1. Synonyms: Usually, the total number of possible values for identifier X is much larger than the hash table size. Therefore the hash function f(X) must map several identifiers into the same bucket. Two identifiers I1 and I2 are synonyms if f(I1) = f(I2) Collision: When two non identical identifiers are mapped into the same bucket, a collision is said to occur, i.e. f (I 1) = f(I2). Hence all synonyms occupy the same bucket. Overflow: An overflow is said to occur when an identifier gets mapped onto a full bucket. When s = 1, i.e., a bucket contains only one record, collision and overflow occur simultaneously. Such a situation is called a Hash clash. Load factor: If n is the total number of identifiers in the table and t is the table size, the load factor is : lf = n/t It represents the fraction of the table that is occupied. Perfect Hash Function : Given a set of keys k1, k2, kn, a perfect hash function, f, is such that f(ki) != f(kj) for all distinct i and j. Example: Let us consider a hash table with b= 26 buckets (from 1 to 26) each having 2 slots, i.e. s = 2. Assume that there are 10 distinct identifiers each beginning with a letter. Hash function f(X) = Number corresponding to the first letter of identifier i.e. if X = A, then f(X) = 1. If the identifiers are GA, D, A, G, L, A2, A1, A3, A4, E . Their hash addresses will be 7,4,1,7,12,1,1,1,1,5 respectively. GA and G are synonyms. A, A2, A1, A3 and A4 are synonyms. When G gets hashed into bucket 7, a collision is said to occur. When A1 gets hashed into bucket 1, which already contains A and A2, an overflow occurs.
Buckets 1 2 3 A 0 0 A2 0 0

5-3 Hash Table 4 5 6 7 D 0 0 GA . . . L . . . 0 0 0 0 G . . . 0 . . . 0

12

26

HASH FUNCTION
A hashing function, f, transforms an identifier X into a bucket address in the hash table. Obviously, the efficiency of the entire hashing technique depends on the hash function. Hence, the hashing function is extremely important. If the hashing function gives the same addresses for several identifiers, there will be many collisions leading to overflow. Thus, it has to be such that it gives distinct addresses for non-identical identifiers. Before we study some commonly used hashing functions, we will study the desirable characteristics of a good hashing function.
Desirable Characteristics of a Hashing Function

i. ii. iii. iv.

It should be easily computable. It should minimize the number of collisions. The hash function should compute the address, which depends on all or most of the characters in the identifier. It should yield uniform bucket addresses for random inputs. Such a function is called a uniform hash function.

Several uniform hash functions are in use. Some of them are

1.

Mid Square: This is a very widely used function in symbol table applications.
It is the middle of square function, which is computed by squaring the identifiers and using an appropriate number of bits from the middle to obtain the bucket address. Since middle bits of the square will depend on all the bits in the identifier, different identifiers will result in different hash addresses thus minimizing collision. Example: We will choose the middle two digits (the size of the hash table = 100) If X = 225, X2 = 050625 hash address = 06

5-4 Hash Table

If X = 3205, X2 = 01027205

hash address = 27

2.

Division: The modulus operator (modulo division) can be used as a hash


function. The identifier X is divided by some number M and the remainder is used as the hash address of X. f(x) = X mod M. This gives addresses in the range 0 to M-1 and so the table should be of size M. The choice of M is very critical. The following should not be used for M, i. if M is a power of 2, ii. if M is divisible by 2. The best choice for M is that it should be a prime number. In practical applications, M is chosen such that it has no prime divisors less than 20. Example X = 134, M = 31 f(X) = X mod M = 134 mod 31 = 10 Folding: In this method, the identifier X is partitioned into several parts all of the same length except the last. These parts are added to obtain the hash address. Addition is done in two ways. i. Shift folding: All parts except the last are shifted so that their least significant bits correspond to each other. ii. Folding at the boundaries: The identifier is folded at the part boundaries and the bits falling together are added. Example
P1 123 P1 P2 P3 P4 P5 P2 203 123 203 241 112 20 699 P3 241 P4 112 P5 20 P1 P2 P3 P4 r P5
r

3.

123 302 241 211 20 897

Shift Folding

Folding at boundaries

4.

Digit Analysis: This method is used for static tables where the identifiers are
known in advance. The identifier X is interpreted as a number using some radix r. Each digit of identifier X is compared with a radix r. The digits having most skewed distributions are deleted. Enough digits are retained to obtain the hash address. The same radix is used for all identifiers.

COLLISION RESOLUTION STRATEGIES


However good the hash function may be, there are bound to be collisions and overflows. Once a overflow occurs, we cannot simply discard the identifier. It has to be stored somewhere such that we should be able to retrieve it later. Thus, even

5-5 Hash Table

though a proper hash function is chosen, some overflow handling techniques have to be implemented.
Some commonly implemented techniques are :

1.

2. 3.

Open Addressing Mechanisms Linear Probing Quadratic Probing Rehashing Chaining Extendible Hashing

Open Addressing

In an open addressing hashing system, if a collision occurs, alternative locations are tried until an empty location is found. The process starts with examining the hash location of the identifier. If it is found occupied, some other calculated slots are examined in succession till an empty slot is found. The same process is carried out for retrieval. Features : i. All the identifiers are stored in the hash table itself ii. Each slot contains an identifier or it is empty iii. Requires a bigger table for open addressing iv. Three techniques are commonly used for open addressing : linear probing, quadratic probing, rehashing. v. There is a possibility of the table becoming full vi. Load factor can never exceed 1. vii. Probing for insertion

Linear Probing or Linear Open Addressing


This is a very simple method of handling overflow. In this method, the hash address of identifier X is obtained. If the slot in its bucket is empty, the identifier is placed there. If an overflow occurs the identifier is placed in the next empty slot after its bucket. To do this, we have to look for an empty slot in successive buckets and place the identifier in the first empty slot found. Example: Identifiers : Series of letters, Hash Function f(X) = Number of First letter of identifier, Size of Hash Table = 26. The identifier list is C, DA, A1, E , A2, GA, B1, D2, FA
1 A1 2 A2 3 C 4 DA 5 E 6 B1 7 GA 8 D2 9 FA 10 26

The hash table shows the positions of all the identifiers. For identifier A2, an overflow occurs. Hence it is put in position 2. The bucket of identifier B1 is occupied by A2 ; hence B1 is put in the next vacant slot i.e.6.

5-6 Hash Table

Advantage: 1. It is a simple method of overflow handling. 2. The method can be easily implemented using simple data structures like arrays. Disadvantage 1. An identifier whose bucket is full, will occupy the valid hash address of another identifier. Thus, one overflow will lead to many more. 2. If an identifier is not found at its computed address, a series of comparisons will have to be made in order to retrieve it. 3. If an identifier is not present in the bucket, the search will be terminated only after b+1 comparisons from ht(f(X)+1) to ht (f(X) 1) i.e. the entire bucket is searched. 4. It creates a cluster of identifiers i.e. identifiers are concentrated in some parts of the table. 5. There is a possibility of the table becoming full.

Quadratic Probing:

This method is used to avoid the clustering problem in the above method. In Linear probing, the searching is carried out for buckets (f(X) + i ) mod b where 0 < i < b. This means that when the identifier is not found in its bucket , we search in successive buckets using an increment i which is 1. In Quadratic Probing a quadratic function of i is used as the increment i.e instead of checking (i + 1)th bucket, this method checks the buckets number computed using a quadratic equation. The hash function is quadratic. This ensures that the identifiers are fairly spread out in the table. Advantages : i. Avoids the clustering problem in linear probing to some extent ii. Faster insertion and deletion Disadvantages: i. There is no guarantee of finding an empty slot once the table becomes more than half full if the table size is not prime. ii. Although this method eliminates primary clustering, it does not eliminate another phenomenon called secondary clustering in which different keys that hash to the same value follow the same rehash path.

3 Rehashing
In this method, if an overflow occurs, a new address is computed by using another hash function. A series of hash functions f 1, f2, ..fn are used. To place an identifier X, f1(X) , f2(X) ..fm (X) are computed till an empty slot is found to place

5-7 Hash Table

the identifier. To search for identifier X, they are used in the same order to search the identifier. If all the hash functions have been used and the identifier is not found, it means that the identifier is not there in the table. For rehashing, all the hash functions should yield different hash addresses for the same identifier X. Finding such a series of hash functions is difficult. Advantages i. Does not lead to clustering Disadvantages i. The process may loop forever. This may happen when some rehash function results in the same hash address for an identifier. ii. It is difficult to find a series of hash functions yielding different addresses for the same identifier. iii. Rehashing is costly as each rehash function takes a finite amount of time.

Chaining
The main problem in using the linear probing and its variants is that, it results in a series of comparisons, which is not a very big improvement over the linear search method. In chaining, what is done is that when an overflow occurs, the identifier X is placed in the next vacant slot ( like linear probing ) but it is chained to the identifier occupying its bucket so that X can be easily located. Chaining can be done if another field is provided in the hash table, which stores the hash address of the next identifier having the same hash address i.e. synonym. Hence identifiers can be easily located. Chaining has three variants : 1. Chaining without replacement 2. Chaining with replacement 3. Chaining using linked lists

1. Chaining without Replacement


This method is as follows: The hash address if an identifier X is computed. If this position is vacant, it is placed there. It its position is occupied by another identifier Y, the identifier X is put in the next vacant position and a chain is formed to the new position i.e. from Y to X. Identifier is 11, 32, 41, 54, 33. Hash function: f(x) = X mod 10 Table size 10 Identifier 11 and 32 get hashed into positions 1 and 2 respectively. Example:

5-8 Hash Table

The hash address of 41 is 1 but its position is occupied, hence it is put at position 3 and a chain is formed. 54 stored at position 4 The hash address of 33 is 3 but it is occupied. So 3 is placed at position 5 and a chain is formed from position 3 to 5 as shown.
X 0 1 2 3 4 5 . . . 9 11 32 41 54 33 . . . Chain -1 3 -1 5 -1 -1

Advantage: 1. Since identifiers are chained, we have to only use the chains to locate an identifier. 2. It is more efficient as compared to the previous methods. Disadvantages: The main idea is to chain all identifiers having same hash address (synonyms). However, when an overflow occurs, an identifier occupies the position of another identifier. Hence, even non-synonyms get chained together thereby increasing complexity.

2. Chaining with Replacement


This is an improvement over the above method. As we saw above, even non synonyms get chained. To avoid this from happening, if we find that another identifier Y which is a non-synonym, is occupying the position of an identifier X, X replaces Y and then Y is relocated to a new position. It will not make a difference to Y since Y was anyway not in its own bucket. But by placing X in its own bucket, we are improving the efficiency. Example: Let us consider the same example as earlier. Here, 11, 32, 41 and 54 get placed in the same way as before. However, when 33 has to be placed, its position is occupied by 41. Thus, 33 replaces 41 and 41 is now put into the next empty slot i.e 5 and the chain from element 11 at position 1 is modified.
0 1 2 3 11 32 33 -1 5 -1 -1

5-9 Hash Table 4 54 -1 5 41 -1 . . . . . . 9 Advantages:

1. Most of the identifiers occupy their valid positions. 2. Searching becomes easier since only the synonyms are chained.
Disadvantages:

1.

Insertions and deletions take more time.

3. Chaining using linked lists


All of the above methods suffer from a variety of problems. One problem is limited amount of space, due to which the table can become full. Another problem is in the complexity of chaining which results in poor efficiency. Obviously the best method would be to have unlimited amount of space for each bucket so that we could put in as many identifiers as we want without any overflow taking place. This is possible when we can allocate memory dynamically for an identifier. That is, when we want to store an identifier in a bucket, we should first allocate memory for that identifier and then store the identifier in the allocated memory. Chaining using linked lists is a dynamic representation of a hash table and especially used when the number of identifiers varies dynamically. In the dynamic method of representation, we will not use an array to represent the hash table. Instead, whenever we have to add an identifier, we will allocate the required amount of memory for the identifier. Thus, there is no limit on the number of identifiers in a bucket. This is possible by using linked lists. Memory is allocated and de-allocated for an identifier dynamically. Each bucket is basically a list in which all identifiers of that bucket are stored. The identifiers are linked to each other by means of pointers. Hence, we need a structure to store the identifier and the address of the next identifier. The Hash table is maintained as an array of pointers, each pointing to a linked list. Each list contains all identifiers having the same bucket address (synonyms). When the address of a new identifier is computed, this identifier gets added to the list corresponding to its bucket. In order to search for a specific identifier, we have to compute the hash address of the identifier. The bucket corresponding to that hash address will be a linked list. This list has to be traversed from the start to the end to search for the identifier. If the identifier is not found in this list, it means that the identifier is not present in the hash table.

5 - 10 Hash Table

As shown below, the hash table consists of several linked lists; one per bucket. All the synonyms are stored in the same list.
HT A B1 N ULL D N U LL N U LL
NU LL

A2 B
N ULL

A1

N U LL

Chaining using linked lists

Advantages 1. Most efficient method to resolve collision. 2. There is no limit on the number of identifiers. 3. Searching, Insertion and Deletions are done efficiently. 4. The hash table will never be full. Disadvantages 1. If many identifiers are put into a single list, searching time within a bucket will increase. 2. Handling of the pointer array and linked lists is more complex as compared to simple arrays.
Implementation

To implement the hash table as a set of lists, we will have to define a structure for the list. As seen in the diagram above, each list stores identifiers along with pointers to the next identifiers. To achieve this, we can define a node structure as follows: class node { int ident; node *next; }; The hash table will now be an array of pointers, each pointing to the first node in the list. node * ht[SIZE]; Whenever a new identifier has to be added , we will allocate memory for a node, put the identifier in the node and link the node to the corresponding list using the pointer field in the node. To delete an identifier, we will have to find the node containing that identifier and remove the node from that list.

5 - 11 Hash Table

EXTENDIBLE HASHING
All the methods studied above are useful for small amounts of data i.e. when the table size is small enough to fit in the main memory. But what if the table size is so large, that it is impossible to store it in the main memory? In such a case, the hash table will have to be stored in the secondary memory, If the hash table is stored in the secondary memory, then the main consideration will be the number of disk accesses required to retrieve data. Let us assume that initially, there are n records which are stored. The value of n varies over a period of time. Moreover, only m records can fit in one disk block. This means that if open addressing or chaining schemes are used, several blocks will have to be examined in order to perform insertion and deletion. If rehashing is used, the process will require a lot of disk accesses for every rehash function examined.
Reason for Extendible hashing :

- Most records have to be stored in disk Disk read/write operations much more expensive than operations in main memory Regular hash tables need to examine several disk blocks when collisions occur We want to limit number of disk accesses for find/insert operations

Concept
The main idea behind extendible hashing is to allow a Retrieval to be performed in two disk accesses and Insertion also in few accesses. This can be achieved by using a tree structure. The tree is an M-ary tree (degree M) which allows M-way branching. This means that as the branching increases, depth of the tree decreases. A complete Binary tree has height (log 2N) whereas a complete M-ary tree has height of (log M/2 N). As M increases, the depth decreases. The procedure is as follows : Hash the key of each record into a reasonably long integer to avoid collision adding 0s to the left so they have the same length Build a directory The directory is stored in the primary memory Each entry in the directory points to a leaf Directory is extensible Each leaf contains M records, Stored in one disk block Share the same D number of leading digits
Directories and Leaves

5 - 12 Hash Table

Directory: also called root, stored in the main memory D: the number of bits for each entry in the directory Size of the directory: 2^D Leaf: Each leaf stores up to M elements M = block size /record size dL: number of leading digits in common for all elements in leaf L. dL < = D.

5.6

SOLVED PROBLEMS

1. Assuming integer identifiers , a hash function of X%10 and one slot per bucket, show the hash table after the following identifiers have been added to the table if linear probing method is used. 19 34 23 29 100 53 191 Soln : The hash table will have 10 buckets since the hash function is X%10. The bucket addresses will be 0 to 9. Since linear probing is used, when an overflow occurs, the identifier will be placed in the next vacant slot. Thus, the hash table will be as follows:
0 29 1 100 2 191 3 23 4 34 5 53 6 7 8 9 19

2. Write an algorithm to list all the identifiers in a hash table in alphabetic order. Assume string identifiers and the hash function is f(X) = first character of X and linear probing is used. What is the time taken to perform the entire operation? Soln : In order to list all the identifiers in an alphabetic order, we will first have to search the entire table for all identifiers beginning with character A/a, then B/b and so on till Z. The hash table will have 26 buckets. We will assume that each bucket contains only one slot.
Algorithm

1. 2. 3. 4. 5.

Start H is the hash table which has 26 buckets. ch = a Bucket b = 0; i.e. Start from bucket 0. if ( f ( H[b]) equal to ch ) Display H[b] 6. Repeat from 5 till all buckets are searched. 7. b = b+1

5 - 13 Hash Table

8. ch = ch + 1 9. if ch <= z go to step 5 10. Stop. In this case, the number of buckets n=26. The time taken to search the entire table for identifiers beginning with character a will be O(n) since linear probing is used. Since the search is repeated 26 times i.e. n times, the time complexity is O(n*n) which is of the order O(n2). 3. What are the key ideas behind hashing? Soln : The main idea behind hashing is to avoid a series of comparisons to retrieve a stored identifier. In hashing, an identifier is stored at a pre-calculated location using a hash function which calculates the address of the identifier in a table called the hash table. To search for an identifier, the address is calculated using the same hash function which allows for faster retrieval. The efficiency of the entire system depends upon how good the hashing function is. A perfect hashing function will not generate the same hash address for distinct identifiers. However, it is impossible to have a perfect hashing function. Since there will be collisions ( two non identical identifiers having the same hash address), it will lead to overflow. Hence some overflow handling techniques like linear probing, chaining etc have to be used to place the identifier in the hash table.

4. Compare the linear probing and Chaining method of Collision Resolution techniques. Soln : In order to resolve collision, some collision resolution techniques have to be used. Two important methods are linear probing and chaining methods. Criteria Linear Chaining Probing Data structures used Array Array, Linked List Algorithm Complexity Simple method More complex as compared to to implement. linear probing. Variations No variations Three types- chaining without replacement, chaining with replacement , chaining using linked lists. Space Complexity Requires no Requires additional memory to additional store the chain field and memory pointers Time Complexity Best Case Best Case O(1)

5 - 14 Hash Table

Search efficiency

O(1) Worst Case O(n) Poor

Worst Case Depends on the method used. Better than linear probing since the idea is to chain synonyms together.

Exercises
1. 2. 3. 4. 5. 6. 7. Define : Hash Table, Hash address, Collision, Overflow, Synonym How does hashing differ from other searching methods? What are the characteristics of a good hash function. Discuss various hash functions. Which are the different methods of overflow handling? Compare chaining with replacement and chaining without replacement. Compare Linear Probing method and chaining.

8. Explain the method of extendible hashing with the help of an example.

Frequently Asked Questions ( Univ )


1. Explain Hashing and hashing function? Give two examples of a hashing function. What are the characteristics of a good hashing function? ( Dec 02 May 01 Dec 00 Dec 98, 6 marks ) 2. What is meant by collision? Explain the following collision avoidance techniques with an example : linear probing, chaining without replacement, chaining with replacement. ( Dec 03 May 02 , 10 marks ) 3. Explain with an example difference between chaining without replacement and chaining with replacement. Which method is better? ( May 01 , 4 marks ) 4. Write an algorithm for chaining with replacement as a technique for synonym resolution. ( May 00 , 8 marks ) 5. What is linear probing? Explain. ( May 99 , 2 marks ) 6. Discuss various applications of the data structure : hash table ( Dec 97 , 2 marks )

5 - 15 Hash Table

7. Name and describe two hashing functions you can use for alphanumeric keys. Do they satisfy the characteristics of a good hashing function? Justify your answer considering each of the characteristics independently. ( May 97 , 8 marks ) 8. What is an AVL Tree? Explain in detail about rotations necessary in each situation. ( Dec 98 May 98 May 95 , 10 marks ) 9.Obtain the AVL tree for the following data : i)MAR MAY NOV AUG APR JAN DEC JUL FEB JUN OCT SEP (May 98 ) ii) 50 55 60 15 10 40 20 45 30 47 70 80 ( Dec 95 ) iii) A Z B Y C X D W E V F M R ( May 97 , 10 marks ) 10. Compare binary search tree, optimal binary tree and height balanced trees. ( Dec 98 May 98 , 6 marks ) 11. What are the advantages and disadvantages of optimal BST as compared to AVL and binary search tree? ( Dec 96 May 95 , 6 marks ) 12. Compare AVL tree with binary search tree. ( Dec 94 May 96, 8 marks )

Anda mungkin juga menyukai