Anda di halaman 1dari 11

Hashing Hashing is nothing but storing a value in such a way that it has a key, so that whenever we want

to retrieve that value then we can retrieve it from that key. This key can be generated through a
certain logic from the given value. Note that the place where we store this key-value pair is called
as hashtable/hashmap(it can be an array or a special C++ container which I will tell you later in this
module)
For example, let us choose a function- mod (%) to generate the key from the given values and then
we store it in an array of size- 13 like thiskey = value % 13
Hence suppose we have values like- 5, 23, 12, 98.
Then 5 % 13 = 5, (We store it in the index- 5)
23 % 13 = 10 (We store it in index- 10)
12 % 13 = 12 (We store it in index- 10)
98 % 13 = 7 (We store it in index-7)
Hence we also say that the keys of 5, 23, 12, 98 are 5, 10, 12, 7 respectively.
See the animation- https://www.cs.usfca.edu/~galles/visualization/OpenHash.html and insert
5,23,12,98 and see how the elements are getting stored.
Suppose if we insert 33 now after inserting the above numbersThen the key of 33 is 33%13 = 7.
Hence 33 will also get stored at the index -7 of our array (hashtable). But since there is already an
element 98 at the index- 7 so we say that we got a collision . How can we resolve collision?? (or
in simpler words how can we store 33 ?)
[Note- Just go through this two methods. No need to code these things. These things are not useful
while solving questions. Just have a look at these two methods theoretically. ]
There are two methods for resolving collisions1) Open Hashing (https://www.cs.usfca.edu/~galles/visualization/OpenHash.html )In this method there is an array of linked list. Hence we will just insert 33 as a new node after the
node 98 (which is already inserted). Please try the animation in the above link and insert 33 after
adding 98 and see how this happens.
2) Closed HashingIn this method whenever there is a collision then we look for the next best empty places in the
hashtable (here the hashtable is array). This method is less popular than the first one.
Now we go to the main parts of hashing(i.e- how we solve the interview questions using hashing)As we can see from the above introduction that hashing is nothing but storing a number

corresponding to its key in a hashtable.


There are two data structures by which we can do hashing(We call these data structures- hashtable)1) Using array
2) Using C++ <set> (which we will discuss later in this document)
We will cover few questions given in G4GFirst problem- http://www.geeksforgeeks.org/find-first-repeating-element-array-integers/

A) Approach #1 -> Using arrayThe post mentions about different other methods like- sorting etc, but here we will consider how to
solve in hashing using an array.
Let us take the sample input and output asInput: arr[ ] = {10, 5, 3, 4, 3, 5, 6}
Output: 5 [5 is the first element that repeats]

Now what we do is that we create an array (hashtable)- hash[].But the size of the hashtable should
be equal to the (largest element in the given input array arr[] + 1). Here the maximum element in
arr[] is 10, hence we create an array - hash[] of size 11.

Then we initialise all the elements in hash[] as (-1). This -1 marks that no element in arr[] has been
visited yet.

Then we iterate from right to the left of arr[], and we mark the numbers visited. We do this by
making the value of element of hash[] at the index equal to the current element in arr[] from -1 to
+1. (This is a bit weird and looks complicated at the first sight but try to learn as this is the only
important thing in hashing).
For example the first element of arr[] from right is 6, so we mark the index- 6 in hash[] as (+1). This
shows that the element 6 has now been visited as shown below.

And similarly we now go to the next element in arr[] from right which is 5 and ark hash[5] as +1 from
-1 like this-

Then in the next step hash[3] = +1 and again hash[4] = +1. Till now the situation is like this-

So the numbers- 3,4,5,6 from the arr[] are visited and hence are +1.
Now we are at index- 2 of arr[]. We see that arr[2] is 3 which has already been visited (because
hash[3] = +1 and not -1) . So whenever we see that the number is already +1 in the hash[] then we
say that the number is repeated because it has been marked visited earlier sometime. Hence 3 is a
repeated number. So this can be our answer. Hence we now store 3 as our temporary answer (in a
variable- tempans).But we have to keep looking for an element with a lower index than 2 (arr[2] = 3)

such that it has been repeated in the array. For that we continue to iterate towards left. We go to
left to index-1 in arr[]. We see that arr[1] = 5 is also a repeated number (as hash[5] = +1 already
instead of -1). And since 5 is at a lower index than our temporary answer 3 (in the variabletempans), so we now update our temporary answer to 5 and go one step left at index- 0 in arr[]. We
see that hash[10] is not a repeated number yet (as hash[10] = -1 and not +1). So we make hash[10] =
+1 and then end our iteration.

In our final hash[] array, we see that hash[3], hash[4], hash[5], hash[6] and hash[10] are +1 which
signifies that 3,4,5,6,10 are present in arr[]. We do not need the hash[] array anymore. And our
temporary answer (tempans) is 5 which means that 5 is the first repeated element in arr[].
Points to note1) It is not mandatory that we should initialise all the elements in hash[] array as (-1). We can
initialise it to any other number. And as we start visiting the numbers from arr[] then we can assign a
unique number (here +1), but this new number must be different than the initialised number so
that we can recognize that this element has already been visited. We generally choose +1 and -1
because they represent on and off (here- marked and not marked)
2) The time complexity of this method is O(N) where N=number of elements in arr[]. But the space
complexity is O(max_element_in_arr[]). See whether this matches with your answer or not.
3) In the above example arr[] contains only positive numbers. What if the arr[] has negative
numbers. Then we cannot use this method as there are no negative indices in an array. For example
we cannot mark say the number (-8) if the arr[] = {10, 5, -8, 3, 7, 5}. Hence this is a big limitation of
using array as a hashtable. So we will look at another simple data structure which C++ provide in its
library.
CodeThe code for the above method is at- http://code.geeksforgeeks.org/LnTaeT
Go through the code many times and see what is happening.

B) Using C++s <set>C++ provides a ready-made inbuilt data structures for trees and hash tables which we will discuss

below.
To use this I will first introduce to C++s <set> which is easy.

C++ <set>Note that throughout this document we will use this symbol- <set> instead of just a set to tell that
it is a kind of container and nothing else. Never ever in code we write set inside those angular
brackets. We always write like this - set<int> name; which we will see in next pages.
A set is a container/data structure/collection in which elements are sorted according to their own
values. Each element may occur only once, so duplicates are not allowed.

There are two types of <set>1) Ordered Set (a.k.a Binary Search Tree)- The elements inserted in it are always sorted. These are
used as a shortcut to binary search trees as by using this container you can search in O(logN) time
and can insert in O(logN) time also and all other things you would expect in a tree.
2) Unordered Set (a.k.a HashTable)- The elements inserted in it are not sorted and are randomly
inserted. These are used as a shortcut for hashtables.
Coding Detailsa) To use both ordered and unordered set you must write#include<bits/stdc++.h>
using namespace std;
This is because these are C++ things.
b) To create an ordered set of say- integer type ,useset<int> container_name;

// Do not add ordered_ at the front

and to create an unordered set useunordered_set<int> container_name;


Note that we can use any datatype, like- float, double etc instead of int.
c) To insert elements in them, we usecontainer_name.insert(65);

// To insert a number - 65

Note that the method to insert in both ordered and unordered set is the same, i.e.insert()method, but the way the elements are inserted are different. As told earlier that an
ordered set is basically a balanced BST. Hence when we insert a number then it gets inserted just like
the way an element gets inserted in a tree. But when we insert something in an unordered set then
it gets inserted randomly anywhere.
Imp- Insertion takes O(logN) time if an ordered set otherwise it takes O(1) time average when in a

unordered set. Note that this is O(N) in worse case for an unordered set. Hence remember that
insertion and searching takes O(1) time in average but in worse case it can take O(N) time.
d) If at any point we want to get the number of elements in the set/size of the set then we useint size = container_name.size(); // To get the size/number of elements
The .size() method returns an integer which denotes the number of elements in the set.

e) Now that since we know how to insert elements in a set, we will iterate through all the elements
in the set. We cannot do like thisfor(int i=0 ; i<=container_name.size(); i++)
{
printf(%d, container_name[i]);
}
The above is a wrong method. We cannot use i for that. What we do is that we declare an iterator
which will iterate through all the elements as shown belowset<int>:: iterator it; // Iterator for set
unordered_set<int>::iterator it; // Iterator for unordered_set
You can give whatever name you want to give to the iterator instead of it
And then we iterate like thisfor(it=container_name.begin() ; it!=container_name.end(); it++)
{
printf(%d , *it);
}
Note that this iterator- it is not a number, its a pointer.
container_name.begin() is also a pointer which points to the first element of the container.
When using a set the first element is the smallest element (just like tree) and the last element is the
largest.
Hence the above for loop is just like the inorder traversal of the tree and always prints a sorted
output if the container is a set. If its an unordered set then it will print the elements inserted into it
randomly with no particular order.
Hence in short,
printf(%d, *container_name.begin()) Prints the smallest number if container is
a set otherwise if its an unordered set then it prints a random number inserted into it.
printf(%d, *container_name.end()) Prints the largest number if container is a set
otherwise if its an unordered set then it prints a random number inserted into it.
Note that in all the above printf statements we have used an asterisk (*) as it,
container_name.begin() and container_name.end() all are pointers.

One important thing to note is that in these for loops we put an inequality sign(!) instead of less than
sign(<). See the highlighted part below.
for(it=container_name.begin() ; it!=container_name.end(); it++)
printf(%d , *it);

ExerciseNow since we know about all the above things, so run the following code in your compiler and see
what is happening- http://code.geeksforgeeks.org/R57XQD
f) To search/find an element, we usecontainer_name.find(65);

// To insert a number - 65

Note- The .find() method returns a pointer which points to the element -65 if 65 is inside the
container otherwise it returns- container_name.end().
Hence if we want to print the number-65, we do like thisprintf(%d, *container_name.find(65));
or
set<int>::iterater it;

/* Note that if the container is an unordered set then add


unordered_ in the front*/
it = container_name.find(65);
printf(%d, *it);
Both the above will print 65 if 65 is there inside the container. If 65 is not present then the
.find() method returns container_name.end()
To check whether 65 is there inside or not we do the below thingsset<int>::iterater it;

/* Note that if the container is an unordered set then add


unordered_ in the front*/
it = container_name.find(65);
if(it != container_name.end())
{
printf(65 is present);
}
else
{
printf(65 is not present\n);
}
ExerciseSee the below code for reference- http://code.geeksforgeeks.org/WqD1yJ

Imp- Searching takes O(logN) time if an ordered set(obviously as ordered set is a tree and takes
O(log N)) otherwise it takes O(1) time average when in a unordered set. But that this is O(N) in
worse case for an unordered set. Hence remember that insertion and searching takes O(1) time in
average but in worse case it can take O(N) time.

That completes our <set>.


Now we will silently get introduced to <multiset>. As told earlier that <set> cannot contain duplicate
values but <multiset> can have duplicate elements.
<multiset> is very easy if <set> is clear as a concept. Just like <set>, <multiset> is also of two type1) Ordered Multiset aka a BST that can hold duplicate elements.
2) Unordered Multiset- aka a hashtable that can hold duplicate elements
We just do two things different than the <set> which are1) Creating the <multiset> is different than <set>
Instead of
set<int> container_name;

// Do not add ordered_ at the front

and to create an unordered set useunordered_set<int> container_name;


we just add the prefix multi in front of set to make it multiset shown as belowmultiset<int> container_name;

// Do not add ordered_ at the front

and to create an unordered multiset useunordered_multiset<int> container_name;


Rest everything remains the same. The methods like- .find(), .insert() etc are same.
2) Defining the <multiset> iterator is different than <set>
Instead of thisset<int>:: iterator it; // Iterator for set
unordered_set<int>::iterator it; // Iterator for unordered_set
we do this to define iterators for multisetsmultiset<int>:: iterator it; // Iterator for multiset
unordered_multiset<int>::iterator it; // Iterator for unordered_multiset

ExerciesSee this code for the difference between set and multiset- http://code.geeksforgeeks.org/xFwQuB

Notice from the code that how even after inserting the number 10 in container1 and container2
three times, we still were left out with only one 10 inside it whereas in container3, and container4
you can have as many as you want.
NoteJust you should remember that the internal structure of a <set> and a <multiset> is a tree and for
<unordered_set> and <unordered_multiset> its a complicated array of linked list which uses open
hashing technique (we discussed that at the beginning of this document in page 1)

Now since you have knowledge of C++s set and multiset lets come to the previous problem in G4Ghttp://www.geeksforgeeks.org/find-first-repeating-element-array-integers/
We have solved this problem using an array (See pg-2 of this document). But as we know that
method was having two limitations1) What if the arr[] has negative numbers?
Then we cannot use this method as there are no negative indices in an array. For example we cannot
mark say the number (-8) if the arr[] = {10, 5, -8, 3, 7, 5}. Hence this is a big limitation of using array
as a hashtable. The solution is <set>.
2) What if the maximum element in arr[] is very large. We have to declare a hash array- hash[] of
that much size. Like even though arr[] has only one element like- arr[] = {999999999}, then also we
have to create hash[] of that much size. So this is completely a waste of space. We can fix this by
using the <set>

B) Approach #2 -> Using <set>In this method we dont create a hash*+ array. We create an <unordered_set> *Note that we can also
use <set> but as we know that <set> is a tree which has insertion time O(log N) whereas
<unordered_set> is a hash table which has an insertion time of O(1) in average and generally rarely
touch the worse case O(N), so for efficiency we use hash table]
We iterate from right to left and then start inserting the elements from the arr[] in our
<unordered_set>. Let us name our <unordered_set> as container1
Firstly we start from right. Our first element in arr[] from the right is 6. 6 is not present already in the
container2, which means that 6 has not repeated yet. So we simply add it in our container1 as
shown below-

Then same with 5 and then 3 and then 4. All of them are not present in the container1 before
adding. Hence we simply insert them to container1, as shown below-

[Note that since the container1 is an <unordered_set> hence we can represent it as a bag containing
numbers as it is stored in a random order. There is no such sorted order like in an ordered set <set>]

Now we come to arr[2] which is 3. We now discover that 3 is already there inside the container1.
Hence it must be there in the array more than one time. So 3 is a repeated number. So this can be
our answer. Hence we store 3 in a variable- tempans .But we have to keep looking for an element
with a lower index than 2 (arr[2] = 3) such that it has been repeated in the array. For that we
continue to iterate towards left.
When we go to index- 1, i.e- the element- 5(arr[1] = 5), then we discover that 5 is also in the
container1, which means that 5 is also a repeated element. And since 5 is at a lower index than 3 (5
is at index 1 and 3 at 2), so we update our tempans to store 5 now.
Then we go to index-0. We see that the element is 10 and it is not present in the container1. So 10 is
out of the race as it is not a repeated element till now and we simply add 10 in container1.
At the end the container1 has 10, 5, 4, 3 and 6 (in no specific order) and our tempans should be
having the value- 5 which is our required answer.
CodeThe code for the above method is at- http://code.geeksforgeeks.org/oqTV1t
Go through the code many times and see what is happening.

Are the limitations that were there in the Approach #1 Using Arrays solved?
1) Yes, now we can use this method even if there is a negative number in arr[], because we can
obviously store negative numbers also in container1.
2) Yes, now even if the array has only one element like- arr[] = {9999999}, then also we do not have
to allocate that much amount of space. We just have to store the integer-9999999 inside the
container1.
Is there even a single benefit in using array as a hashtable instead of
<unordered_set>/<unordered_multiset> as a hashtable?
1) Yes, there is a benefit. If you look at the codesApproach #1 Using Arrays - http://code.geeksforgeeks.org/LnTaeT
Time Complexity- O(N + max_element_of_arr[])
Space Complexity O(max_element_of_arr[])
Approach #2 Using Hashtable - http://code.geeksforgeeks.org/oqTV1t
Time Complexity- O(N^2) but generally O(N)
Space Complexity O(N)
Here, N = size of array arr[]
The time complexity of the second approach is O(N^2) because we have discussed earlier also that
the .find() method can take O(N) time in worse case. But this happens rarely. Generally in
average, it takes O(1) time and hence on an average this approach has O(N) complexity on average.
Hence if we are extremely unlucky then it will take more time by hash table (although this rarely
happens and thats why we use hash table very much).
Also it is also possible that the max_element_of_arr[] is much lower than N, then in such cases the
space complexity of Approach 1 is better than that of hashing.

ExerciseNote that even an <unordered_multiset> would have also worked instead of <unordered_set>. Think
why?
SummaryHence hashing is nothing but storing the values in a hash table for various purposes. This hash table
can be an array or any other data structure like- <set>, <unordered_set>.

NoteWhenever you get stuck in a problem then tell me and I will try to solve that.

Anda mungkin juga menyukai