Hashing

Hashing & Hash Tables
Cpt S 223. School of EECS, WSU
Overview
Hash Table Data Structure : Purpose
To support insertion, deletion and search in

average-case constant
t t ti
time
Hash function
Assumption: Order of elements irrelevant

==> data structure *not* useful for if you want to
maintain
i t i and
d retrieve
ti
some kind
ki d off an order
d off the
th
elements
Hash[ string key] ==> integer value
Hash table ADT
I l
Implementations,
t ti
Analysis,
A l i Applications
A li ti
Hash table: Main components

key
value
john
TableSize
e
Hash index
h(john)
key
Hash
function
How to determine ?
Hash table
(implemented as a vector)
Hash Table
Hash table is an array of fixed

size TableSize
key
Element value
Array elements indexed by a

key, which is mapped to an
array index (0TableSize-1)
Mapping (hash function) h
from key to index
E.g., h(john) = 3
Hash Table Operations
Insert
T [h(john)] = <john,25000>
Data
record
Delete
Hash key
Hash
f
function
ti
T [h(
[h(john)]
john )] = NULL
Search
T [h(john)] returns the

element hashed for john
What happens if h(john)

h( john ) == h(
h(joe)
joe ) ?
collision
Factors affecting Hash Table

Design
Hash function
Table size
Usuallyy fixed at the start
Collision handling scheme
Hash Function
A hash function is one which maps an

elements key into a valid hash table index
h(key) => hash table index
Note that this is (slightly) different from saying:

h(string) => int
Because the key can be of any type
E.g., h(int) => int is also a hash function!
But also note that anyy type

yp can be converted into
an equivalent string form
h(key) ==> hash table index
Hash Function Properties
A hash function maps key to integer

Constraint: Integer should be between
[0, TableSize-1]
A hash function can result in a many-to-one mapping
(causing collision)
Collision occurs when hash function maps two or more keys

to same array index
Collisions
C
lli i
cannott be
b avoided
id d but
b t its
it chances
h
can be
b
reduced using a good hash function
h(key) ==> hash table index
Hash Function Properties
A good hash function should have the

properties:
1.
Reduced chance of collision

Different keys should ideally map to different
indices
Distribute keys uniformly over table
2.
Should be fast to compute

Hash Function - Effective use

of table size
Simple hash function (assume integer keys)
h(Key) = Key mod TableSize
For random keys, h() distributes keys evenly

over table
What if TableSize = 100 and keys are ALL

multiples of 10?
Better if TableSize is a prime number
10
Different Ways to Design a

Hash Function for String Keys
A very simple function to map strings to integers:
Add up character ASCII values (0-255) to produce
integer keys
E.g., abcd = 97+98+99+100 = 394

==> h(abcd) = 394 % TableSize
Potential problems:
Anagrams will map to the same index
Small strings may not use all of table
h(abcd) == h(dbac)
Strlen(S) * 255 < TableSize
Time proportional to length of the string

11

Approach 2
Treat first 3 characters of string as base-27 integer (26

letters plus space)
Key = S[0] + (27 * S[1]) + (272 * S[2])
Better than approach 1 because ?
Potential problems:
Assumes first 3 characters randomly distributed
Not true of English

Apple
Apply
pp
Appointment
Apricot
collision
12

Approach 3
Use all N characters of string as an
N-digit
g base-K number
Choose K to be prime number

larger than number of different
digits (characters)
I.e., K = 29, 31, 37
If L = length of string S, then

L 1
h( S ) S [ L i 1] 37 i mod TableSize
i 0
Problems:
Use Horners rule to compute h(S)

potential overflow
Li it L for
Limit
f long
l
strings
ti
larger runtime
13
Collision resolution techniques

q
Techniques
T
h i
to
t Deal
D l with
ith
Collisions
Chaining
Open addressing
Double hashing
Etc.
Etc
14
Resolving Collisions
What happens when h(k1) = h(k2)?
==>
> collision !
Collision resolution strategies
Chaining
Store colliding keys in a linked list at the same

hash table index
Open addressing
Store colliding
g keys
y elsewhere in the table
15
Ch i i
Chaining
Collision resolution technique #1
16
Chaining strategy: maintains a linked list at

every hash index for collided elements
Insertion sequence: { 0 1 4 9 16 25 36 49 64 81 }
Hash table T is a vector of

linked lists
Insert element at the head

(as shown here) or at the tail
Key k is stored in list at

T[h(k)]
E.g.,
g TableSize = 10
h(k) = k mod 10
Insert first 10 perfect
squares
17
Implementation of Chaining
Hash Table
Vector of linked lists
(this is the main
hashtable)
Current #elements in
the hashtable
Hash functions for
i t
integers
and
d string
ti
keys
18
Implementation of Chaining
Hash Table
This is the hashtables
current capacity
(aka. table size)
This is the hash table

index for the element
x
19
Duplicate check
Later, but essentially

resizes the hashtable if its
getting crowded
20
Each of these
operations takes time
linear in the length of
the list at the hashed
index location
21
All hash objects must

define == and !=
operators.
Hash function to
handle Employee
object type
22
Collision Resolution by
Chaining: Analysis
Load factor of a hash table T is defined as follows:
N = number of elements in T
M = size
i off T
= N/M
i.e., is the average length of a chain
Unsuccessful search time: O()
(current size)
(t bl size)
(table
i )
( load factor)
Same for insert time
Successful search time: O(/2)

Ideally, want 1 (not a function of N)
23
Potential disadvantages of
Chaining
Linked lists could get long
Especially when N approaches M
Longer
L
linked
li k d lists
li t could
ld negatively
ti l impact
i
t
performance
More memory because of pointers
Absolute worst-case (even if N << M):
All N elements in one linked list!
Typically the result of a bad hash function
24
O
Open
Addressing
Add
i
Collision resolution technique #2
25
Open Addressing
An inplace approach
When a collision occurs, look elsewhere in the

table for an empty slot
Advantages over chaining
No need for list structures

No
o need
eed to
o allocate/deallocate
a oca e/dea oca e memory
e o y du
during
g
insertion/deletion (slow)
Disadvantages
Slower insertion May need several attempts to find an

empty slot
Table needs to be bigger (than chaining-based table) to
achieve average
average-case
case constant
constant-time
time performance
Load factor 0.5

26
Open Addressing
A Probe sequence is a sequence of slots in hash table while

searching for an element x
h0(x),
(x) h1(x),
(x) h2(x),
(x)
Needs to visit each slot exactly once
Needs to be repeatable (so we can find/delete what weve
inserted)
Hash function
hi(x) = (h(x) + f(i)) mod TableSize

f(0) = 0
==> position for the 0th probe
f(i)
( ) is the distance to be traveled relative to the 0th p
probe
position, during the ith probe.
27
Linear Probing
i probe
th
index =
Linear probing:
0th probe
b
occupied
1st
occupied
2nd probe
occupied
probe
+i
f(i) = is a linear function of i,
E.g., f(i) = i
hi(x) = (h(x) + i) mod TableSize
3rd probe
0th probe
index
Probe sequence: +0, +1, +2, +3, +4,

unoccupied
Populate x here
Continue until an empty slot is found
#failed probes is a measure of performance
28
ith probe
index =
0th probe
index
+i
Linear Probing
f(i) = is a linear function of i, e.g., f(i) = i
hi(x) = (h(x) + i) mod TableSize
Probe sequence: +0, +1, +2, +3, +4,
Example: h(x) = x mod TableSize
h0(89)
h0(18)
h0(49)
h1(49)
= (h(89)+f(0)) mod 10 = 9
= (h(18)+f(0)) mod 10 = 8
= (h(49)+f(0)) mod 10 = 9 (X)
= (h(49)+f(1)) mod 10
= (h(49)+ 1 ) mod 10 = 0
29
Linear Probing Example

I
Insert
t sequence: 89,
89 18
18, 49
49, 58
58, 69
#unsuccessful
probes:
time
7
total
30
Linear Probing: Issues

Probe sequences can get longer with time
Primary clustering
Keys tend to cluster in one part of table

Keys that hash into cluster will be added to
the end of the cluster (making it even
bigger)
Side effect: Other keys could also get
affected if mapping to a crowded
neighborhood
31
Linear Probing: Analysis
Expected number of
probes for insertion or
unsuccessful search
1
1
1
2
2 (1 )
Expected number of
probes for successful
search
1
1
1
2 (1 )
Example ( = 0.5)
Insert / unsuccessful
search
Successful search
2.5 probes
1 5 probes
1.5
b
Example ( = 0.9)
Insert / unsuccessful
search
50.5 probes
Successful search
5.5 probes
32
Random Probing: Analysis
Random probing does not suffer from

clustering
Expected number of probes for insertion or
unsuccessful search:
1
1
Example
l
ln
= 0.5: 1.4 probes

= 0.9: 2.6 probes
33
# probe
es
Linear vs. Random Probing
U - unsuccessful search
S - successful search
I - insert
Linear probing
Random probing
good
bad
Load factor
34
Quadratic Probing
Quadratic probing:
occupied
occupied
0th probe
1st probe
2nd probe
Avoids primary clustering

f(i) is quadratic in i
e.g., f(i) = i2
hi(x) = (h(x) + i2) mod
TableSize
occupied
3rd probe
Probe sequence:
q
+0, +1, +4, +9, +16,
occupied
Continue until an empty slot is found

#failed probes is a measure of performance
35
Quadratic Probing
Avoids primary clustering

f(i) is quadratic in I,
I e.g.,
e g f(i) = i2
hi(x) = (h(x) + i2) mod TableSize
Probe sequence: +0,

+0 +1,
+1 +4,
+4 +9,
+9 +16,
+16
Example:
h0(58) = (h(58)+f(0))
(h(58) f(0)) mod
d 10 = 8 (X)
h1(58) = (h(58)+f(1)) mod 10 = 9 (X)
h2(58)
( 8) = (h(58)+f(2))
(h( 8) f(2)) mod
d 10
0=2
36
Q) Delete(49), Find(69) - is there a problem?
Quadratic Probing Example

I
Insert
t sequence: 89,
89 18
18, 49
49, 58
58, 69
+12
+12
+22
+22
+02
+02
+02
+02
#unsuccessful
probes:
+12
2
+02
2
5
total
37
Quadratic Probing: Analysis
Difficult to analyze
Theorem 5.1
New element can always be inserted into a table

that is at least half empty and TableSize is prime
Otherwise, may never find an empty slot,

even is one exists
Ensure table never gets half full
If close, then expand it

38
Quadratic Probing
May cause secondary clustering
Deletion
Emptying
p y g slots can break probe
p
sequence
q
and
could cause find stop prematurely
Lazy deletion
Differentiate
Diff
ti t b
between
t
empty
t and
dd
deleted
l t d slot
l t
When finding skip and continue beyond deleted slots
If you hit a non-deleted empty slot, then stop find procedure

returning not found
May need compaction

at some time
39
Quadratic Probing:
Implementation
40
Quadratic Probing:
Implementation
Lazy deletion
41
Quadratic Probing:
Implementation
Ensure table
size is prime
42
Quadratic Probing:
Implementation
Find
Skip DELETED;
No duplicates
Quadratic probe
sequence (really)
43
Quadratic Probing:
Implementation
Insert
No duplicates
Remove
No deallocation
needed
44
Double Hashing: keep two

hash functions h1 and h2
Use a second hash function for all tries I

other than 0:
f(i) = i * h2(x)
Good choices for h2(x) ?
Should never evaluate to 0

h2(x) = R (x mod R)
R is prime number less than TableSize
P i
Previous
example
l with
ith R=7
R 7
h0(49) = (h(49)+f(0)) mod 10 = 9 (X)

h1(49) = (h(49)+1*(7 49 mod 7)) mod 10 = 6
f(1)
45
Double Hashing Example
46
Double Hashing: Analysis
Imperative that TableSize is prime
E g insert 23 into previous table

E.g.,
Empirical tests show double hashing

close to random hashing
Extra hash function takes extra time to
compute
t
47
Probing Techniques - review

Quadratic probing:
0th try
t
1st try
2nd try
0th try
1st try
0th try
i
2nd try
t
try
3rd try
3rd
Double hashing*:
2nd try
1stt try
3rd try
Linear probing:
*(determined by a second
hash function)
48
Rehashing
Increases the size of the hash table when load factor

becomes too high (defined by a cutoff)
Anticipating that prob(collisions) would become
higher
Typically expand the table to twice its size (but still
prime)
Need to reinsert all existing elements into new hash
table
49
Rehashing Example
h(x) = x mod 7
= 0.57
0 57
h(x) = x mod 17
= 0.29
0 29
Insert 23
Rehashing
= 0.71
50
Rehashing Analysis
Rehashing takes time to do N insertions

Therefore should do it infrequently
Specifically
Mustt h
M
have been
b
N/2 iinsertions
ti
since
i
last
l t
rehash
A
Amortizing
ti i the
th O(N) costt over the
th N/2 prior
i
insertions yields only constant additional
time per insertion
51
Rehashing Implementation
When to rehash
When load factor reaches some threshold

(e.g,. 0.5), OR
When an insertion fails
Applies across collision handling

schemes
52
Rehashing for Chaining
53
Rehashing for
Quadratic Probing
54
Hash Tables in C++ STL
Hash tables not part of the C++

Standard Library
Some implementations of STL have
hash tables (e.g.,
(e g SGI
SGIss STL)
hash_set
hash map
hash_map
55
Hash Set in STL

#include <hash
<hash_set>
set>
struct eqstr
{
bool operator()(const char* s1, const char* s2) const
{
return strcmp(s1, s2) == 0;
}
};
void lookup(const hash_set<const char*, hash<const char*>, eqstr>& Set,
const char* word)
{
hash_set<const char*, hash<const char*>, eqstr>::const_iterator it
= Set.find(word);
cout << word << ": "
<< (it != Set.end()
Set end() ? "present" : "not present")
<< endl;
}
Key
Hash fn
Key equality test
int main()
{
hash_set<const char*, hash<const char*>, eqstr> Set;
Set.insert("kiwi");
lookup(Set, kiwi");
}
56
Hash Map in STL

#i l d <h
#include
<hash_map>
h
>
struct eqstr
{
bool operator() (const char* s1, const char* s2) const
{
return strcmp(s1, s2) == 0;
}
};
Key
Data
Hash fn
Key equality test
int main()
{
hash_map<const char*, int, hash<const char*>, eqstr> months;
Internally
months["january"] = 31;
treated
months["february"] = 28;
like insert
(or overwrite
months["december"] = 31;
if key
cout << january -> " << months[january"] << endl;
already present)
}
57
Problem with Large Tables
What if hash table is too large to store

in main memory?
Solution: Store hash table on disk
Minimize disk accesses
But
Collisions
ll
require disk
d k accesses
Rehashing requires a lot of disk accesses
Solution: Extendible Hashing
58
Hash Table Applications
Symbol table in compilers

Accessing tree or graph nodes by name
E.g.,
g , city
c ty names
a es in Goog
Google
e maps
aps
Maintaining a transposition table in games
Remember previous game situations and the move taken
(avoid re
re-computation)
computation)
Dictionary lookups
Spelling checkers
Natural
N t l llanguage understanding
d t di (word
(
d sense))
Heavily used in text processing languages
E.g., Perl, Python, etc.
59
Summary
Hash tables support fast insert and

search
O(1) average case performance

Deletion possible
possible, but degrades
performance
Not suited if ordering of elements is

important
Many applications
60
Points to remember - Hash

tables
Table size prime

Table size much larger than number of inputs
(to maintain closer to 0 or < 0.5)
Tradeoffs between chaining vs. probing
C lli i chances
Collision
h
decrease
d
in
i this
hi order:
d
linear probing => quadratic probing =>
{random probing, double hashing}
Rehashing required to resize hash table at a
time when exceeds 0.5
Good for searching. Not good if there is some
Cpt S data.
223. School of EECS, WSU
61
order implied by

Hashing

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Hashing

Diunggah oleh

Hak Cipta:

Format Tersedia

Hashing & Hash Tables

Cpt S 223. School of EECS, WSU

Hash Table Data Structure : Purpose

To support insertion, deletion and search in

Assumption: Order of elements irrelevant

Hash[ string key] ==> integer value

Hash table ADT

Hash table: Main components

Hash table is an array of fixed

Array elements indexed by a

Hash Table Operations

T [h(john)] returns the

What happens if h(john)

Factors affecting Hash Table

Usuallyy fixed at the start

Collision handling scheme

Cpt S 223. School of EECS, WSU

A hash function is one which maps an

h(key) => hash table index

Note that this is (slightly) different from saying:

E.g., h(int) => int is also a hash function!

But also note that anyy type

h(key) ==> hash table index

Hash Function Properties

A hash function maps key to integer

Collision occurs when hash function maps two or more keys

Cpt S 223. School of EECS, WSU

h(key) ==> hash table index

Hash Function Properties

A good hash function should have the

Reduced chance of collision

Should be fast to compute

Hash Function - Effective use

Simple hash function (assume integer keys)

h(Key) = Key mod TableSize

For random keys, h() distributes keys evenly

What if TableSize = 100 and keys are ALL

Cpt S 223. School of EECS, WSU

Different Ways to Design a

E.g., abcd = 97+98+99+100 = 394

Small strings may not use all of table

Time proportional to length of the string

Different Ways to Design a

Treat first 3 characters of string as base-27 integer (26

Key = S[0] + (27 * S[1]) + (272 * S[2])

Better than approach 1 because ?

Not true of English

Cpt S 223. School of EECS, WSU

Different Ways to Design a

Choose K to be prime number

I.e., K = 29, 31, 37

If L = length of string S, then

Use Horners rule to compute h(S)

Cpt S 223. School of EECS, WSU

Collision resolution techniques

What happens when h(k1) = h(k2)?

Collision resolution strategies

Store colliding keys in a linked list at the same

Cpt S 223. School of EECS, WSU

Chaining strategy: maintains a linked list at

Hash table T is a vector of

Insert element at the head

Key k is stored in list at

This is the hash table