Anda di halaman 1dari 47

Huffman Coding

An Application of Binary Trees and Priority


Queues
(Chapter 10.2 page 701)

CSc 28

Encoding and Compression of Data


Fax Machines
ASCII
Variations on ASCII
min number of bits needed
cost of savings
patterns
modifications

CSc 28

Purpose of Huffman Coding

Developed by Dr. David A. Huffman, 1952


A Method for the Construction of Minimum
Redundancy Codes
MIT

Applicable to many forms of data


transmission

CSc 28

The Basic Algorithm


Huffman coding is a form of statistical coding
For a given text, not all characters occur with the
same frequency!
Yet all characters are allocated the same amount
of space
1 char = 1 byte, 8 bits
be it E or Q
CSc 28

The Basic Algorithm

Savings in tailoring codes to frequency of


occurrences of characters
Character code lengths are no longer fixed
(like ASCII)
Code word lengths vary and will be shorter for
the more frequently used characters

CSc 28

The (Real) Basic Algorithm


1. Scan text (e.g. a message) to be compressed and tally
occurrence of each character in the message.
2. Sort or prioritize characters based on number of
occurrences in text (a queue of characters low to
high).
3. Build Huffman code tree based on this ordered queue.
4. Traversal this tree to determine all code words.

CSc 28

Building a Tree
Scan the original text

For the following short text:

Eerie eyes seen near lake


Count the occurrences of all characters in the text

CSc 28

Building a Tree
Scan the original text

Eerie eyes seen near lake


What characters are present?

E, e, r, i, space, s, n, a, r, l, k
11 characters

CSc 28

Building a Tree
Scan the original text

Frequency count: Eerie eyes seen near lake


ch
E
i
y
l
k
.
space
e
r
s
n
a

freq.
1
1
1
1
1
1
4
8
2
2
2
228
CSc

11
characters

Building a Tree
Prioritize characters

Create binary tree nodes with character and


frequency of each character one node per
character
Arrange in ascending order low
frequency to high frequency
NOTE
The lower the occurrence, the lower the priority
in the queue
CSc 28

Steps
1. Select the two lowest valued elements
2. Combine these into a tree with the root containing
the sum of the two frequencies.
3. Reorder the list... low to high
4. Repeat the combining process until there is only
one element (tree) left in the list.

CSc 28

Building a Tree
The queue after inserting all nodes (character & frequency count)
low freq. to high freq.

sp

Null Pointers not shown

CSc 28

Building a Tree
When priority queue contains two or more nodes
Create new node start with two lowest (on the left)
Dequeue the two lowest adjacent nodes
Dequeue left node and make it left subtree
Dequeue next node and make it right subtree
Frequency of root node equals sum of frequency of
left and right children
Enqueue root node (with its children) back into
queue placing to the right of the nodes with the
same of lower frequency
CSc 28

Building a Tree

sp

CSc 28

Building a Tree

sp

2
i
1

E
1

CSc 28

Building a Tree

2
E
1

CSc 28

i
1

sp

Building a Tree

2
E
1

i
1

2
y
1

l
1

CSc 28

sp

Building a Tree

E
1

CSc 28

i
1

y
1

l
1

sp

Building a Tree

E
1

i
1

y
1

l
1

2
k
1

.
1

CSc 28

sp

Building a Tree

E
1

2
i
1

y
1

l
1

CSc 28

k
1

.
1

sp

Building a Tree

n
2

2
E
1

i
1

y
1

2
l
1

k
1
4
r
2

s
2

CSc 28

.
1

sp

Building a Tree

E
1

2
i
1

y
1

sp

4
l
1

k
1

.
1

CSc 28

r
2

s
2

Building a Tree

E
1

2
i
1

y
1

2
l
1

k
1

sp
4

.
1

4
n
2

a
2

CSc 28

8
r
2

s
2

Building a Tree

E
1

2
i
1

y
1

2
l
1

k
1

sp
.
1

CSc 28

8
r
2

s
2

n
2

a
2

Building a Tree

2
k
1

sp
.
1

8
r
2

s
2

n
2

a
2

4
2
E
1

i
1

2
y
1

l
1
CSc 28

Building a Tree

2
k
1

sp
.
1

r
2

4
s
2

n
2

a
2
E
1

Left to right
CSc 28

i
1

2
y
1

l
1

Building a Tree

4
r
2

4
s
2

n
2

a
2
E
1

i
1

2
y
1

l
1

6
sp
4

2
k
1

.
1
CSc 28

Building a Tree

4
r
2

s
2

n
2

4
2

a
2
E
1

i
1

2
y
1

l
1

k
1

.
1

e
sp
4

What is happening to the characters with a low number of


occurrences?

CSc 28

Building a Tree

2
E
1

i
1

2
y
1

2
l
1

k
1

.
1

e
8

sp
4

8
4

4
r
2

CSc 28

s
2

n
2

a
2

Building a Tree

2
E
1

i
1

2
y
1

2
l
1

k
1

.
1

e
8

sp
4

r
2

CSc 28

4
s
2

n
2

a
2

Building a Tree

e
8

4
r
2

s
2

n
2

10
a
2

2
E
1

CSc 28

i
1

2
y
1

2
l
1

k
1

.
1

sp
4

Building a Tree

e
8

10

r
2

4
s
2

n
2

a
2
E
1

i
1

CSc 28

2
y
1

2
l
1

k
1

.
1

sp
4

Building a Tree

10
16

2
E
1

i
1

2
y
1

2
l
1

k
1

.
1

e
8

sp
4

8
4

4
r
2

CSc 28

s
2

n
2

a
2

Building a Tree

10

16

2
E
1

i
1

2
y
1

2
l
1

k
1

.
1

sp
4

e
8

8
4

4
r
2

CSc 28

s
2

n
2

a
2

Building a Tree

26
16

10
4
2
E
1

i
1

e
8

6
2
y
1

2
l
1

k
1

.
1

8
4

sp
4
r
2

CSc 28

s
2

n
2

a
2

Building a Tree
After enqueueing this
node there is only one
node left in priority
queue.

26
16

10
4
2
E
1

i
1

e
8

6
2
y
1

2
l
1

k
1

.
1

8
4

sp
4
r
2

s
2

CSc 28

n
2

a
2

Building a Tree
Dequeue the single node left
in the queue.
26

This tree contains the new


code words for each character.

16

10
4

Frequency of root node should


equal number of characters in
text.

E i y l k .
1 1 1 1 1 1

Eerie eyes seen near lake


26 characters
CSc 28

e
8

6
sp
4

8
4

r s n a
2 2 2 2

Encoding the File


Traverse Tree for character codes
Perform a traversal of the
tree to obtain each
character code
Going left is 0 going right
is 1
Characters Huffman code
is only completed when a
leaf node is reached

26
16

10
4
2

E i y l k .
1 1 1 1 1 1

CSc 28

e
8

6
sp
4

8
4

r s n a
2 2 2 2

Encoding the File


Traverse Tree for Codes
ch
E
i
y
l
k
.
space
e
r
s
n
a

freq.
1
1
1
1
1
1
4
8
2
2
2
2

Huffman
Code
0000
0001
0010
0011
0100
0101
011
10
1100
1101
1110
1111

26
16

10
4
2

E i y l k .
1 1 1 1 1 1

CSc 28

e
8

6
sp
4

8
4

r s n a
2 2 2 2

Encoding the File


Rescan text and encode file
using new code words
Eerie eyes seen near lake.

00001011000001100111000101011011010011111
01011111100011001111110100100101

Why is there no need for a separator


character?

CSc 28

ch
E
i
y
l
k
.
space
e
r
s
n
a

Huffman
Code
0000
0001
0010
0011
0100
0101
011
10
1100
1101
1110
1111

Encoding the File


Results
00001011000001100111000101011011010011111
01011111100011001111110100100101

Have we made things any better?


73 bits to encode the text
ASCII would take 8 * 26 = 208 bits
If modified code is used, 4 bits per character are needed.
Total bits 4 * 26 = 104
Savings not as great.

CSc 28

Decoding the File


How does receiver of the coded message know
how to do the translation?
Huffman Tree constructed for each text file.
Considers frequencies for characters in each file

Tree predetermined
Based on statistical analysis of text files or file
types

Data transmission is bit based versus byte


based
CSc 28

ch
E
i
y
l
k
.
space
e
r
s
n
a

Huffman
Code
0000
0001
0010
0011
0100
0101
011
10
1100
1101
1110
1111

Decoding the File


Once receiver has
tree it scans
incoming bit stream
0 go left
1 go right

26
16

10
4
2

E i y l k .
1 1 1 1 1 1

101000110111101111011
11110000110101
CSc 28

e
8

6
sp
4

8
4

r s n a
2 2 2 2

Summary
Huffman coding is a technique used to compress files
for transmission
Uses statistical coding
more frequently used symbols have shorter codes

Works well for text and fax transmissions


An application that uses several data structures

CSc 28

Practice
Given the coding scheme:
a: 001
b: 0001
e: 1
r: 0000
s: 0100
t: 011
x: 01010

What are the words


represented by:
a) 01110100011
b) 0001110000
c) 0100101010
d) 0110010101010100
Rosen text:
Chapter 10.2 page 701
CSc 28

Practice
Given the coding scheme:
a: 001
b: 0001
e: 1
r: 0000
s: 0100
t: 011
x: 01010

What are the words


represented by:
t e

a) 01110100011
b

e e

b) 0001110000
s e

c) 0100101010
t

d) 0110010101010100

CSc 28

Practice one more


Given the characters and their frequencies, create
the Huffman Tree and derive the codes for each
character:
100
a: 0.20
CODES:
a: 0 0
b: 0.10
45
55
b: 1 0 0
c: 0.15
c: 1 0 1
20
25
30
25
d: 0.25
d; 0 1
a
d
e
e: 1 1
e: 0.30
10
b

CSc 28

15
c

Anda mungkin juga menyukai