Review
Hash Functions
Spreading data evenly through a table avoiding collisions while maintaining a simple efficient function.
This week
Data Compression Techniques
We can speed up the efficiency of many algorithms by compressing data. Obviously storing M or F, rather than Male or Female, takes less space, AND less time to transfer between bits of a program. Lets investigate how we can efficiently encode data.
Encoding 128
128 80 1000000 CXXVIII ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
Assumptions
We have n different symbols we can use to encode a message.
In Morse code n=3. In Binary n=2.
All symbols mi forming the set M, have probabilities of occurrence P(mi) such that P(mi) + + P(mn) =1
Infrequently occurring symbols can be assigned a long code word, while short code words are reserved for frequent symbols.
Coding Objectives
Each codeword corresponds to exactly one symbol. Decoding should not require any look ahead.
This is known as the prefix property.
Prefix Property
Symbols: A, B, C Codes: 1, 2, 12 Message: 12 Is it AB? Is it C?
Prefix Property
Symbols: A, B, C, Codes: 1, 22, 12 Message: 1222 Read in 1, is it an A? Read in 2, was it a C? Read in 2, Should it be AB? Read in 2, Ah, finally we can assume it was CB.
Prefix Property
Symbols: A, B, C Codes: 11, 12, 21 No Ambiguity here.
Code Optimisation
The length of a code for one symbol should not exceed the length of a less likely symbol;
if P(mi) P(mj) then L(mi) L(mj)
There should be no unused short codes, either as stand alone encodings or as prefixs for longer codes.
01, 000, 001, 100, 101 is not ideal as 11 is not used.
Huffman Coding
Huffman coding is a method for choosing a representation for each symbol, resulting in a prefix-free code
The bit string representing some particular symbol is never a prefix of the bit string representing any other symbol
The most common characters are expressed using shorter strings of bits than are used for less common symbols.
Huffman Coding
Huffman coding starts by creating a heap, based on the frequencies of the symbols to be encoded.
The root of the heap should be 1.0 the sum of both child nodes. Each data node appears in a leaf node.
Creating a Heap
There are several ways a heap can be created from the given data, here is one option;
120 55 25 30 65
30
H
15
35
A 10 C 5
5
15
15
10
B D
15
20
Huffman Coding
Each node is then given a code based on their position in the tree 0 for left node, 1 for right node;
120 55 25 30 65
A = 000 B = 010
35
30 10
H
15
C = 0010 D = 011
A C 5
5
15
15
10
B D
15
20
H = 10
5
I = 00111
Huffman Decoding
A = 000 B = 010 C = 0010 D = 011 E = 111 F = 00110 G = 110 H = 10 I = 00111
Adaptive Huffman
Huffman coding was invented back in the 1950s, and has a key assumption that we know the expected frequency of each symbol. We could get this frequency from a large corpus, such as the BNC, but then the resulting huffman tree might not be ideal for all circumstances.
A technical computing paper is likely to contain more non alphabet symbols than war and peace.
Adaptive Huffman is usable in this situation, as it reconstructs the tree as data is read in.
Adaptive Huffman
Initially all symbols are unused and so stored in a leaf with value 0. As symbols are used they are moved up the tree. As symbols are used more frequently, they are compared and if necessary swapped with their siblings. Sibling comparison is easy if the heap is stored as a doubly linked list.
30
H
15
35
A 10 C 5
5
15
15
10
B D
15
20
add c
1 0 1
add a
2 1 1
c
add b
3 2 1 1 1
b a
1
c
abdegrs c
bdegrs a
3
c
0
transpose
degrs
1 1
2 1 1
b a
0
degrs
Adaptive Huffman
Note that Adaptive Huffman requires just one pass through the message to encode it;
The first time the b is transmitted it has the code 101. The second time the b is transmitted it has the code 0.
As symbols rise up the tree their codes become shorter. So long as the decoder uses the same formula it can decode the same message.
Run-Length Encoding
The word cabbage has a repeated b, so run length encoding might be able to shorten the code, by indicating that there are 2 bs.
This is similar to the symbol in Thai.
Homework
Taking the following phrase;
Data Structures And Algorithms Analysis
Use the Adaptive Huffman Coding method to generate a code for it.