Anda di halaman 1dari 5

An Example: Huffman codes

We are used to using characters that each have the same number of bits, e.g., the 7-bit ASCII code. However, some of these characters tend to occur more frequently in English (or in any languages with alphabets) than others. If we used a variable number of bits for a code such that frequent characters use fewer bits and infrequent character use more bits, we can decrease the space needed to store the same information. For example, consider the following sentence:
dead beef cafe deeded dad. dad faced a faded cab. dad acceded. dad be bad.

There are 12 a's, 4 b's, 5 c's, 19 d's, 12 e's, 4 f's, 17 spaces, and 4 periods, for a total of 77 characters. If we use a fixed-length code like this:
000 001 010 011 100 101 110 111 (space) a b c d e f .

Then the sentence, which is of length 77, consumes 77 * 3 = 231 bits. But if we use a variable length code like this:
100 110 11110 1110 0 1010 11111 1011 (space) a b c d e f .

Then we can encode the text in 3 * 12 + 4 * 5 + 5 * 4 + 19 * 1 + 12 * 4 + 4 * 5 + 17 * 3 + 4 * 4 = 230 bits. That a savings of 1 bit. It doesn't seem like much, but it's a start. (Note that such a code must be a prefix code, where we can distinguish where one code stops and another starts; one code may not be a prefix of another code or there will be confusion.) If the characters have non-uniform frequency distributions, then finding such a code can lead to great savings in storage space. A process with this effect is called data compression. This can be applied to any data where the frequency distribution is known or can be computed, not just sentences in languages. Examples are computer graphics, digitized sound, binary executables, etc.

A prefix code can be represented as a binary tree, where the leaves are the characters and the codes are derived by tracing a path from root to leaf, using 0 when we go left and 1 when we go right. For example, the code above would be represented by this tree: <
_@_ _/ _/ _/ _/ _/ _/ d / / / / _@_ / / (space) e \ \ @ / \ / \ "." / c a / / \_ \_ \_ \_ \_ \_ _@_ / \ \ \ \ \ _@_ \ \ @ / \ \ @ / \ b f

In this tree, the code for e is found by going right, left, right, left, i.e., 1010. How can we find such a code? There are many codes, but we would like to find one that is optimal with respect to the number of bits needed to represent the data. Huffman's Algorithm is a greedy algorithm that does just this. We can label each leaf of the tree with the frequency of the letter in the text to be compressed. This quantity will be called the "value" of the leaf. The frequencies may be known beforehand from studies of the language or data, or can be computed by just counting characters the way counting sort does.

We then label each internal node recursively with the sum of the values of its children, starting at the leaves. So the tree in our example looks like this:
_77 _/ _/ _/ _/ _/ _/ d 19 / / / _33 / / (space) 17 e 12 \ \ 16 / \ / a 12 \ "." 4 / c 5 / / \_ \_ \_ \_ \_ \_ _58 / \ / \ \ \ \ _25 \ \ 13 / \ \ 8 / \ b f 4 4

The root node has value 77, which is just the number of characters.

Let's go through the above example using Huffman's algorithm. Here are the contents of Q after each step through the for loop: 1. Initially, all nodes are leaf nodes. We stick all 8 in Q:
2. (space) a 3. 17 12 5. 6. 7. 8. 9. 11. 12. 13. 14. 15. b 4 c 5 d 19 e 12 f 4 . 4

4. We join two of the nodes with least value; now there are 7 things in Q:
8 / \ f . 4 4 (space) a 17 12 b 4 c 5 d 19 e 12

10. Then the next two with least value, Q has 6 elements:
8 / \ f . 4 4 (space) a 17 12 9 / \ b c 4 5 d 19 e 12

16. Now the two nodes with least values are the two trees we just made, and Q has 5 elements:
17. 18. 19. 20. 21. 22. 23. 24. 26. 27. 28. 29. 30. 31. 32. 33. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 48. 17 __/ __/ / 8 / \ f . 4 4 17 __/ __/ / 8 / \ f . 4 4 \__ \__ \ 9 / \ b c 4 5 d 19 e 12 / 24 \ a 12 (space) 17 \__ \__ \ 9 / \ b c 4 5 d 19 e 12 (space) a 17 12

25. Q has 4 elements:

34. Three items left:


34 ______/ ______/ / 17 __/ __/ / 8 / \ f . 4 4 \__ \__ \ 9 / \ b c 4 5 34 e 12 / 24 \ a 12 d 19 \_____ \_____ (space) 17

47. Two big trees left:

49. 50. 51. 52. 53. 54. 55. 56. 57. 58. 59. 61. 62. \_______ 63. \ 64. 43 65. / 66. 24 67. \ 68. a 69. 12 70. 71. 72. 73. 74. 75. 19 d \

______/ ______/ / 17 __/ __/ / 8 / \ f . 4 4 \__ \__ \ 9 / \ b c 4 5

\_____ \_____ (space) 17 43 / \ 24 d / \ 19 e a 12 12

60. Finally, we join the whole thing up:


77 __________________/ / 34 ______/ ______/ / 17 __/ __/ / 8 / \ f . 4 4 \__ \__ \ 9 / \ b c 4 5 \_____ \_____ (space) 17 e 12 /

At each point, we chose the joining that would force less frequent characters to be deeper in the tree. So an optimal prefix code is:
01 101 0010 0011 11 100 0000 0001 (space) a b c d e f .

And B(T) = 17 * 2 + 12 * 3 + 4 * 4 + 5 * 4 + 19 * 2 + 12 * 3 + 4 * 4 = 196 bits, a savings of 15% in the size of the data.