Anda di halaman 1dari 5

1

Konstantinos D. Pechlivanis

# Huffman Coding: a Case Study of a Comparison Between Three Different Type Documents

Abstract—We examine the results of applying Huffman coding to three different style documents: a novel from the 19 th century, an HTML document of a modern news website and a C language source code.

Index Terms

- Data compression, Lossless

source coding,

Entropy rate of a source, Information theory

I.INTRODUCTION

E VER since the development of electronic means for the transmission and information processing, has emerged

the need of reducing the volume of data transmitted. Data compression theory was formulated by Claude E. Shannon, in his 1948 paper "A Mathematical Theory of Communication" [1]. Shannon proved that there is a fundamental limit to lossless data compression. This limit, called entropy rate, is denoted by H. It is possible to compress the source in a lossless manner, with compression rate close to H and it can be mathematically proven that it is impossible to have a better compression rate than H. One important method of transmitting messages is to transmit in their place sequences of symbols [2]. For the best performance of communication systems, sought as much as possible compact representation of messages which is achieved by removing the redundancy inherent in them. This process is called source coding. More specifically source coding is the process of converting the sequence of symbols generated by a source, in symbol sequences of the code (usually binary sequences), so as to remove the redundancy and the resulting compressed representation of messages. Compression can be either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by identifying marginally important information and removing it [5]. Lossless compression examples of methods for source coding are Shannon coding [1], Shannon – Fano coding [3], Shannon – Fano – Elias coding [4], and Huffman coding [2], where the last one can been proven to be the optimum for a specific set of symbols with specific probabilities.

II.HUFFMAN CODING

A. Description - prerequisites

Huffman coding is the optimal code for a given probability of occurrence of the source symbols that can be obtained by means of a simple algorithm coding. It can be shown that no other algorithm can lead to the construction of a code with smaller average length codewords, for a given alphabet of the

K. D. Pechlivanis is head of the department of Informatics and Organization of the Mental Hospital of Corfu, Greece (e-mail: konpexl@gmail.com).

source. The outcome of the certain coding consists of a variable-length code table for encoding a source, where the variable-length code table has been derived in a particular way, based on the estimated probability of occurrence for each possible value of the source symbol [6].

• B. The algorithm

According to the algorithm of Huffman, for the binary encoding of the source symbols follow the next steps:

• 1. The source symbols are arranged in decreasing transmission probability.

• 2. The last two symbols of the source with the lowest likelihood of producing combined into one, probability equal to the sum of the robabilities of the two symbols, resulting in the reduction of one of the plurality of conventional of all of the source alphabet.

• 3. Steps 1 and 2 are repeated until the source alphabet consists of only two symbols. In these two symbols assigned to 0 and 1 of the binary code.

• 4. A "0" and a "1" is assigned in place of one and the other symbol respectively, which in step 2 were merged into one. This step relates to all mergers.

• 5. The codewords of symbols formed by all the bits "0" and "1" associated with these symbols (from bottom to top), ie the digits are assigned directly to them or in merged symbols involved.

• C. Example

The

frequency of the

letters

in

the

English

language

(according to Wikipedia) is the following:

2

Assuming that we have to use the Huffman algorithm to encode the following subset of the English letters, having the above probabilities X = {a, b, c, d, e} and P(X) = {0.082, 0.015, 0.028, 0.043, 0.127} According to the algorithm, we first arrange the symbols in descending order of the transmission probability. The first column of the following table contains the symbols and the second column their chances. In the next step, the symbols with the least chance are combined with a probability equal to the sum of these probabilities. Then we re-arrange the symbols considering the joining of the last two. In the next step, the symbols with the least chance are merged again and the remaining symbols are arranged again. We repeat the steps of merging and arranging of the symbols that are left until we finally end up with the merging of the last two symbols, having a probability equal to the sum of those of them. Starting from the last column of probability, we assign to them the symbols "0" and "1", respectively. In the prior probabilities column we also assign the symbols "0" and "1" next to the previous combined, etc. Finally the resulting codes respectively for each letter of the chosen would be

• D. Limitations

Although Huffman's algorithm is optimal for a symbol-by- symbol coding with a known input probability distribution, it is not optimal when the symbol-by-symbol restriction is dropped, or when the probability mass functions are unknown, not identically distributed, or not independent. Other methods such as arithmetic coding and LZW coding often have better compression capability: both of these methods can combine an arbitrary number of symbols for more efficient coding, and generally adapt to the actual input statistics, the latter of which is useful when input probabilities are not precisely known or vary significantly within the stream [6].

• A. Description

III.

CASE STUDY

In order to compare the results of the Huffman coding implementation between different types of human-produced texts, we have chosen to pick three representative documents of human activity. The first one is in the field of literature, a 1876 novel “The Adventures of Tom Sayer” by Mark Twain [7]. The novel is clearly indicative of the folklore surrounding

life on the Mississippi River, around the time it was written. The second is a plain HTML document of the FrontPage from the BBC news channel website, on the 08-11-2012. It represents the modern spoken English in the western world. Finally the last document is the C language source code implementation of the Huffman algorithm itself. It represents a strictly technical document, with a limited set of words.

• B. Implementation

Two programs were used in order to apply Huffman coding to the three documents. The first one, “letter_count.cpp”, accepts as input a file in text format and then counts the exact number of the letters and the spaces presented in the whole document. It also calculates the frequency of each symbol and writes these values to a text file named “freqs.txt”. The second, “huffman.c”, is an implementation of the Huffman algorithm, having as output the compressed binary code in a text file, calculating also the percentage of memory saved by means of the use of the algorithm.

• C. Results

The first document, “The Adventures of Tom Sawyer”, was found to have a total of 379.055 letters and spaces altogether. The uncompressed text would have been encoded using 552 original bits in total for all letters and after the application of the Huffman coding the total bits used are 322. This yields to a 58.33% saved memory. The second document, from www.bbc.com, was found to have a total of 88.687 letters and spaces altogether. The uncompressed text would have been encoded using 824 original bits in total for all letters and after the application of the Huffman coding the total bits used are 623. This yields to a 75.61% saved memory. The third document, “huffman.c”, was found to have a total of 3.367 letters and spaces altogether. The uncompressed text would have been encoded using 144 original bits in total for all letters and after the application of the Huffman coding the total bits used are 78. This yields to a 54.17% saved memory.

IV.

CONCLUSION

According to the results of the Huffman coding application to the three documents, we can say that the gain was greater to the bbc.com HTML document. This may be a result of the frequent appearance in this document of certain tags like <div>, <p>, etc. Also interesting is the fact that the amount of compression between the other two documents is almost equal. There has to be a further research with more documents to examine in order to end up with a safe conclusion.

APPENDIX

Source code of the first C program: letter_count.cpp #include <iostream>

int main()

3

 { } FILE *input, *output; char c; fclose(input); char letters[27] = for (i=0;i<=26;i++) {'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v',' { w','x','y','z',' '}; int count[27]; printf("%c: %d\n",letters[i],count[i]); fprintf(output, "%d\n", count[i]); int i = 0, lettercount = 0; char filename[20]; }

/*get input details from user*/ printf("Type the name of the file to process: "); scanf("%s",filename); input = fopen(filename,"r"); output = fopen("freqs.txt","w");

for (i=0;i <= 26;i++) { count[i] = 0; }

fclose(output); printf("There are %d letters in the text\n",lettercount); return 0; }

Source code of the second C program: huffman.c #include <stdio.h> #include <stdlib.h> #include <math.h> #define len(x) ((int)log10(x)+1)

int frequencies[27];

 if (input == NULL) printf("File doesn't exist\n"); else { do { /* Node of the huffman tree */ struct node{ }; c /* get one character from the file */ = getc(input); int value; c /* all characters to lowercase */ = tolower(c); char letter; switch (c) { case 'y': count[24]++; lettercount++; break; struct node *left,*right; case 'a': count[0]++; lettercount++; break; case 'b': count[1]++; lettercount++; break; typedef struct node Node; case 'c': count[2]++; lettercount++; break; case 'd': count[3]++; lettercount++; break; case 'e': count[4]++; lettercount++; break; case 'f': count[5]++; lettercount++; break; case 'g': count[6]++; lettercount++; break; /*finds and returns the small sub-tree in the forrest*/ int findSmaller (Node *array[], int differentFrom){ int smaller; int i = 0; case 'h': count[7]++; lettercount++; break; case 'i': count[8]++; lettercount++; break; case 'j': count[9]++; lettercount++; break; case 'k': count[10]++; lettercount++; break; case 'l': count[11]++; lettercount++; break; case 'm': count[12]++; lettercount++; break; case 'n': count[13]++; lettercount++; break; case 'o': count[14]++; lettercount++; break; case 'p': count[15]++; lettercount++; break; while (array[i]->value==-1) i++; smaller=i; if (i==differentFrom){ i++; while (array[i]->value==-1) i++; smaller=i; case 'q': count[16]++; lettercount++; break; } case 'r': count[17]++; lettercount++; break; case 's': count[18]++; lettercount++; break; case 't': count[19]++; lettercount++; break; case 'u': count[20]++; lettercount++; break; case 'v': count[21]++; lettercount++; break; case 'w': count[22]++; lettercount++; break; case 'x': count[23]++; lettercount++; break; for (i=1;i<27;i++){ if (array[i]->value==-1) continue; if (i==differentFrom) continue; if (array[i]->valuevalue) smaller = i; case 'z': count[25]++; lettercount++; break; } case ' ': count[26]++; lettercount++; break; return smaller; default: break; } } } while (c != EOF); /* repeat until EOF (end of file) /*builds the huffman tree and returns its address by */ reference*/

4

 void buildHuffmanTree(Node **tree){ else{ Node *temp; length=len(codeTable[c-97]); Node *array[27]; n = codeTable[c-97]; int i, subTrees = 27; } int smallOne,smallTwo; while (length>0){ for (i=0;i<27;i++){ array[i] = malloc(sizeof(Node)); array[i]->value = frequencies[i]; } compressedBits++; bit = n % 10 - 1; n /= 10; array[i]->letter = i; array[i]->left = NULL; = x | bit; bitsLeft--; x array[i]->right = NULL; length--; if (bitsLeft==0){ while (subTrees>1){ fputc(x,output); x = 0; smallOne=findSmaller(array,-1); bitsLeft = 8; smallTwo=findSmaller(array,smallOne); } temp = array[smallOne]; x = x << 1; array[smallOne] = malloc(sizeof(Node)); } array[smallOne]->value=temp- }

>value+array[smallTwo]->value;

array[smallOne]->letter=127;

array[smallOne]->left=array[smallTwo]; array[smallOne]->right=temp;

array[smallTwo]->value=-1;

subTrees--;

}

*tree = array[smallOne];

return;

}

if (bitsLeft!=8){ x = x << (bitsLeft-1); fputc(x,output);

}

/*print details of compression on the screen*/ fprintf(stderr,"Original bits = %d\n",originalBits*8); fprintf(stderr,"Compressed bits = %d\n",compressedBits); fprintf(stderr,"Saved %.2f%% of memory\n",

((float)compressedBits/(originalBits*8))*100);

return;

 /* builds the table with the bits for each letter. 1 stands for } } binary 0 and 2 for binary 1 (used to facilitate arithmetic)*/ void fillTable(int codeTable[], Node *tree, int Code){ if (tree->letter<27) codeTable[(int)tree->letter] = Code; else{ /*invert the codes in codeTable2 so they can be used with mod operator by compressFile function*/ void invertCodes(int codeTable[],int codeTable2[]){ int i, n, copy; fillTable(codeTable, tree->left, Code*10+1); fillTable(codeTable, tree->right, Code*10+2); for (i=0;i<27;i++){ n = codeTable[i]; return; copy = 0; while (n>0){ } copy = copy * 10 + n %10; n /= 10;
 /*function to compress the input*/ } void compressFile(FILE *input, FILE *output, int codeTable2[i]=copy; codeTable[]){ } char bit, c, x = 0; return; int n,length,bitsLeft = 8; int originalBits = 0, compressedBits = 0; }

while ((c=fgetc(input))!=10){ originalBits++; if (c==32){ length = len(codeTable[26]); n = codeTable[26];

}

int main(){ Node *tree; int codeTable[27], codeTable2[27]; int i, n; char filename[20]; FILE *input, *freqsin, *output;

5

/*get input details from user*/ printf("Type the name of the file to process: "); scanf("%s",filename); input = fopen(filename, "r"); freqsin = fopen("freqs.txt", "r"); output = fopen("output.txt","w"); for(i = 0; i <= 26; i++) {

fscanf(freqsin, "%d", &n); frequencies[i] = n; printf("frequencies[%d]=%d\n",i,frequencies[i]);

} buildHuffmanTree(&tree); fillTable(codeTable, tree, 0);

invertCodes(codeTable,codeTable2);

compressFile(input,output,codeTable2);

return 0;

}

REFERENCES

 [1] C. E. Shannon, “A Mathematical Theory of Communication”. The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948. [2] D. A. Huffman, “A Method for the Construction of Minimum-Redundancy 1949. [3] Codes”. Proceedings of the I.R.E., September 1952. R.M. Fano, "The transmission of information". Technical Report No. 65 (Cambridge (Mass.), USA: Research Laboratory of Electronics at MIT), [4] T. M. Cover and Joy A. Thomas, Elements of information theory (2nd ed.). [5] John Wiley and Sons. pp. 127–128. ISBN 978-0-471-24195-9, 2006. Data Compression, Wikipedia. Available: [6] http://en.wikipedia.org/wiki/Data_compression Huffman Coding, Wikipedia. Available: [7] http://en.wikipedia.org/wiki/Huffman_coding The Project Gutenberg. Available: http://www.gutenberg.org/files/74/74- [8] BBC – Homepage. Available: http://www.bbc.com

Konstantinos D. Pechlivanis was born in Thessaloniki in 1967. He studied mathematics at the Aristotle’s University of Thessaloniki, Greece (1992). Next, he studied informatics at the Alexandrium Technological and Educational Institute of Thessaloniki, Greece (2006). He has been working as a Teacher of mathematics in private and public schools from 1993 until 2007. During his studies in informatics, he has been working for Center for Software Innovation, Sonderborg, Denmark for vocational training, undertaken a scholarship from the Leonardo Programme of the European Union, from 2005 until 2006. Also he has been working as a Teacher of informatics in primary and high school from 2006 - 2007. Since 2007 he is head of the department of Informatics and Organization of the Mental Hospital of Corfu, Greece. His research interests are in the field of formal methods for requirements specification.