Anda di halaman 1dari 4

2014 Sixth International Conference on Computational Intelligence and Communication Networks

A New Approach for Compression on Textual Data

Amit Jain, Avinash Panwar, Divya Bhatnagar, Akhilesh Sharma


Computer Science and Engineering Department
Sir Padampat Singhania University
Udaipur, India
Most of today's familiar lossless compression algorithms
operate in streaming mode, reading a single byte or a few bytes
at a time. But with this new transform, we want to operate on
the largest chunks of data possible. Since the BWT operates on
data in memory, you may encounter files too big to process in
one fell swoop. In these cases, the files should be split and one
block at a time should be processed[1] [14]. The output of the
BWT transform is usually piped through a move-to-front stage,
then a run length encoder stage, and finally an entropy encoder,
normally arithmetic or Huffman coding. The actual command
line to perform this sequence will look like this:

Abstract Data compression algorithms are used to reduce the


redundancy, storage requirement and efficiently reduce
communication costs. Data Encryption is used to protect our data
from unauthorized users. Due to the unprecedented explosion in
the amount of digital data transmitted via the Internet, it has
become necessary to develop a compression algorithm that can
effectively use available network bandwidth and also taking into
account the security aspects of the compressed data transmitting
over Internet. This paper presents an encoding technique that
offers high compression ratios. An intelligent and reversible
transformation technique is applied to source text to improve the
capability of algorithms to compress the transmitted data. The
results prove that the proposed method performs better than
many other popular techniques with respected to compression
ratio and the speed of compression and requires some additional
processing on the server/nodes.

BWT < input-file | MTF | RLE | ARI > output-file


The decompression is just the reverse process and look like this

Keywords Data Compression, Dictionary Based Encoding,


Lossless Compression

I.

UNARI input-file | UNRLE | UNMTF | UNBWT > output-file

INTRODUCTION

An alternate approach to this is to perform a lossless,


reversible transformation to a source file prior to applying an
existing compression algorithm. The transformation is
designed to make it easier to compress the source file. The star
encoding is generally used for this type of preprocessing
transformation of the source text. Star-encoding works by
creating a large dictionary of commonly used words expected
in the input files. The dictionary must be prepared in advance,
and must be known to the compressor and decompressor.

Over the last decade we have seen an unprecedented


explosion in the amount of text data transmitted via the Internet
in the form of digital library, search engines etc. The text data
competes for 45% of the total Internet traffic. So, for reducing
the Internet traffic data need to be compressed, so that large
amount of information can be transmitted. A number of
sophisticated algorithms have been proposed for lossless text
compression such as Huffman encoding, arithmetic encoding,
the Lempel-Ziv (LZ) family, Dynamic Markov Compression
(DMC), Prediction by Partial Matching (PPM), and BurrowsWheeler Transform (BWT) based algorithms etc[8]. However,
none of the above algorithms has been able to achieve bestcase compression ratio.

Each word in the dictionary has a star-encoded equivalent,


in which as many letters a possible are replaced by the '*'
character. For example, a commonly used word such the might
be replaced by the string t**. The star-encoding transform
simply replaces every occurrence of the word the in the input
file with t**.

Michael Burrows and David Wheeler have given BWT


transformation function that opens the door to some
revolutionary new data compression techniques. The BWT is
performed on an entire block of data at one and transforms into
a format that is extremely well suited for compression. The
block sorting algorithm they developed works by The BWT is
performed on an entire block of data at one and transforms into
a format that is extremely well suited for compression.
applying a reversible transformation to a block of input text.
The transformation does not itself compress the data, but
reorders it to make it easy to compress with simple algorithms
such as move to front encoding. The philosophy of secure
compression is to preprocess the text, transform it into some
intermediate form to compress it with better efficiency and
exploit the natural redundancy of the language in making the
transformation.[7][10]
978-1-4799-6929-6/14 $31.00 2014 IEEE
DOI 10.1109/CICN.2014.41
10.1109/.41

Ideally, the most common words will have the highest


percentage of '*' characters in their encoding. When done
properly, the transformed file will have a huge number of '*'
characters. This ought to make the transformed file more
compressible than the original plain text. The existing star
encoding does not provide any compression as such but
provide the input text a better compressible format for a later
stage compressor. This encoding is very much weak and
vulnerable to attacks[14].
As an example, a section of text from Project Guttenburgs
version of Romeo and Juliet looks like this in the original text:
But soft, what light through yonder window breaks?

134

It is the East, and Iuliet is the Sunne,


Arise faire Sun and kill the enuious Moone,
Who is already sicke and pale with griefe,
That thou her Maid art far more faire then she

5. Create a new table having only words and their codes in the
sorted order form of words. Store this table as the Dictionary in
a file.
6. Stop.

Running this text through the star-encoder yields the


following text:

B. Encoding Algorithm
Start encode with argument input file if

B** *of*, **a* **g** *****g* ***d*r ***do* b*e***?


It *s *** E**t, **d ***i** *s *** *u**e,
A***e **i** *un **d k*** *** e****** M****,
*ho *s a****** **c*e **d **le ***h ****fe,
***t ***u *e* *ai* *r* f*r **r* **i** ***n s**

1. Read the dictionary and stores all words and their codes in a
table
2. While ipf is not empty

It is clear seen that the encoded data has exactly the same
number of characters, but is dominated by stars [13]. It seems
more compressible and does not offer any serious challenge to
the hacker.
II.

i).Read the characters from ipf and form tokens.


a. Search for the code in the table for respective characters.
b. Find the length of the code for the word.

PROPOSED WORK

Here we are presenting an encoding technique, which will


offer highest compression ratios. The objective is to develop a
better transformation yielding greater compression. The basic
philosophy of compression is to transform text in to some
intermediate form, which can be compressed with better
efficiency, which utilize the natural feature of the language in
making this transformation. The algorithm we developed is a
two step process consisting.

ii).The actual code consists of the length concatenated with the


code in the table, the length serves as a marker while decoding
and is represented by the ASCII characters 251 to 254 with 251
representing a code of length 1, 252 representing a code of
length 2 etc.

Step1: Make an intelligent dictionary


Step2: Encode the input text data

iv). Read the next character from input file and neglect it if it is
a space. Go back to B, after inserting a marker character
(ASCII 255) to indicate the presence of a space

The entire process can be summarized as follows.

End (While)

iii). Write the actual code into the output file.

3. Stop.

A. Dictionary Making Algorithm


Start Making Dictionary with text source file as input

C. Decoding Algorithm
Start decode with argument input file ipf

1. Extract all words from input file.


2. If a word is already in the table increment the number of
occurrence by 1, otherwise add it to the table and set the
number occurrence to 1.

1. While ipf is not empty

3. Sort the table by frequency of occurrences in descending


order.

ii) Read the length of code and search for the code in
dictionary, for that particular length of code.

4. Start giving codes using the following method:


i). Give the first 218 words the ASCII characters from 33 to
250 as the code.

iii) If code is found, then write the associated word in output


file.

i) Read the actual code from the input file

iv) Read the next character from input file and if it is a


character (ASCII 255), go back to A, after inserting a space
character into output file.

ii). Now for the remaining words give each one permutation of
two of the ASCII characters (in the range 33 to 250), taken in
order. If there are any remaining words give them each one
permutation of three of the ASCII characters (in the range 33 to
250) and finally if required permutation of four characters,
again from same range.

End (While)
2. Stop

135

As an example, a section of text from Project Guttenburgs


version of Romeo and Juliet looks like this in the original text:

It is the East, and Iuliet is the Sunne,

4
BPC

But soft, what light through yonder window breaks?


Arise faire Sun and kill the enuious Moone,

BWT
3

*Encoding
Propos ed

Who is already sicke and pale with griefe,

That thou her Maid art far more faire then she

ge
o
ne
w
s
pa
pe
r1
pa
pe
r2
pa
pe
r3
pa
pe
r4
pa
pe
r5
pa
pe
r6
pr
og
c
pr
og
1
tra
ns

bi
b

bo
ok
1
bo
ok
2

Running this text through the proposed approach yields the


following text:

File Name

!  # $ % &  (
Fig. 1.

) * + , - . * + /
0 1 2 . 3 + 4 5

BPC comparison of BWT, * Encoding and proposed method.

20
18

6 * 7 8 - 9 : ;

16
Conversion Time

< = > ? @ A B C D E
You can clearly see that the encoded data is completely
represented in ASCII character form. It is also clear from the
above algorithm that the encoded text provides a better
compression and a stiff challenge to the hacker! It may look as
if the encoded text can be attacked using a conventional
frequency analysis of the words in the encoded text, but a
detailed inspection of the dictionary making algorithm reveal
that it is not so. An attacker can decode the encoded text only if
he knows the dictionary. The dictionary on the other hand is a
dynamically created one. It depends on the nature of the text
being encoded. The nature of the text differs for different
sessions of communication between a server and client. In
addition to this fact we suggest a stronger encryption strategy
for the dictionary transfer. A proper dictionary management
and transfer protocol can be adopted for a more secure data
transfer.
III.

14
12

BWT

10

*Encoding
Proposed

8
6
4
2

ge
o
ne
w
s
pa
pe
r
pa 1
pe
r2
pa
pe
r
pa 3
pe
r
pa 4
pe
r
pa 5
pe
r6
pr
og
c
pr
og
1
tra
ns

bo
ok
1
bo
ok
2

bi
b

File Name

Fig. 2.

Conversion time comparison of BWT, *Encoding and proposed


method.
BPC AND TIME COMPARISON OF SIMPLE BWT, *ENCODE AND
PROPOSED METHOD

TABLE I.

File
Name

PERFORMANCE ANALYSIS

Bib
Book1
Book2
Geo
News
Paper1
Paper2
Paper3
Paper4
Paper5
Paper6
Progc
Prog1
Trans

The performance issues such as Bits Per Character (BPC)


and conversion time (in sec) are compared for the three cases
i.e., simple BWT, Star encoding and Proposed method. The
results are shown graphically and prove that Proposed method
performs better than all other techniques in compression ratio
and speed of compression (conversion time).

File
Size
(Kb)
108.7
750.8
596.5
100.0
368.3
51.9
80.3
45.4
13.0
11.7
37.2
38.7
70.0

BPC

Time

Time

BPC

2.11
2.85
2.43
4.84
2.83
2.65
2.61
2.91
3.32
3.41
2.73
2.67
1.88

1
11
9
2
6
1
2
2
2
1
1
2
1

1.93
2.74
2.33
4.84
2.65
1.59
2.45
2.60
2.79
3.00
2.54
2.54
1.78

6
18
14
6
10
5
5
6
5
4
5
5
5

1.69
2.36
2.02
5.18
2.37
2.26
2.14
2.27
2.52
2.8
2.38
2.44
1.70

4
11
10
5
7
3
4
3
3
2
3
3
3

91.5

1.63

1.53

1.46

BWT

IV.

*Encoding
BPC

Proposed
method
Time

CONCLUSION

In an ideal channel, the reduction of transmission time is


directly proportional to the amount of compression. But in a
typical Internet scenario with fluctuating bandwidth,

136

[6]

congestion and protocols of packet switching, this does not


hold true. Our results have shown excellent improvement in
text data compression and added levels of security over the
existing methods. With our algorithm we are able to encode a
file with 2,26,89,38,550 words. If a single page contain 35 lines
and each line is of 15 words in it, then we can process a file
with 43,21,788 number of pages in it. If the plain text contain
more set of repeating words, then it affect only the size of
dictionary, not the encrypted text. Compression ratio is affected
by the size of words. These improvements come with
additional processing required on the server/nodes.
V.

[7]

[8]

[9]

FUTURE ASPECTS

The method can be enhanced by devising a suitable


dictionary management and transfer protocol to make the
system least vulnerable to possible attacks by hackers. One
suggested method for dictionary transfer between server and
client can be as per SSL (Secure Socket Layer) Record
Protocol, which provides basic security services to various
higher-level protocols such as HTTP.

[10]

[11]

[12]

REFERENCES
[1]

[2]

[3]
[4]
[5]

[13]

M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data


Compression Algorithm, SRC Research Report 124, Digital Systems
Research Center.
N.J. Larsson. The Context Trees of Block Sorting Compression,
Proceedings of the IEEE Data Compression Conference, March 1998,
pp. 189-198.
A. Moffat. Implementing the PPM Data Compression Scheme, IEEE
Transactions on Communications, COM-38, 1990, pp. 1917-1921.
T. Welch, .A Technique for High-Performance Data Compression.,
IEEE Computer, Vol. 17, No. 6, 1984.
M. Nelson, The Data Compression Book, BPB Publication

[14]

137

R. Franceschini, H. Kurse, N. Zhang, R. Iqbal and A Mukherjee,


Lossless, Reversible Transformations that Improve Text Compression
Ratios, submitted to IEEE Transactions on Multimedia Systems (June
2000).
F. Awan, and A. Mukherjee, .LIPT: A losskess Text Transform to
Improve Compression., Proceedings of International Conference on
Information and Theory: Coding and computing, IEEE Computer
Society, Las Vegas Nevada, April 2001.
N. Motgi and A. Mukherjee, .Network Conscious Text Compression
Systems (NCTCSys)., Proceedings of International Conference on
Information and Theory: Coding aand Computing, IEEE Computer
Society, Las Vegas Nevada, April 2001.
Franceschini, Robert and A. Mukherjee, "Data Compression Using
Encrypted Text," Proceedings of the Third Forum on Research and
Technology, Advances on Digital Libraries, 1996
F. Awan, Nan Zhang N. Motgi, R.Iqbal and A. Mukherjee, LIPT: A
reversible Lossless Text Transformation to Improve Compression
Performance., Proceedings of data Compression Conference, Snowbird,
Utah, March, 2001.
A. Jain, K. I. Lakhtaria, Comparative Study of Dictionary Based
Compression Algorithm on Text Data, International Journal of
Computer Engineering and Application, Vol. VI, Issue II, pp. 55-65,
April 14.
A. Jain, K. I. Lakhtaria, P. Srivastava, A Comparative Study of
Lossless Compression Algorithm on Text Data, International
Conference on Advances in Computer Science, AETACS-2013, Elsevier
Digital Library, pp 536-543.
A. C Chandrathil, Intelligent Dictionary Based Encryption and
Compression
Algorithm,
http://www.scribd.com/doc/19800200/Intelligent-Dictionary-BasedEncryption-And-Compression-Algorithm
Techniques For Fast And Secure Data Transmission Computer Science
Essay, http://www.ukessays.com/essays/computer-science/techniquesfor-fast-and-secure-data-transmission-computer-science-essay.php