Anda di halaman 1dari 4

2011 International Conference on Electrical Engineering and Informatics

17-19 July 2011, Bandung, Indonesia



An FPGA Implementation of a Simple Lossless Data
Compression Coprocessor
Armein Z. R. Langi
#*+1

*
ITB Research Center on Information and Communication Technology
#
DSP-RTG, Information Technology Research Division
School of Electrical Engineering and Informatics
+
ITB MicrolectronicCenter
InstitutTeknologi Bandung, JalanGaneca 10 Bandung, Indonesia 40132
1
langi@dsp.itb.ac.id

Abstract The paper describes a Field Programmable Gate
Array (FPGA)-based lossless data compression coprocessor using
implementing a compression method developed by Rice. We have
implemented the Rice code (both encoder and decoder) for 8
bit/sample data on an FPGA Xilinx XC4005. The code has been
designed to be optimal on 1.5 < H < 7.5 bits/sample, that is
usually required in lossless image compression. The encoder and
decoder can achieve 11.6 MHz and 19.4 MHz clock, respectively,
where a 10 MHz clock corresponds to a 1.5 Mbits/s throughput.
The XC4005 contains combinatorial logic units (CLU) and I/O
pins. The Rice encoder uses 30% CLB F&G, 15% CLB H, 16%
CLB FF, and 34% I/O pins. The Rice decoder uses 31% CLB
F&G, 19% CLB H, 16% CLB FF, and 34% I/O pin. Hence, an
X4005 is sufficient to implement both encoder and decoder.

Keywords Rice coder, Lossless Compression, FPGA
Implementation.
I. INTRODUCTION
The paper describes a Field Programmable Gate Array
(FPGA)-based integrated circuit (IC) of a multipurpose
lossless data compression coprocessor using simple counters
implementing a compression method developed by R. F. Rice
[1]. Data compression reduces the number of bits normally
required for representing digital data [2]. Hence, data
compression allows more bits to be transmitted through rate
limited channels or to be stored in limited storage space [3].
We have shown in a lossless and nearlossless image
coding that Rice coder performs well (comparable to
arithmetic coder, AC), and in some cases of lossless wavelet
image coding it outperforms a Huffman coder [4]. In this
paper, we show that such a highperformance coder can be
implemented in hardware in a much simpler complexity (i.e.,
using simple counters) than that of AC or Huffman coders.
A hardware implementation allows a dedicated
compression subsystem to be integrated into a system without
putting any computation burden to the host processor. In
anticipating the need for compression core modules, such as in
intellectual property (IP) modules, we have designed the coder
hardware to be reusable for other hardware design [5]. The
FPGA platform is selected as the target platform to verify the
design.
We have implemented the Rice coder (both encoder and
decoder) for 8 bit/sample data on an FPGA Xilinx XC4005.
The encoder and decoder can achieve more than 1.74 Mbit/s
throughput, and using less than 68% XC4005 resources. An
X4005 is sufficient to implement both encoder and decoder.
The paper is organized as follow. Section II describes the
Rice coder, showing that it uses up-counting and down-
counting processes. Section III describes a computing model
and an architecture of Rice coding, especially its data path
unit (DPU). Section IV describes the FPGA design of a
coprocessor implementing Rice coding scheme. Section V
discusses its performance. Finally, Section VI provides
concluding remarks.
II. THE RICE CODER ALGORITHM
Following more detailed explanations in [4] and [5], a basic
counter code PSI1 [1], shown in Table 1, works as follows:
1. Given a block of data samples (for j = 1, ...,J), PSI1
assumes that each sample takes a symbol s

, for i = 1, ..., 256.


2. Coder PSI1 treats each sample symbol s

having sample
data J

as a non-negative integer number x

. The average
length of sample data is R = 8 bits per sample.
3. For every sample in the block (having a symbol s

, thus
having sample data J

), a PSI1 encoder converts J

into a
codeword w

of a length l

= x

+ 1 , consisting of
x

consecutive zero bits 0 followed by a closing one bit 1


(see again Table 1). For example, if a sample happens to be J


= 0000 0011, it must have x

= 3, and the PSI1 then encodes


it using 4 bits, i.e., 3 zeros followed by a one. Hence, the
encoding algorithm (converting J

into w

) is simply down
counting, which is summarized in Table 1.
4. The reconstruction is obviously simple counting too.
Given a codeword w

, a PSI1 decoder just counts the number


of zero bits until a one appears. The counting result is the
sample value x

. Using Table 1, it determines that the


codeword belongs to a symbol s

, hence it produces the


sample data J

as its output.

H15 - 7
978-1-4577-0752-0/11/$26.00 2011 IEEE
TABLE I
A CODEWORD TABLE OF PSI1 CODE (SYMBOLS s

,
SAMPLE DATA J

, CODE-WORDS w

, LE
i x
|
x
|
d
|

1
1
s 0 0000 0000
2
2
s 1 0000 0001
3
3
s 2 0000 0010
4
4
s 3 0000 0011

256
256
s 255 1111 1111 000
III. A DATA PATH ARCHITECT
It has been studied elsewhere [1] that t
Table 1 is optimal for a monotonically decr
source with a first-order entropy around 2, i
For input samples with entropies outside
introduced a concept of word splitting.
If E > 2.S, it is safe to assume that th
bits (LSBs) of sample data J
(])
are comp
this case there is no need to perform any com
LSBs. An encoder can then split J
(])
int
LSBs and (8-k) most significant bits (MSBs
1. The MSBs are coded using PSI1 encode
to bitstream, while the LSBs are sent unco
coded using PSI3). A decoder must first r
using PSI1 decoder, and then concatenat
the uncoded LSBs, resulting in the desired
Fig. 1 Data structure of input and output of encoder..
This counter compression of word spli
LSBs and (8-k) MSBs) is called PSI-1,k
been shown in [1] that PSI1,k has a natura
1.S + k < E < 2.S +k. Given a block of
= 1, ..., J), the PSI1,k coder must then hav
estimate the entropy of the block to ensure
k. Rice has come with an estimation rule o
sum of x

values in the block [1].



SAMPLE VALUES x

,
ENGTH l

).
w
|
|
|

1 1
01 2
001 3
0001 4

0000001 256
TURE
this PSI1 code in
reasing distribution
i.e., 1.S < E < 2.S.
e that range, Rice
he least significant
pletely random. In
mpression on those
to two portions: k
s), as shown in Fig.
er before being sent
oded (or said to be
recovers the MSBs
tes the results with
J
(])
.

itting code (into k
(see Fig 2). It has
al entropy range of
data samples (for j
ve a mechanism to
it uses the optimal
of thumb based on
Fig. 2 A PSI1,k Rice encoder.
IV. COPROCES
Having defined the algorithm
design a coprocessor to implem
design assumptions, especia
environment. We then choose a
model. The model consists of a
path unit (DPU). We then descr
Later we describe input-outp
interactions of the coprocessor w
A. Design Assumptions
We assume in a real environ
encoder and a decoder separated
communication channel (see
oriented, meaning data are tran
to decoder bit-by-bit serially. D
available locally, stored in mem
memory through a bus oriented
of the decoder must be stored to

Fig.3 The encoder accepts input d
producing bitstream to serial channel, an
B. Basic Coprocessor Computi
We propose to use a basic c
shown in Fig. 4, to satisf
requirements. The coprocesso
external subsystems: Host Proc
I/O. The coprocessor consists
processor ultimately control
commands to the coprocessor to
Both the encoder and deco
model. An encoder DPU get
performs the actual data com
channel I/O. Conversely, a d
channel, performs decoding pr
into memory.
Encoder
Bit Orie
Chan
Bus Oriented
Channel
Memory


SSOR DESIGN
m of Rice coder, we can now
ment the coder. First we define
ally its external operating
a basic coprocessor computing
a control unit (CU) and a data
ribe encoder and decoder DPUs.
put assignments to facilitate
with its external environment.
nment, the coder consists of an
d physically by a long distance
Fig. 3). The channel is bit
nsferred one way from encoder
Data input of the encoder are
mory. Encoder can access the
channel. Similarly, output data
o memory for further use.

data from parallel bus memory and
nd decoder does the other way around.
ing Model
coprocessor computing model,
fy the above environmental
or interacts with three major
cessor, Memory, and Channel
of a CU and a DPU. A host
ls the coprocessor, giving
o perform its functions.
oder use the same computing
ts input data from memory,
mpression, and sends data to
ecoder DPU gets input from
rocesses, and stores the results
Decoder
ented
nnel
Memory
Bus Oriented
Channel

Fig. 4 The coprocessor interacts with control orient
oriented memory, and bit oriented channel I/O.
The CU manages and controls the
synchronized interactions with the externa
CU accepts commands and giving status
processor through Host I/O. Interactions
controlled through bus control signals.
signals manage interactions with the tran
Optionally, we provide test signals for testab
C. Encoder DPU
A DPU of the encoder is shown in Fig. 5
bits each, enter the DPU and populate a firs
buffer of J-sized. Simultaneously, an accum
J sample values to estimate the block
determines split size k for the block in the
sends the split-size code k to the output bi
the selected Rice internal coder. One by on
then loaded into parallel-to-serial converte
shifted through pass-through channel to
The remaining 8-k MSB are loaded into a
the down counter starts down counting whi
to output bitstream, until the down count
encoder the send a closing bit 0. The
process until all J samples in the buffer have
Fig. 5 A pseudocode of the DWT pyramidal algorithm
D. Decoder DPU
A decoder DPU is shown in Fig. 6. It
information of split size k. Using the info
Host I/O
Control Unit
Data Path Unit
Test
Host
Processor
Memory
Control
Data
C
Bus Oriented Bit

ted host processor, bus
DPU to ensure
al subsystems. The
signals to the host
with memory are
Channel control
nsmission channel.
bility purposes.
5. Data samples, 8-
st-in first-out FIFO
mulator adds up all
entropy and then
e buffer. The DPU
itstream to indicate
ne the samples are
er. The k LSBs are
o output bitstream.
down counter, and
ile sending 1 bits
ter reaches 0. The
DPU repeats the
e been encoded.

m forL point wavelet.
t first receives the
rmation, it accepts
the incoming k bits as samp
concatenating register. The DP
and uses the next incoming bi
The upcounting stops when a
contents of the counter are MS
combines the LSBs and MSBs t
The DPU then stores the sam
process is repeated until J samp
Fig. 6 A pseudocode of the DWT pyram
E. Encoder Input/Output Assig
To ensure synchronized
subsystems, we design the en
with various signals (see Tabl
coprocessor interacts with ho
channel I/O. Optionally, we ca
the coprocessor through test pin
TABL
PIN ASSIGNMENT F
No Pin I/O
1 CLK IN
2 RST IN
3 START IN
4-11 ENC_IN IN
12 ENC_OUT OUT
13 ENC_RDY OUT
14 EMEM_RDY IN
15-17 ENC_ADDR OUT
18 RD OUT
19 SENDRQ OUT
20 SYNCH OUT
21 SENDACK IN
22-38 C_T OUT

When the coprocessor is rea
host processor to process a b
issues ENC_RDY to host. It is
put a block of data input into
START to trigger the coproces
the ENC_RDY and start proces
block has been processed, the co
to tell the host that it is ready to
Channel
IO
Control
Data
t Oriented

ple LSBs, and stores it in a
PU then empties an upcounter
its 1 to upcount the counter.
a closing 0 is received. The
Bs. The concatenating register
to recover the complete sample.
mple into output memory. The
les have been decoded.

midal algorithm forL point wavelet.
gnment
interactions with external
ncoder coprocessor to interact
e II). Through the signals the
ost processor, memory, and
an observe internal working of
ns C_T.
LE II
OR THE ENCODER.
#Bits Usage
1 Clock Signal
1 Reset signal
1 Start processing
8 Samples in
1 Stream out
1 Encoder ready
1 Memory ready
3 Memory address
1 Ready to read data
1 Request to send results
1 Serial synchronization
1 Acknowledge to receive
results
17 Test pins
ady to receive commands from
lock of data, the coprocessor
the responsible of the host to
memory. The host then issues
ssor. The coprocessor turns off
ssing the input data. When all
oprocessor activates ENC_RDY
process the next block.

To interact with memory through memory bus, the
coprocessor issues ENC_ADDR and waits for the bus to be
ready by monitoring EMEM_RDY. When the signal indicates
the bus is not occupied, the coprocessor issues signal RD
causing the memory to provide 8 bit data to ENC_IN through
the bus. At the next clock, data are pushed into internal buffer.
The signals RD, ENC_ADDR, and ENC_DATA then go back to
tri-state conditions.
To interact with channel I/O, the coprocessor issues
SENDRQ and then monitors SENDACK to ensure the channel
is ready. When the channel is ready, the coprocessor sends
output bits through ENC_OUT. The bits are clocked through
channel serial synchronization SYNCH.
F. Decoder Input/Output Assignments
To have similar synchronized interactions with external
subsystems, we also design the decoder coprocessor to
interact through various signals (see Table III). Through the
signals the coprocessor interacts with host processor, memory,
and channel I/O. Optionally, we can observe internal working
of the coprocessor through test pins D_T.
TABLE III
PIN ASSIGNMENT FOR THE DECODER.
No Pin I/O #Bits Usage
1 CLK IN 1 Clock signal
2 RST IN 1 Reset signal
3 START IN 1 Start processing
4 SENDRQ IN 1 Request by external
device to send stream
5 SYNCH IN 1 Serial synchronization
6 DEC_IN IN 1 Decoder input
7 RECACK OUT 1 Acknowledge
receiving stream
8 DEC_RDY OUT 1 Decoder is ready
9 DMEM_RDY IN 1 Memory is ready
10-12 DEC_ADDR OUT 3 Memory address
13 WR OUT 1 Writing to memory
14-21 DEC_OUT OUT 8 Output data
22-36 D_T OUT 15 Test pins
37 CLK IN 1 Clock signal

When the coprocessor is ready to receive commands from
host processor to process a block of data, the coprocessor
issues DEC_RDY to host. The host the issues STARTto trigger
the coprocessor. The coprocessor turns off the DEC_RDY and
start processing the input data from channel. When all block
has been processed, the coprocessor activates DECC_RDY to
tell the host that data are available in the memory and it is
ready to process the next block from channel.
To interact with channel I/O, the coprocessor issues
RECACK to tell the channel that the processor is ready and
then monitors SENDRQ to wait if the channel is ready to
provide bitstream. When the channel is ready, the coprocessor
receives input bits through DEC_IN. The bits are clocked by
channel through channel serial synchronization SYNCH.
To interact with memory through memory bus, the
coprocessor issues DEC_ADDR and waits for the bus to be
ready by monitoring DMEM_RDY. When the signal indicates
the bus is not occupied, the coprocessor issues data for the
memory from DEC_OUT through the bus. Signal WR is then
issued to write data into memory at specified address. After
that, the signals WR, DEC_ADDR, and DEC_OUT go back to
tri-state conditions. It is the responsible of the host to make
use of the data block in memory.
V. RESULTS OF AN FPGA IMPLEMENTATION
After validating the architecture with C++ and VHDL
simulations, we further implemented the Rice coder (both
encoder and decoder) for 8 bit/sample data on an FPGA
Xilinx XC4005 [8]. One XC4005 contains 196 combinatorial
logic units (CLU) and 112 user I/O pins. In our
implementation, the encoder uses 30% CLB F&G, 15% CLB
H, 16% CLB FF, and 34% I/O pins. The decoder uses 31%
CLB F&G, 19% CLB H, 16% CLB FF, and 34% I/O pin.
Hence, an X4005 is sufficient to implement both encoder and
decoder. Furthermore in this particular implementation, the
encoder and decoder can achieve 11.6 MHz and 19.4 MHz
clock rates, respectively. Since a 10 MHz clock rate
corresponds to a 1.5 Mbits/s throughput, the FPGA
implementation achieves 1.74 Mbit/s and 2.91 Mbits/s for the
encoder and the decoder, respectively.
VI. CONCLUDING REMARKS
In anticipating the need for compression core modules,
such as in intellectual property (IP) modules, we have
designed the coder hardware to be reusable for other hardware
design. The FPGA platform is selected as the target platform
to verify the design. We have implemented the Rice code
(both encoder and decoder) for 8 bit/sample data on an FPGA
Xilinx XC4005, achieving at least a throughput of 1.74 Mbit/s.
We also show that an X4005 is sufficient to implement both
encoder and decoder. The performance is optimal, comparable
to that of computationally more expensive Huffman dan
arithmetic coding.
ACKNOWLEDGMENT
This work and paper was supported in part by Riset
Unggulan ITB at ITB Research Center on ICT.
REFERENCES
[1] R.F. Rice, Some practical universal noiseless coding coding techniques,
Part III, module PSI14,k+, JPL Publication 913, NASA, JPL
California Institute of Technology, 124p, Nov. 1991.
[2] A. Langi, Review of data compression methods and algorithms,
Technical Report, DSP-RTG20109, InstitutTeknologi Bandung, Sep.
2010.
[3] A. Langi and W. Kinsner, Wavelet compression for image transmission
through bandlimited channels, ARRL QEX ExperimenterssEchange,
(ISSN: 08868093, USPS 011424), No. 151, pp. 1221, Sep. 1994.
[4] A. Langi, Lossless Compression Performance of a Simple Counter-Based
Entropy Coder, ITB Journal of information and communication
technology, submitted Dec 2010.
[5] A. Langi, A VLSI Architecture of a Counter-Based Entropy Coder, ITB
Journal of Engineering Science, submitted Feb 2011.

Anda mungkin juga menyukai