DWT

Low-Power Multiplierless 2-D DWT and IDWT Architectures Using 4-tap
Daubechies Filters
Tze-Yun Sung, Yaw-Shih Shieh and Hsi-Chin Hsin

Chun-Wang Yu Department of Computer Science and
Department of Microelectronics Engineering Information Engineering
Chung Hua University National Formosa University
Hsinchu, Taiwan 30012 Hu- Wei, Taiwan 63208
E-mail: bobsung@chu.edu.tw E-mail: hsin@nfu.edu.tw
Abstract analysis stage of the system is the DWT. In the

synthesis stage, the inverse DWT recovers the original
This paper proposes two architectures of 2 - 0 image from the coefficients of DWT.
discrete wavelet transform (DWT) and inverse DWT Cohen, Daubechies and Feauveau proposed
(ID WT). The first high-eficiency architecture using Ctap Daubechies coefficients for lossy analysis
comprises a transform module, an address sequencer, [I]. The symmetry of 4-tap Daubechies filters and the
and a RAM module. The transform module has uniform fact that they are almost orthogonal [2] make them
and regular structure, simple control jtow, and local good candidates for image analysis application. The
communication. The significant advantages of the coefficients of the filter are quantized before hardware
single transform module are full hardware-utilization implementation; hence, the multiplier can be replaced
and low-power. The second architecture features by limited quantity of shift registers and adders. Thus,
parallel and pipelined computation and high the system hardware is saved, and the system
throughput. Both architectures are very suitable for throughput is improved significantly.
VLSI implementation of new-generation image In this paper, we proposed two high-efficiency
coding/decoding systems, such as JPEG-2000. In the architectures for the even and odd parts of 1-D
realization of 2-D DWT/IDWT, we focus on a FPGA decimated convolution. The advantages of the
and VLSI implementation using - t a p Daubechies proposed architectures are 100% hardware-utilization,
filters, which savespower and reduces chip area. multiplierless, regular structure, simple control flow
Ktywords: D WT/IDWT, low-power, image coding/ and high scalability.
decoding system, JPEG-2000, 4-tap Daubechiesjlters, The remainder of the paper is organized as
multiplierless. follows. Section 2 presents the 2-D discrete wavelet
transform algorithm, and derives new mathematical
1. Introduction formulas. In Section 3, the high-efficiency architecture
for the 2-D DWT is proposed. In Section 4, the
In the field of digital image processing, the high-efficiency and low-power architecture for the 2-D
JPEG-2000 standard uses the scalar wavelet transform IDWT is proposed. Section 5 applies the two proposed
for image compression [I]; hence, the two-dimensional 2-D DWTIIDWT architectures to the coefficient
(2-D) discrete wavelet transform (DWT)/inverse DWT quantization scheme for FPGA and VLSI
(IDWT) has recently been used as a powerful tool for implementation, and analyzes their performance.
image codingldecoding systems. Two-dimensional Finally, comparison of performance between the
DWTIIDWT demands massive computations, hence, it proposed architectures and previous works is made
requires a parallel and pipelined architecture to with conclusions given in Section 6.
perform real-time or on-line video and image coding
and decoding, and to implement high-efficiency 2. 2-D Discrete Wavelet Transform
application-specific integrated circuits (ASIC) or field Algorithm
programmable gate array (FPGA). At the heart of the
Proceedings of the Seventh International Conference on

Parallel and Distributed Computing,Applications and Technologies (PDCAT'06)
0-7695-2736-1/06 $20.00 © 2006
The 2-D DWT is a multilevel decomposition
technique. The mathematical formulas of 2-D DWT
are defined as follows:
K-I K-1
LLj(m,n) = y x l ( i ) - l ( k ) .LLiil (2m - i,2n - k ) (I)
i=O k=O
[K/2 1
LHLi (n)= d ( Q h ( 2 k ) -L a . (2n-2k)
K-l K-l
HL' (m,n ) = h(i) .l(i)- LL'-' (2m - i,2n - k ) (3)
i=O k=O
where 0 < n, m < N j , LL0 (m,n) is the input image,

K denotes the length of filter, [ ( i ) denote the impulse
responses of the low-pass filter, and h(k) denote the [KIZFI
impulse responses of the high-pass filter, which is Hlf'. (n)= ch(i)h(2k) -L a . (21
-2k)
developed from ( K x K ) -tap filters, and
LLJ(m,n) , LHJ(m,n) , HL'(m,n) , and
HHJ(m,n) denote respectively the coefficients of k=O
low-low, low-high, high-low and high-high subbands The above equations imply that LLLj ( n ) ,L H ~( n, )~,
produced at the decomposition levelj (also represented
HLAj(n) and H H j~( n ) can be computed as the sum
by LL' , L H ~, HLJ , and H H ~). N j x N j denotes
of two 1 -D convolutions performed independently on
samples of LLJ . the even part LL!&(2n -2k) and the odd part
According to the mathematical formulas (I), (2),
(3) and (4), the decomposition is produced by four 2-D
convolutions followed by the decimation both in the
row and in the column dimension for each level. In the 3. The Proposed 2-DDWT Architecture
three-level analysis for 2-D DWT, the data set LLJ-'
having N,-, x Nj-l samples is decomposed into four The proposed architecture performs parallel and
pipelined processing. Each analysis level involves two
subbands LLj ,LH j ,HLj , and HH' each having stages: stage 1 performs row filtering, and stage 2
N j x N j (equals to (Nj,/2)x(Nj,/2)) samples. performs column filtering. In a one-level filter bank for
2-D DWT computation. At the first level, the size of
Let LLL(2n) , l(i)1(2k) , l(i)h(2k) , h(i)l(2k)
the input image is N x N , and the size of the output of
and h(i)h(2k) be I-D DWT consisting of the each of the three subbands LH, HL and HH
even-numbered samples, and 0ln lN, ; is ( N l 2 )x ( N l 2 ) . At the second level, the input is the
0<k < K 12 .Moreover, let L L (2n ~ + 1) , LL subband whose size is ( N / 2) x ( N / 2 ) , and the size
l(i)1(2k+1) , l(i)h(2k+l) , h(i)1(2k+l) and of the output of each of the three subbands LLLH,
LLHL and LLHH is ( N / 4 ) x ( N / 4). At the third level,
h(i)h(2k+l) be 1-D DWT consisting of the
the input is the LLLL subband whose size
odd-numbered samples, and 0 < n < N, ;0 < k K 12 . is ( N 14) x ( N / 4 ) , and the size of the output of each of
~ L i , ~ (,nL) ~ i , ~ (,nH)L ; , ~ ( ~,and
) H H ; , ~ ( ~can
) be the four subbands LLLLLL, LLLLLH, LLLLHL and
expressed as follows: LLLLHH is (N / 8) x ( N / 8) .
The coefficients of the low-pass filter and the
high-pass filter have been derived in the biorthogonal
917 wavelet [3].The coefficients are quantized before

0-7695-2736-1/06 $20.00 © 2006
hardware implementation. We assume that the RAM module, as well as the (LL)'-* LH,
low-pass filter has four tapes: a, ,a,, a, and a,, and the
(LL)~-,HL, and (LL)~-,HH subbands obtained by the
high-pass filter also has four tapes: bo ,bl ,b, and b, .
entropy decoder. At the last-level composition, the
The vertical filter and the horizontal filter for 1-D inverse transform module comprises the original image
DWT are shown in Figure 1 and Figure 2, respectively. selected from the LL subband by the multiplexer, as
According to Eqs. (5), (6), (7) and (8), each 1-D well as the LH ,HL , and HH subbands obtained by the
decimated convolution can be computed as the entropy decoder. It requires 22 clock cycles to perform
point-wise sum of two 1-D convolutions performed the 2-D IDWT transform. Clock cycles 0 to 1 perform
independently. Figure 3 illustrates the transform the level-1 synthesis, clock cycles 2 to 9 perform the
module for 2-D DWT, the splitter arranges the data of level-2 synthesis, and clock cycles 10 to 41 perform
the even and odd parts using processing element (PE). f the level3 synthesis.
represents the input frequency, fl2 denotes that the
output frequency of L and H is a half of the input
frequency, andfl4 denotes that the output frequency of 5. Hardware Implementation and
LL, LH, HL and HH is a quarter of the input frequency. Performance Analysis of the Proposed 2-D
The single transform module can perform 2-D DWT. DWTDDWT Architectures
The single transform module comprises a ( N / 2 x N I 2 )
RAM, a transform module, a multiplex, a splitter and Filter coefficients of 4-tap Daubechies low-pass
an address sequencer. It requires 42 clock cycles to filter are quantized before implementation in the
perform the 2-D DWT transform. Clock cycles 0 to 31 high-speed computation hardware [17]-[18]. In the
perform the level-1 analysis, clock cycles 32 to 40 proposed architectures, all multiplications are
perform the level-2 analysis, and clock cycles 40 to 41 performed using shifts and additions after
perform the level-3 analysis. Three transform modules approximating the coefficients as a booth binary
can be cascaded to perform parallel and pipelined recoded format. The multiplier is replaced by a
processing. carry-save-adder (CSA), an adder and three hardwire
shifters in processing element (PE).
4. The Proposed 2-D IDWT Architecture The hardware codes were written in
Verilog-hardware description Language (HDL) [5]
In the proposed architecture of three-level synthesis evaluated under ModelSim simulation tool [6]. The
for 2-1) DWT, each synthesis level involves two stages: architectures were synthesized by Xilinx FPGA
stage 1 performs column filtering, and stage 2 performs express tools [7] and evaluated on the Xilinx
row filtering. The horizontal filter and the vertical XC2V4000 FPGA platform [8]. They were designed to
filter for 1-D IDWT are shown in Figure 4 and Figure evaluate the hardware and to provide an embedded
5, respectively. Figure 6 illustrates the transform core for digital image data analysis and synthesis [9].
module for 2-D IDWT. f denotes that the input The decimation filter for 1-D DWT requires
frequency of L and H i s a double of the input frequency, eight adders, thirteen shiflers and three registers for
and 4f denotes that the output frequency is four times each PE, while that for the 1-D IDWT requires seven
that of the input frequency of LL, LH, HL, and HH. adders, thirty shifters and three registers for each PE.
The single transform module comprises an inverse Both hardwares are very cost-effective. They require
transform module, a RAM module (N 12x N 12) and three and two additions to perform each PE
computation in DWT and IDWT, respectively. The two
a multiplexer. At the first-level composition, the architectures reduce power dissipation by m compared
inverse transform module comprises the with conventional architectures in m-bit operand
(LL)~-,LL subband selected from the (LL)'-I LL (low-power utilization).
subband by the multiplexer, as well as the Two proposed DWT and IDWT architectures
(LL)~-'LH, (LL)~-IHL, and (LL)~-'HH have regular structure, local communication and simple
control flow, so they are very suitable for VLSI
subbands obtained by the entropy decoder. The implementation and scalable filter length. In both of
(LL)~-'LL subband is stored in the RAM module. At the single transform modules, the hardware utilization
the second-level composition, the inverse transform are loo%, so the systems consumes ultra-low power.
module comprises the (LL)jw3LL subband selected The total data processing time of 2-D DWT can be
calculated as follows:
from (LL)~-,LL subband by the multiplexer of the

0-7695-2736-1/06 $20.00 © 2006
References
[I] ITU-T Recommendation T.800. JPEG2000 image coding
system - Part 1, ITU Std., July 2002. http://www.itu.intnTU-
TI.
I
where j = log2 N . [2] A. S. Lewis and G. Knowles, "VLSI architecture for 2-D
In the single transform module of 2-D IDWT, the Daubechics wavelet transform without multipliers," Electron.
total data processing time is the same as that of 2-D Lett., vol. 27, pp.171-173, Jan. 1991.
DWT.
Two architectures with fixed point operation [3] M. Antonini, M. Barlaud, P. Mathieu, I. Daubechies,
" Image Coding Using Wavelet Transform," IEEE
applying on 3-level image compression and Transactions on Image Processing, Vol.1, No.2, April 1992,
decompression by periodic extension methods [17], the pp. 205-220.
peak-signal-to-noise ratios (PSNR) of the reconstructed
image is 255.5dB. The 3-level compressed image using [4] K. A. Kotteri, A. E. Bell, J. E. Carletta, "Design of
periodic extension method is shown in Figure 7. The Multiplierless, High- Peformance, Wavelet Filter Banks with
original image and reconstructed image are shown in Image Compression Applications," IEEE Transactions on
Figure 8. Hence, the performances of the proposed 2-D Circuits and Systems-I, Vol. 51, No. 3, March 2004,
DWTIIDWT architectures are suited for PEG-2000 pp.483-494.
and motion-JPEG-2000.
[5] D. E. Thomas, P. H. Moorby, The Verilog Hardware
Description Language, Fifth Edition, Kluwer Academic Pub.
6. Discussion and Conclusion 2002.
In this paper, two high-speed and ultra low-power [6] Model ModelSim Products: http://www.
architectures for 2-D DWT and IDWT with single model.com/products.
transform modules have been proposed. Both
[7] Synopsys FPGA Express, http://www.
architectures perform analysis in 2 . (1 - 2-2i). N~ I 3 synopsys.com/products.
processing time and synthesis in 2 . (1 - 2-2i). N~ I 3 .
[8] Xilinx FPGA products, http://www. xilinx.com/products.
They are significantly faster than conventional
architectures proposed by Wu and Chen [lo]-[l I], and [9] S. Masud, J. V. McCanny, "Reusable Silicon IP Cores for
Marino [12]. Table 1 depicts the comparison with Discrete Wavelet Transform Application," IEEE
previous works. In this table, AT^ represents the Transactions on Circuits and Systems-I, Vol. 5 1, No. 6, June
system performance [Ill-[15], where A denotes area 2004, pp.1114-1124.
and T denotes time or latency (clock cycles). As can be [lo] P. -C. Wu, L. -G. Chen, "An Efficient Architecture for
seen, the system performances of the two proposed Two-Dimensional Discrete Wavelet Transform," IEEE
architectures are significantly better than that of Transactions on Circuits and Systems for Video Technology,
previous works. Vol. 11, NO.4, April 2001, pp. 536-545.
Filter coefficients are quantized before
implementation using 4-tap Daubechies filters. Both [ I l l P. -C. Wu, C. -T. Liu, L. -G. Chen, "An Efficient
hardwares are cost-effective and the systems have Architecture for Two-Dimensional Inverse Discrete Wavelet
high-speed. The architectures reduce power dissipation Transform," IEEE International Symposium on Circuits and
by m compared with conventional architectures in Systems, Vol. 2, May 2002, pp. 11-312-11-315.
m-bit operand (low-power utilization). [12] F. Marino, "Two Fast Architectures for the Direct 2-D
The proposed architectures have been verified by Discrete Wavelet Transform," IEEE Transactions on Signal
Verilog-HDL and implemented on FPGA. The Processing, Vol. 49, No. 6, June 2001, pp. 1248-1259.
advantages of the proposed architectures are 100%
hardware utilization and ultra low-power. The [I31 S. Y. Kung, VLSI Array Processors, Prentice-Hall, New
architectures have regular structure, simple control Jersey, USA, 1989.
flow, high throughput and high scalability. Thus, they
are very suitable for new-generation image [I41 T. Y. Sung, C. S. Chen, "A Parallel-Pipelined Processor
codingldecoding systems, such as PEG-2000. for Fast Fourier Transform," The Fourth IEEE Asia-Pacz$c
Conference on Advanced System Integrated Circuits
(AP-ASIC-2004), Fukuoka, Japan, August 3-5, 2004,
pp. 194-197.

0-7695-2736-1/06 $20.00 © 2006
[I51 T. Y. Sung, "A Memory-Efficient and High-Speed input
I
Split-Radix FFTtIFFT Processor Based on Pipelined
CORDIC Rotations," to appear in IEE Proceedings - Vision,
Image and Signal Processing.
[I61 T. Y. Sung, Y. S. Shieh, C. L. Chiu, "High-Efficient

Architectures for Forward and Inverse Discrete Wavelet
Transform Using 4-tap Daubechies Filters," 2006 Conference
on Microelectronics Technology and Applications
(2006-CMTA), Kaoshiung, Taiwan, May 19,2006, D-3.
[17]T. Y. Sung, Y. S. Shieh, C. L Chiu, C . W. Yu, "VLSI

Architectures for 2-D Forward and Inverse Discrete Wavelet
Transform Using Ctap Daubechies Filters," 2006 Conference
on Electronic Communication and Application (2006-CECA),
Kaoshiung, Taiwan, July 06,2006.
L-Line delay register

Figure 2. The Horizontal filter for 1-D DWT
1
input -
Figure 1. The vertical filter for I-D DWT
Figure 4. The Horizontal filter for 1-D IDWT
Figure 7. The 3-level compressed image
Figure 5. The vertical filter for 1-D IDWT
(a) Ib)
Figure 8. (a) Original image and @)
Reconstructed image

0-7695-2736-1/06 $20.00 © 2006
Figure 6 . The transform module for 1-D IDWT
Figure 3. The transform module for 1-D DWT
Table 1. Comparison with the previous works ( j = log, N )

0-7695-2736-1/06 $20.00 © 2006

DWT

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

DWT

Diunggah oleh

Hak Cipta:

Format Tersedia

Low-Power Multiplierless 2-D DWT and IDWT Architectures Using 4-tap

Tze-Yun Sung, Yaw-Shih Shieh and Hsi-Chin Hsin

Abstract analysis stage of the system is the DWT. In the

Proceedings of the Seventh International Conference on

where 0 < n, m < N j , LL0 (m,n) is the input image,

Proceedings of the Seventh International Conference on

Proceedings of the Seventh International Conference on

Proceedings of the Seventh International Conference on

[I61 T. Y. Sung, Y. S. Shieh, C. L. Chiu, "High-Efficient

[17]T. Y. Sung, Y. S. Shieh, C. L Chiu, C . W. Yu, "VLSI

L-Line delay register

Figure 4. The Horizontal filter for 1-D IDWT

Figure 7. The 3-level compressed image

Figure 5. The vertical filter for 1-D IDWT

Proceedings of the Seventh International Conference on

Figure 3. The transform module for 1-D DWT

Table 1. Comparison with the previous works ( j = log, N )

Proceedings of the Seventh International Conference on

Anda mungkin juga menyukai