Anda di halaman 1dari 4

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO.

6, JUNE 2007

725

Shift-Register-Based Data Transposition for Cost-Effective Discrete Cosine Transform


Shih-Chang Hsia and Szu-Hong Wang

AbstractThis paper presents a cost-effective 2-D-discrete cosine transform (DCT) architecture based on the fast row/column decomposition algorithm. We propose a new schedule for 2-D-DCT computing to reduce the hardware cost. With this approach, the transposed memory can be simplied using shift-registers for the data transposition between two 1-D-DCT units. A special shift cell with MOS circuit is designed by using the energy transferring methodology. The memory size can be greatly reduced, and the address generator and its READ/WRITE control all can be saved. For an 8 8-block transformation, the number of transistors is only 4 k for the shift-register array. The maximum frequency of shift-operation can achieve about 120 MHz, when implemented by 0.35- m technology. Index TermsDiscrete cosine transform (DCT), pseudo capacitor, row/ column decomposition, shift register, video coding.

I. INTRODUCTION Video coding systems have widely used the discrete cosine transform (DCT) to remove redundancy data [1], [2]. Many fast DCT algorithms were presented [3], [4] to reduce the computational complexity, and VLSI architectures were designed for a dedicated DCT processor [5], [6]. A row/column decomposition approach is popular due to its regularity and simplication, but it needs a transposed memory for 2-D-DCT processing. We use either the ip-op cell or the embedded RAM for data transposition. If the ip-op is used to perform the data transposition, the chip complexity becomes high since one ip-op cell requires about 16 transistors in a typical CMOS cell library. As using the embedded RAM, we have to employ the memory compiler to generate the expected RAM size. Although the layout density is high, the VLSI implementation becomes more complex. Generally, the circuit size of transpose memory occupies about 1=2  1=5 of full 2-D-DCT core. To reduce the complexity, we present a simple shift-register rather than the transpose memory. With a particular access scheduling for 2-D-DCT, the simple shift-register array can be used for data transposition. The proposed DCT can decrease the control overhead and the transpose memory size. II. PROPOSED 2-D DCT ARCHITECTURE Based on the fast row/column algorithm [5], one can utilize a time-sharing method to perform 2-D-DCT with one 1-D-DCT core for cost-effective design. The timing schedule and VLSI architecture for DCT computations are illustrated in Figs. 1 and 2, respectively. The 2-D-DCT architecture consists of the 1-D-DCT core and the shift-register array. For the N th block processing, the rst row pixels f00 f07 are sequentially loaded to R0R7 during 07th cycles. R0R7 are selected to the computation kernel by multiplex 2 2 1 for the rst row coefcient transformation during 815th cycles. The resulting coefcient is sequentially sent to the shift register per cycle. Meanwhile, the second row pixels f10 f17 are loaded to R8R15. In the 1623rd cycles, R8R15 are selected to the computation kernel for the second row coefcients computing. At the same time, the third row pixels
Manuscript received April 1, 2005. This work was supported by the National Science Council, Taiwan, R.O.C., under Contract NSC92-2213-E-327-010. The authors are with Department of Computer and Communication Engineering, National Kaohsiung First University of Science and Technology, Taiwan 824, R.O.C. (e-mail: hsia@ccms.nkfust.edu.tw). Digital Object Identier 10.1109/TVLSI.2007.898780

are loaded into R0R7. Repeat this schedule, one block pixel can be transformed to 1-D coefcient with row-by-row during 64 cycles. A pair of registers R0R7 and R8R15 are chosen by multiplexes controlled with Clk_Enable signal 0 and 1, respectively. The addition or subtraction of two pixels is preproceeded for even or odd coefcients computing, which can be implemented using twos complement control with XOR gate. The weights 14 are cosine coefcients and the detail values are listed in Table I. The cosine coefcients can be easily implemented using a xed state machine. The computational order is regular from the coefcients F0, F1, . . ., F7 in the 1D-DCT [5]. In the rst cycle, to compute the F0 coefcient, the weights 14 all use a cosine value. To the second cycle, the weights 14 individually use b, d, e, and g for computing the F1 coefcient. The other coefcients can be calculated with various weights from the state machine. The last row pixels f70 f77 have been loaded to R8R15 at the 63rd cycle. These pixels can be completely transformed to 1-D-DCT coefcients at the 72nd cycle. The rst 1-D-DCT results sequentially input to the shift register for the second 1-D-DCT computing. The accessing schedule of the shift register is shown in Fig. 3 at the 71st cycle. The shift-register array is designed with a serial-in/parallel-output structure. The rst 1-D-DCT results, m[00], m[10], . . ., m[70], are loaded to R0R7 in parallel for 2-D-DCT computation at the 71st cycle. Due to one-stage pipelined delay and output latch, the rst 2-D-DCT coefcient F[00] is achieved at the 74th cycle. Then, the 2-D-DCT coefcients F[10], F[20], . . . sequentially output during 7581st cycles. For the next column processing, we send one clock to the shift-register array. Now the output of shift-register array becomes m[01], m[11], . . ., m[71]. The 1-D-DCT coefcients are loaded to R8R15 in parallel at the 79th cycle. One can attain the second column 2-D-DCT coefcients during the 8289th cycle, Repeating this computation schedule, the last column 1-D-DCT coefcients m[07], m[17], . . ., m[77] are loaded to R8R15 at the 116th cycle, and the 2-D coefcient F[70]F[77] is sequentially achieved. For the next block processing, the new pixels are sequentially written into R0R7 from the 117th to 125th cycles. The same computation schedule is again employed for the new block transformation. For cost-effective design, a special shift register cell is designed with MOS circuit to reduce the memory size. The shift operation is based on capacitor energy transferring. The shift cell and its timing control are shown in Fig. 4. We use two-phase to control the nMOS switch. At the rst half cycle, 1 is high and 2 is low, so Q1 on and Q2 off. The D1 data is stored at c1 capacitor through Q1, where input data Din = . . . ; D4, D3, D2, D1. At the next half cycle, the 1 and 2 status is inverse from the previous half cycle. The Q1 turns off and the Q2 turns on, the c1 data shifts to c2 capacitor. The inverter is used to keep the logic level at the end of shift cell. To the second cycle, Q1 turns on, D2 data is loaded to c1 in rst half cycle. Meanwhile, the capacitor c2 still keeps D1 data since the Q2 turns off. The D1 data in c2 capacitor is through the inverter and transfers to the c3 since the Q3 turns on. In the next half cycle, the 2 clock becomes high, D2 and D1 are shifted c2 and c4 capacitors, respectively. Repeatedly, the shift function can be performed with the energy transferring technique. The capacitor is designed to satisfy c1c2, to make sure the energy is transferring properly. We can adjust the ratio of channel wide and length of Q1, Q2 and inverter to decide the c1 and c2 capacitances. The capacitor c1 is dominated by Q1 source capacitance and Q2 drain capacitance. The capacitor c2 is dominated by Q2 drain and inverter gate capacitances. Since Q2 size will inuence c1 and c2 capacitances, we cannot adjust the capacitance by Q2. To satisfy c1c2, one can increase the c1 capacitance with large ration of width and length for

1063-8210/$25.00 2007 IEEE

726

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 6, JUNE 2007

Fig. 1. Timing schedule for DCT computations.

Fig. 2. Proposed 2-D-DCT architecture with one 1-D DCT core.

Q1, and the uniform ration for Q2 and inverter to minimize the memory size. The shift-register cell can be implemented with two nMOS and one inverter circuit, where one bit cell only uses four transistors. The circuit complexity for transpose memory is much less than that of the conventional SRAM or ip-op. Moreover, we do not need the extra controller, such as READ/WRITE access control and address decoder. III. CHIP IMPLEMENTATION AND COMPARISONS First, the shift-register array is realized with the full-custom design. Spice simulator is used to verify the circuit function. The maximum shifting rate can achieve about 120 MHz, when implemented with TSMC 0.35-m process. For the cell layout, we rst design 1-bit cell and repeat it to expand the expected word size. The word length uses 16 bits for DCT computation, so the layout cell is copied 16

times for one word. The transformation for one 8 2 8 block needs a 64-word shift register. One can repeat the shift word with eight times in the row direction, and then a one-row layout can be duplicated by eight times for the column cells. The shift-register array only requires 4 k transistors and uses the simple two-phase control rather than the memory control and address generator. To design the whole DCT chip, the shift register is modeled as a function block for full-system simulations. The DCT chip is implemented with Verilog HDL. First, the preprocessing and computational core is realized with Fig. 2. Then, the 2-D-DCT chip is integrated with one 1-D-DCT core and the shift-register array and veried with logic simulations. To ensure the function of 2-D-DCT, 100 random blocks are used for the software and hardware cosimulation. Results show that the chip can meet the function of 2-D-DCT in real-time operation. The maximum operation frequency of the 2-D-DCT chip can achieve about 55 MHz from the

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 6, JUNE 2007

727

Fig. 3. Shift-register cell and its control timing.

TABLE I STATE MACHINE FOR COSINE WEIGHTS 14

TABLE II CHIP FEATURE OF THE PROPOSED 2-D-DCT PROCESSOR

Fig. 4. Serial-in/parallel-out shift register array for the second 1-D-DCT processing.

post simulation. Table II shows the feature of the proposed 2-D-DCT chip.

Table III lists the comparison with other DCT architectures. Aggoun [5] proposed a shift-based row/column transposition. The shift operation used the skewed register and it requires a special control to access the shift cells with multiplexes. Gong [6] uses the submatrix method to implement DCT/IDCT without using transpose memory, but needs a three-port memory module for column-based processing. The proposed DCT adopts a simple shift-register as the transpose memory. The

728

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 6, JUNE 2007

TABLE III COMPARISONS WITH OTHER DCT PROCESSORS

IV. CONCLUSION This paper presents a cost-effective DCT architecture for video coding applications. The 2-D-DCT processor is realized with a particular schedule consisting of 1-D-DCT core and the shift-register array. The shift-registers array can perform data transposition with serial-in/parallel-out structure based on pseudocapacitor technique. The shift-register based transposition can reduce the control-overhead since the address generator and decoder for memory access can be removed. Comparison with the transposition-based DCT chips, the memory size and the full 2-D-DCT complexity can be reduced about a factor of 1/8 and 1/2, respectively. With low circuit complexity and control overhead, the proposed DCT IP can provide a cost-effective solution for video encoder.

64 2 16 shift-register array is implemented with MOS circuit to reduce the circuit complexity. The address generator and WRITE/READ control circuit all can be saved and the address decoder for memory cell access also can be ignored. Compared with the conventional architectures, the transposed memory size can be reduced about a factor of 1/81/10. The computational core only employs four multipliers and seven adders. The chip requires about 29 k transistors, hence, the circuit complexity is lower than others. The latency time with only 74 cycles is reduced in comparison with the earlier transposition-based chips. The throughput is 0.5 for 2-D-coefcient and 1 for 1-D-coefcient per cycle.

REFERENCES [1] MPEG-2 video coding, ISO/IEC DIS 13818-2, 1995. [2] G. Cote, B. Erol, and F. Kossentini, H.263+: Video coding at low bit rate, IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 7, pp. 849866, Nov. 1998. [3] E. Feig and S. Winograd, Fast algorithm for the discrete cosine transform, IEEE Trans. Signal Process., vol. 40, no. 9, pp. 21742193, Sep. 1992. [4] N. I. Cho and S. U. Lee, Fast algorithm and implementation of 2-D discrete cosine transform, IEEE Trans. Circuits Syst., vol. 38, no. 3, pp. 297305, Mar. 1991. [5] A. Aggoun and I. Jalloh, Two-dimensional DCT/IDCT architecture, Proc. IEE Comput. Digit. Tech., vol. 150, no. 1, pp. 210, 2003. [6] D. Gong, Y. He, and Z. Cao, New cost-effective VLSI implementation of a 2-D discrete cosine transform and its inverse, IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 4, pp. 405415, Apr. 2004.

Anda mungkin juga menyukai