Anda di halaman 1dari 4

Design and implementa›ion of 16-bit fixed point

digital signal processor

Donghoon Lee, Chanwon Ryu, Jusung Park Kyunsoo Kwon, Wontae Choi
School of Electronic Engineering Samsung Electro-Mechanics
Pusan National University Suwon, Korea
Busan, Korea
minuet21@pusan.ac.kr

Abstract—This paper deals with the design and implementation


of the 16-bit fixed point Digital Signal Processor. The designed The overall architecture of the suggesting DSP consists of data
DSP has 211 instructions and consists of 40-bit ALU, 6 level and address buses, a central processing unit, a control unit and
pipelines, 17-bit X 17-bit parallel multiplier for single-cycle MAC memory interface unit[3], shown in Figure 1.
operation, 8 addressing modes, 8 auxiliary registers, 2 auxiliary
register arithmetic units, two 40-bit accumulators and 2 address
generators. The verilog HDL coded synthesizable RTL code of
the DSP core has a complexity of 69,860 in the two input NAND
gates. We verified the functions of the DSP by a simulation with a
single instruction test as the first step. and then implemented the
DSP with the FPGA. The test vectors have a single instruction
test, combination of single instructions and algorithm
applications, ADPCM vocoder and the MP3 decoder. After
FPGA verification, the DSP core is fabricated with 0.25um
CMOS technology. The DSP core carried out three test vector
sets which are tested at FPGA at the 106 MHz clock rates.

Keywords -GDSP; CPU; SoC;Gdesign;G processor; chip

I. INTRODUCTION
Recently, as technology is developed, the size of processor
is smaller and so many portable devices are manufactured.
Therefore the importance of the DSP (digital signal processor)
which can process fast and accurate digital signals for
audio/image processing or data communication, has been
getting bigger. This paper includes information of the design Figure 1. Architecture of the DSP
and implementation of the DSP core. The suggesting DSP has
a hardware architecture and instruction sets suitable for
processing digital signals. Because portable devices such as A. Archiecture and instruction
cell-phone or MP3 player, need a low-cost and low-power The suggested 16-bit fixed point DSP has a bus architecture
processor. A fixed-point DSP is important to portable devices, which is divided into two types, a program and data. The
although it has less accuracy than a floating-point DSP. PAGEN and DAGEN generate each address, data and program.
Through a memory interface unit such as MMU and
II. DESIGN EXTMMU, the DSP controls an internal or an external
The suggesting DSP has a couple of features of an memory unit. The CPU such as ALU or MAU carries out the
architecture because DSP has to carry out many digital signal data operation.
algorithms at a real time. First, DSP has fast and optimized 1) Pipeline and Bus : data processing, the DSP has the
multiplier for operating the MAC instruction during one cycle. Advanced Havard architecture which reduces the waiting time
And it has the advanced Harvard architecture to improve an of the buses. So it can access memory units through one
operating speed. For a movement of bust bits used frequently program bus(PB), three data buses(CB, DB, EB) and each
in digital signal algorithms, It has a barrel shifter and a address bus(PAB, CAB, DAB, EAB). The program bus reads
repeater. an opcode and an operand from program memory. And three
data buses are divided into 2 type. Data reading bus (CB, DB)
reads an operand from data memory, data writing bus (EB)

978-1-4244-2599-0/08/$25.00 ©2008 IEEE II-61 2008 International SoC Design Conference

Authorized licensed use limited to: Gandhi Institute of Technology & Management. Downloaded on June 22,2010 at 09:10:42 UTC from IEEE Xplore. Restrictions apply.
writes data in data memory. Also it has a six-level pipeline: and MAX or parallel-instructions like LD||MAC which load
Prefetch, Fetch, Decode, Access, Read and Execute. When the data to the accumulator and operate them. 40-bit ACC data
instructions are operated, the DSP permits an instruction consist of three parts: 8-bit guard data, 16-bit high data and
overlap and each pipeline level performs an independent 16-bit low data.
function. The function of each cycle is explained on Figure 2.
c) MAU : The digital signal processing algorithms, such
Loads PAB with Loads IR with the contents of PB Loads DB with the data1 read operand
the PC’s contents Decodes the IR’s contents Loads CB with the data2 read operand as FFT, FIR and Huffman decoding, have repetitive and
Loads EAB with the data3 write
address, if required complicated operations, so a performance of the DSP is up to
the execution of these algorithms[1]. The proposed DSP has
Prefetch Fetch Decode Access Read Execute/Write an optimized multiplier based on the modified radix-4 booth
algorithm shown figure 4. and it carries out an operation like
Loads DAB with the data1 read
address, if required equation (1) at one cycle. The MAU operates MPY, MAC,
Loads CAB with the data2 read

Loads PB with
address, if required
Executes the instruction and
SQURA and so on.
Updates auxiliary registers and
the fetched instruction word stack pointer loads EB with write data
(1)
TIME

Figure 2. Function of pipeline cycles

2) CPU : The CPU (Central Processing Unit) carries out


data operations received from the buses. It consists of specific
operating(function) blocks, such as the ALU (Arithmetic
Logic Unit), MAU (Multiply and Adder Unit), CSSU
(Compare Select and Store Unit), Barrel shifter and Exponent
encoder[2], shown in Fig.3. Most of arithmetic and logical
instructions are processed at this block during one or two
cycle.

Figure 4. Architecture of the MAU


d) Barrel Shifter : Under signal processing algorithm,
when move or operate data, DSP has to shift one more bits of
data. For reducing an operating time DSP has a barrel shifter
in CPU. The range of data shifting is -16 to 31 bit. And it can
carry out an arithmetic shift and a logical shift by a control
signal.
3) Memory : The suggested DSP has 2k words x 16-bit
internal ROM and 10k words x 16-bit internal RAM for
accessing two data within one cycle simultaneously. And it
has each 64k words x 16-bit memory space of the Data,
Program and I/O.
4) Addressing mode : Two ARAU (Auxiliary Register
Arithmetic Unit) generate two data memory addresses at one
cycle. ARAU works parallel with ALU and has 8 addressing
mode : Short immediate addressing mode, Long immediate
Figure 3. Architecture of the CPU addressing mode, Absolute addressing mode, Direct
addressing mode, Indirect addressing mode, Bit-reversed
a) ALU : The ALU consists of the 40-bit adder and index addressing mode, Memory-mapped register addressing
logical operation units. As shown figure 3, operated output mode and Stack addressing mode.
data of ALU are inputted to the barrel shifter, then after being 5) Instruction set : The DSP has overall 211 instruction
shifted they are stored to a memory or an accumulator. When sets and size of instruction is 16 bit. They are classified into
operated data are stored after being shifted, they have to be four types shown in table 1 : Arithmetic operations, Logical
carried out within one cycle. And using an EXP encoder block
operations, Program control operations and Load and Store
CPU can express and operate a floating-point data.
operations.
b) Accumulator : The proposed DSP has two 40-bit
accumulators (ACCA and ACCB). It stores data which are
executed at ALU and MAU and can sends operated data to
ALU again. Output data of ACCA can be an input data of the
MAU. Also it is used to operate comparing-instructions MIN

II-62 2008 International SoC Design Conference

Authorized licensed use limited to: Gandhi Institute of Technology & Management. Downloaded on June 22,2010 at 09:10:42 UTC from IEEE Xplore. Restrictions apply.
TABLE 1. Functional classified table of instructions 2) MP3 decode : For more complicated verification, many
cases of the instruction combinations in the MP3(MPEG-I
layer 3) decode algorithms are carried out. Like the ADPCM
algorithm test, this algorithm is verified by comparing results
of a C-code and results of a HDL shown in figure 6. Input data
are from a sampling rate of 44.1khz stereo. These algorithms
are executed at an average of 60 MIPS and use 12k words
program memory space and 27k words data memory space.
Under test of the MP3 decode algorithm, we confirm that a
weight of instructions of addition and multiplication is large.
So high performance DSP must have an effective and
optimized an adder and a multiplier. Table 2. shows
calculation weight of each routines of MP3 decode.
TABLE 2. Weight of the MP3 decoder routine
Name MIPS %

III_hufman_decode (Huffman decoder) 6.5 10.1

III_hybrid (IMDCT) 16.3 25.5


G
SubbandSynthesis 16.9 26.4

III_dequantize_sample (Dequantization) 9.4 14.7


I. VERIFICATION
File read/write 2.5 4
To verify the designed DSP core, we goes through three
processes : HDL code simulation, FPGA implementation and Total 64.0 100
Chip fabrication/verification.

A. Functional simulation
When the 211 instruction sets are verified using a logic
simulation tool, verifications of a state of internal registers and
buses, and status flags are accompanied. Figure 5. shows the
result of an functional simulation, ADD instruction. After tests
of each single instruction are finished, many combinations of
single instructions are verified and they work correctly.


Figure 6. Results of MP3 decode algorithm of C-code and HDL
Fig 5. Functional simulation of ADD #4568, 8, A, B

C. Hardware verification
B. Application algorithm simulation 1) FPGA implemenation : HDL codes of DSP core are
1) ADPCM : The G.726 ADPCM (Adaptive Difference synthesised and downloaded to the Altera FPGA : Excalibur.
Pulse Code Modulation) suitable for suggestng DSP core is Under the FPGA implementation, four types of data
coded and verified by comparing results of a C-code and movement between an internal and an external, tests of single
results of a HDL. 174,950 for the input data with a sampling instructions and tests of combinations of single instructions
rate of 8khz encoded with ADPCM usage total of 60 are verified. DSP core on FPGA is carried out these test.
instructions and 1,700 words program memory space.

II-63 2008 International SoC Design Conference

Authorized licensed use limited to: Gandhi Institute of Technology & Management. Downloaded on June 22,2010 at 09:10:42 UTC from IEEE Xplore. Restrictions apply.
2) Chip fabrication : After an FPGA verification, the DSP
core is fabricated with the 3.3V, 0.25um technology. Figure 7.
shows the layout of the DSP core and memory cells. The DSP
core can operate at 106Mhz clock rates in a post-simulation. A
logic area of the DSP is 7,159,500 and the logic area of the
memory cells is 6,321,161. Also, the DSP core has 69,680
gates a base of a two-input NAND gate.
Figure 9. Architecture of a wrapper and DSP core

III. CONCLUSION
This paper describes the design and implementation of the
16-bit fixed point pipelined processor. The processor core is
designed using the Verilog HDL and verified instruction sets
and application algorithms, such as the ADPCM and MP3
decode, through a functional simulation. After the verification,
FPGA of DSP core is implemented. Also the DSP core is
synthesized with the 3.3v 0.25 um CMOS library and
fabricated after a layout procedure.
The fabricated DSP core has 69,860 gates based on a two
input NAND gate and can operate at the maximum 106MHz
Figure 7. Chip layout of DSP core and memory
clock rates. For a test of the DSP compatibility, single
instruction set tests and the ADPCM algorithm test are carried
out on the PCB environment and they work correctly. For an
3) Chip verification : For a verification of the fabricated easy and accurate verification procedure, a suitable debugger
DSP chip, a PCB test board is designed shown in Figure 8. system will be designed. And the designed DSP core will be
Tests of single instruction sets, a memory interface and the applied to the SoC technology.
ADPCM algorithm are verified similarly in a fucntional
simulation and the FPGA implemenation. And the correct ACKNOWLEDGMENT (HEADING 5)
result of the chip verification is confirmed. This project is supported by Samsung Elector-mechanics
and IDEC.

REFERENCES
[1] C.S. Wallace, “A Suggestion for fast multipliers”, IEEE Trans. On
Electronic Computers, Vol. EC-10, no. 3, pp.14-17, Feb 1964
[2] Avatar Singh and S. Srinivasan, Digital Signal Processing
Implementation : using DSP Microprocessor with Examples from
TMS320C54x, Thomson, Brooks, Cole, 2004.
[3] Texas Instruments, TMS320C54x Reference Set, 1999.

Figure 8. Test environment for DSP chip

II. ADDITIONAL DESIGN


To increase usefulness of suggesting DSP core, An AMBA
2.0 wrapper suitable the DSP core is designed shown in figure
9. So DSP core can interface to AMBA and operate as a master
or a slave. A wrapper connects external memory control signals
of DSP core and data/address buses to the AMBA. So DSP
core and an ARM7tdmi core are interfaced through an AMBA.
Overall architecture including a DSP core, ARM7 core and
external memory blocks is designed and test under interfaced
environment will be carried out.

II-64 2008 International SoC Design Conference

Authorized licensed use limited to: Gandhi Institute of Technology & Management. Downloaded on June 22,2010 at 09:10:42 UTC from IEEE Xplore. Restrictions apply.

Anda mungkin juga menyukai