Anda di halaman 1dari 22

1

CHAPTER VIII DSP PROCESSORS




Digital Signal Processing is carried out by mathematical operations. In comparison, word processing and
similar programs merely rearrange stored data. This means that computers designed for business and other
general applications are not optimized for algorithms such as digital filtering and Fourier analysis. Digital
Signal Processors are microprocessors specifically designed to handle Digital Signal Processing tasks.
These devices have seen tremendous growth in the last decade, finding use in everything from cellular
telephones to advanced scientific instruments. In fact, hardware engineers use "DSP" to mean Digital
Signal Processor, just as algorithm developers use "DSP" to mean Digital Signal Processing.

The last forty years have shown that computers are extremely capable in two broad areas:
Data manipulation, such as word processing and database management,
Mathematical calculation, used in science, engineering, and Digital Signal Processing.
All microprocessors can perform both tasks; however, it is difficult (expensive) to make a device that is
optimized for both. There are technical tradeoffs in the hardware design, such as the size of the instruction
set and how interrupts are handled. Even more important, there are:
Marketing issues involved:
Development and manufacturing cost,
Competitive position,
Product lifetime,.
As a broad generalization, these factors have made traditional microprocessors, such as the Pentium,
primarily directed at data manipulation. Similarly, DSPs are designed to perform the mathematical
calculations needed in Digital Signal Processing.


Data manipulation versus mathematical calculation.

In addition to performing mathematical calculations very rapidly, DSPs must also have a predictable
execution time. Most DSPs are used in applications where the processing is continuous, not having a
defined start or end. For instance, consider an engineer designing a DSP system for an audio signal, such as
a hearing aid. If the digital signal is being received at 20,000 samples per second, the DSP must be able to
maintain a sustained throughput of 20,000 samples per second. However, there are important reasons not to
make it any faster than necessary. As the speed increases, so does the cost, the power consumption, the
design difficulty, and so on. This makes an accurate knowledge of the execution time critical for selecting
the proper device, as well as the algorithms that can be applied.

Circular Buffering

In real-time processing, the output signal is produced at the same time that the input signal is being
acquired. For example, this is needed in telephone communication, hearing aids, and radar. These
applications must have the information immediately available, although it can be delayed by a short
amount. For instance, a 10 millisecond delay in a telephone call cannot be detected by the speaker or
listener. Likewise, it makes no difference if a radar signal is delayed by a few seconds before being
displayed to the operator. Real-time applications input a sample, perform the algorithm, and output a
2
sample, over-and-over. Alternatively, they may input a group of samples, perform the algorithm, and output
a group of samples. This is the world of Digital Signal Processors.


Circular buffer operation.


FIR digital filter. In FIR filtering, each sample in the output signal, y[n], is found by multiplying samples
from the input signal, x[n], x[n-1], x[n-2], ..., by the filter kernel coefficients, a0, a1, a2, a3 ..., and
summing the products.

Figure above shows an FIR filter being implemented in real-time. To calculate the output sample, we must
have access to a certain number of the most recent samples from the input. For example, suppose we use
eight coefficients in this filter, a
0
,

a
1
,..

a
7
. This means we must know the value of the eight most recent
samples from the input signal, x[n], x[n-1], . x[n-7]. These eight samples must be stored in memory and
continually updated as new samples are acquired. What is the best way to manage these stored samples?
The answer is circular buffering.

3
Figure above illustrates an eight sample circular buffer. We have placed this circular buffer in eight
consecutive memory locations, 20041 to 20048. Figure (a) shows how the eight samples from the input
might be stored at one particular instant in time, while (b) shows the changes after the next sample is
acquired. The idea of circular buffering is that the end of this linear array is connected to its beginning;
memory location 20041 is viewed as being next to 20048, just as 20044 is next to 20045. You keep track of
the array by a pointer (a variable whose value is an address) that indicates where the most recent sample
resides. For instance, in (a) the pointer contains the address 20044, while in (b) it contains 20045. When a
new sample is acquired, it replaces the oldest sample in the array, and the pointer is moved one address
ahead. Circular buffers are efficient because only one value needs to be changed when a new sample is
acquired.

Four parameters are needed to manage a circular buffer. First, there must be a pointer that indicates the start
of the circular buffer in memory (in this example, 20041). Second, there must be a pointer indicating the
end of the array (e.g., 20048), or a variable that holds its length (e.g., 8). Third, the step size of the memory
addressing must be specified. In Fig. 28-3 the step size is one, for example: address 20043 contains one
sample, address 20044 contains the next sample, and so on. This is frequently not the case. For instance, the
addressing may refer to bytes, and each sample may require two or four bytes to hold its value. In these
cases, the step size would need to be two or four, respectively.

These three values define the size and configuration of the circular buffer, and will not change during the
program operation. The fourth value, the pointer to the most recent sample, must be modified as each new
sample is acquired. In other words, there must be program logic that controls how this fourth value is
updated based on the value of the first three values. While this logic is quite simple, it must be very fast.
This is the whole point of this discussion; DSPs should be optimized at managing circular buffers to
achieve the highest possible execution speed.

Below are the steps needed to implement an FIR filter using circular buffers for both the input signal and
the coefficients. The efficient handling of these individual tasks is what separates a DSP from a traditional
microprocessor. For each new sample, all the following steps need to be taken:

1. Obtain a sample with the ADC; generate an interrupt
2. Detect and manage the interrupt
3. Move the sample into the input signal's circular buffer
4. Update the pointer for the input signal's circular buffer
5. Zero the accumulator
6. Control the loop through each of the coefficients
7. Fetch the coefficient from the coefficient's circular buffer
8. Update the pointer for the coefficient's circular buffer
9. Fetch the sample from the input signal's circular buffer
10. Update the pointer for the input signal's circular buffer
11. Multiply the coefficient by the sample
12. Add the product to the accumulator
13. Move the output sample (accumulator) to a holding buffer
14. Move the output sample from the holding buffer to the DAC

The goal is to make these steps execute quickly. Since steps 6-12 will be repeated many times (once for
each coefficient in the filter), special attention must be given to these operations. Traditional
microprocessors must generally carry out these 14 steps in serial (one after another), while DSPs are
designed to perform them in parallel. In some cases, all of the operations within the loop (steps 6-12) can
be completed in a single clock cycle. Let's look at the internal architecture that allows this magnificent
performance.

Architecture of the Digital Signal Processor

One of the biggest bottlenecks in executing DSP algorithms is transferring information to and from
memory. This includes data, such as samples from the input signal and the filter coefficients, as well as
4
program instructions, the binary codes that go into the program sequencer. Typical DSP operations require
simple many additions and multiplications. Additions and multiplications require us to:
Fetch two operands
Perform the addition or multiplication (usually both)
Store the result or hold it for a repetition
To fetch the two operands in a single instruction cycle, we need to be able to make two memory accesses
simultaneously. Actually, a little thought will show that since we also need to store the result - and to read
the instruction itself - we really need more than two memory accesses per instruction cycle. For this reason
DSP processors usually support multiple memory accesses in the same instruction cycle. It is not possible
to access two different memory addresses simultaneously over a single memory bus. There are three
common methods to achieve multiple memory accesses per instruction cycle:
Harvard architecture
Modified von Neuman architecture
Super Harvard Architecture
The Harvard architecture has two separate physical memory buses. This allows two simultaneous memory
accesses:

The Harvard architecture
The Harvard architecture requires two memory buses. This makes it expensive to bring off the chip - for
example a DSP using 32 bit words and with a 32 bit address space requires at least 64 pins for each
memory bus - a total of 128 pins if the Harvard architecture is brought off the chip. This results in very
large chips, which are difficult to design into a circuit. Even the simplest DSP operation - an addition
involving two operands and a store of the result to memory - requires four memory accesses (three to fetch
the two operands and the instruction, plus a fourth to write the result). This exceeds the capabilities of a
Harvard architecture. Some processors get around this by using a modified von Neuman architecture.

Figure below shows how it is implemented in a traditional microprocessor. This is often called Von
Neumann architecture, after the brilliant American mathematician John Von Neumann (1903-1957). The
von Neuman architecture uses only a single memory bus. The Von Neumann design is quite satisfactory
when you are content to execute all of the required tasks in serial. In fact, most computers today are of the
Von Neumann design. We only need other architectures when very fast processing is required, and we are
willing to pay the price of increased complexity.

The Von Neuman Architecture

This is cheap, require less pins than the Harvard architecture, and simple to use because the programmer
can place instructions or data anywhere throughout the available memory. But it does not permit multiple
memory accesses. The modified von Neuman architecture allows multiple memory accesses per instruction
cycle by the simple trick of running the memory clock faster than the instruction cycle. For example the
Lucent DSP32C runs with an 80 MHz clock: this is divided by four to give 20 million instructions per
second (MIPS), but the memory clock runs at the full 80 MHz - each instruction cycle is divided into four
'machine states' and a memory access can be made in each machine state, permitting a total of four memory
accesses per instruction cycle:
5

In this case the modified von Neuman architecture permits all the memory accesses needed to support
addition or multiplication:
fetch of the instruction
fetch of the two operands
Storage of the result.
Both Harvard and von Neuman architectures require the programmer to be careful of where in memory
data is placed, for example, with the Harvard architecture, if both needed operands are in the same memory
bank then they cannot be accessed simultaneously.

The true Harvard architecture dedicates one bus for fetching instructions, with the other available to fetch
operands. This is inadequate for DSP operations, which usually involve at least two operands. So DSP
Harvard architectures usually permit the 'program' bus to be used also for access of operands. Note that it is
often necessary to fetch three things - the instruction plus two operands - and the Harvard architecture is
inadequate to support this: so DSP Harvard architectures often also include a cache memory which can be
used to store instructions which will be reused, leaving both Harvard buses free for fetching operands. This
extension - Harvard architecture plus cache - is sometimes called an extended Harvard architecture or
Super Harvard ARChitecture (SHARC) e.g. ADSP-2106x and new ADSP-211xx.


The super Harvard architecture

The idea is to build upon the Harvard architecture by adding features to improve the throughput. While the
SHARC DSPs are optimized in dozens of ways, two areas are important enough to be included in Fig.
above:
An instruction cache, and
An I/O controller.
A handicap of the basic Harvard design is that the data memory bus is busier than the program memory
bus. When two numbers are multiplied, two binary values (the numbers) must be passed over the data
memory bus, while only one binary value (the program instruction) is passed over the program memory
bus. To improve upon this situation, we start by relocating part of the "data" to program memory. For
instance, we might place the filter coefficients in program memory, while keeping the input signal in data
memory. (This relocated data is called "secondary data" in the illustration). At first glance, this doesn't
seem to help the situation; now we must transfer one value over the data memory bus (the input signal
sample), but two values over the program memory bus (the program instruction and the coefficient). In fact,
if we were executing random instructions, this situation would be no better at all.

However, DSP algorithms generally spend most of their execution time in loops, such as instructions 6-12
above. This means that the same set of program instructions will continually pass from program memory to
the CPU. The Super Harvard architecture takes advantage of this situation by including an instruction
cache in the CPU. This is a small memory that contains about 32 of the most recent program instructions.
6
The first time through a loop, the program instructions must be passed over the program memory bus. This
results in slower operation because of the conflict with the coefficients that must also be fetched along this
path. However, on additional executions of the loop, the program instructions can be pulled from the
instruction cache. This means that all of the memory to CPU information transfers can be accomplished in
a single cycle:
The sample from the input signal comes over the data memory bus,
The coefficient comes over the program memory bus, and
The program instruction comes from the instruction cache.
In the jargon of the field, this efficient transfer of data is called a high memory access bandwidth.


The super Harvard architecture detailed view

Figure below presents a more detailed view of the SHARC architecture, showing the I/O controller
connected to data memory. The SHARC DSPs provides both serial and parallel communications ports.
These are extremely high speed connections. For example, at a 40 MHz clock speed, there are two serial
ports that operate at 40 Mbits/second each, while six parallel ports each provide a 40 Mbytes/second data
transfer. When all six parallel ports are used together, the data transfer rate is an incredible 240
Mbytes/second. Just as important, dedicated hardware allows these data streams to be transferred directly
into memory (Direct Memory Access, or DMA), without having to pass through the CPU's registers. In
other words, tasks 1 & 14 on our list happen independently and simultaneously with the other tasks; no
cycles are stolen from the CPU. The main buses (program memory bus and data memory bus) are also
accessible from outside the chip, providing an additional interface to off-chip memory and peripherals. This
allows the SHARC DSPs to use a four Gigaword (16 Gbyte) memory, accessible at 40 Mwords/second
(160 Mbytes/second), for 32 bit data. This type of high speed I/O is a key characteristic of DSPs. The
overriding goal is to move the data in, perform the math, and move the data out before the next sample is
available. Everything else is secondary. Some DSPs have onboard analog-to-digital and digital-to-analog
converters, a feature called mixed signal. However, all DSPs can interface with external converters through
serial or parallel ports.

At the top of the diagram are two blocks labeled Data Address Generator (DAG), one for each of the two
memories. These control the addresses sent to the program and data memories, specifying where the
information is to be read from or written to. In simpler microprocessors this task is handled as an inherent
part of the program sequencer, and is quite transparent to the programmer. However, DSPs are designed to
operate with circular buffers, and benefit from the extra hardware to manage them efficiently. This avoids
needing to use precious CPU clock cycles to keep track of how the data are stored. For instance, in the
7
SHARC DSPs, each of the two DAGs can control eight circular buffers. This means that each DAG holds
32 variables (4 per buffer), plus the required logic.

Why so many circular buffers? Some DSP algorithms are best carried out in stages. For instance, IIR filters
are more stable if implemented as a cascade of biquads (a stage containing two poles and up to two zeros).
Multiple stages require multiple circular buffers for the fastest operation. The DAGs in the SHARC DSPs
are also designed to efficiently carry out the Fast Fourier transform. In this mode, the DAGs are
configured to generate bit-reversed addresses into the circular buffers, a necessary part of the FFT
algorithm. In addition, an abundance of circular buffers greatly simplifies DSP code generation- both for
the human programmer as well as high-level language compilers, such as C. The data register section of the
CPU is used in the same way as in traditional microprocessors. In the ADSP-2106x SHARC DSPs, there
are 16 general purpose registers of 40 bits each. These can hold intermediate calculations, prepare data for
the math processor, serve as a buffer for data transfer, hold flags for program control, and so on. If needed,
these registers can also be used to control loops and counters; however, the SHARC DSPs have extra
hardware registers to carry out many of these functions.

The math processing is broken into three sections, a multiplier, an arithmetic logic unit (ALU), and a
barrel shifter. The multiplier takes the values from two registers, multiplies them, and places the result
into another register. The ALU performs addition, subtraction, absolute value, logical operations (AND,
OR, XOR, NOT), conversion between fixed and floating point formats, and similar functions. Elementary
binary operations are carried out by the barrel shifter, such as shifting, rotating, extracting and depositing
segments, and so on. A powerful feature of the SHARC family is that the multiplier and the ALU can be
accessed in parallel. In a single clock cycle, data from registers 0-7 can be passed to the multiplier, data
from registers 8-15 can be passed to the ALU, and the two results returned to any of the 16 registers. There
are also many important features of the SHARC family architecture that aren't shown in this simplified
illustration. For instance, an 80 bit accumulator is built into the multiplier to reduce the round-off error
associated with multiple fixed-point math operations.

Another interesting feature is the use of shadow registers for all the CPU's key registers. These are
duplicate registers that can be switched with their counterparts in a single clock cycle. They are used for
fast context switching, the ability to handle interrupts quickly. When an interrupt occurs in traditional
microprocessors, all the internal data must be saved before the interrupt can be handled. This usually
involves pushing all of the occupied registers onto the stack, one at a time. In comparison, an interrupt in
the SHARC family is handled by moving the internal data into the shadow registers in a single clock cycle.
When the interrupt routine is completed, the registers are just as quickly restored. This feature allows step 4
on our list (managing the sample-ready interrupt) to be handled very quickly and efficiently.

Now we come to the critical performance of the architecture, how many of the operations within the loop
(steps 6-12 above) can be carried out at the same time. Because of its highly parallel nature, the SHARC
DSP can simultaneously carry out all of these tasks. Specifically, within a single clock cycle, it can perform
a multiply (step 11), an addition (step 12), two data moves (steps 7 and 9), update two circular buffer
pointers (steps 8 and 10), and control the loop (step 6). There will be extra clock cycles associated with
beginning and ending the loop (steps 3, 4, 5 and 13, plus moving initial values into place); however, these
tasks are also handled very efficiently. If the loop is executed more than a few times, this overhead will be
negligible. As an example, suppose you write an efficient FIR filter program using 100 coefficients. You
can expect it to require about 105 to 110 clock cycles per sample to execute (i.e., 100 coefficient loops plus
overhead). This is very impressive; a traditional microprocessor requires many thousands of clock cycles
for this algorithm.

Although there are many DSP processors, they are mostly designed with the same few basic operations in
mind, so they share the same set of basic characteristics. This enables us to draw the processor diagrams in
a similar way, to bring out the similarities and allow us to concentrate on the differences:
8

The diagram above shows a generalised DSP processor, with the basic features that are common. The
Analog Devices ADSP21060 shows how similar are the basic architectures:

The ADSP21060 has a Harvard architecture - shown by the two memory buses. This is extended by a
cache, making it a Super Harvard ARChitecture (SHARC). Note, however, that the Harvard architecture is
not fully brought off chip - there is a special bus switch arrangement which is not shown on the diagram.
The 21060 has two serial ports in place of the Lucent DSP32C's one. Its host port implements a PCI bus
rather than the older ISA bus. Apart from this, the 21060 introduces four features not found on the Lucent
DSP32C:
There are two sets of address generation registers. DSP processors commonly have to react to
interrupts quickly - the two sets of address generation registers allow for swapping between
9
register sets when an interrupt occurs, instead of having to save and restore the complete set of
registers.
There are six link ports, used to connect with up to six other 21060 processors: showing that this
processor is intended for use in multiprocessor designs.
There is a timer - useful to implement DSP multitasking operating system features using time
slicing.
There is a debug port - allowing direct non-intrusive debugging of the processor internals.

To suit these fundamental operations DSP processors often have:
Parallel multiply and add
Multiple memory accesses (to fetch two operands and store the result)
Lots of registers to hold data temporarily
Efficient address generation for array handling
Special features such as delays or circular addressing
To perform the simple arithmetic required, DSP processors need special high speed arithmetic units. Most
DSP operations require additions and multiplications together. So DSP processors usually have hardware
adders and multipliers which can be used in parallel within a single instruction:

The diagram shows the data path for the Lucent DSP32C processor. The hardware multiply and add work
in parallel so that in the space of a single instruction, both an add and a multiply can be completed. Delays
require that intermediate values be held for later use. This may also be a requirement, for example, when
keeping a running total - the total can be kept within the processor to avoid wasting repeated reads from
and writes to memory. For this reason DSP processors have lots of registers which can be used to hold
intermediate values:

Registers may be fixed point or floating point format. Array handling requires that data can be fetched
efficiently from consecutive memory locations. This involves generating the next required memory
address. For this reason DSP processors have address registers which are used to hold addresses and can be
used to generate the next needed address efficiently:

10
The ability to generate new addresses efficiently is a characteristic feature of DSP processors. Usually, the
next needed address can be generated during the data fetch or store operation, and with no overhead. DSP
processors have rich sets of address generation operations:
*rP register indirect read the data pointed to by the address in register rP
*rP++ postincrement
having read the data, postincrement the address pointer to point to the
next value in the array
*rP-- postdecrement
having read the data, postdecrement the address pointer to point to
the previous value in the array
*rP++rI
register
postincrement
having read the data, postincrement the address pointer by the amount
held in register rI to point to rI values further down the array
*rP++rIr bit reversed
having read the data, postincrement the address pointer to point to the
next value in the array, as if the address bits were in bit reversed order
The table shows some addressing modes for the Lucent DSP32C processor. The assembler syntax is very
similar to C language. Whenever an operand is fetched from memory using register indirect addressing, the
address register can be incremented to point to the next needed value in the array. This address increment is
free - there is no overhead involved in the address calculation - and in the case of the Lucent DSP32C
processor up to three such addresses may be generated in each single instruction. Address generation is an
important factor in the speed of DSP processors at their specialised operations. The last addressing mode -
bit reversed - shows how specialised DSP processors can be. Bit reversed addressing arises when a table of
values has to be reordered by reversing the order of the address bits:
- Reverse the order of the bits in each address
- Shuffle the data so that the new, bit reversed, addresses are in ascending order
This operation is required in the Fast Fourier Transform - and just about nowhere else. So one can see that
DSP processors are designed specifically to calculate the Fast Fourier Transform efficiently.

DSP processors: input and output interfaces
In addition to the mathematics, in practice DSP is mostly dealing with the real world. Although this aspect
is often forgotten, it is of great importance and marks some of the greatest distinctions between DSP
processors and general purpose microprocessors:

In a typical DSP application, the processor will have to deal with multiple sources of data from the real
world. In each case, the processor may have to be able to receive and transmit data in real time, without
interrupting its internal mathematical operations. There are three sources of data from the real world:
Signals coming in and going out
Communication with an overall system controller of a different type
Communication with other DSP processors of the same type
11
These multiple communications routes mark the most important distinctions between DSP processors and
general purpose processors.

When DSP processors first came out, they were rather fast processors: for example the first floating point
DSP - the AT&T DSP32 - ran at 16 MHz at a time when PC computer clocks were 5 MHz. This meant that
we had very fast floating point processors: a fashionable demonstration at the time was to plug a DSP board
into a PC and run a fractal (Mandelbrot) calculation on the DSP and on a PC side by side. The DSP fractal
was of course faster. Today, however, the fastest DSP processor is the Texas TMS320C6201 which runs at
200 MHz. This is no longer very fast compared with an entry level PC. And the same fractal today will
actually run faster on the PC than on the DSP. But DSP processors are still used - why? The answer lies
only partly in that the DSP can run several operations in parallel: a far more basic answer is that the DSP
can handle signals very much better than a Pentium. Try feeding eight channels of high quality audio data
in and out of a Pentium simultaneously in real time, without impacting on the processor performance, if
you want to see a real difference. The need to deal with these different sources of data efficiently leads to
special communication features on DSP processors:

In a typical DSP application, the processor will have to deal with multiple sources of data from the real
world. In each case, the processor may have to be able to receive and transmit data in real time, without
interrupting its internal mathematical operations. There are three sources of data from the real world:
Signals coming in and going out
Communication with an overall system controller of a different type
Communication with other DSP processors of the same type
The need to deal with different sources of data efficiently and in real time leads to special communication
features on DSP processors:

Signals tend to be fairly continuous, but at audio rates or not much higher. They are usually handled by
high speed synchronous serial ports. Serial ports are inexpensive - having only two or three wires - and are
well suited to audio or telecommunications data rates up to 10 Mbit/s. Most modern speech and audio
analogue to digital converters interface to DSP serial ports with no intervening logic. A synchronous serial
port requires only three wires: clock, data, and word sync. The addition of a fourth wire (frame sync) and a
high impedance state when not transmitting makes the port capable of Time Division Multiplex (TDM)
data handling, which is ideal for telecommunications:

DSP processors usually have synchronous serial ports - transmitting clock and data separately - although
some, such as the Motorola DSP56000 family, have asynchronous serial ports as well (where the clock is
recovered from the data). Timing is versatile, with options to generate the serial clock from the DSP chip
clock or from an external source. The serial ports may also be able to support separate clocks for receive
and transmit - a useful feature, for example, in satellite modems where the clocks are affected by Doppler
shifts. Most DSP processors also support companding to A-law or mu-law in serial port hardware with no
overhead - the Analog Devices ADSP2181 and the Motorola DSP56000 family do this in the serial port,
whereas the Lucent DSP32C has a hardware compander in its data path instead. The serial port will usually
operate under DMA - data presented at the port is automatically written into DSP memory without stopping
the DSP - with or without interrupts. It is usually possible to receive and trasmit data simultaneously. The
12
serial port has dedicated instructions which make it simple to handle. Because it is standard to the chip, this
means that many types of actual I/O hardware can be supported with little or no change to code - the DSP
program simply deals with the serial port, no matter to what I/O hardware this is attached.

Host communications is an element of many, though not all, DSP systems. Many systems will have
another, general purpose, processor to supervise the DSP, for example, the DSP might be on a PC plug-in
card or a VME card - simpler systems might have a microcontroller to perform a 'watchdog' function or to
initialise the DSP on power up. Whereas signals tend to be continuous, host communication tends to
require data transfer in batches - for instance to download a new program or to update filter coefficients.
Some DSP processors have dedicated host ports which are designed to communicate with another
processor of a different type, or with a standard bus. For instance the Lucent DSP32C has a host port which
is effectively an 8 bit or 16 bit ISA bus: the Motorola DSP56301 and the Analog Devices ADSP21060 have
host ports which implement the PCI bus.

The host port will usually operate under DMA - data presented at the port is automatically written into DSP
memory without stopping the DSP - with or without interrupts. It is usually possible to receive and trasmit
data simultaneously. The host port has dedicated instructions which make it simple to handle. The host port
imposes a welcome element of standardisation to plug-in DSP boards - because it is standard to the chip, it
is relatively difficult for individual designers to make the bus interface different. For example, of the 22
main different manufacturers of PC plug-in cards using the Lucent DSP32C, 21 are supported by the same
PC interface code: this means it is possible to swap between different cards for different purposes, or to
change to a cheaper manufacturer, without changing the PC side of the code. Of course this is not foolproof
- some engineers will always 'improve upon' a standard by making something incompatible if they can - but
at least it limits unwanted creativity.

Interprocessor communications is needed when a DSP application is too much for a single processor - or
where many processors are needed to handle multiple but connected data streams. Link ports provide a
simple means to connect several DSP processors of the same type. The Texas TMS320C40 and the Analog
Devices ADSP21060 both have six link ports (called 'comm ports' for the 'C40). These would ideally be
parallel ports at the word length of the processor, but this would use up too many pins (six ports each 32
bits wide=192, which is a lot of pins even if we neglect grounds). So a hybrid called serial/parallel is used:
in the 'C40, comm ports are 8 bits wide and it takes four transfers to move one 32 bit word - in the 21060,
link ports are 4 bits wide and it takes 8 transfers to move one 32 bit word.

The link port will usually operate under DMA - data presented at the port is automatically written into DSP
memory without stopping the DSP - with or without interrupts. It is usually possible to receive and trasmit
data simultaneously. This is a lot of data movement - for example the Texas TMS320C40 could in principle
use all its six comm ports at their full rate of 20 Mbyte/s to achieve data transfer rates of 120 Mbyte/s. In
practice, of course, such rates exist only in the dreams of marketing men since other factors such as internal
bus bandwidth come into play. The link ports have dedicated instructions which make them simple to
handle. Although they are sometimes used for signal I/O, this is not always a good idea since it involves
very high speed signals over many pins and it can be hard for external hardware to exactly meet the timing
requirements.

13
DSP processors: data formats

DSP processors store data in fixed or floating point formats. It is worth noting that fixed point format is not
quite the same as integer:

The integer format is straightforward: representing whole numbers from 0 up to the largest whole number
that can be represented with the available number of bits. Fixed point format is used to represent numbers
that lie between 0 and 1: with a 'binary point' assumed to lie just after the most significant bit. The most
significant bit in both cases carries the sign of the number.
The size of the fraction represented by the smallest bit is the precision of the fixed point format.
The size of the largest number that can be represented in the available word length is the dynamic
range of the fixed point format
To make the best use of the full available word length in the fixed point format, the programmer has to
make some decisions:
If a fixed point number becomes too large for the available word length, the programmer has to
scale the number down, by shifting it to the right: in the process lower bits may drop off the end
and be lost
If a fixed point number is small, the number of bits actually used to represent it is small. The
programmer may decide to scale the number up, in order to use more of the available word length
In both cases the programmer has to keep a track of by how much the binary point has been shifted, in
order to restore all numbers to the same scale at some later stage. Floating point format has the remarkable
property of automatically scaling all numbers by moving, and keeping track of, the binary point so that all
numbers use the full word length available but never overflow:

Floating point numbers have two parts: the mantissa, which is similar to the fixed point part of the number,
and an exponent which is used to keep track of how the binary point is shifted. Every number is scaled by
the floating point hardware:
If a number becomes too large for the available word length, the hardware automatically scales it
down, by shifting it to the right
If a number is small, the hardware automatically scale it up, in order to use the full available word
length of the mantissa
14
In both cases the exponent is used to count how many times the number has been shifted. In floating point
numbers the binary point comes after the second most significant bit in the mantissa. The block floating
point format provides some of the benefits of floating point, but by scaling blocks of numbers rather than
each individual number:

Block floating point numbers are actually represented by the full word length of a fixed point format.
If any one of a block of numbers becomes too large for the available word length, the programmer
scales down all the numbers in the block, by shifting them to the right
If the largest of a block of numbers is small, the programmer scales up all numbers in the block, in
order to use the full available word length of the mantissa
In both cases the exponent is used to count how many times the numbers in the block have been shifted.
Some specialised processors, such as those from Zilog, have special features to support the use of block
floating point format: more usually, it is up to the programmer to test each block of numbers and carry out
the necessary scaling. The floating point format has one further advantage over fixed point: it is faster.
Because of quantisation error, a basic direct form 1 IIR filter second order section requires an extra
multiplier, to scale numbers and avoid overflow. But the floating point hardware automatically scales every
number to avoid overflow, so this extra multiplier is not required:


DSP processors: precision and dynamic range

The precision with which numbers can be represented is determined by the word length in the fixed point
format, and by the number of bits in the mantissa in the floating point format. In a 32 bit DSP processor the
mantissa is usually 24 bits: so the precision of a floating point DSP is the same as that of a 24 bit fixed
point processor. But floating point has one further advantage over fixed point: because the hardware
automatically scales each number to use the full word length of the mantissa, the full precision is
maintained even for small numbers:
15

There is a potential disadvantage to the way floating point works. Because the hardware automatically
scales and normalises every number, the errors due to truncation and rounding depend on the size of the
number. If we regard these errors as a source of quantisation noise, then the noise floor is modulated by the
size of the signal. Although the modulation can be shown to be always downwards (that is, a 32 bit floating
point format always has noise which is less than that of a 24 bit fixed point format), the signal dependent
modulation of the noise may be undesirable: notably, the audio industry prefers to use 24 bit fixed point
DSP processors over floating point because it is thought by some that the floating point noise floor
modulation is audible.
The precision directly affects quantisation error.
The largest number which can be represented determines the dynamic range of the data format. In fixed
point format this is straightforward: the dynamic range is the range of numbers that can be represented in
the available word length. For floating point format, though, the binary point is moved automatically to
accommodate larger numbers: so the dynamic range is determined by the size of the exponent. For an 8 bit
exponent, the dynamic range is close to 1,500 dB:

So the dynamic range of a floating point format is enormously larger than for a fixed point format:

While the dynamic range of a 32 bit floating point format is large, it is not infinite: so it is possible to suffer
overflow and underflow even with a 32 bit floating point format. A classic example of this can be seen by
running fractal (Mandelbrot) calculations on a 32 bit DSP processor: after quite a long time, the fractal
pattern ceases to change because the increment size has become too small for a 32 bit floating point format
to represent. Most DSP processors have extended precision registers within the processor:
16

The diagram shows the data path of the Lucent DSP32C processor. Although this is a 32 bit floating point
processor, it uses 40 and 45 bit registers internally: so results can be held to a wider dynamic range
internally than when written to memory.

DSP processors: ADSP2181
DSP often involves a need to switch rapidly between one task and another: for example, on the occurrence
of an interrupt. This would usually require all registers currently in use to be saved, and then restored after
servicing the interrupt. The DSP16A and the Analogue Devices ADSP2181 use two sets of address
generation registers:

The two sets of address generation registerscan be swapped as a fast alternative to saving and restoring
registers when switching between tasks. The ADSP2181 also has a timer: useful for implementing 'time
sliced' task switching, such as in a real time operating system - and another indication that this processor
was designed with task switching in mind.
17
It is interesting to see how far a manufacturer carries the same basic processor model into their different
designs. Texas Instruments, Analog Devices and Motorola all started with fixed point devices, and have
carried forward those designs into their floating point processors. AT&T (now Lucent) started with floating
point, then brought out fixed point devices later. The Analog Devices ADSP21060 looks like a floating
point version of the integer ADSP2181:


The 21060 also has six high speed link ports which allow it to connect with up to six other processors of the
same type. One way to support multiprocessing is to have many fast inter-processor communications ports:
another is to have shared memory. The ADSP21060 supports both methods. Shared memory is supported in
a very clever way: each processor can directly access a small area of the internal memory of up to four
other processors. As with the ADSP2181, the 21060 has lots of internal memory: the idea being, that most
applications can work without added external memory: note, though, that the full Harvard architecture is
not brought off chip, which means they really need the on-chip memory to be big enough for most
applications.
The problem of fixed point processors is quantisation error, caused by the limited fixed point precision.
Motorola reduce this problem in the DSP56002 by using a 24 bit integer word length:
18

They also use three internal buses - one for program, two for data (two operands). This is an extension of
the standard Harvard architecture which goes beyond the usual trick of simply adding a cache, to allow
access to two operands and the instruction at the same time. Of course, the problem of 24 bit fixed point is
its expense: which probably explains why Motorola later produced the cheap, 16 bit DSP56156 - although
this looks like a 16 bit variant of the DSP56002:

And of course there has to be a floating point variant - the DSP96002 looks like a floating point version of
the DSP56002:
19

The DSP96002 supports multiprocessing with an additional 'global bus' which can connect to other
DSP96002 processors: it also has a new DMA controller with its own bus
The Texas TMS320C25 is quite an early design. It does not have a parallel multiply/add: the multiply is
done in one cycle, the add in the next and the DSP has to address the data for both operations. It has a
modified Harvard bus with only one data bus, which sometimes restricts data memory accesses to one per
cycle, but it does have a special 'repeat' instruction to repeat an instruction without writing code loops

The Texas TMS320C50 is the C25 brought up to date: the multiply/add can now achieve single cycle
execution if it is done in a hardware repeat loop. It also uses shadow registers as a fast way to preserve
registers when context switching. It has automatic saturation or rounding (but it needs it, since the
20
accumulator has no guard bits to prevent overflow), and it has parallel bit manipulation which is useful in
control applications

The Texas TMS320C30 carries on some of the features of the integer C25, but introduces some new ideas.
It has a von Neuman architecture with multiple memory accesses in one cycle, but there are still separate
internal buses which are multiplexed onto the CPU. It also has a dedicated DMA controller.

The Texas TMS320C40 is similar to the C30, but with high speed communications ports
for multiprocessing. It has six high speed parallel comm ports which connect with other
C40 processors: these are 8 bits wide, but carry 32 bit data in four successive cycles.
21

The Texas TMS320C60 is radically different from other DSP processors, in using a Very Long Instruction
Word (VLIW) format. It issues a 256 bit instruction, containing up to 8 separate 'mini instructions' to each
of 8 functional units. Because of the radically different concept, it an be hard to compare this processor
with other, more traditional, DSP processors. But despite this, the 'C60 can still be viewed in a similar way
to other DSP processors, which makes apparent the fact that it basically has two data paths each capable of
a multiply/accumulate.

Note that this diagram is very different from the way Texas Instruments draw it. This is for several reasons:
Texas Instruments tend to draw their processors as a set of subsystems, each with a separate block
diagram
22
my diagram does not show some of the more esoteric features such as the bus switching
arrangement to address the multiple external memory accesses that are required to load the eight
'mini-instructions'
An important point is raised by the placing of the address generation unit on the diagram. Texas
Instruments draw the C60 block diagram as having four arithmetic units in each data path - whereas my
diagram shows only three. The fourth unit is in fact the address generation calculation. Following my
practice for all other DSP processors, I show address generation as separate from the arithmetic units -
address calculation being assumed by the presence of address registers as is the case in all DSP processors.
The C60 can in fact choose to use the address generation unit for general purpose calculations if it is not
calculating addresses - this is similar, for example, to the Lucent DSP32C: so Texas Instruments' approach
is also valid - but for most classic DSP operations address generation would be required and so the unit
would not be available for general purpose use.

There is an interesting side effect of this. Texas Instruments rate the C60 as a 1600 MIPS device - on the
basis that it runs at 200 MHz, and has two data paths each with four execution units: 200 MHz x 2 x
4=1600 MIPS. But from my diagram, treating the address generation separately, we see only three
execution units per data path: 200 MHz x 2 x 3=1200 MIPS. The latter figure is that actually achieved in
quoted benchmarks for an FIR filter, and reflects the device's ability to perform arithmetic.

This illustrates a problem in evaluating DSP processors. It is very hard to compare like with like - not least,
because all manufacturers present their designs in such a way that they show their best performance. The
lesson to draw is that one cannot rely on MIPS, MOPS or Mflops ratings but must carefully try to
understand the features of each candidate processor and how they differ from each other - then make a
choice based on the best match to the particular application. It is very important to note that a DSP
processor's specialised design means it will achieve any quoted MIPS, MOPS or Mflops rating only if
programmed to take advantage of all the parallel features it offers.

Anda mungkin juga menyukai