Anda di halaman 1dari 28

Accelerating DSP Algorithms Using FPGAs

Sean Gallagher DSP Specialist Xilinx Inc

Gallagher

P188/MAPLD2004

Why DSP in FPGAs


Availability of fast analog-to-digital converters (ADCs)
Enables digital methods for functions traditionally done in RF components

Massive parallel processing


FPGAs may have several hundred embedded multipliers on-chip One FPGA can replace many DSP Processors

Gallagher

P188/MAPLD2004

Architectural Considerations
FPGA architectures are vendor specific
Unlike ASICS, no two are alike

FPGA vendors develop distinct competencies


In device architecture design In intellectual property (dsp functions, bus controllers, etc) In design tool flows

Vendor independent HDL can be written but this usually achieves mediocre results in clock speed and design size instantiation

Gallagher

P188/MAPLD2004

FPGAs Are Massive Parallel Computing Machines


20MHz Samples LPF LPF ch1 ch2 LPF ch3 ch4 Multi Channel Filter 80MHz Samples

LPF
LPF

FPGAs are ideally suited for multi-channel DSP designs


Many low sample rate channels can be multiplexed (e.g. TDM) and processed in the FPGA, at a high rate Interpolation (using zeros) can also drive sample rates higher

Gallagher

P188/MAPLD2004

FPGAs Allow Space/Speed Trade-offs


A Q = (A x B) + (C x D) + (E x F) + (G x H) B C D E F G H

+ + + + + +

can be implemented in parallel

But is this the only way in the FPGA?


Gallagher 5 P188/MAPLD2004

Customize Architectures to Suit your Ideal Algorithms


FPGAs allow Area (cost) / Performance tradeoffs
Parallel

Semi-Parallel

Serial

+ + + + + +

+ + + +
DQ

DQ

Speed

Optimized for?

Area

Gallagher

P188/MAPLD2004

Exploitng The Xilinx Architecture For DSP Functions


Memory Blocks that can be configured as ROMs, dual port RAMs, FIFOs Embedded 18x18 multipliers that can be ganged to form a 35x35 bit multiply SRL16 shift registers
A patented technique for turning the 4 input lookup table (2 per slice) into an addressable shift register
Gallagher 7 P188/MAPLD2004

Using SRL16E to increase Compute Density


20MHz

9
9 9 9 k3 0 +

4 channels
k2 + k1 + k0 + 18

SRL16E takes the same area as one LUT.


20MHz

9 channels
9 k3 0 + k2 +

It can be used for up to 16 channels.

Gallagher

P188/MAPLD2004

Xilinx System Generator For DSP


System Generator is a Block Set that resides in Simulink/Matlab environment. System Generator blocks are bit true and cycle true models of Xilinxs DSP intellectual property (IP) cores. Hardware DSP design capture is significantly accelerated due to automatic code generation from Simulink

Gallagher

P188/MAPLD2004

Algorithm Instantiation Considerations


There are cases where following a textbook approach does not necessarily translate into an efficient instantiation Manipulating the algorithm to exploit features of the architecture can lead to much more efficient instantiations Modification of a text book algorithm includes how the math is executed as well as over-clocking structures to allow the structures to be time division multiplexed

Gallagher

10

P188/MAPLD2004

Example 1: Digital Down Conversion


In digital down conversion we need to filter before we decimate to prevent aliasing These filters can get rather large because the transition band is rather narrow in relation to the sample rate A text book solution is to step the sample rate down in steps

Gallagher

11

P188/MAPLD2004

Digital Down Conversion


The following 3 slides show three different filter designs for the down conversion of a .625 Mhz band of interest that is centered at 20 MHz and sampled at 61.44 MHz.
The decimation rate is 25 The final sample rate will be 61.44/25= 2.4576MHz

The next slide shows the filter design needed if decimating by 25 in one step
the total coefficient count is 184

The two slides after the next show the two filters necessary to decimate in steps, decimating by 5 in each step
The total coefficient count is 11+43=54

Gallagher

12

P188/MAPLD2004

Gallagher

13

P188/MAPLD2004

Gallagher

14

P188/MAPLD2004

Gallagher

15

P188/MAPLD2004

Digital Down Conversion (DDC) Implementation


The following design shows how the DDC function would be implemented using the FIR filter core from the Xilinx Library The coefficients are automatically loaded into the filter cores The design has been compiled and was found to use about 6000 logic slices The fir filter core is a legacy core and is built as an optimized lookup table of coefficients

Gallagher

16

P188/MAPLD2004

Digital Down Conversion Implementation

Gallagher

17

P188/MAPLD2004

DDC Another Way


While we were able to exploit the math of DSP to reduce our coefficient count, we did not necessarily exploit the Xilinx architecture. The next design shows a design that implements the 184 coefficient filter but is significantly smaller in instantiation size then the previous design This design exploits the memory, embedded multipliers, and SRL16s

Gallagher

18

P188/MAPLD2004

Gallagher

19

P188/MAPLD2004

Time Division Multiplexed Input


Multiplexing I&Q multiplication so that just one filter is needed instead of two

Gallagher

20

P188/MAPLD2004

Efficient Shift Registers via SRL16s


Delay line would require 16x50x7=5200 registers which would be 2800 logic slices. Use of SRL16s reduces slice count to less then 700

Gallagher

21

P188/MAPLD2004

Clock Based Demuxing And Automatic Pipeline Balancing

Down sample block grabs last sample in a frame

Delay block slide frame

Down sample block grabs next sample in a frame

Balancing latencies is a common requirement in DSP designs. The Sync block uses SRL16s (very efficient) to automatically balance pipeline delays

Gallagher

22

P188/MAPLD2004

Notes on Previous Design


One filter structure is used by clocking the filter at twice the rate of the incoming data The coefficients are stored in memory, 25 per rom. There are 200 coefficients but this approach allows storage of many more The delay between taps is built using SRL 16s. This would have taken 2800 slices alone without SRL16s but instead the entire design is less that 700 slices

Gallagher

23

P188/MAPLD2004

Channelizer Design
The following design is a 64 channel channelizer based on the technique known as polyphase decimation filter with a DFT bank The design basebands and decimates 64 channels simultaniously The polyphase decimation is the same structure as the previous design, hence very efficient device utilization. This filter structure uses the on-chip ram blocks of the Xilinx device to store the coefficients This technique requires a tapped shift register that requires 6272 registers (3136 slices). However, Xilinxs patented ability to turn the logic look-up table into a 16 bit register reduces this require by more than an order of magnitude. The whole design is less than 1700 slices. The DFT is implemented with a streaming fft core. The streaming mode allows the FFT to keep up with the data rate Individual channels out of the fft are demuxed using the implied clocking technique seen in the previous design

Gallagher

24

P188/MAPLD2004

512 Coefficients are stored in on chip block rams

64 pt FFT set to streaming mode

Gallagher

25

P188/MAPLD2004

Filter coefficients are stored in on-chip block rams. A new phase of the 64 phase-polyphase filter is rotated into the multipliers on every clock cycle. There are 64 phases x 8 taps =512 coefficients

Gallagher

26

P188/MAPLD2004

Gallagher

27

P188/MAPLD2004

Conclusion
Efficient FPGA instantiation of DSP algorithms requires exploitation of the FPGA vendors architecture. Xilinxs Virtex II architecture is especially amenable to systolic computation structures FPGA architectures may present non-obvious instantiation choices that are more efficient then a typical textbook approach Algorithms can and should be modified for parallelized data flow instantiation.

Gallagher

28

P188/MAPLD2004

Anda mungkin juga menyukai