Anda di halaman 1dari 17

Hybrid Floating point Technique yields 1.

2 Giga-sample per second 32 to 2048 point floating point FFT in a single FPGA
HPEC 2006 Poster Session B.4 20 September 2006

Ray Andraka, P.E. President, Andraka Consulting Group, Inc ray@andraka.com


the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

Floating point addition & subtraction is resource intensive


Barrel Shift Denormalize Mantissa A

Exchange Network

Mantissa Add/Sub

Barrel Shift Renormalize

Rounding Mantissa

Mantissa B Exponent A Leading Zeros Detect Exponent Difference

Exponent B

Exponent Exponent Adder

the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

Apply floating point to larger functions


Apply floating point to larger functions Floating point typically applied at add and multiply level operations Instead construct higher order operations from fixed point operators Phase rotator FFT Apply floating point to those more complicated operators Denormalize to convert mantissa to fixed point plus common scale Pass exponent around series of fixed point operations Renormalize after several operations rather than after each one

the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

Apply floating point to larger functions


Barrel Shift Denormalize Mantissas Fixed point function Exponent Difference Exponents Leading Zeros Detect Barrel Shift Renormalize

Rounding Mantissa

Max Exponent

Exponent Exponent Adder


copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

the

Andraka Consulting Group, Inc.

Floating point sum has only as much precision as larger addend


Add requires both addends to have the same scale
Radix points must align Addition is inherently fixed point

Examples:
Different exponents A= 1.101 * 25 B= 1.101 * 23 = 0.01101 * 25 A+B= (1.101 + 0.011) * 25 = (11.000) * 25 LSBs of B are lost Renormalizing A= 1.101 * 25 B= 1.011 * 25 A-B= (1.101 - 1.011) * 25 = (0.010) * 25 = (1.000) * 23 Sum LSBs are filled with 0s
copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

Smaller addends mantissa is right shifted until exponent is same as larger


Exponent increments each shift Right shift truncates LSBs Truncated LSBs are lost

Sum is left shifted to left justify


LSBs zero filled No improvement to precision

the

Andraka Consulting Group, Inc.

Phase rotation does not change amplitude


Re (y) = re(x) * cos(w) - im(x) * sin(w) Im(y) = re(x) * sin(w) + im(x) * cos(w) Magnitudes of individual I and Q components change, but complex magnitude is not altered. No loss of precision by treating I and Q with common exponent Complex operation is limited to precision of larger component Using common exponent for I and Q reduces hardware
Single copy of exponent logic No rescaling of I with respect to Q

Simplifies rotator
Fixed point complex multiply (smaller of I or Q is denormalized) Fixed point sines and cosines Output renormalize is +/-1 bit shift

the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

FFT butterflies are only as precise as largest input


Cooley-Tukey FFT butterfly
Sum and difference of pair of complex inputs one input is rotated by twiddle factor phasor

FFT Butterfly

Rotation does not affect scale Smaller input right shifted


Shift to match scale LSBs are lost

Complex inputs

Complex outputs

Both outputs have same LSB weight before renormalizing Renormalizing does not add precision (zero fills LSBs) Output is 1 bit wider than input
Sum of similar sized addends
the

Twiddle factor wk=cos(w)+jsin(w)

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

FFT output is only as precise as largest input


Cascade of butterfly elements Each output is essentially an adder tree with phase rotators
Rotators dont change scale Inputs right shifted to match scale of largest input intermediate renormalizing not effective Term from every FFT input

Butterfly

wk

1 bit growth per stage


Renormalize maintains width Alternative: grow word width

wk

Similar effect in other FFTs


Winograd, Sande-Tukey, Singleton etc.)
the

wk

wk

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

Fixed Point FFT Replaces Floating Point FFT


Denormalize inputs Shift each input right to match scale of largest Perform fixed point FFT Pass common exponent around it Input width = mantissa bits Maximum 1 bit growth per equivalent radix 2 stage Renormalize outputs Add common exponent to delta exponent from renormalize

Max Exponent + Exp.


>>n Fixed Point FFT <<

Exp .

Mant.

Mant.

Denormalize

Renormalize

the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

Advantages and Limitations


Advantages Large reduction in required hardware Less complexity means higher clock rates, smaller parts Limitations Word width grows for each radix 2 stage Becomes excessive for large FFTs Max Exponent needed at beginning of set Problem for large sequential FFTs Use periodic renormalization to manage word widths A few bits growth dont significantly affect timing Word not limited to specific widths in FPGA Fixed width assets like DSP48s limit practical word sizes. Find balance between precision, growth and renormalizing stages
the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

10

Small FFTs as building blocks


Larger FFT constructed from small FFTs with mixed radix algorithm Similar to Cooley-Tukey decomposition Arbitrarily large FFTs using small off-the shelf kernels Combination uses FFT plus phase rotator and reorder memory In-place operation (results written to same memory locations)

Fill along rows


the

FFT down cols

Mult by e-j2pkn/N

FFT along rows

Read down cols

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

11

Winograd FFT
Different factorization Minimizes multiplies Advantageous for hardware implementation 74 adds and 18 real multiplies for 16pt Winograd 176 adds and 72 real multiplies for 16pt Cooley-Tukey Reorder Reorder Irregular data sequence Difficult for shared memory Easy when reorder memory is distributed

Weights Reorder

Reorder

Reorder

Reorder

the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

12

32 to 2048 point mixed radix FFT


2K FFT is 8 x 256 mixed radix 256 point is 16 x 16 mixed radix Combined algorithms 2K = 8 x 16 x16 Data arranged in cube, FFT along each dimension Reorder at input and output (not shown) Kernel is proprietary 1/4/8/16 Winograd kernel
Each kernel has floating point wrapper
32/64/128/256 point FFT

1/8 Point FFT

Phase Rotator

Data Reorder 4k sample BRAM

4/8/16 Point FFT

Phase Rotator

Data Reorder 512 sample BRAM

8/16 Point FFT

the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

13

32-2K point FFT statistics


Speed: 400 MS/sec per FFT engine (3 in FPGA) 400MHz clock in XC4VSX55-10 (slowest speed grade) 1 complex sample per clock in and out continuous Latency: ~430 + 3*FFT length + (32,64,128 or 256) clocks Utilization less than 30% of XC4VSX55 DSP48s: 151 Slice flip-flops: 9707 RAMB16s: 69 LUTs: 7736 (4975 are SRL16) Precision 30-35 bit mantissa internal, 8 bit exponents IEEE single precision input and output Matches Matlab FFT to +/- 1 LSB of output mantissa
the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

14

1.2 GSample/sec IEEE floating point FFT

32 to 2K pt floating pt FFT

Input Buffer

32 to 2K pt floating pt FFT 32 to 2K pt floating pt FFT

Output buffer

the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

15

Who is Andraka Consulting Group?


Exclusively FPGAs since 1994 Leading industry expert on DSP in FPGAs Charter Xilinx Xperts partner First published FIR filter in FPGAs (1992) Fastest single threaded FFT kernel for FPGA

Other current projects Beamforming digital receiver: 10 25MHz channels, 260 antennas, 500MS/sec input sample rate Cylindrical Sonar Array processor Other Digital receiver and radar projects

the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

16

Floating Point Format


Floating point dedicates part of word to indicate scale (exponent) Tracks radix point position as part of data Compare to fixed point where radix point position is at an implied fixed location Trades precision for dynamic range Useful when data range is unknown or spans a large range

The IEEE single precision floating point standard is a 32 bit word, Leftmost bit is the sign bit, S. 1 is negative, 0 is positive Next 8 bits are exponent, excess 127 format Right 23 bits are the fraction. There is an implicit 1 bit to the left of the fraction except in special cases. The fractions radix point is between the implied 1 and the leftmost bit of the fraction.
S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF

Number = -1S * 2 (E-127) * (1.F)


the

Andraka Consulting Group, Inc.

copyright 2006 Andraka Consulting Group, Inc. All Rights reserved

17

Anda mungkin juga menyukai