Abstract
cessor comes from the memory requirement to buffer the input data and the intermediate result for the computation. For
large size transform, this turns out to be dominating [3, 4].
Although there is no formal proof, the area lower bound indicates that the the lower bound for the number of registers is
likely to be N . This is obviously true for any architecture
implementing FFT based algorithm, since the butterfly at first
stage has to take data elements separated N=r distance away
from the input sequence, where r is a small constant integer,
or the radix.
Putting above arguments together, a pipeline FFT processor has necessarily logr N AEs and N complex word
registers. The optimal architecture has to be the one that
reduces the constant factor, or the absolute number of AEs
(multipliers and adders) and memory size, to the minimum.
In this paper a new approach for real-time pipeline FFT processor, the Radix-22 Single-path Delay Feedback, or R22SDF
architecture will be presented. We will begin with a brief review of previous approaches. A hardware oriented radix-22
algorithm is then developed by integrating a twiddle factor
decomposition technique in divide and conquer approach to
form a spatially regular signal flow graph (SFG). Mapping
the algorithm to the cascading delay feedback structure leads
to the the proposed architecture. Finally we conclude with a
comparison of hardware requirement of R22 SDF and several
other popular pipeline architectures.
II. PIPELINE FFT PROCESSOR ARCHITECTURES
Before going into details of the new approach, it is beneficial
to have a brief review of the various architectures for pipeline
FFT processors. To avoid being influenced by the sequence
order, we assume that the real-time processing task only requires the input sequence to be in normal order, and the output
is allowed to be in digit-reversed (radix-2 or radix-4) order,
which is permissible in such applications such as DFT based
communication system [5]. We also stick to the DecimationIn-Frequency (DIF) type of decomposition throughout the
discussion.
The architecture design for pipeline FFT processor had
been the subject of intensive research as early as in 70s when
4
BF2
C2
BF2
C2
C2
BF2
(i). R2MDC(N=16)
1
j
BF2
BF2
BF2
BF2
(ii). R2SDF(N=16)
3x64
3x16
3x4
3x1
BF4
BF4
BF4
BF4
(iii). R4SDF(N=256)
192
C4
128
64
48
BF4
16
32
C4
32
16
12
BF4
48
4
8
C4
8
4
BF4
12
1
2
C4
2
1
BF4
(iv). R4MDC(N=256)
DC6x64
BF4
DC6x16
BF4
DC6x4
R4SDF: Radix-4 Single-path Delay Feedback [8] was proposed as a radix-4 version of R2SDF, employing CORDIC1
iterations. The utilization of multipliers has been increased
to 75% due to the storage of 3 out of radix-4 butterfly outputs. However, the utilization of the radix-4 butterfly,
which is fairly complicated and contains at least 8 complex adders, is dropped to only 25%. It requires log4 N 1
multipliers, log4 N full radix-4 butterflies and storage of
size N 1.
1
BF2
BF4
DC6x1
BF4
(v). R4SDC(N=256)
X k =
Xx n W
N 1
n=0
=
=
X k1
+ 2k2 + 4k3 =
n3 =0
0k<N
(1)
n3 k3
N4
x(0)
(5)
X(0)
N/4 DFT
x(1)
x(2)
(k1=0, k2=0)
x(3)
X(8)
X(4)
X(12)
W0
x(4)
W2
x(5)
X(2)
N/4 DFT
W4
W6
(k1=0, k2=1)
W1
W3
x(11)
-j
W0
-j
W3
-j
W6
-j
W9
x(12)
x(13)
x(14)
x(15)
BF I
X(6)
X(1)
N/4 DFT
W2
x(10)
X(10)
X(14)
W0
x(8)
x(9)
< N2 n1 + N4 n2 + n3 >N
< k1 + 2k2 + 4k3 >N
n3 k1 +2k2
N
H k1 ; k2 ; n3 W
x(7)
n
k
Xh
N4
x(6)
nk
N
(k1=1, k2=0)
X(9)
X(5)
X(13)
X(3)
N/4 DFT
(k1=1, k2=1)
X(11)
X(7)
X(15)
BF II
(2)
Figure 2: Butterfly with decomposed twiddle factors.
XXX
X Xf
X k1
N4
+ 2k2 + 4k3
1
n3 =0 n2 =0n1 =0
N4
n3 =0 n2 =0
x
k1 N n2 + n3 = x N n2 + n3 +
N2 4
1k1 x
n2
n3
If the expression within the braces of eqn. (3) is to be computed before further decomposition, an ordinary radix-2 DIF
FFT results. The key idea of the new algorithm is to proceed the second step decomposition to the remaining DFT
N n +n k
WN
WN WN
= j n k +2k W
=
n3 k1 +2k2
N
n3 k1 +2k2 4n3 k3
N
N
WN
4n3 k3
(4)
H k1 ; k2 ; n3
hz
|
}|
i{
BF I
xn3
+
N
1k1 xn3 +
+
X(0)
x(1)
X(8)
x(2)
X(4)
-j
x(3)
X(12)
x(4)
X(2)
W2
X(10)
W4
x(6)
X(6)
-j
W6
x(7)
X(14)
x(8)
X(1)
W1
x(9)
X(9)
W2
x(10)
X(5)
-j
W3
x(11)
X(13)
-j
x(12)
X(3)
-j
W3
-j
W6
-j
W9
x(13)
x(14)
x(15)
{z
}|
BF I
xn3
+
X(11)
X(7)
-j
X(15)
BF II
BF III
xi(n)
xr(n+N/2)
1n
1n + N=2
xn
xn
+ xn + N=2
xn
+ N=2
,0
n < N=
(7)
The butterfly output Z 1n is sent to apply the twiddle factor, and Z 1n + N=2 is sent back to the shift registers to be
multiplied in still next N=2 cycles when the first half of the
next frame of time sequence is loaded in. The operation of the
second butterfly is similar to that of the first one, except the
distance of butterfly input sequence are just N=4 and the
trivial twiddle factor multiplication has been implemented by
real-imaginary swapping with a commutator and controlled
Zr(n)
0
1
xi(n+N/2)
Zi(n+N/2)
16
Fig. 5 outlines an implementation of the R22SDF architecture for N = 256, note the similarity of the data-path to
R2SDF and the reduced number of multipliers. The implementation uses two types of butterflies, one identical to that
in R2SDF, the other contains also the logic to implement the
trivial twiddle factor multiplication, as shown in Fig. 4-(i)(ii)
respectively. Due to the spatial regularity of Radix-22 algorithm, the synchronization control of the processor is very
simple. A log2 N -bit binary counter serves two purposes:
synchronization controller and address counter for twiddle
factor reading in each stages.
With the help of the butterfly structures shown in Fig. 4,
the scheduled operation of the R22 SDF processor in Fig. 5
is as follows. On first N=2 cycles, the 2-to-1 multiplexors
in the first butterfly module switch to position 0, and the
butterfly is idle. The input data from left is directed to the
shift registers until they are filled. On next N=2 cycles, the
multiplexors turn to position 1, the butterfly computes a
2-point DFT with incoming data and the data stored in the
shift registers.
Zr(n+N/2)
1
0
BF IV
(6)
BF I
i{
}
3
1k1 xn3 + N
BF II
x(0)
x(5)
k
j
1 +2 2
zh
Zi(n)
(i). BF2I
0
xr(n)
1
0
xi(n)
1
1
xr(n+N/2)
Zr(n+N/2)
Zi(n+N/2)
Zr(n)
xi(n+N/2)
Zi(n)
t s
(ii). BF2II
Figure 4: Butterfly structure for R22 SDF FFT processor
In practical implementation, pipeline register should be inserted between each multiplier and butterfly stage to improve
the performance. Shimming registers are also needed for
control signals to comply with thus revised timing. The latency of the output is then increased to N 1 + 3log4 N 1
without affecting the throughput rate.
V. CONCLUSION
In this paper, a hardware-oriented radix-22 algorithm is derived which has the radix-4 multiplicative complexity but retains radix-2 butterfly structure in the SFG. Based on this algorithm, a new, efficient pipeline FFT architecture, the R22SDF
architecture, is put forward. The hardware requirement of
proposed architecture as compared with various approaches
is shown in Table 1, where not only the number of complex
x(n)
128
64
32
16
BF2I
s
BF2II
t s
BF2I
s
BF2II
t s
BF2I
s
BF2II
t s
BF2I
s
BF2II
t s
W2(n)
W1(n)
clk
W3(n)
3
multiplier #
2log4 N 1
2log4 N 1
log4 N 1
3log4 N 1
log4 N 1
log4 N 1
adder #
4 log4 N
4 log4 N
8 log4 N
8 log4 N
3 log4 N
4 log4 N
memory size
3N=2 2
N 1
N 1
5N=2 4
2N 2
N 1
control
simple
simple
medium
simple
complex
simple
The architecture has been modeled with hardware description language VHDL with generic parameters for transform
size and word-length, using fixed point arithmetic and a complex array multiplier implemented with distributed arithmetic.
The validity and efficiency of the proposed architecture has
been verified by extensive simulation.
REFERENCES
[1] C. D. Thompson. Fourier transform in VLSI. IEEE Trans.
Comput., C-32(11):10471057, Nov. 1983.
[2] S. He and M. Torkelson. A new expandable 2D systolic array
for DFT computation based on symbiosis of 1D arrays. In Proc.
ICA3 PP95, pages 1219, Brisbane, Australia, Apr. 1995.
[3] E. E. Swartzlander, W. K. W. Young, and S. J. Joseph. A
radix 4 delay commutator for fast Fourier transform processor
implementation. IEEE J. Solid-State Circuits, SC-19(5):702
709, Oct. 1984.
[4] E. Bidet, D. Castelain, C. Joanblanq, and P. Stenn. A fast
single-chip implementation of 8192 complex point FFT. IEEE
J. Solid-State Circuits, 30(3):300305, Mar. 1995.
[5] M. Alard and R. Lassalle. Principles of modulation and channel
coding for digital broadcasting for mobile receivers. EBU
Review, (224):4769, Aug. 1987.
[6] L.R. Rabiner and B. Gold. Theory and Application of Digital
Signal Processing. Prentice-Hall, Inc., 1975.
X(k)
256