IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 4, APRIL 2013
I. I NTRODUCTION
AST Fourier transform (FFT) is a crucial block in orthogonal frequency division multiplexing (OFDM) systems.
OFDM has been adopted in a wide range of applications from
wired-communication modems, such as digital subscriber lines
(xDSL) [1], [2], to wireless-communication modems, such as
YANG et al.: MDC FFT/IFFT PROCESSOR WITH VARIABLE LENGTH FOR MIMO-OFDM SYSTEMS
Fig. 1.
721
N1
x[n]W Nnk
(1)
n=0
and
x[n] = I F F T {X[k]} =
N1
1
X[k]W Nnk
N
(2)
k=0
where
W Nnk
2nk
= cos
N
2nk
j sin
N
.
(3)
7
3
3
3
3
n 1 k1
N1 k1
x[N]W4 W2048
X[K ] =
n 5 =0 n 4 =0 n 3 =0 n 2 =0 n 1 =0
n 5 k4
N2 k2
N3 k3
W4n3 k3 W128
W4n4 k4 W32
W8n5 k5 (4)
W4n2 k2 W512
x[n] =
722
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 4, APRIL 2013
x0
X 0
x1
X 4
x2
x 3
X 1
1
8
X2
x4
x 5
2
8
X6
X3
x6
x 7
X5
3
8
X7
YANG et al.: MDC FFT/IFFT PROCESSOR WITH VARIABLE LENGTH FOR MIMO-OFDM SYSTEMS
723
...
2047 ...
511 ...
1279 255
... 1088
...
2047 ...
512
1535 511
...
2047 ...
1791 767
...
2047 ...
2047 1023 ... 1856 832 1792 768 2047 1023 ... 1856 832 1792 768
Symbol j+1
Symbol j
n
n
n
n
Input
Buffer
Stream A
Time index
64
Stage 1
Radix-4
1279 255
... 1088
64
1024
Stream A
Time index
(c)
Stage 2
Radix-4
Stage 3
Radix-4
Stage 4
Radix-4
Stage S=1
TF
MUL
TF
MUL
TF
MUL
1024
Stream B
FIFO
N/4S+1
FIFO
2N/4S+1
FIFO
3N/4S+1
Phase 1
FIFO
3N/4S+1
FIFO
2N/4S+1
FIFO
N/4S+1
Switch Box
Address generator
511 ...
(b)
Radix-4 Butterfly
Stream B
Time index
(a)
x Aj
x Bj
x Cj
x Dj
X ij k 0
Stage 5
Radix4/8
X ij k 1
X ij k 2
X ij k 3
Output
Sorting
2044 ...
2044 ...
2045 ...
2045 ...
2046 ...
2046 ...
2047 ...
2047 ...
Stream B Output
Stream A Output
Time index
(d)
Fig. 3. Block diagram of the proposed MIMO MDC FFT/IFFT processor. The routing rule updates every N/4s+1 clock cycles. (a) Initial input order.
(b) Sorted input order at the output of input buffer. (c) Computed output order without sorting. (d) Output order after output sorting.
724
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 4, APRIL 2013
6WUHDP $%&'
LQSXW V\PERO
D DFFHVV HQDEOH
D DFFHVV HQDEOH
D DFFHVV HQDEOH
E DFFHVV HQDEOH
E DFFHVV HQDEOH
E DFFHVV HQDEOH
F DFFHVV HQDEOH
F DFFHVV HQDEOH
F DFFHVV HQDEOH
G DFFHVV HQDEOH
G DFFHVV HQDEOH
G DFFHVV HQDEOH
M
x [i ]
x[i 2 ]
x[i 4 ]
x[i 6 ]
Radix-4
Butterfly
Register
Register
Radix-2
Butterfly
X [ 4i ]
Radix-2
Butterfly
X [4i 1]
Radix-2
Butterfly
X [4i 2]
Radix-2
Butterfly
X [4i 3]
1
Const. MUL
iLSB
radix_8
enable
1
8
Const. MUL
W 83
bypass_en
Fig. 4. Illustration of the proposed input scheduling. The cubicles are the
physical memory and radix-4 butterfly in active mode, while the rectangles
are the modules in dormant mode. (a) Logical groups of initial memory banks.
(b) (3N/4 + 1)th to the N th cycle. (c) (N + 1)th to the (N + N/4)th cycle.
(d) (N + N/4 + 1)th to the (N + N/2)th cycle. (e) (N + N/2 + 1)th to the
(N + 3N/4)th cycle.
YANG et al.: MDC FFT/IFFT PROCESSOR WITH VARIABLE LENGTH FOR MIMO-OFDM SYSTEMS
Push / Pop
FIFOs
N/16
N/8
Words Words
N/16
N/8
Words Words
N/16
N/8
Words Words
N/16
N/8
Words Words
Ctrl.
512-FFT/IFFT sel.
Source
Ctrl.
FIFO
N/32
Words
FIFO
N/16 Words
128-FFT/IFFT sel.
FIFO & Output
Switch Box
FIFO
N/16 Words
FIFO
N/32
Words
FIFO
3N/32 Words
Switch Box
FFT/IFFT
Core
725
FIFO
3N/32 Words
0
Sorted
Outputs
Stage A
Stage B
Stage C
512-FFT/IFFT sel.
Cycle 1~4
12 8 4 0
Write
Cycle 9~11
12 8 4 0
13 9 5 1
14 10 6 2
Write
Cycle 13~16
12 8 4 0
13 9 5 1 Read
14 10 6 2
15 11 7 3
Write
1
0
FIFO
4 W ords
FIFO
8 W ords
FIFO
12 W ords
Switch Box
Write
Cycle 5~8
12 8 4 0
13 9 5 1
1
0
FIFO
12 W ords
FIFO
8 W ords
FIFO
4 W ords
1
0
1
0
512-FFT/IFFT sel.
Fig. 7. Proposed output sorting for set-reversed FFT/IFFT processors. Stage A percolates the first half symbol from interlaced sequence. Stages B and C
contain switch-boxes for reordering. The annotated write/read-access indices in stage C is used for 512-FFT/IFFT.
726
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 4, APRIL 2013
(6)
Note that in the last folding stage, all the twiddle factors are
one. Therefore, multipliers are not required in the last pipe.
For the proposed FFT/IFFT scheme, the discussion in previous
sections are dedicated for practical MIMO systems with four
data streams. Thus radix-4 butterflies are used. It is worthwhile
to emphasize, however, that similar concept can be used for
MIMO systems with arbitrary number of data streams. That is,
for a system with Ns data stream, we propose to use radix-Ns
butterfly with MDC architecture. In this case, there are log Ns N
stages. Also, thanks to the 100% utilization rate of MDC,
log2 Ns 1 k
each stage requires Ns 1 multipliers, and
2
k=0
complex adders if binary-tree addition is used in each radix-r
branch. As a result, the number of complex adders Na and the
number of complex multipliers Nm are given respectively by
Na
log2 Ns 1
= log Ns N r
2k
(7)
(8)
k=0
and
YANG et al.: MDC FFT/IFFT PROCESSOR WITH VARIABLE LENGTH FOR MIMO-OFDM SYSTEMS
TABLE I
H ARDWARE C OMPLEXITY OF D IFFERENT FFT/IFFT A RCHITECTURES
SISO
Scheme
CPLX. MUL #
Radix-r
BF #
Memory size
R2MDC
R2SDF
R4SDF
R4MDC
R22 SDF
(log2 N 1)
(log2 N 1)
(log4 N 1)
3(log4 N 1)
(log4 N 1)
log2
log2
log4
log4
log4
3N/2 2
N 1
N 1
5N/2 4
N 1
(r 1)(log Ns 1)
log Ns N
MIMO Prop.
MIMO
MDC
N
N
N
N
N
FFT butterflies
Memory macros
(Ns 1)N +
log Ns N 1
(Ns 1)N/Nss
4
TABLE II
E LEMENTS IN P ROPOSED MDC FFT/IFFT P ROCESSOR
Components
Complex number
multipliers
s=1
10
727
Purpose
Twiddle factors
multiplication with
radix-4/8 outputs
Radix-4
Radix-4/8
Dual-Port SRAM (words)
Switch-box
Input memory addressing
Output FIFO control
FIFO registers (words)
Quadrant conversion of
twiddle factor
Twiddle factor generator
Number
12
4
1
10224
7
1
1
456
12
4
10
Ns=1; 1 rdx2 BF
10
10
10
10
10
(a)
10
10
10
5
4
2
3
10
10
10 1
10
10
10
(b)
10
10
Fig. 8. Required number of (a) complex adders and (b) complex multipliers,
as functions of FFT/IFFT size N for various numbers of MIMO data
streams.
IV. I MPLEMENTATION
A. Hardware Specification and Synthesis Report
The required components of the proposed MDC MIMO
FFT/IFFT processor are summarized in Table II. Lin et al.
showed that using 12-bit internal word length can provide
an output signal-to-noise ratio of 40 dB, capable of meeting
the IEEE 802.16e WiMAX standard [31]. Based on out fixpoint simulation, the input word length was fine tuned to
8 bits and the output word length was 12 bits. As for the
internal word lengths, all the computations were rounded to
10 bits. The 12 memory banks, which store the input data
were implemented by dual-port synchronous dynamic random
access memory (SDRAM). The intermediate FIFOs whose
depths exceed 8 were also implemented by dual-port SDRAM.
The total SDRAM size in the proposed design was 23.56 KB.
Fig. 9. APR of the proposed MDC MIMO FFT/IFFT processor. The pipeline
stages of FFT/IFFT computing core are numerated from one to five. S is the
output sorting stage. The memory control function is distributed in adjacent
memory macros.
728
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 4, APRIL 2013
Switch-box &
where Tech is the process in micrometers, Width is the bitwidth of data-path in bits, and ExecTime is the calculation time
in microseconds. Baass comparisons were used for FFT/IFFT
designs that have the same N, and similar architecture with
coarse adjustment corresponds to different CMOS process. In
addition to the architecture, different FFT length N, system
frequency, and applied CMOS processes fundamentally affect
the area and power consumption. Thus Peng in [27] considered
different FFT/IFFT length N and proposed the normalized area
Apeng and normalized power Ppeng as
(a)
Ppeng
Switch-box &
(b)
Fig. 10.
The power consumption was analyzed primarily by Synopsys PrimePower with the net-list extracted from actual APR.
We also use the power analysis function in SoC Encounter.
The measured results using these two tools only have a small
mismatch within 5 mW. Fig. 10 lists the pie diagram of
area and power consumption. Excluding the testing function,
the memory occupied 85.95% of the total area. The ratio of
standard cells versus SDRAM macros was close to 1/5. This
means if more advanced (that is, high density) memory macros
were applied, further reduction could be achieved. The power
consumption at 40 MHz system clock were 63.72 mW for
2048-FFT, 62.92 mW for 1024-FFT, 57.51 mW for 512-FFT,
and 51.69 mW for 128-FFT computations.
B. Performance Analysis and Comparison
Throughput, signal to noise quantization ratio, and normalized area/power consumption are the major indices used to
evaluate the performance of FFT/IFFT processors [22]. Let
us compare the proposed design with other existing designs.
Most of the previous works were based on SDF or memorybase radix-2n algorithm. Bass proposed methods to evaluate
the normalized area Abass and normalized power consumption
Pbass among different kinds of FFT processor [22] as follows:
Abass =
Pbass =
Area
(Tech/0.5 m)2
2 Width 1 Width 2
+
Tech
3 20
3
20
Power Exec Time 106
Area
N (Tech/0.18)2
Power ExecTime
=
.
N (V D D /1.8)2
Apeng =
Area 103
(Tech/0.09 m)2 M log2 N
(9)
(10)
YANG et al.: MDC FFT/IFFT PROCESSOR WITH VARIABLE LENGTH FOR MIMO-OFDM SYSTEMS
729
TABLE III
C OMPARISON A MONG D IFFERENT FFT/IFFT P ROCESSORS
Proposed
[26]
MDC
Architecture
[27]
SDF
MDF
1282048
1282048/1536
[29]
Memory
base
[31]
Memory
base
[30]
[24]
[28]
[25]
[20]
[14]
[12]
SDF
SDF
Memory
base
SDF
MDC
SDF
SDF
2048
2048
2048
2048
512
256
128
128
128
Clock rate
(MHz)
40
40
35
20
22.86
45
300
324
300
75
40
250
Stream
no.
Process
(um)
0.09
0.18
0.18
0.18
0.13
0.35
0.09
0.09
0.09
0.18
0.13
0.18
Voltage
(V)
1.8
1.8
1.8
1.2
3.3
0.85
1.8
1.2
1.8
Area
(mm2 )
3.1
4.53
1.932
4.96
2.12
13.05
1.16
2.46
3.53
2.1
1.41
3.1
Output
sorting
Yes
No
No
Yes
Yes
No
No
Yes
No
Yes
No
No
FFT size
2048
1024
512
128
Power
(mW)
63.72
62.92
57.51
51.69
55.64
11.29
52.02
17.26
640.00
159.00
103.50
119.70
87.00
5.20
175.00
Execute
time (us)
51.20
25.60
12.80
3.20
51.25
234.00
205.2
89.76
22.36
0.85
0.22
0.11
0.43
0.8
0.13
Normalized
energy
36.20
39.33
39.94
46.15
39.07
36.19
36.56
23.88
58.33
6.02
4.92
1.08
3.20
0.81
1.93
102.95
43.91
28.18
46.19
78.45
105.45
273.33
55.23
18.75
24.14
27.68
Normalized
area
70.45
the die area. Although the normalized area in the memorybase processor is smaller, the normalized energy is not reduced
proportionally [29], [31]. This is because the memory-based
scheme uses twice amount of memory than those in pipeline
schemes with continuous output. As long as the memories are
accessed by computing elements, the macros consumes power.
Consequently, to effectively reduce the area and power consumption in large N-FFT/IFFT processor, decreasing memory
usage may be a key solution. Moreover, the output latency
of the memory-based FFT/IFFT processors can be as long as
one OFDM symbol. Thus it is not able to handle successive
OFDM symbols unless extremely high clock is used to handle
relatively slow data.
Trying to seek a good trade-off between these conventional
schemes, the proposed FFT/IFFT processor adopts simple
memory scheduling methods for both input and output data,
this enables the processor to use a relatively small amount
of memory to handle successive and multiple data streams.
730
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 4, APRIL 2013
[13] S.-H. Hsiao and W.-R. Shiue, Design of low-cost and high-throughput
linear arrays for DFT computations: Algorithms, architectures, and
implementations, IEEE Trans. Circuits Syst. II, Analog Digital Signal
Process., vol. 47, no. 11, pp. 11881203, Nov. 2000.
[14] Y.-W. Lin and C.-Y. Lee, Design of an FFT/IFFT processor for MIMO
OFDM systems, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54,
no. 4, pp. 807815, Apr. 2007.
[15] Y. Jung, H. Yoon, and J. Kim, New efficient FFT algorithm and
pipeline implementation results for OFDM/DMT applications, IEEE
Trans. Consumer Electron., vol. 49, no. 1, pp. 1420, Feb. 2003.
[16] T. Sansaloni, A. Perex-Pascual, V. Torres, and J. Valls, Efficient pipeline
FFT processors for WLAN MIMO-OFDM systems, Electron. Lett.,
vol. 41, no. 19, pp. 10431044, Sep. 2005.
[17] K. K. Parhi, Systematic synthesis of DSP data format converters
using life-time analysis and forward-backward register allocation, IEEE
Trans. Circuits Syst. II, Analog Digital Process., vol. 39, no. 7, pp. 423
440, Jul. 1992.
[18] T. Jarvinen, P. Salmela, H. Sorokin, and J. Takala, Stride permutation
networks for array processors, in Proc. 15th IEEE Int. Conf. Appl.-Spec.
Syst., Archit. Process., Sep. 2004, pp. 376386.
[19] M. Pschel, P. A. Milder, and J. C. Hoe, Permuting streaming data
using RAMs, J. ACM, vol. 56, no. 2, pp. 10:110:34, 2009.
[20] B. Fu and P. Ampadu, An area efficient FFT/IFFT processor for MIMOOFDM WLAN 802.11n, J. Signal Process. Syst., vol. 56, no. 1, pp.
5968, Jul. 2009.
[21] W. Fan and C.-S. Choy, Robust, low-complexity, and energy efficient
downlink baseband receiver design for MB-OFDM UWB system, IEEE
Trans. Circuits Syst. I, Reg. Papers, vol. 59, no. 2, pp. 399408, Feb.
2012.
[22] B. M. Baas, A low-power, high-performance, 1024-point FFT processor, IEEE J. Solid-State Circuits, vol. 34, no. 3, pp. 380387, Mar.
1999.
[23] E. E. Swartzlander, W. K. W. Young, and S. J. Joseph, A radix 4 delay
commutator for fast Fourier transform processor implementation, IEEE
J. Solid-State Circuits, vol. 19, no. 5, pp. 702709, Oct. 1984.
[24] S.-N. Tang, J.-W. Tsai, and T.-Y. Chang, A 2.4-GS/s FFT processor for
OFDM-based WPAN applications, IEEE Trans. Circuits Syst. II, Exp.
Briefs, vol. 57, no. 6, pp. 451455, Jun. 2010.
[25] Y. Chen, Y.-W. Lin, Y.-C. Tsao, and C.-Y. Lee, A 2.4-Gsample/s DVFS
FFT processor for MIMO OFDM communication systems, IEEE J.
Solid-State Circuits, vol. 43, no. 5, pp. 12601273, May 2008.
[26] M. S. Patil, T. D. Chhatbar, and A. D. Darji, An area efficient and low
power implementation of 2048 point FFT/IFFT processor for mobile
WiMAX, in Proc. Int. Conf. Signal Process. Commun., 2010, pp. 14.
[27] S.-Y. Peng, K.-T. Shr, C.-M. Chen, and Y.-H. Huang, Energy-efficient
1282048/1536-point FFT processor with resource block mapping for
3GPP-LTE system, in Proc. Int. Conf. Green Circuits Syst., 2010, pp.
1417.
[28] S.-J. Huang and S.-G. Chen, A green FFT processor with 2.5-GS/s
for IEEE 802.15.3c (WPANs), in Proc. Int. Conf. Green Circuits Syst.,
2010, pp. 913.
[29] C.-L. Hung, S.-S. Long, and M.-T. Shiue, A low power and variablelength FFT processor design for flexible MIMO OFDM systems, in
Proc. IEEE Int. Symp. Circuits Syst., May 2009, pp. 705708.
[30] Y.-T. Lin, P.-Y. Tsai, and T.-D. Chiueh, Low-power variable-length fast
Fourier transform processor, IEE Proc. Comput. Digital Tech., vol. 152,
no. 4, pp. 499506, Jul. 2005.
[31] Y. Chen, Y.-W. Lin, and C.-Y. Lee, A block scaling FFT/IFFT processor
for WiMAX applications, in Proc. IEEE Asian Solid-State Circuits
Conf., Nov. 2006, pp. 203206.
Kai-Jiun Yang received the B.S. degree in electrical engineering from Tamkang University, Taipei,
Taiwan, in 1999, and the M.S. degree in electrical engineering from the University of Southern
California, Los Angeles, in 2001. He is currently
pursuing the Ph.D. degree in electrical and control
Engineering with National Chiao-Tung University,
Hsinchu, Taiwan.
He was with Trendchip Technologies (merged into
MediaTek Inc.), Hsinchu, from 2001 to 2009, and
developed DMT-ADSL chip-set. He is currently with
the Industrial Technology Research Institute, Hsinchu, where he participates in
low-power wireless SoC implementation. His current research interests include
baseband signal processing in orthogonal frequency division multiplexing
systems, VLSI design, and verification.
YANG et al.: MDC FFT/IFFT PROCESSOR WITH VARIABLE LENGTH FOR MIMO-OFDM SYSTEMS
731