A Multi-Granularity FPGA With Hierarchical Interconnects For Efficient and Flexible Mobile Computing PDF

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO.
1, JANUARY 2015 137
A Multi-Granularity FPGA With Hierarchical

Interconnects for Efficient and Flexible
Mobile Computing
Fang-Li Yuan, Student Member, IEEE, Cheng C. Wang, Student Member, IEEE,
Tsung-Han Yu, Student Member, IEEE, and Dejan Markovi, Member, IEEE
AbstractFollowing the rapid expansion of mobile computing,

mobile system-on-a-chip (SoC) designs have off-loaded most
compute-intensive tasks to dedicated accelerators to improve
energy and area efficiency. An increasing number of acceler-
ators in power-limited SoCs results in large regions of dark
silicon. Unlike processors, dedicated hardware is inflexible, so
any changes would require a chip re-design, which significantly
impacts cost and timeline. To address the need for efficiency and
flexibility, this work presents a multi-granularity FPGA suitable
for mobile computing. Occupying 20.5 mm in 40 nm CMOS,
the chip incorporates fine-grained configurable logic blocks,
medium-grained DSP processors and reconfigurable block RAMs,
and two coarse-grained kernels: a 64- to 8192-point FFT processor
and a 16-core universal DSP for software-defined radios. Using
a mix-radix hierarchical interconnect, the chip achieves a 34x
interconnect area reduction over commercial FPGAs for compa-
rable connectivity, reducing overall area and leakage by 22.5x, Fig. 1. Efficiency vs. flexibility for microprocessors, FPGAs, programmable
and delivering up to 50% lower active power. With coarse-grained DSPs, and dedicated hardware.
kernels, the energy efficiency reaches within 45x of ASIC designs.
Index TermsDigital integrated circuits, digital signal proces- Unlike traditional power and performance criteria, efficiency
sors (DSPs), fast Fourier transform (FFT), field programmable requires a combination of metrics. Energy efficiency quantifies
gate arrays (FPGAs), interconnect networks, low-power design. work per unit energy, and is generally measured in billions of
operations per second per milliwatt (GOPS/mW). In VLSI cir-
I. INTRODUCTION cuits, this translates to battery life and thermal budget. Area effi-
ciency quantifies work per unit area, and is generally measured in
T HE rapid expansion of mobile computing has driven the

need to integrate more functionality and signal processing
on a single chip, yet maintaining a strict power and cost budget.
GOPS/mm . This translates to die size and cost. Rather than fo-
cusing solely on power, area or performance, targeting efficiency
balances these tradeoffs.
As a result, flexibility and energy/area efficiency have become To look at the efficiency of todays chips, we normalize the
major criteria for mobile designs. Todays mobile system-on-a- available data from International Solid-State Circuit Conference
chip (SoC) generally embeds microprocessor cores and several and Symposium on VLSI Circuits between 1999 and 2011 to
dedicated accelerators to improve efficiency while providing 16 b arithmetic in 65 nm technology, categorize them and take
flexibility. Such heuristic migration from a purely processor- the average (Fig. 1). We see a 1000x of gap in energy efficiency
based SoC does not truly address the problem of efficient and between microprocessors and dedicated hardware [1], [2]. Pro-
flexible hardware, because efficient and flexible blocks are in- grammable digital signal processors (DSPs) have a wide range of
tegrated next to each other. efficiency depending on architectures and applications, but they
Manuscript received June 03, 2014; revised November 05, 2014; accepted
generally fall in between microprocessors and dedicated hard-
November 05, 2014. Date of publication December 11, 2014; date of current ware. Field-programmable gate array (FPGA) data were gathered
version December 24, 2014. This paper was approved by Guest Editor Stephen from [3], [4].
Kosonocky. This work was supported in part by the DARPA HEALICs pro- The idea for an accelerator-based SoC design becomes
gram.
F.-L. Yuan and C.C. Wang were with the University of California, Los An- clear: dedicated blocks provide much higher efficiency than
geles, CA 90095 USA. They are now with Flex Logix Technologies, Inc., Moun- processors, meeting the strict energy and cost requirements.
tain View, CA 94040 USA. Meanwhile, microprocessor is used as an arbiter and/or a
T.-H. Yu was with the University of California, Los Angeles, CA 90095 USA.
He is now with Qualcomm, Irvine, CA 92618 USA. system controller to compensate the flexibility loss from the
D. Markovi is with the University of California, Los Angeles, CA 90095 accelerators. As todays SoCs require more functions in every
USA, and also with Flex Logix Technologies, Inc., Mountain View, CA 94040 generation for the same power budget, adding accelerators
USA.
Color versions of one or more of the figures in this paper are available online
has resulted in large portions of dark silicon [5], [6] and
at http://ieeexplore.ieee.org. increased chip size. Using many dedicated blocks also requires
Digital Object Identifier 10.1109/JSSC.2014.2372034 frequent design changes, due to new standards, new features,
0018-9200 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
138 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO. 1, JANUARY 2015
Fig. 2. A sample 2D-mesh architecture with I/O connections (top-left) and switch boxes (bottom-left).
or re-spins. Under the frequent changes and rising design costs, interconnect networks, and coarse-grained reconfigurable DSP.
the accelerator-based SoC becomes less of a scalable solution. Section II gives a brief background of FPGA interconnect
Instead, there is a greater need today for flexible designs that networks. Section III presents our hierarchical interconnects
are re-usable, but also efficient enough for the strict energy for a more efficient FPGA. The multi-granularity FPGA with
requirements. various configurable logic blocks is described in Section IV,
Another issue with dedicated blocks is compatibility with and Section V adds the coarse-grained DSP as the key module
non-standard interfaces. We are facing an exponential growth for mobile computing. Chip software flow is presented in
in the number of connected devices with wide range of require- Section VI, followed by measured results and comparison to
ments for the number of antennas, modulation classes (e.g., prior art in Section VII. Section VIII concludes the paper.
PSK, QAM) and system types (e.g., OFDM, CDMA). Some
call this the Internet of Things. In addition, each device needs
functionality adjustment to respond to real-time changes in II. BACKGROUND OF FPGA INTERCONNECT NETWORKS
channel condition, such as using low-energy zero-forcing or It is well known that FPGAs incur penalty in area (1770x),
high-accuracy sphere decoder, to balance the link throughput, speed (36x) and power (552x) compared to standard-cell
bit-error rate and power consumption. A flexible hardware ASICs in the same process technology, but these numbers
would be highly beneficial here: besides providing flexible do not represent efficiencies [3]. The actual energy efficiency
I/Os, it can also perform data processing for each of the con- gap is the product of the speed and power gap, and the area
nected devices. efficiency gap is the product of the area and speed gap.
Although Moores Law lowers the cost of transistors, chips It is natural to attribute the FPGA inefficiency to the sea-of-
are becoming increasingly more expensive to design. At 28 nm, lookup-tables (LUTs) architectures. Although generally called a
the nonrecurring engineering (NRE) cost of a custom-made gate-array, it is the interconnect network that occupies the ma-
IC is more than $50 million [7]. For chip makers to afford jority of the FPGAs area (7580%), delay ( 75%), and power
the newest technology, amortizing the design cost over a large ( 60%) today [3], [11], [45].
volume is no longer enough. Chips need to work the first time,
fit in a range of end-products, and remain competitive and
A. Traditional 2D-Mesh Networks
up-to-date for a longer product cycle. We now need hardware
re-use. To achieve such flexibility, we need flexible hardware. Although FPGAs have grown enormously in size since their
But to maintain the efficiency of current SoCs, we need effi- inception (64-LUTs in XC2000), the fundamental interconnect
cient, flexible hardware. architecture still remains (Fig. 2). In 2D-mesh interconnects,
Among the candidates for flexible hardware, microprocessors LUTs are placed in configurable logic blocks (CLBs), and in-
are least efficient. They will not replace ASICs. Software-con- terconnects run in the x- and y-direction surrounding the CLBs.
trolled programmable DSPs are unlikely to deliver the bit/cycle- I/O connection switches tie the CLB I/O to the interconnect net-
true behavior of ASICs. DSPs also have lower throughput un- work. Arrays of switch boxes are placed at interconnect cross-
less massively-parallel single-instruction-multiple-data (SIMD) ings to select and buffer the programmed path, each containing
architectures [8][10] are used. FPGAs, designed to emulate programmable pass-transistors. Since a full switch-box array
dedicated hardware in a bit/cycle-true manner come closest in at every interconnect crossing requires too much area, various
mimicking ASICs. FPGAs have ASIC-like throughput due to heuristics are used to simplify the arrays at the cost of lower con-
parallel logic blocks. But to make FPGAs real candidates in the nectivity [12][14]. In Fig. 2, the example network only imple-
mobile space, we ought to narrow a large efficiency gap between ments switch boxes along one main diagonal and two sub-diago-
FPGAs and ASICs. nals of the switch-box array. Modern FPGAs have also migrated
We present a multi-granularity FPGA aimed for mobile com- toward unidirectional routing to reduce interconnect loading
puting and focus on two key steps towards higher efficiency: [15], [16].
YUAN et al.: A MULTI-GRANULARITY FPGA WITH HIERARCHICAL INTERCONNECTS FOR EFFICIENT AND FLEXIBLE MOBILE COMPUTING 139
Fig. 3. An example folded-Bene network with radix-2 switches, and a data width of 2.
The root cause for interconnect overhead is the scalability of yet to demonstrate any advantage over 2D-mesh. On the other
2D-mesh interconnects. In the worst case, the number of switch hand, 2D-mesh FPGAs today are already very mature, well-op-
boxes grows as with the number of LUTs. Although timized products.
heuristics can reduce the number of switches, there is a limit. For our work to be considered worthwhile, we need to
Rents rule can be used to model interconnects, demonstrate a hierarchical FPGA with significant benefits
where is the number of gates, is the Rents coefficient for over state-of-the-art commercial FPGAs. For practicality, an
modeling the number of I/Os, and is a constant of propor- automated tool is also needed to map user designs. Overall,
tionality. In typical cases, the interconnect complexity per logic this project requires us to first create an efficient interconnect
block is for random logic, which is still architecture and reconfigurable logic, realize it in silicon,
for a chip of logic blocks [17]. For highly regular designs, then develop software tools to demonstrate its advantages and
such as a memory banks, the complexity per logic is . usability. These steps are further described below.
Since FPGA mapping software employs intelligent gate place-
ments, the logic is not completely random, but it is certainly III. OUR HIERARCHICAL INTERCONNECT NETWORK
not as regular as memory banks. We therefore expect the ac-
tual Rents exponent to be between 0.5 and 0.75 [13], which We adopted a hierarchical interconnect architecture based on
gives us lower and upper bounds on interconnect complexity. a folded-Bene network. An example network is shown in Fig. 3
Clearly, FPGA interconnect needs to scale much faster than the to illustrate the concept. It is radix-2, so each switch matrix (SM)
number of logic elements . Scaling from 64 in XC2000 to accepts inputs from two SMs, and sends its outputs to two other
1,000,000 in modern FPGAs, it becomes clear why interconnect SMs; both upstream and downstream. Signals will come from
area is a key concern today. the LUT output, traverse up to the required hierarchy, and tra-
verse back down to the LUT input. To ease routing congestion,
B. Previous Hierarchical Networks our architecture evenly distributes routing across all LUTs in-
To address the non-scalability of 2D-mesh, there have been stead of crowding them into centralized hubs [23], [27].
some designs adopting hierarchical interconnects based on a
Benes network that scales as . The Clos, Bene, and A. Radix-3 Boundary-Less Interconnect
similar hierarchical networks are well-known to be rearrangeably Although hierarchical routings complexity is
non-blocking network for point-to-point connections, and are simpler than from 2D-mesh, it can be inefficient for local
commonly used in telecommunication [18][20]. Numerous routing if the leaves are crossing a high-radix boundary, resulting
publications have also discussed hierarchical FPGAs with a in performance loss. For example, in Fig. 4(a), LUTs 8 and 9
tree-of-meshes topology [21][24]. It is a limited bisection are neighbors, but signals have to traverse up 4 stages of net-
network, where the mesh connectivity decreases for upper hierar- work, and then zig-zag their way down the hierarchy. Such lack
chies.Acentralizedroutingnetworkisrequiredateveryhierarchy, of spatial locality is not desirable. One method to shorten the
which increases routing congestion. The scalability issue is par- nearest-neighbor routing lengths is an isomorphic transforma-
tially resolved, as the central switches are still based on 2D-mesh. tion, as shown in Fig. 4(b), which maintains the same logic con-
A multilevel hierarchical FPGA was published in [25], al- nectivity [28]. Although the wire length travelled has reduced,
though no silicon realization is attempted. The architecture uses the number of switches has not: the signal still needs to traverse
a radix-4 topology with a Rents exponent of 1, but only on up and down 4 hierarchies for communication between LUT 8
the downward paths. The upward path, on the other hand, pro- and 9.
vides no path diversity. Therefore, the overall path diversity of We propose a method of applying higher radix switches on the
the architecture is very limited, and the interconnect connec- lower SM levels to utilize spatial locality in routing, allowing
tivity when mapping designs is about 3050%, often requiring efficient interconnect routing for direct neighbors. We call such
a 2K-LUT FPGA to map 1K-LUT designs. network a boundary-less radix-3 (BLR3) network [29]. To con-
vert a radix-2 network to a BLR3 network, we first identify the
C. Our Challenges center 2 2 routing of each stage, shown in the dashed circle in
Although hierarchical FPGA has great appeal on paper, it has Fig. 4(b). They only connect across an interconnect length of 1.
not had much success in practice. The main reason is that it has All circled 2 2 routings are then moved to stage 1 (Fig. 4(c)),
Fig. 4. (a) An original 16-LUT Bene network, (b) with isomorphic transformation to shorten nearest-neighbor lengths, and (c) with boundary-less radix-3 switches
in stage 1.
Fig. 5. A 16-LUT Bene network, (a) with boundary-less radix-3 switches in stages 1 and 2, (b) with boundary-less radix-3 switches in stage 1-3, and (c) rearranged
for distributed routing.
transforming to a BLR3 stage. All stage-1 switches are now ca- wires above stage 1 has been reduced by 50%. To form a regular
pable of communicating with their immediate neighbors. routing pattern, one method is to evenly re-distribute the inter-
After stage 1 is converted into BLR3 (Fig. 4(c)), the stage connect routing: the dual routes branching out of stages 1-4 are
routings only connect across an interconnect length of re-distributed across all switches, resulting in the final routing
3. The 2 2 routes with a length of 3 are then moved down to architecture shown in Fig. 5(c).
stage 2 (Fig. 5(a)), converting the second stage to BLR3. This The BLR3 network is one of the many techniques we used
stage-by-stage transformation can be continued to the top of to optimize the interconnects. It is difficult to model and quan-
the hierarchy (Fig. 5(b)). Alternatively, the designer may also tify what impacts these changes have on the overall quality of
choose to stop at any hierarchy, and preserve the remaining the network. We can evaluate it by mapping benchmark designs
upper hierarchies as traditional radix-2 network. through our CAD tool, which reports routing congestion and crit-
In Fig. 5(b), we see that all stages above stage 1 have un- ical-path delay. We then tune the architecture and re-evaluate.
evenly distributed routing: some switches have to connect more The final interconnect architecture (Fig. 6) has 40 tiles in
routing than others. This scenario occurs because the number of seven different types. Each tile has 512 SM macros, ranging
Fig. 6. The final interconnect architecture divided into 40 tiles, each with 512 switch-matrix macros, ranging from 9 to 14 levels of hierarchies.
Fig. 7. A 4-input static mux with output inverter and our proposed, tri-state PG.
from 9 to 14 levels of hierarchy. The upper hierarchies have current. So, in traditional PG designs, there is either a large area
fewer available wires as a result of interconnect pruning from penalty from a larger footer, or a large performance penalty from
our closed-loop design process. a smaller footer.
Our power-gating method for interconnects requires minimal
B. Fine-Grained Interconnect Power Gating area overhead and has no significant performance impact [30]. A
The interconnect network in an FPGA can involve long wires pass-transistor input is added for PG, and the inputs to the output
and large output drivers. To reduce leakage, power gating (PG) inverter are separated into pmos and nmos inputs, joined by a min-
is employed to power off the output drivers when unused. Footer imum-size, high- transmission gate as keeper (Fig. 7). During
transistors work well for PG an entire block, but interconnect PG, signal is 1, the keeper is off, and the pmos and nmos are
signals are long, thus lack spatial locality. It is therefore ineffec- driven to opposite polarities, 1 and 0, respectively, turning off
tive to PG interconnect at the block level, because just one ac- both output transistors. This tri-state not only reduces leakage, but
tive interconnect buffer would prevent an entire region from PG. also drastically reduces the coupling capacitance experienced by
Therefore, fine-grained PG circuitry is needed, such as adding neighboring wires by forming a capacitive divider.
footer transistors and PG control to individual output drivers. When one of the select bits, is on, power-gating
In interconnect switches, the output drivers are already large. is disabled, and the keeper is enabled to bring the pmos and
Because the PG footer transistors are in series with the output nmos nodes together. The static MUX must quickly transmit
transistors, the footer needs to be made wide to maintain the on the selected input to the output inverter. The MUX contains
Fig. 8. The FFT architecture and radix factorizations of different FFT resolutions.
a full transmission gate, enabling a good drive capability for The CLBs are made to be logically-compatible with the CLB
both 0 and 1 at the input. The NMOS pass gate is connected designs from Xilinx Virtex-6 and 7. This not only allows our
to the node pmos to rapidly turn on the PMOS transistor, and FPGA to be synthesized with commercial synthesis tools (e.g.,
the PMOS pass gate is connected to the node nmos to rapidly Synplify Pro), but it also enables a direct comparison of perfor-
turn on the NMOS transistor in the output inverter. When one mance and power between our FPGA and commercial FPGAs
transistor in the output inverter is turning on, the other is not while mapping identical designs.
yet fully off (its gate voltage hovers around until the keeper More intensive arithmetic such as wide multiplications and
completes the transition), but the current difference between the multi-input additions require a more dedicated DSP acceler-
output transistors is large enough to not impact the performance ator, and larger memories require a more dedicated memory IP.
by more than 10%. Eventually, the keeper connects the pmos We designed the DSP accelerators to be compatible with the
and nmos nets to the same voltage, and the static leakage is no DSP48E1 accelerators from Xilinex. Various pipeline modes
different from a traditional CMOS inverter. are implemented to support a maximum operation frequency of
The design in Fig. 7 cannot directly drive CMOS gates, be- 800 MHz. Our BRAM CLBs are 36 Kb each, also compatible
cause it leads to floating gates during PG. The outputs can only with the RAMB36 designs from Xilinex.
drive the input of another static MUX. All nets driven by tri-state
buffers must be set as dont_touch during chip synthesis to avoid B. Coarse-Grained Configurable Logic Block for
automated buffer insertion. Communication Signal Processing
Utilizing our closed-loop interconnect optimization, and our Since this chip primarily targets high-throughput com-
area-efficient PG scheme, the overall interconnect area is re- munication applications, we integrated two coarse-grained
duced to 52% of the total core area, maintaining a logic-to- accelerators to achieve higher efficiency for domain-specific
interconnect ratio of 1:1. Compared to 2D-mesh FPGAs that computations.
use 7580% of the area for interconnects, this chip achieves The first block is a 64- to 8192-point reconfigurable Fast
a 34x reduction for a fixed logic area. Compared to our pre- Fourier Transform (FFT) processor (Fig. 8). The FFT is de-
vious work [26], we have scaled the logic complexity by more
signed with 16 parallel, 4- to 512-point cores, followed by a
than 10 times while maintaining the same logic-to-interconnect
16-point final stage for data combining. To save area, BRAM
ratio. Even though we should see a increase in inter-
and CLB memories on the FPGA are utilized to form the delay
connect area, the area increase is mitigated by our closed-loop
lines (DLs) of size above 32, while the pointer-based shift reg-
interconnect optimization.
isters are employed for smaller DLs. Each of the 16 cores is im-
IV. MULTI-GRANULARITY FPGA ARCHITECTURE plemented as a 3-stage mixed-radix pipelined FFT to ease the
timing requirements and leverage voltage scaling for high en-
An improved interconnect network does not address the inef-
ergy efficiency. Extensive analysis of radix factorization is per-
ficiency of the CLB itself since fine-grained CLBs are more gen-
formed in [31] to determine the optimal radix per pipeline stage
eral purpose but less efficient. The second step towards better
that minimizes the energy and area cost.
efficiency is to realize coarse-grained kernels. This chip is built
The second coarse-grained block is a 16-core, highly-efficient
with three granularities of CLBs.
universal DSP accelerator, reconfigurable to perform tasks for
A. Fine-Grained and Medium-Grained Configurable Logic software-defined radio (SDR).
Block
The fine-grained CLBs mainly consist of LUTs and their sur- V. DOMAIN-SPECIFIC COARSE-GRAINED UDSP PROCESSOR
rounding logic, such as basic arithmetic and flip-flops. Some Growing number of radio standards has led to the concept
CLBs also support modes for small memories and shift registers. of SDRs, where hard-wired computing blocks (e.g., channel
Fig. 9. For SDR tasks, the (a) generic 2 2 dataflow structure is considered as the proper granularity. More detailed implementation of the SIMD-style BCE is
illustrated in (b).
estimation) are replaced by flexible hardware to adapt to tions. To realize this concept, the traditional ISA decoder is re-
evolving protocols. The domain-specific reconfigurable pro- placed by a register bank to deploy the task-dependent control
cessor (DSRP) is a promising architecture for SDRs [32]. It uses signals. Real-time instruction definition is done by configuring
an array of compute elements linked by 2D-mesh interconnects the bank through specialized configuration commands.
to reduce the control overhead associated with traditional pro- In addition to control circuits, we investigate the core gran-
cessors. Architectures of DSRPs have been extensively studied ularity and connectivity for highly efficient and flexible imple-
with emphasis on the core granularity [33], dynamic voltage mentations. The butterfly compute element (BCE) supports ar-
and frequency scaling [34] and globally-asynchronous-lo- bitrary 2 2 complex-valued matrix operations and is found as
cally-synchronous clocking [36]. However, the benefits of proper granularity for SDR workloads. The multi-scale inter-
control circuit simplifications for instruction deployment are connects comprise the two-dimensional fastpaths and the hier-
not thoroughly studied. Prior DSRPs exploited SIMD/MIMD archical network for efficient data arbitration between BCEs and
operations to reduce the control overhead, but their instruction other FPGA fabrics.
set architecture (ISA) was predefined in hardware. Such inflex-
ible ISA had to be complex to make it universal enough for the A. 16-Core UDSP Architecture
required flexibility, increasing the area and energy overhead.
In addition, any design change may create a need for new The proposed UDSP supports filtering, channel estimation,
instructions, but a non-expandable ISA might fail to support blind classification, and signal detection for 1 1 to 4 4
the new features or use existing instructions as a work-around. multi-antenna (MIMO) systems with modulation order from
The computations of todays DSRPs can be made efficient with BPSK to 64QAM. It is comprised of 16 homogeneous cores
existing architectural and circuit techniques, but the inflexible in a 4 4 2D array [37]. Each core embeds one BCE and four
control architecture still limits the achievable efficiency and the interconnect switches (ISWs) for programmable computing
hardware reusability. and networking. Local connection between adjacent cores
We propose flexible instruction-set architecture (ISA) aimed is supported by a 128-bit uni-directional, 2D fast-path that
at simplified control and enhanced flexibility of DSRPs. We allows horizontal transmission from left to right, and vertical
observe that for each domain-specific task, only a subset of link in a zigzag fashion. This approach saves 50% of the cir-
the entire instruction set is required. If we can flexibly define cuit-switched MUXes compared with a bi-directional scheme
the task-specific instructions prior to program execution, we [34], yet it preserves enough connectivity for stream-oriented,
no longer need a fixed and complex ISA that has high code multistage mappings. The ISWs form a hierarchical network,
coverage but low utilization. Adaptation problems with design compatible to that on the FPGA, to allow data exchange and
changes can also be resolved by simple hardware reconfigura- multicasting between cores and other on-chip CLBs.
Fig. 10. Flexible ISA control mechanism.
Fig. 11. Energy breakdown of 4 4 QRD shows an overall 1.8x of energy saving from the flexible-ISA over the traditional fixed-ISA scheme.
B. Butterfly Compute Element RF is composed by two independent sub-banks, M1 and M0, re-
The datapath structure directly impacts energy efficiency of spectively. The write access can only be issued to different sub-
the core. An efficient dataflow minimizes data movement, re- banks, but the read access can come from the same sub-bank.
ducing the program runtime and power. Modeling the energy This feature enables efficient data combining among M0 and
cost of each component clarifies the approach. For example, M1 to lower the memory requirement for various SDR tasks,
a 16-bit add takes around 0.035 W/MHz at nominal in such as the candidate-symbol generation in the MIMO SD.
40 nm CMOS, while a 16-bit read/write access from the reg- The BCE datapath consumes about 6 W/MHz per core at
ister file costs around 1.7 W/MHz, a 50x higher than addition. nominal supply voltage, equivalent to energy overhead from 50%
From our examination of common SDR algorithms, we propose to 210% as compared to the cost of active operations from heavy-
a generic 2 2 butterfly dataflow structure as the proper granu- to light-loaded tasks. This overhead, however, yields great reduc-
larity (Fig. 9(a)). This 2 2 structure can map functions such tion in memory access and program latency. For example, a 12x-
as filters (for spectrum shaping), QR decomposition (QRD) (for iterated CORDIC square root takes 12 cycles with no interme-
channel factorization), linear equalizer and sphere decoder (SD) diate RF access, 18x faster than 216 cycles in the state-of-the-art
(for signal detection). communication processor [36]. From an efficiency perspective,
The SIMD-style BCE, optimized by modeling the energy cost the 18x saving in instruction memory (4 W/MHz) already out-
of elementary components, is designed to process 16-bit com- weighs the datapath overhead by 12x, even without considering
plex-valued data in fixed point (Fig. 9(b)). Three major compo- the overhead of RF and datapath in [36].
nents: 16-bit multimode multipliers, 32-bit shifters, and 40-bit
adders, are flexibly concatenated to perform MAC, normaliza- C. Flexible ISA
tion, Euclidean distance, 116x-iterated CORDIC, etc. Auxil-
iary components, such as sinusoid synthesis (SIN), CORDIC The flexible control structure for each BCE comprises a
pre-normalization, and the metric-enumeration unit (ME) accel- 128 43 bit instruction memory (IM), a programmable SM
erate the miscellaneous SDR operations. Two pipeline stages, and a flexible ISA decoder (Fig. 10). Among each of the 43-bit
plus the multiplier output registers, are inserted for a 500MHz instructions, the 3-bit instruction header defines whether it
peak clock rate, while the functional registers are used for task- belongs to the configuration or the operation type. The former
dependent retiming and data interleaving. All of the above com- configures the ISA decoder based on the remaining 40-bit
ponents and memories are aggressively clock-/input-gated to content, while the latter uses a 10-bit opcode to access up to
save unnecessary switching energy. The 2R2W 64 32-bit reg- eight branching states, two network and 64 datapath config-
ister file (RF) is employed for local data access and FIFO. The urations from the state, network and datapath register banks
Fig. 12. Our 2-mode place-and-route flow, from user input to bitstream generation.
Fig. 13. Die micrograph and summary.
(SRB/NRB/DRB). The contents of the RBs are then selec- in totally different physical microcode information to distinct
tively accessed to translate the opcodes to the physical control SDR tasks.
patterns. To save the computing complexity of ISA decoder, The programmable SM directs the value and accumulation
we partition the DRB into three sub-banks , each mode of the program counter (PC) to access the IM. Each of
respectively controlling the input, arithmetic, and output part the eight available states, as configured in SRB, contains a 7-bit
of the BCE. Compared with a heuristic approach that require a inner-loop bound (ILB), 7-bit outer-loop bound (OLB), 1-bit
64-row DRB to memorize 64 BCE configurations, our approach PC accumulation enable, 7-bit PC start value , 1-bit
can save the DRB area by 16x (4 vs. 64) without significant loss state-counter accumulation enable, 5-bit state-counter start
in controllability. For each task mapping, the BCE configures value , and a 5-bit state-counter end value
the instruction decoder prior to execution. The configuration to establish state transition and counting rules for a variety of
time is task-dependent, typically no more than 12 clock cycles. pointers in the SM. Overall, the SM enables zero-overhead
Due to the reconfiguration feature, the same opcode can result looping, branching and subroutine calls to map any SDR task
Fig. 14. Benchmark mapping of UDSP for performance measurements.
Fig. 15. Comparing the UDSP with state-of-the-art communication multiprocessors and functionally-equivalent ASICs show its good compromise between pro-
grammable and dedicated solutions.
with high efficiency and flexibility. Since the behavior of the thus only utilizes the fine- and medium-grained CLBs. Our tool
SM is solely controlled by the state information in the SRB, the parses the packed netlist and performs floorplan, placement,
IM can fully focus on data manipulation without wasting time routing, static timing analysis, and bitstream generation. This
and energy on managing the PC value and the program flow. mode creates a design that is functionally identical to existing
As a result, the SM can effectively lower the program runtime FPGAs for a bit-accurate, cycle-accurate comparison.
and improve the IM efficiency as compared to traditional Mode 2 adds support for our coarse-grained FFT and UDSP.
processors. As of now, the user has to manually define the functions to ex-
The energy breakdown of the 4 4 QRD (memory- and ecute on the coarse-grained blocks, and generate the assembly
control-dominant task) shows the benefits of flexible control code to configure the UDSP. The rest of the logic goes through
(Fig. 11). Compared with the traditional control, the flexible the commercial synthesis flow.
decoder dissipates 6.4x less energy due to simpler selection
logic. The IM energy is also reduced by 3.3x due to the higher VII. CHIP MEASUREMENT AND RESULTS
code efficiency and the SM. The overall energy saving is 1.8x, The die photo and summary are shown in Fig. 13. Due to
making the efficiency of QRD around 2.34 GOPS/mW at 0.5 V, the high pin count and the performance requirement, a chip-on-
only a 2.3x lower than a dedicated QRD [40]. board is used to interface with a Kintex-7 testbench FPGA.
We consider five SDR benchmarks for standalone UDSP
VI. CHIP PROGRAMMING AND SOFTWARE FLOW measurements: 1) 64-tap raised-cosine FIR [38]; 2) 8th-order
This chip is built to be a flexible hardware, but it is not useful Chebyshev Type-II IIR; 3) 4th-order cyclic-autocorrelation
unless it is easy to program. With over 9 million configuration (CAC) [39]; 4) 4 4 QRD [40]; (5) 4 4 MIMO SD [41].
bits, we need an automated mapping flow. In the commercial Fig. 14 illustrates the energy efficiency vs. supply voltage. A
FPGA, the user design is synthesized into a LUT-level netlist, 55.6x efficiency gap is observed between the compute-centric
and then sent into a vendor-specific tool, which then packs the SD and the memory-/control-dominant QRD. Robust chip
LUTs into their CLBs, followed by placement and routing. functionality is measured down to 415mV with minimum
Our mapping flow supports two modes (Fig. 12). Mode 1 ex- power of 275 W/core at 25 MHz, 25 C for the SD. At this
ecutes on a packed netlist compatible with commercial FPGAs, operating point, the leakage and the active power are roughly
Fig. 16. Measured results showing the impact of multi-grained blocks: (Inset) A 32-tap FIR filter mapped with Virtex-6, our chip with fine- and medium-grained
CLBs, and our chip with coarse-grained UDSP CLBs; (Main plot) A 1024-point FFT mapped on microprocessor, FPGA, DSP processor, this chip, and ASICs.
equal. This point also translates to a peak energy efficiency of

13.1 GOPS/mW (76 fJ/OP). Further, the performance of the
chip can scale up to 1.17 TOPS at 500 MHz, 1.0 V, achieving
4.2 GOPS/mW.
To validate the efficiency benefits over state-of-the-art com-
munication processors, we normalize the energy per operation
of UDSP to 65nm for a fair comparison (Fig. 15). The nor-
malization is performed by matching the gate delay vs. supply
voltage, and scaling the leakage and active power separately
across technology nodes based on SPICE simulations. A nor-
malized energy of 9.2 pJ/16-bit MAC shows 2.44.8x higher
efficiency than [34], [35], [42]. The improvement scales up
to 22x for low-voltage operations. Since energy per operation
alone does not guarantee real-time performance, we also nor-
malize the energy of UDSP and functionally equivalent ASICs Fig. 17. Summary of our contributions. Efficiency and flexibility requires the
[38][41] to the same throughput for a fair comparison. As use of 1) compact interconnect networks and 2) coarse-grained reconfigurable
blocks. Efficiency of UDSP alone is shown for comparison. Compiler support
shown in Fig. 15, the UDSP bridges the energy efficiency gap for UDSP-type blocks is required for user adoption.
with ASICs to within 2.6x.
To consider the entire chip operations, a 32-tap FIR filter
implemented on Virtex-6, our FPGA with CLBs only, and our per FFT sample, but is still 1.5x less efficient than our flexible
FPGA with coarse-grained UDSP kernels is shown in Fig. 16. hardware. Another 13x gain can be achieved by enabling our
Our measured results are compared with Virtex-6 production FFT kernel.
data generated from the ISE tool. Because Virtex-6 is larger For this example, we see that our chip runs about 30% slower
than our chip, we normalize the leakage area to only the utilized than Virtex-6. This is mainly due to our software place and
area after place and route. We measured up to 0.86 GOPs/mW route, which is still rudimentary. From our experience, complex
using only fine-grained CLBs, a 4x efficiency gain over Virtex-6 designs tend to see 1030% performance degradation from our
due to interconnect area, power-gating, and voltage scaling. The router. Our design is standard-cell based, which also affects the
coarse-grained UDSP kernel provides another 8x efficiency gain speed when compared to custom-designed commercial parts.
and up to 400 MHz performance. More examples are shown in Table I. These are designs for
To illustrate how this chip compares with other devices, and which we have the RTL available in-house, synthesized onto
to illustrate the benefit of coarse-grained blocks, Fig. 16 also our FPGA using only fine-grained and medium-grained CLBs.
shows the measurement results of mapping a 1024-point FFT. These RTL designs were previously used to build our ASIC
In comparison, our chip is 1020x more efficient than DSP so- chips, measured under similar ambient conditions. With the
lutions and microprocessors, while maintaining much higher improved energy efficiency from interconnects, we are within
throughput (1 clock cycle per FFT sample vs. 17 to 100 cy- 1215x the energy efficiency of ASIC designs when normalized
cles, respectively). The DSP with a dedicated FFT core achieves for the ASIC technology. The area penalty is around 3060x.
higher efficiency and throughput by averaging only 2.8 cycles These are amongst the best efficiency numbers for FPGAs
TABLE I
MEASUREMENT RESULTS USING ONLY FINE- AND MEDIUM-GRAINED CLBS
TABLE II
MEASUREMENT RESULTS USING FINE-, MEDIUM-, AND COARSE-GRAINED CLBS
today. For many smaller designs, we believe flexible logic is a [3] I. Kuon and J. Rose, Measuring the gap between FPGAs and ASICs,
good solution. For larger blocks, the efficiency gap may still be IEEE Trans. Computer-aided Design Integr. Circuits Syst., vol. 26, no.
2, pp. 203215, Feb. 2007.
too large for many ASIC designers to adopt, but this is where [4] A. Geroge, H. Lam, and G. Stitt, Novo-G: at the forefront of scalable
the coarse-grained kernels can contribute. reconfigurable supercomputing, IEEE Comput. Sci. Eng., vol. 13, no.
By offloading computationally-intensive blocks to our FFT 1, pp. 8286, Jan. 2011.
and UDSP kernels, the energy efficiency gap can be reduced to [5] N. Goulding-Hotta et al., The greendroid mobile application pro-
cessor: An architecture for silicons dark future, IEEE Mirco, vol. 31,
less than 5x (Table II), and the area penalty to 1230x (Fig. 17). no. 2, Mar. 2011.
We believe this is the key step to efficiency, which is to first [6] M. B. Taylor, Is dark silicon useful?, in Proc. Design Automation
identify reconfigurable kernels that can be used effectively, and Conf., 2012, pp. 11311136.
use flexible logic to do the rest. [7] E. Sperling, Designing at 28 nm and beyond, Chip Design Mag., Mar.
2012.
[8] M. Nakajima et al., A 40 GOPS 250 mW massively parallel pro-
VIII. CONCLUSION cessor based on matrix architecture, in IEEE ISSCC Dig., 2006, pp.
16161617.
In this work, we have made two steps towards more efficient, [9] H. Noda et al., The design and implementation of the massively
flexible hardware. Our first step is a hardware with more effi- parallel processor based on the matrix architecture, IEEE J.
cient interconnect. It is supported by a suite of design automa- Solid-State Circuits, vol. 42, no. 1, pp. 183192, Jan. 2007.
tion tools, leveraging commercial synthesis and custom P&R. [10] T. Kurafuji et al., A scalable massively parallel processor for real-time
image processing, IEEE J. Solid-State Circuits, vol. 46, no. 10, pp.
For some applications, the efficiency offered by this flexible 23632373, Oct. 2011.
logic is sufficient. But to move even closer to ASIC efficiency, [11] I. Bolsens, Programming modern FPGAs, Int. Forum on Embedded
domain-specific flexible kernels are the key addition. Such flex- Multiprocessor SoC, Keynote, Aug. 2006.
ible hardware with higher efficiency and proper tool support can [12] A. DeHon, Balancing interconnect and computation in a
reconfigurable computing array, in ACM Int. Symp. FPGA,
fill the need for many mobile applications. We believe the future 1999, pp. 6978.
of mobile computing is much more than large SoCs for phones [13] R. Tessier and H. Giza, Balancing logic utilization and area efficiency
and tablets; it would be filled with many sensors and wireless de- in FPGAs, in Int. Workshop on Field-Programmable Logic and Ap-
vices in the Internet of Things. For these smaller wireless nodes, plications, 2000, pp. 535544.
[14] M. Lin and A. E. Gamal, A low-power field-programmable gate array
this type of efficient, flexible hardware will serve as key com- routing fabric, IEEE Trans. Very Large Scale Integrat. Syst., vol. 17,
ponent for both the interface and signal processing needs. no. 10, pp. 14811494, Oct. 2009.
[15] G. Lemieum, E. Lee, M. Tom, and A. Yu, Directional single-driver
wires in FPGA interconnect, in Int. Conf. Field Programmable Tech.,
REFERENCES 2004, pp. 4148.
[1] T. A. C. M. Claasen, High speed: Not the only way to exploit the [16] E. Lee, Interconnect driver design for long wires in field-pro-
intrinsic computational power of silicon, in IEEE ISSCC Dig., 1999, grammable gate arrays, M.S. thesis, Univ. British Columbia,
pp. 2225. Vancouver, BC, Canada, 2006.
[2] R. W. Brodersen, Technology, architecture, and applications, in [17] B. S. Landman and R. L. Russo, On a pin versus block relationship
IEEE ISSCC Dig., Special Topic Evening Session: Low Voltage Design for partitions of logic graphs, IEEE Trans. Comput., vol. C-20, no. 12,
for Portable Syst., Feb. 2002. pp. 14691479, Dec. 1971.
[18] C. Clos, A study of non-blocking switching networks, Bell Syst. Tech. [44] J. Thompson et al., An integrated 802.11a baseband and MAC pro-
J., vol. 32, pp. 406424, 1953. cessor, in IEEE ISSCC Dig. Tech. Papers, 2002, pp. 126127.
[19] V. E. Benes, Heuristic remarks and mathematical problems regarding [45] R. Merritt, FPGA add comms cores amid ASIC debate, EE Times,
the theory of switching systems, Bell Syst. Tech. J., vol. 41, pp. Mar. 2013.
12011247, 1962. [46] FFT Implementation on the TMS320C5535 DSP, TI Tech. Reference
[20] W. J. Dally and B. Towles, Principles and Practices of Interconnection Manual, pp. 111134, 2012.
Networks. Boston, MA, USA: Morgan Kaufmann, 2004. [47] T.-H. Yu et al., A 7.4 mW 200 ms/s wideband spectrum sensing digital
[21] Y.-T. Lai and P.-T. Wang, Hierarchical interconnection structures for baseband processor for cognitive radios, IEEE J. Solid-State Circuits,
field programmable gate arrays, IEEE Trans. Very Large Scale Inte- vol. 47, no. 9, pp. 22352245, Sep. 2012.
grat. Syst., vol. 5, no. 2, pp. 186196, Jun. 1997.
[22] W. Tsu et al., HSRA: High-speed, hierarchical synchronous reconfig-
urable array, in Int. Symp. FPGA, 1999, pp. 125134. Fang-Li Yuan (S10) received the B.S. and M.S.
[23] D. Wong, Interconnection network for a field programmable gate degrees from National Taiwan University, Taipei,
array, U.S. 6,693,456 B2, Feb. 2004. Taiwan, in 2006 and 2008, respectively, and the
[24] A. DeHon, Unifying mesh- and tree-based programmable intercon- Ph.D. degree from the University of California, Los
nect, IEEE Trans. Very Large Scale Integrat. Syst., vol. 12, no. 10, pp. Angeles (UCLA), CA, USA, in 2014, all in electrical
10511065, Oct. 2004. engineering.
[25] H. Mrabet, Z. Marrakchi, P. Souillot, and H. Mehrez, Performances He joined Flex Logix Technologies, a semicon-
improvement of FPGA using novel multilevel hierarchical intercon- ductor IP startup, in 2014. His research interests
nection structure, in IEEE/ACM Int. Conf. Computer-Aided Design, include the flexible DSP architectures and VLSI
2006, pp. 675679. circuits for communication signal processing, with
[26] C. C. Wang et al., A 1.1 GOPS/mW FPGA chip with hierarchical particular focus on the software-defined and cogni-
interconnect fabric, VLSI11, pp. 136137. tive radios.
[27] A. DeHon, Compact, multilayer layout for butterfly fat-tree, in Proc. Dr. Yuan was awarded the Broadcom Fellowship in 2012 for his research on
ACM Symp. Parallel Algorithms and Architectures, 2000, pp. 206215. the multi-core processors for software-defined radios. In 2014, he received the
[28] C.-L. Wu and T. Y. Feng, On a class of multistage interconnection Distinguished Ph.D. Dissertation Award in Circuits and Systems from UCLA.
networks, IEEE Trans. Comput., vol. C-29, no. 8, pp. 694702, Aug.
1980.
[29] C. C. Wang and D. Markovi, Network architectures for boundary- Cheng C. Wang (S06) received the B.S. degree in
less hierarchical interconnects, Int. Patent Applicant. PCT/US2014/ electrical engineering and computer sciences from
029407, Mar. 2014. the University of California, Berkeley, CA, USA, in
[30] C. C. Wang and D. Markovi, Fine-grained power gating in FPGA in- 2005, and the M.S. and Ph.D. degrees in electrical
terconnects, Int. Patent Applicant. PCT/US2014/029404, Mar. 2014. engineering from the University of California, Los
[31] C.-H. Yang, T.-H. Yu, and D. Markovic, Power and Area Minimiza- Angeles, CA, USA, in 2009 and 2013, respectively.
tion of Reconfigurable FFT Processors: A 3GPP-LTE Example, IEEE In 2014, he co-founded Flex Logix Technologies,
J. Solid-State Circuits, vol. 47, no. 3, pp. 757768, Mar. 2012. Inc. His research interests include the design automa-
[32] J. Cong, G. Reinman, A. Bui, and V. Sarkar, Customizable domain- tion flow for reconfigurable hardware, as well as en-
specific computing, IEEE Design and Test of Computers, vol. 28, no. ergy-efficient circuits and systems.
2, pp. 615, Mar. 2011. Dr. Wang received the Distinguished Ph.D. Disser-
[33] A. Poon, An energy-efficient reconfigurable baseband processor for tation Award in Circuits and Systems from UCLA in 2013.
wireless communications, IEEE Trans. Very Large Scale Integrat.
Syst., vol. 15, no. 3, pp. 319327, Jan. 2007.
[34] D. Troung et al., A 167-Processor computational platform in 65 nm Tsung-Han Yu (S10) received the B.S. and M.S.
CMOS, IEEE J. Solid-State Circuits, vol. 44, no. 4, pp. 11301144, degrees in electrical engineering from National
Apr. 2009. Taiwan University, Taipei, Taiwan, in 2005 and
[35] P. Ou et al., A 65 nm 39 GOPS/W 24-core processor with 11 Tb/s/W 2007, respectively, and the Ph.D. degree from the
packet-controlled circuit-switched double-layer network-on-chip and Department of Electrical Engineering at the Univer-
heterogeneous execution array, in IEEE ISSCC Dig. Tech. Papers, sity of California, Los Angeles, CA, USA, in 2013.
2013, pp. 5657. His research is focused on digital integrated cir-
[36] Z. Yu et al., ASAP: An asynchronous array of simple processors, cuits and architectures for communication signal pro-
IEEE J. Solid-State Circuits, vol. 43, no. 3, pp. 695705, Mar. 2008. cessing, such as wideband spectrum sensing in cog-
[37] F.-L. Yuan and D. Markovi, A 13.1 GOPS/mW 16-Core processor nitive radios. In 2013, he joined Qualcomm, Inc. for
for software-defined radios in 40 nm CMOS, in IEEE Int. Symp. VLSI digital modem designs.
Circuits (VLSI) Dig., Jun. 2014, paper 8.4.
[38] F. Sheikh et al., A 1-190 MSample/s 8-64 tap energy-efficient recon-
figurable FIR filter for multi-mode wireless communication, in Proc.
Dejan Markovi (S96M06) received the Ph.D.
Symp. VLSI Circuits Dig., 2010, pp. 207208.
degree at the University of California, Berkeley, CA,
[39] F.-L. Yuan, T.-H. Yu, and D. Markovi, A 500 MHz blind classifi-
USA, in 2006, for which he was awarded the 2007
cation processor for cognitive radios in 40 nm CMOS, in IEEE Int.
David J. Sakrison Memorial Prize.
Symp. VLSI Circuits (VLSI) Dig., 2014, paper 8.3. He is a Professor of Electrical Engineering at
[40] M. Shabany et al., A 0.13 m CMOS 655 Mb/s 4 4 64-QAM the University of California, Los Angeles (UCLA),
K-best MIMO detector, in IEEE ISSCC Dig. Tech. Papers, 2009, pp. CA, USA. He is also affiliated with UCLA Bio-
256257. engineering Department as a co-chair of the
[41] M. Shabany et al., A low-latency low-power QR-decomposition Neuroengineering field. His current research is
ASIC implementation in 0.13 m CMOS, IEEE Trans. Circuits Syst. focused on low-power embedded systems for basic
I: Reg. Papers, vol. 60, no. 2, pp. 327340, Feb. 2013. neuroscience and clinical neurophysiology, with
[42] Z. Yu et al., An 800 MHz 320 mW 16-Core processor with message- emphasis on memory and learning. His research also includes energy-efficient
passing and shared-memory inter-core communication mechanisms, VLSI architectures, design with post-CMOS devices, optimization methods
in IEEE ISSCC Dig. Tech. Papers, 2012, pp. 6465. and CAD flows. He co-founded Flex Logix Technologies, a semiconductor IP
[43] F.-L. Yuan, Y.-H. Lin, C.-F. Wu, M.-T. Shiue, and C.-K. Wang, A startup, in 2014.
256-point dataflow scheduling 2 2 MIMO FFT/IFFT processor for Dr. Markovi received an NSF CAREER Award in 2009. In 2010, he was a
IEEE 802.16 WMAN, in Proc. Asian Solid-State Circuits Conf., 2008, co-recipient of the IEEE ISSCC Jack Paper Award for Outstanding Technology
pp. 309312. Directions.

A Multi-Granularity FPGA With Hierarchical Interconnects For Efficient and Flexible Mobile Computing PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

A Multi-Granularity FPGA With Hierarchical Interconnects For Efficient and Flexible Mobile Computing PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 50, NO.

1, JANUARY 2015 137

A Multi-Granularity FPGA With Hierarchical

AbstractFollowing the rapid expansion of mobile computing,

T HE rapid expansion of mobile computing has driven the

Fig. 10. Flexible ISA control mechanism.

Fig. 13. Die micrograph and summary.

Fig. 14. Benchmark mapping of UDSP for performance measurements.

equal. This point also translates to a peak energy efficiency of

Anda mungkin juga menyukai