AbstractFor multicore processors with a private vector coprocessor (VP) per core, VP resources may not be highly utilized due to
limited data-level parallelism (DLP) in applications. Also, under low VP utilization static power dominates the total energy consumption.
We enhance here our previously proposed VP sharing framework for multicores in order to increase VP utilization while reducing the
static energy. We describe two power-gating (PG) techniques to dynamically control the VPs width based on utilization figures.
Floating-point results on an FPGA prototype show that the PG techniques reduce the energy needs by 30-35 percent with negligible
performance reduction as compared to a multicore with the same amount of hardware resources where, however, each core is
attached to a private VP.
1 INTRODUCTION
Fig. 1. A representative architecture containing two scalar cores and M vector lanes in a shared VP.
not fully utilize lane resources. Fine-grain Temporal Sharing ALU contains a write back (WB) arbiter to resolve simulta-
(FTS) multiplexes spatially in each lane instructions from dif- neous stores. Vector register elements are distributed
ferent vector threads in order to increase the utilization of all across lanes using low-order interleaving; the number of
vector units. It resembles simultaneous multithreading for elements in a lane is configurable. Scratchpad memory is
scalar codes. FTS achieves better performance and energy chosen to drastically reduce the energy and area require-
savings [10], [14]. ments. Scratchpad memory can yield, with code optimiza-
Fig. 1 shows a representative dual-core for VP sharing. tion, much better performance than cache for regular
Each lane contains a subset of elements from the vector reg- memory accesses [25]. Also, cacheless vector processors
ister file, an FPU and a memory load/store (LDST) unit. A often outperform cache-based systems [20]. Cores run (un)
vector controller (VC) receives instructions of two types lock routines for exclusive DMA access. For a VP with M
from its attached core: instructions to move and process vec- lanes, K vector elements per lane and a vector length VL for
tor data (forwarded to lanes) and control instructions (for- an application, up to K M=VL vector registers are avail-
warded to the Scheduler). Control instructions facilitate able. The scheduler makes lane assignment and configura-
communications between cores and the scheduler for tion decisions.
acquiring VP resources, getting the VP status and changing We prototyped in VHDL a dual-core on a Xilinx Virtex-6
the vector length. The VC has a two-stage pipeline for XC6VLX130T FPGA. The VP has M 2; 4; 8; 16 or 32
instruction decoding, and hazard detection and vector reg- lanes and an M-bank memory. The soft core is MicroBlaze,
ister renaming, respectively. The VC broadcasts a vector a 32-bit RISC by Xilinx that employs the Harvard architec-
instruction to lanes by pushing it along with vector element ture and uses the 32-bit fast simplex link (FSL) bus for cop-
ranges into small instruction FIFOs. rocessors. The VP is scalable with the lane population,
Each lane in our proof-of-concept implementation has a except for the MUXF8 FPGA primitive that relates to the
vector flag register file (VFRF), and separate ALU and crossbar. Without loss of generality, the crossbar could be
LDST instruction FIFOs for each VC. Logic in ALU and replaced with an alternative interconnect for larger multi-
LDST arbitrates instruction execution. LDST interfaces a cores. The LUT counts for a VP lane and the core are 1,100
memory crossbar (MC). LDST instructions use non-unit or and 750, respectively. The flip-flop counts are 3,642 and
indexed vector stride. Shuffle instructions interchange ele- 1,050, respectively. For TSMC 40 nm HP ASIC implementa-
ments between lanes using patterns stored in vector regis- tion, the lane and core gate counts are 453,000 and 212,000,
ters. LDST instructions are executed and committed in respectively. These resource numbers boost further our
order. ALU instructions may commit out of order. The claim that coprocessor sharing is critical.
808 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 3, MARCH 2015
TABLE 1
Benchmarks
TABLE 2
Dynamic Power Model Equations
measured in mW per percent of utilization (mW/per- For a vector kernel, the ratio ULDST =UALU a is assumed
cent), and are found by experimentation followed by lin- constant (it is known in advance or the number of memory
ear approximation. They are shown in Table 3 along accesses depends on some computed values). The total
with their standard deviation and the mean absolute dynamic power is:
dynamic power estimation error of VP components. The
estimation of dynamic power using unit utilizations D 1
PTOTAL M UALU 2KVC 1 a
results in a 13 percent confidence interval. VL
X
The total dynamic power dissipation for M lanes and L DATA
KALU DATA
CTRL Kexei wi a KLDST
memory banks is: i
D
PTOTAL 2PVC M PLANE L PMEM BANKs 2PVC KVRF 1 2a a KMEM BANKs
TABLE 3
Mean Absolute Error for Dynamic Power Estimation
Fig. 2. Normalized energy consumption (bars, values on left axis) for a workload with 10 K floating-point operations and various kernels. Normaliza-
tion is in reference to 2 16-VP; nuno: loop unrolling; u1- loop unrolled once. FTS is applied to two threads. The normalized speed-up (values on
right axis) is shown as a line.
M N
interrupt routines for power management run on a core or a Proof. From ETOTAL < ETOTAL , a conclusion in Section 5.2
M N
power control unit (e.g., similar the Intel7 Nehalem). How- ED ED and Equation (7):
ever, obtaining offline the combined utilization of VP units
M L
LANE LANE
for all pairs of vector kernels is impractical since kernels UALU N PST L M PST POFF
N
> L LANE LANE
; (10)
may start executing randomly. Also, some kernels may not UALU M PST L N PST POFF
be known a priori.
6.3 Adaptive PG with Profiled Information (APGP) where the right-hand term is RThM=N . For M > N,
M N
UALU UALU since a lanes ALU utilization decreases, or
For effective PG-driven energy minimization, we use
stays constant, when the number of lanes increases.
embedded hardware profilers to dynamically monitor lane
Thus, RThM=N 1. u
t
unit utilizations. The decision to adjust the population of
M N
active lanes based on a cores request/release involves spe- Ideal scalability implies UALU UALU . To evaluate
cialized hardware. To find the optimal number of lanes that Equation (9) after a VP event, we profile unit utilizations
minimizes the energy consumption at runtime, we use the for various VP configurations. We propose a dynamic
next theorem. process where the VP state is changed successively in the
right direction (i.e., increasing or decreasing the number
Theorem 1. If the total energy consumption of a kernel for the
of active lanes) until optimality is reached. Since for most
M-lane VP configuration is smaller than the total energy con-
of our scenarios minimum energy is reached for
sumption for the N-lane configuration, then:
M 2 f4; 8; 16g; our runtime framework assumes four pos-
M
UALU sible VP states with 0, 4, 8 and 16 active lanes. Fig. 3b
N
> RThM=N ; (9) shows hardware extensions for APGP. Each profiler,
UALU
attached to a VC, monitors ALU and LDST utilizations
where RThM=N is a constant depending on M and N. Addi- for the kernel. It captures the average ALU utilization
tionally, RThM=N 1 for M > N. based on the instruction stream flowing through the VC
812 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 3, MARCH 2015
(a) (b)
Fig. 3. Hardware for (a) DPGS and (b) APGP. In DPGS or APGP, software or the PG controller, respectively, configures the PG Register. VP Profiler:
aggregates utilizations from VCs. ST: Sleep Transistor (Header or Footer).
8
during a chosen time window, as per Equation (1). Simu- the probability P UALU < ATh8 > 16 j16L min energy 0:
lations show that a window of 1,024 clock cycles gives RThM=N < 1 for M > N, as per the theorem. Besides these
highly accurate results. thresholds, the PG controller contains the profiled utiliza-
M
The PG controller (PGC) aggregates utilizations from tion registers UALU ; M 2 f4; 8; 16g (one for each VP
both threads (using the profilers) and implements the PGC configuration).
state machine of Fig. 4. We use two types of thresholds: (i) As per APGP, after a VP request/release event that may
the absolute threshold AThM > N which is used if the ratio change the utilization figures, and thus the optimal configu-
M N
UALU =UALU is not available for the current kernel combina- ration, the utilization registers are reinitialized. Bit Vld in
tion and M ! N represents a transition from the M- to the Fig. 4 shows if the ALU utilization register U M for the
N-lane VP, and (ii) the relative threshold RThM=N of Equa- M-lane VP contains an updated value (the ALU subscript in
tion (9). RThM=N is used to compare utilizations with pro- the utilization variable is dropped to simplify the discus-
filed M and N lanes. Absolute thresholds are chosen sion). If the VP is initially idle (0 L), PGC will power up
empirically such that, for a given ALU utilization the eight lanes to enter the 8 L state. We bypass the 4 L configu-
probability that the current configuration will be kept is ration since 8 L has the highest probability to be the optimal
minimum if a configuration of lower energy consumption energy state for our scenarios. VP uses data from at least
exists. Absolute thresholds enable PGC to initiate a one profile window in order to update utilization figures. If
state transition if there is a probability that the current one of the inequalities based on the absolute threshold is
state is not optimal. For example, ATh8 > 16 is such that met, PGC initiates a transition to another state. After each
Fig. 4. PG controller state machine and PGC registers for state transitions under APGP. INT, PW and CFG are transitional VP (i.e., non-operating)
states. 4 L, 8 L and 16 L are stable VP operating states that represent the 4-, 8- and 16-lane VP configurations. ML is a PGC state with M active
lanes,M 2 f0; 4; 8; 16g; INT is a PGC state where the PGC asserts an interrupt and waits for an Interrupt Acknowledge (INT_ACK); PW is a PGC
state where some of the VP lanes are powered-up/down; CFG is a PGC state where the Scheduler is reconfigured to a new VP state. Threshold
registers are fixed during runs and utilization registers are updated for every profile window. The registers store 8-bit integers. The Vld bit is used to
show that the ALU utilization register U M , with M 4, 8 or 16, for the M-lane VP configuration does not contain an updated value.
BELDIANU AND ZIAVRAS: PERFORMANCE-ENERGY OPTIMIZATIONS FOR SHARED VECTOR ACCELERATORS IN MULTICORES 813
TABLE 5
Time and Energy Overheads for PGC State Transition
profile window, the utilization register for the current state For diverse workloads, we created benchmarks com-
is updated. A transition between two stable VP operating posed of random threads running on cores. Each thread has
states involves the following steps and three transitional VP VP busy and idle periods. During idle periods the core is
non-operating states: often busy either with memory transfers or scalar code. A
thread busy period is denoted by a vector kernel ui and a
1. INT state: Stop the Scheduler to acknowledge new workload expressed in a random number of floating-point
VP acquire requests and send a hardware interrupt operations; a thread idle period is denoted by a random
to core(s) that have acquired VP resources. number of VP clock cycles. Ten fundamental vector kernels
2. PW state: After ACKs from all cores, reconfigure the were used to create execution scenarios. Two versions of
PG Sequencer for a new VP state. each kernel in Section 4 were first produced with low and
3. CFG state: Reconfigure the Scheduler with the chosen high ALU utilization, respectively. The kernel workload is
number of active lanes and enable it to acknowledge uniformly distributed so that enough data exists in the vec-
new VP acquire requests. tor memory for processing without the need for DMA. By
In a new state, the utilization register is updated after a adding an idle kernel, 55 unique kernel pairs were pro-
full profile window. If one of the inequalities is met, a duced plus 10 scenarios with a single active kernel on a
transition occurs. Up to three transitions are allowed after core. Based on Section 6.3, we get the values: ATh4 > 8
a VP event in order to avoid repetitive transitions that 50%; ATh8 > 16 60%; ATh8 > 4 50%; ATh16 > 8 72%;
increase time/energy overheads. The resources consumed RTh8=4 0:6739, and RTh16=8 0:7581.
by the profilers and the PGC are less than 1 percent of the
VPs resources. As PGC events are rare, the PGCs
dynamic power consumption is insignificant compared to
8 EXPERIMENTAL RESULTS
that of the VP.
Fig. 5 shows the breakdown of VP normalized execution
time and energy consumption when the majority of kernels
7 SIMULATION MODEL AND EXPERIMENTAL SETUP have low ALU utilization. The ratio of low to high utiliza-
Our simulator models vector-thread executions for vari- tion kernels in a thread is 4:1. The idle periods between con-
ous VP configurations. The model is based on perfor- secutive kernels in a thread are uniformly distributed in the
mance and power figures gathered from RTL and netlist ranges: [1,000, 4,000], [5,000, 10,000] and [10,000, 30,000]
simulations, as described in Section 5. It contains informa- clock cycles.
tion necessary to compute the execution time and energy Our conclusions are: (i) FTS generally produces the low-
consumption for any combination of kernels ui ; uj run- est energy consumption. For either CTS or FTS, DPGS or
ning in any VP state. Each combination of kernels ui ; uj APGP minimizes the energy consumption compared to sce-
is represented by the utilization(s) U M ui and U M uj narios without PG. (ii) Except for two scenarios, 2x(1cpu_8
and the total power P M ui ; uj for kernels running on the L) and 2cpu_16 L_FTS, FTS with DPGS or APGP also
M-lane VP, for all values of M 2 f4; 8; 16g. minimizes the execution time. These PG schemes also yield
The model accounts for all time and energy overheads 30-35 percent and 18-25 percent less energy as compared to
due to state transitions, as shown in Table 5. Since our lane 2x(1cpu_8 L) and 2cpu_16 L_FTS, respectively. (iii) Scenar-
implementation is almost eight times larger in area than a ios with two cores and a private per-core VP yield lower
floating-point multiply unit in [24], which is PGed in one execution time than CTS because CTS does not sustain high
clock cycle, we assume that a VP lane wakes up in eight utilization across all lane units. (iv) DPGS or APGP applied
clock cycles. One lane is powered up/down at a time by the to CTS reduces the energy compared to 2x scenarios. As
PG Sequencer to avoid excessive currents in the power net. idle periods decrease, CTS becomes less effective; e.g., a
VP components that are not woken up or gated during state 5 percent gain in consumption for DPGS-driven CTS with a
transitions consume static energy as usual. slowdown of 70 percent compared to 2x(1cpu_4 L). Finally,
814 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 3, MARCH 2015
Fig. 5. Normalized execution time (a, c, e) and normalized energy consumption (b, d, f) where the majority of kernels in a thread have low ALU utiliza-
tion, for various idle periods. The ratio of low to high utilization kernels in a thread is 4:1. E_st and E_dyn are the energy consumptions due to static
and dynamic activities, respectively. 2x means two cores/CPUs of the type that follows in parentheses, such as (1cpu_4 L) which means one core
having a private VP with four lanes. Whenever CTS or FTS shows, it implies two cores with VP sharing.
(v) time-energy overheads due to state transitions are negli- and high utilization kernels in a thread is 1:1. FTS under
gible; they are not shown in Figs. 5, 6, and 7. The total time DPGS or APGP yields the minimum energy while the
overhead is upper-bounded by 0.3 and 0.7 percent of the performance is better than FTS with eight fixed lanes.
total execution time for DPGS and APGP, respectively. The Fig. 7 shows the normalized execution time and energy
total energy overhead is upper-bounded by 0.23 and 0.57 consumption for threads dominated by high ALU utiliza-
percent of the total energy consumption for DPGS and tion kernels; the ratio between low and high utilization
APGP, respectively. kernels is 1:4. As the number of thread kernels with high
Fig. 6 shows the normalized execution time and ALU utilization increases, the portion of time spent in
energy consumption for threads containing kernels with the 16 L state increases for FTS under DPGS or APGP.
balanced ALU utilization figures; the ratio between low The performance of the PG schemes is better than that of
BELDIANU AND ZIAVRAS: PERFORMANCE-ENERGY OPTIMIZATIONS FOR SHARED VECTOR ACCELERATORS IN MULTICORES 815
Fig. 6. Normalized execution time (a, c, e) and normalized energy consumption (b, d, f) for threads with balanced utilization kernels, for various idle
periods. The ratio of low to high utilization kernels in a thread is 1:1.
a fixed VP with eight lanes, and approaches the perfor- prototype, ASIC simulations show that this model is also
mance of the 16 L FTS-driven configuration. As expected, valid for ASICs; only the values of some model coefficients,
the energy is reduced drastically with FTS and DPGS or that depend on the chosen hardware platform anyway,
APGP compared to all other scenarios. must change. For given vector kernels, VPs dynamic
energy does not vary substantially with the number of vec-
tor lanes. Consequently, we proposed two PG techniques to
9 CONCLUSIONS dynamically control the number of lanes in order to mini-
We proposed two energy reduction techniques to dynami- mize the VPs static energy. DPGS uses a priori information
cally control the width of shared VPs in multicores. We first of lane utilizations to choose the optimal number of lanes.
introduced an energy estimation model based on theory APGP uses embedded hardware utilization profilers for
and observations deduced from experimental results. runtime decisions. To find each time the optimal number of
Although we presented detailed results for an FPGA lanes that minimize the static energy, the VP state is
816 IEEE TRANSACTIONS ON COMPUTERS, VOL. 64, NO. 3, MARCH 2015
Fig. 7. Normalized execution time (a, c, e) and normalized energy consumption (b, d, f) for threads dominated by high utilization kernels, for various
idle periods. The ratio of low to high utilization kernels in a thread is 1:4.
changed to reach optimality for the given workload. Bench- [3] G. Chrysos, Intel Xeon Phi Coprocessor, Proc. Hot Chips Symp.,
Aug. 2012.
marking shows that PG reduces the total energy by 30-35 [4] S. Ishihara, M. Hariyama, and M. Kameyama, A Low-Power
percent while maintaining performance comparable to a FPGA Based on Autonomous Fine-Grain Power Gating, IEEE
multicore with the same amount of VP resources and per- Trans. Very Large Scale Integration Systems, vol. 19, no. 8, pp. 1394-
core VPs. 1406, Aug. 2011.
[5] M. Keating, D. Flynn, R. Aitken, A. Gibsons, and K. Shi, Low
Power Methodology Manual for System on Chip Design. Springer,
REFERENCES 2008.
[6] H. Esmaeilzadeh, E. Blem, R.S. Amant, K. Sankaralingam, and D.
[1] T. Hiramoto and M. Takamiya, Low Power and Low Voltage Burger, Dark Silicon and the end of Multicore Scaling, Proc.
MOSFETs with Variable Threshold Voltage Controlled by Back- 38th Ann. Intl Symp. Computer Architecture, pp. 365-376, 2011.
Bias, IEICE Trans. Electronics, vol. E83, no. 2, pp. 161-169, 2000. [7] C. Kozyrakis and D. Patterson, Scalable, Vector Processors for
[2] M. Woh et al., Analyzing the Scalability of SIMD for the Next Embedded Systems, IEEE Micro, vol. 23, no. 6, pp. 36-45, Nov./
Generation Software Defined Radio, Proc. IEEE Intl Conf. Acous- Dec. 2003.
tics, Speech, and Signal Processing, pp. 5388-5391, Mar./Apr. 2008.
BELDIANU AND ZIAVRAS: PERFORMANCE-ENERGY OPTIMIZATIONS FOR SHARED VECTOR ACCELERATORS IN MULTICORES 817
[8] Y. Lin et al., SODA: A Low-Power Architecture for Software [31] Y. Wang, S. Roy, and N. Ranganathan, Run-Time Power-Gating
Radio, Proc. 33rd IEEE Ann. Intl Symp. Computer Architecture, In Caches of GPUs for Leakage Energy Savings, Proc. Design,
pp. 89-101, 2006. Automation and Test in Europe Conf. & Exhibition (DATE), pp. 300-
[9] M. Woh et al., AnySP: Anytime Anywhere Anyway Signal Proc- 303, Mar. 2012.
essing, IEEE Micro, vol. 30, no. 1, pp. 81-91, Jan./Feb. 2010.
[10] S.F. Beldianu and S.G. Ziavras, On-Chip Vector Coprocessor Spiridon F. Beldianu received the BS degree in
Sharing for Multicores, Proc. 19th Euromicro Intl Conf. Parallel electrical engineering and the MS degree in sig-
Distributed and Network-Based Processing, pp. 431-438, Feb. 2011. nal processing, from Technical University, Iasi,
[11] M. Anis, S. Areibi, and M. Elmasry, Design and Optimization of Romania, in 2001 and 2002, respectively, and
Multithreshold CMOS (MTCMOS) Circuits, IEEE Trans. Com- the PhD degree in computer engineering from
puter Aided Design, vol. 22, no. 10, pp. 1324-1342, Oct. 2003. the New Jersey Institute of Technology in 2012.
[12] H. Yang and S.G. Ziavras, FPGA-Based Vector Processor for His research interests are high-performance
Algebraic Equation Solvers, Proc. IEEE Intl Systems-on-Chip computing, on-chip vector coprocessor sharing,
Conf., pp. 115-116, 2005. reconfigurable computing and scheduling for
[13] J. Li and J.F. Martinez, Power-Performance Considerations of input-queued packet switches. He is a senior
Parallel Computing on Chip Multiprocessors, ACM Trans. Archi- scientist at Broadcom, San Jose, California. He
tecture and Code Optimization Architecture and Code Optimization, is a member of the IEEE.
vol. 2, pp. 397-422, Dec. 2005.
[14] S.F. Beldianu and S.G. Ziavras, Multicore-Based Vector
Coprocessor Sharing for Performance and Energy Gains,
ACM Trans. Embedded Computing Systems, vol. 13, no. 2, article Sotirios G. Ziavras received the diploma in elec-
17, Sept. 2013. trical engineering from the National Technical
[15] J. Yu, C. Eagleston, C.H.-Y. Chou, M. Perreault, and G. Lemieux, University of Athens (NTUA), Greece and the
Vector Processing as a Soft Processor Accelerator, ACM Trans. DSc degree in computer science from George
Reconfigurable Technology and Systems, vol. 2, no. 2, pp. 1-34, June Washington University (GWU). He is a professor
2009. of electrical and computer engineering, the direc-
[16] A. Bakhoda et al., Analyzing CUDA Workloads Using a Detailed tor of the Computer Architecture and Parallel
GPU Simulator, Proc. IEEE Intl Symp. Performance Analysis of Sys- Processing Laboratory (CAPPL), and the associ-
tems and Software, pp. 163-174, Apr. 2009. ate provost for Graduate Studies at the New Jer-
[17] V.A. Korthikanti and G. Agha, Towards Optimizing Energy sey Institute of Technology (NJIT). He was with
Costs of Algorithms for Shared Memory Architectures, Proc. the Center for Automation Research at the Uni-
ACM Symp. Parallelism in Algorithms and Architectures, pp. 157-165, versity of Maryland, College Park from 1988 to 1989. He was a visiting
2010. professor of electrical and computer engineering at George Mason Uni-
[18] S. Powell and P. Chau, Estimating Power Dissipation of VLSI versity in Spring 1990. He joined NJIT in Fall 1990 as an assistant pro-
Signal Processing Chips: The PFA Techniques, Proc. IEEE Work- fessor. He has received several honors such as an award from the
shop VLSI Signal Processing, pp. 250-259, 1990. Hellenic Republic for his academic performance at NTUA, an industry-
[19] S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis, Vector Lane funded Distinguished assistantship at GWU, the NJIT Excellence in
Threading, Proc. Intl Conf. Parallel Processing, pp. 55-64, Aug. Teaching Award for Graduate Education, and the Richard E. Merwin fel-
2006. lowship at GWU. He has published 170 papers, and did early work on
[20] L. Oliker et al., Evaluation of Cache-Based Superscalar and chip multiprocessors embedded in FPGAs for data-intensive computing
Cacheless Vector Architectures for Scientific Computations, Proc. and energy-grid applications. His main research interests are multicore
18th Ann. Intl Conf. Supercomputing, Nov. 2003. processors, reconfigurable computing, accelerators, parallel processing,
[21] J. Leverich et al., Power Management of Datacenter Workloads and embedded computing. He is a senior member of the IEEE.
Using Per-Core Power Gating, IEEE Computer Architecture Letters,
vol. 8, no. 2, pp. 48-51, July-Dec. 2009.
[22] Y. Wang and N. Ranganathan, An Instruction-Level Energy Esti-
mation and Optimization Methodology for GPU, Proc. IEEE 11th
Intl Conf. Computer and Information Technology, pp. 621-628, Aug./
Sept. 2011. " For more information on this or any other computing topic,
[23] W. Wang and P. Mishra, System-Wide Leakage-Aware Energy please visit our Digital Library at www.computer.org/publications/dlib.
Minimization Using Dynamic Voltage Scaling and Cache Recon-
figuration in Multitasking Systems, IEEE Trans. Very Large Scale
Integration Systems, vol. 20, no. 5, pp. 902-910, May 2012.
[24] S. Roy, N. Ranganathan, and S. Katkoori, A Framework for
Power-Gating Functional Units in Embedded Microprocessors,
IEEE Trans. Very Large Scale Integration Systems, vol. 17, no. 11,
pp. 1640-1649, Nov. 2009.
[25] O. Avissar, R. Barua, and D. Stewart, An Optimal Memory Allo-
cation Scheme for Scratch-Pad-Based Embedded Systems, ACM
Trans. Embedded Computing Systems, vol. 1, no. 1, pp. 6-26, 2002.
[26] NVIDIAs Next Generation CUDA Compute Architecture: Kep-
ler GK110: The Fastest, Most Efficient HPC Architecture Ever
Built, White Paper, NVIDIA Corp., 2012.
[27] Reducing System Power and Cost with Artix-7 FPGAs, Xilinx
2013, http://www.xilinx.com/support/documentation/white_
papers/wp423-Reducing-Sys-Power-Cost-28 nm.pdf, 2014.
[28] 28 Nanometer Process Technology, http://www.nu-vista.com:
8080/download/brochures/2011_28 Nanometer Process Technol-
ogy.pdf, 2012.
[29] J. Fowers, G. Brown, P. Cooke, and G. Stitt, A Performance and
Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-
Window Applications, Proc. 20th ACM/SIGDA Intl Symp. FPGAs,
pp. 47-56, Feb. 2012.
[30] P.-H. Wang, C.-L. Yang, Y.-M. Chen, and Y.-J. Cheng, Power Gat-
ing Strategies on GPUs, ACM Trans. Architecture and Code Optimi-
zation, vol. 8, no. 3, article 3, Oct. 2011.