The feature size is decreasing every year(Moores law) => Transistors becomes smaller,
faster less power, More on Chip
As sizes diminished, the designers have run the processor faster and faster by increasing
the clock speed. Many years clock speed has gone up; however from the last decade
clock speed has remained constant.
Why we are not able to increase the clock speed further at the previous rate:
The transistor size has been decreasing, but we are unable to increase the speed. The
problem is running a billion transistors generate a lot of heat and we cant keep the processors
cool. What matters today is the Power.
Department of Electronics Engineering, IIT (BHU) Varanasi
Page 1
Let me put an example or a historical background for this: Intel Pentium 4th generation
was made for a high operating clock frequency of around 10GHz, but the researchers can operate
it only for a time period of 30sec to 1min because of the vast power dissipation at such high
frequency. So they restricted it to around 3 to 4GHz.
Page 2
Figure 1.3: Three different Intel processors vary widely. Although the Itanium processor has two cores and thei7
four, only one core is used in the benchmarks. [1]
The main limitation in exploiting ILP often turned out to be the memory system. The result was
that these designs never came close to achieving the peak instruction throughput despite the large
transistor counts and extremely sophisticated and clever techniques.
1.4 HOW COMPUTATIONAL POWER CAN BE IMPROVED
We can solve large problems by breaking them into small pieces called kernels and launching
these pieces at the same time by means of simple processors.
Modern GPUs:
Thousands of ALUs
Hundreds of processors
Tens of thousands of Concurrent threads
If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?
-Seymour Cray, Father of Super Computer
What kind of Processors will we build? Or Why CPUs are not energy efficient?
CPU: Complex Control Hardware
Flexibility and Performance
Expensive in terms of Power
Page 3
Minimizing Latency
(Time Sec)
CPU
Throughput
(Stuff/Time Jobs/Hour)
GPU Pixel/Sec
Figure 1.4: Intel Core i7-960, NVIDIA GTX 280, and GTX 480 specification [1]
In the Figure 1.4 it can be seen from the rightmost columns show the ratios of GTX 280
and GTX 480 to Core i7. For SP SIMD FLOPS on the GTX 280, the higher speed (933) comes
from a very rare case of dual issuing of fused multiply-add and multiply. More reasonable is 622
for single fused multiply-adds.
1.5 MODERN COMPUTING ERA
It can be easily assumed that when the two previous techniques have reached the limits
then is this not the case with the Graphics Processing Unit. Moores Law states that the density
of transistors at a given die size doubles every 12 months, but has since slowed to every 18
months or loosely translated, the performance of processors doubles every 18 months. Now the
increase of no. of transistors that can be accumulated in a single chip almost became dead as the
die size has reached its limits, and the power dissipation of transistors has also been constraining
the number of transistors that can be fabricated on a single die.
Department of Electronics Engineering, IIT (BHU) Varanasi
Page 4
Page 5
Page 6
2. GPU ARCHITECTURE
2.1 INTRODUCTION
Before knowing how the GPU architecture is implemented, we first need to know what
the different parallelisms present in the GPU. GPUs have virtually every type of parallelism that
can be captured by the programming environment: multithreading, MIMD, SIMD, and even
instruction-level.
2.2 MULTITHREADING
Multithreading allows multiple threads to share the functional units of a single processor
in an overlapping fashion. A general method to exploit thread-level parallelism (TLP) is with a
multiprocessor that has multiple independent threads operating at once and in parallel.
Multithreading, however, does not duplicate the entire processor as a multiprocessor does.
Instead, multithreading shares most of the processor core among a set of threads, duplicating
only private state, such as the registers and program counter. There are three main hardware
approaches to multithreading.
2.2.1 FINE-GRAINED MULTITHREADING
Fine-grained multithreading switches between threads on each clock, causing the
execution of instructions from multiple threads to be interleaved. This interleaving is often done
in a round-robin fashion, skipping any threads that are stalled at that time. One key advantage of
fine-grained multithreading is that it can hide the throughput losses that arise from both short and
long stalls, since instructions from other threads can be executed when one thread stalls, even if
the stall is only for a few cycles. The primary disadvantage of fine-grained multithreading is that
it slows down the execution of an individual thread, since a thread that is ready to execute
without stalls will be delayed by instructions from other threads. It trades an increase in
multithreaded throughput for a loss in the performance (as measured by latency) of a single
thread.
2.2.2 COARSE-GRAINED MULTITHREADING
A coarse-grained multithreading switches thread only on costly stalls, such as level two
or three cache misses. This change relieves the need to have thread-switching be essentially free
and is much less likely to slow down the execution of any one thread, since instructions from
other threads will only be issued when a thread encounters a costly stall. It is limited in its ability
to overcome throughput losses, especially from shorter stalls. This limitation arises from the
pipeline start-up costs of coarse-grained multithreading. Because a CPU with coarse-grained
multithreading issues instructions from a single thread, when a stall occurs the pipeline will see a
bubble before the new thread begins executing. Because of this start-up overhead, coarse-grained
Page 7
Figure 2.1: How four different approaches use the functional unit execution slots of a superscalar
processor. The horizontal dimension represents the instruction execution capability in each clock cycle.
The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the
corresponding execution slot is unused in that clock cycle. The shades of gray and black correspond to
four different threads in the multithreading processors. Black is also used to indicate the occupied issue
slots in the case of the superscalar without multithreading support. TheSunT1 and T2 (aka Niagara)
processors are fine-grained multithreaded processors, while the Intel Corei7 and IBM Power7 processors
use SMT. TheT2 has eight threads, the Power7 has four, and the Intel i7 has two. In all existing SMTs,
instructions issue from only one thread at a time. The difference in SMT is that the subsequent decision to
execute an instruction is decoupled and could execute the operations coming from several different
instructions in the same clock cycle [1]
Page 8
theevery
potential
speedup from SIMD parallelism is twice that of MIMD parallelism. Hence, its as
four years.
least as important to understand SIMD parallelism as MIMD parallelism, although the latter has
received much more fanfare recently. For applications with both data-level parallelism and
thread-level parallelism, the potential speedup in 2020 will be an order of magnitude higher than
today.
2.4 MULTIPLE INSTRUCTION MULTIPLE DATA STREAMS
MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of
processors that function asynchronously and independently. At any time, different processors
may be executing different instructions on different pieces of data. MIMD architectures may be
Department of Electronics Engineering, IIT (BHU) Varanasi
Page 9
Page 10
In Fermi, the newly designed integer ALU supports full 32-bit precision for all
instructions, consistent with standard programming language requirements. The integer ALU is
Department of Electronics Engineering, IIT (BHU) Varanasi
Page 11
Page 12
Page 13
Page 14
Page 15
Page 16
A 11111111
B 11000000
C 00111111
D 00110000
E 00001111
F 00111111
G 11111111
Page 17
The hardware reconvergence stack tracks the program counter (PC) associated with each
control flow path, which threads are active at each path (the active mask of the path), and at what
PC should a path reconverge (RPC) with its predecessor in the control-flow graph [12]. The
stack contains the information on the control flow of all threads within a warp. Figure 3.2 depicts
the reconvergence stack and its operation on the example control flow shown in Figure 3.1. We
describe this example in detail below.
When a warp first starts executing, the stack is initialized with a single entry: the PC
points to the first instructions of the kernel (first instruction of block A), the active mask is full,
and the RPC (reconvergence PC) is set to the end of the kernel. When a warp executes a
conditional branch, the predicate values for both the taken and non-taken paths (left and right
paths) are computed. If control diverges with some threads following the taken path and others
the nontaken path, the stack is updated to include the newly formed paths (Figure 3.2(b)). First,
the PC field of the current top of the stack (TOS) is modified to the PC value of the
reconvergence point, because when execution returns to this path, it would be at the point where
the execution reconverges (start of block G in the example). The RPC value is explicitly
communicated from software and is computed with a straightforward compiler analysis [3].
Second, the PC of the right path (block C), the corresponding active mask, and the RPC (block
G) is pushed onto the stack. Third, the information on the left path (block B) is similarly pushed
onto the stack. Finally, execution moves to the left path, which is now at the TOS. Note that only
a single path per warp, the one at the TOS, can be scheduled for execution. For this reason we
refer to this baseline architecture as the single-path execution (SPE) model.
When the current PC of a warp matches the RPC field value at that warps TOS, the entry
at the TOS is popped off (Figure 3.2(c)). At this point, the new TOS corresponds to the right path
of the branch and the warp starts executing block C. As the warp encounters another divergent
branch at the end of block C, the stack is once again updated with the left and right paths of
blocks D and E (Figure 3.2(d)). Note how the stack elegantly handles the nested branch and how
the active masks for the paths through blocks D and E are each a subset of the active mask of
block C. When both left and right paths of block D and E finish execution and corresponding
stack entries are popped out, the TOS points to block F and control flow is reconverged back to
the path that started at block C (Figure 3.2(e)) the active mask is set correctly now that the
nested branch reconverged. Similarly, when block F finishes execution and the PC equals the
reconvergence PC (block G), the stack is again popped and execution continues along a single
path with a full active mask (Figure 3.2(f)).
Page 18
Single-path stack
PC
Active Mask RPC
A
11111111
-
PC
G
F
E
D
TOS
TOS
Single-path stack
PC
G
C
B
Active Mask
11111111
00111111
11000000
Single-path stack
Active Mask RPC
11111111
00111111
G
00001111
F
00110000
F
Single-path stack
PC
Active Mask RPC
G
11111111
F
00111111
G
RPC
G
G
TOS
TOS
(b) Two entries of block B and C are pushed into
the stack when BRB-C is executed. RPC is updated to
block G.
Single-path stack
PC
G
C
Active Mask
11111111
00111111
Single-path stack
RPC
G
PC
G
Active Mask
11111111
RPC
-
TOS
TOS
(c) The stack entry, corresponding to B at TOS, is
popped out when PC matches RPC of G
L7
SIMD LANE
L6
L5
L4
L3
E
F
A A A
D
L2
L1
L0
Time
Page 19
This example also points out the two main deficiencies of the SPE model. First, SIMD
utilization decreases every time control flow diverges. SIMD utilization has been the focus of
active research (e.g., [3, 16]) and we do not discuss it further in this paper. Second, execution is
serialized such that only a single path is followed until it completes and reconverges (Figure
3.2(g)). Such SPE model works well for most applications because of the abundant parallelism
exposed through multiple warps within cooperative thread arrays. However, for some
applications, the restriction of following only a single path does degrade performance. Meng et
al. proposed dynamic warp subdivision (DWS) [15], which selectively deviates from the
reconvergence stack execution model, to overcome the serialization issue.
3.4 LIMITATION OF PREVIOUS MODEL
As discussed in previous subsection, SPE is able to address only one aspect of the control
divergence issue while overlooking the other. SPE uses simple hardware and an elegant
execution model to maximize SIMD utilization with structured control flow, but always
serializes execution with only a single path schedulable at any given time. DWS [15] can
interleave the scheduling of multiple paths and increase TLP, but it sacrifices SIMD lane
utilization. Our proposed model, on the other hand, always matches the utilization and SIMD
efficiency of the baseline SPE while still enhancing TLP in some cases. Our approach keeps the
elegant reconvergence stack model and the hardware requires only small modifications to utilize
up to two interleaved paths. Our technique requires only a small number of components within
the GPU microarchitecture and requires no support from software. Specifically, the stack itself is
enhanced to provide up to two concurrent paths for execution, the scoreboard is modified to track
dependencies of two concurrent paths and to correctly handle divergence and reconvergence, and
the warp scheduler is extended to handle up to two schedulable objects per warp.
3.5 WARP SIZE IMPACT
Small warps, i.e., warps as wide as SIMD width, reduce the likelihood of branch divergence
occurrence. Reducing the branch divergence improves SIMD efficiency by increasing the
number of active lanes. At the same time, a small size warp reduces memory coalescing,
effectively increasing memory stalls. This can lead to redundant memory accesses and
increase pressure on the memory subsystem. Large warps, on the other hand, exploit
potentially existing memory access localities among neighbor threads and coalesce them to
a few off-core requests. On the negative side, bigger warp size can increase serialization and
the branch divergence impact.
Page 20
Insensitive workloads: Warp size affects performance in SIMT cores only for workloads
suffering from branch/memory divergence or showing potential benefits from memory
access coalescing. Therefore, benchmarks lacking either of these characteristics are insensitive
to warp size.
Ideal coalescing and write accesses: Small warps machine coalescing rate is far higher than
other machines due to ideal coalescing hardware. However, ideal coalescing can only capture
the read accesses and does not compensate uncoalesced accesses. Therefore, Small warps
machine may suffer from uncoalesced write accesses. The coalescing rate of Small warps
machine is higher than other machines since it merges many read accesses among warps.
However, uncoalesced write accesses downgrades the overall performance in Small warps
machine.
Practical issues with small warps: Pipeline front-end includes the warp scheduler, fetch
engine, decode instruction and register read stages. Using fewer threads per warp affects
pipeline front-end as it requires a faster clock rate to deliver the needed workload during the
same time period. An increase in the clock rate can increase power dissipation in the
front-end and impose bandwidth limitation issues on the fetch stage. Moreover, using short
warps can impose extra area overhead as the warp scheduler has to select from a larger number
of warps. In this study we focus on how warp size impacts performance, leaving the area and
power evaluations to future works.
Register file: Warp size affects register file design and allocation. GPUs allocate all warp
registers in a single row. Such an allocation allows the read stage to read one operand for
all threads of a warp by accessing a single register file row. For different warp sizes, the
number of registers in a row (row size) varies according to the warp size to preserve
accessibility. Row size should be wider for large warps to read the operands of all threads in a
single row access and narrower for small warps to prevent unnecessary reading.
Page 21
Figure shows that adding more warps does exacerbate L1 contention. This is a capacity
limitation, not just an associativity problem, as shown in Figure 1b, where time waiting on
memory is still significant even with full associativity. These results are all averages across the
benchmarks described in Section 3.2 and obtained with the configuration described in Section 3.
Intra-warp latency tolerance hides latencies without requiring extra threads. However, intrawarp latency tolerance is only beneficial when threads within the same warp exhibit divergent
behavior. Table 1 shows that many benchmarks exhibit frequent memory divergence. A further
advantage of intra-warp latency tolerance is that the same mechanisms also improve throughput
in the presence of branch divergence.
Page 22
Handling Divergence: A warp starts executing on one of the paths, for example the left path,
with a full active mask. The PC is set to the first instruction in the kernel and the RPC set to the
last instruction (PathL in Figure 3.6(a)). The warp then executes in identical way to the baseline
single-path stack until a divergent branch executes. When the warp executes a divergent branch,
we push a single entry onto the stack, which represents both sides of the branch, rather than
pushing two distinct entries as done with the baseline SPE. The PC field of the block that
diverged is set to the RPC of both the left and right paths (block G in Figure 3.6(b)), because this
is the instruction that should execute when control returns to this path. Then, the active mask and
PC of PathL, as well as the same information for PathR are pushed onto the stack, along with
their common RPC and updating the TOS (Figure 3.6(b)). Because it contains the information
for both paths, the single TOS entry enables the warp scheduler to interleave the scheduling of
active threads at both paths as depicted in Figure 3.6(g). If both paths are active at the time of
divergence, the one to diverge (block C in Figure 3.6(b)) first pushes an entry onto the stack, and
in effect, suspends the other path (block B in Figure 3.6(c)) until control returns to this stack
entry (Figure 3.6(e)). Note that the runtime information required to update the stack entries is
exactly the same as in the baseline single-stack model.
Page 23
TOS
PCL
MaskL
PCR
MaskR
11111111
RPC
PCL
MaskL
PCR
MaskR
RPC
G
B
-
11111111
11000000
-
F
E
00111111
00001111
G
F
TOS
MaskL
PCR
MaskR
G
B
11111111
11000000
00111111
RPC
PCL
MaskL
PCR
MaskR
RPC
G
B
11111111
11000000
00111111
TOS
TOS
PCL
MaskL
PCR
MaskR
G
B
D
11111111
11000000
00110000
F
E
00111111
00001111
RPC
TOS
G
F
PCL
MaskL
PCR
MaskR
RPC
11111111
L7
SIMD LANE
L6
L5
L4
L3
A A A
D
L2
SAVED
SAVED
SAVED
SAVED
TOS
PCL
L1
L0
B
Time
Figure 3.5: Exploiting the parallelism with our model. We assume same control-flow graph in Figure 3.1
Department of Electronics Engineering, IIT (BHU) Varanasi
Page 24
Page 25
Page 26
The major stages in the front end include instruction cache access and instruction buffering logic,
scoreboard and scheduling logic, SIMT stack.
Department of Electronics Engineering, IIT (BHU) Varanasi
Page 27
Page 28
Page 29
4.3 Benchmarks
4.3.1 Histogram (Histo) [18]
The Histogram Parboil benchmark is a straightforward histogramming operation that
accumulates the number of occurrences of each output value in the input data set. The output
histogram is a two-dimensional matrix of chartype bins that saturate at 255. The Parboil input
sets, exemplary of a particular application setting in silicon wafer verification, are what define
the optimizations appropriate for the benchmark. The dimensions of the histogram (256 W
8192 H) are very large, yet the input set follows a roughly Gaussian distribution, centered in the
output histogram. Recognizing this high concentration of contributions to the histograms central
region (referred to as the eye), the benchmark optimizations mainly focus on improving the
throughput of contributions to this area. Prior to performing the histogramming, the optimized
implementations for scratchpad run a kernel that determines the size of the eye by sampling the
input data. Architectures with an implicit cache can forego such analysis, since the hardware
cache will automatically prioritize the heavily accessed region wherever it may be.
Overall, the histogram benchmark demonstrates the high cost of random atomic
updates to a large dataset. The global atomic update penalty can sometimes outweigh a fixedfactor cost of redundantly reading input data.
4.3.2 Stencil [18]
The importance of solving partial differential equations (PDE) numerically as well as
the computationally intensive nature of this class of application have made PDE solvers an
interesting candidate for accelerators. In the benchmark we include a stencil code, representing
an iterative Jacobi solver of the heat equation on a 3-D structured grid, which can also be used as
a building block for more advanced multi-grid PDE solvers.
Department of Electronics Engineering, IIT (BHU) Varanasi
Page 30
Page 31
15
1536
32
32768
48KB
2
Round-Robin
16KB/4/128B
768KB/8/256B
6
700:1400:700:924 MHz
(FR-FCFS)
100
Page 32
Description/ Area
LU Decomposition
Lattice-Boltzmann
Method Simulation
LPS
3D Laplace Solver
HEARTWALL Shapes of Heart walls
over Ultrasonic Images
HISTO
Histogram Operation
RAY
Ray Tracing
STENCIL
PDE Solvers
Number
of Kernels
46
100
Number of
Instructions
40M
55936M
% of Memory
Instructions
7.1665
0.772
1
5
72M
35236M
7.065
2.619
80
1
100
2348M
62M
2775M
16.79
21.2
8.954
Page 33
LUD
LBM
LPS
HEARTWALL
HISTO
RAY
STENCIL
20.59805
24.4674745
18.78539
54.52345
66.28455
21.57072
222.43355
252.359
13.45366
209.9862
227.40495
8.295188
116.18875
125.523765
8.034354
217.10065
231.50103
6.633043
264.92875
279.685
5.569894
11.7631
1.4
1.2
1
0.8
IPC with Baseline Architecture
0.6
IPC with enhanced
reconvergence stack
0.4
0.2
0
Page 34
Benchmarks
Total
Stalls
for Total Stalls with enhanced Percentage
Baseline Architecture reconvergence stack Arch. decrease in stalls
LUD
LBM
LPS
HEARTWALL
HISTO
RAY
STENCIL
9934860
8845544
10.965
3256573992
2355437042
27.671
1466837
1133817
22.703
882978728
764754980
13.389
219281118
191740591
12.559
2070150
1772489
14.379
56491686
52808554
Average Percentage decrease in stalls
6.5198
15.455
1.2
1
0.8
0.6
0.4
0.2
0
Average decrease in Total No. of Stalls is approximately 15.5%. Maximum improvement is seen
for Lattice-Boltzmann Method simulation (LBM) benchmark which is around 27.7%.
Future Scope: We can employ dynamic warp formation method to further improve
SIMD Utilization, which we havent focused in this work of ours.
Page 35
5. REFERENCES
1. COMPUTER ARCHITECTURE A Quantitative Approach, 5th Edition by David A.
Patterson, John L. Hennessy
2. W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and
Scheduling for Efficient GPU Control Flow. In 40th International Symposium on
Microarchitecture (MICRO-40), December 2007.
3. Minsoo Rhu and Mattan Erez, The Dual-Path Execution Model for Efficient GPU
Control Flow in the 19th IEEE International Symposium on High-Performance
Computer Architecture (HPCA-19), Shenzhen, China, Feb. 2013
4. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads
Using a Detailed GPU Simulator. In IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS-2009), April 2009.
5. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee and K. Skadron. Rodinia: A
Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on
Workload Characterization (IISWC-2009), October 2009
6. IMPACT
Research
Group.
The
Parboil
Benchmark
Suite.
http://www.crhc.uiuc.edu/IMPACT/parboil.php
7. NVIDIA Corporation. NVIDIAs Next Generation CUDA Compute Architecture: Fermi,
2009.
8. NVIDIA Corporation. CUDA toolkit, C/C++ SDK CODE Samples
9. Evolution of the Graphical Processing Unit by Thomas Scott Crow
10. http://cpudb.stanford.edu/ a complete database of processors built by the Stanford
University's VLSI Research
11. https://www.udacity.com/course/cs344 a lecture series by David Luebke of NVIDIA
Research , John Owens from the University of California, Davis
12. GPGPU-Sim. http://www.gpgpu-sim.org
13. Jonathan Palacios, Josh Triska A comparison of modern GPU and CPU architectures: and
common convergence of both
14. Ahmad Lashgar, Amirali Baniasadi, and Ahmad Khonsari Warp Size Impact in GPUs:
Large or Small?
15. J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch
and Memory Divergence Tolerance. In 37th International Symposium on Computer
Architecture (ISCA-37), 2010.
16. V. Narasiman, C. Lee, M. Shebanow, R. Miftakhutdinov, O. Mutlu, and Y. Patt.
Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In 44th
International Symposium on Microarchitecture (MICRO-44), December 2011.
17. Intel Corporation. Intel HD Graphics Open Source Programmer Reference Manual, June
2011.
Page 36
18. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid,Li-Wen Chang, Nasser
Anssari, Geng Daniel Liu, Wen-mei W. Hwu. Parboil: A Revised Benchmark Suite for
Scientific and Commercial Throughput Computing. In IMPACT Technical Report
IMPACT-12-01.
19. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt.
Analyzing CUDA Workloads Using a Detailed GPU Simulator
Page 37