Exploiting Intra Warp Parallelism For GPGPU

Exploiting Control Divergence to ImproveParallelism in GPGPU
1. Why GPU FOR GENERAL PURPOSE

1.1 INTRODUCTION
GPGPU General Purpose Computation on Graphics Processing Unit
Here in GPU we are doing something which the GPU is not made for. The GPU was built
for Graphics Processing but we are using it for General purpose computations which were
normally done by Central processing Unit. Lets see why the GPU has come into the
computational field
How to dig holes faster?
1. Dig at a faster rate: Two shovels per sec instead of one. There is always a limit
This is equivalent to having a faster clock or increasing the clock frequency. (Clock-Shorter
amount of time for each computation)
Although the max limit to the clock is the speed of light, and we are already in the ns range.
There is very little to achieve further
2. Have a more productive shovel: More blades per shovel. But it produces diminishing
returns as the no. of blades increases.
It put forward the fact of using the parallelism (only the ILP) present in the instructions. IT tells
more work to be done on each step per clock cycle.
3. Hire more diggers => parallel computing
Instead of having one fast digger and an awesome shovel what if we have many diggers with
many shovels.
i.e., Many Smaller and simpler processors which is the case for the GPU
1.2 CPU SPEED REMAINING FLAT
The feature size is decreasing every year(Moores law) => Transistors becomes smaller,
faster less power, More on Chip
As sizes diminished, the designers have run the processor faster and faster by increasing
the clock speed. Many years clock speed has gone up; however from the last decade
clock speed has remained constant.
Why we are not able to increase the clock speed further at the previous rate:
The transistor size has been decreasing, but we are unable to increase the speed. The
problem is running a billion transistors generate a lot of heat and we cant keep the processors
cool. What matters today is the Power.
Department of Electronics Engineering, IIT (BHU) Varanasi
Page 1
Let me put an example or a historical background for this: Intel Pentium 4th generation
was made for a high operating clock frequency of around 10GHz, but the researchers can operate
it only for a time period of 30sec to 1min because of the vast power dissipation at such high
frequency. So they restricted it to around 3 to 4GHz.
Figure 1.1: Technology Scaling vs. Feature Size [10]
Figure 1.2: Clock Frequency vs. Year [10]
1.3 CONSTRAINTS WITH ILP

There is always a limit to the no of blades in the shovel (the pipelining stages here). This is
because of the limited amount of parallelism that comes by exploiting the ILP. Take the case of a
pipeline with 10 pipeline stages. With the conditional instruction frequency of 16% (put ref. no
here the Bible) there may be more than a single branch instruction always in the pipeline, which
always limits the work that can be performed.
Coming to case of the historical background we can put forth the example of the Intel
Pentium vs. Itanium. The Pentium 4 was the most aggressively pipelined processor ever built by
Intel. It used a pipeline with over 20 stages, had seven functional units, and cached micro-ops
rather than x86 instructions. Its relatively inferior performance given the aggressive
implementation was a clear indication that the attempt to exploit more ILP (there could easily be
50 instructions in flight) had failed. The Pentiums power consumption was similar to the i7,
although its transistor count was lower, as its primary caches were half as large as the i7, and it
included only a 2 MB secondary cache with no tertiary cache.
Page 2

The Intel Itanium is a VLIW-style architecture, which despite the potential decrease in
complexity compared to dynamically scheduled superscalars, never attained competitive clock
rates with the mainline x86 processors (although it appears to achieve an overall CPI similar to
that of the i7).
Figure 1.3: Three different Intel processors vary widely. Although the Itanium processor has two cores and thei7
four, only one core is used in the benchmarks. [1]
The main limitation in exploiting ILP often turned out to be the memory system. The result was
that these designs never came close to achieving the peak instruction throughput despite the large
transistor counts and extremely sophisticated and clever techniques.
1.4 HOW COMPUTATIONAL POWER CAN BE IMPROVED
We can solve large problems by breaking them into small pieces called kernels and launching
these pieces at the same time by means of simple processors.
Modern GPUs:
Thousands of ALUs
Hundreds of processors
Tens of thousands of Concurrent threads
If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?
-Seymour Cray, Father of Super Computer
What kind of Processors will we build? Or Why CPUs are not energy efficient?
CPU: Complex Control Hardware
Flexibility and Performance
Expensive in terms of Power
GPU: Simpler Control Hardware

More Hardware for computation
Potentially more Power efficient (Ops/Watt) as shown in figure 1.4
More restrictive programming model
Page 3

Build a (Power Efficient) High Performance Processor
Minimizing Latency
(Time Sec)
CPU
Throughput
(Stuff/Time Jobs/Hour)
GPU Pixel/Sec
Figure 1.4: Intel Core i7-960, NVIDIA GTX 280, and GTX 480 specification [1]
In the Figure 1.4 it can be seen from the rightmost columns show the ratios of GTX 280
and GTX 480 to Core i7. For SP SIMD FLOPS on the GTX 280, the higher speed (933) comes
from a very rare case of dual issuing of fused multiply-add and multiply. More reasonable is 622
for single fused multiply-adds.
1.5 MODERN COMPUTING ERA
It can be easily assumed that when the two previous techniques have reached the limits
then is this not the case with the Graphics Processing Unit. Moores Law states that the density
of transistors at a given die size doubles every 12 months, but has since slowed to every 18
months or loosely translated, the performance of processors doubles every 18 months. Now the
increase of no. of transistors that can be accumulated in a single chip almost became dead as the
die size has reached its limits, and the power dissipation of transistors has also been constraining
the number of transistors that can be fabricated on a single die.
Page 4

Within the last 8 to 10 years, GPU manufacturers have found that they have far exceeded the
principles of Moores Law by doubling the transistor count on a given die size every 6 months.
Essentially that is performance on a scale of Moores Law cubed. With this rate of increase and
performance potential one can see why a more in depth look at the GPU for uses other than
graphics is well warranted [9]. This has even been replicated in the following figure showing that
we are still in the initial stages of the development of GPU which can be drastically improved in
the forth coming years.
Figure 1.5: New era of processor performance [12]
1.6 A BRIEF HISTORY OF GPU COMPUTING

The graphics processing unit (GPU), first invented by NVIDIA in 1999, is the most
pervasive parallel processor to date. Fueled by the insatiable desire for life-like real-time
graphics, the GPU has evolved into a processor with unprecedented floating-point performance
and programmability; todays GPUs greatly outpace CPUs in arithmetic throughput and memory
bandwidth, making them the ideal processor to accelerate a variety of data parallel applications.
Efforts to exploit the GPU for non-graphical applications have been underway since
2003. By using high-level shading languages such as DirectX, OpenGL and Cg, various data
parallel algorithms have been ported to the GPU. Problems such as protein folding, stock options
pricing, SQL queries, and MRI reconstruction achieved remarkable performance speedups on the
GPU. These early efforts that used graphics APIs for general purpose computing were known as
GPGPU programs.
While the GPGPU model demonstrated great speedups, it faced several drawbacks. First,
it required the programmer to possess intimate knowledge of graphics APIs and GPU
architecture. Second, problems had to be expressed in terms of vertex coordinates, textures and
Page 5

shader programs, greatly increasing program complexity. Third, basic programming features
such as random reads and writes to memory were not supported, greatly restricting the
programming model. Lastly, the lack of double precision support (until recently) meant some
scientific applications could not be run on the GPU.
To address these problems, NVIDIA introduced two key technologiesthe G80 unified
graphics and compute architecture (first introduced in GeForce 8800, Quadro FX 5600, and
Tesla C870 GPUs), and CUDA, a software and hardware architecture that enabled the GPU to
be programmed with a variety of high level programming languages. Together, these two
technologies represented a new way of using the GPU. Instead of programming dedicated
graphics units with graphics APIs, the programmer could now write C programs with CUDA
extensions and target a general purpose, massively parallel processor. We called this new way of
GPU programming GPU Computingit signified broader application support, wider
programming language support, and a clear separation from the early GPGPU model of
programming.
Page 6
2. GPU ARCHITECTURE
2.1 INTRODUCTION
Before knowing how the GPU architecture is implemented, we first need to know what
the different parallelisms present in the GPU. GPUs have virtually every type of parallelism that
can be captured by the programming environment: multithreading, MIMD, SIMD, and even
instruction-level.
2.2 MULTITHREADING
Multithreading allows multiple threads to share the functional units of a single processor
in an overlapping fashion. A general method to exploit thread-level parallelism (TLP) is with a
multiprocessor that has multiple independent threads operating at once and in parallel.
Multithreading, however, does not duplicate the entire processor as a multiprocessor does.
Instead, multithreading shares most of the processor core among a set of threads, duplicating
only private state, such as the registers and program counter. There are three main hardware
approaches to multithreading.
2.2.1 FINE-GRAINED MULTITHREADING
Fine-grained multithreading switches between threads on each clock, causing the
execution of instructions from multiple threads to be interleaved. This interleaving is often done
in a round-robin fashion, skipping any threads that are stalled at that time. One key advantage of
fine-grained multithreading is that it can hide the throughput losses that arise from both short and
long stalls, since instructions from other threads can be executed when one thread stalls, even if
the stall is only for a few cycles. The primary disadvantage of fine-grained multithreading is that
it slows down the execution of an individual thread, since a thread that is ready to execute
without stalls will be delayed by instructions from other threads. It trades an increase in
multithreaded throughput for a loss in the performance (as measured by latency) of a single
thread.
2.2.2 COARSE-GRAINED MULTITHREADING
A coarse-grained multithreading switches thread only on costly stalls, such as level two
or three cache misses. This change relieves the need to have thread-switching be essentially free
and is much less likely to slow down the execution of any one thread, since instructions from
other threads will only be issued when a thread encounters a costly stall. It is limited in its ability
to overcome throughput losses, especially from shorter stalls. This limitation arises from the
pipeline start-up costs of coarse-grained multithreading. Because a CPU with coarse-grained
multithreading issues instructions from a single thread, when a stall occurs the pipeline will see a
bubble before the new thread begins executing. Because of this start-up overhead, coarse-grained
Page 7

multithreading is much more useful for reducing the penalty of very high-cost stalls, where
pipeline refill is negligible compared to the stall time.
2.2.3 SIMULTANEOUS MULTITHREADING
Simultaneous multithreading is a variation on fine-grained multithreading that arises
naturally when fine-grained multithreading is implemented on top of a multiple-issue,
dynamically scheduled processor. SMT uses thread-level parallelism to hide long-latency events
in a processor, thereby increasing the usage of the functional units. The key insight in SMT is
that register renaming and dynamic scheduling allow multiple instructions from independent
threads to be executed without regard to the dependences among them; the resolution of the
dependences can be handled by the dynamic scheduling capability.
Figure 2.1: How four different approaches use the functional unit execution slots of a superscalar
processor. The horizontal dimension represents the instruction execution capability in each clock cycle.
The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the
corresponding execution slot is unused in that clock cycle. The shades of gray and black correspond to
four different threads in the multithreading processors. Black is also used to indicate the occupied issue
slots in the case of the superscalar without multithreading support. TheSunT1 and T2 (aka Niagara)
processors are fine-grained multithreaded processors, while the Intel Corei7 and IBM Power7 processors
use SMT. TheT2 has eight threads, the Power7 has four, and the Intel i7 has two. In all existing SMTs,
instructions issue from only one thread at a time. The difference in SMT is that the subsequent decision to
execute an instruction is decoupled and could execute the operations coming from several different
instructions in the same clock cycle [1]
2.3 SINGLE INSTRUCTION MULTIPLE DATA

In SIMD a single instruction can launch many data operations, hence SIMD is potentially
more energy efficient than multiple instruction multiple data (MIMD), which needs to fetch and
Page 8

execute one instruction per data operation. Data-level parallelism (DLP) is not only present in
the matrix-oriented computations of scientific computing, but also the media oriented image and
sound processing. The biggest advantage of SIMD versus MIMD is that the programmer
continues to think sequentially yet achieves parallel speedup by having parallel data operations.
SIMD comes from the GPU community, offering higher potential performance than is found in
traditional Multicore computers today. While GPUs share features with vector architectures, they
have their own distinguishing characteristics, in part due to the ecosystem in which they evolved.
This environment has a system processor and system memory in addition to the GPU and its
graphics memory. In fact, to recognize those distinctions, the GPU community refers to this type
of architecture as heterogeneous.
Figure 2.2: Potential speedup via parallelism from MIMD, SIMD,

and both MIMD and SIMD over time for x86 computers. This figure
assumes that two cores per chip for MIMD will be added every two
years and the number of operations for SIMD will double [1]
For problems with lots

of data parallelism, all three
SIMD variations share the
advantage of being easier for the
programmer than classic parallel
MIMD programming. To put
into perspective the importance
of SIMD versus MIMD, Figure
2.2 plots the number of cores for
MIMD versus the number of 32bit and 64-bit operations per
clock cycle in SIMD mode for
x86 computers over time.
For x86 computers, we
expect to see two additional
cores per chip every two years
and the SIMD width to double
every four years. Given these
assumptions, over next decade
theevery
potential
speedup from SIMD parallelism is twice that of MIMD parallelism. Hence, its as
four years.
least as important to understand SIMD parallelism as MIMD parallelism, although the latter has
received much more fanfare recently. For applications with both data-level parallelism and
thread-level parallelism, the potential speedup in 2020 will be an order of magnitude higher than
today.
2.4 MULTIPLE INSTRUCTION MULTIPLE DATA STREAMS
MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of
processors that function asynchronously and independently. At any time, different processors
may be executing different instructions on different pieces of data. MIMD architectures may be
Page 9

used in a number of application areas such as computer-aided design/computer-aided
manufacturing, simulation, modeling, and as communication switches. MIMD machines can be
of either shared memory or distributed memory categories. These classifications are based on
how MIMD processors access memory. Shared memory machines may be of the bus-based,
extended, or hierarchical type. Distributed memory machines may have hypercube or mesh
interconnection schemes.
Figure 2.3: The overview of a MIMD Processor
2.5 FERMI GPU ARCHITECTURE OVERVIEW

The Fermi architecture is the most significant leap forward in GPU architecture since the
original G80. G80 was our initial vision of what a unified graphics and computing parallel
processor should look like. GT200 extended the performance and functionality of G80. With
Fermi, NVIDIA have taken all they have learned from the two prior processors and all the
applications that were written for them, and employed a completely new approach to design to
create the worlds first computational GPU.
The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512
CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a
thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit
memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5
DRAM memory. A host interface connects the GPU to the CPU via PCI-Express. The
GigaThread global scheduler distributes thread blocks to SM thread schedulers.
Page 10
Figure 2.4: Fermis 16 SM are positioned around a common L2 cache.
2.5.1 STREAMING MULTIPROCESSOR

Each SM features 32 CUDA
processors. Each CUDA processor has
a fully pipelined integer arithmetic
logic unit (ALU) and floating point
unit (FPU). Prior GPUs used IEEE
754-1985 floating point arithmetic.
The Fermi architecture implements the
new IEEE 754-2008 floating-point
standard, providing the fused multiplyadd (FMA) instruction for both single
and double precision arithmetic. FMA
improves over a multiply-add (MAD)
instruction by doing the multiplication
and addition with a single final
rounding step, with no loss of
precision in the addition. FMA is more
accurate
than
performing
the
operations
separately.
GT200
implemented double precision FMA
Figure 2.5: Fermi Streaming Multiprocessor (SM)
In Fermi, the newly designed integer ALU supports full 32-bit precision for all
instructions, consistent with standard programming language requirements. The integer ALU is
Page 11

also optimized to efficiently support 64-bit and extended precision operations. Various
instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract,
bit-reverse insert, and population count.
NVIDIA Fermi supports 48 active warps for a total of 1536 threads per SM. To
accommodate the large set of threads, GPUs provide large on-chip register files. Fermi has per
SM register file size of 128KB or 21 32-bit registers per thread at full occupancy. Each thread
uses dedicated registers to enable fast switching.16 SMs * 32 cores/SM = 512 floating point
operations per cycle.
16 Load/Store Units: Each SM has 16 load/store units, allowing source and destination
addresses to be calculated for sixteen threads per clock. Supporting units load and store the data
at each address to cache or DRAM.
Four Special Function Units: Special Function Units (SFUs) execute transcendental
instructions such as sin, cosine, reciprocal, and square root. Each SFU executes one instruction
per thread, per clock; a warp executes over eight clocks. The SFU pipeline is decoupled from the
dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is
occupied.
Designed for Double Precision: Double precision arithmetic is at the heart of HPC applications
such as linear algebra, numerical simulation, and quantum chemistry. The Fermi architecture has
been specifically designed to offer unprecedented performance in double precision; up to 16
double precision fused multiply-add operations can be performed per SM, per clock, a dramatic
improvement over the GT200 architecture.
In the Fermi architecture, each SM has 64 KB of on-chip memory that can be configured as 48
KB of Shared memory with 16 KB of L1 cache or as 16 KB of Shared memory with 48 KB of
L1 cache. For existing applications that make extensive use of Shared memory, tripling the
amount of Shared memory yields significant performance improvements, especially for problems
that are bandwidth constrained. For existing applications that use Shared memory as software
managed cache, code can be streamlined to take advantage of the hardware caching system,
while still having access to at least 16 KB of shared memory for explicit thread cooperation. Best
of all, applications that do not use Shared memory automatically benefit from the L1 cache,
allowing high performance CUDA programs to be built with minimum time and effort
2.5.2 DUAL WARP SCHEDULER
The SM schedules threads in groups of 32 parallel threads called warps. Each SM features two
warp schedulers and two instruction dispatch units, allowing two warps to be issued and
executed concurrently. Fermis dual warp scheduler selects two warps, and issues one instruction
from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. Because
warps execute independently, Fermis scheduler does not need to check for dependencies from
Page 12

within the instruction stream. Using this elegant model of dual-issue, Fermi achieves near peak
hardware performance.
Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix
of integer, floating point, load, store, and SFU instructions can be issued concurrently. Double
precision instructions do not support dual dispatch with any other operation.
Figure 2.6: Dual Warp Scheduler
2.5.3 GIGATHREAD THREAD SCHEDULER

One of the most important technologies of the Fermi architecture is its two-level, distributed
thread scheduler. At the chip level, a global work distribution engine schedules thread blocks to
various SMs, while at the SM level, each warp scheduler distributes warps of 32 threads to its
execution units. The first generation GigaThread engine introduced in G80 managed up to
12,288 threads in real time. The Fermi architecture improves on this foundation by providing not
only greater thread throughput, but dramatically faster context switching, concurrent kernel
execution, and improved thread block scheduling.
2.6 COMPUTE UNIFIED DEVICE ARCHITECTURE
CUDA is the hardware and software architecture that enables NVIDIA GPUs to execute
programs written with C, C++, FORTRAN, OpenCL, DirectCompute, and other languages. A
CUDA program calls parallel kernels. A kernel executes in parallel across a set of parallel
threads. The programmer or compiler organizes these threads in thread blocks and grids of thread
blocks. The GPU instantiates a kernel program on a grid of parallel thread blocks. Each thread
within a thread block executes an instance of the kernel, and has a thread ID within its thread
block, program counter, registers, per-thread private memory, inputs, and output results. A thread
block is a set of concurrently executing threads that can cooperate among themselves through
Page 13

barrier synchronization and shared memory. A thread block has a block ID within its grid. A grid
is an array of thread blocks that execute the same kernel, read inputs from global memory, write
results to global memory, and synchronize between dependent kernel calls. In the CUDA parallel
programming model, each thread has a per-thread private memory space used for register spills,
function calls, and C automatic array variables. Each thread block has a per-Block shared
memory space used for inter-thread communication, data sharing, and result sharing in parallel
algorithms. Grids of thread blocks share results in Global Memory space after kernel-wide global
synchronization.
Thread: concurrent code and associated
state executed on the CUDA device (in
parallel with other threads). Thread is
the unit of parallelism in CUDA.
Warp: a group of threads executed
physically in parallel.
From Oxford Dictionary: Warp:
In the textile industry, the term warp
refers to the threads stretched
lengthwise in a loom to be crossed by
the weft.
Block: a group of threads that are
executed together and form the unit of
resource assignment.
Figure 2.7: CUDA Hierarchy of threads, blocks, and grids,

with corresponding per-thread private, per-block shared,
and per-application global memory spaces.
Grid: a group of thread blocks that must

all complete before the next kernel call
of the program can take effect.
2.7 PROGRAM EXECUTION IN GPU

When the CPU (Host) want to execute some code on the GPU (Device), then the CPU
copies the required data in the GPU global memory. The program itself contains the instantiation
the device, which then executes the code and keeps the data in the GPU global memory. Then
CPU copies the data output obtained from the GPU global memory to CPU global memory and
then makes use of it. The memory is dynamically created and freed for GPU by the CPU. But the
CPU cant access the Shared memory of the GPU or any SM.
Page 14
3. CONTROL FLOW DIVERGENCE

3.1 INTRODUCTION
Current graphics processing unit (GPU) architectures balance the high efficiency of
single instruction multiple data (SIMD) execution with programmability and dynamic control.
GPUs group multiple logical threads together (32 or 64 threads in current designs [7]) that are
then executed on the SIMD hardware. While SIMD execution is used, each logical thread can
follow an independent flow of control, in the single instruction multiple thread execution style.
When control flow causes threads within the same group to diverge and take different control
flow paths, hardware is responsible for maintaining correct execution. This is currently done by
utilizing a reconvergence predicate stack, which partially serializes execution. The stack
mechanism partitions the thread group into subgroups that share the same control flow. A single
subgroup executes at a time, while the execution of those threads that follow a different flow is
masked. While this mechanism is effective and efficient, it does not maximize available
parallelism because it serializes the execution of different subgroups, which degrades
performance in some cases.
We present a mechanism that requires only small changes to the current reconvergence
stack structure, maintains the same SIMD execution efficiency, and yet is able to increase
available parallelism and performance. Unlike previously proposed solutions to the serialized
parallelism problem [15], our technique requires no heuristics, no compiler support, and is
robust to changes in architectural parameters. In our technique hardware does not serialize all
control flows, and instead is able to interleave execution of the taken and not-taken flows.
We propose the first design that extends the reconvergence stack model, which is the
dominant model for handling branch divergence in GPUs. We maintain the simplicity,
robustness, and high SIMD utilization advantages of the stack approach, yet we are able to
exploit more parallelism by interleaving the execution of diverged control-flow paths. We
describe and explain the microarchitecture in depth and show how it can integrate with current
GPUs.
3.2 BACKGROUND
The typical current GPU consists of multiple processing cores (streaming
multiprocessors (SMs) and SIMD units in NVIDIA and AMD terminology respectively),
where each core consists of a set of parallel execution lanes (CUDA cores and SIMD cores
in NVIDIA and AMD).In the GPU execution model, each core executes a large number of
logically independent threads, which are all executing the same code (referred to as a kernel).
These parallel lanes operate in a SIMD/vector fashion where a single instruction is issued to all
the lanes for execution. Because each thread can follow its own flow of control while executing
on SIMD units, the name used for this hybrid execution model is single instruction multiple
thread (SIMT). In NVIDIA GPUs, each processing core schedules a SIMD instruction from a
Page 15

warp of 32 threads while AMD GPUs currently schedule instructions from a wavefront of 64
work items. In the rest of the paper we will use the terms defined in the CUDA language [8].
This execution model enables very efficient hardware, but requires a mechanism to allow each
thread to follow its own thread of control, even though only a single uniform operation can be
issued across all threads in the same warp.
In order to allow independent branching, hardware must provide two mechanisms. The
first mechanism determines which single path, of potentially many control paths, is active and is
executing the current SIMD instruction. The technique for choosing the active path used by
current GPUs is stack-based reconvergence, which we explain in the next subsection. The second
mechanism ensures that only threads that are on the active path, and therefore share the same
program counter (PC) can execute and commit results. This can be done by associating an active
mask with each SIMD instruction that executes. Threads that are in the executing SIMD
instruction but not on the active control path are masked and do not commit results. The mask
may either be computed dynamically by comparing the explicit PC of each thread with the PC
determined for the active path, or alternatively, the mask may be explicitly stored along with
information about the control paths. The GPU in Intels Sandy Bridge [17] stores an explicit PC
for each thread while GPUs from AMD and NVIDIA currently associate a mask with each path.
3.3 STACK-BASED RECONVERGENCE
A significant challenge with the SIMT model is maintaining high utilization of the SIMD
resources when the control flow of different threads within a single warp diverges. There are two
main reasons why SIMD utilization decreases with divergence. The first is that masked
operations needlessly consume resources. This problem has been the focus of a number of recent
research projects, with the main idea being that threads from multiple warps can be combined to
reduce the fraction of masked operations [3, 16]. The second reason is that execution of
concurrent control paths is serialized with every divergence potentially decreasing parallelism.
Therefore, care must be taken to allow them to reconverge with one another. In current GPUs, all
threads that reach a specific diverged branch reconverge at the immediate post-dominator
instruction of that branch [3]. The post-dominator (PDOM) instruction is the first instruction in
the static control flow that is guaranteed to be on both diverged paths [3]. For example, in Figure
3.1, the PDOM of the divergent branch at the end of basic block A (BRB-C) is the instruction that
starts basic block G. Similarly, the PDOM of BRD-E at the end of basic block C is the instruction
starting basic block F.
An elegant way to implement PDOM reconvergence is to treat control flow execution as
a serial stack. Each time control diverges, both the taken and not taken paths are pushed onto a
stack (in arbitrary order) and the path at the new top of stack is executed. When the control path
reaches its reconvergence point, the entry is popped off of the stack and execution now follows
the alternate direction of the diverging branch. This amounts to a serial depth-first traversal of
the control-flow graph. Note that only a single path is executed at any given time, which is the
Page 16

path that is logically at the top of the stack. There are multiple ways to implement such stack
model, including both explicit hardware structures and implicit traversal with software directives
[3].We describe our mechanisms in the context of an explicit hardware approach, which we
explain below. According to prior publications, this hardware approach is used in NVIDIA
GPUs. We discuss the application of our technique to GPUs with software control in the next
subsections.
A 11111111
B 11000000
C 00111111
D 00110000
E 00001111
F 00111111
G 11111111
Figure 3.1: Example control-flow graph. Each

warp consists of 8 threads and 1s and 0s in the
control flow graph designate the active and
inactive threads in each path.
Page 17
The hardware reconvergence stack tracks the program counter (PC) associated with each
control flow path, which threads are active at each path (the active mask of the path), and at what
PC should a path reconverge (RPC) with its predecessor in the control-flow graph [12]. The
stack contains the information on the control flow of all threads within a warp. Figure 3.2 depicts
the reconvergence stack and its operation on the example control flow shown in Figure 3.1. We
describe this example in detail below.
When a warp first starts executing, the stack is initialized with a single entry: the PC
points to the first instructions of the kernel (first instruction of block A), the active mask is full,
and the RPC (reconvergence PC) is set to the end of the kernel. When a warp executes a
conditional branch, the predicate values for both the taken and non-taken paths (left and right
paths) are computed. If control diverges with some threads following the taken path and others
the nontaken path, the stack is updated to include the newly formed paths (Figure 3.2(b)). First,
the PC field of the current top of the stack (TOS) is modified to the PC value of the
reconvergence point, because when execution returns to this path, it would be at the point where
the execution reconverges (start of block G in the example). The RPC value is explicitly
communicated from software and is computed with a straightforward compiler analysis [3].
Second, the PC of the right path (block C), the corresponding active mask, and the RPC (block
G) is pushed onto the stack. Third, the information on the left path (block B) is similarly pushed
onto the stack. Finally, execution moves to the left path, which is now at the TOS. Note that only
a single path per warp, the one at the TOS, can be scheduled for execution. For this reason we
refer to this baseline architecture as the single-path execution (SPE) model.
When the current PC of a warp matches the RPC field value at that warps TOS, the entry
at the TOS is popped off (Figure 3.2(c)). At this point, the new TOS corresponds to the right path
of the branch and the warp starts executing block C. As the warp encounters another divergent
branch at the end of block C, the stack is once again updated with the left and right paths of
blocks D and E (Figure 3.2(d)). Note how the stack elegantly handles the nested branch and how
the active masks for the paths through blocks D and E are each a subset of the active mask of
block C. When both left and right paths of block D and E finish execution and corresponding
stack entries are popped out, the TOS points to block F and control flow is reconverged back to
the path that started at block C (Figure 3.2(e)) the active mask is set correctly now that the
nested branch reconverged. Similarly, when block F finishes execution and the PC equals the
reconvergence PC (block G), the stack is again popped and execution continues along a single
path with a full active mask (Figure 3.2(f)).
Page 18
Single-path stack
PC
Active Mask RPC
A
11111111
-
PC
G
F
E
D
TOS
TOS
(a) Initial status of the stack. The current TOS

designates that basic block A is being executed.
(d) Two more entries for block D and E are pushed

into the stack when warp executes BRD-E.
Single-path stack
PC
G
C
B
Active Mask
11111111
00111111
11000000
Single-path stack
Active Mask RPC
11111111
00111111
G
00001111
F
00110000
F
Single-path stack
PC
Active Mask RPC
G
11111111
F
00111111
G
RPC
G
G
TOS
TOS
(b) Two entries of block B and C are pushed into
the stack when BRB-C is executed. RPC is updated to
block G.
(e) Threads are reconverged back at block F when

both entries of block D and E are popped out.
Single-path stack
PC
G
C
Active Mask
11111111
00111111
Single-path stack
RPC
G
PC
G
Active Mask
11111111
RPC
-
TOS
TOS
(c) The stack entry, corresponding to B at TOS, is
popped out when PC matches RPC of G
(f) All eight threads become active again when

stack entry for block F is popped out.
L7
SIMD LANE
L6
L5
L4
L3
E
F
A A A
D
L2
L1
L0
Time
Page 19

Figure 3.2: High-level operation of SIMT stack-based reconvergence when executing the control-flow graph in
Figure 3.1. The ones/zeros inside the active mask field designate the active threads in that block. Bubbles in (g)
represent idle execution resources (masked lanes or zero ready warps available for scheduling in the SM).
This example also points out the two main deficiencies of the SPE model. First, SIMD
utilization decreases every time control flow diverges. SIMD utilization has been the focus of
active research (e.g., [3, 16]) and we do not discuss it further in this paper. Second, execution is
serialized such that only a single path is followed until it completes and reconverges (Figure
3.2(g)). Such SPE model works well for most applications because of the abundant parallelism
exposed through multiple warps within cooperative thread arrays. However, for some
applications, the restriction of following only a single path does degrade performance. Meng et
al. proposed dynamic warp subdivision (DWS) [15], which selectively deviates from the
reconvergence stack execution model, to overcome the serialization issue.
3.4 LIMITATION OF PREVIOUS MODEL
As discussed in previous subsection, SPE is able to address only one aspect of the control
divergence issue while overlooking the other. SPE uses simple hardware and an elegant
execution model to maximize SIMD utilization with structured control flow, but always
serializes execution with only a single path schedulable at any given time. DWS [15] can
interleave the scheduling of multiple paths and increase TLP, but it sacrifices SIMD lane
utilization. Our proposed model, on the other hand, always matches the utilization and SIMD
efficiency of the baseline SPE while still enhancing TLP in some cases. Our approach keeps the
elegant reconvergence stack model and the hardware requires only small modifications to utilize
up to two interleaved paths. Our technique requires only a small number of components within
the GPU microarchitecture and requires no support from software. Specifically, the stack itself is
enhanced to provide up to two concurrent paths for execution, the scoreboard is modified to track
dependencies of two concurrent paths and to correctly handle divergence and reconvergence, and
the warp scheduler is extended to handle up to two schedulable objects per warp.
3.5 WARP SIZE IMPACT
Small warps, i.e., warps as wide as SIMD width, reduce the likelihood of branch divergence
occurrence. Reducing the branch divergence improves SIMD efficiency by increasing the
number of active lanes. At the same time, a small size warp reduces memory coalescing,
effectively increasing memory stalls. This can lead to redundant memory accesses and
increase pressure on the memory subsystem. Large warps, on the other hand, exploit
potentially existing memory access localities among neighbor threads and coalesce them to
a few off-core requests. On the negative side, bigger warp size can increase serialization and
the branch divergence impact.
Page 20
Figure reports average

performance for GPUs using
different warp sizes and SIMD
widths. For any specific SIMD
width, configuring the warp size to
1-2X larger than SIMD width
provides best average performance.
Widening the warp size beyond 2X
degrades performance. [14]
Figure 3.3 Warp size impact on performance for different SIMD

widths, normalized to 8-wide SIMD and 4x warp size. [14]
Insensitive workloads: Warp size affects performance in SIMT cores only for workloads
suffering from branch/memory divergence or showing potential benefits from memory
access coalescing. Therefore, benchmarks lacking either of these characteristics are insensitive
to warp size.
Ideal coalescing and write accesses: Small warps machine coalescing rate is far higher than
other machines due to ideal coalescing hardware. However, ideal coalescing can only capture
the read accesses and does not compensate uncoalesced accesses. Therefore, Small warps
machine may suffer from uncoalesced write accesses. The coalescing rate of Small warps
machine is higher than other machines since it merges many read accesses among warps.
However, uncoalesced write accesses downgrades the overall performance in Small warps
machine.
Practical issues with small warps: Pipeline front-end includes the warp scheduler, fetch
engine, decode instruction and register read stages. Using fewer threads per warp affects
pipeline front-end as it requires a faster clock rate to deliver the needed workload during the
same time period. An increase in the clock rate can increase power dissipation in the
front-end and impose bandwidth limitation issues on the fetch stage. Moreover, using short
warps can impose extra area overhead as the warp scheduler has to select from a larger number
of warps. In this study we focus on how warp size impacts performance, leaving the area and
power evaluations to future works.
Register file: Warp size affects register file design and allocation. GPUs allocate all warp
registers in a single row. Such an allocation allows the read stage to read one operand for
all threads of a warp by accessing a single register file row. For different warp sizes, the
number of registers in a row (row size) varies according to the warp size to preserve
accessibility. Row size should be wider for large warps to read the operands of all threads in a
single row access and narrower for small warps to prevent unnecessary reading.
Page 21

3.6 LIMITATIONS OF MULTITHREADING DEPTH
As in other throughput-oriented organizations that try to maximize thread concurrency and
hence do not waste area on discovering instruction level parallelism, SMs typically employ inorder pipelines that have limited ability to execute past L1 cache misses or other long latency
events. To hide memory latencies, SMs instead time-multiplex among multiple concurrent
warps, each with their own PCs and registers. However, the multi-threading depth (i.e., number
of warps) is limited because adding more warps multiplies the area overhead in register files, and
may increase cache contention as well. As a result of this limited multi-threading depth, the
WPU may run out of work. This can occur even when there are runnable threads that are stalled
only due to SIMD lockstep restrictions. For example, some threads in a warp may be ready while
others are still stalled on a cache miss.
Figure 3.4: (a) A wider SIMD organization does not always

improve performance due to increased time spent waiting
for memory. The number of warps is fixed at four. (b) 16wide WPUs with 4 warps even suffer from highly associative
D caches. (c) A few 8-wide warps can beneficially hide
latency, but too many warps eventually exacerbates cache
contention and increases the time spent waiting for memory.
As SIMD width increases,

the likelihood that at least one
thread will stall the entire warp
increases. However, inter-warp
latency tolerance (i.e., deeper multithreading via more warps) is not
sufficient to hide these latencies.
The number of threads whose state
fits in an L1 cache is limited. That
is why intra-warp latency tolerance
is needed. Intrawarp latency also
provides opportunities for memorylevel parallelism (MLP) that
conventional SIMD organizations
do not.
Figure shows that adding more warps does exacerbate L1 contention. This is a capacity
limitation, not just an associativity problem, as shown in Figure 1b, where time waiting on
memory is still significant even with full associativity. These results are all averages across the
benchmarks described in Section 3.2 and obtained with the configuration described in Section 3.
Intra-warp latency tolerance hides latencies without requiring extra threads. However, intrawarp latency tolerance is only beneficial when threads within the same warp exhibit divergent
behavior. Table 1 shows that many benchmarks exhibit frequent memory divergence. A further
advantage of intra-warp latency tolerance is that the same mechanisms also improve throughput
in the presence of branch divergence.
Page 22

3.7 ENHANCING THE INTRA WARP PARALLELISM
We extend the hardware stack used in many current GPUs to support two concurrent paths of
execution. The idea is that instead of pushing the taken and fall-through paths onto the stack one
after the other, in effect serializing their execution, the two paths are maintained in parallel. A
stack entry of the dual-path stack architecture thus consists of three data elements: a) PC and
active mask value of the left path (PathL), b) PC and active mask value of the right path (PathR),
and c) the RPC (reconvergence PC) of the two paths. We use the generic names left and right
because there is no reason to restrict the mapping of taken and non-taken paths to the fields of
the stack. Note that there is no need to duplicate the RPC field within an entry because PathL and
PathR of a divergent branch have a common reconvergence point. Besides the data fields that
constitute a stack entry, the other components of the control flow hardware, such as the logic for
computing active masks and managing the stack, are virtually identical to those used in the
baseline stack architecture. We expose the two paths for execution on a divergent branch and can
improve performance when this added parallelism is necessary, as shown in Figure 3.6, and
described below for the cases of divergence and reconvergence.
Handling Divergence: A warp starts executing on one of the paths, for example the left path,
with a full active mask. The PC is set to the first instruction in the kernel and the RPC set to the
last instruction (PathL in Figure 3.6(a)). The warp then executes in identical way to the baseline
single-path stack until a divergent branch executes. When the warp executes a divergent branch,
we push a single entry onto the stack, which represents both sides of the branch, rather than
pushing two distinct entries as done with the baseline SPE. The PC field of the block that
diverged is set to the RPC of both the left and right paths (block G in Figure 3.6(b)), because this
is the instruction that should execute when control returns to this path. Then, the active mask and
PC of PathL, as well as the same information for PathR are pushed onto the stack, along with
their common RPC and updating the TOS (Figure 3.6(b)). Because it contains the information
for both paths, the single TOS entry enables the warp scheduler to interleave the scheduling of
active threads at both paths as depicted in Figure 3.6(g). If both paths are active at the time of
divergence, the one to diverge (block C in Figure 3.6(b)) first pushes an entry onto the stack, and
in effect, suspends the other path (block B in Figure 3.6(c)) until control returns to this stack
entry (Figure 3.6(e)). Note that the runtime information required to update the stack entries is
exactly the same as in the baseline single-stack model.
Page 23
TOS
PCL
MaskL
PCR
MaskR
11111111
RPC
PCL
MaskL
PCR
MaskR
RPC
G
B
-
11111111
11000000
-
F
E
00111111
00001111
G
F
TOS
(a) Initial status of the stack. PathR at TOS is

left blank as there exists only a schedulable
path at the beginning.
MaskL
PCR
MaskR
G
B
11111111
11000000
00111111
RPC
PCL
MaskL
PCR
MaskR
RPC
G
B
11111111
11000000
00111111
TOS
(e) When the threads in block E eventually arrive at the

end of its basic block, PathL and PathR are invalidated.
The entry associated with block D and E are therefore
popped, TOS decremented, and block B and F resumes
execution.
(b) When BRB-C is executed, both taken (PathL)

and non-taken (PathR) path information is
pushed as a single operation. Note that only a
single RPC field is needed per stack.
TOS
PCL
MaskL
PCR
MaskR
G
B
D
11111111
11000000
00110000
F
E
00111111
00001111
RPC
TOS
G
F
(c) Branching at BRD-E at the end of block C

pushes another entry for both block D and E. PC
of PathR in TOS is updated to block F and TOS is
incremented afterwards.
PCL
MaskL
PCR
MaskR
RPC
11111111
(f) When the threads in block B and F arrive the

RPC point (block G), the entry is popped out
again, TOS is decremented, and all eight threads
reconverge back at block G.
L7
SIMD LANE
L6
L5
L4
L3
A A A
D
L2
SAVED
SAVED
SAVED
SAVED
TOS
PCL
(d) When all the instructions in block D are

consumed and PathLs PC value matches RPC
(block F), corresponding path is invalidated.
L1
L0
B
Time
Figure 3.5: Exploiting the parallelism with our model. We assume same control-flow graph in Figure 3.1
Page 24

Handling Reconvergence: When either one of the basic blocks at the TOS arrives at the
reconvergence point and its PC matches the RPC, the block is invalidated (block D in Figure
3.6(d)). Because the right path is still active, though, the entry is not yet popped off of the stack.
Once both paths arrive at the RPC, the stack is popped and control is returned to the next stack
entry (Figure 3.6(ef))
Page 25
4. SIMULATOR AND BENCHMARKS

We model the microarchitectural components for execution using GPGPU-Sim (version
3.2.1) [4, 12], which is a detailed cycle-based performance simulator of a general purpose GPU
architecture supporting CUDA version 4.0 and its PTX ISA. The simulator is configured to be
similar to NVIDIAs Fermi architecture using the configuration file provided with GPGPUSim [1].
We are using benchmarks selected from Rodinia [5], Parboil [6], CUDA-SDK [8], the
benchmarks provided with GPGPU-Sim [4], that we found to work with GPGPU-Sim, and [15].
The benchmarks studied are ones whose kernel can execute to completion on GPGPU-Sim.
4.1 Top-Level Organization of GPGPU-Sim[12]

The GPU modeled by GPGPU-Sim is composed of Single Instruction Multiple Thread
(SIMT) cores connected via an on-chip connection network to memory partitions that interface
to graphics GDDR DRAM. An SIMT core models a highly multithreaded pipelined SIMD
processor roughly equivalent to what NVIDIA calls a Streaming Multiprocessor (SM) or what
AMD calls a Compute Unit (CU). The organization of an SIMT core is described in Figure 4.1
below.
Figure 4.1: Detailed Microarchitecture Model of SIMT Core [12]
Page 26

4.2 SIMT core microarchitecture [12]
Figure 4.2 below illustrates the SIMT core microarchitecture simulated by GPGPU-Sim
3.x. An SIMT core models a highly multithreaded pipelined SIMD processor roughly equivalent
to what NVIDIA calls a Streaming Multiprocessor (SM) or what AMD calls a Compute Unit
(CU). A Stream Processor (SP) or a CUDA Core would correspond to a lane within an ALU
pipeline in the SIMT core. This microarchitecture model contains many details not found in
earlier versions of GPGPU-Sim. The main differences include:
A new front-end that models instruction caches and separates the warp scheduling (issue) stage
from the fetch and decode stage
Scoreboard logic enabling multiple instructions from a single warp to be in the pipeline at once.
A detailed model of an operand collector that schedules operand access to single ported register
file banks (used to reduce area and power of the register file)
Flexible model that supports multiple SIMD functional units. This allows memory instructions
and ALU instructions to operate in different pipelines.
The following subsections describe the details in Figure 4.2 by going through each stage of the
pipeline.
Figure 4.2: Overall GPU Architecture Modeled by GPGPU-Sim [12]
The major stages in the front end include instruction cache access and instruction buffering logic,
scoreboard and scheduling logic, SIMT stack.
Page 27

4.2.1 Fetch and Decode [12]
The instruction buffer (I-Buffer) block in Figure 4.2 is used to buffer instructions after
they are fetched from the instruction cache. It is statically partitioned so that all warps running on
SIMT core have dedicated storage to place instructions. In the current model, each warp has two
I-Buffer entries. Each I-Buffer entry has a valid bit, ready bit and a single decoded instruction for
this warp. The valid bit of an entry indicates that there is a non-issued decoded instruction within
this entry in the I-Buffer. While the ready bit indicates that the decoded instructions of this warp
are ready to be issued to the execution pipeline. Conceptually, the ready bit is set in the schedule
and issue stage using the scoreboard logic and availability of hardware resources (in the
simulator software, rather than actually set a ready bit, a readiness check is performed). The IBuffer is initially empty with all valid bits and ready bits deactivated.
A warp is eligible for instruction fetch if it does not have any valid instructions within
the I-Buffer. Eligible warps are scheduled to access the instruction cache in round robin order.
Once selected, a read request is sent to instruction cache with the address of the next instruction
in the currently scheduled warp. By default, two consecutive instructions are fetched. Once a
warp is scheduled for an instruction fetch, its valid bit in the I-Buffer is activated until all the
fetched instructions of this warp are issued to the execution pipeline.
The instruction cache is a read-only, non-blocking set-associative cache that can model
both FIFO and LRU replacement policies with on-miss and on-fill allocation policies. A request
to the instruction cache results in either a hit, miss or a reservation fail. The reservation fail
results if either the Miss Status Holding Register (MSHR) is full or there are no replaceable
blocks in the cache set because all block are reserved by prior pending requests (see section
Caches for more details). In both cases of hit and miss the round-robin fetch scheduler moves to
the next warp. In case of hit, the fetched instructions are sent to the decode stage. In the case of a
miss a request will be generated by the instruction cache. When the miss response is received the
block is filled into the instruction cache and the warp will again need to access the instruction
cache. While the miss is pending, the warp does not access the instruction cache.
A warp finishes execution and is not considered by the fetch scheduler anymore if all its
threads have finished execution without any outstanding stores or pending writes to local
registers. The thread block is considered done once all warps within it are finished and have no
pending operations. Once all thread blocks dispatched at a kernel launch finish, then this kernel
is considered done.
WARP->THREAD BLOCK ->KERNEL
At the decode stage, the recent fetched instructions are decoded and stored in their corresponding
entry in the I-Buffer waiting to be issued.
Page 28

4.2.2 Instruction Issue [12]
A second round robin arbiter chooses a warp to issue from the I-Buffer to rest of the
pipeline. This round robin arbiter is decoupled from the round robin arbiter used to schedule
instruction cache accesses. The issue scheduler can be configured to issue multiple instructions
from the same warp per cycle. Each valid instruction (i.e. decoded and not issued) in the
currently checked warp is eligible for issuing if
(1) Its warp is not waiting at a barrier,
(2) It has valid instructions in its I-Buffer entries (valid bit is set),
(3) The scoreboard check passes (see section Scoreboard for more details), and
(4) The operand access stage of the instruction pipeline is not stalled.
Memory instructions (Load, store, or memory barriers) are issued to the memory
pipeline. For other instructions, it always prefers the SP pipe for operations that can use both SP
and SFU pipelines. However, if a control hazard is detected then instructions in the I-Buffer
corresponding to this warp are flushed. The warp's next pc is updated to point to the next
instruction (assuming all branches as not-taken). For more information about handling control
flow, refer to SIMT Stack.
4.2.3 SIMT Stack [12]
A per-warp SIMT stack is used to handle the execution of branch divergence on singleinstruction, multiple thread (SIMT) architectures. Since divergence reduces the efficiency of
these architectures, different techniques can be adapted to reduce this effect. One of the simplest
techniques is the post-dominator stack-based reconvergence mechanism. This technique
synchronizes the divergent branches at the earliest guaranteed reconvergence point in order to
increase the efficiency of the SIMT architecture.
Entries of the SIMT stack represent a different divergence level. At each divergence
branch, a new entry is pushed to the top of the stack. The top-of-stack entry is popped when the
warp reaches its reconvergence point. Each entry stores the target PC of the new branch, the
immediate post dominator reconvergence PC and the active mask of threads that are diverging to
this branch. In our model, the SIMT stack of each warp is updated after each instruction issue of
this warp. The target PC, in case of no divergence, is normally updated to the next PC. However,
in case of divergence, new entries are pushed to the stack with the new target PC, the active
mask that corresponds to threads that diverge to this PC and their immediate reconvergence point
PC. Hence, a control hazard is detected if the next PC at top entry of the SIMT stack does not
equal to the PC of the instruction currently under check.
Page 29
4.2.3 Scoreboard [12]

The Scoreboard algorithm checks for WAW and RAW dependency hazards. As
explained above, the registers written to by a warp are reserved at the issue stage. The scoreboard
algorithm indexed by warp IDs. It stores the required register numbers in an entry that
corresponds to the warp ID. The reserved registers are released at the write back stage.
As mentioned above, the decoded instruction of a warp is not scheduled for issue until
the scoreboard indicates no WAW or RAW hazards exist. The scoreboard detects WAW and
RAW hazards by tracking which registers will be written to by an instruction that has issued but
not yet written its results back to the register file.
4.3 Benchmarks
4.3.1 Histogram (Histo) [18]
The Histogram Parboil benchmark is a straightforward histogramming operation that
accumulates the number of occurrences of each output value in the input data set. The output
histogram is a two-dimensional matrix of chartype bins that saturate at 255. The Parboil input
sets, exemplary of a particular application setting in silicon wafer verification, are what define
the optimizations appropriate for the benchmark. The dimensions of the histogram (256 W
8192 H) are very large, yet the input set follows a roughly Gaussian distribution, centered in the
output histogram. Recognizing this high concentration of contributions to the histograms central
region (referred to as the eye), the benchmark optimizations mainly focus on improving the
throughput of contributions to this area. Prior to performing the histogramming, the optimized
implementations for scratchpad run a kernel that determines the size of the eye by sampling the
input data. Architectures with an implicit cache can forego such analysis, since the hardware
cache will automatically prioritize the heavily accessed region wherever it may be.
Overall, the histogram benchmark demonstrates the high cost of random atomic
updates to a large dataset. The global atomic update penalty can sometimes outweigh a fixedfactor cost of redundantly reading input data.
4.3.2 Stencil [18]
The importance of solving partial differential equations (PDE) numerically as well as
the computationally intensive nature of this class of application have made PDE solvers an
interesting candidate for accelerators. In the benchmark we include a stencil code, representing
an iterative Jacobi solver of the heat equation on a 3-D structured grid, which can also be used as
a building block for more advanced multi-grid PDE solvers.
Page 30

The GPU-optimized version draws from several published works on the topic,
containing a combination of 2D blocking in the X-Y plane, and register-tiling (coarsening) along
the Z-direction, similar to the one developed by Datta et al. Even with these optimizations, the
performance limitation is global memory bandwidth for current GPU architectures we have
tested.
4.3.3 3D Laplace Solver (LPS) [19]
Laplace is a highly parallel finance application. As well as using shared memory, care was
taken by the application developer to ensure coalesced global memory accesses. We observe that
this benchmark suffers some performance loss due to branch divergence. We run one iteration on
a 100x100x100 grid.
4.3.4 Lattice-Boltzmann Method simulation (LBM) [19]
The Lattice-Boltzmann Method (LBM) is a method of solving the systems of partial
differential equations governing fluid dynamics. Its implementations typically represent a cell of
the lattice with 20 words of data: 18 represent fluid flows through the 6 faces and 12 edges of the
lattice cell, one represents the density of fluid within the cell, and one represents cell type or
other properties e.g., to differentiate obstacles from fluid. In a timestep, each cell uses the input
flows to compute the resulting output flows from that cell and an updated local fluid density.
Although some similarities are apparent, the major difference between LBM and a
stencil application is that no input data is shared between cells; the fluid flowing into a cell is not
read by any other cell. Therefore, the application has been memory-bandwidth bound in current
studies, and optimization efforts have focused on improving achieved memory bandwidth.
4.3.5 Ray Tracing (RAY) [19]
Ray-tracing is a method of rendering graphics with near photo-realism. In this
implementation, each pixel rendered corresponds to a scalar thread in CUDA. Up to 5 levels of
reflections and shadows are taken into account, so thread behavior depends on what object the
ray hits (if it hits any at all), making the kernel susceptible to branch divergence. We simulate
rendering of a 256x256 image.
4.5.6 Heart Wall
Heartwall application tracks the movement of a mouse heart over a sequence of 104
609x590 ultrasound images to record response to the stimulus. In its initial stage, the program
performs image processing operations on the first image to detect initial, partial shapes of inner
and outer heart walls. These operations include: edge detection, SRAD despeckling (also part of
Rodinia suite) [2], morphological transformation and dilation. In order to reconstruct
approximated full shapes of heart walls, the program generates ellipses that are superimposed
over the image and sampled to mark points on the heart walls (Hough Search). In its final stage
Page 31

(Heart Wall Tracking presented here) [1], program tracks movement of surfaces by detecting the
movement of image areas under sample points as the shapes of the heart walls change throughout
the sequence of images. Only two stages of the application, SRAD and Tracking, contain enough
data/task parallelism to be interesting for Rodinia suite, and therefore they are presented here
separately. The separation of these two parts of the application allows easy analysis of the
distinct types of workloads. However, in the near future, we plan to convert the entire application
to OpenMP and CUDA and make all of its parts available together as well as separately in the
Rodinia suite.
Tracking is the final stage of the Heart Wall application. It takes the positions of heart
walls from the first ultrasound image in the sequence as determined by the initial detection stage
in the application. Partitioning of the working set between caches and avoiding of cache trashing
contribute to the performance. CUDA implementation of this code is a classic example of the
exploitation of braided parallelism.
4.5.7 LUD
LUD Decomposition (lud_cuda) is an algorithm to calculate the solutions of a set of
linear equations. The LUD kernel decomposes a matrix as the product of a lower triangular
matrix and an upper triangular matrix.
Figure 4.3: GPGPU Sim configuration

Number of SMs
Threads per SM
Threads per warp
Registers per SM
Shared memory per SM
Number of warp schedulers
Warp scheduling policy
L1 Data cache (size/associativity/line size)
L2 cache (size/associativity/line size)
Number of memory channels
Clock[Core: Interconnect: L2: DRAM]
Memory controller Out-of-order
DRAM Latency
15
1536
32
32768
48KB
2
Round-Robin
16KB/4/128B
768KB/8/256B
6
700:1400:700:924 MHz
(FR-FCFS)
100
Page 32
Figure 4.4: Description of Benchmarks used

Benchmark
LUD
LBM
Description/ Area
LU Decomposition
Lattice-Boltzmann
Method Simulation
LPS
3D Laplace Solver
HEARTWALL Shapes of Heart walls
over Ultrasonic Images
HISTO
Histogram Operation
RAY
Ray Tracing
STENCIL
PDE Solvers
Number
of Kernels
46
100
Number of
Instructions
40M
55936M
% of Memory
Instructions
7.1665
0.772
1
5
72M
35236M
7.065
2.619
80
1
100
2348M
62M
2775M
16.79
21.2
8.954
Page 33
5. RESULTS AND CONCLUSIONS

Benchmarks
IPC for Baseline IPC

with
enhanced Percentage
Architecture
reconvergence stack Arch.
Improvement
LUD
LBM
LPS
HEARTWALL
HISTO
RAY
STENCIL
20.59805
24.4674745
18.78539
54.52345
66.28455
21.57072
222.43355
252.359
13.45366
209.9862
227.40495
8.295188
116.18875
125.523765
8.034354
217.10065
231.50103
6.633043
264.92875
279.685
5.569894
Average Percentage Improvement
11.7631
Figure 5.1: Table for comparison based on Instructions per cycle
1.4
1.2
1
0.8
IPC with Baseline Architecture
0.6
IPC with enhanced
reconvergence stack
0.4
0.2
0
Figure 5.2: Normalized Graph depicting increase in IPC

Average Improvement in IPC is approximately 11.8%. Maximum improvement is seen for
Lattice-Boltzmann Method simulation (LBM) benchmark which is around 21.6%.
Page 34
Benchmarks
Total
Stalls
for Total Stalls with enhanced Percentage
Baseline Architecture reconvergence stack Arch. decrease in stalls
LUD
LBM
LPS
HEARTWALL
HISTO
RAY
STENCIL
9934860
8845544
10.965
3256573992
2355437042
27.671
1466837
1133817
22.703
882978728
764754980
13.389
219281118
191740591
12.559
2070150
1772489
14.379
56491686
52808554
Average Percentage decrease in stalls
6.5198
15.455
Figure 5.3: Table for comparison based on Total number of Stalls
1.2
1
0.8
0.6
Total Stalls for Baseline

Architecture
0.4
Total Stalls with enhanced

reconvergence stack Arch.
0.2
0
Figure 5.4: Normalized Graph depicting decrease in No. of Stalls
Average decrease in Total No. of Stalls is approximately 15.5%. Maximum improvement is seen
for Lattice-Boltzmann Method simulation (LBM) benchmark which is around 27.7%.
Future Scope: We can employ dynamic warp formation method to further improve
SIMD Utilization, which we havent focused in this work of ours.
Page 35
5. REFERENCES
1. COMPUTER ARCHITECTURE A Quantitative Approach, 5th Edition by David A.
Patterson, John L. Hennessy
2. W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and
Scheduling for Efficient GPU Control Flow. In 40th International Symposium on
Microarchitecture (MICRO-40), December 2007.
3. Minsoo Rhu and Mattan Erez, The Dual-Path Execution Model for Efficient GPU
Control Flow in the 19th IEEE International Symposium on High-Performance
Computer Architecture (HPCA-19), Shenzhen, China, Feb. 2013
4. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA Workloads
Using a Detailed GPU Simulator. In IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS-2009), April 2009.
5. S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee and K. Skadron. Rodinia: A
Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on
Workload Characterization (IISWC-2009), October 2009
6. IMPACT
Research
Group.
The
Parboil
Benchmark
Suite.
http://www.crhc.uiuc.edu/IMPACT/parboil.php
7. NVIDIA Corporation. NVIDIAs Next Generation CUDA Compute Architecture: Fermi,
2009.
8. NVIDIA Corporation. CUDA toolkit, C/C++ SDK CODE Samples
9. Evolution of the Graphical Processing Unit by Thomas Scott Crow
10. http://cpudb.stanford.edu/ a complete database of processors built by the Stanford
University's VLSI Research
11. https://www.udacity.com/course/cs344 a lecture series by David Luebke of NVIDIA
Research , John Owens from the University of California, Davis
12. GPGPU-Sim. http://www.gpgpu-sim.org
13. Jonathan Palacios, Josh Triska A comparison of modern GPU and CPU architectures: and
common convergence of both
14. Ahmad Lashgar, Amirali Baniasadi, and Ahmad Khonsari Warp Size Impact in GPUs:
Large or Small?
15. J. Meng, D. Tarjan, and K. Skadron. Dynamic Warp Subdivision for Integrated Branch
and Memory Divergence Tolerance. In 37th International Symposium on Computer
Architecture (ISCA-37), 2010.
16. V. Narasiman, C. Lee, M. Shebanow, R. Miftakhutdinov, O. Mutlu, and Y. Patt.
Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In 44th
International Symposium on Microarchitecture (MICRO-44), December 2011.
17. Intel Corporation. Intel HD Graphics Open Source Programmer Reference Manual, June
2011.
Page 36
18. John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid,Li-Wen Chang, Nasser
Anssari, Geng Daniel Liu, Wen-mei W. Hwu. Parboil: A Revised Benchmark Suite for
Scientific and Commercial Throughput Computing. In IMPACT Technical Report
IMPACT-12-01.
19. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt.
Analyzing CUDA Workloads Using a Detailed GPU Simulator
Page 37

Exploiting Intra Warp Parallelism For GPGPU

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Exploiting Intra Warp Parallelism For GPGPU

Diunggah oleh

Hak Cipta:

Format Tersedia

Exploiting Control Divergence to ImproveParallelism in GPGPU

1. Why GPU FOR GENERAL PURPOSE

Exploiting Control Divergence to ImproveParallelism in GPGPU

Figure 1.1: Technology Scaling vs. Feature Size [10]

Figure 1.2: Clock Frequency vs. Year [10]

1.3 CONSTRAINTS WITH ILP

Department of Electronics Engineering, IIT (BHU) Varanasi

Exploiting Control Divergence to ImproveParallelism in GPGPU

GPU: Simpler Control Hardware

Exploiting Control Divergence to ImproveParallelism in GPGPU

Exploiting Control Divergence to ImproveParallelism in GPGPU

Figure 1.5: New era of processor performance [12]

1.6 A BRIEF HISTORY OF GPU COMPUTING

Exploiting Control Divergence to ImproveParallelism in GPGPU

Department of Electronics Engineering, IIT (BHU) Varanasi

Exploiting Control Divergence to ImproveParallelism in GPGPU

Department of Electronics Engineering, IIT (BHU) Varanasi

Exploiting Control Divergence to ImproveParallelism in GPGPU

2.3 SINGLE INSTRUCTION MULTIPLE DATA

Exploiting Control Divergence to ImproveParallelism in GPGPU

Figure 2.2: Potential speedup via parallelism from MIMD, SIMD,

For problems with lots

Exploiting Control Divergence to ImproveParallelism in GPGPU

Figure 2.3: The overview of a MIMD Processor

2.5 FERMI GPU ARCHITECTURE OVERVIEW

Department of Electronics Engineering, IIT (BHU) Varanasi

Exploiting Control Divergence to ImproveParallelism in GPGPU

Figure 2.4: Fermis 16 SM are positioned around a common L2 cache.

2.5.1 STREAMING MULTIPROCESSOR

Exploiting Control Divergence to ImproveParallelism in GPGPU

Exploiting Control Divergence to ImproveParallelism in GPGPU

Figure 2.6: Dual Warp Scheduler

2.5.3 GIGATHREAD THREAD SCHEDULER

Exploiting Control Divergence to ImproveParallelism in GPGPU

Figure 2.7: CUDA Hierarchy of threads, blocks, and grids,

Grid: a group of thread blocks that must

2.7 PROGRAM EXECUTION IN GPU

Department of Electronics Engineering, IIT (BHU) Varanasi

Exploiting Control Divergence to ImproveParallelism in GPGPU

3. CONTROL FLOW DIVERGENCE

Exploiting Control Divergence to ImproveParallelism in GPGPU

Exploiting Control Divergence to ImproveParallelism in GPGPU

Figure 3.1: Example control-flow graph. Each

Department of Electronics Engineering, IIT (BHU) Varanasi

Exploiting Control Divergence to ImproveParallelism in GPGPU

Department of Electronics Engineering, IIT (BHU) Varanasi

Exploiting Control Divergence to ImproveParallelism in GPGPU

(a) Initial status of the stack. The current TOS

(d) Two more entries for block D and E are pushed

(e) Threads are reconverged back at block F when

(f) All eight threads become active again when

Department of Electronics Engineering, IIT (BHU) Varanasi

Exploiting Control Divergence to ImproveParallelism in GPGPU

Department of Electronics Engineering, IIT (BHU) Varanasi

Exploiting Control Divergence to ImproveParallelism in GPGPU

Figure reports average

Figure 3.3 Warp size impact on performance for different SIMD

Department of Electronics Engineering, IIT (BHU) Varanasi

Exploiting Control Divergence to ImproveParallelism in GPGPU

Figure 3.4: (a) A wider SIMD organization does not always

As SIMD width increases,

Department of Electronics Engineering, IIT (BHU) Varanasi

Exploiting Control Divergence to ImproveParallelism in GPGPU