Anda di halaman 1dari 34

FUSIONSIM:

A Cycle-Accurate CPU + GPU System Simulator


Vitaly Zakharenko, Andreas Moshovos University of Toronto Tor Aamodt University of British Columbia With support from AMD Canada, Ontario Centres of Excellence and National Science and Engineering Council of Canada.

FUSIONSIM: A CYCLE-ACCURATE CPU + GPU SYSTEM SIMULATOR

| 3 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

WHAT IS FUSIONSIM?

Detailed timing simulator of a complete system with an x86 CPU and a GPU Fused or Discrete Systems FusionSims features: x86 out-of-order CPU + CUDA-capable GPU
Operate concurrently

Detailed timing models for all components


Models reflect modern hardware

Enables performance modeling:


Fused vs. Discrete What if scenarios

| 4 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

AGENDA TWO FLAVOURS OF FUSIONSIM


Structure & Functionality of Discrete FusionSim Models a discrete system:
Distinct CPU and GPU chips Separate CPU and GPU DRAM

Structure & Functionality of Fused FusionSim Models a fused system:


Same CPU and GPU chip Shared CPU and GPU DRAM Partly shared memory hierarchy

| 5 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

AGENDA FUSION: WHICH BENCHMARK BENEFITS?


Analytical speed-up model Greater speed up for
Small benchmark input data size dataTOTAL Many kernel invocations (large cumulative latency overhead High benchmark kernel throughput KERNEL Long time spent in the GPU code relative to the x86 code

GGPU 1 +

TOTAL KERNEL + KERNEL data TOTAL COPY


)

TOTAL

Simulation speed-up results of Rodinia Range: 1.05x to 9.72x A closer look at a fusion-friendly benchmark
Large speed-up (up to x9.72) for small problem sizes Smaller (x1.8) speed-up for medium problem sizes

Dependence on latency overhead

TOTAL

and kernel throughput

KERNEL

| 6 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

AGENDA FUSION: WHICH SYSTEM FACTORS AFFECT SPEED-UP?

Kernel spawn latency From GPU API kernel launch request until actual kernel execution Simulation: order-of-magnitude reduction is important CPU/GPU memory coherence Simulation: performance loss is minor
less than 2 % for most Rodinia benchmarks

| 7 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM: STRUCTURE

CPU from PTLSim: www.ptlsim.org GPU of GPGPU-Sim: www.gpgpu-sim.org CPU caches of MARSSx86: www.marss86.org

| 8 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM: COMPONENT FEATURES

CPU: PTLSIM Fast x86: 200KIPS/sec (isolated) Out-of-Order Micro-op architecture Cycle-accurate Modular & detailed memory hierarchy model

GPU: GPGPU-SIM OpenCL/CUDA capable


Currently only CUDA

High correlation vs. Nvidia GT200 and Fermi NoC Detailed & configurable DRAM Detailed

| 9 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM: START-UP AND MEMORY LAYOUT


Input: standard Linux CUDA benchmark executable Benchmarks process is created Private stack Private heap & heap management Invisible to the benchmark process Simulator executes benchmarks code: x86 code on PTLsim PTX code on GPGPU-Sim Benchmarks process communicates with FusionSim via a single page accessible by both
(yellow) (pink) (green)

Simulator is injected into virtual memory space

Replacement of the standard dynamic library

|10 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM: MAIN SIMULATION LOOP


Single simulation loop: Each loop cycle == tick of a virtual common clock x GPU_MULTIPLIER = GPU_FREQ x CPU_MULTIPLIER = CPU_FREQ WHILE (1) { FOR GPU_MULTIPLIER ITERATIONS DO { GPU_CYCLE() } FOR CPU_MULTIPLIER ITERATIONS DO { CPU_CYCLE() } }
|11 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM: EXAMPLE GPU API CALL


Virtual PTLsim CPU executes x86 code call to API cudaMemcpyAsync(a, b, c) is reached On next GPU cycle, FusionSim Identifies pending API call Enqueues the task for the GPU Decides whether to block the CPU (synchronous) or to let the CPU proceed (asynchronous)

|12 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM: SIMULATOR FEATURES


Correctly models ordering and overlap in time of asynchronous & synchronous operations memory transfers CUDA events Kernel computations CPU processing Models duration of all CUDA stream operations Simple and powerful mechanism for management of configuration and simulation output files

|13 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM: STRUCTURE


Processing Cluster is replaced by a CPU CUDA global memory address space is shared No more memory transfers from/to device DRAM Last Level Cache size is adjusted (increased) GPUs L2 is also CPUs L3 CPU: L1 and L2 private caches

|14 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM: A CHALLENGE WITH EXISTING CPU + GPU MEMORY SPACES


CUDA global memory space Shared between CPU & GPU Accessible by both using the same virtual address Cached in LLC and mapped to DRAM CUDA local memory space Private to the GPU Inaccessible by the CPU Cached in LLC and mapped to DRAM

How do we model these?

|15 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM: SIMULATING THE CPU AND GPU MEMORY SPACES


Common Virtual memory Used by both the CPU and the GPU Slightly different virtual memory spaces Generic virtual address Used by GPU For the same location X accessible by the CPU
Generic_virt_addr = virt_addr + 0x40000000

32-bit virtual address space (4GBytes) FusionSim does not simulate OS kernel code => top-most 1GByte addresses unused

|16 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM: MEMORY SPACE: WHERE AND WHAT


Uses generic virtual address

CPU Uses CPU virtual address GPU Uses generic virtual address Caches Physically-addressed CPU adjusts virtual address to generic and translates it to physical GPU directly translates generic to physical MMU Same MMU for both the CPU and the GPU Physical address Uses CPU virtual address

|17 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM MEMORY COHERENCE

Shared CUDA global address space Same block from global space Cached in private CPU L1 $ Cached in private GPU L1 $ Potential coherence problem First-cut solution: Flushing caches to LLC Interesting area for exploration

|18 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM MEMORY COHERENCE: IMPLEMENTATION

CPU side: Selective flushing of private caches cudaSelectivelyFlush(address, size)


prior to every kernel invocation for every region of memory accessed by the kernel

GPU side: GPGPU-Sim already flushes the caches

|19 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM CHANGES TO GPU API

|20 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED VS DISCRETE: EXPERIMENTAL METHODOLOGY


Rodinia benchmark suite for heterogeneous computing Discrete system modeled by Discrete FusionSim Unmodified Rodinia Fused system modeled by Fused FusionSim Modified Rodinia:
No cudaMalloc()/cudaFree() No memory transfers Added cudaSelectivelyFlush()

Data input generation is excluded from time measurement

|21 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED VS DISCRETE: RELATIVE PERFORMANCE


FUSED is better

Rodinia benchmarks Two baseline discrete systems: 10 sec kernel spawn latency 100 usec kernel spawn latency Speed-up varies: From x1.05
nn, 10 usec

Up to x9.72
gaus_4, 10 usec

|22 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED SYSTEM: KERNEL SPAWN LATENCY


FUSED is better

One baseline discrete system: 10 sec kernel spawn latency Different fused systems: 0.1 sec kernel spawn latency 1 sec kernel spawn latency 10 sec kernel spawn latency Simulations show: Reduction of the latency to 1 sec is important Further reduction below 1 sec is NOT important
|23 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED SYSTEM: COHERENCE OVERHEAD


SMALLER is better

Two fused systems: Incoherent vs. coherent kernel spawn latency is 0.1 usec in both systems Simulations show: Minor performance loss
Less then 2% for most benchmarks 5% for bfs_small

|24 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSION: WHICH BENCHMARK BENEFITS? ANALYTICAL MODEL


Semantics Total cumulative latency

TOTAL

Kernel throughput KERNEL Benchmark data input size dataTOTAL Greater speed up for Small benchmark input data size
Small

GGPU

TOTAL KERNEL 1 + KERNEL + data TOTAL COPY

dataTOTAL

Many kernel invocations and memory transfers


Large

TOTAL

High benchmark kernel throughput


Large

KERNEL

Long time spent in the GPU code relative to the CPU code
|25 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSION: WHICH BENCHMARK BENEFITS? TWO SCENARIOS

GGPU

CHUNK KERNEL (data CHUNK ) 1 + KERNEL (data CHUNK ) + data CHUNK COPY

Large dataCHUNK
Significant

Small
Insignificant

dataCHUNK
KERNEL COPY

KERNEL COPY

CHUNK KERNEL Insignificant dataCHUNK

Significant

CHUNK KERNEL dataCHUNK

|26 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSION: WHICH BENCHMARK BENEFITS? INPUT DATA SIZE

Rodinia BFS

Rodinia Gaussian

FUSED is better

INPUT SIZE IS GREATER

Greater problem size smaller benefit from fusion


|27 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSION: WHICH BENCHMARK BENEFITS? LATENCY OVERHEAD


Comparison between two benchmarks: Rodinia Gaussian
Speed-up 9.72x

Normalized latency overhead

TOTAL dataTOTAL

Rodinia NN
Speed-up 1.05x

Why?
100 times more kernel spawns for Gaussian 10 times more memory copies for Gaussian

GGPU 1 +

TOTAL KERNEL + KERNEL dataTOTAL COPY

TOTAL (nKERNEL 1 + nMEM _ COPY ) 1 dataTOTAL


|28 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSION: WHICH BENCHMARK BENEFITS? KERNEL THROUGHPUT


Comparison between two benchmarks: Rodinia BFS
Speed-up 4.28x

Kernel throughput KERNEL

Rodinia NN
Speed-up 1.05x

Why?
100 times greater throughput

KERNEL

for BFS

GGPU 1 +

TOTAL KERNEL + KERNEL dataTOTAL COPY

|29 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSIONSIM WEBSITE: DOCUMENTATION & SOURCE CODE

www.fusionsim.ca
Discrete FusionSim & Fused FusionSim Source code Documentation Google group for collaborators

|30 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DERIVATION OF THE ANALYTICAL SPEED-UP MODEL SYMBOL MEANINGS

|31 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DERIVATION OF THE ANALYTICAL SPEED-UP MODEL PART 1

The kernel can be modeled as a channel of throughput KER ( dataKER ) as the actual throughput will vary depending on dataKER . As dataKER increases, KER saturates. Most of existing CUDA applications (including all the considered Rodinia benchmarks) exhibit the following computation pattern:
For 1 . 1 . 1 . iterations do: Copy the input data from the host to the device Launch kernel on the data Copy the results from the device to the host

For such applications GPU is described by the following: data KER tGPU = nKER TOT + KER , where KER is the kernel data throughput and TOT is the total latency per iteration resulting from both the memory transfers and the kernel spawn. For CUDA applications that do not utilize multiple concurrent CUDA streams the total latency per single computation iteration TOT is comprised of the time spent transferring the data to or from the device and the kernel spawn latency: data KER + KS TOT = 1 COPY + COPY The above expression holds true for all the considered Rodinia benchmarks.
|32 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DERIVATION OF THE ANALYTICAL SPEED-UP MODEL PART 2


Since on fused systems this latency reduces to TOT = KS KS , the time the CUDA code on the fused system is given by
data KER tGPU = n KER KS + KER data KER n KER KER

t GPU of executing

The speed-up of the CUDA code is given by

GGPU =

tGPU TOT KER +1 tGPU data KER

Since data KER = dataTOT / n KER , we obtain

GGPU
Here

TOT nKER TOTAL KER + 1 1 = + KERNEL + KERNEL data TOT data TOTAL COPY

TOTAL is the total latency accumulated during the benchmark execution and comprising

the all kernel spawn and memory transfer latencies. Please also note that the throughput KER of a benchmark kernel increases with dataKER for small dataKER values and saturates to a constant for large dataKER values. The throughput saturates when the input data size is sufficient for maximum possible warp scheduler occupancy for the given benchmark kernel. For benchmarks utilizing CUDA streams and overlapping kernel execution with data transfers the latency is bounded from above, i.e.: data KER + KS TOT 1 COPY + COPY This results in a smaller speed-up

GGPU for such benchmarks. Applying the Amdahls law we

get an expression for the total benchmarks speed-up GTOT :

GTOT =

1 GGPU %CPU + %GPU GGPU

|33 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

Disclaimer & Attribution


The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMDs positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.

|34 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

Anda mungkin juga menyukai