PT-4822 FINAL 060612 New

FUSIONSIM:
A Cycle-Accurate CPU + GPU System Simulator

Vitaly Zakharenko, Andreas Moshovos University of Toronto Tor Aamodt University of British Columbia With support from AMD Canada, Ontario Centres of Excellence and National Science and Engineering Council of Canada.
FUSIONSIM: A CYCLE-ACCURATE CPU + GPU SYSTEM SIMULATOR
| 3 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
WHAT IS FUSIONSIM?
Detailed timing simulator of a complete system with an x86 CPU and a GPU Fused or Discrete Systems FusionSims features: x86 out-of-order CPU + CUDA-capable GPU
Operate concurrently
Detailed timing models for all components

Models reflect modern hardware
Enables performance modeling:

Fused vs. Discrete What if scenarios
AGENDA TWO FLAVOURS OF FUSIONSIM

Structure & Functionality of Discrete FusionSim Models a discrete system:
Distinct CPU and GPU chips Separate CPU and GPU DRAM
Structure & Functionality of Fused FusionSim Models a fused system:

Same CPU and GPU chip Shared CPU and GPU DRAM Partly shared memory hierarchy
AGENDA FUSION: WHICH BENCHMARK BENEFITS?

Analytical speed-up model Greater speed up for
Small benchmark input data size dataTOTAL Many kernel invocations (large cumulative latency overhead High benchmark kernel throughput KERNEL Long time spent in the GPU code relative to the x86 code
GGPU 1 +
TOTAL KERNEL + KERNEL data TOTAL COPY

)
TOTAL
Simulation speed-up results of Rodinia Range: 1.05x to 9.72x A closer look at a fusion-friendly benchmark
Large speed-up (up to x9.72) for small problem sizes Smaller (x1.8) speed-up for medium problem sizes
Dependence on latency overhead
TOTAL
and kernel throughput
KERNEL
AGENDA FUSION: WHICH SYSTEM FACTORS AFFECT SPEED-UP?
Kernel spawn latency From GPU API kernel launch request until actual kernel execution Simulation: order-of-magnitude reduction is important CPU/GPU memory coherence Simulation: performance loss is minor
less than 2 % for most Rodinia benchmarks
DISCRETE FUSIONSIM: STRUCTURE
CPU from PTLSim: www.ptlsim.org GPU of GPGPU-Sim: www.gpgpu-sim.org CPU caches of MARSSx86: www.marss86.org
DISCRETE FUSIONSIM: COMPONENT FEATURES
CPU: PTLSIM Fast x86: 200KIPS/sec (isolated) Out-of-Order Micro-op architecture Cycle-accurate Modular & detailed memory hierarchy model
GPU: GPGPU-SIM OpenCL/CUDA capable

Currently only CUDA
High correlation vs. Nvidia GT200 and Fermi NoC Detailed & configurable DRAM Detailed
DISCRETE FUSIONSIM: START-UP AND MEMORY LAYOUT

Input: standard Linux CUDA benchmark executable Benchmarks process is created Private stack Private heap & heap management Invisible to the benchmark process Simulator executes benchmarks code: x86 code on PTLsim PTX code on GPGPU-Sim Benchmarks process communicates with FusionSim via a single page accessible by both
(yellow) (pink) (green)
Simulator is injected into virtual memory space
Replacement of the standard dynamic library
|10 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM: MAIN SIMULATION LOOP

Single simulation loop: Each loop cycle == tick of a virtual common clock x GPU_MULTIPLIER = GPU_FREQ x CPU_MULTIPLIER = CPU_FREQ WHILE (1) { FOR GPU_MULTIPLIER ITERATIONS DO { GPU_CYCLE() } FOR CPU_MULTIPLIER ITERATIONS DO { CPU_CYCLE() } }
DISCRETE FUSIONSIM: EXAMPLE GPU API CALL

Virtual PTLsim CPU executes x86 code call to API cudaMemcpyAsync(a, b, c) is reached On next GPU cycle, FusionSim Identifies pending API call Enqueues the task for the GPU Decides whether to block the CPU (synchronous) or to let the CPU proceed (asynchronous)
DISCRETE FUSIONSIM: SIMULATOR FEATURES

Correctly models ordering and overlap in time of asynchronous & synchronous operations memory transfers CUDA events Kernel computations CPU processing Models duration of all CUDA stream operations Simple and powerful mechanism for management of configuration and simulation output files
FUSED FUSIONSIM: STRUCTURE

Processing Cluster is replaced by a CPU CUDA global memory address space is shared No more memory transfers from/to device DRAM Last Level Cache size is adjusted (increased) GPUs L2 is also CPUs L3 CPU: L1 and L2 private caches
FUSED FUSIONSIM: A CHALLENGE WITH EXISTING CPU + GPU MEMORY SPACES

CUDA global memory space Shared between CPU & GPU Accessible by both using the same virtual address Cached in LLC and mapped to DRAM CUDA local memory space Private to the GPU Inaccessible by the CPU Cached in LLC and mapped to DRAM
How do we model these?
FUSED FUSIONSIM: SIMULATING THE CPU AND GPU MEMORY SPACES

Common Virtual memory Used by both the CPU and the GPU Slightly different virtual memory spaces Generic virtual address Used by GPU For the same location X accessible by the CPU
Generic_virt_addr = virt_addr + 0x40000000
32-bit virtual address space (4GBytes) FusionSim does not simulate OS kernel code => top-most 1GByte addresses unused
FUSED FUSIONSIM: MEMORY SPACE: WHERE AND WHAT

Uses generic virtual address
CPU Uses CPU virtual address GPU Uses generic virtual address Caches Physically-addressed CPU adjusts virtual address to generic and translates it to physical GPU directly translates generic to physical MMU Same MMU for both the CPU and the GPU Physical address Uses CPU virtual address
FUSED FUSIONSIM MEMORY COHERENCE
Shared CUDA global address space Same block from global space Cached in private CPU L1 $ Cached in private GPU L1 $ Potential coherence problem First-cut solution: Flushing caches to LLC Interesting area for exploration
FUSED FUSIONSIM MEMORY COHERENCE: IMPLEMENTATION
CPU side: Selective flushing of private caches cudaSelectivelyFlush(address, size)

prior to every kernel invocation for every region of memory accessed by the kernel
GPU side: GPGPU-Sim already flushes the caches
FUSED FUSIONSIM CHANGES TO GPU API
FUSED VS DISCRETE: EXPERIMENTAL METHODOLOGY

Rodinia benchmark suite for heterogeneous computing Discrete system modeled by Discrete FusionSim Unmodified Rodinia Fused system modeled by Fused FusionSim Modified Rodinia:
No cudaMalloc()/cudaFree() No memory transfers Added cudaSelectivelyFlush()
Data input generation is excluded from time measurement
FUSED VS DISCRETE: RELATIVE PERFORMANCE

FUSED is better
Rodinia benchmarks Two baseline discrete systems: 10 sec kernel spawn latency 100 usec kernel spawn latency Speed-up varies: From x1.05
nn, 10 usec
Up to x9.72
gaus_4, 10 usec
FUSED SYSTEM: KERNEL SPAWN LATENCY

FUSED is better
One baseline discrete system: 10 sec kernel spawn latency Different fused systems: 0.1 sec kernel spawn latency 1 sec kernel spawn latency 10 sec kernel spawn latency Simulations show: Reduction of the latency to 1 sec is important Further reduction below 1 sec is NOT important
FUSED SYSTEM: COHERENCE OVERHEAD

SMALLER is better
Two fused systems: Incoherent vs. coherent kernel spawn latency is 0.1 usec in both systems Simulations show: Minor performance loss
Less then 2% for most benchmarks 5% for bfs_small
FUSION: WHICH BENCHMARK BENEFITS? ANALYTICAL MODEL

Semantics Total cumulative latency
TOTAL
Kernel throughput KERNEL Benchmark data input size dataTOTAL Greater speed up for Small benchmark input data size
Small
GGPU
TOTAL KERNEL 1 + KERNEL + data TOTAL COPY
dataTOTAL
Many kernel invocations and memory transfers

Large
TOTAL
High benchmark kernel throughput

Large
KERNEL
Long time spent in the GPU code relative to the CPU code
FUSION: WHICH BENCHMARK BENEFITS? TWO SCENARIOS
GGPU
CHUNK KERNEL (data CHUNK ) 1 + KERNEL (data CHUNK ) + data CHUNK COPY
Large dataCHUNK
Significant
Small
Insignificant
dataCHUNK
KERNEL COPY
KERNEL COPY
CHUNK KERNEL Insignificant dataCHUNK
Significant
CHUNK KERNEL dataCHUNK
FUSION: WHICH BENCHMARK BENEFITS? INPUT DATA SIZE
Rodinia BFS
Rodinia Gaussian
FUSED is better
INPUT SIZE IS GREATER
Greater problem size smaller benefit from fusion

FUSION: WHICH BENCHMARK BENEFITS? LATENCY OVERHEAD

Comparison between two benchmarks: Rodinia Gaussian
Speed-up 9.72x
Normalized latency overhead
TOTAL dataTOTAL
Rodinia NN
Speed-up 1.05x
Why?
100 times more kernel spawns for Gaussian 10 times more memory copies for Gaussian
GGPU 1 +
TOTAL KERNEL + KERNEL dataTOTAL COPY
TOTAL (nKERNEL 1 + nMEM _ COPY ) 1 dataTOTAL

FUSION: WHICH BENCHMARK BENEFITS? KERNEL THROUGHPUT

Comparison between two benchmarks: Rodinia BFS
Speed-up 4.28x
Kernel throughput KERNEL
Rodinia NN
Speed-up 1.05x
Why?
100 times greater throughput
KERNEL
for BFS
GGPU 1 +
TOTAL KERNEL + KERNEL dataTOTAL COPY
FUSIONSIM WEBSITE: DOCUMENTATION & SOURCE CODE
www.fusionsim.ca
Discrete FusionSim & Fused FusionSim Source code Documentation Google group for collaborators
DERIVATION OF THE ANALYTICAL SPEED-UP MODEL SYMBOL MEANINGS
DERIVATION OF THE ANALYTICAL SPEED-UP MODEL PART 1
The kernel can be modeled as a channel of throughput KER ( dataKER ) as the actual throughput will vary depending on dataKER . As dataKER increases, KER saturates. Most of existing CUDA applications (including all the considered Rodinia benchmarks) exhibit the following computation pattern:
For 1 . 1 . 1 . iterations do: Copy the input data from the host to the device Launch kernel on the data Copy the results from the device to the host
For such applications GPU is described by the following: data KER tGPU = nKER TOT + KER , where KER is the kernel data throughput and TOT is the total latency per iteration resulting from both the memory transfers and the kernel spawn. For CUDA applications that do not utilize multiple concurrent CUDA streams the total latency per single computation iteration TOT is comprised of the time spent transferring the data to or from the device and the kernel spawn latency: data KER + KS TOT = 1 COPY + COPY The above expression holds true for all the considered Rodinia benchmarks.
DERIVATION OF THE ANALYTICAL SPEED-UP MODEL PART 2

Since on fused systems this latency reduces to TOT = KS KS , the time the CUDA code on the fused system is given by
data KER tGPU = n KER KS + KER data KER n KER KER
t GPU of executing
The speed-up of the CUDA code is given by
GGPU =
tGPU TOT KER +1 tGPU data KER
Since data KER = dataTOT / n KER , we obtain
GGPU
Here
TOT nKER TOTAL KER + 1 1 = + KERNEL + KERNEL data TOT data TOTAL COPY
TOTAL is the total latency accumulated during the benchmark execution and comprising
the all kernel spawn and memory transfer latencies. Please also note that the throughput KER of a benchmark kernel increases with dataKER for small dataKER values and saturates to a constant for large dataKER values. The throughput saturates when the input data size is sufficient for maximum possible warp scheduler occupancy for the given benchmark kernel. For benchmarks utilizing CUDA streams and overlapping kernel execution with data transfers the latency is bounded from above, i.e.: data KER + KS TOT 1 COPY + COPY This results in a smaller speed-up
GGPU for such benchmarks. Applying the Amdahls law we
get an expression for the total benchmarks speed-up GTOT :
GTOT =
1 GGPU %CPU + %GPU GGPU
Disclaimer & Attribution

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMDs positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.

PT-4822 FINAL 060612 New

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

PT-4822 FINAL 060612 New

Diunggah oleh

Hak Cipta:

Format Tersedia

FUSIONSIM:

A Cycle-Accurate CPU + GPU System Simulator

FUSIONSIM: A CYCLE-ACCURATE CPU + GPU SYSTEM SIMULATOR

| 3 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

Detailed timing models for all components

Enables performance modeling:

| 4 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

AGENDA TWO FLAVOURS OF FUSIONSIM

Structure & Functionality of Fused FusionSim Models a fused system:

| 5 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

AGENDA FUSION: WHICH BENCHMARK BENEFITS?

TOTAL KERNEL + KERNEL data TOTAL COPY

Dependence on latency overhead

and kernel throughput

| 6 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

AGENDA FUSION: WHICH SYSTEM FACTORS AFFECT SPEED-UP?

| 7 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM: STRUCTURE

| 8 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM: COMPONENT FEATURES

GPU: GPGPU-SIM OpenCL/CUDA capable

| 9 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM: START-UP AND MEMORY LAYOUT

Simulator is injected into virtual memory space

Replacement of the standard dynamic library

DISCRETE FUSIONSIM: MAIN SIMULATION LOOP

DISCRETE FUSIONSIM: EXAMPLE GPU API CALL

DISCRETE FUSIONSIM: SIMULATOR FEATURES

FUSED FUSIONSIM: STRUCTURE

FUSED FUSIONSIM: A CHALLENGE WITH EXISTING CPU + GPU MEMORY SPACES

How do we model these?

FUSED FUSIONSIM: SIMULATING THE CPU AND GPU MEMORY SPACES

FUSED FUSIONSIM: MEMORY SPACE: WHERE AND WHAT

FUSED FUSIONSIM MEMORY COHERENCE

FUSED FUSIONSIM MEMORY COHERENCE: IMPLEMENTATION

CPU side: Selective flushing of private caches cudaSelectivelyFlush(address, size)

GPU side: GPGPU-Sim already flushes the caches

FUSED FUSIONSIM CHANGES TO GPU API

FUSED VS DISCRETE: EXPERIMENTAL METHODOLOGY

Data input generation is excluded from time measurement

FUSED VS DISCRETE: RELATIVE PERFORMANCE

FUSED SYSTEM: KERNEL SPAWN LATENCY

FUSED SYSTEM: COHERENCE OVERHEAD

FUSION: WHICH BENCHMARK BENEFITS? ANALYTICAL MODEL

TOTAL KERNEL 1 + KERNEL + data TOTAL COPY

Many kernel invocations and memory transfers

High benchmark kernel throughput

FUSION: WHICH BENCHMARK BENEFITS? TWO SCENARIOS

CHUNK KERNEL Insignificant dataCHUNK

CHUNK KERNEL dataCHUNK

FUSION: WHICH BENCHMARK BENEFITS? INPUT DATA SIZE

INPUT SIZE IS GREATER

Greater problem size smaller benefit from fusion

FUSION: WHICH BENCHMARK BENEFITS? LATENCY OVERHEAD

Normalized latency overhead

TOTAL KERNEL + KERNEL dataTOTAL COPY

TOTAL (nKERNEL 1 + nMEM _ COPY ) 1 dataTOTAL

FUSION: WHICH BENCHMARK BENEFITS? KERNEL THROUGHPUT

Kernel throughput KERNEL

TOTAL KERNEL + KERNEL dataTOTAL COPY

FUSIONSIM WEBSITE: DOCUMENTATION & SOURCE CODE

DERIVATION OF THE ANALYTICAL SPEED-UP MODEL SYMBOL MEANINGS