WHAT IS FUSIONSIM?
Detailed timing simulator of a complete system with an x86 CPU and a GPU Fused or Discrete Systems FusionSims features: x86 out-of-order CPU + CUDA-capable GPU
Operate concurrently
GGPU 1 +
TOTAL
Simulation speed-up results of Rodinia Range: 1.05x to 9.72x A closer look at a fusion-friendly benchmark
Large speed-up (up to x9.72) for small problem sizes Smaller (x1.8) speed-up for medium problem sizes
TOTAL
KERNEL
Kernel spawn latency From GPU API kernel launch request until actual kernel execution Simulation: order-of-magnitude reduction is important CPU/GPU memory coherence Simulation: performance loss is minor
less than 2 % for most Rodinia benchmarks
CPU from PTLSim: www.ptlsim.org GPU of GPGPU-Sim: www.gpgpu-sim.org CPU caches of MARSSx86: www.marss86.org
CPU: PTLSIM Fast x86: 200KIPS/sec (isolated) Out-of-Order Micro-op architecture Cycle-accurate Modular & detailed memory hierarchy model
High correlation vs. Nvidia GT200 and Fermi NoC Detailed & configurable DRAM Detailed
|10 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
|12 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
|13 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
|14 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
|15 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
32-bit virtual address space (4GBytes) FusionSim does not simulate OS kernel code => top-most 1GByte addresses unused
|16 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
CPU Uses CPU virtual address GPU Uses generic virtual address Caches Physically-addressed CPU adjusts virtual address to generic and translates it to physical GPU directly translates generic to physical MMU Same MMU for both the CPU and the GPU Physical address Uses CPU virtual address
|17 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
Shared CUDA global address space Same block from global space Cached in private CPU L1 $ Cached in private GPU L1 $ Potential coherence problem First-cut solution: Flushing caches to LLC Interesting area for exploration
|18 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
|19 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
|20 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
|21 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
Rodinia benchmarks Two baseline discrete systems: 10 sec kernel spawn latency 100 usec kernel spawn latency Speed-up varies: From x1.05
nn, 10 usec
Up to x9.72
gaus_4, 10 usec
|22 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
One baseline discrete system: 10 sec kernel spawn latency Different fused systems: 0.1 sec kernel spawn latency 1 sec kernel spawn latency 10 sec kernel spawn latency Simulations show: Reduction of the latency to 1 sec is important Further reduction below 1 sec is NOT important
|23 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
Two fused systems: Incoherent vs. coherent kernel spawn latency is 0.1 usec in both systems Simulations show: Minor performance loss
Less then 2% for most benchmarks 5% for bfs_small
|24 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
TOTAL
Kernel throughput KERNEL Benchmark data input size dataTOTAL Greater speed up for Small benchmark input data size
Small
GGPU
dataTOTAL
TOTAL
KERNEL
Long time spent in the GPU code relative to the CPU code
|25 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
GGPU
CHUNK KERNEL (data CHUNK ) 1 + KERNEL (data CHUNK ) + data CHUNK COPY
Large dataCHUNK
Significant
Small
Insignificant
dataCHUNK
KERNEL COPY
KERNEL COPY
Significant
|26 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
Rodinia BFS
Rodinia Gaussian
FUSED is better
TOTAL dataTOTAL
Rodinia NN
Speed-up 1.05x
Why?
100 times more kernel spawns for Gaussian 10 times more memory copies for Gaussian
GGPU 1 +
Rodinia NN
Speed-up 1.05x
Why?
100 times greater throughput
KERNEL
for BFS
GGPU 1 +
|29 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
www.fusionsim.ca
Discrete FusionSim & Fused FusionSim Source code Documentation Google group for collaborators
|30 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
|31 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
The kernel can be modeled as a channel of throughput KER ( dataKER ) as the actual throughput will vary depending on dataKER . As dataKER increases, KER saturates. Most of existing CUDA applications (including all the considered Rodinia benchmarks) exhibit the following computation pattern:
For 1 . 1 . 1 . iterations do: Copy the input data from the host to the device Launch kernel on the data Copy the results from the device to the host
For such applications GPU is described by the following: data KER tGPU = nKER TOT + KER , where KER is the kernel data throughput and TOT is the total latency per iteration resulting from both the memory transfers and the kernel spawn. For CUDA applications that do not utilize multiple concurrent CUDA streams the total latency per single computation iteration TOT is comprised of the time spent transferring the data to or from the device and the kernel spawn latency: data KER + KS TOT = 1 COPY + COPY The above expression holds true for all the considered Rodinia benchmarks.
|32 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
t GPU of executing
GGPU =
GGPU
Here
TOT nKER TOTAL KER + 1 1 = + KERNEL + KERNEL data TOT data TOTAL COPY
TOTAL is the total latency accumulated during the benchmark execution and comprising
the all kernel spawn and memory transfer latencies. Please also note that the throughput KER of a benchmark kernel increases with dataKER for small dataKER values and saturates to a constant for large dataKER values. The throughput saturates when the input data size is sufficient for maximum possible warp scheduler occupancy for the given benchmark kernel. For benchmarks utilizing CUDA streams and overlapping kernel execution with data transfers the latency is bounded from above, i.e.: data KER + KS TOT 1 COPY + COPY This results in a smaller speed-up
GTOT =
|33 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
|34 FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012