Simulation Techniques

Computer
Architecture
Simulators
Eines i Tecniques
de Mesura
Master CANS
* Some contents taken from Sangyeun Cho, CS/COE1541 course U. of Pittsburg.
Alex Ramirez
What is a simulator?
A tool to reproduce the behavior of a computing device

Why use a simulator?
Obtain fine-grain details about internal behavior
Performance analysis
Enable software development on not available platforms
Obtain performance predictions for candidate system architectures
Taxonomy of simulation tools
Architecture simulators
Functional
Trace-driven
Performance / Timing
Execution-driven
Trace-driven
Execution-driven
User code
Full system
Functional vs. Timing simulators
Functional simulators implement the visible architecture state

Maintain the correctness of the programmer vision of the architecture
Main purpose: software development and/or emulation
Eg. Virtual machines (simnow, qemu, vmware)
Timing simulators implement the microarchitecture details

Model the system internals tructures
Processor pipeline
Memory hierarchy
Branch predictors
Interconnection network
...
Functional simulation is much faster than performance simulation
Trace vs. Execution driven

Trace driven simulators
Instrument application, and execute on real platform
Instrumentation records an execution trace
Recorded trace is used as input to the simulator
Execution driven simulators

The application runs directly on top of the simulator
Simulators maintains both application state and architecture state
Trace-driven simulation is usually faster than execution driven

Only needs to maintain the architecture state
No need to reproduce the computation parts of the application
Trace driven simulator makes host and target system independent

Obtain traces on machine A, simulate on machine B
Traces enable simulation of propietary applications & input sets

Take traces away, independent of inputs & binaries
Traces can not capture dynamic properties of the application

5
User-code vs Full-system
User-code simulators only simulate the application code

System calls and I/O fall-back to functional simulation
Often calls the native OS to perform the functional emulation
Requires that host OS and target OS are the same
Full-system simulators include OS and I/O devices

Functional and timing simulation of OS codes
Functional and timing simulation of all devices
Disks, Network, ...
Building a supercomputer from the ground up
F
F
D
D
F
F
A
A
D
D
F
F
M
M
A
A
D
D
F
F
W
W
M
M
A
A
D
D
W
W
M
M
A
A
W
W
M
M
Processor exploits ILP
W
W
Multicore exploits TLP
Rack connects
multiple nodes
Node board connects

multiple chips
Many interconnected racks

build a supercomputer
7
Multiple levels of abstraction
Simulation can be performed at multiple levels of abstraction

Cluster-level
Processors as atomic blocks
No simulation of pipeline or memory system
Detailed simulation of interconnection network
Shared memory node

Processor microarchitecture as atomic block
Detailed simulation of memory system
No interaction with I/O and interconnection network
Processor
Detailed simulation of processor piepeline stages

Detailed simulation of first leevl caches
Little interaction with off-chip memory system
No interaction with I/O and interconnection network
8
The Zen of architecture simulators

Can't build a simulator that achieves
everything ...
FPGA prototypes
Speed
Fast + accurate, but not flexible
Detailed software models

Accurate + flexible, but slow
Abstract software models

Accuracy
Flexibility
Fast + flexible, but not accurate
Developing architecture simulators
Monolithic C code
Object-oriented (C++, Java)
Models architecture in low-level c code

Fast, Accurate (but error-prone), not really
flexible
Need to model interfaces + timing of
components
Objects closely match architecture

components
Methods implement the object interfaces
Still need to model timing of components
Domain-specific languages
Modular infrastructures
Programmer defines components +
interfaces
Programmer defines componnets
interconnect
Simulation negine models timing
10
The simulation speed problem
Trace-driven simulation
Collect traces on one machine

Run simulations in a different machine
Run multiple parallel simulations
No need for total correctness
Easier for first-approach data
Execution-driven simulation
Must run application and simulator on the same machine
Cross-ISA emulation is costly
In all cases, simulation is slow (and traces are huge)

100x to 100.000x slower than real runs
1 min. of real application can take
2h to 2 years of simulation
Gigabytes of trace storage
11
Simulator speed example

Time
Ratio to Native
Ratio to Functional
Native
1.054s
--
--
sim-fast
2m 47s
158x
1h 11m 07s
4,029x
25x
7m 41s
437x
w/Ruby
11h 27m 25s
39,131x
89x
w/Ruby + Opal
43h 13m 41s
147,648x
338x
sim-outorder
simics
Simplescalar toolset
sim-fast provides functional emulation
sim-outorder provides timing simulation of processor out of order pipeline
Simics + GEMS toolset

simics provides multicore full-system functional simulation
Ruby provides timing simulation of the caches + memory hierarchy
Processor abstracted as fixed 1IPC
Opal provides timing simulation of the out-of-order processor pipeline
* gcc from SPEC'00 small input on Xeon 3.8GHz
12
Don't simulate faster: Simulate LESS
Simulation is a single-thread process

Even when simulating parallel hardware
Processors do not get any faster: we have hit the power wall
Simulation speed no longer improves with technology
Parallel machines used to run multiple simulations in parallel
We can not make a faster simulator

Do not simulate the entire system
Do not simulate the entire application
13
Downsizing the simulated platform
Simulate a smaller system

eg. Multiprocessor of 16 cores instead of 1024
Does not expose scalability issues
Competition for shared resources
Conflicts
Race conditions
Simulate a smaller application

eg. Matrix Multiply of 1K x 1K instead of 1M x 1M
Does not exercise the hardware properly
Smaller working set fits in cache
Capacity and conflict issues
Less data / code reuse
Heavier weight of initialization vs. steady state
14
Simulation techniques
Sampling
Simulate random samples of the application
Fast-forwarding
Reaching the sample point in the application
Warm-up
Fast simulation of components previous to measuring stage
Checkpointing
Dump application and architecture state before the simulation sample
Phase detection
Select (non-random) samples based on application analysis
15
Simulation sampling
N instructions (total benchmark execution)
U
Actual simulator
measurement
Statistical approach
Do not examine the entire population
Interview a representative sample
Mathematical approach
confidence margin (eg. 95%)
confidence interval (eg. +/- 2.5)
Two sampling approaches

Systematic sampling
Sample every N instructions
Random sampling
16
Number of required samples
Sample population is proportional to

Variance of the target metric
Desired confidence interval
Desired confidence level
n >= ( level * variance / (1 - interval) ) ^ 2
17
SimFlex sample sizes
Application
Detailed warm-up
Simulation unit
Confidence interval
SPEC CPU
2.000 instr
1.000 instr
99.7 +/- 3
OLTP (TPC-C)
Web server
100.000 cycles
50.000 cycles
95 +/- 5
Multiprocessor applications exhibit much higher variance

Larger samples required to stabilize measurements
18
Simulator warm-up
Functional simulation
Warm-up of large structures
W
Detailed simulator
warm-up
Can't simulate samples on an empty architecture state

Caches do not start in invalid state
Branch predictor, TLB, ...
Operating system state
Files open / close, read / write pointers, ...
The architecture takes some time to warm-up

The larger the structure, the longer the warm-up
19
U
Actual simulator
measurement
Checkpointing and sampling simulation

Warm-up of large structures
W
Detailed simulator
warm-up
U
Actual simulator
measurement
Restore checkpoint state
Store warmed-up state before each sample to a checkpoint file

Amortizes the functional simulation overhead time over more detailed simulations
Allows parallel simulation of all samples
20
Non-random sampling
Exploit application's repetitive behavior to select valid samples

Assume application behavior is linked to static code execution
Each phase corresponds to a static section of the code
21
Basic block vector
Count executions of each basic block

For the whole program execution
For every N instructions executed
Typical N is 100 million instructions
Normalize vectors to the total number of basic blocks in the BBV

The sum of all elements in the BBV adds up to 1
Compare sample BBV to global BBV

Manhattan distance
Sum of absolute differences produces a value between 0 and 2
Euclidean distance
SQRT of the sum of the squared difference
22
BBV comparison
Comparison of BBVs also shows
periodic behavior in the application
BBV difference graph used to
Identify initialization phase
Identify repetitive interval
Fourier analysis
23
SimPoint issues
The most representative BBV might be far ahead in the code

Large fast-forward period
Pick the earliest significant sample
A single sample might not represent the whole application

Pick one sample from each application phase
Selected samples are architecture dependent

However, the same code with a different compiler has the same BBV
24
Multithreaded workloads
Random samples of multiple threads do not overlap in time

Invalidates simulation of multithreaded systems
Taking a vertical sample does not guarantee the proper alignment either
Thread speed is not the same during
Warm-up
Simulation
Fast-forward is measured in instructions, not time
25
Alternatives to simulation
Statistical simulation
Analytical modeling
Abstract simulation
Hyerarchical simulation
26
Task-based programming model

Programmer annotates the code to identify tasks, inputs, and outputs
Total task working set must fit in the LS
Runtime library takes care of

Detecting parallelism
Bundling tasks together
Transferring data in / out of the LS
Automatically performs double buffering for tasks in a bundle
The first Tasks on a Bundle

require to wait for DMAs to
finished
The following Tasks on a Bundle

don't have to wait much, since their
data has been requested
before the previous Task was executed
27
The TaskSim simulator
TaskSim is a system simulator based on application level

traces
Thread phases + inter-thread dependencies + DMA events
Abstract simulation of CPU bursts

Obtain burst duration from trace
Aplly CPU speed factor based on phase id
Detailed (cycle-accurate) simulation of DMA controller, caches,

interconnect, memory controller, and DRAM
28
Sample TaskSim application trace
Helper Thread
CPU 8, 15.51us
CPU 9, 1.19us
CPU 11, 1.51us
SIGNAL 8
CPU 16, 0us
WAIT 101
CPU 7, 5,52us
CPU 8, 15.61us
SPE Thread 8
CPU 19, 4521.23us
WAIT 8
CPU 20, 6.42us
CPU 21, 1.35us
CPU 23, 561.21us
Prepare bundle
Prepare submission
Submit bundle
Bundle 8 is ready
Low level wait for events
Wait for Task 101
Schedule
Prepare bundle
Wait for tasks

Wait for Bundle 8
Get task description
Task Stage in
Task execution
In the simulation, duration of wait

phases will be ignored, as it depends
on the WAIT event
29
Where does CPU burst timing come from?
CPU bursts in the TaskSim trace are annotated with their duration
TaskSim can then adjust such CPU burst duration
Change duration to a fixed amount (eg. DMA Wait goes to 1ns)
Multiply duration for a given factor (eg. SPU computation gets 2x faster)
CPU burst duration obtained from

The real execution at the time the trace was collected
Cycle-accurate simulation on another simulator
Take an instruction-level trace through another simulator
Simulate a sample in an execution-driven simulator
All CPU bursts can be simulated independently on such cycle-accurate

models
Hierarchical simulation
30
Dynamic task scheduling automatically achieves load balancing
The simulator task scheduler dynamically allocates tasks (bundles) to the first
available processor
Simulate a variable number of processors from a single application trace
31
Matrix Multiply simulations (8 to 256 CPU)

Original execution, 8 SPU, 25.6 GB/s
16 SPU, 25.6 GB/s
3x PPU, 16 SPU, 25.6 GB/s
5x PPU, 32 SPU, 25.6 GB/s
3 PPU, 10x PPU, 256 SPU, 409.6 GB/s
32
MatMul zoom on 256 CPU simulation
Not even 3 masters at 10x speed can keep up with 256 workers
Wrong task creation + scheduling strategy (depth first vs. breatdh first)
33
Task generation scheme scalability
16p
32p
64p
Task generation (green) on the master task limits scalability (on the left)
Parallelization of task generation (on the right) is crucial to avoid this bottleneck
34
Simulator results also allow analysis of per-module behavior

Cache interleave every 128 bytes
Cache interleave every 128 KB
Cache interleave every 256 KB
Paraver trace shows cache accesses / ns for each of the 4 cache banks
128-byte interleave evenly balances pressure on the 4 caches
128 KB interleave is only using 1 out of 4 caches at any given time
Worst case quotient of data size (or DMA size?) vs. interleave grain
256KB interleave is using 2 out of 4 caches
35
Also produces statistics (counters + histograms) in addition to traces
Graphs show the number of simultaneous data transfers on the cluster and global buses
Little intra-cluster traffic

All traffic competes to reach the global bus port
High traffic on the global bus
Continously handles 2-4 transfers
Very little use for more than 4 simultaneous transfers
36
Why FFT3D does not perform well with 32 cache modules?

FFT3D with 800GB/s bandwidth
Cache is better
for some
access patterns
FFT3D with 32 cache blocks (800GB/s bandwidth) and 100GB/s memory bandwidth
Cache write-allocate policy hurts 1st FFT pass too badly
Even if afterwards the whole working set fits in the cache
TransposeYZ works better with cache due tor educed page conflicts and lower latency
37
Hyerarchical simulation methodology
First execution
Trace real application at a high level
of abstraction (CPU bursts,
synchronization, communication
events)
Find Representative segment of
execution
Non-linear filtering and spectral
analysis
10-100x smaller trace
CPU burst clustering algorithm
Density based clustering using
performance counters
Selection of CPU bursts
representatives
10-100x smaller trace
Second execution
Trace CPU bursts representatives at
microarchitecture level
38
Trace Obtention: WRF
39

Simulation Techniques

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Simulation Techniques

Diunggah oleh

Hak Cipta:

Format Tersedia

Computer

* Some contents taken from Sangyeun Cho, CS/COE1541 course U. of Pittsburg.

A tool to reproduce the behavior of a computing device

Taxonomy of simulation tools

Functional vs. Timing simulators

Functional simulators implement the visible architecture state

Timing simulators implement the microarchitecture details

Functional simulation is much faster than performance simulation

Trace vs. Execution driven

Execution driven simulators

Trace-driven simulation is usually faster than execution driven

Trace driven simulator makes host and target system independent

Traces enable simulation of propietary applications & input sets

Traces can not capture dynamic properties of the application

User-code simulators only simulate the application code

Full-system simulators include OS and I/O devices

Building a supercomputer from the ground up

Processor exploits ILP

Multicore exploits TLP

Node board connects

Many interconnected racks

Multiple levels of abstraction

Simulation can be performed at multiple levels of abstraction

Shared memory node

Detailed simulation of processor piepeline stages

The Zen of architecture simulators

Fast + accurate, but not flexible

Detailed software models

Abstract software models

Fast + flexible, but not accurate

Developing architecture simulators

Object-oriented (C++, Java)

Models architecture in low-level c code

Objects closely match architecture

The simulation speed problem

Collect traces on one machine

In all cases, simulation is slow (and traces are huge)

Simulator speed example

11h 27m 25s

43h 13m 41s

Simics + GEMS toolset

Don't simulate faster: Simulate LESS

Simulation is a single-thread process

We can not make a faster simulator

Downsizing the simulated platform

Simulate a smaller system

Simulate a smaller application

Two sampling approaches

Number of required samples

Sample population is proportional to

SimFlex sample sizes

Multiprocessor applications exhibit much higher variance

N instructions (total benchmark execution)

Can't simulate samples on an empty architecture state

The architecture takes some time to warm-up

Checkpointing and sampling simulation

Restore checkpoint state

Store warmed-up state before each sample to a checkpoint file

Exploit application's repetitive behavior to select valid samples

Basic block vector

Count executions of each basic block

Normalize vectors to the total number of basic blocks in the BBV

Compare sample BBV to global BBV