Architecture
Simulators
Eines i Tecniques
de Mesura
Master CANS
Alex Ramirez
What is a simulator?
Architecture simulators
Functional
Trace-driven
Performance / Timing
Execution-driven
Trace-driven
Execution-driven
User code
Full system
User-code vs Full-system
F
F
D
D
F
F
A
A
D
D
F
F
M
M
A
A
D
D
F
F
W
W
M
M
A
A
D
D
W
W
M
M
A
A
W
W
M
M
W
W
Rack connects
multiple nodes
Processor
Speed
Flexibility
Monolithic C code
Domain-specific languages
Modular infrastructures
Programmer defines components +
interfaces
Programmer defines componnets
interconnect
Simulation negine models timing
10
Trace-driven simulation
Execution-driven simulation
Must run application and simulator on the same machine
Cross-ISA emulation is costly
11
Ratio to Native
Ratio to Functional
Native
1.054s
--
--
sim-fast
2m 47s
158x
1h 11m 07s
4,029x
25x
7m 41s
437x
w/Ruby
39,131x
89x
w/Ruby + Opal
147,648x
338x
sim-outorder
simics
Simplescalar toolset
sim-fast provides functional emulation
sim-outorder provides timing simulation of processor out of order pipeline
12
Processors do not get any faster: we have hit the power wall
Simulation speed no longer improves with technology
Parallel machines used to run multiple simulations in parallel
13
14
Simulation techniques
Sampling
Simulate random samples of the application
Fast-forwarding
Reaching the sample point in the application
Warm-up
Fast simulation of components previous to measuring stage
Checkpointing
Dump application and architecture state before the simulation sample
Phase detection
Select (non-random) samples based on application analysis
15
Simulation sampling
N instructions (total benchmark execution)
U
Actual simulator
measurement
Statistical approach
Do not examine the entire population
Interview a representative sample
Mathematical approach
confidence margin (eg. 95%)
confidence interval (eg. +/- 2.5)
16
17
Application
Detailed warm-up
Simulation unit
Confidence interval
SPEC CPU
2.000 instr
1.000 instr
99.7 +/- 3
OLTP (TPC-C)
Web server
100.000 cycles
50.000 cycles
95 +/- 5
18
Simulator warm-up
Functional simulation
Warm-up of large structures
W
Detailed simulator
warm-up
19
U
Actual simulator
measurement
Functional simulation
Warm-up of large structures
W
Detailed simulator
warm-up
U
Actual simulator
measurement
20
Non-random sampling
22
BBV comparison
Comparison of BBVs also shows
periodic behavior in the application
BBV difference graph used to
Identify initialization phase
Identify repetitive interval
Fourier analysis
23
SimPoint issues
24
Multithreaded workloads
Taking a vertical sample does not guarantee the proper alignment either
Thread speed is not the same during
Functional simulation
Warm-up
Simulation
Fast-forward is measured in instructions, not time
25
Alternatives to simulation
Statistical simulation
Analytical modeling
Abstract simulation
Hyerarchical simulation
26
28
Helper Thread
CPU 8, 15.51us
CPU 9, 1.19us
CPU 11, 1.51us
SIGNAL 8
CPU 16, 0us
WAIT 101
CPU 7, 5,52us
CPU 8, 15.61us
SPE Thread 8
CPU 19, 4521.23us
WAIT 8
CPU 20, 6.42us
CPU 21, 1.35us
CPU 23, 561.21us
Prepare bundle
Prepare submission
Submit bundle
Bundle 8 is ready
Low level wait for events
Wait for Task 101
Schedule
Prepare bundle
CPU bursts in the TaskSim trace are annotated with their duration
TaskSim can then adjust such CPU burst duration
Change duration to a fixed amount (eg. DMA Wait goes to 1ns)
Multiply duration for a given factor (eg. SPU computation gets 2x faster)
30
The simulator task scheduler dynamically allocates tasks (bundles) to the first
available processor
Simulate a variable number of processors from a single application trace
31
32
Not even 3 masters at 10x speed can keep up with 256 workers
Wrong task creation + scheduling strategy (depth first vs. breatdh first)
33
16p
32p
64p
Task generation (green) on the master task limits scalability (on the left)
Parallelization of task generation (on the right) is crucial to avoid this bottleneck
34
Paraver trace shows cache accesses / ns for each of the 4 cache banks
128-byte interleave evenly balances pressure on the 4 caches
128 KB interleave is only using 1 out of 4 caches at any given time
Worst case quotient of data size (or DMA size?) vs. interleave grain
256KB interleave is using 2 out of 4 caches
35
Graphs show the number of simultaneous data transfers on the cluster and global buses
Cache is better
for some
access patterns
FFT3D with 32 cache blocks (800GB/s bandwidth) and 100GB/s memory bandwidth
TransposeYZ works better with cache due tor educed page conflicts and lower latency
37
First execution
Trace real application at a high level
of abstraction (CPU bursts,
synchronization, communication
events)
Find Representative segment of
execution
Non-linear filtering and spectral
analysis
10-100x smaller trace
CPU burst clustering algorithm
Density based clustering using
performance counters
Selection of CPU bursts
representatives
10-100x smaller trace
Second execution
Trace CPU bursts representatives at
microarchitecture level
38
39