Anda di halaman 1dari 39

Computer

Architecture
Simulators

Eines i Tecniques
de Mesura
Master CANS

* Some contents taken from Sangyeun Cho, CS/COE1541 course U. of Pittsburg.

Alex Ramirez

What is a simulator?

A tool to reproduce the behavior of a computing device


Why use a simulator?
Obtain fine-grain details about internal behavior
Performance analysis
Enable software development on not available platforms
Obtain performance predictions for candidate system architectures

Taxonomy of simulation tools

Architecture simulators

Functional

Trace-driven

Performance / Timing

Execution-driven

Trace-driven

Execution-driven

User code

Full system

Functional vs. Timing simulators

Functional simulators implement the visible architecture state


Maintain the correctness of the programmer vision of the architecture
Main purpose: software development and/or emulation
Eg. Virtual machines (simnow, qemu, vmware)

Timing simulators implement the microarchitecture details


Model the system internals tructures
Processor pipeline
Memory hierarchy
Branch predictors
Interconnection network
...

Functional simulation is much faster than performance simulation

Trace vs. Execution driven


Trace driven simulators
Instrument application, and execute on real platform
Instrumentation records an execution trace
Recorded trace is used as input to the simulator

Execution driven simulators


The application runs directly on top of the simulator
Simulators maintains both application state and architecture state

Trace-driven simulation is usually faster than execution driven


Only needs to maintain the architecture state
No need to reproduce the computation parts of the application

Trace driven simulator makes host and target system independent


Obtain traces on machine A, simulate on machine B

Traces enable simulation of propietary applications & input sets


Take traces away, independent of inputs & binaries

Traces can not capture dynamic properties of the application


5

User-code vs Full-system

User-code simulators only simulate the application code


System calls and I/O fall-back to functional simulation
Often calls the native OS to perform the functional emulation
Requires that host OS and target OS are the same

Full-system simulators include OS and I/O devices


Functional and timing simulation of OS codes
Functional and timing simulation of all devices
Disks, Network, ...

Building a supercomputer from the ground up

F
F

D
D
F
F

A
A
D
D
F
F

M
M
A
A
D
D
F
F

W
W
M
M
A
A
D
D

W
W
M
M
A
A

W
W
M
M

Processor exploits ILP

W
W

Multicore exploits TLP

Rack connects
multiple nodes

Node board connects


multiple chips

Many interconnected racks


build a supercomputer
7

Multiple levels of abstraction

Simulation can be performed at multiple levels of abstraction


Cluster-level
Processors as atomic blocks
No simulation of pipeline or memory system
Detailed simulation of interconnection network

Shared memory node


Processor microarchitecture as atomic block
Detailed simulation of memory system
No interaction with I/O and interconnection network

Processor

Detailed simulation of processor piepeline stages


Detailed simulation of first leevl caches
Little interaction with off-chip memory system
No interaction with I/O and interconnection network
8

The Zen of architecture simulators


Can't build a simulator that achieves
everything ...
FPGA prototypes

Speed

Fast + accurate, but not flexible

Detailed software models


Accurate + flexible, but slow

Abstract software models


Accuracy

Flexibility

Fast + flexible, but not accurate

Developing architecture simulators

Monolithic C code

Object-oriented (C++, Java)

Models architecture in low-level c code


Fast, Accurate (but error-prone), not really
flexible
Need to model interfaces + timing of
components

Objects closely match architecture


components
Methods implement the object interfaces
Still need to model timing of components

Domain-specific languages

Modular infrastructures
Programmer defines components +
interfaces
Programmer defines componnets
interconnect
Simulation negine models timing

10

The simulation speed problem

Trace-driven simulation

Collect traces on one machine


Run simulations in a different machine
Run multiple parallel simulations
No need for total correctness
Easier for first-approach data

Execution-driven simulation
Must run application and simulator on the same machine
Cross-ISA emulation is costly

In all cases, simulation is slow (and traces are huge)


100x to 100.000x slower than real runs
1 min. of real application can take
2h to 2 years of simulation
Gigabytes of trace storage

11

Simulator speed example


Time

Ratio to Native

Ratio to Functional

Native

1.054s

--

--

sim-fast

2m 47s

158x

1h 11m 07s

4,029x

25x

7m 41s

437x

w/Ruby

11h 27m 25s

39,131x

89x

w/Ruby + Opal

43h 13m 41s

147,648x

338x

sim-outorder
simics

Simplescalar toolset
sim-fast provides functional emulation
sim-outorder provides timing simulation of processor out of order pipeline

Simics + GEMS toolset


simics provides multicore full-system functional simulation
Ruby provides timing simulation of the caches + memory hierarchy
Processor abstracted as fixed 1IPC
Opal provides timing simulation of the out-of-order processor pipeline
* gcc from SPEC'00 small input on Xeon 3.8GHz

12

Don't simulate faster: Simulate LESS

Simulation is a single-thread process


Even when simulating parallel hardware

Processors do not get any faster: we have hit the power wall
Simulation speed no longer improves with technology
Parallel machines used to run multiple simulations in parallel

We can not make a faster simulator


Do not simulate the entire system
Do not simulate the entire application

13

Downsizing the simulated platform

Simulate a smaller system


eg. Multiprocessor of 16 cores instead of 1024
Does not expose scalability issues
Competition for shared resources
Conflicts
Race conditions

Simulate a smaller application


eg. Matrix Multiply of 1K x 1K instead of 1M x 1M
Does not exercise the hardware properly
Smaller working set fits in cache
Capacity and conflict issues
Less data / code reuse
Heavier weight of initialization vs. steady state

14

Simulation techniques

Sampling
Simulate random samples of the application

Fast-forwarding
Reaching the sample point in the application

Warm-up
Fast simulation of components previous to measuring stage

Checkpointing
Dump application and architecture state before the simulation sample

Phase detection
Select (non-random) samples based on application analysis

15

Simulation sampling
N instructions (total benchmark execution)

U
Actual simulator
measurement

Statistical approach
Do not examine the entire population
Interview a representative sample

Mathematical approach
confidence margin (eg. 95%)
confidence interval (eg. +/- 2.5)

Two sampling approaches


Systematic sampling
Sample every N instructions
Random sampling

16

Number of required samples

Sample population is proportional to


Variance of the target metric
Desired confidence interval
Desired confidence level
n >= ( level * variance / (1 - interval) ) ^ 2

17

SimFlex sample sizes

Application

Detailed warm-up

Simulation unit

Confidence interval

SPEC CPU

2.000 instr

1.000 instr

99.7 +/- 3

OLTP (TPC-C)
Web server

100.000 cycles

50.000 cycles

95 +/- 5

Multiprocessor applications exhibit much higher variance


Larger samples required to stabilize measurements

18

Simulator warm-up

N instructions (total benchmark execution)

Functional simulation
Warm-up of large structures

W
Detailed simulator
warm-up

Can't simulate samples on an empty architecture state


Caches do not start in invalid state
Branch predictor, TLB, ...
Operating system state
Files open / close, read / write pointers, ...

The architecture takes some time to warm-up


The larger the structure, the longer the warm-up

19

U
Actual simulator
measurement

Checkpointing and sampling simulation


N instructions (total benchmark execution)

Functional simulation
Warm-up of large structures

W
Detailed simulator
warm-up

U
Actual simulator
measurement

Restore checkpoint state

Store warmed-up state before each sample to a checkpoint file


Amortizes the functional simulation overhead time over more detailed simulations
Allows parallel simulation of all samples

20

Non-random sampling

Exploit application's repetitive behavior to select valid samples


Assume application behavior is linked to static code execution
Each phase corresponds to a static section of the code
21

Basic block vector

Count executions of each basic block


For the whole program execution
For every N instructions executed
Typical N is 100 million instructions

Normalize vectors to the total number of basic blocks in the BBV


The sum of all elements in the BBV adds up to 1

Compare sample BBV to global BBV


Manhattan distance
Sum of absolute differences produces a value between 0 and 2
Euclidean distance
SQRT of the sum of the squared difference

22

BBV comparison
Comparison of BBVs also shows
periodic behavior in the application
BBV difference graph used to
Identify initialization phase
Identify repetitive interval
Fourier analysis

23

SimPoint issues

The most representative BBV might be far ahead in the code


Large fast-forward period
Pick the earliest significant sample

A single sample might not represent the whole application


Pick one sample from each application phase

Selected samples are architecture dependent


However, the same code with a different compiler has the same BBV

24

Multithreaded workloads

Random samples of multiple threads do not overlap in time


Invalidates simulation of multithreaded systems

Taking a vertical sample does not guarantee the proper alignment either
Thread speed is not the same during
Functional simulation
Warm-up
Simulation
Fast-forward is measured in instructions, not time

25

Alternatives to simulation

Statistical simulation
Analytical modeling
Abstract simulation
Hyerarchical simulation

26

Task-based programming model


Programmer annotates the code to identify tasks, inputs, and outputs
Total task working set must fit in the LS

Runtime library takes care of


Detecting parallelism
Bundling tasks together
Transferring data in / out of the LS
Automatically performs double buffering for tasks in a bundle

The first Tasks on a Bundle


require to wait for DMAs to
finished

The following Tasks on a Bundle


don't have to wait much, since their
data has been requested
before the previous Task was executed
27

The TaskSim simulator

TaskSim is a system simulator based on application level


traces
Thread phases + inter-thread dependencies + DMA events

Abstract simulation of CPU bursts


Obtain burst duration from trace
Aplly CPU speed factor based on phase id

Detailed (cycle-accurate) simulation of DMA controller, caches,


interconnect, memory controller, and DRAM

28

Sample TaskSim application trace

Helper Thread
CPU 8, 15.51us
CPU 9, 1.19us
CPU 11, 1.51us
SIGNAL 8
CPU 16, 0us
WAIT 101
CPU 7, 5,52us
CPU 8, 15.61us

SPE Thread 8
CPU 19, 4521.23us
WAIT 8
CPU 20, 6.42us
CPU 21, 1.35us
CPU 23, 561.21us

Prepare bundle
Prepare submission
Submit bundle
Bundle 8 is ready
Low level wait for events
Wait for Task 101
Schedule
Prepare bundle

Wait for tasks


Wait for Bundle 8
Get task description
Task Stage in
Task execution

In the simulation, duration of wait


phases will be ignored, as it depends
on the WAIT event
29

Where does CPU burst timing come from?

CPU bursts in the TaskSim trace are annotated with their duration
TaskSim can then adjust such CPU burst duration
Change duration to a fixed amount (eg. DMA Wait goes to 1ns)
Multiply duration for a given factor (eg. SPU computation gets 2x faster)

CPU burst duration obtained from


The real execution at the time the trace was collected
Cycle-accurate simulation on another simulator
Take an instruction-level trace through another simulator
Simulate a sample in an execution-driven simulator

All CPU bursts can be simulated independently on such cycle-accurate


models
Hierarchical simulation

30

Dynamic task scheduling automatically achieves load balancing

The simulator task scheduler dynamically allocates tasks (bundles) to the first
available processor
Simulate a variable number of processors from a single application trace
31

Matrix Multiply simulations (8 to 256 CPU)


Original execution, 8 SPU, 25.6 GB/s

16 SPU, 25.6 GB/s

3x PPU, 16 SPU, 25.6 GB/s

5x PPU, 32 SPU, 25.6 GB/s

3 PPU, 10x PPU, 256 SPU, 409.6 GB/s

32

MatMul zoom on 256 CPU simulation

Not even 3 masters at 10x speed can keep up with 256 workers
Wrong task creation + scheduling strategy (depth first vs. breatdh first)

33

Task generation scheme scalability

16p

32p

64p

Task generation (green) on the master task limits scalability (on the left)
Parallelization of task generation (on the right) is crucial to avoid this bottleneck
34

Simulator results also allow analysis of per-module behavior


Cache interleave every 128 bytes

Cache interleave every 128 KB

Cache interleave every 256 KB

Paraver trace shows cache accesses / ns for each of the 4 cache banks
128-byte interleave evenly balances pressure on the 4 caches
128 KB interleave is only using 1 out of 4 caches at any given time
Worst case quotient of data size (or DMA size?) vs. interleave grain
256KB interleave is using 2 out of 4 caches
35

Also produces statistics (counters + histograms) in addition to traces

Graphs show the number of simultaneous data transfers on the cluster and global buses

Little intra-cluster traffic


All traffic competes to reach the global bus port
High traffic on the global bus
Continously handles 2-4 transfers
Very little use for more than 4 simultaneous transfers
36

Why FFT3D does not perform well with 32 cache modules?


FFT3D with 800GB/s bandwidth

Cache is better
for some
access patterns

FFT3D with 32 cache blocks (800GB/s bandwidth) and 100GB/s memory bandwidth

Cache write-allocate policy hurts 1st FFT pass too badly

Even if afterwards the whole working set fits in the cache

TransposeYZ works better with cache due tor educed page conflicts and lower latency
37

Hyerarchical simulation methodology

First execution
Trace real application at a high level
of abstraction (CPU bursts,
synchronization, communication
events)
Find Representative segment of
execution
Non-linear filtering and spectral
analysis
10-100x smaller trace
CPU burst clustering algorithm
Density based clustering using
performance counters
Selection of CPU bursts
representatives
10-100x smaller trace

Second execution
Trace CPU bursts representatives at
microarchitecture level

38

Trace Obtention: WRF

39

Anda mungkin juga menyukai