Anda di halaman 1dari 47

Unit 5

Multiprocessor
Architectures
P. Raja vadhana
AP CSE
Computer Architecture II
BE CSE G2 S6
15/3/2016

Contents

Multiprocessors (15/3/2016 H5)


Cache coherency (15/3/2016 H5)
Coherency Protocols (15/3/2016 H5,7)
Models of Memory Consistency (15/3/2016 H7)
Basics of Synchronization (Intro Self Study
(15/3/2016 H7))
Symmetric Shared Memory Multiprocessors (17/3/2016
H3)
Performance (17/3/2016 H3)
Distributed Shared Memory Multiprocessors
(17/3/2016 H4)
Implementation of Directory Based Coherence
(18/3/2016 H3)

Multiprocessors
Factors:
Silicon usage
Power consumption
Cost of ILP & TLP

Solution: Leveraging design investment by


replication is advantageous than creating new unique
design
Multiprocessors -> Multiprogramming ->
Multithreading
Grain size: Amount of computation assigned to a
thread

Cache Coherence
Reason
Existence
Shared variable exhibits 2 states
Global state
Local state
Occurrence
View of memory held by two different
processors is through their individual caches

Properties of Consistent Memory


Systems
Coherency

Defines what values can


be returned by a Read
Behavior of R &W to same
Location
3 properties:
Preserving Program
Order
Time spaced Coherency
Write Serialization

Consistency

Defines when a written


value will be returned by a
Read
Behavior of R&W w.r.t to
other memory Location
access

Properties of Coherency

Time
Program Order Spaced
1.P writes to
location X
2.No other
WRITES in
between
3.P reads from
location X

Access returns
the WRITTEN
value by P

1.B writes to location


X
2.No other WRITES
in between.
Sufficiently time
spaced
3.A reads from
location X
Access returns the
WRITTEN value
by B

Write
Serialization
1.B writes to location
X
2.A writes to location
X

Access order is
seen the same by
all processors

Cache Coherence Protocols

Track the status of any sharing of a data block


Basic Techniques:
Write Invalidate Protocol
Write Update/Broadcast Protocol

Cache Coherence Protocols

Track the status of any sharing of a data block


Basic Techniques:
Write Invalidate Protocol
Write Update/Broadcast Protocol

Models Of Memory
Consistency
Delayed Invalidation

Solution:
Sequential Consistency
Delay memory access completion until all
invalidations caused by that access are
completed

Relaxed Consistency Models


Allow READS & WRITES to completed out-oforder
Enforce ordering through synchronization
mechanisms
Model
Require
Relax
X->Y means that
X must complete before
Y
Sequential

R->W, R->R, W->W, W-R

Processor
Consistency
or
Total Store Ordering

R->W, R->R, W->W

Partial Store
Ordering

R->W, R->R

Weak Ordering
Or
Release Consistency
Model

W->R

W->W
R->W, R->R

Basics Of Synchronization
Implement a Atomic structure using hardware
primitives
Implement a Coherence Mechanism

SWAP by Atomic Operation


Interchanges the value to Reg to Mem
LOCK {0, 1} <-> Memory Value {1, ADDRESS}
TEST & SET
FETCH & INCREMENT
LoadLinked (LL) & StoreConditional (SC)

LL-SC
SWAP:
try: MOV R3 , R4
LL R2 , 0(R1)
SC R3 , 0(R1)
BEQZ R3 , try
MOV R4 , R2
FETCH & INCREMENT:
try: LL R2 , 0(R1)
DADDUI R3 , R2 ,#1
SC R3 , 0(R1)
BEQZ R3 , try

Coherence Mechanism

SPIN LOCKS:

locks that a processor continuously tries


to acquire, spinning around a loop until
it succeeds
Caching of Lock variable
1.
2.
3.
4.
5.
6.

Place a LOCK VAR in memory


B requests LOCK by intiating ATOMIC EXCHANGE
Check if the LOCK = 0
Change LOCK = 1
Get and Change Data at X
Release LOCK = 0

Symmetric Shared Memory


Multiprocessors

Snooping based Coherence Protocol

* Cache Line is same as Cache Block

Remote Read Request Write Back


Cache

Simple Illustration of
Snoop

Remote CPU
Access to Local
Cache block

Remote CPU
Access to Local
Cache block

Problem

Format for Answer


Proces
sor
Activit
y

Bus
Activi
ty

Cache
content in
P0

State of
Block
containin
g __

Cache
content in
P1

State of
Block
containing __

Memory
Content

Solve the problem using above format under following two cases:
1. X and Y in SAME cache block/cache line
2. X and Y in DIFFERENT cache block/cache line

Symmetric Shared Memory


Multiprocessors

Symmetric Shared Memory in


Practice

Performance of UMA

Memory Access Time


Bandwidth
Latency
Design
Scaling
Asymmetric Cache memory access
Cache Coherence from replication
Migration Vs Replication

Limitations
True Sharing Vs False Sharing
Scalability of Cache Coherence

Distributed Shared Memory


Multiprocessors

Simple Illustration of
Directory

State of Cache Block


1. Shared
2. Un-cached
3. Exclusive

Home Vs Requesting Vs Owner


Node

Home Node
Vs
Requesting/Local Node
Vs
Owner/Remote Node

Directory Based Coherence


Mechanism: Read Miss

Directory Based Coherence


Mechanism: Read Miss

Directory Based Coherence


Mechanism: Read Miss

Directory Based Coherence


Mechanism: Read Miss

Directory Based Coherence


Mechanism: Read Miss

Directory Based Coherence Mechanism: Write Miss


Block State : Shared

Directory Based Coherence Mechanism: Write Miss


Block State : Shared

Directory Based Coherence Mechanism: Write Miss


Block State : Block State : Exclusive(P1) , Shared (P2, P3)

Directory Based Coherence Mechanism: Write Miss


Block State : Exclusive(P1) , Invalid (P2, P3)

Performance of NUMA

Memory Access Time


Bandwidth
Latency
Design
Scaling
Distributed Cache memory access
Cache Coherence from replication

Directory Based Coherence Performance


Directory:
Delineates Remote from Local Traffic
Avoids unnecessary Invalidation Message
Congestion
Centralized Directory Single Point of Failure
Distributed Directory Complex
Reduce the storage Overhead (Processors *
Memory blocks)
Increase Cache Block Size
Hierarchical multiple processor per node
architecture

Optimizing Storage Overhead


Limited Pointer Scheme: Reduce P
Sparse Directories: Reduce M

Transition
Gray: Request from remote node
Black: Actions taken by Home directory