ECE437: Introduction to
Digital Computer Design
Chapter 5b
Multiprocessors
Why?
Power changes everything
What about performance
Key concepts
Fall 2014
Programming models
Cache coherence
Synchronization
Consistency
Why Multicores
In the past, we got faster clock with every
microprocessor generation
But faster clock led to higher power which cannot be
sustained
Power = CV2f where C is circuit capacitance, V is
voltage, f is clock frequency
V is linear in f, so power is cubic in f
2x clock => 8x power but 2x cores at same clock => 2x
power
We have been lowering voltage but that causes other
problems (leakage, reliability)
(2)
Join At:
IO
IO
IO
IO
IO
IO
IO
IO
IO
I/O (Network)
Memory
Processor
Shared Memory
(Dataflow/Systolic),
Single-Instruction
Multiple-Data
(SIMD)
(3)
(4)
(5)
Shared memory
shared memory - all processors see one global memory
programs use loads/stores to access data
+ conceptually compatible with uniprocessors
+ ease of programming even if communication
complex/dynamic
+ lower latency communicating small data items
Real programs tend to communicate small fine-grain data
Message passing
message passing each processor sees only its private
memory
programs use sends/receives to access data
+ very simple hardware (almost nothing extra)
+ communication pattern explicit
(6)
Message Passing
distribute data carefully to threads
no automatic data motion
(7)
(8)
Shared memory
Thread1
....
compute (data)
store( A,B, C, D, ...)
synchronize
......
Thread2
.....
synchronize
load (A, B, C, D, ....)
compute
.....
(9)
Message Passing
Thread1
Thread2
....
.....
compute (data)
receive (mesg)
store (A,B, C, D ..)
scatter (mesg to A B C D ..)
gather (A B C D into mesg)
load (A, B, C, D, ....)
send (mesg)
compute
......
.....
A B C D are DIFFERENT in each thread -- PRIVATE
memory
Example
Ocean - Compute temperature of a part of an ocean
Very simple example so you can see
Real parallel programs have much more complex
communication patterns databases, online
transaction processing (reservations), decision
support (inventory control), web servers, Java
middleware
in other words all the real programs that make the world
go round!
(11)
(10)
Sequential Ocean
main()
Solve(float **A)
begin
begin
read(n);
while (!done)
A = malloc(n * n);
diff = 0;
initialize(A);
for i=1 to n do
Solve(A);
for j = 1 to n do
end main
temp = A(i,j);
A(i,j)=0.2*(A(i,j)+A(i,j-1)+
A(i,j+1)+A(i+1,j)+
A(i-1,j));
diff += abs(A(i,j) - temp);
end for
end for
if (diff / (n*n) < TOL) then done = 1;
end while
end Solve
(12)
(13)
(15)
while (!done)
CREATE(p-1, solve);
mydiff = 0;
solve(A)
WAIT_FOR_END(p-1);
end main
Solve(float **A)
begin
myA = malloc(n/p + 2 by n array);
end for
(14)
Ocean
Think of how each cores cache would look in
the shared memory case
Think of how each cores messages would look
in message passing case
(16)
SIMD
Single instruction multiple data
SIMD (Vector)
do i = 1,64
(17)
Think of partitioning
Tasks? (do webpage serving and data-base lookup
for different requests in parallel.)
Data? (Do similar operations on different data in
parallel.)
(19)
a * x + y (daxpy)
ld f0, a
lv, v1, rx
64-element vector
multsv v2, f0, v1
lv v3, ry
addv v4,v2,v3
sv ry, v4
Above is vector example
Intel MMX similar except narrower vector
Common Parallelization
Strategies
Think of parallelism at the algorithm level
Possibly different
(18)
Programming models
Pthreads Shared memory processors **
MPI (mesg passing interface) clusters **
OpenMP Shared memory
Marking loops as parallel loops
(20)
Pn
Switch
Pn
P1
(Interleaved)
First-level $
$
Bus
(Interleaved)
Main memory
I/O devices
Mem
(a) Shared cache
P1
Pn
Pn
P1
Mem
Mem
Interconnection network
Figure Embedded.com
Interconnection network
Mem
Mem
(c) Dancehall
(d) Distributed-memory
(21)
(22)
Context
(23)
(24)
No.
Reason about coherence/consistency
P3
u=?
u=7
$
I/O devices
3
u:5
Synchronization
u:5
u:5
Sequencing
Memory
P2
P1
(25)
Is there a problem?
Replication + writes = coherence problem
Cache coherence
Previous slide makes intuitive sense but
EXACTLY what should coherence
guarantee?
Is the previous slide the only way to do
business? Are there many options in
what coherence can and cannot do?
What are the rules of the game?
(26)
Coherence
Thread 1
A = 10
..
..
A = 0 (initial)
thread 2
A=5
thread 3
..
..
read A
(27)
(28)
Coherence
A = 0 (initial)
Thread 1
thread 2
A = 10
..
..
Read A
..
Signal thread 3
thread 3
..
wait thread2
read A
(29)
Coherence
Hard to distinguish between previous two slides
Hard to track arbitrary synchronization among
arbitrary threads in a large distributed system
Coherence
(30)
Coherence
Fuzzy idea
Precise definition
For any memory location X
Derivative properties
not necessary
not instantaneous coherence
(31)
(32)
Coherence
Correct actions:
On write, invalidate other copies
On write, update other copies
Snooping:
Hits are guaranteed not stale misses
have to get latest data
Every node can see the bus
If all transactions can be observed
Each can maintain coherence
(33)
Snoopy coherence
(34)
set of states
state-transition diagram
actions
Basic Choices
(35)
(36)
Processor:
PrRd
PrWr
Writeback on replacement of modified block
Shared
Bus
Invalid
(37)
PrRd /-M
BusRd / BusWB
PrWr / BusRdX
S
PrWr / BusRdX
BusRdX / BusWB
BusRdX / --
PrRd / BusRd
I
PrRd / -BusRd / --
(38)
An example
Proc Action P1 State
1. P1 read u
S
2. P3 read u
S
3. P3 write u
I
4. P1 read u
S
5. P2 read u
S
P2 state
----S
P3 state
-S
M
S
S
(39)
(40)
10
(41)
(43)
(42)
(44)
11
BusWB
(45)
There is no snooping
WBs are not writes there is no atomicity
requirement why?
You just write back the block to memory
(46)
Snoopy coherence
Only tag array copies but not data array why not?
(47)
(48)
12
Snoopy coherence
But the tag copies have to be updated
together
(49)
Wr A
rd A
rd A
(50)
Snoopy bus
Coherence makes writes atomic
(51)
thread3
thread2
(52)
13
Snoopy bus
(53)
for i = 1 to k
P1(write, x);
// one write before many reads
P2--PN-1(read, x);
end for i
Pattern 2:
for i = 1 to k
for j = 1 to m
P1(write, x); // many writes before one read
end for j
P2(read, x);
end for i
(54)
Read Snarfing
if tag match but invalid, snarf from bus
Pattern 1:
(55)
Non-Atomic Bus
ECE437, Fall 2014
(56)
14
Why Synchronize?
Synchronization
Solution: Locks/Mutex
lw $r1, ac-balance
add $r1, $r1, -50
sw $r1, ac-balance
lw $r1, ac-balance
add $r1, $r1, 100
sw $r1, ac-balance
critical section
Only one person can have a lock
Event notification
Correct result
ac-balance(new) = ac-balance(old)+50
(57)
Producer consumer
Just use flags?
Consistency issues
LOCK ac-lock
lw $r1, ac-balance
add $r1, $r1, 100
sw $r1, ac-balance
UNLOCK ac-lock
LOCK ac-lock
lw $r1, ac-balance
add $r1, $r1, -50
sw $r1, ac-balance
UNLOCK ac-lock
(58)
LOCK
while(lock_variable == 1); /* locked so wait */
lock_variable = 1;
UNLOCK
lock_variable = 0;
Implementation requires atomicity!
(59)
r is register
r = m[x]
m[x] = 1
m[x] is memory location x
if r is 1 then already locked else accessor gets the lock
(60)
15
(61)
(63)
(62)
(64)
16
Test&set:
move $s2, 1
ll $s1, X
sc X,$s2
Slight change
beq $s2, $0 T&S
jr $31
// 1 means locked
// $s1 M[X] and sets link reg X
// $s2 contains 1 to signify locked
// if link reg == X then
M[X] $s2 and $s2 1
// else sc fails and sets $s2 0
// retry if atomicity failure
// return $s1 which is previous lock
//
value
sw $0, X
jr $31
(65)
Test-and-Test&Set
(67)
LOCK
while (test&set(x) == 1);
UNLOCK
x = 0;
High contention (many threads want lock while
one thread holds it)
All non-holders keep retrying T&S but each retry
wrties a 1 on top of 1 (sc within T&S is a write)
(66)
Contention
TAS good at low contention
Fairness
TAS offers no guarantees of fairness
(68)
17
Synchronization
Races
May not be visible to the programmer
Hard to debug (irreproducible)
(69)
(70)
/* initial A = flag = 0 */
P1
P2
A = 1;
while (flag == 0); /*wait*/
flag = 1;
print A;
/* initial A = flag = 0 */
P1
P2
A = 1;
while (flag == 0); /*wait*/
flag = 1;
print A;
Consider coalescing write buffer
(71)
(72)
18
(73)
Consistency Model
(75)
lw $arg1, x
jal f
sw a, $retval
while (a == 0);
b = g(a);
1
lw $1, a
loop: beqz $1, $0,loop
lw $arg1, $1
jal g
(74)
Sequential Consistency
Memory Model
sequential
processors
issue
memory ops
in program
order
P1
Pn
P2
(76)
19
Detecting violation of SC
/* initial A = flag = 0 */
P1
P2
A = 1;
while (flag == 0); /*wait*/
flag = 1;
print A;
(77)
Sequential Consistency
Programmers think in SC
(78)
Announcement
Ch 5b done!
All the interfaces are pre-defined for both cache and multicore
labs - previously not so
You have one extra week for the lab
(79)
(80)
20