4 December 2015
5th
Edition
Chapter 6
Parallel Processors from
Client to Cloud
Multiprocessors
Scalability, availability, power efficiency
6.1 Introduction
Introduction
Multicore microprocessors
n
4 December 2015
Hardware
n
n
Software
n
n
Synchronization
Subword Parallelism
Cache Coherence
4 December 2015
n
n
Difficulties
n
n
n
Partitioning
Coordination
Communications overhead
Parallel Programming
Amdahls Law
n
n
Speedup =
1
= 90
(1 Fparallelizable ) + Fparallelizable /100
4 December 2015
Scaling Example
n
n
n
100 processors
n
n
100 processors
n
n
4 December 2015
As in example
10 processors, 10 10 matrix
n
Time = 20 tadd
Time = 10 tadd + 1000/100 tadd = 20 tadd
An alternate classification
Data Streams
Single
Instruction Single
Streams
Multiple
Multiple
SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
MISD:
No examples today
MIMD:
Intel Xeon e5345
4 December 2015
Example: DAXPY (Y = a X + Y)
Conventional MIPS code
l.d
$f0,a($sp)
addiu r4,$s0,#512
loop: l.d
$f2,0($s0)
mul.d $f2,$f2,$f0
l.d
$f4,0($s1)
add.d $f4,$f4,$f2
s.d
$f4,0($s1)
addiu $s0,$s0,#8
addiu $s1,$s1,#8
subu $t0,r4,$s0
bne
$t0,$zero,loop
n Vector MIPS code
l.d
$f0,a($sp)
lv
$v1,0($s0)
mulvs.d $v2,$v1,$f0
lv
$v3,0($s1)
addv.d $v4,$v2,$v3
sv
$v4,0($s1)
n
;load scalar a
;upper bound of what to load
;load x(i)
;a x(i)
;load y(i)
;a x(i) + y(i)
;store into y(i)
;increment index to x
;increment index to y
;compute bound
;check if done
;load scalar a
;load vector x
;vector-scalar multiply
;load vector y
;add y to product
;store the result
Vector Processors
n
n
4 December 2015
SIMD
n
n
n
n
Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel
applications
Chapter 6 Parallel Processors from Client to Cloud 14
4 December 2015
Fine-grain multithreading
n
n
n
Multithreading
Coarse-grain multithreading
n
n
4 December 2015
Simultaneous Multithreading
n
Multithreading Example
4 December 2015
Future of Multithreading
n
n
n
n
Shared Memory
10
4 December 2015
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Chapter 6 Parallel Processors from Client to Cloud 22
11
4 December 2015
3D graphics processing
n
n
n
History of GPUs
12
4 December 2015
GPU Architectures
n
Programming languages/APIs
n
n
DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language
(HLSL)
Compute Unified Device Architecture (CUDA)
Chapter 6 Parallel Processors from Client to Cloud 25
8 Streaming
processors
Chapter 6 Parallel Processors from Client to Cloud 26
13
4 December 2015
Streaming Processors
n
n
Executed in parallel,
SIMD style
n
8 SPs
4 clock cycles
Hardware contexts
for 24 warps
n
Registers, PCs,
Chapter 6 Parallel Processors from Client to Cloud 27
GPU
SIMD processors
4 to 8
8 to 16
SIMD lanes/processor
2 to 4
8 to 16
2 to 4
16 to 32
2:1
2:1
8 MB
0.75 MB
64-bit
64-bit
8 GB to 256 GB
4 GB to 6 GB
Yes
Yes
Demand paging
Yes
No
Yes
No
Cache coherent
Yes
No
14
4 December 2015
Message Passing
15
4 December 2015
n
n
Reduction
Half the processors send, other half receive
and add
n The quarter send, quarter receive and add,
n
16
4 December 2015
Grid Computing
n
17
4 December 2015
Network topologies
n
Bus
Ring
N-cube (N = 3)
2D Mesh
Interconnection Networks
Fully connected
Chapter 6 Parallel Processors from Client to Cloud 35
Multistage Networks
18
4 December 2015
Network Characteristics
n
Performance
n
n
n
n
n
Link bandwidth
Total network bandwidth
Bisection bandwidth
Cost
Power
Routability in silicon
Chapter 6 Parallel Processors from Client to Cloud 37
Code or Applications?
n
Traditional benchmarks
n
19
4 December 2015
Optimizing Performance
n
Optimize FP performance
Balance adds & multiplies
n Improve superscalar ILP
and use of SIMD
instructions
n
Software prefetch
n
Memory affinity
n
Use OpenMP:
Multi-threading DGEMM
20
4 December 2015
Multithreaded DGEMM
Multithreaded DGEMM
21
4 December 2015
Fallacies
Pitfalls
n
22
4 December 2015
Concluding Remarks
23