Anda di halaman 1dari 0

1

Subject : ADVANCED COMPUTER ARCHITECTURE


Department : Computer Sc. & Engineering
Course B.Tech 7
th
Semester

( )


Echelon Institute of Technology
Unit 5
Concurrent Processors
1. Vector Processors
2. Vector Memory
3. Multiple Issue Machines
4. Comparing Vector and Multiple Issue Processors
Shared Memory Multiprocessors
1. Basic Issues: Partitioning, Synchronization and Coherency
2. Types of Shared Memory Multiprocessors
3. Memory Coherence in shared Memory Multiprocessors
2
Shared Memory Multiprocessors
Shared Memory Multiprocessors
Multiprocessors are usually designed for at least one of
two reasons
Fault Tolerance
Program Speed up
Fault Tolerant Systems: n identical processors ensure that failure of one
processor does not affect the ability of the multiprocessor to continue with
program execution.
These multiprocessors are called high availability or high integrity systems.
These systems may not provide any speed up over a single processor
system.
Program Speed up: Most multiprocessors are designed with main
objective of improving program speed up over that of single processor.
Yet fault tolerance is still an issue as no design for speedup ought to come
at the expense of fault tolerance.
It is generally not acceptable for whole multi processor system to fail if any
one of its processors fail.
3
Shared Memory Multiprocessors
Basic Issues: Three basic issues are associated with
designs of multiprocessor systems.
Partitioning
Scheduling of tasks
Communication and synchronization
Partitioning
This is the process of dividing a program into tasks, each of
which can be assigned to an individual processor for
execution.
The partitioning process occurs at compile time.
The goal of partitioning process is to uncover the maximum
amount of parallelism possible within certain obvious
machine limitations.
Suppose a program P is converted into
a parallel form Pp. This conversion
consists of partitioning Pp into a set of
tasks Ti.
Pp consists of tasks, some of
which can be executed
concurrently (parallel).
4
Program partitioning is usually performed with some
program overhead.
Overhead affects speedup.
The larger the size of the minimum task defined by the
partitioning program, the smaller the effect of program
overhead.
If uniprocessor program P1 does operation O1, then parallel
version of P1 does operations Op, where Op O1.
If available parallelism exceeds the known number of
processors, or several shorter tasks share the same
instruction / data working set, Clustering is used to group
subtasks into a single assignable task.
Partitioning
The detection of parallelism is done by one of three methods.
1. Explicit statement of concurrency in high level language.
Programmers to define boundaries among tasks that can be
executed in parallel.
2. Programmers hint in source statement which compilers can use
or ignore.
3. Implicit parallelism sophisticated compilers can detect
parallelism in normal serial code and transform program code for
execution on multiprocessors.
Partitioning
Grouping tasks into
process clusters.
5
Scheduling
Associated with each programs flow of control among the
sub-program tasks.
Each task is dependent on others.
Scheduling
Scheduling is done both statically (at compile time) and
dynamically (at run time).
Statically scheduling is not sufficient to ensure optimum
speedup or even fault tolerance.
The processor availability is difficult to predict and may vary
from run to run.
Runtime scheduling has advantage of handling changing
system environments and program structures also having
disadvantage of run time overhead.
Major run time overheads in run-time scheduling:
1. Information gathering: information about dynamic state of the
program and the state of the system.
2. Scheduling
3. Dynamic execution control: clustering or process creation
4. Dynamic data management: providing of tasks and processors
in such a way to minimize the required amount of memory
overhead delay.
6
Run Time Scheduling Techniques
1. System load balancing:
Objective is to balance the systems loading.
Dispatching the number of ready tasks to each processors
execution queue.
2. Load balancing: relies on estimates of the amount of
computation needed within each concurrent sub-task.
3. Clustering: pair wise communication process information
developed to minimize inter-process communication.
4. Scheduling with complier assistance: block level
dynamic program information is gathered at run time.
5. Static scheduling / custom scheduling: inter-process
communication and computational requirements can be
determined at compile time.
Scheduling
Synchronization and Coherency
7
Synchronization and Coherency
Multiprocessor configuration having high degree of task
concurrency, the tasks must follow an explicit order and
communication between active tasks must be performed in an
orderly way.
The value passing between different tasks executing on different
processors is performed by synchronization primitives or
semaphores.
semaphore is a variable or abstract data type that provides a
simple but useful abstraction for controlling access by multiple
processes to a common resource in a parallel programming
environment.
Synchronization is the means to ensure that multiple processors
have a coherent or similar view of critical values in memory.
Memory coherence is the property of memory that ensures that a
read operation returns the same value which was stored by the
latest write to same address.
In complex systems of multiple processors the program order of
memory related operations may be different from order in which
operations are actually executed.
Different degrees of operation ordering in multiprocessors:
1. Sequential Consistency
2. Processor Consistency
3. Weak Consistency
4. Release Consistency
Synchronization and Coherency
Fig. a: Sequential Consistency: result of
any execution is same as operations of all
processors executed in some sequential
order.
Fig. c: Weak Consistency: Synch.
Operations are performed before any
subsequent memory operation is performed
and all pending memory operations are
performed before any synchronization
operation is performed.
Fig. d: Release Consistency: Synch.
operations are split into acquire (lock) and
release (unlock) and these operations are
processor consistent.
Fig. b: Processor Consistency: Loads
followed by store to be performed in
program order. But store followed by load is
not necessarily performed.
8
Types of Shared Memory
Multiprocessors
Types of Shared Memory
Multiprocessors
The variety in multiprocessors results from the
way memory is shared between processors.
1. Shared data cache, shared memory.
2. Separate data cache but shared bus
shared memory.
3. Separate data cache with separate buses
leading to a shared memory.
4. Separate processors and separate memory
modules interconnected with a multi-stage
interconnection network.
9
Types of Shared Memory
Multiprocessors
2. Separate data cache but shared bus-shared
memory
Types of Shared Memory
Multiprocessors
4. Separate processors and separate memory
modules interconnected with a multi-stage
interconnection network.
10
Memory Coherence in
Shared Memory
Multiprocessors
Memory Coherence in Shared
Memory Multiprocessor
Each node in a multiprocessor system possesses a local
cache.
Since the address space of the processor overlaps,
different processors can be holding (caching) the same
memory segment at the same time.
Further each processor may be modifying these cached
location simultaneously.
Cache coherency problem is to ensure that all caches
contain same most updated copy of data.
The protocol that maintains the consistency of data in all
the local caches is called the cache coherency protocol.
Snoopy Protocol
Directory Protocol
11
Snoopy Protocol
A write is broadcast to all processors in the system.
Broadcast protocols are usually reserved for shared bus
multiprocessors.
All processors share memory through a common memory
bus.
Snoopy protocols assume that all of the processors are
aware and receive all bus transactions (snoop on the bus).
Bus
Processors
Snoopy Protocol
Snoopy Protocol
Memory
12
Snoopy protocols are further classified based on the type of
action local processor must take when an altered line is
recognized.
There are two types of actions:
Invalidate: all copies in other caches are invalidated before
changes are made to data in a particular line. The invalidate
signal is received from the bus and all caches which groups the
same cache line invalidate their copies.
Update: Writes are broadcast on the bus and caches sharing
the same line snoop for data on the bus and update the contents
and state of their cache lines.
Snoopy Protocol
Snoopy Protocol
Three processors (P1, P2 and Pn)
having consistent copy of block X
in their local caches.
Using write invalidate, processor P1
writes its cache from X to X' and all
other copies are invalidated via
bus.
Write update demands the new
block content X' be broadcast to
all cache copies via bus.
13
Directory Based Protocols
Directory Based Protocols
These protocols maintain the state information of the
cache lines at a single location called directory.
Only the caches that are stated in the directory and are
thus known to posses a copy of the newly altered line,
are sent write update information.
Since there is no need to connect to all caches, in
contrast to snoopy protocols, directory based protocols
can scale better.
As the number of processor nodes and number of cache
lines increase, the size of directory can become very
large.
14
Based on action taken by local processor in
terms of invalidate or update cache lines, there
is important distinction among directory based
protocols depending on directory placement.
Central Directory: Directories specifying line
ownership or usage can be kept with memory
(central).
Distributed Directory: Directories specifying
line ownership or usage can be kept with the
processor-caches (distributed).
Directory Based Protocols
Central Directory:
Memory will be updated on a write update.
Directory is associated with the memory and contains information for
each line in memory.
Each line has a bit vector which contains a bit for each processor
cache in the system.
Directory Based Protocols
Centralized directory structure
Directory size (bits) = Number of caches x Number of memory lines
15
Distributed Directory:
Memory has only one pointer which identifies the cache that last
requested it. A subsequent request to that line is then referred to that
cache and requestor Id is placed at the head of list.
Main memory is generally not updated and the true picture of the
memory of the memory state is found only in group of caches.
Directory Based Protocols
Directory size (bits) = Number of memory lines x [ log2 (number of caches)]
Concurrent Processors
16
Concurrent Processors
Processors that can execute multiple instructions at the
same time (Concurrently)
Concurrent processors can make simultaneous access to
memory and can execute multiple operations
simultaneously.
Processor performance depends on compiler ability,
execution resources and memory system design.
There are two main types of concurrent processors.
Vector Processors: single vector instruction replaces
multiple scalar instructions. It depends on compilers ability
to vectorize the code to transform loops into a sequence of
vector operations.
Multiple Issue Processors: Instructions whose effects are
independent of each other are executed concurrently.
Vector Processors
17
Vector Processors
A vector computer or vector processor is a machine
designed to efficiently handle arithmetic operations on
elements of arrays, called vectors.
Such machines are especially useful in high-performance
scientific computing, where matrix and vector arithmetic are
quite common. Supercomputers like Cray Y-MP is an
example of vector Processor.
Vector processor is an ensemble of hardware resources,
including vector registers, register counters etc.
Vector processing occurs when arithmetic or logical
operations are applied to vectors.
Vector processors achieve considerable speed up in
processor performance over that of simple pipelined
processors.
Vector Processors
Six types of Vector Instructions
1. Vector Vector Instruction
2. Vector Scalar Instructions
3. Vector Memory Instructions
4. Vector - Reduction Instruction
5. Gather and Scatter Instructions
6. Masking Instructions
18
Vector - Vector Instructions
One or two operands are fetched from the respective
vector registers and produce results in another vector
register.
Vj x Vk Vi
Vector Scalar Instructions
Vector element is multiplied by scalar
element to produce vector of equal length
s x Vi Vj
19
Vector - Memory Instructions
M V (Vector Load)
V M (Vector Store)
Gather and Scatter Instruction
Two vector registers are used to gather or to scatter
vectors elements randomly throughout the memory.
M V1 x V0 Gather
V1 x V0 M Scatter
Gather
Scatter
20
Masking Instruction
Mask vector is used to compress or to expand a
vector to a shorter or long index vector
respectively.
V0 x Vm V1
VM register = Masking Register
VL register = Length of vector
being tested
Vector Memory
21
Vector Memory
Simple low order interleaving used in normal pipelined
processors is not suitable for vector processors.
Since access in case of vectors is non sequential but
systematic, thus if array dimension or stride (address
distance between adjacent elements) is same as
interleaving factor then all references will concentrate on
same module.
It is quite common for these strides to be of the form 2
k
or
other even dimensions.
So vector memory designs use address remapping and use
a prime number of memory modules.
Hashed addresses is a technique for dispersing addresses.
Hashing is a strict 1:1 mapping of the bits in X to form a new
address X based on simple manipulations of the bits in X.
A memory system used in
vector / matrix accessing
consists of following units.
Address Hasher
2
k
+ 1 memory modules.
Module mapper.
This may add certain
overhead and add extra
cycles to memory access
but since the purpose of
the memory is to access
vectors, this can be
overlapped in most cases.
Vector Memory
22
Modeling Vector Memory Performance
Vector memory is designed for multiple simultaneous requests to
memory.
Operand fetching and storing is overlapped with vector execution.
Three concurrent operand access to memory are a common
target but increased cost of memory system may limit this to two.
Chaining may require even more accesses.
Another issue is the degree of bypassing or out of order requests
that a source can make to memory system.
In case of conflict i.e. a request being directed to a busy module,
the source can continue to make subsequent requests only if not
serviced requests are held in a buffer.
Assume each of s access ports to memory has a buffer of size
TBF/s which holds requests that are being held due to a conflict.
For each source, degree of bypassing is defined as the allowable
number of requests waiting before stalling of subsequent
requests occurs.
If Qc is the expected number of denied requests per
module and m is the number of modules, then buffer size
must be large enough to hold denied requests.
Buffer = TBF > m. Qc
If n is the total number of requests made and B is the
bandwidth achieved then
m . Qc = n-B (denied requests)
Modeling Vector Memory Performance
Typical Buffer entries include:
1. Request source ID.
2. Request source tag. (i.e VR number)
3. Module ID
4. Address for request to a module
5. Scheduled cycle time indicating when module is free.
6. Entry priority ID
23
Gamma () Binomial Model
Assume that each vector source issues a request each cycle( =1) and
each physical requestor has the same buffer capacity and
characteristics.
If the vector processor can make s requests per cycle and there are t
cycles per Tc, then
Total requests per Tc = t . s = n
This is same as n requests per Tc in simple binomial model.
If is the mean queue size of bypassed requests awaiting service then
each of buffered requests also make a request.
From memory modeling point of view this is equivalent to buffer
requesting service each cycle until module is free.
Total request per Tc = t . s + t . s .
= t. s(1 + )
= n (1 + )
Using simple
binomial
equation:
Calculating
opt
The is the mean expected bypassed request queue per source.
If we continue to increase number of bypass buffer registers we can
achieve a
opt
which totally eliminates contention.
No contention occurs when B = n or B(m,n,) = n
This occurs when a = = n/m
Since MB/ D/1 queue size is given by
Q = a
2
p a / 2(1- a) = n(1+) B /m
Substituting a = = n/m and p=1/m we get:
Q=( n
2
- n) / 2((m
2
- nm) = (n/m )(n-1) /(2m-2n)
Since Q = (n(1+) B) /m
So mQ = n(1+) B
Now for
opt
(n - B) =0
So
opt
= m/n Q
So
opt
= n - 1/ 2m-2n
And mean total buffer size (TBF) = n
opt
To avoid overflow buffer may be considerably larger may be 2 x TBF
24
Multiple Issue Machines
Multiple Issue Machines
These machines evaluate dependencies among
group of instructions, and groups found to be
independent are simultaneously dispatched to
multiple execution units.
There are two broad classes of multiple issue
machines
Statically Scheduled: detection process is done by the
compiler.
Dynamically Scheduled: Detection of independent
instructions is done by hardware in the decoder at run
time.
25
Statically Scheduled Machines
Sophisticated compilers search the code at compile time and
instructions found independent in their effect are assembled into
instruction packets, which are decoded and executed at run time.
Statically scheduled processors must have some additional
information either implicitly or explicitly indicating instruction
packet boundaries.
Early statically scheduled machines include the so called VLIW
(Very long instruction word) machines.
These machines use an instruction word that consists of 8 to 10
instruction fragments.
Each fragment controls a designated execution unit.
To accommodate multiple instruction fragments the instruction
word is typically over 200 bits long.
The register set is extensively multi ported to support
simultaneous access to multiple execution units.
To avoid performance limitation by occurrence of branch
instructions a novel compiler technology called trace
scheduling is used.
In trace scheduling branches are predicted where possible
and predicted path is incorporated into a large basic block.
If an unanticipated ( or unpredicted) branch occurs during
the execution of the code, at the end of the basic block the
proper result is fixed up for use by a target basic block.
Statically Scheduled Machines
26
Dynamically Scheduled Machines
In dynamically scheduled machines detection of
independent instruction is done by hardware at
run time.
The detection may also be done at compile time
and code suitably arranged to optimize
execution patterns.
At run time the search for concurrent instructions
is restricted to the localities of the last executing
instruction.
Superscalar Machines
The maximum program speed up available in
multiple issue machines, largely depends on
sophisticated compiler technology.
The potential speedup available from multi-flow
compiler using trace scheduling is generally less
than 3.
Recent multiple issue machines having more
modest objectives, are called Superscalar
Machines.
The ability to issue multiple instructions in a single
cycle is referred to as Superscalar implementation.
27
Comparing Vector and
Multiple Issue Processors
The goal of any processor design is to provide cost
effective computation across a range of
applications.
So we should compare the two technologies based
on following two factors.
Cost
Performance
28
Cost Comparison
While comparing the cost we must approximate the area used
by both the technologies in the form of additional / required
units.
The cost of execution units is about the same for both (for
same maximum performance).
A major difference lies in the storage hierarchy.
Both rely heavily on multi ported registers.
These registers occupy significant amount of area. If p is the
no of ports, the area required is
Area = (No of reg +3p)(bits per reg +3p) rbe.
Most vector processors have 8 sets of 64 element registers
with each element being 64 bit in size.
Each vector register is dual ported ( a read port and a write
port). Since registers are sequentially accessed each port can
be shared by all elements in the register set.
There is an additional switching overhead to switch each of n
vector registers to each of p external ports.
Switch area = 2 (bits per reg).p. (no of reg)
So area used by registers set in vector processors (supporting
8 ports) is
Area = 8x[(64+6) (64+6)] =39,200 rbe.
Switch area = 2 (64).8.(64) = 8192 rbe.
A multiple issue processor with 32 registers each having 64
bits and supporting 8 ports will require
Area = (32+3(8))(64+3(8)) =4928 rbe
So vector processors use almost 42,464 rbe of extra area
compared to MI processors.
This extra area corresponds to about 70,800 cache bits
(.6 rbe / bit) i.e. approximately 8 KB of data cache.
Vector processors use small data cache.
Cost Comparison
29
Multiple issue machines require larger data cache to ensure high
performance.
Vector processors require support hardware for managing access to
memory system.
Also high degree of interleaving is required in memory system to
support processor bandwidth
M.I machines must support 4-6 reads and 2-3 writes per cycle. This
increases the area required by buses between arithmetic units and
registers.
M.I machines should access and hold multiple instructions each
cycle from I- cache
This increases the size of I-fetch path between I-cache and
instruction decoder/ instruction register.
At instruction decoder multiple instructions must be decoded
simultaneously and detection for instruction independence must be
performed.
Cost Comparison
Performance Comparison
The performance of vector processors depends primarily on two
factors
Percentage of code that is vectorizable.
Average length of vectors.
We know that n
1/2
or the vector size at which the vector processor
achieves approx half its asymptotic performance is roughly the same
as arithmetic plus memory access pipeline.
For short vectors data cache is sufficient in MI machines so for
Short Vectors M.I .processors would perform better than equivalent
vector processor.
As vectors get longer the performance of M.I. machine becomes much
more dependent on size of data cache and n
1/2
of vector processors
improve.
So for long vectors performance would be better in case of vector
processor.
The actual difference depends largely on sophistication in compiler
technology.
Compiler can recognize occurrence of short vector and treat that
portion of code as if it were a scalar code
.
30
References
Computer Architecture by Michael J.Flynn
Advanced Computer Architecture by Kai Hwang

Anda mungkin juga menyukai