Figure 1.1: SISD processor organizations
They are also called scalar processor i.e., one inst ruct ion at a t ime and each inst ruct ion have
only one set of operands.
Single inst ruct ion: only one inst ruct ion st ream is being act ed on by t he CPU during any one
clock cycle.
Single dat a: only one dat a st ream is being used as input during any one clock cycle.
Det erminist ic execut ion.
Inst ruct ions are execut ed sequent ially.
This is t he oldest and unt il recent ly, t he most prevalent form of comput er.
Examples: most PCs, single CPU w orkst at ions and mainframes.
Single inst ruct ion: All processing unit s execut e the same inst ruct ion issued by t he cont rol unit at
any given clock cycle as show n in figure w here t here is mult iple processors execut ing inst ruct ion
M ult iple dat a: Each processing unit can operat e on a different dat a element a s show n if figure
given by one cont rol unit .
below t he processor are connect ed t o shared memory or int erconnect ion net w ork providing
This t ype of machine t ypically has an inst ruct ion dispat cher, a very high- bandw idt h int ernal
mult iple dat a t o processing unit .
Thus single inst ruct ion is execut ed by different processing unit on different set of dat a as show n in
net w ork, and a very large array of very small-capacit y inst ruct ion unit s.
Best suit ed for specialized problems charact erized by a high degree of regularit y, such as image
figure 1.2.
Dat a
1 2 N
st reams
Cont rol
C) M ultiple Instruction Stream, Single Data Stream (M ISD)
A single dat a st ream is feed int o mult iple processing unit s.
Each processing unit operat es on t he dat a independent ly via independent inst ruct ion st reams
as show n in figure 1.3 a single dat a st ream is forw arded t o different processing unit w hich are
connect ed t o different cont rol unit and execut e inst ruct ion given t o it by cont rol unit t o w hich it is
at t ached.
Thus in t hese comput ers same dat a flow t hrough a linear array of processors execut ing different
Figure 1.3: M ISD processor organizations
This archit ect ure is also know n as syst olic arrays for pipelined execut ion of specific inst ruct ions.
inst ruct ion st reams.
Few act ual examples of t his class of parallel comput er have ever exist ed. One is t he experiment al
Carnegie-M ellon C.mmp comput er (1971).
d) M ultiple Instruction Stream, M ultiple Data Stream (M IM D)
M ult iple Dat a: every processor may be w orking w it h a different dat a st ream as show n in figure 1.4
• M ult iple Inst ruct ions: Every Processor may be execut ing a different inst ruct ion st ream.
Can be cat egorized as loosely coupled or t ight ly coupled depending on sharing of dat a and cont rol.
mult iple dat a st ream is provided by shared memory.
Dat a 1 2 N
st reams
CPU
CPU
Figure 1.5: Shared M emory (UM A)
Non-Uniform M emory Access (NUM A):
• Oft en made by physically linking t w o or more SM Ps
• One SM P can direct ly access memory of anot her SM P
• Not all processors have equal access t ime t o all memories
• M emory access across link is slow er
If cache coherency is maint ained, t hen may also be called CC-NUM A - Cache Coherent NUM A
Bus Interconnect
Advantages:
• M emory is scalable w it h number of processors. Increase t he number of processors and t he size of
memory increases proport ionat ely.
• Each processor can rapidly access it s ow n memory w it hout int erference and w it hout t he overhead
incurred w it h t rying t o maint ain cache coherency.
• Cost effect iveness: can use commodit y, off-t he-shelf processors and net w orking.
Disadvantages:
• The programmer is responsible for m any of t he det ails associat ed w it h dat a communicat ion
bet w een processors.
• It may be difficult t o map exist ing dat a st ruct ures, based on global memory, t o t his memory
organizat ion.
M ulti-vector and SIM D Computers
A vect or operand cont ains an ordered set of n element s, w here n is called t he lengt h of t he vect or. Each
element in a vect or is a scalar quant it y, w hich may be a float ing point number, an int eger, a logical value or a
charact er.
A vect or processor consist s of a scalar processor and a vect or unit , w hich could be t hought of as an
independent funct ional unit capable of efficient vect or operat ions.
Vector Supercomputer
Vect or comput ers have hardw are t o perform t he vect or operat ions efficient ly. Operands cann ot be used
direct ly from memory but rat her are loaded int o regist ers and are put back in regist ers aft er t he operat ion.
Vect or hardw are has t he special abilit y t o overlap or pipeline operand processing. Vect or funct ional unit s
pipelined, fully segment ed each stage of t he pipeline performs a st ep of t he funct ion on different operand(s)
once pipeline is full; a new result is produced each clock period (cp).
Instruction Steam
Control
Control Dependence : This refers t o t he sit uat ion w here t he order of t he execut ion of st at ement s cannot be
det ermined before run t ime. For example all condit ion st at ement , w here t he flow of st at ement depends on
t he out put . Different pat hs t aken aft er a condit ional branch may depend on t he dat a hence w e need t o
eliminat e t his dat a dependence among t he instruct ions. This dependence also exist s bet w een operat ions
performed i n su ccessi v e i t e r at i o n s o f l o o p i n g p r o ce d u r e. Cont rol d ep e n d en ce o f t e n prohibit s
parallelism from being exploit ed.
Cont rol-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
Cont rol-dependent example:
for (i=1;i<n;i++) {
if (a[i-1] < 0) a[i] = 1;
}
Cont rol dependence also avoids parallelism t o being exploit ed. Compilers are used t o eliminat e t his
cont rol dependence and exploit t he parallelism.
Resource dependence:
Dat a and cont rol dependencies are based on t he independence of t he w ork t o be done. Resource
independence is concerned w it h conflict s in using shared resources, such as regist ers, int eger and float ing
point ALUs, et c. ALU conflict s are called ALU dependence. M emory (st orage) conflict s are called st orage
dependence.
Bernst ein’s Condit ions - 1
Bernst ein’s condit ions are a set of condit ions w hich must exist if t w o processes can execut e in parallel.
Not at ion
Ii is t he set of all input variables for a process Pi. Ii is also called t he read set or domain of Pi. Oi is t he set of
all out put variables for a process Pi .Oi is also called w rit e set
If P1 and P2 can execut e in parallel (w hich is w rit t en as P1 | | P2), t hen:
Software Parallelism
Soft w are parallelism is defined by t he cont rol and dat a dependence of programs, and is revealed in t he
program’s flow graph i.e., it is defined by dependencies w it hin t he code and is a funct ion of algorit hm,
programming st yle, and compiler opt imizat ion.
Levels of Parallelism
Instruction Level Parallelism
This fine-grained or smallest granularit y level t ypically involves less t han 20 inst ruct ions per grain. The number
of candidat es for parallel execut ion varies from 2 t o t housands, w it h about five inst ruct ions or st atement s (on
t he average) being t he average level of parallelism.
Advant ages:
• There are usually many candidat es for parallel execut ion
• Compilers can usually do a reasonable job of finding t his parallelism
Loop-level Parallelism
Typical loop has less t han 500 inst ruct ions. If a loop operat ion is independent bet w een it erat ions, it can
be handled by a pipeline, or by a SIM D machine. M ost opt imized program const ruct t o execut e on a
parallel or vect or machine. Some loops (e.g. recursive) are difficult t o handle. Loop-level parallelism is st ill
considered fine grain comput at ion.
Scheduling
A schedule is a mapping of nodes t o processors and st art t imes such t hat communicat ion delay requirement s
are observed, and no t w o nodes are execut ing on t he same processor at t he same t ime. Some general
scheduling goals:
• Schedule all fine-grain act ivit ies in a node t o t he same processor t o minimize communicat ion
delays.
• Select grain sizes for packing t o achieve bet t er schedules for a part icular parallel machine.
• Node Duplicat ion
Grain packing may pot ent ially eliminat e int erprocessor communicat ion, but it may not alw ays p r o d u ce a
sh o r t e r sch e d u l e. By d u p l i cat i n g n o d es (t hat i s, exe cu t i n g so m e inst ruct ions on mult iple processors),
w e may eliminat e some int erprocessor communicat ion, and t hus produce a short er schedule.
Demand-Driven M echanisms
Dat a-driven machines select inst ruct ions for execut ion based on t he availabilit y of t heir operands; t his is
essent ially a bot t om-up approach.
Demand-driven machines t ake a t op-dow n approach, at t empt ing t o execut e t he inst ruct ion (a
demander) t hat yields t he final result . This t riggers t he execut ion of inst ruct ions t hat yield it s operands, and so
fort h. The demand-driven approach mat ches nat urally w it h funct ional programming languages (e.g. LISP and
SCHEM E).
Pattern driven computers: An inst ruct ion is execut ed w hen w e obt ain a part icular dat a pat t erns as out put .
There are t w o t ypes of pat t ern driven comput ers
String-reduction model: each demander gets a separat e copy of t he expression st ring t o evaluat e each
reduct ion st ep has an operat or and embedded reference t o demand t he corresponding operands each
operat or is suspended w hile argument s are evaluat ed
Graph-reduction model: expression graph reduced by evaluat ion of branches or sub-graphs, possibly in
parallel, w it h demanders given point ers t o result s of reduct ions. Based on sharing of point ers t o argument s;
t raversal and reversal of point ers cont inues unt il const ant argument s are encount ered.
Diameter: The maximum dist ance bet w een any t w o processors in t he net w ork or in ot her w ords w e can
say Diamet er, is t he maximum number of (rout ing) processors t hrough w hich a message must pass on it s w ay
from source t o reach dest inat ion. Thus diamet er measures t he maximum delay for t ransmit t ing a message
from one processor t o anot her as it det ermines communicat ion t ime hence smaller t he diam et er bet t er
w ill be t he net w ork t opology.
Connectivity: How many pat hs are possible bet w een any t w o processors i.e., t he mult iplicit y of pat hs bet w een
t w o processors. Higher connect ivit y is desirable as it minimizes cont ent ion.
Arch connect ivit y of t he net w ork: t he minimum number of arcs t hat must be removed for t he net w ork t o
break it int o t w o disconnect ed net w orks. The arch connect ivit y of various net w ork are as follow s
• 1 for linear arrays and binary t rees
• 2 for rings and 2-d meshes
• 4 for 2-d t orus
• d for d-dimensional hypercubes
Larger t he arch connect ivit y lesser t he conjunct ions and bet t er w ill be net w ork t opology. Channel w idt h :The
channel w idt h is t he number of bit s t hat can communicat ed simult aneously by a int erconnect ion bus
connect ing t w o processors:
Bisection W idth and Bandwidth: In order divide t he net w ork int o equal halves w e require t he remove some
communicat ion links. The minimum numbers of such communicat ion links t hat have t o be removed are called
t he Bisect ion Widt h. Bisect ion w idt h basically provide us t he informat ion about t he largest number of
messages w hich can be sent simult aneously (w it hout needing t o use t he same w ire or rout ing processor at
t he same t ime and so delaying one anot her), no mat t er w hich processors are sending t o w hich ot her
processors. Thus larger t he bisect ion w idt h is t he bet t er t he net w ork t opology is considered. Bisect ion
Bandw idt h is t he minimum volume of communicat ion allow ed bet w een t w o halves of t he net w ork w it h
equal numbers of processors. This is import ant for t he net w orks w it h w eight ed arcs w here the w eight s
correspond t o t he link w idt h i.e., (how much dat a it can t ransfer). The Larger bisect ion w idt h t he bet t er
net w ork t opology is considered.
Cost t he cost of net w orking can be est imat ed on variet y of crit eria w here w e consider t he number of
communicat ion links or w ires used t o design t he net w ork as t he basis of cost est imat ion, smaller t he bet t er
t he cost .
Data Routing Functions: A dat a rout ing net w ork is used for int er –PE dat a exchange. It can be st at ic as in case
of hypercube rout ing net w ork or dynamic such as mult ist age netw ork. Various t ype of dat a rout ing funct ions
are Shift ing, Rot at ing, Permut at ion (one t o one), Broadcast (one t o all), M ult icast (many t o many),
Personalized broadcast (one t o many), Shuffle, Exchange Et c.
In linear and mat rix st ruct ures, processors are int erconnect ed w it h t heir neighbors in a regular st r uct ure on a
plane. A t orus is a mat rix st ruct ure in w hich elem ent s at t he mat rix borders are connect ed in t he frame of t he
same lines and columns. In a complet e connect ion st ruct ure, all element s (ex. processors) are direct ly
int erconnect ed (point -t o-point )
Figure 1.14: A complete interconnection Figure 1.15: A Ring Figure 1.16: A Chordal Ring
In a t ree st ruct ure, syst em element s are set in a hierarchical st ruct ure from t he root t o t he leaves, see t he
figure below . All element s of t he t ree (nodes) can be processors or only leaves are processors and t he rest of
nodes are linking element s, w hich int ermediat e in t ransmissions. If from one node, 2 or more connect ions go
t o different nodes t ow ards t he leaves - w e say about a binary or k-nary t ree. If from one node, more t han one
connect ion goes t o t he neighboring node, w e speak about a fat t ree. A binary t ree, in w hich in t he direct ion of
t he root , t he number of connect ions bet w een neighboring nodes increases t w ice, provides a uniform
t ransmission t hroughput bet w een t he t ree levels, a feat ure not available in a st andard t ree.
t ransfers t hat have t o be done t o send dat a bet w een t he most dist ant nodes of a net w ork. In t his respect t he
hypercubes have very good propert ies, especially for a very large number of const it uent nodes. Due t o t his
hypercubes are popular net w orks in exist ing parallel syst ems.
Figure 1.19: A two-by-two switching box and its four interconnection states
A mult ist age net w ork is capable of connect ing an arbit rary input t erminal t o an arbit rary out put t erminal.
n
Generally it is consist of n st ages w here N = 2 is t he number of input and out put lines. And each st age use
N/ 2 sw it ch boxes. The int erconnect ion pat t erns from one st age t o anot her st age are determined by net w ork
t opology. Each st age is connect ed t o t he next st age by at least N pat hs. The t ot al w ait t ime is proport ional t o
t he number st ages i.e., n and t he t ot al cost depend on t he t ot al number of sw it ches used and t hat are Nlog2N.
The cont rol st ruct ure can be individual st age cont rol i.e., t he same cont rol signal is used t o set all sw it ch
boxes in t he same st ages t hus w e need n cont rol signal. The second cont rol st ruct ure is individual box
cont rol w here a separat e cont rol signal is used t o set t he st at e of each sw it ch box. This provide flexibilit y at
t he same t ime require n2/ 2 cont rol signal w hich increases t he complexit y of t he cont rol circuit . In bet w een
pat h is use of part ial st age cont rol.
Bus networks
A bus is t he sim plest t ype of dynamic int erconnect ion net w orks. It const it ut es a common dat a t ransfer pat h
for many devices. Depending on t he t ype of im plement ed t ransmissions w e have serial busses and parallel
busses. The devices connect ed t o a bus can be processors, memories, I/ O unit s, as show n in t he figur e below .
Figure 1.20: A diagram of a system based on a single bus Figure 1.21: A diagram of a system based on a multibus
Only one device connect ed t o a bus can t ransmit s dat a. M any devices can receive dat a. In t he last case w e
speak about a mult icast t ransmission. If dat a are meant for all devices connect ed t o a bus w e speak about a
broadcast t ransmission. Accessing t he bus must be synchronized. It is done w it h t he use of t w o met hods: a
t oken met hod and a bus arbit er met hod. Wit h t he t oken met hod, a t oken (a special cont rol message or signal)
is circulat ing bet w een t he devices connect ed t o a bus and it gives t he right t o t ransmit t o t he bus t o a single
device at a t ime. The bus arbit er receives dat a t ransmission request s from t he devices connect ed t o a bus. It
select s one device according t o a select ed st rat egy (ex. using a system of assigned priorit ies) and sends an
acknow ledge message (signal) t o one of t he request ing devices t hat grant s it t he t ransmit t ing right . Aft er t he
select ed device com plet es t he t ransmission, it informs t he arbit er t hat can select anot her request . The
receiver (s) address is usually given in t he header of t he message. Special header values are used f or t he
broadcast and mult icast s. All receivers read and decode headers. These devices t hat are specified in t he
header, read-in t he dat a t ransmit t ed over t he bus.
The t hroughput of t he net w ork based on a bus can be increased by t he use of a mult ibus net w ork show n in
t he figure below . In t his net w ork, processors connect ed t o t he busses can t ransmit dat a in parallel (one for
each bus) and many processors can read dat a from many busses at a t ime.
Crossbar switches
A crossbar sw it ch is a circuit t hat enables many int erconnect ions bet w een element s of a parallel syst em at a
t ime. A crossbar sw it ch has a number of input and out put dat a pins and a number of cont rol pins. In response
t o cont rol inst ruct ions set t o it s cont rol input , t he crossbar sw it ch implement s a st able connect ion of a
det ermined input w it h a det ermined out put . The diagrams of a t ypical crossbar sw it ch are show n in t he figure
below .
It requires expensive memory cont rol logic and a large number of cables and connect ions.