Anda di halaman 1dari 20

Subject Name: Advanced Computer Architecture

Subject Code: CS-6001


Semester: 6th
Downloaded from be.rgpvnotes.in

Department of Computer Science and Engineering


Subject Notes
Subject Code: CS-6001 Subject Name: Advanced Computer Architecture
UNIT-1
Flynn's Classification
Flynn’s classificat ion dist inguishes mult i-processor comput er archit ect ures according t o t w o independent
dimensions of Inst ruct ion st ream and Dat a stream. An inst ruct ion st ream is sequence of inst ruct ions
execut ed by machine. And a dat a st ream is a sequence of dat a including input , part ial or t emporary result s
used by inst ruct ion st ream. Each of t hese dim ensions can have only one of t w o possible st at es: Single or
M ult iple. Flynn’s classificat ion depends on t he dist inct ion bet w een t he performance of cont rol unit and t he
dat a processing unit rat her t han it s operat ional and st ruct ural int erconnect ions. Follow ing are t he four
cat egory of Flynn classificat ion and charact erist ic feat ure of each of t hem.

a) Single Instruction Stream, Single Data Stream (SISD)


The figure 1.1 is represent s an organizat ion of simple SISD comput er having one cont rol unit , one processor
unit and single memory unit .

Cont rol Inst ruct ion Processor Dat a M emory


St ream St ream


Figure 1.1: SISD processor organizations
They are also called scalar processor i.e., one inst ruct ion at a t ime and each inst ruct ion have


only one set of operands.
Single inst ruct ion: only one inst ruct ion st ream is being act ed on by t he CPU during any one


clock cycle.


Single dat a: only one dat a st ream is being used as input during any one clock cycle.


Det erminist ic execut ion.


Inst ruct ions are execut ed sequent ially.


This is t he oldest and unt il recent ly, t he most prevalent form of comput er.
Examples: most PCs, single CPU w orkst at ions and mainframes.

 A t ype of parallel comput er.


b) Single Instruction Stream, M ultiple Data Stream (SIM D) processors

 Single inst ruct ion: All processing unit s execut e the same inst ruct ion issued by t he cont rol unit at
any given clock cycle as show n in figure w here t here is mult iple processors execut ing inst ruct ion

 M ult iple dat a: Each processing unit can operat e on a different dat a element a s show n if figure
given by one cont rol unit .

below t he processor are connect ed t o shared memory or int erconnect ion net w ork providing

 This t ype of machine t ypically has an inst ruct ion dispat cher, a very high- bandw idt h int ernal
mult iple dat a t o processing unit .

 Thus single inst ruct ion is execut ed by different processing unit on different set of dat a as show n in
net w ork, and a very large array of very small-capacit y inst ruct ion unit s.

 Best suit ed for specialized problems charact erized by a high degree of regularit y, such as image
figure 1.2.

 Synchronous (lockst ep) and det erminist ic execut ion.


processing and vect or comput at ion.

Page no: 1 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Shared memory or int erconnect ion net work

Dat a
1 2 N
st reams

Processor 1 Processor 2 Processor N

Inst ruct ion st ream

Cont rol

Figure 1.2: SIM D processor organizations


C) M ultiple Instruction Stream, Single Data Stream (M ISD)


A single dat a st ream is feed int o mult iple processing unit s.
Each processing unit operat es on t he dat a independent ly via independent inst ruct ion st reams
as show n in figure 1.3 a single dat a st ream is forw arded t o different processing unit w hich are
connect ed t o different cont rol unit and execut e inst ruct ion given t o it by cont rol unit t o w hich it is
at t ached.

Inst ruct ion


Dat a st reams
st ream
Processor 1 Cont rol 1
1

M emory Processor 2 Cont rol 2


2

Processor N Cont rol N


N

 Thus in t hese comput ers same dat a flow t hrough a linear array of processors execut ing different
Figure 1.3: M ISD processor organizations

 This archit ect ure is also know n as syst olic arrays for pipelined execut ion of specific inst ruct ions.
inst ruct ion st reams.

 Few act ual examples of t his class of parallel comput er have ever exist ed. One is t he experiment al
Carnegie-M ellon C.mmp comput er (1971).
d) M ultiple Instruction Stream, M ultiple Data Stream (M IM D)

 M ult iple Dat a: every processor may be w orking w it h a different dat a st ream as show n in figure 1.4
• M ult iple Inst ruct ions: Every Processor may be execut ing a different inst ruct ion st ream.

 Can be cat egorized as loosely coupled or t ight ly coupled depending on sharing of dat a and cont rol.
mult iple dat a st ream is provided by shared memory.

 Execut ion can be synchronous or asynchronous, det erminist ic or non-


det erminist ic.
• Examples: most current supercomput ers, net w orked parallel comput er " grids" and mult i-processor

Page no: 2 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

SM P comput ers - including some t ypes of PCs.

Shared memory or int erconnect ion net work

Dat a 1 2 N
st reams

Processor 1 Processor 2 Processor N


Inst ruct ion N
1 2
st reams
Cont rol 1 Cont rol 2 Cont rol N

Figure 1.4: M IM D processor organizations


System Attributes to Performance
Performance of a syst em depends on
• Hardw are t echnology
• Archit ect ural feat ures
• Efficient resource management
• Algorit hm design
• Dat a st ruct ures
• Language efficiency
• Programmer skill
• Compiler t echnology
When w e t alk about performance of comput er syst em w e w ould describe how quickly a given syst em can
execut e a program or programs. Thus w e are int erest ed in know ing t he t urnaround t ime. Turnaround t ime
depends on:
• Disk and memory accesses
• Input and out put
• Compilat ion t ime
• Operat ing syst em overhead
• CPU t ime
An ideal performance of a comput er syst em means a perfect mat ch bet w een t he machine capabilit y and
program behavior. The machine capabilit y can be improved by using bet t er hardw are t echnology and
efficient resource management . But as far as program behavior is concerned it depends on code used,
compiler used and ot her run t ime condit ions. Also a machine performance may vary from program t o
program. Because t here are t oo many programs and it is impract ical t o t est a CPU's speed on all of t hem
benchmarks w ere developed. Comput er archit ect s have come up w it h a variet y of met rics t o describe t he
comput er performance.
Clock rate and CPI / IPC: Since I/ O and syst em overhead frequent ly overlaps processing by ot her programs, it
is fair t o consider only t he CPU t ime used by a program, and t he user CPU t ime is t he most import ant
fact or. CPU is driven by a clock w it h a const ant cycle t ime (usually measured in nanoseconds, w hich cont rols
t he rat e of int ernal operat ions in t he CPU. The clock most ly has t he const ant cycle t ime (t in nanoseconds).
The inverse of t he cycle t ime is t he clock rat e (f = 1/ τ, measured in megahert z). A short er clock cycle t ime, or
equivalent ly a larger number of cycles per second, implies more operat ions can be performed per unit
t ime. The size of t he program is det ermined by t he inst ruct ion count (Ic). The size of a program is det ermined
by it s inst ruct ion count , Ic, t he number of machine inst ruct ions t o be execut ed by t he program. Different
machine inst ruct ions require different numbers of clock cycles t o execut e. CPI (cycles per inst ruct ion) is t hus

Page no: 3 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

an import ant paramet er.


Average CPI
It is easy t o det ermine t he average number of cycles per inst ruct ion for a part icular processor if w e know t he
frequency of occurrence of each inst ruct ion t ype. Any est imat e is valid only for a specific set of programs
(w hich defines t he inst ruct ion mix), and t hen only if t here are sufficient ly large number of inst ruct ions.
In general, t he t erm CPI is used w it h respect t o a part icular inst ruct ion set and a given program mix. The t ime
required t o execut e a program cont aining Ic inst ruct ions is just T = Ic * CPI * τ.
Each inst ruct ion must be fet ched from memory, decoded, t hen operands fet ched from memory, t he
inst ruct ion execut ed, and t he result s st ored.
The t ime required t o access memory is called t he memory cycle t ime, w hich is usually k t imes t he processor
cycle t ime τ. The value of k depends on t he memory t echnology and t he processor-memory int erconnect ion
scheme. The processor cycles required for each inst ruct ion (CPI) can be at t ribut ed t o cycles needed for
inst ruct ion decode and execut ion (p), and cycles needed for memory references (m* k).
The t ot al t ime needed t o execut e a program can t hen be rew rit t en as
T = Ic* (p + m* k)* τ

Parallel Computer M odels


M ultiprocessor and M ulticomputer
Tw o cat egor ies of parallel comput ers are discussed below namely shared com mon memory or unshared
dist ribut ed memory.
Shared M emory M ultiprocessor
• Shared memory parallel comput ers vary w idely, but generally have in common t he abilit y for all
processors t o access all memory as global address space.
• M ult iple p r o cesso r s c a n o p e r at e i n d ep e n d en t l y but sh ar e t h e sam e m e m o r y resources.
• Changes in a memory locat ion effect ed by one processor are visible t o all ot her processors.
• Shared memory machines can be divided int o t w o m ain classes based upon memory access t imes:
UM A, NUM A and COM A.

Uniform M emory Access (UM A):


• M ost commonly represent ed t oday by Symmet ric M ult iprocessor (SM P) machines.
• Ident ical processors.
• Equal access and access t imes t o memory.
• Somet imes called CC-UM A - Cache Coherent UM A. Cache coherent means if one processor updat es
a locat ion in shared memory, all t he ot her processors know about t he updat e. Cache coherency is
accomplished at t he hardw are level.

CPU

CPU M emory CPU

CPU
Figure 1.5: Shared M emory (UM A)
Non-Uniform M emory Access (NUM A):
• Oft en made by physically linking t w o or more SM Ps
• One SM P can direct ly access memory of anot her SM P
• Not all processors have equal access t ime t o all memories
• M emory access across link is slow er
If cache coherency is maint ained, t hen may also be called CC-NUM A - Cache Coherent NUM A

Page no: 4 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

CPU CPU CPU CPU


M emory M emory
CPU CPU CPU CPU

Bus Interconnect

CPU CPU CPU CPU


M emory M emory
CPU CPU CPU CPU

Figure 1.6: Shared M emory (NUM A)


The COM A model (Cache only M emory Access): The COM A model is a special case of NUM A machine in
w hich t he dist ribut ed main memories are convert ed t o caches. All caches form a global address space and
t here is no memory hierarchy at each processor node.
Advantages:
• Global address space provides a user-friendly programming perspect ive t o memory
• Dat a sharing bet w een t asks is bot h fast and uniform due t o t he proximit y of memory t o CPUs
Disadvantages:
• Primary disadvant age is t he lack of scalabilit y bet w een memory and CPUs. Adding more CPUs can
geomet rically increases t raffic on t he shared m emory-CPU pat h, and for cache coherent syst ems,
geomet rically increase t raffic associat ed w it h cache/ memory management .
• Programmer responsibilit y for synchronizat ion const ruct s t hat insure " correct " access of global
memory.
• Expense: it becomes increasingly difficult and expensive t o design and produce shared memory
machines w it h ever increasing numbers of processors.
Distributed M emory
• Like shared memory syst ems, dist ribut ed memory syst ems vary w idely but share a com mon
charact erist ic. Dist ribut ed memory syst ems require a communicat ion net w ork t o connect int er-
processor memory.
CPU M emory CPU M emory

CPU M emory CPU M emory

Figure 1.7: Distributed M emory Systems


• Processors have t heir ow n local memory. M emory addresses in one processor do not map t o
anot her processor, so t here is no concept of global address space across all processors.
• Because each processor has it s ow n local memory, it operat es independent ly.
• Changes it makes t o it s local memory have no effect on t he memory of ot her processors. Hence,
t he concept of cache coherency does not apply.
• When a processor needs access t o dat a in anot her processor, it is usually t he t ask of t he programmer
t o explicit ly define how and w hen dat a is communicat ed. Synchronizat ion bet w een t asks is likew ise
t he programmer's responsibilit y.
• M odern m ult icom put er use hardw are rout ers t o pass m essage.

Page no: 5 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Advantages:
• M emory is scalable w it h number of processors. Increase t he number of processors and t he size of
memory increases proport ionat ely.
• Each processor can rapidly access it s ow n memory w it hout int erference and w it hout t he overhead
incurred w it h t rying t o maint ain cache coherency.
• Cost effect iveness: can use commodit y, off-t he-shelf processors and net w orking.
Disadvantages:
• The programmer is responsible for m any of t he det ails associat ed w it h dat a communicat ion
bet w een processors.
• It may be difficult t o map exist ing dat a st ruct ures, based on global memory, t o t his memory
organizat ion.
M ulti-vector and SIM D Computers
A vect or operand cont ains an ordered set of n element s, w here n is called t he lengt h of t he vect or. Each
element in a vect or is a scalar quant it y, w hich may be a float ing point number, an int eger, a logical value or a
charact er.
A vect or processor consist s of a scalar processor and a vect or unit , w hich could be t hought of as an
independent funct ional unit capable of efficient vect or operat ions.
Vector Supercomputer
Vect or comput ers have hardw are t o perform t he vect or operat ions efficient ly. Operands cann ot be used
direct ly from memory but rat her are loaded int o regist ers and are put back in regist ers aft er t he operat ion.
Vect or hardw are has t he special abilit y t o overlap or pipeline operand processing. Vect or funct ional unit s
pipelined, fully segment ed each stage of t he pipeline performs a st ep of t he funct ion on different operand(s)
once pipeline is full; a new result is produced each clock period (cp).

Figure 1.8: Architecture of Vector Supercomputer


SIM D Computer
The Synchronous parallel archit ect ures coordinat e Concurrent operat ions in lockst ep t hrough global clocks,
cent ral cont rol unit s, or vect or unit cont rollers. A synchronous array of parallel processors is called an
array processor. These processors are composed of N ident ical processing element s (PES) under t he
supervision of a one cont rol unit (CU) This Cont rol unit is a comput er w it h high speed regist ers, local memory
and arit hmet ic logic unit .. An array processor is basically a single inst ruct ion and mult iple dat a (SIM D)
comput ers. There are N dat a st reams; one per processor, so different dat a can be used in each processor.

Page no: 6 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

The figure below show a t ypical SIM D or array processor

Shared memory or interconnection network


Data
1 2 N
Streams
Processor 1 Processor 2 Processor N

Instruction Steam

Control

Figure 1.9: Configurations of SIM D Computers


These processors consist of a number of memory modules w hich can be eit her global or dedicat ed t o each
processor. Thus t he main memory is t he aggregat e of t he memory modules. These Processing element s and
memory unit communicat e w it h each ot her t hrough an int erconnect ion net w ork. SIM D processors are
especially designed for perform ing vect or comput at ions. SIM D has t w o basic archit ect ural organizat ions
a. Array processor using random access memory
b. Associat ive processors using cont ent addressable memory.
All N ident ical processors operat e under t he cont rol of a single inst ruct ion st ream issued by a central cont rol
unit . The popular examples of t his t ype of SIM D configurat ion is ILLIAC IV, CM -2, M P-1. Each PEi is
essent ially an arit hmet ic logic unit (ALU) w it h at t ached w orking regist ers and local memory PEM i for t he
st orage of dist ribut ed dat a. The CU also has it s ow n main memory for t he st orage of program. The funct ion of
CU is t o decode t he inst ruct ions and det ermine w here t he decoded inst ruct ion should be execut ed. The PE
perform same funct ion (same inst ruct ion) synchronously in a lock st ep fashion under command of CU.

Data and Resource Dependence


Data dependence : The ordering relat ionship bet w een st at ement s is indicat ed by t he dat a dependence. Five
t ype of dat a dependence are defined below :
1. Flow dependence: A st at ement S2 is flow dependent on S1 if an execut ion pat h exist s from s1 t o S2 and if
at least one out put (variables assigned) of S1feeds in as input (operands t o be used) t o S2 also called
RAW hazard and denot ed as S1 -> S2
2. Ant i-dependence: St at ement S2 is ant i-dependent on t he st at ement S1 if S2 follow s S1 in t he program
order and if t he out put of S2 overlaps t he input t o S1 also called RAW hazard and denot ed as S1 | ->S2
3. Out put dependence: t w o st at ement s are out put dependent if t hey produce (w rit e) t he same out put
variable. Also called WAW hazard and denot ed as S1 0->S2
4. I/ O dependence: Read and w rit e are I/ O st at ement s. I/ O dependence occurs not because t he same
variable is involved but because t he same file referenced by bot h I/ O st at ement.
5. Unknow n dependence: The dependence relat ion bet w een t w o stat ements cannot be det ermined in t he
follow ing sit uat ions:
• The subscript of a variable is it self subscribed (indirect addressing).
• The subscript does not cont ain t he loop index variable.
• A variable appears more t han once w it h subscript s having different coefficient s of t he loop variable.
• The subscript is non linear in t he loop index variable.
• Parallel execut ion of program segments w hich do not have t ot al dat a independence can produce
non-det erminist ic result s.
Consider t he follow ing fragment of any program: S1 Load R1, A
S2 Add R2, R1
S3 M ove R1, R3
S4 St ore B, R1

Page no: 7 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

• Here t he Forw ard dependency S1t o S2, S3 t o S4, S2 t o S2


• Ant i-dependency from S2t o S3
• Out put dependency S1 t oS3

Control Dependence : This refers t o t he sit uat ion w here t he order of t he execut ion of st at ement s cannot be
det ermined before run t ime. For example all condit ion st at ement , w here t he flow of st at ement depends on
t he out put . Different pat hs t aken aft er a condit ional branch may depend on t he dat a hence w e need t o
eliminat e t his dat a dependence among t he instruct ions. This dependence also exist s bet w een operat ions
performed i n su ccessi v e i t e r at i o n s o f l o o p i n g p r o ce d u r e. Cont rol d ep e n d en ce o f t e n prohibit s
parallelism from being exploit ed.
Cont rol-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
Cont rol-dependent example:
for (i=1;i<n;i++) {
if (a[i-1] < 0) a[i] = 1;
}
Cont rol dependence also avoids parallelism t o being exploit ed. Compilers are used t o eliminat e t his
cont rol dependence and exploit t he parallelism.

Resource dependence:
Dat a and cont rol dependencies are based on t he independence of t he w ork t o be done. Resource
independence is concerned w it h conflict s in using shared resources, such as regist ers, int eger and float ing
point ALUs, et c. ALU conflict s are called ALU dependence. M emory (st orage) conflict s are called st orage
dependence.
Bernst ein’s Condit ions - 1
Bernst ein’s condit ions are a set of condit ions w hich must exist if t w o processes can execut e in parallel.
Not at ion
Ii is t he set of all input variables for a process Pi. Ii is also called t he read set or domain of Pi. Oi is t he set of
all out put variables for a process Pi .Oi is also called w rit e set
If P1 and P2 can execut e in parallel (w hich is w rit t en as P1 | | P2), t hen:

Bernst ein’s Condit ions - 2


In t erms of dat a dependencies, Bernst ein’s condit ions imply t hat t w o processes can execut e in parallel if
t hey are flow -independent , ant i-independent , and out put - independent . The parallelism relat ion | | is
commut at ive (Pi | | Pj implies Pj | | Pi), but not t ransit ive (Pi | | Pj and Pj | | Pk does not imply Pi | | Pk).
Therefore, | | is not an equivalence relat ion. Int ersect ion of t he input set s is allow ed.

Hardware and Software Parallelism


Hardw are parallelism is defined by machine archit ect ure and hardw are mult iplicit y i.e., funct ional parallelism
t imes t he processor parallelism .It can be characterized by t he number of inst ruct ions t hat can be issued per
machine cycle. If a processor issues k inst ruct ions per machine cycle, it is called a k-issue processor.
Convent ional processors are one-issue machines. This provides t he user t he informat ion about peak at t ainable
performance.

Page no: 8 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Software Parallelism
Soft w are parallelism is defined by t he cont rol and dat a dependence of programs, and is revealed in t he
program’s flow graph i.e., it is defined by dependencies w it hin t he code and is a funct ion of algorit hm,
programming st yle, and compiler opt imizat ion.

Program partitioning and scheduling


Scheduling and allocat ion is a highly import ant issue since an inappropriat e scheduling of t asks can fail t o
exploit t he t rue pot ent ial of t he syst em and can offset t he gain from parallelizat ion. In t his paper w e focus on
t he scheduling aspect . The object ive of scheduling is t o minimize t he complet ion t ime of a parallel
applicat ion by properly allocat ing t he t asks t o t he processors. In a broad sense, t he scheduling problem exist s
in t w o forms: st at ic and dynamic. In st at ic scheduling, w hich is usually done at compile t ime, t he
charact erist ics of a parallel program (such as t ask processing t imes, communicat ion, dat a dependencies, and
synchronizat ion requirement s) are know n before program execut ion.
A parallel program, t herefore, can be represent ed by a node- and edge-w eight ed direct ed acyclic graph (DAG),
in w hich t he node w eight s represent t ask processing t imes and t he edge w eight s represent dat a dependencies
as w ell as t he communicat ion t imes bet w een t asks. In dynamic scheduling only, a few assumpt ions about t he
parallel program can be made before execut ion, and t hus, scheduling decisions have t o be made on-t he-fly.
The goal of a dynamic scheduling algorit hm as such includes not only t he minimizat ion of t he program
complet ion t ime but also t he minimizat ion of t he scheduling overhead w hich const it ut es a significant port ion
of t he cost paid for running t he scheduler. In general dynamic scheduling is an NP hard problem.

Grain size and latency


The size of t he part s or pieces of a program t hat can be considered for parallel execut ion can vary. The sizes
are roughly classified using t he t erm “ granule size,” or simply “ granularit y.” The simplest measure, for
example, is t he number of inst ruct ions in a program part . Grain sizes are usually described as fine,
medium or coarse, depending on t he level of parallelism involved.
Latency
Lat ency is t he t ime required for communicat ion bet w een different subsyst ems in a comput er. M emory
lat ency, for example, is t he t ime required by a processor t o access memory. Synchronizat ion lat ency is t he
t ime required for t w o processes t o synchronize t heir execut ion. Comput at ional granularit y and
communicat ion lat ency are closely relat ed. Lat ency and grain size are int errelat ed and some general
observat ion is
• As grain size decreases, pot ent ial parallelism increases, and overhead also increases.
• Overhead is t he cost of parallelizing a t ask. The principle overhead is communicat ion lat ency.
• As grain size is reduced, t here are few er operat ions bet w een communicat ions, and hence t he im pact
of lat ency increases.
• Surface t o volume: int er t o int ra-node communicat ion.

Levels of Parallelism
Instruction Level Parallelism
This fine-grained or smallest granularit y level t ypically involves less t han 20 inst ruct ions per grain. The number
of candidat es for parallel execut ion varies from 2 t o t housands, w it h about five inst ruct ions or st atement s (on
t he average) being t he average level of parallelism.
Advant ages:
• There are usually many candidat es for parallel execut ion
• Compilers can usually do a reasonable job of finding t his parallelism
Loop-level Parallelism
Typical loop has less t han 500 inst ruct ions. If a loop operat ion is independent bet w een it erat ions, it can
be handled by a pipeline, or by a SIM D machine. M ost opt imized program const ruct t o execut e on a

Page no: 9 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

parallel or vect or machine. Some loops (e.g. recursive) are difficult t o handle. Loop-level parallelism is st ill
considered fine grain comput at ion.

Figure 1.10: Level of Parallelism in Program Execution on M odern Computers


Procedure-level Parallelism
M edium-sized grain, usually less t han 2000 inst ruct ions. Det ect ion of parallelism is more difficult t han w it h
smaller grains; int er-procedural dependence analysis is difficult and hist ory-sensit ive. Communicat ion
requirement less t han inst ruct ion level SPM D (single procedure mult iple dat a) is a special case M ult it asking
belongs t o t his level.
Subprogram-level Parallelism
Job st ep level; grain t ypically has t housands of inst ruct ions; medium- or coarse-grain level. Job st eps can
overlap across different jobs. M ult i-programming conduct ed at t his level No compilers available t o exploit
medium- or coarse-grain parallelism at present .
Job or Program-Level Parallelism
It corresponds t o execut ion of essent ially independent jobs or programs on a parallel comput er. This is
pract ical for a machine w it h a small number of pow erful processors, but impract ical for a machine w it h a
large number of simple processors (since each processor w ould t ake t oo long t o process a single job).
Communication Latency
Balancing granularit y and lat ency can yield bet t er performance. Various lat encies at t ribut ed t o machine
archit ect ure, t echnology, and communicat ion pat t erns used. Lat ency imposes a limit ing fact or on machine
scalabilit y. Ex. M emory lat ency increases as memory capacit y increases, limit ing t he amount of m emory t hat
can be used w it h a given t olerance for comm unicat ion lat ency.
Inter-processor Communication Latency
• Needs t o be minimized by syst em designer
• Affect ed by signal delays and communicat ion pat t erns Ex. n communicat ing t asks may require n (n -
1)/ 2 communicat ion links, and t he complexit y grow s quadrat ically, effect ively limit ing t he number of
processors in t he syst em.
Communication Patterns
• Det ermined by algorit hms used and archit ect ural support provided
• Pat t erns include permut at ions broadcast mult icast conference
• Tradeoffs oft en exist bet w een granularit y of parallelism and communicat ion demand.
Program Graphs and Packing
A program graph is similar t o a dependence graph Nodes = { (n,s) }, w here n = node name, s = size (larger

Page no: 10 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

s = larger grain size).


Edges = { (v,d) }, w here v = variable being “ communicat ed,” and d = communicat ion delay.
Packing t w o (or more) nodes produces a node w it h a larger grain size and possibly more edges t o other nodes.
Packing is done t o eliminat e unnecessary communicat ion delays or reduce overall scheduling overhead.

Scheduling
A schedule is a mapping of nodes t o processors and st art t imes such t hat communicat ion delay requirement s
are observed, and no t w o nodes are execut ing on t he same processor at t he same t ime. Some general
scheduling goals:
• Schedule all fine-grain act ivit ies in a node t o t he same processor t o minimize communicat ion
delays.
• Select grain sizes for packing t o achieve bet t er schedules for a part icular parallel machine.
• Node Duplicat ion

Grain packing may pot ent ially eliminat e int erprocessor communicat ion, but it may not alw ays p r o d u ce a
sh o r t e r sch e d u l e. By d u p l i cat i n g n o d es (t hat i s, exe cu t i n g so m e inst ruct ions on mult iple processors),
w e may eliminat e some int erprocessor communicat ion, and t hus produce a short er schedule.

Program flow mechanism


Convent ional machines used cont rol flow mechanism in w hich order of program execut ion explicit ly
st at ed in user programs. Dat aflow machines which inst ruct ions can be execut ed by det ermining operand
availabilit y.
Reduct ion machines t rigger an inst ruct ion’s execut ion based on t he demand for it s result s.
Cont rol Flow vs. Dat a Flow : In Cont rol flow comput ers t he next inst ruct ion is execut ed w hen t he last
inst ruct ion as st ored in t he program has been execut ed w here as in Dat a flow comput ers an inst ruct ion
execut ed w hen t he dat a (operands) required for execut ing t hat inst ruct ion is available Cont rol flow machines
used shared memory for inst ruct ions and dat a. Since variables are updat ed by many inst ruct ions, t here may
be side effect s on ot her inst ruct ions. These side effect s frequent ly prevent parallel processing. Single
processor syst ems are inherent ly sequent ial.
Inst ruct ions in dat aflow machines are unordered and can be execut ed as soon as t heir operands are available;
dat a is held in t he inst ruct ions t hemselves. Dat a t okens are passed from an inst ruct ion t o it s dependent s t o
t rigger execut ion.
Data Flow Features
No need for shared memory program count er cont rol sequencer Special mechanisms are required t o det ect
dat a availabilit y mat ch dat a t okens w it h inst ruct ions needing t hem enable chain react ion of asynchronous
inst ruct ion execut ion
A Dat aflow Archit ect ure – 1 The Arvind machine (M IT) has N PEs and an N -by –N int erconnect ion net w ork.
Each PE has a t oken-mat ching mechanism t hat dispat ches only inst ruct ions w it h dat a t okens available. Each
dat um is t agged w it h
• address of inst ruct ion t o w hich it belongs
• cont ext in w hich t he inst ruct ion is being execut ed
Tagged t okens ent er PE t hrough local pat h (pipelined), and can also be communicat ed t o ot her PEs t hrough
t he rout ing net w ork. Inst ruct ion addresses effect ively replace t he program count er in a cont rol flow machine.
Cont ext ident ifier effect ively replaces t he frame base regist er in a cont rol flow machine. Since t he dat aflow
machine mat ches t he dat a t ags from one inst ruct ion w it h successors, synchronized inst ruct ion execut ion is
implicit .
An I-st ruct ure in each PE is provided t o eliminat e excessive copying of dat a st ruct ures. Each w ord of t he I-
st ruct ure has a t w o-bit t ag indicat ing w het her t he value is empt y, full, or has pending read request s.
This is a ret reat from t he pure dat aflow approach. Special compiler t echnology needed for dat aflow machines.

Page no: 11 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Demand-Driven M echanisms
Dat a-driven machines select inst ruct ions for execut ion based on t he availabilit y of t heir operands; t his is
essent ially a bot t om-up approach.
Demand-driven machines t ake a t op-dow n approach, at t empt ing t o execut e t he inst ruct ion (a
demander) t hat yields t he final result . This t riggers t he execut ion of inst ruct ions t hat yield it s operands, and so
fort h. The demand-driven approach mat ches nat urally w it h funct ional programming languages (e.g. LISP and
SCHEM E).

Pattern driven computers: An inst ruct ion is execut ed w hen w e obt ain a part icular dat a pat t erns as out put .
There are t w o t ypes of pat t ern driven comput ers
String-reduction model: each demander gets a separat e copy of t he expression st ring t o evaluat e each
reduct ion st ep has an operat or and embedded reference t o demand t he corresponding operands each
operat or is suspended w hile argument s are evaluat ed
Graph-reduction model: expression graph reduced by evaluat ion of branches or sub-graphs, possibly in
parallel, w it h demanders given point ers t o result s of reduct ions. Based on sharing of point ers t o argument s;
t raversal and reversal of point ers cont inues unt il const ant argument s are encount ered.

System interconnect architecture


Various t ypes of int erconnect ion net w orks have been suggest ed for SIM D comput ers. These are basically
classified have been classified on net w ork t opologies int o t w o cat egories namely
• St at ic Net w orks
• Dynamic Net w orks
Static versus Dynamic Networks
The t opological st ruct ure of an SIM D array processor is mainly charact erized by t he dat a rout ing net w ork used
in int erconnect ing t he processing element s.
The t opological st ruct ure of an SIM D array processor is mainly charact erized by t he dat a rout ing net w ork used
in t he int erconnect ing t he processing element s. To execut e t he communicat ion t he rout ing funct ion f is
execut ed and via t he int erconnect ion net w ork t he PEi copies t he cont ent of it s Ri regist er int o t he Rf(i)
regist er of PE f(i). The f(i) t he processor ident ified by t he mapping funct ion f. The dat a rout ing operat ion
occurs in all act ive PEs simult aneously.

Network properties and routing


The goals of an int erconnect ion net w ork are t o provide low -lat ency high dat a t ransfer rat e w ide
communicat ion bandw idt h. Analysis includes lat ency bisect ion bandw idt h dat a- rout ing funct ions scalabilit y of
parallel archit ect ure These Net w ork usually represent ed by a graph w it h a finit e number of nodes linked by
direct ed or undirect ed edges.
Number of nodes in graph = net w ork size.
Number of edges (links or channels) incident on a node = node degree d (also not e in and out degrees w hen
edges are direct ed).
Node degree reflect s number of I/ O port s associat ed w it h a node, and should ideally be small and const ant .
Net w ork is symmet ric if t he t opology is t he same looking from any node; t hese are easier t o implement or t o
program.

Diameter: The maximum dist ance bet w een any t w o processors in t he net w ork or in ot her w ords w e can
say Diamet er, is t he maximum number of (rout ing) processors t hrough w hich a message must pass on it s w ay
from source t o reach dest inat ion. Thus diamet er measures t he maximum delay for t ransmit t ing a message
from one processor t o anot her as it det ermines communicat ion t ime hence smaller t he diam et er bet t er
w ill be t he net w ork t opology.

Page no: 12 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Connectivity: How many pat hs are possible bet w een any t w o processors i.e., t he mult iplicit y of pat hs bet w een
t w o processors. Higher connect ivit y is desirable as it minimizes cont ent ion.
Arch connect ivit y of t he net w ork: t he minimum number of arcs t hat must be removed for t he net w ork t o
break it int o t w o disconnect ed net w orks. The arch connect ivit y of various net w ork are as follow s
• 1 for linear arrays and binary t rees
• 2 for rings and 2-d meshes
• 4 for 2-d t orus
• d for d-dimensional hypercubes
Larger t he arch connect ivit y lesser t he conjunct ions and bet t er w ill be net w ork t opology. Channel w idt h :The
channel w idt h is t he number of bit s t hat can communicat ed simult aneously by a int erconnect ion bus
connect ing t w o processors:
Bisection W idth and Bandwidth: In order divide t he net w ork int o equal halves w e require t he remove some
communicat ion links. The minimum numbers of such communicat ion links t hat have t o be removed are called
t he Bisect ion Widt h. Bisect ion w idt h basically provide us t he informat ion about t he largest number of
messages w hich can be sent simult aneously (w it hout needing t o use t he same w ire or rout ing processor at
t he same t ime and so delaying one anot her), no mat t er w hich processors are sending t o w hich ot her
processors. Thus larger t he bisect ion w idt h is t he bet t er t he net w ork t opology is considered. Bisect ion
Bandw idt h is t he minimum volume of communicat ion allow ed bet w een t w o halves of t he net w ork w it h
equal numbers of processors. This is import ant for t he net w orks w it h w eight ed arcs w here the w eight s
correspond t o t he link w idt h i.e., (how much dat a it can t ransfer). The Larger bisect ion w idt h t he bet t er
net w ork t opology is considered.
Cost t he cost of net w orking can be est imat ed on variet y of crit eria w here w e consider t he number of
communicat ion links or w ires used t o design t he net w ork as t he basis of cost est imat ion, smaller t he bet t er
t he cost .
Data Routing Functions: A dat a rout ing net w ork is used for int er –PE dat a exchange. It can be st at ic as in case
of hypercube rout ing net w ork or dynamic such as mult ist age netw ork. Various t ype of dat a rout ing funct ions
are Shift ing, Rot at ing, Permut at ion (one t o one), Broadcast (one t o all), M ult icast (many t o many),
Personalized broadcast (one t o many), Shuffle, Exchange Et c.

Factors Affecting Performance


Funct ionalit y – how t he net w ork support s dat a rout ing, int errupt handling, synchronizat ion, request / message
combining, and coherence.
Net w ork lat ency – w orst -case t ime for a unit message t o be t ransferred
Bandw idt h – maximum dat a rat e.
Hardw are complexit y – implement at ion cost s for w ire, logic, sw it ches, connect ors, et c. Scalabilit y – how
easily does t he scheme adapt t o an increasing number of processors, memories, et c.

Static interconnection networks


St at ic int erconnect ion net w orks for element s of parallel syst ems (ex. processors, memories) are based on
fixed connect ions t hat cannot be modified w it hout a physical re-designing of a syst em. St at ic int erconnect ion
net w orks can have many st ruct ures such as a linear st ruct ure (pipeline), a mat rix, a ring, a t orus, a complet e
connect ion st ruct ure, a t ree, a st ar, a hyper-cube.

In linear and mat rix st ruct ures, processors are int erconnect ed w it h t heir neighbors in a regular st r uct ure on a
plane. A t orus is a mat rix st ruct ure in w hich elem ent s at t he mat rix borders are connect ed in t he frame of t he
same lines and columns. In a complet e connect ion st ruct ure, all element s (ex. processors) are direct ly
int erconnect ed (point -t o-point )

Page no: 13 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Figure 1.11: Linear structure (pipeline) of interconnections in a parallel system

Figure 1.12: 2D Torus Figure 1.13: M atrix

Figure 1.14: A complete interconnection Figure 1.15: A Ring Figure 1.16: A Chordal Ring

In a t ree st ruct ure, syst em element s are set in a hierarchical st ruct ure from t he root t o t he leaves, see t he
figure below . All element s of t he t ree (nodes) can be processors or only leaves are processors and t he rest of
nodes are linking element s, w hich int ermediat e in t ransmissions. If from one node, 2 or more connect ions go
t o different nodes t ow ards t he leaves - w e say about a binary or k-nary t ree. If from one node, more t han one
connect ion goes t o t he neighboring node, w e speak about a fat t ree. A binary t ree, in w hich in t he direct ion of
t he root , t he number of connect ions bet w een neighboring nodes increases t w ice, provides a uniform
t ransmission t hroughput bet w een t he t ree levels, a feat ure not available in a st andard t ree.

Figure 1.17: Binary tree Figure 1.18: Fat tree


In a hypercube st ruct ure, processors are int erconnect ed in a net w ork, in w hich connect ions bet w een
processors correspond t o edges of an n-dimensional cube. The hypercube st ruct ure is very advant ageous since
it provides a low net w ork diamet er equal t o t he degree of t he cube. The net w ork diamet er is t he number of
edges bet w een t he most dist ant nodes. . The net w ork diamet er det erm ines t he number in int ermediat e

Page no: 14 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

t ransfers t hat have t o be done t o send dat a bet w een t he most dist ant nodes of a net w ork. In t his respect t he
hypercubes have very good propert ies, especially for a very large number of const it uent nodes. Due t o t his
hypercubes are popular net w orks in exist ing parallel syst ems.

Dynamic interconnection networks


Dynamic int erconnect ion net w orks bet w een processors enable changing (reconfiguring) of t he connect ion
st ruct ure in a syst em. It can be done before or during parallel program execut ion. So, w e can speak about
st at ic or dynam ic connect ion reconfigurat ion.
The dynamic net w orks are t hose net w orks w here t he rout e t hrough w hich dat a move from one PE t o
anot her is est ablished at t he t ime communicat ion has t o be performed. Usually all processing element s are
equidist ant and an int erconnect ion pat h is est ablished w hen t w o processing element w ant t o communicat e by
use of sw it ches. Such systems are more difficult t o expand as compared t o st at ic net w ork. Examples: Bus-
based, Crossbar, M ult ist age Net w orks. Here t he Rout ing is done by comparing t he bit -level represent at ion of
source and dest inat ion addresses. If t here is a mat ch goes t o next st age via pass- t hrough else in case of
it mismat ch goes via cross-over using t he sw it ch.
There are t w o classes of dynamic net w orks namely
• single st age net w ork
• mult i st age
Single Stage Networks
A single st age sw it ching net w ork w it h N input select ors (IS) and N out put select ors (OS). Here at each net w ork
st age t here is a 1- t o-D demult iplexer corresponding t o each IS such t hat 1<D<N and each OS is an M -t o-1
mult iplexer such t hat 1<M <=N. Cross bar net w ork is a single st age net w ork w it h D=M =N. In order t o est ablish
a desired connect ing pat h different pat h cont rol signals w ill be applied t o all IS and OS select ors. The single
st age net w ork is also called as re-circulat ing net w ork as in t his net w ork connect ion t he single dat a it ems may
have t o re-circulat e several t ime t hrough t he single st age before reaching t heir final dest inat ions. The number
of recirculat ion depends on t he connect ivit y in t he single st age netw ork. In general higher t he hardw are
connect ivit y t he lesser is t he number of recirculat ion. In cross bar net w ork only one circulat ion is needed t o
est ablish t he connect ion pat h. The cost of complet ed connect ed cross bar net w ork is O(N2) w hich is very high
as compared t o ot her most re-circulat ing net w orks w hich have cost O(N log N) or low er hence are more cost
effect ive for large value of N.
M ultistage Networks
M any st ages of int erconnect ed sw it ches form a mult ist age SIM D netw ork. It is basicaaly consist of t hree
charact erist ic feat ures
• The sw it ch box,
• The net w ork t opology
• The cont rol st ruct ure
M any st ages of int erconnect ed sw it ches form a mult ist age SIM D netw orks. Eachbox is essent ially an
int erchange device w it h t w o input s and t w o out put s. The four possible st at es of a sw it ch box are w hich
are show n in figure.
• St raight
• Exchange
• Upper Broadcast
• Low er broadcast .
A t w o funct ion sw it ch can assume only t w o possible st at e namely st at e or exchange st at es. How ever a
four funct ion sw it ch box can be any of four possible st at es. A mult ist age net w ork is capable of
connect ing any input t erminal t o any out put t erminal. M ult i-st age net w orks are basically const ruct ed by so
called shuffle-exchange sw it ching element , w hich is basically a 2 x 2 crossbar. M ult iple layers of t hese
element s are connect ed and form t he net w ork.

Page no: 15 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

Figure 1.19: A two-by-two switching box and its four interconnection states
A mult ist age net w ork is capable of connect ing an arbit rary input t erminal t o an arbit rary out put t erminal.
n
Generally it is consist of n st ages w here N = 2 is t he number of input and out put lines. And each st age use
N/ 2 sw it ch boxes. The int erconnect ion pat t erns from one st age t o anot her st age are determined by net w ork
t opology. Each st age is connect ed t o t he next st age by at least N pat hs. The t ot al w ait t ime is proport ional t o
t he number st ages i.e., n and t he t ot al cost depend on t he t ot al number of sw it ches used and t hat are Nlog2N.
The cont rol st ruct ure can be individual st age cont rol i.e., t he same cont rol signal is used t o set all sw it ch
boxes in t he same st ages t hus w e need n cont rol signal. The second cont rol st ruct ure is individual box
cont rol w here a separat e cont rol signal is used t o set t he st at e of each sw it ch box. This provide flexibilit y at
t he same t ime require n2/ 2 cont rol signal w hich increases t he complexit y of t he cont rol circuit . In bet w een
pat h is use of part ial st age cont rol.

Bus networks
A bus is t he sim plest t ype of dynamic int erconnect ion net w orks. It const it ut es a common dat a t ransfer pat h
for many devices. Depending on t he t ype of im plement ed t ransmissions w e have serial busses and parallel
busses. The devices connect ed t o a bus can be processors, memories, I/ O unit s, as show n in t he figur e below .

Figure 1.20: A diagram of a system based on a single bus Figure 1.21: A diagram of a system based on a multibus
Only one device connect ed t o a bus can t ransmit s dat a. M any devices can receive dat a. In t he last case w e
speak about a mult icast t ransmission. If dat a are meant for all devices connect ed t o a bus w e speak about a
broadcast t ransmission. Accessing t he bus must be synchronized. It is done w it h t he use of t w o met hods: a
t oken met hod and a bus arbit er met hod. Wit h t he t oken met hod, a t oken (a special cont rol message or signal)
is circulat ing bet w een t he devices connect ed t o a bus and it gives t he right t o t ransmit t o t he bus t o a single
device at a t ime. The bus arbit er receives dat a t ransmission request s from t he devices connect ed t o a bus. It
select s one device according t o a select ed st rat egy (ex. using a system of assigned priorit ies) and sends an
acknow ledge message (signal) t o one of t he request ing devices t hat grant s it t he t ransmit t ing right . Aft er t he
select ed device com plet es t he t ransmission, it informs t he arbit er t hat can select anot her request . The
receiver (s) address is usually given in t he header of t he message. Special header values are used f or t he
broadcast and mult icast s. All receivers read and decode headers. These devices t hat are specified in t he
header, read-in t he dat a t ransmit t ed over t he bus.

Page no: 16 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

The t hroughput of t he net w ork based on a bus can be increased by t he use of a mult ibus net w ork show n in
t he figure below . In t his net w ork, processors connect ed t o t he busses can t ransmit dat a in parallel (one for
each bus) and many processors can read dat a from many busses at a t ime.

Crossbar switches
A crossbar sw it ch is a circuit t hat enables many int erconnect ions bet w een element s of a parallel syst em at a
t ime. A crossbar sw it ch has a number of input and out put dat a pins and a number of cont rol pins. In response
t o cont rol inst ruct ions set t o it s cont rol input , t he crossbar sw it ch implement s a st able connect ion of a
det ermined input w it h a det ermined out put . The diagrams of a t ypical crossbar sw it ch are show n in t he figure
below .

Figure 1.22: Crossbar switch

Figure 1.23: Crossbar switch a) general scheme, b) internal structure


Cont rol inst ruct ions can request reading t he st at e of specified input and out put pins i.e. t heir cur rent
connect ions in a crossbar sw it ch. Crossbar sw it ches are built w it h t he use of mult iplexer circuit s, cont rolled by
lat ch regist ers, w hich are set by cont rol inst ruct ions. Crossbar sw it ches implement direct , single non-blocking
connect ions, but on t he condit ion t hat t he necessary input and out put pins of t he sw it ch are free. The
connect ions bet w een free pins can alw ays be implement ed independent ly on t he st at us of ot her connect ions.
New connect ions can be set during dat a t ransmissions t hrough ot her connect ions. The non-blocking
connect ions are a big advant age of crossbar sw it ches. Some crossbar sw it ches enable broadcast t ransmissions
but in a blocking manner for all ot her connect ions. The disadvant age of crossbar sw it ches is t hat ext ending
t heir size, in t he sense of t he number of input / out put pins, is cost ly in t erms of hardw are. Because of t hat ,
crossbar sw it ches are built up t o t he size of 100 input / out put pins.
M ultiport M emory
In t he mult iport memory syst em, different memory module and CPUs have separat e buses. The module has
int ernal cont rol logic t o det ermine port w hich w ill access t o memory at any given t ime. Priorit ies are assigned
t o each memory port t o resolve memory access conflict s.
Advant ages:
Because of t he mult iple pat hs high t ransfer rat e can be achieved.
Disadvant ages:

Page no: 17 Follow us on facebook to get real-time updates from RGPV


Downloaded from be.rgpvnotes.in

It requires expensive memory cont rol logic and a large number of cables and connect ions.

Figure 1.24: M ultiport memory organization


M ultistage and combining networks
M ult ist age connect ion net w orks are designed w ith t he use of small element ary crossbar sw it ches (usually t hey
have t w o input s) connect ed in mult iple layers. The element ary crossbar sw it ches can implement 4 t ypes of
connect ions: st raight , crossed, upper broadcast and low er broadcast . All element ary sw it ches are cont rolled
simult aneously. The net w ork like t his is an alt ernat ive for crossbar sw it ches if w e have t o sw it ch a large
number of connect ions, over 100. The ext ension cost for such a net w ork is relat ively low .
In such net w orks, t here is no full freedom in im plement ing arbit rary connect ions w hen some connect ions have
already been set in t he sw it ch. Because of t his propert y, t hese net w orks belong t o t he cat egory of so called
blocking net w orks.
How ever, if w e increase t he num ber of levels of element ary crossbar sw it ches above t he number necessary t o
implement connect ions for all pairs of input s and out put s, it is possible t o implement all request ed
connect ions at t he same t ime but st at ically, before any communicat ion is st art ed in t he sw it ch. It can be
achieved at t he cost of addit ional redundant hardw are included int o t he sw it ch. The block diagram of such a
net w ork, called t he Benes net w ork, is show n in t he figure below .

Figure 1.25: A multistage connection network for parallel systems


To obt ain nonblocking propert ies of t he mult ist age connect ion net w ork, t he redundancy level in t he circuit
should be much increased. To build a nonblocking mult ist age net w ork n x n, t he element ary t w o-input
sw it ches have t o be replaced by 3 layers of sw it ches n x m, r x r and m x n, w here m ³ 2n - 1 and r is t he
number of element ary sw it ches in t he layer 1 and 3. Such a sw it ch w as designed by a French mat hemat ician
Clos and it is called t he Clos net w ork. This sw it ch is commonly used t o build large int egrat ed crossbar sw it ches.
The block diagram of t he Clos net w ork is show n in t he figure below .

Figure 1.26: A nonblocking Clos interconnection network

Page no: 18 Follow us on facebook to get real-time updates from RGPV


We hope you find these notes useful.
You can get previous year question papers at
https://qp.rgpvnotes.in .

If you have any queries or you want to submit your


study notes please write us at
rgpvnotes.in@gmail.com

Anda mungkin juga menyukai