Anda di halaman 1dari 44

Reference: introduction to parallel comuting by Vipin Kumar etl. (pearson education) MODUL !

" Introduction to Parallel Computing

Topic Overview
Motivating Parallelism Scope of Parallel Computing Applications Organization and Contents of the Course

Motivating Parallelism
The role of parallelism in accelerating computing speeds has been recognized for several decades. Its role in providing multiplicit of datapaths and increased access to storage elements has been significant in commercial applications. The scalable performance and lo!er cost of parallel platforms is reflected in the !ide variet of applications. "eveloping parallel hard!are and soft!are has traditionall been time and effort intensive. If one is to vie! this in the conte#t of rapidl improving uniprocessor speeds$ one is tempted to %uestion the need for parallel computing.

There are some unmista&able trends in hard!are design$ !hich indicate that uniprocessor 'or implicitl parallel( architectures ma not be able to sustain the rate of realizable performance increments in the future.

This is the result of a number of fundamental ph sical and computational limitations.

The emergence of standardized parallel programming environments$ libraries$ and hard!are has significantl reduced time to 'parallel( solution.

The Computational Power Argument


Moore)s la! states *+,-./0 ``The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 1 years. That

means by 1!"#, the number of components per integrated circuit for minimum cost will be $#, .%%

Moore attributed this doubling rate to e#ponential behavior of die sizes$ finer minimum dimensions$ and 11circuit and device cleverness)). In +,2.$ he revised this la! as follo!s0 ``There is no room left to s&ueeze anything out by being clever. 'oing forward from here we have to depend on the two size factors ( bigger dies and finer dimensions.%% 3e revised his rate of circuit complexity doubling to +4 months and pro5ected from +,2. on!ards at this reduced rate. If one is to bu into Moore)s la!$ the %uestion still remains 6 ho! does one translate transistors into useful OPS 'operations per second(7 The logical recourse is to rel on parallelism$ both implicit and e#plicit. Most serial 'or seemingl serial( processors rel e#tensivel on implicit parallelism. 8e focus in this class$ for the most part$ on e#plicit parallelism.

The Memory/Disk Speed Argument


8hile cloc& rates of high6end processors have increased at roughl 9:; per ear over the past decade$ "<AM access times have onl improved at the rate of roughl +:; per ear over this interval. This mismatch in speeds causes significant performance bottlenec&s. Parallel platforms provide increased band!idth to the memor s stem. Parallel platforms also provide higher aggregate caches.

Principles of localit of data reference and bul& access$ !hich guide parallel algorithm design also appl to memor optimization. Some of the fastest gro!ing applications of parallel computing utilize not their ra! computational speed$ rather their abilit to pump data to memor and dis& faster.

The Data Communication Argument


As the net!or& evolves$ the vision of the Internet as one large computing platform has emerged. This vie! is e#ploited b applications such as S=TI>home and ?olding>home.

In man other applications 't picall databases and data mining( the volume of data is such that the cannot be moved. An anal ses on this data must be performed over the net!or& using parallel techni%ues.

#cope of $arallel %omputing &pplications


Parallelism finds applications in ver diverse application domains for different motivating reasons. These range from improved application performance to cost considerations.

Applications in ngineering and Design


"esign of airfoils 'optimizing lift$ drag$ stabilit ($ internal combustion engines 'optimizing charge distribution$ burn($ high6speed circuits 'la outs for dela s and capacitive and inductive effects($ and structures 'optimizing structural integrit $ design parameters$ cost$ etc.(. "esign and simulation of micro6 and nano6scale s stems. Process optimization$ operations research.

Scienti!ic Applications
?unctional and structural characterization of genes and proteins. Advances in computational ph sics and chemistr have e#plored ne! materials$ understanding of chemical path!a s$ and more efficient processes. Applications in astroph sics have e#plored the evolution of gala#ies$ thermonuclear processes$ and the anal sis of e#tremel large datasets from telescopes. 8eather modeling$ mineral prospecting$ flood prediction$ etc.$ are other important applications. @ioinformatics and astroph sics also present some of the most challenging problems !ith respect to anal zing e#tremel large datasets.

Commercial Applications
Some of the largest parallel computers po!er the !all streetA "ata mining and anal sis for optimizing business and mar&eting decisions.

Barge scale servers 'mail and !eb servers( are often implemented using parallel platforms. Applications such as information retrieval and search are t picall po!ered b large clusters.

Applications in Computer Systems


3

Cet!or& intrusion detection$ cr ptograph $ multipart computations are some of the core users of parallel computing techni%ues. =mbedded s stems increasingl rel on distributed control algorithms. A modern automobile consists of tens of processors communicating to perform comple# tas&s for optimizing handling and performance. Conventional structured peer6to6peer net!or&s impose overla algorithms directl from parallel computing. net!or&s and utilize

Organi"ation and Contents o! this Course


?undamentals0 This part of the class covers basic parallel platforms$ principles of algorithm design$ group communication primitives$ and anal tical modeling techni%ues. Parallel Programming0 This part of the class deals !ith programming using message passing libraries and threads. Parallel Algorithms0 This part of the class covers basic algorithms for matri# computations$ graphs$ sorting$ discrete optimization$ and d namic programming. ===============xxxxxxxxxxx===============

$arallel %omputing $latforms 'opic O(er(ie)


Implicit Parallelism0 Trends in Microprocessor Architectures Bimitations of Memor S stem Performance "ichotom of Parallel Computing Platforms Communication Model of Parallel Platforms Ph sical Organization of Parallel Platforms Communication Costs in Parallel Machines Messaging Cost Models and <outing Mechanisms Mapping Techni%ues Case Studies

#cope of $arallelism
Conventional architectures coarsel comprise of a processor$ memor s stem$ and the datapath. =ach of these components present significant performance bottlenec&s. Parallelism addresses each of these components in significant !a s. "ifferent applications utilize different aspects of parallelism 6 e.g.$ data itensive applications utilize high aggregate throughput$ server applications utilize high aggregate net!or& band!idth$ and scientific applications t picall utilize high processing and memor s stem performance. It is important to understand each of these performance bottlenec&s.

"mplicit $arallelism: 'rends in Microprocessor &rc*itectures

Microprocessor cloc& speeds have posted impressive gains over the past t!o decades 't!o to three orders of magnitude(. 3igher levels of device integration have made available a large number of transistors. The %uestion of ho! best to utilize these resources is an important one. Current processors use these resources in multiple functional units and e#ecute multiple instructions in the same c cle. The precise manner in !hich these instructions are selected and e#ecuted provides impressive diversit in architectures.

$ipelining and #uperscalar +ecution


Pipelining overlaps various stages of instruction e#ecution to achieve performance. At a high level of abstraction$ an instruction can be e#ecuted !hile the ne#t one is being decoded and the ne#t one is being fetched. This is a&in to an assembl line for manufacture of cars. Pipelining$ ho!ever$ has several limitations. The speed of a pipeline is eventuall limited b the slo!est stage. ?or this reason$ conventional processors rel on ver deep pipelines 'D: stage pipelines in state6of6the6art Pentium processors(. 3o!ever$ in t pical program traces$ ever .6-th instruction is a conditional 5umpA This re%uires ver accurate branch prediction. The penalt of a misprediction gro!s !ith the depth of the pipeline$ since a larger number of instructions !ill have to be flushed. One simple !a of alleviating these bottlenec&s is to use multiple pipelines. The %uestion then becomes one of selecting these instructions.

#uperscalar +ecution: &n +ample

=#ample of a t!o6!a superscalar e#ecution of instructions.

In the above e#ample$ there is some !astage of resources due to data dependencies. The e#ample also illustrates that different instruction mi#es !ith identical semantics can ta&e significantl different e#ecution time.

#uperscalar +ecution
Scheduling of instructions is determined b a number of factors0 E True "ata "ependenc 0 The result of one operation is an input to the ne#t. E <esource "ependenc 0 T!o operations re%uire the same resource. E @ranch "ependenc 0 Scheduling instructions across conditional branch statements cannot be done deterministicall a6priori. E The scheduler$ a piece of hard!are loo&s at a large number of instructions in an instruction %ueue and selects appropriate number of instructions to e#ecute concurrentl based on these factors. E The comple#it of this hard!are is an important constraint on superscalar processors.

#uperscalar +ecution: "ssue Mec*anisms


In the simpler model$ instructions can be issued onl in the order in !hich the are encountered. That is$ if the second instruction cannot be issued because it has a data dependenc !ith the first$ onl one instruction is issued in the c cle. This is called in(order issue. In a more aggressive model$ instructions can be issued out of order. In this case$ if the second instruction has data dependencies !ith the first$ but the third instruction does not$ the first and third instructions can be co6scheduled. This is also called d namic issue. Performance of in6order issue is generall limited.

#uperscalar +ecution: fficiency %onsiderations


Cot all functional units can be &ept bus at all times. If during a c cle$ no functional units are utilized$ this is referred to as vertical !aste. If during a c cle$ onl some of the functional units are utilized$ this is referred to as horizontal !aste.

"ue to limited parallelism in t pical instruction traces$ dependencies$ or the inabilit of the scheduler to e#tract parallelism$ the performance of superscalar processors is eventuall limited. Conventional microprocessors t picall support four6!a superscalar e#ecution.

Very Long "nstruction ,ord (VL",) $rocessors


The hard!are cost and comple#it of the superscalar scheduler is a ma5or consideration in processor design. To address this issues$ FBI8 processors rel on compile time anal sis to identif and bundle together instructions that can be e#ecuted concurrentl . These instructions are pac&ed and dispatched together$ and thus the name ver long instruction !ord. This concept !as used !ith some commercial success in the Multiflo! Trace machine 'circa +,49(. Fariants of this concept are emplo ed in the Intel IA-9 processors.

Very Long "nstruction ,ord (VL",) $rocessors: %onsiderations


Issue hard!are is simpler. Compiler has a bigger conte#t from !hich to select co6scheduled instructions. Compilers$ ho!ever$ do not have runtime information such as cache misses. Scheduling is$ therefore$ inherentl conservative. @ranch and memor prediction is more difficult. FBI8 performance is highl dependent on the compiler. A number of techni%ues such as loop unrolling$ speculative e#ecution$ branch prediction are critical. T pical FBI8 processors are limited to 96!a to 46!a parallelism.

Limitations of Memory #ystem $erformance


Memor s stem$ and not processor speed$ is often the bottlenec& for man applications. Memor s stem performance is largel captured b t!o parameters$ latenc and band!idth. Batenc is the time from the issue of a memor re%uest to the time the data is available at the processor. @and!idth is the rate at !hich data can be pumped to the processor b the memor s stem.

Memory #ystem $erformance: -and)idt* and Latency


It is ver important to understand the difference bet!een latenc and band!idth. Consider the e#ample of a fire6hose. If the !ater comes out of the hose t!o seconds after the h drant is turned on$ the latenc of the s stem is t!o seconds. Once the !ater starts flo!ing$ if the h drant delivers !ater at the rate of . gallonsGsecond$ the band!idth of the s stem is . gallonsGsecond. If ou !ant immediate response from the h drant$ it is important to reduce latenc . If ou !ant to fight big fires$ ou !ant high band!idth.

Memory Latency: &n +ample


7

Consider a processor operating at + H3z '+ ns cloc&( connected to a "<AM !ith a latenc of +:: ns 'no caches(. Assume that the processor has t!o multipl 6add units and is capable of e#ecuting four instructions in each c cle of + ns. The follo!ing observations follo!0
E The pea& processor rating is 9 H?BOPS. E Since the memor latenc is e%ual to +:: c cles and bloc& size is one !ord$ ever time a memor re%uest is made$ the processor must !ait +:: c cles before it can process the data.

Memory Latency: &n +ample


On the above architecture$ consider the problem of computing a dot6product of t!o vectors.
E A dot6product computation performs one multipl 6add on a single pair of vector elements$ i.e.$ each floating point operation re%uires one data fetch. E It follo!s that the pea& speed of this computation is limited to one floating point operation ever +:: ns$ or a speed of +: M?BOPS$ a ver small fraction of the pea& processor ratingA

"mpro(ing ffecti(e Memory Latency Using %ac*es


Caches are small and fast memor elements bet!een the processor and "<AM. This memor acts as a lo!6latenc high6band!idth storage. If a piece of data is repeatedl used$ the effective latenc of this memor s stem can be reduced b the cache. The fraction of data references satisfied b the cache is called the cache hit ratio of the computation on the s stem. Cache hit ratio achieved b a code on a memor s stem often determines its performance.

Impact o! Caches# $ample


Consider the architecture from the previous e#ample. In this case$ !e introduce a cache of size ID J@ !ith a latenc of + ns or one c cle. 8e use this setup to multipl t!o matrices A and @ of dimensions ID K ID. 8e have carefull chosen these numbers so that the cache is large enough to store matrices A and @$ as !ell as the result matri# C.

Impact o! Caches# $ample %continued&


The follo!ing observations can be made about the problem0 ?etching the t!o matrices into the cache corresponds to fetching DJ !ords$ !hich ta&es appro#imatel D:: Ls. E Multipl ing t!o n ) n matrices ta&es *n+ operations. ?or our problem$ this corresponds to -9J operations$ !hich can be performed in +-J c cles 'or +- Ls( at four instructions per c cle. E The total time for the computation is therefore appro#imatel the sum of time for loadGstore operations and the time for the computation itself$ i.e.$ D:: M +- Ls. E This corresponds to a pea& computation rate of -9JGD+- or I:I M?BOPS.
E

Impact o! Caches
<epeated references to the same data item correspond to temporal localit . In our e#ample$ !e had O,n*- data accesses and O,n+- computation. This as mptotic difference ma&es the above e#ample particularl desirable for caches.

"ata reuse is critical for cache performance.

Impact o! Memory 'andwidth


Memor band!idth is determined b the band!idth of the memor bus as !ell as the memor units. Memor band!idth can be improved b increasing the size of memor bloc&s. The underl ing s stem ta&es l time units '!here l is the latenc of the s stem( to deliver b units of data '!here b is the bloc& size(.

Impact o! Memory 'andwidth# $ample


Consider the same setup as before$ e#cept in this case$ the bloc& size is 9 !ords instead of + !ord. 8e repeat the dot6product computation in this scenario0 E Assuming that the vectors are laid out linearl in memor $ eight ?BOPs 'four multipl 6 adds( can be performed in D:: c cles. E This is because a single memor access fetches four consecutive !ords in the vector. E Therefore$ t!o accesses can fetch four elements of each of the vectors. This corresponds to a ?BOP ever D. ns$ for a pea& speed of 9: M?BOPS.

Impact o! Memory 'andwidth


It is important to note that increasing bloc& size does not change latenc of the s stem. Ph sicall $ the scenario illustrated here can be vie!ed as a !ide data bus '9 !ords or +D4 bits( connected to multiple memor ban&s. In practice$ such !ide buses are e#pensive to construct. In a more practical s stem$ consecutive !ords are sent on the memor bus on subse%uent bus c cles after the first !ord is retrieved.

Impact o! Memory 'andwidth


The above e#amples clearl illustrate ho! increased band!idth results in higher pea& computation rates. The data la outs !ere assumed to be such that consecutive data !ords in memor !ere used b successive instructions 'spatial localit of reference(. If !e ta&e a data6la out centric vie!$ computations must be reordered to enhance spatial localit of reference.

Impact o! Memory 'andwidth# $ample


Consider the follo!ing code fragment0
for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) column_sum[i] += b[j][i];

The code fragment sums columns of the matri# b into a vector column_sum.

Impact o! Memory 'andwidth# $ample


The vector column_sum is small and easil fits into the cache The matri# b is accessed in a column order.

The strided access results in ver poor performance.

Multipl ing a matri# !ith a vector0 'a( multipl ing column6b 6column$ &eeping a running sumN 'b( computing each element of the result as a dot product of a ro! of the matri# !ith the vector.

Impact o! Memory 'andwidth# $ample


8e can fi# the above code as follo!s0
for (i = 0; i < 1000; i++) column_sum[i] = 0.0; for (j = 0; j < 1000; j++) for (i = 0; i < 1000; i++) column_sum[i] += b[j][i];

In this case$ the matri# is traversed in a ro!6order and performance can be e#pected to be significantl better.

Memory System Per!ormance# Summary


The series of e#amples presented in this section illustrate the follo!ing concepts0 E =#ploiting spatial and temporal localit in applications is critical for amortizing memor latenc and increasing effective memor band!idth. E The ratio of the number of operations to number of memor accesses is a good indicator of anticipated tolerance to memor band!idth. E Memor la outs and organizing computation appropriatel can ma&e a significant impact on the spatial and temporal localit .

Alternate Approaches !or (iding Memory )atency


Consider the problem of bro!sing the !eb on a ver slo! net!or& connection. 8e deal !ith the problem in one of three possible !a s0

10

!e anticipate !hich pages !e are going to bro!se ahead of time and issue re%uests for them in advanceN E !e open multiple bro!sers and access different pages in each bro!ser$ thus !hile !e are !aiting for one page to load$ !e could be reading othersN or E !e access a !hole bunch of pages in one go 6 amortizing the latenc across various accesses. The first approach is called prefetching$ the second multithreading$ and the third one corresponds to spatial localit in accessing memor !ords.
E

Multithreading !or )atency (iding


A thread is a single stream of control in the flo! of a program. 8e illustrate threads !ith a simple e#ample0
for (i = 0; i < n; i++) c[i] = dot_product(get_row( ! i)! b);

=ach dot6product is independent of the other$ and therefore represents a concurrent unit of e#ecution. 8e can safel re!rite the above code segment as0
for (i = 0; i < n; i++) c[i] = cre te_t"re d(dot_product!get_row( ! i)! b);

Multithreading !or )atency (iding# $ample


In the code$ the first instance of this function accesses a pair of vector elements and !aits for them. In the meantime$ the second instance of this function can access t!o other vector elements in the ne#t c cle$ and so on. After l units of time$ !here l is the latenc of the memor s stem$ the first function instance gets the re%uested data from memor and can perform the re%uired computation. In the ne#t c cle$ the data items for the ne#t function instance arrive$ and so on. In this !a $ in ever cloc& c cle$ !e can perform a computation.

Multithreading !or )atency (iding


The e#ecution schedule in the previous e#ample is predicated upon t!o assumptions0 the memor s stem is capable of servicing multiple outstanding re%uests$ and the processor is capable of s!itching threads at ever c cle. It also re%uires the program to have an e#plicit specification of concurrenc in the form of threads. Machines such as the 3=P and Tera rel on multithreaded processors that can s!itch the conte#t of e#ecution in ever c cle. Conse%uentl $ the are able to hide latenc effectivel .

Pre!etching !or )atency (iding


Misses on loads cause programs to stall. 11

8h not advance the loads so that b the time the data is actuall needed$ it is alread thereA The onl dra!bac& is that ou might need more space to store advanced loads. 3o!ever$ if the advanced loads are over!ritten$ !e are no !orse than beforeA

Tradeo!!s o! Multithreading and Pre!etching


Multithreading and prefetching are criticall impacted b the memor band!idth. Consider the follo!ing e#ample0 E Consider a computation running on a machine !ith a + H3z cloc&$ 96!ord cache line$ single c cle access to the cache$ and +:: ns latenc to "<AM. The computation has a cache hit ratio at + J@ of D.; and at ID J@ of ,:;. Consider t!o cases0 first$ a single threaded e#ecution in !hich the entire cache is available to the serial conte#t$ and second$ a multithreaded e#ecution !ith ID threads !here each thread has a cache residenc of + J@. E If the computation ma&es one data re%uest in ever c cle of + ns$ ou ma notice that the first scenario re%uires 9::M@Gs of memor band!idth and the second$ IH@Gs.

Tradeo!!s o! Multithreading and Pre!etching


@and!idth re%uirements of a multithreaded s stem ma increase ver significantl because of the smaller cache residenc of each thread. Multithreaded s stems become band!idth bound instead of latenc bound. Multithreading and prefetching onl address the latenc problem and ma often e#acerbate the band!idth problem. Multithreading and prefetching also re%uire significantl more hard!are resources in the form of storage.

$plicitly Parallel Plat!orms Dichotomy o! Parallel Computing Plat!orms


An e#plicitl parallel program must specif concurrenc and interaction bet!een concurrent subtas&s. The former is sometimes also referred to as the control structure and the latter as the communication model.

Control Structure o! Parallel Programs


Parallelism can be e#pressed at various levels of granularit 6 from instruction level to processes. @et!een these e#tremes e#ist a range of models$ along !ith corresponding architectural support.

Control Structure o! Parallel Programs


Processing units in parallel computers either operate under the centralized control of a single control unit or !or& independentl . If there is a single control unit that dispatches the same instruction to various processors 'that !or& on different data($ the model is referred to as single instruction stream$ multiple data stream 'SIM"(. If each processor has its o!n control control unit$ each processor can e#ecute different instructions on different data items. This model is called multiple instruction stream$ multiple data stream 'MIM"(.

SIMD and MIMD Processors

12

A t pical SIM" architecture 'a( and a t pical MIM" architecture 'b(.

SIMD Processors
Some of the earliest parallel computers such as the Illiac IF$ MPP$ "AP$ CM6D$ and MasPar MP6+ belonged to this class of machines. Fariants of this concept have found use in co6processing units such as the MMO units in Intel processors and "SP chips such as the Sharc. SIM" relies on the regular structure of computations 'such as those in image processing(. It is often necessar to selectivel turn off operations on certain data items. ?or this reason$ most SIM" programming paradigms allo! for an 11activit mas&))$ !hich determines if a processor should participate in a computation or not.

Conditional $ecution in SIMD Processors

13

=#ecuting a conditional statement on an SIM" computer !ith four processors0 'a( the conditional statementN 'b( the e#ecution of the statement in t!o steps

MIMD Processors
In contrast to SIM" processors$ MIM" processors can e#ecute different programs on different processors. A variant of this$ called single program multiple data streams 'SPM"( e#ecutes the same program on different processors. It is eas to see that SPM" and MIM" are closel related in terms of programming fle#ibilit and underl ing architectural support. =#amples of such platforms include current generation Sun Pltra Servers$ SHI Origin Servers$ multiprocessor PCs$ !or&station clusters$ and the I@M SP.

SIMD*MIMD Comparison
SIM" computers re%uire less hard!are than MIM" computers 'single control unit(. 3o!ever$ since SIM" processors ae speciall designed$ the tend to be e#pensive and have long design c cles. Cot all applications are naturall suited to SIM" processors. In contrast$ platforms supporting the SPM" paradigm can be built from ine#pensive off6 the6shelf components !ith relativel little effort in a short amount of time.

Communication Model o! Parallel Plat!orms


There are t!o primar forms of data e#change bet!een parallel tas&s 6 accessing a shared data space and e#changing messages. Platforms that provide a shared data space are called shared6address6space machines or multiprocessors. Platforms that support messaging are also called message passing platforms or multicomputers.

Shared*Address*Space Plat!orms
Part 'or all( of the memor is accessible to all processors. Processors interact b modif ing data ob5ects stored in this shared6address6space.

14

If the time ta&en b a processor to access an memor !ord in the s stem global or local is identical$ the platform is classified as a uniform memor access 'PMA($ else$ a non6 uniform memor access 'CPMA( machine.

+,MA and ,MA Shared*Address*Space Plat!orms

T pical shared6address6space architectures0 'a( Pniform6memor access shared6 address6space computerN 'b( Pniform6memor 6access shared6address6space computer !ith caches and memoriesN 'c( Con6uniform6memor 6access shared6 address6space computer !ith local memor onl .

+,MA and ,MA Shared*Address*Space Plat!orms


The distinction bet!een CPMA and PMA platforms is important from the point of vie! of algorithm design. CPMA machines re%uire localit from underl ing algorithms for performance. Programming these platforms is easier since reads and !rites are implicitl visible to other processors. 3o!ever$ read6!rite data to shared data must be coordinated 'this !ill be discussed in greater detail !hen !e tal& about threads programming(.

15

Caches in such machines re%uire coordinated access to multiple copies. This leads to the cache coherence problem. A !ea&er model of these machines provides an address map$ but not coordinated access. These models are called non cache coherent shared address space machines.

Shared*Address*Space vs- Shared Memory Machines


It is important to note the difference bet!een the terms shared address space and shared memor . 8e refer to the former as a programming abstraction and to the latter as a ph sical machine attribute. It is possible to provide a shared address space using a ph sicall distributed memor .

Message*Passing Plat!orms
These platforms comprise of a set of processors and their o!n 'e#clusive( memor . Instances of such a vie! come naturall from clustered !or&stations and non6shared6 address6space multicomputers. These platforms are programmed using 'variants of( send and receive primitives. Bibraries such as MPI and PFM provide such primitives.

Message Passing vs- Shared Address Space Plat!orms


Message passing re%uires little hard!are support$ other than a net!or&. Shared address space platforms can easil emulate message passing. The reverse is more difficult to do 'in an efficient manner(.

Physical Organi"ation o! Parallel Plat!orms


8e begin this discussion !ith an ideal parallel machine called Parallel <andom Access Machine$ or P<AM.

Architecture o! an Ideal Parallel Computer


A natural e#tension of the <andom Access Machine '<AM( serial architecture is the Parallel <andom Access Machine$ or P<AM. P<AMs consist of p processors and a global memor of unbounded size that is uniforml accessible to all processors. Processors share a common cloc& but ma e#ecute different instructions in each c cle.

Architecture o! an Ideal Parallel Computer


"epending on ho! simultaneous memor accesses are handled$ P<AMs can be divided into four subclasses. E =#clusive6read$ e#clusive6!rite '=<=8( P<AM. E Concurrent6read$ e#clusive6!rite 'C<=8( P<AM. E =#clusive6read$ concurrent6!rite '=<C8( P<AM. E Concurrent6read$ concurrent6!rite 'C<C8( P<AM.

16

Architecture o! an Ideal Parallel Computer

E E E E

8hat does concurrent !rite mean$ an !a 7 Common0 !rite onl if all values are identical. Arbitrar 0 !rite the data from a randoml selected processor. Priorit 0 follo! a predetermined priorit order. Sum0 8rite the sum of all data items.

Physical Comple$ity o! an Ideal Parallel Computer


Processors and memories are connected via s!itches. Since these s!itches must operate in O,1- time at the level of !ords$ for a s stem of p processors and m !ords$ the s!itch comple#it is O,mp-. Clearl $ for meaningful values of p and m$ a true P<AM is not realizable.

Interconnection +etworks !or Parallel Computers


Interconnection net!or&s carr data bet!een processors and to memor . Interconnects are made of s!itches and lin&s '!ires$ fiber(. Interconnects are classified as static or d namic. Static net!or&s consist of point6to6point communication lin&s among processing nodes and are also referred to as direct net!or&s. " namic net!or&s are built using s!itches and communication lin&s. " namic net!or&s are also referred to as indirect net!or&s.

Static and Dynamic Interconnection +etworks

Classification of interconnection net!or&s0 'a( a static net!or&N and 'b( a d namic net!or&.

17

Interconnection +etworks
S!itches map a fi#ed number of inputs to outputs. The total number of ports on a s!itch is the degree of the s!itch. The cost of a s!itch gro!s as the s%uare of the degree of the s!itch$ the peripheral hard!are linearl as the degree$ and the pac&aging costs linearl as the number of pins.

Interconnection +etworks# +etwork Inter!aces


Processors tal& to the net!or& via a net!or& interface. The net!or& interface ma hang off the IGO bus or the memor bus. In a ph sical sense$ this distinguishes a cluster from a tightl coupled multicomputer. The relative speeds of the IGO and memor buses impact the performance of the net!or&.

+etwork Topologies
A variet of net!or& topologies have been proposed and implemented. These topologies tradeoff performance for cost. Commercial machines often implement h brids of multiple topologies for reasons of pac&aging$ cost$ and available components.

+etwork Topologies# 'uses


Some of the simplest and earliest parallel machines used buses. All processors access a common bus for e#changing data. The distance bet!een an t!o nodes is O,1- in a bus. The bus also provides a convenient broadcast media. 3o!ever$ the band!idth of the shared bus is a ma5or bottlenec&. T pical bus based machines are limited to dozens of nodes. Sun =nterprise servers and Intel Pentium based shared6bus multiprocessors are e#amples of such architectures.

+etwork Topologies# 'uses

18

@us6based interconnects 'a( !ith no local cachesN 'b( !ith local memor Gcaches. Since much of the data accessed b processors is local to the processor$ a local memor can improve the performance of bus6based machines.

+etwork Topologies# Cross.ars


A crossbar net!or& uses an p)m grid of s!itches to connect p inputs to m outputs in a non6bloc&ing manner.

A completel non6bloc&ing crossbar net!or& connecting p processors to b memor ban&s.

19

+etwork Topologies# Cross.ars


The cost of a crossbar of p processors gro!s as O,p*-. This is generall difficult to scale for large values of p. =#amples of machines that emplo crossbars include the Sun Pltra 3PC +:::: and the ?u5itsu FPP.::.

+etwork Topologies# Multistage +etworks


Crossbars have e#cellent performance scalabilit but poor cost scalabilit . @uses have e#cellent cost scalabilit $ but poor performance scalabilit . Multistage interconnects stri&e a compromise bet!een these e#tremes.

.et)or/ 'opologies: Multistage .et)or/s

The schematic of a t pical multistage interconnection net!or&

20

+etwork Topologies# Multistage Omega +etwork


One of the most commonl used multistage interconnects is the Omega net!or&. This net!or& consists of log p stages$ !here p is the number of inputsGoutputs. At each stage$ input i is connected to output . if0

+etwork Topologies# Multistage Omega +etwork


=ach stage of the Omega net!or& implements a perfect shuffle as follo!s0

A perfect shuffle interconnection for eight inputs and outputs.

+etwork Topologies# Multistage Omega +etwork


The perfect shuffle patterns are connected using DKD s!itches. The s!itches operate in t!o modes E crossover or passthrough.

21

T!o s!itching configurations of the D K D s!itch0 'a( Pass6throughN 'b( Cross6over.

+etwork Topologies# Multistage Omega +etwork


A complete Omega net!or& !ith the perfect shuffle interconnects and s!itches can no! be illustrated0

A complete omega net!or& connecting eight inputs and eight outputs. An omega net!or& has p/* ) log p s!itching nodes$ and the cost of such a net!or& gro!s as ,p log p-.

+etwork Topologies# Multistage Omega +etwork / 0outing


Bet s be the binar processor. representation of the source and d be that of the destination

22

The data traverses the lin& to the first s!itching node. If the most significant bits of s and d are the same$ then the data is routed in pass6through mode b the s!itch else$ it s!itches to crossover. This process is repeated for each of the log p s!itching stages. Cote that this is not a non6bloc&ing s!itch.

+etwork Topologies# Multistage Omega +etwork / 0outing

An e#ample of bloc&ing in omega net!or&0 one of the messages ':+: to +++ or ++: to +::( is bloc&ed at lin& A@.

+etwork Topologies# Completely Connected +etwork


=ach processor is connected to ever other processor. The number of lin&s in the net!or& scales as O,p*-. 8hile the performance scales ver !ell$ the hard!are comple#it is not realizable for large values of p. In this sense$ these net!or&s are static counterparts of crossbars.

23

+etwork Topologies# Completely Connected and Star Connected +etworks


=#ample of an 46node completel connected net!or&.

'a( A completel 6connected net!or& of eight nodesN 'b( a star connected net!or& of nine nodes.

+etwork Topologies# Star Connected +etwork


=ver node is connected onl to a common node at the center. "istance bet!een an pair of nodes is O,1-. 3o!ever$ the central node becomes a bottlenec&. In this sense$ star connected net!or&s are static counterparts of buses.

+etwork Topologies# )inear Arrays1 Meshes1 and k-d Meshes


In a linear arra $ each node has t!o neighbors$ one to its left and one to its right. If the nodes at either end are connected$ !e refer to it as a +6" torus or a ring. A generalization to D dimensions has nodes !ith 9 neighbors$ to the north$ south$ east$ and !est. A further generalization to d dimensions has nodes !ith *d neighbors. A special case of a d6dimensional mesh is a h percube. 3ere$ d 0 log p$ !here p is the total number of nodes.

+etwork Topologies# )inear Arrays

24

Binear arra s0 'a( !ith no !raparound lin&sN 'b( !ith !raparound lin&.

+etwork Topologies# Two* and Three Dimensional Meshes

T!o and three dimensional meshes0 'a( D6" mesh !ith no !raparoundN 'b( D6" mesh !ith !raparound lin& 'D6" torus(N and 'c( a I6" mesh !ith no !raparound.

+etwork Topologies# (ypercu.es and their Construction

25

Construction of h percubes from h percubes of lo!er dimension.

+etwork Topologies# Properties o! (ypercu.es


The distance bet!een an t!o nodes is at most log p. =ach node has log p neighbors. The distance bet!een t!o nodes is given b the number of bit positions at !hich the t!o nodes differ.

+etwork Topologies# Tree*'ased +etworks

Complete binar tree net!or&s0 'a( a static tree net!or&N and 'b( a d namic tree net!or&.

+etwork Topologies# Tree Properties


The distance bet!een an t!o nodes is no more than *logp. 26

Bin&s higher up the tree potentiall carr more traffic than those at the lo!er levels. ?or this reason$ a variant called a fat6tree$ fattens the lin&s as !e go up the tree. Trees can be laid out in D" !ith no !ire crossings. This is an attractive propert of trees.

+etwork Topologies# 2at Trees

A fat tree net!or& of +- processing nodes

valuating Static Interconnection +etworks


1iameter2 The distance bet!een the farthest t!o nodes in the net!or&. The diameter of a linear arra is p 1$ that of a mesh is *, 1-, that of a tree and h percube is log p$ and that of a completel connected net!or& is O,1-. 3isection 4idth2 The minimum number of !ires ou must cut to divide the net!or& into t!o e%ual parts. The bisection !idth of a linear arra and tree is 1$ that of a mesh is $ that of a h percube is p/* and that of a completel connected net!or& is pDG9. Cost2 The number of lin&s or s!itches '!hichever is as mptoticall higher( is a meaningful measure of the cost. 3o!ever$ a number of other factors$ such as the abilit to la out the net!or&$ the length of !ires$ etc.$ also factor in to the cost.

27

valuating Static Interconnection +etworks

Cet!or& Completel 6connected Star Complete binar tree Binear arra D6" mesh$ no !raparound D6" !raparound mesh 3 percube 8raparound 56ar d6cube

"iameter

@isection 8idth

Arc Connectivit

Cost 'Co. of lin&s(

valuating Dynamic Interconnection +etworks

28

Cet!or& Crossbar Omega Cet!or& " namic Tree

"iameter

@isection 8idth

Arc Connectivit

Cost 'Co. of lin&s(

Cache Coherence in Multiprocessor Systems


Interconnects provide basic mechanisms for data transfer. In the case of shared address space machines$ additional hard!are is re%uired to coordinate access to data that might have multiple copies in the net!or&. The underl ing techni%ue must provide some guarantees on the semantics. This guarantee is generall one of serializabilit $ i.e.$ there e#ists some serial order of instruction e#ecution that corresponds to the parallel schedule.

8hen the value of a variable is changes$ all its copies must either be invalidated or updated.

Cache coherence in multiprocessor s stems0 'a( Invalidate protocolN 'b( Ppdate protocol for shared variables

29

Cache Coherence# ,pdate and Invalidate Protocols


If a processor 5ust reads a value once and does not need it again$ an update protocol ma generate significant overhead. If t!o processors ma&e interleaved test and updates to a variable$ an update protocol is better. @oth protocols suffer from false sharing overheads 't!o !ords that are not shared$ ho!ever$ the lie on the same cache line(. Most current machines use invalidate protocols.

Maintaining Coherence ,sing Invalidate Protocols


=ach cop of a data item is associated !ith a state. One e#ample of such a set of states is$ shared$ invalid$ or dirt . In shared state$ there are multiple valid copies of the data item 'and therefore$ an invalidate !ould have to be generated on an update(. In dirt state$ onl one cop e#ists and therefore$ no invalidates need to be generated. In invalid state$ the data cop is invalid$ therefore$ a read generates a data re%uest 'and associated state changes(.

Maintaining Coherence ,sing Invalidate Protocols

State diagram of a simple three6state coherence protocol.

30

Maintaining Coherence ,sing Invalidate Protocols

=#ample of parallel program e#ecution !ith the simple three6state coherence protocol.

Snoopy Cache Systems


3o! are invalidates sent to the right processors7 In snoop caches$ there is a broadcast media that listens to all invalidates and read re%uests and performs appropriate coherence operations locall .

31

A simple snoop bus based cache coherence s stem

Per!ormance o! Snoopy Caches


Once copies of data are tagged dirt $ all subse%uent operations can be performed locall on the cache !ithout generating e#ternal traffic. If a data item is read b a number of processors$ it transitions to the shared state in the cache and all subse%uent read operations become local. If processors read and update data at the same time$ the generate coherence re%uests on the bus 6 !hich is ultimatel band!idth limited.

Directory 'ased Systems


In snoop caches$ each coherence operation is sent to all processors. This is an inherent limitation. 8h not send coherence re%uests to onl those processors that need to be notified7 This is done using a director $ !hich maintains a presence vector for each data item 'cache line( along !ith its global state.

Directory 'ased Systems

32

Architecture of t pical director based s stems0 'a( a centralized director N and 'b( a distributed director .

Per!ormance o! Directory 'ased Schemes


used. The need for a broadcast media is replaced b the director . The additional bits to store the director ma add significant overhead. The underl ing net!or& must be able to carr all the coherence re%uests. The director is a point of contention$ therefore$ distributed director schemes must be

Communication Costs in Parallel Machines


Along !ith idling and contention$ communication is a ma5or overhead in parallel programs. The cost of communication is dependent on a variet of features including the programming model semantics$ the net!or& topolog $ data handling and routing$ and associated soft!are protocols.

Message Passing Costs in Parallel Computers

The total time to transfer a message over a net!or& comprises of the follo!ing0 6tartup time 'ts(0 Time spent at sending and receiving nodes 'e#ecuting the routing algorithm$ programming routers$ etc.(. E 7er(hop time 'th(0 This time is a function of number of hops and includes factors such as s!itch latencies$ net!or& dela s$ etc. E 7er(word transfer time 'tw(0 This time includes all overheads that are determined b the length of the message. This includes band!idth of lin&s$ error chec&ing and correction$ etc.
E

Store*and*2orward 0outing
A message traversing multiple hops is completel received at an intermediate hop before being for!arded to the ne#t hop. The total communication cost for a message of size m !ords to traverse l communication lin&s is In most platforms$ th is small and the above e#pression can be appro#imated b

33

0outing Techni3ues

34

Passing a message from node 7 to 7+ 'a( through a store6and6for!ard communication net!or&N 'b( and 'c( e#tending the concept to cut6through routing. The shaded regions represent the time that the message is in transit. The startup time associated !ith this message transfer is assumed to be zero.

Packet 0outing
Store6and6for!ard ma&es poor use of communication resources. Pac&et routing brea&s messages into pac&ets and pipelines them through the net!or&. Since pac&ets ma ta&e different paths$ each pac&et must carr routing information$ error chec&ing$ se%uencing$ and other related header information. The total communication time for pac&et routing is appro#imated b 0

The factor tw accounts for overheads in pac&et headers.

Cut*Through 0outing
Ta&es the concept of pac&et routing to an e#treme b further dividing messages into basic units called flits. Since flits are t picall small$ the header information must be minimized. This is done b forcing all flits to ta&e the same path$ in se%uence. A tracer message first programs all intermediate routers. All flits then ta&e the same route. =rror chec&s are performed on the entire message$ as opposed to flits. Co se%uence numbers are needed.

Cut*Through 0outing
The total communication time for cut6through routing is appro#imated b 0

This is identical to pac&et routing$ ho!ever$ tw is t picall much smaller.

Simpli!ied Cost Model !or Communicating Messages

35

The cost of communicating a message bet!een t!o nodes l hops a!a through routing is given b

using cut6

In this e#pression$ th is t picall smaller than ts and tw. ?or this reason$ the second term in the <3S does not sho!$ particularl $ !hen m is large. ?urthermore$ it is often not possible to control routing and placement of tas&s. ?or these reasons$ !e can appro#imate the cost of message transfer b

Simpli!ied Cost Model !or Communicating Messages


It is important to note that the original e#pression for communication time is valid for onl uncongested net!or&s. If a lin& ta&es multiple messages$ the corresponding tw term must be scaled up b the number of messages. "ifferent communication patterns congest different net!or&s to var ing e#tents. It is important to understand and account for this in the communication time accordingl .

Cost Models !or Shared Address Space Machines


8hile the basic messaging cost applies to these machines as !ell$ a number of other factors ma&e accurate cost modeling more difficult. Memor la out is t picall determined b the s stem. ?inite cache sizes can result in cache thrashing. Overheads associated !ith invalidate and update operations are difficult to %uantif . Spatial localit is difficult to model. Prefetching can pla a role in reducing the overhead associated !ith data access. ?alse sharing and contention are difficult to model.

0outing Mechanisms !or Interconnection +etworks


3o! does one compute the route that a message ta&es from source to destination7 E <outing must prevent deadloc&s 6 for this reason$ !e use dimension6ordered or e6cube routing. E <outing must avoid hot6spots 6 for this reason$ t!o6step routing is often used. In this case$ a message from source s to destination d is first sent to a randoml chosen intermediate processor i and then for!arded to destination d.

0outing Mechanisms !or Interconnection +etworks

36

<outing a message from node 7s ':+:( to node 7d '+++( in a three6dimensional h percube using =6cube routing.

Mapping Techni3ues !or 4raphs


Often$ !e need to embed a &no!n communication pattern into a given interconnection topolog . 8e ma have an algorithm designed for one net!or&$ !hich !e are porting to another topolog . ?or these reasons$ it is useful to understand mapping bet!een graphs.

Mapping Techni3ues !or 4raphs# Metrics


8hen mapping a graph ',8,9- into ':,8:,9:-, the follo!ing metrics are important0 The ma#imum number of edges mapped onto an edge in 9: is called the congestion of the mapping. The ma#imum number of lin&s in 9: that an edge in 9 is mapped onto is called the dilation of the mapping. The ratio of the number of nodes in the set 8: to that in set 8 is called the expansion of the mapping.

m.edding a )inear Array into a (ypercu.e


A linear arra 'or a ring( composed of D d nodes 'labeled : through Dd +( can be embedded into a d6dimensional h percube b mapping node i of the linear arra onto node ',i, d- of the h percube. The function ',i, x- is defined as follo!s0

0
37

m.edding a )inear Array into a (ypercu.e


The function ' is called the binary reflected 'ray code '<HC(. Since ad5oining entries '',i, d- and ',i ; 1, d-( differ from each other at onl one bit position$ corresponding processors are mapped to neighbors in a h percube. Therefore$ the congestion$ dilation$ and e#pansion of the mapping are all +.

m.edding a )inear Array into a (ypercu.e# $ample

38

'a( A three6bit reflected Hra h percube

code ringN and 'b( its embedding into a three6dimensional

m.edding a Mesh into a (ypercu.e


A Dr K Ds !raparound mesh can be mapped to a D rMs6node h percube b mapping node ,i, .- of the mesh onto node ',i, r +- QQ ',., s 1- of the h percube '!here QQ denotes concatenation of the t!o Hra codes(.

m.edding a Mesh into a (ypercu.e

39

'a( A 9 K 9 mesh illustrating the mapping of mesh nodes to the nodes in a four6dimensional h percubeN and 'b( a D K 9 mesh embedded into a three6dimensional h percube. Once again$ the congestion$ dilation$ and e#pansion of the mapping is +.

m.edding a Mesh into a )inear Array


Since a mesh has more edges than a linear arra $ !e !ill not have an optimal congestionGdilation mapping. 8e first e#amine the mapping of a linear arra into a mesh and then invert this mapping. This gives us an optimal mapping 'in terms of congestion(.

m.edding a Mesh into a )inear Array# $ample

40

'a( =mbedding a +- node linear arra into a D6" meshN and 'b( the inverse of the mapping. Solid lines correspond to lin&s in the linear arra and normal lines to lin&s in the mesh.

m.edding a (ypercu.e into a 5*D Mesh


=ach node subcube of the h percube is mapped to a node ro! of the mesh. This is done b inverting the linear6arra to h percube mapping. This can be sho!n to be an optimal mapping.

m.edding a (ypercu.e into a 5*D Mesh# $ample

Case Studies# The I'M 'lue*4ene Architecture


41

The hierarchical architecture of @lue Hene.

Case Studies# The Cray T6 Architecture

Interconnection net!or& of the Cra TI=0 'a( node architectureN 'b( net!or& topolog .

Case Studies# The S4I Origin 6777 Architecture

42

Architecture of the SHI Origin I::: famil of servers

Case Studies# The Sun (PC Server Architecture

43

Architecture of the Sun =nterprise famil of servers

44

Anda mungkin juga menyukai