5 Crosscutting Issues: The Design of Memory Hierarchies
Protection and Instruction Set Architecture Protection is a joint effort of architecture and operating systems, but architects had to modify some details of existing instruction set architectures when virtual memory became popular. IBM mainframe hardware and VMM too three steps to improve performance of VMs! ". #educe the cost of processor virtuali$ation. %. #educe interrupt overhead cost due to the virtuali$ation. &. #educe interrupt cost by steering interrupts to the proper VM without invoing VMM. In %''(, new proposals by )M* and Intel try to address the first point, reducing the cost of processor virtuali$ation. Speculatie !"ecution and the Memory System Inherent in processors that support speculative execution or conditional instructions is the possibility of generating invalid addresses. +his is the incorrect behavior if protection exceptions were taen, but the benefits of speculative execution would be swamped by false exception overhead. ,ence, the memory system must identify speculatively executed instructions and conditionally executed instructions and suppress the corresponding exception. -imilarly, we cannot allow such instructions to cause the cache to stall on a miss because again unnecessary stalls could overwhelm the benefits of speculation. ,ence, these processors must be matched with nonblocing caches. In reality, the penalty of an .% miss is so large that compilers normally only speculate on ." misses. -ome scientific programs the compiler can sustain multiple outstanding .% misses to cut the .% miss penalty effectively. I#$ and Consistency of Cached Data *ata can be found in memory and in the cache. )s long as one processor is them sole device changing or reading the data and the cache stands between the processor and memory, there is little danger in the processor seeing the old or stale copy . Multiple processors and I/0 devices raise the opportunity for copies to be inconsistent and to read the wrong copy. +he fre1uency of the cache coherency problem is different for multiprocessors than I/0. program running on multiple processors will want to have copies of the same data in several caches. Performance of a multiprocessor program depends on the performance of the system when sharing data. +he I/O cache coherency 1uestion is this! 2here does the I/0 occur in the computer3between the I/0 device and the cache or between the I/0 device and main memory4 If input puts data into the cache and output RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 1 reads data from the cache, both I/0 and the processor see the same data. +he difficulty in this approach is that it interferes with the processor and can cause the processor to stall for I/0. Input may also interfere with the cache by displacing some information with new data that is unliely to be accessed soon. +he goal for the I/0 system in a computer with a cache is to prevent the stale data problem while interfering as little as possible. Many systems, therefore, prefer that I/0 occur directly to main memory, with main memory acting as an I/0 buffer. If a write5through cache were used, then memory would have an up5to5 date copy of the information, and there would be no stale5data issue for output. 2rite through is usually used in first5level data caches and .% cache uses write bac. Input re1uires some extra wor. +he software solution is to guarantee that no blocs of the input buffer are in the cache. ) page containing the buffer can be mared as non cacheable, and the operating system can always input to such a page. +he operating system can flush the buffer addresses from the cache before the input occurs. ) hardware solution is to chec the I/0 addresses on input to see if they are in the cache. If there is a match of I/0 addresses in the cache, the cache entries are invalidated to avoid stale data. )ll these approaches can also be used for output with write5bac caches. %.% Putting It All Together: AMD $pteron Memory Hierarchy +he 0pteron is an out5of5order execution processor that fetches up to three 6'x6( instructions per cloc cycle, translates them into #I-7 operations, and it has "" parallel execution units. In %''(, the "%5stage integer pipeline yields a maximum cloc rate of %.6 8,$, and the fastest memory supported is P7&%'' **# -*#)M. It uses 965bit virtual addresses and 9'5bit physical addresses. :igure ;."6 shows the mapping of the address through the multiple levels of data caches and +.Bs. +he P7 is sent to the instruction cache. It is (9 <B, two5way set associative with a (95byte bloc si$e and .#= replacement. +he cache index is 0r > bits. It is virtually indexed and physically tagged . RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 2 &igure 5.'( the irtual address) physical address) inde"es) tags) and data *loc+s for the AMD $pteron caches and T,-s. Step '! the page frame of the instruction?s data address is sent to the instruction +.B -tep %! at the same time the >5bit index from the virtual address is sent to the data cache -tep &,9! +he fully associative +.B -imultaneously searches all 9' entries to find a match between the address and a valid P+@. 0pteron checs for changes to the actual page directory in memory and flushes only when the data structure is changed. In the worst case, the page is not in memory, and the operating system gets the page from dis. -ince millions of instructions could execute during a page fault, the operating system will swap in another process if one is waiting to run. 0therwise, the instruction cache access continues. -tep ;! +he index field of the address is sent to both groups of the two5way set associative data cache . -tep (! +he instruction cache tag is 9' A > bits/ ( bits/ %; bits. +he four tags and valid bits are compared to the physical page frame from the Instruction +.B. -tep B! +he instruction cache is virtually addressed and physically tagged. 0n a miss, the cache controller must chec for a synonym Ctwo different virtual addresses that reference the same physical addressD. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 3 -tep 6! +he second5level cache tries to fetch the bloc on a miss. +he .% cache is uses a pseudo5.#= scheme by managing eight pairs of blocs .#=, and then randomly picing one of the .#= pair on a replacement. +he .% index is -tep >! 0nce again, the index and tag are sent to all "( groups of the "(5way set associative data cache . -tep "'! which are compared in parallel. If one matches and is valid, it returns the bloc in se1uential order, 6 bytes per cloc cycle. -tep ""! If the instruction is not found in the secondary cache, the on5chip memory controller must get the bloc from main memory. -ince there is only one memory controller and the same address is sent on both channels. -tep "%! 2ide transfers happen when both channels have identical *IMMs. @ach channel supports up to four **# *IMMs. -tep "&! +he total latency of the instruction miss that is serviced by main memory is approximately %' processor cycles plus the *#)M latency for the critical instructions. +he memory controller fills the remainder of the (95byte cache bloc at a rate of "( bytes per memory cloc cycle. -tep "9!. 0pteron has a prefetch engine associated with the .% cache. It loos at patterns for .% misses to consecutive blocs, either ascending or descending, and then prefetches the next line into the .% cache. -tep ";! -ince the second5level cache is a write5bac cache, any miss can lead to an old bloc being written bac to memory. +he 0pteron places this EvictimF bloc into a victim buffer, as it does with a victim dirty bloc in the data cache. -tep "(! +he data cache and .% cache chec the victim buffer for the missing bloc, but it stalls until the data is written to memory and then perfected. +he new data are loaded into the instruction cache as soon as they arrive. Performance of the $pteron Memory Hierarchy Ho. .ell does the $pteron .or+/. +he major components are the instruction and data caches, instruction and data +.Bs, and the secondary cache. ) memory stall for one instruction may be completely hidden by successful completion of a later instruction. :or 0pteron memory hierarchy, +he average -P@7 I cache misses per instruction is '.'"G to '.'>G, the average * cache misses per instruction are ".&9G to ".9&G, and the average .% cache misses per instruction are '.%&G to '.&(G. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 4 +he pipeline stall portion is a lower bound. about ;'G of the memory 7PI C%;G overallD is due to .% cache misses for the integer programs and about B'G of the memory 7PI for the floating5point programs C9'G overallD. )lthough they are executing the same programs compiled for the same instruction set, the compilers and resulting code se1uences are different as are the memory hierarchies. +he following table summari$es the two memory hierarchies! 5.0 &allacies and Pitfalls Memory hierarchy is less vulnerable to fallacies and pitfalls. &allacy: Predicting cache performance of one program from another. *epending on the program, the data misses will change. 7ommercial programs such as databases will have significant miss rates even in large second5level caches. '. Pitfall: Simulating enough instructions to get accurate performance measures of the memory hierarchy. +here are really three pitfalls here. 0ne is trying to predict performance of a large cache using a small trace. )nother is that a program?s locality behavior is not constant over the run of the entire program. +he third is that a program?s locality behavior may vary depending on the input. 1. Pitfall: $eremphasi2ing memory bandwidth in DRAMs. #)MB=- innovated on the *#)M interface. Its product, *irect #*#)M, offered up to ".( 8B/sec of bandwidth from a single *#)M. P7s do most memory accesses through a two5level cache hierarchy, so it was unclear how much benefit is gained from high bandwidth without improving memory latency. 0ne measure of the #*#)M cost is die si$eH it had about a %'G larger die for the same capacity compared to -*#)M. *#)M designers use redundant rows and columns to improve yield significantly on the memory portion of the *#)M. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 5 %. Pitfall: Not deliering high memory bandwidth in a cache!based system 7aches help with average cache memory latency but may not deliver high memory bandwidth to an application that must go to main memory. +he architect must design a high bandwidth memory behind the cache for such applications. It is a vector computer that doesn?t rely on data caches for memory performance. 3. Pitfall : Implementing a irtual machine monitor on an I"A that wasn#t designed to be irtuali$able. Virtual memory design is a challenging tas. Because the 6'x6( +.Bs do not support process I* tags, it is more expensive for the VMM and guest 0-es to share the +.BH each address space change typically re1uires a +.B flush. Virtuali$ing I/0 is also a challenge for the 6'x6(, because it supports memory5mapped I/0 and has separate I/0 instructions, and there is a very large number and variety of types of devices and device drivers of P7s for the VMM to handle. +hird5party vendors supply their own drivers, and they may not properly virtuali$ed. 0ne solution for conventional VM implementations is to load real device drivers directly into the VMM. +o simplify implementations of VMMs on the 6'x6(, both )M* and Intel ,ave proposed extensions to the architecture. Intel4s 5T6" provides a new execution mode for running VMs. (." Introduction (.% Adanced Topics in Dis+ Storage +he dis industry has concentrated on improving the capacity of diss. Improvement in capacity is customarily expressed as improvement in areal density, measured in bits per s1uare inch! +hrough about ">66, the rate of improvement of areal density was %>G per year, thus doubling density every three years. Between then and about ">>(, the rate improved to ('G per year, 1uadrupling density every three years and matching the traditional rate of *#)Ms. :rom ">>B to about %''&, the rate increased to "''G, or doubling every year. )fter that the rate has dropped recently to about &'G per year. 7ost per gigabyte has dropped at least as fast as areal density has increased. +he bandwidth gap is more complex. Many have tried to invent a technology cheaper than *#)M but faster than dis to fill that gap. 7hallengers have never had a product to maret at the right time. By the time a new product would ship, *#)Ms and diss have made advances, costs have dropped accordingly, and the challenging product is immediately obsolete. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 6 flash memory is semiconductor memory, is nonvolatile lie diss, and it has about the same bandwidth as diss, but latency is "''A"''' times faster than dis. In %''(, the price per gigabyte of flash was about the same as *#)M. :lash is popular in cameras and portable music players because it comes in much smaller capacities and it is more power efficient than diss, and cost is higher than diss. But flash memory is not popular in destop and server computers. Blocs in the same cylinder tae less time to access since there is no see time, and some tracs are closer than others. :igure (.% shows how a 1ueue depth of ;' can double the number of I/0s per second of random I/0s due to better scheduling of accesses. 8iven buffers, caches, and out5of5order accesses, an accurate performance model of a real dis is much more complicated than sector5trac5cylinder. &igure 7.1 Throughput ersus command 8ueue depth using random 5'16*yte reads. +he dis performs "B' reads per second starting at no command 1ueue, and doubles performance at ;' and triples at %;(. Dis+ Po.er Power is an increasing concern for diss as well as for processors. ) typical )+) dis in %''( might use > watts when idle, "" watts when reading or writing, and "& watts when seeing. 0ne formula that indicates the importance of rotation speed and the si$e of the platters for the power consumed by the dis motor is the following +hus, smaller platters, slower rotation, and fewer platters all help reduce dis motor power, and most of the power is in the motor. :igure (.& shows the specifications of two &.;5inch diss in %''(. +he Serial ATA C-)+)D diss shoot for high capacity and the best cost per gigabyte, and so the ;'' 8B drives cost less than I" per gigabyte. +hey use the widest platters that fit the form factor and use four or five of them, but they spin at B%'' #PM and see RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 7 relatively slowly to lower power. +he corresponding Serial Attach SCSI C-)-D drive aims at performance, and so it spins at ";,''' #PM and sees much faster. +o reduce power, the platter is much narrower than the form factor and it has only a single platter. +his combination reduces capacity of the -)- drive to &B 8B. +he cost per gigabyte is better for the -)+) drives, and the cost per I/0 per second or MB transferred per second is better for the -)- drives. +he -)- diss use twice the power of the -)+) drives, due to the much faster #PM and sees. &igure 7.% Serial ATA 9SATA: ersus Serial Attach SCSI 9SAS: dries in %.56inch form factor in 1;;7. +he I/0s per second are calculated using the average see plus the time for one5half rotation plus the time to transfer one sector of ;"% <B. Adanced Topics in Dis+ Arrays 9';m: -preading data over multiple diss, called striping , +he ) dis array have more faults than a smaller number of larger diss. +he lost information is reconstructed from redundant information. the mean time to failure CM++:D of diss is tens of years, and the M++# is measured in hours. -uch redundant dis arrays are nown as <AID, originally redundant array of ine%pensie dis&s, and also called independent for I in the acronym. It can be measured as mbps or as I/0s per second. :igure (.9 summari$es the five standard #)I* levels. RAID ': It has no redundancy and is called as =-$D (just a bunch of dissD. +he data may be striped across the diss in the array. +his level is generally used as a measuring stic for the other #)I* levels in terms of cost, performance, and dependability. &igure 7.3 <AID leels) their fault tolerance) and their oerhead in redundant dis+s. mirroring C#)I* "D in this instance can survive up to eight dis failures provided only one dis of each mirrored pair failsH worst case is both diss in a mirrored pair. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 8 RAID (: )lso called mirroring or shado.ing , there are two copies of every piece of data. It is the simplest and oldest dis redundancy scheme, but it also has the highest cost. -ome array controllers will optimi$e read performance by allowing the mirrored diss to act independently for reads, but it may tae longer for the mirrored writes to complete. RAID )* +his organi$ation applying memory6style error correcting codes to diss . It was included because there was such a dis array product at the time of the original #)I* introduced, but it is oldest dis redundancy scheme. RAID +: -ince the higher5level dis interfaces understand the health of a dis, it?s easy to figure out which dis failed. one extra dis is used to hold the all parity information of the data diss, this dis allows recovery from a dis failure. +he data is organi$ed in stripes, with N data blocs and one parity bloc . 2hen a failure occurs, you just EsubtractF the good data from the good blocs, and what remains is the missing data. #)I* & assumes the data is spread across all diss on reads and writes, RAID ,* we can increase the number of reads per second by allowing each dis to perform independent reads. +o increase the number of writes per second, :irst, the array reads the old data that is about to be overwritten, and then calculates what bits would change before it writes the new data. It then reads the old value of the parity on the chec diss, updates parity according to the list of changes, and then writes the new value of parity to the chec dis. It is called Esmall writesF are still slower than small reads but they are faster than if you had to read RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 9 all diss on every write. #)I* 9 has the same low chec dis overhead as #)I* &, and it can still do large reads and writes as fast as #)I* &. But control is more complex. RAID -* #)I* 9 has as a performance bottlenec. #)I* ; simply distributes the parity information across all diss in the array, thereby removing the bottlenec. +he parity bloc in each stripe is rotated so that parity is spread evenly across all diss. +he dis array controller can calculate which dis has the parity for when it wants to write a given bloc, and this is simple calculation. #)I* ; has the same low chec dis overhead as #)I* & and 9, and it can do the large reads and writes of #)I* & and the small reads of #)I* 9. #)I* ; re1uires the most sophisticated controller. +wo #)I* levels that have become popular are! RAID (' ersus '( .or ( / ' ersus RAID '/(0 RAID 1* 2eyond a "ingle Dis& 3ailure #)I* " to ; protect against a single self5identifying failure. ,owever, if an operator accidentally replaces the wrong dis during a failure, then the dis array will experience two failures, and data will be lost. )s dis bandwidth is growing more slowly than dis capacity, the M++# of a dis in a #)I* system is increasing, which in turn increases the chances of a second failure. these failures were becoming a danger to customers, they created a more robust scheme to protect data, called row!diagonal parity or RAID!DP. #ow5diagonal parity uses redundant space based on a parity calculation on a per5stripe basis. -ince it is protecting against a double failure, it adds two chec blocs per stripe of data. .et?s assume there are p J " diss total, and so p " diss have data. :igure (.; shows the case when p is ;. +he row parity dis is just lie in #)I* 9H it contains the even parity across the other four data blocs in its stripe. @ach bloc of the diagonal parity dis contains the even parity of the blocs in the same diagonal. @ach diagonal does not cover one dis. &igure 7.5 <o. diagonal parity for p > 5) .hich protects four data dis+s from dou*le failures. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 10 (.& Definition and !"amples of <eal &aults and &ailures 95m: +he terms fault4 error4 and 3ailure are often used interchangeably , but they have different meanings. 7omputer system dependability is the 1uality of delivered service such that reliance can justifiably be placed on this service. +he service delivered by a system is its o*sered actual behaior as perceived by other systemCsD interacting with this system?s users. @ach module also has an ideal specified behaior , where a serice specification is an agreed description of the expected behavior. ) system failure occurs when the actual behavior deviates from the specified behavior . +he failure occurred because of an error. 5rror is a defect in that module. +he cause of an error is a fault. +he time between the occurrence of an error and the resulting failure is the error latency . +hus, an error is the manifestation in the system of a fault, and a failure is the manifestation on the serice of an error. @x! ) programming mistae is a fault. +he conse1uence is an error 9or latent errorD in the software. =pon activation, the error becomes effective. 2hen this effective error produces erroneous data that affect the delivered service, a failure occurs. )n alpha particle hitting a *#)M can be considered a fault. If it changes the memory, it creates an error. 2hen error affects the delivered service, a failure occurs. +he relation between faults, errors, and failures is as follows! ) fault creates one or more latent errors. +he properties of errors are C"D a latent error becomes effective once activatedH C%D an error may cycle between its latent and effective statesH C&D an effective error often propagates from one component to another, thereby creating new errors.. ) component failure occurs when the error affects the delivered service. +hese properties are recursive and apply to any component in the system. 8ray and -iewiore classify faults into four categories according to their cause! '. 6ardware faults3h/w *evices failure. 1. Design faults3:aults in software and hardware design. %. Operation faults3Mistaes by operations and maintenance personnel. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 11 3. 5nironmental faults3:ire, flood, earth1uae, power failure, and sabotage. :aults are also classified by their duration into transient, intermittent, and permanent. 7ransient faults exist for a limited time and are not recurring. Intermittent faults cause a system to oscillate between faulty and fault5free operation. Permanent faults do not correct themselves with the passing of time. (.9 I#$ Performance) <elia*ility Measures) and -enchmar+s <esponse time is defined as the time a tas taes from the moment it is placed in the buffer until the server finishes the tas. Throughput is the average number of tass completed by the server over a time period. +o get the high throughput, the server should never be idle, and thus the buffer should never be empty. &igure 7.( The traditional producer6serer model of response time and throughput. Throughput ersus <esponse Time 95m: :igure (.> shows throughput versus response time Cor latencyD for I/0 system. +he nee of the curve is the area where more throughput results in much longer response time or shorter response time results in much lower throughput. &igure 7.? Throughput ersus response time. ,atency is normally reported as response time. )n interaction, or transaction, with a computer is divided into three parts ! ". 5ntry time3the time for the user to enter the command. %. "ys. response time3+he time between when the user enters the cmd K the complete response is displayed. &. 7hin& time3+he time from the reception of the response until the user begins to enter the next command. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 12 +he sum of these three parts is called the transaction time . user productivity is inversely proportional to transaction time. +o reflect the importance of response time to user productivity, I/0 benchmars also address the response time versus throughput trade5off. Transaction6Processing -enchmar+s 7ransaction processing C+P, or 0.+P for online transaction processingD is concerned with I/O rate as opposed to data rate . +P generally involves changes to a large body of shared information from many terminals, with the +P system guaranteeing proper behavior on a failure. -uppose, for example, a ban?s computer fails when a customer tries to withdraw money from an )+M. +he +P system would guarantee that the account is debited if the customer received the money and that the account is unchanged if the money was not received. )irline reservations systems as well as bans are traditional customers for +P. +he Transaction rocessin! Council, which led?s to eight benchmars. :igure (."% summari$es these benchmars. +P757 uses a database to simulate an order5entry environment of a wholesale supplier, including entering and delivering orders, recording payments, checing the status of orders, and monitoring the level of stoc at the warehouses. It runs five concurrent transactions of varying complexity, and the database includes nine tables with a scalable range of records and customers. +P757 is measured in transactions per minute Ctpm7D and in price of system, including hardware, software, and three years of maintenance support. +hese +P7 benchmars have these unusual characteristics! ". Price is included with the benchmar& results. +he cost of hardware, software, and maintenance agreements is included in a submission, which enables evaluations based on price5performance as well as high performance. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 13 %. 7he data set generally must scale in si$e as the throughput increases. +he benchmars are trying to model real systems, in which the demand on the system and the si$e of the data stored in it increase together. It maes no sense, for example, to have thousands of people per minute access hundreds of ban accounts. &. 7he benchmar& results are audited. Before results can be submitted, they must be approved by a certified +P7 auditor, who enforces the +P7 rules that try to mae sure that only fair results are submitted. #esults can be challenged and disputes resolved by going before the +P7. 9. 7hroughput is the performance metric4 but response times are limited. :or example, with +P757, >'G of the Lew50rder transaction response times must be less than ; seconds. ;. An independent organi$ation maintains the benchmar&s. *ues collected by +P7 pay for an administrative structure including a 7hief 0perating 0ffice. +his organi$ation settles disputes, conducts mail ballots on approval of changes to benchmars, holds board meetings, and so on. 7.5 A ,ittle @ueuing Theory In I/0 systems, *ac+6of6the6enelope calculation is the one of the best case analysis. :ull5scale simulation is also much more accurate to calculate expected performance. 2ith I/0 systems, we also have a mathematical tool to guide I/0 design that is a little more wor and much more accurate than best5case analysis, but much less wor than full5scale simulation. 2e can define a set of simple theorems that will help calculate response time and throughput of an entire I/0 system. +his helpful field is called 8ueuing theory. 7onsider blac5box approach to I/0 systems, as in :igure (.";. In our example, the processor is maing I/0 re1uests that arrive at the I/0 device, and the re1uests EdepartF when the I/0 device fulfills them. If the system is in steady state, then the number of tass entering the system must e1ual the number of tass leaving the system. ,ere we assuming that, we are evaluating systems with multiple independent re1uests for I/0 service that are in e1uilibrium! the input rate must be e1ual to the output rate. +his leads us to "ittle#s "aw , which relates the average number of tass in the system, the average arrival rate of new tass, and the average time to perform a tas! Mean num*er of tas+s in system > Arrial rate A Mean response time RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 14 Deriation of ,ittle4s ,a.! )ssume we observe a system for +imeobserve minutes. *uring that observation, we record how long it too each tas to be serviced, and then sum those times. +he number of tass completed during +ime observe is Lumber tas , and the sum of the times each tas spends in the system is +ime accumulated . Lote that the tass can overlap in time, so Time accumulated B Time o*sered. +hen In blac box, the area where the tass are waiting to be serviced is called the 8ueue4 or waiting line . +he device performing the re1uested service is called the serer. .ittle?s .aw and a series of definitions lead to several useful e1uations! -erver utili$ation is! Serer utili2ation > Arrial rate A Timeserer RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 15 -ervice utili$ation must be between ' and "H otherwise, there would be more tass arriving than could be serviced, violating our assumption that the system is in e1uilibrium. =tili$ation is also called traffic intensity and is represented by the symbol C. Poisson distri*ution of <andom 5aria*les ) variable is random if you cannot now exactly what its next value will be, but you may now the probability of all possible values. #e1uests for service from an I/0 system can be modeled by a random variable because the operating system is normally switching between several processes that generate independent I/0 re1uests. 0ne way to characteri$e the distribution of values of a random variable with discrete values is a histogram4 which divides the range between the minimum and maximum values into sub ranges called $uc%ets. ,istograms then plot the number in each bucet as columns. ,istograms wor well for distributions that are discrete values. :or the weighted arithmetic mean time, .et?s first assume that after measuring the number of tass occurrences, say, ni, , you could compute fre1uency of occurrence of tas i: 2here Ti is the time for tas i and fi is the fre1uency of occurrence of tas i. the variance can be calculated as 0ne of the most widely used exponential distributions is called a oisson distri$ution . It is used to characteri$e random events in a given time interval. +he Poisson distribution is described by the following e1uation Ccalled the probability mass functionD! RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 16 2here a M #ate of events N @lapsed time . If inter arrival times are exponentially distributed and we use arrival rate from above for rate of events, the number of arrivals in a time interval t is a oisson process , which has the Poisson distribution with a M )rrival rate N t. +he length of time a new tas must wait for the server to complete a tas, called the aerage residual serice time4 we assumes Poisson arrivals! 2hen the distribution is not random and all possible values are e1ual to the average, the standard deviation is ' and so 7 is '. +he average residual service time is then just half the average service time. If the distribution is random and it is Poisson, then 7 is " and the average residual service time e1uals the weighted arithmetic mean time. )ssumptions about the 1ueuing model! ". +he system is in e1uilibrium. %. +he times between two successive re1uests arriving, called the interarrival times, are exponentially distributed, which characteri$es the arrival rate mentioned earlier. &. +he number of sources of re1uests is unlimited. 9. +he server can start on the next job immediately after finishing the prior one. ;. +here is no limit to the length of the 1ueue, and it follows the first in, first out order discipline, so all tass in line must be completed. (. +here is one server. -uch a 1ueue is called &'&'(: & M exponentially random re1uest arrival C7O% M "D, with & standing for ). ). & M exponentially random service time C7O% M "D, with & again for Marov ( M single server +he M/M/" model is a simple and widely used model. Many real systems have multiple diss and hence could use multiple servers, as in :igure (."B. -uch a system is called an &'&'m model in 1ueuing theory. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 17 Figure 6.17 The M/M/m multiple-server model. (.( Crosscutting Issues 9(m: '. Point6to6Point ,in+s and S.itches <eplacing -uses Point5to5point lins and switches are increasing in popularity as Moore?s .aw continues to reduce the cost of components. 7ombined with the higher I/0 bandwidth demands from faster processors, faster diss, and faster local area networs, the decreasing cost advantage of buses means the days of buses in destop and server computers are numbered. +he number of bits and bandwidth for the new generation is per direction, so they double for both directions. ) common way to increase bandwidth is to offer versions with several times the number of wires and bandwidth. 1. -loc+ Serers ersus &ilers +he operating system provides the file abstraction on top of blocs stored on the dis. +he terms logical units4 logical olumes4 and physical olumes are related terms used in Microsoft and =LIP systems to refer to collections of dis blocs. ) logical unit appears to the server as a single virtual Edis.F In a #)I* dis array, the logical unit is configured as a particular #)I* layout. ) physical volume is the device file used by the file system to access a logical unit. ) logical volume provides a level of virtuali$ation that enables the file system to split the physical volume across multiple pieces or to stripe data across multiple physical volumes. Q ) logical unit is an abstraction of a dis array that presents a virtual dis to the operating system, while physical and logical volumes are abstractions used by the operating system to divide these virtual diss into smaller, independent file systems. +he file illusion is maintained in the server. It accesses storage as dis blocs and maintains the metadata. Most file systems use a file cache, so the server must maintain consistency of file accesses. +he dis subsystem also maintains the file abstraction, and the server uses a file system protocol to communicate with storage. -uch devices are called networ& attached storage CL)-D devices. +he term filer is RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 18 used for L)- devices that only provide file service and file storage. Letwor )ppliances was one of the first companies to mae filers. +he idea behind placing storage on the networ is to mae it easier for many computers to share information and for operators to maintain the shared system. %. Asynchronous I#$ and $perating Systems *iss spend more time in delays than in transferring data. +he straightforward approach to I/0 is to re1uest data and then start using it. +he operating system then switches to another process until the desired data arrive, and then the operating system switches bac to the re1uesting process. -uch a style is called synchronous I/O. +he alternative model is for the process to continue after maing a re1uest, and it is not bloced until it tries to read the re1uested data. -uch asynchronous I/O allows the process to continue maing re1uests so that many I/0 re1uests can be operating simultaneously. )synchronous I/0 shares the same philosophy as caches in out5of5order 7P=s, which achieve greater bandwidth. 7.0 Designing and !aluating an I#$ SystemD The Internet Archie Cluster +he art of I/0 system design is to find a design that meets goals for cost, dependability, and variety of devices while avoiding bottlenecs in I/0 performance and dependability. )voiding bottlenecs means that components must be balanced between main memory and the I/0 device. +he architect must also plan for expansion so that customers can tailor the I/0 to their applications. In designing an I/0 system, we analy$e performance, cost, capacity, and availability using varying I/0 connection schemes and different numbers of I/0 devices of each type. steps to follow in designing an I/0 system. ". .ist the different types of I/0 devices to be connected to the machine, or list the standard buses and networs that the machine will support. %. .ist the physical re1uirements for each I/0 device. #e1uirements include si$e, power, connectors, bus slots, expansion cabinets, and so on. &. .ist the cost of each I/0 device, including the portion of cost of any controller needed for this device. 9. .ist the reliability of each I/0 device. ;. #ecord the processor resource demands of each I/0 device. +his list should include 7loc cycles for instructions used to initiate an I/0, to support operation of an I/0 device Csuch as handling interruptsD, and to complete I/0 Processor cloc stalls due to waiting for I/0 to finish using the memory, bus, or cache RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 19 Processor cloc cycles to recover from an I/0 activity, such as a cache flush (. .ist the memory and I/0 bus resource demands of each I/0 device. @ven when the processor is not using memory, the bandwidth of main memory and the I/0 connection is limited. B. )ssessing the performance and availability of the I/0 devices. Performance can be evaluated with simulation, or by using 1ueuing theory. <elia*ility can be calculated assuming I/0 devices fail independently. Aaila*ility can be computed from reliability by estimating M++: for the devices, taing into account the time from failure to repair. Performance can be measured either as megabytes per second or I/0s per second, depending on the needs of the application. :or high performance, speed of I/0 devices, number of I/0 devices, and speed of memory and processor is considered. :or low cost, most of the cost should be the I/0 devices themselves. 7.( Putting It All Together: EetApp &AS7;;; &iler Letwor )ppliance entered the storage maret in ">>% with a goal of providing an easy5to5operate file server running L-: using their own log5structured file system and a #)I* 9 dis array. +he company later added support for the 2indows 7I:- file system and a #)I* (. +o support applications that want access to raw data blocs without the overhead of a file system, such as database systems, Let)pp filers can serve data blocs over a standard :ibre 7hannel interface. Let)pp also supports iSCSI . +he latest hardware product is the :)-('''. It is a multiprocessor based on the )M* 0pteron microprocessor connected using its ,yper transport lins. +he microprocessors run the Let)pp software stac, including L-:, 7I:-, #)I*5*P, -7-I, and so on. +he :)-(''' comes as either a dual processor C:)-('&'D or a 1uad processor C:)-('B'D. *#)M is distributed to each microprocessor in the 0pteron. +he *#)M bus is "%6 bits wide, plus extra bits for -@7/*@* memory. Both models dedicate four ,yper transport lins to I/0. )s a filer, the :)-(''' needs a lot of I/0 to connect to the diss and to connect to the servers. +he integrated I/0 consists of! ". 6 :ibre 7hannel C:7D controllers and ports, %. ( 8igabit @thernet lins, &. ( slots for x6 C% 8B/secD P7I @xpress cards, 9. & slots for P7I5P "&& M,$, (95bit cards, ;. Plus standard I/0 options lie I*@, =-B, and &%5bit P7I. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 20 +he 6 :ibre 7hannel controllers can each be attached to ( shelves containing "9 &.;5inch :7 diss. +hus, the maximum number of drives for the integrated I/0 is (B% diss. )dditional :7 controllers can be added to the option slots to connect up to "''6 drives. +he six "5gigabit @thernet lins connect to servers to mae the :)-(''' loo lie a file server running if L+:- or 7I:-, or lie a bloc server if running i-7-I. :or dependability, :)-(''' filers can be paired so that if one fails, the other can tae over. 7lustered failover re1uires that both filers have access to all diss in the pair of filers using the :7 interconnect. +he health of the filers is constantly monitored, and failover happens automatically. +he healthy filer maintains its own networ identity and its own primary functions, but it also assumes the networ identity of the failed filer and handles all its data re1uests via a virtual filer until an administrator restores the data service to the original state. 7.? &allacies and Pitfalls &allacy: 9omponents fail fast. +he fault5tolerant is based on the assumption that a component operates perfectly until a latent error becomes effective, and then a failure occurs that stops the component. +he +ertiary *is project had the opposite experience. Many components started acting strangely long before they failed, and it was generally up to the system operator to determine whether to declare a component as failed. +he component would generally willing to continue to act in violation of the service agreement until an operator EterminatedF that component. &allacy: 9omputers systems achiee ::.:::; aailability .<fie nines=04 as adertised. Mareting departments of companies maing servers availability of their computer hardwareH in terms of :igure (.%", they claim availability of >>.>>>G, nicnamed five nines. @ven the mareting departments of operating system companies tried to give this impression. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 21 Figure 6.21 Minutes unavailable per year to achieve availability class. Note that five nines mean unavailable five minutes pe !ea" :ive minutes of unavailability per year is certainly impressive. :or example, ,ewlett5Pacard claims that the ,P5>''' server hardware and ,P5=P operating system can deliver a >>.>>>G availability guarantee Ein certain pre5defined, pre5tested customer environmentsF. +his guarantee does not include failures due to operator faults, application faults, or environmental faults, which are liely the dominant fault categories today. It is also unclear what the financial penalty is to a company if a system does not match its guarantee. In contrast to mareting suggestions, well5managed servers in %''( typically achieve >>G to >>.>G availability. Pitfall: >here a function is implemented affects its reliability. In theory, it is fine to move the #)I* function into software. In practice, it is very difficult to mae it wor reliably. +he software culture is generally based on eventual correctness via a series of releases and patches. It is also difficult to isolate from other layers of software. many customers have lost data due to software bugs or incompatibilities in environment in software #)I* systems. ,ardware systems are not immune to bugs , but the hardware culture tends to place a greater emphasis on testing correctness in the initial release. the hardware is independent of the version of the operating system. &allacy : Operating systems are the best place to schedule dis& accesses. ,igher5level interfaces lie )+) and -7-I offer logical bloc addresses to the host operating system. an 0- can sort the logical bloc addresses into increasing order. -ince only the dis nows the mapping of the logical addresses onto the physical geometry of sectors, tracs, and surfaces, it can reduce the rotational and see latencies. &allacy : 7he aerage see& time of a dis& in a computer system is the time for a see& of one!third the number of cylinders. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 22 +his fallacy comes from confusing the way manufacturer?s maret diss with the expected performance, and from the false assumptions that see times are linear in distance. +he one5third5distance rule of thumb comes from calculating the distance of a see from one random location to another random location. In the past, manufacturers listed the see of this distance to offer a consistent basis for comparison. )ssuming CincorrectlyD that see time is linear in distance, and using the manufacturer?s reported minimum and EaverageF see times, a common techni1ue to predict see time is Implementing Cache Coherence in a DSM Multiprocessor 9II: Implementing a directory5based cache coherence protocol re1uires overcoming all the problems related to nonatomic actions for a snooping protocol without the use of broadcast, which forced a seriali$ation on competing writes. the seriali$ation re1uired for the memory consistency model. )voiding the need to broadcast is a central goal for a directory5based system, so another method for ensuring seriali$ation is necessary. +he seriali$ation of re1uests for exclusive access to a memory bloc is easily enforced since those re1uests will be seriali$ed when they reach the uni1ue directory for the specified bloc. If the directory controller simply ensures that one re1uest is completely serviced before the next is begun, writes will be seriali$ed. Because the re1uesters cannot now ahead of time who will win the race and because the communication is not a broadcast, the directory must signal to the winner when it completes the processing of the winner?s re1uest. +his is done by a message that supplies the data or by an explicit acnowledgment message. )lthough the acnowledgment that a re1uesting node has ownership is completed when the write miss or ownership acnowledgment message is transmitted, we still do not now that the invalidates have been received and processed by the nodes that were in the sharing set. )ll memory consistency models eventually re1uire that a processor nows that all the invalidates for a write have been processed. In a snooping scheme, the nature of the broadcast networ provides this assurance. 2e can now that the invalidates have been completed by have the destination nodes of the invalidate messages explicitly acnowledge the invalidation messages sent from the directory. +here are two possibilities. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 23 In the first the acnowledgments can be sent to the directory, which can count them, and when all acnowledgments have been received, confirm this with a single message to the original re1uester. )lternatively, when granting ownership, the directory can tell the register how many acnowledgments to expect. +he destinations of the invalidate messages can then send an acnowledgment directly to the re1uester, whose identity is provided by the directory. Most existing implementations use the latter scheme, because it reduces the bottlenec at a directory. )lthough the re1uirement for acnowledgments is an additional complexity in directory protocols, it is used to avoid seriali$ation mechanism. The Intel Pentium 3 +he Pentium 9 is a processor with a deep pipeline supporting multiple issue with speculation. Pentium 9 also supports multithreading. +he Pentium 9 uses out5of5order speculative micro architecture, called Letburst, that is pipelined with the goal of achieving high instruction throughput by combining multiple issue and high cloc rates. .ie the micro architecture used in the Pentium III, a front5end decoder translates each I)5&% instruction to a series of micro5operations CuopsD, which are similar to typical #I-7 instructions. +he uops are than executed by a dynamically scheduled speculative pipeline. +he Pentium 9 uses a noel e"ecution trace cache to generate the uop instruction stream . ) trace cache is a type of instruction cache that holds se1uences of instructions to be executed including nonadjacent instructions separated by branchesH a trace cache tries to exploit the temporal se1uencing of instruction execution rather than the spatial locality exploited in a normal cache. +he Pentium 9?s execution trace cache is a trace cache of uops, corresponding to the decoded I)5&% instruction stream. By filling the pipeline from the execution trace cache, the Pentium 9 avoids the need to redecode I)5&% instructions whenever the trace cache hits. 0nly on trace cache misses are I)5&% instructions fetched from the .% cache and decoded to refill the execution trace cache. =p to three I)5&% instructions may be decoded and translated every cycle, generating up to six uopsH when a single I)5&% instruction re1uires more than three uops, the uop se1uence is generated from the microcode #0M. +he execution trace cache has its own branch target buffer, which predicts the outcome of uop branches. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 24 )fter fetching from the execution trace cache, the uops are executed by an out5of5order speculative pipeline using register renaming rather than a reorder buffer. =p to three uops per cloc can be renamed and dispatched to the functional unit 1ueues, and three uops can be committed each cloc cycle. +here are four dispatch ports, which allow a total of six uops to be dispatched to the functional units every cloc cycle. +he load and store units each have their own dispatch port, another port covers basic ).= operations, and a fourth handles :P and integer operations. :igure %.%( shows a diagram of the micro architecture. ) two5level cache is used to minimi$e the fre1uency of *#)M accesses. Branch prediction is done 2ith a branch5target buffer using a two5level predictor with both local and global branch histories. Pipeline of the Pentium 9 maes the use of speculation, and its dependence on branch prediction, critical to achieving high performance. Performance is very dependent on the memory system. )lthough dynamic scheduling and the large number of outstanding loads and stores supports hiding the latency of cache misses, .% misses are liely to cause a stall. Branch5prediction accuracy is crucial in speculative processors. +he trace cache miss rate is almost negligible for this set of the -P@7 benchmars. +he )M* 0pteron and Intel Pentium 9 share a number of similarities! Both use a dynamically scheduled, speculative pipeline capable of issuing and committing three I)5&% instructions per cloc. Both use a two5level on5chip cache structure , although the Pentium 9 uses a trace cache for the first5 level instruction cache and recent Pentium 9 have larger second5level caches. +hey have similar transistor counts, die si$e, and powe r. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 25 &igure 1.17 the Pentium 3 micro architecture. RAGHAVENDRA REDDY, MTECH CSE, REVA ITM 26