Memory Hir and Io System

5.
5 Crosscutting Issues: The Design of Memory Hierarchies

Protection and Instruction Set Architecture
Protection is a joint effort of architecture and operating systems, but architects had to modify some
details of existing instruction set architectures when virtual memory became popular.
IBM mainframe hardware and VMM too three steps to improve performance of VMs!
". #educe the cost of processor virtuali$ation.
%. #educe interrupt overhead cost due to the virtuali$ation.
&. #educe interrupt cost by steering interrupts to the proper VM without invoing VMM.
In %''(, new proposals by )M* and Intel try to address the first point, reducing the cost of processor
virtuali$ation.
Speculatie !"ecution and the Memory System
Inherent in processors that support speculative execution or conditional instructions is the possibility of
generating invalid addresses. +his is the incorrect behavior if protection exceptions were taen, but the benefits
of speculative execution would be swamped by false exception overhead. ,ence, the memory system must
identify speculatively executed instructions and conditionally executed instructions and suppress the
corresponding exception.
-imilarly, we cannot allow such instructions to cause the cache to stall on a miss because again
unnecessary stalls could overwhelm the benefits of speculation. ,ence, these processors must be matched with
nonblocing caches. In reality, the penalty of an .% miss is so large that compilers normally only speculate on
." misses. -ome scientific programs the compiler can sustain multiple outstanding .% misses to cut the .% miss
penalty effectively.
I#$ and Consistency of Cached Data
*ata can be found in memory and in the cache. )s long as one processor is them sole device changing
or reading the data and the cache stands between the processor and memory, there is little danger in the
processor seeing the old or stale copy . Multiple processors and I/0 devices raise the opportunity for copies to be
inconsistent and to read the wrong copy.
+he fre1uency of the cache coherency problem is different for multiprocessors than I/0. program
running on multiple processors will want to have copies of the same data in several caches. Performance of a
multiprocessor program depends on the performance of the system when sharing data.
+he I/O cache coherency 1uestion is this! 2here does the I/0 occur in the computer3between the I/0
device and the cache or between the I/0 device and main memory4 If input puts data into the cache and output
RAGHAVENDRA REDDY, MTECH CSE, REVA ITM
1
reads data from the cache, both I/0 and the processor see the same data. +he difficulty in this approach is that it
interferes with the processor and can cause the processor to stall for I/0. Input may also interfere with the cache
by displacing some information with new data that is unliely to be accessed soon.
+he goal for the I/0 system in a computer with a cache is to prevent the stale data problem while
interfering as little as possible. Many systems, therefore, prefer that I/0 occur directly to main memory, with
main memory acting as an I/0 buffer. If a write5through cache were used, then memory would have an up5to5
date copy of the information, and there would be no stale5data issue for output. 2rite through is usually used in
first5level data caches and .% cache uses write bac.
Input re1uires some extra wor. +he software solution is to guarantee that no blocs of the input buffer
are in the cache. ) page containing the buffer can be mared as non cacheable, and the operating system can
always input to such a page. +he operating system can flush the buffer addresses from the cache before the
input occurs. ) hardware solution is to chec the I/0 addresses on input to see if they are in the cache. If there
is a match of I/0 addresses in the cache, the cache entries are invalidated to avoid stale data. )ll these
approaches can also be used for output with write5bac caches.
%.% Putting It All Together: AMD $pteron Memory Hierarchy
+he 0pteron is an out5of5order execution processor that fetches up to three 6'x6( instructions per cloc
cycle, translates them into #I-7 operations, and it has "" parallel execution units. In %''(, the "%5stage integer
pipeline yields a maximum cloc rate of %.6 8,$, and the fastest memory supported is P7&%'' **# -*#)M.
It uses 965bit virtual addresses and 9'5bit physical addresses. :igure ;."6 shows the mapping of the address
through the multiple levels of data caches and +.Bs.
+he P7 is sent to the instruction cache. It is (9 <B, two5way set associative with a (95byte bloc si$e
and .#= replacement. +he cache index is
0r > bits. It is virtually indexed and physically tagged .
2
&igure 5.'( the irtual address) physical address) inde"es) tags) and data *loc+s for the AMD $pteron
caches and T,-s.
Step '! the page frame of the instruction?s data address is sent to the instruction +.B
-tep %! at the same time the >5bit index from the virtual address is sent to the data cache
-tep &,9! +he fully associative +.B -imultaneously searches all 9' entries to find a match between the address
and a valid P+@.
0pteron checs for changes to the actual page directory in memory and flushes only when the data
structure is changed. In the worst case, the page is not in memory, and the operating system gets the page from
dis. -ince millions of instructions could execute during a page fault, the operating system will swap in another
process if one is waiting to run. 0therwise, the instruction cache access continues.
-tep ;! +he index field of the address is sent to both groups of the two5way set associative data cache .
-tep (! +he instruction cache tag is 9' A > bits/ ( bits/ %; bits. +he four tags and valid bits are compared to the
physical page frame from the Instruction +.B.
-tep B! +he instruction cache is virtually addressed and physically tagged. 0n a miss, the cache controller must
chec for a synonym Ctwo different virtual addresses that reference the same physical addressD.
3
-tep 6! +he second5level cache tries to fetch the bloc on a miss. +he .% cache is uses a pseudo5.#= scheme
by managing eight pairs of blocs .#=, and then randomly picing one of the .#= pair on a replacement. +he
.% index is
-tep >! 0nce again, the index and tag are sent to all "( groups of the "(5way set associative data cache .
-tep "'! which are compared in parallel. If one matches and is valid, it returns the bloc in se1uential order, 6
bytes per cloc cycle.
-tep ""! If the instruction is not found in the secondary cache, the on5chip memory controller must get the bloc
from main memory. -ince there is only one memory controller and the same address is sent on both channels.
-tep "%! 2ide transfers happen when both channels have identical *IMMs. @ach channel supports up to four
**# *IMMs.
-tep "&! +he total latency of the instruction miss that is serviced by main memory is approximately %'
processor cycles plus the *#)M latency for the critical instructions. +he memory controller fills the remainder
of the (95byte cache bloc at a rate of "( bytes per memory cloc cycle.
-tep "9!. 0pteron has a prefetch engine associated with the .% cache. It loos at patterns for .% misses to
consecutive blocs, either ascending or descending, and then prefetches the next line into the .% cache.
-tep ";! -ince the second5level cache is a write5bac cache, any miss can lead to an old bloc being written
bac to memory. +he 0pteron places this EvictimF bloc into a victim buffer, as it does with a victim dirty
bloc in the data cache.
-tep "(! +he data cache and .% cache chec the victim buffer for the missing bloc, but it stalls until the data is
written to memory and then perfected. +he new data are loaded into the instruction cache as soon as they arrive.
Performance of the $pteron Memory Hierarchy
Ho. .ell does the $pteron .or+/. +he major components are the instruction and data caches,
instruction and data +.Bs, and the secondary cache. ) memory stall for one instruction may be completely
hidden by successful completion of a later instruction.
:or 0pteron memory hierarchy, +he average -P@7 I cache misses per instruction is '.'"G to '.'>G,
the average * cache misses per instruction are ".&9G to ".9&G, and the average .% cache misses per instruction
are '.%&G to '.&(G.
4
+he pipeline stall portion is a lower bound. about ;'G of the memory 7PI C%;G overallD is due to .%
cache misses for the integer programs and about B'G of the memory 7PI for the floating5point programs C9'G
overallD. )lthough they are executing the same programs compiled for the same instruction set, the compilers
and resulting code se1uences are different as are the memory hierarchies. +he following table summari$es the
two memory hierarchies!
5.0 &allacies and Pitfalls
Memory hierarchy is less vulnerable to fallacies and pitfalls.
&allacy: Predicting cache performance of one program from another.
*epending on the program, the data misses will change. 7ommercial programs such as databases will
have significant miss rates even in large second5level caches.
'. Pitfall: Simulating enough instructions to get accurate performance measures of the memory
hierarchy.
+here are really three pitfalls here. 0ne is trying to predict performance of a large cache using a small
trace. )nother is that a program?s locality behavior is not constant over the run of the entire program. +he third
is that a program?s locality behavior may vary depending on the input.
1. Pitfall: $eremphasi2ing memory bandwidth in DRAMs.
#)MB=- innovated on the *#)M interface. Its product, *irect #*#)M, offered up to ".( 8B/sec of
bandwidth from a single *#)M. P7s do most memory accesses through a two5level cache hierarchy, so it was
unclear how much benefit is gained from high bandwidth without improving memory latency.
0ne measure of the #*#)M cost is die si$eH it had about a %'G larger die for the same capacity
compared to -*#)M. *#)M designers use redundant rows and columns to improve yield significantly on the
memory portion of the *#)M.
5
%. Pitfall: Not deliering high memory bandwidth in a cache!based system
7aches help with average cache memory latency but may not deliver high memory bandwidth to an
application that must go to main memory. +he architect must design a high bandwidth memory behind the
cache for such applications. It is a vector computer that doesn?t rely on data caches for memory performance.
3. Pitfall : Implementing a irtual machine monitor on an I"A that wasn#t designed to be irtuali$able.
Virtual memory design is a challenging tas. Because the 6'x6( +.Bs do not support process I* tags, it is
more expensive for the VMM and guest 0-es to share the +.BH each address space change typically re1uires a
+.B flush.
Virtuali$ing I/0 is also a challenge for the 6'x6(, because it supports memory5mapped I/0 and has
separate I/0 instructions, and there is a very large number and variety of types of devices and device drivers of
P7s for the VMM to handle. +hird5party vendors supply their own drivers, and they may not properly
virtuali$ed. 0ne solution for conventional VM implementations is to load real device drivers directly into the
VMM.
+o simplify implementations of VMMs on the 6'x6(, both )M* and Intel ,ave proposed extensions to the
architecture. Intel4s 5T6" provides a new execution mode for running VMs.
(." Introduction (.% Adanced Topics in Dis+ Storage
+he dis industry has concentrated on improving the capacity of diss. Improvement in capacity is
customarily expressed as improvement in areal density, measured in bits per s1uare inch!
+hrough about ">66, the rate of improvement of areal density was %>G per year, thus doubling density
every three years. Between then and about ">>(, the rate improved to ('G per year, 1uadrupling density every
three years and matching the traditional rate of *#)Ms. :rom ">>B to about %''&, the rate increased to "''G,
or doubling every year. )fter that the rate has dropped recently to about &'G per year. 7ost per gigabyte has
dropped at least as fast as areal density has increased.
+he bandwidth gap is more complex. Many have tried to invent a technology cheaper than *#)M but
faster than dis to fill that gap. 7hallengers have never had a product to maret at the right time. By the time a
new product would ship, *#)Ms and diss have made advances, costs have dropped accordingly, and the
challenging product is immediately obsolete.
6
flash memory is semiconductor memory, is nonvolatile lie diss, and it has about the same bandwidth
as diss, but latency is "''A"''' times faster than dis. In %''(, the price per gigabyte of flash was about the
same as *#)M. :lash is popular in cameras and portable music players because it comes in much smaller
capacities and it is more power efficient than diss, and cost is higher than diss. But flash memory is not
popular in destop and server computers.
Blocs in the same cylinder tae less time to access since there is no see time, and some tracs are
closer than others. :igure (.% shows how a 1ueue depth of ;' can double the number of I/0s per second of
random I/0s due to better scheduling of accesses. 8iven buffers, caches, and out5of5order accesses, an accurate
performance model of a real dis is much more complicated than sector5trac5cylinder.
&igure 7.1 Throughput ersus command 8ueue depth using random 5'16*yte reads. +he dis performs
"B' reads per second starting at no command 1ueue, and doubles performance at ;' and triples at %;(.
Dis+ Po.er
Power is an increasing concern for diss as well as for processors. ) typical )+) dis in %''( might
use > watts when idle, "" watts when reading or writing, and "& watts when seeing. 0ne formula that indicates
the importance of rotation speed and the si$e of the platters for the power consumed by the dis motor is the
following
+hus, smaller platters, slower rotation, and fewer platters all help reduce dis motor power, and most of
the power is in the motor.
:igure (.& shows the specifications of two &.;5inch diss in %''(. +he Serial ATA C-)+)D diss shoot
for high capacity and the best cost per gigabyte, and so the ;'' 8B drives cost less than I" per gigabyte. +hey
use the widest platters that fit the form factor and use four or five of them, but they spin at B%'' #PM and see
7
relatively slowly to lower power. +he corresponding Serial Attach SCSI C-)-D drive aims at performance, and
so it spins at ";,''' #PM and sees much faster. +o reduce power, the platter is much narrower than the form
factor and it has only a single platter. +his combination reduces capacity of the -)- drive to &B 8B.
+he cost per gigabyte is better for the -)+) drives, and the cost per I/0 per second or MB transferred
per second is better for the -)- drives. +he -)- diss use twice the power of the -)+) drives, due to the
much faster #PM and sees.
&igure 7.% Serial ATA 9SATA: ersus Serial Attach SCSI 9SAS: dries in %.56inch form factor in 1;;7.
+he I/0s per second are calculated using the average see plus the time for one5half rotation plus the time to
transfer one sector of ;"% <B.
Adanced Topics in Dis+ Arrays 9';m:
-preading data over multiple diss, called striping , +he ) dis array have more faults than a smaller
number of larger diss. +he lost information is reconstructed from redundant information. the mean time to
failure CM++:D of diss is tens of years, and the M++# is measured in hours.
-uch redundant dis arrays are nown as <AID, originally redundant array of ine%pensie dis&s, and
also called independent for I in the acronym. It can be measured as mbps or as I/0s per second. :igure (.9
summari$es the five standard #)I* levels.
RAID ': It has no redundancy and is called as =-$D (just a bunch of dissD. +he data may be striped across
the diss in the array. +his level is generally used as a measuring stic for the other #)I* levels in terms of
cost, performance, and dependability.
&igure 7.3 <AID leels) their fault tolerance) and their oerhead in redundant dis+s. mirroring C#)I* "D
in this instance can survive up to eight dis failures provided only one dis of each mirrored pair failsH worst
case is both diss in a mirrored pair.
8
RAID (: )lso called mirroring or shado.ing , there are two copies of every piece of data. It is the simplest
and oldest dis redundancy scheme, but it also has the highest cost. -ome array controllers will optimi$e read
performance by allowing the mirrored diss to act independently for reads, but it may tae longer for the
mirrored writes to complete.
RAID )* +his organi$ation applying memory6style error correcting codes to diss . It was included because
there was such a dis array product at the time of the original #)I* introduced, but it is oldest dis redundancy
scheme.
RAID +: -ince the higher5level dis interfaces understand the health of a dis, it?s easy to figure out which dis
failed. one extra dis is used to hold the all parity information of the data diss, this dis allows recovery
from a dis failure. +he data is organi$ed in stripes, with N data blocs and one parity bloc . 2hen a failure
occurs, you just EsubtractF the good data from the good blocs, and what remains is the missing data. #)I* &
assumes the data is spread across all diss on reads and writes,
RAID ,* we can increase the number of reads per second by allowing each dis to perform independent reads.
+o increase the number of writes per second, :irst, the array reads the old data that is about to be overwritten,
and then calculates what bits would change before it writes the new data. It then reads the old value of the parity
on the chec diss, updates parity according to the list of changes, and then writes the new value of parity to the
chec dis. It is called Esmall writesF are still slower than small reads but they are faster than if you had to read
9
all diss on every write. #)I* 9 has the same low chec dis overhead as #)I* &, and it can still do large
reads and writes as fast as #)I* &. But control is more complex.
RAID -* #)I* 9 has as a performance bottlenec. #)I* ; simply distributes the parity information across all
diss in the array, thereby removing the bottlenec. +he parity bloc in each stripe is rotated so that parity is
spread evenly across all diss. +he dis array controller can calculate which dis has the parity for when it
wants to write a given bloc, and this is simple calculation. #)I* ; has the same low chec dis overhead as
#)I* & and 9, and it can do the large reads and writes of #)I* & and the small reads of #)I* 9. #)I* ;
re1uires the most sophisticated controller.
+wo #)I* levels that have become popular are!
RAID (' ersus '( .or ( / ' ersus RAID '/(0
RAID 1* 2eyond a "ingle Dis& 3ailure
#)I* " to ; protect against a single self5identifying failure. ,owever, if an operator accidentally
replaces the wrong dis during a failure, then the dis array will experience two failures, and data will be lost.
)s dis bandwidth is growing more slowly than dis capacity, the M++# of a dis in a #)I* system is
increasing, which in turn increases the chances of a second failure.
these failures were becoming a danger to customers, they created a more robust scheme to protect data,
called row!diagonal parity or RAID!DP. #ow5diagonal parity uses redundant space based on a parity
calculation on a per5stripe basis. -ince it is protecting against a double failure, it adds two chec blocs per
stripe of data. .et?s assume there are p J " diss total, and so p " diss have data. :igure (.; shows the case
when p is ;.
+he row parity dis is just lie in #)I* 9H it contains the even parity across the other four data blocs in
its stripe. @ach bloc of the diagonal parity dis contains the even parity of the blocs in the same diagonal.
@ach diagonal does not cover one dis.
&igure 7.5 <o. diagonal parity for p > 5) .hich protects four data dis+s from dou*le failures.
10
(.& Definition and !"amples of <eal &aults and &ailures 95m:
+he terms fault4 error4 and 3ailure are often used interchangeably , but they have different meanings.
7omputer system dependability is the 1uality of delivered service such that reliance can justifiably be
placed on this service. +he service delivered by a system is its o*sered actual behaior as perceived by other
systemCsD interacting with this system?s users. @ach module also has an ideal specified behaior , where a
serice specification is an agreed description of the expected behavior.
) system failure occurs when the actual behavior deviates from the specified behavior . +he failure
occurred because of an error. 5rror is a defect in that module. +he cause of an error is a fault.
+he time between the occurrence of an error and the resulting failure is the error latency . +hus, an error
is the manifestation in the system of a fault, and a failure is the manifestation on the serice of an error.
@x! ) programming mistae is a fault. +he conse1uence is an error 9or latent errorD in the software.
=pon activation, the error becomes effective. 2hen this effective error produces erroneous data that affect the
delivered service, a failure occurs.
)n alpha particle hitting a *#)M can be considered a fault. If it changes the memory, it creates an
error. 2hen error affects the delivered service, a failure occurs.
+he relation between faults, errors, and failures is as follows!
) fault creates one or more latent errors.
+he properties of errors are C"D a latent error becomes effective once activatedH C%D an error may cycle
between its latent and effective statesH C&D an effective error often propagates from one component to
another, thereby creating new errors..
) component failure occurs when the error affects the delivered service.
+hese properties are recursive and apply to any component in the system.
8ray and -iewiore classify faults into four categories according to their cause!
'. 6ardware faults3h/w *evices failure.
1. Design faults3:aults in software and hardware design.
%. Operation faults3Mistaes by operations and maintenance personnel.
11
3. 5nironmental faults3:ire, flood, earth1uae, power failure, and sabotage.
:aults are also classified by their duration into transient, intermittent, and permanent. 7ransient faults
exist for a limited time and are not recurring. Intermittent faults cause a system to oscillate between faulty and
fault5free operation. Permanent faults do not correct themselves with the passing of time.
(.9 I#$ Performance) <elia*ility Measures) and -enchmar+s
<esponse time is defined as the time a tas taes from the moment it is placed in the buffer until the
server finishes the tas. Throughput is the average number of tass completed by the server over a time period.
+o get the high throughput, the server should never be idle, and thus the buffer should never be empty.
&igure 7.( The traditional producer6serer model of response time and throughput.
Throughput ersus <esponse Time 95m:
:igure (.> shows throughput versus response time Cor latencyD for I/0 system. +he nee of the curve is
the area where more throughput results in much longer response time or shorter response time results in much
lower throughput.
&igure 7.? Throughput ersus response time. ,atency is normally reported as response time.
)n interaction, or transaction, with a computer is divided into three parts !
". 5ntry time3the time for the user to enter the command.
%. "ys. response time3+he time between when the user enters the cmd K the complete response is displayed.
&. 7hin& time3+he time from the reception of the response until the user begins to enter the next command.
12
+he sum of these three parts is called the transaction time . user productivity is inversely proportional to
transaction time. +o reflect the importance of response time to user productivity, I/0 benchmars also address
the response time versus throughput trade5off.
Transaction6Processing -enchmar+s
7ransaction processing C+P, or 0.+P for online transaction processingD is concerned with I/O rate as
opposed to data rate . +P generally involves changes to a large body of shared information from many
terminals, with the +P system guaranteeing proper behavior on a failure. -uppose, for example, a ban?s
computer fails when a customer tries to withdraw money from an )+M. +he +P system would guarantee that
the account is debited if the customer received the money and that the account is unchanged if the money was
not received. )irline reservations systems as well as bans are traditional customers for +P.
+he Transaction rocessin! Council, which led?s to eight benchmars. :igure (."% summari$es these
benchmars.
+P757 uses a database to simulate an order5entry environment of a wholesale supplier, including
entering and delivering orders, recording payments, checing the status of orders, and monitoring the level of
stoc at the warehouses. It runs five concurrent transactions of varying complexity, and the database includes
nine tables with a scalable range of records and customers. +P757 is measured in transactions per minute
Ctpm7D and in price of system, including hardware, software, and three years of maintenance support.
+hese +P7 benchmars have these unusual characteristics!
". Price is included with the benchmar& results. +he cost of hardware, software, and maintenance
agreements is included in a submission, which enables evaluations based on price5performance as well
as high performance.
13
%. 7he data set generally must scale in si$e as the throughput increases. +he benchmars are trying to
model real systems, in which the demand on the system and the si$e of the data stored in it increase
together. It maes no sense, for example, to have thousands of people per minute access hundreds of
ban accounts.
&. 7he benchmar& results are audited. Before results can be submitted, they must be approved by a
certified +P7 auditor, who enforces the +P7 rules that try to mae sure that only fair results are
submitted. #esults can be challenged and disputes resolved by going before the +P7.
9. 7hroughput is the performance metric4 but response times are limited. :or example, with +P757, >'G
of the Lew50rder transaction response times must be less than ; seconds.
;. An independent organi$ation maintains the benchmar&s. *ues collected by +P7 pay for an
administrative structure including a 7hief 0perating 0ffice. +his organi$ation settles disputes, conducts
mail ballots on approval of changes to benchmars, holds board meetings, and so on.
7.5 A ,ittle @ueuing Theory
In I/0 systems, *ac+6of6the6enelope calculation is the one of the best case analysis. :ull5scale
simulation is also much more accurate to calculate expected performance.
2ith I/0 systems, we also have a mathematical tool to guide I/0 design that is a little more wor and
much more accurate than best5case analysis, but much less wor than full5scale simulation. 2e can define a set
of simple theorems that will help calculate response time and throughput of an entire I/0 system. +his helpful
field is called 8ueuing theory.
7onsider blac5box approach to I/0 systems, as in :igure (.";. In our example, the processor is maing
I/0 re1uests that arrive at the I/0 device, and the re1uests EdepartF when the I/0 device fulfills them.
If the system is in steady state, then the number of tass entering the system must e1ual the number of
tass leaving the system. ,ere we assuming that, we are evaluating systems with multiple independent re1uests
for I/0 service that are in e1uilibrium! the input rate must be e1ual to the output rate.
+his leads us to "ittle#s "aw , which relates the average number of tass in the system, the average
arrival rate of new tass, and the average time to perform a tas!
Mean num*er of tas+s in system > Arrial rate A Mean response time
14
Deriation of ,ittle4s ,a.! )ssume we observe a system for +imeobserve minutes. *uring that
observation, we record how long it too each tas to be serviced, and then sum those times. +he number of
tass completed during +ime observe is Lumber tas , and the sum of the times each tas spends in the system
is +ime accumulated . Lote that the tass can overlap in time, so Time accumulated B Time o*sered. +hen
In blac box, the area where the tass are waiting to be serviced is called the 8ueue4 or waiting line .
+he device performing the re1uested service is called the serer.
.ittle?s .aw and a series of definitions lead to several useful e1uations!
-erver utili$ation is! Serer utili2ation > Arrial rate A Timeserer
15
-ervice utili$ation must be between ' and "H otherwise, there would be more tass arriving than could be
serviced, violating our assumption that the system is in e1uilibrium. =tili$ation is also called traffic intensity
and is represented by the symbol C.
Poisson distri*ution of <andom 5aria*les
) variable is random if you cannot now exactly what its next value will be, but you may now the
probability of all possible values.
#e1uests for service from an I/0 system can be modeled by a random variable because the operating
system is normally switching between several processes that generate independent I/0 re1uests.
0ne way to characteri$e the distribution of values of a random variable with discrete values is a
histogram4 which divides the range between the minimum and maximum values into sub ranges called $uc%ets.
,istograms then plot the number in each bucet as columns. ,istograms wor well for distributions that are
discrete values.
:or the weighted arithmetic mean time, .et?s first assume that after measuring the number of tass
occurrences, say, ni, , you could compute fre1uency of occurrence of tas i:
2here Ti is the time for tas i and fi is the fre1uency of occurrence of tas i. the variance can be
calculated as
0ne of the most widely used exponential distributions is called a oisson distri$ution . It is used to
characteri$e random events in a given time interval. +he Poisson distribution is described by the following
e1uation Ccalled the probability mass functionD!
16
2here a M #ate of events N @lapsed time . If inter arrival times are exponentially distributed and we use
arrival rate from above for rate of events, the number of arrivals in a time interval t is a oisson process , which
has the Poisson distribution with a M )rrival rate N t.
+he length of time a new tas must wait for the server to complete a tas, called the aerage residual
serice time4 we assumes Poisson arrivals!
2hen the distribution is not random and all possible values are e1ual to the average, the standard
deviation is ' and so 7 is '. +he average residual service time is then just half the average service time. If the
distribution is random and it is Poisson, then 7 is " and the average residual service time e1uals the weighted
arithmetic mean time.
)ssumptions about the 1ueuing model!
". +he system is in e1uilibrium.
%. +he times between two successive re1uests arriving, called the interarrival times, are exponentially
distributed, which characteri$es the arrival rate mentioned earlier.
&. +he number of sources of re1uests is unlimited.
9. +he server can start on the next job immediately after finishing the prior one.
;. +here is no limit to the length of the 1ueue, and it follows the first in, first out order discipline, so all
tass in line must be completed.
(. +here is one server.
-uch a 1ueue is called &'&'(:
& M exponentially random re1uest arrival C7O% M "D, with & standing for ). ).
& M exponentially random service time C7O% M "D, with & again for Marov
( M single server
+he M/M/" model is a simple and widely used model. Many real systems have multiple diss and hence
could use multiple servers, as in :igure (."B. -uch a system is called an &'&'m model in 1ueuing theory.
17
Figure 6.17 The M/M/m multiple-server model.
(.( Crosscutting Issues 9(m:
'. Point6to6Point ,in+s and S.itches <eplacing -uses
Point5to5point lins and switches are increasing in popularity as Moore?s .aw continues to reduce the
cost of components. 7ombined with the higher I/0 bandwidth demands from faster processors, faster diss, and
faster local area networs, the decreasing cost advantage of buses means the days of buses in destop and server
computers are numbered. +he number of bits and bandwidth for the new generation is per direction, so they
double for both directions. ) common way to increase bandwidth is to offer versions with several times the
number of wires and bandwidth.
1. -loc+ Serers ersus &ilers
+he operating system provides the file abstraction on top of blocs stored on the dis. +he terms logical
units4 logical olumes4 and physical olumes are related terms used in Microsoft and =LIP systems to refer to
collections of dis blocs.
) logical unit appears to the server as a single virtual Edis.F In a #)I* dis array, the logical unit is
configured as a particular #)I* layout. ) physical volume is the device file used by the file system to access a
logical unit. ) logical volume provides a level of virtuali$ation that enables the file system to split the physical
volume across multiple pieces or to stripe data across multiple physical volumes. Q
) logical unit is an abstraction of a dis array that presents a virtual dis to the operating system, while
physical and logical volumes are abstractions used by the operating system to divide these virtual diss into
smaller, independent file systems.
+he file illusion is maintained in the server. It accesses storage as dis blocs and maintains the
metadata. Most file systems use a file cache, so the server must maintain consistency of file accesses.
+he dis subsystem also maintains the file abstraction, and the server uses a file system protocol to
communicate with storage. -uch devices are called networ& attached storage CL)-D devices. +he term filer is
18
used for L)- devices that only provide file service and file storage. Letwor )ppliances was one of the first
companies to mae filers. +he idea behind placing storage on the networ is to mae it easier for many
computers to share information and for operators to maintain the shared system.
%. Asynchronous I#$ and $perating Systems
*iss spend more time in delays than in transferring data. +he straightforward approach to I/0 is to
re1uest data and then start using it. +he operating system then switches to another process until the desired data
arrive, and then the operating system switches bac to the re1uesting process. -uch a style is called
synchronous I/O.
+he alternative model is for the process to continue after maing a re1uest, and it is not bloced until it
tries to read the re1uested data. -uch asynchronous I/O allows the process to continue maing re1uests so that
many I/0 re1uests can be operating simultaneously. )synchronous I/0 shares the same philosophy as caches in
out5of5order 7P=s, which achieve greater bandwidth.
7.0 Designing and !aluating an I#$ SystemD The Internet Archie Cluster
+he art of I/0 system design is to find a design that meets goals for cost, dependability, and variety of
devices while avoiding bottlenecs in I/0 performance and dependability. )voiding bottlenecs means that
components must be balanced between main memory and the I/0 device. +he architect must also plan for
expansion so that customers can tailor the I/0 to their applications.
In designing an I/0 system, we analy$e performance, cost, capacity, and availability using varying I/0
connection schemes and different numbers of I/0 devices of each type. steps to follow in designing an I/0
system.
". .ist the different types of I/0 devices to be connected to the machine, or list the standard buses and networs
that the machine will support.
%. .ist the physical re1uirements for each I/0 device. #e1uirements include si$e, power, connectors, bus slots,
expansion cabinets, and so on.
&. .ist the cost of each I/0 device, including the portion of cost of any controller needed for this device.
9. .ist the reliability of each I/0 device.
;. #ecord the processor resource demands of each I/0 device. +his list should include
7loc cycles for instructions used to initiate an I/0, to support operation of an I/0 device Csuch as
handling interruptsD, and to complete I/0
Processor cloc stalls due to waiting for I/0 to finish using the memory, bus, or cache
19
Processor cloc cycles to recover from an I/0 activity, such as a cache flush
(. .ist the memory and I/0 bus resource demands of each I/0 device. @ven when the processor is not using
memory, the bandwidth of main memory and the I/0 connection is limited.
B. )ssessing the performance and availability of the I/0 devices. Performance can be evaluated with simulation,
or by using 1ueuing theory.
<elia*ility can be calculated assuming I/0 devices fail independently.
Aaila*ility can be computed from reliability by estimating M++: for the devices, taing into account the time
from failure to repair.
Performance can be measured either as megabytes per second or I/0s per second, depending on the
needs of the application. :or high performance, speed of I/0 devices, number of I/0 devices, and speed of
memory and processor is considered. :or low cost, most of the cost should be the I/0 devices themselves.
7.( Putting It All Together: EetApp &AS7;;; &iler
Letwor )ppliance entered the storage maret in ">>% with a goal of providing an easy5to5operate file
server running L-: using their own log5structured file system and a #)I* 9 dis array. +he company later
added support for the 2indows 7I:- file system and a #)I* (. +o support applications that want access to raw
data blocs without the overhead of a file system, such as database systems, Let)pp filers can serve data blocs
over a standard :ibre 7hannel interface. Let)pp also supports iSCSI .
+he latest hardware product is the :)-('''. It is a multiprocessor based on the )M* 0pteron
microprocessor connected using its ,yper transport lins. +he microprocessors run the Let)pp software stac,
including L-:, 7I:-, #)I*5*P, -7-I, and so on. +he :)-(''' comes as either a dual processor C:)-('&'D
or a 1uad processor C:)-('B'D. *#)M is distributed to each microprocessor in the 0pteron. +he *#)M bus is
"%6 bits wide, plus extra bits for -@7/*@* memory. Both models dedicate four ,yper transport lins to I/0.
)s a filer, the :)-(''' needs a lot of I/0 to connect to the diss and to connect to the servers. +he
integrated I/0 consists of!
". 6 :ibre 7hannel C:7D controllers and ports,
%. ( 8igabit @thernet lins,
&. ( slots for x6 C% 8B/secD P7I @xpress cards,
9. & slots for P7I5P "&& M,$, (95bit cards,
;. Plus standard I/0 options lie I*@, =-B, and &%5bit P7I.
20
+he 6 :ibre 7hannel controllers can each be attached to ( shelves containing "9 &.;5inch :7 diss. +hus,
the maximum number of drives for the integrated I/0 is (B% diss. )dditional :7 controllers can be added to
the option slots to connect up to "''6 drives.
+he six "5gigabit @thernet lins connect to servers to mae the :)-(''' loo lie a file server running
if L+:- or 7I:-, or lie a bloc server if running i-7-I.
:or dependability, :)-(''' filers can be paired so that if one fails, the other can tae over. 7lustered
failover re1uires that both filers have access to all diss in the pair of filers using the :7 interconnect. +he
health of the filers is constantly monitored, and failover happens automatically. +he healthy filer maintains its
own networ identity and its own primary functions, but it also assumes the networ identity of the failed filer
and handles all its data re1uests via a virtual filer until an administrator restores the data service to the original
state.
7.? &allacies and Pitfalls
&allacy: 9omponents fail fast.
+he fault5tolerant is based on the assumption that a component operates perfectly until a latent error
becomes effective, and then a failure occurs that stops the component.
+he +ertiary *is project had the opposite experience. Many components started acting strangely long
before they failed, and it was generally up to the system operator to determine whether to declare a component
as failed. +he component would generally willing to continue to act in violation of the service agreement until
an operator EterminatedF that component.
&allacy: 9omputers systems achiee ::.:::; aailability .<fie nines=04 as adertised.
Mareting departments of companies maing servers availability of their computer hardwareH in terms
of :igure (.%", they claim availability of >>.>>>G, nicnamed five nines. @ven the mareting departments of
operating system companies tried to give this impression.
21
Figure 6.21 Minutes unavailable per year to achieve availability class. Note that five nines mean unavailable five minutes pe !ea"
:ive minutes of unavailability per year is certainly impressive. :or example, ,ewlett5Pacard claims
that the ,P5>''' server hardware and ,P5=P operating system can deliver a >>.>>>G availability guarantee
Ein certain pre5defined, pre5tested customer environmentsF. +his guarantee does not include failures due to
operator faults, application faults, or environmental faults, which are liely the dominant fault categories today.
It is also unclear what the financial penalty is to a company if a system does not match its guarantee.
In contrast to mareting suggestions, well5managed servers in %''( typically achieve >>G to >>.>G
availability.
Pitfall: >here a function is implemented affects its reliability.
In theory, it is fine to move the #)I* function into software. In practice, it is very difficult to mae it
wor reliably.
+he software culture is generally based on eventual correctness via a series of releases and patches. It is
also difficult to isolate from other layers of software. many customers have lost data due to software bugs or
incompatibilities in environment in software #)I* systems.
,ardware systems are not immune to bugs , but the hardware culture tends to place a greater emphasis
on testing correctness in the initial release. the hardware is independent of the version of the operating system.
&allacy : Operating systems are the best place to schedule dis& accesses.
,igher5level interfaces lie )+) and -7-I offer logical bloc addresses to the host operating system. an
0- can sort the logical bloc addresses into increasing order. -ince only the dis nows the mapping of the
logical addresses onto the physical geometry of sectors, tracs, and surfaces, it can reduce the rotational and
see latencies.
&allacy : 7he aerage see& time of a dis& in a computer system is the time for a see& of one!third the number
of cylinders.
22
+his fallacy comes from confusing the way manufacturer?s maret diss with the expected performance,
and from the false assumptions that see times are linear in distance. +he one5third5distance rule of thumb
comes from calculating the distance of a see from one random location to another random location. In the past,
manufacturers listed the see of this distance to offer a consistent basis for comparison. )ssuming CincorrectlyD
that see time is linear in distance, and using the manufacturer?s reported minimum and EaverageF see times, a
common techni1ue to predict see time is
Implementing Cache Coherence in a DSM Multiprocessor 9II:
Implementing a directory5based cache coherence protocol re1uires overcoming all the problems related
to nonatomic actions for a snooping protocol without the use of broadcast, which forced a seriali$ation on
competing writes. the seriali$ation re1uired for the memory consistency model. )voiding the need to broadcast
is a central goal for a directory5based system, so another method for ensuring seriali$ation is necessary.
+he seriali$ation of re1uests for exclusive access to a memory bloc is easily enforced since those
re1uests will be seriali$ed when they reach the uni1ue directory for the specified bloc. If the directory
controller simply ensures that one re1uest is completely serviced before the next is begun, writes will be
seriali$ed. Because the re1uesters cannot now ahead of time who will win the race and because the
communication is not a broadcast, the directory must signal to the winner when it completes the processing of
the winner?s re1uest. +his is done by a message that supplies the data or by an explicit acnowledgment
message.
)lthough the acnowledgment that a re1uesting node has ownership is completed when the write miss
or ownership acnowledgment message is transmitted, we still do not now that the invalidates have been
received and processed by the nodes that were in the sharing set. )ll memory consistency models eventually
re1uire that a processor nows that all the invalidates for a write have been processed. In a snooping scheme,
the nature of the broadcast networ provides this assurance.
2e can now that the invalidates have been completed by have the destination nodes of the invalidate
messages explicitly acnowledge the invalidation messages sent from the directory. +here are two possibilities.
23
In the first the acnowledgments can be sent to the directory, which can count them, and when all
acnowledgments have been received, confirm this with a single message to the original re1uester.
)lternatively, when granting ownership, the directory can tell the register how many acnowledgments to
expect. +he destinations of the invalidate messages can then send an acnowledgment directly to the re1uester,
whose identity is provided by the directory.
Most existing implementations use the latter scheme, because it reduces the bottlenec at a directory.
)lthough the re1uirement for acnowledgments is an additional complexity in directory protocols, it is used to
avoid seriali$ation mechanism.
The Intel Pentium 3
+he Pentium 9 is a processor with a deep pipeline supporting multiple issue with speculation. Pentium 9
also supports multithreading. +he Pentium 9 uses out5of5order speculative micro architecture, called Letburst,
that is pipelined with the goal of achieving high instruction throughput by combining multiple issue and high
cloc rates. .ie the micro architecture used in the Pentium III, a front5end decoder translates each I)5&%
instruction to a series of micro5operations CuopsD, which are similar to typical #I-7 instructions. +he uops are
than executed by a dynamically scheduled speculative pipeline.
+he Pentium 9 uses a noel e"ecution trace cache to generate the uop instruction stream . ) trace cache
is a type of instruction cache that holds se1uences of instructions to be executed including nonadjacent
instructions separated by branchesH a trace cache tries to exploit the temporal se1uencing of instruction
execution rather than the spatial locality exploited in a normal cache.
+he Pentium 9?s execution trace cache is a trace cache of uops, corresponding to the decoded I)5&%
instruction stream. By filling the pipeline from the execution trace cache, the Pentium 9 avoids the need to
redecode I)5&% instructions whenever the trace cache hits. 0nly on trace cache misses are I)5&% instructions
fetched from the .% cache and decoded to refill the execution trace cache. =p to three I)5&% instructions may
be decoded and translated every cycle, generating up to six uopsH when a single I)5&% instruction re1uires more
than three uops, the uop se1uence is generated from the microcode #0M.
+he execution trace cache has its own branch target buffer, which predicts the outcome of uop branches.
24
)fter fetching from the execution trace cache, the uops are executed by an out5of5order speculative
pipeline using register renaming rather than a reorder buffer. =p to three uops per cloc can be renamed and
dispatched to the functional unit 1ueues, and three uops can be committed each cloc cycle. +here are four
dispatch ports, which allow a total of six uops to be dispatched to the functional units every cloc cycle. +he
load and store units each have their own dispatch port, another port covers basic ).= operations, and a fourth
handles :P and integer operations. :igure %.%( shows a diagram of the micro architecture.
) two5level cache is used to minimi$e the fre1uency of *#)M accesses. Branch prediction is done
2ith a branch5target buffer using a two5level predictor with both local and global branch histories. Pipeline of
the Pentium 9 maes the use of speculation, and its dependence on branch prediction, critical to achieving high
performance. Performance is very dependent on the memory system. )lthough dynamic scheduling and the
large number of outstanding loads and stores supports hiding the latency of cache misses, .% misses are liely
to cause a stall. Branch5prediction accuracy is crucial in speculative processors. +he trace cache miss rate is
almost negligible for this set of the -P@7 benchmars.
+he )M* 0pteron and Intel Pentium 9 share a number of similarities!
Both use a dynamically scheduled, speculative pipeline capable of issuing and committing three I)5&%
instructions per cloc.
Both use a two5level on5chip cache structure , although the Pentium 9 uses a trace cache for the first5
level instruction cache and recent Pentium 9 have larger second5level caches.
+hey have similar transistor counts, die si$e, and powe r.
25
&igure 1.17 the Pentium 3 micro architecture.
26

Memory Hir and Io System

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Memory Hir and Io System

Diunggah oleh

Hak Cipta:

Format Tersedia

5.

5 Crosscutting Issues: The Design of Memory Hierarchies

Anda mungkin juga menyukai