Anda di halaman 1dari 181

UNIT 1

Basic Structure of Computers


FUNCTIONAL UNITS OF A COMPUTER SYSTEM
Digital computer systems consist of three distinct units. These units are as follows: Input
unit Central Processing unit Output unit these units are interconnected y electrical cales
to permit communication etween them. This allows the computer to function as a
system. Input Unit ! computer must recei"e oth data and program statements to
function properly and e ale to sol"e prolems. The method of feeding data and
programs to a computer is accomplished y an input de"ice. Computer input de"ices
read data from a source# such as magnetic dis$s# and translate that data into electronic
impulses for transfer into the CPU. %ome typical input de"ices are a $eyoard# a mouse#
or a scanner. Central Processing Unit The rain of a computer system is the central
processing unit &CPU'. The CPU processes data transferred to it from one of the
"arious input de"ices. It then transfers either an intermediate or final result of the CPU to
one or more output de"ices. ! central control section and wor$ areas are re(uired to
perform calculations or manipulate data. The CPU is the computing center of the
system. It consists of a control section# an arithmetic)logic section# and an internal storage
section &main memory'. *ach section within the CPU ser"es a specific function and has a
particular relationship with the other sections within the CPU.
CONTROL SECTION
The control section directs the flow of traffic &operations' and data. It also maintains
order within the computer. The control section selects one program statement at a time
from the program storage area# interprets the statement# and sends the appropriate
electronic impulses to the arithmetic)logic and storage sections so they can carry out the
instructions. The control section does not perform actual processing operations on the
data. The control section instructs the input de"ice on when to start and stop transferring
data to the input storage area. It also tells the output de"ice when to start and stop
recei"ing data from the output storage area.
!+IT,-*TIC).O/IC %*CTION.0
The arithmetic)logic section performs arithmetic operations# such as addition#
sutraction# multiplication# and di"ision. Through internal logic capaility# it tests
"arious conditions encountered during processing and ta$es action ased on the result. !t
no time does processing ta$e place in the storage section. Data maye transferred
ac$ and forth etween these two sections se"eral times efore processing is
completed.
Computer architecture topics
Sub!efi"itio"s
%ome practitioners of computer architecture at companies such as Intel and !-D use
more fine distinctions:
-acroarchitecture ) architectural layers that are more astract than
microarchitecture# e.g. I%!
I%! &Instruction %et !rchitecture' ) as defined ao"e
!ssemly I%! ) a smart assemler may con"ert an astract assemly language
common to a group of machines into slightly different machine language for
different implementations
Programmer 1isile -acroarchitecture ) higher le"el language tools such as
compilers may define a consistent interface or contract to programmers using
them# astracting differences etween underlying I%!# UI%!# and
microarchitectures. *.g. the C# C22# or 3a"a standards define different
Programmer 1isile -acroarchitecture ) although in practice the C
microarchitecture for a particular computer includes
UI%! &-icrocode Instruction %et !rchitecture' ) a family of machines with
different hardware le"el microarchitectures may share a common microcode
architecture# and hence a UI%!.
Pin !rchitecture ) the set of functions that a microprocessor is e4pected to
pro"ide# from the point of "iew of a hardware platform. *.g. the 456 !78-#
9*++:I/NN* or 9.U%, pins# and the messages that the processor is e4pected to
emit after completing a cache in"alidation so that e4ternal caches can e
in"alidated. Pin architecture functions are more fle4ile than I%! functions )
e4ternal hardware can adapt to changing encodings# or changing from a pin to a
message ) ut the functions are e4pected to e pro"ided in successi"e
implementations e"en if the manner of encoding them changes.
#esi$" $oa%s
The e4act form of a computer system depends on the constraints and goals for which it
was optimi;ed. Computer architectures usually trade off standards# cost# memory
capacity# latency and throughput. %ometimes other considerations# such as features# si;e#
weight# reliaility# e4pandaility and power consumption are factors as well.
The most common scheme carefully chooses the ottlenec$ that most reduces the
computer<s speed. Ideally# the cost is allocated proportionally to assure that the data rate
is nearly the same for all parts of the computer# with the most costly part eing the
slowest. This is how s$illful commercial integrators optimi;e personal computers.
Performa"ce
Computer performance is often descried in terms of cloc$ speed &usually in -,; or
/,;'. This refers to the cycles per second of the main cloc$ of the CPU. ,owe"er# this
metric is somewhat misleading# as a machine with a higher cloc$ rate may not necessarily
ha"e higher performance. !s a result manufacturers ha"e mo"ed away from cloc$ speed
as a measure of performance.
Computer performance can also e measured with the amount of cache a processor has. If
the speed# -,; or /,;# were to e a car then the cache is li$e the gas tan$. No matter
how fast the car goes# it will still need to get gas. The higher the speed# and the greater
the cache# the faster a processor runs.
-odern CPUs can e4ecute multiple instructions per cloc$ cycle# which dramatically
speeds up a program. Other factors influence speed# such as the mi4 of functional units#
us speeds# a"ailale memory# and the type and order of instructions in the programs
eing run.
There are two main types of speed# latency and throughput. .atency is the time etween
the start of a process and its completion. Throughput is the amount of wor$ done per unit
time. Interrupt latency is the guaranteed ma4imum response time of the system to an
electronic e"ent &e.g. when the dis$ dri"e finishes mo"ing some data'. Performance is
affected y a "ery wide range of design choices 0 for e4ample# pipelining a processor
usually ma$es latency worse &slower' ut ma$es throughput etter. Computers that
control machinery usually need low interrupt latencies. These computers operate in a
real)time en"ironment and fail if an operation is not completed in a specified amount of
time. 9or e4ample# computer)controlled anti)loc$ ra$es must egin ra$ing almost
immediately after they ha"e een instructed to ra$e.
The performance of a computer can e measured using other metrics# depending upon its
application domain. ! system may e CPU ound &as in numerical calculation'# I:O
ound &as in a weser"ing application' or memory ound &as in "ideo editing'. Power
consumption has ecome important in ser"ers and portale de"ices li$e laptops.
=enchmar$ing tries to ta$e all these factors into account y measuring the time a
computer ta$es to run through a series of test programs. !lthough enchmar$ing shows
strengths# it may not help one to choose a computer. Often the measured machines split
on different measures. 9or e4ample# one system might handle scientific applications
(uic$ly# while another might play popular "ideo games more smoothly. 9urthermore#
designers ha"e een $nown to add special features to their products# whether in hardware
or software# which permit a specific enchmar$ to e4ecute (uic$ly ut which do not offer
similar ad"antages to other# more general tas$s.
! 9unctional Unit is defined as a collection of computer systems and networ$
infrastructure components which# when astracted# can e more easily and o"iously
lin$ed to the goals and o>ecti"es of the enterprise# ultimately supporting the success of
the enterprise?s mission.
9rom a technological perspecti"e# a 9unctional Unit is an entity that consists of computer
systems and networ$ infrastructure components that deli"er critical information assets#1
through networ$)ased ser"ices# to constituencies that are authenticated to that
9unctional Unit.
Ce"tra% processi"$ u"it &CPU' 0
The part of the computer that e4ecutes program instructions is $nown as the processor or
central processing unit &CPU'. In a microcomputer# the CPU is on a single electronic
component# the microprocessor chip# within the system unit or system cainet. The
system unit also includes circuit oards# memory chips# ports and other components. !
microcomputer system cainet will also house dis$ dri"es# hard dis$s# etc.# ut these are
considered separate from the CPU. This is principal part of any digital computer system#
generally composed of control unit# and arithmetic)logic unit the @heartA of the computer.
It constitutes the physical heart of the entire computer systemB to it is lin$ed "arious
peripheral e(uipment# including input:output de"ices and au4iliary storage units
o Co"tro% U"it is the part of a CPU or other de"ice that directs its operation. The control
unit tells the rest of the computer system how to carry out a program?s instructions. It
directs the mo"ement of electronic signals etween memory0which temporarily holds
data# instructions and processed information0and the !.U. It also directs these control
signals etween the CPU and input:output de"ices. The control unit is the circuitry that
controls the flow of information through the processor# and coordinates the acti"ities of
the other units within it. In a way# it is the CrainC# as it controls what happens inside the
processor# which in turn controls the rest of the PC.
o ArithmeticLo$ic Unit usually called the !.U is a digital circuit that performs two
types of operations0 arithmetic and logical. !rithmetic operations are the fundamental
mathematical operations consisting of addition# sutraction# multiplication and di"ision.
.ogical operations consist of comparisons. That is# two pieces of data are compared to
see whether one is e(ual to# less than# or greater than the other. The !.U is a
fundamental uilding loc$ of the central processing unit of a computer.
-emory 0 -emory enales a computer to store# at least temporarily# data and
programs. -emory0also $nown as the primary storage or main memory0is a part of
the microcomputer that holds data for processing# instructions for processing the data &the
program' and information &processed data'. Part of the contents of the memory is held
only temporarily# that is# it is stored only as long as the microcomputer is turned on.
Dhen you turn the machine off# the contents are lost. The capacity of the memory to hold
data and program instructions "aries in different computers. The original I=- PC could
hold appro4imately 6#E8#888 characters of data or instructions only. =ut modern
microcomputers can hold millions# e"en illions of characters in their memory.
I"put !e(ice :
!n input de"ice is usually a $eyoard or mouse# the input de"ice is the conduit through
which data and instructions enter a computer. ! personal computer would e useless if
you could not interact with it ecause the machine could not recei"e instructions or
deli"er the results of its wor$. Input de"ices accept data and instructions from the user or
from another computer system &such as a computer on the Internet'. Output de"ices
return processed data to the user or to another computer system.
The most common input de"ice is the $eyoard# which accepts letters# numers# and
commands from the user. !nother important type of input de"ice is the mouse# which lets
you select options from on)screen menus. Fou use a mouse y mo"ing it across a flat
surface and pressing its uttons.
! "ariety of other input de"ices wor$ with personal computers# too: The trac$all and
touchpad are "ariations of the mouse and enale you to draw or point on the screen.
The >oystic$ is a swi"eling le"er mounted on a stationary ase that is well suited for
playing "ideo games.
Basic Operatio"a% Co"cepts of a Computer
G -ost computer operations are e4ecuted in the !.U &arithmetic and logic unit' of
a processor.
G *4ample: to add two numers that are oth located in memory.
H *ach numer is rought into the processor# and the actual addition is
carried out y the !.U.
H The sum then may e stored in memory or retained in the processor for
immediate use.
Re$isters
G Dhen operands are rought into the processor# they are stored in high)speed
storage elements &registers'.
G ! register can store one piece of data &5)it registers# 16)it registers# I7)it
registers# 6E)it registers# etcJ'
G !ccess times to registers are faster than access times to the fastest cache unit in
the memory hierarchy.
I"structio"s
G I"structio"s for a processor are !efi"e! i" the ISA &I"structio" Set
Architecture' ) Le(e% *
G T+pica% i"structio"s i"c%u!e,
H Mo( B-. LocA
G Fetch the i"structio"
G Fetch the co"te"ts of memor+ %ocatio" LocA
G Store the co"te"ts i" $e"era% purpose re$ister B-
H A!! A-.B-
G Fetch the i"structio"
G A!! the co"te"ts of re$isters B- a"! A-
G P%ace the sum i" re$ister A-
/o0 are i"structio"s se"t bet0ee" memor+ a"! the processor
G The pro$ram cou"ter &PC' or i"structio" poi"ter &IP' co"tai"s the memor+
a!!ress of the "e1t i"structio" to be fetche! a"! e1ecute!2
G Se"! the a!!ress of the memor+ %ocatio" to be accesse! to the memor+ u"it
a"! issue the appropriate co"tro% si$"a%s &memory read'2
G The i"structio" re$ister &IR' ho%!s the i"structio" that is curre"t%+ bei"$
e1ecute!2
G Timi"$ is crucia% a"! is ha"!%e! b+ the co"tro% u"it 0ithi" the processor2
CPU
CPU
Memor+ Memor+
Single BUS STRUCTURES :
=us structure and multiple us structures are types of us or computing. ! us is
asically a susystem which transfers data etween the components of a Computer
components either within a computer or etween two computers. It connects peripheral
devices at the same time.
) ! multiple =us %tructure has multiple inter connected ser"ice integration uses and for
each us the other uses are its foreign uses. ! %ingle us structure is "ery simple and
consists of a single server.

) ! us can not span multiple cells. !nd each cell can ha"e more than one uses.
) Pulished messages are printed on it. There is no messaging engine on %ingle us
structure
I'In single us structure all units are connected in the same us than connecting different
uses as multiple us structure.
Ii'multiple us structure<s performance is etter than single us structure.
Iii'single us structure<s cost is cheap than multiple us structure.
Computer soft0are# or >ust soft0are is a general term used to descrie the role that
computer programs# procedures and documentation play in a computer system.
The term includes:
!pplication software# such as word processors which perform producti"e tas$s for
users.
9irmware# which is software programmed resident to electrically programmale
memory de"ices on oard mainoards or other types of integrated hardware
carriers.
-iddleware# which controls and co)ordinates distriuted systems.
%ystem software such as operating systems# which interface with hardware to
pro"ide the necessary ser"ices for application software.
%oftware testing is a domain dependent of de"elopment and programming.
%oftware testing consists of "arious methods to test and declare a software
product fit efore it can e launched for use y either an indi"idual or a group.
Testware# which is an umrella term or container term for all utilities and
application software that ser"e in comination for testing a software pac$age ut
not necessarily may optionally contriute to operational purposes. !s such#
testware is not a standing configuration ut merely a wor$ing en"ironment for
application software or susets thereof.
Soft0are Characteristics
%oftware is de"eloped and engineered.
%oftware doesn<t Cwear)outC.
-ost software continues to e custom uilt.
T+pes of soft0are
! layer structure showing where Operating %ystem is located on generally used software
systems on des$tops
S+stem soft0are
%ystem software helps run the computer hardware and computer system. It includes a
comination of the following:
de"ice dri"ers
operating systems
ser"ers
utilities
windowing systems
The purpose of systems software is to unurden the applications programmer from the
often comple4 details of the particular computer eing used# including such accessories
as communications de"ices# printers# de"ice readers# displays and $eyoards# and also to
partition the computer<s resources such as memory and processor time in a safe and stale
manner. *4amples are) Dindows KP# .inu4 and -ac.
Pro$rammi"$ soft0are
Programming software usually pro"ides tools to assist a programmer in writing computer
programs# and software using different programming languages in a more con"enient
way. The tools include:
compilers
deuggers
interpreters
lin$ers
te4t editors
App%icatio" soft0are
!pplication software allows end users to accomplish one or more specific &not directly
computer de"elopment related' tas$s. Typical applications include:
industrial automation
usiness software
computer games
(uantum chemistry and solid state physics software
telecommunications &i.e.# the internet and e"erything that flows on it'
dataases
educational software
medical software
military software
molecular modeling software
image editing
spreadsheet
simulation software
Dord processing
Decision ma$ing software
!pplication software e4ists for and has impacted a wide "ariety of topics.
Assemb%er
Typically a modern assemb%er creates o>ect code y translating assemly instruction
mnemonics into opcodes# and y resol"ing symolic names for memory locations and
other entities. The use of symolic references is a $ey feature of assemlers# sa"ing
tedious calculations and manual address updates after program modifications. -ost
assemlers also include macro facilities for performing te4tual sustitution0e.g.# to
generate common short se(uences of instructions to run inline# instead of in a suroutine.
There are two types of assemlers ased on how many passes through the source are
needed to produce the e4ecutale program. One)pass assemlers go through the source
code once and assumes that all symols will e defined efore any instruction that
references them. Two)pass assemlers &and multi)pass assemlers' create a tale with all
unresol"ed symols in the first pass# then use the 7nd pass to resol"e these addresses. The
ad"antage in one)pass assemlers is speed# which is not as important as it once was with
ad"ances in computer speed and capailities. The ad"antage of the two)pass assemler is
that symols can e defined anywhere in the program source. !s a result# the program
can e defined in a more logical and meaningful way. This ma$es two)pass assemler
programs easier to read and maintain.
-ore sophisticated high)le"el assemlers pro"ide language astractions such as:
!d"anced control structures
,igh)le"el procedure:function declarations and in"ocations
,igh)le"el astract data types# including structures:records# unions# classes# and
sets
%ophisticated macro processing
O>ect)Oriented features such as encapsulation# polymorphism# inheritance#
interfaces
Assemb%+ %a"$ua$e
! program written in assemly language consists of a series of instructions))mnemonics
that correspond to a stream of e4ecutale instructions# when translated y an assemler#
that can e loaded into memory and e4ecuted.
9or e4ample# an 456:I!)I7 processor can e4ecute the following inary instruction as
e4pressed in machine language &see 456 assemly language':
=inary: 18118888 81188881 &,e4adecimal: =8 61'
The e(ui"alent assemly language representation is easier to rememer &e4ample in Intel
synta4# more mnemonic':
MOV AL, 61h

This instruction means:


-o"e the "alue 61h &or LM decimalB the h)suffi4 means he4adecimalB into the
processor register named C!.C.
The mnemonic Cmo"C represents the opcode 3433 which moves the "alue in the second
operand into the register indicated y the first operand. The mnemonic was chosen y the
instruction set designer to are"iate Cmo"eC# ma$ing it easier for the programmer to
rememer. ! comma)separated list of arguments or parameters follows the opcodeB this
is a typical assemly language statement.
In practice many programmers drop the word mnemonic and# technically incorrectly# call
Cmo"C an opcode. Dhen they do this they are referring to the underlying inary code
which it represents. To put it another way# a mnemonic such as Cmo"C is not an opcode#
ut as it symoli;es an opcode# one might refer to Cthe opcode mo"C for e4ample when
one intends to refer to the inary opcode it symoli;es rather than to the symol )) the
mnemonic )) itself. !s few modern programmers ha"e need to e mindful of actually
what inary patterns are the opcodes for specific instructions# the distinction has in
practice ecome a it lurred among programmers ut not among processor designers.
Transforming assemly into machine language is accomplished y an assemler# and the
re"erse y a disassemler. Unli$e in high)le"el languages# there is usually a one)to)one
correspondence etween simple assemly statements and machine language instructions.
,owe"er# in some cases# an assemler may pro"ide pseudoinstructions which e4pand
into se"eral machine language instructions to pro"ide commonly needed functionality.
9or e4ample# for a machine that lac$s a Cranch if greater or e(ualC instruction# an
assemler may pro"ide a pseudoinstruction that e4pands to the machine<s Cset if less
thanC and Cranch if ;ero &on the result of the set instruction'C. -ost full)featured
assemlers also pro"ide a rich macro language &discussed elow' which is used y
"endors and programmers to generate more comple4 code and data se(uences.
*ach computer architecture and processor architecture has its own machine language. On
this le"el# each instruction is simple enough to e e4ecuted using a relati"ely small
numer of electronic circuits. Computers differ y the numer and type of operations
they support. 9or e4ample# a new 6E)it machine would ha"e different circuitry from a
I7)it machine. They may also ha"e different si;es and numers of registers# and
different representations of data types in storage. Dhile most general)purpose computers
are ale to carry out essentially the same functionality# the ways they do so differB the
corresponding assemly languages reflect these differences.
-ultiple sets of mnemonics or assemly)language synta4 may e4ist for a single
instruction set# typically instantiated in different assemler programs. In these cases# the
most popular one is usually that supplied y the manufacturer and used in its
documentation.
Basic e%eme"ts
!ny !ssemly language consists of I types of instruction statements which are used to
define the program operations:
Opcode mnemonics
Data sections
!ssemly directi"es
Opco!e m"emo"ics
Instructions &statements' in assemly language are generally "ery simple# unli$e those in
high)le"el languages. /enerally# an opcode is a symolic name for a single e4ecutale
machine language instruction# and there is at least one opcode mnemonic defined for each
machine language instruction. *ach instruction typically consists of an operation or
opcode plus ;ero or more operands. -ost instructions refer to a single "alue# or a pair of
"alues. Operands can e either immediate &typically one yte "alues# coded in the
instruction itself' or the addresses of data located elsewhere in storage. This is determined
y the underlying processor architecture: the assemler merely reflects how this
architecture wor$s.
#ata sectio"s
There are instructions used to define data elements to hold data and "ariales. They
define what type of data# length and alignment of data. These instructions can also define
whether the data is a"ailale to outside programs &programs assemled separately' or
only to the program in which the data section is defined.
Assemb%+ !irecti(es a"! pseu!oops
!ssemly directi"es are instructions that are e4ecuted y the assemler at assemly time#
not y the CPU at run time. They can ma$e the assemly of the program dependent on
parameters input y the programmer# so that one program can e assemled different
ways# perhaps for different applications. They also can e used to manipulate
presentation of the program to ma$e it easier for the programmer to read and maintain.
&9or e4ample# pseudo)ops would e used to reser"e storage areas and optionally their
initial contents.' The names of pseudo)ops often start with a dot to distinguish them from
machine instructions.
%ome assemlers also support pseudo-instructions# which generate two or more machine
instructions.
%ymolic assemlers allow programmers to associate aritrary names &labels or symbols'
with memory locations. Usually# e"ery constant and "ariale is gi"en a name so
instructions can reference those locations y name# thus promoting self)documenting
code. In e4ecutale code# the name of each suroutine is associated with its entry point#
so any calls to a suroutine can use its name. Inside suroutines# /OTO destinations are
gi"en laels. %ome assemlers support local symbols which are le4ically distinct from
normal symols &e.g.# the use of C18NC as a /OTO destination'.
-ost assemlers pro"ide fle4ile symol management# allowing programmers to manage
different namespaces# automatically calculate offsets within data structures# and assign
laels that refer to literal "alues or the result of simple computations performed y the
assemler. .aels can also e used to initiali;e constants and "ariales with relocatale
addresses.
!ssemly languages# li$e most other computer languages# allow comments to e added to
assemly source code that are ignored y the assemler. /ood use of comments is e"en
more important with assemly code than with higher)le"el languages# as the meaning and
purpose of a se(uence of instructions is harder to decipher from the code itself.
Dise use of these facilities can greatly simplify the prolems of coding and maintaining
low)le"el code. Raw assemly source code as generated y compilers or disassemlers0
code without any comments# meaningful symols# or data definitions0is (uite difficult
to read when changes must e made.
Defining (Speed) Performance
Normally interested in reducing
Response time (aka execution time) the time between the start
and the completion of a task
mportant to individual users
!hus" to maximi#e performance" need to minimi#e execution time
!hroughput the total amount of work done in a given time
mportant to data center managers
$ecreasing response time almost always improves throughput
performance
%
& ' ( execution)time
%
f % is n times faster than *" then
performance
%
execution)time
*
++++++++++++++++++++ & +++++++++++++++++++++ & n
performance
*
execution)time
%
Performa"ce Factors
CPU e1ecutio" time 5 CPU c%oc6 c+c%es for a pro$ram
Machi"e C%oc6 Rate
C%oc6 rate &M/7. 8/7' is i"(erse of c%oc6 c+c%e time &c%oc6 perio!'
CC 9 3 : CR
C,- execution time . C,- clock cycles
for a program for a program
& x clock cycle time
Performance Factors
C,- execution time . C,- clock
cycles
for a program for a program
& x clock cycle
time
C,- execution time . C,- clock cycles for a program
for a program clock rate
& +++++++++++++++++++++++++++++++++++++++++++
Can improve performance by reducing either the length of
the clock cycle or the number of clock cycles re/uired
for a program
or
0C0122 Recall: 3e/uential 3ystems Need 3ynchroni#ing Clocks
4 Computer is a 3e/uential 3ystem and has a Clock
0ach nstruction !akes up a few Clock Cycles to 0xecute
he Performa"ce E;uatio" is a term used in computer science. It refers to the calculation
of the performance or speed of a central processing unit &CPU'.
=asically the Basic Performance Equation [BPE] is an e(uation with I parameters which
are re(uired for the calculation of C=asic PerformanceC of a gi"en system.
It is gi"en yB
T = (N*S)/R
Dhere
<T< is the processor time OProgram *4ecution TimePre(uired to e4ecute a gi"en program
written in some high le"el language .The compiler generates a machine language o>ect
program corresponding to the source program.
<N< is the tota% "umber of steps re;uire! to comp%ete pro$ram e1ecutio".<N< is the
actual numer of instruction e4ecutions#not necessarily e(ual to the total numer of
machine language instructions in the o>ect program.%ome instructions are e4ecuted
more than others&loops' and some are not e4ecuted at all&conditions'.
<S< is the a(era$e "umber of basic steps each i"structio" e1ecutio" re;uires#where
each asic step is completed in one cloc$ cycle.De say a"erage as each instruction
contains a "ariale numer of steps depending on the instruction.
<R< is the c%oc6 rate O in cycles per second P
Review: ac!ine Cloc" Rate
Clock rate (56#" 76#) is inverse of clock cycle time (clock
period)
CC & ' ( CR
one cloc"
period
'8 nsec clock cycle &9 '88 56# clock rate
: nsec clock cycle &9 188 56# clock rate
1 nsec clock cycle &9 :88 56# clock rate
' nsec clock cycle &9 ' 76# clock rate
:88 psec clock cycle &9 1 76# clock rate
1:8 psec clock cycle &9 2 76# clock rate
188 psec clock cycle &9 : 76# clock rate
Cloc" C#cles per $nstr%ction
Not all instructions take the same amount of time to execute (different number of
clock cycles in each instruction); <or example 5-= takes more cycles than 4dd
Cloc$ cycles per instruction &CPI' H the a"erage numer of cloc$ cycles each
instruction ta$es to e4ecute
! way to compare two different implementations of the same I%!
. C,- clock cycles . nstructions 4verage clock cycles
for a program for a program per instruction
& x
> 1 ' C,
C ? 4
C, for this instruction class
Performa"ce E;uatio"
Our basic performa"ce e;uatio" is the"
CPU time 9 I"structio"=cou"t 1 CPI 1 c%oc6=c+c%e
or
I"structio"=cou"t 1 CPI
CPU time 9
c%oc6=rate
These e;uatio"s separate the three 6e+ factors that affect performa"ce
Our basic performa"ce e;uatio" is the"
CPU time 9 I"structio"=cou"t 1 CPI 1 c%oc6=c+c%e
I"structio" Cou"t, #epe"!s o" the 6i"! of i"structio"s supporte! b+ the
Architecture2
For e1amp%e a Mu%tip%+ operatio" i" C La"$ua$e cou%!
be represe"te! as a se;ue"ce of A!!s i" Assemb%+ co!e.
but the "umber of i"structio"s 0ou%! be ;uite a %ot2 /a(i"$ a
!e!icate! Mu% i"structio" re!uces the tota% "umber of
i"structio"s i" the pro$ram
Our basic performa"ce e;uatio" is the"
CPU time 9 I"structio"=cou"t 1 CPI 1 c%oc6=c+c%e
CPI, #epe"!s o" ho0 comp%icate! the i"structio"s that are2 More comp%e1
i"structio"s "ee! more c%oc6s to e1ecute2 For e1amp%e Mu% i"structio" i" MIPS
ta6es more c%oc6s tha" A!! i"structio" i" MIPS2 /e"ce if the Compi%er a"!
Assemb%er choose more comp%e1 i"structio"s the" the+ 0i%% i"crease the CPI but
ma+ re!uce the tota% "umber of i"structio"s
Computi"$ the Effecti(e CPI
Our basic performa"ce e;uatio" is the"
CPU time 9 I"structio"=cou"t 1 CPI 1 c%oc6=c+c%e
8i(e" a specific Computer Architecture &MIPS for i"sta"ce'. each I"structio" i ca"
be associate! 0ith the "umber of c%oc6s that it Nee!s Ci2
8i(e" a C:>a(a pro$ram. the compi%er a"! assemb%er !eci!e 0hich i"structio"s
from the a(ai%ab%e i"structio" set to choose. this affects both the "umber of
i"structio"s a"! the CPI2 Let us suppose that the+ e"! up choosi"$ ICi "umber
i"structio"s from a" i"structio" i2 The" the effecti(e CPI becomes &here " is the
tota% "umber of i"structio"s'
/e"ce the effecti(e CPI !epe"!s o"
l The 6i"! of i"structio"s &i"structio" set' supporte! b+ the
Architecture
l The choice of i"structio"s from this i"structio" set b+ the compi%er
a"! assemb%er
#etermi"ates of CPU Performa"ce
CPU time 9 I"structio"=cou"t 1 CPI 1 c%oc6=c+c%e
I"structio"=cou"t CPI c%oc6=c+c%e
A%$orithm
Pro$rammi"$
%a"$ua$e

Compi%er
ISA
I"structio" Set

Processor
or$a"i7atio"

Tech"o%o$+
A Simp%e E1amp%e
Op Fre; CPIi Fre; 1 CPIi
ALU ?4@ 3 2
Loa! *4@ ?
Store 34@ A
Bra"ch *4@ *
9
/o0 much faster 0ou%! the machi"e be if a better !ata cache re!uce! the
a(era$e %oa! time to * c+c%esB
/o0 !oes this compare 0ith usi"$ bra"ch pre!ictio" to sha(e a c+c%e off the
bra"ch timeB
Chat if t0o ALU i"structio"s cou%! be e1ecute! at o"ceB
& Simple E'ample
6ow much faster would the machine be if a better architecture reduced the average load time to 1 cycles@
6ow does this compare with using branch prediction to shave a cycle off the branch time@
Ahat if two 4=- instructions could be executed at once@
&
1 18B ?ranch
> '8B 3tore
: 18B =oad
' :8B 4=-
<re/ x C,
i
C,
i
<re/ Cp
;:
';8
;>
;2
1;1
;1:
';8
;>
;2
C,- time new & ';D: x C x CC so 1;1(';D: means '1;EB faster
;:
;2
;>
;2
;:
';8
;>
;1
;1:
';8
;>
;2
';F 1;8 ';D:
,ow much faster would the machine e if a etter architecture reduced the
a"erage load time to 7 cyclesQ
CPU time new R 1.6 4 IC 4 CC so 7.7:1.6 means IM.ST faster
,ow does this compare with using ranch prediction to sha"e a cycle off the ranch
timeQ
CPU time new R 7.8 4 IC 4 CC so 7.7:7.8 means 18T faster
Dhat if two !.U instructions could e e4ecuted at onceQ
CPU time new R 1.LS 4 IC 4 CC so 7.7:1.LS means 17.5T faster
T+pes of A!!ressi"$ Mo!es
*ach instruction of a computer specifies an operation on certain data. The are "arious ways of
specifying address of the data to e operated on. These different ways of specifying data are
called the addressing modes. The most common addressing modes are:
Immediate addressing mode
Direct addressing mode
Indirect addressing mode
+egister addressing mode
+egister indirect addressing mode
Displacement addressing mode
%tac$ addressing mode
To specify the addressing mode of an instruction se"eral methods are used. -ost often used are :
a' Different operands will use different addressing modes.
' One or more its in the instruction format can e used as mode field. The "alue of the mode
field determines which addressing mode is to e used.
The effecti"e address will e either main memory address of a register.
C,- time new & 1;8 x C x CC so 1;1(1;8 means '8B faster
Immediate !ddressing:
This is the simplest form of addressing. ,ere# the operand is gi"en in the instruction itself. This
mode is used to define a constant or set initial "alues of "ariales. The ad"antage of this mode is
that no memory reference other than instruction fetch is re(uired to otain operand. The
disad"antage is that the si;e of the numer is limited to the si;e of the address field# which most
instruction sets is small compared to word length.
IN%T+UCTION
OP*+!ND
Direct !ddressing:
In direct addressing mode# effecti"e address of the operand is gi"en in the address field of the
instruction. It re(uires one memory reference to read the operand from the gi"en location and
pro"ides only a limited address space. .ength of the address field is usually less than the word
length.
*4 : -o"e P# +o# !dd U# +o P and U are the address of operand.
Indirect !ddressing:
Indirect addressing mode# the address field of the instruction refers to the address of a word in
memory# which in turn contains the full length address of the operand. The ad"antage of this
mode is that for the word length of N# an address space of 7N can e addressed. ,e disad"antage
is that instruction e4ecution re(uires two memory reference to fetch the operand -ultile"el or
cascaded indirect addressing can also e used.
+egister !ddressing:
+egister addressing mode is similar to direct addressing. The only difference is that the address
field of the instruction refers to a register rather than a memory location I or E its are used as
address field to reference 5 to 16 generate purpose registers. The ad"antages of register
addressing are %mall address field is needed in the instruction.
+egister Indirect !ddressing:
This mode is similar to indirect addressing. The address field of the instruction refers to a
register. The register contains the effecti"e address of the operand. This mode uses one memory
reference to otain the operand. The address space is limited to the width of the registers
a"ailale to store the effecti"e address.
Displacement !ddressing:
In displacement addressing mode there are I types of addressing mode. They are :
1' +elati"e addressing
7' =ase register addressing
I' Inde4ing addressing.
This is a comination of direct addressing and register indirect addressing. The "alue contained
in one address field. ! is used directly and the other address refers to a register whose contents
are added to ! to produce the effecti"e address.
%tac$ !ddressing:
%tac$ is a linear array of locations referred to as last)in first out (ueue. The stac$ is a reser"ed
loc$ of location# appended or deleted only at the top of the stac$. %tac$ pointer is a register
which stores the address of top of stac$ location. This mode of addressing is also $nown as
implicit addressing.
The Instruction %et !rchitecture
Gsuperscalar processor ))can e4ecute more than one instructions per cycle.
Gcycle))smallest unit of time in a processor.
Gparallelism))the aility to do more than one thingat once.
Gpipelining))o"erlapping parts of a large tas$ to increase throughput without
decreasing latency
Crafting an I%!
GDe?ll loo$ at some of the decisions facing an instruction set architect# and
Ghow those decisions were made in the design of the -IP% instruction set.
G-IP%# li$e %P!+C# PowerPC# and !lpha !KP# is a +I%C &+educed
Instruction %et Computer' I%!.
Hfi4ed instruction length
Hfew instruction formats
Hload:store architecture
G+I%C architectures wor$ed ecause they enaled pipelining. They continue
to thri"e ecause they enale parallelism.
Instruction engt!
G1ariale)length instructions &Intel 58456# 1!K' re(uire multi)step fetch
and decode# ut allow for a much more fle4ile and compact instruction set.
G9i4ed)length instructions allow easy fetch and decode# and simplify
pipelining and parallelism.
!ll -IP% instructions are I7 its long.
Hthis decision impacts e"ery other I%! decision we ma$e ecause it ma$es
instruction its scarce.
!ccessing the Operands
Goperands are generally in one of two places:
Hregisters &I7 int# I7 fp'
Hmemory &7I7locations'
Gregisters are
Heasy to specify
Hclose to the processor &fast access'
Gthe idea that we want to access registers whene"er possile led to load)store
architectures.
Hnormal arithmetic instructions only access registers
Honly access memory with e4plicit loads and stores.
.oad)store architectures
can do:
add r1Rr72rI
and
load rI# -&address'
forces hea"y dependence on registers# which is e4actly what you want in
today?s CPUs
can?t do
add r1 R r7 2 -&address'
)more instructions
2 fast implementation &e.g.# easy pipelining'
,ow -any OperandsQ
G-ost instructions ha"e three operands &e.g.# ; R 4 2 y'.
GDell)$nown I%!sspecify 8)I &e4plicit' operands per instruction.
GOperands can e specified implicitly or e4plicity.
,ow -any OperandsQ
=asic I%! Classes
Accumu%ator,
1 addressadd !acc Vacc 2 memO!P
Stac6,
8 addressaddtosVtos2 ne4t
8e"era% Purpose Re$ister,
7 addressadd ! =*!&!' V*!&!' 2 *!&='
I addressadd ! = C*!&!' V*!&=' 2 *!&C'
Loa!:Store,
I addressadd +a ++c+a V+2 +c
load +a ++a VmemO+P
store +a +memO+P V+a
9our principles of I% architecture
Hsimplicity fa"ors regularity
Hsmaller is faster
Hgood design demands compromise
Hma$e the common case fast
$nstr%ction Set &rc!itect%re ($S&)
The Instruction "et #rc!itecture &I%!' is the part of the processor that is "isile to the
programmer or compiler writer. The I%! ser"es as the oundary etween software and
hardware. De will riefly descrie the instruction sets found in many of the
microprocessors used today. The I%! of a processor can e descried using S catagories:
Opera"! Stora$e i" the CPU
Dhere are the operands $ept other than in memoryQ
Number of e1p%icit "ame! opera"!s
,ow many operands are named in a typical instruction.
Opera"! %ocatio"
Can any !.U instruction operand e located in memoryQ Or must all operands e
$ept internaly in the CPUQ
Operatio"s
Dhat operations are pro"ided in the I%!.
T+pe a"! si7e of opera"!s
Dhat is the type and si;e of each operand and how is it specifiedQ
Of all the ao"e the most distinguishing factor is the first.
The I most common types of I%!s are:
1. Stack ) The operands are implicitly on top of the stac$.
7. Accumulator ) One operand is implicitly the accumulator.
I. $eneral Purpose Register %$PR& ) !ll operands are e4plicitely mentioned# they
are either registers or memory locations.
.ets loo$ at the assemly code of
A = B + C;
in all I architectures:
%tac$ !ccumulator /P+
PU%, ! .O!D ! .O!D +1#!
PU%, = !DD = !DD +1#=
!DD %TO+* C %TO+* +1#C
POP C ) )
Not all processors can e neatly tagged into one of the ao"e catagories. The i5856 has
many instructions that use implicit operands although it has a general register set. The
i58S1 is another e4ample# it has E an$s of /P+s ut most instructions must ha"e the !
register as one of its operands.
Dhat are the ad"antages and disad"antages of each of these approachsQ
Stac6
A!(a"ta$es, %imple -odel of e4pression e"aluation &re"erse polish'. %hort instructions.
#isa!(a"ta$es, ! stac$ can<t e randomly accessed This ma$es it hard to generate
eficient code. The stac$ itself is accessed e"ery operation and ecomes a ottlenec$.
Accumu%ator
A!(a"ta$es, %hort instructions.
#isa!(a"ta$es, The accumulator is only temporary storage so memory traffic is the
highest for this approach.
8PR
A!(a"ta$es, -a$es code generation easy. Data can e stored for long periods in
registers.
#isa!(a"ta$es, !ll operands must e named leading to longer instructions.
*arlier CPUs were of the first 7 types ut in the last 1S years all CPUs made are /P+
processors. The 7 ma>or reasons are that registers are faster than memory# the more data
that can e $ept internaly in the CPU the faster the program wil run. The other reason is
that registers are easier for a compiler to use.
Re!uce! I"structio" Set Computer &RISC'
!s we mentioned efore most modern CPUs are of the /P+ &/eneral Purpose +egister'
type. ! few e4amples of such CPUs are the I=- I68# D*C 1!K# Intel 58456 and
-otorola 65444. =ut while these CPU% were clearly etter than pre"ious stac$ and
accumulator ased CPUs they were still lac$ing in se"eral areas:
1. Instructions were of "arying length from 1 yte to 6)5 ytes. This causes
prolems with the pre)fetching and pipelining of instructions.
7. !.U &!rithmetic .ogical Unit' instructions could ha"e operands that were
memory locations. =ecause the numer of cycles it ta$es to access memory "aries
so does the whole instruction. This isn<t good for compiler writers# pipelining and
multiple issue.
I. -ost !.U instruction had only 7 operands where one of the operands is also the
destination. This means this operand is destroyed during the operation or it must
e sa"ed efore somewhere.
Thus in the early 58<s the idea of +I%C was introduced. The %P!+C pro>ect was started
at =er$eley and the -IP% pro>ect at %tanford. +I%C stands for +educed Instruction %et
Computer. The I%! is composed of instructions that all ha"e e4actly the same si;e#
usualy I7 its. Thus they can e pre)fetched and pipelined succesfuly. !ll !.U
instructions ha"e I operands which are only registers. The only memory access is through
e4plicit .O!D:%TO+* instructions.
Thus ! R = 2 C will e assemled as:
LOAD R1,A
LOAD R2,B
ADD R3,R1,R2
STORE C,R3
!lthough it ta$es E instructions we can reuse the "alues in the registers.
Dhy is this architecture called +I%CQ
Dhat is +educed aout itQ
The answer is that to ma$e all instructions the same length the numer of its that are
used for the opcode is reduced. Thus less instructions are pro"ided. The instructions that
were thrown out are the less important string and =CD &inary)coded decimal'
operations. In fact# now that memory access is restricted there aren<t se"eral $inds of
-O1 or !DD instructions. Thus the older architecture is called CI%C &Complete
Instruction %et Computer'. +I%C architectures are also called '#()"*'RE
architectures.
The numer of registers in +I%C is usualy I7 or more. The first +I%C CPU the -IP%
7888 has I7 /P+s as opposed to 16 in the 65444 architecture and 5 in the 58456
architecture. The only disad"antage of +I%C is its code si;e. Usualy more instructions are
needed and there is a waste in short instructions &POP# PU%,'.
%o why are there still CI%C CPUs eing de"elopedQ
Dhy is Intel spending time and money to manufacture the Pentium II and the Pentium
IIIQ
The answer is simple# ac$ward compatiility. The I=- compatile PC is the most
common computer in the world. Intel wanted a CPU that would run all the applications
that are in the hands of more than 188 million users. On the other hand -otorola which
uilds the 65444 series which was used in the -acintosh made the transition and together
with I=- and !pple uilt the Power PC &PPC' a +I%C CPU which is installed in the new
Power -acs. !s of now Intel and the PC manufacturers are ma$ing more money ut with
-icrosoft playing in the +I%C field as well &Dindows NT runs on Compa(<s !lpha' and
with the promise of 3a"a the future of CI%C isn<t clear at all.
!n important lesson that can e learnt here is that superior technology is a factor in the
computer industry# ut so are mar$eting and price as well &if not more'.
The CISC Approach
The primary goal of CISC architecture is to complete a task in as few lines of
assembly as possible. This is achieved by building processor hardware that is capable
of understanding and executing a series of operations. For this particular task, a
CISC processor would come prepared with a specific instruction we!ll call it "#$%T"&.
'hen executed, this instruction loads the two values into separate registers,
multiplies the operands in the execution unit, and then stores the product in the
appropriate register. Thus, the entire task of multiplying two numbers can be
completed with one instruction(
MULT 2:3, 5:2
#$%T is what is known as a "complex instruction." It operates directly on the
computer!s memory banks and does not re)uire the programmer to explicitly call any
loading or storing functions. It closely resembles a command in a higher level
language. For instance, if we let "a" represent the value of *(+ and "b" represent the
value of ,(*, then this command is identical to the C statement "a - a . b."
One of the primary ad"antages of this system is that the compiler has to do "ery little
wor$ to translate a high)le"el language statement into assemly. =ecause the length of
the code is relati"ely short# "ery little +!- is re(uired to store instructions. The
emphasis is put on uilding comple4 instructions directly into the hardware.
The RISC Approach
+I%C processors only use simple instructions that can e e4ecuted within one cloc$
cycle. Thus# the C-U.TC command descried ao"e could e di"ided into three separate
commands: C.O!D#C which mo"es data from the memory an$ to a register# CP+OD#C
which finds the product of two operands located within the registers# and C%TO+*#C
which mo"es data from a register to the memory an$s. In order to perform the e4act
series of steps descried in the CI%C approach# a programmer would need to code four
lines of assemly:
LOAD A, 2:3
LOAD B, 5:2
PROD A, B
STORE 2:3, A
/t first, this may seem like a much less efficient way of completing the operation.
0ecause there are more lines of code, more 1/# is needed to store the assembly
level instructions. The compiler must also perform more work to convert a high2level
language statement into code of this form.
3owever, the 1ISC strategy
also brings some very
important advantages.
0ecause each instruction
re)uires only one clock
cycle to execute, the entire
program will execute in
approximately the same
amount of time as the
multi2cycle "#$%T"
command. These 1ISC
"reduced instructions"
re)uire less transistors of hardware space than the complex instructions, leaving
more room for general purpose registers. 0ecause all of the instructions execute in a
uniform amount of time i.e. one clock&, pipelining is possible.
%eparating the C.O!DC and C%TO+*C instructions actually reduces the amount of wor$
that the computer must perform. !fter a CI%C)style C-U.TC command is e4ecuted# the
processor automatically erases the registers. If one of the operands needs to e used for
another computation# the processor must re)load the data from the memory an$ into a
register. In +I%C# the operand will remain in the register until another "alue is loaded in
its place.
CISC RISC
4mphasis on hardware 4mphasis on software
Includes multi2clock
complex instructions
Single2clock,
reduced instruction only
#emory2to2memory(
"%5/6" and "ST514"
incorporated in instructions
1egister to register(
"%5/6" and "ST514"
are independent instructions
Small code si7es,
high cycles per second
%ow cycles per second,
large code si7es
Transistors used for storing
complex instructions
Spends more transistors
on memory registers
The Performance Equation
The following e)uation is commonly used for expressing a computer!s performance
ability(
The CISC approach attempts to minimi7e the number of instructions per program,
sacrificing the number of cycles per instruction. 1ISC does the opposite, reducing the
cycles per instruction at the cost of the number of instructions per program.
Mu%tip%icatio"
More comp%icate! tha" a!!itio"
G !ccomplished "ia shifting and addition
More time a"! more area
Let<s %oo6 at A (ersio"s base! o" $ra!e schoo% a%$orithm
43434434 &mu%tip%ica"!'
1 43343343 &mu%tip%ier'
Ne$ati(e "umbers, co"(ert a"! mu%tip%+
Use other better tech"i;ues %i6e BoothDs e"co!i"$
Signed %ltiplication
The easiest way to deal with signed numers is to first con"ert the multiplier and
multiplicand to positi"e numers and then rememer the original sign. It turns out that
the last algorithm will wor$ with signed numers pro"ided that when we do the shifting
steps we e4tend the sign of the product.
Speeding %p m%ltiplication (Boot!(s &lgorit!m)
The way we ha"e done multiplication so far consisted of repeatedly scanning the
multiplier# adding the multiplicand &or ;eros' and shifting the result accumulated.
Oser"ation:
if we could reduce the numer of times we ha"e to add the multiplicand that would ma$e
the all process faster.
.et say we want to do:
0xa where a-8ten-9:::two
Dith the algorithm used so far we successi"ely:
add # add # add # and add 8
BoothDs A%$orithm
Oser"ation: If esides addition we also use sutraction# we can reduce the numer of
consecuti"es additions and therefore we can ma$e the multiplication faster.
This re(uires to WrecodeA the multiplier in such a way that the numer of consecuti"e 1s
in the multiplier &indeed the numer of consecuti"e additions we should ha"e done' are
reduced.
The $ey to =ooth?s algorithm is to scan the multiplier and classify group of its into the
eginning# the middle and the end of a run of 1s
Using Boot!(s encoding for m%ltiplication
If the initial content of ! is an)1Ja8 then i)th multiply step# the low)order
it of register ! is ai and step &i' in the multiplication algorithm ecomes:
1. If aiR8 and ai)1R8# then add 8 to P
7. If aiR8 and ai)1R1# then add = to P
I. If aiR1 and ai)1R8# then sutract = from P
E. If aiR1 and ai)1R1# then add 8 to P
&9or the first step when iR8# then add 8 to P'
#i(isio"
*"en more complicated can e accomplished "ia shifting and addition:sutraction
-ore time and more area we will loo$ at I "ersions ased on grade school algorithm
8811 X 8818 8818 &Di"idend'
Negati"e numers: *"en more difficult There are etter techni(ues# we won?t loo$ at
them
F%oati"$ poi"t "umbers &a brief %oo6'
Ce "ee! a 0a+ to represe"t
G Numers with fractions# e.g.# I.1E16
G 1ery small numers# e.g.# 8.888888881
G 1ery large numers# e.g.# I.1SSM6 4 18L
Represe"tatio"
G %ign# e4ponent# significand: &H1'sign 4 significand 4 7e4ponent
G -ore its for significand gi"es more accuracy
G -ore its for e4ponent increases range
IEEE E?F f%oati"$ poi"t sta"!ar!
G %ingle precision: 5 it e4ponent# 7I it significand
G Doule precision: 11 it e4ponent# S7 it significand
9loating point comple4ities Operations are somewhat more complicated &see te4t' In
addition to o"erflow we can ha"e WunderflowA !ccuracy can e a ig prolem
G I*** MSE $eeps two e4tra its# guard and round
G 9our rounding modes
G Positi"e di"ided y ;ero yields WinfinityA
G Yero di"ide y ;ero yields Wnot a numerA
G Other comple4ities
F%oati"$ poi"t a!!:subtract
To a!!:sub t0o "umbers
De first compare the two e4ponents
G %elect the higher of the two as the e4ponent of result
G %elect the significand part of lower e4ponent numer and shift it right y the amount
e(ual to
the difference of two e4ponent
G +ememer to $eep two shifted out it and a guard it
G !dd:su the signifand as re(uired according to operation and signs of operands
G Normali;e significand of result ad>usting e4ponent
G +ound the result &add one to the least significant it to e retained if the first it eing
thrown
away is a 1
G +e)normali;e the result
Floating point m%ltipl#
To mu%tip%+ t0o "umbers
!dd the two e4ponent &rememer access 17M notation'
G Produce the result sign as e4or of two signs
G -ultiply significand portions
G +esults will e 14.44444J or 81.4444J.
G In the first case shift result right and ad>ust e4ponent
G +ound off the result
G This may re(uire another normali;ation step
Floating point divide
To divide two n%m)ers
%utract di"isor?s e4ponent from the di"idend?s e4ponent &rememer access 17M
notation'
G Produce the result sign as e4or of two signs
G Di"ide di"idend?s significand y di"isor?s significand portions
G +esults will e 1.44444J or 8.14444J.
G In the second case shift result left and ad>ust e4ponent
G +ound off the result
G This may re(uire another normali;ation step
UNIT 7
BASIC PROCESSIN8 UNIT
E1ecutio" of o"e i"structio" re;uires the fo%%o0i"$ three steps to be
performe! b+ the CPU,
32 Fetch the co"te"ts of the memor+ %ocatio" poi"te! at b+ the PC2 The co"te"ts
of this %ocatio" are i"teprete! as a" i"structio" to be e1ecute!2 /e"ce. the+
are store! i" the i"structio" re$ister &IR'2 Simbo%ica%%+. this ca" be 0ritte"
as,
IR GGPCHH
*2 Assumi"$ that the memor+ is b+te a!!ressab%e. i"creme"t the co"te"ts of the
PC b+ F. that is
PC GPCH I F
A2 Carr+ out the actio"s specifie! b+ the i"structio" store! i" the IR
But. i" cases 0here a" i"structio" occupies more tha" o"e 0or!. steps 3 a"!
* must be repeate! as ma"+ times as "ecessar+ to fetch the comp%ete
i"structio"2
T0o first steps are ussua%%+ referre! to as the fetch phase2
Step A co"stitutes the e1ecutio" phase
SIN8LE BUS OR8ANIJATION OF T/E #ATAPAT/ INSI#E A PROCESSOR
=ut# in cases where an instruction occupies more than one word# steps 1 and 7 must e
repeated as many times as necessary to fetch the complete instruction.
Two first steps are ussually referred to as the fetch phase.
%tep I constitutes the e4ecution phase
9etch the contents of a gi"en memory location and load them into a CPU
+egister
%tore a word of data from a CPU register into a gi"en memory location.
Transfer a word of data from one CPU register to another or to !.U.
Perform an arithmetic or logic operation# and store the result in a CPU register.
RE8ISTER TRANSFER,
The input and output gates for register +i are controlled y the signals +iin and +iout#
respecti"ely.
Thus# when +iin is set to 1# the data a"ailale on the common us is loaded into +i.
%imilarly# when +iout is set to 1# the contents of register +i are placed on the us.
Dhile +iout is e(ual to 8# the us can e used for transferring data from other registers.
.et us now consider data transfer etween two registers. 9or e4ample# to transfer the
contents of register +1 to +E# the following actions are needed:
*nale the output gate of register +1 y setting +1out to 1. This places the contents of +1
on the CPU us.
*nale the input gate of register +E y setting +Ein to 1. This loads data from
the CPU us into register +E.
*!is data transfer can be represented symbolically as R+out, R-in
PERFORMING AN ARITHMETIC OR LOGIC OPERATION
! %*UU*NC* O9 OP*+!TION% TO !DD T,* CONT*NT% O9 +*/I%T*+ +1 TO
T,O%* O9 +*/I%T*+ +7 !ND %TO+* T,* +*%U.T IN +*/I%T*+ +I I%:
+1out# Fin
+7out# %elect F# !dd# Yin
Yout# +Iin
FETC/IN8 A COR# FROM MEMORY,
CPU transfers the address of the re(uired information word to the memory address
register &-!+'. !ddress of the re(uired word is transferred to the main memory.
-eanwhile# the CPU uses the control lines of the memory us to indicate that a read
operation is re(uired.
!fter issuing this re(uest# the CPU waits until it recei"es an answer from the memory#
informing it that the re(uested function has een completed. This is accomplished
through the use of another control signal on the memory us# which will e referred to as
-emory 9unction Completed &-9C'.
The memory sets this signal to 1 to indicate that the contents of the specified location in
the memory ha"e een read and are a"ailale on the data lines of the memory us.
De will assume that as soon as the -9C signal is set to 1# the information on the data
lines is loaded into -D+ and is thus a"ailale for use inside the CPU. This completes the
memory fetch operation.
The actio"s "ee!e! for i"structio" Mo(e &R3'. R* are,
-!+ Z O+1P
%tart +ead operation on the memory us
Dait for the -9C response from the memory
.oad -D+ from the memory us
+7 Z O-D+P
Si$"a%s acti(ate! for that prob%em are,
+1out# -!+in# +ead
-D+in*# D-9C
-D+out# +7in
Stori"$ a 0or! i" Memor+
That is similar procedure with fetching a word from memory.
The desired address is loaded into -!+
Then data to e written are loaded into -D+# and
a write command is issued.
If we assume that the data word to e stored
in the memory is in +7 and that the memory
address is in +1# the Drite operation re(uires
the following se(uence :
-!+ Z O+1P
-D+ ZO+7P
Drite
Dait for the -9C
Mo(e R*. &R3' re;uires the fo%%o0i"$ se;ue"ce &si$"a%',
+1out# -!+in
+7out# -D+in. Drite
-D+out*#D-9C
E1ecutio" of a comp%ete I"structio"
Consider the instruction :
A!! &RA'. R3
*4ecuting this instruction re(uires the
following actions :
1. 9etch the instruction
7. 9etch the first operand &the contents of the
memory location pointed to y +I'
I. Perform the addition
E. .oad the result into +1
Co"tro% Se;ue"ce for i"structio" A!! &RA'. R3,
PCout# -!+in# +ead# %electE# !dd# Yin
Yout# PCin# Fin# Dait for the -9C
-D+out# I+in
+Iout# -!+in# +ead
+1out# Fin# Dait for -9C
-D+out# %elect F# !dd# Yin
Yout# +1in# *nd
Bra"ch I"structio"s,
PCout# -!+in# +ead# %electE# !dd# Yin
Yout# PCin# Fin# Dait for the -9C &D9-C'
-D+out# Irin
offset[field[of[I+out# !dd# Yin
Yout# PCin# *nd
Mu%tip%e bus architecture
One solution to the andwidth limitation of a single us is to simply add additional uses.
Consider the architecture shown in 9igure 7.7 that contains . processors# P1 P7 P.# each
ha"ing its own pri"ate cache# and all connected to a shared memory y B uses =1 =7
=B. The shared memory consists of / interlea"ed an$s -1 -7 -/ to allow
simultaneous memory re(uests concurrent access to the shared memory. This a"oids the
loss in performance that occurs if those accesses must e seriali;ed# which is the case
when there is only one memory an$. *ach processor is connected to e"ery us and so is
each memory an$. Dhen a processor needs to access a particular an$# it has B uses
from which to choose. Thus each processor)memory pair is connected y se"eral
redundant paths# which implies that the failure of one or more paths can# in principle# e
tolerated at the cost of some degradation in system performance.
In a multiple us system se"eral processors may attempt to access the shared memory
simultaneously. To deal with this# a policy must e implemented that allocates the
a"ailale uses to the processors ma$ing re(uests to memory. In particular# the policy
must deal with the case when the numer of processors e4ceeds B. 9or performance
reasons this allocation must e carried out y hardware ariters which# as we shall see#
add significantly to the comple4ity of the multiple us interconnection networ$.
PCout# +R=# -!+in# +ead# IncPC
D9-C
-D+out=# +R=# I+in
+Eout# +Sout=# %elect!# !dd# +6in# *nd.
/AR#CIRE# CONTROL:
8e"eratio" of the Ji co"tro% si$"a% for the processor
Generation of the End control signal
Na"opro$rammi"$
%econd compromise: nanoprogramming
H Use a 7)le"el control storage organi;ation
H Top le"el is a "ertical format memory
\ Output of the top le"el memory dri"es the address register of the
ottom &nano)le"el' memory
H Nanomemory uses the hori;ontal format
\ Produces the actual control signal outputs
H The ad"antage to this approach is significant sa"ing in control memory
si0e &its'
H Disad"antage is more comple4ity and slower operation &doing 7 memory
accesses fro each microinstruction'
Na"opro$ramme! machi"e
*4ample: %upppose that a system is eing designed with 788 control
points and 78E5 microinstructions
H !ssume that only 7S6 different cominations of control points are e"er
used
H ! single)le"el control memory would re(uire 78E54788RE8L#688 storage
its
! nanoprogrammed system would use
\ -icrostore of si;e 78E545R16$
\ Nanostore of si;e 7S64788RS1788
\ Total si;e R 6M#S5E storage its
Nanoprogramming has een used in many CI%C microprocessors
App%icatio"s of Micropro$rammi"$
-icroprogramming application: emulation
H The use of a microprogram on one machine to e4ecute programs originally
written to run on another &different]' machine
H =y changing the microcode of a machine# you can ma$e it e4ecute
software from another machine
H Commonly used in the past to permit new machines to continue to run old
software
\ 1!K11)M58 had 7 WmodesA
H Normal 11)M58 mode
H *mulation mode for a PDP)11
H The Nanodata U-)1 machine was mar$eted with no nati"e instruction set]
\ Uni"ersal emulation engine
UNIT I
Pipe%i"i"$
Chat is Pipe%i"i"$B
T!e Pipelie "e#ied
3ohn ,ayes pro"ides a definition of a pipeline as it applies to a computer processor.
C! pipeline processor consists of a se(uence of processing circuits# called segments or
stages# through which a stream of operands can e passed.
CPartial processing of the operands ta$es place in each segment.
C... a fully processed result is otained only after an operand set has passed through the
entire pipeline.C
In e"eryday life# people do many tas$s in stages. 9or instance# when we do the laundry#
we place a load in the washing machine. Dhen it is done# it is transferred to the dryer and
another load is placed in the washing machine. Dhen the first load is dry# we pull it out
for folding or ironing# mo"ing the second load to the dryer and start a third load in the
washing machine. De proceed with folding or ironing of the first load while the second
and third loads are eing dried and washed# respecti"ely. De may ha"e ne"er thought of
it this way ut we do laundry y pipe%i"e processi"$.
A Pipe%i"e
is a series of sta$es. 0here some 0or6 is !o"e at each sta$e2 The 0or6 is "ot
fi"ishe! u"ti% it has passe! throu$h a%% sta$es2
.et us re"iew ,ayes< definition as it pertains to our laundry e4ample. The washing
machine is one Cse(uence of processing circuitsC or a sta$e. The second is the dryer. The
third is the folding or ironing stage.
Partial processing ta$es place in each stage. De certainly aren<t done when the clothes
lea"e the washer. Nor when they lea"e the dryer# although we<re getting close. De must
ta$e the third step and fold &if we<re luc$y' or iron the cloths. The Cfully processed resultC
is otained only after the operand &the load of clothes' has passed through the entire
pipeline.
De are often taught to ta$e a large tas$ and to di"ide it into smaller pieces. This may
ma$e a unmanageale comple4 tas$ into a series of more tractale smaller steps. In the
case of manageale tas$s such as the laundry e4ample# it allows us to speed up the tas$
y doing it in o"erlapping steps.
This is the $ey to pipelining: Di"ision of a larger tas$ into smaller o(er%appi"$ tas$s.
C! significant aspect of our ci"ili;ation is the di"ision of laor. -a>or engineering
achie"ements are ased on sudi"iding the total wor$ into indi"idual tas$s which can e
handled despite their inter)dependencies.
KO(er%ap a"! pipe%i"i"$ are esse"tia%%+ operatio" ma"a$eme"t tech"i;ues base! o"
Lob sub!i(isio"s u"!er a prece!e"ce co"strai"t2K
Types o# Pipelies
-any authors# such as Taa$ OT!=LS# p 6MP# separate the pipeline into two categories.
Instructional pipeline
where different stages of an instruction fetch and e4ecution are handled in a
pipeline.
!rithmetic pipeline
where different stages of an arithmetic operation are handled along the stages of a
pipeline.
The ao"e definitions are correct ut are ased on a narrow perspecti"e# consider only the
central processor. There are other type of computing pipelines. Pipelines are used to
compress and transfer "ideo data. !nother is the use of speciali;ed hardware to perform
graphics display tas$s. Discussing graphics displays# Dare -yers wrote:
C...the pipeline concept ... transforms a model of some o>ect into representations that
successi"ely ecome more machine)dependent and finally results in an image upon a
particular screen.
This e4ample of pipelining fits the definitions from ,ayes and Chen ut not the
categories offered y Taa;. These roader categories are eyond the scope of this paper
and are mentioned only to alert the reader that different authors mean different things
when referring to pipelining.
#isa!(a"ta$es
There are two disad"antages of pipeline architecture. The first is comple4ity. The second
is the inaility to continuously run the pipeline at full speed# i.e. the pipeline sta%%s.
.et us e4amine why the pipeline cannot run at full speed. There are phenomena called
pipeline ha;ards which disrupt the smooth e4ecution of the pipeline. The resulting delays
in the pipeline flow are called ules. These pipeline ha;ards include
structural ha;ards from hardware conflicts
data ha;ards arising from data dependencies
control ha;ards that come aout from ranch# >ump# and other control flow
changes
These issues can and are successfully dealt with. =ut detecting and a"oiding the ha;ards
leads to a considerale increase in hardware comple4ity. The control paths controlling the
gating etween stages can contain more circuit le"els than the data paths eing
controlled. In 1LM8# this comple4ity is one reason that led 9oster to call pipelining Ksti%%
co"tro(ersia%K .
The one ma>or idea that is still contro"ersial is Cinstruction loo$)aheadC OpipeliningP...
Dhy then the contro"ersyQ 9irst# there is a considerale increase in hardware comple4ity
O...P
The second prolem O...P when a ranch instruction comes along# it is impossile to $now
in ad"ance of e4ecution which path the program is going to ta$e and# if the machine
guesses wrong# a%% the partia%%+ processe! i"structio"s i" the pipe%i"e are use%ess a"!
must be rep%ace! G222H
In the second edition of 9oster<s oo$# pulished 1LM6# this passage was gone.
!pparently# 9oster felt that pipelining was no longer contro"ersial.
Doran also alludes to the nature of the prolem. The model of pipelining is Cama;ingly
simpleC while the implementation is C"ery comple4C and has many complications.
=ecause of the multiple instructions that can e in "arious stages of e4ecution at any
gi"en moment in time# handling an interrupt is one of the more comple4 tas$s. In the
I=- I68# this can lead to se"eral instructions e4ecuting after the interrupt is signaled#
resulting in an imprecise i"terrupt. !n imprecise interrupt can result from an instruction
e4ception and precise address of the instruction causing the e4ception may not e $nown]
This led -yers to critici;e pipelining# referring to the imprecise interrupt as an
Carchitectural nuisanceC. ,e stated that it was not an ad"ance in computer architecture ut
an impro"ement in implementation that could e "iewed as a step bac60ar!.
In retrospect# most of -yers< oo$ #dvances in 1omputer #rc!itecture dealt with his
concepts for impro"ements in computer architecture that would e termed CI%C today.
Dith the enefits of hindsight# we can see that pipelining is here today and that most of
the new CPUs are in the +I%C class. In fact# -yers is one of the co)architects of Intel<s
series of I7)it +I%C microprocessors. This processor is fully pipelined. I suspect that
-yers no longer considers pipelining a step ac$wards.
The !ifficu%t+ arisi"$ from imprecise i"terrupts shou%! be (ie0e! as a comp%e1it+ to be
o(ercome. "ot as a" i"here"t f%a0 i" pipe%i"i"$2 #ora" e1p%ai"s ho0 the BEE44
carries the a!!ress of the i"structio" throu$h the pipe%i"e. so that a"+ e1ceptio"
that the i"structio" ma+ raise ca" be precise%+ %ocate! a"! "ot $e"erate a"
imprecise i"terrupt
!n i"structio" pipe%i"e is a techni(ue used in the design of computers and other digital
electronic de"ices to increase their instruction throughput &the numer of instructions that
can e e4ecuted in a unit of time'.
The fundamental idea is to split the processing of a computer instruction into a series of
independent steps# with storage at the end of each step. This allows the computer<s
control circuitry to issue instructions at the processing rate of the slowest step# which is
much faster than the time needed to perform all steps at once. The term pipeline refers to
the fact that each step is carrying data at once &li$e water'# and each step is connected to
the ne4t &li$e the lin$s of a pipe.'
The origin of pipelining is thought to e either the pro>ect or the pro>ect. The I=- %tretch
Pro>ect proposed the terms# C9etch# Decode# and *4ecuteC that ecame common usage.
-ost modern CPUs are dri"en y a cloc$. The CPU consists internally of logic and
memory &flip flops'. Dhen the cloc$ signal arri"es# the flip flops ta$e their new "alue and
the logic then re(uires a period of time to decode the new "alues. Then the ne4t cloc$
pulse arri"es and the flip flops again ta$e their new "alues# and so on. =y rea$ing the
logic into smaller pieces and inserting flip flops etween the pieces of logic# the delay
efore the logic gi"es "alid outputs is reduced. In this way the cloc$ period can e
reduced. 9or e4ample# the +I%C pipeline is ro$en into fi"e stages with a set of flip flops
etween each stage.
1. Instruction fetch
7. Instruction decode and register fetch
I. *4ecute
E. -emory access
S. +egister write ac$
,a;ards: Dhen a programmer &or compiler' writes assemly code# they ma$e the
assumption that each instruction is e4ecuted efore e4ecution of the suse(uent
instruction is egun. This assumption is in"alidated y pipelining. Dhen this causes a
program to eha"e incorrectly# the situation is $nown as a ha;ard. 1arious techni(ues for
resol"ing ha;ards such as forwarding and stalling e4ist.
! non)pipeline architecture is inefficient ecause some CPU components &modules' are
idle while another module is acti"e during the instruction cycle. Pipelining does not
completely cancel out idle time in a CPU ut ma$ing those modules wor$ in parallel
impro"es program e4ecution significantly.
Processors with pipelining are organi;ed inside into stages which can semi)independently
wor$ on separate >os. *ach stage is organi;ed and lin$ed into a <chain< so each stage<s
output is fed to another stage until the >o is done. This organi;ation of the processor
allows o"erall processing time to e significantly reduced.
! deeper pipeline means that there are more stages in the pipeline# and therefore# fewer
logic gates in each pipeline. This generally means that the processor<s fre(uency can e
increased as the cycle time is lowered. This happens ecause there are fewer components
in each stage of the pipeline# so the propagation delay is decreased for the o"erall stage.
Unfortunately# not all instructions are independent. In a simple pipeline# completing an
instruction may re(uire S stages. To operate at full performance# this pipeline will need to
run E suse(uent independent instructions while the first is completing. If E instructions
that do not depend on the output of the first instruction are not a"ailale# the pipeline
control logic must insert a stall or wasted cloc$ cycle into the pipeline until the
dependency is resol"ed. 9ortunately# techni(ues such as forwarding can significantly
reduce the cases where stalling is re(uired. Dhile pipelining can in theory increase
performance o"er an unpipelined core y a factor of the numer of stages &assuming the
cloc$ fre(uency also scales with the numer of stages'# in reality# most code does not
allow for ideal e4ecution.
*a+ard (comp%ter arc!itect%re)
In computer architecture# a ha7ar! is a potential prolem that can happen in a pipelined
processor. It refers to the possiility of erroneous computation when a CPU tries to
simultaneously e4ecute multiple instructions which e4hiit data dependence. There are
typically three types of ha;ards: data ha;ards# structural ha;ards# and ranching ha;ards
&control ha;ards'.
Instructions in a pipelined processor are performed in se"eral stages# so that at any gi"en
time se"eral instructions are eing e4ecuted# and instructions may not e completed in the
desired order.
! ha;ard occurs when two or more of these simultaneous &possily out of order'
instructions conflict.
1 Data ha;ards
o 1.1 +!D ) +ead !fter Drite
o 1.7 D!+ ) Drite !fter +ead
o 1.I D!D ) Drite !fter Drite
7 %tructural ha;ards
I =ranch &control' ha;ards
E *liminating ha;ards
o E.1 *liminating data ha;ards
o S.1 *liminating ranch ha;ards
#ata ha7ar!s
Data ha;ards occur when data is modified. Ignoring potential data ha;ards can result in
race conditions &sometimes $nown as race ha;ards'. There are three situations a data
ha;ard can occur in:
1. Rea! after Crite &+!D' or True !epe"!e"c+: !n operand is modified and read
soon after. =ecause the first instruction may not ha"e finished writing to the
operand# the second instruction may use incorrect data.
7. Crite after Rea! &D!+' or A"ti !epe"!e"c+: +ead an operand and write soon
after to that same operand. =ecause the write may ha"e finished efore the read#
the read instruction may incorrectly get the new written "alue.
I. Crite after Crite &D!D' or Output !epe"!e"c+: Two instructions that write
to the same operand are performed. The first one issued may finish second# and
therefore lea"e the operand with an incorrect data "alue.
RAC Rea! After Crite
! +!D Data ,a;ard refers to a situation where we refer to a result that has not yet een
calculated# for e4ample:
i1. R2 <- R1 + R3
i2. R <- R2 + R3
The 1st instruction is calculating a "alue to e sa"ed in register 7# and the second is going
to use this "alue to compute a result for register E. ,owe"er# in a pipeline# when we fetch
the operands for the 7nd operation# the results from the 1st will not yet ha"e een sa"ed#
and hence we ha"e a data dependency.
De say that there is a data dependency with instruction 7# as it is dependent on the
completion of instruction 1
CAR Crite After Rea!
! D!+ Data ,a;ard represents a prolem with concurrent e4ecution# for e4ample:
i1. R <- R1 + R3
i2. R3 <- R1 + R2
If we are in a situation that there is a chance that i7 may e completed efore i1 &i.e. with
concurrent e4ecution' we must ensure that we do not store the result of register I efore
i1 has had a chance to fetch the operands.
CAC Crite After Crite
! D!D Data ,a;ard is another situation which may occur in a Concurrent e4ecution
en"ironment# for e4ample:
i1. R2 <- R1 + R2
i2. R2 <- R ! R"
De must delay the D= &Drite =ac$' of i7 until the e4ecution of i1
Structura% ha7ar!s
! structural ha;ard occurs when a part of the processor<s hardware is needed y two or
more instructions at the same time. ! structural ha;ard might occur# for instance# if a
program were to e4ecute a ranch instruction followed y a computation instruction.
=ecause they are e4ecuted in parallel# and ecause ranching is typically slow &re(uiring
a comparison# program counter)related computation# and writing to registers'# it is (uite
possile &depending on architecture' that the computation instruction and the ranch
instruction will oth re(uire the !.U &arithmetic logic unit' at the same time.
Bra"ch &co"tro%' ha7ar!s
=ranching ha;ards &also $nown as control ha;ards' occur when the processor is told to
ranch ) i.e.# if a certain condition is true# then >ump from one part of the instruction
stream to another ) not necessarily to the ne4t instruction se(uentially. In such a case# the
processor cannot tell in ad"ance whether it should process the ne4t instruction &when it
may instead ha"e to mo"e to a distant instruction'.
This can result in the processor doing unwanted actions.
E%imi"ati"$ ha7ar!s
De can delegate the tas$ of remo"ing data dependencies to the compiler# which can fill in
an appropriate numer of #OP instructions etween dependent instructions to ensure
correct operation# or re)order instructions where possile.
Other methods include on)chip solutions such as:
%coreoarding method
Tomasulo<s method
There are se"eral estalished techni(ues for either pre"enting ha;ards from occurring# or
wor$ing around them if they do.
=uling the Pipeline
=uling the pipeline &a techni(ue also $nown as a pipe%i"e brea6 or pipe%i"e
sta%%' is a method for pre"enting data# structural# and ranch ha;ards from
occurring. !s instructions are fetched# control logic determines whether a ha;ard
could:will occur. If this is true# then the control logic inserts NOPs into the
pipeline. Thus# efore the ne4t instruction &which would cause the ha;ard' is
e4ecuted# the pre"ious one will ha"e had sufficient time to complete and pre"ent
the ha;ard. If the numer of NOPs is e(ual to the numer of stages in the pipeline#
the processor has een cleared of all instructions and can proceed free from
ha;ards. This is called f%ushi"$ the pipe%i"e. !ll forms of stalling introduce a
delay efore the processor can resume e4ecution.
E%imi"ati"$ !ata ha7ar!s
9orwarding
NOT*: In t!e following e2amples, computed values are in $old, w!ile Register
numbers are not.
9orwarding in"ol"es feeding output data into a pre"ious stage of the pipeline. 9or
instance# let<s say we want to write the "alue I to register 1# &which already
contains a 6'# and then add M to register 1 and store the result in register 7# i.e.:
Instruction 34 Register + 5 %
Instruction +4 Register + 5 &
Instruction 64 Register 6 5 Register + 7 ' 5 ()
9ollowing e4ecution# register 7 should contain the "alue 34. ,owe"er# if
Instruction 1 &write A to register 1' does not completely e4it the pipeline efore
Instruction 7 starts e4ecution# it means that +egister 1 does not contain the "alue A
when Instruction 7 performs its addition. In such an e"ent# Instruction 7 adds E to
the old "alue of register 1 &M'# and so register 7 would contain 3A instead# i.e:
Instruction 34 Register + 5 %
Instruction +4 Register + 5 &
Instruction 64 Register 6 5 Register + 7 ' 5 (&
This error occurs ecause Instruction 7 reads +egister 1 efore Instruction 1 has
committed:stored the result of its write operation to +egister 1. %o when
Instruction 7 is reading the contents of +egister 1# register 1 still contains M# not A.
9orwarding &descried elow' helps correct such errors y depending on the fact
that the output of Instruction 1 &which is A' can e used y suse(uent instructions
before the "alue A is committed to:stored in +egister 1.
9orwarding is implemented y feeding ac$ the output of an instruction into the
pre"ious stage&s' of the pipeline as soo" as the output of that i"structio" is
a(ai%ab%e. 9orwarding applied to our e4ample means that we do not wait to
commit)store t!e output of Instruction + in Register + %in t!is e2ample, t!e output
is && before ma8ing t!at output available to t!e subsequent instruction %in t!is
case, Instruction 6&. The effect is that Instruction 7 uses the correct &the more
recent' "alue of +egister 1: the commit:store was made immediately and not
pipelined.
Dith forwarding enaled# the ID:*K
Oclarification neededP
stage of the pipeline now has
two inputs: the "alue read from the register specified &in this e4ample# the "alue M
from +egister 1'# and the new "alue of +egister 1 &in this e4ample# this "alue is A'
which is sent from the ne4t stage &*K:-*-'
Oclarification neededP
. !dditional control
logic is used to determine which input to use.
Forwarding Unit
,!at &)o%t -oad.Use Stall/
,!at &)o%t Control *a+ards/
(Predict.0ot ta"en )
Reduce ?ranch $elay
Pipeline *a+ards
There are situations# called ha7ar!s# that pre"ent the ne4t instruction in the instruction
stream from eing e4ecuting during its designated cloc$ cycle. ,a;ards reduce the
performance from the ideal speedup gained y pipelining.
There are three classes of ha;ards:
%tructural ,a;ards. They arise from resource conflicts when the hardware
cannot support all possile cominations of instructions in simultaneous
o"erlapped e4ecution.
Data ,a;ards. They arise when an instruction depends on the result of a
pre"ious instruction in a way that is e4posed y the o"erlapping of instructions in
the pipeline.
Control ,a;ards.They arise from the pipelining of ranches and other
instructions that change the PC.
,a;ards in pipelines can ma$e it necessary to stall the pipeline. The processor can stall
on different e"ents:
! cac!e miss. ! cache miss stalls all the instructions on pipeline oth efore and after
the instruction causing the miss.
! !a*ard i pipelie+ *liminating a ha;ard often re(uires that some instructions in the
pipeline to e allowed to proceed while others are delayed. Dhen the instruction is
stalled# all the instructions issued later than the stalled instruction are also stalled.
Instructions issued earlier than the stalled instruction must continue# since otherwise the
ha;ard will ne"er clear.
! ha;ard causes pipeline ules to e inserted.The following tale shows how the stalls
are actually implemented. !s a result# no new instructions are fetched during cloc$ cycle
E# no instruction will finish during cloc$ cycle 5.
In case of structural ha;ards:
N
C%oc6 c+c%e "umber
Instr 1 7 I E S 6 M 5 L 18
Instr i I9 ID *K -*- D=
Instr i21 I9 ID *K -*- D=
Instr i27 I9 ID *K -*- D=
%tall ule ule ule ule ule
Instr i2I I9 ID *K -*- D=
Instr i2E I9 ID *K -*- D=
To simplify the picture it is also commonly shown li$e this:
C%oc6 c+c%e "umber
I"str 1 7 I E S 6 M 5 L 18
Instr i I9 ID *K -*- D=
Instr i21 I9 ID *K -*- D=
Instr i27 I9 ID *K -*- D=
Instr i2I stall I9 ID *K -*- D=
Instr i2E I9 ID *K -*- D=
In case of data ha;ards:
C%oc6 c+c%e "umber
Instr 1 7 I E S 6 M 5 L 18
Instr i I9 ID *K -*- D=
Instr i21 I9 ID ule *K -*- D=
Instr i27 I9 ule ID *K -*- D=
Instr i2I ule I9 ID *K -*- D=
Instr i2E I9 ID *K -*- D=
which appears the same with stalls:
C%oc6 c+c%e "umber
I"str 1 7 I E S 6 M 5 L 18
Instr i I9 ID *K -*- D=
Instr i21 I9 ID stall *K -*- D=
Instr i27 I9 stall ID *K -*- D=
Instr i2I stall I9 ID *K -*- D=
Instr i2E I9 ID *K -*- D=
Performa"ce of Pipe%i"es 0ith Sta%%s
! stall causes the pipeline performance to degrade the ideal performance.
A(era$e i"structio" time u"pipe%i"e!
Spee!up from pipe%i"i"$ 9
A(era$e i"structio" time pipe%i"e!
CPI
u"pipe%i"e!
O C%oc6 C+c%e Time
u"pipe%i"e!
9
CPI
pipe%i"e!
O C%oc6 C+c%e Time
pipe%i"e!
The ideal CPI on a pipelined machine is almost always 1. ,ence# the pipelined CPI is
CPI
pipe%i"e!
9 I!ea% CPI I Pipe%i"e sta%% c%oc6 c+c%es per i"structio"
9 3 I Pipe%i"e sta%% c%oc6 c+c%es per i"structio"
If we ignore the cycle time o"erhead of pipelining and assume the stages are all perfectly
alanced# then the cycle time of the two machines are e(ual and
CPI
u"pipe%i"e!
Spee!up 9
3I Pipe%i"e sta%% c+c%es per i"structio"

If all instructions ta$e the same numer of cycles# which must also e(ual the numer of
pipeline stages & the depth of the pipeline' then unpipelined CPI is e(ual to the depth of
the pipeline# leading to
Pipe%i"e !epth
Spee!up 9
3 I Pipe%i"e sta%% c+c%es per i"structio"
If there are no pipeline stalls# this leads to the intuiti"e result that pipelining can impro"e
performance y the depth of pipeline.
Str%ct%ral *a+ards
Dhen a machine is pipelined# the o"erlapped e4ecution of instructions re(uires pipelining
of functional units and duplication of resources to allow all possile cominations of
instructions in the pipeline.
If some comination of instructions cannot e accommodated ecause of a resource
conflict# the machine is said to ha"e a structural ha;ard.
Common instances of structural ha;ards arise when
%ome functional unit is not fully pipelined. Then a se(uence of instructions using that
unpipelined unit cannot proceed at the rate of one per cloc$ cycle
%ome resource has not een duplicated enough to allow all cominations of
instructions in the pipeline to e4ecute.
E,ample(-
a machine may ha"e only one register)file write port# ut in some cases the pipeline
might want to perform two writes in a cloc$ cycle.
E,ample.-
a machine has shared a single)memory pipeline for data and instructions. !s a result#
when an instruction contains a data)memory reference&load'# it will conflict with the
instruction reference for a later instruction &instr I':
C%oc6 c+c%e "umber
I"str 1 7 I E S 6 M 5
.oad I9 ID *K -*- D=
Instr 1 I9 ID *K -*- D=
Instr 7 I9 ID *K -*- D=
Instr I I9 ID *K -*- D=

To resol"e this# we stall the pipeline for one cloc$ cycle when a data)memory access
occurs. The effect of the stall is actually to occupy the resources for that instruction slot.
The following tale shows how the stalls are actually implemented.
C%oc6 c+c%e "umber
Instr 1 7 I E S 6 M 5 L
.oad I9 ID *K MEM D=
Instr 1 I9 ID *K -*- D=
Instr 7 I9 ID *K -*- D=
%tall ule ule ule ule ule
Instr I IF ID *K -*- D=
Instruction 1 assumed not to e data)memory reference &load or store'# otherwise
Instruction I cannot start e4ecution for the same reason as ao"e.
To simplify the picture it is also commonly shown li$e this:
C%oc6 c+c%e "umber
I"str 1 7 I E S 6 M 5 L
.oad I9 ID *K -*- D=
Instr 1 I9 ID *K -*- D=
Instr 7 I9 ID *K -*- D=
Instr I stall I9 ID *K -*- D=
Introducing stalls degrades performance as we saw efore. Dhy# then# would the
designer allow structural ha;ardsQ There are two reasons:
To reduce cost. 9or e4ample# machines that support oth an instruction and a cache
access e"ery cycle &to pre"ent the structural ha;ard of the ao"e e4ample' re(uire at least
twice as much total memory.
To reduce the latency of the unit. The shorter latency comes from the lac$ of pipeline
registers that introduce o"erhead.
Data *a+ards
! ma>or effect of pipelining is to change the relati"e timing of instructions y
o"erlapping their e4ecution. This introduces data and control ha;ards. #ata ha7ar!s
occur when the pipeline changes the order of read:write accesses to operands so that the
order differs from the order seen y se(uentially e4ecuting instructions on the
unpipelined machine.
Consider the pipelined e4ecution of these instructions:

1 7 I E S 6 M 5 L
!DD R3# +7# +I I9 ID *K -*- CB
%U= +E# +S# R3 I9 I#
sub
*K -*- D=
!ND +6# R3. +M I9 I#
a"!
*K -*- D=
O+ +5# R3# +L I9 I#
or
*K -*- D=
KO+ +18#R3#+11 I9 I#
1or
*K -*- D=
!ll the instructions after the !DD use the result of the !DD instruction &in +1'. The
!DD instruction writes the "alue of +1 in the D= stage &shown lac$'# and the SUB
instruction reads the "alue during ID stage &I#
sub
'. This prolem is called a data !a*ard.
Unless precautions are ta$en to pre"ent it# the %U= instruction will read the wrong "alue
and try to use it.
The AN# instruction is also affected y this data ha;ard. The write of +1 does not
complete until the end of cycle S &shown lac$'. Thus# the !ND instruction that reads the
registers during cycle E &I#
a"!
' will recei"e the wrong result.
The OR instruction can e made to operate without incurring a ha;ard y a simple
implementation techni(ue. T!e tec!i/ue is to perform register file reads in the second
half of the cycle# and writes in the first half. =ecause oth CB for !DD and I#
or
for O+
are performed in one cycle S# the write to register file y !DD will perform in the first
half of the cycle# and the read of registers y O+ will perform in the second half of the
cycle.
The -OR instruction operates properly# ecause its register read occur in cycle 6 after
the register write y !DD.
The ne4t page discusses forwarding# a techni(ue to eliminate the stalls for the ha;ard
in"ol"ing the %U= and !ND instructions.
De will also classify the data ha;ards and consider the cases when stalls can not e
eliminated. De will see what compiler can do to schedule the pipeline to a"oid stalls.
Data *a+ard Classification
! ha;ard is created whene"er there is a dependence etween instructions# and they are
close enough that the o"erlap caused y pipelining would change the order of access to
an operand. Our e4ample ha;ards ha"e all een with register operands# ut it is also
possile to create a dependence y writing and reading the same memory location. In
D.K pipeline# howe"er# memory references are always $ept in order# pre"enting this type
of ha;ard from arising.
!ll the data ha;ards discussed here in"ol"e registers within the CPU. =y con"ention# t!e
!a*ards are amed $y t!e orderi0 i t!e pro0ram t!at must $e preser1ed $y t!e
pipelie+
RA2 3read a#ter 4rite5
2A2 34rite a#ter 4rite5
2AR 34rite a#ter read5
Consider two instructions i and 6# with i occurring efore 6. The possile data ha;ards are:
RA2 3read a#ter 4rite5 ) 6 tries to read a source before i writes it, so 6 incorrectly gets
t!e old value.
This is the most common type of ha;ard and the $ind that we use forwarding to
o"ercome.

2A2 34rite a#ter 4rite5 ) 6 tries to write an operand before it is written by i. *!e
writes end up being performed in t!e wrong order, leaving t!e value written by i rat!er
t!an t!e value written by 6 in t!e destination.
This ha;ard is present only in pipelines that write in more than one pipe stage or allow an
instruction to proceed e"en when a pre"ious instruction is stalled. The D.K integer
pipeline writes a register only in D= and a"oids this class of ha;ards.
D!D ha;ards would e possile if we made the following two changes to the D.K
pipeline:
mo"e write ac$ for an !.U operation into the -*- stage# since the data "alue is
a"ailale y then.
suppose that the data memory access too$ two pipe stages.

,ere is a se(uence of two instructions showing the e4ecution in this re"ised pipeline#
highlighting the pipe stage that writes the result:

.D +1# 8&+7' I9 ID *K -*-1 -*-7 CB
!DD +1# +7# +I I9 ID *K CB
Unless this ha;ard is a"oided# e4ecution of this se(uence on this re"ised pipeline will
lea"e the result of the first write &the .D' in +1# rather than the result of the !DD.
!llowing writes in different pipe stages introduces other prolems# since two instructions
can try to write during the same cloc$ cycle. The D.K 9P pipeline # which has oth
writes in different stages and different pipeline lengths# will deal with oth write conflicts
and D!D ha;ards in detail.

2AR 34rite a#ter read5 ) 6 tries to write a destination before it is read by i , so i
incorrectly gets t!e new value.
This can not happen in our e4ample pipeline ecause all reads are early &in ID' and all
writes are late &in D='. This ha;ard occurs when there are some instructions that write
results early in the instruction pipeline# and other instructions that read a source late in the
pipeline.
=ecause of the natural structure of a pipeline# which typically reads "alues efore it
writes results# such ha;ards are rare. Pipelines for comple4 instruction sets that support
autoincrement addressing and re(uire operands to e read late in the pipeline could create
a D!+ ha;ards.
If we modified the D.K pipeline as in the ao"e e4ample and also read some operands
late# such as the source "alue for a store instruction# a D!+ ha;ard could occur. ,ere is
the pipeline timing for such a potential ha;ard# highlighting the stage where the conflict
occurs:

%D +1# 8&+7' I9 ID *K -*-1 MEM* D=
!DD +7# +I# +E I9 ID *K CB
If the %D reads +7 during the second half of its -*-7 stage and the !dd writes +7
during the first half of its D= stage# the %D will incorrectly read and store the "alue
produced y the !DD.
RAR 3read a#ter read5 ) t!is case is not a !a0ard 4&.
/a"!%i"$ co"tro% ha7ar!s is (er+ importa"t
PA- e2$2.
G *mer and Clar$ report ILT of instr. change the PC
G Nai"e solution adds appro4. S cycles e"ery time
9 'r, adds 6 to 1PI or :63; increase
#L- e2$2.
G ,^P report 1IT ranches
G Nai"e solution adds I cycles per ranch
9 'r, 3.<= added to 1PI or :<3; increase
Mo(e co"tro% poi"t ear%ier i" the pipe%i"e
G 9ind out whether ranch is ta$en earlier
G Compute target address fast
Both "ee! to be !o"e
e2$2. i" I# sta$e
G target :R PC 2 immediate
G if &+s1 op 8' PC :R target
Compariso"s i" I# sta$e
G must e fast
G can?t afford to sutract
G compares with 8 are simple
G gt# lt test sign)it
G e(# ne must O+ all its
More $e"era% co"!itio"s "ee! ALU
G D.K uses conditional sets
Bra"ch pre!ictio"
G guess the direction of ranch
G minimi;e penalty when right
G may increase penalty when wrong
Tech"i;ues
G static ) y compiler
G dynamic ) y hardware
Static tech"i;ues
G predict always not)ta$en
G predict always ta$en
G predict ac$ward ta$en
G predict specific opcodes ta$en
G delayed ranches
#+"amic tech"i;ues
G Discussed with I.P
if ta6e" the" s;uash &a6a abort or ro%%bac6'
G will wor$ only if no state change until ranch is resol"ed
G %imple S)stage Pipeline# e.g.# D.K ) o$ ) whyQ
G Other pipelines# e.g.# 1!K ) autoincrement addressingQ
For #L- must 6"o0 tar$et before bra"ch is !eco!e!
G can use prediction
G special hardware for fast decode
E1ecute both paths ) hardware:memory :w e4pensi"e
Fill with an instr before branch
G DhenQ if ranch and instr are independent.
G ,elpsQ always
Fi%% from tar$et &ta6e" path'
G DhenQ if safe to e4ecute target# may ha"e to duplicate
code
G ,elpsQ on ta$en ranch# may increase code si;e
Fi%% from fa%%throu$h &"otta6e" path'
G whenQ if safe to e4ecute instruction
G helpsQ when not)ta$en
9illing in Delay %lots cont.
9rom Control)Independent code:
that?s code that will e eventually "isited no matter where the ranch goes
Nu%%if+i"$ or Ca"ce%%i"$ or Li6e%+ Bra"ches,
%pecify when delay slot is e4ecute and when is s(uashed Ch+B Increase fill
opportunities MaLor Co"cer" 0: #S, *4poses implementation optimi;ation
Co"!2 Bra"ch statistics #L-
1ET)1MT of all insts &integer'
G IT)17T of all insts &floating)point'
G O"erall 78T &int' and 18T &fp' control)flow insts.
G !out 6MT are ta$en
Bra"chPe"a%t+ 9 @bra"ches 1
&@ta6e" 1 ta6e"pe"a%t+ I @"otta6e" 1 "otta6e"pe"a%t+'
Comparison of =ranch %chemes
Impact of Pipeline Depth
!ssume that now penalties are douled
For e1amp%e 0e !oub%e c%oc6 fre;ue"c+
Interrupts
E1amp%es,
G power failing# arithmetic o"erflow
G I:O de"ice re(uest# O% call# page fault
G In"alid opcode# rea$point# protection "iolation
I"terrupts &a6a fau%ts. e1ceptio"s. traps' ofte" re;uire
G surprise >ump &to "ectored address'
G lin$ing return address
G sa"ing of P%D &including CCs'
G state change &e.g.# to $ernel mode'
Classifying Interrupts
3a2 s+"chro"ous
G function of program state &e.g.# o"erflow# page fault'
3b2 as+"chro"ous
G e4ternal de"ice or hardware malfunction
*a2 user re;uest
G O% call
*b2 coerce!
G from O% or hardware &page fault# protection "iolation'
Aa2 User Mas6ab%e
User can disale processing
Ab2 No"Mas6ab%e
User cannot disale processing
Fa2 Bet0ee" I"structio"s
Usually asynchronous
Fb2 Cithi" a" i"structio"
Usually synchronous ) ,arder to deal with
?a2 Resume
!s if nothing happenedQ Program will continue e4ecution
?b2 Termi"atio"
,andling Interrupts
Precise interrupts (sequential semantics)
Complete instructions efore the offending instr
G %(uash &effects of' instructions after
G %a"e PC &^ ne4t PC with delayed ranches'
G 9orce trap instruction into I9
Must ha"!%e simu%ta"eous i"terrupts
G I9# - ) memory access &page fault# misaligned# protection'
G ID ) illegal:pri"ileged instruction
G *K ) arithmetic e4ception
Out)of)Order Interrupts
Post interrupts
G chec$ interrupt it on entering D=
G precise interrupts
G longer latency
/a"!%e imme!iate%+
G not fully precise
G interrupt may occur in order different from se(uential CPU
G may cause implementation headaches]
Other comp%icatio"s
G odd its of state &e.g.# CC'
G early)writes &e.g.# autoincrement'
G instruction uffers and prefetch logic
G dynamic scheduling
G out)of)order e4ecution
I"terrupts come at ra"!om times
Both Performa"ce a"! Correct"ess
G fre(uent case not e"erything
G rare case -U%T wor$ correctly
Delayed =ranches and Interrupts
Chat happe"s o" i"terrupt 0hi%e i" !e%a+ s%ot
G ne4t instruction is not se(uential
So%utio" 53, sa(e mu%tip%e PCs
G sa"e current and ne4t PC
G special return se(uence# more comple4 hardware
So%utio" 5*, si"$%e PC p%us
G ranch delay it
G PC points to ranch instruction
G %D +estrictions
O"erlapping Instructions
Co"te"tio" i" CB
G static priority
G e.g.# 9U with longest latency
G instructions stall after issue
CAR ha7ar!s
G always read registers at same pipe stage
CAC ha7ar!s
G di"f f8# f7# fE followed y suf f8# f5# f18
G stall suf or aort di"f?s D=
Multicycle Operations
Problems with interrupts
DI19 f8# f7#fE
G !DD9 f7#f5# f18
G %U=9 f6# fE# f18
A##F comp%etes before #IPF
G Out)Of)Order completion
G Possile imprecise interrupts
Precice Interrupts
Reorder Buffer
UNIT E
-emory %ystem
B&S$C C10CEPTS:
!ddress space
H 16)it : 716 R 6E_ mem. locations
H I7)it : 7I7 R E/ mem. locations
H E8)it : 7E8 R 1 T locations
!erminology:
/emory access time H time etween +ead and -9C signals
G /emory cycle time H min. time delay etween initiation of two successi"e memory
operations
Internal Organi;ation of memory chips
H 9orm of an array
H Dord line ^ it lines
H 1645 organi;ation : 16 words of 5 its each
%tatic memories
H Circuits capale of retaining their state as long as power is applied
H "tatic +!-&%+!-'
H volatile
$R453:
H Charge on a capacitor
H Needs W+efreshing
A si"$%etra"sistor !+"amic memor+ ce%%
%ynchronous D+!-s
%ynchroni;ed with a cloc$ signal
-emory system considerations
H Cost
H %peed
H Power dissipation
H %i;e of chip
-emory controller
H Used =etween processor and memory
H +efresh O"erhead
H
MEMORY /IERARC/Y
Principle of locality :
`aTempora% %oca%it+ &locality in time': If an item is referenced# it will tend to
e referenced again soon.
`aSpatia% %oca%it+ &locality in space': If an item is referenced# items whose
addresses are close y will tend to e referenced soon.
`aSe;ue"tia%it+ &suset of spatial locality '.
The principle of locality can e e4ploited implementing the memory of computer
as a memory !ierarc!y# ta$ing ad"antage of all types of memories.
Metho!: The le"el closer to processor &the fastest' is a suset of any le"el
further away# and all the data is stored at the lowest le"el &the slowest'.
Cache -emories
H %peed of the main memory is "ery low in comparison with the speed of processor
H 9or good performance# the processor cannot spend much time of its time waiting
to access instructions and data in main memory.
H Important to de"ice a scheme that reduces the time to access the information
H !n efficient solution is to use fast cache memory
H Dhen a cache is full and a memory word that is not in the cache is referenced# the
cache control hardware must decide which loc$ should e remo"ed to create
space for the new loc$ that contain the referenced word.
The basics of Caches
C The caches are organi;ed on asis of bloc8s# the smallest amount of data which can e
copied etween two ad>acent le"els at a time.
C If data re(uested y the processor is present in some loc$ in the upper le"el#
it is called a !it.
C If data is not found in the upper le"el# the re(uest is called a miss and the data is
retrie"ed from the lower le"el in the hierarchy.
C The fraction of memory accesses found in the upper le"el is called a !it ratio.
C The storage# which ta$es ad"antage of locality of accesses is called a cac!e
Performa"ce of caches
Accessi"$ a Cache
!ddress -apping in Cache:
Direct -apping
In this techni(ue# loc$ > of the main memory maps onto loc$ > modulo 175 of the
cache.
G -ain memory loc$s 8#175#7S6#Jis loaded in the cache# it is stored in cache loc$ 8.
G =loc$s 1#17L#7SM#Jare stored in cache loc$ 1.
Direct -apped Cache:
Associati(e Mappi"$
-ore fle4ile mapping techni(ue
G ! main memory loc$ can e placed inot any cache loc$ position.
G %pace in the cache can e used more efficiently# ut need to search all 175 tag patterns.
SetAssociate Mappi"$
Comination of the direct) and associati"e mapping techni(ue
G =loc$s of the cache are grouped into sets# and the mapping allows a loc$ of the main
memory to reside in any loc$ of a specific set.
Note: -emory loc$s 8#6E#175#J#E8I7 maps into cache set 8.
Calculating =loc$ %i;e:
Crite /it Po%icies,
+*P.!C*-*NT PO.ICF:
On a cache miss we need to e"ict a line to ma$e room for the new line
C In an !)way set associati"e cache# we ha"e ! choices of which loc$ to e"ict
C Dhich loc$ gets ooted outQ
] random
] least-recently used &true .+U is too costly'
] pseudo .+U &!ppro4imated .+U ) in case of four)way set associati"ity one it $eeps
trac$ of which pair of loc$s is .+U# and then trac$ing which loc$ in each pair is .+U
&one it per pair''
] fi4ed &processing audio stream'
9or a two)way set associati"e cache# random replacement has a miss rate aout
1.1 time higher than .+U replacement. !s the caches ecome larger# the miss rate for
oth replacement strategies fall# and the difference ecomes small.
+andom replacement is sometimes etter than simple .+U appro4imations that can e
easily implemented in hardware.
CRITE MISS POLICY:
H Drite allocate
H allocate a new loc$ on each write
H fetch on write
H fetch e"tire b%oc6. the" 0rite 0or! i"to b%oc6
H no)fetch
H a%%ocate b%oc6. but !o"Dt fetch
H re;uires (a%i! bits per 0or!
H more comp%e1 e(ictio"
H Drite no)allocate
H don?t allocate a loc$ if it is not already in the cache
H write around the cache
H typically used y write through since we need update main memory anyway
H Drite in"alidate
H instead of update for write)through
Measuri"$ a"! Impro(i"$ Cache Performa"ce

Pirtua% memor+
Pirtua% memor+ is a computer system techni(ue which gi"es an application program the
impression that it has contiguous wor$ing memory &an address space'# while in fact it
may e physically fragmented and may e"en o"erflow on to dis$ storage.
1irtual memory pro"ides two primary functions:
1. *ach process has its own address space# therey not re(uired to e relocated nor
re(uired to use relati"e addressing mode.
7. *ach process sees one contiguous loc$ of free memory upon launch.
9ragmentation is hidden.
!ll implementations &*4cluding emulators' re(uire hardware support. This is typically in
the form of a -emory -anagement Unit uilt into the CPU.
%ystems that use this techni(ue ma$e programming of large applications easier and use
real physical memory &e.g. +!-' more efficiently than those without "irtual memory.
1irtual memory differs significantly from memory "irtuali;ation in that "irtual memory
allows resources to e "irtuali;ed as memory for a specific system# as opposed to a large
pool of memory eing "irtuali;ed as smaller pools for many different systems.
Note that C"irtual memoryC is more than >ust Cusing dis$ space to e4tend physical
memory si;eC ) that is merely the e4tension of the memory hierarchy to include hard dis$
dri"es. *4tending memory to dis$ is a normal conse(uence of using "irtual memory
techni(ues# ut could e done y other means such as o"erlays or swapping programs and
their data completely out to dis$ while they are inacti"e. The definition of C"irtual
memoryC is ased on redefining the address space with a contiguous virtual memory
addresses to Ctric$C programs into thin$ing they are using large loc$s of contiguous
addresses.
Pa$e! (irtua% memor+
!lmost all implementations of "irtual memory di"ide the "irtual address space of an
application program into pagesB a page is a loc$ of contiguous "irtual memory
addresses. Pages are usually at least E_ ytes in si;e# and systems with large "irtual
address ranges or large amounts of real memory &e.g. +!-' generally use larger page
si;es.
Pa$e tab%es
!lmost all implementations use page tales to translate the "irtual addresses seen y the
application program into physical addresses &also referred to as Creal addressesC' used y
the hardware to process instructions. *ach entry in the page tale contains a mapping for
a "irtual page to either the real memory address at which the page is stored# or an
indicator that the page is currently held in a dis$ file. &!lthough most do# some systems
may not support use of a dis$ file for "irtual memory.'
%ystems can ha"e one page tale for the whole system or a separate page tale for each
application. If there is only one# different applications which are running at the same time
share a single "irtual address space# i.e. they use different parts of a single range of
"irtual addresses. %ystems which use multiple page tales pro"ide multiple "irtual
address spaces ) concurrent applications thin$ they are using the same range of "irtual
addresses# ut their separate page tales redirect to different real addresses.
#+"amic a!!ress tra"s%atio"
If# while e4ecuting an instruction# a CPU fetches an instruction located at a particular
"irtual address# or fetches data from a specific "irtual address or stores data to a particular
"irtual address# the "irtual address must e translated to the corresponding physical
address. This is done y a hardware component# sometimes called a memory
management unit# which loo$s up the real address &from the page tale' corresponding to
a "irtual address and passes the real address to the parts of the CPU which e4ecute
instructions.
Pa$i"$ super(isor
This part of the operating system creates and manages the page tales. If the dynamic
address translation hardware raises a page fault e4ception# the paging super"isor searches
the page space on secondary storage for the page containing the re(uired "irtual address#
reads it into real physical memory# updates the page tales to reflect the new location of
the "irtual address and finally tells the dynamic address translation mechanism to start the
search again. Usually all of the real physical memory is already in use and the paging
super"isor must first sa"e an area of real physical memory to dis$ and update the page
tale to say that the associated "irtual addresses are no longer in real physical memory
ut sa"ed on dis$. Paging super"isors generally sa"e and o"erwrite areas of real physical
memory which ha"e een least recently used# ecause these are proaly the areas which
are used least often. %o e"ery time the dynamic address translation hardware matches a
"irtual address with a real physical memory address# it must put a time)stamp in the page
tale entry for that "irtual address.
Perma"e"t%+ resi!e"t pa$es
!ll "irtual memory systems ha"e memory areas that are Cpinned downC# i.e. cannot e
swapped out to secondary storage# for e4ample:
Interrupt mechanisms generally rely on an array of pointers to the handlers for
"arious types of interrupt &I:O completion# timer e"ent# program error# page fault#
etc.'. If the pages containing these pointers or the code that they in"o$e were
pageale# interrupt)handling would ecome e"en more comple4 and time)
consumingB and it would e especially difficult in the case of page fault interrupts.
The page tales are usually not pageale.
Data uffers that are accessed outside of the CPU# for e4ample y peripheral
de"ices that use direct memory access &D-!' or y I:O channels. Usually such
de"ices and the uses &connection paths' to which they are attached use physical
memory addresses rather than "irtual memory addresses. *"en on uses with an
IO--U# which is a special memory management unit that can translate "irtual
addresses used on an I:O us to physical addresses# the transfer cannot e stopped
if a page fault occurs and then restarted when the page fault has een processed.
%o pages containing locations to which or from which a peripheral de"ice is
transferring data are either permanently pinned down or pinned down while the
transfer is in progress.
Timing)dependent $ernel:application areas cannot tolerate the "arying response
time caused y paging.
Figure 2: Address translation
Compi%er time: If it is $nown in ad"ance that a program will reside at a specific
location of main memory# then the compiler may e told to uild the o>ect code with
asolute addresses right away. 9or e4ample# the oot sect in a ootale dis$ may e
compiled with the starting point of code set to 88MC:8888.
G Loa! time: It is pretty rare that we $now the location a program will e assigned ahead
of its e4ecution. In most cases# the compiler must generate relocatale code with logical
addresses. Thus the address translation may e performed on the code during load time.
9igure I shows that a program is loaded at location 4. If the whole program resides on a
monolithic loc$# then e"ery memory reference may e translated to e physical y
added to 4.
9igure E: *4ample of fi4ed partitioning of a 6E)-egayte memory
,owe"er two disad"antages are
G ! program that is too ig to e held in a partition needs some special design# called
o"erlay# which rings hea"y urden on programmers. Dith o"erlay# a process consists
of se"eral portions with each eing mapped to the same location of the partition# and
at any time# only one portion may reside in the partition. Dhen another portion is
referenced# the current portion will e switched out.
G ! program may e much smaller than a partition# thus space left in the partition will e
wasted# which is referred to as internal fragmentation. !s an impro"ement shown in
9igure E &'# une(ual)si;e partitions may e configured in main memory so that small
programs will occupy small partitions and ig programs are also li$ely to e ale to fit
into ig partitions. !lthough this may sol"e the ao"e prolems with fi4ed S e(ual)si;e
partitioning to some degree# the fundamental wea$ness still e4ists: The numer of
partitions are the ma4imum of the numer of processes that could reside in main memory
at the same time. Dhen most processes are small# the system should e ale to
accommodate more of them ut fails to do so due to the limitation. -ore fle4iility is
needed.
#+"amic partitio"i"$
To o"ercome difficulties with fi4ed partitioning# partitioning may e done dynamically#
called dynamic partitioning. Dith it# the main memory portion for user applications is
initially a single contiguous loc$. Dhen a new process is created# the e4act amount of
memory space is allocated to the process. %imilarly when no enough space is a"ailale# a
process may e swapped out temporarily to release space for a new process. The way
how the dynamic partitioning wor$s is illustrated in 9igure S.
Figure 5: The effect of dynamic partitioning
!s time goes on# there will appear many small holes in the main memory# which is
referred to 6 as e4ternal fragmentation. Thus although much space is still a"ailale# it
cannot e allocated to new processes. ! method for o"ercoming e4ternal fragmentation is
compaction. 9rom time to time# the operating system mo"es the processes so that they
occupy contiguous sections and all of the small holes are rought together to ma$e a ig
loc$ of space. The disad"antage of compaction is: The procedure is time)consuming and
re(uires relocation capaility.
Address translation
9igure 6 shows the address translation procedure with dynamic partitioning# where the
processor pro"ides hardware support for address translation# protection# and relocation.
9igure 6: !ddress translation with dynamic partitioning
The ase register holds the entry point of the program# and may e added to a relati"e
address to generate an asolute address. The ounds register indicates the ending location
of the program# which is used to compare with each physical address generated. If the
later is within ounds# then the e4ecution may proceedB otherwise# an interrupt is
generated# indicating illegal access to memory.
The relocation can e easily supported with this mechanism with the new starting address
and ending address assigned respecti"ely to the ase register and the ounds M register.
Placement algorithm
Different strategies may e ta$en as to how space is allocated to processes:
G First fit: !llocate the first hole that is ig enough. %earching may start either at
the eginning of the set of holes or where the pre"ious first)fit search ended.
G Best fit: !llocate the smallest hole that is ig enough. The entire list of holes must e
searched unless it is sorted y si;e. This strategy produces the smallest lefto"er hole.
G Corst fit: !llocate the largest hole. In contrast# this strategy aims to produce the largest
lefto"er hole# which may e ig enough to hold another process. *4periments ha"e
shown that oth first fit and est fit are etter than worst fit in terms of decreasing time
and storage utili;ation.
/a"!%i"$ a Pa$e,
Tra"s%atio" Loo6 asi!e Buffer,

I"te$rati"$ Pirtua% Memor+. TLBs. a"! Caches
Imp%eme"ti"$ Protectio" 0ith Pirtua% Memor+
To enale the O% to implement protection in the 1- system# the ,D must:
1. %upport at least two modes that indicate weather the running process is a user process
&e2ecutive process' or an O% process &8ernel)supervisor process'.
7. pro"ide a portion of the CPU state that a user process can read ut not write &includes
super"isor mode it'.
I. pro"ide mechanism wherey the CPU can go from user mode to super"isor mode
&accomplished y a system call e4ception' and "ice "ersa &return from e2ception
instruction'.
baOnly O% process can change page tales. Page tales are held in O% address space
therey pre"enting a user process from changing them. baDhen processes want to share
information in a limited way# the operating system must assist them.
baThe write access it &in oth the T.= and the page tale' can e used to restrict the
sharing to >ust read sharing.
Cache Misses
Comp%ter data storage
Computer !ata stora$e# often called stora$e or memor+# refers to computer
components# de"ices# and recording media that retain digital data used for computing for
some inter"al of time. Computer data storage pro"ides one of the core functions of the
modern computer# that of information retention. It is one of the fundamental components
of all modern computers# and coupled with a central processing unit &CPU# a processor'#
implements the asic computer model used since the 1LE8s.
In contemporary usage# memory usually refers to a form of semiconductor storage $nown
as random)access memory &+!-' and sometimes other forms of fast ut temporary
storage. %imilarly# storage today more commonly refers to mass storage 0 optical discs#
forms of magnetic storage li$e hard dis$ dri"es# and other types slower than +!-# ut of
a more permanent nature. ,istorically# memory and storage were respecti"ely called
main memory and secondary storage. The terms internal memory and e2ternal memory
are also used.
The contemporary distinctions are helpful# ecause they are also fundamental to the
architecture of computers in general. The distinctions also reflect an important and
significant technical difference etween memory and mass storage de"ices# which has
een lurred y the historical usage of the term storage. Ne"ertheless# this article uses the
traditional nomenclature.
Purpose of stora$e
-any different forms of storage# ased on "arious natural phenomena# ha"e een
in"ented. %o far# no practical uni"ersal storage medium e4ists# and all forms of storage
ha"e some drawac$s. Therefore a computer system usually contains se"eral $inds of
storage# each with an indi"idual purpose.
! digital computer represents data using the inary numeral system. Te4t# numers#
pictures# audio# and nearly any other form of information can e con"erted into a string of
its# or inary digits# each of which has a "alue of 1 or 8. The most common unit of
storage is the yte# e(ual to 5 its. ! piece of information can e handled y any
computer whose storage space is large enough to accommodate t!e binary representation
of t!e piece of information# or simply data. 9or e4ample# using eight million its# or aout
one megayte# a typical computer could store a short no"el.
Traditionally the most important part of e"ery computer is the central processing unit
&CPU# or simply a processor'# ecause it actually operates on data# performs any
calculations# and controls all the other components.

In practice# almost all computers use a "ariety of memory types# organi;ed in a storage
hierarchy around the CPU# as a tradeoff etween performance and cost. /enerally# the
lower a storage is in the hierarchy# the lesser its andwidth and the greater its access
latency is from the CPU. This traditional di"ision of storage to primary# secondary#
tertiary and off)line storage is also guided y cost per it.
/ierarch+ of stora$e
Seco"!ar+ stora$e
Seco"!ar+ stora$e &or e1ter"a% memor+' differs from primary storage in that it is not
directly accessile y the CPU. The computer usually uses its input:output channels to
access secondary storage and transfers the desired data using intermediate area in primary
storage. %econdary storage does not lose the data when the de"ice is powered down0it is
non)"olatile. Per unit# it is typically also an order of magnitude less e4pensi"e than
primary storage. Conse(uently# modern computer systems typically ha"e an order of
magnitude more secondary storage than primary storage and data is $ept for a longer time
there.
In modern computers# hard dis$ dri"es are usually used as secondary storage. The time
ta$en to access a gi"en yte of information stored on a hard dis$ is typically a few
thousandths of a second# or milliseconds. =y contrast# the time ta$en to access a gi"en
yte of information stored in random access memory is measured in illionths of a
second# or nanoseconds. This illustrates the "ery significant access)time difference which
distinguishes solid)state memory from rotating magnetic storage de"ices: hard dis$s are
typically aout a million times slower than memory. +otating optical storage de"ices#
such as CD and D1D dri"es# ha"e e"en longer access times. Dith dis$ dri"es# once the
dis$ read:write head reaches the proper placement and the data of interest rotates under it#
suse(uent data on the trac$ are "ery fast to access. !s a result# in order to hide the initial
see$ time and rotational latency# data are transferred to and from dis$s in large
contiguous loc$s.
Dhen data reside on dis$# loc$ access to hide latency offers a ray of hope in designing
efficient e4ternal memory algorithms. %e(uential or loc$ access on dis$s is orders of
magnitude faster than random access# and many sophisticated paradigms ha"e een
de"eloped to design efficient algorithms ased upon se(uential and loc$ access .
!nother way to reduce the I:O ottlenec$ is to use multiple dis$s in parallel in order to
increase the andwidth etween primary and secondary memory.
%ome other e4amples of secondary storage technologies are: flash memory &e.g. U%=
flash dri"es or $eys'# floppy dis$s# magnetic tape# paper tape# punched cards# standalone
+!- dis$s# and Iomega Yip dri"es.
Characteristics of stora$e
! 1/= DD+ +!- memory module
%torage technologies at all le"els of the storage hierarchy can e differentiated y
e"aluating certain core characteristics as well as measuring characteristics specific to a
particular implementation. These core characteristics are "olatility# mutaility#
accessiility# and addressiility. 9or any particular implementation of any storage
technology# the characteristics worth measuring are capacity and performance.
Po%ati%it+
Non)"olatile memory
Dill retain the stored information e"en if it is not constantly supplied with electric
power. It is suitale for long)term storage of information. Nowadays used for
most of secondary# tertiary# and off)line storage. In 1LS8s and 1L68s# it was also
used for primary storage# in the form of magnetic core memory.
1olatile memory
+e(uires constant power to maintain the stored information. The fastest memory
technologies of today are "olatile ones &not a uni"ersal rule'. %ince primary
storage is re(uired to e "ery fast# it predominantly uses "olatile memory.
#iffere"tiatio"
Dynamic random access memory
! form of "olatile memory which also re(uires the stored information to e
periodically re)read and re)written# or refreshed# otherwise it would "anish.
%tatic memory
! form of "olatile memory similar to D+!- with the e4ception that it ne"er
needs to e refreshed.
Mutabi%it+
+ead:write storage or mutale storage
!llows information to e o"erwritten at any time. ! computer without some
amount of read:write storage for primary storage purposes would e useless for
many tas$s. -odern computers typically use read:write storage also for secondary
storage.
+ead only storage
+etains the information stored at the time of manufacture# and 0rite o"ce stora$e
&Drite Once +ead -any' allows the information to e written only once at some
point after manufacture. These are called immutab%e stora$e. Immutale storage
is used for tertiary and off)line storage. *4amples include CD)+O- and CD)+.
%low write# fast read storage
+ead:write storage which allows information to e o"erwritten multiple times# ut
with the write operation eing much slower than the read operation. *4amples
include CD)+D and flash memory.
Accessibi%it+
+andom access
!ny location in storage can e accessed at any moment in appro4imately the same
amount of time. %uch characteristic is well suited for primary and secondary
storage.
%e(uential access
The accessing of pieces of information will e in a serial order# one after the
otherB therefore the time to access a particular piece of information depends upon
which piece of information was last accessed. %uch characteristic is typical of off)
line storage.
A!!ressabi%it+
.ocation)addressale
*ach indi"idually accessile unit of information in storage is selected with its
numerical memory address. In modern computers# location)addressale storage
usually limits to primary storage# accessed internally y computer programs# since
location)addressaility is "ery efficient# ut urdensome for humans.
9ile addressale
Information is di"ided into files of "ariale length# and a particular file is selected
with human)readale directory and file names. The underlying de"ice is still
location)addressale# ut the operating system of a computer pro"ides the file
system astraction to ma$e the operation more understandale. In modern
computers# secondary# tertiary and off)line storage use file systems.
Content)addressale
*ach indi"idually accessile unit of information is selected ased on the asis of
&part of' the contents stored there. Content)addressale storage can e
implemented using software &computer program' or hardware &computer de"ice'#
with hardware eing faster ut more e4pensi"e option. ,ardware content
addressale memory is often used in a computer<s CPU cache.
Capacit+
+aw capacity
The total amount of stored information that a storage de"ice or medium can hold.
It is e4pressed as a (uantity of its or ytes &e.g. 18.E megaytes'.
-emory storage density
The compactness of stored information. It is the storage capacity of a medium
di"ided with a unit of length# area or "olume &e.g. 1.7 megaytes per s(uare inch'.
Performa"ce
.atency
The time it ta$es to access a particular location in storage. The rele"ant unit of
measurement is typically nanosecond for primary storage# millisecond for
secondary storage# and second for tertiary storage. It may ma$e sense to separate
read latency and write latency# and in case of se(uential access storage# minimum#
ma4imum and a"erage latency.
Throughput
The rate at which information can e read from or written to the storage. In
computer data storage# throughput is usually e4pressed in terms of megaytes per
second or -=:s# though it rate may also e used. !s with latency# read rate and
write rate may need to e differentiated. !lso accessing media se(uentially# as
opposed to randomly# typically yields ma4imum throughput.
Ma$"etic
Ma$"etic stora$e uses different patterns of magneti;ation on a magnetically coated
surface to store information. -agnetic storage is non-volatile. The information is
accessed using one or more read:write heads which may contain one or more recording
transducers. ! read:write head only co"ers a part of the surface so that the head or
medium or oth must e mo"ed relati"e to another in order to access data. In modern
computers# magnetic storage will ta$e these forms:
-agnetic dis$
o 9loppy dis$# used for off)line storage
o ,ard dis$ dri"e# used for secondary storage
-agnetic tape data storage# used for tertiary and off)line storage
/ar! #is6 Tech"o%o$+
Diagram of a computer hard dis$ dri"e
,DDs record data y magneti;ing ferromagnetic material directionally# to represent
either a 8 or a 1 inary digit. They read the data ac$ y detecting the magneti;ation of
the material. ! typical ,DD design consists of a spindle that holds one or more flat
circular dis$s called platters# onto which the data is recorded. The platters are made from
a non)magnetic material# usually aluminum alloy or glass# and are coated with a thin
layer of magnetic material# typically 18)78 nm in thic$ness with an outer layer of caron
for protection.
The platters are spun at "ery high speeds. Information is written to a platter as it rotates
past de"ices called read)and)write heads that operate "ery close &tens of nanometers in
new dri"es' o"er the magnetic surface. The read)and)write head is used to detect and
modify the magneti;ation of the material immediately under it. There is one head for
each magnetic platter surface on the spindle# mounted on a common arm. !n actuator
arm &or access arm' mo"es the heads on an arc &roughly radially' across the platters as
they spin# allowing each head to access almost the entire surface of the platter as it spins.
The arm is mo"ed using a "oice coil actuator or in some older designs a stepper motor.
The magnetic surface of each platter is conceptually di"ided into many small su)
micrometre)si;ed magnetic regions# each of which is used to encode a single inary unit
of information. Initially the regions were oriented hori;ontally# ut eginning aout 788S#
the orientation was changed to perpendicular. Due to the polycrystalline nature of the
magnetic material each of these magnetic regions is composed of a few hundred magnetic
grains. -agnetic grains are typically 18 nm in si;e and each form a single magnetic
domain. *ach magnetic region in total forms a magnetic dipole which generates a highly
locali;ed magnetic field neary. ! write head magneti;es a region y generating a strong
local magnetic field. *arly ,DDs used an electromagnet oth to magneti;e the region and
to then read its magnetic field y using electromagnetic induction. .ater "ersions of
inducti"e heads included metal in /ap &-I/' heads and thin film heads. !s data density
increased# read heads using magnetoresistance &-+' came into useB the electrical
resistance of the head changed according to the strength of the magnetism from the
platter. .ater de"elopment made use of spintronicsB in these heads# the magnetoresisti"e
effect was much greater than in earlier types# and was dued CgiantC magnetoresistance
&/-+'. In today<s heads# the read and write elements are separate# ut in close pro4imity#
on the head portion of an actuator arm. The read element is typically magneto)resisti"e
while the write element is typically thin)film inducti"e.
,D heads are $ept from contacting the platter surface y the air that is e4tremely close to
the platterB that air mo"es at# or close to# the platter speed. The record and playac$ head
are mounted on a loc$ called a slider# and the surface ne4t to the platter is shaped to
$eep it >ust arely out of contact. It<s a type of air earing.
In modern dri"es# the small si;e of the magnetic regions creates the danger that their
magnetic state might e lost ecause of thermal effects. To counter this# the platters are
coated with two parallel magnetic layers# separated y a I)atom)thic$ layer of the non)
magnetic element ruthenium# and the two layers are magneti;ed in opposite orientation#
thus reinforcing each other. !nother technology used to o"ercome thermal effects to
allow greater recording densities is perpendicular recording# first shipped in 788S# as of
788M the technology was used in many ,DDs.
The grain oundaries turn out to e "ery important in ,DD design. The reason is that# the
grains are "ery small and close to each other# so the coupling etween ad>acent grains is
"ery strong. Dhen one grain is magneti;ed# the ad>acent grains tend to e aligned parallel
to it or demagneti;ed. Then oth the staility of the data and signal)to)noise ratio will e
saotaged. ! clear grain oundary can wea$en the coupling of the grains and
suse(uently increase the signal)to)noise ratio. In longitudinal recording# the single)
domain grains ha"e unia4ial anisotropy with easy a4es lying in the film plane. The
conse(uence of this arrangement is that ad>acent magnets repel each other. Therefore the
magnetostatic energy is so large that it is difficult to increase areal density. Perpendicular
recording media# on the other hand# has the easy a4is of the grains oriented perpendicular
to the dis$ plane. !d>acent magnets attract to each other and magnetostatic energy are
much lower. %o# much higher areal density can e achie"ed in perpendicular recording.
!nother uni(ue feature in perpendicular recording is that a soft magnetic underlayer are
incorporated into the recording dis$.This underlayer is used to conduct writing magnetic
flu4 so that the writing is more efficient. This will e discussed in writing process.
Therefore# a higher anisotropy medium film# such as .18)9ePt and rare)earth magnets#
can e used.
Error ha"!%i"$
-odern dri"es also ma$e e4tensi"e use of *rror Correcting Codes &*CCs'# particularly
+eedH%olomon error correction. These techni(ues store e4tra its for each loc$ of data
that are determined y mathematical formulas. The e4tra its allow many errors to e
fi4ed. Dhile these e4tra its ta$e up space on the hard dri"e# they allow higher recording
densities to e employed# resulting in much larger storage capacity for user data. In 788L#
in the newest dri"es# low)density parity)chec$ codes &.DPC' are supplanting +eed)
%olomon. .DPC codes enale performance close to the %hannon .imit and thus allow for
the highest storage density a"ailale.
Typical hard dri"es attempt to CremapC the data in a physical sector that is going ad to a
spare physical sector0hopefully while the numer of errors in that ad sector is still
small enough that the *CC can completely reco"er the data without loss.
Architecture
! hard dis$ dri"e with the platters and motor hu remo"ed showing the copper colored
stator coils surrounding a earing at the center of the spindle motor. The orange stripe
along the side of the arm is a thin printed)circuit cale. The spindle earing is in the
center.
! typical hard dri"e has two electric motors# one to spin the dis$s and one to position the
read:write head assemly. The dis$ motor has an e4ternal rotor attached to the plattersB
the stator windings are fi4ed in place. The actuator has a read)write head under the tip of
its "ery end &near center'B a thin printed)circuit cale connects the read)write head to the
hu of the actuator. ! fle4ile# somewhat <U<)shaped# rion cale# seen edge)on elow
and to the left of the actuator arm in the first image and more clearly in the second#
continues the connection from the head to the controller oard on the opposite side.
Capacit+ a"! access spee!
PC hard dis$ dri"e capacity &in /='. The "ertical a4is is logarithmic# so the fit line
corresponds to e4ponential growth.
Using rigid dis$s and sealing the unit allows much tighter tolerances than in a floppy dis$
dri"e. Conse(uently# hard dis$ dri"es can store much more data than floppy dis$ dri"es
and can access and transmit it faster.
!s of !pril 788L# the highest capacity consumer ,DDs are 7 T=.
! typical Cdes$top ,DDC might store etween 178 /= and 7 T= although rarely
ao"e S88/= of data &ased on U% mar$et data rotate at S#E88 to 18#888 rpm# and
ha"e a media transfer rate of 1 /it:s or higher. &1 /= R 18
L
=B 1 /it:s R 18
L
it:s'
The fastest WenterpriseA ,DDs spin at 18#888 or 1S#888 rpm# and can achie"e
se(uential media transfer speeds ao"e 1.6 /it:s. and a sustained transfer rate up
to 1 /it:s. Dri"es running at 18#888 or 1S#888 rpm use smaller platters to
mitigate increased power re(uirements &as they ha"e less air drag' and therefore
generally ha"e lower capacity than the highest capacity des$top dri"es.
C-oile ,DDsC# i.e.# laptop ,DDs# which are physically smaller than their
des$top and enterprise counterparts# tend to e slower and ha"e lower capacity. !
typical moile ,DD spins at S#E88 rpm# with M#788 rpm models a"ailale for a
slight price premium. =ecause of physically smaller platter&s'# moile ,DDs
generally ha"e lower capacity than their physically larger counterparts.
The e4ponential increases in dis$ space and data access speeds of ,DDs ha"e enaled the
commercial "iaility of consumer products that re(uire large storage capacities# such as
digital "ideo recorders and digital audio players.
The main way to decrease access time is to increase rotational speed# thus reducing
rotational delay# while the main way to increase throughput and storage capacity is to
increase areal density. =ased on historic trends# analysts predict a future growth in ,DD
it density &and therefore capacity' of aout E8T per year. !ccess times ha"e not $ept up
with throughput increases# which themsel"es ha"e not $ept up with growth in storage
capacity.
The first I.Sc ,DD mar$eted as ale to store 1 T= was the ,itachi Des$star M_1888. It
contains fi"e platters at appro4imately 788 /= each# pro"iding 1 T= &LIS.S /i=' of
usale spaceB note the difference etween its capacity in decimal units &1 T= R 18
17
ytes'
and inary units &1 Ti= R 187E /i= R 7
E8
ytes'. ,itachi has since een >oined y
%amsung &%amsung %pinPoint 91# which has I d IIE /= platters'# %eagate and Destern
Digital in the 1 T= dri"e mar$et.
In %eptemer 788L# %howa Den$o announced capacity impro"ements in platters that they
manufacture for ,DD ma$ers. ! single 7.SC platter is ale to hold IIE /= worth of data#
and preliminary results for I.SC indicate a MS8 /= per platter capacity.
Optica%
Optica% stora$e# the typical Optical disc# stores information in deformities on the surface
of a circular disc and reads this information y illuminating the surface with a laser diode
and oser"ing the reflection. Optical disc storage is non-volatile. The deformities may e
permanent &read only media '# formed once &write once media' or re"ersile &recordale
or read:write media'. The following forms are currently in common use:
CD# CD)+O-# D1D# =D)+O-: +ead only storage# used for mass distriution of
digital information &music# "ideo# computer programs'
CD)+# D1D)+# D1D2+ =D)+: Drite once storage# used for tertiary and off)line
storage
CD)+D# D1D)+D# D1D2+D# D1D)+!-# =D)+*: %low write# fast read
storage# used for tertiary and off)line storage
Ultra Density Optical or UDO is similar in capacity to =D)+ or =D)+* and is
slow write# fast read storage used for tertiary and off)line storage.
Ma$"etooptica% !isc stora$e is optical disc storage where the magnetic state on a
ferromagnetic surface stores information. The information is read optically and written y
comining magnetic and optical methods. -agneto)optical disc storage is non-volatile#
sequential access# slow write# fast read storage used for tertiary and off)line storage.
! Compact #isc &also $nown as a C#' is an optical disc used to store digital data. It was
originally de"eloped to store sound recordings e4clusi"ely# ut later it also allowed the
preser"ation of other types of data. !udio CDs ha"e een commercially a"ailale since
Octoer 1L57. In 788L# they remain the standard physical storage medium for audio.
%tandard CDs ha"e a diameter of 178 mm and can hold up to 58 minutes of
uncompressed audio &M88 -= of data'. The -ini CD has "arious diameters ranging from
68 to 58 mmB they are sometimes used for CD singles or de"ice dri"ers# storing up to 7E
minutes of audio.
The technology was e"entually adapted and e4panded to encompass data storage CD)
+O-# write)once audio and data storage CD)+# rewritale media CD)+D# 1ideo
Compact Discs &1CD'# %uper 1ideo Compact Discs &%1CD'# PhotoCD# PictureCD# CD)
i# and *nhanced CD.
Ph+sica% !etai%s
Diagram of CD layers.
!. ! polycaronate disc layer has the data encoded y using umps.
=. ! shiny layer reflects the laser.
C. ! layer of lac(uer helps $eep the shiny layer shiny.
D. !rtwor$ is screen printed on the top of the disc.
*. ! laser eam reads the CD and is reflected ac$ to a sensor# which con"erts it into
electronic data.
! CD is made from 1.7 mm thic$# almost)pure polycaronate plastic and weighs
appro4imately 1SH78 grams. 9rom the center outward components are at the center
&spindle' hole# the first)transition area &clamping ring'# the clamping area &stac$ing ring'#
the second)transition area &mirror and'# the information &data' area# and the rim.
! thin layer of aluminium or# more rarely# gold is applied to the surface to ma$e it
reflecti"e# and is protected y a film of lac(uer that is normally spin coated directly on
top of the reflecti"e layer# upon which the lael print is applied. Common printing
methods for CDs are screen)printing and offset printing.
CD data are stored as a series of tiny indentations $nown as WpitsA# encoded in a spiral
trac$ molded into the top of the polycaronate layer. The areas etween pits are $nown as
WlandsA. *ach pit is appro4imately 188 nm deep y S88 nm wide# and "aries from 5S8 nm
to I.S em in length.
The distance etween the trac$s# the pitch# is 1.6 em. ! CD is read y focusing a M58 nm
wa"elength &near infrared' semiconductor laser through the ottom of the polycaronate
layer. The change in height etween pits &actually ridges as seen y the laser' and lands
results in a difference in intensity in the light reflected. =y measuring the intensity
change with a photodiode# the data can e read from the disc.
The pits and lands themsel"es do not directly represent the ;eros and ones of inary data.
Instead# Non)return)to);ero# in"erted &N+YI' encoding is used: a change from pit to land
or land to pit indicates a one# while no change indicates a series of ;eros. There must e
at least two and no more than ten ;eros etween each one# which is defined y the length
of the pit. This in turn is decoded y re"ersing the *ight)to)9ourteen -odulation used in
mastering the disc# and then re"ersing the Cross)Interlea"ed +eed)%olomon Coding#
finally re"ealing the raw data stored on the disc.
CDs are susceptile to damage from oth daily use and en"ironmental e4posure. Pits are
much closer to the lael side of a disc# so that defects and dirt on the clear side can e out
of focus during playac$. Conse(uently# CDs suffer more scratch damage on the lael
side whereas scratches on the clear side can e repaired y refilling them with similar
refracti"e plastic# or y careful polishing. Initial music CDs were $nown to suffer from
CCD rotC# or Claser rotC# in which the internal reflecti"e layer degrades. Dhen this occurs
the CD may ecome unplayale.
#isc shapes a"! !iameters
! -ini)CD is 5 centimetres in diameter.
The digital data on a CD egin at the center of the disc and proceeds toward the edge#
which allows adaptation to the different si;e formats a"ailale. %tandard CDs are
a"ailale in two si;es. =y far the most common is 178 mm in diameter# with a ME) or 58)
minute audio capacity and a 6S8 or M88 -= data capacity. This diameter has also een
adopted y later formats# including %uper !udio CD# D1D# ,D D1D# and =lu)ray Disc.
58 mm discs &C-ini CDsC' were originally designed for CD singles and can hold up to
71 minutes of music or 15E -= of data ut ne"er really ecame popular. Today# nearly
e"ery single is released on a 178 mm CD# called a -a4i single.
UNIT S
Input:Output Organi;ation
Input:Output -odule
Interface to CPU and -emory
GInterface to one or more peripherals
/eneric -odel of IO -odule
Interface for an IO De"ice:
CPU chec$s I:O module de"ice status GI:O module returns status
GIf ready# CPU re(uests data transfer GI:O module gets data from de"ice
GI:O module transfers data to CPU
I"put Output Tech"i;ues
Programmed
Interrupt dri"en
Direct -emory !ccess &D-!'
Programmed I:O
GCPU has direct control o"er I:O
H%ensing status
H+ead:write commands
HTransferring data
GCPU waits for I:O module to complete operation
GDastes CPU time
GCPU re(uests I:O operation
GI:O module performs operation
GI:O module sets status its
GCPU chec$s status its periodically
GI:O module does not inform CPU directly
GI:O module does not interrupt CPU
GCPU may wait or come ac$ later
GUnder programmed I:O data transfer is "ery li$e memory access &CPU "iewpoint'
G*ach de"ice gi"en uni(ue identifier
GCPU commands contain identifier &address'
I:O -apping
QMemor+ mappe! I:O
HDe"ices and memory share an address space
HI:O loo$s >ust li$e memory read:write
HNo special commands for I:O
G.arge selection of memory access commands a"ailale
QIso%ate! I:O
H%eparate address spaces
HNeed I:O or memory select lines
H%pecial commands for I:O
G.imited set
-emory -apped IO:
GInput and output uffers use same address spaceas memory locations
G!ll instructions can access the uffer
Interrupts
GInterrupt)re(uest line
HInterrupt)re(uest signal
HInterrupt)ac$nowledge signal
GInterrupt)ser"ice routine
H%imilar to suroutine
H-ay ha"e no relationship to program eing e4ecuted at time of interrupt
GProgram info must e sa"ed
GInterrupt latency
!ransfer of control through the use of interrupts
INTERRUPT /AN#LIN8
,andling Interrupts G -any situations where the processor should ignore
interrupt re(uestsHInterrupt)disaleHInterrupt)enale GTypical scenarioH
De"ice raises interrupt re(uestHProcessor interrupts program eing
e4ecutedHProcessor disales interrupts and ac$nowledges interruptH
Interrupt)ser"ice routine e4ecutedHInterrupts enaled and program e4ecution
resumed
!n e(ui"alent circuit for an open)drain us used to implement a common
interrupt)re(uest line.
,andling -ultiple De"ices
nterrupt ,riority
GDuring e4ecution of interrupt)ser"ice routine
HDisale interrupts from de"ices at the same le"el priority or lower
HContinue to accept interrupt re(uests from higher priority de"ices
HPri"ileged instructions e4ecuted in super"isor mode
GControlling de"ice re(uests
HInterrupt)enale
G_*N# D*N
Po%%e! i"terrupts,Priorit+ !etermi"e! b+ the or!er i" 0hich processor po%%s the
!e(ices &po%%s their status re$isters'Pectore! i"terrupts,Priorit+ !etermi"e! b+ the
or!er i" 0hich processor te%%s !e(iceto put its co!e o" the a!!ress %i"es &or!er of
co""ectio" i" the chai"'
#ais+ chai"i"$ of INTA,If !e(ice has "ot re;ueste! ser(ice. passes the INTA si$"a%
to "e1t !e(iceIf "ee!s ser(ice. !oes "ot pass the INTA. puts its co!e o" the a!!ress
%i"es Po%%e!
-ultiple Interrupts
GPriority in Processor %tatus Dord
H%tatus +egister ))acti"e program
H%tatus Dord ))inacti"e program
GChanged only y pri"ileged instruction
G-ode changes ))automatic or y pri"ileged instruction
GInterrupt enale:disale# y de"ice# system)wide
Common 9unctions of Interrupts
GInterrupt transfers control to the interrupt ser"ice routine# generally through
the interrupt vector table# which contains the addresses of all the ser"ice
routines.
GInterrupt architecture must sa"e the address of the interrupted instruction
and the contents of the processor status register.
GIncoming interrupts are disabledwhile another interrupt is eing processed
to pre"ent a lost interrupt.
G! software)generated interrupt may e caused either y an error or a user
re(uest &sometimes called a trap'.
G!n operating system is interruptdri"en.
,ardware interrupts0from I:O de"ices# memory# processor#
%oftware interrupts0/eneratedy a program.
Direct -emory !ccess &D-!'
GPolling or interrupt dri"en I:O incurs considerale o"erhead
H-ultiple program instructions
H%a"ing program state
HIncrementing memory addresses
H_eeping trac$ of word count
GTransfer large amounts of data at high speed without continuous
inter"ention y the processor
G%pecial control circuit re(uired in the I:O de"ice interface# called a D-!
controller
GD-! controller $eeps trac$ of memory locations# transfers directly to
memory &"ia the us' independent of the processor
Fi$ure2 Use of #MA co"tro%%ers i" a computer s+stem
D-! Controller
GPart of the I:O de"ice interface
HD-! Channels
GPerforms functions that would normally e carried out y the processor
HPro"ides memory address
H=us signals that control transfer
H_eeps trac$ of numer of transfers
GUnder control of the processor
=us aritration
In a single us architecture when more than one de"ice re(uests the us# a controller
called us ariter decides who gets the us# this is called the us aritration.
=us -aster:
In computing# bus masteri"$ is a feature supported y many us architectures that
enales a de"ice connected to the us to initiate transactions.
!he procedure in bus communication that chooses between connected devices
contending for control of the shared busG the device currently in control of the bus
memory
Processor
Keyboard
System bus
Main
Interface
Network
Disk/DMA
controller
Printer
DM
A
controller
Disk Disk
is often termed the bus master; $evices may be allocated differing priority levels
that will determine the choice of bus master in case of contention; 4 device not
currently bus master must re/uest control of the bus before attempting to initiate
a data transfer via the bus; !he normal protocol is that only one device may be
bus master at any time and that all other devices act as slaves to this master;
Cnly a bus master may initiate a normal data transfer on the busG slave devices
respond to commands issued by the current bus master by supplying data
re/uested or accepting data sent;
Centrali;ed aritration
Distriuted aritration
9igure. ! simple arrangement for us aritration using a daisy chain.
G The us ariter may e the processor or a separate unit connected to the us.
G One us)re(uest line and one us)grant line form a daisy chain.
G This arrangement leads to considerale fle4iility in determining the order.
Processor
DMA
controller
1
DMA
controller
2
BG1 BG2
BR
BBSY
Fi$2 Se;ue"ce of Si$"a%s !uri"$ tra"sfer of mastership for the !e(ices
Distriuted !ritration
BBSY
BG1
BG2
Bus
master
BR
Tim
e
Interface circuit
for device A
0 1 0 1 0 1 1 1
O.C.
V
cc
ARB0
ARB1
ARB2
ARB3
G !ll de"ices ha"e e(ual responsiility in carrying out the aritration process.
G *ach de"ice on the us assigned an identification numer.
G Place their ID numers on four open)collector lines.
G ! winner is selected as a result.
Types of =us
%ynchronous =us
G !ll de"ices deri"e timing information from a common cloc$ line.
G *ach of these inter"als constitutes a us cycle during which one data
transfer can ta$e place.
%ynchronous =us Input Transfer
Bus
cycle
Dat
a
Bus
clock
comman
d
Address
and
t
0
t
1
t
2
Tim
e
!synchronous =us
G Data transfers on the us is ased on the use of a handsha$e etween
the master and the sal"e.
G The common cloc$ is replaced y two timing control lines# -aster)
ready and %la"e)ready.
Fi$ure2 /a"!sha6e co"tro% of !ata tra"sfer !uri"$ a" i"put operatio"
Slave-ready
Data
Master-ready
and command
Address
Bus cycle
t
1
t
2
t
3
t
4
t
5
t
0
Time
Fi$ure2 /a"!sha6e co"tro% of !ata tra"sfer !uri"$ a" output operatio"
INT*+9!C* CI+CUIT%
G Circuitry re(uired to connect an I:O de"ice to a computer us
G Pro"ides a storage uffer for at least one word of data.
G Contains status flag that can e accessed y the processor.
G Contains address)decoding circuitry
G /enerates the appropriate timing signals re(uired y the us control
scheme.
G Performs format con"ersions
G Ports
H %erial port
H Parallel port
Bus cycle
Data
Master-ready
Slave-ready
and command
Address
t
1
t
2
t
3
t
4
t
5
t
0
Time
Fi$ure2 Re+boar! to processor co""ectio"
INPUT INT*+9!C* CI+CUIT
Valid
Data
Keyboard
switches
Encoder
and
debouncing
circuit
SIN
Input
interface
Data
Address
R /
Master-ready
Slave-ready
W
DATAIN
Processor
DATAI
N
Keyboar
d
dat
a
Vali
d
Stat
us
fla
g
Rea
d-
1
Sl
a
ve
-
Rea
d-
SI
N
read
y
A3
1
A
1
A
0
Addre
ss
decod
er
Q
7
D
7
Q
0
D
0
D
7
D
0
R/ W
dat
a
stat
us
read
y
Mast
er
-
9igure . !n e4ample of a computer system using different interface
standards.
PCI &Peripheral Component Interconnect'
G PCI stands for Perip!eral 1omponent Interconnect
G Introduced in 1LL7
G It is a .ow)cost us
G It is Processor independent
G It has Plug)and)play capaility
memory
Processor
Bridge
Processor bus
PCI bus
Main
memory
Additional
controller
CD-ROM
controller
Disk
Disk 1 Disk 2
ROM
CD-
SCSI
controller
USB
controller
Video
Keyboard Game
disk
IDE
SCSI bus
ISA
interface
Ethernet
interface
PCI bus tra"sactio"s
PCI us traffic is made of a series of PCI us transactions. *ach transaction is made up of
an address p!ase followed y one or more data p!ases. The direction of the data phases
may e from initator to target &write transaction' or "ice)"ersa &read transaction'# ut all
of the data phases must e in the same direction. *ither party may pause or halt the data
phases at any point. &One common e4ample is a low)performance PCI de"ice that does
not support urst transactions# and always halts a transaction after the first data phase.'
!ny PCI de"ice may initiate a transaction. 9irst# it must re(uest permission from a PCI
us ariter on the motheroard. The ariter grant permission to one of the re(uesting
de"ices. The initiator egins the address phase y roadcasting a I7)it address plus a E)
it command code# then waits for a target to respond. !ll other de"ices e4amine this
address and one of them responds a few cycles later.
6E)it addressing is done using a 7)stage address phase. The initiator roadcasts the low
I7 address its# accompanied y a special Cdual address cycleC command code. De"ices
which do not support 6E)it addressing can simply not respond to that command code.
The ne4t cycle# the initiator transmits the high I7 address its# plus the real command
code. The transaction operates identically from that point on. To ensure compatiility
with I7)it PCI de"ices# it is foridden to use a dual address cycle if not necessary# i.e. if
the high)order address its are all ;ero.
Dhile the PCI us transfers I7 its per data phase# the initiator transmits a E)it yte
mas$ indicating which 5)it ytes are to e considered significant. In particular# a mas$ed
write must affect only the desired ytes in the target PCI de"ice.
Arbitratio"
!ny de"ice on a PCI us that is capale of acting as a us master may initiate a
transaction with any other de"ice. To ensure that only one transaction is initiated at a
time# each master must first wait for a us grant signal# /NTf# from an ariter located on
the motheroard. *ach de"ice has a separate re(uest line +*Uf that re(uests the us# ut
the ariter may Cpar$C the us grant signal at any de"ice if there are no current re(uests.
The ariter may remo"e /NTf at any time. ! de"ice which loses /NTf may complete
its current transaction# ut may not start one &y asserting 9+!-*f' unless it oser"es
/NTf asserted the cycle efore it egins.
The ariter may also pro"ide /NTf at any time# including during another master<s
transaction. During a transaction# either 9+!-*f or I+DFf or oth are assertedB when
oth are deasserted# the us is idle. ! de"ice may initiate a transaction at any time that
/NTf is asserted and the us is idle.
A!!ress phase
! PCI us transaction egins with an address p!ase. The initiator# seeing that it has
/NTf and the us is idle# dri"es the target address onto the !DOI1:8P lines# the
associated command &e.g. memory read# or I:O write' on the C:=*OI:8Pf lines# and pulls
9+!-*f low.
*ach other de"ice e4amines the address and command and decides whether to respond as
the target y asserting D*1%*.f. ! de"ice must respond y asserting D*1%*.f within
I cycles. De"ices which promise to respond within 1 or 7 cycles are said to ha"e Cfast
D*1%*.C or Cmedium D*1%*.C# respecti"ely. &!ctually# the time to respond is 7.S
cycles# since PCI de"ices must transmit all signals half a cycle early so that they can e
recei"ed three cycles later.'
Note that a de"ice must latch the address on the first cycleB the initiator is re(uired to
remo"e the address and command from the us on the following cycle# e"en efore
recei"ing a D*1%*.f response. The additional time is a"ailale only for interpreting the
address and command after it is captured.
On the fifth cycle of the address phase &or earlier if all other de"ices ha"e medium
D*1%*. or faster'# a catch)all Csutracti"e decodingC is allowed for some address
ranges. This is commonly used y an I%! us ridge for addresses within its range &7E
its for memory and 16 its for I:O'.
On the si4th cycle# if there has een no response# the initiator may aort the transaction
y deasserting 9+!-*f. This is $nown as master abort termination and it is customary
for PCI us ridges to return all)ones data &8499999999' in this case. PCI de"ices
therefore are generally designed to a"oid using the all)ones "alue in important status
registers# so that such an error can e easily detected y software.
A!!ress phase timi"$
On the rising edge of cloc$ 8# the initiator oser"es 9+!-*f and I+DFf oth high# and
/NTf low# so it dri"es the address# command# and asserts 9+!-*f in time for the rising
edge of cloc$ 1. Targets latch the address and egin decoding it. They may respond with
D*1%*.f in time for cloc$ 7 &fast D*1%*.'# I &medium' or E &slow'. %utracti"e
decode de"ices# seeing no other response y cloc$ E# may respond on cloc$ S. If the
master does not see a response y cloc$ S# it will terminate the transaction and remo"e
9+!-*f on cloc$ 6.
T+DFf and %TOPf are deasserted &high' during the address phase. The initiator may
assert I+DFf as soon as it is ready to transfer data# which could theoretically e as soon
as cloc$ 7.
#ata phases
!fter the address phase &specifically# eginning with the cycle that D*1%*.f goes low'
comes a urst of one or more data p!ases. In all cases# the initiator dri"es acti"e)low yte
select signals on the C:=*OI:8Pf lines# ut the data on the !DOI1:8P may e dri"en y the
initiator &on case of writes' or target &in case of reads'.
During data phases# the C:=*OI:8Pf lines are interpreted as acti"e)low byte enables. In
case of a write# the asserted signals indicate which of the four ytes on the !D us are to
e written to the addressed location. In the case of a read# they indicate which ytes the
initiator is interested in. 9or reads# it is always legal to ignore the yte enale signals and
simply return all I7 itsB cacheale memory resources are re(uired to always return I7
"alid its. The yte enales are mainly useful for I:O space accesses where reads ha"e
side effects.
! data phase with all four C:=*f lines deasserted is e4plicitly permitted y the PCI
standard# and must ha"e no effect on the target &other than to ad"ance the address in the
urst access in progress'.
The data phase continues until oth parties are ready to complete the transfer and
continue to the ne4t data phase. The initiator asserts I+DFf &initiator ready' when it no
longer needs to wait# while the target asserts T+DFf &target ready'. Dhiche"er side is
pro"iding the data must dri"e it on the !D us efore asserting its ready signal.
Once one of the participants asserts its ready signal# it may not ecome un)ready or
otherwise alter its control signals until the end of the data phase. The data recipient must
latch the !D us each cycle until it sees oth I+DFf and T+DFf asserted# which mar$s
the end of the current data phase and indicates that the >ust)latched data is the word to e
transferred.
To maintain full urst speed# the data sender then has half a cloc$ cycle after seeing oth
I+DFf and T+DFf asserted to dri"e the ne4t word onto the !D us.
This continues the address cycle illustrated ao"e# assuming a single address cycle with
medium D*1%*.# so the target responds in time for cloc$ I. ,owe"er# at that time#
neither side is ready to transfer data. 9or cloc$ E# the initiator is ready# ut the target is
not. On cloc$ S# oth are ready# and a data transfer ta$es place &as indicated y the
"ertical lines'. 9or cloc$ 6# the target is ready to transfer# ut the initator is not. On cloc$
M# the initiator ecomes ready# and data is transferred. 9or cloc$s 5 and L# oth sides
remain ready to transfer data# and data is transferred at the ma4imum possile rate &I7
its per cloc$ cycle'.
In case of a read# cloc$ 7 is reser"ed for turning around the !D us# so the target is not
permitted to dri"e data on the us e"en if it is capale of fast D*1%*..
Fast #EPSEL5 o" rea!s
! target that supports fast D*1%*. could in theory egin responding to a read the cycle
after the address is presented. This cycle is# howe"er# reser"ed for !D us turnaround.
Thus# a target may not dri"e the !D us &and thus may not assert T+DFf' on the second
cycle of a transaction. Note that most targets will not e this fast and will not need any
special logic to enforce this condition.
E"!i"$ tra"sactio"s
*ither side may re(uest that a urst end after the current data phase. %imple PCI de"ices
that do not support multi)word ursts will always re(uest this immediately. *"en de"ices
that do support ursts will ha"e some limit on the ma4imum length they can support#
such as the end of their addressale memory.
The initiator can mar$ any data phase as the final one in a transaction y deasserting
9+!-*f at the same time as it asserts I+DFf. The cycle after the target asserts T+DFf#
the final data transfer is complete# oth sides deassert their respecti"e +DFf signals# and
the us is idle again. The master may not deassert 9+!-*f efore asserting I+DFf# nor
may it assert 9+!-*f while waiting# with I+DFf asserted# for the target to assert
T+DFf.
The only minor e4ception is a master abort termination# when no target responds with
D*1%*.f. O"iously# it is pointless to wait for T+DFf in such a case. ,owe"er# e"en
in this case# the master must assert I+DFf for at least one cycle after deasserting
9+!-*f. &Commonly# a master will assert I+DFf efore recei"ing D*1%*.f# so it
must simply hold I+DFf asserted for one cycle longer.' This is to ensure that us
turnaround timing rules are oeyed on the 9+!-*f line.
The target re(uests the initiator end a urst y asserting %TOPf. The initiator will then
end the transaction y deasserting 9+!-*f at the ne4t legal opportunity. If it wishes to
transfer more data# it will continue in a separate transaction. There are se"eral ways to do
this:
Disconnect with data
If the target asserts %TOPf and T+DFf at the same time# this indicates that the
target wishes this to e the last data phase. 9or e4ample# a target that does not
support urst transfers will always do this to force single)word PCI transactions.
This is the most efficient way for a target to end a urst.
Disconnect without data
If the target asserts %TOPf without asserting T+DFf# this indicates that the target
wishes to stop without transferring data. %TOPf is considered e(ui"alent to
T+DFf for the purpose of ending a data phase# ut no data is transferred.
+etry
! Disconnect without data efore transferring any data is a retry# and unli$e other
PCI transactions# PCI initiators are re(uired to pause slightly efore continuing
the operation. %ee the PCI specification for details.
Target aort
Normally# a target holds D*1%*.f asserted through the last data phase.
,owe"er# if a target deasserts D*1%*.f efore disconnecting without data
&asserting %TOPf'# this indiates a target abort# which is a fatal error condition.
The initiator may not retry# and typically treats it as a us error. Note that a target
may not deassert D*1%*.f while waiting with T+DFf or %TOPf lowB it must
do this at the eginning of a data phase.
!fter seeing %TOPf# the initiator will terminate the transaction at the ne4t legal
opportunity# ut if it has already signaled its desire to continue a urst &y asserting
I+DFf without deasserting 9+!-*f'# it is not permitted to deassert 9+!-*f until the
following data phase. ! target that re(uests a urst end &y asserting %TOPf' may ha"e
to wait through another data phase &holding %TOPf asserted without T+DFf' efore the
transaction can end.
Tab%e F2A2 #ata tra"sfer si$"a%s o" the PCI bus2
+ead operation on the PCI =us
1 2 3 4 5 6 7
CLK
Frame#
AD
C/BE#
IRDY#
TRDY#
DEVSEL#
Adres
s
#
1
#
4
Cmn
d
Byte
enable
#
2
#
3
+ead operation showing the role of the I+DFf# T+DFf
%C%I =us
G Defined y !N%I H KI.1I1
G "mall 1omputer "ystem Interface
G S8# 65 or 58 pins
G -a4. transfer rate H 168 -=:s# I78 -=:s.
1 2 3 4 5 6 7 8 9
CLK
Frame#
AD
C/BE#
IRDY#
TRDY#
DEVSEL#
Adres
s
#
1
#
2
#
3
#4
Cmn
d
Byte
enable
%C%I =us %ignals
U%=
G >niversal "erial Bus
G %peed
G .ow)speed&1.S -:s'
G 9ull)speed&17 -:s'
G ,igh)speed&E58 -:s'
G Port .imitation
G De"ice Characteristics
G Plug)and)play
Uni"ersal %erial =us Tree %tructure
USB &U"i(ersa% Seria% Bus' is a specification to estalish communication etween
de"ices and a host controller &usually personal computers'. U%= is intended to replace
many "arieties of serial and parallel ports. U%= can connect computer peripherals such as
mice# $eyoards# digital cameras# printers# personal media players# flash dri"es# and
e4ternal hard dri"es. 9or many of those de"ices# U%= has ecome the standard
connection method. U%= was designed for personal computers
Ocitation neededP
# ut it has
ecome commonplace on other de"ices such as smartphones# PD!s and "ideo game
consoles# and as a power cord etween a de"ice and an !C adapter plugged into a wall
plug for charging. !s of 7885# there are aout 7 illion U%= de"ices sold per year# and
appro4imately 6 illion total sold to date.
Host
computer
Root
hub
b
Hub
I/
O
device
Hub
I/O
de vice
I/
O
device
Hub
I/
O
device
I/
O
device
I/
O
device
The design of U%= is standardi;ed y the U%= Implementers 9orum &U%=)I9'# an
industry standards ody incorporating leading companies from the computer and
electronics industries. Notale memers ha"e included !gere &now merged with .%I
Corporation'# !pple Inc.# ,ewlett)Pac$ard# Intel# -icrosoft and N*C.
%plit =us Operation
Si$"a%i"$
U%= supports following signaling rates:
! %o0 spee! rate of 1.S -it:s is defined y U%= 1.8. It is "ery similar to Cfull
speedC operation e4cept each it ta$es 5 times as long to transmit. It is intended
primarily to sa"e cost in low)andwidth human interface de"ices &,ID' such as
$eyoards# mice# and >oystic$s.
The fu%% spee! rate of 17 -it:s is the asic U%= data rate defined y U%= 1.1.
!ll U%= hus support full speed.
! hispee! &U%= 7.8' rate of E58 -it:s was introduced in 7881. !ll hi)speed
de"ices are capale of falling ac$ to full)speed operation if necessaryB they are
ac$ward compatile. Connectors are identical.
! SuperSpee! &U%= I.8' rate of S.8 /it:s. The U%= I.8 specification was
released y Intel and partners in !ugust 7885# according to early reports from
CN*T news. The first U%= I controller chips were sampled y N*C -ay 788L
O11P
and products using the I.8 specification are e4pected to arri"e eginning in UI
788L and 7818.
O17P
U%= I.8 connectors are generally ac$wards compatile# ut
include new wiring and full duple4 operation. There is some incompatiility with
older connectors.
U%= signals are transmitted on a raided pair data cale with L8g h1ST Characteristic
impedance#
O1IP
laeled D2 and Di. Prior to U%= I.8# These collecti"ely use half)duple4
differential signaling to reduce the effects of electromagnetic noise on longer lines.
Transmitted signal le"els are 8.8H8.I "olts for low and 7.5HI.6 "olts for high in full speed
&9%' and low speed &.%' modes# and i18H18 m1 for low and I68HEE8 m1 for high in hi)
speed &,%' mode. In 9% mode the cale wires are not terminated# ut the ,% mode has
termination of ES g to ground# or L8 g differential to match the data cale impedance#
reducing interference of particular $inds. U%= I.8 introduces two additional pairs of
shielded twisted wire and new# mostly interoperale contacts in U%= I.8 cales# for
them. They permit the higher data rate# and full duple4 operation.
! U%= connection is always etween a host or hu at the C!C connector end# and a
de"ice or hu<s CupstreamC port at the other end. Originally# this was a C=< connector#
pre"enting erroneous loop connections# ut additional upstream connectors were
specified# and some cale "endors designed and sold cales which permitted erroneous
connections &and potential damage to the circuitry'. U%= interconnections are not as fool)
proof or as simple as originally intended.
The host includes 1S $g pull)down resistors on each data line. Dhen no de"ice is
connected# this pulls oth data lines low into the so)called Csingle)ended ;eroC state &%*8
in the U%= documentation'# and indicates a reset or disconnected connection.
! U%= de"ice pulls one of the data lines high with a 1.S $g resistor. This o"erpowers
one of the pull)down resistors in the host and lea"es the data lines in an idle state called
C3C. 9or U%= 1.4# the choice of data line indicates a de"ice<s speed supportB full)speed
de"ices pull D2 high# while low)speed de"ices pull Di high.
U%= data is transmitted y toggling the data lines etween the 3 state and the opposite _
state. U%= encodes data using the N+YI con"entionB a 8 it is transmitted y toggling the
data lines from 3 to _ or "ice)"ersa# while a 1 it is transmitted y lea"ing the data lines
as)is. To ensure a minimum density of signal transitions# U%= uses it stuffingB an e4tra 8
it is inserted into the data stream after any appearance of si4 consecuti"e 1 its. %e"en
consecuti"e 1 its is always an error. U%= I.88 has introduced additional data
transmission encodings.
! U%= pac$et egins with an 5)it synchroni;ation se(uence 88888881. That is# after the
initial idle state 3# the data lines toggle _3_3_3__. The final 1 it &repeated _ state'
mar$s the end of the sync pattern and the eginning of the U%= frame.
! U%= pac$et<s end# called *OP &end)of)pac$et'# is indicated y the transmitter dri"ing 7
it times of %*8 &D2 and Di oth elow ma4' and 1 it time of 3 state. !fter this# the
transmitter ceases to dri"e the D2:Di lines and the aforementioned pull up resistors hold
it in the 3 &idle' state. %ometimes s$ew due to hus can add as much as one it time
efore the %*8 of the end of pac$et. This e4tra it can also result in a Cit stuff "iolationC
if the si4 its efore it in the C+C are <1<s. This it should e ignored y recei"er.
! U%= us is reset using a prolonged &18 to 78 milliseconds' %*8 signal.
U%= 7.8 de"ices use a special protocol during reset# called CchirpingC# to negotiate the
high speed mode with the host:hu. ! de"ice that is ,% capale first connects as an 9%
de"ice &D2 pulled high'# ut upon recei"ing a U%= +*%*T &oth D2 and Di dri"en
.OD y host for 18 to 78 m%' it pulls the Di line high# $nown as chirp _. This indicates
to the host that the de"ice is high speed. If the host:hu is also ,% capale# it chirps
&returns alternating 3 and _ states on Di and D2 lines' letting the de"ice $now that the
hu will operate at high speed. The de"ice has to recei"e at least I sets of _3 chirps
efore it changes to high speed terminations and egins high speed signaling. =ecause
U%= I.8 use wiring separate and additional to that used y U%= 7.8 and U%= 1.4# such
speed negotiation is not re(uired.
Cloc$ tolerance is E58.88 -it:s hS88 ppm# 17.888 -it:s h7S88 ppm# 1.S8 -it:s
h1S888 ppm.
Though high speed de"ices are commonly referred to as CU%= 7.8C and ad"ertised as Cup
to E58 -it:sC# not all U%= 7.8 de"ices are high speed. The U%=)I9 certifies de"ices and
pro"ides licenses to use special mar$eting logos for either Casic speedC &low and full' or
high speed after passing a compliance test and paying a licensing fee. !ll de"ices are
tested according to the latest specification# so recently)compliant low speed de"ices are
also 7.8 de"ices.
#ata pac6ets
U%= communication ta$es the form of pac$ets. Initially# all pac$ets are sent from the
host# "ia the root hu and possily more hus# to de"ices. %ome of those pac$ets direct a
de"ice to send some pac$ets in reply.
!fter the sync field descried ao"e# all pac$ets are made of 5)it ytes# transmitted
least)significant it first. The first yte is a pac$et identifier &PID' yte. The PID is
actually E itsB the yte consists of the E)it PID followed y its itwise complement.
This redundancy helps detect errors. &Note also that a PID yte contains at most four
consecuti"e 1 its# and thus will ne"er need it)stuffing# e"en when comined with the
final 1 it in the sync yte. ,owe"er# trailing 1 its in the PID may re(uire it)stuffing
within the first few its of the payload.'
/a"!sha6e pac6ets
,andsha$e pac$ets consist of nothing ut a PID yte# and are generally sent in response
to data pac$ets. The three asic types are ACR# indicating that data was successfully
recei"ed# NAR# indicating that the data cannot e recei"ed at this time and should e
retried# and STALL# indicating that the de"ice has an error and will ne"er e ale to
successfully transfer data until some correcti"e action &such as de"ice initiali;ation' is
performed.
U%= 7.8 added two additional handsha$e pac$ets# NYET which indicates that a split
transaction is not yet complete. ! NF*T pac$et is also used to tell the host that the
recei"er has accepted a data pac$et# ut cannot accept any more due to uffers eing full.
The host will then send PIN/ pac$ets and will continue with data pac$ets once the
de"ice !C_<s the PIN/. The other pac$et added was the ERR handsha$e to indicate that
a split transaction failed.
The only handsha$e pac$et the U%= host may generate is !C_B if it is not ready to
recei"e data# it should not instruct a de"ice to send any.
To6e" pac6ets
To$en pac$ets consist of a PID yte followed y 7 payload ytes: 11 its of address and a
S)it C+C. To$ens are only sent y the host# ne"er a de"ice.
IN and OUT to$ens contain a M)it de"ice numer and E)it function numer &for
multifunction de"ices' and command the de"ice to transmit D!T!4 pac$ets# or recei"e
the following D!T!4 pac$ets# respecti"ely.
!n IN to$en e4pects a response from a de"ice. The response may e a N!_ or %T!..
response# or a D!T!4 frame. In the latter case# the host issues an !C_ handsha$e if
appropriate.
!n OUT to$en is followed immediately y a D!T!4 frame. The de"ice responds with
!C_# N!_# NF*T# or %T!..# as appropriate.
SETUP operates much li$e an OUT to$en# ut is used for initial de"ice setup. It is
followed y an 5)yte D!T!8 frame with a standardi;ed format.
*"ery millisecond &17888 full)speed it times'# the U%= host transmits a special SOF
&start of frame' to$en# containing an 11)it incrementing frame numer in place of a
de"ice address. This is used to synchroni;e isochronous data flows. ,igh)speed U%= 7.8
de"ices recei"e M additional duplicate %O9 to$ens per frame# each introducing a 17S es
CmicroframeC &68888 high)speed it times each'.
U%= 7.8 added a PIN8 to$en# which as$s a de"ice if it is ready to recei"e an
OUT:D!T! pac$et pair. The de"ice responds with !C_# N!_# or %T!..# as
appropriate. This a"oids the need to send the D!T! pac$et if the de"ice $nows that it
will >ust respond with N!_.
U%= 7.8 also added a larger I)yte SPLIT to$en with a M)it hu numer# 17 its of
control flags# and a S)it C+C. This is used to perform split transactions. +ather than tie
up the high)speed U%= us sending data to a slower U%= de"ice# the nearest high)speed
capale hu recei"es a %P.IT to$en followed y one or two U%= pac$ets at high speed#
performs the data transfer at full or low speed# and pro"ides the response at high speed
when prompted y a second %P.IT to$en. The details are comple4B see the U%=
specification.
#ata pac6ets
! data pac$et consists of the PID followed y 8H187I ytes of data payload &up to 187E
in high speed# at most 5 at low speed'# and a 16)it C+C.
There are two asic data pac$ets# #ATA4 and #ATA3. They must always e preceded
y an address to$en# and are usually followed y a handsha$e to$en from the recei"er
ac$ to the transmitter. The two pac$et types pro"ide the 1)it se(uence numer re(uired
y %top)and)wait !+U. If a U%= host does not recei"e a response &such as an !C_' for
data it has transmitted# it does not $now if the data was recei"ed or notB the data might
ha"e een lost in transit# or it might ha"e een recei"ed ut the handsha$e response was
lost.
To sol"e this prolem# the de"ice $eeps trac$ of the type of D!T!4 pac$et it last
accepted. If it recei"es another D!T!4 pac$et of the same type# it is ac$nowledged ut
ignored as a duplicate. Only a D!T!4 pac$et of the opposite type is actually recei"ed.
Dhen a de"ice is reset with a %*TUP pac$et# it e4pects an 5)yte D!T!8 pac$et ne4t.
U%= 7.8 added #ATA* and M#ATA pac$et types as well. They are used only y high)
speed de"ices doing high)andwidth isochronous transfers which need to transfer more
than 187E ytes per 17S es CmicroframeC &51L7 $=:s'.
PRE Kpac6etK
.ow)speed de"ices are supported with a special PID "alue# PRE. This mar$s the
eginning of a low)speed pac$et# and is used y hus which normally do not send full)
speed pac$ets to low)speed de"ices. %ince all PID ytes include four 8 its# they lea"e the
us in the full)speed _ state# which is the same as the low)speed 3 state. It is followed y
a rief pause during which hus enale their low)speed outputs# already idling in the 3
state# then a low)speed pac$et follows# eginning with a sync se(uence and PID yte# and
ending with a rief period of %*8. 9ull)speed de"ices other than hus can simply ignore
the P+* pac$et and its low)speed contents# until the final %*8 indicates that a new pac$et
follows.
U%= Pac$et 9ormat
Output Transfer
U%= 9+!-*%
&a' %O9 Pac$et
PI
D
Frame number CRC
5
8 1
1
5 Bit
s
S T3 D S
1-ms frame
T7 D T3 D
S - Start-of-frame packet
Tn- Token packet, address = n
D - Data packet
A - ACK packet
(b) Frame
example