net/publication/282845738
Computer System Architecture Lecturer Notes
Research · October 2015
DOI: 10.13140/RG.2.1.2592.8407
CITATIONS READS
0 35,147
1 author:
Budditha Hettige
General Sir John Kotelawala Defence University
46 PUBLICATIONS 120 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
EnSiMaS View project
masmt2.0 View project
All content following this page was uploaded by Budditha Hettige on 14 October 2015.
The user has requested enhancement of the downloaded file.
CSC 203 1.5
Computer System Architecture
By
Budditha Hettige
Department of Statistics and Computer Science
University of Sri Jayewardenepura
Virtual Machine
Machine Language Machine
(L1) (L0)
Machine Language
.
.
.
Machine Language L0
1952 IAS Von Neumann Most current machines use this design
Electronic Numerical
1946 ENIAC decimal vacuum tubes vacuum tubes
Integrator And Computer
• 42 million
transistors
• 2GHz
• 0.13µm process
Typical Data
Controller Port / Device
Transfer Rate
PS/2 (keyboard / mouse) 2 KB/s
Serial Port 25 KB/s
Super I/O
Floppy Disk 125 KB/s
Parallel Port 200 KB/s
Integrated Audio 1 MB/s
Integrated LAN 12 MB/s
USB 60 MB/s
Southbridge Integrated Video 133 MB/s
IDE (HDD, DVD) 133 MB/s
SATA (HDD, DVD) 300 MB/s
ALU
IC
Register C
C Instruction Counter
C
0000
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
01 4 1 0 0 0 0 0 0 0
C 5 0 1 1 1 0 0 0 0
6
C
0001
1 0 0 0 0 0 0 0 0
2 0 0 0 1 000
0 1 10 0
IC 3 0 0 1 0 0 1 0 1
02 4 1 0 0 0 0 0 0 0
C 5 0 1 1 1 0 0 0 0
6
C
0010
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 0 11 0 01 1
03 4 1 0 0 0 0 0 0 0
C 5 0 1 1 1 0 0 0 0
6
C
1000
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
04 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 0 0 0 0
0111 6 1 0 0 1 0 0 0 0
7 1 1 1 1 0 0 0 1
8
C
0111
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
05 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 1 1 1 1
0111 6 1 0 0 1 0 0 0 0
0 1 1 1 7 1 1 1 1 0 0 0 1
8
C
1001
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
06 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 1 1 1 1
0111 6 1 0 0 1 0 0 0 0
7 1 1 1 1 0 0 0 1
8 0 1 1 1
C
0000
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
06 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 1 1 1 1
0000 6 1 0 0 1 0 0 0 0
7 1 1 1 1 0 0 0 1
8
C
1111
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
07 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 1 1 1 1
0000 6 1 0 0 1 0 0 0 0
7 1 1 1 1 0 0 0 1
8
C
0000
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
01 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 1 1 1 1
0000 6 1 0 0 1 0 0 0 0
7 1 1 1 1 0 0 0 1
8
CPU
DATA BUS
ADDRESS BUS
CONTROL BUS
DATA BUS
ADDRESS BUS
CONTROL BUS
DATA BUS
ADDRESS BUS
CONTROL BUS
DATA BUS
ADDRESS BUS
CONTROL BUS
DATA BUS
ADDRESS BUS
CONTROL BUS
0000 0000 00
2011 Computer System Architecture 87
How BUS System works
CONTROL 2 BIT CONTROL 2 BIT CONTROL 2 BIT
01 – READ, 01 – READ, 01 – READ,
10 – Write 10 – Write 10 – Write
ADDRESS 0100 ADDRESS 0010 ADDRESS 0001
DATA BUS
ADDRESS BUS
CONTROL BUS
0000 0100 00
2011 Computer System Architecture 88
How BUS System works
CONTROL 2 BIT CONTROL 2 BIT CONTROL 2 BIT
01 – READ, 01 – READ, 01 – READ,
10 – Write 10 – Write 10 – Write
ADDRESS 0100 ADDRESS 0010 ADDRESS 0001
DATA BUS
ADDRESS BUS
CONTROL BUS
1 01 0 0100 10
2011 Computer System Architecture 89
How BUS System works
CONTROL 2 BIT CONTROL 2 BIT CONTROL 2 BIT
01 – READ, 01 – READ, 01 – READ,
10 – Write 10 – Write 10 – Write
ADDRESS 0100 ADDRESS 0010 ADDRESS 0001
DATA BUS
ADDRESS BUS
CONTROL BUS
1 01 0 0010 00
2011 Computer System Architecture 90
How BUS System works
CONTROL 2 BIT CONTROL 2 BIT CONTROL 2 BIT
01 – READ, 01 – READ, 01 – READ,
10 – Write 10 – Write 10 – Write
ADDRESS 0100 ADDRESS 0010 ADDRESS 0001
DATA BUS
ADDRESS BUS
CONTROL BUS
1 01 0 0010 01
2011 Computer System Architecture 91
Intel
Microprocessor History
162
Computer Pipelines
163
Example
T - Cycle time
N - Number of stages in the pipeline
Latency:
Time taken to execute an instruction = N x
T
Processor Bandwidth:
No. of MIPS the CPU has = 1000 MIPS
T
164
Processor - pipeline depth
165
Dual pipelines
166
Superscalar architecture
• Single pipeline with multiple functional
units
Processor level parallelism
• High bus traffic
• Execution time:
– Time between start and completion of a task
(including disk accesses, memory accesses )
• Throughput:
– Total amount of work dome a given time
Performance of a Computer
• % of CPU time =
User CPU Time + System CPU Time
X 100
%
Execution time
CPU Time
% CPU time = (90.7 + 12.9 ) x 100
159
= 65 %
Clock Rate
• Computer clock runs at the constant
rate and determines when events take
place in the hardware
Clock Rate = 1
Clock Cycle
Amdahl’s law
• Performance improvement that can be
gained from some faster mode of
execution is limited by fraction of the
time the faster mode can be used
Amdahl’s law
• Speedup depends on
– Fraction of computation time in original
machine that can be converted to take
advantage of the enhancement
(Fraction Enhanced)
– Improvement gains by enhanced
execution mode
(Speedup Enhanced)
Example
Total execution time of a Program = 50
s
Execution time that can be enhanced
= 30 s
FractionEnhanced
= 30 /50
= 0.6
Speedup
Example
Normal mode execution time for some
portion of a program = 6s
Enhances mode execution time for the
same program = 2s
188
Remark
• If an enhancement is only usable for
fraction of a task, we cannot speedup
by more than
189
Example
• A common transformation required in graphics
engines is square root. Implementation of
floating-point (FP) square root vary significantly
in performance, especially among processors
designed graphics
• Suppose FP square root (FPSQR) is responsible
for 20% of execution tine of a critical graphics
program
• Design alternative
1. Enhance EPSQR hardware and speed up this
operation by a factor of 10
2. Make all FP instruction run faster by a factor of 1.6
190
Example
• FP instruction are responsible for a total
of 50% of execution time. Design team
believes they can make all fp
instruction run 1.6 times faster with
same effort as required for fast square
root.
191
192
CPU performance equation
CPU time = CPU clock cycles for a program x Clock cycle time
Design alternatives:
1. decrease CPI of FPSQR to 2
2. decrease average CPI of all FP operation to 2.5
207
Introduction
208
Instruction Set Architecture
• Positioned between microarchitecture
level and operating system level
• Important to system architects
– interface between software and hardware
209
Instruction Set Architecture
210
ISA contd..
• General approach of system designers:
– Build programs in high-level languages
– Translate to ISA level
– Build hardware that executes ISA level
programs directly
• Key challenge:
– Build better machines subject to backward
compatibility constraint
211
Features off a good ISA
• Define a set of instructions that can be
implemented efficiently in current and
future technologies resulting in cost
effective designs over several
generations
• Provide a clean target for compiled
code
212
Properties off ISA level
• ISA level code is what a compiler
outputs
• To produce ISA code, compiler writer
has to know
– What the memory model is
– What registers are there
– What data types and instructions are
available
213
ISA level memory models
• Computers divide memory into cells (8
bits) that have consecutive addresses
• Bytes are grouped into words (4-, 8-
byte) with instructions available for
manipulating entire words
• Many architectures require words to be
aligned on their natural boundaries
– Memories operate more efficiently that
way
214
ISA level Memory Models
215
ISA level registers
• Main function of ISA level registers:
– provide rapid access to heavily used data
• Registers are divided into 2 categories
– special purpose registers (program
counter, stack pointer)
– General purpose registers (hold key local
variables, intermediate results of
calculations).
• These are interchangeable
216
Instructions
• Main feature of ISA level is its set of
machine instructions
• They control what the machine can do
• Ex:
– LOAD and STORE instructions move data
between memory and registers
– MOVE instruction copies data among
registers
217
Pentium II ISA level (Intel’s IA-32)
• Maintains full support for execution of programs
written for 8086, 8088 processors (16-bit)
• Pentium II has 3 operating modes (Real mode,
Virtual 8086 mode, Protected mode)
• Address Add space: memory is divided into 16,384
segments, each going from address 0 to address
232-1 (Windows supports only one segment)
• Every byte has its own address, with words being
32 bits long
• Words are stored in Little endian format (low-
order byte has lowest address)
218
Little endian and Big endian
format
219
Pentium II’s primary registers
220
Pentium II’s primary registers
• EAX: Main arithmetic registers, 32-bit
– 16-bit register in low-order 16 bits
– 8-bit register in low-order 8 bits
– easy to manipulate 16-bit (in 80286) and 8-bit
(in 8088) quantities
• EBX: holds pointers
• ECX: used in looping
• EDX: used for multiplication and division,
where together with EAX, it holds 64-bit
products and dividends
221
Pentium II’s primary registers
• ESI,ESI EDI: holds pointers into memory
– Especially for hardware string manipulation
instructions (ESI points to source string, EDI
points to destination string)
• EBP: pointer register
• ESP: stack pointer
• CS through GS: segment registers
• EIP: program counter
• EFLAGS: flag register (holds various
miscellaneous bits such as conditional
codes)
222
Pentium II data Types
223
Instruction Formats
• An instruction consists of an opcode,
plus additional information such as
where operands come from, where
results go to
• Opcode tells what instruction does
• On some machines, all instructions
have same length
– Advantages: simple, easy to decode
– Disadvantages: waste space
224
Common Instruction Formats
225
Instruction and Word length
Relationships
226
Example
• An Instruction with 4bit Opcode and
Three 4bit address
227
Design of Instruction Formats
• Factors:
– Length of instruction
• short instructions are better than long
instructions (modern processors can execute
multiple instructions per clock cycle)
– Sufficient room in the instruction format to
express all operations required
– No. of bits in an address field
228
Intel® 64 and IA-32 Architectures
• Intel 64 and IA-32 instructions
– General purpose
– x87 FPU
– x87 FPU and SIMD state management
– Intel MMX technology
– SSE extensions
– SSE2 extensions
– SSE3 extensions
– SSSE3 extensions
– SSE4 extensions
– AESNI and PCLMULQDQ
– Intel AVX extensions
– F16C, RDRAND, FS/GS base access
– System instructions
– IA-32e mode: 64-bit mode instructions
– VMX instructions
– SMX instructions
229
Addressing
230
Addressing
• Subject of specifying where the operands
(addresses) are
– ADD instruction requires 2 or 3 operands, and
instruction must tell where to find operands and
where to put result
• Addressing Modes
– Methods of interpreting the bits of an address field
to find operand
• Immediate Addressing
• Direct Addressing
• Register Addressing
• Register Indirect Addressing
• Indexed Addressing
231
Immediate Addressing
• Simplest way to specify where the operand is
• Address part of instruction contains operand
itself (immediate operand)
• Operand is automatically fetched from memory
at the same time the instruction it self is fetched
– Immediately available for use
• No additional memory references are required
• Disadvantages
– only a constant can be supplied
– value of the constant is limited by size of address field
• Good for specifying small integers
232
Example
Immediate Addressing
MOV R1, #8 ; Reg[R1] ← 8
ADD R2R2, #3 ; Reg[R2] ← Reg[R2] + 3
233
Direct Addressing
• Operand is in memory, and is specified by giving
its full address (memory address is hardwired
into instruction)
• Instruction will always access exactly same
memory location, which cannot change
• Can only be used for global variables who
address is known at compile time
• Example Instruction:
– ADD R1, R1(1001) ; Reg[R1] ← Reg[R1]
+Mem[1001]
234
Direct Addressing Example
235
Register Addressing
• Same as direct addressing with the exception that it
specifies a register instead of memory location
• Most common addressing mode on most computers
since register accesses are very fast
• Compilers try to put most commonly accessed
variables in registers
• Cannot be used only in LOAD and STORE
instructions (one operand in is always a memory
address)
• Example instruction:
– ADD R3, R4 ; Reg[R3] ← Reg[R3] + Reg[R4]
236
Register Indirect Addressing
• Operand being specified comes from memory or
goes to memory
• Its address is not hardwired into instruction, but is
contained in a register (pointer)
• Can reference memory without having full memory
address in the instruction
• Different memory words can be used on different
executions of the instruction
• Example instruction:
– ADD R1,R1(R2) ; Reg[R1] ← Reg[R1] +
Mem[Reg[R2]]
237
Example
• Following generic assembly program calculates the
sum of elements (1024) of an array A of integers of 4
bytes each, and stores result in register R1
238
Indexed Addressing
• Memory is addressed by giving a register
plus a constant offset
• Used to access local variables
• Example instruction:
– ADD R3, 100(R2)
; Reg[R3] ← Reg[R3] + Mem[100+Reg[R2]]
239
Based-Indexed Addressing
• Memory address is computed by
adding up two registers plus an optional
offset
• Example instruction:
ADD R3, (R1+R2)
;Reg[R3] ← Reg[R3] + Mem[Reg[R1] +
Reg[R2]]
240
Instruction Types
• ISA level instructions are divided into few
categories
– Data Movement Instructions
• Copy data from one location to another
– Examples (Pentium II integer instructions):
• MOV DST, SRC – copies SRC (source) to DST
(destination)
• PUSH SRC – push SRC into the stack
• XCHG DS1, DS2 – exchanges DS1 and DS2
• CMOV DST, SRC – conditional move
241
Instruction Types contd..
– Dyadic Operations
• Combine two operands to produce a result
(arithmetic instructions, Boolean instructions)
– Examples (Pentium II integer instructions):
• ADD DST, SRC – adds SRC to DST, puts result in
DST
• SUB DST, SRC – subtracts DST from SRC
• AND DST, SRC – Boolean AND SRC into DST
• OR DST, SRC - Boolean OR SRC into DST
• XOR DST,DST SRC – Boolean Exclusive OR to
DST
242
Instruction Types contd..
• Monadic Operations
– Have one operand and produce one result
– Shorter than dyadic instructions
• Examples (Pentium II integer
instructions):
– INC DST – adds 1 to DST
– DEC DST – subtracts 1 from DST
– NOT DST – replace DST with 1’s
complement
243
Instruction Types contd..
• Comparison and Conditional Branch
Instructions
244
Instruction Types contd..
• Procedure (Subroutine) call
Instructions
– When the procedure has finished its task,
transfer is returned to statement after the call
245
Instruction Types contd..
• Loop Control Instructions
– LOOPxx – loops until condition is met
• Input / Output Instructions
There are several input/output schemes
currently used in personal computers
– Programmed I/O with busy waiting
– Interrupt-driven I/O
– DMA (Direct Memory Access) I/O
246
Programmed I/O with busy waiting
247
DMA I/O
• DMA controller is a chip that has a direct
access to the bus
• It consists of at least four registers, each
can be loaded by software.
– Register 1 contains memory address to be
read/written
– Register 2 contains the count of how many
bytes / words to be transferred
– Register 3 specifies the device number or I/O
space address to use
– Register 4 indicates whether data are to be
read from or written to I/O device
248
Structure of a DMA
249
Registers in the DMA
• Status register: readable by the CPU to determine the status
of the DMA device (idle, busy, etc)
• Command register: writable by the CPU to issue a command
to the DMA
• Data register: readable and writable. It is the buffering place
for data that is being transferred between the memory and the
IO device.
• Address register: contains the starting location of memory
where from or where to the data will be transferred. The
Address register must be programmed by the CPU before
issuing a "start" command to the DMA.
• Count register: contains the number of bytes that need to be
transferred. The information in the address and the count
register combined will specify exactly what information need to
be transferred.
250
Example
• Writing a block of 32 bytes from memory
address 100 to a terminal device (4)
251
Example contd..
• CPU writes numbers 32, 100, and 4 into first three
DMA registers, and writes the code for WRITE (1, for
example) in the fourth register
• DMA controller makes a bus request to read byte
100 from memory
• DMA controller makes an I/O request to device 4 to
write the byte to it
• DMA controller increments its address register by 1
and decrements its count register by 1
• If the count register is > 0, another byte is read from
memory and then written to device
• DMA controller stops transferring data when count =
0
252
Sample Questions
Q1.
1. Explain the processor architecture of 8086.
2. What are differences in Intel Pentium
Processor and dual core processor.
3. What are the advantages and disadvantage
of the multi-core processors
253
Sample Questions
Q2.
1. What is addressing.
2. Comparing advantages,
disadvantages and features briefly
explain each addressing modes.
3. What is DMA and why it useful for
Programming?. Explain your answer
254
Computer Memory
• Primary Memory
• Secondary Memory
• Virtual Memory
255
Levels in Memory Hierarchy
Cache Virtual Memory
C
Regs a
8B 32 B 4 KB
c Memory disk
CPU h
e
257
Primary memory
• Memory is the workspace for CPU
• When a file is loaded into memory, it is a copy of the
file that is actually loaded
• Consists of a no. of cells, each having a number
(address)
• n cells → addresses: 0 to n‐1 ‐
• Same no. off bits in each cell
• Adjacent cells have consecutive addresses
‐ address 2m addressable cells
• m‐bit
• A portion of RAM address space is mapped into one
or more ROM chips
258
Ways of organizing a 96-bit
memory
259
SRAM (Static RAM)
• Constructed using flip flops
• 6 transistors for each bit of storage
• Very fast
• Contents are retained as long as power is
kept on
• Expensive
• Used in level 2 cache
260
DRAM (Dynamic RAM)
• No flip‐flops
• Array of cells, each consisting a transistor and a capacitor
• Capacitors can be charged or discharged, allowing 0s
and 1s to be Stored
• Electric charge tends to leak out Þ each bit in a DRAM
must be reloaded (refreshed) every few milliseconds (15
ms) to prevent data from leaking away
• Refreshing takes several CPU cycles to complete (less
than 1% of overall bandwidth)
• High density (30 times smaller than SRAM)
• Used in main memories
• Slower than SRAM
• Inexpensive (30 times lower than SRAM)
261
SDRAM (Synchronous DRAM)
• Hybrid of SRAM and DRAM
• Runs in synchronization with the system bus
• Driven by a single synchronous clock
• Used in large caches, main memories
262
DDR (Double Data Rate) SDRAM
263
Dual channel DDR
• Technique in which 2 DDR DIMMs are installed at one time and
function as a single bank doubling the bandwidth of a single module
• DDR2 SDRAM
– A faster version of DDR SDRAM (doubles the data rate of DDR)
– Less power consumption than DDR
– Achieves higher throughput by using differential pairs of signal wires
– Additional signal add to the pin count
• DDR3 SDRAM
– An improved version off DDR2 SDRAM
– Same no. of pins as in DDR2,
– Not compatible with DDR2
– Can transfer twice the data rate of DDR2
– DDR3 standard allows chip sizes of 512 Megabits to
8 Gigabits (max module size – 16GB)
264
DRAM Memory module
265
DRAM Memory module
266
SDRAM and DDR DIMM versions
• Buffered
• Unbuffered
• Registered
267
SDRAM and DDR DIMM
• Buffered Module
– Has additional buffer circuits between memory
chips and the connector to buffer signals
– New motherboards are not designed to use
buffered modules
• Unbuffered Module
– Allows memory controller signals to pass directly
to memory chips with no interference
– Fast and most efficient design
– Most motherboards are designed to use
unbuffered modules
268
SDRAM and DDR DIMM
• Registered Module
– Uses register chips on the module that act
as an interface between RAM chip and
chipset
– Used in systems designed to accept
extremely large amounts of RAM (server
motherboards)
269
Memory Errors
270
Memory errors
• Hard errors
– Permanent failure
– How to fix? (replace the chip)
• Soft errors
– Non permanent failure
– Occurs at infrequent intervals
– How to fix? (restart the system)
• Best way to deal with soft errors is to
increase system’s fault tolerance
(implement ways of detecting and
correcting errors)
271
Techniques used for fault
tolerance
• Parity
• ECC (Error Correcting Code)
272
Parity Checking
• 9 bits are used in the memory chip to
store 1 byte of information
• Extra bit (parity bit) keeps tabs on other
8 bits
• Parity can only detect errors, but
cannot correct them
273
ODD Parity stranded for error
checking
• Parity generator/checker is a part of CPU
or located in a special chip on
motherboard
• Parity checker evaluates the 8 data bits
by adding the no. of 1s in the byte
• If an even no. of 1s is found, parity
generator creates a 1 and stores it as the
parity bit in memory chip
274
ODD Parity stranded for error
checking (contd.)
• If the sum is odd, parity bit would be 0
• If a (9 bit) byte has an even no. of 1s, that
byte must have an error · System cannot
tell which bit or bits have changed
• If 2 bits changed, bad byte could pass
unnoticed
• Multiple bit errors in a single byte are very
rare
• System halts when a parity check error is
detected
275
ECC- Error Correcting Code
• Successor to parity checking
• Can detect and correct memory errors
• Only a single bit error can be corrected
though it can detect doubled bit errors
• This type of ECC is known as single bit
error correction double bit error detection
(SEC DED)
• SEC DED requires an additional 7 check
bits over 32 bits in a 4 byte system, or 8
check bits over 64 bits in an 8 byte system
276
ECC- Error Correcting Code
• ECC entails memory controller
calculating check bits on a
memory write operation, performing a
compare between read and calculated
check bits on a read operation
• Cost of additional ECC logic in memory
controller is not significant
• It affects memory performance on a
write
277
Cache memory
278
Cache Memory
• A high speed,speed small memory
• Most frequently used memory words are kept in
• When CPU needs a word, it first checks it in
cache. If not found, checks in memory
279
Cache and Main Memory
280
Cache memory Vs Main Memory
281
Cache Hit and Miss
• Cache Hit: a request to
read from memory,
which can satisfy from
the cache without using
the main memory.
• Cache Miss: A request
to read from memory,
which cannot be
satisfied from the cache,
for which the main
memory has to be
consulted.
282
Locality Principle
• PRINCIPAL OF LOCALITY is the tendency to
reference data items that are near other
recently referenced data items, or that were
recently referenced themselves.
• TEMPORAL LOCALITY : memory location that
is referenced once is likely to be referenced
multiple times in near future.
• SPATIAL LOCALITY : memory location that is
referenced once, then the program is likely to
be reference a nearby memory location in
near future.
283
Locality Principle
Let
c – cache access time
m – main memory access time
h – hit ratio (fraction of all references that can
be satisfied out of cache)
miss ratio = 1‐h
Average memory access time = c + (1 h) m
H =1 No memory references
H=0 all are memory references
284
Example:
Suppose that a word is read k times in a
short interval
First reference: memory, Other k 1
references: cache
h = k–1
k
Memory access time = c + m
k
285
Cache Memory
• Main memories and caches are divided into fixed sized
blocks
• Cache lines – blocks inside the cache
• On a cache miss, entire cache line is loaded into cache
from memory
• Example:
– 64K cache can be divided into 1K lines of 64 bytes, 2K lines of
32 byte etc
• Unified cache
– instruction and data use the same cache
• Split cache
– Instructions in one cache and data in another
286
A system with three levels of
cache
287
Pentium 4 Block Diagram
288
Replacement Algorithm
• Optimal Replacement: replace the
block which is no longer needed in the
future. If all blocks currently in Cache
Memory will be used again, replace the
one which will not be used in the future
for the longest time.
• Random selection: replace a randomly
selected block among all blocks
currently in Cache Memory.
289
Replacement Algorithm
• FIFO (first-in first-out): replace the block
that has been in Cache Memory for the
longest time.
• LRU (Least recently used): replace the
block in Cache Memory that has not
been used for the longest time.
• LFU (Least frequently used): replace
the block in Cache Memory that has
been used for the least number of times
290
Cache Memory Placement Policy
• Three commonly used methods to
translate main memory addresses to
cache memory addresses.
– Associative Mapped Cache
– Direct-Mapped Cache
– Set-Associative Mapped Cache
• The choice of cache mapping scheme
affects cost and performance, and there
is no single best method that is
appropriate for all situations
291
Associative Mapping
292
Associative Mapping
• A block in the Main Memory can
be mapped to any block in the
Cache Memory available (not
already occupied)
• Advantage: Flexibility. An Main
Memory block can be mapped
anywhere in Cache Memory.
• Disadvantage: Slow or
expensive. A search through all
the Cache Memory blocks is
needed to check whether the
address can be matched to any
of the tags.
293
Direct Mapping
294
Direct Mapping
To avoid the search through all
CM blocks needed by
associative mapping, this
method only allows
# blocks in main memory
# blocks in cache memory
Blocks to be mapped to each
Cache Memory block.
• Each entry (row) in cache can
hold exactly one cache line
from main memory
• 32‐byte
‐ cache line size →
cache can hold 64KB
295
Direct Mapping
• Advantage: Direct mapping is faster than
the associative mapping as it avoids
searching through all the CM tags for a
match.
• Disadvantage: But it lacks mapping
flexibility. For example, if two MM blocks
mapped to same CM block are needed
repeatedly (e.g., in a loop), they will keep
replacing each other, even though all
other CM blocks may be available.
296
Set-Associative Mapping
297
Set-Associative Mapping
• This is a trade-off between
associative and direct mappings
where each address is mapped
to a certain set of cache
locations.
• The cache is broken into sets
where each set contains "N"
cache lines, let's say 4. Then,
each memory address is
assigned a set, and can be
cached in any one of those 4
locations within the set that it is
assigned to. In other words,
within each set the cache is
associative, and thus the name.298
Set Associative cache
• LRU (Least Recently Used) algorithm
is used
– keep an ordering of each set of locations
that could be accessed from a given
memory location
– whenever any of present lines are
accessed, it updates list, making that entry
the most recently accessed
– when it comes to replace an entry, one at
the end of list is discarded
299
Load-Through and Store-Through
• Load-Through : When the CPU
needs to read a word from the
memory, the block containing the
word is brought from MM to CM,
while at the same time the word is
forwarded to the CPU.
• Store-Through : If store-through is
used, a word to be stored from
CPU to memory is written to both
CM (if the word is in there) and
MM. By doing so, a CM block to be
replaced can be overwritten by an
in-coming block without being
saved to MM.
300
Cache Write Methods
• Words in a cache have been viewed simply
as copies of words from main memory that
are read from the cache to provide faster
access. However this view point changes.
• There are 3 possible write actions:
– Write the result into the main memory
– Write the result into the cache
– Write the result into both main memory and cache
memory
301
Cache Write Methods
• Write Through: A cache architecture in which
data is written to main memory at the same
time as it is cached.
• Write Back / Copy Back: CPU performs write
only to the cache in case of a cache hit. If there
is a cache miss, CPU performs a write to main
memory.
• When the cache is missed :
– Write Allocate: loads the memory block into cache
and updates the cache block
– No-Write allocation: this bypasses the cache and
writes the word directly into the memory.
302
Cache Evaluation
Processor on
Problem Solution which feature first
appears
External memory
Add external cache using
slower than the system 386
faster memory technology
bus
Increased processor
speed results in Move external cache on-chip,
external bus becoming operating at the same speed 486
a bottleneck for cache as the processor
access.
Internal cache is rather Add external L2 cache using
small, due to limited faster technology than main 486
space on chip memory
303
Cache Evaluation
Processor on
Problem Solution which feature first
appears
Increased processor speed Move L2 cache on to the Pentium II
results in external bus processor chip.
becoming a bottleneck for
L2 cache access Create separate back-side bus that Pentium Pro
runs at higher speed than the main
(front-side) external bus. The BSB
is dedicated to the L2 cache.
306
Example
Assume we have a machine where CPI is 2.0
when all memory accesses hit in the cache.
Only data accesses are loads and stores,
and these total 40% of instructions. If the
miss penalty is 25 clock cycles and miss ratio
is 2%, how much faster would the machine
be if all instructions were cache hits?
307
Answer
308
Secondary Memory
309
Technologies
• Magnetic storage
– Floppy, Zip disk, Hard drives, Tapes
• Optical storage
– CD, DVD, Blue-Ray, HD-DVD
• Solid state memory
– USB flash drive, Memory cards for mobile
phones/digital cameras/MP3 players, Solid
State Drives
310
Magnetic Disk
• Purpose:
– Long term, nonvolatile storage
– Large, inexpensive, and slow
– Lowest level in the memory hierarchy
• Two major types:
– Floppy disk
– Hard disk
• Both types of disks:
– Rely on a rotating platter coated with a magnetic surface
– Use a moveable read/write head to access the disk
• Advantages of hard disks over floppy disks:
– Platters are more rigid ( metal or glass) so they can be larger
– Higher density because it can be controlled more precisely
– Higher data rate because it spins faster
– Can incorporate more than one platter
Disk Track
Components of a Disk
Spindle
• The arm assembly is Tracks
Disk head
moved in or out to
position a head on a
desired track. Tracks Sector
under heads make a
cylinder (imaginary!).
• Only one head
reads/writes at any one
Platters
time. Arm movement
313
Internal Hard-Disk
Page 223
Magnetic Disk
• A stack of platters, a surface with a magnetic
coating
• Typical numbers (depending on the disk size):
– 500 to 2,000 tracks per surface
– 32 to 128 sectors per track
• A sector is the smallest unit that can be read or
written
• Traditionally all tracks have the same number
of sectors:
• Constant bit density: record more sectors on
the outer tracks
Magnetic Disk Characteristic
• Disk head: each side of a platter has separate disk head
• Cylinder: all the tracks under the head at a given point on all
surface
• Read/write data is a three-stage process:
– Seek time: position the arm over the proper track
– Rotational latency: wait for the desired sector to rotate under the
read/write head
– Transfer time: transfer a block of bits (sector) under the read-write
head
• Average seek time as reported by the industry:
– Typically in the range of 8 ms to 15 ms
– (Sum of the time for all possible seek) / (total # of possible seeks)
• Due to locality of disk reference, actual average seek time may:
– Only be 25% to 33% of the advertised number
Typical Numbers of a Magnetic
Disk
• Rotational Latency:
– Most disks rotate at 3,600/5400/7200 RPM
– Approximately 16 ms per revolution
– An average latency to the desired information is
halfway around the disk: 8 ms
• Transfer Time is a function of :
– Transfer size (usually a sector): 1 KB / sector
– Rotation speed: 3600 RPM to 5400 RPM to 7200
– Recording density: typical diameter ranges from 2
to 14 in
– Typical values: 2 to 4 MB per second
Disk I/O Performance
322
Example
• Advertised average seek time of a disk is 5
ms, transfer rate is 40 MB per second, and it
rotates at 10,000 rpm Controller overhead is
0.1 ms. Calculate the average time to read a
512 byte sector.
323
RAID-
(Redundant Array of Inexpensive
Disks)
• A disk organization used to improve
performance of storage systems
• An array of disks controlled by a
controller (RAID Controller)
• Data are distributed over disks
(striping) to allow parallel operation
324
RAID 0- No redundancy
• No redundancy to tolerate disk failure
• · Each strip has k sectors (say)
– Strip 0: sectors 0 to k 1
– Strip 1: sectors k to 2k 1 ...etc
• Works well with large accesses
• Less reliable than having a single large
disk
325
Example (RAID 0)
• Suppose that RAID consists of 4 disks
with MTTF (mean time to failure) of
20,000 hours.
– A drive will fail once in every 5,000 hours
– A single large drive with MTTF of 20,000
hours is 4 times reliable
326
RAID 1 (Mirroring)
• Uses twice as many disk as does RAID 0
(first half: primary, next half: backup)
• Duplicates all disks
328
RAD 4 (Block Interleaved Parity)
329
RAID 5- Block Interleaved
Distributed Parity
• In RAID 5, parity information is spread
throughout all disks
• In RAID 5, multiple writes can occur
simultaneously as long as stripe units are not
located in same disks, but it is not possible in
RAID 4
330
Secondary Storage Devices:
CD-ROM
331
Physical Organization of CD-ROM
• Compact Disk – read only memory (write once)
• Data is encoded and read optically with a laser
• Can store around 600MB data
• Digital data is represented as a series of Pits and
Lands:
– Pit = a little depression, forming a lower level in the track
– Land = the flat part between pits, or the upper levels in the
track
• Reading a CD is done by shining a laser at the disc and detecting
changing reflections patterns.
– 1 = change in height (land to pit or pit to land)
– 0 = a “fixed” amount of time between 1’s
332
Organization of data
LAND PIT LAND PIT LAND
...------+ +-------------+ +---...
|_____| |_______|
..0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 ..
333
CD-ROM
• Addressing
– 1 second of play time is divided up into 75 sectors.
– Each sector holds 2KB
– 60 min CD:
60min * 60 sec/min * 75 sectors/sec = 270,000 sectors = 540,000 KB ~ 540
MB
– A sector is addressed by: Minute:Second:Sector e.g. 16:22:34
• Type of laser
– CD: 780nm (infrared)
– DVD: 635nm or 650nm (visible red)
– HD-DVD/Blu-ray Disc: 405nm (visible blue)
• Capacity
– CD: 650 MB, 700 MB
– DVD: 4.7 GB per layer, up to 2 layers
– HD-DVD: 15 GB per layer, up to 3 layers
– BD: 25 GB per layer, up to 2 layers
334
Solid state storage
335
Solid state storage
• Memory cards
– For Digital cameras, mobile phones, MP3 players...
– Many types: Compact flash, Smart Media, Memory Stick,
Secure Digital card...
• USB flash drives
– Replace floppies/CD-RW
• Solid State Drives
– Replace traditional hard disks
• Uses flash memory
– Type of EEPROM
• Electrically erasable programmable read only memory
– Grid of cells (1 cell = 1 bit)
– Write/erase cells by blocks
336
Solid state storage
• Cell=two transistors
– Bit 1: no electrons in between
– Bit 0: many electrons in between
• Performance
– Acces time: 10X faster than hard drive
– Transfer rate
• 1x=150 kb/sec, up to 100X for memory cards
• Similar to normal hard drive for SSD ( 100-150
MB/sec)
– Limited write: 100k to 1,000k cycles
337
Solid state storage
• Size
– Very small: 1cm² for some memory cards
• Capacity
– Memory cards: up to 32 GB
– USB flash drives: up to 32 GB
– Solid State Drives: up to 256 GB
338
Solid state storage
• Reliability
– Resists to shocks
– Silent!
– Avoid extreme heat/cold
– Limited number of erase/write
• Challenges
– Increasing size
– Improving writing limits
339
Virtual Memory
340
Virtual Memory
• Virtual memory is a memory management
technique developed for multitasking kernels
• Separation of user logical memory from
physical memory.
• Logical address space can therefore be
much larger than physical address space
341
A System with
Physical Memory Only
• Examples:
– Most Cray machines, early PCs, nearly all embedded systems, etc.
Memory
0:
Physical 1:
Addresses
CPU
N-1:
CPU
P-1:
N-1:
Disk
Address Translation: Hardware converts virtual addresses to physical ones
via OS-managed lookup table (page table)
Page Tables
Virtual Page Memory-resident
Number page table
(physical page
Valid
Physical Memory
or disk address)
1
1
0
1
1
1
0
1
0 Disk Storage
1 (swap file or
regular file system file)
VM – Windows
• Can change the
paging file size
• Can set multiple
Virtual memory on
difference drivers
345
Windows Memory management
346
IO Fundamentals
I/O Fundamentals
• Computer System has three major
functions
– CPU
– Memory
– I/O
PC with PCI and ISA bus
Types and Characteristics of I/O
Devices
• Behavior: how does an I/O device behave?
– Input – Read only
– Output - write only, cannot read
– Storage - can be reread and usually rewritten
• Partner:
– Either a human or a machine is at the other end of
the I/O device
– Either feeding data on input or reading data on
output
• Data rate:
– The peak rate at which data can be transferred
• between the I/O device and the main memory
• Or between the I/O device and the CPU
Data Rate
Buses
• A bus is a shared communication link
• Multiple sources and multiple destinations
• It uses one set of wires to connect multiple
subsystems
• Different uses:
– Data
– Address
– Control
Motherboard
Advantages
• Versatility:
– New devices can be added easily
– Peripherals can be moved between
computer
– systems that use the same bus standard
• Low Cost:
– A single set of wires is shared in multiple
ways
Disadvantages
• It creates a communication bottleneck
– The bandwidth of that bus can limit the
maximum I/O throughput
• The maximum bus speed is largely limited
by:
– The length of the bus
– The number of devices on the bus
– The need to support a range of devices with:
• Widely varying latencies
• Widely varying data transfer rates
The General Organization of a Bus
• Control lines:
– Signal requests and acknowledgments
– Indicate what type of information is on the
data lines
• Data lines carry information between the
source and the destination:
– Data and Addresses
– Complex commands
• A bus transaction includes two parts:
– Sending the address
– Receiving or sending the data
Master Vs Slave
• A bus transaction includes two parts:
– Sending the address
– Receiving or sending the data
• Master is the one who starts the bus
transaction by:
– Sending the address
• Salve is the one who responds to the
address by:
– Sending data to the master if the master ask
for data
– Receiving data from the master if the master
wants to send data
Output Operation
Input Operation
• Input is defined as the Processor
receiving data from the I/O device
Type of Buses
• Processor-Memory Bus (design specific or proprietary)
– Short and high speed
– Only need to match the memory system
– Maximize memory-to-processor bandwidth
– Connects directly to the processor
• I/O Bus (industry standard)
– Usually is lengthy and slower
– Need to match a wide range of I/O devices
– Connects to the processor-memory bus or backplane bus
• Backplane Bus (industry standard)
– Backplane: an interconnection structure within the chassis
– Allow processors, memory, and I/O devices to coexist
• Distributed memory:
In this model, each processor has its
own (small) local memory, and its
content is not replicated anywhere else
Multi-core processor is a special
kind of a multiprocessor:
All processors are on the same chip
Uop queues
Rename/Alloc
Decoder
Bus
Uop queues
Rename/Alloc
Decoder
Bus
Thread 2:
integer operation
SMT processor: both threads can
run concurrently
L1 D-Cache D-TLB
Uop queues
Rename/Alloc
Decoder
Bus
Uop queues
Rename/Alloc
Rename/Alloc Rename/Alloc
Bus
BTB and I-TLB BTB and I-TLB
Thread 1 Thread 2
Multi-core:
threads can run on separate cores
L1 D-Cache D-TLB L1 D-Cache D-TLB
Rename/Alloc Rename/Alloc
Bus
BTB and I-TLB BTB and I-TLB
Thread 3 Thread 4
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT
– Multi-core, non-SMT
– Multi-core, with SMT: our fish machines
• The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
SMT Dual-core: all four threads
can run concurrently
L1 D-Cache D-TLB L1 D-Cache D-TLB
Rename/Alloc Rename/Alloc
Bus
BTB and I-TLB BTB and I-TLB
• Advantages/disadvantages?
Comparison: multi-core vs SMT
• Multi-core:
– Since there are several cores,
each is smaller and not as powerful
(but also easier to design and manufacture)
– However, great with thread-level parallelism
• SMT
– Can have one large and fast superscalar core
– Great performance on a single thread
– Mostly still only exploits instruction-level
parallelism
The memory hierarchy
• If simultaneous multithreading only:
– all caches shared
• Multi-core chips:
– L1 caches private
– L2 caches private in some architectures
and shared in others
• Memory is always shared
“Fish” machines
hyper-threads
• Dual-core
Intel Xeon processors
CORE1
CORE0
• Each core is L1 cache L1 cache
hyper-threaded
L2 cache
• Private L1 caches
memory
• Shared L2 caches
Designs with private L2 caches
CORE0
CORE1
CORE0
CORE1
L1 cache L1 cache L1 cache L1 cache
L3 cache L3 cache
memory
memory
Both L1 and L2 are private
A design with L3 caches
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D Example: Intel Itanium 2
Private vs shared caches?
• Advantages/disadvantages?
Private vs shared caches
• Advantages of private:
– They are closer to core, so faster access
– Reduces contention
• Advantages of shared:
– Threads on different cores can share the
same cache data
– More cache space available if a single (or
a few) high-performance thread runs on
the system
View publication stats
core 2
core 1