Anda di halaman 1dari 401

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/282845738
Computer System Architecture Lecturer Notes
Research · October 2015
DOI: 10.13140/RG.2.1.2592.8407
CITATIONS READS
0 35,147
1 author:
Budditha Hettige
General Sir John Kotelawala Defence University
46 PUBLICATIONS   120 CITATIONS   
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
EnSiMaS View project
masmt2.0 View project
All content following this page was uploaded by Budditha Hettige on 14 October 2015.
The user has requested enhancement of the downloaded file.
CSC 203 1.5
Computer System Architecture

By
Budditha Hettige
Department of Statistics and Computer Science
University of Sri Jayewardenepura

(2011) Computer System architectures 1


Course Outline
Course Type Core
Credit Value 1.5
Duration 22 lecture hours
Pre-requisites CSC 106 2.0
Course contents
• Introduction and Historical Developments
– About Historical System development
– Processor families
• Computer Architecture and Organization
– Instruction Set Architecture (ISA)
– Microarchitecture
– System architecture
– Processor architecture
– Processor structures
• Interfacing and I/O Strategies
– I/O fundamentals, Interrupt mechanisms, Buses
Course contents
• Memory Architecture
– Primary memory, Cache memory, Secondary memory
• Functional Organization
– Instruction pipelining
– Instruction level parallelism (ILP),
– Superscalar architectures
– Processor and system performance
• Multiprocessing
– Amdahl’s law
– Short vector processing
– Multi-core
– multithreaded processors
Introduction

(2011) Computer System architectures 5


What is Computer?
• Is a machine that can solve problems
for people by carrying out instructions
given to it
• The sequence of instructions is call
Program
• The language machine can understand
is call machine language

(2011) Computer System architectures 6


What is Machine Language?
• Machine language(ML) is a system of instructions and data
executed directly by a computer's Central Processing Unit
• The codes are strings of 0s and 1s, or binary digits (“bits”)
• Instructions typically use some bits to represent
– Operations (addition )
– Operands or
– Location of the next instruction.

(2011) Computer System architectures 7


Machine Language contd..
• Advantages
– Machine can directly access (Electronic
circuit)
– High Speed
• Disadvantages
– Human cannot identify
– Machine depended
(Hardware depended)

(2011) Computer System architectures 8


More on Machines
• Machine defines a language
– Set of instructions carried out by the machine
• Language defines by the machine
– Machine executing all the program, writing in the
language

Language Machine Language

(2011) Computer System architectures 9


Two Layer (Level) Machine
• This machine
Virtual Machine (L1)
contains only New
Language (L1) and Translate/
Interpreter
the Machine
language (LO) Machine Language (L0)

Virtual Machine
Machine Language Machine
(L1) (L0)

(2011) Computer System architectures 10


Translation (L1 → L0)
1. Replace each instruction written in L1 in to
LO
2. Program now execute new Program
3. Program is called compiler/ translator

(2011) Computer System architectures 11


Interpretation
• Each instruction in L1 can execute
through the relevant L0 instructions
directly
• Program is call interpreter

(2011) Computer System architectures 12


Multi Level Machine
High-level Language Program (C, C++)

Assembly Language Program

Machine Language

(2011) Computer System architectures 13


Multilevel Machine
Virtual Machine Ln

Virtual Machine Ln-1

.
.
.

Machine Language L0

(2011) Computer System architectures 14


Six-Level Machine
• Computer that is designed up to the 6th level of
computer architecture

(2011) Computer System architectures 15


Digital Logic Level
• The interesting objects at this level are gates;
• Each gate has one or more digital inputs (0 or
1)
• Each gate is built of at most a handful of
transistors
• A small number of gates can be combined to
form a 1-bit memory, which can store a 0 or 1;
• The 1-bit memories can be combined in
groups of, for example, 16, 32 or 64 to form
registers
• Each register can hold a single binary number
up to some maximum;
• Gates can also be combined to form the main
computing engine itself.

(2011) Computer System architectures 16


Microarchitecture level
• A collection of 8-32 registers that form a
local memory and a circuit called an ALU
(Arithmetic Logic Unit) that is capable of
performing simple arithmetic operations;
• The registers are connected to the ALU to
form a data path over which the data
flow;
• The basic operation of the data path
consists of selecting one or two registers
having the ALU operate on them;
• On some machines the operation of the
data path is controlled by a program called
a microprogram, on other machine it is
controlled by hardware.

(2011) Computer System architectures 17


Data Path

(2011) Computer System architectures 18


Instruction Set Architecture Level
• The ISA level is defined by the
machine’s instruction set
• This is a set of instructions carried
out interpretively by the
microprogram or
hardware execution sets

(2011) Computer System architectures 19


Operating System Level
• Uses different memory organization, a new
set of instructions, the ability to run one or
more programs concurrently
• Those level 3 instructions identical to level
2’s are carried out directly by the
microprogram (or hardwired control), not by
the OS;
• In other words, some of the level 3
instructions are interpreted by the OS and
some of the level 3 instructions are
interpreted directly by the microprogram;
• This level is hybrid

(2011) Computer System architectures 20


Assembly Language Level
• This level is really a symbolic form for the one
of the underlying languages;
• This level provides a method for people to write
programs for levels 1, 2 and 3 in a form that is
not as unpleasant as the virtual machine
languages themselves;
• Programs in assembly language are first
translated to level 1, 2 or 3 language and then
interpreted by the appropriate virtual or actual
machine;
• The program that performs the translation is
called an assembler.

(2011) Computer System architectures 21


Between Levels 3 and 4
• The lower 3 levels are not for the
average programmer – Instead
they are primarily for running the
interpreters and translators
needed to support the higher
levels;
• These are written by system
programmers who specialise in
developing new virtual machines;
• Levels 4 and above are intended
for the applications programmer
• Levels 2 and 3 are always
interpreted, Levels 4 and above
are usually, but not always,
supported by translation;
(2011) Computer System architectures 22
Problem-oriented Language Level
• This level usually consists of
languages designed to be used by
applications programmers;
• These languages are generally
called higher level languages
• Some examples: Java, C, BASIC,
LISP, Prolog;
• Programs written in these
languages are generally translated
to Level 3 or 4 by translators
known as compilers, although
occasionally they are interpreted
instead;

(2011) Computer System architectures 23


Multilevel Machines: Hardware
• Programs written in a computer’s true machine language (level
1) can be directly executed by the computer’s electronic circuits
(level 0), without any intervening interpreters or translators.
• These electronic circuits, along with the memory and
input/output devices, form the computer’s hardware.
• Hardware consists of tangible objects:
– integrated circuits
– printed circuit boards
– Cables
– power supplies
– Memories
– Printers
• Hardware is not abstract ideas, algorithms, or instructions.

(2011) Computer System architectures 24


Multi level machine Software
• Software consists of algorithms (detailed instructions
telling how to do something) and their computer
representations-namely, programs
• Programs can be stored on hard disk, floppy disk, CD-
ROM, or other media but the essence of software is the
set of instructions that makes up the programs, not the
physical media on which they are recorded.
• In the very first computers, the boundary between
hardware and software was crystal clear.
• Over time, however, it has blurred considerably, primarily
due to the addition, removal, and merging of levels as
computers have evolved.
• Hardware and software are logically equivalent

(2011) Computer System architectures 25


The Hardware/Software Boundary

• Any operation performed by software can also


be built directly into the hardware;
• Also, any instruction executed by the hardware
can also be simulated in software;
• The decision to put certain functions in
hardware and others in software is based on
such factors as:
– Cost
– Speed
– Reliability and
– Frequency of expected changes

(2011) Computer System architectures 26


Exercises
1. Explain each of the following terms in your own
words
– Machine Language
– Instruction
2. What are the differences between Interpretation
and translation?
3. What are Multilevel Machines?
4. What are the differences between two-level
machine and the six-level machine

(2011) Computer System architectures 27


Historical Developments

(2011) Computer System architectures 28


Computer Generation
1. Zeroth generation- Mechanical Computers (1642-1940)
2. First generation - Vacuum Tubes (1940-1955)
3. Second Generation -Transistors (1956-1963)
4. Third Generation - Integrated Circuits (1964-1971)
5. Forth Generation – VLS-Integration (1971-present)
6. Fifth Generation – Artificial Intelligence (Present and
Beyond)

(2011) Computer System architectures 29


The Zero Generation (1)
Year Name Made by Comments
Analytical
1834 Babbage First attempt to build a digital computer
Engine
1936 Z1 Zuse First working relay calculating machine

1943 COLOSSUS British gov't First electronic computer

1944 Mark I Aiken First American general-purpose computer

1946 ENIAC I EckerVMauchley Modern computer history starts here

1949 EDSAC Wilkes First stored-program computer

1951 Whirlwind I M.I.T. First real-time computer

1952 IAS Von Neumann Most current machines use this design

1960 PDP-1 DEC First minicomputer (50 sold)

1961 1401 IBM Enormously popular small business machine


Dominated scientific computing in the early
1962 7094 IBM
1960s
(2011) Computer System architectures 30
The Zero Generation (2)
1963 B5000 Burroughs First machine designed for a high-level language

1964 360 IBM First product line designed as a family

1964 6600 CDC First scientific supercomputer

1965 PDP-8 DEC First mass-market minicomputer (50,000 sold)

1970 PDP-11 DEC Dominated minicomputers in the 1970s

1974 8080 Intel First general-purpose 8-bit computer on a chip

1974 CRAY-1 Cray First vector supercomputer

1978 VAX DEC First 32-bit superminicomputer

1981 IBM PC IBM Started the modern personal computer era

1985 MIPS MIPS First commercial RISC machine

1987 SPARC Sun First SPARC-based RISC workstation

1990 RS6000 IBM First superscalar machine


(2011) Computer System architectures 31
The Zero Generation (3)
• Pascal’s machine
– Addition and Subtraction
• Analytical engine
– Four components (Store, mill, input,
output)

(2011) Computer System architectures 32


Charles Babbage

• Difference Engine 1823

• Analytic Engine 1833


– The forerunner of modern digital computer
– The first conception of a general purpose
computer

(2011) Computer System architectures 33


Von-Neumann machine

(2011) Computer System architectures 34


First Generation-Vacuum Tubes
(1945-1955)
• First generation computers are
characterized by the use of vacuum
tube logic
• Developments
– ABC
– ENIAC
– UNIVAC I

(2011) Computer System architectures 35


Brief Early Computer Timeline

First Generation- Time Line


Date Event Description Arithmetic Logic Memory
1942 ABC Atanasoff-Berry Computer binary vacuum tubes capacitors

Electronic Numerical
1946 ENIAC decimal vacuum tubes vacuum tubes
Integrator And Computer

Electronic Discrete Variable


1947 EDVAC binary vacuum tubes mercury delay lines
Automatic Computer

Manchester Small Scale


1948 The Baby binary vacuum tubes CRST
Experimental Machine
Universal Automatic
1949 UNIVAC I decimal vacuum tubes mercury delay lines
Computer
Electronic Delay Storage
1949 EDSAC binary vacuum tubes mercury delay lines
Automatic Computer
1952 IAS Institute for Advanced Study binary vacuum tubes cathode ray tubes
1953 IBM 701 binary vacuum tubes mercury delay lines

(2011) Computer System architectures 36


ABC - Atanasoff-Berry Computer

• world's first electronic digital computer


• The ABC used binary arithmetic

(2011) Computer System architectures 37


ENIAC – First general purpose
computer
• Electronic Numerical Integrator And Computer
• Designed and built by Eckert and Mauchly at the University of
Pennsylvania during 1943-45
• capable of being reprogrammed to solve a full range of computing
problems
• The first, completely electronic, operational, general-purpose analytical
calculator!
– 30 tons, 72 square meters, 200KW
• Performance
– Read in 120 cards per minute
– Addition took 200 µs, Division 6 ms

(2011) Computer System architectures 38


UNIVAC - UNIVersal Automatic
Computer
• The first commercial computer
• UNIVAC was delivered in 1951
• designed at the outset for business and
administrative use
• The UNIVAC I had 5200 vacuum tubes, weighed
29,000 pounds, and consumed 125 kilowatts of
electrical power
• Originally priced at US$159,000

(2011) Computer System architectures 39


The Second Generation-
Transistors (1955-1965)
• Second generation computers are
characterized by the use of discrete
transistor logic
• Use of magnetic core for primary storage
• Developments
– IBM 1620 System
– IBM 7030 System
– IBM 7090 System
– IBM 7094 System

(2011) Computer System architectures 40


IBM 7090
• The IBM 7090 system was announced in 1958.
• The 7090 included a multiplexor which supported up to 8
I/O channels.
• The 7090 supported both fixed point and floating point
arithmetic.
• Two fixed point numbers could be added in 4.8
microseconds, and two floating point numbers could be
added in 16.8 microseconds.
• The 7090 had 32,768 thirty-six bit words of core storage.
• In 1960, the American Airlines
• SABRE system used two 7090 systems.
• Cost of a 7090 system was in the
$3,000,000 range.

(2011) Computer System architectures 41


IBM 1620
• The IBM 1620 system was announced in 1959.
• The IBM 1620 system had up to 60,000 digits of core
storage (6 bits each.)
• Floating point hardware was optional.
• The IBM 1620 system performed decimal arithmetic.
• The system was digit oriented, not word oriented.

(2011) Computer System architectures 42


IBM 7030
• The IBM 7030 system was
announced in 1960.
• The IBM 7030 system used
magnetic core for main memory,
and magnetic disks for
secondary storage.
• The ALU could perform
1,000,000 operations per
second.
• Up to 32 I/O channels were
supported.
• The 7030 was also referred to
as "Stretch."
• Cost of a 7030 system was in
the $10,000,000 range.

(2011) Computer System architectures 43


IBM 7094
• The IBM 7094 system was announced in
1962.
• The 7094 was an improved 7090.
• The 7094 introduced double precision
floating point arithmetic.

(2011) Computer System architectures 44


Third Generation
• Third generation computers are
characterized by the use of integrated
circuit logic.
• Development
– IBM System/360

(2011) Computer System architectures 45


IBM S 360
• The IBM S/360 family was announced in 1964.
• Included both multiplexor and selector I/O
channels.
• Supported both fixed point and floating point
arithmetic.
• Had a microprogrammed instruction set.
• Cost between $133,000 and $12,500,000.

(2011) Computer System architectures 46


Forth Generation
• Very Large Scale(VLSI) and Ultra Large
scale(ULSI)
• Fourth generation computers are
characterized by the use of
microprocessors.
• Semiconductor memory was commonly
used
• Development
– Intel
– AMD etc

(2011) Computer System architectures 47


Intel 4004
• The Intel 4004 microprocessor was announced in
1971.
• The Intel 4004 microprocessor had
– 2,300 transistors.
– A clock speed of 108 KHz.
– A die size of 12 sq mm.
– 4 bit memory access.
– 4 bit registers.
• The Intel 4004 microprocessor supported
– Up to 32,768 bits of program storage.
– Up to 5,120 bits of data storage.
• The 4004 was used mainly in calculators.

(2011) Computer System architectures 48


Intel 4004 - 1971

(2011) Computer System architectures 49


MOS 6502
• The MOS 6502 microprocessor was announced in 1975.
• The MOS 6502 microprocessor had
– A clock speed of 1 MHz.
– 8 bit memory access.
– 8 bit registers.
• The MOS 6502 microprocessor supported
– Up to 65,536 bytes (8 bit) of main memory.
• The MOS 6502 was used in
– The Apple II personal computer.
– The Comodore PET personal computer.
– The KIM-1 computer kit.
– The Atari 2600 game system.
– The Nintendo Famicon game system.
• Initial price of the 6502 was $25.00.

(2011) Computer System architectures 50


Intel Pentium IV - 2001
• “State of the art”

• 42 million
transistors
• 2GHz
• 0.13µm process

• Could fit ~15,000


4004s on this chip!

(2011) Computer System architectures 51


Now
- zEnterprise196 Microprocessor
• 1.4 billion transistors, Quad core design
• Up to 96 cores (80 visible to OS) in one multichip module
• 5.2 GHz, IBM 45nm SOI CMOS technology
• 64-bit virtual addressing
– original 360 was 24-bit; 370 was a 31-bit extension
• Superscalar, out-of-order
– Up to 72 instructions in flight
• Variable length instruction pipeline: 15-17 stages
• Each core has 2 integer units, 2 load-store units and 2 floating point
units
• 8K-entry Branch Target Buffer
– Very large buffer to support commercial workload
• Four Levels of caches:
– 64KB L1 I-cache, 128KB L1 D-cache
– 1.5MB L2 cache per core
– 24MB shared on-chip L3 cache
– 192MB shared off-chip L4 cache

(2011) Computer System architectures 52


Fifth Generation
• Computing devices, based on artificial
intelligence
• Features
– Voice recognition,
– Parallel processing
– Quantum computation and molecular and
nanotechnology will radically change the face
of computers in years to come.
– The goal of fifth-generation computing is to
develop devices that respond to natural
language input and are capable of learning
and self-organization

(2011) Computer System architectures 53


Computer Architecture

2011 Computer System Architecture 54


What is Computer Architecture?

• Set of data types, Operations, and


features are call its architecture
• It deals with those aspects that are
visible to user of that level
• Study of how to design those parts a
computer is called Computer
Architecture

2011 Computer System Architecture 55


Why Computer Architecture
• Maximum overall performance of system
keeping within cost constraints
• Bridge performance gap between slowest
and fastest component in a computer
• Architecture design
– Search the space of possible design
– Evaluate the performance of design choose
– Identify bottlenecks, redesign and repeat
process

2011 Computer System Architecture 56


Computer Organization
• The Simple Computer concise with
– CPU
– I/O Devices
– Memory
– BUS (Connection method)

2011 Computer System Architecture 57


Simple Computer

2011 Computer System Architecture 58


CPU – Central Processing Unit
• Is the “Brain”
• It Execute the program and stored in
the main memory
• Composes with several parts
– Control Unit
– Arithmetic and Logic Units
– Registers

2011 Computer System Architecture 59


Registers
• High-speed memory
• Top of the memory hierarchy, and
provide the fastest way to access data
• Store temporary results
• Some useful registers
– PC – Program counters
• Point to the next instructions
– IR - Instruction Register
• Hold instruction currently being execute

2011 Computer System Architecture 60


Registers more…
• Types
– User-accessible Registers
– Data registers
– Address registers
– General purpose registers
– Special purpose registers
– Etc.

2011 Computer System Architecture 61


Instruction
• Types
– Data handling and Memory operations
• Set, Move, Read, Write
– Arithmetic and Logic
• Add, subtract, multiply, or divide
• Compare
– Control flow
• Complex instructions
– Take many instructions on other computers
• saving many registers on the stack at once
• moving large blocks of memory

2011 Computer System Architecture 62


Parts of an instruction
• Opcode
– Specifies the operation to be performed
• Operands
– Register values,
– Values in the stack,
– Other memory values,
– I/O ports

2011 Computer System Architecture 63


Type of the operation
• Register-Register Operation
– Add, subtract, compare, and logical
operations
• Memory Reference
– All loads from memory
• Multi Cycle Instructions
– Integer multiply and divide and all floating-
point operations

2011 Computer System Architecture 64


Fetch-Decode execute circle
• Instruction fetch
– 32-bit instruction was fetched from the cache
• Decode
• Execute
• Memory Access
• Write back

2011 Computer System Architecture 65


Fetch-Decode execute circle

2011 Computer System Architecture 66


MIcroprocessors
• Processors can be identify by two main
parameters
– Speed (MHz/ GHz)
– Processor with
• Data bus
• Address bus
• Internal registers

2011 Computer System Architecture 67


Data bus
• Known as Front side bus, CPU bus and
Processor side bus
• Use between CPU and main chipset
• Define a size of memory
– 32 bit
– 64 bit etc.

2011 Computer System Architecture 68


Data bus

2011 Computer System Architecture 69


The division of I/O buses is according to data transfer rate. Specifically,

I/O Ports with data transfer rates

Typical Data
Controller Port / Device
Transfer Rate
PS/2 (keyboard / mouse) 2 KB/s
Serial Port 25 KB/s
Super I/O
Floppy Disk 125 KB/s
Parallel Port 200 KB/s
Integrated Audio 1 MB/s
Integrated LAN 12 MB/s
USB 60 MB/s
Southbridge Integrated Video 133 MB/s
IDE (HDD, DVD) 133 MB/s
SATA (HDD, DVD) 300 MB/s

2011 Computer System Architecture 70


Address Bus
• Carries addressing information
• Each wire carries a single bit
• Width indicates maximum amount of RAM the
processor can handle
• Data bus and address bus are independent

2011 Computer System Architecture 71


How CPU works?
• A Simple CPU
– 4 Bit Address bus
– Registers A, B and C (4 Bit)
– 8 Bit Program ( 4 BIT Instruction, 4 BIT
Data)

2011 Computer System Architecture 72


How CPU works? Instruction SET
0000 Sleep
0001 LOAD M → A
0010 LOAD M → B
A B 0101 SET A → M
0110 SET B → M
1000 ADD A + B → C
1111 MOVE
IP 1001 RESET

ALU
IC
Register C
C Instruction Counter

2011 Computer System Architecture 73


Instruction SET
How CPU works? 0000 Sleep
0001 LOAD M → A
0010 LOAD M → B
0101 SET A → M
A B
0110 SET B → M
0111 SET C → M
1000 ADD A + B → C

C
0000

1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
01 4 1 0 0 0 0 0 0 0
C 5 0 1 1 1 0 0 0 0
6

2011 Computer System Architecture 74


Instruction SET
How CPU works? 0000 Sleep
0001 LOAD M → A
0010 LOAD M → B
0101 SET A → M
A B
0110 SET B → M
0111 SET C → M
1000 ADD A + B → C

C
0001

1 0 0 0 0 0 0 0 0
2 0 0 0 1 000
0 1 10 0
IC 3 0 0 1 0 0 1 0 1
02 4 1 0 0 0 0 0 0 0
C 5 0 1 1 1 0 0 0 0
6

2011 Computer System Architecture 75


Instruction SET
How CPU works? 0000 Sleep
0001 LOAD M → A
0010 LOAD M → B
0101 SET A → M
A B
0010 0110 SET B → M
0111 SET C → M
1000 ADD A + B → C

C
0010
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 0 11 0 01 1
03 4 1 0 0 0 0 0 0 0
C 5 0 1 1 1 0 0 0 0
6

2011 Computer System Architecture 76


Instruction SET
How CPU works? 0000 Sleep
0001 LOAD M → A
0010 LOAD M → B
0101 SET A → M
A B
0010 0101 0110 SET B → M
0111 SET C → M
1000 ADD A + B → C

C
1000
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
04 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 0 0 0 0
0111 6 1 0 0 1 0 0 0 0
7 1 1 1 1 0 0 0 1
8

2011 Computer System Architecture 77


Instruction SET
How CPU works? 0000 Sleep
1111 MOVE
1001 RESET
0101 SET A → M
A B
0010 0101 0110 SET B → M
0111 SET C → M
1000 ADD A + B → C

C
0111
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
05 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 1 1 1 1
0111 6 1 0 0 1 0 0 0 0

0 1 1 1 7 1 1 1 1 0 0 0 1
8

2011 Computer System Architecture 78


Instruction SET
How CPU works? 0000 Sleep
1111 MOVE
1001 RESET
0101 SET A → M
A B
0010 0101 0110 SET B → M
0111 SET C → M
1000 ADD A + B → C

C
1001
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
06 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 1 1 1 1
0111 6 1 0 0 1 0 0 0 0
7 1 1 1 1 0 0 0 1
8 0 1 1 1

2011 Computer System Architecture 79


Instruction SET
How CPU works? 0000 Sleep
1111 MOVE
1001 RESET
0101 SET A → M
A B
0000 0000 0110 SET B → M
0111 SET C → M
1000 ADD A + B → C

C
0000
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
06 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 1 1 1 1
0000 6 1 0 0 1 0 0 0 0
7 1 1 1 1 0 0 0 1
8

2011 Computer System Architecture 80


Instruction SET
How CPU works? 0000 Sleep
1111 MOVE
1001 RESET
0101 SET A → M
A B
0000 0000 0110 SET B → M
0111 SET C → M
1000 ADD A + B → C

C
1111
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
07 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 1 1 1 1
0000 6 1 0 0 1 0 0 0 0
7 1 1 1 1 0 0 0 1
8

2011 Computer System Architecture 81


Instruction SET
How CPU works? 0000 Sleep
1111 MOVE
1001 RESET
0101 SET A → M
A B
0000 0000 0110 SET B → M
0111 SET C → M
1000 ADD A + B → C

C
0000
1 0 0 0 0 0 0 0 0
2 0 0 0 1 0 0 1 0
IC 3 0 0 1 0 0 1 0 1
01 4 1 0 0 0 0 0 0 0
C
5 0 1 1 1 1 1 1 1
0000 6 1 0 0 1 0 0 0 0
7 1 1 1 1 0 0 0 1
8

2011 Computer System Architecture 82


How BUS System works

CPU

Device A Device B Device C

DATA BUS

ADDRESS BUS

CONTROL BUS

2011 Computer System Architecture 83


How BUS System works

DATA BUS

ADDRESS BUS

CONTROL BUS

2011 Computer System Architecture 84


How BUS System works

ADDRESS BUS 4 BIT


DATA BUS 4 BIT
CONTROL BUS 2 BIT
CPU

Device A Device B Device C

DATA BUS

ADDRESS BUS

CONTROL BUS

2011 Computer System Architecture 85


How BUS System works
CONTROL 2 BIT CONTROL 2 BIT CONTROL 2 BIT
01 – READ, 01 – READ, 01 – READ,
10 – Write 10 – Write 10 – Write
ADDRESS 0100 ADDRESS 0010 ADDRESS 0001

DATA BUS

ADDRESS BUS

CONTROL BUS

2011 Computer System Architecture 86


How BUS System works
CONTROL 2 BIT CONTROL 2 BIT CONTROL 2 BIT
01 – READ, 01 – READ, 01 – READ,
10 – Write 10 – Write 10 – Write
ADDRESS 0100 ADDRESS 0010 ADDRESS 0001

DATA BUS

ADDRESS BUS

CONTROL BUS
0000 0000 00
2011 Computer System Architecture 87
How BUS System works
CONTROL 2 BIT CONTROL 2 BIT CONTROL 2 BIT
01 – READ, 01 – READ, 01 – READ,
10 – Write 10 – Write 10 – Write
ADDRESS 0100 ADDRESS 0010 ADDRESS 0001

DATA BUS

ADDRESS BUS

CONTROL BUS
0000 0100 00
2011 Computer System Architecture 88
How BUS System works
CONTROL 2 BIT CONTROL 2 BIT CONTROL 2 BIT
01 – READ, 01 – READ, 01 – READ,
10 – Write 10 – Write 10 – Write
ADDRESS 0100 ADDRESS 0010 ADDRESS 0001

DATA BUS

ADDRESS BUS

CONTROL BUS
1 01 0 0100 10
2011 Computer System Architecture 89
How BUS System works
CONTROL 2 BIT CONTROL 2 BIT CONTROL 2 BIT
01 – READ, 01 – READ, 01 – READ,
10 – Write 10 – Write 10 – Write
ADDRESS 0100 ADDRESS 0010 ADDRESS 0001

DATA BUS

ADDRESS BUS

CONTROL BUS
1 01 0 0010 00
2011 Computer System Architecture 90
How BUS System works
CONTROL 2 BIT CONTROL 2 BIT CONTROL 2 BIT
01 – READ, 01 – READ, 01 – READ,
10 – Write 10 – Write 10 – Write
ADDRESS 0100 ADDRESS 0010 ADDRESS 0001

DATA BUS

ADDRESS BUS

CONTROL BUS
1 01 0 0010 01
2011 Computer System Architecture 91
Intel
Microprocessor History

2011 Computer System Architecture 92


Microprocessor History
• Intel 4004 (1971)
– 0.1 MHz
– 4 bit
– World first Single chip microprocessor
– Instruction set contained 46 instructions
– Register set contained 16 registers of 4 bits each

2011 Computer System Architecture 93


Microprocessor History
• Intel 8008 (1972)
– Max. CPU clock rate 0.5 MHz to 0.8 MHz
– 8-bit CPU with an external 14-bit address bus
– could address 16KB of memory
– had 3,500 transistors

2011 Computer System Architecture 94


Microprocessor History
• Intel 8080 (1974)
– second 8-bit microprocessor
– Max. CPU clock rate 2 MHz
– Large 40-pin DIP packaging
– 16-bit address bus and an 8-bit data bus
– Easy access to 64 kilobytes of memory
– Processor had seven 8-bit registers, (A, B,
C, D, E, H, and L)

2011 Computer System Architecture 95


Microprocessor History
• Intel 8086 (1978)
– 16-bit microprocessor
– Max. CPU clock rate 5 MHz to 10 MHz
– 20-bit external address bus gave a 1 MB
physical address
– 16-bit registers including the stack pointer,

2011 Computer System Architecture 96


Microprocessor History
• Intel 80286 (1978)
– 16-bit x86 microprocessor
– 134,000 transistors
– Max. CPU clock rate 6 MHz to 25 MHz
– Run in two modes
• Protected mode
• Real mode

2011 Computer System Architecture 97


Microprocessor History
• Intel 80386 (1985)
– 32-bit Microprocessor
– 275,000 transistors
– 16-bit data bus
– Max. CPU clock rate 12 MHz to 40 MHz
– Instruction set
• x86 (IA-32)

2011 Computer System Architecture 98


Microprocessor History
• Intel 80486 (1989)
– Max. CPU clock rate 16 MHz to 100 MHz
– FSB speeds 16 MHz to 50 MHz
– Instruction set x86 (IA-32)
– An 8 KB on-chip SRAM cache stores
– 486 has a 32-bit data bus and a 32-bit address bus.
– Power Management Features and System Management
Mode (SMM) became a standard feature

2011 Computer System Architecture 99


Microprocessor History
• Intel Pentium I (1993)
– Intel's 5th generation micro architecture
– Operated at 60 MHz
– powered at 5V and generated enough heat to
require a CPU cooling fan
– Level 1 CPU cache from 16 KB to 32 KB
– Contained 4.5 million transistors
– compatible with the common Socket 7
motherboard configuration

2011 Computer System Architecture 100


Microprocessor History
• Intel Pentium II (1997)
– Intel's sixth-generation microarchitecture
– 296-pin Staggered Pin Grid Array (SPGA) package (Socket
7)
– speeds from 233 MHz to 450 MHz
– Instruction set IA-32, MMX
– cache size was increased to 512 KB
– better choice for consumer-level operating systems, such as
Windows 9x, and multimedia applications

2011 Computer System Architecture 101


Microprocessor History
• Intel Pentium III (1999)
– 400 MHz to 1.4 GHz
– Instruction set IA-32, MMX, SSE
– L1-Cache: 16 + 16 KB (Data + Instructions)
– L2-Cache: 512 KB, external chips on CPU
module at 50% of CPU-speed
– the first x86 CPU to include a unique, retrievable,
identification number

2011 Computer System Architecture 102


Microprocessor History
• Intel Pentium IV (2000)
– Max. CPU clock rate 1.3 GHz to 3.8 GHz
– Instruction set x86 (i386), x86-64, MMX, SSE,
SSE2, SSE3
– featured Hyper-Threading Technology (HTT)
– The 64-bit external data bus
– More than 42 million transistors
– Processor (front-side) bus runs at 400MHz,
533MHz, 800MHz, or 1066MHz
– L2 cache can handle up to 4GB RAM
– 2MB of full-speed L3 cache

2011 Computer System Architecture 103


Microprocessor History
• Intel Core Duo
– Processing Die Transistors 151 million
– Consists of two cores
– 2 MB L2 cache
– All models support: MMX, SSE, SSE2,
SSE3, EIST, XD bit
– FSB Speed 533 MHz
– Intel® Virtualization Technology (VT-x)
– Execute Disable Bit

2011 Computer System Architecture 104


Microprocessor History
• Pentium Dual-Core
– Max. CPU clock rate 1.3 GHz to 2.6 GHz
– based on either the 32-bit Yonah or (with quite
different microarchitectures) 64-bit Merom-2M
– Instruction set MMX, SSE, SSE2, SSE3, SSSE3,
x86-64
– FSB speeds 533 MHz to 800 MHz
– Cores 2

2011 Computer System Architecture 105


Microprocessor History
• Intel Core Due
– Clock Speed 1.2 GHz
– L2 Cache 2 MB
– FSB Speed 533 MHz
– Instruction Set 32-bit
– Processing Die Transistors 151 million
– Advanced Technologies
• Intel® Virtualization Technology (VT-x)
• Enhanced Intel SpeedStep® Technolog
• Execute Disable Bit

2011 Computer System Architecture 106


Microprocessor History
• Core 2 due
– Cores 2 , Threads 2
– Clock Speed 3.33 GHz
– L2 Cache 6 MB
– FSB Speed 1333 MHz
– Processing Die Transistors 410 million
– Advanced Technologies
• Intel® Virtualization Technology (VT-x)
• Intel® Virtualization Technology for Directed IO (VT-d)
• Intel® Trusted Execution Technology
• Intel® 64
• Idle States
• Enhanced Intel SpeedStep® Technology
• Thermal Monitoring Technologies
• Execute Disable Bit

2011 Computer System Architecture 107


Microprocessor History
• Intel Core 2 Quad
– Cores 4 , Threads 4
– Clock Speed 3. GHz
– L2 Cache 12 MB
– FSB Speed 1333 MHz
– Processing Die Transistors 410 million
– Advanced Technologies
• Intel® Virtualization Technology (VT-x)
• Intel® Virtualization Technology for Directed IO (VT-d)
• Intel® Trusted Execution Technology
• Intel® 64
• Idle States
• Enhanced Intel SpeedStep® Technology
• Thermal Monitoring Technologies
• Execute Disable Bit

2011 Computer System Architecture 108


Microprocessor History
• Core i3
– Cores 2
– Threads 4
– Clock Speed 2.13 GHz
– Intel® Smart Cache 3 MB
– Instruction Set 64-bit Instruction Set Extensions
SSE4.1,SSE4.2
– Max Memory Size 8 GB
– Processing Die Transistors 382 million
– Technologies
• Intel® Trusted Execution Technology
• Intel® Fast Memory Access
• Intel® Flex Memory Access

2011 Computer System Architecture 109


Microprocessor History
• Core i5
– Cores 2
– Threads 4
– Clock Speed 1.7 - 3.0 GHz
– Max Memory Size 8 GB
– Processing Die Transistors 382 million
– Technologies
• Intel® Trusted Execution Technology
• Intel® Fast Memory Access
• Intel® Flex Memory Access
• Intel® Anti-Theft Technology
• Intel® My WiFi Technology
• 4G WiMAX Wireless Technology
• Idle States

2011 Computer System Architecture 110


Microprocessor History
Technologies
• Core i7  Intel® Turbo Boost Technology
– Cores 4  2.0Intel® vPro Technology
 Intel® Hyper-Threading Technology
– Threads 8  Intel® Virtualization Technology (VT-x)
 Intel® Virtualization Technology for
– Clock Speed 3.4 GHz Directed I/O (VT-d)
– Max Turbo Frequency  Intel® Trusted Execution Technology
 AES New Instructions
3.8 GHz  Intel® 64
– Intel® Smart Cache 8  Idle States
MB  Enhanced Intel SpeedStep® Technology
 Thermal Monitoring Technologies
 Intel® Fast Memory Access
 Intel® Flex Memory Access
 Execute Disable Bit

2011 Computer System Architecture 111


Summary –
Processor Family Vs Buses

2011 Computer System Architecture 112


Summary - Intel processors (1)

2011 Computer System Architecture 113


AMD processors (1)

2011 Computer System Architecture 114


AMD processors (2)

2011 Computer System Architecture 115


Microprocessors

2011 Computer System Architecture 116


Processor Instructions
• Intel 80386 (1985)
– x86 (IA-32)
• Intel 80486 (1989)
– x86 (IA-32)
• Intel Pentium I (1993)
– x86 (IA-32)
• Intel Pentium II (1997)
– IA-32, MMX

2011 Computer System Architecture 117


Processor Instructions(2)
• Intel Pentium III (1999)
– IA-32, MMX, SSE
• Intel Pentium IV (2000)
– x86 (i386), x86-64, MMX, SSE, SSE2,
SSE3
• Intel Core Duo
– MMX, SSE, SSE2, SSE3, EIST, XD bit
• Pentium Dual-Core
– MMX, SSE, SSE2, SSE3, SSSE3, x86-64

2011 Computer System Architecture 118


Processor Modes

2011 Computer System Architecture 119


Processor modes
• Intel and Compatible processors are
run in several modes
– Real Mode
– IA 32 Mode
• Protected Mode
• Virtual Real Mode
– IA 32e 64 bit mode
• 64-bit mode
• Compatibility mode

2011 Computer System Architecture 120


8086 Real Mode (x86)
• 80286 and later x86-compatible CPUs
• Execute 16 bit instructions
• Address only 1MB Memory
• Single task
• MS-Dos Programs are run in this mode
– Windows 1x, 3x
– 16 bit instructions
• No built in protection to keep one program
overwriting another in memory

2011 Computer System Architecture 121


IA-32 - Protected Mode
• First implemented in the Intel 80386 as
a 32-bit extension of x86 architecture
• Can run 32-bit instructions
• 32 bit OS and Application are Required
• Programs are protection to keep one
program overwriting another in memory

2011 Computer System Architecture 122


Virtual Real mode (IA- 32 Mode)

• Backward compatibility (can run 16 bit


apps)
– used to execute DOS programs in
Windows/386, Windows 3.x, Windows 9x/Me
• 16 bit program run on the 32 bit protected
mode
• Address only up to 1 Mb
• All Intel and Intel-supported processors
power up in real mode
2011 Computer System Architecture 123
IA-32e 64 bit Exaction Mode
• Originally design by AMD , later
adapted by Intel
• Processor can run
– Real mode
– IA 32 mode
– IA 32e mode
• IA -32e 64 bit is run 64 bit OS and 64
bit apps
• Need 64 bit OS and All 64 bit hardware
2011 Computer System Architecture 124
64-Bit Operating Systems
• Windows XP – 64 bit Edition for Itanium (IA-
64 bit processors)
• Windows XP professional x64( IA 32, Atholen
64)
• 32 bit Application can run without any
probem
• 16 bit and Dos application does not run
• Problem ?
– All 32-bit and 64 bit drivers are required

2011 Computer System Architecture 125


Physical memory limit

2011 Computer System Architecture 126


Processors Features

2011 Computer System Architecture 127


Processors Features
• System Management Mode (SMM)
• MMX Technology
• SSE, SSE2, SSE3, SSE4 etc
• 3DNow!, Technology
• Math core processor
• Hyper Threading
• Dual core technology
• Quad core technology
• Intel Virtualization
• Execute Disable bit
• Intel® Turbo Boost Technology

2011 Computer System Architecture 128


System Management Mode(SMM)
• is an operating mode
• is suspended, and special separate software is
executed in high-privilege mode
• It is available in all later microprocessors in the x86
architecture
• Some uses of SMM are
– Handle system events like memory or chipset errors.
– Manage system safety functions, such as shutdown
on high CPU temperature and turning the fans on and
off.
– Control power management operations, such as
managing the voltage regulator modules.

2011 Computer System Architecture 129


MMX Technology
• Multimedia extension / Matrix math
extension
• Improves audio/video compression
• MMX defined eight registers, known as
MM0 through MM7
• Each of the MMn registers holds 64 bits
• MMX provides only integer operations
• Used for both 2D and 3D calculations
• 57 new instructions + (SIMD- Single
instruction multiple data)

2011 Computer System Architecture 130


SSE -Streaming SIMD Extensions

• Used to accelerate floating point and parallel calculations


• is a SIMD instruction set extension to the x86 architecture
• subsequently expanded by Intel to SSE2, SSE3, SSSE3,
and SSE4
• it supports floating point math
• SSE originally added eight new 128-bit registers known as
XMM0 through XMM7
• SSE Instructions
– Floating point instructions
– Integer instructions
– Other instructions

2011 Computer System Architecture 131


SSE2- Streaming SIMD
Extensions 2
• Introduce in Pentium IV
• Add 114 additional instructions
• Also include MMX and SSE instructions
• SSE2 is an extension of the IA-32
architecture

2011 Computer System Architecture 132


SSE3- Streaming SIMD
Extensions 3
• Introduce in PIV Prescott processor
• Code name Prescott New
Instructions (PNI)
• Contains 13 new instructions
• Also include MMX, SSE, SSE2

2011 Computer System Architecture 133


SSE3- Supple
• Introduce in xeon and Core 2
processors
• Add new 32 SIMD instructions to SSE3

2011 Computer System Architecture 134


SSE4 (HD BOOT)
• Introduce by Intel in 2008
• Adds 54 new instructions
• 47 of SSE4 instructions are referred to as
SSE4.1
• 7 other instruction as SSE4.2
• SSE4.1 – is targeted to improve
performance of media, imaging and 3D
• SSE4.2 improves string and text
processing

2011 Computer System Architecture 135


SSE - Advantages
• Higher quality and high quality image
resolution
• High quality audio and MPEG2 Video
multi media application support
• Reduce CPU utilization for speech
recognition software
• SSEx instructions are useful
withMPEG2 decoding

2011 Computer System Architecture 136


3DNow! Technology
• AMD’s alternative to SSE
• Uses 21 instructions uses SIMD
technologies
• Enhanced 3DNow! ADDS 24 more
instructions
• Professional 3DNow! Adds 51 SSE
command to the Enhanced 3DNow!

2011 Computer System Architecture 137


Math coprocessor
• Provides hardware for plotting point
Math
• Speed Computer Operations
• All Intel processors since 486DX
include built-in floating point unit (FPU)
• Can performance high level
mathematical operation
• Instruction set differ from main CPU
2011 Computer System Architecture 138
Hyper-Threading Technology
• Is an Intel-proprietary technology used to
improve parallelization of computations
doing multiple tasks at once
• The operating system addresses two virtual
processors, and shares the workload
between them when possible
• Allowing multiple threads to run
simultaneously

2011 Computer System Architecture 139


Hyper-Threading Technology
• Originally introduce Xeon processor for
Servers (2002)
• Available all PIV processor with bus
speed 800 MHz
• HT enable processors has 2 set of
general purpose registers, control
registers
• Only Single Cache memory and Single
Buses
2011 Computer System Architecture 140
HT - Requirements
• Processor with HT Technology
• Compatible MB (Chipset)
• BIOS support
• Compatible OS
• Software written to Support HT

2011 Computer System Architecture 141


Dual Core Technology
• Introduce in 2005
• Consist of 2 CPU cores (Enable Single
processors to work as 2 processors)
• Multi Tasking performance is improved

2011 Computer System Architecture 142


Quad-Core Technology
• Consist of 4 CPU cores (Enable Single
processors to work as 4 processors)
• Less power consumption
• Design to provide multimedia and multi
tasking experience

2011 Computer System Architecture 143


Intel Virtualization
• Allows hardware platform to run
multiple platform
• Available in Core to Quad processors

2011 Computer System Architecture 144


Execute Disable Bit
• Is a hardware-based security feature
• Can reduce exposure to viruses and
malicious-code attacks and prevent
harmful software from executing and
propagating on the server or network.
• Help protect your customers' business
assets and reduce the need for costly
virus-related repairs by building systems
with built-in Intel Execute Disable Bit.
2011 Computer System Architecture 145
Intel® Turbo Boost Technology
• Provides more performance when needed
• Automatically allows processor cores to run
faster than the base operating frequency
• Depends on the workload and operating
environment
• Processor frequency will dynamically increase
until the upper limit of frequency is reached
• Has multiple algorithms operating in parallel to
manage current, power, and temperature to
maximize performance and energy efficiency

2011 Computer System Architecture 146


Bugs

2011 Computer System Architecture 147


Bugs
• Processor can contain defects or errors
• Only way to fix the bug
– Work around it or replace it with bugs free
• Now…
– Many bugs to be fixed by altering the
microcode
– Microcode gives set of information how
processor works
– Incorporate Reprogrammable Microcode

2011 Computer System Architecture 148


Fixing the Bugs
• Microcode updates reside in ROM
BIOS
• Each time the system rebooted fixed
code is loaded
• These microcode is provided by Intel to
motherboard manufacturers and they
can incorporate it into ROM BIOS
• Need to install most recent BIOS every
time
2011 Computer System Architecture 149
CPU Design Strategy

CISC & RISC

2011 Computer System Architecture 150


What is CISC?
• CISC is an acronym for Complex
Instruction Set Computer
• Most common microprocessor designs such
as the Intel 80x86 and Motorola 68K series
followed the CISC philosophy.
• But recent changes in software and hardware
technology have forced a re-examination of
CISC and many modern CISC processors
are hybrids, implementing many RISC
principles.
• CISC was developed to make compiler
development simpler.
CISC Characteristics
• 2-operand format,
• Variable length instructions where the length
often varies according to the addressing mode
• Instructions which require multiple clock cycles
to execute.
• E.g. Pentium is considered a modern CISC
processor
• Complex instruction-decoding logic, driven by
the need for a single instruction to support
multiple addressing modes.
• A small number of general purpose registers
• Several special purpose registers.
• A 'Condition code" register which is set as a
side-effect of most instructions.
CISC Advantages
• Microprogramniing is as easy as assembly
language to implement
• The ease of microcoding new instructions
allowed designers to make CISC machines
upwardly compatible: a new computer could
run the same programs as earlier computers
because the new computer would contain a
superset of the instructions of the earlier
computers.
• As each instruction became more capable,
fewer instructions could be used to
implement a given task. This made more
efficient use of the relatively slow main
memory.
2011 Computer System Architecture 153
CISC Disadvantages
• Instruction set & chip hardware become
more complex with each generation of
computers.
• Many specialized instructions aren't
used frequently enough to justify their
existence -
• CISC instructions typically set the
condition codes as a side effect of the
instruction.
What is RISC?
• RISC - Reduced Instruction Set Computer.
– is a type of microprocessor architecture
– utilizes a small, highly-optimized set of
instructions, rather than a more specialized set of
instructions often found in other types of
architectures.
• History
– The first RISC projects came from IBM,
Stanford, and UC-Berkeley in the late 70s
and early 80s.
– The IBM 801, Stanford MIPS, and Berkeley RISC
1 and 2 were all designed with a similar
philosophy which has become known as RISC.
RISC - Characteristic
• one cycle execution time: RISC processors
have a CPI (clock per instruction) of one
cycle. This is due to the optimization of each
instruction on the CPU and a technique
called PIPELINING
• pipelining: a techique that allows for
simultaneous execution of parts, or stages, of
instructions to more efficiently process
instructions;
• large number of registers: the RISC design
philosophy generally incorporates a larger
number of registers to prevent in large
amounts of interactions with memory

2011 Computer System Architecture 156


RISC Attributes
The main characteristics of CISC microprocessors are:
• Extensive instructions.
• Complex and efficient machine instructions.
• Microencoding of the machine instructions.
• Extensive addressing capabilities for memory
operations.
• Relatively few registers.
In comparison, RISC processors are more or less the
opposite of the above:
• Reduced instruction set.
• Less complex, simple instructions.
• Hardwired control unit and machine instructions.
• Few addressing schemes for memory operands with
only two basic instructions, LOAD and STORE
CISC Vs RISC
CISC RISC

Emphasis on hardware Emphasis on software


Includes multi-clock Single-clock,
complex instructions reduced instruction only
Memory-to-memory: Register to register:
"LOAD" and "STORE" "LOAD" and "STORE"
incorporated in instructions are independent instructions
Small code sizes, Low cycles per second,
high cycles per second large code sizes
Transistors used for storing Spends more transistors
complex instructions on memory registers
Performance of
Computers
Improving Performance of
Computers
• Increasing clock speed
– Physical limitation (Need new hardware)
• Parallelism (Doing more things at once)
– Instruction-level parallelism
• Getting more instruction per second
– Processor-level parallelism
• Having multiple CPUs working on the same
problem
Instruction-level parallelism
• Pipelining
– Instruction execution speed is affected by
time taken to fetch instruction from memory
– Early Computers fetch instructions in advance
and stored in registers (Prefetch buffer)
• Prefetching divides instruction execution into two
parts
– Fetching
– Actual execution
– Pipelining divides instruction in to many parts;
each handled by different hardware and can
run in parallel
Pipelining example
• Packaging cakes
– W1: Place an empty box on the belt every 10 second
– W2: Place the cake in the empty box
– W3: Close and seal the box
– W4: Label the box
– W5: Remove the box and place it in the large
container

162
Computer Pipelines

• S1: Fetch instruction from memory and place it in a


buffer until it is needed
• S2: Decode the instruction; determine it type and
operands it needs
• S3: locate the fetch operands from memory (or registers)
• S4: Execute instruction
• S5: Write back result in a register

163
Example
T - Cycle time
N - Number of stages in the pipeline

Latency:
Time taken to execute an instruction = N x
T

Processor Bandwidth:
No. of MIPS the CPU has = 1000 MIPS
T

164
Processor - pipeline depth

165
Dual pipelines

• Instruction fetch unit fetches a pair of instructions and puts


each one into own pipeline
• Pentium has two five-stage pipelines
– U pipeline (main) executes an arbitrary Pentium instructions
– V pipeline (second) executes inter instructions, one simple
floating point instruction
• If instructions in a pair conflict, instruction in u pipeline is
executed. Other instruction is held and is paired with next
instruction

166
Superscalar architecture
• Single pipeline with multiple functional
units
Processor level parallelism
• High bus traffic

• Low bus traffic


Measuring Performance
Moore’s law
• Describes a long-term trend in the
history of computing hardware
• Defined by Dr. Gordon Moore during
the sixties.
• Predicts an exponential increase in
component density over time, with a
doubling time of 18 months.
• Applicable to microprocessors, DRAMs
, DSPs and other microelectronics.
Moore's Law and Performance
• The performance of computers is
determined by architecture and clock
speed.
• Clock speed doubles over a 3 year period
due to the scaling laws on chip.
• Processors using identical or similar
architectures gain performance directly as
a function of Moore's Law.
• Improvements in internal architecture can
yield better gains than predicted by
Moore's Law.
Measuring Performance

• Execution time:
– Time between start and completion of a task
(including disk accesses, memory accesses )
• Throughput:
– Total amount of work dome a given time
Performance of a Computer

Two Computer X and Y;


Performance of (X) > Performance of (Y)

Execution Time (Y) > Execution Time (X)


Performance of difference 2
Computer
X is n Time faster than Y
CPU Time
• Time CPU spends on a task
• User CPU time
– CPU time spent in the program
• System CPU time
– CPU time spent in OS performing tasks on
behalf of the program
CPU Time (Example)
• User CPU time = 90.7s
• System CPU time 12.9s
• Execution time 2m 39 s 159s

• % of CPU time =
User CPU Time + System CPU Time
X 100
%
Execution time
CPU Time
% CPU time = (90.7 + 12.9 ) x 100
159
= 65 %
Clock Rate
• Computer clock runs at the constant
rate and determines when events take
place in the hardware

Clock Rate = 1
Clock Cycle
Amdahl’s law
• Performance improvement that can be
gained from some faster mode of
execution is limited by fraction of the
time the faster mode can be used
Amdahl’s law
• Speedup depends on
– Fraction of computation time in original
machine that can be converted to take
advantage of the enhancement
(Fraction Enhanced)
– Improvement gains by enhanced
execution mode
(Speedup Enhanced)
Example
Total execution time of a Program = 50
s
Execution time that can be enhanced
= 30 s
FractionEnhanced
= 30 /50
= 0.6
Speedup
Example
Normal mode execution time for some
portion of a program = 6s
Enhances mode execution time for the
same program = 2s

Speedup Enhanced = 6/2


=3
Execution Time
Example
• Suppose we consider an enhancement to the
processor of a server system used for Web serving.
New CPU is 10 times faster on computation in Web
application than original CPU. Assume original CPU
is busy with computation 40% of the time and is
waiting for I/O 60% of time.

What is the overall speedup gained


from enhancement?
Answer

188
Remark
• If an enhancement is only usable for
fraction of a task, we cannot speedup
by more than

189
Example
• A common transformation required in graphics
engines is square root. Implementation of
floating-point (FP) square root vary significantly
in performance, especially among processors
designed graphics
• Suppose FP square root (FPSQR) is responsible
for 20% of execution tine of a critical graphics
program
• Design alternative
1. Enhance EPSQR hardware and speed up this
operation by a factor of 10
2. Make all FP instruction run faster by a factor of 1.6

190
Example
• FP instruction are responsible for a total
of 50% of execution time. Design team
believes they can make all fp
instruction run 1.6 times faster with
same effort as required for fast square
root.

Compare these two design alternatives

191
192
CPU performance equation
CPU time = CPU clock cycles for a program x Clock cycle time

= CPU clock cycles / Clock rate


Example
A program runs in 10s on computer A
having 400 MHz clock. A new machine
B, which could run the same program in
6s, has to be designed. Further, B
should have 1.2 times as many clock
cycles as A.

What should be the clock rate of B?


Answer
CPU Clock Cycles
CPI (clock cycles per instruction)
average no. of clock cycles each instruction
takes to execute
IC (instruction count)
no. of instructions executed in the program

CPU clock cycles = CPI x IC

Note: CPI can be used to compare two different


implementations of the same instruction set
architecture (as IC required for a program is
same)
Example
• Consider two implementations of same
instruction set architecture. For a certain
program, details of time measurements of
two machines are given below

• Which machine is faster for this program and


by how much?
Answer
Measuring components
of CPU performance equation
• CPU Time: by running the program
• Clock Cycle Time: published in
documentation
• IC: by a software tools/simulator of the
architecture ((more difficult to obtain)
• CPI: by simulation of an implementation
(more difficult to obtain)
CPU clock cycles
Suppose n different types of instruction
Let
ICi – No. of times instruction i is executed in a program
CPIi – Avg. no. of clock cycles for instruction i
Example
Suppose we have made the following measurements:
– Frequency of FP operations (other than FPSQR) = 25%
– Average CPI of FP operations = 4.0
– Average CPI of other instructions = 1.33
– Frequency of FPSQR= 2%
– CPI of FPSQR = 20

Design alternatives:
1. decrease CPI of FPSQR to 2
2. decrease average CPI of all FP operation to 2.5

Compare these two design alternatives using CPU


performance equation
Answers
• Note that only CPI changes; clock rate; IC remain identical
MIPS as a performance measure
Problems
MIPS as a performance measure
• MIPS is dependant on instruction set
– difficult to compare MIPS of computers
with different instruction sets
• MIPS can vary inversely to
performance
MFLOPS as a performance
measure
Problems
MIPS as a performance measure
• MFLOPS is not dependable
– Cray C90 has no divide instructions while
Pentium has
• MFLOPS depends on the mixture of
fast and slow floating point operations
– add (fast) and divide (slow) operations
Instruction Set Architecture
(ISA) Level

207
Introduction

208
Instruction Set Architecture
• Positioned between microarchitecture
level and operating system level
• Important to system architects
– interface between software and hardware

209
Instruction Set Architecture

210
ISA contd..
• General approach of system designers:
– Build programs in high-level languages
– Translate to ISA level
– Build hardware that executes ISA level
programs directly

• Key challenge:
– Build better machines subject to backward
compatibility constraint

211
Features off a good ISA
• Define a set of instructions that can be
implemented efficiently in current and
future technologies resulting in cost
effective designs over several
generations
• Provide a clean target for compiled
code

212
Properties off ISA level
• ISA level code is what a compiler
outputs
• To produce ISA code, compiler writer
has to know
– What the memory model is
– What registers are there
– What data types and instructions are
available

213
ISA level memory models
• Computers divide memory into cells (8
bits) that have consecutive addresses
• Bytes are grouped into words (4-, 8-
byte) with instructions available for
manipulating entire words
• Many architectures require words to be
aligned on their natural boundaries
– Memories operate more efficiently that
way

214
ISA level Memory Models

• On Pentium II (fetches 8 bytes at a time from


memory), ISA programs can make memory
references to words starting at any address
– Requires extra logic circuits on the chip
– Intel allows it cause of backward compatibility
constraint (8088 programs made non-aligned
memory references)

215
ISA level registers
• Main function of ISA level registers:
– provide rapid access to heavily used data
• Registers are divided into 2 categories
– special purpose registers (program
counter, stack pointer)
– General purpose registers (hold key local
variables, intermediate results of
calculations).
• These are interchangeable

216
Instructions
• Main feature of ISA level is its set of
machine instructions
• They control what the machine can do
• Ex:
– LOAD and STORE instructions move data
between memory and registers
– MOVE instruction copies data among
registers

217
Pentium II ISA level (Intel’s IA-32)
• Maintains full support for execution of programs
written for 8086, 8088 processors (16-bit)
• Pentium II has 3 operating modes (Real mode,
Virtual 8086 mode, Protected mode)
• Address Add space: memory is divided into 16,384
segments, each going from address 0 to address
232-1 (Windows supports only one segment)
• Every byte has its own address, with words being
32 bits long
• Words are stored in Little endian format (low-
order byte has lowest address)

218
Little endian and Big endian
format

219
Pentium II’s primary registers

220
Pentium II’s primary registers
• EAX: Main arithmetic registers, 32-bit
– 16-bit register in low-order 16 bits
– 8-bit register in low-order 8 bits
– easy to manipulate 16-bit (in 80286) and 8-bit
(in 8088) quantities
• EBX: holds pointers
• ECX: used in looping
• EDX: used for multiplication and division,
where together with EAX, it holds 64-bit
products and dividends
221
Pentium II’s primary registers
• ESI,ESI EDI: holds pointers into memory
– Especially for hardware string manipulation
instructions (ESI points to source string, EDI
points to destination string)
• EBP: pointer register
• ESP: stack pointer
• CS through GS: segment registers
• EIP: program counter
• EFLAGS: flag register (holds various
miscellaneous bits such as conditional
codes)
222
Pentium II data Types

223
Instruction Formats
• An instruction consists of an opcode,
plus additional information such as
where operands come from, where
results go to
• Opcode tells what instruction does
• On some machines, all instructions
have same length
– Advantages: simple, easy to decode
– Disadvantages: waste space

224
Common Instruction Formats

(a) Zero address instruction


(b) One address instruction
(c) Two address instruction
(d) Three address instruction

225
Instruction and Word length
Relationships

226
Example
• An Instruction with 4bit Opcode and
Three 4bit address

227
Design of Instruction Formats
• Factors:
– Length of instruction
• short instructions are better than long
instructions (modern processors can execute
multiple instructions per clock cycle)
– Sufficient room in the instruction format to
express all operations required
– No. of bits in an address field

228
Intel® 64 and IA-32 Architectures
• Intel 64 and IA-32 instructions
– General purpose
– x87 FPU
– x87 FPU and SIMD state management
– Intel MMX technology
– SSE extensions
– SSE2 extensions
– SSE3 extensions
– SSSE3 extensions
– SSE4 extensions
– AESNI and PCLMULQDQ
– Intel AVX extensions
– F16C, RDRAND, FS/GS base access
– System instructions
– IA-32e mode: 64-bit mode instructions
– VMX instructions
– SMX instructions

229
Addressing

230
Addressing
• Subject of specifying where the operands
(addresses) are
– ADD instruction requires 2 or 3 operands, and
instruction must tell where to find operands and
where to put result
• Addressing Modes
– Methods of interpreting the bits of an address field
to find operand
• Immediate Addressing
• Direct Addressing
• Register Addressing
• Register Indirect Addressing
• Indexed Addressing

231
Immediate Addressing
• Simplest way to specify where the operand is
• Address part of instruction contains operand
itself (immediate operand)
• Operand is automatically fetched from memory
at the same time the instruction it self is fetched
– Immediately available for use
• No additional memory references are required
• Disadvantages
– only a constant can be supplied
– value of the constant is limited by size of address field
• Good for specifying small integers

232
Example
Immediate Addressing
MOV R1, #8 ; Reg[R1] ← 8
ADD R2R2, #3 ; Reg[R2] ← Reg[R2] + 3

233
Direct Addressing
• Operand is in memory, and is specified by giving
its full address (memory address is hardwired
into instruction)
• Instruction will always access exactly same
memory location, which cannot change
• Can only be used for global variables who
address is known at compile time

• Example Instruction:
– ADD R1, R1(1001) ; Reg[R1] ← Reg[R1]
+Mem[1001]

234
Direct Addressing Example

235
Register Addressing
• Same as direct addressing with the exception that it
specifies a register instead of memory location
• Most common addressing mode on most computers
since register accesses are very fast
• Compilers try to put most commonly accessed
variables in registers
• Cannot be used only in LOAD and STORE
instructions (one operand in is always a memory
address)
• Example instruction:
– ADD R3, R4 ; Reg[R3] ← Reg[R3] + Reg[R4]

236
Register Indirect Addressing
• Operand being specified comes from memory or
goes to memory
• Its address is not hardwired into instruction, but is
contained in a register (pointer)
• Can reference memory without having full memory
address in the instruction
• Different memory words can be used on different
executions of the instruction

• Example instruction:
– ADD R1,R1(R2) ; Reg[R1] ← Reg[R1] +
Mem[Reg[R2]]

237
Example
• Following generic assembly program calculates the
sum of elements (1024) of an array A of integers of 4
bytes each, and stores result in register R1

– MOV R1, #0 ; sum in R1 (0 initially)


– MOV R2, #A ; Reg[R2] = address of array A
– MOV R3, #A+4096 ; Reg[R3] = address of first
word beyond A
– LOOP: ADD R1, (R2) ; register indirect via R2 to get
operand
– ADD R2, #4 ; increment R2 by one word
– CMP R2, R3 ; is R2 < R3?
– BLT LOOP ; loop if R2 < R3

238
Indexed Addressing
• Memory is addressed by giving a register
plus a constant offset
• Used to access local variables

• Example instruction:
– ADD R3, 100(R2)
; Reg[R3] ← Reg[R3] + Mem[100+Reg[R2]]

239
Based-Indexed Addressing
• Memory address is computed by
adding up two registers plus an optional
offset

• Example instruction:
ADD R3, (R1+R2)
;Reg[R3] ← Reg[R3] + Mem[Reg[R1] +
Reg[R2]]

240
Instruction Types
• ISA level instructions are divided into few
categories
– Data Movement Instructions
• Copy data from one location to another
– Examples (Pentium II integer instructions):
• MOV DST, SRC – copies SRC (source) to DST
(destination)
• PUSH SRC – push SRC into the stack
• XCHG DS1, DS2 – exchanges DS1 and DS2
• CMOV DST, SRC – conditional move

241
Instruction Types contd..
– Dyadic Operations
• Combine two operands to produce a result
(arithmetic instructions, Boolean instructions)
– Examples (Pentium II integer instructions):
• ADD DST, SRC – adds SRC to DST, puts result in
DST
• SUB DST, SRC – subtracts DST from SRC
• AND DST, SRC – Boolean AND SRC into DST
• OR DST, SRC - Boolean OR SRC into DST
• XOR DST,DST SRC – Boolean Exclusive OR to
DST

242
Instruction Types contd..
• Monadic Operations
– Have one operand and produce one result
– Shorter than dyadic instructions
• Examples (Pentium II integer
instructions):
– INC DST – adds 1 to DST
– DEC DST – subtracts 1 from DST
– NOT DST – replace DST with 1’s
complement

243
Instruction Types contd..
• Comparison and Conditional Branch
Instructions

• Examples (Pentium II integer


instructions):
– TST SRC1, SRC2 – Boolean AND operands, set flags
(EFLAGS)
– CMP SRC1, SRC2 – sets flags based on SRC1-SRC2

244
Instruction Types contd..
• Procedure (Subroutine) call
Instructions
– When the procedure has finished its task,
transfer is returned to statement after the call

• Examples (Pentium II integer


instructions):
– CALL ADDR -Calls procedure at ADDR
– RET - Returns from procedure

245
Instruction Types contd..
• Loop Control Instructions
– LOOPxx – loops until condition is met
• Input / Output Instructions
There are several input/output schemes
currently used in personal computers
– Programmed I/O with busy waiting
– Interrupt-driven I/O
– DMA (Direct Memory Access) I/O

246
Programmed I/O with busy waiting

• Simplest I/O method


• Commonly used in low-end processors
• Processors have a single input instruction and a
single output instruction, and each of them
selects one of the I/O devices
• A single character is transferred between a fixed
register in the processor and selected I/O device
• Processor must execute an explicit sequence of
instructions for each and every character read or
written

247
DMA I/O
• DMA controller is a chip that has a direct
access to the bus
• It consists of at least four registers, each
can be loaded by software.
– Register 1 contains memory address to be
read/written
– Register 2 contains the count of how many
bytes / words to be transferred
– Register 3 specifies the device number or I/O
space address to use
– Register 4 indicates whether data are to be
read from or written to I/O device
248
Structure of a DMA

249
Registers in the DMA
• Status register: readable by the CPU to determine the status
of the DMA device (idle, busy, etc)
• Command register: writable by the CPU to issue a command
to the DMA
• Data register: readable and writable. It is the buffering place
for data that is being transferred between the memory and the
IO device.
• Address register: contains the starting location of memory
where from or where to the data will be transferred. The
Address register must be programmed by the CPU before
issuing a "start" command to the DMA.
• Count register: contains the number of bytes that need to be
transferred. The information in the address and the count
register combined will specify exactly what information need to
be transferred.
250
Example
• Writing a block of 32 bytes from memory
address 100 to a terminal device (4)

251
Example contd..
• CPU writes numbers 32, 100, and 4 into first three
DMA registers, and writes the code for WRITE (1, for
example) in the fourth register
• DMA controller makes a bus request to read byte
100 from memory
• DMA controller makes an I/O request to device 4 to
write the byte to it
• DMA controller increments its address register by 1
and decrements its count register by 1
• If the count register is > 0, another byte is read from
memory and then written to device
• DMA controller stops transferring data when count =
0

252
Sample Questions
Q1.
1. Explain the processor architecture of 8086.
2. What are differences in Intel Pentium
Processor and dual core processor.
3. What are the advantages and disadvantage
of the multi-core processors

253
Sample Questions
Q2.
1. What is addressing.
2. Comparing advantages,
disadvantages and features briefly
explain each addressing modes.
3. What is DMA and why it useful for
Programming?. Explain your answer

254
Computer Memory
• Primary Memory
• Secondary Memory
• Virtual Memory

255
Levels in Memory Hierarchy
Cache Virtual Memory

C
Regs a
8B 32 B 4 KB
c Memory disk
CPU h
e

Register Cache Memory Disk Memory


size: 32 B 32 KB-4MB 4096 MB 1 TB
speed: 0.3 ns 2 ns? 7.5 ns 8 ms
$/Mbyte: $75/MB $0.014/MB $0.00012/MB
line size: 4B 32 B 4 KB

larger, slower, cheaper


Primary Memory

257
Primary memory
• Memory is the workspace for CPU
• When a file is loaded into memory, it is a copy of the
file that is actually loaded
• Consists of a no. of cells, each having a number
(address)
• n cells → addresses: 0 to n‐1 ‐
• Same no. off bits in each cell
• Adjacent cells have consecutive addresses
‐ address 2m addressable cells
• m‐bit
• A portion of RAM address space is mapped into one
or more ROM chips

258
Ways of organizing a 96-bit
memory

259
SRAM (Static RAM)
• Constructed using flip flops
• 6 transistors for each bit of storage
• Very fast
• Contents are retained as long as power is
kept on
• Expensive
• Used in level 2 cache

260
DRAM (Dynamic RAM)
• No flip‐flops
• Array of cells, each consisting a transistor and a capacitor
• Capacitors can be charged or discharged, allowing 0s
and 1s to be Stored
• Electric charge tends to leak out Þ each bit in a DRAM
must be reloaded (refreshed) every few milliseconds (15
ms) to prevent data from leaking away
• Refreshing takes several CPU cycles to complete (less
than 1% of overall bandwidth)
• High density (30 times smaller than SRAM)
• Used in main memories
• Slower than SRAM
• Inexpensive (30 times lower than SRAM)

261
SDRAM (Synchronous DRAM)
• Hybrid of SRAM and DRAM
• Runs in synchronization with the system bus
• Driven by a single synchronous clock
• Used in large caches, main memories

262
DDR (Double Data Rate) SDRAM

• An upgrade to standard SDRAM


• Performs 2 transfers per clock cycle (one at falling
edge, one at rising edge) without doubling actual
clock rate

263
Dual channel DDR
• Technique in which 2 DDR DIMMs are installed at one time and
function as a single bank doubling the bandwidth of a single module

• DDR2 SDRAM
– A faster version of DDR SDRAM (doubles the data rate of DDR)
– Less power consumption than DDR
– Achieves higher throughput by using differential pairs of signal wires
– Additional signal add to the pin count

• DDR3 SDRAM
– An improved version off DDR2 SDRAM
– Same no. of pins as in DDR2,
– Not compatible with DDR2
– Can transfer twice the data rate of DDR2
– DDR3 standard allows chip sizes of 512 Megabits to
8 Gigabits (max module size – 16GB)

264
DRAM Memory module

265
DRAM Memory module

266
SDRAM and DDR DIMM versions

• Buffered
• Unbuffered
• Registered

267
SDRAM and DDR DIMM
• Buffered Module
– Has additional buffer circuits between memory
chips and the connector to buffer signals
– New motherboards are not designed to use
buffered modules

• Unbuffered Module
– Allows memory controller signals to pass directly
to memory chips with no interference
– Fast and most efficient design
– Most motherboards are designed to use
unbuffered modules

268
SDRAM and DDR DIMM
• Registered Module
– Uses register chips on the module that act
as an interface between RAM chip and
chipset
– Used in systems designed to accept
extremely large amounts of RAM (server
motherboards)

269
Memory Errors

270
Memory errors
• Hard errors
– Permanent failure
– How to fix? (replace the chip)
• Soft errors
– Non permanent failure
– Occurs at infrequent intervals
– How to fix? (restart the system)
• Best way to deal with soft errors is to
increase system’s fault tolerance
(implement ways of detecting and
correcting errors)
271
Techniques used for fault
tolerance
• Parity
• ECC (Error Correcting Code)

272
Parity Checking
• 9 bits are used in the memory chip to
store 1 byte of information
• Extra bit (parity bit) keeps tabs on other
8 bits
• Parity can only detect errors, but
cannot correct them

273
ODD Parity stranded for error
checking
• Parity generator/checker is a part of CPU
or located in a special chip on
motherboard
• Parity checker evaluates the 8 data bits
by adding the no. of 1s in the byte
• If an even no. of 1s is found, parity
generator creates a 1 and stores it as the
parity bit in memory chip

274
ODD Parity stranded for error
checking (contd.)
• If the sum is odd, parity bit would be 0
• If a (9 bit) byte has an even no. of 1s, that
byte must have an error · System cannot
tell which bit or bits have changed
• If 2 bits changed, bad byte could pass
unnoticed
• Multiple bit errors in a single byte are very
rare
• System halts when a parity check error is
detected

275
ECC- Error Correcting Code
• Successor to parity checking
• Can detect and correct memory errors
• Only a single bit error can be corrected
though it can detect doubled bit errors
• This type of ECC is known as single bit
error correction double bit error detection
(SEC DED)
• SEC DED requires an additional 7 check
bits over 32 bits in a 4 byte system, or 8
check bits over 64 bits in an 8 byte system
276
ECC- Error Correcting Code
• ECC entails memory controller
calculating check bits on a
memory write operation, performing a
compare between read and calculated
check bits on a read operation
• Cost of additional ECC logic in memory
controller is not significant
• It affects memory performance on a
write

277
Cache memory

278
Cache Memory
• A high speed,speed small memory
• Most frequently used memory words are kept in
• When CPU needs a word, it first checks it in
cache. If not found, checks in memory

279
Cache and Main Memory

280
Cache memory Vs Main Memory

281
Cache Hit and Miss
• Cache Hit: a request to
read from memory,
which can satisfy from
the cache without using
the main memory.
• Cache Miss: A request
to read from memory,
which cannot be
satisfied from the cache,
for which the main
memory has to be
consulted.

282
Locality Principle
• PRINCIPAL OF LOCALITY is the tendency to
reference data items that are near other
recently referenced data items, or that were
recently referenced themselves.
• TEMPORAL LOCALITY : memory location that
is referenced once is likely to be referenced
multiple times in near future.
• SPATIAL LOCALITY : memory location that is
referenced once, then the program is likely to
be reference a nearby memory location in
near future.
283
Locality Principle
Let
c – cache access time
m – main memory access time
h – hit ratio (fraction of all references that can
be satisfied out of cache)
miss ratio = 1‐h
Average memory access time = c + (1 h) m
H =1 No memory references
H=0 all are memory references

284
Example:
Suppose that a word is read k times in a
short interval
First reference: memory, Other k 1
references: cache
h = k–1
k
Memory access time = c + m
k
285
Cache Memory
• Main memories and caches are divided into fixed sized
blocks
• Cache lines – blocks inside the cache
• On a cache miss, entire cache line is loaded into cache
from memory
• Example:
– 64K cache can be divided into 1K lines of 64 bytes, 2K lines of
32 byte etc
• Unified cache
– instruction and data use the same cache
• Split cache
– Instructions in one cache and data in another

286
A system with three levels of
cache

287
Pentium 4 Block Diagram

288
Replacement Algorithm
• Optimal Replacement: replace the
block which is no longer needed in the
future. If all blocks currently in Cache
Memory will be used again, replace the
one which will not be used in the future
for the longest time.
• Random selection: replace a randomly
selected block among all blocks
currently in Cache Memory.

289
Replacement Algorithm
• FIFO (first-in first-out): replace the block
that has been in Cache Memory for the
longest time.
• LRU (Least recently used): replace the
block in Cache Memory that has not
been used for the longest time.
• LFU (Least frequently used): replace
the block in Cache Memory that has
been used for the least number of times

290
Cache Memory Placement Policy
• Three commonly used methods to
translate main memory addresses to
cache memory addresses.
– Associative Mapped Cache
– Direct-Mapped Cache
– Set-Associative Mapped Cache
• The choice of cache mapping scheme
affects cost and performance, and there
is no single best method that is
appropriate for all situations
291
Associative Mapping

292
Associative Mapping
• A block in the Main Memory can
be mapped to any block in the
Cache Memory available (not
already occupied)
• Advantage: Flexibility. An Main
Memory block can be mapped
anywhere in Cache Memory.
• Disadvantage: Slow or
expensive. A search through all
the Cache Memory blocks is
needed to check whether the
address can be matched to any
of the tags.

293
Direct Mapping

294
Direct Mapping
 To avoid the search through all
CM blocks needed by
associative mapping, this
method only allows
# blocks in main memory
# blocks in cache memory
 Blocks to be mapped to each
Cache Memory block.
• Each entry (row) in cache can
hold exactly one cache line
from main memory
• 32‐byte
‐ cache line size →
cache can hold 64KB

295
Direct Mapping
• Advantage: Direct mapping is faster than
the associative mapping as it avoids
searching through all the CM tags for a
match.
• Disadvantage: But it lacks mapping
flexibility. For example, if two MM blocks
mapped to same CM block are needed
repeatedly (e.g., in a loop), they will keep
replacing each other, even though all
other CM blocks may be available.

296
Set-Associative Mapping

297
Set-Associative Mapping
• This is a trade-off between
associative and direct mappings
where each address is mapped
to a certain set of cache
locations.
• The cache is broken into sets
where each set contains "N"
cache lines, let's say 4. Then,
each memory address is
assigned a set, and can be
cached in any one of those 4
locations within the set that it is
assigned to. In other words,
within each set the cache is
associative, and thus the name.298
Set Associative cache
• LRU (Least Recently Used) algorithm
is used
– keep an ordering of each set of locations
that could be accessed from a given
memory location
– whenever any of present lines are
accessed, it updates list, making that entry
the most recently accessed
– when it comes to replace an entry, one at
the end of list is discarded

299
Load-Through and Store-Through
• Load-Through : When the CPU
needs to read a word from the
memory, the block containing the
word is brought from MM to CM,
while at the same time the word is
forwarded to the CPU.

• Store-Through : If store-through is
used, a word to be stored from
CPU to memory is written to both
CM (if the word is in there) and
MM. By doing so, a CM block to be
replaced can be overwritten by an
in-coming block without being
saved to MM.

300
Cache Write Methods
• Words in a cache have been viewed simply
as copies of words from main memory that
are read from the cache to provide faster
access. However this view point changes.
• There are 3 possible write actions:
– Write the result into the main memory
– Write the result into the cache
– Write the result into both main memory and cache
memory

301
Cache Write Methods
• Write Through: A cache architecture in which
data is written to main memory at the same
time as it is cached.
• Write Back / Copy Back: CPU performs write
only to the cache in case of a cache hit. If there
is a cache miss, CPU performs a write to main
memory.
• When the cache is missed :
– Write Allocate: loads the memory block into cache
and updates the cache block
– No-Write allocation: this bypasses the cache and
writes the word directly into the memory.

302
Cache Evaluation
Processor on
Problem Solution which feature first
appears
External memory
Add external cache using
slower than the system 386
faster memory technology
bus
Increased processor
speed results in Move external cache on-chip,
external bus becoming operating at the same speed 486
a bottleneck for cache as the processor
access.
Internal cache is rather Add external L2 cache using
small, due to limited faster technology than main 486
space on chip memory

303
Cache Evaluation
Processor on
Problem Solution which feature first
appears
Increased processor speed Move L2 cache on to the Pentium II
results in external bus processor chip.
becoming a bottleneck for
L2 cache access Create separate back-side bus that Pentium Pro
runs at higher speed than the main
(front-side) external bus. The BSB
is dedicated to the L2 cache.

Some applications deal Add external L3 cache. Pentium III


with massive databases
and must have rapid
Move L3 cache on-chip Pentium IV
access to large amounts of
data. The on-chip caches
are too small.
304
Comparison of Cache Sizes
Year of
Processor Type L1 cache L2 cache L3 cache
Introduction
IBM 360/85 Mainframe 1968 16 to 32 KB — —
PDP-11/70 Minicomputer 1975 1 KB — —
VAX 11/780 Minicomputer 1978 16 KB — —
IBM 3033 Mainframe 1978 64 KB — —
IBM 3090 Mainframe 1985 128 to 256 KB — —
Intel 80486 PC 1989 8 KB — —
Pentium PC 1993 8 KB/8 KB 256 to 512 KB —
PowerPC 601 PC 1993 32 KB — —
PowerPC 620 PC 1996 32 KB/32 KB — —
PowerPC G4 PC/server 1999 32 KB/32 KB 256 KB to 1 MB 2 MB
IBM S/390 G4 Mainframe 1997 32 KB 256 KB 2 MB
IBM S/390 G6 Mainframe 1999 256 KB 8 MB —
Pentium 4 PC/server 2000 8 KB/8 KB 256 KB —
IBM SP High-end server 2000 64 KB/32 KB 8 MB —
CRAY MTAb Supercomputer 2000 8 KB 2 MB —
Itanium PC/server 2001 16 KB/16 KB 96 KB 4 MB
SGI Origin 2001 High-end server 2001 32 KB/32 KB 4 MB —
Itanium 2 PC/server 2002 32 KB 256 KB 6 MB
IBM POWER5 High-end server 2003 64 KB 1.9 MB 36 MB
CRAY XD-1 Supercomputer 2004 64 KB/64 KB 1MB —
Memory stall cycles
No. of clock cycles during which CPU is
stalled waiting for a memory access
CPU time =
(CPU clock cycles + Memory stall cycles)
x Clock cycle time
Memory stall cycles = No. of misses x Miss
penalty
= IC x Misses per instruction x Miss penalty
= IC x Memory accesses per instruction x
Miss ratio x Miss penalty

306
Example
Assume we have a machine where CPI is 2.0
when all memory accesses hit in the cache.
Only data accesses are loads and stores,
and these total 40% of instructions. If the
miss penalty is 25 clock cycles and miss ratio
is 2%, how much faster would the machine
be if all instructions were cache hits?

307
Answer

308
Secondary Memory

309
Technologies
• Magnetic storage
– Floppy, Zip disk, Hard drives, Tapes
• Optical storage
– CD, DVD, Blue-Ray, HD-DVD
• Solid state memory
– USB flash drive, Memory cards for mobile
phones/digital cameras/MP3 players, Solid
State Drives

310
Magnetic Disk
• Purpose:
– Long term, nonvolatile storage
– Large, inexpensive, and slow
– Lowest level in the memory hierarchy
• Two major types:
– Floppy disk
– Hard disk
• Both types of disks:
– Rely on a rotating platter coated with a magnetic surface
– Use a moveable read/write head to access the disk
• Advantages of hard disks over floppy disks:
– Platters are more rigid ( metal or glass) so they can be larger
– Higher density because it can be controlled more precisely
– Higher data rate because it spins faster
– Can incorporate more than one platter
Disk Track
Components of a Disk
Spindle
• The arm assembly is Tracks
Disk head
moved in or out to
position a head on a
desired track. Tracks Sector
under heads make a
cylinder (imaginary!).
• Only one head
reads/writes at any one
Platters
time. Arm movement

• Block size is a multiple of


sector size (which is often
fixed).
Arm assembly

313
Internal Hard-Disk

Page 223
Magnetic Disk
• A stack of platters, a surface with a magnetic
coating
• Typical numbers (depending on the disk size):
– 500 to 2,000 tracks per surface
– 32 to 128 sectors per track
• A sector is the smallest unit that can be read or
written
• Traditionally all tracks have the same number
of sectors:
• Constant bit density: record more sectors on
the outer tracks
Magnetic Disk Characteristic
• Disk head: each side of a platter has separate disk head
• Cylinder: all the tracks under the head at a given point on all
surface
• Read/write data is a three-stage process:
– Seek time: position the arm over the proper track
– Rotational latency: wait for the desired sector to rotate under the
read/write head
– Transfer time: transfer a block of bits (sector) under the read-write
head
• Average seek time as reported by the industry:
– Typically in the range of 8 ms to 15 ms
– (Sum of the time for all possible seek) / (total # of possible seeks)
• Due to locality of disk reference, actual average seek time may:
– Only be 25% to 33% of the advertised number
Typical Numbers of a Magnetic
Disk
• Rotational Latency:
– Most disks rotate at 3,600/5400/7200 RPM
– Approximately 16 ms per revolution
– An average latency to the desired information is
halfway around the disk: 8 ms
• Transfer Time is a function of :
– Transfer size (usually a sector): 1 KB / sector
– Rotation speed: 3600 RPM to 5400 RPM to 7200
– Recording density: typical diameter ranges from 2
to 14 in
– Typical values: 2 to 4 MB per second
Disk I/O Performance

Disk Access Time =


Seek time + Rotational Latency
+ Transfer time + Controller Time
+ Queueing Delay
Disk I/O Performance
• Disk Access Time = Seek time +
Rotational Latency + Transfer time +
Controller Time + Queueing Delay
• Estimating Queue Length:
– Utilization = U = Request Rate / Service
Rate
– Mean Queue Length = U / (1 - U)
– As Request Rate Service Rate -> Mean
Queue Length ->Infinity
Example
• Setup parameters:
– 16383 Cycliders, 63 sectors per track, 3 platters,
6 heads
• Bytes per sector: 512
• RPM: 7200
• Transfer mode: 66.6MB/s
• Average Read Seek time: 9.0ms (read), 9.5ms
(write)
• Average latency: 4.17ms
• Physical dimension: 1’’ x 4’’ x 5.75’’
• Interleave: 1:1
Disk performance
• Preamble: allows head to be synchronized before read/write
• ECC (Error Correction Code): corrects errors
• Unformatted capacity: preambles, ECCs and inter sector gaps are
counted as data
• Disk performance depends on
– seek time ‐ time to move arm to desired track
– rotational latency – time needed for requested sector to
rotate under head
• Rotational speed: 5400, 7200, 10000, 15000 rpm

• Transfer time – time needed to transfer a block of


bits under head (e.g., 40 MB/s)
321
Disk performance
Disk controller
– chip that controls the drive. Its tasks include accepting
– commands (READ, WRITE, FORMAT) from software,
controlling arm motion, detecting and correcting errors
Controller time
– overhead the disk controller imposes in performing an
I/O access

Avg. disk access time = avg. seek time +


avg. rotational delay +
Transfer time + controller
overhead

322
Example
• Advertised average seek time of a disk is 5
ms, transfer rate is 40 MB per second, and it
rotates at 10,000 rpm Controller overhead is
0.1 ms. Calculate the average time to read a
512 byte sector.

323
RAID-
(Redundant Array of Inexpensive
Disks)
• A disk organization used to improve
performance of storage systems
• An array of disks controlled by a
controller (RAID Controller)
• Data are distributed over disks
(striping) to allow parallel operation

324
RAID 0- No redundancy
• No redundancy to tolerate disk failure
• · Each strip has k sectors (say)
– Strip 0: sectors 0 to k 1
– Strip 1: sectors k to 2k 1 ...etc
• Works well with large accesses
• Less reliable than having a single large
disk

325
Example (RAID 0)
• Suppose that RAID consists of 4 disks
with MTTF (mean time to failure) of
20,000 hours.
– A drive will fail once in every 5,000 hours
– A single large drive with MTTF of 20,000
hours is 4 times reliable

326
RAID 1 (Mirroring)
• Uses twice as many disk as does RAID 0
(first half: primary, next half: backup)
• Duplicates all disks

• On a write, every strip is written twice


• Excellent fault tolerance (if a disk fails, backup
copy is used)
• Requires more disks
327
RAID 3 (Bit Interleaved Parity)
• Reads/writes go to all disks in the group,
with one extra disk (parity disk) to hold
check information in case off a failure

• Parity contains sum of all data in other


disks
• If a disk fails, subtract all data in good
disks from parity disk

328
RAD 4 (Block Interleaved Parity)

• RAID 4 is much like RAID 3 with a


strip for strip parity written onto an extra
disk
– A write involves accessing 2 disks instead
of all
– Parity disk must be updated on every write

329
RAID 5- Block Interleaved
Distributed Parity
• In RAID 5, parity information is spread
throughout all disks
• In RAID 5, multiple writes can occur
simultaneously as long as stripe units are not
located in same disks, but it is not possible in
RAID 4

330
Secondary Storage Devices:
CD-ROM

331
Physical Organization of CD-ROM
• Compact Disk – read only memory (write once)
• Data is encoded and read optically with a laser
• Can store around 600MB data
• Digital data is represented as a series of Pits and
Lands:
– Pit = a little depression, forming a lower level in the track
– Land = the flat part between pits, or the upper levels in the
track
• Reading a CD is done by shining a laser at the disc and detecting
changing reflections patterns.
– 1 = change in height (land to pit or pit to land)
– 0 = a “fixed” amount of time between 1’s

332
Organization of data
LAND PIT LAND PIT LAND
...------+ +-------------+ +---...
|_____| |_______|
..0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 ..

• Cannot have two 1’s in a row!


=> uses Eight to Fourteen Modulation (EFM) encoding table.
• 0's are represented by the length of time between transitions, we
must travel at constant linear velocity (CLV)on the tracks.
• Sectors are organized along a spiral
• Sectors have same linear length
• Advantage: takes advantage of all storage space available.
• Disadvantage: has to change rotational speed when seeking
(slower towards the outside)

333
CD-ROM
• Addressing
– 1 second of play time is divided up into 75 sectors.
– Each sector holds 2KB
– 60 min CD:
60min * 60 sec/min * 75 sectors/sec = 270,000 sectors = 540,000 KB ~ 540
MB
– A sector is addressed by: Minute:Second:Sector e.g. 16:22:34
• Type of laser
– CD: 780nm (infrared)
– DVD: 635nm or 650nm (visible red)
– HD-DVD/Blu-ray Disc: 405nm (visible blue)
• Capacity
– CD: 650 MB, 700 MB
– DVD: 4.7 GB per layer, up to 2 layers
– HD-DVD: 15 GB per layer, up to 3 layers
– BD: 25 GB per layer, up to 2 layers

334
Solid state storage

335
Solid state storage
• Memory cards
– For Digital cameras, mobile phones, MP3 players...
– Many types: Compact flash, Smart Media, Memory Stick,
Secure Digital card...
• USB flash drives
– Replace floppies/CD-RW
• Solid State Drives
– Replace traditional hard disks
• Uses flash memory
– Type of EEPROM
• Electrically erasable programmable read only memory
– Grid of cells (1 cell = 1 bit)
– Write/erase cells by blocks

336
Solid state storage
• Cell=two transistors
– Bit 1: no electrons in between
– Bit 0: many electrons in between
• Performance
– Acces time: 10X faster than hard drive
– Transfer rate
• 1x=150 kb/sec, up to 100X for memory cards
• Similar to normal hard drive for SSD ( 100-150
MB/sec)
– Limited write: 100k to 1,000k cycles

337
Solid state storage
• Size
– Very small: 1cm² for some memory cards
• Capacity
– Memory cards: up to 32 GB
– USB flash drives: up to 32 GB
– Solid State Drives: up to 256 GB

338
Solid state storage
• Reliability
– Resists to shocks
– Silent!
– Avoid extreme heat/cold
– Limited number of erase/write
• Challenges
– Increasing size
– Improving writing limits

339
Virtual Memory

340
Virtual Memory
• Virtual memory is a memory management
technique developed for multitasking kernels
• Separation of user logical memory from
physical memory.
• Logical address space can therefore be
much larger than physical address space

341
A System with
Physical Memory Only
• Examples:
– Most Cray machines, early PCs, nearly all embedded systems, etc.

Memory
0:
Physical 1:
Addresses
CPU

N-1:

 Addresses generated by the CPU correspond directly to bytes in physical


memory
A System with Virtual Memory
• Examples: Memory
– Workstations, servers, modern PCs, etc.
0:
Page Table 1:
Virtual Physical
Addresses 0: Addresses
1:

CPU

P-1:
N-1:

Disk
 Address Translation: Hardware converts virtual addresses to physical ones
via OS-managed lookup table (page table)
Page Tables
Virtual Page Memory-resident
Number page table
(physical page
Valid
Physical Memory
or disk address)
1
1
0
1
1
1
0
1
0 Disk Storage
1 (swap file or
regular file system file)
VM – Windows
• Can change the
paging file size
• Can set multiple
Virtual memory on
difference drivers

345
Windows Memory management

346
IO Fundamentals
I/O Fundamentals
• Computer System has three major
functions
– CPU
– Memory
– I/O
PC with PCI and ISA bus
Types and Characteristics of I/O
Devices
• Behavior: how does an I/O device behave?
– Input – Read only
– Output - write only, cannot read
– Storage - can be reread and usually rewritten
• Partner:
– Either a human or a machine is at the other end of
the I/O device
– Either feeding data on input or reading data on
output
• Data rate:
– The peak rate at which data can be transferred
• between the I/O device and the main memory
• Or between the I/O device and the CPU
Data Rate
Buses
• A bus is a shared communication link
• Multiple sources and multiple destinations
• It uses one set of wires to connect multiple
subsystems
• Different uses:
– Data
– Address
– Control
Motherboard
Advantages
• Versatility:
– New devices can be added easily
– Peripherals can be moved between
computer
– systems that use the same bus standard
• Low Cost:
– A single set of wires is shared in multiple
ways
Disadvantages
• It creates a communication bottleneck
– The bandwidth of that bus can limit the
maximum I/O throughput
• The maximum bus speed is largely limited
by:
– The length of the bus
– The number of devices on the bus
– The need to support a range of devices with:
• Widely varying latencies
• Widely varying data transfer rates
The General Organization of a Bus
• Control lines:
– Signal requests and acknowledgments
– Indicate what type of information is on the
data lines
• Data lines carry information between the
source and the destination:
– Data and Addresses
– Complex commands
• A bus transaction includes two parts:
– Sending the address
– Receiving or sending the data
Master Vs Slave
• A bus transaction includes two parts:
– Sending the address
– Receiving or sending the data
• Master is the one who starts the bus
transaction by:
– Sending the address
• Salve is the one who responds to the
address by:
– Sending data to the master if the master ask
for data
– Receiving data from the master if the master
wants to send data
Output Operation
Input Operation
• Input is defined as the Processor
receiving data from the I/O device
Type of Buses
• Processor-Memory Bus (design specific or proprietary)
– Short and high speed
– Only need to match the memory system
– Maximize memory-to-processor bandwidth
– Connects directly to the processor
• I/O Bus (industry standard)
– Usually is lengthy and slower
– Need to match a wide range of I/O devices
– Connects to the processor-memory bus or backplane bus
• Backplane Bus (industry standard)
– Backplane: an interconnection structure within the chassis
– Allow processors, memory, and I/O devices to coexist

• Cost advantage: one single bus for all components


Increasing the Bus Bandwidth
• Separate versus multiplexed address and data lines:
– Address and data can be transmitted in one bus cycle if
separate address and data lines are available
– Cost: (a) more bus lines, (b) increased complexity
• Data bus width:
– By increasing the width of the data bus, transfers of multiple
words require fewer bus cycles
– Example: SPARCstation 20’s memory bus is 128 bit wide
– cost: more bus lines
• Block transfers:
– Allow the bus to transfer multiple words in back-to-back bus
cycles
– Only one address needs to be sent at the beginning
– The bus is not released until the last word is transferred
– Cost: (a) increased complexity (b) decreased response time
for request
Operating System Requirements
• Provide protection to shared I/O resources
– Guarantees that a user’s program can only access the
portions of an I/O device to which the user has rights
• Provides abstraction for accessing devices:
– Supply routines that handle low-level device operation
• Handles the interrupts generated by I/O devices
• Provide equitable access to the shared I/O
resources
– All user programs must have equal access to the I/O
resources
• Schedule accesses in order to enhance system
throughput
OS and I/O Systems
Communication Requirements
• The Operating System must be able to prevent:
– The user program from communicating with the I/O
device directly
• If user programs could perform I/O directly:
– Protection to the shared I/O resources could not be
provided
• Three types of communication are required:
– The OS must be able to give commands to the I/O
devices
– The I/O device must be able to notify the OS when the
I/O device has completed an operation or has
encountered an error
• Data must be transferred between memory and an I/O
device
Commands to I/O Devices
• Two methods are used to address the device:
– Special I/O instructions
– Memory-mapped I/O

• Special I/O instructions specify:


– Both the device number and the command word
– Device number: the processor communicates this via a set of
wires normally included as part of the I/O bus
– Command word: this is usually send on the bus’s data lines
• Memory-mapped I/O:
– Portions of the address space are assigned to I/O device
– Read and writes to those addresses are interpreted as
commands to the I/O devices
– User programs are prevented from issuing I/O operations
directly:
• The I/O address space is protected by the address translation
I/O Device Notifying the OS
• The OS needs to know when:
– The I/O device has completed an operation
– The I/O operation has encountered an error
• This can be accomplished in two different
ways:
– Polling:
• The I/O device put information in a status register
• The OS periodically check the status register
– I/O Interrupt:
• Whenever an I/O device needs attention from the
processor, it interrupts the processor from what it is
currently doing.
Polling
• Advantage:
– Simple: the processor is
totally in control and does all
the work
• Disadvantage:
– Polling overhead can
consume a lot of CPU time
Interrupts
• interrupt is an asynchronous signal
indicating the need for attention or a
synchronous event in software indicating the
need for a change in execution
• Advantage:
– User program progress is only halted during actual
transfer
• Disadvantage, special hardware is needed to:
– Cause an interrupt (I/O device)
– Detect an interrupt (processor)
– Save the proper states to resume after the interrupt
(processor)
Interrupt Driven Data Transfer
• An I/O interrupt is just like the
exceptions except:
– An I/O interrupt is asynchronous
– Further information needs to be
conveyed
• An I/O interrupt is
asynchronous with respect to
instruction execution:
– I/O interrupt is not associated
with any instruction
– I/O interrupt does not prevent
any instruction from completion
– You can pick your own
convenient point to take an
interrupt
I/O Interrupt
• I/O interrupt is more complicated than
exception:
– Needs to convey the identity of the device
generating the interrupt
– Interrupt requests can have different
urgencies:
– Interrupt request needs to be prioritized
• Interrupt Logic
– Detect and synchronize interrupt requests
• Ignore interrupts that are disabled (masked off)
• Rank the pending interrupt requests
• Create interrupt microsequence address
• Provide select signals for interrupt microsequence
Multi-core architectures
Single Computer
Single Core CPU
Multi core architecture
• Replicate multiple processor cores on a
single die
Multi-core CPU chip
• The cores fit on a single processor
socket
• Also called CMP (Chip Multi-Processor)
Why Multi-core
• Difficult to make single-core clock
frequencies even higher
• Deeply pipelined circuits:
– heat problems
– speed of light problems
– difficult design and verification
– large design teams necessary
– server farms need expensive air-conditioning
• Many new applications are multithreaded
• General trend in computer architecture (shift
towards more parallelism)
Instruction-level parallelism
• Parallelism at the machine-instruction
level
• The processor can re-order, pipeline
instructions, split them into
microinstructions, do aggressive branch
prediction, etc.
• Instruction-level parallelism enabled
rapid increases in processor speeds
over the last 15 years
Thread-level parallelism (TLP)
• This is parallelism on a more coarser scale
• Server can serve each client in a separate
thread (Web server, database server)
• A computer game can do AI, graphics, and
physics in three separate threads
• Single-core superscalar processors cannot
fully exploit TLP
• Multi-core architectures are the next step in
processor evolution: explicitly exploiting TLP
Multiprocessor memory types
• Shared memory:
In this model, there is one (large)
common shared memory for all
processors

• Distributed memory:
In this model, each processor has its
own (small) local memory, and its
content is not replicated anywhere else
Multi-core processor is a special
kind of a multiprocessor:
All processors are on the same chip

• Multi-core processors are MIMD:


Different cores execute different threads
(Multiple Instructions), operating on different
parts of memory (Multiple Data).

• Multi-core is a shared memory multiprocessor:


All cores share the same memory
What applications benefit
from multi-core?
• Database servers
• Web servers (Web commerce) Each can
• Compilers run on its
own core
• Multimedia applications
• Scientific applications,
CAD/CAM
• In general, applications with
Thread-level parallelism
(as opposed to instruction-level
parallelism)
More examples
• Editing a photo while recording a TV
show through a digital video recorder
• Downloading software while running an
anti-virus program
• “Anything that can be threaded today
will map efficiently to multi-core”
• BUT: some applications difficult to
parallelize
A technique complementary to multi-core:
Simultaneous multithreading

• Problem addressed: L1 D-Cache D-TLB

The processor pipeline Integer Floating Point


can get stalled:

L2 Cache and Control


Schedulers
– Waiting for the result
of a long floating point Uop queues
(or integer) operation
Rename/Alloc
– Waiting for data to
BTB Trace Cache uCode
arrive from memory ROM

Other execution units Decoder


Bus
wait unused BTB and I-TLB
Source: Intel
Simultaneous multithreading (SMT)

• Permits multiple independent threads to execute


SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads”
on the same core

• Example: if one thread is waiting for a floating


point operation to complete, another thread can
use the integer units
Without SMT, only a single thread
can run at any given time
L1 D-Cache D-TLB

Integer Floating Point


L2 Cache and Control
Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 1: floating point


Without SMT, only a single thread
can run at any given time
L1 D-Cache D-TLB

Integer Floating Point


L2 Cache and Control
Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 2:
integer operation
SMT processor: both threads can
run concurrently
L1 D-Cache D-TLB

Integer Floating Point


L2 Cache and Control
Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder
Bus

BTB and I-TLB

Thread 2: Thread 1: floating point


integer operation
But: Can’t simultaneously use the
same functional unit
L1 D-Cache D-TLB

Integer Floating Point


L2 Cache and Control
Schedulers

Uop queues

Rename/Alloc

BTB Trace Cache uCode ROM

Decoder This scenario is


impossible with SMT
Bus

BTB and I-TLB


on a single core
Thread 1 Thread 2 (assuming a single
IMPOSSIBLE integer unit)
SMT not a “true” parallel
processor
• Enables better threading (e.g. up to 30%)
• OS and applications perceive each
simultaneous thread as a separate
“virtual processor”
• The chip has only a single copy
of each resource
• Compare to multi-core:
each core has its own copy of resources
Multi-core:
threads can run on separate cores
L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point


L2 Cache and Control

L2 Cache and Control


Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode


ROM ROM
Decoder Decoder
Bus

Bus
BTB and I-TLB BTB and I-TLB

Thread 1 Thread 2
Multi-core:
threads can run on separate cores
L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point


L2 Cache and Control

L2 Cache and Control


Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode


ROM ROM
Decoder Decoder
Bus

Bus
BTB and I-TLB BTB and I-TLB

Thread 3 Thread 4
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
– Single-core, non-SMT: standard uniprocessor
– Single-core, with SMT
– Multi-core, non-SMT
– Multi-core, with SMT: our fish machines
• The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
SMT Dual-core: all four threads
can run concurrently
L1 D-Cache D-TLB L1 D-Cache D-TLB

Integer Floating Point Integer Floating Point


L2 Cache and Control

L2 Cache and Control


Schedulers Schedulers

Uop queues Uop queues

Rename/Alloc Rename/Alloc

BTB Trace Cache uCode BTB Trace Cache uCode


ROM ROM
Decoder Decoder
Bus

Bus
BTB and I-TLB BTB and I-TLB

Thread 1 Thread 3 Thread 2 Thread 4


Comparison: multi-core vs SMT

• Advantages/disadvantages?
Comparison: multi-core vs SMT

• Multi-core:
– Since there are several cores,
each is smaller and not as powerful
(but also easier to design and manufacture)
– However, great with thread-level parallelism
• SMT
– Can have one large and fast superscalar core
– Great performance on a single thread
– Mostly still only exploits instruction-level
parallelism
The memory hierarchy
• If simultaneous multithreading only:
– all caches shared
• Multi-core chips:
– L1 caches private
– L2 caches private in some architectures
and shared in others
• Memory is always shared
“Fish” machines
hyper-threads
• Dual-core
Intel Xeon processors

CORE1

CORE0
• Each core is L1 cache L1 cache
hyper-threaded
L2 cache

• Private L1 caches
memory
• Shared L2 caches
Designs with private L2 caches

CORE0
CORE1

CORE0

CORE1
L1 cache L1 cache L1 cache L1 cache

L2 cache L2 cache L2 cache L2 cache

L3 cache L3 cache
memory
memory
Both L1 and L2 are private
A design with L3 caches
Examples: AMD Opteron,
AMD Athlon, Intel Pentium D Example: Intel Itanium 2
Private vs shared caches?
• Advantages/disadvantages?
Private vs shared caches
• Advantages of private:
– They are closer to core, so faster access
– Reduces contention
• Advantages of shared:
– Threads on different cores can share the
same cache data
– More cache space available if a single (or
a few) high-performance thread runs on
the system
View publication stats

Windows Task Manager

core 2

core 1

Anda mungkin juga menyukai