Instruction Set Principles and Examples

Instruction Set Principles and
Examples
1
Outline
• Introduction
• Classifying instruction set architectures
• Instruction set measurements
– Memory addressing
– Addressing modes for signal processing
– Type and size of operands
– Operations in the instruction set
– Operations for media and signal processing
– Instructions for control flow
– Encoding an instruction set
• Role of compilers
• MIPS architecture
2
Brief Introduction to ISA
• Instruction Set Architecture: a set of instructions
– Each instruction is directly executed by the CPU’s hardware
• How is it represented?
– By a binary format since the hardware understands only bits
• Concatenate together binary encoding for instructions, registers,
constants, memories
• Typical physical blobs are bits, bytes, words, n-words
• Word size is typically 16, 32, 64 bits today
• Options - fixed or variable length formats
– Fixed - each instruction encoded in same size field (typically 1 word)
– Variable – half-word, whole-word, multiple word instructions are
possible
3
Example of
Program Execution
• Command
– 1: Load AC from
Memory
– 2: Store AC to
memory
– 5: Add to AC from
memory
• Add the contents of
memory 940 to the
content of memory
941 and stores the
result at 941
Fetch Execution
4
A Note on Measurements
• We’re taking the quantitative approach
• BUT measurements will vary:
– Due to application selection or application mix
– Due to the particular compiler being used
– Also dependent on compiler optimization selection
– And the target ISA
• Hence the measurements we’ll talk about
– Are useful to understand the method
– Are a typical yet small sample derived from benchmark codes
• To do it for real
– You would want lots of real applications
– Plus - your compiler and ISA
5
Classifying Instruction Set
Architecture
6
Instruction Set Design
CPU _ Time  IC * CPI * Cycle _ time

The instruction set influences everything
7
Instruction Characteristics
• Usually a simple operation
– Which operation is identified by the op-code field
• But operations require operands - 0, 1, or 2
– To identify where they are, they must be addressed
• Address is to some piece of storage
• Typical storage possibilities are main memory, registers, or a
stack
• 2 options explicit or implicit addressing
– Implicit - the op-code implies the address of the operands
• ADD on a stack machine - pops the top 2 elements of the stack,
then pushes the result
• HP calculators work this way
– Explicit - the address is specified in some field of the instruction
• Note the potential for 3 addresses - 2 operands + the destination
8
Classifying Instruction Set
Architectures AND # of operands
Based on CPU internal storage options
These choices critically affect - #instructions, CPI, and

cycle time
9
Operand Locations for Four
ISA Classes
10
C=A+B
• Stack • Register (register-memory)
– Push A – Load R1, A
– Push B – Add R3, R1, B
– Add – Store R3, C
• Pop the top-2 values of the
stack (A, B) and push the • Register (load-store)
result value into the stack – Load R1, A
– Pop C – Load R2, B
• Accumulator (AC) – Add R3, R1, R2
– Load A – Store R3, C
– Add B
• Add AC (A) with B and store
the result into AC
– Store C
11
Pro’s and Con’s of Stack,
Accumulator, Register Machine
12
Modern Choice – Load-store
Register (GPR) Architecture
• Reasons for choosing GPR (general-purpose registers) architecture
– Registers (stacks and accumulators…) are faster than memory
– Registers are easier and more effective for a compiler to use
• (A+B) – (C*D) – (E*F)
– May be evaluated in any order (for pipelining concerns or …)
» But on a stack machine  must left to right
– Registers can be used to hold variables
• Reduce memory traffic
• Speed up programs
• Improve code density (fewer bits are used to name a register)
• Compiler writers prefer that all registers be equivalent and unreserved
– The number of GPR: at least 16
13
Characteristics Divide GPR
Architectures
• # of operands
– Three-operand: 1 result and 2 source operands
– Two-operand – 1 both source/result and 1 source
• How many operands are memory addresses
– 0 – 3 (two sources + 1 result)
Load-store
Register-memory
Memory-memory 14
Pro’s and Con’s of Three Most
Common GPR Computers
15
Short Summary – Classifying
Instruction Set Architectures
• Expect the use of general-purpose registers
• Figure 2.4 + pipelining (Appendix A)
– Expect the use of Register-Register (load-store) GPR architecture
16
Memory Addressing
17
Memory Addressing Basics
All architectures must address memory
• What is accessed - byte, word, multiple words?

– Today’s machine are byte addressable
– Main memory is organized in 32 - 64 byte lines
– Big-Endian or Little-Endian addressing
• Hence there is a natural alignment problem
– Size s bytes at byte address A is aligned if A mod s = 0
– Misaligned access takes multiple aligned memory references
• Memory addressing mode influences instruction counts (IC)
and clock cycles per instruction (CPI)
18
Typical Address Modes (I)
19
Typical Address Modes (II)
20
Use of Memory Addressing
Based on a VAX which
Mode (Figure 2.7) supported everything
Not counting Register

mode (50% of all)
21
Displacement Field Size
At least 12—16 bits (75% -- 99%) of the displacements
22
Immediate Operands
23
Distribution of Immediate
Values
24
Addressing Modes for Signal
Processing
• DSPs deal with infinite, continuous streams of data, they
routinely rely on circular buffers
– Modulo or circular addressing mode
• Support data shuffling in Fast Fourier Transform (FFT)
– Bit reverse addressing
– 0112  1102
• However, the two fancy addressing modes do not used
heavily
– Mismatch between what programmers and compilers actually use
versus what architects expect
25
Frequency of Addressing Modes
for T1 TMS320C54x DSP
26
Short Summary – Memory
Addressing
• Need to support at least three addressing modes
– Displacement, immediate, and register deferred (+ REGISTER)
– They represent 75% -- 99% of the addressing modes in benchmarks
• The size of the address for displacement mode to be at
least 12—16 bits (75% – 99%)
• The size of immediate field to be at least 8 – 16 bits (50%—
80%)
27
Operand Type & Size
• Specified by instruction (opcode) or by hardware tag
– Tagged machines are extinct
• Typical types: assume word= 32 bits
– Character - byte - ASCII or EBCDIC (IBM) - 4 per word
– Short integer - 2- bytes, 2’s complement
– Integer - one word - 2’s complement
– Float - one word - usually IEEE 754 these days
– Double precision float - 2 words - IEEE 754
– BCD or packed decimal - 4- bit values packed 8 per word
• Instructions will be needed for common conversions --
software can do the rare ones
28
Data Access Patterns
29
Operands for Media and
Signal Processing
• Graphics applications – vertex
– (x, y, z) + w to help with color or hidden surfaces (R, G, B, A)
– 32-bit floating-point values
• DSPs
– Fixed point – a binary point just to the right of the sign bit
• Represent fractions between –1 and +1
• Have a separate exponent variable
• Blocked floating point – a block of variables has a common
exponent
• Need some registers that are wider to guard against round-off
error
30
Operand Type and Size in
DSP
31
Short Summary – Type and
Size of Operand
• The future - as we go to 64 bit machines
• Decimal’s future is unclear
• Larger offsets, immediate, etc. is likely
• Usage of 64 and 128 bit values will increase
• DSPs need wider accumulating registers than the size in
memory to aid accuracy in fixed-point arithmetic
32
What Operations are Needed
• Arithmetic + Logical
– Integer arithmetic: ADD, SUB, MULT, DIV, SHIFT
– Logical operation: AND, OR, XOR, NOT
• Data Transfer - copy, load, store
• Control - branch, jump, call, return, trap
• System - OS and memory management
– We’ll ignore these for now - but remember they are needed
• Floating Point
– Same as arithmetic but usually take bigger operands
• Decimal - if you go for it what else do you need?
– legacy from COBOL and the commercial application domain
• String - move, compare, search
• Graphics – pixel and vertex, compression/decompression operations
33
Top 10 Instructions for 80x86
• load: 22% • The most widely executed
• conditional branch: 20% instructions are the simple
• compare: 16% operations of an instruction set
• • The top-10 instructions for 80x86
store: 12%
account for 96% of instructions
• add: 8% executed
• and: 6% • Make them fast, as they are the
• sub: 5% common case
• move register-register: 4%
• call: 1%
• return: 1%
34
Control Instructions are a Big
Deal
• Jumps - unconditional transfer
• Conditional Branches
– How is condition code set? – by flag or part of the instruction
– How is target specified? How far away is it?
• Calls
– How is target specified? How far away is it?
– Where is return address kept?
– How are the arguments passed? Callee vs. Caller save!
• Returns
– Where is the return address? How far away is it?
– How are the results passed?
35
Breakdown of Control Flows
• Call/Returns
– Integer: 19% FP: 8%
• Jump
• Conditional Branch
36
Branch Address Specification
• Known at compile time for unconditional and conditional
branches - hence specified in the instruction
– As a register containing the target address
– As a PC-relative offset
• Consider word length addresses, registers, and instructions
– Full address desired? Then pick the register option.
• BUT - setup and effective address will take longer.
– If you can deal with smaller offset then PC relative works
• PC relative is also position independent - so simple linker duty
37
Returns and Indirect Jumps
• Branch target is not known at compile time
• Need a way to specify the target dynamically
– Use a register
– Permit any addressing mode
– Regs[R4]  Regs[R4] + Mem[Regs[R1]]
• Also useful for
– case or switch
– Dynamically shared libraries
– High-order functions or function pointers
– Virtual functions in OO
38
Branch Stats - 90% are PC
Relative
• Call/Return
– TeX = 16%, Spice = 13%, GCC = 10%
• Jump
– TeX = 18%, Spice = 12%, GCC = 12%
• Conditional
– TeX = 66%, Spice = 75%, GCC = 78%
39
Branch Distances
40
Condition Testing Options
41
What kinds of compares do
Branches Use?
42
Direction, Frequency, and
real Change
Key points – 75% are forward branch

• Most backward branches are loops - taken about 90%
• Branch statistics are both compiler and application dependent
• Any loop optimizations may have large effect
43
Short Summary – Operations
in the Instruction Set
• Branch addressing to be able to jump to about 100+
instructions either above or below the branch
– Imply a PC-relative branch displacement of at least 8 bits
• Register-indirect and PC-relative addressing for jump
instructions to support returns as well as many other
features of current systems
44
Encoding an Instruction Set
45
Encoding the ISA
• Encode instructions into a binary representation for
execution by CPU
• Can pick anything but:
– Affects the size of code - so it should be tight
– Affects the CPU design - in particular the instruction decode
• So it may have a big influence on the CPI or cycle-time
• Must balance several competing forces
– Desire for lots of addressing modes and registers
– Desire to make average program size compact
– Desire to have instructions encoded into lengths that will be easy to
handle in a pipelined implementation (multiple of bytes)
46
3 Popular Encoding Choices
• Variable (compact code but difficult to encode)
– Primary opcode is fixed in size, but opcode modifiers may exist
– Opcode specifies number of arguments - each used as address fields
– Best when there are many addressing modes and operations
– Use as few bits as possible, but individual instructions can vary widely in
length
– e. g. VAX - integer ADD versions vary between 3 and 19 bytes
• Fixed (easy to encode, but lengthy code)
– Every instruction looks the same - some field may be interpreted differently
– Combine the operation and the addressing mode into the opcode
– e. g. all modern RISC machines
• Hybrid Trade-off between size of program
– Set of fixed formats VS. ease of decoding
– e. g. IBM 360 and Intel 80x86
47
3 Popular Encoding Choices
(Cont.)
48
An Example of Variable
Encoding -- VAX
• addl3 r1, 737(r2), (r3): 32-bit integer add instruction with 3
operands  need 6 bytes to represent it
– Opcode for addl3: 1 byte
– A VAX address specifier is 1 byte (4-bits: addressing mode, 4-bits:
register)
• r1: 1 byte (register addressing mode + r1)
• 737(r2)
– 1 byte for address specifier (displacement addressing + r2)
– 2 bytes for displacement 737
• (r3): 1 byte for address specifier (register indirect + r3)
• Length of VAX instructions: 1—53 bytes
49
Short Summary – Encoding
the Instruction Set
• Choice between variable and fixed instruction encoding
– Code size than performance  variable encoding
– Performance than code size  fixed encoding
50
Role of Compilers
 Critical goals in ISA from the compiler viewpoint

 What features will lead to high-quality code
 What makes it easy to write efficient compilers for an
architecture
51
Compiler and ISA
• ISA decisions are no more for programming AL easily
• Due to HLL, ISA is a compiler target today
• Performance of a computer will be significantly affected by
compiler
• Understanding compiler technology today is critical to
designing and efficiently implementing an instruction set
• Architecture choice affects the code quality and the
complexity of building a compiler for it
52
Goal of the Compiler
• Primary goal is correctness
• Second goal is speed of the object code
• Others:
– Speed of the compilation
– Ease of providing debug support
– Inter-operability among languages
– Flexibility of the implementation - languages may not change much
but they do evolve - e. g. Fortran 66 ===> HPF
Make the frequent cases fast and the rare case correct
53
Typical Modern Compiler
Structure
Common Intermediate
Representation
Somewhat language dependent

Largely machine independent
Small language dependent

Slight machine dependent
Language independent
Highly machine dependent
54
Typical Modern Compiler
Structure (Cont.)
• Multi-pass structure  easy to write bug-free compilers
– Transform HL, more abstract representations, into progressively low-
level representations, eventually reaching the instruction set
• Compilers must make assumptions about the ability of later
steps to deal with certain problems
– Ex. 1 choose which procedure calls to expand inline before they
know the exact size of the procedure being called
– Ex. 2 Global common sub-expression elimination
• Find two instances of an expression that compute the same
value and saves the result of the first one in a temporary
– Temporary must be register, not memory (Performance)
– Assume register allocator will allocate temporary into register
55
Optimization Types
• High level - done at source code level
– Procedure called only once - so put it in-line and save CALL
• Local - done on basic sequential block (straight-line code)
– Common sub-expressions produce same value
– Constant propagation - replace constant valued variable with the
constant - saves multiple variable accesses with same value
• Global - same as local but done across branches
– Code motion - remove code from loops that compute same value on
each pass and put it before the loop
– Simplify or eliminate array addressing calculations in loop
56
Optimization Types (Cont.)
• Register allocation
– Use graph coloring (graph theory) to allocate registers
• NP-complete
• Heuristic algorithm works best when there are at least 16 (and
preferably more) registers
• Processor-dependent optimization
– Strength reduction: replace multiply with shift and add sequence
– Pipeline scheduling: reorder instructions to minimize pipeline stalls
– Branch offset optimization: Reorder code to minimize branch offsets
57
Major Types of Optimizations
and Example in Each Class
58
Change in IC Due to
Optimization
 Level 1: local optimizations, code scheduling, and local register allocation

 Level 2: global optimization, loop transformation (software pipelining),
global register allocation
 Level 3: + procedure integration
59
Optimization Observations
• Hard to reduce branches
• Biggest reduction is often memory references
• Some ALU operation reduction happens but it is usually a
few %
• Implication:
– Branch, Call, and Return become a larger relative % of the
instruction mix
– Control instructions among the hardest to speed up
60
Impact of Compiler Technology
on Architect’s Decisions
• Important questions
– How are variables allocated and addressed?
– How many registers will be needed?
• We must look at 3 areas to allocate data
61
Where to allocate data?
• Stack
– Local variable access in activation records, almost no push/pop
– Addressing is relative to the stack pointer
– Grown or shrunk on calls and returns
• Global data area - the easy one
– Constants and global static structures
– For arrays addressing may be indexed off head
• Heap
– Used for dynamic objects
– Access usually by pointers
– Data is typically not scalar
62
Register Allocation & Data
• Reasonably simple for stack objects
p= &a
• Hard for global data due to aliasing opportunity a=…
– Must be conservative *p = …
…a…
• Heap objects & pointers in general are even harder
– Computed pointers make allocation impossible to register save the
target data
– Any structured data - string, array, etc. is too big to save
• Since register allocation is a major optimization source
– The effect is clearly important
63
How can Architects Help
Compiler Writers
• Provide Regularity
– Address modes, operations, and data types should be orthogonal
(independent) of each other
• Simplify code generation especially multi-pass
• Counterexample: restrict what registers can be used for a certain
classes of instructions
• Provide primitives - not solutions
– Special features that match a HLL construct are often un-usable
– What works in one language may be detrimental to others
64
How can Architects Help
Compiler Writers (Cont.)
• Simplify trade-offs among alternatives
– How to write good code? What is a good code?
• Metric: IC or code size (no longer true) caches and pipeline…
– Anything that makes code sequence performance obvious is a
definite win!
• How many times a variable should be referenced before it is
cheaper to load it into a register
• Provide instructions that bind the quantities known at
compile time as constants
– Don’t hide compile time constants
• Instructions which work off of something that the compiler thinks
could be a run-time determined value hand-cuffs the optimizer
65
Short Summary -- Compilers
• ISA has at least 16 GPR (not counting FP registers) to
simplify allocation of registers using graph coloring
• Orthogonality suggests all supported addressing modes
apply to all instructions that transfer data
• Simplicity – understand that less is more in ISA design
– Provide primitives instead of solutions
– Simplify trade-offs between alternatives
– Don’t bind constants at runtime
• Counterexample – Lack of compiler support for multimedia
instructions
66
The MIPS Architecture
67
Expectations for New ISA
• Use general-purpose registers, with a load-store architecture
• Support displacement (offset size12-16 bits), immediate (size 8 to 16
bits), and register indirect
• Support 8-, 16-, 32-, and 64-bit integers and 64-bit IEEE 754 floating-
point numbers
• Support the following simple instructions: load, store, add, subtract,
move register-register, and, shift, compare equal, compare not equal,
branch (with a PC-relative address at least 8 bits long), jump, call, return
• Use fixed instruction encoding if interested in performance and use
variable instruction encoding if interested in code size
• Provide at least 16 general-purpose registers (GPA) + separate floating-
point registers, be sure all addressing modes apply to all data transfer
instructions, and aim for a minimalist instruction set
68
MIPS
• Simple load- store ISA
• Enable efficient pipeline implementation
• Fixed instruction set encoding
• Efficiency as a compiler target
• MIPS64 variant is discussed here
69
Register for MIPS
• 32 64-bit integer GPR’s - R0, R1, ... R31, R0= 0 always
• 32 FPR’s - used for single or double precision
– For single precision: F0, F1, ... , F31 (32-bit)
– For double precision: F0, F2, ... , F30 (64-bit)
• Extra status registers - moves via GPR’s
• Instructions for moving between an FRP and a GPR
70
Data Types for MIPS
• 8-bit byte, 16-bit half words, 32-bit word, and 64-bit double
words for integer data
• 32-bit single precision and 64-bit double precision for FP
• MIPS64 operations work on 64-bit integer and 32- or 64-bit
floating point
– Bytes, half words, and words are loaded into the GPRs with zeros or
the sign bit replicated to fill the 64 bits of the GPRs
• All references between memory and either GPRs or FPRs
are through load or stores
71
Addressing Modes for MIPS
• Data addressing : immediate and displacement (16 bits)
– Displacement: Add R4, 100(R1)
(Regs[R4]Regs[R4]+Mem[100+Regs[R1]])
– Register-indirect: placing 0 in displacement field
• Add R4, (R1) (Regs[R4]Regs[R4]+Mem[Regs[R1]])
– Absolute addressing (16 bits): using R0 as the base register
• Add R1, (1001) (Regs[R4]Regs[R4]+Mem[1001])
• Byte addressable with 64-bit address
– Mode selection for Big Endian or Little Endian
72
MIPS Instruction Format
• Encode addressing mode into the opcode
• All instructions are 32 bits with 6-bit primary opcode
73
(Cont.)
I-Type Instruction
6 5 5 16
opcode rs rt Immediate
 Loads and Stores LW R1, 30(R2), S.S F0, 40(R4)
 ALU ops on immediates DADDIU R1, R2, #3
 rt <-- rs op immediate
 Conditional branches BEQZ R3, offset
 rs is the register checked
 rt unused
 immediate specifies the offset
 Jump registers ,jump and link register JR R3
 rs is target register
 rt and immediate are unused but = 011
74
(Cont.)
R-Type Instruction
6 5 5 5 5 6
opcode rs rt rd shamt func
 Register-register ALU operations: rdrs funct rt DADDU R1, R2, R3
 Function encodes the data path operations: Add, Sub...
 read/write special registers
 Moves
J-Type Instruction: Jump, Jump and Link, Trap and return from exception
6 26
opcode Offset added to PC
75
MIPS instruction MIX
SPECint2000
76
MIPS instruction MIX (Cont.)
SPECfp2000
77

Instruction Set Principles and Examples

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Instruction Set Principles and Examples

Diunggah oleh

Hak Cipta:

Format Tersedia

Instruction Set Principles and

CPU _ Time  IC * CPI * Cycle _ time

These choices critically affect - #instructions, CPI, and

• What is accessed - byte, word, multiple words?

Not counting Register

Key points – 75% are forward branch

 Critical goals in ISA from the compiler viewpoint

Somewhat language dependent

Small language dependent

 Level 1: local optimizations, code scheduling, and local register allocation

Anda mungkin juga menyukai