Anda di halaman 1dari 21

What is the Itanium®

Architecture ?

Thomas Siebold
Technology Consultant
Alpha Systems Division

thomas.siebold@hp.com

Agenda

• Terminology
• What is the Itanium®
Architecture?

1
Terminology

Processor Architectures and


Implementations

IA64 Architecture

Intel Itanium® Architecture


Alpha Architecture

EV68 Merced Madison


EV5 Future
EV4 Itanium® Itanium®
EV7 McKinley
processor
EV6 Itanium®2
processor
implementations

Itanium® Processor Family

2
Itanium® Processor Family Roadmap
• Intel has enhanced the Itanium® Processor Family roadmap
– To deliver the most competitive product offerings for enterprise customers
– To pull-in dual core technology as early as possible and deliver a significant
performance boost
– To maintain a consistent introduction rate on new Itanium® Processor Family product
offerings
2002 2003 2004 2005
®
® ®
® Silicon Process
®
® ® Itanium
Itanium 22 Itanium
Itanium 22
Itanium
Itanium 22 0.18
0.18 µµm
m
Processor
Processor Processor
Processor Montecito
Montecito
Processor
Processor
(Madison
(Madison &
& Deerfield)
Deerfield) (Madison
(Madison 9M)
9M) (Dual
(Dual Core)
Core) 0.13
0.13 µm
µm
(1
(1 GHz,
GHz, 3MB
3MB L3)
L3) (1.5GHz,
(1.5GHz, 6MB
6MB L3)
L3) (>1.5GHz,
(>1.5GHz, 9MB
9MB L3)
L3) 90
90 nm
nm

• Montecito processor will enable dual-core technology


– Continues PAC611 and maintains the same bus protocol
– Extends Itanium® 2 microarchitecture to 90nm process technology
– Platform Release target of 2005

Roadmap
Roadmapmaintains
maintainsworld
worldclass
classperformance
performance

next generation processor technologies

New features !

Alpha EV79
PA 8800+

PA -8800 Alpha EV7 Itanium tm 2


E xplicitly
xplicitly
P arallel
arallel
Multiple Cores & Itanium
Integrated Interconnects
I nstruction
nstruction
C omputing
omputing
POWER4
Innovation

PA-8700 Alpha EV68


SuperScalar
IA-32 Processor Family
SPARC -III
MIPS 14K
CISC RISC
© 2002 2

3
Itanium2 Processor

221M FETs
421mm2

90+% of the
transistors

21.6mm
and
50+% of the
die area are
devoted to
cache and
cache support
logic!
19.5mm

What is the
Itanium®
Architecture?

4
Traditional CPU Architectures

• Performance barriers:
- Memory latency
- Branches
- Loop pipelining
- Procedure call / return overhead
• Headroom constraints :
- Hardware-based instruction scheduling
- Unable to efficiently schedule parallel execution
• Resource constraints
- Too few registers
- Unable to fully utilize multiple execution units

EPIC – Explicitly Parallel Instruction Computing


Basic Ideas

• Static Hardware Design


– Compiler creates record of execution
• Instructions in bundles
– Machine plays record
• Distribute among execution units
– No runtime changes like out-of -order-excution
• High Scalability of ‚execution units‘
– Very Large Instruction Word (VLIW) concept
– Focus is parallelism
• 6 instructions in parallel (2 bundles per cycle)
– High number of execution units
• Enhancement of VLIW concepts with
– Predication
– Indication of parallelism in machine code
– Speculative data loading

5
Improving Performance

• Itanium® architecture boosts performance by


– allowing compiler to provide information to chip
– using available compile time information
– Moving performance burden from microarchitecture (chip) to compiler

• Itanium® architecture code accomplishes the following:


– Increases instruction level parallelism (ILP)
– Improves branch handling
– Reduces memory access cost
– Supports modular code (note)

6
Increasing
Instruction
Level
Parallelism

Increasing Instruction Level Parallelism

• Improving instruction level parallelism (ILP) by:


– Compiler/assembly writer is able to explicitly indicate
parallelism
– Instruction groups
– Three-instruction-wide word
– Instruction bundle
– Two executed per cycle
– Massive resources on chip
– Large number of registers to avoid register contention

7
Instruction Format: Bundles & Templates

•Bundle (123 bits)


•Set of three instructions
•Template (5 bits)
•Identifies types of instructions in bundle
•One of Integer, Memory, Branch, Floating, eXtended
•Identifies independent operations (“stops”) -> MM_F
•Defines execution units to be invoked executing the bundle
•Compiler can schedule functional units to avoid contention

Explicitly Parallel Instruction Computing


EPIC
S2 S1 S0 T
128-bit instruction bundles from I-cache

Fetch one or more bundles for execution


(Implementation, Itanium® takes two.)
Processor

Try to execute all instructions in


functional units parallel, depending on available
MEM MEM INT INT FP FP B B B units.

Retired instruction bundles

8
Instruction Groups

• Instruction groups:
• Set of instructions
• No dependencies (read-after-write) within group
• May execute in parallel
• The processor executes as many instructions per
instruction group as possible, based on its resources
• Must contain at least one instruction (no upper limit)
• Instruction groups are indicated by cycle breaks (;;)

Instruction groups and bundles

ld8 r5 = [r7] Instructions within a group may not


sub r1 = r2, r3 have any register dependencies within
add r10 = r20, r21 ;; the group.
add r1 = r1, r5 ;;
st8 [r7] = r1 ;; indicates the end of a group.

Instruction bundles
{
Instructions are fetched and
.mii // template
executed in bundles.
ld8 r10, [r5] // slot 0, Memory
add r1 = r2, r3 // slot 1, Integer
add r4 = r5,r6 // slot 2, Integer
}

9
Instruction groups and bundles
Itanium® and Itanium2® fetch 2 bundles at a time for execution.
They may or may not execute in parallel.

Handwritten code Instruction bundles


Execution
instr Code generator Fetch
instr instr instr tmpl instr instr instr tmpl
instr
instr instr instr tmpl instr instr instr tmpl
instr ;;
instr instr nop tmpl
instr
instr nop nop tmpl
instr ;; Can the bundle pair
instr instr nop tmpl
instr
intsr
instr instr nop tmpl Execute in parallel ?
intsr instr instr tmpl
instr

instr
instr ;;
instr
instr ;; Forgetting end-of-group
instr
Code generator creates bundles,
… may be fatal: possibly including nops.
add r1 = r1, r5 ;;
st8 [r7]= r1

There are two difficulties:


1) Finding instruction triplets matching the defined templates.
2) Matching pairs of bundles that can execute in parallel.

Massive On Chip Resources

• Several register files visible to the


programmer:

• 128 General registers


• 128 Floating-point registers
• 64 Predicate registers
• 8 Branch registers
• 128 Application registers
• Instruction Pointer (IP) register
• Control Registers
• Process Status Register (includes slot index
within current bundle)

10
Improving
Branch
Handling

What is the problem ?

• Traditional CPUs:
• Branch-prediction is used to predict the most likely set of
instructions
• Correct branch prediction keeps the execution pipelines full
• A mispredicted branch flushes the pipeline with a large
penalty

• Itanium® architecture improves branch handling:


• Provide a way to minimize branches using predicates
• Provide support for special branch instructions
– counted loop

11
Branch Handling

• Predication
– Conditional execution of instructions
– When the predicate is true, the instruction is executed
– When it is false, the instruction is treated as a NOP
• Predication converts a control dependency into a data
dependency
• Predication eliminates branches in the code

Speculation
Predication

• Traditional code:
if (a>b)
c = c + 1
else
d = d * e + f
• Avoid branch by using predicated code
p1, p2 = compare(a>b)
if (p1) c = c + 1
if (p2) d = d * e + f
– Predicate p1 set to 1 if compare is true, and to 0 if it
evaluates to false
– p2 is the complement of p1

12
Speculation
Predication

Before:
• Instructions c = c + 1 and d = d * e + f are
control dependant on a<b

After:
• Instruction are data dependant:
– Values of p1 and p2
– They determine execution
– The branch is eliminated

Predication

Traditional Itanium® Architecture


Architecture
Cmp a,b Cmp a,b pT, pF

Jump
br NEQ pT Y=3 pF Y=4

then Y=3

JumpbrEND
Y=4 else Code for both paths loaded and
routed to different execution
pipelines.
Only one ‘branch’ will have a valid
predicate and be executed.

13
Reducing
Memory
Access
Cost

Reducing Memery Access Cost

• Itanium® architecture eliminates many memory


accesses through:
• large register files to manage work in progress
• better control of the memory hierarchy (cache
hints)

• Itanium® architecture reduces remaining memory


accesses by:
• moving load instructions earlier in the code
– Data speculation - the execution of a load before a
preceeding store
– Control speculation - the execution of a load before its
guarding branch
• hides memory latency
• enables the processor to bring in the data in time
• avoids stalling the processor

14
„Data Speculation“
Advanced Loads

• Load is performed before a store that logically


precedes it
– may potentially use the same address
– also referred to as ‘advanced load’
– at compile time memory addresses need to
be “disambiguated” (relationship)

Itanium® architecture
Traditional sequence:sequence:

aload(ld_addr,target)
store(st_addr,data)
/* other operations including uses of
load(ld_addr,target)
target use(target)
*/
store(st_addr,data)
acheck(target,recovery_addr)
use(target)

„Control Speculation“

• Load is performed before a store that’s


guarded by a branch
– Need to check for exceptions

Traditionalarchitecture
Itanium® sequence: sequence:

if a>b
sload(ld_addr1,target1)
then
sload(ld_addr2,target2)
load(ld_addr1,target1)
/* other operations including usage of
else
target1/target2 */
load(ld_addr2,
if a>b target2)
then
scheck(target1,recovery_addr1)
else
scheck(target2, recovery_addr2)

15
Massive Memory Resources

• Physical memory
– Full implementation will address 16 EB of physical
memory (264)
• 16,000,000,000GB
– Itanium® architecture microprocessor has 44-bit
address bus
• 16TB (16,000GB) physical memory addressable
– Itanium2® architecture microprocessor has 50-bit
address bus
• Virtual memory
– Itanium® architecture microprocessor uses 50-bits
– Itanium2® architecture microprocessor uses 64-bits

Supporting
Modular
Code

16
Procedure Call Overhead

• Modular programs create more overhead


– Programs tend to be call intensive
– Register space shared by caller and callee
– Call/Returns require register save/restores
– Frequent memory access
– Limitations due to resource shortage
• Itanium® solution
– Massive register resources
• Renaming, rotating
• Integer registers stackable
– Register Stack Engine (RSE)
– Eliminates memory accesses
– Allows to allocate local registers dynamically

Register Stack

• The general register stack is divided into two subsets:


• Static: 32 permanent registers (r0-r31)
– visible to all procedures
– Used for global variables
• Stacked: 96 other registers are like a stack
– procedure code allocates up to 96 registers for a
frame
• Frame allocation:
– previous frame is hidden
– first register is renamed to logical register r32
– small frames eliminate/reduce saving/restoring
registers to/from memory

17
Procedure Call Overhead

IA-32 Itanium® Architecture


• Procedure A Procedure A
• call B call B

• Procedure B Procedure B
• save current register state alloc, no save!
• ... ...
• restore previous register state no restore! (remap)
• return... return

Register Stack Engine (RSE)

• When a procedure is called


– New frame of registers is made available
– Caller’s register content remain in registers,
invisible and inaccessible to called procedure
– If deep nesting exhausts physical registers
the RSE will save contents of hidden registers
to memory to free up resources
– On return to caller, caller’s register content
automatically restored
• RSE works in background, utilizing unused
memory bandwidth
• Activity not visible to application programs

18
Loop Optimization Overhead

• Enhance loop performance:


– Done by unrolling loops
– Causes code expansion
– Prologue/epilogue add to code size
• Itanium® solution
– Software pipelining
– Architecture support
– Minimal prologue/epilogue code
– Predication
– Loop control registers (LC, EC)
– Loop branches (br.ctop, br.wtop)

IA64 Instruction Peculiarities

There is a floating point multiply and add instruction, fma (f= a*b+c)
A simple floating point multiply is a fma with c=0.
A simple floating point add is a fma with b=1.

There is an integer multiply and add instruction, which executes in fp registers!

There is a memory fence instruction: mf (Alpha: MB)

There are three atomic semaphore instructions: xchg, cmpxchg and fetchadd.

There are no load/store instructions with immediate offsets a la LDQ R1, 32(R5) on Alpha.

There are speculative and advanced loads that do not exist on Alpha.

The Register Stack Engine (RSE) is a powerful tool in procedure nestings.

19
• Itanium® Architecture Training

• https://shale.intel.com/softwarecollege/

Q
&
A

20
21

Anda mungkin juga menyukai