It Ani Um

What is the Itanium®
Architecture ?
Thomas Siebold
Technology Consultant
Alpha Systems Division
thomas.siebold@hp.com
Agenda
• Terminology
• What is the Itanium®
Architecture?
1
Terminology
Processor Architectures and

Implementations
IA64 Architecture
Intel Itanium® Architecture

Alpha Architecture
EV68 Merced Madison

EV5 Future
EV4 Itanium® Itanium®
EV7 McKinley
processor
EV6 Itanium®2
processor
implementations
Itanium® Processor Family
2
Itanium® Processor Family Roadmap
• Intel has enhanced the Itanium® Processor Family roadmap
– To deliver the most competitive product offerings for enterprise customers
– To pull-in dual core technology as early as possible and deliver a significant
performance boost
– To maintain a consistent introduction rate on new Itanium® Processor Family product
offerings
2002 2003 2004 2005
®
® ®
® Silicon Process
®
® ® Itanium
Itanium 22 Itanium
Itanium 22
Itanium
Itanium 22 0.18
0.18 µµm
m
Processor
Processor Processor
Processor Montecito
Montecito
Processor
Processor
(Madison
(Madison &
& Deerfield)
Deerfield) (Madison
(Madison 9M)
9M) (Dual
(Dual Core)
Core) 0.13
0.13 µm
µm
(1
(1 GHz,
GHz, 3MB
3MB L3)
L3) (1.5GHz,
(1.5GHz, 6MB
6MB L3)
L3) (>1.5GHz,
(>1.5GHz, 9MB
9MB L3)
L3) 90
90 nm
nm
• Montecito processor will enable dual-core technology

– Continues PAC611 and maintains the same bus protocol
– Extends Itanium® 2 microarchitecture to 90nm process technology
– Platform Release target of 2005
Roadmap
Roadmapmaintains
maintainsworld
worldclass
classperformance
performance
next generation processor technologies
New features !
Alpha EV79
PA 8800+
PA -8800 Alpha EV7 Itanium tm 2

E xplicitly
xplicitly
P arallel
arallel
Multiple Cores & Itanium
Integrated Interconnects
I nstruction
nstruction
C omputing
omputing
POWER4
Innovation
PA-8700 Alpha EV68

SuperScalar
IA-32 Processor Family
SPARC -III
MIPS 14K
CISC RISC
© 2002 2
3
Itanium2 Processor
221M FETs
421mm2
90+% of the
transistors
21.6mm
and
50+% of the
die area are
devoted to
cache and
cache support
logic!
19.5mm
What is the
Itanium®
Architecture?
4
Traditional CPU Architectures
• Performance barriers:
- Memory latency
- Branches
- Loop pipelining
- Procedure call / return overhead
• Headroom constraints :
- Hardware-based instruction scheduling
- Unable to efficiently schedule parallel execution
• Resource constraints
- Too few registers
- Unable to fully utilize multiple execution units
EPIC – Explicitly Parallel Instruction Computing

Basic Ideas
• Static Hardware Design

– Compiler creates record of execution
• Instructions in bundles
– Machine plays record
• Distribute among execution units
– No runtime changes like out-of -order-excution
• High Scalability of ‚execution units‘
– Very Large Instruction Word (VLIW) concept
– Focus is parallelism
• 6 instructions in parallel (2 bundles per cycle)
– High number of execution units
• Enhancement of VLIW concepts with
– Predication
– Indication of parallelism in machine code
– Speculative data loading
5
Improving Performance
• Itanium® architecture boosts performance by

– allowing compiler to provide information to chip
– using available compile time information
– Moving performance burden from microarchitecture (chip) to compiler
• Itanium® architecture code accomplishes the following:

– Increases instruction level parallelism (ILP)
– Improves branch handling
– Reduces memory access cost
– Supports modular code (note)
6
Increasing
Instruction
Level
Parallelism
Increasing Instruction Level Parallelism
• Improving instruction level parallelism (ILP) by:

– Compiler/assembly writer is able to explicitly indicate
parallelism
– Instruction groups
– Three-instruction-wide word
– Instruction bundle
– Two executed per cycle
– Massive resources on chip
– Large number of registers to avoid register contention
7
Instruction Format: Bundles & Templates
•Bundle (123 bits)

•Set of three instructions
•Template (5 bits)
•Identifies types of instructions in bundle
•One of Integer, Memory, Branch, Floating, eXtended
•Identifies independent operations (“stops”) -> MM_F
•Defines execution units to be invoked executing the bundle
•Compiler can schedule functional units to avoid contention
Explicitly Parallel Instruction Computing

EPIC
S2 S1 S0 T
128-bit instruction bundles from I-cache
Fetch one or more bundles for execution

(Implementation, Itanium® takes two.)
Processor
Try to execute all instructions in

functional units parallel, depending on available
MEM MEM INT INT FP FP B B B units.
Retired instruction bundles
8
Instruction Groups
• Instruction groups:
• Set of instructions
• No dependencies (read-after-write) within group
• May execute in parallel
• The processor executes as many instructions per
instruction group as possible, based on its resources
• Must contain at least one instruction (no upper limit)
• Instruction groups are indicated by cycle breaks (;;)
Instruction groups and bundles
ld8 r5 = [r7] Instructions within a group may not

sub r1 = r2, r3 have any register dependencies within
add r10 = r20, r21 ;; the group.
add r1 = r1, r5 ;;
st8 [r7] = r1 ;; indicates the end of a group.
Instruction bundles
{
Instructions are fetched and
.mii // template
executed in bundles.
ld8 r10, [r5] // slot 0, Memory
add r1 = r2, r3 // slot 1, Integer
add r4 = r5,r6 // slot 2, Integer
}
9
Instruction groups and bundles
Itanium® and Itanium2® fetch 2 bundles at a time for execution.
They may or may not execute in parallel.
Handwritten code Instruction bundles

Execution
instr Code generator Fetch
instr instr instr tmpl instr instr instr tmpl
instr
instr instr instr tmpl instr instr instr tmpl
instr ;;
instr instr nop tmpl
instr
instr nop nop tmpl
instr ;; Can the bundle pair
instr instr nop tmpl
instr
intsr
instr instr nop tmpl Execute in parallel ?
intsr instr instr tmpl
instr
…
instr
instr ;;
instr
instr ;; Forgetting end-of-group
instr
Code generator creates bundles,
… may be fatal: possibly including nops.
add r1 = r1, r5 ;;
st8 [r7]= r1
There are two difficulties:

1) Finding instruction triplets matching the defined templates.
2) Matching pairs of bundles that can execute in parallel.
Massive On Chip Resources
• Several register files visible to the

programmer:
• 128 General registers

• 128 Floating-point registers
• 64 Predicate registers
• 8 Branch registers
• 128 Application registers
• Instruction Pointer (IP) register
• Control Registers
• Process Status Register (includes slot index
within current bundle)
10
Improving
Branch
Handling
What is the problem ?
• Traditional CPUs:
• Branch-prediction is used to predict the most likely set of
instructions
• Correct branch prediction keeps the execution pipelines full
• A mispredicted branch flushes the pipeline with a large
penalty
• Itanium® architecture improves branch handling:

• Provide a way to minimize branches using predicates
• Provide support for special branch instructions
– counted loop
11
Branch Handling
• Predication
– Conditional execution of instructions
– When the predicate is true, the instruction is executed
– When it is false, the instruction is treated as a NOP
• Predication converts a control dependency into a data
dependency
• Predication eliminates branches in the code
Speculation
Predication
• Traditional code:
if (a>b)
c = c + 1
else
d = d * e + f
• Avoid branch by using predicated code
p1, p2 = compare(a>b)
if (p1) c = c + 1
if (p2) d = d * e + f
– Predicate p1 set to 1 if compare is true, and to 0 if it
evaluates to false
– p2 is the complement of p1
12
Speculation
Predication
Before:
• Instructions c = c + 1 and d = d * e + f are
control dependant on a<b
After:
• Instruction are data dependant:
– Values of p1 and p2
– They determine execution
– The branch is eliminated
Predication
Traditional Itanium® Architecture

Architecture
Cmp a,b Cmp a,b pT, pF
Jump
br NEQ pT Y=3 pF Y=4
then Y=3
JumpbrEND
Y=4 else Code for both paths loaded and
routed to different execution
pipelines.
Only one ‘branch’ will have a valid
predicate and be executed.
13
Reducing
Memory
Access
Cost
Reducing Memery Access Cost
• Itanium® architecture eliminates many memory

accesses through:
• large register files to manage work in progress
• better control of the memory hierarchy (cache
hints)
• Itanium® architecture reduces remaining memory

accesses by:
• moving load instructions earlier in the code
– Data speculation - the execution of a load before a
preceeding store
– Control speculation - the execution of a load before its
guarding branch
• hides memory latency
• enables the processor to bring in the data in time
• avoids stalling the processor
14
„Data Speculation“
Advanced Loads
• Load is performed before a store that logically

precedes it
– may potentially use the same address
– also referred to as ‘advanced load’
– at compile time memory addresses need to
be “disambiguated” (relationship)
Itanium® architecture
Traditional sequence:sequence:
aload(ld_addr,target)
store(st_addr,data)
/* other operations including uses of
load(ld_addr,target)
target use(target)
*/
store(st_addr,data)
acheck(target,recovery_addr)
use(target)
„Control Speculation“
• Load is performed before a store that’s

guarded by a branch
– Need to check for exceptions
Traditionalarchitecture
Itanium® sequence: sequence:
if a>b
sload(ld_addr1,target1)
then
sload(ld_addr2,target2)
load(ld_addr1,target1)
/* other operations including usage of
else
target1/target2 */
load(ld_addr2,
if a>b target2)
then
scheck(target1,recovery_addr1)
else
scheck(target2, recovery_addr2)
15
Massive Memory Resources
• Physical memory
– Full implementation will address 16 EB of physical
memory (264)
• 16,000,000,000GB
– Itanium® architecture microprocessor has 44-bit
address bus
• 16TB (16,000GB) physical memory addressable
– Itanium2® architecture microprocessor has 50-bit
address bus
• Virtual memory
– Itanium® architecture microprocessor uses 50-bits
– Itanium2® architecture microprocessor uses 64-bits
Supporting
Modular
Code
16
Procedure Call Overhead
• Modular programs create more overhead

– Programs tend to be call intensive
– Register space shared by caller and callee
– Call/Returns require register save/restores
– Frequent memory access
– Limitations due to resource shortage
• Itanium® solution
– Massive register resources
• Renaming, rotating
• Integer registers stackable
– Register Stack Engine (RSE)
– Eliminates memory accesses
– Allows to allocate local registers dynamically
Register Stack
• The general register stack is divided into two subsets:

• Static: 32 permanent registers (r0-r31)
– visible to all procedures
– Used for global variables
• Stacked: 96 other registers are like a stack
– procedure code allocates up to 96 registers for a
frame
• Frame allocation:
– previous frame is hidden
– first register is renamed to logical register r32
– small frames eliminate/reduce saving/restoring
registers to/from memory
17
Procedure Call Overhead
IA-32 Itanium® Architecture

• Procedure A Procedure A
• call B call B
• Procedure B Procedure B
• save current register state alloc, no save!
• ... ...
• restore previous register state no restore! (remap)
• return... return
Register Stack Engine (RSE)
• When a procedure is called

– New frame of registers is made available
– Caller’s register content remain in registers,
invisible and inaccessible to called procedure
– If deep nesting exhausts physical registers
the RSE will save contents of hidden registers
to memory to free up resources
– On return to caller, caller’s register content
automatically restored
• RSE works in background, utilizing unused
memory bandwidth
• Activity not visible to application programs
18
Loop Optimization Overhead
• Enhance loop performance:

– Done by unrolling loops
– Causes code expansion
– Prologue/epilogue add to code size
• Itanium® solution
– Software pipelining
– Architecture support
– Minimal prologue/epilogue code
– Predication
– Loop control registers (LC, EC)
– Loop branches (br.ctop, br.wtop)
IA64 Instruction Peculiarities
There is a floating point multiply and add instruction, fma (f= a*b+c)
A simple floating point multiply is a fma with c=0.
A simple floating point add is a fma with b=1.
There is an integer multiply and add instruction, which executes in fp registers!
There is a memory fence instruction: mf (Alpha: MB)
There are three atomic semaphore instructions: xchg, cmpxchg and fetchadd.
There are no load/store instructions with immediate offsets a la LDQ R1, 32(R5) on Alpha.
There are speculative and advanced loads that do not exist on Alpha.
The Register Stack Engine (RSE) is a powerful tool in procedure nestings.
19
• Itanium® Architecture Training
•
• https://shale.intel.com/softwarecollege/
Q
&
A
20
21

It Ani Um

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

It Ani Um

Diunggah oleh

Hak Cipta:

Format Tersedia

What is the Itanium®

Processor Architectures and

Intel Itanium® Architecture

EV68 Merced Madison

Itanium® Processor Family

• Montecito processor will enable dual-core technology

next generation processor technologies

PA -8800 Alpha EV7 Itanium tm 2

PA-8700 Alpha EV68

EPIC – Explicitly Parallel Instruction Computing

• Static Hardware Design

• Itanium® architecture boosts performance by

• Itanium® architecture code accomplishes the following:

Increasing Instruction Level Parallelism

• Improving instruction level parallelism (ILP) by:

•Bundle (123 bits)

Explicitly Parallel Instruction Computing

Fetch one or more bundles for execution

Try to execute all instructions in

Retired instruction bundles

Instruction groups and bundles

ld8 r5 = [r7] Instructions within a group may not

Handwritten code Instruction bundles

There are two difficulties:

Massive On Chip Resources

• Several register files visible to the

• 128 General registers

What is the problem ?

• Itanium® architecture improves branch handling:

Traditional Itanium® Architecture

Reducing Memery Access Cost

• Itanium® architecture eliminates many memory

• Itanium® architecture reduces remaining memory

• Load is performed before a store that logically

• Load is performed before a store that’s

• Modular programs create more overhead

• The general register stack is divided into two subsets:

IA-32 Itanium® Architecture

Register Stack Engine (RSE)

• When a procedure is called

• Enhance loop performance:

IA64 Instruction Peculiarities

There is an integer multiply and add instruction, which executes in fp registers!

There is a memory fence instruction: mf (Alpha: MB)

The Register Stack Engine (RSE) is a powerful tool in procedure nestings.

Anda mungkin juga menyukai