Architecture ?
Thomas Siebold
Technology Consultant
Alpha Systems Division
thomas.siebold@hp.com
Agenda
• Terminology
• What is the Itanium®
Architecture?
1
Terminology
IA64 Architecture
2
Itanium® Processor Family Roadmap
• Intel has enhanced the Itanium® Processor Family roadmap
– To deliver the most competitive product offerings for enterprise customers
– To pull-in dual core technology as early as possible and deliver a significant
performance boost
– To maintain a consistent introduction rate on new Itanium® Processor Family product
offerings
2002 2003 2004 2005
®
® ®
® Silicon Process
®
® ® Itanium
Itanium 22 Itanium
Itanium 22
Itanium
Itanium 22 0.18
0.18 µµm
m
Processor
Processor Processor
Processor Montecito
Montecito
Processor
Processor
(Madison
(Madison &
& Deerfield)
Deerfield) (Madison
(Madison 9M)
9M) (Dual
(Dual Core)
Core) 0.13
0.13 µm
µm
(1
(1 GHz,
GHz, 3MB
3MB L3)
L3) (1.5GHz,
(1.5GHz, 6MB
6MB L3)
L3) (>1.5GHz,
(>1.5GHz, 9MB
9MB L3)
L3) 90
90 nm
nm
Roadmap
Roadmapmaintains
maintainsworld
worldclass
classperformance
performance
New features !
Alpha EV79
PA 8800+
3
Itanium2 Processor
221M FETs
421mm2
90+% of the
transistors
21.6mm
and
50+% of the
die area are
devoted to
cache and
cache support
logic!
19.5mm
What is the
Itanium®
Architecture?
4
Traditional CPU Architectures
• Performance barriers:
- Memory latency
- Branches
- Loop pipelining
- Procedure call / return overhead
• Headroom constraints :
- Hardware-based instruction scheduling
- Unable to efficiently schedule parallel execution
• Resource constraints
- Too few registers
- Unable to fully utilize multiple execution units
5
Improving Performance
6
Increasing
Instruction
Level
Parallelism
7
Instruction Format: Bundles & Templates
8
Instruction Groups
• Instruction groups:
• Set of instructions
• No dependencies (read-after-write) within group
• May execute in parallel
• The processor executes as many instructions per
instruction group as possible, based on its resources
• Must contain at least one instruction (no upper limit)
• Instruction groups are indicated by cycle breaks (;;)
Instruction bundles
{
Instructions are fetched and
.mii // template
executed in bundles.
ld8 r10, [r5] // slot 0, Memory
add r1 = r2, r3 // slot 1, Integer
add r4 = r5,r6 // slot 2, Integer
}
9
Instruction groups and bundles
Itanium® and Itanium2® fetch 2 bundles at a time for execution.
They may or may not execute in parallel.
10
Improving
Branch
Handling
• Traditional CPUs:
• Branch-prediction is used to predict the most likely set of
instructions
• Correct branch prediction keeps the execution pipelines full
• A mispredicted branch flushes the pipeline with a large
penalty
11
Branch Handling
• Predication
– Conditional execution of instructions
– When the predicate is true, the instruction is executed
– When it is false, the instruction is treated as a NOP
• Predication converts a control dependency into a data
dependency
• Predication eliminates branches in the code
Speculation
Predication
• Traditional code:
if (a>b)
c = c + 1
else
d = d * e + f
• Avoid branch by using predicated code
p1, p2 = compare(a>b)
if (p1) c = c + 1
if (p2) d = d * e + f
– Predicate p1 set to 1 if compare is true, and to 0 if it
evaluates to false
– p2 is the complement of p1
12
Speculation
Predication
Before:
• Instructions c = c + 1 and d = d * e + f are
control dependant on a<b
After:
• Instruction are data dependant:
– Values of p1 and p2
– They determine execution
– The branch is eliminated
Predication
Jump
br NEQ pT Y=3 pF Y=4
then Y=3
JumpbrEND
Y=4 else Code for both paths loaded and
routed to different execution
pipelines.
Only one ‘branch’ will have a valid
predicate and be executed.
13
Reducing
Memory
Access
Cost
14
„Data Speculation“
Advanced Loads
Itanium® architecture
Traditional sequence:sequence:
aload(ld_addr,target)
store(st_addr,data)
/* other operations including uses of
load(ld_addr,target)
target use(target)
*/
store(st_addr,data)
acheck(target,recovery_addr)
use(target)
„Control Speculation“
Traditionalarchitecture
Itanium® sequence: sequence:
if a>b
sload(ld_addr1,target1)
then
sload(ld_addr2,target2)
load(ld_addr1,target1)
/* other operations including usage of
else
target1/target2 */
load(ld_addr2,
if a>b target2)
then
scheck(target1,recovery_addr1)
else
scheck(target2, recovery_addr2)
15
Massive Memory Resources
• Physical memory
– Full implementation will address 16 EB of physical
memory (264)
• 16,000,000,000GB
– Itanium® architecture microprocessor has 44-bit
address bus
• 16TB (16,000GB) physical memory addressable
– Itanium2® architecture microprocessor has 50-bit
address bus
• Virtual memory
– Itanium® architecture microprocessor uses 50-bits
– Itanium2® architecture microprocessor uses 64-bits
Supporting
Modular
Code
16
Procedure Call Overhead
Register Stack
17
Procedure Call Overhead
• Procedure B Procedure B
• save current register state alloc, no save!
• ... ...
• restore previous register state no restore! (remap)
• return... return
18
Loop Optimization Overhead
There is a floating point multiply and add instruction, fma (f= a*b+c)
A simple floating point multiply is a fma with c=0.
A simple floating point add is a fma with b=1.
There are three atomic semaphore instructions: xchg, cmpxchg and fetchadd.
There are no load/store instructions with immediate offsets a la LDQ R1, 32(R5) on Alpha.
There are speculative and advanced loads that do not exist on Alpha.
19
• Itanium® Architecture Training
•
• https://shale.intel.com/softwarecollege/
Q
&
A
20
21