z Introduction
z ARM architecture
ARM Architecture
z Coprocessor Interface
z Processor Cores
Sept 14 , 2005
z AMBA
Kyoung-Su Kim
E-mail: kimks@rayman.sejong.ac.kr
Real-Time Graphics Lab., Sejong Univ.
Introduction
ARM architecture
z Architecture version
Version 1 (obsolete)
Basic data processing
Byte, word and multi-word load/store
Software interrupt
26 bit address bus
Version 2 (obsolete)
Multiply & Multiply-accumulate
Coprocessor support
Atomic instruction for thread synchronization
26 bit address bus
-3-
-4-
ARM architecture
ARM architecture
Version 3
Version 5
Version 4
Half word transfer
Introduce THUMB processor state
Add Privileged mode for operating system
2 word distance of PC from current instruction
PC+8 behavior (at ARM state)
-5-
ARM architecture
-6-
ARM architecture
z Architecture Variants
THUMB ( symbol as a T)
-7-
-8-
ARM architecture
Programmers Model
Big Endian
z Most significant byte is at lowest address
z Word is addressed by byte address of most significant byte
Higher Address31
24 23
11
7
3
16 15
10
6
2
Little Endian
24 23
8
4
0
16 15
9
5
1
Lower Address
-9-
Programmers Model
11
7
3
8
4
0
- 10 -
z Registers
37 registers
31 general 32 bit registers
6 status registers
16 general registers and one or two status registers are visible at
any time
The visible registers depend on the processor mode
The other registers (the banked registers) are switched in to
support IRQ, FIQ, Supervisor, Abort and Undefined mode
processing
R0 to R15 are directly accessible
R0 to R14 are general purpose
R15 holds the Program Counter (PC)
CPSR - Current Program Status Register contains condition code
flags and the current mode bits
5 SPSRs (Saved Program Status Registers) which are loaded with
CPSR when an exceptions occurs
0 Word Address
87
10
6
2
Programmers Model
8
4
0
8
4
0
Lower Address
0 Word Address
87
9
5
1
- 11 -
- 12 -
Programmers Model
Programmers Model
z Processor Status Registers
z Registers (contd)
User32
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15(PC)
Fiq32
R0
R1
R2
R3
R4
R5
R6
R7
R8_fiq
R9_fiq
R10_fiq
R11_fiq
R12_fiq
R13_fiq
R14_fiq
R15(PC)
CPSR
CPSR
SPSR_fiq
Supervisor32
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13_svc
R14_svc
R15(PC)
CPSR
SPSR_svc
Abort32
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13_abt
R14_abt
R15(PC)
IRQ32
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13_irq
R14_irq
R15(PC)
Undefined32
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13_und
R14_und
R15(PC)
CPSR
SPSR_abt
CPSR
SPSR_irq
CPSR
SPSR_und
31
30
29
28
M4
M3
M2
M1
M0
Mode Bits
FIQ disable
IRQ disable
Negative/Less Than
- 13 -
Programmers Model
- 14 -
Programmers Model
z Exceptions
z Exceptions (contd)
Type of exception
FIQ (Fast Interrupt reQuest)
Externally generated by taking the nFIQ input LOW
Fast handling for data or channel transfer
IRQ(Interrupt ReQuest)
Normal interrupt caused by a LOW level on the nIRQ input
ABORT
Exception Priorities
Software interrupt
Generated by the software interrupt instruction (SWI)
Getting into Supervisor mode
usually to request a particular supervisor function. OS support
Overflow
Carry/Borrow/Extend
Zero
27
- 15 -
- 16 -
ARM architecture
z Instruction Set
z Conditional execution
Instruction Format
31
27
Cond
0000 = EQ - Z set (equal)
- 17 -
- 18 -
z Control instruction
31
28 2726 25 24
cond
21 20 19
0 0 # opcode S
1615
Rn
12 11
operand 2
Rd
destination register
first operand register
set condition codes
31
28 27
cond
25 24 23
101
arithmetic/logic function
25
11
8 7
#rot
8-bit immediate
immediate alignment
11
7 6 5 4 3
#shift
25
Sh 0
Rm
31
cond
Sept 14, 2005
2827
6 5 4 3
0001001011111111111100
L 1
8 7 6 5 4 3
Rs
Rm
0 Sh 1
Rm
- 19 -
- 20 -
Mn e mo n i c
Me an i n g
Ef f e c t
AND
EOR
SUB
RSB
ADD
ADC
SBC
RSC
TST
TEQ
CMP
CMN
ORR
MOV
BIC
MVN
Rd := Rn AND Op2
Rd := Rn EOR Op2
Rd := Rn - Op2
Rd := Op2 - Rn
Rd := Rn + Op2
Rd := Rn + Op2 + C
Rd := Rn - Op2 + C - 1
Rd := Op2 - Rn + C - 1
Scc on Rn AND Op2
Scc on Rn EOR Op2
Scc on Rn - Op2
Scc on Rn + Op2
Rd := Rn OR Op2
Rd := Op2
Rd := Rn AND NOT Op2
Rd := NOT Op2
Shift operation
In any data processing instructions, the second register
operand can have a shift operation applied to it.
Logical shift
- 21 -
- 22 -
z Multiply Instruction
Arithmetic shift
ASR: = LSR
ASL: Arithmetic shift left
28 27
cond
Op c o de
[2 3 :2 1 ]
000
001
100
101
110
111
- 23 -
24 23
0000
21 20 19
mul
16 15
12 11
S Rd/RdHi Rn/RdLo
8 7
Rs
4 3
1001
Rm
Mn e mo n i c
Me an i n g
Ef f e c t
MUL
MLA
UMULL
UMLAL
SMULL
SMLAL
- 24 -
Register offset
Address = base register offset register
Immediate offset
Address = base register immediate constant
Post-indexing: modify address after use
Pre-indexing: modify address before use
Addressing
Pre/Post indexing
Auto increment or decrement
Write back the base register
Write back
Special bit
- 25 -
- 26 -
Atomic instruction
Cant be interrupted during execution
External memory management unit is locked during
operation by LOCK signal output
Use
Synchronization in the multi-threading program (OS
support)
Lock
Semaphore
- 27 -
- 28 -
z Coprocessor instructions
Coprocessor
General mechanism to extend the instruction set through the
addition to the core
Example : system controller such as MMU & cache. FPU
Registers
private to coprocessor
ARM controls the data flow
Coprocessor concerns only the data processing and memory transfer
operations
28 27
24 23
20 19
16 15
12 11
8 7
5 4 3
0
31
cond
1110
Cop1
CRn
CRd
CP#
Cop2 0
CRm
- 29 -
ARM architecture
ARM version 6
Improved memory management
Multiprocessing
Add new synchronization instruction (LDREX, STREX)
ALU
Control
Signals
Thumb Decode
Compute
Partial Products
Sum/Accumulate
& Saturation
Register Register
Decode Read
Shift + ALU
Memory Access
ARM SIMD
(16bit 2 way and 8 bit 4 way)
FFT, MPEG4
Saturation, Selection
Register
Write
ARM Decode
Register Register
Decode Read
FETCH
Sept 14, 2005
- 30 -
Instruction
Fetch
Stack
Management
24 23
1111
ARM architecture
z ARM extension
Bytecode
Instruction
Stream
28 27
cond
DECODE
EXECUTE
MEMORY WRITEBACK
- 31 -
- 32 -
Coprocessor Interface
z Implementation dependent
z Busy-waiting
If CPA goes LOW, ARM watch the CPB (coprocessor busy) line
ARM will busy-wait while CPB is HIGH, unless an enabled
interrupt occurs
When CPB goes LOW, the instruction continues to completion
M em ory
ARM 7
nCPI
CPA
CPB
z Pipeline following
z Data transfer cycles
C op rocesso r
z Privileged Instructions
z Idempotency
Any action taken by the coprocessor before it goes not-busy must
be idempotent, ie must be repeatable with identical results after
interrupt
- 33 -
Processor Cores
Processor Cores
A[31:0]
z ARM7
Two main blocks: datapath and
decoder
Register bank (r0 to r15)
Two read ports to A- bus/ Bbus
One write port from ALU- bus
Additional read/ write ports for
program counter r15
Barrel shifter / ALU
Address registers/ incrementer
Single Memory Port
holds either PC address (with
increment) or operand address
control
z ARM7 (contd)
address register
P
C
incrementer
PC
register
bank
instruction
decode
A
L
U
b
u
s
multiply
register
A
&
B
b
u
s
control
b
u
s
barrel
shifter
fetch
PC
ALU
decode
PC+4
3
in struction
data in register
D[31:0]
- 35 -
execute
R15
fetch
PC+4
data out register
- 34 -
decode
execute
fetch
decode
PC+8
execute
time
- 36 -
Processor Cores
Processor Cores
z ARM7(contd)
z ARM7(contd)
2 Phase Non-overlapping clocking scheme
Multi-cycle operation
phase 1
phase 2
1 clock cycle
Datapath timing
ALU operands
latched
fetch STR
execute
decode
fetch ADD
decode
fetch ADD
execute
decode
ph ase 1
ADD
STR
ADD
ADD
ADD
execute
ph ase 2
register
read
time
shift time
precharge
invalidates
buses
register
write time
ALU t ime
execute
instruction
time
ALU o ut
- 37 -
Processor Cores
Processor Cores
z ARM7(contd)
z ARM7(contd)
Memory Interface
De-pipelined addressing
- 38 -
Cycle type
Pipelined addressing
- 39 -
- 40 -
Processor Cores
Processor Cores
next
pc
+4
I-cache
z ARM9
fetch
z ARM9(contd)
pc + 4
pc + 8
5 Stage Pipeline
Multi-cycle operation: MUL, multiple load/store
Data forwarding
I decode
r15
instruction
decode
register read
Instruction
Data
immediate
fields
mul
LDM/
STM
Datapath
+4
postindex
reg
shift
shift
ARM7TDMI:
pre-index
Fetch
Decode
Execute
execute
ALU
forwarding
paths
mux
instruction
fetch
ARM
decode
Thumb
decompress
reg
read
shift/ALU
reg
write
shift/ALU
data memor y
access
reg
write
Execute
Memory
SUBS pc
byte repl.
buffer/
data
D-cache
load/store
address
ARM9TDMI:
rot/sgn ex
LDR pc
register write
write-back
- 41 -
Processor Cores
Separate instruction
and data port
5 Stage pipeline
same as ARM9
First developed by
DEC, now Intel
Fetch
Decode
Write
- 42 -
+4
fetch
pc + 4
z StrongARM(contd)
branch
offset
Harvard
architecture
decode
Processor Cores
next
pc
I-cache
z StrongARM
r. read
instr uction
fetch
instruction
decode
r15
+ disp
branch
target
B, BL
I decode
pc + 8
+4
postindex
immediate
elds
MOV pc
LDM/
STM
register read
CMP r0, #0
BNE label
reg
shift
shift
pre-index
execute
forwarding
paths
mux
fetch CMP
read r0
set CCs
(buf fer)
(write)
fetch BNE
+ disp
(execute)
(buf fer)
(write)
fetch ..
(decode)
(execute)
(buf fer)
fetch tgt
decode
execute
SUBS pc
rotate
*SA1110: v4
*XScale: v5TE
load/store
address
D-cache
buffer/
data
Penalty cycle
rot/sgn ex
LDR pc
register write
write-back
- 43 -
- 44 -
AMBA
Processor Cores
z StrongARM(contd)
Multiply implementation
Memory port
Multiplier
Branch adder
ARM7
8 bit
ARM9
8 bit
StrongARM
12 bit
- 45 -
AMBA
AMBA
z AMBA buses
z AMBA buses(contd)
- 46 -
- 47 -
AHB
ASB
APB
- burst transfers
- split transactions
- single-cycle bus
master handover
- single-clock edge
operation
- wider data bus
configurations
(64/128 bits)
- multiple bus
masters (up to 16)
- pipelined operation
- burst transfers
- pipelined operation
- multiple bus
- low power
- latched address and
control
- simple interface
- suitable for many
peripherals
masters
- 48 -
AMBA
AMBA
Master
Initiate read and write operations by providing an address and
control information. Only one bus master is allowed to actively
use the bus at any one time.
Slave
Responds to a read or write operation within a given addressspace range. The bus slave signals back to the active master the
success, failure or waiting of the data transfer.
Arbiter
Ensures that only one bus master at a time is allowed to initiate
data transfers. Can use the priority
Decoder
Decode the address of each transfer and provide a select signal for
the slave that is involved in the transfer.
- 49 -
AMBA
- 50 -
z AMBA APB
z Memory System
Memory hierarchy
Cache system
Temporal locality
Spatial locality
z Processor core
Master in AHB
Connect through the memory interface of core
Sept 14, 2005
- 51 -
- 52 -
z Write strategy
Write- through
All write are passed to main memory immediately
If there is a hit, the cache is updated to hold new value
Processor slow down to main memory speed during write
Copy- back
Write operation updates the cache, but not main memory
Cache remember that it is different from main memory via a dirty
bit
It is copied back to main memory only when the cache line is used
by new data
- 53 -
- 54 -
z Protection Unit
Physical
Address
Register Purpose
0
ID Register
1
Configuration
2
Cache Control
3
Write Buffer Control
5
Access Permissions
6
Region Base and Size
7
Cache Operations
9
Cache Lock Down
15
Test
4, 8, UNUSED
10-14
0x0
Configure ...
1. Cacheable
2. Use Write buffer
3. Privileged access
4. Enable / Disable
5. Size and Base
Address
6. ......
Region 0
Reginn 1
Region 2
Region 3
0xf..f
- 55 -
31
28 27
cond
24 23
1110
21 20 19
000 L
16 15
CRn
12 11
Rd
8 7
5 4 3
1 1 1 1 Cop2 1
CRm
- 56 -
12 11
address
cacheable,
bufferable,
permissions
region 7
region 6
region 5
region 4
region 3
priority
encoder
attribute
registers
region 2
region 1
region 0
- 57 -
- 58 -
z ARM MMU
31
virtual
address
CP15
register 2
20 19
table index
section index
31
14 13
31
memory
access
physical
address
page
table
virtual
address
page
table
data
1st level
page
table
virtual
address
1st level
page
table
2nd level
page
table
physical
address
14 13
data
20 19
2nd level
page
table
memory
access
2 1 0
table index
12 11 10 9 8
00000000
00
5 4 3 2 1 0
AP 0 domain ? C B 1 0
20 19
section index
31
data
- 59 -
- 60 -
vi rtual
address
31
20 19
12 11
pa ge table index
pa ge offset
Register
0
1
2
3
5
6
7
8
9
10
13
14
15
4, 11-12
CP15
register 2
31
14 13
14 13
2 1
ta ble index
31
10 9
5 4
0 do main
me mory
access
31
2 1
???
10 9
12 11 10 9
pa ge base address
8 7
5 4
01
2 1
pa ge table index
31
00
00
2 1
me mory
access
31
12 11
pa ge base address
pa ge offset
31
Purpose
ID Register
Control
Translation Table Base
Domain Access Control
Fault Status
Fault Address
Cache Operations
TLB Operations
Read Buffer Operations
TLB lockdown
Process ID Mapping
Debug Support
Test & Clock Control
UNUSED
da ta
Memory
a mc ess
- 61 -
z Stack(contd)
Idea of stack
Pop operation
- 62 -
- 63 -
Implemented
by LDM/STM
instruction
- 64 -
z Subroutine
Subroutines allow you to modularize your code so
that they are more reusable.
- 65 -
ARMulator
Cycle accurate simulator
MMU, coprocessor
Profiler
Boot-up code
On reset , processor starts at address 0x0
- 67 -
- 66 -