WITH BEST OF LUCK FROM: Prof. Vidya Gogate SAKEC, Chembur 2 TM T H E A R C H I T E C T U R E F O R T H E D I G I T A L W O R L D The ARM Architecture 3 TM 3 39v10 The ARM Architecture Agenda Introduction to ARM Ltd Programmers Model Instruction Set System Design Development Tools
4 TM 4 39v10 The ARM Architecture ARM Ltd Founded in November 1990 Spun out of Acorn Computers
Designs the ARM range of RISC processor cores Licenses ARM core designs to semiconductor partners who fabricate and sell to their customers. ARM does not fabricate silicon itself
Also develop technologies to assist with the design-in of the ARM architecture Software tools, boards, debug hardware, application software, bus architectures, peripherals etc 5 TM 5 39v10 The ARM Architecture ARM Partnership Model 6 TM 6 39v10 The ARM Architecture ARM Powered Products 7 TM 7 39v10 The ARM Architecture Latest NEWS For 30 years, Intel basically had the market to itself and as a result, its chips were priced much higher than ARM chips.
But ARMs entry into Intels territory is a real threat to Intels dominance.
In fact, Microsofts first version of the Surface tablet/keyboard device uses ARM chips. The Intel-based versions will come out about three month later.
8 TM 8 39v10 The ARM Architecture Latest NEWS There has been an important new development in the processor world lately. The folks behind the ARM processor the chip that powers most smart phones and tablets today decided to scale up this processor technology to run at speeds that could be used in advanced tablets and more importantly, laptops and even desktops.
And ARM processors got a major boost when Microsoft made the decision to create an ARM-based version of Windows 8. For the first time, Microsoft broke away from whats been called the Win-Tel monopoly. 9 TM 9 39v10 The ARM Architecture Latest NEWS
But Microsofts new operating system for ARM, called Windows RT, has touched off a new battlefront in processor wars.
This has forced Intel to try and make its x86 processors more energy efficient;
the company hopes to have chips that are on par with ARM by mid-2013.
10 TM 10 39v10 The ARM Architecture RISC Design Philosophy InstructionsRISC processors have a reduced number of instruction classes which provide simple operations that can each execute in a single cycle. In contrast, in CISC processors the instructions are often of variable size and take many cycles to execute. PipelinesThe processing of instructions is broken down into smaller units that can be executed in parallel by pipelines. There is no need for an instruction to be executed by a mini-program called microcode as on CISC processors. RegistersRISC machines have a large general-purpose register set. Any register can contain either data or an address. Registers act as the fast local memory store for all data processing operations. In contrast, CISC processors have dedicated registers for specific purposes. 11 TM 11 39v10 The ARM Architecture RISC Design Philosophy
Load-store architectureThe processor operates on data held in registers. Separate load and store instructions transfer data between the register bank and external memory.
Memory accesses are costly, so separating memory accesses from data processing provides an advantage because you can use data items held in the register bank multiple times without needing multiple memory accesses.
In contrast, with a CISC design the data processing operations can act on memory directly.
12 TM 12 39v10 The ARM Architecture RISC Design Philosophy These design rules allow a RISC processor to be simpler, and thus the core can operate at higher clock frequencies.
In contrast, traditional CISC processors are more complex and operate at lower clock frequencies. 13 TM 13 39v10 The ARM Architecture ARM Design Philosophy Portable embedded systems require some form of battery power. The ARM processor has been specifically designed to be small to reduce power consumption and extend battery operationessential for applications such as PDAs. Since embedded systems have limited memory due to cost and/or physical size restrictions; High code density is useful feature of ARM for applications that have limited on-board memory. The ability to use low-cost memory devices produces substantial savings. For a single-chip solution, the smaller the area used by the embedded processor,(reduced die size) the more available space for specialized peripherals. This in turn reduces the cost of the design and manufacturing since fewer discrete chips are required for the end product. 14 TM 14 39v10 The ARM Architecture ARM Design Philosophy ARM has incorporated hardware debug technology within the processor so that software engineers can view what is happening while the processor is executing code.
With greater visibility, software engineers can resolve issues faster, which has a direct effect on the time to market and reduces overall development costs. 15 TM 15 39v10 The ARM Architecture Instruction set FEATURES
Variable cycle execution for certain instructions- load-store-multiple instructions vary in the number of execution cycles depending upon the number of registers being transferred. The transfer can occur on sequential memory addresses, which increases performance since sequential memory accesses are often faster than random accesses. Code density is also improved since multiple register transfers are common operations at the start and end of functions. Inline barrel shifter leading to more complex instructions The inline barrel shifter is a hardware component that preprocesses one of the input registers before it is used by an instruction. This expands the capability of many instructions to improve core performance and code density.
16 TM 16 39v10 The ARM Architecture Instruction set FEATURES Thumb 16-bit instruction setThe 16-bit instructions improve code density by about 30% over 32-bit fixed-length instructions.
Conditional execution- An instruction is only executed when a specific condition has been satisfied. This feature improves performance and code density by reducing branch instructions.
Enhanced instructions - The enhanced digital signal processor (DSP) instructions were added to the standard ARM instruction set to support fast 1616-bit multiplier operations and saturation. These instructions allow a faster-performing ARM processor in some cases to replace the traditional combinations of a processor plus a DSP. 17 TM 17 39v10 The ARM Architecture Nomenclature. Instruction set architecture (ISA) is upward compatible. ARM{x}{y}{z}{T}{D}{M}{I}{E}{J}{F}{-S} xfamily ymemory management/protection unit zcache TThumb 16-bit decoder DJTAG debug Mfast multiplier IEmbedded ICE macro-cell Eenhanced instructions (assumes TDMI) JJazelle Fvector floating-point unit Ssynthesizible version 18 TM 18 39v10 The ARM Architecture Nomenclature
All ARM cores after the ARM7TDMI include the TDMI features even though they may not include those letters after the ARM label.
The processor family is a group of processor implementations that share the same hardware characteristics.
For example, the ARM7TDMI, ARM740T, and ARM720T all share the same family characteristics and belong to the ARM7 family. 19 TM 19 39v10 The ARM Architecture Nomenclature
JTAG is described by IEEE 1149.1 Standard Test Access Port and boundary scan architecture. It is a serial protocol used by ARM to send and receive debug information between the processor core and test equipment.
Embedded ICE macro-cell is the debug hardware built into the processor that allows breakpoints and watch-points to be set. Synthesizable means that the processor core is supplied as source code that can be compiled into a form easily used by EDA tools.
20 TM 20 39v10 The ARM Architecture Syllabus
ARM processor fundamentals introduction to ARM and THUMB instruction set-- processor and memory organization CPU Bus configuration ARM Bus Memory devices Input/output devices Component interfacing designing with microprocessor development and debugging Design Example Instruction set with enhanced DSP features with ARM core, mix mode programming as Thumb + ARM core, Assembly programming concept, compare with ARM7, ARM9, ARM11 with new features additions
21 TM 21 39v10 The ARM Architecture PIN DIAGRAM 22 TM 22 39v10 The ARM Architecture Architecture block diagram 23 TM 23 39v10 The ARM Architecture Hardware Fundamentals The ARM processor can be abstracted into eight components ALU, barrel shifter, MAC, register file, instruction decoder, address register, incrementer, and sign extend. 24 TM 24 39v10 The ARM Architecture Data Sizes and Instruction Sets The ARM is a 32-bit architecture.
When used in relation to the ARM: Byte means 8 bits Halfword means 16 bits (two bytes) Word means 32 bits (four bytes)
Most ARMs implement two instruction sets 32-bit ARM Instruction Set 16-bit Thumb Instruction Set
Jazelle cores can also execute Java bytecode 25 TM 25 39v10 The ARM Architecture Processor Modes The ARM has seven basic operating modes:
User : unprivileged mode under which most tasks run
FIQ : entered when a high priority (fast) interrupt is raised
IRQ : entered when a low priority (normal) interrupt is raised
Supervisor : entered on reset and when a Software Interrupt instruction is executed
Abort : used to handle memory access violations
Undef : used to handle undefined instructions
System : privileged mode using the same registers as user mode 26 TM 26 39v10 The ARM Architecture r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 (sp) r14 (lr) r15 (pc) cpsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r8 r9 r10 r11 r12 r13 (sp) r14 (lr) spsr FIQ IRQ SVC Undef Abort User Mode r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 (sp) r14 (lr) r15 (pc) cpsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r8 r9 r10 r11 r12 r13 (sp) r14 (lr) spsr Current Visible Registers Banked out Registers FIQ IRQ SVC Undef Abort r0 r1 r2 r3 r4 r5 r6 r7 r15 (pc) cpsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r8 r9 r10 r11 r12 r13 (sp) r14 (lr) spsr Current Visible Registers Banked out Registers User IRQ SVC Undef Abort r8 r9 r10 r11 r12 r13 (sp) r14 (lr) FIQ Mode IRQ Mode r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r15 (pc) cpsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r8 r9 r10 r11 r12 r13 (sp) r14 (lr) spsr Current Visible Registers Banked out Registers User FIQ SVC Undef Abort r13 (sp) r14 (lr) Undef Mode r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r15 (pc) cpsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r8 r9 r10 r11 r12 r13 (sp) r14 (lr) spsr Current Visible Registers Banked out Registers User FIQ IRQ SVC Abort r13 (sp) r14 (lr) SVC Mode r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r15 (pc) cpsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r8 r9 r10 r11 r12 r13 (sp) r14 (lr) spsr Current Visible Registers Banked out Registers User FIQ IRQ Undef Abort r13 (sp) r14 (lr) Abort Mode r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r15 (pc) cpsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r13 (sp) r14 (lr) spsr r8 r9 r10 r11 r12 r13 (sp) r14 (lr) spsr Current Visible Registers Banked out Registers User FIQ IRQ SVC Undef r13 (sp) r14 (lr) The ARM Register Set 27 TM 27 39v10 The ARM Architecture Register Organization Summary User mode r0-r7, r15, and cpsr r8 r9 r10 r11 r12 r13 (sp) r14 (lr) spsr FIQ r8 r9 r10 r11 r12 r13 (sp) r14 (lr) r15 (pc) cpsr r0 r1 r2 r3 r4 r5 r6 r7 User r13 (sp) r14 (lr) spsr IRQ User mode r0-r12, r15, and cpsr r13 (sp) r14 (lr) spsr Undef User mode r0-r12, r15, and cpsr r13 (sp) r14 (lr) spsr SVC User mode r0-r12, r15, and cpsr r13 (sp) r14 (lr) spsr Abort User mode r0-r12, r15, and cpsr Thumb state Low registers Thumb state High registers Note: System mode uses the User mode register set 28 TM 28 39v10 The ARM Architecture The Registers ARM has 37 registers all of which are 32-bits long. 1 dedicated program counter 1 dedicated current program status register 5 dedicated saved program status registers 30 general purpose registers
The current processor mode governs which of several banks is accessible. Each mode can access a particular set of r0-r12 registers a particular r13 (the stack pointer, sp) and r14 (the link register, lr) the program counter, r15 (pc) the current program status register, cpsr
Privileged modes (except System) can also access a particular spsr (saved program status register) 29 TM 29 39v10 The ARM Architecture Program Status Registers Condition code flags N = Negative result from ALU Z = Zero result from ALU C = ALU operation Carried out V = ALU operation oVerflowed
Sticky Overflow flag - Q flag Architecture 5TE/J only Indicates if saturation has occurred
J bit Architecture 5TEJ only J = 1: Processor in Jazelle state
Interrupt Disable bits. I = 1: Disables the IRQ. F = 1: Disables the FIQ.
T Bit Architecture xT only T = 0: Processor in ARM state T = 1: Processor in Thumb state
Mode bits Specify the processor mode 27 31 N Z C V Q 28 6 7 I F T mode 16 23
8 15
5 4 0 24 f s x c U n d e f i n e d J 30 TM 30 39v10 The ARM Architecture When the processor is executing in ARM state: All instructions are 32 bits wide All instructions must be word aligned Therefore the pc value is stored in bits [31:2] with bits [1:0] undefined (as instruction cannot be halfword or byte aligned).
When the processor is executing in Thumb state: All instructions are 16 bits wide All instructions must be halfword aligned Therefore the pc value is stored in bits [31:1] with bit [0] undefined (as instruction cannot be byte aligned).
When the processor is executing in Jazelle state: All instructions are 8 bits wide Processor performs a word access to read 4 instructions at once Program Counter (r15) 31 TM 31 39v10 The ARM Architecture Vector Table Exception Handling When an exception occurs, the ARM: Copies CPSR into SPSR_<mode> Sets appropriate CPSR bits Change to ARM state Change to exception mode Disable interrupts (if appropriate) Stores the return address in LR_<mode> Sets PC to vector address To return, exception handler needs to: Restore CPSR from SPSR_<mode> Restore PC from LR_<mode> This can only be done in ARM state. Vector table can be at 0xFFFF0000 on ARM720T and on ARM9/10 family devices FIQ IRQ (Reserved) Data Abort Prefetch Abort Software Interrupt Undefined Instruction Reset 0x1C 0x18 0x14 0x10 0x0C 0x08 0x04 0x00 32 TM 32 39v10 The ARM Architecture Instruction set ARM has three instruction setsARM, Thumb, and Jazelle. The register file contains 37 registers, but only 17 or 18 registers are accessible at any point in time; the rest are banked according to processor mode. The current processor mode is stored in the CPSR It holds the current status of the processor core as well as interrupt masks, condition flags, and state. The state determines which instruction set is being executed. 33 TM 33 39v10 The ARM Architecture ARM instructions can be made to execute conditionally by postfixing them with the appropriate condition code field. This improves code density and performance by reducing the number of forward branch instructions. CMP r3,#0 CMP r3,#0 BEQ skip ADDNE r0,r1,r2 ADD r0,r1,r2 skip
By default, data processing instructions do not affect the condition code flags but the flags can be optionally set by using S. CMP does not need S. loop
SUBS r1,r1,#1 BNE loop
if Z flag clear then branch decrement r1 and set flags Conditional Execution and Flags 34 TM 34 39v10 The ARM Architecture Condition Codes Not equal Unsigned higher or same Unsigned lower Minus Equal Overflow No overflow Unsigned higher Unsigned lower or same Positive or Zero Less than Greater than Less than or equal Always Greater or equal EQ NE CS/HS CC/LO PL VS HI LS GE LT GT LE AL MI VC Suffix Description Z=0 C=1 C=0 Z=1 Flags tested N=1 N=0 V=1 V=0 C=1 & Z=0 C=0 or Z=1 N=V N!=V Z=0 & N=V Z=1 or N=!V
The possible condition codes are listed below: Note AL is the default and does not need to be specified 35 TM 35 39v10 The ARM Architecture Examples of conditional execution Use a sequence of several conditional instructions if (a==0) func(1); CMP r0,#0 MOVEQ r0,#1 BLEQ func
Set the flags, then use various condition codes if (a==0) x=0; if (a>0) x=1; CMP r0,#0 MOVEQ r1,#0 MOVGT r1,#1
Use conditional compare instructions if (a==4 || a==10) x=0; CMP r0,#4 CMPNE r0,#10 MOVEQ r1,#0 36 TM 36 39v10 The ARM Architecture Branch : B{<cond>} label Branch with Link : BL{<cond>} subroutine_label
The processor core shifts the offset field left by 2 positions, sign-extends it and adds it to the PC 32 Mbyte range How to perform longer branches? 28 31 24 0 Cond 1 0 1 L Offset Condition field Link bit 0 = Branch 1 = Branch with link 23 25 27 Branch instructions 37 TM 37 39v10 The ARM Architecture Data processing Instructions Consist of : Arithmetic: ADD ADC SUB SBC RSB RSC Logical: AND ORR EOR BIC Comparisons: CMP CMN TST TEQ Data movement: MOV MVN
These instructions only work on registers, NOT memory.
Syntax:
<Operation>{<cond>}{S} Rd, Rn, Operand2
Comparisons set flags only - they do not specify Rd Data movement does not specify Rn
Second operand is sent to the ALU via barrel shifter. 38 TM 38 39v10 The ARM Architecture The Barrel Shifter Destination CF 0 Destination CF LSL : Logical Left Shift ASR: Arithmetic Right Shift Multiplication by a power of 2 Division by a power of 2, preserving the sign bit Destination CF ...0 Destination CF LSR : Logical Shift Right ROR: Rotate Right Division by a power of 2 Bit rotate with wrap around from LSB to MSB Destination RRX: Rotate Right Extended Single bit rotate with wrap around from CF to MSB CF 39 TM 39 39v10 The ARM Architecture Register, optionally with shift operation Shift value can be either be: 5 bit unsigned integer Specified in bottom byte of another register. Used for multiplication by constant
Immediate value 8 bit number, with a range of 0-255. Rotated right through even number of positions Allows increased range of 32-bit constants to be loaded directly into registers
Result Operand 1 Barrel Shifter Operand 2 ALU Using the Barrel Shifter: The Second Operand 40 TM 40 39v10 The ARM Architecture No ARM instruction can contain a 32 bit immediate constant All ARM instructions are fixed as 32 bits long The data processing instruction format has 12 bits available for operand2
4 bit rotate value (0-15) is multiplied by two to give range 0-30 in steps of 2 Rule to remember is 8-bits shifted by an even number of bit positions. 0 7 11 8 immed_8 Shifter ROR rot x2 Quick Quiz: 0xe3a004ff MOV r0, #??? Immediate constants (1) 41 TM 41 39v10 The ARM Architecture Examples:
The assembler converts immediate values to the rotate form: MOV r0,#4096 ; uses 0x40 ror 26 ADD r1,r2,#0xFF0000 ; uses 0xFF ror 16
The bitwise complements can also be formed using MVN: MOV r0, #0xFFFFFFFF ; assembles to MVN r0,#0
Values that cannot be generated in this way will cause an error. 0 31 ror #0 range 0-0xff000000 step 0x01000000 ror #8 range 0-0x000000ff step 0x00000001 range 0-0x000003fc step 0x00000004 ror #30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Immediate constants (2) 42 TM 42 39v10 The ARM Architecture To allow larger constants to be loaded, the assembler offers a pseudo- instruction: LDR rd, =const This will either: Produce a MOV or MVN instruction to generate the value (if possible). or Generate a LDR instruction with a PC-relative address to read the constant from a literal pool (Constant data area embedded in the code). For example LDR r0,=0xFF => MOV r0,#0xFF LDR r0,=0x55555555 => LDR r0,[PC,#Imm12]
DCD 0x55555555 This is the recommended way of loading constants into a register Loading 32 bit constants 43 TM 43 39v10 The ARM Architecture Multiply Syntax: MUL{<cond>}{S} Rd, Rm, Rs Rd = Rm * Rs MLA{<cond>}{S} Rd,Rm,Rs,Rn Rd = (Rm * Rs) + Rn [U|S]MULL{<cond>}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := Rm*Rs [U|S]MLAL{<cond>}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := (Rm*Rs)+RdHi,RdLo
Cycle time Basic MUL instruction 2-5 cycles on ARM7TDMI 1-3 cycles on StrongARM/XScale 2 cycles on ARM9E/ARM102xE +1 cycle for ARM9TDMI (over ARM7TDMI) +1 cycle for accumulate (not on 9E though result delay is one cycle longer) +1 cycle for long
Above are general rules - refer to the TRM for the core you are using for the exact details 44 TM 44 39v10 The ARM Architecture Single register data transfer LDR STR Word LDRB STRB Byte LDRH STRH Halfword LDRSB Signed byte load LDRSH Signed halfword load
e.g. LDREQB 45 TM 45 39v10 The ARM Architecture Address accessed Address accessed by LDR/STR is specified by a base register plus an offset For word and unsigned byte accesses, offset can be An unsigned 12-bit immediate value (ie 0 - 4095 bytes). LDR r0,[r1,#8] A register, optionally shifted by an immediate value LDR r0,[r1,r2] LDR r0,[r1,r2,LSL#2] This can be either added or subtracted from the base register: LDR r0,[r1,#-8] LDR r0,[r1,-r2] LDR r0,[r1,-r2,LSL#2] For halfword and signed halfword / byte, offset can be: An unsigned 8 bit immediate value (ie 0-255 bytes). A register (unshifted). Choice of pre-indexed or post-indexed addressing 46 TM 46 39v10 The ARM Architecture 0x5 0x5 r1 0x200 Base Register 0x200 r0 0x5 Source Register for STR Offset 12 0x20c r1 0x200 Original Base Register 0x200 r0 0x5 Source Register for STR Offset 12 0x20c r1 0x20c Updated Base Register Auto-update form: STR r0,[r1,#12]! Pre or Post Indexed Addressing? Pre-indexed: STR r0,[r1,#12] Post-indexed: STR r0,[r1],#12 47 TM 47 39v10 The ARM Architecture LDM / STM operation Syntax: <LDM|STM>{<cond>}<addressing_mode> Rb{!}, <register list> 4 addressing modes: LDMIA / STMIA increment after LDMIB / STMIB increment before LDMDA / STMDA decrement after LDMDB / STMDB decrement before IA r1 Increasing Address r4 r0 r1 r4 r0 r1 r4 r0 r1 r4 r0 r10 IB DA DB LDMxx r10, {r0,r1,r4} STMxx r10, {r0,r1,r4} Base Register (Rb) 48 TM 48 39v10 The ARM Architecture Software Interrupt (SWI) Causes an exception trap to the SWI hardware vector The SWI handler can examine the SWI number to decide what operation has been requested. By using the SWI mechanism, an operating system can implement a set of privileged operations which applications running in user mode can request. Syntax: SWI{<cond>} <SWI number> 28 31 24 27 0 Cond 1 1 1 1 SWI number (ignored by processor) 23 Condition Field 49 TM 49 39v10 The ARM Architecture PSR Transfer Instructions MRS and MSR allow contents of CPSR / SPSR to be transferred to / from a general purpose register. Syntax: MRS{<cond>} Rd,<psr> ; Rd = <psr> MSR{<cond>} <psr[_fields]>,Rm ; <psr[_fields]> = Rm where <psr> = CPSR or SPSR [_fields] = any combination of fsxc Also an immediate form MSR{<cond>} <psr_fields>,#Immediate In User Mode, all bits can be read but only the condition flags (_f) can be written. 27 31 N Z C V Q 28 6 7 I F T mode 16 23
8 15
5 4 0 24 f s x c U n d e f i n e d J 50 TM 50 39v10 The ARM Architecture ARM Branches and Subroutines B <label> PC relative. 32 Mbyte range. BL <subroutine> Stores return address in LR Returning implemented by restoring the PC from LR For non-leaf functions, LR will have to be stacked STMFD sp!,{regs,lr} : BL func2 : LDMFD sp!,{regs,pc} func1 func2
: : BL func1 : : : : : : : MOV pc, lr 51 TM 51 39v10 The ARM Architecture Thumb Thumb is a 16-bit instruction set Optimised for code density from C code (~65% of ARM code size) Improved performance from narrow memory Subset of the functionality of the ARM instruction set Core has additional execution state - Thumb Switch between ARM and Thumb using BX instruction 0 15 31 0 ADDS r2,r2,#1 ADD r2,#1 32-bit ARM Instruction 16-bit Thumb Instruction For most instructions generated by compiler: Conditional execution is not used Source and destination registers identical Only Low registers used Constants are of limited size Inline barrel shifter not used 52 TM 52 39v10 The ARM Architecture Agenda Introduction Programmers Model Instruction Sets System Design Development Tools
53 TM 53 39v10 The ARM Architecture Example ARM-based System 16 bit RAM 8 bit ROM 32 bit RAM ARM Core I/O Peripherals Interrupt Controller nFIQ nIRQ 54 TM 54 39v10 The ARM Architecture ARM Based microcontroller 55 TM 55 39v10 The ARM Architecture A Basic ARM MEMORY SYSTEM.
56 TM 56 39v10 The ARM Architecture AMBA B r i d g e
Timer On-chip RAM ARM Interrupt Controller Remap/ Pause TIC Arbiter Bus Interface External ROM External RAM Reset System Bus Peripheral Bus AMBA Advanced Microcontroller Bus Architecture ADK Complete AMBA Design Kit
ACT AMBA Compliance Testbench
PrimeCell ARMs AMBA compliant peripherals
AHB or ASB APB External Bus Interface Decoder 57 TM 57 39v10 The ARM Architecture System Design-Hardware An embedded system includes the following hardware components: ARM processors are found embedded in chips. Programmers access peripherals through memory-mapped registers. There is a special type of peripheral called a controller, which embedded systems use to configure higher-level functions such as memory and interrupts. The AMBA on-chip bus is used to connect the processor and peripherals together. 58 TM 58 39v10 The ARM Architecture System Design-Software An embedded system also includes the following software components:
Initialization code configures the hardware to a known state. Once configured, operating systems can be loaded and executed. Operating systems provide a common programming environment for the use of hardware resources and infrastructure. Device drivers provide a standard interface to peripherals. An application Program performs the task-specific duties of an embedded system.
59 TM 59 39v10 The ARM Architecture Agenda Introduction Programmers Model Instruction Sets System Design Development Tools
60 TM 60 39v10 The ARM Architecture The RealView Product Families Debug Tools AXD (part of ADS) Trace Debug Tools Multi-ICE Multi-Trace Platforms ARMulator (part of ADS) Integrator Family Compilation Tools ARM Developer Suite (ADS) Compilers (C/C++ ARM & Thumb), Linker & Utilities
61 TM 61 39v10 The ARM Architecture ARM Debug Architecture
ARM core ETM TAP controller Trace Port JTAG port Ethernet Debugger (+ optional trace tools) EmbeddedICE Logic Provides breakpoints and processor/system access JTAG interface (ICE) Converts debugger commands to JTAG signals Embedded trace Macrocell (ETM) Compresses real-time instruction and data access trace Contains ICE features (trigger & filter logic) Trace port analyzer (TPA) Captures trace in a deep buffer EmbeddedICE Logic 63 TM 63 39v10 The ARM Architecture Thumb instruction set Thumb instruction set encodes a subset of the 32-bit ARM instructions into a 16-bit instruction set space. So it has higher code density: 30% less memory
Since Thumb has higher performance than ARM on a processor with a 16-bit data bus, but lower performance than ARM on a 32-bit data bus, use Thumb for memory-constrained systems. 64 TM 64 39v10 The ARM Architecture Thumb Instruction set
65 TM 65 39v10 The ARM Architecture Thumb Instruction set 66 TM 66 39v10 The ARM Architecture Thumb register usage 67 TM 67 39v10 The ARM Architecture Code Density
68 TM 68 39v10 The ARM Architecture Thumb instruction decoding 69 TM 69 39v10 The ARM Architecture Thumb Instructions limitations Only the branch relative instruction can be conditionally executed. The limited space available in 16 bits causes the barrel shift operations ASR, LSL, LSR, and ROR to be separate instructions in the Thumb ISA. there is no direct access to the CPSR or SPSR. So there are no MSR- and MRS-equivalent Thumb instructions. To alter the CPSR or SPSR, you must switch into ARM state to use MSR and MRS. Similarly, there are no coprocessor instructions in Thumb state. You need to be in ARM state to access the coprocessor for configuring cache and memory management. 70 TM 70 39v10 The ARM Architecture ARM Thumb internetworking The method of linking ARM and Thumb code together for both assembly and C/C++.
It handles the transition between the two states. Extra code, called a veneer, is sometimes needed to carry out the transition.
ATPCS defines the ARM and Thumb procedure call standards. To call a Thumb routine from an ARM routine, the core has to change state of T bit of the CPSR.
The BX and BLX branch instructions cause a switch between ARM and Thumb state while branching to a routine 71 TM 71 39v10 The ARM Architecture MIX Mode Programming branch instructions There are two versions of the BX or BLX instructions: an ARM instruction and a Thumb equivalent. The ARM BX instruction enters Thumb state only if bit 0 of the address in Rn is set to binary 1; otherwise it enters ARM state. The Thumb BX instruction does the same. Syntax: BX Rm BLX Rm | label Unlike the ARM version, the Thumb BX instruction cannot be conditionally executed. The conditional branch instruction is the only conditionally executed instruction in Thumb state. B branch BL branch with link lr =(instruction address after the BL) + 1 72 TM 72 39v10 The ARM Architecture MIX Mode Programming The Thumb data processing instructions are a subset of the ARM data processing instructions. Most Thumb data processing instructions operate on low registers and update the cpsr. The exceptions are MOV Rd,Rn ADD Rd,Rm CMP Rn,Rm ADD sp, #immediate SUB sp, #immediate ADD Rd,sp,#immediate ADD Rd,pc,#immediate which can operate on the higher registers r8r14 and the pc. These instructions, except for CMP, do not update the condition flags in the cpsr when using the higher registers. The CMP instruction, however, always updates the cpsr. 73 TM 73 39v10 The ARM Architecture Single Register Load Store Instructions
T
The Thumb instruction set supports load and storing registers, or LDR and STR. These instructions use two pre-indexed addressing modes: offset by register and offset by immediate. Load/store register [Rn, Rm] Base register + offset [Rn, #immediate] Relative [pc|sp, #immediate] The offset by register uses a base register Rn + the register offset Rm. The second uses the same base register Rn + a 5-bit immediate or a value dependent on the data size. The 5-bit offset encoded in the instruction is multiplied by one for byte accesses, two for 16-bit accesses, and four for 32-bit accesses.
74 TM 74 39v10 The ARM Architecture Multiple Register Load-Store I The Thumb versions of the load-store multiple instructions are reduced forms of the ARM load-store multiple instructions. They only support the increment after (IA) addressing mode. Syntax : <LDM|STM>IA Rn!, {low Register list} LDMIA load multiple registers {Rd}*N <- mem32[Rn + 4 N], Rn = Rn + 4 N STMIA save multiple registers {Rd}*N -> mem32[Rn + 4 N], Rn = Rn + 4 N Here N is the number of registers in the list of registers. these instructions always update the base register Rn after execution. The base register and list of registers are limited to the low registers r0 to r7. 75 TM 75 39v10 The ARM Architecture MIX Mode Programming stack instructions The Thumb stack operations are different from the equivalent ARM instructions because they use the more traditional POP and PUSH concept.
Syntax: POP {low_register_list{, pc}}
PUSH {low_register_list{, lr}}
POP pop registers from the stacks RdN <- mem32[sp+4 N], sp = sp+4 N
PUSH push registers on to the stack RdN -> mem32[sp+4 N], sp = sp4 N 76 TM 76 39v10 The ARM Architecture MIX Mode Programming No stack pointer in the instruction because the stack pointer is fixed as register r13 in Thumb operations and sp is automatically updated. The list of registers is limited to the low registers r0 to r7.
The PUSH register list also can include the link register lr. similarly the POP register list can include the pc. This provides support for subroutine entry and exit. The stack instructions only support full descending stack operations.
77 TM 77 39v10 The ARM Architecture MIX Mode Programming software interrupt. Similar to the ARM equivalent, the Thumb software interrupt (SWI) instruction causes a software interrupt exception. If any interrupt or exception flag is raised in Thumb state, the processor automatically reverts back to ARM state to handle the exception. Syntax: SWI immediate The Thumb SWI instruction has the same effect and nearly the same syntax as the ARM equivalent. It differs in that the SWI number is limited to the range 0 to 255 and it is not conditionally executed. 78 TM 78 39v10 The ARM Architecture ARM7 Family One significant variation in the ARM7 family is the ARM7TDMI-S. synthesizable. ARM720T includes an MMU being capable of handling the Linux and Microsoft embedded platform operating systems. The processor also includes a unified 8K cache. The vector table can be relocated to a higher address by setting a coprocessor 15 register. Another variation is the ARM7EJ-S processor, also synthesizable, provides both Java acceleration and the enhanced instructions but without any memory protection ARM7EJ-S is quite different since it includes a five-stage pipeline and executes ARMv5TEJ instructions. 79 TM 79 39v10 The ARM Architecture ARM PROCESSOR VARIANTS 80 TM 80 39v10 The ARM Architecture ARM7,ARM9,ARM11 ARM7, ARM9, ARM10 and ARM11 cores are directly dependent upon the type and geometry of the manufacturing process, which has a direct effect on the frequency (MHz) and power consumption (watts). An ARM processor is an implementation of a specific instruction set architecture (ISA). The ISA has been continuously improved from the first ARM processor design. Processors are grouped into implementation families (ARM7, ARM9, ARM10, and ARM11) with similar characteristics. 81 TM 81 39v10 The ARM Architecture ARM9 Family The ARM9 family was announced in 1997,with five-stage pipeline, can run at higher clock frequencies than the ARM7. The extra stages improve the overall performance of the processor. The memory system has been redesigned to follow the Harvard architecture, which separates the data D and instruction I buses. The first processor in the ARM9 family was the ARM920T, which includes a separate D + I cache and an MMU for virtual memory support. ARM922T is a variation on the ARM920T but with half the D +I cache size. 82 TM 82 39v10 The ARM Architecture ARM9 Family The ARM940T includes a smaller D +I cache and an MPU designed for applications that do not require a platform operating system. Both ARM920T and ARM940T execute the architecture v4T instructions. The next processors are based on the ARM9E-S core, a synthesizable version of the ARM9 core with the E extensions. There are two variations: the ARM946E-S and the ARM966E-S. Both execute architecture v5TE instructions. They also support the optional embedded trace macro-cell (ETM), which allows a developer to trace instruction and data execution in real time on the processor. This is important when debugging applications with time-critical segments. 83 TM 83 39v10 The ARM Architecture ARM9 Family The ARM946E-S includes TCM, cache, and an MPU. The sizes of the TCM and caches are configurable. This processor is designed for use in embedded control applications that require deterministic real-time response. In contrast, the ARM966E does not have the MPU and cache extensions but does have configurable TCMs. The latest core in the ARM9 product line is the ARM926EJ-S synthesizable processor core, announced in 2000 the first ARM processor core to include the Jazelle technology, which accelerates Java byte-code execution. It is designed for use in small portable Java-enabled devices such as 3G phones and personal digital assistants (PDAs). It features an MMU, configurable TCMs, and D +I caches with zero or nonzero wait state memories. 84 TM 84 39v10 The ARM Architecture ARM10 Family The ARM10, announced in 1999, was designed for performance.
It extends the ARM9 pipeline to six stages. It also supports an optional vector floating-point (VFP) unit, which adds a seventh stage to the ARM10 pipeline. The VFP significantly increases
floating-point performance and is compliant with the IEEE 754.1985 floating-point standard. 85 TM 85 39v10 The ARM Architecture ARM10 Family
The ARM1020E is the first processor to use an ARM10E core. Like the ARM9E, It includes the enhanced E instructions. It has separate 32K D + I caches, optional vector floating-point unit, and an MMU.
The ARM1020E also has a dual 64-bit bus interface for increased performance. ARM1026EJ-S is very similar to the ARM926EJ-S but with both MPU and MMU. This processor has the performance of the ARM10 with the flexibility of an ARM926EJ-S.
86 TM 86 39v10 The ARM Architecture ARM11 Family The ARM1136J-S, announced in 2003, was designed for high performance and power-efficient applications. It was the first processor implementation to execute architecture ARMv6 instructions. It incorporates an eight-stage pipeline with separate load-store and arithmetic pipelines. Included in the ARMv6 instructions are single instruction multiple data (SIMD) extensions for media processing, specifically designed to increase video processing performance. The ARM1136JF-S is an ARM1136J-S with the addition of the vector floating-point unit for fast floating-point operations. 87 TM 87 39v10 The ARM Architecture Enhanced DSP features Processing digitized signals requires high memory bandwidths and fast multiply accumulate operations. A single-core design can reduce cost and power consumption over a two-core solution. DSP applications are typically multiply and load-store intensive. A basic operation is a multiply accumulate multiplying two 16-bit signed numbers and accumulating onto a 32-bit signed accumulator. The ARMv5TE extensions available in the ARM9E and later cores provide efficient multiply accumulate operations. With careful coding, the ARM9E processor will perform decently on the DSP parts of an application while outperforming a DSP on the control parts of the application. 88 TM 88 39v10 The ARM Architecture Generations suitable for DSP applications. 89 TM 89 39v10 The ARM Architecture DSP Algorithms Characteristics Due to their high data bandwidth and performance requirements, we have to code DSP algorithms in hand-written assembly.
We need fine control of register allocation and instruction scheduling to achieve the best performance.
Filtering is probably the most commonly used signal processing operation. It can be used to remove noise, to analyze signals, or in signal compression.
Another very common algorithm is the Discrete Fourier Transform (DFT), which converts a signal from a time representation to a frequency representation or vice versa.
90 TM 90 39v10 The ARM Architecture How to represent a signal on the ARM Use a floating-point representation for prototyping algorithms. Do not use floating point in applications where speed is critical. Most ARM implementations do not include hardware floating-point support. Use a fixed-point representation for DSP applications where speed is critical with moderate dynamic range. The ARM cores provide good support for 8-, 16- and 32-bit fixed-point DSP. For applications requiring speed and high dynamic range, use a block- floating or logarithmic representation. The key idea is to use block algorithms that calculate several results at once, and thus require less memory bandwidth, increase performance and decrease power consumption compared with calculating single results. 91 TM 91 39v10 The ARM Architecture Figure shows a sine wave signal digitized at the sampling points 0, 1, 2, 3, and so on. 92 TM 92 39v10 The ARM Architecture Dynamic Range &Accuracy There are two things to worry about when choosing a representation of x[t ]: 1. The dynamic range of the signalthe maximum fluctuation in the signal defined by Equation-A. For a signed signal we are interested in the maximum absolute value M possible. For this example, lets take M = 1 volt. M = max|x[t ]| over all t = 0, 1, 2, 3 . . (A)
2. The accuracy required in the representation- sometimes given as a proportion of the maximum range. For example, an accuracy of 100 parts per million means that each x[t ] needs to be represented within an error of E = M 0. 0001 = 0. 0001 volts 93 TM 93 39v10 The ARM Architecture Suitable Representation We could use a floating-point representation for x[t ]. 1)This would certainly meet our dynamic range and accuracy constraints, and 2) it would also be easy to manipulate using the C type float. However, most ARM cores do not support floating point in hardware, and so a floating-point representation would be very slow.
A better choice for fast code is a fixed-point representation. A fixed-point representation uses an integer to represent a fractional value by scaling the fraction. 94 TM 94 39v10 The ARM Architecture Error Vs Accuracy A common error is to think that floating point is more accurate than fixed point. This is false! For the same number of bits, a fixed-point representation gives greater accuracy. The floating-point representation gives higher dynamic range at the expense of lower absolute accuracy. For example, if you use a 32-bit integer to hold a fixed-point value scaled to full range, then the maximum error in a representation is 232. However, single- precision 32-bit floating-point values give a relative error of 224. The single-precision floating-point mantissa is 24 bits. The leading 1 of the mantissa is not stored, so 23 bits of storage are actually used. For values near the maximum, the fixed-point representation is 23224 = 256 times more accurate! The 8-bit floating-point exponent is of little use when you are interested in maximum error rather than relative accuracy. 95 TM 95 39v10 The ARM Architecture Better representation To summarize, a fixed-point representation is best when there is a clear bound to the strength of the signal and when maximum error is important.
When there is no clear bound and you require a large dynamic range, then floating point is better.
You can also use the other representations, which give more dynamic range than fixed point while still being more efficient to implement than floating point.
96 TM 96 39v10 The ARM Architecture General rules on writing DSP algorithms for the ARM. ARM does not provide operations that saturate automatically. Design the DSP algorithm so that saturation is not required because saturation will cost extra cycles. ARM supports extended-precision 32-bit multiplied by 32-bit to 64-bit operations very well. Use extended-precision arithmetic or additional scaling rather than saturation. The ARM core is not a dedicated DSP. There is no single instruction that issues a multiply accumulate and data fetch in parallel. However, by reusing loaded data you can achieve a respectable DSP performance. Design the DSP algorithm to minimize loads and stores. Once you load a data item, then perform as many operations that use the datum as possible. You can often do this by calculating several output results at once. Another way of increasing reuse is to concatenate several operations. For example, you could perform a dot product and signal scale at the same time, while only loading the data once. 97 TM 97 39v10 The ARM Architecture Guidelines for writing DSP code. FromARM9onwards,ARMimplementations use a multistage execute pipeline for loads and multiplies, which introduces potential processor interlocks. If you load a value and then use it in either of the following two instructions, the processor may stall for a number of cycles waiting for the loaded value to arrive. Similarly if you use the result of a multiply in the following instruction, this may cause stall cycles. It is particularly important to schedule code to avoid these stalls. Write ARM assembly to avoid processor interlocks. The results of load and multiply instructions are often not available to the next instruction without adding stall cycles. Sometimes the results will not be available for several cycles There are 14 registers available for general use on the ARM, r0 to r12 and r14. Design the DSP algorithm so that the inner loop will require 14 registers or fewer. 98 TM 98 39v10 The ARM Architecture An example- a DOT Product A dot-product is one of the simplest DSP operations and highlights the difference among different ARM implementations. A dot-product combines N samples from two signals x(t) and c(t) to produce a correlation value a: a= Ci * Xi The C interface to the dot-product function is int dot_product(sample *x, coefficient *c, unsigned int N); where sample is the type to hold a 16-bit audio sample, usually a short coefficient is the type to hold a 16-bit coefficient, usually a short x[i] and c[i] are two arrays of length N (the data and coefficients) the function returns the accumulated 32-bit integer dot product a
99 TM 99 39v10 The ARM Architecture DSP rating of ARM7TDMI This example shows a 16-bit dot-product optimized for the ARM7TDMI. Each MLA takes a worst case of four cycles. We store the 16-bit input samples in 32-bit words so that we can use the LDM instruction to load them efficiently. This code assumes that the number of samples N is a multiple of five. Therefore we can use a five-word load multiple to increase data bandwidth. The cost per load is 7/4 = 1.4 cycles compared to 3 cycles per load if we had used LDR or LDRSH. The inner loop requires a worst case of 7 + 7 + 5 4 + 1 + 3 = 38 cycles to process each block of 5 products from the sum. This gives the ARM7TDMI a DSP rating of 38/5 = 7.6 cycles per tap for a 16-bit dot-product. 100 TM 100 39v10 The ARM Architecture Assembly code for DOT Product x RN 0 ; input array x[] c RN 1 ; input array c[] N RN 2 ; number of samples (a multiple of 5) acc RN 3 ; accumulator x_0 RN 4 ; elements from array x[] x_1 RN 5 x_2 RN 6 x_3 RN 7 x_4 RN 8 c_0 RN 9 ; elements from array c[] c_1 RN 10 c_2 RN 11 c_3 RN 12 c_4 RN 14 ; int dot_16by16_arm7m(int *x, int *c, unsigned N) dot_16by16_arm7m STMFD sp!, {r4-r11, lr} MOV acc, #0 loop_7m ; accumulate 5 products LDMIA x!, {x_0, x_1, x_2, x_3, x_4} LDMIA c!, {c_0, c_1, c_2, c_3, c_4} MLA acc, x_0, c_0, acc MLA acc, x_1, c_1, acc MLA acc, x_2, c_2, acc MLA acc, x_3, c_3, acc MLA acc, x_4, c_4, acc SUBS N, N, #5 BGT loop_7m MOV r0, acc LDMFD sp!, {r4-r11, pc} 101 TM 101 39v10 The ARM Architecture DSP Rating -16bit DOT Product
ARM7TDMI a DSP rating of 38/5 = 7.6 cycles per tap ARM9TDMI-The inner loop requires 28 cycles per tap, giving 28/4 = 7 cycles per tap. STRONGARM-The inner loop uses 19 cycles to process 4 taps, giving a rating of 19/4 = 4.75 cycles per tap. ARM9E-The inner loop requires 20 cycles to accumulate 8 products, a rating of 20/8 = 2.5 cycles per tap. ARM10E-The inner loop requires 25 cycles to process 10 samples, or 2.5 cycles per tap. Intel- XSCALE-The inner loop requires 14 cycles to accumulate 8 products, a rating of 1.75 cycles per tap.
102 TM 102 39v10 The ARM Architecture Performance improvement in DSP The block filter algorithm gives a much better performance per tap if you are calculating multiple products. 103 TM 103 39v10 The ARM Architecture THANK YOU