Anda di halaman 1dari 168

The ARM architecture and Fundamentals Chapter-1

11/13/2013

Amit Kulkarni

ARM ltd
Founded in November 1990
Spun out of Acorn Computers

Designs the ARM range of RISC processor cores Licenses ARM core designs to semiconductor partners who fabricate and sell to their customers.
ARM does not fabricate silicon itself

Also develop technologies to assist with the design-in of the ARM architecture
Software tools, boards, debug hardware, application software, bus architectures, peripherals etc
11/13/2013 Amit Kulkarni

ARM Powered Products

11/13/2013

Amit Kulkarni

ARM core
A key component of many successful 32-bit embedded systems. The most successful core is ARM7TDMI. Over 32 billion ARM processor have been shipped in 2011 worldwide. It is not a single core but a whole family of designs sharing similar design principles and common instruction set.

11/13/2013

Amit Kulkarni

What is a Microprocessor?
The word comes from the combination micro and processor. Processor means a
device that processes whatever. In this context processor means a device that processes numbers, specifically binary numbers, 0s and 1s.
To process means to manipulate. It is a general term that describes all manipulation. Again in this content, it means to perform certain operations on the numbers that depend on the microprocessors design.

Differences between:
Microcomputer a computer with a microprocessor as its CPU. Includes memory, I/O etc. Microprocessor silicon chip which includes ALU, register circuits & control circuits Microcontroller silicon chip which includes microprocessor, memory & I/O in a single package.

What about micro?


Micro is a new addition. In the late 1960s,
processors were built using discrete elements.
These devices performed the required operation, but were too large and too slow.

In the early 1970s the microchip was invented. All of the components that made up the processor were now placed on a single piece of silicon. The size became several thousand times smaller and the speed became several hundred times faster. The MicroProcessor was born.

Definition of the Microprocessor


The microprocessor is a programmable device that takes in numbers, performs on them arithmetic or logical operations according to the program stored in memory and then produces other numbers as a result. Can we just elaborate on underlined text?

Definition (Contd.)
Programmable device: The microprocessor can perform different sets of operations on the data it receives depending on the sequence of instructions supplied in the given program. By changing the program, the microprocessor manipulates the data in different ways. Instructions: Each microprocessor is designed to execute a specific group of operations. This group of operations is called an instruction set. This instruction set defines what the microprocessor can and cannot do.

Definition (Contd.)
Takes in: The data that the microprocessor manipulates must come from somewhere. It comes from what is called input devices. These are devices that bring data into the system from the outside world. These represent devices such as a keyboard, a mouse, switches, and the like.

Definition (Contd.)
Numbers: The microprocessor has a very narrow view on life. It only understands binary numbers. A binary digit is called a bit (which comes from binary digit). The microprocessor recognizes and processes a group of bits together. This group of bits is called a word. The number of bits in a Microprocessors word, is a measure of its abilities.

Definition (Contd.)
Words, Bytes, etc.
information 8-bits at a time. Thats why they are called 8-bit processors. They can handle large numbers, but in order to process these numbers, they broke them into 8-bit pieces and processed each group of 8-bits separately. Later microprocessors (8086 and 68000) were designed with 16-bit words. A group of 8-bits were referred to as a half-word or byte. A group of 4 bits is called a nibble. Also, 32 bit groups were given the name long word.

The earliest microprocessor (the Intel 8088 and Motorolas 6800) recognized 8-bit words. They processed

Today, all processors manipulate at least 32 bits at a time and there exists microprocessors that can process 64, 80, 128 bits i

Definition (Contd.)
Arithmetic and Logic Operations:
Every microprocessor has arithmetic operations such as add and subtract as part of its instruction set. Most microprocessors will have operations such as multiply and divide. Some of the newer ones will have complex operations such as square root.

In addition, microprocessors have logic operations as well. Such as AND, OR, XOR, shift left, shift right, etc. Again, the number and types of operations define the microprocessors instruction set and depends on the specific microprocessor.

Definition (Contd.)
Stored in memory :
First, what is memory? Memory is the location where information is kept while not in current use. Memory is a collection of storage devices. Usually, each storage device holds one bit. Also, in most kinds of memory, these storage devices are grouped into groups of 8. These 8 storage locations can only be accessed together. So, one can only read or write in terms of bytes to and form memory. Memory is usually measured by the number of bytes it can hold. It is measured in Kilos, Megas and lately Gigas. A Kilo in computer language is 210 =1024. So, a KB (KiloByte) is 1024 bytes. Mega is 1024 Kilos and Giga is 1024 Mega.

Definition (Contd.)
Stored in memory: When a program is entered into a computer, it is stored in memory. Then as the microprocessor starts to execute the instructions, it brings the instructions from memory one at a time. Memory is also used to hold the data. The microprocessor reads (brings in) the data from memory when it needs it and writes (stores) the results into memory when it is done.

Definition (Contd.)
Produces:
For the user to see the result of the execution of the program, the results must be presented in a human readable form. The results must be presented on an output device. This can be the monitor, a paper from the printer, a simple LED or many other forms.

Computer architecture & Organization


Computer architecture describes the user's view of the computer. The instruction set, visible registers, memory management table structures and exception handling model are all part of the architecture. Computer organization describes the user-invisible implementation of the architecture. The pipeline structure, transparent cache, table-walking Hardware and translation look-aside buffer are all aspects of the organization.
11/13/2013 Amit Kulkarni

Overview
Instruction set architecture is distinguished from the microarchitecture, which is the set of processor design techniques used to implement the instruction set. Computers with different microarchitectures can share a common instruction set. For example, the IntelPentium and the AMD Athlon implement nearly identical versions of the x86 instruction set A complex instruction set computer (CISC) has many specialized instructions, which may only be rarely used in practical programs. Areduced instruction set computer (RISC) simplifies the processor by only implementing instructions that are frequently used in programs;
11/13/2013 Amit Kulkarni

The stored-program digital computer

11/13/2013

Amit Kulkarni

Hardware abstraction levels


1. Transistors; 2. Logic gates, memory cells, special circuits; 3. Single-bit adders, multiplexers, decoders, flip-flops; 4. Word-wide adders, multiplexers, decoders, registers, buses; 5. ALUs (Arithmetic-Logic Units), barrel shifters, register banks, memory blocks; 6. Processor, cache and memory management organizations; 7. Processors, peripheral cells, cache memories, memory management units; 8. Integrated system chips; 9. Printed circuit boards; 10. Mobile telephones, PCs, engine controllers.
11/13/2013 Amit Kulkarni

MU - a simple processor
A simple form of processor can be built from a few basic components: a program counter (PC) register that is used to hold the address of the current instruction; a single register called an accumulator (ACC) that holds a data value while it is worked upon; an arithmetic-logic unit (ALU) that can perform a number of operations on binary operands, such as add, subtract, increment, and so on; an instruction register (IR) that holds the current instruction while it is executed; instruction decode and control logic that employs the above components to achieve the desired results from each instruction.
11/13/2013 Amit Kulkarni

The MU instruction set

11/13/2013

Amit Kulkarni

MU logic design
To understand how this instruction set might be implemented we will go through the design process in a logical order. The approach taken here will be to separate the design into two components:
The datapath The control logic.

11/13/2013

Amit Kulkarni

The datapath
All the components carrying, storing or processing many bits in parallel will be considered part of the datapath, including the accumulator, program counter, ALU and instruction register. For these components we will use a register transfer level (RTL) design style based on registers, multiplexers, and so on.

11/13/2013

Amit Kulkarni

The control logic


Everything that does not fit comfortably into the datapath will be considered part of the control logic and will be designed using a finite state machine (FSM) approach.

11/13/2013

Amit Kulkarni

Datapath design
Each instruction takes exactly the number of clock cycles defined by the number of memory accesses it must make

11/13/2013

Amit Kulkarni

Contd..
Datapath operation
Access the memory operand and perform the desired operation. Fetch the next instruction to be executed.

Initialization Register transfer level design Control logic ALU design

11/13/2013

Amit Kulkarni

Instruction set design


4-addreSS instruction format:

3-addreSS instruction format:

2-addreSS instruction format: 1-addreSS instruction format:

11/13/2013

Amit Kulkarni

Instruction types
Data processing instructions such as add, subtract and multiply. Data movement instructions that copy data from one place in memory to another, or from memory to the processor's registers, and so on. Control flow instructions that switch execution from one part of the program to another, possibly depending on data values. Special instructions to control the processor's execution state, for instance to switch into a privileged mode to carry out an operating system function.
11/13/2013 Amit Kulkarni

Addressing modes
1. Immediate addressing: the desired value is presented as a binary value in the instruction. 2. Absolute addressing: the instruction contains the full binary address of the desired value in memory. 3. Indirect addressing: the instruction contains the binary address of a memory location that contains the binary address of the desired value. 4. Register addressing: the desired value is in a register, and the instruction contains the register number. 5. Register indirect addressing: the instruction contains the number of a register which contains the address of the value in memory. 6. Base plus offset addressing: the instruction specifies a register (the base) and a binary offset to be added to the base to form the memory address. 7. Base plus index addressing: the instruction specifies a base register and another register (the index) which is added to the base to form the memory address. 8. Base plus scaled index addressing: as above, but the index is multiplied by a constant (usually the size of the data item, and usually a power of two) before being added to the base. 9. Stack addressing: an implicit or specified register (the stack pointer) points to an area of memory (the stack) where data items are written (pushed) or read (popped) on a last-in-first-out basis.

11/13/2013

Amit Kulkarni

CISC
The principal trend in instruction set design was towards increasing complexity in an attempt to reduce the semantic gap that the compiler had to bridge. The origins of this trend were in the minicomputers developed during the 1970s. These computers had relatively slow main memories coupled to processors built using many simple integrated circuits. So it made sense to implement frequently used operations as microcode sequences rather than them requiring several instructions to be fetched from main memory.
11/13/2013 Amit Kulkarni

CISC contd
In particular, the microcode ROM which was needed for all the complex routines absorbed an unreasonable proportion of the area of a single chip, leaving little room for other performance- enhancing features.
Compiler

Code generation Greater complexity Processor

11/13/2013

Amit Kulkarni

RISC
Reducing the semantic gap between the processor instruction set and the high-level language is not the right way to make an efficient computer. What other options are open to the designer? What processors do?

11/13/2013

Amit Kulkarni

RISC contd
The ARM core uses a RISC architecture. RISC is a design philosophy aimed at delivering simple but powerful instructions that execute within a single cycle at a high clock speed. The RISC philosophy concentrates on reducing the complexity of instructions It is easier to provide greater flexibility and intelligence in software rather than hardware. As a result, a RISC design places greater demands on the compiler.
11/13/2013 Amit Kulkarni

RISC contd

The RISC philosophy is implemented with four major design rules:


Instructions Pipelines Registers Load-store architecture
11/13/2013 Amit Kulkarni

RISC design rules


InstructionsRISC processors have a reduced number of instruction classes. These classes provide simple operations that can each execute in a single cycle. The compiler or programmer synthesizes complicated operations (for example, a divide operation) by combining several simple instructions. Each instruction is a fixed length to allow The pipeline to fetch future instructions before decoding the current instruction. In contrast, in CISC processors the instructions are often of variable size and take many cycles to execute.
11/13/2013 Amit Kulkarni

Contd
PipelinesThe processing of instructions is broken down into smaller units that can be executed in parallel by pipelines. Ideally the pipeline advances by one step on each cycle for maximum throughput. Instructions can be decoded in one pipeline stage. There is no need for an instruction to be executed by a miniprogram called microcode as on CISC processors.

11/13/2013

Amit Kulkarni

Contd
RegistersRISC machines have a large generalpurpose register set. Any register can contain either data or an address. Registers act as the fast local memory store for all data processing operations. In contrast, CISC processors have dedicated registers for specific purposes.

11/13/2013

Amit Kulkarni

Contd
Load-store architectureThe processor operates on data held in registers. Separate load and store instructions transfer data between the register bank and external memory. Memory accesses are costly, so separating memory accesses from data processing provides an advantage because you can use data items held in the register bank multiple times without needing multiple memory accesses. In contrast, with a CISC design the data processing operations can act on memory directly.
11/13/2013 Amit Kulkarni

Instruction Set for Embedded Systems


Variable cycle execution for certain instructions:
increases performance since sequential memory accesses are often faster than random accesses. Code density is also improved

Inline barrel shifter leading to more complex instructions


Thumb 16-bit instruction set Conditional execution Enhanced instructions

11/13/2013

Amit Kulkarni

Embedded System Hardware

11/13/2013

Amit Kulkarni

ARM Bus Technology


Embedded systems use different bus technologies than those designed for x86 PCs. The most common PC bus technology, the Peripheral Component Interconnect (PCI) bus, connects such devices as video cards and hard disk controllers to the x86 processor bus. This type of technology is external or off-chip and is built into the motherboard of a PC. In contrast, embedded devices use an on-chip bus that is internal to the chip and that allows different peripheral devices to be interconnected with an ARM core.
11/13/2013 Amit Kulkarni

ARM Bus Technology


There are two different classes of devices attached to the bus.
The ARM processor core is a bus master. Peripherals tend to be bus slaves.

A bus has two architecture levels.


The first is a physical level that covers the electrical characteristics and bus width (16, 32, or 64 bits). The second level deals with protocolthe logical rules that govern the communication between the processor and a peripheral.

AMBA bus protocol.


11/13/2013 Amit Kulkarni

Memory
Hierarchy Width Types

11/13/2013

Amit Kulkarni

Peripherals
All ARM peripherals are memory mappedthe programming interface is a set of memory-addressed registers. The address of these registers is an offset from a specific peripheral base address.
Memory Controllers Interrupt Controllers

11/13/2013

Amit Kulkarni

Embedded System Software

11/13/2013

Amit Kulkarni

Summary
RISC CISC A fixed (32-bit) instruction size with Hard-wired instruction decode logic; few formats CISC processors used large microcode ROMs to decode their instructions. A load-store architecture where instructions that process data operate only on registers and are separate from instructions that access memory Pipelined execution; CISC processors allowed little, if any, overlap between consecutive instructions (though they do now).

A large register bank of thirty-two 32- Single-cycle execution; CISC processors bit registers, all of which could be used typically took many clock cycles to for any purpose, to allow the load-store complete a single instruction. architecture to operate efficiently A smaller die size., A shorter Greater code density development time, A higher performance, poor code density
11/13/2013 Amit Kulkarni

Chapter-2

11/13/2013

Amit Kulkarni

Data Sizes and Instruction Sets


The ARM is a 32-bit architecture. When used in relation to the ARM:
Byte means 8 bits Halfword means 16 bits (two bytes) Word means 32 bits (four bytes)

Most ARMs implement two instruction sets


32-bit ARM Instruction Set 16-bit Thumb Instruction Set

Jazelle cores can also execute Java bytecode


11/13/2013 Amit Kulkarni

ARM core dataflow model

11/13/2013

Amit Kulkarni

Registers
Current Visible Registers
Abort Mode
r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 (sp) r14 (lr) r15 (pc) cpsr spsr r13 (sp) r14 (lr)

Banked out Registers


User FIQ
r8 r9 r10 r11 r12 r13 (sp) r14 (lr) r13 (sp) r14 (lr) r13 (sp) r14 (lr) r13 (sp) r14 (lr)

IRQ

SVC

Undef

spsr

spsr

spsr

spsr

11/13/2013

Amit Kulkarni

Current Program Status Register

Condition code flags


N = Negative result from ALU Z = Zero result from ALU C = ALU operation Carried out V = ALU operation oVerflowed

Interrupt Disable bits.


I = 1: Disables the IRQ. F = 1: Disables the FIQ.

T Bit
Architecture xT only T = 0: Processor in ARM state T = 1: Processor in Thumb state

Sticky Overflow flag - Q flag


Architecture 5TE/J only Indicates if saturation has occurred

J bit
Architecture 5TEJ only J = 1: Processor in Jazelle state
11/13/2013 Amit Kulkarni

Mode bits
Specify the processor mode

Processor Modes
User : unprivileged mode under which most tasks run FIQ : entered when a high priority (fast) interrupt is raised IRQ : entered when a low priority (normal) interrupt is raised Supervisor : entered on reset and when a Software Interrupt instruction is executed Abort : used to handle memory access violations Undef : used to handle undefined instructions System : privileged mode using the same registers as user mode

11/13/2013

Amit Kulkarni

Banked Registers

11/13/2013

Amit Kulkarni

State and Instruction Sets


The state of the core determines which instruction set is being executed. There are three instruction sets:
ARM, Thumb, and Jazelle

These sates cannot be intermingle sequential ARM, Thumb, and Jazelle instructions. The Jazelle J and Thumb T bits in the cpsr reflect the state of the processor. When both J and T bits are 0, the processor is in ARM state and executes ARM instructions.
11/13/2013 Amit Kulkarni

The third instruction set


The ARM designers introduced a third instruction set called Jazelle. Jazelle executes 8-bit instructions. It is a hybrid mix of software and hardware designed to speed up the execution of Java bytecodes. To execute Java bytecodes, you require the Jazelle technology plus a specially modified version of the Java virtual machine. Jazelle DBX (Direct Bytecode eXecution) allows some ARM processors to execute Java bytecode in hardware.
11/13/2013 Amit Kulkarni

Jazelle
Jazelle functionality was specified in the ARMv5TEJ architecture[2] and the first processor with Jazelle technology was the ARM926EJ-S. Jazelle RCT (Runtime Compilation Target) is a different technology and is based on ThumbEE mode and supports ahead-of-time (AOT) and just-intime (JIT) compilation with Java and other execution environments. The most prominent use of Jazelle DBX is by manufacturers of mobile phones to increase the execution speed of Java ME games and applications
11/13/2013 Amit Kulkarni

ARM and Thumb instruction set features

11/13/2013

Amit Kulkarni

Interrupt Masks
Interrupt masks are used to stop specific interrupt requests from interrupting the processor. There are two interrupt request levels available on the ARM processor core
interrupt request (IRQ) and fast interrupt request (FIQ)

The cpsr has two interrupt mask bits, 7 and 6 (or I and F), which control the masking of IRQ and FIQ, respectively. The I bit masks IRQ when set to binary 1, and similarly the F bit masks FIQ when set to binary 1.
11/13/2013 Amit Kulkarni

Condition Flags

11/13/2013

Amit Kulkarni

Condition mnemonics

11/13/2013

Amit Kulkarni

Pipeline
A pipeline is the mechanism a RISC processor uses to execute instructions. Using a pipeline speeds up execution by fetching the next instruction while other instructions are being decoded and executed.

11/13/2013

Amit Kulkarni

Pipelined instruction sequence

11/13/2013

Amit Kulkarni

Pipeline Executing Characteristics


The ARM pipeline has not processed an instruction until it passes completely through the execute stage. The pc always points to the address of the instruction being executed plus two instructions ahead. The execution of a branch instruction or branching by the direct modification of the pc causes the ARM core to flush its pipeline. ARM10 uses branch prediction, which reduces the effect of a pipeline flush by predicting possible branches and loading the new branch address prior to the execution of the instruction. an instruction in the execute stage will complete even though an interrupt has been raised.
11/13/2013 Amit Kulkarni

Exceptions, Interrupts, and the Vector Table


When an exception or interrupt occurs, the processor suspends normal execution and starts loading instructions from the exception vector table. Reset vector is the location of the first instruction executed by the processor when power is applied. This instruction branches to the initialization code. Undefined instruction vector is used when the processor cannot decode an instruction. Software interrupt vector is called when you execute a SWI instruction. The SWI instruction is frequently used as the mechanism to invoke an operating system routine. Prefetch abort vector occurs when the processor attempts to fetch an instruction from an address without the correct access permissions. The actual abort occurs in the decode stage. Data abort vector is similar to a prefetch abort but is raised when an instruction attempts to access data memory without the correct access permissions. Interrupt request vector is used by external hardware to interrupt the normal execution flow of the processor. It can only be raised if IRQs are not masked in the cpsr.
Amit Kulkarni

11/13/2013

The vector table

11/13/2013

Amit Kulkarni

Exception Handling
When an exception occurs, the ARM:

Copies CPSR into SPSR_<mode> Sets appropriate CPSR bits 0x1C Change to ARM state 0x18 Change to exception mode 0x14 Disable interrupts (if appropriate) 0x10 Stores the return address in LR_<mode> 0x0C 0x08 Sets PC to vector address
To return, exception handler needs to:

FIQ IRQ (Reserved) Data Abort Prefetch Abort


Software Interrupt Undefined Instruction

Restore CPSR from SPSR_<mode> Restore PC from LR_<mode>


This can only be done in ARM state.

0x04 0x00

Reset

Vector table can be at 0xFFFF0000 on ARM720T and on ARM9/10 family devices

11/13/2013

Amit Kulkarni

Nomenclature

11/13/2013

Amit Kulkarni

An Introduction Chapter-3

11/13/2013

Amit Kulkarni

Data types
ARM processors support six data types: 8-bit signed and unsigned bytes. 16-bit signed and unsigned half-words; these are aligned on 2-byte boundaries. 32-bit signed and unsigned words; these are aligned on 4-byte boundaries.

11/13/2013

Amit Kulkarni

11/13/2013

Amit Kulkarni

11/13/2013

Amit Kulkarni

Data Processing Instructions


Move Instructions

If you use the S suffix on a data processing instruction, then it updates the flags in the cpsr.
11/13/2013 Amit Kulkarni

Barrel Shifter
Data processing instructions are processed within the arithmetic logic unit (ALU). A unique and powerful feature of the ARM processor is the ability to shift the 32-bit binary pattern in one of the source registers left or right by a specific number of positions before it enters the ALU. This shift increases the power and flexibility of many data processing operations. Pre-processing or shift occurs within the cycle time of the instruction. This is particularly useful for loading constants into a register and achieving fast multiplies or division by a power of 2.
11/13/2013 Amit Kulkarni

Barrel Shifter

11/13/2013

Amit Kulkarni

Barrel shifter operations

x represents the register being shifted and y represents the shift amount.

11/13/2013

Amit Kulkarni

Barrel shift operation syntax for data processing instructions

11/13/2013

Amit Kulkarni

Shift operations
LSL #n Logical shift left immediate

11/13/2013

Amit Kulkarni

Shift operations
ASL #n Arithmetic shift left immediate: This is a synonym for LSL #n and has an identical effect. LSR #n Logical shift right immediate: n is the number of bit positions by which the value is shifted. It has the value 1..32. An LSR by one bit is shown below:

11/13/2013

Amit Kulkarni

Shift operations
ASR #n Arithmetic shift right immediate: n is the number of bit positions by which the value is shifted. It has the value 1..32. An ASR by one bit is shown below:

11/13/2013

Amit Kulkarni

Shift operations
ROR #n Rotate right immediate: n is the number of bit positions to rotate in the range 1..31. A rotate right by one bit is shown below:

11/13/2013

Amit Kulkarni

Shift operations
RRX Rotate right one bit with extend: This special case of rotate right has a slightly different effect from the usual rotates. There is no count; it always rotates by one bit only. The pictorial representation of RRX is:

11/13/2013

Amit Kulkarni

Arithmetic Instructions

11/13/2013

Amit Kulkarni

Logical Instructions

11/13/2013

Amit Kulkarni

Comparison Instructions
The comparison instructions are used to compare or test a register with a 32-bit value. They update the cpsr flag bits according to the result, but do not affect other registers. No need to apply the S suffix for comparison instructions to update the flags.

11/13/2013

Amit Kulkarni

Multiply Instructions
The multiply instructions multiply the contents of a pair of registers and, depending upon the instruction, accumulate the results in with another register. The long multiplies accumulate onto a pair of registers representing a 64-bit value. The final result is placed in a destination register or a pair of registers.

11/13/2013

Amit Kulkarni

Multiply Instructions

11/13/2013

Amit Kulkarni

Branch Instructions
A branch instruction changes the flow of execution or is used to call a routine. This type of instruction allows programs to have subroutines, if-then-else structures, and loops.

11/13/2013

Amit Kulkarni

Condition codes
To make an instruction conditional, a two-letter suffix is added to the mnemonic. AL Always: An instruction with this suffix is always executed. ADDAL and ADD mean the same thing: add unconditionally. NV Never: Such instructions might be used for 'padding' or perhaps to use up a (very) small amount of time in a program. EQ Equal: This condition is true if the result flag Z (zero) is set. This might arise after a compare instruction where the operands were equal, or in any data instruction which received a zero result into the destination.
11/13/2013 Amit Kulkarni

Condition codes
NE Not equal: This is clearly the opposite of EQ, and is true if the Z flag is cleared. If Z is set, and instruction with the NE condition will not be executed. VS Overflow set: This condition is true if the result flag V (overflow) is set. Add, subtract and compare instructions affect the V flag. VC Overflow clear: The opposite to VS. MI Minus: Instructions with this condition only execute if the N (negative) flag is set.
11/13/2013 Amit Kulkarni

Condition codes
PL Plus: This is the opposite to the MI condition and instructions with the PL condition will only execute if the N flag is cleared. CS Carry set: This condition is true if the result flag C (carry) is set. The carry flag is affected by arithmetic instructions such as ADD, SUB and CMP. CC Carry clear: This is the inverse condition to CS. HI Higher: This condition is true if the C flag is set and the Z flag is false. LS Lower or same: This condition is true if the C flag is cleared or the Z flag is set.
11/13/2013 Amit Kulkarni

Condition codes
GE Greater than or equal: This is true if N is cleared and V is cleared, or N is set and V is set. LT Less than: This is the opposite to GE and instructions with this condition are executed if N is set and V is cleared, or N is cleared and V is set. GT Greater than: This is the same as GE, with the addition that the Z flag must be cleared too. LE Less than or equal: This is the same as LT, and is also true whenever the Z flag is set.
11/13/2013 Amit Kulkarni

Load-Store Instructions
Load-store instructions transfer data between memory and processor registers. There are three types of load-store instructions:
single-register transfer, multiple-register transfer, and swap.

11/13/2013

Amit Kulkarni

Single-Register Transfer
These instructions are used for moving a single data item in and out of a register. The data types supported are signed and unsigned words (32-bit), halfwords (16-bit), and bytes.

11/13/2013

Amit Kulkarni

Single-Register Transfer

11/13/2013

Amit Kulkarni

Single-Register Load-Store Addressing Modes


The ARM instruction set provides different modes for addressing memory. These modes incorporate one of the indexing methods.

11/13/2013

Amit Kulkarni

Single-Register Load-Store Addressing Modes

11/13/2013

Amit Kulkarni

Multiple-Register Transfer
Load-store multiple instructions can transfer multiple registers between memory and the processor in a single instruction. The transfer occurs from a base address register Rn pointing into memory. Multiple-register transfer instructions are more efficient from single-register transfers for moving blocks of data around memory and saving and restoring context and stacks.

11/13/2013

Amit Kulkarni

Examples of LDR instructions using different addressing modes

11/13/2013

Amit Kulkarni

Variations of STRH instructions.

11/13/2013

Amit Kulkarni

Multiple-Register Transfer
Load-store multiple instructions can increase interrupt latency. ARM implementations do not usually interrupt instructions while they are executing. For example, on an ARM7 a load multiple instruction takes 2 + Nt cycles, where N is the number of registers to load and t is the number of cycles required for each sequential access to memory.

11/13/2013

Amit Kulkarni

Stack Operations
The ARM architecture uses the load-store multiple instructions to carry out stack operations. The pop operation (removing data from a stack) uses a load multiple instruction, similarly The push operation (placing data onto the stack) uses a store multiple instruction. When using a stack you have to decide whether the stack will grow up or down in memory. A stack is either ascending (A) or descending (D). Ascending stacks grow towards higher memory addresses; in contrast, descending stacks grow towards lower memory addresses.
11/13/2013 Amit Kulkarni

Stack Operations
When you use a full stack (F), the stack pointer sp points to an address that is the last used or full location (i.e., sp points to the last item on the stack). In contrast, if you use an empty stack (E) the sp points to an address that is the first unused or empty location (i.e., it points after the last item on the stack).

11/13/2013

Amit Kulkarni

Stack Operations
When handling a checked stack there are three attributes that need to be preserved:
stack base, the stack pointer, and the stack limit

The stack base is the starting address of the stack in memory. The stack pointer initially points to the stack base. If the stack pointer passes/goes back the stack limit, then a stack overflow/underflow error has occurred.
11/13/2013 Amit Kulkarni

Swap Instruction
The swap instruction is a special case of a load-store instruction. It swaps the contents of memory with the contents of a register. This instruction is an atomic operationit reads and writes a location in the same bus operation, preventing any other instruction from reading or writing to that location until it completes.

11/13/2013

Amit Kulkarni

Swap Instruction
The swap instruction loads a word from memory into register r0 and overwrites the memory with register r1.

This instruction is particularly useful when implementing semaphores and mutual exclusion in an operating system.
11/13/2013 Amit Kulkarni

Swap Instruction
This example shows a simple data guard that can be used to protect data from being written by another task. The SWP instruction holds the bus until the transaction is complete.

11/13/2013

Amit Kulkarni

Software Interrupt Instruction


A software interrupt instruction (SWI) causes a software interrupt exception, which provides a mechanism for applications to call operating system routines.

When the processor executes an SWI instruction, it sets the program counter pc to the offset 0x8 in the vector table. The instruction also forces the processor mode to SVC, which allows an operating system routine to be called in a privileged mode.
11/13/2013 Amit Kulkarni

Program Status Register Instructions


The ARM instruction set provides two instructions to directly control a program status register (psr). The MRS instruction transfers the contents of either the cpsr or spsr into a register; in the reverse direction, The MSR instruction transfers the contents of a register into the cpsr or spsr. Together these instructions are used to read and write the cpsr and spsr.
11/13/2013 Amit Kulkarni

Program Status Register Instructions

In the syntax you can see a label called fields. This can be any combination of control (c), extension (x), status (s), and flags (f ).
11/13/2013 Amit Kulkarni

Coprocessor Instructions
Coprocessor instructions are used to extend the instruction set. A coprocessor can either provide additional computation capability or be used to control the memory subsystem including caches and memory management. The coprocessor instructions include data processing, register transfer, and memory transfer instructions.

11/13/2013

Amit Kulkarni

Code Density
In early computers, memory was expensive, So minimizing the size of a program to make sure it would fit in the limited memory was often central. Thus the combined size of all the instructions needed to perform a particular task, the code density, was an important characteristic of any instruction set. Computers with high code density often have complex instructions for procedure entry, parameterized returns, loops etc. (therefore retroactively named Complex Instruction Set Computers, CISC).
11/13/2013 Amit Kulkarni

Reduced instruction-set computers, RISC


Widely implemented during a period of rapidly growing memory subsystems. sacrifice code density in order to simplify implementation circuitry and thereby try to increase performance via higher clock frequencies and more registers. RISC instructions typically perform only a single operation, such as an "add" of registers or a "load" from a memory location into a register. They also normally use a fixed instruction width.
11/13/2013 Amit Kulkarni

Over view
The Thumb instruction set addresses the issue of code density. Thumb encodes a subset of the 32-bit ARM instructions into a 16-bit instruction set space. Thumb programmer's model maps onto the ARM programmer's model. Implementations of Thumb use dynamic decompression in an ARM instruction pipeline and then instructions execute as standard ARM instructions within the processor. Thumb has higher code densitythe space taken up in memory by an executable programthan ARM. On average, a Thumb implementation of the same code takes up around 30% less memory than the equivalent ARM implementation.
11/13/2013 Amit Kulkarni

Over view
Thumb is not a complete architecture. It is not anticipated that a processor would execute Thumb instructions without also supporting the ARM instruction set. Thumb implementation may uses more instructions, but the overall memory footprint is reduced.

11/13/2013

Amit Kulkarni

An Introduction Chapter-3

11/13/2013

Amit Kulkarni

11/13/2013

Amit Kulkarni

11/13/2013

Amit Kulkarni

Thumb Register Usage

1. No direct access to the cpsr or spsr. 2. To alter the cpsr or spsr, you must switch into ARM state to use MSR and MRS. 3. No coprocessor instructions in Thumb state.
11/13/2013 Amit Kulkarni

Branch Instruction

11/13/2013

Amit Kulkarni

Data Processing Instructions

11/13/2013

Amit Kulkarni

11/13/2013

Amit Kulkarni

Data Pro

11/13/2013

Amit Kulkarni

The ARM based controllers

11/13/2013

Amit Kulkarni

Code optimization
Optimizing code takes time and reduces source code readability. Optimize functions that are frequently executed and important for performance. C compilers have to translate your C function literally into assembler so that it works for all possible inputs. In practice, many of the input combinations are not possible or wont occur.

11/13/2013

Amit Kulkarni

Example
1.compiler, it does not know whether N can be 0 on input or not. 2. The compiler doesnt know whether the data array pointer is four-byte aligned or not. 3. Nor does it know whether N is a multiple of four or not. 4. The compiler must be conservative and assume all possible values for N and all possible alignments for data.

11/13/2013

Amit Kulkarni

Efficient C code
To write efficient C code, 1. You must be aware of areas where the C compiler has to be conservative, 2. The limits of the processor architecture the C compiler is mapping to, and 3. The limits of a specific C compiler.

11/13/2013

Amit Kulkarni

C compiler datatype mappings

11/13/2013

Amit Kulkarni

Local Variable Types


ARMv4-based processors can efficiently load and store 8-, 16-, and 32-bit data. Most ARM data processing operations are 32-bit only. For this reason, you should use a 32-bit datatype, int or long, for local variables wherever possible. Avoid using char and short as local variable types, even if you are manipulating an 8- or 16-bit value. If you require modulo arithmetic of the form 255 + 1 = 0, then use the char type.
11/13/2013 Amit Kulkarni

Example

11/13/2013

Amit Kulkarni

11/13/2013

Amit Kulkarni

Function Argument Types


short add_v1(short a, short b) 1. The input values a, b, and the return value will be passed in 32-bit ARM registers. 2. Should the compiler assume that these 32-bit values are in the range of a short type, that is, 32,768 to +32,767? Or 3. Should the compiler force values to be in this range by sign-extending the lowest 16 bits to fill the 32-bit register? 4. The compiler must make compatible decisions for the function caller and callee. 5. Either the caller or callee must perform the cast to a short type.
11/13/2013 Amit Kulkarni

Wide and Narrow arguments


function arguments are passed wide if they are not reduced to the range of the type and narrow if they are. char or short type function arguments and return values introduce extra casts. These increase code size and decrease performance. It is more efficient to use the int type for function arguments and return values, even if you are only passing an 8-bit value.
11/13/2013 Amit Kulkarni

Signed versus Unsigned Types


If your code uses addition, subtraction, and multiplication, then there is no performance difference between signed and unsigned operations. However, there is a difference when it comes to division. In C on an ARM target, a divide by two is not a right shift if x is negative. For example, 3>>1=2 but 3/2=1. Division rounds towards zero, but arithmetic right shift rounds towards . It is more efficient to use unsigned types for divisions. The compiler converts unsigned power of two divisions directly to right shifts.
11/13/2013 Amit Kulkarni

The Efficient Use of C Types

11/13/2013

Amit Kulkarni

The Efficient Use of C Types

11/13/2013

Amit Kulkarni

The Efficient Use of C Types

11/13/2013

Amit Kulkarni

The Efficient Use of C Types

11/13/2013

Amit Kulkarni

C Looping Structures
Loops with a fixed number of iterations Loops with a variable number of iterations. Loop unrolling.

11/13/2013

Amit Kulkarni

Loops with a Fixed Number of Iterations


for (i=0; i<64; i++) It takes three instructions to implement the for loop structure: An ADD to increment i A compare to check if i is less than 64 A conditional branch to continue the loop if i < 64

11/13/2013

Amit Kulkarni

Is it an efficient loop?
This is not efficient. On the ARM, a loop should only use two instructions: A subtract to decrement the loop counter, which also sets the condition code flags on the result A conditional branch instruction The key point is that the loop counter should count down to zero rather than counting up to some arbitrary limit. Then the comparison with zero is free since the result is stored in the condition flags. Since we are no longer using i as an array index, there is no problem in counting down rather than up.
11/13/2013 Amit Kulkarni

Loops Using a Variable Number of Iterations


for (; N!=0; N--) The compiler checks that N is nonzero on entry to the function. Often this check is unnecessary since you know that the array wont be empty. In this case a do-while loop gives better performance and code density than a for loop.

11/13/2013

Amit Kulkarni

Loop Unrolling
Each loop iteration costs two instructions in addition to the body of the loop: 1. a subtract to decrement the loop count and 2. a conditional branch. These instructions are the loop overhead. On ARM7 or ARM9 processors the subtract takes one cycle and the branch three cycles, giving an overhead of four cycles per loop.

11/13/2013

Amit Kulkarni

Loop Unrolling
Some of these cycles can be saved by unrolling a loop. Repeating the loop body several times, and reducing the number of loop iterations by the same proportion. There are two questions you need to ask when unrolling a loop: How many times should I unroll the loop? What if the number of loop iterations is not a multiple of the unroll amount?

11/13/2013

Amit Kulkarni

Example
int checksum_v9(int *data, unsigned int N) { int sum=0; do { sum += *(data++); sum += *(data++); sum += *(data++); sum += *(data++); N -= 4; } while ( N!=0); return sum; }
11/13/2013 Amit Kulkarni

Loop unrolling
1. Only unroll loops that are important for the overall performance of the application. Otherwise unrolling will increase the code size with little performance benefit. Unrolling may even reduce performance by evicting more important code from the cache. 2. Try to arrange it so that array sizes are multiples of your unroll amount. If this isnt possible, then you must add extra code to take care of the leftover cases. This increases the code size a little but keeps the performance high.
11/13/2013 Amit Kulkarni

Writing Loops Efficiently


Use loops that count down to zero. Then the compiler does not need to allocate a register to hold the termination value, and the comparison with zero is free. Use unsigned loop counters by default and the continuation condition i!=0 rather than i>0. This will ensure that the loop overhead is only two instructions. Use do-while loops rather than for loops when you know the loop will iterate at least once. This saves the compiler checking to see if the loop count is zero.

11/13/2013

Amit Kulkarni

Writing Loops Efficiently


Unroll important loops to reduce the loop overhead. Do not overunroll. If the loop overhead is small as a proportion of the total, then unrolling will increase code size and hurt the performance of the cache. Try to arrange that the number of elements in arrays are multiples of four or eight. You can then unroll loops easily by two, four, or eight times without worrying about the leftover array elements.

11/13/2013

Amit Kulkarni

Register Allocation
The compiler attempts to allocate a processor register to each local variable you use in a C function. It will try to use the same register for different local variables if the use of the variables do not overlap. When there are more local variables than available registers, the compiler stores the excess variables on the processor stack. These variables are called spilled or swapped out variables since they are written out to memory. Spilled variables are slow to access compared to variables allocated to registers.
11/13/2013 Amit Kulkarni

Register Allocation
To implement a function efficiently, you need to Minimize the number of spilled variables Ensure that the most important and frequently accessed variables are stored in registers

11/13/2013

Amit Kulkarni

Function Calls
The ARM Procedure Call Standard (APCS) defines how to pass function arguments and return values in ARM registers. The more recent ARM-Thumb Procedure Call Standard (ATPCS) covers ARM and Thumb interworking The first four integer arguments are passed in the first four ARM registers: r0, r1, r2, and r3. Subsequent integer arguments are placed on the full descending stack, ascending in memory
11/13/2013 Amit Kulkarni

Function Calls

11/13/2013

Amit Kulkarni

Example
Function with 5 arguments

Function with 3 arguments

11/13/2013

Amit Kulkarni

Calling Functions Efficiently


Try to restrict functions to four arguments. This will make them more efficient to call. Use structures to group related arguments and pass structure pointers instead of multiple arguments. Define small functions in the same source file and before the functions that call them. The compiler can then optimize the function call or inline the small function. Critical functions can be inlined using the __inline keyword.
11/13/2013 Amit Kulkarni

To reduce overheads
The caller function need not preserve registers that it can see the callee doesnt corrupt. Therefore the caller function need not save all the ATPCS corruptible registers. If the callee function is very small, then the compiler can inline the code in the caller function. This removes the function call overhead completely.

11/13/2013

Amit Kulkarni

Pointer Aliasing
Two pointers are said to alias when they point to the same address. If you write to one pointer, it will affect the value you read from the other pointer. In a function, the compiler often doesnt know which pointers can alias and which pointers cant.

11/13/2013

Amit Kulkarni

How to avoid?
Do not rely on the compiler to eliminate common subexpressions involving memory accesses. Instead create new local variables to hold the expression. This ensures the expression is evaluated only once. Avoid taking the address of local variables. The variable may be inefficient to access from then on.

11/13/2013

Amit Kulkarni

Structure Arrangement
Place all 8-bit elements at the start of the structure. Place all 16-bit elements next, then 32-bit, then 64-bit. Place all arrays and larger elements at the end of the structure. If the structure is too big for a single instruction to access all the elements, then group the elements into substructures. The compiler can maintain pointers to the individual substructures.

11/13/2013

Amit Kulkarni

ARM Thumb Interworking


What is ARM/Thumb Interworking?

An application is allowed to be written as a mix of ARM and Thumb instruction sets.

11/13/2013

Amit Kulkarni

Why using ARM/Thumb Interworking?


Better code density using Thumb. Certain ARM instructions have better performance over Thumb ones. ARM instructions provide some functionality which Thumb does not. Exception handling is required to run under ARM state. Thumb program needs state changes from default ARM state
11/13/2013 Amit Kulkarni

How ARM/Thumb Interworking work?


The ARM processor is initially set in ARM state. Therefore it requires a state change when encountering Thumb instructions, otherwise it wont work properly. In order to branch to Thumb state, the bit 0 in the branch target address is set, this changes the processor state after branching. Thebit5 in the CPSR (t bit) would change to 1 indicating its in Thumb state.

11/13/2013

Amit Kulkarni

ARM/Thumb interworking using ASM (no Veneer)


This program do computations among registers. No veneer is needed, inteworking instruction change is implemented manually. The program consists of 4 parts: 1. Main: Generate branch address, and set bit0=1 to arrive at target in Thumb mode. Initial in ARM state. 2. ThumbProg: Set values for r2, r3. Sum r2,r3 to r2. Executed in Thumb state. 3. ArmProg: Set values for r4, r5. Sum r4, r5 to r4. Executed in ARM state. 4. Stop: Terminate the program.
11/13/2013 Amit Kulkarni

ARM/Thumb interworking using C:


This program consist of 2 parts: 1. Armmain.c for main function using ARM instructions set. Print strings Call Thumb function Compiled using ARM C compiler. 2. Thumbsub.c for sub function called by main function using Thumb instructions set. Print strings Return to main function Compiled using Thumb C compiler.
11/13/2013 Amit Kulkarni

ARM/Thumb interworking using C:


Armmain.c code:

#include <stdio.h> extern void thumb_function(void); int main(void) { printf("Hello from ARM\n"); thumb_function(); printf("And goodbye from ARM\n"); return (0); }
11/13/2013 Amit Kulkarni

ARM/Thumb interworking using C:


Thumbsub.c code:

#include <stdio.h> void thumb_function(void) { printf("Hello and goodbye from Thumb\n"); }

11/13/2013

Amit Kulkarni

AREA AddReg,CODE,READONLY ;Name this block of code. ENTRY ;Mark first instruction to call. Main ADR r0,ThumbProg +1 ;Generate branch target address ;and set bit 0,hence arrive at target in Thumb state. BX r0 ;Branch exchange to ThumbProg. ;Subsequent instructions are Thumb

CODE16 code. ThumbProg

MOV r2,#2 MOV r3,#3 ADD r2,r2,r3 ADR r0,ARMProg BX r0 CODE32 ARMProg MOV r4,#4 MOV r5,#5 ADD r4,r4,r5 MOV r0,#0x18 LDR r1,=0x20026 SWI 0x123456 END
Amit Kulkarni

;Load r2 with value 2. ;Load r3 with value 3. ;r2 =r2 +r3

;Subsequent instructions are ARM code.

Stop

;angel_SWIreason_ReportException ;ADP_Stopped_ApplicationExit ;ARM semihosting SWI ;Mark end of this file.

11/13/2013

Input and output

11/13/2013

Amit Kulkarni

11/13/2013

Amit Kulkarni