The first ARM processor was developed at Acorn Computers Limited, of Cambridge, England, between ctober !"#$ and April !"#%& At that time, and until the formation of Advanced RISC Machines Limited 'which later was renamed simpl( ARM Limited) in !""*, ARM stood for Acorn RISC Machine.
Architectural inheritance
At the time the first ARM chip was designed, the onl( e+amples of RISC architectures were the ,er-ele( RISC I and II and the Stanford MI.S 'which stands for Microprocessor without Interlocking Pipeline Stages).
/eatures used
The ARM architecture incorporated a number of features from the ,er-ele( RISC design, but a number of other features were re0ected& Those that were used were1 2 a load3store architecture4 2 fi+ed3length $53bit instructions4 2 $3address instruction formats&
/eatures re0ected
Register windows
The register ban-s on the ,er-ele( RISC processors incorporated a large number of registers, $5 of which were visible at an( time& Reduce the data traffic between the processor and memor( resulting from register saving and restoring& The principal problem with register windows is the large chip area occupied b( the large number of registers&
shadow registers used to handle e+ceptions on the ARM& 2 6ela(ed branches ,ranches cause pipelines problems since the( interrupt the smooth flow of instructions& Most RISC processors avoid this problem b( using dela(ed branches where the branch ta-es effect after the following instruction has e+ecuted& The( wor- well on single issue pipelined processors, but the( do not scale well to super3scalar implementations and can interact badl( with branch prediction mechanisms.
n the original ARM dela(ed branches were not used because the( made e+ception handling more comple+. 2 Single3c(cle e+ecution of all instructions Although the ARM e+ecutes most data processing instructions in a single cloc- c(cle, man( other instructions ta-e multiple cloc- c(cles& simple load or store instruction re7uires at least two memor( accesses 'one for the instruction and one for the data)&
single c(cle operation of all instructions is onl( possible with separate data and instruction memories, which were considered too e+pensive for the intended ARM application areas& Instead of single3c(cle e+ecution of all instructions, the ARM was designed to use the minimum number of c(cles re7uired for memor( accesses&
The bits at the bottom of the register control the processor mode& The 9T: field is used to switch between ARM and Thumb instruction sets& The 9I: and 9/: flags enable normal and fast interrupts respectivel(& /inall(, the 9mode: field selects one of seven e+ecution modes& ;ser mode is the main e+ecution mode& ,( running application software in user mode, the operating s(stem can achieve protection and isolation&
/ast interrupt processing mode is entered whenever the processor receives an interrupt signal from the designated fast interrupt source& <ormal interrupt processing mode is entered whenever the processor receives an interrupt signal from an( other interrupt source& Software interrupt mode is entered when the processor encounters a soft3ware interrupt instruction& ;ndefined instruction mode is entered when the processor attempts to e+ecute an instruction that is supported neither b( the main integer core nor b( one of the coprocessors&
S(stem mode is used for running privileged operating s(stem tas-s& Abort mode is entered in response to memor( faults& <1 <egative4 the last AL; operation which changed the flags produced a negative result 'the top bit of the $53bit result was a one)& =1 =ero4 the last AL; operation which changed the flags produced a >ero result 'ever( bit of the $53bit result was >ero)&
C1 Carr(4 the last AL; operation which changed the flags generated a carr(3out, either as a result of an arithmetic operation in the AL; or from the shifter& ?1 o?erflow4 the last arithmetic AL; operation which changed the flags generated an overflow into the sign bit&
Load3store architecture
In common with most RISC processors, ARM emplo(s a load3store architecture& The instruction set will onl( process 'add, subtract, and so on) values which are in registers 'or specified directl( within the instruction itself), and will alwa(s place the results of such processing into a register& The onl( operations which appl( to memor( state are ones which cop( memor( values into registers 'load instructions) or cop( register values into memor( 'store instructions)&
ARM does not support such 8memor(3to3memor(8 operations& Therefore all ARM instructions fall into one of the following three categories1 !& 6ata processing instructions These use and change onl( register values& /or e+ample, an instruction can add two registers and place the result in a register& 5& 6ata transfer instructions These cop( memor( values into registers 'load instructions) or cop( register values into memor( 'store instructions)&
$& Control flow instructions <ormal instruction e+ecution uses instructions stored at consecutive memor( addresses& Control flow instructions cause e+ecution to switch to a different address, either permanentl( 'branch instructions) or saving a return address to resume the original se7uence 'branch and lin- instructions)&
Supervisor mode
ARM processor supports a protected supervisor mode& The protection mechanism ensures that user code cannot gain supervisor privileges without appropriate chec-s being carried out to ensure that the code is not attempting illegal operations& s(stem3level functions can onl( be accessed through specified supervisor calls& These functions generall( include an( accesses to hardware peripheral registers, and to widel( used operations such as character input and output.
2 Conditional e+ecution of ever( instruction& 2 The inclusion of ver( powerful load and store multiple register instructions& The abilit( to perform a general shift operation and a general AL; operation in a single instruction that e+ecutes in a single cloc- c(cle& pen instruction set e+tension through the coprocessor instruction set, including adding new registers and data t(pes to the programmer8s model& A ver( dense !@3bit compressed representation of the instruction set in the Thumb architecture&
The IC s(stem
The ARM handles IC 'inputCoutput) peripherals 'such as dis- controllers, networ- interfaces, and so on) as memor(3mapped devices with interrupt support& The internal registers in these devices appear as addressable locations within the ARM8s memor( map and ma( be read and written using the same 'load3 store) instructions as an( other memor( locations.
.eripherals ma( attract the processor8s attention b( ma-ing an interrupt re7uest using either the normal interrupt (IRQ) or the fast interrupt (FIQ) input. ,oth interrupt inputs are level3sensitive and mas-able& Some s(stems ma( include direct memor( access '6MA) hardware e+ternal to the processor to handle high3bandwidth IC traffic&
ARM e+ceptions
The ARM architecture supports a range of interrupts, traps and supervisor calls, all grouped under the general heading of e+ceptions& The general wa( these are handled is the same in all cases1 1. The current state is saved b( cop(ing the .C into rl4_exc and the C.SR into S.SRDe+c 'where exc stands for the e+ception t(pe).
5& The processor operating mode is changed to the appropriate e+ception mode& $& The .C is forced to a value between **!@ and !C!@, the particular value depending on the t(pe of e+ception& The instruction at the location the .C is forced to the vector address will usuall( contain a branch to the e+ception handler& The e+ception handler will use rl$De+c, which will normall( have been initiali>ed to point to a dedicated stac- in memor(, to save some user registers for use as wor- registers&
The image format files can be built to include the debug tables re7uired b( the ARM s(mbolic debugger 'ARMsd which can load, run and debug programs either on hardware such as the ARM 6evelopment ,oard or using a software emulation of the ARM 'the ARMulator)& The ARM C compiler is compliant with the A<SI 'American <ational Standards Institute) standard for C and is supported b( the appropriate librar( of standard functions&
It uses the ARM .rocedure Call Standard for all e+ternall( available functions. It can produce assembl( source output instead of ARM ob0ect format& The compiler can also produce Thumb Code& The ARM assembler is a full macro assembler which produces ARM ob0ect format output that can be lin-ed with output from the C compiler& The lin-er ta-es one or more ob0ect files and combines them into an e+ecutable program&
It resolves s(mbolic references between the ob0ect files and e+tracts ob0ect modules from libraries as needed b( the program& It can assemble the various components of the program in a number of different wa(s, depending on whether the code is to run in RAM 'Random Access Memor(, which can be read and written) or R M 'Read nl( Memor(), whether overla(s are re7uired, and so on&
The ARM s(mbolic debugger is a front3end interface to assist in debugging programs running either under emulation 'on the ARMulator) or remotel( on a target s(stem such as the ARM development board& The remote s(stem must support the appropriate remote debug protocols either via a serial line or through a ETAF test interface& It allows the setting of brea-points, which are addresses in the code that, if e+ecuted, cause e+ecution to halt so that the processor state can be e+amined&
emulator) is a suite of The ARMulator (AR programs that models the behaviour of various ARM processor cores in software on a host s(stem& It can operate at various levels of accurac(1 Instruction$accurate modelling gives the e+act behaviour of the s(stem state without regard to the precise timing characteristics of the processor& 2 C%cle$accurate modelling gives the e+act behaviour of the processor on a c(cleb(3c(cle basis, allowing the e+act number of cloc- c(cles that a program re7uires to be established&
&iming$accurate modelling presents signals at the correct time within a c(cle, allowing logic dela(s to be accounted for& All these approaches run considerabl( slower than the real hardware& The ARM 6evelopment ,oard is a circuit board incorporating a range of components and interfaces to support the development of ARM3based s(stems&
The address register and incrementer, which select and hold all memor( addresses and generate se7uential addresses when re7uired. The data registers, which hold data passing to and from memor(& The instruction decoder and associated control logic& In a single3c(cle data processing instruction, two register operands are accessed, the value on the , bus is shifted and combined with the value on the A bus in the AL;, then the result is written bac- into the register ban-.
D[31:
where <inst is the number of ARM instructions e+ecuted in the course of the program. C.I is the average number of cloc- c(cles per instruction& fcl- is the processor8s cloc- fre7uenc(
Increase the cloc- rate, fcl'. This re7uires the logic in each pipeline stage to be simplified and, therefore, the number of pipeline stages to be increased& Reduce the average number of cloc- c(cles per instruction, C(I. This re7uires either that instructions which occup( more than one pipeline slot in a $3stage pipeline ARM are re3implemented to occup( fewer slots, or that pipeline stalls caused b( dependencies between instructions are reduced, or a combination of both&
,ufferCdata4 data memor( is accessed if re7uired& therwise the AL; result is simpl( buffered for one cloc- c(cle to give the same pipeline flow for all instructions& Arite3bac-4 the results generated b( the instruction are written bac- to the register file, including an( data loaded from memor(.
6ata forwarding
The onl( wa( to resolve data dependencies without stalling the pipeline is to introduce for)arding paths. 6ata dependencies arise when an instruction needs to use the result of one of its predecessors before that result has returned to the register file& /orwarding paths allow results to be passed between stages as soon as the( are available, and the %3stage ARM pipeline re7uires each of the three source operands to be forwarded from an( of three intermediate result registers
Even with forwarding, it is not possible to avoid a pipeline stall& Consider the following code se7uence1 L6R r<, G & & H 4 load r< from somewhere A66 r5, r!, r< 4 and use it immediatel( The onl( wa( to avoid this stall is to encourage the compiler 'or assembl( language programmer) not to put a dependent instruction immediatel( after a load instruction&