In the early 80s the idea of RISC was introduced. RISC stands for
Reduced Instruction Set computer.
RISC processors have faster clock rates. The clock rates range from 20 to
120MHz.
A large register file , separate instruction and data caches are used. It
eliminates unnecessary storage of intermediate results.
Contd..
Problems in CISC processors:
1.
Most ALU instructions had only 2 operands where one of the operands is
also the destination. This means this operand is destroyed during the operation
or it must be saved before somewhere.
It composed of instructions that all have exactly the same size, usually 32 bits.
Contd..
This
will
slow
instruction execution.
down
the
RISC
CISC
1.
2.
Simple
instruction
taking
one
1.
2.
cycle.
3.
Very
multiple cycles.
few
instructions
refer
3.
memory.
4.
4.
5.
5.
6.
6.
7.
Highly pipelined.
7.
Less pipelined.
8.
8.
Complexity
program.
is
in
the
micro
RISC properties
2.
3.
Hardwired instructions.
4.
5.
6.
7.
Instruction pipelining.
RISC properties
1. Register to register operations
These register sets are organized into overlapped windows and act as
small, fast buffer for holding a subset of all variables that are most likely to
used.
Paramet
er
register
Local
register
Called procedure
Tempora
ry
register
Paramet
er
register
Current procedure
Local
register
Tempora
ry
register
Cont..
Cont..
2. One instruction per cycle.
In RISC processors, there is an one instruction per machine cycle.
A machine cycle is defined to be the time it takes to fetch two operands
from registers , performs an ALU operation, and stores the result in a
register.
So RISC machine instruction are not complicated and can execute as fast
as CISC machines.
3. Hardwired Instructions
With simple , one cycle instruction, there is no need for micro instructions.
The machine instructions can be hardwired.
These instructions are executed faster than the instructions implemented
with micro instructions, since it is not necessary to access a micro program
control memory during instruction execution.
4. Reduced number of instructions.
RISC processor provides limited number of instructions, which simplifies
the design of control unit.
Cont..
Cont..
Cont..
Cont..
Switching power:
This is the power dissipated by charging and discharging the gate
output capacitance.
Short-circuit power:
During transition on the input of CMOS gate both p and n transistors
can conduct simultaneously resulting a transitory conducting path from
Vdd to Vss.
This causes a power dissipation which is a small fraction.
Leakage current:
A very small current called leakage current flows through the
transistors when they are in OFF state.
The power dissipation due to leakage current is small and can be
neglected.
Neglecting power consumption due to leakage current and short circuit, the
total power dissipation of a CMOS circuit is the summation of power
dissipation due to all the gates in the circuit.
It is given by,
Where,
f = clock frequency,
Ag = gate activity factor,
Cg = gate load capacitance.
Architecture Inheritance
Load-store architecture:
It has two instruction types, load and store, for transferring data in and
out of the processor respectively.
LOAD : This instruction copies data from memory to registers in the
processor core.
STORE : This instruction copies data from registers in the processor
core to memory.
The ARM processor instruction set does not include the instruction that
directly manipulate data in memory.
The data processing is carried out only in registers.
Data bus:
The data enters the ARM core through data bus.
The data is either in the form of a instruction opcode or a data.
Data and instruction share the same bus.
Instruction decoder:
This unit decodes the instruction opcode read from the memory and
then the instruction is executed.
Cont..
Register file:
This is a bank of 32 bit registers used for storing data items.
Sign extend:
The ARM core is a 32 bit processor. So most instructions of ARM
processor treat registers as holding signed or unsigned 32 bit values.
When the processor reads signed 8 bit or 16 bit numbers from memory,
the sign extend hardware converts these numbers to 32 bit values and
then places them in a register file.
ALU and MAC:
Most of the ARM instructions are two operands instructions. The two
source registers Rn and Rm are used to store these operands.
These source operands are read from the Rn and Rm registers using the
internal buses A and B respectively.
The ALU and MAC reads the operand values from Rn and Rm
registers via internal C bus in destination register, Rd and then to the
register file.
Cont..
Address register:
This holds the address generated by the load and store instructions and
places it on the address bus.
Barrel shifter:
The contents of the Rm register alternatively can be preprocessed in
the barrel shifter before applying as an input to the ALU.
Incrementer:
For load and store instructions, the incrementer updates the contents of
the address register before the processor core reads or writes the next
register value from or to the consecutive memory locations.
Cont..
The register file in the ARM core contains all the registers, available to a
programmer.
The current mode of the processor decides the availability of the registers
to the programmer.
The ARM processor has a total of 37 registers.
All registers are 32- bit wide. They can be classified into two groups as,
General purpose registers and
Special purpose registers.
General purpose registers:
Registers r0 r12 are used as general purpose registers. Depending
upon the context, registers r13 r15 can also be used as general
purpose registers.
The general purpose registers hold either data or an address.
Cont..
Format of CPCR
The current program status register is accessible in all processor modes.
It contains condition code flags, interrupt disable bits, the current processor
mode and other status and control information.
User mode and system mode do not have an SPSR, because they are not
exception.
Control flags:
The control bits change when an exception arises and can be altered
by software.
Bits 0-4 (mode select bits):
This bit determines the processor mode.
PROCESSSOR MODE
Abort
10111
10001
Interrupt request
10010
Supervisor
10011
System
11111
Undefined
11011
user
10000
Thumb:
The Thumb instruction set is a reworking of the ARM set, with a few
things omitted.
Thumb instructions are 16 bits.
This allows for greater code density in places where memory is
restricted.
The Thumb set can only address the first eight registers and there are
no conditional execution instruction.
So, the thumb instruction set will always come along with full ARM
instruction set.
Jazelle:
Jazelle executes 8 bit instructions.
It is a hybrid mix of software and hardware.
It is designed to increase the speed of the java byte codes.
The jazelle technology and a specially modified version of the java
virtual machine is needed to execute java byte codes.
Little endian:
In little endian format, the lowest addressed byte in a word is
considered the least significant byte of the word.
The highest addressed byte is the most significant.
So the byte at address 0 of the memory system connects to data lines 7
through 0.
For a word aligned address A, the figure shows how the word at
address A, the halfword at address A and A+2 and the byte addresses A,
A+1, A+2 and A+3 map on to each other when the core is configured
as little endian.
31
24
23
16
15
1 0
Word at address A
Halfword at address A+2
Byte at address
A+3
Byte at address
A+2
Halfword at address A
Byte at address
A+1
Byte at address A
Big endian:
In big endian format, the ARM processor stores the most significant
byte of a word at the lowest numbered byte and the least significant
byte at the highest numbered byte.
So the byte at address 0 of the memory system connects to data lines
31 through 24.
For a word aligned address A, the figure shows how the word at
address A, the halfword at address A and A+2 and the byte addresses A,
A+1, A+2 and A+3 map on to each other when the core is configured
as big endian.
31
24
23
16
15
1 0
Word at address A
Halfword at address A
Byte at address A
Byte at address
A+1
Byte at address
A+3
Branch instructions:
Branch instructions are executed in three cycles.
In the first cycle, a 24 bit immediate field is extracted from the
instruction and then shifted left two bit positions using barrel shifter to
give a word aligned offset.
This offset is added with PC and the result is loaded into address
register.
In the second cycle, the return address, the contents of PC are loaded
into the link register r14 through ALU.
The third cycle is used to fill the instruction pipeline.
3-Stage pipelining
NinstXCPI
Tprog
Fclk
Tprog : Time required to execute a given program.
Ninst : Number of ARM instructions executed in the program.
CPI : Average number of clock cycles per instruction.
Fclk : Processors clock frequency.
There are some ways to increase the performance,
Increase the clock rate, Fclk : To achieve this it s necessary to
simplify the pipeline stages to increase the number of pipeline stages.
DECODE
EXECUTE
MEMORY
WRITE
Instruction Fetch
Thumb/ARM inst.decoder
Shift
ALU
Memory Access
Register Write
5-Stage pipelining
Write:
In this stage, the results generated by the instruction are written back to
the register file including any data loaded from memory.
ARM implementation
As shown in figure, the register read buses are valid early in phase 1.
One operand is passed through the barrel shifter and the output of barrel shifter is
valid later in the phase 1.
ALU has input latches and they are open when valid data arrives.
ALU gets the valid operands later in the phase 1 so that the phase 2 precharge
does not get through the ALU.
The ARM supports 32 bit addition and it has significant effect on the
datapath cycle time.
As a result it has also significant effect on processors performance.
It has worst case carry path of 32 gates long.
In order to reduce worst case carry path and to allow a higher clock rate,
ARM 2 uses 4 bit carry look ahead circuit.
4 bit carry look ahead circuit
ALU functions:
Along with the addition, ALU does address computations for memory
transfer, branch calculations, bit wise logical functions and so on.
In this scheme, the worst case addition time is significantly faster than the 4
bit carry look ahead adder.
The above table shows the values of u and v for inputs A, B and C
(carry) for a particular bit position.
When C is unknown, values of u and v are 1 and 0, respectively.
It is important to note that u gives the carry out if the carry in is one
and v gives the carry out if the carry in is zero.
In the above figure 4x4 matrix is shown. ARM processors use 32x32
matrix.
Precharging sets all outputs to logic 0, so those which are not connected to
any input during switching remain at 0 giving the zero filling required by
the shift operation.
For rotate right, the right shift diagonal is enabled + complementary left
diagonal.
Multiplier Design:
The older ARM cores support 32 bit result multiplication.
They use the barrel shifter and ALU to generate the product.
Here, multiplication is implemented using modified booth algorithm.
On the other hand recent ARM cores support 64 bit result
multiplication.
For high performance multiplication they use carry save adders.
In this technique, the carry output from bit i during step j is applied to
carry input bit i+1 during the next step j+1.
After addition of carry components in the last row, one more step is
required in which the carries are allowed to ripple from the least to the
most significant bit.
Read buses A and B are provided to read the state of the cell.
Read operation activated by activating control signals read A and read B.
The register cell are arranged column wise to from 32 bit register.
Such column are packed together to form the complete register bank.
The decoders are used for the read and write enable lines which are packed
above the column.
In the ARM processor Program Counter is a part of register bank having
two write and three read ports.
The other registers in the bank have only one write port and two read ports.
The PC is kept at one end of the register array.
Instruction decoder PLA: It uses internal cycle counter and some of the
instruction bits to identify the class of operation to be performed on the
datapath in the next cycle.
Distributed Secondary Control : It uses information from PLA to select
other instruction bits or processor state information to control the datapath.
Decentralized Control Units : They control the datapath for specific
instructions that take a variable number of cycles to complete their
execution.
The cycle count block indicates the current cycle number in the multicycle instruction execution.
According to the cycle count PLA generate different control outputs.
The cycle count also determines whether it is a last cycle of the current
instruction and if it is, it initiates the transfer of the next instruction
from the instruction pipeline.
Physical Design:
There are two principal mechanisms used to implement an ARM
processor core.
Hard Macrocell:
It is a physical layout.
It can be used only on the particular process for which it has been
designed.
For every new process, the layout need to be modified and
recharacterized.
Soft Macrocell:
It is a synthesizable design expressed in a hardware description
language such as VHDL.
It can readily be ported to a new process technology.
Recent ARM processor cores are available in both hard and soft forms.
ARM7TDMI core
Cont..
ARM7TDMI Organization
CLOCK SIGNALS:
Mclk : Memory clock input. This is the main clock for all memory accesses
and processor operations.
Wait : When LOW the processor extends an access over a number of cycles of
MCLK, which is useful for accessing slow memory.
Eclk : External clock output.
MEMORY INTERFACE:
MREQ : Memory request : When the processor requires memory access
during the following cycle this is low.
SEQ : Sequential Address : When the address of next memory cycle is
closely related to that of the last memory access, this is high.
LOCK : Locked operation : When the processor is performing a locked
memory access this is high. This is used to prevent the memory
controller
allowing another device to access the memory. It is active
only during
the data swap instructions.
R / W : Read / Write : When the processor is performing a read cycle, this
is low.
MAS [1:0]:
Memory access size : Used to indicate to the memory system the size
of data transfer required for both read and write cycles, become valid
before the falling edge of MCLK and remain valid until the rising edge
of MCLK.
The binary values 00, 01 and 10 represent byte, halfword and word
respectively.
BL[3:0]:
Byte latch control : The values on the data bus are latched on the
falling edge of MCLK when these signals are high.
MMU INTERFACE:
TRANS:
Memory translate : When the processor is in user mode, this is low. It
can be used either to tell the memory management system when
address translation is on.
MODE[4:0]:
Processor mode : These are the inverse of the internal status bits
including the current processor mode.
ABORT:
Memory abort : the memory system uses this signal to tell the processor that
a requested access is not allowed.
STATUS SIGNAL:
TBIT:
When the processor is executing the thumb instruction set, this is high. It is
low when executing the ARM instruction set.
CONFIGURATION:
BIGEND:
Big endian configuration : selects how the processor treats bytes in memory.
HIGH for big endian format.
LOW for little endian format.
INTERRUPTS:
FIQ:
Fast interrupt request : Taking this LOW causes the processor to be
interrupted if the appropriate enable in the processor is active.
The signal is level sensitive and must be held LOW until a suitable response
is received from the processor.
IRQ:
Interrupt request : As FIQ, but with lower priority. Can be taken LOW
to interrupt the processor.
ISYNC:
Synchronous interrupts : Set this HIGH if IRQ and FIQ are
synchronous to the processor clock. Set it LOW for asynchronous
interrupts.
INITIALIZATION:
RESET:
Used to start the processor from a known address.
A LOW level causes the instruction being executed to terminate
abnormally.
When HIGH for at least one clock cycle, the processor restarts from
address 0.
BUS CONTROL:
ENIN:
Enable input : This must be LOW for the data bus to be driven during
write cycle.
ENOUT:
Enable output : during a write cycle, this signal is driven LOW before
the rising edge of MCLK and remains LOW for the entire cycle.
DBE:
Data bus enable : Must be HIGH for data to appear on either the
bidirectional or unidirectional data output bus.
When LOW, the bidirectional data bus is placed into high impedance
state and data output is prevented on the unidirectional data output bus.
ABE:
Address bus enable : The address bus are disabled when this is LOW.
ABE must be HIGH if there is no system requirement to disable the
address drivers.
ALE:
Address latch enable : The signal is provided for backwards
compatibility with older ARM processors.
This enables these address signals to be held valid for the complete
duration of a memory access cycle.
APE:
Address pipeline enable : selects whether the address bus and other
signals operate in pipelined (APE is high).
Or depipelined mode (APE is LOW).
BUSEN:
Data bus configuration : A static configuration signal that selects
whether the bidirectional data bus (D[31:0]) or the unidirectional data
buses (Din[31:0]) and (DOUT[31:0]) are used to transfer data between
the processor and memory.
When BUSEN is LOW, D[31:0] is used.
When BUSEN is HIGH, DIN[31:0] and DOUT[31:0] is enabled.
DEBUG INTERFACE:
The ARM7TDMI processor contains hardware extensions for advanced
debugging features.
DBGACK:
Debug acknowledge : when the processor is in debug state this is high.
DBGEN:
Debug enable : A static configuration signal that disables the debug
features of the processor when held LOW.
This signal must be HIGH to enable the debug function.
DBGRQ :
Debug request : This is a level sensitive input, that when HIGH causes
ARM7TDMI core to enter debug state after executing the current
instruction.
It has also additional debugging features.
EXTERN0:
External input 0 : This is connected to the Embedded ICE debug logic
and enables breakpoints and watchpoints to be dependent on an
external condition.
EXTERN1:
External input 1 : This is connected to the Embedded ICE debug logic
and enables breakpoints and watchpoints to be dependent on an
external condition.
COMMRX:
Communication channel receive : When the communication channel
receive buffer is full this is HIGH.
This signal changes after the rising edge of MCLK.
COMMTX:
Communication channel transmit : When the communication channel
transmit buffer is empty this is HIGH.
This signal changes after the rising edge of MCLK.
EXEC:
Executed : This is HIGH when the instruction in the execution unit is
not being executed.
RANGEOUT0:
When the embedded ICE watchpoint unit 0 has matched the conditions
currently present on the address, data and control buses, then this is
HIGH.
RANGEOUT1:
When the embedded ICE watchpoint unit 1 has matched the conditions
currently present on the address, data and control buses, then this is
HIGH.