Anda di halaman 1dari 65

Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.

edu
1
Sharif University of Technology
Embedded Real-Time

EE Department

Spring 2017
Systems

part one
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Embedded Real-Time Systems
Analog Processing
Analog Computers
Fourier Optics

Why Go Digital Processing?

Programmability
One hardware can perform several tasks
Upgradeability and flexibility
Repeatability
Identical performance from unit to unit
No drift in performance due to temperature or aging
Immune to noise
Offering higher quality or performance
(Compare CD players versus phonographic turntable)
2
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Embedded Real-Time Systems
Embedded Systems:
Computer systems designed to perform one or a few tasks
Examples: MP3 players, Traffic Control Systems, Radars,
Comparing to GP computers which are flexible to end user needs
Can be optimized for cost, size, power consumption,

Real Time Systems:


Systems that are subjected to a real-time constraint
~ operational deadlines from event to system response
Examples: Video Recorder,
Comparing to non-real-time systems
3
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Embedded Real-Time Systems
Real Time Systems

Hard (Immediate) Real-time Systems


Correct execution of the main task depends on the duration of execution
Deadline Concept~ Real-Time Constraint (RTC)
Example: Car engine controller Waiting Time
Processing Time
Real-time Waiting Time > 0
n Sample Time n+1

Latency = Transmission (Acquisition) Delay + Algorithmic Delay

Soft Real-time Systems


Completion after deadline is tolerated by losing QoS
Example: Dropping frames in video chat
4
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
History
First Commercial Microprocessor : 4-bit Intel-4004, 1971
4-bit processor followed by 8, 16, 32, 64, -bit
The most successful family, started with 16-bit Intel 8086
x86 / IA-32 (i386) architecture , 32-bit ones started with 80386

The architecture is the processor contents from the programmers vantage point

Moores Law, 1965


The number of transistors that can be integrated on a single piece of silicon will
double roughly every 18-24 months

Has held true for more than 45 years now may hold true for another decade

Roughly applies to both density and clock frequency denser ~ faster

5
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
6
Transistor dimensions
Processors
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Moores Law
Challenge of processor designers
Make Performance to follow at least (Moores law)2
density effect x clock effect
adding improvements thru innovations micro-architectures, multi-core

Performance has not gone up that fast! follows the Moores law
Orchestration problem 1971-2009 238/2 ~ 524,000
Power consumption bottleneck
Heat dissipation not worth it to increase the clock

Intel Rules:
Increasing clock rate by 25% will yield approximately 15% performance increase
But power consumption will be doubled

MIPS per Watts challenge Change of view point


7
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Processor Design
Architecture (ISA)
programmer/compiler view
Functional structure, Interface to user/system programmer
Op-codes, Addressing modes, Registers, Number formats

Implementation (Architecture)
processor designer view
Logical structure, Pipelining
Functional units, Caches, Physical registers

Realization (Chip)
chip/system designer view
Physical structure for the implementation
Gates, Cells, Transistors, Interconnection
8
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Iron Law (Joel Emer)
Time Instructio ns Cycles Time

Program Program Instructio n Cycle
To be minimized
Architecture Implementation Realization
Code Size CPI Cycle time
Compiler Designer Processor Designer Chip designer

Instructions/Program Instructions executed, not static code size


Determined by algorithm, compiler, ISA

Cycles/Instruction Determined by ISA and CPU organization


Overlap among instructions reduces this term

Time/Cycle Determined by technology, organization, clever circuit


design
9
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Iron Law Examples

Processor A: clock 1ns, CPI 2.0, for program P (N instructions of Processor A)


Processor B: clock 2ns, CPI 1.2, for program P (N instructions of Processor B)

Time(A) = N x 2.0 x 1 = 2N
Time(B) = N x 1.2 x 2 = 2.4N

Time(B)/Time(A) = 2.4N/2N = 1.2 A is 20% faster on program P

For performance of B to reach the performance of A:

CPI(B) may be improved to 1

Clock(B) may be changed to 1.66667ns

ISA(B) may be redesigned to support golden instructions 0.833N


instructions to perform P
10
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Iron Law Examples
op frequency in P cycles
Option: ALU 43% 1
stores can be executed in Load 21% 1
1 cycle by slowing the clock down by 15% Store 12% 2
Branch 24% 2
Is it better to consider the modification?

oldCPI = 0.43 + 0.21 + 0.12 x 2 + 0.24 x 2 = 1.36
Store 12% 1
newCPI = 0.43 + 0.21 + 0.12 + 0.24 x 2 = 1.24

Speedup = oldtime/newtime
= {P x oldCPI x T}/{P x newCPI x 1.15 T}
= (1.36)/(1.24 x 1.15) = 0.95

Dont do it!
11
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Instruction Set Architecture (ISA)
The boundary between software and hardware

Specifies the functional machine that is visible to the programmer

Also, a functional spec for the processor designers

What needs to be specified by an ISA


Operations : what to perform and what to perform next

Temporary Operand Storage in the CPU : accumulator, stacks, registers

Number of operands per instruction

Operand location : where and how to specify the operands

Type and size of operands

Instruction-to-Binary Encoding
12
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Basic ISA Classification
Stack Architecture (zero operand):
Operands popped from stack(s)
Result pushed on stack
Accumulator (one operand):
Special accumulator register is implicit operand
Other operand from memory
Register-Memory (two operand):
One operand from register, other from memory or register
Generally, one of the source operands is also the destination
A few architectures have allowed memory To memory operations
Register-Register or Load/Store (three operand):
All operands for ALU instructions must be registers
General format Rd Rs op Rt
Separate Load and Store instructions for memory access
13
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Important ISA Considerations
Number of registers

Data types/sizes

Addressing modes

Instructions complexity

Branch/jump/function call

Exception handling

Instruction format/size/regularity

Data Type / Size


Fixed point 8, 16, 24, 32,-bit
Floating point IEEE 754 Standard 32 and 64-bit
14
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Addressing Modes
Register indirect: M[Ri]
Indexed: M[Ri+Rj]
Absolute: M[#n]
Memory indirect: M[M[Ri]]
Auto-increment: M[Ri]; Ri += d
Auto-decrement: M[Ri]; Ri -= d
Scaled: M[Ri + #n + Rj * d]
Update: M[Ri = Ri + #n]
Immediate value: #n; Registers: Ri, Rj ; displacement Ri+#n; M :Memory block

Branches
Conditional/Unconditional
Function call is similar but needs parameter passing, saving state
restoring state, Latency,
15
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Modern ISAs
Operations: simple ALU ops, data movement, control transfer
Temporary Operand Storage in the CPU
Large General Purpose Register (GPR) File
Load/Store Architecture
Three operands per ALU instruction (all registers)
A B op C
Addressing Modes
Limited addressing modes, e.g. register indirect addressing only
Type and size of operands
32/64-bit integers, IEEE floats
Instruction-to-Binary Encoding
Fixed width, regular fields

Important Exception: Intel x86

16
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
MIPS ISA
The MIPS ISA was one of the first RISC instruction sets (1985)

Main characteristics:

Load-Store Architecture

Three operand format (Rd Rs op Rt)

Simple instruction format

32 General Purpose Registers

Only one addressing mode for memory operands: reg. indirect + displacement

Limited, highly orthogonal instruction set: 52 instructions

Simple branch/jump/subroutine call architecture

17
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
x86 ISA, a CISC ISA
Was first introduced with Intel 8086 processor in 1978 Evolved over the years

Main characteristics:
Reg/Mem architecture ALU instructions can have memory operands

Two operand format one source operand can be destination too

Eight general purpose registers

Seven memory addressing modes

More than 500 instructions

Instruction set is non-orthogonal

Highly variable instruction size and format instruction size varies from 1 to
17 bytes.
18
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
ARM and Thumb ISAs
The ARM processor architecture provides support for:
The 32-bit ARM and 16-bit Thumb Instruction Set Architectures
Reduced Instruction Set Computer (RISC) architecture
Load/store architecture
Simple addressing modes (determined from register contents and instruction)
16 /32-bit registers

8, 16, 32-bit data types


Instructions that combine a shift with an arithmetic or logical operation
Auto-increment and auto-decrement addressing modes to optimize loops
Load and Store Multiple registers : instructions to maximize data throughput
Conditional execution of almost all instructions to maximize execution throughput.
ARM uses the Universal Assembly Language to provide a canonical form for all
ARM and Thumb instructions

19
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
ARM and Thumb ISAs
Enhancements to a basic RISC architecture enable ARM processors to achieve
a good balance of high performance, small code size, low power
consumption, and small silicon area.

ARM6, ARM7, ARM9, ARM11, Cortex ,

ARM7-TDMI
3-stage pipeline

Von Newman bus structure (ARM9 Harvard) www.arm.com

TDMI Thumb-Multiplier-Debug (JTAG) Interface-ICE

CPI ~ 1.9

20 billion chips created, 10 millions shipped everyday!


About 60 instructions
20
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Performance Depends on What is Important!
Execution time, throughput, cost, area, power,
Execution time or elapsed time is the time to finish a job
Throughput is completion counts per second or number of jobs done per sec
Faster CPUs or more CPUs to improve performance

Performance Metrics, MIPS, MFLOPS


MIPS = instruction count/(execution time x 106) = clock rate/(CPI x 106)
MIPS has serious shortcomings

MFLOPS = FP ops in program/(execution time x 106)


Assuming FP ops independent of compiler and ISA
However, not always safe:
Missing instructions (e.g. FP divide, sqrt/sin/cos), and Optimizing compilers

Relative MIPS and normalized MFLOPS, normalized to some common baseline


machine
21
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Performance
Program Selection
A Set of programs Benchmarks
Covering different aspects
Tested on different processors
www.BDTi.com
Berkeley Design Technology inc.

Example DSP Kernel Benchmarks


Each kernel is implemented in
hand-optimized assembly
language on the target processor.

Video , OFDM,
DQPSK Benchmarks

22
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Going Deeper into Implementation
How to improve the implementation Increasing /Decreasing CPI
Amdahls Law (Gene Amdahl, IBM) 1
speedup
Expected speed-up of partial program improvement (1 p ) p / S
1
speedup
Expected speed-up of partial parallel implementation (1 p) p / N
%improvement (1 1 / speedup) 100

For parallel implementation:


speedupmax lim
1

1
N (1 p ) p / N 1 p

S max
For sequential
speedupmax
case:
(1 p )( S max 1) 1
23
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Going Deeper into Implementation Pipelining, 1980s
Overlap the execution of instructions T
I 1, I2, I3, Un-pipelined Process O1, O2, O3,

N Independent tasks being done in N independent modules N-Stage Pipeline


T/N
S1 S2 S3 SN Pipeline Depth

I1 latency > 1 clock cycle


N
I2 I1 prologue epilogue
I3 I2 I1
. 1
IN IN-1 IN-2 . I1 O1 1-p p

Amdahls Law
if p is the fully pipelined
KT portion, for K1 In/Out:
speedup N ( K N ) p ( K 1) / K
( K N 1)T / N (1 p ) p / N
24
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Going Deeper into Implementation
Instruction Level Parallelism (ILP), 1990s
A measure of how many operations in a computer program can be
performed simultaneously.
Example Program ILP=3/2 [IPC (Instruction per Cycle) in CPU level]
1. x=a+b
2. y=c-d
3. z=xf
Design Problem Compiler or Processor must increase ILP
ILP Processors
Possibility of having overlap among instructions
Pipelining is what can be done in a basic single block IPCmax= 1
Substantial improvement 1 by having multiple blocks, IPC>1
canbe achieved
speedup
(1 p) / S1 p /( NS 2 ) 25
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Going Deeper into Implementation ILP
Single Issue Architecture Multi Issue Architecture
Parallelism Expands in Space (1990s) Using Multiple Basic Units
VLIW (Very long Instruction Word) Processors
Static / no code compatibility, recompile needed for different processors
Superscalar Processors
Dynamic/ code compatibility between processors family members

Dynamic Static Interface (DSI) Gap between HW and SW


Placement of the DSI determines how the gap is bridged

Software Program Compiler Complexity Exposed Static


Architecture DSI
Hardware Machine Hardware Complexity Hidden Dynamic
26
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
The Role of the Compiler
Phases to manage complexity
Parsing intermediate representation
Loop Optimizations
Common Sub-Expression
Procedure inlining
Jump Optimization
Register Allocation
Code Generation Assembly code + Problems with Optimization
Constant Propagation
Strength Reduction Simpler equivalent code replacement
Pipeline Scheduling

More important in VLIW Processors Case


Directing Compiler + Hand optimization are needed for full optimization
27
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
Microprocessors Optimized for Digital Signal Processing Algorithms
DSPs are shaped by DSP Algorithms

Lets Start with Von Neumann Architecture:


And implementing an FIR Filter
N 1
y[ n ] h[ k ] x[ n k ]
k 0

for(;;) {
ReadNewSample(&xn);
UpdateInputArray(xn);
sum = 0;
for (int k=0; k<N; k++)
sum += h[k]*x[n-k]; Separate Memories/Paths for Code and Data
yn=sum; MAC
n++; } Faster Memories
28
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
Modify the C code and see how to match a hardware
for(;;)
{
ReadNewSample( &xn);
UpdateInputArray(xn);
sum = 0;
a = &x[n-N+1]; y 0 h [ 0 ] * x [ 0 ] h [1] * x [ 1] h [ 2 ] * x [ 2 ]
b = &h[N-1];
y 1 h [ 0 ] * x [1] h [1] * x [ 0 ] h [ 2 ] * x [ 1]
for (int k=0; k<N; k++)
sum += (*a++)*(*b++); y 2 h [ 0 ] * x [ 2 ] h [1] * x [1] h [ 2 ] * x [ 0 ]
yn=sum;
}

Same/Similar ideas for efficiently perform in FFT, Convolution, IIR filtering


inner (dot) product

29
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
Number and variety of products that include DSP algorithms
+ Challenge of being cost effective, power/size effective Special HWs
So many dedicated ICs A little bit more general programmable DSP chips
Because of high cost of IC design, DSP processors became less risky and famous
solutions, specially in low volume applications

DSP Algorithms Mold DSP Architectures


Every feature in a DSP processor is included to ease performing a DSP Algorithm
FIR filtering problem again Each tap needs a multiply and add
1) Fast Multipliers
Processors perform multiplication by a series of shift/add operations
Needed multiple clocks cycles To go for higher performance:
First commercially successful DSP TMS32010 included a single cycle Multiply or
combined Multiply-Accumulate (MAC) unit, using a specialized HW
Almost all modern DSPs followed that
30
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
2) Multiple Execution Units
Adding more independent execution units
for example a MAC unit + an ALU and a Shifter (Barrel Shifter: Single Cycle)

3) Efficient Memory Access Higher Memory Bandwidth


Faster Memories + More Memories
a) Harvard Architecture instead of Von Neumann (see next slide figures)
Or modified Harvard
Separate paths for data and program Two memory access per clock cycle
single cycle access for single operand instructions
FIR needs more! Two-operand instruction
b) Using Instruction Cache
Small block of RAM near the processor core
In loops (repeated instruction block) no need for instruction fetch Another operand
can be got from program memory
31
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
Von Neumann vs. Harvard Architecture
Not just a matter of separate memories, but also separate data
paths
Internal Memory

32
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
Program Memory to Provide Filter Coefficients
Multiple Data Memories, Program Cache (at least one to have RPT
command!)
Modified Harvard Architectures


33
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
3) Efficient Memory Access
Having predicted patterns of memory access in DSP algorithms leads to:

a) Dedicated HW for calculating memory addresses


b) Performs independently, parallel to other parts
c) Accessing new memory locations without pausing
Most common addressing mode in DSP processors
Register indirect addressing with post increment
Having Circular addressing support in dedicated HW:
Accessing a block of specified length sequentially and wrap around
As seen in FIR delay line
Circular Addressing is not part of C/C++ programming
One of the reasons C compilers are not perfect for DSP processors
Circular addressing is very helpful in implementing FIFO buffers for IO
34
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
4) Data Format
Numeric fidelity needs careful attention in DSP algorithms
For example Avoiding overflow, covering the dynamic range,

DSP applications are normally easier to implement in floating point format

Sensitivity to cost and energy consumption force to:


Fixed point DSPs
Smaller word width
Conventional DSPs usually had 16-bit architecture, a few had 20, 24 and 32

Instead, most conventional DSP processors:


a) Have wider Accumulator registers providing extra bits called guard bits to
extend the range of intermediate values
b) Support for saturation arithmetic, rounding and shifting to avoid overflow and
maintain numeric fidelity
35
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
5) Zero-Overhead Looping
Processing time of most DSP algorithms spent in loops
DSP architectures usually provide special support for loops
No extra clock cycles for updating the loop counter, testing the loop expiry and
branching back

6) Streamed I/O
Processing time of most DSP algorithms spent in loops
DSP architectures usually:
a) provide special support for serial ports, parallel ports, memory interface,
b) have special ports for specific streams, like TI McBSP to connect to audio
codecs, or Video ports in new DSPs BT565x50 27.5 MHz
c) use DMA to allow data transfer with little or no intervention from
computational units Most ports in TI DSPs are equipped with DMA

36
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
7) Specialized Instruction Set
DSP ISA design is a cost sensitive design

Conventional DSP ISAs are traditionally designed to:


a) Make maximum use of underlying HW Maximum performance
Specifying multiple parallel operations in a single instruction, including several data
fetches, multiple arithmetic operations, few address pointer updates,
b) Minimize the program code size Minimum program memory
Restricting usable registers in different operations, restriction of operations that can
be done together, mode setting independent of the instruction (e.g. rounding and
saturation modes)
Highly specialized, complicated, and irregular instruction set

Another reason for why DSP C/C++ compilers are not that efficient and low level
programming and hand optimization are needed
Also C/C++ is not designed for describing DSP algorithms
37
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Conventional DSPs
based on conventional DSP architecture we have studied so far
1) Low cost, low performance range
Multiply or MAC + an ALU, address generators
Still in use Very similar to DSPs of 80s
20-50 MHz clock
Examples: Motorola DSP560xx,
TI TMS320C2xx, Lucent DSP16xx
AD ADSP-21xx

Typical applications:
Consumer electronics,
Modest telecommunications products,
Hard disks
38
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Conventional DSPs
2) Midrange DSPs, higher performance
Thru higher clock rates and more sophisticated architectures, deeper pipeline
More HW units like barrel shifter, instruction cache,
incremental ( not dramatic) enhancement comparing to (1)
100-150 MHz clock
Examples:
Motorola DSP563xx,
TI TMS320C54x,
AD ADSP-218x

Typical applications:
Consumer electronics,
Telecommunications
products VoIP, Wireless
39
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Conventional DSPs review
Issues with Old Conventional DSPs
Slow External memory
Pin count problem
Expensive Internal memory
Expensive HW and need for higher computational power
Design Goals :
1) To make maximum use of the processors underlying HW
2) To minimize the amount of memory space to store DSP programs
Short instructions but capable of a few memory fetch + some ALU ops
instruction
Fewer number of registers coded in instructions
Using mode bits to control features instead of encoding them in instructions
Short but irregular, complicated and highly specialized instruction sets
Compilers are complicated , not friendly Hand optimization is a must
40
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Enhanced Conventional DSPs
Going beyond Faster Clock

Extending conventional DSP Architecture by adding more parallel execution units

Higher computational power, sometimes no high clock, no faster HW


Typically adding more ALUs or Multipliers or MAC units
e.g. Performing 2 MACs per cycle
Extended instruction set to support new units
Wider data buses to retrieve more data in parallel and feed new instruction

May need wider instruction words to include the parameters of new units in one
instruction
Higher Cost and Complexity but Higher Performance
More advanced fabrication processes may help to justify the modifications
41
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Enhanced Conventional DSPs
Still Suffer from Conventional DSP Problems
Complex Irregular instruction sets
Hard to code and make compilers

Example:
Lucent DSP16XXX
Two MACs

Same range of applications

42
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Multi Issue Approach
Design Goals :
1) Achieving Higher Performance
2) Instruction sets that lend themselves to compilers
TI was the first to come up with a solution TMS320C62XX , 1996
Then Motorola, Analog Devices, Lucent followed
Very Simple Instructions, typically encoding a single operation per instruction
Achieving high level of parallelism by issuing and executing instructions in
parallel groups rather than one at a time
Simplified instruction decoding and execution Higher clock rates
Many parallel execution units to execute instructions in parallel groups
High level of Parallelism and higher performance
Generally higher power consumption

43
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Multi Issue Approach
Multi Issue Processors 1) Superscalar 2) VLIW Most DSPs are VLIW

Both have many execution units comparing to enhanced conventional DSPs

VLIW DSPs issue a maximum of 4 to 8 instructions per cycle,


Fetched as a part of a long super instruction
Grouping is done in programming time by compiler or programmer
Assembly language programmer or code generation tool specifies which
instructions will be executed in parallel (at the time program is assembled)
Grouping will not change during execution

Wider instruction words, usually 32bit, to remove the restrictions on registers

Sufficient routers, buses and memory bandwidth are needed

Simpler instructions needs more instructions to do a task


44
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Multi Issue Approach
Example: TI TMS320C62x Architecture
No zero overhead branch

45
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Multi Issue Approach
Superscalar processors typically issue less than VLIWs, say 2 to 4 instructions
Differs from VLIW in how instructions are grouped
Grouping is done in run time and using specialized HW, and differs depending
on the situation
Based on data dependencies and resource contention
In superscalar processors
Burden of programming shifts from programmer to HW (easier to program)

Grouping of the same set of instructions may differ , for example in a loop

Difficult for programmers to predict how long given segments of code will take

For real time applications that is a disadvantage to meet time constraints

Extra cost /size/power for dynamic features


46
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Multi Issue Approach
Most of Multi-Issue DSPs are based on VLIW Architecture
As a rule DSPs traditionally avoid dynamic features

Only a few Superscalar


DSPs in the market
e.g. ZSP500

47
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
SIMD
Single Instruction Multiple Data is not an Architecture
An architectural technique that improves the performance
Processor executes multiple instances of the same operation in parallel on
different data
Two Methods:
1) More independent units with separate registers to work on different data at
the same time
e.g ADSP-2116x that has two basic ADSP2106x set of execution units including
MAC, ALU, Shifter
2) Logically Split their execution units into multiple sub units
For example an ALU to be used for 32bit add or two 16bit adds
e.g ADS TigerSHARK is a VLIW DSP that has two sets of units and in each set
split-ALU and Split-MAC can be used Two levels of SIMD
8 x16-bit multiplications can be done in single cycle
48
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
SIMD
Difficult to program
Needs programmer effort to modify the algorithms to fit into the architecture
Not possible for all algorithms
Specific applications

Alternative to DSPs in some applications


High performance CPUs
DSP/Micro-Controller Hybrids
49
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Goals:
To know why the recent processors are like what we will see
To know processors environment and constraints
To know how to choose processors
To know when to go for HW, when to part the algorithm between HW and SW

Applications of Embedded Real time Systems


Real-time systems maintain a continuous timely interaction with the environment (RTC)
Hard Real-time / Soft Real-time Systems
Violating the constraints cause failure / performance degradation
DSP Systems are usually Hard Real-time Feasibility? and Cost?

Understanding the tasks and execution environment characteristics


bounds and constraints
Prediction of tasks and execution environment behavior
corner/worst cases
50
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Tasks Characteristics:
Timeliness parameters, e.g. arrival times, events rates,
Deadlines
Resource utilization profiles Specification
Relative importance HW/SW
Worst case execution time partitioning
Ready and suspension times
Architecture & Tasks
Precedence and exclusion constraints
Execution Environment Characteristics: Allocation
System loading & Scheduling
Service latencies Modeling, Simulation,
Resources and their interactions Prototype
Interrupt priorities and timing Implementation
Queuing disciplines
Product
Arbitration mechanisms
51
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Real-time versus Time-shared systems
Time-shared use multi-programming or multi-tasking to maximize throughput
Real-time systems designed for providing predictably fast response to urgent tasks
Might have some other non-real-time tasks
Differences Real-time Systems need:

High degree of schedulability


Timing requirements of the system must be satisfied at high degrees of resource
usage
Worst-case latency
Ensuring the system still operates under worst-case response time to events
Stability under transient overload
When the system is overloaded by events and it is impossible to meet all
deadlines, the deadlines of selected critical tasks must still be guaranteed

52
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Real-time versus Time-shared systems

Real-time Multiple Tasks Categories


1) Synchronous Sharply Predictable tasks can be packed in one
2) Asynchronous Unpredictable
3) Isochronous Loosely Predictable (in a time window)
S: Camcorder Audio and Video,
A: Calls from phones received by a BTS,
I: Audio and Video in Video over IP systems, when video arrives so will relative audio in
a time window
53
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Scheduling and Resource Allocation to meet all the deadlines
1) Offline Algorithms: By the designer

2) Online Algorithms: By the OS or other software


2-1) Static (fixed priorities): e.g. RMA, Rate Monotonic Algorithms Higher rates first
2-2) Dynamic (changeable priorities): e.g. EDF, Earliest Deadline First

Static priority assignment is suitable for deterministic tasks


Truly hard real-time system is not feasible unless for deterministic tasks

In DSP systems using complicated RTOSes are avoided when possible!


Example: DSP/BIOS from TI Works for all TI DSPs
Pre-emptive static priority RTOS 15 Task Priority, HWI, SWI
On top of this Use the processor recourses as much as possible
54
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Embedded Real-time System Requirements
1) Efficiency
Performance Cost
Processor Cycle, Power, Size, Memory
Selection, Optimization
2) Acceptable Timeliness

Thru Resource Management


Complexity of resource management Cost
Expensive over-capacity processor, complicated OS (online/real-time
resource management)

The resource management that is used is usually static and requires


analysis of the system prior to executing it in its environment.
55
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Selection/Design Guide
1. Response Time Optimal Partitioning into HW and SW
1) Is the Architecture suitable?

2) Are the programmable processing resources powerful enough?


High utilization > 90% makes the system unpredictable,
more time needed to develop
3) Are the communication speeds adequate?

4) Are enough communication ports/IOs available?

5) Is the right scheduling method available/possible?


2. Failure Recovery no reset button/ or hard to access, not possible to check all cases
Internal or external Failures : Processor , board, link/connectivity failure, invalid
environmental behavior, Simulation can help

56
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Selection/Design Guide
3. Scheduling Loss
Liu & Layland (1973)
For n periodic tasks with fixed periods, a feasible schedule that will always meet
deadlines exists if the CPU utilization is below a specific bound (depending on
the number of tasks).
n
Ck
U n(21/ n 1) n Umax
k 1 Tk
1 100%
Ck: Worst case computation time of task k, 2 83%
Tk: Release Period of task k, n: number of scheduled tasks 69%

The other 30.7% of the CPU can be dedicated to lower-priority non real-time
tasks.

Rule of Thumb 90% needs twice time to develop


95% triple time to develop
57
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Selection/Design Guide

4. Distributed and Multi-Processor Architecture


Several nodes: DSP+ P + FPGA
Common practice in today designs

Must consider the following points to decide:


4-1) Initialization

4-2) Processors Interfaces

4-3) Load distribution and timeliness

4-4) Managing shared resources

58
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Characteristics of Embedded Systems
Application Specific
Monitoring and reacting to the environment
Processing the Information
Control the environment
Examples:
Airbag system,
Digital Still Camera,
Cell Phone,

59
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Components of Embedded Systems
Firmware: Programs deep in the HW, not often changed
Software: Programs that can be changed by the user
Memory RAM, ROM, Flash,
User Interface: Some LEDs to GUI on LCD
Sensors: Sense the real world
Actuators: React to or control the real word
Emulation and Diagnostics:
e.g. JTAG (Joint Test Action Group)
Application specific gates
Analog IO:
A/D and D/A + ASP (filters, amplifiers, )
Last but not least the Processor: 8-bit MCU to 64-bit P can be ASIC or
FPGA
60
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Development Life Cycle
1) Examine the overall system needs

BDTI Selection Criteria:


Price and BOM size, Performance,
Time to Market, Power, Size, Features,

Rules in this phase:


For Fixed Cost,
maximize the performance
For Fixed Performance,
minimize the cost
61
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Development Life Cycle
2) Select the Hardware Components Required

C / MCU FPGA (ASIC)

ASIC: Only for extremely low power,


extremely high performance high volume,
good time to market margin, less flexible
FPGA: More flexible/faster time to market
faster than processors, but more
development time
Typical application: radar

62
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Development Life Cycle
2) Select the Hardware Components Required

DSP

P (GPP)
Go for MCU when low cost low chip
count, low processing power needed
Go for DSP only if cost, size and power,
requirements cannot be met by a GPP
or other DSP related features are needed
like fast IO, low delay interrupt, specific
instructions, are needed

63
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Development Life Cycle
2) Select the Hardware Components Required
Other rules in this phase:
FPGAs are good for bit manipulation applications
Processors are better for numerical applications
If single DSP will do the job go for it!
DSP better than FPGA if algorithms are so complex and resources are already in DSPs
Have a DSP if special memory accesses needed
Have an FPGA and a DSP if possible to:
Use smaller/cheaper DSP/MCU by offloading some computationally intensive tasks to the FPGA
Increase the computational throughput
Make a prototype of a new signal processing algorithm (like an EVM)
To pack the glue logic and have flexibility in it
Have a GPP along with a DSP if there are so many non-real-time tasks
Embedded DSP systems development
Technical trend is towards programmability

64
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Development Life Cycle
2) Select the Hardware Components Required
Programmable or HW or Mixed ?
Most of the time cost is the final saying!
Cost is a multifold issue usually fulfilled by a mixed strategy
Device cost, NRE (non-recurring engineering), Manufacturing cost, Opportunity cost,
Low time to market gain, physical advantages (power dissipation, weight, size, )

3) Understanding DSP basics and architecture


What makes a DSP a DSP?!
How fast it can go?
How can I achieve maximum performance? Max # channels for an algorithm
IO options? GPIO, UART, CAN, SPI, USB, McBSP, HPI
IO speed and performance. Can IO keep up with Max # channels?
Memory speed?
Hi-speed internal memory? Special access, DMA
Sample based and frame based
65

Anda mungkin juga menyukai