Embrt Igh Part1

Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.
edu
1
Sharif University of Technology
Embedded Real-Time
EE Department
Spring 2017
Systems
part one
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Embedded Real-Time Systems
Analog Processing
Analog Computers
Fourier Optics
Why Go Digital Processing?
Programmability
One hardware can perform several tasks
Upgradeability and flexibility
Repeatability
Identical performance from unit to unit
No drift in performance due to temperature or aging
Immune to noise
Offering higher quality or performance
(Compare CD players versus phonographic turntable)
2
Embedded Systems:
Computer systems designed to perform one or a few tasks
Examples: MP3 players, Traffic Control Systems, Radars,
Comparing to GP computers which are flexible to end user needs
Can be optimized for cost, size, power consumption,
Real Time Systems:

Systems that are subjected to a real-time constraint
~ operational deadlines from event to system response
Examples: Video Recorder,
Comparing to non-real-time systems
3
Real Time Systems
Hard (Immediate) Real-time Systems

Correct execution of the main task depends on the duration of execution
Deadline Concept~ Real-Time Constraint (RTC)
Example: Car engine controller Waiting Time
Processing Time
Real-time Waiting Time > 0
n Sample Time n+1
Latency = Transmission (Acquisition) Delay + Algorithmic Delay
Soft Real-time Systems

Completion after deadline is tolerated by losing QoS
Example: Dropping frames in video chat
4
Processors
History
First Commercial Microprocessor : 4-bit Intel-4004, 1971
4-bit processor followed by 8, 16, 32, 64, -bit
The most successful family, started with 16-bit Intel 8086
x86 / IA-32 (i386) architecture , 32-bit ones started with 80386
The architecture is the processor contents from the programmers vantage point
Moores Law, 1965

The number of transistors that can be integrated on a single piece of silicon will
double roughly every 18-24 months
Has held true for more than 45 years now may hold true for another decade
Roughly applies to both density and clock frequency denser ~ faster
5
6
Transistor dimensions
Processors
Processors
Moores Law
Challenge of processor designers
Make Performance to follow at least (Moores law)2
density effect x clock effect
adding improvements thru innovations micro-architectures, multi-core
Performance has not gone up that fast! follows the Moores law
Orchestration problem 1971-2009 238/2 ~ 524,000
Power consumption bottleneck
Heat dissipation not worth it to increase the clock
Intel Rules:
Increasing clock rate by 25% will yield approximately 15% performance increase
But power consumption will be doubled
MIPS per Watts challenge Change of view point

7
Processors
Processor Design
Architecture (ISA)
programmer/compiler view
Functional structure, Interface to user/system programmer
Op-codes, Addressing modes, Registers, Number formats
Implementation (Architecture)
processor designer view
Logical structure, Pipelining
Functional units, Caches, Physical registers
Realization (Chip)
chip/system designer view
Physical structure for the implementation
Gates, Cells, Transistors, Interconnection
8
Processors
Iron Law (Joel Emer)
Time Instructio ns Cycles Time

Program Program Instructio n Cycle
To be minimized
Architecture Implementation Realization
Code Size CPI Cycle time
Compiler Designer Processor Designer Chip designer
Instructions/Program Instructions executed, not static code size

Determined by algorithm, compiler, ISA
Cycles/Instruction Determined by ISA and CPU organization

Overlap among instructions reduces this term
Time/Cycle Determined by technology, organization, clever circuit

design
9
Processors
Iron Law Examples
Processor A: clock 1ns, CPI 2.0, for program P (N instructions of Processor A)

Processor B: clock 2ns, CPI 1.2, for program P (N instructions of Processor B)
Time(A) = N x 2.0 x 1 = 2N
Time(B) = N x 1.2 x 2 = 2.4N
Time(B)/Time(A) = 2.4N/2N = 1.2 A is 20% faster on program P
For performance of B to reach the performance of A:
CPI(B) may be improved to 1
Clock(B) may be changed to 1.66667ns
ISA(B) may be redesigned to support golden instructions 0.833N

instructions to perform P
10
Processors
Iron Law Examples
op frequency in P cycles
Option: ALU 43% 1
stores can be executed in Load 21% 1
1 cycle by slowing the clock down by 15% Store 12% 2
Branch 24% 2
Is it better to consider the modification?

oldCPI = 0.43 + 0.21 + 0.12 x 2 + 0.24 x 2 = 1.36
Store 12% 1
newCPI = 0.43 + 0.21 + 0.12 + 0.24 x 2 = 1.24
Speedup = oldtime/newtime
= {P x oldCPI x T}/{P x newCPI x 1.15 T}
= (1.36)/(1.24 x 1.15) = 0.95
Dont do it!
11
Processors
Instruction Set Architecture (ISA)
The boundary between software and hardware
Specifies the functional machine that is visible to the programmer
Also, a functional spec for the processor designers
What needs to be specified by an ISA

Operations : what to perform and what to perform next
Temporary Operand Storage in the CPU : accumulator, stacks, registers
Number of operands per instruction
Operand location : where and how to specify the operands
Type and size of operands
Instruction-to-Binary Encoding
12
Processors
Basic ISA Classification
Stack Architecture (zero operand):
Operands popped from stack(s)
Result pushed on stack
Accumulator (one operand):
Special accumulator register is implicit operand
Other operand from memory
Register-Memory (two operand):
One operand from register, other from memory or register
Generally, one of the source operands is also the destination
A few architectures have allowed memory To memory operations
Register-Register or Load/Store (three operand):
All operands for ALU instructions must be registers
General format Rd Rs op Rt
Separate Load and Store instructions for memory access
13
Processors
Important ISA Considerations
Number of registers
Data types/sizes
Addressing modes
Instructions complexity
Branch/jump/function call
Exception handling
Instruction format/size/regularity
Data Type / Size

Fixed point 8, 16, 24, 32,-bit
Floating point IEEE 754 Standard 32 and 64-bit
14
Processors
Addressing Modes
Register indirect: M[Ri]
Indexed: M[Ri+Rj]
Absolute: M[#n]
Memory indirect: M[M[Ri]]
Auto-increment: M[Ri]; Ri += d
Auto-decrement: M[Ri]; Ri -= d
Scaled: M[Ri + #n + Rj * d]
Update: M[Ri = Ri + #n]
Immediate value: #n; Registers: Ri, Rj ; displacement Ri+#n; M :Memory block
Branches
Conditional/Unconditional
Function call is similar but needs parameter passing, saving state
restoring state, Latency,
15
Processors
Modern ISAs
Operations: simple ALU ops, data movement, control transfer
Temporary Operand Storage in the CPU
Large General Purpose Register (GPR) File
Load/Store Architecture
Three operands per ALU instruction (all registers)
A B op C
Addressing Modes
Limited addressing modes, e.g. register indirect addressing only
Type and size of operands
32/64-bit integers, IEEE floats
Instruction-to-Binary Encoding
Fixed width, regular fields
Important Exception: Intel x86
16
Processors
MIPS ISA
The MIPS ISA was one of the first RISC instruction sets (1985)
Main characteristics:
Load-Store Architecture
Three operand format (Rd Rs op Rt)
Simple instruction format
32 General Purpose Registers
Only one addressing mode for memory operands: reg. indirect + displacement
Limited, highly orthogonal instruction set: 52 instructions
Simple branch/jump/subroutine call architecture
17
Processors
x86 ISA, a CISC ISA
Was first introduced with Intel 8086 processor in 1978 Evolved over the years
Main characteristics:
Reg/Mem architecture ALU instructions can have memory operands
Two operand format one source operand can be destination too
Eight general purpose registers
Seven memory addressing modes
More than 500 instructions
Instruction set is non-orthogonal
Highly variable instruction size and format instruction size varies from 1 to
17 bytes.
18
Processors
ARM and Thumb ISAs
The ARM processor architecture provides support for:
The 32-bit ARM and 16-bit Thumb Instruction Set Architectures
Reduced Instruction Set Computer (RISC) architecture
Load/store architecture
Simple addressing modes (determined from register contents and instruction)
16 /32-bit registers
8, 16, 32-bit data types

Instructions that combine a shift with an arithmetic or logical operation
Auto-increment and auto-decrement addressing modes to optimize loops
Load and Store Multiple registers : instructions to maximize data throughput
Conditional execution of almost all instructions to maximize execution throughput.
ARM uses the Universal Assembly Language to provide a canonical form for all
ARM and Thumb instructions
19
Processors
ARM and Thumb ISAs
Enhancements to a basic RISC architecture enable ARM processors to achieve
a good balance of high performance, small code size, low power
consumption, and small silicon area.
ARM6, ARM7, ARM9, ARM11, Cortex ,
ARM7-TDMI
3-stage pipeline
Von Newman bus structure (ARM9 Harvard) www.arm.com
TDMI Thumb-Multiplier-Debug (JTAG) Interface-ICE
CPI ~ 1.9
20 billion chips created, 10 millions shipped everyday!

About 60 instructions
20
Processors
Performance Depends on What is Important!
Execution time, throughput, cost, area, power,
Execution time or elapsed time is the time to finish a job
Throughput is completion counts per second or number of jobs done per sec
Faster CPUs or more CPUs to improve performance
Performance Metrics, MIPS, MFLOPS

MIPS = instruction count/(execution time x 106) = clock rate/(CPI x 106)
MIPS has serious shortcomings
MFLOPS = FP ops in program/(execution time x 106)

Assuming FP ops independent of compiler and ISA
However, not always safe:
Missing instructions (e.g. FP divide, sqrt/sin/cos), and Optimizing compilers
Relative MIPS and normalized MFLOPS, normalized to some common baseline

machine
21
Processors
Performance
Program Selection
A Set of programs Benchmarks
Covering different aspects
Tested on different processors
www.BDTi.com
Berkeley Design Technology inc.
Example DSP Kernel Benchmarks

Each kernel is implemented in
hand-optimized assembly
language on the target processor.
Video , OFDM,
DQPSK Benchmarks
22
Processors
Going Deeper into Implementation
How to improve the implementation Increasing /Decreasing CPI
Amdahls Law (Gene Amdahl, IBM) 1
speedup
Expected speed-up of partial program improvement (1 p ) p / S
1
speedup
Expected speed-up of partial parallel implementation (1 p) p / N
%improvement (1 1 / speedup) 100
For parallel implementation:

speedupmax lim
1

1
N (1 p ) p / N 1 p
S max
For sequential
speedupmax
case:
(1 p )( S max 1) 1
23
Processors
Going Deeper into Implementation Pipelining, 1980s
Overlap the execution of instructions T
I 1, I2, I3, Un-pipelined Process O1, O2, O3,
N Independent tasks being done in N independent modules N-Stage Pipeline

T/N
S1 S2 S3 SN Pipeline Depth
I1 latency > 1 clock cycle

N
I2 I1 prologue epilogue
I3 I2 I1
. 1
IN IN-1 IN-2 . I1 O1 1-p p
Amdahls Law
if p is the fully pipelined
KT portion, for K1 In/Out:
speedup N ( K N ) p ( K 1) / K
( K N 1)T / N (1 p ) p / N
24
Processors
Going Deeper into Implementation
Instruction Level Parallelism (ILP), 1990s
A measure of how many operations in a computer program can be
performed simultaneously.
Example Program ILP=3/2 [IPC (Instruction per Cycle) in CPU level]
1. x=a+b
2. y=c-d
3. z=xf
Design Problem Compiler or Processor must increase ILP
ILP Processors
Possibility of having overlap among instructions
Pipelining is what can be done in a basic single block IPCmax= 1
Substantial improvement 1 by having multiple blocks, IPC>1
canbe achieved
speedup
(1 p) / S1 p /( NS 2 ) 25
Processors
Going Deeper into Implementation ILP
Single Issue Architecture Multi Issue Architecture
Parallelism Expands in Space (1990s) Using Multiple Basic Units
VLIW (Very long Instruction Word) Processors
Static / no code compatibility, recompile needed for different processors
Superscalar Processors
Dynamic/ code compatibility between processors family members
Dynamic Static Interface (DSI) Gap between HW and SW

Placement of the DSI determines how the gap is bridged
Software Program Compiler Complexity Exposed Static

Architecture DSI
Hardware Machine Hardware Complexity Hidden Dynamic
26
Processors
The Role of the Compiler
Phases to manage complexity
Parsing intermediate representation
Loop Optimizations
Common Sub-Expression
Procedure inlining
Jump Optimization
Register Allocation
Code Generation Assembly code + Problems with Optimization
Constant Propagation
Strength Reduction Simpler equivalent code replacement
Pipeline Scheduling
More important in VLIW Processors Case

Directing Compiler + Hand optimization are needed for full optimization
27
Digital Signal Processors
Microprocessors Optimized for Digital Signal Processing Algorithms
DSPs are shaped by DSP Algorithms
Lets Start with Von Neumann Architecture:

And implementing an FIR Filter
N 1
y[ n ] h[ k ] x[ n k ]
k 0
for(;;) {
ReadNewSample(&xn);
UpdateInputArray(xn);
sum = 0;
for (int k=0; k<N; k++)
sum += h[k]*x[n-k]; Separate Memories/Paths for Code and Data
yn=sum; MAC
n++; } Faster Memories
28
Modify the C code and see how to match a hardware
for(;;)
{
ReadNewSample( &xn);
UpdateInputArray(xn);
sum = 0;
a = &x[n-N+1]; y 0 h [ 0 ] * x [ 0 ] h [1] * x [ 1] h [ 2 ] * x [ 2 ]
b = &h[N-1];
y 1 h [ 0 ] * x [1] h [1] * x [ 0 ] h [ 2 ] * x [ 1]
for (int k=0; k<N; k++)
sum += (*a++)*(*b++); y 2 h [ 0 ] * x [ 2 ] h [1] * x [1] h [ 2 ] * x [ 0 ]
yn=sum;
}
Same/Similar ideas for efficiently perform in FFT, Convolution, IIR filtering

inner (dot) product
29
Number and variety of products that include DSP algorithms
+ Challenge of being cost effective, power/size effective Special HWs
So many dedicated ICs A little bit more general programmable DSP chips
Because of high cost of IC design, DSP processors became less risky and famous
solutions, specially in low volume applications
DSP Algorithms Mold DSP Architectures

Every feature in a DSP processor is included to ease performing a DSP Algorithm
FIR filtering problem again Each tap needs a multiply and add
1) Fast Multipliers
Processors perform multiplication by a series of shift/add operations
Needed multiple clocks cycles To go for higher performance:
First commercially successful DSP TMS32010 included a single cycle Multiply or
combined Multiply-Accumulate (MAC) unit, using a specialized HW
Almost all modern DSPs followed that
30
2) Multiple Execution Units
Adding more independent execution units
for example a MAC unit + an ALU and a Shifter (Barrel Shifter: Single Cycle)
3) Efficient Memory Access Higher Memory Bandwidth

Faster Memories + More Memories
a) Harvard Architecture instead of Von Neumann (see next slide figures)
Or modified Harvard
Separate paths for data and program Two memory access per clock cycle
single cycle access for single operand instructions
FIR needs more! Two-operand instruction
b) Using Instruction Cache
Small block of RAM near the processor core
In loops (repeated instruction block) no need for instruction fetch Another operand
can be got from program memory
31
Von Neumann vs. Harvard Architecture
Not just a matter of separate memories, but also separate data
paths
Internal Memory
32
Program Memory to Provide Filter Coefficients
Multiple Data Memories, Program Cache (at least one to have RPT
command!)
Modified Harvard Architectures

33
3) Efficient Memory Access
Having predicted patterns of memory access in DSP algorithms leads to:
a) Dedicated HW for calculating memory addresses

b) Performs independently, parallel to other parts
c) Accessing new memory locations without pausing
Most common addressing mode in DSP processors
Register indirect addressing with post increment
Having Circular addressing support in dedicated HW:
Accessing a block of specified length sequentially and wrap around
As seen in FIR delay line
Circular Addressing is not part of C/C++ programming
One of the reasons C compilers are not perfect for DSP processors
Circular addressing is very helpful in implementing FIFO buffers for IO
34
4) Data Format
Numeric fidelity needs careful attention in DSP algorithms
For example Avoiding overflow, covering the dynamic range,
DSP applications are normally easier to implement in floating point format
Sensitivity to cost and energy consumption force to:

Fixed point DSPs
Smaller word width
Conventional DSPs usually had 16-bit architecture, a few had 20, 24 and 32
Instead, most conventional DSP processors:

a) Have wider Accumulator registers providing extra bits called guard bits to
extend the range of intermediate values
b) Support for saturation arithmetic, rounding and shifting to avoid overflow and
maintain numeric fidelity
35
5) Zero-Overhead Looping
Processing time of most DSP algorithms spent in loops
DSP architectures usually provide special support for loops
No extra clock cycles for updating the loop counter, testing the loop expiry and
branching back
6) Streamed I/O
Processing time of most DSP algorithms spent in loops
DSP architectures usually:
a) provide special support for serial ports, parallel ports, memory interface,
b) have special ports for specific streams, like TI McBSP to connect to audio
codecs, or Video ports in new DSPs BT565x50 27.5 MHz
c) use DMA to allow data transfer with little or no intervention from
computational units Most ports in TI DSPs are equipped with DMA
36
7) Specialized Instruction Set
DSP ISA design is a cost sensitive design
Conventional DSP ISAs are traditionally designed to:

a) Make maximum use of underlying HW Maximum performance
Specifying multiple parallel operations in a single instruction, including several data
fetches, multiple arithmetic operations, few address pointer updates,
b) Minimize the program code size Minimum program memory
Restricting usable registers in different operations, restriction of operations that can
be done together, mode setting independent of the instruction (e.g. rounding and
saturation modes)
Highly specialized, complicated, and irregular instruction set
Another reason for why DSP C/C++ compilers are not that efficient and low level
programming and hand optimization are needed
Also C/C++ is not designed for describing DSP algorithms
37
Evolution of Digital Signal Processors
Conventional DSPs
based on conventional DSP architecture we have studied so far
1) Low cost, low performance range
Multiply or MAC + an ALU, address generators
Still in use Very similar to DSPs of 80s
20-50 MHz clock
Examples: Motorola DSP560xx,
TI TMS320C2xx, Lucent DSP16xx
AD ADSP-21xx
Typical applications:
Consumer electronics,
Modest telecommunications products,
Hard disks
38
Conventional DSPs
2) Midrange DSPs, higher performance
Thru higher clock rates and more sophisticated architectures, deeper pipeline
More HW units like barrel shifter, instruction cache,
incremental ( not dramatic) enhancement comparing to (1)
100-150 MHz clock
Examples:
Motorola DSP563xx,
TI TMS320C54x,
AD ADSP-218x
Typical applications:
Consumer electronics,
Telecommunications
products VoIP, Wireless
39
Conventional DSPs review
Issues with Old Conventional DSPs
Slow External memory
Pin count problem
Expensive Internal memory
Expensive HW and need for higher computational power
Design Goals :
1) To make maximum use of the processors underlying HW
2) To minimize the amount of memory space to store DSP programs
Short instructions but capable of a few memory fetch + some ALU ops
instruction
Fewer number of registers coded in instructions
Using mode bits to control features instead of encoding them in instructions
Short but irregular, complicated and highly specialized instruction sets
Compilers are complicated , not friendly Hand optimization is a must
40
Enhanced Conventional DSPs
Going beyond Faster Clock
Extending conventional DSP Architecture by adding more parallel execution units
Higher computational power, sometimes no high clock, no faster HW

Typically adding more ALUs or Multipliers or MAC units
e.g. Performing 2 MACs per cycle
Extended instruction set to support new units
Wider data buses to retrieve more data in parallel and feed new instruction
May need wider instruction words to include the parameters of new units in one
instruction
Higher Cost and Complexity but Higher Performance
More advanced fabrication processes may help to justify the modifications
41
Enhanced Conventional DSPs
Still Suffer from Conventional DSP Problems
Complex Irregular instruction sets
Hard to code and make compilers
Example:
Lucent DSP16XXX
Two MACs
Same range of applications
42
Multi Issue Approach
Design Goals :
1) Achieving Higher Performance
2) Instruction sets that lend themselves to compilers
TI was the first to come up with a solution TMS320C62XX , 1996
Then Motorola, Analog Devices, Lucent followed
Very Simple Instructions, typically encoding a single operation per instruction
Achieving high level of parallelism by issuing and executing instructions in
parallel groups rather than one at a time
Simplified instruction decoding and execution Higher clock rates
Many parallel execution units to execute instructions in parallel groups
High level of Parallelism and higher performance
Generally higher power consumption
43
Multi Issue Processors 1) Superscalar 2) VLIW Most DSPs are VLIW
Both have many execution units comparing to enhanced conventional DSPs
VLIW DSPs issue a maximum of 4 to 8 instructions per cycle,

Fetched as a part of a long super instruction
Grouping is done in programming time by compiler or programmer
Assembly language programmer or code generation tool specifies which
instructions will be executed in parallel (at the time program is assembled)
Grouping will not change during execution
Wider instruction words, usually 32bit, to remove the restrictions on registers
Sufficient routers, buses and memory bandwidth are needed
Simpler instructions needs more instructions to do a task

44
Example: TI TMS320C62x Architecture
No zero overhead branch
45
Superscalar processors typically issue less than VLIWs, say 2 to 4 instructions
Differs from VLIW in how instructions are grouped
Grouping is done in run time and using specialized HW, and differs depending
on the situation
Based on data dependencies and resource contention
In superscalar processors
Burden of programming shifts from programmer to HW (easier to program)
Grouping of the same set of instructions may differ , for example in a loop
Difficult for programmers to predict how long given segments of code will take
For real time applications that is a disadvantage to meet time constraints
Extra cost /size/power for dynamic features

46
Most of Multi-Issue DSPs are based on VLIW Architecture
As a rule DSPs traditionally avoid dynamic features
Only a few Superscalar

DSPs in the market
e.g. ZSP500
47
SIMD
Single Instruction Multiple Data is not an Architecture
An architectural technique that improves the performance
Processor executes multiple instances of the same operation in parallel on
different data
Two Methods:
1) More independent units with separate registers to work on different data at
the same time
e.g ADSP-2116x that has two basic ADSP2106x set of execution units including
MAC, ALU, Shifter
2) Logically Split their execution units into multiple sub units
For example an ALU to be used for 32bit add or two 16bit adds
e.g ADS TigerSHARK is a VLIW DSP that has two sets of units and in each set
split-ALU and Split-MAC can be used Two levels of SIMD
8 x16-bit multiplications can be done in single cycle
48
SIMD
Difficult to program
Needs programmer effort to modify the algorithms to fit into the architecture
Not possible for all algorithms
Specific applications
Alternative to DSPs in some applications

High performance CPUs
DSP/Micro-Controller Hybrids
49
Real-Time Embedded Systems
Goals:
To know why the recent processors are like what we will see
To know processors environment and constraints
To know how to choose processors
To know when to go for HW, when to part the algorithm between HW and SW
Applications of Embedded Real time Systems

Real-time systems maintain a continuous timely interaction with the environment (RTC)
Hard Real-time / Soft Real-time Systems
Violating the constraints cause failure / performance degradation
DSP Systems are usually Hard Real-time Feasibility? and Cost?
Understanding the tasks and execution environment characteristics

bounds and constraints
Prediction of tasks and execution environment behavior
corner/worst cases
50
Tasks Characteristics:
Timeliness parameters, e.g. arrival times, events rates,
Deadlines
Resource utilization profiles Specification
Relative importance HW/SW
Worst case execution time partitioning
Ready and suspension times
Architecture & Tasks
Precedence and exclusion constraints
Execution Environment Characteristics: Allocation
System loading & Scheduling
Service latencies Modeling, Simulation,
Resources and their interactions Prototype
Interrupt priorities and timing Implementation
Queuing disciplines
Product
Arbitration mechanisms
51
Real-time versus Time-shared systems
Time-shared use multi-programming or multi-tasking to maximize throughput
Real-time systems designed for providing predictably fast response to urgent tasks
Might have some other non-real-time tasks
Differences Real-time Systems need:
High degree of schedulability

Timing requirements of the system must be satisfied at high degrees of resource
usage
Worst-case latency
Ensuring the system still operates under worst-case response time to events
Stability under transient overload
When the system is overloaded by events and it is impossible to meet all
deadlines, the deadlines of selected critical tasks must still be guaranteed
52
Real-time versus Time-shared systems
Real-time Multiple Tasks Categories

1) Synchronous Sharply Predictable tasks can be packed in one
2) Asynchronous Unpredictable
3) Isochronous Loosely Predictable (in a time window)
S: Camcorder Audio and Video,
A: Calls from phones received by a BTS,
I: Audio and Video in Video over IP systems, when video arrives so will relative audio in
a time window
53
Scheduling and Resource Allocation to meet all the deadlines
1) Offline Algorithms: By the designer
2) Online Algorithms: By the OS or other software

2-1) Static (fixed priorities): e.g. RMA, Rate Monotonic Algorithms Higher rates first
2-2) Dynamic (changeable priorities): e.g. EDF, Earliest Deadline First
Static priority assignment is suitable for deterministic tasks

Truly hard real-time system is not feasible unless for deterministic tasks
In DSP systems using complicated RTOSes are avoided when possible!

Example: DSP/BIOS from TI Works for all TI DSPs
Pre-emptive static priority RTOS 15 Task Priority, HWI, SWI
On top of this Use the processor recourses as much as possible
54
Embedded Real-time System Requirements
1) Efficiency
Performance Cost
Processor Cycle, Power, Size, Memory
Selection, Optimization
2) Acceptable Timeliness
Thru Resource Management

Complexity of resource management Cost
Expensive over-capacity processor, complicated OS (online/real-time
resource management)
The resource management that is used is usually static and requires

analysis of the system prior to executing it in its environment.
55
Selection/Design Guide
1. Response Time Optimal Partitioning into HW and SW
1) Is the Architecture suitable?
2) Are the programmable processing resources powerful enough?

High utilization > 90% makes the system unpredictable,
more time needed to develop
3) Are the communication speeds adequate?
4) Are enough communication ports/IOs available?
5) Is the right scheduling method available/possible?

2. Failure Recovery no reset button/ or hard to access, not possible to check all cases
Internal or external Failures : Processor , board, link/connectivity failure, invalid
environmental behavior, Simulation can help
56
3. Scheduling Loss
Liu & Layland (1973)
For n periodic tasks with fixed periods, a feasible schedule that will always meet
deadlines exists if the CPU utilization is below a specific bound (depending on
the number of tasks).
n
Ck
U n(21/ n 1) n Umax
k 1 Tk
1 100%
Ck: Worst case computation time of task k, 2 83%
Tk: Release Period of task k, n: number of scheduled tasks 69%
The other 30.7% of the CPU can be dedicated to lower-priority non real-time
tasks.
Rule of Thumb 90% needs twice time to develop

95% triple time to develop
57
4. Distributed and Multi-Processor Architecture

Several nodes: DSP+ P + FPGA
Common practice in today designs
Must consider the following points to decide:

4-1) Initialization
4-2) Processors Interfaces
4-3) Load distribution and timeliness
4-4) Managing shared resources
58
Characteristics of Embedded Systems
Application Specific
Monitoring and reacting to the environment
Processing the Information
Control the environment
Examples:
Airbag system,
Digital Still Camera,
Cell Phone,
59
Components of Embedded Systems
Firmware: Programs deep in the HW, not often changed
Software: Programs that can be changed by the user
Memory RAM, ROM, Flash,
User Interface: Some LEDs to GUI on LCD
Sensors: Sense the real world
Actuators: React to or control the real word
Emulation and Diagnostics:
e.g. JTAG (Joint Test Action Group)
Application specific gates
Analog IO:
A/D and D/A + ASP (filters, amplifiers, )
Last but not least the Processor: 8-bit MCU to 64-bit P can be ASIC or
FPGA
60
Development Life Cycle
1) Examine the overall system needs
BDTI Selection Criteria:

Price and BOM size, Performance,
Time to Market, Power, Size, Features,
Rules in this phase:

For Fixed Cost,
maximize the performance
For Fixed Performance,
minimize the cost
61
2) Select the Hardware Components Required
C / MCU FPGA (ASIC)
ASIC: Only for extremely low power,

extremely high performance high volume,
good time to market margin, less flexible
FPGA: More flexible/faster time to market
faster than processors, but more
development time
Typical application: radar
62
DSP
P (GPP)
Go for MCU when low cost low chip
count, low processing power needed
Go for DSP only if cost, size and power,
requirements cannot be met by a GPP
or other DSP related features are needed
like fast IO, low delay interrupt, specific
instructions, are needed
63
Other rules in this phase:
FPGAs are good for bit manipulation applications
Processors are better for numerical applications
If single DSP will do the job go for it!
DSP better than FPGA if algorithms are so complex and resources are already in DSPs
Have a DSP if special memory accesses needed
Have an FPGA and a DSP if possible to:
Use smaller/cheaper DSP/MCU by offloading some computationally intensive tasks to the FPGA
Increase the computational throughput
Make a prototype of a new signal processing algorithm (like an EVM)
To pack the glue logic and have flexibility in it
Have a GPP along with a DSP if there are so many non-real-time tasks
Embedded DSP systems development
Technical trend is towards programmability
64
Programmable or HW or Mixed ?
Most of the time cost is the final saying!
Cost is a multifold issue usually fulfilled by a mixed strategy
Device cost, NRE (non-recurring engineering), Manufacturing cost, Opportunity cost,
Low time to market gain, physical advantages (power dissipation, weight, size, )
3) Understanding DSP basics and architecture

What makes a DSP a DSP?!
How fast it can go?
How can I achieve maximum performance? Max # channels for an algorithm
IO options? GPIO, UART, CAN, SPI, USB, McBSP, HPI
IO speed and performance. Can IO keep up with Max # channels?
Memory speed?
Hi-speed internal memory? Special access, DMA
Sample based and frame based
65

Embrt Igh Part1

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Embrt Igh Part1

Diunggah oleh

Hak Cipta:

Format Tersedia

Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.

Why Go Digital Processing?

Real Time Systems:

Hard (Immediate) Real-time Systems

Latency = Transmission (Acquisition) Delay + Algorithmic Delay

Soft Real-time Systems

Moores Law, 1965

Roughly applies to both density and clock frequency denser ~ faster

MIPS per Watts challenge Change of view point

Instructions/Program Instructions executed, not static code size

Cycles/Instruction Determined by ISA and CPU organization

Time/Cycle Determined by technology, organization, clever circuit

Processor A: clock 1ns, CPI 2.0, for program P (N instructions of Processor A)

Time(B)/Time(A) = 2.4N/2N = 1.2 A is 20% faster on program P

For performance of B to reach the performance of A:

CPI(B) may be improved to 1

Clock(B) may be changed to 1.66667ns

ISA(B) may be redesigned to support golden instructions 0.833N

Specifies the functional machine that is visible to the programmer

Also, a functional spec for the processor designers

What needs to be specified by an ISA

Temporary Operand Storage in the CPU : accumulator, stacks, registers

Number of operands per instruction

Operand location : where and how to specify the operands

Type and size of operands

Data Type / Size

Important Exception: Intel x86

Three operand format (Rd Rs op Rt)

Simple instruction format

32 General Purpose Registers

Limited, highly orthogonal instruction set: 52 instructions

Simple branch/jump/subroutine call architecture

Two operand format one source operand can be destination too

Eight general purpose registers

Seven memory addressing modes

More than 500 instructions

Instruction set is non-orthogonal

8, 16, 32-bit data types

ARM6, ARM7, ARM9, ARM11, Cortex ,

Von Newman bus structure (ARM9 Harvard) www.arm.com

TDMI Thumb-Multiplier-Debug (JTAG) Interface-ICE

20 billion chips created, 10 millions shipped everyday!

Performance Metrics, MIPS, MFLOPS

MFLOPS = FP ops in program/(execution time x 106)

Relative MIPS and normalized MFLOPS, normalized to some common baseline

Example DSP Kernel Benchmarks

For parallel implementation:

N Independent tasks being done in N independent modules N-Stage Pipeline

I1 latency > 1 clock cycle

Dynamic Static Interface (DSI) Gap between HW and SW

Software Program Compiler Complexity Exposed Static

More important in VLIW Processors Case

Lets Start with Von Neumann Architecture:

Same/Similar ideas for efficiently perform in FFT, Convolution, IIR filtering

DSP Algorithms Mold DSP Architectures

3) Efficient Memory Access Higher Memory Bandwidth

a) Dedicated HW for calculating memory addresses

DSP applications are normally easier to implement in floating point format

Sensitivity to cost and energy consumption force to:

Instead, most conventional DSP processors:

Conventional DSP ISAs are traditionally designed to:

Extending conventional DSP Architecture by adding more parallel execution units

Higher computational power, sometimes no high clock, no faster HW

Same range of applications