edu
1
Sharif University of Technology
Embedded Real-Time
EE Department
Spring 2017
Systems
part one
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Embedded Real-Time Systems
Analog Processing
Analog Computers
Fourier Optics
Programmability
One hardware can perform several tasks
Upgradeability and flexibility
Repeatability
Identical performance from unit to unit
No drift in performance due to temperature or aging
Immune to noise
Offering higher quality or performance
(Compare CD players versus phonographic turntable)
2
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Embedded Real-Time Systems
Embedded Systems:
Computer systems designed to perform one or a few tasks
Examples: MP3 players, Traffic Control Systems, Radars,
Comparing to GP computers which are flexible to end user needs
Can be optimized for cost, size, power consumption,
The architecture is the processor contents from the programmers vantage point
Has held true for more than 45 years now may hold true for another decade
5
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
6
Transistor dimensions
Processors
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Moores Law
Challenge of processor designers
Make Performance to follow at least (Moores law)2
density effect x clock effect
adding improvements thru innovations micro-architectures, multi-core
Performance has not gone up that fast! follows the Moores law
Orchestration problem 1971-2009 238/2 ~ 524,000
Power consumption bottleneck
Heat dissipation not worth it to increase the clock
Intel Rules:
Increasing clock rate by 25% will yield approximately 15% performance increase
But power consumption will be doubled
Implementation (Architecture)
processor designer view
Logical structure, Pipelining
Functional units, Caches, Physical registers
Realization (Chip)
chip/system designer view
Physical structure for the implementation
Gates, Cells, Transistors, Interconnection
8
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Iron Law (Joel Emer)
Time Instructio ns Cycles Time
Program Program Instructio n Cycle
To be minimized
Architecture Implementation Realization
Code Size CPI Cycle time
Compiler Designer Processor Designer Chip designer
Time(A) = N x 2.0 x 1 = 2N
Time(B) = N x 1.2 x 2 = 2.4N
Speedup = oldtime/newtime
= {P x oldCPI x T}/{P x newCPI x 1.15 T}
= (1.36)/(1.24 x 1.15) = 0.95
Dont do it!
11
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Instruction Set Architecture (ISA)
The boundary between software and hardware
Instruction-to-Binary Encoding
12
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Basic ISA Classification
Stack Architecture (zero operand):
Operands popped from stack(s)
Result pushed on stack
Accumulator (one operand):
Special accumulator register is implicit operand
Other operand from memory
Register-Memory (two operand):
One operand from register, other from memory or register
Generally, one of the source operands is also the destination
A few architectures have allowed memory To memory operations
Register-Register or Load/Store (three operand):
All operands for ALU instructions must be registers
General format Rd Rs op Rt
Separate Load and Store instructions for memory access
13
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Important ISA Considerations
Number of registers
Data types/sizes
Addressing modes
Instructions complexity
Branch/jump/function call
Exception handling
Instruction format/size/regularity
Branches
Conditional/Unconditional
Function call is similar but needs parameter passing, saving state
restoring state, Latency,
15
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Modern ISAs
Operations: simple ALU ops, data movement, control transfer
Temporary Operand Storage in the CPU
Large General Purpose Register (GPR) File
Load/Store Architecture
Three operands per ALU instruction (all registers)
A B op C
Addressing Modes
Limited addressing modes, e.g. register indirect addressing only
Type and size of operands
32/64-bit integers, IEEE floats
Instruction-to-Binary Encoding
Fixed width, regular fields
16
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
MIPS ISA
The MIPS ISA was one of the first RISC instruction sets (1985)
Main characteristics:
Load-Store Architecture
Only one addressing mode for memory operands: reg. indirect + displacement
17
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
x86 ISA, a CISC ISA
Was first introduced with Intel 8086 processor in 1978 Evolved over the years
Main characteristics:
Reg/Mem architecture ALU instructions can have memory operands
Highly variable instruction size and format instruction size varies from 1 to
17 bytes.
18
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
ARM and Thumb ISAs
The ARM processor architecture provides support for:
The 32-bit ARM and 16-bit Thumb Instruction Set Architectures
Reduced Instruction Set Computer (RISC) architecture
Load/store architecture
Simple addressing modes (determined from register contents and instruction)
16 /32-bit registers
19
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
ARM and Thumb ISAs
Enhancements to a basic RISC architecture enable ARM processors to achieve
a good balance of high performance, small code size, low power
consumption, and small silicon area.
ARM7-TDMI
3-stage pipeline
CPI ~ 1.9
Video , OFDM,
DQPSK Benchmarks
22
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Going Deeper into Implementation
How to improve the implementation Increasing /Decreasing CPI
Amdahls Law (Gene Amdahl, IBM) 1
speedup
Expected speed-up of partial program improvement (1 p ) p / S
1
speedup
Expected speed-up of partial parallel implementation (1 p) p / N
%improvement (1 1 / speedup) 100
S max
For sequential
speedupmax
case:
(1 p )( S max 1) 1
23
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Going Deeper into Implementation Pipelining, 1980s
Overlap the execution of instructions T
I 1, I2, I3, Un-pipelined Process O1, O2, O3,
Amdahls Law
if p is the fully pipelined
KT portion, for K1 In/Out:
speedup N ( K N ) p ( K 1) / K
( K N 1)T / N (1 p ) p / N
24
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Going Deeper into Implementation
Instruction Level Parallelism (ILP), 1990s
A measure of how many operations in a computer program can be
performed simultaneously.
Example Program ILP=3/2 [IPC (Instruction per Cycle) in CPU level]
1. x=a+b
2. y=c-d
3. z=xf
Design Problem Compiler or Processor must increase ILP
ILP Processors
Possibility of having overlap among instructions
Pipelining is what can be done in a basic single block IPCmax= 1
Substantial improvement 1 by having multiple blocks, IPC>1
canbe achieved
speedup
(1 p) / S1 p /( NS 2 ) 25
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Processors
Going Deeper into Implementation ILP
Single Issue Architecture Multi Issue Architecture
Parallelism Expands in Space (1990s) Using Multiple Basic Units
VLIW (Very long Instruction Word) Processors
Static / no code compatibility, recompile needed for different processors
Superscalar Processors
Dynamic/ code compatibility between processors family members
for(;;) {
ReadNewSample(&xn);
UpdateInputArray(xn);
sum = 0;
for (int k=0; k<N; k++)
sum += h[k]*x[n-k]; Separate Memories/Paths for Code and Data
yn=sum; MAC
n++; } Faster Memories
28
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
Modify the C code and see how to match a hardware
for(;;)
{
ReadNewSample( &xn);
UpdateInputArray(xn);
sum = 0;
a = &x[n-N+1]; y 0 h [ 0 ] * x [ 0 ] h [1] * x [ 1] h [ 2 ] * x [ 2 ]
b = &h[N-1];
y 1 h [ 0 ] * x [1] h [1] * x [ 0 ] h [ 2 ] * x [ 1]
for (int k=0; k<N; k++)
sum += (*a++)*(*b++); y 2 h [ 0 ] * x [ 2 ] h [1] * x [1] h [ 2 ] * x [ 0 ]
yn=sum;
}
29
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
Number and variety of products that include DSP algorithms
+ Challenge of being cost effective, power/size effective Special HWs
So many dedicated ICs A little bit more general programmable DSP chips
Because of high cost of IC design, DSP processors became less risky and famous
solutions, specially in low volume applications
32
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
Program Memory to Provide Filter Coefficients
Multiple Data Memories, Program Cache (at least one to have RPT
command!)
Modified Harvard Architectures
33
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
3) Efficient Memory Access
Having predicted patterns of memory access in DSP algorithms leads to:
6) Streamed I/O
Processing time of most DSP algorithms spent in loops
DSP architectures usually:
a) provide special support for serial ports, parallel ports, memory interface,
b) have special ports for specific streams, like TI McBSP to connect to audio
codecs, or Video ports in new DSPs BT565x50 27.5 MHz
c) use DMA to allow data transfer with little or no intervention from
computational units Most ports in TI DSPs are equipped with DMA
36
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Digital Signal Processors
7) Specialized Instruction Set
DSP ISA design is a cost sensitive design
Another reason for why DSP C/C++ compilers are not that efficient and low level
programming and hand optimization are needed
Also C/C++ is not designed for describing DSP algorithms
37
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Conventional DSPs
based on conventional DSP architecture we have studied so far
1) Low cost, low performance range
Multiply or MAC + an ALU, address generators
Still in use Very similar to DSPs of 80s
20-50 MHz clock
Examples: Motorola DSP560xx,
TI TMS320C2xx, Lucent DSP16xx
AD ADSP-21xx
Typical applications:
Consumer electronics,
Modest telecommunications products,
Hard disks
38
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Conventional DSPs
2) Midrange DSPs, higher performance
Thru higher clock rates and more sophisticated architectures, deeper pipeline
More HW units like barrel shifter, instruction cache,
incremental ( not dramatic) enhancement comparing to (1)
100-150 MHz clock
Examples:
Motorola DSP563xx,
TI TMS320C54x,
AD ADSP-218x
Typical applications:
Consumer electronics,
Telecommunications
products VoIP, Wireless
39
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Conventional DSPs review
Issues with Old Conventional DSPs
Slow External memory
Pin count problem
Expensive Internal memory
Expensive HW and need for higher computational power
Design Goals :
1) To make maximum use of the processors underlying HW
2) To minimize the amount of memory space to store DSP programs
Short instructions but capable of a few memory fetch + some ALU ops
instruction
Fewer number of registers coded in instructions
Using mode bits to control features instead of encoding them in instructions
Short but irregular, complicated and highly specialized instruction sets
Compilers are complicated , not friendly Hand optimization is a must
40
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Enhanced Conventional DSPs
Going beyond Faster Clock
May need wider instruction words to include the parameters of new units in one
instruction
Higher Cost and Complexity but Higher Performance
More advanced fabrication processes may help to justify the modifications
41
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Enhanced Conventional DSPs
Still Suffer from Conventional DSP Problems
Complex Irregular instruction sets
Hard to code and make compilers
Example:
Lucent DSP16XXX
Two MACs
42
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Multi Issue Approach
Design Goals :
1) Achieving Higher Performance
2) Instruction sets that lend themselves to compilers
TI was the first to come up with a solution TMS320C62XX , 1996
Then Motorola, Analog Devices, Lucent followed
Very Simple Instructions, typically encoding a single operation per instruction
Achieving high level of parallelism by issuing and executing instructions in
parallel groups rather than one at a time
Simplified instruction decoding and execution Higher clock rates
Many parallel execution units to execute instructions in parallel groups
High level of Parallelism and higher performance
Generally higher power consumption
43
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Multi Issue Approach
Multi Issue Processors 1) Superscalar 2) VLIW Most DSPs are VLIW
45
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
Multi Issue Approach
Superscalar processors typically issue less than VLIWs, say 2 to 4 instructions
Differs from VLIW in how instructions are grouped
Grouping is done in run time and using specialized HW, and differs depending
on the situation
Based on data dependencies and resource contention
In superscalar processors
Burden of programming shifts from programmer to HW (easier to program)
Grouping of the same set of instructions may differ , for example in a loop
Difficult for programmers to predict how long given segments of code will take
47
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
SIMD
Single Instruction Multiple Data is not an Architecture
An architectural technique that improves the performance
Processor executes multiple instances of the same operation in parallel on
different data
Two Methods:
1) More independent units with separate registers to work on different data at
the same time
e.g ADSP-2116x that has two basic ADSP2106x set of execution units including
MAC, ALU, Shifter
2) Logically Split their execution units into multiple sub units
For example an ALU to be used for 32bit add or two 16bit adds
e.g ADS TigerSHARK is a VLIW DSP that has two sets of units and in each set
split-ALU and Split-MAC can be used Two levels of SIMD
8 x16-bit multiplications can be done in single cycle
48
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Evolution of Digital Signal Processors
SIMD
Difficult to program
Needs programmer effort to modify the algorithms to fit into the architecture
Not possible for all algorithms
Specific applications
52
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Real-time versus Time-shared systems
56
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Selection/Design Guide
3. Scheduling Loss
Liu & Layland (1973)
For n periodic tasks with fixed periods, a feasible schedule that will always meet
deadlines exists if the CPU utilization is below a specific bound (depending on
the number of tasks).
n
Ck
U n(21/ n 1) n Umax
k 1 Tk
1 100%
Ck: Worst case computation time of task k, 2 83%
Tk: Release Period of task k, n: number of scheduled tasks 69%
The other 30.7% of the CPU can be dedicated to lower-priority non real-time
tasks.
58
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Characteristics of Embedded Systems
Application Specific
Monitoring and reacting to the environment
Processing the Information
Control the environment
Examples:
Airbag system,
Digital Still Camera,
Cell Phone,
59
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Components of Embedded Systems
Firmware: Programs deep in the HW, not often changed
Software: Programs that can be changed by the user
Memory RAM, ROM, Flash,
User Interface: Some LEDs to GUI on LCD
Sensors: Sense the real world
Actuators: React to or control the real word
Emulation and Diagnostics:
e.g. JTAG (Joint Test Action Group)
Application specific gates
Analog IO:
A/D and D/A + ASP (filters, amplifiers, )
Last but not least the Processor: 8-bit MCU to 64-bit P can be ASIC or
FPGA
60
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Development Life Cycle
1) Examine the overall system needs
62
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Development Life Cycle
2) Select the Hardware Components Required
DSP
P (GPP)
Go for MCU when low cost low chip
count, low processing power needed
Go for DSP only if cost, size and power,
requirements cannot be met by a GPP
or other DSP related features are needed
like fast IO, low delay interrupt, specific
instructions, are needed
63
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Development Life Cycle
2) Select the Hardware Components Required
Other rules in this phase:
FPGAs are good for bit manipulation applications
Processors are better for numerical applications
If single DSP will do the job go for it!
DSP better than FPGA if algorithms are so complex and resources are already in DSPs
Have a DSP if special memory accesses needed
Have an FPGA and a DSP if possible to:
Use smaller/cheaper DSP/MCU by offloading some computationally intensive tasks to the FPGA
Increase the computational throughput
Make a prototype of a new signal processing algorithm (like an EVM)
To pack the glue logic and have flexibility in it
Have a GPP along with a DSP if there are so many non-real-time tasks
Embedded DSP systems development
Technical trend is towards programmability
64
Sharif University of Technology, EE Department, Embedded Real-Time Systems Course, imangh@sharif.edu
Real-Time Embedded Systems
Development Life Cycle
2) Select the Hardware Components Required
Programmable or HW or Mixed ?
Most of the time cost is the final saying!
Cost is a multifold issue usually fulfilled by a mixed strategy
Device cost, NRE (non-recurring engineering), Manufacturing cost, Opportunity cost,
Low time to market gain, physical advantages (power dissipation, weight, size, )