Lec10b DSP2

Lecture 10b: Implementing DSP Functionality: Alternatives
Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000
With contributions from:

Prof. Heinrich Meyr, University of Aachen Philip Chong, David Chinnery, Rhett Davis, Paul Husted,
Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang

Kurt Keutzer 1
System Implementation Choices System Functionality
DSP
DSP Core Program ROM Control ASIP Core
Coefficient ROM
Program ROM Control
ASIC
OFF-THE SHELF P/ DSP
Coefficient ROM
EMBEDDED CORE P/DSP
APPLICATION SPECIFIC P (ASIP)

2
Kurt Keutzer
Making a Successful Comparison - 1

Find an interesting application kernel
viterbi decoding for speech processing (not a full modem!)
Find realistic constraints native to the application
n=2, K=7, QPSK, 100KBS, BER= 10^-4
Find architectures/implementations that are promising for the application
TI TMS320C54, Tensilica Xtensa
What are the relevant features of this architecture that support this application?
Fix application constraints across all implementations (above) Fix key parameters for implementation comparison

performance (constraint) area power

3
Kurt Keutzer

Identify how key parameters will be measured

performance - instruction set simulator, eval board area - data sheets, gate estimates power - eval board, TI application note
Implement your application kernel

Examine different algorithms Start with code downloaded from the web - multimedia benchmarks etc. Build your software development/evaluation environment:
http://www.ti.com/sc/docs/tools/dsp/6ccsfreetool.htm
Kurt Keutzer

Implement your application kernel (cont)
Phase 0: Research
Find application notes, research reports for your own or comparable architectures Develop a quick estimate based on initial code Integrate research findings Do a quick back-of-envelope reality check Tailor algorithm, implementation to architecture Do your very best! Have a contest with your partner
Phase 1: Estimation

Phase 2: Real implementation/Tuning

Phase 3: Evaluation

Apply evaluation tools to key parameters Evaluate and compare results - return to 2
If your life depended on choosing the right part - what would you do?
Kurt Keutzer 5

Final evaluation and comparison - compare all implementations To evaluate for a product - everything is fair game To evaluate principally the architectures - need to consider:

Fab differences - TSMC vs. IBM (10-20% faster)

process differences - .35 micron vs. .25 (50% faster) power supply differences 3.0V vs. 1.5V asic vs. custom implementations - (2x faster)
Now evaluate - if I was the architect of this processor/implementor of this system on a chip, what would I do differently?

cache sizes register availability additional instructions on chip memory

6
Kurt Keutzer

Just for fun In addition to primary constraints (speed, cost, power) final real world considerations

business relationships (joint partnership with Lucent)

Time-to-market issues

time to configure? software development environment library/application software support application engineering support
Kurt Keutzer
Viterbi Algorithm
Prof. Heinrich Meyr University of Aachen
Kurt Keutzer
Viterbi Decoders in digital communication systems
S gn a lS ou rce i
S ou rce C od e r
C on vo u ton a lo r l i T re lli C od e r & s M app e r ch ann e lsy bo l c m s
M od u a t r l o
n f m a ton b it i or i s
k C h ann e l
d ecod ed b it s
rece i ed sy bo l y v m s
k D e o du a t r m l o
S gn a lS n k i i
S ou rce D ecod e r
V it rb iD ecod e r e
Kurt Keutzer
Convolutional Coder and Trellis diagram

m o du o 2 l add iton i n f m a t on i or i b it s u k z
+
cod e sy b o l m s b k z 1 x , u0k k2 M app e r
ch ann e l sy b o l m s c k
add it i e v w h it e no i e n k s
1 x k u 1, k1 +
i b =1 k B PSK
i b =0 k CH ANN E L
CO N VO LU T O N A L CO D ER I x 0 1 2 3 s 0 3, k k s 3 , +1 k yk know n s t r t a s t t X 0 =0 ae s 0, k s 0 , +1 k
know n end s t t X T =0 ae
k+ 1
T1
T d e c s on s i i
S u rv i o rM e o ry v m
V TER B ID ECO D ER I
d ecod ed b it s
Kurt Keutzer
10
ACS recursion for M = 2
Z (0 , i) ,k -1
( , i) 0
Z (0 , i) ,k -1
( , i) 0 k
su rv vo r pa h i t co pe tng pa h m i t i,k
( , i) 1 k
Z (1 , i) ,k -1 k
( , i) 1
M a x { k ,
( , i) 0
( , i) 1 k
Z (1 , i) ,k -1
( , i) 1 k
d i,k= 1
Kurt Keutzer
11
Viterbi Decoder block diagram
channe l sy bo s y m l
b ran ch m e tr cs i TM U AC SU La t h c
de c s on i i b it s SM U st t ae m e tr cs i
de coded b its u k
Kurt Keutzer
12
Characteristic of a 2-bit step-at-zero quantizer

I t rp re t ton ne a i 2 Q =1 sa t ra ton u i
1 Q =0
-2
-1
Q = -1
2 no m a lzed r i n pu t i e ve l l
-1 Q = -2 sa t ra ton u i -2
Kurt Keutzer
13
Architecture
Kurt Keutzer
14
Node parallel ACS architecture

T U M
( , i) 1 k
( , i) 0 k
R eg s e r i t 0 ,k 0 1 ,k 1
AC S
S hu ffe l E xchange N e w o rk t
AC S
SM U
AC S
N -1 ,k N1
de c s on s i i de c ( i, ) k
Kurt Keutzer
15
Alternative Implementations
bu tt r f l e y
AC S AC S AC S AC S
M M M M
b u tt r f y e l
sha red AC S sha red AC S
Kurt Keutzer
16
Butterfly trellis structure and resource sharing for the K = 3, rate 1/2 code
AC S
0 ,k
P a t m e tr c h i m e o ry m
0 ,k
od l st t ae m e tr cs i
M U X M U X
new st t ae 0 ,k+1 m e tr cs i AC S
2 ,k+ 1
AC S
1 ,k
1 ,k
AC S AC S
2 ,k
2 ,k
M U X M U X
AC S
1 ,k+ 1
3 ,k
3 ,k
3 ,k+ 1
Kurt Keutzer
17
Survivor Memory Unit
Kurt Keutzer
18
REA hardware architecture

0= u 0= u s0 0= u
( 0) 0, ( 0 ) PE 0, 0 ^[ ] u k [ ^ 0] u k1 ^ u 0] [ k D +1 ^ u 0] [ kD -
0 0, k ( , ) 10 0 0 d 0 1 1 ,k 1 1 d 2 ,k 1 1 d 3 ,k 0 1 D
[ ^ 3] u k [ ^ 3] u k1 3 ^[ ] u k D +1 [ ^ 3] u kD [ ^ 2] u k [ ^ 2] u k1 2 ^ [ ] u k D +1 [ ^ 2] u kD -
[ ^ 1] uk
[ ^ 1] uk1
1 ^[ ] u k D +1 -
[ ^ 1] u kD -
1 s3 1= u
( 3) 1,
Kurt Keutzer
19
Decoded Sequence: 0 0 ... 0 1 0
^0 u[ ] k -D + M -1 ) (
^[ u 0] kD -
0 ^[ ] u k
0 0 0 1 0
D e cod ng i
A cqu s iton o f fna lsu rv vo r i i i i
D e coded S equen ce :0 0 ...0 1 0
Kurt Keutzer
20
Viterbi Project Constraints

uncoded word length = 1 coded word length (n) = 2
chain-backing depth (D) = 96 generator polynomials:

this means that it is rate 1/2
p0 = 171, p1= 133 (octal) this means that p0=1111001, p1=1011011
constraint length (K aka. L) = 7
this means that the number of states in trellis is 2^(K-1) or 64 states
data rate 100 kbs goal: bit error rate (BER) = 10^-4
branch metric calculation is
QPSK
soft decision wordlength (q) = 6
signal to noise ratio (SNR)

degradation 0.05dB
Kurt Keutzer
21
Viterbi Decoder Implementation on an ARM
EE 290S Final Project May 4, 1999 Phillip Chong
Kurt Keutzer
22
ARM Overview
32-bit RISC microprocessor Five stage pipeline Features fast ALU operations (barrel shifter) Scalar integer unit, no FPU
Kurt Keutzer
23
Algorithm Tweaking
Performing the metric computation through table lookup (load = 1 delay slot) is faster than using ALU (multiplication = up to 3 delay slots)
Parity computation (Viterbi code) can also be done through table lookup
Kurt Keutzer
24
Reducing Memory Footprint

Cache misses can be very costly due to pipeline stalls We are willing to give up some algorithmic efficiency to eliminate cache misses To minimize the memory footprint, we pack 32 bits of traceback into single word; we can easily unpack this data due to the barrel shifter (1 cycle operation)
For 128 level traceback, memory requirements are 512 bytes (metrics table) + 1024 bytes (traceback) + 768 bytes (parity lookup tables) = 2304 bytes
Kurt Keutzer
25
Simulation Results
Simulated decoding of 4096 bits on a 125 MHz 3.3V model Execution requires 11.72M ARM instruction cycles, giving 44 kb/s data rate Power consumption was estimated at 52.47 mW Scaling simulation results up to 275 MHz 2.0V ARM (fastest commercially available) gives 96 kb/s at 42.40 mW
Kurt Keutzer
26
Summary
Clock speed: 275 MHz Execution Performance: 96kb/s Power Dissipation: 42.40 mW (5.68 mW/mm2) Area: 7.47mm2 in 0.25 m Design Effort: 4 days Portability very high: code is ANSI C; architecturedependent tweaks may need reworking
Kurt Keutzer
27
Conclusion/Thanks
One-bit quantization gives opportunities for performance improvements, at a huge cost in QOR Viterbi algorithm would benefit greatly from having hardware parallelism (vector ops) available Many thanks to Marlene Wan for providing power estimation
Kurt Keutzer
28
Viterbi Decoder Implementation on a TI C54x
EE 290S Final Project May 4, 1999 Paul Husted
Kurt Keutzer
29
Introduction
Implemented Viterbi Decoder on a TI TMS320VC5402 DSP Examine:

Performance (bits/sec) Power (mW/bit) Cost ($/unit,area) Design effort (engineer-months)
Kurt Keutzer
30
Viterbi Decoder Specifications

Implementation Specifications:
Constraint Length (K aka. L) = 7 Branch Metric Calculation is QPSK Soft Decision Wordlength (q) = 6 Chain-backing Depth (D) = 96 Gen. Polynomials: p0 = 171, p1= 133 (octal) Data Rate 100 kbs Goal: Bit Error Rate (BER) = 10^-4
Kurt Keutzer
31
C54x Capabilities
Capabilities of all C54x DSP Cores:
Three 16-bit Data, One 16-bit program bus 40 bit ACC with 40 bit barrel shifter Two independent accumulators A single cycle non-pipelined MAC Single-instruction repeat and block-repeat Six channel DMA controller Arithmetic instructions with parallel store and parallel load
Kurt Keutzer
32
Helpful Instructions for the Viterbi Decoder

The C54x Has Specialized Instruction Set

Dual Add/Subtract in 1 Cycle Compare, Select, and Store Unit (CSSU)

Compare Branch Metrics Store Larger Value, Store Decision Bit Increment Address Registers in Circular Buffer 1 Cycle
Allows Butterfly (2 States) in 5 cycles
Kurt Keutzer
33
Butterfly Implementation
T Register = Local Distance Old(2*j) New(j) DADST CMPS DSADT CMPS
Old(2*j+1)
New(j+2(K-2))
Kurt Keutzer
34
TI TMS320VC5402 DSP
Specific Chip Characteristics:
Operates at 100 MIPS

Core Voltage of 1.8V I/O Pins Operate at 3.3V
16K Word x 16 Bits of Dual-Access RAM 4K Word x 16 Bits of ROM Internal DMA Created in 0.18 Micron Technology
Kurt Keutzer
35
Dataflow
Data I/O
Input Values Assumed to be Placed at Specified Memory Location by Internal DMA
Output Values Assumed to be removed from another Memory Location by Internal DMA
Alternatively, Data Could be Placed in this Memory Location After Other On-Chip Receiver Processing
Kurt Keutzer
36
Implementation Analysis
Viterbi Decoder Code Created in Assembly Linked to Processor Specific Memory Map Simulated on Cycle-Accurate Simulator
Used Correct Memory Model for VC5402
Kurt Keutzer
37
Implementation Results
Estimated Code Size Data Size MIPS (100 Kbps) 500 Instructions 1280 (16 bit) Words 18.425
Actual 1032 (16 bit) Words 1280 (16 Bit) Words 21.53125 464.7 Kbps
Max. Speed 582 Kbps (100 MIPS)
Kurt Keutzer
38
Power Calculation
Compared with TI Figures:

TI uses 1/2 MACs, 1/2 NOPs For Power Figure .25 Micron Estimate is .45 mA/MIPS
Fully Static Design can be Clocked at Any Rate
Viterbi Code Uses 1.08 Times More Current than TI Estimate
At 22 MIPS, 19.25 mW are Consumed in the Core
Kurt Keutzer
39
Area Estimate
TI Will Not Release Die Sizes
.25 Micron Chips Fit Inside 3.2 mm x 3.2 mm Area on a 144 pin BGA
Maximum Die Size is thus 10.24 mm2
Kurt Keutzer
40
Development Cost
Engineering Time
Estimate - 3 days
Assumes Engineer Has Experience with Assembly Language and TI Tools
Tool Cost - $13262.45
Includes Emulator, Simulator, Compiler, Assembler, Linker, Debugger
Cost of Chip - $8.52
Kurt Keutzer
41
Conclusion
Optimized Instructions Make Algorithm Efficient Static Design Allows Clock Rate to be Set As Needed to Reduce Power Flexibility Exists to Perform Other Processing of Data Very Little Development Time/Cost
Kurt Keutzer
42
ACS TIE Extension with State (ACS)
31 27
Rs
pm-
17 11
1:0 pm-
31 24:23 16:15 8:7 0 bm3 bm2 bm1 bm0
0:1 pm-
11 17
Rt
pm-
27 31
+
msb
+
msb
=1?
+
Control
=1?
0:1
decision bit pm
11 16:17
pm
27 31
decision bit
instruction
43
Rr
Kurt Keutzer
Tensilica Viterbi Implementation

Niraj Shah Scott Weber 290A Final Presentation
Kurt Keutzer
44
Tensilica Flow
.c .c .c TIE
xt-gcc
gen
Tensilica Processor Generator
uArch
Designer
gen xt-run
.o
Kurt Keutzer
45
Xtensa Architecture
TIE Extensions:

single cycle state free
Xtensa Core
Rs Rt I Rr
no new exceptions
no stalls typeless data
Rs, Rt, Rr are 32 bit regs I is the instruction controlling the TIE unit Xtensa Core is a 32 bit configurable RISC processor
TIE
Kurt Keutzer
46
Viterbi Architecture
ADC
I/0 Device
Init
RAM
TraceBack
ACS
Measured Performance Here
Kurt Keutzer
47
TIE SetupBMreg (ACS)
31 0x7F
Rs
8:7
31
Rt
8:7
+ -
instruction
Control bm0 bm1 bm2 bm3 0 7:8 15:16 23:24 31
Rr
Kurt Keutzer
48
ACS TIE Extension (ACS)
Rs
31 27 pm17 11 pm1:0
Rt
31 24:23 16:15 8:7 0 bm3 bm2 bm1 bm0
=1?
msb
+
ACS03 || ACS12 || ACS30 || ACS21
instruction
0:1
decision bit pm
11:12
0s
31
Kurt Keutzer
Rr
49
ACS TIE Extension with State (ACS)
31 27
Rs
pm-
17 11
1:0 pm-
31 24:23 16:15 8:7 0 bm3 bm2 bm1 bm0
0:1 pm-
11 17
Rt
pm-
27 31
+
msb
+
msb
=1?
+
Control
=1?
0:1
decision bit pm
11 16:17
pm
27 31
decision bit
instruction
50
Rr
Kurt Keutzer
TIE Zmask (TraceBack)
31
Rs
1:0
31
Rt
6:5 0 <<1
0x7F
& |
instruction
Control
0x3F
&
6:7
Rr
31
Kurt Keutzer
51
Designs
All designs had a BER of 0.000095 after 10 million iterations Design 1
100 MHz, 48 mW, 1K DCache, 1K ICache, TIE 222 MHz, 144 mW, 1K DCache, 1K ICache, TIE
Design 1+
Design 2
100 MHz, 69 mW, 16K DCache, 16K ICache, TIE
Design 2
222 MHz, 191 mW, 16K DCache, 16K ICache, TIE
Design 3
222 MHz, 191 mW, 16K DCAche, 16K ICache, TIE with state
Kurt Keutzer
52
Performance
1200 1000 800 600
909 793 909 1142 966
Kb/s
400 200 0
118
409 263
357
409
Cache Perfect Cache
Design Design Design Design Design 1 1+ 22 3
Kurt Keutzer
53
Energy Dissipation
0.6 0.5 0.4 0.3
0.4 0.54
uJ/bit
0.2
0.12
0.16
0.19 0.17
0.24 0.21
0.2
0.17
Cache Perfect Cache
0.1 0
Kurt Keutzer
54
n(s*J)/Bit
3.5 3 2.5 2 1.5
2.05 3.39
n(s*J)/ 1 Bit 0.5

0
Cache Perfect Cache

0.293 0.532 0.416 0.176 0.315 0.231 0.207 0.148
Kurt Keutzer
55
Die Area
7 6 5 4 3
2.1 2.1 2.372.37 6.146.14 6.7 6.7 6.7 6.7
mm2 2
1 0
Cache Perfect Cache
Kurt Keutzer
56
Conclusions
TIE extensions, cache configuration, and improved code efficiency resulted in an order of magnitude improvement from our original For power and performance, the effect of cache size is greater than the
effect of a higher clock frequency

Use voltage scaling to reduce the power If streaming data, then scale frequency Adding state will result in the ability to increase performance
Having the ability to remove core instructions will decrease decode

complexity and should lower power and area
Kurt Keutzer
57
Soft Core Viterbi Decoder

EECS 290A Project
Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang
Kurt Keutzer
58
High Level Architecture
4% 1% 5%
23% 36% 30%
38% 8% 22%
18% 4% 16%
0% 48% 15% 9% 2% 8%
2% 1% 4%
% Gates % Area % Power
Kurt Keutzer
59
Branch & Path Metric Generation

Branch Metrics Computation apparently implemented with a CORDIC block
U L U L U L U L (contains 840 MUXs, 58 adders & flipflops, 32 15-bit busses)
Branch Metrics Hard-wired to

each ACS unit Path Metrics Stored in ACS
U L U L U L U L
units
Each ACS unit handles 16

states
Hard-wired Path Metric Interconnect
Kurt Keutzer
60
ACS Architecture
8x9 SRAM
Each ACS unit stores 32 path metrics Only two SRAMs are active at a time Across all four ACS units, each path metric is stored twice SRAM accounts for 88% of the area and 27% of the power for each ACS unit
Add Compare Select PML PML BML Pipeline Register PMU BMU PMU
MUX
Kurt Keutzer
61
Traceback Architecture
State-Machine blocks
192
are just large sumof products

Out
Decision Bits
combinational networks
(351 gates each)
Traceback Memory Unit 22% Area 20% Power
Finite State Machine 11% Area 13% Power
Each memory unit

contains a 16x64 SRAM and logic
(192 MUXs, 128 flip-flops)
Traceback Unit
SRAM MUX Pipeline Register
Decision Bits
Traceback
Next_ramin
Traceback Memory Unit

Kurt Keutzer
62
Design Flow
Design Compiler Synthesis script (from Mentor/Inventra)
Synthesis & Module Generation
SRAM Generator (from Norman Walker)
VHDL gate-level sims (timing verification, switching activity annotation)
Pre-Layout Verification & Analysis
PowerMill Simulations (SRAM, core) Design Compiler, Power Compiler (Static timing, power analysis)
Floor Planning (Preview)
Floor Planning Place & Route
Place & Route (Silicon Ensemble)

Interconnect Parasitic Extraction (report simcap
PowerMill simulations, PathMill static analysis
Post-Layout Verification & Analysis

Kurt Keutzer
Design Compiler, Power Compiler (Static timing, power
analysis with back-annotated interconnect parasitics)
63
Synthesis and SRAM Generation

Synthesis with Synopsys Design Compiler

Constraint: 66 kHz clock (effectively infinite) Bottom-up synthesis of 62 VHDL entities
Low-Power SRAM generator (from Pleiades)

Very large sense-amps, control logic Optimized for power, speed at low supply-voltages Word-length limited to a power of 2
Kurt Keutzer
64
Simulation Models
Behavioral C
Parameterized, bit-true, and fast Used for system level design and BER simulations
Behavioral VHDL
Parameterized, bit-true, and cycle-true Used for structural simulations and test bench referen
RTL VHDL
Synthesizable, crafted for specific parameters and implementation structure Used for synthesis quality
Kurt Keutzer
65
BER Simulation Results
Kurt Keutzer
66
SRAM
Simulation Tools: TimeMill & PowerMill Parameters

66 MHz clock Voltage 2.5V Random Generated Test Vectors
Results
Power Analysis
Timing Analysis
Kurt Keutzer
67
SRAM: Power Numbers

SRAM used for ACS Unit
8 words by 9 data bits
Operations Read Activity
Avg.(A) 663.73
Avg.(mW) 1.659
Avg.(pJ) 24.885
Write Activity
Read/Write
563.21
612.29
1.408
1.530
21.120
22.950
Parasitic Extraction
Operations Read Activity Write Activity Read/Write
Kurt Keutzer
Avg.(A) 949.89 772.830 851.42
Avg.(mW) 2.3747 1.9320 2.1285
Avg.(pJ) 35.6205 28.980 31.9275

68
SRAM: Power Numbers

SRAM used for Traceback Unit
16 words by 64 data bits
Operations Read Activity Write Activity Read/Write

Parasitic Extraction?
Avg.(A) 2170.7 1893.4 2086.9
Avg.(mW) 5.4267 4.7335 5.2172
Avg.(pJ) 81.4005 71.0025 78.2580
Kurt Keutzer
69
SRAM: Timing Numbers

Delays
Delays

Setup Time; Hold Time time needed for data address to become stable
Setup(ns) ACS SRAM Traceback SRAM ~1 ~1
Hold(ns) ~2 ~2
Data Resolution(ns) ~1.8 ~5
Kurt Keutzer
70
Place and Route

Floor planning of the Viterbi SRAM macro cells and standard cells was done in Preview, and Silicon Ensemble was used for routing. Total SRAM macro cell area was 1.58 mm2 (1.08 mm2 with 9x8 SRAMs)
Area of the 16 9x8 bit SRAM macro cells: 0.052 mm2 each, 62% larger than required, as 16x8 bit SRAMs were used (SRAM generator output had been verified for powers of 2) Area of the 3 16x64 bit SRAM macro cells: 0.25 mm2 each
Area of the standard cells 1.02 mm2 (0.35 mm2 from DEF file) Final chip area was 4.0 mm2 (original estimate 2.5 mm2)
Parasitics for timing simulation were extracted from the final routed nets in Silicon Ensemble.
Kurt Keutzer
71
Wiring Statistics
Six metal layers, layers 5 and 6 used for power and ground respectively Ground and power spaced alternately 100 um apart horizontally and vertically. There were about 6200 nets and 46,114 vias.
Total wire lengths: metal layer 1: 3,293 um metal layer 2: 458,440 um metal layer 3: 510,517 um
metal layer 4: 218,023 um

metal layer 5: 96,882 um signal, and 38,400 um power metal layer 6: 8,660 um signal, and 37,500 um ground wire length: 685 mm horizontal, 611 mm vertical, total 1296 mm
Kurt Keutzer
72
Final Placement and Routing

Significant routing congestion at 16 by 64 bit SRAM outputs, due to Silicon Ensemble grid size of 1 um (observe white and light blue wires). Minimum of 6 unroutable nets observed, even at 12 mm2 chip area. Final size was 1.25 mm x 3.2 mm, 4 mm2, with 9 unroutable nets. Violation reports in Silicon Ensemble did not identify which nets were unroutable, other than problems with ground and power connections.
Kurt Keutzer
73
Static Timing Checks

All timing checks performed with Design Compilers report_timing command Parasitic capacitances back-annotated with the set_load command No RC parasitics annotated No SRAM model was used for timing checks Critical Path was from ACS control logic, through a PM ouput MUX select signal (in an ACS unit), through the following ACS unit. Checks performed at 2.5V
Delay Before Annotation (ns) Critical Path Longest SRAM Path
Delay After Annotation (ns)
Max Clock Frequency (MHz)
Max Symbol Rate (Msps)
8.7 8.5
17 14
60 -
3.8 -
Kurt Keutzer
74
Static Power Checks
Power Cell Internal (mW): Net Switching (mW): Total Dynamic (mW): Cell Leakage (nW):
Before Annotation
After SAIF Annotation
After Parasitic Annotation
28 15 43 750
20 6.3 26 810
20 8.7 29 810
All timing checks performed with Design Compilers report_power command Switching activity was measured for every output port (transition counts over 16,000-cycle simulation) Back-annotation performed with SAIF files No SRAM model was used for power checks (added in manually) Checks performed at 2.5V w/ 60 MHz clock
Kurt Keutzer
75
Delay and Energy Scaling
Kurt Keutzer
76
Performance Results
Optimized for Performance Optimized for Energy Optimized for EDP
Supply Voltage (V) 2.5 0.8 1.25
Clock Rate (MHz) 60 7.46 25.12
Symbol Rate (Msps) 3.75 0.47 1.57
Energy Delay Product (fJs) 4.24 3.49 2.53
Power (mW) 59.6 0.76 6.24
For fixed throughput requirement 100ksps:

Supply Voltage (V) 2.5 0.8 1.25 Clock Rate (MHz) 1.6 1.6 1.6 Symbol Rate (Msps) 0.1 0.1 0.1 Power (mW) 1.59 0.16 0.40
Optimized for Performance Optimized for Energy Optimized for EDP
Kurt Keutzer
77
Summary NORMALIZED (100kbs)

Implementation ASIC Performance (kbs) 100.00 Power (mW) 0.14 Norm 1.0 Gates 35100 Area (mm^2) 4.00 Gates/ Area 8775.00 Power (uW)/ Gate 0.004 Effort (days) 30
DSP CP 3 CP 2
100.00 100.00 100.00
1.97 2.02 2.47
14.3 14.7 17.9
47098 26480 47098
10.24 6.69 6.69
4599.41 3958.15 7040.06
0.042 0.076 0.052
3 6 6
ARM CP 1
100.00 100.00
36.86 40.68
266.8 294.4
500000 50000
7.47 2.10
6695.68 23809.52
0.737 0.814
4 6
Kurt Keutzer
78
Summary MAX PERFORMANCE

Performanc e Implementation (kbs) Power (mW) Norm Gates Area (mm^2) Gates/ Area Power (uW)/ Gate Effort (days)
ASIC
CP 3 CP 2 DSP CP 1 ARM Reference
3750.00
966.00 793.00 464.70 118.00 116.48 100.00 N/A
50.60
191.00 191.00 89.46 48.00 42.94 N/A
1.0
3.8 3.8 1.8 0.9 0.8 N/A
35100
26480 47098 47098 50000 50000 N/A
4.00
6.69 6.69 10.24 2.10 7.47 N/A
8775.00
3958.15 7040.06 4599.41 23809.52 6695.68 N/A
1.44
7.21 4.06 1.90 0.96 0.86 N/A
30
6 6 3 6 4
Kurt Keutzer
79

Lec10b DSP2

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Lec10b DSP2

Diunggah oleh

Hak Cipta:

Format Tersedia

Lecture 10b: Implementing DSP Functionality: Alternatives

With contributions from:

Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang

System Implementation Choices System Functionality

Program ROM Control

OFF-THE SHELF P/ DSP

EMBEDDED CORE P/DSP

APPLICATION SPECIFIC P (ASIP)

Making a Successful Comparison - 1

viterbi decoding for speech processing (not a full modem!)

Find realistic constraints native to the application

n=2, K=7, QPSK, 100KBS, BER= 10^-4

Find architectures/implementations that are promising for the application

TI TMS320C54, Tensilica Xtensa

performance (constraint) area power

Making a Successful Comparison - 2

Implement your application kernel

Making a Successful Comparison - 3

Phase 2: Real implementation/Tuning

Making a Successful Comparison - 4

Fab differences - TSMC vs. IBM (10-20% faster)

cache sizes register availability additional instructions on chip memory

Making a Successful Comparison - 5

business relationships (joint partnership with Lucent)

Viterbi Decoders in digital communication systems

C on vo u ton a lo r l i T re lli C od e r & s M app e r ch ann e lsy bo l c m s

Convolutional Coder and Trellis diagram

cod e sy b o l m s b k z 1 x , u0k k2 M app e r

ACS recursion for M = 2

Viterbi Decoder block diagram

Characteristic of a 2-bit step-at-zero quantizer

Node parallel ACS architecture

Survivor Memory Unit

REA hardware architecture

Decoded Sequence: 0 0 ... 0 1 0

A cqu s iton o f fna lsu rv vo r i i i i

D e coded S equen ce :0 0 ...0 1 0

Viterbi Project Constraints

chain-backing depth (D) = 96 generator polynomials:

this means that it is rate 1/2

p0 = 171, p1= 133 (octal) this means that p0=1111001, p1=1011011

constraint length (K aka. L) = 7

this means that the number of states in trellis is 2^(K-1) or 64 states

branch metric calculation is

signal to noise ratio (SNR)

Viterbi Decoder Implementation on an ARM

EE 290S Final Project May 4, 1999 Phillip Chong

Reducing Memory Footprint

Viterbi Decoder Implementation on a TI C54x

EE 290S Final Project May 4, 1999 Paul Husted

Performance (bits/sec) Power (mW/bit) Cost ($/unit,area) Design effort (engineer-months)

Viterbi Decoder Specifications

Helpful Instructions for the Viterbi Decoder

Dual Add/Subtract in 1 Cycle Compare, Select, and Store Unit (CSSU)

Allows Butterfly (2 States) in 5 cycles

T Register = Local Distance Old(2*j) New(j) DADST CMPS DSADT CMPS

Operates at 100 MIPS

Core Voltage of 1.8V I/O Pins Operate at 3.3V

Input Values Assumed to be Placed at Specified Memory Location by Internal DMA

Used Correct Memory Model for VC5402

Max. Speed 582 Kbps (100 MIPS)

Fully Static Design can be Clocked at Any Rate

Viterbi Code Uses 1.08 Times More Current than TI Estimate