Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000
DSP
DSP Core Program ROM Control ASIP Core
Coefficient ROM
ASIC
Coefficient ROM
Kurt Keutzer
What are the relevant features of this architecture that support this application?
Fix application constraints across all implementations (above) Fix key parameters for implementation comparison
Kurt Keutzer
performance - instruction set simulator, eval board area - data sheets, gate estimates power - eval board, TI application note
Examine different algorithms Start with code downloaded from the web - multimedia benchmarks etc. Build your software development/evaluation environment:
http://www.ti.com/sc/docs/tools/dsp/6ccsfreetool.htm
Kurt Keutzer
Phase 0: Research
Find application notes, research reports for your own or comparable architectures Develop a quick estimate based on initial code Integrate research findings Do a quick back-of-envelope reality check Tailor algorithm, implementation to architecture Do your very best! Have a contest with your partner
Phase 1: Estimation
Phase 3: Evaluation
Apply evaluation tools to key parameters Evaluate and compare results - return to 2
If your life depended on choosing the right part - what would you do?
Kurt Keutzer 5
Now evaluate - if I was the architect of this processor/implementor of this system on a chip, what would I do differently?
Kurt Keutzer
time to configure? software development environment library/application software support application engineering support
Kurt Keutzer
Viterbi Algorithm
Prof. Heinrich Meyr University of Aachen
Kurt Keutzer
S gn a lS ou rce i
S ou rce C od e r
M od u a t r l o
n f m a ton b it i or i s
k C h ann e l
d ecod ed b it s
rece i ed sy bo l y v m s
k D e o du a t r m l o
S gn a lS n k i i
S ou rce D ecod e r
V it rb iD ecod e r e
Kurt Keutzer
ch ann e l sy b o l m s c k
add it i e v w h it e no i e n k s
1 x k u 1, k1 +
i b =1 k B PSK
i b =0 k CH ANN E L
CO N VO LU T O N A L CO D ER I x 0 1 2 3 s 0 3, k k s 3 , +1 k yk know n s t r t a s t t X 0 =0 ae s 0, k s 0 , +1 k
know n end s t t X T =0 ae
k+ 1
T1
T d e c s on s i i
S u rv i o rM e o ry v m
V TER B ID ECO D ER I
d ecod ed b it s
Kurt Keutzer
10
Z (0 , i) ,k -1
( , i) 0
Z (0 , i) ,k -1
( , i) 0 k
su rv vo r pa h i t co pe tng pa h m i t i,k
( , i) 1 k
Z (1 , i) ,k -1 k
( , i) 1
M a x { k ,
( , i) 0
( , i) 1 k
Z (1 , i) ,k -1
( , i) 1 k
d i,k= 1
Kurt Keutzer
11
channe l sy bo s y m l
b ran ch m e tr cs i TM U AC SU La t h c
de c s on i i b it s SM U st t ae m e tr cs i
de coded b its u k
Kurt Keutzer
12
1 Q =0
-2
-1
Q = -1
2 no m a lzed r i n pu t i e ve l l
-1 Q = -2 sa t ra ton u i -2
Kurt Keutzer
13
Architecture
Kurt Keutzer
14
( , i) 0 k
R eg s e r i t 0 ,k 0 1 ,k 1
AC S
S hu ffe l E xchange N e w o rk t
AC S
SM U
AC S
N -1 ,k N1
de c s on s i i de c ( i, ) k
Kurt Keutzer
15
Alternative Implementations
bu tt r f l e y
AC S AC S AC S AC S
M M M M
b u tt r f y e l
sha red AC S sha red AC S
Kurt Keutzer
16
Butterfly trellis structure and resource sharing for the K = 3, rate 1/2 code
AC S
0 ,k
P a t m e tr c h i m e o ry m
0 ,k
od l st t ae m e tr cs i
M U X M U X
new st t ae 0 ,k+1 m e tr cs i AC S
2 ,k+ 1
AC S
1 ,k
1 ,k
AC S AC S
2 ,k
2 ,k
M U X M U X
AC S
1 ,k+ 1
3 ,k
3 ,k
3 ,k+ 1
Kurt Keutzer
17
Kurt Keutzer
18
0 0, k ( , ) 10 0 0 d 0 1 1 ,k 1 1 d 2 ,k 1 1 d 3 ,k 0 1 D
[ ^ 3] u k [ ^ 3] u k1 3 ^[ ] u k D +1 [ ^ 3] u kD [ ^ 2] u k [ ^ 2] u k1 2 ^ [ ] u k D +1 [ ^ 2] u kD -
[ ^ 1] uk
[ ^ 1] uk1
1 ^[ ] u k D +1 -
[ ^ 1] u kD -
1 s3 1= u
( 3) 1,
Kurt Keutzer
19
^0 u[ ] k -D + M -1 ) (
^[ u 0] kD -
0 ^[ ] u k
0 0 0 1 0
D e cod ng i
Kurt Keutzer
20
data rate 100 kbs goal: bit error rate (BER) = 10^-4
QPSK
soft decision wordlength (q) = 6
Kurt Keutzer
21
Kurt Keutzer
22
ARM Overview
32-bit RISC microprocessor Five stage pipeline Features fast ALU operations (barrel shifter) Scalar integer unit, no FPU
Kurt Keutzer
23
Algorithm Tweaking
Performing the metric computation through table lookup (load = 1 delay slot) is faster than using ALU (multiplication = up to 3 delay slots)
Parity computation (Viterbi code) can also be done through table lookup
Kurt Keutzer
24
For 128 level traceback, memory requirements are 512 bytes (metrics table) + 1024 bytes (traceback) + 768 bytes (parity lookup tables) = 2304 bytes
Kurt Keutzer
25
Simulation Results
Simulated decoding of 4096 bits on a 125 MHz 3.3V model Execution requires 11.72M ARM instruction cycles, giving 44 kb/s data rate Power consumption was estimated at 52.47 mW Scaling simulation results up to 275 MHz 2.0V ARM (fastest commercially available) gives 96 kb/s at 42.40 mW
Kurt Keutzer
26
Summary
Clock speed: 275 MHz Execution Performance: 96kb/s Power Dissipation: 42.40 mW (5.68 mW/mm2) Area: 7.47mm2 in 0.25 m Design Effort: 4 days Portability very high: code is ANSI C; architecturedependent tweaks may need reworking
Kurt Keutzer
27
Conclusion/Thanks
One-bit quantization gives opportunities for performance improvements, at a huge cost in QOR Viterbi algorithm would benefit greatly from having hardware parallelism (vector ops) available Many thanks to Marlene Wan for providing power estimation
Kurt Keutzer
28
Kurt Keutzer
29
Introduction
Implemented Viterbi Decoder on a TI TMS320VC5402 DSP Examine:
Kurt Keutzer
30
Constraint Length (K aka. L) = 7 Branch Metric Calculation is QPSK Soft Decision Wordlength (q) = 6 Chain-backing Depth (D) = 96 Gen. Polynomials: p0 = 171, p1= 133 (octal) Data Rate 100 kbs Goal: Bit Error Rate (BER) = 10^-4
Kurt Keutzer
31
C54x Capabilities
Capabilities of all C54x DSP Cores:
Three 16-bit Data, One 16-bit program bus 40 bit ACC with 40 bit barrel shifter Two independent accumulators A single cycle non-pipelined MAC Single-instruction repeat and block-repeat Six channel DMA controller Arithmetic instructions with parallel store and parallel load
Kurt Keutzer
32
Compare Branch Metrics Store Larger Value, Store Decision Bit Increment Address Registers in Circular Buffer 1 Cycle
Kurt Keutzer
33
Butterfly Implementation
Old(2*j+1)
New(j+2(K-2))
Kurt Keutzer
34
TI TMS320VC5402 DSP
Specific Chip Characteristics:
16K Word x 16 Bits of Dual-Access RAM 4K Word x 16 Bits of ROM Internal DMA Created in 0.18 Micron Technology
Kurt Keutzer
35
Dataflow
Data I/O
Output Values Assumed to be removed from another Memory Location by Internal DMA
Alternatively, Data Could be Placed in this Memory Location After Other On-Chip Receiver Processing
Kurt Keutzer
36
Implementation Analysis
Viterbi Decoder Code Created in Assembly Linked to Processor Specific Memory Map Simulated on Cycle-Accurate Simulator
Kurt Keutzer
37
Implementation Results
Estimated Code Size Data Size MIPS (100 Kbps) 500 Instructions 1280 (16 bit) Words 18.425
Actual 1032 (16 bit) Words 1280 (16 Bit) Words 21.53125 464.7 Kbps
Kurt Keutzer
38
Power Calculation
Compared with TI Figures:
TI uses 1/2 MACs, 1/2 NOPs For Power Figure .25 Micron Estimate is .45 mA/MIPS
Kurt Keutzer
39
Area Estimate
TI Will Not Release Die Sizes
.25 Micron Chips Fit Inside 3.2 mm x 3.2 mm Area on a 144 pin BGA
Kurt Keutzer
40
Development Cost
Engineering Time
Estimate - 3 days
Kurt Keutzer
41
Conclusion
Optimized Instructions Make Algorithm Efficient Static Design Allows Clock Rate to be Set As Needed to Reduce Power Flexibility Exists to Perform Other Processing of Data Very Little Development Time/Cost
Kurt Keutzer
42
31 27
Rs
pm-
17 11
1:0 pm-
0:1 pm-
11 17
Rt
pm-
27 31
+
msb
+
msb
=1?
+
Control
=1?
0:1
decision bit pm
11 16:17
pm
27 31
decision bit
instruction
43
Rr
Kurt Keutzer
Kurt Keutzer
44
Tensilica Flow
.c .c .c TIE
xt-gcc
gen
Tensilica Processor Generator
uArch
Designer
gen xt-run
.o
Kurt Keutzer
45
Xtensa Architecture
TIE Extensions:
Xtensa Core
Rs Rt I Rr
no new exceptions
no stalls typeless data
Rs, Rt, Rr are 32 bit regs I is the instruction controlling the TIE unit Xtensa Core is a 32 bit configurable RISC processor
TIE
Kurt Keutzer
46
Viterbi Architecture
ADC
I/0 Device
Init
RAM
TraceBack
ACS
Kurt Keutzer
47
31 0x7F
Rs
8:7
31
Rt
8:7
+ -
instruction
Rr
Kurt Keutzer
48
Rs
31 27 pm17 11 pm1:0
Rt
31 24:23 16:15 8:7 0 bm3 bm2 bm1 bm0
=1?
msb
+
ACS03 || ACS12 || ACS30 || ACS21
instruction
0:1
decision bit pm
11:12
0s
31
Kurt Keutzer
Rr
49
31 27
Rs
pm-
17 11
1:0 pm-
0:1 pm-
11 17
Rt
pm-
27 31
+
msb
+
msb
=1?
+
Control
=1?
0:1
decision bit pm
11 16:17
pm
27 31
decision bit
instruction
50
Rr
Kurt Keutzer
31
Rs
1:0
31
Rt
6:5 0 <<1
0x7F
& |
instruction
Control
0x3F
&
6:7
Rr
31
Kurt Keutzer
51
Designs
All designs had a BER of 0.000095 after 10 million iterations Design 1
100 MHz, 48 mW, 1K DCache, 1K ICache, TIE 222 MHz, 144 mW, 1K DCache, 1K ICache, TIE
Design 1+
Design 2
Design 2
Design 3
222 MHz, 191 mW, 16K DCAche, 16K ICache, TIE with state
Kurt Keutzer
52
Performance
1200 1000 800 600
909 793 909 1142 966
Kb/s
400 200 0
118
409 263
357
409
Kurt Keutzer
53
Energy Dissipation
0.6 0.5 0.4 0.3
0.4 0.54
uJ/bit
0.2
0.12
0.16
0.19 0.17
0.24 0.21
0.2
0.17
0.1 0
Design Design Design Design Design 1 1+ 22 3
Kurt Keutzer
54
n(s*J)/Bit
3.5 3 2.5 2 1.5
2.05 3.39
Kurt Keutzer
55
Die Area
7 6 5 4 3
2.1 2.1 2.372.37 6.146.14 6.7 6.7 6.7 6.7
mm2 2
1 0
Kurt Keutzer
56
Conclusions
TIE extensions, cache configuration, and improved code efficiency resulted in an order of magnitude improvement from our original For power and performance, the effect of cache size is greater than the
Kurt Keutzer
57
Kurt Keutzer
58
4% 1% 5%
38% 8% 22%
18% 4% 16%
0% 48% 15% 9% 2% 8%
2% 1% 4%
Kurt Keutzer
59
units
Kurt Keutzer
60
ACS Architecture
8x9 SRAM
Each ACS unit stores 32 path metrics Only two SRAMs are active at a time Across all four ACS units, each path metric is stored twice SRAM accounts for 88% of the area and 27% of the power for each ACS unit
Add Compare Select PML PML BML Pipeline Register PMU BMU PMU
MUX
Kurt Keutzer
61
Traceback Architecture
State-Machine blocks
192
Decision Bits
combinational networks
(351 gates each)
Traceback Unit
SRAM MUX Pipeline Register
Decision Bits
Traceback
Next_ramin
62
Design Flow
Design Compiler Synthesis script (from Mentor/Inventra)
PowerMill Simulations (SRAM, core) Design Compiler, Power Compiler (Static timing, power analysis)
63
Very large sense-amps, control logic Optimized for power, speed at low supply-voltages Word-length limited to a power of 2
Kurt Keutzer
64
Simulation Models
Behavioral C
Parameterized, bit-true, and fast Used for system level design and BER simulations
Behavioral VHDL
Parameterized, bit-true, and cycle-true Used for structural simulations and test bench referen
RTL VHDL
Synthesizable, crafted for specific parameters and implementation structure Used for synthesis quality
Kurt Keutzer
65
Kurt Keutzer
66
SRAM
Simulation Tools: TimeMill & PowerMill Parameters
Results
Power Analysis
Timing Analysis
Kurt Keutzer
67
Avg.(A) 663.73
Avg.(mW) 1.659
Avg.(pJ) 24.885
Write Activity
Read/Write
563.21
612.29
1.408
1.530
21.120
22.950
Parasitic Extraction
Operations Read Activity Write Activity Read/Write
Kurt Keutzer
Kurt Keutzer
69
Delays
Setup Time; Hold Time time needed for data address to become stable
Hold(ns) ~2 ~2
Kurt Keutzer
70
Area of the 16 9x8 bit SRAM macro cells: 0.052 mm2 each, 62% larger than required, as 16x8 bit SRAMs were used (SRAM generator output had been verified for powers of 2) Area of the 3 16x64 bit SRAM macro cells: 0.25 mm2 each
Area of the standard cells 1.02 mm2 (0.35 mm2 from DEF file) Final chip area was 4.0 mm2 (original estimate 2.5 mm2)
Parasitics for timing simulation were extracted from the final routed nets in Silicon Ensemble.
Kurt Keutzer
71
Wiring Statistics
Six metal layers, layers 5 and 6 used for power and ground respectively Ground and power spaced alternately 100 um apart horizontally and vertically. There were about 6200 nets and 46,114 vias.
Total wire lengths: metal layer 1: 3,293 um metal layer 2: 458,440 um metal layer 3: 510,517 um
Kurt Keutzer
72
Kurt Keutzer
73
8.7 8.5
17 14
60 -
3.8 -
Kurt Keutzer
74
Power Cell Internal (mW): Net Switching (mW): Total Dynamic (mW): Cell Leakage (nW):
Before Annotation
28 15 43 750
20 6.3 26 810
20 8.7 29 810
All timing checks performed with Design Compilers report_power command Switching activity was measured for every output port (transition counts over 16,000-cycle simulation) Back-annotation performed with SAIF files No SRAM model was used for power checks (added in manually) Checks performed at 2.5V w/ 60 MHz clock
Kurt Keutzer
75
Kurt Keutzer
76
Performance Results
Kurt Keutzer
77
DSP CP 3 CP 2
3 6 6
ARM CP 1
100.00 100.00
36.86 40.68
266.8 294.4
500000 50000
7.47 2.10
6695.68 23809.52
0.737 0.814
4 6
Kurt Keutzer
78
ASIC
CP 3 CP 2 DSP CP 1 ARM Reference
3750.00
966.00 793.00 464.70 118.00 116.48 100.00 N/A
50.60
191.00 191.00 89.46 48.00 42.94 N/A
1.0
3.8 3.8 1.8 0.9 0.8 N/A
35100
26480 47098 47098 50000 50000 N/A
4.00
6.69 6.69 10.24 2.10 7.47 N/A
8775.00
3958.15 7040.06 4599.41 23809.52 6695.68 N/A
1.44
7.21 4.06 1.90 0.96 0.86 N/A
30
6 6 3 6 4
Kurt Keutzer
79