Anda di halaman 1dari 79

Lecture 10b: Implementing DSP Functionality: Alternatives

Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000

With contributions from:


Prof. Heinrich Meyr, University of Aachen Philip Chong, David Chinnery, Rhett Davis, Paul Husted,

Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang


Kurt Keutzer 1

System Implementation Choices System Functionality

DSP
DSP Core Program ROM Control ASIP Core
Coefficient ROM

Program ROM Control

ASIC

OFF-THE SHELF P/ DSP

Coefficient ROM

EMBEDDED CORE P/DSP

APPLICATION SPECIFIC P (ASIP)


2

Kurt Keutzer

Making a Successful Comparison - 1


Find an interesting application kernel

viterbi decoding for speech processing (not a full modem!)

Find realistic constraints native to the application

n=2, K=7, QPSK, 100KBS, BER= 10^-4

Find architectures/implementations that are promising for the application

TI TMS320C54, Tensilica Xtensa

What are the relevant features of this architecture that support this application?

Fix application constraints across all implementations (above) Fix key parameters for implementation comparison

performance (constraint) area power


3

Kurt Keutzer

Making a Successful Comparison - 2


Identify how key parameters will be measured

performance - instruction set simulator, eval board area - data sheets, gate estimates power - eval board, TI application note

Implement your application kernel


Examine different algorithms Start with code downloaded from the web - multimedia benchmarks etc. Build your software development/evaluation environment:

http://www.ti.com/sc/docs/tools/dsp/6ccsfreetool.htm

Kurt Keutzer

Making a Successful Comparison - 3


Implement your application kernel (cont)

Phase 0: Research

Find application notes, research reports for your own or comparable architectures Develop a quick estimate based on initial code Integrate research findings Do a quick back-of-envelope reality check Tailor algorithm, implementation to architecture Do your very best! Have a contest with your partner

Phase 1: Estimation

Phase 2: Real implementation/Tuning


Phase 3: Evaluation

Apply evaluation tools to key parameters Evaluate and compare results - return to 2

If your life depended on choosing the right part - what would you do?
Kurt Keutzer 5

Making a Successful Comparison - 4


Final evaluation and comparison - compare all implementations To evaluate for a product - everything is fair game To evaluate principally the architectures - need to consider:

Fab differences - TSMC vs. IBM (10-20% faster)


process differences - .35 micron vs. .25 (50% faster) power supply differences 3.0V vs. 1.5V asic vs. custom implementations - (2x faster)

Now evaluate - if I was the architect of this processor/implementor of this system on a chip, what would I do differently?

cache sizes register availability additional instructions on chip memory


6

Kurt Keutzer

Making a Successful Comparison - 5


Just for fun In addition to primary constraints (speed, cost, power) final real world considerations

business relationships (joint partnership with Lucent)


Time-to-market issues

time to configure? software development environment library/application software support application engineering support

Kurt Keutzer

Viterbi Algorithm
Prof. Heinrich Meyr University of Aachen

Kurt Keutzer

Viterbi Decoders in digital communication systems

S gn a lS ou rce i

S ou rce C od e r

C on vo u ton a lo r l i T re lli C od e r & s M app e r ch ann e lsy bo l c m s

M od u a t r l o

n f m a ton b it i or i s

k C h ann e l

d ecod ed b it s

rece i ed sy bo l y v m s

k D e o du a t r m l o

S gn a lS n k i i

S ou rce D ecod e r

V it rb iD ecod e r e

Kurt Keutzer

Convolutional Coder and Trellis diagram


m o du o 2 l add iton i n f m a t on i or i b it s u k z
+

cod e sy b o l m s b k z 1 x , u0k k2 M app e r

ch ann e l sy b o l m s c k

add it i e v w h it e no i e n k s

1 x k u 1, k1 +

i b =1 k B PSK

i b =0 k CH ANN E L

CO N VO LU T O N A L CO D ER I x 0 1 2 3 s 0 3, k k s 3 , +1 k yk know n s t r t a s t t X 0 =0 ae s 0, k s 0 , +1 k

know n end s t t X T =0 ae

k+ 1

T1

T d e c s on s i i

S u rv i o rM e o ry v m

V TER B ID ECO D ER I

d ecod ed b it s

Kurt Keutzer

10

ACS recursion for M = 2

Z (0 , i) ,k -1

( , i) 0

Z (0 , i) ,k -1

( , i) 0 k

su rv vo r pa h i t co pe tng pa h m i t i,k
( , i) 1 k

Z (1 , i) ,k -1 k
( , i) 1

M a x { k ,

( , i) 0

( , i) 1 k

Z (1 , i) ,k -1

( , i) 1 k

d i,k= 1

Kurt Keutzer

11

Viterbi Decoder block diagram

channe l sy bo s y m l

b ran ch m e tr cs i TM U AC SU La t h c

de c s on i i b it s SM U st t ae m e tr cs i

de coded b its u k

Kurt Keutzer

12

Characteristic of a 2-bit step-at-zero quantizer


I t rp re t ton ne a i 2 Q =1 sa t ra ton u i

1 Q =0

-2

-1

Q = -1

2 no m a lzed r i n pu t i e ve l l

-1 Q = -2 sa t ra ton u i -2

Kurt Keutzer

13

Architecture

Kurt Keutzer

14

Node parallel ACS architecture


T U M
( , i) 1 k

( , i) 0 k

R eg s e r i t 0 ,k 0 1 ,k 1

AC S

S hu ffe l E xchange N e w o rk t

AC S

SM U

AC S

N -1 ,k N1

de c s on s i i de c ( i, ) k

Kurt Keutzer

15

Alternative Implementations

bu tt r f l e y
AC S AC S AC S AC S

M M M M

b u tt r f y e l
sha red AC S sha red AC S

Kurt Keutzer

16

Butterfly trellis structure and resource sharing for the K = 3, rate 1/2 code

AC S

0 ,k

P a t m e tr c h i m e o ry m

0 ,k

od l st t ae m e tr cs i

M U X M U X

new st t ae 0 ,k+1 m e tr cs i AC S
2 ,k+ 1

AC S

1 ,k

1 ,k

AC S AC S

2 ,k

2 ,k

M U X M U X

AC S

1 ,k+ 1

3 ,k

3 ,k

3 ,k+ 1

Kurt Keutzer

17

Survivor Memory Unit

Kurt Keutzer

18

REA hardware architecture


0= u 0= u s0 0= u
( 0) 0, ( 0 ) PE 0, 0 ^[ ] u k [ ^ 0] u k1 ^ u 0] [ k D +1 ^ u 0] [ kD -

0 0, k ( , ) 10 0 0 d 0 1 1 ,k 1 1 d 2 ,k 1 1 d 3 ,k 0 1 D
[ ^ 3] u k [ ^ 3] u k1 3 ^[ ] u k D +1 [ ^ 3] u kD [ ^ 2] u k [ ^ 2] u k1 2 ^ [ ] u k D +1 [ ^ 2] u kD -

[ ^ 1] uk

[ ^ 1] uk1

1 ^[ ] u k D +1 -

[ ^ 1] u kD -

1 s3 1= u
( 3) 1,

Kurt Keutzer

19

Decoded Sequence: 0 0 ... 0 1 0

^0 u[ ] k -D + M -1 ) (

^[ u 0] kD -

0 ^[ ] u k

0 0 0 1 0

D e cod ng i

A cqu s iton o f fna lsu rv vo r i i i i

D e coded S equen ce :0 0 ...0 1 0

Kurt Keutzer

20

Viterbi Project Constraints


uncoded word length = 1 coded word length (n) = 2

chain-backing depth (D) = 96 generator polynomials:


this means that it is rate 1/2

p0 = 171, p1= 133 (octal) this means that p0=1111001, p1=1011011

constraint length (K aka. L) = 7

this means that the number of states in trellis is 2^(K-1) or 64 states

data rate 100 kbs goal: bit error rate (BER) = 10^-4

branch metric calculation is

QPSK
soft decision wordlength (q) = 6

signal to noise ratio (SNR)


degradation 0.05dB

Kurt Keutzer

21

Viterbi Decoder Implementation on an ARM

EE 290S Final Project May 4, 1999 Phillip Chong

Kurt Keutzer

22

ARM Overview
32-bit RISC microprocessor Five stage pipeline Features fast ALU operations (barrel shifter) Scalar integer unit, no FPU

Kurt Keutzer

23

Algorithm Tweaking
Performing the metric computation through table lookup (load = 1 delay slot) is faster than using ALU (multiplication = up to 3 delay slots)

Parity computation (Viterbi code) can also be done through table lookup

Kurt Keutzer

24

Reducing Memory Footprint


Cache misses can be very costly due to pipeline stalls We are willing to give up some algorithmic efficiency to eliminate cache misses To minimize the memory footprint, we pack 32 bits of traceback into single word; we can easily unpack this data due to the barrel shifter (1 cycle operation)

For 128 level traceback, memory requirements are 512 bytes (metrics table) + 1024 bytes (traceback) + 768 bytes (parity lookup tables) = 2304 bytes

Kurt Keutzer

25

Simulation Results
Simulated decoding of 4096 bits on a 125 MHz 3.3V model Execution requires 11.72M ARM instruction cycles, giving 44 kb/s data rate Power consumption was estimated at 52.47 mW Scaling simulation results up to 275 MHz 2.0V ARM (fastest commercially available) gives 96 kb/s at 42.40 mW

Kurt Keutzer

26

Summary
Clock speed: 275 MHz Execution Performance: 96kb/s Power Dissipation: 42.40 mW (5.68 mW/mm2) Area: 7.47mm2 in 0.25 m Design Effort: 4 days Portability very high: code is ANSI C; architecturedependent tweaks may need reworking

Kurt Keutzer

27

Conclusion/Thanks
One-bit quantization gives opportunities for performance improvements, at a huge cost in QOR Viterbi algorithm would benefit greatly from having hardware parallelism (vector ops) available Many thanks to Marlene Wan for providing power estimation

Kurt Keutzer

28

Viterbi Decoder Implementation on a TI C54x

EE 290S Final Project May 4, 1999 Paul Husted

Kurt Keutzer

29

Introduction
Implemented Viterbi Decoder on a TI TMS320VC5402 DSP Examine:

Performance (bits/sec) Power (mW/bit) Cost ($/unit,area) Design effort (engineer-months)

Kurt Keutzer

30

Viterbi Decoder Specifications


Implementation Specifications:

Constraint Length (K aka. L) = 7 Branch Metric Calculation is QPSK Soft Decision Wordlength (q) = 6 Chain-backing Depth (D) = 96 Gen. Polynomials: p0 = 171, p1= 133 (octal) Data Rate 100 kbs Goal: Bit Error Rate (BER) = 10^-4

Kurt Keutzer

31

C54x Capabilities
Capabilities of all C54x DSP Cores:

Three 16-bit Data, One 16-bit program bus 40 bit ACC with 40 bit barrel shifter Two independent accumulators A single cycle non-pipelined MAC Single-instruction repeat and block-repeat Six channel DMA controller Arithmetic instructions with parallel store and parallel load

Kurt Keutzer

32

Helpful Instructions for the Viterbi Decoder


The C54x Has Specialized Instruction Set

Dual Add/Subtract in 1 Cycle Compare, Select, and Store Unit (CSSU)


Compare Branch Metrics Store Larger Value, Store Decision Bit Increment Address Registers in Circular Buffer 1 Cycle

Allows Butterfly (2 States) in 5 cycles

Kurt Keutzer

33

Butterfly Implementation

T Register = Local Distance Old(2*j) New(j) DADST CMPS DSADT CMPS

Old(2*j+1)

New(j+2(K-2))

Kurt Keutzer

34

TI TMS320VC5402 DSP
Specific Chip Characteristics:

Operates at 100 MIPS


Core Voltage of 1.8V I/O Pins Operate at 3.3V

16K Word x 16 Bits of Dual-Access RAM 4K Word x 16 Bits of ROM Internal DMA Created in 0.18 Micron Technology

Kurt Keutzer

35

Dataflow
Data I/O

Input Values Assumed to be Placed at Specified Memory Location by Internal DMA

Output Values Assumed to be removed from another Memory Location by Internal DMA
Alternatively, Data Could be Placed in this Memory Location After Other On-Chip Receiver Processing

Kurt Keutzer

36

Implementation Analysis
Viterbi Decoder Code Created in Assembly Linked to Processor Specific Memory Map Simulated on Cycle-Accurate Simulator

Used Correct Memory Model for VC5402

Kurt Keutzer

37

Implementation Results

Estimated Code Size Data Size MIPS (100 Kbps) 500 Instructions 1280 (16 bit) Words 18.425

Actual 1032 (16 bit) Words 1280 (16 Bit) Words 21.53125 464.7 Kbps

Max. Speed 582 Kbps (100 MIPS)

Kurt Keutzer

38

Power Calculation
Compared with TI Figures:

TI uses 1/2 MACs, 1/2 NOPs For Power Figure .25 Micron Estimate is .45 mA/MIPS

Fully Static Design can be Clocked at Any Rate

Viterbi Code Uses 1.08 Times More Current than TI Estimate

At 22 MIPS, 19.25 mW are Consumed in the Core

Kurt Keutzer

39

Area Estimate
TI Will Not Release Die Sizes

.25 Micron Chips Fit Inside 3.2 mm x 3.2 mm Area on a 144 pin BGA

Maximum Die Size is thus 10.24 mm2

Kurt Keutzer

40

Development Cost
Engineering Time

Estimate - 3 days

Assumes Engineer Has Experience with Assembly Language and TI Tools

Tool Cost - $13262.45

Includes Emulator, Simulator, Compiler, Assembler, Linker, Debugger

Cost of Chip - $8.52

Kurt Keutzer

41

Conclusion
Optimized Instructions Make Algorithm Efficient Static Design Allows Clock Rate to be Set As Needed to Reduce Power Flexibility Exists to Perform Other Processing of Data Very Little Development Time/Cost

Kurt Keutzer

42

ACS TIE Extension with State (ACS)

31 27

Rs
pm-

17 11

1:0 pm-

31 24:23 16:15 8:7 0 bm3 bm2 bm1 bm0

0:1 pm-

11 17

Rt
pm-

27 31

+
msb

+
msb

=1?

+
Control

=1?

0:1
decision bit pm

11 16:17
pm

27 31
decision bit

instruction
43

Rr
Kurt Keutzer

Tensilica Viterbi Implementation


Niraj Shah Scott Weber 290A Final Presentation

Kurt Keutzer

44

Tensilica Flow
.c .c .c TIE

xt-gcc

gen
Tensilica Processor Generator

uArch

Designer

gen xt-run

.o

Kurt Keutzer

45

Xtensa Architecture
TIE Extensions:

single cycle state free

Xtensa Core
Rs Rt I Rr

no new exceptions
no stalls typeless data

Rs, Rt, Rr are 32 bit regs I is the instruction controlling the TIE unit Xtensa Core is a 32 bit configurable RISC processor

TIE

Kurt Keutzer

46

Viterbi Architecture

ADC

I/0 Device

Init

RAM

TraceBack

ACS

Measured Performance Here

Kurt Keutzer

47

TIE SetupBMreg (ACS)

31 0x7F

Rs

8:7

31

Rt

8:7

+ -

instruction

Control bm0 bm1 bm2 bm3 0 7:8 15:16 23:24 31

Rr

Kurt Keutzer

48

ACS TIE Extension (ACS)

Rs
31 27 pm17 11 pm1:0

Rt
31 24:23 16:15 8:7 0 bm3 bm2 bm1 bm0

=1?

msb

+
ACS03 || ACS12 || ACS30 || ACS21

instruction

0:1
decision bit pm

11:12
0s

31

Kurt Keutzer

Rr

49

ACS TIE Extension with State (ACS)

31 27

Rs
pm-

17 11

1:0 pm-

31 24:23 16:15 8:7 0 bm3 bm2 bm1 bm0

0:1 pm-

11 17

Rt
pm-

27 31

+
msb

+
msb

=1?

+
Control

=1?

0:1
decision bit pm

11 16:17
pm

27 31
decision bit

instruction
50

Rr
Kurt Keutzer

TIE Zmask (TraceBack)

31

Rs

1:0

31

Rt

6:5 0 <<1

0x7F

& |

instruction

Control

0x3F

&

6:7

Rr

31

Kurt Keutzer

51

Designs
All designs had a BER of 0.000095 after 10 million iterations Design 1

100 MHz, 48 mW, 1K DCache, 1K ICache, TIE 222 MHz, 144 mW, 1K DCache, 1K ICache, TIE

Design 1+

Design 2

100 MHz, 69 mW, 16K DCache, 16K ICache, TIE

Design 2

222 MHz, 191 mW, 16K DCache, 16K ICache, TIE

Design 3

222 MHz, 191 mW, 16K DCAche, 16K ICache, TIE with state

Kurt Keutzer

52

Performance
1200 1000 800 600
909 793 909 1142 966

Kb/s

400 200 0
118

409 263

357

409

Cache Perfect Cache

Design Design Design Design Design 1 1+ 22 3

Kurt Keutzer

53

Energy Dissipation
0.6 0.5 0.4 0.3
0.4 0.54

uJ/bit

0.2
0.12

0.16

0.19 0.17

0.24 0.21

0.2

0.17

Cache Perfect Cache

0.1 0
Design Design Design Design Design 1 1+ 22 3

Kurt Keutzer

54

n(s*J)/Bit
3.5 3 2.5 2 1.5
2.05 3.39

n(s*J)/ 1 Bit 0.5


0

Cache Perfect Cache


0.293 0.532 0.416 0.176 0.315 0.231 0.207 0.148

Design Design Design Design Design 1 1+ 22 3

Kurt Keutzer

55

Die Area
7 6 5 4 3
2.1 2.1 2.372.37 6.146.14 6.7 6.7 6.7 6.7

mm2 2
1 0

Cache Perfect Cache

Design Design Design Design Design 1 1+ 22 3

Kurt Keutzer

56

Conclusions
TIE extensions, cache configuration, and improved code efficiency resulted in an order of magnitude improvement from our original For power and performance, the effect of cache size is greater than the

effect of a higher clock frequency


Use voltage scaling to reduce the power If streaming data, then scale frequency Adding state will result in the ability to increase performance

Having the ability to remove core instructions will decrease decode


complexity and should lower power and area

Kurt Keutzer

57

Soft Core Viterbi Decoder


EECS 290A Project

Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang

Kurt Keutzer

58

High Level Architecture

4% 1% 5%

23% 36% 30%

38% 8% 22%

18% 4% 16%

0% 48% 15% 9% 2% 8%

2% 1% 4%

% Gates % Area % Power

Kurt Keutzer

59

Branch & Path Metric Generation


Branch Metrics Computation apparently implemented with a CORDIC block
U L U L U L U L (contains 840 MUXs, 58 adders & flipflops, 32 15-bit busses)

Branch Metrics Hard-wired to


each ACS unit Path Metrics Stored in ACS
U L U L U L U L

units

Each ACS unit handles 16


states

Hard-wired Path Metric Interconnect

Kurt Keutzer

60

ACS Architecture
8x9 SRAM

Each ACS unit stores 32 path metrics Only two SRAMs are active at a time Across all four ACS units, each path metric is stored twice SRAM accounts for 88% of the area and 27% of the power for each ACS unit
Add Compare Select PML PML BML Pipeline Register PMU BMU PMU

MUX

Kurt Keutzer

61

Traceback Architecture
State-Machine blocks
192

are just large sumof products


Out

Decision Bits

combinational networks
(351 gates each)

Traceback Memory Unit 22% Area 20% Power

Finite State Machine 11% Area 13% Power

Each memory unit


contains a 16x64 SRAM and logic
(192 MUXs, 128 flip-flops)

Traceback Unit
SRAM MUX Pipeline Register

Decision Bits

Traceback

Next_ramin

Traceback Memory Unit


Kurt Keutzer

62

Design Flow
Design Compiler Synthesis script (from Mentor/Inventra)

Synthesis & Module Generation

SRAM Generator (from Norman Walker)

VHDL gate-level sims (timing verification, switching activity annotation)

Pre-Layout Verification & Analysis

PowerMill Simulations (SRAM, core) Design Compiler, Power Compiler (Static timing, power analysis)

Floor Planning (Preview)

Floor Planning Place & Route

Place & Route (Silicon Ensemble)


Interconnect Parasitic Extraction (report simcap

PowerMill simulations, PathMill static analysis

Post-Layout Verification & Analysis


Kurt Keutzer

Design Compiler, Power Compiler (Static timing, power

analysis with back-annotated interconnect parasitics)

63

Synthesis and SRAM Generation


Synthesis with Synopsys Design Compiler

Constraint: 66 kHz clock (effectively infinite) Bottom-up synthesis of 62 VHDL entities

Low-Power SRAM generator (from Pleiades)


Very large sense-amps, control logic Optimized for power, speed at low supply-voltages Word-length limited to a power of 2

Kurt Keutzer

64

Simulation Models

Behavioral C

Parameterized, bit-true, and fast Used for system level design and BER simulations

Behavioral VHDL

Parameterized, bit-true, and cycle-true Used for structural simulations and test bench referen

RTL VHDL

Synthesizable, crafted for specific parameters and implementation structure Used for synthesis quality

Kurt Keutzer

65

BER Simulation Results

Kurt Keutzer

66

SRAM
Simulation Tools: TimeMill & PowerMill Parameters

66 MHz clock Voltage 2.5V Random Generated Test Vectors

Results

Power Analysis
Timing Analysis

Kurt Keutzer

67

SRAM: Power Numbers


SRAM used for ACS Unit

8 words by 9 data bits

Operations Read Activity

Avg.(A) 663.73

Avg.(mW) 1.659

Avg.(pJ) 24.885

Write Activity
Read/Write

563.21
612.29

1.408
1.530

21.120
22.950

Parasitic Extraction
Operations Read Activity Write Activity Read/Write
Kurt Keutzer

Avg.(A) 949.89 772.830 851.42

Avg.(mW) 2.3747 1.9320 2.1285

Avg.(pJ) 35.6205 28.980 31.9275


68

SRAM: Power Numbers


SRAM used for Traceback Unit

16 words by 64 data bits

Operations Read Activity Write Activity Read/Write


Parasitic Extraction?

Avg.(A) 2170.7 1893.4 2086.9

Avg.(mW) 5.4267 4.7335 5.2172

Avg.(pJ) 81.4005 71.0025 78.2580

Kurt Keutzer

69

SRAM: Timing Numbers


Delays

Delays

Setup Time; Hold Time time needed for data address to become stable

Setup(ns) ACS SRAM Traceback SRAM ~1 ~1

Hold(ns) ~2 ~2

Data Resolution(ns) ~1.8 ~5

Kurt Keutzer

70

Place and Route


Floor planning of the Viterbi SRAM macro cells and standard cells was done in Preview, and Silicon Ensemble was used for routing. Total SRAM macro cell area was 1.58 mm2 (1.08 mm2 with 9x8 SRAMs)

Area of the 16 9x8 bit SRAM macro cells: 0.052 mm2 each, 62% larger than required, as 16x8 bit SRAMs were used (SRAM generator output had been verified for powers of 2) Area of the 3 16x64 bit SRAM macro cells: 0.25 mm2 each

Area of the standard cells 1.02 mm2 (0.35 mm2 from DEF file) Final chip area was 4.0 mm2 (original estimate 2.5 mm2)

Parasitics for timing simulation were extracted from the final routed nets in Silicon Ensemble.

Kurt Keutzer

71

Wiring Statistics
Six metal layers, layers 5 and 6 used for power and ground respectively Ground and power spaced alternately 100 um apart horizontally and vertically. There were about 6200 nets and 46,114 vias.

Total wire lengths: metal layer 1: 3,293 um metal layer 2: 458,440 um metal layer 3: 510,517 um

metal layer 4: 218,023 um


metal layer 5: 96,882 um signal, and 38,400 um power metal layer 6: 8,660 um signal, and 37,500 um ground wire length: 685 mm horizontal, 611 mm vertical, total 1296 mm

Kurt Keutzer

72

Final Placement and Routing


Significant routing congestion at 16 by 64 bit SRAM outputs, due to Silicon Ensemble grid size of 1 um (observe white and light blue wires). Minimum of 6 unroutable nets observed, even at 12 mm2 chip area. Final size was 1.25 mm x 3.2 mm, 4 mm2, with 9 unroutable nets. Violation reports in Silicon Ensemble did not identify which nets were unroutable, other than problems with ground and power connections.

Kurt Keutzer

73

Static Timing Checks


All timing checks performed with Design Compilers report_timing command Parasitic capacitances back-annotated with the set_load command No RC parasitics annotated No SRAM model was used for timing checks Critical Path was from ACS control logic, through a PM ouput MUX select signal (in an ACS unit), through the following ACS unit. Checks performed at 2.5V

Delay Before Annotation (ns) Critical Path Longest SRAM Path

Delay After Annotation (ns)

Max Clock Frequency (MHz)

Max Symbol Rate (Msps)

8.7 8.5

17 14

60 -

3.8 -

Kurt Keutzer

74

Static Power Checks

Power Cell Internal (mW): Net Switching (mW): Total Dynamic (mW): Cell Leakage (nW):

Before Annotation

After SAIF Annotation

After Parasitic Annotation

28 15 43 750

20 6.3 26 810

20 8.7 29 810

All timing checks performed with Design Compilers report_power command Switching activity was measured for every output port (transition counts over 16,000-cycle simulation) Back-annotation performed with SAIF files No SRAM model was used for power checks (added in manually) Checks performed at 2.5V w/ 60 MHz clock

Kurt Keutzer

75

Delay and Energy Scaling

Kurt Keutzer

76

Performance Results

Optimized for Performance Optimized for Energy Optimized for EDP

Supply Voltage (V) 2.5 0.8 1.25

Clock Rate (MHz) 60 7.46 25.12

Symbol Rate (Msps) 3.75 0.47 1.57

Energy Delay Product (fJs) 4.24 3.49 2.53

Power (mW) 59.6 0.76 6.24

For fixed throughput requirement 100ksps:


Supply Voltage (V) 2.5 0.8 1.25 Clock Rate (MHz) 1.6 1.6 1.6 Symbol Rate (Msps) 0.1 0.1 0.1 Power (mW) 1.59 0.16 0.40

Optimized for Performance Optimized for Energy Optimized for EDP

Kurt Keutzer

77

Summary NORMALIZED (100kbs)


Implementation ASIC Performance (kbs) 100.00 Power (mW) 0.14 Norm 1.0 Gates 35100 Area (mm^2) 4.00 Gates/ Area 8775.00 Power (uW)/ Gate 0.004 Effort (days) 30

DSP CP 3 CP 2

100.00 100.00 100.00

1.97 2.02 2.47

14.3 14.7 17.9

47098 26480 47098

10.24 6.69 6.69

4599.41 3958.15 7040.06

0.042 0.076 0.052

3 6 6

ARM CP 1

100.00 100.00

36.86 40.68

266.8 294.4

500000 50000

7.47 2.10

6695.68 23809.52

0.737 0.814

4 6

Kurt Keutzer

78

Summary MAX PERFORMANCE


Performanc e Implementation (kbs) Power (mW) Norm Gates Area (mm^2) Gates/ Area Power (uW)/ Gate Effort (days)

ASIC
CP 3 CP 2 DSP CP 1 ARM Reference

3750.00
966.00 793.00 464.70 118.00 116.48 100.00 N/A

50.60
191.00 191.00 89.46 48.00 42.94 N/A

1.0
3.8 3.8 1.8 0.9 0.8 N/A

35100
26480 47098 47098 50000 50000 N/A

4.00
6.69 6.69 10.24 2.10 7.47 N/A

8775.00
3958.15 7040.06 4599.41 23809.52 6695.68 N/A

1.44
7.21 4.06 1.90 0.96 0.86 N/A

30
6 6 3 6 4

Kurt Keutzer

79

Anda mungkin juga menyukai