Video/Imaging
Multi-Processing
Application Examples
W-CDMA
Radars
DSP Building Blocks Function/Application Specific
Digital Radios & Bit Slice Processors (MUL, etc.) ( MP)
High-End
Control
Modems
DSP P and RISC
Voice Coding ( MP )
Instruments
C and Analog P
Low-End
Modems
Industrial
Control
3
Texas Instruments TMS320 Family
Multiple DSP P Generations
First Bit Size Clock Instruction MAC MOPS Device density (#
Sample speed Throughput execution of transistors)
(MHz) (ns)
Uniprocessor
Based
(Harvard
Architecture)
TMS32010 1982 16 integer 20 5 MIPS 400 5 58,000 (3)
Multiprocessor
Based
TMS320C80 1996 32 integer/flt. 2 GOPS MIMD
120 MFLOP
TMS320C62XX 1997 16 integer 1600 MIPS 5 20 GOPS VLIW
TMS310C67XX 1997 32 flt. pt. 5 1 GFLOP VLIW
4
First Generation DSP P Case Study
TMS32010 (Texas Instruments) - 1982
Features
200 ns instruction cycle (5 MIPS)
144 words (16 bit) on-chip data RAM
1.5K words (16 bit) on-chip program ROM - TMS32010
External program memory expansion to a total of 4K words at full
speed
16-bit instruction/data word
single cycle 32-bit ALU/accumulator
Single cycle 16 x 16-bit multiply in 200 ns
Two cycle MAC (5 MOPS)
Zero to 15-bit barrel shifter
Eight input and eight output channels
5
TMS32010 BLOCK DIAGRAM
6
TMS32010 Program Memory Maps
Microcomputer Mode Microprocessor Mode
External
Memory
1525
Space
Internal
Memory
Space Reserved
For Testing
1536
External
Memory
Space
4095 4095
7
Digital FIR Filter Implementation
(Uniprocessor-Circular Buffer)
Start each
Time here
1st. Cycle 2nd. Cycle
End
X0 Start
a n-1 a n-2 a1 a0 X1 Start
X2
a0 a n-1 X3
X4
X X5
Xn-1
End
+ Replace
starting
value
Acc with new
value
8
TMS32010 FIR FILTER PROGRAM
Indirect Addressing (Smaller Program Space)
10
Third Generation DSP P Case Study
TMS320C30 - 1988
TMS320C30 Key Features
60 ns single-cycle instruction execution time
33.3 MFLOPS (million floating-point operations per second)
16.7 MIPS (million instructions per second)
One 4K x 32-bit single-cycle dual-access on-chip ROM block
Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks
64 x 32-bit instruction cache
32-bit instruction and data words, 24-bit addresses
40/32-bit floating-point/integer multiplier and ALU
32-bit barrel shifter
11
Third Generation DSP P Case Study
TMS320C30 - 1988
12
TMS320C30 BLOCK DIAGRAM
13
TMS320C3x CPU BLOCK DIAGRAM
14
TMS320C3x MEMORY BLOCK DIAGRAM
15
TMS320C30 Memory Organization
Oh Interrupt locations Oh Interrupt locations
& reserved (192) & reserved (192)
BFh external STRB active BFh
COh External COh
ROM
STRB Active 0FFFh
7FFFFFh (Internal)
1000h
800000h Expansion BUS MSTRB
7FFFFFh
Expansion BUS MSTRB Active (8K)
801FFFh 800000h
Active (8K)
802000h 801FFFh Reserved
Reserved
802000h (8K)
803FFFh (8K)
804000h 803FFFh
Expansion Bus Expansion Bus
IOSTRB Active (8K) 804000h IOSTRB Active (8K)
805FFFh
806000h 805FFFh
Reserved Reserved
806000h
807FFFH (8K) (8K)
80800h 807FFFH Peripheral Bus Memory Mapped
Peripheral Bus Memory Mapped
80800h
8097FFh Registers (Internal) (6K) Registers (Internal) (6K)
809800h 8097FFh
RAM Block 0 (1K) RAM Block 0 (1K)
809800h
809BFFh (Internal) (Internal)
809C00h 809BFFh
RAM Block 1 (1K)
RAM Block 1 (1K) 809C00h
809FFFh (Internal)
(Internal)
80A00h 809FFFh
External 80A00h External
0FFFFFFh STRB Active STRB Active
0FFFFFFh
Microprocessor Mode Microcomputer Mode 16
TMS320C30 FIR FILTER PROGRAM
18
TMS320C54x Internal Block Diagram
19
Architecture optimized for DSP
#1: CPU designed for efficient DSP processing
MAC unit, 2 Accumulators, Additional Adder,
Barrel Shifter
#2: Multiple busses for efficient data
and program flow
Four busses and large on-chip memory that
result in sustained performance near peak
#3: Highly tuned instruction set for
powerful DSP computing
Sophisticated instructions that execute in fewer
cycles, with less code and low power demands
20
Key #1: DSP engine
40
Y = an * xn
n = 1
x a
MPY
ADD
y
21
Key #1: MAC Unit
MAC *AR2+, *AR3+, A
MPY A
Fractional B
Mode Bit
ADD O
acc A acc B
22
Key #1: Accumulators + Adder
General-Purpose Math example: t = s+e-r
LD @s, A
acc A acc B ALU
ADD @e, A
MUX U Bus SUB @r, A
STL A, @t
A B MAC
23
Key #1: Barrel shifter
LD @X, 16, A
STH @B, Y
A B C D
Barrel Shifter
(-16-+31)
S Bus
ALU E Bus
24
Key #1: Temporary register
LD @x, T
MPY @a, A
D X EXP A
Encoder B
MAC ALU
25
Key #2: Efficient data/program flow
#1: CPU designed for efficient DSP processing
MAC unit, 2 Accumulators, Additional Adder,
Barrel Shifter
#2: Multiple busses for efficient data
and program flow
Four busses and large on-chip memory that
result in sustained performance near peak
#3: Highly tuned instruction set for
powerful DSP computing
Sophisticated instructions that execute in fewer
cycles, with less code and low power demands
26
Key #2: Multiple busses
MAC *AR2+, *AR3+, A
P
INTERNAL
EXTERNAL
MEMORY
MEMORY
U D M
X C U
E X
S E
Central
C D
Arithmetic T MAC A B ALU SHIFTER
Logic Unit
M
27
Key #2: Pipeline
Prefetch Fetch Decode Access Read Execute
P F D A R E
28
Key #2: Bus usage
CNTL PC ARs
P
INTERNAL
EXTERNAL
MEMORY
MEMORY
U D M
X C U
E X
S E
Central
Arithmetic
Logic Unit
T MAC A B ALU SHIFTER
29
Key #2: Pipeline performance
CYCLES
P1 F1 D1 A1 R1 X1
P2 F2 D2 A2 R2 X2
P3 F3 D3 A3 R3 X3
P4 F4 D4 A4 R4 X4
P5 F5 D5 A5 R5 X5
P6 F6 D6 A6 R6 X6
30
Key #3: Powerful instructions
#1: CPU designed for efficient DSP processing
MAC Unit, 2 Accumulators, Additional Adder,
Barrel Shifter
#2: Multiple busses for efficient data and
program flow
Four busses and large on-chip memory that
result in sustained performance near peak
#3: Highly tuned instruction set for
powerful DSP computing
Sophisticated instructions that execute in fewer
cycles, with less code and low power demands
31
Key #3: Advanced applications
Symmetric FIR filter FIRS
Adaptive filtering LMS
32
C62x Architecture
33
TMS320C6201 Revision 2
Program Cache / Program Memory
32-bit address, 256-Bit data512K Bits RAM
Memory
Interface
2 Timers
2 Multi-
Data Memory channel
buffered
32-Bit address, 8-, 16-, 32-Bit data serial ports
(T1/E1)
512K Bits RAM
34
C6201 Internal Memory
Architecture
Separate Internal Program and Data Spaces
Program
16K 32-bit instructions (2K Fetch Packets)
256-bit Fetch Width
Configurable as either
Direct Mapped Cache, Memory Mapped Program Memory
Data
32K x 16
Single Ported Accessible by Both CPU Data Buses
4 x 8K 16-bit Banks
2 Possible Simultaneous Memory Accesses (4 Banks)
4-Way Interleave, Banks and Interleave Minimize Access
Conflicts
35
C62x Interrupts
12 Maskable Interrupts , Non-Maskable Interrupt (NMI)
Interrupt Return Pointers (IRP, NRP)
Fast Interrupt Handing
Branches Directly to 8-Instruction Service Fetch Packet
Can Branch out with no overhead for longer service
7 Cycle Overhead : Time When No Code is Running
12 Cycle Latency : Interrupt Response Time
Interrupt Acknowledge (IACK) and Number (INUM)
Signals
Branch Delay Slots Protected From Interrupts
Edge Triggered
36
C62x Datapaths
Registers A0 - A15 Registers B0 - B15
1X 2X
S1 S2 D DL SL SL DL D S1 S2 D S1 S2 D S1 S2 S2 S1 D S2 S1 D S2 S1 D DL SL SL DL D S2 S1
L1 S1 M1 D1 D2 M2 S2 L2
DDATA_I1 DDATA_I2
(load data) (load data)
Cross Paths
40-bit Write Paths (8 MSBs)
40-bit Read Paths/Store Paths
37
Functional Units
L-Unit (L1, L2)
40-bit Integer ALU, Comparisons
Bit Counting, Normalization
S-Unit (S1, S2)
32-bit ALU, 40-bit Shifter
Bitfield Operations, Branching
M-Unit (M1, M2)
16 x 16 -> 32
D-Unit (D1, D2)
32-bit Add/Subtract
Address Calculations
38
C62x Datapaths
S1 S2 D DL SL SL DL D S1 S2 D S1 S2 D S1 S2 S2 S1 D S2 S1 D S2 S1 D DL SL SL DL D S2 S1
L1 S1 M1 D1 D2 M2 S2 L2
Cross Paths
DDATA_O1 DDATA_I1 DDATA_I2 DDATA_O2 40-bit Write Paths (8 MSBs)
(store data) (load data) (load data) (store data) 40-bit Read Paths/Store Paths
DADR1 DADR2
(address) (address)
39
C62x Instruction Packing
Instruction Packing Advanced VLIW
Fetch Packet
CPU fetches 8 instructions/cycle
Example 1
Execute Packet
A B C D E F G H CPU executes 1 to 8 instructions/cycle
Fetch packets can contain multiple execute packets
A Parallelism determined at compile / assembly
time
B Examples
C 1) 8 parallel instructions
D Example 2 2) 8 serial instructions
E 3) Mixed Serial/Parallel Groups
A // B
F C
G D
H A B E // F // G // H
Reduces Codesize, Number of Program Fetches,
C Power Consumption
D Example 3
E
F G H
40
C62x Pipeline Operation
Pipeline Phases
Fetch Decode Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5
Decode
Single-Cycle Throughput
DP Instruction Dispatch
Operate in Lock Step
DC Instruction Decode
Fetch Execute
PG Program
E1Address
- E5 Generate
Execute 1 through Execute 5
PS Program Address Send
PW Program Access Ready Wait
PR Program Fetch Packet Receive
Execute Packet 1 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 2 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 3 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 4 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 5 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 6 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 7 PG PS PW PR DP DC E1 E2 E3 E4 E5
41
C62x Pipeline Operation
Delay Slots
Delay Slots: number of extra cycles until result is:
written to register file
available for use by a subsequent instructions
Multi-cycle NOP instruction can fill delay slots while
minimizing codesize impact
43
C6000 Instruction Set Features
Conditional Instructions
44
C6000 Instruction Set Addressing
Features
Load-Store Architecture
Two Addressing Units (D1, D2)
Orthogonal
Any Register can be used for Addressing or
Indexing
Signed/Unsigned Byte, Half-Word, Word,
Double-Word Addressable
Indexes are Scaled by Type
Register or 5-Bit Unsigned Constant
Index
45
C6000 Instruction Set Addressing
Features
Indirect Addressing Modes
Pre-Increment *++R[index]
Post-Increment *R++[index]
Pre-Decrement *--R[index]
Post-Decrement *R--[index]
Positive Offset *+R[index]
Negative Offset *-R[index]
15-bit Positive/Negative Constant Offset
from Either B14 or B15
46
C6000 Instruction Set Addressing
Features
Circular Addressing
Fast and Low Cost: Power of 2 Sizes and
Alignment
Up to 8 Different Pointers/Buffers, Up to 2
Different Buffer Sizes
Dual Endian Support
47
C67x Architecture
48
TMS320C6701 DSP
Block Diagram
Program Cache/Program Memory
32-bit address, 256-Bit data
512K Bits RAM
49
TMS320C6701
Advanced VLIW CPU (VelociTI ) TM
50
TMS320C6701
Memory /Peripherals
Same as ’C6201
External interface supports
SDRAM, SRAM, SBSRAM
4-channel bootloading DMA
16-bit host port interface
1Mbit on-chip SRAM
2 multichannel buffered serial ports (T1/E1)
Pin compatible with ’C6201
51
TMS320C67x CPU Core
’C67x Floating-Point CPU Core
Program Fetch
Control
Instruction Dispatch Registers
Instruction Decode
Data Path 1 Data Path 2 Control
Logic
A Register File B Register File
Test
Emulation
L1 S1 M1 D1 D2 M2 S2 L2
Interrupts
Floating-Point
Arithmetic Auxiliary
Logic Logic
Multiplier Capabilities
Unit
Unit Unit
52
C67x Interrupts
12 Maskable Interrupts
Non-Maskable Interrupt (NMI)
Interrupt Return Pointers (IRP, NRP)
Fast Interrupt Handling
Branches Directly to 8-Instruction Service Fetch Packet
7 Cycle Overhead: Time When No Code is Running
12 Cycle Latency : Interrupt Response Time
Interrupt Acknowledge (IACK) and Number
(INUM) Signals
Branch Delay Slots Protected From Interrupts
Edge Triggered
53
C67x New Instructions
.L Unit .M Unit .S Unit
Floating Point Arithmetic Unit
S1 S2 D DL SL SL DL D S1 S2 D S1 S2 D S1 S2 S2 S1 D S2 S1 D S2 S1 D DL SL SL DL D S2 S1
L1 S1 M1 D1 D2 M2 S2 L2
55
C67x Instruction Packing
Instruction Packing Enhanced VLIW
Example 1
Fetch Packet
A B C D E F G H CPU fetches 8 instructions/cycle
Execute Packet
CPU executes 1 to 8
instructions/cycle
A Fetch packets can contain multiple
execute packets
B Parallelism determined at
C compile/assembly time
Examples
D Example 2 1) 8 parallel instructions
E 2) 8 serial instructions
3) Mixed Serial/Parallel Groups
F A // B
G
C
D
H A B E // F // G // H
C Reduces
Codesize
D Example 3 Number of Program Fetches
E Power Consumption
F G H
56
C67x Pipeline Operation
Pipeline Phases
Fetch Decode Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Operate in Lock Step Decode
Fetch DP Instruction Dispatch
PG Program Address Generate DC Instruction Decode
PS Program Address Send Execute
PW Program Access Ready Wait E1 - E5 Execute 1 through Execute 5
PR Program Fetch Packet Receive E6 - E10 Double Precision Only
57
C67x Pipeline Operation
Delay Slots
Delay Slots: number of extra cycles until result is:
written to register file
available for use by a subsequent instructions
Multi-cycle NOP instruction can fill delay slots while
minimizing codesize impact
Branches E1
Decode Decode
Register Register Register Register
file file file file
59
TMS320C80 MIMD MULTIPROCESSOR
Texas Instruments - 1996
60
Copyright 1999
61