Lec10a DSP1

Lecture 10a: Digital Signal Processors: A TI Architectural History
Collated by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Dr. Brock Barton, Clark Hise TI; Dr. Surendar S. Magar, Berkeley Concept Research Corporation
1
DSP ARCHITECTURE EVOLUTION
Multipliers (MUL)
Multiprocessors (MP) Multi-Processing
Application Examples
Video/Imaging W-CDMA Radars Digital Radios High-End Control Modems Voice Coding Instruments Low-End Modems Industrial Control
DSP Building Blocks & Bit Slice Processors (MUL, etc.)
Function/Application Specific ( MP) DSP P and RISC ( MP )
C and Analog
1980
1985
1990
1995
2
DSP ARCHITECTURE Enabling Technologies
Time Frame Early 1970s Late 1970s Early 1980s Late 1980s Early 1990s Late 1990s
Approach
Primary Application
Enabling Technologies
Discrete logic Building block Single Chip DSP P Function/Application specific chips Multiprocessing Single-chip multiprocessing
Non-real time procesing Simulation Military radars Digital Comm. Telecom Control Computers Communication
Bipolar SSI, MSI FFT algorithm Single chip bipolar multiplier Flash A/D P architectures NMOS/CMOS Vector processing Parallel processing Advanced multiprocessing VLIW, MIMD, etc. Low power single-chip DSP Multiprocessing
Video/Image Processing Wireless telephony Internet related
Texas Instruments TMS320 Family Multiple DSP P Generations

First Sample Bit Size Clock speed (MHz) Instruction Throughput MAC execution (ns) MOPS Device density (# of transistors)
Uniprocessor Based (Harvard

Architecture)
TMS32010 TMS320C25 TMS320C30 TMS320C50 TMS320C2XXX Multiprocessor Based TMS320C80 TMS320C62XX TMS310C67XX
1982 1985 1988 1991 1995
16 integer 16 integer 32 flt.pt. 16 integer 16 integer
20 40 33 57
5 MIPS 10 MIPS 17 MIPS 29 MIPS 40 MIPS
400 100 60 35 25
5 20 33 60 80
58,000 (3) 160,000 (2) 695,000 (1) 1,000,000 (0.5)
1996 1997 1997
32 integer/flt. 16 integer 32 flt. pt. 1600 MIPS 5 5
2 GOPS 120 MFLOP 20 GOPS 1 GFLOP
MIMD VLIW VLIW
First Generation DSP P Case Study

TMS32010 (Texas Instruments) - 1982
Features
u u u u u u u u u u 200 ns instruction cycle (5 MIPS) 144 words (16 bit) on-chip data RAM 1.5K words (16 bit) on-chip program ROM - TMS32010 External program memory expansion to a total of 4K words at full speed 16-bit instruction/data word single cycle 32-bit ALU/accumulator Single cycle 16 x 16-bit multiply in 200 ns Two cycle MAC (5 MOPS) Zero to 15-bit barrel shifter Eight input and eight output channels
TMS32010 BLOCK DIAGRAM
TMS32010 Program Memory Maps

Microcomputer Mode Address 0 1 2 16-bit word Reset 1st Word Reset 2nd Word Interrupt
Internal Memory Space
Microprocessor Mode 16-bit word 0 1 2 Reset 1st Word Reset 2nd Word Interrupt
1525
Internal Memory Space Reserved For Testing
External Memory Space
1536
External Memory Space
4095
4095 7
Digital FIR Filter Implementation (Uniprocessor-Circular Buffer)

Start each Time here 1st. Cycle 2nd. Cycle Start End Start
a n-1 a n-2 a0
a1
a0 a n-1
X0 X1 X2 X3 X4 X5
Xn-1
End
+
Acc
Replace starting value with new value
TMS32010 FIR FILTER PROGRAM Indirect Addressing (Smaller Program Space)

Y(n) = x[n-(N-1)] . h(N-1) + x[n-(N-2)] . h(N-2) ++ x(n) . h(0)
For N=50, Indirect Addressing t=42 s (23.8 KHz) For N=50, Direct Addressing t=21.6 s (40.2 KHz)
TMS320C203/LC203 BLOCK DIAGRAM DSP Core Approach - 1995
10
Third Generation DSP P Case Study TMS320C30 - 1988
TMS320C30 Key Features

u u u u u u u 60 ns single-cycle instruction execution time
n n
33.3 MFLOPS (million floating-point operations per second) 16.7 MIPS (million instructions per second)
One 4K x 32-bit single-cycle dual-access on-chip ROM block Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks 64 x 32-bit instruction cache 32-bit instruction and data words, 24-bit addresses 40/32-bit floating-point/integer multiplier and ALU 32-bit barrel shifter
11
Third Generation DSP P Case Study TMS320C30 - 1988
TMS320C30 Key Features (cont.)

u u u u u u u u u Eight extended precision registers (accumulators) Two address generators with eight auxiliary registers and two auxiliary register arithmetic units On-chip direct memory Access (DMA) controller for concurrent I/O and CPU operation Parallel ALU and multiplier instructions Block repeat capability Interlocked instructions for multiprocessing support Two serial ports to support 8/16/32-bit transfers Two 32-bit timers 1 CDMOS Process
12
TMS320C30 BLOCK DIAGRAM
13
TMS320C3x CPU BLOCK DIAGRAM
14
TMS320C3x MEMORY BLOCK DIAGRAM
15
TMS320C30 Memory Organization

Oh BFh COh 7FFFFFh 800000h 801FFFh 802000h 803FFFh 804000h 805FFFh 806000h 807FFFH 80800h 8097FFh 809800h 809BFFh 809C00h 809FFFh 80A00h 0FFFFFFh Interrupt locations & reserved (192) external STRB active External STRB Active Expansion BUS MSTRB Active (8K) Reserved (8K) Expansion Bus IOSTRB Active (8K) Reserved (8K) Peripheral Bus Memory Mapped Registers (Internal) (6K) RAM Block 0 (1K) (Internal) RAM Block 1 (1K) (Internal) External STRB Active Oh BFh COh 0FFFh 1000h 7FFFFFh 800000h 801FFFh 802000h 803FFFh 804000h 805FFFh 806000h Interrupt locations & reserved (192) ROM (Internal) Expansion BUS MSTRB Active (8K) Reserved (8K) Expansion Bus IOSTRB Active (8K) Reserved (8K)
807FFFH Peripheral Bus Memory Mapped 80800h Registers (Internal) (6K) 8097FFh RAM Block 0 (1K) 809800h (Internal) 809BFFh 809C00h 809FFFh 80A00h 0FFFFFFh RAM Block 1 (1K) (Internal) External STRB Active
Microprocessor Mode
Microcomputer Mode
16
TMS320C30 FIR FILTER PROGRAM

Y(n) = x[n-(N-1)] . h(N-1) + x[n-(N-2)] . h(N-2) ++ x(n) . h(0)
For N=50, t=3.6 s (277 KHz)
17
C54x Architecture
18
TMS320C54x Internal Block Diagram
19
Architecture optimized for DSP

#1: CPU designed for efficient DSP processing
n
MAC unit, 2 Accumulators, Additional Adder, Barrel Shifter
#2: Multiple busses for efficient data

and program flow n Four busses and large on-chip memory that result in sustained performance near peak
#3: Highly tuned instruction set for powerful DSP computing

n
Sophisticated instructions that execute in fewer cycles, with less code and low power demands
20
Key #1: DSP engine

Y = x
40 n = 1
an * xn a
MPY ADD
y
21
Key #1: MAC Unit

MAC *AR2+, *AR3+, A
Data Acc A Temp Coeff Prgm Data Acc A
S/U
S/U
Fractional Mode Bit
MPY ADD acc A acc B
A B O
22
Key #1: Accumulators + Adder

General-Purpose Math example: t = s+e-r A Bus B Bus A B C T D Shifter acc A acc B ALU LD @s, A ADD @e, A STL A, @t A B MAC
23
MUX
U Bus SUB @r, A
Key #1: Barrel shifter

LD STH @X, 16, A @B, Y
A B C D Barrel Shifter (-16-+31) S Bus
ALU
E Bus
24
Key #1: Temporary register

LD MPY
D X
@x, T @a, A
EXP Encoder
A B For example: A = xa
Temporary Register T Bus MAC ALU
25
Key #2: Efficient data/program flow

n
MAC unit, 2 Accumulators, Additional Adder, Barrel Shifter
#2: Multiple busses for efficient data and program flow

n
Four busses and large on-chip memory that result in sustained performance near peak

n
26
Key #2: Multiple busses

MAC *AR2+, *AR3+, A
INTERNAL MEMORY
M U X E S
P D C E C
T
EXTERNAL MEMORY
M U X D
ALU SHIFTER MAC A B
Central Arithmetic Logic Unit
M
27
Key #2: Pipeline

Prefetch Fetch Decode Access Read Execute P K K K K K K F D A R E
Prefetch: Calculate address of instruction Fetch: Collect instruction Decode: Interpret instruction Access: Collect address of operand Read: Collect operand Execute: Perform operation
28
Key #2: Bus usage

CNTL INTERNAL MEMORY M U X E S P D C E PC ARs EXTERNAL MEMORY M U X
Central Arithmetic Logic Unit
T MAC A B ALU SHIFTER
29
Key #2: Pipeline performance

CYCLES P1 F1 D1 A1 P2 F2 D2 P3 F3 P4 R1 A2 D3 F4 P5 X1 R2 A3 D4 F5 P6 X2 R3 A4 D5 F6
X3 R4 X4 A5 R5 X5 D6 A6 R6 X6
Fully loaded pipeline

30
Key #3: Powerful instructions

n
MAC Unit, 2 Accumulators, Additional Adder, Barrel Shifter
#2: Multiple busses for efficient data and program flow

n
Four busses and large on-chip memory that result in sustained performance near peak

n
31
Key #3: Advanced applications

Symmetric FIR filter Adaptive filtering Polynomial evaluation Code book search Viterbi FIRS LMS POLY STRCD SACCD SRCCD DADST DSADT CMPS
32
C62x Architecture
33
TMS320C6201 Revision 2
Program Cache / Program Memory
32-bit address, 256-Bit data512K Bits RAM Pwr Dwn
Host Port Interface C6201 CPU Megamodule
Program Fetch Instruction Dispatch Instruction Decode Control Registers Control Logic Test Emulation Interrupts
4DMA
Data Path 1
A Register File L1 S1 M1 D1
Data Path 2
B Register File D2 M2 S2 L2
Ext. Memory Interface
2 Timers 2 Multichannel buffered serial ports (T1/E1)
Data Memory
32-Bit address, 8-, 16-, 32-Bit data 512K Bits RAM
34
C6201 Internal Memory Architecture

K Separate Internal Program and Data Spaces K Program
n n n
16K 32-bit instructions (2K Fetch Packets) 256-bit Fetch Width Configurable as either w Direct Mapped Cache, Memory Mapped Program Memory 32K x 16 Single Ported Accessible by Both CPU Data Buses 4 x 8K 16-bit Banks w 2 Possible Simultaneous Memory Accesses (4 Banks) w 4-Way Interleave, Banks and Interleave Minimize Access Conflicts
K Data
n n n
35
K K Interrupt Return Pointers (IRP, NRP) K Fast Interrupt Handing

n n n n
C62x Interrupts Interrupt (NMI) 12 Maskable Interrupts , Non-Maskable

Branches Directly to 8-Instruction Service Fetch Packet Can Branch out with no overhead for longer service 7 Cycle Overhead : Time When No Code is Running 12 Cycle Latency : Interrupt Response Time
K Interrupt Acknowledge (IACK) and Number (INUM) Signals K Branch Delay Slots Protected From Interrupts K Edge Triggered
36
C62x Datapaths
Registers A0 - A15
1X
Registers B0 - B15
2X
S1
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
M1
S2
D S1 S2
D1
S2 S1 D
S2
S1 D
D2
M2
S2
S2
S1 D DL SL
SL DL D
L2
S2
S1
DDATA_I1 (load data) DDATA_O1 (store data)
DDATA_I2 (load data) DDATA_O2 (store data)
DADR1 DADR2 (address) (address) Cross Paths 40-bit Write Paths (8 MSBs) 40-bit Read Paths/Store Paths
37
Functional Units
K L-Unit (L1, L2)
n n n n n n n
K S-Unit (S1, S2)
40-bit Integer ALU, Comparisons Bit Counting, Normalization 32-bit ALU, 40-bit Shifter Bitfield Operations, Branching 16 x 16 -> 32
K M-Unit (M1, M2) K D-Unit (D1, D2)
32-bit Add/Subtract Address Calculations
38
C62x Datapaths
Registers A0 - A15
1X
Registers B0 - B15
2X
S1
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
M1
S2
D S1 S2
D1
S2 S1 D
S2
S1 D
D2
M2
DDATA_I2 (load data)
S2
S2
S1 D DL SL
SL DL D
L2
S2
S1
DDATA_O1 (store data)
DDATA_I1 (load data)
DADR1 (address)
DADR2 (address)
DDATA_O2 (store data)
Cross Paths 40-bit Write Paths (8 MSBs) 40-bit Read Paths/Store Paths
39
C62x Instruction Packing Instruction Packing Advanced VLIW

K Fetch Packet
Example 1
n n n
CPU fetches 8 instructions/cycle CPU executes 1 to 8 instructions/cycle Fetch packets can contain multiple execute packets
A B C D E F G H A B C D Example 2 E F G H A B C D Example 3 E F G H
K Execute Packet K Parallelism determined at compile / assembly time K Examples

n n n
1) 8 parallel instructions 2) 8 serial instructions 3) Mixed Serial/Parallel Groups w A // B w C w D w E // F // G // H
K Reduces Codesize, Number of Program Fetches, Power Consumption
40
C62x Pipeline Operation Pipeline Phases

Fetch Decode Execute PG PS PW PR DP DC E1 E2 E3 E4 E5
u Decode uSingle-Cycle ThroughputInstruction Dispatch n DP uOperate in LockDC Step n Instruction Decode uFetch u Execute
n n n n
PG PS PW PR
Program Address Generate through Execute 5 n E1 - E5 Execute 1 Program Address Send Program Access Ready Wait Program Fetch Packet Receive
Execute Packet 1 PG PS PW PR DP DC Execute Packet 2 PG PS PW PR DP Execute Packet 3 PG PS PW PR Execute Packet 4 PG PS PW Execute Packet 5 PG PS Execute Packet 6 PG Execute Packet 7
E1 DC DP PR PW PS PG
E2 E1 DC DP PR PW PS
E3 E2 E1 DC DP PR PW
E4 E3 E2 E1 DC DP PR
E5 E4 E3 E2 E1 DC DP
E5 E4 E3 E2 E1 DC
E5 E4 E3 E2 E1
E5 E4 E5 E3 E4 E5 E2 E3 E4 E5 41
C62x Pipeline Operation Delay Slots

u Delay Slots: number of extra cycles until result is: n written to register file n available for use by a subsequent instructions n Multi-cycle NOP instruction can fill delay slots while minimizing codesize impact
Most Instructions Integer Multiply Loads Branches
E1 No Delay E1 E2 1 Delay Slots E1 E2 E3 E4 E5 4 Delay Slots E1
Branch Target PG PSPWPR DPDC E1 5 Delay Slots

42
C6000 Pipeline Operation Benefits

K Cycle Time
n n n n n n n n n n
Allows 6 ns cycle time on 67x Allows 5 ns cycle time & single cycle execution on C62x 8 new instructions can always be dispatched every cycle Pipelined Program and Data Accesses Two 32-bit Data Accesses/Cycle (C62x) Two 64-bit Data Accesses/Cycle (C67x) 256-bit Program Access/Cycle Visible: No Variable-Length Pipeline Flow Deterministic: Order and Time of Execution Orthogonal: Independent Instructions
K Parallelism K High Performance Internal Memory Access
K Good Compiler Target
43
C6000 Instruction Set Features

Conditional Instructions K All Instructions can be Conditional
n n n
A1, A2, B0, B1, B2 can be used as Conditions Based on Zero or Non-Zero Value Compare Instructions can allow other Conditions (<, >, etc)
K Reduces Branching K Increases Parallelism
44
C6000 Instruction Set Addressing Features

K Load-Store Architecture K Two Addressing Units (D1, D2) K Orthogonal
n
K Signed/Unsigned Byte, Half-Word, Word, Double-Word Addressable K Register or 5-Bit Unsigned Constant Index
n
Any Register can be used for Addressing or Indexing Indexes are Scaled by Type
45

K Indirect Addressing Modes
n n n n n n
K 15-bit Positive/Negative Constant Offset from Either B14 or B15
Pre-Increment Post-Increment Pre-Decrement Post-Decrement Positive Offset Negative Offset
*++R[index] *R++[index] *--R[index] *R--[index] *+R[index] *-R[index]
46

K Circular Addressing
n n
K Dual Endian Support
Fast and Low Cost: Power of 2 Sizes and Alignment Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer Sizes
47
C67x Architecture
48
TMS320C6701 DSP Block Diagram

Program Cache/Program Memory 32-bit address, 256-Bit data 512K Bits RAM
Power C67x Floating-Point CPU Core Down Program Fetch Host Port Interface
Instruction Dispatch Instruction Decode Data Path 1
A Register File L1 S1 M1 D1
Control Registers Control Logic Test Emulation Interrupts
4 Channel DMA
Data Path 2
B Register File D2 M2 S2 L2
External Memory Interface
2 Timers
Data Memory 32-Bit address 8-, 16-, 32-Bit data 512K Bits RAM
2 Multichannel buffered serial ports (T1/E1)
49
TMS320C6701 Advanced VLIW CPU (VelociTI )

TM
K 1 GFLOPS @ 167 MHz

n n
6-ns cycle time 6 x 32-bit floating-point instructions/cycle
K K K K
Load store architecture 3.3-V I/Os, 1.8-V internal Single- and double-precision IEEE floating-point Dual data paths
n
6 floating-point units / 8 x 32-bit instructions
50
K Same as C6201 K External interface supports

n
TMS320C6701 Memory /Peripherals
SDRAM, SRAM, SBSRAM
K K K K K
4-channel bootloading DMA 16-bit host port interface 1Mbit on-chip SRAM 2 multichannel buffered serial ports (T1/E1) Pin compatible with C6201
51
TMS320C67x CPU Core

C67x Floating-Point CPU Core
Program Fetch Instruction Dispatch Instruction Decode Data Path 1 A Register File Data Path 2 B Register File Control Logic Test Emulation L1 S1 M1 D1 D2 M2 S2 L2 Interrupts Control Registers
Arithmetic Logic Unit
Auxiliary Logic Unit
Multiplier Unit
Floating-Point Capabilities
52
C67x Interrupts
K K K K 12 Maskable Interrupts Non-Maskable Interrupt (NMI) Interrupt Return Pointers (IRP, NRP) Fast Interrupt Handling
n n n
Branches Directly to 8-Instruction Service Fetch Packet 7 Cycle Overhead: Time When No Code is Running 12 Cycle Latency : Interrupt Response Time
K Interrupt Acknowledge (IACK) and Number (INUM) Signals K Branch Delay Slots Protected From Interrupts K Edge Triggered
53
C67x New Instructions

.L Unit
Floating Point Arithmetic Unit ADDSP ADDDP SUBSP SUBDP INTSP INTDP SPINT DPINT SPTRUNC DPTRUNC DPSP
.M Unit
Floating Point Multiply Unit MPYSP MPYDP MPYI MPYID MPY24 MPY24H
.S Unit
ABSSP ABSDP CMPGTSP CMPEQSP CMPLTSP CMPGTDP CMPEQDP CMPLTDP RCPSP RCPDP RSQRSP RSQRDP SPDP
54
Floating Point Auxilary Unit
C67x Datapaths
u u 2 Data Paths 8 Functional Units n Orthogonal/Independent n 2 Floating Point Multipliers u n 2 Floating Point Arithmetic n 2 Floating Point Auxiliary Control n Independent u n Up to 8 32-bit Instructions Registers n 2 Files u n 32, 32-bit registers total Cross paths (1X, 2X) u L-Unit (L1, L2) n Floating-Point, 40-bit Integer ALU n Bit Counting, Normalization S-Unit (S1, S2) n Floating Point Auxiliary Unit n 32-bit ALU/40-bit shifter n Bitfield Operations, Branching M-Unit (M1, M2) n Multiplier: Integer & Floating-Point D-Unit (D1, D2) n 32-bit add/subtract Addr Calculations
u u u
Registers A0 - A15
1X
Registers B0 - B15
2X
S1
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
M1
S2
D S1 S2
D1
S2 S1 D
S2
S1 D
D2
M2
S2
S2
S1 D DL SL
SL DL D
L2
S2
S1
55
C67x Instruction Packing Instruction Packing Enhanced VLIW

Example 1
A B C D E F G H A B C D E F G H
Example 2
A B C D Example 3 E F G H
K Fetch Packet n CPU fetches 8 instructions/cycle K Execute Packet n CPU executes 1 to 8 instructions/cycle n Fetch packets can contain multiple execute packets K Parallelism determined at compile/assembly time K Examples n 1) 8 parallel instructions n 2) 8 serial instructions n 3) Mixed Serial/Parallel Groups M A // B M C M D M E // F // G // H K Reduces n Codesize n Number of Program Fetches n Power Consumption
56
C67x Pipeline Operation Pipeline Phases

Fetch Decode Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
uOperate in Lock Step uFetch n PG Program Address Generate n PS Program Address Send n PW Program Access Ready Wait n PR Program Fetch Packet Receive u Decode n DP n DC u Execute n E1 - E5 n E6 - E10 Instruction Dispatch Instruction Decode Execute 1 through Execute 5 Double Precision Only
Execute Packet 1 PG PS PW PR DP DC Execute Packet 2 PG PS PW PR DP Execute Packet 3 PG PS PW PR Execute Packet 4 PG PS PW Execute Packet 5 PG PS Execute Packet 6 PG Execute Packet 7
E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
57
C67x Pipeline Operation Delay Slots

Delay Slots: number of extra cycles until result is: n written to register file n available for use by a subsequent instructions n Multi-cycle NOP instruction can fill delay slots while minimizing codesize impact
Most Integer Single-Precision Loads Branches
E1 No Delay E1 E2 E3 E4 3 Delay Slots E1 E2 E3 E4 E5 E1 PG PS PW PR DP DC E1 5 Delay Slots

58
4 Delay Slots
Branch Target
C67x and C62x Commonality

u u Driving commonality ( ) between C67x & C62x shortens C67x design time. Maintaining symmetry between datapaths shortens the C67x design time. C62x CPU C67x CPU
M-Unit 1 M-Unit 2 Multiplier Multiplier Unit Unit Control D-Unit 1 D-Unit 2 Data Load/ Registers Data Load/ Store Store Emulation S-Unit 2 S-Unit 1 Auxiliary Auxiliary Logic Unit Logic Unit L-Unit 1 L-Unit 2 Arithmetic Arithmetic Logic Unit Logic Unit
M-Unit 1 Multiplier Unit with Floating Point
M-Unit 2 Multiplier Unit with Floating Point
D-Unit 1 Data Load/ Store

S-Unit 1 Auxiliary Logic Unit with Floating Point L-Unit 1 Arithmetic Logic Unit with Floating Point
Control Registers Emulation
D-Unit 2 Data Load/ Store

S-Unit 2 Auxiliary Logic Unit with Floating Point L-Unit 2 Arithmetic Logic Unit with Floating Point
Register file
Decode
Register file
Register file
Decode
Register file
Program Fetch & Dispatch
Program Fetch & Dispatch

59
TMS320C80 MIMD MULTIPROCESSOR Texas Instruments - 1996
60
Copyright 1999
61

Lec10a DSP1

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lec10a DSP1

Hochgeladen von

Copyright:

Verfügbare Formate

Lecture 10a: Digital Signal Processors: A TI Architectural History

DSP ARCHITECTURE EVOLUTION

Multiprocessors (MP) Multi-Processing

DSP Building Blocks & Bit Slice Processors (MUL, etc.)

Function/Application Specific ( MP) DSP P and RISC ( MP )

DSP ARCHITECTURE Enabling Technologies

Video/Image Processing Wireless telephony Internet related

Texas Instruments TMS320 Family Multiple DSP P Generations

Uniprocessor Based (Harvard

1982 1985 1988 1991 1995

16 integer 16 integer 32 flt.pt. 16 integer 16 integer

5 MIPS 10 MIPS 17 MIPS 29 MIPS 40 MIPS

58,000 (3) 160,000 (2) 695,000 (1) 1,000,000 (0.5)

1996 1997 1997

32 integer/flt. 16 integer 32 flt. pt. 1600 MIPS 5 5

2 GOPS 120 MFLOP 20 GOPS 1 GFLOP

MIMD VLIW VLIW

First Generation DSP P Case Study

TMS32010 BLOCK DIAGRAM

TMS32010 Program Memory Maps

External Memory Space

Digital FIR Filter Implementation (Uniprocessor-Circular Buffer)

Replace starting value with new value

TMS32010 FIR FILTER PROGRAM Indirect Addressing (Smaller Program Space)

TMS320C203/LC203 BLOCK DIAGRAM DSP Core Approach - 1995

Third Generation DSP P Case Study TMS320C30 - 1988

TMS320C30 Key Features

Third Generation DSP P Case Study TMS320C30 - 1988

TMS320C30 Key Features (cont.)

TMS320C30 BLOCK DIAGRAM

TMS320C3x CPU BLOCK DIAGRAM

TMS320C3x MEMORY BLOCK DIAGRAM

TMS320C30 Memory Organization

TMS320C30 FIR FILTER PROGRAM

For N=50, t=3.6 s (277 KHz)

TMS320C54x Internal Block Diagram

Architecture optimized for DSP

MAC unit, 2 Accumulators, Additional Adder, Barrel Shifter

#2: Multiple busses for efficient data

#3: Highly tuned instruction set for powerful DSP computing

Key #1: DSP engine

Key #1: MAC Unit

Fractional Mode Bit

MPY ADD acc A acc B

Key #1: Accumulators + Adder

U Bus SUB @r, A

Key #1: Barrel shifter

A B C D Barrel Shifter (-16-+31) S Bus

Key #1: Temporary register

Temporary Register T Bus MAC ALU

Key #2: Efficient data/program flow

MAC unit, 2 Accumulators, Additional Adder, Barrel Shifter

#2: Multiple busses for efficient data and program flow

#3: Highly tuned instruction set for powerful DSP computing

Key #2: Multiple busses

Central Arithmetic Logic Unit

Key #2: Pipeline

Key #2: Bus usage

Central Arithmetic Logic Unit

T MAC A B ALU SHIFTER

Key #2: Pipeline performance

Fully loaded pipeline

++R[index] R++[index] --R[index] R--[index] +R[index] -R[index]