Beruflich Dokumente
Kultur Dokumente
Collated by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Dr. Brock Barton, Clark Hise TI; Dr. Surendar S. Magar, Berkeley Concept Research Corporation
1
Multipliers (MUL)
Application Examples
Video/Imaging W-CDMA Radars Digital Radios High-End Control Modems Voice Coding Instruments Low-End Modems Industrial Control
C and Analog
1980
1985
1990
1995
2
Time Frame Early 1970s Late 1970s Early 1980s Late 1980s Early 1990s Late 1990s
Approach
Primary Application
Enabling Technologies
Discrete logic Building block Single Chip DSP P Function/Application specific chips Multiprocessing Single-chip multiprocessing
Non-real time procesing Simulation Military radars Digital Comm. Telecom Control Computers Communication
Bipolar SSI, MSI FFT algorithm Single chip bipolar multiplier Flash A/D P architectures NMOS/CMOS Vector processing Parallel processing Advanced multiprocessing VLIW, MIMD, etc. Low power single-chip DSP Multiprocessing
TMS32010 TMS320C25 TMS320C30 TMS320C50 TMS320C2XXX Multiprocessor Based TMS320C80 TMS320C62XX TMS310C67XX
20 40 33 57
400 100 60 35 25
5 20 33 60 80
Features
u u u u u u u u u u 200 ns instruction cycle (5 MIPS) 144 words (16 bit) on-chip data RAM 1.5K words (16 bit) on-chip program ROM - TMS32010 External program memory expansion to a total of 4K words at full speed 16-bit instruction/data word single cycle 32-bit ALU/accumulator Single cycle 16 x 16-bit multiply in 200 ns Two cycle MAC (5 MOPS) Zero to 15-bit barrel shifter Eight input and eight output channels
Microprocessor Mode 16-bit word 0 1 2 Reset 1st Word Reset 2nd Word Interrupt
1525
Internal Memory Space Reserved For Testing
1536
External Memory Space
4095
4095 7
a n-1 a n-2 a0
a1
a0 a n-1
X0 X1 X2 X3 X4 X5
Xn-1
End
+
Acc
For N=50, Indirect Addressing t=42 s (23.8 KHz) For N=50, Direct Addressing t=21.6 s (40.2 KHz)
10
33.3 MFLOPS (million floating-point operations per second) 16.7 MIPS (million instructions per second)
One 4K x 32-bit single-cycle dual-access on-chip ROM block Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks 64 x 32-bit instruction cache 32-bit instruction and data words, 24-bit addresses 40/32-bit floating-point/integer multiplier and ALU 32-bit barrel shifter
11
12
13
14
15
807FFFH Peripheral Bus Memory Mapped 80800h Registers (Internal) (6K) 8097FFh RAM Block 0 (1K) 809800h (Internal) 809BFFh 809C00h 809FFFh 80A00h 0FFFFFFh RAM Block 1 (1K) (Internal) External STRB Active
Microprocessor Mode
Microcomputer Mode
16
17
C54x Architecture
18
19
Sophisticated instructions that execute in fewer cycles, with less code and low power demands
20
an * xn a
MPY ADD
y
21
S/U
S/U
A B O
22
MUX
ALU
E Bus
24
@x, T @a, A
EXP Encoder
A B For example: A = xa
25
Four busses and large on-chip memory that result in sustained performance near peak
Sophisticated instructions that execute in fewer cycles, with less code and low power demands
26
INTERNAL MEMORY
M U X E S
P D C E C
T
EXTERNAL MEMORY
M U X D
ALU SHIFTER MAC A B
M
27
Prefetch: Calculate address of instruction Fetch: Collect instruction Decode: Interpret instruction Access: Collect address of operand Read: Collect operand Execute: Perform operation
28
29
X3 R4 X4 A5 R5 X5 D6 A6 R6 X6
Four busses and large on-chip memory that result in sustained performance near peak
Sophisticated instructions that execute in fewer cycles, with less code and low power demands
31
C62x Architecture
33
TMS320C6201 Revision 2
Program Cache / Program Memory
32-bit address, 256-Bit data512K Bits RAM Pwr Dwn
Host Port Interface C6201 CPU Megamodule
Program Fetch Instruction Dispatch Instruction Decode Control Registers Control Logic Test Emulation Interrupts
4DMA
Data Path 1
A Register File L1 S1 M1 D1
Data Path 2
B Register File D2 M2 S2 L2
Data Memory
32-Bit address, 8-, 16-, 32-Bit data 512K Bits RAM
34
16K 32-bit instructions (2K Fetch Packets) 256-bit Fetch Width Configurable as either w Direct Mapped Cache, Memory Mapped Program Memory 32K x 16 Single Ported Accessible by Both CPU Data Buses 4 x 8K 16-bit Banks w 2 Possible Simultaneous Memory Accesses (4 Banks) w 4-Way Interleave, Banks and Interleave Minimize Access Conflicts
K Data
n n n
35
K Interrupt Acknowledge (IACK) and Number (INUM) Signals K Branch Delay Slots Protected From Interrupts K Edge Triggered
36
C62x Datapaths
Registers A0 - A15
1X
Registers B0 - B15
2X
S1
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
M1
S2
D S1 S2
D1
S2 S1 D
S2
S1 D
D2
M2
S2
S2
S1 D DL SL
SL DL D
L2
S2
S1
DADR1 DADR2 (address) (address) Cross Paths 40-bit Write Paths (8 MSBs) 40-bit Read Paths/Store Paths
37
Functional Units
K L-Unit (L1, L2)
n n n n n n n
40-bit Integer ALU, Comparisons Bit Counting, Normalization 32-bit ALU, 40-bit Shifter Bitfield Operations, Branching 16 x 16 -> 32
38
C62x Datapaths
Registers A0 - A15
1X
Registers B0 - B15
2X
S1
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
M1
S2
D S1 S2
D1
S2 S1 D
S2
S1 D
D2
M2
DDATA_I2 (load data)
S2
S2
S1 D DL SL
SL DL D
L2
S2
S1
DADR1 (address)
DADR2 (address)
Cross Paths 40-bit Write Paths (8 MSBs) 40-bit Read Paths/Store Paths
39
Example 1
n n n
CPU fetches 8 instructions/cycle CPU executes 1 to 8 instructions/cycle Fetch packets can contain multiple execute packets
A B C D E F G H A B C D Example 2 E F G H A B C D Example 3 E F G H
40
PG PS PW PR
Program Address Generate through Execute 5 n E1 - E5 Execute 1 Program Address Send Program Access Ready Wait Program Fetch Packet Receive
Execute Packet 1 PG PS PW PR DP DC Execute Packet 2 PG PS PW PR DP Execute Packet 3 PG PS PW PR Execute Packet 4 PG PS PW Execute Packet 5 PG PS Execute Packet 6 PG Execute Packet 7
E1 DC DP PR PW PS PG
E2 E1 DC DP PR PW PS
E3 E2 E1 DC DP PR PW
E4 E3 E2 E1 DC DP PR
E5 E4 E3 E2 E1 DC DP
E5 E4 E3 E2 E1 DC
E5 E4 E3 E2 E1
E5 E4 E5 E3 E4 E5 E2 E3 E4 E5 41
Allows 6 ns cycle time on 67x Allows 5 ns cycle time & single cycle execution on C62x 8 new instructions can always be dispatched every cycle Pipelined Program and Data Accesses Two 32-bit Data Accesses/Cycle (C62x) Two 64-bit Data Accesses/Cycle (C67x) 256-bit Program Access/Cycle Visible: No Variable-Length Pipeline Flow Deterministic: Order and Time of Execution Orthogonal: Independent Instructions
43
A1, A2, B0, B1, B2 can be used as Conditions Based on Zero or Non-Zero Value Compare Instructions can allow other Conditions (<, >, etc)
44
K Signed/Unsigned Byte, Half-Word, Word, Double-Word Addressable K Register or 5-Bit Unsigned Constant Index
n
Any Register can be used for Addressing or Indexing Indexes are Scaled by Type
45
46
Fast and Low Cost: Power of 2 Sizes and Alignment Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer Sizes
47
C67x Architecture
48
4 Channel DMA
Data Path 2
B Register File D2 M2 S2 L2
2 Timers
Data Memory 32-Bit address 8-, 16-, 32-Bit data 512K Bits RAM
49
K K K K
Load store architecture 3.3-V I/Os, 1.8-V internal Single- and double-precision IEEE floating-point Dual data paths
n
50
K K K K K
4-channel bootloading DMA 16-bit host port interface 1Mbit on-chip SRAM 2 multichannel buffered serial ports (T1/E1) Pin compatible with C6201
51
Multiplier Unit
Floating-Point Capabilities
52
C67x Interrupts
K K K K 12 Maskable Interrupts Non-Maskable Interrupt (NMI) Interrupt Return Pointers (IRP, NRP) Fast Interrupt Handling
n n n
Branches Directly to 8-Instruction Service Fetch Packet 7 Cycle Overhead: Time When No Code is Running 12 Cycle Latency : Interrupt Response Time
K Interrupt Acknowledge (IACK) and Number (INUM) Signals K Branch Delay Slots Protected From Interrupts K Edge Triggered
53
.M Unit
Floating Point Multiply Unit MPYSP MPYDP MPYI MPYID MPY24 MPY24H
.S Unit
ABSSP ABSDP CMPGTSP CMPEQSP CMPLTSP CMPGTDP CMPEQDP CMPLTDP RCPSP RCPDP RSQRSP RSQRDP SPDP
54
C67x Datapaths
u u 2 Data Paths 8 Functional Units n Orthogonal/Independent n 2 Floating Point Multipliers u n 2 Floating Point Arithmetic n 2 Floating Point Auxiliary Control n Independent u n Up to 8 32-bit Instructions Registers n 2 Files u n 32, 32-bit registers total Cross paths (1X, 2X) u L-Unit (L1, L2) n Floating-Point, 40-bit Integer ALU n Bit Counting, Normalization S-Unit (S1, S2) n Floating Point Auxiliary Unit n 32-bit ALU/40-bit shifter n Bitfield Operations, Branching M-Unit (M1, M2) n Multiplier: Integer & Floating-Point D-Unit (D1, D2) n 32-bit add/subtract Addr Calculations
u u u
Registers A0 - A15
1X
Registers B0 - B15
2X
S1
S2
D DL SL
L1
SL DL D S1
S1
S2
D S1
M1
S2
D S1 S2
D1
S2 S1 D
S2
S1 D
D2
M2
S2
S2
S1 D DL SL
SL DL D
L2
S2
S1
55
A B C D E F G H A B C D E F G H
Example 2
A B C D Example 3 E F G H
K Fetch Packet n CPU fetches 8 instructions/cycle K Execute Packet n CPU executes 1 to 8 instructions/cycle n Fetch packets can contain multiple execute packets K Parallelism determined at compile/assembly time K Examples n 1) 8 parallel instructions n 2) 8 serial instructions n 3) Mixed Serial/Parallel Groups M A // B M C M D M E // F // G // H K Reduces n Codesize n Number of Program Fetches n Power Consumption
56
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
uOperate in Lock Step uFetch n PG Program Address Generate n PS Program Address Send n PW Program Access Ready Wait n PR Program Fetch Packet Receive u Decode n DP n DC u Execute n E1 - E5 n E6 - E10 Instruction Dispatch Instruction Decode Execute 1 through Execute 5 Double Precision Only
Execute Packet 1 PG PS PW PR DP DC Execute Packet 2 PG PS PW PR DP Execute Packet 3 PG PS PW PR Execute Packet 4 PG PS PW Execute Packet 5 PG PS Execute Packet 6 PG Execute Packet 7
57
4 Delay Slots
Branch Target
M-Unit 1 M-Unit 2 Multiplier Multiplier Unit Unit Control D-Unit 1 D-Unit 2 Data Load/ Registers Data Load/ Store Store Emulation S-Unit 2 S-Unit 1 Auxiliary Auxiliary Logic Unit Logic Unit L-Unit 1 L-Unit 2 Arithmetic Arithmetic Logic Unit Logic Unit
Register file
Decode
Register file
Register file
Decode
Register file
60
Copyright 1999
61