Sie sind auf Seite 1von 58

DSP Processor Fundamentals

Subhasish Mukherjee

Slide: 1

Salient Features of DSP Processors


Fast multiply and accumulate

Multiple access memory architecture


Specialized addressing modes Specialized execution control Peripherals and I/O interfaces

Slide: 2

DSP Processor Embodiments


Multichip modules
Multiple dies in a single package Increased operating speed & reduced power dissipation

Multiple processors on chip Chip sets


Dividing the processor into two or more packages Makes sense when the processor is very complex & has large no of I/O pins Saves cost

DSP Cores

Slide: 3

Fixed-Point vs. Floating Point


Most DSP are Fixed-Point
Fixed Point DSP support integer and fraction arithmetic Limited dynamic range and precision Cheaper too. Mostly use 16-bit format, though some use 20/24 bit format.

Floating point DSPs use mantissa and exponent representation


They provide good dynamic range and precision Mostly use 32-bit format Easier to program

Slide: 4

Fixed Point Data Path

Slide: 5

Content of Fixed Point Data Path

Typically incorporate a multiplier, an ALU, shifters, operand registers & accumulators.


Single cycle multipliers are central to programmable DSP Often integrated with adder to make a multiply accumulate unit.

Slide: 6

Accumulator
Holds intermediate and final results of MAC operation Most DSP processors provide multiple Accumulator.

Have guard bits to accumulate a number of values


Guard bits provide greater flexibility than scaling.

Slide: 7

ALU
Implements basic arithmetic and logical operations in a single instruction cycle.
Common operations include add, subtract, increment, negate, logical and, or, not.

Differs in the word size used for logical operations.

Slide: 8

Shifter
Used for scaling the input by a power of 2

Either eliminates or reduces the possibilities of overflow to an acceptably lower level.


Trade off is loss of precision and dynamic range. Barrel shifters offers more flexibility

Slide: 9

Memory Architecture & Addressing Schemes

Slide: 10

Motivation
FIR Filter involves following operations
Fetch the MAC instruction
Fetch coefficient hm Fetch delayed input x(n-m) Multiply both Add with the previous result Shift data in the delay line
h0 h1 h2 hN-1 hN Input x(n) z-1 z-1 z-1

Output y(n)

The above set of operations done for all the taps of the filter for each sample

N y ( n) h( m) x ( n m) m 0

Slide: 11

Motivation
Conventional processors need more than 5 cycles/tap/sample to implement the above FIR filter

DSP architectures try to reduce the cycles needed to compute this primitive
This is accomplished by
Harvard architecture Efficient addressing modes

Slide: 12

Harvard Architecture
Basic Harvard Architecture Program Memory Data Memory

Basic Harvard Architecture


Separate program and data bus

different from Von-Neumann Architecture

P BUS D BUS

Modification 1
Data fetches possible from program memory Opcode and one data fetch done in parallel

Harvard Architecture Modification #1 Program/


Data

Data Memory

Memory

P BUS

D BUS

Slide: 13

Harvard Architecture
Modification 2
One program memory

Harvard Architecture Modification 2


Program Memory Multi Port Data Memory

One dual ported data memory


3 busses for the internal memory
2 for data 1 for program

2 data words can be fetched in parallel to an instruction

P BUS

D BUS 1 D BUS 2

Slide: 14

Harvard Architecture
Harvard Architecture Modification 3
Program Cache Program Memory Data Memory 1 Data Memory 2

P BUS

D BUS 1

D BUS 2

Modification 3
One program memory & Program Cache Two Data memory 3 busses for the internal memory
2 for data & 1 for program

2 data words can be fetched in parallel to an instruction

Slide: 15

Addressing mode Circular Addressing


Avoids shifting of data in the delay line Oldest element is overwritten by the new element Pointer wraps around once it crosses start or the end of the circular buffer Need to maintain 5 parameters for circular buffer operation
Circular buffer - Example
Recent sample at time instant n 2nd recent sample at time instant n+1
X(n-1) X(n-2) X(n-3) X(n) X(n-7)

Oldest sample at time instant n


X(n-6) X(n-4) X(n-5)

Will be overwritten by the recent sample at instant n+1

Start

X(n-m)

X(n) X(n-N)

End

X(n-m-1)

Slide: 16

Multiple Access Memories


Supports multiple, sequential access per instruction cycle Can be combined with Harvard Architecture to have better performance

Supporting off-chip memory means introducing significant additional delay between processor core and memory

Slide: 17

Multiported Memories
Has multiple independent sets of address and data connections Can provide multiple simultaneous accesses Costly Supporting off-chip memory means larger and more expensive package

Slide: 18

Program Cache
Simplest type is single instruction repeat buffer

Can be extended to multi word repeat buffer


Another type is single sector instruction cache Extended to multiple independent sector cache Used only for program instructions and not for data

Slide: 19

Wait States
State in which processor waits to access memory Conflict Wait states
Multiple access to memory that can not handle multiple access

Externally requested wait states


Multiple processors sharing a data bus

TMS320C5x has a special READY pin which can be used by external hardware to signal the processor that it must wait before accessing external memory.
Slide: 20

Multiprocessor Support- Memory Interface


Multiple external memory ports

Sometimes multiple processors share one external memory bus


Bus arbitration required

Two pins can be configured to act as bus request and bus grant signals
TMS320C5x allows external access to on chip memory through BR and IAQ signals Helpful for multiprocessor communication without shared memory
Slide: 21

Direct Memory Access


Handled by DMA controller

Coupled with Bus Request and Bus Grant pins of the processor
Some sophisticated DMA controllers reside onchip and access on chip memory Multiple channel DMA controllers handle multiple memory transfer in parallel

Slide: 22

Memory Addressing Schemes


Implied addressing
Operand addresses are implied P=X*Y

Immediate data
Operand itself is encoded in the instruction AX0 = 1234

Memory direct addressing


The address of the data in memory is enclosed in the instruction word AX0 = DM(1000)
Slide: 23

Memory Addressing Schemes


Register direct addressing
Data being addressed reside in a register
SUBF R1, R2

Register indirect addressing


Data resides in memory and the address resides in the register, A0 = A0 + *R5

0x1000
Address Registers
Slide: 24

7 Memory

0x1000

Memory Addressing Schemes


Register indirect addressing with pre and post increment
A0 = A0 + *R5++ (Post Increment) A0 = A0 + *R5++R17 (Post Increment)
Address incremented by the value stored in register R17

MOVE X: -(R0), A1 (Pre Decrement)

Slide: 25

Memory Addressing Schemes


Register indirect addressing with indexing
Values stored in two address registers are added to form an effective address
Does not change the content of any of the address registers MOVE Y1, X: (R6 + N6) LDI *-AR1(1), R7

Slide: 26

Memory Addressing Schemes


Register addressing with bit reversal
Used for FFT
The output or input is in a scrambled order

000 = 0
100 = 4 010 = 2

001 = 1
101 = 5 011 = 3

110 = 6

111 = 7

Slide: 27

Instruction Set

Slide: 28

Instruction Types
Arithmetic & Multiplication Logic Operations Shifting Rotation Comparison

Looping
Branching, subroutine calls and returns Conditional instruction Special function instruction Block floating point instructions, stack operation etc. Bit manipulation instructions
Slide: 29

Registers
Accumulators

General & special purpose registers


Address registers Other registers
Stack pointer Program counter

Loop registers

Slide: 30

Parallel Move Support


Operand related parallel moves
MPY (R0), (R4)
Accesses are limited to those required by arithmetic operation

Operand unrelated parallel moves


MPY X0, Y0, A X: (R0)+, X0 Y1, Y: (R4)+ Memory accesses unrelated to the operands of the ALU operation

Slide: 31

Orthogonality
Indicates the extent to which processor instruction set is consistent Depends upon
Consistency & Completeness of the instruction set

Degree to which operands and addressing modes are uniformly available with different operations

Slide: 32

Assembly Language Format


Traditional opcode operand variety MPY X0, Y0
ADD P,A MOV (R0), X0 JMP LOOP

C Like Syntax

P = X0 * Y0 A=P+A

X0 = *R0
GOTO LOOP
Slide: 33

Execution Control

Slide: 34

Looping
Hardware looping
RPT #16 MAC (R0)+, (R4)+, A

Software looping

MOVE #16, B
LOOP: MAC (R0)+, (R4)+, A DEC B

JNE LOOP

Slide: 35

Considerations in Looping
Sometimes 0 loop repetition count causes the processor to repeat the loop the maximum number of times Consider loop effects on interrupt latency

Slide: 36

Nesting
Directly nestable
Hardware loop instruction placed within the outer loop

Partially nestable
Single instruction loop inside multi instruction loop

Software nestable
Multi instruction hardware loops are nested by saving various registers like loop index, loop start & loop count
Slide: 37

Interrupts
Interrupt sources
On chip peripherals, External interrupt lines and software interrupts

Interrupt vectors
Associating each interrupt with a different memory address
Typically one or two word long and are located in low memory

Usually contains a branch or subroutine call to an interrupt handler routine

Slide: 38

Interrupt latency
Time between the assertion of an external interrupt line to the execution of the first word of the interrupt vector Following adds up to the interrupt latency
Interrupt line to be asserted prior to the start of an instruction cycle when interrupt is said to have occurred (Set up time)

To be passed through synchronization stages


Wait until the processor reaches an interruptible state Wait until all instructions in the pipeline are finished

If interrupt vector holds only address of the interrupt routine then the time required to branch to that location

Slide: 39

Stacks
Typically one of the three kinds of stack support is provided
Shadow registers Hardware stack

Software stack

Slide: 40

Pipelining

Slide: 41

Pipelining and Performance


Technique for increasing the performance of a processor
Breaks a sequence of operations into smaller pieces Execute the pieces in parallel whenever possible

Hypothetical processor
Fetch an instruction word from memory Decode the instruction

Read/write data operands from/to memory


Execute the ALU or MAC operation of the instruction
Slide: 42

Pipelining and Performance


Clock Cycle

1
Instruction Fetch Decode Data Read/Write Execute

7
P

I1

I2
I1

I3
I2 I1

I4
I3 I2 I1

I5
I4 I3 I2

I6
I5 I4 I3

I7
I6 I5 I4

I P E L I

D E P T H

N
E

Perfect Overlap

100% utilization of processor execution stages


Ideal scenario
Slide: 43

Conflicting Instruction
Clock Cycle

1
Instruction Fetch Decode Data Read/Write Execute

7
P

I1

I2
I1

I3
I2 I1

I4
I3 I2 I1

I5
I4 I2 I3 I2

I6
I5 I4 I3

I7
I6 I5 I4

I P E L I

D E P T H

N
E

I2 tries to write to memory while I3 tries to read memory

Solution to this problem is interlocking


Interlocking is delaying the conflicting instruction in pipeline
Slide: 44

Interlocking
Clock Cycle

1
Instruction Fetch Decode Data Read/Write Execute

7
P

I1

I2
I1

I3
I2 I1

I4
I3 I2 I1

I4
I3 I2 I2

I5
I4 I3 NOP

I6
I5 I4 I3

I P E L I

D E P T H

N
E

Interlocking resolves resource conflict


Pipeline sequencer holds instruction I3 at the decode stage I4 is held at the fetch stage One instruction cycle penalty occurs
Slide: 45

Multicycle Branching Effects


Clock Cycle

1
Instruction Fetch Decode Data Read/Write Execute

BR

I2
BR

----BR

------BR

I4
----NOP

I5
I4 --NOP

I6
I5 I4 NOP

I7
I6 I5 I4

When a branch instruction reaches the decode stage already one instruction is
fetched which has to be flushed from the pipeline NOPs are executed for the invalidated pipeline slots Multicycle branch typically executes for as many cycles as pipeline depth
Slide: 46

Delayed Branching Effects


Clock Cycle

1
Instruction Fetch Decode Data Read/Write Execute

BR

N2
BR

N3
N2 BR

N4
N3 N2 BR

I4
N4 N3 N2

I5
I4 N4 N3

I6
I5 I4 N4

I7
I6 I5 I4

An alternative to multicycle branch, does not flush the pipeline Instructions to be executed before the branch instruction must be located exactly after the branch instruction in the memory Increased efficiency and confusing code on casual inspection
Slide: 47

Interrupt Effects
Clock Cycle

3
Instruction Fetch Decode Data Read/Write Execute

10

I6
I5 I4 I3

--INTR I5 I4

----INTR I5

------INTR

V1
----NOP

V2
V1 --NOP

V3
V2 V1 NOP

V4
V3 V2 V1

INETRRUPT

Processor inserts the INTR instruction in the pipeline


INTR is a special branch instruction that flushes the pipeline and jumps to the appropriate interrupt vector location Causes a 4 cycle delay before the first word of the interrupt vector is executed I6 is flushed but would be refetched on returning from interrupt
Slide: 48

Fast Interrupt Processing


Clock Cycle

1
Instruction Fetch Decode Execute

I3
I2 I1

I4
I3 I2

V1
I4 I3

V2
V1 I4

I5
V2 V1

I6
I5 V2

I7
I6 I5

I8
I7 I6

INETRRUPT

Interrupt handler stored at the interrupt vector location


In this case V1 & V2 are the two instructions in the interrupt vector This is called fast interrupt as this does not insert any delay in the pipeline

Slide: 49

Peripherals

Slide: 50

Serial Ports
Serial interface transmits and receives data one bit at a time Requires far fewer interface pins than parallel interface

Used for variety of applications


Sending/receiving data to/from A/D and D/A converters

Sending/receiving data from other processors or DSP


Communicating with other external peripherals
Slide: 51

Serial Ports
Synchronous
Transmits one bit clock signal in addition to the serial data bits
Receiver uses that for sampling the received data

Asynchronous
Do not transmit separate clock signal Receiver deduces the clock signal from the serial data itself More complex
Slide: 52

Data and Clock


BIT CLOCK

FRAME SYNC DATA

- - -

- -

Most DSPs allow changing the clock polarity, data polarity and shift direction

Frame sync signal indicates the position of the first bit of a data word on the serial data line
Common formats are bit length and word length Also can have multiple words per frame

Slide: 53

Serial Clock Generation


Provide Circuitry for clock generation

Usually called serial clock generation support


Normally done by scaling the master clock in DSP Usually contains a pre-scaler and a down counter

Slide: 54

Time Division Multiplex


CLOCK FRAME SYNC DATA

CLOCK

CLOCK CLOCK CLOCK FRAME SYNC DATA

FRAME SYNC

FRAME SYNC

FRAME SYNC DATA

DATA

DATA

DSP

DSP

DSP

DSP

One processor (or External Circuitry) generates the clock and Frame sync signal Frame sync indicates the start of a new set of time slots Transmitted data word might contain some number of bits to indicate the destination DSP. Other bits are used for data
Slide: 55

Timers
Programmable timers are often a source of periodic interrupts May also be used as a software controlled square wave generator
Clock Source

Prescale Preload Value

Counter Preload Value

Slide: 56

Parallel Ports
Transmit/receive multiple data bits at a time Faster than serial ports but require more pins External data bus may be used as a parallel port Can also have separate parallel ports Bit I/O ports
Individual pins can be made input or output on a bit by bit basis

Host ports
Specialized 8/16 bit bidirectional parallel ports used for data transfer between DSP and host microprocessor

May be used to control the DSP

Communication ports
Special parallel port intended for multiprocessor communication
Slide: 57

Slide: 58