Sie sind auf Seite 1von 65

Signal Processing for Wireless

Communications
and Multimedia: Design, Tools, Architectures

Advanced Digital System Design Course 2006, EPF-L

Prof. Heinrich Meyr

RWTH Aachen University , Germany


and
Chief Scientific Officer, CoWare Inc

Agenda

Future Wireless Communication System


Future Wireless Communication Systems and ist Impact on ESL
The End of Moore´s Law
Receiver Structure, Models and Performance Metrics
Massive Parallel Processing on heterogeneous MPSoC
Application Specific Processors
Agenda
Summary and Conclusions

1
Future Wireless
Communication Systems

Internet Access Today

Fixed
DSL (→3 Mb/s)
Intranet (100Mb/s)
Wireless
WLAN (10-54 Mb/s)
Mobile
UMTS (2 Mb/s)

2
Mobile Internet Access

The Vision
UMTS
UMTS Standard:
Standard:
Ultra
Ultra High-Speed
High-Speed
22 Mb/s
Mb/s Mobile
Mobile Information
Information
and
and Communication
Communication
everywhere at low cost
Reality
Reality today:
today:
UMTS
UMTS 0,1-0,3
0,1-0,3 Mb/s
Mb/s
GSM/GPRS
GSM/GPRS 0,020,02 Mb/s
Mb/s

€€€
In
Inoptimally
optimallylocated
locatedplaces
places
For a few users
For a few users

4G and Beyond

New concepts
Ultra high speed transmission
Mobile multimedia processing

Wearable and environmental


information processing

Smart systems

Flexible, cognitive radio access

Multi-Processor Systems on Chip (MPSoC)

Digitized radio front end

3
Mobile Applications and Services

Future mobile wireless internet services:


Information (web browsing, …)
Communication (VoIP, video, P2P, …)
Entertainment (distributed gaming, …)

Challenging mobile application classes


Wearable and environmental information processing:
work, sport, health care
e.g. location aware services, seamless mobile working
Mobile multimedia processing
e.g. entertainment, information access, navigation,…

Future Wireless Systems: In a nutshell

Will be
cognitive
multifunctional
software definable

Will have
multiple Antennas

They will make use of ultra-complex signal processing


to optimally use the availabel bandwidth
And process these algorithms on heterogeneous
configurable computing engines

10

4
Future Wireless Communication Systems
and its Impact on ESL

Impact of NGMN on Design Process: I

To meet the schedule of NGMN it is imperative to have a


concurrent and iterative development and validation
process to design
Standard
Development and validation of algorithm and HW/SW
of the digital receiver
Application (SW) development

New approaches are needed !

12

5
Impact of NGMN on Design Process:II

Development and integration issues need to be uncovered


as early as possible

Companies cannot wait for hardware to be available to start


Software development

Development costs need to be reduced and schedules


accelerated

New approaches are needed !

13

Virtual Platform Based Development

Hardware Development

Simulator Initial HW Initial


Availability Availability

Incremental Virtual Platform Development Simulator/HW Refinement

Specification

OS
Software Development

Integrate
Device
Device
Connectivity
Software
Software
Stack
Stack
Develop Unit Test
UI
Test Hardware
Hardware Silicon
Integration
Application Validation
Virtual
Virtual

Debugging
-
Platform
Platform Reduced bring up
Virtual Platform Reduced system test

Incremental Software Development System Test Integration

14

6
The End of Moore´s Law:

„Design Competence rules the World“

Cross-disciplinary Task Management

Analysis
The task comprises of many subtask in various disciplines
“The whole is more than the sum of the parts”

Conclusion

The solution requires the interaction of people in the


various disciples

16

7
The Paradigm Shift: Innovation Overtakes Scaling

Innovation now dominates performance gains between generations


This means that “scheduled invention” is now the majority component in
all technology gains

IBM Transistor Performance Improvement


Gain by Traditional Scaling Gain by Innovation
100

80
Relative % Improvement

60

40
350 nm

180 nm
550 nm

250 nm

130 nm

65 nm
90 nm
20

0
CMOS7S-S0I

CMOS10S
CMOS6X
CMOS5X

CMOS11S
CMOS9S
CMOS8S2

Source:
Lisa Su /IBM:
MPSoC Conference 2005

Source: Lisa Su /IBM: MPSoC 05 Conference 2005 17

The Paradigm Shift: Integrated Design Approach


Future improvement in systems performance will require
an integrated design approach
Languages Microprocessor frequency will no
Application
Application Software Tuning longer be the dominant driver of
Efficient Programming
Middleware system level performance

Dynamic Optimization Integration over the entire stack,


Assist Threads from semiconductor technology to
System
System Level
Level Morphing Support end-user applications, will replace
Fast Computation
Migration scaling as the major driver of
Power Optimization increased system performance
Compiler Support
Systems will be designed with the
Compiler Support ability to dynamically manage and
Morphing
Chip
Chip Level
Level Multiple Cores optimize power
SMT
Accelerators Scale-out and small SMPs will
Power Shifting continue to outpace scale-up
Interconect
growth
Circuits

Silicon Innovation
Systems will increasingly rely on
Technology
Technology Packaging modular components for continued
Efficient Cooling performance leadership
Dense SRAM, embedded DRAM
Source: Lisa Su /IBM: MPSoC 05 Conference 2005

18

8
Core Proposition

y
oloologgy
n
ecechhn
ASIP
ASIP based ddTT Platforms
based
nn Platforms
s aa i es s
(heterogenousMPSoC)
c s
(heterogenousMPSoC)
hhyyssi ic mmetertrie
etet PP g GGeoeo
g
t ft ofor rg riniknikning
o h r
s t tnno e totos snh
s
e mmuu rs dudue ptpiotion
e
t tww t EErrroroorsnsnusum
m
u
BBu f o
Sooft er rCC
S owwe
PPo

19

The Human Element

Building and managing an interdisciplinary t


m eennt
engineering team of le m
aal leele
c
ti c
t ct cririti
1. Algorithm Designers
s
Most critical problem:ArchitectseDesignmmoos Competence
2. Computer/Compiler
h e
t isist th
I
ele: : It
3. System Integrators
RTL Designers bbb b l
bbaa
4.
h o
h o
pspsyycc
NNoo

20

9
Cross-disciplinary Task

Algorithm
Algorithm

Architecture
Architecture Tools
Tools

21

Food Chain and Alliances

Service Provider

SIEMENS Equipment
Equipment Manufacturers
Manufacturers

Semiconductor House

Enabling Technology
Providers

22

10
Alliances and the Business Equation

Managing alliances is a key to success

EDA
Mobile provider
Semiconductor company

23

Receiver Structure , Models


and Performance Metrics

11
Design - Space I: Physical Layer

Complexity

Bandwidth

Power

25

System Design

System design = algorithm design + implementation

algorithm
algorithm implementation
implementation
design
designspace
space design
designspace
space

JOINTLY optimizing algorithm and architecture

26

12
Center-of-Gravity Approach

Algorithm
Algorithm

Architecture
Architecture Tools
Tools

27

Methodology

13
Design - Methodology: I

Mathematical Theory and Experiment


are complementary

29

Design - Methodology: II

Mathematical Theory provides Bounds

1. Estimation and Detection Theory used to


systematically derive (optimum)
Receiver Structures
Synthesis

2. Mathematical Analysis used to compute


Performance Bounds
Analysis

30

14
Design Methodology III

Computer Simulation is used to

1. Obtain numerical Performance Data


Detection Loss
Implementation Loss

2. Validate a Design
(Conformance to Standards)

3. Verify Correctness of Implementation


(Verification) against Testpattern

31

32

15
Models

Communication Model

34

16
Signal Model

36

17
Received Bandlimited Signals

37

Approximation by BL Signals

E{ x(t ) − xBL } = ∫ Sx (ω )dω


2 Approx. of non-
bandlimited Signal
ω ≥BL x(t) by BL -Signal

k Truncation defines
∑ x(kT )ϕ (k )
−k
s
(2K+1) dim.
Approx. In Vector
space

39

18
Equivalence of digital/analog Signal Processing

40

Properties

41

19
Canonical Receiver Model
1 sample/symbol

CHANNEL
CHANNEL SOURCE
SOURCE
SIGNAL
SIGNAL DETECTION
DETECTION PATH
PATH DECODER
DECODER
DECODER
DECODER

Use the estimated channel


RF&ADC
RF&ADC parameters in the detection path
From SourceDecoder
as if they were the true values

FromChannel Decoder
PARAMETER
PARAMETER ESTIMATION
ESTIMATION
PATH
PATH

INNER OUTER
RECEIVER RECEIVER
H.Meyr et al., “ Digital Communication Receiver”, J.Wiley 1998

42

Receiver Task

Inner Receiver
To provide a “good” channel to the decoder based on the
principle of synchronized Detection.
NOTHING ELSE !

Outer Receiver
To decode the information

43

20
Performance Measure

Inner Receiver
Properties of the estimator
Variance
Unbiased

Outer Receiver
Bit-error-rate of the coded system

44

Performance Loss

Detection Loss of synchronized Detection


Δ SNR (dB) required to achieve the performance of perfect
channel knowledge . (Infinite Precision arithmetic assumed)

Implementation Loss
ΔSNR (dB) resulting from finite precision arithmetic and
algorithmic approximations

45

21
BER Performance

Source: Digital Communication Receivers, H. Meyr, M. Moeneclaey, S.A. Fechtel

48

Complexity DVB-S

Source: Digital Communication Receivers, H. Meyr, M. Moeneclaey, S.A. Fechtel

49

22
DVB-S Chip

Siemens-RWTH
Siemens-RWTHAachen
Aachen
(ISS)
(ISS)Design
Design1997
1997

0.5
0.5mmtechnology
technology
33metal
metallayer
layer
1.5
1.5W W@ @88 88MHz
MHz
>>500 k transistors
500 k transistors
First
Firstsilicon
siliconsuccess
success

50

DVB-T Specifications

Digital terrestrial video broadcasting:


high symbol rates: up to 7.4 Msym/s
sensitive modulation: 4 - 64 QAM
net bit rate up to 31.67 Mb/s
wide range of channels: (AWGN) 0 < Tau < 224 s (SFN)
error correction:
outer coder: Reed Solomon (204,188)
inner code: punctured convolutional
BER < 10e-9 (after RS)
3dB < Es/No < 40 dB
Challenges: > 200 transmission modes
algorithms
design methodology

51

23
System Performance: DVB-T

52

DVB-T Chip: First single Chip Solution

Joint
JointInfineon-Nokia-ISS
Infineon-Nokia-ISS
Design
Design1999
1999

AGC:
AGC:Automatic
AutomaticGain
GainControl
Control
IQ:
IQ: IQ-Mixer
IQ-Mixerand
andResampling
Resampling
PPU:
PPU: Postprocessing
PostprocessingUnit
Unit
FFT:
FFT: Fast
FastFourier
FourierTransform
Transform(2k,8k)
(2k,8k)
DTO:
DTO:Digital
DigitalTiming
TimingOscillator
Oscillator
RAM: OFDM Symbol
RAM: OFDM Symbol MemoryMemory
CHE:
CHE:Channel
ChannelEstimation
Estimation
IFFT:
IFFT:Inverse
InverseFFT
FFTand
andFine
FineTiming
Timing
ESG:
ESG: Equalizationand
Equalization and
Softbit Generation
Softbit Generation
FEC:
FEC:Forward
ForwardError
ErrorCorrection
Correction
(Viterbi,
(Viterbi,Reed-Solomon)
Reed-Solomon)

53

24
DVB-T Complexity

Analog part : 10%


Input interfaces
DC removal
anti-aliasing filter
ADC,AGC
Digital demodulator: 60 %
Channel estimation and equalization
synchronization
control flow implementation
FFT (alone 30%)
Channel decoder : 20 %
Viterbi and RS decoder
Miscellaneous : 10%
IIC bus controller, DAC

54

Design Space : Architecture and Algorithm

Inner Receiver
The algorithms of the inner receiver are never specified
by the standard
BOTH algorithm and architecture space exploration

Outer Receiver
The decoder is exaclty specified in the standard
ONLY architecture space exploration

55

25
Massive Parallel Processing on
Heterogeneous MPSoC

Parallel Computing in Mobiles

Massive
MassiveParallelism
Parallelismrequired
required
in the foreseeable future
in the foreseeable future

2003 2009 2013

Frequency
300 600 1500
(MHz)
Giga
0,3 14 2458
Operations
Operations
1 23 1638
per Cycle

Source: International Technology Roadmap for Semiconductors (ITRS, TX 2003)

58

26
Why Many-Processors Architectures today?

Not because of a fundamental breakthough in novel


software and parallel architecture

…..simply because the problems with tradtional


architectures pose an even greater challenge

59

Guding Principles for Manycore SoC I

Energy Efficiency and Power are the dominating


issues
There exists a fundamental trade-off between
energy efficiency and flexibility
Below 65nm high soft and hard error rates occur
Bandwidth improves by at least the square of the
latency
Memory wall: Load and stores are slow ( up to 200
cyles to access DRAM)

60

27
Guding Principles for Manycore SoC :II

Multiplies are fast


Instruction Level Parallelism (ILP) wall: Dimishing
return on finding new ILP
Brick wall:Power Wall+Memory Wall+ILPWall
Increasing parallelism and decreasing clock
frequency is the primary source of improving
processor performance

61

GP -Processor Performance Improvement between 1978 and 2006

Source: Seven Questions and Seven Dwarfs for Parallel


Computing,
UC Berkeley Report, June 2006

62

28
Parallel Computing

“Switching from sequential to modestly parallel


computing will make programming much more
difficult…….without a dramatic improvement in
performance”
Source: Seven Questions and Seven Dwarfs for Parallel Computing,
UC Berkeley Report, June 2006
Basic Blocks: Algorithm Types

We need to go to from multiple processors


to many cores

63

The Need for New Architectures


4G

Algorithmic 3G Source: R.Subramanian.


Complexity Berkeley Design Automation Inc
(Shannon’s Law) Memory
2G (Moore’s
Law)

Wireless 1G

Microprocessor / DSP

Battery Power

Time

64

29
Computational Efficiency vs. Flexibility

ASIP
A (ICORE, DVB-T Sync&Track)

Flexibility →
← Efficiency
65

How to Exploit the Design Space


and Design MPSoC´s?

30
Design Principles

Focus …. first on applications and constituent algorithms, not the


silicon architecture
Identify key attributes of the application
Identify periodicity of signal processing taks
(cyclostationarity)
Block processing
Identify loose coupling of tasks
Use…. extensive profiling to find the spatial and temporal mapping
with the following goal
Minimize the processor flexibility to a constrained set to
optimize the energy efficiency
Maximize the software parameterizability and ease of use of
the programmer’s model for flexibility

67

MPSoC design flow: Temporal and Spatial Mapping


Application:

Task 1 Task 2 Task 3 Task 4 Task 5

?
HW
HW Proc
Proc Proc
Proc HW
HW
Network-on-Chip
Specification
Network-on-Chip
Specification
Mem
Mem Mem
Mem Mem
Mem
MPSoC virtual prototype
MPSoC virtual prototype

HW
HW Proc
Proc Proc
Proc HW
HW
Network-on-Chip
Network-on-Chip

Mem
Mem Mem
Mem Mem
Mem
MPSoC HW prototype
MPSoC HW prototype
68

31
MPSoC exploration principles

Interconnect
Structure

Divide and conquer


Separate processing elements from communication
Early SW performance estimation

69

MPSoC virtual prototyping platform

VPU
VPU VPU
VPU
(Processor (Processor
(ProcessorSimulator)
(ProcessorSimulator)
Simulator) NoC Simulator Simulator)
Task 1
Task 1 Task
Task22 Task
Task33 Task
Task44

Interconnect
Structure
P2P Bus Router
model model model

Communication: CoWare Architect´s View Framework (AVF)


VPU: virtual processing unit
Enables modeling spatial and temporal task-to-PE mapping
70

32
MPSoC exploration Results

An MPSoC is defined by its processing elements


(PE) and their interconnect (NoC)
Interconnect is defined by its topology.
Communication performance is measured for a
given topology
PE performance is determined by a set of
numbers

71

Message Sequence Chart (MSC) Trace

Message
Sequence
Chart

72

33
Aggregated Communication Graph
Message
Sequence
Chart

Interacting Partner View Topology View

73

Histogram Views
Message Histogram
Sequence
Chart

Interacting Partner View Topology View

74

34
MPSoC Exploration Results: Communication

Source: Seven Questions and


Seven Dwarfs for Parallel
Computing,
UC Berkeley Report, June 2006

75

The „Key Algorithm“ Propostion

Each application is composed of a small number of


fundamental algorithms ( „Nuclei“) that represent a
significant amount of the computation.

Focus on an efficient composition („design of


an MPSoc) or mapping („programming of the
MPSoC“)

76

35
Composition of Nuclei

Nuclei can be composed/mapped on a multiprocessor


in three different ways
Temporally distributed or time-shared on a
common processor
Spatially distributed with each Nucleus occupying
one or more processors
Pipelined: A single nucleus is distributed in time and
space
In a given time slot a nucleus is running on a
group of processors
On a given processor a group of nucleus
computation run over time Source: Schaumont et. al.2001

77

Intel RMS View (Recognition, Mining,Synthesis)

78

36
Example: Baseband Processing for 4G

Canonical Receiver Model


1 sample/symbol

CHANNEL
CHANNEL SOURCE
SOURCE
SIGNAL
SIGNAL DETECTION
DETECTION PATH
PATH DECODER
DECODER
DECODER
DECODER

Use the estimated channel


RF&ADC
RF&ADC parameters in the detection path
From SourceDecoder
as if they were the true values

From Channel Decoder


PARAMETER
PARAMETER ESTIMATION
ESTIMATION
PATH
PATH

INNER OUTER
RECEIVER RECEIVER
H.Meyr et al., “ Digital Communication Receiver”, J.Wiley 1998

80

37
Lessons Learned from Design Reviews 2005

Virtual Prototype (Product) of utmost importance


Early customer interaction
Debugging
Verification&Validation
Product Differentiator
80% of Area and Power Consumption in the inner receiver
(Algorithm and Architecture Design)
10-15% of Area and Power Consumption in Decoder
(Architecture Design)
5% of Area and Power Consumption in the ARM (But
major portion of cost is SW/Protocol implementation)

81

Properties of the Task

The signal/information processing task can be


naturally partitioned
Decoders
Filters
Channel
Useestimator
A-Priori Knowledge of the Task

The building blocks are loosely coupled


The signal processing task is (mostly) cyclostationary

82

38
From Function to Algorithm Classes
Butterfly unit
Viterbi & MAP decoder
MLSE equalizer
Eigenvalue decomposition (EVD)
Delay acquisition (CDMA)
MIMO Tx processing
Matrix-Matrix & Matrix-Vector Multiplication
MIMO processing (Rx & Tx)
Basic
LMMSE channel Blocks:
estimation (OFDM &Algorithm
MIMO) Types
Iterative (Turbo) Decoding
Message Passing Algorithm , LDPC Decoding
CORDIC
Frequency offset estimation (e.g. AFC)
OFDM post-FFT synchronization (sampling clock, fine frequency)
FFT & IFFT (spectral processing)
OFDM
Speech post processing (noise suppression)
Image processing (not FFT but DCT)

83

Decoder for Convolutional Codes

⎡ x1,k +1 ⎤ ⎡ a11,k a12,k ⎤ ⎡ x1,k ⎤ ⎡ a11,k ⊗ x1,k ⊕ a12,k ⊗ x 2,k ⎤


x k +1 = ⎢ ⎥=⎢ ⎥⊗⎢ ⎥=⎢ ⎥
⎣ x 2,k +1 ⎦ ⎣a 21,k a 22,k ⎦ ⎣x 2,k ⎦ ⎣a 21,k ⊗ x1,k ⊕ a 22,k ⊗ x 2,k ⎦

OPERATIONS MAP LOGMAP VITERBI

x⊕y x+y x y
loge [e + e ] max( x, y)
x⊗y x⋅y x+y x+y

84

39
Algorithmic Descriptors

Clock rate of processing elements (1/Tc)


Sampling rate of the signal (1/Ts)
Algorithm characteristic
Complexity (MOPS/sample)
Computational characteristic
Data flow
Basic Blocks: Algorithm Types
Data locality
Data storage
Parallelism
Control flow
Connectivity of algorithms
Spatial
Temporal

85

384 kbps UMTS Receiver BB Complexity

5
384 kbps UMTS receiver, digital BB complexity
10
MUSIC delay acq.

4
10
1 MOPS 10 MOPS 100 MOPS 1000 MOPS
OPs per sample

3
Turbo decoder
10

Path searcher
Max. ratio combining
SIR estimation AFC Correlators RRC pulse MF
2
10 Timing tracking
Channel estimation
Interpolation/decimation

1 AGC
10

0
10 2 3 4 5 6 7 8
10 10 10 10 10 10 10
sampling rate [1/s]

86

40
Hardware

Guding Principle

Employ all forms of Parallelism

88

41
Potential Processor Parallelism

Three form of instruction-set parallelism


Instruction parallelism
Data parallelism
Pipeline parallelism
Multi-issue instructions (VLIW)
Instruction size L
Number of operation slot per instruction
Operation mix in each slot

SIMD Instructions
Maximum
Maximumparallism:
parallism: Types of vector operations M
LxMxN Number of vector elements
LxMxN Number and size of vector register files

Fusion of Operation
** Number and type of composing operations N
Number of inputs and outputs
++ ++ Latency
Source: C.Rowen, Tensilica

89

Memory Architecture

DRAM prices have draramatically decreased


From $ 10,000,000 for1 Gigabyte in 1980
To $ 100 in 2006
Memory wall is the major obstacle to good performance
of many applications

Novel memory architecture are a key component


of ASIP

90

42
Programming Models and Design
Methodology

System Architecture Concept (HW & SW)


Source: Dr. H. Dawid, Infineon

Tx Framing &
Tx Modulator
RF Frontend (Down-/Upconversion)

FEC

Closed
DLPhysical
PowerLoop Intra-Frequency
Control
TxScheduling
Diversity Measurements
HSDPA
HSUPA
HSUPA
Layer E-TFCControl
HARQ ACK/NACK
&Selection
Ctrl Layer 1 SW
Inter-Frequency Measurements
UL Power Control Physical Layer HardReconfig
Handover
Layer 2/3 Stack (MAC, RLC,
RRC) Inter-RAT Handover
Cell Search
Timing
AGC
AFC Tracking Transport Channel
Delay SoftReconfig Layer 1 SW
Handover
Profile
Estimation

Inner Receiver Outer Receiver

Data L1 Config/Ctrl System Information/Higher Layer Ctrl

92

43
Programming Models

Goal: To maximize programmers productivity


Requirement
Independent of number of processors
Allow to describe concurrency naturally
Support rich set of data types
Support parallel models
Data level parallelism
Instruction level parallelism
Independent task paralleism
Autotuners should take on a complementary role to compilers
Far more formal methods must be developed to guarantee
correctness ( e.g. avoid dead locks using threads )
Source: Seven Questions and Seven Dwarfs for Parallel Computing,
UC Berkeley Report, June 2006

93

Software Synthesis and Autotuners

Principle of Autotunners:

Optimize a set of library kernels by generating many


variants of a given kernel
Benchmark each variant on a given platform

Source: Bilmes et al. 1997;


Frigo and Johnson 1998;
Whaleyand Dongarra 1998,
IM et al. 2005

94

44
Conclusion

We are presently at a juncture of the semiconductor


industry as it seldom occurs
The existing ( RTL) design paradigm has reached its end-
of-life we need to move to a higher level of abstraction
(ESL) to keep the cost within resonable bounds
The existing processor multiprocessor architectures and
the programing tools do not scale
we need much innovation in these areas to make
economic use of scaling

95

Application Specific
Processors Design

96

45
Processor Design Space

Instruction Set Design Micro Architecture Design

butterfly 0 butterfly 1 load/store


• Exploit regularity/parallelism in FE
FE DC
DC EX
EX WB
WB
data flow/data storage
• VLIW, SIMD, ?
• Pipeline length ?
• Which instructions for compiler support?
• Shared resources ?
• Instruction Encoding? • Bypass ?
• How much general purpose registers? • Parallel execution units ?

RTL Design Soc Integration

• Area constraints met? Core


Core MMU Cache
• Clock frequency?
bus fast enough?

communication?

Memory Peripheral

97

Processor Design Space

Instruction Set Design Micro Architecture Design

butterfly 0 butterfly 1 load/store


• Exploit--regularity/parallelism
Instruction-Set
Instruction-SetDesign
Design
in FE
FE DCDC EX EX WB WB
-Micro
-MicroArchitecture
ArchitectureDesign
Design
--Compiler
data flow/data storage Design
Compiler Design
• VLIW, SIMD, ?
• Pipeline length ?
• Which instructions for compiler support?
• Shared resources ?
• Instruction Encoding? • Bypass ?
• How much general purpose registers? • Parallel execution units ?
Optimal
Optimaldesign
designrequires
requirespowerful
powerfultools
tools
RTL Design
and
andautomation
automation!!
Soc Integration

• Area constraints met? Core


Core MMU Cache
-System
-SystemIntegration
Integration
-RTL
-RTLDesign
• Clock frequency? Design --Embedded
EmbeddedSoftware
bus fast enough?
Software
--RTL
RTLISS
ISSCo-verification
Co-verification Simulation
Simulation
communication?

Memory Peripheral

98

46
Traditional Processor Design

Processor Design-phase Dependencies: Far


Fartoo
toolate
late!!
Instruction Set Design Micro Architecture Design RTL Design Verification

IA Simulator Development CA Simulator Develop. Soc Integration

Assembler & Linker Debugger Coupling

Compiler Design Software Development

time
Handwriting fast simulators is tedious, error-prone and difficult
Compiler cannot be considered in the architecture definition cycle
Risk of compiler un-friendly instruction-set
Inconsistencies between tools and models
Traditional design methodology does not allow for efficient processor design
Verification, Software Development and SoC integration too late
Real-world stimuli and SoC interaction might reveal bottlenecks

Design
Designphases
phasesneed
needto
tobe
beparallelized!
parallelized!
99

Today: ADL based Processor design

OBJECTIVE
OBJECTIVE

Improve
Improve Design-
Design- and
and Implementation
Implementation
Efficiency
Efficiency

…..at
…..at the
the same
same time
time

100

47
Architecture Description Language based Processor Design

The purpose of an architecture description language (e.g


LISA) is:
To allow for an iterative design to efficiently explore
architecture alternatives
To jointly design “Architecture –Compiler” and on chip
communication
To automatically generate hardware (path to
implementation)
To automatically generate tools
Assembler ,Linker, Compiler, Simulator, co-simulation
interfaces
From a single model at various level of temporal and spatial
abstraction

101

Describe/Adopt Generate
Application
Empty Model Processor Model Tools
RISC Sample Software
Software
VLIW Sample LISATek Tool
ToolChain:
Chain:
Custom
DSP Sample Processor C-Compiler
FFT Processor Model Assembler &
Processor Linker
LISATek IP LISA 2.0 Designer Simulator
Samples
Debugger &
Profiler

Function and instruction level


profiling reveals hot-spots
special purpose instructions

104

48
LISATek (Multi Core) Analyzer

Source
Sourcelevel
level
Extendable analysis
analysis
Extendable
Instruction
Instruction
Profiling
Profiling Symbolic
Symbolic
C/C++
C/C++
debugging
debugging

Memory/
Memory/
Cache
Cache
Analysis
Analysis

Extendable
Extendable
CCProfiling
Profiling Pipeline
Pipeline
Analysis
Analysis
(Stalls,
(Stalls,Flushes...)
Flushes...)

105

•Instruction
•InstructionSet
Set
Synthesis
Synthesis Describe/Adopt Generate Application
Empty Model
•Memory
•Memoryarchitecture
architecture Processor Model Tools
•Verification
RISC Sample
•Verification Software
Software
VLIW Sample LISATek Tool
Tool Chain:
Chain:
Custom
DSP Sample C-Compiler
Processor
FFT Processor Model Assembler &
Processor Linker
LISATek IP LISA 2.0 Designer
Simulator
Samples
Rapid
Rapid modeling
modeling and
and re-targetable
re-targetable simulation
simulation ++ code-generation
code-generation allows
allows for:
for: &
Debugger
joint Profiler
joint optimization
optimization of
of application
application and
and architecture
architecture

Generate...

RTL

Function and instruction level


Software SoC profiling reveals hot-spots
Platform Integration RTL
special purpose instructions
106

49
Tool Structure Principles

Orthogonalize „Workbench“
and Optimization Tools

R.Leupers.et al „Fine Grained Application Source Code Profiling


for ASIP“, DAC 2005

R.Leupers et al., “A Design Flow for Configurable Embedded


Processors based on Optimized Instruction Set Extension
Synthesis”, DATE 2006

P.Ienne,R.Leupers (Editors), "Customizable Embedded


Processors”, Morgan Kaufmann (Elsevier), 2006

107

ASIP: Lofty Ambitions, Stark Realities

J. Fisher, “Customizing Processors :Lofty Ambitions, Stark Realities, Chapter 2 in: Customizable
Embedded Processors, ed. By L.Leupers, Paolo Ienne, to be published by Morgan Kaufmann July 2006

108

50
Mapping Application to Architecture
Exploit
ExploitParallelism
Parallelism
1.1. Instruction
Instructionlevel
level
Yesterday Today 2.2. Data
Tomorrow Datalevel
level
3.3. Pipeline
Pipelinelevel
level
RISC DSP DSP
CPU
Flexibility - Reuse
Programmable

Extensible Extensible
DSP RISC RISC
CPU CPU
Processor
Processor Application Application
instruction-set

Performance Opt.
Programmable
instruction-set
Specialization Specific Specific ProgrammableDMA
DMA
extenstions with
extenstions with Extensions Extensions
highly specialized
highly specialized
data-path Application
data-path DMA Controller ApplicationSpecific
Specific
MIPS SIMD
SIMDengine
enginefor
MIPSCorXtend
CorXtend for
ARM image processing
ARMOptimode
Optimode SIMD ASIP image processing
Tensilica
TensilicaXTensa
XTensa
ARC
ARC––ARC600
ARC600 VLIW ASIP iDCT
iDCTVLIW
VLIWprocessor
processor

Hardwired More
Moreand
andmore
morefixed
fixed
Fixed

ASIC
Hardwired Hardwired Logic ASICdatapath
datapathmoves
moves
into
intoapplication
Logic Logic application
specific
specificprocessors
processors

110

From C-to Complex Instructions

Mapping Algorithm to Architecture

Fixed and Re-Configurable ASIP

113

51
rASIP : A Huge and Complex Design Space

Stage 1 Stage 1 Stage 3

Register file Register file


Core Core

re-configurable data path : multiple stages re-configurable data path : single stage

Design
DesignSpace
SpaceExploration
Exploration
isisthe
thekey
key

Stage 1 Slot B Slot B


Stage 1 Stage 2 Stage 3
Slot A Slot A

Register File
Core Register File
Core

re-configurable data path in VLIW slots re-configurable data path in loosely coupled rASIP

Pre-fabrication
Pre-fabrication Post-fabrication
Post-fabrication
ASIP
ASIParchitecture
architecture Instruction-Set
Instruction-SetExtension
Extension
FPGA
FPGAarchitecture
architecture Configuration
Configurationcode
codegeneration
generation
ASIP-FPGA
ASIP-FPGAInterface
Interface Scheduling
SchedulingandandCode-Generation
Code-Generation
Static
Static/ /Dynamic
Dynamicre-configurability
re-configurability FPGA-targeted
FPGA-targetedoptimization
optimization
.... ....
115

Case Studies

References:
Tilman Glöckler,H. Meyr, Design of Energy efficient Application-
Specific Instruction Set Processors, Kluwer Academic
Publisher,2004
Oliver Wahlen, C Compiler Aided Design of Application Specific
Instruction-Set Processors Using the Machine Description
Language LISA, Ph.D thesis submitted to Aachen University of
Technology (RWTH), 2004

116

52
The ICORE Example

A low-power ASIP for Infineon DVB-T 2nd


generation single-chip receiver:
ASIP for DVB-T acquisition and tracking algorithms
(sampling-clock-synchronization, interpolation / decimation,
carrier frequency offset estimation)
Harvard architecture
60 mostly RISC-like instructions &
special instructions for CORDIC-algorithm
8x32-Bit general purpose registers, 4x9-Bit address registers
2048x20-Bit instruction ROM, 512x32-Bit data memory
I2C registers and dedicated interfaces for external communication

121

Computational Effiency vs. Flexibility

Source: T.Noll, RWTH Aachen

129

53
The Retinex Project

Application: Retinex-like Algorithms

/ β *

F Γ LinSt

Knowledge: Application Knowledge,


VLSI and Basic Processor Design Knowledge

Outline: From Specification to FPGA Prototyping

Duration: 7,5 Weeks

A cooperation between Pisa University and RWTH Aachen University

130

Retinex Architecture Reference

Paper presentation at DATE 2006

ASIP DESIGN AND SYNTHESIS


FOR NON LINEAR FILTERING
IN IMAGE PROCESSING

L. Fanucci, M. Cassiano and S. Saponara,


DIIEIT-Pisa University, Italy
D. Kammler, E. M. Witte, O. Schleibusch, G. Ascheid,
R. Leupers and H. Meyr,
RWTH Aachen University, Germany

131

54
The Retinex ASIP
Program
Program Memory
Memory
X-Memory
X-Memory
Y-Memory
Y-Memory

FE
FE DC
DC LD
LD CMP
CMP ROM
ROM ARITH
ARITH WB
WB

ROM
ROM

132

The Retinex ASIP


Program
Program Memory
Memory
X-Memory
X-Memory
Address
AddressGeneration
Generation Special
SpecialInstructions
Instructions
Units Y-Memory
Units Y-Memory
to
toimplement
implement
to
tooptimally
optimallyimplement
implement non-linear
non-linear
the
theaddress
addresscalculation
calculation transformations
transformations
scheme
scheme

FE
FE DC
DC LD
LD CMP
CMP ROM
ROM ARITH
ARITH WB
WB

Zero
ZeroOverhead
Overhead
Loops ROM
ROM
Loops

totoaccelerate
accelerate
loop
loopcontrol
control
133

55
Performance Comparison

Retinex ASIP
System Athlon XP 3000+
mapped on FPGA

plain C-application, Optimized ASIP and


handwritten
Design Flow compiled with gcc,
assembly program
executed on AMD Athlon (~100 lines of code)

Frequency 2100 MHz 16 MHz

593 ms
Computation time
~ 3000 ms ~ 20 % of
(Picture 513x385)
Athlon run-time

134

Retargetable Compiler

56
Infineon PP32 Network Processor

200
200

180
180

160
160

140
140

120
120
lcc
lcc
100 CoSy cycle count
%

100 CoSy cycle count


%

CoSy code size


CoSy code size
80
80

60
60

40
40

20
20

0
0
frag tos hwacc route reed md5 crc
frag tos hwacc route reed md5 crc

136

ST200 VLIW Multimedia Processor

350
350

300
300

250
250

200
200 ST Multiflow
ST Multiflow
CoSy cycle count
%

CoSy cycle count


%

CoSy code size


150 CoSy code size
150

100
100

50
50

0
0
fir dct adpcm fht viterbigsm sieve
fir dct adpcm fht viterbigsm sieve

137

57
Low Cost Commercial ASIP
Increasing SW Content- but How?

138

Project Goals

Initial goal:
+ Custom processor design to save royalties
LISA processor design
+ development of an ASIP with superior
architectural efficiency
General purpose register file
+ support a smooth legacy code migration
Perl - translation script
+ an architecture which is smaller than the
existing architecture
!!!
LISA
139

58
Development Time Sheet

Initial Model 4 weeks

Design Space Analysis 3 weeks


Design Space Exploration 4 weeks

Phase I
- Address Calculation 1 week
- Non-delayed Branches 1 week
- Timing Improvement ½ week
- Others 1½ weeks

Translation Script 5 weeks


Move Elimination 2 week
Verification Script 5 weeks
Phase II

Synthesis & FPGA Mapping 1 day


FPGA System ( one time effort)
10 weeks

140

Moving through the Design Space

1 First synthesis of verified RTL 8 Changed multiplier implementation


code, no port constraints, no from 32bit to 17bit
optimizations 9 Removed functional unit grouping
2 Memory port constraints, from 3
autom. optimization: path sharing 10 Final Synthesis: timing constraint
3 Grouping in functional units for adapted to synthesis results
more detailed analysis
4 Change in address calculation
enabling resource sharing 3

5 Critical path analysis 4


2

modification of fetch mechanism,


optimization: decision minimization
6 Pin for FPGA prototype added 8-9
1
10
7 Implementation of non-delayed 5-7

branches
prog-mem size reduction

141

59
Multimedia Processor

Processor Designer in a video


deblocking unit

60
Multi standard video decoder IP

Coded bitstream

Reference frames Core decoder

Deblocked frames DBLK

External memory
Semiconductors 144

Why Processor Designer ?

• Until now : a RTL block for each standard. => Make a


generic block for all (changing !) video standards.

• A programmable architecture brings flexibility (C


compilation).

• 288 conditionals filters (4 and 8 taps) to be done in 600


cycles.

• High throughput needed : custom operations and special


memory addressing scheme are required.
Semiconductors 145

61
DBLK architecture

DMA IN Pixels memory DMA OUT

2x88 bits

Data ram
Processor
Prog. rom

Semiconductors 146

Step 1 : function call


Application development :
• Get quickly a C model for the system
• Debug the application in a SystemC environment

DMA IN Pixels memory DMA OUT

Data ram
deblock()
Prog. rom

Semiconductors 147

62
Step 2 : integration of lt_risc_32p5
• Provided model of RISC used
• Compilation of application on the Lisa Processor
• Memories latency are modelled in the pipeline

DMA IN Pixels memory DMA OUT

Processor Data ram


(systemC/RTL)
Prog. rom

Semiconductors 148

Step 3 : RTL generation and


performance improvement

• RTL generation
• C optimization
• Asm. optimization
• Use of specialized asm. instruction
• Remove unnecessary asm. Instructions
• Improve model for RTL generation (clock speed, area)

Semiconductors 149

63
Results
• Architecture far from the initial RISC
• Target of 166 MHz easily reached
• Size comparable to a all RTL design
(processor = 50 kgates)
• Performances reached
• IP taped out in a Set Top Box chip

Next steps
• No problem met yet on prototype
• Make the block more generic to handle others standards

Semiconductors 150

Planning

8 weeks 2 weeks 4 weeks 2 weeks 5 weeks


es

s
integ _32p5

mem f pixel

ation
ent
deve ation

terfac
ories
ration
lopm

is
o
c
c

Optim
Lt_ris

Pin in
Appli

Use

Step 1 Step 2 Step 3

Semiconductors 151

64
Conclusion
- Con + Pro
• Long learning • RTL and SystemC always
• First use -> rough estimate of consistent (=> most of the
time needed validation can be run on SC)
• Faster than writing
independent SC and RTL
models
• Fast exploration of architecture
choices
• Use of firmware :
– can be generic
– C debug
– If program ram : fixes and
feature changes can be
downloaded
• No royalties
Semiconductors 152

Thank You

65

Das könnte Ihnen auch gefallen