Signal Processing For Wireless Communications and Multimedia: Design, Tools, Architectures

Signal Processing for Wireless
Communications
and Multimedia: Design, Tools, Architectures
Advanced Digital System Design Course 2006, EPF-L
Prof. Heinrich Meyr
RWTH Aachen University , Germany

and
Chief Scientific Officer, CoWare Inc
Agenda
Future Wireless Communication System

Future Wireless Communication Systems and ist Impact on ESL
The End of Moore´s Law
Receiver Structure, Models and Performance Metrics
Massive Parallel Processing on heterogeneous MPSoC
Application Specific Processors
Agenda
Summary and Conclusions
1
Future Wireless
Communication Systems
Internet Access Today
Fixed
DSL (→3 Mb/s)
Intranet (100Mb/s)
Wireless
WLAN (10-54 Mb/s)
Mobile
UMTS (2 Mb/s)
2
Mobile Internet Access
The Vision
UMTS
UMTS Standard:
Standard:
Ultra
Ultra High-Speed
High-Speed
22 Mb/s
Mb/s Mobile
Mobile Information
Information
and
and Communication
Communication
everywhere at low cost
Reality
Reality today:
today:
UMTS
UMTS 0,1-0,3
0,1-0,3 Mb/s
Mb/s
GSM/GPRS
GSM/GPRS 0,020,02 Mb/s
Mb/s
€€€
In
Inoptimally
optimallylocated
locatedplaces
places
For a few users
For a few users
4G and Beyond
New concepts
Ultra high speed transmission
Mobile multimedia processing
Wearable and environmental

information processing
Smart systems
Flexible, cognitive radio access
Multi-Processor Systems on Chip (MPSoC)
Digitized radio front end
3
Mobile Applications and Services
Future mobile wireless internet services:

Information (web browsing, …)
Communication (VoIP, video, P2P, …)
Entertainment (distributed gaming, …)
Challenging mobile application classes

Wearable and environmental information processing:
work, sport, health care
e.g. location aware services, seamless mobile working
Mobile multimedia processing
e.g. entertainment, information access, navigation,…
Future Wireless Systems: In a nutshell
Will be
cognitive
multifunctional
software definable
Will have
multiple Antennas
They will make use of ultra-complex signal processing

to optimally use the availabel bandwidth
And process these algorithms on heterogeneous
configurable computing engines
10
4
Future Wireless Communication Systems
and its Impact on ESL
Impact of NGMN on Design Process: I
To meet the schedule of NGMN it is imperative to have a

concurrent and iterative development and validation
process to design
Standard
Development and validation of algorithm and HW/SW
of the digital receiver
Application (SW) development
New approaches are needed !
12
5
Impact of NGMN on Design Process:II
Development and integration issues need to be uncovered

as early as possible
Companies cannot wait for hardware to be available to start

Software development
Development costs need to be reduced and schedules

accelerated
New approaches are needed !
13
Virtual Platform Based Development
Hardware Development
Simulator Initial HW Initial

Availability Availability
Incremental Virtual Platform Development Simulator/HW Refinement
Specification
OS
Software Development
Integrate
Device
Device
Connectivity
Software
Software
Stack
Stack
Develop Unit Test
UI
Test Hardware
Hardware Silicon
Integration
Application Validation
Virtual
Virtual
…
Debugging
-
Platform
Platform Reduced bring up
Virtual Platform Reduced system test
Incremental Software Development System Test Integration
14
6
The End of Moore´s Law:
„Design Competence rules the World“
Cross-disciplinary Task Management
Analysis
The task comprises of many subtask in various disciplines
“The whole is more than the sum of the parts”
Conclusion
The solution requires the interaction of people in the

various disciples
16
7
The Paradigm Shift: Innovation Overtakes Scaling
Innovation now dominates performance gains between generations

This means that “scheduled invention” is now the majority component in
all technology gains
IBM Transistor Performance Improvement

Gain by Traditional Scaling Gain by Innovation
100
80
Relative % Improvement
60
40
350 nm
180 nm
550 nm
250 nm
130 nm
65 nm
90 nm
20
0
CMOS7S-S0I
CMOS10S
CMOS6X
CMOS5X
CMOS11S
CMOS9S
CMOS8S2
Source:
Lisa Su /IBM:
MPSoC Conference 2005
Source: Lisa Su /IBM: MPSoC 05 Conference 2005 17
The Paradigm Shift: Integrated Design Approach

Future improvement in systems performance will require
an integrated design approach
Languages Microprocessor frequency will no
Application
Application Software Tuning longer be the dominant driver of
Efficient Programming
Middleware system level performance
Dynamic Optimization Integration over the entire stack,

Assist Threads from semiconductor technology to
System
System Level
Level Morphing Support end-user applications, will replace
Fast Computation
Migration scaling as the major driver of
Power Optimization increased system performance
Compiler Support
Systems will be designed with the
Compiler Support ability to dynamically manage and
Morphing
Chip
Chip Level
Level Multiple Cores optimize power
SMT
Accelerators Scale-out and small SMPs will
Power Shifting continue to outpace scale-up
Interconect
growth
Circuits
Silicon Innovation
Systems will increasingly rely on
Technology
Technology Packaging modular components for continued
Efficient Cooling performance leadership
Dense SRAM, embedded DRAM
Source: Lisa Su /IBM: MPSoC 05 Conference 2005
18
8
Core Proposition
y
oloologgy
n
ecechhn
ASIP
ASIP based ddTT Platforms
based
nn Platforms
s aa i es s
(heterogenousMPSoC)
c s
(heterogenousMPSoC)
hhyyssi ic mmetertrie
etet PP g GGeoeo
g
t ft ofor rg riniknikning
o h r
s t tnno e totos snh
s
e mmuu rs dudue ptpiotion
e
t tww t EErrroroorsnsnusum
m
u
BBu f o
Sooft er rCC
S owwe
PPo
19
The Human Element
Building and managing an interdisciplinary t

m eennt
engineering team of le m
aal leele
c
ti c
t ct cririti
1. Algorithm Designers
s
Most critical problem:ArchitectseDesignmmoos Competence
2. Computer/Compiler
h e
t isist th
I
ele: : It
3. System Integrators
RTL Designers bbb b l
bbaa
4.
h o
h o
pspsyycc
NNoo
20
9
Cross-disciplinary Task
Algorithm
Algorithm
Architecture
Architecture Tools
Tools
21
Food Chain and Alliances
Service Provider
SIEMENS Equipment
Equipment Manufacturers
Manufacturers
Semiconductor House
Enabling Technology
Providers
22
10
Alliances and the Business Equation
Managing alliances is a key to success
EDA
Mobile provider
Semiconductor company
23
Receiver Structure , Models

and Performance Metrics
11
Design - Space I: Physical Layer
Complexity
Bandwidth
Power
25
System Design
System design = algorithm design + implementation
algorithm
algorithm implementation
implementation
design
designspace
space design
designspace
space
JOINTLY optimizing algorithm and architecture
26
12
Center-of-Gravity Approach
Algorithm
Algorithm
Architecture
Architecture Tools
Tools
27
Methodology
13
Design - Methodology: I
Mathematical Theory and Experiment

are complementary
29
Design - Methodology: II
Mathematical Theory provides Bounds
1. Estimation and Detection Theory used to

systematically derive (optimum)
Receiver Structures
Synthesis
2. Mathematical Analysis used to compute

Performance Bounds
Analysis
30
14
Design Methodology III
Computer Simulation is used to
1. Obtain numerical Performance Data

Detection Loss
Implementation Loss
2. Validate a Design
(Conformance to Standards)
3. Verify Correctness of Implementation

(Verification) against Testpattern
31
32
15
Models
Communication Model
34
16
Signal Model
36
17
Received Bandlimited Signals
37
Approximation by BL Signals
E{ x(t ) − xBL } = ∫ Sx (ω )dω

2 Approx. of non-
bandlimited Signal
ω ≥BL x(t) by BL -Signal
k Truncation defines
∑ x(kT )ϕ (k )
−k
s
(2K+1) dim.
Approx. In Vector
space
39
18
Equivalence of digital/analog Signal Processing
40
Properties
41
19
Canonical Receiver Model
1 sample/symbol
CHANNEL
CHANNEL SOURCE
SOURCE
SIGNAL
SIGNAL DETECTION
DETECTION PATH
PATH DECODER
DECODER
DECODER
DECODER
Use the estimated channel

RF&ADC
RF&ADC parameters in the detection path
From SourceDecoder
as if they were the true values
FromChannel Decoder
PARAMETER
PARAMETER ESTIMATION
ESTIMATION
PATH
PATH
INNER OUTER
RECEIVER RECEIVER
H.Meyr et al., “ Digital Communication Receiver”, J.Wiley 1998
42
Receiver Task
Inner Receiver
To provide a “good” channel to the decoder based on the
principle of synchronized Detection.
NOTHING ELSE !
Outer Receiver
To decode the information
43
20
Performance Measure
Inner Receiver
Properties of the estimator
Variance
Unbiased
Outer Receiver
Bit-error-rate of the coded system
44
Performance Loss
Detection Loss of synchronized Detection

Δ SNR (dB) required to achieve the performance of perfect
channel knowledge . (Infinite Precision arithmetic assumed)
Implementation Loss
ΔSNR (dB) resulting from finite precision arithmetic and
algorithmic approximations
45
21
BER Performance
Source: Digital Communication Receivers, H. Meyr, M. Moeneclaey, S.A. Fechtel
48
Complexity DVB-S
Source: Digital Communication Receivers, H. Meyr, M. Moeneclaey, S.A. Fechtel
49
22
DVB-S Chip
Siemens-RWTH
Siemens-RWTHAachen
Aachen
(ISS)
(ISS)Design
Design1997
1997
0.5
0.5mmtechnology
technology
33metal
metallayer
layer
1.5
1.5W W@ @88 88MHz
MHz
>>500 k transistors
500 k transistors
First
Firstsilicon
siliconsuccess
success
50
DVB-T Specifications
Digital terrestrial video broadcasting:

high symbol rates: up to 7.4 Msym/s
sensitive modulation: 4 - 64 QAM
net bit rate up to 31.67 Mb/s
wide range of channels: (AWGN) 0 < Tau < 224 s (SFN)
error correction:
outer coder: Reed Solomon (204,188)
inner code: punctured convolutional
BER < 10e-9 (after RS)
3dB < Es/No < 40 dB
Challenges: > 200 transmission modes
algorithms
design methodology
51
23
System Performance: DVB-T
52
DVB-T Chip: First single Chip Solution
Joint
JointInfineon-Nokia-ISS
Infineon-Nokia-ISS
Design
Design1999
1999
AGC:
AGC:Automatic
AutomaticGain
GainControl
Control
IQ:
IQ: IQ-Mixer
IQ-Mixerand
andResampling
Resampling
PPU:
PPU: Postprocessing
PostprocessingUnit
Unit
FFT:
FFT: Fast
FastFourier
FourierTransform
Transform(2k,8k)
(2k,8k)
DTO:
DTO:Digital
DigitalTiming
TimingOscillator
Oscillator
RAM: OFDM Symbol
RAM: OFDM Symbol MemoryMemory
CHE:
CHE:Channel
ChannelEstimation
Estimation
IFFT:
IFFT:Inverse
InverseFFT
FFTand
andFine
FineTiming
Timing
ESG:
ESG: Equalizationand
Equalization and
Softbit Generation
Softbit Generation
FEC:
FEC:Forward
ForwardError
ErrorCorrection
Correction
(Viterbi,
(Viterbi,Reed-Solomon)
Reed-Solomon)
53
24
DVB-T Complexity
Analog part : 10%

Input interfaces
DC removal
anti-aliasing filter
ADC,AGC
Digital demodulator: 60 %
Channel estimation and equalization
synchronization
control flow implementation
FFT (alone 30%)
Channel decoder : 20 %
Viterbi and RS decoder
Miscellaneous : 10%
IIC bus controller, DAC
54
Design Space : Architecture and Algorithm
Inner Receiver
The algorithms of the inner receiver are never specified
by the standard
BOTH algorithm and architecture space exploration
Outer Receiver
The decoder is exaclty specified in the standard
ONLY architecture space exploration
55
25
Massive Parallel Processing on
Heterogeneous MPSoC
Parallel Computing in Mobiles
Massive
MassiveParallelism
Parallelismrequired
required
in the foreseeable future
in the foreseeable future
2003 2009 2013
Frequency
300 600 1500
(MHz)
Giga
0,3 14 2458
Operations
Operations
1 23 1638
per Cycle
Source: International Technology Roadmap for Semiconductors (ITRS, TX 2003)
58
26
Why Many-Processors Architectures today?
Not because of a fundamental breakthough in novel

software and parallel architecture
…..simply because the problems with tradtional

architectures pose an even greater challenge
59
Guding Principles for Manycore SoC I
Energy Efficiency and Power are the dominating

issues
There exists a fundamental trade-off between
energy efficiency and flexibility
Below 65nm high soft and hard error rates occur
Bandwidth improves by at least the square of the
latency
Memory wall: Load and stores are slow ( up to 200
cyles to access DRAM)
60
27
Guding Principles for Manycore SoC :II
Multiplies are fast

Instruction Level Parallelism (ILP) wall: Dimishing
return on finding new ILP
Brick wall:Power Wall+Memory Wall+ILPWall
Increasing parallelism and decreasing clock
frequency is the primary source of improving
processor performance
61
GP -Processor Performance Improvement between 1978 and 2006
Source: Seven Questions and Seven Dwarfs for Parallel

Computing,
UC Berkeley Report, June 2006
62
28
Parallel Computing
“Switching from sequential to modestly parallel

computing will make programming much more
difficult…….without a dramatic improvement in
performance”
Source: Seven Questions and Seven Dwarfs for Parallel Computing,
Basic Blocks: Algorithm Types
We need to go to from multiple processors

to many cores
63
The Need for New Architectures

4G
Algorithmic 3G Source: R.Subramanian.

Complexity Berkeley Design Automation Inc
(Shannon’s Law) Memory
2G (Moore’s
Law)
Wireless 1G
Microprocessor / DSP
Battery Power
Time
64
29
Computational Efficiency vs. Flexibility
ASIP
A (ICORE, DVB-T Sync&Track)
Flexibility →
← Efficiency
65
How to Exploit the Design Space

and Design MPSoC´s?
30
Design Principles
Focus …. first on applications and constituent algorithms, not the

silicon architecture
Identify key attributes of the application
Identify periodicity of signal processing taks
(cyclostationarity)
Block processing
Identify loose coupling of tasks
Use…. extensive profiling to find the spatial and temporal mapping
with the following goal
Minimize the processor flexibility to a constrained set to
optimize the energy efficiency
Maximize the software parameterizability and ease of use of
the programmer’s model for flexibility
67
MPSoC design flow: Temporal and Spatial Mapping

Application:
Task 1 Task 2 Task 3 Task 4 Task 5
?
HW
HW Proc
Proc Proc
Proc HW
HW
Network-on-Chip
Specification
Network-on-Chip
Specification
Mem
Mem Mem
Mem Mem
Mem
MPSoC virtual prototype
MPSoC virtual prototype
HW
HW Proc
Proc Proc
Proc HW
HW
Network-on-Chip
Network-on-Chip
Mem
Mem Mem
Mem Mem
Mem
MPSoC HW prototype
MPSoC HW prototype
68
31
MPSoC exploration principles
Interconnect
Structure
Divide and conquer

Separate processing elements from communication
Early SW performance estimation
69
MPSoC virtual prototyping platform
VPU
VPU VPU
VPU
(Processor (Processor
(ProcessorSimulator)
(ProcessorSimulator)
Simulator) NoC Simulator Simulator)
Task 1
Task 1 Task
Task22 Task
Task33 Task
Task44
Interconnect
Structure
P2P Bus Router
model model model
Communication: CoWare Architect´s View Framework (AVF)

VPU: virtual processing unit
Enables modeling spatial and temporal task-to-PE mapping
70
32
MPSoC exploration Results
An MPSoC is defined by its processing elements

(PE) and their interconnect (NoC)
Interconnect is defined by its topology.
Communication performance is measured for a
given topology
PE performance is determined by a set of
numbers
71
Message Sequence Chart (MSC) Trace
Message
Sequence
Chart
72
33
Aggregated Communication Graph
Message
Sequence
Chart
Interacting Partner View Topology View
73
Histogram Views
Message Histogram
Sequence
Chart
Interacting Partner View Topology View
74
34
MPSoC Exploration Results: Communication
Source: Seven Questions and

Seven Dwarfs for Parallel
Computing,
75
The „Key Algorithm“ Propostion
Each application is composed of a small number of

fundamental algorithms ( „Nuclei“) that represent a
significant amount of the computation.
Focus on an efficient composition („design of

an MPSoc) or mapping („programming of the
MPSoC“)
76
35
Composition of Nuclei
Nuclei can be composed/mapped on a multiprocessor

in three different ways
Temporally distributed or time-shared on a
common processor
Spatially distributed with each Nucleus occupying
one or more processors
Pipelined: A single nucleus is distributed in time and
space
In a given time slot a nucleus is running on a
group of processors
On a given processor a group of nucleus
computation run over time Source: Schaumont et. al.2001
77
Intel RMS View (Recognition, Mining,Synthesis)
78
36
Example: Baseband Processing for 4G
Canonical Receiver Model

1 sample/symbol
CHANNEL
CHANNEL SOURCE
SOURCE
SIGNAL
SIGNAL DETECTION
DETECTION PATH
PATH DECODER
DECODER
DECODER
DECODER
Use the estimated channel

RF&ADC
RF&ADC parameters in the detection path
From SourceDecoder
as if they were the true values
From Channel Decoder

PARAMETER
PARAMETER ESTIMATION
ESTIMATION
PATH
PATH
INNER OUTER
RECEIVER RECEIVER
H.Meyr et al., “ Digital Communication Receiver”, J.Wiley 1998
80
37
Lessons Learned from Design Reviews 2005
Virtual Prototype (Product) of utmost importance

Early customer interaction
Debugging
Verification&Validation
Product Differentiator
80% of Area and Power Consumption in the inner receiver
(Algorithm and Architecture Design)
10-15% of Area and Power Consumption in Decoder
(Architecture Design)
5% of Area and Power Consumption in the ARM (But
major portion of cost is SW/Protocol implementation)
81
Properties of the Task
The signal/information processing task can be

naturally partitioned
Decoders
Filters
Channel
Useestimator
A-Priori Knowledge of the Task
The building blocks are loosely coupled

The signal processing task is (mostly) cyclostationary
82
38
From Function to Algorithm Classes
Butterfly unit
Viterbi & MAP decoder
MLSE equalizer
Eigenvalue decomposition (EVD)
Delay acquisition (CDMA)
MIMO Tx processing
Matrix-Matrix & Matrix-Vector Multiplication
MIMO processing (Rx & Tx)
Basic
LMMSE channel Blocks:
estimation (OFDM &Algorithm
MIMO) Types
Iterative (Turbo) Decoding
Message Passing Algorithm , LDPC Decoding
CORDIC
Frequency offset estimation (e.g. AFC)
OFDM post-FFT synchronization (sampling clock, fine frequency)
FFT & IFFT (spectral processing)
OFDM
Speech post processing (noise suppression)
Image processing (not FFT but DCT)
83
Decoder for Convolutional Codes
⎡ x1,k +1 ⎤ ⎡ a11,k a12,k ⎤ ⎡ x1,k ⎤ ⎡ a11,k ⊗ x1,k ⊕ a12,k ⊗ x 2,k ⎤

x k +1 = ⎢ ⎥=⎢ ⎥⊗⎢ ⎥=⎢ ⎥
⎣ x 2,k +1 ⎦ ⎣a 21,k a 22,k ⎦ ⎣x 2,k ⎦ ⎣a 21,k ⊗ x1,k ⊕ a 22,k ⊗ x 2,k ⎦
OPERATIONS MAP LOGMAP VITERBI
x⊕y x+y x y
loge [e + e ] max( x, y)
x⊗y x⋅y x+y x+y
84
39
Algorithmic Descriptors
Clock rate of processing elements (1/Tc)

Sampling rate of the signal (1/Ts)
Algorithm characteristic
Complexity (MOPS/sample)
Computational characteristic
Data flow
Basic Blocks: Algorithm Types
Data locality
Data storage
Parallelism
Control flow
Connectivity of algorithms
Spatial
Temporal
85
384 kbps UMTS Receiver BB Complexity
5
384 kbps UMTS receiver, digital BB complexity
10
MUSIC delay acq.
4
10
1 MOPS 10 MOPS 100 MOPS 1000 MOPS
OPs per sample
3
Turbo decoder
10
Path searcher
Max. ratio combining
SIR estimation AFC Correlators RRC pulse MF
2
10 Timing tracking
Channel estimation
Interpolation/decimation
1 AGC
10
0
10 2 3 4 5 6 7 8
10 10 10 10 10 10 10
sampling rate [1/s]
86
40
Hardware
Guding Principle
Employ all forms of Parallelism
88
41
Potential Processor Parallelism
Three form of instruction-set parallelism

Instruction parallelism
Data parallelism
Pipeline parallelism
Multi-issue instructions (VLIW)
Instruction size L
Number of operation slot per instruction
Operation mix in each slot
SIMD Instructions
Maximum
Maximumparallism:
parallism: Types of vector operations M
LxMxN Number of vector elements
LxMxN Number and size of vector register files
Fusion of Operation
** Number and type of composing operations N
Number of inputs and outputs
++ ++ Latency
Source: C.Rowen, Tensilica
89
Memory Architecture
DRAM prices have draramatically decreased

From $ 10,000,000 for1 Gigabyte in 1980
To $ 100 in 2006
Memory wall is the major obstacle to good performance
of many applications
Novel memory architecture are a key component

of ASIP
90
42
Programming Models and Design
Methodology
System Architecture Concept (HW & SW)

Source: Dr. H. Dawid, Infineon
Tx Framing &
Tx Modulator
RF Frontend (Down-/Upconversion)
FEC
Closed
DLPhysical
PowerLoop Intra-Frequency
Control
TxScheduling
Diversity Measurements
HSDPA
HSUPA
HSUPA
Layer E-TFCControl
HARQ ACK/NACK
&Selection
Ctrl Layer 1 SW
Inter-Frequency Measurements
UL Power Control Physical Layer HardReconfig
Handover
Layer 2/3 Stack (MAC, RLC,
RRC) Inter-RAT Handover
Cell Search
Timing
AGC
AFC Tracking Transport Channel
Delay SoftReconfig Layer 1 SW
Handover
Profile
Estimation
Inner Receiver Outer Receiver
Data L1 Config/Ctrl System Information/Higher Layer Ctrl
92
43
Programming Models
Goal: To maximize programmers productivity

Requirement
Independent of number of processors
Allow to describe concurrency naturally
Support rich set of data types
Support parallel models
Data level parallelism
Instruction level parallelism
Independent task paralleism
Autotuners should take on a complementary role to compilers
Far more formal methods must be developed to guarantee
correctness ( e.g. avoid dead locks using threads )
Source: Seven Questions and Seven Dwarfs for Parallel Computing,
93
Software Synthesis and Autotuners
Principle of Autotunners:
Optimize a set of library kernels by generating many

variants of a given kernel
Benchmark each variant on a given platform
Source: Bilmes et al. 1997;

Frigo and Johnson 1998;
Whaleyand Dongarra 1998,
IM et al. 2005
94
44
Conclusion
We are presently at a juncture of the semiconductor

industry as it seldom occurs
The existing ( RTL) design paradigm has reached its end-
of-life we need to move to a higher level of abstraction
(ESL) to keep the cost within resonable bounds
The existing processor multiprocessor architectures and
the programing tools do not scale
we need much innovation in these areas to make
economic use of scaling
95
Application Specific
Processors Design
96
45
Processor Design Space
Instruction Set Design Micro Architecture Design
butterfly 0 butterfly 1 load/store

• Exploit regularity/parallelism in FE
FE DC
DC EX
EX WB
WB
data flow/data storage
• VLIW, SIMD, ?
• Pipeline length ?
• Which instructions for compiler support?
• Shared resources ?
• Instruction Encoding? • Bypass ?
• How much general purpose registers? • Parallel execution units ?
RTL Design Soc Integration
• Area constraints met? Core

Core MMU Cache
• Clock frequency?
bus fast enough?
communication?
Memory Peripheral
97
Processor Design Space
Instruction Set Design Micro Architecture Design
butterfly 0 butterfly 1 load/store

• Exploit--regularity/parallelism
Instruction-Set
Instruction-SetDesign
Design
in FE
FE DCDC EX EX WB WB
-Micro
-MicroArchitecture
ArchitectureDesign
Design
--Compiler
data flow/data storage Design
Compiler Design
• VLIW, SIMD, ?
• Pipeline length ?
• Which instructions for compiler support?
• Shared resources ?
• Instruction Encoding? • Bypass ?
• How much general purpose registers? • Parallel execution units ?
Optimal
Optimaldesign
designrequires
requirespowerful
powerfultools
tools
RTL Design
and
andautomation
automation!!
Soc Integration
• Area constraints met? Core

Core MMU Cache
-System
-SystemIntegration
Integration
-RTL
-RTLDesign
• Clock frequency? Design --Embedded
EmbeddedSoftware
bus fast enough?
Software
--RTL
RTLISS
ISSCo-verification
Co-verification Simulation
Simulation
communication?
Memory Peripheral
98
46
Traditional Processor Design
Processor Design-phase Dependencies: Far

Fartoo
toolate
late!!
Instruction Set Design Micro Architecture Design RTL Design Verification
IA Simulator Development CA Simulator Develop. Soc Integration
Assembler & Linker Debugger Coupling
Compiler Design Software Development
time
Handwriting fast simulators is tedious, error-prone and difficult
Compiler cannot be considered in the architecture definition cycle
Risk of compiler un-friendly instruction-set
Inconsistencies between tools and models
Traditional design methodology does not allow for efficient processor design
Verification, Software Development and SoC integration too late
Real-world stimuli and SoC interaction might reveal bottlenecks
Design
Designphases
phasesneed
needto
tobe
beparallelized!
parallelized!
99
Today: ADL based Processor design
OBJECTIVE
OBJECTIVE
Improve
Improve Design-
Design- and
and Implementation
Implementation
Efficiency
Efficiency
…..at
…..at the
the same
same time
time
100
47
Architecture Description Language based Processor Design
The purpose of an architecture description language (e.g

LISA) is:
To allow for an iterative design to efficiently explore
architecture alternatives
To jointly design “Architecture –Compiler” and on chip
communication
To automatically generate hardware (path to
implementation)
To automatically generate tools
Assembler ,Linker, Compiler, Simulator, co-simulation
interfaces
From a single model at various level of temporal and spatial
abstraction
101
Describe/Adopt Generate
Application
Empty Model Processor Model Tools
RISC Sample Software
Software
VLIW Sample LISATek Tool
ToolChain:
Chain:
Custom
DSP Sample Processor C-Compiler
FFT Processor Model Assembler &
Processor Linker
LISATek IP LISA 2.0 Designer Simulator
Samples
Debugger &
Profiler
Function and instruction level

profiling reveals hot-spots
special purpose instructions
104
48
LISATek (Multi Core) Analyzer
Source
Sourcelevel
level
Extendable analysis
analysis
Extendable
Instruction
Instruction
Profiling
Profiling Symbolic
Symbolic
C/C++
C/C++
debugging
debugging
Memory/
Memory/
Cache
Cache
Analysis
Analysis
Extendable
Extendable
CCProfiling
Profiling Pipeline
Pipeline
Analysis
Analysis
(Stalls,
(Stalls,Flushes...)
Flushes...)
105
•Instruction
•InstructionSet
Set
Synthesis
Synthesis Describe/Adopt Generate Application
Empty Model
•Memory
•Memoryarchitecture
architecture Processor Model Tools
•Verification
RISC Sample
•Verification Software
Software
VLIW Sample LISATek Tool
Tool Chain:
Chain:
Custom
DSP Sample C-Compiler
Processor
FFT Processor Model Assembler &
Processor Linker
LISATek IP LISA 2.0 Designer
Simulator
Samples
Rapid
Rapid modeling
modeling and
and re-targetable
re-targetable simulation
simulation ++ code-generation
code-generation allows
allows for:
for: &
Debugger
joint Profiler
joint optimization
optimization of
of application
application and
and architecture
architecture
Generate...
RTL
Function and instruction level

Software SoC profiling reveals hot-spots
Platform Integration RTL
special purpose instructions
106
49
Tool Structure Principles
Orthogonalize „Workbench“
and Optimization Tools
R.Leupers.et al „Fine Grained Application Source Code Profiling

for ASIP“, DAC 2005
R.Leupers et al., “A Design Flow for Configurable Embedded

Processors based on Optimized Instruction Set Extension
Synthesis”, DATE 2006
P.Ienne,R.Leupers (Editors), "Customizable Embedded

Processors”, Morgan Kaufmann (Elsevier), 2006
107
ASIP: Lofty Ambitions, Stark Realities
J. Fisher, “Customizing Processors :Lofty Ambitions, Stark Realities, Chapter 2 in: Customizable
Embedded Processors, ed. By L.Leupers, Paolo Ienne, to be published by Morgan Kaufmann July 2006
108
50
Mapping Application to Architecture
Exploit
ExploitParallelism
Parallelism
1.1. Instruction
Instructionlevel
level
Yesterday Today 2.2. Data
Tomorrow Datalevel
level
3.3. Pipeline
Pipelinelevel
level
RISC DSP DSP
CPU
Flexibility - Reuse
Programmable
Extensible Extensible
DSP RISC RISC
CPU CPU
Processor
Processor Application Application
instruction-set
Performance Opt.
Programmable
instruction-set
Specialization Specific Specific ProgrammableDMA
DMA
extenstions with
extenstions with Extensions Extensions
highly specialized
highly specialized
data-path Application
data-path DMA Controller ApplicationSpecific
Specific
MIPS SIMD
SIMDengine
enginefor
MIPSCorXtend
CorXtend for
ARM image processing
ARMOptimode
Optimode SIMD ASIP image processing
Tensilica
TensilicaXTensa
XTensa
ARC
ARC––ARC600
ARC600 VLIW ASIP iDCT
iDCTVLIW
VLIWprocessor
processor
Hardwired More
Moreand
andmore
morefixed
fixed
Fixed
ASIC
Hardwired Hardwired Logic ASICdatapath
datapathmoves
moves
into
intoapplication
Logic Logic application
specific
specificprocessors
processors
110
From C-to Complex Instructions
Mapping Algorithm to Architecture
Fixed and Re-Configurable ASIP
113
51
rASIP : A Huge and Complex Design Space
Stage 1 Stage 1 Stage 3
Register file Register file

Core Core
re-configurable data path : multiple stages re-configurable data path : single stage
Design
DesignSpace
SpaceExploration
Exploration
isisthe
thekey
key
Stage 1 Slot B Slot B

Stage 1 Stage 2 Stage 3
Slot A Slot A
Register File
Core Register File
Core
re-configurable data path in VLIW slots re-configurable data path in loosely coupled rASIP
Pre-fabrication
Pre-fabrication Post-fabrication
Post-fabrication
ASIP
ASIParchitecture
architecture Instruction-Set
Instruction-SetExtension
Extension
FPGA
FPGAarchitecture
architecture Configuration
Configurationcode
codegeneration
generation
ASIP-FPGA
ASIP-FPGAInterface
Interface Scheduling
SchedulingandandCode-Generation
Code-Generation
Static
Static/ /Dynamic
Dynamicre-configurability
re-configurability FPGA-targeted
FPGA-targetedoptimization
optimization
.... ....
115
Case Studies
References:
Tilman Glöckler,H. Meyr, Design of Energy efficient Application-
Specific Instruction Set Processors, Kluwer Academic
Publisher,2004
Oliver Wahlen, C Compiler Aided Design of Application Specific
Instruction-Set Processors Using the Machine Description
Language LISA, Ph.D thesis submitted to Aachen University of
Technology (RWTH), 2004
116
52
The ICORE Example
A low-power ASIP for Infineon DVB-T 2nd

generation single-chip receiver:
ASIP for DVB-T acquisition and tracking algorithms
(sampling-clock-synchronization, interpolation / decimation,
carrier frequency offset estimation)
Harvard architecture
60 mostly RISC-like instructions &
special instructions for CORDIC-algorithm
8x32-Bit general purpose registers, 4x9-Bit address registers
2048x20-Bit instruction ROM, 512x32-Bit data memory
I2C registers and dedicated interfaces for external communication
121
Computational Effiency vs. Flexibility
Source: T.Noll, RWTH Aachen
129
53
The Retinex Project
Application: Retinex-like Algorithms
/ β *
F Γ LinSt
Knowledge: Application Knowledge,

VLSI and Basic Processor Design Knowledge
Outline: From Specification to FPGA Prototyping
Duration: 7,5 Weeks
A cooperation between Pisa University and RWTH Aachen University
130
Retinex Architecture Reference
Paper presentation at DATE 2006
ASIP DESIGN AND SYNTHESIS

FOR NON LINEAR FILTERING
IN IMAGE PROCESSING
L. Fanucci, M. Cassiano and S. Saponara,

DIIEIT-Pisa University, Italy
D. Kammler, E. M. Witte, O. Schleibusch, G. Ascheid,
R. Leupers and H. Meyr,
RWTH Aachen University, Germany
131
54
The Retinex ASIP
Program
Program Memory
Memory
X-Memory
X-Memory
Y-Memory
Y-Memory
FE
FE DC
DC LD
LD CMP
CMP ROM
ROM ARITH
ARITH WB
WB
ROM
ROM
132
The Retinex ASIP

Program
Program Memory
Memory
X-Memory
X-Memory
Address
AddressGeneration
Generation Special
SpecialInstructions
Instructions
Units Y-Memory
Units Y-Memory
to
toimplement
implement
to
tooptimally
optimallyimplement
implement non-linear
non-linear
the
theaddress
addresscalculation
calculation transformations
transformations
scheme
scheme
FE
FE DC
DC LD
LD CMP
CMP ROM
ROM ARITH
ARITH WB
WB
Zero
ZeroOverhead
Overhead
Loops ROM
ROM
Loops
totoaccelerate
accelerate
loop
loopcontrol
control
133
55
Performance Comparison
Retinex ASIP
System Athlon XP 3000+
mapped on FPGA
plain C-application, Optimized ASIP and

handwritten
Design Flow compiled with gcc,
assembly program
executed on AMD Athlon (~100 lines of code)
Frequency 2100 MHz 16 MHz
593 ms
Computation time
~ 3000 ms ~ 20 % of
(Picture 513x385)
Athlon run-time
134
Retargetable Compiler
56
Infineon PP32 Network Processor
200
200
180
180
160
160
140
140
120
120
lcc
lcc
100 CoSy cycle count
%
100 CoSy cycle count

%
CoSy code size

CoSy code size
80
80
60
60
40
40
20
20
0
0
frag tos hwacc route reed md5 crc
frag tos hwacc route reed md5 crc
136
ST200 VLIW Multimedia Processor
350
350
300
300
250
250
200
200 ST Multiflow
ST Multiflow
CoSy cycle count
%
CoSy cycle count

%
CoSy code size

150 CoSy code size
150
100
100
50
50
0
0
fir dct adpcm fht viterbigsm sieve
fir dct adpcm fht viterbigsm sieve
137
57
Low Cost Commercial ASIP
Increasing SW Content- but How?
138
Project Goals
Initial goal:
+ Custom processor design to save royalties
LISA processor design
+ development of an ASIP with superior
architectural efficiency
General purpose register file
+ support a smooth legacy code migration
Perl - translation script
+ an architecture which is smaller than the
existing architecture
!!!
LISA
139
58
Development Time Sheet
Initial Model 4 weeks
Design Space Analysis 3 weeks

Design Space Exploration 4 weeks
Phase I
- Address Calculation 1 week
- Non-delayed Branches 1 week
- Timing Improvement ½ week
- Others 1½ weeks
Translation Script 5 weeks

Move Elimination 2 week
Verification Script 5 weeks
Phase II
Synthesis & FPGA Mapping 1 day

FPGA System ( one time effort)
10 weeks
140
Moving through the Design Space
1 First synthesis of verified RTL 8 Changed multiplier implementation

code, no port constraints, no from 32bit to 17bit
optimizations 9 Removed functional unit grouping
2 Memory port constraints, from 3
autom. optimization: path sharing 10 Final Synthesis: timing constraint
3 Grouping in functional units for adapted to synthesis results
more detailed analysis
4 Change in address calculation
enabling resource sharing 3
5 Critical path analysis 4

2
modification of fetch mechanism,

optimization: decision minimization
6 Pin for FPGA prototype added 8-9
1
10
7 Implementation of non-delayed 5-7
branches
prog-mem size reduction
141
59
Multimedia Processor
Processor Designer in a video

deblocking unit
60
Multi standard video decoder IP
Coded bitstream
Reference frames Core decoder
Deblocked frames DBLK
External memory
Semiconductors 144
Why Processor Designer ?
• Until now : a RTL block for each standard. => Make a

generic block for all (changing !) video standards.
• A programmable architecture brings flexibility (C

compilation).
• 288 conditionals filters (4 and 8 taps) to be done in 600

cycles.
• High throughput needed : custom operations and special

memory addressing scheme are required.
Semiconductors 145
61
DBLK architecture
DMA IN Pixels memory DMA OUT
2x88 bits
Data ram
Processor
Prog. rom
Semiconductors 146
Step 1 : function call

Application development :
• Get quickly a C model for the system
• Debug the application in a SystemC environment
Data ram
deblock()
Prog. rom
Semiconductors 147
62
Step 2 : integration of lt_risc_32p5
• Provided model of RISC used
• Compilation of application on the Lisa Processor
• Memories latency are modelled in the pipeline
Processor Data ram

(systemC/RTL)
Prog. rom
Semiconductors 148
Step 3 : RTL generation and

performance improvement
• RTL generation
• C optimization
• Asm. optimization
• Use of specialized asm. instruction
• Remove unnecessary asm. Instructions
• Improve model for RTL generation (clock speed, area)
Semiconductors 149
63
Results
• Architecture far from the initial RISC
• Target of 166 MHz easily reached
• Size comparable to a all RTL design
(processor = 50 kgates)
• Performances reached
• IP taped out in a Set Top Box chip
Next steps
• No problem met yet on prototype
• Make the block more generic to handle others standards
Semiconductors 150
Planning
8 weeks 2 weeks 4 weeks 2 weeks 5 weeks

es
s
integ _32p5
mem f pixel
ation
ent
deve ation
terfac
ories
ration
lopm
is
o
c
c
Optim
Lt_ris
Pin in
Appli
Use
Step 1 Step 2 Step 3
Semiconductors 151
64
Conclusion
- Con + Pro
• Long learning • RTL and SystemC always
• First use -> rough estimate of consistent (=> most of the
time needed validation can be run on SC)
• Faster than writing
independent SC and RTL
models
• Fast exploration of architecture
choices
• Use of firmware :
– can be generic
– C debug
– If program ram : fixes and
feature changes can be
downloaded
• No royalties
Semiconductors 152
Thank You
65

Signal Processing For Wireless Communications and Multimedia: Design, Tools, Architectures

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Signal Processing For Wireless Communications and Multimedia: Design, Tools, Architectures

Hochgeladen von

Copyright:

Verfügbare Formate

Signal Processing for Wireless

Advanced Digital System Design Course 2006, EPF-L

Prof. Heinrich Meyr

RWTH Aachen University , Germany

Future Wireless Communication System

Internet Access Today

Wearable and environmental

Flexible, cognitive radio access

Multi-Processor Systems on Chip (MPSoC)

Digitized radio front end

Future mobile wireless internet services:

Challenging mobile application classes

Future Wireless Systems: In a nutshell

They will make use of ultra-complex signal processing

Impact of NGMN on Design Process: I

To meet the schedule of NGMN it is imperative to have a

New approaches are needed !

Development and integration issues need to be uncovered

Companies cannot wait for hardware to be available to start

Development costs need to be reduced and schedules

New approaches are needed !

Virtual Platform Based Development

Simulator Initial HW Initial

Incremental Virtual Platform Development Simulator/HW Refinement

Incremental Software Development System Test Integration

„Design Competence rules the World“

Cross-disciplinary Task Management

The solution requires the interaction of people in the

Innovation now dominates performance gains between generations

IBM Transistor Performance Improvement

Source: Lisa Su /IBM: MPSoC 05 Conference 2005 17

The Paradigm Shift: Integrated Design Approach

Dynamic Optimization Integration over the entire stack,

The Human Element

Building and managing an interdisciplinary t

Food Chain and Alliances

Managing alliances is a key to success

Receiver Structure , Models

System design = algorithm design + implementation

JOINTLY optimizing algorithm and architecture

Mathematical Theory and Experiment

Mathematical Theory provides Bounds

1. Estimation and Detection Theory used to

2. Mathematical Analysis used to compute

Computer Simulation is used to

1. Obtain numerical Performance Data

3. Verify Correctness of Implementation

E{ x(t ) − xBL } = ∫ Sx (ω )dω

Use the estimated channel

Detection Loss of synchronized Detection

Source: Digital Communication Receivers, H. Meyr, M. Moeneclaey, S.A. Fechtel

Source: Digital Communication Receivers, H. Meyr, M. Moeneclaey, S.A. Fechtel

Digital terrestrial video broadcasting:

DVB-T Chip: First single Chip Solution

Analog part : 10%

Design Space : Architecture and Algorithm

Parallel Computing in Mobiles

2003 2009 2013

Source: International Technology Roadmap for Semiconductors (ITRS, TX 2003)

Not because of a fundamental breakthough in novel

…..simply because the problems with tradtional

Guding Principles for Manycore SoC I

Energy Efficiency and Power are the dominating