Beruflich Dokumente
Kultur Dokumente
Communications
and Multimedia: Design, Tools, Architectures
Agenda
1
Future Wireless
Communication Systems
Fixed
DSL (→3 Mb/s)
Intranet (100Mb/s)
Wireless
WLAN (10-54 Mb/s)
Mobile
UMTS (2 Mb/s)
2
Mobile Internet Access
The Vision
UMTS
UMTS Standard:
Standard:
Ultra
Ultra High-Speed
High-Speed
22 Mb/s
Mb/s Mobile
Mobile Information
Information
and
and Communication
Communication
everywhere at low cost
Reality
Reality today:
today:
UMTS
UMTS 0,1-0,3
0,1-0,3 Mb/s
Mb/s
GSM/GPRS
GSM/GPRS 0,020,02 Mb/s
Mb/s
€€€
In
Inoptimally
optimallylocated
locatedplaces
places
For a few users
For a few users
4G and Beyond
New concepts
Ultra high speed transmission
Mobile multimedia processing
Smart systems
3
Mobile Applications and Services
Will be
cognitive
multifunctional
software definable
Will have
multiple Antennas
10
4
Future Wireless Communication Systems
and its Impact on ESL
12
5
Impact of NGMN on Design Process:II
13
Hardware Development
Specification
OS
Software Development
Integrate
Device
Device
Connectivity
Software
Software
Stack
Stack
Develop Unit Test
UI
Test Hardware
Hardware Silicon
Integration
Application Validation
Virtual
Virtual
…
Debugging
-
Platform
Platform Reduced bring up
Virtual Platform Reduced system test
14
6
The End of Moore´s Law:
Analysis
The task comprises of many subtask in various disciplines
“The whole is more than the sum of the parts”
Conclusion
16
7
The Paradigm Shift: Innovation Overtakes Scaling
80
Relative % Improvement
60
40
350 nm
180 nm
550 nm
250 nm
130 nm
65 nm
90 nm
20
0
CMOS7S-S0I
CMOS10S
CMOS6X
CMOS5X
CMOS11S
CMOS9S
CMOS8S2
Source:
Lisa Su /IBM:
MPSoC Conference 2005
Silicon Innovation
Systems will increasingly rely on
Technology
Technology Packaging modular components for continued
Efficient Cooling performance leadership
Dense SRAM, embedded DRAM
Source: Lisa Su /IBM: MPSoC 05 Conference 2005
18
8
Core Proposition
y
oloologgy
n
ecechhn
ASIP
ASIP based ddTT Platforms
based
nn Platforms
s aa i es s
(heterogenousMPSoC)
c s
(heterogenousMPSoC)
hhyyssi ic mmetertrie
etet PP g GGeoeo
g
t ft ofor rg riniknikning
o h r
s t tnno e totos snh
s
e mmuu rs dudue ptpiotion
e
t tww t EErrroroorsnsnusum
m
u
BBu f o
Sooft er rCC
S owwe
PPo
19
20
9
Cross-disciplinary Task
Algorithm
Algorithm
Architecture
Architecture Tools
Tools
21
Service Provider
SIEMENS Equipment
Equipment Manufacturers
Manufacturers
Semiconductor House
Enabling Technology
Providers
22
10
Alliances and the Business Equation
EDA
Mobile provider
Semiconductor company
23
11
Design - Space I: Physical Layer
Complexity
Bandwidth
Power
25
System Design
algorithm
algorithm implementation
implementation
design
designspace
space design
designspace
space
26
12
Center-of-Gravity Approach
Algorithm
Algorithm
Architecture
Architecture Tools
Tools
27
Methodology
13
Design - Methodology: I
29
Design - Methodology: II
30
14
Design Methodology III
2. Validate a Design
(Conformance to Standards)
31
32
15
Models
Communication Model
34
16
Signal Model
36
17
Received Bandlimited Signals
37
Approximation by BL Signals
k Truncation defines
∑ x(kT )ϕ (k )
−k
s
(2K+1) dim.
Approx. In Vector
space
39
18
Equivalence of digital/analog Signal Processing
40
Properties
41
19
Canonical Receiver Model
1 sample/symbol
CHANNEL
CHANNEL SOURCE
SOURCE
SIGNAL
SIGNAL DETECTION
DETECTION PATH
PATH DECODER
DECODER
DECODER
DECODER
FromChannel Decoder
PARAMETER
PARAMETER ESTIMATION
ESTIMATION
PATH
PATH
INNER OUTER
RECEIVER RECEIVER
H.Meyr et al., “ Digital Communication Receiver”, J.Wiley 1998
42
Receiver Task
Inner Receiver
To provide a “good” channel to the decoder based on the
principle of synchronized Detection.
NOTHING ELSE !
Outer Receiver
To decode the information
43
20
Performance Measure
Inner Receiver
Properties of the estimator
Variance
Unbiased
Outer Receiver
Bit-error-rate of the coded system
44
Performance Loss
Implementation Loss
ΔSNR (dB) resulting from finite precision arithmetic and
algorithmic approximations
45
21
BER Performance
48
Complexity DVB-S
49
22
DVB-S Chip
Siemens-RWTH
Siemens-RWTHAachen
Aachen
(ISS)
(ISS)Design
Design1997
1997
0.5
0.5mmtechnology
technology
33metal
metallayer
layer
1.5
1.5W W@ @88 88MHz
MHz
>>500 k transistors
500 k transistors
First
Firstsilicon
siliconsuccess
success
50
DVB-T Specifications
51
23
System Performance: DVB-T
52
Joint
JointInfineon-Nokia-ISS
Infineon-Nokia-ISS
Design
Design1999
1999
AGC:
AGC:Automatic
AutomaticGain
GainControl
Control
IQ:
IQ: IQ-Mixer
IQ-Mixerand
andResampling
Resampling
PPU:
PPU: Postprocessing
PostprocessingUnit
Unit
FFT:
FFT: Fast
FastFourier
FourierTransform
Transform(2k,8k)
(2k,8k)
DTO:
DTO:Digital
DigitalTiming
TimingOscillator
Oscillator
RAM: OFDM Symbol
RAM: OFDM Symbol MemoryMemory
CHE:
CHE:Channel
ChannelEstimation
Estimation
IFFT:
IFFT:Inverse
InverseFFT
FFTand
andFine
FineTiming
Timing
ESG:
ESG: Equalizationand
Equalization and
Softbit Generation
Softbit Generation
FEC:
FEC:Forward
ForwardError
ErrorCorrection
Correction
(Viterbi,
(Viterbi,Reed-Solomon)
Reed-Solomon)
53
24
DVB-T Complexity
54
Inner Receiver
The algorithms of the inner receiver are never specified
by the standard
BOTH algorithm and architecture space exploration
Outer Receiver
The decoder is exaclty specified in the standard
ONLY architecture space exploration
55
25
Massive Parallel Processing on
Heterogeneous MPSoC
Massive
MassiveParallelism
Parallelismrequired
required
in the foreseeable future
in the foreseeable future
Frequency
300 600 1500
(MHz)
Giga
0,3 14 2458
Operations
Operations
1 23 1638
per Cycle
58
26
Why Many-Processors Architectures today?
59
60
27
Guding Principles for Manycore SoC :II
61
62
28
Parallel Computing
63
Wireless 1G
Microprocessor / DSP
Battery Power
Time
64
29
Computational Efficiency vs. Flexibility
ASIP
A (ICORE, DVB-T Sync&Track)
Flexibility →
← Efficiency
65
30
Design Principles
67
?
HW
HW Proc
Proc Proc
Proc HW
HW
Network-on-Chip
Specification
Network-on-Chip
Specification
Mem
Mem Mem
Mem Mem
Mem
MPSoC virtual prototype
MPSoC virtual prototype
HW
HW Proc
Proc Proc
Proc HW
HW
Network-on-Chip
Network-on-Chip
Mem
Mem Mem
Mem Mem
Mem
MPSoC HW prototype
MPSoC HW prototype
68
31
MPSoC exploration principles
Interconnect
Structure
69
VPU
VPU VPU
VPU
(Processor (Processor
(ProcessorSimulator)
(ProcessorSimulator)
Simulator) NoC Simulator Simulator)
Task 1
Task 1 Task
Task22 Task
Task33 Task
Task44
Interconnect
Structure
P2P Bus Router
model model model
32
MPSoC exploration Results
71
Message
Sequence
Chart
72
33
Aggregated Communication Graph
Message
Sequence
Chart
73
Histogram Views
Message Histogram
Sequence
Chart
74
34
MPSoC Exploration Results: Communication
75
76
35
Composition of Nuclei
77
78
36
Example: Baseband Processing for 4G
CHANNEL
CHANNEL SOURCE
SOURCE
SIGNAL
SIGNAL DETECTION
DETECTION PATH
PATH DECODER
DECODER
DECODER
DECODER
INNER OUTER
RECEIVER RECEIVER
H.Meyr et al., “ Digital Communication Receiver”, J.Wiley 1998
80
37
Lessons Learned from Design Reviews 2005
81
82
38
From Function to Algorithm Classes
Butterfly unit
Viterbi & MAP decoder
MLSE equalizer
Eigenvalue decomposition (EVD)
Delay acquisition (CDMA)
MIMO Tx processing
Matrix-Matrix & Matrix-Vector Multiplication
MIMO processing (Rx & Tx)
Basic
LMMSE channel Blocks:
estimation (OFDM &Algorithm
MIMO) Types
Iterative (Turbo) Decoding
Message Passing Algorithm , LDPC Decoding
CORDIC
Frequency offset estimation (e.g. AFC)
OFDM post-FFT synchronization (sampling clock, fine frequency)
FFT & IFFT (spectral processing)
OFDM
Speech post processing (noise suppression)
Image processing (not FFT but DCT)
83
x⊕y x+y x y
loge [e + e ] max( x, y)
x⊗y x⋅y x+y x+y
84
39
Algorithmic Descriptors
85
5
384 kbps UMTS receiver, digital BB complexity
10
MUSIC delay acq.
4
10
1 MOPS 10 MOPS 100 MOPS 1000 MOPS
OPs per sample
3
Turbo decoder
10
Path searcher
Max. ratio combining
SIR estimation AFC Correlators RRC pulse MF
2
10 Timing tracking
Channel estimation
Interpolation/decimation
1 AGC
10
0
10 2 3 4 5 6 7 8
10 10 10 10 10 10 10
sampling rate [1/s]
86
40
Hardware
Guding Principle
88
41
Potential Processor Parallelism
SIMD Instructions
Maximum
Maximumparallism:
parallism: Types of vector operations M
LxMxN Number of vector elements
LxMxN Number and size of vector register files
Fusion of Operation
** Number and type of composing operations N
Number of inputs and outputs
++ ++ Latency
Source: C.Rowen, Tensilica
89
Memory Architecture
90
42
Programming Models and Design
Methodology
Tx Framing &
Tx Modulator
RF Frontend (Down-/Upconversion)
FEC
Closed
DLPhysical
PowerLoop Intra-Frequency
Control
TxScheduling
Diversity Measurements
HSDPA
HSUPA
HSUPA
Layer E-TFCControl
HARQ ACK/NACK
&Selection
Ctrl Layer 1 SW
Inter-Frequency Measurements
UL Power Control Physical Layer HardReconfig
Handover
Layer 2/3 Stack (MAC, RLC,
RRC) Inter-RAT Handover
Cell Search
Timing
AGC
AFC Tracking Transport Channel
Delay SoftReconfig Layer 1 SW
Handover
Profile
Estimation
92
43
Programming Models
93
Principle of Autotunners:
94
44
Conclusion
95
Application Specific
Processors Design
96
45
Processor Design Space
communication?
Memory Peripheral
97
Memory Peripheral
98
46
Traditional Processor Design
time
Handwriting fast simulators is tedious, error-prone and difficult
Compiler cannot be considered in the architecture definition cycle
Risk of compiler un-friendly instruction-set
Inconsistencies between tools and models
Traditional design methodology does not allow for efficient processor design
Verification, Software Development and SoC integration too late
Real-world stimuli and SoC interaction might reveal bottlenecks
Design
Designphases
phasesneed
needto
tobe
beparallelized!
parallelized!
99
OBJECTIVE
OBJECTIVE
Improve
Improve Design-
Design- and
and Implementation
Implementation
Efficiency
Efficiency
…..at
…..at the
the same
same time
time
100
47
Architecture Description Language based Processor Design
101
Describe/Adopt Generate
Application
Empty Model Processor Model Tools
RISC Sample Software
Software
VLIW Sample LISATek Tool
ToolChain:
Chain:
Custom
DSP Sample Processor C-Compiler
FFT Processor Model Assembler &
Processor Linker
LISATek IP LISA 2.0 Designer Simulator
Samples
Debugger &
Profiler
104
48
LISATek (Multi Core) Analyzer
Source
Sourcelevel
level
Extendable analysis
analysis
Extendable
Instruction
Instruction
Profiling
Profiling Symbolic
Symbolic
C/C++
C/C++
debugging
debugging
Memory/
Memory/
Cache
Cache
Analysis
Analysis
Extendable
Extendable
CCProfiling
Profiling Pipeline
Pipeline
Analysis
Analysis
(Stalls,
(Stalls,Flushes...)
Flushes...)
105
•Instruction
•InstructionSet
Set
Synthesis
Synthesis Describe/Adopt Generate Application
Empty Model
•Memory
•Memoryarchitecture
architecture Processor Model Tools
•Verification
RISC Sample
•Verification Software
Software
VLIW Sample LISATek Tool
Tool Chain:
Chain:
Custom
DSP Sample C-Compiler
Processor
FFT Processor Model Assembler &
Processor Linker
LISATek IP LISA 2.0 Designer
Simulator
Samples
Rapid
Rapid modeling
modeling and
and re-targetable
re-targetable simulation
simulation ++ code-generation
code-generation allows
allows for:
for: &
Debugger
joint Profiler
joint optimization
optimization of
of application
application and
and architecture
architecture
Generate...
RTL
49
Tool Structure Principles
Orthogonalize „Workbench“
and Optimization Tools
107
J. Fisher, “Customizing Processors :Lofty Ambitions, Stark Realities, Chapter 2 in: Customizable
Embedded Processors, ed. By L.Leupers, Paolo Ienne, to be published by Morgan Kaufmann July 2006
108
50
Mapping Application to Architecture
Exploit
ExploitParallelism
Parallelism
1.1. Instruction
Instructionlevel
level
Yesterday Today 2.2. Data
Tomorrow Datalevel
level
3.3. Pipeline
Pipelinelevel
level
RISC DSP DSP
CPU
Flexibility - Reuse
Programmable
Extensible Extensible
DSP RISC RISC
CPU CPU
Processor
Processor Application Application
instruction-set
Performance Opt.
Programmable
instruction-set
Specialization Specific Specific ProgrammableDMA
DMA
extenstions with
extenstions with Extensions Extensions
highly specialized
highly specialized
data-path Application
data-path DMA Controller ApplicationSpecific
Specific
MIPS SIMD
SIMDengine
enginefor
MIPSCorXtend
CorXtend for
ARM image processing
ARMOptimode
Optimode SIMD ASIP image processing
Tensilica
TensilicaXTensa
XTensa
ARC
ARC––ARC600
ARC600 VLIW ASIP iDCT
iDCTVLIW
VLIWprocessor
processor
Hardwired More
Moreand
andmore
morefixed
fixed
Fixed
ASIC
Hardwired Hardwired Logic ASICdatapath
datapathmoves
moves
into
intoapplication
Logic Logic application
specific
specificprocessors
processors
110
113
51
rASIP : A Huge and Complex Design Space
re-configurable data path : multiple stages re-configurable data path : single stage
Design
DesignSpace
SpaceExploration
Exploration
isisthe
thekey
key
Register File
Core Register File
Core
re-configurable data path in VLIW slots re-configurable data path in loosely coupled rASIP
Pre-fabrication
Pre-fabrication Post-fabrication
Post-fabrication
ASIP
ASIParchitecture
architecture Instruction-Set
Instruction-SetExtension
Extension
FPGA
FPGAarchitecture
architecture Configuration
Configurationcode
codegeneration
generation
ASIP-FPGA
ASIP-FPGAInterface
Interface Scheduling
SchedulingandandCode-Generation
Code-Generation
Static
Static/ /Dynamic
Dynamicre-configurability
re-configurability FPGA-targeted
FPGA-targetedoptimization
optimization
.... ....
115
Case Studies
References:
Tilman Glöckler,H. Meyr, Design of Energy efficient Application-
Specific Instruction Set Processors, Kluwer Academic
Publisher,2004
Oliver Wahlen, C Compiler Aided Design of Application Specific
Instruction-Set Processors Using the Machine Description
Language LISA, Ph.D thesis submitted to Aachen University of
Technology (RWTH), 2004
116
52
The ICORE Example
121
129
53
The Retinex Project
/ β *
F Γ LinSt
130
131
54
The Retinex ASIP
Program
Program Memory
Memory
X-Memory
X-Memory
Y-Memory
Y-Memory
FE
FE DC
DC LD
LD CMP
CMP ROM
ROM ARITH
ARITH WB
WB
ROM
ROM
132
FE
FE DC
DC LD
LD CMP
CMP ROM
ROM ARITH
ARITH WB
WB
Zero
ZeroOverhead
Overhead
Loops ROM
ROM
Loops
totoaccelerate
accelerate
loop
loopcontrol
control
133
55
Performance Comparison
Retinex ASIP
System Athlon XP 3000+
mapped on FPGA
593 ms
Computation time
~ 3000 ms ~ 20 % of
(Picture 513x385)
Athlon run-time
134
Retargetable Compiler
56
Infineon PP32 Network Processor
200
200
180
180
160
160
140
140
120
120
lcc
lcc
100 CoSy cycle count
%
60
60
40
40
20
20
0
0
frag tos hwacc route reed md5 crc
frag tos hwacc route reed md5 crc
136
350
350
300
300
250
250
200
200 ST Multiflow
ST Multiflow
CoSy cycle count
%
100
100
50
50
0
0
fir dct adpcm fht viterbigsm sieve
fir dct adpcm fht viterbigsm sieve
137
57
Low Cost Commercial ASIP
Increasing SW Content- but How?
138
Project Goals
Initial goal:
+ Custom processor design to save royalties
LISA processor design
+ development of an ASIP with superior
architectural efficiency
General purpose register file
+ support a smooth legacy code migration
Perl - translation script
+ an architecture which is smaller than the
existing architecture
!!!
LISA
139
58
Development Time Sheet
Phase I
- Address Calculation 1 week
- Non-delayed Branches 1 week
- Timing Improvement ½ week
- Others 1½ weeks
140
branches
prog-mem size reduction
141
59
Multimedia Processor
60
Multi standard video decoder IP
Coded bitstream
External memory
Semiconductors 144
61
DBLK architecture
2x88 bits
Data ram
Processor
Prog. rom
Semiconductors 146
Data ram
deblock()
Prog. rom
Semiconductors 147
62
Step 2 : integration of lt_risc_32p5
• Provided model of RISC used
• Compilation of application on the Lisa Processor
• Memories latency are modelled in the pipeline
Semiconductors 148
• RTL generation
• C optimization
• Asm. optimization
• Use of specialized asm. instruction
• Remove unnecessary asm. Instructions
• Improve model for RTL generation (clock speed, area)
Semiconductors 149
63
Results
• Architecture far from the initial RISC
• Target of 166 MHz easily reached
• Size comparable to a all RTL design
(processor = 50 kgates)
• Performances reached
• IP taped out in a Set Top Box chip
Next steps
• No problem met yet on prototype
• Make the block more generic to handle others standards
Semiconductors 150
Planning
s
integ _32p5
mem f pixel
ation
ent
deve ation
terfac
ories
ration
lopm
is
o
c
c
Optim
Lt_ris
Pin in
Appli
Use
Semiconductors 151
64
Conclusion
- Con + Pro
• Long learning • RTL and SystemC always
• First use -> rough estimate of consistent (=> most of the
time needed validation can be run on SC)
• Faster than writing
independent SC and RTL
models
• Fast exploration of architecture
choices
• Use of firmware :
– can be generic
– C debug
– If program ram : fixes and
feature changes can be
downloaded
• No royalties
Semiconductors 152
Thank You
65