Research With "Chamaleon" Chips or Chips That Can Do Almost Anything Jose Luis Nunez-Yanez

Research with Chamaleon chips or chips that can do almost anything Jose Luis Nunez-Yanez,
Department of Electrical and Electronic Engineering
Multicore everywhere
Instead of focusing on increasing clock speed Intel turned to the world and admitted that its gigahertz race was foolish, and that the next decade of performance improvements would come from increasing the number of cores per
processor, not clock speed
Possible multicore challenges

Programming (general purpose?)
Reliability and fault-
tolerance (yield/cost)
Test and verification Energy/Power (static
and dynamic)
What can we do about this? : The NoRC (Network on a reconfigurable chip) concept Asynchronous network-on-chip
with fine-grained dynamically reconfigurable tiles

reconfigurable part
Off-chip Memory/Interface
Tile
Serial I/O core
Embbeded Processor
Local Memory
Fixed Fuction IP
Static part
Internal BUS
Memory
Reconfigurable Area Controller, DVS logic, ICAP and Network Interface
Router
Evolution of current FPGA technology towards NoRC? Partial and dynamic reconfiguration already supported
in state-of-the-art FPGAs : Xilinx Virtex-4/5/6 and Altera is getting there as well.
Initial state
R R R
NoRC Functional model : the assignment of processing resources to tasks is determined at run-time following requests for services Configured/Steady state
Configuration state
R
R
R
R R MA R R
IN B
IN A
R MA
SA4 R R SB2 R SA3

OUT B
R
R
SAB 1 R MB
SB1
SA2
OUT A
R MB
Clone state
Reconfigured state
R
R
Fault state
R R R
R SB1
R
R R MA R R
IN B
IN A
R MA
IN A
R MA
IN A
SA4 R R SB2 R SA3

OUT B
SA4 R SB2 R
OUT B
SA4 R SB2 R
OUT B
R
IN B
R SAB 1 R MB SB1
M B re c o n fig u re s S B 1
R
SA2
OUT A
R
IN B
SAB 1 R MB
F a u lt
SB1
SA2
OUT A
R SA3
SAB 1 R MB SB1
M B re c o n fig u re s S B 1
SA2
OUT A
R SA3
What do we get out of NoRC?

A pretty undeterministic system but :
1. 2. Fault-tolerance thanks to dynamic reconfiguration so better yield. Adaptability to process variations and operating conditions with the GALS approach and V/F scaling techniques. 3. Well priced since it can support many different applications by reconfiguration. 4. Energy efficiency thanks to being able to match the architecture to the task instead of using a programmable processor ( I thought FPGAs were terrible at power)
Example of processor customization : Motion estimation overview

Motion estimation removes the correlation between video frames and directly influences video quality and compression ratio. It consumes a large percentage of video encoding cycles (50-80% in H.264). We propose an application specific architecture and programming language: Programmable and Configurable
Video In Video In Next Next Frame Frame ME ME Buffer Buffer Loop Loop Filter Filter + + Intra Intra Prediction Prediction + + _ _ Switch Switch Q-1, ,IDCT Q-1 IDCT MC MC Motion Comp. Motion Comp. Loop Loop Prediction Prediction Loop Loop DCT, Q DCT, Q Entropy Entropy Coding Coding Bitstream Bitstream
Execution Time Distribution

720p pedestrian, baseline, noasm
60% 10% 9% 6% 4% 11% 0 20 40 60 80 100 120 140 160 180 200
Inter (ME) Intra DCT Quantisation CABAC Other
Execution Time (s)
Exhaustive and fast motion estimation hardware

9
Full search motion estimation is the preferred approach in hardware solutions.
The large number of points tested in ME means that it can be very costly in terms of time and energy.
Typical point-reduction using a simple diamond Search.
Lagrangian optimization, multiple reference frames, quarter-pel resolutions add further complexity to full search strategies.
ME algorithms performance analysis

10
Fast motion algorithms perform as well as or better than full search. Lagrangian optimization, hadamard transform, fractional search and sub-
blocks must be supported in high-performance ME processor.

Different video sequences benefit from different features.
Crowdrun
Pedestrian area
43 42.5 42 41.5
PSNR
PSNR
43 42.5 42 41.5
Fractional-pel with SATD Full-pel only Fractional-pel with SAD Fractional-pel with SATD and 8x8 blocks Fractional-pel with SATD and all blocks
41
41 40.5 40 39.5 39 38.5 38 60 65 70 75 80 85 Bit Rate (Mbits) 90 95 100
40.5 40 39.5 39 38.5 38 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 Bit Rate (Mbits) 7.0
7.5Full-pel only with 9.0 8.0 8.5

Lagrangian dissable
11
Highly configurable architecture
LiquidMotion (LMx) Motion Estimation Processor features

8-bit pattern addresses
Point Memory
16-bit current motion vector 16-bit point addresses 20-bit instructions
Point Memory
Point Memory
Point Memory
16-bit point addresses
Address Calculator
12-bit reference addresses
Address Calculator
Address Calculator
Address Calculator
enables the designer to optimize the hardware for the selected algorithm.
Program Memory
8-bit Fetch addresses
Fetch, Decode, Issue Reference Memory Reference Memory Reference Memory Reference Memory
Macroblock Memory
128-bit Reference Vector 64-bit Current Vector
Macroblock Memory
Half-pel Interpolator
Vector Alignment Unit
MV cost
Lagrangian Optimizer
64-bit hp interpolated pixels
Half-pel Reference Memory
Integer-pel Execution Unit

16-bit current sad 8-bit eu id
16-bit current sad
16-bit current sad 8-bit eu id
16-bit current sad 8-bit eu id 8-bit eu id
Quarter-pel Interpolator
64-bit Current Vector 128-bit qp interpolated pixels
SAD Selector
Intuitive and easy programming using a C-like syntax.

64-bit interpolated vector
16-bit best sad
Fractional -pel Execution Unit

8-bit eu id 16-bit current sad
8-bit best eu id
SAD Selector
16-bit best sad
Fig.2 Complex configuration (four integerpel execution units, one fractional-pel exucution unit, Lagrangian optimizer)
Point Memory
16-bit current motion vector 8-bit point addresses 8-bit pattern addresses
Support of advance features such as rate distortion optimization using Lagrangian techniques, sub-partitions, multiple reference frames and fractional search.
20-bit instructions
Address Calculator
Program Memory
8-bit Fetch addresses
Fetch, Decode, Issue Reference Memory
128-bit Reference Vector
Macroblock Memory
64-bit Current Vector
64-bit Reference Vector
16-bit current sad
8-bit eu id
SAD Selector
16-bit best sad
Fig.1 Base configuration (one integer-pel execution unit)
8-bit best eu id
LMx processor microarchitecture

8-bit pattern addresses Program Memory 8-bit Fetch addresses 20-bit instructions 16-bit current motion vector Point Memory 8-bit point addresses Address Calculator 12-bit reference addresses Fetch, Decode, Issue 8-bit point addresses
Macroblock Memory Half-pel Interpolation
Point Memory 8-bit point addresses Address Calculator 12-bit reference addresses
Point Memory 8-bit point addresses Address Calculator 12-bit reference addresses
Point Memory
16-bit best sad and 8bit winner id
Address Calculator 12-bit reference addresses
Reference Memory
Reference Memory
Reference Memory
Reference Memory
Address Calculator 8-bit reference addresses
Address Calculator
64-bit reference Vector Vector Alignme nt Unit 64-bit Current Vector
128-bit Reference Vector 128-bit Reference Vector 128-bit Reference Vector 64-bit Current Vector Vector Alignme nt Unit 64-bit Current Vector 64-bit Current Vector
Macroblock Memory

Vector Alignme nt Unit

SAD
Vector Alignme nt Unit

SAD
MVP QP MV
SAD
SAD
Vector Alignm ent Unit
64-bit SAD Vector COST Accumulat or and control 16-bit current sad
64-bit SAD Vector COST Accumulat or and control 16-bit current sad 8-bit eu id
64-bit SAD Vector COST Accumulat or and control 8-bit eu id 16-bit current sad COST Selector
64-bit SAD Vector COST Accumulat or and control 16-bit current sad 8-bit eu id
MV cost Motion Vector cost
Quarter-pel Interpolator
Quarter-pel Interpolator 64-bit interpol ated vector
8-bit eu id
DIF
DIF
64-bit DIF vector Hada mard Hada mard 32-bit SATD vector COST Accumulat or and control
16-bit current cost
Hada mard
Hada mard
16-bit best sad

COST Accumulat or and control
8-bit best eu id
16-bit current cost
COST Selector

16-bit best cost
Toolset (Compiler + CAS + design space exploration)

13
ME algorithm source code
Selected configuration details
Processor configuration window Result plots
Hardware configuration selection and RTL export
Exploration of design space
Programming Model
14
Compiler supports typical constructs such as if-else, for loops, comparison with current SAD values, early termination Efficient evaluation of multiple userdefined motion vector candidates transparently to the rest of the algorithm.
S = 8; // Initial step size check(0, 0); check(0, S); check(0, -S); check(S, 0); check(-S, 0); update; do { S = S / 2; for(i = 0 to 4 step 1) { check(0, S); check(0, -S); check(S, 0); check(-S, 0); update; #if( WINID == 0 ) #break; } } while( S > 1);
0 1 2 3 Conditional jump instruction Integer check pattern instruction
0 05 00 chk NumPoints: 5 startAddr: 0 0 04 05 chk NumPoints: 4 startAddr: 5 2 00 0B chkjmp WIN: 0 goto: 11 0 04 05 chk NumPoints: 4 startAddr: 5 . 11 0 04 0A chk NumPoints: 4 startAddr: 9 12 2 00 15 chkjmp WIN: 0 goto: 21 .. 21 0 04 0D chk NumPoints: 4 startAddr: 13 22 2 00 1F chkjmp WIN: 0 goto: 31 . 31 1 19 11 chkfr NumPoints: 25 startAddr: 17
Fractional check pattern instruction
Binary compatibility so that once an

algorithm has been compiled it can be executed on any hardware
for(i = -0.5 to 0.5 step 0.25) for(j = -0.5 to 0.5 step 0.25) check(i, j); update;
configuration.
EstimoC source code and compiler output
LMx processor performance evaluation

15
Video sequence : 1080p crowdrun and pedestrian sequences from the SVT High Definition Multi Format Test Set
ftp://vqeg.its.bldrdoc.gov/HDTV/SVT_MultiFormat/1080p50_CgrLevels_SINC_FILTER_SVTdec05_/ FPGA part : Virtex-4 SX35 and 200 MHz clock frequency
Pedestrian area
Crowdrun
200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0
1
dia 16x16 hex 4x4
200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0
1 2 4 8 16
Number of IPEU
dia 8x8 UMH 16x16 dia 4x4 UMH 8x8
8
hex 8x8
16
frames/second
frames/second
hex 16x16 UMH 4x4
Number of IPEU
dia 16x16 hex 4x4 dia 8x8 UMH 16x16 dia 4x4 UMH 8x8 hex 16x16 UMH 4x4 hex 8x8
LMx processor power evaluation

16
Static area
F requency (MHz ) 0 (s tatic) 30 50 70
E mpty R egion C onfigured R egion ME inactive
Dynamic area with ME x1

JTAG RS232
Dynamic area
with ME x2
Virtex-4 ML402 FPGA boundary
ME X 1 S ys tem Idle P ower (mW ) S ys tem runtime P ower(mW )

ME Active
425 500 520 528
454 545 576 606
550 601 636
565 623 660
LEON3 Processor
DSU
Reconfiguration Controller
Reconfigurable Module
AMBA AHB
AHB Controller
Memory Controlelr
AHB/APB Bridge
AMBA APB
VGA
UART
Timers
I/O port
F requency (MHz ) 0 (s tatic) 30 50 70
ME X 2 S ys tem Idle P ower (mW ) S ys tem runtime P ower (mW )

E mpty R egion C onfigured R egion ME inactive ME Active
8/32-bits memory bus Video DAC RS232 WDOG 32-bit I/O port
427 500 521 532
470 558 616 661
571 640 691
587 662 714
PROM
I/O
SRAM
SDRAM
AHB bus master
AHB bus slave
APB bus slave
17
LMx processor memory and logic complexity
Virtex - 4 SX35 Configuration LUTs used /LUTs available Memory blocks used/Memory blocks available/Minimum memory bits 21/192 (10%)/95 Kbits 38/192 (19%)/179 Kbits 55/192 (28%)/263 Kbits 31/192 (16%)/95+42 Kbits 48/192(39%)/179+84 Kbits Critical path (ns) Logic levels 4.976/8 5.040/8 5.032/7 4.986/6 4.996/9
1 IPEU/ 0 FPEU 2 IPEU/ 0 FPEU 3 IPEU/ 0 FPEU 1 IPEU/ 1 FPEU 2 IPEU/ 1 FPEU
2259/30720 (7.4%) 3805 /30270 (12.6%) 5571 /30270 (18.4 %) 9143/30270 (30.2 %) 10985/30270 (36.2%)
18
18
Prototype implementation
PCI JTAG Interface
FPGA
PCI Core
AHB UART Interface Timer Unit Interrupt Controller AHB Master AHB Slave APB Slave
me_ wrapper
Memory Controller
AHB/APB APB Bridge

On-board memory
Conclusions
19
Programmable and configurable processor able to support

high-quality high definition motion estimation. Can support H.264, MPEG-4, MPEG-2 and VC-1 motion
estimation features (alternative interpolation hardware

available). Compiler tool chain and cycle-accurate simulator open-
source at (http://sharpeye.borelspace.com)
An example of how processor customization in NoRC can be used to obtain a flexible architecture with high performance
and low energy requirements

Research With "Chamaleon" Chips or Chips That Can Do Almost Anything Jose Luis Nunez-Yanez

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Research With "Chamaleon" Chips or Chips That Can Do Almost Anything Jose Luis Nunez-Yanez

Hochgeladen von

Copyright:

Verfügbare Formate

Research with Chamaleon chips or chips that can do almost anything Jose Luis Nunez-Yanez,

Department of Electrical and Electronic Engineering

processor, not clock speed

Department of Electrical and Electronic Engineering

Possible multicore challenges

Department of Electrical and Electronic Engineering

with fine-grained dynamically reconfigurable tiles

Reconfigurable Area Controller, DVS logic, ICAP and Network Interface

Department of Electrical and Electronic Engineering

in state-of-the-art FPGAs : Xilinx Virtex-4/5/6 and Altera is getting there as well.

Department of Electrical and Electronic Engineering

SA4 R R SB2 R SA3

SA4 R R SB2 R SA3

Department of Electrical and Electronic Engineering

What do we get out of NoRC?

Example of processor customization : Motion estimation overview

Execution Time Distribution

Execution Time (s)

Department of Electrical and Electronic Engineering

Exhaustive and fast motion estimation hardware

Full search motion estimation is the preferred approach in hardware solutions.

ME algorithms performance analysis

blocks must be supported in high-performance ME processor.

41 40.5 40 39.5 39 38.5 38 60 65 70 75 80 85 Bit Rate (Mbits) 90 95 100

7.5Full-pel only with 9.0 8.0 8.5

Department of Electrical and Electronic Engineering

Highly configurable architecture

LiquidMotion (LMx) Motion Estimation Processor features

16-bit point addresses

16-bit point addresses

8-bit Fetch addresses

128-bit Reference Vector 64-bit Current Vector

128-bit Reference Vector 64-bit Current Vector

128-bit Reference Vector 64-bit Current Vector

128-bit Reference Vector 64-bit Current Vector

Vector Alignment Unit

Vector Alignment Unit

Vector Alignment Unit

Vector Alignment Unit

64-bit hp interpolated pixels

Half-pel Reference Memory

Integer-pel Execution Unit

Integer-pel Execution Unit

Integer-pel Execution Unit

Integer-pel Execution Unit

128-bit hp interpolated pixels

16-bit current sad

16-bit current sad 8-bit eu id

16-bit current sad 8-bit eu id 8-bit eu id

Intuitive and easy programming using a C-like syntax.

Vector Alignment Unit

16-bit best sad

Fractional -pel Execution Unit

8-bit Fetch addresses

Fetch, Decode, Issue Reference Memory

128-bit Reference Vector

64-bit Current Vector

64-bit Reference Vector

Integer-pel Execution Unit

16-bit current sad

Fig.1 Base configuration (one integer-pel execution unit)

Department of Electrical and Electronic Engineering