Sie sind auf Seite 1von 19

Research with Chamaleon chips or chips that can do almost anything Jose Luis Nunez-Yanez,

Department of Electrical and Electronic Engineering

Multicore everywhere
Instead of focusing on increasing clock speed Intel turned to the world and admitted that its gigahertz race was foolish, and that the next decade of performance improvements would come from increasing the number of cores per

processor, not clock speed

Department of Electrical and Electronic Engineering

Possible multicore challenges


Programming (general purpose?)
Reliability and fault-

tolerance (yield/cost)
Test and verification Energy/Power (static

and dynamic)

Department of Electrical and Electronic Engineering

What can we do about this? : The NoRC (Network on a reconfigurable chip) concept Asynchronous network-on-chip

with fine-grained dynamically reconfigurable tiles


reconfigurable part
Off-chip Memory/Interface

Tile
Serial I/O core

Embbeded Processor

Local Memory

Fixed Fuction IP

Static part

Internal BUS

Memory

Reconfigurable Area Controller, DVS logic, ICAP and Network Interface

Router

Department of Electrical and Electronic Engineering

Evolution of current FPGA technology towards NoRC? Partial and dynamic reconfiguration already supported

in state-of-the-art FPGAs : Xilinx Virtex-4/5/6 and Altera is getting there as well.

Department of Electrical and Electronic Engineering

Initial state
R R R

NoRC Functional model : the assignment of processing resources to tasks is determined at run-time following requests for services Configured/Steady state
Configuration state
R
R

R
R R MA R R
IN B

IN A

R MA

SA4 R R SB2 R SA3


OUT B

R
R

SAB 1 R MB

SB1

SA2

OUT A

R MB

Clone state

Reconfigured state
R
R

Fault state
R R R

R SB1

R
R R MA R R
IN B

IN A

R MA

IN A

R MA

IN A

SA4 R R SB2 R SA3


OUT B

SA4 R SB2 R
OUT B

SA4 R SB2 R
OUT B

R
IN B

R SAB 1 R MB SB1
M B re c o n fig u re s S B 1

R
SA2
OUT A

R
IN B

SAB 1 R MB

F a u lt

SB1

SA2

OUT A

R SA3

SAB 1 R MB SB1
M B re c o n fig u re s S B 1

SA2

OUT A

R SA3

Department of Electrical and Electronic Engineering

What do we get out of NoRC?


A pretty undeterministic system but :
1. 2. Fault-tolerance thanks to dynamic reconfiguration so better yield. Adaptability to process variations and operating conditions with the GALS approach and V/F scaling techniques. 3. Well priced since it can support many different applications by reconfiguration. 4. Energy efficiency thanks to being able to match the architecture to the task instead of using a programmable processor ( I thought FPGAs were terrible at power)
Department of Electrical and Electronic Engineering

Example of processor customization : Motion estimation overview


Motion estimation removes the correlation between video frames and directly influences video quality and compression ratio. It consumes a large percentage of video encoding cycles (50-80% in H.264). We propose an application specific architecture and programming language: Programmable and Configurable
Video In Video In Next Next Frame Frame ME ME Buffer Buffer Loop Loop Filter Filter + + Intra Intra Prediction Prediction + + _ _ Switch Switch Q-1, ,IDCT Q-1 IDCT MC MC Motion Comp. Motion Comp. Loop Loop Prediction Prediction Loop Loop DCT, Q DCT, Q Entropy Entropy Coding Coding Bitstream Bitstream

Execution Time Distribution


720p pedestrian, baseline, noasm
60% 10% 9% 6% 4% 11% 0 20 40 60 80 100 120 140 160 180 200
Inter (ME) Intra DCT Quantisation CABAC Other

Execution Time (s)

Department of Electrical and Electronic Engineering

Exhaustive and fast motion estimation hardware


9

Full search motion estimation is the preferred approach in hardware solutions.

The large number of points tested in ME means that it can be very costly in terms of time and energy.
Typical point-reduction using a simple diamond Search.

Lagrangian optimization, multiple reference frames, quarter-pel resolutions add further complexity to full search strategies.
Department of Electrical and Electronic Engineering

ME algorithms performance analysis


10

Fast motion algorithms perform as well as or better than full search. Lagrangian optimization, hadamard transform, fractional search and sub-

blocks must be supported in high-performance ME processor.


Different video sequences benefit from different features.
Crowdrun
Pedestrian area
43 42.5 42 41.5
PSNR
PSNR

43 42.5 42 41.5
Fractional-pel with SATD Full-pel only Fractional-pel with SAD Fractional-pel with SATD and 8x8 blocks Fractional-pel with SATD and all blocks

41

41 40.5 40 39.5 39 38.5 38 60 65 70 75 80 85 Bit Rate (Mbits) 90 95 100

40.5 40 39.5 39 38.5 38 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 Bit Rate (Mbits) 7.0

7.5Full-pel only with 9.0 8.0 8.5


Lagrangian dissable

Department of Electrical and Electronic Engineering

11

Highly configurable architecture

LiquidMotion (LMx) Motion Estimation Processor features


8-bit pattern addresses

Point Memory
16-bit current motion vector 16-bit point addresses 20-bit instructions

Point Memory

Point Memory

Point Memory
16-bit point addresses

16-bit point addresses

16-bit point addresses

Address Calculator
12-bit reference addresses

Address Calculator
12-bit reference addresses

Address Calculator
12-bit reference addresses

Address Calculator
12-bit reference addresses

enables the designer to optimize the hardware for the selected algorithm.

Program Memory

8-bit Fetch addresses

Fetch, Decode, Issue Reference Memory Reference Memory Reference Memory Reference Memory

Macroblock Memory

128-bit Reference Vector 64-bit Current Vector

128-bit Reference Vector 64-bit Current Vector

128-bit Reference Vector 64-bit Current Vector

128-bit Reference Vector 64-bit Current Vector

Macroblock Memory

Half-pel Interpolator

Vector Alignment Unit

Vector Alignment Unit

Vector Alignment Unit

Vector Alignment Unit

MV cost

Lagrangian Optimizer

64-bit hp interpolated pixels

Half-pel Reference Memory

Integer-pel Execution Unit

Integer-pel Execution Unit

Integer-pel Execution Unit


16-bit current sad 8-bit eu id

Integer-pel Execution Unit

128-bit hp interpolated pixels

16-bit current sad

16-bit current sad 8-bit eu id

16-bit current sad 8-bit eu id 8-bit eu id

Quarter-pel Interpolator
64-bit Current Vector 128-bit qp interpolated pixels

SAD Selector

Intuitive and easy programming using a C-like syntax.

Vector Alignment Unit


64-bit interpolated vector

16-bit best sad

Fractional -pel Execution Unit


8-bit eu id 16-bit current sad

8-bit best eu id

SAD Selector
16-bit best sad

Fig.2 Complex configuration (four integerpel execution units, one fractional-pel exucution unit, Lagrangian optimizer)
Point Memory
16-bit current motion vector 8-bit point addresses 8-bit pattern addresses

Support of advance features such as rate distortion optimization using Lagrangian techniques, sub-partitions, multiple reference frames and fractional search.

20-bit instructions

Address Calculator
12-bit reference addresses

Program Memory

8-bit Fetch addresses

Fetch, Decode, Issue Reference Memory

128-bit Reference Vector

Macroblock Memory
Vector Alignment Unit

64-bit Current Vector

64-bit Reference Vector

Integer-pel Execution Unit

16-bit current sad

8-bit eu id

SAD Selector
16-bit best sad

Fig.1 Base configuration (one integer-pel execution unit)

8-bit best eu id

Department of Electrical and Electronic Engineering

LMx processor microarchitecture


8-bit pattern addresses Program Memory 8-bit Fetch addresses 20-bit instructions 16-bit current motion vector Point Memory 8-bit point addresses Address Calculator 12-bit reference addresses Fetch, Decode, Issue 8-bit point addresses
Macroblock Memory Half-pel Interpolation

Point Memory 8-bit point addresses Address Calculator 12-bit reference addresses

Point Memory 8-bit point addresses Address Calculator 12-bit reference addresses

Point Memory

16-bit best sad and 8bit winner id

Address Calculator 12-bit reference addresses

Reference Memory

Reference Memory

Reference Memory

Reference Memory

Address Calculator 8-bit reference addresses

Address Calculator

64-bit reference Vector Vector Alignme nt Unit 64-bit Current Vector

128-bit Reference Vector 128-bit Reference Vector 128-bit Reference Vector 64-bit Current Vector Vector Alignme nt Unit 64-bit Current Vector 64-bit Current Vector

Macroblock Memory

Half-pel Reference Memory


128-bit hp interpolated pixels

Half-pel Reference Memory

Vector Alignme nt Unit


SAD

Vector Alignme nt Unit


SAD

MVP QP MV

SAD

SAD

Vector Alignm ent Unit

Vector Alignm ent Unit

Vector Alignm ent Unit

Vector Alignm ent Unit

64-bit SAD Vector COST Accumulat or and control 16-bit current sad

64-bit SAD Vector COST Accumulat or and control 16-bit current sad 8-bit eu id

64-bit SAD Vector COST Accumulat or and control 8-bit eu id 16-bit current sad COST Selector

64-bit SAD Vector COST Accumulat or and control 16-bit current sad 8-bit eu id

MV cost Motion Vector cost

Quarter-pel Interpolator

Quarter-pel Interpolator 64-bit interpol ated vector

8-bit eu id

DIF

DIF

64-bit DIF vector Hada mard Hada mard 32-bit SATD vector COST Accumulat or and control
16-bit current cost

Hada mard

Hada mard

16-bit best sad


COST Accumulat or and control

8-bit best eu id

16-bit current cost

COST Selector

Department of Electrical and Electronic Engineering


16-bit best cost

Toolset (Compiler + CAS + design space exploration)


13

ME algorithm source code

Selected configuration details

Processor configuration window Result plots

Hardware configuration selection and RTL export

Exploration of design space

Department of Electrical and Electronic Engineering

Programming Model
14

Compiler supports typical constructs such as if-else, for loops, comparison with current SAD values, early termination Efficient evaluation of multiple userdefined motion vector candidates transparently to the rest of the algorithm.

S = 8; // Initial step size check(0, 0); check(0, S); check(0, -S); check(S, 0); check(-S, 0); update; do { S = S / 2; for(i = 0 to 4 step 1) { check(0, S); check(0, -S); check(S, 0); check(-S, 0); update; #if( WINID == 0 ) #break; } } while( S > 1);
0 1 2 3 Conditional jump instruction Integer check pattern instruction

0 05 00 chk NumPoints: 5 startAddr: 0 0 04 05 chk NumPoints: 4 startAddr: 5 2 00 0B chkjmp WIN: 0 goto: 11 0 04 05 chk NumPoints: 4 startAddr: 5 . 11 0 04 0A chk NumPoints: 4 startAddr: 9 12 2 00 15 chkjmp WIN: 0 goto: 21 .. 21 0 04 0D chk NumPoints: 4 startAddr: 13 22 2 00 1F chkjmp WIN: 0 goto: 31 . 31 1 19 11 chkfr NumPoints: 25 startAddr: 17

Fractional check pattern instruction

Binary compatibility so that once an


algorithm has been compiled it can be executed on any hardware

for(i = -0.5 to 0.5 step 0.25) for(j = -0.5 to 0.5 step 0.25) check(i, j); update;

configuration.

EstimoC source code and compiler output

Department of Electrical and Electronic Engineering

LMx processor performance evaluation


15
Video sequence : 1080p crowdrun and pedestrian sequences from the SVT High Definition Multi Format Test Set
ftp://vqeg.its.bldrdoc.gov/HDTV/SVT_MultiFormat/1080p50_CgrLevels_SINC_FILTER_SVTdec05_/ FPGA part : Virtex-4 SX35 and 200 MHz clock frequency

Pedestrian area

Crowdrun
200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0
1
dia 16x16 hex 4x4

200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0
1 2 4 8 16

Number of IPEU
dia 8x8 UMH 16x16 dia 4x4 UMH 8x8

8
hex 8x8

16

frames/second

frames/second

hex 16x16 UMH 4x4

Number of IPEU
dia 16x16 hex 4x4 dia 8x8 UMH 16x16 dia 4x4 UMH 8x8 hex 16x16 UMH 4x4 hex 8x8

Department of Electrical and Electronic Engineering

LMx processor power evaluation


16

Static area
F requency (MHz ) 0 (s tatic) 30 50 70
E mpty R egion C onfigured R egion ME inactive

Dynamic area with ME x1


JTAG RS232

Dynamic area

with ME x2
Virtex-4 ML402 FPGA boundary

ME X 1 S ys tem Idle P ower (mW ) S ys tem runtime P ower(mW )


ME Active

425 500 520 528

454 545 576 606

550 601 636

565 623 660

LEON3 Processor

DSU

Reconfiguration Controller

Reconfigurable Module

AMBA AHB

AHB Controller

Memory Controlelr

AHB/APB Bridge

AMBA APB

VGA

UART

Timers

I/O port

F requency (MHz ) 0 (s tatic) 30 50 70

ME X 2 S ys tem Idle P ower (mW ) S ys tem runtime P ower (mW )


E mpty R egion C onfigured R egion ME inactive ME Active

8/32-bits memory bus Video DAC RS232 WDOG 32-bit I/O port

427 500 521 532

470 558 616 661

571 640 691

587 662 714

PROM

I/O

SRAM

SDRAM

AHB bus master

AHB bus slave

APB bus slave

Department of Electrical and Electronic Engineering

17

LMx processor memory and logic complexity

Virtex - 4 SX35 Configuration LUTs used /LUTs available Memory blocks used/Memory blocks available/Minimum memory bits 21/192 (10%)/95 Kbits 38/192 (19%)/179 Kbits 55/192 (28%)/263 Kbits 31/192 (16%)/95+42 Kbits 48/192(39%)/179+84 Kbits Critical path (ns) Logic levels 4.976/8 5.040/8 5.032/7 4.986/6 4.996/9

1 IPEU/ 0 FPEU 2 IPEU/ 0 FPEU 3 IPEU/ 0 FPEU 1 IPEU/ 1 FPEU 2 IPEU/ 1 FPEU

2259/30720 (7.4%) 3805 /30270 (12.6%) 5571 /30270 (18.4 %) 9143/30270 (30.2 %) 10985/30270 (36.2%)

Department of Electrical and Electronic Engineering

18

18

Prototype implementation
PCI JTAG Interface

FPGA
PCI Core
AHB UART Interface Timer Unit Interrupt Controller AHB Master AHB Slave APB Slave

me_ wrapper

Memory Controller

AHB/APB APB Bridge


On-board memory

Department of Electrical and Electronic Engineering

Conclusions
19

Programmable and configurable processor able to support


high-quality high definition motion estimation. Can support H.264, MPEG-4, MPEG-2 and VC-1 motion

estimation features (alternative interpolation hardware


available). Compiler tool chain and cycle-accurate simulator open-

source at (http://sharpeye.borelspace.com)
An example of how processor customization in NoRC can be used to obtain a flexible architecture with high performance

and low energy requirements

Department of Electrical and Electronic Engineering

Das könnte Ihnen auch gefallen