Beruflich Dokumente
Kultur Dokumente
Multicore everywhere
Instead of focusing on increasing clock speed Intel turned to the world and admitted that its gigahertz race was foolish, and that the next decade of performance improvements would come from increasing the number of cores per
tolerance (yield/cost)
Test and verification Energy/Power (static
and dynamic)
What can we do about this? : The NoRC (Network on a reconfigurable chip) concept Asynchronous network-on-chip
Tile
Serial I/O core
Embbeded Processor
Local Memory
Fixed Fuction IP
Static part
Internal BUS
Memory
Router
Evolution of current FPGA technology towards NoRC? Partial and dynamic reconfiguration already supported
Initial state
R R R
NoRC Functional model : the assignment of processing resources to tasks is determined at run-time following requests for services Configured/Steady state
Configuration state
R
R
R
R R MA R R
IN B
IN A
R MA
R
R
SAB 1 R MB
SB1
SA2
OUT A
R MB
Clone state
Reconfigured state
R
R
Fault state
R R R
R SB1
R
R R MA R R
IN B
IN A
R MA
IN A
R MA
IN A
SA4 R SB2 R
OUT B
SA4 R SB2 R
OUT B
R
IN B
R SAB 1 R MB SB1
M B re c o n fig u re s S B 1
R
SA2
OUT A
R
IN B
SAB 1 R MB
F a u lt
SB1
SA2
OUT A
R SA3
SAB 1 R MB SB1
M B re c o n fig u re s S B 1
SA2
OUT A
R SA3
The large number of points tested in ME means that it can be very costly in terms of time and energy.
Typical point-reduction using a simple diamond Search.
Lagrangian optimization, multiple reference frames, quarter-pel resolutions add further complexity to full search strategies.
Department of Electrical and Electronic Engineering
Fast motion algorithms perform as well as or better than full search. Lagrangian optimization, hadamard transform, fractional search and sub-
43 42.5 42 41.5
Fractional-pel with SATD Full-pel only Fractional-pel with SAD Fractional-pel with SATD and 8x8 blocks Fractional-pel with SATD and all blocks
41
40.5 40 39.5 39 38.5 38 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 Bit Rate (Mbits) 7.0
11
Point Memory
16-bit current motion vector 16-bit point addresses 20-bit instructions
Point Memory
Point Memory
Point Memory
16-bit point addresses
Address Calculator
12-bit reference addresses
Address Calculator
12-bit reference addresses
Address Calculator
12-bit reference addresses
Address Calculator
12-bit reference addresses
enables the designer to optimize the hardware for the selected algorithm.
Program Memory
Fetch, Decode, Issue Reference Memory Reference Memory Reference Memory Reference Memory
Macroblock Memory
Macroblock Memory
Half-pel Interpolator
MV cost
Lagrangian Optimizer
Quarter-pel Interpolator
64-bit Current Vector 128-bit qp interpolated pixels
SAD Selector
8-bit best eu id
SAD Selector
16-bit best sad
Fig.2 Complex configuration (four integerpel execution units, one fractional-pel exucution unit, Lagrangian optimizer)
Point Memory
16-bit current motion vector 8-bit point addresses 8-bit pattern addresses
Support of advance features such as rate distortion optimization using Lagrangian techniques, sub-partitions, multiple reference frames and fractional search.
20-bit instructions
Address Calculator
12-bit reference addresses
Program Memory
Macroblock Memory
Vector Alignment Unit
8-bit eu id
SAD Selector
16-bit best sad
8-bit best eu id
Point Memory 8-bit point addresses Address Calculator 12-bit reference addresses
Point Memory 8-bit point addresses Address Calculator 12-bit reference addresses
Point Memory
Reference Memory
Reference Memory
Reference Memory
Reference Memory
Address Calculator
128-bit Reference Vector 128-bit Reference Vector 128-bit Reference Vector 64-bit Current Vector Vector Alignme nt Unit 64-bit Current Vector 64-bit Current Vector
Macroblock Memory
MVP QP MV
SAD
SAD
64-bit SAD Vector COST Accumulat or and control 16-bit current sad
64-bit SAD Vector COST Accumulat or and control 16-bit current sad 8-bit eu id
64-bit SAD Vector COST Accumulat or and control 8-bit eu id 16-bit current sad COST Selector
64-bit SAD Vector COST Accumulat or and control 16-bit current sad 8-bit eu id
Quarter-pel Interpolator
8-bit eu id
DIF
DIF
64-bit DIF vector Hada mard Hada mard 32-bit SATD vector COST Accumulat or and control
16-bit current cost
Hada mard
Hada mard
8-bit best eu id
COST Selector
Programming Model
14
Compiler supports typical constructs such as if-else, for loops, comparison with current SAD values, early termination Efficient evaluation of multiple userdefined motion vector candidates transparently to the rest of the algorithm.
S = 8; // Initial step size check(0, 0); check(0, S); check(0, -S); check(S, 0); check(-S, 0); update; do { S = S / 2; for(i = 0 to 4 step 1) { check(0, S); check(0, -S); check(S, 0); check(-S, 0); update; #if( WINID == 0 ) #break; } } while( S > 1);
0 1 2 3 Conditional jump instruction Integer check pattern instruction
0 05 00 chk NumPoints: 5 startAddr: 0 0 04 05 chk NumPoints: 4 startAddr: 5 2 00 0B chkjmp WIN: 0 goto: 11 0 04 05 chk NumPoints: 4 startAddr: 5 . 11 0 04 0A chk NumPoints: 4 startAddr: 9 12 2 00 15 chkjmp WIN: 0 goto: 21 .. 21 0 04 0D chk NumPoints: 4 startAddr: 13 22 2 00 1F chkjmp WIN: 0 goto: 31 . 31 1 19 11 chkfr NumPoints: 25 startAddr: 17
for(i = -0.5 to 0.5 step 0.25) for(j = -0.5 to 0.5 step 0.25) check(i, j); update;
configuration.
Pedestrian area
Crowdrun
200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0
1
dia 16x16 hex 4x4
200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0
1 2 4 8 16
Number of IPEU
dia 8x8 UMH 16x16 dia 4x4 UMH 8x8
8
hex 8x8
16
frames/second
frames/second
Number of IPEU
dia 16x16 hex 4x4 dia 8x8 UMH 16x16 dia 4x4 UMH 8x8 hex 16x16 UMH 4x4 hex 8x8
Static area
F requency (MHz ) 0 (s tatic) 30 50 70
E mpty R egion C onfigured R egion ME inactive
Dynamic area
with ME x2
Virtex-4 ML402 FPGA boundary
LEON3 Processor
DSU
Reconfiguration Controller
Reconfigurable Module
AMBA AHB
AHB Controller
Memory Controlelr
AHB/APB Bridge
AMBA APB
VGA
UART
Timers
I/O port
8/32-bits memory bus Video DAC RS232 WDOG 32-bit I/O port
PROM
I/O
SRAM
SDRAM
17
Virtex - 4 SX35 Configuration LUTs used /LUTs available Memory blocks used/Memory blocks available/Minimum memory bits 21/192 (10%)/95 Kbits 38/192 (19%)/179 Kbits 55/192 (28%)/263 Kbits 31/192 (16%)/95+42 Kbits 48/192(39%)/179+84 Kbits Critical path (ns) Logic levels 4.976/8 5.040/8 5.032/7 4.986/6 4.996/9
1 IPEU/ 0 FPEU 2 IPEU/ 0 FPEU 3 IPEU/ 0 FPEU 1 IPEU/ 1 FPEU 2 IPEU/ 1 FPEU
2259/30720 (7.4%) 3805 /30270 (12.6%) 5571 /30270 (18.4 %) 9143/30270 (30.2 %) 10985/30270 (36.2%)
18
18
Prototype implementation
PCI JTAG Interface
FPGA
PCI Core
AHB UART Interface Timer Unit Interrupt Controller AHB Master AHB Slave APB Slave
me_ wrapper
Memory Controller
Conclusions
19
source at (http://sharpeye.borelspace.com)
An example of how processor customization in NoRC can be used to obtain a flexible architecture with high performance