Sie sind auf Seite 1von 4

R-Stream: Enabling efficient development of portable, highperformance, parallel applications

Peter Mattson, Allen Leung, Ken Mackenzie, Eric Schweitz, Peter Szilagi, Richard Lethin {mattson, leunga, kenmac, schweitz, szilagyi, lethin @ reservoir.com}

R-Stream Compiler
Reservoir Labs Inc. Goals and Benefits
Portable, consistently high-performance compiler for streaming applications on morphable architectures
Relative to fixed -function hardware, R -Stream and a morphable architecture yields: Cost savings Dynamic mission changes Shorter time to deployment In-mission adaptation Multipurpose hardware Multi-sensor More advanced algorithms Bug fixes and upgrades less costly

Streaming Compilation
Streaming is a pattern in efficient implementations of computation- and data-intensive applications Three key characteristics:
Processing is largely parallel Data access patterns are apparent Control is high-level, steady, simple Caused by intersection of application domain and architecture limits:
Applications process, simulate, or render physical systems Processing is largely parallel Independence of spatially and/or temporally distributed data Continuous, N dimensional spatial/temporal data Few tasks per application, few unpredictable events Architectural limits

Application domain
VLIW, SIMD, multiprocessor demand for parallelism

Streaming
Small local memories

Data access patterns are apparent Control is high-level, steady, simple

Handling unpredictability in hardware is expensive

Architecture limits

Streaming languages and architectures designed for this pattern are emerging:
Streaming languages enforce the pattern, making parallelism and data -flow apparent Streaming architectures provide maximum parallel resources and minimize control overhead by exposing resources to the compiler

Whole program Multiple loop nests Single loop nest CSE etc. ALUs, registers Cache Local memory Multiprocessor, routing

R-Stream uses these to expand the scope of optimization and resource choreography
Streaming languages avoid limits on compiler analysis (e.g. aliasing), enabling top-down optimizations Streaming architectures allow the compiler to lay out all computation, data, and communication

Implementation Plan
Spiral development Helps to ensure eventual results Enables early usage by partners Three major releases: Release 1.0: Functionality, but no performance optimization Release 2.0: Common -case, phase -ordered performance optimization, static morphing Release 3.0: General, unified performance optimization, dynamic morphing
Year Quarter Release Front-end Streaming IR Mapping Code generation 03 2 3 4 1.0 04 1 2 3 2.0 4 05 1 2 3.0 3 4 Non-existent Discardable draft Draft Final

Compiler Flow
Brook
Kernels perform computations on streams. This kernel computes pair-wise sum.
void kernel streamAdd(stream float s1, stream float s2, out stream float s3) { s3 = s1 + s2; }

StreamIt

Represents s3.push(s1.pop() + s2.pop()).

void kernel weightedSum(stream float image_in[3][3], out stream float image_out) { image_out = 1.0 / 9.0 * (image_in[0][0] + image_in[0][1] + image_in[0][2] + image_in[1][0] + image_in[1][1] + image_in[1][2] + image_in[2][0] + image_in[2][1] + image_in[2][2]); } int brookMain() { float image[100][100][3][3]; be arrays. This is can float imageOut[100][100];stream of 3x3 arrays.

Stream is 1D, but elements a

stream float st[3][3]; stream float stOut; stream float stOutDouble; Use of stream operator to read from streamRead(st, image, 0, 99, 0, 99); weightedSum(st, stOut); array into stream. streamAdd(stOut, stOut, stOutDouble); streamWrite(imageOut , stOutDouble, 0, 99, 0, 99); }

Use of streamAdd kernel to double stream.

Front end
1. Unfold graph

PCA Machine Model for Target Architecture


(simple 1.0 algorithm shown)

2. Cluster based on edge/node weights 0. Extract graph


weightedSum1 weightedSum2 weightedSum1 weightedSum2

splitter1 weightedSum

splitter2

(simple example used in algorithm illustration)


Proc0 Proc1

splitter1 streamAdd1 splitter2 streamAdd1 streamAdd streamAdd2

splitter2

Mapper

Thread Proc

streamAdd2

3. Convert communication 4. Schedule and assign storage


Proc0: DMA2:
DMA1 weightedSum1 DMA2 weightedSum2 splitter2 DMA3 streamAdd1 DMA4 DMA3 DMA4 streamAdd2 splitter1 splitter2 DMA1 DMA2

Cache

Mem0

Mem1

Proc1:

weightedSum1

weightedSum2

DMA2

GlobalMem

splitter1

streamAdd1

streamAdd2

Code generation Streaming Virtual Machine Code


streamInitRAM(&(ts_1_0), MEM1, 0, 512, sizeof(float[3][3]), (STREAM_ONE_EOS)); streamInitRAM(&(ts_1_18432), MEM1, 18432, 512, sizeof(float), (STREAM_ONE_EOS)); ... int hlc_loop = 0 ; Streams mapped to memory for (; hlc_loop < 10000 ; hlc_loop += 512 ){ int hlc_smc = min ( 10000 - hlc_loop , 512 ); int hlc_seteos = hlc_loop + hlc_smc == 10000 ; Whole application kernelWait (& mv7 . HLC_kernel ); strip-mined moveInit (& mv7 , DMA2 ,& streamIn ,& ts_1_0 , hlc_smc , hlc_seteos ); kernelAddDependence (& mv7 . HLC_kernel ,& mv8 . HLC_kernel ); kernelRun (& mv7 . HLC_kernel ) ; kernelWait (& weightedSum_2 . HLC_kernel ); /* weightedSum_2 */ HLC_INIT_weightedSum (& weightedSum_2 , PROC1 , (Block*)0 , &ts_1_0 , &ts_1_18432 ); kernelAddDependence (& weightedSum_2 . HLC_kernel ,& streamAdd_5 . HLC_kernel ); kernelRun (& weightedSum_2 . HLC_kernel ) ; kernelWait (& mv6 . HLC_kernel ); /* mv6 */ moveInit (& mv6 , DMA2 ,& streamIn ,& ts_0_0 , hlc_smc , hlc_seteos ); kernelAddDependence (& mv6 . HLC_kernel ,& mv7 . HLC_kernel ); kernelRun (& mv6 . HLC_kernel ) ; kernelWait (& splitter_4 . HLC_kernel ); /* splitter_4 */ HLC_INIT_splitter (& splitter_4 , PROC1 , (Block*)20480 , &ts_1_18432 , &ts_1_20480 , &ts_1_22528 ); kernelAddDependence (& splitter_4 . HLC_kernel ,& weightedSum_2 . HLC_kernel ); kernelRun (& splitter_4 . HLC_kernel ); kernelWait (& streamAdd_5 . HLC_kernel ); /* streamAdd_5 */ HLC_INIT_streamAdd (& streamAdd_5 , PROC1 , (Block*)24576 , &ts_1_20480 , &ts_1_22528 , &ts_1_24576 ); kernelAddDependence (& streamAdd_5 . HLC_kernel ,& splitter_4 . HLC_kernel ); kernelRun (& streamAdd_5 . HLC_kernel ); ... } ... Kernels mapped to processors

Low-level compiler for target architecture Binary Executable

Technologies
Streaming Intermediate Representation (IR)
Represents an application within the compiler as a series of kernels or loop nests connected by streams or arrays Enables the compiler to directly analyze and optimize streaming applications Makes task, pipeline, and data parallelism readily apparent:
Kernel or loop nest Stream or array

A
Data parallelism

C A

Task parallelism

D B C B
Pipeline parallelism

Top down, unified, streaming IR mapping


Mapping will include top-down optimizations, such as software pipelining an entire application to cover communication latency Mapping of computation, data, and communication layout will be performed by a single, unified pass for greater efficiency Mapping will exploit all degrees of parallelism exposed by the streaming IR:
space

Data Parallel A B C C D Pipelined


C

Task Parallel A B C D A B D Combinations A


Edges in streaming IR become communication or storage

A B

Compiler driven architecture morphing


Morphable architectures reconfigure high-level resources such as processors, memories, and networks to meet application demands
computation->processor

time

A D

Compiler must choose a configuration and map to that configuration this expands the compiler mapping space: Compiler can employ one or multiple configurations: 1. Static Morphing: pick one configuration and mapping for whole application
space time

data ->memory

configuration

2.

Dynamic Morphing: pick multiple configurations and mappings for different parts of the application; orchestrate morphing between them
space

time

A D

A D

Das könnte Ihnen auch gefallen