R Stream

R-Stream: Enabling efficient development of portable, highperformance, parallel applications
Peter Mattson, Allen Leung, Ken Mackenzie, Eric Schweitz, Peter Szilagi, Richard Lethin {mattson, leunga, kenmac, schweitz, szilagyi, lethin @ reservoir.com}
R-Stream Compiler
Reservoir Labs Inc. Goals and Benefits
Portable, consistently high-performance compiler for streaming applications on morphable architectures
Relative to fixed -function hardware, R -Stream and a morphable architecture yields: Cost savings Dynamic mission changes Shorter time to deployment In-mission adaptation Multipurpose hardware Multi-sensor More advanced algorithms Bug fixes and upgrades less costly
Streaming Compilation
Streaming is a pattern in efficient implementations of computation- and data-intensive applications Three key characteristics:
Processing is largely parallel Data access patterns are apparent Control is high-level, steady, simple Caused by intersection of application domain and architecture limits:
Applications process, simulate, or render physical systems Processing is largely parallel Independence of spatially and/or temporally distributed data Continuous, N dimensional spatial/temporal data Few tasks per application, few unpredictable events Architectural limits
Application domain
VLIW, SIMD, multiprocessor demand for parallelism
Streaming
Small local memories
Data access patterns are apparent Control is high-level, steady, simple
Handling unpredictability in hardware is expensive
Architecture limits
Streaming languages and architectures designed for this pattern are emerging:
Streaming languages enforce the pattern, making parallelism and data -flow apparent Streaming architectures provide maximum parallel resources and minimize control overhead by exposing resources to the compiler
Whole program Multiple loop nests Single loop nest CSE etc. ALUs, registers Cache Local memory Multiprocessor, routing
R-Stream uses these to expand the scope of optimization and resource choreography
Streaming languages avoid limits on compiler analysis (e.g. aliasing), enabling top-down optimizations Streaming architectures allow the compiler to lay out all computation, data, and communication
Implementation Plan
Spiral development Helps to ensure eventual results Enables early usage by partners Three major releases: Release 1.0: Functionality, but no performance optimization Release 2.0: Common -case, phase -ordered performance optimization, static morphing Release 3.0: General, unified performance optimization, dynamic morphing
Year Quarter Release Front-end Streaming IR Mapping Code generation 03 2 3 4 1.0 04 1 2 3 2.0 4 05 1 2 3.0 3 4 Non-existent Discardable draft Draft Final
Compiler Flow
Brook
Kernels perform computations on streams. This kernel computes pair-wise sum.
void kernel streamAdd(stream float s1, stream float s2, out stream float s3) { s3 = s1 + s2; }
StreamIt
Represents s3.push(s1.pop() + s2.pop()).
void kernel weightedSum(stream float image_in[3][3], out stream float image_out) { image_out = 1.0 / 9.0 * (image_in[0][0] + image_in[0][1] + image_in[0][2] + image_in[1][0] + image_in[1][1] + image_in[1][2] + image_in[2][0] + image_in[2][1] + image_in[2][2]); } int brookMain() { float image[100][100][3][3]; be arrays. This is can float imageOut[100][100];stream of 3x3 arrays.
Stream is 1D, but elements a
stream float st[3][3]; stream float stOut; stream float stOutDouble; Use of stream operator to read from streamRead(st, image, 0, 99, 0, 99); weightedSum(st, stOut); array into stream. streamAdd(stOut, stOut, stOutDouble); streamWrite(imageOut , stOutDouble, 0, 99, 0, 99); }
Use of streamAdd kernel to double stream.
Front end
1. Unfold graph
PCA Machine Model for Target Architecture

(simple 1.0 algorithm shown)
2. Cluster based on edge/node weights 0. Extract graph

weightedSum1 weightedSum2 weightedSum1 weightedSum2
splitter1 weightedSum
splitter2
(simple example used in algorithm illustration)

Proc0 Proc1
splitter1 streamAdd1 splitter2 streamAdd1 streamAdd streamAdd2
splitter2
Mapper
Thread Proc
streamAdd2
3. Convert communication 4. Schedule and assign storage

Proc0: DMA2:
DMA1 weightedSum1 DMA2 weightedSum2 splitter2 DMA3 streamAdd1 DMA4 DMA3 DMA4 streamAdd2 splitter1 splitter2 DMA1 DMA2
Cache
Mem0
Mem1
Proc1:
weightedSum1
weightedSum2
DMA2
GlobalMem
splitter1
streamAdd1
streamAdd2
Code generation Streaming Virtual Machine Code

streamInitRAM(&(ts_1_0), MEM1, 0, 512, sizeof(float[3][3]), (STREAM_ONE_EOS)); streamInitRAM(&(ts_1_18432), MEM1, 18432, 512, sizeof(float), (STREAM_ONE_EOS)); ... int hlc_loop = 0 ; Streams mapped to memory for (; hlc_loop < 10000 ; hlc_loop += 512 ){ int hlc_smc = min ( 10000 - hlc_loop , 512 ); int hlc_seteos = hlc_loop + hlc_smc == 10000 ; Whole application kernelWait (& mv7 . HLC_kernel ); strip-mined moveInit (& mv7 , DMA2 ,& streamIn ,& ts_1_0 , hlc_smc , hlc_seteos ); kernelAddDependence (& mv7 . HLC_kernel ,& mv8 . HLC_kernel ); kernelRun (& mv7 . HLC_kernel ) ; kernelWait (& weightedSum_2 . HLC_kernel ); /* weightedSum_2 */ HLC_INIT_weightedSum (& weightedSum_2 , PROC1 , (Block*)0 , &ts_1_0 , &ts_1_18432 ); kernelAddDependence (& weightedSum_2 . HLC_kernel ,& streamAdd_5 . HLC_kernel ); kernelRun (& weightedSum_2 . HLC_kernel ) ; kernelWait (& mv6 . HLC_kernel ); /* mv6 */ moveInit (& mv6 , DMA2 ,& streamIn ,& ts_0_0 , hlc_smc , hlc_seteos ); kernelAddDependence (& mv6 . HLC_kernel ,& mv7 . HLC_kernel ); kernelRun (& mv6 . HLC_kernel ) ; kernelWait (& splitter_4 . HLC_kernel ); /* splitter_4 */ HLC_INIT_splitter (& splitter_4 , PROC1 , (Block*)20480 , &ts_1_18432 , &ts_1_20480 , &ts_1_22528 ); kernelAddDependence (& splitter_4 . HLC_kernel ,& weightedSum_2 . HLC_kernel ); kernelRun (& splitter_4 . HLC_kernel ); kernelWait (& streamAdd_5 . HLC_kernel ); /* streamAdd_5 */ HLC_INIT_streamAdd (& streamAdd_5 , PROC1 , (Block*)24576 , &ts_1_20480 , &ts_1_22528 , &ts_1_24576 ); kernelAddDependence (& streamAdd_5 . HLC_kernel ,& splitter_4 . HLC_kernel ); kernelRun (& streamAdd_5 . HLC_kernel ); ... } ... Kernels mapped to processors
Low-level compiler for target architecture Binary Executable
Technologies
Streaming Intermediate Representation (IR)
Represents an application within the compiler as a series of kernels or loop nests connected by streams or arrays Enables the compiler to directly analyze and optimize streaming applications Makes task, pipeline, and data parallelism readily apparent:
Kernel or loop nest Stream or array
A
Data parallelism
C A
Task parallelism
D B C B
Pipeline parallelism
Top down, unified, streaming IR mapping

Mapping will include top-down optimizations, such as software pipelining an entire application to cover communication latency Mapping of computation, data, and communication layout will be performed by a single, unified pass for greater efficiency Mapping will exploit all degrees of parallelism exposed by the streaming IR:
space
Data Parallel A B C C D Pipelined

C
Task Parallel A B C D A B D Combinations A

Edges in streaming IR become communication or storage
A B
Compiler driven architecture morphing

Morphable architectures reconfigure high-level resources such as processors, memories, and networks to meet application demands
computation->processor
time
A D
Compiler must choose a configuration and map to that configuration this expands the compiler mapping space: Compiler can employ one or multiple configurations: 1. Static Morphing: pick one configuration and mapping for whole application
space time
data ->memory
configuration
2.
Dynamic Morphing: pick multiple configurations and mappings for different parts of the application; orchestrate morphing between them
space
time
A D
A D

R Stream

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

R Stream

Hochgeladen von

Copyright:

Verfügbare Formate

R-Stream: Enabling efficient development of portable, highperformance, parallel applications

Data access patterns are apparent Control is high-level, steady, simple

Handling unpredictability in hardware is expensive

Represents s3.push(s1.pop() + s2.pop()).

Stream is 1D, but elements a

Use of streamAdd kernel to double stream.

PCA Machine Model for Target Architecture

2. Cluster based on edge/node weights 0. Extract graph

(simple example used in algorithm illustration)

splitter1 streamAdd1 splitter2 streamAdd1 streamAdd streamAdd2

3. Convert communication 4. Schedule and assign storage

Code generation Streaming Virtual Machine Code

Low-level compiler for target architecture Binary Executable

Top down, unified, streaming IR mapping

Data Parallel A B C C D Pipelined

Task Parallel A B C D A B D Combinations A

Compiler driven architecture morphing

Das könnte Ihnen auch gefallen