Sie sind auf Seite 1von 6

16.

Programming Challenges & Solutions for Multi-Processor SoCs: An Industrial Perspective


Pierre Paulin
STMicroelectronics Inc. 16 Fitzgerald Road Ottawa, Canada, K2H 8R6 Tel: +1-613-768-9069

pierre.paulin@st.com

ABSTRACT
In this paper, we describe challenges and solutions for programming multi-processor systems-on-a-chip, based on our experience in programming Platform2012, a large-scale multicore fabric under development by STMicroelectronics and CEA, using the MultiFlex multi-core programming environment. We present a component-based environment which is the basis for a rich set of parallel programming constructs supporting task level and data level parallelism. The MultiFlex programming tools are described, supporting platform mapping, debug, trace and analysis. We discuss the applicability of different parallel programming model variants for two versions of a high-definition VC-1 decoding video application. These two versions are mapped onto variants of a homogeneous multi-core platform.

- Developing a new SoC platform for feature-rich products requires significant design investment and high non-recurring engineering (NRE) costs. Longer time-in-market is needed to amortize that investment over more product variants and market niches. - Original equipment manufacturers which make use of SoC platform increasingly request to add their own value-added features as a market differentiator.

2. PLATFORM 2012 OVERVIEW


The Platform 2012 (P2012) project [1], currently under joint development by STMicroelectronics and CEA, was defined in order to address these challenges. It is an accelerator architecture targeted to imaging, video, and next-generation immersive applications such as computational photography and augmented reality. P2012 is an area- and power-efficient many-core computing fabric, and it provides an architectural harness that eases integration of hardwired accelerators. The P2012 computing fabric is highly modular, as it is based on multiple clusters implemented with independent power and clock domains. As depicted in Figure 1, clusters are connected via a highperformance fully-asynchronous network-on-chip (NoC), which provides scalable bandwidth and robust communication across different power and clock domains [2].
Control
Mapping control

Categories and Subject Descriptors


D.1.3 [Concurrent Programming]

General Terms
Performance, Experimentation, Languages.

Keywords
Programming models, components, multi-core platform mapping.

1. INTRODUCTION
The emergence of multi-processor SoC platforms in the consumer-driven SoC market is driven by three main challenges: - The ability to respond to rapidly evolving consumer-style markets in a minimum time.

Audio

Video Video

Performance Feedback

Prog. Model
[Multicore] GPP
L2 MEM Fabric Ctrl

HWPE 0
Send Recv

HWPE N

HWPE 0
Send Recv

HWPE N

Rd Wr

A-LIC

Rd Wr

A-LIC

Clust Ctrl DMA DMA

PE PE 0 1

PE n

Clust Ctrl DMA DMA

PE PE 0 1

PE n

HW Synchr. Shared L1 MEM

HW Synchr. Shared L1 MEM

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC10, June 510, 2011, San Diego, California, USA. Copyright 2011 ACM 978-1-4503-0636-2/11/06 $10.00.

DMA
System bus

Async NoC

Figure 1. Programming model and P2012 architecture Each cluster features up to 16 tightly-coupled processing elements (PE) sharing uncached multi-banked level-1 data memories, individual cached instruction memories, a multi-channel advanced

262

16.2
DMA engine, and specialized hardware for synchronization and scheduling acceleration. P2012 targets extreme area and energy efficiency by aggressive exploitation of domain-specific acceleration at the processor and cluster level. Each PE is a customizable 32-bit RISC processor which can be specialized at design time with modular extensions (vector units, floating-point unit, special-purpose instructions). Clusters can easily become heterogeneous computing engines thanks to the integration of coarse-grained hardware processing elements which communicate via send/receive streaming interfaces and an asynchronous local interconnect (A-LIC). Platform programming productivity is essential in order to meet time-to-market constraints for the complex target applications. This requires a high-level abstraction layer which we refer to as a platform programming model (PPM), as illustrated at the top of Figure 1, and associated platform mapping tools which provide control of the mapping process when needed, and rich feedback on the results of a given mapping, in particular on the performance aspects.

4. PROGRAMMING MODEL OVERVIEW


As shown in the top of Figure 2, the Platform 2012 programming tools and runtime support three main classes of platform programming models (PPMs): 1. Standards-based programming models target platform portability and are based on industry standards. They will be used by both third-party and in-house programmers. We are currently developing support for the OpenCL standard. Native programming models target a combination of high productivity and high performance. They are built on welldefined parallel programming abstractions that can be efficiently mapped on P2012 capabilities while offering substantial automation tools for exploring different mapping solutions. They are typically used internally to offer highperformance versions of applications delivered with a platform instance. It helps make application code portable to a range of P2012 platform configurations. The native programming layer (NPL) is a low-level API which is very closely coupled to the platform capabilities. It allows the highest level of control on application-to-resource mapping and therefore the ability to potentially extract the highest performance - at the expense of abstraction and lack of code portability across P2012 platform variants.

2.

3.

3. THE P2012 S/W DEVELOPMENT KIT


The P2012 programming environment leverages a tools stack, as depicted in the software development kit (SDK) of Figure 2. The stack starts on top with the platform programming models (PPMs) which are used to develop parallel applications. The next level of the stack is the P2012 programming environment which is an evolution of the first generation MultiFlex tools [5], [6] and support multiple levels of the development cycle: from high-level application capture and simulation, to the analysis, debug and visualization of the performance- and power-optimized version of the application mapped onto the fabric. The overall environment is integrated into an Eclipse-based integrated development environment (IDE).
Programming Models
Standard Progr. Models
OpenCL

In this paper, we focus primarily on the definition and use of the native programming models.

5. NATIVE PROGRAMMING MODELS


The native programming models offer a midway productivity/performance tradeoff between standard programming models and the native programming layer. In order to fulfill the requirements of key applications, which need to be mapped to a range of different platform configurations, the native PMs are designed to get the best out of the P2012 specific resources while still providing high-level abstractions to provide automated mapping. Three classes of native programming models are supported: - Streaming-oriented programming models support applications with data-dominated parallelism and medium to low levels of control. We defined the predicated execution synchronous dataflow model (PEDF) that is dedicated to mixed HW/SW applications which make use of the user-defined hardware accelerators supported in the cluster template. It relies on a dataflow programming model with simple predicates [1]. - The parallel programming patterns (PPP) are used to implement applications on single- or multi-cluster configurations with high mapping flexibility. This is achieved using a set of core parallel programming patterns implemented using a component framework. These patterns support various forms of data- and task-level parallelism. They are typically used to encapsulate medium-grain parallel tasks. - Dynamic task dispatching (DTD) is used to perform scheduling of fine-grain tasks onto a set of PEs which have access to the L1 shared memory of a single cluster. Here, we focus on the PPP and DTD programming models.

Native Prog. Models


Parallel Prog. Control Streaming: Patterns: - PEDF Audio Video Video TLP - DDF - DLP

Native Prog. Layer DTD API NPL API

Programming Environment
Language-based Component-based T2 T2 T2 Parallel Prog. Pattern Library API-based

System Infrastructure & Runtime


Dynamic Deployment QoS Power Management Execution Engines

Platform Models

TLM

Figure 2. Platform 2012 Software Development Kit Stack The programming environment relies on the services of the system infrastructure and runtime layer, which supports several execution engines, dynamic deployment of the applications, platform quality of service, and power management. Finally, the base of the stack consists of the platform modeling layer, including a functional simulator, a transaction-level simulator integrating performance modeling of the architecture platform, and architectural power estimation models. An FPGA board and demonstrator SoC are under development.

5.1 Parallel Programming Patterns


Parallel programming patterns (PPPs) provide a set of high-level abstractions for capturing the synchronization, communication and data access behavior of a thread-based parallel application. These PPPs help programmers in structuring their code around well-defined data-level parallelism (DLP) and task-level

263

16.2
parallelism (TLP) constructs, which can be mapped on different platform resources. These patterns are implemented in a programming environment combining component-based programming with a component library. This is based on the open source MIND component environment [3]. The use of a component-based technology adds many benefits to programming model implementation. First, the encapsulation of data with code, provided by components, helps programmers to manage data isolation and locality, which is fundamental to master the parallelism. Second, thanks to the separation of programming interfaces from their implementations, the above patterns may be implemented in different ways as separate components, each optimized for different mapping scenarios (e.g. a data exchanger pattern implemented using intraor inter-cluster communication). The programmer can pick & choose the component which optimizes the mapping of an application to a given set of resources using an architecture description language (ADL), without modifying the source code of the application.
Application
F1 F2 F3

that automates the management of data prefetching into local memory for optimizing access latency and bandwidth. The Async Data Prefetch pattern provides a non-blocking version of the same function. An abstract illustration of the use of components in the platform mapping process is shown in Figure 4. The high-level application is decomposed into five coarse grain components (Comp1 to Comp5). These components can be assigned by the programmer to a fixed subset of platform resources, as depicted by the vertical red dotted arrows in the figure. At compile time, the programmer can explore different component-to-resource allocations and the programming tools will automatically implement the abstract communication as intra-cluster or inter-cluster communication based on the selected assignment. The tasks embodied inside a component can be assigned dynamically to PEs in a single cluster, when using the dynamic task dispatch programming constructs described next.
Deploy

Codec, IQI, AR, Modem

Comp 1
Deploy

Comp. 2
DTD DTD

Comp. 3
DTD DTD

Comp. 4
DTD DTD

Native Programming Models


Stream DDF, SDF (SW, HW/SW) Parallel Progr. Patterns for DLP and TLP Dynamic Task Dispatch

I/O

Comp. 5
Fa Fb

Abstraction

Execution model
Run-toCompl. DDF
PEDF

Communication/ synchronization
Exchanger

Memory access
Prefetch

Dyn.Task Dispatch Thread

Queue iterator

M-core GPP
L1 MEM

CC DMA DMA
SI/SO

PE PE 0 1

PE n

HW Synchr. Shared DMEM


M S

HW PE 0

CC DMA DMA

PE 0

PE k

PE n

HW Synchr. Shared DMEM


M S

HW PE 0

HW PE n

SDF

FIFO

Synchr Buffer

Async Prefetch

...

FC DMA

A-LIC

SI/SO

SI/SO

A-LIC

SI/SO

SI/SO

Parallel Programming Pattern (PPP) lib

Fabric ANoC

Native Programming Layer (NPL)

HAL

Figure 4. Mapping Process

Figure 3. Parallel Programming Patterns As depicted in Figure 3, the PPP library consists of three categories: execution models, communication & synchronization patterns, and memory access patterns. Different variants of the native programming models (streaming, programming patterns for DLP and TLP, and dynamic task dispatch) use different subsets of these patterns. The first category, the execution model patterns, provides the underlying execution engines to deploy and schedule components on hardware execution resources. The second pattern category offers communication & synchronization services. Four main pattern classes are provided: - The Queue Iterator pattern class provides an optimized communication channel for implementing several producer/consumer tasks (e.g. split, join, broadcast, etc.) based on the iterator pattern [4]. - The Exchanger provides a barrier type synchronization point between two parallel tasks where they can swap buffers of data. - The Synchronized Buffer provides a shared memory element that manages the synchronization between a single writer and multiple reader tasks. - A class of basic Queue patterns supports consumer-producer data interactions. These queues can be specialized to implement various access orderings like FIFO or FILO. The third category of patterns supports memory access. For example, the Data Prefetcher provides a memory access pattern

5.2 Dynamic Task Dispatch


The dynamic task dispatching (DTD) programming constructs are based on a C API which implement OpenMP-like constructs to fork and join either heterogeneous or identical tasks, implemented respectively using the fork_join and dup function calls. The fork_join primitive takes as argument an array containing one function pointer for each separate function to be forked. This is used typically to implement task-level parallelism, as will be demonstrated in the DTD version of the VC-1 decoder (see Figure 6 of application section 7 below). The dup primitive takes as argument a single pointer to a function which will be invoked N times in parallel on N different data. This construct is used to exploit data-level parallelism. The DTD API is built directly on top of the native programming layer, and is streamlined for maximum execution efficiency. The software primitives are highly optimized and rely on hardware acceleration provided by the clusters hardware synchronizer. The overhead for scheduling a task on a processor is a few cycles and a full fork/join sequence can be handled in a few dozen cycles. This implies that very fine-grained tasks can be defined and managed with minimal overhead. In addition, load-balancing is a natural by-product of this fork/join programming style. The execution semantic ensures that no processor is left idle if there are tasks to be executed in any fork/join queue. The DTD programming interface provides a simple form of execution scalability as it does not enforce any static (compile time) binding constraints between tasks and processing elements. In some cases, this unconstrained execution model may not be desirable (e.g. if precise processor to task mapping is needed to

264

16.2
achieve a controlled workload). This is in contrast with the more controlled, static mappings performed when using the componentbased parallel programming patterns described in the previous subsection. mechanisms, depending on the location of the PEs and the memory they communicate with. For example, data movement can be achieved by pointer passing, memory copy, or by DMA for large blocks. Finally, communication from the GPP host to PEs in the fabric involves data marshalling and the transparent use of driver layers in the host O/S. In contrast with the compile-time task assignment of components, tasks described using the DTD fork_join and dup constructs are assigned dynamically at runtime to the set of resources associated with the enveloping component. The only constraint is that all tasks in a given DTD section must run on PEs of the same cluster. Finally, the component Comp5 in Figure 4 illustrates a component containing two filters Fa and Fb which compose a PEDF streaming dataflow description. One is mapped to a S/W PE and the other to a H/W PE. The streaming communication between these two PE classes and the static control and scheduling is implemented automatically by the PEDF mapping tools [1].

5.3 Native Programming Layer


P2012 offers a native programming layer (NPL) that implements a lightweight OS kernel and exposes a low-level C-based API to access platform resources. The NPL includes thread creation/destruction, synchronization and memory allocation operations to enable classical thread-based parallel programming. Moreover, it includes operations for accessing the cluster-wide hardware synchronizer (HWS) functions such as atomic counters, system messaging and dynamic task dispatching. It accesses the underlying hardware platform via a hardware abstraction layer (HAL), as depicted in Figure 3. A frequent use-case scenario for the native programming layer is the development of highly optimized computation kernels that must be fine-tuned to fully exploit the fabric capabilities. This efficiency is achieved at the cost of a loss of platform portability and higher development effort when compared with the standardbased and native programming models described above. It also requires the explicit consideration of inter-cluster communication in the application programming.

7. APPLICATION MAPPING EXAMPLES


We use a high-definition VC1 video decoder application to illustrate to pros and cons of different programming models mapped onto two different variants of the P2012 platform memory architecture.

6. PLATFORM MAPPING PROCESS


An abstract illustration of the mapping process was introduced in Figure 4. Lets elaborate on the specific functionalities of the mapping tools using this example. The first component (Comp1) is responsible for the deployment on the platform fabric, as depicted with the dotted left-to-right blue lines. This leverages a dynamic component deployment tool which is associated with the MIND framework [3]. The components Comp2 to Comp4 embody applications which make use of the DTD constructs. Comp5 embodies a streaming programming model description (also expressed internally using components). The use of components at the top-level, combined with the use of the DTD construct within a component, supports a flexible taskto-resource allocation strategy. Components are assigned statically at compile time to a subset of resources of the platform in a one-to-one or one-to-many fashion. This is flexible and userdriven, with the restriction that a single component can only be assigned to the resources of a single cluster in any given platform mapping. For example, in Figure 4, tasks of components Comp2 and Comp3 are assigned to the PEs of the LHS cluster. Tasks of component Comp4 are mapped to a subset of the PEs of the RHS cluster. At deployment time, any of these components can be transparently reassigned between clusters, or even the host GPP in many cases. One key feature of this component-based approach is that intercomponent communication is abstracted logically. Depending on the specific component-to-resource assignment chosen, the tools will generate the appropriate low-level communication. In the case of Comp2 to Comp3 (with the sample resource binding shown), this could be a local procedure call. In the case of Comp3 to Comp4, it is automatically converted to an inter-cluster communication over the ANoC. Since the shared L1 data memory is not cached, management of the data movement with external memory is extremely important. Data defined in a component can be mapped to different memory locations (e.g cluster shared L1 memory or the external memory). Moreover, communication can be implemented using a variety of

7.1 VC-1 using PPPs


For this first VC-1 mapping experiment, we targeted an earlier P2012 cluster configuration which uses a distributed L1 TCDM memory (in contrast with the shareL1 data memory configuration of Figure 1). In this configuration, there is also a shared L2 memory for each cluster. This has higher access time than the local TCDMs of each PE. Due to the importance of explicitly managing data movement between the L1 TCDMs, the cluster L2 and external L3, we chose a functional pipeline description exploiting task-level parallelism. The block diagram of Figure 5 represents the top-level decomposition of the components, using the PPP library described in section 5.1.
Input bitstream
Control Queue

Queue Iterator Broadcast

Decoder pipe
Frame N Intra Prediction AC/DC IZZ Queue IDCT Reconstruction IQ IDCT Smooth

VLD pipe
Frame N+1 Bit Parse VLD Intra VLD Inter Motion Compen Queue MC Intensity compensn. Queue Queue (MB) MV Pred MV Pred Prefetch manager Loop Filter Queue (2lines)

Deblock Range mapping Display

Queue

Reference pictures

Figure 5. VC-1 decoder using PPPs There are two top-level pipelines: the variable length decoder (VLD) pipe, and the main decoding pipe which is the combination of the intra prediction, IDCT reconstruction, the loop filter, the motion compensation, and the motion vector prediction. The VLD pipe operates on frame N+1 and the main decoding pipe operates on frame N. The motion vector prediction uses previously decoded reference pictures stored in external memory and prefetches appropriate macro blocks (MBs). The overall description is a dataflow-oriented description, with explicit queues to buffer variable execution rates among subblocks. There is also a special QueueIterator block which is used to perform out-of-band global control of the underlying filters at

265

16.2
the frame (or frame slice) level. Components are statically assigned to PEs on a one-to-one basis, and data movement is managed automatically by the underlying communication pattern implementation. This requires care in the task decomposition to balance loads across statically assigned PE resources. This mapping enables the efficient usage of both the L1 TCDM memories and the cluster shared L2 when the space in the L1 is insufficient (for example the deblocking filter needs a full line of macroblocks to process the edge between the current MB and the one on top). This choice needs to be made carefully because proper placement of data is performance-critical. The queues from the PPP repository take care of moving the data between the L1 TCDMs of the communicating processors. Using the tracing and visualization tools, the size of each queue over time can be tracked. This information is useful to optimally size the queue depth. Furthermore, moving the memory used by these queues from L1 to L2 can be done simply by changing the implementation of the PPP used, which supports rapid exploration of different mappings. The downside of this streaming model is that the pipeline speed is conditioned by the slowest filter. Because of this, intensive memory accesses - like the accesses to the reference pictures, or the management of the loop filter buffer for deblocking - must be managed asynchronously with DMA accesses in order to make sure that all filters are doing useful computing. One advantage of this mapping is that it can quickly be achieved starting from the VC-1 reference code. Each core processing function can be left as-is, while the top-level loop is split into multiple parts which communicate through the PPP queues. Initially, all data was placed in the external L3 memory to get a first functional mapping and then memory placement was refined incrementally. Details of the TLP and vectorial DLP optimizations of this VC-1 mapping can be found in [7]. Finally, this description can be transparently mapped to single or multiple clusters, leveraging the use of the communication abstraction supported by the component-based PPPs. most demanding) is purely sequential. The pipe for the decoding part uses the fork/join construct to launch nine parallel tasks performing DMA transfers, intra prediction, IDCT, motion compensation, deblocking filter, etc. Three of these tasks use data-level parallelism using the dup construct. In particular, the IDCT and motion compensation work in parallel for the six blocks of a macro-block. Apart from the traditional tasks involved in the decoding process, the pipe also contains a dedicated stage for prefetching the reference block needed for motion compensation (Motion Comp I).
Fork
Frame N+1 Frame N

Fork VLD pipe

Fork Decoder pipe

Impl. MB d-flow
DMA in
VLD

Implicit MB loop dataflow


DMA in Intra Pred Dup IDCT Dup Motion Comp I Motion Comp F Dup
OVL

DMA out

Dblk Filter

Dblk Chro

DMA out

Join

Loop MB

Join

Loop on MB

Join

Loop on Frames

Figure 6. VC-1 decoder using dynamic task dispatching With this mapping, approximately 20 tasks can be active at the same time. Based on the DTD runtime, each processor uses a greedy scheduling loop that grabs the next task in the queue whenever it is free, the exception being the master processor (the one which called the initial fork) since it must at one point wait for all its descendants to complete. When mapped to a cluster with 8 cores, we observe an average idle time of ~5% per core. The idle time here is calculated as the total time spent in the DTD scheduling loop, including waiting for tasks if the ready list is empty. The only exception is the processor that performs the DMA input task which is not computationally intensive but has a long execution time. When this processor is the top-level master, which must also wait for all children to join after each iteration, its idle time can go up to ~16%. This application mapping takes better advantage of the underlying L1 shared data memory architecture and allows for finer grain dynamic load balancing. On the negative side, this description cannot be automatically mapped to multiple clusters, as is supported by the PPP version. Moreover, the functional pipeline is implemented manually via global variables storing intermediate results and the state of the pipeline after each loop. This implements implicitly the functional pipeline of the datafloworiented version of Figure 5. This is more time consuming and error-prone. Nevertheless, we believe that the benefits on improved load balancing compensate for these efforts.

7.2 VC-1 with Dynamic Task Dispatch


In this experiment, we make use of the cluster configuration of Figure 1 which makes use of a shared L1 tightly coupled data memory (TCDM), accessible from all PEs with symmetric access times. This memory configuration makes it easy to dynamically assign a task to any free PE, since the PE has direct access to the data needed to execute the task in the shared L1 TCDM. For this reason, the DTD programming constructs were chosen for an alternative mapping. The VC-1 implementation using DTD is depicted in Figure 6. It is encapsulated in a single component and deployed as such. In this implementation, the VC-1 program is organized in a two level pipeline structure. At the highest level, the VLD / bit-stream parsing and the decoding part are pipelined on frames, i.e. the VLD / bit-stream works on frame N+1 while the decoder works on frame N. This allows to average out the decoding time at the frame level rather than at the macro-block level. The VLD process is particularly computationally demanding and at the same time it cannot be parallelized because of temporal dependencies. This high-level pipelining on a frame basis is a solution to alleviate this bottleneck. Both the VLD and the decoding part are also pipelined internally, but this is done at the macro-block level. The VLD pipeline contains three tasks: one for input DMA transfer, one for VLD and one for output DMA transfer. The VLD / bit parsing task (the

8. RELATED WORK
Scalable multi-core architectures have been widely adopted for high-end graphics and media processing, e.g. IBM Cell BE, NVIDIA Fermi and Tilera TILE64. A description of multicore programming tools developed in industry and academia for the Cell BE heterogeneous multicore platform was given in [8]. For special cases where direct control of resources is preferred over productivity, P2012 offers a lowlevel API, called the native programming layer (NPL), which can be compared with the Cell SDK. The other five programming models presented in [8] are language extensions expressing

266

16.2
parallelism in different ways. P2012 takes a similar approach by providing multiple programming patterns that supports a rich set of parallel programming models. Moreover, it consolidates these patterns within a component-based infrastructure, which is used as a semantic-neutral basis. Many of the programming models cited in [8] are based on variations of the OpenMP standard. The P2012 tools also support OpenMP-like semantics for exploiting data- and task-level parallelism. However, P2012 implements this support using a lightweight C API, namely dynamic task dispatch (DTD) API, in order to give fine control to the programmer by eliminating any intrusion of compiler-managed directive interpretations. The DTD API can also be compared to the Tagged Procedure Calls, presented in [8]. Although both programming models provide runtime support for the execution of parallel tasks, DTD leverages the synchronous execution of tasks whereas TPC leverages the asynchronous execution. Thanks to the fast scheduling of synchronous parallel tasks, DTD is compatible with the very fine-grain parallelism found in nested loops. The commercial toolset for the Tilera TILE64 architecture is based on a set of C/C++ libraries that implement high-level communication. The C++ libraries of the Intel CILK [9] and RapidMIND tools (now part of the ArBB toolset [10]) also follow this approach. The Intel TBB (threading building blocks) library [11] extends the C++ language with OpenMP-like primitives, with support for data-level parallelism on a rich set of datatypes. These are somewhat analogous to the P2012 programming patterns, except that in P2012 they are implemented using components on top of the C language. Leveraging the encapsulation properties of components, the P2012 programming tools automate the mapping of the application to the platform by generating the required communication stubs. The complementary use of the component-based parallel programming patterns and the DTD API enables the support for a wide range of use cases. This C-based approach also allows the use of a standard C compiler for final code generation. Finally, many GP-GPU multicore platform tools support the OpenCL standard. We are also developing an OpenCL compiler for the P2012 platform, but this is not the focus of this paper. appropriate programming model variant, depending on the characteristics of the application and those of the platform configuration.

10. ACKNOWLEDGMENTS
Our thanks to the P2012 programming tool team members, Olivier Benny, Youcef Bouchebaba, Vincent Gagn, Michel Langevin, Bruno Lavigueur, Matthieu Leclercq, Michel Metzger, Erdem Ozcan, Chuck Pilkington, our P2012 partners in ST and CEA, as well as our academic partners at University of Genova.

11. REFERENCES
[1] STMicroelectronics and CEA, Platform 2012: A Many-core Programmable Accelerator for Ultra-Efficient Embedded Computing in Nanometer Technology, Nov. 2010. DOI=http://www.cmc.ca/en/NewsAndEvents/~/media/Englis h/Files/Events/20101105_Whitepaper_Final.pdf. [2] Y. Thonnart, P. Vivet, F. Clermidy, "A fully asynchronous low-power framework for GALS NoC integration", Proc. DATE'10, Dresden, April 2010. [3] OW2 Consortium. The MIND Project. DOI=http://mind.ow2.org. [4] Erich Gamma, Richard Helm, Ralph Johnson and John M. Vlissides. Design patterns: Elements of reusable objectoriented software. s.l. : Addison-Wesley, 1995. [5] P. G. Paulin et al, Parallel Programming Models for a Multi-Processor SoC Platform Applied to Networking and Multimedia, IEEE Transactions on VLSI Journal, Vol. 14, No. 7, July 2006, pp 667-680. [6] P. G. Paulin et al, MPSoC Platform Mapping Tools for DataDominated Applications, in Model-Based Design for Embedded Systems, Ed. G. Nicolescu, P. Mosterman, CRC Press, 2010. [7] M. Bariani, P. Lambruschini, M. Raggio. VC-1 decoder on STMicroelectronics P2012 architecture, Proc. of 8th Annual Intl. Workshop STreaming Day, Sept. 2010, Univ. of Udine, Udine, IT. DOI=http://stday2010.uniud.it/stday2010/stday_2010.html [8] R. Ferrer et al, Parallel Programming Models for Heterogeneous Multicore Architectures, IEEE Micro, Sept./Oct. 2010, pp. 42-53. [9] Intel CILK plus, DOI= http://software.intel.com/enus/articles/intel-cilk-plus/. [10] Intel Array Building Blocks, DOI= http://software.intel.com/ en-us/articles/intel-array-building-blocks/ [11] Intel Threading Building Blocks, DOI= http://threadingbuildingblocks.org/

9. CONCLUSION
We have described challenges and solutions in programming multi-processor systems-on-a-chip, based on our experience in mapping representative video applications to the Platform2012 multi-core fabric, developed by STMicroelectronics and CEA. We described a component-based environment which is the basis for a rich set of parallel programming constructs supporting task-level and data-level parallelism. The mapping of two versions of a VC-1 HD video decoding application highlighted the importance of choosing the most

267

Das könnte Ihnen auch gefallen