Beruflich Dokumente
Kultur Dokumente
pierre.paulin@st.com
ABSTRACT
In this paper, we describe challenges and solutions for programming multi-processor systems-on-a-chip, based on our experience in programming Platform2012, a large-scale multicore fabric under development by STMicroelectronics and CEA, using the MultiFlex multi-core programming environment. We present a component-based environment which is the basis for a rich set of parallel programming constructs supporting task level and data level parallelism. The MultiFlex programming tools are described, supporting platform mapping, debug, trace and analysis. We discuss the applicability of different parallel programming model variants for two versions of a high-definition VC-1 decoding video application. These two versions are mapped onto variants of a homogeneous multi-core platform.
- Developing a new SoC platform for feature-rich products requires significant design investment and high non-recurring engineering (NRE) costs. Longer time-in-market is needed to amortize that investment over more product variants and market niches. - Original equipment manufacturers which make use of SoC platform increasingly request to add their own value-added features as a market differentiator.
General Terms
Performance, Experimentation, Languages.
Keywords
Programming models, components, multi-core platform mapping.
1. INTRODUCTION
The emergence of multi-processor SoC platforms in the consumer-driven SoC market is driven by three main challenges: - The ability to respond to rapidly evolving consumer-style markets in a minimum time.
Audio
Video Video
Performance Feedback
Prog. Model
[Multicore] GPP
L2 MEM Fabric Ctrl
HWPE 0
Send Recv
HWPE N
HWPE 0
Send Recv
HWPE N
Rd Wr
A-LIC
Rd Wr
A-LIC
PE PE 0 1
PE n
PE PE 0 1
PE n
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC10, June 510, 2011, San Diego, California, USA. Copyright 2011 ACM 978-1-4503-0636-2/11/06 $10.00.
DMA
System bus
Async NoC
Figure 1. Programming model and P2012 architecture Each cluster features up to 16 tightly-coupled processing elements (PE) sharing uncached multi-banked level-1 data memories, individual cached instruction memories, a multi-channel advanced
262
16.2
DMA engine, and specialized hardware for synchronization and scheduling acceleration. P2012 targets extreme area and energy efficiency by aggressive exploitation of domain-specific acceleration at the processor and cluster level. Each PE is a customizable 32-bit RISC processor which can be specialized at design time with modular extensions (vector units, floating-point unit, special-purpose instructions). Clusters can easily become heterogeneous computing engines thanks to the integration of coarse-grained hardware processing elements which communicate via send/receive streaming interfaces and an asynchronous local interconnect (A-LIC). Platform programming productivity is essential in order to meet time-to-market constraints for the complex target applications. This requires a high-level abstraction layer which we refer to as a platform programming model (PPM), as illustrated at the top of Figure 1, and associated platform mapping tools which provide control of the mapping process when needed, and rich feedback on the results of a given mapping, in particular on the performance aspects.
2.
3.
In this paper, we focus primarily on the definition and use of the native programming models.
Programming Environment
Language-based Component-based T2 T2 T2 Parallel Prog. Pattern Library API-based
Platform Models
TLM
Figure 2. Platform 2012 Software Development Kit Stack The programming environment relies on the services of the system infrastructure and runtime layer, which supports several execution engines, dynamic deployment of the applications, platform quality of service, and power management. Finally, the base of the stack consists of the platform modeling layer, including a functional simulator, a transaction-level simulator integrating performance modeling of the architecture platform, and architectural power estimation models. An FPGA board and demonstrator SoC are under development.
263
16.2
parallelism (TLP) constructs, which can be mapped on different platform resources. These patterns are implemented in a programming environment combining component-based programming with a component library. This is based on the open source MIND component environment [3]. The use of a component-based technology adds many benefits to programming model implementation. First, the encapsulation of data with code, provided by components, helps programmers to manage data isolation and locality, which is fundamental to master the parallelism. Second, thanks to the separation of programming interfaces from their implementations, the above patterns may be implemented in different ways as separate components, each optimized for different mapping scenarios (e.g. a data exchanger pattern implemented using intraor inter-cluster communication). The programmer can pick & choose the component which optimizes the mapping of an application to a given set of resources using an architecture description language (ADL), without modifying the source code of the application.
Application
F1 F2 F3
that automates the management of data prefetching into local memory for optimizing access latency and bandwidth. The Async Data Prefetch pattern provides a non-blocking version of the same function. An abstract illustration of the use of components in the platform mapping process is shown in Figure 4. The high-level application is decomposed into five coarse grain components (Comp1 to Comp5). These components can be assigned by the programmer to a fixed subset of platform resources, as depicted by the vertical red dotted arrows in the figure. At compile time, the programmer can explore different component-to-resource allocations and the programming tools will automatically implement the abstract communication as intra-cluster or inter-cluster communication based on the selected assignment. The tasks embodied inside a component can be assigned dynamically to PEs in a single cluster, when using the dynamic task dispatch programming constructs described next.
Deploy
Comp 1
Deploy
Comp. 2
DTD DTD
Comp. 3
DTD DTD
Comp. 4
DTD DTD
I/O
Comp. 5
Fa Fb
Abstraction
Execution model
Run-toCompl. DDF
PEDF
Communication/ synchronization
Exchanger
Memory access
Prefetch
Queue iterator
M-core GPP
L1 MEM
CC DMA DMA
SI/SO
PE PE 0 1
PE n
HW PE 0
CC DMA DMA
PE 0
PE k
PE n
HW PE 0
HW PE n
SDF
FIFO
Synchr Buffer
Async Prefetch
...
FC DMA
A-LIC
SI/SO
SI/SO
A-LIC
SI/SO
SI/SO
Fabric ANoC
HAL
Figure 3. Parallel Programming Patterns As depicted in Figure 3, the PPP library consists of three categories: execution models, communication & synchronization patterns, and memory access patterns. Different variants of the native programming models (streaming, programming patterns for DLP and TLP, and dynamic task dispatch) use different subsets of these patterns. The first category, the execution model patterns, provides the underlying execution engines to deploy and schedule components on hardware execution resources. The second pattern category offers communication & synchronization services. Four main pattern classes are provided: - The Queue Iterator pattern class provides an optimized communication channel for implementing several producer/consumer tasks (e.g. split, join, broadcast, etc.) based on the iterator pattern [4]. - The Exchanger provides a barrier type synchronization point between two parallel tasks where they can swap buffers of data. - The Synchronized Buffer provides a shared memory element that manages the synchronization between a single writer and multiple reader tasks. - A class of basic Queue patterns supports consumer-producer data interactions. These queues can be specialized to implement various access orderings like FIFO or FILO. The third category of patterns supports memory access. For example, the Data Prefetcher provides a memory access pattern
264
16.2
achieve a controlled workload). This is in contrast with the more controlled, static mappings performed when using the componentbased parallel programming patterns described in the previous subsection. mechanisms, depending on the location of the PEs and the memory they communicate with. For example, data movement can be achieved by pointer passing, memory copy, or by DMA for large blocks. Finally, communication from the GPP host to PEs in the fabric involves data marshalling and the transparent use of driver layers in the host O/S. In contrast with the compile-time task assignment of components, tasks described using the DTD fork_join and dup constructs are assigned dynamically at runtime to the set of resources associated with the enveloping component. The only constraint is that all tasks in a given DTD section must run on PEs of the same cluster. Finally, the component Comp5 in Figure 4 illustrates a component containing two filters Fa and Fb which compose a PEDF streaming dataflow description. One is mapped to a S/W PE and the other to a H/W PE. The streaming communication between these two PE classes and the static control and scheduling is implemented automatically by the PEDF mapping tools [1].
Decoder pipe
Frame N Intra Prediction AC/DC IZZ Queue IDCT Reconstruction IQ IDCT Smooth
VLD pipe
Frame N+1 Bit Parse VLD Intra VLD Inter Motion Compen Queue MC Intensity compensn. Queue Queue (MB) MV Pred MV Pred Prefetch manager Loop Filter Queue (2lines)
Queue
Reference pictures
Figure 5. VC-1 decoder using PPPs There are two top-level pipelines: the variable length decoder (VLD) pipe, and the main decoding pipe which is the combination of the intra prediction, IDCT reconstruction, the loop filter, the motion compensation, and the motion vector prediction. The VLD pipe operates on frame N+1 and the main decoding pipe operates on frame N. The motion vector prediction uses previously decoded reference pictures stored in external memory and prefetches appropriate macro blocks (MBs). The overall description is a dataflow-oriented description, with explicit queues to buffer variable execution rates among subblocks. There is also a special QueueIterator block which is used to perform out-of-band global control of the underlying filters at
265
16.2
the frame (or frame slice) level. Components are statically assigned to PEs on a one-to-one basis, and data movement is managed automatically by the underlying communication pattern implementation. This requires care in the task decomposition to balance loads across statically assigned PE resources. This mapping enables the efficient usage of both the L1 TCDM memories and the cluster shared L2 when the space in the L1 is insufficient (for example the deblocking filter needs a full line of macroblocks to process the edge between the current MB and the one on top). This choice needs to be made carefully because proper placement of data is performance-critical. The queues from the PPP repository take care of moving the data between the L1 TCDMs of the communicating processors. Using the tracing and visualization tools, the size of each queue over time can be tracked. This information is useful to optimally size the queue depth. Furthermore, moving the memory used by these queues from L1 to L2 can be done simply by changing the implementation of the PPP used, which supports rapid exploration of different mappings. The downside of this streaming model is that the pipeline speed is conditioned by the slowest filter. Because of this, intensive memory accesses - like the accesses to the reference pictures, or the management of the loop filter buffer for deblocking - must be managed asynchronously with DMA accesses in order to make sure that all filters are doing useful computing. One advantage of this mapping is that it can quickly be achieved starting from the VC-1 reference code. Each core processing function can be left as-is, while the top-level loop is split into multiple parts which communicate through the PPP queues. Initially, all data was placed in the external L3 memory to get a first functional mapping and then memory placement was refined incrementally. Details of the TLP and vectorial DLP optimizations of this VC-1 mapping can be found in [7]. Finally, this description can be transparently mapped to single or multiple clusters, leveraging the use of the communication abstraction supported by the component-based PPPs. most demanding) is purely sequential. The pipe for the decoding part uses the fork/join construct to launch nine parallel tasks performing DMA transfers, intra prediction, IDCT, motion compensation, deblocking filter, etc. Three of these tasks use data-level parallelism using the dup construct. In particular, the IDCT and motion compensation work in parallel for the six blocks of a macro-block. Apart from the traditional tasks involved in the decoding process, the pipe also contains a dedicated stage for prefetching the reference block needed for motion compensation (Motion Comp I).
Fork
Frame N+1 Frame N
Impl. MB d-flow
DMA in
VLD
DMA out
Dblk Filter
Dblk Chro
DMA out
Join
Loop MB
Join
Loop on MB
Join
Loop on Frames
Figure 6. VC-1 decoder using dynamic task dispatching With this mapping, approximately 20 tasks can be active at the same time. Based on the DTD runtime, each processor uses a greedy scheduling loop that grabs the next task in the queue whenever it is free, the exception being the master processor (the one which called the initial fork) since it must at one point wait for all its descendants to complete. When mapped to a cluster with 8 cores, we observe an average idle time of ~5% per core. The idle time here is calculated as the total time spent in the DTD scheduling loop, including waiting for tasks if the ready list is empty. The only exception is the processor that performs the DMA input task which is not computationally intensive but has a long execution time. When this processor is the top-level master, which must also wait for all children to join after each iteration, its idle time can go up to ~16%. This application mapping takes better advantage of the underlying L1 shared data memory architecture and allows for finer grain dynamic load balancing. On the negative side, this description cannot be automatically mapped to multiple clusters, as is supported by the PPP version. Moreover, the functional pipeline is implemented manually via global variables storing intermediate results and the state of the pipeline after each loop. This implements implicitly the functional pipeline of the datafloworiented version of Figure 5. This is more time consuming and error-prone. Nevertheless, we believe that the benefits on improved load balancing compensate for these efforts.
8. RELATED WORK
Scalable multi-core architectures have been widely adopted for high-end graphics and media processing, e.g. IBM Cell BE, NVIDIA Fermi and Tilera TILE64. A description of multicore programming tools developed in industry and academia for the Cell BE heterogeneous multicore platform was given in [8]. For special cases where direct control of resources is preferred over productivity, P2012 offers a lowlevel API, called the native programming layer (NPL), which can be compared with the Cell SDK. The other five programming models presented in [8] are language extensions expressing
266
16.2
parallelism in different ways. P2012 takes a similar approach by providing multiple programming patterns that supports a rich set of parallel programming models. Moreover, it consolidates these patterns within a component-based infrastructure, which is used as a semantic-neutral basis. Many of the programming models cited in [8] are based on variations of the OpenMP standard. The P2012 tools also support OpenMP-like semantics for exploiting data- and task-level parallelism. However, P2012 implements this support using a lightweight C API, namely dynamic task dispatch (DTD) API, in order to give fine control to the programmer by eliminating any intrusion of compiler-managed directive interpretations. The DTD API can also be compared to the Tagged Procedure Calls, presented in [8]. Although both programming models provide runtime support for the execution of parallel tasks, DTD leverages the synchronous execution of tasks whereas TPC leverages the asynchronous execution. Thanks to the fast scheduling of synchronous parallel tasks, DTD is compatible with the very fine-grain parallelism found in nested loops. The commercial toolset for the Tilera TILE64 architecture is based on a set of C/C++ libraries that implement high-level communication. The C++ libraries of the Intel CILK [9] and RapidMIND tools (now part of the ArBB toolset [10]) also follow this approach. The Intel TBB (threading building blocks) library [11] extends the C++ language with OpenMP-like primitives, with support for data-level parallelism on a rich set of datatypes. These are somewhat analogous to the P2012 programming patterns, except that in P2012 they are implemented using components on top of the C language. Leveraging the encapsulation properties of components, the P2012 programming tools automate the mapping of the application to the platform by generating the required communication stubs. The complementary use of the component-based parallel programming patterns and the DTD API enables the support for a wide range of use cases. This C-based approach also allows the use of a standard C compiler for final code generation. Finally, many GP-GPU multicore platform tools support the OpenCL standard. We are also developing an OpenCL compiler for the P2012 platform, but this is not the focus of this paper. appropriate programming model variant, depending on the characteristics of the application and those of the platform configuration.
10. ACKNOWLEDGMENTS
Our thanks to the P2012 programming tool team members, Olivier Benny, Youcef Bouchebaba, Vincent Gagn, Michel Langevin, Bruno Lavigueur, Matthieu Leclercq, Michel Metzger, Erdem Ozcan, Chuck Pilkington, our P2012 partners in ST and CEA, as well as our academic partners at University of Genova.
11. REFERENCES
[1] STMicroelectronics and CEA, Platform 2012: A Many-core Programmable Accelerator for Ultra-Efficient Embedded Computing in Nanometer Technology, Nov. 2010. DOI=http://www.cmc.ca/en/NewsAndEvents/~/media/Englis h/Files/Events/20101105_Whitepaper_Final.pdf. [2] Y. Thonnart, P. Vivet, F. Clermidy, "A fully asynchronous low-power framework for GALS NoC integration", Proc. DATE'10, Dresden, April 2010. [3] OW2 Consortium. The MIND Project. DOI=http://mind.ow2.org. [4] Erich Gamma, Richard Helm, Ralph Johnson and John M. Vlissides. Design patterns: Elements of reusable objectoriented software. s.l. : Addison-Wesley, 1995. [5] P. G. Paulin et al, Parallel Programming Models for a Multi-Processor SoC Platform Applied to Networking and Multimedia, IEEE Transactions on VLSI Journal, Vol. 14, No. 7, July 2006, pp 667-680. [6] P. G. Paulin et al, MPSoC Platform Mapping Tools for DataDominated Applications, in Model-Based Design for Embedded Systems, Ed. G. Nicolescu, P. Mosterman, CRC Press, 2010. [7] M. Bariani, P. Lambruschini, M. Raggio. VC-1 decoder on STMicroelectronics P2012 architecture, Proc. of 8th Annual Intl. Workshop STreaming Day, Sept. 2010, Univ. of Udine, Udine, IT. DOI=http://stday2010.uniud.it/stday2010/stday_2010.html [8] R. Ferrer et al, Parallel Programming Models for Heterogeneous Multicore Architectures, IEEE Micro, Sept./Oct. 2010, pp. 42-53. [9] Intel CILK plus, DOI= http://software.intel.com/enus/articles/intel-cilk-plus/. [10] Intel Array Building Blocks, DOI= http://software.intel.com/ en-us/articles/intel-array-building-blocks/ [11] Intel Threading Building Blocks, DOI= http://threadingbuildingblocks.org/
9. CONCLUSION
We have described challenges and solutions in programming multi-processor systems-on-a-chip, based on our experience in mapping representative video applications to the Platform2012 multi-core fabric, developed by STMicroelectronics and CEA. We described a component-based environment which is the basis for a rich set of parallel programming constructs supporting task-level and data-level parallelism. The mapping of two versions of a VC-1 HD video decoding application highlighted the importance of choosing the most
267