You are on page 1of 10

A Modular Coprocessor Architecture for Embedded Real-Time Image and Video Signal Processing

Holger Flatt, Sebastian Hesselbarth, Sebastian Fl gel, and Peter Pirsch u

Institut fr Mikroelektronische Systeme, u Gottfried Wilhelm Leibniz Universitt Hannover, a Appelstr. 4, 30167 Hannover, Germany {flatt,hesselbarth,fluegel,pirsch}

Abstract. This paper presents a modular coprocessor architecture for embedded real-time image and video signal processing. Applications are separated into high-level and low-level algorithms and mapped onto a RISC and a coprocessor, respectively. The coprocessor comprises an optimized system bus, dierent application specic processing elements and I/O interfaces. For low volume production or prototyping, the architecture can be mapped onto FPGAs, which allows exible extension or adaption of the architecture. Depending on the complexity of the coprocessor data paths, frequencies up to 150 MHz have been achieved on a Virtex II-Pro FPGA. Compared to a RISC processor, the performance gain for an SSD algorithm is more than factor 70.


In recent years, integration of smart image and video processing algorithms in sensor devices increased. Applications like object detection, tracking, and classication demand high computing performance. It is desirable to have embedded signal processing integrated in the sensor, e.g. video cameras, to perform image or video compression, ltering, and data reduction techniques. Moreover, exibility is mandatory where new applications have to be supported or existing code needs to be modied. General purpose processors can be used as a rst approach for embedded realtime image and video signal processing. They comprise instruction set extensions like SSE, which allow SIMD operations [1], but eciency of execution units remains low [2]. Moreover, due to their high power consumption, they are not suitable for embedded systems. As an alternative to general purpose processors, digital signal processors (DSP) provide high performance at low power consumption. While exploiting the inherent parallelism of algorithms, they are inferior to application specic arithmetic cores [3]. Dedicated arithmetic cores provide highest optimization potential for a specic application, but lack exibility if support to dierent applications is required.
S. Vassiliadis et al. (Eds.): SAMOS 2007, LNCS 4599, pp. 241251, 2007. c Springer-Verlag Berlin Heidelberg 2007


H. Flatt et al.

Design and modication of dedicated units are time-consuming, while their reusability is low. The analysis of the algorithmic hierarchy of image and video processing applications yields three layers of hardware abstraction [4]. It is advantageous if the hardware architecture combines processing cores that are optimized for the execution of algorithms associated to one of the layers. High-level algorithms (HLA) consist of data depended decisions and control operations. They require high exibility. A RISC processor supports mapping of complex HLAs with low development eort if a compiler is available. Low-level algorithms (LLA) comprise simple computing operations that need high processing power. They are regular and have high potential for parallel execution. A coprocessor that is optimized for processing of low-level algorithms is superior to a RISC. Medium-level algorithms (MLA) are situated between HLAs and LLAs. Depending on their complexity, they can be executed either on a RISC or on a coprocessor. In [5] a special programmable coprocessor was combined with a RISC. This coprocessor is optimized for processing of low-level algorithms. Due to its complexity, adaptations and extensions of the architecture are time-consuming. In this paper, a modular coprocessor architecture for a generic embedded system is proposed, which can be easily modied or extended for dierent applications. This embedded system shown in gure 1 comprises a RISC core, a recongurable coprocessor, data I/O, debug, and memory interfaces.

Peripheral Host PC (optional) Sensor Actuator

Embedded System RISC I/O Coprocessor Memory

Fig. 1. Embedded architecture and peripheral units

The architecture is designed for accelerating dierent image and video signal processing algorithms. The combination of RISC and coprocessor considers algorithmic complexity of embedded applications [6]. While dedicated processing elements inside the coprocessor compute time-consuming low-level operations, irregular high-level parts of the application are executed on a RISC core. Commonly used and feature-rich, commercial system-on-chip bus systems like AMBA AHB [7] require complex nite state machines for master units. To reduce modularization eorts for peripheral and coprocessor units, a simplied multilayer communication bus is introduced. Due to their recongurability, the focus is on FPGA implementations, although the architecture is also suitable for ASIC implementation. Actual FPGAs allow real-time signal processing of sophisticated algorithms. They provide high

A Modular Coprocessor Architecture


speed communication interfaces, internal memories and arithmetic cores [8] [9]. Moreover, some FPGAs contain embedded RISC processors [10]. These embedded cores allow the integration of a RISC with a coprocessor in one device [11]. This paper is organized as follows. Chapter 2 gives a brief description of the proposed coprocessor architecture and the communication approach. Chapter 3 shows an application example and the design ow for dedicated processing elements. Subsequently, chapter 4 presents verication and results. Conclusions and an outlook to future work are given in chapter 5.


Embedded Coprocessor Architecture

Communication Approach

In order to utilize dedicated hardware acceleration units, a sucient communication structure between RISC and coprocessor is needed. If the processes on the RISC and the coprocessor are frequently synchronized, communication latencies result in a high performance reduction. Aiming at a lower synchronization rate, an hierarchical control approach reduces the communication overhead [12]. Instead of calling and synchronizing single coprocessor micro instructions, e. g. multiply-accumulate, the RISC calls and synchronizes low-level algorithms that consist of a set of micro instructions. Figure 2 shows the structure of the proposed communication scheme, which includes a Dynamic Resource Scheduler for converting medium-level function calls (MLA) into a set of low-level function calls (LLA).
RISC Coprocessor Control Unit (Data transfers) Application (HLAs) Dynamic Resource Scheduler Processing Element 1

MLA calls LLA calls interrupt requests

Processing Element n
Fig. 2. RISC/coprocessor communication approach with Dynamic Resource Scheduler

The RISC transfers MLA function calls to the Dynamic Resource Scheduler. The Scheduler creates a list of LLA function calls and forwards them to the associated Processing Elements. Dierent Processing Elements can work in parallel if no data and resource conicts occur. The Processing Elements send interrupt requests to the Scheduler after nishing their computation. After a MLA function


H. Flatt et al.

call has been processed, the Scheduler signals the application through interrupt request. The Dynamic Resource Scheduler can be implemented in software if saving of hardware resources is intended, or in hardware if reduction of communication cost between RISC and coprocessor is highly important. In this work, a software approach is used. 2.2 Architecture Overview

The coprocessor carries out LLAs computations. The proposed modular approach currently supports several dedicated Processing Elements executing different LLAs [11]. These autonomous working units are compact and have high potential for optimization. Replacement and extension of PEs demand low development eort. Function calls and data transfers are performed via a central system bus. Synchronization of PEs is managed by a Dynamic Resource Scheduler instead of using semaphores [13]. A Control Interface connects the RISC to the system bus of the coprocessor to allow access to all resources. For the current design exploration phase, the RISC has been supplemented by a Host PC, which is attached to an FPGA-based emulation system. This allows HW/SW co-emulation during initial phases of application development to evaluate HLAs, LLAs, and bus communication in detail. On-chip memories can be used to reduce data transfer latencies. Memory modules have been implemented for the proposed system bus with congurable data and address widths prior to logic synthesis. Additionally, external memories like DDR-SDRAM can be accessed through external memory interface modules. A DMA Unit can be integrated if an application requires large amounts of data transferred between dierent coprocessor memories. The DMA Unit is controlled by the Dynamic Resource Scheduler. The resulting coprocessor architecture is shown in Figure 3.
Coprocessor Host PC / RISC Control Interface MIB DMA Unit Internal Memory External MEM IF External Memory

PE 1


PE m

Fig. 3. Modular coprocessor architecture

A Modular Coprocessor Architecture



Module Interconnect Bus (MIB)

A key component in any System-on-Chip (SoC) design is the interconnection structure, which is used for inter-module communication . The bus architecture is the most popular integration choice for SoC designs today. The main advantages of buses are exibility and extensibility [14]. Commercial bus systems allow high speed communication between dierent units. Common SoC busses like AMBA AHB Bus [7] and Processor Local Bus (PLB) [15] are powerful. But both have multi-state protocols, which result in complex development and integration of new bus modules. Moreover, the majority of applications only requires a small subset of the specied bus features [16]. If full compatibility to the bus protocol is needed, hardware overhead is unavoidable. For the modular coprocessor architecture approach, these commercial bus systems are not suitable. Therefore, a small, exible, and powerful bus system called Module Interconnect Bus (MIB) was developed. It allows rapid development of new bus components. Timing conditions of the bus protocol are simple. The communication protocol of the MIB is based upon synchronous transmission with double handshake mechanism. Valid bus transfers occur at every rising clock edge if the sending module asserts a valid signal and the receiving module is responding with an accept signal. Two temporally independent sub-busses are used for data transmission. Read requests and write operations are transmitted through a Request/Write Bus. Data read operations are sent over a Read Bus. All transfers are initiated by master modules while slave modules receive transfer requests. Both sub-busses allow multiple layers to provide independent parallel transfers. Control and data ow is managed by two bus arbiters. A slave may induce any arbitrary delay to a read operation as long as correct sequential order of responses is sustained. This decoupling allows integration of pipeline stages in both sub-busses, which is very suitable for complex SoCs running at high clock frequencies. A Reorder Scheduler on the Read Bus is used to keep in-order data delivery from slaves with dierent latencies. Figure 4 shows the structure of the Module Interconnect Bus.


Processing Element Design

PE Example Application

An exemplary processing element for implementation of an image classication algorithm based on a support vector machine (SVM) [17] is described in order to demonstrate the modular coprocessor architecture. The purpose of this algorithm is to classify a test image to a given set of classes. A main processing task of the algorithm is to compare the test image x with all reference images yj . Input images of 64x64 pixels with 16 bit x-point values per pixel are used and a total of 2520 reference images are available.


H. Flatt et al.

Read Bus BUF BUF Reorder Scheduler BUF BUF

Devices Master 1 . . . Master n

Request / Write Bus BUF BUF Req/Write Arbiter

Slave 1 . . . Slave m


Fig. 4. Module Interconnect Bus architecture

The analysis of the algorithm has shown that most of computation time is needed for calculating the sum of square dierences function ssdj . ssdj =

(xi yj,i )

Figure 5 shows pseudo assembler code of the SSD function for an unoptimized RISC core. After initializing the loop counters, the core operations are executed in lines 5-8. Assuming one cycle per operation, this pseudo code yields four cycles per loop. Considering loop unrolling, only every fourth branch operation is counted. For the given example of 2520 reference images with 64x64 pixels, the code would take roughly 47M cycles to nish.

1: 2: 3: 4: 5: 6: 7: 8: 21: 22: 23: 24: 25:

MOV Rj, #2520 ssdj_loop: MOV Ri, #4096 ssdi_loop: LD Rx, (Ax+) LD Ry, (Ay+) SUB Rx, Rx, Ry MAC Ra, Rx, Rx ... SUB Ri, Ri, #4 BNZ ssdi_loop ST (Ar+), Ra DEC Rj BNZ ssdj_loop


Fig. 5. Pseudo RISC code of SSD

A Modular Coprocessor Architecture



SSD Data Path Architecture

Processing Elements carry out the LLA computations in the coprocessor. Figure 6 shows a generic architecture of an autonomous PE. It comprises an MIB Slave Interface, a Control Unit, an MIB Master Interface for accessing external data, and a Data Path for performing computations. An Internal Memory can be integrated into the PE when needed.
MIB PE Slave + Control Master

Internal Memory

Data Path

Fig. 6. Generic Architecture of a Processing Element

Performing a computation task requires that the Dynamic Resource Scheduler transfers function calls to the processing element via the MIB Slave Interface rst. A function call comprises data memory addresses and denes function specic parameters. Afterwards the PE starts processing. Source data is taken from external memories via the MIB bus or directly from internal memories if available. After nishing computations, the PE sends an interrupt request to the Dynamic Resource Scheduler. For the exemplary algorithm, dedicated hardware can reduce number of clock cycles as follows. The test image x is compared with each reference picture yj . Loading from external memory is necessary only once if the image ts in internal memory. Reference image data is loaded via the MIB Master Interface and is processed by the PE as soon as available. Data parallelism is exploited in order to increase the computation performance. The level of maximal concurrency is limited by the data bus width of both external memory and system bus. Figure 7 shows the architecture approach. For this 64 bit example, four 16 bit pixels from a reference image are loaded in parallel. These pixels are subtracted from the corresponding pixels of the test image and squared afterwards. The results are added by a tree of adders and accumulated in the last step. After computing the whole sum of square dierences, it is stored into internal memory. To further increase hardware performance, a pipeline stage is inserted after each operation.


H. Flatt et al.

xi 64 yj,i

16 16 16 16 16 16 16 16

Reg x + x




+ x + x



Fig. 7. Sum of Square Dierences architecture, example 64 bit

Verication and Results

For demonstrating the eciency of the modular coprocessor architecture, the ASIC verication system CHIPit Gold Edition Pro from ProDesign [18] was used. CHIPit is used for emulation only. Real embedded processing demands a more area and energy ecient platform. Figure 8 shows the system architecture.

Host PC

Fig. 8. CHIPit Gold Edition Pro architecture

The CHIPit system comprises two Virtex II Pro XC2VP100-6 FPGAs. Each of them is connected to 256 MB DDR RAM. User software running on an Host PC can be used for executing high level algorithms and controlling the hardware mapped on the FPGAs. A 528 Mbps connection is provided for communication between Host PC and FPGAs. In order to use both DDR RAMs, the coprocessor architecture was partioned and mapped onto both FPGAs. The data width of the MIB was adjusted to 128 bit. Using two independent DDR RAMs allows simultaneous data transfers for two Processing Elements. Table 1 shows the synthesis results for coprocessors with dierent complexity. Frequency decreases with increasing number of processing elements due to the more complex place-and-route process for the 128 bit multilayer bus system.

A Modular Coprocessor Architecture


Table 1. Synthesis Results after Place and Route using one XC2VP100-6 FPGA #SSD PEs Frequency Slices Block RAMs Multipliers 0 156 MHz 5581 23 0 1 151 MHz 8173 35 8 2 147 MHz 9930 47 16 3 133 MHz 12936 59 24 4 125 MHz 14785 71 32 5 119 MHz 16449 83 40

Computing the SSD application example involves loading of all reference data from external memory. Therefore, maximum processing performance is limited by the available external memory bandwidth [19]. The memory hierarchy of the demonstration system supports memory transfers of 256 bit per cycle, which is equal to SSD processing of sixteen 16 bit pixel per clock cycle. Table 2 shows the processing performance running the SSD algorithm on a RISC and a coprocessor containing two PEs, respectively. Compared to the RISC, the coprocessor needs 1/72 of the RISC clock frequency to achieve the same performance. According to Amdahls law, speedup for the whole application is approximately 5 if 20% of the high level computations are remaining on the RISC.
Table 2. Performance for 2520 SSD computations with 4096 pixels (16 bit) per image Platform Cycles Pixels / cycle RISC 47M 0.222 2x 128 bit SSD PEs 645k 16


In this paper, a modular coprocessor platform is presented, which is easy to extend or modify to support a large range of applications. It allows adding support for new applications without re-implementing all modules from scratch and can be used as a framework for dedicated hardware architectures. The architecture reaches feasible speed even on FPGAs with frequencies up to 150 MHz on a Xilinx Virtex II-Pro. The architecture approach is optimized for integration of dedicated application specic processing elements. If several irregular low-level algorithms like image warping must be processed by dierent PEs, a full programmable solution might require less area. Currently, the coprocessor is only accessible by a Host PC. Future work will be focused on interfacing an external RISC with the coprocessor architecture. To show the capabilities of the embedded architecture, more complex algorithms and processing engines will be implemented.


H. Flatt et al.

1. Lee, R.: Multimedia extensions for general-purpose processors. In: IEEE Workshop on Signal Processing Systems SiPS97 Design and Implementation, pp. 923. IEEE Computer Society Press, Los Alamitos (1997) 2. Talla, D., John, L., Burger, D.: Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements. IEEE Transactions on Computers 52, 10151031 (2003) 3. Vejanovski, R., Singh, J., Faulkner, M.: ASIC and DSP implementation of channel lter for 3G wireless TDD system. In: 14th Annual IEEE International ASIC/SOC Conference, Proceedings, pp. 4751. IEEE Computer Society Press, Los Alamitos (2001) 4. Pirsch, P., Stolberg, H.J.: VLSI implementations of image and video multimedia processing systems. IEEE Transactions on Circuits and Systems for Video Technology 8, 878891 (1998) 5. Jachalsky, J., Wahle, M., Pirsch, P., Capperon, S., Gehrke, W., Kruijtzer, W., Nuez, A.: A core for ambient and mobile intelligent imaging applications. In: n IEEE International Conference on Multimedia & Expo (ICME), Proceedings, IEEE Computer Society Press, Los Alamitos, CDROM (2003) 6. Paulin, P., Liem, C., Cornero, M., Nacabal, F., Goossens, G.: Embedded software in real-time signal processing systems: application and architecture trends. In: Proceedings of the IEEE, vol. 85, pp. 419435. IEEE Computer Society Press, Los Alamitos (1997) 7. ARM: AMBA specication (rev. 2.0) (1999) 8. Xilinx: Xilinx website, 9. Altera: Altera website, 10. Xilinx: Virtex-II Pro and Virtex-II Pro X platform FPGAs: Complete data sheet (2005) 11. Stechele, W., Herrmann, S.: Recongurable hardware acceleration for video-based driver assistance. In: Workshop on Hardware for Visual Computing, Tbingen u (2005) 12. Jachalsky, J., Wahle, M., Pirsch, P., Gehrke, W., Hinz, T.: A coprocessor for intelligent image and video processing in the automotive and mobile communication domain. In: IEEE International Symposium on Consumer Electronics, Proceedings, pp. 142145. IEEE Computer Society Press, Los Alamitos (2004) 13. Dejnokov, E., Dokldal, P.: Embedded real-time architecture for level-set-based z a a active contours. EURASIP Journal on Applied Signal Processing 2005, 27882803 (2005) 14. Lee, A., Bergmann, N.: On-chip communication architectures for recongurable system-on-chip. In: IEEE International Conference on Field-Programmable Technology, Proceedings, pp. 332335. IEEE Computer Society Press, Los Alamitos (2003) 15. IBM: 64-bit processor local bus architecture specications, version 3.5 (2001) 16. Cyr, G., Bois, G., Aboulhamid, M.: Generation of processor interface for SoC using standard communication protocol. IEE Proceedings - Computers and Digital Techniques 151, 367376 (2004) 17. Schlkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (2002) o 18. ProDesign: CHIPit Gold Edition Pro, 19. Ding, C., Kennedy, K.: The memory bandwidth bottleneck and its amelioration by a compiler. In: 14th International Symposium on Parallel and Distributed Processing (IPDPS), Proceedings, Washington, DC, USA, p. 181. IEEE Computer Society, Los Alamitos (2000)