Sie sind auf Seite 1von 4

2012 International Conference on Computer Science and Electronics Engineering

Study of High Speed Parallel Image Processing Machine


LEI Tao
Institute of Optics and Electronics, Chinese Academy of Science, Chengdu 610209, China Graduate School of Chinese Academy of Science, Beijing 100039, China e-mail: taoleiyan@yahoo.com.cn ZHOU Jin Institute of Optics and Electronics, Chinese Academy of Science, Chengdu 610209, China e-mail: zhoujin@163.com
AbstractAs high resolution and large field sensors were used, the real time performance of the target tracking system is more and more important. Single mode algorithm can not track the target stably because of the target changes its characteristics from time to time. In order to solve the two problems above, a multi-processors and parallel high-speed image processing machine was presented. The kernel method for designing the parallel image processing system was analyzed and the optimization technologies for DSP were discussed. The FPGAs were used for detecting targets in whole view of sight and the DSPs were used for target tracking. The machine realized can automatic detect and track targets in real-time. The experimental results indicate that the system can adapt various environments and various targets which have different characteristics. Keywords-multi-processors; real-time image processing; parallel processing; softerware optimization

JIANG Ping
Institute of Optics and Electronics, Chinese Academy of Science, Chengdu 610209, China e-mail: jpnl@yahoo.com.cn

WU Qin-zhang Institute of Optics and Electronics, Chinese Academy of Science, Chengdu 610209, China e-mail: wqz@163.com SRIO(Serial RapidIO) was used for connecting the multiprocessor for high-speed data transmission. The FPGAS were used for running low level algorithms of image process and the DSPS were used for high level algorithms of target association and tracking. II.
THE FEATURE OF IMAGE PROCESSING SYSTEM

I.

INTRODUCTION

Real-time image processing involve processing vast amounts of image data in a timely manner for the purpose of extracting useful target information. The digital images are essentially multidimensional signals and are thus quite data intensive, requiring a significant amount of computation and memory resources for processing [1]. In the optoelectronic tracking system, the real-time performance gets more and more important as high resolution, large field sensors were used. Due to the request of multiple targets tracking in various different scenes, the traditional single mode tracking algorithm is no longer competent for this situation. The multi-mode tracking algorithm was put forward. In this multi mode tracking system, more than one algorithms running parallel at the same time to detect and track different kinds of objects such as small, mass, extent, slow and fast moving targets. So the traditional image processing machine based on single processor could not accomplish the task in time. To solve this problem, the highspeed image processing machine based on multi-processors was presented. In this new image processing platform, four high performance DSPs and four high performance FPGAs were used, the DDRII memory was exploited and the
978-0-7695-4647-6/12 $26.00 2012 IEEE DOI 10.1109/ICCSEE.2012.392 223

A. The request for real-time performance of the system Due to the request for high precision of the optoelectronic tracking system, the size of image gets larger and larger, and 12 or more bits of precision may be needed for higher levels of accuracy. The amount of data increases if color is also considered. For example, a typical application with the size of one frame is 2M bytes (1024x1024 and 16 bits of precision) at 100fps requires performing. With the trend toward higher resolution and faster frame rates, the amounts of data that need to be processed in a short amount of time will continue to increase dramatically. The key to cope with this issue is to exploit higher performance hardware platform and exploit the parallel image processing algorithms. B. Parallellism of the image processing algorithm Much of what goes into implementing an efficient image processing system centers on how well the implementation, both hardware and software, exploits different forms of parallelism in an algorithm, which can be data level parallelism (DLP) and instruction level parallelism (ILP) [2, 3]. DLP manifests itself in the application of the same operation on different sets of data, while ILP manifests itself in scheduling the simultaneous execution of multiple independent operations in a pipeline fashion. Traditionally, image processing operations have been classified into three main levels, namely low-level, intermediate-level, and highlevel, where each successive level differs in its input/output data relationship [4, 5]. Low-level operators take an image as their input and produce an image as their output. It can be further classified

into point, neighborhood, and global operations. In this level, the operations are data-intensive, with highly structured and predictable processing, requiring a high bandwidth for accessing image data. In general, low-level operations are excellent candidates for exploiting DLP. Intermediate-level operations transform image data to a slightly more abstract form of information, include segmenting an image into regions of interest, extracting edges, lines, contours, or other image attributes of interest such as statistical features. The goal by carrying out these operations is to reduce the amount of data to form a set of features suitable for further high-level processing. Some intermediate-level operations are also data intensive with a regular processing structure, thus making them suitable candidates for exploiting DLP. High-level operations interpret the abstract data from the intermediate-level, performing high level knowledge-based scene analysis on a reduced amount of data. Such operations include recognition of objects or a control decision based on some extracted features. These types of operations are usually characterized by control or branch-intensive operations. Thus, they are less data intensive and more inherently sequential rather than parallel. Due to their irregular structure and lowbandwidth requirements, such operations are suitable candidates for exploiting ILP [6]. III.
PARALLEL IMAGE PROCESSING MACHINE BASED MULTIPROCESSOR

reasonable and it can make the accelerator approach N (N is total amount of the processors used in the system). The performance of the parallel system realized through the network connecting multi-processor. The network can be divided into two types: The first shares the bus or shares the memory which called close coupling parallel systemIn the second type, every processor has its own data memory and linked through interface of communication, it is called loose coupling parallel system.

Figure 1. Interconnection of multi-processors

Three kernel components which decide the performance of real-time image processing system are processor nodes, memory, and network connecting the processors. It will restrict the whole system performance if there is any deficiency in one component of the three. The performance of the CPU improved so fast that it is not a problem any more. The key to improve the whole system performance is to improve speed of memory and the performance of network connecting multi-processor. A. High performance processor node The image processing machine presented consist of four FPGAs Virtex5LX50T(V5) and four TMS320C6455(C6455). The FPGAs have flexibility in implementing custom hardware solutions and have been used extensively for realizing the lowlevel image algorithm. DSPs are well known for their highperformance, relative flexibility for implementing more comprehensive algorithm than FPGAs. The C6455 used is a high performance DSP whose main frequency is 1.2GHZ and the 16 bit fixed-point processing capability is 9600MMAC/S. B. High speed DDRII storage Due to the ever-increasing gap between computation performance and memory, in the real-time image processing system presented, the ultimate memory DDRII SDRAM is used which interface to the 32-bit, 533-MHz (data rate) DDR2 Memory Controller of the C6455. The maximum data rate achievable by the DDR2 memory controller is 2.1 Gbytes/s. C. High speed system interconnection The parallel processing hardware platform can make full use of advantages only when the parallel structure is

In the machine presented, the interconnection used shared PCI bus and SRIO. The SRIO link realized the point to point interconnection between two processors, as is showed in Figure1. The band width of the PCI bus is 32 bits and work at 33MHZ. So its peak band width of data transmission is 132Mb/s. Because of the DSP occupies the PCI bus exclusively and there is so much bus arbitration that the communication efficiency is too low to satisfy the request of the task. SRIO is a kind of high performance serial interconnection technology for data transmission between point and point [7]. The SRIO module embedded in C6455 has four full duplex ports and every port supports the data transmission ratio up to 3.125Gb/s. The Fig.1 shows that every DSP linking another DSP through one 1x channel so every DSP can access the resource in another DSP on board. One FPGA is linked with one DSP through SRIO so there is high efficient data transmission band width between them. Through PCI bus and the SRIO, the maximized communication efficiency is realized in the image processing machine presented. D. Preprocessing based on FPGA FPGA is the arrays of reconfigurable complex logic blocks with a network of programmable interconnection. They can be programmed to exploit different types of parallelism inherent in an image processing algorithm. This in turn leads to highly efficient real-time image processing for low-level and Intermediate-level. In the image processing machine presented, four FPGAs were used for low-level filtering, Intermediate-level operations to get the mark result of targets after segmentation. Fig.2 shows the whole view preprocess frame constructed by FPGAs. The M10,M11,M40 represent the data stored in memory, which are image data after filtering and the line table after target labeling. a1,a2a7 represent the algorithm module

224

realized in FPGA. They are fast Histogram statistics module, 5x5median filter, Max median filter in all directions, 9x9 cross high-pass filter, adaptive contrast enhance module, edge detecting module, target labeling module. Furthermore, several peripheral interface modules for image processing like camera link interface, SDRAM controller, image store module are also implemented. Their physical implement are running at 192MHz, which processing ability is more than 120fps for 1024x1024 16bit gray images.

parallel, thus allowing more computations to be performed in a shorter time [8]. While SIMD can be exploited, VLIW can be used for exploiting instruction level parallelism for speeding up high-level operations. It furnishes the ability to execute multiple instructions within one processor clock cycle, all running in allowing software-oriented pipelining of instructions

Figure 3. The task partition for multi DSPs

Figure 2. The frame of whole view of sight processing

In the Fig.2, every one FPGA constructs one channel and every one channel is used for detecting one kind of targets in whole view of sight. In these four channels, four objects detecting algorithms were realized, they are: small target detecting algorithm, mass target detecting algorithm, high contrast target detecting algorithm and three frames difference detecting algorithm. These four algorithms running in the system at the same time and it can detect four kinds of targets which have different features. The feature of the target for candidate extracted was calculated and recognized, thus the noise and the false targets were discarded. E. High speed parallel processing based on DSPs There are two kinds of parallel levels of high speed image processing based on DSP, high-level parallelism and low-level parallelism. The high-level parallelism means that there are more than one DSP running at the same time in the system. In these DSPs, one serves as the master handling system control, running a RTOS for complex control-intensive operations. Other DSPs, act as slaves to the main DSP, mainly perform computationally intensive data processing operations. In the parallel image processing machine presented this paper, four DSPs are used, different algorithms were running on them and the DSP4 acting as the main DSP. Fig.3 shows the task partition for these DSPs and the multi-mode image tracking algorithm has been realized for extracting different types of targets. The low-level parallelism means there are three major architectural features that are essential to image processing system, namely single instruction multiple data(SIMD), very long instruction word(VLIW), and an efficient memory subsystem. The SIMD means that the processor executing the instruction simultaneously on different portions of data in

by the programmer. while SIMD and VLIW can help speed up the processing of diverse image operations, the time saved through such mechanisms would be completely wasted if there did not exist an efficient way to transfer data throughout the system, So the direct memory access(DMA) are introduced. DMA is a well-known tool for hiding memory access latencies, especially for image data. There are many technologies can be taken to optimize the time costly portions in order to bring the execution time within an acceptable real-time range, follows are some important optimization methods frequently used in image processing platform based on DSP. 1) Algorithms level optimization To achieve real-time performance on a given hardware platform, the algorithms often have to be optimized. In general, greater gains in performance are obtained through optimization at the algorithmic level. So some modifications performed at the algorithmic level help to streamline the algorithm down to its core functionality, which not only leads to a lower computational complexity but also to be faster. 2) Compiler optimization option Compiler provided by TI is equipped with automatic software optimizer. After the code profiled, some special compiler optimization options should be applied to those functions that will experience an increase in performance. 3) Substituting fixed-point for floating-point arithmetic More often than not, floating-point computations pose a major performance bottleneck in real-time implementations, especially on fixed-point processors. The fixed-point calculations are usually faster to perform than floating calculations on a fixed-point processor. 4) Substituting In-Line Code for subroutines Because the subroutines incurs overhead when called since variables have to be pushed onto a stack in order to be popped back upon return, the function calls should be replaced with in-line codes.

225

5) Packed data processing The C6455 incorporate some form of packed data processing functionality to enhance the performance of processing data that is smaller than the width of the data path. Because most modern data paths are of 32 bits, it is possible to process four 8-bit data or two 16-bit data pixels simultaneously, leading to processing more image data in one clock cycle. 6) Loop transformations It is often a good practice to first apply loop transformations in the high-level language before applying assembly optimizations. It includes removing any unnecessary calculations within the loop itself, replacing the function calls with in-line codes and the loop unrolling. Loop unrolling can be used to increase the ILP of the loop and it allows performing multiple iterations in one pass, grouping more computations together for simultaneous access to more data and thus reducing loop overhead due to branching and helping the compiler to create a more efficient scheduling of the main loop body. After the loop unrolling, the software pipelining scheduled by the compiler and it improves performance of the loop distinctly. The table Tab.1 shows some results of the software module after optimization based on technology introduced above. The size of the region tested is 256x256 and one pixel is 16 bits.
TABLE I.
THE TIME COST OF SOFTWARE MOUDLE BEFORE AND AFTER OPTIMIZATION IN THE PARALLEL IMAGE PROCESSING MACHINE

Figure 5. Complex background, target size and illumination changed

The Fig.5 shows tracking result of aeromodelling. In this image sequence, the size of the object and the illumination were changing all the time. The cloud and the building on the background were also challenges, if there were no multi algorithms in system, the tracking would fail quickly. Due to the multi-mode tracking algorithm, it was not influenced by the feature changes of target and the complex background. V. CONCLUSISON

Ord 1 2 3 4 5 6

Time cost before and after optimization


Name of software module Before opt After opt

The high-speed parallel image processing machine presented runs with the speed of 100fps. The frame processed with size 320x256-1024x1024 and one pixel of the image is 16 bits. Four kinds of algorithm for target detecting based on FPGAs and four kinds of algorithm for target tracking based on DSPs have been realized on the multi-mode tracking system. The experimental results indicate that the system can adapt various environments and various targets which have different characteristics and the time cost is acceptable. REFERENCES
[1] A. Bovik, Introduction to Digital Image and Video Processing, in Handbook of Image & Video Processing, A. C. Bovik, Ed. Amsterdam: Elsevier Academic Press, 2005. K. Dong,M. Hu, Z. Ji, and B. Fang, Research on Architectures for High Performance Image Processing, Proceedings of the Fourth International Workshop on Advanced Parallel Processing Technologies, September 2001. H. Hunter and J. Moreno, A New Look at Exploiting Data Parallelism in Embedded Systems, Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems, pp. 159169, October/November 2003. K. Dong,M. Hu, Z. Ji, and B. Fang, Research on Architectures for High Performance Image Processing, Proceedings of the Fourth International Workshop on Advanced Parallel Processing Technologies, September 2001. A. Downton and D. Crookes, Parallel Architectures for Image Processing, Electronics & Communication Engineering Journal, Vol. 10, No. 3, pp. 139151, June 1998. H. Broers, W. Caarls, P. Jonker, and R. Kleihorst, Architecture Study for Smart Cameras, Proceedings of the European Optical Society Conference on Industrial Imaging and Machine Vision, pp. 3949, June 2005. RapidIO Trade Association. Rapid IO Interconnect Specification Rev.1.2[OL]. http://www.rapidio.org, 2002-10-14. H. Hunter and J. Moreno, A New Look at Exploiting Data Parallelism in Embedded Systems, Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems, pp. 159169, October/November 2003.

Binary morphologic filtering Gray 5x5 morphologic filtering 5x5 average filtering 3x3 sobel edge filtering Binary statistical filtering Contrast liner enhancement

4ms 16ms 6ms 3.8ms 1.8ms 25ms

0.06ms 1.2ms 0.8ms 0.18ms 0.18ms 0.7ms [3] [2]

IV.

RESULT OF PARALLEL IMAGE PROCESSING MACHINE

Fig.4 and Fig.5 show the robust tracking results from the high-speed parallel image processing machine presented. From the Fig.4, we can see that although the background is complex, the tracking is stably all the time. Under this situation, the three frames difference algorithm was scheduled automatically in our image processing machine.

[4]

[5]

[6]

[7] [8]

Figure 4. Frames from a car tracking sequence on a complex ground background. It is being tracked automatically by our parallel tracking system.

226

Das könnte Ihnen auch gefallen