Sie sind auf Seite 1von 42

Parallel Image Processing with CUDA

A case study with the Canny Edge Detection Filter

Daniel Weingaertner
Informatics Department Federal University of Paran a - Brazil

Hochschule Regensburg 02.05.2011

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

1 / 40

Summary
1

Introduction Insight Toolkit (ITK) GPGPU and CUDA Integrating CUDA and ITK Canny Edge Detection Experimental Results Conclusion

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

2 / 40

Paran a Brazil

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

3 / 40

Brazil Europe

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

4 / 40

Paran a

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

5 / 40

Curitiba

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

6 / 40

Federal University of Paran a

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

7 / 40

Informatics Department
Undergraduate: Bachelor in Computer Science 8 semesters course 80 incoming students per year Bachelor in Biomedical Informatics 8 semesters course 30 incoming students per year Graduate: Master and PhD in Computer Science Algorithms, Image Processing, Computer Vision, Articial Intelligence Databases, Scientic Computing and Open Source Software, Computer-Human Interface Computer Networks, Embedded Systems

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

8 / 40

Summary
1

Introduction Insight Toolkit (ITK) GPGPU and CUDA Integrating CUDA and ITK Canny Edge Detection Experimental Results Conclusion

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

9 / 40

Insight Toolkit (ITK)


Created in 1999, Open Source, Multi platform, Object Oriented (Templates), Good documentation and support

Figure: Image Processing Workow in ITK

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

10 / 40

ITK - Sample code


#i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e typedef typedef typedef typedef i t k I m a g e . h i t k I m a g e F i l e R e a d e r . h i t k I m a g e F i l e W r i t e r . h i t k C a n n y E d g e D e t e c t i o n I m a g e F i l t e r . h itk itk itk itk :: :: :: :: Image<f l o a t ,2 > ImageType ; I m a g e F i l e R e a d e r < ImageType > ReaderType ; I m a g e F i l e W r i t e r < ImageType > W r i t e r T y p e ; C a n n y E d g e D e t e c t i o n I m a g e F i l t e r < ImageType , ImageType > C a n n y F i l t e r ;

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

i n t main ( i n t a r g c , c h a r a r g v ) { ReaderType : : P o i n t e r r e a d e r = ReaderType : : New ( ) ; reader >S e t F i l e N a m e ( a r g v [ 1 ] ) ; reader >Update ( ) ; C a n n y F i l t e r : : P o i n t e r c a n n y = C a n n y F i l t e r : : New ( ) ; canny >S e t I n p u t ( r e a d e r >GetOutput ( ) ) ; canny >S e t V a r i a n c e ( a t o f ( a r g v [ 3 ] ) ) ; canny >S e t U p p e r T h r e s h o l d ( a t o i ( a r g v [ 4 ] ) ) ; canny >S e t L o w e r T h r e s h o l d ( a t o i ( a r g v [ 5 ] ) ) ; canny >Update ( ) ; W r i t e r T y p e : : P o i n t e r w r i t e r = W r i t e r T y p e : : New ( ) ; writer >S e t F i l e N a m e ( a r g v [ 2 ] ) ; writer >S e t I n p u t ( canny >GetOutput ( ) ) ; writer >Update ( ) ; r e t u r n EXIT SUCCESS ; }

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

11 / 40

Summary
1

Introduction Insight Toolkit (ITK) GPGPU and CUDA Integrating CUDA and ITK Canny Edge Detection Experimental Results Conclusion

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

12 / 40

What is GPGPU Computing?


The use of the GPU for general purpose computation CPU and GPU can be used concurrently To the end user, its simply a way to run applications faster.

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

13 / 40

What is CUDA?
CUDA = Compute Unied Device Architecture. General-Purpose Parallel Computing Architecture. Provides libraries, C language extension and hardware driver.

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

14 / 40

Parallel Processing Models

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

15 / 40

Single-Instruction Multiple-Thread Unit

Creates, handles, schedules and executes groups of 32 threads (warp ). All threads in a warp start at the same point. But they are free to jump to dierent code positions independently.

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

16 / 40

CUDA Architecture Overview

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

17 / 40

Optimization Strategies for CUDA

Main optimization strategies for CUDA involve: Optimized/careful memory access Maximization of processor utilization Maximization of non-serialized instructions

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

18 / 40

CUDA - Sample Code


#i n c l u d e < s t d i o . h> #i n c l u d e < a s s e r t . h> #i n c l u d e <cuda . h> v o i d i n c r e m e n t A r r a y O n H o s t ( f l o a t a , i n t N) { int i ; f o r ( i =0; i < N ; i ++) a [ i ] = a [ i ] + 1 . f ; } global v o i d i n c r e m e n t A r r a y O n D e v i c e ( f l o a t a , i n t N) { i n t i d x = b l o c k I d x . x b l o c k D i m . x + t h r e a d I d x . x ; i f ( i d x< N) a [ i d x ] = a [ i d x ] + 1 . f ; } i n t main ( v o i d ) { f l o a t a h , b h ; // p o i n t e r s t o h o s t memory // p o i n t e r t o d e v i c e memory f l o a t a d ; i n t i , N = 10000; s i z e t s i z e = N s i z e o f ( f l o a t ) ; a h = ( f l o a t ) m a l l o c ( s i z e ) ; b h = ( f l o a t ) m a l l o c ( s i z e ) ; c u d a M a l l o c ( ( v o i d ) &a d , s i z e ) ; f o r ( i =0; i < N ; i ++) a h [ i ] = ( f l o a t ) i ; cudaMemcpy ( a d , a h , s i z e o f ( f l o a t ) N, cudaMemcpyHostToDevice ) ; i n c r e m e n t A r r a y O n H o s t ( a h , N) ; i n t blockSize = 256; i n t n B l o c k s = N/ b l o c k S i z e + (N%b l o c k S i z e == 0 ? 0 : 1 ) ; incrementArrayOnDevice < < < nBlocks , b l o c k S i z e > > > ( a d , N) ; cudaMemcpy ( b h , a d , s i z e o f ( f l o a t ) N, cudaMemcpyDeviceToHost ) ; f r e e ( a h ) ; f r e e ( b h ) ; cudaFree ( a d ) ; }

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

19 / 40

Summary
1

Introduction Insight Toolkit (ITK) GPGPU and CUDA Integrating CUDA and ITK Canny Edge Detection Experimental Results Conclusion

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

20 / 40

Integrating CUDA Filters into ITK Workow

ITK community suggests: Re-implement lters where parallelizing provides signicant speedup Consider the entire workow: copying to/from the GPU is very time consuming Careful! Premature optimization is the root of all evil! (Donald Knuth)

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

21 / 40

Integrating CUDA Filters into ITK Workow

ITK community suggests: Re-implement lters where parallelizing provides signicant speedup Consider the entire workow: copying to/from the GPU is very time consuming Careful! Premature optimization is the root of all evil! (Donald Knuth)

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

21 / 40

CUDA Insight Toolkit (CITK)


Changes to ITK Slight architecture change: CudaImportImageContainer Backwards compatible Data transfer between HOST and DEVICE only on demand Allows for lter chaining inside the DEVICE

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

22 / 40

Summary
1

Introduction Insight Toolkit (ITK) GPGPU and CUDA Integrating CUDA and ITK Canny Edge Detection Experimental Results Conclusion

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

23 / 40

CudaCanny

itkCudaCannyEdgeDetectionImageFilter Algorithm 1 Canny Edge Detection Filter Gaussian Smoothing Gradient Computation Non-Maximum Supression Histeresis

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

24 / 40

Gradient Computation with Sobel Filter


itkCudaSobelEdgeDetectionImageFilter

(a) Sobel X

(b) Sobel Y

Lv =

2 L2 x + Ly

(1) (2)

= arctan

Ly Lx

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

25 / 40

Optimization for Edge Direction Computation

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

26 / 40

Code Extract from CudaSobel

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

27 / 40

Histeresis Operation

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

28 / 40

Histeresis Algorithm

Algorithm 2 Histeresis on CPU Transfers the Gradient/NMS images to the GPU repeat Run the histeresis kernel on GPU until no pixel changes status Return edge image

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

29 / 40

Histeresis Algorithm
Algorithm 3 Histeresis on GPU Load an image region with size 18x18 into shared memory modied false repeat modied region false Synchronize threads of same multiprocessor if Pixel changes status then modied true modied region true end if Synchronize threads of same multiprocessor until modied region = false if modied = true then Update modied status on HOST end if
Daniel Weingaertner (DInf-UFPR) FH-Regensburg 30 / 40

Summary
1

Introduction Insight Toolkit (ITK) GPGPU and CUDA Integrating CUDA and ITK Canny Edge Detection Experimental Results Conclusion

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

31 / 40

Metodology

Hardware: Server:
CPU: 4x AMD Opteron(tm) Processor 6136 2,4GHz with 8 cores, each with 512 KB cache and 126GB RAM GPU1: NVidia Tesla C2050 with 448 1,15GHz cores and 3GB RAM. GPU2: NVidia Tesla C1060 com 240 1,3GHz cores and 4GB RAM.

Desktop:
CPU: Intel R Core(TM)2 Duo E7400 2,80GHz with 3072 KB cache and 2GB RAM GPU: NVidia GeForce 8800 GT with 112 1,5GHz cores and 512MB RAM.

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

32 / 40

Metodology

Images from the Berkeley Segmentation Dataset Base B1 B2 B3 B4 Image resolution 321481 e 481321 642962 e 962642 12841924 e 19241284 25683848 e 38482568 Num. of Images 100 100 100 100

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

33 / 40

Performance Tests

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

34 / 40

Performance Tests

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

35 / 40

Performance Tests

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

36 / 40

Performance Tests

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

37 / 40

Summary
1

Introduction Insight Toolkit (ITK) GPGPU and CUDA Integrating CUDA and ITK Canny Edge Detection Experimental Results Conclusion

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

38 / 40

Conclusion

Parallel Programming Parallel programming is denitely the way to go. Implement ecient parallel code is demanding. Programmer should know more details about the hardware, especially memory architecture. Canny Filter with CUDA We had a great speedup on the edge detection lter Also noticed that the existing implementation is not ecient There is still a LOT of work if we want to parallelize ITK.

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

39 / 40

Conclusion

Parallel Programming Parallel programming is denitely the way to go. Implement ecient parallel code is demanding. Programmer should know more details about the hardware, especially memory architecture. Canny Filter with CUDA We had a great speedup on the edge detection lter Also noticed that the existing implementation is not ecient There is still a LOT of work if we want to parallelize ITK.

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

39 / 40

Contact

Thank You!

Daniel Weingaertner danielw@inf.ufpr.br

Daniel Weingaertner (DInf-UFPR)

FH-Regensburg

40 / 40

Das könnte Ihnen auch gefallen