CUDACanny

Parallel Image Processing with CUDA
A case study with the Canny Edge Detection Filter
Daniel Weingaertner
Informatics Department Federal University of Paran a - Brazil
Hochschule Regensburg 02.05.2011
Daniel Weingaertner (DInf-UFPR)
FH-Regensburg
1 / 40
Summary
1
Introduction Insight Toolkit (ITK) GPGPU and CUDA Integrating CUDA and ITK Canny Edge Detection Experimental Results Conclusion
FH-Regensburg
2 / 40
Paran a Brazil
FH-Regensburg
3 / 40
Brazil Europe
FH-Regensburg
4 / 40
Paran a
FH-Regensburg
5 / 40
Curitiba
FH-Regensburg
6 / 40
Federal University of Paran a
FH-Regensburg
7 / 40
Informatics Department
Undergraduate: Bachelor in Computer Science 8 semesters course 80 incoming students per year Bachelor in Biomedical Informatics 8 semesters course 30 incoming students per year Graduate: Master and PhD in Computer Science Algorithms, Image Processing, Computer Vision, Articial Intelligence Databases, Scientic Computing and Open Source Software, Computer-Human Interface Computer Networks, Embedded Systems
FH-Regensburg
8 / 40
Summary
1
FH-Regensburg
9 / 40
Insight Toolkit (ITK)

Created in 1999, Open Source, Multi platform, Object Oriented (Templates), Good documentation and support
Figure: Image Processing Workow in ITK
FH-Regensburg
10 / 40
ITK - Sample code

#i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e typedef typedef typedef typedef i t k I m a g e . h i t k I m a g e F i l e R e a d e r . h i t k I m a g e F i l e W r i t e r . h i t k C a n n y E d g e D e t e c t i o n I m a g e F i l t e r . h itk itk itk itk :: :: :: :: Image<f l o a t ,2 > ImageType ; I m a g e F i l e R e a d e r < ImageType > ReaderType ; I m a g e F i l e W r i t e r < ImageType > W r i t e r T y p e ; C a n n y E d g e D e t e c t i o n I m a g e F i l t e r < ImageType , ImageType > C a n n y F i l t e r ;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
i n t main ( i n t a r g c , c h a r a r g v ) { ReaderType : : P o i n t e r r e a d e r = ReaderType : : New ( ) ; reader >S e t F i l e N a m e ( a r g v [ 1 ] ) ; reader >Update ( ) ; C a n n y F i l t e r : : P o i n t e r c a n n y = C a n n y F i l t e r : : New ( ) ; canny >S e t I n p u t ( r e a d e r >GetOutput ( ) ) ; canny >S e t V a r i a n c e ( a t o f ( a r g v [ 3 ] ) ) ; canny >S e t U p p e r T h r e s h o l d ( a t o i ( a r g v [ 4 ] ) ) ; canny >S e t L o w e r T h r e s h o l d ( a t o i ( a r g v [ 5 ] ) ) ; canny >Update ( ) ; W r i t e r T y p e : : P o i n t e r w r i t e r = W r i t e r T y p e : : New ( ) ; writer >S e t F i l e N a m e ( a r g v [ 2 ] ) ; writer >S e t I n p u t ( canny >GetOutput ( ) ) ; writer >Update ( ) ; r e t u r n EXIT SUCCESS ; }
FH-Regensburg
11 / 40
Summary
1
FH-Regensburg
12 / 40
What is GPGPU Computing?

The use of the GPU for general purpose computation CPU and GPU can be used concurrently To the end user, its simply a way to run applications faster.
FH-Regensburg
13 / 40
What is CUDA?
CUDA = Compute Unied Device Architecture. General-Purpose Parallel Computing Architecture. Provides libraries, C language extension and hardware driver.
FH-Regensburg
14 / 40
Parallel Processing Models
FH-Regensburg
15 / 40
Single-Instruction Multiple-Thread Unit
Creates, handles, schedules and executes groups of 32 threads (warp ). All threads in a warp start at the same point. But they are free to jump to dierent code positions independently.
FH-Regensburg
16 / 40
CUDA Architecture Overview
FH-Regensburg
17 / 40
Optimization Strategies for CUDA
Main optimization strategies for CUDA involve: Optimized/careful memory access Maximization of processor utilization Maximization of non-serialized instructions
FH-Regensburg
18 / 40
CUDA - Sample Code

#i n c l u d e < s t d i o . h> #i n c l u d e < a s s e r t . h> #i n c l u d e <cuda . h> v o i d i n c r e m e n t A r r a y O n H o s t ( f l o a t a , i n t N) { int i ; f o r ( i =0; i < N ; i ++) a [ i ] = a [ i ] + 1 . f ; } global v o i d i n c r e m e n t A r r a y O n D e v i c e ( f l o a t a , i n t N) { i n t i d x = b l o c k I d x . x b l o c k D i m . x + t h r e a d I d x . x ; i f ( i d x< N) a [ i d x ] = a [ i d x ] + 1 . f ; } i n t main ( v o i d ) { f l o a t a h , b h ; // p o i n t e r s t o h o s t memory // p o i n t e r t o d e v i c e memory f l o a t a d ; i n t i , N = 10000; s i z e t s i z e = N s i z e o f ( f l o a t ) ; a h = ( f l o a t ) m a l l o c ( s i z e ) ; b h = ( f l o a t ) m a l l o c ( s i z e ) ; c u d a M a l l o c ( ( v o i d ) &a d , s i z e ) ; f o r ( i =0; i < N ; i ++) a h [ i ] = ( f l o a t ) i ; cudaMemcpy ( a d , a h , s i z e o f ( f l o a t ) N, cudaMemcpyHostToDevice ) ; i n c r e m e n t A r r a y O n H o s t ( a h , N) ; i n t blockSize = 256; i n t n B l o c k s = N/ b l o c k S i z e + (N%b l o c k S i z e == 0 ? 0 : 1 ) ; incrementArrayOnDevice < < < nBlocks , b l o c k S i z e > > > ( a d , N) ; cudaMemcpy ( b h , a d , s i z e o f ( f l o a t ) N, cudaMemcpyDeviceToHost ) ; f r e e ( a h ) ; f r e e ( b h ) ; cudaFree ( a d ) ; }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
FH-Regensburg
19 / 40
Summary
1
FH-Regensburg
20 / 40
Integrating CUDA Filters into ITK Workow
ITK community suggests: Re-implement lters where parallelizing provides signicant speedup Consider the entire workow: copying to/from the GPU is very time consuming Careful! Premature optimization is the root of all evil! (Donald Knuth)
FH-Regensburg
21 / 40
Integrating CUDA Filters into ITK Workow
ITK community suggests: Re-implement lters where parallelizing provides signicant speedup Consider the entire workow: copying to/from the GPU is very time consuming Careful! Premature optimization is the root of all evil! (Donald Knuth)
FH-Regensburg
21 / 40
CUDA Insight Toolkit (CITK)

Changes to ITK Slight architecture change: CudaImportImageContainer Backwards compatible Data transfer between HOST and DEVICE only on demand Allows for lter chaining inside the DEVICE
FH-Regensburg
22 / 40
Summary
1
FH-Regensburg
23 / 40
CudaCanny
itkCudaCannyEdgeDetectionImageFilter Algorithm 1 Canny Edge Detection Filter Gaussian Smoothing Gradient Computation Non-Maximum Supression Histeresis
FH-Regensburg
24 / 40
Gradient Computation with Sobel Filter

itkCudaSobelEdgeDetectionImageFilter
(a) Sobel X
(b) Sobel Y
Lv =
2 L2 x + Ly
(1) (2)
= arctan
Ly Lx
FH-Regensburg
25 / 40
Optimization for Edge Direction Computation
FH-Regensburg
26 / 40
Code Extract from CudaSobel
FH-Regensburg
27 / 40
Histeresis Operation
FH-Regensburg
28 / 40
Histeresis Algorithm
Algorithm 2 Histeresis on CPU Transfers the Gradient/NMS images to the GPU repeat Run the histeresis kernel on GPU until no pixel changes status Return edge image
FH-Regensburg
29 / 40
Histeresis Algorithm
Algorithm 3 Histeresis on GPU Load an image region with size 18x18 into shared memory modied false repeat modied region false Synchronize threads of same multiprocessor if Pixel changes status then modied true modied region true end if Synchronize threads of same multiprocessor until modied region = false if modied = true then Update modied status on HOST end if
Daniel Weingaertner (DInf-UFPR) FH-Regensburg 30 / 40
Summary
1
FH-Regensburg
31 / 40
Metodology
Hardware: Server:
CPU: 4x AMD Opteron(tm) Processor 6136 2,4GHz with 8 cores, each with 512 KB cache and 126GB RAM GPU1: NVidia Tesla C2050 with 448 1,15GHz cores and 3GB RAM. GPU2: NVidia Tesla C1060 com 240 1,3GHz cores and 4GB RAM.
Desktop:
CPU: Intel R Core(TM)2 Duo E7400 2,80GHz with 3072 KB cache and 2GB RAM GPU: NVidia GeForce 8800 GT with 112 1,5GHz cores and 512MB RAM.
FH-Regensburg
32 / 40
Metodology
Images from the Berkeley Segmentation Dataset Base B1 B2 B3 B4 Image resolution 321481 e 481321 642962 e 962642 12841924 e 19241284 25683848 e 38482568 Num. of Images 100 100 100 100
FH-Regensburg
33 / 40
Performance Tests
FH-Regensburg
34 / 40
Performance Tests
FH-Regensburg
35 / 40
Performance Tests
FH-Regensburg
36 / 40
Performance Tests
FH-Regensburg
37 / 40
Summary
1
FH-Regensburg
38 / 40
Conclusion
Parallel Programming Parallel programming is denitely the way to go. Implement ecient parallel code is demanding. Programmer should know more details about the hardware, especially memory architecture. Canny Filter with CUDA We had a great speedup on the edge detection lter Also noticed that the existing implementation is not ecient There is still a LOT of work if we want to parallelize ITK.
FH-Regensburg
39 / 40
Conclusion
Parallel Programming Parallel programming is denitely the way to go. Implement ecient parallel code is demanding. Programmer should know more details about the hardware, especially memory architecture. Canny Filter with CUDA We had a great speedup on the edge detection lter Also noticed that the existing implementation is not ecient There is still a LOT of work if we want to parallelize ITK.
FH-Regensburg
39 / 40
Contact
Thank You!
Daniel Weingaertner danielw@inf.ufpr.br
FH-Regensburg
40 / 40

CUDACanny

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

CUDACanny

Hochgeladen von

Copyright:

Verfügbare Formate

Parallel Image Processing with CUDA

A case study with the Canny Edge Detection Filter

Hochschule Regensburg 02.05.2011

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Federal University of Paran a

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Insight Toolkit (ITK)

Figure: Image Processing Workow in ITK

Daniel Weingaertner (DInf-UFPR)

ITK - Sample code

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

What is GPGPU Computing?

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Parallel Processing Models

Daniel Weingaertner (DInf-UFPR)

Single-Instruction Multiple-Thread Unit

Daniel Weingaertner (DInf-UFPR)

CUDA Architecture Overview

Daniel Weingaertner (DInf-UFPR)

Optimization Strategies for CUDA

Daniel Weingaertner (DInf-UFPR)

CUDA - Sample Code

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Integrating CUDA Filters into ITK Workow

Daniel Weingaertner (DInf-UFPR)

Integrating CUDA Filters into ITK Workow

Daniel Weingaertner (DInf-UFPR)

CUDA Insight Toolkit (CITK)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Gradient Computation with Sobel Filter

Daniel Weingaertner (DInf-UFPR)

Optimization for Edge Direction Computation

Daniel Weingaertner (DInf-UFPR)

Code Extract from CudaSobel

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner (DInf-UFPR)

Daniel Weingaertner danielw@inf.ufpr.br

Daniel Weingaertner (DInf-UFPR)

Das könnte Ihnen auch gefallen