Implementation of Fast Fourier Transform FFT On Graphics Processing Unit GPU

IMPLEMENTATION OF FAST FOURIER TRANSFORM ON A GRAPHICS PROCESSING UNIT
by NUST CDT AAMIR MAJEED (060902)
COLLEGE OF AERONAUTICAL ENGINEERING PAF ACADEMY RISALPUR September, 2010

By NUST CDT AAMIR MAJEED (060902)
ADVISOR: SQN LDR DR. TAUSEEF UR REHMAN Co ADVISOR WG CDR DR. SOHAIL AHMED REPORT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF BE
COLLEGE OF AERONAUTICAL ENGINEERING PAF ACADEMY RISALPUR September, 2010
RESTRICTED
COLLEGE OF AERONAUTICAL ENGINEERING PAF ACADEMY, RISALPUR
by NUST Cdt Aamir Majeed, 69th EC
A report submitted to the College of Aeronautical Engineering in partial fulfillment of the requirements for the degree of B.E (AVIONICS)
APPROVED
(TAUSEEF UR REHMAN) Squadron Leader Project Advisor College of Aeronautical Engineering
(JAHANGIR KIYANI) Group Captain Head of Avionics Dept College of Aeronautical Engineering
i RESTRICTED
RESTRICTED
ABSTRACT
The Fourier Transform is a widely used tool in many scientific and engineering fields like Digital Signal Processing. The Fast Fourier Transform (FFT) refers to a class of algorithms for efficiently computing the Discrete Fourier Transform (DFT). FFT is a computationally intensive algorithm and real time implementation of FFT on a general purpose CPU is challenging due to limited processing power. Graphics Processing Units (GPUs) are an emerging breed of massively parallel processors having hundreds of processing cores in contrast to the CPUs. Greater computational power and the parallel architecture of GPUs help them outperform CPUs for data-parallel compute applications by a huge factor. The growing computational power of GPUs has introduced the concept of General Purpose Computing on GPU (GPGPU). Open Computing Language (OpenCL) is a developing, royalty-free standard for crossplatform general-purpose parallel programming. OpenCL provides a uniform
programming environment for developing efficient and portable software for multi-core CPUs and GPUs. The aim of this project is to implement Radix-2 FFT algorithm on a state-of-the-art AMD GPU ATI Radeon 5870 using OpenCL programming language. 1D and 2D FFT algorithms were successfully implemented with significant performance gains.
ii RESTRICTED
RESTRICTED
TABLE OF CONTENTS
ABSTRACT .................................................................................................................... ii TABLE OF CONTENTS ................................................................................................ iii LIST OF FIGURES ....................................................................................................... vii DEDICATION................................................................................................................ vii
CHAPTER 1.................................................................................................................... 1 PROJECT INTRODUCTION........................................................................................... 1 Project Title.................................................................................................................. 1 Project Introduction...................................................................................................... 1
CHAPTER 2.................................................................................................................... 3 LITERATURE REVIEW .................................................................................................. 3 FFT on Graphics Hardware ......................................................................................... 3 CUDA FFT Library (CUFFT) ........................................................................................ 4 Apple FFT Library ........................................................................................................ 5 IPT ATI Project ............................................................................................................ 5 Project Motivation ........................................................................................................ 5
CHAPTER 3.................................................................................................................... 6 GRAPHICS PROCESSING UNIT (GPU)........................................................................ 6 Introduction .................................................................................................................. 6 Major GPU Vendors..................................................................................................... 7 Evolution of GPUs ....................................................................................................... 8 GPU Capabilities ......................................................................................................... 9 General Purpose GPU (GPGPU) ................................................................................ 9 CPU Vs GPU ............................................................................................................. 10 GPU Application Areas .............................................................................................. 13 iii RESTRICTED
RESTRICTED
CHAPTER 4.................................................................................................................. 14 GPU ARCHITECTURE ................................................................................................. 14 Flynn's Taxonomy ...................................................................................................... 14 Single Instruction Multiple Data (SIMD) Architecture ................................................. 14 Generalized GPU Architecture .................................................................................. 15 ATI Radeon 5870 Architecture .................................................................................. 15 Memory Hierarchy of ATI 5870 .................................................................................. 18
CHAPTER 5.................................................................................................................. 20 GPGPU PROGRAMMING ENVIRONMENT................................................................. 20 Introduction ................................................................................................................ 20 Compute Unified Device Architecture (CUDA) .......................................................... 21 Open Computing Language (OpenCL) ...................................................................... 21 Anatomy of OpenCL .................................................................................................. 22 OpenCL Architecture ................................................................................................. 23 OpenCL Execution Model .......................................................................................... 24 OpenCL Memory Model............................................................................................. 25 OpenCL Program Structure ....................................................................................... 27
CHAPTER 6.................................................................................................................. 28 FAST FOURIER TRANSFORM (FFT).......................................................................... 28 Introduction ................................................................................................................ 28 Fourier Transform ...................................................................................................... 28 Categories of Fourier Transform................................................................................ 29 Discrete Fourier Transform (DFT) ............................................................................. 29 Fast Fourier Transform (FFT) .................................................................................... 30 FFT Algorithms .......................................................................................................... 31 Radix-2 FFT Algorithm............................................................................................... 31 Decomposition of Time Domain Signal ...................................................................... 32 iv RESTRICTED
RESTRICTED
Calculating Frequency Spectra.................................................................................. 33 Frequency Spectrum Synthesis ................................................................................. 33 Reducing Operations Count ...................................................................................... 35
CHAPTER 7.................................................................................................................. 37 OPENCL IMPLEMENTATION ...................................................................................... 37 Introduction ................................................................................................................ 37 Data Packaging ......................................................................................................... 37 1D FFT Implementation ............................................................................................. 37 Data Decomposition .................................................................................................. 38 Parallel Implementation of Elster Algorithm ............................................................... 39 Butterfly Computations .............................................................................................. 39 Improved Program Structure ..................................................................................... 41 2D FFT Implementation ............................................................................................. 42 Matrix Transpose Implementation ............................................................................. 42 Matrix Transpose Using Local Memory ..................................................................... 44
CHAPTER 8.................................................................................................................. 46 RESULTS AND CONCLUSION ................................................................................... 46 Computing Environment ............................................................................................ 46 Experiment Setup ...................................................................................................... 46 1D FFT Results ......................................................................................................... 46 2D FFT Results ......................................................................................................... 47 Analysis of Results .................................................................................................... 48 Conclusion ................................................................................................................. 49
REFERENCES ............................................................................................................. 50
v RESTRICTED
RESTRICTED
LIST OF FIGURES
Figure 3-1: ATI Radeon HD 5870 Graphics Card ........................................................... 6 Figure 3-2: Layout of ATI 5870 Graphics Card ............................................................... 7 Figure 3-3: GPU Market Share ....................................................................................... 8 Figure 3-4: CPU vs GPU Peak Performance ................................................................ 11 Figure 3-5: CPU vs GPU Architecture Comparison ...................................................... 12 Figure 4-1: SIMD Architecture....................................................................................... 14 Figure 4-2: Simplified GPU Architecture ....................................................................... 15 Figure 4-3: Cypress Architecture .................................................................................. 16 Figure 4-4: Cross Section of a Compute Unit ............................................................... 17 Figure 4-5: Stream Core ............................................................................................... 17 Figure 4-6: ATI 5870 Memory Hierarchy ....................................................................... 19 Figure 5-1: Compute Device ......................................................................................... 24 Figure 5-2: OpenCL Memory Model.............................................................................. 26 Figure 6-1: Interlaced Decomposition ........................................................................... 32 Figure 6-2: Bit Reversal Sorting .................................................................................... 33 Figure 6-3: Radix-2 FFT Butterfly.................................................................................. 34 Figure 6-4: FFT Flow Graph (N=8) ............................................................................... 35 Figure 6-5: Generalized FFT Butterfly........................................................................... 35 Figure 6-6: Simplified FFT Butterfly .............................................................................. 36 Figure 7-1: Bit Reversed Vector .................................................................................... 38 Figure 7-2: Applying Elster on 16 Elements ................................................................. 40 Figure 7-3: Simplified FFT Butterfly .............................................................................. 41 Figure 7-4: Matrix Transpose ........................................................................................ 43 Figure 7-5: 4x4 Matrix Transpose Example .................................................................. 43 Figure 7-6: Result of Simple Matrix Transpose Kernel.................................................. 44 Figure 7-7: Coalesced Data Transfer from Global to Local Memory ............................. 45 Figure 7-8: Writing Transposed Elements to Global Memory ....................................... 45
vi RESTRICTED
RESTRICTED
DEDICATION
I dedicate this report to my parents who have been a source of inspiration for me; to my sisters who made me believe in myself when I thought that I could not make it; and finally to my friends just for being there.
vii RESTRICTED
RESTRICTED
ACKNOWLEDGEMENT
I am grateful to my advisor Sqn Ldr Tauseef ur Rehman for all the guidance and support. I also like to thank the department of Avionics engineering for providing a conducive environment for research and study.
viii RESTRICTED
RESTRICTED
CHAPTER 1 PROJECT INTRODUCTION

Project Title 1. The aim of this project is to implement 1D and 2D Radix-2 Fast Fourier
Transform (FFT) algorithm on ATI Radeon 5870 Graphics Processing Unit (GPU) using OpenCL programming language and benchmark the performance against state of the art CPUs and NVIDIA GPUs. Project Introduction 2. The Fourier Transform is a well known and widely used tool in many scientific
and engineering fields. Fourier transform converts a signal from time domain to frequency domain. It is essential for many image processing techniques including filtering, convolution, manipulation, correlation, and compression. 3. The Fast Fourier Transform (FFT) refers to a class of algorithms for efficiently
computing the Discrete Fourier Transform (DFT). FFT is a computationally intensive algorithm and real time implementation of FFT on a general purpose CPU is challenging due to limited processing power. 4. Graphics Processing Units (GPUs) are an emerging breed of processors
originating from the world of graphics and form an integral part of a commodity graphics card on a Personal Computer (PC). Historically, GPUs have been used primarily for graphics processing and gaming purposes. However, growing computational power of GPUs has introduced the concept of General Purpose Computing on GPU (GPGPU) [1]. In contrast to CPUs, GPUs have hundreds of processing cores working in parallel and can handle real time parallel computations. Greater computational power and the parallel architecture of GPUs help them outperform CPUs for data-parallel compute applications by a huge factor.
1 RESTRICTED
RESTRICTED 5. NVIDIA and AMD are the leading manufacturers of GPUs. NVIDA has
introduced a GPU development platform, Compute Unified Device Architecture (CUDA) for general purpose computing on GPUs. CUDA is not a cross-platform tool and its use is limited only to NVIDIA GPUs [2]. NVIDIA has developed a GPU accelerated FFT library called CUFFT. AMD GPUs cannot take advantage of this library. 6. Open Computing Language (OpenCL) is a developing, royalty-free standard for
cross-platform general-purpose parallel programming. OpenCL provides a uniform programming environment for developing efficient and portable software for multi-core CPUs and GPUs [3]. Both NVIDIA and AMD now support OpenCL for their respective GPUs. 7. The aim of this project is to implement Radix-2 FFT algorithm on an AMD GPU
using OpenCL as the programming language. A state-of-the-art AMD GPU ATI Radeon 5870 will be used for implementation of this code. ATI Radeon 5870 can potentially deliver a peak computational power of 2.72 Tera floating point operations per second (FLOPS). 8. The focus of the project is to exploit data parallelism inherent in the FFT
algorithm and develop a code that exploits the AMDs GPU architecture by fully utilizing its parallel computing resources. Performance benchmarking of this implementation against state-of-the-art Intel CPUs will be done.
2 RESTRICTED
RESTRICTED
CHAPTER 2 LITERATURE REVIEW

FFT on Graphics Hardware 1. The mathematical complexity of FFT suggests that it is a computationally
expensive task for uni-processor machines, especially when input size (N) is in millions. The data-level parallelism of FFT algorithm can be exploited through a parallel processing architecture to gain speed up. 2. With the introduction of GPGPU concept in 2002 the GPUs became an attractive
target for computation because of their high performance and low cost compared to parallel vector machines and CPUs. At that time, general purpose algorithms for the GPU had to be mapped to the programming model provided by graphics APIs, like OpenGL and DirectX. The graphic APIs were unable to fully utilize the compute resources as access to the low-level hardware features was not supported. The use of graphics APIs for GPGPU was challenging and the performance gains were less compared to the programming efforts. 3. K. Moreland and E. Angel [4] were among the first to implement FFT on graphics
hardware in 2003 using graphics APIs. Implementation was done on NVIDIA GeForce FX 5800 Ultra graphics card. This graphics card features fully programmable pipeline and full 32 bit floating point math enabled throughout the entire pipeline. Programming environment used was OpenGL and the Cg computer language and runtime libraries. The average performance achieved was 2.5 Giga FLOPS. J. Spitzer implemented FFT on NVIDIA GeForce FX 5800 Ultra graphics card in 2003 [5], using the graphics APIs and reported peak performance was 5 Giga FLOPS. 4. These FFT implementation on GPUs revealed that using graphics APIs for
GPGPU is inefficient and achievable peak performance was very limited as compared to the programming efforts. Developers came up with the solution in the form of nongraphics APIs to fully utilize the compute resources of GPU and reducing the programming effort. 3 RESTRICTED
RESTRICTED 5. NVIDIA launched CUDA in 2007 [6], allowing the developers to fully utilize
immense GPU power by accessing all hardware features and resources via an industrial standard high-level computer language C. With CUDA, the programming effort was reduced and performance gains were much more significant. In February, 2007 NVIDIA launched first GPU accelerated FFT library, CUDA FFT Library (CUFFT) [7]. CUDA FFT Library (CUFFT) 6. CUFFT is the first GPU accelerated FFT library. Initial version of CUFFT was
released in February, 2007. The latest release is CUFFT version 3.0 launched in February, 2010 [8]. The salient features of this library are listed below. (a) (b) (c) (d) (e) (f) 7. 1D, 2D and 3D transforms of complex and real-valued data Batched execution of multiple transforms of any dimension in parallel 1D transforms size up to 8 Million elements 2D and 3D transform sizes in the range [2, 16384] in any dimension In-place and out-of-place transforms for real and complex data Double Precision transforms on compatible hardware The CUFFT library implements several FFT
Implemented Algorithms.
algorithms, each having different performance and accuracy. Radix-2 algorithm is implemented for input sizes that are integral powers of 2. This corresponds to best performance paths in CUFFT. For transform sizes that are not integral powers of 2, CUFFT uses a more general Mixed-Radix FFT algorithm that is usually slower and less numerically accurate [8]. 8. CUFFT Limitation. The use of CUFFT is limited only to NVIDIA GPUs
because CUDA is not a heterogeneous programming environment. AMD GPUs cannot take advantage of this library. Thus, there is still a need of accelerated FFT library targeting GPUs from all vendors and multi-core CPUs. 4 RESTRICTED
RESTRICTED Apple FFT Library 9. In February 2010, Apple Inc published FFT library for Mac OS X implementation
of OpenCL [9]. This FFT library includes all the features of CUFFT library. The runtime requirements for this library are Mac OS X v10.6 or later with OpenCL 1.0 support and Apples Xcode compiler. This limits the use of this library to Apple computers only. 10. OpenCL developer community modified the library to make it compatible with
AMDs OpenCL implementation. The largest transform size achieved in this way is 1024 points with reported issues of numerical accuracy. IPT ATI Project 11. In February 2010, Jingfei Kong published OpenCL implementation of FFT for
AMD ATI GPUs. This implementation accelerates MATLB FFT using MATLAB external interface (MEX) [10]. This implementation only supports 1D transforms in single precision. Project Motivation 12. This project is a step towards the development of OpenCL FFT library for AMD
ATI GPUs which is comparable to CUFFT library in features and performance. Contribution of this project will be implementation of 1D transforms, batched 1D transforms and 2D transforms.
5 RESTRICTED
RESTRICTED
CHAPTER 3 GRAPHICS PROCESSING UNIT (GPU)

Introduction 1. Graphics Processing Unit (GPU). GPU is a specialized processor designed for
processing and displaying computer graphics. The terms graphics processing unit and GPU were coined by NVIDIA, the largest GPU manufacturer, in 1999. GPUs are highly efficient at performing the calculations necessary to generate visual output from program data. They are widely used as co-processors in mobile phones, personal computers, laptops and game consoles to offload the graphics processing from the central processing unit (CPU) and to meet the ever increasing demand for better graphics. GPUs commonly accompany standard CPUs in Personal Computers (PCs) to accelerate graphics generation and video display. In a PC, a GPU can be present on the motherboard or it can be on a dedicated Graphics Card. 2. Graphics Card. Graphics Card is a peripheral device and interfaces with the
motherboard by means of an expansion slot such as Peripheral Component Interconnect Express (PCIe) Slot or Accelerated Graphics Port (AGP). Fig. 3-1 shows ATI 5870 graphics card.
Figure 3-1: ATI Radeon HD 5870 Graphics Card 6 RESTRICTED
RESTRICTED 3. The key components of a graphics card are GPU, Video memory, Output
Interface and Motherboard Interface. Fig. 3-2 depicts the layout of ATI 5870 graphics card [11].
Figure 3-2: Layout of ATI 5870 Graphics Card 4. Graphics cards offer added functions, such as video capture and decoding, TV
output, or the ability to connect multiple monitors. High performance graphics cards are used for more graphically demanding purposes, such as PC games. Major GPU Vendors 5. In 2008, Intel, NVIDIA and AMD/ATI were the market share leaders, with 49.4%,
27.8% and 20.6% market share respectively. However, these numbers include Intel's very low-cost, less powerful integrated graphics solutions as GPUs [12]. 6. In June 2010, 30 days GPU market share survey was done by passmark [13].
The survey was carried out for dedicated graphics cards only. According to this survey NVIDIA and AMD/ATI are the leading GPU manufacturers with 49% and 34% market
7 RESTRICTED
RESTRICTED share respectively, while Intel captures 10% of market share. Fig. 3-3 shows the results of this survey.
Figure 3-3: GPU Market Share Evolution of GPUs 7. The history of graphics processors traces back to 1980s when 2D graphics and
text displays were generated by graphic chips called accelerators. 8. The IBM Professional Graphics Controller was one of the very first 2D graphics
accelerators available for the IBM PC released in 1984 [12]. Its high price, slow processor and lack of compatibility with commercial programs made it unable to succeed in the mass-market. 9. In 1991, S3 Graphics introduced the first single-chip 2D accelerator, the S3
86C911 [12]. By 1995, all major PC graphics chip makers had added 2D acceleration support to their chips. In the mid-1990s, CPU-assisted real-time 3D graphics were becoming increasingly common in computer and console games, which led to an increasing public demand for hardware-accelerated 3D graphics. Early examples of 8 RESTRICTED
RESTRICTED mass-marketed 3D graphics hardware can be found in fifth generation video game consoles such as PlayStation and Nintendo 64. 10. In 1997, 3D accelerators added another significant hardware stage for hardware
transforms and lightning to the 3D graphics pipeline. The NVIDIA GeForce 256 (NV10) was the first card on the market with this capability in 1999 [12]. 11. NVIDIA was first to produce a chip with programmable graphics pipeline in 2000,
GeForce 3 (NV20). By October 2002, ATI Radeon 9700 (R300), the world's first Direct3D 9.0 accelerator, added floating point math capability to GPUs [12]. Since then GPUs are quickly becoming as flexible as CPUs, and orders of magnitude faster for image-array operations. GPU Capabilities 12. Historically, GPUs have been used primarily for graphics processing and
gamming purposes. As graphics processing is inherently a parallel task, so GPUs naturally have more parallel architecture as compared to standard CPUs. Furthermore, the 3D video games demand very high computational power, driving the GPU development beyond CPUs. Thus modern GPUs, in comparison to CPUs, offer extremely high performance for the monetary cost. 13. Modern GPUs have a more flexible and programmable graphic pipeline and offer
high peak performance. Naturally, interest has been developed as to whether the GPU processing power can be harnessed for more general purpose calculations. General Purpose GPU (GPGPU) 14. The addition of programmable stages and higher precision arithmetic to the
graphics pipelines allows software developers to use GPUs for processing non-graphics data. The idea of utilizing the parallel computing resources of a GPU for non graphics general purpose computations is named as General-Purpose computation on Graphics Processing Unit (GPGPU). The term GPGPU was coined by Mark Harris in 2002 when he recognized an early trend of using GPUs for non-graphics applications [1]. 9 RESTRICTED
RESTRICTED 15. GPGPU is the technique of using a GPU, which typically handles computation
only for computer graphics, to perform computation in applications traditionally handled by the CPU. Once specially designed for computer graphics and difficult to program, todays GPUs are general-purpose parallel processors with support for accessible programming interfaces and industry-standard languages such as C. Porting
applications to GPUs often achieve speedups of orders of magnitude vs. optimized CPU implementations. CPU Vs GPU
16.
Performance Comparison.
The motivation behind GPGPU is manifold. A
high end graphics card costs less than a high end CPU and provides peak performance which is more then cent times the contemporary CPU. As depicted in Fig. 3-4 [14], modern GPUs completely outperform state-of-the-art CPUs in theoretical peak performance. Theoretical peak performance is the maximum number of floating point operations per second (FLOPs). 17. CPU performance tops out at about 25 Giga FLOPs per core, thus Core i7 with 4
cores delivers a peak performance of 110 Giga FLOPS [15]. Whereas ATI Radeon 5870 delivers 2.72 Tera FLOPs and NVIDIA GTX 285 delivers 1 Tera FLOPs. This huge performance advantage of GPUs justifies the GPGPU.
10 RESTRICTED
RESTRICTED
Figure 3-4: CPU vs GPU Peak Performance 18. Architecture Comparison. The architecture comparison of a quad core
CPU and a general GPU is shown in Fig. 3-5 [6]. The processing element in a microprocessor where the floating point math occurs is an Arithmetic Logic Unit (ALU). ALUs are shown as green blocks in Fig. 3-5. On quad core CPU the 4 ALUs can handle 4 Floating point operations simultaneously whereas the GPU shown here can handle 128 floating point operations simultaneously as it has 128 ALUs.
11 RESTRICTED
RESTRICTED
Figure 3-5: CPU vs GPU Architecture Comparison 19. The key difference between GPUs and CPUs is that while a modern CPU
contains a few high-functionality cores, GPUs typically contain 100s of basic cores. Each CPU core is capable of running a heavy task independently. So multiple tasks can map to different cores. The GPU core is a very basic processing element and each core can perform same operation on different data simultaneously. 20. CPUs use a large chip area for control circuitry and data caching for faster data
accesses. GPUs, on the other hand, have very less area devoted for caches and control circuitry. Most of the die area is occupied by the ALUs. Thus GPUs gain performance advantage by allocating a huge number of transistors for floating point calculations. 21. GPUs also boast a larger memory bus width than CPUs which results in faster
memory access. The Dynamic RAM used in modern GPUs is GDDR5 with a much greater bandwidth compared to DRAM in consumer PCs, generally GDDR2. CPUs typically operate at 2-3 Ghz clock frequency. The GPU clock frequency is lower than that of a CPU, typically up to 1.2 Ghz, but this gap has been closing over the last few years.
12 RESTRICTED
RESTRICTED GPU Application Areas 22. GPUs offer very high peak performance but all consumer applications can't take
full advantage of this performance. As GPUs are massively parallel devices, the application needs to be highly parallel to fully utilize the GPU resources. Applications such as graphics processing are highly parallel in nature, and can keep the cores busy, resulting in a significant performance improvement over use of a standard CPU. 23. For applications less susceptible to such high levels of parallelization, the extent
to which the available performance can be harnessed will depend on the nature of the application and the investment put into software development. 24. Following are the major application areas where GPUs provide significant
speedups over standard CPUs. (a) (b) (c) (d) (e) (f) (g) (h) Fast Fourier Transform (FFT) Image and Video Processing Multi Dimensional Signal Processing Particle Interaction and Fluid Dynamics Simulations Radar Signal Processing Linear Algebra (BLAS, LAPACK) Partial Differential Equations MATLAB Acceleration
13 RESTRICTED
RESTRICTED
CHAPTER 4 GPU ARCHITECTURE

Flynn's Taxonomy 1. Flynn's Taxonomy is a way to characterize the computer architecture. It
categorizes all computers according to the number of instruction streams and data streams they have, where a stream is a sequence of instructions or data on which a computer operates. There are four classes of computers as defined by Flynn's Taxonomy [16]. (a) (b) (c) (d) Single Instruction Single Data (SISD) Single Instruction Multiple Data (SIMD) Multiple Instruction Single Data (MISD) Multiple Instruction Multiple Data (MIMD)
Single Instruction Multiple Data (SIMD) Architecture 2. In a SIMD system, a single instruction stream is concurrently broadcast to
multiple processors, each with its own data stream. Each processor thus executes same instruction or program on a different data set concurrently. GPU architecture is based on SIMD model. Fig. 4-1 [17] shows a typical SIMD architecture.
Figure 4-1: SIMD Architecture 14 RESTRICTED
RESTRICTED Generalized GPU Architecture 3. Fig. 4-2 [18] shows a simplified diagram of a generalized GPU device. A GPU
device comprises of a set of compute units. Each compute unit has a set of stream cores which are further divided into basic execution units called processing elements. All ATI GPUs follow similar design pattern, however number of compute units or stream processors may vary from device to device.
Figure 4-2: Simplified GPU Architecture ATI Radeon 5870 Architecture 4. ATI Radeon 5870 architecture is given the code name cypress [19]. Fig. 4-3
illustrates the cypress architecture. The GPU comprises of 20 compute units which operate as SIMD engines. Each SIMD engine consists of 16 stream cores. A stream core houses 5 processing elements. 5. SIMD Engine. Each compute unit consists of 16 stream cores and operates
as a SIMD engine. All stream cores within a SIMD engine have to execute the same
15 RESTRICTED
RESTRICTED instruction sequence; different compute units may execute different instructions. Fig. 4-4 [19] shows the internal design of a compute unit.
Figure 4-3: Cypress Architecture
16 RESTRICTED
RESTRICTED
Figure 4-4: Cross Section of a Compute Unit 6. Stream Cores. A stream core consists of multiple processing elements for
floating point calculations. The branch unit shown in Fig. 4-5 handles branching statements and thus allows threads to take different computation paths without incurring overheads. The register file is extremely fast memory private to each thread and used for fast data access and holding the intermediate variables.
Figure 4-5: Stream Core
17 RESTRICTED
RESTRICTED 7. Processing Elements. Processing elements are the fundamental
programmable computational units that perform integer, single-precision floating point, double-precision floating-point and transcendental operations. A stream core is arranged as a five-way very long instruction word (VLIW) processors, see Fig. 4-5. Up to five scalar operations can be co-issued in a VLIW instruction, each of which is executed on one of the corresponding five processing elements. Processing elements can execute single-precision floating point or integer operations. One of the five processing elements also can perform transcendental operations (sine, cosine, logarithm, etc.) Double-precision floating point operations are processed by connecting two or four of the processing elements (excluding the transcendental core) to perform a single double-precision operation [18]. Memory Hierarchy of ATI 5870 8. GPUs generally feature a multi-level memory space for efficient data access and
communication within the GPU. Fig. 4-6 [19] illustrates the memory spaces on ATI 5870. 9. Global Memory. Global memory is the main memory pool on the GPU and is
accessible by all processing elements on the GPU for read and writes operations. ATI 5870 features 1GB GDDR5 global memory operating at 1.2 GHz. The data transfer rate is 4.8 Gbps via 256 bit wide memory bus and memory bandwidth is153.6 GB/s. This is the slowest memory on the GPU. 10. Constant Data Cache. ATI 5870 GPU features 48 KB of constant data cache
memory used to store the frequently used constant values. This memory is written by the host and all processing elements have only read access to this memory. Constant cache is a very fast memory with 4.25 TB/s memory bandwidth. 11. Local Data Share (LDS). Each SIMD engine has 32 KB local memory called
LDS. All processing elements within a SIMD can share data using this memory. LDS offers 2.125 TB/s memory bandwidth providing low latency data access to each SIMD
18 RESTRICTED
RESTRICTED engine. LDS is arranged into 32 banks, each with 1KB memory. LDS provides zero latency reads in broadcast mode and in conflict free reads/writes.
Figure 4-6: ATI 5870 Memory Hierarchy 12. Registers. Registers are the fastest memory available on GPU. Each SIMD
engine posses 256 KB register files. Registers provide 13 TB/s memory bandwidth. Registers are local to each processing element. 13. Global Data Share. ATI 5870 also features low latency global data share
allowing all processing elements to share data. This memory space is not available in NVIDIA GPUs and old ATI GPUs. The size of GDS is 64 KB and memory access latency is only 25 clock cycles.
19 RESTRICTED
RESTRICTED
CHAPTER 5 GPGPU PROGRAMMING ENVIRONMENT

Introduction 1. Early GPUs were designed to specifically implement graphics programming
standards such as OpenGL and Microsoft DirectX. The tight coupling between the language used by graphics programmers and the graphics hardware ensured good performance for most applications. However, this relationship limited the graphicsprocessing realism to only that which was defined in the graphics language. To overcome this limitation, GPU designers eventually made the pixel processing elements customizable using specialized programs called graphics shaders [20]. 2. Over time, developers and GPU vendors evolved shaders from simple assembly
language programs into high-level programs that create the amazingly rich scenes found in todays 3D software. To handle increasing shader complexity, the pixel processing elements were redesigned to support more generalized math, logic, and flow control operations. This set the stage for a new way to accelerate computation, the GPGPU. 3. GPU vendors and software developers realized that the trends in GPU designs
offered an incredible opportunity to take the GPU beyond graphics. All that was needed was a non-graphic Application Program Interface (API) that could engage the emerging programmable aspects of the GPU and access its immense power for non graphics applications. 4. First non-graphics API was introduced by NVIDIA in August 2007, named
Compute Unified Device Architecture (CUDA). NVIDIA actually devoted silicon area to facilitate the ease of parallel programming, so this does not represent software changes alone; additional hardware was added to the chip [6]. 5. Open Computing Language (OpenCL) is another emerging standard for GPGPU
programming. OpenCL was proposed by Apple and has a broad industry support from 20 RESTRICTED
RESTRICTED AMD, Intel, ARM Texas Instruments and many others. The OpenCL specifications are managed by KHRONOS Group [3]. 6. Microsoft DirectCompute is another API that supports GPGPU on Microsoft
Windows Vista and Windows 7. The DirectCompute architecture shares a range of computational interfaces with its competitors; the KHRONOS Group's OpenCL and NVIDIA's CUDA. Following sections of the report provide further details of these APIs. Compute Unified Device Architecture (CUDA) 7. Introduction. CUDA is a parallel computing architecture developed by
NVIDIA. The programming language used to access the GPU resources is a subset of widely used computer language C with extensions to support parallel processing. 'C for CUDA' (C with NVIDIA extensions), is compiled through a PathScale Open64 C compiler or the NVIDIA CUDA Compiler (NVCC) to generate a machine code for execution on the GPU. CUDA works with all NVIDIA GPUs from the G8X series onwards, including GeForce, Quadro and the Tesla line. The programs developed for the GeForce 8 series also work without modification on all NVIDIA video cards, due to binary compatibility. CUDA gives developers access to the native instruction set and memory of the parallel computational elements in CUDA enabled GPUs. Using CUDA, the latest NVIDIA GPUs effectively become open architectures like CPUs [6]. 8. Limitations. The major limitation of CUDA is that it only targets NVIDIA
GPUs (Homogenous) [2]. OpenCL on the other hand is extremely heterogeneous and targets not only GPUs from all vendors but also x86 CPUs, DSPs, CELL Engines and hand held devices. Open Computing Language (OpenCL) 9. Introduction. OpenCL (Open Computing Language) is the first open,
royalty-free standard for general-purpose parallel programming of heterogeneous systems. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, desktop computer systems and handheld devices using a diverse mix of multi-core 21 RESTRICTED
RESTRICTED CPUs, GPUs, Cell-type architectures and other parallel processors such as DSPs [3]. OpenCL will form the foundation layer of a parallel computing ecosystem of platformindependent tools, middleware and applications. 10. OpenCL is being created by the KHRONOS Group with the participation of many
industry-leading companies and institutions including AMD, Apple, ARM, Electronic Arts, Ericsson, IBM, Intel, Nokia, NVIDIA and Texas Instruments. 11. OpenCL language is based on C99 for writing programs that execute on OpenCL
devices, and APIs that are used to define and then control the platforms. OpenCL provides parallel computing using task-based and data-based parallelism. Its architecture shares a range of computational interfaces with two competitors, NVIDIA's CUDA and Microsoft's DirectCompute. 12. below, (a) (b) (c) (d) Heterogeneous computing Code portability Support for task-parallelism Cross platform compatibility OpenCL Advantages. OpenCL advantages over CUDA are summarized
Anatomy of OpenCL 13. KHRONOs Group released first OpenCL specifications in 2008 [21]. The
OpenCL 1.0 specification is made up of three main parts: (a) (b) (c) Language specification Platform layer API Runtime API
22 RESTRICTED
RESTRICTED 14. Language Specifications. The language specification describes the
syntax and programming interface for writing compute programs that run on supported devices, such as AMD GPUs and multi-core CPUs. The language used is based on a subset of ISO C99. C was chosen as the basis for OpenCL due to its prevalence and familiarity in the developer community. To foster consistent results across different platforms, a well-defined IEEE 754 numerical accuracy is defined for all floating point operations along with a rich set of built-in functions. 15. Platform Layer API. The platform layer API gives the developer access to
routines that query for the number and types of devices in the system. The developer can then select and initialize the necessary compute devices to properly run their work load. The required compute resources for job submission and data transfer are created at this layer. 16. Runtime API. The runtime API allows the developer to queue up the work
for execution and is responsible for managing the compute and memory resources in the OpenCL system. OpenCL Architecture 17. OpenCL Platform Model. The OpenCL platform consists of a Host and
one or more Compute Devices. Host is CPU device running the main operating system. Compute device is any CPU or GPU device which provides processing power for OpenCL. OpenCL allows multiple heterogeneous compute devices to connect to a single host and efficiently divide the work among them. 18. Compute Device. A compute device consists of a collection of one or
more compute units. A compute unit consists of one or more processing elements. Each processing element executes same code in single instruction multiple data (SIMD) fashion [18]. Fig. 5-1 shows a general layout of compute device.
23 RESTRICTED
RESTRICTED
Figure 5-1: Compute Device OpenCL Execution Model 19. Since OpenCL is meant to target not only GPUs but also other accelerators, such
as multi-core CPUs, flexibility is given in specifying the type of task; whether it is dataparallel or task-parallel. The OpenCL execution model includes Compute Kernels and Compute Programs. 20. Compute Kernel. A compute kernel is the basic unit of executable code
and can be thought of as similar to a C function. Execution of such kernels can precede either in-order or out-of-order depending on the parameters passed to the system when queuing up the kernel for execution. Events are provided so that the developer can check on the status of outstanding kernel execution requests and other runtime requests. 21. Compute Program. A compute program is a collection of multiple
compute kernels and functions and is similar to a dynamic library. Both compute kernel and program are executed on the compute device specified in the host code. 22. Computation Domain. In terms of organization, the execution domain of a
kernel is defined by an N-dimensional computation domain. This lets the system know the problem size to which user would like to apply a kernel. Each element in the
24 RESTRICTED
RESTRICTED execution domain is a work-item and OpenCL provides the ability to group together work-items into work-groups for synchronization and communication purposes. OpenCL Memory Model 23. OpenCL defines a multi-level memory model with memory ranging from private
memory visible only to the individual compute units in the device to global memory that is visible to all compute units on the device. Depending on the actual memory subsystem, different memory spaces are allowed to be collapsed together. 24. OpenCL 1.0 defines 4 memory spaces: private, local, constant and global [22].
Fig. 5-2 shows a diagram of the memory hierarchy defined by OpenCL. 25. Private Memory. Private memory is memory that can only be used by a single
processing element. No two processing elements can access each other's private memory. This is similar to registers in a single CPU core. This is the fastest memory
available on the GPU but the size of private memory is very limited, generally in kilo bytes. 26. Local Memory. Local memory is memory that can be used by the work-items
within a work-group. All work-items within a work-group can share data but data can't be shared among different work-groups. Physically local memory is the local data share (LDS) that is available on the current generation of GPUs. Each compute unit has its own local memory shared among all the processing elements in that compute unit. Local memory is also extremely fast but less as compared to private memory. The size of LDS ranges from 16 KB to maximum of 48 KB on the latest hardware.
25 RESTRICTED
RESTRICTED
Figure 5-2: OpenCL Memory Model 27. Constant Memory. Constant memory is memory that can be used to
store constant data for read-only access by all of the compute units in the device during the execution of a kernel. The host processor is responsible for allocating and initializing the memory objects that reside in this memory space. This is similar to the constant caches that are available on GPUs. 28. Global Memory. Global memory is memory that can be used by all the
compute units on the device. This is similar to the off-chip DRAM that is available on GPUs. On latest GPUs GDDR5 DRAM is used as global memory and operates at a bandwidth of 100GB/s and above.
26 RESTRICTED
RESTRICTED OpenCL Program Structure 29. OpenCL program consists of a host code that executes on host processor and a
kernel code that executes on the compute device. A CPU device can be used both as host and compute device. 30. OpenCL Host Code. Host code is a C/C++ program executing on the host
processor to augment the kernel code [22]. A sample host code includes following steps. (a) (b) (c) (d) (e) (f) (g) (h) Create an OpenCL context Get and select the devices to execute kernel Create a command queue to accept the execution and memory requests Allocate OpenCL memory to hold the data for the compute kernel Online compile and build the compute kernel code Set up the arguments and execution domain Kick off compute kernel execution Collect the results
27 RESTRICTED
RESTRICTED
CHAPTER 6 FAST FOURIER TRANSFORM (FFT)

Introduction 1. The Fourier Transform is a well known and widely used tool in many scientific
and engineering fields. The Fourier transform is essential for many image processing Techniques, including filtering, manipulation, correction, and compression. Fourier analysis is a family of mathematical techniques, all based on decomposing signals into sinusoids. The discrete Fourier transform (DFT) is the family member used with digitized signals and form the basis of digital signal processing. FFT is an efficient algorithm to compute the DFT and its inverse. This chapter provides a brief description of Fourier analysis and its applications followed by a detailed description of DFT and FFT. Fourier Transform 2. Fourier Transform is named after Jean Baptiste Joseph Fourier (1768-1830), a
French mathematician and physicist. Fourier was interested in heat propagation, and presented a paper in 1807 to the Institute de France on the use of sinusoids to represent temperature distributions. The paper contained the controversial claim that any continuous periodic signal could be represented as the sum of properly chosen sinusoidal waves [23]. 3. Fourier transform converts a signal from the spatial domain to the frequency
domain by representing it as a sum of properly chosen sinusoids. Spatial domain signal is defined by amplitude values at specific time intervals. Frequency domain signal is defined by amplitudes and phase shifts of the various sinusoids that make up the signal. 4. Sinusoidal Fidelity. Sinusoids are used to represent a signal in frequency
domain because the sinusoids are the easiest waveforms to work with due to Sinusoidal Fidelity, which implies that if a sinusoid enters a linear system, the output will also be a
28 RESTRICTED
RESTRICTED sinusoid, and at exactly the same frequency and shape as the input. Only the amplitude and phase can change [23]. Categories of Fourier Transform 5. The general term Fourier transform, can be broken into four categories, resulting
from the four basic types of signals that can be encountered [23]. (a) Aperiodic-Continuous. These are continuous signals that do not
repeat in a periodic fashion. This includes, for example, decaying exponentials and the Gaussian curve. These signals extend to both positive and negative
infinity without repeating in a periodic pattern. The Fourier Transform for this type of signal is simply called the Fourier Transform. (b) Periodic-Continuous. Continuous signals that repeat themselves
after fixed time interval. Examples include sine waves, square waves, and any waveform that repeats itself in a regular pattern from negative to positive infinity. This version of the Fourier transform is called the Fourier series. (c) Aperiodic-Discrete. These signals are only defined at discrete
points between positive and negative infinity, and do not repeat themselves in a periodic fashion. This type of Fourier transform is called the Discrete Time Fourier Transform (DTFT). (d) Periodic-Discrete. These are discrete signals that repeat
themselves in a periodic fashion from negative to positive infinity. This class of Fourier Transform is sometimes called the Discrete Fourier Series, but is most often called the Discrete Fourier Transform (DFT). Discrete Fourier Transform (DFT) 6. DFT is one of the most important algorithms in Digital Signal Processing (DSP). It
converts a periodic-discrete time domain signal to periodic-discrete frequency domain
29 RESTRICTED
RESTRICTED signal. As digital computers can only work with information that is discrete and finite in length, the only Fourier transform that can be used in DSP is the DFT. 7. The input to the DFT is a finite sequence of real or complex numbers, making the
DFT ideal for processing information stored in computers. In particular, the DFT is widely employed in signal processing and related fields to analyze the frequencies contained in a sampled signal, to solve partial differential equations, and to perform other operations such as convolutions or multiplying large integers. 8. Mathematical Complexity. The DFT is defined by the formula shown in
Eq. (6.1).
where N is the total number of input points (samples) and Xk represents the DFT of time domain signal xn, where i is the imaginary unit and is a primitive N'th root of
unity and is called the Twiddle Factor (W) . Evaluating this definition directly requires N2 operations as there are N outputs Xk, and each output requires a sum of N terms. Thus mathematical complexity of DFT is O(N2) [24]. Fast Fourier Transform (FFT) 9. FFT was introduced by J.W. Cooley and J.W. Tukey in 1965 [25]. FFT is
very efficient method to compute the DFT in O(N log N) operations. FFT reduces the operations count by eliminating O(N) trivial operations such as multiplications by 1. The difference in speed can be substantial, especially for long data sets where N may be in the thousands or millions. In practice, the computation time can be reduced by several orders of magnitude in such cases, and the improvement is roughly proportional to N/log(N). This huge improvement made many DFT-based algorithms practical.
30 RESTRICTED
RESTRICTED FFT Algorithms 10. Cooley-Tukey FFT Algorithm. By far the most common FFT is the Cooley
Tukey algorithm. This is a divide and conquer algorithm that recursively breaks down a DFT of size N into two pieces of size N / 2 at each step, and is therefore limited to power-of-two sizes. Commonly used variants of this frame work are listed here. (a) (b) (c) 11. Radix-2 Radix-4 Split Radix
Prime-Factor Algorithm (PFA). PFA introduced by Good and Thomas [26], is
based on the Chinese Remainder Theorem to factorize the DFT similarly to CooleyTukey but without the twiddle factors. 12. Rader-Brenner Algorithm. The Rader-Brenner algorithm [27], is a
Cooley-Tukey-like factorization but with purely imaginary twiddle factors, reducing multiplications at the cost of increased additions and reduced numerical stability. It was later superseded by the split-radix variant of Cooley-Tukey which achieves the same multiplication count but with fewer additions and without sacrificing accuracy. Radix-2 FFT Algorithm 13. Radix-2 FFT algorithm recursively divides the N point input signal to two N/2
point signals at each stage. This recursion occurs until the signal is divided into 2 point signals, hence the name Radix-2. The input size N must be an integral power of two to apply this recursion. FFT is computed in are (N/2) 14. and complex additions are N stages. Total complex multiplications [28].
Radix-2 FFT algorithm computes FFT in following three steps. (a) Decompose an N point time domain signal into N time domain signals
each composed of a single point.
31 RESTRICTED
RESTRICTED (b) Calculate the N frequency spectra corresponding to these N time domain
signals. (c) Synthesize N spectra into a single frequency spectrum.
Decomposition of Time Domain Signal 15. Interlaced Decomposition. N point time domain signal is decomposed into
N time domain signals each composed of a single point. Decomposition is done in stages using Interlaced Decomposition at each stage [28]. Interlaced decomposition breaks the signal into its even and odd numbered samples at each stage. Stages
are required to decompose N point signal to N single point signals. Fig. 6-1 shows how decomposition works. 16. Bit Reversal Sorting. The interlaced decomposition is done using a bit
reversal sorting algorithm. This algorithm rearranges the order of the N time domain samples by counting in binary with the bits flipped left-for-right as shown in Fig. 6-2. This algorithm provides the same output as calculated in Fig. 6-1 using interlaced decomposition.
Figure 6-1: Interlaced Decomposition 32 RESTRICTED
RESTRICTED Calculating Frequency Spectra 17. The second step in computing the FFT is to calculate the frequency spectra of N
time domain signals of single point each. According to the Duality Principle the frequency spectrum of a 1 point time domain signal is equal to itself [28]. A single point in the frequency domain corresponds to a sinusoid in the time domain. By duality, the inverse is also true; a single point in the time domain corresponds to a sinusoid in the frequency domain. Thus nothing is required to do this step and each of the 1 point signals is now a frequency spectrum, and not a time domain signal.
Figure 6-2: Bit Reversal Sorting Frequency Spectrum Synthesis 18. To synthesize a single frequency spectrum from N frequency spectra, the N
frequency spectra are combined in the exact reverse order that the time domain decomposition took place. The synthesis process is done one stage at a time. In the first stage, 16 frequency spectra (1 point each) are synthesized into 8 frequency spectra (2 points each). In the second stage, the 8 frequency spectra (2 points each) are 33 RESTRICTED
RESTRICTED synthesized into 4 frequency spectra (4 points each), and so on. The last stage results in the output of the FFT, a 16 point frequency spectrum. 19. Radix-2 FFT Butterfly. The butterfly is the basic computational element of
the FFT, transforming two complex points input into two complex point output. Fig. 6-3 shows a Radix-2 FFT butterfly. 20. For N=2, the calculations for FFT are shown in Eq. (6.2) through Eq. (6.4).
21.
Figure 6.5 shows the flow graph for FFT calculations. W represents the twiddle
factor. Butterfly shown in Fig. 6-5 requires two complex multiplications and two complex additions.
Figure 6-3: Radix-2 FFT Butterfly
34 RESTRICTED
RESTRICTED 22. This butterfly pattern is repeated over and over to compute entire frequency
spectrum. Flow graph for 8 point FFT is shown in Fig. 6-6.
Figure 6-4: FFT Flow Graph (N=8) Reducing Operations Count 23. The flow graph in Fig. 6-5 depicts the operations count to be O(N2). FFT reduces
the operation count by exploiting the symmetry property of twiddle factor (W) and eliminating the trivial multiplications. The general FFT butterfly is shown in Fig6-7.
Figure 6-5: Generalized FFT Butterfly 35 RESTRICTED
RESTRICTED 24. Symmetry Property. Applying the symmetry property defined as Eq. (6.5)
and Eq. (6.6) on the twiddle factors in Fig. 6-7, reduces the complex multiplications by a factor of two [29]. Fig. 6-8 shows a simplified butterfly with one complex multiplication and two complex additions. WS+N/2 = WNS WNN/2 = -WNS Eq. (6.5) WNN/2 = e-j = cos(- ) + jsin(- ) = -1 Eq. (6.6) 25. Thus the operation count of FFT is reduced to (N/2) complex
multiplications and N
complex additions. The overall complexity is O(Nlog2N).
Figure 6-6: Simplified FFT Butterfly
36 RESTRICTED
RESTRICTED
CHAPTER 7 OPENCL IMPLEMENTATION

Introduction 1. The previous chapter (chapter 6) introduced Radix-2 FFT algorithm and provided
the mathematical details of the algorithm. This chapter describes the handling of complex numbers and how the mathematics is implemented to compute FFT on the GPU using OpenCL programming language. Data Packaging 2. For generality, Complex to Complex FFT is implemented. OpenCL 1.0
specifications do not support complex numbers. As there is no special data type for handling complex numbers, the data needs to be packaged to manage the complex arithmetic in OpenCL. 3. The real and complex parts of input data are single precision floating point
values. Each complex number can be stored as a two element vector, where each element is a floating point value. OpenCL supports vector data types with floating point elements. Float2 is a built in data type in OpenCL which stores two floating point values in a vector. This data type is well suited for handling complex data type in OpenCL. 4. The host code copies the real and imaginary input values to a buffer in the global
memory of GPU. A kernel is then launched to package the floating point real and imaginary values to Float2 vector data type. First element of each Float2 vector is the real part and second element is the corresponding imaginary part. 1D FFT Implementation 5. steps. As discussed in chapter 6, the Radix-2 algorithm computes FFT in two major
37 RESTRICTED
RESTRICTED (a) (b) Data Decomposition (Bit Reversal Sorting) Butterfly Computations (Frequency spectrum synthesis)
Data Decomposition 6. Data decomposition in FFT is achieved using bit reversal algorithm. Computing
the bit reversal permutation efficiently is a problem on its own. In general, bit reversal methods are divided in two main classes: in-place bit reversals, and indirect addressing methods [30]. The first ones rearrange the input vector x into its bit reversal order. This is normally achieved through nested sequences of stride permutations. This method is not efficient for parallel architectures due to lot of branching. 7. Indirect addressing methods, in turn, do not reorder x but compute instead a
vector representation of the bit reversal permutation. For example, for N=8, the vector representation will be as shown in Fig. 7-1.
Figure 7-1: Bit Reversed Vector 8. Elster Algorithm. One of the most efficient methods for producing a vector
representation of the bit reversal is Elster's algorithm [31]. The purpose of this algorithm is to create a vector B with the bit reversal permutation values. Elster computes the Npoint bit reversal vector in log2(N) steps. For example, for N=8 the vector representation is shown in Table 7-1: 38 RESTRICTED
RESTRICTED Table 7-1: Elster Algorithm Calculations Initial Value +4 +2 +1 0 0 0 0 4 4 4 2 2 6 6 1 5 3 7
Parallel Implementation of Elster Algorithm 9. Elster method can be parallelized by dividing the total length N by the number of
parallel processes and calculating each block on a separate processing element. Assuming each block size to be M, total blocks thus formed are N/M. Each block can be computed in parallel on a separate processing element. As GPUs have large number of processing elements this implementation maps well to the GPU architecture. 10. Initial point of each block head is pre calculated by applying Elster on Total
blocks. Fig. 7-2 shows an example of Elster on 16 elements. Head is computed by bit reversing 4 element vector [0 1 2 3]. Applying Elster on this vector results in [0 2 1 3]. The four blocks are computed in parallel on separate processing element, thus speeding up the calculations. Butterfly Computations 11. The FFT butterfly is derived by expanding the formula of DFT shown as Eq. (7.1). 12.
After bit reversal sorting of input data, one stage of 2 point butterflies can be
launched in parallel. Fig. 7-3 shows the simplified FFT butterfly. This butterfly includes one complex multiplication, two complex additions and a sign change.
39 RESTRICTED
RESTRICTED
Figure 7-2: Applying Elster on 16 Elements 13. Simple Program Structure. To calculate the FFT of N points, the simplest
approach is to bit reverse the input data and launch 2 point FFT kernels at each stage to calculate the entire frequency spectrum. This approach requires log2N kernel launches with the input stride increasing by a factor of two at each kernel launch. 14. Limitations. The 2 point FFT butterfly requires only 25 ALU operations.
Each thread of 2 point FFT kernel reads 4 floating point values (16 Bytes). The ALU to Fetch ratio is very less degrading the kernel performance. The ALU to Fetch ratio is a kernel performance parameter. It is a ratio of the time taken by ALU operations to the time spent in importing the data. For high kernel performance this ratio should be high. Doing more calculations per thread can improve the performance.
40 RESTRICTED
RESTRICTED
Figure 7-3: Simplified FFT Butterfly Improved Program Structure 15. The ALU to Fetch ratio can be improved by calculating 2 or more FFT stages in a
single kernel launch. This approach reduces the total number of kernel launches, thus reducing the overheads incurred in issuing a kernel call. 16. Hard-Coded FFT Kernels. FFT kernels for 2 points, 4 points and 8 points
are hard-coded in the program. The code us developed by expanding the DFT formula for N = 2, 4 and 8. 16 point FFT kernel is not hard-coded because it consumes more register files per thread, thus reducing the total number of concurrent threads, resulting in performance degradation. 17. Twiddle Factor Calculation. The twiddle factor (W) for each FFT stage is
calculated on the fly. The mathematical equation for calculating twiddle factor is derived as follows. 18. Applying the identity in Eq. (7.3) to Eq. (7.2) yields the final equation Eq. (7.4)
used to calculate twiddle factor is in the kernel. 41 RESTRICTED
RESTRICTED
19.
Computing Large FFT.
FFT of length 16 and above are computed by invoking
a set of hard-coded, small length, FFT kernels. FFT of 1024 points requires 10 stages and is calculated by launching 8 point FFT kernel thrice, completing 9 stages, followed by a 2 point FFT kernel for 10th stage. The largest FFT that can be computed using this implementation is 8 million (224) points. 2D FFT Implementation 20. 2D FFT is implemented by computing 1D FFT on rows followed by 1D FFT on
columns of the input matrix. 2D input matrix is stored in row major order in computer memory. A matrix is converted to row major order by placing each successive row next to the previous in a 1D array. 21. For GPU, computing FFT on columns is very expensive due to large input strides
in accessing data for calculations. A modified technique of implementing 2D FFT introduces a matrix transpose stage after 1D FFT on rows. Thus 2D FFT can be calculated in following steps. (a) (b) (c) (d) 1D FFT on rows of matrix Transpose the matrix 1D FFT on rows of transposed matrix Transpose the matrix to restore natural order
Matrix Transpose Implementation 22. Matrix Transpose. Transpose of a matrix A is another matrix At in which
each row has been interchanged with the respective column. A matrix transpose, thus swaps the elements along each axis of the matrix as depicted in Fig. 7-4. Mathematically, transpose of a matrix Ai,j is defined by Eq. (7.5), where i and j are the indices along X and Y axis respectively.
42 RESTRICTED
RESTRICTED 23. Simple Transpose Kernel. In a simple transpose kernel each thread reads
one element of the input matrix from global memory and writes back the same element at its transposed index in the global memory. Fig. 7-5 shows a 4x4 matrix with index of each element.
Figure 7-4: Matrix Transpose
Figure 7-5: 4x4 Matrix Transpose Example
43 RESTRICTED
RESTRICTED 23. Limitations of Simple Transpose Kernel. The result of applying simple
matrix transpose kernel to a 4x4 matrix is shown in Fig. 7-6. All threads read consecutive elements from global memory, so reading from global memory is coalesced. Coalesced loads are the most efficient way of accessing global memory because the data is loaded as a single chunk. All global memory access should be coalesced for optimal performance. Each thread writes out the result in a non-coalesced fashion, as the threads are non writing consecutive elements. Each non-coalesced access to global memory is serviced individually and multiple accesses are serialized causing long latency [32].
Figure 7-6: Result of Simple Matrix Transpose Kernel 24. The simple transpose kernel is inefficient due to non-coalesced memory
accesses and needs some modifications to coalesce all memory accesses. This can be achieved using the local memory available on GPU. Matrix Transpose Using Local Memory 25. Local memory or LDS is an extremely fast memory available on GPUs providing
zero latency data accesses, ideally. Each SIMD engine on ATI 5870 owns 32 KB LDS. Data in LDS can be accessed in any pattern without performance penalty if there are no bank conflicts [32]. Bank conflicts occur when two or more threads try to read same piece of data from LDS. Using LDS in transpose kernel can improve the performance by a huge factor.
44 RESTRICTED
RESTRICTED 26. The improved transpose kernel reads the input data from global memory in a
coalesced fashion and writes it in the local memory. Fig. 7-7 shows coalesced data transfer from global to local memory.
Figure 7-7: Coalesced Data Transfer from Global to Local Memory 27. Once the data is loaded in local memory each thread can access the transposed
data element and write it out to the global memory in a coalesced fashion. Fig. 7-8 shows how consecutive threads write the transposed elements to the global memory
Figure 7-8: Writing Transposed Elements to Global Memory 28. The use of local memory in matrix transpose kernel provides a speedup of over
20 times as compared to the un-optimized kernel.
45 RESTRICTED
RESTRICTED
CHAPTER 8 RESULTS AND CONCLUSION

1. OpenCL code for 1D and 2D FFT is developed using the methodology discussed
in the previous chapter. This chapter summarizes the results of 1D and 2D FFT implementation and the benchmarks against high-end Intel CPUs. Computing Environment 2. Hardware. The GPU performance is benchmarked against two high-end Intel
CPUs. The hardware details are as follows (a) (b) (c) 3. GPU. ATI Radeon HD 5870 CPU-1. CPU-2. Core-i7 930 @ 2.80 GHz Core 2 Quad 8200 @ 2.33 GHz
Software.
The operating system installed on both CPUs is Windows 7 64-Bit
and CPU performance results are obtained on MATLAB R2009b. Experiment Setup 4. CPU performance results are obtained by computing FFT of random values in
MATLAB and measuring the execution time using commands tic and toc. All CPU times presented in Table 8-1 and Table 8-2 are averaged over 50 iterations. 5. The GPU FFT time is calculated using a ATI Stream Profiler, a software
provided by AMD for accurate performance measurement. ATI stream profiler can measure time with 1 nano second precision [33]. 1D FFT Results 6. This FFT implementation supports power of two transform sizes up to 8 million
points (223). Table 8-1 summarizes the performance comparison over the input size 46 RESTRICTED
RESTRICTED range. FFT time is the kernel execution time of FFT. The total time includes the memory transfer time between CPU and GPU over the PCIe bus. Table 8-1: Performance Benchmark for 1D FFT
Device
ATI Radeon 5870
CPU 1 Core-i7 930
CPU 2 Core 2 Quad FFT Time (ms) 0.0357 0.1592 0.7829 3.3158 18.278 87.589 548.41 1192.2
Input Size (x1000 points) 1 4 16 64 256 1024 4096 8192
FFT Time (ms) 0.04398 0.05753 0.06822 0.13872 0.42907 1.77403 7.69335 15.62233
Total Time (ms) 0.5037 0.5077 0.6446 0.6501 1.36833 7.19109 33.5992 68.87037
FFT Time (ms) 0.0064 0.0264 0.1125 0.4666 3.1224 17.036 134.728 273.503
2D FFT Results 6. The largest 2D FFT input size supported by this implementation is 2048x2048
(222) point. Table 8-2 summarizes the performance comparison over the input size range. FFT time is the kernel execution time of FFT. The total time includes the memory transfer time between CPU and GPU over the PCI bus.
47 RESTRICTED
RESTRICTED Table 8-2: Performance Benchmark for 2D FFT
Device
ATI Radeon 5870
CPU 1 Core-i7 930
CPU 2 Core 2 Quad FFT Time (ms) 0.2982 1.3016 5.6619 28.576 155.24 648.38
Input Size (Points) 64x64 128x128 256x256 512x512 1024x1024 2048x2048
FFT Time (ms) 0.07556 0.11226 0.20268 0.63373 3.81680 27.4293
Total Time (ms) 0.41039 0.55073 0.68671 1.55834 9.27949 54.0063
FFT Time (ms) 0.085 0.243 0.965 4.288 11.395 72.13
Analysis of Results 7. The results in Table 8-1 and Table 8-2 show that OpenCL FFT implemented on
ATI 5870 GPU outperforms the state of the art Intel CPUs. The performance advantage increases with increase in the input size. This is because of increasing parallelism as the input size increases. The data transfer time between GPU and motherboard is a factor which limits the performance gains. 8. GPU Bottleneck. The GPU is connected to the motherboard through a PCIe
bus. The PCIe bus has a memory bandwidth of only 5.2 GB/s. Thus the data transfer between CPU and GPU is very expensive as evident from the performance results presented in Table 8-1 and 8-2. 9. AMD Fusion Project. AMD has recently launched a new compute
architecture having both CPU and the GPU on a single chip [34]. This architecture is
48 RESTRICTED
RESTRICTED named as Accelerated Processing Unit (APU). Having both CPU and the GPU on a single chip removes the PCIe interface thus removing the data transfer bottleneck. Conclusion 10. GPU is novel parallel computing architecture providing huge speedups to data-
parallel computations at a relatively lower monetary cost. The performance gains depend upon the exploitable parallelism in the algorithm being implemented. 11. OpenCL is a heterogeneous parallel computing environment allowing the
developers to tap into the immense computing power of GPUs for general purpose computations. OpenCL provides code portability across a range of compute devices. 12. OpenCL implementation of FFT on ATI 5870 GPU outperforms the latest Intel
CPUs by exploiting data-level parallelism inherent in FFT algorithm. Both 1D and 2D FFT algorithms have been implemented successfully. 13. Currently, the data transfer to GPU over the PCIe interface presents a
performance bottleneck. The AMD Fusion APUs remove this bottleneck opening new horizons of parallel computing on GPUs.
49 RESTRICTED
RESTRICTED
REFERENCES
[1] [2] [3] [4] http:// gpgpu.org/about http://www.streamcomputing.nl/blog/difference-between-cuda-and-opencl http://www.khronos.org/opencl K. Moreland and E. Angel, "The FFT on a GPU," in Proceedings of the ACM SIGGRAPH Conference on Graphics Hardware, 2003, pp. 112 119. [5] J. Spitzer, Implementing a GPU-efficient FFT, SIGGRAPH Course on
Interactive Geometric and Scientific Computations with Graphics Hardware, 2003. [6] [7] [8] [9] [10] David Kirk/NVIDIA and Wen-mei Hwu, "CUDA Programming Book". http://developer.download.nvidia.com/compute/CUFFT_Library_0.8.pdf http://developer.download.nvidia.com/compute/CUFFT_Library_3.0.pdf http://developer.apple.com/mac/library/samplecode/OpenCL_FFT.html J. Kong, et. al., Accelerating MATLAB Image Processing Toolbox Functions on GPUs, GPGPU-3, March, 2010. [11] [12] [13] [14] http://ixbtlabs.com/articles3/video/cypress-p2.html http://en.wikipedia.org/wiki/Graphics_processing_unit http://www.videocardbenchmark.net/30dayshare.html Jimmy Pettersson, Ian Wainwright, "Radar Signal Processing with GPUs" SAAB, Master Thesis, 2010. [15] http://www.hpcwire.com/features/Compilers_and_More_GPU_Architecture_and_ Applications [16] [17] http://en.wikipedia.org/wiki/Flynn's_taxonomy Timothy g. Mattson, Beverly A. Sanders, "Patterns for Parallel Programming" pp. 15. [18] "ATI Stream SDK - OpenCL Programming Guide" Advanced Micro Devices, 2010. [19] "OpenCL and the ATI Radeon HD 5870 Architecture", Advanced Micro Devices. [20] "OpenCL Technology Brief" Apple Inc. 50 RESTRICTED
RESTRICTED [21] [22] [23] www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf "An Introduction to OpenCL", www.amd.com. Steven W. Smith, "The Scientist and Engineer's Guide to Digital Signal Processing," Ch 8, pp.140-146, 1997. [24] Douglas L. Jones, Ivan Selesnick, "The DFT, FFT, and Practical Spectral Analysis," pp. 42 Rice University, Houston, Texas, 2007. [25] Cooley, James W. and John W. Tukey, 1965, "An algorithm for the machine calculation of complex Fourier series," Math. Compute. 19: 297301. [26] I.J. Good, L. H. Thomas, "Using a computer to solve problems in physics," in Applications of Digital Computers, Boston, 1963. [27] C. M. Rader, "Discrete Fourier transforms when the number of data samples is prime", IEEE Proceedings, 1968, pp. 11071108. [28] Steven W. Smith, "The Scientist and Engineer's Guide to Digital Signal Processing," 1997, Ch 12, pp. 225-235. [29] [30] http://www.cmlab.csie.ntu.edu. Dniza C. Morales Berrios, "A Parallel Bit Reversal Algorithm and it's CILK Implementation," University of Puerto Rico, Mayagez Campus, 1999. [31] [32] Elster A.C. Fast Bit-Reversal Algorithm, ICASSP 89 Proccedings,1989. David W. Gohara, "OpenCL Memory Layout and Access", University School of Medicine, St. Louis, September 2009. [33] Andr Heidekrger Sr. System Engineer Graphics, "AMD/ATI Stream Computing on GPU", 1T-Systems HPCN-Workshop, 2010. [34] http://www.fusion.amd.com Washington
51 RESTRICTED

Implementation of Fast Fourier Transform FFT On Graphics Processing Unit GPU

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Implementation of Fast Fourier Transform FFT On Graphics Processing Unit GPU

Hochgeladen von

Copyright:

Verfügbare Formate

IMPLEMENTATION OF FAST FOURIER TRANSFORM ON A GRAPHICS PROCESSING UNIT

by NUST CDT AAMIR MAJEED (060902)

COLLEGE OF AERONAUTICAL ENGINEERING PAF ACADEMY RISALPUR September, 2010

IMPLEMENTATION OF FAST FOURIER TRANSFORM ON A GRAPHICS PROCESSING UNIT

COLLEGE OF AERONAUTICAL ENGINEERING PAF ACADEMY RISALPUR September, 2010

COLLEGE OF AERONAUTICAL ENGINEERING PAF ACADEMY, RISALPUR

IMPLEMENTATION OF FAST FOURIER TRANSFORM ON A GRAPHICS PROCESSING UNIT

by NUST Cdt Aamir Majeed, 69th EC

(TAUSEEF UR REHMAN) Squadron Leader Project Advisor College of Aeronautical Engineering

CHAPTER 1 PROJECT INTRODUCTION

CHAPTER 2 LITERATURE REVIEW

CHAPTER 3 GRAPHICS PROCESSING UNIT (GPU)

Figure 3-1: ATI Radeon HD 5870 Graphics Card 6 RESTRICTED

The motivation behind GPGPU is manifold. A

CHAPTER 4 GPU ARCHITECTURE

Figure 4-1: SIMD Architecture 14 RESTRICTED

Figure 4-3: Cypress Architecture

Figure 4-5: Stream Core

RESTRICTED 7. Processing Elements. Processing elements are the fundamental

CHAPTER 5 GPGPU PROGRAMMING ENVIRONMENT

RESTRICTED 14. Language Specifications. The language specification describes the

CHAPTER 6 FAST FOURIER TRANSFORM (FFT)

converts a periodic-discrete time domain signal to periodic-discrete frequency domain

Prime-Factor Algorithm (PFA). PFA introduced by Good and Thomas [26], is

each composed of a single point.

signals. (c) Synthesize N spectra into a single frequency spectrum.

Figure 6-1: Interlaced Decomposition 32 RESTRICTED

Figure 6-3: Radix-2 FFT Butterfly

spectrum. Flow graph for 8 point FFT is shown in Fig. 6-6.

Figure 6-5: Generalized FFT Butterfly 35 RESTRICTED

complex additions. The overall complexity is O(Nlog2N).

Figure 6-6: Simplified FFT Butterfly

CHAPTER 7 OPENCL IMPLEMENTATION

RESTRICTED Table 7-1: Elster Algorithm Calculations Initial Value +4 +2 +1 0 0 0 0 4 4 4 2 2 6 6 1 5 3 7

used to calculate twiddle factor is in the kernel.  41 RESTRICTED

Computing Large FFT.

FFT of length 16 and above are computed by invoking

Figure 7-4: Matrix Transpose

Figure 7-5: 4x4 Matrix Transpose Example

20 times as compared to the un-optimized kernel.

CHAPTER 8 RESULTS AND CONCLUSION

The operating system installed on both CPUs is Windows 7 64-Bit

ATI Radeon 5870

CPU 1 Core-i7 930

Input Size (x1000 points) 1 4 16 64 256 1024 4096 8192

RESTRICTED Table 8-2: Performance Benchmark for 2D FFT

ATI Radeon 5870

CPU 1 Core-i7 930

Input Size (Points) 64x64 128x128 256x256 512x512 1024x1024 2048x2048

FFT Time (ms) 0.07556 0.11226 0.20268 0.63373 3.81680 27.4293

Total Time (ms) 0.41039 0.55073 0.68671 1.55834 9.27949 54.0063

FFT Time (ms) 0.085 0.243 0.965 4.288 11.395 72.13

Das könnte Ihnen auch gefallen

used to calculate twiddle factor is in the kernel. 41 RESTRICTED