0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)
19 Ansichten8 Seiten
A CUDA program calculates the two-point angular correlation function for two sets of galaxies. GPU accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, engineering, and enterprise applications. The main objective of the course is be able to utilize the GPU resources as well as possible and design, implement and run a program to apply it.
A CUDA program calculates the two-point angular correlation function for two sets of galaxies. GPU accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, engineering, and enterprise applications. The main objective of the course is be able to utilize the GPU resources as well as possible and design, implement and run a program to apply it.
A CUDA program calculates the two-point angular correlation function for two sets of galaxies. GPU accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, engineering, and enterprise applications. The main objective of the course is be able to utilize the GPU resources as well as possible and design, implement and run a program to apply it.
Alberto Quesada Aranda bo Akademi University, Finland Written Spring 2014
Abstract
The aim of this report is to put in practise the skills learned in the course Advanced Computer Graphics and Graphics Hardware. For this, I will implement a CUDA program that calculates the two-point angular correlation function for two sets of galaxies. Due to the amount of calculations needed is a perfect example for prove the efciency of GPUs over CPUs in this kind of operations. As presented later, we obtain a code speed-up of 70 (on average) faster, compared to computing the same calculation on the CPU. Introduction GPU accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientic, engineering, and enterprise applications. The main objective of the course is be able to utilize the GPU resources as well as possible and design, implement and run a program to apply it.
The use of GPUs for scientic calculations have been proven very useful due to the large amount of operations required and the capability of the GPUs to run this calculations in parallel. Weve chosen cosmological measurements as our eld of study.
In this paper, I describe the GPU implementation of one of these cosmological calculations: the two-point angular correlation function. This function requires independent calculations of the same quantity for all data points, which makes it an ideal candidate for parallelization.
The two-point angular correlation function is a statistical measure of to what extent galaxies are randomly distributed in the universe, or lumped together. Once I have implemented it, I should run it with two different data sets: the real data, D (real observation of galaxy coordinates) and the random data, R (randomly generated galaxy coordinates).
The input consists of a number of position of galaxies on a celestial sphere (all of them at the same distance from the earth). The observations are located on a sphere centred on the earth. Therefore, the angular separation between two observations gives the distance between the galaxies
The idea is to compare the real observations to a randomly generated set and see if the real galaxies are more lumped together in space than in a random distribution. If they are, this can be seen as an evidence of the gravitational forces caused by cold dark matter, which causes more attractive forces than what can be explained by the known visible mass of the galaxies.
For this, I have to calculate the histogram distribution of angular separations, covering [0 - 64] degrees. Each histogram contains the number of coordinate pairs with angles between each pair of galaxies (DD, DR and RR).
CUDA Model To implement this I will need to master some aspects of CUDA and GPUs running, so I'll do a quick summary with the basics that I will use for the implementation.
A simple way to understand the difference between a CPU and GPU is to compare how they process tasks. A CPU consists of a few cores optimized for sequential serial processing while a GPU consists of thousands of smaller, more efcient cores designed for handling multiple tasks simultaneously. GPUs have thousands of cores to process parallel workloads efciently.
For the GPU implementation, I make use of the CUDA programming language. For now on, the host is the CPU and the device is the GPU. Parallel portions of the application are executed in the device as kernels by an array of threads (one kernel is executed at a time but many threads execute each kernel). Each thread perform an individual calculation in parallel on the GPU; threads are grouped into blocks, which denes how the threads will be distributed on the GPU and what sections of local memory they can access.
Kernel launches a grid of thread blocks in which threads can cooperate via shared memory. Threads in different blocks cannot cooperate. A thread block consist of 32 thread-warps; a warp is executed physically in parallel on a multiprocessor (SIMD). #2
Summary of a GPU program: - Allocate data on host. - Allocate data on device. - Transfer data from host to device. - Operate on device (kernel). - Transfer data from device to host.
Is possible that different threads want access to the same data at the same time; for be protected of this risk we need to use atomic operations to access that data on global or shared memory. Essentially with the atomic operations we are serializing the execution of the threads.
In CUDA we have no control over thread block execution order, hence we must design for arbitrary thread block execution order. Threads within the same thread block can be synchronized. Threads from different thread blocks cannot be synchronized.
Implementation details In this section I will explain how the code works. The code is divided in three functions: main(), calc() and distance().
The main() function just denes a few variables, opens the input/output les and calls the calc function. It also checks if the two input les are the same, in which case the calculations will be a little different. The most important variable declared here is nbin, which is the number of bins of the histogram (in most of the code I use nbins+2 for also save the underow and overow).
The calc() function manages all the program (less the angle calculations). It starts reading, saving in global memory and converting to radians all the input data. Now starts the GPU part: rst of all I will dene how many blocks and threads/block will be in each grid (each kernel call). The total number of threads in the grid will be the size of the sub-matrix for the calculations. The chosen numbers for those parameters are 32 (number of blocks) and 512 (number of threads / block), which give us a size of 32x512 = 16384 for the sub-matrix. #3
The next step is allocate the histograms. I will need three arrays of histograms: hist[], dev_hist[] and hist_array[]. Im going to use the rs two for calculate the sub-histograms of each sub-matrix in GPU and copy it back to CPU. dev_hist[] will be allocated in GPU memory and its size its going to be 8256 (num_blocks * (nbins+2)) : Having a different memory address for each block I avoid the necessity of use atomic operations for save the sub-histograms once all threads have nished with the angle operations.
hist[] is going to be another array of the same size as dev_hist[] but allocated in global memory. Once the kernel nished its task I will copy the content of the GPU back to the CPU in the array hist[]. The last array, hist_array[], will be the global sum of histograms of all blocks; its size will be nbins + 2. After get the histogram arrays ready, I copy the galaxies data to GPU memory.
Before start the kernel, I have to know how many sub-matrixes Im going to need for make the calculations (if the number of galaxies is bigger than the number of threads/grid I will have to call the kernel more than once). The number of sub-matrix needed is the number of galaxies divided by the number of threads/grid (sub-matrix size), round up. In the case of 100k galaxies the number of galaxies will be 100000/16384 = 6,1 ~= 7 sub-matrix.
The kernel has to be called num_sub-matrix*num_sub-matrix times (7x7 = 49 in the 100k example), unless the two les are the same in which case I only loop over the upper half of the matrix of operations. Before each call I have to prepare the dev_hist[] array, initialising it to 0s; after each call I will copy from GPU dev_hist[] to global memory (CPU) in hist[]. Once I have the results of the kernel in global memory I can compute it and add it to the global histogram (hist_array), which at the end of the execution will contain the nal data for create the histogram. #4 In the previous image I show how are organised the kernel calls. In the rst call, the kernel calculate the angles between each one of the rst SUBMATRIX_SIZE elements of the rst galaxy set and all the rst SUBMATRIX_SIZE elements of the second galaxy set. The second call will make the operations between each one of the elements of the rst galaxy set in the range SUBMATRIX_SIZE and SUBMATRIX_SIZE*2 and all the rst SUBMATRIX_SIZE elements of the second galaxy set again. When all the distances between the SUBMATRIX_SIZE elements of the second galaxy and each one of the elements of the rst set have been performed then the program repeat the process but with the elements of the second set in the range SUBMATRIX_SIZE and SUBMATRIX_SIZE*2. [[ This explanation is difcult to understand, but the image is more informative ]] When all the sub-matrixes have been computed I just have to write the results to an output le and free al the memory.
The distance() function is the kernel. Every kernel call, the code is executed 16384 times (blocks * threads/block = total threads) and every code execution calculates 16384 angles. For example, for complete 100k*100k operations the kernel must be called 38 times (I call it 49 times, but 11 of those times dont do anything; in the code this is the case when idx > max_xind).
Each block of threads has a shared vector for save the sub-histogram calculated by that block of threads. Before allow all the threads to start the execution of the calculations, one of them has to initialise the shared vector to 0; the instruction __syncthreads() manage this (all the threads have to reach this point before continue with the execution). Once all the threads have reached this point they start with the calculations; I used the formula specied in the course slides. Each thread calculate the two-point angular distance from one galaxy from the rst set (idx) to SUBMATRIX_SIZE galaxies from the second galaxy set (range [yind-ymax]); in the previous image you can see it. In the case of two different input les the total number of operations is N*N and in the case of the same le N*(N-1)/2.
After calculate the distance I have to convert it to degrees and check in which bin should I include it. The histogram array (shared_array) is shared for all the threads within the block, so I have to use an atomic operation for write each result in it. This operation prevents from loss of information if two different threads try to write at the same time in the same location.
The last step will be copy to the global array (dev_hist[]) the information of every shared array. Before do it we need to ensure that all the threads have reached this point and every shared array has already all the information saved.
Summary: 1. Initialise default variables. 2. Save galaxies data to global memory. 3. Allocate memory for histograms. 4. Copy galaxies data to GPU memory. 5. Determine size and position of sub-matrixes of the calculation. 6. Launch kernel for each sub-matrix. 1. Allocate local memory for local histogram. 2. Perform the operations. 3. Sum local memory histograms in global memory. 7. Copy GPU memory (histograms) back to CPU. 8. Make a recount in only one histogram and save it to an output le.
Results The code has been executed with the following hardware and software:
Cuda compilation tools, release 5.5, V5.5.0 #5 Cuda Device: Name: Tesla M2050 Compute capability: 2.0 Clock rate: 1147000
Total global mem: 2817982464 Multiprocessor count: 14 Threads in warp: 32 Max threads per block: 1024 Max thread dimensions: (1024, 1024, 64) Max grid dimensions: (65535, 65535, 65535)
Code compiled and executed in asterope with this commands: nvcc -arch sm_20 project.cu -o project srun gres=gpu project input1 input2
The time of execution of the CUDA code (GPU) and the sequential code (CPU) for two different data sets (10k and 100k) can be seen in the following tables:
For the 10k data set we obtain a speed-up of approximately 38x and for the 100k data set of 101x.
The time of execution is bigger in the case of DR because we need to compute more operations than in the other cases. When the input les are different the number of required operations is (consider N the number of galaxies of the input) N*N, while if we use the same input le (DD, RR) the number of operations is N*(N-1)/2.
In the following page are presented the histograms obtained for the 100k galaxies example. For create them Ive used Excel (excel data and original graphics can be found in the project folder).
Both histograms (DD, RR) are quite similar. The only visible difference is that in the RR histogram, there are almost double values in the range 0 - 0.25.
GPU DR DD RR 10k 0.3100 0.2500 0.2500 100k 19.1400 10.3600 10.6400 CPU DR DD RR 10k 15.9300 7.8700 8.0500 100k 1630.86 796.6600 805.6700 #6
#7 Histogram DD F r e c u e n c y 0 175.000.000 350.000.000 525.000.000 700.000.000 Range [0-64] Histogram RR F r e c u e n c y 0 350.000.000 700.000.000 1.050.000.000 1.400.000.000 Range [0-64]
A positive value of !(") indicates that there are more galaxies with an angular separation " than expected in a random distribution. A negative value of !(") indicates that there are less galaxies with an angular separation " than expected in a random distribution. If !(") = 0 the distribution of galaxies is random.