GPU Accelerated Two-Point Angular Correlation Function

Advanced Computer Graphics and Graphics Hardware
CUDA: Course Project

Alberto Quesada Aranda
bo Akademi University, Finland
Written Spring 2014

Abstract

The aim of this report is to put in practise the skills learned in the course Advanced Computer
Graphics and Graphics Hardware. For this, I will implement a CUDA program that calculates the
two-point angular correlation function for two sets of galaxies. Due to the amount of calculations
needed is a perfect example for prove the efciency of GPUs over CPUs in this kind of operations.
As presented later, we obtain a code speed-up of 70 (on average) faster, compared to computing
the same calculation on the CPU.
Introduction
GPU accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to
accelerate scientic, engineering, and enterprise applications. The main objective of the course is
be able to utilize the GPU resources as well as possible and design, implement and run a program
to apply it.

The use of GPUs for scientic calculations have been proven very useful due to the large amount
of operations required and the capability of the GPUs to run this calculations in parallel. Weve
chosen cosmological measurements as our eld of study.

In this paper, I describe the GPU implementation of one of these cosmological calculations: the
two-point angular correlation function. This function requires independent calculations of the same
quantity for all data points, which makes it an ideal candidate for parallelization.

The two-point angular correlation function is a statistical measure of to what extent galaxies are
randomly distributed in the universe, or lumped together. Once I have implemented it, I should run
it with two different data sets: the real data, D (real observation of galaxy coordinates) and the
random data, R (randomly generated galaxy coordinates).

The input consists of a number of position of galaxies on a celestial sphere (all of them at the same
distance from the earth). The observations are located on a sphere centred on the earth.
Therefore, the angular separation between two observations gives the distance between the
galaxies

The idea is to compare the real observations to a randomly generated set and see if the real
galaxies are more lumped together in space than in a random distribution. If they are, this can be
seen as an evidence of the gravitational forces caused by cold dark matter, which causes more
attractive forces than what can be explained by the known visible mass of the galaxies.

For this, I have to calculate the histogram distribution of angular separations, covering [0 - 64]
degrees. Each histogram contains the number of coordinate pairs with angles between each pair of
galaxies (DD, DR and RR).

CUDA Model
To implement this I will need to master some aspects of CUDA and GPUs running, so I'll do a quick
summary with the basics that I will use for the implementation.

A simple way to understand the difference between a CPU and GPU is to compare how they
process tasks. A CPU consists of a few cores optimized for sequential serial processing while a
GPU consists of thousands of smaller, more efcient cores designed for handling multiple tasks
simultaneously. GPUs have thousands of cores to process parallel workloads efciently.

For the GPU implementation, I make use of the CUDA programming language. For now on, the
host is the CPU and the device is the GPU. Parallel portions of the application are executed in the
device as kernels by an array of threads (one kernel is executed at a time but many threads
execute each kernel). Each thread perform an individual calculation in parallel on the GPU; threads
are grouped into blocks, which denes how the threads will be distributed on the GPU and what
sections of local memory they can access.

Kernel launches a grid of thread blocks in which threads can cooperate via shared memory.
Threads in different blocks cannot cooperate. A thread block consist of 32 thread-warps; a warp is
executed physically in parallel on a multiprocessor (SIMD).
#2

Summary of a GPU program:
- Allocate data on host.
- Allocate data on device.
- Transfer data from host to device.
- Operate on device (kernel).
- Transfer data from device to host.

Is possible that different threads want access to the same data at the same time; for be protected
of this risk we need to use atomic operations to access that data on global or shared memory.
Essentially with the atomic operations we are serializing the execution of the threads.

In CUDA we have no control over thread block execution order, hence we must design for arbitrary
thread block execution order. Threads within the same thread block can be synchronized. Threads
from different thread blocks cannot be synchronized.

Implementation details
In this section I will explain how the code works. The code is divided in three functions: main(),
calc() and distance().

The main() function just denes a few variables, opens the input/output les and calls the calc
function. It also checks if the two input les are the same, in which case the calculations will be a
little different. The most important variable declared here is nbin, which is the number of bins of the
histogram (in most of the code I use nbins+2 for also save the underow and overow).

The calc() function manages all the program (less the angle calculations). It starts reading, saving
in global memory and converting to radians all the input data. Now starts the GPU part: rst of all I
will dene how many blocks and threads/block will be in each grid (each kernel call). The total
number of threads in the grid will be the size of the sub-matrix for the calculations. The chosen
numbers for those parameters are 32 (number of blocks) and 512 (number of threads / block),
which give us a size of 32x512 = 16384 for the sub-matrix.
#3

The next step is allocate the histograms. I will need three arrays of histograms: hist[], dev_hist[]
and hist_array[]. Im going to use the rs two for calculate the sub-histograms of each sub-matrix
in GPU and copy it back to CPU. dev_hist[] will be allocated in GPU memory and its size its going
to be 8256 (num_blocks * (nbins+2)) :
Having a different memory address for each block I avoid the necessity of use atomic operations
for save the sub-histograms once all threads have nished with the angle operations.

hist[] is going to be another array of the same size as dev_hist[] but allocated in global memory.
Once the kernel nished its task I will copy the content of the GPU back to the CPU in the array
hist[]. The last array, hist_array[], will be the global sum of histograms of all blocks; its size will be
nbins + 2. After get the histogram arrays ready, I copy the galaxies data to GPU memory.

Before start the kernel, I have to know how many sub-matrixes Im going to need for make the
calculations (if the number of galaxies is bigger than the number of threads/grid I will have to call
the kernel more than once). The number of sub-matrix needed is the number of galaxies divided by
the number of threads/grid (sub-matrix size), round up. In the case of 100k galaxies the number of
galaxies will be 100000/16384 = 6,1 ~= 7 sub-matrix.

The kernel has to be called num_sub-matrix*num_sub-matrix times (7x7 = 49 in the 100k
example), unless the two les are the same in which case I only loop over the upper half of the
matrix of operations. Before each call I have to prepare the dev_hist[] array, initialising it to 0s;
after each call I will copy from GPU dev_hist[] to global memory (CPU) in hist[]. Once I have the
results of the kernel in global memory I can compute it and add it to the global histogram
(hist_array), which at the end of the execution will contain the nal data for create the histogram.
#4
In the previous image I show how are organised the kernel calls. In the rst call, the kernel
calculate the angles between each one of the rst SUBMATRIX_SIZE elements of the rst galaxy
set and all the rst SUBMATRIX_SIZE elements of the second galaxy set. The second call will
make the operations between each one of the elements of the rst galaxy set in the range
SUBMATRIX_SIZE and SUBMATRIX_SIZE*2 and all the rst SUBMATRIX_SIZE elements of the
second galaxy set again. When all the distances between the SUBMATRIX_SIZE elements of the
second galaxy and each one of the elements of the rst set have been performed then the program
repeat the process but with the elements of the second set in the range SUBMATRIX_SIZE and
SUBMATRIX_SIZE*2. [[ This explanation is difcult to understand, but the image is more
informative ]] When all the sub-matrixes have been computed I just have to write the results to an
output le and free al the memory.

The distance() function is the kernel. Every kernel call, the code is executed 16384 times (blocks *
threads/block = total threads) and every code execution calculates 16384 angles. For example, for
complete 100k*100k operations the kernel must be called 38 times (I call it 49 times, but 11 of
those times dont do anything; in the code this is the case when idx > max_xind).

Each block of threads has a shared vector for save the sub-histogram calculated by that block of
threads. Before allow all the threads to start the execution of the calculations, one of them has to
initialise the shared vector to 0; the instruction __syncthreads() manage this (all the threads have
to reach this point before continue with the execution). Once all the threads have reached this point
they start with the calculations; I used the formula specied in the course slides. Each thread
calculate the two-point angular distance from one galaxy from the rst set (idx) to
SUBMATRIX_SIZE galaxies from the second galaxy set (range [yind-ymax]); in the previous image
you can see it. In the case of two different input les the total number of operations is N*N and in
the case of the same le N*(N-1)/2.

After calculate the distance I have to convert it to degrees and check in which bin should I include
it. The histogram array (shared_array) is shared for all the threads within the block, so I have to
use an atomic operation for write each result in it. This operation prevents from loss of information
if two different threads try to write at the same time in the same location.

The last step will be copy to the global array (dev_hist[]) the information of every shared array.
Before do it we need to ensure that all the threads have reached this point and every shared array
has already all the information saved.

Summary:
1. Initialise default variables.
2. Save galaxies data to global memory.
3. Allocate memory for histograms.
4. Copy galaxies data to GPU memory.
5. Determine size and position of sub-matrixes of the calculation.
6. Launch kernel for each sub-matrix.
1. Allocate local memory for local histogram.
2. Perform the operations.
3. Sum local memory histograms in global memory.
7. Copy GPU memory (histograms) back to CPU.
8. Make a recount in only one histogram and save it to an output le.

Results
The code has been executed with the following hardware and software:

Cuda compilation tools, release 5.5, V5.5.0
#5
Cuda Device:
Name: Tesla M2050
Compute capability: 2.0
Clock rate: 1147000

Total global mem: 2817982464
Multiprocessor count: 14
Threads in warp: 32
Max threads per block: 1024
Max thread dimensions: (1024, 1024, 64)
Max grid dimensions: (65535, 65535, 65535)

Code compiled and executed in asterope with this commands:
nvcc -arch sm_20 project.cu -o project
srun gres=gpu project input1 input2

gcc project.cc -o projectC -lm
projectC input1 input2

The time of execution of the CUDA code (GPU) and the sequential code (CPU) for two different
data sets (10k and 100k) can be seen in the following tables:

For the 10k data set we obtain a speed-up of approximately 38x and for the 100k data set of 101x.

The time of execution is bigger in the case of DR because we need to compute more operations
than in the other cases. When the input les are different the number of required operations is
(consider N the number of galaxies of the input) N*N, while if we use the same input le (DD, RR)
the number of operations is N*(N-1)/2.

In the following page are presented the histograms obtained for the 100k galaxies example. For
create them Ive used Excel (excel data and original graphics can be found in the project folder).

Both histograms (DD, RR) are quite similar. The only visible difference is that in the RR histogram,
there are almost double values in the range 0 - 0.25.

GPU DR DD RR
10k 0.3100 0.2500 0.2500
100k 19.1400 10.3600 10.6400
CPU DR DD RR
10k 15.9300 7.8700 8.0500
100k 1630.86 796.6600 805.6700
#6

#7
Histogram DD
F
r
e
c
u
e
n
c
y
0
175.000.000
350.000.000
525.000.000
700.000.000
Range [0-64]
Histogram RR
F
r
e
c
u
e
n
c
y
0
350.000.000
700.000.000
1.050.000.000
1.400.000.000
Range [0-64]

A positive value of !(") indicates that there are more galaxies with an angular separation " than
expected in a random distribution. A negative value of !(") indicates that there are less galaxies
with an angular separation " than expected in a random distribution. If !(") = 0 the distribution of
galaxies is random.

References
-
http://arxiv.org/abs/1208.3658
-
http://www.ehow.com/how_12212035_make-histogram-iwork-numbers.html
-
https://github.com/djbard/ccogs
-
http://arxiv.org/abs/1204.6630
-
https://developer.nvidia.com/cuda-zone

#8

GPU Accelerated Two-Point Angular Correlation Function

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

GPU Accelerated Two-Point Angular Correlation Function

Hochgeladen von

Copyright:

Verfügbare Formate

Advanced Computer Graphics and Graphics Hardware

CUDA: Course Project

Das könnte Ihnen auch gefallen