Report 11 370

Acknowledgment
All the praise to the almighty Allah, whose blessings helped us to success-
fully complete this thesis work. We show significant and indescribable grate-
fulness to our honorable supervisor Dr. K.M. Azharul Hasan, Professor, De-
partment of Computer Science and Engineering, Khulna University of En-
gineering & Technology for his outstanding helpful contribution in giving
support, suggestion and encouragement. We acknowledge his constant co-
operation and proper guidance throughout the development process. He has
been a great source of effective and feasible ideas, profound knowledge and
all time feedback for us.
We thank all the teachers of the Department of Computer Science and Engi-
neering who helped us providing guidelines to perform the work. We would
also like to thank our friends and family for their cordial support.
Authors.
i
Abstract
In today’s world, various research activities are being done upon data sci-
ence. Matrix mathematics are one of the most significant sectors in this field.
We also know multidimensional matrices as Tensors. In our thesis work, we
have worked on multidimensional matrix mathematics. Multidimensional
matrices have huge applications in the field of data science. Previous works
done on multidimensional matrices were on CPU. But these operations are
so complex that performing them on CPU is less efficient and causes loss of
a great amount of time and costs. But as the applications of data science
is rising day by day, this time and costs complexity must be reduced. So
we have performed these multidimensional matrix operations on GPU. GPU
gives us the ability to do these operations in parallel with the use of threads
and blocks. So the operations become faster and more efficient. Our main
target is to perform multidimensional matrix operations such as addition,
subtraction, multiplication etc on GPU environment and apply this in the
field of data science and high performance computing.
ii
Contents
Page
Acknowledgement i
Abstract ii
Contents iii
List of Figures v
List of Tables vi
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Organization of The Thesis . . . . . . . . . . . . . . . . . . . . . 2
2 Literature Review 3
2.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 What is GPU . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 GPU vs CPU . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.3 GPU Programming . . . . . . . . . . . . . . . . . . . . . . 4
2.1.4 Simple Processing Flow . . . . . . . . . . . . . . . . . . . 5
2.1.5 Managing Memory in GPU . . . . . . . . . . . . . . . . . 6
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
iii
3 Multidimensional Matrix Operations On GPU 10
3.1 Blocks and Threads Organization . . . . . . . . . . . . . . . . . 10
3.2 Multidimensional Matrix Addition . . . . . . . . . . . . . . . . . 13
3.2.1 Formula Construction . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Implementation On GPU . . . . . . . . . . . . . . . . . . . 14
3.2.3 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Multidimensional Matrix Subtraction . . . . . . . . . . . . . . . 17
3.4 Multidimensional Matrix Scalar Multiplication . . . . . . . . . 17
3.5 Multidimensional Matrix Multiplication . . . . . . . . . . . . . . 18
3.5.1 Formula Construction . . . . . . . . . . . . . . . . . . . . 18
3.5.2 Implementation on GPU . . . . . . . . . . . . . . . . . . . 19
3.5.3 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Experimental Results and Analysis 24

4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Measuring Terms . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Conclusion 39
5.0.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.0.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.0.3 Future Scope of Work . . . . . . . . . . . . . . . . . . . . 40
References 41
iv
List of Figures
2.1 GPU vs CPU architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Processing Flow between CPU and GPU. . . . . . . . . . . . . . . . . . . . . . 5
3.1 Thread Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Thread Indexing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Example of 3-D Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Logical Computational Architecture of Addition. . . . . . . . . . . . . . 15
3.5 Actual Computational Architecture of Addition. . . . . . . . . . . . . . . 16
3.6 Specification of dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.7 Organization of Blocks and Threads. . . . . . . . . . . . . . . . . . . . . . . . . 20
3.8 Logical Computational Architecture of Multiplication. . . . . . . . 21
3.9 Actual Computational Architecture of Multiplication. . . . . . . . . 22
4.1 Computational Time of Addition for Fixed number of Blocks. 27
4.2 Computational Time of Addition for Fixed number of Threads. 29
4.3 Computational Time Graph for 4D Multiplication. . . . . . . . . . . . 31
4.6 Computational Time Graph for 10D Multiplication. . . . . . . . . . . 35
4.7 GPU Performance for various Data Types. . . . . . . . . . . . . . . . . . . . 36
4.8 Speed Up Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
v
List of Tables
2.1 GPU Memory Management Functions. . . . . . . . . . . . . . . 6
4.1 Addition for Fixed number of Data Length. . . . . . . . . . . . . 26

4.2 Addition for Fixed number of Blocks. . . . . . . . . . . . . . . . 26
4.3 Addition for Fixed number of Threads. . . . . . . . . . . . . . . 28
4.4 Computational Time for 4D Multiplication. . . . . . . . . . . . . 30
4.7 Computational Time for 10D Multiplication. . . . . . . . . . . . 34
4.8 Speed Up For various Data Types. . . . . . . . . . . . . . . . . . 36
4.9 R2 Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vi
Chapter 1
Introduction
1.1 Introduction
Matrix Operations is one of the significant branches of mathematics [1] [2]

[3]. Various research activities on data science rely on matrix operations.
Multidimensional matrix operations have huge application in various fields
of computer science like data mining, image processing and computer vision,
computer graphics, signal processing, numerical linear algebra, numerical
analysis, graph analysis etc. The operations include addition, subtraction,
scalar multiplication, matrix multiplication, inverse, transpose etc. 1D array
and 2D matrix are incorporated in classical matrix operations. More than 2D
or Multidimensional Matrix operations elongates the operations on classical
matrix.
1.2 Problem Statement
Multidimensional Matrix Operations are nothing but the extension of oper-

ations on classical Matrix [4]. But with the large data and high dimension,
this operations become more complex. So the space and computation time
complexity increases proportionally with dimensions. Doing this complex
operations on CPU is time and space consuming. So we need to reduce this
computation time and space.
Our proposed method is to perform this multidimensional matrix operations
1
on GPU using the concept of parallel computing. If we perform these oper-
ations using threads and blocks, then it will enhance this performance by
reducing computation time and space needed to store data.
1.3 Scope
In our thesis, we have performed multidimensional matrix operations like

matrix addition, matrix subtraction, scalar multiplication and matrix mul-
tiplication on GPU. As we have studied that the existing works on matrix
operations are done only on CPU, not on GPU. So it’s an advantageous per-
spective of our work. For parallel computing on GPU, we have used blocks
and threads. In a single launch, the hardware limit of the number of blocks
is 65,535 and for the number of threads it is 1024 per block. To perform
multidimensional matrix operations on GPU, firstly we have linearized the
matrix i.e converted the higher dimensional matrix to 1D vector and after
that performed the desired operations. We have used square matrices for
simplification but we can extend this to any type of matrices. For matrix
multiplication on GPU we can go up to 10D matrix with each dimension
having length 6. But having more developed and high performance GPU we
can perform operations for larger data and higher dimension.
1.4 Organization of The Thesis
The remaining part of this thesis consists of the parts as follows :

Chapter 2 : It represents literature review that describes some of the ex-
isting works on Multidimensional Matrix or Tensor Computation.
Chapter 3 : Mechanism of performing multidimensional matrix operations
on GPU are discussed in this chapter.
Chapter 4 : The experimental outcomes and performance evaluation of our
method are explained in this chapter.
Chapter 5 : The future scope of this work and the conclusive words about
the method are outlined in this chapter.
2
Chapter 2
Literature Review
In recent decades, Tensors or multidimensional matrices has great impact

on various research fields. So study on this topic has started from the very
beginning. Many famous researches have worked on tensors or multidimen-
sional matrices. As our main aim is to perform multidimensional matrix
operations on GPU, so let’s first gain insight into GPU.
2.1 GPU Architecture
2.1.1 What is GPU
GPU means Graphics Processing Unit which can be used for parallel com-
putation to obtain fast operation. Parallel computation is the process of
breaking down large and complex problems into smaller and simpler sub
problems and solving them concurrently which leads to a fast computation.
2.1.2 GPU vs CPU
Computations on CPU is not always fast. When we deal with complex and
large data CPU computation become slower as CPU architecture is not de-
signed to perform operations parallelly.
GPU has high computation capability, it is designed with many core proces-
sors which is termed as Arithmetic Logic Units (ALU) whereas CPU doesn’t
3
support multiple ALUs like GPU structure. These ALUs are important for
parallel operations. That’s why GPU programming is necessary for efficient
parallel computing.
Figure 2.1: GPU vs CPU architecture.
2.1.3 GPU Programming
There are two types of parallelism- (i) Task Parallelism which is concerned
with many tasks and (ii) Data Parallelism maps each data element to paral-
lel threads. We have used CUDA platform which is developed by NVIDIA to
perform parallel computation on GPU. CUDA C is used for programming in
GPU environment which is based on standard C. CUDA is a platform used
for heterogeneous computing where there are a Host code which runs on
CPU and a Device code which runs on GPU. A simple “Hello World” program
in heterogeneous CUDA C paradigm is as follows:
4
global is used to indicate that the function is to run on GPU instead
of CPU and the function mykernel() is launched from the CPU.
2.1.4 Simple Processing Flow
Processing flow describes the overall process of how data parallelism is done
and how the interaction between CPU and GPU occurs. The following steps
are performed for any operation done on GPU:
Step 1 is shown in figure 3.2(a).The input data is copied from CPU memory
to GPU memory(DRAM).
Step 2 involves loading and executing the GPU code in kernel (figure 3.2(b)),
which distributes data to each core processor unit in GPU using threads.
And finally in step 3 the result is transferred from GPU to memory to CPU
memory(figure 3.2(c)). GPU core consists of a large number of blocks, gen-
erated after a kernel function is launched. Each block consists of a large
number of threads running parallelly.
(a) (b)
(c)
Figure 2.2: Processing Flow between CPU and GPU.
5
2.1.5 Managing Memory in GPU
In heterogeneous computing host and device are two important terms used
in CUDA programming model. By Host we mean CPU and by Device we mean
GPU. For allocating and freeing memory in GPU we use different functions
that differ from the standard C functions that is for CPU.
Table 2.1: GPU Memory Management Functions.
STANDARD C FUNCTIONS CUDA C FUNCTIONS

malloc cudaMalloc
memcpy cudaMemcpy
memset cudaMemset
free cudaFree
• cudaMalloc(), This function is used to allocate memory on the device. It

requires two parameters: the address of a pointer to the allocated object, and
the size of the allocated object. For example, cudaMalloc((void**)&device c,
sizeof(int)) is used for allocation of memory for a variable on the device.
• cudaMemcpy(), This function is used for data transfer between host and
device. Four parameters are required for this function: Pointer to the desti-
nation, pointer to the source, data to be transferred in bytes, and the direc-
tion of transfer such as from device to host. For example, cudaMemcpy(&c,
device c, sizeof(int), cudaMemcpyDeviceToHost) returns the result to the
host(CPU) from the device(GPU).
• cudaFree(), This method is used to free the content from device memory.
It requires only one parameter: the pointer to the content to be freed. E.g :
cudaFree(d A).
6
2.2 Related Works
Some related works for Multidimensional matrices or Tensors are discussed

below:
• Recently many research works are being done multidimensional matrix.

In the paper [4] the author described multidimensional matrices. He de-
veloped an algebra of multidimensional matrices. He discussed about the
notation and representation of them. He also showed how simplification can
be done on multidimensional matrices.
• In the paper [5] author discussed about the formulation of various mul-
tidimensional matrix operations such as addition, multiplication and es-
tablished relation with classical matrix algebra. This greatly helped us on
our implementation of multidimensional matrix operations on GPU. Ashu
M.G. Solo also discussed other Multidimensional matrix operations ij his
papers [6] [7] [8] [9].
• Expressing high dimensional matrix or tensors as low dimension or de-

composed tensors and then applying necessary operations on them is com-
putationally efficient and fast for computation of multidimensional matrices
or tensors. In the paper [10] The authors discussed about tensor decom-
positions and it’s application in various sectors. Tensors are basically an-
other name for multidimensional matrices. They have discussed two several
types of decomposition techniques of tensors. One technique is to decom-
pose higher rank tensors as a sum of rank-one tensors and another one is
regarding higher-order form of principal component analysis. We have also
applied linearization techniques in our implementation so that we can sim-
plify high dimensional matrices.
• The techniques for gaining insight into higher order tensor decompositions
are also discussed in paper [11]. These tensor decomposition techniques
helped us undersatnding the higher order tensors or multidimensional ma-
7
trices. There are also other works on decomposition of tensors or multidi-
mensional matrix [12] [13] [14] [15] [16].
• Techniques for manipulating multidimensional matrix operations are dis-

cussed in the paper [17] using MATLAB classes. They have described four
MATLAB classes that extends the functionality of MATLAB’s multidimen-
sional array. This MATLAB implementation are mainly based on CPU based
implementation. We can speed up the manipulations of this operations by
GPU Based implementation. And that is our main aim of this research work.
• In the paper [18], The authors discussed about efficient process for stor-
ing sparse and fact tensors. They proposed a coordinate format for storing
multidimensional matrices or tensors and also discussed the computational
efficiency for this format in various operations of tensors. This helped us
getting a clear view of indexing structure of multidimensional matrices. An
efficient and fast way for compiling compound tensors with sparse and dense
operands is discussed in [19].
• In the paper [20] The authors introduced relational data model based on
multidimensional matrix or tensors. This model facilitated data analysis
based on multidimensional matrix and relational manipulations of them.
Then they established a tensor-relational data management system which
combines relational algebraic operations and tensor algebraic operations.
• Multidimensional matrix operations discussed in all this papers are im-

plemented in CPU. But with the increasing use of data the use of multidi-
mensional matrices have also increased so much. So this operations should
be made faster so that the cost and space complexity reduces. But with the
use of CPU it is complicated. But if we divide the large problem into several
small problems i.e if we add parallelism into this operations, then this op-
erations can be made faster. And for this we need GPU. In the paper [21]
The Authors computed tensor eigen values on GPU. They have used GPU to
8
accelerate the computation of tensor eigen values. This has inspired us to
perform further multidimensional matrix related operations on GPU.
2.3 Discussion
After analysis of multidimensional matrix or tensor related papers we can

see that most of the methods are concerned witn representation, decomposi-
tion of the multidimensional matrices and performing the multidimensional
matrix operations on CPU. As implementing them on CPU needs so much
time and space, so we desire to work on implementation of multidimensional
matrix operartions on GPU.
9
Chapter 3
Multidimensional Matrix
Operations On GPU
In this thesis we have proposed GPU based implementation to perform Mul-

tidimensional Matrix operations efficiently. We have compared the compu-
tation time required for GPU for each operation with respect to that for CPU.
We can speed up the computation time in GPU compared to CPU. To do this,
let’s first gain insight into the organization of threads and blocks in GPU.
Then we can approach to the operation of multidimensional matrices.
3.1 Blocks and Threads Organization
Heterogeneous computation uses both host (CPU) and device (GPU) partic-
ipation. When a kernel function is launched from the host side, a large
number of threads are generated. Host (CPU) consists of multiple kernels.
A Grid is made up of many blocks. Usually, a grid is organized as a 1D,
2D or 3D array of blocks, and a block consists of a 1D, 2D or 3D array of
threads. An example of thread organization is shown in following Figure 3.1.
Block can be accessed with a built in variable blockIdx. Similarly a built in
variable threadIdx is used to get access the particular index location where
a thread is located. Both the variables are used inside a kernel.
In CUDA programming model there are two more built in variables. They
10
are gridDim and blockDim. gridDim is used to represent the dimension of
the grid and blockDim is used to represent the dimension of a block along
a particular axis. For example, blockDim.x returns the number of threads
a block has along the x axis. gridDim.x returns the number of blocks a grid
has along the x axis.:
Figure 3.1: Thread Organization.
Similarly blockIdx.x is used to represent the block index along x axis and
threadIdx.x is used to represent the thread index along x axis. A block can
be of 1D, 2D or 3D and a thread can also be of 1D, 2D or 3D.
A block can be split into parallel threads by three cases.
1) Using blocks: We can use multiple blocks each containing a thread.

Consider a simple CUDA example for adding two vector elements can be done
using only blocks as follows:
global void add ( i n t ∗a , i n t ∗b , i n t ∗c ) { }

c [ blockIdx . x ] = a [ blockIdx . x ] + b [ blockIdx . x ] ;
}
11
2) Using Threads: Use all threads in one block. The CUDA example for
adding two vector elements can be done using only threads per block as fol-
lows:
global void add ( i n t ∗a , i n t ∗b , i n t ∗c ) {

c [ threadIdx . x ] = a [ threadIdx . x ] + b [ threadIdx . x ] ;
}
3) Combining blocks and threads: Blocks and threads are often combined.
The maximum number of blocks is 65,535 in one kernel function launch
and for many GPUs, maximum threads per block 512(or 1024). The CUDA
example for adding two vector elements can be done by combining blocks
and threads as follows:
global void add ( i n t ∗a , i n t ∗b , i n t ∗c ) {

i n t index = threadIdx . x + blockIdx . x ∗ blockDim . x ;
c [ index ] = a [ index ] + b [ index ] ;
}
Suppose the number of threads per block is 6.
Figure 3.2: Thread Indexing.
To determine a particular index we can use the formula:

int index = threadIdx.x + blockIdx.x*blockDim.x;
12
3.2 Multidimensional Matrix Addition
Multidimensional matrix addition operation is the simpler operation than

other operations. Though for complex and higher dimensional matrix ad-
dition, CPU takes very long time to response. We have proposed a method
for addition performed in GPU which gives fast computational speed and an
efficient way for the addition operation while dealing with the higher dimen-
sional matrix.
3.2.1 Formula Construction
The first and foremost condition for the addition of two multidimensional
matrices is that they must have same number of elements in each dimen-
sion. The addition operation operation can only be done based on element to
element addition of two multidimensional matrices. We can simply add the
corresponding positioned elements of each multidimensional matrix. For
that reason ”Linearization” technique is efficient and it is also a simple task.
We can easily linearize each element of each multidimensional matrix. After
linearization we obtain an 1-D vector for each multidimensional matrix and
can simply add the same indexed elements of two vectors.
Consider two multidimensional matrices M1 and M2 and the resultant mul-
tidimensional matrix is M, addition operation is M1 + M2 = M.
As addition is performed on element to element basis [5] we obtain
mijk...p = m1ijk...p + m2ijk...p
where i,j,k...p are integers, each represents dimension.
13
3.2.2 Implementation On GPU
Suppose a 3-D matrix, M1 of 3*2*2 and another 3-D matrix, M2 of 3*2*2 are
are to be added and the sum is 3-D matrix, M dimensions of 3*2*2.
(a) (b)
Figure 3.3: Example of 3-D Matrices.
We have linearized the M1 matrix as follows,
M1* : 1 7 4 10 2 8 5 11 3 9 6 12
Similarly we have linearized the M2 matrix as follows,
M2* : 2 14 8 20 4 16 10 22 6 18 12 24
The Sum gives M* such that M* = M1* + M2* and thus we obtain resultant
matrix M. For this addition total 12 threads are generated:
Thread1 adds the 1st two elements of M1* and M2* [1+2 =3]
Thread2 adds the 2nd two elements of M1* and M2* [7+14 =21]
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Thread12 adds the last two elements of M1* and M2* [12+24 =36]
Linearization of the addition of the two matrix elements gives,
M* : 3 21 12 30 6 24 15 33 9 27 18 36
The logical computational architecture of addition is shown in Figure 3.4.

Here threads are generated in parallel.
14
Figure 3.4: Logical Computational Architecture of Addition.
After performing linearization of each matrix element we add the corre-

sponding elements in parallel using threads and the resultant elements are
also in a linearized form. After linearizing, the actual computational archi-
tecture of addition with threads is shown in Figure 3.5. The figure shows
how threads work in parallel to obtain the final result. Threads operate in
element by element basis. Each of the threads generated adds two element
from the input matrices to get one element in the resultant matrix. Thus
we can compute the elements of M in parallel and faster. But in case of
CPU, the elements are computed in a sequential fashion. Thus needs more
time to get the final resultant matrix. GPU reduces this time and makes
multidimensional matrix operations more efficient than in CPU.
15
Figure 3.5: Actual Computational Architecture of Addition.
3.2.3 Pseudocode
The pseudocode for computing each element of Matrix M is as follows:
id=zero
f o r i = 0 to N−1 //N=blocks ∗ threads
begin loop
p1 := B; // B = Block Index
p2 := b1 ; // b1 = Thread Index in x dimension
dim B := blockDim //block Dimension along x axis
id := p2 + p1∗dim B ;
m[ id ] = m1[ id ] + m2[ id ] ;
end loop
16
3.3 Multidimensional Matrix Subtraction
Multidimensional matrix subtraction operation is another simpler operation

in Multidimensional Matrix Mathematics. Multidimensional matrix subtrac-
tion method is similar to Multidimensional matrix addition, it is also done
per element basis.
The necessary condition for the subtraction of a multidimensional matrix
from another multidimensional matrix is that they must have same num-
ber of elements in each dimension. The subtraction operation operation can
only be done based per element subtraction of two multidimensional ma-
trices . We can simply subtract the corresponding positioned elements of
each multidimensional matrix. ”Linearization” technique can also be used
for simplifying subtraction operation. We can easily linearize each element
of each multidimensional matrix. After linearization we obtain an 1-D vector
for each multidimensional matrix and can simply subtract the same indexed
elements of two 1-D matrices or vectors.
Consider two multidimensional matrices M1 and M2 and subtraction of M2
from M1,the resultant multidimensional matrix is M, M = M1-M2.
As subtraction operation is performed per element basis [5] we obtain
mijk...p = m1ijk...p − m2ijk...p
where i,j,k...p are integers,each represents dimension.

The implementation for subtraction operation on GPU is also same as imple-
mentation for addition on GPU. All we have to do is to perform subtraction
of one matrix element from another matrix element instead of addition.
3.4 Multidimensional Matrix Scalar Multiplica-

tion
Multidimensional Matrix Scalar Multiplication means multiplication of the

elements of a Multidimensional Matrix by a scalar.Multiplication is applied
on each element of a multidimensional matrix.
17
After linearizing of each element of a multidimensional matrix, we can simply
multiply each element by the scalar.
3.5 Multidimensional Matrix Multiplication
Another multidimensional matrix operation is matrix multiplication. It is

slightly complex than the previous ones stated above. And that’s why GPU
implementation is much needed for this operation. To perform matrix mul-
tiplication between two matrices, we need to specify the two dimensions of
two multidimensional matrices on which we want to apply multiplication
operation.
3.5.1 Formula Construction
To construct formula for multidimensional matrix product, we just need to

extend and generalize the formula of classical matrix multiplication. Say we
want to multiply two 4D matrices. M1 & M2 represents multidimensional
matrices being multiplied and M represents the product of them. To mul-
tiply M1 and M2, we have to pick up the dimensions we want to multiply.
Suppose M1 is of i*j*k*l and M2 is of x*y*z*w. We want to multiply 1st and
4th dimensions from M1 & M2 respectively.
Figure 3.6: Specification of dimensions.
To be able to multiply this two matrices we have to consider certain con-

ditions. They are: l = x, j = y and k = z, as we are multiplying 1st dimension
of M1 and 4th dimension of of M2. We have previously seen that if we want
to multiply two 2D matrices suppose C1(a*b) and C2(c*d) then to be able to
multiply them, b and c must be equal. So in the case of M1 and M2, the 4th
dimension of M1 must be equal to the 1st dimension of M2. And the other
18
dimensions not participating in multiplication should also be equal.
The number of dimensions for product matrix M will be equal to the number
of dimensions in M1 and M2. As we are multiplying 1st dimension of M1 and
4th dimension of M2, So 1st dimension of M will be equal to the 1st dimen-
sion of M1 and 4th dimension of M will be equal to the 4th dimension of M2.
For example, in case of multiplication of 2D matrices, say M1 is of 2*3 and
M2 is of 3*4 then M will be of 2*4. Similarly multiplication of 3D matrices,
say M1 is of 2*3*4 and M2 is of 4*3*2 and if the multiplied dimensions are 1
and 3 then M will be of 2*3*2, For Multiplication of 4D matrices, say M1 is
of 2*3*4*4 and M2 is of 4*3*4*3 and if the multiplied dimensions are 1 and
4 then M will be of 2*3*4*3 and so on. We have previously seen that to find
each specific component of the product matrix the formula is :
n
X
Cij = aix ∗ bxj
x=1
Finding each specific component of the product matrix for multidimensional

matrix is just the generalized version of the previous formula [5]:
n
X
Mijk...q = M 1ijkx...q ∗ M 2xjkl...q
x=1
Here n specifies the number of components in 4th dimension of M1 or 1st

dimension of M2.
3.5.2 Implementation on GPU
Multidimensional Matrix Multiplication done on CPU consumes huge space

and time. To reduce the time and space complexity we have implemented
this operation on GPU. GPU computing introduces robust parallelism in this
operation. Each particular task is handled parallelly by a unit called block.
And a block is also made up of threads. By combining blocks and threads
we can compute the multidimensional matrix product faster and more effi-
ciently.
To make it simple, we have worked with square matrices. But the proce-
dure can be extended to any type of matrices. Say we have two matrices
19
M1 (2*2*2*2) and M2 (2*2*2*2) and we want to multiply 1st and 2nd dimen-
sion, so M will be 2*2*2*2. Let’s say here dimension is 4 and length is 2. So
in this case the formula for computing each component of M will be:
2
X
Mijkl = M 1ixkl ∗ M 2xjkl
x=1
To perform multidimensional matrix multiplication, we have used 1D block

and 3D thread. In this case the length of the block is = 2 and the number
of threads is = 2*2*2 = 8. In case of non-square matrix, suppose M is of
2*3*4*5 then the length of the block is = 2 and the number of threads is =
3*4*5 = 60.
The block and thread organization for our example is shown in Figure 3.7.
Figure 3.7: Organization of Blocks and Threads.
To do this operation on GPU we first linearize the multidimensional matrix i.e

convert the high dimensional matrix into low dimensional vector and then
map elements of both input matrices (here M1 and M2) that is needed to
compute each element of M matrix. Each thread computes one element of
M Matrix. The index of each element of M using block and thread indices
can be found by the following equation:
As M is of 2*2*2*2 so Index of any element of M = BlockIndex * 8 + threadIn-
dex for z dimension * 4 + threadIndex for y dimension * 2 + threadIndex for
x dimension * 1.
We can extend this for any non square matrix. for example, if M is 2*3*4*5
20
then Index of any element of M = BlockIndex * (3*4*5) + threadIndex for z
dimension * (4*5) + threadIndex for y dimension * (5) + threadIndex for x
dimension * 1.
Say, M1 and M2 matrix is as follows:
Then the logical computational architecture of multidimensional matrix

operation is shown in Figure 3.8.
Figure 3.8: Logical Computational Architecture of Multiplication.
Now, suppose we want to multiply 1st and 2nd dimension of M1 and M2

respectively and find the 1st element of M matrix. For this we will use the
21
formula stated above:
M0000 = (M 10000 * M 20000 ) + (M 10100 * M 21000 )
Firstly we have done linearization of the elements of M1 and M2 matrix. Now
if we want to find the 1st element of product matrix, then we have to compute
= [(0*8+0*4+0*2+0*1)th element of M1 * (0*8+0*4+0*2+0*1)th element of M2 ]

+ [(0*8+1*4+0*2+0*1)th element of M1 * (1*8+0*4+0*2+0*1)th element of M2 ]
= [0th element of M1 * 0th element of M2 ] + [4th element of M1 * 8th element
of M2 ]
This part of the operation is handled by Thread1. There is a single thread

for every element of Matrix M. The actual computation architecture after
linearization is shown in Figure 3.9.
Figure 3.9: Actual Computational Architecture of Multiplication.
The rest elements will also be computed in a similiar fashion. Thus we

can compute the elements of M in parallel and faster.
22
3.5.3 Pseudocode
The pseudocode for computing each element of Matrix M is as follows:
sum := zero ;
input width ;
f o r x = 0 to width
begin loop
p1 := B; // B = Block Index
p2 := b1 ; // b1 = Thread Index in z dimension
p3 := b2 ; // b2 = Thread Index in y dimension
p4 := b3 ; // b3 = Thread Index in x dimension
t1 := p1 ∗ 8 + x ∗ 4 + p3 ∗ 2 + p4 ;
t2 : = x ∗ 8 + p2 ∗ 4 + p3 ∗ 2 + p4 ;
temp := m1[ t1 ] ∗ m2[ t2 ] ;
sum := sum + temp ;
end loop
t3 := p1 ∗ 8 + p2 ∗ 4 + p3 ∗ 2 + p4 ;
m[ t3 ] := sum;
23
Chapter 4
Experimental Results and

Analysis
At the stage of implementation of this thesis, We have successfully imple-

mented multidimensional matrix addition and multidimensional matrix mul-
tiplication on GPU. We are able to reduce the time complexity in these oper-
ations. The computational time we have found on GPU is clearly faster than
the computational time on CPU.
4.1 Experimental Setup
In this chapter, we have done the implementation and related works. The
experiments and analysis processes are done on a computer with core i5
processor having 4 cores with each core having 1.7 GHz Speed. Also the
system had 8 GB of Ram. The Graphics card that we have used is NVIDIA
GeForce GTX 1070. It had total 16GB available graphics memory with 8GB
shared system memory and 8GB GDDR5 dedicated video memory while hav-
ing 1920 CUDA Cores. The Graphics clock is 1.5GHz.
The integrated development environment (IDE) that we have used is Mi-
crosoft Visual Studio 2015 Community Edition and programming is done
with CUDA C compiled with nvcc. [22] [23]
24
4.2 Experiment Results
4.2.1 Measuring Terms
Our main aim of this thesis is to reduce the time complexity in performing
the multidimensional matrix operations. So the main measuring term of our
experiment is to determine the computational time for both CPU and GPU.
The computational time is measured in ms. And by computational time we
mean the required amount of time to complete a process. For example, in
matrix multiplication the time that the system needs to find all the elements
of the product matrix is called the computational time complexity for matrix
multiplication. We also define speed up as a measuring term which is the
ratio between computational time of CPU by computational time of GPU. To
identify the characteristics of the speed up values, we have measured the R2
values of them to check if they are linear or not. R2 is a statistical analysis
term, used mainly for showing how closely the data are fitted to a regression
line. We have also evaluated GPU performance for various data types to find
the speed up in all cases.
4.2.2 Result Analysis
4.2.2.1 Multidimensional Matrix Addition
To perform Multidimensional Matrix Addition, we first convert high dimen-

sional data to low dimensional vector. Then we perform element by ele-
ment operation using threads. The number of blocks*threads is equal to
the length of the vector and each parallel is unit is concerned with only
computing a single element in resulting matrix.
Firstly we have done it for a fixed size vector data and variable number of
threads and blocks. Say vector data length is 1000000. Then the result we
obtained for this is shown in Table 4.1.
25
Table 4.1: Addition for Fixed number of Data Length.
Number of Blocks Number of Threads CPU Time (ms) GPU Time (ms)
100000 10 2.594514 0.387584
10000 100 2.587917 0.92976
1000 1000 2.588284 0.798912
100 10000 2.593782 0.00112
10 100000 2.5951 0.001152
As the length of the data is same, so the computational time for CPU is
nearly the same in all samples. But as the number of thread is not fixed,
increasing the number of thread has reduced the required computational
time.
Secondly we have done it for fixed number of blocks and variable vector
length and threads. Here the number of blocks is fixed to 16,384. Then the
result we obtained for this is shown in Table 4.2
Table 4.2: Addition for Fixed number of Blocks.
Number of Blocks Number of Threads CPU Time (ms) GPU Time (ms) Speed Up
16384 10 0.429609 0.383584 1.12
16384 20 0.936562 0.757408 1.24
16384 40 1.693876 1.501984 1.13
16384 80 3.366491 2.99184 1.13
16384 160 6.846616 5.96752 1.15
16384 320 13.551007 11.941888 1.13
16384 512 25.233357 20.006816 1.26
Here the length of the data is not the same for all samples. We know that
the length of the data is equal to the product of the number of blocks and
the number of threads. As the number of blocks is fixed here, data length is
proportional to the number of threads. Increasing threads will increase data
length. So we have found that increasing threads has increased data length
and that also has increased computational time for CPU and GPU. But it is
26
clearly seen that the computational time for GPU is faster than CPU. If we
add more threads, then the speed up will also increase. The result can be
visualized by the graph in Figure 4.1.
Figure 4.1: Computational Time of Addition for Fixed number of Blocks.
Thirdly we have done it for fixed number of threads and variable vector
length and blocks. Here the number of threads is fixed to 512. Then the
result we obtained for this is shown in Table 4.3. Here the length of the
data is not the same also for all samples. As the number of threads is fixed
here, data length is proportional to the number of blocks. Increasing blocks
will increase data length. So we have found that increasing blocks has in-
creased data length and that also has increased computational time for CPU
and GPU. But it is clearly seen that the computational time for GPU is much
faster than CPU.
27
Table 4.3: Addition for Fixed number of Threads.
Number of Blocks Number of Threads CPU Time (ms) GPU Time (ms) Speed Up
1 512 0.001832 0.008801 0.21
2 512 0.004399 0.008896 0.49
4 512 0.005499 0.009286 0.59
8 512 0.01063 0.010592 1.01
16 512 0.020527 0.015584 1.32
32 512 0.067081 0.021376 3.13
64 512 0.081376 0.032576 2.49
128 512 0.170084 0.056448 3.01
256 512 0.330637 0.104352 3.17
512 512 0.680702 0.198752 3.42
1024 512 1.341612 0.32656 4.11
2048 512 2.781095 0.763872 3.64
4096 512 5.438085 1.51008 3.6
8192 512 11.367773 3.00992 3.77
16384 512 22.233357 6.006816 3.7
32768 512 47.950872 12.0025 4.01
65536 512 87.61068 23.99005 3.65
The result can be visualized by the following graph in Figure 4.2. Here
we have taken the number of blocks along x axis and the computation time
along y axis. The graph clearly shows that GPU is faster than CPU. At first
when the number of blocks is small, then the computational time differ-
ence is small as the data size is small and the computation is also simple.
As the number of blocks increases the data size also increases. And with
the increase in number of blocks GPU becomes faster. But as the data size
becomes large, so CPU tends to be slow. Thus we have performed addi-
tion operation more efficiently and faster with the use of GPU. This result
is obtained for multidimensional matrix addition operation. But it can also
be shown for all other element by element operations like multidimensional
matrix subtraction, scalar multiplication etc.
28
Figure 4.2: Computational Time of Addition for Fixed number of Threads.
4.2.2.2 Multidimensional Matrix Multiplication
Multidimensional Matrix Multiplication operation is more complex. We have

performed the operation on 2D,4D,6D,8D and 10D square matrix data. We
have taken data of length 2 to 10 for each dimensions and then obtained the
computational time for each particular case. Then we compared the compu-
tational time for CPU with computational time for GPU. The obtained results
are as follows:
Firstly we have implemented the operation for 4D square matrix. We

have taken the samples of data length from 2 to 10. For higher length of
data we have found noticeably faster computational time for GPU. But data
with smaller length has small amount of data with few calculations. And we
know that less amount of data with few number of calculations is a poor fit
for GPU. So in this case GPU is slower than CPU. Moreover GPU has also
a start up time that also kills some time. But with large amount of data
29
we have been successful to reduce the time complexity to a great extent for
multiplication.
Table 4.4: Computational Time for 4D Multiplication.
LENGTH CPU Time (ms) GPU Time (ms) Speed Up

2 0.000569 0.012288 0.05
3 0.001422 0.014336 0.11
4 0.004267 0.016384 0.26
5 0.011093 0.018432 0.60
6 0.039253 0.022528 1.74
7 0.06912 0.027648 2.5
8 0.107235 0.033792 3.17
9 0.193707 0.048128 4.02
10 0.336782 0.062464 5.39
By plotting the length of square matrix along x axis and computational

time along y axis we can obtain the graph in Figure 4.3. Here in case of
smaller length we have found that GPU tends to be slow than CPU. As for
smaller length in each dimension, the data size is small so the computation
is simple. And thus in this case GPU is a poor fit. But with the increase
in length, GPU performs much better than CPU with faster computation in
parallel.
30
Figure 4.3: Computational Time Graph for 4D Multiplication.
Secondly we have implemented the operation for 6D square matrix. We

have taken the samples of data length from 2 to 10 here also. The compu-
tational time of GPU is noticeably faster than CPU. The slow GPU problem
of 4D with small length data is also overcomed here. Because small length
in 6D data has large amount of data and large number of computations in
spite of having smaller length.
31
LENGTH CPU Time (ms) GPU Time (ms) Speed Up

2 0.001137 0.017408 0.07
3 0.010809 0.017408 0.62
4 0.083627 0.03072 2.72
5 0.333368 0.059392 5.61
6 1.104786 0.175936 6.28
7 3.227019 0.479232 6.73
8 13.2935 1.139232 11.67
9 18.92979 2.609152 7.26
10 40.85955 5.366144 7.61
The results can be also shown in graph in Figure 4.4 in the same way as
before. It also indicates that GPU is clearly faster than CPU.
32
Thirdly we have implemented the operation for 8D square matrix with the
samples of data length from 2 to 10. It also shows that GPU completes it’s
operation faster than CPU. We can also see that with the increase in dimen-
sion, the difference between computational time for CPU and computational
time for GPU increases i.e CPU becomes slower for high dimensional data.
So we can say that the more the dimension the more reduction in time com-
plexity for GPU.
LENGTH CPU Time (ms) GPU Time(ms) Speed Up

2 0.003697 0.031744 0.12
3 0.118614 0.083968 1.41
4 1.396337 0.294752 4.74
5 9.426197 1.604224 5.88
6 47.65635 7.444468 6.40
7 190.327 31.49395 6.04
8 1037.14 98.3255 10.55
9 1813.553 286.591 6.33
10 4560.209 727.8426 6.27
33
The obtained result is shown below in graph in Figure 4.5 showing the
computational time for CPU and GPU:
For 10D square matrix data, we have gone up to length 6 in all dimen-
sions. Because of the large amount of data (approx. 300 Million), the com-
putation goes beyond the capacity of our GPU. But with more developed GPU
we can overcome this limitation.
LENGTH CPU Time (ms) GPU Time(ms) Speed Up

2 0.014791 0.108514 0.14
3 1.101083 0.779264 1.41
4 26.68343 5.653504 4.72
5 286.4521 55.78013 5.13
6 2090.301 448.7321 4.66
34
For 10D data the resulting graph is as follows:
So it is a clear indicator that in all cases GPU speeds up the manipu-

lation of Multidimensional Matrix operations. We can extend this method
to any type of multidimensional matrices we want. Thus we can perform
these operations more efficiently than before. This speed up can be further
increased with more developed GPU in future.
In our experiment we have taken only integer data as our consideration. But
there may be variety of data types. We can check if there is any difference
in speed up among various data types. For this we have taken 6D data with
length 10 in each dimension. And the data types we take as consideration
here are short integer, integer, long integer and long long integer. The results
can be shown in a table:
35
Table 4.8: Speed Up For various Data Types.
Data type Size(Bytes) CPU time (ms) GPU Time (ms) Speed Up
short 2 52.21875 26.736223 1.953109
int 4 54.68471 26.392832 2.071953
long int 4 55.61866 26.632641 2.088365
long long int 8 60.72672 42.441345 1.430839
After plotting this data in a graph taking data type size along x axis and
speed up along y axis, we have got a flat curve. So GPU works efficiently and
faster irrespective of any data types.
Figure 4.7: GPU Performance for various Data Types.
36
4.2.2.3 R2 Analysis
R2 is a statistical analysis term. It is used mainly for showing how closely

the data are fitted to a regression line. If it’s value is near 1, then we can say
that the points in the graph establishes a linear relation. We can apply this
measure to our data related to speed up. After applying R2 analysis on our
speed up data, we get the following values:
Table 4.9: R2 Analysis.
Operation R2 Value
Addition 0.81
4D Multiplication 0.97
We can see from the table and graphs that the R2 values are nearly close to
1. So The speed up we obtained are nearly linear. We can get more accurate
results with more samples with more developed GPU. We can visualize speed
up results by plotting the data in a graph which we will call speed up graph.
The speed up graph for multidimensional data is shown below and also the
average speed up for various dimensions can be shown in a graph:
37
(a) Addition. (b) 4D Multiplication.
(c) 6D Multiplication. (d) 8D Multiplication.
(e) 10D Multiplication. (f) Average Speed Up vs Dimension.
Figure 4.8: Speed Up Graphs.
38
Chapter 5
Conclusion
5.0.1 Summary
Various research activities are being done on data science. Matrix Math-
ematics is one of the most significant sectors in this field. Various fields
in Computer Science rely on Multidimensional matrix operations . We have
seen that to perform these operations for higher dimensions matrices on CPU
the computational speed is comparatively much slower than GPU. Thus we
have proposed methods for computing multidimensional matrix operations
efficiently on GPU.
For addition and subtraction of two multidimensional matrices,we have lin-
earized each element of each matrix to an 1-D matrix type data structure
and then performed the respected operation per element basis. Similarly for
scalar multiplication the process is also the same. For the multiplication of
multidimensional matrices we have approached to a method by which we are
able to partition particular elements of two multidimensional matrices using
the traditional multiplication rule or formula. Each single thread computes
each single element in product matrix.
Performing multidimensional matrix operations on CPU is less efficient but
performing the same operations on GPU makes it even faster and more effi-
cient. This speed up will surely help us in various areas of data science and
high performance computing.
39
In our thesis, we have used CUDA platform developed by NVIDIA with
standard C functions. CUDA C supports heterogeneous computation for
processing flow that provides data parallelism.
5.0.2 Limitations
In our thesis, there exists some limitations. While dealing with more higher
dimensional matrices like 10-D and higher than that,we have faced garbage
outputs for the computation times for GPU. As the number of elements be-
comes very large (approx. 300 millions) so it goes beyond the scope of the
GPU.
High performance GPU is required for performing more higher dimensional
matrices operations efficiently.
5.0.3 Future Scope of Work
So far we have performed multidimensional matrix operations like addition,

subtraction, scalar multiplication and matrix multiplication on GPU.
We can extend our work for the transpose of a matrix, inverse of a matrix in
future.
There are many applications in the field of data science where we need to do
some matrix operations. We can work more on the extension of this thesis
in future in data science related applications where we deal with multidi-
mensional matrices.
40
References
[1] J.N. Franklin and J. Franklin. Matrix Theory. Applied mathematics

series. Prentice-Hall, 1968.
[2] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cam-

bri@bookGolub:1996:MC:248979, author = Golub, Gene H. and Van
Loan, Charles F., title = Matrix Computations (3rd Ed.), year = 1996,
isbn = 0-8018-5414-8, publisher = Johns Hopkins University Press,
address = Baltimore, MD, USA, dge University Press, New York, NY,
USA, 2nd edition, 2012.
[3] Gene H. Golub and Charles F. Van Loan. Matrix Computations (3rd Ed.).
Johns Hopkins University Press, Baltimore, MD, USA, 1996.
[4] Ashu Solo. Multidimensional matrix mathematics: Notation, represen-

tation, and simplification, part 1 of 6. Lecture Notes in Engineering and
Computer Science, 2185, 06 2010.
[5] Ashu Solo. Multidimensional matrix mathematics: Multidimensional

matrix equality, addition, subtraction, and multiplication, part 2 of 6.
Lecture Notes in Engineering and Computer Science, 2185, 06 2010.

null and identity matrices and multidimensional matrix outer and in-
ner products, part 3 of 6. Lecture Notes in Engineering and Computer
Science, 2185, 06 2010.

matrix transpose, symmetry, antisymmetry, determinant, and inverse,
41
part 4 of 6. Lecture Notes in Engineering and Computer Science, 2185,
06 2010.
[8] Ashu Solo. Multidimensional matrix mathematics: Algebraic laws, part

5 of 6. Lecture Notes in Engineering and Computer Science, 2185, 06
2010.
[9] Ashu Solo. Multidimensional matrix mathematics: Solving systems of

linear equations and multidimensional matrix calculus, part 6 of 6.
Lecture Notes in Engineering and Computer Science, 2185, 06 2010.
[10] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and ap-
plications. SIAM REVIEW, 51(3):455–500, 2009.
[11] Tamara Gibson Kolda. Multilinear operators for higher-order decompo-

sitions.
[12] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and
Matus Telgarsky. Tensor decompositions for learning latent variable
models. Journal of Machine Learning Research, 15:2773–2832, 2014.
[13] Parikshit Shah, Nikhil Rao, and Gongguo Tang. Sparse and low-
rank tensor decomposition. In C. Cortes, N. D. Lawrence, D. D. Lee,
M. Sugiyama, and R. Garnett, editors, Advances in Neural Informa-
tion Processing Systems 28, pages 2548–2556. Curran Associates, Inc.,
2015.
[14] Jonathan Kadmon and Surya Ganguli. Statistical mechanics of low-

rank tensor decomposition. In S. Bengio, H. Wallach, H. Larochelle,
K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in
Neural Information Processing Systems 31, pages 8212–8222. Curran
Associates, Inc., 2018.
[15] V. Ranjbar, M. Salehi, P. Jandaghi, and M. Jalili. Qanet: Tensor decom-

position approach for query-based anomaly detection in heterogeneous
information networks. IEEE Transactions on Knowledge Data Engineer-
ing, page 1.
42
[16] S. Acer, T. Torun, and C. Aykanat. Improving medium-grain partitioning
for scalable sparse tensor decomposition. IEEE Transactions on Parallel
Distributed Systems, 29(12):2814–2825, Dec. 2018.
[17] Brett W. Bader and Tamara G. Kolda. Algorithm 862: MATLAB tensor
classes for fast algorithm prototyping. ACM Transactions on Mathemat-
ical Software, 32(4):635–653, December 2006.
[18] Brett W. Bader and Tamara G. Kolda. Efficient MATLAB computations

with sparse and factored tensors. SIAM Journal on Scientific Computing,
30(1):205–231, December 2007.
[19] Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and
Saman Amarasinghe. The tensor algebra compiler. Proc. ACM Program.
Lang., 1(OOPSLA):77:1–77:29, October 2017.
[20] Mijung Kim and K. Selçuk Candan. Tensordb: In-database tensor ma-
nipulation with tensor-relational query plans. In Proceedings of the 23rd
ACM International Conference on Conference on Information and Knowl-
edge Management, CIKM ’14, pages 2039–2041, New York, NY, USA,
2014. ACM.
[21] Grey Ballard, Tamara G. Kolda, and Todd Plantenga. Efficiently com-
puting tensor eigenvalues on a gpu. pages 1340–1348, 05 2011.
[22] Professional CUDA C Programming. Wrox Press Ltd., Birmingham, UK,

UK, 1st edition, 2014.
[23] Nvidia. Nvidia cuda c programming guide. Compare A Journal Of Com-

parative Education, 01 2010.
43

Report 11 370

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Report 11 370

Hochgeladen von

Copyright:

Verfügbare Formate

Acknowledgment

4 Experimental Results and Analysis 24

2.1 GPU vs CPU architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Processing Flow between CPU and GPU. . . . . . . . . . . . . . . . . . . . . . 5

3.1 Thread Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Thread Indexing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Example of 3-D Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Logical Computational Architecture of Addition. . . . . . . . . . . . . . 15

3.5 Actual Computational Architecture of Addition. . . . . . . . . . . . . . . 16

3.6 Specification of dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.7 Organization of Blocks and Threads. . . . . . . . . . . . . . . . . . . . . . . . . 20

3.8 Logical Computational Architecture of Multiplication. . . . . . . . 21

3.9 Actual Computational Architecture of Multiplication. . . . . . . . . 22

4.1 Computational Time of Addition for Fixed number of Blocks. 27

4.2 Computational Time of Addition for Fixed number of Threads. 29

4.3 Computational Time Graph for 4D Multiplication. . . . . . . . . . . . 31

4.4 Computational Time Graph for 6D Multiplication. . . . . . . . . . . . 32

4.5 Computational Time Graph for 8D Multiplication. . . . . . . . . . . . 34

4.6 Computational Time Graph for 10D Multiplication. . . . . . . . . . . 35

4.7 GPU Performance for various Data Types. . . . . . . . . . . . . . . . . . . . 36

4.8 Speed Up Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.1 GPU Memory Management Functions. . . . . . . . . . . . . . . 6

4.1 Addition for Fixed number of Data Length. . . . . . . . . . . . . 26

Matrix Operations is one of the significant branches of mathematics [1] [2]

1.2 Problem Statement

Multidimensional Matrix Operations are nothing but the extension of oper-

In our thesis, we have performed multidimensional matrix operations like

1.4 Organization of The Thesis

The remaining part of this thesis consists of the parts as follows :

In recent decades, Tensors or multidimensional matrices has great impact

2.1 GPU Architecture

2.1.1 What is GPU

2.1.2 GPU vs CPU

Figure 2.1: GPU vs CPU architecture.

2.1.3 GPU Programming

2.1.4 Simple Processing Flow

Figure 2.2: Processing Flow between CPU and GPU.

Table 2.1: GPU Memory Management Functions.

STANDARD C FUNCTIONS CUDA C FUNCTIONS

• cudaMalloc(), This function is used to allocate memory on the device. It

Some related works for Multidimensional matrices or Tensors are discussed

• Recently many research works are being done multidimensional matrix.

• Expressing high dimensional matrix or tensors as low dimension or de-

• Techniques for manipulating multidimensional matrix operations are dis-

• Multidimensional matrix operations discussed in all this papers are im-

After analysis of multidimensional matrix or tensor related papers we can

In this thesis we have proposed GPU based implementation to perform Mul-

3.1 Blocks and Threads Organization

Figure 3.1: Thread Organization.

1) Using blocks: We can use multiple blocks each containing a thread.

global void add ( i n t ∗a , i n t ∗b , i n t ∗c ) { }

global void add ( i n t ∗a , i n t ∗b , i n t ∗c ) {

global void add ( i n t ∗a , i n t ∗b , i n t ∗c ) {

Suppose the number of threads per block is 6.

Figure 3.2: Thread Indexing.

To determine a particular index we can use the formula:

Multidimensional matrix addition operation is the simpler operation than

3.2.1 Formula Construction

mijk...p = m1ijk...p + m2ijk...p

where i,j,k...p are integers, each represents dimension.

Figure 3.3: Example of 3-D Matrices.

We have linearized the M1 matrix as follows,

Similarly we have linearized the M2 matrix as follows,

= [(08+04+02+01)th element of M1 * (08+04+02+01)th element of M2 ]