Beruflich Dokumente
Kultur Dokumente
All the praise to the almighty Allah, whose blessings helped us to success-
fully complete this thesis work. We show significant and indescribable grate-
fulness to our honorable supervisor Dr. K.M. Azharul Hasan, Professor, De-
partment of Computer Science and Engineering, Khulna University of En-
gineering & Technology for his outstanding helpful contribution in giving
support, suggestion and encouragement. We acknowledge his constant co-
operation and proper guidance throughout the development process. He has
been a great source of effective and feasible ideas, profound knowledge and
all time feedback for us.
We thank all the teachers of the Department of Computer Science and Engi-
neering who helped us providing guidelines to perform the work. We would
also like to thank our friends and family for their cordial support.
Authors.
i
Abstract
In today’s world, various research activities are being done upon data sci-
ence. Matrix mathematics are one of the most significant sectors in this field.
We also know multidimensional matrices as Tensors. In our thesis work, we
have worked on multidimensional matrix mathematics. Multidimensional
matrices have huge applications in the field of data science. Previous works
done on multidimensional matrices were on CPU. But these operations are
so complex that performing them on CPU is less efficient and causes loss of
a great amount of time and costs. But as the applications of data science
is rising day by day, this time and costs complexity must be reduced. So
we have performed these multidimensional matrix operations on GPU. GPU
gives us the ability to do these operations in parallel with the use of threads
and blocks. So the operations become faster and more efficient. Our main
target is to perform multidimensional matrix operations such as addition,
subtraction, multiplication etc on GPU environment and apply this in the
field of data science and high performance computing.
ii
Contents
Page
Acknowledgement i
Abstract ii
Contents iii
List of Figures v
List of Tables vi
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Organization of The Thesis . . . . . . . . . . . . . . . . . . . . . 2
2 Literature Review 3
2.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 What is GPU . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 GPU vs CPU . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.3 GPU Programming . . . . . . . . . . . . . . . . . . . . . . 4
2.1.4 Simple Processing Flow . . . . . . . . . . . . . . . . . . . 5
2.1.5 Managing Memory in GPU . . . . . . . . . . . . . . . . . 6
2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
iii
3 Multidimensional Matrix Operations On GPU 10
3.1 Blocks and Threads Organization . . . . . . . . . . . . . . . . . 10
3.2 Multidimensional Matrix Addition . . . . . . . . . . . . . . . . . 13
3.2.1 Formula Construction . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Implementation On GPU . . . . . . . . . . . . . . . . . . . 14
3.2.3 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Multidimensional Matrix Subtraction . . . . . . . . . . . . . . . 17
3.4 Multidimensional Matrix Scalar Multiplication . . . . . . . . . 17
3.5 Multidimensional Matrix Multiplication . . . . . . . . . . . . . . 18
3.5.1 Formula Construction . . . . . . . . . . . . . . . . . . . . 18
3.5.2 Implementation on GPU . . . . . . . . . . . . . . . . . . . 19
3.5.3 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Conclusion 39
5.0.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.0.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.0.3 Future Scope of Work . . . . . . . . . . . . . . . . . . . . 40
References 41
iv
List of Figures
v
List of Tables
vi
Chapter 1
Introduction
1.1 Introduction
1
on GPU using the concept of parallel computing. If we perform these oper-
ations using threads and blocks, then it will enhance this performance by
reducing computation time and space needed to store data.
1.3 Scope
2
Chapter 2
Literature Review
GPU means Graphics Processing Unit which can be used for parallel com-
putation to obtain fast operation. Parallel computation is the process of
breaking down large and complex problems into smaller and simpler sub
problems and solving them concurrently which leads to a fast computation.
Computations on CPU is not always fast. When we deal with complex and
large data CPU computation become slower as CPU architecture is not de-
signed to perform operations parallelly.
GPU has high computation capability, it is designed with many core proces-
sors which is termed as Arithmetic Logic Units (ALU) whereas CPU doesn’t
3
support multiple ALUs like GPU structure. These ALUs are important for
parallel operations. That’s why GPU programming is necessary for efficient
parallel computing.
There are two types of parallelism- (i) Task Parallelism which is concerned
with many tasks and (ii) Data Parallelism maps each data element to paral-
lel threads. We have used CUDA platform which is developed by NVIDIA to
perform parallel computation on GPU. CUDA C is used for programming in
GPU environment which is based on standard C. CUDA is a platform used
for heterogeneous computing where there are a Host code which runs on
CPU and a Device code which runs on GPU. A simple “Hello World” program
in heterogeneous CUDA C paradigm is as follows:
4
global is used to indicate that the function is to run on GPU instead
of CPU and the function mykernel() is launched from the CPU.
Processing flow describes the overall process of how data parallelism is done
and how the interaction between CPU and GPU occurs. The following steps
are performed for any operation done on GPU:
Step 1 is shown in figure 3.2(a).The input data is copied from CPU memory
to GPU memory(DRAM).
Step 2 involves loading and executing the GPU code in kernel (figure 3.2(b)),
which distributes data to each core processor unit in GPU using threads.
And finally in step 3 the result is transferred from GPU to memory to CPU
memory(figure 3.2(c)). GPU core consists of a large number of blocks, gen-
erated after a kernel function is launched. Each block consists of a large
number of threads running parallelly.
(a) (b)
(c)
5
2.1.5 Managing Memory in GPU
In heterogeneous computing host and device are two important terms used
in CUDA programming model. By Host we mean CPU and by Device we mean
GPU. For allocating and freeing memory in GPU we use different functions
that differ from the standard C functions that is for CPU.
6
2.2 Related Works
• In the paper [5] author discussed about the formulation of various mul-
tidimensional matrix operations such as addition, multiplication and es-
tablished relation with classical matrix algebra. This greatly helped us on
our implementation of multidimensional matrix operations on GPU. Ashu
M.G. Solo also discussed other Multidimensional matrix operations ij his
papers [6] [7] [8] [9].
• The techniques for gaining insight into higher order tensor decompositions
are also discussed in paper [11]. These tensor decomposition techniques
helped us undersatnding the higher order tensors or multidimensional ma-
7
trices. There are also other works on decomposition of tensors or multidi-
mensional matrix [12] [13] [14] [15] [16].
• In the paper [18], The authors discussed about efficient process for stor-
ing sparse and fact tensors. They proposed a coordinate format for storing
multidimensional matrices or tensors and also discussed the computational
efficiency for this format in various operations of tensors. This helped us
getting a clear view of indexing structure of multidimensional matrices. An
efficient and fast way for compiling compound tensors with sparse and dense
operands is discussed in [19].
• In the paper [20] The authors introduced relational data model based on
multidimensional matrix or tensors. This model facilitated data analysis
based on multidimensional matrix and relational manipulations of them.
Then they established a tensor-relational data management system which
combines relational algebraic operations and tensor algebraic operations.
8
accelerate the computation of tensor eigen values. This has inspired us to
perform further multidimensional matrix related operations on GPU.
2.3 Discussion
9
Chapter 3
Multidimensional Matrix
Operations On GPU
Heterogeneous computation uses both host (CPU) and device (GPU) partic-
ipation. When a kernel function is launched from the host side, a large
number of threads are generated. Host (CPU) consists of multiple kernels.
A Grid is made up of many blocks. Usually, a grid is organized as a 1D,
2D or 3D array of blocks, and a block consists of a 1D, 2D or 3D array of
threads. An example of thread organization is shown in following Figure 3.1.
Block can be accessed with a built in variable blockIdx. Similarly a built in
variable threadIdx is used to get access the particular index location where
a thread is located. Both the variables are used inside a kernel.
In CUDA programming model there are two more built in variables. They
10
are gridDim and blockDim. gridDim is used to represent the dimension of
the grid and blockDim is used to represent the dimension of a block along
a particular axis. For example, blockDim.x returns the number of threads
a block has along the x axis. gridDim.x returns the number of blocks a grid
has along the x axis.:
Similarly blockIdx.x is used to represent the block index along x axis and
threadIdx.x is used to represent the thread index along x axis. A block can
be of 1D, 2D or 3D and a thread can also be of 1D, 2D or 3D.
A block can be split into parallel threads by three cases.
11
2) Using Threads: Use all threads in one block. The CUDA example for
adding two vector elements can be done using only threads per block as fol-
lows:
3) Combining blocks and threads: Blocks and threads are often combined.
The maximum number of blocks is 65,535 in one kernel function launch
and for many GPUs, maximum threads per block 512(or 1024). The CUDA
example for adding two vector elements can be done by combining blocks
and threads as follows:
12
3.2 Multidimensional Matrix Addition
The first and foremost condition for the addition of two multidimensional
matrices is that they must have same number of elements in each dimen-
sion. The addition operation operation can only be done based on element to
element addition of two multidimensional matrices. We can simply add the
corresponding positioned elements of each multidimensional matrix. For
that reason ”Linearization” technique is efficient and it is also a simple task.
We can easily linearize each element of each multidimensional matrix. After
linearization we obtain an 1-D vector for each multidimensional matrix and
can simply add the same indexed elements of two vectors.
Consider two multidimensional matrices M1 and M2 and the resultant mul-
tidimensional matrix is M, addition operation is M1 + M2 = M.
As addition is performed on element to element basis [5] we obtain
13
3.2.2 Implementation On GPU
Suppose a 3-D matrix, M1 of 3*2*2 and another 3-D matrix, M2 of 3*2*2 are
are to be added and the sum is 3-D matrix, M dimensions of 3*2*2.
(a) (b)
M1* : 1 7 4 10 2 8 5 11 3 9 6 12
M2* : 2 14 8 20 4 16 10 22 6 18 12 24
The Sum gives M* such that M* = M1* + M2* and thus we obtain resultant
matrix M. For this addition total 12 threads are generated:
Thread1 adds the 1st two elements of M1* and M2* [1+2 =3]
Thread2 adds the 2nd two elements of M1* and M2* [7+14 =21]
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Thread12 adds the last two elements of M1* and M2* [12+24 =36]
Linearization of the addition of the two matrix elements gives,
M* : 3 21 12 30 6 24 15 33 9 27 18 36
14
Figure 3.4: Logical Computational Architecture of Addition.
15
Figure 3.5: Actual Computational Architecture of Addition.
3.2.3 Pseudocode
id=zero
f o r i = 0 to N−1 //N=blocks ∗ threads
begin loop
p1 := B; // B = Block Index
p2 := b1 ; // b1 = Thread Index in x dimension
dim B := blockDim //block Dimension along x axis
id := p2 + p1∗dim B ;
m[ id ] = m1[ id ] + m2[ id ] ;
end loop
16
3.3 Multidimensional Matrix Subtraction
17
After linearizing of each element of a multidimensional matrix, we can simply
multiply each element by the scalar.
18
dimensions not participating in multiplication should also be equal.
The number of dimensions for product matrix M will be equal to the number
of dimensions in M1 and M2. As we are multiplying 1st dimension of M1 and
4th dimension of M2, So 1st dimension of M will be equal to the 1st dimen-
sion of M1 and 4th dimension of M will be equal to the 4th dimension of M2.
For example, in case of multiplication of 2D matrices, say M1 is of 2*3 and
M2 is of 3*4 then M will be of 2*4. Similarly multiplication of 3D matrices,
say M1 is of 2*3*4 and M2 is of 4*3*2 and if the multiplied dimensions are 1
and 3 then M will be of 2*3*2, For Multiplication of 4D matrices, say M1 is
of 2*3*4*4 and M2 is of 4*3*4*3 and if the multiplied dimensions are 1 and
4 then M will be of 2*3*4*3 and so on. We have previously seen that to find
each specific component of the product matrix the formula is :
n
X
Cij = aix ∗ bxj
x=1
19
M1 (2*2*2*2) and M2 (2*2*2*2) and we want to multiply 1st and 2nd dimen-
sion, so M will be 2*2*2*2. Let’s say here dimension is 4 and length is 2. So
in this case the formula for computing each component of M will be:
2
X
Mijkl = M 1ixkl ∗ M 2xjkl
x=1
20
then Index of any element of M = BlockIndex * (3*4*5) + threadIndex for z
dimension * (4*5) + threadIndex for y dimension * (5) + threadIndex for x
dimension * 1.
Say, M1 and M2 matrix is as follows:
21
formula stated above:
M0000 = (M 10000 * M 20000 ) + (M 10100 * M 21000 )
Firstly we have done linearization of the elements of M1 and M2 matrix. Now
if we want to find the 1st element of product matrix, then we have to compute
22
3.5.3 Pseudocode
sum := zero ;
input width ;
f o r x = 0 to width
begin loop
p1 := B; // B = Block Index
p2 := b1 ; // b1 = Thread Index in z dimension
p3 := b2 ; // b2 = Thread Index in y dimension
p4 := b3 ; // b3 = Thread Index in x dimension
t1 := p1 ∗ 8 + x ∗ 4 + p3 ∗ 2 + p4 ;
t2 : = x ∗ 8 + p2 ∗ 4 + p3 ∗ 2 + p4 ;
temp := m1[ t1 ] ∗ m2[ t2 ] ;
sum := sum + temp ;
end loop
t3 := p1 ∗ 8 + p2 ∗ 4 + p3 ∗ 2 + p4 ;
m[ t3 ] := sum;
23
Chapter 4
In this chapter, we have done the implementation and related works. The
experiments and analysis processes are done on a computer with core i5
processor having 4 cores with each core having 1.7 GHz Speed. Also the
system had 8 GB of Ram. The Graphics card that we have used is NVIDIA
GeForce GTX 1070. It had total 16GB available graphics memory with 8GB
shared system memory and 8GB GDDR5 dedicated video memory while hav-
ing 1920 CUDA Cores. The Graphics clock is 1.5GHz.
The integrated development environment (IDE) that we have used is Mi-
crosoft Visual Studio 2015 Community Edition and programming is done
with CUDA C compiled with nvcc. [22] [23]
24
4.2 Experiment Results
Our main aim of this thesis is to reduce the time complexity in performing
the multidimensional matrix operations. So the main measuring term of our
experiment is to determine the computational time for both CPU and GPU.
The computational time is measured in ms. And by computational time we
mean the required amount of time to complete a process. For example, in
matrix multiplication the time that the system needs to find all the elements
of the product matrix is called the computational time complexity for matrix
multiplication. We also define speed up as a measuring term which is the
ratio between computational time of CPU by computational time of GPU. To
identify the characteristics of the speed up values, we have measured the R2
values of them to check if they are linear or not. R2 is a statistical analysis
term, used mainly for showing how closely the data are fitted to a regression
line. We have also evaluated GPU performance for various data types to find
the speed up in all cases.
25
Table 4.1: Addition for Fixed number of Data Length.
Number of Blocks Number of Threads CPU Time (ms) GPU Time (ms)
100000 10 2.594514 0.387584
10000 100 2.587917 0.92976
1000 1000 2.588284 0.798912
100 10000 2.593782 0.00112
10 100000 2.5951 0.001152
As the length of the data is same, so the computational time for CPU is
nearly the same in all samples. But as the number of thread is not fixed,
increasing the number of thread has reduced the required computational
time.
Secondly we have done it for fixed number of blocks and variable vector
length and threads. Here the number of blocks is fixed to 16,384. Then the
result we obtained for this is shown in Table 4.2
Number of Blocks Number of Threads CPU Time (ms) GPU Time (ms) Speed Up
16384 10 0.429609 0.383584 1.12
16384 20 0.936562 0.757408 1.24
16384 40 1.693876 1.501984 1.13
16384 80 3.366491 2.99184 1.13
16384 160 6.846616 5.96752 1.15
16384 320 13.551007 11.941888 1.13
16384 512 25.233357 20.006816 1.26
Here the length of the data is not the same for all samples. We know that
the length of the data is equal to the product of the number of blocks and
the number of threads. As the number of blocks is fixed here, data length is
proportional to the number of threads. Increasing threads will increase data
length. So we have found that increasing threads has increased data length
and that also has increased computational time for CPU and GPU. But it is
26
clearly seen that the computational time for GPU is faster than CPU. If we
add more threads, then the speed up will also increase. The result can be
visualized by the graph in Figure 4.1.
Thirdly we have done it for fixed number of threads and variable vector
length and blocks. Here the number of threads is fixed to 512. Then the
result we obtained for this is shown in Table 4.3. Here the length of the
data is not the same also for all samples. As the number of threads is fixed
here, data length is proportional to the number of blocks. Increasing blocks
will increase data length. So we have found that increasing blocks has in-
creased data length and that also has increased computational time for CPU
and GPU. But it is clearly seen that the computational time for GPU is much
faster than CPU.
27
Table 4.3: Addition for Fixed number of Threads.
Number of Blocks Number of Threads CPU Time (ms) GPU Time (ms) Speed Up
1 512 0.001832 0.008801 0.21
2 512 0.004399 0.008896 0.49
4 512 0.005499 0.009286 0.59
8 512 0.01063 0.010592 1.01
16 512 0.020527 0.015584 1.32
32 512 0.067081 0.021376 3.13
64 512 0.081376 0.032576 2.49
128 512 0.170084 0.056448 3.01
256 512 0.330637 0.104352 3.17
512 512 0.680702 0.198752 3.42
1024 512 1.341612 0.32656 4.11
2048 512 2.781095 0.763872 3.64
4096 512 5.438085 1.51008 3.6
8192 512 11.367773 3.00992 3.77
16384 512 22.233357 6.006816 3.7
32768 512 47.950872 12.0025 4.01
65536 512 87.61068 23.99005 3.65
The result can be visualized by the following graph in Figure 4.2. Here
we have taken the number of blocks along x axis and the computation time
along y axis. The graph clearly shows that GPU is faster than CPU. At first
when the number of blocks is small, then the computational time differ-
ence is small as the data size is small and the computation is also simple.
As the number of blocks increases the data size also increases. And with
the increase in number of blocks GPU becomes faster. But as the data size
becomes large, so CPU tends to be slow. Thus we have performed addi-
tion operation more efficiently and faster with the use of GPU. This result
is obtained for multidimensional matrix addition operation. But it can also
be shown for all other element by element operations like multidimensional
matrix subtraction, scalar multiplication etc.
28
Figure 4.2: Computational Time of Addition for Fixed number of Threads.
29
we have been successful to reduce the time complexity to a great extent for
multiplication.
30
Figure 4.3: Computational Time Graph for 4D Multiplication.
31
Table 4.5: Computational Time for 6D Multiplication.
The results can be also shown in graph in Figure 4.4 in the same way as
before. It also indicates that GPU is clearly faster than CPU.
32
Thirdly we have implemented the operation for 8D square matrix with the
samples of data length from 2 to 10. It also shows that GPU completes it’s
operation faster than CPU. We can also see that with the increase in dimen-
sion, the difference between computational time for CPU and computational
time for GPU increases i.e CPU becomes slower for high dimensional data.
So we can say that the more the dimension the more reduction in time com-
plexity for GPU.
33
The obtained result is shown below in graph in Figure 4.5 showing the
computational time for CPU and GPU:
For 10D square matrix data, we have gone up to length 6 in all dimen-
sions. Because of the large amount of data (approx. 300 Million), the com-
putation goes beyond the capacity of our GPU. But with more developed GPU
we can overcome this limitation.
34
For 10D data the resulting graph is as follows:
35
Table 4.8: Speed Up For various Data Types.
Data type Size(Bytes) CPU time (ms) GPU Time (ms) Speed Up
short 2 52.21875 26.736223 1.953109
int 4 54.68471 26.392832 2.071953
long int 4 55.61866 26.632641 2.088365
long long int 8 60.72672 42.441345 1.430839
After plotting this data in a graph taking data type size along x axis and
speed up along y axis, we have got a flat curve. So GPU works efficiently and
faster irrespective of any data types.
36
4.2.2.3 R2 Analysis
Operation R2 Value
Addition 0.81
4D Multiplication 0.97
6D Multiplication 0.73
8D Multiplication 0.62
10D Multiplication 0.8
We can see from the table and graphs that the R2 values are nearly close to
1. So The speed up we obtained are nearly linear. We can get more accurate
results with more samples with more developed GPU. We can visualize speed
up results by plotting the data in a graph which we will call speed up graph.
The speed up graph for multidimensional data is shown below and also the
average speed up for various dimensions can be shown in a graph:
37
(a) Addition. (b) 4D Multiplication.
38
Chapter 5
Conclusion
5.0.1 Summary
Various research activities are being done on data science. Matrix Math-
ematics is one of the most significant sectors in this field. Various fields
in Computer Science rely on Multidimensional matrix operations . We have
seen that to perform these operations for higher dimensions matrices on CPU
the computational speed is comparatively much slower than GPU. Thus we
have proposed methods for computing multidimensional matrix operations
efficiently on GPU.
For addition and subtraction of two multidimensional matrices,we have lin-
earized each element of each matrix to an 1-D matrix type data structure
and then performed the respected operation per element basis. Similarly for
scalar multiplication the process is also the same. For the multiplication of
multidimensional matrices we have approached to a method by which we are
able to partition particular elements of two multidimensional matrices using
the traditional multiplication rule or formula. Each single thread computes
each single element in product matrix.
Performing multidimensional matrix operations on CPU is less efficient but
performing the same operations on GPU makes it even faster and more effi-
cient. This speed up will surely help us in various areas of data science and
high performance computing.
39
In our thesis, we have used CUDA platform developed by NVIDIA with
standard C functions. CUDA C supports heterogeneous computation for
processing flow that provides data parallelism.
5.0.2 Limitations
In our thesis, there exists some limitations. While dealing with more higher
dimensional matrices like 10-D and higher than that,we have faced garbage
outputs for the computation times for GPU. As the number of elements be-
comes very large (approx. 300 millions) so it goes beyond the scope of the
GPU.
High performance GPU is required for performing more higher dimensional
matrices operations efficiently.
40
References
[3] Gene H. Golub and Charles F. Van Loan. Matrix Computations (3rd Ed.).
Johns Hopkins University Press, Baltimore, MD, USA, 1996.
41
part 4 of 6. Lecture Notes in Engineering and Computer Science, 2185,
06 2010.
[10] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and ap-
plications. SIAM REVIEW, 51(3):455–500, 2009.
[12] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and
Matus Telgarsky. Tensor decompositions for learning latent variable
models. Journal of Machine Learning Research, 15:2773–2832, 2014.
[13] Parikshit Shah, Nikhil Rao, and Gongguo Tang. Sparse and low-
rank tensor decomposition. In C. Cortes, N. D. Lawrence, D. D. Lee,
M. Sugiyama, and R. Garnett, editors, Advances in Neural Informa-
tion Processing Systems 28, pages 2548–2556. Curran Associates, Inc.,
2015.
42
[16] S. Acer, T. Torun, and C. Aykanat. Improving medium-grain partitioning
for scalable sparse tensor decomposition. IEEE Transactions on Parallel
Distributed Systems, 29(12):2814–2825, Dec. 2018.
[17] Brett W. Bader and Tamara G. Kolda. Algorithm 862: MATLAB tensor
classes for fast algorithm prototyping. ACM Transactions on Mathemat-
ical Software, 32(4):635–653, December 2006.
[19] Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and
Saman Amarasinghe. The tensor algebra compiler. Proc. ACM Program.
Lang., 1(OOPSLA):77:1–77:29, October 2017.
[20] Mijung Kim and K. Selçuk Candan. Tensordb: In-database tensor ma-
nipulation with tensor-relational query plans. In Proceedings of the 23rd
ACM International Conference on Conference on Information and Knowl-
edge Management, CIKM ’14, pages 2039–2041, New York, NY, USA,
2014. ACM.
[21] Grey Ballard, Tamara G. Kolda, and Todd Plantenga. Efficiently com-
puting tensor eigenvalues on a gpu. pages 1340–1348, 05 2011.
43