Beruflich Dokumente
Kultur Dokumente
I. I NTRODUCTION
Matrix multiplication is one of the basic operations in
linear algebra. Basically, it is not the same as single number
multiplication, but closer to multiplying vector. Thus, many
algorithms have been proposed to complete this operation with
the most effective routine. People use naive algorithm for ages
because of its simplicity and accuracy, the basic concept of
doing matrix multiplication are also with this algorithm and
it has O(n3 ) running time. But one day at 1969, Strassen
invents a new algorithm that has less complexity than the
naive one [1], [2]. Using trial and error process and primarily
rely on Divide and Conquer (DnC) concept, his algorithm
can run for only O(nlog2 7 ). A slight improvisation in matrix
multiplication algorithm but later becomes very useful in
computer science to compute a huge matrix problem. Since
then, scholars believe that there is a possible O(n2) running
time algorithm for matrix multiplication that have not yet
invented [3], [4], [5].
A hybrid computation system, combining Graphics Processing Unit (GPU) and Central Processing Unit (CPU) has been
used to increase the performance of an operation in a whole set
of PC, and also to help some research activities in technology.
As a fact that the concept of parallel computing first found in
c
978-1-4799-1078-6/13/$31.00
2013
IEEE
201
C1,2 = M3 + M5
C1,1 = M1 + M4 M5 + M7
C2,1 = M2 + M4
C = AB A, B, C R2
2n
(1)
with
A=
A1,1
A2,2
A1,2
A2,2
B=
B1,1
B2,2
B1,2
B2,2
C=
C1,1
C2,2
C1,2
C2,2
2n1 2n1
(2)
(3)
then
C2,2 = M1 M2 + M3 + M6
(4)
(5)
(7)
where tm (n) and ta (n) denote the number of multiplications and the number of additions respectively. Then the
threshold of the multiplication can be
(6)
(8)
202
procedure MATRIX-POWER(A, p)
n rows [A]
let ans be n n matrix
while p do
. iterative version of DnC
if (p & 1) then
. check if p is odd
ans = STRASSEN-MULTIPLICATION (ans, A);
end if
A = STRASSEN-MULTIPLICATION (A, A);
. square the matrix
p >>= 1;
. divide p by 2
end while
return ans;
end procedure
We create special kernel for reducing all arithmetic operations in calculating sub-matrix C1,1 and C2,2 as countC11, and
sub-matrix C1,2 and C2,1 as countC22. Fig 4 and Fig 5 shown
both kernels respectively. It consists three basic operations
to reduce running time. The countC11 kernel is a recursive
process for calculating p1 , p4 , p5 and p7 , while countC22
kernel for p1 , p2 , p3 and p6 . The arithmetic operations are
written as row by row to ensure correct parentheses.
B. Implementaion and Results
203
TABLE I
E XECUTION T IME C OMPARISON
Matrix Size
2
4
8
16
32
64
128
256
512
1,024
2,048
4,096
8,192
16,384
32,768
C11, C12, C21, and C22 with add, subtract, countC11, and
countC22 kernels. Finally, join each 4 sub-matrices into one
complete result, matrix C.
We compare the running time of different matrices size, as
shown in Table 1. There are two programs to be compared,
the naive and the Stressen. The two programs displayed almost
the same result for size 2n < 4, 096. Their running times are
less than 1 second, even below 0.5 second for some cases. It
shows that the divide and conquer method are not so different
because the matrix size mostly below the value of threshold.
The threshold for this experiment is 256 since setting threshold
below that value will make the Strassen process slower and
matrix below that size does not trouble the device memory
allocation. Matrices experimented are only matrices with 2n
2n size where n 1. Naive approach failed to compute 16,384
x 16,384 matrix, due to lack of memory, while Strassen still
survive. This is the break through of memory limitation.
V. R ESULT A NALYSIS
Through all phases in this research process with the title
Breaking through memory limitation in GPU parallel processing with Strassen algorithm, the main conclusion is we can
make the GPU counts matrix multiplication that needs memory
Naive (s)
0.37
0.37
0.32
0.31
0.34
0.34
0.35
0.34
0.34
0.36
0.42
0.77
2.24
Out of Memory
Out of Memory
Stressen (s)
0.37
0.37
0.32
0.36
0.37
0.36
0.38
0.37
0.37
0.31
0.69
0.51
2.04
15.55
21.42
204
TABLE II
NV IDIA P ROFILE R ESULTS OF NAIVE A LGORITHM
Matrix Size
Duration Max ( ms)
Duration Min (s)
Duration Avg (ms)
Compute utilization (%)
Compute/Memcpy Efficiency
Memcpy Throughput Avg (GB/s)
4,096
167.9
1.3
74.0
0.9
0.6
0.9
8,192
1,219.0
1.3
340.2
1.6
1.5
1.2
TABLE III
NV IDIA P ROFILE R ESULTS OF S TRASSEN A LGORITHM
Matrix Size
Duration Max ( ms)
Duration Min (s)
Duration Avg (ms)
Compute utilization (%)
Compute/Memcpy Efficiency
Memcpy Throughput Avg (GB/s)
4,096
169.1
1.3
61.3
0.9
0.8
1.3
8,192
168.9
1.3
62.1
0.9
0.8
1.2
16,384
166.4
1.3
61.5
0.5
0.8
1.3
[3] D.C. Lay, Linear Algebra and Its Applications, United States : Pearson
Education, Inc., 2006.
[4] E.W. Weisstein, Matrix Multiplication. From MathWorldA Wolfram
Web Resource. http://mathworld.wolfram.com/MatrixMultiplication.html;
as seen 23 Feb. 2013
[5] R. Johnsonbaugh, and M. Schaefer, Algorithms, United States : Pearson
Education, Inc., 2004.
[6] I. Foster. 1995.11.4 Mergesort. http://www.mcs.anl.gov/ itf/dbpp/text/
node127.html; Internet; as seen 4 May 2013.
[7] B. Barney, Introduction to Parallel Computing, United States : Livermore
Computing, 2012.
[8] S. Halim, and F. Halim, Competitive Programming 2: This Increases the
Lower Bound of Programming Contests. Again., International Olympiad
Informatics, 2011.
[9] J. Sanders, and E. Kandrot, CUDA by Example: An Introduction to
General-Purpose GPU Programming, Boston : Pearson Education, Inc.,
2011.
[10] G. Ruetsch, and B. Oster, Getting Started with CUDA, nvision 08 the
world of visual computing, NVIDIA Corporation, 2008.
[11] J. Li, S. Ranka, and S. Sahni, Strassens Matrix Multiplication on
GPUs Parallel and Distributed Systems ICP ADS, 2011 IEEE 17th
International Conference on , vol., no., pp.157,164, 7-9 Dec. 2011
VI. C ONCLUSION
This paper concluded that the maximum size of data samples that are processed by Strassen Algorithm reach 32.768 for
(2n 2n ) matrix size, which is bigger than the naive algorithm
8.192 could process. Also for all attempt to matrices with
larger than 2048, the running times are way faster, about 0.04
seconds to 0.2 seconds for worst case scenario, and overcome
the one with naive algorithm.
We realized that further investigation should be performed
as we did not analyse memory footprint on GPGPU properly,
or introducing innovative technique to manage unlimited memory by sending back and forth chunk of matrix from DRAM to
GPGPU memory segment without reducing its performance.
ACKNOWLEDGMENT
This researh is partially funded by Research Grand Universitas Pelita Harapan, Contract No. P-001-FIK/X/2012, CUDA
Parallel Computation: a Case Study of Empirical Orthogonal
Function In Climatology.
We would like to thanks the three anonymous reviewers for
providing us with constructive comments and suggestions.
R EFERENCES
[1] M.
Cvet.
Strassens
Algorithm:
Theory
vs.
Application.
http://mikecvet.wordpress.com/2010/04/17/strassens-algorithm-theory-vsapplication-part-2/ as seen 18 March 2013.
[2] T.H. Cormen, C. E. Leiserson, and R.L. Rivest, , Second Edition,
Cambridge, McGraw Hill, 2003.
205