Breaking Through Memory Limitation in GPU Parallel Processing Using Strassen Algorithm

2013 International Conference on Computer, Control, Informatics and Its Applications
Breaking Through Memory Limitation in GPU

Parallel Processing Using Strassen Algorithm
Pujianto Yugopuspito, Sutrisno, and Robertus Hudi
Informatics Department, Universitas Pelita Harapan, Tangerang, Indonesia
pujianto.yugopuspito@uph.edu, sutrisno.fik@uph.edu robertus.hudi@uph.edu
AbstractMatrix multiplication is one of the basic operations
in linear algebra that mostly used in computer science. For ages,
applying naive algorithm to complete it has done it, and it has
a standard complexity O(n3 ). Many researches are concluded
to find more efficient and effective algorithm to process this
operation, and one day Strassen has one that overcome the naive
algorithm complexity with only O(n2.8074 ). The basic concept of
this algorithm is divide and conquer (DnC) but with adjustment,
to break the limitation of memory in a GPU computation. An idea
to overcome this limit is to combine CPU and GPU processing,
which implement Strassen algorithm. This paper shows the
result of breaking through GPU memory limitation, checking
the accuracy of the algorithm, compare the running time result
and the most important thing is to make sure this algorithm can
process bigger matrix than the naive one. Through the created
charts and tables based on each performances, it showed that the
maximum size of data samples that are processed by Strassen
Algorithm reach 32.768 for (2n 2n ) matrix size, which is bigger
than the naive algorithm 8.192 could process. Also for all attempt
to matrices with larger than 2048, the running times are way
faster, about 0.04 seconds to 0.2 seconds for worst case scenario,
and overcome the one with naive algorithm.
Index TermsGPU, CUDA Programming, Strassen Algorithm,
Matrix Multiplication
I. I NTRODUCTION
Matrix multiplication is one of the basic operations in
linear algebra. Basically, it is not the same as single number
multiplication, but closer to multiplying vector. Thus, many
algorithms have been proposed to complete this operation with
the most effective routine. People use naive algorithm for ages
because of its simplicity and accuracy, the basic concept of
doing matrix multiplication are also with this algorithm and
it has O(n3 ) running time. But one day at 1969, Strassen
invents a new algorithm that has less complexity than the
naive one [1], [2]. Using trial and error process and primarily
rely on Divide and Conquer (DnC) concept, his algorithm
can run for only O(nlog2 7 ). A slight improvisation in matrix
multiplication algorithm but later becomes very useful in
computer science to compute a huge matrix problem. Since
then, scholars believe that there is a possible O(n2) running
time algorithm for matrix multiplication that have not yet
invented [3], [4], [5].
A hybrid computation system, combining Graphics Processing Unit (GPU) and Central Processing Unit (CPU) has been
used to increase the performance of an operation in a whole set
of PC, and also to help some research activities in technology.
As a fact that the concept of parallel computing first found in
c
978-1-4799-1078-6/13/$31.00 2013
IEEE
a GPU architecture, after some research concluded to find how

to optimize CPU with multi core processing unit performance
and make them work together [6], [7]. CPU for long time has
been known for its serial core processing while from long ago,
GPU, which intended to accomplish graphic processes, already
equipped with multi core arithmetic unit and the other core.
But one issue that limits the GPU performance is the memory
and to overcome this problem, the research of the hybrid
system initiated [8]. To meet its purpose, the right algorithm
must be selected and be applied for the right problem.
Compute Unified Driver Architecture (CUDA) library let us
to compute problems in the GPU and it has subroutine called
Basic Linear Algebra Subroutines (BLAS), which contain
kernel to process matrix [9], [10]. But the kernel here to
process matrix multiplication has the same running time with
naive algorithm, which leads to an assumption that we can do
other algorithm implementation based on this hybrid system
to optimize its performance. Strassen algorithm is chosen to
overcome memory limitation issue in this parallel environment
because its basic concept is to divide the main problem into
smaller problems called sub problems, and we can set a
threshold to make sure the memory can hold the data in its
peak performance, even if the data are larger than its memory
can hold.
This paper content is divided into six parts that the first one
is the introduction, followed by related works in section II.
Section III explored matrix multiplication with the Strassen
algorithm, and the optimization for the algorithm via the
power of matrix with further explanation of CUDA library and
programming environment. The implementation of Strassen
algorithm to CUDA programming model and the results are
provided in the section IV. While section V and VI are analysis
for the results and the conclusion.
II. R ELATED W ORKS
In the beginning of the research we did not read any
references concerning implementation of Strassen algorithm
in CUDA environment. Thanks to the anonymous reviewer
pointed a very similar work by Junjie Li et al. [11]. They
argued the implementations of Strassens algorithm as well as
of Winograds variant on an NVIDIA C1060 GPU.
This research only concerns the comparation execution time
between naive and Strassen, and shows the ability to execute as
the break through in the memory limitation. Further more we
explored the size of memory limitation on NVIDIA GTX 660
201
graphics card [12]. The graphic card is a 2,048 MB in memory,

running CUDA 5.0 in Ubuntu 12.04 operating system. Some
useful information on CUDA programming can be referred to
[13], [14]
Now, using only the above 7 multiplications, Ci,j can be

express in terms Mk as follows:
III. M ATRIX O PERATION AND S TRASSEN A LGORITHM
C1,2 = M3 + M5
C1,1 = M1 + M4 M5 + M7
C2,1 = M2 + M4
A. Matrix Multiplication and Strassen Algorithm

Volker Strassen published the Strassen algorithm in 1969
based on a divide and conquers strategy. Let A, B be two
square matrices over a ring R. The objective is to calculate
the matrix product C as (1).
n
C = AB A, B, C R2
2n
(1)
If the matrices A, B are not of type 2n 2n , the missing

rows and columns will be filled with zeros. But in this
experiment, we are not going to deal with this kind of case. A,
B, and C will be partitioned into equally sized block matrices
such that
with
A=
A1,1
A2,2
A1,2
A2,2
B=
B1,1
B2,2
B1,2
B2,2
C=
C1,1
C2,2
C1,2
C2,2
Ai,j , Bi,j , Ci,j R
2n1 2n1
(2)
(3)
then
C2,2 = M1 M2 + M3 + M6
This matrices partition process can be done recursively until

the sub matrices degenerate into numbers. The number of
addition and multiplications required in the Strassen algorithm
can be calculated as follows:
Let f (n) be the number of operations for a 2n 2n
matrix. Then according to recursive application of the Strassen
algorithm is f (n) = 7f (n1)+l4n , for some constant l. This
constant l depends on the number of additions performed at
each application of the algorithm. Hence, i.e. the asymptotic
complexity for multiplying matrices of size N = 2n using the
Strassen algorithm is O([7 + O(1)n ] = O(N log2 7+O(1) )
O(N 2.8074 ). The reduction in the number of operations however comes at the price of a somewhat reduced numerical
stability, and the algorithm also requires significantly more
memory compared to the naive algorithm. Both initial matrices
must have their dimensions expanded to the next power of 2,
which results in storing up to four times as many elements,
and the seven auxiliary matrices each contain a quarter of the
elements in the expanded ones. The arithmetic complexity of
the algorithm is:
tm (n) = nlog2 7 ta (n) = 6log2 7 6n2
T (n) = 7T (n/2) + (n2 )0
C1,1 = A1,1 B1,1 + A1,2 B2,1

C1,1 = A1,1 B1,1 + A1,2 B2,1
(4)
In the above construction still 8 multiplications are needed

to calculate Ci,j matrices. In order to reduce the number of
multiplications, the following new matrices have to be defined.
M1 = (A1,1 + A2,2 )(B1,1 + B2,2
M2 = (A2,1 + A2,2 )B1,1
M3 = A1,1 (B1,2 B2, 2)
M4 = A2,2 (B2,1 B1,1
M5 = (A1,1 + A1,2 )B2,2
M6 = (A2,1 A1,1 )(B1,1 + B1,2
M7 = (A1,2 A2,2 )(B2,1 + B2,2
(5)
(7)
where tm (n) and ta (n) denote the number of multiplications and the number of additions respectively. Then the
threshold of the multiplication can be
C1,1 = A1,1 B1,1 + A1,2 B2,1

C1,1 = A1,1 B1,1 + A1,2 B2,1
(6)
(8)
This solution can be simplified and meet this last form

T (n) = 7T (n/2) + (n2 ), which asymptotically faster than
naive matrix multiplication.
B. The Power of Matrix
Matrix multiplication can be applied to many cases, and
one of them is to compute the power of matrix. But for this
research environment, we have limitation for test cases which
for every matrix used, its size should be precisely 2n , and
leaves this overview restricted to this limitation. ItQcould be
0
p
explained in a matrix form as[A] = I and [A] = pi=1 [A];
where I represents identity matrix and p is the asked power of
square matrix A. Analytically, naive algorithm to compute the
power of matrix has (n3 log p) complexity. There is a better
way to compute this problem that uses Divide and Conquer
p
0
1
principle. We can express [A] as : [A] = 1 dan [A] = [A]
p
p1
p
(two base cases), [A] = [A]
[A] if p is odd, and [A] =
2 2
([A] ) if p is even. As this approach keeps halving the value
of p by two, it runs in (log p).
202
procedure MATRIX-POWER(A, p)
n rows [A]
let ans be n n matrix
while p do
. iterative version of DnC
if (p & 1) then
. check if p is odd
ans = STRASSEN-MULTIPLICATION (ans, A);
end if
A = STRASSEN-MULTIPLICATION (A, A);
. square the matrix
p >>= 1;
. divide p by 2
end while
return ans;
end procedure
__global__ void add(float *a, float *b,

float *c,
int size {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < (size*size)) {
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}
Fig. 2. add kernel.
Fig. 1. Algorithm Power of Matrix
__global__ void substract(float *a, float *b, float *c,

int size {
c[tid] = a[tid] - b[tid];
}
}
For example by definition 29 = 22222222

2 (p) multiplication; But with devide and conquer, 29 =
28 2 = (24 )2 2 = ((22 )2 )2 2 (log p) multiplications.
Then the divide and conquer system to calculate the power
of matrix could be explained by pseudo-code in Fig 1, where
matrix multiplication operations inside the system are done by
Strassen algorithm.
IV. S TRASSEN I MPLEMENTATION ON CUDA
A RCHITECTURE
A. Basic Function and Kernels
Several basic kernels compile the implementation of
Strassen algorithm for CUDA architecture in this research.
cublasSgemm kernel, adopted from CUBLAS library, performs parallel multiplication for two matrices or vector.
void cublasSgemm(char transa, char transb, int m,
int n, int k, float alpha, const float *A, int lda,
const float *B, int ldb, float beta, float *C, int ldc)
Fig. 3. substract kernel.
We create special kernel for reducing all arithmetic operations in calculating sub-matrix C1,1 and C2,2 as countC11, and
sub-matrix C1,2 and C2,1 as countC22. Fig 4 and Fig 5 shown
both kernels respectively. It consists three basic operations
to reduce running time. The countC11 kernel is a recursive
process for calculating p1 , p4 , p5 and p7 , while countC22
kernel for p1 , p2 , p3 and p6 . The arithmetic operations are
written as row by row to ensure correct parentheses.
B. Implementaion and Results
This kernel should meet this size condition where matrix

A should be m k, matrixB be k n and matrix result
C meets m n. Character typed variable transa and transb
represents matrix A and B respectively. These two variables
can be conditioned in two ways, CUBLAS OP N for normal
condition, and CUBLAS OP T for transposed matrix. lda,
ldb, and ldc represent memory dimensions that will be reserved
at GPU to process the experiment, and the value itself states
that it will be only accept square matrix.
add kernel performs the addition of sub matrices (Z =
X + Y ) as required for the multiplications, as Fig 2. The
kernel counts the indexing for the GPU memory environment
within integer variable tid. Then it is executed parallel in the
GPU, with increment variable blockDim.x gridDim.x that
represents block and grid dimensions in the kernel calling at
the main program, reserved for matrix problem dimensions.
This kernel performes a recursive process of P
subtract kernel performs the subtraction of sub matrices
(Z = XY ) as shown in Fig 3. The kernel counts the indexing
for the GPU memory environment within integer variable
tid. Then it is executed parallel in the GPU, with increment
variable blockDim.x gridDim.x that represents block and
grid dimensions in the kernel calling at the main program,
reserved for matrix problem dimensions.
Fig 6 shows algorithm for Strassen implementation in

general programming environment. In this research, the applied routine is slightly different in term of function and
kernel usage. After we set a barrier variable called threshold.
Generally it can be divided into two main parts, the first states
the condition where sub matrix size is less than the threshold
value. The sub-matrix will be processed with the basic naive
algorithm, since it is still possible to reserve such memories.
While after first else, it is the main core of Strassen idea. It
divides the sub-matrix into half its size (n/2) first, and then
calculates Pi recursively with addition kernel and subtraction
kernel. Then it combines each sub matrices into 4 parts result
203
__global__ void countC11(float *a, float *b, float *c,

float *d, float *result, int size {
result[tid] = a[tid] + b[tid];
result[tid] -= c[tid];
result[tid] += d[tid];
}
}
Fig. 4. C11 kernel.
TABLE I
E XECUTION T IME C OMPARISON
__global__ void countC22(float *a, float *b, float *c,

float *d, float *result, int size {
result[tid] = a[tid] - b[tid];
result[tid] += c[tid];
result[tid] += d[tid];
}
}
Matrix Size
2
4
8
16
32
64
128
256
512
1,024
2,048
4,096
8,192
16,384
32,768
Fig. 5. C22 kernel.

procedure STRASSEN-MULTIPLICATION(A, B, size)
n rows [A]
let C be n n matrix
if size threshold then
C = A B;
. using standard algorithm
else
newSize = size/2;
Initialize submatrices with newSize N ewSize dimension;
partition A into four submatrices A11, A12, A21, A22;
partition B into four submatrices B11, B12, B21, B22;
p1 = Strassen ( (A11 + A22), (B11 + B22), newSize);
p2 = Strassen ( (A21 + A22), B11, newSize);
p3 = Strassen ( A11, (B12 B22), newSize);
p4 = Strassen ( A22, (B21 B11), newSize);
p5 = Strassen ( (A11 + A12), (B11+B22), newSize);
p6 = Strassen ( (A21 A11), (B11+B22), newSize);
p7 = Strassen ( (A21 A22), (B21+B22), newSize);
C21 = p1 + p4 p5 + p7;
C12 = p3 + p5;
C21 = p2 + p4;
C22 = p1 + p3 p2 + p6;
end if
return C;
end procedure
Fig. 6. Algorithm Strassen
C11, C12, C21, and C22 with add, subtract, countC11, and
countC22 kernels. Finally, join each 4 sub-matrices into one
complete result, matrix C.
We compare the running time of different matrices size, as
shown in Table 1. There are two programs to be compared,
the naive and the Stressen. The two programs displayed almost
the same result for size 2n < 4, 096. Their running times are
less than 1 second, even below 0.5 second for some cases. It
shows that the divide and conquer method are not so different
because the matrix size mostly below the value of threshold.
The threshold for this experiment is 256 since setting threshold
below that value will make the Strassen process slower and
matrix below that size does not trouble the device memory
allocation. Matrices experimented are only matrices with 2n
2n size where n 1. Naive approach failed to compute 16,384
x 16,384 matrix, due to lack of memory, while Strassen still
survive. This is the break through of memory limitation.
V. R ESULT A NALYSIS
Through all phases in this research process with the title
Breaking through memory limitation in GPU parallel processing with Strassen algorithm, the main conclusion is we can
make the GPU counts matrix multiplication that needs memory
Naive (s)
0.37
0.37
0.32
0.31
0.34
0.34
0.35
0.34
0.34
0.36
0.42
0.77
2.24
Out of Memory
Out of Memory
Stressen (s)
0.37
0.37
0.32
0.36
0.37
0.36
0.38
0.37
0.37
0.31
0.69
0.51
2.04
15.55
21.42
larger than its architecture. By applying Strassen, the hybrid

environment will only send part of the problem that GPU can
handle effectively. Results are satisfying the expectation that
the program with Strassen algorithm is surely can handle the
problem better than the naive one. First program, which has
naive algorithm to compute matrix multiplication, meets its
limit until the result shows out of memory error message.
Though it could be overcame by applying divide and conquer
strategy, but to analyze the problem so that we can put the
solution into the right area. We know that the system execute
the matrix multiplication with raw method and not dividing
the problem into smaller problem, so the memory copied into
the device memory is the same exact problem as the original
problem itself, even multiplied by three because we need to
reserve three matrices for two matrices multiplied and the
other one is for the result. Based on this research at program
one, it displays memory error message at matrix size bigger
than (8,192 8,192).
Meanwhile, in the other program that uses Strassen implementation for matrix multiplication, hybrid CPU-GPU system
provides different result. The efficiency of memory usage is
increased and although the growth of the running time is
unstable, at higher matrix size, it still works perfectly. The
divide and conquer strategy helps us to set a threshold, a
barrier that wont reserve any memory at the GPU unless
it is fit in size as the device memory could process. The
result is quite satisfying, by setting the threshold number
to 256 (which defines 2n), it shows better running time for
until (1,024 1,024) matrix multiplication, and reach (16,384
16,384) matrix size, outperforms the naive program and still
works perfectly, only it starts to have a lot bigger of running
time. It also works at (32.768 32.768) matrix size, where the
trial and error stops because it already has proved that Strassen
algorithm can do better than the original one.
Further analysis are conducted by NVIDIA Visual
Profiler[15] as suggested. The aim of this profiling is to
compare the kernel memory usage of both Naive versus
Strassen methods. According to the current result, the Timeline
profiling is the most meaningful information as low compute
204
TABLE II
NV IDIA P ROFILE R ESULTS OF NAIVE A LGORITHM
Matrix Size
Duration Max ( ms)
Duration Min (s)
Duration Avg (ms)
Compute utilization (%)
Compute/Memcpy Efficiency
Memcpy Throughput Avg (GB/s)
4,096
167.9
1.3
74.0
0.9
0.6
0.9
8,192
1,219.0
1.3
340.2
1.6
1.5
1.2
TABLE III
NV IDIA P ROFILE R ESULTS OF S TRASSEN A LGORITHM
Matrix Size
Duration Max ( ms)
Duration Min (s)
Duration Avg (ms)
Compute utilization (%)
Compute/Memcpy Efficiency
Memcpy Throughput Avg (GB/s)
4,096
169.1
1.3
61.3
0.9
0.8
1.3
8,192
168.9
1.3
62.1
0.9
0.8
1.2
16,384
166.4
1.3
61.5
0.5
0.8
1.3
utilization of our problem. So the characteristics of currently

observed problem is low compute utilization (below 2%), low
compute/memcpy efficiency (max 1.5), low Memcpy/compute
overlap (almost 0%), and low Memcpy throughput. Table II
and Table III shown the important results of 4,096, 8,192,
and 16,384 matrix size. Kernel concurrency of multiprocessor
is low, and insufficient are shown for branch-divergenceoverhead of kernel-instruction; local-memory-overhead and
DRAM utilization of kernel-memory. Global memory store
efficientcy average is less than 50%.
[3] D.C. Lay, Linear Algebra and Its Applications, United States : Pearson
Education, Inc., 2006.
[4] E.W. Weisstein, Matrix Multiplication. From MathWorldA Wolfram
Web Resource. http://mathworld.wolfram.com/MatrixMultiplication.html;
as seen 23 Feb. 2013
[5] R. Johnsonbaugh, and M. Schaefer, Algorithms, United States : Pearson
Education, Inc., 2004.
[6] I. Foster. 1995.11.4 Mergesort. http://www.mcs.anl.gov/ itf/dbpp/text/
node127.html; Internet; as seen 4 May 2013.
[7] B. Barney, Introduction to Parallel Computing, United States : Livermore
Computing, 2012.
[8] S. Halim, and F. Halim, Competitive Programming 2: This Increases the
Lower Bound of Programming Contests. Again., International Olympiad
Informatics, 2011.
[9] J. Sanders, and E. Kandrot, CUDA by Example: An Introduction to
General-Purpose GPU Programming, Boston : Pearson Education, Inc.,
2011.
[10] G. Ruetsch, and B. Oster, Getting Started with CUDA, nvision 08 the
world of visual computing, NVIDIA Corporation, 2008.
[11] J. Li, S. Ranka, and S. Sahni, Strassens Matrix Multiplication on
GPUs Parallel and Distributed Systems ICP ADS, 2011 IEEE 17th
International Conference on , vol., no., pp.157,164, 7-9 Dec. 2011
[12] H. Hagedoorn, MSI

GeForce GTX 660 TwinFrozr III Review2012.
http://www.guru3d.com/articles pages/msi geforce gtx 660 twinfrozr
iii review,5.html, Internet, as seen 14 May 2013
[13] V. Volkov and D. Barbieri, CUBLAS Library User Guide, NVIDIA
Corporation, 2012.
[14] A. Burnes. 2012. Kepler For Every Gamer: Meet the New Geforce
GTX 660 and 650. http://www.geforce.com/whats-new/articles/geforcegtx-660-650-launch; Internet; as seen 28 April 2013
[15] NVIDIA, Profiler Users Guide, DU-05982-001 v03, May 2012.
VI. C ONCLUSION
This paper concluded that the maximum size of data samples that are processed by Strassen Algorithm reach 32.768 for
(2n 2n ) matrix size, which is bigger than the naive algorithm
8.192 could process. Also for all attempt to matrices with
larger than 2048, the running times are way faster, about 0.04
seconds to 0.2 seconds for worst case scenario, and overcome
the one with naive algorithm.
We realized that further investigation should be performed
as we did not analyse memory footprint on GPGPU properly,
or introducing innovative technique to manage unlimited memory by sending back and forth chunk of matrix from DRAM to
GPGPU memory segment without reducing its performance.
ACKNOWLEDGMENT
This researh is partially funded by Research Grand Universitas Pelita Harapan, Contract No. P-001-FIK/X/2012, CUDA
Parallel Computation: a Case Study of Empirical Orthogonal
Function In Climatology.
We would like to thanks the three anonymous reviewers for
providing us with constructive comments and suggestions.
R EFERENCES
[1] M.
Cvet.
Strassens
Algorithm:
Theory
vs.
Application.
http://mikecvet.wordpress.com/2010/04/17/strassens-algorithm-theory-vsapplication-part-2/ as seen 18 March 2013.
[2] T.H. Cormen, C. E. Leiserson, and R.L. Rivest, , Second Edition,
Cambridge, McGraw Hill, 2003.
205

Breaking Through Memory Limitation in GPU Parallel Processing Using Strassen Algorithm

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Breaking Through Memory Limitation in GPU Parallel Processing Using Strassen Algorithm

Hochgeladen von

Copyright:

Verfügbare Formate

2013 International Conference on Computer, Control, Informatics and Its Applications

Breaking Through Memory Limitation in GPU

a GPU architecture, after some research concluded to find how

graphics card [12]. The graphic card is a 2,048 MB in memory,

Now, using only the above 7 multiplications, Ci,j can be

III. M ATRIX O PERATION AND S TRASSEN A LGORITHM

A. Matrix Multiplication and Strassen Algorithm

If the matrices A, B are not of type 2n 2n , the missing

Ai,j , Bi,j , Ci,j R

This matrices partition process can be done recursively until

T (n) = 7T (n/2) + (n2 )0

C1,1 = A1,1 B1,1 + A1,2 B2,1

In the above construction still 8 multiplications are needed

C1,1 = A1,1 B1,1 + A1,2 B2,1

This solution can be simplified and meet this last form

__global__ void add(float *a, float *b,

Fig. 2. add kernel.

Fig. 1. Algorithm Power of Matrix

__global__ void substract(float *a, float *b, float *c,

For example by definition 29 = 22222222

Fig. 3. substract kernel.

This kernel should meet this size condition where matrix

Fig 6 shows algorithm for Strassen implementation in

__global__ void countC11(float *a, float *b, float *c,

Fig. 4. C11 kernel.

__global__ void countC22(float *a, float *b, float *c,

Fig. 5. C22 kernel.

Fig. 6. Algorithm Strassen

larger than its architecture. By applying Strassen, the hybrid

utilization of our problem. So the characteristics of currently

[12] H. Hagedoorn, MSI

Das könnte Ihnen auch gefallen

global void add(float a, float b,

global void substract(float a, float b, float *c,

global void countC11(float a, float b, float *c,

global void countC22(float a, float b, float *c,