Sie sind auf Seite 1von 10

Toward Techniques

for Auto-tuning GPU Algorithms


Andrew Davidson and John Owens
University of California, Davis

Abstract. We introduce a variety of techniques toward autotuning dataparallel algorithms on the GPU. Our techniques tune these algorithms independent of hardware architecture, and attempt to select near-optimum
parameters. We work towards a general framework for creating autotuned data-parallel algorithms, using these techniques for common algorithms with varying characteristics. Our contributions include tuning
a set of algorithms with a variety of computational patterns, with the
goal in mind of building a general framework from these results. Our
tuning strategy focuses first on identifying the computational patterns
an algorithm shows, and then reducing our tuning model based on these
observed patterns.
Keywords: GPU Computing, Auto-Tuning Algorithms, Data-Parallel
Programming, CUDA.

Introduction

Given the complexity of emerging heterogeneous systems where small parameter


changes lead to large variations in performance, hand-tuning algorithms have become impractical, and auto-tuning algorithms have gained popularity in recent
years. When tuned, these algorithms select near-optimum parameters for any
machine. We demonstrate a number of techniques which successfully auto-tune
parameters for a variety of GPU algorithms. These algorithms are commonly
used, were not synthetically created and were chosen to rely on dierent parameters and be performance bound in dierent ways.
Both work by Liu et al. [4] and Kerr et al. [2] use adaptive database methods
for deriving relationships between input sets and their direct inuence on the
parameter space. Work by Li et al. [3] focuses on tuning one GEMM algorithm,
while work by Ryoo et al. [5] mainly deals with kernel and compiler level optimizations for one machine (an 8800GTX). Our work diers from these in that we
rst try to identify computational patterns between algorithms and then tune
their parameters using a number of dierent techniques.
Rather than use a single database method for all of our algorithms, which
might be susceptible to non-linear relations, we try to identify relationships between each algorithm, possible input sets on any given machine; we then choose a
tuning strategy accordingly. Since we now have these relationships, we can heavily prune the parameter search space before attempting to tune the algorithm,
K. J
onasson (Ed.): PARA 2010, Part II, LNCS 7134, pp. 110119, 2012.
c Springer-Verlag Berlin Heidelberg 2012


Toward Techniques for Auto-tuning GPU Algorithms

111

resulting in very fast tuning runs. For our set of algorithms, we show a quick
tuning run (less than a few minutes for most machines) can generate the needed
information to automatically choose near-optimum parameters for future runs.
We will next discuss the general strategy (or philosophy) we used for developing
each of our auto-tuned algorithms.

Algorithms

Algorithms designed in a data-parallel fashion are concerned with maintaining


the highest throughput possible. This means parameters are often dependent not
only on the machine, but also upon the workload being operated on. With this
in mind, when attempting to auto-tune an algorithm we approach the problem
using this philosophy:
Identify the tunable parameters for the algorithm.
Look for a relationship between the input space, and the parameter space.
If such a relationship exists, we must build a model from the input space to
parameter space, or model the parameter space solely on machine characteristics.
Therefore when considering the input parameters for each algorithm we identify two axes of tuning, workload specic and machine specic. The relationship
between these two may vary widely between algorithms, but we concentrate on
identifying common patterns between certain algorithms that can be used as a
starting point for general tuning. As an example, in our N-Body algorithm we
discovered a heavy dependence between our input parameters and the workload
(input space). Yet for our reduction kernel, once the workload reaches a critical
size, our tuning parameters rely solely on machine specic parameters. Therefore using micro-benchmarks we developed a model that estimates the correct
parameters. Sections 2.2 covers this strategy in more depth. Next we will introduce our auto-tuning test suite, and the approaches we took to quickly tune
each algorithm.
2.1

Test Suite

We studied four parallel algorithms extensively in our work. Reduction, a common parallel primitive which operates on a large set of values and reduces that
set to one value. Next, scalar product, which sums up the products between
two vectors. An N-Body algorithm which simulates the gravitational interactions between a set of objects. Finally, SGEMM, a fast matrix multiply. For the
reduction, scalar product and N-Body algorithms we used the kernels found in
the NVIDIA SDK as our base algorithms. These algorithms have already been
highly optimized, and are considered standards for benchmarking. Since many
of these algorithms have been hand-tuned for certain machines, we may receive
only a marginal performance boost for certain devices and workloads. In particular, the NVIDIA SDK has changed a number of default parameters to obtain

112

A. Davidson and J. Owens

optimal performance on GT200 series cards. As a result, performance suers


on older cards which highlights the importance of auto-tuning algorithms which
scale with architecture changes. For our SGEMM algorithm we used Chiens
highly tuned and optimized kernel [1] which demonstrates higher performance
than that released by NVIDIAs SGEMM. In our nal auto-tuned algorithm we
wish to trim as much of the parameter search space as possible, while maintaining
an accurate model to select parameters.
Auto-tuning algorithms that use standard techniques such as model driven
inputs, with interpolation between points, may result in sub-optimal parameters
(one method does not t all). This is where the major challenge lies, as great
care must be taken to ensure your auto-tuning strategy is not susceptible to
unpredictable non-linear eects.
2.2

Reduction

We made a few minor improvements to NVIDIAs SDK reduction kernel code.


We chose this algorithm as it is obviously memory bound and each item can be
accessed in any order (coalesced reads would obviously be preferred). Therefore
tuning reduction would give insight into many other common parallel algorithms,
such as scan and sort, which have similar computational patterns. For the reduction kernel, the parameters that require tuning are the number of threads per
block, and the number of blocks. The optimum set of parameters may uctuate
with respect to the number of elements we reduce.
Through experiments and benchmarking, we were able to show a standard
behavior for the optimum parameter set given a number of elements. Using the
results from these tests, our auto-tuning method rst searches for a thread cap
for all elements, which is assumed to fully occupy the machine. The thread cap
is dened as the maximum number of threads optimum to any valid input set
(no optimum parameters will have more total threads than the thread cap).
Since our input set is a one-dimensional array in the reduction algorithm, it is
easy to test what this thread cap is, and where it applies. All workloads greater
than the number of elements where this thread cap applies, is also bound by the
thread cap. Next we test for a lower switchpoint where the CPU outperforms
the GPU, and the optimum parameters for that point. Using these two points,
and their associated number of elements, we are able to select a number of total
threads (threads per block and number of blocks) for any input set.
Therefore this algorithm is highly machine dependent, less workload dependent, and therefore much easier to tune. Highly workload dependent algorithms
require micro-benchmarks in order to correlate the workload space with the parameter space. The next algorithms considered are cases where the workload has
a more dominant eect on the parameter space.
2.3

Scalar Product

This algorithm is also memory bound and available in NVIDIAs SDK. However
the parameters for tuning the Scalar Product kernel are more complex as the

Toward Techniques for Auto-tuning GPU Algorithms

113

vector size and total number of vectors adds a dimension to the tuning process.
Therefore while applying some of the same principles from our reduction method
for a thread cap and associated work cap (the minimum workload at which our
thread cap is applied), we also select a set of distributed points (vsizei , ni ) such
that vsizei < vsizem and ni < nm , that are under the work cap and test for the
optimum number of threads, these points will operate as a nearest neighbor spline.
Here our machine dependent parameters help prune our tuning space, as we
generalize parameters for all input sets outside these limits. For all input sets
under the thread cap we pre-tune each point from our distributed and select
parameters for the knot closest to the input set. This works due to the ne
grained corelation between the input set and the parameter set. We used a
nearest neighbor approach rather than an interpolation approach, as our tests
showed the closest knot approach generally performed better and performance
was more stable (less variance).
Since we prune all points greater than (vsizem , nm ), where vsizem and nm are
dependent on machine parameters. We have therefore reduced the tuning space
to a smaller subset of the original input space.
2.4

N-Body

We selected the N-Body algorithm as it has block-wise tunable parameters, is


more arithmetically intense, and therefore not global memory bound. The tunable parameters in this case are two dimensions of block sizes that will be shared
per block. Therefore tuning these parameters involves a tradeo in register usage, and the amount of data sharing per block. These parameters are fairly
ne-grained, and when paired with a ne-grained input set we nd that there
is a corelation between the input space and parameter space. In other words,
the optimum parameter points does not vary widely between input set ai and
ai + x. Where x is a small variation of the input set ai .
This motivates us to use a nearest-neighbor spline technique and concentrate
on intensely tuning a few distributed points. This technique allows us to greatly
reduce the tuning search space to only a small subset, while maintaining near
optimum performance for all inputs.
2.5

SGEMM

As mentioned previously we used Lung Sheng Chiens optimized code [1] (which
is a further optimization of Vasily Volkovs code [6]) as the base kernel for our
autotuning method. There are a number of variations to this set of code; we
selected the one with best performance, that would not lead to out-of-bound
problems (method8 in Chiens benchmark suite). Using templates, we created
twenty-one valid versions of this method that relied on three input parameters.
Due to Chiens reliance on precompiling kernels into cubin binaries before runs,
we created twenty-one associated binaries for each version. Experimental results
showed that a number of these valid versions were inherently inecient, and
were removed from the tuning suite.

114

A. Davidson and J. Owens

We found that for larger matrices, one parameter set would deliver near optimum performance. Therefore, our strategy was to nd the optimum parameter
set (of the twenty-one available) for a given machine which would allow us to
tune all input-sets larger than a certain matrix size. It was also noticed that in
general, this preferred kernel had higher register usage per thread than other kernels. Whether or not this is a rule of thumb one can follow for quickly selecting
an appropriate variation, requires more testing on a wider variety of devices.
For smaller matrices, our tests on a GTX260 and an 8600 GTS showed that
there was more variety and workload dependence on the optimal kernel. Therefore, our complete tuning strategy for this algorithm is as follows:
Find optimal kernel for large matrices. This relies on machine specic parameters.
Find matrix size for which we switch strategies.
Test a variety of candidates from a pre-tuning run, and nely tune all small
matrices with these candidates.
Though this algorithm requires us to nely tune for smaller matrices, we gain an
advantage by pruning all matrices greater than the switchpoint. Since smaller
matrices are solved much faster, we are able to prune the most vital section of
tuning space. Also, since we only select a few valid candidates for the smaller
matrices, we further eliminate unnecessary kernel tests.

Results

We tested our methods, and ran a number of experiments on a higher end GTX
260 and GTX 280, medium-end 8800GT and 5600FX, and a low-end 8600 GTS.
We had access to a Tesla C1060 for our reduction tests, however it was unavailable for test runs for our other algorithms.
In the next subsections, we will compare the performance for each of our
auto-tuned algorithms versus the untuned default algorithms. For all auto-tuned
algorithms, tuning runs are very short (around one to ve minutes) due to our
carefully pruning the tuning space so that we gather only necessary information
for our tuning process.
3.1

Reduction

Figure 1 compares the performance of our auto-tuning method against that of


the SDKs default parameters. The results show a speedup that brings memory
performance close to bandwidth limit. Our auto-tuned method(blue plot) performs as well as a brute force check on all possible parameters (red dotted line),
while only taking a minute or less to run a one-time tuning run. As a memory bound function, we found performance depended directly on the number
of memory controllers available to the machine. Though this cannot be queried

Toward Techniques for Auto-tuning GPU Algorithms

(a) 8600 GTS

115

(b) GTX 280

(c) Tesla C1060


Fig. 1. Auto-Tuned vs SDK Performance Comparison for Reduction. The theoretical bound is the reported maximum DRAM bandwidth. In all cases, the auto-tuned
implementation performs as well as the algorithmic bound (brute force test of all implementations).

directly (one can query the number of memory controllers, but not the bandwidth
at which each operates), some pretuning tests can supply this information, and
be used to reduce the tuning space further.
The performance comparison in Figure 1 also shows this algorithm is dominated by machine dependent parameters. If there was a higher workload dependence, selecting optimal parameters would be more dicult, resulting in pockets
of poorer performance. However, the auto-tuned reduction kernel consistently
matches the optimum algorithmic bound curve. This algorithmic bound curve is
a brute force performance check on every possible parameter combination.
3.2

Scalar Product

Though using knots adds complexity and possibility for non-optimal selections,
our results still perform better than that of NVIDIAs SDK for most points.
Since visualizing the performance of a multi-dimensional input space is dicult,
we instead present our results in Table 1 as the relative performance from a set

116

A. Davidson and J. Owens

of random inputs. Our results in Table 1 show good speedups for almost all
cases on the GTX 280 and 260 (achieving nearly 30 percent performance boost
on the 280). While the performance gains on the 8800GT and 5600FX were not
as drastic, we still were able to boost performance slightly.

Table 1. Performance speedups comparison of auto-tuned kernel vs default for a set


of 1000 random selected points
Architecture (GPU)
GTX 280
GTX 260
8800 GT
5600 FX

Speedup
1.2817
1.1511
1.0201
1.0198

(a) GTX 260

(b) GTX 280

(c) Quadro 5600

(d) 8800GT

Fig. 2. Auto-Tuned vs SDK performance comparison for N-Body

Toward Techniques for Auto-tuning GPU Algorithms

3.3

117

N-Body

Figure 2 shows the performance comparison of our spline auto-tuned strategy


versus the default parameters. The spikes in performance from the untuned
implementation illustrate the workload dependence to the parameter space. As
the workload changes, the untuned parameters fall in and out of sync with the
optimum implementation.
Our spline strategy helps to minimize these spikes in performance, and maintain near-optimum performance by updating parameters as the workload changes.
In some cases the speedups from these can be up to 2x that of the untuned performance. One can also vary the amount of tuning in order to get nearer and
nearer to the optimum performance. The example in Figure 2 has twenty tuned
points which are used as a reference.
3.4

SGEMM

As was illustrated in Section 2.4, small variations in parameters could lead to


large variations in kernel performance. Therefore, we cannot use the same spline
technique from our previous two algorithms. The default parameters that Chien
[1] used were found to be the near-optimal for most cases on larger matrices
on both the GTX260 and 8600GTS. For smaller size matrices (N < M) other
kernels were preferred, as shown in Figure 3. More testing is needed to conrm
this holds true for a variety of devices. However, this motivated our hybrid tuning
technique where coarse tuning runs are used to select candidate parameters for
smaller matrices, and then a dominating machine-dependent parameter set for
all larger matrices.

Fig. 3. Performance of the coarser grained parameter selection for our SGEMM. The
right-hand figure demonstrates possible auto-tuned performance, versus the default
kernel. For smaller matrices one could possibly achieve a 2x speedup(e.g. for matrices
of 200 200, possible performance is about 130 GFlops/s versus 60 GFlops/sec).

118

3.5

A. Davidson and J. Owens

General Auto-tuning Summary and Practices

This section serves as both a summary of our previous techniques, and to which
types of algorithms each are applicable. Table 2 shows our tested algorithms, and
how strongly their optimum parameters are machine and workload dependent.
Once these are identied, one can begin developing an auto-tuning scheme for
parameter selection.
Generally speaking, we nd that for strongly workload dependent algorithms
with a ne-grained input set and ne-grained parameter set, nearest neighbor
spline strategies have returned good results. This was evident in both our N-Body
and Scalar Product tuned algorithms. For our reduction kernel, we saw a strong
dependence between our thread parameters and the device parameters (memory
controllers). Therefore our strategy relies on this simple relationship. Finally,
our SGEMM kernel displayed various levels of dependency, and we therefore
developed a hybrid strategy for dierent workload ranges.
Table 2. Table summarizing each auto-tuned algorithms dependencies on device specific parameters and workload specific parameters(input space)
Algorithm Name
Reduction
Scalar Product
N-Body
SGEMM

Device Dependency
Strong
Medium
Weak
Strong

Workload Dependency
Weak
Strong
Strong
Medium

Conclusion

We believe that hand-tuning algorithms for each machine will become an impractical method as systems become more diverse in capability, and algorithm
bounds become more complex. Therefore, developing methods that either fully
automate or assist the tuning process, could prove powerful tools for developers
to boost utilization.
Future work is needed in developing rmer relationships between algorithms
with similar computational patterns, and developing auto-tuning schemes between these algorithms. Testing on newer architectures, such as the recently
released Fermi architecture is also needed. The Fermi 400x series cards contain
a number of new features that would change tuning strategies for a number of
algorithms. On top of faster global memory bandwidth, more shared memory
within blocks, and compute power, these additions include faster atomic operations than previous cards, and more computational power for double precision
operations.
Our work has shown a number of autotuning practices and methods which
boost performance for a number of common algorithms. We believe this is an
important stepping stone in developing a generalized tuning methodology for
data parallel programs.

Toward Techniques for Auto-tuning GPU Algorithms

119

References
1. Chien, L.S.: Hand-Tuned SGEMM On GT200 GPU. Technical Report, Tsing Hua
University (2010), http://oz.nthu.edu.tw/~ d947207/NVIDIA/SGEMM/HandTuned
Sgemm 2010 v1.1.pdf
2. Kerr, A., Diamos, G., Yalamanchili, S.: Modeling GPU-CPU Workloads and Systems. In: GPGPU 2010: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 3142. ACM, New York (2010)
3. Li, Y., Dongarra, J., Tomov, S.: A Note on Auto-Tuning GEMM for GPUs. In:
Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A.
(eds.) ICCS 2009. LNCS, vol. 5544, pp. 884892. Springer, Heidelberg (2009)
4. Liu, Y., Zhang, E., Shen, X.: A Cross-Input Adaptive Framework for GPU Program
Optimizations. In: IPDPS 2009: Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing, pp. 110. IEEE Computer Society
Press, Washington, DC (2009)
5. Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.Z., Stratton, J.A.,
Hwu, W.W.: Program Optimization Space Pruning for a Multithreaded GPU. In:
CGO 2008: Proceedings of the Sixth Annual IEEE/ACM International Symposium
on Code Generation and Optimization, pp. 195204 (April 2008)
6. Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In:
SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing,
pp. 111. IEEE Press, Piscataway (2008)

Das könnte Ihnen auch gefallen