Beruflich Dokumente
Kultur Dokumente
Abstract. We introduce a variety of techniques toward autotuning dataparallel algorithms on the GPU. Our techniques tune these algorithms independent of hardware architecture, and attempt to select near-optimum
parameters. We work towards a general framework for creating autotuned data-parallel algorithms, using these techniques for common algorithms with varying characteristics. Our contributions include tuning
a set of algorithms with a variety of computational patterns, with the
goal in mind of building a general framework from these results. Our
tuning strategy focuses first on identifying the computational patterns
an algorithm shows, and then reducing our tuning model based on these
observed patterns.
Keywords: GPU Computing, Auto-Tuning Algorithms, Data-Parallel
Programming, CUDA.
Introduction
111
resulting in very fast tuning runs. For our set of algorithms, we show a quick
tuning run (less than a few minutes for most machines) can generate the needed
information to automatically choose near-optimum parameters for future runs.
We will next discuss the general strategy (or philosophy) we used for developing
each of our auto-tuned algorithms.
Algorithms
Test Suite
We studied four parallel algorithms extensively in our work. Reduction, a common parallel primitive which operates on a large set of values and reduces that
set to one value. Next, scalar product, which sums up the products between
two vectors. An N-Body algorithm which simulates the gravitational interactions between a set of objects. Finally, SGEMM, a fast matrix multiply. For the
reduction, scalar product and N-Body algorithms we used the kernels found in
the NVIDIA SDK as our base algorithms. These algorithms have already been
highly optimized, and are considered standards for benchmarking. Since many
of these algorithms have been hand-tuned for certain machines, we may receive
only a marginal performance boost for certain devices and workloads. In particular, the NVIDIA SDK has changed a number of default parameters to obtain
112
Reduction
Scalar Product
This algorithm is also memory bound and available in NVIDIAs SDK. However
the parameters for tuning the Scalar Product kernel are more complex as the
113
vector size and total number of vectors adds a dimension to the tuning process.
Therefore while applying some of the same principles from our reduction method
for a thread cap and associated work cap (the minimum workload at which our
thread cap is applied), we also select a set of distributed points (vsizei , ni ) such
that vsizei < vsizem and ni < nm , that are under the work cap and test for the
optimum number of threads, these points will operate as a nearest neighbor spline.
Here our machine dependent parameters help prune our tuning space, as we
generalize parameters for all input sets outside these limits. For all input sets
under the thread cap we pre-tune each point from our distributed and select
parameters for the knot closest to the input set. This works due to the ne
grained corelation between the input set and the parameter set. We used a
nearest neighbor approach rather than an interpolation approach, as our tests
showed the closest knot approach generally performed better and performance
was more stable (less variance).
Since we prune all points greater than (vsizem , nm ), where vsizem and nm are
dependent on machine parameters. We have therefore reduced the tuning space
to a smaller subset of the original input space.
2.4
N-Body
SGEMM
As mentioned previously we used Lung Sheng Chiens optimized code [1] (which
is a further optimization of Vasily Volkovs code [6]) as the base kernel for our
autotuning method. There are a number of variations to this set of code; we
selected the one with best performance, that would not lead to out-of-bound
problems (method8 in Chiens benchmark suite). Using templates, we created
twenty-one valid versions of this method that relied on three input parameters.
Due to Chiens reliance on precompiling kernels into cubin binaries before runs,
we created twenty-one associated binaries for each version. Experimental results
showed that a number of these valid versions were inherently inecient, and
were removed from the tuning suite.
114
We found that for larger matrices, one parameter set would deliver near optimum performance. Therefore, our strategy was to nd the optimum parameter
set (of the twenty-one available) for a given machine which would allow us to
tune all input-sets larger than a certain matrix size. It was also noticed that in
general, this preferred kernel had higher register usage per thread than other kernels. Whether or not this is a rule of thumb one can follow for quickly selecting
an appropriate variation, requires more testing on a wider variety of devices.
For smaller matrices, our tests on a GTX260 and an 8600 GTS showed that
there was more variety and workload dependence on the optimal kernel. Therefore, our complete tuning strategy for this algorithm is as follows:
Find optimal kernel for large matrices. This relies on machine specic parameters.
Find matrix size for which we switch strategies.
Test a variety of candidates from a pre-tuning run, and nely tune all small
matrices with these candidates.
Though this algorithm requires us to nely tune for smaller matrices, we gain an
advantage by pruning all matrices greater than the switchpoint. Since smaller
matrices are solved much faster, we are able to prune the most vital section of
tuning space. Also, since we only select a few valid candidates for the smaller
matrices, we further eliminate unnecessary kernel tests.
Results
We tested our methods, and ran a number of experiments on a higher end GTX
260 and GTX 280, medium-end 8800GT and 5600FX, and a low-end 8600 GTS.
We had access to a Tesla C1060 for our reduction tests, however it was unavailable for test runs for our other algorithms.
In the next subsections, we will compare the performance for each of our
auto-tuned algorithms versus the untuned default algorithms. For all auto-tuned
algorithms, tuning runs are very short (around one to ve minutes) due to our
carefully pruning the tuning space so that we gather only necessary information
for our tuning process.
3.1
Reduction
115
directly (one can query the number of memory controllers, but not the bandwidth
at which each operates), some pretuning tests can supply this information, and
be used to reduce the tuning space further.
The performance comparison in Figure 1 also shows this algorithm is dominated by machine dependent parameters. If there was a higher workload dependence, selecting optimal parameters would be more dicult, resulting in pockets
of poorer performance. However, the auto-tuned reduction kernel consistently
matches the optimum algorithmic bound curve. This algorithmic bound curve is
a brute force performance check on every possible parameter combination.
3.2
Scalar Product
Though using knots adds complexity and possibility for non-optimal selections,
our results still perform better than that of NVIDIAs SDK for most points.
Since visualizing the performance of a multi-dimensional input space is dicult,
we instead present our results in Table 1 as the relative performance from a set
116
of random inputs. Our results in Table 1 show good speedups for almost all
cases on the GTX 280 and 260 (achieving nearly 30 percent performance boost
on the 280). While the performance gains on the 8800GT and 5600FX were not
as drastic, we still were able to boost performance slightly.
Speedup
1.2817
1.1511
1.0201
1.0198
(d) 8800GT
3.3
117
N-Body
SGEMM
Fig. 3. Performance of the coarser grained parameter selection for our SGEMM. The
right-hand figure demonstrates possible auto-tuned performance, versus the default
kernel. For smaller matrices one could possibly achieve a 2x speedup(e.g. for matrices
of 200 200, possible performance is about 130 GFlops/s versus 60 GFlops/sec).
118
3.5
This section serves as both a summary of our previous techniques, and to which
types of algorithms each are applicable. Table 2 shows our tested algorithms, and
how strongly their optimum parameters are machine and workload dependent.
Once these are identied, one can begin developing an auto-tuning scheme for
parameter selection.
Generally speaking, we nd that for strongly workload dependent algorithms
with a ne-grained input set and ne-grained parameter set, nearest neighbor
spline strategies have returned good results. This was evident in both our N-Body
and Scalar Product tuned algorithms. For our reduction kernel, we saw a strong
dependence between our thread parameters and the device parameters (memory
controllers). Therefore our strategy relies on this simple relationship. Finally,
our SGEMM kernel displayed various levels of dependency, and we therefore
developed a hybrid strategy for dierent workload ranges.
Table 2. Table summarizing each auto-tuned algorithms dependencies on device specific parameters and workload specific parameters(input space)
Algorithm Name
Reduction
Scalar Product
N-Body
SGEMM
Device Dependency
Strong
Medium
Weak
Strong
Workload Dependency
Weak
Strong
Strong
Medium
Conclusion
We believe that hand-tuning algorithms for each machine will become an impractical method as systems become more diverse in capability, and algorithm
bounds become more complex. Therefore, developing methods that either fully
automate or assist the tuning process, could prove powerful tools for developers
to boost utilization.
Future work is needed in developing rmer relationships between algorithms
with similar computational patterns, and developing auto-tuning schemes between these algorithms. Testing on newer architectures, such as the recently
released Fermi architecture is also needed. The Fermi 400x series cards contain
a number of new features that would change tuning strategies for a number of
algorithms. On top of faster global memory bandwidth, more shared memory
within blocks, and compute power, these additions include faster atomic operations than previous cards, and more computational power for double precision
operations.
Our work has shown a number of autotuning practices and methods which
boost performance for a number of common algorithms. We believe this is an
important stepping stone in developing a generalized tuning methodology for
data parallel programs.
119
References
1. Chien, L.S.: Hand-Tuned SGEMM On GT200 GPU. Technical Report, Tsing Hua
University (2010), http://oz.nthu.edu.tw/~ d947207/NVIDIA/SGEMM/HandTuned
Sgemm 2010 v1.1.pdf
2. Kerr, A., Diamos, G., Yalamanchili, S.: Modeling GPU-CPU Workloads and Systems. In: GPGPU 2010: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pp. 3142. ACM, New York (2010)
3. Li, Y., Dongarra, J., Tomov, S.: A Note on Auto-Tuning GEMM for GPUs. In:
Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A.
(eds.) ICCS 2009. LNCS, vol. 5544, pp. 884892. Springer, Heidelberg (2009)
4. Liu, Y., Zhang, E., Shen, X.: A Cross-Input Adaptive Framework for GPU Program
Optimizations. In: IPDPS 2009: Proceedings of the 2009 IEEE International Symposium on Parallel and Distributed Processing, pp. 110. IEEE Computer Society
Press, Washington, DC (2009)
5. Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.Z., Stratton, J.A.,
Hwu, W.W.: Program Optimization Space Pruning for a Multithreaded GPU. In:
CGO 2008: Proceedings of the Sixth Annual IEEE/ACM International Symposium
on Code Generation and Optimization, pp. 195204 (April 2008)
6. Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In:
SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing,
pp. 111. IEEE Press, Piscataway (2008)