PSO20

A Fully Saturated OpenCL Particle Swarm Optimizer
Donald Pupecki
Computer Science Department, SUNYIT 100 Seymour Road. Utica, NY 13502, USA
pupeckd@cs.sunyit.edu Abstract In this paper, I describe the methods I used for creating a Particle Swarm Optimizer (PSO) that runs under the Khronos Open Compute Language (OpenCL) on General Purpose Graphical Processing Units. The implementation is compared to similar work done in NVIDIA Common Unified Device Architecture (CUDA) [1] and a well written parallel version, the 2006 Standard PSO (SPSO) [3]. Keywords PSO, CUDA, OpenCL, GPU
to be compiled for the CPU. It is unclear whether this will be fixed in future releases.
TABLE I COMPARISON OF TEST DEVICES Maker ATI AMD NVIDIA NVIDIA Device Radeon 4870 PhenomII x4 GTX 570 Quadro FX 4800 Speed 850Mhz 3.6Ghz 1464Mhz 1204Mhz Compute Units 10 4 15 24 Max Work Group Size 256 1024 1024 512
I. INTRODUCTION The Open Compute Language is a parallel computation language designed to be run on a multitude of parallel platforms including General Purpose Graphics Processing Units as well as other parallel platforms. Particle Swarm Optimization is a technique for solving problems which can be defined by a fitness function and falls under the broad category of Evolutionary Algorithms. The use of PSO's on GPGPUs has been done before in [1] and [2], however the Fully Saturated approach presented in this paper explicitly uses all silicon available on the platform. This approach scales very well for the future as the processing power of GPGPUs continues to grow unlike that of the CPU. Already the GPU platform has pulled way ahead. As of May 2011 a rack of the Tesla 20-series is pushing over 20 teraflops vs. a rack of the Intel Westmore Core CPUs 4 teraflops.[10] This means that this approach should continue to get better with newer hardware. II. SETUP The setup for this work was similar to that of previous work with OpenCL [6]. The GPGPU used was a NVIDIA GTX 570. Also, a NVIDIA Quadro 4800, ATI Radeon 4870 and AMD Phenom II x4 were used for testing. The AMD processor was particularly useful for quick testing/debugging because of AMDs cl_amd_printf_enable flag that allows the use of the printf function in OCL kernels if they are compiled down to x86 code. (Running on the CPU.) The GTX570 was used for final calculations as it is the newest card and therefore supports more features with a higher clock speed, even though the Quadro has more cores. Table 1 shows a comparison of the test devices. The Operating System was Linux Mint 11 x64 (latest release as of this writing.) The program was written in a mix of C++ and OpenCL with some bash scripting for the testing and Gnuplot for graphs. The programming was done by linking to the ATI Stream SDK for the development and the NVIDIA CUDA SDK for testing. The reason for this is that the NVIDIA SDK fails to implement the ability (as per specification) for OpenCL code
A. Note on hybrid environments Just getting code running on the hybrid ATI/NVIDIA environment was a small challenge in itself as modern Linux distributions tend to include the radeon and nouveau drivers, which are open-source implementations of the actual NVIDIA and ATI drivers that do not include support for OpenCL or do so in such a fashion that performance would greatly suffer. To stop these drivers from taking over it was necessary to blacklist both kernel modules and modify grub to not load them at boot. Only then could both drivers be installed simultaneously. After this Xorg settings had to be changed so that both cards would be recognized under OpenCL. B. Note on PCI-e Slotting on modern Motherboards. Another thing to note that caused some concern during setup is the (mis)labelling of PCI-e motherboard slot speeds. These are normally given in terms of lane count as the card can utilize more connections in parallel to improve speed. The lane sizes range from 1x, 4x, ..., up to 32x. The sizes of the slots are also given in terms of lanes, however a large slot size may not have all lanes active (i.e. a 16x slot @ 4x.) It is common practice to have many 16x slots with only 8x or 4x speed on a motherboard. Microsoft provides some more information on the subject in [7]. Luckily after testing I had determined that it was not a problem on my older hardware, but it should be mentioned so others do not fall victim because the cards, when combined with properly optimized GPU code, can take up the whole bandwidth of the 16x bus. However, this performance would be lost if placed into a lower lane slot. C. Note on the difficulty of debugging OpenCL There are a few debuggers out there but none without its learning curve. gDEBugger[11] is one that looked promising but that I did not have time to explore. For this project I was able to fall back to the AMD printf function and do some
manual debugging that way but it was useless once it came time to debug the NVIDIA cards. I expect, as more people use OpenCL, more tools will become available. III. GOALS The goals of the project were as follows: A. Implement More Test Functions Most of the ones used in previous work [6] were parabolic or semi-parabolic. In the end, the functions implemented were: 1. DeJong F1 (Sphere) 2. Rastragin 3. Rosenbrock 4. Griewank 5. Ackley 6. Schwefel 7. Sum of Powers However for testing only the first three were used as the CUDA PSO only implements those three. B. Shift Test Functions Since all functions have global minima at 0.0, shifting was necessary to ensure my testing did not fall victim to the PSO central bias described in [4][5]. All functions were shifted so global minima are now at -2.5. C. More Parallelism. In OCL PSO 1.0 fitness calculations were done in parallel for each particle but each dimension was done in a loop. In the 2.0 version fitness calculations should be done in the form of parallel reductions. D. Move Everything to the Card Previous versions of the project only used the GPU for fitness calculations. The speedup was almost completely overtaken by IO overhead. Table II shows that only about 1/10th of the time spent doing fitness calculation was doing actual calculation; the rest was spent copying data to or from the card which indicated a low computational density / compute to IO ratio. By doing all calculations on the card, this copying was isolated to only reading the final result and passing a few paramaters. E. Compare to CUDA PSO and SPSO Compare the code to similar work in the CUDA PSO[1] and the Standard PSO[3]. See section IV. F. Improve Upon Standard Techniques. Come up with some way of utilizing the graphics card better than is normally done. This was done through the Fully Saturated approach where every effort was made to use all the silicon on the card, by running as many swarms and as many particles per swarm as the card allows. This is not new ([12] uses an adaptive swarm size), and somewhat intuitively obvious, but it does not seem to be done in practice and to the best of my knowledge in OpenCL.
All of these goals were eventually satisfied in the OCL PSO 2.0 code. [8] The program went through at least 2 major rewrites in development gaining more parallelism as it went along and finally settling on what is the OCL PSO 2.0. IV. DEVELOPMENT A. First Version The first major rewrite attempted to shift not only the fitness calculation but the update function to the card. Later the bests calculation was moved to the card. The 1.0 parallelism still applied where work was done in parallel over #particles but done sequentially over each dimension. There was also just one swarm. This version did fitness calculations, best-pos updates, and pos/vel updates in separate kernels and still suffered from the slowness of multiple copies to and from the card for each iteration. Also, a Linear Congruential Random Number Generator was implemented on the card. The seed was generated on the CPU and sent to the card. Seeds were multiplied by particles Thread ID on the card to get a different seed and a small loop was added to keep generating random numbers and spin up the LCRNG. Testing indicated that this method produced uniform random numbers with a variance from the median of about +/- 5% when plotted with a histogram of 10 bins. This, while not perfect, seems to me, random enough for these calculations.
TABLE II COMPUTATIONAL DENSITY OF OCL PSO 1.0
Method WriteBuffer WriteBuffer WriteBuffer WriteBuffer WriteBuffer rast__k3_ATI RV7701 ReadBuffer ReadBuffer
Time 0.08559 0.07935 0.07701 0.07851 0.07641 0.04503 0.02778 0.02052
Data Transferred 2.34 2.34 0.23 2.34 0.08 0 2.34 2.34
Table II shows results give by the ATI tool sprofile, which can aid in measuring kernel performance.
B. Second Version In this version, the terminating conditions function was also moved to the card bringing the total number of arguments required to launch the kernel to an uneconomical value of 19, however the code now ran entirely on the card, but produced odd results. A hill climber was added to the code to stabilize it on the assumption that the odd behavior was a result of the swarm not converging on peaks, and the card would switch to the climber after not seeing any improvement in a certain number of generations. This did not stop the odd behavior and much research was done to figure out why. The cause eventually was determined to be a subtle point in the OCL
framework. While threads have a barrier which causes all threads in a work group to sync up to the barrier before proceeding, there is no way, without returning to C code, to synchronize between workgroups. This is a problem for the PSO as it relies on the contribution of a global best particle, which may or may not have been calculated by the time the rest of the threads need its value. This game-stopper required a redesign. Also, OpenCLs built-in performance tracking measures were enabled to determine actual card compute times. C. Third Version The last revision was made to fix the bugginess of the previous version. This included a change in the mapping of the PSO to the card. In the latest revision, each work group acts as its own independent swarm. The swarms parameters are determined based on the selection of the dimension one wishes to solve. The number of particles is calculated as #parts = #compute units/dimension so as to always maximize the utilization of the card. Fitness calculations also now took the form of bank-conflict free reductions; however a restriction must be placed on the possible dimensionality to powers of 2. This could be avoided by either rounding up to the next power of two and filling in the values with null or by switching to a slower serial calculation if dim % 2 != 0. It is unclear which would perform better and more work should be done to determine this. It should be noted that in the NVIDIA demos, they use the round up method. In addition, each compute unit available on the target device runs a swarm so not only is the card computing all dimensions of all particles in parallel, it is also computing many swarms at the same time. This increases the chance of getting a lucky roll in the stochastic algorithm. Figure 1 shows a flowchart of the algorithm. In this version, since it was almost a complete rewrite, I never got around to reimplementing the hill climber switch. Also, an issue arose with this version, in that it will not run on NVIDIA devices unless the -cl-disable-opt flag is passed turning off compiler optimizations. When executed, the kernel will return success, however it will not store results or do anything indicative of actual execution. I suspect somehow the compiler is erroneously optimizing a large portion of the code to a noop. Some investigation tracked it down to being an issue with properly optimizing the main for loop. If the loop is manually unrolled the code works fine, otherwise it exhibits this odd behavior. After much searching no fix was found and it appears others have a similar problem [13]. However, this unfortunately means the OCL PSO is not performing as well as it could have if the optimizations were not disabled.
Fig. 1 Flowchart of the OpenCL PSO version 2.0.
V. TESTING Originally for testing, I had wanted to pit the three codes against each other; however after some trial and error it appears that they cannot be easily made to speak the same language. The SPSO is fairly straightforward and like most evolutionary codes it keeps track of a best-so-far value. The CUDA PSO however, to eliminate unnecessary IO, elected to provide the kernel with a set number of iterations and then transfer this final result back to the CPU. This, while eliminating a lot of IO and synchronization issues, makes testing somewhat difficult because its impossible, without modification of the CUDA PSO, to get a best-so-far or capture the iteration at which the best fitness was found. I think this is why the author in [1][2] elected to pick a number of iterations and see where it got in that time. This is the method I have used to compare the CUDA PSO to the OCL PSO. However for the SPSO I return to the normal method. A. OCL PSO 2.0 vs. CUDA PSO 2.0
Fig. 2 OpenCL PSO fitness as a function of dimension, best after 500 iterations.
The first test was done vs. the most similar work, the CUDA PSO, [1][2] which, since it was written in the CUDA language, was unable to run on the ATI cards or the CPU. (Though it has come to my attention in [9] that there exist tools for porting the CUDA code to OpenCL) Figure 2 shows a graph of the best fitness after 500 iterations. It would have been nice to test more than three functions, however. A special case is the Rosenbrock banana function. This does very poorly, most likely as a result of the lack of neighborhoods causing particles to get stuck in the trough of local minima in this function. It looks to be near a draw for the two codes. B. OCL PSO 2.0 vs. SPSO
REFERENCES
[1] L. Mussi, F. Daolio, and S. Cagnoni. Evaluation of parallel particle swarm optimization algorithms within the CUDA architecture. Information Sciences, 2010, in press. L. Mussi, Y.S.G. Nashed, and S. Cagnoni. GPU-based Asynchronous Particle Swarm Optimization. Proc.GECCO 2011, 2011. J. Kennedy and M. Clerc, 2 2006. Standard PSO 2006 [Online]. Available: http://www.particleswarm.info/Standard PSO 2006.c M. Clerc, Confinements and Biases in Particle Swarm Optimization [Online]. Available: http://clerc.maurice.free.fr/pso/ C. K. Monson and K. D. Seppi, "Exposing Origin-Seeking Bias in PSO," presented at GECCO'05, Washington, DC, USA, 2005, pp. 241248. D. Pupecki, (2011) OpenCL PSO, Development, Benchmarking, Lessons, Future Work [Online]. Available: http://web.cs.sunyit.edu/ ~pupeckd/pso10.pdf MSDN, PCI Express FAQ for Graphics, http://msdn.microsoft.com/enus/windows/hardware/gg463285. D. Pupecki. (2011). OpenCL PSO 2.0 [Online]. https://github.com/ Flamewires/PSO, 2011 Wikipedia. CUDA [Online]. Available: http://en.wikipedia.org/wiki/ CUDA Nvidia. (2011, May). TESLA GPU computing: Supercomputing at 1/10th the Cost [Online]. Available: http://www.nvidia.com/docs/ IO/96958/Tesla_Master_Deck.pdf, 2011 Graphic Remedy. gDEBugger Main Page [Online]. Available: http://www.gremedy.com/ M. Clerc. TRIBES - un exemple doptimisation par essaim particulaire sans param`etres de controle. In Optimisation par Essaim Particulaire OEP 2003, Paris, France, 2003. Nvidia Forums. (2010 Apr. 15). My kernel works only if I compile it with -cl-opt-disable [Online]. Available: http://forums.nvidia.com/ index.php?showtopic=166385, 2011
[2] [3] [4] [5]
[6]
[7] [8] [9] [10]
[11] [12]
[13] Fig. 3 OpenCL PSO and SPSO fitness as a function of dimension
The OCL PSO fared much better against the 2006 Standard PSO (Fig. 3), which outperformed the OCL PSO 2.0 for the two and four dimensional cases, and the eight for the sphere, however after this, while the OCL PSO kept getting faster (most likely because less particles means less waiting for synchronization, less bank conflicts/false sharing), the SPSO took almost exponentially (note, log scale) longer. VI. CONCLUSION The OCL PSO seems to be a competitive player, even though it lacks all the features of other non-classical or improved PSOs. This is likely due to the massive parallelism inherent in the OCL PSO. With some improvements, such as neighborhoods, the reimplementation of the hill climber switch, and the fixing of the compilation optimizations, the OCL PSO could perform even better. Some drawbacks to this approach are the lack of best-so-far fitness results and, closely related, the need to transfer data to and from the card. This is being remedied with newer generations of NVIDIA cards due to smarter memory management and easier/more powerful synchronization operations. [10]

PSO20

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

PSO20

Hochgeladen von

Copyright:

Verfügbare Formate

A Fully Saturated OpenCL Particle Swarm Optimizer

Time 0.08559 0.07935 0.07701 0.07851 0.07641 0.04503 0.02778 0.02052

Data Transferred 2.34 2.34 0.23 2.34 0.08 0 2.34 2.34

Fig. 1 Flowchart of the OpenCL PSO version 2.0.

[2] [3] [4] [5]

[7] [8] [9] [10]

[13] Fig. 3 OpenCL PSO and SPSO fitness as a function of dimension

Das könnte Ihnen auch gefallen