Beruflich Dokumente
Kultur Dokumente
1 Introduction
Between nowadays available manycore architectures, GPGPU (General Purpose
Graphic Processing Units) cards are offering one of the most attractive cost/per-
formance ratio. However programming such machines is a difficult task. This
paper focuses on a specific kind of resource-consuming application: evolution-
ary algorithms. It is well-known that such algorithms offer efficient solutions to
many optimization problems, but they usually require a great number of evalu-
ations, making processing power a limit on standard micro-processors. However,
their algorithmic structure clearly exhibits resource-costly computation parts
that can be naturally parallelized. But, GPGPU programming constraints in-
duce to consider dedicated operations for efficient parallel execution, one of the
main performance-relevant constraint being the time needed to transfer data
from the host memory to the GPGPU memory.
This paper starts by presenting evolutionary algorithms, and studying them
to determine where parallelization could take place. Then, GPGPU cards are
presented in section 3, and a proposition on how evolutionary algorithms could
be parallelized on such cards is made and described in section 4. Experiments are
made on two benchmarks and two NVidia cards in section 5 and some related
works are described in section 7. Finally, results and future developments are
discussed in the conclusion.
2 Presentation of Evolutionary Algorithms
In [5], Darwin suggests that species evolve through two main principles: variation
in the creation of new children (that are not exactly like their parents) and
survival of the fittest, as many more individuals of each species are born than
can possibly survive.
Evolutionary Algorithms (EAs) [9] get their inspiration from this paradigm
to suggest a way to solve the following interesting question. Given :
How can one use the accumulated knowledge to choose a new set of parame-
ters to try out (and therefore do better than a random search) ? EAs rely on ar-
tificial Darwinism to do just that: create new potential solutions from variations
on good individuals, and keeping a constant population size through selection of
the best solutions. The Darwinian inspiration for this paradigm leads to borrow
some specific vocabulary from biology: given an initial set of evaluated potential
solutions (called a population of individuals), parents are selected among the best
to create children thanks to genetic operators (that Darwin called “variation”
operators), such as crossover and mutation. Children (new potential solutions)
are then evaluated and from the pool of parents and children, a replacement
operator selects those that will make it to the new generation before the loop is
started again.
The algorithm presented on figure 1 contains several steps that may or may not
be independent. To start with, population initialisation is inherently parallel,
because all individuals are created independently (usually with random values).
Then, all newly created individuals need to be evaluated. But since they
are all evaluated independently using a fitness function, evaluation of the pop-
ulation can be done in parallel. It is interesting to note that in evolutionary
algorithms, evaluation of individuals is usually the most CPU-consuming step
of the algorithm, due to the high complexity of the fitness function.
Once a parent population has been obtained (by evaluating all the individu-
als of the initial population), one needs to create a new population of children. In
order to create a child, it is necessary to select some parents on which variation
operators (crossover, mutation) will be applied. In evolutionary algorithms, se-
lection of parents is also parallelizable because one parent can be selected several
times, meaning that independent selectors can select whoever they wish without
any restrictions.
Creation of a child out of the selected parents is also a totally independent
step: a crossover operator needs to be called on the parents, followed by a mu-
tation operator on the created child.
So up to now, all steps of the evolutionary loop are inherently parallel but
for the last one: replacement. In order to preserve diversity in the successive
generations, the (N + 1)-th generation is created by selecting some of the best
individuals of the parents+children populations of generation N . However, if
an individual is allowed to appear several times in the new generation, it could
rapidly become preeminent in the population, therefore inducing a loss of diver-
sity that would reduce the exploratory power of the algorithm.
Therefore, evolutionary algorithms impose that all individuals of the new
generation are different. This is a real restriction on parallelism, since it means
that the selection of N survivors cannot be made independently, otherwise a
same individual could be selected several times by several independent selectors.
Finally, one could wonder whether several generations could evolve in parallel.
The fact that generation (N + 1) is based on generation N invalidates this idea.
3 GPGPU architecture
GPGPU and classic CPU designs are very different. GPGPUs come from the
gaming industry and are designed to do 3D rendering. They inherit specific
features from this usage. For example, they feature several hundreds execution
units grouped into SIMD bundles that have access to a small amount of shared
memory (16KB on the NVidia 8800GTX that was used for this paper), a large
memory space (several hundred megabytes), a special access mode for texture
memory and a hardware scheduling mechanism.
The 8800GTX GPGPU card features 128 stream processors (compared to
4 general purpose processors on the Intel Quad Core) even though both chips
share a similar number of transistors (681 million for the 8800GTX vs 582 million
for the Intel Quad Core). This can be done thanks to a simplified architecture
that has some serious drawbacks. For instance, all stream processors are not
independent. They are grouped into SIMD bundles (16 SPMD bundles of 8
SIMD units on the 8800GTX, which saves 7 fetch and dispatch units). Then,
space-consuming cache memory is simply not available on GPGPUs, meaning
that all memory accesses (that can be done in only a few cycles on a CPU if the
data is already in the cache) cost several hundred cycles.
Fortunately, some workarounds are provided. For instance, the hardware
scheduling mechanism allows to run a bundle of threads called a warp at the
same time, swapping between the warps as soon as a thread of the current warp
is stalled on a memory access, so memory latency can be overcome with warp
scheduling. But there is a limit to what can be done: it is important to have
enough parallel tasks to be scheduled while waiting for the memory. A thread’s
state is not saved into memory. It stays on the execution unit (like in hyper-
threading mecanism), so the number of registers used by a task directly impacts
the number of tasks that can be scheduled on a bundle of stream processors.
Then, there is a limit in the number of schedulable warps (24 warps, i.e. 768
threads on the 8800GTX).
All these quirks make it very difficult for standard programs to exploit the
real power of these graphic cards.
As has been shown in section 2.1, it is possible to parallelize most of the evolu-
tionary loop. However, whether it is worth to run everything in parallel on the
GPGPU card or not is another matter: in [8, 7, 11], the authors implemented
complete algorithms on GPGPU cards, but clearly show that doing so is very
difficult, for quite few performance gains.
Rather than going this way, the choice made for this paper was to keep every-
thing simple, and start with experimenting the obvious idea of only parallelizing
children evaluation, based on the three following considerations.
5 Experiments
Two implementations have been tested: a toy problem that contains interesting
tuneable parameters allowing to observe the behaviour of the GPGPU card,
and a much more complex real world problem to make sure that the GPGPU
processors are also able to run more complex fitness functions. In fact, the 400
code lines of the real world evaluation function were programmed by a chemist,
who has not the least idea on how to use a GPGPU card.
are very interesting to use as a test case of CPU usage in evolutionary compu-
tation since they provide two parameters that can be adjusted independently.
2500 80
2000
Time (s)
Time (s)
60
1500
40
1000
20
500
0 0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 1000 2000 3000 4000 5000 6000
Population size Population size
Fig. 2. Left: Host CPU (top) and CPU+8800GTX (bottom) time for 10 generations
of the Weierstrass problem on an increasing population size. Right: CPU+8800GTX
curve only, for increasing numbers of iterations and increasing population sizes.
0.5
Weierstrass Cpu Iteration 10 45 Weierstrass Iteration 10
Weierstrass 40B data Iteration 10 Weierstrass Iteration 70
Weierstrass 2KB data Iteration 10 Weierstrass Iteration 120
Weierstrass 4KB data Iteration 10 40
0.4
35
30
0.3
Time (s)
Time (s)
25
0.2 20
15
0.1 10
0 0
0 200 400 600 800 1000 1200 0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Population size Population size
Fig. 3. Left: determination of genome size overhead on a very short evaluation. Right:
Same curves as figure 2 right, but for the GTX260 card.
9 0.14
Real problem on CPU Real problem on GPU
Real problem on GPU
8
0.12
7
0.1
6
0.08
Time (s)
Time (s)
5
4 0.06
3
0.04
2
0.02
1
0 0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
Population size Population size
Fig. 4. Left: evaluation times for increasing population sizes on host CPU (top) and
host CPU + GTX260 (bottom). Right: CPU + GTX260 total time.
First, while the 3.60GHz CPU was evaluating 20,000 individuals in 23 seconds
only, which seemed really fast considering the very complex evaluation function,
the GPGPU version took around 80 seconds which was disappointing.
When looking at the genome of the individuals, it appeared that it was coded
in a strange structure, i.e. an array of 4 pointers towards 4 other arrays of 3 floats.
This structure seemed much too complex to access, so it was suggested to flatten
it into a unique array of 12 floats which was easy to do, but unfortunately, the
whole evaluation function was made of pointers to some parts of the previous
genome structure. After some hard work, all got back to a pointer-less flat code,
and evaluation time for the 20,000 individuals instantaneously dropped from 80
second down to 0.13 seconds. One conclusion to draw out of this experience is
that, as expected, GPGPUs do not seem very talented in allocating, copying and
de-allocating memory.
Back to the host CPU, the new function now took 7.66s to evaluate 20,000
individuals, meaning that all in all, the speedup offered by the GPGPU card
is nearly 60 on the new GTX260 (figure 4 left). Figure 4 right shows only the
GTX260 curve.
EASEA4 [3] is software platform that was originally designed to help non ex-
pert programmers to try out evolutionary algorithms to optimise their applied
4
EASEA (pron. [i:zi:]) stands for EAsy Specification for Evolutionary Algorithms.
1 \User classes : 1 \GenomeClass::initialiser :
2 GenomeClass { 2 for(int i=0; i<N; i++) {
3 float x[SIZE]; 3 Genome.x[i] = random
4 } (-1.0,1.0);
4 }
1 \GenomeClass::mutator :
2 for (int i=0; i<N; i++) 1 \GenomeClass::evaluator:
3 if (tossCoin(pMutProb)){ 2 float res = 0., b=2.;
4 Genome.x[i]+=SIGMA* 3 float h=.25, val[SIZE];
random(0.,1.); 4 int i,k;
5 Genome.x[i]=MAX(X_MIN, 5 for(i = 0;i<N; i++){
MIN(X_MAX,Genome.x[i 6 val[i] = 0.;
])); 7 for (k=0;k<ITER;k++)
6 } 8 val[i] +=
9 pow(b,-(float)k*h) *
1 \GenomeClass::crossover: 10 sin(pow(b,(float)k)*
2 for(i=0; i<N; i++){ 11 Genome.x[i]);
3 float a = (float)randomLoc 12 res += Abs(val[i]);
(0.,1.); 13 }
4 if (&child1) 14 return (res);
5 child1.x[i] = 15 }
6 a*parent1.x[i]+
7 (1-a)*parent2.x[i];
8 }
5
Although available on all recent GPGPU cards, using double precision variables will
apparently considerably slow down the calculations on current GPGPU cards, but
Adding the -cuda option to this compiler is very important, since it not
only allows replication of the presented work, but also gives non GPGPU expert
programmers the possibility to run their own code on these powerful parallel
cards.
7 Related Work
Even though many papers have been written on the implementation of Genetic
Programming algorithms on GPGPU cards, only three papers were found on the
implementation of standard evolutionary algorithms on these cards.
In [11], Yu et al. implement a refined fine-grained algorithm with a 2D
toroidal population structure stored as a set of 2D textures, which imposes re-
strictions on mating individuals (that must be neighbours). Other constraints
arise, such as the need to store a matrix of random numbers in GPGPU memory
for future reference, since there is no random number generator on the card.
Anyway, a 10 time speedup is obtained, but on a huge population of 512 × 512
individuals.
In [7], Fok et al. find that standard genetic algorithms are ill-suited to run
on GPGPUs because of such operators as crossovers “that would slow down
execution when executed on the GPGPU” and therefore choose to implement
a crossover-less Evolutionary Programming algorithm [6] here again entirely on
the GPGPU card. The obtained speedup of their parallel EP “ranges from 1.25
to 5.02 when the population size is large enough.”
In [8], Li et al. implement a Fine Grained Parallel Genetic Algorithm once
again on the GPGPU, to “avoid massive data transfer.” For a strange reason,
they implement a binary genetic algorithm even though GPGPUs have no bit-
wise operators, so go into a lot of trouble to implement simple genetic operators.
To our knowledge no paper proposed the simple approach of only parallelizing
the evaluation of the population on the GPGPU card.
Results show that deporting the children population onto the GPGPU for a par-
allel evaluation yields quite significant speedups of up to 100 on a $250 GTX260
card, in spite of the overhead induced by the population transfer.
Being faster by around 2 orders of magnitude is a real breakthrough in evo-
lutionary computation, as it will allow applied scientists to find new results in
their domains. Then, researchers on artificial evolution will need to modify their
algorithms to adapt them to such speeds, that will probably lead to premature
convergence, for instance. Then, unlike many other works that are difficult (if
not impossible) to replicate, knowhow on the parallelization of evolutionary al-
gorithms has been integrated into the EASEA language. Researchers who would
this has not been tested yet, as all this work has been done on an 8800GTX card
that can only manipulate floats.
like to try out these cards can simply specify their algorithm using EASEA and
the compiler will parallelize the evaluation.
Anyway, many improvements can still be expected. Load balancing could
probably be improved, in order to maximize bundles throughput. Using texture
cache memory may be interesting on evaluation functions that repeatedly ac-
cess genome data. Automatic use of shared memory could also yield some good
results, particulary on local variables in the evaluation function.
Finally, an attempt to implement evolutionary algorithms on Sony/Toshiba/IBM
Cell multicore chips is currently being made. Its integration into the EASEA lan-
guage could allow to compare performance of GPGPU and Cell architecture on
identical programs.
References
1. L. A. Baumes, M. Moliner, and A. Corma. Design of a full-profile matching so-
lution for high-throughput analysis of multi-phases samples through powder x-ray
diffraction. Chemistry - A European Journal, In Press.
2. L. A. Baumes, M. Moliner, N. Nicoloyannis, and A. Corma. A reliable methodology
for high throughput identification of a mixture of crystallographic phases from
powder x-ray diffraction data. CrystEngComm, 10:1321–1324, 2008.
3. P. Collet, E. Lutton, M. Schoenauer, and J. Louchet. Take it easea. In In Parallel
Problem Solving from Nature VI, pages 891–901. Springer, LNCS, 2000.
4. A. Corma, M. Moliner, J. M. Serra, P. Serna, M. J. Diaz-Cabanas, and L. A.
Baumes. A new mapping/exploration approach for ht synthesis of zeolites. Chem-
istry of Materials, pages 3287–3296, 2006.
5. C. Darwin. On the Origin of Species by Means of Natural Selection or the Preser-
vation of Favoured Races in the Struggle for Life. John Murray, London, 1859.
6. D. B. Fogel. Evolving artificial intelligence. Technical report, 1992.
7. K.-L. Fok, T.-T. Wong, and M.-L. Wong. Evolutionary computing on consumer
graphics hardware. Intelligent Systems, IEEE, 22(2):69–78, March-April 2007.
8. J.-M. Li, X.-J. Wang, R.-S. He, and Z.-X. Chi. An efficient fine-grained parallel
genetic algorithm based on gpu-accelerated. In Network and Parallel Computing
Workshops, 2007. NPC Workshops. IFIP International Conference on, pages 855–
862, 2007.
9. K. De Jong. Evolutionary Computation: a Unified Approach. MIT Press, 2005.
10. R. A. Young. The Rietveld Method. OUP and International Union of Crystallog-
raphy, 1993.
11. Q. Yu, C. Chen, and Z. Pan. Parallel genetic algorithms on programmable graphics
hardware. In Advances in Natural ComputationICNC 2005, Proceedings, Part III,
volume 3612 of LNCS, pages 1051–1059, Changsha, August 27-29 2005. Springer.