Sie sind auf Seite 1von 12

DARWIN IS BACK!

A STRING GUESSING GAME1


Demonstration of a Genetic Algorithm

(Published as an article in “Information Technology”, Aug 2005)

Tushar Malhotra
totobogy@gmail.com

1
The OIPR’s (Open Intellectual Property Rights) for this paper are held by Tushar Malhotra. However, the
Publishing Rights are held by the EFY Group.
ABSTRACT Diagram 1: Progression of Solution Population
This paper introduces and compares Genetic
Algorithms as a search and optimization technique
against other traditional methods. It demonstrates
Evaluation
the implementation, working and utility of a Generation and
Genetic Algorithm by simulation of a “String based on some
reinsertion of
Guessing Game”. A simplistic statistical objective
new Population
performance analysis of the algorithm is also function
discussed.

INTRODUCTION
Should the ghost of Darwin happen to visit the people
involved with AI and Algorithm research today, he
would be pleasantly surprised to see his proposed
principles being applied in all sorts of interesting
applications! Selection of
Genetic Algorithms are a subdomain of Evolutionary Reproduction, individuals for
Algorithms which simulate Evolution – the way living Mutation, breeding to
beings evolve into forms which are more successful Variation to propagate next
and adaptable to their environments. The main and produce better generation
often cited principles involved are offsprings
 Survival of the Fittest
 Natural Selection
GA’s simulate the evolution in the following manner – A variety of algorithms/procedures are available to
We start from an initial set of possible solutions to a choose from at each step of the evolution, depending
given problem and allow these solutions to “evolve” or on the specific problem at hand. For details on the
improve over successive generations, trying to arrive various aspects of and techniques available at each
at the exact or optimal solution. stage of a Genetic Algorithm, readers are referred to
Like in nature, we go by the assumption that the [1].
solutions will improve, in general, with each
successive generation under the pressures of
selection and reproduction. HOW DO GENETIC ALGORITHMS DIFFER FROM THE
TRADITIONAL SEARCH AND OPTIMIZATION TECHNIQUES?
(See [1])
A typical Genetic (Evolutionary) Algorithm broadly
consists of the following stages:-
1. Coding of Solutions and determination of  Genetic algorithms are most useful for
Solution Search-Space optimization problems with a very large solution-
2. Initialization of Population search space such that the use of conventional
3. Evaluation based on some objective function search techniques to locate the optimal solution is
4. Selection of individuals to reproduce and rendered inefficient.
propagate the next generation  The optimization is achieved by the use of some
5. Reproduction, Variation, Alteration to produce objective evaluation function. Thus no
better offspring solutions derivative information is required.
6. Generation and reinsertion of new population  A typical GA provides a number of potentially
Steps 3 through 6 are repeated till the desired results optimal or near optimal solutions to the given
are obtained. This is depicted in the Diagram 1 shown problem. This is especially useful in case of the
alongside. optimization problems which do not have a single
optimal solution (e.g. scheduling problems like
Activity selection; Shortest paths, Minimum
spanning trees etc.) in these cases a genetic
algorithm facilitates determination of various
alternative solutions simultaneously.
 Owing to the above reason, Genetic Algorithms
are inherently Anytime Algorithms. That is they
can be intercepted at any time to yield a selection
of fittest solutions reached so far. This
characteristic is of much practical importance. Because any conventional algorithm would necessarily
 Genetic Algorithms deal with a population of prove inefficient as shown above, I devise the
potential solutions instead of individual solutions. following GA to beat you in the game.
Thus they are typical candidates for parallel
processing due to their inherent parallel search 1. Population Initialization
nature.
I choose a population size, N (N = 100, say) and
I now demonstrate the implementation and working of generate the N initial individuals (here strings of
a Genetic Algorithm through a simple problem and its letters of the English alphabet) randomly. For
solution. I will explain why the problem is a suitable simplicity, I keep the population size fixed for each
candidate to deserve a GA solution, implement the generation, which means at least some of the
solution and analyze its working. The discussion will individuals of each generation must perish.
serve as an illustrative example for understanding the
key concepts involved. 2. Fitness Evaluation of a Generation

2 Each individual in a generation is evaluated in the


A STRING GUESSING GAME
following manner:-
 The maximum possible score for any string is
Let us play a guessing game! This is how it goes –
twice the number of letters in it.
You think of an n-letter English word (specifically, an
 For each matching letter, two points are
n-letter string of characters drawn from the English
awarded to the string
alphabet – the word may or may not be meaningful),
while I try to guess your choice. Simple, right?  For each letter which appears in the original
Let’s see who is better poised to win the game. string but not at the same position at which it
What is the probability that by randomly guessing a appears in the string being evaluated, one
word, I would hit upon the same word that you chose? point is awarded
Or, in how many attempts will I be able to guess your  Only the correct solution can have (& has) the
choice? maximum score
The answer to both these questions is disappointing Note that this scheme of assigning scores to the
(for me!). strings is different from our earlier assumption
Even for a single letter word the probability that I would where the string was either accepted or rejected.
be able to guess the selected word in the first attempt In the game, for example, if I ask you to evaluate
is only 1 / 26. (As there are 26 letters in the alphabet, my guesses in the above manner, it might simplify
and only one of them is correct). my task but only to a small extent. So I have
Also, in the worst case, I may require as many as 26 changed the rules of the game slightly! However
attempts to guess the correct choice. note that the evaluation scheme is objective. I still
The complexity magnifies greatly as the length of the do not know which letter contributed how much to
word increases. the score. Thus I do not use any derived
Consider an 8 letter word. There are as many as information, as pointed out earlier.
8
26 = 208827064576 (over two hundred billion) such
strings possible. The probability of guessing the 3. Selection
correct solution in the first attempt is miniscule
(1 / 208827064576 – if you can calculate that, even my I use Rank Assignment and Truncation
calculator couldn’t!). And in the worst case, I may Selection techniques. After each string has been
require at least these many attempts (read a lifetime, evaluated, all the individuals of a generation are
perhaps more) to guess the correct word. ranked on the basis of their scores. The top N/2
Even if I implement an algorithm which tries to guess ranking strings are then selected for reproduction,
the correct string directly, chances are that I will give while the rest of the population of that generation
up before the computer can pop up the winning guess. is simply truncated. Note that this selection
The writing is on the wall – I will invariably lose this technique is artificial. In nature, even the less fit
game unless I do something radical. individuals get a chance to reproduce and
But that is exactly what I will do! propagate. The truncation, if at all, is gradual
spanning several generations. There are more
natural selection techniques available such as
The Winning Formula: A Genetic Algorithm Recipe Proportional Assignment and Roulette Wheel
Selection. I chose truncation because it was the
2 simplest and served my purpose of demonstration
The author is indebted to [2] for the idea behind this
illustration.
well! string with a very small population size (say 5 or
6).
4. Reproduction and Mutations  Generation of random numbers is crucial to the
algorithm. I have used the available library
The selected strings are allowed to reproduce function srand(). However as an increased
using Single/Double Point Crossover operation, measure of randomization, I ask the user to input
randomly. That is, pairing of parents is done a random number and use it to set the seed for
randomly and each pair reproduces either by srand(). Though this trick suffices for the demo, for
single or by double point crossover operation, practical applications, more sophisticated random
arbitrarily. Moreover, the crossover point along the number generators are commonly required.
length of string is also chosen randomly. Each  Implementation of each function is not necessarily
crossover yields two offspring. The parents survive the most efficient possible in terms of space or
to the next generation while the children replace running time. My purpose wasn’t to optimize these
unfit strings from the previous generation which but only to demonstrate the GA. For example, I
were not selected. After the new population has use bubble sort for ranking the individuals where
been generated, its individuals are subjected to some other algorithms would have done much
mutations randomly. This is achieved by altering a better on the above parameters.
random number (not more than half the string  The population size and length of the strings is
length) of genes (letters) in the chromosomes of defined using macros and may be altered.
some randomly selected individuals. Elitism is However, this demonstrative algorithm does not
used to accelerate convergence towards the work as well for all possible string lengths. For
optimal solution by preserving the fittest individual very small strings (say less than 6 letters) the
at each generation. The fittest string is cloned and double point crossover operation is counter
at least one copy is passed on to the next productive. For larger strings (more than 10
generation without being subjected to mutations. letters) it is generally insufficient. Therefore I have
These random mutations are necessary to ensure kept the string length to 8-9 letters for
diversity so that the population does not converge demonstration.
too fast, thus stagnating without reaching the  The source code was compiled and tested using
optimal solution. The Crossover and Mutation Bloodshed Dev-C++ (Ming-32 C/C++ compiler) on
operations are depicted in Figure 2. an AMD Sempron 2200+ running Windows XP
Professional Edition. It also compiled and ran
NOTES ON IMPLEMENTATION successfully on Turbo C after minor modifications.
However, convergence rate was much slower on
 For each generation, the program displays the Turbo C due to its pathetic random number
average score of individuals in that generation. generation. If you are patient enough though, you
This is useful for analysis as I will show soon. The will get results for sure!
program also lists the top five fittest strings of each
generation, besides the entire generation, along PERFORMANCE ANALYSIS
with their scores. The program is an anytime
implementation in the sense that it displays the The question now is that after all these efforts, does
statistics of each generation (or after every n this GA thing actually work? If yes then how well does
generations, determined by a macro) along with it perform?
the fittest solution(s) so far. This feature may not Fortunately, this time the answer to both the questions
seem very useful in this case. However, this is a is pretty encouraging.
distinctive and rather necessary feature of Genetic A detailed analysis of the algorithm requires
Algorithms in general, when applied to real life information about the probability distribution of the
optimization problems. random number generator and use of more elaborate
 Various parameters like population size (PSIZE), statistical techniques. However, the key points become
string length (LEN) etc are controlled using distinctively clear even with a somewhat simplified
Macros, which are documented in the source analysis which I present here.
code. I have commented the code in the best I have used the average score of the generation as a
possible manner I could! It is indeed difficult to measure of convergence towards optimal solution. I
study others’ code. However, an understanding of tested the algorithm on a number of strings by varying
exact working of the entire code is not necessary. various parameters.
An overview of what each function accomplishes For 8-letter strings, the algorithm almost always
is sufficient to see the GA in action. constructed the correct string successfully or gave a
 The crossover and mutation operations can be very near approximation well within 100 generations.
easily observed by running the program for a test
Thus, this algorithm could “guess” the target string by
evaluating at most 100 x 100 = 10000 strings which is  The rate of convergence (given by the slope of the
only a very small fraction of the number mentioned graph) is generally greater for larger population
above. sizes and
(Specifically, 10000 / 208827064576).  The algorithm converges towards the optimal
Even if we assume the overhead involved to be solution in lesser number of generations for larger
sufficiently big, this algorithm will, in all probability, population sizes. However, the actual number of
produce results in a fraction of the time required by the individuals evaluated may be more and the
direct method. average score of individuals in the final generation
may be less for a larger population.
Statistics for a few sample runs of the program are  It is important to note, though, that these
listed in Table 1 assertions about larger population sizes are true
The variation trend of average score of individuals of a only in a broad sense and may not hold true in
generation over successive generations for these each and every case.
sample runs is depicted graphically in Figure 1(a) and This is because the algorithm employs non
1(b). determinism for its working. In fact it is this
stochastic (probabilistic) nature which is the
In general it is verified, as expected, that the average essence of any Genetic (Evolutionary) Algorithm
score of individuals in a generation, increases with and the key reason for the success of this
each successive generation until it reaches a value technique.
sufficiently close to the maximum possible score. After
this it remains more or less constant.
Also,

Example 1 Population size = 100 Population size = 500


String length 8 8
Random seed value 6767 6767
String “zebrafry” “zebrafry”
No. of generations till solution 48 31
Total no. of strings evaluated 48 x 100 = 4800 31 x 500 = 15500
Max. possible score of a string 16 16
Avg. score of initial population 2.24 2.208
Avg. score of final population 9.79 9.53

Example 2 Population size = 100 Population size = 500


String length 9 9
Random Seed value 5650 5650
String “lostfound” “lostfound”
No. of generations till solution 39 18
Total no. of strings evaluated 39 x 100 = 3900 18 x 500 = 9000
Max. possible score of a string 18 18
Avg. score of initial population 2.46 2.722
Avg. score of final population 11.66 10.12

Table 1: Statistical Information about two sample runs


Number of generations Vs Average score of individuals of the generation for
the string "zebrafry"
12

10 9.53
9.278
8.874
8.492 9.64 9.79
8.008 9.03 9.05
8.91
8 8.49 8.63
7.17
Avg. Score of generation

7.51 7.72
6.182
7.08
6
6.03
4.604

4 4.54

2.208
2
2.27

0
0 4 8 12 16 20 24 28 32 36 40 44 48
population size 100
Generations
population size 500
Figure 1(a)

Number of generations Vs Average score of individuals of the generation for the


string "lostfound"
14

12

10.12 11.66
9.734
10
8.724 10.17 10.24 10.35
Avg. Score of generation

9.51
8 7.346 8.94

7.82
6 5.468
6.6

5.1
4
2.722

2
2.46

0
0 4 8 12 16 20 24 28 32 39
population size 100
Generations
population size 500
Figure 1(b)
Sample Output with Population Size 6
Crossover Operation
Enter a seed value: 6767
Enter the string(8 characters without space) to be
constructed: zebrafry Single Point

Generation: 0
0 hsirfcna 4 Parents
1 wuctpqzy 3
2 mmhnkrbx 2
3 aqlnbjqw 2
4 rdeuvcrs 4
5 pycygnob 2 Children
Maximum possible score of any individual: 16
Average score of generation 0 : 2.83333

Top 5 fittest individuals of generation 0 are:- Parents abcd efgh


Rank 1 is individual hsirfcna having the score :4 stuv wxyz
Rank 2 is individual rdeuvcrs having the score :4
Rank 3 is individual wuctpqzy having the score :3
Rank 4 is individual mmhnkrbx having the score :2 Children abcd wxyz
Rank 5 is individual aqlnbjqw having the score :2 stuv efgh
Do you wish to continue evolution?(y/n): y
Double Point
Generation: 1
0 hsirfcna 4
1 rdeuvcrs 4 Parents
2 wultprzy 4
3 rdfuvcrs 4
4 xsirfcna 4
5 hsirfcna 4

Maximum possible score of any individual: 16 Children


Average score of generation 1 : 4

Top 5 fittest individuals of generation 1 are:-


Rank 1 is individual hsirfcna having the score :4
Rank 2 is individual rdeuvcrs having the score :4
Rank 3 is individual wultprzy having the score :4 Random Mutations
Rank 4 is individual rdfuvcrs having the score :4
Rank 5 is individual xsirfcna having the score :4
Mutations are non reproductive operations
as a single individual is involved. They are
Do you wish to continue evolution?(y/n): y made to occur after reproduction by
crossover in order to add diversity to the
Generation: 2
population. Some examples, picked from the
0 hsirfcna 4
1 rdevvzrs 5 sample output on the left are shown. Mutated
2 hultpfzy 5 genes are darkened.
3 hsltprgr 2
4 wuirscna 3 hsirfcna
5 hsirfcna 4
mutates to
Maximum possible score of any individual: 16 xsirfcna
Average score of generation 2 : 3.83333
wuctpqzy
Top 5 fittest individuals of generation 2 are:-
mutates to
Rank 1 is individual rdevvzrs having the score :5
Rank 2 is individual hultpfzy having the score :5 wultprzy
Rank 3 is individual hsirfcna having the score :4
Rank 4 is individual hsirfcna having the score :4 Figure 2: Crossover and Mutation
Rank 5 is individual wuirscna having the score :3
APPENDIX: SOURCE CODE

/* A string guessing game Author: Tushar Malhotra totobogy@ieee.org. This file was created using Bloodshed Dev-C++ 4.9.8.0*/

/* INCLUDES */

//Population size (MUST be greater than 5 or you are calling for trouble)
#define PSIZE 100
#define LEN 8 //string lrngth
#define MAX (LEN*2) //max possible score of any individual

unsigned int seed; // seed for srand();


char sspace[] = "abcdefghijklmnopqrstuvwxyz"; //solution space generator

//Prototypes
void evaluate(char*,char[][LEN+1],int*,int*); //evaluates a generation
void select(char[][LEN+1],int*,char*); //selects top N/2 individuals for reproduction
void crossover(char*,char*,char[][LEN+1],int); //performs single and double point crossover
void mutate(char[][LEN+1]); //mutates genes(letters) arbitrarily

int main()
{
//Variable definitions
char solstr[LEN+1];
char fittest[LEN+1] = "\0";
int gen = 0;
char pop[PSIZE][LEN+1];
int score[PSIZE];
double totalscore;
int rank[PSIZE]; //rank i is the string pop[rank[i]]
int i,j; //loop variables
char ch;

printf("Enter a seed value: ");


scanf("%d",&seed); //take a random value
printf("Enter the string (%d characters without space in lowercase) to be guessed: ",LEN);
scanf("%s",&solstr);

//Initialize a random population


for(i=0;i<PSIZE;i++)
{
seed *= rand();
srand(seed); //set seed for rand()

for(j=0;j<LEN;j++)
pop[i][j] = sspace[(rand() % 26)];
pop[i][j] = '\0';
srand(1); //reintialize srand()
}

while(strcmp(solstr,fittest) != 0)
{
//new generation starts
gen++;
totalscore = 0;

evaluate(solstr,pop,score,rank); //Evaluate current population


strcpy(fittest,pop[rank[0]]);
system("CLS");
//Display analysis of the current generation
printf("\nGeneration %d of my guesses:-",gen);

for(i=0;i<PSIZE;i++)
totalscore += score[i];
printf("\n\nAverage score of generation %d of my guesses: %f",gen,totalscore/PSIZE);
printf("\n(This statistic is based on the scores you have assigned to my guesses.\nIt shows how good or bad I'm doing!)");
printf("\n\nThe Fittest individual of this generation is: %s",fittest);
printf("\n\nTop 5 fittest individuals of generation %d are:-",gen);

for(i=0;i<5;i++)
printf("\nRank %d is %s having the score %d",i+1,pop[rank[i]],score[rank[i]]);

printf("\n\nDo you want me to continue guessing (Evolution!)? (y/n):");


cin>>ch;

if(ch != 'y')
break;

//Selection,Variation and Reproduction to produce next gen


select(pop,rank,fittest);

printf("\n\nMy best guess till generation %d is %s with the score %d",gen,fittest,score[rank[0]]);

if(score[rank[0]] == MAX)
printf("\n\nI guessed the word! Watch out, I can read your mind now!");

getch();
return 0;
}

void evaluate(char *solstr,char pop[][LEN+1], int score[PSIZE], int rank[])


{
int i,j,k,t,tmp;
int tscore[PSIZE]; //local copy for rank calculation
int mloc[LEN]; //stores match location of a pop string character in solstr(if any), for each pop str
for(i=0;i<PSIZE;i++)
{
score[i]=0;
tscore[i]=0;
for(j=0;j<LEN;j++)
{
mloc[j]= -1; //matching process yet to be started for jth character of ith pop str
for(k=0;k<LEN;k++)
{
if(pop[i][j] == solstr[k])
{
//make sure kth of solstr has not already been matched
//to some character of current string
for(t=0; (t<j)&&(mloc[t] != k); t++);
if(t<j)
continue; //not break else j will increase as well, the char may not be counted
else
mloc[j]=k;//pop[i][j] matched to solstr[k]

if(j == k)
{
score[i] +=2;
tscore[i] +=2;
}
else
{
score[i]++;
tscore[i]++;
}
break; //break and lookup next char
}
}}}
for(i=0;i<PSIZE;i++)
rank[i] = i;

for(i=0;i<PSIZE;i++)
{
for(j=0;j<PSIZE-i-1;j++)
{
if(tscore[j]<tscore[j+1])
{
tmp = rank[j];
rank[j] = rank[j+1];
rank[j+1] = tmp;

tmp = tscore[j];
tscore[j] = tscore[j+1];
tscore[j+1] = tmp;
}
}
}
}//end of function

void select(char pop[][LEN+1],int rank[], char *fittest)


{
char newpop[PSIZE][LEN+1];
int i,j;
//Deterministic Selection.Top PSIZE/2 selected for
//reproduction. Parents survive.

for(i=0;i<PSIZE/2;i++)
strcpy(newpop[i],pop[rank[i]]);

//Reproduction by crossover of pairs - single pt/double point. Yields 2 children per pair.
//Elitism - best preserved.

for(i=0;i<PSIZE;i+=2)
crossover(newpop[rand()%(PSIZE/2)],newpop[rand()%(PSIZE/2)],newpop,((PSIZE/2)+(i%(PSIZE/2))-1));

//mutate random individuals


mutate(newpop);

//reinsertion of the new generation to population


for(i=0;i<PSIZE;i++)
strcpy(pop[i],newpop[i]);

strcpy(pop[PSIZE-1],pop[0]); //two copies of fittest, one is preserved


}

//Single/Double point crossover


void crossover(char *parent1, char *parent2, char newpop[][LEN+1], int index)
{
seed *= rand();
srand(seed);
int i,j;
char tmp;
int ch = rand()%2;
int cpt1 = rand() % LEN; //random crossover point
int cpt2 = rand() % LEN; //second pt;
switch(ch)
{
case 0://single point crossover
for(i=0;i<cpt1;i++)
{
newpop[index][i] = parent1[i];
newpop[index+1][i] = parent2[i];
}
for(i=cpt1;i<LEN+1;i++)
{
newpop[index][i] = parent2[i];
newpop[index+1][i] = parent1[i];
}
break;
case 1: //double point crossover
if(cpt2>cpt1)
{
for(i=0;i<cpt1;i++)
{
newpop[index][i] = parent1[i];
newpop[index+1][i] = parent2[i];
}
for(i=cpt1;i<cpt2;i++)
{
newpop[index][i] = parent2[i];
newpop[index+1][i] = parent1[i];
}
for(i=cpt2;i<LEN+1;i++)
{
newpop[index][i] = parent1[i];
newpop[index+1][i] = parent2[i];
}
}
else
{
for(i=0;i<cpt2;i++)
{
newpop[index][i] = parent1[i];
newpop[index+1][i] = parent2[i];
}
for(i=cpt2;i<cpt1;i++)
{
newpop[index][i] = parent2[i];
newpop[index+1][i] = parent1[i];
}
for(i=cpt1;i<LEN+1;i++)
{
newpop[index][i] = parent1[i];
newpop[index+1][i] = parent2[i];
}
}
break;
}//end switch

}//end of function

//Random Mutations
void mutate(char newpop[PSIZE][LEN+1])
{
int i,j,k1,k2;
char tmp;

seed *= rand();
srand(seed);

for(i=1;i<PSIZE;i++) //Elitism - preserve the best individual


for(j=0;j<(rand()%(LEN/2));j++) //mutate upto LEN/2 chars in each string, randomly
newpop[i][rand()%LEN] = sspace[rand()%26];
}//end of function

End of Code
REFERENCES

[1] Evolutionary Algorithms: Overview, Methods & Operators


Documentation for GEATbx version 3.5 (Evolutionary & GA toolbox for MATLAB),
Hartmut Pohleheim (July 2004)

[2] Genetic Algorithms: Inspired by Nature


Information Technology (July 2004),
M. Marjit Singh

[3] Genetic Algorithm Solution of Vigenere Alphabetic Codes,


Creed F Jones III and Michael Christman, Virginia Polytechnic & State University (2001)

[4] The World Wide Web