Sie sind auf Seite 1von 17

GENETIC PROGRAMMING FOR

GENERATING PROTOTYPES IN
CLASSIFICATION PROBLEMS

Presented by:
Tarundeep Dhot
Dept of ECE
Concordia University
Acknowledgements

This presentation is based on a research paper written by the following


authors:
L.P. Cordella, C. De Stefano, F. Fontanella, A Marcelli

This paper was published at:


The 2005 IEEE Congress on Evolutionary Computation

This presentation is solely meant for educational purposes.


Salient Features

 GP based approach for generating prototypes in


a classification problem is proposed.
 The set of prototypes is encoded by a multitree.
 Multitrees are set of trees which represent the
chromosome.
 Chromosomes are of variable length allowing
classification for problems consisting of classes
as well as further sub-classes.
WHAT IS CLASSIFICATION?
 Categorization: the act of distributing things into classes or
categories of the same type.
X1

X2

Type X
Data SYSTEM /
Sets CLASSIFIER
(Labeled)
Y1
Fig: Training Phase
Type Y
 Classification problems undergo a supervised training phase.
 Set of labeled data samples indicating the class it belongs to is provided to
the system.
 Based on such data, unknown data is classified using rules, decision trees
or mathematical functions.
PROPOSED APPROACH
 Provides a new GP based method for determining the prototypes in a c-
class problem where c ≥ 2.

 Prototypes describing samples belonging to ‘c’ different classes consists


of logical expressions.
PROTOTYPE 1 PROTOTYPE 2 PROTOTYPE 3

X1 X2 X3 Y1 Y2 Y3 Z1 Z2

 Each prototype represents a cluster of samples in the training set.


 Each prototype consists of a set of assertions (logical predicates)
connected by Boolean operators.
 The value of a particular feature in a sample depends on such assertions.

 A class can be represented by a variable number of logical expressions


i.e. the length of a single expression (as well as its predicates) is variable,
thus, the devised approach is able to automatically find subclasses present
in the data set.
PROPOSED APPROACH (cont:)


The set of prototypes describing the classes makes up a single
individual of the evolving population.

Each prototype is encoded as a derivation tree, thus, an individual
is a multi-tree (list of trees).

Classification consists of attributing a sample to one of the classes
i.e. associating the sample to one of the prototypes.

Fitness function or value is the recognition rate obtained on the
training set of an individual.
 Selection is done based on the fitness values of individuals.

At the end of the process, the best individual obtained, constitutes
the set of prototypes to be used for the considered application.
DESCRIPTION OF THE APPROACH

 Set of prototypes
 Each prototype characterizes a class or subclass
 Each class or subclass consists of set of logical expressions
 Each expression may contain variable number of predicates.
 Each predicate establishes a condition on the value of a particular feature it is
representing. If all predicates of an expression are satisfied by values in the feature
vector describing a sample, we say the expression matches the sample.
DESCRIPTION OF THE APPROACH (cont:)

CLASSIFICATION: Given a data set and a set of labeled expressions, the


classification task is performed in the following way:
Each sample of the data set is matched against the set of expressions and is either
assigned to one of them or is rejected. Different cases may occur:
 Sample matches ONE expression: it is assigned to it.
 Sample matches more than one expression with DIFFERENT number of predicates:
it is assigned to the expression with the SMALLEST number of predicates.
 Sample matches more than one expression with SAME number of predicates and
different labels: sample is REJECTED.
 Sample matches NO expression: sample is REJECTED.
 Sample matches more than one expression with equal label: sample is assigned to
the class the expression belongs to.
IMPLEMENTATION OF EVOLUTIONARY APPROACH

 GP starts by randomly generating a population of p individuals.


 Phenotype  String containing logical expressions, each one encoded as a
derivation tree.
 Chromosome of an individual  Multitree
 Number of trees in an individual’s chromosome is referred as length of the
individual. Length of individual in the initial population [2, Lmax ].
 For generating a new population, best e individuals are selected and copied to
new generation  ELITISM.
 Tournament selection for remaining (p – e)/2 couples. This controls loss of
diversity and selection intensity.
 Crossover is applied on selected couples with probability factor pc.
 Mutation is applied on individuals with probability factor pm.
 Finally, these individuals are added to the new population.
 Process is repeated for NG generations.
LEARNING CLASSIFICATION RULES

Before implementing the proposed evolutionary paradigm, the following steps


must be executed:

 Structure Definition: definition of the structure to be evolved


 Training Phase and Fitness Function: choice of fitness function
 Genetic Operators: definition of genetic operators
LEARNING CLASSIFICATION RULES (cont:)

STRUCTURE DEFINITION:
 The implementation requires a program generator providing syntactically correct
programs and an interpreter to execute them.
 PROGRAM GENERATOR: Based on grammar written for S-expressions.
Grammar G is defined as a quadruple = (T, N, S, P)
where T and N are disjoint finite alphabets.
S is the starting symbol
P is the set of production rules used to define strings.
 INTERPRETER: is implemented by an automation that computes Boolean
functions. Such an automation accepts an expression as an input and returns
TRUE or FALSE as an output depending on whether the expression matches the
sample or not .
LEARNING CLASSIFICATION RULES (cont:)

TRAINING PHASE AND FITNESS FUNCTION:


The system is trained with a set containing Ntr patterns. The training set is used for
evaluating the fitness of the individuals in the population.
 Training set samples are assigned to the expressions belonging to each individual.
 Each valid expression of an individual is labeled with the most widely represented in
the corresponding cluster.
 Recognition rate for each individual is evaluated using a classifier. This rate is
assigned as the fitness value to the individual.

In order to favor those individuals able to obtain good performance with lesser number
of expressions, the fitness of each individual is increased by 0.1/Nc where Nc is the
number of expressions in an individual.
LEARNING CLASSIFICATION RULES (cont:)

GENETIC OPERATORS:
Crossover and mutation are used as the genetic operator.

Crossover: Applied to two chromosomes C1 and C2 and yields two new


chromosomes by swapping the lists of the initial chromosomes.

Fig: (a) and (b) Chromosome C1 and C2 of length 4 and 3.


(c) and (d) Chromosomes obtained after crossover at t1 = 2 and t2 = 1
LEARNING CLASSIFICATION RULES (cont:)

GENETIC OPERATORS:

Mutation: It is independently applied to every tree of the chromosome C with


probability pm. More specifically, mutation operator is applied by randomly
choosing a single non-terminal node Ti in a given tree Ts.
If there are n non-terminal nodes in a chromosome C, probability of mutating
each single node of C = pm / n.
EXPERIMENTAL RESULTS

The proposed approach was tested on three standard databases (IRIS,


BUPA and Vehicle available on the UCI website and also compared another
GP based approach. The method showed better results with higher
recognition rates.

Data Sets R2 R1

IRIS 98.67 99.4

BUPA 69.87 74.3

Vehicle 61.75 66.5

Table: Average Recognition rates (%).


R1  proposed classifer R2  comparison classifier.
CONCLUSIONS

 GP based approach to prototype generation and classification is


proposed.
 Population is evolved where each individual consists of a set of possible
prototypes of the classes present in the data to be analyzed.
 A prototype consists of a set of logical expressions.
 Recognition rate obtained using each set of prototypes is used as fitness
function for controlling evolution.
 Method is able to automatically cluster data.
 The method offers greater flexibility in comparison to other methods due
to dynamic labeling mechanism of logical expressions.
 The comparison with another method show better results with the
proposed GP.
THANK YOU !!

Das könnte Ihnen auch gefallen