Rule Induction
Adriana Pietramala
1
, Veronica L. Policicchio
1
, Pasquale Rullo
1,2
,
and Inderbir Sidhu
3
1
University of Calabria Rende  Italy
{a.pietramala,policicchio,rullo}@mat.unical.it
2
Exeura s.r.l. Rende, Italy
3
Kenetica Ltd Chicago, ILUSA
isidhu@computer.org
Abstract. This paper presents a Genetic Algorithm, called OlexGA,
for the induction of rulebased text classiers of the form classify doc
ument d under category c if t
1
d or ... or t
n
d and not (t
n+1
d
or ... or t
n+m
d) holds, where each t
i
is a term. OlexGA relies on
an ecient severalrulesperindividual binary representation and uses
the Fmeasure as the tness function. The proposed approach is tested
over the standard test sets Reuters21578 and Ohsumed and compared
against several classication algorithms (namely, Naive Bayes, Ripper,
C4.5, SVM). Experimental results demonstrate that it achieves very good
performance on both data collections, showing to be competitive with
(and indeed outperforming in some cases) the evaluated classiers.
1 Introduction
Text Classication is the task of assigning natural language texts to one or more
thematic categories on the basis of their contents.
A number of machine learning methods have been proposed in the last few
years, including knearest neighbors (kNN), probabilistic Bayesian, neural net
works and SVMs. Overviews of these techniques can be found in [13, 20].
In a dierent line, rule learning algorithms, such as [2, 3, 18], have become
a successful strategy for classier induction. Rulebased classiers provide the
desirable property of being readable and, thus, easy for people to understand
(and, possibly, modify).
Genetic Algorithms (GAs) are stochastic search methods inspired to the bi
ological evolution [7, 14]. Their capability to provide good solutions for classical
optimization tasks has been demonstrated by various applications, including
TSP [9, 17] and Knapsack [10]. Rule induction is also one of the application
elds of GAs [1, 5, 15, 16]. The basic idea is that each individual encodes a
candidate solution (i.e., a classication rule or a classier), and that its tness
is evaluated in terms of predictive accuracy.
W. Daelemans et al. (Eds.): ECML PKDD 2008, Part II, LNAI 5212, pp. 188203, 2008.
c SpringerVerlag Berlin Heidelberg 2008
A Genetic Algorithm for Text Classication Rule Induction 189
In this paper we address the problem of inducing propositional text classiers
of the form
c (t
1
d t
n
d) (t
n+1
d t
n+m
d)
where c is a category, d a document and each t
i
a term (ngram) taken from a
given vocabulary. We denote a classier for c as above by H
c
(Pos, Neg), where
Pos = {t
1
, , t
n
} and Neg = {t
n+1
t
n+m
}. Positive terms in Pos are used to
cover the training set of c, while negative terms in Neg are used to take precision
under control.
The problem of learning H
c
(Pos, Neg) is formulated as an optimization task
(hereafter referred to as MAXF) aimed at nding the sets Pos and Neg which
maximize the Fmeasure when H
c
(Pos, Neg) is applied to the training set.
MAXF can be represented as a 01 combinatorial problem and, thus, the GA ap
proach turns out to be a natural candidate resolution method. We call OlexGA
the genetic algorithm designed for the task of solving problem MAXF.
Thanks to the simplicity of the hypothesis language, in OlexGA an individ
ual represents a candidate classier (instead of a single rule). The tness of an
individual is expressed in terms of the Fmeasure attained by the correspond
ing classier when applied to the training set. This severalrulesperindividual
approach (as opposed to the singleruleperindividual approach) provides the
advantage that the tness of an individual reliably indicates its quality, as it is
a measure of the predictive accuracy of the encoded classier rather than of a
single rule.
Once the population of individuals has been suitably initialized, evolution
takes place by iterating elitism, selection, crossover and mutation, until a pre
dened number of generations is created.
Unlike other rulebased systems (such as Ripper) or decision tree systems
(like C4.5) the proposed method is a onestep learning algorithm which does
not need any postinduction optimization to rene the induced rule set. This is
clearly a notable advantage, as rule setrenement algorithms are rather complex
and timeconsuming tasks.
The experimentation over the standard test sets Reuters21578 and Ohsu
med conrms the goodness of the proposed approach: on both data collections,
OlexGA showed to be highly competitive with some of the topperforming
learning algorithms for text categorization, notably, Naive Bayes, C4.5, Ripper
and SVM (both polynomial and rbf). Furthermore, it consistently defeated the
greedy approach to problem MAXF we reported in [19]. In addition, OlexGA
turned out to be an ecient rule induction method (e.g., faster than both C4.5
and Ripper).
In the rest of the paper, after providing a formulation of the learning op
timization problem (Section 2) and proving its NPhardness, we describe the
GA approach proposed to solve it (Section 3). Then, we present the experi
mental results (Section 5) and provide a performance comparison with some of
the topperforming learning approaches (Section 6). Finally, we briey discuss
the relation to other rule induction systems (Sections 7 and 8) and give the
conclusions.
190 A. Pietramala et al.
2 Basic Denitions and Problem Statement
In the following we assume the existence of:
1. a nite set C of categories, called classication scheme;
2. a nite set D of documents, called corpus; D is partitioned into a training set
TS, a validation set and a test set ; the training set, along with the validation
set, represents the socalled seen data, used to induce the model, while the
test set represents the unseen data, used to asses performance of the induced
model;
3. a binary relationship which assigns each document d D to a number of
categories in C (ideal classication). We denote by TS
c
the subset of TS
whose documents belong to category c (the training set of c);
4. a set = {f
1
, , f
k
} of scoring functions (or feature selection functions),
such as Information Gain, Chi Square, etc. [4, 22], used for vocabulary re
duction (we will hereafter often refer to them simply as functions).
A vocabulary V (k, f) over the training set TS is a set of terms (ngrams)
dened as follows: let V
c
(k, f) be the set of the k terms occurring in the docu
ments of TS
c
that score highest according to the scoring function f ; then,
V (k, f) =
cC
V
c
(k, f), that is, V (k, f) is the union of the k best terms, accord
ing to f, of each category of the classication scheme.
We consider a hypothesis language of the form
c (t
1
d t
n
d) (t
n+1
d t
n+m
d) (1)
where each t
i
is a term taken from a given vocabulary. In particular, each term in
Pos = {t
1
, , t
n
} is a positive term, while each term in Neg = {t
n+1
t
n+m
}
is a negative term. A classier as above, denoted H
c
(Pos, Neg), states the
condition if any of the terms t
1
, , t
n
occurs in d and none of the terms
t
n+1
, , t
n+m
occurs in d then classify d under category c
1
. That is, the oc
currence of a positive term in a document d requires the contextual absence of
the (possibly empty) set of negative terms in order for d be classied under c
2
.
To assess the accuracy of a classier for c, we use the classical notions of
Precision Pr
c
, Recall Re
c
and Fmeasure F
c,
, dened as follows:
Pr
c
=
a
a +b
; Re
c
=
a
a +e
; F
c,
=
Pr
c
Re
c
(1 )Pr
c
+Re
c
(2)
Here a is the number of true positive documents w.r.t. c (i.e., the number of
documents of the test set that have correctly been classied under c), b the
number of false positive documents w.r.t. c and e the number of false negative
documents w.r.t. c. Further, in the denition of F
c,
, the parameter [0 1]
1
It is immediate to recognize that c (t
1
d t
n
d)(t
n+1
d t
n+m
d) is equivalent to the following set of classication rules: {c t
1
d (t
n+1
d) (t
n+m
d), , c t
n
d (t
n+1
d) (t
n+m
d)}.
2
d may satisfy more then one; thus, it may be assigned to multiple categories.
A Genetic Algorithm for Text Classication Rule Induction 191
is the relative degree of importance given to Precision and Recall; notably, if
= 1 then F
c,
coincides with Pr
c
, and if = 0 then F
c,
coincides with Re
c
(a value of = 0.5 attributes the same importance to Pr
c
and Re
c
)
3
.
Now we are in a position to state the following optimization problem:
PROBLEM MAXF. Let a category c C and a vocabulary V (k, f) over
the training set TS be given. Then, nd two subsets of V (k, f), namely, Pos =
{t
1
, , t
n
} and Neg = {t
n+1
, , t
n+m
}, with Pos = , such that H
c
(Pos, Neg)
applied to TS yields a maximum value of F
c,
(over TS), for a given .
Proposition 1. Problem MAXF is NPhard.
Proof. We next show that the decision version of problem MAXF is NP
complete (i.e., it requires time exponential in the size of the vocabulary, unless
P = NP). To this end, we restrict our attention to the subset of the hypothesis
language where each rule consists of only positive terms, i.e, c t
1
d t
n
d (that is, classiers are of the formH
c
(Pos, )). Let us call MAXF+ this special
case of problem MAXF.
Given a generic set Pos V (k, f), we denote by S {1, , p} the set of
indices of the elements of V (k, f) = {t
1
, , t
q
} that are in Pos, and by (t
i
)
the set of documents of the training set TS where t
i
occurs. Thus, the set of
documents classiable under c by H
c
(Pos, ) is D(Pos) =
iS
(t
i
). The F
measure of this classication is
F
c,
(Pos) =
D(Pos) TS
c

(1 )TS
c
 +D(Pos)
. (3)
Now, problem MAXF+, in its decision version, can be formulated as follows:
Pos V (k, f) such that F
c,
(Pos) K?, where K is a constant (let us call
it MAXF+(D)).
Membership. Since Pos V (k, f), we can clearly verify a YES answer of MAX
F+(D), using equation 3, in time polynomial in the size V (k, f) of the vocab
ulary.
Hardness. We dene a partition of D(Pos) into the following subsets: (1) (Pos)
= D(Pos)TS
c
, i.e., the set of documents classiable under c by H
c
(Pos, ) that
belong to the training set TS
c
(true classiable); (2) (Pos) = D(Pos) \ TS
c
,
i.e., the set of documents classiable under c by H
c
(Pos, ) that do not belong
to TS
c
(false classiable).
Now, it is straightforward to see that the Fmeasure, given by equation 3, is
proportional to the size (Pos) of (Pos) and inversely proportional to the size
(Pos) of (Pos) (just replace in equation 3 the quantities D(Pos) TS
c
 by
(Pos) and D(Pos) by (Pos) + (Pos)). Therefore, problem MAXF+(D)
can equivalently be stated as follows: Pos V (k, f) such that (Pos) V
and (Pos) C?, where V and C are constants.
3
Since [0 1] is a parameter of our model, we nd convenient using F
c,
rather
than the equivalent F
=
(
2
+1)Pr
c
Re
c
2
Pr
c
+Re
c
, where [0 ] has no upper bound.
192 A. Pietramala et al.
Now, let us restrict MAXF+(D) to the simpler case in which terms in Pos
are pairwise disjoint, i.e., (t
i
) (t
j
) = for all t
i
, t
j
V (k, f) (let us call
DSJ this assumption). Next we show that, under DSJ, MAXF+(D) coincides
with the Knapsack problem. To this end, we associate with each t
i
V (k, f)
two constants, v
i
(the value of t
i
) and a c
i
(the cost of t
i
) as follows:
v
i
= (t
i
) TS
c
, c
i
= (t
i
) \ TS
c
.
That is, the value v
i
(resp. cost c
i
) of a term t
i
is the number of documents
containing t
i
that are true classiable (resp., false classiable).
Now we prove that, under DSJ, the equality
iS
v
i
= (Pos) holds, for any
Pos V (k, f). Indeed:
iS
v
i
=
iS
(t
i
)TS
c
 = (
iS
(t
i
))TS
c
 = (D(Pos)TS
c
 = (Pos).
To get the rst equality above we apply the denition of v
i
; for the second, we
exploit assumption DSJ; for the third and the fourth we apply the denitions
of D(Pos) and (Pos), respectively.
In the same way as for v
i
, the equality
iS
c
i
= (Pos) can be easily drawn.
Therefore, by replacing (Pos) and (Pos) in our decision problem, we get
the following new formulation, valid under DSJ: Pos V (k, f) such that
iS
v
i
V and
iS
c
i
C?. That is, under DSJ, MAXF+(D) is the Knap
sack problem a well known NPcomplete problem. Therefore, MAXF+(D)
under DSJ is NPcomplete. It turns out that (the general case of) MAXF+(D)
(which is at least as complex as MAXF+(D) under DSJ) is NPhard and, thus,
the decision version of problem MAXF+ is NPhard as well. It turns out that
the decision version of problem MAXF is NPhard.
Having proved both membership (in NP) and hardness, we conclude that the
decision version of problem MAXF is NPcomplete.
A solution of problem MAXF is a best classier for c over the training set TS,
for a given vocabulary V (f, k). We assume that categories in C are mutually
independent, i.e., the classication results of a given category are not aected
by the results of any other. It turns out that we can solve problem MAXF for
that category independently on the others. For this reason, in the following, we
will concentrate on a single category c C.
3 OlexGA: A Genetic Algorithm to Solve MAXF
Problem MAXF is a combinatorial optimization problem aimed at nding a
best combination of terms taken from a given vocabulary. That is, MAXF is
a typical problem for which GAs are known to be a good candidate resolution
method.
A GA can be regarded as composed of three basic elements: (1) A population,
i.e., a set of candidate solutions, called individuals or chromosomes, that will
evolve during a number of iterations (generations); (2) a tness function used to
A Genetic Algorithm for Text Classication Rule Induction 193
assign a score to each individual of the population; (3) an evolution mechanism
based on operators such as selection, crossover and mutation. A comprehensive
description of GA can be found in [7].
Next we describe our choices concerning the above points.
3.1 Population Encoding
In the various GAbased approaches to rule induction used in the literature (e.g.,
[5, 15, 16]), an individual of the population may either represent a single rule or
a rule set. The former approach (singleruleperindividual) makes the individual
encoding simpler, but the tness of an individual may not be a meaningful indica
tor of the quality of the rule. On the other hand, the severalrulesperindividual
approach, where an individual may represent an entire classier, requires a more
sophisticated encoding of individuals, but the tness provides a reliable indi
cator. So, in general, there is a tradeo between simplicity of encoding and
eectiveness of the tness function.
Thanks to the structural simplicity of the hypothesis language, in our model
an individual encodes a classier in a very natural way, thus combining the
advantages of the two aforementioned approaches. In fact, an individual is simply
a binary representation of the sets Pos and Neg of a classier H
c
(Pos, Neg).
Our initial approach was that of representing H
c
(Pos, Neg) through an indi
vidual consisting of 2 V (k, f) bits, half for the terms in Pos, and half for the
terms in Neg (recall that both Pos and Neg are subsets of V (k, f)). This how
ever proved to be very ineective (and inecient), as local solutions representing
classiers with hundreds of terms were generated.
More eective classiers were obtained by restricting the search of both posi
tive and negative terms, based on the following simple observations: (1) scoring
functions are used to assess goodness of terms w.r.t. a given category; hence,
it is quite reasonable searching positive terms for c within V
c
(k, f), i.e., among
the terms that score highest for c according to a given function f; (2) the role
played by negative terms in the training phase is that of avoiding classication
of documents (of the training set) containing some positive term, but falling
outside the training set TS
c
of c.
Based on the above remarks, we dene the following subsets Pos and Neg
of V (k, f):
Pos = V
c
(k, f), i.e., Pos is the subset of V (k, f) consisting of the k terms
occurring in the documents of the training set TS
c
of c that score highest
according to scoring function f; we say that t
i
Pos is a candidate positive
term of c over V (k, f);
Neg = {t V (k, f)  (
t
j
Pos
(t
j
) \TS
c
) (t) = }, where (t
j
) TS
(resp. (t) TS) is the set of documents of TS containing term t
j
(resp.
t); we say that t
i
Neg is a candidate negative term of c over V (k, f).
Intuitively, a candidate negative term is one which occurs in a document
containing some candidate positive term and not belonging to the training
set TS
c
of c.
194 A. Pietramala et al.
Now, we represent a classier H
c
(Pos, Neg) over V (k, f) as a chromosome K
made of Pos  + Neg  bits. The rst Pos  bits of K, denoted K
+
, are
used to represent the terms in Pos, and the remaining Neg , denoted K
, to
represent the terms in Neg. We denote the bit in K
+
(resp. K
) representing
the candidate positive (resp. negative) term t
i
as K
+
[t
i
] (resp. K
[t
i
]). Thus, the
value 01 of K
+
[t
i
] (resp. K
[t
i
]) denotes whether t
i
Pos (resp. t
i
Neg)
is included in Pos (resp. Neg) or not.
A chromosome K is legal if there is no term t
i
Pos Neg such that
K
+
[t
i
] = K
[t
i
] = 1 (that is, t
i
cannot be at the same time a positive and a
negative term)
4
.
3.2 Fitness Function
The tness of a chromosome K, representing H
c
(Pos, Neg), is the value of the
Fmeasure resulting from applying H
c
(Pos, Neg) to the training set TS. This
choice naturally follows from the formulation of problem MAXF. Now, if we
denote by D(K) TS the set of all documents containing any positive term in
Pos and no negative term in Neg, i.e.,
D(K) =
tPos
(t) \
tNeg
(t),
starting from the denition of F
c,
, after some algebra, we obtain the following
formula for F
c,
:
F
c,
(K) =
D(K) TS
c

(1 )TS
c
 +D(K)
.
3.3 Evolutionary Operators
We perform selection via the roulettewheel method, and crossover by the uni
form crossover scheme. Mutation consists in the ipping of each single bit with a
given (low) probability. In order not to lose good chromosomes, we apply elitism,
thus ensuring that the best individuals of the current generation are passed to
the next one without being altered by a genetic operator. All the above operators
are described in [7].
3.4 The Genetic Algorithm
A description of OlexGA is sketched in Figure 1. First, the sets Pos and Neg
of candidate positive and negative terms, respectively, are computed from the
input vocabulary V (k, f). After population has been initialized, evolution takes
place by iterating elitism, selection, crossover and mutation, until a predened
number n of generations is created. At each step, a repair operator , aimed
at correcting possible illegal individuals generated by crossover/mutation, is
applied. In particular, if K is an illegal individual with K
+
[t] = K
[t] = 1,
4
Note that the classier encoded by an illegal individual is simply redundant, as it
contains a dummy rule of the form c t d, (t d), .
A Genetic Algorithm for Text Classication Rule Induction 195
Algorithm OlexGA
Input: vocabulary V (f, k) over the training set TS; number n of generations;
Output: best classier H
c
(Pos, Neg) of c over TS;
 begin
 Evaluate the sets of candidate positive and negative terms from V (f, k);
 Create the population oldPop and initialize each chromosome;
 Repeat n times
 Evaluate the tness of each chromosome in oldPop;
 newPop = ;
 Copy in NewPop the best r chromosomes of oldPop (elitism  r is
determined on the basis of the elitism percentage)
 While size(newPop) < size(oldPop)
 select parent1 and parent2 in oldPop via roulette wheel
 generate kid1, kid2 through crossover(parent1, parent2)
 apply mutation, i.e., kid1 = mut(kid1) and kid2 = mut(kid2)
 apply the repair operator to both kid1 and kid2;
 add kid1 and kid2 to newPop;
 endwhile
 oldPop = newPop;
 endrepeat;
 Select the best chromosome K in oldPop;
 Eliminate redundancies from K;
 return the classier H
c
(Pos, Neg) associated with K.
Fig. 1. Evolutionary Process for category c
the application of to K simply set K
+
[t] = 0 (thus, the redundant rule
c t d, (t d), is removed from the encoded classier). After it
eration completes, the elimination of redundant bits/terms from the best chro
mosome/classier K is performed. A bit K[t] = 1 is redundant in a chromosome
K, representing H
c
(Pos, Neg), if the chromosome K
(Pos{t})
(t
Pos
(t
) = (i.e., t
does not contribute to remove any misclassied document).
4 Learning Process
While OlexGA is the search of a best classier for category c over the training
set, for a given input vocabulary, the learning process is the search of a best
classier for c over the validation set, for all input vocabularies. More precisely,
the learning process proceeds as follows:
196 A. Pietramala et al.
Learning: for each input vocabulary
Evolutionary Process: execute a predened number of runs of OlexGA
over the training set, according to Figure 1;
Validation: run over the validation set the best chromosome/classier
generated by OlexGA;
Testing: After all runs have been executed, pick up the best classier (over
the validation set), and assess its accuracy over the test set (unseen data).
The output of the learning process is assumed to be the best classier of c.
5 Experimentation
5.1 Benchmark Corpora
We have experimentally evaluated our method using both the Reuters21578
[12] and and the Ohsumed [8] test collections.
The Reuters21578 consists of 12,902 documents. We have used the Mod
Apte split, in which 9,603 documents are selected for training (seen data) and
the other 3,299 form the test set (unseen data). Of the 135 categories of the
TOPICS group, we have considered the 10 with the highest number of positive
training examples (in the following, we will refer to this subset as R10). We
remark that we used all 9603 documents of the training corpus for the learning
phase, and performed the test using all 3299 documents of the test set (including
those not belonging to any category in R10).
The second data set we considered is Ohsumed, in particular, the collection
consisting of the rst 20,000 documents from the 50,216 medical abstracts of
the year 1991. Of the 20,000 documents of the corpus, the rst 10,000 were used
as seen data and the second 10,000 as unseen data. The classication scheme
consisted of the 23 MeSH diseases.
5.2 Document Preprocessing
Preliminarily, documents were subjected to the following preprocessing steps:
(1) First, we removed all words occurring in a list of common stopwords, as well
as punctuation marks and numbers; (2) then, we extracted all ngrams, dened
as sequences of maximum three words consecutively occurring within a document
(after stopword removal)
5
; (3) at this point we have randomly split the set of
seen data into a training set (70%), on which to run the GA, and a validation set
(30%), on which tuning the model parameters. We performed the split in such
a way that each category was proportionally represented in both sets (stratied
holdout); (4) nally, for each category c C, we scored all ngrams occurring in
the documents of the training set TS
c
by each scoring function f {CHI, IG},
where IG stands for Information Gain and CHI for Chi Square [4, 22].
5
Preliminary experiments showed that ngrams of length ranging between 1 and 3
perform slightly better than single words.
A Genetic Algorithm for Text Classication Rule Induction 197
Table 1. Best classiers for categories in R10 and microaveraged performance
Cat. Input Training Test Classier
( = 0.5) (GA)
f k Fmeas Fmeas BEP #pos #neg
earn CHI 80 93.91 95.34 95.34 36 9
acq CHI 60 83.44 87.15 87.49 22 35
monfx CHI 150 79.62 66.47 66.66 33 26
grain IG 60 91.05 91.61 91.75 9 8
crude CHI 70 85.08 77.01 77.18 26 14
trade IG 80 76.77 61.80 61.81 20 11
interest CHI 150 73.77 64.20 64.59 44 23
wheat IG 50 92.46 89.47 89.86 8 6
ship CHI 30 87.88 74.07 74.81 17 33
corn IG 30 93.94 90.76 91.07 4 12
BEP 86.35 86.40
5.3 Experimental Setting
We conducted a number of experiments for inducing the best classiers for both
Reuters21578 and Ohsumed according to the learning process described in
Section 4. To this end, we used dierent input vocabularies V (k, f), with k
{10, 20, , 90, 100, 150, 200} and f {CHI, IG}. For each input vocabulary,
we ran OlexGA three times. The parameter of the tness function F
c,
was
set to 0.5 for all the experiments (thus attributing equal importance to Precision
and Recall).
In all cases, the population size has been held at 500 individuals, the number
of generations at 200, the crossover rate at 1.0, the mutation rate at 0.001 and
the elitism probability at 0.2.
For each chromosome K in the population, we initialized K
+
at random,
while we set K
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.