Sie sind auf Seite 1von 15

A Fuzzy Self-Constructing Feature Clustering

Algorithm for Text Classification


Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, Member, IEEE
AbstractFeature clustering is a powerful method to reduce the dimensionality of feature vectors for text classification. In this paper,
we propose a fuzzy similarity-based self-constructing algorithm for feature clustering. The words in the feature vector of a document
set are grouped into clusters, based on similarity test. Words that are similar to each other are grouped into the same cluster. Each
cluster is characterized by a membership function with statistical mean and deviation. When all the words have been fed in, a desired
number of clusters are formed automatically. We then have one extracted feature for each cluster. The extracted feature,
corresponding to a cluster, is a weighted combination of the words contained in the cluster. By this algorithm, the derived membership
functions match closely with and describe properly the real distribution of the training data. Besides, the user need not specify the
number of extracted features in advance, and trial-and-error for determining the appropriate number of extracted features can then be
avoided. Experimental results show that our method can run faster and obtain better extracted features than other methods.
Index TermsFuzzy similarity, feature clustering, feature extraction, feature reduction, text classification.

1 INTRODUCTION
I
N text classification, the dimensionality of the feature
vector is usually huge. For example, 20 Newsgroups [1]
and Reuters21578 top-10 [2], which are two real-world data
sets, both have more than 15,000 features. Such high
dimensionality can be a severe obstacle for classification
algorithms [3], [4]. To alleviate this difficulty, feature
reduction approaches are applied before document classi-
fication tasks are performed [5]. Two major approaches,
feature selection [6], [7], [8], [9], [10] and feature extraction
[11], [12], [13], have been proposed for feature reduction. In
general, feature extraction approaches are more effective
than feature selection techniques, but are more computa-
tionally expensive [11], [12], [14]. Therefore, developing
scalable and efficient feature extraction algorithms is highly
demanded for dealing with high-dimensional document
data sets.
Classical feature extraction methods aim to convert the
representation of the original high-dimensional data set into
a lower-dimensional data set by a projecting process
through algebraic transformations. For example, Principal
Component Analysis [15], Linear Discriminant Analysis
[16], Maximum Margin Criterion [12], and Orthogonal
Centroid algorithm [17] perform the projection by linear
transformations, while Locally Linear Embedding [18],
ISOMAP [19], and Laplacian Eigenmaps [20] do feature
extraction by nonlinear transformations. In practice, linear
algorithms are in wider use due to their efficiency. Several
scalable online linear feature extraction algorithms [14],
[21], [22], [23] have been proposed to improve the
computational complexity. However, the complexity of
these approaches is still high. Feature clustering [24], [25],
[26], [27], [28], [29] is one of effective techniques for feature
reduction in text classification. The idea of feature cluster-
ing is to group the original features into clusters with a high
degree of pairwise semantic relatedness. Each cluster is
treated as a single new feature, and, thus, feature
dimensionality can be drastically reduced.
The first feature extraction method based on feature
clustering was proposed by Baker and McCallum [24],
which was derived from the distributional clustering idea
of Pereira et al. [30]. Al-Mubaid and Umair [31] used
distributional clustering to generate an efficient representa-
tion of documents and applied a learning logic approach for
training text classifiers. The Agglomerative Information
Bottleneck approach was proposed by Tishby et al. [25],
[29]. The divisive information-theoretic feature clustering
algorithm was proposed by Dhillon et al. [27], which is an
information-theoretic feature clustering approach, and is
more effective than other feature clustering methods. In
these feature clustering methods, each new feature is
generated by combining a subset of the original words.
However, difficulties are associated with these methods. A
word is exactly assigned to a subset, i.e., hard-clustering,
based on the similarity magnitudes between the word and
the existing subsets, even if the differences among these
magnitudes are small. Also, the mean and the variance of a
cluster are not considered when similarity with respect to
the cluster is computed. Furthermore, these methods
require the number of new features be specified in advance
by the user.
We propose a fuzzy similarity-based self-constructing
feature clustering algorithm, which is an incremental feature
clustering approach to reduce the number of features for the
text classification task. The words in the feature vector of a
document set are represented as distributions, and pro-
cessed one after another. Words that are similar to each
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011 335
. The authors are with the Department of Electrical Engineering, National
Sun Yat-Sen University, Kaohsiung, 804, Taiwan.
E-mail: jungyi@water.ee.nsysu.edu.tw, ericliu0715@gmail.com,
leesj@mail.ee.nsysu.edu.tw.
Manuscript received 18 Feb. 2009; revised 27 July 2009; accepted 4 Nov. 2009;
published online 26 July 2010.
Recommended for acceptance by K.-L. Tan.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2009-02-0077.
Digital Object Identifier no. 10.1109/TKDE.2010.122.
1041-4347/11/$26.00 2011 IEEE Published by the IEEE Computer Society
other are grouped into the same cluster. Each cluster is
characterized by a membership function with statistical
mean and deviation. If a word is not similar to any existing
cluster, a new cluster is created for this word. Similarity
between a word and a cluster is defined by considering both
the mean and the variance of the cluster. When all the words
have been fed in, a desired number of clusters are formed
automatically. We then have one extracted feature for each
cluster. The extracted feature corresponding to a cluster is a
weighted combination of the words contained in the cluster.
Three ways of weighting, hard, soft, and mixed, are
introduced. By this algorithm, the derived membership
functions match closely with and describe properly the real
distribution of the training data. Besides, the user need not
specify the number of extracted features in advance, and
trial-and-error for determining the appropriate number of
extracted features can then be avoided. Experiments on real-
world data sets show that our method can run faster and
obtain better extracted features than other methods.
The remainder of this paper is organized as follows:
Section 2 gives a brief background about feature reduction.
Section 3 presents the proposed fuzzy similarity-based self-
constructing feature clustering algorithm. An example,
illustrating how the algorithm works, is given in Section 4.
Experimental results are presented in Section 5. Finally, we
conclude this work in Section 6.
2 BACKGROUND AND RELATED WORK
To process documents, the bag-of-words model [32], [33] is
commonly used. Let D fd
1
. d
2
. . . . . d
i
g be a document set
of i documents, where d
1
, d
2
. . . . . d
i
are individual docu-
ments, andeach document belongs to one of the classes in the
set fc
1
. c
2
. . . . . . . . c
j
g. If a document belongs to two or more
classes, then two or more copies of the document with
different classes are included in D. Let the word set W
fn
1
. n
2
. . . . . n
i
g be the feature vector of the document set.
Each document d
i
, 1 i i, is represented as d
i

<d
i1
. d
i2
. . . . . d
ii
, where each d
i,
denotes the number of
occurrences of n
,
in the ith document. The feature reduction
task is to find a new word set W
0
fn
0
1
. n
0
2
. . . . . n
0
/
g, / << i,
such that W and W
0
work equally well for all the desired
properties with D. After feature reduction, each document d
i
is convertedintoa newrepresentationd
0
i
<d
0
i1
. d
0
i2
. . . . . d
0
i/

and the converted document set is D


0
fd
0
1
. d
0
2
. . . . . d
0
i
g. If /
is much smaller than i, computation cost with subsequent
operations on D
0
can be drastically reduced.
2.1 Feature Reduction
In general, there are two ways of doing feature reduction,
feature selection, and feature extraction. By feature selec-
tion approaches, a new feature set W
0
fn
0
1
. n
0
2
. . . . . n
0
/
g is
obtained, which is a subset of the original feature set W.
Then W
0
is used as inputs for classification tasks.
Information Gain (IG) is frequently employed in the
feature selection approach [10]. It measures the reduced
uncertainty by an information-theoretic measure and gives
each word a weight. The weight of a word n
,
is calculated
as follows:
1Gn
,

X
j
|1
1c
|
|oq1c
|

1n
,

X
j
|1
1c
|
jn
,
|oq1c
|
jn
,

1n
,

X
j
|1
1c
|
jn
,
|oq1c
|
jn
,
.
1
where 1c
|
denotes the prior probability for class c
|
, 1n
,

denotes the prior probability for feature n


,
, 1n
,
is
identical to 1 1n
,
, and 1c
|
jn
,
and 1c
|
jn
,
denote
the probability for class c
|
with the presence and absence,
respectively, of n
,
. The words of top / weights in W are
selected as the features in W
0
.
In feature extraction approaches, extracted features are
obtained by a projecting process through algebraic trans-
formations. An incremental orthogonal centroid (IOC)
algorithm was proposed in [14]. Let a corpus of documents
be represented as an i i matrix X 2 R
ii
, where i is
the number of features in the feature set and i is the
number of documents in the document set. IOC tries to find
an optimal transformation matrix F

2 R
i/
, where / is the
desired number of extracted features, according to the
following criterion:
F

arg max tioccF


T
S
/
F. 2
where F 2 R
i/
and F
T
F I, and
S
/

X
j
1
1c

M
o||
M

M
o||

T
3
with 1c

being the prior probability for a pattern


belonging to class c

, M

being the mean vector of


class c

, and M
o||
being the mean vector of all patterns.
2.2 Feature Clustering
Feature clustering is an efficient approach for feature
reduction [25], [29], which groups all features into some
clusters, where features in a cluster are similar to each
other. The feature clustering methods proposed in [24], [25],
[27], [29] are hard clustering methods, where each word
of the original features belongs to exactly one word cluster.
Therefore each word contributes to the synthesis of only
one new feature. Each new feature is obtained by summing
up the words belonging to one cluster. Let D be the matrix
consisting of all the original documents with i features and
D
0
be the matrix consisting of the converted documents
with new / features. The new feature set W
0
fn
0
1
. n
0
2
. . . . .
n
0
/
g corresponds to a partition fW
1
. W
2
. . . . . W
/
g of the
original feature set W, i.e., W
t
T
W

;, where 1 . t /
and t 6 . Note that a cluster corresponds to an element in
the partition. Then, the tth feature value of the converted
document d
0
i
is calculated as follows:
d
0
it

X
n
,
2W
t
d
i,
. 4
which is a linear sum of the feature values in W
t
.
The divisive information-theoretic feature clustering
(DC) algorithm, proposed by Dhillon et al. [27] calculates
the distributions of words over classes, 1Cjn
,
, 1 , i,
336 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
where C fc
1
. c
2
. . . . . . . . c
j
g, and uses Kullback-Leibler di-
vergence to measure the dissimilarity between two dis-
tributions. The distribution of a cluster W
t
is calculated
as follows:
1CjW
t

X
n
,
2W
t
1n
,

P
n
,
2W
t
1n
,

1Cjn
,
. 5
The goal of DC is to minimize the following objective
function:
X
/
t1
X
n
,
2W
t
1n
,
111Cjn
,
. 1CjW
t
. 6
which takes the sum over all the / clusters, where / is
specified by the user in advance.
3 OUR METHOD
There are some issues pertinent to most of the existing
feature clustering methods. First, the parameter /, indicat-
ing the desired number of extracted features, has to be
specified in advance. This gives a burden to the user, since
trial-and-error has to be done until the appropriate number
of extracted features is found. Second, when calculating
similarities, the variance of the underlying cluster is not
considered. Intuitively, the distribution of the data in a
cluster is an important factor in the calculation of similarity.
Third, all words in a cluster have the same degree of
contribution to the resulting extracted feature. Sometimes, it
may be better if more similar words are allowed to have
bigger degrees of contribution. Our feature clustering
algorithm is proposed to deal with these issues.
Suppose, we are given a document set D of i documents
d
1
, d
2
. . . . . d
i
, together with the feature vector W of
i words n
1
. n
2
. . . . . n
i
and j classes c
1
, c
2
. . . . . c
j
, as
specified in Section 2. We construct one word pattern for
each word in W. For word n
i
, its word pattern x
i
is defined,
similarly as in [27], by
x
i
<r
i1
. r
i2
. . . . . r
ij

<1c
1
jn
i
. 1c
2
jn
i
. . . . . 1c
j
jn
i
.
7
where
1c
,
jn
i

P
i
1
d
i
c
,
P
i
1
d
i
. 8
for 1 , j. Note that d
i
indicates the number of
occurrences of n
i
in document d

, as described in Section 2.
Also, c
,
is defined as
c
,

1. if document d

belongs to class c
,
.
0. otherwise.
&
9
Therefore, we have i word patterns in total. For example,
suppose we have four documents d
1
, d
2
, d
3
, andd
4
belonging
to c
1
, c
1
, c
2
, and c
2
, respectively. Let the occurrences of n
1
in
these documents be 1, 2, 3, and4, respectively. Then, the word
pattern x
1
of n
1
is:
1c
1
jn
1

1 1 2 1 3 0 4 0
1 2 3 4
0.3.
1c
2
jn
1

1 0 2 0 3 1 4 1
1 2 3 4
0.7.
x
1
<0.3. 0.7.
10
It is these word patterns, our clustering algorithm will work
on. Our goal is to group the words in Winto clusters, based
on these word patterns. A cluster contains a certain number
of word patterns, and is characterized by the product of
j one-dimensional Gaussian functions. Gaussian functions
are adopted because of their superiority over other
functions in performance [34], [35]. Let G be a cluster
containing word patterns x
1
, x
2
. . . . . x

. Let x
,
<r
,1
.
r
,2
. . . . . r
,j
, 1 , . Then the mean m <i
1
. i
2
. . . . .
i
j
and the deviation oo <o
1
. o
2
. . . . . o
j
of G are
defined as
i
i

,1
r
,i
jGj
11
o
i

,1
r
,i
i
,i

2
jGj
s
12
for 1 i j, where jGj denotes the size of G, i.e., the
number of word patterns contained in G. The fuzzy
similarity of a word pattern x <r
1
. r
2
. . . . . r
j
to cluster
G is defined by the following membership function:
j
G
x
Y
j
i1
exp
r
i
i
i
o
i

2
" #
. 13
Notice that 0 j
G
x 1. A word pattern close to the mean
of a cluster is regarded to be very similar to this cluster, i.e.,
j
G
x % 1. On the contrary, a word pattern far distant from
a cluster is hardly similar to this cluster, i.e., j
G
x % 0. For
example, suppose G
1
is an existing cluster with a mean
vector m
1
< 0.4. 0.6 and a deviation vector oo
1
<0.2.
0.3. The fuzzy similarity of the word pattern x
1
shown in
(10) to cluster G
1
becomes
j
G
1
x
1

exp
0.3 0.4
0.2

2
" #
exp
0.7 0.6
0.3

2
" #
0.7788 0.8948 0.6969.
14
3.1 Self-Constructing Clustering
Our clustering algorithm is an incremental, self-construct-
ing learning approach. Word patterns are considered one
by one. The user does not need to have any idea about
the number of clusters in advance. No clusters exist at the
beginning, and clusters can be created if necessary. For
each word pattern, the similarity of this word pattern to
each existing cluster is calculated to decide whether it is
combined into an existing cluster or a new cluster is
created. Once a new cluster is created, the corresponding
membership function should be initialized. On the
contrary, when the word pattern is combined into an
existing cluster, the membership function of that cluster
should be updated accordingly.
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 337
Let / be the number of currently existing clusters. The
clusters are G
1
, G
2
. . . . . G
/
, respectively. Each cluster G
,
has
mean m
,
<i
,1
. i
,2
. . . . . i
,j
and deviation oo
,
<o
,1
.
o
,2
. . . . . o
,j
. Let o
,
be the size of cluster G
,
. Initially, we
have / 0. So, no clusters exist at the beginning. For each
word pattern x
i
<r
i1
. r
i2
. . . . . r
ij
. 1 i i, we calcu-
late, according to (13), the similarity of x
i
to each existing
clusters, i.e.,
j
G
,
x
i

Y
j
1
exp
r
i
i
,
o
,

2
" #
15
for 1 , /. We say that x
i
passes the similarity test on
cluster G
,
if
j
G
,
x
i
! ,. 16
where ,, 0 , 1, is a predefined threshold. If the user
intends to have larger clusters, then he/she can give a
smaller threshold. Otherwise, a bigger threshold can be
given. As the threshold increases, the number of clusters
also increases. Note that, as usual, the power in (15) is 2 [34],
[35]. Its value has an effect on the number of clusters
obtained. A larger value will make the boundaries of the
Gaussian function sharper, and more clusters will be
obtained for a given threshold. On the contrary, a smaller
value will make the boundaries of the Gaussian function
smoother, and fewer clusters will be obtained instead.
Two cases may occur. First, there are no existing fuzzy
clusters on which x
i
has passed the similarity test. For this
case, we assume that x
i
is not similar enough to any existing
cluster and a new cluster G
/
, / / 1, is created with
m
/
x
i
. oo
/
oo
0
. 17
where oo
0
<o
0
. . . . . o
0
is a user-defined constant vector.
Note that the new cluster G
/
contains only one member, the
word pattern x
i
, at this point. Estimating the deviation of a
cluster by (12) is impossible, or inaccurate, if the cluster
contains few members. In particular, the deviation of a new
cluster is 0, since it contains only one member. We cannot
use zero deviation in the calculation of fuzzy similarities.
Therefore, we initialize the deviation of a newly created
cluster by oo
0
, as indicated in (17). Of course, the number of
clusters is increased by 1 and the size of cluster G
/
, o
/
,
should be initialized, i.e.,
/ / 1. o
/
1. 18
Second, if there are existing clusters on which x
i
has passed
the similarity test, let cluster G
t
be the cluster with the
largest membership degree, i.e.,
t oiq max
1,/
j
G
,
x
i
. 19
In this case, we regard x
i
to be most similar to cluster G
t
,
and m
t
and oo
t
of cluster G
t
should be modified to include x
i
as its member. The modification to cluster G
t
is described as
follows:
i
t,

o
t
i
t,
r
i,
o
t
1
. 20
o
t,

1
p
o
0
. 21

o
t
1o
t,
o
0

2
o
t
i
t,
2
r
i,
2
o
t
. 22
1
o
t
1
o
t
o
t
i
t,
r
i,
o
t
1

2
. 23
for 1 , j, and
o
t
o
t
1. 24
Equations (20) and (21) can be derived easily from (11) and
(12). Note that / is not changed in this case. Lets give an
example, following (10) and (14). Suppose x
1
is most similar
to cluster G
1
, and the size of cluster G
1
is 4 and the initial
deviation o
0
is 0.1. Then cluster G
1
is modified as follows:
i
11

4 0.4 0.3
4 1
0.38.
i
12

4 0.6 0.7
4 1
0.62.
m
1
<0.38. 0.62.

11

4 10.2 0.1
2
40.38
2
0.3
2
4
.
1
11

4 1
4
4 0.38 0.3
4 1

2
.
o
11

11
1
11
p
0.1 0.1937.

12

4 10.3 0.1
2
4 0.62
2
0.7
2
4
.
1
12

4 1
4
4 0.62 0.7
4 1

2
.
o
12

12
1
12
p
0.1 0.2769.
oo
1
<0.1937. 0.2769.
o
1
4 1 5.
The above process is iterated until all the word patterns
have been processed. Consequently, we have / clusters. The
whole clustering algorithm can be summarized below.
Initialization:
; of original word patterns: i
; of classes: j
Threshold: ,
Initial deviation: o
0
Initial ; of clusters: / 0
Input:
x
i
<r
i1
. r
i2
. . . . . r
ij
, 1 i i
Output:
Clusters G
1
, G
2
. . . . . G
/
procedure Self-Constructing-Clustering-Algorithm
for each word pattern x
i
, 1 i i
tcij W fG
,
jj
G
,
x
i
! ,. 1 , /g;
if (tcij W c)
A new cluster G
/
, / / 1, is created by
(17)-(18);
else let G
t
2 tcij W be the cluster to which x
i
is
closest by (19);
Incorporate x
i
into G
t
by (20)-(24);
endif;
338 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
endfor;
return with the created / clusters;
endprocedure
Note that the word patterns in a cluster have a high degree of
similarity to each other. Besides, when new training patterns
are considered, the existing clusters can be adjusted or new
clusters can be created, without the necessity of generating
the whole set of clusters from the scratch.
Note that the order in which the word patterns are fed
in influences the clusters obtained. We apply a heuristic to
determine the order. We sort all the patterns, in decreas-
ing order, by their largest components. Then the word
patterns are fed in this order. In this way, more significant
patterns will be fed in first and likely become the core of
the underlying cluster. For example, let x
1
<0.1. 0.3.
0.6. x
2
<0.3. 0.3. 0.4, and x
3
<0.8. 0.1. 0.1 be three
word patterns. The largest components in these word
patterns are 0.6, 0.4, and 0.8, respectively. The sorted list is
0.8, 0.6, 0.4. So the order of feeding is x
3
, x
1
, x
2
. This
heuristic seems to work well.
We discuss briefly here the computational cost of our
method and compare it with DC [27], IOC [14], and IG [10].
For an input pattern, we have to calculate the similarity
between the input pattern and every existing cluster. Each
pattern consists of j components where j is the number of
classes in the document set. Therefore, in worst case, the
time complexity of our method is Oi/j, where i is the
number of original features and / is the number of clusters
finally obtained. For DC, the complexity is Oi/jt, where
t is the number of iterations to be done. The complexity of
IG is Oij ilogi, and the complexity of IOC is
Oi/ji, where i is the number of documents involved.
Apparently, IG is the quickest one. Our method is better
than DC and IOC.
3.2 Feature Extraction
Formally, feature extraction can be expressed in the
following form:
D
0
DT. 25
where
D d
1
d
2
d
i

T
. 26
D
0
d
0
1
d
0
2
d
0
i

T
. 27
T
t
11
. . . t
1/
t
21
. . . t
2/
.
.
.
.
.
.
.
.
.
t
i1
. . . t
i/
2
6
6
6
6
4
3
7
7
7
7
5
. 28
with
d
i
d
i1
d
i2
d
ii
.
d
0
i
d
0
i1
d
0
i2
d
0
i/
.
for 1 i i. Clearly, T is a weighting matrix. The goal of
feature reduction is achieved by finding an appropriate T
such that / is smaller than i. In the divisive information-
theoretic feature clustering algorithm [27] described in
Section 2.2, the elements of T in (25) are binary and can be
defined as follows:
t
i,

1. if n
i
2 W
,
.
0. otherwise.
&
29
where 1 i i and 1 , /. That is, if a word n
i
belongs
to cluster W
,
, t
i,
is 1; otherwise t
i,
is 0.
By applying our clustering algorithm, word patterns have
been grouped into clusters, and words in the feature vector
W are also clustered accordingly. For one cluster, we have
one extracted feature. Since we have / clusters, we have
/ extracted features. The elements of Tare derived based on
the obtained clusters, and feature extraction will be done. We
propose three weighting approaches: hard, soft, and mixed.
In the hard-weighting approach, each word is only allowed
to belong to a cluster, and so it only contributes to a new
extracted feature. In this case, the elements of T in (25) are
defined as follows:
t
i,

1. if , oiq max
1c/
j
G
c
x
i
.
0. otherwise.
&
30
Note that if , is not unique in (30), one of them is chosen
randomly. In the soft-weighting approach, each word is
allowed to contribute to all new extracted features, with the
degrees depending on the values of the membership
functions. The elements of T in (25) are defined as follows:
t
i,
j
G
,
x
i
. 31
The mixed-weighting approach is a combination of the
hard-weighting approach and the soft-weighting ap-
proach. For this case, the elements of T in (25) are defined
as follows:
t
i,
t
H
i,
1 t
o
i,.
32
where t
H
i,
is obtained by (30) and t
o
i,
is obtained by (31), and
is a user-defined constant lying between 0 and 1. Note that
is not related to the clustering. It concerns the merge of
component features in a cluster into a resulting feature. The
merge can be hard or soft by setting to 1 or 0. By
selecting the value of , we provide flexibility to the user.
When the similarity threshold is small, the number of
clusters is small, and each cluster covers more training
patterns. In this case, a smaller will favor soft-weighting
and get a higher accuracy. On the contrary, when the
similarity threshold is large, the number of clusters is large,
and each cluster covers fewer training patterns. In this case, a
larger will favor hard-weighting and get a higher accuracy.
3.3 Text Classification
Given a set D of training documents, text classification can
be done as follows: We specify the similarity threshold , for
(16), and apply our clustering algorithm. Assume that
/ clusters are obtained for the words in the feature vector
W. Then we find the weighting matrix T and convert D to
D
0
by (25). Using D
0
as training data, a classifier based on
support vector machines (SVM) is built. Note that any
classifying technique other than SVM can be applied.
Joachims [36] showed that SVM is better than other
methods for text categorization. SVM is a kernel method,
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 339
which finds the maximum margin hyperplane in feature
space separating the images of the training patterns into
two groups [37], [38], [39]. To make the method more
flexible and robust, some patterns need not be correctly
classified by the hyperplane, but the misclassified patterns
should be penalized. Therefore, slack variables
i
are
introduced to account for misclassifications. The objective
function and constraints of the classification problem can be
formulated as:
min
n./
1
2
w
T
w C
X
|
i1

i
:.t.y
i
w
T
cx
i
/

! 1
i
.
i
! 0. i 1. 2. . . . . |.
33
where | is the number of training patterns, C is a
parameter, which gives a tradeoff between maximum
margin and classification error, and y
i
, being +1 or -1, is
the target label of pattern x
i
. Note that c : A ! 1 is a
mapping from the input space to the feature space 1,
where patterns are more easily separated, and w
T
cx
i

b 0 is the hyperplane to be derived with w, and / being
weight vector and offset, respectively.
An SVM described above can only separate apart two
classes, y
i
1 and y
i
1. We follow the idea in [36] to
construct an SVM-based classifier. For j classes, we create j
SVMs, one SVM for each class. For the SVM of class c
.
,
1 . j, the training patterns of class c
.
are treated as
having y
i
1, and the training patterns of the other
classes are treated as having y
i
1. The classifier is then
the aggregation of these SVMs. Now we are ready for
classifying unknown documents. Suppose, d is an unknown
document. We first convert d to d
0
by
d
0
dT. 34
Then we feed d
0
to the classifier. We get j values, one from
each SVM. Then d belongs to those classes with 1,
appearing at the outputs of their corresponding SVMs.
For example, consider a case of three classes c
1
, c
2
, and c
3
. If
the three SVMs output 1, 1, and 1, respectively, then the
predicted classes will be c
1
and c
3
for this document. If the
three SVMs output 1, 1, and 1, respectively, the predicted
classes will be c
2
and c
3
.
4 AN EXAMPLE
We give an example here to illustrate how our method
works. Let D be a simple document set, containing
9 documents d
1
, d
2
. . . . . d
9
of two classes c
1
and c
2
, with
10 words office, building,. . . . . fridge in the feature
vector W, as shown in Table 1. For simplicity, we denote
the ten words as n
1
, n
2
. . . . . n
10
, respectively.
We calculate the ten word patterns x
1
. x
2
. . . . . x
10
according to (7) and (8). For example, x
6
<1c
1
jn
6
.
1c
2
jn
6
and 1c
2
jn
6
is calculated by (35).
1c
2
jn
6
1 0 2 0 0 0 1 0 1 1 1 1
1 1 1 1 0 1,1 2 0 1 1 1
1 1 0 0.50.
35
The resulting word patterns are shown in Table 2. Note that
each word pattern is a two-dimensional vector, since there
are two classes involved in D.
We run our self-constructing clustering algorithm, by
setting o
0
0.5 and , 0.64, on the word patterns and
obtain 3 clusters G
1
, G
2
, and G
3
, which are shown in Table 3.
The fuzzy similarity of each word pattern to each cluster is
shown in Table 4. The weighting matrices T
H
, T
o
, and T
`
obtained by hard-weighting, soft-weighting, and mixed-
weighting (with 0.8), respectively, are shown in Table 5.
The transformed data sets D
0
H
, D
0
o
, and D
0
`
obtained by (25)
for different cases of weighting are shown in Table 6.
Based on D
0
H
, D
0
o
, or D
0
`
, a classifier with two SVMs is
built. Suppose d is an unknown document, and d
<0. 1. 1. 1. 1. 1. 0. 1. 1. 1 . We first convert d to d
0
by (34).
Then, the transformed document is obtained as d
0
H

dT
H
<2. 4. 2. d
0
o
dT
o
<2.5591. 4.3478. 3.9964, or
d
0
`
dT
`
<2.1118. 4.0696. 2.3993. Then the trans-
formed unknown document is fed to the classifier. For this
example, the classifier concludes that d belongs to c
2
.
5 EXPERIMENTAL RESULTS
In this section, we present experimental results to show the
effectiveness of our fuzzy self-constructing feature clustering
method. Three well-known data sets for text classification
340 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
TABLE 1
A Simple Document Set D
TABLE 2
Word Patterns of W
research: 20 Newsgroups [1], RCV1 [40], and Cade12 [41] are
used in the following experiments. We compare our method
with other three feature reduction methods: IG[10], IOC[14],
and DC [27]. As described in Section 2, IG is one of the state-
of-art feature selection approaches, IOC is an incremental
feature extraction algorithm, and DC is a feature clustering
approach. For convenience, our method is abbreviated as
FFC, standing for Fuzzy Feature Clustering, in this section. In
our method, o
0
is set to 0.25. The value was determined by
investigation. Note that o
0
is set to the constant 0.25 for all the
following experiments. Furthermore, we use H-FFC, S-FFC,
and M-FFCto represent hard-weighting, soft-weighting, and
mixed-weighting, respectively. We run a classifier built on
support vector machines [42], as described in Section 3.3, on
the extracted features obtained by different methods. We use
a computer with Intel(R) Core(TM)2 Quad CPU Q6600
2.40 GHz, 4 GB of RAM to conduct the experiments. The
programming language used is MATLAB 7.0.
To compare classification effectiveness of each method,
we adopt the performance measures in terms of micro-
averaged precision (MicroP), microaveraged recall (Mi-
croR), microaveraged F1 (MicroF1), and microaveraged
accuracy (MicroAcc) [4], [43], [44] defined as follows:
`icio1

j
i1
T1
i

j
i1
T1
i
11
i

. 36
`icio1

j
i1
T1
i

j
i1
T1
i
1`
i

. 37
`icio11
2`icio1 `icio1
`icio1 `icio1
. 38
`iciocc

j
i1
T1
i
T`
i

j
i1
T1
i
T`
i
11
i
1`
i

. 39
where j is the number of classes. T1
i
(true positives wrt c
i
)
is the number of c
i
test documents that are correctly
classified to c
i
. T`
i
(true negatives wrt c
i
) is the number of
non-c
i
test documents that are classified to non-c
i
. 11
i
(false
positives wrt c
i
) is the number of non-c
i
test documents that
are incorrectly classified to c
i
. 1`
i
(false negatives wrt c
i
) is
the number of c
i
test documents that are classified to non-c
i
.
5.1 Experiment 1: 20 Newsgroups Data Set
The 20 Newsgroups collection contains about 20,000 articles
taken from the Usenet newsgroups. These articles are
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 341
TABLE 3
Three Clusters Obtained
TABLE 5
Weighting Matrices: Hard T
H
, Soft T
o
, and Mixed T
`
TABLE 6
Transformed Data Sets: Hard D
0
H
, Soft D
0
o
, and Mixed D
0
`
TABLE 4
Fuzzy Similarities of Word Patterns to Three Clusters
evenly distributed over 20 classes, and each class has about
1,000 articles, as shown in Fig. 1a. In this figure, the x-axis
indicates the class number, and the y-axis indicates the
fraction of the articles of each class. We use two-thirds of
the documents for training and the rest for testing. After
preprocessing, we have 25,718 features, or words, for this
data set.
Fig. 2 shows the execution time (sec) of different feature
reduction methods on the 20 Newsgroups data set. Since H-
FFC, S-FFC, and M-FFC have the same clustering phase,
they have the same execution time, and thus we use FFC to
denote them in Fig. 2. In this figure, the horizontal axis
indicates the number of extracted features. To obtain
different numbers of extracted features, different values of
, are used in FFC. The number of extracted features is
identical to the number of clusters. For the other methods,
the number of extracted features should be specified in
advance by the user. Table 7 lists values of certain points in
Fig. 2. Different values of , are used in FFC and are listed in
the table. Note that for IG, each word is given a weight. The
words of top / weights in W are selected as the extracted
features in W
0
. Therefore, the execution time is basically the
same for any value of /. Obviously, our method runs much
faster than DC and IOC. For example, our method needs
8.61 seconds for 20 extracted features, while DC requires
88.38 seconds and IOC requires 6,943.40 seconds. For
84 extracted features, our method only needs 17.68 seconds,
but DC and IOC require 293.98 and 28,098.05 seconds,
respectively. As the number of extracted features increases,
DC and IOC run significantly slow. In particular, when the
number of extracted features exceeds 280, IOC spends more
than 100,000 seconds without getting finished, as indicated
by dashes in Table 7.
Fig. 3 shows the MicroAcc (percent) of the 20 News-
groups data set obtained by different methods, based on
the extracted features previously obtained. The vertical axis
indicates the MicroAcc values (percent), and the horizontal
axis indicates the number of extracted features. Table 8 lists
values of ceratin points in Fig. 3. Note that is set to 0.5 for
M-FFC. Also, no accuracies are listed for IOC when the
number of extracted features exceeds 280. As shown in the
figure, IG performs the worst in classification accuracy,
especially when the number of extracted features is small.
For example, IG only gets 95.79 percent in accuracy for
20 extracted features. S-FFC works very well when the
number of extracted features is smaller than 58. For
example, S-FFC gets 98.46 percent in accuracy for 20 ex-
tracted features. H-FFC and M-FFC perform well in
accuracy all the time, except for the case of 20 extracted
features. For example, H-FFC and M-FFC get 98.43 percent
342 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
Fig. 1. Class distributions of three data sets. (a) 20 Newsgroups. (b) RCV1. (c) Cade12.
Fig. 2. Execution time (sec.) of different methods on 20 Newsgroups
data.
TABLE 7
Sampled Execution Times (Seconds) of Different Methods on 20 Newsgroups Data
Fig. 3. Microaveraged accuracy (percent) of different methods for
20 Newsgroups data.
and 98.54 percent, respectively, in accuracy for 58 extracted
features, 98.63 percent and 98.56 percent for 203 extracted
features, and 98.65 percent and 98.59 percent for 521 ex-
tracted features. Although the number of extracted features
is affected by the threshold value, the classification
accuracies obtained keep fairly stable. DC and IOC perform
a little bit worse in accuracy when the number of extracted
features is small. For example, they get 97.73 percent and
97.09 percent, respectively, in accuracy for 20 extracted
features. Note that while the whole 25,718 features are used
for classification, the accuracy is 98.45 percent.
Table 9 lists values of the MicroP, MicroR, and MicroF1
(percent) of the 20 Newsgroups data set obtained by
different methods. Note that is set to 0.5 for M-FFC.
From this table, we can see that S-FFC can, in general, get
best results for MicroF1, followed by M-FFC, H-FFC, and
DC in order. For MicroP, H-FFC performs a little bit
better than S-FFC. But for MicroR, S-FFC performs better
than H-FFC by a greater margin than in the MicroP case.
M-FFC performs well for MicroP, MicroR, and MicroF1.
For MicroF1, M-FFC performs only a little bit worse than
S-FFC, with almost identical performance when the
number of features is less than 500. For MicroP, M-FFC
performs better than S-FFC most of the time. H-FFC
performs well when the number of features is less than
200, while it performs worse than DC and M-FFC when
the number of features is greater than 200. DC performs
well when the number of features is greater than 200, but
it performs worse when the number of features is less
than 200. For example, DC gets 88.00 percent and M-FFC
gets 89.65 percent when the number of features is 20. For
MicroR, S-FFC performs best for all the cases. M-FFC
performs well, especially when the number of features is
less than 200. Note that while the whole 25,718 features
are used for classification, MicroP is 94.53 percent, MicroR
is 73.18 percent, and MicroF1 is 82.50 percent.
In summary, IG runs very fast in feature reduction, but
the extracted features obtained are not good for classifica-
tion. FFC can not only run much faster than DC and IOC in
feature reduction, but also provide comparably good or
better extracted features for classification. Fig. 4 shows the
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 343
TABLE 8
Microaveraged Accuracy (Percent) of Different Methods for 20 Newsgroups Data
TABLE 9
Microaveraged Precision, Recall, and F1 (Percent) of Different Methods for 20 Newsgroups Data
Fig. 4. Microaveraged F1 (percent) of M-FFC with different values for
20 Newsgroups data.
influence of values on the performance of M-FFC in
MicroF1 for three numbers of extracted features, 20, 203,
and 1,453, indicated by M-FFC20, M-FFC203, and M-
FFC1453, respectively. As shown in the figure, the MircoF1
values do not vary significantly with different settings of .
5.2 Experiment 2: Reuters Corpus Volume 1 (RCV1)
Data Set
The RCV1 data set consists of 804,414 news stories
produced by Reuters from 20 August 1996 to 19 August
1997. All news stories are in English, and have 109 distinct
terms per document on average. The documents are
divided, by the LYRL2004 split defined in Lewis et al.
[40], into 23,149 training documents and 781,265 testing
documents. There are 103 Topic categories and the
distribution of the documents over the classes is shown in
Fig. 1b. All the 103 categories have one or more positive test
examples in the test set, but only 101 of them have one or
more positive training examples in the training set. It is
these 101 categories that we use in this experiment. After
pre-processing, we have 47,152 features for this data set.
Fig. 5 shows the execution time (sec) of different feature
reduction methods on the RCV1 data set. Table 10 lists
values of certain points in Fig. 5. Clearly, our method runs
much faster than DC and IOC. For example, for 18 extracted
features, DC requires 609.63 seconds and IOC requires
94,533.88 seconds, but our method only needs 14.42 seconds.
For 65 extracted features, our method only needs 37.40 sec-
onds, but DC and IOC require 1,709.77 and over 100,000 sec-
onds, respectively. As the number of extracted features
increases, DC and IOC run significantly slow.
Fig. 6 shows the MicroAcc (percent) of the RCV1 data set
obtained by different methods, based on the extracted
features previously obtained. Table 11 lists values of certain
points in Fig. 6. Note that is set to 0.5 for M-FFC. No
accuracies are listed for IOC when the number of extracted
features exceeds 18. As shown in the figure, IG performs the
worst in classification accuracy, especially when the number
of extracted features is small. For example, IG only gets
96.95 percent in accuracy for 18 extracted features. H-FFC, S-
FFC, and M-FFC perform well in accuracy all the time.
For example, H-FFC, S-FFC, and M-FFC get 98.03 percent,
98.26 percent, and 97.85 percent, respectively, in accuracy
for 18 extracted features, 98.13 percent, 98.39 percent, and
98.31 percent, respectively, for 65 extracted features, and
98.41 percent, 98.53 percent, and 98.56 percent, respectively,
for 421 extracted features. Note that while the whole
47,152 features are used for classification, the accuracy is
98.83 percent. Table 12 lists values of the MicroP, MicroR,
and MicroF1 ( percent) of the RCV1 data set obtained by
different methods. Fromthis table, we can see that S-FFCcan,
in general, get best results for MicroF1, followed by M-FFC,
DC, and H-FFC in order. S-FFC performs better than the
other methods, especially when the number of features is less
than 500. For MicroP, all the methods perform about equally
well, the values being around 80 percent-85 percent. IGgets a
high MicroP when the number of features is smaller than 50.
M-FFCperforms well for MicroP, MicroR, andMicroF1. Note
that while the whole 47,152 features are used for classifica-
tion, MicroP is 86.66 percent, MicroR is 75.03 percent, and
MicroF1 is 80.43 percent.
In summary, IGruns very fast in feature reduction, but the
extractedfeatures obtainedare not goodfor classification. For
example, IG gets 96.95 percent in MicroAcc, 91.58 percent
in MicroP, 5.80 percent in MicroR, and 10.90 percent in
MicroF1 for 18 extracted features, and 97.24 percent in
MicroAcc, 68.82 percent in MicroP, 25.72 percent in MicroR,
and 37.44 percent in MicroF1 for 65 extracted features. FFC
can not only run much faster than DC and IOC in feature
reduction, but also provide comparably good or better
extracted features for classification. For example, S-FFC
gets 98.26 percent in MicroAcc, 85.18 percent in MicroP,
55.33 percent in MicroR, and 67.08 percent in MicroF1 for
18 extracted features, and 98.39 percent in MicroAcc,
82.34 percent in MicroP, 63.41 percent in MicroR, and
344 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
Fig. 5. Execution time (sec) of different methods on RCV1 data.
Fig. 6. Microaveraged accuracy (percent) of different methods for RCV1
data.
TABLE 10
Sampled Execution Times (sec.) of Different Methods on RCV1 Data
71.65 percent in MicroF1 for 65 extracted features. Fig. 7
shows the influence of values on the performance of M-FFC
in MicroF1 for three numbers of extracted features, 18, 202,
and 1,700, indicated by M-FFC18, M-FFC202, and M-
FFC1700, respectively. As shown in the figure, the MircoF1
values do not vary significantly with different settings of .
5.3 Experiment 3: Cade12 Data
Cade12 is a set of classified Web pages extracted from the
Cad Web directory [41]. This directory points to Brazilian
Web pages that were classified by human experts into
12 classes. The Cade12 collection has a skewed distribution,
and the three most popular classes represent more than
50 percent of all documents. A version of this data set,
40,983 documents in total with 122,607 features, was
obtained from [45], from which two-thirds, 27,322 docu-
ments, are split for training, and the remaining, 13,661 docu-
ments, for testing. The distribution of Cade12 data is shown
in Fig. 1c and Table 13.
Fig. 8 shows the execution time (sec) of different feature
reduction methods on the Cade12 data set. Clearly, our
method runs much faster than DC and IOC. For example,
for 22 extracted features, DC requires 689.63 seconds and
IOC requires 57,958.10 seconds, but our method only needs
24.02 seconds. For 84 extracted features, our method only
needs 47.48 seconds, while DC requires 2,939.35 seconds
and IOC has a difficulty in completion. As the number of
extracted features increases, DC and IOC run significantly
slow. In particular, IOC can only get finished in 100,000 sec-
onds with 22 extracted features, as shown in Table 14.
Fig. 9 shows the MicroAcc (percent) of the Cade12 data
set obtained by different methods, based on the extracted
features previously obtained. Table 15 lists values of
certain points in Fig. 9. XO Table 16 lists values of the
MicroP, MicroR, and MicroF1 (percent) of the Cade12 data
set obtained by different methods. Note that is set to 0.5
for M-FFC. Also, no accuracies are listed for IOC when the
number of extracted features exceeds 22. As shown in the
tables, all the methods work equally well for MicroAcc.
But none of the methods work satisfactorily well for
MicroP, MicroR, and MicroF1. In fact, it is hard to get
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 345
TABLE 12
Microaveraged Precision, Recall and F1 (Percent) of Different Methods for RCV1 Data
TABLE 11
Microaveraged Accuracy (percent) of Different Methods for RCV1 Data
Fig. 7. Microaveraged F1(percent) of M-FFC with different values for
RCV1 data.
feature reduction for Cade12. Loose correlation exists
among the original features. IG performs best for MicroP.
For example, IG gets 78.21 percent in MicroP when the
number of features is 38. However, IG performs worst for
MicroR, getting only 7.04 percent in MicroR when the
number of features is 38. S-FFC, M-FFC, and H-FFC get a
little bit worse than IG in MicroP, but they are good in
MicroR. For example, S-FFC gets 75.83 percent in MicroP
and 38.33 percent in MicroR when the number of features
is 38. In general, S-FFC can get best results, followed by M-
FFC, H-FFC, and DC in order. Note that while the whole
122,607 features are used for classification, MicroAcc
is 93.55 percent, MicroP is 69.57 percent, MicroR is
40.11 percent, and MicroF1 is 50.88 percent. Again, this
experiment shows that FFC can not only run much faster
than DC and IOC in feature reduction, but also provide
comparably good or better extracted features for classifica-
tion. Fig. 10 shows the influence of values on the
performance of M-FFC in MicroF1 for three numbers of
extracted features, 22, 286, and 1,338, indicated by M-
FFC22, M-FFC286, and M-FFC1338, respectively. As shown
in the figure, the MircoF1 values do not vary significantly
with different settings of .
6 CONCLUSIONS
We have presented a fuzzy self-constructing feature
clustering (FFC) algorithm, which is an incremental
clustering approach to reduce the dimensionality of the
features in text classification. Features that are similar to
each other are grouped into the same cluster. Each cluster is
characterized by a membership function with statistical
mean and deviation. If a word is not similar to any existing
cluster, a new cluster is created for this word. Similarity
between a word and a cluster is defined by considering
346 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
TABLE 13
The Number of Documents Contained in Each Class of the Cade12 Data Set
Fig. 8. Execution time (seconds) of different methods on Cade12 data.
TABLE 14
Sampled Execution Times (Seconds) of Different Methods on Cade12 Data
TABLE 15
Microaveraged Accuracy (Percent) of Different Methods for Cade12 Data
Fig. 9. Microaveraged accuracy (percent) of different methods for
Cade12 data.
both the mean and the variance of the cluster. When all the
words have been fed in, a desired number of clusters are
formed automatically. We then have one extracted feature
for each cluster. The extracted feature corresponding to a
cluster is a weighted combination of the words contained in
the cluster. By this algorithm, the derived membership
functions match closely with and describe properly the real
distribution of the training data. Besides, the user need not
specify the number of extracted features in advance, and
trial-and-error for determining the appropriate number of
extracted features can then be avoided. Experiments on
three real-world data sets have demonstrated that our
method can run faster and obtain better extracted features
than other methods.
Other projects were done on text classification, with or
without feature reduction, and evaluation results were
published. Some methods adopted the same performance
measures as we did, i.e., microaveraged accuracy or
microaveraged F1. Others adopted unilabeled accuracy
(ULAcc). Unilabeled accuracy is basically the same as the
normal accuracy, except that a pattern, either training or
testing, with i class labels are copied i times, each copy is
associated with a distinct class label. We discuss some
methods here. The evaluation results given here for them
are taken directly from corresponding papers. Joachims [33]
applied the Rocchio classifier over TFIDF-weighted repre-
sentation and obtained 91.8 percent unilabeled accuracy on
the 20 Newsgroups data set. Lewis et al. [40] used the SVM
classifier over a form of TFIDF weighting and obtained
81.6 percent microaveraged F1 on the RCV1 data set. Al-
Mubaid and Umair [31] applied a classifier, called Lsquare,
which combines the distributional clustering of words and a
learning logic technique, and achieved 98.00 percent
microaveraged accuracy and 86.45 percent microaveraged
F1 on the 20 Newsgroups data set. Cardoso-Cachopo [45]
applied the SVM classifier over the normalized TFIDF term
weighting, and achieved 52.84 percent unilabeled accuracy
on the Cade12 data set and 82.48 percent on the 20 News-
groups data set. Al-Mubaid and Umair [31] applied AIB
[25] for feature reduction, and the number of features was
reduced to 120. No feature reduction was applied in the
other three methods. The evaluation results published for
these methods are summarized in Table 17. Note that,
according to the definitions, unilabeled accuracy is usually
lower than Microaveraged accuracy for multilabel classifi-
cation. This explains why both Joachims [33] and Cardoso-
Cachopo [45] have lower unilabeled accuracies than the
microaveraged accuracies shown in Tables 8 and 15.
However, Lewis et al. [40] has a higher Microaveraged F1
than that shown in Table 11. This may be due to that Lewis
et al. [40] used the TFIDF weighting, which is more
elaborate than the TF weighting we used. However, it is
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 347
TABLE 17
Published Evaluation Results for Some Other Text Classifiers
TABLE 16
Microaveraged Precision, Recall and F1 (Percent) of Different Methods for Cade12 Data
Fig. 10. Microaveraged F1 (percent) of M-FFC with different values for
Cade12 data.
hard for us to explain the difference between the micro-
averaged F1 obtained by Al-Mubaid and Umair [31] and
that shown in Table 9, since a kind of sampling was applied
in Al-Mubaid and Umair [31].
Similarity-based clustering is one of the techniques we
have developed in our machine learning research. In this
paper, we apply this clustering technique to text categor-
ization problems. We are also applying it to other problems,
such as image segmentation, data sampling, fuzzy model-
ing, web mining, etc. The work of this paper was motivated
by distributional word clustering proposed in [24], [25],
[26], [27], [29], [30]. We found that when a document set is
transformed to a collection of word patterns, as by (7), the
relevance among word patterns can be measured, and the
word patterns can be grouped by applying our similarity-
based clustering algorithm. Our method is good for text
categorization problems due to the suitability of the
distributional word clustering concept.
ACKNOWLEDGMENTS
The authors are grateful to the anonymous reviewers for
their comments, which were very helpful in improving the
quality and presentation of the paper. Theyd also like to
express their thanks to National Institute of Standards and
Technology for the provision of the RCV1 data set. This
work was supported by the National Science Council under
the grants NSC-95-2221-E-110-055-MY2 and NSC-96-2221-
E-110-009.
REFERENCES
[1] Http://people.csail.mit.edu/jrennie/20Newsgroups/, 2010.
[2] Http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.
html. 2010.
[3] H. Kim, P. Howland, and H. Park, Dimension Reduction in Text
Classification with Support Vector Machines, J. Machine Learning
Research, vol. 6, pp. 37-53, 2005.
[4] F. Sebastiani, Machine Learning in Automated Text Categoriza-
tion, ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[5] B.Y. Ricardo and R.N. Berthier, Modern Information Retrieval.
Addison Wesley Longman, 1999.
[6] A.L. Blum and P. Langley, Selection of Relevant Features and
Examples in Machine Learning, Aritficial Intelligence, vol. 97,
nos. 1/2, pp. 245-271, 1997.
[7] E.F. Combarro, E. Montan es, I. D az, J. Ranilla, and R. Mones,
Introducing a Family of Linear Measures for Feature Selection in
Text Categorization, IEEE Trans. Knowledge and Data Eng., vol. 17,
no. 9, pp. 1223-1232, Sept. 2005.
[8] K. Daphne and M. Sahami, Toward Optimal Feature Selection,
Proc. 13th Intl Conf. Machine Learning, pp. 284-292, 1996.
[9] R. Kohavi and G. John, Wrappers for Feature Subset Selection,
Aritficial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.
[10] Y. Yang and J.O. Pedersen, A Comparative Study on Feature
Selection in Text Categorization, Proc. 14th Intl Conf. Machine
Learning, pp. 412-420, 1997.
[11] D.D. Lewis, Feature Selection and Feature Extraction for Text
Categorization, Proc. Workshop Speech and Natural Language,
pp. 212-217, 1992.
[12] H. Li, T. Jiang, and K. Zang, Efficient and Robust Feature
Extraction by Maximum Margin Criterion, T. Sebastian, S.
Lawrence, and S. Bernhard eds. Advances in Neural Information
Processing System, pp. 97-104, Springer, 2004.
[13] E. Oja, Subspace Methods of Pattern Recognition. Research Studies
Press, 1983.
[14] J. Yan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi,
and Z. Chen, Effective and Efficient Dimensionality Reduction
for Large-Scale and Streaming Data Preprocessing, IEEE Trans.
Knowledge and Data Eng., vol. 18, no. 3, pp. 320-333, Mar. 2006.
[15] I.T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986.
[16] A.M. Martinez and A.C. Kak, PCA versus LDA, IEEE Trans.
Pattern Analysis and Machine Intelligence, vol. 23, no. 2 pp. 228-233,
Feb. 2001.
[17] H. Park, M. Jeon, and J. Rosen, Lower Dimensional Representa-
tion of Text Data Based on Centroids and Least Squares, BIT
Numerical Math, vol. 43, pp. 427-448, 2003.
[18] S.T. Roweis and L.K. Saul, Nonlinear Dimensionality Reduction
by Locally Linear Embedding, Science, vol. 290, pp. 2323-2326,
2000.
[19] J.B. Tenenbaum, V. de Silva, and J.C. Langford, A Global
Geometric Framework for Nonlinear Dimensionality Reduction,
Science, vol. 290, pp. 2319-2323, 2000.
[20] M. Belkin and P. Niyogi, Laplacian Eigenmaps and Spectral
Techniques for Embedding and Clustering, Advances in Neural
Information Processing Systems, vol. 14, pp. 585-591, The MIT Press
2002.
[21] K. Hiraoka, K. Hidai, M. Hamahira, H. Mizoguchi, T. Mishima,
and S. Yoshizawa, Successive Learning of Linear Discriminant
Analysis: Sanger-Type Algorithm, Proc. IEEE CS Intl Conf.
Pattern Recognition, pp. 2664-2667, 2000.
[22] J. Weng, Y. Zhang, and W.S. Hwang, Candid Covariance-Free
Incremental Principal Component Analysis, IEEE Trans. Pattern
Analysis and Machine Intelligence, vol. 25, no. 8, pp. 1034-1040, Aug.
2003.
[23] J. Yan, B.Y. Zhang, S.C. Yan, Z. Chen, W.G. Fan, Q. Yang, W.Y.
Ma, and Q.S. Cheng, IMMC: Incremental Maximum Margin
Criterion, Proc. 10th ACM SIGKDD, pp. 725-730, 2004.
[24] L.D. Baker and A. McCallum, Distributional Clustering of Words
for Text Classification, Proc. ACM SIGIR, pp. 96-103, 1998.
[25] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, Distribu-
tional Word Clusters versus Words for Text Categorization,
J. Machine Learning Research, vol. 3, pp. 1183-1208, 2003.
[26] M.C. Dalmau and O.W.M. Flo rez, Experimental Results of the
Signal Processing Approach to Distributional Clustering of Terms
on Reuters-21578 Collection, Proc. 29th European Conf. IR Research,
pp. 678-681, 2007.
[27] I.S. Dhillon, S. Mallela, and R. Kumar, A Divisive Infomation-
Theoretic Feature Clustering Algorithm for Text Classification,
J. Machine Learning Research, vol. 3, pp. 1265-1287, 2003.
[28] D. Ienco and R. Meo, Exploration and Reduction of the Feature
Space by Hierarchical Clustering, Proc. SIAM Conf. Data Mining,
pp. 577-587, 2008.
[29] N. Slonim and N. Tishby, The Power of Word Clusters for Text
Classification, Proc. 23rd European Colloquium on Information
Retrieval Research (ECIR), 2001.
[30] F. Pereira, N. Tishby, and L. Lee, Distributional Clustering of
English Words, Proc. 31st Ann. Meeting of ACL, pp. 183-190, 1993.
[31] H. Al-Mubaid and S.A. Umair, A New Text Categorization
Technique Using Distributional Clustering and Learning Logic,
IEEE Trans. Knowledge and Data Eng., vol. 18, no. 9, pp. 1156-1165,
Sept. 2006.
[32] G. Salton and M.J. McGill, Introduction to Modern Retrieval.
McGraw-Hill Book Company, 1983.
[33] T. Joachims, A Probabilistic Analysis of the Rocchio Algorithm
with TFIDF for Text Categorization, Proc. 14th Intl Conf. Machine
Learning, pp. 143-151, 1997.
[34] J. Yen and R. Langari, Fuzzy Logic-Intelligence, Control, and
Information. Prentice-Hall, 1999.
[35] J.S. Wang and C.S.G. Lee, Self-Adaptive Neurofuzzy Inference
Systems for Classification Applications, IEEE Trans. Fuzzy
Systems, vol. 10, no. 6, pp. 790-802, Dec. 2002.
[36] T. Joachims, Text Categorization with Support Vector Machine:
Learning with Many Relevant Features, Technical Report LS-8-
23, Univ. of Dortmund, 1998.
[37] C. Cortes and V. Vapnik, Support-Vector Network, Machine
Learning, vol. 20, no. 3, pp. 273-297, 1995.
[38] B. Scho lkopf and A.J. Smola, Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. MIT Press, 2001.
[39] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern
Analysis. Cambridge Univ. Press, 2004.
[40] D.D. Lewis, Y. Yang, T. Rose, and F. Li, RCV1: A New
Benchmark Collection for Text Categorization Research,
J. Machine Learning Research, vol. 5, pp. 361-397, http://
www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf, 2004.
[41] The Cad Web directory, http://www.cade.com.br/, 2010.
348 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 3, MARCH 2011
[42] C.C. Chang and C.J. Lin, Libsvm: A Library for Support Vector
Machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm. 2001.
[43] Y. Yang and X. Liu, A Re-Examination of Text Categorization
Methods, Proc. ACM SIGIR, pp. 42-49, 1999.
[44] G. Tsoumakas, I. Katakis, and I. Vlahavas, Mining Multi-Label
Data, Data Mining and Knowledge Discovery Handbook, O. Maimon
and L. Rokach eds., second ed. Springer, 2009.
[45] Http://web.ist.utl.pt/~acardoso/data sets/, 2010.
Jung-Yi Jiang received the BS degree from I-
Shou University, Taiwan, in 2002, and the MSEE
degree from Sun Yat-Sen University, Taiwan, in
2004. He is currently pursuing the PhD degree at
the Department of Electrical Engineering, Na-
tional Sun Yat-Sen University. His main re-
search interests include machine learning, data
mining, and information retrieval.
Ren-Jia Liou received the BS degree from
National Dong Hwa University, Taiwan, in 2005,
and the MSEE degree from Sun Yat-Sen
University, Taiwan, in 2009. He is currently a
research assistant in the Department of Chem-
istry, National Sun Yat-Sen University. His main
research interests include machine learning,
data mining, and web programming.
Shie-Jue Lee received the BS and MS degrees
in electrical engineering, in 1977 and 1979,
respectively, from National Taiwan University,
and the PhD degree in computer science from
the University of North Carolina, Chapel Hill, in
1990. He joined the faculty of the Department of
Electrical Engineering at National Sun Yat-Sen
University, Taiwan, in 1983, where he has been
working as a professor since 1994. Now, he is
the deputy dean of academic affairs and director
of NSYSU-III Research Center, National Sun Yat-Sen University. His
research interests include artificial intelligence, machine learning, data
mining, information retrieval, and soft computing. He served as director
of the Center for Telecommunications Research and Development,
National Sun Yat-Sen University (1997-2000), director of the Southern
Telecommunications Research Center, National Science Council (1998-
1999), and chair of the Department of Electrical Engineering, National
Sun Yat-Sen University (2000-2003). He is a recipient of the
Distinguished Teachers Award of the Ministry of Education, Taiwan,
1993. He was awarded by the Chinese Institute of Electrical Engineering
for Outstanding MS Thesis Supervision, 1997. He received the
Distinguished Paper Award of the Computer Society of the Republic of
China, 1998, and the Best Paper Award of the Seventh Conference on
Artificial Intelligence and Applications, 2002. He received the Distin-
guished Research Award of National Sun Yat-Sen University, 1998. He
received the Distinguished Teaching Award of National Sun Yat-Sen
University, 1993 and 2008. He received the Best Paper Award of the
International Conference on Machine Learning and Cybernetics, 2008.
He also received the Distinguished Mentor Award of National Sun Yat-
Sen University, 2008. He served as the program chair for the
International Conference on Artificial Intelligence (TAAI-96), Kaohsiung,
Taiwan, December 1996, International Computer SymposiumWork-
shop on Artificial Intelligence, Tainan, Taiwan, December 1998, and the
Sixth Conference on Artificial Intelligence and Applications, Kaohsiung,
Taiwan, November, 2001. He is a member of the IEEE Society of
Systems, Man, and Cybernetics, the Association for Automated
Reasoning, the Institute of Information and Computing Machinery, the
Taiwan Fuzzy Systems Association, and the Taiwanese Association of
Artificial Intelligence. He is also a member of the IEEE.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
JIANG ET AL.: A FUZZY SELF-CONSTRUCTING FEATURE CLUSTERING ALGORITHM FOR TEXT CLASSIFICATION 349