Beruflich Dokumente
Kultur Dokumente
1 INTRODUCTION
D
OCUMENT clustering aims to automatically group related
documents into clusters. It is one of the most important
tasks in machine learning and artificial intelligence and has
received much attention in recent years [1], [2], [3]. Based on
various distance measures, a number of methods have been
proposed to handle document clustering [4], [5], [6], [7], [8],
[9], [10]. A typical and widely used distance measure is the
euclidean distance. The /-means method [4] is one of the
methods that use the euclidean distance, which minimizes
the sum of the squared euclidean distance between the data
points and their corresponding cluster centers. Since the
document space is always of high dimensionality, it is
preferable to find a low-dimensional representation of the
documents to reduce computation complexity.
Low computation cost is achieved in spectral clustering
methods, in which the documents are first projected into a
low-dimensional semantic space and then a traditional
clustering algorithm is applied to finding document
clusters. Latent semantic indexing (LSI) [7] is one of the
effective spectral clustering methods, aimed at finding the
best subspace approximation to the original document
space by minimizing the global reconstruction error
(euclidean distance).
However, because of the high dimensionality of the
document space, a certain representation of documents
usually reside on a nonlinear manifold embedded in the
similarities between the data points [11]. Unfortunately, the
euclidean distance is a dissimilarity measure which de-
scribes the dissimilarities rather than similarities between
the documents. Thus, it is not able to effectively capture the
nonlinear manifold structure embedded in the similarities
between them [12]. An effective document clustering
method must be able to find a low-dimensional representa-
tion of the documents that can best preserve the similarities
between the data points. Locality preserving indexing (LPI)
method is a different spectral clustering method based on
graph partitioning theory [8]. The LPI method applies a
weighted function to each pairwise distance attempting to
focus on capturing the similarity structure, rather than the
dissimilarity structure, of the documents. However, it does
not overcome the essential limitation of euclidean distance.
Furthermore, the selection of the weighted functions is often
a difficult task.
In recent years, some studies [13], [14], [15] suggest that
correlation as a similarity measure can capture the intrinsic
structure embedded in high-dimensional data, especially
when the input data is sparse. In probability theory and
statistics, correlation indicates the strength and direction of a
linear relationship between two random variables which
reveals the nature of data represented by the classical
geometric concept of an angle. It is a scale-invariant
association measure usually used to calculate the similarity
between two vectors. In many cases, correlation can
effectively represent the distributional structure of the input
data which conventional euclidean distance cannot explain.
The usage of correlation as a similarity measure can be
found in the canonical correlation analysis (CCA) method
[16]. The CCA method is to find projections for paired data
sets such that the correlations between their low-dimen-
sional representatives in the projected spaces are mutually
maximized. Specifically, given a paired data set consisting
of matrices A fr
1
. r
2
. . . . . r
i
g and Y fy
1
. y
2
. . . . . y
i
g, we
1002 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
. T. Zhang and Y.Y. Tang are with the Department of Computer Science,
Chongqing University, Chongqing 400030, China and the Faculty of
Science and Technology, University of Macau, Taipa, Macau, China.
E-mail: tpzhang@ieee.org, yuanyant@gmail.com.
. B. Fang is with the Department of Computer Science, Chongqing
University, Chongqing 400030, China. E-mail: fb@cqu.edu.cn.
. Y. Xiang is with the School of Engineering, Deakin University, Geelong,
VIC 3217, Australia. E-mail: yxiang@deakin.edu.au.
Manuscript received 16 Nov. 2009; revised 1 June 2010; accepted 6 Jan. 2011;
published online 7 Feb. 2011.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-2009-11-0784.
Digital Object Identifier no. 10.1109/TKDE.2011.49.
1041-4347/12/$31.00 2012 IEEE Published by the IEEE Computer Society
would like to find directions n
r
for A and n
y
for Y that
maximize the correlation between the projections of A on
n
r
and the projections of Y on n
y
. This can be expressed as
max
n
r
.n
y
hAn
r
. Y n
y
i
kAn
r
k kY n
y
k
. 1
where . h i and k k denote the operators of inner product
and norm, respectively. As a powerful statistical technique,
the CCA method has been applied in the field of pattern
recognition and machine learning [15], [16]. Rather than
finding a projection of one set of data, CCA finds
projections for two sets of corresponding data A and Y
into a single latent space that projects the corresponding
points in the two data sets to be as nearby as possible. In the
application of document clustering, while the document
matrix A is available, the cluster label Y is not. So the
CCA method cannot be directly used for clustering.
In this paper, we propose a new document clustering
method based on correlation preserving indexing (CPI),
which explicitly considers the manifold structure embedded
in the similarities between the documents. It aims to find an
optimal semantic subspace by simultaneously maximizing
the correlations between the documents in the local patches
and minimizing the correlations between the documents
outside these patches. This is different from LSI and LPI,
which are based on a dissimilarity measure (euclidean
distance), and are focused on detecting the intrinsic
structure between widely separated documents rather than
on detecting the intrinsic structure between nearby docu-
ments. The similarity-measure-based CPI method focuses on
detecting the intrinsic structure between nearby documents
rather than on detecting the intrinsic structure between
widely separated documents. Since the intrinsic semantic
structure of the document space is often embedded in the
similarities between the documents [11], CPI can effectively
detect the intrinsic semantic structure of the high-dimen-
sional document space. At this point, it is similar to Latent
Dirichlet Allocation (LDA) [17] which attempts to capture
significant intradocument statistical structure (intrinsic
semantic structure embedded in the similarities between
the documents) via the mixture distribution model.
The rest of the paper is organized as follows: the
proposed document clustering method is presented in
Section 2. In Section 3, experimental results are provided
to illustrate the performance of the CPI method. Finally,
conclusions are given in Section 4.
2 DOCUMENTATION CLUSTERING BASED ON
CORRELATION PRESERVING INDEXING
In high-dimensional document space, the semantic struc-
ture is usually implicit. It is desirable to find a low-
dimensional semantic subspace in which the semantic
structure can become clear. Hence, discovering the intrinsic
structure of the document space is often a primary concern
of document clustering. Since the manifold structure is
often embedded in the similarities between the documents,
correlation as a similarity measure is suitable for capturing
the manifold structure embedded in the high-dimensional
document space. Mathematically, the correlation between
two vectors (column vectors) n and . is defined as
Coiin. .
n
T
.
n
T
n
p
.
T
.
p
n
knk
.
.
k.k
( )
. 2
Note that the correlation corresponds to an angle 0 such that
co:0 Coiin. .. The larger the value of Coiin. ., the
stronger the association between the two vectors n and ..
Online document clustering aims to group documents
into clusters, which belongs unsupervised learning. How-
ever, it can be transformed into semi-supervised learning by
using the following side information:
A1. If two documents are close to each other in the
original document space, then they tend to be
grouped into the same cluster [8].
A2. If two documents are far away from each other in the
original document space, they tend to be grouped
into different clusters.
Based on these assumptions, we can propose a spectral
clustering in the correlation similarity measure space
through the nearest neighbors graph learning.
2.1 Correlation-Based Clustering Criterion
Suppose y
i
2 Y is the low-dimensional representation of the
ith document r
i
2 A in the semantic subspace, where
i 1. 2. . . . . i. Then the above assumptions (A1) and (A2)
can be expressed as
max
X
i
X
r
,
2`r
i
Coiiy
i
. y
,
. 3
and min
X
i
X
r
,
62`r
i
Coiiy
i
. y
,
. 4
respectively, where `r
i
denotes the set of nearest
neighbors of r
i
. The optimization of (3) and (4) is equivalent
to the following metric learning:
dr. y c cosr. y.
where dr. y denotes the similarity between the documents
r and y, c corresponds to whether r and y are the nearest
neighbors of each other.
The maximization problem (3) is an attempt to ensure
that if r
i
and r
,
are close, then y
i
and y
,
are close as well.
Similarly, the minimization problem (4) is an attempt to
ensure that if r
i
and r
,
are far away, y
i
and y
,
are also far
away. Since the following equality is always true
X
i
X
y
,
2`y
r
Coiiy
i
. y
,
X
i
X
y
,
62`y
r
Coiiy
i
. y
,
X
i
X
,
Coiiy
i
. y
,
.
5
the simultaneous optimization of (3) and (4) can be
achieved by maximizing the following objective function
P
i
P
r
,
2`r
i
Coiiy
i
. y
,
P
i
P
,
Coiiy
i
. y
,
. 6
Without loss of generality, we denote the mapping between
the original document space and the low-dimensional
semantic subspace by \, that is, \
T
r
i
y
i
. Following some
algebraic manipulations, we have
ZHANG ET AL.: DOCUMENT CLUSTERING IN CORRELATION SIMILARITY MEASURE SPACE 1003
P
i
P
r
,
2`r
i
Coiiy
i
. y
,
P
i
P
,
Coiiy
i
. y
,
P
i
P
r
,
2`r
i
y
T
i
y
,
y
T
i
y
i
y
T
,
y
,
p
P
i
P
,
y
T
i
y
,
y
T
i
y
i
y
T
,
y
,
p
P
i
P
r
,
2`r
i
tiy
i
y
T
,
tiy
i
y
T
i
tiy
,
y
T
,
p
P
i
P
,
tiy
i
y
T
,
tiy
i
y
T
i
tiy
,
y
T
,
p
P
i
P
r
,
2`r
i
ti\
T
r
i
r
T
,
\
ti\
T
r
i
r
T
i
\ti\
T
r
,
r
T
,
\
p
P
i
P
,
ti\
T
r
i
r
T
,
\
ti\
T
r
i
r
T
i
\ti\
T
r
,
r
T
,
\
p
.
7
where ti is the trace operator. Based on optimization
theory, the maximization of (7) can be written as
arg max
\
P
i
P
r
,
2`r
i
ti
\
T
r
i
r
T
,
\
P
i
P
,
ti
\
T
r
i
r
T
,
\
arg max
\
ti
\
T
P
i
P
r
,
2`r
i
r
i
r
T
,
ti
\
T
P
i
P
,
r
i
r
T
,
\
.
8
with the constraints
ti\
T
r
i
r
i
\ 1 for all i 1. 2. . . . . i. 9
Consider a mapping \ 2 IR
id
, where i and d are the
dimensions of the original document space and the semantic
subspace, respectively. We need to solve the following
constrained optimization:
arg max
P
d
i1
n
T
i
`
o
n
i
P
d
i1
n
T
i
`
T
n
i
10
subject to
X
d
i1
n
T
i
r
,
r
T
,
n
i
1. , 1. 2. . . . . i. 11
Here, the matrices `
T
and `
o
1
are defined as
`
T
X
i
X
,
r
i
r
T
,
.
`
o
X
i
X
r
,
2`r
i
r
i
r
T
,
.
It is easy to validate that the matrix `
T
is semipositive
definite. Since the documents are projected in the low-
dimensional semantic subspace in which the correlations
between the document points among the nearest neighbors
are preserved, we call this criterion correlation preserving
indexing.
Physically, this model may be interpreted as follows: all
documents are projectedonto the unit hypersphere (circle for
2D). The global angles between the points in the local
neighbors, u
i
, are minimized and the global angles between
the points outside the local neighbors, c
,
, are maximized
simultaneously, as illustrated in Fig. 1. On the unit hyper-
sphere, a global angle can be measured by spherical arc, that
is, the geodesic distance. The geodesic distance between :
and :
0
on the unit hypersphere can be expressed as
d
G
:. :
0
arccos:
T
:
0
arccos Coii:. :
0
. 12
Since a strong correlation between : and :
0
means a small
geodesic distance between : and :
0
, then CPI is equivalent to
simultaneously minimizing the geodesic distances between
the points in the local patches and maximizing the geodesic
distances between the points outside these patches. The
geodesic distance is superior to traditional euclidean
distance in capturing the latent manifold [18]. Based on this
conclusion, CPI can effectively capture the intrinsic struc-
tures embedded in the high-dimensional document space.
It is worth noting that semi-supervised learning using
the nearest neighbors graph approach in the euclidean
distance space was originally proposed in the literatures
[19] and [20], and LPI is also based on this idea. Differently,
CPI is a semi-supervised learning using nearest neighbors
graph approach in the correlation measure space. Zhong
and Ghosh showed that euclidean distance is not appro-
priate for clustering high dimensional normalized data such
as text and a better metric for text clustering is the cosine
similarity [12]. In [21], Lebanon proposed a distance metric
for text documents, which was defined as
d
1
`
Jr.y
arccos
X
i1
i1
`
i
r
i
y
i
p
hr. `ihy. `i
!
.
I f we use t he not at i ons " r
r
1
p
.
r
2
p
. , " y
y
1
p
.
y
2
p
. , and set `
1
`
2
`
i
, then the dis-
tance metric d
1
`
Jr.y
reduces to
d
1
`
Jr.y
arccos
X
i1
i1
" r
i
" y
i
h" r. " rih" y. " yi
!
arccos Coii" r. " y .
This distance is very similar to the distance defined by (12).
Since the distance d
1
`
Jr.y
is local (thus it captures local
variations within the space) and is defined on the entire
embedding space [21], correlation might be a suitable
distance measure for capturing the intrinsic structure
embedded in document space. That is why the proposed
CPI method is expected to outperform the LPI method.
Note that the distance d
1
`
Jr.y
can be obtained based on
the training data and it can be used for classification rather
than clustering.
1004 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 6, JUNE 2012
Fig. 1. 2D projections of CPI.
1. When computing the matrix `
o
, if r
,
is among the nearest neighbors
of r
i
, then we consider r
i
is also among the nearest neighbors of r
,
. This is
to ensure that `
o
is a symmetric matrix.
2.2 Algorithm Derivation
The optimizationproblem(10) withtheconstraints (11) canbe
solved by maximizing the objective function
P
d
i1
n
T
i
`
o
n
i
under the constraints
X
d
i1
n
T
i
`
T
n
i
1. 13
and
X
d
i1
n
T
i
r
,
r
T
,
n
i
1. , 1. 2. . . . . i. 14
To do this, we introduce an additional Lagrangian function
with multipliers `
i
[22] as follows:
J
1
\.
X
d
i1
1
`
n
T
i
`
o
n
i
`
0
X
d
i1
n
T
i
`
T
n
i
1
!
`
1
X
d
i1
n
T
i
r
1
r
T
1
n
i
1
!
`
i
X
d
i1
n
T
i
r
i
r
T
i
n
i
1
!
.
15
where \ n
1
. . . . . n
d
and `. `
0
. `
1
. . . . . `
i
. The
additional Lagrange multiplier 1,`` 6 0 is introduced as
a multiplicative factor for
P
d
i1
n
T
i
`
T
n
i
, which does not
affect the solution (see [22] for detail). The additional
Lagrangian function J
1
\. will be maximized by setting
the partial derivatives of J
1
\. with respect to n
i
to be
zero, i.e.,
0J
1
\.
0n
i
0. This yields
1
`
`
:
n
i
`
0
`
T
n
i
`
1
r
1
r
T
1
n
i
`
i
r
i
r
T
i
n
i
0. 16
or equivalently
`
o
n
i
` `
0
`
T
`
1
r
1
r
T
1
`
i
r
i
r
T
i
n
i
0. 17
Equation (17) means that the n
i
. i 1. 2. . . . . d are the set of
generalized eigenvectors of the matrix `
o
and the matrix
` `
0
`
T
`
1
r
1
r
T
1
`
i
r
i
r
T
i
. 18
corresponding to the d largest generalized eigenvalues.
In order to find the optimal solution, we first need to fix
the value of the Lagrange multipliers `
i
. In (15), we suppose
that the function
1\.
X
d
i1
1
`
n
T
i
`
o
n
i
.
obtains a relative extremum: 1
P
d
i1
n
i
T
`
o
n
i
together
with the optimal values `
i
and n
i
under the constraints
X
d
i1
n
i
T
`
T
n
i
/
0
and
X
d
i1
n
i
T
r
,
r
T
,
n
i
/
,
.
Based on the interpretation of the Lagrange multipliers [23],
the values of 1
, n
i
, and `
i
depend on the values of /
i
on
the right-hand sides of the constraint equations. Suppose
that `
i
and n
i
are continuously differentiable functions of /
i
in some -neighborhood of /
i
. Then 1
is also continuously
differentiable with respect to /
i
. The partial derivatives of
1
with respect to /
i
are equal to the corresponding
Lagrange multipliers `
i
, i.e.,
`
i
01
0/
i
. i 0. 1. 2. . . . . i.
Let \
\
T
. It follows:
1
X
d
i1
n
i
T
`
o
n
i
ti`
o
\
\
T
ti`
o
.
/
0
X
d
i1
n
i
T
`
T
n
i
ti`
T
\
\
T
ti`
T
.
/
,
X
d
i1
n
i
T
r
,
r
T
,
n
i
tir
,
r
T
,
\
\
T
tir
,
r
T
,
.
for all , 1. 2. . . . . i.
Then, the values of `
i
can be computed by
`
0
01
0/
0
01
0
0/
0
0
ti`
:
ti`
T
. 19
`
,
01
0/
,
01
0
0/
,
0
ti`
:
tir
,
r
T
,
. , 1. 2. . . . . i. 20
Note that in document clustering, the matrix ` in (18) is
often singular as the dimension of the documents is
generally larger than the number of documents. To
circumvent the requirement of ` being nonsingular, we
may first project the document vectors into the SVD
subspace by throwing away the zero singular values.
2.3 Clustering Algorithm Based on CPI
Given a set of documents r
1
. r
2
. . . . . r
i
2 IR
i
. Let A denote
the document matrix. The algorithm for document cluster-
ing based on CPI can be summarized as follows:
1. Construct the local neighbor patch, and compute the
matrices `
o
and `
T
.
2. Project the document vectors into the SVD subspace
by throwing away the zero singular values. The
singular value decomposition of A can be written as
A l\
T
. Here all zero singular values in have
been removed. Accordingly, the vectors in l and \
that correspond to these zero singular values have
been removed as well. Thus the document vectors in
the SVD subspace can be obtained by
~
A l
T
A.
3. Compute CPI Projection. Based on the multipliers
`
0
. `
1
. . . . . `
i
obtained from (19) and (20), one can
compute the matrix ` `
0
`
T
`
1
r
1
r
T
1
`
i
r
i
r
T
i
. Let \
C11
be the solution of the generalized
eigenvalue problem `
o
\ ``\. Then, the low-
dimensional representation of the document can be
computed by
Y \
T
C11
~
A \
T
A.
where \ l\
C11
is the transformation matrix.
4. Cluster the documents in the CPI semantic subspace.
Since the documents were projected on the unit
ZHANG ET AL.: DOCUMENT CLUSTERING IN CORRELATION SIMILARITY MEASURE SPACE 1005
hypersphere, the inner product is a natural measure
of similarity. We seek a partitioning f
,
g
/
,1
of the
document using the maximization of the following
objection function [24]:
Qf
,
g
/
,1
X
/
,1
X
r2
,
r
T
c
,
.
with c
,
i
,
ki
,
k
, where i
,
is the mean of the
document vectors contained in the cluster
,
.
2.4 Complexity Analysis
The time complexity of the CPI clustering algorithm can be
analyzed as follows: consider i documents in the d-
dimensional space (d ) i). In Step 1, we first need to
compute the pairwise distance which needs Oi
2
d opera-
tions. Second, we need to find the / nearest neighbors for
each data point which needs O/i
2
operations. Third,
computing the matrices `
o
and `
T
requires Oi
2
d
operations and Oii /d operations, respectively. Thus,
the computation cost in Step 1 is O2i
2
d /i
2
ii /d.
In Step 2, the SVD decomposition of the matrix A needs
Od
3
operations and projecting the documents into the i-
dimensional SVD subspace takes Oii
2
operations. As a
result, Step 2 costs Od
3
i
2
d. In Step 3, we need to solve
the generalized eigenvalue problem `
:
n ``n in order
to find the i generalized eigenvectors associated with the
i-largest eigenvalues which needs Oi
3
operations. Then,
transforming the documents into i-dimensional semantic
subspace requires Oii
2
operations. Consequently, the
computation cost of Step 3 is Oi
3
ii
2
. In Step 4, it takes
O|cii operations to find the final document clusters,
where | is the number of iterations and c is the number of
clusters. Since / ( i, | ( i, and i. i ( d in document
clustering applications, the Step 2 will dominate the
computation. To reduce the computation cost of Step 2,
one can apply the iterative SVD algorithm [25] rather than
matrix decomposition algorithm or feature selection meth-
od to first reduce the dimension.
3 EXPERIMENTAL RESULTS
In this section, the performance of the proposed CPI
method is demonstrated by various experiments and
compared with that of other competing methods.
3.1 Evaluation Metrics
In this work, the accuracy (AC) metric and the normal-
ized mutual information (`1) metric are used to measure
the clustering performance [9]. The AC metric is defined
as follows:
C
P
i
i1
c:
i
. ioji
i
i
.
where i
i
is the cluster label obtained by our algorithm, :
i
is
the label provided by the corpus, i is the total number of
documents, cr. y is the delta function that equals one if
r y and equals zero otherwise, and ioji
i
is the
permutation mapping function that maps cluster label i
i
to the equivalent label from the data corpus. The best
mapping can be achieved by using the Kuhn-Munkres
algorithm [27].
The normalized mutual information (`1) is defined as
`1C. C
0
`1C. C
0
maxHC. HC
0
.
where C is the set of clusters provided by the document
corpus and C
0
is the set of clusters obtained by our
algorithm. HC and HC
0
are the entropies of C and C
0
,
respectively. `1 is the mutual information corresponding
to the matrices C and C
0
`1C. C
0
X
c
i
2C.c
0
,
2C
0
jc
i
. c
0
,
log
2
jc
i
. c
0
,
jc
i
jc
0
,
.
Here, jc
i
(resp. jc
0
,
) is the probability that a document
arbitrarily selected from the corpus belongs to the clusters c
i
(resp. c
0
,
) and jc
i
. c
0
,
is the joint probability that the
arbitrarily selected document belongs to the clusters c
i
and
c
0
,
at the same time. It is easy to check that `1C. C
0
takes
zero when the two sets are independent, and takes one
when the two sets are identical.
3.2 Document Representation
In all experiments, each document is represented as a term
frequency vector. The term frequency vector can be
computed as follows:
1. Transform the documents to a list of terms after
words stemming operations.
2. Remove stop words. Stop words
2
are common
words that contain no semantic content.
3. Compute the termfrequency vector using the TF/IDF
weighting scheme. The TF/IDF weighting scheme
assigned to the term t
i
in document d
,
is given by
t),id)
i.,
t)
i.,
id)
i
.
Here,
t)
i.,
i
i.,
P
/
i
/.,
is the term frequency of the term t
i
in document d
,
,
where i
i.,
is the number of occurrences of the
considered term t
i
in document d
,
. id)
i
log
j1j
jd:t
i
2dj