Sie sind auf Seite 1von 86

CLUSTERING DATASETS WITH SINGULAR VALUE

DECOMPOSITION
A thesis submitted in partial fulllment of the requirements for the
degree
MASTER OF SCIENCE
in
MATHEMATICS
by
EMMELINE P. DOUGLAS
NOVEMBER 2008
at
THE GRADUATE SCHOOL OF THE COLLEGE OF CHARLESTON
Approved by:
Dr. Amy Langville, Thesis Advisor
Dr. Ben Cox
Dr. Katherine Johnston-Thom
Dr. Martin Jones
Dr. Amy T. McCandless, Dean of the Graduate School:
1461189

1461189

2009
Copyright 2009 by
Douglas, Emmeline P.
.
All rights reserved
ABSTRACT
CLUSTERING DATASETS WITH SINGULAR VALUE
DECOMPOSITION
A thesis submitted in partial fulllment of the requirements for the
degree
MASTER OF SCIENCE
in
MATHEMATICS
by
EMMELINE P. DOUGLAS
NOVEMBER 2008
at
THE GRADUATE SCHOOL OF THE COLLEGE OF CHARLESTON
Spectral graph partitioning has been widely acknowledged as a useful way to cluster
matrices. Since eigen decompositions do not exist for rectangular matrices, it is
necessary to nd an alternative method for clustering rectangular datasets. The
Singular Value Decomposition lends itself to two convenient and eective clustering
techniques, one using the signs of singular vectors and the other using gaps in singular
vectors. We can measure and compare the quality of our resultant clusters using an
entropy measure. When unable to decide which is better, the results can be nicely
aggregated.
1
Contents
1 Introduction 5
2 The Fiedler Method 9
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Clustering with the Fiedler vector . . . . . . . . . . . . . . . . 11
2.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Extended Fiedler Method . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Clustering with Multiple Eigenvectors . . . . . . . . . . . . . . 14
2.2.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Moving from Eigenvectors to Singular Vectors 17
3.1 Small Example Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 How to Cluster a matrix with SVD Signs . . . . . . . . . . . . . . . . 21
3.2.1 Results on Small Yahoo! Dataset . . . . . . . . . . . . . . . . 24
3.2.2 Why the SVD Signs method works . . . . . . . . . . . . . . . 25
3.2.3 Limitations of SVD signs . . . . . . . . . . . . . . . . . . . . . 30
3.3 SVD Gaps Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 When are Gaps Large Enough? . . . . . . . . . . . . . . . . . 34
3.3.2 Results on Small Yahoo! dataset . . . . . . . . . . . . . . . . 35
4 Quality of Clusters 37
2
4.1 Entropy Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Comparing results from Small Yahoo! Example . . . . . . . . . . . . 42
5 Cluster Aggregation 44
5.1 Results on Small Yahoo! Dataset . . . . . . . . . . . . . . . . . . . . 47
6 Experiments on Large datasets 49
6.1 Yahoo! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 Netix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7 Conclusion 59
A MATLAB Code 61
A.1 SVD Signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.2 SVD Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.3 Entropy Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.4 Cluster Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
A.5 Other Code Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3
Acknowledgements
I owe a great debt to the many people who helped me to successfully complete this
thesis. First and most of all, to my thesis advisor Dr. Amy Langville who not only
listened to countless research updates, presentations, and MATLAB complaints, but
who encouraged me and gave me the condence to start a thesis in the rst place.
Second to Kathryn Pedings who was always a ready and willing soundboard whenever
I was excited or frustrated and needed someone to talk to. Also I would like to thank
my committee members for their interest, criticism, and encouragement. Last, but
not least, I would like to thank my ance, Andrew Aghapour, for being patient with
me and my mood swings as I nished this thesis.
4
Chapter 1
Introduction
Since the introduction of the internet to the public in the late 1980s, life has become
much more convenient. Instead of making trips to the bank, the mall, the post
oce, and the grocery store, a person can manage their nances, pay their bills,
shop for gifts, and even buy groceries from the comfort of their own home. As lucky
as the consumers think they are, the internet has actually proven to be a greater
boon to the companies providing these conveniences. Now companies can easily
gather information about their customers that they may not have known before,
which, in the end, will help them make even more sales to even more customers.
For example, the internet company Netix invests a great deal of time and money
collecting data about their customers movie preferences. They use this information
to make recommendations to other customers. Better recommendations may result
in more rentals and, more importantly, higher customer loyalty. However, moving
from data collection to movie recommendation is not a trivial task, as the datasets
inevitably grow quite large. One useful way to glean information from these massive
datasets is to cluster them.
Clustering is a data mining technique which reorganizes a dataset and places
objects from the dataset into groups of similar items.
1
When the dataset is rep-
1
Note to the Reader: Clustering should not be confused with classication; classication names
5
resented as a matrix, clustering is essentially reordering the matrix so that similar
rows and columns are near each other. Datasets can be created in hundreds of dif-
ferent structures and sizes, therefore it makes sense that the methods used to cluster
them are just as abundant and varied. Hundreds of dierent clustering techniques
have been developed by both mathematicians and computer scientists over the years;
these techniques can be grouped into two main catagories: hierarchical and partitional
[28].
Hierarchical: In Hierarchical algorithms, clusters are created in a tree-like
process by which the dataset is broken down into nested sets of clusters based on some
measure of similarity between objects. An example diagram describing this process
is shown in Figure 1.1.
Figure 1.1: Tree diagram for a hierarchical clustering algorithm. Any vertical cut
would result in a clustering.
Hierarchical algorithms can be subdivided into groups: the more widely used
agglomerative (or bottom up) methods, and the divisive (or top down) methods.
Linkage methods such as the Nearest Neighbors algorithm, which forms clusters by
grouping objects that are nearest to each other, and the Centroid Method, which
or qualies the groups and clustering does not, though it might be used as a means to that end (i.e.
datasets may be easier to classify once they have been clustered).
6
chooses central objects and then clusters the other objects according to their proximity
to either centroid, are good examples of popular agglomerative clustering algorithms
[8]. Divisive algorithms work in the opposite direction, starting with the full dataset
as one cluster and then splitting it into smaller and smaller pieces. However, these
techniques tend to be more computationally demanding, and, as mentioned before,
are not as popular as the agglomerative methods. More examples of hierarchical
techniques along with some discussion of their merits and disadvantages can be found
are summarized by Everitt et al. in [12].
Partitional: Partitional algorithms work by dividing the dataset into disjoint
subsets. Principal Direction Divisive Partitioning, or PDDP, which divides a dataset
into halves using the principal direction of variation of the dataset as described by
Boley in [9], falls into this catagory along with the other Singular Value Decomposition
(SVD) based algorithms presented in Chapter 3. Spectral methods, or clustering
algorithms that analyze components of the eigen decomposition, are also partitional
algorithms. One partional technique that has gotten a lot of attention lately uses
the PageRank vector to cluster data [2]. Independent Component Analysis, or ICA,
analyzes and divides a dataset so that objects between clusters are independent, and
objects within clusters are dependent [3], [17]. The k-means algorithm is a very
popular partitional algorithm that has long been upheld as a standard in the eld
of clustering due its eciency, exibility, and robustness. The algorithm divides the
data into k groups centered around the k cluster centers that must be chosen at the
outset of the algorithm [19]. A nice comparison of many dierent algorithms, both
hierarchical and partional, is presented by Halkidi et al. in [16].
I go more into detail about the history of spectral clustering in Chapter 2 since
this particular eld gave birth to the SVD clustering methods introduced in Chapter
3; the rst, SVD Signs, is an algorithm outlined by Dr. Carl Meyer in [22], and
the second, SVD Gaps is my own SVD-based clustering algorithm. Of course, after
7
introducing these two SVD clustering methods, some measure of cluster goodness
is necessary in order to compare the two methods directly. Therefore, in Chapter 4
an entropy measure will be introduced that can be used to measure the how well a
dataset has been clustered. Many times, though, it is useful to be able to nd an
average clustering when one algorithm does not stand out above the rest, so Chapter
5 introduces some ideas about cluster aggregation, a way to combine the results from
several clustering algorithms, that will be helpful in cases where multiple algorithms
produce good clusterings. Throughout these chapters small datasets will be used as
examples to help clarify how the algorithms work, and what a well-clustered matrix
looks like. However, as mentioned above, datasets tend to be very large in real life.
Thus, it is crucial that clustering methods work as well on these large datasets as
they do on the smaller ones. In Chapter 6 the results of my experiments with the
SVD Signs and SVD Gaps algorithms on three large datasets (one from Yahoo!, one
from Wikipedia, and one from Netix) will be presented.
The images in this paper have been created using several dierent computer
programs. I used MATLAB for some of the simpler images such as line graphs, and
the Apple Grapher application to display small three dimensional datasets. All of the
images of large matrices were created with David Gleichs VISMATRIX tool [11].
8
Chapter 2
The Fiedler Method
Though graph theoretic clustering has been used heavily by computer scientists for in
recent decades, the understanding of these methods is rather new. The mathematics
behind these methods were not explored until the late 1960s and early 1970s. In 1968,
Anderson and Morely published their paper [18] on the eigenvalues of the Laplacian
matrix, which is a special matrix in graph theory will be dened in Section 2.1. Then
in 1973 and 1975, Miroslav Fiedler published his landmark papers [14] and [13] on
the properties of the eigensystems of the Laplacian matrix. Fiedlers ideas were not
applied to the eld of clustering until Pothen, Simon, and Liou did so with their
1990 paper [24]. These papers are the origins of spectral graph partitioning, sub-
eld of clustering that uses the spectral or eigen properties of a matrix to identify
clusters. There are two methods in the spectral category that inspired the SVD-based
clustering methods introduced in Chapter 3: the Fiedler Method and the Extended
Fiedler method.
2.1 Background
The Fiedler Method takes its name from Miroslav Fiedler because of two important
papers he published in 1973 and 1975 that explored the properties of eigensystems
9
of the Laplacian Matrix. What is the Laplacian matrix? Consider this small graph
with 10 vertices or nodes that are connected by several edges.
Figure 2.1: Small graph with 10 nodes and its adjacency matrix.
We can easily represent this graph with a binary adjacency matrix, where the
rows and columns represent the 10 nodes, and non-zero entries represent the edges
between nodes. Any graph, small or large, can be fully represented by a matrix. Once
we have an adjacency matrix, the corresponding Laplacian matrix L can be found by
L = D A, (2.1)
where A is the adjacency matrix, and D is a diagonal matrix containing the row sums
of A. Figure 2.2 shows the Laplacian matrix for the 10 node graph given above.
Figure 2.2: Finding the Laplacian matrix for the adjacency matrix in Figure 2.1
These matrices can be used to discover important properties of the graphs they
10
represent. Most importantly, a Laplacian matrix can give us information about the
connectivity of the graph it represents. In fact, in his papers [14] and [13], Miroslav
Fiedler proved that the eigenvector corresponding to the second smallest eigenvalue,
which is now called the Fiedler vector, can tell us how a graph can be broken down into
maximally intraconnected components and minimally interconnected components. In
other words, the Fiedler vector is a very useful tool for partitioning the graph. A more
in-depth look at spectral graph theory can be found in Chungs Spectral Graph Theory
[10]. Some more important results about spectral partitioning are shown in [26] and
[20].
2.1.1 Clustering with the Fiedler vector
Suppose we have the graph from Figure 2.1 with its corresponding Laplacian matrix
(see Figure 2.3).
Figure 2.3: Graph with 10 nodes and its Laplacian matrix.
From the Laplacian matrix we obtain the eigen decomposition (Figure 2.4
shows the eigenvectors and eigenvalues from the Laplacian matrix in Figure 2.3. No-
tice that the smallest eigenvalue is 0, and its corresponding eigenvector is a scalar
multiple of the identity vector, as is the case with all Laplacian matrices. The eigen-
vector we are interested in is the Fiedler vector, which is circled. We can use the
11
signs of this eigenvector to cluster our graph. This clustering method is known as the
Fiedler Method.
Figure 2.4: Eigenvectors and eigenvalues of L, with second smallest eigenvalue and
the Fiedler vector circled.
The rows with the same sign are placed in the same cluster. Therefore, for
the 10 node example, nodes 1, 2, 3, 7, 8, and 9 are in one cluster while nodes 4, 5, 6,
and 10 are in another cluster. Looking at Figure 2.5, we can see this partition makes
a lot of sense in the context of the graph. As expected, the Fiedler Method cut the
graph into two better connected subgraphs.
Figure 2.5: Signs of v
2
, the Fiedler vector, and the partition made by the rst iteration
of the Fiedler Method.
The next step is to take each subgraph and partition each with its own Fiedler
vector. We will only do this with the larger sub graph, since the cluster containing
nodes 4, 5, 6, and 10 is fully connected and so it does not make sense to cluster it
12
further. However, as we can see in Figure 2.6, the second iteration of the Fiedler
Method works very nicely on the second half of the graph.
Figure 2.6: Partition made by the second iteration of the Fiedler Method.
For this small graph, two iterations are sucient for a satisfactory clustering.
Though, as graphs become larger, certainly many more iterations are necessary. The
algorithm stops when no more partitions can be made such that the number of edges
between two clusters is less than the minimum number of edges within either of
the two clusters. Barbara Ball presents a much more thorough explanation of this
algorithm in [4].
The Fiedler Method has been shown to perform very well in experimentation.
Some experimental results are given in [31], [4], and [15].
2.1.2 Limitations
Though this method is theoretically sound, and has been shown to work very nicely on
large as well as small square symmetric matrices, it does have drawbacks. First, the
Fiedler Method is iterative. Therefore, if at any point a questionable partition is made,
the mistake is exacerbated by further iterations. Also, new eigen decompositions must
be found at every iteration, which can be expensive for larger datasets.
Secondly, the Fiedler Method only works for square symmetric matrices. Many
13
dierent symmetrization techniques have been developed for non-square or non-
symmetric matrices, but inevitably some information contained in the matrix is lost
whenever symmetry is forced.
Though it is still based on the eigen decomposition, the next clustering algo-
rithm does not carry the drawbacks of an iterative procedure.
2.2 Extended Fiedler Method
In the last section we considered the application of the Fiedler vector to the problem of
clustering. Surely the Fiedler vector is not the only eigenvector that can be of service.
In fact, Extended Fiedler nds much success by incorporating multiple eigenvectors.
2.2.1 Clustering with Multiple Eigenvectors
Since the rise of the Fiedler Method, many mathematicians have developed a similar
clustering algorithm, referred to as the Extended Fiedler method, that uses multiple
eigenvectors (see references [1] and [5] for such algorithms).
The Extended Fiedler method follows the same preliminary steps as the Fiedler
Method, but diverges when it comes to the actual clustering. Instead of looking at
the signs of one eigenvector, Extended Fiedler looks at the sign patterns of multiple
eigenvectors.
Algorithm 1 Extended Fiedler Let L be the Laplacian matrix for a symmetric
matrix A.
1. Find V
k
, a matrix containing the rst k eigenvectors of L and E
k
, a diagonal
matrix containing the rst k eigenvalues of L in ascending order such that V
i
is
an eigenvector with the i
th
eigenvalue in E
k
.
2. Look at signs of columns 2 through k of V
14
3. If rows i and j have the same sign pattern, then rows i and j of A belong in the
same cluster.
The algorithm extends to as many eigenvectors as the user deems necessary.
If k vectors are used, then up to 2
k
(but often fewer) clusters result.
To help explain Extended Fiedler, I will demonstrate how it clusters the 10
node graph from Section 2.1 when k = 2 eigenvectors are used. Note rst that the
algorithm nds only 3 clusters (see Figure 2.7), which is less that the potential of
4 cluster. Second, notice that for this example the Extended Fiedler method and
the Fiedler Method produced the exact same clustering of this dataset. Also, I only
needed to calculate one eigen decomposition. Some experimental results with this
algorithm are detailed by Basabe in [5].
Figure 2.7: Results of Extended Fiedler on the 10 node graph when using the signs
of 2 eigenvectors.
2.2.2 Limitations
Though Extended Fiedler frees us from the iterative processes of the Fiedler Method,
we are still bound to and limited by the eigen decomposition which only exists for
square matrices, and only has real-valued eigenvalues and eigenvectors when the ma-
trix is symmetric. The next Chapter moves on to the related but more exible singular
15
value decomposition, and the two SVD-based clustering methods, which do not have
as many limitations as the two Fiedler algorithms. .
16
Chapter 3
Moving from Eigenvectors to
Singular Vectors
It would be nice to nd a clustering method as simple and robust as the extended
Fiedler method but that is more exible and extends to rectangular matrices. Ob-
viously, the main obstacle in our way is the decomposition being used. Is there a
decomposition for both rectangular and square matrices that has the same structure
as the eigen decomposition?
As it turns out, the Singular Value Decomposition accomplishes this goal. A
unique SVD is dened a matrix of any size, and the SVD of a square matrix is related
to its eigen decomposition. SVD is not as widely known or studied as the eigen
decomposition, so we will dene it now.
Denition 1 [21] The singular value decomposition of an mn matrix A with rank
r is an orthogonal decomposition of a matrix into three matrices such that
A
mn
= U
mr
S
rr
V
T
rn
(3.1)
.
17
Figure 3.1: A Diagram for the Singular Value Decomposition of a matrix
This decomposition is called orthogonal since the columns of U are orthogonal
to each other. The same holds for the rows of V
T
. The matrix S is a diagonal matrix
that contains the singular values of A in descending order. Singular values are always
non-negative real numbers [21]. In Section 3.2, the three components of the SVD, U,
S, and V
T
, will be addressed in greater detail.
As mentioned above, the eigen decomposition and the singular value decom-
position are closely connected. Suppose B and C are square symmetric matrices such
that when B = AA
T
and C = A
T
A for some rectangular matrix A with singular
values s
i
, left singular vectors u
i
, and right singular vectors v
i
. Then
Bu
i
= s
2
i
u
i
, (3.2)
and
Cv
i
= s
2
i
v
i
. (3.3)
Therefore, u
i
is an eigenvector of B with eigenvalue s
2
i
, and v
i
is an eigenvector
of C with eigenvalue s
2
i
[28]. Hence, if A is a square symmetric matrix, then the eigen
decomposition of A and the singular value decomposition of A are equivalent.
The Singular Value Decomposition is an exact decomposition. In other words,
we can multiply U, S, and V
T
and get back the original matrix A. However, we can
18
also use SVD to nd an approximation of A of rank k, where k < r, by multiplying
only the rst k columns of U, the rst k values in S, and the rst k rows of V
T
, as
in the diagram below. This is called the truncated SVD of a matrix.
Figure 3.2: A Diagram for the truncated Singular Value Decomposition of a matrix
This truncated SVD not only gives us a rank k approximation to a matrix A,
it gives us the best possible rank k approximation in the following sense:
Theorem 1 (Eckert and Young; see [21])Let A be matrix of rank r, and let A
k
be
the SVD rank k approximation to A, with k r, and let B be any other matrix of
rank k. Then
A A
k

F
A B
F
(3.4)
When k << r, the truncated SVD is much less expensive to nd than the
full singular value decomposition. For this reason, both SVD clustering methods
introduced in Sections 3.2 and 3.3 use the truncated SVD. But how do we know what
value to use for k? This is not a trivial question, and there are hundreds of ways to
answer it (some quite simple, and some highly complex). Throughout this paper, I
typically use the screeplot of the singular values of A to choose k. The singular
values of a matrix can be plotted on a line graph without great expense. Since the
values appear in descending order, it is often easy to nd a place where the magnitude
19
of these values drops, forming a sort of elbow on the line graph. If the elbow
roccurs at the j
th
singular value, we might set k = j.
Figure 3.3: A line plot of the singular values of a matrix where the y-axis represents
the magnitude of a singular value. Since the graph drops sharply 4, it is reasonable
to set k = 4.
Because of these properties, the singular value decomposition has many appli-
cations outside the eld of clustering as well. Mike Berry and others established the
usefulness of SVD with respect to information retrieval in [6] and [7], and Skillicorn
has applied it to counterterrorism in [27].
3.1 Small Example Datasets
The next two sections cover the algorithms for my two SVD-based clustering methods.
In these sections, it will be useful to have a small example matrix to demonstrate how
well each method clusters and reorders the given matrix. For this purpose, a small
subset of 45 rows, or search phrases, and 24 columns, or advertisers, was chosen from
the large Yahoo! dataset (the full Yahoo! matrix is introduced and used in Chapter
6). The rows and columns were selected so that the small matrix has the nice block-
diagonal structure associated with well-clustered matrices, and then randomized it
by means of a random permutation. This way we have an answer key with which to
compare our results.
20
Figure 3.4: Small Yahoo! example matrix, along with the terms represented by each
row, before (left) and after (right) randomization.
Notice there are row labels for the dataset, but no column labels. Yahoo was
willing to release the terms represented by the dataset, but kept the advertisers names
a secret for business and privacy reasons. This does not present a great obstacle, since
we can still get a very good idea of how a clustering algorithm performs based on
how the rows have been reordered.
I will also use two other small matrices that each represent a dierent set of
points shown in Figures 3.5 and 3.6 in three-dimensional space. These will be helpful
visuals when discussing the geometric aspects of each clustering algorithm.
3.2 How to Cluster a matrix with SVD Signs
The rst SVD based clustering methods to be discussed is the SVD signs method,
which uses the sign patterns of the singular vectors rather than eigen vectors as done
by the Extended Fiedler method.
21
Figure 3.5: 11 by 3 matrix as a set of eleven points in three-dimensional space.
Figure 3.6: Another set of points used in this chapter.
As discussed earlier, eigenvectors hold a lot of information about a graphs
connectivity, and this information is exploited by the Fiedler clustering methods.
Since the eigen decomposition and singular value decomposition are so closely related,
it is not surprising that the singular vectors also carry a wealth of information about
the matrices they represent, and so play a central role in both SVD Signs and SVD
Gaps methods of clustering.
For both methods, the rst step is to nd the truncated SVD of the matrix.
This will give us three matrices U
mk
, S
kk
, and V
T
kn
, for some chosen k, where
the columns of U are the dominant k, or the k vectors that contribute most to the
dataset, left singular vectors of A, the entries in S are the dominant singular values
22
of A, and the rows of V contain the dominant right singular vectors.
Once the truncated SVD is obtained, the SVD signs algorithm uses the sign
patterns of the singular vectors to group the rows and columns in precisely the same
way as the sign patterns of eigenvectors were used in the extended Fiedler method.
Rows that have the same sign pattern in the rst k singular vectors are grouped
together.
For example, Figure 3.7, below, shows the rst two left singular vectors from
some matrix A. Using the sign patterns from these two vectors, rows 1 and 4 would
be clustered together, rows 2, 5, 6, and 7 would be clustered together, and row 3
would be placed in a cluster by itself. Note that if we use k singular vectors, we can
have up to 2
k
clusters, since each row of U
k
has k entries, each with 2 possible values.
Luckily, the algorithm rarely yields such a high number of clusters.
Figure 3.7: Clustering by using sign patterns of the rst two singular vectors
Why were the left singular vectors used here instead of the right ones? Recall
that earlier this was not an issue because we used the eigen decomposition which
has only had one set of eigenvectors. On the other hand the SVD gives two sets of
singular vectors. Which set of singular vectors, the left or the right, should be used?
Note also that the spectral methods only dealt with square symmetric matrices, and
so one reordering could be applied to both rows and columns. This is not the case for
SVD signs, which can be used on rectangular matrices, as well as asymmetric square
23
matrices, and so calls for two independent re-orderings. This is where the two sets of
singular vectors come in handy - The signs of the left singular vectors, or the
columns of U, give a clustering for the rows, while the signs of the right
singular vectors, or the columns of V , can be used to cluster the columns.
Algorithm 2 SVD Signs
1. Find [U
k
, S
k
, V
T
k
] = svds(A, k)
2. If rows i and j of U
k
have the same sign pattern, the rows i and j of A are in
the same cluster.
3. If columns i and j of V
T
k
have the same sign pattern, the columns i and j of A
are in the same cluster.
3.2.1 Results on Small Yahoo! Dataset
The SVD signs method performs very well on the small Yahoo! example, as can
be seen in Figure 3.8. First, the reordered matrix has the very nice block diagonal
structure that is characteristic of well-clustered matrices. Second, we can see that,
with one exception, all the terms were returned to their original categories. Note that
we chose k = 3 here. It turns out that using three left singular vectors gives the best
clustering of the rows, even though the singular values (shown in Figure ??) suggest
a k value of 4. This is a good example of how problematic choosing a k value can be.
The next obvious question one might ask is: why does this work? Since SVD is
an orthogonal decomposition, it has a very nice geometry. In fact, it is the geometrical
properties of the SVD, which are explained and proved by Meyer in [22], that give a
clear explanation for why this SVD Signs method works so well.
24
Figure 3.8: The singular values of the Small Yahoo! dataset and the results of using
signs method with k = 3 on the dataset.
3.2.2 Why the SVD Signs method works
Note that any mn matrix can be thought of as a set of m points in an n-dimensional
space. For example, Figure 3.9 shows an 113 matrix and the corresponding cloud
of 11 points.
Figure 3.9: 11 by 3 matrix as a set of eleven points in three-dimensional space.
In this geometrical context, the right singular vectors represent the vectors
of principle trend of the data cloud. In other words, the rst right singular vector
(referred to from now on as v
1
) will point in the direction of highest variation in the
data cloud. However, when we plot the data cloud and its rst right singular vector
25
together, as in Figure 3.10, it certainly does not look like the vector is pointing in the
direction with the most variation.
Figure 3.10: First right singular vector
This is because the dataset has not been centered, and v
1
looks for the direction
of highest variation from the origin. After centering the dataset, as in Figure 3.11, v
1
does point in the accurate direction of principal trend.
Figure 3.11: First right singular vector of centered dataset
The second right singular vector, v
2
, points in the direction of secondary trend
orthogonal to v
1
, and v
3
points in the direction of tertiary trend orthogonal to both
v
1
and v
2
.
Not only do these three right singular vectors represent the three directions of
principal trend in the dataset, but they also represent a new set of axes for our set of
26
Figure 3.12: The three right singular vectors of the dataset
points!
Figure 3.13: The right singular vectors can be thought of as a new set of axes for the
dataset.
In this particular example, the dimension of the original dataset and the di-
mension of the new space created by the right singular vectors are the same because
all of the right singular vectors were used. What happens when the truncated SVD,
rather than the full SVD is used? In other words, what if instead of using all n right
singular vectors, we decide to use only k of them? In this case, the original dataset
of dimension n is projected into a space of lower dimension k. Why would anyone
want to do this? Wouldnt a lot of the information in the original dataset be lost?
Of course, as with any projection of this nature, some of the information will be lost.
But the information lost will be the least important, and sometimes even superuous,
27
information that could be obfuscating important correlations in the original matrix.
This is because the Singular Value Decomposition naturally sorts trends in the matrix
from most important to least important [28]. Therefore, in most cases, losing extra
dimensions does not pose a problem, and can even be helpful.
Now let us consider the left singular vectors. What is their role? It was shown
by Meyer [22] that the left singular vectors give the coordinates of the points on the
new set of axes created by the right singular vectors! In other words, u
1
contains the
orthogonal projections of each point onto v
1
.
Figure 3.14: The left singular vectors contain the coordinates of the points when
projected onto each right singular vector
This information about the geometry of SVD can now be used to better un-
derstand the SVD Signs clustering method discussed above. When the signs of u
1
are used to divide the cloud of points into two pieces, all the points projected onto
the positive half of v
1
are placed in one cluster, and all the points projected onto the
negative half of v
1
are placed in another cluster. This is essentially the same as slicing
through the centered set of points at the origin with a hyperplane that is orthogonal
to v
1
. Figure 3.15 demonstrates this with the set of 11 points, and shows the two
clusters that result.
28
Figure 3.15: Points on the positive side of the hyper-plane are in one cluster, while
points on the negative side are in another (left). The dataset is now divided into two
clusters (right).
These steps are repeated with the signs of u
2
, resulting in another hyper-plane
orthogonal to the rst that divides the set of points with respect to v
2
. As shown
in Figure 3.16, these two planes divide the space into quadrants, with each quadrant
containing a dierent sign pattern and therefore a dierent cluster.
Figure 3.16: Points are further clustered according to their quadrant.
Next, the set of points is divided with respect to the signs of u
3
, resulting in a
third hyper-plane that divides the space into octants. Note that not all of the octants
happen to contain points, and so fewer than eight, or 2
3
, clusters result.
It is possible to go too far with the method and divide the set of points into
too many groups. Notice that, while the rst two singular vectors resulted in very
29
Figure 3.17: Points further clustered according to their octant.
intuitive clusterings, some of the clusters resulting from the third vector, particularly
the clusters circled in Figure 3.18 are questionable. This represents the consequence
of moving from k = 2 to k = 3 when clustering this set of points. Is it better to
have a clustering that is too ne, or not ne enough? There is really no good answer
to this question. The most appropriate k depends on the goals of and applications
envisioned by the researcher.
Figure 3.18: Cost of choosing a k that is too high.
3.2.3 Limitations of SVD signs
Like the Fiedler and Extended Fiedler methods before it, SVD signs clusters strictly
according to positive and negative signs. It breaks the dataset into halves according
30
to whether the projection of each point lies on the positive or negative half of a vector.
But what if the data set doesnt naturally break in half? Or if the break point lies
somewhere other than the middle of the dataset? For example, What if SVD Signs
were applied to the following trimodal data set introduced earlier:
Figure 3.19: Example of a trimodal dataset
Since the SVD SIgns method can only bisect a dataset at any given iteration,
it would divide this set of points into two halves, cutting right through the middle
clump of points as shown in Figure 3.20.
Figure 3.20: Signs method breaks dataset in half, rather than into thirds as desired.
Clearly, dividing this set of points into three clusters by splitting it at two
places along the direction of principal trend would be preferable, but SVD Signs
31
simply does not have that capability. It was with this aw in mind that we set out
to create an SVD clustering method that can tailor itself to the shape of a dataset.
3.3 SVD Gaps Method
What if, instead of using the signs of the left singular vectors to blindly cut through
a dataset at its center, we considered the gaps in the left singular vectors instead?
Remember that these vectors contain orthogonal projections of the points in the
directions of principal, secondary, and tertiary trend. Therefore, we can use them to
nd the gaps between points in any of these directions and divide the dataset where
the gaps occur.
Figure 3.21: Gaps preserved when a dataset is projected onto a singular vector.
For example, when we look at the rst left singular vector of the tri-modal
dataset introduced earlier, the large gaps can be found quite easily. As seen in Figure
3.22, this new gaps method would cut through the center of these gaps and therefore
create the three clusters we desired earlier.
The algorithm for SVD Gaps follows many of the same steps as the Signs
method, but introduces a few important changes. As with the Signs method, the
32
Figure 3.22: First singular vector of trimodal dataset with the gaps between entries,
and the resulting cuts.
truncated SVD, using the appropriate rank k, must be found, and then the gaps
between entries of the left singular vectors are calculated. If a gap between two entries
is large enough, a division is placed between the corresponding rows of the original
matrix. Again, clustering the columns of a matrix is similar. The only dierence
being that the algorithm uses gaps in the right singular vectors to determine where
to divide the columns.
Algorithm 3 SVD Gaps
1. Find [U
k
, S
k
, V
k
] = svds(A, k)
2. For 1 i k, sort U
i
(or V
i
if clustering columns) and nd the gaps between
entries.
3. If the gap between rows j and j + 1 of U
i
(V
i
) is large enough then divide A
between the corresponding rows (columns).
4. Create a column vector C
i
that contains numerical cluster labels for U
i
(V
i
) for
all rows (columns).
5. After nding C
i
for all 1 i k, compare cluster label patterns for rows of C.
33
6. If rows (columns) i and j have the same cluster label pattern in C, then rows
(columns) i and j belong in the same cluster.
3.3.1 When are Gaps Large Enough?
Earlier, the issue of when gaps are large enough was glossed over. However, this is
a very important component of the Gaps algorithm. In fact, the eectiveness of the
entire method rests on deciding when a gap in the data is large enough to be used as
a cut. If the criteria are too relaxed, there will be far too many clusters. However, if
the criteria are too stringent the algorithm might not discover as many clusters as it
should.
So how big is big enough? Obviously, the measure should be relative to the
data. For example, some datasets might have smaller gaps over all, and so have
smaller signicant gaps, than other datasets. Therefore, the signicance of a gap
should be tied to the average size of the gaps in the data. However, the points in
a dataset can be greatly spread out in the direction of v
1
, but more compact in the
direction of v
2
. In this case, it would not be a good idea to pool all the gaps from
all the singular vectors together, as this would result in too many cuts in the earlier
vectors and too few cuts in the later vectors. It makes sense, then, for an average to
be taken with respect to each individual singular vector, which is exactly what the
SVD Gaps algorithm does.
Now that we have an average gap size for each singular vector, we know we
should only choose gaps that are larger than the average gap, but how much larger
should it be? It would make a lot of sense if, before making this decision, the algorithm
took into account how spread out the gap sizes were. Therefore, a standard deviation
for the gaps of each singular vector should also be found. From there, it is easy to
calculate how many standard deviations away each gap is from the average gap (this
might bring to mind the z-score of the normal distribution, but we must remember
34
that these gaps are most likely not normally distributed). Any gap that is more than
a certain number of standard deviations larger than the average gap will be chosen as
a place to cut the dataset. We can use Chebyshevs result about general distributions
to help us decide the cut-o or tolerance level for the number of standard deviations;
experimentation with various datasets has shown that using gaps that are 1.5 to
2.5 standard deviations larger than the average yields nice clusters. The number of
standard deviations is a parameter of the SVD Gaps method, and can be set by the
user.
3.3.2 Results on Small Yahoo! dataset
SVD Gaps also performs fairly well on the small Yahoo! dataset (see Figure 3.23
below). I chose a value of 3 for k, again, and a tolerance level (i.e. the standard
deviation cut-o for the gaps) of 2.15.
Figure 3.23: Small Yahoo Dataset clustered with the SVD Gaps algorithm using
k = 3 and tol = 2.15
The reordered picture for SVD Gaps does not look quite as nice as the one
for SVD Signs, but this does not necessarily mean that the clustering is not as good.
However, by looking at the reordered list of terms, we can see that all of the terms
35
are returned to their original group. Are the clusterings equal in strength, since they
got essentially the same term reordering? It is hard to decide based simply on the
appearance of the reordered matrix and reordered list of terms. The next chapter
introduces a more rigorous way to compare the performances of the two algorithms.
36
Chapter 4
Quality of Clusters
With small examples, such as the tri-modal set of points and the small Yahoo! dataset,
determining whether a clustering algorithm works well is easy and can be done visually
(especially when the dataset was created with specic clusters in mind, as was the case
with both of these). If we want to be able to apply either of these algorithms to real
world problems, however, a more rigorous measure of the quality of the clustering,
or cluster goodness, must be found. Such a measure allows one to decide which
algorithm is better: SVD Signs or SVD Gaps. With these goals in mind, lets take a
look at an entropy measure.
4.1 Entropy Measure
The entropy measure used in this paper for clustering is based on the measure pre-
sented by Meyer in [23] and revolves around the concept of surprise, or the surprise
felt when an event occurs.
Denition 2 [23] For an event E such that 0 < P(E) = p 1, the surprise S(p)
elicited by the occurrence of E is dened by the following four axioms.
1. S(1) = 0
37
2. S(p) is continuous with p
3. p < q S(p) > S(q)
4. S(pq) = S(p) +S(q)
Basically, events with lower probabilities elicit a higher surprise when they
occur, and vice versa; the function for surprise in terms of the probability turns out
to be
S(p) = log

p (4.1)
where S(
1
) = 1 [23]. Now, if X is a random variable, then the entropy of
X is the expected surprise of X.
Denition 3 For a discrete random variable X whose distribution vector is p, where
p
i
= P(X = x
i
), the -entropy of X is dened
E[S(X)] = H

X =
n

i=1
log

p
i
(4.2)
Set t log t = 0 when t = 0.
Before we can apply this measure to clustering we must resolve a problem
with the nature of rectangular datasets. One issue that makes these datasets dicult
to work with, is that the rows must be clustered and reordered independently of
the columns. Therefore the reordered matrix does not often have the nice, clean,
block diagonal structure that we get from symmetric reorderings like the Fiedler or
Extended Fiedler methods (see Figure 4.1).
So, even if a matrix has been clustered and reordered well, it can be hard to
tell by looking at the reordered matrix. However, if a matrix A is clustered well and
reordered to

A, then R =

A

A
T
, or the reordered row by reordered row matrix, and
38
Figure 4.1: This matrix has very well dened clusters, even though the matrix is not
in block diagonal form.
C =

A
T

A, or the reordered column by reordered column matrix, both have nice block
diagonal structures with few nonzero entries outside of the blocks. For R these blocks
represent the row clusters, and for C they represent the column clusters. Figure 4.2
shows a perfectly block diagonal matrix, as well as one that has a few stray points. If
these matrices had been reordered by a clustering algorithm, we would say that the
one on the left had been clustered better than the one on the right.
Figure 4.2: Two block diagonal matrices 1 and 2
When we look at R and C for each of these matrices (Figures 4.3 and 4.4),
it is even more clear that the rst matrix has been clustered better than the second
39
one.
Figure 4.3: The reordered row by reordered row matrices for 1 and 2
Figure 4.4: The reordered column by reordered column matrices for 1 and 2
Now this idea of entropy can be applied to clustering in the following way
(which is nicely laid out and fully explained by Meyer in [23]). Say we have a set
of distinct objects A = {A
1
, A
2
, ..., A
n
}, each classied with a label L
j
from the set
{L
1
, L
2
, ..., L
n
}, and say we group these objects into k clusters {C
1
, C
2
, ..., C
k
}. Then
we can create a probability distribution containing probabilities p
ij
such that
p
ij
=
number of objects in C
i
labeled L
j
number of objects in C
i
(4.3)
The entropy of an individual cluster C
i
is then
40
H
k
(C
i
) =
k

j=1
p
ij
log
k
p
ij
. (4.4)
We set p
ij
log
k
p
ij
= 0 when p
ij
= 0. Therefore the entropy of the entire clustering or
partition is
H

=
r

i=1

i
H
k
(C
i
) where
i
=
|C
i
|
n
. (4.5)
This entropy measure H

has the following properties:


0 H

1
H

= 0 if and only if H
k
(C
i
) = 0 for all i = 1, ..., k
H

= 1 if and only if H
k
(C
i
) = 1 for all i = 1, ..., k
In other words the entropy scores for a clustering will range from 0 to 1 with 0 being
the best and 1 being the worst.
What if a dataset is not labeled? In fact, none of the datasets presented in
the next chapter are labeled. In these instances there is a clever way to force labels
on a dataset that has been clustered. Consider the small clustered 15 by 10 matrix
shown below in Figure 4.5.
First we divide R according to the row clusters, which in this case will result
in 3 row blocks and 3 column blocks . Then for each row of R we create a vector of
ratios of the number of nonzeros in each column block over the number of columns
in that block. For example, for the rst row of R in Figure 4.5, the ratio vector will
be {
4
4
,
1
7
,
0
4
}. If the ratio for block j of row i has the highest ratio, then that row will
be labeled j. Obviously, row one of R will be labeled 1. The second row is a more
interesting case as the ratio vector is {
4
4
,
5
7
,
4
4
} and so we have a tie between blocks 1
and 3. Since a row can only have one label for our entropy measure, we pick the rst
block with the highest ratio, and so row 2 of R will be labeled 1.
41
Figure 4.5: A clustered 15 by 10 matrix A and the corresponding row by row matrix
R.
Most of the row labels for R match with the clustering, except for row 6. The
ratio vector for this row is {
4
4
,
3
7
,
0
4
}, and so this row will be labeled 1 even though is
in cluster 2. Therefore the entropy measure for R, or the row entropy measure for A,
will be less than perfect. It turns out to be 0.1742.
This measure, though not the method of labeling, is presented by Meyer in
[23], and resembles the method presented in [25]. A similar cluster measure is also
presented in [19], though here the entropy measure is relative to a perfect partition and
so is not useful when such a partition is not known. A nice synopsis and comparison
of several dierent cluster measuring techniques is presented in [16].
4.2 Comparing results from Small Yahoo! Exam-
ple
The row re-ordering of the small Yahoo! dataset for SVD Signs is perfect and has a
row entropy of zero! SVD Gaps does not do quite as well, but didnt do too badly,
42
with a row entropy of 0.0279 Figure 4.6 shows the reordered row matrices side by
side.
Figure 4.6: Symmetric reordered small Yahoo! term matrix for SVD Signs reordering
with entropy 0 (left) and for SVD Gaps with entropy 0.0279
The column re-orderings were worse for both algorithms. The SVD Signs
column reordering had an entropy of 0.1332, and it is obvious in Figure 4.7 that the
column clusters of SVD Signs are not as good as its row clusters. SVD Gaps had a
clearly worse score of 0.4493 for its column reordering.
Figure 4.7: Symmetric reordered small Yahoo! column matrix for SVD Signs reorder-
ing with entropy 0.1332 (left) and for SVD Gaps with entropy 0.4493
43
Chapter 5
Cluster Aggregation
No matter how strong the theoretical or experimental evidence behind a clustering
algorithm may be, nothing is perfect. All algorithms will have their strengths and
weaknesses, and sometimes one algorithms weakness may be another algorithms
strength. Thus it makes sense to nd a way to combine them and bring out the best
from several algorithms. This is exactly what cluster aggregation does.
Before I give my algorithm for cluster aggregation I will demonstrate it on
a small example. Suppose three dierent clustering algorithms are used on a small
dataset containing eight objects, resulting in the three clusterings shown in Figure
5.1.
Figure 5.1: Small example of three dierent clusterings
44
We can use these clusterings to build a graph that represents the relationships
between the clusterings. For instance, since object 4 and 8 are clustered together for
two of the algorithms, then there is an edge with weight 2 between nodes 4 and 8 on
the graph shown in Figure 5.2. If two objects have no common clusters, then there
is no edge between them.
Figure 5.2: Graph representing three dierent clusterings of 8 objects
Of course, any graph can be easily translated to an adjacency matrix A, as
in Figure 5.3, where the nodes become rows and columns and the edges become the
entries of the matrix. For instance, the 4
th
row of the 8
th
column and the 8
th
row of
the 4
th
column are both 2. In general, if objects i and j have n common clusters,
then A
ij
= A
ji
= n.
Figure 5.3: Creating an adjacency matrix from the graph
45
This adjacency matrix represents all three of the clustering algorithms, and
clustering this matrix yields a nice aggregation of the three methods. Of course, this
begs the question, what clustering method should be used? After all, if we knew the
best clustering method, there would be no need for cluster aggregation in the rst
place. However, notice that the adjacency matrix is a square symmetric matrix. For
these types of matrices, spectral methods, specically those which use the signs of
the eigenvectors as in [5] and [1] have actually been proven to be the best clustering
algorithms. Therefore, I have chosen to use the method layed out in [5], and described
in Section 2.2 of this thesis, to cluster the adjacency matrix. The results for the small
example are shown in Figure 5.6.
Figure 5.4: Results of Cluster Aggregation with small example
The simplicity of this aggregation method gives it a surprising amount of
exibility and therefore a wide range of uses. In the small example used here, all of
the algorithms were given equal weight in the adjacency matrix, and therefore seen as
equally valid. However, in some instances, one or more of the clusterings might stand
out above the rest. In these cases, it is easy to adjust the values in the adjacency
matrix to reect this disparity. Aggregation might also be used when a proper k value
cannot be discerned for a given matrix. Instead of settling on one value of k that may
not be optimal, several values can be chosen, and their corresponding results can be
aggregated.
46
Many aggregation techniques have been introduced in the eld of data mining,
and some are quite similar to the one presented above. The aggregation algorithm
introduced in [29] a hyper-graph that is equivalent to the adjacency matrix used
here to capture the relationships between dierent clusterings. Another interesting
aggregation algorithm is the one presented in [30], which is modeled after the way
ant-colonies sort larvae and also makes use of the hyper-graphs presented by [29].
5.1 Results on Small Yahoo! Dataset
Figure 5.5: Adjacency Matrix of Small Yahoo! Dataset using clusters from SVD Signs
and SVD Gaps
47
Figure 5.6: Clustered adjacency matrix and the sorted term labels
48
Chapter 6
Experiments on Large datasets
All the time spent researching and creating these algorithms is wasted if the algo-
rithms themselves are useless when applied to large data sets. After all, one of the
main reasons that clustering is so important is that it can be useful for breaking down
and processing large amounts of information. Therefore, it is important at this point
to demonstrate the results of the two SVD based clustering algorithms when applied
to a couple of large data sets.
6.1 Yahoo!
The rst data set is a binary 3,000 by 2,000 matrix complied by Yahoo! The matrix
represents the relationships between 3,000 search terms and 2,000 (anonymous) ad-
vertisers. If an advertiser j bought a given search phrase i, then there is a 1 in the
ij
th
entry of the matrix; all other entries are 0.
The image below, created with David Gleichs VISMATRIX tool, gives an idea
of what the raw dataset looks like before reordering (the terms are in alphabetical
order, and advertisers are in random order). Blue dots represent ones, and zeros are
represented by whitespace.
Before we run each of the algorithms on the full Yahoo! dataset, we must
49
Figure 6.1: Vismatrix display of the raw 3000 by 2000 Yahoo! dataset.
choose a value for k. Since the singular values can often be helpful for making this
decision, Figure 6.2 shows a line graph of the rst 100 singular values of the Yahoo!
matrix.
Figure 6.2: Plot of the singular values of the Yahoo! dataset.
Unfortunately, the singular values plotted in Figure 6.2 do not provide a very
clear answer. We can see that k = 20 would probably be too early a cut-o. I nally
chose to compare the results for k = 25 and k = 35. Figure 6.3 shows the results for
50
SVD signs.
Figure 6.3: Results of the SVD Signs algorithm on the full Yahoo! dataset for k = 25
(left) and k = 35 (right).
SVD Signs performed quite nicely in both of these trials. In both, there are
plenty of nice dense rectangles along the diameter. On exploring these clusters, we
nd that the terms in each cluster have similar themes: one contains search terms
related to hotels, one contains terms related to online gambling, etc. Which choice
for k yields better results? For k = 25, the row entropy is 0.0555 and the column
entropy is 0.0482, whereas for k = 35 the average row and column entropies are
It is clear from the pictures and entropy scores that using k = 25 results
in better clusters; using a higher k results in too many clusters for both rows and
columns. Next let us look at how the SVD Gaps algorithm performed. Figure 6.4
shows the results using a tolerance level of 2.3. The SVD Gaps clustering got a row
entropy of 0.1023 and a column entropy of 0.1421.
How do the results compare? As we see in Figure 6.5, the SVD Signs method
produced a much cleaner block diagonal structure than SVD Gaps. But, SVD Gaps
seems to have found more dense clusters than SVD Signs. This seems to indicate that
51
Figure 6.4: SVD Gaps results using tol = 2.3
SVD Gaps is good at nding really strong clusters, but not good at nding weaker
clusters.
Figure 6.5: Side by side comparison of results from SVD Signs (left) and SVD Gaps
(right).
52
Figures 6.6 and 6.7 show side by side images of the reordered search phrase by
reordered search phrase matrices, and the reordered advertiser by reordered advertiser
matrices for SVD Signs and SVD Gaps so we can compare the row clusters and the
column clusters for the two algorithms
Figure 6.6: Term by Term matrices for SVD Signs (left) and SVD Gaps (right)
Figure 6.7: Column by Column matrices for SVD Signs and SVD Gaps
53
Again, the results for SVD Signs do look stronger at rst glance, however
although SVD Gaps did not perform as well on the dataset as a whole, it still did
a better job on the really dense clusters. Also, We might conclude that SVD Gaps
tends to focus only on the really strong clusters in a dataset and ignore the weaker
ones. In some applications, this would be a very helpful trait.
6.2 Wikipedia
The next large dataset, shown in Figure 6.8, is a binary matrix representing links
between 5,176 Wikipedia articles by 4,107 categories. Each article can be placed in
multiple categories, and each category can contain several articles.
Figure 6.8: Vismatrix display of the raw 5176 by 4107 Wikipedia dataset, and a plot
of its rst 100 singular values
The results of SVD Signs and SVD Gaps on the Wikipedia dataset are shown
side by side in Figure 6.9. Neither algorithm produced a strong block diagonal struc-
ture, but this is due to the nature of the dataset, and does not necessarily mean that
54
the dataset was clustered poorly by either algorithm.
Figure 6.9: Side by side comparison of the SVD Signs and SVD Gaps results for the
Wikipedia dataset.
Though the SVD Gaps reordering does not look as nice, it actually got higher
entropy scores. The row entropies for SVD Signs and SVD Gaps were 0.1257 and
0.0956 respectively, while the column entropies were 0.3573 and 0.2013 resepectively.
For this dataset, it is much easier to compare the results by looking at the reordered
article by reordered article (see Figure 6.10) and reordered category by reordered
category matrices (see Figure 6.11).
Looking at Figure 6.10, it is more apparent that SVD Gaps produced a better
clustering for the Wikipedia dataset. The structure is much cleaner, and there are
even lots of nice clusters within clusters, which are also present for SVD Signs but
are not as well dened.
The column reorderings show a similar story. The reordering for SVD Signs
looks more interesting than that of SVD Gaps, but the clusters found by SVD Signs
55
Figure 6.10: Reordered row by reordered row matrices for SVD Signs (left) and SVD
Gaps (right).
Figure 6.11: Reordered column by reordered column matrices for SVD Signs (left)
and SVD Gaps (right).
are rather weak, and we can see there are a lot of groups of objects that are outside
of yet rectilinearly aligned with the clusters. This suggests that many columns have
been mis-clustered. SVD Gaps, on the other hand produced only very small and
56
dense clusters, and there is not much noise in the rows and columns aligned with
these clusters.
6.3 Netix
The Netix dataset used here is a 280 user by 17,770 movie dataset containing the
users ratings (from 1 through 5, 5 being the best) of movies rented through Netix.
All the users in this dataset rated at least 500 movies. An image of this matrix is
shown below in Figure 6.12.
Figure 6.12: Vismatrix display of the raw 280 by 17,770 Netix dataset and a plot of
its singular values.
The results for both methods, shown in Figure 6.13, are interesting. Both
methods seem to have gathered as much data as possible into a few clusters in the
outside columns, and neither algorithm found very many column clusters.
This seems odd, since the number of columns is so large, but makes more
sense when the odd nature of the dataset is considered. Remember that the dataset
only represents users who have rated more than 500 movies. Thus, every person
represented in the matrix is a Netix superuser who must like movies a great deal,
57
Figure 6.13: Results for SVD Signs with k = 7 and for SVD Gaps with k = 7 and
tol = 2
and many of these 280 people probably share a lot of opinions. If all of the users are
rating similar amounts of similar movies with similar scores, the dataset will be fairly
homogenous, which will lead to strange and poor clusterings. After considering these
aspects of the dataset, it makes more sense that both methods found a few very large
and dense clusters.
Unfortunately, because the matrix is so large, there was not enough memory in
Matlab to compute the entropy scores for the SVD Signs and SVD Gaps clusterings
of the Netix dataset, and so we cannot conclude whether one algorithm preformed
better than another on this dataset.
58
Chapter 7
Conclusion
Thus far, SVD Signs tends to be the more robust clustering algorithm. Because it only
have one parameter, k, it is easier to work with and produces better overall clusterings
on a more regular basis than SVD Gaps. With SVD Gaps method introduced here,
we not only have to choose the appropriate value for k, we also have to choose an
appropriate tolerance level for the dataset.
However, there are some very positive aspects of the SVD Gaps method. The
method showed again and again that it excelled in singling out the strongest clusters in
the datatset, while ignoring weaker and less important clusters. In many applications
of clustering, this might be a highly desirable quality. Regardless, SVD Gaps has
not yet reached its full potential. It might be helpful to rst learn more about
the statistical distribution of a dataset before applying the SVD Gaps algorithm.
This would help us to determine a tolerance level, and maybe even whether dierent
tolerance levels should be used for dierent singular vectors.
The Cluster Aggregation algorithm presented in Chapter 5 has a lot of promise
as well, and it would have been nice to spend more time working and experimenting
with it. Some expansions on this algorithm also need to be explored. Weighting
clusterings before aggregating them (perhaps somehow inversely proportionate to
59
their entropy scores) could improve results. Also, the algorithm could be used to
aggregate clusterings for dierent values of k, especially in cases where one value
produces too few clusters, but a higher value produces too many.
Some other areas that need further work are the entropy measure, the method
of choosing k, and cluster ordering. The entropy measure works nicely, but the
labeling scheme needs to be expanded to work on more dense matrices, as it now
works only for sparse matrices. Since, the success of either SVD Signs or SVD Gaps
depends so much on the choice of k, a more reliable method must be found and
employed for both algorithms. Finally, it would be nice if clusters were ordered and
displayed so that clusters next to each other are most similar; this would be most
useful when analyzing a clustering for data mining purposes.
60
Appendix A
MATLAB Code
A.1 SVD Signs
function []=svdclusterrect(A,k,centeringtoggle,trms,docs);
%% INPUT: A = m by n matrix
%% k = number of principal directions to compute
%% centeringtoggle = 1 if you want to center the data matrix A
%% first and work on thet centered matrix C
%% = 0 if you want to work with uncentered data
%% matrix A
%% trms = row labels
%% docs = column labels
if centeringtoggle==1
mu = A*ones(n,1)/n;
61
A= A-mu*ones(1,n); % A is now the centered A matrix
end
[U,S,V]=fastsvds(A,k); %finds truncated SVD of A
% to find a row reordering that clusters rows
E=(U>=0);
%finds positive entries of left singular vectors
x=zeros(m,1);
%creates vector of zeros
for i=1:k;
x=x+(2^(i-1))*(E(:,k-i+1));
end
%designates sign pattern to each row of U
[sortedrowx,rowindex]=sort(x);
%sorts x by sign patterns of U
numrowclusters=length(unique(x))
%finds the number of row clusters of A
% to find a column reordering that clusters columns
F=(V>=0);
62
% finds all positive entries of matrix V
y=zeros(n,1); %creates vector of zeros
for i=1:k;
y=y+(2^(i-1))*(F(:,k-i+1));
end
%designates signs patterns to each row of V
[sortedcoly,colindex]=sort(y);
%sorts y by sign patterns of V
numcolclusters=length(unique(y))
%finds number of column clusters of A
if centeringtoggle==1
A= A+mu*ones(1,n);
% A changed back to the original uncentered A matrix
end
reorderedA=A(rowindex,colindex);
%reorders A into clustered form
spy(reorderedA)
%spyplot of reordered matrix
trms=cellstr(trms);
63
reorderedterms=trms(rowindex);
%reorders row labels
docs=cellstr(docs);
reordereddocs=docs(colindex);
%reorders column labels
er=entropy2(reorderedA*reorderedA,x,m,numrowclusters)
%finds row entropy of reordered A
ec=entropy2(reorderedA*reorderedA,y,n,numcolclusters)
%finds column entropy of reordered A
cd /Users/langvillea/David/vismatrix2
vismatrix2(reorderedA,reorderedterms,reordereddocs)
cd /Users/langvillea/Desktop/datavisSAS-Meyer/vismatrix
%sends reordered A to VISMATRIX tool
A.2 SVD Gaps
function []=svdgapcluster2(A,k,centeringtoggle,confidence,
confidence2,termlabels,doclabels)
%% INPUT: A = m by n matrix
%% k = number of principal directions to compute
%% centeringtoggle = 1 if you want to center the data matrix A
%% first and work on the centered matrix C
64
%% = 0 if you want to work with uncentered data
%% matrix A
%% confidence = tolerance level for rows
%% confidence = tolerance level for columns
%% termlabels = labels for rows
%% doclabels = labels for columns
[m,n]=size(A);
% finds the dimensions of A
if centeringtoggle==1
mu = A*ones(n,1)/n;
A= A-mu*ones(1,n); % A is now the centered A matrix
end
[U,S,V]=fastsvds(A,k);
% finds truncated SVD of A
% later do smart implementation of svd on centered data
%using rank-one udpate rules.
%% for Term Clustering
[sortedU,index]=sort(U);
%sort left singular vectors
gapmatrix=sortedU(2:m,:)-sortedU(1:m-1,:);
65
%each column of gapmatrix contains gaps of a left singular vector
gapmeans=mean(gapmatrix,1);
%find mean gap size for each vector
stdgaps=std(gapmatrix,1,1);
%find st. dev of gap size for each vector
gapszscore=(gapmatrix-ones(m-1,1)*gapmeans)./(ones(m-1,1)*stdgaps);
%convert all gaps to z-scores
[row,col]=find(gapszscore>confidence);
%find indices for all z-scores that are greater than
%tolerance level for rows
D=full(sparse(row,col,ones(length(row),1),m,k));
%creates binary sparse matrix whose columns
%contain ones to mark where large gaps in sing vectors occur
C=zeros(m,k);
for j=1:k
count=0;
for i=1:m
C(i,j)=count+1; %creates cluster label matrix
if D(i,j)==1
count=count+1;
%cluster label changes where large gaps occur
end
end
end
66
% matrix C is the matrix of cluster labels
% need to sort C by index matrix
[sortedindex,IIndex]=sort(index,1);
for i=1:k
C(:,i)=C((IIndex(:,i)),i);
end
%each column of cluster labels is now sorted
[b,i,h]=unique(C,rows);
%finds the rows of C with the same label patterns
%these rows will be clustered together
[termclusters,termclusterindex]=sort(h);
%finds row reordering for A
C(termclusterindex,:);
%reorders the cluster label matrix
%rows with same cluster label patterns will be adjacent
numtrmclusters=size(b)
%finds number of row clusters of A
%% For Doc Clustering
[sortedV,indexV]=sort(V);
%sort right singular vectors
gapmatrix=sortedV(2:n,:)-sortedV(1:n-1,:);
%each column of gapmatrix contains gaps of a right singular vector
gapmeans=mean(gapmatrix,1);
67
%find mean gap size for each vector
stdgaps=std(gapmatrix,1,1);
%find st. dev of gap size for each vector
gapszscore=(gapmatrix-ones(n-1,1)*gapmeans)./(ones(n-1,1)*stdgaps);
%convert all gaps to z-scores
[row,col]=find(gapszscore>confidence2);
%find indices for all z-scores that are
%greater than tolerance level for columns
F=full(sparse(row,col,ones(length(row),1),n,k));
%creates binary sparse matrix whose columns
%contain ones to mark where large gaps in sing vectors occur
E=zeros(n,k);
for j=1:k
count=0;
for i=1:n
E(i,j)=count+1; %creates cluster label matrix
if F(i,j)==1
count=count+1;
%cluster label changes where large gaps occur
end
end
end
% matrix E is the matrix of cluster labels
% need to sort E by index matrix
68
[sortedindexV,IIndexV]=sort(indexV,1);
for i=1:k
E(:,i)=E((IIndexV(:,i)),i);
end
%each column of cluster labels is now sorted
[b,i,z]=unique(E,rows);
%finds the rows of C with the same label patterns
%these rows represent columns of A that will be clustered together
[docclusters,docclusterindex]=sort(z);
%finds row reordering for A
E(docclusterindex,:);
%reorders the cluster label matrix
%rows with same cluster label patterns will be adjacent
numdocclusters=size(b)
%finds number of row clusters of A
if centeringtoggle==1
A= A+mu*ones(1,n);
end
% A is now back to the original uncentered A matrix
reorderedA=A(termclusterindex,docclusterindex);
%reorders A into clustered form
spy(reorderedA)
%spyplot of reordered A
69
doclabels=cellstr(doclabels);
doclabels=doclabels(docclusterindex);
%reorders column labels of A
termlabels=cellstr(termlabels);
termlabels=termlabels(termclusterindex);
%reorders row labels of A
er=entropy2(reorderedA*reorderedA,h,m,numtrmclusters(1))
%finds row entropy of reordered A
ec=entropy2(reorderedA*reorderedA,z,n,numdocclusters(1))
%finds column entropy of reordered A
cd /Users/langvillea/David/vismatrix2
vismatrix2(reorderedA*reorderedA, termlabels, doclabels)
cd /Users/langvillea/Desktop/datavisSAS-Meyer/vismatrix
%sends reordered A to VISMATRIX tool
A.3 Entropy Measure
function [entropy]=entropy2(A,clusters,n,k)
%input A is symmetric row by row or column by column
%n is the number of rows of A
%k is the number of clusters of A
x=sort(clusters); %clusterlabels in descending order
70
[m,xstart]=unique(x,first); %find cluster start row
[m,xstop]=unique(x,last); %find cluster stop row
spy(A)
Q=zeros(n,k);
for i=1:n
for j=1:k
maxvector(j)=nnz(A(i,xstart(j):xstop(j)))/((xstop(j)+1)-xstart(j));
end
[maxratio,index]=max(maxvector);
Q(i,index(1))=1;
end
%forces row labels using max ratio vector
P=zeros(k,k);
for j=1:k
for i=1:k
P(j,i)=sum(Q(xstart(j):xstop(j),i))/((xstop(j)+1)-xstart(j));
end
end
%creates a matrix of p(i,j)s
P=(P.*log(P))/log(k);
[row,col]=find(isnan(P));
for i=1:length(row)
71
P(row(i),col(i))=0;
end
%if p(i,j)=0 then set p(i,j)*log p(i,j)=0
for i=1:k
H(i)=-sum(P(i,:));
alpha(i)=((xstop(i)+1)-xstart(i))/n;
end
%finds entropy measure for each cluster
entropy=sum(alpha.*H)
%finds entropy of entire partition
A.4 Cluster Aggregation
%%%%%%%%%%%%%%%%% ClusterAgg.m %%%%%%%%%%%%%%%%
%% INPUT : L = n-by-p matrix of cluster results;
%% column i contains results from clustering method i
%% OUTPUT : A = n-by-n weighted undirected (symmetric)
%% Aggregation matrix;
%% A(i,j) = # of methods having items i and j in same cluster
%% EXAMPLE INPUT: L=[1 3 1;3 1 2; 2 2 2 ; 1 3 1; 1 1 3; 2 2 2]
%% L =
%%
%% 1 3 1
72
%% 3 1 2
%% 2 2 2
%% 1 3 1
%% 1 1 3
%% 2 2 2
%% means that clustering method 1 (info. in col. 1 of L) groups items
%% 1, 4 and 5 together, then
%% items 3 and 5, and finally item 2 in its own cluster, creating
%% a total of three clusters. Clustering method 2
%% (info. in col. 2 of L)
%% groups items 1 and 4, then 3 and 6, and finally 2 and 5. Notice
%% that the cluster assignment labels used by one clustering
%% method do not need to match those from another method. And
%% the number of clusters found by each method do not need to
%% match either. Yet all lists must be full, i.e., have the same
%% number of items.
function [] = ClusterAgg(L,labels);
% n = # of items/documents
% p = # of lists of cluster assignment results = # of methods
[n,p]=size(L);
A=zeros(n,n);
% need to do p*(n choose 2) pairwise comparisons to create
%Aggregation matrix A.
for i=1:n
73
for j=i+1:n
matchcount=0;
for k=1:p
if L(i,k)==L(j,k)
matchcount=matchcount+1;
end
end
A(i,j)=matchcount;
end
end
A=A+A;
%Now run any clustering method you like on Aggregation matrix A
%For example, code for running the extended Fiedler method is below.
%The Fiedler method is a good choice since the graph is undirected.
% F = Fiedler matrix F=D-A
D=diag(A*ones(n,1));
F=D-A;
% k = # of eigenvectors to use for extended Fiedler
k=2;
[FiedlerVector,evalue]=eigs(F,k+1,sa);
FiedlerVector=FiedlerVector(:,2:k+1)
U=(FiedlerVector>=0);
74
x=zeros(n,1);
for l=1:k;
x=x+(2^(k-l))*(U(:,l));
end
% x contains the cluster assignments
x
% numclusters is the number of clusters produced by the aggregated
%method
numclusters=length(unique(x))
%%%%%%%%%%%%%%%%%% ClusterAgg.m %%%%%%%%%%%%
A.5 Other Code Used
%%%%%%%%%%%BEGIN GLEICHS fastsvds.m%%%%%%%%%%%
function [U, S, V] = fastsvds(varargin)
% fastsvds performs the singular value decomposition more quickly than
% matlabs built in function.
%
% [U S V] = fastsvds(A, k, sigma, opts)
%
% see svds for a description of the parameters.
%
75
A = varargin{1};
[m,n] = size(A);
p = min(m,n);
if nargin < 2
k = min(p,6);
else
k = varargin{2};
end
if nargin < 3
bk = min(p,k);
if isreal(A)
bsigma = LA;
else
bsigma = LR;
end
else
sigma = varargin{3};
if sigma == 0 % compute a few extra eigenvalues to be safe
bk = 2 * min(p,k);
else
bk = k;
end
if strcmp(sigma,L)
if isreal(A)
bsigma = LA;
76
else
bsigma = LR;
end
elseif isa(sigma,double)
bsigma = sigma;
if ~isreal(bsigma)
error(Sigma must be real);
end
else
error(Third argument must be a scalar or the string L)
end
end
if isreal(A)
boptions.isreal = 1;
else
boptions.isreal = 0;
end
boptions.issym = 1;
if nargin < 4
% norm(B*W-W*D,1) / norm(B,1) <= tol / sqrt(2)
% => norm(A*V-U*S,1) / norm(A,1) <= tol
boptions.tol = 1e-10 / sqrt(2);
boptions.disp = 0;
else
77
options = varargin{4};
if isstruct(options)
if isfield(options,tol)
boptions.tol = options.tol / sqrt(2);
else
boptions.tol = 1e-10 / sqrt(2);
end
if isfield(options,maxit)
boptions.maxit = options.maxit;
end
if isfield(options,disp)
boptions.disp = options.disp;
else
boptions.disp = 0;
end
else
error(Fourth argument must be a structure of options.)
end
end
if (m > n)
% this means we want to find the right singular vectors first
% [V D] = eigs(A*A)
%f = inline(global AFASTSVDMATRIX;
%AFASTSVDMATRIX*(AFASTSVDMATRIX*v), v);
[V D] = eigs(@multiply_mtm, n, bk, bsigma, boptions, A);
[dummy, perm] = sort(-diag(D));
78
S = diag(sqrt(diag(D(perm, perm))));
V = V(:, perm);
Sinv = diag(1./sqrt(diag(D)));
U = (A*V)*Sinv;
else
% find the left singular vectors first
% [U D] = eigs(A*A)
%f = inline(global AFASTSVDMATRIX; A*(A*v), v);
[U D] = eigs(@multiply_mmt, m, bk, bsigma, boptions, A);
[dummy, perm] = sort(-diag(D));
S = diag(sqrt(diag(D(perm, perm))));
U = U(:, perm);
Sinv = diag(1./sqrt(diag(D)));
V = Sinv*(U*A);
V = V;
end;
if nargout <= 1
U = diag(S);
end
%clear global AFASTSVDMATRIX;
function mtmv = multiply_mtm(v, A)
mtmv = A*(A*v);
%global AFASTSVDMATRIX;
%mtmv = AFASTSVDMATRIX*(AFASTSVDMATRIX*v);
79
function mmtv = multiply_mmt(v, A)
mmtv = A*(A*v);
%global AFASTSVDMATRIX;
%mmtv = AFASTSVDMATRIX*(AFASTSVDMATRIX*v);
%%%%%%%%%%%%%END GLEICHS fastsvds.m%%%%%%%%%%
%%%%%%%%%%readSMAT.m%%%%%%%%%%%%
function A = readSMAT(filename)
% readSMAT reads an indexed sparse matrix representation of
% a matrix and creates a MATLAB sparse matrix.
%
% A = readSMAT(filename)
% filename - the name of the SMAT file
% A - the MATLAB sparse matrix
%
s = load(filename);
m = s(1,1);
n = s(1,2);
ind_i = s(2:length(s),1)+1;
ind_j = s(2:length(s),2)+1;
val = s(2:length(s),3);
A = sparse(ind_i,ind_j,val, m, n);
%%%%%%%%%%end readSMAT.m%%%%%%%%%%
80
Bibliography
[1] Charles J. Alpert, Andrew B. Kahng, and So zen Yao. Spectral partitioning: The
more eigenvectors, the better. In Proc. ACM/IEEE Design Automation Conf,
pages 195200, 1995.
[2] Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using
PageRank vectors. In Proceedings of the 47th Annual IEEE Symposium on Foun-
dations of Computer Science, 2006.
[3] Francis R. Bach and Michael I. Jordan. Finding clusters in independent compo-
nent analysis. In In: 4th Intl. Symp. on Independent Component Analysis and
Signal Separation (ICA2003, pages 891896, 2003.
[4] Barbara E. Ball. Clustering directed graphs without symmetrization, Masters
Thesis, College of Charleston, 2006.
[5] Ibai E. Basabe. A new way to cluster data, Masters Thesis, College of Charleston,
2007.
[6] M. W. Berry, S. T. Dumais, G. W. Obrien, Michael W. Berry, Susan T. Dumais,
and Gavin. Using linear algebra for intelligent information retrieval. SIAM
Review, 37:573595, 1995.
81
[7] Michael W. Berry and Murray Browne. Understanding search engines: mathe-
matical modeling and text retrieval. Society for Industrial and Applied Mathe-
matics, Philadelphia, PA, USA, 1999.
[8] Mike W. Berry and Murray Browne. Lecture Notes in Data Mining. World
Scientic Publishing Co., 2006.
[9] Daniel Boley. Principal direction divisive partitioning. Data Mining and Knowl-
edge Discovery, 2:325344, 1998.
[10] Fan R.K. Chung. Spectral Graph Theory. Number 92. American Mathematical
Society, 1997.
[11] Matt Rasmussen David Gleich, Leonid Zhukov. Vismatrix, 2006.
[12] Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Arnold
Publishers, May 2001.
[13] Miroslav Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical
Journal, 23:298305, 1973.
[14] Miroslav Fiedler. A property of eigenvectors of nonnegative symmetric matrices
and its application to graph theory. Czechoslovak Mathematical Journal, 25:619
633, 1975.
[15] Stephen Guattery and Gary L.Miller. On the performance of spectral graph par-
titioning methods. In SODA: ACM-SIAM Symposium on Discrete Algorithms (A
Conference on Theoretical and Experimental Analysis of Discrete Algorithms),
1995.
[16] Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. On clustering vali-
dation techniques. Journal of Intelligent Information Systems, 17:107145, 2001.
82
[17] A. Hyvarinen and E. Oja. Independent component analysis: algorithms and
applications. Neural Netw., 13(4-5):411430, 2000.
[18] W.N. Anderson Jr. and T.D. Morely. Eigenvalues of the laplacian of a graph,
1968.
[19] Jacob Kogan. Introduction to Clustering Large and High-Dimensional Data.
Cambridge University Press, New York, NY, USA, 2007.
[20] Ulrike Von Luxburg and Ulrike Von Luxburg. Limits of spectral clustering. In
Advances in Neural Information Processing Systems, pages 857864. MIT Press,
2004.
[21] Carl D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, 2000.
[22] Carl D. Meyer. Extended edler clustering, Preprint 2007.
[23] Carl D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, preprint
2007.
[24] A. Pothen, H. Simon, and K. Liou. Partitioning sparse matrices with eigenvectors
of graphs. SIAM Journal on Matrix Analysis and Applications, 11(3):430452,
Jul 1990.
[25] Andrew Rosenberg and Julia Hirschberg. V-measure: A conditional entropy-
based external cluster evaluation measure. In Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and Compu-
tational Natural Language Learning (EMNLP-CoNLL), pages 410420.
[26] Marco Saerens, Francois Fouss, Luh Yen, and Pierre Dupont. The principal com-
ponents analysis of a graph, and its relationships to spectral clustering. In Pro-
ceedings of the 15th European Conference on Machine Learning (ECML 2004).
Lecture Notes in Articial Intelligence, pages 371383. Springer-Verlag, 2004.
83
[27] D. B. Skillicorn and D. B. Skillicorn. Clusters within clusters: Svd and coun-
terterrorism. In SIAM Workshop on Counterterrorism, 2003.
[28] David Skillicorn. Understanding Complex Datasets: Data Mining with Matrix
Decompositions (Chapman & Hall/Crc Data Mining and Knowledge Discovery
Series). Chapman & Hall/CRC, May 2007.
[29] Alexander Strehl and Joydeep Ghosh. Cluster ensembles a knowledge reuse
framework for combining multiple partitions. J. Mach. Learn. Res., 3:583617,
2003.
[30] Yan Yang and Mohamed S. Kamel. An aggregated clustering approach using
multi-ant colonies algorithms. Pattern Recogn., 39(7):12781289, 2006.
[31] Leonid Zhukov. Technical report: Spectral clustering of large advertiser datasets
part 1, 2003.
84

Das könnte Ihnen auch gefallen