Sie sind auf Seite 1von 16

Subspace Clustering

for High Dimensional Data: A Review ∗

Lance Parsons Ehtesham Haque Huan Liu


Department of Computer Department of Computer Department of Computer
Science Engineering Science Engineering Science Engineering
Arizona State University Arizona State University Arizona State University
Tempe, AZ 85281 Tempe, AZ 85281 Tempe, AZ 85281
lparsons@asu.edu Ehtesham.Haque@asu.edu hliu@asu.edu

ABSTRACT in an attempt to learn as much as possible about each ob-


ject described. In high dimensional data, however, many of
Subspace clustering is an extension of traditional cluster-
the dimensions are often irrelevant. These irrelevant dimen-
ing that seeks to find clusters in different subspaces within
sions can confuse clustering algorithms by hiding clusters in
a dataset. Often in high dimensional data, many dimen-
noisy data. In very high dimensions it is common for all
sions are irrelevant and can mask existing clusters in noisy
of the objects in a dataset to be nearly equidistant from
data. Feature selection removes irrelevant and redundant
each other, completely masking the clusters. Feature selec-
dimensions by analyzing the entire dataset. Subspace clus-
tion methods have been employed somewhat successfully to
tering algorithms localize the search for relevant dimensions
improve cluster quality. These algorithms find a subset of
allowing them to find clusters that exist in multiple, possi-
dimensions on which to perform clustering by removing ir-
bly overlapping subspaces. There are two major branches
relevant and redundant dimensions. Unlike feature selection
of subspace clustering based on their search strategy. Top-
methods which examine the dataset as a whole, subspace
down algorithms find an initial clustering in the full set of
clustering algorithms localize their search and are able to
dimensions and evaluate the subspaces of each cluster, it-
uncover clusters that exist in multiple, possibly overlapping
eratively improving the results. Bottom-up approaches find
subspaces.
dense regions in low dimensional spaces and combine them
to form clusters. This paper presents a survey of the various Another reason that many clustering algorithms struggle
subspace clustering algorithms along with a hierarchy orga- with high dimensional data is the curse of dimensionality.
nizing the algorithms by their defining characteristics. We As the number of dimensions in a dataset increases, distance
then compare the two main approaches to subspace cluster- measures become increasingly meaningless. Additional di-
ing using empirical scalability and accuracy tests and discuss mensions spread out the points until, in very high dimen-
some potential applications where subspace clustering could sions, they are almost equidistant from each other. Figure 1
be particularly useful. illustrates how additional dimensions spread out the points
in a sample dataset. The dataset consists of 20 points ran-
domly placed between 0 and 2 in each of three dimensions.
Keywords Figure 1(a) shows the data projected onto one axis. The
clustering survey, subspace clustering, projected clustering, points are close together with about half of them in a one
high dimensional data unit sized bin. Figure 1(b) shows the same data stretched
into the second dimension. By adding another dimension we
spread the points out along another axis, pulling them fur-
1. INTRODUCTION & BACKGROUND ther apart. Now only about a quarter of the points fall into
Cluster analysis seeks to discover groups, or clusters, of sim- a unit sized bin. In Figure 1(c) we add a third dimension
ilar objects. The objects are usually represented as a vector which spreads the data further apart. A one unit sized bin
of measurements, or a point in multidimensional space. The now holds only about one eighth of the points. If we con-
similarity between objects is often determined using distance tinue to add dimensions, the points will continue to spread
measures over the various dimensions in the dataset [44; 45]. out until they are all almost equally far apart and distance
Technology advances have made data collection easier and is no longer very meaningful. The problem is exacerbated
faster, resulting in larger, more complex datasets with many when objects are related in different ways in different subsets
objects and dimensions. As the datasets become larger and of dimensions. It is this type of relationship that subspace
more varied, adaptations to existing algorithms are required clustering algorithms seek to uncover. In order to find such
to maintain cluster quality and speed. Traditional clustering clusters, the irrelevant features must be removed to allow
algorithms consider all of the dimensions of an input dataset the clustering algorithm to focus on only the relevant di-

mensions. Clusters found in lower dimensional space also
Supported in part by grants from Prop 301 (No. ECR tend to be more easily interpretable, allowing the user to
A601) and CEINT 2004. better direct further study.
Methods are needed that can uncover clusters formed in var-
ious subspaces of massive, high dimensional datasets and

Sigkdd Explorations. Volume 6, Issue 1 - Page 90


2.0

2.0
1.5
Dimension b

Dimension c
1.5

Dimension b
1.0

1.0
2.0

1.5

0.5

0.5
1.0

0.5

0.0
0.0

0.0
0.0 0.5 1.0 1.5 2.0

0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0


Dimension a
Dimension a Dimension a
(a) 11 Objects in One Unit Bin (b) 6 Objects in One Unit Bin (c) 4 Objects in One Unit Bin

Figure 1: The curse of dimensionality. Data in only one dimension is relatively tightly packed. Adding a dimension stretches
the points across that dimension, pushing them further apart. Additional dimensions spreads the data even further making
high dimensional data extremely sparse.

represent them in easily interpretable and meaningful ways suited, and introduce subspace clustering as a potential so-
[48; 66]. This scenario is becoming more common as we lution. Section 4 contains a summary of many subspace
strive to examine data from various perspectives. One exam- clustering algorithms and organizes them into a hierarchy
ple of this occurs when clustering query results. A query for according to their primary characteristics. In Section 5 we
the term “Bush” could return documents on the president of analyze the performance of a representative algorithm from
the United States as well as information on landscaping. If each of the two major branches of subspace clustering. Sec-
the documents are clustered using the words as attributes, tion 6 discusses some potential real world applications for
the two groups of documents would likely be related on dif- subspace clustering in Web text mining and bioinformatics.
ferent sets of attributes. Another example is found in bioin- We summarize our conclusions in Section 7. Appendix A
formatics with DNA microarray data. One population of contains definitions for some common terms used through-
cells in a microarray experiment may be similar to another out the paper.
because they both produce chlorophyll, and thus be clus-
tered together based on the expression levels of a certain set
of genes related to chlorophyll. However, another popula- 2. DIMENSIONALITY REDUCTION
tion might be similar because the cells are regulated by the Techniques for clustering high dimensional data have in-
circadian clock mechanisms of the organism. In this case, cluded both feature transformation and feature selection
they would be clustered on a different set of genes. These techniques. Feature transformation techniques attempt to
two relationships represent clusters in two distinct subsets of summarize a dataset in fewer dimensions by creating com-
genes. These datasets present new challenges and goals for binations of the original attributes. These techniques are
unsupervised learning. Subspace clustering algorithms are very successful in uncovering latent structure in datasets.
one answer to those challenges. They excel in situations like However, since they preserve the relative distances between
those described above, where objects are related in multiple, objects, they are less effective when there are large numbers
different ways. of irrelevant attributes that hide the clusters in sea of noise.
There are a number of excellent surveys of clustering tech- Also, the new features are combinations of the originals and
niques available. The classic book by Jain and Dubes [43] may be very difficult to interpret the new features in the con-
offers an aging, but comprehensive look at clustering. Zait text of the domain. Feature selection methods select only
and Messatfa offer a comparative study of clustering algo- the most relevant of the dimensions from a dataset to reveal
rithms in [77]. Jain et al. published another survey in 1999 groups of objects that are similar on only a subset of their
[44]. More recent data mining texts include a chapter on attributes. While quite successful on many datasets, feature
clustering [34; 39; 45; 73]. Kolatch presents an updated selection algorithms have difficulty when clusters are found
hierarchy of clustering algorithms in [50]. One of the more in different subspaces. It is this type of data that motivated
recent and comprehensive surveys was published by Berhkin the evolution to subspace clustering algorithms. These algo-
and includes a small section on subspace clustering [11]. Gan rithms take the concepts of feature selection one step further
presented a small survey of subspace clustering methods at by selecting relevant subspaces for each cluster separately.
the Southern Ontario Statistical Graduate Students Semi-
nar Days [33]. However, there was little work that dealt 2.1 Feature Transformation
with the subject of subspace clustering in a comprehensive Feature transformations are commonly used on high dimen-
and comparative manner. sional datasets. These methods include techniques such as
In the next section we discuss feature transformation and principle component analysis and singular value decompo-
feature selection techniques which also deal with high di- sition. The transformations generally preserve the original,
mensional data. Each of these techniques work well with relative distances between objects. In this way, they sum-
particular types of datasets. However, in Section 3 we show marize the dataset by creating linear combinations of the
an example of a dataset for which neither technique is well attributes, and hopefully, uncover latent structure. Feature

Sigkdd Explorations. Volume 6, Issue 1 - Page 91


transformation is often a preprocessing step, allowing the
clustering algorithm to use just a few of the newly created
features. A few clustering methods have incorporated the

3
use of such transformations to identify important features

2
Dimension c
and iteratively improve their clustering [24; 41]. While of-

1
ten very useful, these techniques do not actually remove

Dimension b
any of the original attributes from consideration. Thus,

−1
information from irrelevant dimensions is preserved, mak-

−2
(ab) 1.0
(ab) 0.5

ing these techniques ineffective at revealing clusters when

−3
(bc) 0.0
(bc) −0.5
−1.0
there are large numbers of irrelevant attributes that mask

−4
−3 −2 −1 0 1 2 3

the clusters. Another disadvantage of using combinations of Dimension a


attributes is that they are difficult to interpret, often mak-
ing the clustering results less useful. Because of this, fea-
ture transformations are best suited to datasets where most Figure 2: Sample dataset with four clusters, each in two
of the dimensions are relevant to the clustering task, but dimensions with the third dimension being noise. Points
many are highly correlated or redundant. from two clusters can be very close together, confusing many
traditional clustering algorithms.
2.2 Feature Selection
Feature selection attempts to discover the attributes of a
vised feature selection methods have employed various search
dataset that are most relevant to the data mining task at
methods in attempts to scale to large, high dimensional
hand. It is a commonly used and powerful technique for re-
datasets. With such datasets, random searching becomes
ducing the dimensionality of a problem to more manageable
a viable heuristic method and has been used with many of
levels. Feature selection involves searching through various
the aforementioned criteria [1; 12; 29]. Sampling is another
feature subsets and evaluating each of these subsets using
method used to improve the scalability of algorithms. Mitra
some criterion [13; 54; 63; 76]. The most popular search
et al. partition the original feature set into clusters based
strategies are greedy sequential searches through the feature
on a similarity function and select representatives from each
space, either forward or backward. The evaluation criteria
group to form the sample [57].
follow one of two basic models, the wrapper model and the
filter model [49]. The wrapper model techniques evaluate
the dataset using the data mining algorithm that will ulti- 3. SUBSPACE CLUSTERING THEORY
mately be employed. Thus, they “wrap” the selection pro- Subspace clustering is an extension of feature selection that
cess around the data mining algorithm. Algorithms based attempts to find clusters in different subspaces of the same
on the filter model examine intrinsic properties of the data dataset. Just as with feature selection, subspace clustering
to evaluate the feature subset prior to data mining. requires a search method and an evaluation criteria. In ad-
Much of the work in feature selection has been directed at dition, subspace clustering must somehow limit the scope of
supervised learning. The main difference between feature se- the evaluation criteria so as to consider different subspaces
lection in supervised and unsupervised learning is the eval- for each different cluster.
uation criterion. Supervised wrapper models use classifi-
cation accuracy as a measure of goodness. The filter based 3.1 Motivation
approaches almost always rely on the class labels, most com- We can use a simple dataset to illustrate the need for sub-
monly assessing correlations between features and the class space clustering. We created a sample dataset with four
labels [63]. In the unsupervised clustering problem, there hundred instances in three dimensions. The dataset is di-
are no universally accepted measures of accuracy and no vided into four clusters of 100 instances, each existing in only
class labels. However, there are a number of methods that two of the three dimensions. The first two clusters exist in
adapt feature selection to clustering. dimensions a and b. The data forms a normal distribution
Entropy measurements are the basis of the filter model ap- with means 0.5 and -0.5 in dimension a and 0.5 in dimen-
proach presented in [19; 22]. The authors argue that en- sion b, and standard deviations of 0.2. In dimension c, these
tropy tends to be low for data that contains tight clusters, clusters have µ = 0 and σ = 1. The second two clusters
and thus is a good measure to determine feature subset rel- are in dimensions b and c and were generated in the same
evance. The wrapper method proposed in [47] forms a new manner. The data can be seen in Figure 2. When k-means
feature subset and evaluates the resulting set by applying a is used to cluster this sample data, it does a poor job of
standard k -means algorithm. The EM clustering algorithm finding the clusters. This is because each cluster is spread
is used in the wrapper framework in [25]. Hybrid methods out over some irrelevant dimension. In higher dimensional
have also been developed that use a filter approach as a datasets this problem becomes even worse and the clusters
heuristic and refine the results with a clustering algorithm. become impossible to find, suggesting that we consider fewer
One such method by Devaney and Ram uses category util- dimensions.
ity to evaluate the feature subsets. They use the COBWEB As we have discussed, using feature transformation tech-
clustering algorithm to guide the search, but evaluate based niques such as principle component analysis does not help
on intrinsic data properties [23]. Another hybrid approach in this instance, since relative distances are preserved and
uses a greedy algorithm to rank features based on entropy the effects of the irrelevant dimension remain. Instead we
values and then uses k -means to select the best subsets of might try using a feature selection algorithm to remove one
features [20]. or two dimensions. Figure 3 shows the data projected in a
In addition to using different evaluation criteria, unsuper- single dimension (organized by index on the x-axis for ease

Sigkdd Explorations. Volume 6, Issue 1 - Page 92


of interpretation). We can see that none of these projections compounded in that they must also determine the dimen-
of the data are sufficient to fully separate the four clusters. sionality of the subspaces. Many algorithms rely on an input
Alternatively, if we only remove one dimension, we produce parameter to assist them, however this is often not feasible
the graphs in Figure 4. to determine ahead of time and is often a question that one
However, it is worth noting that the first two clusters (red would like the algorithm to answer. To be most useful, algo-
and green) are easily separated from each other and the rest rithms should strive for stability across a range of parameter
of the data when viewed in dimensions a and b (Figure 4(a)). values.
This is because those clusters were created in dimensions a Since there is no universal definition of clustering, there is
and b and removing dimension c removes the noise from no universal measure with which to compare clustering re-
those two clusters. The other two clusters (blue and pur- sults. However, evaluating cluster quality is essential since
ple) completely overlap in this view since they were created any clustering algorithm will produce some clustering for
in dimensions b and c and removing c made them indistin- every dataset, even if the results are meaningless. Cluster
guishable from one another. It follows then, that those two quality measures are just heuristics and do not guarantee
clusters are most visible in dimensions b and c (Figure 4(b)). meaningful clusters, but the clusters found should represent
Thus, the key to finding each of the clusters in this dataset some meaningful pattern in the data in the context of the
is to look in the appropriate subspaces. Next, we explore particular domain, often requiring the analysis of a human
the various goals and challenges of subspace clustering. expert.
In order to verify clustering results, domain experts require
3.2 Goals and Challenges the clusters to be represented in interpretable and meaning-
The main objective of clustering is to find high quality clus- ful ways. Again, subspace clusters must not only represent
ters within a reasonable time. However, different approaches the cluster, but also the subspace in which it exists. Inter-
to clustering often define clusters in different ways, making it pretation of clustering results is also important since clus-
impossible to define a universal measure of quality. Instead, tering is often a first step that helps direct further study.
clustering algorithms must make reasonable assumptions, In addition to producing high quality, interpretable clusters,
often somewhat based on input parameters. Still, the results clustering algorithms must also scale well with respect to
must be carefully analyzed to ensure they are meaningful. the number of instances and the number of dimensions in
The algorithms must be efficient in order to scale with the large datasets. Subspace clustering methods must also be
increasing size of datasets. This often requires heuristic ap- scalable with respect to the dimensionality of the subspaces
proaches, which again make some assumptions about the where the clusters are found. In high dimensional data,
data and/or the clusters to be found. the number of possible subspaces is huge, requiring efficient
While all clustering methods organize instances of a dataset search algorithms. The choice of search strategy creates
into groups, some require clusters to be discrete partitions different biases and greatly affects the assumptions made
and others allow overlapping clusters. The shape of clusters by the algorithm. In fact, there are two major types of
can also vary from semi-regular shapes defined by a center subspace clustering algorithms, distinguished by the type of
and a distribution to much more irregular shapes defined search employed. In the next section we review subspace
by units in a hyper-grid. Subspace clustering algorithms clustering algorithms and present a hierarchy that organizes
must also define the subset of features that are relevant for existing algorithms based on the strategy they implement.
each cluster. These subspaces are almost always allowed
to be overlapping. Many clustering algorithms create dis-
crete partitions of the dataset, putting each instance into 4. SUBSPACE CLUSTERING METHODS
one group. Some, like k-means, put every instance into one This section discusses the main subspace clustering strate-
of the clusters. Others allow for outliers, which are defined gies and summarizes the major subspace clustering algo-
as points that do not belong to any of the clusters. Still rithms. Figure 5 presents a hierarchy of subspace clustering
other clustering algorithms create overlapping clusters, al- algorithms organized by the search technique employed and
lowing an instance to belong to more than one group or to the measure used to determine locality. The two main types
be an outlier. Usually these approaches also allow an in- of algorithms are distinguished by their approach to search-
stance to be an outlier, belonging to no particular cluster. ing for subspaces. A naive approach might be to search
Clustering algorithms also differ in the shape of clusters they through all possible subspaces and use cluster validation
find. This is often determined by the way clusters are rep- techniques to determine the subspaces with the best clus-
resented. Algorithms that represent clusters using a center ters [43]. This is not feasible because the subset generation
and some distribution function(s) are biased toward hyper- problem is intractable [49; 14]. More sophisticated heuris-
elliptical clusters. However, other algorithms define clusters tic search methods are required, and the choice of search
as combinations of units of a hyper-grid, creating poten- technique determines many of the other characteristics of
tially very irregularly shaped clusters with flat faces. No an algorithm. Therefore, the first division in the hierarchy
one definition of clusters is better than the others, but some (Figure 5)) splits subspace clustering algorithms into two
are more appropriate for certain problems. Domain specific groups, the top-down search and bottom-up search meth-
knowledge is often very helpful in determining which type ods.
of cluster formation will be the most useful. Subspace clustering must evaluate features on only a subset
The very nature of unsupervised learning means that the of the data, representing a cluster. They must use some mea-
user knows little about the patterns they seek to uncover. sure to define this context. We refer to this as a “measure
Because of the problem of determining cluster quality, it of locality”. We further divide the two categories of sub-
is often difficult to determine the number of clusters in a space clustering algorithms based on how they determine a
particular dataset. The problem for subspace algorithms is measure of locality with which to evaluate subspaces.

Sigkdd Explorations. Volume 6, Issue 1 - Page 93


2

2
0.5
Dimension a

Dimension b

Dimension c
1
1

0
0.0
0

−1
−1

−0.5

−2
(ab) (ab) (ab)
(ab) (ab) (ab)
−2

(bc) (bc) (bc)


(bc) (bc) (bc)

−3
0 100 200 300 4000 20 40 60 80 0 100 200 300 4000 10 20 30 40 0 100 200 300 4000 20 40 60

Index Frequency Index Frequency Index Frequency

(a) Dimension a (b) Dimension b (c) Dimension c

Figure 3: Sample data plotted in one dimension, with histogram. While some clustering can be seen, points from multiple
clusters are grouped together in each of the three dimensions.

2
0.5
Dimension b

Dimension c

Dimension c
1

1
0

0
0.0

−1

−1
−0.5

−2

−2
(ab) (ab) (ab)
(ab) (ab) (ab)
(bc) (bc) (bc)
(bc) (bc) (bc)
−3

−3
−2 −1 0 1 2 −0.5 0.0 0.5 −2 −1 0 1 2

Dimension a Dimension b Dimension a


(a) Dims a & b (b) Dims b & c (c) Dims a & c

Figure 4: Sample data plotted in each set of two dimensions. In both (a) and (b) we can see that two clusters are properly
separated, but the remaining two are mixed together. In (c) the four clusters are more visible, but still overlap each other are
are impossible to completely separate.

4.1 Bottom-Up Subspace Search Methods Based Clustering (CBF), CLTree, and DOC. These meth-
The bottom-up search method take advantage of the down- ods distinguish themselves by using data driven strategies
ward closure property of density to reduce the search space, to determine the cut-points for the bins on each dimension.
using an APRIORI style approach. Algorithms first create a MAFIA and CBF use histograms to analyze the density of
histogram for each dimension and selecting those bins with data in each dimension. CLTree uses a decision tree based
densities above a given threshold. The downward closure strategy to find the best cutpoint. DOC uses a maximum
property of density means that if there are dense units in width and minimum number of instances per cluster to guide
k dimensions, there are dense units in all (k − 1) dimen- a random search.
sional projections. Candidate subspaces in two dimensions
can then be formed using only those dimensions which con- 4.1.1 CLIQUE
tained dense units, dramatically reducing the search space. CLIQUE [8] was one of the first algorithms proposed that at-
The algorithm proceeds until there are no more dense units tempted to find clusters within subspaces of the dataset. As
found. Adjacent dense units are then combined to form described above, the algorithm combines density and grid
clusters. This is not always easy, and one cluster may be based clustering and uses an APRIORI style technique to
mistakenly reported as two smaller clusters. The nature of find clusterable subspaces. Once the dense subspaces are
the bottom-up approach leads to overlapping clusters, where found they are sorted by coverage, where coverage is defined
one instance can be in zero or more clusters. Obtaining as the fraction of the dataset covered by the dense units
meaningful results is dependent on the proper tuning of the in the subspace. The subspaces with the greatest coverage
grid size and the density threshold parameters. These can are kept and the rest are pruned. The algorithm then finds
be particularly difficult to set, especially since they are used adjacent dense grid units in each of the selected subspaces
across all of the dimensions in the dataset. A popular adap- using a depth first search. Clusters are formed by combin-
tation of this strategy provides data driven, adaptive grid ing these units using using a greedy growth scheme. The
generation to stabilize the results across a range of density algorithm starts with an arbitrary dense unit and greedily
thresholds. grows a maximal region in each dimension until the union of
The bottom-up methods determine locality by creating bins all the regions covers the entire cluster. Redundant regions
for each dimension and using those bins to form a multi- are removed by a repeated procedure where smallest redun-
dimensional grid. There are two main approaches to ac- dant regions are discarded until no further maximal region
complishing this. The first group consists of CLIQUE and can be removed. The hyper-rectangular clusters are then
ENCLUS which both use a static sized grid to divide each di- defined by a Disjunctive Normal Form (DNF) expression.
mension into bins. The second group contains MAFIA, Cell CLIQUE is able to find many types and shapes of clusters

Sigkdd Explorations. Volume 6, Issue 1 - Page 94


Figure 5: Hierarchy of Subspace Clustering Algorithms. At the top there are two main types of subspace clustering algorithms,
distinguished by their search strategy. One group uses a top-down approach that iteratively updates weights for dimensions
for each cluster. The other uses a bottom-up approach, building clusters up from dense units found in low dimensional space.
Each of those groups is then further categorized based on how they define a context for the quality measure.

and presents them in easily interpretable ways. The region multi-dimension distribution. Larger values indicate higher
growing density based approach to generating clusters al- correlation between dimensions and an interest value of zero
lows CLIQUE to find clusters of arbitrary shape. CLIQUE indicates independent dimensions.
is also able to find any number of clusters in any number of ENCLUS uses the same APRIORI style, bottom-up ap-
dimensions and the number is not predetermined by a pa- proach as CLIQUE to mine significant subspaces. Prun-
rameter. Clusters may be found in the same, overlapping, ing is accomplished using the downward closure property of
or disjoint subspaces. The DNF expressions used to repre- entropy (below a threshold ω) and the upward closure prop-
sent clusters are often very interpretable. The clusters may erty of interest (i.e. correlation) to find minimally correlated
also overlap each other meaning that instances can belong subspaces. If a subspace is highly correlated (above thresh-
to more than one cluster. This is often advantageous in old ²), all of it’s superspaces must not be minimally corre-
subspace clustering since the clusters often exist in different lated. Since non-minimally correlated subspaces might be of
subspaces and thus represent different relationships. interest, ENCLUS searches for interesting subspaces by cal-
Tuning the parameters for a specific dataset can be diffi- culating interest gain and finding subspaces whose entropy
cult. Both grid size and the density threshold are input exceeds ω and interest gain exceeds ²0 . Once interesting sub-
parameters which greatly affect the quality of the cluster- spaces are found, clusters can be identified using the same
ing results. Even with proper tuning, the pruning stage methodology as CLIQUE or any other existing clustering
can sometimes eliminate small but important clusters from algorithm. Like CLIQUE, ENCLUS requires the user to set
consideration. Like other bottom-up algorithms, CLIQUE the grid interval size, ∆. Also, like CLIQUE the algorithm
scales well with the number of instances and dimensions in is highly sensitive to this parameter.
the dataset. CLIQUE (and other similar algorithms), how- ENCLUS also shares much of the flexibility of CLIQUE. EN-
ever, do not scale well with number of dimensions in the CLUS is capable of finding overlapping clusters of arbitrary
output clusters. This is not usually a major issue, since shapes in subspaces of varying sizes. ENCLUS has a simi-
subspace clustering tends to be used to find low dimensional lar running time as CLIQUE and shares the poor scalability
clusters in high dimensional data. with respect to the subspace dimensions. One improvement
the authors suggest is a pruning approach similar to the way
4.1.2 ENCLUS CLIQUE prunes subspaces with low coverage, but based on
ENCLUS [17] is a subspace clustering method based heav- entropy. This would help to reduce the number of irrelevant
ily on the CLIQUE algorithm. However, ENCLUS does not clusters in the output, but could also eliminate small, yet
measure density or coverage directly, but instead measures meaningful, clusters.
entropy. The algorithm is based on the observation that a
subspace with clusters typically has lower entropy than a 4.1.3 MAFIA
subspace without clusters. Clusterability of a subspace is MAFIA [35] is another extension of CLIQUE that uses an
defined using three criteria: coverage, density, and corre- adaptive grid based on the distribution of data to improve
lation. Entropy can be used to measure all three of these efficiency and cluster quality. MAFIA also introduces paral-
criteria. Entropy decreases as the density of cells increases. lelism to improve scalability. MAFIA initially creates a his-
Under certain conditions, entropy will also decrease as the togram to determine the minimum number of bins for a di-
coverage increases. Interest is a measure of correlation and mension. The algorithm then combines adjacent cells of sim-
is defined as the difference between the sum of entropy mea- ilar density to form larger cells. In this manner, the dimen-
surements for a set of dimensions and the entropy of the sion is partitioned based on the data distribution and the re-

Sigkdd Explorations. Volume 6, Issue 1 - Page 95


sulting boundaries of the cells capture the cluster perimeter can browse the tree and refine the final clustering results.
more accurately than fixed sized grid cells. Once the bins CLTree follows the bottom-up strategy, evaluating each di-
have been defined, MAFIA proceeds much like CLIQUE, mension separately and then using only those dimensions
using an APRIORI style algorithm to generate the list of with areas of high density in further steps. The decision
clusterable subspaces by building up from one dimension. tree splits correspond to the boundaries of bins and are cho-
MAFIA also attempts to allow for parallelization of the the sen using modified gain criteria that essentially measures
clustering process. the density of a region. Hypothetical, uniformly distributed
In addition to requiring a density threshold parameter, MAFIA noise is “added” to the dataset and the tree attempts to split
also requires the user to specify threshold for merging ad- the read data from the noise. The noise does not actually
jacent windows. Windows are essentially subset of bins in have to be added to the dataset, but instead the density can
a histogram that is plotted for each dimension. Adjacent be estimated for any given bin under investigation.
windows within the specified threshold are merged to form While the number and size of clusters are not required ahead
larger windows. The formation of the adaptive grid is de- of time, there are two parameters. The first, min y, is the
pendent on these windows. Also, to assist the adaptive grid minimum number of points that a region must contain to be
algorithm, MAFIA takes a default grid size as input for di- considered interesting during the pruning stage. The second,
mensions where the data is uniformly distributed. In such min rd, is the minimum relative density between two adja-
dimensions, the data is divided into a small, fixed number cent regions before the regions are merged to form a larger
of partitions. Despite the adaptive nature of the grids, the cluster. These parameters are used to prune the tree, and
algorithm is rather sensitive to these parameters. results are sensitive to their settings. CLTree finds clusters
Like related methods, MAFIA finds any number of clus- as hyper-rectangular regions in subspaces of different sizes.
ters of arbitrary shape in subspaces of varying size. It also Though building the tree could be expensive (O(n2 )), the
represents clusters as DNF expressions. Due to paralleliza- algorithm scales linearly with the number of instances and
tion and other improvements, MAFIA performs many times the number of dimensions of the subspace.
faster than CLIQUE on similar datasets. It scales linearly
with the number of instances and even better with the num- 4.1.6 DOC
ber of dimensions in the dataset (see section 5). However, Density-based Optimal projective Clustering (DOC) is some-
like the other algorithms in this category, MAFIA’s running what of a hybrid method that blends the grid based ap-
time grows exponentially with the number of dimensions in proach used by the bottom-up approaches and the iterative
the clusters. improvement method from the top-down approaches. DOC
proposes a mathematical definition for the notion of an op-
4.1.4 Cell-based Clustering Method (CBF) timal projective cluster [64]. A projective cluster is defined
Cell-based Clustering (CBF) [16] attempts to address scal- as a pair (C, D) where C is a subset of the instances and
ability issues associated with many bottom-up algorithms. D is a subset of the dimensions of the dataset. The goal is
One problem for other bottom-up algorithms is that the to find a pair where C exhibits a strong clustering tendency
number of bins created increases dramatically as the num- in D. To find these optimal pairs, the algorithm creates a
ber of dimensions increases. CBF uses a cell creation algo- small subset X, called the discriminating set, by random
rithm that creates optimal partitions by repeatedly exam- sampling. Ideally, this set can be used to differentiate be-
ining minimum and maximum values on a given dimension tween relevant and irrelevant dimensions for a cluster. For a
which results in the generation of fewer bins (cells). CBF given a cluster pair (C, D), instances p in C, and instances
also addresses scalability with respect to the number of in- q in X the following should hold true: for each dimension
stances in the dataset. In particular, other approaches often i in D, |q(i) − p(i)| <= w, where w is the fixed side length
perform poorly when the dataset is too large to fit in main of a subspace cluster or hyper-cube, given by the user. p
memory. CBF stores the bins in an efficient filtering-based and X are both obtained through random sampling and the
index structure which results in improved retrieval perfor- algorithm is repeated with the best result being reported.
mance. The results are heavily dependent on parameter settings.
The CBF algorithm is sensitive to two parameters. Section The value of w determines the maximum length of one side
threshold determines the bin frequency of a dimension. The of a cluster. A heuristic is presented to choose an optimal
retrieval time is reduced as the threshold value is increased value for w. DOC also requires a parameter α that specifies
because the number of records accessed is minimized. The the minimum number of instances that can form a cluster.
other parameter is cell threshold which determines the min- α along with w determine the minimum cluster density. An-
imum density of data points in a bin. Bins with density other parameter β is provided to specify the balance between
above this threshold are selected as potentially a member of number of points and the number of dimensions in a cluster.
a cluster. Like the other bottom-up methods, CBF is able The clusters are hyper-rectangular in shape and represented
to find clusters of various sizes and shapes. It also scales by a pair (C, D), meaning the set of data points C is pro-
linearly with the number of instances in a dataset. CBF has jected on the dimension set D. The running time grows
been shown to have slightly lower precision than CLIQUE linearly with the number of data points but it increases ex-
but is faster in retrieval and cluster construction [16]. ponentially with the number of dimensions in the dataset.

4.1.5 CLTree 4.2 Iterative Top-Down Subspace Search Meth-


CLTree [53] uses a modified decision tree algorithm to select ods
the best cutting planes for a given dataset. It uses a decision The top-down subspace clustering approach starts by find-
tree algorithm to partition each dimension into bins, sepa- ing an initial approximation of the clusters in the full fea-
rating areas of high density from areas of low density. Users ture space with equally weighted dimensions. Next each

Sigkdd Explorations. Volume 6, Issue 1 - Page 96


dimension is assigned a weight for each cluster. The up- However, using a small number of representative points can
dated weights are then used in the next iteration to regener- cause PROCLUS to miss some clusters entirely. The cluster
ate the clusters. This approach requires multiple iterations quality of PROCLUS is very sensitive to the accuracy of the
of expensive clustering algorithms in the full set of dimen- input parameters, which may be difficult to determine.
sions. Many of the implementations of this strategy use a
sampling technique to improve performance. Top-down al- 4.2.2 ORCLUS
gorithms create clusters that are partitions of the dataset, ORCLUS [6] is an extended version of the algorithm PRO-
meaning each instance is assigned to only one cluster. Many CLUS[5] that looks for non-axis parallel subspaces. This
algorithms also allow for an additional group of outliers. algorithm arose from the observation that many datasets
Parameter tuning is necessary in order to get meaningful contain inter-attribute correlations. The algorithm can be
results. Often the most critical parameters for top-down divided into three steps: assign clusters, subspace determi-
algorithms is the number of clusters and the size of the sub- nation, and merge. During the assign phase, the algorithm
spaces, which are often very difficult to determine ahead of iteratively assigns data points to the nearest cluster cen-
time. Also, since subspace size is a parameter, top-down al- ters. The distance between two points is defined in a sub-
gorithms tend to find clusters in the same or similarly sized space E, where E is a set of orthonormal vectors in some
subspaces. For techniques that use sampling, the size of the d-dimensional space. Subspace determination redefines the
sample is another critical parameter and can play a large subspace E associated with each cluster by calculating the
role in the quality of the final results. covariance matrix for a cluster and selecting the orthonor-
Locality in top-down methods is determined by some ap- mal eigenvectors with the least spread (smallest eigenval-
proximation of clustering based on the weights for the di- ues). Clusters that are near each other and have similar di-
mensions obtained so far. PROCLUS, ORCLUS, FINDIT, rections of least spread are merged during the merge phase.
and δ-Clusters determine the weights of instances for each The number of clusters and the size of the subspace dimen-
cluster. COSA is unique in that it uses the k nearest neigh- sionality must be specified. The authors provide a general
bors for each instance in the dataset to determine the weights scheme for selecting a suitable value. A statistical mea-
for each dimension for that particular instance. sure called the cluster sparsity coefficient, is provided which
can be inspected after clustering to evaluate the choice of
4.2.1 PROCLUS subspace dimensionality. The algorithm is computationally
PROCLUS [5] was the first top-down subspace clustering intensive due mainly to the computation of covariance matri-
algorithm. Similar to CLARANS [58], PROCLUS samples ces. ORCLUS uses random sampling to improve speed and
the data, then selects a set of k medoids and iteratively scalability and as a result may miss smaller clusters. The
improves the clustering. The algorithm uses a three phase output clusters are defined by the cluster center and and an
approach consisting of initialization, iteration, and cluster associated partition of the dataset, with possible outliers.
refinement. Initialization uses a greedy algorithm to se-
lect a set of potential medoids that are far apart from each 4.2.3 FINDIT
other. The objective is to ensure that each cluster is rep- A Fast and Intelligent Subspace Clustering Algorithm us-
resented by at least one instance in the selected set. The ing Dimension Voting, FINDIT [74] is similar in structure
iteration phase selects a random set of k medoids from this to PROCLUS and the other top-down methods, but uses
reduced dataset, replaces bad medoids with randomly cho- a unique distance measure called the Dimension Oriented
sen new medoids, and determines if clustering has improved. Distance (DOD). The idea is compared to voting whereby
Cluster quality is based on the average distance between in- the algorithm tallies the number of dimensions on which two
stances and the nearest medoid. For each medoid, a set instances are within a threshold distance, ², of each other.
of dimensions is chosen whose average distances are small The concept is based on the assumption that in higher di-
compared to statistical expectation. The total number of mensions it is more meaningful for two instances to be close
dimensions associated to medoids must be k ∗ l, where l in several dimensions rather than in a few [74]. The algo-
is an input parameter that selects the average dimension- rithm typically consists of three phases, namely sampling
ality of cluster subspaces. Once the subspaces have been phase, cluster forming phase, and data assignment phase.
selected for each medoid, average Manhattan segmental dis- The algorithms starts by selecting two small sets generated
tance is used to assign points to medoids, forming clusters. through random sampling of the data. The sets are used
The medoid of the cluster with the least number of points to determine initial representative medoids of the clusters.
is thrown out along with any medoids associated with fewer In the cluster forming phase the correlated dimensions are
than (N/k) ∗ minDeviation points, where minDeviation is found using the DOD measure for each medoid. FINDIT
an input parameter. The refinement phase computes new then increments the value of ² and repeats this step until
dimensions for each medoid based on the clusters formed the cluster quality stabilizes. In the final phase all of the
and reassigns points to medoids, removing outliers. instances are assigned to medoids based on the subspaces
Like many top-down methods, PROCLUS is biased toward found.
clusters that are hyper-spherical in shape. Also, while clus- FINDIT requires two input parameters, the minimum num-
ters may be found in different subspaces, the subspaces must ber of instances in a cluster, and the minimum distance be-
be of similar sizes since the the user must input the average tween two clusters. FINDIT is able to find clusters in sub-
number of dimensions for the clusters. Clusters are repre- spaces of varying size. The DOD measure is dependent on
sented as sets of instances with associated medoids and sub- the ² threshold which is determined by the algorithm in a
spaces and form non-overlapping partitions of the dataset iterative manner. The iterative phase that determines the
with possible outliers. Due to the use of sampling, PRO- best threshold adds significantly to the running time of the
CLUS is somewhat faster than CLIQUE on large datasets. algorithm. Because of this, FINDIT employs sampling tech-

Sigkdd Explorations. Volume 6, Issue 1 - Page 97


niques like the other top-down algorithms. Sampling helps are calculated for each instance and pair of instances, not
to improve performance, especially with very large datasets. for each cluster. After clustering the relevant dimensions
Section 5 shows the results of empirical experiments using must be calculated based on the dimension weights assigned
the FINDIT and MAFIA algorithms. to cluster members.

4.2.4 δ -Clusters 5. EMPIRICAL COMPARISON


The δ-Clusters algorithm uses a distance measure that at- In this section we measure the performance of representative
tempts to capture thecoherence exhibited by subset of in- top-down and bottom-up algorithms. We chose MAFIA [35],
stances on subset of attributes [75]. Coherent instances may an advanced version of the bottom-up subspace clustering
not be close in many dimensions, but instead they both fol- method, and FINDIT [74], an adaptation of the top-down
low a similar trend, offset from each other. One coherent strategy. In datasets with very high dimensionality, we ex-
instance can be derived from another by shifting by the off- pect the bottom-up approaches to perform well as they only
set. PearsonR correlation [68] is used to measure coherence have to search in the lower dimensionality of the hidden
among instances. The algorithm starts with initial seeds and clusters. However, the sampling schemes of top-down ap-
iteratively improves the overall quality of the clustering by proaches should scale well to large datasets. To measure the
randomly swapping attributes and data points to improve scalability of the algorithms, we measure the running time
individual clusters. Residue measures the decrease in co- of both algorithms and vary the number of instance or the
herence that a particular attribute or instance brings to the number of dimensions. We also examine how well each of
cluster. The iterative process terminates when individual the algorithms were able to determine the correct subspaces
improvement levels off in each cluster. for each cluster. The implementation of each algorithm was
The algorithm takes as parameters the number of clusters provided by the respective authors.
and the individual cluster size. The running time is partic-
ularly sensitive to the cluster size parameter. If the value 5.1 Data
chosen is very different from the optimal cluster size, the To facilitate the comparison of the two algorithms, we chose
algorithm could take considerably longer to terminate. The to use synthetic datasets so that we have control over of the
main difference between δ-Clusters and other methods is characteristics of the data. Synthetic data also allows us
the use of coherence as a similarity measure. Using this to easily measure the accuracy of the clustering by compar-
measure makes δ-clusters suitable for domains such as mi- ing the output of the algorithms to the known input clusters.
croarray data analysis. Often finding related genes means The datasets were generated using specialized dataset gener-
finding those that respond similarly to environmental condi- ators that provided control over the number of instances, the
tions. The absolute response rates are often very different, dimensionality of the data, and the dimensions for each of
but the type of response and the timing may be similar. the input clusters. They also output the data in the formats
required by the algorithm implementations and provided
4.2.5 COSA the necessary meta-information to measure cluster accuracy.
Clustering On Subsets of Attributes (COSA) [32] is an it- The data values lie in the range of 0 to 100. Clusters were
erative algorithm that assigns weights to each dimension generated by restricting the value of relevant dimensions for
for each instance, not each cluster. Starting with equally each instance in a cluster. Values for irrelevant dimensions
weighted dimensions, the algorithm examines the k near- were chosen randomly to form a uniform distribution over
est neighbors (knn) of each instance. These neighborhoods the entire data range.
are used to calculate the respective dimension weights for
each instance. Higher weights are assigned to those dimen- 5.2 Scalability
sions that have a smaller dispersion within the knn group. We measured the scalability of the two algorithms in terms
These weights are used to calculate dimension weights for of the number of instances and the number of dimensions in
pairs of instances which are in turn used to update the dis- the dataset. In the first set of tests, we fixed the number
tances used in the knn calculation. The process is then of dimensions at twenty. On average the datasets contained
repeated using the new distances until the weights stabilize. five hidden clusters, each in a different five dimensional sub-
The neighborhoods for each instance become increasingly space. The number of instances was increased first from
enriched with instances belonging to its own cluster. The from 100,000 to 500,000 and then from 1 million to 5 mil-
dimension weights are refined as the dimensions relevant to a lion. Both algorithms scaled linearly with the number of
cluster receive larger weights. The output is a distance ma- instances, as shown by Figure 6. MAFIA clearly out per-
trix based on weighted inverse exponential distance and is forms FINDIT through most of the cases. The superior
suitable as input to any distance-based clustering method. performance of MAFIA can be attributed to the bottom-up
After clustering, the weights of each dimension of cluster approach which does not require as many passes through
members are compared and an overall importance value for the dataset. Also, when the clusters are embedded in few
each dimension for each cluster is calculated. dimensions, the bottom-up algorithms have the advantage
The number of dimensions in clusters need not be specified that they only consider the small set of relevant dimensions
directly, but instead is controlled by a parameter, λ, that during most of their search. If the clusters exist in high
controls the strength of incentive for clustering on more di- numbers of dimensions, then performance will degrade as
mensions. Each cluster may exist in a different subspaces the number of candidate subspaces grows exponentially with
of different sizes, but they do tend to be of similar dimen- the number of dimensions in the subspace [5].
sionality. It also allows for the adjustment of the k used in In Figure 6(b) we can see that FINDIT actually outperforms
the knn calculations. Friedman claims that the results are MAFIA when the number of instances approaches four mil-
stable over a wide range of k values. The dimension weights lion. This could be attributed to the use of random sam-

Sigkdd Explorations. Volume 6, Issue 1 - Page 98


Time vs. Thousands of Instances Time vs. Millions of Instances
FindIt FindIt

300
MAFIA MAFIA

1500
250
Time (seconds)

Time (seconds)
200

1000
150
100

500
50
0

0
100 200 300 400 500 1 2 3 4 5
Thousands of Instances Millions of Instances

(a) Thousands (b) Millions

Figure 6: Running time vs. number of instances for FINDIT (top-down) and MAFIA (bottom-up). MAFIA outperforms
FINDIT in smaller datasets. However, due to the use of sampling techniques, FINDIT is able to maintain relatively high
performance even with datasets containing five million records.

pling, which eventually gives FINDIT a performance edge Time vs. Number of Dimensions
with huge datasets. MAFIA on the other hand, must scan

100 150 200 250 300 350


FindIt
the entire dataset each time it makes a pass to find dense MAFIA

units on a given number of dimensions. We can see in Sec-


tion 5.3 that sampling can cause FINDIT to miss clusters

Time (seconds)
or dimensions entirely.
In the second set of tests we fixed the number of instances
at 100,000 and increased the number of dimensions of the
dataset. Figure 7 is a plot of the running times as the num-
ber of dimensions in the dataset is increased from 10 to 50
100. The datasets contained, on average, five clusters each
0

in five dimensions. Here, the bottom-up approach is clearly 20 40 60 80 100


superior as can be seen in Figure 7. The running time for Dimensions
MAFIA increased linearly with the number of dimensions
in the dataset. The running time for the top-down method,
FINDIT, increases rapidly as the the number of dimensions
Figure 7: Running time vs. number of dimensions for
increase. Sampling does not help FINDIT in this case. As
FINDIT (top-down) and MAFIA (bottom-up). MAFIA
the number of dimensions increase, FINDIT must weight
consistently outperforms FINDIT and scales better as the
each dimension for each cluster in order to select the most
number of dimensions increases.
relevant ones. The bottom-up algorithms like MAFIA, how-
ever, are not adversely affected by the additional, irrelevant
dimensions. This is because those algorithms project the instances was increased up to 4 million. The only differ-
data into small subspaces first adding only the interesting ence was that some clusters were reported as two separate
dimensions to the search. clusters, instead of properly merged. This fracturing of clus-
ters is an artifact of the grid-based approach used by many
5.3 Subspace Detection & Accuracy bottom-up algorithms that requires them to merge dense
In addition to comparing the algorithms scalability, we also units to form the output clusters. The top-down approach
compared how accurately each algorithm was able to de- used by FINDIT was better able to identify the significant
termine the clusters and corresponding subspaces in the dimensions for clusters it uncovered. As the number of in-
dataset. The results are presented as a confusion matrix stances was increased, FINDIT occasionally missed an entire
that lists the relevant dimensions of the input clusters as cluster. As the dataset grew, the clusters were more diffi-
well as those output by the algorithm. Table 1 and Table cult to find among the noise and the sampling employed by
2 show the best case input and output clusters for MAFIA many top-down algorithms cause them to miss clusters.
and FINDIT on a dataset of 100,000 instances in 20 dimen- Tables 4 and 3 show the results from MAFIA and FINDIT
sions. The bottom-up algorithm, MAFIA, discovered all of respectively when the number of dimensions in the dataset
the clusters but left out one significant dimension in four out is increased to 100. MAFIA was able to detect all of the
of the five clusters. Missing one dimension in a cluster can clusters. Again, it was missing one dimension for four of the
be caused by prematurely pruning a dimension based on five clusters. Also higher dimensionality caused the same
a coverage threshold, which can be difficult to determine. problem that we noticed with higher numbers of instances,
This could also occur because the density threshold may MAFIA mistakenly split one cluster into multiple separate
not be appropriate across all the dimensions in the dataset. clusters. FINDIT did not fare so well and we can see that
The results were very similar in tests where the number of FINDIT missed entire clusters and was unable to find all

Sigkdd Explorations. Volume 6, Issue 1 - Page 99


Cluster 1 2 3 4 5
Input (4, 6, 12, 14, 17) (1, 8, 9, 15, 18) (1, 7, 9, 18, 20) (1, 12, 15, 18, 19) (5, 14, 16, 18, 19)
Output (4, 6, 14, 17) (1, 8, 9, 15, 18) (7, 9, 18, 20) (12, 15, 18, 19) (5, 14, 18, 19)

Table 1: MAFIA misses one dimension in 4 out 5 clusters with N = 100, 000 and D = 20.

Cluster 1 2 3 4 5
Input (11, 16) (9, 14, 16) (8, 9, 16, 17) (0, 7, 8, 10, 14, 16) (8, 16)
Output (11, 16) (9, 14, 16) (8, 9, 16, 17) (0, 7, 8, 10, 14, 16) (8, 16)

Table 2: FINDIT uncovers all of the clusters in the appropriate dimensions with N = 100, 000 and D = 20.

of the relevant dimensions for the clusters it did find. The manner and accessible through the Internet. In this sce-
top-down approach means that FINDIT must evaluate all nario, query optimization becomes a complex problem since
of the dimensions, and as a greater percentage of them are the data is not centralized. The decentralization of data
irrelevant, the relevant ones are more difficult to uncover. poses a difficult challenge for information integration sys-
The use of sampling can exacerbate this problem, as some tems, mainly in the determination of the best subset of
clusters may be only weakly represented. sources to use for a given user query [30]. An exhaustive
While subspace clustering seems to perform well on syn- search on all the sources would be a naive and a costly solu-
thetic data, the real test for any data mining technique is tion. In the following, we discuss an application of subspace
to use it to solve a real-world problem. In the next sec- clustering in the context of query optimization for an infor-
tion we discuss some areas where subspace clustering may mation integration system developed here at ASU, Bibfinder
prove to be very useful. Faster and more sophisticated com- [60].
puters have allowed many fields of study to collect massive The Bibfinder system maintains coverage statistics for each
stores of data. In particular we will look at applications of source based on a log of previous user queries. With such in-
subspace clustering in information integration systems, web formation, Bibfinder can rank the sources for a given query.
text mining, and bioinformatics. In order to classify previously unseen queries, a hierarchy of
query classes is generated to generalize the statistics. For
Bibfinder, the process of generating and storing statistics is
6. APPLICATIONS proving to be expensive. A combination of subspace cluster-
Limitations of current methods and the application of sub- ing and classification methods offer a promising solution to
space clustering techniques to new domains drives the cre- this bottleneck. The subspace clustering algorithm can be
ation of new techniques and methods. Subspace clustering is applied to the query list with the queries being instances and
especially effective in domains where one can expect to find the sources corresponding to dimensions of the dataset. The
relationships across a variety of perspectives. Some areas result is a rapid grouping of queries where a group represents
where we feel subspace clustering has great potential are in- queries coming from the same set of sources. Conceptually,
formation integration system, text-mining, and bioinformat- each group can be considered a query class where the classes
ics. Creating hierarchies of data sources is a difficult task are generated in a one-step process using subspace cluster-
for information integration systems and may be improved ing. Furthermore, the query classes can be used to train a
upon by specialized subspace clustering algorithms. Web classifier so that when a user query is given, the classifier
search results can be organized into clusters based on topics can predict a set sources likely to be useful.
to make finding relevant results easier. In Bioinformatics,
DNA microarray technology allows biologists to collect an
huge amount of data and the explosion of the World Wide 6.2 Web Text Mining
Web threatens to drown us in a sea of poorly organized in- The explosion of the world wide web has prompted a surge
formation. Much of the knowledge in these datasets can be of research attempting to deal with the heterogeneous and
extracted by finding and analyzing the patterns in the data. unstructured nature of the web. A fundamental problem
Clustering algorithms have been used with DNA microarray with organizing web sources is that web pages are not ma-
data in identification and characterization of disease sub- chine readable, meaning their contents only convey semantic
types and for molecular pathway elucidation. However, the meaning to a human user. In addition, semantic heterogene-
high dimensionality of the microarray and text data makes ity is a major challenge. That is when a keyword in one do-
the task of pattern discovery difficult for traditional clus- main holds a different meaning in another domain making
tering algorithms [66]. Subspace clustering techniques can information sharing and interoperability between heteroge-
be leveraged to uncover the complex relationships found in neous systems difficult [72].
data from each of these areas. Recently, there has been strong interest in developing on-
tologies to deal with the above issues. The purpose of on-
6.1 Information Integration Systems tologies is to serve as a semantic, conceptual, hierarchical
Information integration systems are motivated by the fact model representing a domain or a web page. Currently, on-
that our information needs of the future will not be satis- tologies are manually created, usually by a domain expert
fied by closed, centralized databases. Rather, increasingly who identifies the key concepts in the domain and the inter-
sophisticated queries will require access to heterogeneous, relationships between them. There has been a substantial
autonomous information sources located in a distributed amount of research effort put toward the goal of automat-

Sigkdd Explorations. Volume 6, Issue 1 - Page 100


Cluster 1 2 3 4 5
Input (1, 5, 16, 20, 27, 58) (1, 8, 46, 58) (8, 17, 18, 37, 46, 58, 75) (14, 17, 77) (17, 26, 41, 77)
Output (5, 16, 20, 27, 58, 81) None Found (8, 17, 18, 37, 46, 58, 75) (17, 77) (41)

Table 3: FINDIT misses many dimensions and and entire cluster at high dimensions with with N = 100, 000 and D = 100.

Cluster 1 2 3 4 5
Input (4, 6, 12, 14, 17) (1, 8, 9, 15, 18) (1, 7, 9, 18, 20) (1, 12, 15, 18, 19) (5, 14, 16, 18, 19)
Output (4, 6, 14, 17) (8, 9, 15, 18) (7, 9, 18, 20) (12, 15, 18, 19) (5, 14, 18, 19)
(1, 8, 9, 18)
(1, 8, 9, 15)

Table 4: MAFIA misses one dimension in four out of five clusters. All of the dimensions are uncovered for cluster number
two, but it is split into three smaller clusters. N = 100, 000 and D = 100.

ing this process. In order to automate the process, the key the number of attributes before meaningful clusters can be
concepts of a domain must be learned. Subspace clustering uncovered. In addition, individual gene products have many
has two major strengths that will help it learn the concepts. different roles under different circumstances. For example,
First, the algorithm would find subspaces, or sets of key- a particular cancer may be subdivided along more than one
words, that were represented important concepts on a page. set of characteristics. There may be subtypes based on the
The subspace clustering algorithm would also scale very well motility of the cell as well as the cells propensity to divide.
with the high dimensionality of the data. If the web pages Separating the specimens based on motility would require
are presented in the form of a document-term matrix where examining one set of genes, while subtypes based on cell
the instances correspond to the pages and the features cor- division would be discovered when looking at a different set
responds to the keywords in the page, the result of subspace of genes [48; 66]. In order to find such complex relationships
clustering will be the identification of set of keywords (sub- in massive microarray datasets, more powerful and flexible
spaces) for a given group of pages. These keyword sets can methods of analysis are required. Subspace clustering is a
be considered to be the main concepts connecting the cor- promising technique that extends the power of traditional
responding groups. Ultimately, the clusters would represent feature selection by searching for unique subspaces for each
a domain and their corresponding subspaces would indicate cluster.
the key concepts of the domain. This information could be
further utilized by training a classifier with domains and rep-
resentative concepts. This classifier could then be used to 7. CONCLUSION
classify or categorize previously unseen web pages. For ex- High dimensional data is increasingly common in many fields.
ample, suppose the subspace clustering algorithm identifies As the number of dimensions increase, many clustering tech-
the domains Hospital, Music, and Sports, along with their niques begin to suffer from the curse of dimensionality, de-
corresponding concepts, from a list of web pages. This infor- grading the quality of the results. In high dimensions, data
mation can then be used to train a classifier that will be able becomes very sparse and distance measures become increas-
to determine the category for newly encountered webpages. ingly meaningless. This problem has been studied exten-
sively and there are various solutions, each appropriate for
6.3 DNA Microarray Analysis different types of high dimensional data and data mining
DNA microarrays are an exciting new technology with the procedures.
potential to increase our understanding of complex cellular Subspace clustering attempts to integrate feature evaluation
mechanisms. Microarray datasets provide information on and clustering in order to find clusters in different subspaces.
the expression levels of thousands of genes under hundreds Top-down algorithms simulate this integration by using mul-
of conditions. For example, we can interpret a lymphoma tiple iterations of evaluation, selection, and clustering. This
dataset as 100 cancer profiles with 4000 features where each process selects a group of instances first and then evaluates
feature is the expression level of a specific gene. This view the attributes in the context of that cluster of instances.
allows us to uncover various cancer subtypes based upon re- This relatively slow approach combined with the fact that
lationships between gene expression profiles. Understanding many are forced to use sampling techniques makes top-down
the differences between cancer subtypes on a genetic level algorithms more suitable for datasets with large clusters in
is crucial to understanding which types of treatments are relatively large subspaces. The clusters uncovered by top-
most likely to be effective. Alternatively, we can view the down methods are often hyper-spherical in nature due to
data as 4000 gene profiles with 100 features corresponding the use of cluster centers to represent groups of similar in-
to particular cancer specimens. Patterns in the data reveal stances. The clusters form non-overlapping partitions of the
information about genes whose products function together dataset. Some algorithms allow for an additional group of
in pathways that perform complex functions in the organ- outliers that contains instances not related to any cluster or
ism. The study of these pathways and their relationships to each other (other than the fact they are all outliers). Also,
one another can then be used to build a complete model of many require that the number of clusters and the size of the
the cell and its functions, bridging the gap between genetic subspaces be input as parameters. The user must use their
maps and living organisms [48]. domain knowledge to help select and tune these settings.
Currently, microarray data must be preprocessed to reduce Bottom-up algorithms integrate the clustering and subspace

Sigkdd Explorations. Volume 6, Issue 1 - Page 101


selection by first selecting a subspace, then evaluating the [5] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and
instances in the that context. Adding one dimension at a J. S. Park. Fast algorithms for projected clustering. In
time, they are able to work in relatively small subspaces. Proceedings of the 1999 ACM SIGMOD international
This allows these algorithms to scale much more easily with conference on Management of data, pages 61–72. ACM
both the number of instances in the dataset and the number Press, 1999.
of attributes. However, performance drops quickly with the
size of the subspaces in which the clusters are found. The [6] C. C. Aggarwal and P. S. Yu. Finding generalized pro-
main parameter required by these algorithms is the density jected clusters in high dimensional spaces. In Proceed-
threshold. This can be difficult to set, especially across all ings of the 2000 ACM SIGMOD international confer-
dimensions of the dataset. Fortunately, even if some dimen- ence on Management of data, pages 70–81. ACM Press,
sions are mistakenly ignored due to improper thresholds, the 2000.
algorithms may still find the clusters in a smaller subspace.
Adaptive grid approaches help to alleviate this problem by [7] C. C. Aggarwal and P. S. Yu. Outlier detection for high
allowing the number of bins in a dimension to change based dimensional data. In Proceedings of the 2001 ACM SIG-
on the characteristics of the data in that dimension. Often, MOD international conference on Management of data,
bottom-up algorithms are able to find clusters of various pages 37–46. ACM Press, 2001.
shapes and sizes since the clusters are formed from various [8] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Ragha-
cells in a grid. This means that the clusters can overlap each van. Automatic subspace clustering of high dimensional
other with one instance having the potential to be in more data for data mining applications. In Proceedings of the
than one cluster. It is also possible for an instances to be 1998 ACM SIGMOD international conference on Man-
considered an outlier and not belong any cluster. agement of data, pages 94–105. ACM Press, 1998.
Clustering is a powerful data exploration tool capable of un-
covering previously unknown patterns in data. Often, users [9] N. Alon, S. Dar, M. Parnas, and D. Ron. Testing of
have little knowledge of the data prior to clustering anal- clustering. In Foundations of Computer Science, 2000.
ysis and are seeking to find some interesting relationships Proceedings. 41st Annual Symposium on, pages 240–
to explore further. Unfortunately, all clustering algorithms 250, 2000.
require that the user set some parameters and make some
assumptions about the clusters to be discovered. Subspace [10] D. Barbará, Y. Li, and J. Couto. Coolcat: an entropy-
clustering algorithms allow users to break the assumption based algorithm for categorical clustering. In Proceed-
that all of the clusters in a dataset are found in the same ings of the eleventh international conference on In-
set of dimensions. formation and knowledge management, pages 582–589.
There are many potential applications with high dimen- ACM Press, 2002.
sional data where subspace clustering approaches could help
to uncover patterns missed by current clustering approaches. [11] P. Berkhin. Survey of clustering data mining tech-
Applications in bioinformatics and text mining are partic- niques. Technical report, Accrue Software, San Jose,
ularly relevant and present unique challenges to subspace CA, 2002.
clustering. As with any clustering techniques, finding mean- [12] E. Bingham and H. Mannila. Random projection in di-
ingful and useful results depends on the selection of the ap- mensionality reduction: applications to image and text
propriate technique and proper tuning of the algorithm via data. In Knowledge Discovery and Data Mining, pages
the input parameters. In order to do this, one must under- 245–250, 2001.
stand the dataset in a domain specific context in order to
be able to best evaluate the results from various approaches. [13] A. Blum and P. Langley. Selection of relevant features
One must also understand the various strengths, weaknesses, and examples in machine learning. Artificial Intelli-
and biases of the potential clustering algorithms. gence, 97:245–271, 1997.

8. REFERENCES [14] A. Blum and R. Rivest. Training a 3-node neural net-


works is NP-complete. Neural Networks, 5:117 – 127,
[1] D. Achlioptas. Database-friendly random projections. 1992.
In Proceedings of the twentieth ACM SIGMOD-
SIGACT-SIGART symposium on Principles of [15] Y. Cao and J. Wu. Projective art for clustering data
database systems, pages 274–281. ACM Press, 2001. sets in high dimensional spaces. Neural Networks,
15(1):105–120, 2002.
[2] C. C. Aggarwal. Re-designing distance functions and
distance-based applications for high dimensional data. [16] J.-W. Chang and D.-S. Jin. A new cell-based clustering
ACM SIGMOD Record, 30(1):13–18, 2001. method for large, high-dimensional data in data mining
[3] C. C. Aggarwal. Towards meaningful high-dimensional applications. In Proceedings of the 2002 ACM sympo-
nearest neighbor search by human-computer interac- sium on Applied computing, pages 503–507. ACM Press,
tion. In Data Engineering, 2002. Proceedings. 18th In- 2002.
ternational Conference on, pages 593–604, 2002.
[17] C.-H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based
[4] C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the subspace clustering for mining numerical data. In Pro-
surprising behavior of distance metrics in high dimen- ceedings of the fifth ACM SIGKDD international con-
sional space. In Database Theory, Proceedings of 8th ference on Knowledge discovery and data mining, pages
International Conference on, pages 420–434, 2001. 84–93. ACM Press, 1999.

Sigkdd Explorations. Volume 6, Issue 1 - Page 102


[18] G. M. D. Corso. Estimating an eigenvector by the power [31] D. Fradkin and D. Madigan. Experiments with random
method with a random start. SIAM Journal on Ma- projections for machine learning. In Proceedings of the
trix Analysis and Applications, 18(4):913–937, October 2003 ACM KDD. ACM Press, 2002.
1997.
[32] J. H. Friedman and J. J. Meulman. Clustering objects
[19] M. Dash, K. Choi, P. Scheuermann, and H. Liu. Feature on subsets of attributes. http://citeseer.nj.nec.
selection for clustering - a filter solution. In Data Min- com/friedman02clustering.html, 2002.
ing, 2002. Proceedings. 2002 IEEE International Con-
[33] G. Gan. Subspace clustering for high dimensional
ference on, pages 115–122, 2002.
categorical data. http://www.math.yorku.ca/
∼gjgan/talk.pdf, May 2003. Talk Given at SOS-
[20] M. Dash and H. Liu. Feature selection for clustering.
In Proceedings of the Fourth Pacific Asia Conference GSSD (Southern Ontario Statistics Graduate Student
on Knowledge Discovery and Data Mining, (PAKDD- Seminar Days).
2000). Kyoto, Japan, pages 110–121. Springer-Verlag, [34] J. Ghosh. Handbook of Data Mining, chapter Scalable
2000. Clustering Methods for Data Mining. Lawrence Erl-
baum Assoc, 2003.
[21] M. Dash, H. Liu, and X. Xu. ‘1 + 1 > 2’: merging dis-
tance and density based clustering. In Database Systems [35] S. Goil, H. Nagesh, and A. Choudhary. Mafia: Efficient
for Advanced Applications, 2001. Proceedings. Seventh and scalable subspace clustering for very large data sets.
International Conference on, pages 32–39, 2001. Technical Report CPDC-TR-9906-010, Northwestern
University, 2145 Sheridan Road, Evanston IL 60208,
[22] M. Dash, H. Liu, and J. Yao. Dimensionality reduc- June 1999.
tion of unsupervised data. In Tools with Artificial Intel-
ligence, 1997. Proceedings., Ninth IEEE International [36] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Clus-
Conference on, pages 532–539, November 1997. ter validity methods: part i. ACM SIGMOD Record,
31(2):40–45, 2002.
[23] M. Devaney and A. Ram. Efficient feature selection in
conceptual clustering. In Proceedings of the Fourteenth [37] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Cluster-
International Conference on Machine Learning, pages ing validity checking methods: part ii. ACM SIGMOD
92–97, 1997. Record, 31(3):19–27, 2002.
[38] G. Hamerly and C. Elkan. Learning the k in k -means.
[24] C. Ding, X. He, H. Zha, and H. D. Simon. Adaptive di-
To be published, obtained by Dr. Liu at ACM-SIGKDD
mension reduction for clustering high dimensional data.
2003 conference., 2003.
In Data Mining, 2002. Proceedings., Second IEEE In-
ternational Conference on, pages 147–154, December [39] J. Han, M. Kamber, and A. K. H. Tung. Geographic
2002. Data Mining and Knowledge Discovery, chapter Spatial
clustering methods in data mining: A survey, pages
[25] J. G. Dy and C. E. Brodley. Feature subset selection 188–217. Taylor and Francis, 2001.
and order identification for unsupervised learning. In
Proceedings of the Seventeenth International Confer- [40] A. Hinneburg, D. Keim, and M. Wawryniuk. Using pro-
ence on Machine Learning, pages 247–254, 2000. jections to visually cluster high-dimensional data. Com-
puting in Science and Engineering, IEEE Journal of,
[26] S. Epter, M. Krishnamoorthy, and M. Zaki. Clusterabil- 5(2):14–25, March/April 2003.
ity detection and initial seed selection in large datasets.
Technical Report 99-6, Rensselaer Polytechnic Insti- [41] A. Hinneburg and D. A. Keim. Optimal grid-clustering:
tute, Computer Science Dept., Rensselaer Polytechnic Towards breaking the curse of dimensionality in high-
Institute, Troy, NY 12180, 1999. dimensional clustering. In Very Large Data Bases, Pro-
ceedings of the 25th International Conference on, pages
[27] L. Ertöz, M. Steinbach, and V. Kumar. Finding clusters 506–517, Edinburgh, 1999.
of different sizes, shapes, and densities in noisy, high
dimensional data. In Proceedings of the 2003 SIAM In- [42] I. Inza, P. Larraaga, and B. Sierra. Feature weight-
ternational Conference on Data Mining, 2003. ing for nearest neighbor by estimation of bayesian net-
works algorithms. Technical Report EHU-KZAA-IK-
[28] D. Fasulo. An analysis of recent work on clustering al- 3/00, University of the Basque Country, PO Box 649.
gorithms. Technical report, University of Washington, E-20080 San Sebastian. Basque Country. Spain., 2000.
1999. [43] A. K. Jain and R. C. Dubes. Algorithms for clustering
data. Prentice-Hall, Inc., 1988.
[29] X. Z. Fern and C. E. Brodley. Random projection for
high dimensional data clustering: A cluster ensemble [44] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clus-
approach. In Machine Learning, Proceedings of the In- tering: a review. ACM Computing Surveys (CSUR),
ternational Conference on, 2003. 31(3):264–323, 1999.
[30] D. Florescu, A. Y. Levy, and A. O. Mendelzon. [45] M. K. Jiawei Han. Data Mining : Concepts and Tech-
Database techniques for the world-wide web: A survey. niques, chapter 8, pages 335–393. Morgan Kaufmann
SIGMOD Record, 27(3):59–74, 1998. Publishers, 2001.

Sigkdd Explorations. Volume 6, Issue 1 - Page 103


[46] S. Kaski and T. Kohonen. Exploratory data analysis [60] Z. Nie and S. Kambhampati. Frequency-based coverage
by the self-organizing map: Structures of welfare and statistics mining for data integration. In Proceedings
poverty in the world. In A.-P. N. Refenes, Y. Abu- of the International Conference on Data Engineering
Mostafa, J. Moody, and A. Weigend, editors, Neural 2004, 2003.
Networks in Financial Engineering. Proceedings of the
Third International Conference on Neural Networks in [61] A. Patrikainen. Projected clustering of high-
the Capital Markets, London, England, 11-13 October, dimensional binary data. Master’s thesis, Helsinki
1995, pages 498–507. World Scientific, Singapore, 1996. University of Technology, October 2002.
[47] Y. Kim, W. Street, and F. Menczer. Feature selection [62] D. Pelleg and A. Moore. X-means: Extending k-means
for unsupervised learning via evolutionary search. In with efficient estimation of the number of clusters. In
Proceedings of the Sixth ACM SIGKDD International Proceedings of the Seventeenth International Confer-
Conference on Knowledge Discovery and Data Mining, ence on Machine Learning, pages 727–734, San Fran-
pages 365–369, 2000. cisco, 2000. Morgan Kaufmann.
[48] I. S. Kohane, A. Kho, and A. J. Butte. Microarrays for
[63] J. M. Pena, J. A. Lozano, P. Larranaga, and I. Inza. Di-
an Integrative Genomics. MIT Press, 2002.
mensionality reduction in unsupervised learning of con-
[49] R. Kohavi and G. John. Wrappers for feature subset ditional gaussian networks. Pattern Analysis and Ma-
selection. Artificial Intelligence, 97(1-2):273–324, 1997. chine Intelligence, IEEE Transactions on, 23(6):590–
603, June 2001.
[50] E. Kolatch. Clustering algorithms for spatial databases:
A survey, 2001. [64] C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M.
Murali. A monte carlo algorithm for fast projective clus-
[51] T. Li, S. Zhu, and M. Ogihara. Algorithms for cluster-
tering. In Proceedings of the 2002 ACM SIGMOD in-
ing high dimensional and distributed data. Intelligent
ternational conference on Management of data, pages
Data Analysis Journal, 7(4):?–?, 2003.
418–427. ACM Press, 2002.
[52] J. Lin and D. Gunopulos. Dimensionality reduction by
random projection and latent semantic indexing. In [65] B. Raskutti and C. Leckie. An evaluation of criteria
Proceedings of the Text Mining Workshop, at the 3rd for measuring the quality of clusters. In Proceedings of
SIAM International Conference on Data Mining, May the Sixteenth International Joint Conference on Artifi-
2003. cial Intelligence, IJCAI 99, pages 905–910, Stockholm,
Sweden, July 1999. Morgan Kaufmann.
[53] B. Liu, Y. Xia, and P. S. Yu. Clustering through deci-
sion tree construction. In Proceedings of the ninth in- [66] S. Raychaudhuri, P. D. Sutphin, J. T. Chang, and R. B.
ternational conference on Information and knowledge Altman. Basic microarray analysis: grouping and fea-
management, pages 20–29. ACM Press, 2000. ture reduction. Trends in Biotechnology, 19(5):189–193,
2001.
[54] H. Liu and H. Motoda. Feature Selection for Knowledge
Discovery & Data Mining. Boston: Kluwer Academic
[67] S. M. Rüger and S. E. Gauch. Feature reduction for
Publishers, 1998.
document clustering and classification. Technical re-
[55] A. McCallum, K. Nigam, and L. H. Ungar. Efficient port, Computing Department, Imperial College, Lon-
clustering of high-dimensional data sets with applica- don, UK, 2000.
tion to reference matching. In Proceedings of the sixth
ACM SIGKDD international conference on Knowledge [68] U. Shardanand and P. Maes. Social information fil-
discovery and data mining, pages 169–178. ACM Press, tering: algorithms for automating word of mouth.
2000. In Proceedings of the SIGCHI conference on Human
factors in computing systems, pages 210–217. ACM
[56] B. L. Milenova and M. M. Campos. O-cluster: scalable Press/Addison-Wesley Publishing Co., 1995.
clustering of large high dimensional data sets. In Data
Mining, Proceedings from the IEEE International Con- [69] L. Talavera. Dependency-based feature selection for
ference on, pages 290–297, 2002. clustering symbolic data, 2000.
[57] P. Mitra, C. A. Murthy, and S. K. Pal. Unsupervised [70] L. Talavera. Dynamic feature selection in incremental
feature selection using feature similarity. IEEE Trans- hierarchical clustering. In Machine Learning: ECML
actions on Pattern Analysis and Machine Intelligence, 2000, 11th European Conference on Machine Learning,
24(3):301–312, 2002. Barcelona, Catalonia, Spain, May 31 - June 2, 2000,
[58] R. Ng and J. Han. Efficient and effective clustering Proceedings, volume 1810, pages 392–403. Springer,
methods for spatial data mining. In Proceedings of the Berlin, 2000.
20th VLDB Conference, pages 144–155, 1994.
[71] C. Tang and A. Zhang. An iterative strategy for pattern
[59] E. Ng Ka Ka and A. W. chee Fu. Efficient algorithm for discovery in high-dimensional data sets. In Proceedings
projected clustering. In Data Engineering, 2002. Pro- of the eleventh international conference on Information
ceedings. 18th International Conference on, pages 273–, and knowledge management, pages 10–17. ACM Press,
2002. 2002.

Sigkdd Explorations. Volume 6, Issue 1 - Page 104


[72] H. Wache, T. Vögele, U. Visser, H. Stuckenschmidt, Subspace Clustering - These clustering techniques seek
G. Schuster, H. Neumann, and S. Hübner. Ontology- to find clusters in a dataset by selecting the most rele-
based integration of information - a survey of existing vant dimensions for each cluster separately. There are
approaches. In In Stuckenschmidt, H., ed., IJCAI-01 two main approaches. The top-down iterative approach
Workshop, pages 108–117, 2001. and the bottom-up grid approach. Subspace clustering
is sometimes referred to as projected clustering.
[73] I. H. Witten and E. Frank. Data Mining: Pratical Ma-
chine Leaning Tools and Techniuqes with Java Imple- Measure of Locality - The measure used to determine a
mentations, chapter 6.6, pages 210–228. Morgan Kauf- context of instances and/or features in which to apply a
mann, 2000. quality measure. Top-down subspace clustering meth-
ods use either an approximate clustering or the nearest
[74] K.-G. Woo and J.-H. Lee. FINDIT: a Fast and Intel- neighbors of an instance. Bottom-up approaches use
ligent Subspace Clustering Algorithm using Dimension the cells of a grid created by defining divisions within
Voting. PhD thesis, Korea Advanced Institute of Sci- each dimension.
ence and Technology, Taejon, Korea, 2002. Feature Transformation - Techniques that generate new
[75] J. Yang, W. Wang, H. Wang, and P. Yu. δ-clusters: features by creating combinations of the original fea-
capturing subspace correlation in a large data set. tures. Often, a small number of these new features are
In Data Engineering, 2002. Proceedings. 18th Interna- used to summarize dataset in fewer dimensions.
tional Conference on, pages 517–528, 2002. Feature Selection - The process of determining and se-
lecting the dimensions (features) that are most relevant
[76] L. Yu and H. Liu. Feature selection for high-dimensional to the data mining task. This technique is also known
data: a fast correlation-based filter solution. In Pro- as attribute selection or relevance analysis.
ceedings of the twentieth International Conference on
Machine Learning, pages 856–863, 2003.
[77] M. Zait and H. Messatfa. A comparative study of clus-
tering methods. Future Generation Computer Systems,
13(2-3):149–159, November 1997.

APPENDIX
A. DEFINITIONS AND NOTATION
Dataset - A dataset consists of a matrix of data values.
Rows represent individual instances and columns rep-
resent dimensions.
Instance - An instance refers to an individual row in a
dataset. It typically refers to a vector of d measure-
ments. Other common terms include observation, da-
tum, feature vector, or pattern. N is defined as the
number of instances in a given dataset.
Dimension - A dimension refers to an attribute or feature
of the dataset. d is defined as the number of dimensions
in a dataset.
Cluster - A group of instances in a dataset that are more
similar to each other than to other instances. Often,
similarity is measured using a distance metric over some
or all of the dimensions in the dataset. k is defined as
the number of clusters in a dataset.
Clustering - The process of discovering groups (clusters)
of instances such that instances in one cluster are more
similar to each other than to instances in another clus-
ter. The measures used to determine similarity can
vary a great deal, leading to a wide variety of defini-
tion and approaches.
Subspace - A subspace is a subset of the d dimensions of
a given dataset.
Clusterability - A measure of how well a given set of in-
stances can be clustered. Intuitively, this is a measure
of how “tight” the results of clustering are. Having
tight clustering means that intra cluster distances tend
to be much larger than inter cluster distances.

Sigkdd Explorations. Volume 6, Issue 1 - Page 105

Das könnte Ihnen auch gefallen