Beruflich Dokumente
Kultur Dokumente
i i
In full dimensional algorithms, one of the techniques fM is the set of medoids in current iteration g
current
for nding a piercing set of medoids is based on a greedy f M is the best set of medoids found so far g
f N is the nal set of medoids with associated dimensions g
best
method. In this process medoids are picked iteratively, f A; B are constant integers g
so that the current medoid is well separated from the begin
medoids that have been chosen so far. The greedy
technique has been proposed in [14] and is illustrated f1. Initialization Phaseg
S = random sample of size A k
in Figure 3. In full dimensionality, if there are no M = Greedy(S , B k)
outliers and if the clusters are well enough separated,
this method always returns a piercing set of medoids. f2. Iterative Phaseg
However, it does not guarantee a piercing set for the BestObjective = 1
M = Random set of medoids fm1 ; m2 ; : :: m g M
projected clustering problem. repeat
current k
In our algorithm we will use the greedy technique f Approximate the optimal set of dimensions g
in order to nd a good superset of a piercing set of for each medoid m 2 M do
begin
i current
medoids. In other words, if we wish to nd k clusters Let be distance to nearest medoid from m
in the data, we will pick a set of points of cardinality a i
L = Points in sphere centered at m with radius
i
In the rst step, we choose a random sample of data f Form the clusters g
k
even smaller nal set of points on which to apply the Compute the bad medoids in M
end
best
hill climbing technique during the next phase. In our Compute M by replacing the bad medoids in
algorithm, the nal set of points has size B k; where current
M with random points from M
B is a small constant. We shall denote this set by M. until (termination criterion)
best
The reasons for choosing this two-step method are as f3. Renement Phaseg
follows: L = fC1 ; :: : ; C g
(D1 ; D2 ;: : : D ) = FindDimensions(k;l; L)
k
(1) The greedy technique tends to pick many outliers (C1 ;: : :; C ) = AssignPoints(D1; :: : D )
k
dist(x) = minfdist(x);d(x; m )g
distance associated with the medoid mi is related to the
i
end
i
(which is the most \central") is likely to form a cluster for each medoid i do
i
Y =
Xi;j
rP
i
D =; i
d
pass over the data to improve the quality of the cluster- for each dimension j do Z = (X , Y )=
end
i;j i;j i i
ing. Let fC1; : : :; Ckg be the clusters corresponding to Pick the k l numbers with the least (most negative) values
these medoids, formed during the Iterative Phase. We of Z subject to the constraint that there are at least 2
i;j
discard the dimensions associated with each medoid and dimensions for each cluster
if Z is picked then add dimension j to D
compute new ones by a procedure similar to that in the return(D1; D2 ; :: : D )
i;j i
Outliers are also handled during this last pass over for each i 2 f1;: :: ; kg do C =
for each data point p do
i
the set of dimensions Di : Find i with lowest value of dD (p;m ) and add p to C ;
i i
end;
i i
i
We also refer to i as the sphere of in
uence of return (C1 ;: : : ; C )
end;
k
of the Initialization Phase, we gave some insight into Y = Average distance of points in C to
why we expect the set M to contain a piercing set of
i;j i
centroid of C along dimension j
endP
i
Table 1: PROCLUS: Dimensions of the Input Clusters Table 2: PROCLUS: Dimensions of the Input Clusters
(Top) and Output Clusters (Bottom) for Case 1 (Top) and Output Clusters (Bottom) for Case 2
Input A B C D E Out.
Output
1 0 20992 267 416 18 358
2 34 0 0 0 16097 669
3 0 9 1 15309 10 58
4 0 2256 16536 0 0 178
5 21357 0 0 2 10 129
Outliers 0 21 1441 1 222 3609
Table 4: PROCLUS: Confusion Matrix (dierent number of dimensions) for Case 2
Output
10
PROCLUS
CLIQUE
2 11128 0 0 0 0 0
15 19510 0 0 0 0 0
31 0 0 0 101 0 0
2500
of high dimensional data spaces. This is a generaliza-
tion of feature selection, in that it allows the selection of
2000
dierent sets of dimensions for dierent subsets of the
Running time in seconds
210
choice.
6 Acknowledgements
200
190
Running time in seconds