Beruflich Dokumente
Kultur Dokumente
Hierarchical clustering is one of the most frequently Our Bayesian hierarchical clustering algorithm uses
used methods in unsupervised learning. Given a set of marginal likelihoods to decide which clusters to merge
data points, the output is a binary tree (dendrogram) and to avoid overfitting. Basically it asks what the
whose leaves are the data points and whose internal probability is that all the data in a potential merge
nodes represent nested clusters of various sizes. The were generated from the same mixture component,
tree organizes these clusters hierarchically, where the and compares this to exponentially many hypotheses
hope is that this hierarchy agrees with the intuitive at lower levels of the tree (section 2).
organization of real-world data. Hierarchical struc-
The generative model for our algorithm is a Dirichlet Tk
process mixture model (i.e. a countably infinite mix- Dk
ture model), and the algorithm can be viewed as a fast
bottom-up agglomerative way of performing approxi- Ti Tj
Di Dj
mate inference in a DPM. Instead of giving weight to
all possible partitions of the data into clusters, which
is intractable and would require the use of sampling 1 2 3 4
methods, the algorithm efficiently computes the weight
of exponentially many partitions which are consistent Figure 1. (a) Schematic of a portion of a tree where Ti and Tj
with the tree structure (section 3). are merged into Tk , and the associated data sets Di and Dj
are merged into Dk . (b) An example tree with 4 data points.
2. Algorithm The clusterings (1 2 3)(4) and (1 2)(3)(4) are tree-consistent par-
titions of this data. The clustering (1)(2 3)(4) is not a tree-
Our Bayesian hierarchical clustering algorithm is sim- consistent partition.
ilar to traditional agglomerative clustering in that it is
a one-pass, bottom-up method which initializes each
data point in its own cluster and iteratively merges priors (e.g. Normal-Inverse-Wishart priors for Normal
pairs of clusters. As we will see, the main difference is continuous data or Dirichlet priors for Multinomial
that our algorithm uses a statistical hypothesis test to discrete data) this integral is tractable. Throughout
choose which clusters to merge. this paper we use such conjugate priors so the inte-
grals are simple functions of sufficient statistics of Dk .
Let D = {x(1) , . . . , x(n) } denote the entire data set,
For example, in the case of Gaussians, (1) is a function
and Di ⊂ D the set of data points at the leaves of the
of the sample mean and covariance of the data in Dk .
subtree Ti . The algorithm is initialized with n trivial
trees, {Ti : i = 1 . . . n} each containing a single data The alternative hypothesis to H1k would be that the
point Di = {x(i) }. At each stage the algorithm consid- data in Dk has two or more clusters in it. Summing
ers merging all pairs of existing trees. For example, if over the exponentially many possible ways of dividing
Ti and Tj are merged into some new tree Tk then the Dk into two or more clusters is intractable. However,
associated set of data is Dk = Di ∪Dj (see figure 1(a)). if we restrict ourselves to clusterings that partition the
data in a manner that is consistent with the subtrees
In considering each merge, two hypotheses are com-
Ti and Tj , we can compute the sum efficiently using re-
pared. The first hypothesis, which we will denote H1k
cursion. (We elaborate on the notion of tree-consistent
is that all the data in Dk were in fact generated in-
partitions in section 3 and figure 1(b)). The probabil-
dependently and identically from the same probabilis-
ity of the data under this restricted alternative hy-
tic model, p(x|θ) with unknown parameters θ. Let us
pothesis, H2k , is simply a product over the subtrees
imagine that this probabilistic model is a multivariate
p(Dk |H2k ) = p(Di |Ti )p(Dj |Tj ) where the probability of
Gaussian, with parameters θ = (µ, Σ), although it is
a data set under a tree (e.g. p(Di |Ti )) is defined below.
crucial to emphasize that for different types of data,
different probabilistic models may be appropriate. To Combining the probability of the data under hypothe-
evaluate the probability of the data under this hypoth- ses H1k and H2k , weighted by the prior that all points
esis we need to specify some prior over the parameters def
in Dk belong to one cluster, πk = p(H1k ), we obtain
of the model, p(θ|β) with hyperparameters β. We now the marginal probability of the data in tree Tk :
have the ingredients to compute the probability of the
data Dk under H1k :
p(Dk |Tk ) = πk p(Dk |H1k ) + (1 − πk )p(Di |Ti )p(Dj |Tj ) (2)
Z
p(Dk |H1k ) = p(Dk |θ)p(θ|β)dθ
Z h Y i This equation is defined recursively, there the first
= p(x(i) |θ) p(θ|β)dθ (1) term considers the hypothesis that there is a single
x(i) ∈Dk cluster in Dk and the second term efficiently sums over
all other other clusterings of the data in Dk which are
This calculates the probability that all the data in Dk consistent with the tree structure (see figure 1(a)).
were generated from the same parameter values as- In section 3 we show that equation 2 can be used to
suming a model of the form p(x|θ). This is a natural derive an approximation to the marginal likelihood of
model-based criterion for measuring how well the data a Dirichlet Process mixture model, and in fact provides
fit into one cluster. If we choose models with conjugate
a new lower bound on this marginal likelihood.1 Allowing infinitely many components makes it possible
We also show that the prior for the merged hypoth- to more realistically model the kinds of complicated
esis, πk , can be computed bottom-up in a DPM. distributions which we expect in real problems. We
The posterior probability of the merged hypothesis briefly review DPMs here, starting from finite mixture
def models.
rk = p(H1k |Dk ) is obtained using Bayes rule:
πk p(Dk |H1k ) Consider a finite mixture model with C components
rk = (3)
πk p(Dk |H1k ) + (1 − πk )p(Di |Ti )p(Dj |Tj ) C
X
p(x(i) |φ) = p(x(i) |θj )p(ci = j|p) (4)
This quantity is used to decide greedily which two trees j=1
to merge, and is also used to determine which merges
in the final hierarchy structure were justified. The where ci ∈ {1, . . . , C} is a cluster indicator variable for
general algorithm is very simple (see figure 2). data point i, p are the parameters of a multinomial dis-
tribution with p(ci = j|p) = pj , θj are the parameters
of the jth component, and φ = (θ1 , . . . , θC , p). Let the
input: data D = {x(1) . . . x(n) }, model p(x|θ),
parameters of each component have conjugate priors
prior p(θ|β)
p(θ|β) as in section 2, and the multinomial parameters
initialize: number of clusters c = n, and
also have a conjugate Dirichlet prior
Di = {x(i) } for i = 1 . . . n
while c > 1 do C
Γ(α) Y α/C−1
Find the pair Di and Dj with the highest p(p|α) = p . (5)
probability of the merged hypothesis: Γ(α/C)C j=1 j
ℓ′ =1
Γ(nvℓ′ ) Y
v ′
Figure 3. Algorithm for computing prior on merging, where p(Di |Ti ) = p(Dℓv′ )
rightk (leftk ) indexes the right (left) subtree of Tk and di
v ′ ∈VTi ℓ′ =1
drightk (dleftk ) is the value of d computed for the right (left)
child of internal node k. and similarly for p(Dj |Tj ), where VTi (VTj ) is the set of
all tree-consistent partitionings of Di (Dj ). Combining
Lemma 1 The marginal likelihood of a DPM is: terms we obtain:
X αmv Qmv Γ(nv ) Y mv di dj
p(Dk ) = h ℓ=1 i ℓ p(Dℓv ) p(Di |Ti )p(Dj |Tj )
Γ(nk +α) dk
v∈V Γ(α) ℓ=1
mv ′ mv ′
1 X m′ Y ′ Y ′
Play
Ball Space
205 149 284 162
0.25 1 796
Baseball NHL Car Space Vehicle Team
0.2
Pitch Hockey Dealer NASA Dealer Game
Hit Round Drive Orbit Driver Hockey
0.15
0.1
def
Q N is the set of all nodes in the tree, ωk =
where by thresholding at a greyscale value of 128 out of 0
rk i∈Nk(1 − ri ) is the weight on cluster k, and Nk is through 255, and the spambase dataset by whether
the set of nodes on the path from the root node to each attribute value was zero or non-zero. We ran
the parent of node k. This expression can be derived the algorithms on 3 digits (0,2,4), and all 10 digits.
by rearranging the sum over all tree-consistent parti- The newsgroup dataset was constucted using Rain-
tionings into a sum over all clusters in the tree (noting bow (McCallum, 1996), where a stop list was used
that a cluster can appear in many partitionings). and words appearing fewer than 5 times were ignored.
For Gaussian components with conjugate priors, this The dataset was then binarized based on word pres-
results in a predictive distribution which is a mixture ence/absence in a document. For these classification
of multivariate t distributions. We show some exam- datasets, where labels for the data points are known,
ples of this in the Results section. we computed a measure between 0 and 1 of how well a
dendrogram clusters the known labels called the den-
drogram purity.2 We found that the marginal like-
5. Results
lihood of the tree structure for the data was highly
We compared Bayesian hierarchical clustering to tra-
correlated with the purity. Over 50 iterations with dif-
ditional hierarchical clustering using average, single,
ferent hyperparameters this correlation was 0.888 (see
and complete linkage over 5 datasets (4 real and 1
figure 4). Table 1 shows the results on these datasets.
synthetic). We also compared our algorithm to aver-
age linkage hierarchical clustering on several toy 2D On all datasets except Glass BHC found the highest
problems (figure 8). On these problems we were able purity trees. For Glass, the Gaussian assumption may
to compare the different hierarchies generated by the have been poor. This highlights the importance of
two algorithms, and visualize clusterings and predic- model choice. Similarly, for classical distance-based
tive distributions. hierarchical clustering methods, a poor choice of dis-
tance metric may result in poor clusterings.
The 4 real datasets we used are the spambase (100 ran-
dom examples from each class, 2 classes, 57 attributes) Another advantage of the BHC algorithm, not fully
and glass (214 examples, 7 classes, 9 attributes) addressed by purity scores alone, is that it tends to
datasets from the UCI repository the CEDAR Buf- create hierarchies with good structure, particularly at
falo digits (20 random examples from each class, 10 high levels. Figure 5 compares the top three levels
classes, 64 attributes), and the CMU 20Newsgroups (last three merges) of the newsgroups hierarchy (using
dataset (120 examples, 4 classes - rec.sport.baseball, 2 Let T be a tree with leaves 1, . . . , n and c , . . . , c be the
1 n
rec.sport.hockey, rec.autos, and sci.space, 500 at- known discrete class labels for the data points at the leaves. Pick
tributes). We also used synthetic data generated from a leaf ℓ uniformly at random; pick another leaf j uniformly in
a mixture of Gaussians (200 examples, 4 classes, 2 the same class, i.e. cℓ = cj . Find the smallest subtree containing
attributes). The synthetic, glass and toy datasets ℓ and j. Measure the fraction of leaves in that subtree which are
in the same class (cℓ ). The expected value of this fraction is the
were modeled using Gaussians, while the digits, spam- dendrogram purity, and can be computed exactly in a bottom
base, and newsgroup datasets were binarized and mod- up recursion on the dendrogram. The purity is 1 iff all leaves in
eled using Bernoullis. We binarized the digits dataset each class are contained in some pure subtree.
Data Set Single Linkage Complete Linkage Average Linkage BHC
Table 1. Purity scores for 3 kinds of traditional agglomerative clustering, and Bayesian hierarchical clustering.
800 examples and the 50 words with highest informa- that maximize marginal likelihood under a Dirichlet-
tion gain) from BHC and ALHC. Continuing to look at Multinomial model.
lower levels does not improve ALHC, as can be seen
Iwayama and Tokunaga define a Bayesian hierarchi-
from the full dendrograms for this dataset, figure 7.
cal clustering algorithm which attempts to agglomer-
Although sports and autos are both under the news-
atively find the maximum posterior probability clus-
group heading rec, our algorithm finds more similarity
tering but it makes strong independence assumptions
between autos and the space document class. This
and does not use the marginal likelihood.
is quite reasonable since these two classes have many
more common words (such as engine, speed, cost, de- Segal et al. (2002) present probabilistic abstraction
velop, gas etc.) than there are between autos and hierarchies (PAH) which learn a hierarchical model
sports. Figure 6 also shows dendrograms of the 3digits in which each node contains a probabilistic model
dataset. and the hierarchy favors placing similar models at
neighboring nodes in the tree (as measured by a dis-
6. Related Work tance function between probabilstic models). The
training data is assigned to leaves of this tree. Ra-
The work in this paper is related to and inspired by
moni et al. (2002) present an agglomerative algorithm
several previous probabilistic approaches to cluster-
for merging time series based on greedily maximizing
ing3 , which we briefly review here. Stolcke and Omo-
marginal likelihood. Friedman (2003) has also recently
hundro (1993) described an algorithm for agglomera-
proposed a greedy agglomerative algorithm based on
tive model merging based on marginal likelihoods in
marginal likelihood which simultaneously clusters rows
the context of hidden Markov model structure induc-
and columns of gene expression data.
tion . Williams (2000) and Neal (2003) describe Gaus-
sian and diffusion-based hierarchical generative mod- The algorithm in our paper is different from the above
els, respectively, for which inference can be done us- algorithms in several ways. First, unlike (Williams,
ing MCMC methods. Similarly, Kemp et al. (2004) 2000; Neal, 2003; Kemp et al., 2004) it is not in fact a
present a hierarchical generative model for data based hierarchical generative model of the data, but rather
on a mutation process. Ward uses a hierarchical fusion a hierarchical way of organizing nested clusters. Sec-
of contexts based on marginal likelihoods in a Dirichlet ond our algorithm is derived from Dirichlet process
language model. mixtures. Third the hypothesis test at the core of
our algorithm tests between a single merged hypoth-
Banfield and Raftery (1993) present an approximate
esis and the alternative is exponentially many other
method based on the likelihood ratio test statistic
clusterings of the same data (not one vs two clusters
to compute the marginal likelihood for c and c − 1
at each stage). Lastly, our algorithm does not use any
clusters and use this in an agglomerative algorithm.
iterative method, like EM, or require sampling, like
Vaithyanathan and Dom (2000) perform hierarchical
MCMC, and is therefore significantly faster than most
clustering of multinomial data consisting of a vector of
of the above algorithms.
features. The clusters are specified in terms of which
subset of features have common distributions. Their
clusters can have different parameters for some fea- 7. Discussion
tures and the same parameters to model other fea- We have presented a novel algorithm for Bayesian hier-
tures and their method is based on finding merges archical clustering based on Dirichlet process mixtures.
3 There has also been a considerable amount of decision tree This algorithm has several advantages over traditional
based work on Bayesian tree structures for classification and approaches, which we have highlighted throughout the
regression (Chipman et al., 1998; Denison et al., 2002), but this paper. We have presented prediction and hyperparam-
is not closely related to work presented here. eter optimization procedures and shown that the al-
gorithm provides competitive clusterings of real-world From this we can compute the conditional probability
data as measured by purity with respect to known la- of a single indicator:
bels. The algorithm can also be seen as an extremely
fast alternative to MCMC inference in DPMs. n−i,j + α/C
p(ci = j|c−i , α) =
n−1+α
The limitations of our algorithm include its inher-
ent greediness, a computational complexity which is where n−i,j is the number of data points, excluding
quadratic in the number of data points, and the lack point i, assigned to class j. Finally, taking the infinite
of any incorporation of tree uncertainty. limit (C → ∞), we obtain:
References
Banfield, J. D., & Raftery, A. E. (1993). Model-based Gaussian
and non-Gaussian clustering. Biometrics, 49, 803–821.
4.5
3.5
2.5
1.5
0.5
0
444444444422444444444444444444444444444442222222222222222222222222222242222220000000022000000000000000000000002000000000
−780
80
70
60
−292
50
40
30
−47
20 −114 −58.5
−21.6
−42.1 −6.76
10
0
000000002200000020000000000000000000000000022242222222222222222222222222222222222444444444444444444444444444444444444444
Figure 6. (a) The tree given by performing average linkage hierarchical clustering on 120 examples of 3 digits (0,2, and 4), yielding
a purity score of 0.870. Numbers on the x-axis correspond to the true class label of each example, while the y-axis corresponds to
distance between merged clusters. (b) The tree given by performing Bayesian hierarchical clustering on the same dataset, yielding
a purity score of 0.924. Here higher values on the y-axis also correspond to higher levels of the hierarchy (later merges), though the
exact numbers are irrelevant. Red dashed lines correspond to merges with negative posterior log probabilities and are labeled with
these values.
4 Newsgroups Average Linkage Clustering
Figure 7. Full dendrograms for average linkage hierarchical clustering and BHC on 800 newsgroup documents. Each of the leaves
in the document is labeled with a color, according to which newsgroup that document came from. Red is rec.autos, blue is
rec.sport.baseball, green is rec.sport.hockey, and magenta is sci.space.
9 10 8
8.5 9 7.5
8
8 7
7.5
7 6.5
7
6 6
6.5
5 5.5
6
4 5
5.5
3 4.5
5
4.5 2 4
4 1 3.5
0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 10
Average Linkage Hierarchical Clustering Average Linkage Hierarchical Clustering Average Linkage Hierarchical Clustering
4.5 5
5
4 4.5
4.5
3.5 4
4
3.5
3 3.5
3
2.5 3
2.5
2.5
2
2
2
1.5
1.5
1.5
1
1
1
0.5 0.5
0.5
0 0
1 2 3 4 5 6 8 9 7 10 11 14 15 12 13 16 17 18 20 19 21 25 24 22 23 26 27 7 8 12 13 9 10 11 4 5 18 6 14 15 16 17 1 2 3 1 3 2 4 6 5 13 14 15 28 29 30 31 7 8 9 10 12 11 16 17 18 19 20 21 22 23 24 25 26 27
−66.9 −28.5 18
−59.9
6 6
16
5 5 14
12
4 −17.1 4
10
−3.6
3 3
8
2.5
−5.39 1.21
6 −1.24
2 −0.0123 2 −0.223
4
3.71 3.57
4
3.72
0.819
1 1 3.54
3.97 2 0.878
2.16 3.13
4.54 4.48 4.6
0 0 0
1 2 3 4 5 6 8 9 10 7 11 12 13 14 15 16 18 20 19 17 21 22 23 24 25 26 27 9 10 11 8 7 12 13 18 1 2 3 4 5 6 14 15 16 17 28 29 30 31 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 1 3 2 4 6 5 7 9 8 10 12 11
9 10 8
7
6 89 17
10
8.5 16
22 23 9 7.5 11
8 10
12
1234 79
21 24 15 5
8 5 2 46
25 8 7 13
7.5 26
27 7 14 6.5
18 6
7
6 6
6.5 17
5
13
16 18 5 12 5.5
20 7
6 8
4
19 9
4 10 5
5.5 11 19
25
12 3 20 26
13 22 23
24
13 3 4.5
5 11 14 16 1718 21 27
2 28
293031
14
4.5 15 2 1 4 15
4 1 3.5
0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 10
8
8 7 0.7
1 0.5
7.5
7 6.5 0.6
7
0.8 0.4
6 6 0.5
6.5
5 5.5 0.4
0.6 0.3
6
4 5 0.3
5.5 0.2
0.4
3 4.5 0.2
5
0.2 0.1
4.5 2 4 0.1
4 1 3.5
0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 9 10
Figure 8. Three toy examples (one per column). The first row shows the original data sets. The second row gives the dendrograms
resulting from average linkage hierarchical clustering, each number on the x-axis corresponds to a data point as displayed on the
plots in the fourth row. The third row gives the dendrograms resulting from Bayesian Hierarchical clustering where the red dashed
lines are merges our algorithm prefers not to make given our priors. The numbers on the branches are the log odds for merging
rk
(log 1−r ). The fourth row shows the clusterings found by our algorithm using Gaussian components, when the tree is cut at red
k
dashed branches (rk < 0.5). The last row shows the predictive densities resulting from our algorithm. The BHC algorithm behaves
structurally more sensibly than the traditional algorithm. Note that point 18 in the middle column is clustered into the green class
even though it is substantially closer to points in the red class. In the first column the traditional algorithm prefers to merge cluster
1-6 with cluster 7-10 rather than, as BHC does, cluster 21-25 with cluster 26-27. Finally, in the last column, the BHC algorithm
clusters all points along the upper and lower horizontal parallels together, whereas the traditional algorithm merges some subsets of
points veritcally(for example cluster 7-12 with cluster 16-27).