You are on page 1of 6

A tag-topic model for blog mining

Flora S. Tsai

School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore
a r t i c l e i n f o
Blog mining
Author-Topic model
Latent Dirichlet Allocation
a b s t r a c t
Blog mining addresses the problem of mining information from blog data. Although mining blogs may
share many similarities to Web and text documents, existing techniques need to be reevaluated and
adapted for the multidimensional representation of blog data, which exhibit dimensions not present in
traditional documents, such as tags. Blog tags are semantic annotations in blogs which can be valuable
sources of additional labels for the myriad of blog documents. In this paper, we present a tag-topic model
for blog mining, which is based on the Author-Topic model and Latent Dirichlet Allocation. The tag-topic
model determines the most likely tags and words for a given topic in a collection of blog posts. The model
has been successfully implemented and evaluated on real-world blog data.
2010 Elsevier Ltd. All rights reserved.
1. Introduction
A blog, or weblog, is a type of online journal where entries are
made in a reverse chronological order. Blogs can comment on a
particular subject, as well as form of a social network (Tsai, Han,
Xu, & Chua, 2009). The blogosphere is dened as the collection of
all blogs as a community or social network. Because of the large
numbers of existing blog documents (posts) the blogosphere con-
tent may be random and chaotic (Chen, Tsai, & Chan, 2008). As a
result, effective mining and visualization techniques are needed
to aid in the analysis and understanding of blog data.
A tag is a keyword that can be used to describe a blog. The tag
metadata is useful for users to quickly nd related blog entries that
are tagged to a topic of interest. Tags can be chosen by the blogger,
the viewer, or both. If many users tag many items, this tag collec-
tion forms a folksonomy. Tagging was popularized by the Web 2.0
and is an important feature of many existing services.
Many blog systems allow bloggers to add new tags to a post, in
addition to placing the post into categories. For example, a post
may display that it has been tagged with web and security.
Each of those tags can link to a main page that lists all of the related
posts with the same tag. A sidebar may list all the tags for that blog,
with each tag leading to an index page. If a post is incorrectly clas-
sied, a blogger can edit the list of tags.
Analysis of large data of multiple tags may require the use of
dimensionality reduction or projection techniques to transform
the data into a smaller set. Dimensionality reduction nds a smal-
ler set of features that can describe the original set of observed
dimensions. Dimensionality reduction can uncover hidden struc-
ture which is useful to understand and visualize of the data.
Previous studies (Chen, Tsai, & Chan, 2007; Liang, Tsai, & Kwee,
2009; Tsai & Chan, 2007a) use existing data mining techniques
without considering the additional dimensions present in blogs.
In this paper, we show how blog mining is different from tradi-
tional Web and text mining by dening the multiple dimensions
in blog documents, and comparing to Web and text documents.
Next, we describe a tag-topic model for mining the multiple tags
present in blogs. Finally, we implement Isomap (Tenenbaum, de
Silva, & Langford, 2000) dimensionality reduction technique for
visualizing real-world collections of security blogs.
The paper is organized as follows: Section 2 describes past work
in blog content and tag mining. Section 3 presents the models and
techniques for blog mining, including the proposed tag-topic mod-
el to analyze and visualize the multiple tags present in blog data.
Section 4 presents experimental results on real-world blog data,
and Section 5 concludes the paper.
2. Blog content and tag mining
2.1. Dimensions of blog documents
A blog is structured differently from a typical Web or text doc-
ument. Table 1 compares the different components of blog, Web,
and text documents. URL stands for the Uniform Resource Locator,
the Web address from which a document can be found. A perma-
link is specic to blogs, and is a URL that points to a specic blog
entry after the entry has passed from the front page into the blog
archives. Outlinks are documents that are linked from the blog or
Web document. Tags are labels that people use to make it easier
to nd related blog posts, photos, and videos.
0957-4174/$ - see front matter 2010 Elsevier Ltd. All rights reserved.

Tel.: +65 6790 6369; fax: +65 6793 3318.

E-mail address:
Expert Systems with Applications 38 (2011) 53305335
Contents lists available at ScienceDirect
Expert Systems with Applications
j our nal homepage: www. el sevi er . com/ l ocat e/ eswa
If we consider the different components of blogs, we can group
general blog data mining into ve main dimensions (blog content,
tags, authors, links, and time), shown in Table 2.
The next sections denes and summarizes blog content and tag
mining techniques.
2.2. Blog content mining
Blog content consists of the title and content of the blog docu-
ments. Many of the techniques are similar to text and Web docu-
ments; however important distinctions that pose challenges in
natural language processing include common use of abbreviations
and slang words, spelling and grammatical errors, and different
languages present within one document.
Many blog content mining techniques focuses on sentiment or
opinion mining, or judging whether a particular blog post is nega-
tive, positive, or neutral to a particular entity (such as a person or
product). In fact, one of the main tasks in the Text Retrieval Confer-
ence (TREC) Blog Track was the Opinion Retrieval Task, which in-
volved locating blog posts that express an opinion about a given
target (Ounis, de Rijke, Macdonald, Mishne, & Soboroff, 2006; Oun-
is, Macdonald, & Soboroff, 2008; Macdonald, Ounis, & Soboroff,
Another prevalent theme in blog content mining is the ltering
of spam blogs, or splogs, which can greatly misrepresent any esti-
mations of the number of blogs posted. Previous work in splog
detection include splog detection using self-similarity analysis on
blog temporal dynamics (Lin, Sundaram, Chi, Tatemura, & Tseng,
2007), using Support Vector Machines (SVMs) to identify and
splogs (Kolari, Finin, & Joshi, 2006).
Yet another important task in blog content mining is topic dis-
tillation, which was the second main task in TREC Blog 2007 (Mac-
donald et al., 2007) and 2008 (Ounis et al., 2008). The blog
distillation, or feed search, task focuses on blog feeds, which are
aggregates of blog posts. Blog distillation task searches for a blog
feed with a principle, recurring interest in topic t. For a given topic
t, systems should suggest feeds that are principally devoted to t
over the timespan of the feed, and would be recommended to sub-
scribe to as an interesting feed about t (Macdonald et al., 2007).
This task has direct relevance to the problem of searching for blogs
that a user may wish to subscribe. As many blog posts are inher-
ently noisy, nding the relevant feeds is not a trivial problem.
2.3. Blog tag mining
A blog tag is a word that categorizes documents according to its
topic. Blog tag mining is a subset of social media tag mining. Social
media sites, such as Flickr, MySpace, and, allow users to
semantically annotate many different types of content. These user-
generated tags classies content so they can be easily found.
Because blog tags are typically user-generated different users
may use different tags to describe a similar blog. There is also a lack
of information about the meaning of each tag. For example, the tag
apple could refer to either the fruit or the company. The person-
alized variety of vulnerable nding comprehensive information
about a subject. Our proposed model attempts to solve some of
the difculties of blog tag mining by applying probabilistic and
dimensionality reduction techniques, which can reduce the noise
in blog tags.
3. Models and techniques for blog mining
In this section, we propose and apply probabilistic models and
dimensionality reduction techniques for analyzing and visualizing
the multiple tags present in blog data. This model can easily be ex-
tended for different categories of multidimensional data, such as
other types of social media. The techniques are based on Latent
Dirichlet Allocation (Blei, Ng, & Jordan, 2003), a modied version
of the Author-Topic model, and Isomap dimensionality reduction
3.1. Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) (Blei et al., 2003) models text
documents as mixtures of latent topics, which are key concepts
presented in the text. LDA is not as vulnerable to overtting as tra-
ditional methods based on Latent Semantic Analysis (LSA) (Chen
et al., 2008; Deerwester, Dumais, Furnas, Landauer, & Harshman,
The topic mixture is drawn from a conjugate Dirichlet prior that
is the same for all documents. The steps adapted for blog docu-
ments are summarized below:
(1) Select a multinomial distribution /
for each topic t from a
Dirichlet distribution with parameter b.
(2) For each blog document b, select a multinomial distribution
from a Dirichlet distribution with parameter a.
(3) For each word token w in blog b, select a topic t from h
(4) Select a word w from /
The probability of generating a corpus is:
jt; /
dhd/ 1
3.2. Topic-tag model
An extension of LDA to probabilistic Author-Topic (AT) model-
ing (Rosen-Zvi, Grifths, Steyvers, & Smyth, 2004; Steyvers, Smyth,
Rosen-Zvi, & Grifths, 2004) is proposed for the blog tag and topic
visualization. The AT model is based on Gibbs sampling, a Markov
chain Monte Carlo technique, where each author is represented by
a probability distribution over topics, and each topic is represented
as a probability distribution over terms (words) for that topic
(Steyvers et al., 2004).
Table 1
Comparison of blog, Web, and text documents.
Components Blog Web Text
p p
p p p
p p
p p
Table 2
Blog dimensions.
Dimensions Blog components
Content Title and content
Tags Tags (labels or
Author Author or blogger
Links URL, permalink,
Time Date and time
F.S. Tsai / Expert Systems with Applications 38 (2011) 53305335 5331
We have extended the AT model for analysis of blog tags. For
the tag-topic (TT) model, each tag is represented by a probability
distribution over topics, and each topic represented by a probabil-
ity distribution over terms for that topic.
Fig. 1 shows the generative model of the TT model using plate
For the TT model, the probability of generating a blog is given
where blog b has T
tags. The probability is then integrated over /
and h and their Dirichlet distributions and sampled using the Gibbs
sampling Monte Carlo technique.
The similarity matrices for tags and content can then be calcu-
lated using the symmetrized Kullback Leibler (KL) distance be-
tween topic distributions, which is able to measure the
difference between two probability distributions. The similarity
matrices can be visualized using the Isomap dimensionality tech-
nique described in the following section.
3.3. Isometric feature mapping (Isomap)
Isomap (Tenenbaum et al., 2000) is a nonlinear dimensionality
reduction technique that uses multidimensional scaling (MDS)
(Davison, 2000) techniques with geodesic interpoint distances in-
stead of Euclidean distances. Geodesic distances represent the
shortest paths along the curved surface of the manifold. Unlike
the linear techniques, Isomap can discover the nonlinear degrees
of freedom that underlie complex natural observations (Tenen-
baum et al., 2000).
Isomap deals with nite data sets of points in R
which are as-
sumed to lie on a smooth submanifold M
of low dimension d < n.
The algorithm attempts to recover M given only the data points.
Isomap estimates the unknown geodesic distance in M between
data points in terms of the graph distance with respect to some
graph G constructed on the data points.
Isomap algorithm consists of three basic steps:
(1) Find the nearest neighbors on the manifold M, based on the
distances between pairs of points in the input space.
(2) Approximate the geodesic distances between all pairs of
points on the manifold M by computing their shortest path
distances in the graph G.
(3) Apply MDS to matrix of graph distances, constructing an
embedding of the data in a d-dimensional Euclidean space
Y that best preserves the manifolds estimated intrinsic
geometry (Tenenbaum et al., 2000).
If two points appear on a nonlinear manifold, their Euclidean
distance in the high-dimensional input space may not accurately
reect their intrinsic similarity. The geodesic distance along the
low-dimensional manifold is thus a better representation for these
points. The neighborhood graph G constructed in the rst step of
allows an estimation of the true geodesic path to be computed ef-
ciently in step two, as the shortest path in G. The two-dimensional
embedding recovered by Isomap in step three, which best pre-
serves the shortest path distances in the neighborhood graph.
The embedding now represents simpler and cleaner approxima-
tions to the true geodesic paths than do the corresponding graph
paths (Tenenbaum et al., 2000).
Isomap is a very useful noniterative, polynomial-time algorithm
for nonlinear dimensionality reduction. Isomap is able to compute
a globally optimal solution, and for a certain class of data manifolds
(Swiss roll), is guaranteed to converge asymptotically to the true
structure (Tenenbaum et al., 2000). However, Isomap may not eas-
ily handle more complex domains such as non-trivial curvature or
topology. Because a previous study showed that Isomap was gen-
erally able to perform well on visualization of synthetic as well
as real-world data (Tsai & Chan, 2007b), we have applied Isomap
for visualizing blog content and tags.
4. Experiments and results
We used the tag-topic model for blog data mining on our collec-
tion of real-world blog data. Dimensionality reduction was per-
formed with Isomap to show the similarity plot of blog content
and tags. Experiments show that the tag-topic model can reveal
interesting patterns in the underlying tags and topics for our data-
set of security-related blogs.
4.1. Data corpus
For our experiments, we extracted a subset of the Nielson Buzz-
Metrics blog data corpus
that focuses on blogs related to security
threats and incidents related to cyber crime and computer viruses.
The original dataset consists of 14 million blog posts collected by
Nielsen BuzzMetrics for May 2006. Although the blog entries span
only a short period of time, they are indicative of the amount and
variety of blog posts that exists in different languages throughout
the world.
Blog entries related to security threats such as malware, cyber
crime, computer virus, encryption, and information security were
extracted by keyword search and stored for use in our analysis.
There were a total of 3096 entries in our dataset; however, as
most of the blog posts do not have tags associated with them, we
eliminated those documents with null or blank tags, as well as
those with tags labeled as uncategorized. Each of the remaining
948 blog entries was saved as a text le for further text preprocess-
ing. For the preprocessing of the blog content, HTML tags were re-
moved, lexical analysis was performed by removing stopwords,
stemming, and pruning by the Text to Matrix Generator (TMG)
(Zeimpekis & Gallopoulos, 2006) prior to generating the term-doc-
ument matrix using term frequency (TF) local term weighting. The
total number of terms after pruning and stopword removal was
4111. For the tag-document matrix, tags separated by and, /,
or & were treated as separate tags. Otherwise, the words were
Fig. 1. The graphical model for the tag-topic model using plate notation.
5332 F.S. Tsai / Expert Systems with Applications 38 (2011) 53305335
combined to form one tag. The tag-document matrix was gener-
ated with binary local term weighting, resulting in a total of 552
unique tags. The term-document matrix and tag-document matrix
were used to compute the tag-topic model.
In this model, each tag is represented by a probability distribu-
tion over topics, and each topic is represented as a probability dis-
tribution over terms for that topic (Steyvers et al., 2004). The topic-
term and tag-topic distributions were then learned from the blog
data in an unsupervised manner. The parameters used in our
experiments were the number of topics (t = 50) and number of iter-
ations (N = 2000). We used symmetric Dirichlet priors in the TT
estimation with a = 50/t and b = 0.01, which are common settings
in the literature.
The most likely terms and corresponding tags fromeach topic of
the blog entry collection are listed in Tables 36.
From the results, we observe that some of the blog tags may not
be very descriptive of the topic. For example, for the topic Spyware,
the tags quizzes, thankyouforsmoking, aquifer, and catchi-
ngupwithtowanda do not seem especially relevant to the topic.
Since tags are user generated, there is often a problem of mislabel-
ing, or using long phrases instead of one or two words to tag a blog.
Bloggers also have a tendency to use the same tag for many or all of
their posts, no matter what the subject.
4.2. Blog content visualization
For visualizing the document similarities, the symmetrized
Kullback Leibler distance between topic distributions was calcu-
lated for each document pair. Fig. 2 shows the 2D plot of the doc-
ument similarities based on the document-topic distributions. A
random sample of 100 titles were taken in and shown in the plot.
4.3. Blog tag visualization
For visualizing the tag similarities, the symmetrized Kullback
Leibler distance between topic distributions was calculated for
each tag pair. Fig. 3 shows the 2D plot of the tag similarities based
on the tag-topic distributions of the most popular tags. In the plot,
Table 3
Topic 11: malware.
Term Probability
browser 0.07184
worm 0.04667
yahoo 0.03283
user 0.03121
safeti 0.02768
instal 0.02488
facetim 0.02355
hijack 0.02002
malwar 0.01870
site 0.01708
Tag Probability
world 0.13636
web 0.09365
videogames 0.07790
links 0.05805
www 0.05079
news 0.05011
opinion 0.03409
internet 0.03245
windows 0.02834
economy 0.02369
Table 4
Topic 22: Windows security.
Term Probability
threat 0.02759
secure 0.02566
custom 0.02227
window 0.02203
antivirus 0.02178
beta 0.01985
protect 0.01960
response 0.01839
vista 0.01839
offer 0.01476
Tag Probability
diggnews 0.47986
miscellanea 0.03511
gallery 0.02606
world 0.02111
musique 0.01960
spywarenews 0.01637
blogging 0.01271
warroom 0.01228
photos 0.00862
mobilesociety 0.00797
Table 5
Topic 26: Spyware.
Term Probability
spyware 0.10403
comput 0.02331
software 0.02177
anti 0.01868
yahoo 0.01800
web 0.01594
user 0.01525
system 0.01320
new 0.01234
person 0.01183
Tag Probability
spywarenews 0.52080
quizzes 0.04806
thankyouforsmoking 0.04412
aquifer 0.03719
catchingupwithtowanda 0.01765
writing 0.01623
spywarebooks 0.01529
secularhumanism 0.00961
sport 0.00804
warroom 0.00756
Table 6
Topic 48: Identity theft.
Term Probability
secure 0.04668
card 0.02941
theft 0.02462
access 0.02334
credit 0.02302
compani 0.01982
ident 0.01695
execute 0.01567
laptop 0.01567
employe 0.01503
Tag Probability
photos 0.31245
security 0.04562
religion 0.03325
miscellanea 0.03243
vehicles 0.02556
review 0.01539
veggingout 0.01182
wespen 0.01154
intellisense 0.01127
writing 0.01127
F.S. Tsai / Expert Systems with Applications 38 (2011) 53305335 5333
each tag was scaled according to the number of blogs posted using
that tag. The distances between the tags are proportional to the
similarity between tags, based on the topic distributions of the
blogs that were posted. As seen from the graphs, the majority of
blogs in our dataset were tagged with either spywarenews or
news. Because of the free-form nature of the tags, problems
may arise due to nonstandardized tag labels. This problem may
be solved when a larger set of blogs are taken. In addition, some
of the tags overlap because they are tagged to the same or similar
topics. This may be due to the specialized nature of our dataset,
which focused on security blogs. If a larger set of blogs are taken,
there may not be as many overlapping tags.
5. Conclusion and future work
In this paper, we proposed a tag-topic model for blog mining
based on the Author-Topic model. In this model, each tag is repre-
sented by a probability distribution over topics, and each topic is
represented as a probability distribution over terms for that topic.
This can solve the problem of nding the most likely tags and
terms for a given topic.
We have successfully implemented and evaluated the tag-topic
model on real-world security blogs. Using the output of the tag-to-
pic model, we present results in visualizing which tags are similar
to each other with the Isomap dimensionality reduction technique.
In addition, we also plot the results of the blog document similar-
ities, based on the same techniques.
Since the tags are user generated, there may be some inherent
noise in the tags. Dimensionality reduction can help remove the
noise in the tags, and may prove useful for future studies focusing
on tag mining and visualization. The tag-topic model can be ex-
tended in the future for larger datasets as well as other types of so-
cial media with semantic annotations.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn.
Res., 3, 9931022.
Chen, Y., Tsai, F. S., & Chan, K. L. (2007). Blog search and mining in the business
domain. In DDDM 07: Proceedings of the 2007 international workshop on domain
driven data mining (pp. 5560). New York, NY, USA: ACM.
Chen, Y., Tsai, F. S., & Chan, K. L. (2008). Machine learning techniques for business
blog search and mining. Expert Systems and Applications, 35(3), 581590.
Davison, M. (2000). Multidimensional scaling. Florida: Krieger Publishing Company.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).
Indexing by latent semantic analysis. Journal of the American Society for
Information Science, 41(6), 391407.
Kolari, P., Finin, T., & Joshi, A. (2006). SVMs for the blogosphere: Blog identication
and splog detection. In AAAI spring symposium on computational approaches to
analysing Weblogs.
Liang, H., Tsai, F. S., Kwee, & A. T. (2009). Detecting novel business blogs. In ICICS
2009Conference Proceedings of the 7th international conference on information,
communications and signal processing (ICICS).
Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J., & Tseng, B. L. (2007). Splog detection
using self-similarity analysis on blog temporal dynamics. In AIRWeb 07:
Proceedings of the third international workshop on Adversarial information
retrieval on the web (pp. 18). New York, NY, USA: ACM.
Macdonald, C., Ounis, I., & Soboroff, I. (2007). Overview of the TREC-2007 blog track.
In The sixteenth text REtrieval conference (TREC 2007) proceedings.
Ounis, I., de Rijke, M., Macdonald, C., Mishne, G.A., & Soboroff, I. (2006). Overview of
the TREC-2006 Blog track. In TREC 2006 working notes. (pp. 1527).
Ounis, I., Macdonald, C., & Soboroff, I. (2008). Overview of the TREC-2008 Blog track.
In TREC 2008 working notes.
Rosen-Zvi, M., Grifths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model
for authors and documents. In AUAI 04: Proceedings of the 20th conference on
uncertainty in articial intelligence (pp. 487494). Arlington, Virginia, United
States: AUAI Press.
Steyvers, M., Smyth, P., Rosen-Zvi, M., & Grifths, T. (2004). Probabilistic author-
topic models for information discovery. In KDD 04: Proceedings of the tenth ACM
SIGKDD international conference on knowledge discovery and data mining
(pp. 306315). New York, NY, USA: ACM.
Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for
nonlinear dimensionality reduction. Science, 290(5500), 23192323.
tes ict blog no.
about trojans
viewpoint media player
eat the dog food, drink the kool
cheating adsense
yhoo32.explr malware threat related
to yahoo! messenger
keeping the software free
frightening world out here!
about trojans
trojan out of nowhere
tech radio
about adware
the awesome five
how to fix the va information theft
ca to offer free etrust ez antivirus
to microsoft windows vista beta
dammit... psp
sunday, may
profiling the hacker macs may no longer be immune to
the things i do for my friends
another trip!
malware is getting smarter, each
day it puzzles us!
thirsty for qoolaid
free adware
internet disclaimer
news new trojan horse threatens
to delete files unless you pay
useful firefox extensions
about trojans
the guys (ed skoudis, tom liston
and mike poor) at
agnitum outpost firewall pro .
(build )
life as it goes
may ,
cyber blackmail increasing
best boat loans
virtual task force nets cyber
torrent infectado
stop pima county from buying diebold
voting machines
hackers straks ook in de cola
top three computer protection priorities
april malware review yonkers
hackers straks ook in de cola
spybot definition file update
. .
attention please
new safe browser now available
shameless self
global virus, spam and phishing
story time!
a class
helping law enforcement fight cyber
all about spyware
new trojan targets word
random stuff,
dissecting leftism
optical scan machines fail in michigan
, officials ....
apple airs new mac commercial
diebold voting systems critically
altiris svs
un broadcasting treaty restricts
free speech
attention virus
about keyloggers
welcome a newcomer in our spyware
and adware collection.
consigned to the waste basket
stupid people
first antivirus for s60 3rd edition
five architectural flaws in windows
microsoft hackers exploiting unpatched
flaw in ms ....
windows live safety center may
not remove some malware
first antivirus for s60 3rd edition
spy data furor
sea angel
apple sans viruses and malware
the exile files
thebroken check it out!
zfone encrypts voip calls
linked by shanmuga
customers who bought sony cds with
xcp copy control ....
kids just say no!
xoftspy , , ....
cyber criminals targeting gamers
nerdy news in april
spyware advice
your fortune calls for efficacious
blocker for wood flooring low cost
installed spam and malicious software!
northwest mortgage
new e
yahoo! im worm
Fig. 2. Results on visualization of blog content using Isomap (k = 100).
spywarenews security
Fig. 3. Results on visualization of blog tags using Isomap (k = 20).
5334 F.S. Tsai / Expert Systems with Applications 38 (2011) 53305335
Tsai, F. S., & Chan, K. L. (2007a). Detecting cyber security threats in weblogs using
probabilistic models. In Lecture notes in computer science LNCS. (Vol. 4430, pp.
Tsai, F. S., & Chan, K. L. (2007b). Dimensionality reduction techniques for data
exploration. In 2007 6th international conference on information, communications
and signal processing, ICICS.
Tsai, F. S., Han, W., Xu, J., & Chua, H. C. (2009). Design and development of a mobile
peer-to-peer social networking application. Expert Systems and Applications,
36(8), 1107711087.
Zeimpekis, D., & Gallopoulos, E. (2006). TMG: A MATLAB Toolbox for generating
term-document matrices from text collections. In Grouping multidimensional
data (pp. 187210). Cambridge, MA: MIT Press.
F.S. Tsai / Expert Systems with Applications 38 (2011) 53305335 5335