Clusteri

Clustering users of an online content service
Sami Linnanvuo
Helsinki 21st June 2006

Master of Science Thesis
UNIVERSITY OF HELSINKI
Department of Computer Science
HELSINGIN YLIOPISTO – HELSINGFORS UNIVERSITET
Tiedekunta/Osasto Laitos – Institution
Faculty of Science Department of Computer Science

Tekijä – Författare
Sami Linnanvuo
Työn nimi – Arbetets titel
Clustering users of an online content service
Oppiaine – Läroämne
Computer Science
Työn laji – Arbetets art Aika – Datum Sivumäärä – Sidoantal
M.Sc. Thesis 21.6.2006 70 pages
Tiivistelmä – Referat
Online content services can greatly benefit from personalisation features that enable delivery of
content that is suited to each user’s specific interests. This thesis presents a system that applies text
analysis and user modeling techniques in an online news service for the purpose of personalisation
and user interest analysis. The system creates a detailed thematic profile for each content item and
observes user’s actions towards content items to learn user’s preferences. A handcrafted taxonomy of
concepts, or ontology, is used in profile formation to extract relevant concepts from the text. User
preference learning is automatic and there is no need for explicit preference settings or ratings from
the user. Learned user profiles are segmented into interest groups using clustering techniques with
the objective of providing a source of information for the service provider. Some theoretical
background for chosen techniques is presented while the main focus is in finding practical solutions
to some of the current information needs, which are not optimally served with traditional techniques.
Avainsanat – Nyckelord
Clustering, user modeling, personalisation, text categorization, ontologies
Säilytyspaikka – Förvaringställe
Muita tietoja – Övriga uppgifter
2
Abstract
Online content services can greatly benefit from personalisation features that enable
delivery of content that is suited to each user’s specific interests. This thesis presents a
system that applies text analysis and user modeling techniques in an online news service
for the purpose of personalisation and user interest analysis. The system creates a detailed
thematic profile for each content item and observes user’s actions towards content items
to learn user’s preferences. A handcrafted taxonomy of concepts, or ontology, is used in
text analysis for extracting relevant concepts from the text. User preference learning is
automatic and there is no need for explicit preference settings or ratings from the user.
Learned user profiles are segmented into interest groups using clustering techniques with
the objective of providing a source of information for the service provider. Some
theoretical background for chosen techniques is presented while the main focus is in
finding practical solutions to some of the current information needs, which are not
optimally served with traditional techniques.
3
1 Introduction.................................................................................................................... 5
2 Text analysis ................................................................................................................... 7

2.1 Information content of a word .................................................................................. 8
2.2 Creating a document model ...................................................................................... 9
2.3 Measuring the similarity of two documents............................................................ 12
2.4 Vector space model................................................................................................. 13
2.5 Semantics in language............................................................................................. 17
2.6 Using ontology in capturing the semantics of text.................................................. 19
3 User modeling............................................................................................................... 23
3.1 Content personalisation........................................................................................... 23
3.2 Collaborative filtering............................................................................................. 24
3.3 Content-based filtering............................................................................................ 27
3.4 Hybrid methods....................................................................................................... 29
4 Cluster analysis ............................................................................................................ 31

4.1 Clustering in information retrieval.......................................................................... 32
4.2 K-means .................................................................................................................. 33
4.3 Hierarchical clustering ............................................................................................ 35
4.4 Group Average Agglomerative Clustering ............................................................. 37
5 Creation of user and content profiles......................................................................... 40

5.1 Architecture of the content service ......................................................................... 40
5.2 Leiki Targeting personalisation engine................................................................... 42
5.3 Ontology ................................................................................................................. 43
5.4 Ontology construction............................................................................................. 46
5.5 Content profiling..................................................................................................... 47
5.6 User profiling .......................................................................................................... 50
6 Clustering of users ....................................................................................................... 53

6.1 Clustering algorithm ............................................................................................... 53
6.2 Data ......................................................................................................................... 54
6.3 Pre-processing of the data....................................................................................... 57
6.4 Visualization of clusters.......................................................................................... 58
6.5 Results..................................................................................................................... 61
6.6 Validation of the result............................................................................................ 65
7 Conclusions................................................................................................................... 68
7.1 Privacy issues.......................................................................................................... 69
7.2 Future work............................................................................................................. 70
BIBLIOGRAPHY ................................................................................................71
4
1 Introduction
The amount of available digital content has increased drastically since the birth of World
Wide Web. Shortly after introduction of a user-friendly interface, web browser, anyone
having a computer and a net connection has been able to access vast amount of online
content in the form of news services, discussion boards and online shops. As the amount
of available information increases, the ability to distinguish relevant content from
irrelevant becomes crucial. Recently, as many of these services have become available to
mobile devices with a small screen size and limited bandwidth, efficient filtering and
personalisation techniques have become more necessary than ever. Increased competition
and more demanding users have brought new concerns to the service providers: how to
keep users using the service and not to switch to a competing one? How to help users find
the content they are looking for? How to increase advertising revenue without irritating
users? These are important questions to any content provider running a commercial
content service.
This thesis, motivated by the above questions, presents a system that aims at helping
service providers to improve the efficiency of their information delivery, and at providing
them information about the end-users. A key factor driving the user satisfaction in a
content service is the quality of the content; if the content serves well user’s information
needs, it is more likely that the user will return to the service again. The problem is that
often the service provider does not exactly know what the users are interested in. By
segmenting users into groups according to their interests and presenting those groups in a
graphical user interface helps service provider to adjust the content according to the
interests of the target audience. This information can be further used in targeted
advertising campaigns in which users are offered only the ads that they are most likely to
be interested to see.
To achieve the results outlined above, the system should be able to resolve the interests of
any individual user and measure the level of similarity of any two users using the system.
5
User profiles are created using Leiki Targeting, a real-time learning, profiling and
personalisation engine. The engine uses domain ontology to automatically form a human
readable profile for each content and user in the system. Once the profiles are created, the
system is capable of recommending users the content they are probably most interested in
seeing. The purpose of this work is to present a way to add user segmentation feature to
this existing personalisation engine.
This work is divided in 7 chapters so that each of the chapters builds on top of the
previous ones. Chapter 2 provides the reader some basics of text analysis and document
modeling with the objective of finding topics from the text. The concept of an ontology, a
taxonomy of concepts, is introduced as a provider of semantics for the text analysis task.
A vector space model is introduced as a methodology for efficiently performing
calculations with documents. Chapter 3 introduces techniques for user modeling and
personalisation. Both content-based and collaborative approaches for content filtering are
compared and their differences are explained. Chapter 4 provides reader an introduction
to data clustering as a method for automatically finding groups in data. Chapter 5
describes the functionality of Leiki Targeting, a real-time learning, profiling and
personalisation engine, which is used in content and user profiling. Chapter 6 is devoted
to analysis of the generated user profiles. Users are segmented into interest groups using
cluster analysis. Scalability to a large number of users is achieved by using a custom
hybrid algorithm, which combines hierarchical and partitional clustering.
6
2 Text analysis
Retrieving relevant information from a document collection is a primary challenge in the

field of information retrieval. Search engines, filtering and personalisation software and
information extraction systems are examples of attempts to solve the same basic need -
providing users with the information they wish to see. In order to do so, one must not
only find the desired content from the document collection but also be able to form a
correct understanding of the information need of the user. Neither of these two
requirements is easy to solve using automated methods. Typically this involves
interpretation of user’s, often ambiguous, query terms, finding relevant documents from
the document collection and ranking them according to their relevance to the user’s query
terms.
There are several approaches for extracting knowledge from text. Some of them rely on
handcrafted rules on detecting words and sentences while some are more computationally
oriented attempting to automatically learn any such rules. Traditional Natural language
processing (NLP) methods are based on the concept of grammar as a model of language
and sentence is stated grammatical if it follows the rules of a grammar or ungrammatical
if it conflicts with any of the rules. However, implementing such a set of rules that
governs all the features of a language is far from easy. Even for a human, it is sometimes
difficult to determine the grammaticality of a sentence. In common use of language,
people often use sentences that do not follow the grammar but they still get understood
and also should get understood if they are communicating with a computer. The fact that,
the use of language changes over time when new terms are invented and new forms of
expressing thoughts are incorporated into language, implies the need for constant
updating of the language model.
7
2.1 Information content of a word
By observing words in text one can clearly see that words are unevenly distributed: some
words are very common while some appear very rarely. The majority of the words lie
between those two extremes. In calculations of the words in Tom Sawyer, the most
common 100 words accounted 50,9% of the text while 49,8% of the words occurred only
once [MaS02]. Similar results would be achieved with any typical text corpus. This
characteristic of language follows the Zipf’s law [Zip49], which states that the frequency
of use of a word is inversely proportional to its statistical rank such that
Pn ≈ 1/n a , (2.1)
where Pn is the frequency of occurrence of the nth ranked word and a is close to 1. Given
any sufficiently large corpus of text in any natural language, the second most common
word will occur approximately ½ as often as the first and the nth most common word will
occur 1/n as often as the first. As a consequence, 20% of words take up 80% of written
text. This pattern is also visible in many related domains such as in distribution of page
views in a web services (the second popular gets half the impressions of the most popular
page) and in the long tail of products in retail market (where 20% of the items makes up
80% of the sales).
An important property related to frequency of a word is its information content, which

denotes the importance of that word in describing the documents contents. The more a
term contributes to the unique characteristics of a document when compared to other
documents in the document collection, the more information content that term bears.
From the information theory it is known that entropy H is a measure of predictability in a
series of symbols. If P is a probability distribution function over the random variable X,
then the entropy of P is given by [Shw63]
8
 1 
H ( P) = ∑ P( X ) log 2   . (2.2)
x∈X  P( X ) 
A collection of documents can be viewed a stream of tokens generated by an information

source. This distribution is never completely known, but it can be estimated from the
sample text. The information content of a single term is
 1
log 2   , (2.3)
 pt 
where pt is the probability of occurrence of the term. The more frequent a word is, the
less information it carries.
2.2 Creating a document model
Computerized text processing requires transforming text into in-memory representation

where operations such as comparing the similarity of two documents can be easily
performed. The model of the document should be compact while preserving as much
essential information from the original document as possible. In typical information
retrieval tasks, a document model should capture all the essence about the topic of the
document although there are also domains in which properties such as writing style may
be of interest e.g. in the detection of authorship of a text [Die00]. A document model
consists of a set of features, words or text fragments, extracted from text. A typical
procedure for finding the suitable set of features can be divided in the following steps:
1. Tokenization of text
2. Stripping punctuation
3. Removal of stop-words
4. Stemming
5. Weighting
9
The transformation process starts by first tokenizing text into units of language. What is a
unit then? Intuitively thinking, a unit is a word, more precisely a sequence of characters
with spaces on both ends. However, tokenizing text turns out to be more complicated
than that. Because of punctuation marks (commas, semicolons etc.), words are not always
surrounded by white spaces. Simply removing punctuation is a common solution but not
a perfect one; there is information stored in the punctuation as well, for example a
question mark suggests that the sentence contains a question. Detecting a boundary of a
sentence is not trivial either: periods may appear inside sentence with abbreviations or
digits. In an extreme case where sentence ends in abbreviation the trailing period serves
two meanings [Msa02].
Words usually have different morphological variants with similar semantic

interpretations. To combine these semantically similar words into the same root form, a
method called stemming is used to remove word suffixes. Some morphological
knowledge about the word is needed (e.g. her -> she) in order to be able to derive the
right stem. The commonly used stemming algorithms, such as porter stemmer [Por80],
rely on lists of suffixes accompanied by rules for handling morphological special cases.
One shortcoming of stemming algorithms is that they are language specific.
There are varying approaches for choosing the smallest unit of representation of which
the most common is a bag-of-words representation where one unit consists of a single
term (word, punctuation mark, digit etc.). In a bag-of-words representation, terms are
seen as separate entities and their ordering has no importance. It is obvious that bag-of-
words representation is not optimal due to existence of special word combinations such
as collocations and idioms. Bag-of-words approach can be extended to include word
combinations as in [MlG98]. Since meaningful word combinations occur more often
together than separately, they can be automatically detected to some extent by statistical
calculations from the sufficiently large text corpora and then treated as single units.
10
The next step is to filter out irrelevant terms in order to make the model of the document
more compact. As the number of distinct terms in a moderate sized collection of
documents easily exceeds the limit of what can be processed efficiently, some selection
must be performed in terms of which words are chosen to represent a document. We are
primarily interested in words that co-occur often in same documents but are absent in
most other documents. As a consequence, the words that have fairly even distribution of
occurrence in the document collection can be discarded as they hardly contain any
significance to the document’s topic, and as such, we are more interested in terms that are
relatively common. As was shown in chapter 2.1, a significant number of terms appear
only once in the whole document collection and increasing the number of documents in
the collection won’t change that fact. In order to filter out such extremely rare terms, a
threshold value can be used. Most common terms in language are usually not relevant for
information retrieval tasks. In English, such most common terms are often determiners
(a, an, the) and personal pronouns such as I, he and you. These topically irrelevant words
are often called stop-words. A manually constructed list of stop-words can be used for
filtering out the most common irrelevant terms. The downside of such lists is that they are
laborious to create and language dependent. The goal of term selection is to filter out
irrelevant terms while keeping the relevant ones. This not only increases the performance
of retrieval tasks but also improves the quality of the outcome when irrelevant terms do
not interfere in the retrieval task.
Once the relevant terms for representing the document are chosen, they are usually
weighted according to their importance. Weighting can be based on several properties
such as frequency of a term, its information content, its position in a document or even its
position within a sentence. A comparative study of different weighting schemes can be
found in [Nan03]. Typically, weighting is based on combination of more than one
property. The outcome of the document modeling process is a set (or sometimes a list) of
units (features) of text with associated weights in a suitable data structure. Ideally, the
model reflects as much as possible the features of the original document that are relevant
in the task while enabling desired operations to be performed to the document efficiently.
11
2.3 Measuring the similarity of two documents
Creation of reliable document similarity measure is an active topic in the research of

information retrieval. In search engines, a query can be thought of being a document that
is to be compared to the documents in the database. In a news service, links to the similar
documents as currently viewed can be provided. Many data mining tasks such as text
classification and clustering relies on measurements of document similarity.
The similarity of two documents can be defined as a degree of overlap between the two
documents. In a simple bag-of-words model, similarity between two documents can be
characterized by the following rules:
• If a term is present in both documents, the similarity will increase.

• If a term is present in only one of the documents, the similarity will decrease.
• Terms emphasis to the similarity score should be inversely proportional to its
frequency in the document collection.
Similarity in an information theoretic viewpoint can be defined in every domain where a

probabilistic model is available. Let s be the proposition that states the commonalities
between A and B. I(s) is the information contained in a proposition s. In information
theory, the information contained in a statement is measured by the negative logarithm of
the probability of the statement. Let A ∩ B be the features that A and B share in common.
Therefore, the information contained in this statement is
I ( A ∩ B ) = − log P( A ∩ B ) . (2.5)
The difference between A and B can be measured by
A∆B = ( A − B ) ∪ ( B − A) . (2.6)
12
Similarity between A and B can be stated as a distance between their commonality and
their difference. The less information is needed to describe what A and B are after their
commonality is known, the more similar they are. Since documents can be described as a
sum of their commonality and difference, their similarity can be defined as [AsI99]
I ( A ∩ B)
Sim( A, B ) = . (2.7)
I ( A ∩ B ) + I ( A∆B )
The problem with this simplified method is that it does not take into account the meaning
of words. Two documents can be similar even though they do not share any words. This
is not only due to synonymy but also due to the fact that two words may be semantically
similar even though they are not considered synonyms. In fact, true synonymy is quite
rare but there are plenty of words that can be considered to be near-synonyms. In
addition, some words, even though not considered even near-synonyms, can belong to
same topical category. For example, banana, apple and orange are all fruits and therefore
can be considered to be semantically similar with each other. Measuring similarity of
two documents is essentially about measuring the similarity of words in them. Including
the semantic similarity of words into document model enables more comprehensive
similarity measures than a simple overlap of words.
2.4 Vector space model
The vector space model [Sal75] is widely used in information retrieval largely due to its
simplicity. In order to perform computation with documents in vector space model, the
extracted terms (features) must be converted into vector representation. Documents are
represented as weighted n-dimensional term vectors where each dimension corresponds
to a term (or a phrase). Term vectors are created by first indexing the documents with the
terms they contain (this usually includes the removal of uninformative words). An entry
in the vector denotes that a certain term appears in a document. After indexing, the
13
extracted terms are weighted according to their relevance for the information retrieval
task. Formally, a term vector for document d is represented as
d = (w1,w 2 ,...,w n ) , (2.8)
where wi corresponds to the weight of the term i. Terms are associated weights according
to their relevance for the document’s contents so that the more relevant a term is the more
weight it is given. In the simplest case, the weight of a term is the number of occurrences
of that term in a document. More advanced weighting schemes take into account the
importance of term, that is, it’s relevancy in describing the contents of the document. A
widely used TFIDF weighting [SaM83] is based on three elements:
1. Local frequency of a term i in document d (tfi,d).

2. Global frequency of a term i in a document collection (dfi).
3. Normalization factor (n).
The local frequency tells the number of occurrences of term ti in a document dj. The more
the term appears in a document the more likely it is that the term is related to the topic of
the document. The global frequency of a term tells how many times the term appears in
the entire collection, and, as such, reflects the amount of information contained in the
term. The more common the term among the text corpus is, the less emphasis it gives to
the semantic distinctiveness of that document from other documents in the collection. If a
term is evenly distributed in the whole collection, it is probably not semantically focused
on any topic, but instead, is used in the context of any topic. To find out how
semantically focused a term is in the collection of N documents, its inverse document
frequency can be calculated as:
 N 
idf i, j = log  . (2.9)
 df j 
14
Idf adjusts the weight of a term in a document with a factor that discounts its importance
when it appears an almost all of the documents, and as such, do not discriminate well
between documents in the collection.
Normalization factor (n) compensates for the differences in the lengths of the documents.
The rationale behind normalization is that the term that appears equally many times in
short document than in a long one is likely to be more relevant for the shorter one. This
issue can be addressed by normalizing the weights by the length (norm) of the vector:
w
w′ = . (2.10)
w + w 22 + ...+ w n2
2
1
Using the weighting scheme based on three above components, the final weight for the
term i in a document j is given by
wi , j = tf i , j idf i n . (2.11)
There is a lot of variation in TFIDF based weighting schemes. Sometimes the term
frequency is dampened by a logarithm (1 + log(tfi,j)) in order to decrease the weight
difference caused by the frequency of a term. The table below lists some of the many
options for TFIDF weighting [MaS02].
Table 2.1 Components of TFIDF weighting schemes.

Term occurrence Document frequency Normalization
tfi,j dfi no normalization
1 + log(tfi,j) N 1
log
dfi w + w + ... + wn2
2
1
2
2
0.5 ∗ tfi, j
0.5 +
max t (tfi, j)
15
One of the strengths of the vector space model is the ability to efficiently compare the
level of similarity of two documents. Since document features are represented as
attributes in vectors the similarity can be defined as a distance between the two vectors.
The closer the two vectors are, the more similar the documents with each other. The
Euclidean distance (the straight line distance between two points) between two vectors A
and B is given by
n
Sim(A,B) = ∑ (A − B ) i i
2
. (2.14)
i= 1
The cosine distance measures the similarity by computing the angle between two vectors
rather than the distance:
∑A B i i
Sim( A, B) = i= 1
. (2.15)
n n
∑ A ∑B 2
i
2
i
i= 1 i= 1
A vector is normalized if it has unit length according to the Euclidean norm (also called a
unit vector):
n
length(A) = ∑ A 2i = 1. (2.16)
i=1
Cosine distance de-emphasizes the lengths of the vectors preventing long documents to
dominate over shorter ones in similarity calculations. For normalized vectors, the cosine
is simply the dot product:
n
Sim(A,B) = ∑ Ai Bi . (2.17)
i=1
16
Although the document vectors are sparse and contain only a subset of the terms in the
corpus, the size of the vectors easily becomes too large for efficient processing. A major
difficulty in information retrieval is the high dimensionality of the feature space. The
number of unique terms that can appear in text is counted in tens or hundreds of
thousands. As the similarity measures based on Euclidean similarity are not well suited
for high dimensional data [AHK01], a reduction of dimensions us required for improving
the performance. The most notable problem with Euclidean distance measures is known
as the “the curse of high dimensionality”, which means that the contrast between the
closest and the furthest point decreases rapidly as number of dimensions increases. The
consequence is that all the vectors seem to be equally far away questioning the
meaningfulness of the similarity measure. Since information retrieval tasks such as
content filtering or text clustering rely heavily on calculating similarities between vectors,
the number of dimensions is the primary factor affecting the performance [Msa02,
BDO95]. A comparative study of dimensionality reduction methods for document vectors
can be found in [YaP97].
2.5 Semantics in language
Study of the semantics of language is a study of meaning of individual words and

sentences. One way of studying semantics of individual words is to organize them into
lexical hierarchy in which words have relations to each other illustrating the meaning of
the word in a context of related words. Examples of such relations are hypernymy and
hyponymy. A hypernym is a word with more general sense, for example vehicle is a
hypernym of car. A hyponym is a word with more specialized meaning: apple is a
hyponym of fruit. Synonyms are words with the same or similar meaning, for example
car and automobile are synonyms. A simple lexical matching technique fails to find
similarity between two documents that share a topic if they do not use the same
terminology to describe the shared topics. Words that are almost synonyms, but not quite
like forest and woods are called near-synonyms or plesionyms. Near-synonyms can be
mostly used interchangeably in a text but they are not being used to represent the whole
meaning but rather differences between meanings. The semantic similarity of words is a
17
shallow notion; Miller et al. [MiW91] defines semantic similarity of words as the degree
of contextual interchangeability or the degree to which one word can be substituted for
another in context.
Homonyms are words that are written in the same way but bear different meaning, for
example, a bank as a financial institution versus a bank as in riverbank. Polysemy is a
case where word’s multiple meanings or senses are related as for word take in take a
picture and take a look. Polysemy and Homonymy both cause ambiguity for the
semantics of a word, which is often fairly easy for a human to handle but notoriously
difficult for a computer. Due to morphological variation there are words that share a
common root and can be considered as referring to the same concept (e.g. house, houses,
housing). Lexical hierarchies denoting this kind of relations between words are often
used in computerized text analysis to uncover the semantics of an individual word.
Interpreting the meaning of the whole sentence is far more difficult: one property of
natural language is the lack of compositionality, which means that the meaning of the
sentence can not be predicted from the meaning of the individual words in it.
Collocations are special word combinations that often occur in a text more often together
in than just by chance. They refer to a certain unique concept in a world around us.
Below are examples of collocations related to computers:
central processing unit

hard drive
network adapter
Collocations are examples of word combinations that bear more meaning than what can
be induced from the combination of the words alone. Idioms are expressions whose
meaning is even less compositional. For example the idiom raining cats and dogs has
nothing to do with pets but instead it denotes a heavy rain.
18
A direct consequence of this diversity of a language is that a perfect search engine must
not only be able to deal with the linguistic characteristics of a term but also know its
position in relation to other terms in the semantic space of the user. There are many
approaches for capturing the semantics from text. These include
• Using Bayesian network to model the probabilistic relationships between users’

query terms and their informational goals [Heh98].
• Using latent semantic indexing with singular value decomposition to arrange
documents in a semantic space [Dee90].
• Using ontology as a knowledge base for measuring similarity between words
[RSM00].
First two methods are more computationally oriented while the third one relies on
external lexical resources. Typical use of lexical resource is to expand user’s query with
synonyms from the dictionary. Thesauri, a simple form of ontology, can be used for
providing related concepts to the ones detected in text. The idea behind dictionaries is to
expand query in order to find, not only the exact matches, but also the items that are close
enough to the user’s query. Taxonomies have been widely studied in text retrieval,
[Jin94, RiS95]. Gonzalo et al [Gon98] showed that the use of WordNet synsets (a set of
one or more synonyms) might result up to 29% improvement in the text categorization
task in comparison to keyword based approach.
2.6 Using ontology in capturing the semantics of text
The keyword-based search is probably the most common way of searching information.
In a typical search engine, user enters one or more terms related to the subject the user is
interested in. Query terms are matched to the terms in the document collection and
articles with best rankings are returned. However, some properties of language such as
synonymy, ambiguity and morphological variation decrease the accuracy of keyword-
based search. A typical query like ‘jazz concerts helsinki’ in a simple database like -query
19
results in a documents that have one or more of the exact terms written in them. More
advanced approaches count the number of hits and use stemmer to find a base form of a
term. However, since the algorithm treats each term as an isolated piece, it fails to capture
the semantic relations between the words in a document. Sometimes documents do not
even contain the terms that would best describe their content. For example, articles about
latest domestic news do not necessarily contain words latest domestic news. Only
understanding of the semantics behind the search words makes it possible to find
connection between two documents with similar semantic concepts but which does not
use same words for describing them and produce all the relevant results.
Ontology has been defined in philosophy as an explicit specification of conceptualization.

In computer science, ontologies are used to provide a knowledge source in the form of
nodes (concepts) and their relations. It provides a set of well-founded constructs that
define concepts relevant for the domain and their context. There can be different kind of
relations between nodes in ontology, most common ones being is-a and is-part-of.
Ontologies can range from a simple taxonomy to a much more complex set of logical
relationships. In a simple generalization hierarchy, the semantics of relations between
nodes is not defined, most of them being is-a relationships. For example, football is a
type of sports.
An example of a simple “flat” ontology is WordNet [Wor06], which is an attempt to

organize English words in a semantic network. The nodes in the network correspond to
word meanings while arcs correspond to relations between meanings such as synonymy,
antonymy (opposite meaning), hyponymy (is-a relation) and meronymy (part-whole
relation). The network contains about 144 000 English nouns, verbs, adjectives and
adverbs connected to synonym sets (synsets), each representing an underlying lexical
concept. Figure 2.1 presents a section of the upper-level concepts in WordNet (only
some of the arcs are shown).
20
Figure 2.1. Section of upper-level concepts in WordNet.
Ontologies have been widely studied in research of text analysis. Blake et al. [BlP01]
investigated the quality of features chosen to represent a document when features had
varying degree of semantics attached to them. They used an existing knowledge base,
Unified Medical Language System (UMLS), to map clauses in documents to medical
concepts and used Apriori-algorithm to learn bi-directional association rules. Their
findings indicated that the association rules based on concepts were more useful than
those based on word features. Hotho et al. [HMS01] used a domain-specific ontology in
text clustering to improve clustering results. They used taxonomy of concepts with
associated key terms to extract relevant concepts from documents. Each document was
represented by a vector of concepts, each entry denoting the frequency that a concept
occurs in a document. Documents were clustered using a K-means algorithm. The use of
ontology improved the clustering task by reducing the dimensions of feature vectors into
the size suitable for efficient processing with K-means and provided human readable
cluster representation. Pretschner et al [PrG99] used a publicly available hierarchy of
categories of web sites to create ontology. Each node in the ontology was associated with
a vector of key terms chosen from the set of web sites belonging to that category. The
system analyzed the pages that a user browsed and compared its profile to each of the
vectors associated with the nodes in the ontology. Best matching nodes were assumed
most related to the surfed page. A hierarchical user profile was formed based on the
observations about the user’s browsing history.
21
The growing interest in ontology-based information processing due to central role of
ontologies in semantic web research have led to emergence of standards both in the
representation of ontologies [RDF06] and in describing their semantics [Con01,
OWL06]. There is also an active research on automatic learning of ontology from a
document collection [Sac99]. As ontology construction is a laborious task, automation of
the task would be greatly welcomed. However, due to the difficulty of the task,
automated methods have been most useful when accompanied with manual guidance by a
human.
22
3 User modeling
User modeling in information retrieval systems focuses on developing an understanding

of information needs of the user in order to provide better support for human-computer
interaction. By knowing what user wants the system can provide user the kind of
information that will be most useful to that user and adapt according to the preferences,
skills and characteristics of that specific user. Efficient filtering according to user
preferences is especially important in wireless data services where small screens and
limited bandwidth bring additional obstacles in finding the right content.
An example of a simple user model is a user of a search engine, which is represented a

single query. Thus, any user making the same query gets the same results. A more
comprehensive user model would include other factors such as queries made by user in
the past or pages visited by the user. A good model takes into account user’s changing
preferences according to the current context, creating a need for adaptation. User may
have different preferences when using the system at his work or when using it at home.
User’s preferences may also change as a result of the interaction with the information. A
sophisticated system can adjust the content offering according to user's preferences in
relation to factors such as time and location. Synergistic effects can be achieved by
sharing the user model acquired by one application with other applications.
3.1 Content personalisation
Personalisation is a generic term denoting user modeling and user-adaptive system

features. On a content service, personalisation is the process of tailoring the content
offering according to the individual user’s tastes and preferences. The need for
personalisation is based on assumption that users with varying skills, interests and
expertise should not be offered the same user interface since no single interface will
satisfy every user. Personalisation is closely related to customization, which denotes
features in a product or a service that allow user to customize the service according to
23
their personal preferences. Examples of customization are downloadable ringtones and
wallpapers on mobile phones or portals on the web that show user-selected content
sections on the site such as local weather reports. Despite the ambiguity, in this thesis,
term personalisation is used to refer to algorithmic methods that apply user-modeling
techniques to deliver personalized content.
Many web sites today offer personalized content based on user modeling to improve the
user experience and to increase customer retention. There are many approaches to acquire
information about users and to use that information for personalisation. The most
common methods are content-based recommending and collaborative filtering. The main
difference between these two is that in the former case, the recommendations are based
on past behavior of the user, while in the collaborative filtering they are based on
behavior of like-minded people. Hybrid methods overcome some of the shortcomings of
these two methods by taking best of the both worlds, collaboration and content.
3.2 Collaborative filtering
Collaborative filtering (CF, also referred as clickstream analysis) [ShM95] is a

personalisation method that mimics the word-of-mouth model by recommending items
that have been popular among similar-minded people. Target user’s ratings of items are
compared to other users’ ratings and item is recommended to a user if it is highly rated by
some other users who tend to agree with target user. Consequently, an item, which is
rated poorly by similar users, is not recommended to a target user. The items that are not
seen by the user but are often found in the transaction history of users with similar tastes
are used as candidates for recommendations. Since recommendations are completely
based on similarities in ratings, content items need not be analyzed at all. An example of
collaborative filtering is the “people who bought this also bought that” -feature seen on
many online bookshops.
The transactions in collaborative filtering may be in the form of explicit ratings for items
such as books or movies or implicit feedbacks as a result of user simply viewing an item.
24
In the case of explicit feedbacks, user is given possibility to rate item in a numerical scale
or to just give positive or negative feedback. For example, in [PaB97], a learning
software agent allowed users to rate pages either hot (two thumbs up) or cold (two
thumbs down), while in NewsWeeder [Lan95] the feedback for articles was given in a
five-point scale. When implicit feedbacks are used, ratings are typically recorded as
binary (true/false) values. Lieberman [Lie95] used implicit feedbacks successfully in a
learning agent that assists a user browsing the web. In his approach, an implicit feedback
was scaled according to the duration that user has spent viewing an item.
A simple CF algorithm can be summarized with the following steps:
1. Create a matrix of m users with ratings to n items.
2. Find a set k of most similar neighbors of the target user according to correlations
between rows of the matrix m and the target user.
3. Create a candidate set s by choosing items which are rated positively by users in k
and which are not rated by the target user.
4. Weight items in s according to chosen criteria.
There are several weighting schemes for items in s. Typically first step is to rank items
according to their frequency in s. Secondly, items can be weighted according to the
closeness of the user in k to the target user so that the closer the two users are the more
weight the item gets. When calculating the closeness of users, additional weight can be
given to items according to how rare they are among the rates. The rationale behind this
is the same as when weighting term vectors according to their information content
[SWI01].
25
Table 3.1 illustrates the ratings of two users for six artists. Both of the users have stated
whether they like or dislike the artist in question. Columns, which are marked with ‘-‘,
have no rating from that user.
Table 3.1. Ratings of two users for collaborative filtering algorithm.

Items User A User B
1. Like -
2. Like Like
4. Dislike -
5. - Like
6. - -
CF algorithm tries to find similar users for a target user by computing similarities based
on a common set of items that users have rated. In the simplified example above, user A
and B have both voted item 2 interesting and thus they might also agree on items 1 and 5,
which were liked by one of them. As a result, CF algorithm could recommend item 1 for
user B and item 5 for user A. Item 6 is has no rates from either of the two users, and thus,
lacks information in order to judge whether it would be liked or disliked by A or B.
Table 3.2. Recommendations produced by collaborative filtering algorithm.

User A User B
Recommendations 5 1
CF techniques are easy to implement and provide good recommendations with little or no
intrusion to user. They can perform well even if there is not much textual content or
metadata associated with items. CF also has the ability to recommend user an item, which
is relevant to user but which could not be derived from the user’s browsing history alone.
However, they also bear some well-known limitations. Finding the k most similar
neighbors in real time easily cause problems in scalability as the complexity of the
algorithm increases linearly as a function of the number of users. This will become a
26
problem if the set of the viewed items is large, which is often case in web personalisation
where items are pages browsed.
When explicit ratings are used, most users do not rate items and therefore the probability
to find a set of users with highly similar ratings is low. Large, sparse item sets decrease
the likelihood of a significant overlap among users causing less reliable
recommendations. Since the system’s knowledge about the content items is derived
solely from the users feedbacks, recommendations tend to be biased to those items that
have been popular in the past, severely limiting the diversity of the recommended
content. Perhaps the most notable shortcoming is a phenomenon known as “new item
problem”, which is caused by the lack of records for any new or recently added item. As
the recommendations are solely based on historical data, the recently inserted item cannot
be recommended until it has been rated (or visited) by sufficient number of users.
There are two basic types of CF, user-based and item based. While user-based
approaches compare user’s choices to those of others to find similar-minded people, the
item-based approaches identifies similarities in the items themselves. The item-based CF
constructs the recommendations list by comparing each item in the list of user’s rated
items to ratings of other items and selects those with the highest correlation. Item-based
collaborative filtering avoids the bottleneck of user-user comparison by avoiding the need
of comparing between possibly millions of users. Comparing items usually requires fewer
operations since the number of items is typically smaller than the number of users. There
is, however, still a problem of data sparseness and the lack of recommendations for new
items [SWI01, MJZ03, MMN01].
3.3 Content-based filtering
Content-based filtering is based on analysis of the textual content and is closely related to
text categorization problem. Each item will be classified into one or more categories and
recommendations are based on the similarities in classification. While CF systems base
their recommendations on similarities among users, content-based filtering systems
27
generate a personal profile for each user in the system and recommend items according to
similarities in content items and a user profile.
The integration of semantic similarities for items provides two primary advantages over
user-based methods. The semantic features of items provide clues for underlying reasons
why user is interested in particular content item. While user-based methods solely record
the items user has accessed, content-based methods record the actual semantic features of
the accessed content item, avoiding the bias to popular items among other users. It helps
unpopular items to find visibility, enabling the better use of the long tail (see chapter 2.1).
Also, in the case of new items or sparse data sets, the system can still base
recommendations on semantic similarities of content items.
Content-based filtering has also limitations: items must be in machine-readable format

and contain text or other metadata. With content such as audio files or images this is
sometimes difficult to achieve. Since the system recommends items that are similar to the
ones the user has already seen, there is danger of “tunnel effect” unless the user is given
access to yet unseen topics as well. The similarity measures typically used have no means
to assess factors such as novelty or quality of the content item; it may be similar in topic
to what user likes, but the quality, style or point-of-view may not match with user’s
tastes. Another downside is the large amount of memory and storage space needed for
storing and manipulating profiles. Content-based methods are usually more demanding
since they require natural language processing techniques for understanding the contents
of the documents [MJZ03, ShM95].
Suppose that we have a pure content-based filtering algorithm, which is capable to learn
the genres a user is interested in by analyzing the metadata of the items that user has
rated. It can then recommend user other artists belonging to those genres. Table 3.2
shows an example of the ratings of two users in such a system.
Table 3.3. Ratings of two users for content-based algorithm.

Items User A User B
28
1. John Coltrane Like -
2. Miles Davis Like Like
4. Guitar Essentials Dislike -
5. Bossa Nova Brazil - Like
6. Dave Brubeck - -
Content-based recommendation system would learn from the item metadata (in the first
column in the table 3.3) that user A likes Jazz and user B likes Jazz and Bossa Nova. As a
result, A would be recommended item 6 since it also belongs to Jazz genre. User B would
be recommended item 6 and also item 1, which is in Jazz genre and not yet rated by B.
Table 3.4. Recommendations produced by content-based algorithm.

User A User B
Recommendations 6 1,6
As can be seen from the recommendations, a purely content-based algorithm produces

slightly different results than the CF algorithm. While CF algorithm is unaware of the
characteristics of the content item, the content-based algorithm ignores the information
hidden in the ratings of the like-minded users.
3.4 Hybrid methods
As can be seen in the examples above, neither pure CF algorithm nor purely content-
based algorithm is capable of producing all the relevant recommendations when used
alone. A hybrid solution can be implemented to overcome these shortcomings by
combining the recommendations produced by both methods. Hybrid method would know
from collaboration data that user A might be interested in item 5 and user B in item 1.
Using a content-based prediction it would also know that both users are probably
interested in item 6. As a result, the recommendations based on hybrid method would be:
29
Table 3.6. Example recommendations by hybrid algorithm.
User A User B
Recommendations 5,6 1,6
Several approaches have been suggested for combining content-based recommendations

with collaborative information. Melville et al. [MMN01] use content-based predictions to
fill the sparse user-rating matrix into full rating matrix and then use CF to provide
recommendations. Soboroff et al. [Son99] use latent semantic indexing (LSI) for term-
document matrix, which multiplied by content profile matrix. Singular value
decomposition (SVI) is calculated for this matrix and recommendations are produced by
ranking user profile against content profiles in LSI space. Pazzani [Paz99] presents a
combined approach which utilizes content, collaboration and demographic. In his
approach user profiles are constructed from the weighted words in the training example
and CF is applied to this content-based user profile matrix.
30
4 Cluster analysis
Clustering is the unsupervised classification of patterns (observations, data items or

feature vectors) into groups (clusters) [JMF99]. By another definition, clustering problem
is about partitioning a given data set into groups such that the data points in a cluster are
more similar to each other than points in different clusters [GRS98]. Cluster analysis is
an important field in data mining and knowledge discovery and is used in explorative
data analysis, which aims at gaining useful understanding of the data through
observation. Data items, having features of varying values, is arranged into groups so that
similar objects are in the same group and dissimilar in different groups. It is closely
related to classification task where purpose is to assign each data item to one or more
predefined classes. However, classification is an example of supervised learning in which
training set is provided with information of the correct classes. By examining the features
in training data and classes, the classification algorithm tries to form a function f to map
each new item into correct class. In clustering, no predefined classes exist but instead, the
algorithm produces classes in which items in data set are classified during the clustering
process.
What is a cluster then? There seems to be no definite answer for this question. Some
authors have defined clusters in terms of internal cohesion and external separation. While
this is intuitively sound, it does not provide a theoretical basis for clustering.
Unfortunately, there is no general definition for a cluster that could be stated in a terms of
mathematics. The definition of a cluster and the goal of the clustering vary according to
the data and the application. Human eyes are good at detecting patterns and structure in
seemingly random source, a property that makes them natural tools for judging a
clustering result. This is not always a good thing: people tend to see structure in data even
when there is no structure at all. Another characteristic of a clustering is its computational
complexity. When clustering large amount of high-dimensional data-objects, there is
always a trade-of between quality of the clusters and the execution time of the algorithm.
Since the number of possible clustering results is exponential to the number of items to
31
be clustered, it is clear that iteration of all possible results not feasible. As a consequence,
many clustering algorithms, including hierarchical algorithms, have exponential time
complexity.
Term segmentation is mainly used in market analysis where clustering techniques are
used for discovering user groups with similar interests or groups of shopping items
frequently purchased together. In this thesis, term user segment refers to a group of users
sharing similar interest while the term clustering denotes the technical process of finding
such segments.
4.1 Clustering in information retrieval
Document clustering has been studied in information retrieval mostly as a method for
increasing the accuracy and performance of search. Since documents inside one cluster
are similar with each other, they are often relevant to the same or similar query terms.
One can increase the efficiency of a similarity search by searching only documents
belonging to the same cluster. In Scatter/Gather [Cut92], clustering was used as a primary
information retrieval method. The document collection was clustered in the document
groups allowing quick glance over the structure of the whole collection. The user selected
the most interesting groups, which were then combined and clustered into more detailed
groups for further browsing. Knowledge of clusters can be utilized for achieving scalable
collaborative filtering. Clustering based CF algorithm selects one or more clusters, which
are closest to target user and finds k nearest users from those clusters only.
There are many applications for clustering in IR systems. By displaying content in

clusters, one can get a quick overview of the subjects the content covers. Clustering can
be used to improve the usability of the user interface as in Scatter/Gather [], which relied
on clustering as a primary method for browsing content. Another application for
clustering is segmentation of the user base according to their browsing behavior in order
to discover users with similar interests.
32
In NLP, a common use for clustering techniques is in word sense disambiguation
problem. Words tend to have multiple senses and meanings in language. The correct
interpretation for such a word can be achieved by looking the context in which it appears.
Clustering can help in disambiguation by grouping the different usages for a word into
different clusters according to different contexts it has. A correct interpretation for a word
can be resolved by selecting the cluster that is most similar to the current context and
using the meaning associated with that cluster. In machine translation clustering has its
use in determining features of a word by using clusters as generalizations of a certain
word types [MaS02]. For example, when determining the correct preposition for the word
Monday, one can use knowledge of clusters for deriving that the word behaves in similar
manner as other weekdays seen on text and choose the preposition used with them.
4.2 K-means
K-means [Mac67] and its variations are the most well known clustering algorithms due to
their simplicity and ease of implementation. K-means is a partitioning algorithm, which
organizes the items in a data set into k partitions, where each partition represents a
cluster. The k-means algorithm first selects randomly k items, which initially represents
centers of the clusters. For each of the remaining items, the algorithm finds the closest
cluster and assigns item to it. After each round of operation the cluster means are
recalculated. The aim of k-means clustering is the optimization of an objective function
that is described as by the equation
c
E = ∑ ∑ d ( x, mi ) , (4.1)
i =1 x∈Ci
where, mi is the center of cluster and d(x, mi) is some distance metric such as Euclidean
distance, between a point x and cluster center. The criterion function E attempts to
minimize the distance of each item to be clustered from the center of the cluster it
33
belongs to. The process continues until the centers of the cluster stop changing. K-means
algorithm is composed of the following steps:
1. Select k initial centers

2. Assign each data point to the cluster that has the closest centroid
3. Recalculate the cluster centroids
4. Repeat steps 2 and 3 until centroids do not change
The result is a non-overlapping, non-hierarchical clustering in which each member of a

cluster is closer to its own cluster than any other cluster. K-means converges to a local
minimum (finding a global minimum is NP complete). Since the initial centers are chosen
randomly, different runs result in different final clusters, which is most notable when
clustering small amount of data.
The complexity of the K-means is O(kni), where k is the number of the clusters and n is
the number of items to be clustered and i is the number of iterations. Number of iterations
can in theory be large but most in most of the cases the algorithm converges quite
quickly. Reassigning of clusters and updating the means, are both O(n) in complexity and
only constant number of iterations are computed. As the algorithm is linear in both is
main components, K-means is faster than hierarchical methods in clustering amounts of
large data.
However, the fact that its performance depends heavily on the initial conditions [PLL99]
is a serious disadvantage for K-means. Another limitations is that the number of clusters
must be provided as a parameter. However, there are approaches that try to approximate
the number of clusters [SiR99] as well as optimizations to improve the performance
[ARS98, Kan02].
34
4.3 Hierarchical clustering
Hierarchical clustering algorithms are popular due to their simplicity and their ability to
produce hierarchical clustering result. Hierarchical clustering algorithms are either
agglomerative or divisive. In divisive methods, clustering starts with only one cluster. In
each round of operation, one cluster is divided into two smaller clusters until desired
amount of clusters is left. Agglomerative algorithms start with each data item as its own
cluster. In each round of operation, two most similar clusters are merged until desired
amount of clusters is left. Agglomerative methods are more common than divisive
methods, mostly because they are more straightforward to implement. The execution of
an agglomerative algorithm can be defined as
1. Create a singleton cluster of each item

2. Find closest cluster for each cluster using some similarity measure
3. Merge the two clusters with most similarity
4. Update cluster centers
5. Repeat from 2 to 4 until number of clusters equal to k
The result of hierarchical clustering algorithm is a dendogram, a hierarchy of merging

steps during the execution of the algorithm. In divisive methods, it progresses from one to
many, and in agglomerative methods from many to one, as demonstrated in figure 4.1.
35
Figure 4.1. Merging steps of a hierarchical clustering algorithm.
A similarity function is used to select the two clusters to be merged at each step.
The three main approaches for measuring the distance between two clusters are
• Single linkage: measures the distance between the closest members of the clusters.
• Complete linkage: measures the distance between the most distant members.
• Average linkage: measures the average distance between the members of the
clusters.
The similarity function must be monotonic so that the similarity of the merged cluster
with any other cluster is always less than the similarity between either of them alone and
any other cluster.
∀c',c'': min(sim(c,c'),sim(c,c'')) ≤ sim(c,c'∪c'')
36
A monotonic similarity function guarantees that merging does not decrease the overall
similarity, which would lead to an inconsistency in the hierarchy of steps.
Hierarchical clustering algorithms (as most clustering algorithms) are complex in terms
of computation required. In n merging (or division) steps each cluster must be compared
to each other in order to find the closest pair. Even though the number of operations
needed decreases in each round in each round of operation, the comparison step requires
n2 operations.
4.4 Group Average Agglomerative Clustering
In group-average agglomerative clustering, the similarity between any two clusters is

defined as the average similarity between members in the clusters. It can be computed
more efficiently than complete-linkage clustering and the clustering result has shown to
be better than the single-linkage clustering in some applications [MaS02]. The average
similarity for a cluster C can be calculated as
1
sim(C ) = ∑ ∑ s ( a, b) ,
| C | (| C | −1) a∈C b ≠ a∈C
(4.2)
where s(a,b) is a cosine similarity between members a and b. The algorithm starts by
constructing the initial set of clusters G by placing each vector to its own cluster. In each
step of iteration, the algorithm identifies the two most similar clusters and merges them.
The algorithm starts by finding two clusters Ci and Cj which maximize S(Ci u Cj ) and
creates a new set of clusters G’:
G ' = (G − {C i , C j }) ∪ {C i ∪ C j } . (4.3)
The iteration terminates when |G| = k. The complexity of computing the average
similarity of the items in a cluster is O(n2). If the average similarities are computed each
37
time a cluster is merged then the algorithm would be O(n3). However, the algorithm can
be speeded up by pre-computing for each cluster the sum of its members so that it can be
easily updated after merge (by simply summing the similarity between the merged
clusters) and it allows easy computation of the average similarity [MaS02].
Initially, the algorithm computes the similarities between the singleton clusters and sorts
them according to the similarity. First, n2 similarities are computed and sorted, which
takes O(n2 log n). In each of the n merge operations, a pair of clusters with the highest
similarity is identified and merged. The similarity matrix must be updated so that the two
chosen clusters are removed from the matrix and replaced with the similarities with the
merged cluster. Each iteration takes O(n log n). The pairwise similarity between n items
must be computed only in the first iteration and the n merging steps can be performed in
linear time resulting to an overall complexity of O(n2 log n).
A profile for a cluster can be constructed by calculating the average profile of the
members in the cluster. A cosine similarity measure can be used for comparing a cluster
profile with a user profile:
S (C , x) = p (C ), p ( x) . (4.4)
As the cluster profile represents a typical member in that cluster, averaging over all of the
members in the cluster is probably not the best representation since it is vulnerable to
outliers. A more suitable profile can be constructed by averaging the profiles of n most
central members in a cluster. Let R be the set of n most central members, then a trimmed
average profile pn(C) can be calculated as
p ' n (C )
p n (C) = , (4.5)
|| p ' n (C ) ||
where p' n (C ) is un-normalized sum profile of the n most central members
38
p ' n(C ) = ∑ p ( x) . (4.6)
x∈R
39
5 Creation of user and content profiles
5.1 Architecture of the content service
The personalisation engine is embedded into a server part of a client-server content

distribution application, which gathers content from the content source, in this case is a
content provider’s FTP-server, and delivers it to mobile clients. Content items are fetched
from the server in XML format after which they are parsed and stored into the database.
The copies of content items are fed to the personalisation engine, which creates a profile
for each content item. The process of retrieving and profiling new content is repeated
once in a minute in order to guarantee the freshness of the content.
Users use web browser or a mobile client to access the content items. The content is
personalized so that the headlines best matching the user profile are shown boldfaced. In
addition, each user is provided a “my news” list, which contains personal
recommendations from the latest news items. For any content item, user can request a list
of similar items to be displayed. The server is implemented in Java and it runs on any
operating system, which has Java Runtime Environment available. Clients are mobile
applications implemented both for J2ME and Symbian platforms. A web-based admin
interface is provided for ontology updates and user and content analysis features. The
figure 5.1 provides a general overview of the system.
40
Figure 5.1. Service architecture.
Numbered entities in the figure are:
1. Content feed is a (usually continuous) feed of content items. The system supports
various different content types such as XML, plain text and PDF documents.
41
2. Content poller is a Java thread, which periodically fetches content from the
content source such as FTP or HTTP server.
3. A content handler parses the content item and forwards it to be stored into an SQL
database.
4. Personalisation engine creates profiles for each user and content item in the
system.
5. The metadata of the content item is fed into personalisation engine.
6. Each user subscribes to one or more content feeds.
7. Client processor transforms the content into suitable format for the device in
question.
8. Administration interface is provided for the service administrator.
5.2 Leiki Targeting personalisation engine
Leiki Targeting is a software engine for real-time learning, profiling and personalisation.
It automatically creates comprehensive profiles for all the users and content in the
system. The content is profiled by semantic analysis of textual content, guided by an
ontology of thematic concepts related to the domain. User profiles are automatically
learned from the usage history and the user’s feedback on content items. User preference
learning is automatic and there is no need for explicit preference settings from the user.
Explicit feedbacks can be used when available but the system learns quite well just by
observing the reading patterns of users. By comparing the user profiles to the content
profiles, the system can generate best matches –lists for any user or content item, sorted
according to their similarity. Figure 5.2 illustrates the functionality of the personalisation
engine.
42
Figure 5.2. Leiki Targeting personalisation engine.
5.3 Ontology
The content profiling process is guided by ontology, a collection of categories organized

according to their relation with each other. The ontology contains concepts that are
relevant in the domain and patterns for identifying those concepts in the content.
Ontology is constructed using the combination of common knowledge from the ontology
constructor with the help of statistical calculations from the content.
The ontology consist of
• A set of patterns
• A set of categories
• References linking patterns to categories
• Relations linking categories and forming a tree structure
43
A pattern can be a word, a phrase or a regular expression. Patterns are associated with
categories, which denote concepts relevant for the domain. Patterns guide the extraction
of concepts from the content items in the process of content profiling. Multilingual
ontology can be constructed by translating the patterns into target language. The
categories, as well as the derived profiles, are language independent (some categories
may not exist for every language though). That is a useful property if the aim is at finding
similarities between documents written in different languages.
Figure 5.3. A category with translations to multiple languages.
A category has one or more connections with other categories denoting the semantic
relationships between concepts in the domain. The resulting structure is a directed acyclic
graph (DAG), in which nodes represent concepts and connections between them
represent their generalization hierarchy. The difference of this structure to a simple
hierarchical taxonomy is that a child (more specialized term) can have many parents (less
specialized terms) and there is a possibility to have different kind of relations between
concepts allowing more sophisticated reasoning when creating a profile. The Leiki
General Ontology currently consists of about 17000 categories. The uppermost levels are
constructed according to the International Press Telecommunications Council (IPTC)
topic sets. The rest of the ontology contains more specific categories placed under this
publishing industry -standard top-level classification. Below is an example of a branch in
the Leiki General Ontology in which IPTC categories are shown in boldface and
44
additional, more specific categories, are shown in regular text.
Geography -> Europe -> Finland -> Southern County of Finland -> Helsinki Area ->
Helsinki
Figure 5.4 shows a typical generalization hierarchy where concepts are generalized when
traversing up in the tree. When the semantics of edges is not specified it corresponds to a
generalization relation although other kinds of relations can be defined as well.
Europe Euro area
Faroe
Finland Luxemburg Italy
Islands
Northern Southern
county of county of
Finland Finland
Helsinki
area
Espoo Helsinki Vantaa
Figure 5.4. A section in the Leiki General Ontology (not all connections are shown).
The knowledge about the type of relation can be used in choosing a suitable weighting
scheme in a profile creation. For example, a user’s search query can be made more
45
restrictive by expanding the categories only if the relation denotes a highly relevant
relationship between parent and child, and thus, the user’s query can be presented with a
set of categories in the ontology that are highly relevant for the terms the user entered.
5.4 Ontology construction
The ontology was constructed using ontology editor, a custom tool for constructing
ontologies. The ontology constructor used the editor to define categories and set the
connections between them. The GUI of the editor is split into two panes. In the left pane,
categories are represented as nodes in a hierarchical tree view. Clicking a node with a
mouse expands the node and reveals its children. When a node is selected in the tree, the
right pane shows the patterns associated with the category along with its parent and child
relations. New patterns for a category can be added by inputting them into keywords list.
A list of possible parents and children are shown allowing addition of connections to
other categories.
46
Figure 5.5. Ontology editor.
The Leiki General Ontology currently contains over 17000 categories with over 19000
patterns associated with them. The two uppermost levels are constructed according to the
International Press Telecommunications Council (IPTC) topic sets. The rest of the
ontology contains concepts related to Financial Times news subjects.
5.5 Content profiling
The content profiler module analyses incoming content and creates a profile for each
content item. The four-step process of profile creation is illustrated in Figure 4.6
47
Figure 5.6. Content profiling process.
In the pre processing -phase, the punctuation is removed from the metadata and possible
markup-tags are stripped away.
In the matching phase, the patterns in the ontology are matched against the metadata of
the content item and found categories are chosen for the initial profile.
Found categories are expanded in the reasoning phase so that additional categories are
included into the profile according to the connections in the ontology.
Selected categories are weighted in the weighting phase according to their relevance in
describing the content item. The weight of a category is based on the local frequency of a
category (number of matches to that category), the global frequency of that category (the
cumulated frequency of that category in all the content) and the position of the category
48
in the ontology. Profiles are normalized with their length in order to avoid long content
items to dominate over shorter ones in the matching process.
The outcome of the content profiling process is a human readable, language independent
profile, which contains categories with associated weights according to the topics found
in the metadata. Figure 4.7 shows a profile that was created from the headline and the
ingress text of an RSS news item.
Swinging in cricket-mad Pakistan

ISLAMABAD - US PRESIDENT George W. Bush, an avid baseball fan,
yesterday tried his hand at cricket and was pronounced 'not bad'
for a first-timer.
Figure 5.7. Profile created from the headline and the ingress of an RSS news item.
There are fifteen categories in the resulting profile. The category cricket has the biggest
weight since it has two occurrences in the text and the other being in the headline
suggests that it is highly relevant in describing the contents of the text. The category
49
pakistan has relatively high weight as well since it, in addition to being mentioned in the
headline, inherits weight from one of it’s children, islamabad. In general, the categories
with biggest weights have resulted from a direct match to one or more of the patterns
associated with the category. The rest of the categories are the result of the expansion
according to the connections in the ontology. The further the connected category is to the
matched category, the less weight it is given. Weights are further adjusted according to
their information content (see chapter 2.1).
5.6 User profiling
User profile is automatically learned from the user’s browsing behavior. Each time a
users accesses content item, the user profile is adjusted according to the profile of the
content item that the user accessed. The user profiling process is illustrated in figure 5.8.
Figure 5.8. User profiling process.
The user profiling process starts by user accessing a content item (in this case an RSS
news item). This results a feedback action, which can be one of the following:
50
• An explicit negative feedback: user dislikes the content item.
• An explicit positive feedback: user likes the content item.
• An implicit positive feedback: user simply reads the content item, which the
system records as a slight positive feedback.
The feedback action is sent to the server with the id of the content item for user profile
adjustment. User profiling is an ongoing process, it starts when the user first time uses the
system and the profile gets more comprehensive each time the user reads a content item
or gives an explicit feedback. The use of implicit feedback is based on an assumption that
user reading an article suggests that the user is interested about the subjects present in
that article. To illustrate the profile formation, lets consider user accesses the following
three content items.
Swinging in cricket-mad Pakistan

ISLAMABAD - US PRESIDENT George W. Bush, an avid baseball fan,
yesterday tried his hand at cricket and was pronounced 'not bad'
for a first-timer.
Screaming Intruder Jumps White House Fence (AP)
AP - A screaming intruder made it onto the front lawn of the
White House Sunday while President Bush was at home before being
apprehended by Secret Service officers.
High oil prices threaten world economy: IMF
High oil prices are storing up trouble for the world economy by
creating serious imbalances in national finances, not least in
the US, the International Monetary Fund warned on Thursday
The user profile is adjusted according to the profiles of these three items and the resulting
profile is shown in figure 2.7.
51
Figure 5.9: Example of a user profile after user has accessed three content items
Like content profiles, a user profile is human interpretable and can be represented in any
of the languages supported by the ontology. In order to avoid over-developing of profile,
weights are slowly decayed by time. This enables the system to better adjust to user’s
changing preference as the previously interesting but lately ignored topics are slowly
diminishing in the profile.
52
6 Clustering of users
The previous chapters provided some background for text analysis, user modeling and
clustering and gave examples of their use in information retrieval tasks. This chapter
introduces an application that combines these three components with the goal of
clustering users into segments with similar interests. Found segments are represented as
cluster profiles denoting the profiles of stereotype users in the user base. The aim of
clustering process is to find natural user groups in data so that users inside a cluster are
share more interests with each other than the users in other clusters.
The clustering process starts by transforming the data into the format suitable for
processing by the algorithm and by removal of incorrect or erroneous items. Once the
data is preprocessed and the relevant are features selected, a suitable algorithm is chosen
for the clustering task. Since there is no general clustering algorithm that would work
with every data set it is important to consider the suitability of the algorithm for this
specific data and purpose.
The last step is the interpretation of the result. The purpose of clustering may be purely
explorative although usually there is some prior understanding of what kind of
information clustering can provide. In the case of user segmentation, the objective is to
find the stereotype users. The presentation method for the clustering should be chosen
according to the purpose of clustering. Good visualization techniques help in representing
the results in a meaningful way.
6.1 Clustering algorithm
Two alternative clustering algorithms were presented in Chapter 3: K-means for

partitional clustering and group-average hierarchical algorithm for hierarchical clustering.
A chosen algorithm should be scalable enough to cluster hundreds of thousands users
within a reasonable delay. Real-time clustering is not a strict requirement as the algorithm
53
can be run as a batch process at specified time intervals. In that light, K-means is a good
choice to start with due to its scalability for large data sets. It has limitations though. The
result is sensitive to initial seeds and it has a tendency to produce different clustering
result on each run. Since the result is highly dependent on the quality of the initial seed
clusters, using heuristics in choosing the seeds seems to yield better results.
Our hybrid algorithm uses k-means in conjunction with hierarchical algorithm as a post-
processing step. As K-means stops to one of the many local minima, the use of
hierarchical algorithm to select the initial seeds should eliminate most of the bad ones.
The algorithm first chooses randomly sqrt(n) users and clusters them with the hierarchical
algorithm. The centroids of the clusters are provided as an input to K-means. The initial
clustering can be performed in time O(sqrt(n2 )log n), where n is the total number of items
to be clustered and k is the number of clusters. Combined with K-means it yields a total
complexity of
O( n 2 log n + kni)
This algorithm always produces the same result for the same data.
6.2 Data
The data consists of catalog of content items with associated metadata and a log file
containing the user transactions. The service is a mobile content service providing
downloadable ringtones. The content catalog contains 4745 ringtones with the following
information attached to them:
• Product id
• Title (of the song)
• Artist
• Type (Polyphonic or True Tone)
54
• Timestamp
Since the title of the song does not provide much information about the topic which, in
the case of ring tones, denotes the music style of genre of the song, it is not included in
the profiled metadata. Therefore, the only metadata that is used in profile creation is the
name of the artist. Figure 6.1 shows a profile created from the input 2pac.
Figure 6.1. Profile created from the input string 2pac.
There are nine categories in the resulting profile. The category with the biggest weight,
tupac shakur, is a result of a match to one of the patterns associated with the category
tupac shakur. The rest of the categories are the result of the expansion according to the
connections in the ontology.
The user log file contains a total of 260 000 user transactions, each transaction consisting
of
• User id
• Product id
• User Agent
• Timestamp
55
This data is enough to be used on the basis of our segmentation. In addition to
timestamps, the only piece of information that is not utilized is the User Agent –field in
the user log file as we want to base our segments purely on user interests and we do not
want to include information about user’s phone into model. Timestamp might be useful in
two ways:
• The popularity of the product is dependent of the time it has been available to the
users. If two products are equally popular among the users in the cluster, the
newer one is probably better candidate for a recommendation since it has
generated same amount of transactions in less time.
• Since user’s tastes are changing by time the more recent transactions reflect more
user’s current interests than the older one. The take this into account in the user
model the transactions could be weighted according to their freshness.
However, the current approach ignores timestamps. There are several reasons for this.
Firstly, we do not know how reliable the timestamp associated with the content item is.
Does the timestamp denote when the item was first time added to the catalog or is the
time it was last updated? Secondly, if the user’s history in the data is limited to few
months, as in this case, there probably has not been much change in the user’s interests.
Let’s consider a user that has ordered the three ring tones below:
• RUN DMC – It’s Like That

• Eric Prydz – Call On Me
• Armand Van Helden – Hear My Name
The resulting user profile is shown in Figure 6.2.
56
Figure 6.2. Example of a user profile after user has accessed three content items.
Next step is to apply cluster analysis for segmenting the generated profiles. The input for
the clustering is a set of vectors. The output is a set of vector clusters. The number of
clusters must be given as a parameter to the clustering algorithm.
6.3 Pre-processing of the data
Data pre-processing consists of transforming the input data into the format suitable for
our clustering algorithm and removing incorrect or erroneous items. No scaling of values
is needed since we are dealing with only one kind of properties (item downloads) and
binary values (downloaded, not downloaded). The 260 000 transactions in the data were
performed by a total of 43 000 users. The number of transactions per user ranges from 1
to a well above a hundred, an average being 6,1 transactions per user.
The quality of the data was good in general although there were some content items in the
order log, which could not be found from the content catalog. We did not filter out these
items, since they did not cause harm to clustering procedure. However, some selection of
users was performed: Users with only one transaction were not included. This is justified
57
because one transaction is hardly enough for learning of reliable user model. There were
6,7 % of such users. Also, users with more than 40 transactions were not included since
they typically contained items from most of the genres and were often a cause of
erroneous merging of two distinct clusters. There were 0.3% of users having over 40
transactions. After these preprocessing tasks, the total number of users included into
clustering procedure was 38 500.
6.4 Visualization of clusters
Clusters are presented in a user interface by displaying a cluster profile accompanied with
a list of recommended items for the cluster. The cluster profile differs from the content
and user profiles in the way the concepts are chosen from the ontology. In cluster profile
only the music genres are displayed. This way the profiles stay relatively short while still
reflecting all the relevant areas of interest in the cluster. There are many options for
forming a profile for a cluster of which the most obvious ones are
• Using the profile of the medium user as the cluster profile

• Calculating a (weighted) average profile of all the users in the clusters
• Calculating the average profile of the n most central users of the cluster
• Creating a profile from the cluster recommendations list
The first option is obviously not perfect since it displays only the interests of one user
leaving out a lot of interests possessed by other users in the cluster. The second option
removes this shortcoming by combining the interests of all users into one profile.
Optionally, the more central users are given more weight in the profile. Another option is
create average profile of n most central users in the cluster using, for example, a fixed
percentage of all users in the cluster. The fourth option simply creates a profile from the
list of recommendations provided for the cluster by profiling (see chapter 5.5) the items
in the list. Since cluster recommendations by definition reflect the common interests of
58
the (most central) users in the cluster, this will intuitively provide a good profile for a
cluster.
After comparing different options, the cluster representation was chosen to be a

combination of demographics data, ontology-based profile and a list of recommendations
for the cluster as show in figure 5.1.
Figure 5.3: A Cluster profile.
59
Each of these three components provides information about the contents of the clu ster in
a slightly different angle. Demographic data plays an important role in making the
concept of a cluster more tangible to the viewer as it is something that the viewer is
already familiar with. Demographics data is calculated as averages of all users in the
cluster. In the example cluster in figure 5.1 demographics data consists of age, gender
and number of downloaded items. Since user demographics are not available in the log
data, N/A is displayed in the corresponding fields. The average number of downloads
denotes the number of content items the user has downloaded from the service. This is
valuable information since it enables comparison of clusters according to the ARPU
(average revenue per user) value for the users in the clusters. Based on this knowledge,
the service provider may choose to direct marketing campaigns on those segments, which
are likely to generate most revenue or, in contrary, to send special offers to the less
committed user segments.
Below demographic data is shown a cluster profile, which provides general overview of
the topics that the users in this cluster are interested in. This profile is slightly different
from user and cluster profiles presented earlier since not all of the levels in the ontology
are shown. In order to keep profile short and easy to understand, only categories related
to music genres are shown.
Below the cluster profile, a list of recommended items to the (medoid user of the) cluster
is shown. From the list of recommendations, the service provider can easily see what
products are most in demand for the users in this cluster. By combining all these there
components into single representation, we are able to provide viewer the following
information about the cluster
• What kind of users does this segment contain?

• What kind of topics users in this cluster are interested in?
• What are the actual products (in this service) that the users in this cluster would
like to have?
60
The above questions were the primary motivation for performing the segmentation for
this data, and the chosen representation was chosen to serve that purpose. In a case with
different motivations and objectives choosing another representation might provide better
results – as is always the case in cluster visualization.
6.5 Results
In order to find out what kind of stereotype users can be found from the data, a sample set
of 4000 users was chosen randomly and clustered into 3 clusters. A web-based user
interface was developed as a part of this thesis for observing the clustering result.
The profiles of the resulting clusters are shown in the figures 6.1 – 6.3.
61
Figure 6.3. Profile of segment 1 containing 89,5 % of the users.
62
Figure 4. Profile of segment 2 containing 4.1% of the users.
63
Figure 5. Profile of segment 3 containing 6.4 % of the users.
Cluster profiles and the recommendation lists shows that the users in the three segments
differ in their music tastes. The profile of the first segment shows biggest interest in hip-
hop & rap followed by pop and electronic music. This is the kind of music that currently
gets most of the airtime in radios, which is reflected in the size of this segment that
contains nearly 90% percents of the users. The second segment is the smallest one with
only 4,1 % of the users. The users in this segment show interest towards rock music with
some hip-hop & rap flavor as well. The third segment, containing 6,4% of the users,
contains mostly users that like to choose TV themes as their ringtones.
64
6.6 Validation of the result
How can one be sure that the clustering result really is a good description for the data or
would there be a better result, which was not found? In validation of clustering, the
objective is to choose a suitable measure for identifying the sets of clusters that are
compact and well separated. Due to unsupervised, explorative nature of clustering,
general validation schemes do not exist, and as such, the validation method must often be
tailored for the specific clustering algorithm and data individually. The validation
procedure may contain technical and semantic components. In technical evaluation,
technical qualities of the methods such as scalability and flexibility of the algorithm is
assessed. In semantic evaluation, the quality of clusters is measured using metrics such as
compactness and separation of clusters. Compactness of clusters indicates how close
items in the cluster are with each other, measured by calculating the variance of the items
in relation to cluster center. The clustering result can also be judged by a human. This
requires proper visualization of clusters possibly with the projection of high-dimensional
data into two or three dimensions. The mere judging by looking at the clustering result is
subjective, and suffers from a human tendency to recognize structure even when no
structure exists, which leads to need for more objective methods [HBV01].
Since the K-means method aims at producing compact clusters by minimizing the sum of
the squared distances of data points to cluster centre, we can use the distances of the
points from their cluster centre to determine whether the clusters are compact. We define
intra-cluster distance as an average distance of all the points in the cluster to the cluster
centre. We also want to measure the inter-cluster distance, which we define as the
minimum distance between centers of two clusters. Only minimum value is taken into
account in order to maximize the separateness of the two closest clusters. As all the other
values are larger, the rest of the clusters will be well separated too.
As can be seen in figure 6.1 the quality score of the clustering is heavily dependent on the
value of k. With k values smaller than 4, the scores are at their highest due to high level
of separation between clusters. When visually observing the clustering result, good
65
separation is reflected in the cluster profiles as highly distinct music tastes between user
segments. As the number of clusters is increased the average separation of clusters tends
to decrease. Consequently, profiles of different clusters share more genres.
Cluster validation
120
100
80
Score
60 Series1
40
20
0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Nro of clusters
Figure 6.1. Validation scores for k values in between 2 and 16
As the purpose of the user interface is to provide an easily understandable overview to

the user base, we are primarily interested in the results with relatively small number of
clusters. Figure 6.2 shows a table of clustering results from the k values between 4 and
13. For each clustering result, the table shows the density and separation as well as
overall quality score.
Table 6.1. Quality scores for clustering results with k values between 6 and 13.
Nro of Density (d) Separation (s) Score
clusters d/sqrt(s)
4 656 115 61,17
5 669 110 63,78
6 715 115 66,67
66
7 652 96 66,54
8 651 101 64,77
9 657 100 65,7
10 623 101 61,99
11 614 100 61,4
12 635 100 63,5
13 619 97 62,84
From the table we see that according to the chosen quality score the k value of 6 produces
the best clustering result for this set of users when using the ratio of density and
separation as the criteria.
Cluster validation
68
67
66
65
64
Score
63 Series1
62
61
60
59
58
4 5 6 7 8 9 10 11 12 13
Nro of clusters
Figure 6.2. Validation scores for k values between 4 and 13.
67
7 Conclusions
In this thesis it was shown that the information hidden in user’s browsing history
provides a valuable source of knowledge. Analysis of users selections while they are
browsing helps system to understand the information needs of the user and make
suggestions based on that. This work showed some approaches for using that information
to personalize the content offering and segment users into interests groups. Two
approaches for content recommendation was presented:
• An ontology-based approach in which content items are analyzed with the help of
ontology and user profiles are learned by observing the topics found in the content
items that the user has accessed. The user will be recommended content items that
contain similar topics than those that the user has previously shown interest in.
• A collaborative approach in which recommendations are based on the similarities

in the users browsing history. The more similar the histories of two users are the
more likely it is that they have similar interests currently as well.
User base was segmented using clustering techniques. A hybrid algorithm was
implemented for the task as a mixture of hierarchical and partitional approaches. The
results showed that users differ in their taste for content preferences and that the user base
can be segmented into natural interests groups, each group having a unique taste of
music. Further assessment about the meaningfulness of the clustering result was left to
further work. The cluster centroid was used to represent the user segment. The objectives
of user segmentation in an online digital content service include:
• To provide knowledge about the users. The service provider can observe users as
evolving interest segments and adjust the content in the service based on changes
in the trends.
68
• To provide segmented user base for content targeting or targeted marketing.
• To initiate the formation of social networks or communities of users sharing

similar interest.
• To provide basis for efficient collaborative filtering based on user segments: a

user can be suggested new items based on likings of the other users in the same
cluster.
7.1 Privacy issues
Users are hesitant to give away their personal information. Online services that apply
registration process requiring inputting demographic information and details about
personal interests often face this in a form of decreased user base. In the age of increasing
amount nuisances such as unsolicited email and viruses, people can hardly be blamed
being concerned of their privacy while online. All the above methods of user modeling
require some level of compromise in individual user’s privacy. A good practice is to
inform users about the gathering of information and explain what it is used for. Gathered
information should not be shared with third party unless the user has given permission to
do so. Naturally, it is important to maintain user information in a secure system, which is
not vulnerable to attacks by intruders.
In terms of privacy, it is more acceptable to gather information so that individual users

cannot be identified. Practically, such information is comparable to the statistics
information in the web server logs. The current practice in online services seems to be
that the service provider informs users about its privacy practices in a specific privacy
policy –statement. There are also some emerging standards for publishing the privacy
policy in the service such as W3C’s Platform for Privacy Preferences (P3P) [P3P06],
which aims to standardize the method of publishing the privacy policy in the service and
69
allows users to set their preferences for the privacy issues. When the user enters the site,
the privacy policy of the service is compared to the privacy preferences set by the user
and any conflicts are reported to the user. In order to work in practice, the support for
exchanging such information must exist in the server as well as in the client software.
Still additional effort is required from the user and the service provider to manually enter
the privacy levels. Future will show whether this kind of standard will make it to the
mainstream.
7.2 Future work
As future work, it would be interesting to use trend analysis for user segments. Doubtless,
the natural clusters that can be found from data look different from time to time. Trend
analysis could further reveal users changing interests by finding topics that have gained
interest recently as well those that are loosing attention. Also, clustering could have been
applied to ontological user profiles rather than just the click histories. This approach
would have benefited from the ontology not only in the visualization, but also in
expanding the user profile to reflect related interests. There would have been fewer users
that do not belong to any of the clusters due to their unique browsing history. The
downside is that the ontology-based approach is more dependent of the existence and the
quality of the ontology. The most severe limitation in the current implementation is its
restriction to only non-overlapping clusters. It is obvious that one user should be able to
belong to more than one clusters. Fortunately, there seems to be a K-means
implementation that allows fuzzy clusters [DeM88].
70
Bibliography
AHK01 Aggarwal, C., Hinneburg, A., Keim, A., On the surprising

behavior of distance metrics in high dimensional space.
Proc. 8th International Conference on Database Theory,
London, 420-434, 2001.
AsI99 Aslam, J.A, Isaacs, J.D., Investigating measures for

pairwise document similarity. Technical report. Hanover,
1999.
ARS98 Alsabti, K., Ranka, S., Singh, V., An Efficient Parallel

Algorithm for High Dimensional Similarity Join, Proc. 11th
International Parallel Processing Symposium, IEEE
Computer Society Press, 1998.
BlC01 Blake, C., Pratt, W., Better rules, fewer features: a semantic
approach to selecting features from text. Proc. of the
Institute of Electrical and Electronics Engineers Data
Mining Conference (IEEE DM 2001), San Jose, Ca, 2001.
BDO95 Berry, M.W., Dumais, S.T., O’Brien, G.W., Using linear

algebra for intelligent information retrieval. SIAM Review.
37:573-595, 1995.
Con01 Connolly, D. et al, DAML+OIL (March 2001) Reference

Description. W3C Working Draft 29 July, 2002.
Cut92 Cutting et al. Scatter/Gather: A Cluster-based Approach to

Browsing Large Document Collections, Proc. of the 15th
71
Annual International ACM/SIGIR Conference,
Copenhagen, 1992.
Dee90 Deerwester et al., Indexing by latent semantic analysis.

Journal of the American Society of Information Science,
41(6), pages 391-407, 1990.
DeM88 deGruijiter, J., McBratney, A., A modified fuzzy k means

for predictive classification. Classification and Related
Methods of Data Analysis, 97-104, Elsevier Science,
Amsterdam, 1988.
Die00 Diederich, J., et al. Authorship attribution with Support

Vector Machines. Applied Intelligence, 2000.
Dum90 Dumais, S. T. Enhancing performance in latent semantic

indexing (LSI) retrieval. Technical report, Bellcore, 1990.
Gon98 Gonzalo et al., Indexing with WordNet synsets can improve

text retrieval, Proceedings of the COLING/ACL '98
Workshop on Usage of WordNet for NLP, Montreal, 1998.
GRS98 Guha, S., Rastogi, R., Shim, K., CURE: An efficient

clustering algorithm for large databases. Proc of ACM
SIGMOD International Conference on Management of
Data, pages 73-84, New York, 1998.
HBV01 Halkidi, M., Batistakis, Y., Vazirgiannis, M., On clustering

Validation Techniques. Journal of Intelligent Information
Systems, 17(23), pages 107-145, 2001.
72
HeH98 Heckerman D., Horvitz E., Inferring Informational Goals
from Free-Text Queries, Proc of Fourteenth Conference on
Uncertainty in Artificial Intelligence, p 230-237, Madison,
Wisconsin, 1998.
HMS01 Hotho, A., Maedche, A., Staab, S.. Ontology-based text

clustering. Proc. of the IJCAI-2001 Workshop "Text
Learning: Beyond Supervision", August, Seattle, USA.
2001.
JMF99 Jain, A. Murty, M., Flynn, P., Data Clustering: A Review.

ACM Computing Surveys, 31(3), p 264-323, 1999.
Jin94 Jing, Y., W. Croft, W.B., An association thesaurus for

information retrieval, Proceedings of {RIAO}-94, 4th
International Conference, p. 146 – 160, New York, 1994.
Kan02 Kanungo et al., An Efficient k-means clustering

algorithm: Analysis and Implementation, IEEE
Transactions on Pattern Analysis and Machine
Intelligence, Volume 24 , Issue 7, p. 881 – 892, 2002.
Lan95 Lang, K. NewsWeeder: Learning to filter news. Proc. of

the Twelth International Conference on Machine Learning.
Lake Tahoe, CA, 1995.
Lie95 Lieberman, H., Letizia: An agent that assists web browsing.

Proc. of the International Joint Conference on Artificial
Intelligence, Montreal, 1995.
73
MaS02 Manning C., Schytze H., Foundations of Statistical Natural
Language Processing. MIT Press, p. 495-527. Cambridge,
2002.
MiW91 Miller, G., Walter, C., Contextual correlates of semantic

similarity. Language and Cognitive Processes, 6(1), p. 1-
28, 1991.
MJZ03 Mobasher, B., Jin, X., Zhou, Y., Semantically Enhanced

Collaborative Filtering on the Web. EWMF 2003, p. 57-76,
2003.
MlG98 Mladenic, D., Globelnik, M., Word sequences as features in

text learning. Proc of the 17th Electrotechnical and
Computer Science Conference (ERK98), Ljubljana,
Slovenia, 1998.
MMN01 Melville, P., Mooney, R., Nagarajan, R., Content-boosted

collaborative filtering. Proc. of the SIGIR2001 Workshop
on Recommender Systems, New Orleans, LA, 2001.
Nan03 Nanas, N. et al., A comparative study of term weighting

methods for information filtering. Technical Report KMi-
TR-128, Knowledge Media Institue, The Open University,
2003.
OWL06 Web Ontology Language, http://www.w3.org/2004/OWL/

(9.3.2006).
74
PaB97 Pazzani, M., Billsus, D., Learning and Revising User
Profiles: The Identification of Interesting Web Sites.
Machine Learning, 27, p. 313-331, 1997.
Paz99 Pazzani, M., A Framework for Collaborative,

ContentBased and Demographic Filtering. Artificial
Intelligence Review, p. 393-408, 1999.
PLL99 Pea, J., Lozano, J., Larraaga, P., An empirical comparison

of four initialization methods for the k-means algorithm.
Pattern recognition letters 20:, p. 1027-1040., 1999.
Por80 Porter, M., An algorithm for suffix stripping, Program,

14(3) p. 130−137. 1980.
PrG99 Pretschner, A., Gauch, S., Ontology based personalized

search. Proc. of 11th IEEE International Conference on
Tools with Artificial Intelligence, 391-398, Chicago, 1999.
Ris95 Richardson R., Smeaton A., Using WordNet in a

Knowledge-Based Approach to Information Retrieval,
Working paper CA-0395, School of Computer
Applications, Trinity College Dublin, 1995.
RSM00 Richardson R., Smeaton A.F., Murphy J., Using Wordnet

as a knowledge base for measuring semantic similarity
between words. Proc of AICS Conference. Dublin, 1994.
RDF06 Resource Description Framework (RDF),

http://www.w3.org/RDF/ (9.3.2006).
75
SaC99 Sanderson, M., Croft, B. Deriving concept hierarchies from
text. Proc. of SIGIR-99, the 22nd ACM Conference on
Research and Development in Information Retrieval, p.
206--213, Berkeley, 1999.
Sal75 Salton, G., Wong, A., Yang, C.,. A vector space model for
automatic indexing. Communications of the ACM, 18(11),
p. 613--620, 1975.
SaM83 Salton, G., McGill, M. J., Introduction to modern

information retrieval. McGraw-Hill, 1983.
ShM95 Shardanand U., Maes P., Social Information Filtering:

Algorithms for Automating "Word of Mouth”, Proc. of
ACM CHI'95 Conference on Human Factors in Computing
Systems, 1995
Sir99 Siddheswar R., Rose H. T.,

Determination of number of clusters in
k-means clustering and application in
colour image segmentation. Proc of 4th
International Conference on Advances in Pattern
Recognition and Digital Techniques (ICAPRDT'99), 1999.
ShW63 Shannon, C. E., Weawer, W., The Mathematical Theory of

Communication., Chicago, University of Illinois Press,
1963.
Son99 Soboroff, I., Nicholas, C., Combining content and

collaboration in text filtering of netnews. Proc. of the 1994
76
Computer Supported Cooperative Work Conference, ACM,
New York, 1994.
SWI01 Sholom M. Weiss, Nitin Indurkhya: Lightweight

Collaborative Filtering Method for Binary-Encoded Data.
PKDD 2001, 484-491, 1991.
P3P06 http://www.w3.org/TR/P3P/ (9.3.2006).
Wor06 WordNet,http://www.cogsci.princeton.edu/~wn/
(9.3.2006).
YaP97 Yang, Y., Pedersen, J., Comparative Study on Feature

Selection in Text Categorization. Proc. of the 14th
International Conference on Machine Learning ICML97, p.
412-420, 1997.
Zip49 Zipf, G., Human Behaviour and the Principle of Least-

Effort, Addison-Wesley, Cambridge MA, 1949.
77

Clusteri

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Clusteri

Hochgeladen von

Copyright:

Verfügbare Formate

Clustering users of an online content service

Helsinki 21st June 2006

Faculty of Science Department of Computer Science

Muita tietoja – Övriga uppgifter

2 Text analysis ................................................................................................................... 7

4 Cluster analysis ............................................................................................................ 31

5 Creation of user and content profiles......................................................................... 40

6 Clustering of users ....................................................................................................... 53

Retrieving relevant information from a document collection is a primary challenge in the

An important property related to frequency of a word is its information content, which

A collection of documents can be viewed a stream of tokens generated by an information

2.2 Creating a document model

Computerized text processing requires transforming text into in-memory representation

Words usually have different morphological variants with similar semantic

Creation of reliable document similarity measure is an active topic in the research of

• If a term is present in both documents, the similarity will increase.

Similarity in an information theoretic viewpoint can be defined in every domain where a

The difference between A and B can be measured by

2.4 Vector space model

d = (w1,w 2 ,...,w n ) , (2.8)

1. Local frequency of a term i in document d (tfi,d).

Table 2.1 Components of TFIDF weighting schemes.

2.5 Semantics in language

Study of the semantics of language is a study of meaning of individual words and

central processing unit

• Using Bayesian network to model the probabilistic relationships between users’

2.6 Using ontology in capturing the semantics of text

Ontology has been defined in philosophy as an explicit specification of conceptualization.

An example of a simple “flat” ontology is WordNet [Wor06], which is an attempt to

User modeling in information retrieval systems focuses on developing an understanding

An example of a simple user model is a user of a search engine, which is represented a

3.1 Content personalisation

Personalisation is a generic term denoting user modeling and user-adaptive system

3.2 Collaborative filtering

Collaborative filtering (CF, also referred as clickstream analysis) [ShM95] is a

A simple CF algorithm can be summarized with the following steps:

1. Create a matrix of m users with ratings to n items.

4. Weight items in s according to chosen criteria.

Table 3.1. Ratings of two users for collaborative filtering algorithm.

Table 3.2. Recommendations produced by collaborative filtering algorithm.

3.3 Content-based filtering

Content-based filtering has also limitations: items must be in machine-readable format

Table 3.3. Ratings of two users for content-based algorithm.

Table 3.4. Recommendations produced by content-based algorithm.

As can be seen from the recommendations, a purely content-based algorithm produces

3.4 Hybrid methods

Several approaches have been suggested for combining content-based recommendations

Clustering is the unsupervised classification of patterns (observations, data items or

4.1 Clustering in information retrieval

There are many applications for clustering in IR systems. By displaying content in

1. Select k initial centers

The result is a non-overlapping, non-hierarchical clustering in which each member of a

1. Create a singleton cluster of each item

The result of hierarchical clustering algorithm is a dendogram, a hierarchy of merging

∀c',c'': min(sim(c,c'),sim(c,c'')) ≤ sim(c,c'∪c'')

4.4 Group Average Agglomerative Clustering

In group-average agglomerative clustering, the similarity between any two clusters is

where p' n (C ) is un-normalized sum profile of the n most central members

5.1 Architecture of the content service

The personalisation engine is embedded into a server part of a client-server content

Numbered entities in the figure are:

5. The metadata of the content item is fed into personalisation engine.