Beruflich Dokumente
Kultur Dokumente
Sami Linnanvuo
Online content services can greatly benefit from personalisation features that enable delivery of
content that is suited to each user’s specific interests. This thesis presents a system that applies text
analysis and user modeling techniques in an online news service for the purpose of personalisation
and user interest analysis. The system creates a detailed thematic profile for each content item and
observes user’s actions towards content items to learn user’s preferences. A handcrafted taxonomy of
concepts, or ontology, is used in profile formation to extract relevant concepts from the text. User
preference learning is automatic and there is no need for explicit preference settings or ratings from
the user. Learned user profiles are segmented into interest groups using clustering techniques with
the objective of providing a source of information for the service provider. Some theoretical
background for chosen techniques is presented while the main focus is in finding practical solutions
to some of the current information needs, which are not optimally served with traditional techniques.
Avainsanat – Nyckelord
Clustering, user modeling, personalisation, text categorization, ontologies
Säilytyspaikka – Förvaringställe
2
Abstract
Online content services can greatly benefit from personalisation features that enable
delivery of content that is suited to each user’s specific interests. This thesis presents a
system that applies text analysis and user modeling techniques in an online news service
for the purpose of personalisation and user interest analysis. The system creates a detailed
thematic profile for each content item and observes user’s actions towards content items
to learn user’s preferences. A handcrafted taxonomy of concepts, or ontology, is used in
text analysis for extracting relevant concepts from the text. User preference learning is
automatic and there is no need for explicit preference settings or ratings from the user.
Learned user profiles are segmented into interest groups using clustering techniques with
the objective of providing a source of information for the service provider. Some
theoretical background for chosen techniques is presented while the main focus is in
finding practical solutions to some of the current information needs, which are not
optimally served with traditional techniques.
3
1 Introduction.................................................................................................................... 5
3 User modeling............................................................................................................... 23
3.1 Content personalisation........................................................................................... 23
3.2 Collaborative filtering............................................................................................. 24
3.3 Content-based filtering............................................................................................ 27
3.4 Hybrid methods....................................................................................................... 29
7 Conclusions................................................................................................................... 68
7.1 Privacy issues.......................................................................................................... 69
7.2 Future work............................................................................................................. 70
BIBLIOGRAPHY ................................................................................................71
4
1 Introduction
The amount of available digital content has increased drastically since the birth of World
Wide Web. Shortly after introduction of a user-friendly interface, web browser, anyone
having a computer and a net connection has been able to access vast amount of online
content in the form of news services, discussion boards and online shops. As the amount
of available information increases, the ability to distinguish relevant content from
irrelevant becomes crucial. Recently, as many of these services have become available to
mobile devices with a small screen size and limited bandwidth, efficient filtering and
personalisation techniques have become more necessary than ever. Increased competition
and more demanding users have brought new concerns to the service providers: how to
keep users using the service and not to switch to a competing one? How to help users find
the content they are looking for? How to increase advertising revenue without irritating
users? These are important questions to any content provider running a commercial
content service.
This thesis, motivated by the above questions, presents a system that aims at helping
service providers to improve the efficiency of their information delivery, and at providing
them information about the end-users. A key factor driving the user satisfaction in a
content service is the quality of the content; if the content serves well user’s information
needs, it is more likely that the user will return to the service again. The problem is that
often the service provider does not exactly know what the users are interested in. By
segmenting users into groups according to their interests and presenting those groups in a
graphical user interface helps service provider to adjust the content according to the
interests of the target audience. This information can be further used in targeted
advertising campaigns in which users are offered only the ads that they are most likely to
be interested to see.
To achieve the results outlined above, the system should be able to resolve the interests of
any individual user and measure the level of similarity of any two users using the system.
5
User profiles are created using Leiki Targeting, a real-time learning, profiling and
personalisation engine. The engine uses domain ontology to automatically form a human
readable profile for each content and user in the system. Once the profiles are created, the
system is capable of recommending users the content they are probably most interested in
seeing. The purpose of this work is to present a way to add user segmentation feature to
this existing personalisation engine.
This work is divided in 7 chapters so that each of the chapters builds on top of the
previous ones. Chapter 2 provides the reader some basics of text analysis and document
modeling with the objective of finding topics from the text. The concept of an ontology, a
taxonomy of concepts, is introduced as a provider of semantics for the text analysis task.
A vector space model is introduced as a methodology for efficiently performing
calculations with documents. Chapter 3 introduces techniques for user modeling and
personalisation. Both content-based and collaborative approaches for content filtering are
compared and their differences are explained. Chapter 4 provides reader an introduction
to data clustering as a method for automatically finding groups in data. Chapter 5
describes the functionality of Leiki Targeting, a real-time learning, profiling and
personalisation engine, which is used in content and user profiling. Chapter 6 is devoted
to analysis of the generated user profiles. Users are segmented into interest groups using
cluster analysis. Scalability to a large number of users is achieved by using a custom
hybrid algorithm, which combines hierarchical and partitional clustering.
6
2 Text analysis
There are several approaches for extracting knowledge from text. Some of them rely on
handcrafted rules on detecting words and sentences while some are more computationally
oriented attempting to automatically learn any such rules. Traditional Natural language
processing (NLP) methods are based on the concept of grammar as a model of language
and sentence is stated grammatical if it follows the rules of a grammar or ungrammatical
if it conflicts with any of the rules. However, implementing such a set of rules that
governs all the features of a language is far from easy. Even for a human, it is sometimes
difficult to determine the grammaticality of a sentence. In common use of language,
people often use sentences that do not follow the grammar but they still get understood
and also should get understood if they are communicating with a computer. The fact that,
the use of language changes over time when new terms are invented and new forms of
expressing thoughts are incorporated into language, implies the need for constant
updating of the language model.
7
2.1 Information content of a word
By observing words in text one can clearly see that words are unevenly distributed: some
words are very common while some appear very rarely. The majority of the words lie
between those two extremes. In calculations of the words in Tom Sawyer, the most
common 100 words accounted 50,9% of the text while 49,8% of the words occurred only
once [MaS02]. Similar results would be achieved with any typical text corpus. This
characteristic of language follows the Zipf’s law [Zip49], which states that the frequency
of use of a word is inversely proportional to its statistical rank such that
Pn ≈ 1/n a , (2.1)
where Pn is the frequency of occurrence of the nth ranked word and a is close to 1. Given
any sufficiently large corpus of text in any natural language, the second most common
word will occur approximately ½ as often as the first and the nth most common word will
occur 1/n as often as the first. As a consequence, 20% of words take up 80% of written
text. This pattern is also visible in many related domains such as in distribution of page
views in a web services (the second popular gets half the impressions of the most popular
page) and in the long tail of products in retail market (where 20% of the items makes up
80% of the sales).
8
1
H ( P) = ∑ P( X ) log 2 . (2.2)
x∈X P( X )
1
log 2 , (2.3)
pt
where pt is the probability of occurrence of the term. The more frequent a word is, the
less information it carries.
1. Tokenization of text
2. Stripping punctuation
3. Removal of stop-words
4. Stemming
5. Weighting
9
The transformation process starts by first tokenizing text into units of language. What is a
unit then? Intuitively thinking, a unit is a word, more precisely a sequence of characters
with spaces on both ends. However, tokenizing text turns out to be more complicated
than that. Because of punctuation marks (commas, semicolons etc.), words are not always
surrounded by white spaces. Simply removing punctuation is a common solution but not
a perfect one; there is information stored in the punctuation as well, for example a
question mark suggests that the sentence contains a question. Detecting a boundary of a
sentence is not trivial either: periods may appear inside sentence with abbreviations or
digits. In an extreme case where sentence ends in abbreviation the trailing period serves
two meanings [Msa02].
There are varying approaches for choosing the smallest unit of representation of which
the most common is a bag-of-words representation where one unit consists of a single
term (word, punctuation mark, digit etc.). In a bag-of-words representation, terms are
seen as separate entities and their ordering has no importance. It is obvious that bag-of-
words representation is not optimal due to existence of special word combinations such
as collocations and idioms. Bag-of-words approach can be extended to include word
combinations as in [MlG98]. Since meaningful word combinations occur more often
together than separately, they can be automatically detected to some extent by statistical
calculations from the sufficiently large text corpora and then treated as single units.
10
The next step is to filter out irrelevant terms in order to make the model of the document
more compact. As the number of distinct terms in a moderate sized collection of
documents easily exceeds the limit of what can be processed efficiently, some selection
must be performed in terms of which words are chosen to represent a document. We are
primarily interested in words that co-occur often in same documents but are absent in
most other documents. As a consequence, the words that have fairly even distribution of
occurrence in the document collection can be discarded as they hardly contain any
significance to the document’s topic, and as such, we are more interested in terms that are
relatively common. As was shown in chapter 2.1, a significant number of terms appear
only once in the whole document collection and increasing the number of documents in
the collection won’t change that fact. In order to filter out such extremely rare terms, a
threshold value can be used. Most common terms in language are usually not relevant for
information retrieval tasks. In English, such most common terms are often determiners
(a, an, the) and personal pronouns such as I, he and you. These topically irrelevant words
are often called stop-words. A manually constructed list of stop-words can be used for
filtering out the most common irrelevant terms. The downside of such lists is that they are
laborious to create and language dependent. The goal of term selection is to filter out
irrelevant terms while keeping the relevant ones. This not only increases the performance
of retrieval tasks but also improves the quality of the outcome when irrelevant terms do
not interfere in the retrieval task.
Once the relevant terms for representing the document are chosen, they are usually
weighted according to their importance. Weighting can be based on several properties
such as frequency of a term, its information content, its position in a document or even its
position within a sentence. A comparative study of different weighting schemes can be
found in [Nan03]. Typically, weighting is based on combination of more than one
property. The outcome of the document modeling process is a set (or sometimes a list) of
units (features) of text with associated weights in a suitable data structure. Ideally, the
model reflects as much as possible the features of the original document that are relevant
in the task while enabling desired operations to be performed to the document efficiently.
11
2.3 Measuring the similarity of two documents
The similarity of two documents can be defined as a degree of overlap between the two
documents. In a simple bag-of-words model, similarity between two documents can be
characterized by the following rules:
I ( A ∩ B ) = − log P( A ∩ B ) . (2.5)
A∆B = ( A − B ) ∪ ( B − A) . (2.6)
12
Similarity between A and B can be stated as a distance between their commonality and
their difference. The less information is needed to describe what A and B are after their
commonality is known, the more similar they are. Since documents can be described as a
sum of their commonality and difference, their similarity can be defined as [AsI99]
I ( A ∩ B)
Sim( A, B ) = . (2.7)
I ( A ∩ B ) + I ( A∆B )
The problem with this simplified method is that it does not take into account the meaning
of words. Two documents can be similar even though they do not share any words. This
is not only due to synonymy but also due to the fact that two words may be semantically
similar even though they are not considered synonyms. In fact, true synonymy is quite
rare but there are plenty of words that can be considered to be near-synonyms. In
addition, some words, even though not considered even near-synonyms, can belong to
same topical category. For example, banana, apple and orange are all fruits and therefore
can be considered to be semantically similar with each other. Measuring similarity of
two documents is essentially about measuring the similarity of words in them. Including
the semantic similarity of words into document model enables more comprehensive
similarity measures than a simple overlap of words.
The vector space model [Sal75] is widely used in information retrieval largely due to its
simplicity. In order to perform computation with documents in vector space model, the
extracted terms (features) must be converted into vector representation. Documents are
represented as weighted n-dimensional term vectors where each dimension corresponds
to a term (or a phrase). Term vectors are created by first indexing the documents with the
terms they contain (this usually includes the removal of uninformative words). An entry
in the vector denotes that a certain term appears in a document. After indexing, the
13
extracted terms are weighted according to their relevance for the information retrieval
task. Formally, a term vector for document d is represented as
where wi corresponds to the weight of the term i. Terms are associated weights according
to their relevance for the document’s contents so that the more relevant a term is the more
weight it is given. In the simplest case, the weight of a term is the number of occurrences
of that term in a document. More advanced weighting schemes take into account the
importance of term, that is, it’s relevancy in describing the contents of the document. A
widely used TFIDF weighting [SaM83] is based on three elements:
The local frequency tells the number of occurrences of term ti in a document dj. The more
the term appears in a document the more likely it is that the term is related to the topic of
the document. The global frequency of a term tells how many times the term appears in
the entire collection, and, as such, reflects the amount of information contained in the
term. The more common the term among the text corpus is, the less emphasis it gives to
the semantic distinctiveness of that document from other documents in the collection. If a
term is evenly distributed in the whole collection, it is probably not semantically focused
on any topic, but instead, is used in the context of any topic. To find out how
semantically focused a term is in the collection of N documents, its inverse document
frequency can be calculated as:
N
idf i, j = log . (2.9)
df j
14
Idf adjusts the weight of a term in a document with a factor that discounts its importance
when it appears an almost all of the documents, and as such, do not discriminate well
between documents in the collection.
Normalization factor (n) compensates for the differences in the lengths of the documents.
The rationale behind normalization is that the term that appears equally many times in
short document than in a long one is likely to be more relevant for the shorter one. This
issue can be addressed by normalizing the weights by the length (norm) of the vector:
w
w′ = . (2.10)
w + w 22 + ...+ w n2
2
1
Using the weighting scheme based on three above components, the final weight for the
term i in a document j is given by
wi , j = tf i , j idf i n . (2.11)
There is a lot of variation in TFIDF based weighting schemes. Sometimes the term
frequency is dampened by a logarithm (1 + log(tfi,j)) in order to decrease the weight
difference caused by the frequency of a term. The table below lists some of the many
options for TFIDF weighting [MaS02].
0.5 ∗ tfi, j
0.5 +
max t (tfi, j)
15
One of the strengths of the vector space model is the ability to efficiently compare the
level of similarity of two documents. Since document features are represented as
attributes in vectors the similarity can be defined as a distance between the two vectors.
The closer the two vectors are, the more similar the documents with each other. The
Euclidean distance (the straight line distance between two points) between two vectors A
and B is given by
n
Sim(A,B) = ∑ (A − B ) i i
2
. (2.14)
i= 1
The cosine distance measures the similarity by computing the angle between two vectors
rather than the distance:
∑A B i i
Sim( A, B) = i= 1
. (2.15)
n n
∑ A ∑B 2
i
2
i
i= 1 i= 1
A vector is normalized if it has unit length according to the Euclidean norm (also called a
unit vector):
n
length(A) = ∑ A 2i = 1. (2.16)
i=1
Cosine distance de-emphasizes the lengths of the vectors preventing long documents to
dominate over shorter ones in similarity calculations. For normalized vectors, the cosine
is simply the dot product:
n
Sim(A,B) = ∑ Ai Bi . (2.17)
i=1
16
Although the document vectors are sparse and contain only a subset of the terms in the
corpus, the size of the vectors easily becomes too large for efficient processing. A major
difficulty in information retrieval is the high dimensionality of the feature space. The
number of unique terms that can appear in text is counted in tens or hundreds of
thousands. As the similarity measures based on Euclidean similarity are not well suited
for high dimensional data [AHK01], a reduction of dimensions us required for improving
the performance. The most notable problem with Euclidean distance measures is known
as the “the curse of high dimensionality”, which means that the contrast between the
closest and the furthest point decreases rapidly as number of dimensions increases. The
consequence is that all the vectors seem to be equally far away questioning the
meaningfulness of the similarity measure. Since information retrieval tasks such as
content filtering or text clustering rely heavily on calculating similarities between vectors,
the number of dimensions is the primary factor affecting the performance [Msa02,
BDO95]. A comparative study of dimensionality reduction methods for document vectors
can be found in [YaP97].
17
shallow notion; Miller et al. [MiW91] defines semantic similarity of words as the degree
of contextual interchangeability or the degree to which one word can be substituted for
another in context.
Homonyms are words that are written in the same way but bear different meaning, for
example, a bank as a financial institution versus a bank as in riverbank. Polysemy is a
case where word’s multiple meanings or senses are related as for word take in take a
picture and take a look. Polysemy and Homonymy both cause ambiguity for the
semantics of a word, which is often fairly easy for a human to handle but notoriously
difficult for a computer. Due to morphological variation there are words that share a
common root and can be considered as referring to the same concept (e.g. house, houses,
housing). Lexical hierarchies denoting this kind of relations between words are often
used in computerized text analysis to uncover the semantics of an individual word.
Interpreting the meaning of the whole sentence is far more difficult: one property of
natural language is the lack of compositionality, which means that the meaning of the
sentence can not be predicted from the meaning of the individual words in it.
Collocations are special word combinations that often occur in a text more often together
in than just by chance. They refer to a certain unique concept in a world around us.
Below are examples of collocations related to computers:
Collocations are examples of word combinations that bear more meaning than what can
be induced from the combination of the words alone. Idioms are expressions whose
meaning is even less compositional. For example the idiom raining cats and dogs has
nothing to do with pets but instead it denotes a heavy rain.
18
A direct consequence of this diversity of a language is that a perfect search engine must
not only be able to deal with the linguistic characteristics of a term but also know its
position in relation to other terms in the semantic space of the user. There are many
approaches for capturing the semantics from text. These include
First two methods are more computationally oriented while the third one relies on
external lexical resources. Typical use of lexical resource is to expand user’s query with
synonyms from the dictionary. Thesauri, a simple form of ontology, can be used for
providing related concepts to the ones detected in text. The idea behind dictionaries is to
expand query in order to find, not only the exact matches, but also the items that are close
enough to the user’s query. Taxonomies have been widely studied in text retrieval,
[Jin94, RiS95]. Gonzalo et al [Gon98] showed that the use of WordNet synsets (a set of
one or more synonyms) might result up to 29% improvement in the text categorization
task in comparison to keyword based approach.
The keyword-based search is probably the most common way of searching information.
In a typical search engine, user enters one or more terms related to the subject the user is
interested in. Query terms are matched to the terms in the document collection and
articles with best rankings are returned. However, some properties of language such as
synonymy, ambiguity and morphological variation decrease the accuracy of keyword-
based search. A typical query like ‘jazz concerts helsinki’ in a simple database like -query
19
results in a documents that have one or more of the exact terms written in them. More
advanced approaches count the number of hits and use stemmer to find a base form of a
term. However, since the algorithm treats each term as an isolated piece, it fails to capture
the semantic relations between the words in a document. Sometimes documents do not
even contain the terms that would best describe their content. For example, articles about
latest domestic news do not necessarily contain words latest domestic news. Only
understanding of the semantics behind the search words makes it possible to find
connection between two documents with similar semantic concepts but which does not
use same words for describing them and produce all the relevant results.
20
Figure 2.1. Section of upper-level concepts in WordNet.
Ontologies have been widely studied in research of text analysis. Blake et al. [BlP01]
investigated the quality of features chosen to represent a document when features had
varying degree of semantics attached to them. They used an existing knowledge base,
Unified Medical Language System (UMLS), to map clauses in documents to medical
concepts and used Apriori-algorithm to learn bi-directional association rules. Their
findings indicated that the association rules based on concepts were more useful than
those based on word features. Hotho et al. [HMS01] used a domain-specific ontology in
text clustering to improve clustering results. They used taxonomy of concepts with
associated key terms to extract relevant concepts from documents. Each document was
represented by a vector of concepts, each entry denoting the frequency that a concept
occurs in a document. Documents were clustered using a K-means algorithm. The use of
ontology improved the clustering task by reducing the dimensions of feature vectors into
the size suitable for efficient processing with K-means and provided human readable
cluster representation. Pretschner et al [PrG99] used a publicly available hierarchy of
categories of web sites to create ontology. Each node in the ontology was associated with
a vector of key terms chosen from the set of web sites belonging to that category. The
system analyzed the pages that a user browsed and compared its profile to each of the
vectors associated with the nodes in the ontology. Best matching nodes were assumed
most related to the surfed page. A hierarchical user profile was formed based on the
observations about the user’s browsing history.
21
The growing interest in ontology-based information processing due to central role of
ontologies in semantic web research have led to emergence of standards both in the
representation of ontologies [RDF06] and in describing their semantics [Con01,
OWL06]. There is also an active research on automatic learning of ontology from a
document collection [Sac99]. As ontology construction is a laborious task, automation of
the task would be greatly welcomed. However, due to the difficulty of the task,
automated methods have been most useful when accompanied with manual guidance by a
human.
22
3 User modeling
23
their personal preferences. Examples of customization are downloadable ringtones and
wallpapers on mobile phones or portals on the web that show user-selected content
sections on the site such as local weather reports. Despite the ambiguity, in this thesis,
term personalisation is used to refer to algorithmic methods that apply user-modeling
techniques to deliver personalized content.
Many web sites today offer personalized content based on user modeling to improve the
user experience and to increase customer retention. There are many approaches to acquire
information about users and to use that information for personalisation. The most
common methods are content-based recommending and collaborative filtering. The main
difference between these two is that in the former case, the recommendations are based
on past behavior of the user, while in the collaborative filtering they are based on
behavior of like-minded people. Hybrid methods overcome some of the shortcomings of
these two methods by taking best of the both worlds, collaboration and content.
The transactions in collaborative filtering may be in the form of explicit ratings for items
such as books or movies or implicit feedbacks as a result of user simply viewing an item.
24
In the case of explicit feedbacks, user is given possibility to rate item in a numerical scale
or to just give positive or negative feedback. For example, in [PaB97], a learning
software agent allowed users to rate pages either hot (two thumbs up) or cold (two
thumbs down), while in NewsWeeder [Lan95] the feedback for articles was given in a
five-point scale. When implicit feedbacks are used, ratings are typically recorded as
binary (true/false) values. Lieberman [Lie95] used implicit feedbacks successfully in a
learning agent that assists a user browsing the web. In his approach, an implicit feedback
was scaled according to the duration that user has spent viewing an item.
2. Find a set k of most similar neighbors of the target user according to correlations
between rows of the matrix m and the target user.
3. Create a candidate set s by choosing items which are rated positively by users in k
and which are not rated by the target user.
There are several weighting schemes for items in s. Typically first step is to rank items
according to their frequency in s. Secondly, items can be weighted according to the
closeness of the user in k to the target user so that the closer the two users are the more
weight the item gets. When calculating the closeness of users, additional weight can be
given to items according to how rare they are among the rates. The rationale behind this
is the same as when weighting term vectors according to their information content
[SWI01].
25
Table 3.1 illustrates the ratings of two users for six artists. Both of the users have stated
whether they like or dislike the artist in question. Columns, which are marked with ‘-‘,
have no rating from that user.
CF algorithm tries to find similar users for a target user by computing similarities based
on a common set of items that users have rated. In the simplified example above, user A
and B have both voted item 2 interesting and thus they might also agree on items 1 and 5,
which were liked by one of them. As a result, CF algorithm could recommend item 1 for
user B and item 5 for user A. Item 6 is has no rates from either of the two users, and thus,
lacks information in order to judge whether it would be liked or disliked by A or B.
CF techniques are easy to implement and provide good recommendations with little or no
intrusion to user. They can perform well even if there is not much textual content or
metadata associated with items. CF also has the ability to recommend user an item, which
is relevant to user but which could not be derived from the user’s browsing history alone.
However, they also bear some well-known limitations. Finding the k most similar
neighbors in real time easily cause problems in scalability as the complexity of the
algorithm increases linearly as a function of the number of users. This will become a
26
problem if the set of the viewed items is large, which is often case in web personalisation
where items are pages browsed.
When explicit ratings are used, most users do not rate items and therefore the probability
to find a set of users with highly similar ratings is low. Large, sparse item sets decrease
the likelihood of a significant overlap among users causing less reliable
recommendations. Since the system’s knowledge about the content items is derived
solely from the users feedbacks, recommendations tend to be biased to those items that
have been popular in the past, severely limiting the diversity of the recommended
content. Perhaps the most notable shortcoming is a phenomenon known as “new item
problem”, which is caused by the lack of records for any new or recently added item. As
the recommendations are solely based on historical data, the recently inserted item cannot
be recommended until it has been rated (or visited) by sufficient number of users.
There are two basic types of CF, user-based and item based. While user-based
approaches compare user’s choices to those of others to find similar-minded people, the
item-based approaches identifies similarities in the items themselves. The item-based CF
constructs the recommendations list by comparing each item in the list of user’s rated
items to ratings of other items and selects those with the highest correlation. Item-based
collaborative filtering avoids the bottleneck of user-user comparison by avoiding the need
of comparing between possibly millions of users. Comparing items usually requires fewer
operations since the number of items is typically smaller than the number of users. There
is, however, still a problem of data sparseness and the lack of recommendations for new
items [SWI01, MJZ03, MMN01].
Content-based filtering is based on analysis of the textual content and is closely related to
text categorization problem. Each item will be classified into one or more categories and
recommendations are based on the similarities in classification. While CF systems base
their recommendations on similarities among users, content-based filtering systems
27
generate a personal profile for each user in the system and recommend items according to
similarities in content items and a user profile.
The integration of semantic similarities for items provides two primary advantages over
user-based methods. The semantic features of items provide clues for underlying reasons
why user is interested in particular content item. While user-based methods solely record
the items user has accessed, content-based methods record the actual semantic features of
the accessed content item, avoiding the bias to popular items among other users. It helps
unpopular items to find visibility, enabling the better use of the long tail (see chapter 2.1).
Also, in the case of new items or sparse data sets, the system can still base
recommendations on semantic similarities of content items.
Suppose that we have a pure content-based filtering algorithm, which is capable to learn
the genres a user is interested in by analyzing the metadata of the items that user has
rated. It can then recommend user other artists belonging to those genres. Table 3.2
shows an example of the ratings of two users in such a system.
28
1. John Coltrane Like -
2. Miles Davis Like Like
4. Guitar Essentials Dislike -
5. Bossa Nova Brazil - Like
6. Dave Brubeck - -
Content-based recommendation system would learn from the item metadata (in the first
column in the table 3.3) that user A likes Jazz and user B likes Jazz and Bossa Nova. As a
result, A would be recommended item 6 since it also belongs to Jazz genre. User B would
be recommended item 6 and also item 1, which is in Jazz genre and not yet rated by B.
As can be seen in the examples above, neither pure CF algorithm nor purely content-
based algorithm is capable of producing all the relevant recommendations when used
alone. A hybrid solution can be implemented to overcome these shortcomings by
combining the recommendations produced by both methods. Hybrid method would know
from collaboration data that user A might be interested in item 5 and user B in item 1.
Using a content-based prediction it would also know that both users are probably
interested in item 6. As a result, the recommendations based on hybrid method would be:
29
Table 3.6. Example recommendations by hybrid algorithm.
User A User B
Recommendations 5,6 1,6
30
4 Cluster analysis
What is a cluster then? There seems to be no definite answer for this question. Some
authors have defined clusters in terms of internal cohesion and external separation. While
this is intuitively sound, it does not provide a theoretical basis for clustering.
Unfortunately, there is no general definition for a cluster that could be stated in a terms of
mathematics. The definition of a cluster and the goal of the clustering vary according to
the data and the application. Human eyes are good at detecting patterns and structure in
seemingly random source, a property that makes them natural tools for judging a
clustering result. This is not always a good thing: people tend to see structure in data even
when there is no structure at all. Another characteristic of a clustering is its computational
complexity. When clustering large amount of high-dimensional data-objects, there is
always a trade-of between quality of the clusters and the execution time of the algorithm.
Since the number of possible clustering results is exponential to the number of items to
31
be clustered, it is clear that iteration of all possible results not feasible. As a consequence,
many clustering algorithms, including hierarchical algorithms, have exponential time
complexity.
Term segmentation is mainly used in market analysis where clustering techniques are
used for discovering user groups with similar interests or groups of shopping items
frequently purchased together. In this thesis, term user segment refers to a group of users
sharing similar interest while the term clustering denotes the technical process of finding
such segments.
Document clustering has been studied in information retrieval mostly as a method for
increasing the accuracy and performance of search. Since documents inside one cluster
are similar with each other, they are often relevant to the same or similar query terms.
One can increase the efficiency of a similarity search by searching only documents
belonging to the same cluster. In Scatter/Gather [Cut92], clustering was used as a primary
information retrieval method. The document collection was clustered in the document
groups allowing quick glance over the structure of the whole collection. The user selected
the most interesting groups, which were then combined and clustered into more detailed
groups for further browsing. Knowledge of clusters can be utilized for achieving scalable
collaborative filtering. Clustering based CF algorithm selects one or more clusters, which
are closest to target user and finds k nearest users from those clusters only.
32
In NLP, a common use for clustering techniques is in word sense disambiguation
problem. Words tend to have multiple senses and meanings in language. The correct
interpretation for such a word can be achieved by looking the context in which it appears.
Clustering can help in disambiguation by grouping the different usages for a word into
different clusters according to different contexts it has. A correct interpretation for a word
can be resolved by selecting the cluster that is most similar to the current context and
using the meaning associated with that cluster. In machine translation clustering has its
use in determining features of a word by using clusters as generalizations of a certain
word types [MaS02]. For example, when determining the correct preposition for the word
Monday, one can use knowledge of clusters for deriving that the word behaves in similar
manner as other weekdays seen on text and choose the preposition used with them.
4.2 K-means
K-means [Mac67] and its variations are the most well known clustering algorithms due to
their simplicity and ease of implementation. K-means is a partitioning algorithm, which
organizes the items in a data set into k partitions, where each partition represents a
cluster. The k-means algorithm first selects randomly k items, which initially represents
centers of the clusters. For each of the remaining items, the algorithm finds the closest
cluster and assigns item to it. After each round of operation the cluster means are
recalculated. The aim of k-means clustering is the optimization of an objective function
that is described as by the equation
c
E = ∑ ∑ d ( x, mi ) , (4.1)
i =1 x∈Ci
where, mi is the center of cluster and d(x, mi) is some distance metric such as Euclidean
distance, between a point x and cluster center. The criterion function E attempts to
minimize the distance of each item to be clustered from the center of the cluster it
33
belongs to. The process continues until the centers of the cluster stop changing. K-means
algorithm is composed of the following steps:
The complexity of the K-means is O(kni), where k is the number of the clusters and n is
the number of items to be clustered and i is the number of iterations. Number of iterations
can in theory be large but most in most of the cases the algorithm converges quite
quickly. Reassigning of clusters and updating the means, are both O(n) in complexity and
only constant number of iterations are computed. As the algorithm is linear in both is
main components, K-means is faster than hierarchical methods in clustering amounts of
large data.
However, the fact that its performance depends heavily on the initial conditions [PLL99]
is a serious disadvantage for K-means. Another limitations is that the number of clusters
must be provided as a parameter. However, there are approaches that try to approximate
the number of clusters [SiR99] as well as optimizations to improve the performance
[ARS98, Kan02].
34
4.3 Hierarchical clustering
Hierarchical clustering algorithms are popular due to their simplicity and their ability to
produce hierarchical clustering result. Hierarchical clustering algorithms are either
agglomerative or divisive. In divisive methods, clustering starts with only one cluster. In
each round of operation, one cluster is divided into two smaller clusters until desired
amount of clusters is left. Agglomerative algorithms start with each data item as its own
cluster. In each round of operation, two most similar clusters are merged until desired
amount of clusters is left. Agglomerative methods are more common than divisive
methods, mostly because they are more straightforward to implement. The execution of
an agglomerative algorithm can be defined as
35
Figure 4.1. Merging steps of a hierarchical clustering algorithm.
A similarity function is used to select the two clusters to be merged at each step.
The three main approaches for measuring the distance between two clusters are
• Single linkage: measures the distance between the closest members of the clusters.
• Complete linkage: measures the distance between the most distant members.
• Average linkage: measures the average distance between the members of the
clusters.
The similarity function must be monotonic so that the similarity of the merged cluster
with any other cluster is always less than the similarity between either of them alone and
any other cluster.
36
A monotonic similarity function guarantees that merging does not decrease the overall
similarity, which would lead to an inconsistency in the hierarchy of steps.
Hierarchical clustering algorithms (as most clustering algorithms) are complex in terms
of computation required. In n merging (or division) steps each cluster must be compared
to each other in order to find the closest pair. Even though the number of operations
needed decreases in each round in each round of operation, the comparison step requires
n2 operations.
1
sim(C ) = ∑ ∑ s ( a, b) ,
| C | (| C | −1) a∈C b ≠ a∈C
(4.2)
where s(a,b) is a cosine similarity between members a and b. The algorithm starts by
constructing the initial set of clusters G by placing each vector to its own cluster. In each
step of iteration, the algorithm identifies the two most similar clusters and merges them.
The algorithm starts by finding two clusters Ci and Cj which maximize S(Ci u Cj ) and
creates a new set of clusters G’:
G ' = (G − {C i , C j }) ∪ {C i ∪ C j } . (4.3)
The iteration terminates when |G| = k. The complexity of computing the average
similarity of the items in a cluster is O(n2). If the average similarities are computed each
37
time a cluster is merged then the algorithm would be O(n3). However, the algorithm can
be speeded up by pre-computing for each cluster the sum of its members so that it can be
easily updated after merge (by simply summing the similarity between the merged
clusters) and it allows easy computation of the average similarity [MaS02].
Initially, the algorithm computes the similarities between the singleton clusters and sorts
them according to the similarity. First, n2 similarities are computed and sorted, which
takes O(n2 log n). In each of the n merge operations, a pair of clusters with the highest
similarity is identified and merged. The similarity matrix must be updated so that the two
chosen clusters are removed from the matrix and replaced with the similarities with the
merged cluster. Each iteration takes O(n log n). The pairwise similarity between n items
must be computed only in the first iteration and the n merging steps can be performed in
linear time resulting to an overall complexity of O(n2 log n).
A profile for a cluster can be constructed by calculating the average profile of the
members in the cluster. A cosine similarity measure can be used for comparing a cluster
profile with a user profile:
S (C , x) = p (C ), p ( x) . (4.4)
As the cluster profile represents a typical member in that cluster, averaging over all of the
members in the cluster is probably not the best representation since it is vulnerable to
outliers. A more suitable profile can be constructed by averaging the profiles of n most
central members in a cluster. Let R be the set of n most central members, then a trimmed
average profile pn(C) can be calculated as
p ' n (C )
p n (C) = , (4.5)
|| p ' n (C ) ||
38
p ' n(C ) = ∑ p ( x) . (4.6)
x∈R
39
5 Creation of user and content profiles
Users use web browser or a mobile client to access the content items. The content is
personalized so that the headlines best matching the user profile are shown boldfaced. In
addition, each user is provided a “my news” list, which contains personal
recommendations from the latest news items. For any content item, user can request a list
of similar items to be displayed. The server is implemented in Java and it runs on any
operating system, which has Java Runtime Environment available. Clients are mobile
applications implemented both for J2ME and Symbian platforms. A web-based admin
interface is provided for ontology updates and user and content analysis features. The
figure 5.1 provides a general overview of the system.
40
Figure 5.1. Service architecture.
1. Content feed is a (usually continuous) feed of content items. The system supports
various different content types such as XML, plain text and PDF documents.
41
2. Content poller is a Java thread, which periodically fetches content from the
content source such as FTP or HTTP server.
3. A content handler parses the content item and forwards it to be stored into an SQL
database.
4. Personalisation engine creates profiles for each user and content item in the
system.
7. Client processor transforms the content into suitable format for the device in
question.
Leiki Targeting is a software engine for real-time learning, profiling and personalisation.
It automatically creates comprehensive profiles for all the users and content in the
system. The content is profiled by semantic analysis of textual content, guided by an
ontology of thematic concepts related to the domain. User profiles are automatically
learned from the usage history and the user’s feedback on content items. User preference
learning is automatic and there is no need for explicit preference settings from the user.
Explicit feedbacks can be used when available but the system learns quite well just by
observing the reading patterns of users. By comparing the user profiles to the content
profiles, the system can generate best matches –lists for any user or content item, sorted
according to their similarity. Figure 5.2 illustrates the functionality of the personalisation
engine.
42
Figure 5.2. Leiki Targeting personalisation engine.
5.3 Ontology
• A set of patterns
• A set of categories
• References linking patterns to categories
• Relations linking categories and forming a tree structure
43
A pattern can be a word, a phrase or a regular expression. Patterns are associated with
categories, which denote concepts relevant for the domain. Patterns guide the extraction
of concepts from the content items in the process of content profiling. Multilingual
ontology can be constructed by translating the patterns into target language. The
categories, as well as the derived profiles, are language independent (some categories
may not exist for every language though). That is a useful property if the aim is at finding
similarities between documents written in different languages.
A category has one or more connections with other categories denoting the semantic
relationships between concepts in the domain. The resulting structure is a directed acyclic
graph (DAG), in which nodes represent concepts and connections between them
represent their generalization hierarchy. The difference of this structure to a simple
hierarchical taxonomy is that a child (more specialized term) can have many parents (less
specialized terms) and there is a possibility to have different kind of relations between
concepts allowing more sophisticated reasoning when creating a profile. The Leiki
General Ontology currently consists of about 17000 categories. The uppermost levels are
constructed according to the International Press Telecommunications Council (IPTC)
topic sets. The rest of the ontology contains more specific categories placed under this
publishing industry -standard top-level classification. Below is an example of a branch in
the Leiki General Ontology in which IPTC categories are shown in boldface and
44
additional, more specific categories, are shown in regular text.
Geography -> Europe -> Finland -> Southern County of Finland -> Helsinki Area ->
Helsinki
Figure 5.4 shows a typical generalization hierarchy where concepts are generalized when
traversing up in the tree. When the semantics of edges is not specified it corresponds to a
generalization relation although other kinds of relations can be defined as well.
Faroe
Finland Luxemburg Italy
Islands
Northern Southern
county of county of
Finland Finland
Helsinki
area
Figure 5.4. A section in the Leiki General Ontology (not all connections are shown).
The knowledge about the type of relation can be used in choosing a suitable weighting
scheme in a profile creation. For example, a user’s search query can be made more
45
restrictive by expanding the categories only if the relation denotes a highly relevant
relationship between parent and child, and thus, the user’s query can be presented with a
set of categories in the ontology that are highly relevant for the terms the user entered.
The ontology was constructed using ontology editor, a custom tool for constructing
ontologies. The ontology constructor used the editor to define categories and set the
connections between them. The GUI of the editor is split into two panes. In the left pane,
categories are represented as nodes in a hierarchical tree view. Clicking a node with a
mouse expands the node and reveals its children. When a node is selected in the tree, the
right pane shows the patterns associated with the category along with its parent and child
relations. New patterns for a category can be added by inputting them into keywords list.
A list of possible parents and children are shown allowing addition of connections to
other categories.
46
Figure 5.5. Ontology editor.
The Leiki General Ontology currently contains over 17000 categories with over 19000
patterns associated with them. The two uppermost levels are constructed according to the
International Press Telecommunications Council (IPTC) topic sets. The rest of the
ontology contains concepts related to Financial Times news subjects.
The content profiler module analyses incoming content and creates a profile for each
content item. The four-step process of profile creation is illustrated in Figure 4.6
47
Figure 5.6. Content profiling process.
In the pre processing -phase, the punctuation is removed from the metadata and possible
markup-tags are stripped away.
In the matching phase, the patterns in the ontology are matched against the metadata of
the content item and found categories are chosen for the initial profile.
Found categories are expanded in the reasoning phase so that additional categories are
included into the profile according to the connections in the ontology.
Selected categories are weighted in the weighting phase according to their relevance in
describing the content item. The weight of a category is based on the local frequency of a
category (number of matches to that category), the global frequency of that category (the
cumulated frequency of that category in all the content) and the position of the category
48
in the ontology. Profiles are normalized with their length in order to avoid long content
items to dominate over shorter ones in the matching process.
The outcome of the content profiling process is a human readable, language independent
profile, which contains categories with associated weights according to the topics found
in the metadata. Figure 4.7 shows a profile that was created from the headline and the
ingress text of an RSS news item.
Figure 5.7. Profile created from the headline and the ingress of an RSS news item.
There are fifteen categories in the resulting profile. The category cricket has the biggest
weight since it has two occurrences in the text and the other being in the headline
suggests that it is highly relevant in describing the contents of the text. The category
49
pakistan has relatively high weight as well since it, in addition to being mentioned in the
headline, inherits weight from one of it’s children, islamabad. In general, the categories
with biggest weights have resulted from a direct match to one or more of the patterns
associated with the category. The rest of the categories are the result of the expansion
according to the connections in the ontology. The further the connected category is to the
matched category, the less weight it is given. Weights are further adjusted according to
their information content (see chapter 2.1).
User profile is automatically learned from the user’s browsing behavior. Each time a
users accesses content item, the user profile is adjusted according to the profile of the
content item that the user accessed. The user profiling process is illustrated in figure 5.8.
The user profiling process starts by user accessing a content item (in this case an RSS
news item). This results a feedback action, which can be one of the following:
50
• An explicit negative feedback: user dislikes the content item.
• An explicit positive feedback: user likes the content item.
• An implicit positive feedback: user simply reads the content item, which the
system records as a slight positive feedback.
The feedback action is sent to the server with the id of the content item for user profile
adjustment. User profiling is an ongoing process, it starts when the user first time uses the
system and the profile gets more comprehensive each time the user reads a content item
or gives an explicit feedback. The use of implicit feedback is based on an assumption that
user reading an article suggests that the user is interested about the subjects present in
that article. To illustrate the profile formation, lets consider user accesses the following
three content items.
The user profile is adjusted according to the profiles of these three items and the resulting
profile is shown in figure 2.7.
51
Figure 5.9: Example of a user profile after user has accessed three content items
Like content profiles, a user profile is human interpretable and can be represented in any
of the languages supported by the ontology. In order to avoid over-developing of profile,
weights are slowly decayed by time. This enables the system to better adjust to user’s
changing preference as the previously interesting but lately ignored topics are slowly
diminishing in the profile.
52
6 Clustering of users
The previous chapters provided some background for text analysis, user modeling and
clustering and gave examples of their use in information retrieval tasks. This chapter
introduces an application that combines these three components with the goal of
clustering users into segments with similar interests. Found segments are represented as
cluster profiles denoting the profiles of stereotype users in the user base. The aim of
clustering process is to find natural user groups in data so that users inside a cluster are
share more interests with each other than the users in other clusters.
The clustering process starts by transforming the data into the format suitable for
processing by the algorithm and by removal of incorrect or erroneous items. Once the
data is preprocessed and the relevant are features selected, a suitable algorithm is chosen
for the clustering task. Since there is no general clustering algorithm that would work
with every data set it is important to consider the suitability of the algorithm for this
specific data and purpose.
The last step is the interpretation of the result. The purpose of clustering may be purely
explorative although usually there is some prior understanding of what kind of
information clustering can provide. In the case of user segmentation, the objective is to
find the stereotype users. The presentation method for the clustering should be chosen
according to the purpose of clustering. Good visualization techniques help in representing
the results in a meaningful way.
53
can be run as a batch process at specified time intervals. In that light, K-means is a good
choice to start with due to its scalability for large data sets. It has limitations though. The
result is sensitive to initial seeds and it has a tendency to produce different clustering
result on each run. Since the result is highly dependent on the quality of the initial seed
clusters, using heuristics in choosing the seeds seems to yield better results.
Our hybrid algorithm uses k-means in conjunction with hierarchical algorithm as a post-
processing step. As K-means stops to one of the many local minima, the use of
hierarchical algorithm to select the initial seeds should eliminate most of the bad ones.
The algorithm first chooses randomly sqrt(n) users and clusters them with the hierarchical
algorithm. The centroids of the clusters are provided as an input to K-means. The initial
clustering can be performed in time O(sqrt(n2 )log n), where n is the total number of items
to be clustered and k is the number of clusters. Combined with K-means it yields a total
complexity of
O( n 2 log n + kni)
This algorithm always produces the same result for the same data.
6.2 Data
The data consists of catalog of content items with associated metadata and a log file
containing the user transactions. The service is a mobile content service providing
downloadable ringtones. The content catalog contains 4745 ringtones with the following
information attached to them:
• Product id
• Title (of the song)
• Artist
• Type (Polyphonic or True Tone)
54
• Timestamp
Since the title of the song does not provide much information about the topic which, in
the case of ring tones, denotes the music style of genre of the song, it is not included in
the profiled metadata. Therefore, the only metadata that is used in profile creation is the
name of the artist. Figure 6.1 shows a profile created from the input 2pac.
There are nine categories in the resulting profile. The category with the biggest weight,
tupac shakur, is a result of a match to one of the patterns associated with the category
tupac shakur. The rest of the categories are the result of the expansion according to the
connections in the ontology.
The user log file contains a total of 260 000 user transactions, each transaction consisting
of
• User id
• Product id
• User Agent
• Timestamp
55
This data is enough to be used on the basis of our segmentation. In addition to
timestamps, the only piece of information that is not utilized is the User Agent –field in
the user log file as we want to base our segments purely on user interests and we do not
want to include information about user’s phone into model. Timestamp might be useful in
two ways:
• The popularity of the product is dependent of the time it has been available to the
users. If two products are equally popular among the users in the cluster, the
newer one is probably better candidate for a recommendation since it has
generated same amount of transactions in less time.
• Since user’s tastes are changing by time the more recent transactions reflect more
user’s current interests than the older one. The take this into account in the user
model the transactions could be weighted according to their freshness.
However, the current approach ignores timestamps. There are several reasons for this.
Firstly, we do not know how reliable the timestamp associated with the content item is.
Does the timestamp denote when the item was first time added to the catalog or is the
time it was last updated? Secondly, if the user’s history in the data is limited to few
months, as in this case, there probably has not been much change in the user’s interests.
Let’s consider a user that has ordered the three ring tones below:
56
Figure 6.2. Example of a user profile after user has accessed three content items.
Next step is to apply cluster analysis for segmenting the generated profiles. The input for
the clustering is a set of vectors. The output is a set of vector clusters. The number of
clusters must be given as a parameter to the clustering algorithm.
Data pre-processing consists of transforming the input data into the format suitable for
our clustering algorithm and removing incorrect or erroneous items. No scaling of values
is needed since we are dealing with only one kind of properties (item downloads) and
binary values (downloaded, not downloaded). The 260 000 transactions in the data were
performed by a total of 43 000 users. The number of transactions per user ranges from 1
to a well above a hundred, an average being 6,1 transactions per user.
The quality of the data was good in general although there were some content items in the
order log, which could not be found from the content catalog. We did not filter out these
items, since they did not cause harm to clustering procedure. However, some selection of
users was performed: Users with only one transaction were not included. This is justified
57
because one transaction is hardly enough for learning of reliable user model. There were
6,7 % of such users. Also, users with more than 40 transactions were not included since
they typically contained items from most of the genres and were often a cause of
erroneous merging of two distinct clusters. There were 0.3% of users having over 40
transactions. After these preprocessing tasks, the total number of users included into
clustering procedure was 38 500.
Clusters are presented in a user interface by displaying a cluster profile accompanied with
a list of recommended items for the cluster. The cluster profile differs from the content
and user profiles in the way the concepts are chosen from the ontology. In cluster profile
only the music genres are displayed. This way the profiles stay relatively short while still
reflecting all the relevant areas of interest in the cluster. There are many options for
forming a profile for a cluster of which the most obvious ones are
The first option is obviously not perfect since it displays only the interests of one user
leaving out a lot of interests possessed by other users in the cluster. The second option
removes this shortcoming by combining the interests of all users into one profile.
Optionally, the more central users are given more weight in the profile. Another option is
create average profile of n most central users in the cluster using, for example, a fixed
percentage of all users in the cluster. The fourth option simply creates a profile from the
list of recommendations provided for the cluster by profiling (see chapter 5.5) the items
in the list. Since cluster recommendations by definition reflect the common interests of
58
the (most central) users in the cluster, this will intuitively provide a good profile for a
cluster.
59
Each of these three components provides information about the contents of the clu ster in
a slightly different angle. Demographic data plays an important role in making the
concept of a cluster more tangible to the viewer as it is something that the viewer is
already familiar with. Demographics data is calculated as averages of all users in the
cluster. In the example cluster in figure 5.1 demographics data consists of age, gender
and number of downloaded items. Since user demographics are not available in the log
data, N/A is displayed in the corresponding fields. The average number of downloads
denotes the number of content items the user has downloaded from the service. This is
valuable information since it enables comparison of clusters according to the ARPU
(average revenue per user) value for the users in the clusters. Based on this knowledge,
the service provider may choose to direct marketing campaigns on those segments, which
are likely to generate most revenue or, in contrary, to send special offers to the less
committed user segments.
Below demographic data is shown a cluster profile, which provides general overview of
the topics that the users in this cluster are interested in. This profile is slightly different
from user and cluster profiles presented earlier since not all of the levels in the ontology
are shown. In order to keep profile short and easy to understand, only categories related
to music genres are shown.
Below the cluster profile, a list of recommended items to the (medoid user of the) cluster
is shown. From the list of recommendations, the service provider can easily see what
products are most in demand for the users in this cluster. By combining all these there
components into single representation, we are able to provide viewer the following
information about the cluster
60
The above questions were the primary motivation for performing the segmentation for
this data, and the chosen representation was chosen to serve that purpose. In a case with
different motivations and objectives choosing another representation might provide better
results – as is always the case in cluster visualization.
6.5 Results
In order to find out what kind of stereotype users can be found from the data, a sample set
of 4000 users was chosen randomly and clustered into 3 clusters. A web-based user
interface was developed as a part of this thesis for observing the clustering result.
The profiles of the resulting clusters are shown in the figures 6.1 – 6.3.
61
Figure 6.3. Profile of segment 1 containing 89,5 % of the users.
62
Figure 4. Profile of segment 2 containing 4.1% of the users.
63
Figure 5. Profile of segment 3 containing 6.4 % of the users.
Cluster profiles and the recommendation lists shows that the users in the three segments
differ in their music tastes. The profile of the first segment shows biggest interest in hip-
hop & rap followed by pop and electronic music. This is the kind of music that currently
gets most of the airtime in radios, which is reflected in the size of this segment that
contains nearly 90% percents of the users. The second segment is the smallest one with
only 4,1 % of the users. The users in this segment show interest towards rock music with
some hip-hop & rap flavor as well. The third segment, containing 6,4% of the users,
contains mostly users that like to choose TV themes as their ringtones.
64
6.6 Validation of the result
How can one be sure that the clustering result really is a good description for the data or
would there be a better result, which was not found? In validation of clustering, the
objective is to choose a suitable measure for identifying the sets of clusters that are
compact and well separated. Due to unsupervised, explorative nature of clustering,
general validation schemes do not exist, and as such, the validation method must often be
tailored for the specific clustering algorithm and data individually. The validation
procedure may contain technical and semantic components. In technical evaluation,
technical qualities of the methods such as scalability and flexibility of the algorithm is
assessed. In semantic evaluation, the quality of clusters is measured using metrics such as
compactness and separation of clusters. Compactness of clusters indicates how close
items in the cluster are with each other, measured by calculating the variance of the items
in relation to cluster center. The clustering result can also be judged by a human. This
requires proper visualization of clusters possibly with the projection of high-dimensional
data into two or three dimensions. The mere judging by looking at the clustering result is
subjective, and suffers from a human tendency to recognize structure even when no
structure exists, which leads to need for more objective methods [HBV01].
Since the K-means method aims at producing compact clusters by minimizing the sum of
the squared distances of data points to cluster centre, we can use the distances of the
points from their cluster centre to determine whether the clusters are compact. We define
intra-cluster distance as an average distance of all the points in the cluster to the cluster
centre. We also want to measure the inter-cluster distance, which we define as the
minimum distance between centers of two clusters. Only minimum value is taken into
account in order to maximize the separateness of the two closest clusters. As all the other
values are larger, the rest of the clusters will be well separated too.
As can be seen in figure 6.1 the quality score of the clustering is heavily dependent on the
value of k. With k values smaller than 4, the scores are at their highest due to high level
of separation between clusters. When visually observing the clustering result, good
65
separation is reflected in the cluster profiles as highly distinct music tastes between user
segments. As the number of clusters is increased the average separation of clusters tends
to decrease. Consequently, profiles of different clusters share more genres.
Cluster validation
120
100
80
Score
60 Series1
40
20
0
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Nro of clusters
Table 6.1. Quality scores for clustering results with k values between 6 and 13.
Nro of Density (d) Separation (s) Score
clusters d/sqrt(s)
4 656 115 61,17
66
7 652 96 66,54
13 619 97 62,84
From the table we see that according to the chosen quality score the k value of 6 produces
the best clustering result for this set of users when using the ratio of density and
separation as the criteria.
Cluster validation
68
67
66
65
64
Score
63 Series1
62
61
60
59
58
4 5 6 7 8 9 10 11 12 13
Nro of clusters
67
7 Conclusions
In this thesis it was shown that the information hidden in user’s browsing history
provides a valuable source of knowledge. Analysis of users selections while they are
browsing helps system to understand the information needs of the user and make
suggestions based on that. This work showed some approaches for using that information
to personalize the content offering and segment users into interests groups. Two
approaches for content recommendation was presented:
• An ontology-based approach in which content items are analyzed with the help of
ontology and user profiles are learned by observing the topics found in the content
items that the user has accessed. The user will be recommended content items that
contain similar topics than those that the user has previously shown interest in.
User base was segmented using clustering techniques. A hybrid algorithm was
implemented for the task as a mixture of hierarchical and partitional approaches. The
results showed that users differ in their taste for content preferences and that the user base
can be segmented into natural interests groups, each group having a unique taste of
music. Further assessment about the meaningfulness of the clustering result was left to
further work. The cluster centroid was used to represent the user segment. The objectives
of user segmentation in an online digital content service include:
• To provide knowledge about the users. The service provider can observe users as
evolving interest segments and adjust the content in the service based on changes
in the trends.
68
• To provide segmented user base for content targeting or targeted marketing.
Users are hesitant to give away their personal information. Online services that apply
registration process requiring inputting demographic information and details about
personal interests often face this in a form of decreased user base. In the age of increasing
amount nuisances such as unsolicited email and viruses, people can hardly be blamed
being concerned of their privacy while online. All the above methods of user modeling
require some level of compromise in individual user’s privacy. A good practice is to
inform users about the gathering of information and explain what it is used for. Gathered
information should not be shared with third party unless the user has given permission to
do so. Naturally, it is important to maintain user information in a secure system, which is
not vulnerable to attacks by intruders.
69
allows users to set their preferences for the privacy issues. When the user enters the site,
the privacy policy of the service is compared to the privacy preferences set by the user
and any conflicts are reported to the user. In order to work in practice, the support for
exchanging such information must exist in the server as well as in the client software.
Still additional effort is required from the user and the service provider to manually enter
the privacy levels. Future will show whether this kind of standard will make it to the
mainstream.
As future work, it would be interesting to use trend analysis for user segments. Doubtless,
the natural clusters that can be found from data look different from time to time. Trend
analysis could further reveal users changing interests by finding topics that have gained
interest recently as well those that are loosing attention. Also, clustering could have been
applied to ontological user profiles rather than just the click histories. This approach
would have benefited from the ontology not only in the visualization, but also in
expanding the user profile to reflect related interests. There would have been fewer users
that do not belong to any of the clusters due to their unique browsing history. The
downside is that the ontology-based approach is more dependent of the existence and the
quality of the ontology. The most severe limitation in the current implementation is its
restriction to only non-overlapping clusters. It is obvious that one user should be able to
belong to more than one clusters. Fortunately, there seems to be a K-means
implementation that allows fuzzy clusters [DeM88].
70
Bibliography
BlC01 Blake, C., Pratt, W., Better rules, fewer features: a semantic
approach to selecting features from text. Proc. of the
Institute of Electrical and Electronics Engineers Data
Mining Conference (IEEE DM 2001), San Jose, Ca, 2001.
71
Annual International ACM/SIGIR Conference,
Copenhagen, 1992.
72
HeH98 Heckerman D., Horvitz E., Inferring Informational Goals
from Free-Text Queries, Proc of Fourteenth Conference on
Uncertainty in Artificial Intelligence, p 230-237, Madison,
Wisconsin, 1998.
73
MaS02 Manning C., Schytze H., Foundations of Statistical Natural
Language Processing. MIT Press, p. 495-527. Cambridge,
2002.
74
PaB97 Pazzani, M., Billsus, D., Learning and Revising User
Profiles: The Identification of Interesting Web Sites.
Machine Learning, 27, p. 313-331, 1997.
75
SaC99 Sanderson, M., Croft, B. Deriving concept hierarchies from
text. Proc. of SIGIR-99, the 22nd ACM Conference on
Research and Development in Information Retrieval, p.
206--213, Berkeley, 1999.
Sal75 Salton, G., Wong, A., Yang, C.,. A vector space model for
automatic indexing. Communications of the ACM, 18(11),
p. 613--620, 1975.
76
Computer Supported Cooperative Work Conference, ACM,
New York, 1994.
Wor06 WordNet,http://www.cogsci.princeton.edu/~wn/
(9.3.2006).
77