Thesis - Identification of Semantic Duplicates - JR

FAKULTT FR INFORMATIK
DER TECHNISCHEN UNIVERSITT MNCHEN LEHRSTUHL FR WIRTSCHAFTSINFORMATIK (I17) PROF. DR. HELMUT KRCMAR Bachelor Thesis in Wirtschaftsinformatik
Identification and Prevention of Semantic Duplicates in Web 2.0-based Innovation Portals
Julian Riediger
FAKULTT FR INFORMATIK
DER TECHNISCHEN UNIVERSITT MNCHEN LEHRSTUHL FR WIRTSCHAFTSINFORMATIK (I17) PROF. DR. HELMUT KRCMAR Bachelor Thesis in Wirtschaftsinformatik
Identification and Prevention of Semantic Duplicates in Web 2.0-based Innovation Portals Identifikation und Verhinderung von Semantischen Duplikaten in Web 2.0-basierten Innovationsportalen
Author: Supervisor: Advisor: Date of Submission:
Julian Riediger Prof. Dr. Helmut Krcmar Christoph Riedl, M.Sc. 17.08.2009
I assure the single handed composition of this bachelor thesis only supported by declared resources.
Garching b. Mnchen, 17.08.2009 Place, Date
________________________ Signature
Abstract
This thesis covers an approach for the identification and prevention of semantically duplicate ideas in web 2.0-based innovation portals. Based on a vector space representation of documents, a special clustering algorithm has been proposed that is able to identify and group similar ideas into a single cluster. The algorithms specific focus on duplicate identification showed to be successful in regard to clustering and recognition capability of semantic duplicates. Next to improved idea evaluation, applications of clustered idea sets such as automatic thesaurus construction, cluster labeling and automated idea tagging that can help to improve duplicate identification, were discussed and evaluated. On the foundation of the developed model and algorithm a component featuring semantic duplicate prevention for an existing innovation portal has been developed. Based on a suitable user interface, semantically duplicate ideas entered by a user are herein automatically recognized. The effectiveness of this component has been subject to a user study that has been conducted in course of this thesis. The results showed that the developed component is indeed able to reduce the amount of duplicates significantly. The users perception of the component in regard to the quality of suggested semantic duplicates as well as to an improved idea quality induced by it, has shown to be positive. Keywords: duplicate detection, duplicate, semantic, web 2.0, clustering, cluster, algorithm, innovation portal, evaluation, user interface, labeling, tagging, automatic thesaurus, thesaurus construction
IV
Contents
List of Figures ............................................................................................................ VII List of Tables ............................................................................................................... IX List of Abbreviations ................................................................................................... X Acknowledgement ......................................................................................................... 2 1. Introduction ........................................................................................................... 3
1.1 1.2 1.3 1.4 Motivation ................................................................................................................. 3 Semantic Duplicate ................................................................................................... 3 Research Objectives ................................................................................................. 3 Approach ................................................................................................................... 4
2.
Background & Methods ........................................................................................ 6

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 The Vector Space Model .......................................................................................... 6 Term Weighting...................................................................................................... 10 Stemming................................................................................................................. 11 Similarity Matrix .................................................................................................... 11 Clustering Algorithms............................................................................................ 13 Search Queries ........................................................................................................ 14 Weighted Zone Scoring .......................................................................................... 15 Implementation....................................................................................................... 16
3.
An Algorithm for Semantic Duplicate Identification ....................................... 17

3.1 Concept .................................................................................................................... 17 Requirements .................................................................................................. 17 An Algorithm .................................................................................................. 18 Characteristics of the Algorithm .................................................................. 19 Functioning of the Algorithm ........................................................................ 19 3.1.1 3.1.2 3.1.3 3.1.4 3.2
Evaluation of the Algorithm .................................................................................. 26 Evaluation Criteria ........................................................................................ 26 Test Set ............................................................................................................ 27 Creating Configurations ................................................................................ 27 Evaluating Configurations ............................................................................ 28 Interpretation of Results ................................................................................ 31 External Criterion .......................................................................................... 32
3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 3.3
Further applications of clustering ........................................................................ 32 Automatic Thesaurus Creation ..................................................................... 33 V
3.3.1
3.3.2 3.3.3 3.3.4
Automatic Thesaurus Creation in Innovation Portals................................ 35 Cluster Labeling ............................................................................................. 37 Automated Idea Tagging ............................................................................... 38
4.
Semantic Duplicate Prevention in an Innovation Portal ................................. 39

4.1 Concept .................................................................................................................... 39 Requirements .................................................................................................. 39 Development ................................................................................................... 40 Implementation............................................................................................... 40 Configuration .................................................................................................. 41 4.1.1 4.1.2 4.1.3 4.1.4 4.2 4.3
Functioning of the Duplicate Detection ................................................................ 42 Description of the User Interface .......................................................................... 43 Submitting an Idea ......................................................................................... 45
4.3.1
5.
User study ............................................................................................................. 46

5.1 Study Design ........................................................................................................... 46 Overview ......................................................................................................... 46 Survey Design ................................................................................................. 47 Evaluation Criteria ........................................................................................ 48 Demographics ................................................................................................. 49 5.1.1 5.1.2 5.1.3 5.1.4 5.2
Study Results .......................................................................................................... 49 Effectiveness of the Semantic Duplicate Detection...................................... 49 Improved idea quality .................................................................................... 53 Quality of Suggested Duplicates.................................................................... 56
5.2.1 5.2.2 5.2.3
6.
Conclusion ............................................................................................................ 59
References .................................................................................................................... 60 Appendix ...................................................................................................................... 62

Appendix A ......................................................................................................................... 63 Survey Evaluation Control Group ............................................................................... 63 Survey Evaluation Treatment Group 1 ........................................................................ 64 Survey Evaluation Treatment Group 2 ........................................................................ 65 Survey Evaluation Treatment Groups 1 & 2 ............................................................... 66
VI
List of Figures
Figure 1 An example of a 3-dimensional vector space, containing a total of 3 terms. Each document vector contains a weight (frequency) for each of the three terms. ................. 7 Figure 2 The angle between two document vectors in a 3-dimensional vector space. ........... 9 Figure 3 Example of a similarity matrix. ................................................................................. 12 Figure 4 Algorithm step 0. ....................................................................................................... 20 Figure 5 Algorithm step 1. ....................................................................................................... 21 Figure 6 Algorithm step 2. ....................................................................................................... 22 Figure 7 Algorithm step 3. ....................................................................................................... 23 Figure 8 Algorithm step 4. ....................................................................................................... 24 Figure 9 Algorithm final result................................................................................................. 25 Figure 10 Comparison of values for the top-4 performing configurations. ....................... 30 Figure 11 Relating number of duplicates to average similarity . ................................... 30 Figure 12 Relating to . ................................................................................................. 31 Figure 13 A thesaurus class is deducted from a three-element document cluster shown in a Venn diagram. Members of are therefore considered as semantically related. . 34 Figure 14 Example thesaurus classes with stemmed word forms deducted from the clustered test set of ideas. ........................................................................................................................ 36 Figure 15 Conceptual UML class diagram of back-end solution............................................. 41 Figure 16 Description of the functioning of the duplicate detection. ....................................... 42 Figure 17 Input screen for a new idea. Source: Theseus TEXO Innovation Repository .......... 43 Figure 18 Semantic duplicates have been found for the entered idea. Source: Theseus TEXO Innovation Repository .............................................................................................................. 44 Figure 19 Close-up of the duplicate detection pop-up. ............................................................ 45 Figure 20 Number of ideas entered compared with the number of participants per group. Source: Own illustration .......................................................................................................... 49 Figure 21 Comparison of suggested and accepted duplicates per treatment group. Source: Own illustration........................................................................................................................ 50 Figure 22 Accepted vs. not accepted suggested semantic duplicates in treatment group 2. .... 51 Figure 23 Comparison between the number of semantic duplicates accepted and the total number of semantic duplicates found (including those not recognized by the presented solution).................................................................................................................................... 52 Figure 24 Distribution of answer choices to the statement The application helped me to improve the quality and extent of the ideas I entered in treatment group 2. ............................ 53 Figure 25 Distribution of answers to the statement The application helped me to improve the quality and extent of the ideas I entered in the control group. ................................................. 54 Figure 26 Mann-Whitney-Wilcoxon Test. ............................................................................... 55 Figure 27 Distribution of answers to the statement I found the suggested similar ideas useful for both treatment groups 1 and 2. ........................................................................................... 56 VII
Figure 28 Distribution of answers to the statement Suggested similar ideas were indeed similar to ideas I entered for both treatment groups 1 and 2. .................................................. 57 Figure 29 Distribution of answers to the statement The order of suggested similar ideas was matching the actual degree of similarity for both treatment groups 1 and 2. .......................... 58
VIII
List of Tables
Table 1 Different configurations for zone weighting factor ................................................. 27 Table 2 Different configurations for ............................................................................... 28 Table 3 Evaluation results of various algorithm configurations .............................................. 29 Table 4 Likert scale that has been used in the survey .............................................................. 47
IX
List of Abbreviations
AJAX CosSim DF EC IC IDF SUTVA TAM TF TVM USE VSM Asynchronous JavaScript and XML Cosine Similarity Document Frequency External Criterion Internal Criterion Inverted Document Frequency Stable Unit Treatment Value Assumption Technology Acceptance Model Term Frequency Term Vector Model Usefulness, Satisfaction, Ease of Use Vector Space Model
Acknowledgement
It is a pleasure for me to thank those who made this thesis possible. First of all I would like to thank my supervisor Prof. Dr. Helmut Krcmar for his supervision and for having access to resources of his chair. I am especially thankful to my advisor Christoph Riedl for his outstanding guidance and support during the creation of this thesis. With his challenging nature he helped me to further improve this thesis and supported me whenever I needed advice, even during the weekends. Also, I would like to thank my family and friends for not only giving me moral support (and in case of my parents, especially financial support for my studies), but also delivering me with many useful ideas which have found their way into this thesis. Last but not least, I would like to thank the participants of the user study for their precious time and give a special thanks to the developers of Lucene for their high-quality open-source API that has been leveraged in the implementation presented in this thesis.
1.
Introduction
1.1
Motivation
In many innovative environments throughout academics, industrial research and open user innovation web 2.0-based innovation portals play a decisive role nowadays. However, one of the biggest obstacles in this context is to manage and categorize the amount of ideas that evolve in typical innovation portal settings. More precisely, similar ideas in innovation portals cannot be identified properly in most cases due to a number of reasons. Obviously, the most important reason is that natural language allows to describe the same idea in multiple ways, making it difficult for a machine to automatically identify two ideas as similar or even as duplicates. Most web-based innovation portals (e.g. Salesforce IdeaExchange) currently in use lack any automatic identification of similar ideas at all. Understandably, this lack of identification poses a big drawback in idea evaluation. This thesis will dwell on the handling and prevention of similar ideas in innovation portals, so called semantic duplicates.
1.2
Semantic Duplicate
In this thesis, the term semantic duplicate is used to describe an idea that is conceptually the same or very similar to another idea, in distinction to the definition of a regular duplicate or near-duplicate, which only includes ideas that are exactly the same (e.g. identical entries in a database) or nearly the same (e.g. only a minor difference in title/description) respectively. If nothing else is mentioned, the term duplicate is to be seen synonymously to semantic duplicate in this thesis.
1.3
Research Objectives
The first research objective is to examine how semantic duplicates can be identified in an innovation portal. This question tackles the choice of a suitable measure to identify semantically duplicate ideas as well as the development of a solution that allows to identify semantic duplicates in an existing set of ideas. The second research objective dwells on the prevention of semantic duplicates. More specifically, the question arises how the number of semantic duplicates can be reduced by a solution that is able to detect and prevent a semantic duplicate if a new idea is submitted. Based on the solution of the first research objective in combination with a reasonable user interface a component for an existing innovation portal should be developed and evaluated. Focus of exami3
nation is especially the users perception of the duplicate prevention, relating to the question how an effective user interface has to look like. Yet, in this context is another interesting question that deserves attention. If it were possible to build a solution that allows to link semantically related ideas to prevent semantic duplicates, can the use such an application for duplicate prevention ultimately result in an improved quality of ideas, as users with similar ideas are then able to see each others contributions? To be more specific, can the assumption be asserted that bringing (people with) similar ideas together advances in a positive effect on idea quality?
1.4
Approach
This thesis tries to answer the research questions by a two-step approach. The first objective has been approached by the evaluation and development of a suitable algorithm that is able to cluster similar ideas based on their titles and descriptions. Although there have been successful tries to leverage methods from information retrieval to identify real duplicates or near-duplicates specifically in databases in the past (Koudas/Marathe/Srivastava 2004), nothing comparable has been done so far for the identification of semantic duplicates in innovation portals. Compared to the identification of near-duplicates in databases, the identification of semantic duplicates seems to be far more challenging. For the similarity evaluation of ideas a model based on methods from information retrieval has been developed. The usage of methods from information retrieval, specifically clustering algorithms has shown good results in grouping related documents in many other domains, so the question remains whether this applies for duplicate detection in innovation portals as well. Especially, the fact that most descriptions of ideas are usually comparably short poses a challenge in deciding whether an idea is semantically duplicate and therefore deserves special attention. Furthermore, applications of the developed algorithm for improvements in duplicate recognition will be discussed. The second objective has been approached by an implementation of the underlying model of the first sub problem into an existing web 2.0-based innovation portal paired with the creation of a suitable web-based user interface for semantic duplicate detection. The effectiveness of this approach was subject to a user study that has been conducted for this thesis. This thesis begins with an introduction to necessary basics, especially the Vector Space Model and its characteristics as well as a short overview on clustering algorithms. The third chapter provides a detailed description of the developed algorithm and its configuration parameters in implementation and also applications of this algorithm for semantic duplicate detection. The fourth chapter will dwell on the development and implementation of an active semantic 4
duplicate detection. The effectiveness of the developed solution will be evaluated in chapter five. In the last chapter, the results will be summarized and an outlook on potential improvements will be given.
2.
Background & Methods
In the following chapter the methods that have been used to build up a suitable representation for semantic duplicate identification are presented. Whenever necessary, special adjustments that have been done in the concrete implementation of the model are discussed in detail.
2.1
The Vector Space Model
First presented in 1975 by Salton, Yang and Wong (1975, 613ff), the Vector Space Model (VSM) has become one of the most important and influential models in information retrieval. It describes a mapping of document sets into an algebraic vector space that enables a range of applications on these document sets. The basic idea behind the VSM is to map each document to a term vector where each component of the vector represents a term of this document (hence the VSM is also known as the Term Vector Model, TVM). The set of documents becomes a vector space of dimension , where is the number of unique terms in the document set. Each document of the document set is then represented by a -dimensional vector
while represents the weight of term measured by the frequency of this term (the number of times a term occurs in the document), , in document , . For example, if , term does not occur in document at all, means that term occurs exactly four times in document . This vector-based approach requires that all grammatical structure inside the document is ignored in this representation, so that the document as a term vector can actually be imagined as a bag of words, where only the number of occurrences of a term is considered.
Figure 1 An example of a 3-dimensional vector space, containing a total of 3 terms. Each document vector contains a weight (frequency) for each of the three terms. Source: Own illustration, according to (Salton/Yang/Wong, 1975, 614)
This quite simple approach of mapping documents into an algebraic model offers numerous opportunities based on its mathematical characteristics. In case of this thesis a similarity measure between documents is needed to evaluate how close (similar) two documents (ideas) actually are1. Thanks to the vector representation of document sets one can now resort to the dot product, also known as the scalar product, to determine the similarity of two document vectors, derived from their distance in a -dimensional vector space, expressed by their inner angle. The dot product of two vectors is defined as (1) while is simply the sum of the products of the document vectors components in
a -dimensional vector space. This definition is based on an orthonormal vector space2. In the VSM this assumption is usually not met, creating the need for a more generalized definition.
As ideas can be considered as documents from a technical point of view and the term document is common in most descriptions for the VSM, the term document has been used in the description of this model whenever referring to an idea in this thesis. 2 Orthonormal vector spaces require per definiton that all vectors are mutually orthogonal and of unit length
The generalization of (1) for non-orthonormal vector spaces is the following extended definition
where denotes the product of their Euclidean lengths and represents the cosine of the angle between the two vectors in their respective vector space. Transforming (2) into (3) gives one the cosine of the angle between the two document vectors calculated out of their dot product which is in turn divided by the product of their Euclidean length. With this definition (3) the cosine of the angle of two document vectors is easy to calculate and serves as a substitute for the angle itself, which in fact determines the similarity of two documents in the VSM. Therefore we define the cosine of the angle between two vectors as , which serves as our similarity measure for documents, the so called Cosine Similarity. Next to a comparably simple calculation the Cosine Similarity offers characteristics that help to quickly evaluate the similarity of two documents. As per definition the cosine function delivers values between -1 and 1, although in this context we will only see values between 0 and 1 3. In this specific context a value of 0 means no similarity between the documents at all, a value of 1 means a perfect duplicate (two identical documents). Another important feature of the Cosine Similarity is its symmetry
which will be dwelled on later in this thesis.
Negative values for the cosine function are not possible by definition (3)
Figure 2 The angle between two document vectors in a 3-dimensional vector space. Source: Own illustration
The VSM as described above offers another handy feature. Not only can already existent documents be transformed into a vector space representation, but a term query, as found in a variety of search applications, can be represented as a vector as well. All terms contained in a (multi-)term query will simply be represented as a regular term vector. In fact, one can imagine a multi-term query as document that is transformed into a vector space representation as described above. This now enables the use of the Cosine Similarity to compare documents with queries, concluding in a result set of documents matching this query. However, in contrast to the Boolean retrieval model (Waller/Kraft 1979, 236), that simply returns all documents containing the query, the VSM additionally returns a ranked (ordered by relevance) result set of documents. The ranking becomes possible as Cosine Similarity allows a valuebased ranking (therefore the VSM is also called a Ranked Retrieval model). This feature will turn out as eminent in the component for semantic duplicate prevention.
2.2
Term Weighting
Using raw term frequencies as initially described, suffers from one big downside though. To understand this problem, one has to look at the idea of counting term frequencies in documents as a basis for similarity measuring. Ideally, the more often a term occurs in a document the higher is the documents perceived relevance for a query (or another document) containing this specific term, while all terms of the -dimensional vector space enjoy the same weight during relevance assessment. But what happens if a term occurs in almost every document of the set so that its discriminating power to decide between relevant and non-relevant documents is very low? The term internet in documents on the evolution of the internet has obviously little discriminating power in that context. To solve this problem a promising idea seems to be attenuating the weight for terms that occur too often in the document set and therefore lack discriminating power. The nave approach of counting the total number of occurrences for each term in the whole document set and attenuating those terms with the highest occurrences, has been neglected in favor of a more document-based statistics though. The reason for this becomes obvious once one recalls the purpose of improving term discriminating power. Having many occurrences of a term in total but distributed only over some documents of the set is not necessarily a bad thing in regard to discriminating power. However, this looks completely different if the term occurs in nearly every document of the set. This in mind, it makes more sense to count the number of documents in which a term actually occurs, the so called Document Frequency defined as , , rather than counting the total number of occurrences in the document set. Sprck Jones (1972, 11f) was the first to describe a concept called Inverse Document Frequency which incorporates the described thoughts. In it a relation between and the total number of documents in the set, denoted as has been established as
where denotes the logarithm of the quotient of and for each term of the dimensional vector space. Based on it is now possible to weigh the raw term frequencies in a document, initially described as , for attenuating the characterized negative effect of high-occurrence terms. So instead of the raw term frequencies as described above, the term vectors actually store the product of the Inverse Document Frequency and the raw term frequency defined as
in their term vectors. For each term frequency in document .
is calculated and then used to weigh its raw term
10
2.3
Stemming
In the description for the VSM I described that all terms existent in the document set, span a -dimensional vector space. As natural language has the feature of having multiple forms of words (e.g. singular/plural, conjugations, tenses etc.) a simple term analysis of documents will result in many different terms for exactly the same word, only existing in different word forms though. Imagine the terms recycling and recycle, which would be stored as two separate terms. The negative effect of this is that comparing recycling to recycle would not result in a match, although in most times and especially in semantic duplicate detection a match is required. To overcome this issue, stemming algorithms have been developed along languagespecific rules. Their purpose in the described model is to reduce all terms to their root before indexing them to the vector space. One of the most famous algorithms for stemming is Porters stemming algorithm (1980, 130ff) which has been used for English language processing in this thesis. Following the example above, both terms are stemmed to recycl by the Porter stemmer. A second stemmer for German language processing has been used additionally.
2.4
Similarity Matrix
A similarity matrix is a squared matrix that contains non-negative similarity values measuring the similarity of two documents. The number of rows and columns match the number of documents in the data set, so that each similarity value represents a similarity pair of two documents. The similarity values are usually bounded to an interval between 0 and 1. A common similarity measure used in similarity matrices featuring this requirement, is the Cosine Similarity that has been described already and is used here as well. Further constraints of a similarity matrix are symmetry and reflexivity. Recalling the symmetry property of the Cosine Similarity
it is obvious that it fulfills this requirement. Reflexivity in this context means that a document is identical to itself and resulting in
expressing that the similarity of the document to itself equals 1.
11
Figure 3 Example of a similarity matrix. Source: Own illustration
In order to associate the similarity matrix with the described VSM, its features and methods, one can think of the similarity matrix as the base outcome of a VSM that can be leveraged by a number of retrieval applications as their foundation. The VSM provides a vector space representation of documents, in which the similarity of two documents can be determined using the Cosine Similarity, resulting in similarity pairs of documents that are stored in a similarity matrix with the presented features.
12
2.5
Clustering Algorithms
Clustering algorithms have been and still are subject to research and play a decisive role in information retrieval. Jardine and van Rijsbergen (1971, 219) formulated the so-called Cluster Hypothesis for information retrieval, where they suppose that similar documents also possess similar relevance to a specified information need. Clustering algorithms should therefore help to answer this information need effectively. The purpose of clustering algorithms in information retrieval is to group (cluster) documents of a given document set into suitable subsets. The objective is to create clusters that feature a high intra-cluster but low inter-cluster similarity. Therefore clusters should be as coherent internally as possible, resulting in subsets of documents where all documents of a cluster are as similar as possible to each other. On the other hand documents in different clusters should be as dissimilar as possible. The typical clustering process itself is running unsupervised, the only input that is needed is a suitable distance measure in order to compare the similarity of documents. While there are several possible measures, the Cosine Similarity described above presents itself as a suitable distance measure for clustering documents based on their textual contents. It is important to understand though, that the distance measure can be composed out of any feature documents maintain, meaning not only out of the textual content of a document but e.g. also out of the file type, creation date, tags etc.. The distance measure only serves as a proxy to allow an easy comparison of how similar two documents actually are. In literature it is distinguished between flat (partitioning) and hierarchical clustering algorithms (Jain/Dubes 1988, 55ff). A flat clustering algorithm creates clusters that are not structured in any way. Such algorithms usually require the input of a fixed number of clusters in order to distribute the documents of the input set over the clusters. The assignment to a cluster is usually based on the distance to the pre-determined cluster seeds (centroids). One of the most popular flat clustering algorithms is the K-means algorithm, first introduced by Hartigan and Wong (1979, 100ff). A hierarchical clustering algorithm however creates a structure in shape of a hierarchy. Each cluster is either a sub-cluster and/or a super-cluster of another cluster in the hierarchy. Based on the level of similarity one can decide for a hierarchy level which gives the underlying clusters. In contrast to flat clustering algorithms, hierarchical clustering algorithms do not need a pre-determined number and definition of clusters. One of the most common types in hierarchical clustering is the agglomerative hierarchical clustering algorithm that starts with singleton clusters that are merged subsequently. Another distinction is made between hard and soft clustering algorithms. Hard clustering requires that each document is assigned to exactly one cluster, while soft clustering allows a 13
document to be a member of several clusters, although its membership is only fractional, i.e. the assignment is a distribution over the whole cluster set. In regard to Jardine and van Rijsbergens Cluster Hypothesis (1971, 219) and to the objective of duplicate identification, the information need obviously is to find similar documents for a given document of the document set. By building clusters containing similar documents one can comply with this task of identifying semantic duplicates. In other words, the result should be one cluster for each idea/concept found, containing all ideas representing this idea/concept. The decision whether to develop a flat or a hierarchical clustering algorithm depends on a variety of parameters. While flat clustering algorithms are usually less time-complex than hierarchical clustering algorithms, a difference in the quality of the results remains controversial. Earlier publications suggest that hierarchical algorithms outperform flat algorithms (Jain/Dubes 1988, 140), while newer examinations suggest the opposite (Steinbach/Karypis/Kumar 2000, 19). As time complexity is not a crucial factor in common innovation portals (even in big portals the number of documents is considerably less than 1m) but the number of unique ideas/concepts is usually not known, using a traditional flat clustering algorithm as the Kmeans that requires a pre-defined number of clusters did not seem to be reasonable. However, Steinbach, Karypis and Kumar (2000, 19) observed that K-means, especially an extended version called bisecting K-means outperforms hierarchical agglomerative clustering approaches in regard to clustering quality. In order to leverage the simplicity and quality of a flat clustering algorithm, as well as the positive characteristics of a hierarchical algorithm (no need of pre-determined number of clusters), a new algorithm has been proposed that can be seen as a mixture of both types. The focus has been to develop a hard clustering algorithm that specifically fits to the purpose of duplicate identification. A detailed description of the proposed algorithm can be found in chapter 3.
2.6
Search Queries
Another handy feature of the VSM is the ability to transform a query into a vector representation. It follows the same process as the transformation of a document to a vector. After transforming a query into a vector representation , it can be handled like a regular document and compared with actual documents of the document set as follows
14
where the query vector takes place instead of a second document vector. In fact the definition of a query can range from a multi-term query typically found in web search applications to complete documents that are handled as queries. The latter is the case in semantic duplicate prevention where an entered document has to be transformed into a vector in order to check if similar documents already exist. The developed solution for duplicate prevention therefore makes use of this query-to-vector transformation.
2.7
Weighted Zone Scoring
The score can be seen as the outcome of the Cosine Similarity calculation of two document vectors. In the described approach the score of two documents has been calculated based on the similarity of two complete documents, defined by . Imaginable is as well a differentiated view of the title and the body of a document. A differentiated view offers the possibility to establish a weighting of title and body similarities in the actual score. A zone in this context describes a part of the document (e.g. title or body) that is considered for a differentiated similarity examination. The basic formula for weighted zone scoring that has been used in this thesis is defined as
where denotes a value between 0 and 1 and title and body are regarded as separate zones. The actual weighted score used in the implementation of the described model is a little bit more complex than described here, as it considers the cross-similarities of titles and bodies of a document pair additionally. However, this does not have a substantial effect on the weighting scheme. It can be said that weighted zones allow an improved statement whether two documents are actually similar. Especially in the context of innovation portals where documents happen to be comparably short a differentiated examination of zones may improve duplicate identification and prevention significantly. If empirical evidence should show that e.g. the title of document is a good feature to decide whether two documents are similar, the title could gain a higher weight in the actual similarity score. The question how to weigh the different zones will be dwelled on in course of the configuration of the developed clustering algorithm.
15
2.8
Implementation
Based on the described approach a model has been developed and implemented that features a VSM with stemming and weighted zone scoring capability as presented. The implementation has been done in Java and for indexing capability, including stemming the Lucene API has been leveraged. The implemented model builds the foundation for the duplicate detection and prevention that will be described in the following chapters.
16
3.
An Algorithm for Semantic Duplicate Identification
Derived from the motivation of this thesis to build a solution that is able to identify semantic duplicates in Innovation Portals, suitable approaches had to be evaluated. Clustering similar ideas based on a common feature seemed to be a promising approach. Theoretically such ideas offer many features such as titles, descriptions, tags, categories etc. that could be used as clustering features. Nevertheless simply clustering based on common features as titles or tags did not turn out to be flexible and accurate enough and would have ended in poor results. To obtain good results with such a simple approach, it would be required that ideas are correctly labeled with meaningful tags or that idea titles completely match each other respectively, which is usually not the case in reality. So for an effective semantic duplicate identification there seemed to be no getting around from a more sophisticated approach that builds clusters based on a real content analysis of idea titles and descriptions. In order to carry out such a content analysis the described and implemented model from the previous chapter had been used. On this foundation a suitable clustering algorithm had to be developed and implemented. In the following chapter the nature of the algorithm and how the developed algorithm works will be described.
3.1
Concept
3.1.1 Requirements The requirements for a clustering algorithm that is able to group semantically duplicate ideas are basically the same as for clustering algorithms that focus on grouping the whole document set. The goal is to maintain a high intra-cluster similarity, while inter-cluster similarity remains low, allowing for a clear distinction of clusters. Yet, there are some special requirements that have to be considered. First, the clusters should be optimized in regard to semantically duplicate ideas only. This is in contrast to most clustering applications, where the whole document set should be clustered. An idea is considered as a semantic duplicate if its similarity value with another idea is higher than , defined as
where denotes the similarity threshold that decides whether two ideas are semantically duplicate or not. Each cluster is then defined by documents which fulfill the following condition
17
where
is
defined
as
the
centroid
document
of
the
cluster.
If
is fulfilled, then . In other words, in order to be part of a cluster a document s Cosine Similarity has to be higher than and at the same time equal the maximum similarity for . All clusters build a set of clusters, defined as , while the number of clusters is defined as
stating the cardinality of the set of all clusters. As each cluster represents a unique concept, describes the number of unique ideas/concepts found that feature at least one duplicate.
3.1.2 An Algorithm The following algorithm is based on the already presented similarity matrix. 1. Select a document y 2. Check its similarity with the next document x of the active set a. If the similarity is higher than the fixed threshold, b. higher than the maximum similarity for document x c. in case document y already belongs to another cluster, its maximum similarity is lower than the similarity to document x i. Set document xs maximum similarity to document y ii. In case of (c), remove document y from its old cluster and adjust the old clusters maximum similarity if necessary 3. Repeat step 2 until all documents of the document set have been checked 4. Repeat step 1 until all documents of the active set have been checked 5. Repeat steps 1 4 to rearrange empty clusters 6. Write the maximum similarity per document x into a duplicate matrix a. Write the document having the maximum similarity to document x to the list of cluster centroid documents
18
3.1.3 Characteristics of the Algorithm The proposed algorithm can be considered as a hybrid of a flat and hierarchical agglomerative hard clustering algorithm. From the rationale the algorithm resembles more the classical Kmeans algorithm. Documents are assigned to the centroid document that they are most similar to in a iterative process. However, in contrast to K-means where exactly K documents have to be specified as centroids initially, the developed algorithm considers all documents as potential centroids. This characteristic is crucial in duplicate identification as the number of unique ideas/concepts matching the number of clusters remains unclear in the beginning. In course of the iterations, cluster centroids can move until they reach the global optimum, or as seen in agglomerative hierarchical clustering algorithms, clusters can be merged or disappear. Just like hierarchical clustering algorithms the number of clusters as well as documents that serve as cluster centroids do not have to be specified in advance. That way the proposed algorithm tries to unite the characteristics from both flat and hierarchical clustering algorithms to a simple, well-fitting algorithm for duplicate identification. In comparison to K-means linear time complexity, the proposed hybrid algorithm runs at quadratic time just like most well-implemented hierarchical clustering algorithms do. However, one big advantage to traditional hierarchical implementations (Berkhin 2006, 6ff) is the absence of similarity calculations during the clustering process. Instead of computingintense recalculations, the algorithm simply uses only the pre-calculated values of the similarity matrix.
3.1.4 Functioning of the Algorithm In the following the functioning of the developed algorithm will be shown by means of a simple example. Assuming that the whole document set only consists of five documents, the single steps of the algorithm are shown to make its functioning more understandable. Recalling the presented similarity matrix from the previous chapter, figure 4 shows an extended version of it. Two arrays have been added that will help to track the current cluster constellations. Max. SimValue stores the maximum similarity value for each document that has been identified so far. Duplicate of indicates to which document another document is considered semantically duplicate. The similarity threshold in this example is set to , i.e. all similarity values equaling or being above this threshold qualify as a semantic duplicate. The grey-shaded bar indicates the line (iteration) that is currently being checked. Initially, all values of both arrays are empty.
19
Figure 4 Algorithm step 0. Source: Own illustration
During the first step it is being checked whether one of the documents is semantically duplicate to Doc 1. If this is the case, Doc 1 would become the centroid of a new cluster . As self-similarities are ignored, the first potential duplicate to Doc 1 is Doc 3. Its similarity value is higher than and neither Doc 1 nor Doc 3 maintain a maximum similarity. The same holds for Doc 4. Both documents are marked as duplicates of Doc 1 by setting their maximum similarity value to 0.4 and 0.5 respectively. Additionally, Doc 1s maximum similarity is set to 0.5 as this is the highest similarity attained in this cluster. The result of the described step can be seen in figure 5.
20
In the next step Doc 3 fulfills , but , so Doc 3 will stay in Doc 1s cluster. However, Doc 4 and Doc 5 qualify as duplicates of Doc 2 as both similarity values exceed as well as their respective maximum similarity values. As Doc 4 moves from the cluster with centroid Doc 1 to the cluster with centroid Doc 2, Doc 1s maximum similarity value has to be adjusted to 0.4 (as now . Finally, the other maximum similarity values and duplicate markers have to be adjusted as well.
21
In this step Doc 1s cluster is resolved, as its centroid Doc 1 moves to the cluster with centroid Doc 3. Subsequently, Doc 5 is also moved to cluster Doc 3. The values for maximum similarity and the duplicate markers have to be adjusted accordingly, as shown in figure 7.
22
In step 3 Doc 2 passes its status as centroid of the mutual cluster to Doc 4. The members of the cluster itself do not change by this operation, only the status of the centroid changes. Also, Doc 1 moves to the cluster with centroid Doc 4.
23
In the last step Doc 3 and Doc 5 exchange their centroid status and Doc 2 moves to the cluster determined by Doc 5.
24
Figure 9 Algorithm final result. Source: Own illustration
Figure 9 shows the final result of the clustering algorithm for this example. The five documents are distributed over two clusters, which are
where (Doc 5) is centroid of cluster and (Doc 4) is centroid of cluster . The iterative functioning of the developed algorithm assures an optimal distribution of documents over clusters of documents. In the presented example all documents have been assigned to a cluster, which is usually not the case in large document collections. More often some documents are not assigned to any cluster, as they do not fulfill the mentioned conditions for being a semantic duplicate (e.g. their maximum similarity value is below ).
25
3.2
Evaluation of the Algorithm
3.2.1 Evaluation Criteria For evaluating clustering algorithms two possible approaches can be pursued. The first one is to compare different configurations of the algorithm in regard to its objectives of high intracluster similarity and low inter-cluster similarity. Such a criterion is called an internal criterion (IC). As the intra-cluster similarity shows the coherence of a cluster, it may be used as an internal criterion for the quality of the produced clusters. I decided to use a criterion that fits best to the developed clustering algorithm in terms of the nature of the algorithm and easy determination of the criterion. An internal criterion that fits well to the developed algorithm, measures the similarity of the documents of a cluster to their cluster centroid (document). Such a criterion has been proposed by Zhao and Karypis (2001, 5). It focuses on the intracluster similarity by maximizing the following function
where denotes the number of clusters , are the documents of the cluster and is the centroid of the cluster. In other words can be described as the average similarity of all clusters . As the objective of the developed clustering algorithm is to produce only clusters that feature a minimum similarity , the formula originally suggested by Zhao and Karypis had to be adjusted by a weight for the number of documents assigned to clusters, i.e. the number of identified duplicates. Hence is the number of documents of the set that have been assigned to a cluster. Simply maximizing the presented function is however not very helpful in the given context, as not all documents of a set are clustered. It would be easy to maximize by setting to a high level, say 0.9, identifying only identical documents (e.g. that have been accidently posted twice). Therefore maximizing has to be subject to the amount of duplicates detected. The following measure has been constructed to relate to
where the logarithm of the ratio of the number of identified duplicates and is weighted with itself. The idea behind the logarithmic scaling is to make the large range of configuration results comparable. In this way the logarithmic scaling works punishing towards very high numbers of identified duplicates with low values for .
26
3.2.2 Test Set A set of 480 randomly selected documents out of Starbucks customer innovation portal, myStarbucksIdeas.com has been used as the data set for clustering algorithm evaluation. To show the effectiveness in clustering ideas in innovation portals which happen to be quite short, it was important to use real data from an actual heavily-frequented innovation portal. Starbucks customer innovation portal with more than 50.000 entries easily fulfills this criterion.
3.2.3 Creating Configurations Based on the presented internal criterion the goal has been to find the most suitable configuration, i.e. finding the configuration maximizing . First of all, different scoring mechanisms had to be identified for testing. Recalling the score definition from the last chapter,
the pivotal point is the definition of . An nave approach would be setting , resulting in equal weights for title and body. Another possibility is weighing the title twice as much as the body similarity by setting , or in the reverse case with . Additionally, it might also be of interest how the algorithm performs when or , respectively. In this case only the title and accordingly only the body is considered for similarity calculation.
2 * title + body title + 2 * body title + body only title only body
0.66 0.33 0.5 1 0
Table 1 Different configurations for zone weighting factor .
The second important parameter in the configuration of the algorithm is the similarity threshold . Naturally, given one of the described configurations for zone weighting, the algorithm will produce different results dependent on . So the first question might be which level of produces the best results for the chosen weighting scheme in regard to maximizing It is important to understand that the algorithms outcome depends on both parameters and Hence, each of the described weighting schemes had to be tested with a range of
27
levels of . This is necessary as different weighting schemes with the same level of are likely to produce different results. Trying to cover the whole interval of the definition of the Cosine Similarity, five values had been identified.
c1 c2 c3 c4 c5
Table 2 Different configurations for .
0.1 0.3 0.5 0.7 0.9
Having five different weighting schemes and five different levels of , a total of 25 configurations have been created and benchmarked against each other.
3.2.4 Evaluating Configurations According to the definition of the presented internal criterion, all 25 configurations have been benchmarked against each other. As the results show, the configuration with the highest value of , has been with and , i.e. having title and body similarity equally weighted. This configuration detected 72 duplicates ( ) in 53 clusters ( with an average similarity of . In other words, 53 unique ideas or concepts could be identified that have 72 semantically duplicate ideas in total.
28
title + body simThreshold 0.3 2 * title + body simThreshold 0.3 only title simThreshold 0.5 title + 2 * body simThreshold 0.3 2 * title + body simThreshold 0.5 only title simThreshold 0.7 only title simThreshold 0.3 title + body simThreshold 0.5 only title simThreshold 0.9 only body simThreshold 0.3 2 * title + body simThreshold 0.1 title + 2 * body simThreshold 0.5 title + body simThreshold 0.1 title + 2 * body simThreshold 0.1 only body simThreshold 0.5 only title simThreshold 0.1 title + 2 * body simThreshold 0.7 only body simThreshold 0.7 only body simThreshold 0.9 2 * title + body simThreshold 0.7 title + body simThreshold 0.7 only body simThreshold 0.1 2 * title + body simThreshold 0.9 title + 2 * body simThreshold 0.9 title + body simThreshold 0.9
0.5 0.66 1 0.33 0.66 1 1 0.5 1 0 0.66 0.33 0.5 0.33 0 1 0.33 0 0 0.66 0.5 0 0.66 0.33 0.5
0.3 0.3 0.5 0.3 0.5 0.7 0.3 0.5 0.9 0.3 0.1 0.5 0.1 0.1 0.5 0.1 0.7 0.7 0.9 0.7 0.7 0.1 0.9 0.9 0.9
72 93 68 54 19 24 181 10 12 41 267 8 265 260 8 287 6 5 5 4 4 291 0 0 0
53 64 49 43 14 19 101 10 9 34 117 8 122 117 8 116 6 5 5 4 4 132 0 0 0
0.46 0.43 0.45 0.47 0.66 0.59 0.32 0.77 0.65 0.41 0.27 0.82 0.27 0.27 0.77 0.25 0.90 0.93 0.93 0.94 0.94 0.18 0.00 0.00 0.00
1.004 0.997 0.988 0.969 0.958 0.945 0.874 0.856 0.825 0.817 0.813 0.810 0.809 0.794 0.784 0.759 0.743 0.679 0.679 0.592 0.592 0.571 0.000 0.000 0.000
Table 3 Evaluation results of various algorithm configurations.
29
1.010 1.000 0.990 0.980 0.970 0.960 0.950 title + body simThreshold 0.3 2 * title + body simThreshold 0.3 only title simThreshold 0.5 title + 2 * body simThreshold 0.3
delta(w)
Figure 10 Comparison of Source: Own illustration
values for the top-4 performing configurations.
As figure 10 shows, the first three configurations is only marginally different. These three configurations all incorporate an at least equal weight of the title, ranging with to . Obviously the title can be seen as a good characteristic for idea discrimination. In contrast, all configurations weighing the body stronger than the title perform significantly worse.
1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0 50 100 150 200 250 300 350
phi(IC)/# duplicates
Figure 11 Relating number of duplicates Source: Own illustration
to average similarity
Figure 11 shows a hockey stickstyle curve, where each configuration is marked by the average similarity and the number of identified duplicates. Not surprisingly, in general it can 30
be inferred that the more identified duplicates, the less is the average similarity value . However, some configurations are obviously able to identify more duplicates while having higher values of as well. Especially in the vertical segment from 50 to 100 one can see an increased performance compared to the outlier in the first vertical segment. These four values in the second segment are exactly the top-performing configurations already mentioned.
1.200 1.000 0.800 0.600 0.400 0.200 0.000 0.00 0.20 0.40 0.60 0.80 1.00
delta(w)/phi(IC)
Figure 12 Relating to Source: Own illustration
Figure 12 shows the distribution of configurations for values of

rations with values between 0.4 and 0.5 for duplicates and in order to maximize .
and . Obviously, configuoffer a well-balanced ratio of the number of detected
3.2.5 Interpretation of Results One has to be cautious about the interpretation of the presented results. The assessment of the configurations performance has been purely based on the mentioned internal criterion. Its assumptions about the quality of the developed algorithm have been solely based on the ratio of average similarity as a measure for internal cluster coherence and the number of detected duplicates resulting in a measure . Even a high value for does not automatically guarantee good clustering results. Nevertheless, two things can be inferred so far. First, the developed algorithm showed good results in regard to cluster coherence. The top-performing configuration averaged in which is an absolutely acceptable value. Second, but even more relevant, precious information could be gained in view of meaningful configuration parameters. Third, under the assumption that serves as a good indicator for cluster coherence, may help to rank and compare different configurations in regard to their clustering quality. However, it does not suggest anything about the general quality and effectiveness of the presented algorithm, which is subject to an external criterion. Although it can be 31
assumed that the observed results might vary from document set to document set, at least in the domain of innovation portals similar results are expected to be observed. Therefore, the top-performing configuration has been used as a default configuration. Finally, this configuration is not only valuable for the clustering algorithm and its applications, but also for the active duplicate detection presented in the next chapter. Information about a reasonable level of and a well-working weighting scheme is important in order to correctly identify entered ideas as duplicates.
3.2.6 External Criterion Although usually giving a good marker for the quality of a clustering algorithm the internal criterion does not state anything about its effectiveness in applications. The internal criterion might indicate a high value for but may have only clustered a small fraction of the set although a lot more semantic duplicates exist. This implies that such an algorithms effectiveness would be bad. To solve this problem, external criterions may help to determine the algorithms effectiveness for a specific application. External criterions require an already humanjudged set of classified or clustered documents that is used to rate an algorithms outcome against it. Usually already existent standard sets are used that are freely available. In this case however the question was specifically how well the algorithm is able to perform on documents from innovation portals where such sets do not exist. As the creation of such a classified set is quite time-intensive, a hard external criterion has not been used in this thesis. However, a general overall judgment about the quality of a configuration based on a manual clustering has been made instead. Although, no precise statement can be given, the two topperforming configurations in the internal criterion showed to be capable of correctly identifying most of the semantic duplicates of the test set that have been found manually. In some cases the top-configuration of the algorithm even correctly identified semantic duplicates that have been overlooked at first during manual duplicate identification.
3.3
Further applications of clustering
The primary goal of clustering semantic duplicates in this thesis has been to improve idea representation and the process of evaluation. Yet, there are other interesting applications of clustering that can help to further improve duplicate identification. In the following three interesting applications are briefly described.
32
3.3.1 Automatic Thesaurus Creation Although the representation of ideas in a VSM in combination with the described clustering algorithm showed good results in semantic duplicate detection, it has its limitations. The hypothesis behind the described approach has been that semantically duplicate ideas usually share common words. The evaluation of the clustering algorithm showed that this is often the case, but there are situations where this hypothesis is not correct. Especially in short documents (as ideas in innovation portals happen to be) it may happen that two ideas describe the same concept without having any words in common. To point this out, recall the example of having one idea describing internet at no charge and another saying free wireless access. For simplification assume that these ideas only consist of those two titles. In this case the approach described in the last chapter of detecting semantic duplicates would simply fail, because the similarity between the term-based vector representations of these ideas would be 0, as they do not share a single term. One way to improve this situation would be to use a thesaurus that features links between words that are considered as semantically related, such as e.g. synonyms and homonyms. That way it would be possible to consider related words for matches during similarity calculation of term-based vectors. The challenge in this approach is to obtain a good quality thesaurus that shows to be effective in the designated application domain. Manually created general thesauri such as WordNet (Miller 1995, 39) usually offer good quality but lack domainspecific relations of words and are therefore not very effective. Creating a domain-specific thesaurus manually, would probably generate the best results but includes a lot of effort for creation and maintenance. So the most elegant approach would be to create a domain-specific thesaurus automatically. Crouch (1990, 629ff) presented an interesting approach on how to generate an automatic thesaurus. Her idea was to deduct a thesaurus from clustered documents through a quite simple method. Following and extending Jardine and van Rijsbergens Cluster Hypothesis (1971), she assumes that the probability that words occurring in all documents of the same cluster are semantically related, is quite high. More specifically, she assumes that the intersection of lowfrequency terms of documents in a cluster should generate a set of terms (words) that are somehow semantically related. Starting point is an appropriate term-vector-based clustering algorithm that produces document clusters with similar documents. Once clustered, the clusters will be analyzed, focusing on terms that have reasonable discriminating power. Such terms can be thought of as terms that are essential for the content and understanding of the document. To decide whether a term in a document features enough discriminating power, it is differentiated between highand low-frequency terms. The frequency of a term is a collection-wide measure, indicating for each term how many occurrences in all documents of the set exist. High frequency terms usually have little discriminating power in documents, low-frequency terms usually maintain bigger discriminating power. Focusing on low-frequency terms only, the intersection of terms of all documents of a cluster is specified and used to generate a thesaurus class as follows 33
where denotes a cluster of documents . The set includes all terms that are considered as low-frequency terms. Set represents all unique terms existent throughout the document set . The definition is of course dependent on the total number of words in the index, according to Salton (1974, 10) low-frequency terms are terms whose frequency is less than , where denotes the total number of documents in the collection. denotes a set of terms that fulfill the conditions of being in the intersection of and a member of the set . Such a thesaurus class should now include all deducted terms that are semantically related to each other.
Figure 13 A thesaurus class is deducted from a three-element document cluster shown in a Venn diagram. Members of are therefore considered as semantically related. Source: Own illustration
34
The motivation of this procedure is that with the help of many automatically deducted thesaurus classes the retrieval performance can be increased significantly. According to Crouch, her method allows an increase in retrieval performance of up to 15 percent (1990, 629).
3.3.2 Automatic Thesaurus Creation in Innovation Portals In context of this thesis the question aroused whether based on the clustering algorithm introduced, reasonable thesaurus classes can be deducted from the produced clusters of semantically duplicate ideas. Subject to examination therefore has been whether Crouchs method shows to be effective with comparably short ideas of innovation portals as well. If this approach should turn out reasonable results, it would be one way to easily improve the identification of semantic duplicates. Evaluation Based on the myStarbucksIdeas.com test set already used for clustering algorithm evaluation, a prototype Java method has been developed and implemented, that follows Crouchs approach. Input of this method is the low-frequency threshold and a set of document clusters. The method then leverages the term-vector representations of the clusters documents to deduct thesaurus terms as described above. The outcome was surprisingly well suited to the application domain of ideas in myStarbucksIdeas.com innovation portal. Some examples are provided here, please note that the terms are directly extracted from the document vectors and therefore only exist in their stemmed word form. For better understanding the stemmed words their relating nouns are named here. Thesaurus class 49 builds a semantic relation between lactose, intolerance, dairy and milk, thesaurus class 28 sets employee in relation to union. Another good example is class 31 where election is set in relation to voter. Thesaurus class 45 relates carbon to footprint, trash and politics. Two good examples for domain-specific semantic relations that can be gained by the presented approach are the classes 44 and 11. Relating compost to paper and reuse to sleeve creates semantic relations in the area of recycling products at Starbucks.
35
Figure 14 Example thesaurus classes with stemmed word forms deducted from the clustered test set of ideas. Source: Own illustration
Findings Although the implemented Java method to deduct thesaurus classes has only been a prototype, it already showed usable results to build up an automatic thesaurus. Due to extended effort the thesaurus has not been integrated into neither the clustering algorithm nor the solution helping to prevent duplicates. However, the presented research findings show that an integration of such an automatic thesaurus could be of real use in the context of semantic duplicate identification and prevention. Potential Approach A potential implementation into duplicate identification could look as follows. Based on a first run of the clustering algorithm, thesaurus classes could be deducted and used in subsequent runs of the algorithm for similarity calculations in order to improve clustering quality. In the first step the algorithm will be pitched as described. In a second step the produced clusters will be used for the creation of an automatic thesaurus as presented. The automatic thesaurus classes will then be linked to relating documents and thus be considered during the similarity calculations of the similarity matrix. Now the clustering algorithm is restarted based on the newly calculated similarity matrix, hopefully providing improved clustering quality. The same approach could be used for duplicate prevention. First, all existing ideas of the collection are clustered. This step may happen from time to time; it can be considered as a learning phase. From this cluster set a thesaurus will be deducted, possibly refining an already existing previously created automatic thesaurus. As presented above thesaurus classes are
36
linked to related documents of the collection. Now the related thesaurus classes are included during similarity comparisons with query vectors presenting potential semantic duplicates.
Limitations Although generating some nice thesaurus classes of good quality, the described algorithm for automatic thesaurus generation has its limitations. Obviously, it is not able to compete with manually created thesauri in regard to both quality and quantity. Manual thesauri that have been custom-built for a specific application domain will still be superior in quantity and quality of relations to automatically deducted thesauri. While other approaches rely e.g. on explicit linguistic knowledge (Ruge 1997), the elegance of the presented approach lies in its easy deduction once a set of documents has been clustered. So the described approach offers an easy and yet domain-specific thesaurus approach helping to increase retrieval performance with little effort to be taken.
3.3.3 Cluster Labeling Another application following a similar procedure as presented during automatic thesaurus generation, is cluster labeling. The motivation behind this lays again in improved idea evaluation. Although methods to present clusters of semantically duplicate ideas conveniently to the user or evaluator, are not subject to this thesis, they are suitable for a short discussion in regard to cluster labeling. One challenge in large collections of ideas in innovation portals is a proper display and navigation of cluster sets with semantically duplicate ideas. Navigating through hundreds of clusters is not really convenient if one has to check each clusters ideas to know what concept or general idea a cluster represents. Helpful would be labels that describe a cluster with a number of discriminating words. That way an evaluator could get a fast overview of ideas in an innovation portal. Such labels can be deducted by the same method as used for the creation of an automatic thesaurus. Based on a set of clusters, the sets of words that are used to constitute thesaurus classes, can simply be used as labels for clusters. In literature many approaches for cluster labeling can be found. A comprehensive approach has been described by Treeratpituk and Callan (2006, 168f). They developed an algorithm that infers cluster labels based on information from single clusters as well as from cluster-wide statistics. In general, it is differentiated between differential cluster labeling and internal cluster labeling. While differential cluster labeling is comparing the distribution of terms between clusters, internal cluster labeling is solely focusing on the distribution of terms in a single cluster. The presented approach in this thesis can be regarded as a type of internal cluster labeling and is simpler and less sophisticated than Treeratpituk and Callans approach. Possibly, it is necessary to refine the described method for good cluster labels, e.g. considering term occurrences and ranking them in order to get more meaningful labels. 37
3.3.4 Automated Idea Tagging A similar approach to cluster labeling and thesaurus creation is to relate the obtained terms to ideas directly and use them to tag ideas in an innovation portal. Over an existing pool of ideas clusters could be produced using the presented clustering algorithm. Once an idea is assigned to a cluster, meaningful tags could be derived for this idea by using terms from the relating term class (presented above as thesaurus class). Song et al. have proposed a more sophisticated approach than presented here, based on a novel framework for tag recommendation (2008, 515f). The proposed method in this thesis is not to be considered as an alternative to existent well-thought approaches, but to deliver impulses for improvements that are easy implement without the use of further models. Such automated idea tagging has several imaginable applications itself. As most innovation portals already have a tagging feature implemented, automated tagging could support user navigation through ideas. Tags for ideas of one cluster would be identical, helping the user or evaluator to quickly identify similar ideas when stumbling across an idea that features semantic duplicates. The big advantage of such a navigation is that it is already supported by most innovation portals. However, automated tagging is not limited to a previously manually pitched clustering. It could be used in real-time during semantic duplicate prevention when a new idea is entered and checked whether semantic duplicates exist. If the user appends his idea to an existing idea, the tags of the existing idea could be updated by derived tags from the cluster of those semantically duplicate ideas, as mentioned above. Nevertheless automated tagging does not replace manual tagging by a user, but it can be seen as a way to enrich user tagging and allowing better linkage and navigation between semantically duplicate ideas.
38
4.
Semantic Duplicate Prevention in an Innovation Portal
The last chapter tried to evaluate methods on how to identify semantic duplicates in existing idea sets. As semantic duplicates remain a challenge in innovation portals, the following chapter will dwell on the possibilities to prevent the emergence of such duplicates in the first place. Next to reduced effort in the evaluation of ideas there is an interesting hypothesis behind the prevention of semantic duplicates. Recalling the research objective from the introduction, the question emerges whether it is possible to improve the quality of ideas by preventing semantic duplicates? Although this assumption may look a little far-fetched at first, I will try to explain the nature of the problem. The general hypothesis behind this question is that when users with similar ideas are collaborating idea quality can be improved. The main challenge however remains to get users into a process of collaboration. In big innovation portals this turns out to be a huge problem, because users are not based in the same location and enter ideas independently. The focus in most innovation portals is on allowing a quick and simple entering of new ideas. So users will usually enter an idea directly without searching whether similar ideas already exist. Even if they try, they might not get all relevant similar ideas based on search limitations etc.. So a way has to be found that allows to bring users with similar ideas together. The idea behind this is that with having the ability of seeing related ideas, a collaboration process between users is taking place. A general method for such a solution is proposed in this chapter. Furthermore it will be focused on the development and implementation of the proposed solution. Yet, verification of the improved idea quality hypothesis is subject to a user study, presented in the next chapter. This chapter solely concentrates on the development and implementation of a solution for semantic duplicate prevention.
4.1
Concept
4.1.1 Requirements Starting point is the entry of a new idea to an innovation portal. Before an idea is finally submitted, a routine should check whether similar ideas already exist. Potential semantic duplicates are then presented to the user and he may choose the idea to which his entered idea is most similar or simply refuse the offered choices. The user is then able to examine the content of other similar ideas and has the possibility to refine his idea before appending it to one of the offered choices. In case none of the suggested ideas seems to fit, his entered idea will be submitted to the system as a new idea.
39
4.1.2 Development The solution has been developed as a component for an already existing innovation portal in Java. This innovation portal is part of the TEXO stream of the THESEUS project, a project funded by the German Federal ministry of Economics and Technology and realized by various partners from industry and academia, amongst others the chair of Prof. Dr. Helmut Krcmar. Subject to development for this thesis has been to package the implemented model presented in the first chapter as a component for the existing innovation repository as well as developing a suitable user interface.
4.1.3 Implementation In the following sub-chapter the implemented backend model will be presented. The purpose of the following simplified conceptual class diagram in figure 15 is to show the links between the different classes. To reduce complexity, only the most important public methods that are necessary to understand the interaction of these classes have been described. Methods are shown without parameters. Classes with bold borders are core classes of the developed component, while the ones with regular borders present interfaces to classes belonging to the Lucene API. Starting point is the indexing process, where existent ideas from an innovation portals database have to be stored in an index that can be leveraged by a VSM. The class IdeaIndexer is a wrapper class for the Indexer class, fetching ideas out of a MySQL database and handing them over to the Indexer class. The Indexer uses a Stemmer class that stems all ideas before calling addDocument() in Lucenes IndexWriter class to add the stemmed ideas to a Lucene Index. Lucenes index structure can be stored either in RAM or on disk, due to performance and handling issues (easier concurrent access), the described implementation uses a RAMbased index. Once the indexing process is finished, the class SimilarityMatrix creates a vector representation of the indexed ideas. For each idea an instance of DocumentVector is produced that leverages a TermFreqVector instance. A TermFreqVector instance incorporates the term frequencies of an idea derived from a Lucene Index. As soon as an instance of DuplicateServlet is triggered by a newly entered idea, SearchDuplicates uses the Stemmer class to stem the entered idea, transforming it into a query vector and searching for similar ideas using SimilarityMatrix. Found semantic duplicates are then returned as DuplicateDocument. The class MasterData is implemented as a singleton, allowing easy access between multiple instances of the presented classes.
40
Figure 15 Conceptual UML class diagram of back-end solution. Source: Own illustration
4.1.4 Configuration Configuration aspects of the presented solution have been covered in the second and third chapter already. The assumptions and decisions made there apply as well for the active duplicate detection. The similarity threshold level is set to 0.3 per default. Specifically relevant are the choices of stemming language and the similarity threshold level. As stemming and the stemmer (algorithm) is language-depedendent, a stemming language matching the actual input language of ideas has to be set. The implementation therefore resorts to stemming algorithms for English and German language offered by the Lucene API.
41
4.2
Functioning of the Duplicate Detection
Figure 16 Description of the functioning of the duplicate detection. Source: Own illustration
Once the user has entered and submitted an idea to the innovation portal, JavaScript code calls the class DuplicateServlet, passing title and description of the idea. In a next step the method findSimilar() of the class SearchDuplicates is called, forwarding title and description. This class now transforms title and description into a term-vector representation resulting in a query vector, as presented in the first chapter. The transformation process includes the stemming of title and description. Calling cosineSimilarity() compares the query vector with all document vectors (vector representations of all existent ideas in the index) of the class SimilarityMatri,x returning a list of duplicate ideas satisfying the pre-defined similarity threshold. SearchDuplicates converts the result list into a priority queue, which is returned to DuplicateServlet. DuplicateServlet in turn converts the priority queue into an XML document, containing ranked duplicate ideas, including title, description and similarity score. This XML document is then used in an AJAX pop-up to display the ranked result set of ideas to the user. If no semantic duplicates could be found, the XML document is empty. In this case the user will not see a pop-up and the idea will directly be submitted to the innovation portals database.
42
4.3
Description of the User Interface
Figure 17 Input screen for a new idea. Source: Theseus TEXO Innovation Repository
Once the user has entered his idea and hit the Save button (Figure 17), the idea will be checked for semantic duplicates. If no duplicates are found, his idea will be submitted instantly. If semantic duplicates are found, an AJAX-based pop-up window inside the browser is shown as depicted in figure 18 and figure 19 respectively.
43
Figure 18 Semantic duplicates have been found for the entered idea. Source: Theseus TEXO Innovation Repository
Similar ideas are presented in descending order in regard to their computed similarity with the entered idea. Additionally, a bar right to each idea shows the level of similarity. A completely filled orange bar indicates that the two ideas are very similar two each other, a 75%-filled bar indicates good similarity and a 50%-filled bar says that the ideas are considered as similar. Of course, this similarity indication is based on pre-defined similarity thresholds that have been discussed in the configuration section.
44
Figure 19 Close-up of the duplicate detection pop-up. Source: Theseus TEXO Innovation Repository
In the list of similar ideas only the idea titles are shown. If a user wants to see the content of this idea, he can simply move his mouse over the idea and a black-bordered box will be displayed below with the ideas content. The last option in the radio-button list is always Submit my idea a new idea. As soon as the user has made up his mind to which existent idea he wants to append his idea (or create an independently saved idea), he chooses one of the presented options. Also, he is forced to select whether his idea is a Specialization or Generalization of the selected idea. Theoretically, other ontology concepts can be considered as well. In case after reading similar ideas he wants to refine his entered idea, he can simply close the pop-up and modify his idea. Once he presses the Save button again, the whole process will be started again, assuring that his changes have come into consideration for similarity comparison with all existent ideas in the database.
4.3.1 Submitting an Idea Once the user has chosen an idea that seems to be most similar to the one he entered, his entered idea will be saved as a comment to the existent idea. The comment includes full title and description of the newly entered idea as well the selected ontology concept. Going to the detailed idea view, all comments are grouped by their respective ontology concept. Finally, the whole idea, meaning the existing idea plus all similar ideas added as comments are being reindexed, starting the indexing process presented above. 45
5.
User study
Purpose of the user study has been the evaluation of the developed solution for duplicate prevention in regard to its effectiveness in semantic duplicate detection and prevention. Subject to examination has been as well the testing of the improved idea quality hypothesis stated in the last chapter.
5.1
Study Design
5.1.1 Overview The objective of the study has been to simulate an innovative setting, typically found in the context of innovation portals. In such a setting users brainstorm about ideas as to a predefined topic or task in support of an innovation portal, where the ideas are entered. So a number of groups of users were given the task to brainstorm about ideas to a specified problem statement using the presented innovation portal. To proper test the effectiveness of the developed semantic duplicate detection and prevention, one would need an already existent set of ideas that is big enough that the probability of the occurrence of semantic duplicates when users are entering new ideas, were quite high. As the myStarbucksIdeas.com test set that had been used for clustering algorithm evaluation was two small and too specific, it could not be used for the evaluation of the developed solution. Additionally, the ideas of the test set were in English, while the participants native tongue was mostly German. Instead, the thought was to start with an empty set of ideas for each group and let the users generate ideas by themselves. Obviously, the brainstorming topic then had to be one which left little space for ambiguity, ensuing that users are entering many semantically duplicate ideas. As participants were students and researchers with various academic backgrounds, a common topic had to be found. The chosen brainstorming topic had been Saving energy at home. Following this approach, it could be assured that the effectiveness of the semantic duplicate detection and prevention could be thoroughly tested, while preserving a typical innovative brainstorming setting. Each brainstorming session had been conducted separately per group, while all members of a group had to brainstorm simultaneously. However, each user was supposed to enter his ideas directly and independently. Consultation between members of a group was not allowed, in order to simulate an environment where users are not necessarily sitting in the same place, but entering ideas at roughly the same time. A total of n=15 students and researchers had been recruited for the study. The participants were randomly selected from students and faculty members of the computer science department. In order to test the already mentioned improved idea quality hypothesis, the participants had been split up into one control and two treatment groups. The control group had to enter 46
ideas into the innovation portal while semantic duplicate detection and prevention had been disabled. Both treatment groups were working on enabled semantic duplicate detection and prevention. Naturally, membership to either control or treatment group had not been revealed to the participants. With the aim of avoiding potential bias in regard to user behavior, the study has been simply presented as a study on idea quality in innovation portals, not mentioning the real purpose of the study in advance. Especially, the aspect of semantically duplicate ideas has not been mentioned. As most participants native tongue has been German, the study took place in German language. This means, that the task itself as well as the entered ideas have been formulated in German. The task that has been given to each group was as follows: Finde ca. 10 Ideen mit denen Du zuhause Energie sparen kannst (Translated: Find about 10 ideas that can help you to save energy at home). For each group one brainstorming session of 20 minutes had been conducted. At the end of a session, each participant had to fill out a survey.
5.1.2 Survey Design To evaluate the behavior of the participants of the study, a survey has been designed. Based on the Technology Acceptance Model (TAM) the created survey evaluates three major aspects of the developed solution, being usefulness, satisfaction and ease of use. TAM has been first introduced by Davis (1989) and later extended by Adams, Nelson and Todd (1992). Their model suggests that a number of factors influence the acceptance of a new technology, specifically these are perceived usefulness and perceived ease of use. Lund additionally proposes to measure the perceived satisfaction of a technology. The survey that has been created is based on Lunds proposal of a USE survey measuring usefulness, satisfaction and ease of use. For each of these three categories items had been generated and had to be rated by each participant using an ordinal 5-step Likert scale, ranging from strongly disagree (1) to strongly agree (5). N/A (0) Strongly Disagree (1) Disagree (2) Neither Agree Nor Disagree (3) Agree (4) Strongly Agree (5)
Table 4 5-step Likert scale for rating statements.
As some items did not apply to the control group, these items had been marked with an asterisk. Participants assigned to the control group were then asked to indicate N/A for items not applying to their group. In the actual survey the three described categories were not visible and the order of the items had been scrambled. Shown below is the template for the survey. 47
Usefulness
1. 2. 3. 4. 5. 6. The user interface helped me to be more innovative The application helped me to improve the quality and extent of the ideas I entered The application supported me in developing new ideas I found the suggested similar ideas useful (*) Suggested similar ideas were indeed similar to ideas I entered (*) The order of suggested similar ideas was matching the actual degree of similarity (*)
Ease of Use
1. It was easy to enter ideas in general 2. Choosing and selecting a suggested similar idea was easy (*)
Satisfaction
1. 2. 3. 4. 5. 6. The application worked the way I want it to work I would recommend the application to a friend I liked the application Finding similar ideas was fun (*) Finding similar ideas was annoying (*) Finding similar ideas was interesting (*)
Source: Own survey construction, according to Lunds USE survey
5.1.3 Evaluation Criteria For evaluation of the study two criteria have been chosen. To measure the effectiveness of the solution in regard to a reduction of semantic duplicates, the number of times when semantic duplicates had been detected as well as the number of times when the user agreed on one of the suggested semantic duplicates, had been recorded. In this context number of duplicate detections does not mean the total number of semantically duplicate ideas detected, but the times when at least one semantically duplicate idea had been detected. In other words, this figure counts the number of times when the duplicate detection pop-up has been presented to the users. This can be regarded as an objective criterion. In contrast to that the results of the survey have to be seen as subjective criteria as the deductions from the survey are derived from the users self-assessments. I will discuss potential biases of this later in this chapter. Next to a general evaluation of the surveys results, a direct comparison between the control and the treatment groups in regard to the improved idea quality hypothesis has been prepared. The latter can be seen as the major subjective criterion to verify the stated hypothesis.
48
5.1.4 Demographics Of the total 15 participants of the study, nine (60%) were male, six (40%) were female. ALevel/Abitur has been the highest educational degree so far for 11 (73.3%) participants. Two (13.3%) held a Bachelors degree and in turn two (13.3%) held a Masters degree as their highest educational degree. Twelve (80%) participants were aged between 16 and 25, three (20%) were between 26 and 35.
5.2
Study Results
5.2.1 Effectiveness of the Semantic Duplicate Detection In treatment group 1 a total of 39 ideas had been entered by four participants. Consisting of seven participants, treatment group 2 generated a total of 73 entries, with 54 ideas and 19 extensions to existing ideas. The control group generated 41 ideas and consisted of four participants.
Control Treatment 1 Treatment 2 73
41
40
Distribution of participants
Number of ideas entered
Figure 20 Number of ideas entered compared with the number of participants per group. Source: Own illustration
In treatment group 1 semantic duplicates had been detected four times, however in no case the users selected to append their idea to an existing one. Twenty-seven times semantically duplicate ideas had been detected in treatment group 2, while in 19 cases one of the suggested semantic duplicates had been selected and the users own idea had been appended to the suggested duplicate. 49
Treatment 1 27
Treatment 2
19
4 0 Number of times duplicates detected Number of suggested duplicates accepted
Figure 21 Comparison Source: Own illustration
of
suggested
and
accepted
duplicates
per
treatment
group.
Evaluating the results, it has to be discussed why the user behavior between both treatment groups in regard to the detection and acceptance of suggested duplicates showed to be so different. In treatment group 1 a suggested duplicate has not once been accepted by any of the users. One possible explanation might be that the users wanted to see their ideas as independent ideas, ignoring any existing similar ideas. Additionally, several ideas generated by this group incorporated major orthographical errors, as some participants native tongue has obviously not been German. This fact of course impairs detection of semantically duplicate ideas. In the following evaluation process I will therefore focus on treatment group 2, as the mentioned problem could not be observed in that group. Referring to treatment group 2, in 70% of cases, when the pop-up of semantically duplicate ideas had been shown to the user, he selected one of the suggested duplicates to append his idea to it. In 30% of cases, none of the suggested duplicates had been chosen and the entered idea had been saved as an independent idea.
50
% of suggested duplicates accepted % of suggested duplicates NOT accepted
30%
70%
Figure 22 Accepted vs. not accepted suggested semantic duplicates in treatment group 2. Source: Own illustration
This observation already allows to draw first conclusions about the effectiveness of the developed solution in treatment group 2. The high acceptance rate of 70% shows that the suggested semantically duplicate ideas presented are in fact perceived as semantic duplicates by the user in most cases. Naturally, it remains unknown in how many cases a semantically duplicate idea existed but the duplicate detection could not identify a semantic duplicate and consequently did not show the pop-up to the user. Therefore the number of accepted suggested duplicates does not allow to make a general statement about the degree of reduction of semantic duplicates in relation to all semantically duplicate ideas in the presented treatment group. However, it can be said that with 19 ideas having been appended to an existing idea (and 73 entered ideas in total, of which 54 have been saved as independent ideas), at least 26% of all ideas were semantically duplicate. Statements about the degree of duplicate reduction in treatment group 2 have to be inferred with other measures. As mentioned above, to deduct statements about the degree of reduction of semantic duplicates, the total number of duplicates has to be known. To determine this number the generated ideas have been evaluated manually, checking for the total amount of duplicate ideas per evaluated group. In treatment group 2 a total of 28 semantically duplicate ideas have been found, of which 19 could be identified by the developed solution. Compared to a system without an active duplicate detection, and under the assumption of similar user behavior, the number of semantically duplicate ideas could be reduced by 68% in treatment group 2 with the help of the developed duplicate detection. Based on this number it can inferred that the developed duplicate detection has been able to reduce the number of semantic duplicates significantly.
51
Treatment 2 28
19
Number of suggested duplicates accepted
Number of total duplicates
Figure 23 Comparison between the number of semantic duplicates accepted and the total number of semantic duplicates found (including those not recognized by the presented solution). Source: Own illustration
Another way to measure the degree of duplicate reduction would be a comparison with the control group. However it has to be considered that the number of participants in the control group has been less than in treatment group 2. Due to the fact that the given task was very narrowed on a special topic (in order to generate many duplicates) and that each participant had to enter a fixed number of ideas, the absolute number of ideas and the number of semantic duplicates seem to correlate positively, as shown in figure 23 above. Therefore the degree of duplicate reduction cannot be reliably measured by a comparison with the control group, as the number of participants differ too much. Findings Although in treatment group 2 a significant reduction of semantic duplicates could be measured, implications for generalizations have to be discussed carefully. To deduct statements about the statistical significance of the observed characteristics one would have to conduct some more sessions with additional treatment and control groups of equal size. Yet, the described results suggest that similar observations can be assumed in further sessions. As the effect of the active duplicate detection could be proved and quantified, the criterion of internal validity seems to be fulfilled. In regard to external validity of the results of treatment group 2, generalizations of the experimental situation have to be argued. It can be assumed that in a typical non-experiment situation the number of duplicates in relation to the total number of ideas is significantly less than observed in this study. Usually, the brainstorming task is less specific or there is no specified task at all, resulting in far less duplicates, simply because a wider range of topics is tackled. Nevertheless, there is no reason to believe that the duplicate detection is less effective in a typical innovation portal setting than seen in the study. 52
5.2.2 Improved idea quality To test the stated improved idea quality hypothesis, a subjective criterion has been used. Subject to examination has been the participants perception in regard to an improved idea quality induced by the developed solution. The presented survey has been used to evaluate this perception in treatment group 2 as well as in the control group. For the verification of the hypothesis, I have focused on the statement The application helped me to improve the quality and extent of the ideas I entered. This statement had been evaluated in both the treatment and control group, so that a comparison between both median values becomes possible. In treatment group 2, one participant disagreed to the given statement, one neither agreed nor disagreed and five participants agreed to the statement.
Treatment 2 (n=7) 5
N/A (0)
Strongly Disagree (2) Neither agree disagree (1) nor disagree (3)
Agree (4)
Strongly agree (5)
Figure 24 Distribution of answer choices to the statement The application helped me to improve the quality and extent of the ideas I entered in treatment group 2. Source: Own illustration
53
In the control group, three participants disagreed to the statement, and one participant neither agreed nor disagreed to it.
Control (n=4) 3
0 N/A (0) Strongly Disagree (2) Neither agree disagree (1) nor disagree (3)
0 Agree (4)
0 Strongly agree (5)
Figure 25 Distribution of answers to the statement The application helped me to improve the quality and extent of the ideas I entered in the control group. Source: Own illustration
The results show that in treatment group 2, 71% of the participants agreed to the fact of a quality improvement due to the active duplicate detection. In contrast 75% of the participants in the control group, which did not encounter an active duplicate detection, disagreed to the statement. As the only difference between control and treatment group has been the existence of a duplicate detection, other sources of irritation can be ruled out. Assuming stable unit treatment value (SUTVA) (Rubin 1978), which is assured by a randomized selection and assignment of participants to either the control or a treatment group, the duplicate detections positive effect on idea quality seems to be measurable. To test the statistical significance of this assumption, the characteristics of the samples from the control and treatment groups had to be considered. Both samples were taken independently, i.e. one participant has been assigned to one group exclusively. Therefore the samples can be considered as independent. Using the described 5-step Likert scale implies an ordinal scale of answer choices. In order to compare the medians of both the treatment and the control group, these median values have been determined. As the used Likert scale is based on an ordinal scale and both test samples are independent, the non-parametric Mann-Whitney-Wilcoxon test presents itself as suitable. The null hypothesis says that there is no difference between the distributions of the control and the treatment group, while states the opposite. In other words, retaining implies that the observation of improved idea quality in the treatment is statistically insignificant, while rejecting and thus accepting means that the observation is in fact statistically significant. 54
The following plot shows the result of the Mann-Whitney-Wilcoxon test performed on the median data for both treatment and control group. The control group had been identified as 1, the treatment group as 2. Performing a two-tailed test with a level of significance of 5%, the result showed a p-value of 0.025, indicating that the observation can be considered as statistically significant on the mentioned level of significance.
Ranks GROUP N ANSWER 1 2 Total 4 7 11 Mean Rank 3,25 7,57 Sum of Ranks 13,00 53,00
Test Statistics
ANSWER Mann-Whitney U Wilcoxon W Z Asymp. Sig. (2-tailed) Exact Sig. [2*(1-tailed Sig.)] a. Not corrected for ties. b. Grouping Variable: GROUP 3,000 13,000 -2,243 ,025 ,042
a
Figure 26 Mann-Whitney-Wilcoxon Test. Source: SPSS
Findings Thus, it can be inferred that on a level of significance of 5% a perceived improvement in idea quality in the treatment group has to be considered as statistically significant. The improved idea quality hypothesis can therefore be retained. So it can be said, that the duplicate detection can in fact help to improve idea quality by linking relating ideas. However, it has to be stated that this hypothesis has been validated solely based on users self-assessment and not on an inter-judge assessment of idea quality. Such an inter-judge assessment along with broader samples for control and treatment groups would certainly help to back the results presented here.
55
5.2.3 Quality of Suggested Duplicates One major driver for the effectiveness of an active duplicate detection and prevention is the quality of suggested duplicates. Evaluating whether the suggested duplicates are perceived as relevant for the users input idea is therefore of high interest. At the same time the question arises whether the suggested duplicates are actually ranked following the users perceived relatedness to his entered idea. These two questions have been tried to measure with the items I found the suggested similar ideas useful, Suggested similar ideas were indeed similar to ideas I entered and The order of suggested similar ideas was matching the actual degree of similarity. Naturally, only participants of both treatment groups were asked to evaluate these items. In the following evaluation it is always related to both treatment groups. The first item I found the suggested similar ideas useful can be seen as a general assessment about the quality of the active duplicate detection. Of the 11 participants of both treatment groups, six agreed to the statement and two strongly agreed to it. Viewing both answer choices mutually, 73% of the participants either agreed or strongly agreed to the usefulness of the suggested semantic duplicates.
Treatment 1+2 (n=11) 6
2 1 0 N/A (0) Strongly Disagree (2) Neither agree disagree (1) nor disagree (3) Agree (4) Strongly agree (5) 1 1
Figure 27 Distribution of answers to the statement I found the suggested similar ideas useful for both treatment groups 1 and 2. Source: Own illustration
56
Evaluating the second item Suggested similar ideas were indeed similar to ideas I entered, gives a clear hint at the perceived quality of the duplicate detection. Five participants agreed on this statement, two participants even strongly agreed, accounting for 63% of participants.
3 2 1 0 N/A (0) 0 Agree (4) Strongly agree (5)
Figure 28 Distribution of answers to the statement Suggested similar ideas were indeed similar to ideas I entered for both treatment groups 1 and 2. Source: Own illustration
Strongly related to this item is the third item The order of suggested similar ideas was matching the actual degree of similarity. Interestingly, seven participants agreed to this statement and one participant even strongly agreed, accounting for 73% of participants. The fact that the number of approval for the third item is higher than for the second one, might look strange at first glance, however the third item is only based on a relative assessment. Suggested duplicates might be ranked as the user might expect it, including non-relevant suggestions ranking at the end.
57
2 1 0 N/A (0) 0 Agree (4) Strongly agree (5) 1
Figure 29 Distribution of answers to the statement The order of suggested similar ideas was matching the actual degree of similarity for both treatment groups 1 and 2. Source: Own illustration
Findings Due to the small number of samples, statistical inferences for generalizations of the observed results cannot be derived. However the high approval for all three items surrogate a wellbalanced configuration of the evaluated duplicate detection. Of course the number of found duplicates can easily be increased by lowering the similarity threshold, but simultaneously the relevance of suggested duplicates might decrease. In fact, the similarity threshold as well the weighting of the different similarity calculations described in chapter three are the two main parameters affecting the quality of the duplicate detection.
58
6.
Conclusion
The chosen approach of identifying semantic duplicates in innovation portals with methods leveraged from information retrieval eventually turned out to be successful. Performing a textual content analysis of ideas presented itself as a suitable foundation for similarity comparisons conducted by the implemented VSM. The proposed hybrid clustering algorithm proved itself as capable in clustering semantic duplicates in the provided test set. With the help of an introduced internal criterion effective algorithm configurations in regard to weighted scoring and similarity threshold could be identified. The developed backend functionality and the user interface for semantic duplicate prevention in innovation portals were able to show its success by a significant reduction of semantic duplicates during a user study. The improved idea quality hypothesis could in fact be confirmed by means of a statistically significant difference in perceived idea quality between treatment group 2 and the control group. Yet, the developed solution has shown its limitations so far. If two factual semantically duplicate ideas are entered that share almost no words, chances are low that the described solution would recognize both ideas as semantic duplicates. However, experiments with the creation of domain-specific automatic thesauri showed great promise for moderating this flaw. Therefore, it can be assumed that with the help of automatic thesauri and fine-tuned configuration parameters it becomes possible to further increase semantic duplicate recognition and prevention.
59
References
Adams, D.A.; Nelson, R.R.; Todd, P.A. (1992): Perceived Usefulness, Ease of Use, and Usage of Information Technology: A Replication. In: MIS Quarterly, Vol. 16 (1992), No. 2, pp. 227 - 247. Berkhin, P. (2006): A Survey of Clustering Data Mining Techniques. In: Grouping Multidimensional Data, Editors: Kogan, J.; Nicholas, C.; Teboulle, M., Springer, Berlin Heidelberg 2006, pp. 25 71. Crouch, C.J. (1990): An approach to the automatic construction of global thesauri. In: Information Processing & Management, Vol. 26 (1990), No. 5, pp. 629 - 640. Davis, F.D. (1989): Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. In: MIS Quarterly Vol. 13 (1989), No. 3, pp. 319 - 340. Hartigan, J.A.; Wong, M.A. (1979): A K-means clustering algorithm. In: Applied Statistics Vol. 28 (1979), pp. 100 - 108. Jain, A.K.; Dubes, R.C. (1988): Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs 1988. Jardine, N.; van Rijsbergen, C.J. (1971): The use of hierarchic clustering in information retrieval. In: Information Storage and Retrieval, Vol. 7 (1971), No. 5, pp. 217 - 240. Koudas, N.; Marathe, A.; Srivastava, D. (2004): Flexible String Matching Against Large Databases in Practice. In: Proceedings of the 30th VLDB Conference, Vol. 30 (2004), pp. 1078 - 1086. Lucene API: Apache Lucene API. In: http://lucene.apache.org/, accessed on 14.08.2009 Lund, A.: USE Questionnaire Resource Page, in: http://www.usesurvey.com/, accessed on 14.08.2009 Miller, G.A. (1995): WordNet: A Lexical Database for English. In: Communications of the ACM, Vol. 38 (1995), No. 11, pp. 39 - 41. Porter, M.F. (1980): An algorithm for suffix stripping. In: Program Vol. 14 (1980), No. 3, pp. 130-137. Ruge, G. (1997): Automatic detection of thesaurus relations for information retrieval applications. In: Foundations of Computer Science, Lecture Notes in Computer Science (LNCS), Springer, Berlin Heidelberg 1997, pp. 499 506. Rubin, D.B. (1978): Bayesian Inference for Causal Effects: The Role of Randomization. In: Annals of Statistics, Vol. 6 (1978), pp. 34 -58. 60
Salesforce.com Inc.: Salesforce IdeaExchange. In: http://ideas.salesforce.com/, accessed on 14.08.2009 Salton, G.; Yang, C.S.; Yu, C.T. (1974): A theory of term importance in automatic text analysis. Technical Paper Salton, G.; Wong, A.; Yang, C.S. (1975): A vector space model for automatic indexing. In: Communications of the ACM, Vol. 18 (1975), No. 11, pp. 613 620. Song, Y.; Zhuang, Z.; Li, H.; Zhao, Q.; Li, J.; Lee, W.; Giles C.L. (2008): Real-time Automatic Tag Recommendation. In: Proceedings of the 31st annual international ACM SIGIR (2008), pp. 512 - 522. Sprck Jones, K. (1972): A statistical interpretation of term specificity and its application in retrieval. In Journal of Documentation Vol. 28 (1972), No. 1, pp. 11-21. Starbucks Inc.: My Starbucks Idea. In: http://mystarbucksidea.force.com, accessed on 14.08.09. Steinbach, M; Karypis, G.; Kumar, V. (2000): A comparison of document clustering techniques. At: KDD Workshop on Text Mining (2000). Theseus TEXO Project: TEXO Business Webs in Internet of Services, in: http://theseusprogramm.de/en-us/theseus-application-scenarios/texo/default.aspx, accessed on 14.08.2009. Treeratpituk, P.; Callan, J. (2006): Automatically Labeling Hierarchical Clusters. In: Proceeedings of the 2006 International Conference on Digital Government Research, Vol. 151 (2006), pp. 167 176. Waller, W.G.; Kraft, D.H. (1979): A mathematical model of a weighted boolean retrieval system. In: Information Processing & Management, Vol. 15 (1979), pp. 235-245. Zhao, Y.; Karypis, G. (2001): Criterion Functions for Document Clustering Experiments and Analysis.
61
Appendix
62
Appendix A
Control (n=4) N/A (0) Strongly disagree (1) Disagree (2) Neither agree nor disagree (3) Agree (4) Strongly agree (5) The user interface helped me to be more innovative The application helped me to improve the quality and extent of the ideas I entered The application supported me in developing new ideas 4 I found the suggested similar ideas useful (*) Suggested similar ideas were indeed similar to ideas I entered (*) The order of suggested similar ideas was matching the actual degree of similarity (*) It was easy to enter ideas in general 4 Choosing and selecting a suggested similar idea was easy (*) The application worked the way I want it to work I would recommend the application to a friend 1 I liked the application Finding similar ideas was fun (*) Finding similar ideas was annoying (*) Finding similar ideas was interesting (*) 4 4 4 4 4
Survey Evaluation Control Group
2 2
3 1
1 3
2 2
3 1
2 1
63
Survey Evaluation Treatment Group 1
Treatment 1 (n=4) N/A (0) Strongly disagree (1) Disagree (2) Neither agree nor disagree (3) Agree (4) Strongly agree (5) The user interface helped me to be more innovative The application helped me to improve the quality and extent of the ideas I entered The application supported me in developing new ideas I found the suggested similar ideas useful (*) Suggested similar ideas were indeed similar to ideas I entered (*) The order of suggested similar ideas was matching the actual degree of similarity (*) It was easy to enter ideas in general Choosing and selecting a suggested similar idea was easy (*) The application worked the way I want it to work I would recommend the application to a friend I liked the application Finding similar ideas was fun (*) Finding similar ideas was annoying (*) Finding similar ideas was interesting (*) 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
2 1
3 1
1 2
1 1 1
2 1
1 1 1
1 1 1
2 1
2 2
1 1 1
64
Survey Evaluation Treatment Group 2
Treatment 2 (n=7) N/A (0) Strongly disagree (1) Disagree (2) Neither agree nor disagree (3) Agree (4) Strongly agree (5) The user interface helped me to be more innovative The application helped me to improve the quality and extent of the ideas I entered The application supported me in developing new ideas I found the suggested similar ideas useful (*) Suggested similar ideas were indeed similar to ideas I entered (*) The order of suggested similar ideas was matching the actual degree of similarity (*) It was easy to enter ideas in general Choosing and selecting a suggested similar idea was easy (*) The application worked the way I want it to work I would recommend the application to a friend I liked the application Finding similar ideas was fun (*) Finding similar ideas was annoying (*) Finding similar ideas was interesting (*) 2 1 1 1 4 1 5 1 1 2 3
1 5
1 3 2
2 4 1
1 6
2 5
5 1
2 2 2
2 3 2
1 3 1
1 1
65
2 3 2
Survey Evaluation Treatment Groups 1 & 2
Treatment 1+2 (n=11) N/A (0) Strongly disagree (1) Disagree (2) Neither agree nor disagree (3) Agree (4) Strongly agree (5) The user interface helped me to be more innovative The application helped me to improve the quality and extent of the ideas I entered The application supported me in developing new ideas I found the suggested similar ideas useful (*) Suggested similar ideas were indeed similar to ideas I entered (*) The order of suggested similar ideas was matching the actual degree of similarity (*) It was easy to enter ideas in general Choosing and selecting a suggested similar idea was easy (*) The application worked the way I want it to work I would recommend the application to a friend I liked the application Finding similar ideas was fun (*) Finding similar ideas was annoying (*) Finding similar ideas was interesting (*) 1 2 1 1 1 5 1 6 1 1 1 2 1 1 1 1 1 1 1 2 5
5 1
4 6
1 4
1 6 2
3 5 2
7 1
1 2 7
1 2 6
1 6 2
4 3 2
2 5 4
2 4 2
1 2
66
2 6 2

Thesis - Identification of Semantic Duplicates - JR

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Thesis - Identification of Semantic Duplicates - JR

Hochgeladen von

Copyright:

Verfügbare Formate

FAKULTT FR INFORMATIK

Identification and Prevention of Semantic Duplicates in Web 2.0-based Innovation Portals

Author: Supervisor: Advisor: Date of Submission:

Garching b. Mnchen, 17.08.2009 Place, Date

Background & Methods ........................................................................................ 6

An Algorithm for Semantic Duplicate Identification ....................................... 17

3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 3.3

Further applications of clustering ........................................................................ 32 Automatic Thesaurus Creation ..................................................................... 33 V

3.3.2 3.3.3 3.3.4

Semantic Duplicate Prevention in an Innovation Portal ................................. 39

User study ............................................................................................................. 46

5.2.1 5.2.2 5.2.3

References .................................................................................................................... 60 Appendix ...................................................................................................................... 62

Background & Methods

The Vector Space Model

which will be dwelled on later in this thesis.

in their term vectors. For each term frequency in document .

is calculated and then used to weigh its raw term

expressing that the similarity of the document to itself equals 1.

Figure 3 Example of a similarity matrix. Source: Own illustration

Weighted Zone Scoring

An Algorithm for Semantic Duplicate Identification

Figure 4 Algorithm step 0. Source: Own illustration

Figure 5 Algorithm step 1. Source: Own illustration

Figure 6 Algorithm step 2. Source: Own illustration

Figure 7 Algorithm step 3. Source: Own illustration

Figure 8 Algorithm step 4. Source: Own illustration

Figure 9 Algorithm final result. Source: Own illustration

Evaluation of the Algorithm

0.66 0.33 0.5 1 0

Table 1 Different configurations for zone weighting factor .

0.1 0.3 0.5 0.7 0.9

72 93 68 54 19 24 181 10 12 41 267 8 265 260 8 287 6 5 5 4 4 291 0 0 0

53 64 49 43 14 19 101 10 9 34 117 8 122 117 8 116 6 5 5 4 4 132 0 0 0

Table 3 Evaluation results of various algorithm configurations.

Figure 10 Comparison of Source: Own illustration

values for the top-4 performing configurations.

Figure 11 Relating number of duplicates Source: Own illustration

Figure 12 Relating to Source: Own illustration

Figure 12 shows the distribution of configurations for values of

and . Obviously, configuoffer a well-balanced ratio of the number of detected

Further applications of clustering

Semantic Duplicate Prevention in an Innovation Portal

Functioning of the Duplicate Detection

Description of the User Interface

Source: Own survey construction, according to Lunds USE survey

Number of ideas entered

4 0 Number of times duplicates detected Number of suggested duplicates accepted

Figure 21 Comparison Source: Own illustration

% of suggested duplicates accepted % of suggested duplicates NOT accepted

Number of suggested duplicates accepted

Number of total duplicates

Strongly agree (5)

0 Strongly agree (5)

Figure 26 Mann-Whitney-Wilcoxon Test. Source: SPSS

3 2 1 0 N/A (0) 0 Agree (4) Strongly agree (5)

Treatment 1+2 (n=11) 7

2 1 0 N/A (0) 0 Agree (4) Strongly agree (5) 1

Survey Evaluation Control Group

Survey Evaluation Treatment Group 1

Survey Evaluation Treatment Group 2

Survey Evaluation Treatment Groups 1 & 2

Das könnte Ihnen auch gefallen