Beruflich Dokumente
Kultur Dokumente
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 1, Issue 2, July August 2012 ISSN 2278-6856
questions are similar? The classic approaches to information retrieval (IR) would suggest a similarity calculation according to their keywords. However, this approach has some known drawbacks due to the limitations of keywords. At the retrieval level, traditional approaches are also limited by the fact that they require a document to share some keywords with the query to be retrieved. In reality, it is known that users often use keywords or terms that are different from the documents. There are then two different term spaces, one for the users, and another for the documents. How to create relationships for the related terms between the two spaces is an important issue. The creation of such relationships would allow the system to match queries with relevant documents, even though they contain different terms. Again, with a large amount of user logs, this may be possible Despite the fact that keywords are not always good descriptors of contents, most existing search engines still rely solely on the keywords contained in documents and queries to calculate their similarity. This is one of the main factors that affect the precision of the search engines. In many cases, the answers returned by search engines are not relevant to the users information need, although they do contain the same keywords as the query. The classic approach to information retrieval (IR) would suggest a similarity calculation between queries according to their keywords. However, this approach has some known drawbacks due to the limitations of keywords. In the case of queries, in particular, the keyword-based similarity calculation will be very inaccurate (with respect to semantic similarity) due to the short lengths of the queries.
Keywords:
Clustering,
Sense
Disambiguation,
INTRODUCTION
In order to find more precise answers to a query, a new generation of search engines - or question answering systems - has appeared on the Web (e.g. AskJeeves, http://www.askjeaves.com/). Unlike the traditional search engines that only use keywords to match documents, this new generation of systems tries to understand the user's question, and suggest some similar questions that other people have often asked and for which the system has the correct answers. However, the queries submitted by users are very different, both in form and in intention. How human editors can determine which queries are FAQs is still an open issue. A closely related question is: how can a system judge if two
Page 38
The most similar work is that of [7], which exploits the same hypothesis for document and query clustering. However, while exploiting cross-references, [7] reject the use of the contents of queries and documents. They consider that keyword similarity is totally unreliable. We believe that content words provide some useful information for query clustering that is complementary to cross-references. Therefore, our approach tries to combine both content words and cross-references in query clustering. However, using query keywords directly is not reliable due to their short length and word ambiguity. In [8], the top documents retrieved by a search engine for each query are considered as the data source of the query and a hierarchical clustering algorithm is applied. Based on the click-through data, in [9], a bipartite graph of queries and documents is constructed and then a graph based agglomerative iterative clustering method is applied to merge vertices of graph continually until a termination condition reaches. Similar approach is also followed by [10] and found good results. Going beyond the clicked URLs in the search engine query and browsing logs [11], developed one algorithm to use query keywords and the clicked documents to estimate the similarity between queries. [12] Developed a hybrid similarity measure feature for the clustering approach. In [13] [14], a query is assumed ambiguous based on the underlying document collection. [13] Introduced a query disambiguation method by considering part-of-speech patterns of the querys context terms. For each query, [13] analyzed its context in the data collection and classify each occurrence of the query to a question which reflects its context type. Ambiguity is resolved by choosing a question from the generated question list. Such a method suffers from scalability issues and is not applicable to the Web environment. [15] Proposed a way to classify users query to the Open Directory Project (ODP) categories to solve query ambiguity. However, it relies on the users profile.
Sim keyword ( p, q )
KN ( p, q ) Max(kn( p ), kn(q ))
where kn(.) is the number of keywords in a query, KN(p, q) is the number of common keywords in two queries. A first feedback-based similarity considers each document in isolation. This similarity is proportional to the number of the clicked individual documents in common for two queries p and q as follows:
Simclick ( p, q)
RD( p, q ) Max(rd ( p ), rd (q ))
where rd(.) is the number of clicked documents for a query and RD (p, q) is the number of document clicks in common. In spite of its simplicity, this measure demonstrates a strong ability to cluster semantically related queries that contain different words. Below are some queries from one such cluster: Query 1 atomic bomb Query 2 Manhattan Project Query 3 Hiroshima Query 4 nuclear fission Query 5 Japan surrender A very similar study [7] has been carried out recently. However, that study rejects the use of content words and relies solely on document clicks to cluster queries. We think that both query contents words and the corresponding document clicks can partially capture the users interests. Therefore, it is better to use both. A simple way to do it is to combine both measures linearly as follows:
where
and
similarity measure, with 1 . Here, and represent varying levels of contribution a particular approach (results-based or content-based respectively) has in determining the similarity between queries. 2.4 Incremental approach The system incorporates implicit ambiguity resolution method based on query-oriented document clusters. In the system, a query in Korean is first translated into English by looking up dictionaries, and documents are retrieved based on the vector space retrieval for the translated query. For the top-ranked retrieved documents, document clusters are incrementally created and the weight of each retrieved document is re-calculated by using clusters with preference. This phase is the core of our implicit ambiguity resolution method. While synonyms can improve retrieval effectiveness, terms with different meanings produced from the same original term can degrade retrieval performance tremendously. At this stage, we can apply statistical ambiguity resolution method based on mutual information. For the query, documents are retrieved based on the vector space retrieval method. This method simply checks the existence of query terms, and calculates similarities between the query and documents. The query-document similarity of each document is calculated by vector inner product of the query and document vectors:
t
The clusters made are based on incremental centroid method. There are a few variations in the agglomerative clustering method. The agglomerative centroid method joins the pair of clusters with the most similar centroid at each stage [20]. Incremental centroid clustering method is straightforward. The input document of incremental clustering proceeds according to the ranks of the topranked N documents resulted from vector space retrieval for a query. Document and cluster centroid are represented in vectors. For the first input document (rank 1), create one cluster whose member is itself. For each consecutive document (rank 2, ..., N), compute cosine similarity between the document and each cluster centroid in the already created clusters. If the similarity between the document and a cluster is above a threshold, then add the document to the cluster as a member and update cluster centroid. Otherwise, create a new cluster with this document. Note that one document can be a member of several clusters. Similarities between the clusters and the query, or query-cluster similarities, are calculated by the combination of the query inclusion ratio and vector inner product between the query vector and the centroid vectors of the clusters.
simC (q, c )
cq t . wqi .wci q i 1
Where |q| is the number of terms in the query, |cq| is the number of query terms included in a cluster centroid, |cq|/|q| is the query inclusion ratio for the cluster. The documents included in the same cluster have the same query-cluster similarity. Cluster preferences are influenced by the query inclusion ratio, which prefers the cluster whose centroid includes more various query terms. Thus incorporating this information into the weighting of each document means adding information which is related to the behavior of terms in documents as well as the association of terms and documents into the evaluation of the relevance of each document; it therefore has the effect of ambiguity resolution. 2.5 Overall comparison of all approaches Keyword-based document clustering has provided interesting results. One contributing factor is the large number of keywords contained in documents. Even if Page 41
where query and document weight, qi w and di w , are calculated by ntc-ltn weighting scheme which yields the best retrieval result in [17] among several weighting schemes used in SMART system [18]. As the translated query can contain noises, non-relevant documents may have higher ranks than relevant documents. In order to exclude non-relevant documents from higher ranks, the method used N documents to create clusters incrementally and dynamically, and use similarities between the clusters and the query to re-rank the documents. Basic idea is: Each cluster created by clustering of retrieved documents can be seen as giving a context of the documents belonging to the cluster; by calculating the similarity between each cluster and the query, therefore, we can spot the relevant context of the Volume 1, Issue 2 July-August 2012
[22] Suggested the method which used the clusters of retrieved documents as a context for re-weighting each retrieved document and for re-ranking the retrieved documents. Their method was evaluated on TREC-6 CLIR test collection. This method achieved 28.29% performance improvement for translated queries without ambiguity resolution. This corresponds to 97.27% of the monolingual performance. Their method was used with the query ambiguity resolution method based on mutual information; it illustrates 105.87% performance improvement of the monolingual retrieval. These results indicate that cluster analysis help to resolve ambiguity greatly, and each cluster itself provide a context for a query. Their method is language independent model which can be applied to any language retrieval. Query clustering has received more and more attention in recent years, but most existing clustering algorithms seldom consider query disambiguation. [25] Proposed a unified framework to simultaneously cluster and disambiguate queries by mining query logs. The proposed iterative procedure not only resolves the data sparseness of query logs, but also gives more meaningful Web page clusters. The concept set which corresponds to the Web page cluster result is used to assign the queries. In this way, ambiguous queries could be assigned to concepts with different meanings. The experimental results proved that their algorithm is very effective to cluster and disambiguate queries. They get 29% and 11% performance improvement on semi-synthetic and real data respectively compared to the baseline.
In the past few years, Web clustering engines [21] have been proposed as a solution to the lexical ambiguity issue in Web Information Retrieval. These systems group search results, by providing a cluster for each specific aspect (i.e., meaning) of the input query. Users can then select the cluster(s) and the pages therein that best answer their information needs. However, many Web clustering engines group search results on the basis of their lexical similarity. For instance, consider the following snippets returned for the beagle search query: 1. Beagle is a search tool that ransacks your... 2. ...the beagle disappearing in search of game... 3. Beagle indexes your files and searches... While snippets 1 and 3 both concern the Linux search tool, they do not have any content word in common except our query words. As a result, they will most likely be assigned to two different clusters. The ambiguity problem is one of the problems which deteriorate the performance of the web information retrieval. Query Clustering is found one of the solutions for this problem. Volume 1, Issue 2 July-August 2012
4. CONCLUSION
In this paper we had tried to find out the role of query clustering in sense disambiguation. As discussed in section IV we can justify the role of sense clustering in WSD. Sense Ambiguity is one of the core problems of web information retrieval. Sense Ambiguity deteriorates the performance of the web information retrieval. The paper concludes that clustering approach is helpful in disambiguation of word senses. The researchers showed high level of performance level in using clustering approach for WSD. The paper also outlined the various approaches for clustering and also their benefits and drawbacks.
Table 1 Overall Comparision of all Approchaces
Page 42
Hybrid Approac h
Increment al Approach
Result Documents
Result documents
Metho d used
Bipartite Graph
Adva ntage
Number of keywords contained in documents. Even if some of the keywords of two similar documents are different, there are still many others that can make the documents similar in the similarity calculation. Not works fine with short queries
Disad vanta ge
The approach is affected by the possibility that queries representing different information needs might lead to the same results since one result URL can contain information in several topics
References
[1] G. Salton And M. J. Mcgill, Introduction to Modern Information Retrieval. McGraw-Hill New York, NY, 1983. [2] D. Gusfield, Inexact matching, sequence alignment, and dynamic programming. In Algorithms on Strings, Trees, and Sequences Computer Science and Computational Biology, Cambridge University Press, 1997 [3] R. C. Dubes And A. K. Jain, Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ, 1988 [4] E. Garfield, Citation Indexing: Its Theory and Application in Science, Technology and Humanities, 2nd ed. The ISI Press, Philadelphia, PA, 1983. [5] M. M. Kessler, Bibliographic coupling between scientific papers. In American Documentation, 14, 1, 1025, 1963. Volume 1, Issue 2 July-August 2012
Page 44