Sie sind auf Seite 1von 5

IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

Using Word Net for Document Clustering: A Detailed Review


Harsha Patil 1 , Dr. R. S. Thakur 2
1 Research Scholar, MANIT, Bhopal, Ashoka Center For Business and Computer Studies, Nashik, India.
2 Maulana Azad National Institute of Technology, Bhopal, India.

1 Harshap.acbcs@aef.edu.in, 2 ramthakur2000@yahoo.com

ABSTRACT: Document Clustering is an unsupervised technique for categorized documents in groups on the basis of their
similarity. Document clustering techniques are basically very useful to efficiently manage and organize the result of search engine
query. Mostly document clustering techniques use Vector Space Model (VSM) to represent any document. VSM based methods
generates bag of Words. But VSM based approaches doesn’t consider Semantic relationship among the words. Many researchers
are working on semantic aspects of document clustering to improve cluster quality. Since last eight nine years efforts have been
seen in applying semantics to document clustering. Many external knowledge bases like Word Net, Wikipedia, and Lucerne etc.
are utilized to handle this challenge. The article explains semantic approach in detail and different semantic similarity measures
used in algorithms for finding semantic association among words.

Keywords: Document Clustering, Semantic, WordNet, Similarity measures, synonyms.

INTRODUCTION
Document Clustering deals with unstructured text, which generates many challenges. Abstract concepts consist by text are
difficult to represent as well as having countless combination of abstract relationships. Most of the traditional methods are based
on BOW (Bag of Words) which discounts semantic associations among the words and consequences in less qualitative output.
In last nine to ten years span, many of the researchers explore the semantic aspects of the words and improve the text mining
techniques. Use of external knowledge base is being very helpful to develop semantic based approaches for document clustering.
Word Net (Miller, 1995) is one of the most widely used thesauruses for English language. Ambiguity and synonymy are two of
the major problems that document clustering techniques regularly fail to tackle with. In recent research work, WordNet has been
widely used to improve quality of document clustering. WordNet is a lexical knowledge, based on conceptual look up which
organize lexical information in terms of word meaning, rather than word form. Here we extant extensive survey of numerous
Semantic based techniques for document clustering which exploits WordNet and improved the quality of the clusters formed.

DOCUMENT CLUSTERING TECHNIQUES: AN OVERVIEW


Document clustering is the process of making cohesive groups which consist documents, which are more similar with each
other. So documents in same groups have high similarity as compare to document from different cluster. Document clustering
techniques can be classified in to three broad categories: Partitioning method [1], Agglomerative and divisive clustering [2] and
item set based clustering [3]. Many clustering algorithms, based on partitioning or hierarchical methodology like K-Means [4],
Bisecting K-Means [1], Hierarchical Agglomerative clustering (HAC) [x], and Unweighted Pair Group Method with Arithmetic
Mean (UPGMA) [x] performs efficiently for low dimensional data but in case of high dimensional data they results in poor
clustering. Frequent item set based algorithms handle the problem of high dimensionality of text documents by selecting only
frequent item sets as features for clustering. Hierarchical Frequent Term based Clustering (HFTC) [5] proposed by Beil F, Ester
M, Xu X (2002) did great contribution in this direction. After that Fung, et al proposed Hierarchical Document Clustering using
frequent item sets (FIHC) [3] which use association rule mining and provides meaningful labels to the clusters. All the above
algorithms are not considered the semantic associations of the words.

VOLUME 4, ISSUE 7, DEC/2017 339 http://iaetsdjaras.org/


IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

From last nine-ten years many researchers have been using External knowledge base for associate meanings with words.
WordNet is extensively used by researchers for this purpose. Related works using WordNet is exhaustively explained in further
section of this paper.

WORDNET
Word Net is one of the widely used lexical reference system based on psycholinguistic theories of human lexica memory. There
is a multilingual WordNet for European languages which is structured in the same way as the English language WordNet. Now
it is available in some Indian languages also like in Hindi, Marathi, Punjabi etc. and are going to connect with English and
European WordNet. Recent Windows version of WordNet is 2.1, released in March 2005. Version 3.0 for
Unix/Linux/Solaris/etc. was released in December, 2006.

Various Natural language processing applications used WordNet for Word Sense disambiguation, find semantic distance
between words, Machine Translation, Search engine processing, Plagiarism detection, Sentiment analysis etc. WordNet
established the lexical or semantical connections between noun, verb, adjective, and adverb form of words, which expressing
distinct concepts. This distinct concept is called sense and has distinct sysnet, which explain specific meaning of a word, its
explanation, and its synonyms. The fundamental unit which searched in WordNet is concept rather than word however in
dictionary where fundamental unit is results in meaningful unit.

SEMANTIC SIMILARITY MEASURES:

Semantic similarity is confidence score of two words which explains likeness of their meaning. Increased in use of WordNet
results in many semantic similarity methods for find distance between words. According to Meng et al. [6] Semantic similarity
measures can be broadly categorized in four classes: path length based measures, information content based measures, feature
based measures, and hybrid measures.

1 Path length based measures: It is based on the length of the path connecting the concepts and the location of the
concepts in the taxonomy [7]. It counts edges between concepts. The disadvantage of this method is two pairs with equal length
of shortest path will have the same similarity.
2 Information content based measures: It is based on the principle is that if two concepts are sharing more common
information that means they are more similar [8].
3 feature based measures: According to this measure two concepts are more similar if they have more common features
and less uncommon features [9]. This measure is not work properly if complete feature sets of concepts are not available.
4 Hybrid measure: This measure combines the principles proposed in path length based measures, information content
based measures and feature based measure. It also consider the relations like IS-A, Part- of more finding semantic similarity.

RELATED WORKS
In this section we provide tabulated representation of all related research of document clustering using WordNet. This detail
study will be very precious collection for all researchers who are looking this area for their future research.

VOLUME 4, ISSUE 7, DEC/2017 340 http://iaetsdjaras.org/


IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

Table 1: Summarized details on review of Document clustering algorithms using WordNet

VOLUME 4, ISSUE 7, DEC/2017 341 http://iaetsdjaras.org/


IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

CONCLUSION
This paper cover Semantic similarity measures in WordNet to find semantic association among words. This survey will be very
useful for the researchers who want to uses Semantic based techniques for document clustering which exploits WordNet and
improved the quality of the clusters formed. Survey gives summarized description about research going on using WordNet and
also provide briefing about algorithms they are using, evaluation measures and future scope.

REFERENCES
[1] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. Proc. Of the 6th ACM SIGKDD
international conference on TextMining Workshop, KDD 2000, 2000

[2] Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data, An introduction to Cluster Analysis, John Wiley & Sons, Inc (1990)

[3] Fung, B., Wang, K., Ester, M.: Hierarchical Document Clustering using Frequent Item sets, In Proc. of SIAM Intl. Conf. on Data
Mining. (2003)

[4] Hartigan, J. A., and Wong, M. A.: Algorithm AS 136: A K-Meanss Clustering Algorithm. In: Journal of the Royal Statistical Society.
Series C (Applied Statistics), Vol. 28, pp. 100-108, Royal Statistical Society (1979)

VOLUME 4, ISSUE 7, DEC/2017 342 http://iaetsdjaras.org/


IAETSD JOURNAL FOR ADVANCED RESEARCH IN APPLIED SCIENCES ISSN NO: 2394-8442

[5] Beil, F., Ester, M., Xu, X.: Frequent Term-based Text Clustering, In Proc. of Intl. Conf. on Knowledge Discovery and Data Mining,.
(2002)

[6] Meng, L., Huang, R., & Gu, J. (2013). A review of semantic similarity measures in wordnet. International Journal of Hybrid Information
Technology, 6(1), 1–12.

[7] G. Varelas, E. Voutsakis, P. Raftopoulou, E. G. M. Petrakis and E. E. Milios, “Semantic similarity methods in WordNet and their
application to information retrieval on the web”, Proceedings of the 7th annual ACM international workshop on Web information and data
management, (2005) October 31- November 05, Bremen, Germany.

[8] P. Resnik, “Using information content to evaluate semantic similarity”, Proceedings of the 14th International Joint Conference on Artificial
Intelligence, (1995) August 20-25; Montréal Québec, Canada.

[9] A. Tversky, “Features of Similarity”, Psycological Review, vol. 84, no. 4, (1977).

[10] Andreas Hotho , Steffen Staab , Gerd Stumme, “Wordnet improves Text Document Clustering,” In Proc. of the SIGIR 2003 Semantic
Web Workshop, 2003

[11] Chihli Hung, Stefan Wermter, Peter Smith, “Hybrid Neural Document Clustering Using Guided Self-Organization and WordNet,”
Journal IEEE Intelligent Systems archive, Vol. 19 Issue 2, pp. 68-77, Mar. 2004

[12] Yong Wang, Julia Hodges, “Document Clustering with Semantic Analysis,” In Proc. of the 39th Annual Hawaii International Conference
on System Sciences, HICSS, Vol. 03, pp. 54.3, 2006

[13] Hai-Tao Zheng, Bo-Yeong Kang, Hong-Gee Kim, “Exploiting noun phrases and semantic relationships for text document clustering,”
Journal of Information Sciences, Vol. 179, Issue 13, pp. 2249-2262, Jun. 2009

[14] Chun-Ling Chen, Frank S. Tseng, Tyne Liang, “An Integration of Fuzzy Association Rules and WordNet for Document Clustering,”
In Proc. of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD, pp. 147-159, 2009

[15] S.Vijayalakshmi, Dr.D.Manimegalai, “Query based Text Document Clustering using its Hypernymy Relation,” International Journal of
Computer Applications 23(1):13–16, Jun. 2011

[16] Dang, Q., Zhang, J., Lu, Y., & Zhang, K. (2013). WordNet-based suffix tree clustering algorithm. In Paper presented at the 2013
international conference on information science and computer applications (ISCA 2013).

[17] T. Wei, Y. Lu, H. Chang, Q. Zhou, and X. Bao, “A semantic approach for text clustering using Word Net and lexical chains,” Expert
Systems with Applications, vol. 42, no. 4, pp. 2264–2275,2015.

[18] Sujata R. Kolhe ; S. D. Sawarkar, “A concept driven document clustering using WordNet,” In Proc. of the International conference
Nascent Technologies in Engineering (ICNTE),2017.

VOLUME 4, ISSUE 7, DEC/2017 343 http://iaetsdjaras.org/

Das könnte Ihnen auch gefallen