0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)
13 Ansichten3 Seiten
In today era all document are in electronic format to
save space and easy access. So, the biggest task is to the retrieve
document from large database. The Document clustering is the
process of partitioning collected data into subgroups with
similarities. The purpose of document clustering is to provide
technique from which human interests in searching the
information and understanding. The goal is to provide an
approach for extracting unknown pattern from a lager database.
Clustering algorithm are basically used for document clustering,
in this way, the database is summarised and presented in unique
manner.
Originaltitel
Survey on Clustering Algorithm for Document
Clustering
In today era all document are in electronic format to
save space and easy access. So, the biggest task is to the retrieve
document from large database. The Document clustering is the
process of partitioning collected data into subgroups with
similarities. The purpose of document clustering is to provide
technique from which human interests in searching the
information and understanding. The goal is to provide an
approach for extracting unknown pattern from a lager database.
Clustering algorithm are basically used for document clustering,
in this way, the database is summarised and presented in unique
manner.
In today era all document are in electronic format to
save space and easy access. So, the biggest task is to the retrieve
document from large database. The Document clustering is the
process of partitioning collected data into subgroups with
similarities. The purpose of document clustering is to provide
technique from which human interests in searching the
information and understanding. The goal is to provide an
approach for extracting unknown pattern from a lager database.
Clustering algorithm are basically used for document clustering,
in this way, the database is summarised and presented in unique
manner.
Survey on Clustering Algorithm for Document Clustering Priyanka khadse Harshal Chowhan
Student & Computer Science and Engineering Asst. Professor & Computer Science and Engineering W.C.E.M Nagpur, India W.C.E.M Nagpur, India
Abstract in today era all document are in electronic format to save space and easy access. So, the biggest task is to the retrieve document from large database. The Document clustering is the process of partitioning collected data into subgroups with similarities. The purpose of document clustering is to provide technique from which human interests in searching the information and understanding. The goal is to provide an approach for extracting unknown pattern from a lager database. Clustering algorithm are basically used for document clustering, in this way, the database is summarised and presented in unique manner.
Keywords Document clustering, information extracting, clustering algorithm. I. INTRODUCTION Document clustering is the process of extracting fast and important information of the files. The goal of survey is to provide different technique of clustering and review. Clustering is the task of grouping a set of objects with similarities. Clustering is known as unsupervised learning. As every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. Document clustering is the process of organising document in meaningful cluster. In other word the one cluster shares same information than another cluster. The clustering is most common form of unsupervised learning, this major difference between clustering and classification. Unsupervised means while classifying the objects it does not depend on the predefine classes and training. Most application such as segmentation, image processing, pattern recognition, marketing, economics etc. Use clustering. Therefore clustering is very critical area for research. Document clustering is done by using different techniques and models, such as Kohonens Self Organizing Maps (SOM) [4] and the k-means Algorithm[1]. Beebe and Dietrich in [5] proposed a new process model for text string searches that advocated the use of machine learning techniques, clustering being one of them. Clustering algorithms are typically used for exploratory data analysis, where there is little or no prior knowledge about the data [2], [3]. Document clustering has been investigated for use in a number of different areas of text mining and information retrieval. Initially, document clustering was investigated for improving the precision or recall in information retrieval systems [Rij79, Kow97] and as an efficient way of finding the nearest neighbours of a document [BL85]. More recently, clustering has been proposed for use in browsing a collection of documents [CKPT92] or in organizing the results returned by a search engine in response to a users query [ZEMK97]. Document clustering has also been used to automatically generate hierarchical clusters of documents [KS97]. There are many algorithms for clustering in data mining. Cluster in one of the primary tool in data mining for analysis. Clustering algorithmis divided into two parts Hierarchical algorithm and Partition algorithm. Hierarchical Algorithm divide the dataset into two subset and Partition Algorithm divide the dataset into number of subset at one step. Hierarchical Algorithm is divided into two categorized agglomerative and divisive clustering. An agglomerative clustering starts with a one-point cluster and merges two or more most appropriate clusters. A divisive clustering starts with one cluster of all data points and splits into the most appropriate clusters. The process continues until a stopping criterion is achieved. The hierarchical clustering builds of tree known as dendrogram, as shown in Fig (a).
Fig (a): example of dendrogram II. RELATED WORKS In this section, review of clustering techniques which are used in this study. The clustering algorithmis having following stages shown in fig (b). Clustering algorithm is used for document clustering. In this study, hierarchical algorithmis better than K-means, but slower. Fuzzy-c means is work on fuzzy logic and it takes less time to execute than hierarchical. Collection of data includes all documents which are used to cluster. This data in further process to the next step. This data includes all type of file. This data is index for storing and filter themto remove extra words.
Fig (b): stages of clustering Collection of data Pre-processing of data Document Clustering Similarity Calculation Iteration & output International Journal of Computer Trends and Technology (IJCTT) volume 11 number 3 May 2014 ISSN: 2231-5381 http://www.ijcttjournal.org Page 129
Pre-processing includes the process which takes plain text as an input and filters it and provides output. In this method, remove all the stopword and provide only synonym. Document clustering is main part of this paper; it will be discussed in details. Similarity calculation includes two parts inter cluster and intra cluster distance. This distance depends on similarity between clusters. Iteration & output includes time for execution and iteration for cluster. A. Hierarchical Algorithm Hierarchical techniques produce a nested sequence of partitions, with a single, all inclusive cluster at the top and singleton clusters of individual points at the bottom. Each intermediate level can be viewed as combining two clusters fromthe next lower level the result of a hierarchical clustering algorithm can be graphically displayed as tree, called a dendogram. This tree graphically displays the merging process and the intermediate clusters. For document clustering, this dendogramprovides a hierarchical index. There are two basic approaches to generating a hierarchical clustering: Agglomerative: Start with the points as individual clusters and, at each step, merge the most similar or closest pair of clusters. This requires a definition of cluster similarity or distance. Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain. In this case, we need to decide, at each step, which cluster to split and how to performthe split. Agglomerative Algorithm: 1. Compute the similarity between all pairs of clusters, i.e., calculate a similarity matrix whose ijth entry gives the similarity between the ith and jth clusters. 2. Merge the most similar (closest) two clusters. 3. Update the similarity matrix to reflect the pair wise similarity between the new cluster and the original clusters. 4. Repeat steps 2 and 3 until only a single cluster remains. Agglomerative algorithms are according to the inter-cluster similarity measure they use. The most popular of these are single-link, complete-link and group average. In the single link method, the distance between clusters is the minimum distance between any pair of elements drawn from these clusters, in the complete link it is the maximumdistance and in the average link it is correspondingly an average distance. 1) Advantages of hierarchical algorithm: i) Embedded flexibility regarding the level of Granularity. ii) Ease of handling any forms of similarity or distance. iii) Applicability to any attributes type. 2) Disadvantages of hierarchical algorithm: i) Vagueness of termination criteria. ii) Most hierarchal algorithmdoes not revisit once constructed clusters with the purpose of improvement.
B. Fuzzy-c means algorithm Bezdek [5] introduction Fuzzy C-Means clustering method in 1981, extend from Hard C-Means clustering method. Fuzzy-C means clustering is used to sort-out the complex and multi-dimensional data in dataset. In which the members have partial related or having fuzzy relations. FCM is unsupervised clustering that is applied to wide range of problems connected with feature analysis. FCM is used in document clustering, engineering, astronomy, image analysis. The fuzzy theory is actually based on Ruspini Fuzzy clustering theory was proposed in 1980s. This analysis based on distance between various input data point. The distance between data points and the centre is formed according to each cluster. Fuzzy-C Means Algorithm 1. Initialize U=[uij] matrix, U(0) 2. At k-step: calculate the centres vectors C(k)=[cj] with U(k) 3. Update U(k) , U(k+1) 4. If || U(k+1) - U(k)||< then STOP; otherwise return to step 2. Where, 'n is the number of data points. 'cj' represents the jth cluster center. 'm' is the fuzziness index m [1, ]. 'c' represents the number of cluster center. 'ij' represents the membership of ith data to jth cluster centre. '||xi vj||' is the Euclidean distance between ith data and jth cluster centre. FCM iteration moves the cluster centres to the right position within a dataset. In general the introduction of fuzzy logic in K-Means clustering is the Fuzzy C-means.FCM is based on fuzzy method which work on membership weights, this is natural for producing a cluster. 1) Advantages of fuzzy-c means: i) Gives best result for overlapped data set and comparatively better than k-means algorithm. ii) Unlike k-means where data point must exclusively belong to one cluster centre here data point is assigned membership to each cluster centre as a result of which data point may belong to more than one cluster centre. 2) Disadvantages of fuzzy-c means: i) Apriori specification of the number of clusters. ii) With lower value we get the better result but at the expense of more number of iteration. Euclidean distance measures can unequally weight underlying factors.
III. CONCLUSIONS AND FUTURE WORK In this survey, various clustering approaches and algorithms in document clustering in presented. There are many issues in this area of document clustering. The paper gives summary and interested readers a broad overview of the existing techniques. As a future work, working on improvising existing systems with better results. This results can be used for searching results etc. International Journal of Computer Trends and Technology (IJCTT) volume 11 number 3 May 2014 ISSN: 2231-5381 http://www.ijcttjournal.org Page 130
REFERENCES [1] [1] A. L. N. Fred and A. K. J ain, Combining multiple clusterings using evidence accumulation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 6, pp. 835850, J un. 2005. [2] [2] TC Havens, J C Bezdek, C Leckie, LO Hall Fuzzy c-means for very large data ieeexplore.ieee.org - 2012 [3] [3] A. K. J ain and R. C. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ : Prentice-Hall, 1988. [4] [4] A. Strehl and J. Ghosh, Cluster ensembles: A knowledge reuse framework for combining multiple partitions, J . Mach. Learning Res., vol.3, pp. 583617, 2002. [5] [5] X. Rui, D. Wunsch II, Survey of Clustering Algorithms, IEEE Transactions on Neural Networks, vol.16, no.3, 2005. [6] T. Kanungo and D. M. Mount, An Efficient K-means Clustering Algorithm: Analysis and Implementation, Pattern Analysis and Machine Intelligence, IEEE Transactions on Pattern Analysis and Machine Intelligence. vol. 24, no. 7, 2002. [7] V. S. Rao and Dr. S. Vidyavathi, Comparative Investigations and Performance Analysis of FCM and MFPCM Algorithms on Iris data, Indian J ournal of Computer Scienceand Engineering, vol.1, no.2, 2010 pp. 145-151 .