Sie sind auf Seite 1von 3

A Divide-and-Conquer Approach for Minimum Spanning Tree-Based Clustering

Abstract Due to their ability to detect clusters with irregular boundaries, minimum spanning treebased clustering algorithms have been widely used in practice. However, in such clustering algorithms, the search for nearest neighbor in the construction of minimum spanning trees is the main source of computation and the standard solutions take O(N2) time. In this paper, we present a fast minimum spanning tree-inspired clustering algorithm, which, by using an efficient implementation of the cut and the cycle property of the minimum spanning trees, can have much better performance than O(N2). Existing System Some classical algorithms rely on either the idea of grouping the data points around some centers or the idea of separating the data points using some regular geometric curves such as hyper planes. As a result, they generally do not work well when the boundaries of the clusters are irregular. Proposed system Sufficient empirical evidences have shown that a minimum spanning tree representation is quite invariant to the detailed geometric changes in clusters boundaries. Therefore, the shape of a cluster has little impact on the performance of minimum spanning tree (MST)-based clustering algorithms, which allows us to overcome many of the problems faced by the classical clustering algorithms. Modules Form Data set: All the edges that satisfy the inconsistency measure are removed and the data points in the smallest clusters are regarded as outliers. As a result, the definition of the inconsistent edges

and the development of the terminating condition are two major issues that have to be addressed in all MST-based clustering algorithms, even when the number of clusters is given as an input parameter. Due to the invisibility of the MST representation of a data set of dimensionalities MST-based clustering: The number of clusters is either given as an input parameter or figured out by the algorithms themselves. Under the ideal condition, that is, the clusters are well separated and there exist no outliers, the inconsistent edges are just the longest edges. However, in real-world tasks, outliers often exist, which make the longest edges an unreliable indication of cluster separations. In these cases, all the edges that satisfy the inconsistency measure are removed and the data points in the smallest clusters are regarded as outliers. As a result, the definition of the inconsistent edges and the development of the terminating condition are two major issues that have to be addressed in all MST-based clustering algorithms, even when the number of clusters is given as an input parameter. Weight Assign Each Data Set When the weight associated with each edge denotes a distance between the two end points, any edge in the minimum spanning tree will be the shortest distance between the two sub trees that are connected by that edge. Therefore, removing the longest edge will theoretically result in a two-cluster grouping. Removing the next longest edge will result in a three-cluster grouping, and so on. This corresponds to choosing the breaks where the maximum weights occur in the sorted edges. Identify the longest edges The design of a more efficient scheme is motivated by the following observations. First, the MST-based clustering algorithms can be more efficient if the longest edges of an MST can be identified quickly before most of the shorter ones are found. This is because, for some MSTbased clustering problems, if we can find the longest edges in the MST very quickly, there is no need to compute the exact distance values associated with the shorter ones. Second, for other MST-based clustering algorithms, if the longest edges can be found quickly, the Prims algorithm can be more efficiently applied to each individual size-reduced cluster.

Conclusion As a graph partition technique, the MST-based clustering algorithms are of growing importance in detecting the irregular boundaries. A central problem in such clustering algorithms is the classic quadratic time complexity on the construction of an MST. In this paper, we present a more efficient method that can quickly identify the longest edges in an MST so as to save some computations. Our contribution is the design of a new MST-inspired clustering algorithm for large data sets (however, without any specific requirements on the distance measure used) by utilizing a DHCA in an efficient implementation of the cut and the cycle property. Software Requirements Windows XP J2SDK1.6 Eclipse 3.2

Hardware Requirements Intel Pentium III Processor and above

128MB RAM and above 10GB Hard Disk and above

Das könnte Ihnen auch gefallen