Sie sind auf Seite 1von 6

Knowledge Mining using classification through clustering.

Introduction Knowledge discovery and data mining are rapidly evolving area of research that are at intersection of several discipline including statistics, databases, AI, visualization and high performance and parallel computing[1]. Data mining is core part of knowledge Discovery Process (KDD). The KDD process consists of data selection, data transformation, pattern searching (data mining) and finding pattern evaluation. Data mining can be defined as the task of discovering interesting patterns from large amount of data where the data can be stored in databases, data warehouse or information repositories [2]. Thus data mining is the extraction of implicit, previously

for data categorization and analysis in both industry and academia. Clustering is the process of organizing unlabeled objects into groups of which members are similar in some way. This paper includes results of applying firstly the clustering technique on a data set and then after applied the classification algorithm that provides the better result than applying the single classification Algorithm on data set similar to. Experiments were also conducted to compare the classification performance using the proposed model. Clustering Clustering is a kind of unsupervised learning algorithm. It does not use category labels when grouping objects. In Semi-

Supervised clustering, some prior knowledge is available either in the form of labeled data or pair-wise constraints on some of the objects. In clustering there are various methods like partitioning, density-based, hierarchical, modelbased, and grid-based. In partitioning based clustering the k- means algorithm is used and the similarity function distance is used. The kmeans clustering is based on mean value of objects in cluster. In the density based method similarity function density is used and applies on arbitrary shaped data types. In the density based methods there many algorithms like OPTICS, DENCLUE but DBSCAN algorithm is mostly used. In the DBSCAN algorithm the regions

unknown; potentially use for information from the vast amount of data available in data sets (databases, data warehouses or other information repository). The tasks in data mining are distinct because many patterns exist in large database. All the techniques can be integrated or combined to deal with a complicated problem resides in these large databases. Based on pattern data mining task is classified into summarization, classification, clustering, association and trend analysis.[3]. Clustering and Classification are two of the most common data mining tasks, used frequently

have high density objects is form cluster and low density indicates cluster as a noise or outliers. DBSCAN: In the DBSCAN algorithm The key idea behind density-based clustering is that for each object of a cluster the neighborhood of a given radius ( ) has to contain at least a minimum number of objects (MinPts), i.e. the cardinality of the neighborhood has to exceed some threshold []. DBSCAN requires two parameters: (eps) and the minimum number of points required to form a cluster (Minpts). DBSCAN clustering

In the fig. there are some parameters that are described are shown below
eps: Maximum radius of neighborhood.

Core points: Object with at least Min pts within a


radius Eps-neighborhood.

Minpts: Minimum number of points in an Eps neighborhood of that point.

Noise: The point far from core point and border point. Border Points: object that on the border of a cluster.

This clustering method is first used for preprocessing the data set before applying classification in the proposed method. Classification Classification is a kind of supervised learning algorithm. It is a procedure to assign class labels. A classifier is constructed from the labeled training data using certain classification algorithm; it then will be used to predict the class label of the test data. The commonly used methods for Data Mining Classification tasks can be classified into the following groups such as Decision Tree, Bayesian methods, Bayesian Network, Rule-based algorithms, Neural

algorithm starts with an arbitrary starting point that has not been visited. This point's neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. The shape of a neighborhood is determined by the choice of a distance function for two points p and q, denoted by dist (p,q). To find clusters DBSCAN starts from a arbitrary point and retrieves all density reachable from that point.

networks, Support Vector Machine, Association Rule Mining, k-nearest neighbor, Case based reasoning, Genetic algorithm, Rough sets, and Fuzzy logic. For this we have taken Decision tree induction. Decision tree algorithm is given by Ross Figure 1 DBSCAN Quinlan in 1993. It is a data mining induction technique that recursively partitions the records of data set using depth-first greedy approach

(Hunts et al. 1966). A decision tree structure is made of root, internal and leaf nodes. Each internal node tests an attribute, each branch corresponds to attribute value and each leaf node assigns a classification. The tree structure is used for classifying the unknown records of data. At each internal node of a tree, a decision list best split is made from attribute selection measure (Quinlan, 1993). The tree leaves is made up of the class labels which the data items have been group. Decision tree classification technique is performed in two phases: tree building and tree pruning. Tree building is done in a top-down manner. During this phase that the tree is recursively portioned till all the data items belong to the same class (Hunt et al 1966). It is very tasking and computationally intensive as training data set is traversed repeatedly. Tree pruning is done in a bottom up fashion; it is used to improve the prediction and classification accuracy of the algorithm by minimizing over fitting (Mehta et al. 1996).

There are various top-down decision trees inducers such as ID3(Quinlan,1986), (Brieman et

C4.5(Quinlan,1993),

CART

at..1984). some consists of two conceptual phases: growing and pruning (C4.5 and CART). Other inducers perform only the growing phase. J48 is an open source java implementation of the C4.5 algorithm in WEKA data mining tool. C4.5 is an evolution of ID3, presented by the same author (Quinlan, 1993). C4.5 algorithm is a successor of ID3 that uses gain ratio as splitting criterion to partition the data set. The algorithm applies a kind of normalization to information gain using a split information value. Split information for an attribute A with v values is defined as in (1) [R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant. 1996. The Quest Data Mining System, in Proc. 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, 244-249.],[ Richard W. Selby, Adam A Porter. 1988. Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis, IEEE Transactions on Software Engineering, Vol. 14, No. 12, 1743-1757.]: ( )
| |

Root node

Internal node

Internal node

| |

( | | ) ------(1)

Where |Di| is the number of instances in the training set D with ith value for the attribute A and |D| is the total number of instances in the Class Class Class Class training set. Gain ratio is defined as in (2) and the attribute with maximum gain ratio is selected as the splitting attribute [Richard W. Selby,

Figure: Decision tree.

Adam A Porter. 1988. Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis, IEEE Transactions on Software Engineering, Vol. 14, No. 12, 17431757.]. J48 Decision tree classifier follows the simple algorithm. In order to classify a new item, it first needs to create a decision tree based on the attribute values of the available training data. So, whenever it encounters a set of items (training set) it identifies the attribute that discriminates the various instances most clearly. This feature that is able to tell about the data instances so that it can classify them the best is said to have the highest information gain. Now, among the possible values of this feature, if there is any value for which there is no ambiguity, that is, for which the data instances falling within its category have the same value for the target variable, then terminate that branch and assign to it the target value that have obtained.[ http://www.d.umn.edu/~padhy005/Chapter5.htm l] Graphical Representation RESULT ANALYSIS

DATA SET PREPROCESS

CLASSIFICATION

CLUSTERING DATASET CLASSIFICATION

ANALYSIS RESULT

In this method we firstly classify the data set by applying the classification algorithm and find out the result and then the data set is analyzed further in that firstly clustering is applied on the data set and then a new data set is saved after applying clustering and then after the

classification technique is applied on the data set and analysis is done on that after that and result is obtained. Then a comparison has been made between the results obtained and find out which gives the better result which is less prone to errors and effective. Tool The WEKA toolkit is a data mining system developed by the University of Waikato in New Zealand that implements data mining algorithms using the JAVA language. It is free software

available under the GNU General Public License. Weka supports several standard data mining tasks, more specifically, data preprocessing clustering, classification, regression,

Incorrectly classified instance Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases(0.95 level) Mean rel. region size Total instances

45.266 % 0.0933 0.49 0.5038 98.0236 % 100.7747% 99.7 % 99.4 % 3000

visualization and feature selection. Wekas main user interface is the explorer, but essentially the same functionality can be accessed through the component based

Knowledge Flow interface and from the command line. There is also the Experimenter, which allows the systematic comparison of the predictive performance of Wekas machine learning algorithms on a collection of datasets. Data used in WEKA is in Attribute-Relation File Format (ARFF) file format, which consists of special tags to indicate different things in dataset such as attribute names attribute types, attribute values and the data the data set which is taken to get the result is bmw data set which has having 3000 instances and 4 attributes. Results In this paper the following parameters are taken to analyze which method is followed to get the best classification results of the bmw data set with lowest error rate. The first simple classification technique decision tree J48 is applied on the bmw data set with cross validation 10 folds Correctly classified instance 54.733 %

Confusion matrix of it is shown as like this a 903 736 b 622 739 Classified as a b

And in second case firstly applying clustering on data set, save the clustered data set and after that applied classification on it. Clustering. This gives the more enhanced and correct result. Correctly classified instance Incorrectly classified instance Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases(0.95 level) Mean rel. region size Total instances 100 % 0 1 0 0 0% 0% 100 % 20 % 3000

Confusion matrix of it is shown as like this

a 103 0 0 0 0

b 0

c 0

d 0 0 0

e 0 0 0

Classified as a= cluster0 b= cluster1 c= cluster2 d= cluster3 e = cluster4

Clustering AlgorithmInternational Journal of Enterprise Computing and Business Systems [5] Introduction to data mining and knowledge discovery third addition by crows corporation [6] Hunts et al. 1966 [7] J. Ross Quinlan [8] Mehta et al. 1996. [9] R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant. 1996. The Quest Data Mining System, in Proc. 2nd Int'l Conference on Knowledge Discovery in

567 0 0 0 0 0 559

1702 0 0 69

This presents the result in a more refined way with less prone to error above result shows that applying clustering on data set before than classification give more refined results of classification or we can say that clustering enhances the classification method. Conclusion and future work Concluded with comparing the above results is that clustering enhances classification technique and give result which is less prone to error. References [1] Xindowng WU, Data mining: artificial intelligence in data analysis. Proceedings of IEEE/WIC/ACM International Conference on intelligent Agent Technology, 2004. PP7. [2]Han. J and Kamber, Data Mining concepts and techniques 2001. [3] Martin Ester, Hans-Peter Kriegel, Jrg Sander, Xiaowei Xu, A Density-Based

Databases and Data Mining, Portland, Oregon, 244-249 [10] Richard W. Selby, Adam A Porter. 1988. Learning from Examples: Generation and Evaluation of Decision Trees for Software Resource Analysis, IEEE Transactions on Software Engineering, Vol. 14, No. 12, 17431757. [11]http://www.d.umn.edu/~padhy005/Chapter5 .html] [12] Brieman et at..1984).

Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1998. [4] Sanjay Chakraborty, Prof. N.K.Nagwani Analysis and Study of Incremental DBSCAN

Das könnte Ihnen auch gefallen