Sie sind auf Seite 1von 7

International Journal of Scientific Research Engineering & Technology (IJSRET)

Volume 2 Issue 7 pp 420-426 October 2013 www.ijsret.org ISSN 2278 0882

Temporal Data Clustering using weighted aggregate function


Sunitha Azmeera 1, P.Sreelatha 2 1 M.Tech Student, CVSR College of Engineering, Department of Computer Science, A.P. India 2 Assistant Professor of Computer Science and Engineering, Anurag group of Institutions, A.P. India

Abstract- In this we propose a novel weighted consensus function guided by clustering validation criteria to reconcile initial partitions to candidate consensus partitions from different perspectives, and then, introduce an agreement function to further reconcile those candidate consensus partitions to a final partition. As a result, the proposed weighted clustering ensemble algorithm provides an effective enabling technique for the joint use of different representations, which cuts the information loss in a single representation and exploits various information sources underlying temporal data. In addition, our approach tends to capture the intrinsic structure of a data set, e.g., the number of clusters. Our approach has been evaluated with benchmark time series, motion trajectory, and time-series data stream clustering tasks. Index Terms Temporal data clustering, clustering ensemble, cluster 1. INTRODUCTION Temporal data mining is an important extension of the data mining & it is non-trivial extraction of implicit, potentially useful and previously unrecorded information with an implicit or explicit temporal content, from large database Basically temporal data mining is concerned with the analysis of temporal data and for finding temporal patterns and regularities in sets of temporal data. Also temporal data mining techniques allow for the possibility of computer driven, automatic exploration of the data. Temporal data mining has led to a new way of interacting with a temporal database: specifying queries at a much more abstract level than say, Temporal Structured Query Language (TSQL) permits It also facilitates data exploration for problems that, due to multiple and multi- dimensionality, [1] would otherwise be very difficult to explore by humans, regardless of use of, or efficiency issues with, TSQL.The contributions of this paper are summarized as follows: First, we develop a practical temporal data clustering model[2] by different representations via clustering ensemble learning to overcome the fundamental weakness in the representation-based temporal data clustering analysis. Next, we propose a novel weighted clustering ensemble algorithm, which not only provides an enabling technique to support our model but also can be used to combine any input partitions. Formal analysis has also been done. Finally, we demonstrate the effectiveness and the efficiency of our model for a variety of temporal data clustering tasks as well as its easy- to-use nature as all internal parameters are fixed in our simulations. In temporal data analysis, many temporal data mining applications make use of clustering according to similarity and optimization of temporal set functions. If the number of clusters is given, then clustering techniques can be divided into three classes: (1) Metric-distance based technique, (2) Model- based technique and (3) Partition-based technique. These techniques can be used occasionally in combination, such as Probability- based vs. Distance-based clustering analysis. If the number of clusters is not given, then we can use Non-Hierarchical Clustering Algorithms to find their k. Although the clustering algorithms have been intensively developing for last decades, due to the natural complexity oftemporal data, we still face many challenges for temporal data clustering tasks. How to select an intrinsic number of cluster is still a critical model selection 15problem [5]5existed in many clustering algorithms. In a statistical framework, model selection is the task of selecting a mixture of the appropriate number of mathematical models with the appropriate parameter setup that fits the target dataset by optimizing some criterion.

IJSRET @ 2013

International Journal of Scientific Research Engineering & Technology (IJSRET)


Volume 2 Issue 7 pp 420-426 October 2013 www.ijsret.org ISSN 2278 0882 2. EXISTING WORK In general, there are two core problems in clustering analysis, i.e., model selection and grouping. The former seeks a solution that uncovers the number of intrinsic clusters underlying a temporal data set, while the latter demands a proper grouping rule that groups coherent sequences together to form a cluster matching an underlying distribution. Clustering analysis is an extremely difficult unsupervised learning [4] task. It is inherently an ill-posed problem and its solution often violates some common assumptions Based on a temporal data representation of fixed yet lower dimensionality, any existing clustering algorithm is applicable to temporal data clustering, which is efficient in computation. Various temporal data representations have been proposed [8] from different perspectives. To our knowledge, there is no universal representation that perfectly characterizes all kinds of temporal data; one single representation tends to encode only those features well presented in its own representation space and inevitably incurs useful information loss. Furthermore, it is difficult to select a representation to present a given temporal data set properly without prior knowledge and a careful analysis. 3. PROPOSED APPROACH First, we develop a practical temporal data clustering model by different representations via clustering ensemble learning to overcome the fundamental weakness in the representation based temporal data clustering analysis. We propose a novel weighted clustering ensemble algorithm, which not only provides an enabling technique to support our model but also can be used to combine any input partitions. Formal analysis has also been done. In fig. 1 , four representations of time series present themselves with various [6]distributions in their principal component analysis representation Subspaces. For instance, both of classes marked with triangle and star are easily separated from other two overlapped classes . Fig. 1a, while so is the classes marked with circle and dot in Fig. 1b. Similarly, different yet useful structural information can also be observed from plots in Figs. 1c and 1d. Intuitively, our observation suggests that a single representation simply captures partial structural information and thejoint use of different representations is more likely to capture the intrinsic structure of a given temporal data set. When a clustering algorithm is applied to different representations, diverse partitions would be generated. To exploit all information sources, we need to reconcile diverse partitions to find out a consensus partition superior to any input partitions.

Fig. 1: Distributions of the Time-Series Data Set in Various PCA Representation Manifolds Formed by the First Two Principal Components of Their Representations. (a). PLS, (b). PDWT, (c). PCF, (d). DFT

IJSRET @ 2013

International Journal of Scientific Research Engineering & Technology (IJSRET)


Volume 2 Issue 7 pp 420-426 October 2013 www.ijsret.org ISSN 2278 0882 4. RELATED WORK Representing spatio-temporal data in a concise manner can be done by converting it into a trajectory form. In Hwang et al. [3] a trajectory is a function that maps time to locations. To represent object movement, a trajectory is decomposed into a set of linear functions, one for each disjoint time interval. The derivative of each linear function yields the direction and the speed in the associated time interval. A trajectory is a disjunction of all its linear pieces. For example, a trajectory of the user moving on a 2-D space may consist of the following two Linear

our goal is to develop through the HMM models derived from available data, an accurate and explainable representation of system dynamics in a given domain. It is important for our clustering system to determine the best partition, i.e., number of clusters in the data, and the best model structure, i.e., the number of states in a model. The first step allows us to break up the data into homogeneous groups, and the second step provides an accurate state based representation of the dynamic phenomena corresponding to each group. The tasks of this approach is by [11] (i) developing an explicit HMM model size selection procedure that dynamically modifies the size of the HMMs during the clustering process, and (ii) casting the HMM model size selection and partition selection problems in the Bayesian model selection framework. 5. WEIGHTED CLUSTERING ENSEMBLE In this section, we first present the weight consensus function based on clustering validation criteria, and then, describe the agreement function. Finally, we analyze our algorithm under the mean partition assumption made for a formal clustering ensemble analysis: 5.1 Weighted Consensus Function we present a new approach - weighted consensus clustering to identify the clusters in Protein- protein interaction (PPI) networks where each cluster corresponds to a group of functionally similar proteins. In weighed consensus clustering, different input clustering results weigh differently, i.e., a weight for each input clustering is introduced and the weights are automatically determined by an optimization process. We evaluate our proposed method with standard measures[9] such as modularity, normalized mutual information (NMI) and the Gene Ontology (GO) consortium database and compare the performance of our approach with other consensus clustering methods Experimental results demonstrate the effectiveness of our proposed approach. 5.1.1 Weighted Similarity Matrix Weighted estimation of correlation and covariance[10] matrices from historical financial data. To this end, we introduce a weighting scheme that accounts for similarity of previous market conditions to the present one. The resulting estimators are less biased and show lower variance than either unweighted or exponentially weighted estimators. The weighting scheme is based on a similarity measure which compares the current correlation structure of the market to the structures at past times. Similarity is then measured by the matrix 2-norm of the difference of probe correlation matrices estimated for two different times. The method is validated in a simulation study and tested empirically in the context of mean- variance portfolio optimization. Now, we use the matrix Hm to derive an N xN binary similarity matrix Sm that encodes the pairwise similarity between any two objects in a partition. For each partition Pm,its similarity matrix sm ={0,1}NxN is constructed by

IJSRET @ 2013

International Journal of Scientific Research Engineering & Technology (IJSRET)


Volume 2 Issue 7 pp 420-426 October 2013 www.ijsret.org ISSN 2278 0882 5.2 Agreement Function In order to obtain a final partition, we develop an agreement function by means of the evident accumulation idea again. A pairwise similarity s is constructed with three candidate consensuspartitions A binary membership indicator matrix H is constructed from partition P , where = {MHI,DVI,NMI} Then, concatenating three H matrices leads to an adjacency matrix consisting of all the data in a given data set versus candidate consensus partitions, H =[HMH|[HDV I |HNMI]. Thus, the pair wise similarity matrix S is achieved by

6. SIMULATION We apply our approach to a collection of time- series benchmarks for temporal data mining the CAVIAR[12] visual tracking database and the PDMC time series data stream data set . We first present the temporal data representations used for time-series benchmarks and the CAVIAR database . 6.1 Temporal Data Representations It is known that different representations encode various structural information facets of temporal data in their representation space. For illustration, we perform the principal component analysis (PCA) on four typical representations. Temporal data representations are generally classified into two categories: piecewise and global representations. A piecewise representation is generated by partitioning the temporal data into segments at critical points based on a criterion, and then, each segment will be modeled into a concise representation. All segment representations in order collectively form a piecewise representation, e.g., adaptive piecewise constant approximation [8] and curvature-based PCA segments [9]. In contrast, a global representation is derived from modeling the temporal data via a set of basis functions, and therefore, coefficients of basis functions constitute a holistic representation, e.g., polynomial curve fitting [10], [11], discrete Fourier transforms, and discrete wavelet transforms [12]. In general, temporal data representations used in this module should be of the complementary nature, and hence, we recommend the use of both piecewise and global temporal data representations together. In the representation extraction module, different representations are extracted by transforming 6.2 Times-Series Benchmarks Raw temporal data to feature vectors of fixed dimensionality for initial clustering analysis. Most

data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is raw temporal data to feature vectors of fixed dimensionality for initial clustering analysis. especially likely when the user fails to understand the role of parameters in the data mining process. In order to perform convincing experiments, we wanted to test our algorithm against all reasonable alternatives[11]. However, lack of space prevents us from referencing, much less explaining them. So, we re-implemented every time series distance/dissimilarity/ similarity measure that has appeared in the last decade in any of the following conferences: SIGKDD, SIGMOD, ICDM, ICDE, VLDB, ICML, SSDB, PKDD, and PAKDD.

IJSRET @ 2013

International Journal of Scientific Research Engineering & Technology (IJSRET)


Volume 2 Issue 7 pp 420-426 October 2013 www.ijsret.org ISSN 2278 0882

In total, we implemented fifty-one such measures, including the ten mentioned in and the eight variations mentioned in [13]. For fairness, we should note that many of these measures are designed to deal with short time series, and made no claim about their ability to handle longer time series. In addition to the above, we considered the classic Euclidean distance, Dynamic Time Warping (DTW), the L metric, the L metric, and the Longest Common Subsequence (LCSS), all of which are more than a decade old. Some of these (Euclidean and the other L metrics) are parameter free. For measures that require a single parameter, we did an exhaustive search for the best parameter. Formeasures requiring more than one parameter (onemethod required seven!), we spent one hour ofCPU time searching for the best parameters usinga genetic algorithm and independently spent onehour searching manually for the best parameters.We then considered only the better of the two.Most time series data mining algorithms usesimilarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms [11].

Fig. 2 Temporal data clustering with different representations.

The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining hasplateaued at considering a few millions of timeseries objects, while much of industry and sciencesits on billions of time series objects waiting to beexplored. In this work we show that by using acombination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current state-ofthe-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published.
IJSRET @ 2013

International Journal of Scientific Research Engineering & Technology (IJSRET)


Volume 2 Issue 7 pp 420-426 October 2013 www.ijsret.org ISSN 2278 0882

6.3 Motion Trajectory

In order to explore a potential application, we apply our approach to the CAVIAR data base fortrajectory clustering analysis. The CAVIA database [12] was originally designed for video content analysis where there are the manually annotated video sequences of pedestrians, resulting in 222 motion trajectories, Motion trajectories tend to have the various lengths, and therefore, a normalization technique needs to be used to facilitate the representation extraction. Thus, the motion trajectory is resampled with a prespecified number of sample points, i.e., the length is 1,500 in our simulations, by a polynomial interpolation algorithm. After resampling, all trajectories are normalized to a Gaussian distribution of zero mean and unit variance in x- and directions. n a greatly simplified model the object moves only under the influence of a uniform gravitational force field. This can be a good approximation for a rock that is thrown for short distances for example, at the surface of the moon. In this simple approximation the trajectory takes the shape of a parabola. Generally, when determining trajectories it may be necessary to account for nonuniform gravitational forces, air resistance (drag and aerodynamics). This is the focus of the discipline of ballistics.

Fig. 3. All motion trajectories in the CAVIA database 7. CONCLUSION Temporal clustering analysis provides an effective way to discover the intrinsic structure and condense information over temporal data by exploring dynamic regularities underlying temporal data in an unsupervised learning way. First, we develop a practical temporal data clustering model by different representations via clustering ensemble learning to overcome the fundamental weakness in the representation- based temporal data clustering analysis. Next, we propose a novel weighted clustering ensemble algorithm, which not only provides an enabling technique to support our model but also can be used to combine any input partitions. Formal analysis has also been done. Finally, we demonstrate the effectiveness and the efficiency of our model for a variety of temporal data clustering tasks. REFERENCES [1] J. Kleinberg,An Impossible Theorem for Clustering,Advances in Neural Information Processing Systems, Vol.15, 2002. [2] E. Keogh, S. Kasetty,On the Need for Time Series data mining Benchmarks: A Survey and Empirical Study,Knowledge and Data Discovery, Vol. 6, pp. 102-111, 2002. IJSRET @ 2013

International Journal of Scientific Research Engineering & Technology (IJSRET)


Volume 2 Issue 7 pp 420-426 October 2013 www.ijsret.org ISSN 2278 0882 [3] R. Xu, D. Wunsch, II,Survey of Clustering Algorithms, IEEE Trans. Neural Networks, Vol. 16, No. 3, pp. 645-678, May 2005. [4] P. Smyth,Probabilistic Model-Based Clustering of Multivariate and Sequential Data, Proc. Intl Workshop Artificial Intelligence and Statistics, pp. 299-304, 1999. [5] K. Murphy,Dynamic Bayesian Networks: Representation, Inference and Learning, Ph.D thesis, Dept. of Computer Science, Univ. of California, Berkeley, 2002. [6] Y. Xiong, D. Yeung,Mixtures of ARMA Models for Model-Based Time Series Clustering, Proc. IEEE Intl Conf. Data Mining, pp. 717-720, 2002. [7] N. Dimitova, F. Golshani,Motion Recovery for Video Content Classification, ACM Trans. Information Systems, Vol. 13, pp. 408-439, 1995. [8] W. Chen, S. Chang,Motion Trajectory Matching of Video Objects, Proc. SPIE/IS&T Conf. Storage and Retrieval for Media Database, 2000. [9] C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast Subsequence Matching in Time-Series Databases, Proc.ACM SIGMOD, pp. 419-429, 1994. [10] C. Cheong, W. Lee, N. Yahaya,Wavelet- Based Temporal Clustering Analysis on Stock Time Series, Proc. Intl Conf.Quantitative Sciences and Its Applications, 2005. [11] E. Keogh, Temporal Data Mining Benchmarks, http://www.cs.ucr.edu/~eamonn/time_series_dat a, 2010. [12] CAVIAR: Context Aware Vision Using Image-Based Active Recognition, School of Informatics, The Univ. of Edinburgh, http://homepages.inf.ed.ac.uk/rbf/CAVIAR,2010. [13] PDMC: Physiological Data Modeling Contest Workshop, Proc.Intl Conf. Machine Learning (ICML) Workshop, http://www.cs. utexas.edu/users/sherstov/pdmc/, 2004.

IJSRET @ 2013

Das könnte Ihnen auch gefallen