Sie sind auf Seite 1von 22

In Search of Meaning for Time

Series Subsequence Clustering

Dina Goldin, Brown University


work done with Ricardo Mardales, UConn
and George Nagy, RPI
CIKM, Nov. 8, 2006

The Meaningless Paper


[KLT03] Keogh, E., Lin, J., Truppel, W.
Clustering of Time Series is meaningless.
Proc. IEEE Conf. on Data Mining (2003)
[KL05] Keogh, E. & Lin, J.
Clustering of time-series subsequences is meaningless: implications for previous and future research.
J. Knowledge and Inf. Sys. 8:2 (2005)
Clustering of time series subsequences is meaningless
[because] the result of clustering these subsequences
is independent of the input.
November 8, 2006

CIKM06

Implications of Meaningless Result

It cast a shadow over STS clustering.

Jeopardized the legitimacy of research that had used


subsequence clustering.

Led to a flurry of follow-up research

Chen 05 uses cyclical data and k-medoids

Simon et al. 05 uses self-organizing maps

Denton 04 uses density based clustering

Struzik 03 uses correlation for trivial matches

Bagnall 03, Mahoney 05, Rodrigues et al. 04 moved away from STS

No one had challenged the results head-on

i.e. show that output and input of STS clustering


are not independent

November 8, 2006

CIKM06

Independence of Input and Output


Time series:

STS clustering
algorithm
Clusters:

Is there a way to match C to the right time series (X or Y) reliably?


Before: NO; cluster_set_dist(C,B) / cluster_set_dist(C,A) not small
Our work: YES

November 8, 2006

Find a different distance measure!

CIKM06

Outline
1.
2.

Introduction
New Distance Measure for Cluster Sets
based on the notion of cluster shapes

3.
4.

STS Cluster Matching


Observations and Conclusions

November 8, 2006

CIKM06

STS Clustering

Consider all subsequences of the same time series

Normalize each subsequence


so its average is 0 and std. deviation is 1

time series T of length m, window Size w

Normalize(x) = x avg(x) / stddev(x)

Cluster the normalized subsequences


using K-means clustering algorithm

November 8, 2006

CIKM06

K-means Clustering

Given a set of multidimensional points (of


dimension w), partition in into K groups,
so each point belongs to one cluster.

Compute the center of each cluster; it is


the mean of all points in the cluster.

Result: a set of K
cluster centers

Cluster Centers

November 8, 2006

CIKM06

Cluster Set Distance


- Previous approach to measuring distance
between cluster sets
- Returs sum of Euclidean Distances between
cluster centers

November 8, 2006

CIKM06

cluster_set_dist(B,A)

Cluster Shape Distance


-

New distance measure for cluster sets

Returns Euclidean Distance between cluster set shapes

Cluster set shape: sorted list of pairwise distances


between cluster centers; has K*(K-1)/2 values
Z
Shape of cluster A = [XZ, ZY, XY]

X
A

Y
November 8, 2006

A and B have the same shape


(B is a rotated and translated copy of A)
so cluster_shape_dist(A,B) = 0
CIKM06

Cluster Shape Example

STS clustering for ocean series with K=3

Ds: pairwise distances


between cluster centers

Note: all our datasets come from UC Riverside


repository

November 8, 2006

CIKM06

10

Cluster Structure

Sort the pairwise distances

Observation: for each K and w, the shapes obtained


from different STS clustering runs are similar!

Cluster structure T: the average of cluster set


shapes from many clustering runs over T.

November 8, 2006

CIKM06

11

Cluster Structure: Example

Cluster structures for datasets from UCR repository

k=3 w=8

k=3 w=16

k=4 w=8
November 8, 2006

CIKM06

12

Outline
1.
2.
3.
4.

Introduction
New Distance Measure for Cluster Sets
STS Cluster Matching
Observations and Conclusions

November 8, 2006

CIKM06

13

STS Cluster Matching Problem


Given a dataset of multiple time series
and a cluster center set from one of them (query),
match it to the series that produced it.
Note: K and w are assumed to be fixed.

Matching algorithm:
Outputs a guess -- which of the N time series in the
dataset produced the query?

Algorithm accuracy:
Percentage of times that the matching algorithm is correct.
Note: no previous work succeeded to attain high accuracy,
even with dataset of size 2!

November 8, 2006

CIKM06

14

Matching Algorithm

1.

2.

Pre-processing phase:
For each sequence in the dataset, perform Q
clustering runs with given K and w, and calculate its
cluster structure.
Store all the structures in a master table.

Matching phase:
1. Given a query, find the Euclidean distance from its
shape to each of the structures in the master table.
2. Return the sequence whose structure is the closest.

November 8, 2006

CIKM06

15

Example

Master table
k=3 w=8

November 8, 2006

CIKM06

16

Performance Evaluation

10 datasets from UCR time series repository


100 clustering runs per structure
Algorithm evaluated with 3 values of K,
4 values of w (12 combinations)

Result: 100% accuracy

November 8, 2006

CIKM06

17

Outline
1.
2.
3.
4.

Introduction
New Distance Measure for Cluster Sets
STS Cluster Matching Algorithm
Observations and Conclusions

November 8, 2006

CIKM06

18

Conclusions

Previous work seemed to show that the output


of STS clustering is independent of input.

The correct conclusion: cluster set distance is


an inappropriate distance metric.

Instead of absolute positions of cluster


centers, one needs to use relative positions
(as represented by cluster shapes).

STS clustering becomes meaningful: cluster


centers are reliably matched to original series.

We also found correlation between some


characteristics (number of unique shapes,
shape skew) and sequence smoothness.

November 8, 2006

CIKM06

19

Future Work
WHY?

Difference in behavior between wholesequence and subsequence clustering?


(some preliminary answers are in paper)

Apparent presence of transformations


among cluster sets?

Dependency between smoothness, skew,


number of unique clusters, etc.?

HOW?

November 8, 2006

Find expected accuracy of the matching


algorithm for given input and Q (number
of clustering runs to compute each structure).
CIKM06

20

Questions?

Thank you!

Das könnte Ihnen auch gefallen