1506877

In Search of Meaning for Time
Series Subsequence Clustering
Dina Goldin, Brown University

work done with Ricardo Mardales, UConn
and George Nagy, RPI
CIKM, Nov. 8, 2006
The Meaningless Paper

[KLT03] Keogh, E., Lin, J., Truppel, W.
Clustering of Time Series is meaningless.
Proc. IEEE Conf. on Data Mining (2003)
[KL05] Keogh, E. & Lin, J.
Clustering of time-series subsequences is meaningless: implications for previous and future research.
J. Knowledge and Inf. Sys. 8:2 (2005)
Clustering of time series subsequences is meaningless
[because] the result of clustering these subsequences
is independent of the input.
November 8, 2006
CIKM06
Implications of Meaningless Result
It cast a shadow over STS clustering.
Jeopardized the legitimacy of research that had used

subsequence clustering.
Led to a flurry of follow-up research
Chen 05 uses cyclical data and k-medoids
Simon et al. 05 uses self-organizing maps
Denton 04 uses density based clustering
Struzik 03 uses correlation for trivial matches
Bagnall 03, Mahoney 05, Rodrigues et al. 04 moved away from STS
No one had challenged the results head-on
i.e. show that output and input of STS clustering

are not independent
November 8, 2006
CIKM06
Independence of Input and Output

Time series:
STS clustering
algorithm
Clusters:
Is there a way to match C to the right time series (X or Y) reliably?

Before: NO; cluster_set_dist(C,B) / cluster_set_dist(C,A) not small
Our work: YES
November 8, 2006
Find a different distance measure!
CIKM06
Outline
1.
2.
Introduction
New Distance Measure for Cluster Sets
based on the notion of cluster shapes
3.
4.
STS Cluster Matching

Observations and Conclusions
November 8, 2006
CIKM06
STS Clustering
Consider all subsequences of the same time series
Normalize each subsequence

so its average is 0 and std. deviation is 1
time series T of length m, window Size w
Normalize(x) = x avg(x) / stddev(x)
Cluster the normalized subsequences

using K-means clustering algorithm
November 8, 2006
CIKM06
K-means Clustering
Given a set of multidimensional points (of

dimension w), partition in into K groups,
so each point belongs to one cluster.
Compute the center of each cluster; it is

the mean of all points in the cluster.
Result: a set of K
cluster centers
Cluster Centers
November 8, 2006
CIKM06
Cluster Set Distance

- Previous approach to measuring distance
between cluster sets
- Returs sum of Euclidean Distances between
cluster centers
November 8, 2006
CIKM06
cluster_set_dist(B,A)
Cluster Shape Distance

-
New distance measure for cluster sets
Returns Euclidean Distance between cluster set shapes
Cluster set shape: sorted list of pairwise distances

between cluster centers; has K*(K-1)/2 values
Z
Shape of cluster A = [XZ, ZY, XY]
X
A
Y
November 8, 2006
A and B have the same shape

(B is a rotated and translated copy of A)
so cluster_shape_dist(A,B) = 0
CIKM06
Cluster Shape Example
STS clustering for ocean series with K=3
Ds: pairwise distances

between cluster centers
Note: all our datasets come from UC Riverside

repository
November 8, 2006
CIKM06
10
Cluster Structure
Sort the pairwise distances
Observation: for each K and w, the shapes obtained

from different STS clustering runs are similar!
Cluster structure T: the average of cluster set

shapes from many clustering runs over T.
November 8, 2006
CIKM06
11
Cluster Structure: Example
Cluster structures for datasets from UCR repository
k=3 w=8
k=3 w=16
k=4 w=8
November 8, 2006
CIKM06
12
Outline
1.
2.
3.
4.
Introduction
STS Cluster Matching
November 8, 2006
CIKM06
13
STS Cluster Matching Problem

Given a dataset of multiple time series
and a cluster center set from one of them (query),
match it to the series that produced it.
Note: K and w are assumed to be fixed.
Matching algorithm:
Outputs a guess -- which of the N time series in the
dataset produced the query?
Algorithm accuracy:
Percentage of times that the matching algorithm is correct.
Note: no previous work succeeded to attain high accuracy,
even with dataset of size 2!
November 8, 2006
CIKM06
14
Matching Algorithm
1.
2.
Pre-processing phase:
For each sequence in the dataset, perform Q
clustering runs with given K and w, and calculate its
cluster structure.
Store all the structures in a master table.
Matching phase:
1. Given a query, find the Euclidean distance from its
shape to each of the structures in the master table.
2. Return the sequence whose structure is the closest.
November 8, 2006
CIKM06
15
Example
Master table
k=3 w=8
November 8, 2006
CIKM06
16
Performance Evaluation
10 datasets from UCR time series repository

100 clustering runs per structure
Algorithm evaluated with 3 values of K,
4 values of w (12 combinations)
Result: 100% accuracy
November 8, 2006
CIKM06
17
Outline
1.
2.
3.
4.
Introduction
STS Cluster Matching Algorithm
November 8, 2006
CIKM06
18
Conclusions
Previous work seemed to show that the output

of STS clustering is independent of input.
The correct conclusion: cluster set distance is

an inappropriate distance metric.
Instead of absolute positions of cluster

centers, one needs to use relative positions
(as represented by cluster shapes).
STS clustering becomes meaningful: cluster

centers are reliably matched to original series.
We also found correlation between some

characteristics (number of unique shapes,
shape skew) and sequence smoothness.
November 8, 2006
CIKM06
19
Future Work
WHY?
Difference in behavior between wholesequence and subsequence clustering?

(some preliminary answers are in paper)
Apparent presence of transformations

among cluster sets?
Dependency between smoothness, skew,

number of unique clusters, etc.?
HOW?
November 8, 2006
Find expected accuracy of the matching

algorithm for given input and Q (number
of clustering runs to compute each structure).
CIKM06
20
Questions?
Thank you!

1506877

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

1506877

Hochgeladen von

Copyright:

Verfügbare Formate

In Search of Meaning for Time

Series Subsequence Clustering

Dina Goldin, Brown University

The Meaningless Paper

Implications of Meaningless Result

It cast a shadow over STS clustering.

Jeopardized the legitimacy of research that had used

Led to a flurry of follow-up research

Chen 05 uses cyclical data and k-medoids

Simon et al. 05 uses self-organizing maps

Denton 04 uses density based clustering

Struzik 03 uses correlation for trivial matches

No one had challenged the results head-on

i.e. show that output and input of STS clustering

Independence of Input and Output

Is there a way to match C to the right time series (X or Y) reliably?

Find a different distance measure!

STS Cluster Matching

Consider all subsequences of the same time series

Normalize each subsequence

time series T of length m, window Size w

Normalize(x) = x avg(x) / stddev(x)

Cluster the normalized subsequences

Given a set of multidimensional points (of

Compute the center of each cluster; it is

Cluster Set Distance

Cluster Shape Distance

New distance measure for cluster sets

Returns Euclidean Distance between cluster set shapes

Cluster set shape: sorted list of pairwise distances

A and B have the same shape

Cluster Shape Example

STS clustering for ocean series with K=3

Ds: pairwise distances

Note: all our datasets come from UC Riverside

Sort the pairwise distances

Observation: for each K and w, the shapes obtained

Cluster structure T: the average of cluster set

Cluster Structure: Example

Cluster structures for datasets from UCR repository

STS Cluster Matching Problem

10 datasets from UCR time series repository

Result: 100% accuracy

Previous work seemed to show that the output

The correct conclusion: cluster set distance is

Instead of absolute positions of cluster

STS clustering becomes meaningful: cluster

We also found correlation between some

Difference in behavior between wholesequence and subsequence clustering?

Apparent presence of transformations

Dependency between smoothness, skew,

Find expected accuracy of the matching

Das könnte Ihnen auch gefallen