Beruflich Dokumente
Kultur Dokumente
Given a set of time series data, we can construct an latter is a necessary (but not sucient) condition
index ([AFS93]) as follows: nd the DFT of each for the former.
sequence and keep the rst few DFT coecients as The size of the query rectangle has a strong
the sequence features. Let's assume that we keep eect on the number of directory nodes accessed
the rst k coecients. Since all DFT coecients during the search process and the number of candi-
except the rst one are complex numbers, keeping dates which includes all qualifying data items plus
the rst k DFT coecients maps every time series some false positives (data items whose full database
records do not intersect the query region). Our goal A necessary condition for the left side to be less
here is to reduce the size of the query region, using than 2 is that every magnitude on the right side
the inherent properties of DFT, without sacricing be less than 2. For the time being and just for the
the correctness. purpose of presentation, we assume time sequences
are normalized 2 before being stored in the index.
3.1 Our Proposal In general, time sequences may be normalized be-
The following lemma is central to our proposal. cause of eciency reasons [GK95] or other useful
properties [Raf98]. Since the rst Fourier coe-
Lemma 1 The DFT coecients of a real-valued cient is zero for normalized sequences, there is no
sequence of duration n satisfy Xn f = Xf for f = need to store it in the index. In addition, since k is
1; : : :; n 1 where the asterisk denotes complex con- usually a small number, much smaller than n, we
jugation1 . can assume that the (n=2)th coecient is also not
stored in the index. Now the condition left to be
Proof: See Oppenheim and Schafer [OS75, page checked on the index is
25].
This means the Fourier transform of every real- 2jXf Qf j2 < 2 for f = 1; : : :; k
valued sequence is symmetric with respect to its or, equivalently
middle. A simple implication of this lemma is
jXn f j = jXf j, i.e. every amplitude at the begin- j Xf Qf j < p for f = 1; : : :; k
ning except the rst one appears at the end. 2
Observation 1 In the class of (real-valued) time to build A common approach to check this condition p is
2
sequences that have an energy spectrum of the form a search rectangle of side p = 2 (or
2b
p ~
2
O(F ) for b 0:5, the DFT coecients are not a circle of diameter 2) around Q and check for
only strong at the beginning but also strong at the an overlap between this rectangle (circle) and ev-
end. ery rectangle in the index. The search rectangle
still guarantees to include all points within the Eu-
This means if we do our distance computations clidean distance from Q, ~ but there is a major drop
based on only the rst k Fourier coecients, we in the number of false positives. The eect of re-
will miss all the information carried by the last k ducing the size of the search rectangle on the search
Fourier coecients which are as important as the time of a range query is analytically discussed in
former. However, the next observation shows that the next section.
the rst k Fourier coecients are the only features The symmetry property can be similarly used
that we need to store in the index. to reduce the size of the search rectangle even if
Observation 2 The rst d(n + 1)=2e DFT coe- issequences that one
are not normalized. The only dierence
side of the search rectangle (the one
cients of every (real-valued) time sequence contain representing the rst DFT coecient3 ) is 2 and
the whole information about the sequence. p
all other sides are 2.
The point left to describe now is how we can We can show that all-pair queries also benet
take advantage of the last k Fourier coecients from the symmetry property of DFT. Suppose we
without storing them in the index. We can write want to answer an all-pair query using two R-tree
the Euclidean distance between two time sequences indices, i.e., to nd all pairs of sequences that are
~x and ~q, using equations 4 and 2, as follows: within distance form each other. A common ap-
proach for processing this query is to take pairs
~ = X jXf Qf j2 (5) from each index, extend the sides of one by 2
n 1 of (minimum bounding) rectangles, one rectangle
~ Q)
D2 (~x; q~) = D2 (X;
f =0 and check for a possible overlap with the other.
However, the symmetry property implies that if
where X~ and Q~ are respectively DFTs of ~x and ~q. we extend every side by p2, the result is still
Since jXn f j = jXf j and jQn f j = jQf j for guaranteed to include all qualifying pairs though
f = 1; : : :; n 1, we can write D2 (X; ~ Q~ ) as follows: the number of false positives is reduced.
2 ~ ~ 3.2 Analytical Results on the Search Time
8> Pn=2 1 D (X; Q)2 = jX0 Q0j +
2
Improvements
>< f =1jX 2jXf Q Qfj2j + for even n There are two factors that aect the search time
n=2 n=2 (6) of a range query, if we assume the CPU time to
>: P(n 1)=2
f =1 2jXf Qf j for odd n 2 A sequence is in normal form if its mean is 0 and its
2
standard deviation is 1.
1 (a + bj ) = (a bj ) 3 Note that the rst DFT coecient is a real number.
be negligible; one is the number of index nodes
touched by the query rectangle and the other is
the number of data points inside the search rect-
angle (or candidates). Both factors can be approx- R : the best case query point
imated by the area of the search rectangle, if we
assume data points are uniformly distributed over a * * : the worst case query point
unit square, and the search rectangle is a rectangle
within this square 4 . Thus,pto compare the search
time of a rectangle of side 2 to that of a one of
side 2, we compare their areas. Q
is ki=1 (min(i b ; 2))2 and that for the best case
Since a search rectangle has 2k sides, the areap Q k
query is i=1 (min(i b ; ))2.
(or pthe volume) of a search rectangle of side 2
is ( 2)2k = 2k 2k . This is one 2k th of the area To eliminate the eect of the size of R in our
(or the volume) of a rectangle of side 2 which is estimates, we divide
Q the area of the overlap by the
(2)2k = 22k 2k . Thus under the assumptionspwe area of R, i.e. ki=1 (i b )2 , to get what we call
have made, using a search rectangle of side 2 the query selectivity. The query selectivity for the
instead of a one of side 2 should reduce the search worst case query using a search rectangle of side 2
time by (1 1=2k ) 100 can be expressed as follows:
p percent. For example, using Qk
a rectangle of side 2 on an index built on the b ; 2))2
rst two non-zero DFT coecients should reduce S(b; k; 2) = i=1Q(min(i
k (i b )2
the search time by 75 percent. i=1
However, for the class of time sequences that Yk
have an energy spectrum of the form O(F 2b), the = (min(i b ; 2)ib)2 : (7)
amplitude spectrum follows O(F b). In particular i=1
for b > 0, the amplitude reduces as a factor of fre- The term min(i b ; 2)ib is 1 for i b 2 (or i
quency and points get denser in higher frequencies. (2) 1=b) , and it is 2ib for i b > 2 (or i <
If we assume that the rst non-zero DFT coecient (2) 1=b). Thus the query selectivity can be ex-
(for every data or query sequence) is uniformly pressed as
distributed within a unit square, the ith DFT coef-
cient (for i = 1; : : :; k) must be distributed uni- min(k;bY
(2) c
1=b )
formly within a square of side i b . Thus keep- S(b; k; 2) = (2ib)2 (8)
ing the rst k Fourier coecients maps sequences i=1
into points which are uniformly distributed within It can be easily shown that S(b; k; ) gives the query
rectangle R = (< 0; 1 >; < 0; 1 >; < 0; 2 b >,
< 0; 2 b >; : : :; < 0; k b >; < 0; k b >). selectivity for the best case query using the same
In addition, a search rectangle built on an arbi- search rectangle. If we employ the symmetry prop-
trarily chosen query point Q~ (inside or on R) is not erty pof the DFT, i.e. use a search rectangle of
necessarily contained fully within R. If Q~ happens side 2, the query selectivities for the worst
p and
to be a central point of R, the overlap between the the best case queries would be S(b; k; 2) and
two rectangles reaches its maximum. We refer to S(b; k; p2 ) respectively.
this query as `the worst case query' since it requires Figure 1 shows the worst case query selectiv-
the largest number of disk accesses. On the other ity per search rectangle and k varying the query
hand, if Q~ happens to be a corner point of R, threshold for Brownian noise data (b = 1). As is
the overlap between the two rectangles reaches its shown, using the symmetry property reduces the
minimum. We call this query `the best case query'. query selectivity by 50 to 75 percent for k = 2 and
Thus the area of the overlap between the search 0:5. If we keep the rst three non-zero DFT
rectangle and R, and as a result the search time, is coecients (k = 3), using the symmetry property
not only a factor of but also a factor of Q. ~ reduces the selectivity by up to 87 percent. In
To compare the search time of a query rectangle general, taking the symmetry property into account
p reduces the selectivity and as a result the search
of side 2 to that of one of side 2, we can com-
pare their area of overlap with R. The projection time in the worst case by 50 to (1 1=2k ) 100
of the overlap between a search rectangle of side percent for k 2 and 0:5.
2 and R to the ith DFT coecient plane is a Figure 2 shows the best case query selectiv-
square of side min(i b ; 2) for the worst case query ity per search rectangle and k varying the query
and a square of side min(i b ; ) for the best case threshold again for the Brownian noise data. As is
query. Thus the area of the overlap between the shown, taking the symmetry property into account
search rectangle and R for the worst case query reduces the selectivity by at least 75 percent for
all values of 0:5, if we keep only the rst
4 We relax our assumptions later in this section. two non-zero DFT coecients. In general, taking
7ORST
"EST #ASE
#ASE1UERY
1UERY
YTI
KSYM
CE
KNO
SYM
LE KNO
SYM
3
ee
Range Queries Range Queries
1 0.8
: index : index
: index (sym) : index (sym)
0.6
0.4
0.4
0.2
0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Threshold / MaxAmp Threshold / MaxAmp
Figure 3: Both query selectivities and running times for range queries varying the query threshold
for = 0:24 MaxAmp was zero, so we didn't all-pair queries to 0:32 MaxAmp. Figure 5 shows
try smaller thresholds. Since query points were the running times per query for range and all-pair
chosen randomly, we expected the query selectivity queries. Our observation reduces the search time
for every threshold MaxAmp to fall between of the index by 63 to 71 percent for range queries
the two extreme selectivities (the worst case and and by 64 to 72 percent for all-pair queries.
the best case) computed analytically for . As is
shown in Figure 3, for =MaxAmp 0:5, using the 4.4 Varying the length of sequences
symmetry property reduces the query selectivity by
53 to 64 percent and the search time by 70 to 74 Range Queries
percent. It is consistent with our analytical results. 12
For 0:5 < =MaxAmp 1, as the gure shows, : index
using the symmetry property reduces the query
Execution time (seconds)
10 : index (sym)
selectivity by 45 to 64 percent and the running time
by 62 to 74 percent.
8
4.2 Varying the number of DFT coecients
Our next experiment was again on stock prices 6
data, but this time we xed the query threshold for
range queries to 0:95 MaxAmp and that for all- 4
pair queries to 0:32MaxAmp. This setting gave us
average output sizes of 30 and 203 respectively for 2
range and all-pair queries. We varied the number
of DFT coecients kept in the index from 1 to
4. Figure 4 shows the running times per query for 0
100 200 300 400 500 600
range and all-pair queries. Taking our observations Sequence length
into account reduces the search time of the index
by 66 to 72 percent for range queries and by 61 to Figure 6: Running times for range queries varying
72 percent for all-pair queries. the length of sequences
4.3 Varying the number of sequences Our last experiment was on synthetic data where
In our next experiment, we xed the number of we xed the number of DFT coecients to 2 and
DFT coecient to 2 and the sequence length to the number of sequences to 20,000, but we varied
128, but we varied the number of sequences from the sequence length from 128 to 512. The size of
100 to 1067. The experiment conducted on stock the data le was in the range of 40 Mbytes (for se-
prices data set. We again xed the query threshold quences of length 128) to 160 Mbytes (for sequences
for range queries to 0:95 MaxAmp and that for of length 512). We xed the query threshold to
Range Queries All−Pair Queries
0.8 140
: index
0.7 120 : index (sym)
Execution time (seconds)
0.5 : index 80
: index (sym)
0.4 60
0.3 40
0.2 20
0.1 0
1 2 3 4 1 2 3 4
Number of DFT coefficients Number of DFT coefficients
Figure 4: Running times for range and all-pair queries varying the number of DFT coecients
0.4 40
0.3 30
0.2 20
0.1 10
0 0
0 500 1000 1500 0 500 1000 1500
Number of sequences Number of sequences
Figure 5: Running times for range and all-pair queries varying the number of sequences
0:44 MaxAmp and, based on our analytical re- September 1995. Morgan Kaufmann
sults, we expected using the symmetry property to Publishers.
reduce the search time by 50 to 75 percent. Fig- [APWZ95] R. Agrawal, G. Psaila, E. L. Wim-
ure 6 shows the running times per query for range mers, and M. Zait. Querying shapes
queries. Our proposed method reduces the search of histories. In Proceedings of the
time of the index by 73 to 77 percent. The search 21st International Conference on Very
time improvement is slightly more than our ana- Large Data Bases (VLDB '95), pages
lytical estimates mainly because of the CPU time 502{514, Zurich, September 1995.
reduction for distance computations which is not
accounted for in our analytical estimates. Because [BKSS90] N. Beckmann, H.-P. Kriegel, R. Schnei-
of the high volume of data, experiments on all-pair der, and B. Seeger. The R* tree: an
queries were very time consuming. For example, ecient and robust index method for
doing a self-join on sequences of length 512 did not points and rectangles. In Proceedings
nish after 12 hours of overnight running. For this of the ACM SIGMOD International
reason, we did not report them. Conference on Management of Data
(SIGMOD '90), pages 322{331, At-
5 Conclusions lantic City, May 1990.
We have proposed using the last few Fourier co- [Cha84] Christopher Chateld. The Analysis of
ecients of time sequences in the distance com- Time Series: an Introduction. Chap-
putation, the main observation being that every man and Hall, fourth edition, 1984.
coecient at the end is the complex conjugate of [FRM94] C. Faloutsos, M. Ranganathan, and
a coecient at the beginning and as strong as its Y. Manolopoulos. Fast subsequence
counterpart. Our analytical observation shows that matching in time-series databases. In
using the last few Fourier coecients in the dis- Proceedings of the ACM SIGMOD
tance computation accelerates the search time of International Conference on Manage-
the index by more than a factor of two for a large ment of Data (SIGMOD '94), pages
range of thresholds. We also evaluated our pro- 419{429, Minneapolis, May 1994.
posed method over real and synthetic data. Our
experimental results were consistent with our ana- [GK95] D. Q. Goldin and P. C. Kanellakis.
lytical observation; in all our experiments the pro- On similarity queries for time-series
posed method reduced the search time of the index data: constraint specication and im-
by 61 to 77 percent for both range and all-pair plementation. In 1st Intl. Conf. on the
queries. Principles and Practice of Constraint
Programming, pages 137{153. LNCS
Acknowledgements 976, Sept. 1995.
This work was supported by the Natural Sciences [Gut84] Antonin Guttman. R-trees: a dynamic
and Engineering Research Council of Canada and index structure for spatial searching.
the Information Technology Research Centre of On- In Proceedings of the ACM SIGMOD
tario. International Conference on Manage-
ment of Data (SIGMOD '84), pages
References 47{57, Boston, June 1984.
[AFS93] Rakesh Agrawal, Christos Faloutsos, [JMM95] H. V. Jagadish, A. O. Mendelzon, and
and Arun Swami. Ecient similarity T. Milo. Similarity-based queries. In
search in sequence databases. In Pro- Proceedings
ceedings of the 4th International Con- of the 14th ACM SIGACT-SIGMOD-
ference on Foundations of Data Orga- SIGART Symposium on Principles of
nizations and Algorithms (FODO '93), Database Systems (PODS '95), pages
pages 69{84, Chicago, October 1993. 36{45, San Jose, May 1995.
[ALSS95] Rakesh Agrawal, [Man83] B. Mandelbrot. Fractal Geometry of
Nature. W.H. Freeman, New York,
King-Ip Lin, Harpreet S. Sawhney, and 1983.
Kyuseok Shim. Fast similarity search
in the presence of noise, scaling, and [NHS84] J. Nievergelt, H. Hinterberger, and
translation in time-series databases. In K. C. Sevcik. The grid le: an adapt-
Proceedings of the 21st International able, symmetric multikey le structure.
Conference on Very Large Data Bases ACM Transactions on Database Sys-
(VLDB '95), pages 490{501, Zurich, tems, 9(1):38{71, March 1984.
[OS75] A. V. Oppenheim and R. W. Schafer.
Digital Signal Processing. Prentice-
Hall, Englewood Clis, N.J., 1975.
[Raf98] Davood Raei. On similarity-based
queries for time series data. Submitted
for publication, 1998.
[RM97] Davood Raei and Alberto Mendelzon.
Similarity-based queries for time series
data. In Proceedings of the ACM
SIGMOD International Conference on
Management of Data (SIGMOD '97),
pages 13{24, Tucson, Arizona, May
1997.
[Rot93] William G. Roth. MIMSY: A system
for analyzing time series data in the
stock market domain. University of
Wisconsin, Madison, 1993. Master
Thesis.
[RS92] Raghu Ramakrishnan and Divesh Sri-
vastava. CORAL: Control, relations
and logic. In Proceedings of 18th In-
ternational Conference on Very Large
Data Bases (VLDB '92), pages 238{
250, Vancouver, August 1992. Morgan
Kaufmann.
[Sch91] Manfred Schroeder. Fractals, Chaos,
Power Laws: Minutes from an Innite
Paradise. W.H. Freeman, New York,
1991.
[SLR94] P. Seshadri, M. Livny, and R. Ramakr-
ishnan. Sequence query processing.
In Proceedings of the ACM SIGMOD
International Conference on Manage-
ment of Data (SIGMOD '94), pages
430{441, Minneapolis, May 1994.
[WS90] B.J. West and M. Shlesinger. The
noise in natural phenomena. American
Scientist, 78:40{45, Jan-Feb 1990.
[YJF98] Byoung-Kee Yi, H. V. Jagadish, and
Christos Faloutsos. Ecient retrieval
of similar time sequences under time
warping. In Proceedings of the 14th
International Conference on Data En-
gineering (ICDE '98), pages 201{208,
Orlando, February 1998.