Sie sind auf Seite 1von 6

To be presented at the American Control Conference, Denver, CO, June 46, 2003

Data Compression Issues with Pattern Matching in Historical Data


Ashish Singhal

Dale E. Seborg

Department of Chemical Engineering


University of California, Santa Barbara, CA 93106
Abstract
It is a common practice in the process industries to compress
plant data before it is archived. However, compression may
alter the data in a manner that makes it dicult to extract
useful information from it. In this paper we evaluate the ef-
fectiveness of a new pattern matching technique
1
for applica-
tions involving compressed historical data. We also compare
several data compression methods with regard to eciency,
data reconstruction, and suitability for pattern matching ap-
plications.
1 Introduction
Due to the advances in computer technology, large amounts
of data produced by industrial plants are recorded as fre-
quently as every second using commercially available data
historians.
2, 3
Although storage media are inexpensive, the
cost of building large bandwidth networks is still high. Thus,
to minimize the cost of transmitting large amounts of data
over company networks or the internet, data have to be com-
pressed.
One of the classic papers on compression of process data
was published by Hale and Sellars.
2
They provided an excel-
lent overview of the issues in the compression of process data
and also described piecewise linear compression methods.
Other researchers have developed several algorithms to
compress time varying signals in ecient ways. Bristol
4
modied the piecewise linear compression methods of Hale
and Sellars
2
to propose a swinging door data compression
algorithm. Mah et al.
5
proposed a complex piecewise lin-
ear online trending (PLOT) algorithm that performed better
than the classical box-car, backward slope and swinging door
methods. Bakshi and Stephanopoulos
6
compressed process
data using wavelet methods. Recently, Misra et al.
7
de-
veloped an online data compression method using wavelets
where the algorithm computes and updates the wavelet de-
composition tree before receiving the next data point.
In this paper dierent data compression methods are
evaluated not only on the basis of how accurately they repre-
sent process data, but also on how they aect pattern match-
ing. In this paper, six data compression methods are evalu-
ated for both accuracy and their eect on pattern matching.
2 Popular data compression and reconstruction
methods for time-series data
This section briey describes some of the popular compres-
sion methods for time-series data. Because the accuracy
of retrieved data depends not only on the method that was
used for compression, but also on the method used for recon-
struction, some simple reconstruction techniques that include
zero-order hold and linear interpolation are also discussed
briey.
2.1 Data compression methods
The box-car method is a simple piecewise linear com-
pression method. This method records data when a
value is signicantly dierent from the last recorded
value.
2
Because recording of a future value depends
only on the value of the last recorded value and the
recording limits, the box-car algorithm performs best
when the process runs for long periods of steady-state
operation.
8
The backward slope method is also a piecewise linear
compression method and utilizes the trending nature
of a process variable by projecting the recording limit
into the future on the basis of the slope of the previ-
ously two recorded values.
2
The combination method combines the box-car and
backward slope algorithms.
2
This algorithm handles
cases when the system is at steady state as well as
when process variables exhibit trends.
Data averaging compression is a common compres-
sion techniques where the time-series data are simply
averaged over a specied period of time. In this case,
the compression is performed o-line, rather than
online.
Wavelet based compression. Wavelet transforms can
be used to compress time-series data by threshold-
ing the wavelet coecients.
7, 8
Hard thresholding is a
method by which only those wavelet coecients that
are greater than a specied threshold are retained. It
is used for this research.
For data compression, only the non-zero thresholded
wavelet coecients are stored. These thresholded co-
ecients can then be used to reconstruct data when

Present address: Johnson Controls, Inc., 507 E. Michigan St., Milwaukee, WI 53202. Email: Ashish.Singhal@jci.com

Corresponding author. Email: seborg@engineering.ucsb.edu


1
needed. In the present study, the recording limits on
each of the process variables will be used as threshold
values.
Compression using commercial PI

software (OSI
Software, www.osisoft.com). Because PI

is widely
used for data archiving, it is informative to compare
the commercially available software with the classical
techniques. In particular, the BatchFile Interface for
the PI

software was used in this research for data


compression.
2.2 Data reconstruction methods
All of the data compression methods described in the previ-
ous section produce lossy compression, i.e., it is not pos-
sible to reconstruct the compressed data to exactly match the
original data. The accuracy by which compressed data can
describe the original uncompressed data depends not only on
the compression algorithm, but also on the method of data
reconstruction. Many reconstruction methods are available
such as the zero-order hold (ZOH) where the value of a vari-
able is held at the last recorded value until the next recording.
Linear interpolation (LIN) is a simple method that can
overcome a part of this limitation by reconstructing data be-
tween recordings. It can provide more accurate reconstruc-
tion for situations where the process is at steady state, or sit-
uations where process variables show trends.
More sophisticated methods such as spline interpolation,
and expectation-maximization algorithm for data reconstruc-
tion have also been proposed.
9, 10
But these methods are sen-
sitive to the amount of missing data, and do not perform well
when a signicant amount of data are missing.
9, 10
3 Pattern matching approach
In this article, the pattern matching methodology described
by Singhal
11
and Singhal and Seborg
1
is used to compare
historical and current snapshot datasets. First, the user de-
nes the snapshot data that serves as a template for searching
the historical database. The snapshot specications consist
of: (i) the relevant process variables, and (ii) duration of
the abnormal situation. These specications can be arbitrar-
ily chosen by the user; no special plant tests or pre-imposed
conditions are necessary.
In order to nd periods of operation in historical data
that are similar to the snapshot data, a window of the same
size as the snapshot data is moved through the historical data.
The similarity between the snapshot and the historical data
in the moving window is characterized by the S
PCA
and S
dist
similarity factors.
1, 12
The PCA similarity factor compares
two datasets by comparing the angles between the subspaces
spanned by the datasets, while the distance similarity factor
compares datasets by calculating the Mahalanobis distance
between their centers.
1
The historical data windows with the largest values of the
similarity factors are collected in a candidate pool. The indi-
vidual data windows in the candidate pool are called records.
After the candidate pool has been formed, a person familiar
with the process can then perform a more detailed examina-
tion of the records. The number of observations by which
the window is moved through historical data is denoted as w,
and is set equal to one-tenth to one-fth of the length of the
snapshot data window.
1
Detailed description of the similar-
ity factors and the pattern matching methodology is provided
by Singhal
11
and Singhal and Seborg.
1
3.1 Performance measures for pattern matching
Two important metrics are used to quantify the eectiveness
of a pattern matching technique. But rst, several denitions
are introduced:
N
P
: The size of the candidate pool. N
P
is the number of
historical data windows that have been labeled sim-
ilar to the snapshot data by a pattern matching tech-
nique. The data windows collected in the candidate
pool are called records.
N
1
: The number of records in the candidate pool that are
actually similar to the current snapshot, i.e., the num-
ber of correctly identied records.
N
2
: The number of records in the candidate pool that are
actually not similar to the current snapshot, i.e., the
number of incorrectly identied records. By deni-
tion, N
1
+ N
2
= N
P
.
N
DB
: The total number of historical data windows that are
actually similar to the current snapshot. In general,
N
DB
N
P
.
The rst metric, the pool accuracy p, characterizes the accu-
racy of the candidate pool:
p
N
1
N
P
100% (1)
A second metric, the pattern matching eciency , charac-
terizes how eective the pattern matching technique is in lo-
cating similar records in the historical database. It is dened
as:

N
1
N
DB
100% (2)
Because an eective pattern matching technique should ide-
ally produce large values of both p and , an average of the
two quantities () is used as a measure of the overall eec-
tiveness of pattern matching.:

p +
2
(3)
4 Simulation case study: continuous stirred tank
reactor example
In order to compare the eect of data compression on pat-
tern matching, a case study was performed for a simulated
chemical reactor. A nonlinear continuous stirred tank reactor
2
(CSTR) with cooling jacket dynamics, variable liquid level
and a rst order irreversible reaction, A B was simulated.
The dynamic model of Russo and Bequette
13
based on the
assumptions of perfect mixing and constant physical param-
eters was used for the simulation. In the simulation study,
white noise is added to several measurements and process
variables in order to simulate the variability present in real
world processes.
14
4.1 Generation of recording limits
For the simulation study, 95% Shewhart chart limits were
used calculate the recording limits. The chart limits were
constructed using representative data that included small
disturbances as described by Johannesmeyer et al.
14
The
high and low limits for each variable were calculated using
these data.
14
The recording limits for each variable were specied
by calculating the Shewhart chart limits around the nomi-
nal value of each variable. The standard deviation for the i
th
process variable,
i
, was determined using the methodology
described above. Then the recording limit for that variable
was set equal to c
i
, where c is a scaling factor. The value
of c was specied dierently for each compression method
as described later. The value of the standard deviation, , for
each measured variable is reported by Singhal.
11
5 Results and discussion
The data compression methods described in Section 2 were
compared on the basis of the reconstruction error as well as
the compression ratio. The compression ratio (CR) is dened
as,
CR
No. of data points in original dataset
No. of data points in compressed dataset
(4)
and the mean squared error (MSE) of reconstruction is de-
ned as,
MSE
1
m n
n

i=1
m

j=1

2
i, j
(5)
where m is the number of measurements in the original
dataset; n is the number of variables;
i, j
= (x
i, j
x
i, j
), x
i, j
rep-
resents the j
th
measurement of the i
th
variable in the original
data, and x
i, j
is the corresponding reconstructed value.
If the recording limit constant, c, is the same for all meth-
ods, then the resulting compression ratios will be dierent for
each method. These type of results would indicate how eec-
tive each method is for compressing data. However, in order
to compare the methods with respect to reconstruction ac-
curacy, it is easier to analyze the results if all methods have
the same compression ratio. A constant compression ratio
requires adjusting the recording limits individually for each
method. Because the accuracy of the data reconstruction is
a key concern, the recording limits for each method were
varied in order to achieve the the same compression ratio.
As mentioned in the previous section, the recording limits
for a for given method and each process variable are propor-
tional to their standard deviations. For example, the OSI PI

recording limits were chosen as 3


i
, while the recording lim-
its for the box-car method were adjusted to produce the same
compression ratio as the PI

method. Thus, the recording


limits for the box-car method were 2.23
i
.
The eectiveness of a compression-reconstruction
method was characterized in two ways: (i) reconstruction er-
ror, and (ii) degree of similarity between the original data and
the reconstructed data. The S
PCA
and S
dist
similarity factors
were used to quantify the similarity between the original and
reconstructed data.
5.1 Comparison of dierent methods with respect
to compression and reconstruction
Dierent data compression methods were rst compared on
the basis of reconstruction error. The recording limits for the
OSI PI

method were set to 3 and data compression was


performed using PIs proprietary algorithm. The compres-
sion ratio was calculated for each of the 28 datasets. The
average compression ratio obtained for the 28 datasets was
14.8. The recording limits for all other methods were then
adjusted using numerical root nding techniques, such as the
bisection method, to obtain an average compression ratio of
approximately 14.8 for each method. The results presented
in Table 1 show that the PI

algorithm provides the best


reconstruction of the compressed data, while wavelet-based
compression is second best. Except for the box-car method,
linear interpolation provided better reconstruction than zero-
order hold. The common practice of averaging data provides
the worst reconstruction.
5.2 Eect of data compression on pattern match-
ing
Because the present research is concerned with pattern
matching, it is interesting to investigate the eect of data
compression on pattern matching. It is obvious that data
compression aects pattern matching because the original
and reconstructed data sets are not the same. In order to
evaluate the eect of dierent compression methods on the
eectiveness of the proposed pattern matching methodology,
similarity factors between the original data and the recon-
structed data were calculated to see how similar the recon-
structed dataset was to the original one. For scaling purposes,
the original dataset was considered to be the snapshot dataset
while the reconstructed dataset was considered to be the his-
torical dataset. The average values for S
PCA
, S
dist
and their
combination, S F = 0.67 S
PCA
+ 0.33 S
dist
, are presented in
Table 2. Although the averaging compression method per-
formed worst in terms of reconstruction error (cf. Table 1),
it produced compressed datasets that show a high degree of
similarity to the original ones, as indicated by high S
PCA
and
S
dist
values. The wavelet compression method produces low
MSE values as well as high S
PCA
and S
dist
values. These
results demonstrate that wavelet-based compression is very
3
Table 1. Data compression and reconstruction results for the CSTR example for a constant compression ratio.
Compression
method
Recording limit
constant (c)
Reconstruction
method CR MSE
Box-Car 2.2295
Linear 14.84 5.23
Zero-order hold 14.84 4.91
Backward-slope 2.7744
Linear 14.83 4.09
Zero-order hold 14.83 8.83
Combination 2.2003
Linear 14.86 5.28
Zero-order hold 14.63 7.94
Averaging
NA
Linear 14.6 24.69
(over 1.25 min) Zero-order hold 14.63 60.35
Wavelet 2.2669 Wavelet 14.83 2.61
PI

3.0 PI

14.83 0.33
accurate both in terms of reconstruction error and the simi-
larity of the reconstructed and original datasets.
Although the PI

algorithm produces a very low MSE,


it does not represent the data very well for pattern matching.
The wavelet method produces both a low MSE and high sim-
ilarity factor values. The wavelet transform preserves the es-
sential dynamic features of the signal in the detail coecients
while retaining the correlation structure between the vari-
ables in the approximation coecients. These two features
of the wavelet transform produce low MSE and high S
PCA
values between the original and reconstructed data. These
features also minimize mean shifts and result in high S
dist
values. By contrast the PI

method, records data very ac-


curately and produces very low MSE values, but its variable
sampling rates disrupt the correlation structure between vari-
ables and produce low S
PCA
values. Variable sampling also
aects the mean value of the reconstructed data and produces
low S
dist
values. The detailed results for dierent operating
conditions for the CSTR case study are reported by Sing-
hal.
11
5.3 Pattern matching in compressed historical
data
The historical data for the CSTR example described in
Section 4 were compressed using three dierent methods:
wavelets, averaging, and a combination of the box-car and
backward slope methods. The performance of the proposed
pattern matching technique for compressed historical data
was then evaluated. As described by Singhal,
11
and Sing-
hal and Seborg,
1
a data window that was the same size as the
snapshot data (S) was moved through the historical database,
100 observations at a time (i.e., w = 100). The i
th
moving
window was denoted as H
i
. For pattern matching, the com-
pressed data were reconstructed using the linear interpolation
method.
The same compression method was used for both the
snapshot and historical data. The snapshot data were then
scaled to zero mean and unit variance. The historical data
were scaled using the scaling factors for the snapshot data.
Similarity factors were then calculated for each H
i
. After the
the entire database was analyzed for one set of snapshot data,
the analysis was repeated for a new snapshot dataset. A to-
tal of 28 dierent snapshot datasets, one for each of the 28
operating conditions, were used for pattern matching.
11
Table 3 compares the pattern matching results for his-
torical and snapshot data compressed using dierent meth-
ods. The best pattern matching results were obtained when
the data were compressed using the wavelet method. The
optimum N
P
values were determined by choosing the value
of N
P
for which had the largest value. Table 3 indicates
that pattern matching is adversely aected by data compres-
sion when the data are compressed using either the averaging
method or the combination of box-car and backward slope
compression methods. By contrast wavelet-based compres-
sion has very little eect on pattern matching because similar
results are obtained for both compressed and uncompressed
data. Table 4 presents results for the situation when the snap-
shot data are not compressed while the historical data are
compressed using the wavelet method. The p, and values
in Table 4 are slightly lower compared to those in Table 3.
Thus, if the historical data are compressed, it may be bene-
cial to compress the snapshot data as well to obtain better
pattern matching.
6 Conclusions
A variety of data compression methods have been compared
and evaluated for pattern matching applications using a case
study approach. Classical methods such as box-car, back-
ward slope and data averaging compression methods do not
accurately represent data either in terms of reconstruction er-
ror or similarity with the original dataset. Data compressed
using the PI

software very accurately represents the origi-


nal data, but produces somewhat lower similarity factor val-
ues. Compression using the wavelet method produces recon-
struction errors that are higher than those obtained with PI

,
but much lower than conventional compression methods such
as box-car, etc. Data compressed using wavelets also show a
high degree of similarity with the original data.
For pattern matching applications, it is benecial to com-
press the snapshot data prior to performing pattern matching.
4
Table 2. Eect of dierent data compression and reconstruction methods on pattern matching for the CSTR example.
Compression
method
Recording limit
constant (c)
Reconstruction
method
S
PCA
S
dist
SF

Box-Car 2.2295
Linear 0.88 0.67 0.81
Zero-order hold 0.87 0.83 0.86
Backward-slope 2.7744
Linear 0.84 0.63 0.77
Zero-order hold 0.83 0.39 0.68
Combination 2.20025
Linear 0.87 0.67 0.80
Zero-order hold 0.85 0.79 0.83
Averaging
NA
Linear 0.92 0.99 0.94
(over 1.25 min) Zero-order hold 0.93 0.97 0.94
Wavelet 2.2669 Wavelet 0.95 >0.99 0.97
PI

3.0 PI

0.88 0.71 0.82

S F = 0.67 S
PCA
+ 0.33 S
dist
Table 3. Eect of data compression on pattern matching for the CSTR example when both the snapshot and historical data are
compressed using the same method.
Compression
method
Similarity
factor
Opt. N
P
p (%) (%)
max
(%) (%)
S
PCA
only 34 43 90 99 66
Original data S
dist
only 25 41 68 97 54
S F

14 75 72 88 74
S
PCA
only 41 30 78 99 54
Combination S
dist
only 59 19 75 100 47
S F

15 65 67 91 66
S
PCA
only 21 49 65 95 57
Averaging S
dist
only 24 40 65 96 53
S F

17 64 73 92 68
S
PCA
only 34 38 82 99 60
Wavelet S
dist
only 52 25 83 100 54
S F

16 71 76 92 73

S F = 0.67 S
PCA
+ 0.33 S
dist
For the simulated case study, data compression had only a
minor eect on the eectiveness of a new pattern matching
strategy.
11
Acknowledgements
The authors thank OSI Software for providing nancial
support and the data archiving software PI

, and Gregg
LeBlanc at OSI for providing software support during the
research. Financial support from ChevronTexaco Research
and Technology Co. is also acknowledged.
References
(1) Singhal, A. and Seborg, D. E. Pattern Matching in Mul-
tivariate Time Series Databases Using a Moving Win-
dow Approach. Ind. Eng. Chem. Res., 2002. 41, 3822
3838.
(2) Hale, J. C. and Sellars, H. L. Historical Data Recording
For Process Computers. Chemical Eng. Prog., 1981.
77(11), 3843.
(3) Kennedy, J. P. Building an Industrial Desktop. Chemi-
cal Engr., 1996. 103(1), 8286.
(4) Bristol, E. H. Swinging Door Trending: Adaptive
Trend Recording? In Advances in Instrumentation and
Control, volume 45. Instrument Society of America,
Research Triangle Park, NC, 1990 749754.
(5) Mah, R. S. H.; Tamhane, A. C.; Tung, S. H. and Patel,
A. N. Process Trending With Piecewise Linear Smooth-
ing. Comput. Chem. Engr., 1995. 19, 129137.
5
Table 4. Eect of data compression on pattern matching when snapshot data are not compressed and historical data are com-
pressed.
Compression
method
Similarity
factor
Opt. N
P
p (%) (%)
max
(%) (%)
S
PCA
only 34 43 90 99 66
Original data S
dist
only 25 41 68 97 54
S F

14 75 72 88 74
S
PCA
only 48 26 76 100 51
Combination S
dist
only 40 25 67 99 46
S F

15 59 63 91 61
S
PCA
only 60 23 85 100 54
Averaging S
dist
only 16 52 57 92 54
S F

16 63 70 92 66
S
PCA
only 39 31 75 99 53
Wavelet S
dist
only 15 50 53 91 52
S F

14 68 67 88 68

S F = 0.67 S
PCA
+ 0.33 S
dist
(6) Bakshi, B. R. and Stephanopoulos, G. Compression of
Chemical Process Data Through Functional Approxi-
mation and Feature Extraction. AIChE J., 1996. 42,
477492.
(7) Misra, M.; Kumar, S.; Qin, S. J. and Seemann, D. Error
Based Criterion for On-Line Wavelet Data Compres-
sion. J. Process Control, 2001. 11, 717731.
(8) Watson, M. J.; Liakopoulos, A.; Brzakovic, D. and
Georgakis, C. A Practical Assessment of Process Data
Compression Techniques. Ind. Eng. Chem. Res., 1998.
37, 267274.
(9) Nelson, P. R. C.; Taylor, P. A. and MacGregor, J. F.
Missing Data Methods in PCA and PLS: Score Calcu-
lations with Incomplete Observations. Chemometrics
and Intel. Lab. Syst., 1996. 19, 4565.
(10) Roweis, S. EM Algorithms for PCA and SPCA. In
Neural Information Processing Systems 11 (NIPS98).
1997 626632.
(11) Singhal, A. Pattern Matching in Multivariate Time-
Series Data. Ph.D. Dissertation, University of Califor-
nia, Santa Barbara, CA, 2002.
(12) Krzanowski, W. J. Between-Groups Comparison of
Principal Components. J. Amer. Stat. Assoc., 1979.
74(367), 703707.
(13) Russo, L. P. and Bequette, B. W. Eect of Process De-
sign on the Open-Loop Behavior of a Jacketed Exother-
mic CSTR. Comput. Chem. Eng., 1996. 20, 417426.
(14) Johannesmeyer, M. C.; Singhal, A. and Seborg, D. E.
Pattern Matching in Historical Data. AIChE J., 2002.
48, 20222038.
6

Das könnte Ihnen auch gefallen