Data Mining For Clustering and Knowledge Discovery

G Model
JJPC-2122; No. of Pages 16 ARTICLE IN PRESS

Journal of Process Control xxx (2017) xxxxxx
Contents lists available at ScienceDirect
Journal of Process Control

journal homepage: www.elsevier.com/locate/jprocont
Data mining and clustering in chemical process databases for

monitoring and knowledge discovery
Michael C. Thomas, Wenbo Zhu, Jose A. Romagnoli
Cain Department of Chemical Engineering, Louisiana State University, Baton Rouge, LA 70808, United States
a r t i c l e i n f o a b s t r a c t
Article history: Modern chemical plants maintain large historical databases recording past sensor measurements which
Received 26 April 2016 advanced process monitoring techniques analyze to help plant operators and engineers interpret the
Received in revised form meaning of live trends in databases. However, many of the best process monitoring methods require data
24 November 2016
organized into groups before training is possible. In practice, such organization rarely exists and the time
Accepted 7 February 2017
required to create classied training data is an obstacle to the use of advanced process monitoring strate-
Available online xxx
gies. Data mining and knowledge discovery techniques drawn from computer science literature can help
engineers nd fault states in historical databases and group them together with little detailed knowledge
Keywords:
Data mining
of the process. This study evaluates how several data clustering and feature extraction techniques work
Data clustering together to reveal useful trends in industrial chemical process data. Two studies on an industrial scale
Dimensionality reduction separation tower and the Tennessee Eastman process simulation demonstrate data clustering and feature
Knowledge discovery extraction effectively revealing signicant process trends from high dimensional, multivariate data. Pro-
cess knowledge and supervised clustering metrics compare the cluster results against true labels in the
data to compare performance of different combinations of dimensionality reduction and data clustering
approaches.
2017 Elsevier Ltd. All rights reserved.
1. Introduction are abnormal under differing operating regimes. Reducing the dif-
culty of this initial step could lower the time and money required to
Advancements in computing power and data storage at modern create advanced fault detection and diagnosis systems and expand
chemical plants have led to the build-up of large amounts data in their application in industrial settings.
historical databases which store sensor measurements from past Unsupervised learning strategies can help discover groups of
process behavior. Recent research has led to process monitoring data automatically that might otherwise be buried in the sheer
strategies which use the large output of process data to improve volume of data. Approaches to unsupervised learning include
process safety and quality enhancement [13]. Data based process dimensionality reduction and data clustering. Learning patterns
monitoring requires minimal process knowledge to perform this and extracting information about a process from data clusters or
task, in contrast to model based approaches that require detailed dimensionally reduced data can be called knowledge discovery
mechanistic models. or data mining. In order to expand the application of supervised
Unfortunately, many of the best methods for data-driven fault process monitoring algorithms, a software framework must be con-
detection and diagnosis are supervised, meaning training these structed to: a) separate fault data from normal data, b) train a
algorithms require data organized into labelled groups, such as model based on statistics or supervised learning techniques for
faulty or normal. In real plants labelled data is seldom available fault detection and c) assist with the identication and manage-
and creating properly labelled databases for training process mon- ment of new faults. Each of these tasks must be performed in a way
itoring algorithms can be a time consuming task. This task requires that is simple to understand for non-experts in data science and
an engineer to assess multiple operating states, a large amount of easy to deploy on multiple units around a plant with low overhead.
sensors, and data from months or years of operations. This task also The normal-faulty knowledge extracted using unsupervised learn-
requires familiarity with the process to judge which measurements ing can then be exploited to train supervised learning approaches
for process monitoring.
Unsupervised learning is a widely studied topic in computer sci-
Corresponding author. ence [4,5] and chemometrics [6], but many clustering techniques
E-mail address: jose@lsu.edu (J.A. Romagnoli). beyond K-means have seen relatively limited application in process
http://dx.doi.org/10.1016/j.jprocont.2017.02.006
0959-1524/ 2017 Elsevier Ltd. All rights reserved.
Please cite this article in press as: M.C. Thomas, et al., Data mining and clustering in chemical process databases for monitoring and
knowledge discovery, J. Process Control (2017), http://dx.doi.org/10.1016/j.jprocont.2017.02.006
G Model
2 M.C. Thomas et al. / Journal of Process Control xxx (2017) xxxxxx
monitoring situations. Process data clustering has been previously

shown to be effective in semiconductor manufacturing [7], high
speed milling [8], and other applications [9,10]. Research in chem-
ical process monitoring has also used data clustering concepts.
Wang and McGreavy [11] performed an early study in clustering
chemical process data from a uid catalytic cracker simulation with
a Bayesian automatic classication method. Bhushan and Romag-
noli [12] utilized a self-organizing map for unsupervised pattern
classication and with an application on a CSTR model for a fault
diagnosis problem. Strategies integrating principal components
analysis (PCA) and data clustering have also seen success. Maestri
et al. [13] developed a fault detection strategy for multiple operat-
ing states based on PCA supported by data clustering. Zhu et al. [14]
used a k-ICA-PCA modelling method to capture relevant process
patterns with an application to monitoring the Tennessee Eastman
process. Singhal and Seborg [15] developed a modied K-means
methodology to cluster multivariate time-series data from simi-
larity factors based on PCA. Barragan et al. [16] used a clustering
strategy based on a wavelet transform and novel similarity met-
ric to cluster data from the Tennessee Eastman process, but only
study one process fault. Thornhill et al. [17] studied an approach
for visualizing and clustering data based on PCA and hierarchical
clustering.
This study uses traditional techniques for dimensionality reduc- Fig. 1. Schematic of data mining approach.
tion (DR) and data clustering from the computer science literature
to extract faulty data and knowledge about process states from
chemical process databases. Instead of focusing on how to detect correlated sensor measurements and combining them into lower
and diagnose faults, this research focuses on how to create the dimensional scores. DR may project data to two or three dimen-
data sets used to train conventional supervised process monitoring sions to enable visualization, or the technique may simply remove
algorithms. We compare how effectively combinations of DR and redundant information from raw process data. In some cases DR
data clustering techniques recreate fault labels on two case stud- may not be necessary if the data are of good quality.
ies: the benchmark Tennessee Eastman process and an industrial After projection by a DR technique, data clustering algorithms
separation tower. partition the data using any number of clustering techniques. Xu
An advantages of the workow we propose is that it is relatively and Wunsch [8] present a survey of data clustering techniques, but
simple to use because each DR and clustering combination requires clustering is a subjective process and there is no universal denition
the specication of only one or two parameters and techniques. of a cluster except that they consist of groups with members more
Additionally, we expand on previous research by considering inno- similar to each other than data from different groups. Depending
vative and proven clustering techniques such as DBSCAN, BIRCH, on the data and the parameters used to calculate the clustering,
mean shift clustering that have been widely applied in computer clusters found may or may not correspond to signicant structures,
science but are untested on fault data discovery. We also study therefore, cluster evaluation metrics are important to help the user
the role of DR because of the prominent role it plays in visualiza- judge the quality of the clusters extracted before more detailed
tion and feature extraction. As an example, Ding [18] explores the analysis.
close relationship between unsupervised learning and DR and pro- Finally, in the Cluster Assignment step, the user analyses the
vides a theoretical basis for the use of PCA to enhance K-means data in the clusters to relate them to meaningful process events
clustering. The DR techniques considered include not only several such as faults or operating procedures. When labelled according
techniques already adapted to fault detection and process mon- to process events, the data can be used by machine learning or
itoring (principal components analysis (PCA) [19], independent other supervised fault detection or diagnosis algorithms for training
component analysis [20], kernel PCA [21]), but also non-linear man- and tting. Extracting information in this way from databases is
ifold preserving techniques like Isomap and spectral embedding. called knowledge discovery. The data mining algorithms used in
This paper is organized as follows: Section 2 summarizes our this study were drawn from the Python Scikit-learn module [22],
overall approach to data mining; Sections 3 and 4 introduce dimen- which provides a rich environment of supervised and unsupervised
sionality reduction and data clustering respectively, providing a machine learning algorithms.
brief introduction to the techniques used in this study; Section 5
discusses how we decided the parameters of the DR and clustering 3. Dimensionality reduction
techniques used; Section 6 considers a case study on the Tennessee
Eastman process where unsupervised learning is leveraged to dis- Dimensionality reduction is an important data mining step
cover faults from sets of data; Section 7 studies the clustering of a because it addresses the curse of dimensionality . High dimen-
real event on an industrial scale separation tower; Section 8 con- sional spaces lead to problems such as the empty space phenomena
tains a brief review of the challenge of time series clustering; and (increasing dimensionality increases volume such that available
Section 9 concludes and summarizes this research. data becomes sparse), the weaker discrimination power of metrics
like Euclidean distance, and correlations between variables [25].
The dimensionality reduction methods considered here were cho-
2. Data mining approach sen based on their characteristics and computational costs. PCA is
the most commonly used dimensionality reduction technique and
Fig. 1 outlines the data mining approach used. First, DR has numerous successful applications in statistical process mon-
techniques project the raw process data, removing redundant, itoring [19,23]. ICA and KPCA have been successfully adapted to
G Model
M.C. Thomas et al. / Journal of Process Control xxx (2017) xxxxxx 3
Fig. 2. Illustration of the various clustering metrics.
process monitoring [20,21], so it is natural to evaluate them for data which optimally capture the variability of the data. The orthogo-
mining tasks. Isomap and spectral embedding are more recently nal vectors are determined through an eigenvalue decomposition
developed non-linear manifold learning DR techniques that can be of the covariance matrix and arranged in order of the variance
more sensitive to non-linear structures in the data than PCA, ICA, explained in the loading vector directions [19].
or KPCA which are based on matrix decompositions. As a bench- PCA reduces the data as follows. For a given training set of n
mark, we also calculate clustering results obtained without any DR. observations and m sensor measurements stored in Xnm , its sam-
A comparative study by van der Maaten et al. [24] revealed that ple covariance matrix S can be calculated:
while advanced nonlinear dimensionality reduction outperformed
PCA and others at reducing data sets with articial structure, on 1
S= X T X = PP T
other natural data sets not articially generated by computers n1
the DR techniques considered failed to provide any advantage at
all! This diverse environment of techniques calls for a comparative By nding the eigenvalues () of covariance matrix S, the pro-
study to guide judgement in selecting which is most appropriate jections of the observation in X into the lower-dimensional space
for a given application situation. are calculated in the score matrix:
A brief discussion of each DR technique considered in this work
follows. T = XP
3.1. Principal component analysis In our dimensionality reduction studies we determine the num-
ber of principal components (PCs) using the percent variance test.
Principal Component Analysis (PCA) is a linear distance preser- The percent variance test calculates the smallest number of loading
vation technique which determines a set of orthogonal vectors vectors needed to explain a minimum percentage of the total vari-
G Model
Fig. 3. Process schematic with control scheme.
ance. Our models include enough principal components needed to the number of ICs necessary. Indeed, in the fault detection review
model 95% of the variance of the original data [19]. performed by Yin et al. [28], ICA and PCA found identical numbers
of ICs and PCs from TEP data. In this study, we use the same number
3.2. Independent component analysis (ICA) of ICs as PCs in our PCA model.
Independent component analysis (ICA) is used in multivariate

signal separation for extracting hidden and statistically indepen- 3.3. Kernel principal component analysis (KPCA)
dent components (ICs) from the observed data and has been
adapted for process monitoring tasks similar to PCA [20,26]. Sig- Kernel principal component analysis extends traditional princi-
nal source separation recovers the independent signals after linear ple component analysis to nonlinear data spaces. Instead of directly
mixing. In other words, for mixed signals represented by x, inde- taking eigenvalue decomposition of the covariance matrix like PCA,
pendent signal sources s are linearly mixed by the matrix A by: KPCA takes a data set with non-linear features that PCA fails to pre-
serve and projects them to a higher dimensional space where they
vary linearly. KPCA rst assumes that the data have been trans-

n
formed non-linearly using a non-linear mapping function (x).
x= sk ak = As
Conventional PCA is then performed in the feature space to per-
k=1 form the transformations for dimensionality reduction [29]. The
The goal of ICA is to nd the source signals s using informa- kernel matrix K is dened as:
tion stored in x. After matrix A is estimated, the matrix inverse W
(W = A1 ) can be used to calculate the original independent source Kij := (xi ) xj ,
signals by:
We then substitute kernel function k (x, y) for all occurrences
s = Wx
of( (x) , (y)), which allows us to calculate dot products using the
The matrix W can be found using the FastICA algorithm of non-linear mapping function without knowing its form. In this

Hyvarinen and Oja [27]. In separating signals, the two key assump- work, we use the radial basis kernel k (x, y) = exp x y2 /c .
tions that ICA makes is that the source signals are independent of After this substitution, the data can be reduced non-linearly using
each other and the values in each source signals have non-Gaussian an eigenvalue decomposition in a similar manner as PCA.
distributions. W is calculated through xed point iteration to nd As with other non-linear dimensionality reduction methods (i.e.
components with the maximum non-gaussianity, measured using Isomap), a weakness of KPCA is the high computational cost of cal-
negentropy [27]. culating the kernel matrix. To determine the number of PCs, in
ICA requires us to specify the number of ICs to use in the dimen- accordance with Lee et al. [7], we applied the cut-off method using
sionality reduction. A simple option given by Lee [26] suggests that the average eigenvalue which only includes PCs with eigenvalues
the number of PCs bound by PCA can provide a good estimate of above the average.
G Model
Fig. 4. (a) Data from normal TEP operation and Fault 1projected to 3 principal components. (b) DBSCAN clustering results identify the separate clusters formed by steady
state operations and tag much of the transition as noise. (c) and (d) show how the clusters can be used with time series plots of sensor measurements to identify the dominant
behavior of each cluster.
3.4. Isomap dimensional scaling (MDS). MDS computes the feature vectors in a
lower dimensional space that minimizes the stress function:
Isomap performs nonlinear dimensionality reduction by esti-

2
mating the intrinsic geometry of the data manifold using graph ij ij
i<j dY d G
S = min
distances instead of Euclidean distances. A manifold, in this case,
is essentially the underlying support of a data distribution known
d
ij ij
2
only through a nite sample [25]. To project data Isomap creates a G
i<j
dY
topology preserving network and uses it to nd the shortest path
graph distance between any two points on the network. The graph ij
Where dY is the Euclidean distance between feature vectors i and j
distances are then used to create a geometry preserving map of the ij
and the d is the primary monotone least-squares regression of the
G
observations in the lower dimensional space [30]. Isometric fea-
datas pairwise Euclidean distances and the graph distances [31].
ture mapping nds the map that preserves the global, nonlinear
In this work, the number of components found by PCA was used to
geometry of the data. Two kinds of points are dened in Isomap:
estimate the number of components
neighboring points and faraway points. For neighboring point pairs,
the path between them is estimated by Euclidean distance, while
distances between faraway point pairs are estimated by adding 3.5. Spectral embedding (SE)
distances between short hops between neighboring points.
For a given data set, Isomap rst builds the neighborhood graph Spectral embedding, also known as LaPlacian Eigenmaps, is a
by randomly selecting r points to be nodes in the graph and using non-linear dimensionality reduction technique using the LaPlacian
their nearest neighbors to form connections between all points of a graph based on the topology of the data set to non-linearly
within radius . Next, it calculates the graph distances between project data in a way that optimally preserves local neighborhood
nodes by rst assigning weights to connections using Euclidean dis- information and emphasizes clustering structures in the data [32].
tance and summing link weights along the shortest path between The adjacency graph used by SE is created using pairwise dis-
the nodes. In the nal steps, Isomap nds the lower dimension tances between data points on a k nearest neighbor graph. Once
embedding based on the graph distances dij using ordinal multi- the graph is dened, we calculate the non-linear embedding of data
with an eigenvalue decomposition:
Lf = Df
G Model
Fig. 5. Time series plots for cluster member ship for selected DR and clustering results. In (a), clusters found by DBSCAN largely correspond to the original sets from the TEP
simulation except with some data is dismissed as noise; (b) shows DBSCAN on data reduced by PCA, which lost the information separating Fault 4 and 14 from normal in
the projection; (c) shows k-means on No DR, large groups of data are incorrectly associated with normal; In (d), the results of DBSCAN and spectral embedding, a high ARI
was calculated, but some Fault groups are divided into multiple clusters.
Where L is the graph LaPlacian and D is the degree matrix cal- niques k-means, DBSCAN, mean shift, and BIRCH were selected for
culated by summing the rows of the weighted afnity matrix [33]. their scalability and ability to nd clusters corresponding to regions
The embedding of the ith data point in m dimensional Euclidean of high density in data. The latter three techniques locate clusters
space comes from m eigenvectors: based on density, and therefore have the ability to nd clusters of
any arbitrary shape. It should be noted that all techniques applied
xi (f 1 (i) , . . ., f m (i)) here consider each data measurement to be independent of time,
an assumption with disadvantages discussed in Section 8.
The eigendecomposition of the graph LaPlacian has been pre-
viously used for clustering in spectral clustering algorithms. The
clustering methods developed by Shi and Malik [34] and Ng et al. 4.1. K-means clustering
[35] among others begin by performing an eigenvalue decom-
position on the graph LaPlacian to embed the data, followed by K-means creates K initial centroids are chosen, corresponding to
K-means clustering to generate clusters. Within the subspace clus- the number of clusters desired. Each point in the dataset is assigned
tering framework studied here, spectral embedding is used for to the closest centroid, and the centroid of each cluster is updated
dimensionality reduction while several different techniques for each iteration based on the points assigned to the cluster. K-means
data clustering are tested for the clustering step. As with ICA and only nds spherical clusters in data [36]. K-means is still one of the
Isomap, the number of PCs used by the PCA model was used to most widely used algorithms for clustering due to its simplicity and
determine the number of score variables output by SE. efciency [9] and was recognized as one of the top 10 data mining
algorithms by IEEE [37]
Given a specied number of clusters in a sample data, K-
4. Data clustering
means algorithm seeks to minimize the sum of squared error (SSE)
between each mean k of each cluster ck given by:
The goal of data clustering is the unsupervised classication of
data into approximately homogenous groups or clusters based on a K 2
chosen similarity measure such that the similarity between objects SSE (C) = xi k
k=1 xi ck
within a subgroup is larger than the similarity between objects
belonging to different subgroups [4]. Using data clustering, key Minimizing the SSE objective function is an NP-hard problem,
events within process historical databases can be identied and thus K-means will converge to a local minimum [9]. The implemen-
connected to meaningful operating conditions in a plant. The tech- tation used here randomly chooses k points from the data set and
G Model
Fig. 6. Projections in the data can give insights to different faults behavior, and what cluster algorithms are locating. A PCA projection of the data colored by their true fault
group in (a). We can zoom into the data grouped around normal in (b) to see that several faults occur exclusively near and around normal operations. Clustering with k-means
(c) and DBSCAN (d) locates some groups, though issues with unrelated data and noise data remain challenges.
Fig. 7. Simplied process ow sheet.
set these points as the initial centroid of the specied number of 4.2. DBSCAN
clusters the algorithm must nd. All points are assigned to clusters
by associating with the nearest centroid. Then the centroid of each Density-based clustering locates regions of high density of any
cluster is then calculated and becomes the new mean. Again, all shape that are separated from one another by regions of low den-
points are assigned to clusters by associating with the nearest new sity. DBSCAN divides all the data in a set into three groups: Core
mean. Above steps are repeated until convergence is achieved. points making up the body of a cluster, border points in more dif-
fuse regions of the data but near a few core points, and noise points
G Model
Fig. 8. Projections of 9 months of data from the separation tower using (a) PCA (b) ICA, and (c) Isomap (d) SE. PCA and ICA preserve the density of normal operations, while
Isomap bursts normal operations into smaller groups. SE separates the data well, but the projection is less intuitive.
far from concentrated groups of data. DBSCAN is ideally suited to and vector approach treats dense areas of data as a single vector,
clustering chemical process data, since clusters in real data sets are and can allow BIRCH to obtain a clustering of data with one scan of
rarely spherical, and often clouded by noise [36]. the input data. In other words, the decisions to combine and split
Within the DBSCAN algorithm, a cluster is dened as set of clusters are made locally, without the need to calculate the pairwise
density-connected points, and was separated by regions of lower distances over the entire data set.
object density. Two parameters are used to determine clusters in The CF
vector representing

individual subclusters is represented

DBSCAN: , the distance threshold and minPts, the minimum num- as CF = N, LS, SS where N is the number of data points in the clus-
ber of points to form a cluster. Given and minPts, each point is
categorized as a core, noise, or border point. A core point is dened
N
ter, LS is the linear sum of the N data points calculated by Xi
i=1
if there is more than the specied number of points (minPts) within
N 2
in the distance threshold (). A data point is a border point if the and SS is the square sum calculated from i=1 X i
. CF vectors allow
number of points within its neighborhood is less than minPts but BIRCH to accurately and incrementally combine subclusters of data
it lies within the neighborhood of a core point. Any points not within the CF tree structure [39].
included in core points and border points are dened as noise [38]. A new data point or subcluster Xi is added to the tree by start-
DBSCAN starts with an arbitrary initial point that has not been ing at the root and descending the CF tree by choosing the closest
visited. If a point satised with the denition of a core point, a clus- branch based on a chosen distance metric, such as Euclidian. At a
ter begins. All points that are found within the -neighborhood are leaf node, BIRCH rst tests whether Xi can be incorporated into one
added to this cluster. The process continues until a cluster is com- of the other subclusters present on the leaf. If the leaf node can
pletely developed. Then, a new unvisited point is processed. Due absorb Xi , the leaf nodes CF vector is updated with the data from Xi ,
to the nature of the algorithm, it is robust to noisy data since small but if not a new split node is created at the leaf using the farthest
numbers of outliers can be automatically dened as noise. pair of entries as seeds. Additional operations rene the clusters
that arise from this approach as described in [39]
4.3. BIRCH
4.4. Mean shift clustering
Different from previously discussed clustering methods, BIRCH
nds subclusters of data and condenses them into a three element Mean shift clustering is a probability density formulation of the
data vector. The Clustering Feature (CF) vectors of the subclus- clustering problem. As an example, in two dimensions probabil-
ters are combined using a specialized tree structure to generate ity density functions (PDF) form structures like mountains around
the output clusters. This approach affords BIRCH advantages over clusters, with slopes and a peak at the area of highest density in the
other clustering techniques, particularly in a computing environ- cluster as the cluster density increases to a peak. Mean shift clus-
ment with limited memory resources [39]. The combined CF tree tering uses the theory of probability density estimation to create
G Model
Fig. 9. Different clustering results from Table 8 are projected using PCA and using rst 3 PCs. It must be noted that higher dimensional data was used for clustering, not just
the 3 PCs given in this gure. In (a) and (b) k-means with k = 2 and 3 respectively both split the normal cluster, in the k = 2 case it erroneously grouped faulty and normal data
together. K-means with k = 3 in (b), DBSCAN in (c), and Meanshift in (d) each successfully assigned the faulty data its own cluster.
Fig. 10. Adjustments to the eps parameter in DBSCAN nds the smaller clusters in (a). In (b) coloring the resulting data based on time (blue is older, yellow is newer with
dark red being the most recent) reveals that each use of the separation tower at this grade forms a distinct new cluster, which poses a challenge to the training of a data
model. (For interpretation of the references to colour in this gure legend, the reader is referred to the web version of this article.)
clusters using the mean shift procedure rst introduced by Fuku- Where K is a kernel function, h is the bandwidth or window width,
naga and Hostetler [40]. and d is the number of dimensions in the data [41]. Selection of the
The most widely used nonparametric technique for nding an bandwidth is a crucial parameter for mean shift clustering. In our
estimation of the PDF for a set of data is kernel density estimation. application we calculated acceptable results using a heuristic based
For data Xi i = 1, . . ., n, the multivariate kernel estimator with kernel on the median of pairwise distances. The selection of the kernel
K is dened as: also affects results, but the common normal or Gaussian kernel KN
is usually most effective
1
n x X
i
f (x) = K
nhd h
i=1
G Model
Table 1 Where H(C|K) is the conditional entropy of the classes given the
Clustering Metrics used to compare the quality of clustering results.
cluster assignments, and H (C) is the entropy of the classes:
1 when. . .
|C| |K| n

Homogeneity All clusters contain data from a single class
nc,k c,k
Completeness Members of a class are elements of the same cluster
H (C|K) = log
n nk
V-measure Harmonic mean of homogeneity and completeness c=1 k=1
ARI (-1 to 1) Cluster labels match true labels (0 for random labelling)

|C|
nc
n
c
H (C) = log
Mean shift clustering rst nds the zeros of
f (x), which corre- n n
c=1
spond to local maxima and minima of the PDF. Towards this goal,
we used the mean shift vector: where n is the total number of data points, nc,k is the number of
n
data points that are members of class c and cluster k, nc and nk are
xxi 2
xg
i=1 i
h
the number of data points in class c and cluster k respectively.
m (x) = n
xi Finally, we judge the similarity of a given clustering result to the
xxi 2
i=1
g h
true clusters of data using the Adjusted Rand Index (ARI) of Hubert
and Arabie [45]. In this study we calculate the ARI by the equation
Where g is dened as
nij ni nj n
/
g (x) = k (x) i,j
2
i
2
j
2 2
ARI =
1 ni. n.j ni. n.j n

Which arises from the shadow of the kernel K, a concept intro- + /
2 2
duced by Cheng [42]. The resulting mean shift vector always points i 2 j 2 i 2 j 2
towards the direction of maximum increase in probability density,
and following the mean shift vector through the PDF leads us to a where two clusterings of the data, U and V, are compared. The
zero of f (x). Derivations and additional details about the mean indices i and j refer to groups within U and V respectively: ni and
shift procedure can be found in Fukunaga and Hostetler [40] and nj refer to the number of objects in class ui in U and class vj in V.
Cheng [42]. The contingency table in Hubert and Arabie [45] fully illustrates
In the mean shift clustering approach presented by Comaniciu this notation.
and Meer [43], clusters arise from this mode seeking process. All Fig. 2 illustrates the meaning of the various clustering metrics
the points visited by each execution of the mean shift procedure using the Fischer Iris data set [46]. The Iris data is composed of 3
are associated in the cluster corresponding to the local maximum species of owers, and Fig. 2 gives several possible clustering results
the procedure converges upon. found by K-means similar to good and bad clustering results: 1. all
data in one cluster, 2. Data is separated into many clusters, and
3. Data is separated into 3 clusters (which closely match the true
4.5. Supervised cluster evaluation clusters). The sample cases in Fig. 2 illustrate how homogeneity
and completeness give us insight into how the clusters separate
Almost every clustering algorithm will nd clusters in a data set, the data, particularly their relative size to each other, whether they
even if that data set has no natural cluster structure. Cluster evalu- might be too general or too specic. ARI gives an evaluation of how
ation metrics are important to give an idea of the validity of a given accurately the clustering results capture the true grouping of the
clustering generated by an algorithm. This study uses four cluster data (assuming the true clusters are known).
evaluation metrics: homogeneity, completeness, V-measure [44],
and Adjusted Rand Index (ARI) [45]. Each metric gives the output
of a clustering algorithm a score from 0 (corresponding to a poor or 5. Specication of parameters
random clustering) to 1 calculated from the cluster labels and the
correct labels of the data, meaning that these metrics can be con- Each DR and clustering technique had one or more parameters
sidered supervised. A brief summary of these clustering metrics is which needed to be specied. The determination of parameters has
given in Table 1. a big effect on clustering performance and nding parameters is
Homogeneity determines whether data points dened in the often a trial and error process. Here we use consistent rules to deter-
same cluster are from the same single class. The score is between 0 mine cluster parameters to avoid tuning to the data. In order to set
to 1, where 1 represents for perfectly homogeneous labelling. Com- some benchmark in nding cluster parameters, in some cases we
pleteness determines whether data from the same single group is did not attempt to avoid using our knowledge of the sets, such as
dened in the same cluster. The score is between 0 to 1, where 1 k-means where we took the number of clusters to be the number
represents for perfectly complete labelling. V-measure is the har- of groups known to be in the data. All data studied was rst nor-
monic mean between homogeneity and completeness. Formulas malized to zero mean and unit variance before DR and clustering.
for homogeneity, completeness, and V-measure are: The parameters used for each clustering technique were:
H (C|K) DBSCAN: minPts was xed at 10, while eps was determined using
h=1
H (C) the k-nearest neighbors (kNN) graph as suggested by Ester et al.
[38]. We used 95th and 70th percentile for the tower and TEP
data sets respectively.
H (K|C)
c =1 BIRCH: the cluster diameter threshold was determined heuristi-
H (K)
cally using the eps parameter found from the DBSCAN kNN result.
The branching factor had a limited effect on results and was set
hC at 50.
v=2 K-means: the number of clusters was xed as the number of TEP
h+C
faults that were known to be in the data (8 for the reduced set, 20
G Model
for the full set). In the tower data set, results were obtained for 2 Table 2
TEP process faults description.
and 3 clusters.
The bin width parameter of Meanshift was xed using the method Fault No. Description Type
of Comaniciu and Meer [43], meaning that our Meanshift imple- 1 A/C Feed Ration, B Composition Step
mentation had no parameter that needed to be tuned by the user, Constant (Stream 4)
an advantage for unsupervised learning tasks. 2 B Composition, A/C Ratio Constant Step
(Stream 4)
3 D Feed Temperature (Stream 2) Step
The dimensionality reduction techniques used also required 4 Reactor Cooling Water Inlet Step
some parameters to be specied: Temperature
5 Condenser Cooling Water Inlet Step
Temperature
PCA: Cross validation was used such that 95% of the variance was 6 A Feed Loss (Stream 1) Step
preserved in the projection. 7 C Header Pressure Loss Reduced Step
KPCA: we used the average eigenvalue approach utilized by Lee Availability (Stream 4)
8 A, B, C Feed Composition (Stream 4) Random Variation
et al. [21] which accepts al components with eigenvalues above
9 D Feed Temperature (Stream 2) Random Variation
the average eigenvalue. 10 C Feed Temperature (Stream 2) Random Variation
ICA, Isomap, and SE each required an estimate of the intrinsic 11 Reactor Cooling Water Inlet Random Variation
dimensionality for the projection. In practice this can be dif- Temperature
cult to estimate, so as a benchmark we used the same number of 12 Condenser Cooling Water Inlet Random Variation
Temperature
components as PCA.
13 Reaction Kinetics Slow Drift
14 Reactor Cooling Water Valve Sticking
The clustering results in the following sections present the ARI 15 Condenser Cooling Water Valve Sticking
16 Unknown
(accuracy) and other metrics of clusterings of process data using
17 Unknown
the parameter specications above. It should be noted that in a 18 Unknown
clustering study without any a priori knowledge of the true data 19 Unknown
groups, ARI cannot be calculated. Without true cluster labels, only 20 Unknown
unsupervised clustering metrics like the Davies-Bouldin Index [47]
provide a benchmark for the usefulness of a clustering of the data.
In contrast to the separation tower study discussed later where
the data contain one large normal cluster and a much smaller fault
6. Case 1: Tennessee Eastman Process
cluster, the data set studied contains a variety of different behaviors
including step changes, random variations, and other plant trends.
6.1. Tennessee Eastman Process (TEP) description
An additional advantage to using the simulation is that the ground
truth classes are known with certainty, again enabling the use of
First introduced as a process control challenge problem by
supervised cluster evaluation metrics.
Downs and Vogel [48], the TEP is a realistic simulation based on
a real chemical process and has become an important benchmark
in the evaluation of process control and process monitoring tech- 6.2. Normal and fault 1: basic case
niques. The process uses four gaseous reactants (A, C, D, and E)
to produce two liquid products (G and H) in competition with We rst demonstrate our proposed approach to clustering in a
an unwanted by product (F) and in the presence of an inert (B). simple case with two simple operating regimes: normal operations
There are four competing reactions related to temperature through and Fault 1, which is a step change in one of the feeds to the process.
Arrhenius laws, two producing desired products, the others pro- Here, we would expect the clustering algorithm to reproduce the
ducing the byproduct. The entire simulation includes ve unit groups in the data that might be difcult to nd without projection
operations: a reactor, condenser, separator, compressor, and a and clustering. We project the data using PCA and DBSCAN for this
stripper and has 41 measured variables along with 12 manipulated example, but any technique in this paper could accomplish this
variables for a total of 53 process variables. Downs and Vogel [48] task.
also dened 20 disturbances or faults for the challenge process (see Fig. 4a projects the data to 3 principal components, showing
Table 2). the dense cluster corresponding to normal operations as well as
The data was generated using the control scheme and Simulink the transition to Fault 1 s steady state. Fig. 4b shows the result
code from Ricker [49], who applied a decentralized control strategy of DBSCAN applied to the data set. Blue corresponds to the faulty
to the process which involved partitioning the plant into subunits data, green is the normal cluster, and the black data is identied by
and the creation of a controller for each. The process ow sheet and the DBSCAN algorithm as noise. Fig. 4c and d point to how to use
controller scheme is given in Fig. 3. clustering results to reveal knowledge from data sets. Coloring a
In initial clustering studies including faults with random vari- time series plot of the evolution of the Stream 4 ow with the clus-
ations like Fault 13 (slow drift in kinetics), the dynamical faults ters found by DBSCAN, green, blue, and noise groups (black) can be
primarily contributed noise without forming the dense groups that clearly connected to changes in Stream 4 and studied as separate
can be found by data clustering. Therefore, in addition to the full process states. This result can reveal the dominant behavior of each
data set with fault data sets, we studied a reduced dataset in more cluster and, if the ultimate goal is the creation of a process mon-
detail. The reduced dataset primarily contained step change faults itoring algorithm, exploited to separate normal and faulty groups
with fewer time varying dynamics that could obfuscate our anal- for training.
ysis. The reduced data set consists of data from faults 0, 1, 2, 4, 6,
7, 8, and 14, with 8 and 14 possessing time varying dynamics that 6.3. Reduced TEP data set
might be more challenging to learn than the step faults.
The goal in our study of clustering and dimensionality reduction We consider a reduced data set consisting of TEP faults 0, 1, 2,
on the TEP is to study how effectively different approaches to clus- 4, 6, 7, 8, and 14 because the results are simpler to analyze and
tering extract different process operating regimes from the data. visualize than a data set including all 20 TEP faults studied in Sec-
G Model
Table 3
Number of components used by DR projections in clustering TEP data.
PCA ICA KPCA Isomap SE
Reduced 7 7 8 7 7
Full 9 9 11 9 9
Table 4
Clustering Results on reduced TEP data set.
Reduced data set
DBSCAN K-means Mean Shift BIRCH
NO DR Homogeneity 0.86 0.60 0.43 0.56

Completeness 0.86 0.81 0.79 0.85
V-measure 0.86 0.69 0.56 0.68
ARI 0.81 0.40 0.24 0.36

PCA Homogeneity 0.66 0.60 0.57 0.58
Completeness 0.88 0.81 0.81 0.85
V-measure 0.76 0.69 0.67 0.69
ARI 0.53 0.41 0.38 0.39

ICA Homogeneity 0.67 0.69 0.44 0.69
Completeness 0.85 0.82 0.73 0.83
V-measure 0.75 0.75 0.54 0.75
ARI 0.52 0.56 0.24 0.56

KPCA Homogeneity 0.65 0.68 0.36 0.69
Completeness 0.68 0.78 0.71 0.79
V-measure 0.66 0.72 0.48 0.74
ARI 0.45 0.47 0.17 0.49

Isomap Homogeneity 0.80 0.70 0.53 0.71
Completeness 0.87 0.81 0.83 0.89
V-measure 0.83 0.75 0.65 0.79
ARI 0.72 0.58 0.37 0.57

SE Homogeneity 0.81 0.81 0.33 0.74
Completeness 0.73 0.86 0.58 0.79
V-measure 0.77 0.84 0.42 0.76
ARI 0.65 0.68 0.15 0.61
tion 6.1.3. Most of these faults are step changes that are simpler Based on the mean shift clusterings relatively low homogeneity
to detect, but faults 8 and 14 express dynamic, non-linear varia- and high completeness, we can deduce that the algorithm is fre-
tions that do not form into a dense group, and might disrupt the quently grouping together unrelated data, a disadvantage shared
clustering of other data. All data studied were rst normalized to by k-means. This behavior is visualized in Fig. 6c.
zero mean and unit variance with respect to normal operations, While the highest ARI observed was calculated by DBSCAN
which is common with plant data due to the overabundance of data applied to the full data set with no DR, in general projection
from normal operations. Table 3 gives the number of components improved clustering results for other clustering techniques. In
used by each DR technique in calculating clustering results accord- comparing the different dimensionality reduction results, the
ing to the heuristics given in Section 5. Clustering results were also non-linear dimensionality reduction techniques SE and Isomap
calculated without using any DR to evaluate how important dimen- generally outperformed the matrix decomposition methods (PCA,
sionality reduction is to preserving the clustering structures in the ICA, and KPCA). This agrees with the nding of van der Maaten [24]
data. who observed that non-linear dimensionality reduction methods
Table 4 gives the cluster evaluation results of four data clustering were generally better at DR on articial data sets. Less consistent
techniques and DR techniques (including a No DR case where we results were observed on results on the Tower data set in Section
cluster the data with all variables monitored). The best homogene- 7.
ity, completeness, V-measure, and ARI observed on each data set In addition to evaluations of the overall clustering, we can use a
is bolded. The highest ARI, and therefore, the most accurate recon- time series plot of cluster membership where vertical position indi-
struction of the original TEP classes was DBSCAN working without cates cluster membership to evaluate how well the individual faults
dimensionality reduction. DBSCAN successfully found groups of all were learned. Fig. 5ad gives a sampling of cluster series plots. Clus-
step faults except Fault 7, and even grouped data from valve sticking ters found by DBSCAN with No DR largely correspond to the original
in Fault 14. Data from Fault 8 (random feed composition variations) sets from the TEP simulation, though transitions between steady
was all classied into the noise, which may favorably skew the states as well as all of Fault 8 and some of Fault14 were designated
clustering metrics. In some cases k-means was close to or better as noise. In Fig. 5b, DBSCAN with PCA nds many of the same
than DBSCAN, but overall DBSCAN performed the best over most clusters but incorrectly combined data from faults 4, 6, and 14. In
DR techniques. Mean shift clustering performed relatively poorly the simulation, Fault 4 and 14 both affect the cooling water of the
compared to the other clustering techniques across all DR methods.
G Model
Table 5
Data clustering and DR applied to the full TEP data set.
Full data set
NO DR Homogeneity 0.40 0.44 0.29 0.39

Completeness 0.80 0.72 0.69 0.74
V-measure 0.54 0.54 0.41 0.51
ARI 0.15 0.11 0.06 0.08

PCA Homogeneity 0.37 0.45 0.29 0.44
Completeness 0.84 0.71 0.69 0.73
V-measure 0.52 0.55 0.41 0.55
ARI 0.15 0.12 0.06 0.12

ICA Homogeneity 0.37 0.46 0.33 0.44
Completeness 0.80 0.71 0.68 0.71
V-measure 0.51 0.56 0.44 0.54
ARI 0.15 0.12 0.08 0.11

KPCA Homogeneity 0.33 0.43 0.17 0.36
Completeness 0.57 0.53 0.59 0.58
V-measure 0.42 0.47 0.26 0.44
ARI 0.11 0.17 0.03 0.11

Isomap Homogeneity 0.40 0.45 0.28 0.39
Completeness 0.76 0.69 0.70 0.75
V-measure 0.52 0.55 0.40 0.52
ARI 0.15 0.13 0.06 0.09

SE Homogeneity 0.50 0.52 0.31 0.52
Completeness 0.62 0.54 0.73 0.54
V-measure 0.55 0.53 0.43 0.53
ARI 0.17 0.33 0.10 0.31
reactor, so the PCA clustering results suggest that PCA removed the Table 6
Process measurements.
information in the cooling water temperature sensors data.
Figs. 6ad provide a view of how dimensionality reduction Variable Name Description
affects the data being clustered using a PCA projection to the rst 3 F1 Feed to upstream reactor
components. However, recall that in the practical case where data T1 Tank1 Temperature
labels are not known, the comparison between clustering results P1 Tank 1 feed pressure
and true labels typically is not possible without more knowledge P2 Tank1 pressure
L1 Tank1 level
of the data set. Coloring the data by cluster, as in Fig. 6a, gives us
T2 Tower feed temperature
further insight into the results in Table 4. Fig. 6b shows a detail of T3 Middle Tower temperature
the cluster from normal operations where we can see the close- L2 Tower level
ness of normal data and data from Faults 2, 4, 14. Fig. 6c shows T4 Tower overhead temperature
T5 Cooling water temperature
how K-means attempts to t the data into hyperspherical clusters,
T6 Tower overhead temperature
which causes long, narrow transitions between steady states to be P3 Overhead Pressure 3
grouped into different clusters. Fig. 6d shows how DBSCAN effec- F2 Bottoms ow
tively identies dense groups of related data but removes most data T7 Bottoms temperature
from transitions as noise. T8 Bottoms Product temperature
SG Specic gravity sensor
F3 Total Overhead product ow
F4 Non-recycle overhead product ow
T9 Overhead Product temperature
P4 Vapor Purge
L3 Tank 2 Level
6.4. Full TEP data set

7. Case 2: industrial separation tower
Finally, Table 5 shows DR/clustering results over the full TEP data
set consisting of behavior from all 20 Faults. As might be expected, 7.1. Industrial separation tower description
the results are lower owing to the greatly increased complexity
of the data. While all ARI are low and closer to 0 (corresponding The data used in this study came from an industrial scale reactor
to random cluster labels) that in the reduced data set, SE reduced and separation system as shown in Fig. 7, with the measurements
data yielded the highest ARI values across all clustering techniques. indicated explained in Table 6. The feed to the system is produced
Mean shift clustering consistently yielded ARI values close to zero, by a reactor upstream and varies depending on the petroleum grade
corresponding to random cluster labels, while k-means and BIRCH fed. The reactor and separation system is fed a number of different
(with SE) yielded the highest metrics. grades, including standby grades thus leading to number of start-
G Model
Table 7
Number of components used by DR projection in clustering Tower data.
PCA ICA KPCA Isomap SE
12 12 6 12 12
Table 8
Supervised clustering metrics on tower data set.
DBSCAN K-means K-means Mean Shift BIRCH*

(k = 2) (k = 3)
No DR Homogeneity 0.63 0.07 0.59 0.43 0.88

Completeness 0.25 0.03 0.24 0.23 0.09
V measure 0.36 0.05 0.34 0.30 0.16
ARI 0.40 0.08 0.25 0.40 0.02
DBSCAN (k = 2) (k = 3) Mean Shift BIRCH*

PCA Homogeneity 0.64 0.07 0.59 0.42 0.87
Completeness 0.20 0.04 0.24 0.23 0.09
V measure 0.31 0.05 0.34 0.30 0.16
ARI 0.25 0.08 0.25 0.37 0.02
ICA Homogeneity 0.44 0.00 0.51 0.43 0.86

Completeness 0.28 0.00 0.21 0.21 0.08
V measure 0.34 0.00 0.30 0.28 0.15
ARI 0.46 0.01 0.18 0.34 0.02

Isomap Homogeneity 0.73 0.47 0.59 0.39 0.88
Completeness 0.16 0.68 0.24 0.21 0.09
V measure 0.26 0.55 0.34 0.28 0.16
ARI 0.18 0.67 0.26 0.42 0.02

KPCA Homogeneity 0.31 0.31 0.08 0.18 0.84
Completeness 0.08 0.15 0.03 0.181 0.07
V measure 0.13 0.21 0.04 0.18 0.13
ARI 0.01 0.14 0.01 0.37 0.01

SE Homogeneity 0.88 0.02 0.09 0.33 0.98
Completeness 0.12 0.03 0.04 0.15 0.07
V measure 0.21 0.02 0.06 0.20 0.13
ARI 0.06 0.06 0.11 0.26 0.01
up conditions during the grade changes. Different grades can result 7.2. Separation tower DR and clustering
in large differences between many process variables.
Before entering the Tower outlined in Fig. 7, the feed is passed The goal of data mining on the tower is to identify the faulty
through Tank 1 to remove water. Feed then enters the Tower at a series of data from the fault event among the larger amount of data
high temperature and pressure, and a mixture of solvent and prod- from normal operations. Working within the workow described
uct leaves out the bottoms with extra solvent and any water leaving in Fig. 1, the data will be projected, clustered, and evaluated using
out of the top of the column. Tank 2 removes any remaining water clustering metrics. Here, because expert analysis already deter-
before recycling the solvent. To give a sense of scale, the tower is mined which data are faulty and which are normal, we can use
approximately two stories tall. The specic gravity sensor at the supervised cluster evaluation methods. Table 7 gives the number
bottom of the column is used to evaluate the quality of the separa- of components used by each DR technique.
tion and to manually control the feed to the tower by manipulating Fig. 8ad show different 3 dimensional projections of data from
the ow of steam to Tank 2. The product quality specication is the tower made by 4 different dimensionality reduction techniques
loosely based on the composition of the polymer product in the bot- for visualization of the data. PCA in Fig. 8a and ICA in Fig. 8b yield
toms of the tower evaluated by the specic gravity analyzer (SG). qualitatively similar projection results and preserve the density of
The specic gravity is controlled by manipulating the amount of normal operations. Both have large, dense normal regions in blue
steam fed to Tank 1. and a smaller cluster for the fault event in red. In contrast, Isomap
Recently a fault occurred which created changes in many dis- in Fig. 8c and SE in Fig. 8d break apart the dense normal cluster
parate process variables and incurred signicant nancial losses. into smaller clusters (which from Fig. 10b, we can see are related
Our goal is to isolate data from this fault and distinguish it from to the time). The data in the faulty group could be straightforwardly
the much larger volume of data from normal operations. We stud- linked to a fault event using latent variable methods and time series
ied about 7 months of measurements from the tower taken at plots of the original sensor data.
10 min intervals from one of several different feed compositions For data clustering, we study a problem where the faulty period
or grades fed to the process. The total data set consisted of about of data was not known beforehand to analyze how effectively data
4500 data points. clustering techniques isolate the faulty data series and check the
results against the known groups in the data. Table 8 shows the
G Model
quality of the dimensionality reduction and data clustering applied Some previous process monitoring studies have considered time
to the reduced data set, with great variations in the quality of series clustering. Srinivasan et al. [54] developed a dynamic PCA-
clustering. Based on ARI, the true clusters were most accurately based similarity factor for clustering transitions between different
recreated by DBSCAN and ICA. Mean shift also performed as strong process states in agile chemical plants. Beaver and Palazoglu [55]
as DBSCAN based on ARI. K-means with k = 2 (nding two clusters) used a moving window clustering algorithm based on PCA to detect
yields different information compared to calculating with k = 3. The process states and transition points disturbed by a periodic signal.
reason for this difference is given in Fig. 9a and b: while two is a Abonyi et al. [56] developed a fuzzy time series clustering algorithm
good guess as to the number of clusters, in this case k-means incor- and applied it to data from a polymerization reactor. Bo and Hao
rectly divided both clusters, which could also be the result of a poor [57] used qualitative trend analysis for the hierarchical clustering of
initialization of the k-means centroids. Fig. 9c and d also shows data from a blast furnace. Time series clustering is an active area of
that, while DBSCAN and mean shift yielded lower evaluation met- research [52,53], particularly the multivariate time series clustering
rics than on the TEP study, the clusters found by most techniques that would be needed to analyze chemical processes [58].
applied captured the essential grouping in the data. However, most
found one or more large clusters for normal as well as separate
clusters containing the faulty data. Using the kNN selected BIRCH 9. Conclusions
parameter for clustering did not work as well on the tower as it did
on the TEP data, however using other selections of parameters may This study sought to ll the gap between the need for
signicantly improve BIRCH clustering. labelled data for training supervised monitoring algorithms and
If the time of the fault was not known beforehand, data clus- the raw, unclassied data that have accumulated in process his-
tering, dimensionality reduction, as well as time series plots of the torical databases. We demonstrated how unsupervised learning
original data could be used in tandem to analyze the signicance of techniques drawn from the computer science literature can iden-
clusters found, as in the basic approach in Fig. 5. DR projections to tify fault states and extract knowledge from chemical process
two or three dimensions can display the general clustering struc- databases. A selection of dimensionality reduction and data clus-
ture of the data. The clusters found can be used to isolate sections tering techniques identied different operating and fault states in
of a time series plot of raw sensor data to locate and guide analysis data from real and simulated chemical processes. On the Tennessee
of the data isolated by the cluster algorithm. Eastman Process simulation, data clustering and DR functioned in
Manually adjusting and visualizing this data set produced an tandem to isolate data from the different faults in the process,
interesting nding among the normal data. Applying the DBSCAN information which could be used to train a supervised classica-
to this data with the eps parameter set to 1 and removing data clas- tion technique for fault detection. On an industrial scale separation
sied as noise by DBSCAN yields the clustering shown in Fig. 10a, tower, data clustering identied a large, dense normal region cor-
which separates the normal cluster into several smaller clusters. responding to normal process operations and distinguished it from
While nothing in particular distinguishes these clusters from each data from a fault that had occurred during the months of opera-
other, Fig. 10b shows the clusters colored based on the time of tions considered. Further analysis reveals that at the grade studied
observation. Each cluster has a consistent color, meaning that the each start-up of the column formed a new cluster in the dimension-
dense normal cluster is composed of multiple smaller, time depen- ally reduced space, as visualized by a PCA projection of the clusters
dent clusters. The clusters could be distinguished by factors not found by DBSCAN.
immediately obvious from the sensors such as maintenance oper- More ongoing work will study how unsupervised learning can
ations or the ambient temperature. A new cluster may be formed improve process data analytics. While the data clustering tech-
during each new run of the tower. This behavior poses a signicant niques in this study performed satisfactorily on the data from the
challenge to all modelling of this process because the parameters of industrial separation tower, further work will study multistate pro-
the system are gradually and subtly changing with time, requiring cesses where the steady state may move more often based on the
models and controls to be constantly adjusted. rate of production or other frequent changes to the process steady
state. Additionally, further tools are needed to leverage process
knowledge in explaining the meaning and signicance of clusters,
8. A note about time series clustering such as identifying which sensors measurements contribute most
to the separation between two given clusters. Finally, a weakness
A key disadvantage of this work, visible in the results in Fig. 10 of this research is that supervised clustering metrics were used to
for example, is that in our approach we directly cluster time mea- evaluate the different clustering strategies which require labelled
surements of data and assume all observations are independent process data. The methods used to evaluate the quality of the
from time. Our results demonstrate that assuming time indepen- clusters found are nearly as important as the cluster algorithms
dence works effectively in the case of step faults where the themselves. Several different clustering algorithms may nd good
process switches from one steady state to another, however, often clusters in the data, but identifying the right parameter values for
the time series of process measurements are changing with time. a given technique and data set can be a challenging task. Unsuper-
In these dynamic faults the process is unstable or uctuating, and vised clustering metrics can provide crucial insight into identifying
assuming time independence of data can fail because data may good parameter settings.
not form dense, contiguous clusters of data. Searching for faults
with time dependent dynamics requires more advanced time series References
clustering. Methods for time series clustering generally modify
existing clustering algorithms for time series data or transform [1] S.J. Qin, Survey on data-driven industrial process monitoring and diagnosis,
the time series data into a form that allows the application of Annu. Rev. Control 36 (2012) 220234.
[2] Z. Ge, Z. Song, F. Gao, Review of recent research on data-Based process
clustering techniques for static data [50]. However, applying clus- monitoring, Ind. Eng. Res. Chem. 52 (2013) 35433562.
tering approaches to identify frequently appearing patterns has [3] S. Yin, X. Li, H. Gao, O. Kaynak, Data-Based techniques focused on modern
been shown to be meaningless [51]. Esling and Agon [52] and Wang industry: an overview, IEEE Trans. Ind. Electron. 62 (1) (2015) 657667.
[4] R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Trans. Neural Netw.
et al. [53] review many specialized methods needed for the time 16 (2005).
series clustering tasks of data representation, similarity measure- [5] A.K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognit. Lett. 31
ment, and indexing. (2010) 651666.
G Model
[6] L. Chen, S.D. Brown, Bayesian estimation of membership uncertainty in [33] U. von Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (2007)
model-based clustering, J. Chemom. 28 (2014) 358369. 395416.
[7] M. Gardner, J. Bieker, Data mining solves tough semiconductor manufacturing [34] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern
problems, in: KDD, Boston, MA, USA, 2000. Anal. Mach. Intell. 22 (8) (2016) 888905 (200).
[8] A.J. Torabi, X. Li, B.E. Lim, G.O. Peen, Application of clustering methods for [35] A. Ng, M. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm,
online tool condition monitoring and fault diagnosis in high-Speed milling in: T. Dietterich, S. Becker, Z. Ghahramani (Eds.), Advances in Neural
processes, IEEE Syst. J. 10 (2) (2016) 721732. Information Processing Systems, MIT Press, Cambridge, 2002, pp. 849856.
[9] J.A. Harding, M. Shahbaz, S. Srinivas, A. Kusiak, Data mining in manufacturing: [36] P. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Longman,
a review, J. Manuf. Sci. Eng. 128 (2006). Boston, 2005.
[10] S.G. Munoz, J.F. MacGregor, Success stories in process industries, Chem. Eng. [37] X. Wu, et al., Top 10 algorithms in data mining, Knowl. Inf. Syst. 14 (2008)
Prog. 11 (March (3)) (2016) 3640. 137.
[11] X.Z. Wang, C. McGreavy, Automatic classication for mining process [38] M. Ester, H.P. Kriegel, J. Sander, X. Xu, A density based algorithm for
operational data, Ind. Eng. Chem. Res. 37 (1998) 22152222. discovering clusters in large spatial databases with noise, Proceedings of the
[12] Bhushan, Romagnoli, Self-organizing, self-clustering network: a strategy for 2nd International Conference on Knowledge Discovery and Data Mining
unsupervised pattern classication with its application to fault diagnosis, Ind. (1996).
Eng. Chem. Res. 47 (2008) 42094219. [39] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efcient data clustering
[13] M. Maestri, A. Farall, P. Groisman, M. Cassanello, G. Horowitz, A robust method for very large databases, in: Sigmod, 1996, pp. 103114.
clustering method for detection of abnormal situations in a process with [40] K. Fukunaga, L.D. Hostetler, The estimation of the gradient of a density
multiple steady-state operation modes, Comput. Chem. Eng. 34 (2010) function, with applications in pattern recognition, IEEE Trans. Inf. Theory 21
223231. (1) (1975) 3240.
[14] Z. Zhu, Z. Song, A. Palazoglu, Transition process modeling and monitoring [41] B.W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman
based on dynamic ensemble clustering and multiclass support vector data and Hall, London, 1986.
description, Ind. Eng. Chem. Res. 50 (2011) 1396913983. [42] Y. Cheng, Mean shift, mode seeking, and clustering, IEEE Trans. Pattern Anal.
[15] Singhal, Seborg, Clustering multivariate time-series data, in: Proceedings of Mach. Intell. 17 (8) (1995) 790799.
the American Control Conference, Anchorage, AK USA, 2005. [43] D. Comaniciu, P. Meer, Mean Shift: a robust approach toward feature space
[16] J.F. Barragan, C.H. Fontes, M. Embirucu, A wavelet-based clustering of analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (5) (2002) 603619.
multivariate time series using a multiscale SPCA approach, Comput. Ind. Eng. [44] A. Rosenberg, J. Hirschberg, V-Measure: a conditional entropy-based external
95 (2016) 144155. cluster evaluation measure, Proceedings of the 2007 Joint Conference on
[17] N.F. Thornhill, H. Melbo, J. Wiik, Multidimensional visualization and clustering Empirical Methods in Natural Language Processing and Computational
of historical process data, Ind. Eng. Chem. Res. 45 (2006) 59715985. Natural Language Learning (2007) 410420.
[18] C. Ding, X. He, K-means clustering via principal component analysis, in: [45] L. Hubert, P. Arabie, Comparing patterns, Journal of Classication 2 (1985)
Proceedings of 21 st International Conference on Machine Learning, Banff, 193218.
Canada, 2004. [46] R.A. Fisher, The use of multiple measurements in taxonomic problems, Annals
[19] L.H. Chiang, E.L. Russell, R.D. Braatz, Fault Detection and Diagnosis in Eugen. 7 (1936) 179188.
Industrial Systems, Springer-Verlag, London, 2001. [47] D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern
[20] J. Lee, C. Yoo, I. Lee, Statistical process monitoring with independent Anal. Mach. Intell. 1 (1979) 224227.
component analysis, J. Process Control 14 (2004) 468485. [48] J.J. Downs, E.F. Vogel, A plant-wide industrial process control problem,
[21] J. Lee, C. Yoo, S.W. Choi, P.A. Vanrolleghem, I. Lee, Nonlinear process Comput. Chem. Eng. 17 (1993) 245255.
monitoring using kernel principal component analysis, Chem. Eng. Sci. 59 [49] N.L. Ricker, Decentralized control of the tennessee eastman challenge process,
(2004) 223234. J. Proc. Cont. 6 (1996) 205221.
[22] F. Pedregosa, et al., Scikit-learn: machine learning in python, J. Mach. Learn. [50] T.W. Liao, Clustering of time series data a survey, Pattern Recogn. 38 (2005)
Res. 12 (2011) 28252830. 18571874.
[23] S.J. Qin, Statistical process monitoring: basics and beyond, J. Chemom. 17 [51] E. Keogh, J. Lin, Clustering of time-series subsequences is meaningless:
(2003) 480502. implications for previous future research, Knowl. Inf. Syst. 8 (2005) 154177.
[24] L. var der Maaten, E. Postma, J. van den Herik, Dimensionality Reduction: A [52] P. Esling, C. Agon, Time-series data mining, ACM Comput. Surv. 45 (1) (2012)
Comparative Review, 2008, Online Preprint. (12:1-12:34).
[25] J.A. Lee, M. Verleysen, Nonlinear Dimensionality, Springer Reduction, New [53] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, E. Keogh,
York, 2007. Experimental comparison of representation methods and distance measures
[26] J. Lee, S.J. Qin, I. Lee, Fault detection and diagnosis based on modied for time series data, Data Min. Knowl. Discov. 26 (2013) 275309.
independent component analysis, AIChE J. 52 (10) (2006) 35013514. [54] R. Srinivasan, C. Wang, W.K. Ho, K.W. Lim, Dynamic principal component
[27] A. Hyvarinen, E. Oja, Independent component analysis: algorithms and analysis based methodology for clustering process states in agile chemical
applications, Neural Netw. 13 (2000) 411430. plants, Ind. Eng. Chem. Res. 43 (2004) 21232139.
[28] S. Yin, S.X. Ding, A. Haghani, H. Hao, P. Zhang, A comparison study of basic [55] S. Beaver, A. Palazoglu, Cluster analysis for autocorrelated and cyclic chemical
data-driven fault diagnosis and process monitoring methods based on the process data, Ind. Eng. Chem. Res. 46 (2007) 36103622.
benchmark Tennessee Eastman process, J. Process Control 22 (2012) [56] J. Abonyi, B. Feil, S. Nemeth, P. Arva, Modied Gath-Geva clustering for fuzzy
15671581. segmentation of multivariate time-series, Fuzzy Sets Syst. 149 (2005) 3956.
[29] B. Scholkopf, A. Smola, K.R. Muller, Kernel principal component analysis, in: [57] Z. Bo, Y. Hao, Qualitative trend clustering of process data for fault diagnosis,
Advances in Kernel Methods Support Vector Learning, MIT Press, Cambridge, in: IEEE International Conference on Automation Science and Engineering,
1999, pp. 327352. Gothenburg, Sweden, 2015.
[30] J. Tenenbaum, Mapping a manifold of perceptual observations, NIPS 97 (1997) [58] T. Fu, A review on time series data mining, Eng. Appl. Artif. Intell. 24 (2011)
682688. 164181.
[31] M.A.A. Cox, T.F. Cox, Multidimensional scaling, in: Handbook of Data
Visualization, Springer Heidelberg, 1994, 2017, pp. 315347.
[32] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and
data representation, Neural Comput. 15 (6) (2002) 13731396.

Data Mining For Clustering and Knowledge Discovery

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Mining For Clustering and Knowledge Discovery

Hochgeladen von

Copyright:

Verfügbare Formate

G Model

JJPC-2122; No. of Pages 16 ARTICLE IN PRESS

Contents lists available at ScienceDirect

Journal of Process Control

Data mining and clustering in chemical process databases for

monitoring situations. Process data clustering has been previously

Fig. 2. Illustration of the various clustering metrics.

Fig. 3. Process schematic with control scheme.

Independent component analysis (ICA) is used in multivariate

Fig. 7. Simplied process ow sheet.

1 ni. n.j ni. n.j n

PCA ICA KPCA Isomap SE

Reduced data set

DBSCAN K-means Mean Shift BIRCH

NO DR Homogeneity 0.86 0.60 0.43 0.56

DBSCAN K-means Mean Shift BIRCH

DBSCAN K-means Mean Shift BIRCH

DBSCAN K-means Mean Shift BIRCH

DBSCAN K-means Mean Shift BIRCH

DBSCAN K-means Mean Shift BIRCH

Full data set

DBSCAN K-means Mean Shift BIRCH

NO DR Homogeneity 0.40 0.44 0.29 0.39

DBSCAN K-means Mean Shift BIRCH

DBSCAN K-means Mean Shift BIRCH

DBSCAN K-means Mean Shift BIRCH

DBSCAN K-means Mean Shift BIRCH

DBSCAN K-means Mean Shift BIRCH

6.4. Full TEP data set

PCA ICA KPCA Isomap SE

DBSCAN K-means K-means Mean Shift BIRCH*

No DR Homogeneity 0.63 0.07 0.59 0.43 0.88

DBSCAN (k = 2) (k = 3) Mean Shift BIRCH*

ICA Homogeneity 0.44 0.00 0.51 0.43 0.86

DBSCAN (k = 2) (k = 3) Mean Shift BIRCH*

DBSCAN (k = 2) (k = 3) Mean Shift BIRCH*

DBSCAN (k = 2) (k = 3) Mean Shift BIRCH*

Das könnte Ihnen auch gefallen