06252734

WCCI 2012 IEEE World Congress on Computational Intelligence
June, 10-15, 2012 - Brisbane, Australia IJCNN
Online learning using a Bayesian surprise metric

Erion Hasanbelliu, Kittipat Kampa, and Jose C. Principe James T. Cobb
Electrical and Computer Engineering Department Naval Surface Warfare Center
University of Florida Panama City Division
Gainesville, FL, 32611 Panama City, FL 32407
Email: {erioni, kampa, principe}@cnel.ufl.edu Email: james.cobb@navy.mil
Abstract—Dictionary.com defines learning as the process of its effect on the state of the system. A good sample would
acquiring knowledge. In psychology, learning is defined as the provide information beneficial to the model, it would carry
modification of behavior through training. In our work, we reasonable amount of new information that would sufficiently
combine these definitions to define learning as the modification
of a system model to incorporate the knowledge acquired by new but not excessively change the model. An outlier, on the other
observations. During learning, the system creates and modifies hand, would provide information detrimental to the model.
a model to improve its performance. As new samples are It is expected to carry an element of surprise that would
introduced, the system updates its model based on the new significantly change the model. The monitoring system would
information provided by the samples. However, this update may prevent samples that do not conform to expected behavior,
not necessarily improve the model. We propose a Bayesian
surprise metric to differentiate good data (beneficial) from outliers, from incorrectly modifying the model, and as a
outliers (detrimental), and thus help to selectively adapt the result, drastically skewing the state of the system, but allow
model parameters. The surprise metric is calculated based on the interesting observations to improve the system knowledge.
difference between the prior and the posterior distributions of the Detecting and preventing outliers from changing the model
model when a new sample is introduced. The metric is useful not is very important in many application areas such as fraud,
only to identify outlier data, but also to differentiate between the
data carrying useful information for improving the model and intrusion, or fault detection.
those carrying no new information (redundant). Allowing only While outliers incorrectly modify the model, redundant data
the relevant data to update the model would speed up the learning will cause overfitting. Most of the data that the system will
process and prevent the system from overfitting. The method is observe will have little-to-no-new information for the model.
demonstrated in all three learning procedures: supervised, semi- Allowing these redundant samples to update the model is not
supervised and unsupervised. The results show the benefit of
surprise in both clustering and outlier detection. only wasteful and time-consuming but also will over-train the
Index Terms—surprise metric, Bayesian probabilistic frame- system making it rigid to adapt to new, interesting samples.
work, online learning, information theory, outlier detection, Thus, it is as important to identify samples that do not provide
clustering any new information and prevent them from updating the
model as it is to identify those that will drastically change
I. I NTRODUCTION it.
When designing a pattern recognition algorithm, we assume A quantitative definition of the relevance a new observation
that we have available training data that provide a good has to the system is required to subjectively measure the
representation of the problem environment. For most real- information available in the observation based on the current
world problems though, providing a training data that captures system knowledge. To understand this measure, we first need
all the information required to design an optimal model is very to look at a few basic definitions from information theory. In
difficult if not impossible. One of the reasons is that data may information theory, information measures the uncertainty or
evolve through time and new patterns are introduced that were probability of occurrence of an outcome [2]. The information
not available during training. This requires that the system content of an outcome x, whose probability is p(x), is defined
be updated online to accommodate the new samples which as
would potentially enhance the statistical models and improve 1
I(x) = log . (1)
performance. However, if this is not handled with care, the p(x)
models created in the training set can become corrupted and For a random variable (r.v.) X, the average information content
hinder performance (this is the so called stability-plasticity is defined as
dilemma [1]). X 1
While it is important to allow the system to be flexible so H(X) = p(x) log , (2)
x
p(x)
that it learns from new samples, it is as important to not forget
previously learned information. In other words, the system which is also called Shannon’s entropy [3] of r.v. X. This
needs to retain previously stored material while learning new classical information measure is based solely upon objective
information. To accommodate this approach, it is necessary probabilities and cannot represent a system’s individual knowl-
to monitor the information available in a new sample and edge or discriminate between outcomes which are important to
U.S. Government work not protected by U.S. copyright

improve system’s knowledge from those which are irrelevant. we compare the probability of a model M given a new data
To account for the subjective information which the outcome observation xn , P (M |xn ), with the prior probability of the
x brings to a system with its own subjective belief, we need model P (M ) over all models M ∈ M [8]. The surprise factor
to consider its subjective probability q(x) [4]. The subjective is measured using the information gain (6) as:
information content of outcome x to the system with subjective
probability q(x) is then defined as
S(xn , M) = Hm (xn ) = KL (P (M |xn ), P (M ))
1
P (M |xn )
Z
Is (x) = log . (3) (7)
q(x) = P (M |xn ) log dM.
M∈M P (M )
The average subjective information, Hs (X), is given by the
expectation value of the subjective information, Is (x), taken In essence, surprise measures the difference between the
with the true probabilities p(x) as prior and the posterior distributions on the collection of models
based on the new observation xn . If the posterior is the same
X X 1
Hs (X) = p(x)Is (x) = p(x) log . (4) as the prior, the observation carries no new information, thus
x x
q(x) leaving knowledge about the model unaffected. However, if the
Since Hs (X) measures the uncertainty of a system that does posterior significantly differs from the prior, the observation
not know the correct probabilities, it should be larger than carries an element of surprise that can be used to either
the uncertainty of an ideal observer that knows the true improve the model (reasonable difference) or significantly
probabilities. Thus, we can state that change it (drastic difference). Hence, for our purposes, the
surprise value will be used to categorize xn into one of three
Hs (X) ≥ H(X), (5) categories:
with equality being true only when the system has full 1) Outlier: S(xn , M) > Tout , the observation will be
knowledge of its environment, the subjective and objective discarded as unwanted/outlier.
probabilities coincide, q(x) = p(x) [5]. The difference be- 2) Interesting: Tout ≥ S(xn , M) ≥ Tred , the element
tween (4) and (2) defined as: contains useful information and should be used to update
the current belief of the model M .
X p(x) 3) Redundant: Tred > S(xn , M), the observation lacks
Hm (X) = Hs (X) − H(X) = p(x) log , (6) uncertainty and while it is included as a member of a
q(x)
x model in the group M as above, posterior parameters are
represents the information that the system is missing from not updated. (As the number of observations increases,
the data and that can be learned by adjusting the subjective redundant samples become more likely, thus by electing
probabilities closer to the true probabilities. This is called the not to update the parameters, overfitting is prevented.)
information gain and is the basis for surprise as it measures To estimate the prior and posterior distributions, we assume
the system’s ignorance [6]. This can be understood from the a parametric distribution family captured as mixture of Gaus-
fact that (6) is nonnegative and vanishes if and only if p(x) sians models, and follow the Bayesian approach[8] to estimate
and q(x) coincide (Kullback-Leibler divergence) [7]. the surprise metric. When a new contact is introduced into
We use this definition of information gain to define the the system, the surprise information measure will determine
surprise metric in section II and provide a means of computing which category the new sample falls under and update the
it using mixtures of Gaussians. In section III, we analyze the model accordingly. The two thresholds, Tout and Tred , are
surprise behavior. In section IV, we demonstrate the advan- hyperparameters that also need to be estimated from the data.
tage of using surprise in clustering/classification and outlier They are dependent on the state of the system and need to be
detection in the three learning approaches. And, in section V, updated along with the model hyperparameters.
we conclude with some discussion and final observations on
the surprise factor. A. Bayesian Surprise Approach
II. S URPRISE M ETRIC FOR M ODEL S ELECTION AND In this work, we use mixture of Gaussians (MoG) models
PARAMETER E STIMATE U PDATE to quantify the probability structure of the environment. We
Surprise is a subjective information measure that quantifies follow this infrastructure in estimating the surprise metric,
how much information a new observation contains in relation however MoGs can be substituted with any other model. The
to the current knowledge of the system. Surprise can exist Bayesian surprise metric is used to evaluate outliers in online
only in the presence of uncertainty, and it is related to clustering and acts as a simple indicator to update the MoG
the expectations of the system, where the same data carries parameters. The process is outlined as following:
different amount of surprise to different systems or to the same
system at different times. To quantify the surprise factor of an Setup: Consider each existing model M as a mixture
observation, we measure the amount of information gained by of Gaussians M = {mi }M
i=1 , where each model is represented
a model from the introduction of a new observation. Thus, as follows:
exposed to the potential created by g(x) (or viceversa). This
M
t
X term measures the similarity between the two PDFs and it is
p(x|M ) = πit N (x|µti0 , Σti0 ), (8) Renyi’s quadratic cross entropy [9]. It can also be interpreted
i=1
as the information gain from observing g with respect to the
where πit , µti0 , and Σti0 denote the initial mixture coefficient, “true” density f . This term plays an important role in calcu-
mean, and covariance matrix at time t for the ith component lating surprise as it measures the amount of new information
of the model M . that is gained from the new observation. The other two terms
are the negative Renyi’s quadratic entropies of the respective
Initialization: At time t = 0, {πi0 , µ0i0 , Σ0i0 }M
i=1 are calculated PDFs and are considered as normalizing terms that will act
directly from training data or obtained from existing library as regularizers in our application. Thus, the Cauchy-Schwarz
knowledge. (Note that the initial parameters may not be very divergence can be interpreted as:
accurate, but their accuracy will improve with the number of
samples introduced into the process.) DCS (f kg) = 2H2 (f ; g) − H2 (f ) − H2 (g). (12)
Compute Surprise: When a new sample xn arrives at t, where H2 (f ; g) is the quadratic Renyi’s cross-entropy, and
calculate the surprise, S(xn , M t ), for model M t = {mti }M
i=1
H2 (f ), H2 (g) are Renyi’s quadratic entropy of f, g respec-
as: tively.
S(xn , M t ) = KL(p(x|xn , M t ), p(x|M t )) (9) If we denote f (x) and g(x) as two distributions each of
which is a mixture of Gaussians PMwith different parameters
where p(x|M t ) is obtained from the EM algorithm. The term and number of clusters, f (x) = m=1 πm N (x|µm , Λ−1 m ) and
p(x|xn , M t ) is the mixture of Gaussians updated when the
PK
g(x) = k=1 τk N (x|νk , Ω−1 k ), where M and K denote the
new sample xn is added into the model. number of Gaussian components in f (x) and g(x) respec-
Kullback-Leibler divergence, DKL , is one of the most well tively, and πm , µm and Λm denote the mixture coefficient,
known divergences, however it does not yield an analytic the mean, and the precision matrix of the mth component of
closed-form expression for MoGs. To work around this prob- f (x), and τk , νk , and Ωk denote the respective terms of the
lem, a simple approach is numerical integration where the k th component of g(x). The closed-form expression for DCS
whole feature space is uniformly gridded and the DKL is then of a pair of MoGs can be derived by rewriting (11) as
computed for each grid cell individually. The accuracy will
depend on the resolution of the grid, and the memory size M X
X K
!
required grows exponentially with the dimensionality of the DCS (f, g) = − 2 log πm τk zmk
data. m=1 k=1
M M
!
To mitigate this problem, we use a different divergence X πm2
|Λm |1/2 X X
+ log +2 πm πm′ zmm′
measure, the Cauchy-Schwarz divergence[9], DCS , whose (2π)D/2
m=1 m=1 m′ <m
closed-form solution for MoGs is easily computed [10]. K K X
!
This τk2 |Ωk |1/2
qR measureRis derived from the Cauchy-Schwarz inequality
X X
+ log +2 τk τk′ zkk′
2 2
R (2π)D/2
f (x)dx g (x)dx ≥ f (x)g(x)dx, where the equality k=1 k=1 k′ <k
(13)
holds if and only if f (x) = Cg(x) for a constant scalar C. To
simplify the calculations, we take the square of the inequality where
and the Cauchy-Schwarz divergence measure of two PDFs is −1
defined as[11]: zmk = N (µm |νk , (Λ−1
m + Ωk ))
−1

R
 zmm′ = N (µm |µm′ , (Λ−1
m + Λm′ ))
f (x)g(x)dx zkk′ = N (νk |νk′ , (Ω−1 −1
DCS (f, g) = − log  qR R
. (10) k + Ωk′ ))
2
f (x) dx g(x) dx2
are the integrals of the product of two corresponding Gaussian
DCS (f, g) is a symmetric measure and DCS (f, g) ≥ 0, where PDFs. Replacing f (x) and g(x) for p(x|M t ) and p(x|xn , M t ),
the equality holds if and only if f (x) = g(x). However, the we obtain the surprise measure
triangle inequality property does not hold, so it cannot be
considered as a metric. DCS (f, g) can be broken down and S(xn , M t ) = DCS (p(x|M t ), p(x|xn , M t )). (14)
rewritten as
Z Z Z Update MoGs: Once the surprise value is computed, we need
DCS (f kg) = −2 log f (x)g(x)dx+log f 2 (x)dx+log g 2 (x)dx. to check the three conditions:
(11) • Outlier: S(xn , M ) > Tout , the new sample xn is an
R outlier. (Depending on the problem, these samples may
The argument of the first term, f (x)g(x)dx, estimates be stored in an outlier buffer and used to generate new
the interactions on locations within the support of f (x) when models.)
• Interesting: Tout ≥ S(xn , M ) ≥ Tred , xn belongs to 55
an existing cluster, and all the parameters are updated 50

0.018
according to the new sample. 45

0.016
• Redundant: Tred > S(xn , M ), xn belongs to an existing

0.014
40
0.012
cluster, however posterior parameters are not updated so 35
0.01
as to prevent overfitting. 30
0.008
This allows us to more intuitively find the summarized 25 0.006
location and uncertainty of each cluster/object i from µi0 20 0.004
and Σi0 respectively. Moreover, the observation parameters 15

0.002
can be found from Wi , µi , Σi . (Note: the magnitude of the 5 10 15 20 25 30 35 40 45
location uncertainty Σi0 depends on the number of samples

N , Σti ∝ N1 Σ0i , and the observations parameters get closer to
Fig. 1. Original library data (location) and the probability field p(x).
the true value when the number of samples increases.) The Probability field p(x) is modeled as a mixture of Gaussians from the
steps required to update online dynamic trees are summarized parameters Wxi , µxi , Σxi extracted directly from the samples Ni using the
in the Procedure 1. EM algorithm.
Procedure 1 Online MoG Update sample is very small compared to an outlier. The surprise field
1) Build Bayesian network structure from training data reiterates this point and illustrates that as data samples move
2) Calculate model parameters and thresholds further apart from the library data, their respective surprise
3) Infer the posterior distribution of the location values become larger.
p(ri |{xj |j ∈ objecti })
4) When new sample xn arrives, check the following: 0.45
50
(a) Calculate surprise S(xn , M ) from p(x) and p(x|xn ) 0.4
45
as in Eq. 14 0.35
(b) IF S(xn , M ) < Tout 40 0.3
• Include xn to the maximum p(mi |xn ) obtained

35 0.25
S = 0.2166
from MoGs 30 0.2
• IF S(xn , M ) ≥ Tred 25 0.15
0.1
– update the MoG parameters 20
S = 0.0080
0.05
– update the thresholds 15
5 10 15 20 25 30 35 40
(c) ELSE
• Discard xn , or include it in outlier buffer
Fig. 2. Example of surprise value on a good data sample and an outlier.
5) Repeat steps 2-4 The figure shows the surprise field along with the library data. In addition,
the surprise factor of a good data sample and an outlier are computed and
shown in white and red respectively.
III. S URPRISE B EHAVIOR A NALYSIS

The surprise factor depends on the system knowledge, and
To understand the behavior of surprise, we provide a few as the number of samples introduced to the system increases
simple examples on simulated data based on real data distribu- the surprise value decreases. This poses a challenge when
tion to provide flexibility in illustrating the various scenarios. utilizing surprise in clustering problems as each cluster has
Fig. 1 shows the locations of the training data. The parameters its own model and the surprise factor will depend on the state
Wxi , µxi , Σxi are calculated directly from the samples Ni of the model. Fig. 3 (a) illustrates the surprise problem that we
using the EM algorithm on a mixture of Gaussians (MoG) face in classification/clustering. Two clusters are represented in
model. Fig. 1 also shows the location probability p(x) which blue squares and red circles and the new sample is represented
is modeled as a MoG using these parameters. by a black star. By intuition, we would expect that the new
To measure the surprise factor of a new sample xn , sample, black star, should belong to the left cluster (blue
the sample is included in the datapool and the parameters squares). However, when we compute the surprise value of
Wxi , µxi , Σxi are recomputed using EM. The surprise factor the new sample for the two clusters, Fig. 3 (b) and (c),
of the new sample xn is then computed by comparing the prior the surprise value of the cluster on the right (red dots) is
and posterior distributions as provided in Eq. 14. Fig. 2 shows much smaller, 0.007 compared to 0.025 for the cluster on the
the surprise field of the model generated from the training data left. This occurs because the surprise value decreases as the
above in Fig. 1 along with the surprise values for two data number of samples introduced to the system increases, and
samples: a good data sample (S = 0.0080), and an outlier the right cluster model has observed more samples than the
(S = 0.2166). As expected, the surprise factor for a good left cluster. This results in the surprise value of any sample,
even an outlier in this case, to be smaller than that of the we use is very popular and comprises of several clusters of
left cluster model. To mitigate this problem, we modify the varying shapes and sizes as shown in Fig. 4. In addition,
surprise value slightly by normalizing it by the high-threshold noise is added to the dataset as scattered samples and small
,Mk )
as Ŝ(xn , Mk ) = S(xTnout . The normalized surprise value directional (horizontal or vertical) clusters which are very close
provides a fair comparison among clusters, which is shown in or even connected to the clusters of the original data making
Fig. 3 (d) and (e) where the surprise value for the left cluster the identification process difficult.
is 2.125 and for the right cluster is 3.704.
−1
−2
−3
−4
−4 −2 0 2 4 6
Fig. 4. Original dataset: clusters (good data) and noise (outliers), where noise
(a) Surprise Behavior on Clustering is composed of scattered samples and stripes of outliers connected horizontally
x 10
or vertically.
6 0.024 6
6
5 5
0.022
4
S=0.025 0.02
4
3
5.5
We analyze the outlier detection problem using the three
3 5
S=0.007
2
0.018
2
4.5
fundamental learning approaches:
1 0.016 1
4
−1
0.014
0
−1
3.5
A. Supervised learning
In the supervised approach, we provide the system with
0.012 3
−2 −2
−3 0.01 −3 2.5
−4
0.008
−4 2 prior knowledge of both the good data and outliers. Anal-
−4 −2 0 2 4 6 −4 −2 0 2 4 6
ogous to supervised classification, pre-labeled training data

(b) Surprise Field on Group 1 (c) Surprise Field on Group 2 that provide a good representation of the entire distribution
6
1
6 are provided to allow good generalization of the system as
5
shown in Fig. 5. This approach requires a good spread of

5 0.9
4 4
0.9
S=2.125 0.8
3
2
0.8
3
2
S=3.704 both good and outlier data, otherwise, samples that are derived
0.7
1 0.7
1
0.6
from unknown areas may not be classified correctly and good
0
samples may be considered as outliers.

0
0.6
−1 −1
0.5
−2 0.5 −2
0.4
−3 −3
0.4
−4 −4 450 good data
0.3
−4 −2 0 2 4 6 −4 −2 0 2 4 6
outliers
400
(d) Normalized S.F. on G1 (e) Normalized S.F. on G2
350
Fig. 3. Illustration of surprise factor challenge in clustering problems. Even 300
though the new sample (represented with the black star) is closer to the left 250
cluster (blue squares), the surprise value of the right cluster (red circles) is
200
smaller due to the larger number of samples it has observed. This is shown
in subfigures (b) and (c) respectively. To account for this discrepancy due to 150
the different number of samples per cluster, normalized surprise values need
100
to be computed as shown in subfigures (d) and (e) where as expected the
surprise of the new sample with respect to the right cluster is clearly much 50
larger than that of the left cluster. 0

0 100 200 300 400 500 600 700
To identify outliers and prevent them from changing the

Fig. 5. Training data: labeled data samples representing both good data and
system state, we need to understand and learn how to calculate outliers.
the cut-off outlier threshold, Tout . We will discuss how we
compute the outlier and redundant thresholds in the next Fig. 6 shows the results of the supervised approach and as
section. expected parts of the clusters that are not well represented in
the training data are misclassified as outliers (e.g. left parts of
IV. E XPERIMENTS AND R ESULTS the big oval and of the top horizontal strip). In addition, there
The goal of these experiments is to utilize surprise on are many other sporadic misclassifications in regions of the
clustering/classification and outlier detection. The dataset [12] cluster where the training data were sparse.
Fig. 6. Results of the supervised approach. The system performs well overall, Fig. 8. Results of the semi-supervised approach. Despite the lack of initial
however it cannot correctly classify some of the cluster sections that were not samples, the system was able to adapt and learn the cluster model as new
well represented in the training data. samples arrived.
B. Semi-supervised learning with the lowest surprise would be considered as good data.
In the semi-supervised approach, we provide prior knowl- Starting from this initial assumption, we could generate an
edge of the clusters as above, but also allow the system to initial model and then utilize the semi-supervised approach to
adapt and learn online. As new samples arrive, the model is distinguish the rest of the samples that fall in the middle of
adjusted to learn from the information provided in the samples. the surprise spectrum.
The system identifies the boundary of good data / outliers on However, in the online case, we do not have all the data
its own and continues to update it as new samples arrive. Fig. 7 available to derive an initial model for the system and an
shows the initial training data. Even though we provide fewer outlier/good-data boundary. Since our approach to outlier
samples than in the supervised case, since the system will detection requires that we have an initial distribution model,
update online, this lack of initial samples does not lower the we utilize Gaussian Mean Shift (GMS) to obtain the initial
performance of the system as long as the initial samples cover cluster(s). At the beginning, all the samples are considered as
the entire distribution. outliers and placed in an outlier bin. As the outlier bin grows,
GMS is applied to these samples, and when a cluster is present,
450
it is extracted as the initial model. As new samples arrive, they
good data
400
are classified as either part of the model, in which case the
350
system uses them to modify its model, or outliers where they
300
are placed in the outlier bin. Since most of the samples (even
the ones belonging to the cluster) will initially be considered
250
as outliers due to the poor representation of the model, they
200
are always placed in the outlier bin instead of being discarded
150
as in the previous two approaches. GMS constantly processes
100
the data in the outlier bin, and whenever possible, extracts
50
additional clusters, which are then fed back to the system. Fig.
0
0 100 200 300 400 500 600 700 9 shows the results of unsupervised learning up to a mid-point
through the dataset, and Fig. 10 shows the final results. The
Fig. 7. Training data: good data samples to generate an initial model for the unsupervised approach models the clusters well, however it
semi-supervised case. tends to incorrectly classify as outliers some of the outer areas.
This is due to the difficulty of GMS to generate new clusters
Fig. 8 shows the results of the semi-supervised approach. in these areas and also due to the order of introduction of these
Despite the low number of initial data samples, the results of samples, which becomes apparent in both Fig. 9 and Fig. 10.
semi-supervised learning are very good, and in fact better than In addition, some of the outlier samples that are close to the
the supervised approach. This is because the system was able clusters, especially the mini outlier-clusters, are misclassified
to learn from the new samples and adapt its model to integrate as good data due to their vicinity to the actual clusters. Also, it
the new information. is important to note that this process is more time consuming
than the previous approaches.
C. Unsupervised learning
Finally, we determine the outliers with no prior knowledge D. Discussion
of the data. If the data are available offline, the unsupervised To better understand the performance of surprise on each
approach would be simple as the samples that have the highest learning approach, we designed 20 Monte Carlo runs for each
surprise value would be considered as outliers and the samples approach. The results are summarized in Table I which shows
important because they are the only information that the
system has about the environment. As a result, a good repre-
sentation of the overall distribution is required, otherwise the
system will not perform well. In the semi-supervised approach,
the effect of the initial training data is not as drastic, but it
will affect the performance of the system. If the initial training
data provides information about only parts of the distribution,
the system will suffer and not be able to adapt as easily.
Fig. 11 shows an example where the initial data samples do
not contain any information about the left circular cluster. In
addition the top horizontal strip is split into two clusters with
a large portion considered as outliers because of lack of initial
Fig. 9. Results of the unsupervised approach at a mid point in the dataset. samples in that area.
The dataset is shown in black dots, the cluster samples are shown in blue
circles and the currently identified outliers are shown in red circles.
Fig. 11. Results of the semi-supervised approach when the initial learning
data do not provide a good representation of the whole distribution.
Fig. 10. Results of the unsupervised approach. The system is able to learn
most of the cluster model and correctly identify most of the outliers.
In addition to the initial data samples, the order in which
the new samples arrive will also affect the final outcome of
the percentage of good data that were incorrectly classified the system. The model is adapted based on the information
as outliers (false negatives), and the percentage of outliers available in the new samples and how these samples are
that were incorrectly classified as good data (false positives) categorized. So, if a new sample belonging to a cluster is far
for each approach. The supervised approach has the most from the present distribution state, it will be considered as an
false negatives because learning is done offline on the training outlier. To avoid such problems, especially in the unsupervised
dataset and surprise is not utilized to update the model as new approach where the system model is very minimal, we present
samples arrive. This is confirmed by the low false negatives the data samples in order from left to right and up to
for both semi-supervised and unsupervised approaches. The down. Since surprise is a subjective measure its performance
supervised approach has the lowest number of misclassified will depend on the state of the system. Another alternative
outliers due to the algorithm being conservative based on to alleviate the problem of incorrectly assigning good data
the knowledge extracted from the training data. The semi- samples as outliers would be to store the outlier data, and at
supervised approach also performs well on the false positives. regular intervals, retest them on the updated system model.
For the unsupervised approach, the number of misclassified To decide the outlier and redundant-data threshold values,
outliers is much larger because there exists no prior knowledge we run two tests:
about the clusters and outliers and the system tends to mis- • outlier-threshold: In the case where multiple models are
classify the outliers that are close to the clusters and especially present: randomly select several samples from every other
the mini outlier-clusters, as mentioned above. model and compute surprise. Since these samples belong
TABLE I to other models, they are considered as outliers for the
P ERCENTAGE OF G OOD D ATA AND O UTLIERS I NCORRECTLY C LASSIFIED current model. Obtain the lowest surprise value and set
IN THE T HREE L EANING A PPROACHES that as the outlier-threshold.
Incorrect (%) Supervised Semi-supervised Unsupervised In the case where only one model is present: select the
Good Data 4.86% 1.24% 1.67% sample with the highest surprise value and set that as the
Outliers 1.32% 1.46% 3.79% outlier-threshold.
• redundant-data-threshold: Randomly select a few sam-
In the supervised approach, the training samples are very ples from the model, slightly perturb them, and compute
surprise. Obtain the highest surprise value. This indicates While this approach is simple, it is time consuming when the
the surprise value of samples already known to the MoGs are updated with every sample or not very adaptive
system, thus no new knowledge. Set it as the redundant- when they are updated after a fixed number of samples are
data-threshold. introduced. Our approach of computing surprise, and more
The surprise value of the new samples that could belong to specifically distinguishing between redundant and interesting
the model should be less than the high-threshold and that of samples, provides an advantage here as we need to update
samples carrying useful information should be higher than the the MoGs only when interesting data samples are available,
low-threshold. thus speeding up the algorithm and also making it more
dynamically adaptable.
V. CONCLUSIONS AND DISCUSSIONS
ACKNOWLEDGMENT
Pattern recognition algorithms, in general, require that a
training dataset be available to extract features and generate This work was supported by the Office of Naval Research,
a model of the environment. In real-world problems, such a Award Number: N00014-10-1-0101.
dataset is not available ahead of time, or if so very limited. R EFERENCES
Usually, samples become available through time, and as a
[1] G. A. Carpenter and S. Grossberg, “The art of adaptive pattern recogni-
result, the system models initially generated need to adapt to tion by a self-organizing neural network,” Computer, vol. 21, pp. 77–88,
incorporate the new information found in these new samples. March 1988.
The goal of this paper was to create an evolving system [2] T. Cover and J. Thomas, Elements of Information Theory. Wiley, New
York, 1991.
that will use online learning to update its model. However, [3] C. E. Shannon, “A mathematical theory of communication,” SIGMO-
rather than allowing every sample to modify the model, a BILE Mob. Comput. Commun. Rev., vol. 5, pp. 3–55, 2001.
surprise metric was introduced to determine when an update [4] M. Belis and S. Guiasu, “A qualitative-quantitative measure of informa-
tion in cybernetic system,” IEEE Transactions of Information Theory,
was necessary. This would prevent outliers from drastically vol. 14, pp. 593 – 594, 1968.
changing the system state and avoid overfitting from redundant [5] E. Pfaffelhuber, “Learning and information theory,” International Jour-
data. The Bayesian framework was utilized to compute the nal of Neuroscience, vol. 3, no. 2, pp. 83 – 88, 1972.
[6] ——, “Information-theoretic stability and evolution criteria in irre-
prior and posterior PDFs, and thus derive surprise. versible thermodynamics,” Journal of Statistical Physics, vol. 16, no. 1,
In the experiments section, we showed results following pp. 69 – 90, 1977.
all three learning procedures: supervised, semi-supervised, [7] S. Kullback, Information Theory and Statistics. Dover, 1968.
[8] L. Itti and P. F. Baldi, “Bayesian surprise attracts human atten-
and unsupervised learning. The supervised approach simply tion,” in Advances in Neural Information Processing Systems, Vol. 19
trained the system based on the initial labeled data. While (NIPS*2005). Cambridge, MA: MIT Press, 2006, pp. 547–554.
surprise of the new samples was computed to classify them, [9] J. C. Principe, Information Theoretic Learning, Renyi’s Entropy and
Kernel Perspectives. New York, NY: Springer, 2010.
it was not further utilized to update the system model. The [10] K. Kampa, E. Hasanbelliu, and J. Principe, “Closed-form cauchy-
semi-supervised approach extended the supervised procedure schwarz pdf divergence for mixture of gaussians,” in Neural Networks
by creating an initial model based on the labeled data, but (IJCNN), The 2011 International Joint Conference on, 2011, pp. 2578
–2585.
then further improving the system by extracting information [11] R. Jenssen, D. Erdogmus, Kenneth, J. C. Prı́ncipe, and T. Eltoft,
from the new samples. This approach takes full advantage of “Optimizing the cauchy-schwarz pdf distance for information theoretic,
our method. The unsupervised approach requires additional non-parametric clustering.” in EMMCVPR, 2005, pp. 34–45.
[12] J. Almeida, L. Barbosa, A. Pais, and S. Formosinho, “Improving
assistance to generate an initial model for the system. Gaussian hierarchical cluster analysis: A new method with outlier detection
Mean Shift was utilized to cluster the unlabeled data. and automatic clustering,” Chemometrics and Intelligent Laboratory
The surprise metric is based on subjective measures of Systems, vol. 87, no. 2, pp. 208 – 217, 2007.
information and thus the surprise value of a sample depends
on the present state of the system. Since the model of the
system depends on the samples it has observed, the outcome
of the system will always depend on the order the samples are
introduced to the system. This becomes more pronounced in
sparse datasets.
While our current model is designed so as to not forget
previous information, in some online problems where the data
evolves through time, it is important that while adapting the
model to fit to new samples the system should also forget
the information obtained from samples far into the past. Our
method can accommodate this requirement by modifying the
mixture of Gaussians model. This will require storing the data
samples that were used to generate the MoGs and weighing
them based on the time they were introduced to the system.
Most of the methods that accommodate this requirement utilize
a sliding window for the samples that need to be kept.

06252734

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

06252734

Hochgeladen von

Copyright:

Verfügbare Formate

WCCI 2012 IEEE World Congress on Computational Intelligence

June, 10-15, 2012 - Brisbane, Australia IJCNN

Online learning using a Bayesian surprise metric

U.S. Government work not protected by U.S. copyright

an existing cluster, and all the parameters are updated 50

according to the new sample. 45

• Redundant: Tred > S(xn , M ), xn belongs to an existing

This allows us to more intuitively find the summarized 25 0.006

location and uncertainty of each cluster/object i from µi0 20 0.004

and Σi0 respectively. Moreover, the observation parameters 15

can be found from Wi , µi , Σi . (Note: the magnitude of the 5 10 15 20 25 30 35 40 45

location uncertainty Σi0 depends on the number of samples

(b) IF S(xn , M ) < Tout 40 0.3

• Include xn to the maximum p(mi |xn ) obtained

from MoGs 30 0.2

• IF S(xn , M ) ≥ Tred 25 0.15

III. S URPRISE B EHAVIOR A NALYSIS

ogous to supervised classification, pre-labeled training data

shown in Fig. 5. This approach requires a good spread of

samples may be considered as outliers.

Fig. 3. Illustration of surprise factor challenge in clustering problems. Even 300

larger than that of the left cluster. 0

To identify outliers and prevent them from changing the

Das könnte Ihnen auch gefallen