Beruflich Dokumente
Kultur Dokumente
Abstract—Dictionary.com defines learning as the process of its effect on the state of the system. A good sample would
acquiring knowledge. In psychology, learning is defined as the provide information beneficial to the model, it would carry
modification of behavior through training. In our work, we reasonable amount of new information that would sufficiently
combine these definitions to define learning as the modification
of a system model to incorporate the knowledge acquired by new but not excessively change the model. An outlier, on the other
observations. During learning, the system creates and modifies hand, would provide information detrimental to the model.
a model to improve its performance. As new samples are It is expected to carry an element of surprise that would
introduced, the system updates its model based on the new significantly change the model. The monitoring system would
information provided by the samples. However, this update may prevent samples that do not conform to expected behavior,
not necessarily improve the model. We propose a Bayesian
surprise metric to differentiate good data (beneficial) from outliers, from incorrectly modifying the model, and as a
outliers (detrimental), and thus help to selectively adapt the result, drastically skewing the state of the system, but allow
model parameters. The surprise metric is calculated based on the interesting observations to improve the system knowledge.
difference between the prior and the posterior distributions of the Detecting and preventing outliers from changing the model
model when a new sample is introduced. The metric is useful not is very important in many application areas such as fraud,
only to identify outlier data, but also to differentiate between the
data carrying useful information for improving the model and intrusion, or fault detection.
those carrying no new information (redundant). Allowing only While outliers incorrectly modify the model, redundant data
the relevant data to update the model would speed up the learning will cause overfitting. Most of the data that the system will
process and prevent the system from overfitting. The method is observe will have little-to-no-new information for the model.
demonstrated in all three learning procedures: supervised, semi- Allowing these redundant samples to update the model is not
supervised and unsupervised. The results show the benefit of
surprise in both clustering and outlier detection. only wasteful and time-consuming but also will over-train the
Index Terms—surprise metric, Bayesian probabilistic frame- system making it rigid to adapt to new, interesting samples.
work, online learning, information theory, outlier detection, Thus, it is as important to identify samples that do not provide
clustering any new information and prevent them from updating the
model as it is to identify those that will drastically change
I. I NTRODUCTION it.
When designing a pattern recognition algorithm, we assume A quantitative definition of the relevance a new observation
that we have available training data that provide a good has to the system is required to subjectively measure the
representation of the problem environment. For most real- information available in the observation based on the current
world problems though, providing a training data that captures system knowledge. To understand this measure, we first need
all the information required to design an optimal model is very to look at a few basic definitions from information theory. In
difficult if not impossible. One of the reasons is that data may information theory, information measures the uncertainty or
evolve through time and new patterns are introduced that were probability of occurrence of an outcome [2]. The information
not available during training. This requires that the system content of an outcome x, whose probability is p(x), is defined
be updated online to accommodate the new samples which as
would potentially enhance the statistical models and improve 1
I(x) = log . (1)
performance. However, if this is not handled with care, the p(x)
models created in the training set can become corrupted and For a random variable (r.v.) X, the average information content
hinder performance (this is the so called stability-plasticity is defined as
dilemma [1]). X 1
While it is important to allow the system to be flexible so H(X) = p(x) log , (2)
x
p(x)
that it learns from new samples, it is as important to not forget
previously learned information. In other words, the system which is also called Shannon’s entropy [3] of r.v. X. This
needs to retain previously stored material while learning new classical information measure is based solely upon objective
information. To accommodate this approach, it is necessary probabilities and cannot represent a system’s individual knowl-
to monitor the information available in a new sample and edge or discriminate between outcomes which are important to
Compute Surprise: When a new sample xn arrives at t, where H2 (f ; g) is the quadratic Renyi’s cross-entropy, and
calculate the surprise, S(xn , M t ), for model M t = {mti }M
i=1
H2 (f ), H2 (g) are Renyi’s quadratic entropy of f, g respec-
as: tively.
S(xn , M t ) = KL(p(x|xn , M t ), p(x|M t )) (9) If we denote f (x) and g(x) as two distributions each of
which is a mixture of Gaussians PMwith different parameters
where p(x|M t ) is obtained from the EM algorithm. The term and number of clusters, f (x) = m=1 πm N (x|µm , Λ−1 m ) and
p(x|xn , M t ) is the mixture of Gaussians updated when the
PK
g(x) = k=1 τk N (x|νk , Ω−1 k ), where M and K denote the
new sample xn is added into the model. number of Gaussian components in f (x) and g(x) respec-
Kullback-Leibler divergence, DKL , is one of the most well tively, and πm , µm and Λm denote the mixture coefficient,
known divergences, however it does not yield an analytic the mean, and the precision matrix of the mth component of
closed-form expression for MoGs. To work around this prob- f (x), and τk , νk , and Ωk denote the respective terms of the
lem, a simple approach is numerical integration where the k th component of g(x). The closed-form expression for DCS
whole feature space is uniformly gridded and the DKL is then of a pair of MoGs can be derived by rewriting (11) as
computed for each grid cell individually. The accuracy will
depend on the resolution of the grid, and the memory size M X
X K
!
required grows exponentially with the dimensionality of the DCS (f, g) = − 2 log πm τk zmk
data. m=1 k=1
M M
!
To mitigate this problem, we use a different divergence X πm2
|Λm |1/2 X X
+ log +2 πm πm′ zmm′
measure, the Cauchy-Schwarz divergence[9], DCS , whose (2π)D/2
m=1 m=1 m′ <m
closed-form solution for MoGs is easily computed [10]. K K X
!
This τk2 |Ωk |1/2
qR measureRis derived from the Cauchy-Schwarz inequality
X X
+ log +2 τk τk′ zkk′
2 2
R (2π)D/2
f (x)dx g (x)dx ≥ f (x)g(x)dx, where the equality k=1 k=1 k′ <k
(13)
holds if and only if f (x) = Cg(x) for a constant scalar C. To
simplify the calculations, we take the square of the inequality where
and the Cauchy-Schwarz divergence measure of two PDFs is −1
defined as[11]: zmk = N (µm |νk , (Λ−1
m + Ωk ))
−1
R
zmm′ = N (µm |µm′ , (Λ−1
m + Λm′ ))
f (x)g(x)dx zkk′ = N (νk |νk′ , (Ω−1 −1
DCS (f, g) = − log qR R
. (10) k + Ωk′ ))
2
f (x) dx g(x) dx2
are the integrals of the product of two corresponding Gaussian
DCS (f, g) is a symmetric measure and DCS (f, g) ≥ 0, where PDFs. Replacing f (x) and g(x) for p(x|M t ) and p(x|xn , M t ),
the equality holds if and only if f (x) = g(x). However, the we obtain the surprise measure
triangle inequality property does not hold, so it cannot be
considered as a metric. DCS (f, g) can be broken down and S(xn , M t ) = DCS (p(x|M t ), p(x|xn , M t )). (14)
rewritten as
Z Z Z Update MoGs: Once the surprise value is computed, we need
DCS (f kg) = −2 log f (x)g(x)dx+log f 2 (x)dx+log g 2 (x)dx. to check the three conditions:
(11) • Outlier: S(xn , M ) > Tout , the new sample xn is an
R outlier. (Depending on the problem, these samples may
The argument of the first term, f (x)g(x)dx, estimates be stored in an outlier buffer and used to generate new
the interactions on locations within the support of f (x) when models.)
• Interesting: Tout ≥ S(xn , M ) ≥ Tred , xn belongs to 55
Procedure 1 Online MoG Update sample is very small compared to an outlier. The surprise field
1) Build Bayesian network structure from training data reiterates this point and illustrates that as data samples move
2) Calculate model parameters and thresholds further apart from the library data, their respective surprise
3) Infer the posterior distribution of the location values become larger.
p(ri |{xj |j ∈ objecti })
4) When new sample xn arrives, check the following: 0.45
50
(a) Calculate surprise S(xn , M ) from p(x) and p(x|xn ) 0.4
45
as in Eq. 14 0.35
0.1
– update the MoG parameters 20
S = 0.0080
0.05
– update the thresholds 15
5 10 15 20 25 30 35 40
(c) ELSE
• Discard xn , or include it in outlier buffer
Fig. 2. Example of surprise value on a good data sample and an outlier.
5) Repeat steps 2-4 The figure shows the surprise field along with the library data. In addition,
the surprise factor of a good data sample and an outlier are computed and
shown in white and red respectively.
−1
−2
−3
−4
−4 −2 0 2 4 6
Fig. 4. Original dataset: clusters (good data) and noise (outliers), where noise
(a) Surprise Behavior on Clustering is composed of scattered samples and stripes of outliers connected horizontally
x 10
or vertically.
6 0.024 6
6
5 5
0.022
4
S=0.025 0.02
4
3
5.5
We analyze the outlier detection problem using the three
3 5
S=0.007
2
0.018
2
4.5
fundamental learning approaches:
1 0.016 1
4
−1
0.014
0
−1
3.5
A. Supervised learning
In the supervised approach, we provide the system with
0.012 3
−2 −2
−3 0.01 −3 2.5
−4
0.008
−4 2 prior knowledge of both the good data and outliers. Anal-
−4 −2 0 2 4 6 −4 −2 0 2 4 6
1
6 are provided to allow good generalization of the system as
5
4 4
0.9
S=2.125 0.8
3
2
0.8
3
2
S=3.704 both good and outlier data, otherwise, samples that are derived
0.7
1 0.7
1
0.6
from unknown areas may not be classified correctly and good
0
0.6
−1 −1
0.5
−2 0.5 −2
0.4
−3 −3
0.4
−4 −4 450 good data
0.3
−4 −2 0 2 4 6 −4 −2 0 2 4 6
outliers
400
(d) Normalized S.F. on G1 (e) Normalized S.F. on G2
350
though the new sample (represented with the black star) is closer to the left 250
cluster (blue squares), the surprise value of the right cluster (red circles) is
200
smaller due to the larger number of samples it has observed. This is shown
in subfigures (b) and (c) respectively. To account for this discrepancy due to 150
the different number of samples per cluster, normalized surprise values need
100
to be computed as shown in subfigures (d) and (e) where as expected the
surprise of the new sample with respect to the right cluster is clearly much 50
B. Semi-supervised learning with the lowest surprise would be considered as good data.
In the semi-supervised approach, we provide prior knowl- Starting from this initial assumption, we could generate an
edge of the clusters as above, but also allow the system to initial model and then utilize the semi-supervised approach to
adapt and learn online. As new samples arrive, the model is distinguish the rest of the samples that fall in the middle of
adjusted to learn from the information provided in the samples. the surprise spectrum.
The system identifies the boundary of good data / outliers on However, in the online case, we do not have all the data
its own and continues to update it as new samples arrive. Fig. 7 available to derive an initial model for the system and an
shows the initial training data. Even though we provide fewer outlier/good-data boundary. Since our approach to outlier
samples than in the supervised case, since the system will detection requires that we have an initial distribution model,
update online, this lack of initial samples does not lower the we utilize Gaussian Mean Shift (GMS) to obtain the initial
performance of the system as long as the initial samples cover cluster(s). At the beginning, all the samples are considered as
the entire distribution. outliers and placed in an outlier bin. As the outlier bin grows,
GMS is applied to these samples, and when a cluster is present,
450
it is extracted as the initial model. As new samples arrive, they
good data
400
are classified as either part of the model, in which case the
350
system uses them to modify its model, or outliers where they
300
are placed in the outlier bin. Since most of the samples (even
the ones belonging to the cluster) will initially be considered
250
as outliers due to the poor representation of the model, they
200
are always placed in the outlier bin instead of being discarded
150
as in the previous two approaches. GMS constantly processes
100
the data in the outlier bin, and whenever possible, extracts
50
additional clusters, which are then fed back to the system. Fig.
0
0 100 200 300 400 500 600 700 9 shows the results of unsupervised learning up to a mid-point
through the dataset, and Fig. 10 shows the final results. The
Fig. 7. Training data: good data samples to generate an initial model for the unsupervised approach models the clusters well, however it
semi-supervised case. tends to incorrectly classify as outliers some of the outer areas.
This is due to the difficulty of GMS to generate new clusters
Fig. 8 shows the results of the semi-supervised approach. in these areas and also due to the order of introduction of these
Despite the low number of initial data samples, the results of samples, which becomes apparent in both Fig. 9 and Fig. 10.
semi-supervised learning are very good, and in fact better than In addition, some of the outlier samples that are close to the
the supervised approach. This is because the system was able clusters, especially the mini outlier-clusters, are misclassified
to learn from the new samples and adapt its model to integrate as good data due to their vicinity to the actual clusters. Also, it
the new information. is important to note that this process is more time consuming
than the previous approaches.
C. Unsupervised learning
Finally, we determine the outliers with no prior knowledge D. Discussion
of the data. If the data are available offline, the unsupervised To better understand the performance of surprise on each
approach would be simple as the samples that have the highest learning approach, we designed 20 Monte Carlo runs for each
surprise value would be considered as outliers and the samples approach. The results are summarized in Table I which shows
important because they are the only information that the
system has about the environment. As a result, a good repre-
sentation of the overall distribution is required, otherwise the
system will not perform well. In the semi-supervised approach,
the effect of the initial training data is not as drastic, but it
will affect the performance of the system. If the initial training
data provides information about only parts of the distribution,
the system will suffer and not be able to adapt as easily.
Fig. 11 shows an example where the initial data samples do
not contain any information about the left circular cluster. In
addition the top horizontal strip is split into two clusters with
a large portion considered as outliers because of lack of initial
Fig. 9. Results of the unsupervised approach at a mid point in the dataset. samples in that area.
The dataset is shown in black dots, the cluster samples are shown in blue
circles and the currently identified outliers are shown in red circles.
Fig. 11. Results of the semi-supervised approach when the initial learning
data do not provide a good representation of the whole distribution.
Fig. 10. Results of the unsupervised approach. The system is able to learn
most of the cluster model and correctly identify most of the outliers.
In addition to the initial data samples, the order in which
the new samples arrive will also affect the final outcome of
the percentage of good data that were incorrectly classified the system. The model is adapted based on the information
as outliers (false negatives), and the percentage of outliers available in the new samples and how these samples are
that were incorrectly classified as good data (false positives) categorized. So, if a new sample belonging to a cluster is far
for each approach. The supervised approach has the most from the present distribution state, it will be considered as an
false negatives because learning is done offline on the training outlier. To avoid such problems, especially in the unsupervised
dataset and surprise is not utilized to update the model as new approach where the system model is very minimal, we present
samples arrive. This is confirmed by the low false negatives the data samples in order from left to right and up to
for both semi-supervised and unsupervised approaches. The down. Since surprise is a subjective measure its performance
supervised approach has the lowest number of misclassified will depend on the state of the system. Another alternative
outliers due to the algorithm being conservative based on to alleviate the problem of incorrectly assigning good data
the knowledge extracted from the training data. The semi- samples as outliers would be to store the outlier data, and at
supervised approach also performs well on the false positives. regular intervals, retest them on the updated system model.
For the unsupervised approach, the number of misclassified To decide the outlier and redundant-data threshold values,
outliers is much larger because there exists no prior knowledge we run two tests:
about the clusters and outliers and the system tends to mis- • outlier-threshold: In the case where multiple models are
classify the outliers that are close to the clusters and especially present: randomly select several samples from every other
the mini outlier-clusters, as mentioned above. model and compute surprise. Since these samples belong
TABLE I to other models, they are considered as outliers for the
P ERCENTAGE OF G OOD D ATA AND O UTLIERS I NCORRECTLY C LASSIFIED current model. Obtain the lowest surprise value and set
IN THE T HREE L EANING A PPROACHES that as the outlier-threshold.
Incorrect (%) Supervised Semi-supervised Unsupervised In the case where only one model is present: select the
Good Data 4.86% 1.24% 1.67% sample with the highest surprise value and set that as the
Outliers 1.32% 1.46% 3.79% outlier-threshold.
• redundant-data-threshold: Randomly select a few sam-
In the supervised approach, the training samples are very ples from the model, slightly perturb them, and compute
surprise. Obtain the highest surprise value. This indicates While this approach is simple, it is time consuming when the
the surprise value of samples already known to the MoGs are updated with every sample or not very adaptive
system, thus no new knowledge. Set it as the redundant- when they are updated after a fixed number of samples are
data-threshold. introduced. Our approach of computing surprise, and more
The surprise value of the new samples that could belong to specifically distinguishing between redundant and interesting
the model should be less than the high-threshold and that of samples, provides an advantage here as we need to update
samples carrying useful information should be higher than the the MoGs only when interesting data samples are available,
low-threshold. thus speeding up the algorithm and also making it more
dynamically adaptable.
V. CONCLUSIONS AND DISCUSSIONS
ACKNOWLEDGMENT
Pattern recognition algorithms, in general, require that a
training dataset be available to extract features and generate This work was supported by the Office of Naval Research,
a model of the environment. In real-world problems, such a Award Number: N00014-10-1-0101.
dataset is not available ahead of time, or if so very limited. R EFERENCES
Usually, samples become available through time, and as a
[1] G. A. Carpenter and S. Grossberg, “The art of adaptive pattern recogni-
result, the system models initially generated need to adapt to tion by a self-organizing neural network,” Computer, vol. 21, pp. 77–88,
incorporate the new information found in these new samples. March 1988.
The goal of this paper was to create an evolving system [2] T. Cover and J. Thomas, Elements of Information Theory. Wiley, New
York, 1991.
that will use online learning to update its model. However, [3] C. E. Shannon, “A mathematical theory of communication,” SIGMO-
rather than allowing every sample to modify the model, a BILE Mob. Comput. Commun. Rev., vol. 5, pp. 3–55, 2001.
surprise metric was introduced to determine when an update [4] M. Belis and S. Guiasu, “A qualitative-quantitative measure of informa-
tion in cybernetic system,” IEEE Transactions of Information Theory,
was necessary. This would prevent outliers from drastically vol. 14, pp. 593 – 594, 1968.
changing the system state and avoid overfitting from redundant [5] E. Pfaffelhuber, “Learning and information theory,” International Jour-
data. The Bayesian framework was utilized to compute the nal of Neuroscience, vol. 3, no. 2, pp. 83 – 88, 1972.
[6] ——, “Information-theoretic stability and evolution criteria in irre-
prior and posterior PDFs, and thus derive surprise. versible thermodynamics,” Journal of Statistical Physics, vol. 16, no. 1,
In the experiments section, we showed results following pp. 69 – 90, 1977.
all three learning procedures: supervised, semi-supervised, [7] S. Kullback, Information Theory and Statistics. Dover, 1968.
[8] L. Itti and P. F. Baldi, “Bayesian surprise attracts human atten-
and unsupervised learning. The supervised approach simply tion,” in Advances in Neural Information Processing Systems, Vol. 19
trained the system based on the initial labeled data. While (NIPS*2005). Cambridge, MA: MIT Press, 2006, pp. 547–554.
surprise of the new samples was computed to classify them, [9] J. C. Principe, Information Theoretic Learning, Renyi’s Entropy and
Kernel Perspectives. New York, NY: Springer, 2010.
it was not further utilized to update the system model. The [10] K. Kampa, E. Hasanbelliu, and J. Principe, “Closed-form cauchy-
semi-supervised approach extended the supervised procedure schwarz pdf divergence for mixture of gaussians,” in Neural Networks
by creating an initial model based on the labeled data, but (IJCNN), The 2011 International Joint Conference on, 2011, pp. 2578
–2585.
then further improving the system by extracting information [11] R. Jenssen, D. Erdogmus, Kenneth, J. C. Prı́ncipe, and T. Eltoft,
from the new samples. This approach takes full advantage of “Optimizing the cauchy-schwarz pdf distance for information theoretic,
our method. The unsupervised approach requires additional non-parametric clustering.” in EMMCVPR, 2005, pp. 34–45.
[12] J. Almeida, L. Barbosa, A. Pais, and S. Formosinho, “Improving
assistance to generate an initial model for the system. Gaussian hierarchical cluster analysis: A new method with outlier detection
Mean Shift was utilized to cluster the unlabeled data. and automatic clustering,” Chemometrics and Intelligent Laboratory
The surprise metric is based on subjective measures of Systems, vol. 87, no. 2, pp. 208 – 217, 2007.
information and thus the surprise value of a sample depends
on the present state of the system. Since the model of the
system depends on the samples it has observed, the outcome
of the system will always depend on the order the samples are
introduced to the system. This becomes more pronounced in
sparse datasets.
While our current model is designed so as to not forget
previous information, in some online problems where the data
evolves through time, it is important that while adapting the
model to fit to new samples the system should also forget
the information obtained from samples far into the past. Our
method can accommodate this requirement by modifying the
mixture of Gaussians model. This will require storing the data
samples that were used to generate the MoGs and weighing
them based on the time they were introduced to the system.
Most of the methods that accommodate this requirement utilize
a sliding window for the samples that need to be kept.