Sie sind auf Seite 1von 43

1

Dipartimento di Ingegneria
Biofisica ed Elettronica
Universit di Genova
Prof. Sebastiano B. Serpico
3. Supervised estimation of
probability density functions
2
Supervised Classifier Design
Approach 1:

Approach 2:
Training set Data set
Estimation of
the class
conditional pdf
Decision
Theory
{p(x| e
i
)}
training
samples for
the classes
{e
i
}
Apply the decision
rule to the data set
Data set classification
Training set Data set
Training of a
non-Bayesian
classifier by a
direct use of the
training
samples
training
samples for
the classes
{e
i
}
Apply the decision
rule to the data set
Data set classification
3
Supervised Estimation of a pdf
The use of the decision theory to design classifiers requires to
preliminary estimate the class conditional pdf. In a supervised
approach, the estimation of the pdf p(x| e
i
) can be performed
on the basis of the training data of the class e
i
.
Problem definition and notation:
Consider a feature vector x with (unknown) pdf p(x) and a finite
set X = {x
1
, x
2
, ..., x
N
} of N independent samples extracted from
such a pdf;
We would like to compute, on the basis of the available samples,
an estimated pdf
In order to perform supervised classification, the estimation
process has to be repeated individually for each single class: in
particular, to estimate p(x| e
i
), we assume that the set X
corresponds to the set of training samples of the class e
i
.

( ); p x
4
Approaches to pdf Estimation
Parametric Estimation:
a given model (e.g.: Gaussian,
exponential,) for the
analytical form of p(x) is
assumed;
the parameters of such a
model are estimated.
Remarks:
a given model could be not
physically realistic;
most of the parametric
methods assume single-mode
pdfs, while many real
problems involve multimodal
pdfs;
complex methods (not
considered here) have been
developed to identify the
different modes of a pdf.
Non-Parametric Estimation :
no analytical model is assumed
for the pdf estimation, but p(x)
is directly estimated from the
samples in X.
Remarks:
typically, the lack of predefined
models allows more flexibility;
however, the computational
complexity of the estimation
problem is generally higher
than for the parametric case.
5
Parametric Estimation
Given an analytical model of the pdf p(x) to be estimated, the
parameters that characterize the model are collected into a
vector u e
r
.
We highlight the dependence on the parameters by adopting the
notation p(x| u) (in particular, p(x| u), considered as function of u,
is called likelihood function).
The training samples x
1
, x
2
, ..., x
N
are collected into a single
vector of observations X. The samples are considered as random
vectors and a pdf p(X| u) is associated to them
usually the samples are assumed as identically distributed
random vectors (because thay are all extracted from the same pdf
p(x)) and independent from each other (i.i.d. random vectors, that
is, independent and identically distributed), then:
1
( | ) ( | )
N
k
k
p p
=
=
[
X x u u
6
General Properties of the Estimations
General properties
The estimate of the parameter vector depends on the observation
vector:
Therefore also the estimate is a random vector.
Bias
The expected value E{c} of the estimation error c is called bias and
is defined as:
The estimation is told to be not biased if, for each parameter
vector u, the estimation error c has zero mean:
To prove that the estimation of the parameter u
i
(i = 1, 2, ..., r) is
good, we want that the estimation error c
i
(i component of c)
has zero mean (i.e. that the estimation is not biased), but also it
has to have a small variance var{c
i
}.

( ) = X u u
= =

{ } 0 or { } E E c u u

( , ) = = X c u u c u
7
Variance of the Estimation Error
Cramr-Rao Inequality:
For each unbiased estimation of the vector u, it holds:
where (u) = E{V
u
p(X| u)V
u
p(X| u)
t
} is the Fisher information
matrix :
The Cramr-Rao inequality provides a lower bound for the
variance of the estimation error.
therefore var{c
i
} cannot be made arbitrary small, but it will be
always lower bounded by [
1
(u)]
ii
. Then, the Fisher information
matrix represents a measure of how good an estimation can be.
In particular, an unbiased estimation that satisfies the equality for
each vector of the parameters u is told to be efficient.
1
var{ } [ ( )] , 1, 2,...,
i ii
i r

c > = u
( | ) ( | )
[ ( )]
ij
i j
p p
E

c c

=
`
cu cu

)
X X u u
u
8
Asymptotic Properties of Estimations
Often biased and/or nonefficient estimations are used,
provided they exhibit a good behavior for large values of N.
An estimation is called asymptotically unbiased if the error mean
is zero for N + :
An estimation is told asymptotically efficient if the error variance
of the estimation corresponds to the Cramr-Rao lower bound for
N + :
An estimation is told consistent if it converges to the true value
in probability for N + :
Sufficient condition for an estimation to be consistent is that
its asymptotically unbiased and that the estimation error has
infinitesimal variance for N + [Mendel, 1987].
1
var{ }
lim 1, 1, 2,...,
[ ( )]
i
N
ii
i r

+
c
= =
u
lim { } 1 0
N
P
+
< o = o > c
{ }
{ }

N
N
E 0 that is E
lim lim
+
+
= =
9
ML Estimation
Definition
The Maximum Likelihood (ML) estimate of the vector u is
defined as the following vector:
Remarks
For different values of u, different pdfs are obtained. Each of
them is computed in correspondence of the observations X. The
pdf assuming the maximum value for X is identified: the ML
estimate is the value of u that produces this pdf.
Often its an advantage not to maximize the likelihood function
p(X| u), but (equivalently) the log-likelihood function:

argmax ( | ) p = X
u
u u

argmaxln ( | ) p = X
u
u u
10
ML Estimation: Example
ML estimation of the mean of a one-dimensional Gaussian
with known variance (i.e., equal to one) starting from a single
observed sample x
0
.
-6 -4 -2 0 2 4 6
x
0
0

m x =
11
Properties of the ML Estimation
Under mild assumptions about the function p(X| u), it can be
proven that, if an efficient estimation exists and if the ML
estimation is unbiased, then the efficient estimation is the ML
estimation.
Even when an efficient estimation does not exist, the ML
estimation exhibits good asymptotic properties. In particular
the ML estimation is:
Asymptotically unbiased
Asymptotically efficient
Consistent
Asymptotically Gaussian
These properties explain the wide diffusion of ML estimators
in classification methods.
1
( )
, .
N

| |
|
\ .
u
u
12
Parametric Gaussian Estimation
Let us consider the problem of parametric pdf estimation in
the case of a Gaussian model.
Let us assume an (m, E) model. The parameters to be estimated
are the mean (vector) and the covariance matrix:
It can be proven that, if the training samples x
1
, x
2
, ..., x
N
are i.i.d.,
then the ML estimates of the mean and the covariance matrix are
as follows:
( )
1
/2 1/2
1 1
( | , ) exp ( ) ( )
2
2
{ }
where:
Cov{ } {( )( ) } { }
t
n
t t t
p
E
E E

(
E = E
(

t E
=

E = = =

x m x m x m
m x
x x m x m xx mm
1 1 1
1 1 1


, ( )( )
N N N
t t t
k k k k k
k k k
N N N
= = =
= E = =

m x x m x m x x mm
13
Properties of the Parametric Gaussian Estimation
The estimations of m and E, as ML estimations, are
asymptotically unbiased, asymptotically efficient and
consistent. Moreover, the following additional properties are
valid:
The estimation of m is unbiased, while the estimation of E is
biased:
Therefore, usually, the estimation of E is modified as follows:
The two estimations coincide for N + (it is consistent with
the fact that ML estimations are asymptotically unbiased).
The introduced estimation for the mean and the covariance
matrix are called, generally, sample mean and sample covariance.
1

{ } , { }
N
E E
N

= E = E m m
1
1


( )( )
1
N
t
k k
k
N
=
E =


x m x m
14
Iterative Expressions of the
Gaussian Parametric Estimation
The estimations of m and E can also be expressed with an
iterative form, using each single sample in sequence, instead of
the whole training set at once:
Iterative computation of the sample mean:
Iterative computation of the sample covariance:
( )
( 1) 1
( )
1 (1) ( )
1

, 1, 2,..., 1
1

1

,
k
k k
k
k
h
h N
k
k N
k
k
+ +
=

+
= =

=
+

= =

m x
m
m x
m x m m
( )
( 1) 1 1
( )
1 (1)
1 1
( ) ( ) ( ) ( ) ( )
, 1, 2,..., 1
1
1


,
k t
k k k
k
k t
h h
h t
k k k k t N
kS
S k N
S
k
k
S
S
+ + +
=

+
= =

=
+

E = E = E

x x
x x
x x
m m
The iterative
estimations of E
are referred to the
expression with
denominator N.
15
Example (1/2)
Given n = 3 features and two classes e
1
and e
2
, characterized by the
following training set:
e
1
: (0, 0, 0), (1, 1, 0), (1, 0, 0), (1, 0, 1)
e
2
: (0, 0, 1), (0, 1, 1), (1, 1, 1), (0, 1, 0)

In this case it has no sense to normalize the features. In fact the features
are binary and the samples are represented by all possible combinations
of the three binary features.
We assume class conditional pdfs are Gaussian and we apply ML
estimation. It is necessary to estimate the mean vector and the
covariance matrix for each class. Lets see the computation for the class
e
1
.
The mean estimated value is provided by:
(
| | | | | | | | | |
( | | | | |
= + + + =
( | | | | |
| | | | |
(
\ . \ . \ . \ . \ .

1
0 1 1 1 3 / 4
1

0 1 0 0 1/ 4
4
0 0 0 1 1/ 4
m
x
1
x
2
x
3
16
Example (2/2)
Estimation of the covariance matrix for e
1
Lets use as denominator N = 4, so obtaining a biased estimation.
The use of N 1 = 3 would give an unbiased estimation.
However, for large N, such as N > 30, we would obtain 1/(N 1)
~ 1/N.
( ) ( )
( ) ( )
| | | |
| |
E = + +
| |
| |


\ . \ .

(
| | | |
( | |
+ + =
( | |
| |
(

\ . \ .

| | | |
| |
= + +
| |

| |

\ . \ .
1
3 / 4 1/ 4
1

1/ 4 3 / 4 1/ 4 1/ 4 3 / 4 1/ 4 3 / 4 1/ 4
4
1/ 4 1/ 4
1/ 4 1/ 4
1/ 4 1/ 4 1/ 4 1/ 4 1/ 4 1/ 4 1/ 4 3 / 4
1/ 4 3 / 4
9 3 3 1 3 1 1 1
1
3 1 1 3 9 3
4 16
3 1 1 1 3 1
( | | | |
( | |
+ =
( | |
| |
(

\ . \ .

| | | |
| |
= =
| |
| |

\ . \ .
1 1 1 3
1 1 1 1 1 3
1 1 1 3 3 9
12 4 4 3 1 1
1 1
4 12 4 1 3 1
64 16
4 4 12 1 1 3
In this case (not
in general!) we
obtain
2 1

E = E
17
Non-Gaussian Parametric Estimation
When a Gaussian model
appears not accurate for the
considered problem, other
parametric models can be
adopted.
In the case n > 1, an extension of
the gaussian model is given by
the elliptically contoured pdfs:
where m = E{x}, E = Cov{x} and
f is an appropriate non-negative
function.
The level curves of such pdfs
are hyperellipses, like in the
gaussian case.
In the case n = 1, very general
models are the Pearsons pdfs,
that, varying some parameters,
include uniform pdfs, gaussian
and also impulsive models
with vertical asymptotes.


= E E
1/ 2
1
( ) [( ) ( )]
t
p f x x m x m
18
Non-Parametric Estimation: Problem Definition (1/2)
In a non-parametric context the estimation of the unknown pdf
p(x) is not restricted to satisfy any predefined model and it is
directly built-up from the training samples x
1
, x
2
, , x
N

(assumed i.i.d.).
Let x* be a generic sample and R a predefined region of the
feature space; R includes x*. Assuming that the true pdf p(x) is
a continuous function and that R is enough small that such
function doesnt vary in a significant way in R, we have:
where V is the n-dimensional volume (measure) of R.
If K is the number of training samples belonging to R (over a total
of N training samples), a consistent estimation of the probability P
is represented by the relative frequency:
{ } ( ) ( *)
R
R
P P R p d p V = e =
}
x x x x

,
R
K
P
N
=
+
< o = o >

lim { } 1 0
R R
N
P P P
Law of large
numbers
19
Non-Parametric Estimation: Problem Definition (2/2)
Pdf estimation
From the estimation of the probability P
R
that a sample belongs
to R, we can derive an estimation of the pdf in the point x*:
Remarks
R has to be large enough to contain a number of training
samples that justify the application of the Law of large numbers;
R has to be small enough to justify the hypothesis that p(x)
doesnt vary significantly in R.
So, to obtain an accurate estimation, a compromise is necessary
between these two needs, to guarantee a good estimation of the
pdf.
However its not possible to obtain a good compromise, if the
total number N of samples in the training set is small.

( *)
R
P K
p
V NV
= = x
20
Two Non-parametric Approaches
By exchanging the roles of the quantities K and V, the above
reasoning leads to two possible approaches to non-parametric
estimation:
the k-nearest-neighbor approach: for a fixed K and a given
point x of the feature space, the region R containing the K samples
nearest to x belonging to the training set, the hypervolume V is
computed, and the estimation of the pdf is deduced;
Parzen-window approach: for a fixed region R centered in x,
whose hypervolume is equal to V, K is computed considering the
training set, then the estimation is derived.
Its possible to prove that both approaches lead to consistent
estimations. However, its not possible to draw general
conclusions about their behavior in a real context,
characterized by a finite number of training samples.
21
K-Nearest-Neighbor Estimation
Hypotheses
The number of training samples K is preset.
A reference cell (e.g., a sphere) centered in x* is considered.
Methodology
The k-nearest neighbor (k-nn) estimator extends the cell till it
exactly contains K training samples: V
K
(x*) is the volume of the
resulting cell.
The pdf in the point x* is estimated as follow:
It can be proved that, selected K as a function of N (K = K
N
), the
necessary and sufficient condition for the k-nn estimation to be
consistent in all points where p(x) is continuous is K
N
+ for
N + , but of order lower than 1 (e.g., K
N
= N
1/2
).
=

( *)
( *)
K
K
p
NV
x
x
22
Remarks on the k-nn Method
Typically the cell used with k-nn is a hypersphere, then, the k-
nn estimation is based on the following steps:
Identify the K training samples closest to the considered point x*
(wrt an Euclidean metric);
Identify the radius r of the smallest hypersphere that, centered in
x*, includes the above K samples (r coincides with the distance of
x* to the farthest one among the K samples);
Compute the volume of the n-dimensional hypersphere of radius
r and then the value of the estimation of p(x*).
Disadvantages
The pdf estimated by the k-nn method is not a true pdf, since
its integral diverges because of the singularities due to the term
V
K
(x*) at the denominator (e.g., V
1
(x
k
) = 0 for k = 1, 2, ..., N).
The k-nn estimation is computationally heavy, even if ad hoc
techniques have been proposed to decrease the computational
charge (e.g., KD-Tree).
23
Parzen-Window Estimation: Introduction
Hypotheses and notation
Suppose R is an n-dimensional hypercube with side h (and then
volume V = h
n
), centered in the point x*.
Introduce the following rectangular funcion:
Introduction to the method
The training sample x
k
belongs to the hypercube R with center x*
and side h if [(x
k
x*)/h] = 1, otherwise [(x
k
x*)/h] = 0.
Then the number of the training samples that fall into R is:
Consequently, the estimation can be computed as:
1
*
N
k
k
K
h
=

| |
=
|
\ .

x x
= =

| | | |
= = =
| |
\ . \ .

1 1
* * 1 1 1

( *)
N N
k k
n n
k k
K
p
NV h N h Nh h
x x x x
x

e
=
elsewhere 0
0 center and 1 side with hypercube x if 1
(x)
24
Parzen-Window Estimation
The just illustrated estimation, based on the concept of
counting the number of the training samples included into a
prefixed volume, can be interpreted as the superposition of
rectangular contributions, each of which is associated to a
single sample.
To obtain more regular estimations (the rectangular function is
discontinuous), the previous expression is generalized.
The pdf estimation is expressed as the sum of the N
contributions, each of them is associated to a single sample, and
the single contribution is expressed by a function (), in general
not rectangular, but such that () assumes real values that vary
with continuity. The following estimation is obtained:
The function () is called Parzen window or kernel and the
parameter h is the width of the window (or of the kernel).
1
1 1

( )
N
k
n
k
p
N h h
=

| |
=
|
\ .

x x
x
25
Features of the Kernel Function
In order that the Parzen-window estimation have sense, its
necessary to impose restrictions on the kernel ():
Necessary and sufficient condition for the Parzen-window
estimation to be a pdf, is that the kernel function itself be a pdf
(i.e., a non-negative and normalized function):
Moreover some further conditions are accepted with the aim
to obtain a good estimation:
() takes its global maximum in 0;
() is continuous (this is necessary to guarantee that the
estimation doesnt vary suddenly or have discontinuities);
(x) is infinitesimal for x (then the effect of a sample
vanishes at large distances from the sample itself):
lim ( ) 0

=
x
x
> e =
}
( ) 0 , ( ) 1
n
n
d x x x x
26
Examples of Kernel Functions for n = 1
Rectangular kernel:
Triangular kernel:
Gaussian kernel:
Exponential kernel :
Cauchy kernel :
Kernel with sinc
2
() behavior:
( )
2
1
2
2
( ) exp
x
x
t
=
1
2
( ) exp( ) x x =
1
x
-1 1
( ) x
x
( ) x
x
( ) x
1/2
2
1 1
( )
1
x
x
=
t +
( ) ( ) x x = H
( ) ( ) x x = A
2
1 sin( / 2)
( )
2 / 2
x
x
x
| |
=
|
t
\ .
Triangular
kernel
Gaussian
Kernel
Exponential
kernel
x
( ) x
Cauchy kernel
x
( ) x
sinc
2
kernel

Here ()
doesnt
satisfy the
condition of
continuity.
x
-1/2 1/2
( ) x
1
Rectangular
kernel
27
Remark on the Parzen-Window Estimation
Multidimensional case
Often, in multidimensional feature spaces (n > 1) the choice of the
kernel function is led back to the monodimensional case, by
adopting:
where () is a monodimensional kernel function (i.e., one of those
listed in the previous slide). In other words, the
(multidimensional) kernel () has spherical symmetry and the
behavior, moving outward from the centre, is derived from ().
Properties of the Parzen-window estimation
It can be proved that generally the Parzen-window estimation is
biased.
However, choosing the width h of the kernel as a function of the
number N of training samples (i.e., h = h
N
) and by imposing that
{h
N
} be an infinitesimal sequence of order smaller than 1/n, the
Parzen-window estimation becomes asymptotically unbiased
and consistent (e.g., h
N
= N
1/(2n)
).
( ) ( )
'
x x
28
Parzen-Window Estimation with a Finite Number of Samples
The asymptotic properties of the Parzen-window estimation
are derived by making the number of training samples
approach infinity, which, obviously, its not realistic.
With a finite training set, choosing h 0, the estimation turns into
a sequence of Dirac pulses centered on the single samples and
then exhibits an excessive variability. Instead, if h is too large, an
eccessive smoothing is generated.
Therefore, the application of the method requires a high number
of training samples, an adequate choice of the kernel function, and
a compromise choice for the h value.
Automatic algorithms exist (not described in this course) for the
automatic optimization [Scott et al., 1987] [Sain et al., 1974] or
even for the adaptive optimization [Mucciardi, Gose, 1970] of the
kernel width.
29
Remarks on the Parzen-Window Estimation
Computational complexity
Like the K-NN, also the Parzen-window estimation is
computationally heavy. However approaches exist to reduce the
complexity of the Parzen-window estimation (not presented in
this course).
Probabilistic Neural Networks
The Parzen-window estimation with multidimensional gaussian
kernels with spherical symmetry can be implemented by means of
a neural architecture called Probabilistic Neural Network (PNN)
[Specht, 1990]
30
Example (1)
Parzen-window
estimates of a one-
dimensional Gaussian
pdf using a Gaussian
kernel N(0, 1). Being n =
1, we have considered:
with constant h
1
> 0.
1
N
h
h
N
=
31
Example (2)
Parzen-window
estimates of a bimodal
one-dimensional pdf
(one uniform and one
triangular modes)
using Gaussian kernels
N(0, 1). The same
expression of h
N
as for
the 1D Gaussian pdf
has been adopted.
32
Example (3)
Parzen-window
estimates of a
bidimensional
Gaussian pdf using
Gaussian kernels
Being n = 2, we have
considered
with constant h
1
> 0.
0 1 0
,
0 0 1
N
| |
( (
|
( (

\ .
=
1
4
N
h
h
N
33
Orthogonal Function Approximations
An alternative non parametric approach is the functional
approximation of the unknown pdf by means of a system of
orthonormal basis functions.
Let A c
n
(supposed measurable) be the set where x assumes its
values.
The method aims at defining an estimation in the space of the (real)
functions having finite energy over A. On such a space, a scalar
product is defined with the corresponding norm:
Be u = {
1
,
2
, ...,
m
} a finite set of orthonormal functions:
We look for an estimation of the unknown pdf in the space
generated by the m basis-functions
1
,
2
, ...,
m
:
= = =
} }
2
2
, ( ) ( ) , , ( )
A A
f g f g d f f f f d x x x x x

= =

= o =

=

1 ( in particular, 1)
,
0
i
i j ij
i j
i j
=
= e
1 2
1

( ) ( ), , ,..., ,
m
i i m
i
p c c c c x x
34
In particular, the estimate that presents the minimum mean
quadratic error wrt the true pdf in the space of the m basis
functions is searched for.
Therefore, the minimization of the following functional is
considered:
The functional to minimize is a quadratic form into the coefficients
c
1
, c
2
, , c
m
. By imposing a simple condition of stationarity (null
gradient) we obtain:

Minimization of the Quadratic Error
( )
= = =
= = = =
=
= = =
= + =
= +

2
2
1 1 1
1 1 1 1
2
2
1

( ) ,
, , , ,
2 ,
m m m
i i i i j j
i i j
m m m m
i j i j i i j j
i j i j
m
i i i
i
p p p c p c p c p
c c c p c p p p
c p c p
=
c
= = = =
c

1

0 , , 1, 2,..., ,
m
i i i i
i i
c p i m p p
c
35
Computation of the Optimal Coefficients
Estimation of the coefficients of the expansion
Taking into account that p(x) is a pdf defined over A, we can
estimate the scalar product (p,
i
) (i = 1, 2, ..., m) by using a set of
training samples :
Approximation error
Increasing the number of the basis-functions m, we can obtain
estimations with smaller and smaller approximation errors: for m
+ , we expect an infinitesimal error.
In fact, the existence of complete orthonormal bases can be
demonstrated, that is, sequences of orthonormal functions {
i
(): i =
1, 2, ...} such that any f function with finite energy can be
expanded as:
=
= = = =

}
1
1

, ( ) ( ) { ( )} ( )
N
i i i i i i k
k
A
c p p d E c
N
x x x x x

=
=

1
,
i i
i
f f
The series converges in
quadratic mean (with respect
to the introduced norm)
36
Choice of the Basis Functions (1)
In general, being the true pdf unknown , its not possible to a
priori identify the orthonormal basis that provides a given
approximation error with the minimum number of coefficients.
A choice on the basis of operational issues can be taken, such as the
implementation simplicity or the computation time.
Examples of complete orthonormal bases in the case n = 1
The goniometric functions form a complete orthonormal basis over
[0, 2t] (Fourier series expansion):
Complete orthonormal bases can be generated (over various
domains) by means of systems of orthogonal polynomials (Legendre,
Hermite, Lagrange, Tchebitshev polynomials).

t =

= t = =

t = + =

1/2
1/2
1/2
(2 ) for 1
( ) cos( ) for 2 ( 1, 2,...)
sin( ) for 2 1 ( 1, 2,...)
i
i
x rx i r r
rx i r r
37
Choice of the Basis Functions (2)
Legendre polynomials
They are a sequence of recursively defined polynomials:
they are orthogonal into [ 1, 1] and need to be normalized:
In the case n > 1, a complete orthonormal basis can be obtained
by multiplying one-dimensional basis functions:
given a one-dimensional basis {
i
}, a bi-dimensional basis can be
defined as follow:
+
+

=
=

+ +

= =

2
3 1
2
2 2 1 1
3
5 3
3
2 2
0 1
2 1
( ) ,
( ) ( ) ( )
1 1
( ) , ...
( ) 1, ( )
i i i
i i
P x x
P x xP x P x
i i
P x x x
P x P x x

= o = +
+
}
1
1
2 1
( ) ( ) ( ) ( )
2 1 2
i j ij i i
P x P x dx x i P x
i
1 1 2 1 1 1 2 2 1 2 2 1 1 2
3 1 2 1 1 2 2 4 1 2 2 1 2 2
5 1 2 3 1 1 2
( , ) ( ) ( ) ( , ) ( ) ( )
( , ) ( ) ( ) ( , ) ( ) ( )
( , ) ( ) ( ) ...
x x x x x x x x
x x x x x x x x
x x x x
= =

= =

38
Accuracy of the Functional Approximation
The quality of the approximation depends on different
elements:
orthogonality of the basis functions over the region of the feature
space in which the samples take values;
number m of the adopted basis functions .
Number of basis functions
The number m necessary to reach a certain approximation error
depends on the chosen type of basis functions (i.e.: a sinusoidal
p(x) will require, in general, less functions from a trigonometric
basis than from a polynomial one).
In the lack of a priori information on p(x), for a given basis,
typically m is derived inserting the estimated pdf into the
adopted classifier, evaluating the performances on the test set and
increasing m till reaching the desired accuracy.
39
Example (1)
ML classification with estimations based on Legendre polynomials.
Given two classes described by the following training samples in a
bi-dimensional feature space:
Adopt a Legendre polynomials basis of order 4 (m = 4). Note that the
features are normalized into [ 1, 1], corresponding to the orthogonality
interval of the Legendre polynomials (if its not so, its sufficient to
normalize them on such an interval). The basis-functions are:

= =

= =

= =

= =

1 1 1
1 1 2 0 1 0 2
2 2 2
3 3 1
2 1 2 1 1 0 2 1
2 2 2
3 3 1
3 1 2 0 1 1 2 2
2 2 2
3 3 3
4 1 2 1 1 1 2 1 2
2 2 2
( , ) ( ) ( )
( , ) ( ) ( )
( , ) ( ) ( )
( , ) ( ) ( )
x x P x P x
x x P x P x x
x x P x P x x
x x P x P x x x
x
1
x
2
O

1
1
1
1
e
e
1
2
: ( 3/ 5, 0), (0, 3/ 5), ( 3/ 5, 3/ 5) ,
: (1,1), (4/ 5, 2/ 5), (3/ 5, 4/ 5).
40
Example (2)
Computation of the coefficients:
For class e
1
:
For class e
2
:
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( ) ( )
(
= + + =

(
= + + = + = =

(
= + + = + + =

e = + e
3 3 3 3 1 1
1 1 1 1
3 5 5 5 5 2
3 3 3 3 3 3 3 3 1 1
2 2 2 2 3
3 5 5 5 5 3 2 5 5 5
3 3 3 3 3 9 9 1 1
4 4 4 4
3 5 5 5 5 3 2 25 50
3 3 1 27
1 1 2 1 2 1 2
4 10 10 100

, 0 0, ,

, 0 0, , 0

, 0 0, , 0 0

( | ) , , [ 1,1]
c
c c
c
p x x x x x x x
( ) ( )
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
(
= + + =

(
= + + = + + =

(
= + + = + + =

(
= + + = + + =

e =
3 1 4 2 4 1
1 2 2 2
3 5 5 5 5 2
3 2 3 3 3 1 4 2 4 1 4
2 2 2 2
3 5 5 5 5 3 2 5 5 5
3 11 3 3 1 4 2 4 1 2 4
3 3 3 3
3 5 5 5 5 3 2 5 5 30
3 3 8 9 1 4 2 4 1 12
4 4 4 4
3 5 5 5 5 3 2 25 25 10
1
2
4

(1,1) , ,

(1,1) , , 1

(1,1) , , 1

(1,1) , , 1

( | )
c
c
c
c
p x + + + e
3 11 27
1 2 1 2 1 2
5 20 20
, , [ 1,1] x x x x x x
41
Example (3)
Computation of the ML discriminant curve
The discriminant curve is an equilateral hyperbola (i.e., it has
orthogonal asymptotes):
e = e + =
= + + + + + =
+ + = =
+
3 3 1 27
1 2 1 2 1 2
4 10 10 100
3 9 1 11 27 17 27
1 2 1 2 1 2 1 2
4 5 20 20 10 20 25
1
1 2 1 2 2
1

: ( | ) ( | )
0
90
90 85 108 0
108 85
p p x x x x
x x x x x x x x
x
x x x x x
x
x x
If we used three basis
functions, we would
obtain a linear
discriminant function. x
1
x
2
O

1
1
1
1


42
Estimation of the Prior Probabilities
The application of the minimum risk theory requires, in addition
to the estimation of the class conditional pdfs, also the estimation
of the prior probabilities of the classes.
Given M classes e
1
, e
2
, ..., e
M
, let N
i
be the number of training
samples of the class e
i
(i = 1, 2, ..., M) and N = N
1
+ N
2
+ ... + N
M
the
total number of training samples.
A consistent estimation of the prior probabilities P
i
= P(e
i
) (i = 1, 2, ...,
M) is the relative frequency of the samples of the class e
i
in the
training set:
Such an estimation is based on the implicit hypothesis that the
training samples have been extracted from the various classes in
proportion to the class prior probabilities. Often, such hypothesis is
not realistic. More accurate estimators of the prior probabilities have
been developed, which not only use the training samples but also
unknown samples (e.g., the Expectation-Maximization technique
applied to the global pdf).

i
i
N
P
N
=
43
Bibliography
R. O. Duda, P. E. Hart, D. G. Stork,
Pattern Classification, 2nd Edition. New
York: Wiley, 2001.
K. Fukunaga, Introduction to statistical
pattern recognition, 2nd edition,
Academic Press, New York, 1990.
T. Tou, and R. C. Gonzales, Pattern
Recognition Principles, Addison-Wesley,
Publishing Co., Reading
Massachusetts, 1974.
H. L. Van Trees, Detection, estimation
and modulation theory, vol. I, John Wiley
& Sons, New York, 1968.
J. M. Mendel, Lessons in digital
estimation theory, Prentice-Hall,
Englewood Cliffs, 1987.
E. Parzen, On Estimation of a
Probability Density Function and
Mode, Annals of Mathematical Statistics
33 (1962).

D. Specht, Probabilistic Neural
Networks, Neural Networks 3 (1990).
T. Cacoullos, Estimation of a
Multivariate Density, Annals of the
Institute of Statistical Mathematics,
Tokyo, 18:2 (1966).
A.N. Mucciardi and E.L. Gose, An
Algorithm for Automatic Clustering in
N-dimensional Spaces Using
Hyperellipsoidal Cells., Proc. IEEE
Systems Science Cybernetics Conference,
Pittsburgh, Penn. Oct. (1970).
D.W. Scott and G.R. Terrell, "Biased
and unbiased cross-validation in
density estimation," Journal of the
American Statistical Association, vol.
82, pp. 1131-1146, 1987.
S.R. Sain, K.A. Baggerly, and D.W.
Scott, "Cross-validation of multivariate
densities," Journal of the American
Statistical Association}, vol. 89, no. 427,
pp. 807-817, 1994.

Das könnte Ihnen auch gefallen