Beruflich Dokumente
Kultur Dokumente
learning
INFO-F-422
Gianluca Bontempi
Dpartement dInformatique
Boulevard de Triomphe - CP 212
http://www.ulb.ac.be/di
Classical or frequentist:
Bayesian approach:
Parametric estimation
Consider a r.v. z. Suppose that
1. we do not know completely the distribution Fz (z) but that we can write it
in a parametric form
Fz (z) = Fz (z, )
where is a parameter,
2. we have access to a set DN of N measurements of z, called sample
data.
Goal of the estimation procedure: to find a value of the parameter so
closely matches the
that the parametrized distribution Fz (z, )
distribution of data.
We assume that the N observations are the observed values of N i.i.d.
random variables zi , each having a density identical to Fz (z, ).
I.I.D. samples
Consider a set of N i.i.d. random variables zi .
Point estimation
Consider a random variable z with a parametric distribution Fz (z, ),
.
The parameter can be written as a function(al) of F
= t(F )
This corresponds to the fact that is a characteristic of the population
described by Fz ().
Suppose we have a set of N observations DN = {z1 , z2 , . . . , zN }.
Maximum likelihood
Least squares
Minimum Chi-Squared
TP R: empirical distribution
Suppose that our dataset of observations of the age is made of the
following N = 14 samples
DN = {20, 21, 22, 20, 23, 25, 26, 25, 20, 23, 24, 25, 26, 29}
Here it is the empirical distribution function Fz (cumdis.R)
0.0
0.2
0.4
Fn(x)
0.6
0.8
1.0
20
22
24
26
28
30
N
X
1
zi
zdF (z) =
N i=1
Sample average
Consider a r.v. z Fz () such that
= E[z] =
zdF (z)
with unknown.
Suppose we have available the sample Fz DN , made of N
observations.
The plug-in point estimate of is given by the sample average
N
X
1
zi =
=
N i=1
Sample variance
Consider a r.v. z Fz () where the mean and the variance 2 are
unknown.
Suppose we have available the sample Fz DN .
1 X
2
(zi
)2
=
N 1 i=1
Note the presence of N 1 instead of N at the denominator (it will be
explained later).
Note that the following relation holds for all zi
1
N
N
X
i=1
(zi
) =
1
N
N
X
i=1
zi2
1
N
PN
i=1 (zi
)3
z = sup{z : F (z) 1 }
Sample correlation:
PN
x )(yi
y )
qP
(x, y) = qP
N
N
2
(x
)
y )2
i
x
i=1
i=1 (yi
i=1 (xi
Sampling distribution
Given a dataset DN , we have a point estimate
= g(DN )
which is a specific value.
However it is important to remark that DN is the outcome of the
sampling of a r.v. z. As a consequence DN can be considered as
realization of a random variable DN .
Applying the transformation g to the random variable DN we obtain
another random variable
= g(DN )
Sampling distribution
SAMPLING DISTRIBUTION
p( )
(1)
N
(1)
N
(2)
N
(2)
N
(3)
N
(3)
N
(...)
N
(...)
N
Bias[]
= EDN []
Some consideration
An unbiased estimator is an estimator that takes on average the right
value.
Many unbiased estimators may exist for a parameter .
be a BIASED
If is an unbiased estimator of , it may happen that f ()
estimator of f ().
A biased estimator with a known bias (not depending on ) is equivalent
to an unbiased estimator since we can easily compensate for the bias.
Given a r.v. z and the set DN , it can be shown that the sample average
and the sample variance
2 are unbiased estimators of the mean E[z]
Bias and variance of
Consider a random variable z Fz ().
"
1
N
N
X
i=1
zi =
PN
E[zi ]
N
=
=
N
N
i=1
This means that the sample average estimator is not biased ! This holds
for whatever distribution Fz ().
Since Cov[zi , zj ] = 0, for i 6= j, the variance of the sample average
estimator is
#
"
"N #
N
X
X
1
1
1
2
2
= Var
zi = 2 Var
zi = 2 N =
Var []
N i=1
N
N
N
i=1
See R script sam_dis.R.
Machine learning p. 20/51
Bias of
Given an i.i.d. DN z:
2 ] = EDN
EDN [
"
1
N 1
N
X
i=1
"
N
X
N
1
2 =
(zi )
EDN
N 1
N i=1
"
!
#
N
1 X 2
N
2
zi
EDN
=
N 1
N i=1
2 =
(zi )
"
N
1 X 2
zi
N i=1
!#
1
N (2 + 2 ) = 2 + 2
N
Since E[
P
N
i=1
zi
2
Bias of
N
X
1
2 ] = 2 EDN
zi
EDN [
N
i=1
!2
= 1 (N 2 2 + N 2 ) = 2 + 2 /N
N2
It follows that
N
2
2
2
2
]=
EDN [
( + ) ( + /N ) =
N 1
2
N
N 1
N 1 2
= 2
Important relations
E[(z )2 ] = 2 = E[z2 2z + 2 ] = E[z2 ] 2E[z] + 2 =
= E[z2 ] 2 + 2 = E[z2 ] 2
For N = 2
E[(z1 + z2 )2 ] = E[z21 ] + E[z22 ] + 2E[z1 z2 ] =
= 2E[z2 ] + 22 = 42 + 2 2
For N = 3
E[(z1 + z2 + z3 )2 ] = E[z21 ] + E[z22 ] + E[z23 ] + 2E[z1 z2 ] + 2E[z1 z3 ] + 2E[z2 z3 ] =
= 3E[z2 ] + 62 = 92 + 3 2
In general for N i.i.d. zi , E[(z1 + z2 + + zN )2 ] = N 2 2 + N 2 .
Machine learning p. 23/51
Considerations
The results so far are independent of the form F () of the distribution.
The variance of
is 1/N times the variance of z. This is a reason for
collecting several samples: the larger N , the smaller is Var [
], so bigger
N means a better estimate of .
According to the central limit theorem, under quite general conditions on
will be approximately normal as
the distribution Fz , the distribution of
N gets large, which we can write as
N (, 2 /N )
for N
p
The standard error Var []
is a common way of indicating statistical
to be less than one standard
accuracy. Roughly speaking we expect
error away from about 68% of the time, and less than two standard
errors away from about 95% of the time .
h i h
i2
h i
)2 + Var
= Bias[]
+ Var
MSE = (EDN []
i.e., the mean-square error is equal to the sum of the variance and the
squared bias. This decomposition is typically called the bias-variance
decomposition.
See R script mse_bv.R.
+ 2( EDN [])(E
DN [] EDN []) =
h i
) + Var
=
= (EDN []
2
h
i2
h i
+ Var
= Bias[]
TP: example
Suppose z1 , . . . , zN is a random sample of observations from a
distribution with mean and variance 2 .
Study the unbiasedness of the three estimators of the mean :
1 =
=
PN
i=1 zi
N 1
2 =
N +1
3 = z1
Efficiency
Suppose we have two unbiased and consistent estimators. How to choose
between them?
1 and
2 . If
Definition 3 (Relative efficiency). . Let us consider two unbiased estimators
Var
1 < Var
2
If the estimators are biased, typically the comparison is done on the basis of
the mean square error.
N i=1
N 1
i=1
It can be shown that the following relations hold
N (, 2 /N ) and N (
)2 2 21 .
1.
P
2
2 2
2. zi N (0, 2 ), so N
i=1 (zi ) N
PN
2
c
3.
)2
i=1 (zi ) = SS + N (
2
(N 1)
2
c 2 2
4. SS
or
equivalently
2
N 1 . See R script
N 1
sam_dis2.R.
)/
TN 1
5. N (
2
N 3 4
= N1 4 N
.
6. if E[|z |4 ] = 4 then Var
1
Likelihood
Let us consider
1. a density distribution pz (z, ) which depends on a parameter
2. a sample data DN = {z1 , z2 , . . . , zN } drawn independently from this
distribution.
The joint probability density of the sample data is
pDN (DN , ) =
N
Y
pz (zi , ) = LN ()
i=1
Maximum likelihood
The principle of maximum likelihood was first used by Lambert around
1760 and by D. Bernoulli about 13 years later. It was detailed by Fisher
in 1920.
Idea: given an unknown parameter and a sample data DN , the
maximum likelihood estimate is the value for which the likelihood
LN () has a maximum
ml = arg max LN ()
N
Y
i=1
e
2
(z )2
i22
0.0e+00
5.0e-08
1.0e-07
1.5e-07
-2
-1
mu
Some considerations
The likelihood measures the relative abilities of the various parameter
values to explain the observed data.
The principle of m.l. is that the value of the parameter under which the
obtained data would have had highest probability of arising must be
intuitively our best estimator of .
M.l. can be considered a measure of how plausible the parameter
values are in light of the data.
The likelihood function is a function of the parameter .
-30
-35
-40
log(L)
-25
-20
-15
-2
-1
mu
M.l. estimation
If we the take a parametric approach, the analytical form of the log-likelihood
lN () is known. In many cases the function lN () is well behaved in being
continuous with a single maximum away from the extremes of the range of
variation of .
subject to
2 lN ()
< 0
2
ml
N
Y
pz (zi , , 2 ) =
i=1
N
Y
i=1
1
2
exp
(zi )
2 2
The log-likelihood is
lN (, 2 ) = log LN (, 2 ) = log
"N
Y
pz (zi , , 2 ) =
i=1
N
X
i=1
log pz (zi , , 2 ) =
PN
i=1 (zi
2 2
+ N log
ml =
2
ml
=
PN
i=1 zi
PN
2
(z
)
i
ml
i=1
6=
2
N
Note that the m.l. estimator of the mean coincides with the sample
average but that the m.l. estimator of the variance differs from the
sample variance for the different denominator.
TP
1.
2.
e z
pz (z, ) =
z!
If Fz (z, ) DN = {z1 , . . . , zN }, find the m.l.e. of
Script emp_ml.R.
the
ml
).
the Cramer-Rao theorem establishes a lower bound to the variance of
an estimator
ml is consistent.
ml is the m.l.e. of and () is a monotone function then (
ml ) is the
If
m.l.e. of ().
ml is an unbiased estimator of
If () is a nonlinear function, then even if
ml ) of () is usually biased.
, the m.l.e. (
ml is asymptotically normally distributed around .
Interval estimation
Unlike point estimation which is based on a one-to-one mapping from
the space of data to the space of parameters, interval estimation maps
DN to an interval of .
An estimator is a function which, given a dataset DN generated from
Fz (z, ), returns an estimate of .
An interval estimator is a trasformation which, given a dataset DN , returns
an interval estimate of .
While an estimator is a random variable, an interval estimator is a
random interval.
be the lower and the upper bound respectively.
Let and
Prob = 1
[0, 1]
that
1 = Prob {x x } = F (x ) x = F 1 (1 )
We will denote with z the upper critical points of the standard normal
density.
1 = Prob {z z } = F (z ) = (z )
z =
(1 ),
z1 = z ,
1
1=
2
z 2 /2
dz
0.1
0.075
0.05
0.025
0.01
0.005
0.001
0.0005
1.282
1.440
1.645
1.967
2.326
2.576
3.090
3.291
Distribution
R function
Normal standard
qnorm(,lower.tail=F)
qt(,df=N,lower.tail=F)
Confidence interval of
Consider a random sample DN of a normal r.v. N (, 2 ) and the
N (, 2 /N ).
estimator
It follows that
N (0, 1),
/ N
TN 1
N
/
and consequently
+ z/2
z/2
=1
Prob
N
N
=
z/2 / N is a lower 1 confidence bound for .
=
+ z/2 / N is an upper 1 confidence bound for .
By varying we vary width and confidence of the interval.
TP R script confidence.R
The user sets , , N , and a number of iterations Niter .
z/2
z/2
<<
+
N
N
We can easily check that this percentage converges to (1 )% for
Niter .
1.02
0.98
0.96
0.94
0.92
0.9
0.88
0.86
Niter
q
TN 1
2
t = t{/2,N 1}
/ N = 2.015 1.87/ 6 = 1.53
The (1 ) confidence interval of is
t < <
+ t