Sie sind auf Seite 1von 51

Statistical foundations of machine

learning
INFO-F-422
Gianluca Bontempi
Dpartement dInformatique
Boulevard de Triomphe - CP 212
http://www.ulb.ac.be/di

Machine learning p. 1/51

Approaches to parametric estimation


There are two main approaches to parametric estimation
it is based on the idea that sample data are the sole
quantifiable form of relevant information and that the parameters are
fixed but unknown. It is related to the frequency view of probability.

Classical or frequentist:

the parameters are supposed to be random variables,


having a distribution prior to data observation and a distribution
posterior to data observation. This approach assumes that there exists
something beyond data, (i.e. a subjective degree of belief), and that this
belief can be described in probabilistic form.

Bayesian approach:

Machine learning p. 2/51

Classical approach: some history


It dates back to the period 1920-35
J. Neyman and E.S. Pearson, stimulated by problems in biology and
industry, concentrated on the principles for testing hypothesis
R.A. Fisher who was interested in agricultural issues gave attention to
the estimation problem

Machine learning p. 3/51

Parametric estimation
Consider a r.v. z. Suppose that
1. we do not know completely the distribution Fz (z) but that we can write it
in a parametric form
Fz (z) = Fz (z, )
where is a parameter,
2. we have access to a set DN of N measurements of z, called sample
data.
Goal of the estimation procedure: to find a value of the parameter so
closely matches the
that the parametrized distribution Fz (z, )
distribution of data.
We assume that the N observations are the observed values of N i.i.d.
random variables zi , each having a density identical to Fz (z, ).

Machine learning p. 4/51

I.I.D. samples
Consider a set of N i.i.d. random variables zi .

I.I.D. means Identically and Independently Distributed.

Identically distributed means that all the observations have been


sampled from the same distribution, that is
Prob {zi = z} = Prob {zj = z}

for all i, j = 1, . . . , N and z Z

Independently distributed means that the fact that we have observed a


certain value zi does not influence the probability of observing the value
zj , that is
Prob {zj = z|zi = zi } = Prob {zj = z}

Machine learning p. 5/51

Some estimation problems


1. Let DN = {20, 31, 14, 11, 19, . . . } be the times in minutes spent the last 2
weeks to go home. How much does it take in average to reach my house
from ULB?
2. Consider the model of the traffic in the boulevard.
Suppose that the measures of the inter-arrival times are
DN = {10, 11, 1, 21, 2, . . . } seconds.
What does this imply about the average inter-arrival time?
3. Consider the students of the last year of Computer Science. What is the
variance of their grades?

Machine learning p. 6/51

Parametric estimation (II)


Parametric estimation is a mapping from the space of the sample data to the
space of parameters . Two are the possible outcomes
1. some specific value of . In this case we have the so-called point
estimation.
2. some particular region of . In this case we obtain the interval of
confidence.

Machine learning p. 7/51

Point estimation
Consider a random variable z with a parametric distribution Fz (z, ),
.
The parameter can be written as a function(al) of F
= t(F )
This corresponds to the fact that is a characteristic of the population
described by Fz ().
Suppose we have a set of N observations DN = {z1 , z2 , . . . , zN }.

Any function of the sample data DN is called a statistic. A point estimate


is an example of statistic.
A point estimate is a function
= g(DN )
of the sample dataset DN .

Machine learning p. 8/51

Methods of constructing estimators


Examples are:
Plug-in principle

Maximum likelihood
Least squares

Minimum Chi-Squared

Machine learning p. 9/51

Empirical distribution function


Suppose we have observed a i.i.d. random sample of size N from a
distribution function Fz (z) of a continuous rv z
Fz {z1 , z2 , . . . , zN }
where
Fz (z) = Prob {z z}
Let N (z) be the number of samples in DN that do not exceed z.
The empirical distribution function is
#zi z
N (z)
=
Fz (z) =
N
N
This function is a staircase function with discontinuities at the points zi .

Machine learning p. 10/51

TP R: empirical distribution
Suppose that our dataset of observations of the age is made of the
following N = 14 samples
DN = {20, 21, 22, 20, 23, 25, 26, 25, 20, 23, 24, 25, 26, 29}
Here it is the empirical distribution function Fz (cumdis.R)

0.0

0.2

0.4

Fn(x)

0.6

0.8

1.0

Empirical Distribution function

20

22

24

26

28

30

Machine learning p. 11/51

Plug-in principle to define an estimator


Consider a r.v. z and sample dataset DN drawn from the parametric
distribution Fz (z, ).
How to define an estimate of ?

A possible solution is given by the plug-in principle, that is a simple


method of estimating parameters from samples.
The plug-in estimate of a parameter is defined to be:
= t(F (z))
obtained by replacing the distribution function with the empirical
distribution in the analytical expression of the parameter
The sample average is an example of plug-in estimate

N
X
1
zi
zdF (z) =
N i=1

Machine learning p. 12/51

Sample average
Consider a r.v. z Fz () such that
= E[z] =

zdF (z)

with unknown.
Suppose we have available the sample Fz DN , made of N
observations.
The plug-in point estimate of is given by the sample average
N
X
1
zi =

=
N i=1

which is indeed a statistic, i.e. a function of the data set.

Machine learning p. 13/51

Sample variance
Consider a r.v. z Fz () where the mean and the variance 2 are
unknown.
Suppose we have available the sample Fz DN .

Once we have the sample average


, the plug-in estimate of 2 is given
by the sample variance
N

1 X
2
(zi
)2

=
N 1 i=1
Note the presence of N 1 instead of N at the denominator (it will be
explained later).
Note that the following relation holds for all zi
1
N

N
X
i=1

(zi
) =

1
N

N
X
i=1

zi2

Machine learning p. 14/51

Other plug-in estimators


Skewness estimator:

1
N

PN

i=1 (zi

)3

Upper critical point estimator:

z = sup{z : F (z) 1 }
Sample correlation:

PN

x )(yi
y )
qP
(x, y) = qP
N
N
2
(x

)
y )2
i
x
i=1
i=1 (yi
i=1 (xi

Machine learning p. 15/51

Sampling distribution
Given a dataset DN , we have a point estimate
= g(DN )
which is a specific value.
However it is important to remark that DN is the outcome of the
sampling of a r.v. z. As a consequence DN can be considered as
realization of a random variable DN .
Applying the transformation g to the random variable DN we obtain
another random variable
= g(DN )

which is called the point estimator of .


is called the sampling distribution.
The probability distribution of the r.v.

Machine learning p. 16/51

Sampling distribution
SAMPLING DISTRIBUTION

p( )

(1)
N

(1)
N

(2)
N

(2)
N

(3)
N

(3)
N

(...)
N

(...)
N

UNKNOWN R.V. DISTRIBUTION

See the R file sam_dis.R.


Machine learning p. 17/51

Bias and variance


Estimations are the first outputs of a data analysis. The next thing we want
This lead us to the definition of bias, variance
to know is the accuracy of .
and standard error of an estimator.
of is said to be unbiased if and only if
Definition 1. An estimator
=
EDN []
Otherwise, it is called biased with bias

Bias[]


= EDN []

of is the variance of its sampling distribution


Definition 2. The variance of an estimator
h i
E[])
2]
= ED [(
Var
N

Machine learning p. 18/51

Some consideration
An unbiased estimator is an estimator that takes on average the right
value.
Many unbiased estimators may exist for a parameter .
be a BIASED
If is an unbiased estimator of , it may happen that f ()
estimator of f ().
A biased estimator with a known bias (not depending on ) is equivalent
to an unbiased estimator since we can easily compensate for the bias.
Given a r.v. z and the set DN , it can be shown that the sample average
and the sample variance
2 are unbiased estimators of the mean E[z]

and the variance Var [z], respectively.


In general
is not an unbiased estimator of even if
2 is an unbiased
estimator of 2 .

Machine learning p. 19/51


Bias and variance of
Consider a random variable z Fz ().

Let and 2 the mean and the variance of Fz (), repsectively.


Suppose we have observed the i.i.d. sample DN Fz .
The following relation holds
= EDN
EDN []

"

1
N

N
X
i=1

zi =

PN

E[zi ]
N
=
=
N
N

i=1

This means that the sample average estimator is not biased ! This holds
for whatever distribution Fz ().
Since Cov[zi , zj ] = 0, for i 6= j, the variance of the sample average
estimator is
#
"
"N #
N
X
X
1
1
1
2
2
= Var
zi = 2 Var
zi = 2 N =
Var []
N i=1
N
N
N
i=1
See R script sam_dis.R.
Machine learning p. 20/51


Bias of

Given an i.i.d. DN z:
2 ] = EDN
EDN [

"

1
N 1

N
X
i=1

"

N
X

N
1
2 =
(zi )
EDN
N 1
N i=1
"
!
#
N
1 X 2
N
2
zi
EDN
=
N 1
N i=1

2 =
(zi )

Since E[z2 ] = 2 + 2 , Cov[zi , zj ] = 0, the 1st term inside the E[] is


EDN

"

N
1 X 2
zi
N i=1

!#

1
N (2 + 2 ) = 2 + 2
N

Machine learning p. 21/51

Since E[

P

N
i=1

zi

2

Bias of

] = N 2 2 + N 2 the 2nd term is

N
X
1
2 ] = 2 EDN
zi
EDN [
N
i=1

!2

= 1 (N 2 2 + N 2 ) = 2 + 2 /N
N2

It follows that

N
2
2
2
2
]=
EDN [
( + ) ( + /N ) =
N 1
2

N
N 1

N 1 2

= 2

Sample variance (with N 1 at denominator) is not biased !


Question? Is
2 an unbiased estimator of 2 ?

Machine learning p. 22/51

Important relations
E[(z )2 ] = 2 = E[z2 2z + 2 ] = E[z2 ] 2E[z] + 2 =
= E[z2 ] 2 + 2 = E[z2 ] 2
For N = 2
E[(z1 + z2 )2 ] = E[z21 ] + E[z22 ] + 2E[z1 z2 ] =
= 2E[z2 ] + 22 = 42 + 2 2
For N = 3
E[(z1 + z2 + z3 )2 ] = E[z21 ] + E[z22 ] + E[z23 ] + 2E[z1 z2 ] + 2E[z1 z3 ] + 2E[z2 z3 ] =
= 3E[z2 ] + 62 = 92 + 3 2
In general for N i.i.d. zi , E[(z1 + z2 + + zN )2 ] = N 2 2 + N 2 .
Machine learning p. 23/51

Considerations
The results so far are independent of the form F () of the distribution.

The variance of
is 1/N times the variance of z. This is a reason for
collecting several samples: the larger N , the smaller is Var [
], so bigger
N means a better estimate of .
According to the central limit theorem, under quite general conditions on
will be approximately normal as
the distribution Fz , the distribution of
N gets large, which we can write as
N (, 2 /N )

for N

p
The standard error Var []
is a common way of indicating statistical
to be less than one standard
accuracy. Roughly speaking we expect
error away from about 68% of the time, and less than two standard
errors away from about 95% of the time .

Machine learning p. 24/51

Bias/variance decomposition of MSE


is a biased estimator of , its accuracy is usually assessed by its
When
mean-square error (MSE) rather than by its variance.
The MSE is defined by
2]
MSE = ED [( )
N

The MSE of an unbiased estimator is its variance.


For a generic estimator it can be shown that

h i h
i2
h i
)2 + Var
= Bias[]
+ Var

MSE = (EDN []
i.e., the mean-square error is equal to the sum of the variance and the
squared bias. This decomposition is typically called the bias-variance
decomposition.
See R script mse_bv.R.

Machine learning p. 25/51

Bias/variance decomposition of MSE (II)


2] =
MSE = EDN [( )
)
2] =
+ ED []
= EDN [( EDN []
N
)
2 ]+
2 ] + ED [(ED []
= EDN [( EDN [])
N
N

+ EDN [2( EDN [])(E


DN [] )] =
2 ] + ED [(ED []
)
2 ]+
= EDN [( EDN [])
N
N

+ 2( EDN [])(E
DN [] EDN []) =

h i
) + Var
=
= (EDN []
2

h
i2
h i
+ Var

= Bias[]

Machine learning p. 26/51

TP: example
Suppose z1 , . . . , zN is a random sample of observations from a
distribution with mean and variance 2 .
Study the unbiasedness of the three estimators of the mean :
1 =
=

PN

i=1 zi

N 1

2 =
N +1
3 = z1

Machine learning p. 27/51

Efficiency
Suppose we have two unbiased and consistent estimators. How to choose
between them?
1 and
2 . If
Definition 3 (Relative efficiency). . Let us consider two unbiased estimators
Var

1 < Var
2

1 is more efficient than


2.
we say that

If the estimators are biased, typically the comparison is done on the basis of
the mean square error.

Machine learning p. 28/51

Sampling distributions for Gaussian r.v


Let z1 , . . . , zN be i.i.d. N (, 2 ) and let us consider the following sample
statistics
N
N
X
c
SS
1 X
2
2
c =
,
=
=
(zi )
zi , SS

N i=1
N 1
i=1
It can be shown that the following relations hold

N (, 2 /N ) and N (
)2 2 21 .
1.
P
2
2 2
2. zi N (0, 2 ), so N
i=1 (zi ) N
PN
2
c
3.
)2
i=1 (zi ) = SS + N (
2

(N 1)
2
c 2 2

4. SS
or
equivalently
2
N 1 . See R script
N 1

sam_dis2.R.

)/
TN 1
5. N (




2
N 3 4
= N1 4 N
.
6. if E[|z |4 ] = 4 then Var
1

Machine learning p. 29/51

Likelihood
Let us consider
1. a density distribution pz (z, ) which depends on a parameter
2. a sample data DN = {z1 , z2 , . . . , zN } drawn independently from this
distribution.
The joint probability density of the sample data is
pDN (DN , ) =

N
Y

pz (zi , ) = LN ()

i=1

where for a fixed DN , LN () is a function of and is called the empirical


likelihood of given DN .

Machine learning p. 30/51

Maximum likelihood
The principle of maximum likelihood was first used by Lambert around
1760 and by D. Bernoulli about 13 years later. It was detailed by Fisher
in 1920.
Idea: given an unknown parameter and a sample data DN , the
maximum likelihood estimate is the value for which the likelihood
LN () has a maximum
ml = arg max LN ()

is called the maximum likelihood estimator (m.l.e.).


The estimator
It is usual to consider the log-likelihood lN () since being log() a
monotone function, we have

ml = arg max LN () = arg max log(LN ()) = arg max lN ()

Machine learning p. 31/51

Example: maximum likelihood


Let us observe N = 10 realizations of a continuous variable z:
DN = {z1 , . . . , z10 } =
= {1.263, 0.326, 1.330, 1.272, 0.415, 1.540, 0.929, 0.295, 0.006, 2.405}
Suppose that the probabilistic model underlying the data is Gaussian
with an unknown mean and a known variance 2 = 1.
The likelihood LN () is a function of (only) the unknown parameter .
By applying the maximum likelihood technique we have

= arg max L() = arg max

N 
Y

i=1

e
2

(z )2
i22

Machine learning p. 32/51

Example: maximum likelihood (II)

0.0e+00

5.0e-08

1.0e-07

1.5e-07

By plotting LN (), [2, 2] we have

-2

-1

mu

Then the most likely


value of according to the data is
0.358. Note that
P
in this case
= Nzi .
R script ml_norm.R
Machine learning p. 33/51

Some considerations
The likelihood measures the relative abilities of the various parameter
values to explain the observed data.
The principle of m.l. is that the value of the parameter under which the
obtained data would have had highest probability of arising must be
intuitively our best estimator of .
M.l. can be considered a measure of how plausible the parameter
values are in light of the data.
The likelihood function is a function of the parameter .

According to the classical approach to estimation, since is not a r.v.,


the likelihood function is NOT the probability function of .
LN () is rather the probability of observing the dataset DN for a given .
In other terms the likelihood is the probability of the data given the
parameter and not the probability of the parameter given the data.

Machine learning p. 34/51

Example: log likelihood

-30
-35
-40

log(L)

-25

-20

-15

Consider the previous example.


The behaviour of the log-likelihood for this model is

-2

-1

mu

Machine learning p. 35/51

M.l. estimation
If we the take a parametric approach, the analytical form of the log-likelihood
lN () is known. In many cases the function lN () is well behaved in being
continuous with a single maximum away from the extremes of the range of
variation of .

Then is obtained simply as the solution of


lN ()
=0

subject to

2 lN ()
< 0
2

ml

to ensure that the identified stationary point is a maximum.

Machine learning p. 36/51

Gaussian case: ML estimators


Let DN be a random sample from the r.v. z N (, 2 ).
The likelihood of the N samples is given by
LN (, 2 ) =

N
Y

pz (zi , , 2 ) =

i=1

N 
Y

i=1

1
2

exp

(zi )
2 2

The log-likelihood is
lN (, 2 ) = log LN (, 2 ) = log

"N
Y

pz (zi , , 2 ) =

i=1

N
X
i=1

log pz (zi , , 2 ) =

PN

i=1 (zi
2 2

+ N log

Note that, for a given , maximizing the log-likelihood is equivalent to


minimize the sum of squares of the difference between zi and the mean.

Machine learning p. 37/51

Gaussian case: ML estimators (II)


Taking the derivatives with respect to and 2 and setting them equal to
zero, we obtain

ml =
2

ml
=

PN

i=1 zi

PN

2
(z

)
i
ml
i=1
6=
2
N

Note that the m.l. estimator of the mean coincides with the sample
average but that the m.l. estimator of the variance differs from the
sample variance for the different denominator.

Machine learning p. 38/51

TP
1.
2.

Let z U(0, M ) and Fz DN = {z1 , . . . , zN }.


Find the maximum likelihood estimator of M .
Let z have a Possion distribution, i.e.

e z
pz (z, ) =
z!
If Fz (z, ) DN = {z1 , . . . , zN }, find the m.l.e. of

Machine learning p. 39/51

M.l. estimation with numerical methods


Computational difficulties may arise if
1. No explicit solution exists for lN ()/ = 0. Iterative numerical methods
must be used (see Calcul numrique). This is particularly serious for a
vector of parameters or when there are several relative maxima of lN .
2. lN () may be discontinuous, or have a discontinuous first derivative, or a
maximum at an extremal point.

Machine learning p. 40/51

TP: Numerical optimization in R


Suppose we known the analytical form of a one dimensional function
f (x) : I R.
We want to find the value of x I that minimizes the function.

If no analytical solution is available, numerical optimization methods can


be applied (see course Calcul numrique).
In the R language these methods are already implemented
Let f (x) = (x 1/3)2 and I = [0, 1]. The minimum is given by
f <- function (x,a) (x-a)^2
xmin <- optimize(f, c(0, 1), tol = 0.0001, a = 1/3)
xmin

Machine learning p. 41/51

TP: Numerical max. of likelihood in R


Let DN be a random sample from the r.v. z N (, 2 ).
The minus log-likelihood function of the N samples can be written in R
by
eml <- function(m,D,var) {
N<- length(D)
Lik<-1
for (i in 1:N)
Lik<-Lik*dnorm(D[i],m,sqrt(var))
-log(Lik)
}

The numerical minimization of lN (, s2 ) for a given = s in the interval


I = [10, 10] can be written in R in this form
xmin<-optimize( eml,c(-10,10),D=DN,var=s)

Script emp_ml.R.

Machine learning p. 42/51

Properties of m.l. estimators


Under the (strong) assumption that the probabilistic model structure is known,

the

maximum likelihood technique features the following propoerties:


ml is asymptotically unbiased but usually biased in small samples (e.g.

2

ml
).
the Cramer-Rao theorem establishes a lower bound to the variance of
an estimator
ml is consistent.

ml is the m.l.e. of and () is a monotone function then (
ml ) is the
If
m.l.e. of ().
ml is an unbiased estimator of
If () is a nonlinear function, then even if
ml ) of () is usually biased.
, the m.l.e. (
ml is asymptotically normally distributed around .

Machine learning p. 43/51

Interval estimation
Unlike point estimation which is based on a one-to-one mapping from
the space of data to the space of parameters, interval estimation maps
DN to an interval of .
An estimator is a function which, given a dataset DN generated from
Fz (z, ), returns an estimate of .
An interval estimator is a trasformation which, given a dataset DN , returns
an interval estimate of .
While an estimator is a random variable, an interval estimator is a
random interval.
be the lower and the upper bound respectively.
Let and

While an interval either contains or not a certain value, a random interval


has a certain probability of containing a value.

Machine learning p. 44/51

Interval estimation (II)


Suppose that our interval estimator satisfies


Prob = 1

[0, 1]

is called a 100(1 )% confidence


then the random interval [, ]
interval of .
Notice that is a fixed unknown value and that at each realization DN
the interval either does or does not contain the true .
If we repeat the procedure of sampling DN and constructing the
confidence interval many times, then our confidence interval will contain
the true at least 100(1 )% of the time (i.e. 95% of the time if
= 0.05).
While an estimator is characterized by bias and variance, an interval
estimator is characterized by its endpoints and confidence.

Machine learning p. 45/51

Upper critical points


Definition 4. The upper critical point of a continuous r.v.

x is the smallest number x such

that

1 = Prob {x x } = F (x ) x = F 1 (1 )
We will denote with z the upper critical points of the standard normal
density.
1 = Prob {z z } = F (z ) = (z )

Note that the following relations hold


1

z =

(1 ),

z1 = z ,

1
1=
2

z 2 /2

dz

Here we list the most commonly used values of z .

0.1

0.075

0.05

0.025

0.01

0.005

0.001

0.0005

1.282

1.440

1.645

1.967

2.326

2.576

3.090

3.291

Machine learning p. 46/51

TP: Upper critical points in R


File stu_gau.R
> plot(dnorm(x),type="l")
> lines(dt(x,df=10),type="l",col="red")
> lines(dt(x,df=3),type="l",col="green")
> qnorm(0.05,lower.tail=F)
[1] 1.644854
> qt(0.05,lower.tail=F,df=10)
[1] 1.812461

Distribution

R function

Normal standard

qnorm(,lower.tail=F)

Student with N d.o.f

qt(,df=N,lower.tail=F)

Machine learning p. 47/51

Confidence interval of
Consider a random sample DN of a normal r.v. N (, 2 ) and the
N (, 2 /N ).
estimator
It follows that

N (0, 1),
/ N

TN 1
N
/

and consequently



Prob z/2 z/2 = 1


/ N



+ z/2
z/2
=1
Prob
N
N

=
z/2 / N is a lower 1 confidence bound for .

=

+ z/2 / N is an upper 1 confidence bound for .
By varying we vary width and confidence of the interval.

Machine learning p. 48/51

Example: Confidence interval


Let z N (, 0.0001) and DN = {10, 11, 12, 13, 14, 15}.

We want to estimate the confidence interval of with level = 0.1.


We have
= 12.5, and

= z/2 / N = 1.645 0.01/ 6 = 0.0672


The 90% confidence interval for the given DN is
| } = {12.5 0.0672 12.5 + 0.0672}
{ : |

Machine learning p. 49/51

TP R script confidence.R
The user sets , , N , and a number of iterations Niter .

The script generates Niter times DN N (, 2 ) and computes


.
The script returns the percentage of times that

z/2
z/2
<<
+


N
N
We can easily check that this percentage converges to (1 )% for
Niter .
1.02

0.98

0.96

0.94

0.92

0.9

0.88

0.86
Niter

Machine learning p. 50/51

Example: Confidence region (II)


Let z N (, 2 ), with 2 unknown and DN = {10, 11, 12, 13, 14, 15}.
We want to estimate the confidence region of with level = 0.1.
We have
= 12.5,
2 = 3.5. Since

q
TN 1
2

where TN 1 is a Student distribution with N 1 degrees of freedom, we


have

t = t{/2,N 1}
/ N = 2.015 1.87/ 6 = 1.53
The (1 ) confidence interval of is

t < <
+ t

Machine learning p. 51/51

Das könnte Ihnen auch gefallen