Sie sind auf Seite 1von 45

Statistical foundations of machine

learning
INFO-F-422
Gianluca Bontempi
Dpartement dInformatique
Boulevard de Triomphe - CP 212
http://www.ulb.ac.be/di

Machine learning p. 1/45

Testing hypothesis
Hypothesis testing is the second major area of statistical inference.

A statistical hypothesis is an assertion or conjecture about the distribution


of one or more random variables.
A test of a statistical hypothesis is a rule or procedure for deciding
whether to reject the assertion on the basis of the observed data.
The basic idea is formulate some statistical hypothesis and look to see if
the data provides any evidence to reject the hypothesis.

Machine learning p. 2/45

An hypothesis testing problem


Consider the model of the traffic in the boulevard.
Suppose that the measures of the inter-arrival times are
DN = {10, 11, 1, 21, 2, . . . } seconds.
Can we say that the mean inter-arrival time is different from 10?
Consider the grades of two different school sections.
Section A had {15, 10, 12, 19, 5, 7}.

Section B had {14, 11, 11, 12, 6, 7}.


Can we say that Section A had better grades than Section B?

Consider two protein coding genes and their expression levels in a cell.
Are the two genes differentially expressed ?
A statistical test is a procedure that aims to answer such questions.

Machine learning p. 3/45

Types of hypothesis
We start by declaring the working (basic, null) hypothesis H to be tested, in the
form = 0 or , where 0 or are given.
The hypothesis can be
Simple.

It fully specifies the distribution of z.

Composite.

It partially specifies the distribution of z.

if DN constitutes a random sample of size N from N (, 2 ) the


hypothesis H : = 0 , = 0 , (with 0 and 0 known values) is simple while
the hypothesis H : = 0 is composite since it leaves open the value of in
(0, ).

Example:

Machine learning p. 4/45

Types of statistical test


Suppose we have collected N samples DN = {z1 , . . . , zN } from a distribution
Fz and we have declared a null hypothesis H about F .
Three are the most common types of statistical test:
Pure significance test:

data DN are used to assess the inferential evidence

against H.
the inferential evidence against H is used to judge whether
H is inappropriate. In other words it is a rule for rejecting H.

Significance test:

data DN are used to assess the hypothesis H against a


In other words this is a rule for
specific alternative hypothesis H.

rejecting H in favour of H.

Hypothesis test:

Machine learning p. 5/45

Pure significance test


Suppose that the null hypothesis H is simple.

Let t(DN ) be a statistic such that the larger its value the more it casts
doubt on H.
The quantity t(DN ) is called test statistic or discrepancy measure.

Let tN = t(DN ) the value of t calculated on the basis of the sample data
DN .
Let us consider the p-value quantity
p = Prob {t(DN ) > tN |H}
If p is small the sample data DN are highly inconsistent with H and p
(significance probability or significance level ) is the measure of such
inconsistency.

Machine learning p. 6/45

Some considerations
p is the proportion of situations under the hypothesis H where we would
observe a degree of inconsistency at least to the extent represented by
tN .
tN is the observed value of the statistic for a given DN . Different DN
yield different values of p (0, 1).
it is essential that the distribution of t(DN ) under H is known.

We cannot say that p is the probability that H is true but better that p is the
probability that the dataset DN is observed given that H is true
Open issues

1. What if H is composite?
2. how to choose t(DN ).

Machine learning p. 7/45

Tests of significance
Suppose that the value p is known. If p is small either a rare event has
occured or perhaps H is not true.
Idea: if p is less than some stated value , we reject H.

We choose a critical level , we observe DN and we reject H at level if


P {t(DN ) > tN |H)
This is equivalent to choose some critical value t and we reject H if
tN > t .
We obtain two regions in the space of sample data:
critical region

S0 where if DN S0 we reject H.

S1 where the sample data DN gives us no-reason to


reject H on the basis of the level- test.

non-critical region

Machine learning p. 8/45

Some considerations
The principle is that we will accept H unless we witness some event that
has sufficiently small probability of arising when H is true.
If H were true we could still obtain data in S0 and consequently wrongly
reject H with probability
Prob {DN S0 |H} = Prob {t(DN ) > t |H} =
The significance level provides an upper bound to the maximum
probability of incorrectly rejecting H.
The p-value is the probability that the test statistic is more extreme than
its observed value. The p-value changes with the observed data (i.e. it is
a random variable) while is a level fixed by the user.

Machine learning p. 9/45

Standard normal distribution


Normal distribution function (=0, =1)

Normal density function (=0, =1)

0.4

0.9
0.35

0.8
0.3

0.7
0.25

0.6

0.5

0.2

0.4
0.15

0.3
0.1

0.2
0.05

0.1

0
5

0
5

Remember that z0.05 1.64.


This means that, if z N (0, 1), then Prob {z z0.05 } = 0.05 and also that
Prob {|z| z0.05 } = 2 0.05 = 0.1
For a generic z N (, 2 )
Prob {|z | z0.05 } = 2 0.05 = 0.1

Machine learning p. 10/45

TP: example
Let DN consist of N independent observations of x N (, 2 ), with
known variance 2 .
We want to test the hypothesis H : = 0 with 0 known.

Consider as test statistic t(DN ), the quantity |


0 | where
is the
N (0 , 2 /N ).
sample average estimator . If H is true we know that
Let us calculate the value t(DN ) = |
0 | and assume that the
rejection region is S0 = {|
0 |||
0 | > t }.

Let us put a significance level = 10% = 0.1. This means that t should
satisfy
0 | > t |H} =
Prob {t(DN ) > t |H} = Prob {|
0 > t ) (OR)
Prob {(

0 < t )|H} = 0.1


(

Machine learning p. 11/45

TP: example (II)


For a normal variable x N (, 2 )
Prob {x > 1.645} = 1 Fx (1.645) = 0.05
and consequently
Prob {x > 1.645

(OR) x < 1.645} = 0.05 + 0.05 = 0.1

It follows that being


N (0 , 2 /N ) (i.e.

/ N

N (0, 1)) once we put

t = 1.645/ N
we have
0 | > t |H} = 0.1
Prob {|
and that the critical region is
n

S0 = DN

o
: |
0 | > 1.645/ N
Machine learning p. 12/45

TP: example (III)


Suppose that = 0.1 and that we want to test if = 0 = 10 with a
significance level 10%.
After N = 6 observations we have DN = {10, 11, 12, 13, 14, 15}.
On the basis of the dataset we compute

10 + 11 + 12 + 13 + 14 + 15

=
= 12.5
6
and
t(DN ) = |
0 | = 2.5

Since t = 1.645 0.1/ 6 = 0.0672, and t(DN ) > t , the observations


DN are in the critical region.
The hypothesis is rejected.

Machine learning p. 13/45

Hypothesis testing: types of error


So far we considered a single hypothesis. Let us now consider two

alternative hypothesis: H and H.


It is the error we make when we reject H if it is true.
Significance level represents the probability of making the type I error.

Type I error.

It is the error we make when we accept H if it is false.


In order to define this error, we are forced to declare an alternative
as a formal definition of what is meant by H being false.
hypothesis H
The probability of type II error is the probability that the test leads to
prevails.
acceptance of H when in fact H
When the alternative hypothesis is composite, there is no unique Type II
error.

Type II error.

Machine learning p. 14/45

An analogy
Consider the analogy with a murder trial, where we have as suspect Mr.
Bean.
The null hypothesis H is Mr. Bean is innocent.

The dataset is the amount of evidence collected by the police against


Mr. Bean.
The Type I error is the error that we make if, being Mr. Bean innocent,
we send him to penalty death.
The Type II error is the error that we make if, being Mr. Bean guilty, we
acquit him.

Machine learning p. 15/45

Hypothesis testing
Suppose we have some data {z1 , . . . , zN } F from a distribution F .
represent two hypotheses about F .
H and H
On the basis of the data, one is accepted and one is rejected.

Note that the two hypotheses have different philosophical status


(asymmetry).
H is a conservative hypothesis, not to be rejected unless evidence is
clear. This means that a type I error is more serious than a type II error
(benefit of the doubt).
It is often assumed that F belongs to a parametric family F (z, ). The
test on F becomes a test on .
A particular example of hypothesis test is the goodness of fit test where
: F 6= F0 .
we test H : F = F0 against H

Machine learning p. 16/45

The five steps of hypothesis testing


1. Declare the null (e.g. H: honest student) and the alternative hypothesis
cheat student)
(H:
2. Choose the numeric value of the type I error (e.g. the risk I want to run).
3. Choose a procedure to obtain test statistic (e.g. number of similar lines).
4. Determine the critical value of the test statistic (e.g. 4 identical lines) that
leads to a rejection of H. This is done in order to ensure the Type I error
defined in Step 2.
5. Obtain the data and determine whether the observed value of the test
statistic leads to an acceptation or rejection of H.

Machine learning p. 17/45

Quality of the test


Suppose that
N students took part to the exam,
NN did not copy,
NP copied,

N were considered not guilty and passed the exam


N
P were considered guilty and rejected
N
FP honest students were refused
FN cheat students passed.

Machine learning p. 18/45

Confusion matrix
Then we have
H: Not guilty student (-)
Guilty student (+)
H:

Not refused

Refused

TN

FP

NN

FN
N
N

TP
P
N

NP
N

FP is the number of False Positives and the ratio FP /NN represents the
type I error.
FN is the number of False Negatives and the ratio FN /NP represents
the type II error.

Machine learning p. 19/45

Specificity and sensitivity


Specificity:

the ratio (to be maximized)

SP =

TN
TN
NN FP
FP
=
=
=1
,
FP + T N
NN
NN
NN

0 SP 1

It increases by reducing the number of false positive.


Sensitivity:

the ratio (to be maximized)


SE =

TP
NP FN
FN
TP
=
=
=1
,
T P + FN
NP
NP
NP

0 SE 1

It increases by reducing the number of false negatives and corresponds


to the power of the test (i.e. it estimates the quantity 1-Type II error).

Machine learning p. 20/45

Specificity and sensitivity (II)


There exists a trade-off between these two quantities.
In the case of a test who return always H (e.g. very kind professor) we
P = 0,N
N = N , FP = 0, TN = NN and SP = 1 but SE = 0.
have N
(e.g. very suspicious
In the case of a test who return always H
P = N ,N
N = 0, FN = 0, TP = NP and SE = 1 but
professor) we have N
SP = 0.

Machine learning p. 21/45

False Positive and False Negative Rate


False Positive Rate:

F P R = 1 SP = 1

FP
FP
TN
=
=
,
FP + TN
FP + TN
NN

0 FPR 1

It decreases by reducing the number of false positive and estimates the Type I error.
False Negative Rate

F N R = 1 SE = 1

FN
FN
TP
=
=
TP + FN
TP + FN
NP

0 FPR 1

It decreases by reducing the number of false negative.

Machine learning p. 22/45

Predictive value
Positive Predictive value:

the ratio(to be maximized)

PPV =
Negative Predictive value:

0 PPV 1

the ratio (to be maximized)

P NV =
False Discovery Rate:

TP
TP
,
=

T P + FP
NP

TN
TN
,
=

T N + FN
NN

0 P NV 1

the ratio (to be minimized)

FP
FP
=
F DR =
= 1 P P V,

T P + FP
NP

0 F DR 1

Machine learning p. 23/45

Receiver Operating Characteristic curve


The Receiver Operating Characteristic (also known as ROC curve) is a plot
of the true positive rate (i.e. sensitivity or power) against the false positive
rate (Type I error) for the different possible decision thresholds of a test.

Consider an example where t+ N (1, 1) and t N (1, 1). Suppose that


the examples are classed as positive if t > T HR and negative if t < T HR,
where THR is a threshold.
If T HR = , all the examples are classed as positive: T N = F N = 0
TP
FP
which implies SE = N
=
1
and
F
P
R
=
FP +TN = 1.
P
If T HR = , all the examples are classed as negative: T P = F P = 0
which implies SE = 0 and F P R = 0.

Machine learning p. 24/45

0.0

0.2

0.4

SE

0.6

0.8

1.0

ROC curve

0.0

0.2

0.4

0.6

0.8

1.0

FPR

R script roc.R
Machine learning p. 25/45

Choice of test
The choice of test and consequently the choice of the partition {S0 , S1 } is
based on two steps
1. Define a significance level , that is the probability of type I error
Prob {reject H|H} = Prob {DN S0 |H}
that is the probability of incorrectly rejecting H
2. Among the set of tests {S0 , S1 } of level , choose the test that minimizes
the probability of type II error




Prob accept H|H = Prob DN S1 |H

that is the probability of incorrectly accepting H. This is equivalent to


look for maximizing the power of the test






Prob reject H|H = Prob DN S0 |H = 1 Prob DN S1 |H

which is the probability of correctly rejecting H. The higher the power,


the better !

Machine learning p. 26/45

TP example
Consider a r.v. z N (, 2 ), where is known and a set of N iid
observations are given.
We want to test the null hypothesis = 0 = 0, with = 0.1
Consider the 3 critical regions S0

1. |
0 | > 1.645/ N

2.
0 > 1.282/ N

3. |
0 | < 0.126/ N

For all these tests Prob {DN S0 |H} , hence the significance level
is the same.
: = 10 the type II error of the three tests is significantly
However if H
different.

What is the best one?

Machine learning p. 27/45

TP example (II)
:H
11111111111111111
00000000000000000
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
00000000000000000
11111111111111111
0

S1

:H

1111111111111111111111111111
0000000000000000000000000000
0000000000000000000000000000
1111111111111111111111111111
0000000000000000000000000000
1111111111111111111111111111
10

S0

if H : 0 = 0 is true. On the right: distribution of the


On the left: distribution of the test statistic
: 1 = 10 is true. The interval marked by S1 denotes the set of observed
if H
test statistic

values for which H is accepted (non-critical region). The interval marked by S0 denotes the set
of observed
values for which H is rejected (critical region). The area of the black pattern
region on the right equals Prob {DN S0 |H}, i.e. the probability of rejecting H when H is true
(Type I error). The area of the grey shaded region on the left equals the probability of accepting
H when H is false (Type II error).

Machine learning p. 28/45

TP example (III)
: H

111
000
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
000
111
0

S1

S0

: H

10

S1

if H : 0 = 0 is true. On the right: distribution of the


On the left: distribution of the test statistic
: 1 = 10 is true. The two intervals marked by S1 denote the set of observed
if H
test statistic

values for which H is accepted (non-critical region). The interval marked by S0 denotes the set
of observed
values for which H is rejected (critical region). The area of the pattern region
equals Prob {DN S0 |H}, i.e. the probability of rejecting H when H is true (Type I error).
Which area corresponds to the probability of the Type II error?

Machine learning p. 29/45

Type of parametric tests


Consider random variables with a parametric distribution F (, ).
in the one-sample test we consider a single r.v.
and we formulate hypothesis about its distribution. In the two-samples
test we consider 2 r.v. z1 and z2 and we formulate hypothesis about their
differences/similarities.

One-sample vs. two-sample:

the test is simple if H describes completely the


distributions of the involved r.v. otherwise it is composite.

Simple vs composite:

in the single-sided test the


region of rejection concerns only one tail of the distribution of the null
indicates the predicted direction of the
distribution. This means that H
: > 0 ) . In the two-sided test, the region of rejection
difference (e.g. H
does not
concern both tails of the null distribution. This means that H
: 6= 0 ) .
indicate the predicted direction of the difference (e.g. H

Single-sided (or one-tailed) vs Two-sided (or two-tailed):

Machine learning p. 30/45

Example of parametric test


Consider a parametric test on the distribution of a gaussian r.v., and
suppose that the null hypothesis is H : = 0 where 0 is given and
represents the mean.
The test is one-sample and composite.

In order to know whether it is one or two-sided we have to define the


: < 0 the test is one-sided down, if
alternative configuration: if H
: > 0 the test is one-sided up, if H
: 6= 0 the test is double-sided.
H

Machine learning p. 31/45

z-test (one-sample and one-sided)


Consider a random sample DN x N (, 2 ) with unknown et 2 known.
STEP 1:

Consider the null hypothesis and the alternative (composite and one-sided)
H : = 0 ;

: > 0
H

fix the value of the type I error.


STEP 3: choose a test statistic:
is N (0 , 2 /N ). This means that the
If H is true then the distribution of
variable z is

0 ) N
(
N (0, 1)
z=

STEP 2:

It is convenient to rephrase the test in terms of the test statistic z.

Machine learning p. 32/45

z-test (one-sample and one-sided) (II)


STEP 4:

determine the critical value for z.

We reject the hypothesis H is rejected if zN > z where z is such that


Prob {N (0, 1) > z } = .
Ex: for = 0.05 we would take z = 1.645 since 5% of the standard normal
distribution lies to the right of 1.645.
R command: z =qnorm(alpha,lower.tail=FALSE)

STEP 5:

Once the dataset DN is measured, the value of the test statistic is


zN

(
0 ) N
=

Machine learning p. 33/45

TP: example z-test


Consider a r.v. z N (, 1).

: > 5 with significance level 0.05.


We want to test H : = 5 against H
Supose that the data is DN = {5.1, 5.5, 4.9, 5.3}.
Then
= 5.2 and zN = (5.2 5) 2/1 = 0.4.

Since this is less than z = 1.645, we do not reject the null hypothesis.

Machine learning p. 34/45

Two-sided parametric tests


Assumption:

all the variables are normal!

Name

one/two sample

known

z-test

one

= 0

z-test

two

12 = 22

1 = 2

6= 0

t-test

one

= 0

t-test

two

1 = 2

1 6= 2

2 -test

one

2 = 02

2 -test

one

2 = 02

2 6= 02

F-test

two

12 = 22

1 6= 2
6= 0

2 6= 02
12 6= 22

Machine learning p. 35/45

Students t-distribution
If x N (0, 1) and y 2N are independent then the Students t-distribution
with N degrees of freedom is the distribution of the r.v.
z= p

x
y/N

We denote this with z tN .


If z1 , . . . , zN are i.i.d. N (, 2 ) then

)
)
N (
N (
q
tN 1
=

c
SS/(N
1)

Machine learning p. 36/45

t-test: one-sample and two-sided


Consider a random sample from N (, 2 ) with 2 unknown. Let
H : = 0 ;
Let

: 6= 0
H

(
0 )
N (
0 )
= q
t(DN ) = T = q
P
N
1

2
2
(z

)
i
i=1
N 1
N

a statistic computed using the data set DN .

Machine learning p. 37/45

t-test: one-sample and two-sided (II)


It can be shown that if the hypothesis H holds, T TN 1 is a r.v. with a
Student distribution with N 1 degrees of freedom.
The size t-test consists in rejecting H if

|T | > k = t/2,N 1
where t/2,N 1 is the upper point of a T -distribution on N 1 degrees
of freedom, i.e.


Prob |tN 1 | > t/2,N 1 = /2.
where tN 1 TN 1 .

In other terms H is rejected when T is large.

R command: t/2,N 1 =qt(alpha/2,N-1,lower.tail=TRUE)

Machine learning p. 38/45

TP example
Does jogging lead to a reduction in pulse rate? Eight non jogging volunteers
engaged in a one-month jogging programme. Their pulses were taken before
and after the programme
pulse rate before

74

86

98

102

78

84

79

70

pulse rate after

70

85

90

110

71

80

69

74

decrease

-8

10

-4

Suppose that the decreases are samples from N (, 2 ) for some unknown
2 .
: 6= 0 with a significance = 0.05.
We want to test H : = 0 = 0 against H
We have N = 8,
= 2.75, T = 1.263, t/2,N 1 = 2.365
Since |T | t/2,N 1 , the data is not sufficient to reject the hypothesis H. In
other terms we have not enough evidence to show that there is a reduction in
pulse rate.

Machine learning p. 39/45

The chi-squared distribution


For a N positive integer, a r.v. z has a 2N distribution if
z = x21 + + x2N
where x1 ,x2 ,. . . ,xN are i.i.d. random variables N (0, 1).

The probability distribution is a gamma distribution with parameters


( 21 N, 21 ).
E[z] = N and Var [z] = 2N .

The distribution is called a chi-squared distribution with N degrees of


freedom.

Machine learning p. 40/45

-test: one-sample and two-sided


Consider a random sample from N (, 2 ) with known.
Let

H : 2 = 02 ;

c = P (zi )2 .
Let SS
i

: 2 6= 02
H

c 2 2
It can be shown that if H is true then SS/
0
N

c 02 < a1 or SS/
c 02 > a2 where
The size 2 -test rejects H if SS/
Prob

c
SS
< a1
2
0

+ Prob

c
SS
> a2
2
0

If is unknown, you must

c
1. replace with
in the quantity SS

2. use a 2N 1 distribution.

Machine learning p. 41/45

t-test: two-samples, two-sided


Consider two r.v.s x N (1 , 2 ) and y N (2 , 2 ) with the same variance.
y
x
Let DN
and DM
two independent sets of samples .
: 1 6= 2 .
We want to test H : 1 = 2 against H
Let

x =

PN

i=1 xi
,
N

SSx =

N
X
i=1

(xi
x )2 ,

y =

PM

i=1 yi
,
M

SSy =

M
X
i=1

(yi
y )2

Once defined the statistic


T =r

1
M

x
y

TM +N 2


 SSx +SSy
1
+N
M +N 2

it can be shown that a test of size rejects H if


|T | > t/2,M +N 2

Machine learning p. 42/45

F-distribution
Let x 2M and y 2N be two independent r.v.. A r.v. z has a F-distribution
Fm,n with M and N degrees of freedom if
z=

x/M
y/N

If z FM,N then 1/z FN,M .


If z TN then z2 F1,N .

Machine learning p. 43/45

F-distribution
FM,N density: M=20 N=10

FM,N cumulative distribution: M=20 N=10

0.9

0.8

0.9

0.8

0.7

0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2

0.2

0.1

0.1

0.5

1.5

2.5

3.5

4.5

0.5

1.5

2.5

3.5

4.5

R script s_f.R.

Machine learning p. 44/45

F-test: two-samples, two-sided


Consider a random sample x1 , . . . , xM from N (1 , 12 ) and a random sample
y1 , . . . , yN from N (2 , 22 ) with 1 and 2 unknown. Suppose we want to test
H : 12 = 22 ;

: 12 6= 22
H

Let us consider the statistic


c 1 /(M 1)
12 2M 1 /(M 1)
21
SS

12
2 2
= 2 FM 1,N 1
f= 2 =
c
2 N 1 /(N 1)
2
2

SS2 /(N 1)

It can be shown that if H is true, the ratio f has a F-distribution FM 1,N 1


We reject H if the ratio f is large, i.e. f > F,M 1,N 1 where
Prob {z > F,M 1,N 1 } =
if z FM 1,N 1 .

Machine learning p. 45/45

Das könnte Ihnen auch gefallen