Sie sind auf Seite 1von 38

1

BIOINFORMATIK II
PROBABILITY & STATISTICS
Summer semester 2006
The University of Z
urich and ETH Z
urich

Lecture 2a: Statistical estimation.

Prof. Andrew Barbour


Dr. B
eatrice de Tili`
ere
Adapted from a course by
Dr. D. Schuhmacher & Dr. D. Svensson.

Problems in statistics:
Given: a probability model for some (chance) experiment:
X P .
Here, P is a probability distribution (given by a probability function
p (x) or a distribution function F (x)) for any . The P are all known,
but the actual value of the parameter is unknown.
(Ex: X Bin(100, p) but the probability p is unknown.)
Two main areas in statistics are:
ESTIMATION: estimate the unknown value of given observations
of X (a single observation is usually not enough).
TESTING: test a hypothesis about the unknown value of . Base
acceptance/rejection upon observations of X (a single observation is
usually not enough).

Statistical estimation:
Given: a probability model: X P ,
where P are known, but actual value of the parameter is unknown.
(ex: X Bin(100, p), but the probability p is unknown.)
To be able to estimate the value of we repeat the experiment n
times independently, which gives x1 , x2 , . . . , xn . These are n
observations of X.
Next step: use the observations x1 , x2 , . . . , xn to compute an
estimate of . (Observe the values, and then take a good guess )
The collection x1 , x2 , . . . , xn is called an (observed) sample of random
variables X1 , X2 , . . . , Xn ; the latter are independent and have the same
distribution as X.
The collection X1 , X2 , . . . , Xn is called a (random) sample.

Estimator, estimate
Def: An estimator of is a function of the random variables
1 , X2 , . . . , Xn ).
X1 , X2 , . . . , Xn , written (X
For theory and principles of estimation. It is RANDOM!
******
1 , x2 , . . . , xn ) calculated from
Def: An estimate of is the quantity (x
the observed values x1 , x2 , . . . , xn of X1 , X2 , . . . , Xn .
For practice. The value computed after the experiment.
It is not random.

Example: Suppose X Bin(100, ), where the value of is unknown.


Let X1 , . . . , X20 be a random sample; i.e.
Xi Bin(100, ) for i = 1, 2, . . . , 20
and independent. Then:
X1 + X2 + . . . + X20
1 (X1 , X2 , . . . , X20 ) :=
2000
is an estimator of the unknown value of , and
2 (X1 , X2 , . . . , X20 ) := X1 + X2 + . . . + X20
is another one.
However 2 (X1 , X2 , . . . , X20 ) is not very useful since it might be larger
than 1 (but has to be between 0 and 1 since here is a probability).
There are many possible estimators. How can we find a good one?

Some principles for finding good estimators


Consistency: For every > 0 and every possible value of :






P (X1 , X2 , . . . , Xn ) > 0 as n
the more observations, the closer to the truth.
] as low as possible:
The mean square
error
MSE[

h
2 i
1 , X 2 , . . . , Xn )
MSE [ ] := E (X
for every .

Not too much variation, not too far away from truth .
(Note: MSE [ ] = Var [ ] if is unbiased , i.e. if E [ ] = .)

1 , X2 , . . . , Xn ) has a known probability


Nice if the estimator (X
distribution (at least, in an asymptotic sense).
Maximum likelihood estimators have these properties!

Maximum Likelihood Estimation (here: for discrete RVs)


Suppose X1 , X2 , . . . , Xn is a (random) sample from some distribution P
with probability function p (x), where the value of is unknown.
Let
L(x1 , x2 , . . . , xn ; ) := P (X1 = x1 , X2 = x2 , . . . , Xn = xn ).
This is the probability for observing the sample x1 , x2 , . . . , xn if the
unknown parameter takes the value .
L(x1 , x2 , . . . , xn ; ) is called the likelihood function.
By independence of X1 , . . . , Xn ,
L(x1 , x2 , . . . , xn ; ) = p (x1 ) p (x2 ) p (xn ).

Now suppose that we really did observe the values x1 , x2 , . . . , xn .


Def: The maximum likelihood estimate (ML estimate) of is
the value ml of that maximizes the likelihood function
L(x1 , x2 , . . . , xn ; ).
That is, the value ml is the ML-estimate of if
L(x1 , x2 , . . . , xn ; ml ) > L(x1 , x2 , . . . , xn ; )
for all other values 6= ml .
INTERPRETATION: It is more likely to observe the sample
x1 , x2 , . . . , xn if the parameter is equal to ml than for any other value
of .
ml is the value of that best explains the data.

X1 , X2 , . . . , Xn is random sample from a probability distribution P ,


unknown.
Def: The maximum likelihood estimator is the (random!) value ml
that maximizes
L(X1 , X2 , . . . , Xn ; ).
The maximization is carried out with respect to .
The maximization depends upon the random variables X1 , . . . , Xn , so ml
is a function of X1 , . . . , Xn , written ml (X1 , X2 , . . . , Xn ).
(The maximum likelihood estimate is the value obtained by computing
ml (x1 , x2 , . . . , xn ) using the observed sample x1 , x2 , . . . , xn ).

10

Two important things to remember:


1: The maximum likelihood estimator ml (X1 , X2 , . . . , Xn ) has
good properties!
2: How to compute the estimate:
Plug in the observed data into the the likelihood function
L(x1 , x2 , . . . , xn ; )
and vary the value of until you find the value that maximizes
L(x1 , x2 , . . . , xn ; ). (Mathematically, the ML estimate is usually found
by differentiation.)

11

Example: Suppose that we want to find the maximum likelihood


estimate for in the Bin(100, )-distribution based on an observed
sample x1 , x2 , . . . , xn .
Then
p (x) =

100
x

and hence
L(x1 , . . . , xn ; ) =

x (1 )100x

100
x1

x1

(1 )

for x = 0, 1, . . . , 100;

100x1

...

100
xn

xn (1 )100xn .

The binomial coefficients do not vary with , so finding the that


maximizes the above expression amounts to finding the that maximizes
x1

(1 )

100x1

...

xn

(1 )

100xn

xi

(1 )

100n

xi

This can be done by differentiating in and setting the result to zero . . .

12

...differentiating in and setting the result to zero:


Pn
Set s := i=1 xi , and solve

d s
(1 )100ns
0=
d
= s s1 (1 )100ns (100n s) s (1 )100ns1

 s1
= s(1 ) (100n s)
(1 )100ns1
= [s 100n] s1 (1 )100ns1 .
Possible solutions are = 0, = 1 ( no maximum!), and
n
s
1 X
=
xi ( maximum!).
=
100n
100n i=1

13

Hence, the maximum likelihood estimate for in the


Bin(100, )-distribution is given by
n

X
1
1 , x2 , . . . , xn ) =
xi .
(x
100n i=1
(Compare this with the two estimators on Slide 5: 1 (X1 , X2 , . . . , X20 ) is
the ML estimator!)

14

Likelihoods are not just for independent observations!


If x1 , x2 , . . . , xn is an observed sample of independent and equally
distributed random variables X1 , . . . , Xn , then the likelihood function
is
L(x1 , x2 , . . . , xn ; ) = p (x1 ) p (x2 ) p (xn ).
(A product, due to independence).
Be careful with dependent variables, and with random variables
having different distributions! (E.g., observations from a Markov
chain.)
The likelihood function L(x1 , x2 , . . . , xn ; ) is then still defined as the
probability for observing the sample, but L(x1 , x2 , . . . , xn ; ) cannot
be computed as the product above!

15

Unbiased estimators
1 , X2 , . . . , Xn ) be some estimator of the unknown value of .
Let (X
If


E (X1 , X2 , . . . , Xn ) =


holds for every possible value of we say that is unbiased.


It is nice if our estimator has this property, but it is not a good
principle to rely on in order to find good estimators!
One often gets unbiasedness only at the expense of other nice
properties.
Sometimes unbiased estimators are useless
(e.g. the unbiased estimator for p in the Geo(p)-distribution).
Sometimes there is no unbiased estimator at all
(e.g. there is no unbiased estimator for 1/p in the
Bin(n, p)-distribution).

16

BIOINFORMATIK II
PROBABILITY & STATISTICS
Summer semester 2006
The University of Z
urich and ETH Z
urich

Lecture 2b: Statistical hypothesis testing.

Prof. Andrew Barbour


Dr. B
eatrice de Tili`
ere
Adapted from a course by
Dr. D. Schuhmacher & Dr. D. Svensson.

17

Statistical testing problem:


Given: a probability model for some (chance) experiment:
X P .
Typically, the form of the probability distribution P is known for each ,
but the actual value of the parameter is unknown.
(Ex: X Bin(100, p) but the probability p is unknown.)
Want: to be able to test the plausibility of certain hypotheses concerning
the probability model.
Typically: the hypotheses specify values for the parameter .
(Ex: X Bin(100, p);
null hypothesis: p = 0.25,
alternative hypothesis: p = 0.35.)

18

Statistical hypothesis testing involves the test of a


null hypothesis H0 against an alternative hypothesis HA .
H0 is the default hypothesis, taken to be the truth unless
convincing evidence against it is found.
(Default ? In the sense of:
two given sequences are not evolutionarily related,
there is no life on Mars).
HA is more controversial, and its acceptance (in favor of H0 )
requires strong evidence.
(Controversial? Like:
two given sequences are evolutionarily related,
there is life on Mars)

19

Example (sequence matching):


Two random DNA sequences of length N , where
the letters within each sequence are independently generated;
each letter equals a, c, g, or t with uniform probabilities, i.e.
0.25, 0.25, 0.25, 0.25.
Let X = the number of matches; then
X Bin(N, p) ; p = P(match).
H0 : p = 0.25 (i.e. the sequences are not evolutionarily related).
HA : p > 0.25 (sequences are evolutionarily related).
(The value of p depends on whether the sequences are dependent or not,
i.e. on the joint probabilities!).
If X takes an unexpectedly large value, we reject H0 and accept HA as
being the truth.

20

The decision to make:


Accept H0 , or reject it in favour of HA .
How?
The decision taken is based upon the observed value of some function
T (X1 , X2 , . . . , Xn )
of the the sample X1 , X2 , . . . , Xn .
This function is called a test-statistic.
It is a random variable!

21

Critical value(s), rejection region.


Ideally, the distribution of T (X1 , X2 , . . . , Xn ) is known under the
assumption that H0 is true.
From this null hypothesis distribution, one or several critical values can
be determined.
If H0 is true, it is unlikely that these critical values will be reached by T .
But if that happens, then H0 is rejected
(since H0 then explains the data X1 , X2 , . . . , Xn badly).
(Exactly what does unlikely mean? That depends upon the significance
level of the test; determined by the experimenter!)

22

Type I error.
Ex: Suppose that the null hypothesis H0 will be rejected if
T (X1 , X2 , . . . , Xn ) > C for some constant C.
And otherwise, if T (X1 , X2 , . . . , Xn ) C, H0 is accepted.
(C is the critical value in this example).
NOTE: Even if H0 is true it is (in general) possible that
T (X1 , X2 , . . . , Xn ) > C, since the data are random!
If this occurs, a Type I error is being made: rejection of a true null
hypothesis.

23

Significance level
(Type I error = rejection of a true null hypothesis).
The probability for this type of incorrect decision, the significance
level, should be kept (reasonably) low.


= P T (X1 , X2 , . . . , Xn ) > C |H0 true =


= P Type I error .
Typically, one takes equal to some low probability (often 0.05 or 0.01),
and then determines the corresponding value of C.
C depends upon !

24

Statistical Significance
Our test (at significance level ) is
Reject H0

T (X1 , X2 , . . . , Xn ) > C.

If we observe values x1 , . . . , xn such that T (x1 , x2 , . . . , xn ) > C then we


reject H0 and say that we have statistical significance.
***
This statement statistical significance is always relative to some
significance level .
(statistical significance does not automatically mean good scientific
evidence. If = 1 then any experiment outcome would be statistically
significant. )

25

Type II error
Another type of incorrect decision can also be made:
Type II error: acceptance of a false null hypothesis.
(Type II is usually a less serious error than type I).
Suppose that the significance level is fixed ( = 0.05 or some other
value), and the critical value C has been determined such that


= P T (X1 , X2 , . . . , Xn ) > C |H0 true
holds.
Then, the probability of a Type II error is


= P T (X1 , X2 , . . . , Xn ) C |H0 false .

26

Power, Type II error


Furthermore, the power of the test is then defined as


1 = P T (X1 , X2 , . . . , Xn ) > C |H0 false .
The power describes how good the test is at detecting it if the null
hypothesis is false.

27

Power, Type II errors. The power is typically more complicated to


compute than the significance level.
In fact, the power might depend upon how false H0 is, in the following
sense:
Ex: X Bin(20, p). H0 : p = 0.25, HA : p > 0.25. Suppose the
significance level is fixed at = 0.041. Then
P(X > 8|H0 true) = P(Bin(20, 0.25) > 8) = 0.041
The power (1 )? Assume H0 false (i.e., p > 0.25).
1 = P(X > 8|p = 0.26) = P(Bin(20, 0.26) > 8) = 0.0515.
1 = P(X > 8|p = 0.3) = P(Bin(20, 0.3) > 8) = 0.1133.
....
1 = P(X > 8|p = 0.9) = P(Bin(20, 0.9) > 8) 1.

28

p-values:
The p-value is a probability, and can only be computed after the data
have been observed.
Suppose that the test is: we reject H0 if and only if
T (X1 , X2 , . . . , Xn ) > C, for some critical value C.
The significance level is


= P T (X1 , X2 , . . . , Xn ) > C |H0 true .
Now suppose we observe x1 , x2 , . . . , xn .
Compute the observed test statistic t := T (x1 , x2 , . . . , xn ).
The p-value is defined as


P T (X1 , X2 , . . . , Xn ) t |H0 true .
(Interpretation: the probability for seeing something at least as extreme
as just observed... how unlikely the observed value is.)

29

The five main steps in statistical testing:


1. Declare the hypotheses H0 and HA . (Before the data are seen!!!)
2. Determine a test statistic.
3. Choose the significance level (0, 1).
4. Determine those observed values of the test statistic that lead to
rejection of H0 (determine the critical value(s)).
5. Obtain the data and determine whether the observed value of the test
statistic is equal to or more extreme than the critical value(s) calculated
in step 4.

30

NOTE: Point 4 requires the knowledge of the distribution of the test


statistic under the assumption that H0 is true.
4. Determine those observed values of the test statistic that lead to
rejection of H0 (determine the critical value(s)).
This means typically:
Given
the significance level (0, 1),

 for which C do we have
P T (X1 , X2 , . . . , Xn ) > C |H0 true = ?
If this distribution is unknown or too complicated, it can often be
approximately determined from a computer simulation.

31

How can we find a good test statistic?


To be able to perform a statistical test, one is required to find a suitable
test statistic.
2. Determine a test statistic.
In many cases it is optimal to use the likelihood ratio as a statistic.
(Optimal in certain probabilistic senses.)

32

Simple hypotheses
Suppose that we have a test problem where the hypotheses are simple,
i.e. they completely specify the probability function.
Ex: X Bin(20, p). H0 : p = 0.25, HA : p = 0.35, which is equivalent to
 
N
H0 : P (X = k) =
0.25k (1 0.25)N k
k
and
 
N
0.35k (1 0.35)N k .
HA : P (X = k) =
k

33

Likelihood ratio test:


Let X1 , X2 , . . . , Xn be the sample (independent RVs), and let p0 (x) and
p1 (x) be the probability functions specified by the simple hypotheses H0
and HA , respectively.
Define the likelihood ratio LR as
LR :=

p (X1 ) p1 (X2 ) p1 (Xn )


L(X1 , X2 , . . . , Xn ; 1 )
= 1
L(X1 , X2 , . . . , Xn ; 0 )
p0 (X1 ) p0 (X2 ) p0 (Xn )

Choose a constant C such that


P(LR C| H0 true) = .
Then this yields the most powerful test at the significance level for
this testing problem (The Neyman-Pearson Lemma).
There are good reason for using likelihood ratios in statistics;
(good test properties, a well-studied topic)

34

Example (sequence matching):


Two DNA sequences of length n. We want to test if they are
evolutionarily related.
Assumption: letters in each sequence independently generated with
uniform probabilities (probability 1/4 for each letter).
Let Xi = 1 if there is a match at position i, otherwise Xi = 0. Suppose
that X1 , X2 , . . . , Xn are i.i.d. (independent and identically distributed)
with match probability p.
(Note that in this case P(Xi = x) = px (1 p)1x for x {0, 1}.)
We now want to test
H0 : p = 0.25 (i.e. the sequences are not evolutionarily related) against
HA : p = 0.35 (sequences are evolutionarily related and there is some
evidence that in that case p 0.35)
at a significance level of = 0.05 (or approximately so). LR-Test

35

We make a likelihood ratio test:


Likelihood: For x1 , . . . , xn {0, 1},
L(x1 , . . . , xn ; p) = px1 (1 p)1x1 . . . pxn (1 p)1xn .
Therefore, with p0 = 0.25 and p1 = 0.35,
LR =

xi

xi

p1

)n

(1 p1 )

xi

 p s  1 p ns  7 s  13 ns
1
1
=
,
=
p0
1 p0
5
15

xi
p0 (1 p0
Pn
where s = i=1 xi is the total number of matches. But this LR is just a
(increasing) function of s! So instead of rejecting H0 if LR is too big,
we can also use the test that we reject H0 if s is too big. What means
too big ??

36

What is too big ? What is the critical value C for rejecting H0 ?


Pn
We want significance level 0.05, that is, for S := i=1 Xi , we choose
C in such a way that
P(S > C|H0 true) 0.05.
We know that, under the assumption that H0 is true, S has the
Bin(n, 0.25) distribution, so we just have to find out, at what value C the
distribution function of the Bin(n, 0.25)-distribution jumps from below
0.95 to above 0.95.
Choose n = 1000 (say), then
P(S 272|H0 true) 0.9488, so P(S > 272|H0 true) 0.0512;
P(S 273|H0 true) 0.9559, so P(S > 273|H0 true) 0.0441.

37

Thus for an observed sample (x1 , x2 , . . . , x1000 ) our test at significance


P1000
level 0.0441 says: we can reject H0 if i=1 xi > 273, and we cannot
P1000
reject it if i=1 xi 273.

This test has power

P(S > C|H0 false) = P(S > 273|p = 0.35) 0.9999998 1.


(Which is so good, because we have such a big sample!!)
***
NOTE: If one wants a significance level of exactly 0.05, one usually
decides randomly whether to reject H0 in the critical case that
P1000
i=1 xi = 273!

38

The introductory example, revisited


For the sequences from the first lecture (Slide 5), one obtains the test
P26
(n = 26): Reject H0 if i=1 xi > 10, and do not reject H0 if
P26
i=1 xi 10, which is at a significance level of 0.0401
This means that we have statistical significance for rejecting the
null-hypothesis, that the two sequences are not evolutionarily related.
(The power of the test is this time only about 0.278)
***
Note, however, that if we wanted a lower significance level, say 0.01,
we would not be able to reject H0 (the p-value for 11 matches in two
sequences of length 26 is about 0.0155).

Das könnte Ihnen auch gefallen