Sie sind auf Seite 1von 33

I.

Estimation continued
 Least Squares (OLS)
 Likelihood

II. Hypothesis Testing


 t-test
 one sample
 two sample
 paired

 non-parametric tests
 assumptions
 Mann-Whitney-Wilcoxon

Methods for estimating parameters


Frequentist (classical approaches)
 when sampling distribution is known
 Ordinary Least Squares
 Maximum Likelihood

 when sampling distribution is unknown


 Numerical Resampling
 Bootstrap
 Jackknife
 Permutation / Randomization tests

Bayesian inference - estimation

Ordinary Least Squares (OLS)


 the parameter estimate that minimizes the sum of
squared differences between each value in a
sample and the parameter
 in this example, the parameter is the mean

OLS = min  ( y i  y )
i=1

 OLS: identifies parameter estimate that minimizes the sum


of squared differences between each value in a sample
and the parameter

SS =  d 2
b
parameter value

Ordinary Least Squares (OLS)


 frequently used to estimate parameters of linear
models (e.g., linear regression, y=a+bx)
 unbiased and has minimum variance when
distributional assumptions are met (i.e., is a precise
estimator)

 no distributional assumptions required for point


estimates
 for interval estimation & hypothesis testing, OLS
estimators have restrictive assumptions of
normality and patterns of variance

Maximum Likelihood (ML)


 estimate of a parameter that maximizes the
likelihood function based on the observed data
 likelihood function estimates likelihood of observing
sample data for all possible values of the parameter


e.g., likelihood of observing the data for all possible values of


the mean, , in a normally distributed population

 when assumptions of specified underlying distribution


are met, ML estimators are unbiased and have
minimum variance

Maximum Likelihood (ML)


 differences between a likelihood function and a
probability distribution
 in a probability distribution for a random variable, the
data are variable and the parameter fixed
 in a likelihood function, the sample data are fixed and
the parameter varies across all possible values


the maximum likelihood is the value of the


parameter that best fits the observed data

 constraints of ML estimators:


requires knowing sampling distribution underlying the statistic


(e.g., normal, multinomial, etc.)

great for large samples, biased estimates for small samples

General Likelihood function:


n

L(y; ) =  f (y i ; )
i=1

 is the joint probability distribution of yi and  (the


probability distribution of y for possible values of )
Where:
L = likelihood
y = the frequency distribution of your data
 = some parameter you want to estimate (e.g., mean)
yi = variates of your sample
f(yi;) = the function describing the sampling distribution (e.g.,
the equation for normal dist.)

Log-Likelihood function:

 n
 n
ln L = ln f (y i ; ) =  ln[ f ( y i ; )]
 i=1
 i=1
 instead, we maximize the log-likelihood function rather than
the likelihood function because
working with products is computationally difficult and the probability
distribution of L is poorly known
natural logarithm of L is easy to work with and large sample sizes
are approximately 2 distributed

Maximum Likelihood vs. Ordinary Least Squares


 for most population parameters, ML and OLS estimators
are the same when normality assumptions of OLS are met
exception is the variance for which ML estimator is slightly biased
unless n is large

 in balanced linear models (e.g., regression and ANOVA)


for which normality assumptions hold, ML and OLS
estimators are identical
 OLS cannot be used for estimation for other distributions
(e.g., binomial and multinomial)
so generalized linear modeling (e.g., logistic regression and loglinear models) and non-linear modeling are based on ML
estimation

Introduction to hypothesis testing

Hypothesis testing
 main approach to statistical inference in biology
So far covered:
collecting data samples
estimating parameters from sample statistics
calculated SE and CI as estimators of reliability
of these statistics
Now, we want
some objective way to decide whether our sample
differs from some expected distribution, or from
other similarly collected samples

Common philosophy of all tests


Classical statistical hypothesis testing rests on disproving H0:
 must state a statistical null hypothesis (Ho)
includes all possibilities except the prediction of research hypothesis, HA

H0 is usually a hypothesis of no difference or effect


because
(1) inductive reasoning requires verification of all possible
observations (impossible) and
(2) we often dont know what constitutes proof of a hypothesis
(alternative biological hypotheses are often more vague than
H 0)
in contrast, we know what constitutes disproof (falsification),
because null hypotheses are exact
disproof of H0 constitutes evidence for HA

Possible outcomes of tests of null hypothesis (H0)


H0
Accepted Rejected
True

correct
decision

Type I
error

False

Type II
error

correct
decision

H0

What level of Type I error is acceptable?


 P(Type 1 error) =  = critical p-value
 the smaller the magnitude of an allowable type I error, the
more deviant an outcome has to be from the expected
outcome in order to reject Ho
 AKA significance level, since it is the level at which we
decide to accept or reject the Ho
 by convention, it is 0.05 (5%)
 this level is a convention, and is arbitrary, not a law! might alter it
based on the context
 the 5% in the tails of the distribution comprises the rejection
region, and the range comprising 95% of the outcomes comprises
the acceptance region

One vs. Two-tailed tests and rejection region


 when we reject Ho at a specified significance level, , we
say that the sample is significantly different from what
we expect under the null hypothesis
 the 5% rejection region can be split between 2 tails
(2.5% in each one), or can be all in one tail
 these are called 2-tailed or 1-tailed tests, respectively
 decision about which to use is based on whether you have any a
priori knowledge or assumptions about possible alternatives to the
null hypothesis
 if expected outcomes could either above or below the
expectation under the null, then a two-tailed test is appropriate
 if outcomes are only likely and interesting in one direction, then
the test is 1-tailed

 advantage of 1-tailed tests: more powerful, easier to reject Ho, with


a lower probability of a type II error

Power of a test
Power = probability that we
correctly accept Ha
 power = 1-
 once we specify , then P(type I),
P(type II) and power are set
 decreasing  increases  and
thus decreases power
 increasing  decreases  and
thus increases power
 why is a one-tailed test more
powerful than a two-tailed test?

Power of a test
 with a constant  and a given
s2, the power decreases as
the location (mean) of the
distribution of Ha approaches
that of Ho
 power also decreases as
spread (s2) increases
 thus, increase power by
 reducing spread (s2), (i.e.,
maximize n)
 or test for large effects (big
difference between means)

the t-test
review: t statistic development of a confidence interval

y  y 
t=
=
s / n SEM

Rearrange to solve for for confidence


interval
1.

t=

(y  )
s/ n

2.

t(s / n ) = (y  )

Review

Solve for (using df):


1. calculated t values
or
2. desired confidence level (to
determine range in values that are
likely to contain )

for a two-tailed test:


3.

= y  t(s / n )
and

= y + t(s / n )

P[y  t(s / n )   y + t(s / n )]

95% Confidence Interval (2-tailed)

Review

P[y  t(s / n )   y + t(s / n )]


Lovett et al. (2000)
38 df

Probability
Degrees of Freedom

.01

.02

.05

.10

.20

63.66

31.82

12.71

6.314

3.078

9.925

6.965

4.303

2.920

1.886

5.841

4.541

3.182

2.353

1.638

4.604

3.747

2.776

2.132

1.533

4.032

3.365

2.571

2.015

1.476

10

3.169

2.764

2.228

1.812

1.372

15

2.947

2.602

2.132

1.753

1.341

20

2.845

2.528

2.086

1.725

1.325

25

2.787

2.485

2.060

1.708

1.316

38

2.705

2.426

2.020

1.685

1.302

95%

61.92

Sample mean
SEM
DF

61.92
0.84
38

y  t(s / n )
61.92 2.02(0.84)
60.22

y + t(s / n )
61.92 + 2.02(0.84)

<<

63.62

there are 3 types of t-tests


 hypothesis tests based on the t distribution
3 types of t-tests


one-sample t-test: the mean of a sample is


different from a constant

two-sample t-test: the means of two samples


are different

paired t-test: the mean difference between


paired observations is different from a constant
(usually 0)

Assumptions of t-test
 t-tests are parametric tests
 thus, the t statistic only follows t distribution if:
 variable has normal distribution (normality assumption)
 two groups have equal population variances
(homogeneity of variance assumption)
 observations are independent or specifically paired
(independence assumption)

One-sample t-test: testing a simple H0


 test of H0 that population mean equals a
particular value (H0: =  )
 e.g., population mean density of kelp after some
impact (e.g. oil spill) is same as before (H0:  = x
before)

 mean () may be from literature or other


research or legislation

t-statistic
general form of the t statistic:

St  
SE

where
 St is sample statistic
  is parameter value specified in H0
 SE is standard error of sample statistic

specific form for population mean:

t=

y  y 
=
SEM s n

value of mean
specified in H0

sampling distribution of t
P(t)
t<0

t=0


t>0

 different sampling distributions of t for different


sample sizes
 use degrees of freedom (df = n - 1)

 area under each sampling (probability) distribution


equals one
 we determine probabilities of obtaining particular
ranges of t when H0 is true

one-tailed tests

Two possible independent


alternative hypotheses:

two possible tests

1) H0: <  HA: > 


2) H0: >  HA: < 

1)

y 
SE

>0

2)

y 
SE

<0

1) only reject H0 for large + values of t, i.e. when sample mean is
much greater than 
2) only reject H0 for large - values of t, i.e. when sample mean is
much less than 
(1)

(2)

two-tailed tests

Only one alternative


hypothesis:

y 
SE

only one test possible

0

H0: =  HA: >  or < 


 reject H0 for large + or - values of t, i.e. when sample
mean is much greater than or less than 
 = 0.05
P(t)
 / 2 = 0.025

 / 2 = 0.025

t<0

t=0

t>0

One and two tailed t-values (df = 4)


Degrees of Freedom

.005/.01

.01/.02

.025/.05

.05/.10

.10/.20

63.66

31.82

12.71

6.314

3.078

9.925

6.965

4.303

2.920

1.886

5.841

4.541

3.182

2.353

1.638

4.604

3.747

2.776

2.132

1.533

4.032

3.365

2.571

2.015

1.476

10

3.169

2.764

2.228

1.812

1.372

15

2.947

2.602

2.132

1.753

1.341

20

2.845

2.528

2.086

1.725

1.325

25

2.787

2.485

2.060

1.708

1.316

2.575

2.326

1.960

1.645

1.282

2 tailed

1 tailed

95%
-2.78

+2.78

-5 -4 -3 -2 -1 0 1 2 3 4 5

95%

1 tailed

+2.132

-5 -4 -3 -2 -1 0 1 2 3 4 5

-2.132

95%

-5 -4 -3 -2 -1 0 1 2 3 4 5

Example: one-sample t-test


General question:
Are birth:death ratios of human
populations near the nopopulation-growth ratio of 1.25?
Ho: B/D ratios = 1.25
HA: B/D ratios  1.25
 Are the B/D ratios for any of these
groups 1.25?
 test using a one sample t-test

Ourworld.syd

Example: one-sample t-test


Single population:
H0: = 1.25

t=

y 1.25 y 1.25 y 1.25


=
=
SEM
sy
s/ n
df = n - 1

Example: one-sample t-test: Results


Hypothesis Testing: One-sample t-test

Results for GROUP$ = Europe


One-sample t-test of B_TO_D with 20 Cases
Ho: Mean = 1.25000 vs Alternative = 'not equal'
Mean
: 1.25701
95.00% Confidence Interval : 1.15735 to 1.35668
Standard Deviation
: 0.21295
t
: 0.14727
df
:
19
p-value
: 0.88447

1.
2.
3.

Box plot
Normal approximation
Dot plot

Example: one-sample t-test: Results


Results for GROUP$ = Islamic
One-sample t-test of B_TO_D with 16 Cases
Ho: Mean = 1.25000 vs Alternative = 'not equal'
Mean
: 3.47825
95.00% Confidence Interval : 2.84977 to 4.10672
Standard Deviation
: 1.17943
t
: 7.55705
df
:
15
p-value
: 0.00000

Results for GROUP$ = NewWorld


One-sample t-test of B_TO_D with 21 Cases
Ho: Mean = 1.25000 vs Alternative = 'not equal'
Mean
: 3.95091
95.00% Confidence Interval : 3.26380 to 4.63802
Standard Deviation
: 1.50949
t
: 8.19954
df
:
20
p-value
: 0.00000

Example: one-sample t-test:


a way to present the results
Births / Deaths ( 95% CI)

8
7
6
5
4
3
2
H0: = 1.25

1
0

pe
ro
Eu

ic
am
Isl

w
Ne

ld
or
W

Two-sample t-test
 used to compare two populations, each of which has
been sampled
 the simplest form of tests comparing populations
 example: does the average annual income differ for
males and females?
H0: 1 = 2; income (males) = income (females)

Survey2.syd

Calculation of t for two-sample t-test


H0: 1 = 2, i.e. 1 - 2 = 0
- independent observations

t=

y1  y2  ( 1  2 ) y1  y2
y  y2
=
= 1
s y1  y2
s y1  y2

where sp = the pooled standard deviation (more later), and


df = (n1 - 1) + ( n2 - 1) = n1 + n2 - 2

Logic of two-sample t-test


Assume
Ho: 1 = 2

0.4

Probability of t

Ho true

H A: 1 > 2
1) if Ho is true then the
null distribution is
known (for a set df)

0.3

0.2

0.1

2) if HA is true, we dont


know the distribution
but we do know that
is not the null
distribution

0.0

-5 -4 -3 -2 -1 0

t=

If Ho is true: 1 = 2
0.4

y1  y2

sp

1
n1

1
n2

(given 4 df)

Ho true

t 0.05, 4 df = 2.14

0.3

Any t >2.14 will lead to


incorrect rejection of Ho

0.2

0.1

0.0

HA true

-5 -4 -3 -2 -1 0

1. this implies that the


difference between y1
and y2 is > 2.14
standard errors (pooled)
2. this will happen 5 % of
the time

If Ho is false: 1 > 2
0.4

(given 4 df)

HA true

t 0.05, 4 df = 2.14

0.3

Any t < 2.14 will lead to incorrect


rejection of HA (i.e., incorrect
acceptance of HO)

0.2

0.1

0.0

-5 -4 -3 -2 -1 0

1. this means that the


difference between y1 and
y2 is < than 2.14
standard errors (pooled)
2. the probability that this will
happen is dependent on ,
n and the true difference
between 1 and 2

Results of example
Two-sample t-test on INCOME Grouped by SEX$ vs Alternative = 'not equal'

Standard
GROUP 
N
Mean
Deviation
-------+--------------------------Female  152
20.25658
14.82771
Male
 104
24.97115
16.41776

The separate variance t-test is based


on the Satterthwaite adjustment (of
degrees of freedom), it is not
recommended unless the variance
terms are very different and the
sample sizes (n) are very different

Separate Variance
Difference in Means
95.00% Confidence Interval
t
df
p-value

: -4.71457
: -8.67643 to -0.75272
: -2.34611
: 206.23313
:
0.01992

Pooled Variance
Difference in Means
95.00% Confidence Interval
t
df
p-value

: -4.71457
: -8.59712 to -0.83203
: -2.39138
: 254.00000
:
0.01751

What is the conclusion?

meaning of the P value, review


 probability of obtaining our sample data if H0 is
true, i.e., [P(data|H0)]
 NOT the probability that H0 is true!
 strictly, is the long run probability (from repeated
sampling) of obtaining sample result if HO is true

Graphical results of example


70

60

60

Annual Income (x 1000)


+ 95% CI

70

INCOME

50
40
30
20

50
40
30

n=152

n=104

20
10

10
0
50 40 30 20 10 0 10 20 30 40 50
Count
Count
SEX
Female
Male

Female
Male
SEX

 which graph would you present in a


talk or paper?
 which tells you that the assumptions
of this analysis may have been
violated?

Paired t-test: the logic


1. often we want to compare observations that can be
considered paired within a subject (replicate)
for example:
i. comparison of activity level before and after eating in the same
individual
ii. comparison of longevity of males vs females, where county is
the replicate

2. in such cases, there is often benefit in accounting for


variance that could be caused by differences among
subjects (= replicates)
3. and it is wrong to consider the observations on the same
subject as being true replicates they are not
independent

Paired t-test uses paired observations


 the null hypothesis is no difference between the paired
observations

H 0:

d = 0

t=

d
d
=
sd sd / n d

where
d = the difference between paired observations
sd = standard deviation of the differences
df = n - 1 where n is number of pairs

Paired t-test: example


 Sea star Pisaster comes in two colors along the west
coast: purple and orange:
 Ho: density of purple per site = density of orange
 individual reefs are the replicates of interest
 looks like a no brainer

Sea star colors all sites two sample.syd

Paired t-test: example, results if incorrectly


treated as a 2-sample t-test

Standard
GROUP  N
Mean
Deviation
-------+-------------------------Orange  7
144.71429
101.75086
Purple  7
457.28571
353.47829
Separate Variance
Difference in Means
95.00% Confidence Interval
t
df
p-value

: -312.57143
: -641.43752 to 16.29466
:
-2.24827
:
6.98755
:
0.05942

Pooled Variance
Difference in Means
95.00% Confidence Interval
t
df
p-value

: -312.57143
: -615.48591 to -9.65695
:
-2.24827
barely significant
:
12.00000
:
0.04413
WHY?

equal variances?

1200

Density (95% CI)

1000
800
600
400
200
0

Orange
Purple
Color of seastars

Consider site-to-site variability


(remember, sites are replicates)
 given that the observations are paired at the level of
site, can we account for variation among sites?

Paired t-test: details of calculation


Sea star colors all sites.syd

Note slopes are they the

Paired Samples t-test on PURPLE vs ORANGE with 7 Cases


same?
Alternative = 'not equal'
perhaps log
Mean PURPLE
: 457.28571
transform
Mean ORANGE
: 144.71429
Mean Difference
: 312.57143
95.00% Confidence Interval
: 74.58766 to 550.55520
Standard Deviation of Difference : 257.32266
t
:
3.21381
Smaller than before
df
:
6
p-value
:
0.01828
WHY?

Paired t-test: details of calculation:


use of log-transformed data
Note slopes much more
similar

Paired Samples t-test on LPURPLE vs LORANGE with 7 Cases


Alternative = 'not equal'
Mean LPURPLE
: 2.48624
Mean LORANGE
: 1.99536
Mean Difference
: 0.49088
95.00% Confidence Interval
: 0.37685 to 0.60492
Standard Deviation of Difference : 0.12330
t
: 10.53299
df
:
6
p-value
: 0.00004

indicates that:

purples are more
common
 by a constant ratio
 rather than by a
constant amount

Smaller than before


WHY?

Review of calculations of t for the 3 kinds of t-test

y 
s n

 One-sample test

y1  y 2
1 1
sp
+
n1 n 2

 Two-sample test

 Paired test

d
sd

nd

Standard Error used for calculating t


SE

 One-sample test

 Two-sample test sp
(calculation based on
pooled variance term)

s
n
1 1
+
n1 n 2

Variance (s2)

s2 =

SS
(n 1)

s2p =

SS1 + SS2
(n1 1) + (n 2 1)

sd2 =

SSd
(n d 1)

or

 (n 1)s 2 + (n 1)s 2  1 1 
1
2
2
 1
 + 
n
+
n

2

 n1 n 2 
1
2

 Paired test

sd
nd

Presenting results of t-tests in scientific writing


 Methods:
 A two-sample t-test was used to compare the
mean number of eggs per capsule from the two
zones. Assumptions were checked with.

 Results:
 The mean number of eggs per capsule from the
mussel zone was significantly greater than that
from the littorinid zone (t = 5.39, df = 77, P <
0.001; Fig. 2).

Evaluating Assumptions of the t-test: Normality


 Assumption: data in each group are normally
distributed
 Checks:
 Frequency distributions be careful
 Boxplots
 Probability plots
 formal tests for normality (too powerful, not powerful
enough?)

 Solutions:
 transformations
 dont worry, run it anyway, give disclaimer
 if there is another appropriate test where assumptions
are met, use it instead, but often violations make little
difference in reported P-value

Evaluating Assumptions of t-test: Homogeneity of


Variance
 Assumption: population variances equal in 2
groups
 Checks:
 subjective comparison of sample variances
 boxplots
 F-ratio test of H0: 12 = 22

 Solutions
 transformations
 run it anyway same comments as for normality
assumption

Evaluating Assumptions of t-test: Homogeneity of


Variance
the F-statistic (AKA F-ratio)
 H0: 12 = 22
 F-statistic = ratio of 2 sample variances
F = s 12 / s22
reject H0 if F < or > 1

 if H0 is true, F-ratio follows F distribution


 follows usual logic of a statistical test
 will this test be too powerful or not powerful
enough?

Evaluating Assumptions: boxplot & histogram

70
60

Count

50
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90
Limpet numbers per quadrat

Evaluating Assumptions: boxplots


1. IDEAL

2. SKEWED

3. OUTLIERS
*

*
*

4. UNEQUAL VARIANCES

Evaluating Assumptions: transformations to


mitigate departures from normality and
homoscedasticity
Variance
Normal Probability plots (pplots)

Boxplots

raw data

log transformed

Pop_1990

Lpop1990

Europe

441

0.17

Islamic

1378

0.30

Newworld

1042

0.34

Greatest
ratio

3.12:1

2:1

Ourworld.syd

Nonparametric Tests
 these tests dont assume particular underlying
distribution of data
 normal distributions not necessary

 usually based on ranks of the data


 H0: samples come from populations with identical
distributions
 equal means or medians

 equal variances and independence still required


 typically less powerful than parametric tests

Which type of test to use??


 use the test that is more efficient (i.e., has the
greatest power given the sample size (n); results in less
cost and effort)

 if assumptions of parametric tests are met, they


are always more efficient
 parametric tests are able to deal with more
complex experimental designs there may be no
nonparametric equivalent

Parametric tests are usually better


 if assumptions not met, then explore the data
 try transformations to normalize the data or equate
variances
 if normality assumption violated, still hard to
recommend NP tests unless distributions are very
weird, transformations do not help, or outliers are
present

 do a parametric test based on the ranks!


 use a robust parametric test that does not assume
equal variances (e.g., separate variance t-test)
 do a randomization test of your data in
conjunction with a parametric test

Mann-Whitney U / Wilcoxon test


 a nonparametric 2-sample t-test
 calculates sum of ranks in 2 samples
 should be similar if H0 is true

 compares rank sum to sampling distribution of


rank sums
 i.e., the distribution of rank sums when H0 true

 equivalent to t-test on data transformed to ranks

Mann-Whitney U / Wilcoxon test


 DATA: consist of 2 random samples
 ASSUMPTIONS
 both samples are random samples from respective
populations
 independent samples
 measurement scale is at least ordinal
 if there is a difference between sample distributions,
that difference is one of location (i.e., the variances are
equal)

Robust parametric tests


 these tests are more reliable than the traditional tests
when variances or sample sizes are very unequal, but
still require normality
 e.g., Satterthwaites adjusted t-test for unequal
variances (= Separate variances t-test)
 the common version is to recalculate the df for the test
to make it more conservative (a lower df, which may no
longer be an integer)

df =

s
2
s
+ 2
 1

n1
n2 


s1

n1

(n1 + 1) + (s2

n2

+ ( n 2 + 1)

2

Randomization (permutation) tests


 reshuffling the data many times to generate the sampling
distribution of a statistic directly
 principle: if H0 is true, then any random arrangement of
observations to groups is equally likely
 calculate difference between the averages of the groups (D0)
 randomly reassign the observations so that there are n1 in group 1 and
n2 in group 2
 calculate D1
 repeat this procedure ~1000 times each time calculating Di
 calculate the proportion of all Dis that are D0. This is the p-value that
can be compared to  to decide upon accepting or rejecting Ho.
Given the power of computers, this procedure is starting to replace the use of nonparametric testing when distributional assumptions are violated, distributions are
unknown or random sampling not possible

Das könnte Ihnen auch gefallen