Sie sind auf Seite 1von 9

# STAT 141

## Condence Intervals and Hypothesis Testing

10/26/04

Today (Chapter 7): CI with unknown, t-distribution CI for proportions Two sample CI with known or unknown Hypothesis Testing, z-test

## Condence Intervals with unknown

Last Time: Condence Interval when is known: A level C, or 100(1 ) % condence interval for is [X z/2 , X + z/2 ] n n But to return to reality, we dont know . Thus we must estimate the standard deviation of X with: s SEX = n But s is just a function of our Xi s and thus is a random variable too it has a sampling distribution too. Before we could say if we knew P (z/2 < X < z/2 ) = 1 / n

which after algebra gave the condence interval. [Remember for any s, zs is dened as where 12s of the area falls in (zs , zs ). So zs = qnorm(1s) = qnorm(s) = 1 s quantile. i.e. zs is the positive side.] Now we want a similar setup, so that: P (?? < X <??) = SEX

We need know the probability distribution of T = X . T has the Students t-distribution with n 1 SEX degrees of freedom. We write this as T tn1 . The degrees of freedom= is the only parameter of this distribution. [book uses ts for T ]

tdist w/ df=1
0.4 0.4

tdist w/ df=5

0.3

0.2

0.1

0.0

0.0 6

0.1

0.2

0.3

tdist w/ df=10
0.4 0.4

tdist w/ df=50

## tdist, df=100 N(0,1)

0.3

0.2

0.1

0.0

0.0 6

0.1

0.2

0.3

RCode: > par(mfrow=c(2,2)) #tdist1.pdf > plot(seq(-6,6,length=10000),dnorm(seq(-6,6,length=10000)), type="l",lty=3,ylab="",xlab="",main="t-dist w/ df=1") > lines(seq(-6,6,length=10000),dt(seq(-6,6,length=10000),df=1), type="l",ylab="",xlab="") > legend(x=2,y=.4,lty=c(1,3),legend=c("t-dist, df=1","N(0,1)")) ... Thus t-distribution approaches normal as increases, but for small n gives wider intervals. Why degrees of freedom?? 2

Let yi = xi x We have 1 s = n1
2

n 2 yi 1

and

yi = 0()

Now (*) < > 1 constraint on n numbers, hence the phrase n-1 degrees of freedom Now that we know the distribution, we know we can nd the ?? from above these are just the /2 and 1 /2 quantiles of the t-distribution. Let tn1,s be dened similarly as zs and is equal to qt(1 s, df = n 1) = qt(s, df = n 1). We then have: P (tn1,/2 < X < tn1,/2 ) = 1 SEX

This gives us a condence interval like before, only we use the quantiles of the t-distribution rather than the normal distribution. Example. Taken from the original paper on t-test by W.S. Gossett , 1908. [Gossett was employed by Guiness Breweries, Dublin. A chemist, turned statistician, Guiness, fearing the results to be of commerical importance, forbade Gossett to publish under his own name. Chose pseudonym Student out of modesty] Two drugs to induce sleep: A- dextro, B= laevo. Each of ten patients receives both drugs (presumably in random order). Issue: Is drug B better than drug A? Students sleep data:

data(sleep) extra group 1 0.7 1 2 -1.6 1 3 -0.2 1 4 -1.2 1 5 -0.1 1 6 3.4 1 7 3.7 1 8 0.8 1 9 0.0 1 10 2.0 1 11 1.9 2 12 0.8 2 13 1.1 2 14 0.1 2 15 -0.1 2 16 4.4 2 17 5.5 2 18 1.6 2 19 4.6 2 20 3.4 2 extra1=sleep[sleep[,2]==1,] extra2=sleep[sleep[,2]==2,] extradiff=extra2[,1]-extra1[,1]

>extradiff 1.2 2.4 1.3 1.3 0.0 1.0 1.8 0.8 4.6 1.4 > mean(extradiff)  1.58 > sqrt(var(extradiff))  1.229995 > sqrt(var(extradiff)/10)  0.3889587 > 1.58/0.38896  4.062114 > qt(.975,9)  2.262157 > qt(.995,9)  3.249836 > qnorm(0.975)  1.959964 > qnorm(0.995)  2.575829

A level C conf. interval with unknown: exact if X Normal otherwise approx correct for large n Margin of error M in E M is s tn1, = tn1, SEX 2 2 n Remark: Large value, 4.6 possible outlier, so some doubt about normal assumptions here. Whats different?? Since we dont know , pay a penalty with a (slightly) wider interval: ( e.g t=2.262 vs. z=1.96 for 5% level condence ) For large sample sizes we can just use the normal distribution quantiles z/2 , since the t-distribution quickly looks like the normal distribution.

Proportions
We saw last time that p is approximately distributed as N (p, p(1p) ). If we want a condence interval for n p we can use this normality to get an approximate condence interval.

## M = z/2 SEp = z/2

p(1p) n
2 y+0.5z/2 2 n+z/2

## The book offers a correction to this using p =

and SEp =

p(1) p . 2 n+z/2

Two-samples
One of the most common statistical procedures. Is there a difference? Is it real?? However, because of the preparatory work with one-sample problems, this should seem rather familiar, a case of dej-vu. , but with slightly more complex formulas. What do we mean by two-samples? Two groups Distinct populations [treatment/control, . . . , male/female . . . ] Grouping variable: categorical variable with 2 levels. Data is independent between groups Example: (Dalgaard p 87) Energy expenditure: Two groups of women, lean and obese. Twenty four hour energy expenditure in MJ. data(energy) lean_energy[energy\$stature==lean,1] obese_energy[energy\$stature==obese,1] obese  9.21 11.51 12.79 11.85 9.97 8.79 9.69 9.68 9.19 lean  7.53 7.48 8.08 8.09 10.15 8.40 10.88 6.13 7.90 7.05 7.48 7.58 8.11 plot(expend~stature,data=energy) Beware: Some data sets that may look like two sample problems are really better treated as paired data. Example: Sleep drugs data from above: 10 patients, Drugs A and B. But since each patient received both A and B, the samples are not really independent (common component of variation due to patient) better to look at differences. Becomes a one-sample problem. (Will discuss more about pairing/blocking later).

Notation:
Population Variable Population 1 Population 2 X1 X2 Mean 1 2 SD 1 2 SRS from Each Population Sample Size Sample Mean Sample 1 Sample 2 5 n1 n2 X1 X2 Sample SD s1 s2

Distribution of X1 X2
Sample mean difference: X1 X2 All depends on the variability and distribution of this difference!! Recall in general that if E(V ) = and E(W ) = then E(V W ) = and if V and W are independent then var(V W ) = var(V ) + var(W ) So if X1 (1 , n1 ), X2 (2 , n2 ), we will have 1 2 X1 X2 = E(X1 X2 ) = 1 2 and for independent rvs X1 and X2 :
2 2 2 X1 X2 = X1 + X2 = 2 2 1 2 + n1 n2 2 2

2 We need estimates for 1 2 and X1 X2 . 1 X2 is estimate for 1 2 . Once we have an estimate for the X X then we can use Clearly X 1 2 similar method as for a 1-sample case to get a condence interval. 2 2 1. Unequal variances: 1 = 2 then use 2 SEX1 X2 =

s2 s2 1 + 2 n1 n2

2 2 2. Equal Variances: If 1 = 2 = 2 is unknown but assumed to be equal, can use a pooled estimate of variance : (n1 1)s2 + (n2 1)s2 1 2 s2 = pooled n1 + n2 2

i.e. average with weights equal to the respective degrees of freedom. Then our estimate of X1 +X2
2 SEpooled = s2 pooled (

1 1 + ) n1 n2

Good method if the two SDs are close, but if also are moderate to large, there wont be much difference from the unequal variances method (below) If the two SDs are different, better to use unequal variances method. will use this pooled estimate again when we study Analysis of Variance As above, we need the distribution of: T = X1 X2 X1 X2 1 X2 SE of X

2 2 If X1 N (1 , 1 ) and X2 N (2 , 2 ) then:

Equal Variances: If we have equal variances in the two populations, then SE of X1 X2 = SEpooled and T t with = n1 + n2 2 Unequal Variances: Then SE of X1 X2 = SEX1 X2 and T is approximately distributed as t . We use one of two values for 1. = min(n1 1, n2 1) 2. =
s2 1 n1 s2 1 ( 1 )2 n1 1 n1

+ +

s2 2 n2 s2 1 ( 2 )2 n2 1 n2

This is known as Welshs formula which gives fractional degrees of freedom. More accurate formula (generally used by packages, and only on computers!): Can use either approximation, but say which! Note that one can generally not go too far wrong, since can show by algebra that min(n1 1, n2 1) n1 + n2 2 Summary: Two sample condence intervals for 1 2 at the 100(1 )% level E M, known M = z/2
2 1 n1

s2 2 n2

## unknown, equal M = t/2, spooled

1 n1

2 2 n2

M = z/2

s2 1 n1

M = t/2,

s2 1 n1

s2 2 n2

1 n2

= min(n1 1, n2 1) or

= n1 + n2 2

where z/2 and t/2, are same notation as for one-sample case. In energy data above, we can construct a 95% condence interval for the difference in the true means between obese and lean. n1 = 9, n2 = 13 and X1 X2 = 2.23. Well use the conservative estimate for = min(9 1, 13 1) = 8. SEX1 X2 = 0.58. So our M = 2.24 0.57 = 1.30. Then a (conservative) 95% condence interval is [0.93,3.53]. Computer output for Welshs formula gives [1.00,3.46] > mean(obese)-mean(lean)  2.231624 > qt(.9725,df=8)  2.244938 > sqrt(var(obese)/length(obese)+var(lean)/length(lean))  0.5788152 > t.test(obese,lean, conf.level=.95) Welch Two Sample t-test data: obese and lean t = 3.8555, df = 15.919, p-value = 0.001411 alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval: 1.004081 3.459167 sample estimates: mean of x mean of y 10.297778 8.066154

Hypothesis Tests