Sie sind auf Seite 1von 15

APPENDIX: Calculation Details

CHAPTER 0: Statistical Thinking, Variability, and Probability


Notation
 µ represents the population mean
 σ represents the population standard deviation (SD)
 xi represents the value of the ith observation on a variable of interest
 n represents the sample size
 ̅ sample mean
 s represents the sample SD
 Σ represents summation
A common measure of location is the (arithmetic) mean, sometimes called the average.
∑ …
1. Sample mean = ̅ = = =

Example: Consider the following data on how much students paid for their last haircut (including tips) in dollars:
18, 55, 23, 75, 36.

Sample mean = ̅ = = $41.40.


People in this sample paid $41.40 for their last haircut, on average.

A common measure of data variability is the standard deviation (SD). The standard deviation is approximately
equal to the average deviation the values from the mean.

2 2 2 2
∑ 1 2 …
2. Sample SD = s = =
1 1

sum of observed squared differences from the sample mean


=
1
Recall that the observed sample mean, ̅ $41.40.
i 1 2 3 4 5 Sum
xi x1 = 18 x2 = 55 x3 = 23 x4 = 75 x5 = 36
18 – 41.40 = 55 – 41.40 = 23 – 41.40 = 75 – 41.40 = 36 – 41.40 = Σ ̅) = 0
-23.4 13.6 -18.4 33.6 -5.4
)2 (-23.4)2 = (13.6)2 = (-18.4)2 = (33.6)2 = (-5.4)2 = Σ ̅ )2 =
547.56 184.96 338.56 1128.96 29.16 2229.2
2229.2 2229.2
Thus, sample SD = s =
5 1
= 4
√557.3 = $23.61. Thus, the typical variation for a haircut cost

from the sample mean ($41.4) was $23.61, for this sample of 5 people.

OTHER MEASURES OF LOCATION:


3. Minimum = smallest number in the (sorted) dataset
4. Maximum= largest number in the (sorted) dataset.

Example: Here are the weights (lbs) of 20 male basketball players.


225, 180, 230, 210, 224, 220, 185, 263, 160, 194, 245, 240, 225, 235, 190, 230, 235, 240, 190, 185
Sorted data:
160, 180, 185, 185, 190, 190, 194, 210, 220, 224, 225, 225, 230, 230, 235, 237, 240, 240, 245, 263

Then, observed Minimum = 160lbs, and observed Maximum = 263lbs.

5. Median = middle value in a sorted dataset, when odd number of observations. When there is an even
number of observations, the median is the average of the two middle values in the sorted dataset.

Example: Weights (lbs) of 20 male basketball players (contd.):

Then, observed Median = (224 + 225)/2 = 224.5lbs.

6. Quartiles: Divide the sorted dataset into four equal quarters.


 Lower quartile = median for that half of the sorted dataset that contains the smaller observed
values.
 Upper quartile= median for that half of the sorted dataset that contains the bigger observed
values.
Example: Weights (lbs) of 20 male basketball players (contd.):

Then, observed lower quartile = (190+190)/2 = 190lbs, and observed upper quartile = (235+237)/2 = 236lbs.

OTHER MEASURES OF VARIABILITY:


7. Range = Maximum – Minimum

Example: Weights (lbs) of 20 male basketball players (contd.):


Then, Range = 263 – 160 = 103lbs.

8. Inter Quartile Range (IQR) = Upper quartile – Lower quartile


Example: Weights (lbs) of 20 male basketball players (contd.):
Then, observed IQR = 236 – 190 = 46lbs.

CHAPTER 1: Significance: How Strong is the Evidence?


Notation
 π represents the underlying process probability or population proportion (of “successes”)
 π0 represents the hypothesized value of the underlying process probability or population proportion, if the
null hypothesis is true.
 ̂ represents the sample proportion of “successes”
 n represents the sample size
 z represents the standardized statistic

1. Sample proportion = ̂ =

2. In general a z-statistic is

z=

3. Let, the null hypothesis be H0: π = π0
Then, the z-statistic, for one proportion scenario is given by

z=

Example:
Suppose that in 32 attempts, the dolphin Buzz had pushed the correct button, 30 times. Does this provide evidence
that Buzz was doing better than guessing which button to push?

Then, the
Null hypothesis, H0: Buzz is just guessing, so his long-run probability of choosing the correct button is 0.5.
Alternative hypothesis, Ha: Buzz is doing better than guessing, so his long-run probability of choosing the correct
button is greater than 0.5.
 Let, π represent Buzz’s long-run probability of pushing the correct button.
 Then, H0: π = 0.5 versus Ha: π > 0.50
where π0 = 0.5.
 From the sample data, n = 32, and ̂ = 30/32 = 0.9375.
. . .
 Then, z = = 4.95
. . .

 The observed proportion of successes (0.9375) that Buzz had is 4.95 SDs above the hypothesized
proportion 0.5, which is what Buzz’s long-run probability of success would have been if he had been
guessing.
 Note: Using the z-statistic along with the theory-based approach to find a p-value is valid if the sample
size is large enough, that is, at least 10 “successes” and at least 10 “failures.”

CHAPTER 2: Estimation: How Large is the Effect?

Notation
 π represents the underlying process probability or population proportion (of “successes”)
 ̂ represents the sample proportion of “successes”
 n represents the sample size

1. Sample proportion = ̂ =
2. In general, a confidence interval is given by
statistic ± margin of error

Alternatively we can write this as:


statistic ± Multiplier × SE

3. Theory-based confidence intervals for π, the underlying process probability or population proportion is
given by
Confidence Margin of error Confidence interval
level
90%
1.645 × p̂ + 1.645 ×

95%
1.96 × p̂ + 1.96 ×

99%
2.576 × p̂ + 2.576 ×

 For a different confidence level, change the multiplier used in the margin of error formula.
 These theory-based confidence intervals are valid when the sample size is large enough, that is, there are
at least 10 “successes” and at least 10 “failures” in the sample data.

Example:
In a July 2012 Gallup survey of 1,014 randomly selected U.S. adults, 5% said that they consider themselves to be
vegetarians. Find a 95% confidence interval of the proportion of all U.S. adults who consider themselves to be
vegetarians.
 Let, π represent the proportion of all U.S. adults who consider themselves to be vegetarians.
 From the sample data, n = 1014, and ̂ = 0.05. Thus, there are (1014) × (0.05) = 50.7  51 “successes”
and (1014)×(1-0.05) = 963.3  963 “failures” in the sample.
 Then, the theory-based 95% confidence interval is given by
. .
p̂ + 1.96 = 0.05 + 1.96 = 0.05 + 1.96 (0.0068) = 0.05 + 0.0134 = (0.037, 0.063)

 So, we are 95% confident that the proportion of U.S. adults who consider themselves to be vegetarians is
somewhere between 0.037 and 0.063.
CHAPTER 5: Comparing Two Proportions

Notation

 π1 represents the underlying probability for the 1st process or population proportion for 1st population (of
“successes”)
 π2 represents the underlying probability for the 2nd process or population proportion for 2nd population (of
“successes”)
 ̂ represents the sample proportion of successes in the 1st sample
 ̂ represents the sample proportion of successes in the 2nd sample

 ̂ represents the combined proportion of successes in the sample =

 n1 represents the sample size of 1st sample


 n2 represents the sample size of 2nd sample
 z represents the standardized z-statistic
Let, H0: π1 – π2 = 0

Then, the corresponding z-statistic is given by

Note: Using the z-statistic along with the theory-based approach to find a p-value is valid if the sample size is
large enough, that is, there are at least 10 observations in each of the four cells, when the data are written out as a
2×2 table.

And, the theory-based 95% confidence interval for π1 – π2 is given by:



̂ ̂ + 1.96
Note: The theory-based confidence interval is valid provided the sample sizes are large enough. That is, there are
at least 10 observations in each of the four cells, when the data are written out as a 2×2 table.

Example: Let, ̂ = 0.548, ̂ = 0.451, n1 = 3602, n2 = 565.

Then, ̂ = (3602*0.548 + 565*0.451)/(3602+565) = 0.535.

 The z-statistic can be computed as


̂ ̂ 0.548 0.451
4.30
1 1 1 1
̂ 1 ̂ 0.535 1 0.535
3602 565

 Thus, the observed difference in the sample proportions of 0.097 is 4.3 SDs above the hypothesized
difference of 0. Recall that the null hypothesis says π1 – π2 is equal to 0.
 Because the sample sizes are large enough, the theory-based 95% confidence interval for π1 – π2 can be
computed as
. . . .
0.548 – 0.451 + 1.96 = (0.053, 0.141)
 We are 95% confident that the π1 is higher than π2 by somewhere between 0.053 and 0.141.

CHAPTER 6: Comparing Two Averages

Notation
 µ1 represents the “population” average for population 1
 µ2 represents the “population” average for population 2
 ̅ represents the sample mean for sample 1
 ̅ represents the sample mean for sample 2
 s1 represents the sample SD for sample 1
 s2 represents the sample SD for sample 2
 n1 represents the sample size for sample 1
 n2 represents the sample size for sample 2
 t represents the standardized statistic

Let, H0: µ1 – µ2 = 0
The formula used to calculate the t-statistic to compare two groups on a quantitative response is
̅ ̅

An approximate theory-based 95% confidence interval for µ1 – µ2, can be written as follows:

̅ ̅ 2

Note: Using the t-statistic along with the theory-based approach to find a p-value, and/ or finding the theory-based
confidence interval is valid provided
(i) the sample sizes are large enough. That is, each of the samples has at least 20 observations.
OR (ii) the each sample comes from a normally distributed population.

Example: Consider data on BMI of women participating in a randomized experiment of lifestyle change
programs.
Group Sample size Sample mean Sample SD
2
Intervention = 60 ̅ = 30 kg/m = 2.1 kg/m2
Control = 60 ̅ = 34 kg/m2 = 2.4 kg/m2
 We can define our parameters of interest to be,
= Average BMI after 2 years of being enrolled in an individualized lifestyle change
program, for (obese) women like those in the study, and
= Average BMI after 2 years of being enrolled in a “one size fits all” lifestyle change program,
for (obese) women like those in the study

 And using the symbols µ1 and µ2 we can restate our hypotheses to be,
Null hypothesis, H0: – =0
Alternative hypothesis, Ha: – ≠0

 The t-statistic can be computed as follows: 9.72


. . √ . . .

 This tells us that the observed difference (-4) is 9.72 SDs below the hypothesized difference of 0.

 Using the data from the study, we can see that each sample had 60 observational units, which is bigger
than 20. Thus, the theory-based approximate 95% confidence interval for – , is:
2.1 2.4
30 34 2 4 2 0.4117 4 0.8234 4.8234, 3.1776
60 60
 Thus, we are 95% confident that enrollment in individualized lifestyle change programs (compared to
“one size fits all” programs) decreases the average BMI of women like the obese Italian women in our
study, by somewhere between 3.18 kg/m2 to 4.82 kg/m2.

CHAPTER 7: Paired Data: One Quantitative Variable

Notation

 µd represents the mean difference for the “population”


 ̅ represents the sample mean difference,
 sd represents the SD of the sample differences, and
 n represents the sample size.
 t represents the standardized statistic

Let, H0: µd = 0
̅
The t-statistic for paired data on a quantitative response is .

When the sample sizes are large enough, we can also find an approximate theory-based 95% confidence
interval for µd, as follows:

̅ 2 . .

Note: Using the t-statistic along with the theory-based approach to find a p-value, and/or finding the theory-
based confidence interval is valid if
(i) The sample size (that is, the number of pairs observed) is large enough (at least 20).
OR (ii) the sample of differences comes from a normally distributed population.
Example: Suppose that we have the following data

Sample size
Mean of sample SD of sample
differences (kcal) differences (kcal)
Difference = Estimated – Actual 80 ̅ -435 527.7
 Let, µd represent the average difference in estimated and actual energy intake, by men like those in the
study.
 Then, H0: µd = 0, versus Ha: µd ≠ 0.
 The t-statistic can be found as . 7.37
.

 Thus, the observed average difference between estimated and actual Energy intake, 435kcal, is 7.37 SDs
below 0.
 Using the data from the study, because there are 80 observational units (that is, 80 pairs of responses), the
approximate theory-based 95% confidence interval for µd, is:

393.5
435 2 129 2 43.99 435 87.99 523, 347
√80

 Thus, we are 95% confident that, on average, men such as the ones in this study, underestimate their
energy intake by somewhere between 347kcal to 523kcal.

CHAPTER 8: Comparing More Than Two Groups Using Proportions

Notation

 πi represents the underlying probability for the ith process or population proportion for ith population.
 ̂ represents the sample proportion of successes in ith sample
 ni represents the sample size of ith sample
 ̂ represents the overall proportion of successes in the entire study
 represents the chi-square statistic
 Σ represents summation

The chi-square statistic for comparing multiple groups on a binary response variable is given by
∑ ̂ ̂

Example: Consider the following data from Chapter 8

observed counts Real acupuncture Sham acupuncture Non-acupuncture Total


Substantial reduction in pain 184 171 106 461
Not a substantial reduction in pain 203 216 282 701
Total 387 387 388 1162
 The hypotheses can be written as

H0: There is no association between type of treatment and whether or not person experiences substantial
reduction in pain,

Ha: There is an association between type of treatment and whether or not person experiences substantial
reduction in pain.

 ̂ = sample proportion of subjects experiencing substantial reduction in pain among those who received
the real acupuncture = 184/387 = 0.495
 Similarly, ̂ = 171/387 = 0.442, and ̂ = 106/388 = 0.273
 n1 = 387, n2 = 387, n3 = 388
 ̂ = overall proportion of successes in the entire study = (184 + 171 + 106)/(1162) = 0.397
 Then, the observed value of the chi-square statistic can be calculated as
 
2
observed  
. 397 (1
1
 . 397 )

 387 (.495  .397 ) 2  387 (.442  .397 
2
 388(.273  .397 ) 2 )  38.05
 

Another form of the chi-square statistic:


The chi-square statistic for comparing multiple groups on a categorical response variable can also be written

out as ∑

Where

 Oi represents the observed number of observational units in the ith cell


 Ei represents the number of observational units we would expect in the ith cell if the null hypothesis were
true, and there were no association between the explanatory variable and the response variable; these Ei
are often referred to as the expected cell counts.
 Σ represents summation

To calculate the expected counts, we determine the overall proportion in each response category, and then apply
this proportion to each explanatory variable group. One way of implementing this is by calculating the expected
cell counts (Ei) using the following formula


where the row and column totals correspond to the row and column in which the ith cell appears.

Note: Using the chi-square statistic along with the theory-based approach to find a p-value is valid provided the
sample sizes are large enough. That is, there are at least 10 observations in each of the cells, when the data are
written out as a two-way table.
Example: Let us use the data from Chapter 8 again, where the observed counts are as follows

Observed counts (Oi) Real acupuncture Sham acupuncture Non-acupuncture Total


Substantial reduction in 461
184 171 106
pain
Not a substantial 701
203 216 282
reduction in pain
Total 387 387 388 1162

Then, expected cell counts, Ei, can be calculated as

Expected counts (Ei) Real acupuncture Sham acupuncture Non-acupuncture Total


Substantial reduction in = (461 × 387)/1162 = (461 × 387)/1162 = (461 × 388)/1162 461
pain = 153.5 = 153.5 = 153.9
Not a substantial = (701 × 387) /1162 = (701 × 387)/1162 = (701 × 388)/1162 701
reduction in pain = 233.5 = 233.5 = 233.1
Total 387 387 388 1162

And, the observed value of the chi-square statistic is then




184 153.5 171 153.5 106 153.9
153.5 153.5 153.9

203 233.5 216 233.5 282 233.1


233.5 233.5 233.1
6.05 1.99 14.92 3.98 1.31 9.82 38.05

Each calculated value of is called a chi-square cell contribution. For example, the chi-square cell
contribution for the cell corresponding to “real acupuncture” and “substantial reduction in pain” is given by
.
= 6.05.
.

CHAPTER 9: Comparing More Than Two Groups with Averages

Notation

 µi represents the “population” average for the ith population


 ̅ represents the sample mean for the ith sample
 si represents the sample SD for the ith sample
 ni represents the sample size for the ith sample
 ̅ represents the overall sample mean
 I = number of samples/groups being compared
 N = n 1 + n2 + … + n I
 F represents the F-statistic
 The hypotheses can be written as

H0: µ1 = µ2 = … = µI, versus Ha: At least one of the µi is different

Or as,

H0: There is no association between the two variables, versus

Ha: There is an association between the two variables

Then, the F-statistic for comparing more than two groups on a quantitative response is given by

 n (x
i 1
i i  x)2
between  group variability I 1
F 
within  group variability i

 (n
i 1
i  1) si2

                            N I  

Example: Consider the following data from Chapter 9 on the comprehension scores of an ambiguous prose
passage:

Sample size Sample mean Sample SD


No picture n1 = 19 ̅ 3.37 s1 = 1.26
Picture shown before n2 = 19 ̅ 4.95 s2 = 1.31
Picture shown after n3 = 19 ̅ 3.21 s3 = 1.40
Overall N = 57 ̅ 3.84 1.52
And I = 3.

To find the observed value of the F-statistic, let us first calculate the numerator:

∑ ̅ ̅ . . . . . . . . .
 between-group variability =
.
= 17.57
∑ . . . . . .
 witthin-group variability = =
.
1.75
 Thus, the observed value of the F-statistic = 17.57 /1.75 = 10.02.

Note: Using the F-statistic along with the theory-based approach to find a p-value is valid as long as
(i) The sample sizes are large enough. That is, there are at least 20 observations in each sample,
OR, the samples each come from normally distributed populations.
(ii) The variability (SD) in each of the populations is the same.
CHAPTER 10: Paired Data: Two Different Quantitative Variables

Notation

 ρ represents the population correlation coefficient


 r represents the sample correlation coefficient
 β represents the population slope
 b represents the sample slope
 a represents the sample intercept
 xi represents the observed value of the explanatory variable for the ith observational unit
 yi represents the observed value of the explanatory variable for the ith observational unit
 n represents the sample size
 ̅ represents the sample mean of the observed values of the explanatory variable
 represents the sample mean of the observed values of the response variable
 sx represents the sample SD of the observed values of the explanatory variable
 sy represents the sample SD of the observed values of the response variable
 SEb is the standard error of the sample slope.
 t represents the standardized statistic

The sample correlation coefficient for data on a quantitative explanatory variable and a quantitative response
variable, is given by
̅

The least squares regression line for data on a quantitative explanatory variable and a quantitative response is
given by

Where, the sample slope for least squares regression line for data on a quantitative explanatory variable and a
quantitative response is given by

And, the sample intercept for least squares regression line for data on a quantitative explanatory variable and a
quantitative response is given by

̅
Let, H0: ρ = 0 versus Ha: ρ ≠ 0

Then, the corresponding statistic for the hypotheses about the population correlation coefficient (ρ) is given
as

t=

Let, H0: β = 0 versus Ha: β ≠ 0

Then, the corresponding statistic for the hypotheses about the population slope (β) is given as

t= , where SEb can be read off applet or software output.

And, an approximate theory-based 95% confidence interval for β is given by

b ± 2 SEb

The theory-based confidence interval for the slope, β, is valid if


(i) The values of the response variable at each possible value of the explanatory variable has a normal distribution
in the population from which the sample was drawn.
(ii) Each of these normal distributions have the same SD.

Example: Let us use the data on height (inches) and handspan (cm) for a sample of 10 college students.

Mean SD
Height (inches), yi 64 67 65 72 71 70 66 62 73 65 67.50 3.75
Handspan (cm), xi 17 21 20.3 26 24 22 21 19 20 19 20.93 2.59

Then,
 ̅ = 20.93cm
 = 67.50 inches
 sx = 2.59 cm
 sy = 3.75 inches
 n = 10

The observed value of the sample correlation coefficient can be calculated to be


1
17 20.93 64 67.5 21 20.93 67 67.5
10 1 2.59 3.75
20.3 20.93 65 67.5 … 19 20.93 65 67.5
1 1
13.76 0.04 1.58 ⋯ 4.83 61.75 0.71
87.44 87.44
To find the least squares regression line, we can calculate the sample slope and sample intercept as
.
0.71 = 1.02
.

67.5 1.02 20.93 = 46.12

Thus, the calculated least squares regression line for the sample data is found to be 46.12 1.02

The standardized t-statistic for H0: ρ = 0 versus Ha: ρ ≠ 0, can be calculated to be

. . . .
t= 2.85
. . √ . .

Thus, the observed (sample) correlation coefficient 0.71 between height and handspan is about 2.85 SDs above
the hypothesized value 0 of the population correlation coefficient, the value ρ would be if the null hypothesis were
true.

Here is statistical software output with b, SEb, and the corresponding value of the t-statistic circled.

Notice that the t-statistic for H0: β = 0 versus Ha: β ≠ 0, can be calculated to be

.
t= = = 2.82
.

And, the approximate theory-based 95% confidence interval for β can be calculated to be

1.02 ± 2 (0.36) = 1.02 ± 0.72 = (0.3, 1.74)

We are 95% confident that the increase in average height associated with an increase of 1cm in hand span is
somewhere between 0.30 inches and 1.74 inches.

Das könnte Ihnen auch gefallen