Sie sind auf Seite 1von 28

Chapter 10: Comparing Two Groups

Bivariate Analysis: Methods for comparing two groups are special cases of bivariate
statistical methods Two variables exist:
Response variable outcome variable on which comparisons are made
Explanatory variable binary variable that specifies the groups
Statistical methods analyze how the outcome on the response variable depends on or is
explained by the value of the explanatory variable
Independent Samples: Most comparisons of groups use independent samples from the
groups, The observations in one sample are independent of those in the other sample
Example: Randomized experiments that randomly allocate subjects to two treatments
Example: An observational study that separates subjects into groups according to their
value for an explanatory variable
Dependent samples: Dependent samples result when the data are matched pairs each
subject in one sample is matched with a subject in the other sample

Example: set of married couples, the men being in one sample and the women in the other.
Example: Each subject is observed at two times, so the two samples have the same subject
Categorical response variable:
For a categorical response variable
- Inferences compare groups in terms of their population proportions in a particular category
- We can compare the groups by the difference in their population proportions: (p
1
p
2
)
Example: Experiment: Subjects were 22,071 male physicians
Every other day for five years, study participants took either an aspirin or a placebo
The physicians were randomly assigned to the aspirin or to the placebo group
The study was double-blind: the physicians did not know which pill they were taking, nor did
those who evaluated the results
What is the response variable?
The response variable is whether the subject had a heart attack, with categories yes or no.
What are the groups to compare?
The groups to compare are:
Group 1: Physicians who took a placebo
Group 2: Physicians who took aspirin

Estimate the difference between the two population parameters of interest
p
1
: the proportion of the population who would have a heart attack if they participated in
this experiment and took the placebo
p
2
: the proportion of the population who would have a heart attack if they participated in
this experiment and took the aspirin
Sample statistics:


To make an inference about the difference of population proportions, (p
1
p
2
), we need to
learn about the variability of the sampling distribution of: ) (
2 1
p p
Standard error for comparing two proportions:
The difference, ) (
2 1
p p , is obtained from sample data
It will vary from sample to sample
This variation is the standard error of the sampling distribution of ) (
2 1
p p :

Confidence Interval for the Difference
Between Two Population Proportions
The z-score depends on the confidence level
This method requires:
Categorical response variable for two groups
Independent random samples for the two groups
Large enough sample sizes so that there are at least 10 successes and at least 10 failures
in each group
Confidence Interval Comparing Heart Attack Rates for Aspirin and Placebo
95% CI:

Since both endpoints of the confidence interval
(0.005, 0.011) for (p
1
- p
2
) are positive, we infer that
(p
1
- p
2
) is positive
Conclusion: The population proportion of heart
attacks is larger when subjects take the placebo than
when they take aspirin
The population difference (0.005, 0.011) is small
Even though it is a small difference, it may be important in public health terms
For example, a decrease of 0.01 over a 5 year period in the proportion of people suffering
heart attacks would mean 2 million fewer people having heart attacks
The study used male doctors in the U.S, - The inference applies to the U.S. population of
male doctors, Before concluding that aspirin benefits a larger population, wed want to see
results of studies with more diverse groups
008 . 0 009 . 0 017 . 0 ) (
009 . 0 11037 / 104
017 . 0 11034 / 189
2 1
2
1
= =
= =
= =
p p
p
p
2
2 2
1
1 1
) 1 ( ) 1 (
n
p p
n
p p
se

+

=
0.011) (0.005, or , 003 . 0 008 . 0
11037
) 009 . 1 ( 009 .
11034
) 017 . 1 ( 017 .
96 . 1 ) 009 . 017 (.


Interpreting a confidence interval for a difference of proportions
Check whether 0 falls in the CI
If so, it is plausible that the population proportions are equal
If all values in the CI for (p
1
- p
2
) are positive, you can infer that (p
1
- p
2
) >0
If all values in the CI for (p
1
- p
2
) are negative, you can infer that (p
1
- p
2
) <0
Which group is labeled 1 and which is labeled 2 is arbitrary
The magnitude of values in the confidence interval tells you how large any true difference is,
If all values in the confidence interval are near 0, the true difference may be relatively small
in practical terms
Significance tests comparing population proportions:
1. Assumptions:
- Categorical response variable for two groups
- Independent random samples
-Significance tests comparing proportions use the sample size guideline from confidence
intervals: Each sample should have at least about 10 successes and 10 failures
- Twosided tests are robust against violations of this condition
At least 5 successes and 5 failures is adequate
2. Hypotheses:
The null hypothesis is the hypothesis of no difference or no effect: H
0
: p
1
=p
2

The alternative hypothesis is the hypothesis of interest to the investigator
H
a
: p
1
p
2
(two-sided test)
H
a
: p
1
<p
2
(one-sided test)
H
a
: p
1
>p
2
(one-sided test)
Pooled Estimate
Under the presumption that p
1
= p
2
, we estimate the common value of p
1
and p
2
by the
proportion of the total sample in the category of interest
This pooled estimate is calculated by combining the number of successes in the two groups
and dividing by the combined sample size (n
1
+n
2
)
3. The test statistic is:
|
|
.
|

\
|
+

=
2 1
2 1
1 1
) 1 (
0 ) (
n n
p p
p p
z where p is the pooled estimate
4. P-value: Probability obtained from the standard normal table of values even more
extreme than observed z test statistic
5. Conclusion: Smaller P-values give stronger evidence against H
0
and supporting H
a




Example: Tv violence aggressive behavior, 707 families, observations ove 17 years
Define Group 1 as those who watched less than 1 hour of TV per day, on the average, as
teenagers
Define Group 2 as those who averaged at least 1 hour of TV per day, as teenagers
p
1
= population proportion committing aggressive acts for the lower level of TV watching
p
2
= population proportion committing aggressive acts for the higher level of TV watching

Test the Hypotheses: H
0
: (p
1
- p
2
) = 0, H
a
: (p
1
- p
2
) 0, using a significance level of 0.05,
Test statistic: ( )
04 . 4
0476 . 0
192 . 0
0476 . 0
249 . 0 057 . 0
0476 . 0
619
1
88
1
) 775 . 0 ( 225 . 0
1 1
1
225 . 0
619 88
154 5

0
2 1
2 1
0

=
=
|
.
|

\
|
+ =
|
|
.
|

\
|
+ =
=
+
+
=
se
p p
z
n n
p p se
p

Conclusion: Since the P-value is less than 0.05, we reject H
0

We conclude that the population proportions of aggressive acts differ for the two groups
The sample values suggest that the population proportion is higher for the higher level of TV
watching
Example: Two Proportions Summer Jobs Example
Is there evidence that the proportion of male
students who had summer jobs differs from the
proportion of female students who had summer
jobs?
Null: The proportion of male students who had
summer jobs is the same as the proportion of
female students who had summer jobs, [H
0
: p
1
= p
2
]
Alt: The proportion of male students who had s.jobs
differs from the proportion of female students who had summer jobs, [H
a
: p
1
p
2
]
Test statistic: n
1
= 797 and n
2
= 732 (both large, so test statistic follows a Normal
distribution) Pooled sample proportion:
1529
1311
732 797
593 718
=
+
+
= p
Test statistic: 07 . 5
732
1
797
1
1529
1311
1
1529
1311
732
593
797
718
=
|
.
|

\
|
+
|
.
|

\
|

= z
Summer Status Men Women
Employed 718 593
Not Employed 79 139
Total 797 732
Hypotheses: H
0
: p
1
= p
2
, H
a
: p
1
p
2

Test Statistic: z = 5.07
P-value: P-value = 2P(Z > 5.07) = 0.000000396 (using a computer)
Conclusion: Since the P-value is quite small, there is very strong evidence that the proportion
of male students who had summer jobs differs from that of female students.

Comparing Means: We can compare two groups on a quantitative response variable by
comparing their means
Example: A 30-month study, Evaluated the degree of addiction that teenagers form to
nicotine, 332 students who had used nicotine were evaluated, The response variable was
constructed using a questionnaire called the Hooked on Nicotine Checklist (HONC)
The HONC score is the total number of questions to which a student answered yes during
the study, The higher the score, the more hooked on nicotine a student is judged to be
The study considered explanatory variables, such as gender, that might be associated with
the HONC score




How can we compare the sample HONC scores for females and males?
We estimate (
1
-
2
) by (
2 1
x x ): 2.8 1.6 = 1.2
On average, females answered yes to about one more question on the HONC scale than
males did
To make an inference about the difference between population means, (
1

2
), we need to
learn about the variability of the sampling distribution of: ) (
2 1
x x
Standard error for comparing two means:
The difference, ) x x (
2 1
, is obtained from sample data. It will vary from sample to sample.
This variation is the standard error of the sampling distribution of ) x x (
2 1
:
2
2
2
1
2
1
n
s
n
s
se + =
Confidence interval for the difference between two population means:
A confidence interval for m
1
m
2
is: ( )
2
2
2
1
2
1
025 . 2 1
n
s
n
s
t x x +
t
.025
is the critical value for a 95% confidence level from the t distribution
The degrees of freedom are calculated using software. If you are not using software, you
can take df to be the smaller of (n
1
-1) and (n
2
-1) as a safe estimate

This method assumes:
Independent random samples from the two groups
An approximately normal population distribution for each group
This is mainly important for small sample sizes, and even then the method is robust to
violations of this assumption
Example: Data as summarized by HONC scores for the two groups:
Smokers: = 5.9, s
1
= 3.3, n
1
= 75
Ex-smokers: = 1.0, s
2
= 2.3, n
2
= 257
Were the sample data for the two groups approximately normal?
Most likely not for Group 2 (based on the sample statistics:
2
x = 1.0, s
2
= 2.3)
Since the sample sizes are large, this lack of normality is not a problem
95% CI for (
1
-
2
):
) 7 . 5 , 1 . 4 ( , 8 . 0 9 . 4
257
3 . 2
75
3 . 3
985 . 1 ) 1 9 . 5 (
2 2
or
= +

We can infer that the population mean for the smokers is between 4.1 higher and 5.7 higher
than for the ex-smokers
Example: Exercise and pulse rate
A study is performed to compare the mean resting pulse rate of adult subjects who exercise
regularly to the mean resting pulse rate of those who do not exercise regularly.
This is an example of when to use the two-
sample t procedures.



Find a 95% confidence interval for the difference in population means (non-exercisers minus
exercisers).


Note: we use the safe estimate of 29-1=28 for our degrees of freedom in this calculation
We are 95% confident that the difference in mean resting pulse rates (non-exercisers minus
exercisers) is between 4.35 and 13.65 beats per minute.
How can we interpret a confidence interval for a difference of means?
Check whether 0 falls in the interval
When it does, 0 is a plausible value for (
1

2
), meaning that it is possible that
1
=
2

A confidence interval for (
1

2
) that contains only positive numbers suggests that (
1

2
)
is positive, We then infer that
1
is larger than
2

n mean std. dev.
Exercisers 29 66 8.6
Non-exercisers 31 75 9.0
2
2
2
1
2
1
2 1
n
s
n
s
t x x +
-

13.65 to 4.35
4.65 9
=
=
29
2
(8.6)
31
2
(9.0)
2.048 66 75 + =
A confidence interval for (
1

2
) that contains only negative numbers suggests that
(
1

2
) is negative
We then infer that
1
is smaller than
2

Which group is labeled 1 and which is labeled 2 is arbitrary
Significance tests comparing population means:
1. Assumptions:
Quantitative response variable for two groups
Independent random samples
Approximately normal population distributions for each group
This is mainly important for small sample sizes, and even then the two-sided t test is robust
to violations of this assumption
2. Hypotheses:
The null hypothesis is the hypothesis of no difference or no effect: H
0
: (
1
-
2
) =0
The alternative hypothesis:
H
a
: (
1
-
2
) 0 (two-sided test)
H
a
: (
1
-
2
) < 0 (one-sided test)
H
a
: (
1
-
2
) > 0 (one-sided test)
3. The test statistic is: Note change from z to t in formula

4. P-value: Probability obtained from the standard normal table
5. Conclusion: Smaller P-values give stronger evidence against H
0
and supporting H
a


Example: Does cell phone use while driving impair reaction times? Experiment:
64 college students, 32 randomly assigned to the cell phone group, 32 to control group
Students used a machine that simulated driving situations
At irregular periods a target flashed red or green
Participants were instructed to press a brake button as soon as possible when they
detected a red light
For each subject, the experiment analyzed their
mean response time over all the trials
Averaged over all trials and subjects, the mean
response time for the cell-phone group was 585.2
milliseconds
The mean response time for the control group was
533.7 milliseconds
Boxplot of data:

2
2
2
1
2
1
2 1
0 ) (
n
s
n
s
x x
t
+

=
Test the hypotheses:
H
0
: (
1
-
2
) =0 vs
H
a
: (
1
-
2
) 0
using a significance level of 0.05
Conclusion:
The P-value is less than 0.05, so we can reject H
0

There is enough evidence to conclude that the population mean response times differ
between the cell phone and control groups
The sample means suggest that the population mean is higher for the cell phone group
What do the box plots tell us?
There is an extreme outlier for the cell phone group
It is a good idea to make sure the results of the analysis arent affected too strongly by that
single observation
Delete the extreme outlier and redo the analysis
In this example, the t-statistic changes only slightly
Insight:
In practice, you should not delete outliers from a data set without sufficient cause (i.e., if it
seems the observation was incorrectly recorded)
It is however, a good idea to check for sensitivity of an analysis to an outlier
If the results change much, it means that the inference including the outlier is on shaky
ground

Alternative method for Comparing Means:
An alternative t- method can be used when, under the null hypothesis, it is reasonable to
expect the variability as well as the mean to be the same
This method requires the assumption that the population standard deviations be equal






The Pooled Standard Deviation:
This alternative method estimates the common value of
1
and
1
by:



2
) 1 ( ) 1 (
2 1
2
2 2
2
1 1
+
+
=
n n
s n s n
s
Comparing population means, assuming equal population standard deviations
Using the pooled standard deviation estimate, a 95% CI for (
1
-
2
) is:
2 1
025 . 2 1
1 1
) (
n n
s t x x +

This method has df =n
1
+ n
2
- 2
These methods assume:
- Independent random samples from the two groups
- An approximately normal population distribution for each group
- This is mainly important for small sample sizes, and even then, the CI and the two-sided
test are usually robust to violations of this assumption
-
1
=
2

The ratio of proportions: The relative risk
The ratio of proportions for two groups is:
In medical applications for which the proportion refers to a category that is an undesirable
outcome, such as death or having a heart attack, this ratio is called the relative risk
The ratio describes the sizes of the proportions relative to each other
Recall Physicians Health Study:


The proportion of the placebo group who had a heart attack was 1.82 times the proportion
of the aspirin group who had a heart attack.
Dependent samples:
Each observation in one sample has a matched observation in the other sample
The observations are called matched pairs
Benefits of using dependent samples (matched pairs):
Many sources of potential bias are controlled so we can make a more accurate comparison
Using matched pairs keeps many other factors fixed that could affect the analysis
Often this results in the benefit of smaller standard errors
To compare means with matched pairs, use paired differences:
For each matched pair, construct a difference score
d = (reaction time using cell phone) (reaction time without cell phone)
Calculate the sample mean of these differences:
d
x
2
1

p
p
82 . 1 0094 . 0 0171 . 0 = risk relative sample
0094 . 0 11037 / 104
0171 . 0 11034 / 189
2 1
2
1
= =
= =
= =
p p
p
p
The difference (
1
x


2
x ) between the means of the two samples equals the mean
d
x of the
difference scores for the matched pairs
The difference (
1

2
) between the population means is identical to the parameter
d
that
is the population mean of the difference scores
Confidence interval for dependent samples:
Let n denote the number of observations in each sample
This equals the number of difference scores
The 95 % CI for the population mean difference is:

Paired difference inferences: These paired-difference inferences are special cases of single-
sample inferences about a population mean so they make the same assumptions
To test the hypothesis H
0
:
1
=
2
of equal means, we can conduct the single-sample test of
H
0
:
d
= 0 with the difference scores
The test statistic is:
1 with
0
=

= n df
n
s
x
t
d
d

Assumptions:
The sample of difference scores is a random sample from a population of such difference
scores
The difference scores have a population distribution that is approximately normal
This is mainly important for small samples (less than about 30) and for one-sided inferences
Confidence intervals and two-sided tests are robust: They work quite well even if the
normality assumption is violated
One-sided tests do not work well when the sample size is small and the distribution of
differences is highly skewed

Example: Cell phones and driving study
The box plot shows skew to the right for the
difference scores
Two-sided inference is robust to violations of the
assumption of normality
The box plot does not show any severe outliers

Significance test:
H
0
:
d
= 0 (and hence equal population means for the two conditions)
H
a
:
d
0
Test statistic:


deviation standard their is s
s difference the of mean sample the is

d
025 .
d
d
d
x
n
s
t x
46 . 5
32
5 . 52
6 . 50
= = t
Comparing proportions with dependent samples: A recent GSS asked subjects whether they
believed in Heaven and whether they believed in Hell:
Belief in Hell
Belief in Heaven Yes No Total
Yes 833 125 958
No 2 160 162
Total 835 285 1120
We can estimate p
1
- p
2
as: 11 . 0 1120 835 1120 958
2 1
= = p p
Note that the data consist of matched pairs.
Recode the data so that for belief in heaven or hell, 1=yes and 0=no
Heaven Hell Interpretation Difference, d Frequency
1 1 believe in Heaven and Hell 1-1=0 833
1 0 believe in Heaven, not Hell 1-0=1 125
0 1 believe in Hell, not Heaven 0-1=-1 2
0 0 do not believe in Heaven or Hell 0-0=0 160
Sample mean of the 1120 difference scores is [0(833)+1(125)-1(2)+0(160)]/1120=0.11
Note that this equals the difference in proportions
2 1
p p
We have converted the two samples of binary observations into a single sample of 1120
difference scores. We can now use single-sample methods with the differences as we did
for the matched-pairs analysis of means.
Confidence interval comparing proportions with matched-pairs data
Use the fact that the sample difference
2 1
p p is the mean of difference scores of the re-
coded data
We can then find a confidence interval for the population mean of difference scores using
single sample methods






( )
) 128 . 0 , 091 . 0 (
0187 . 0 1098 . 0
1120 3185 . 0 96 . 1 1098 . 0 = CI 95%
3185 . 0
1098 . 0
1120
=
=

=
=
=
d
d
s
x
n
McNemar Test for Comparing Proportions with Matched-Pairs Data
Hypotheses: H
0
: p
1
=p
2
, H
a
can be one or two sided
Test Statistic: For the two counts for the frequency of yes on one response and no on
the other, the z test statistic equals their difference divided by the square root of their sum.
P-value: The probability of observing a sample even more extreme than the observed
sample
Assumptions: The sum of the counts used in the test should be at least 30, but in practice,
the two-sided test works well even if this is not true.
Recall GSS example about belief in Heaven and Hell:
McNemars Test: 9 . 10
2 125
2 125
=
+

= z P-value is approximately 0.
Note that this result agrees with the confidence interval for p
1
-p
2
calculated earlier
A practically significant difference:
When we find a practically significant difference between two groups, can we identify a
reason for the difference?
Warning: An association may be due to a lurking variable not measured in the study
Control variable:
In a previous example, we saw that teenagers who watch more TV have a tendency later in
life to commit more aggressive acts
Could there be a lurking variable that influences this association?
Perhaps teenagers who watch more TV tend to attain lower educational levels and perhaps
lower education tends to be associated with higher levels of aggression
-We need to measure potential lurking variables and use them in the statistical analysis
- If we thought that education was a potential lurking variable we would want to measure it
- Including a potential lurking variable in the study changes it from a bivariate study to a
multivariate study
- A variable that is held constant in a multivariate analysis is called a control variable
This analysis uses three variables:
- Response variable: Whether the subject
has committed aggressive acts
- Explanatory variable: Level of TV watching
- Control variable: Educational level
Can An Association Be Explained by a Third Variable?
- Treat the third variable as a control variable
- Conduct the ordinary bivariate analysis while holding that control variable constant at fixed
values (multivariate analysis)
- Whatever association occurs cannot be due to the effect of the control variable
-At each educational level, the percentage committing an aggressive act is higher for those
who watched more TV
- For this hypothetical data, the association observed between TV watching and aggressive
acts was not because of education









Chapter 11: Analyzing the Association Between Categorical Variables






Association between categorical variables:
The chi-squared test and measures of association such as (p
1
p
2
) and

p
1
/p
2
are fundamental
methods for analyzing contingency tables
The P-value for
2
_ summarized the strength of evidence against H
0
: independence
If the P-value is small, then we conclude that somewhere in the contingency table the
population cell proportions differ from independence
The chi-squared test does not indicate whether all cells deviate greatly from independence
or perhaps only some of them do so


Residual analysis
A cell-by-cell comparison of the observed counts with the counts that are expected when H
0

is true reveals the nature of the evidence against H
0

The difference between an observed and expected count in a particular cell is called a
residual

The residual is negative when fewer subjects are in the cell than expected under H0
The residual is positive when more subjects are in the cell than expected under H0
To determine whether a residual is large enough to indicate strong evidence of a deviation
from independence in that cell we use a adjusted form of the residual: the standardized
residual
The standardized residual for a cell= (observed count expected count)/se
A standardized residual reports the number of standard errors that an observed count falls
from its expected count
The se describes how much the difference would tend to vary in repeated sampling if the
variables were independent
Its formula is complex
Software can be used to find its value
A large standardized residual value provides evidence against independence in that cell
Example: to what extent do you consider yourself a religious person?
- Interpret the standardized residuals in the table
- The table exhibits large positive residuals for the cells for females who are very religious
and for males who are not at all religious.
- In these cells, the observed count is much larger than the expected count
- There is strong evidence that the population has more subjects in these cells than if the
variables were independent
The table exhibits large negative residuals for the cells for females who are not at all
religious and for males who are very religious
In these cells, the observed count is much smaller than the expected count
There is strong evidence that the population has fewer subjects in these cells than if the
variables were independent
Fishers exact test:
The chi-squared test of independence is a large-sample test
When the expected frequencies are small, any of them being less than about 5, small-sample
tests are more appropriate
Fishers exact test is a small-sample test of independence
The calculations for Fishers exact test are complex
Statistical software can be used to obtain the P-value for the test that the two variables are
independent
The smaller the P-value, the stronger the evidence that the variables are associated
This is an experiment conducted by Sir Ronald Fisher
His colleague, Dr. Muriel Bristol, claimed that when drinking tea she could tell whether the
milk or the tea had been added to the cup first
Experiment: Fisher asked her to taste eight cups of tea:
Four had the milk added first
Four had the tea added first
She was asked to indicate which four had the milk added first
The order of presenting the cups was randomized
Results:




Analysis:








The one-sided version of the test pertains to the alternative that her predictions are better
than random guessing
Does the P-value suggest that she had the ability to predict better than random guessing?
The P-value of 0.243 does not give much evidence against the null hypothesis
The data did not support Dr. Bristols claim that she could tell whether the milk or the tea
had been added to the cup first
Assumptions:
Two binary categorical variables, Data are random
Hypotheses: H
0
: the two variables are independent (p
1
=p
2
),
H
a
: the two variables are associated
(p
1
p
2
or p
1
>p
2
or p
1
<p
2
)
Test Statistic: First cell count (this determines the others given the margin totals)
P-value: Probability that the first cell count equals the observed value or a value even more
extreme as predicted by H
a

Conclusion: Report the P-value and interpret in context. If a decision is required, reject H
0

when P-value significance level













Chapter 12: Analyzing Association between Quantitative variables: Regression analysis
First step of a regression analysis, is to identify the response and explanatory variables
Y to denote response variable and X to denote explanatory variable
The scatterplot: First step in answering the question of association; to look at the data
A scatterplot is a graphical display of the relationship between the response variable (y-axis)
and the explanatory variable (x-axis)
Ex: What do we learn from scatterplot in strength study?
The MINITAB output shows the following regression equation:
BP = 63.5 + 1.49 (BP_60)
The y-intercept is 63.5 and the slope is 1.49
The slope of 1.49 tells us that predicted maximum bench press increases by about 1.5
pounds for every additional 60-pound bench press an athlete can do

The regression line equation:
When the scatterplot shows a linear trend, a straight line
can be fitted through the data points to describe that trend
The regression line is:
is the predicted value of the response variable y
is the y-intercept and is the slope
Check for outliers by plotting the data, The regression line can
be pulled toward an outlier and away from the general trend
of points
Influential points: An observation can be influential in affecting the regression line when
two things happen:
-Its x-value is low or high compared to the rest of the data
- It does not fall in the straight-line pattern that the rest of the data have
Residuals are prediction errors: The regression equation is
often called a prediction equation. The difference
between an observed outcome and its predicted value is
the prediction error, called a residual.
Each observation has a residual; A residual is the vertical
distance between the data point and the regression line
We can summarize how near the regression line the data points fall by:
The regression line has the smallest sum of squared
residuals and is called the least squares line
Regression model: A line describes how the mean of y depends on x,
At a given value of x, the equation:
Predicts a single value of the response variable
But, we should not except all subjects at that value of x to have the same value of y
Variability occur in the y values.
The regression line connects the estimated means of y at the various x values,
In summary, y=a+bx, describes the relationship between x and the estimated means of y at
the various values of x
The population regression equation:
describes the relationship in the population between x and the means of y
In the population regression equation, is a population y-intercept and is a population
slope, These are parameters
bx a y + =
y
a b
y y

=
=
2 2
) ( ) (

y y residuals
residuals squared of sum
bx a y + =
x
y
| o + =
In practice we estimate the population regression equation using the prediction equation for
the sample data
The population regression equation merely approximates the actual relationship between x
and the population means of y, It is a model
A model is a simple approximation for how variables relate in the population



The regression model








If the true relationship is far from a straight
line, this regression model may be a poor
one


























Chapter 13: Multiple Regression
Regression models:
The model that contains only two variables, x and y, is called a bivariate model


Suppose there are two predictors, denoted by x
1
and x
2

This is called a multiple regression model


The multiple regression model relates the mean
y
of a quantitative response variable y to a
set of explanatory variables x
1
, x
2
,
Example: For three explanatory variables, the multiple regression equation is:


x
y
| o + =
2 2 1 1
x x
y
| | o + + =
3 3 2 2 1 1
x x x
y
| | | o + + + =
Example: The sample prediction equation with three explanatory variables is:



Example: Predicting selling price using house and lot size
The data set house selling prices contains observations on 100 home sales in Florida in
November 2003
A multiple regression analysis was done with selling price as the response variable and with
house size and lot size as the explanatory variables
Output from the analysis:


Prediction Equation:
2 1
84 . 2 8 . 53 536 , 10 x x y + + =
where y = selling price, x
1
=house size and x
2
= lot size
One house listed in the data set had house size = 1240 square feet, lot size = 18,000 square
feet and selling price = $145,000
Find its predicted selling price:

Find its residual: 724 , 37 276 , 107 000 , 145 = = y y
The residual tells us that the actual selling price was $37,724 higher than predicted
The number of explanatory variables
You should not use many explanatory variables in a multiple regression model unless you
have lots of data
A rough guideline is that the sample size n should be at least 10 times the number of
explanatory variables
3 3 2 2 1 1

x b x b x b a y + + + =
276 , 107
) 000 , 18 ( 84 . 2 ) 1240 ( 8 . 53 536 , 10
=
+ + = y
Plotting relationships
Always look at the data before doing a multiple regression
Most software has the option of constructing scatterplots on a single graph for each pair of
variables - This is called a scatterplot matrix










Interpretation of multiple regression coefficients:
The simplest way to interpret a multiple regression equation looks at it in two dimensions as
a function of a single explanatory variable
We can look at it this way by fixing values for the other explanatory variable(s)
Example using the housing data: Suppose we fix x
1
= house size at 2000 square feet
The prediction equation becomes:
2
2
2.84x 97,022
84 . 2 ) 2000 ( 8 . 53 536 , 10
+ =
+ + = x y

Since the slope coefficient of x
2
is 2.84, the predicted selling price increases by $2.84 for
every square foot increase in lot size when the house size is 2000 square feet
For a 1000 square-foot increase in lot size, the predicted selling price increases by
1000(2.84) = $2840 when the house size is 2000 square feet
Example using the housing data:
- Suppose we fix x
2
= lot size at 30,000 square feet
- The prediction equation becomes:
1
1
53.8 74,676
) 000 , 30 ( 84 . 2 8 . 53 536 , 10
x
x y
+ =
+ + =

Since the slope coefficient of x
1
is 53.8, for houses with a lot size of 30,000 square feet, the
predicted selling price increases by $53.80 for every square foot increase in house size
In summary, an increase of a square foot in house size has a larger impact on the selling
price ($53.80) than an increase of a square foot in lot size ($2.84)
We can compare slopes for these explanatory variables because their units of measurement
are the same (square feet)
Slopes cannot be compared when the units differ
Summarizing the effect while controlling for a variable:
The multiple regression model assumes that the slope for a particular explanatory variable is
identical for all fixed values of the other explanatory variables
For example, the coefficient of x
1
in the prediction equation:
2 1
84 . 2 8 . 53 536 , 10 x x y + + =
is 53.8 regardless of whether we plug in x
2
= 10,000 or x
2
= 30,000 or x
2
= 50,000







Slopes in multiple regression and in bivariate regression:
In multiple regression, a slope describes the effect of an explanatory variable while
controlling effects of the other explanatory variables in the model
Bivariate regression has only a single explanatory variable
A slope in bivariate regression describes the effect of that variable while ignoring all other
possible explanatory variables
Importance of multiple regression:
One of the main uses of multiple regression is to identify potential lurking variables and
control for them by including them as explanatory variables in the model
Multiple correlation:
To summarize how well a multiple regression model predicts y, we analyze how well the
observed y values correlate with the predicted y values
The multiple correlation is the correlation between the observed y values and the predicted
y values
- It is denoted by R
For each subject, the regression equation provides a predicted value
Each subject has an observed y-value and a predicted y-value




The correlation computed between all pairs of observed y-values and predicted y-values is
the multiple correlation, R
The larger the multiple correlation, the better are the predictions of y by the set of
explanatory variables
The R-value always falls between 0 and 1
In this way, the multiple correlation R differs from the bivariate correlation r between y
and a single variable x, which falls between -1 and +1
R-squared
For predicting y, the square of R describes the relative improvement from using the
prediction equation instead of using the sample mean, y
The error in using the prediction equation to predict y is summarized by the residual sum of
squares:


2
) ( y y
The error in using y to predict y is summarized by the total sum of squares:


2
) ( y y
The proportional reduction in error is:
The better the predictions are using the regression equation, the larger R
2
is
For multiple regression, R
2
is the square of the multiple correlation, R
Example: How well can we predict house
selling prices:
For the 100 observations on y = selling
price, x
1
= house size, and x
2
= lot size, a
table, called the ANOVA (analysis of
variance) table was created
The table displays the sums of squares in the SS column
The R
2
value can be created from the sums of squares in the table:

Using house size and lot size together to predict
selling price reduces the prediction error by 71%,
relative to using y alone to predict selling price

Find and interpret the multiple correlation
84 . 0 711 . 0
2
= = = R R

There is a strong association between the observed and the predicted selling prices
House size and lot size are very helpful in predicting selling prices
2
2 2
2
) (
) ( ) (
y y
y y y y
R


=
711 . 0
90,756
90,756 - 314,433

) (
) ( ) (
2
2 2
2
= =


=

y y
y y y y
R
If we used a bivariate regression model to predict selling price with house size as the
predictor, the r
2
value would be 0.58
If we used a bivariate regression model to predict selling price with lot size as the predictor,
the r
2
value would be 0.51
The multiple regression model has R
2
0.71, so it provides better predictions than either
bivariate model






The single predictor in the data set that is most strongly associated with y is the houses real
estate tax assessment: (r
2
= 0.679)
When we add house size as a second predictor, R
2
goes up from 0.679 to 0.730
As other predictors are added, R
2
continues to go up, but not by much
R
2
does not increase much after a few predictors are in the model
When there are many explanatory variables but the correlations among them are strong,
once you have included a few of them in the model, R
2
usually doesnt increase much more
when you add additional ones
This does not mean that the additional variables are uncorrelated with the response variable
It merely means that they dont add much new power for predicting y, given the values of
the predictors already in the model

Properties of R
2
The previous example showed that R
2
for the multiple regression model was larger than r
2
for a bivariate model using only one of the explanatory variables
A key factor of R
2
is that it cannot decrease when predictors are added to a model
R
2
falls between 0 and 1
The larger the value, the better the explanatory variables collectively predict y
R
2
=1 only when all residuals are 0, that is, when all regression predictions are prefect
R
2
= 0 when the correlation between y and each explanatory variable equals 0
R
2
gets larger, or at worst stays the same, whenever an explanatory variable is added to the
multiple regression model
The value of R
2
does not depend on the units of measurement


























Chapter 14: Comparing Groups: Analysis of Variance Methods
How can we compare several means? ANOVA
The analysis of variance method: Compares means of several groups
- Let g denote the number of groups
- each group has a corresponding population of subjects
- means of response variable for the g populations are denoted:
1
,
2
,
g

Hypotheses and Assumptions for the ANOVA test
- The analysis of variance is a significance test of the null hypothesis of equal population
means: H
0
:
1
=
2
= =
g

- Alternative hypothesis: H
a
: At least two of the population means are unequal
The assumptions for the ANOVA test comparing population means are as follows:
1. The population distributions of the response variable for the g groups are normal
with the same standard deviation for each group
2. Randomization:
- In a survey sample, independent random samples are selected from the g
populations
- In an experiment, subjects are randomly assigned separately to the g groups
Ex: How long will you tolerate being put on hold? with music; advertisement, muzak,
classical music.
Denote the holding time means for the populations that these three random samples
represent by:

1
= mean for the advertisement

2
= mean for the Muzak

3
= mean for the classical music
The hypotheses for the ANOVA test are:
H
0
:
1
=
2
=
3

H
a
: at least two of the population means are different
Since the sample means are quite different, can we conclude that the population means
differ?
This alone is not sufficient evidence to enable us to reject H
0

Variability between groups and within groups
The ANOVA method used to compare population means.
It is called ANALYSIS OF VARIANCE because it uses evidence about two types of variability.
EX: Two Data sets with equal means but unequal variability.
Which case do you think gives stronger evidence against H
0
:
1
=
2
=
3
?
What is the difference between the data in these two cases?
In both cases the variability between pairs of means is the same
In Case b the variability within each sample is much smaller than in Case a.
The fact that Case b has less variability within each sample gives stronger evidence against H
0



ANOVA F-test Statistic,
The analysis of variance (ANOVA) F-test statistic is:
The larger the variability between groups relative
to the variability within groups, the larger
the F test statistic tends to be
The test statistic for comparing means has the F-distribution,
The larger the F-test statistic value, the stronger the evidence against H
0
ANOVA F-test for comparing population means of several groups
1. Assumptions: - Independent random samples,
- Normal population distributions with equal standard deviations

2. Hypothesis: H
0
:
1
=
2
= =
g

H
a
:at least two of the population means are different

ty variabili groups Within
ty variabili groups Between
= F
3. Test statistic:
ty variabili groups Within
ty variabili groups Between
= F
F- sampling distribution has df
1
= g -1, df
2
= N g = (total sample size no. of groups)

4. P-value: Right-tail probability above observed F-value
5. Conclusion: If decision is needed, reject if P-value significance level (such as 0.05)
The variance estimates and the ANOVA table
- Let denote the standard deviation for each of the g population distributions
- One assumption for the ANOVA F-test is that each population has the same standard
deviation,
The F-test statistic is the ratio of two estimates of
2
, the population variance for each group
- The estimate of
2
in the denominator of the F-test statistic uses the variability within each
group
- The estimate of
2
in the numerator of the F-test statistic uses the variability between each
sample mean and the overall mean for all the data
-Computer software displays the two estimates of
2
in the ANOVA table
- The MS column contains the two estimates, which are called mean squares
- The ratio of the two mean squares is the F- test statistic
- This F- statistic has a P-value
Examples, customers telephone holding time again:
Since P-value < 0.05, there is sufficient evidence to reject H
0
:
1
=
2
=
3

We conclude that a difference exists among the three types of messages in the population
mean amount of time that customers are willing to remain on hold

Das könnte Ihnen auch gefallen