Sie sind auf Seite 1von 15

Single Variable Regression (Part II)

7. Residual Plots
After the curve is fit, it is important to examine if the fitted curve is reasonable. This is
done using residuals. The residual for a point is the difference between the observed value
and the predicted value, i.e., the residual from fitting a straight line is found as:

There are several standard residual plots:


plot of residuals vs predicted
plot of residuals vs X;
plot of residuals vs time ordering.
In all cases, the residual plots should show random scatter around zero with no obvious
pattern. Dont plot residual vs Y - this will lead to odd looking plots which are an artifact of
the plot and dont mean anything.

8. Probability Plots
The probability plot is a graphical technique for assessing whether or not a data set
follows a given distribution such as the normal distribution. The data are plotted against a
theoretical normal distribution in such a way that the points should form approximately a
straight line. Departures from this straight line indicate departures from the specified
distribution.

Page 9 of 15

The points on this plot form a nearly linear pattern, which indicates that the normal
distribution is a good model for this data set.

The normal probability plot is formed by:

Vertical axis: Ordered response values

Horizontal axis: Normal order statistic medians


The observations are plotted as a function of the corresponding normal order statistic
quantiles. In addition, a straight line can be fit to the points and added as a reference line.
The further the points vary from this line, the greater the indication of departures from
normality. The correlation coefficient of the points on the normal probability plot can be
compared to a table of critical values to provide a formal test of the hypothesis that the
data come from a normal distribution.

= 0.10

= 0.05

= 0.01

.8951

.8734

.8318

.9033

.8804

.8320

10

.9347

.9180

.8804

15

.9506

.9383

.9110

20

.9600

.9503

.9290

25

.9662

.9582

.9408

30

.9707

.9639

.9490

40

.9767

.9715

.9597

50

.9807

.9764

.9664

60

.9835

.9799

.9710

75

.9865

.9835

.9757

The normal probability plot is used to answer the following questions.


1.
Are the data (meaning the residuals) normally distributed?
2.
What is the nature of the departure from normality (data skewed, shorter than
expected tails, longer than expected tails)?

Page 10 of 15

Typical Normal Probability Plot: Normally Distributed Data

Normal
Probability
Plot

The following normal probability plot is from a heat flow


meter data.

Conclusions

We can make the following conclusions from the above plot.


1.
The normal probability plot shows a strongly linear
pattern. There are only minor deviations from the line fit to
the points on the probability plot.
The normal distribution appears to be a good model for
these data.
Visually, the probability plot shows a strongly linear pattern.
This is verified by the correlation coefficient of 0.9989 of
the line fit to the probability plot. The fact that the points in
the lower and upper extremes of the plot do not deviate
significantly from the straight-line pattern indicates that
there are not any significant outliers (relative to a normal
distribution).

2.

Discussion

In this case, we can quite reasonably conclude that the normal


distribution provides an excellent model for the data. The
intercept and slope of the fitted line give estimates of 9.26
and 0.023 for the location and scale parameters of the fitted
normal distribution.

Typical Normal Probability Plot: Data Have Short Tails


Page 11 of 15

The following is a normal probability plot for 500 random


Normal
numbers generated from a Tukey-Lambda distribution with
Probability
Plot for Data the parameter equal to 1.1.
with Short
Tails

Conclusions

We can make the following conclusions from the above plot.


1.
The normal probability plot shows a non-linear pattern.

2.

Discussion

The normal distribution is not a good model for these

data.
For data with short tails relative to the normal distribution,
the non-linearity of the normal probability plot shows up in
two ways. First, the middle of the data shows an S-like
pattern. This is common for both short and long tails. Second,
the first few and the last few points show a marked
departure from the reference fitted line. In comparing this
plot to the long tail example in the next section, the
important difference is the direction of the departure from
the fitted line for the first few and last few points. For
short tails, the first few points show increasing departure
from the fitted line above the line and last few points show
increasing departure from the fitted line below the line. For
long tails, this pattern is reversed.
In this case, we can reasonably conclude that the normal
distribution does not provide an adequate fit for this data
set. For probability plots that indicate short-tailed
distributions, the next step might be to generate a Tukey
Lambda PPCC plot. The Tukey Lambda PPCC plot can often be
helpful in identifying an appropriate distributional family.

Page 12 of 15

Typical Normal Probability Plot: Data Have Long Tails


The following is a normal probability plot of 500 numbers
Normal
generated from a double exponential distribution. The double
Probability
Plot for Data exponential distribution is symmetric, but relative to the
with Long Tails normal it declines rapidly and has longer tails.

Conclusions

We can make the following conclusions from the above plot.


1.
The normal probability plot shows a reasonably linear
pattern in the center of the data. However, the tails,
particularly the lower tail, show departures from the fitted
line.
A distribution other than the normal distribution
would be a good model for these data.
For data with long tails relative to the normal distribution,

2.

Discussion

Page 13 of 15

the non-linearity of the normal probability plot can show up in


two ways. First, the middle of the data may show an S-like
pattern. This is common for both short and long tails. In this
particular case, the S pattern in the middle is fairly mild.
Second, the first few and the last few points show marked
departure from the reference fitted line. In the plot above,
this is most noticeable for the first few data points. In
comparing this plot to the short-tail example in the previous
section, the important difference is the direction of the
departure from the fitted line for the first few and the last
few points. For long tails, the first few points show increasing
departure from the fitted line below the line and last few
points show increasing departure from the fitted line above
the line. For short tails, this pattern is reversed.
In this case we can reasonably conclude that the normal
distribution can be improved upon as a model for these data.
For probability plots that indicate long-tailed distributions,
the next step might be to generate a Tukey Lambda PPCC
plot. The Tukey Lambda PPCC plot can often be helpful in

identifying an appropriate distributional family.


Typical Normal Probability Plot: Data are Skewed Right
Normal
Probability
Plot for Data
that are
Skewed Right

Conclusions

We can make the following conclusions from the above plot.


1.
The normal probability plot shows a strongly non-linear
pattern. Specifically, it shows a quadratic pattern in which all
the points are below a reference line drawn between the first
and last points.

2.

The normal distribution is not a good model for these

Page 14 of 15

Discussion

data.
This quadratic pattern in the normal probability plot is the
signature of a significantly right-skewed data set. Similarly,
if all the points on the normal probability plot fell above the
reference line connecting the first and last points, that would
be the signature pattern for a significantly left-skewed data
set.
In this case we can quite reasonably conclude that we need to
model these data with a right skewed distribution such as the
Weibull or lognormal.

9. Example - Yield and fertilizer


We wish to investigate the relationship between yield (Liters) and fertilizer (kg/ha) for
tomato plants. An experiment was conducted in the Schwarz household on summer on 11
plots of land where the amount of fertilizer was varied and the yield measured at the end
of the season.
The amount of fertilizer applied to each plot was chosen between 5 and 18 kg/ha. While
the levels were not systematically chosen (e.g. they were not evenly spaced between the
highest and lowest values), they represent commonly used amounts based on a preliminary
survey of producers.
Interest also lies in predicting the yield when 16 kg/ha are assigned. The level of fertilizer
were randomly assigned to each plot. At the end of the experiment, the yields were
measured and the following
data were obtained.

Page 15 of 15

In this study, it is quite clear that the fertilizer is the predictor (X) variable, while the
response variable (Y) is the yield.
The population consists of all possible field plots with all possible tomato plants of this
type grown under all possible fertilizer levels between about 5 and 18 kg/ha.
If all of the population could be measured (which it cant) you could find a relationship
between the yield and the amount of fertilizer applied. This relationship would have the
form:

where o and 1 represent the true population intercept and slope respectively. The term
represents random variation that is always present, i.e. even if the same plot was grown
twice in a row with the same amount of fertilizer, the yield would not be identical (why?).
The population parameters to be estimated are o - the true average yield when the
amount of fertilizer is 0, and 1, the true average change in yield per unit change in the
amount of fertilizer. These are taken over all plants in all possible field plots of this type.
The values of o and 1 are impossible to obtain as the entire population could never be
measured.
KYPLOT Analysis
Here is the data entered into a KYPLOT data sheet. Note the scale of both variables
(continuous). The ordering of the rows is NOT important; however, it is often easier to
find individual data points if the data is sorted by the X value and the rows for future
predictions are placed at the end of the dataset.

Page 16 of 15

Use the Statistics-> Regression Analysis -> Simple Regression platform to start the
analysis. Specify the Y and X variable as needed.

Then click OK. A new spread sheet will be created that contains the regression results.

Page 17 of 15

At this stage, it would be also useful to draw a scatter plot of the data (refer to previous
KYPLOT tutorials)

The relationship look approximately linear; there dont appear to be any outlier or
influential points; the scatter appears to be roughly equal across the entire regression
line. Residual plots will be used later to check these assumptions in more detail.
The Fit menu item allows you to fit the least-squares line. The actual fitted line is drawn
on the scatter plot, and the straight line equation coefficients, (here called A1 for the
intercept and A2 for the slope) of the fitted line are printed below the fit spread sheet.

Page 18 of 15

The estimated regression line is:

In terms of estimates, b0=12.856 is the


estimated intercept, and b1=1.101 is the
estimated slope.
The estimated slope is the estimated
change in yield when the amount of
fertilizer is increased by 1 unit. In this
case, the yield is expected to increase
(why?) by 1.10137 L when the fertilizer
amount is increased by 1 kg/ha. NOTE
that the slope is the CHANGE in Y when
X increases by 1 unit - not the value of Y
when X = 1.
The estimated intercept is the estimated yield when the amount of fertilizer is 0. In this
case, the estimated yield when no fertilizer is added is 12.856 L. In this particular case
the intercept has a meaningful interpretation, but Id be worried about extrapolating
outside the range of the observed X values.
Once again, these are the results from a single experiment. If another experiment was
repeated, you would obtain different estimates (b0 and b1 would change). The sampling
distribution over all possible experiments would describe the variation in b0 and b1 over all
possible experiments. The standard deviation of b0 and b1 over all possible experiments is
again referred to as the standard error of b0 and b1.
The formulae for the standard errors of b0 and b1 are messy, and hopeless to compute by
hand. And just like inference for a mean or a proportion, we can obtain estimates of the
standard error from KYPLOT (from the regression results sheet created in page 18 ).

Page 19 of 15

The estimated standard error for b1 (the estimated slope) is 0.132 L/kg. This is an
estimate of the standard deviation of b1 over all possible experiments. Normally, the
intercept is of limited interest, but a standard error can also be found for it as shown in
the above table.
Using exactly the same logic as when we found a confidence interval for the population
mean, a confidence interval for the population slope (1) is found (approximately) as b1
2(estimated se) In the above example, an approximate confidence interval for 1 is found
as
1.101 2 (0.132) = 1.101 .264 = (0.837 to 1.365) L/kg
of fertilizer applied. An exact confidence interval can be computed by KYPLOT as shown
above. The exact confidence interval is based on the t-distribution and is slightly wider
than our approximate confidence interval because the total sample size (11 pairs of points)
is rather small.
We interpret this interval as being 95% confident that the true increase in yield when the
amount of fertilizer is increased by one unit is somewhere between (.837 to 1.365) L/kg.

Page 20 of 15

Be sure to carefully distinguish between 1 and b1. Note that the confidence interval is
computed using b1, but is a confidence interval for 1 - the population parameter that is
unknown .
In linear regression problems, one hypothesis of interest is if the true slope is zero. This
would correspond to no linear relationship between the response and predictor variable
(why?) In many cases, a confidence interval tells the entire story.
KYPLOT produces a test of the hypothesis that each of the parameters (the slope and the
intercept in the population) is zero. The output is reproduced again below:

The test of hypothesis about the intercept is not of interest (why?).


Let
1 be the true (unknown) slope.
b1 be the estimated slope. In this case b1 = 1.1014.
The hypothesis testing proceeds as follows. Again note that we are interested in the
population parameters and not the sample statistics:
1. Specify the null and alternate hypothesis:
Notice that the null hypothesis is in terms of
the population
parameter 1. This is a two-sided test as we are interested in detecting differences from
zero in either direction.
2. Find the test statistic and the p-value. The test statistic is computed as:

In other words, the estimate is over 8 standard errors away from the hypothesized value!
This will be compared to a t-distribution with n2 = 9 degrees of freedom. The p-value is
found to very small (less than 0.0001).

Page 21 of 15

3. Conclusion. There is strong evidence that the true slope is not zero. This is not too
surprising given that the 95% confidence intervals show that plausible values for the true
slope are from about 0.8 to about 1.4.
It is possible to construct tests of the slope equal to some value other than 0. Most
packages cant do this. You would compute the T value as shown above, replacing the value
0 with the hypothesized value.
If the hypothesis is rejected, a natural question to ask is well, what values of the
parameter are plausible given this data. This is exactly what a confidence interval tells
you. Consequently, I usually prefer to find confidence intervals, rather than doing formal
hypothesis testing.
What about making predictions for future yields when certain amounts of fertilizer are
applied? For example, what would be the future yield when 16 kg/ha of fertilizer are
applied?
The predicted value is found by substituting the new X into the estimated regression line.

KYPLOT makes it easier to do such a prediction. If you scroll down in the Regression
Results spread sheet, you will find a table with 2 blank spaces to insert X values and
predict Y values.

If you insert 16, you get:

As noted earlier, there are two types of estimates of precision associated with
predictions using the regression line. It is important to distinguish between them as these
two intervals are the source of much confusion in regression problems.
First, the experimenter may be interested in predicting a single FUTURE individual value
for a particular X. This would correspond to the predicted yield for a single future plot
with 16 kg/ha of fertilizer added.

Page 22 of 15

Second the experimenter may be interested in predicting the average of ALL FUTURE
responses at a particular X. This would correspond to the average yield for all future plots
when 16 kg/ha of fertilizer is added.
The prediction interval for an individual response is sometimes called a confidence interval
for an individual response but this is an unfortunate (and incorrect) use of the term
confidence interval. Strictly speaking confidence intervals are computed for fixed
unknown parameter values; predication intervals are
computed for future random variables.
(To be continued).

Page 23 of 15

Das könnte Ihnen auch gefallen