Beruflich Dokumente
Kultur Dokumente
7. Residual Plots
After the curve is fit, it is important to examine if the fitted curve is reasonable. This is
done using residuals. The residual for a point is the difference between the observed value
and the predicted value, i.e., the residual from fitting a straight line is found as:
8. Probability Plots
The probability plot is a graphical technique for assessing whether or not a data set
follows a given distribution such as the normal distribution. The data are plotted against a
theoretical normal distribution in such a way that the points should form approximately a
straight line. Departures from this straight line indicate departures from the specified
distribution.
Page 9 of 15
The points on this plot form a nearly linear pattern, which indicates that the normal
distribution is a good model for this data set.
= 0.10
= 0.05
= 0.01
.8951
.8734
.8318
.9033
.8804
.8320
10
.9347
.9180
.8804
15
.9506
.9383
.9110
20
.9600
.9503
.9290
25
.9662
.9582
.9408
30
.9707
.9639
.9490
40
.9767
.9715
.9597
50
.9807
.9764
.9664
60
.9835
.9799
.9710
75
.9865
.9835
.9757
Page 10 of 15
Normal
Probability
Plot
Conclusions
2.
Discussion
Conclusions
2.
Discussion
data.
For data with short tails relative to the normal distribution,
the non-linearity of the normal probability plot shows up in
two ways. First, the middle of the data shows an S-like
pattern. This is common for both short and long tails. Second,
the first few and the last few points show a marked
departure from the reference fitted line. In comparing this
plot to the long tail example in the next section, the
important difference is the direction of the departure from
the fitted line for the first few and last few points. For
short tails, the first few points show increasing departure
from the fitted line above the line and last few points show
increasing departure from the fitted line below the line. For
long tails, this pattern is reversed.
In this case, we can reasonably conclude that the normal
distribution does not provide an adequate fit for this data
set. For probability plots that indicate short-tailed
distributions, the next step might be to generate a Tukey
Lambda PPCC plot. The Tukey Lambda PPCC plot can often be
helpful in identifying an appropriate distributional family.
Page 12 of 15
Conclusions
2.
Discussion
Page 13 of 15
Conclusions
2.
Page 14 of 15
Discussion
data.
This quadratic pattern in the normal probability plot is the
signature of a significantly right-skewed data set. Similarly,
if all the points on the normal probability plot fell above the
reference line connecting the first and last points, that would
be the signature pattern for a significantly left-skewed data
set.
In this case we can quite reasonably conclude that we need to
model these data with a right skewed distribution such as the
Weibull or lognormal.
Page 15 of 15
In this study, it is quite clear that the fertilizer is the predictor (X) variable, while the
response variable (Y) is the yield.
The population consists of all possible field plots with all possible tomato plants of this
type grown under all possible fertilizer levels between about 5 and 18 kg/ha.
If all of the population could be measured (which it cant) you could find a relationship
between the yield and the amount of fertilizer applied. This relationship would have the
form:
where o and 1 represent the true population intercept and slope respectively. The term
represents random variation that is always present, i.e. even if the same plot was grown
twice in a row with the same amount of fertilizer, the yield would not be identical (why?).
The population parameters to be estimated are o - the true average yield when the
amount of fertilizer is 0, and 1, the true average change in yield per unit change in the
amount of fertilizer. These are taken over all plants in all possible field plots of this type.
The values of o and 1 are impossible to obtain as the entire population could never be
measured.
KYPLOT Analysis
Here is the data entered into a KYPLOT data sheet. Note the scale of both variables
(continuous). The ordering of the rows is NOT important; however, it is often easier to
find individual data points if the data is sorted by the X value and the rows for future
predictions are placed at the end of the dataset.
Page 16 of 15
Use the Statistics-> Regression Analysis -> Simple Regression platform to start the
analysis. Specify the Y and X variable as needed.
Then click OK. A new spread sheet will be created that contains the regression results.
Page 17 of 15
At this stage, it would be also useful to draw a scatter plot of the data (refer to previous
KYPLOT tutorials)
The relationship look approximately linear; there dont appear to be any outlier or
influential points; the scatter appears to be roughly equal across the entire regression
line. Residual plots will be used later to check these assumptions in more detail.
The Fit menu item allows you to fit the least-squares line. The actual fitted line is drawn
on the scatter plot, and the straight line equation coefficients, (here called A1 for the
intercept and A2 for the slope) of the fitted line are printed below the fit spread sheet.
Page 18 of 15
Page 19 of 15
The estimated standard error for b1 (the estimated slope) is 0.132 L/kg. This is an
estimate of the standard deviation of b1 over all possible experiments. Normally, the
intercept is of limited interest, but a standard error can also be found for it as shown in
the above table.
Using exactly the same logic as when we found a confidence interval for the population
mean, a confidence interval for the population slope (1) is found (approximately) as b1
2(estimated se) In the above example, an approximate confidence interval for 1 is found
as
1.101 2 (0.132) = 1.101 .264 = (0.837 to 1.365) L/kg
of fertilizer applied. An exact confidence interval can be computed by KYPLOT as shown
above. The exact confidence interval is based on the t-distribution and is slightly wider
than our approximate confidence interval because the total sample size (11 pairs of points)
is rather small.
We interpret this interval as being 95% confident that the true increase in yield when the
amount of fertilizer is increased by one unit is somewhere between (.837 to 1.365) L/kg.
Page 20 of 15
Be sure to carefully distinguish between 1 and b1. Note that the confidence interval is
computed using b1, but is a confidence interval for 1 - the population parameter that is
unknown .
In linear regression problems, one hypothesis of interest is if the true slope is zero. This
would correspond to no linear relationship between the response and predictor variable
(why?) In many cases, a confidence interval tells the entire story.
KYPLOT produces a test of the hypothesis that each of the parameters (the slope and the
intercept in the population) is zero. The output is reproduced again below:
In other words, the estimate is over 8 standard errors away from the hypothesized value!
This will be compared to a t-distribution with n2 = 9 degrees of freedom. The p-value is
found to very small (less than 0.0001).
Page 21 of 15
3. Conclusion. There is strong evidence that the true slope is not zero. This is not too
surprising given that the 95% confidence intervals show that plausible values for the true
slope are from about 0.8 to about 1.4.
It is possible to construct tests of the slope equal to some value other than 0. Most
packages cant do this. You would compute the T value as shown above, replacing the value
0 with the hypothesized value.
If the hypothesis is rejected, a natural question to ask is well, what values of the
parameter are plausible given this data. This is exactly what a confidence interval tells
you. Consequently, I usually prefer to find confidence intervals, rather than doing formal
hypothesis testing.
What about making predictions for future yields when certain amounts of fertilizer are
applied? For example, what would be the future yield when 16 kg/ha of fertilizer are
applied?
The predicted value is found by substituting the new X into the estimated regression line.
KYPLOT makes it easier to do such a prediction. If you scroll down in the Regression
Results spread sheet, you will find a table with 2 blank spaces to insert X values and
predict Y values.
As noted earlier, there are two types of estimates of precision associated with
predictions using the regression line. It is important to distinguish between them as these
two intervals are the source of much confusion in regression problems.
First, the experimenter may be interested in predicting a single FUTURE individual value
for a particular X. This would correspond to the predicted yield for a single future plot
with 16 kg/ha of fertilizer added.
Page 22 of 15
Second the experimenter may be interested in predicting the average of ALL FUTURE
responses at a particular X. This would correspond to the average yield for all future plots
when 16 kg/ha of fertilizer is added.
The prediction interval for an individual response is sometimes called a confidence interval
for an individual response but this is an unfortunate (and incorrect) use of the term
confidence interval. Strictly speaking confidence intervals are computed for fixed
unknown parameter values; predication intervals are
computed for future random variables.
(To be continued).
Page 23 of 15