Diagnostics PDF

Regression diagnostics
Diagnostics
The theoretical assumptions for the simple linear regression model with one response variable Y and one explicative variable x are:
Linearity: yi = 0 + 1 xi + ui , for i = 1, . . . , n Homogeneity: E [ui ] = 0, for i = 1, . . . , n Homoscedasticity: Var[ui ] = 2 , for i = 1, . . . , n Independence: ui and uj are independent for i = j Normality: ui N(0, 2 ), for i = 1, . . . , n
We study how to apply diagnostic procedures to test if these assumptions are appropriate for the available data (xi , yi )
Based on the analysis of the residuals ei = yi yi
Diagnostics: scatterplots
Scatterplots
The simplest diagnostic procedure is based on the visual examination of the scatterplot for (xi , yi ) Often this simple but powerful method reveals patterns suggesting whether the theoretical model might be appropriate or not We illustrate its application on a classical example. Consider the four following datasets
The Anscombe datasets 132
DATASET
TWO-VARIABLE LINEAR REGRESSION

TABLE 3-10 Four Data Sets
1
Y 8.04 6.95 7.58 8 .81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 X 10.0 8.0 13.0 9.0 11.0 14.0 6.0 4.0 12.0 7.0 5.0
DATASET
2
Y 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74
X 10.0 8.0 1'3.0 9.0 11.0 14.0 6.0 4.0 12.0 7.0 5.0
DATASET
3
Y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 X 8.0 8.0 8.0 8.0 8.0 8.0 8.0 19.0 8.0 8.0 8.0
DATASET
4
Y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.50 5.56 7.91 6.89
X 10.0 8.0 13.0 9.0 11.0 14.0 6.0 4.0 12.0 7.0 5.0
SOURCE:
F . J . Anscombe, op. cit.
mean of X = 9.0, mean of Y = 7.5, for all four data sets.
The Anscombe example
The estimated regression model for each of the four previous datasets is the same
yi = 3,0 + 0,5xi n = 11, x = 9,0, y = 7,5, rxy = 0,817
The estimated standard error of the estimator 1 ,

2 sR 2 (n 1)sx
takes the value 0,118 in all four cases. The corresponding T statistic takes the value T = 0,5/0,118 = 4,237 But the corresponding scatterplots show that the four datasets are quite dierent. Which conclusions could we reach from these diagrams?
Anscombe data scatterplots 133
10
TWO-VARIABLE LINEAR REGRESSION
10
10
Data Set 1
15
10
Data Set 2
15
20
10
10
10
Data Set 3
15
20
10
Data Set 4
15
20
FIGURE
SOURCE:
3-29 Scatterplots for the four data sets of Table 3-10 F. J . Anscombe, op cit.
(1) numerical calculations are exact, but graphs are rough;
Residual analysis
Further analysis of the residuals
If the observation of the scatterplot is not sucient to reject the model, a further step would be to use diagnosis methods based on the analysis of the residuals ei = yi yi This analysis starts by standardizing the residuals, that is, dividing them by the residuals (quasi-)standard deviation sR . The resulting quantities are known as standardized residuals: ei sR Under the assumptions of the linear regression model, the standardized residuals are approximately independent standard normal random variables A plot of these standardized residuals should show no clear pattern
Residual analysis
Residual plots
Several types of residual plots can be constructed. The most common ones are:
Plot of standardized residuals vs. x Plot of standardized residuals vs. y (the predicted responses)
Deviations from the model hypotheses result in patterns on these plots, which should be visually recognizable
Residual plots examples

5.1: of consistencia con o ConsistencyEj: the theoretical model el modelo terico

Nonlinearity
5.1: Ej: No linealidad

5.1: Ej: Heterocedasticidad
Heteroscedasticity
Residual analysis
Outliers
In a plot of the regression line we may observe outlier data, that is, data that show signicant deviations from the regression line (or from the remaining data) The parameter estimators for the regression model, 0 and 1 , are very sensitive to these outliers It is important to identify the outliers and make sure that they really are valid data Statgraphics is able to show for example the data that generate Unusual residuals or Inuential points
Residual analysis
Normality of the errors

One of the theoretical assumptions of the linear regression model is that the errors follow a normal distribution This assumption can be checked visually from the analysis of the residuals ei , using dierent approaches:
By inspection of the frequency histogram for the residuals By inspection of the Normal Probability Plot of the residuals (signicant deviations of the data from the straight line in the plot correspond to signicant departures from the normality assumption)
Nonlinear relationships and linearizing transformations

Introduction
Consider the case when the deterministic part f (xi ; a, b) of the response in the model yi = f (xi ; a, b) + ui , i = 1, . . . , n
is a nonlinear function of x that depends on two parameters a and b (for example, f (x; a, b) = ab x ) In some cases we may apply transformations to the data to linearize them. We are then able to apply the linear regression procedure From the original data (xi , yi ) we obtain the transformed data (xi , yi ) The parameters 0 and 1 corresponding to the linear relation between xi and yi are transformations of the parameters a and b
Nonlinear relationships and linearizing transformations
Linearizing transformations
Examples of linearizing transformations:
If y = f (x; a, b) = ax b then log y = log a + b log x. We have
y = log y , x = log x, 0 = log a, 1 = b
If y = f (x; a, b) = ab x then log y = log a + (log b)x. We have

y = log y , x = x, 0 = log a, 1 = log b
If y = f (x; a, b) = 1/(a + bx) then 1/y = a + bx. We have

y = 1/y , x = x, 0 = a, 1 = b
If y = f (x; a, b) = log(ax b ) then y = log a + b(log x). We have

y = y , x = log x, 0 = log a, 1 = b

Diagnostics PDF

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Diagnostics PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Regression diagnostics

TWO-VARIABLE LINEAR REGRESSION

F . J . Anscombe, op. cit.

mean of X = 9.0, mean of Y = 7.5, for all four data sets.

The estimated standard error of the estimator 1 ,

(1) numerical calculations are exact, but graphs are rough;

Residual plots examples

Residual plots examples

Residual plots examples

Normality of the errors

Nonlinear relationships and linearizing transformations

Nonlinear relationships and linearizing transformations

If y = f (x; a, b) = ab x then log y = log a + (log b)x. We have

If y = f (x; a, b) = 1/(a + bx) then 1/y = a + bx. We have

If y = f (x; a, b) = log(ax b ) then y = log a + b(log x). We have

Das könnte Ihnen auch gefallen