Beruflich Dokumente
Kultur Dokumente
A regression analysis is typically conducted to obtain a model that may needed for one of the following reasons:
• to explore whether a hypothesis regarding the relationship between the response and predictors is true.
• to estimate a known theoretical relationship between the response and predictors.
The model will then be used for:
• Prediction: the model will be used to predict the response variable from a chosen set of predictors, and
• Inference: the model will be used to explore the strength of the relationships between the response and
the predictors
Therefore, steps in model building may be summarized as follows:
1. Choosing the predictor variables and response variable on which to collect the data.
2. Collecting data. You may be using data that already exists (retrospective), or you may be conducting an
experiment during which you will collect data (prospective). Note that this step is important in determining the
researcher’s ability to claim ‘association’ or ‘causality’ based on the regression model.
3. Exploring the data.
• check for data errors and missing values.
• study the bivariate relationships to reveal other outliers and influential observations, relationships, and
identify possible multicollinearities to suggest possible transformations. (a document was sent to you on
Sept. 21st regarding these topics).
4. Dividing the data into a model-building set and a model-validation set:
• The training set is used to estimate the model.
• The validation set is later used for cross-validation of the selected model.
5. Identify several candidate models:
• Use best subsets regression.
• Use stepwise regression.
6. Evaluate the selected models for violation of the model conditions. Below checks may be performed visually via
residual plots as well as formal statistical tests.
• Check the linearity condition.
• Check for normality of the residuals.
• Check for constant variance of the residuals.
• After time-ordering your data (if appropriate), assess the independence of the observations.
• Overall goodness-of-fit of the model. Above checks turn out to be unsatisfactory, then modifications to the
model may be needed (such as a different functional form).
Regardless, checking the assumptions of your model as well as the model’s overall adequacy is usually
accomplished through residual diagnostic procedures.
Remember, there is not necessarily only one good model for a given set of data. There may be a few equally
satisfactory models.
1
The following table is a good summary of why checking the assumptions is of vital importance:
http://people.stern.nyu.edu/jsimonof/classes/1305/pdf/regression.class.pdf
The figure on the left summarizes the assumptions
in regression analysis.
2
CHECKING THE ASSUMPTION OF LINEARITY OF THE REGRESSION RELATIONSHIP
The graph on the left shows a clearly non-linear relationship
between SPI (https://www.socialprogressindex.com/) and GDP
per capita for 2016 (https://data.worldbank.org).
Below on the left is the scatterplot of standardized residuals vs
standardized predicted, where the non-linear nature of the
relationship is even more obvious.
Y = -44595.588 + 914.412 (SPI)
2
R = 0.742
For simple linear regression models with a single predictor, the
scatter plot of dependent vs independent variable will reveal
the nature of the relationship. The scatterplot of residuals vs
predicted values will provide a clearer picture. For multiple
linear regression models, since there are multiple independent
variables one has to look at residual plots.
Note on partial plots: When conducting multiple regression
analysis, partial residual plots will provide information
regarding the relationship between the response and each of
the predictors as follows:
3
In the case of non-linearity, transformation of the independent variable may remedy the problem. In some cases,
transformation of both variables may be necessary. In any case, before addressing independence, normality and
constancy of variance issues, the non-linearity problem must be fixed. However, while some assumptions may
appear to hold prior to applying a transformation, they may no longer hold once a transformation is applied. In
other words, using transformations is part of an iterative process where all the assumptions are re-checked after
each iteration. Some common transformations and their effect are described below.
Square Transformation of X: Spreads out the high X values relative to smaller X values. Try with the data below:
first, plot Y vs X values, then take the square of X values and plot Y vs X2.
Before transformation of X After transformation of X
X Y X2 Y
0 2 0 2
1 3 1 3
2 6 4 6
3 11 9 11
4 18 16 18
5 27 25 27
6 38 36 38
7 51 49 51
8 66 64 66
9 83 81 83
10 102 100 102
1/X (Inverse) transformation of X: Compresses large X values relative to the smaller X values, to a greater extent
than the log transformation. Values of X less than 1 become greater than 1, and values of X greater than 1 become
less than 1, so the order of the data values is reversed. Try with the data below: first, plot Y vs X values, then take
the inverse of X values and plot Y vs 1/X.
Before transformation of X After transformation of X
x y
X Y
0.20 10.0 5.0 10.0
0.33 51.0 3.030 51.0
0.55 79.0 1.818 79.0
0.70 87.0 1.429 87.0
0.90 94.0 1.111 94.0
0.909 99.0
1.10 99.0
1.30 103.0 0.769 103.0
1.50 106.0 0.667 106.0
1.70 108.0 0.588 108.0
1.85 109.0 0.541 109.0
0.513 109.7
1.95 109.7
Log Transformation: Compresses high X values relative to smaller X values. Note that the log function can only be
applied to values that are greater than 0. Try with the data below: first, plot Y vs X values, then take the natural log
of X values and plot Y vs ln(X).
Before transformation of X After transformation of X
x y x y
0.20 10.0 (1.61) 10.0
0.33 33.0 (1.11) 33.0
0.55 55.0 (0.60) 55.0
0.70 66.0 (0.36) 66.0
0.90 78.0 (0.11) 78.0
1.10 86.0 0.10 86.0
1.30 94.0 0.26 94.0
1.50 100.0 0.41 100.0
Square root 1.70 106.0 0.53 106.0
1.85 110.0 0.62 110.0
transformation has a
1.95 113.0 0.67 113.0
similar effect.
4
In summary:
If the relationship looks as below, then perform a square root transformation (X'= SQRT(X))
of the independent variable.
If the relationship looks as below, then perform a reciprocal (inverse) transformation
(X' = 1 / X) of the independent variable.
If the relationship looks as below, then, again, perform a reciprocal (inverse) transformation
(X' = 1 / X) of the independent variable.
If the relationship looks as below, perform a log transformation (X'=log X) of the
independent variable.
If the relationship looks as below, then, again, perform a log transformation (X'=log X) of the
independent variable.
If transformation of the independent variable (X) fails to meet the linearity assumption, log transformation of both
dependent and independent variables would most likely linearize any the relationship.
5
A Note on Data Transformations – Tukey’s Ladder of Power
Note that the transformations above are ‘power’ transformations of X. In general transformation of a variable (Y or
any of the X variables) is usually chosen from the "power family" of transformations (referred to as Tukey’s Ladder
of Power):
In the window for transforming a variable (above, on the right), create a new target variable name, choose how you
want to create it (such as taking ln of SPI), then click ‘OK’. A new variable will be added to your data set with the
name you defined. Now you can perform regression using the newly created variable by the transformation of an
existing variable.
A Note on Natural Log Transformation
The default logarithmic transformation involves taking the natural logarithm — ln or loge or simply, log — of each
data value. The general characteristics of the natural logarithmic function are:
• The natural logarithm of x is the power of e = 2.718282... that you have to take in order to get x, i.e, ln(ex) = x.
For example, the natural logarithm of 5 is the power to e = 2.718282… should be raised in order to obtain 5.
Since 2.7182821.60944 is approximately 5, the natural logarithm of 5 is 1.60944 or ln(5) = 1.60944.
6
The natural logarithm of e is equal to one, that is, ln(e) = 1, the natural logarithm of, that is, ln(1) = 0.
The effects of taking the natural logarithmic transformation are:
• Small values that are close together are spread further out.
• Large values that are spread out are brought closer together.
• And: if a variable grows exponentially, its logarithm grows linearly.
A different kind of logarithm, such as log base 10 or log base 2 could be used. However, the natural logarithm —
which can be thought of as log base e where e is the constant 2.718282... — is the most common logarithmic scale
used in scientific work.
When both variables are transformed the regression equation becomes: ln(Y) = a + b(ln x)
where a and b are the intercept and the regression coefficient, respectively. Regression analysis is then conducted
on the transformed data. Then, to visualize the actual relationship between Y and X back transformation of
transformed variables would be needed, as follows:
a+b(lnX)
Y = e
EXAMPLE: Data involving ‘the period of revolution’ and ‘distance from the sun’ (in astronomical units- AU) of the
nine planets are given in the table below.
• First plot the raw data.
period of distance
• Then transform only distance to ln(distance), plot the
revolution from the sun (AU) data.
mercury 0.241 0.387 • Then transform only period to ln(period), plot the data.
venus 0.615 0.723 • Finally, transform both variables, and plot ln(period) vs
earth 1.000 1.000 ln(distance).
• Find the regression relationship after both variables are
mars 1.881 1.524
transformed.
jupiter 11.862 5.203
saturn 29.456 9.539 The regression equation is:
uranus 84.070 19.191 ln(period) = 0.0002544+1.49986 ln(distance)
neptune 164.810 30.061
pluto 248.530 39.529 • To predict the period of revolution of planet Eris whose
distance from the sun is 102.15 AU:
ln(period) = 0.0002544+1.49986 ln(102.15) = 6.939
• To view the predicted period of revolution in original units:
6.939
Period = e à Period » 1032 years.
An objective criterion for choosing the type of transformation is the R2 value. The best transformation should have
the highest R 2 value (some people argue otherwise, and point to residual distribution as being most important). For
the most part, this will be a trial and error process, with the end result being improved precision of your estimates
of α and β.
After transformation and achieving a linear relationship, then problems with normality or homoscedasticity are to
be addressed by transforming the dependent variable (Y).
Exercise: Use the GDP per capita and SPI.xlsx to conduct a regression analysis where the GDP per capita (of 126
countries) is the dependent variable and their SPI (Social Progress Index) is the predictor (these are real data for
2016). First, try transforming SPI only. If you don’t achieve the desired linearity of relationship by transforming SPI,
try transforming both GDP and SPI. After you achieve a linear relationship, use the regression function to predict
GDP per capita when the SPI is 55.
7
HOMOSCESADTICITY (Homogeneity of Variance):
The figure on the left shows a hypothetical, linear relationship
between X and Y, and the regression line passing through the
observations:
Remember: Residual = Y-actual – Y-fitted.
Plotting the residuals against X (or Y-fitted), we obtain a residual
plot as shown in the figure on the left.
The horizontal line running through zero on the Y-axis represents
our regression line, allowing us to visualize the distribution of the
observations around the line. The distribution of the points
around the line remains constant across all values of X. This
suggests that the homogeneity of variance assumption has been
met.
The figure left, on the other hand, shows a situation where the
distribution of residuals spread further out as the value of X (and
thus the fitted Y) increases, displaying a wedge pattern. This is the
situation we refer to as heteroscedasticity – non-constancy of the
residual variance.
This is a common violation of the assumption of homoscedasticity
– residual variance increases proportional to the mean value of Y,
so that the spread of the observations gets wider as the value of X
(and, therefore Y-fitted) increases.
Plotting residuals against X (or Y-fitted) would produce a plot like
the figure on the left, where increasing variability of the residuals
is clearly visible.
Example: Use dataset: realestatedata.xlsx
Regress sales price on square feet only, save standardized
residuals and standardized fitted values.
Then view the scatter plot of zresid vs. zpred. Does the plot
indicate constant variance?
8
Transformations of Y are used to stabilize the residual variance. The type of transformation may be selected using
the below as general guidelines:
• the square root transformation of the observations (Y'=sqrt(Y) or Y' = Y1/2), when the variance is proportional
to the mean of th estimate of Y,
• the log transformation (Y' = log(Y) or Y' = ln(Y) or Y' = Y0), when the variance is proportional to the square of
the estimate of Y,
• the reciprocal square root transformation (Y' = 1/sqrt(Y) or Y' = Y-1/2), when the variance is proportional to
the cube of estimate of Y, and
• the reciprocal transformation (Y' = 1/Y or Y' = Y-1), when the variance is proportional to Y4.
Refer to the Note on Transformations above for more detail.
An objective criterion for distinguishing among transformations is the R2 value. The best transformation should have
the highest R 2 value (some people argue otherwise, and point to residual distribution as being most important). For
the most part, this will be a trial and error process, with the end result being improved precision of your estimates
of α and β.
One procedure for estimating an appropriate value of the ‘power’ to which Y should be raised is the Box-Cox
transformations. Box-Cox transformations are a family of power transformations on Y such that Yʹ=Yλ, where λ is a
parameter to be determined using the data. The maximum likelihood estimate λt is that value of λ for which the SSE
is a minimum.
Also note that the transformation of Y involves changing the metric in which the fitted values are analyzed, which
may make interpretation of the results difficult. The resulting relationship must be back-transformed to make
meaningful predictions.
Use the diamond.xlsx data and apply the appropriate transformation if necessary.
9
NORMALITY OF RESIDUAL DISTRIBUTION
The dependent and independent variables in a regression model do not need to be normally distributed --only the
prediction errors need to be normally distributed. But if the distribution of some of the variables are extremely
asymmetric or long-tailed, this may suggest a problem because calculation of confidence intervals and significance
tests for regression coefficients are based on the assumptions of normally distributed errors. Therefore, it is
important to check if the error distribution is significantly non-normal.
A significant violation of the normal distribution assumption may indicate:
1. There are a few unusual data points that should be studied closely and that the error distribution is "skewed"
by the presence of a few large outliers. (Checking for the presence of unusual data points is covered
previously.)
2. There is some other problem with the model assumptions, and/or that the model is not a good fit for the
data.
normal probability plot (P-P plot) of residuals, which are automatically given by SPSS in linear regression. The P-P
plot compares the observed cumulative distribution function of the residuals (on the X-0axis) with the
theoretical cumulative normal distribution function (values range from 0 to 1, on the Y-axis). Again, if the
residuals were exactly normally distributed, the points would lie on a straight line:
normal quantile plot (Q-Q plot)of residuals. The Q-Q plot compares the fractiles of error distribution versus the
fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on
such a plot should fall close to the diagonal reference line.
statistical tests for normality, include the Kolmogorov-Smirnov test, the Shapiro-Wilk test, the Jarque-Bera test,
and the Anderson-Darling test. Real data rarely has errors that are perfectly normally distributed, and it may not
be possible to fit your data with a model whose errors do not violate the normality assumption at the 0.05 level of
significance. These tests are quite strict. It is usually necessary to look at the plots and draw conclusions about the
seriousness of the problem.
Example: Use residualskew.xlsx – First regress variable Y on X.
In SPSS, linear regression, in ‘Plots’, check ‘Histogram’ and ‘Normal Probability Plot’.
To obtain a scatterplot, place ‘Standardized Residuals’ in the Y-axis, and ‘Standardized Predicted’ in the X-axis.
Also, in ‘Save’, check ‘Standardized Residuals’,
Then click ‘OK’.
Your data set will now include ‘Standardized Residuals’
as a variable.
10
Your output will include:
In the main SPSS window, under ‘Analyze’ choose ‘Descriptive Statistics’, then ‘Explore’
In the ‘Explore’ Window, place the
‘standardized residuals’ variable into the
dependents window:
Then the output will show:
11
Note that when using the Shapiro-Wilks test:
For skewness and kurtosis:
Now repeat the above by regressing Y2 on X2. Compare the results to those from Y and X.
12
PROCEDURES FOR FINAL MODEL SELECTION
In general, if there are p-1 predictors, then there are 2p-1 alternative models that can be constructed. For example,
10 predictors yield 210 = 1024 possible regression models.
Several model selection procedures have been developed to suggest ‘the best subset’ regression model according
to specified criteria. These are briefly summarized below.
• The model with the largest Adjusted R2 (equivalent to selecting the model with the smallest MSE)
• Mallows’ Cp-statistic
Mallows' Cp-statistic selects the subset of variables that will have the smallest Cp value – a measure of the
combination of the bias and the variance. Note that a model with p-1 predictors will have a minimum Cp
value of p. Therefore, the simplest model to yield a Cp value close to p is selected.
• Information Criteria
Three information criteria, Akaike’s Information Criterion (AIC), the Bayesian Information Criterion (BIC)
(which is sometimes called Schwartz’s Bayesian Criterion (SBC)), and Amemiya’s Prediction Criterion (APC)
are also used in the selection of best subsets. These criteria, in general, penalize models having a large
number of predictors.
Each of the information criteria is used in a similar way—in comparing subset models, the model with the
lower value is preferred.
PRESS is a method used to assess a model's predictive ability. For a data set of size n, PRESS is calculated by
omitting each observation individually and then the remaining n - 1 observations are used to calculate a
regression equation which is used to predict the value of the omitted response value. In general, the smaller
the PRESS value, the better the model's predictive ability.
Note that these criteria are often used together where the results of all may be pooled to determine the
best subset of predictors to keep in the regression model.
13
Note that SPSS has an ‘Automatic Linear Modeling’ procedure – a relatively new procedure, under the ‘Analyze’,
‘Regression’, and ‘Automatic Linear Modeling’.
When the window opens, all variables are in the same list of ‘predictors’. Remove the dependent variable from the
list (any of the predictors that you do not want to be included in the analysis), then place the dependent variable
in the ‘Target’ field.
14
In ‘Basics’, you will see that SPSS has the option of automatically
preparing data – leave it checked.
You can choose any confidence level, for now, leave it as 95%.
In ‘Model Selection’, you have the options ‘Forward Stepwise’ and ‘Best
Subsets’ in addition to all predictors. Choose ‘Best Subsets’.
You will see at the bottom on the right side that there is a choice for
criteria to select the best subsets. Leave it as Information Criterion
(AICC)
15
Now you can scroll up and down on the boxes on the left to get detailed information about the final selected
model. You will see that some variables’ outliers have been trimmed, and certain variables are manipulated to get
a good fit.
Scroll down to the last box down on the left with letter i on it to summarize the model information.
Automatic Linear Modeling is attractive functionality. However, since you do not really know what SPSS is doing in
the background, for the sake of ‘learning by doing’, I do not recommend using this functionality before completing
the other exercises in the handout.
16
Stepwise Regression:
This is a procedure to determine which predictor variables should be in the final model, starting with one
variable, and then adding others, one at a time, depending on the extent of their marginal contribution in
accounting for the variation in the predicted variable. The extent of marginal contribution is measured by
the p-value of the t-test about the regression coefficient.
§ Specify an Alpha-to-Enter significance level. This will typically be greater than the usual 0.05 level so that it
is not too difficult to enter predictors into the model. Many software packages set this significance level by
default to αE = 0.15.
§ Specify an Alpha-to-Remove significance level. This will typically be greater than the usual 0.05 level so that
it is not too easy to remove predictors from the model. Again, many software packages set this significance
level by default to αR = 0.15.
Stepwise Regression Example bloodpress.xlsx
Some researchers observed the following data (bloodpress.xls) on 20 individuals with high blood pressure:
§ blood pressure (y = BP, in mm Hg)
§ age (x1 = Age, in years)
§ weight (x2 = Weight, in kg)
§ body surface area (x3 = BSA, in sq m)
§ duration of hypertension (x4 = Dur, in years)
§ basal pulse (x5 = Pulse, in beats per minute)
§ stress index (x6 = Stress)
The researchers are interested in determining if a relationship exists between blood pressure and age, weight, body
surface area, duration, pulse rate and/or stress level. Use stepwise regression and state the resulting final model.
MODEL VALIDATION
Most of the time it is difficult to obtain new independent data to validate a regression model. An alternative is to
partition the sample data into a model-building set (where the number of observations should be at least 6 to 10
times the number of predictor variables) which will be used to develop the model, and a validation set, which will
be used to evaluate the predictive ability of the model. This is called cross-validation.
References
Kutner et. al. Applied Linear Statistical Models (5th Ed.)
https://stats.idre.ucla.edu/spss/seminars/introduction-to-regression-with-spss/introreg-lesson2/
https://onlinecourses.science.psu.edu/stat501/
http://www.cambridge.edu.au/downloads/education/extra/209/PageProofs/Further%20Maths%20TINCP/Core%206.pdf
http://www.basic.northwestern.edu/statguidefiles/linreg_alts.html
http://people.stern.nyu.edu/jsimonof/classes/1305/pdf/regression.class.pdf
http://www.cambridge.edu.au/downloads/education/extra/209/PageProofs/Further%20Maths%20TINCP/Core%206.pdf
http://people.duke.edu/~rnau/testing.htm
http://core.ecu.edu/psyc/wuenschk/StatData/StatData.htm
https://erc.barnard.edu/spss/descriptives_normality
17