Attribution Non-Commercial (BY-NC)

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

11 Aufrufe

Attribution Non-Commercial (BY-NC)

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

- Introduction to Econometrics James Stock Watson 2e Part2
- Mission Hospital Case Study
- Multiple Regression Analysis _ Real Statistics Using Excel.pdf
- How Do i Write a Scientific Paper
- MRA
- Motivation in Foreign Language Learning- The Relationship Between
- Corporate governance Turkey
- regresi
- Credit Expansion and Financial Instability: Evidence from Stock Prices
- Gary Grudnitski - Valuations of Residential Properties
- 96
- Duration of Patients' Visits to the Hospital Emergency Department
- 0166927S21 Ljung
- Yu Shuo
- Training Deck OLS
- GMartin Political Consultants
- Wray
- et_Ch3
- Hello World
- Ch3Summary.pdf

Sie sind auf Seite 1von 93

1 / 63

Outline

2 Introduction 2 The multiple linear regression model 2 Underlying assumptions 2 Parameter estimation and hypothesis testing 2 Residual diagnostics 2 Goodness of t and model selection 2 Examples in R

2 / 63

Multiple linear regression represents a generalization, to more than a single explanatory variable, of the simple linear regression model. The main goal is to model the relationship between a random response (or dependent) variable y and several explanatory variables x1 , x2 , . . . , xp , usually assumed to be known or under the control of the investigator, not regarded as random variable at all. In practice, the observed values of the explanatory variables are subject to random variation, like the response variable.

In multivariate regression more than one dependent variable is available.

This do not imply that some variables are more important than others (though they may be), rather that there are dependent and explanatory variables (also called predictor, controlled or independent variables).

3 / 63

Multiple linear regression represents a generalization, to more than a single explanatory variable, of the simple linear regression model. The main goal is to model the relationship between a random response (or dependent) variable y and several explanatory variables x1 , x2 , . . . , xp , usually assumed to be known or under the control of the investigator, not regarded as random variable at all. In practice, the observed values of the explanatory variables are subject to random variation, like the response variable.

In multivariate regression more than one dependent variable is available.

This do not imply that some variables are more important than others (though they may be), rather that there are dependent and explanatory variables (also called predictor, controlled or independent variables).

3 / 63

Multiple linear regression represents a generalization, to more than a single explanatory variable, of the simple linear regression model. The main goal is to model the relationship between a random response (or dependent) variable y and several explanatory variables x1 , x2 , . . . , xp , usually assumed to be known or under the control of the investigator, not regarded as random variable at all. In practice, the observed values of the explanatory variables are subject to random variation, like the response variable.

In multivariate regression more than one dependent variable is available.

This do not imply that some variables are more important than others (though they may be), rather that there are dependent and explanatory variables (also called predictor, controlled or independent variables).

3 / 63

Multiple linear regression represents a generalization, to more than a single explanatory variable, of the simple linear regression model. The main goal is to model the relationship between a random response (or dependent) variable y and several explanatory variables x1 , x2 , . . . , xp , usually assumed to be known or under the control of the investigator, not regarded as random variable at all. In practice, the observed values of the explanatory variables are subject to random variation, like the response variable.

In multivariate regression more than one dependent variable is available.

This do not imply that some variables are more important than others (though they may be), rather that there are dependent and explanatory variables (also called predictor, controlled or independent variables).

3 / 63

Multiple linear regression represents a generalization, to more than a single explanatory variable, of the simple linear regression model. The main goal is to model the relationship between a random response (or dependent) variable y and several explanatory variables x1 , x2 , . . . , xp , usually assumed to be known or under the control of the investigator, not regarded as random variable at all. In practice, the observed values of the explanatory variables are subject to random variation, like the response variable.

In multivariate regression more than one dependent variable is available.

This do not imply that some variables are more important than others (though they may be), rather that there are dependent and explanatory variables (also called predictor, controlled or independent variables).

3 / 63

Let yi be the value of the response variable on the ith individual, and xi1 , xi2 , . . . , xip be the ith individuals values on p explanatory variables. The multiple linear regression model is then given by yi = 0 + 1 xi1 + . . . + p xip + i , i = 1, . . . , n,

where the residual or error terms i , i = 1, . . . , n are assumed to be independent random variables having a normal distribution with mean zero and constant variance 2 . As a consequence, the distribution of the random response variable is also normal (y N (, 2 )) with expected value , given by: = E(y |x1 , x2 , . . . , xp ) = 0 + 1 x1 + . . . + p xp , and variance 2 .

4 / 63

The parameters k , k = 1, 2, . . . , p are known as regression coecients and give the amount of change in the response variable associated with a unit change in the corresponding explanatory variable, conditional on the other explanatory variables in the model remaining unchanged.

Note

The term linear in multiple linear regression refers to the regression parameters, not to the response or explanatory variables. Consequently models in which, for example, the logarithm of a response variable is modeled in terms of quadratic functions of some of the explanatory variables would be included in this class of models. An example of nonlinear model is y1 = 1 e 2 xi1 + 3 e 4 xi2 + i .

5 / 63

The multiple regression model may be written using the matricial representation: y = X + , where y = [y1 , y2 , . . . , yn ], = [0 , 1 , . . . , p ], = [ 1 x11 x12 . . . x1p 1 x21 x22 . . . x2p X= . . . . . . . . . . . . . . . 1 xn1 xn2 . . . xnp

1 , 2 , . . . , n ],

and

Each row in X (sometimes known as the design matrix) represents the values of the explanatory variables for one of the individuals in the sample, with the addition of unity to take into account of the parameter 0 .

6 / 63

It is assumed that the relationship between variables is linear. The design matrix X must have full rank (p + 1). Otherwise the parameter vector will not be identied, at most we will be able to narrow down its value to some linear subspace of Rp+1 . For this property to hold, we must have n > p, where n is the sample size. The independent variables are assumed to be error-free, that is they are not contaminated with measurement errors. The error term is assumed to be a random variable with a mean of zero conditional on the explanatory variables. The errors are uncorrelated (no serial correlation) with constant variance (homoscedasticity). The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others (otherwise multicollinearity problems arise). These are sucient (but not all necessary) conditions for the least-squares estimator to possess desirable properties, such as unbiasedness, consistency, and eciency.

Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

It is assumed that the relationship between variables is linear. The design matrix X must have full rank (p + 1). Otherwise the parameter vector will not be identied, at most we will be able to narrow down its value to some linear subspace of Rp+1 . For this property to hold, we must have n > p, where n is the sample size. The independent variables are assumed to be error-free, that is they are not contaminated with measurement errors. The error term is assumed to be a random variable with a mean of zero conditional on the explanatory variables. The errors are uncorrelated (no serial correlation) with constant variance (homoscedasticity). The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others (otherwise multicollinearity problems arise). These are sucient (but not all necessary) conditions for the least-squares estimator to possess desirable properties, such as unbiasedness, consistency, and eciency.

Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

It is assumed that the relationship between variables is linear. The design matrix X must have full rank (p + 1). Otherwise the parameter vector will not be identied, at most we will be able to narrow down its value to some linear subspace of Rp+1 . For this property to hold, we must have n > p, where n is the sample size. The independent variables are assumed to be error-free, that is they are not contaminated with measurement errors. The error term is assumed to be a random variable with a mean of zero conditional on the explanatory variables. The errors are uncorrelated (no serial correlation) with constant variance (homoscedasticity). The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others (otherwise multicollinearity problems arise). These are sucient (but not all necessary) conditions for the least-squares estimator to possess desirable properties, such as unbiasedness, consistency, and eciency.

Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

It is assumed that the relationship between variables is linear. The design matrix X must have full rank (p + 1). Otherwise the parameter vector will not be identied, at most we will be able to narrow down its value to some linear subspace of Rp+1 . For this property to hold, we must have n > p, where n is the sample size. The independent variables are assumed to be error-free, that is they are not contaminated with measurement errors. The error term is assumed to be a random variable with a mean of zero conditional on the explanatory variables. The errors are uncorrelated (no serial correlation) with constant variance (homoscedasticity). The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others (otherwise multicollinearity problems arise). These are sucient (but not all necessary) conditions for the least-squares estimator to possess desirable properties, such as unbiasedness, consistency, and eciency.

Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

It is assumed that the relationship between variables is linear. The design matrix X must have full rank (p + 1). Otherwise the parameter vector will not be identied, at most we will be able to narrow down its value to some linear subspace of Rp+1 . For this property to hold, we must have n > p, where n is the sample size. The independent variables are assumed to be error-free, that is they are not contaminated with measurement errors. The error term is assumed to be a random variable with a mean of zero conditional on the explanatory variables. The errors are uncorrelated (no serial correlation) with constant variance (homoscedasticity). The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others (otherwise multicollinearity problems arise). These are sucient (but not all necessary) conditions for the least-squares estimator to possess desirable properties, such as unbiasedness, consistency, and eciency.

Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

It is assumed that the relationship between variables is linear. The design matrix X must have full rank (p + 1). Otherwise the parameter vector will not be identied, at most we will be able to narrow down its value to some linear subspace of Rp+1 . For this property to hold, we must have n > p, where n is the sample size. The independent variables are assumed to be error-free, that is they are not contaminated with measurement errors. The error term is assumed to be a random variable with a mean of zero conditional on the explanatory variables. The errors are uncorrelated (no serial correlation) with constant variance (homoscedasticity). The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others (otherwise multicollinearity problems arise). These are sucient (but not all necessary) conditions for the least-squares estimator to possess desirable properties, such as unbiasedness, consistency, and eciency.

Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

It is assumed that the relationship between variables is linear. The design matrix X must have full rank (p + 1). Otherwise the parameter vector will not be identied, at most we will be able to narrow down its value to some linear subspace of Rp+1 . For this property to hold, we must have n > p, where n is the sample size. The independent variables are assumed to be error-free, that is they are not contaminated with measurement errors. The error term is assumed to be a random variable with a mean of zero conditional on the explanatory variables. The errors are uncorrelated (no serial correlation) with constant variance (homoscedasticity). The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others (otherwise multicollinearity problems arise). These are sucient (but not all necessary) conditions for the least-squares estimator to possess desirable properties, such as unbiasedness, consistency, and eciency.

Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

It is assumed that the relationship between variables is linear. The design matrix X must have full rank (p + 1). Otherwise the parameter vector will not be identied, at most we will be able to narrow down its value to some linear subspace of Rp+1 . For this property to hold, we must have n > p, where n is the sample size. The independent variables are assumed to be error-free, that is they are not contaminated with measurement errors. The error term is assumed to be a random variable with a mean of zero conditional on the explanatory variables. The errors are uncorrelated (no serial correlation) with constant variance (homoscedasticity). The predictors must be linearly independent, i.e. it must not be possible to express any predictor as a linear combination of the others (otherwise multicollinearity problems arise). These are sucient (but not all necessary) conditions for the least-squares estimator to possess desirable properties, such as unbiasedness, consistency, and eciency.

Statistical methods and applications (2009) Multiple Linear Regression Model 7 / 63

While simple linear regression allows to draw a straight line that best t the data (in the least squares sense) in the (x, y ) plane, multiple linear draws a plane that best ts the cloud of data point in the in a (p + 1)-dimensional space. Actually, with p predictors, the regression equation yi = 0 + 1 xi1 + 2 xi2 + . . . + p xip , i = 1, 2, . . . , n, denes a p-dimensional hyperplane in a (p + 1)-dimensional space, that minimizes the sum of the squares of the distances (measure parallel to the y axis) between the hyperplane and the data points.

8 / 63

Properties of LS estimators

Gauss-Markov theorem

In a linear model in which the errors have expectation zero and are uncorrelated and have equal variances, the Best Linear Unbiased Estimators (BLUE) of the coecients are the Least-Squares (LS) estimators. More generally, the BLUE estimator of any linear combination of the coecients is its least-squares estimator. It is noteworthy that the errors are not assumed to be normally distributed, nor are they assumed to be independent (but only uncorrelated, a weaker condition), nor are they assumed to be identically distributed (but only homoscedastic, a weaker condition).

9 / 63

Parameters estimation

The Least-Squares (LS) procedure is used to estimate the parameters in the multiple regression model. Assuming that X X is nonsingular, hence it can be inverted, then the LS estimator of the parameter vector is = (X X)1 X y. This estimator has the following properties: E() = , and cov() = 2 (X X)1 . The diagonal elements of the matrix cov() give the variances of the j , whereas j , k . The square the o-diagonal elements give the covariances between pairs roots of the diagonal elements of the matrix are thus the standard errors of the j .

Statistical methods and applications (2009) Multiple Linear Regression Model 10 / 63

In details

One method of estimation of population parameters is ordinary least squares. This method allows to nd the vector that minimizes the sum of squared residuals, i.e. the function G () given by G () = e e = (y X) (y X). It follows that

G () = y y + (X X) 2 X y.

Minimization of this function results in a set of p normal equations, which are solved to yield the parameter estimators. The minimum is then found by setting the gradient to zero:

0 = G () = 2X y + 2(X X) = = (X X)1 X y.

11 / 63

Variance table

The regression analysis can be assessed using the following analysis of variance (ANOVA) table.

Table: ANOVA table Source of Variation Regression Residual Total Sum of Squares (SS) P SSR = P n (i y )2 i=1 y SSE = Pn (yi yi )2 i=1 SST = n (yi y )2 i=1 Degrees of Freedom (df) p np1 n1 Mean Square MSR=SSR/p MSE=SSE/(n-p-1)

where yi is the predicted value of the response variable for the ith individual and y is the mean value of the response variable.

12 / 63

The means square ratio MSR/MSE provides an F -test for the general hypothesis H0 : 0 = 1 = . . . = p = 0. An estimate of 2 is provided by s 2 given by s2 = 1 np1

n

(yi yi )2 .

i=1

Under H0 , the mean square ratio has an F -distribution with p, n p 1 degrees of freedom. On the basis of the ANOVA table, we may calculate the multiple correlation coecient R 2 , related to the F -test, that gives the proportion of variance of the response variable accounted for by the explanatory variables. We will discuss it in more detail later on.

Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63

The means square ratio MSR/MSE provides an F -test for the general hypothesis H0 : 0 = 1 = . . . = p = 0. An estimate of 2 is provided by s 2 given by s2 = 1 np1

n

(yi yi )2 .

i=1

Under H0 , the mean square ratio has an F -distribution with p, n p 1 degrees of freedom. On the basis of the ANOVA table, we may calculate the multiple correlation coecient R 2 , related to the F -test, that gives the proportion of variance of the response variable accounted for by the explanatory variables. We will discuss it in more detail later on.

Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63

The means square ratio MSR/MSE provides an F -test for the general hypothesis H0 : 0 = 1 = . . . = p = 0. An estimate of 2 is provided by s 2 given by s2 = 1 np1

n

(yi yi )2 .

i=1

Under H0 , the mean square ratio has an F -distribution with p, n p 1 degrees of freedom. On the basis of the ANOVA table, we may calculate the multiple correlation coecient R 2 , related to the F -test, that gives the proportion of variance of the response variable accounted for by the explanatory variables. We will discuss it in more detail later on.

Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63

The means square ratio MSR/MSE provides an F -test for the general hypothesis H0 : 0 = 1 = . . . = p = 0. An estimate of 2 is provided by s 2 given by s2 = 1 np1

n

(yi yi )2 .

i=1

Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63

The means square ratio MSR/MSE provides an F -test for the general hypothesis H0 : 0 = 1 = . . . = p = 0. An estimate of 2 is provided by s 2 given by s2 = 1 np1

n

(yi yi )2 .

i=1

Statistical methods and applications (2009) Multiple Linear Regression Model 13 / 63

Individual regression coecients can be assessed by using the ratio j /SE (j ), although these ratios should only be used as rough guides to the signicance or otherwise of the coecients. Under the null hypothesis H0 : j = 0, we have that j SE (j ) tnp1 .

t-statistics are conditional on which explanatory variables are included in the current model. As a consequence, the values of these statistic will change, as will the values of the estimated regression coecients and their standard errors as other variables are included and excluded from the model.

14 / 63

Two models are nested if both contain the same terms and one has at least one additional term. For example, the model (a) y = 0 + 1 x1 + 2 x2 + 3 x1 x2 + is nested within model (b)

2 2 y = 0 + 1 x 1 + 2 x 2 + 3 x 1 x 2 + 4 x 1 + 5 x 2 +

Model (a) is the reduced model and model (b) is the full model. In order to decide whether the full model is better than the reduced one (i.e, does it contribute additional information about the association between y and the predictors?), we have to test the hypothesis H0 : 4 = 5 = 0 against the alternative that at least one additional term is = 0.

15 / 63

We indicate with SSER the SSE in the reduced model and with SSEC the SSE for the complete model. Since it is always true that SSER > SSEC , the question becomes: Is the drop in SSE from tting the complete model is large enough?. In order to compare a model with k parameters (reduced) with another one with k + q parameters (complete or full), hence in order to verify the hypotheses H0 : k+1 = k+2 = . . . = k+q = 0 an F test, dened as F = (SSER SSEC )/# of additional s , SSEC / [n (k + q + 1)] against H1 : At least one = 0

is used. Once chosen an appropriate level, if F F,1 ,2 , with 1 = q and 2 = n (k + q + 1), then H0 is rejected.

Statistical methods and applications (2009) Multiple Linear Regression Model 16 / 63

In order to complete a regression analysis, it is needed to check assumptions such as those of constant variance and normality of the error terms, since violation of these assumptions may invalidate conclusions based on the regression analysis. Diagnostic plots generally used when assessing model assumptions are discussed below. Residuals versus tted values. If the tted model is appropriate, the plotted points should lie in an approximately horizontal band across the plot. Departures from this appearance may indicate that the functional form of the assumed model is incorrect, or alternatively, that there is inconstant variance. Residuals versus explanatory variables. Systematic patterns in these plots can indicate violations of the constant variance assumption or an appropriate model form. Normal probability plot of the residuals. The plot checks the normal distribution assumptions on which all statistical inference procedures are based.

Statistical methods and applications (2009) Multiple Linear Regression Model 17 / 63

A further diagnostic that is often very useful is an index plot of the Cooks distances for each observation. This statistic is dened as follows: Dk = 1 (1 + p)s 2

n

yi(k) yi

i=1

where yi(k) is the tted value of the ith observation when the kth observation is omitted from the model. The values of Dk determine the inuence of the kth observation on the estimated regression coecients (cause for concern). Values of Dk greater than one are suggestive that the corresponding observation has undue inuence on the estimated regression coecients.

18 / 63

Principle of parsimony

In science, parsimony is the preference for the least complex explanation for an observation. One should always choose the simplest explanation of a phenomenon, the one that requires the fewest leaps of logic (Burnham and Anderson, 2002). William of Occam suggested in the 14th century that one shave away all that is unnecessary, an aphorism known as the Occams razor. Albert Einstein is supposed to have said: Everything should be made as simple as possible, but no simpler. According to Box and Jenkins (1970), the principle of parsimony should lead to a model with the smallest number of parameters for adequate representation of the data. Statisticians view the principle of parsimony as a bias versus variance tradeo. Usually, it happens that bias (of parameters estimates) decreases and variance (of parameters estimates) increases as the dimension (number of parameters) of the model increases. All model selection methods are based on the principle of parsimony.

Statistical methods and applications (2009) Multiple Linear Regression Model 19 / 63

As previously said, the relation SST = SSR + SSE holds true. The squared multiple correlation coecient is given by R2 = SSR = SST

n y i=1 (i n (yi i=1

y )2 =1 y )2

yi )2 . y )2

It is the proportion of variability in a data set that is accounted for by the statistical model. It gives a measure of the strength of the association between the independent (explanatory) variables and the one dependent variable. It can be any value from 0 to +1. The closer R 2 is to one, the stronger the linear association is. If it is equal zero, then there is no linear association between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is F = (1 R 2 )/(n R 2 /p . p 1)

20 / 63

As previously said, the relation SST = SSR + SSE holds true. The squared multiple correlation coecient is given by R2 = SSR = SST

n y i=1 (i n (yi i=1

y )2 =1 y )2

yi )2 . y )2

It is the proportion of variability in a data set that is accounted for by the statistical model. It gives a measure of the strength of the association between the independent (explanatory) variables and the one dependent variable. It can be any value from 0 to +1. The closer R 2 is to one, the stronger the linear association is. If it is equal zero, then there is no linear association between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is F = (1 R 2 )/(n R 2 /p . p 1)

20 / 63

As previously said, the relation SST = SSR + SSE holds true. The squared multiple correlation coecient is given by R2 = SSR = SST

n y i=1 (i n (yi i=1

y )2 =1 y )2

yi )2 . y )2

It is the proportion of variability in a data set that is accounted for by the statistical model. It gives a measure of the strength of the association between the independent (explanatory) variables and the one dependent variable. It can be any value from 0 to +1. The closer R 2 is to one, the stronger the linear association is. If it is equal zero, then there is no linear association between the dependent variable and the independent variables. Another formulation of the statistic F for the entire model is F = (1 R 2 )/(n R 2 /p . p 1)

20 / 63

As previously said, the relation SST = SSR + SSE holds true. The squared multiple correlation coecient is given by R2 = SSR = SST

n y i=1 (i n (yi i=1

y )2 =1 y )2

yi )2 . y )2

20 / 63

As previously said, the relation SST = SSR + SSE holds true. The squared multiple correlation coecient is given by R2 = SSR = SST

n y i=1 (i n (yi i=1

y )2 =1 y )2

yi )2 . y )2

20 / 63

However, since the value of R 2 increases when increasing the number of regressors (overestimation problem), it is convenient to calculate the adjusted version of R 2 given by

2 Radj = 1 n 2 i=1 (yi yi ) /(n p 1) n 2 i=1 (yi y ) /(n 1)

= 1 (1 R 2 )

n1 . np1

2 Radj 2 Radj

SSR SSE =1 . SST SST SSE /(n p 1) , = 1 SST /(n 1) (1 R 2 )SST /(n p 1) n1 = 1 = 1 (1 R 2 ) . SST /n np1 =

21 / 63

Log-likelihood for normal distribution is given by = n(log(2) + log(SSE /n) + 1)/2. If the object under study is the multivariate linear regression model with p esplicative variables and n units then AIC = 2 + 2K = 2 + 2(p + 1), and BIC = 2 + K log(n) = 2 + (p + 1) log(n).

22 / 63

Computing AIC in the least squares case

If all models in the set assume normally distributed errors with a constant variance, then AIC can be easily computed from least squares regression statistics as AIC = n log( 2 ) + 2K = n log( 2 ) + 2(p + 1), where 2 = SSE (the MLE of 2 ). n

A common mistake when computing AIC is to take the estimate of 2 from the computer output, instead of computing the ML estimate above. Moreover K is the total number of estimated regression parameters, including the intercept and 2 .

23 / 63

Mallows Cp statistic

2 Mallows Cp statistic is dened as Cp = SSRp (n 2p), s2

where SSRp is the residual sum of squares from a regression model with a certain set of p 1 of the explanatory variables, plus an intercept, and s 2 is the estimate of 2 from the model that includes all explanatory variables under consideration. 2 Cp is an unbiased estimator of the mean squared prediction error. 2 If Cp is plotted against p, the subsets of the variables ensuring a parsimonious model are those lying close to the line Cp = p. 2 In this plot, the value p is (roughly) the contribution to Cp from the variance of the estimated parameters, whereas the remaining Cp p is (roughly) the contribution from the bias of the model. 2 Cp plot is a useful device for evaluating the Cp values of a range of models (Mallows, 1973, 1995; Burnman, 1996).

Statistical methods and applications (2009) Multiple Linear Regression Model 24 / 63

To summarize

Rules of thumb

There are a number of measures based on the entire estimated equation that can be used to compare two or more alternative specications and select the best one based on that specic criterion.

2 Radj higher is better.

AIC lower is better. BIC lower is better. Low values of Cp are those that indicate the best model to consider.

25 / 63

We recall that with p covariates, M = 2p 1 potential models could be obtained. Forward selection approach starts with an initial model that contains only a constant term and successively adds explanatory variables to the model from the pool of candidate variables until a stage is reached where none of the candidate variables, if added to the current model, would contribute information that is statistically important concerning the expected value of the response. This is generally decided by comparing a function of the decrease in the residual sum of squares with a threshold value set by the investigator. Backward elimination method starts with all the variables in the model, and drops the least signicant, one at a time, until only signicant variables are retained. Stepwise regression procedure is a mixture of both forward selection and backward elimination. It starts performing a forward selection, but drops variables which become no longer signicant after introduction of new variables.

Statistical methods and applications (2009) Multiple Linear Regression Model 26 / 63

Multicollinearity occurs when variables are so highly correlated that it is dicult to come up with reliable estimates of their individual regression coecients. It leads to the following problems. The variances of the regression coecients result increased. As a consequence, predicted model are less stable and parameter estimates become inaccurate. The power of signicance tests for the regression coecients obviously decreases, thus leading to Type II errors. The R 2 of the estimated regression will be large, even if all the coecients are not signicant, since the explanatory variables are largely attempting to explain much of the same variability in the response variable (see Dizney and Gromen, 1967). The eects of the explanatory variables are confounded due to their intercorrelations, hence it is dicult to determine the importance of a givent explanatory variable. The removal of a single observation may largely aect the calculated coecients.

Statistical methods and applications (2009) Multiple Linear Regression Model 27 / 63

Examine the correlations between explanatory variables (helpful but not infallible). Evaluate the variance ination factors of the explanatory variables. The VIF for the jth variable (i.e., VIFj ) is given by VIFj = 1/(1 Rj2 ), where Rj2 is the square of the multiple correlation coecient from the regression of the jth explanatory variable on the remaining explanatory variables. VIFj indicates the strength of the linear relationship between the jth variable and the remaining explanatory variables. A rough rule of thumb is that VIF > 10 give some cause for concern. Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables. Use more complex approaches, e.g. principal component regression (Jollie, 1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

Examine the correlations between explanatory variables (helpful but not infallible). Evaluate the variance ination factors of the explanatory variables. The VIF for the jth variable (i.e., VIFj ) is given by VIFj = 1/(1 Rj2 ), where Rj2 is the square of the multiple correlation coecient from the regression of the jth explanatory variable on the remaining explanatory variables. VIFj indicates the strength of the linear relationship between the jth variable and the remaining explanatory variables. A rough rule of thumb is that VIF > 10 give some cause for concern. Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables. Use more complex approaches, e.g. principal component regression (Jollie, 1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

Examine the correlations between explanatory variables (helpful but not infallible). Evaluate the variance ination factors of the explanatory variables. The VIF for the jth variable (i.e., VIFj ) is given by VIFj = 1/(1 Rj2 ), where Rj2 is the square of the multiple correlation coecient from the regression of the jth explanatory variable on the remaining explanatory variables. VIFj indicates the strength of the linear relationship between the jth variable and the remaining explanatory variables. A rough rule of thumb is that VIF > 10 give some cause for concern. Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables. Use more complex approaches, e.g. principal component regression (Jollie, 1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

Examine the correlations between explanatory variables (helpful but not infallible). Evaluate the variance ination factors of the explanatory variables. The VIF for the jth variable (i.e., VIFj ) is given by VIFj = 1/(1 Rj2 ), where Rj2 is the square of the multiple correlation coecient from the regression of the jth explanatory variable on the remaining explanatory variables. VIFj indicates the strength of the linear relationship between the jth variable and the remaining explanatory variables. A rough rule of thumb is that VIF > 10 give some cause for concern. Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables. Use more complex approaches, e.g. principal component regression (Jollie, 1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

Examine the correlations between explanatory variables (helpful but not infallible). Evaluate the variance ination factors of the explanatory variables. The VIF for the jth variable (i.e., VIFj ) is given by VIFj = 1/(1 Rj2 ), where Rj2 is the square of the multiple correlation coecient from the regression of the jth explanatory variable on the remaining explanatory variables. VIFj indicates the strength of the linear relationship between the jth variable and the remaining explanatory variables. A rough rule of thumb is that VIF > 10 give some cause for concern. Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables. Use more complex approaches, e.g. principal component regression (Jollie, 1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

Examine the correlations between explanatory variables (helpful but not infallible). Evaluate the variance ination factors of the explanatory variables. The VIF for the jth variable (i.e., VIFj ) is given by VIFj = 1/(1 Rj2 ), where Rj2 is the square of the multiple correlation coecient from the regression of the jth explanatory variable on the remaining explanatory variables. VIFj indicates the strength of the linear relationship between the jth variable and the remaining explanatory variables. A rough rule of thumb is that VIF > 10 give some cause for concern. Combine in some way explanatory variables that are highly correlated. Select one of the set of correlated variables. Use more complex approaches, e.g. principal component regression (Jollie, 1982) and ridge regression (Chatterjee and Price, 2006; Brown, 1994).

Statistical methods and applications (2009) Multiple Linear Regression Model 28 / 63

Polynomial regression is a generalization of linear regression in which the relationship between the independent variable x and dependent variable y is modelled as an nth order polynomial. Dummy variables (or indicator variables) are variables that takes the values 0 or 1 to indicate the absence or presence of some categorical eect that may be expected to shift the outcome. They are useful to represent categorical variables (e.g., sex, employment status) and may be entered as regressors in linear regression model, giving rise to dummy-variable regression. http://socserv.mcmaster.ca/jfox/Courses/soc740/lecture-5.pdf http://www.slideshare.net/lmarsh/dummy-variable-regression Cohen, J. (1968) Multiple regression as a general data-analytic system Psychological Bulletin 70, 426-443. Harrell, F. E. (2001) Regression modeling strategies (chapter 2) http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/rms.pdf Weisberg, S. (2005) Applied Linear Regression, 3rd edition. John Wiley & Sons, New York.

Statistical methods and applications (2009) Multiple Linear Regression Model 29 / 63

p <- 3 #number of variables x1 <- c(1.1, 2.3, 4.5, 6.7, 8.9, 3.4, 5.6, 6.7) x2 <- c(1.2, 3.4, 5.6, 7.5, 7.5, 6.7, 8.6, 7.6) x3 <- c(1.4, 5.6, 7.56, 6, 5.4, 6.6, 8.7, 8.7) y <- c(1.5, 6.4, 9.6, 8.8, 8.86, 7.8, 8.6, 8.6) n <- 8 # df n-p-1 model<- lm(y ~ x1 + x2 + x3, x = TRUE, y = TRUE) names(model) mod1<- lm(y ~ -1+x1+x2+x3) #without intercept #adding polynomial terms mod2<- lm(y ~ x1+x2+x3+I(x3^2) + I(x3^3)) mod3<- lm(y ~ x1+x2+poly(x3,degree = 3,raw = TRUE)) summary(mod1) summary(mod2) summary(mod3) summary(model) model$coefficients model$residuals model$rank model$fitted.values model$x model$y

30 / 63

first.fitted=0.988514333+0.422516384*1.1-0.001737381*1.2+0.716029046*1.4 model$fitted.values[1] second.fitted=0.988514333+0.422516384*2.3-0.001737381*3.4+0.716029046*5.6 model$fitted.values[2] model1<-summary.lm(model, correlation=TRUE) model1$r.squared model1$adj.r.squared model1$coefficients t.alpha<-qt(0.975,(n-p-1)) t.alpha model1$coefficients[1,1]+model1$coefficients[1,2]*t.alpha model1$coefficients[1,1]-model1$coefficients[1,2]*t.alpha vcov(model) sqrt(diag(vcov(model))) varcov.beta<-(model1$sigma)^2*solve(t(model$x)%*%model$x) sqrt(diag(varcov.beta)) confint(object = model, parm = c(1, 2, 3, 4), level = 0.95) confint(object = model, parm = c(2, 4), level = 0.99) load(lmtest) #df = NULL / Inf significance using t or Z random variables summary(model) coeftest(x = model, df = NULL) coeftest(x = model, df = Inf)

Statistical methods and applications (2009) Multiple Linear Regression Model 31 / 63

new <- data.frame(x1 = 1.3, x2 = 2.1, x3 = 2.3)#new value s <- summary.lm(object = modello)$sigma res <- predict.lm(object = model, newdata = new, se.fit = TRUE, scale = s, df = Inf, interval = "confidence", level = 0.95) res$fit extractAIC(model) X <- model.matrix(model) A <- X[, -1] #which variabili selezionate size numero di parametri #r2 / adjr2 / Cp indice R2; R2adj ; Cp res<-leaps(x = A, y, method = "r2", nbest = 1) #max names(res) res$which res$r2 leaps(x = A, y, method = "adjr2", nbest = 1) #max leaps(x = A, y, method = "Cp", nbest = 1) #min step(model) #var 1 and 3 have been selected qqnorm(model$res) qqline(model$res) shapiro.test(residuals(model)) #under the null hypothesis residuals are normally distributed hist(residuals(model),15)

Statistical methods and applications (2009) Multiple Linear Regression Model 32 / 63

This data set concerns with air pollution in the United States. For 41 cities in the United States the following variables were recorded. SO2: Sulphur dioxide content of air in micrograms per cubic meter Temp: Average annual temperature in F Manuf: Number of manufacturing enterprises employing 20 or more workers Pop: Population size (1970 census) in thousands Wind: Average annual wind speed in miles per hour Precip: Average annual precipitation in inches Days: Average number of days with precipitation per year Air Pollution in the U.S. Cities. From Biometry, 2/E, R. R. Sokal and F. J. Rohlf. Copyright c 1969, 1981 by W. H. Freeman and Company.

33 / 63

> library(leaps) > library(xtable) > usair<-read.table("usair.txt", header=TRUE, sep="\t", dec=".") > attach(usair) > Neg.temp<-(-Temp) > newdata<-data.frame(cbind(SO2,Neg.temp,Manuf,Pop,Wind,Precip,Days)) > attach(newdata) > model1<-lm(SO2~Neg.temp+Manuf+Pop+Wind+Precip+Days,data=newdata) > formula(model1) SO2 ~ Neg.temp + Manuf + Pop + Wind + Precip + Days > names(model1) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" > summary(model1) > xtable(summary(model1))

34 / 63

Estimate 111.7285 1.2679 0.0649 -0.0393 -3.1814 0.5124 -0.0521 Std. Error 47.3181 0.6212 0.0157 0.0151 1.8150 0.3628 0.1620 t value 2.36 2.04 4.12 -2.60 -1.75 1.41 -0.32 Pr(>|t|) 0.0241 0.0491 0.0002 0.0138 0.0887 0.1669 0.7500

#R^2 for model1 #Residual standard error: 14.64 on 34 degrees of freedom #Multiple R-squared: 0.6695, Adjusted R-squared: 0.6112 #F-statistic: 11.48 on 6 and 34 DF, p-value: 5.419e-07 > model1$coefficients (Intercept) Neg.temp Manuf Pop Wind 111.72848064 1.26794109 0.06491817 -0.03927674 -3.18136579 Days -0.05205019

Precip 0.51235896

35 / 63

For example, tted value for Atlanta (city #10) is then

> Atlanta=111.72848064+1.26794109*(-61.5)+0.06491817*368-0.03927674*497+ -3.18136579*9.1+0.51235896 *48.34-0.05205019*115 > Atlanta [1] 27.95068 > model1$fitted.values 1 2 3 4 5 6 7 -3.789143 28.674536 20.542095 28.694105 56.991475 31.367410 22.078815 8 9 10 11 12 13 14 6.927136 11.623630 27.950681 110.542989 22.241636 23.270039 8.194604 15 16 17 18 19 20 21 26.111320 20.673981 31.866991 35.121805 45.166185 29.436860 49.432339 22 23 24 25 26 27 28 20.235861 5.760100 31.744282 29.618343 46.003663 59.770345 27.116241 29 30 31 32 33 34 35 59.682401 30.928374 45.242416 21.415035 28.911046 15.931135 7.715774 36 37 38 39 40 41 24.506704 13.806381 31.143835 32.118061 33.717942 33.512572

36 / 63

> anova(model1) > xtable(anova(model1)) Df 1 1 1 1 1 1 34 Sum Sq 4143.33 7230.76 2125.16 447.90 785.38 22.11 7283.27 Mean Sq 4143.33 7230.76 2125.16 447.90 785.38 22.11 214.21 F value 19.34 33.75 9.92 2.09 3.67 0.10 Pr(>F) 0.0001 0.0000 0.0034 0.1573 0.0640 0.7500

37 / 63

> step(model1) Start: AIC=226.37 SO2 ~ Neg.temp + Manuf + Pop + Wind + Precip + Days Df Sum of Sq RSS AIC - Days 1 22.1 7305.4 224.5 <none> 7283.3 226.4 - Precip 1 427.3 7710.6 226.7 - Wind 1 658.1 7941.4 227.9 - Neg.temp 1 892.5 8175.8 229.1 - Pop 1 1443.1 8726.3 231.8 - Manuf 1 3640.1 10923.4 241.0 Step: AIC=224.49 SO2 ~ Neg.temp + Manuf + Pop + Wind + Precip Df Sum of Sq RSS AIC <none> 7305.4 224.5 - Wind 1 636.1 7941.5 225.9 - Precip 1 785.4 8090.8 226.7 - Pop 1 1447.5 8752.9 229.9 - Neg.temp 1 1517.4 8822.8 230.2 - Manuf 1 3636.8 10942.1 239.1 Call: lm(formula = SO2 ~ Neg.temp + Manuf + Pop + Wind + Precip) Coefficients: (Intercept) Neg.temp Manuf Pop Wind 100.15245 1.12129 0.06489 -0.03933 -3.08240

Precip 0.41947

38 / 63

> xtable(step(model1))

#R^2 for the selected model by the stepwise procedure #Residual standard error: 14.45 on 35 degrees of freedom #Multiple R-squared: 0.6685, Adjusted R-squared: 0.6212 #F-statistic: 14.12 on 5 and 35 DF, p-value: 1.409e-07 > step(model1, direction="backward") > step(model1, direction="forward") > step(model1, direction="both")

Statistical methods and applications (2009) Multiple Linear Regression Model 39 / 63

> AIC(model1) [1] 344.7232 > ## a version of BIC or Schwarz BC : > AIC(model1, k = log(nrow(newdata))) [1] 358.4318 > extractAIC(model1) [1] 7.0000 226.3703 > y<-SO2 > x<-model.matrix(model1)[,-1] > leapcp<-leaps(x,y, method="Cp") > leapcp > library(faraway)## caricamento package faraway > Cpplot(leapcp) > leapadjr<-leaps(x,y, method="adjr") > maxadjr(leapadjr,8) 1,2,3,4,5 1,2,3,4,5,6 1,2,3,4,6 1,2,3,5 0.621 0.611 0.600 0.600 1,2,3,6 2,3,4,6 0.588 0.587

Statistical methods and applications (2009) Multiple Linear Regression Model

1,2,3,4 0.592

1,2,3,5,6 0.588

40 / 63

2356 8.0 12356

7.5

23 2346 1236

7.0

123456

Cp 6.5

236 1235

5.5

6.0

5.0

12345 3 4 5 p 6 7

Note

The Cp plot shows that the minimum value for Cp index may be found in correspondence of the combination #12345, hence the selected variables to be included in a parsimonious model are Neg.temp, Manuf, Pop, Wind and Precip.

Statistical methods and applications (2009) Multiple Linear Regression Model 41 / 63

> plot.lm(model1) > qqnorm(model1$res) > qqline(model1$res) > shapiro.test(model1$res) Shapiro-Wilk normality test data: model1$res W = 0.923, p-value = 0.008535 > hist(model1$res,15) > plot.lm(model1)

50

Sample Quantiles

-20

-10

10

20

30

40

-2

-1

0 Theoretical Quantiles

42 / 63

> plot.lm(model1) > qqnorm(model1$res) > qqline(model1$res) > shapiro.test(model1$res) Shapiro-Wilk normality test data: model1$res W = 0.923, p-value = 0.008535 > hist(model1$res,15) > plot.lm(model1)

Histogram of model1$res

Frequency

-20

0 model1$res

20

40

42 / 63

The term refers to points that make a lot of dierence in the regression analysis. They have two characteristics. They are outliers, i.e. observations lying outside the overall pattern of a distribution. They are often indicative either of measurement error or that the model is not appropriate to t data. They are high-leverage points, i.e.they exert a great deal of inuence on the path of the tted equation, since the values of the x variables are far from the mean x .

When performing a regression analysis, it is often best to discard outliers before computing the line of best t. This is particularly true of outliers along the direction, since these points may greatly inuence the result.

In usair data set at least one city that should considered an outlier. On the Manuf variable, e.g., Chicago with a value of 3344 has about twice as many manufacturing enterprises employing 20 or more workers than has the city with the second highest number (Philadelphia). Philadelphia and Phoenix may also be suspects in this sense.

Statistical methods and applications (2009) Multiple Linear Regression Model 43 / 63

> rstand<-rstandard(model1) > plot(rstand, main="Standardized Errors") > abline(h=2) > abline(h=-2) > outlier<-rstand[abs(rstand)>2] > identify(1:length(rstand),rstand, names(rstand)) #we highlight the values of standardized #errors outside of the 95\% confidence interval # of normal distribution, that may be considered #anomalous

Standardized Errors

rstand

-1 0

10

20 Index

30

40

44 / 63

> data3<-newdata[-c(30,31),] #estimate the model discarding Pittsburg e Providence > attach(data3) > model3<-lm(SO2~ Neg.temp+Manuf+Pop+Wind+Precip+Days,data=data3) > summary(model3) > qqnorm(model3$res) > qqline(model3$res) > shapiro.test(residuals(model3)) Shapiro-Wilk normality test data: residuals(model3) W = 0.9723, p-value = 0.4417 Normal Q-Q Plot

20 Sample Quantiles -20 -10 0 10

-2

-1

0 Theoretical Quantiles

45 / 63

> data3<-newdata[-c(30,31),] #estimate the model discarding Pittsburg e Providence > attach(data3) > model3<-lm(SO2~ Neg.temp+Manuf+Pop+Wind+Precip+Days,data=data3) > summary(model3) > qqnorm(model3$res) > qqline(model3$res) > shapiro.test(residuals(model3)) Shapiro-Wilk normality test data: residuals(model3) W = 0.9723, p-value = 0.4417 Standardized Errors

2 rstand3 -2 0 -1 0 1

10

20 Index

30

40

45 / 63

A data frame containing the estimates of the percentage of body fat determined by underwater weighing and various body circumference measurements for 252 males, ages 21 and 81 (Johnson, 1996). Response variable is y = 1/density , like in Burnham and Anderson (2002). 13 potential predictors are age, weight, height, and 10 body circumference measurements. Selection of the best model using AIC , Mallows Cp and adjusted R 2 criteria through stepwise regression.

46 / 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14

density Density from underwater weighing (gm/cm3 ) age Age (years) weight Weight (lbs) height Height (inches) neck Neck circumference (cm) chest Chest circumference (cm) abdomen Abdomen circumference (cm) hip Hip circumference (cm) thigh Thigh circumference (cm) knee Knee circumference (cm) ankle Ankle circumference (cm) biceps Biceps (extended) circumference (cm) forearm Forearm circumference (cm) wrist Wrist circumference (cm)

Multiple Linear Regression Model 47 / 63

Full model

(Intercept) age weight height neck chest abdomen hip thigh knee ankle biceps forearm wrist > model11$r.squared [1] 0.7424321 > model11$adj.r.squared [1] 0.7283632

Statistical methods and applications (2009) Multiple Linear Regression Model 48 / 63

Estimate 0.8748 0.0001 -0.0002 -0.0002 -0.0009 -0.0001 0.0020 -0.0005 0.0005 0.0000 0.0006 0.0005 0.0009 -0.0036

Std. Error 0.0359 0.0001 0.0001 0.0002 0.0005 0.0002 0.0002 0.0003 0.0003 0.0005 0.0005 0.0004 0.0004 0.0011

t value 24.35 1.56 -1.91 -0.79 -1.97 -0.53 11.43 -1.54 1.71 0.01 1.24 1.41 2.22 -3.23

Pr(>|t|) 0.0000 0.1198 0.0580 0.4314 0.0504 0.5993 0.0000 0.1261 0.0892 0.9893 0.2144 0.1606 0.0276 0.0014

> qqnorm(model1$res) > qqline(model1$res) > shapiro.test(residuals(model1)) Shapiro-Wilk normality test data: residuals(model1) W = 0.9924, p-value = 0.2232

Normal Q-Q Plot

0.02 Sample Quantiles -0.03 -3 -0.02 -0.01 0.00 0.01

-2

-1

0 Theoretical Quantiles

49 / 63

> qqnorm(model1$res) > qqline(model1$res) > shapiro.test(residuals(model1)) Shapiro-Wilk normality test data: residuals(model1) W = 0.9924, p-value = 0.2232

Histogram of residuals(model1)

25 Frequency 0 -0.03 5 10 15 20

-0.02

-0.01

0.00

0.01

0.02

residuals(model1)

49 / 63

> qqnorm(model1$res) > qqline(model1$res) > shapiro.test(residuals(model1)) Shapiro-Wilk normality test data: residuals(model1) W = 0.9924, p-value = 0.2232

Standardized Errors

rstand -3 0 -2 -1

50

100 Index

150

200

250

49 / 63

> step(model1) > extractAIC(model1) [1] 14.00 -2365.27 > extractAIC(model2) [1] 9.000 -2370.712

Estimate 0.8644 0.0001 -0.0002 -0.0010 0.0020 -0.0004 0.0007 0.0011 -0.0033

Std. Error 0.0244 0.0001 0.0001 0.0005 0.0001 0.0003 0.0003 0.0004 0.0011

t value 35.49 1.73 -2.56 -2.04 13.30 -1.51 2.56 2.76 -3.09

Pr(>|t|) 0.0000 0.0848 0.0111 0.0421 0.0000 0.1320 0.0109 0.0063 0.0023

#Residual standard error: 0.008903 on 243 degrees of freedom # Multiple R-squared: 0.7377, Adjusted R-squared: 0.7291 #F-statistic: 85.44 on 8 and 243 DF, p-value: < 2.2e-16

Statistical methods and applications (2009) Multiple Linear Regression Model 50 / 63

x<-model.matrix(model1)[,-1] leapcp<-leaps(x,y, method="Cp") leapcp leaps(x, y, method = "r2", nbest = 1) #max leaps(x, y, method = "adjr2", nbest = 1) #min library(faraway)## load library faraway Cpplot(leapcp)

15 26813 1234567910111213 2345678910111213 1234568910111213 14 12345678910111213 123456789101213 123456789111213 13

Cp

26101113 2461113 12461213 261213 26781213 26101213 269111213 1246789101213 124678910111213 1245678101213 246101213 124567810111213 268111213 268101213 12468910111213 12456810111213 2681213 1246789111213 12346810111213 124568111213 12467891213 1245678111213 2461213 12456781213123467810111213 1268111213 234678111213 123468111213 246710111213 234610111213 1234678101213 26111213 24678101213 1234678111213 2467111213 24681213 2469111213 2346111213 12681213 246810111213 2467810111213 12346781213 2610111213 126810111213 1268101213 246781213 2468111213 1246111213 124610111213 24678111213 12468101213 12467810111213 24610111213 124678101213 124681213 1246810111213 246111213 124678111213 12468111213 1246781213 6 8 p 10 12

1245678910111213 123468910111213 123456810111213 12456789111213 1234678910111213 12346789101213 1234567810111213 12345678101213 12346789111213 12345678111213

10

11

12

14

51 / 63

Using AIC criterion the best model is y~age+weight+neck+abdomen+hip+thigh+forearm+wrist AIC= -2370.71 as we can nd in Burnham and Anderson (2002), while using R 2 adjusted criterion the best model is a model with 10 covariates y~age+weight+neck+abdomen+hip+thigh+ankle+biceps +forearm+wrist Finally using Mallows Cp criterion is a model with 8 covariates y~age+weight+neck+abdomen+hip+thigh+forearm+wrist

52 / 63

In this data set, CO2 emissions (metric tons per capita) measured in 116 countries are related to other variables like:

1 2 3 4 5 6

Energy use (kg of oil equivalent per capita) Export of goods and services (% of GDP) Gross Domestic Product (GDP) growth (annual %) Population growth (annual %) Annual deforestation (% of change) Gross National Income (GNI), Atlas method (current US$)

53 / 63

Full model

> mod11$r.squared [1] 0.8059332 > mod11$adj.r.squared [1] 0.7952506

54 / 63

> qqnorm(mod1$res) > qqline(mod1$res) > shapiro.test(residuals(mod1)) Shapiro-Wilk normality test data: residuals(mod1) W = 0.7228, p-value = 1.673e-13 # 6-9-48-89-97-104 corresponding to the countries: Australia, Bahrain, Iceland, # Russian Federation, Sudan and Togo

Normal Q-Q Plot

Sample Quantiles

-15

-10

-5

-2

-1

0 Theoretical Quantiles

55 / 63

> qqnorm(mod1$res) > qqline(mod1$res) > shapiro.test(residuals(mod1)) Shapiro-Wilk normality test data: residuals(mod1) W = 0.7228, p-value = 1.673e-13 # 6-9-48-89-97-104 corresponding to the countries: Australia, Bahrain, Iceland, # Russian Federation, Sudan and Togo

Histogram of mod1$res

50 Frequency 0 10 20 30 40

-15

-10

-5 mod1$res

55 / 63

> qqnorm(mod1$res) > qqline(mod1$res) > shapiro.test(residuals(mod1)) Shapiro-Wilk normality test data: residuals(mod1) W = 0.7228, p-value = 1.673e-13 # 6-9-48-89-97-104 corresponding to the countries: Australia, Bahrain, Iceland, # Russian Federation, Sudan and Togo

Standardized Errors

rstand

-8 0

-6

-4

-2

20

40

60 Index

80

100

120

55 / 63

> > > > stpmod1<-step(mod1) result<-summary(stpmod1) mod2<-lm(CO2 ~ energy + exports + GNI, data = pollution, x = TRUE, y = TRUE) xtable(summary(mod2))

> extractAIC(mod1) [1] 7.0000 213.2029 > extractAIC(mod2) [1] 4.0000 207.6525

The same model is selected using R 2 adjusted and Mallows Cp criteria as we can nd in Ricci (2006).

Statistical methods and applications (2009) Multiple Linear Regression Model 56 / 63

> > > > y<-CO2 x<-model.matrix(mod1)[,-1] adjr2<-leaps(x,y,method="adjr2") maxadjr(adjr2,8) 1,2,6 1,2,5,6 1,2,4,6 0.800 0.799 0.798 1,2,3,4,6 1,2,3,4,5,6 0.796 0.795 > leapcp<-leaps(x,y,method="Cp") > Cpplot(leapcp)

15 8

1,2,3,6 0.798

1,2,3,5,6 0.797

1,2,4,5,6 0.797

1245 1235 1456 124 123 125 146 156 12346 12456 12356 136 1346 1356 123456

13 1 7 Cp 5 6

12 16 4

126 2 3 4 p 5 6 7

57 / 63

All this material and more may be found at http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf.

anova: Compute an analysis of variance table for one or more linear model ts (stasts) coef: is a generic function which extracts model coecients from objects returned by modeling functions. coecients is an alias for it (stasts) coeftest: Testing Estimated Coecients (lmtest) connt: Computes condence intervals for one or more parameters in a tted model. Base has a method for objects inheriting from class lm (stasts) deviance:Returns the deviance of a tted model object (stasts) tted: is a generic function which extracts tted values from objects returned by modeling functions tted.values is an alias for it (stasts) formula: provide a way of extracting formulae which have been included in other objects (stasts) linear.hypothesis: Test Linear Hypothesis (car) lm: is used to t linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (stasts) predict: Predicted values based on linear model object (stasts) residuals: is a generic function which extracts model residuals from objects returned by modeling functions (stasts) summary.lm: summary method for class lm (stasts) vcov: Returns the variance-covariance matrix of the main parameters of a tted model object (stasts)

58 / 63

All this material and more may be found at http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf.

add1: Compute all the single terms in the scope argument that can be added to or dropped from the model, t those models and compute a table of the changes in t (stasts) AIC: Generic function calculating the Akaike information criterion for one or several tted model objects for which a log-likelihood value can be obtained, according to the formula -2log-likelihood + knpar, where npar represents the number of parameters in the tted model, and k=2 for the usual AIC, or k = log(n) (n the number of observations) for the so-called BIC or SBC (Schwarzs Bayesian criterion) (stasts) Cpplot: Cp plot (faraway) drop1: Compute all the single terms in the scope argument that can be added to or dropped from the model, t those models and compute a table of the changes in t (stasts) extractAIC: Computes the (generalized) Akaike An Information Criterion for a tted parametric model (stasts) maxadjr: Maximum Adjusted R-squared (faraway) oset: An oset is a term to be added to a linear predictor, such as in a generalised linear model, with known coecient 1 rather than an estimated coecient (stasts) step: Select a formula-based model by AIC (stasts) update.formula: is used to update model formulae. This typically involves adding or dropping terms, but updates can be more general (stasts)

59 / 63

All this material and more may be found at http://cran.r-project.org/doc/contrib/Ricci-refcard-regression.pdf.

cookd: Cooks Distances for Linear and Generalized Linear Models (car) cooks.distance: Cooks distance (stasts) hat: diagonal elements of the hat matrix (stasts) hatvalues: diagonal elements of the hat matrix (stasts) ls.diag: Computes basic statistics, including standard errors, t- and p-values for the regression coecients (stasts) rstandard: standardized residuals (stasts) rstudent: studentized residuals (stasts) vif: Variance Ination Factor (car) plot.lm: Four plots (selectable by which) are currently provided: a plot of residuals against tted values, a Scale-Location plot of sqrt residuals against tted values, a Normal Q-Q plot, and a plot of Cooks distances versus row labels (stats) qq.plot: Quantile-Comparison Plots (car) qqline: adds a line to a normal quantile-quantile plot which passes through the rst and third quartiles (stats) qqnorm: is a generic function the default method of which produces a normal QQ plot of the values in y (stats) reg.line: Plot Regression Line (car)

60 / 63

Bibliography

AUSTIN, P. C. and TU, J. V. (2004). Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. Journal of Clinical Epidemiology 57, 11381146. BROWN, P. J. (1994). Measurement, Regression and Calibration Oxford, Clarendon. BURNHAM, K. P. and ANDERSON, D. R. (2002). Model selection and multimodelinference: a practical information-theoretic approach. New York: Springer-Verlag. BURMAN, P. (1996). Model tting via testing. Statistica Sinica 6, 589-601. CHATTERJE, S and HADI, A. S. (2006). Regression analysis by example, 4th edition. Wiley & Sons: Hoboken, New Yersey. f DER, G. and EVERITT, B. S. (2006). Statistical Analysis of Medical Data Using SAS Chapman & Hall/CRC, Boca Raton, Florida. DIZNEY, H. and GROMAN, L. (1967). Predictive validity and dierential achievement in three MLA comparative foreign language tests. Educational and Psychological Measurement 27, 1127-1130.

Statistical methods and applications (2009) Multiple Linear Regression Model 61 / 63

Bibliography

EVERITT, B. S. (2005). An R and S-PLUS companion to multivariate analysis. Springer Verlag: London. FINOS, L., BROMBIN, C. and SALMASO, L. (2009) Adjusting stepwise p-values in generalized linear models. Accepted for publication in Communications in Statistics: Theory and Methods. FREEDMAN, L.S., PEE, D. and MIDTHUNE, D.N. (1992). The problem of underestimating the residual error variance in forward stepwise regression. The statistician 41, 405412. GABRIEL, K. R. (1971). The Biplot Graphic Display of Matrices with Application to Principal Component Analysis. Biometrika, 58 453467. HARSHMAN, R. A. and LUNDY, M. E. (2006). A randomization method of obtaining valid p-values for model changes selected post hoc. Poster presented at the Seventy-rst Annual Meeting of the Psychometric Society, Montreal, Canada, June, 2006. Available at http://publish.uwo.ca/ harshman/imps2006.pdf. JOHNSON, R. W. (1996) Fitting Percentage of Body Fat to Simple Body Measurements. Journal of Statistics Education, 4, (e-journal) see http://www.amstat.org/publications/jse/toc.html

Statistical methods and applications (2009) Multiple Linear Regression Model 62 / 63

Bibliography

JOLLIFFE, I. T. (1982). A note on the Use of Principal Components in Regression. Journal of the Royal Statistical Society, Series C 31, 300-303. JOLLIFFE, I. T. (1986) Principal component analysis. Springer: New York. MALLOWS, C.L. (1973) Some comments on Cp. Technometrics 15, 661-675. MALLOWS, C.L. (1975) More comments on Cp. Technometrics 37, 362-372. MORRISON, D. F. (1967) Multivariate statistical methods. McGraw-Hill: New York. RICCI, V. (2006) Principali tecniche di regressione con R. See cran.r-project.org/doc/contrib/Ricci-regression-it.pdf.

63 / 63

- Introduction to Econometrics James Stock Watson 2e Part2Hochgeladen vongrvkpr
- Mission Hospital Case StudyHochgeladen vonAbhishekKumar
- Multiple Regression Analysis _ Real Statistics Using Excel.pdfHochgeladen vonPuja Singh
- How Do i Write a Scientific PaperHochgeladen vonguevara0571
- MRAHochgeladen vonSamridhi Sharma
- Motivation in Foreign Language Learning- The Relationship BetweenHochgeladen vonAinhoa Gonzalez Martinez
- Corporate governance TurkeyHochgeladen vonBishnu Kumar Adhikary
- regresiHochgeladen vonPepy Dela Hoyag
- Credit Expansion and Financial Instability: Evidence from Stock PricesHochgeladen vonmirando93
- Gary Grudnitski - Valuations of Residential PropertiesHochgeladen vonIrimia Mihai Adrian
- 96Hochgeladen vonAvinash Choudhary
- Duration of Patients' Visits to the Hospital Emergency DepartmentHochgeladen vonAyu Pratiwi
- 0166927S21 LjungHochgeladen vonAdriana Riveros Guevara
- Yu ShuoHochgeladen vonBubu Bear
- Training Deck OLSHochgeladen vonPrasanth
- GMartin Political ConsultantsHochgeladen vonAbelardo Del Prado
- WrayHochgeladen vonAmzarYusoff
- et_Ch3Hochgeladen vonRyan Taga
- Hello WorldHochgeladen vonAmeer Ahmed
- Ch3Summary.pdfHochgeladen vonsablu khan
- modelo matematico_prediction solids watse.pdfHochgeladen vonAlysson Santos
- MultiColl TheoryHochgeladen vonMunnah Bhai
- Cost and Decision MakingHochgeladen vonAnonymous qAegy6G
- INTEGRATION OF IMPORTANCEPERFORMANCE ANALYSIS AND FUZZY DEMATELHochgeladen vonAnonymous Gl4IRRjzN
- 9. Chapter 9.Linear.2Hochgeladen vonElsie Lok
- 3 Sensitivity of Earthquakes to Atmospheric Anomalies Triggered Over South-Western Region of South AsiaHochgeladen vonAdeel Ahmad
- MBA Researcch ArticleHochgeladen vonghassan riaz
- 1Vi1SKtvzHochgeladen vonnavdeep
- 2415-9624-1-PBHochgeladen vonLia Mawarni Banjarnahor
- Contoh Data RegresiHochgeladen vonfikri

- OECD Guidline - 107 - LogP(Shake)Hochgeladen vonalchemist_bg
- Linear Algebra.pdfHochgeladen vonAlin Niculescu
- 10 Assrule EngHochgeladen vonalchemist_bg
- 10.1.1.93Hochgeladen vonalchemist_bg
- s10822-009-9307-yHochgeladen vonalchemist_bg
- 1876396001003010194Hochgeladen vonalchemist_bg
- 1471-2180-8-196Hochgeladen vonalchemist_bg

- Science PhysicsHochgeladen vonJuan Tamad
- Perspectives on Personality 8th Edition Test BankHochgeladen vonStephanie.Henry1
- A Darwinian Dilemma for Realist Theories of Value, StreetHochgeladen vonPaul Iris
- A Linear Constraint Satisfaction Approach for AbductiveHochgeladen vonscribd202
- Problem of EvilHochgeladen vonJesse Alexander Harris
- Faith and Reason - Course Guide by Peter KreeftHochgeladen vonCalvin Ohsey
- तर्क और आस्थाHochgeladen vonsamsung
- Cultural Materialism1Hochgeladen vonjurbina1844
- numerology basicsHochgeladen vonReeshabhdev Gauttam
- Lewtas, P. 2013, The Irrationality of PhysicalismHochgeladen vonkhriniz
- armcq.docxHochgeladen vonSrinivasulu Reddy Banka
- Introduction Methodologies of Bounded RationalityHochgeladen vonomkar_puri5277
- Molecular Systematics of Terraranas With an Assessment of the Effects of Alignment and Optimality CriteriaHochgeladen vonAmanda Delgado
- PseudoscienceHochgeladen vontanyatewa
- Annette J. Dobson (Auth.)-An Introduction to Generalized Linear Models-Springer US (1990)Hochgeladen vonMarcrot Mendez
- Numerical Modelling - Prediction or ProcessHochgeladen vonsonusk777
- AIC Model SelectionHochgeladen vonhammoudeh13
- Occam's RazorHochgeladen vonSyed Putra Iqmal
- A New Approach to Fitting Linear Models in High Dimensional SpacesHochgeladen vongiangblackk
- TUK EECQ 5291- Professional Engineering Practice _ April 2018Hochgeladen vongaza man
- Evidence for Connection TheoryHochgeladen vonAlyssa Vance
- SCIENCE DAN TEORI (KEDOKTERAN).pptHochgeladen vonYus Ani
- Barthes- SZHochgeladen vonMylèneYannick
- Levy 2007Hochgeladen vonLucas Belmino Freitas
- Schema_6Hochgeladen vonaarongroh1
- Enlivenment Andreas WeberHochgeladen vonsoto
- Pa 021214Hochgeladen vonsarahloren_thepress
- Chapter 2 - Framing an Analytic QuestionHochgeladen vonJeremiahOmwoyo
- Designing Design EngineersHochgeladen vonraveemakwana
- Mid Term &Final for 595-ArthurHochgeladen vonBiao Wang

## Viel mehr als nur Dokumente.

Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.

Jederzeit kündbar.