Sie sind auf Seite 1von 52

NR120.

508 Biostatistics for Evidence‐based Practice

Multiple Linear Regression

Song Ge
BSN, RN, PhD Candidate
Johns Hopkins University School of Nursing

www.nursing.jhu.edu
Learning Objectives

By the end of this module, you will be able to:


1. Articulate assumptions for multiple linear
regression
2. Explain the primary components of multiple
linear regression
3. Identify and define the variables included in
the regression equation
4. Construct a multiple regression equation
5. Calculate a predicted value of a dependent
variable using a multiple regression equation
Learning Objectives Cont’d

6. Distinguish between unstandardized (B)


and standardized (Beta) regression
coefficients
7. Distinguish between different methods
for entering predictors into a regression
model (simultaneous, hierarchical and
stepwise)
8. Identify strategies to assess model fit
9. Interpret and report the results of
multiple linear regression analysis
Review of lecture two weeks ago
• Linear regression assumes a linear
relationship between independent
variable(s) and dependent variable
• Linear regression allows us to predict an
outcome based on one or several
predictors
• Linear regression allows us to explain
the interrelationships among variables
• Linear regression is a parametric test
How to choose X and Y?
• Y can be regressed on X
• X can be regressed on Y
• The regression is not symmetric
• The choice of which regression to
perform depends on the scientific
question: Is X to be used to explain or
predict Y?
• Is Y to be used to explain or predict X?
(e.g. Does poor health status explain high
pollution level?)
Linear Regression Assumptions

1. Independent variable can be any scale


(ratio, nominal, etc.)
2. Dependent variable need to be
ratio/interval scale
3. Dependent variable need to be
normally distributed overall and
normally distributed for each value of
the independent variable
4. If dependent variable is not normally
distributed, we can transform it
Review: Normal distribution
Example of transformed data

https://infoactive.co/data‐design/ch11.html

Positively skewed Math
Method Good for: Bad for:
Operation
Log ln(x) Right Zero values
log10(x) skewed data Negative
values
Square root √x Right Negative
skewed data values
2
Square x Left skewed Negative
Normally distributed data values
1/3
Cube root x Right Not as
skewed data effective as
Negative log
values transform
Reciprocal 1/x Making small Zero values
values Negative
bigger and values
big values
ll
Continued…
5. Samples must be
representative of the
population
6. There is no multicollinearity:
the interdependent variables
are so strongly intercorrelated
that they are indistinguishable
from each other

If VIF lies between 1‐10, no multicollinearity
If VIF <1 or >10, then there is multicollinearity
Continued…
7. The relationship between x and y must
be linear. When two scores are graphed,
they should tend to form a straight line.
If that is not a linear relationship, other
methods must be used.
8. For every value of X, the distribution of Y
scores must have approximately equal
variability (homoscedasticity)
Multiple Linear Regression

• Recall student scores example from previous module


• What will you do if you are interested in studying
relationship between final grade with midterm (or
screening) score and other variables such as previous
(undergraduate) GPA, GRE score and motivation?
• A simple linear regression (SLR) cannot handle this
• A separate SLR with each explanatory (independent)
variable will provide information in isolation
• You will need to use a multiple linear regression
(MLR) method to study them together
Multiple Linear Regression
• A multiple linear regression model shows the
relationship between the dependent variable and
multiple (two or more) independent variables
• The overall variance explained by the model (R2) as
well as the unique contribution (strength and
direction) of each independent variable can be
obtained
• In MLR, the shape is not really a line. If there are
three variables, the shape is a plane, and if there
are four or more variables, it is impossible to
visualize or graph. However, by convention, we still
refer to the regression equation as a regression
'line'.
MLR with Two Predictors

http://www.aetheling.com/models/cusp/Intro.htm
Multiple Linear Regression Equation

• Sometimes also called multivariate linear


regression for MLR
• The prediction equation is
Y′ = a + b1X1 + b2X2 + b3X3 + ∙ ∙ ∙ bkXk
• There is still one intercept constant, a, but
each independent variable (e.g., X1, X2, X3)
has their own regression coefficient
Review: Simple linear regression
• Y’ is a linear function of X
• Y’ = a + bx
• a = intercept
• b = slope
Interpretation of MLR Coefficients
Interpretation of MLR Coefficients
Interpretation of MLR Coefficients
Group exercise: interpret B0, B1
and B2
• Data are from children aged 1 to 5 years in the
• Variables
• — Y is the child’s arm circumference (cm)
• — X1 is the age of the child (months)
• — X2 is the height of the child (cm)
• Does arm circumference increase with
increasing child age after controlling for child
height?
• • Multiple linear regression model
• Y = B0 + B1 X1 + B2 X2
Answers
• B0= the estimated mean arm
circumference when the values of age
and height are zero
• B1= the change in the estimated mean
arm
• circumference associated with each 1
month increase in age if height is
unchanged
• B3= You do!
Multiple Linear Regression Models

• We can get six critical pieces of information from


an MLR:
– The overall significance of the model
– The variance in the dependent variable that comes
from the set of independent variables in the model
– The statistical significance of each individual
independent variable (controlling for the others)
– The direct effect (and direction of the effect) of each
independent variable on the dependent variable
– The relative strength of the independent variables
– The regression equation, which allows us to predict
values of the dependent variable given values of the
independent variables
The overall piece: R2 (coefficient of
determination)

• R2 provides the proportion of variability explained by


using X
• R2 measures the ability to predict an individual Y using
its X(s)
• Statistical significance of the overall model (Model F-
test)
• Recall that R is population correlation coefficient
– Takes on values between -1 and +1
– 0 indicates no linear association; 1 indicates a perfect positive
linear relationship; -1 indicates a perfect negative linear
relationship
R: population correlation coefficient

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
SPSS output for R square
The individual piece: Correlation coefficient

F‐ Test of Regression coefficient: Whether the independent variable 
associated with it is contributing significantly to the variance accounted for 
in the dependent variable
Group exercise

• Propose a research question that can


be answered by MLR
• State under what assumptions do we
use this statistical method?
• State the formula and what B0, B1 and
B2 stand for?
• Break
• https://www.youtube.com/watch?v=LV
w9YdP1O-0
Example
• We are interested in knowing if going to
restaurants frequently (five or more times/week)
can lead to higher cholesterol. We also know that
age, gender, and race/ethnicity can affect
cholesterol. How can we tell if going out to
restaurants frequently, this factor alone, will
affect cholesterol levels?
• Do age, gender, ethnicity, and going out to eat
frequently all affect cholesterol levels?
– Dependent variable: cholesterol level
– Independent variables: age (years), gender
(male/female), race/ethnicity (Black, White, Asian,
or Hispanic), frequency of going out to eat (5+
times/week vs less than 5 times/week)
Linear Regression Assumptions
• Linear regression is a parametric method and
requires that certain assumptions be met to be
valid.
1. The sample must be representative of the
population
2. The dependent variable must be of ratio/interval
scale and normally distributed overall and normally
distributed for each value of the independent
variables
3. For every value of X, the distribution of Y scores
must have approximately equal variability
(homoscedasticity)
4. The relationship between X and Y must be linear
5. The independent variables are not very strongly
inter-correlated (no multicollinearity)
Creating Dummy Variables
• Using dummy variables is a way to express a
nominal independent variable with multiple
categories by a series of dichotomous (binary)
variables that compare one category to a
different category that serves as the reference
• The number of dummy variables created will be
one less than the number of categories of the
variable
• One of the categories is chosen to serve as the
“reference” category
• You then include all the dummy variables in the
regression model instead of the original
categorical variable
Creating Dummy Variables:
Example
• Let's say we have a race/ethnicity variable
with four categories (non-Hispanic White,
non-Hispanic Black, non-Hispanic Asian, and
Hispanic)
• If we want to use it in a multiple regression,
we would need to create three variables (4-1)
to represent the four categories
• We would put these variables into the
multiple regression equation instead of the
four category race/ethnicity variable
Example Cont’d
• We would therefore create 3 (4−1) dummy variables
and choose one category as the reference, in this
case, non-Hispanic White
– Non-Hispanic Black (1=yes, 0=no)
– Non-Hispanic Asian (1=yes, 0=no)
– Hispanic (1=yes, 0=no)
– Say these are called Dummy1, Dummy2 and
Dummy3

Race/Ethnicity Dummy1 Dummy2 Dummy3


Non‐Hispanic Black 1 0 0
Non‐Hispanic Asian 0 1 0
Hispanic 0 0 1
Non‐Hispanic White 0 0 0
Information from MLR
• Overall variance explained by the model (e.g., do
the independent variables in the model, taken
together, do a good job at predicting the
dependent variable?) using the adjusted R2
• Statistical significance of the overall model
(Model F-test of R2)
• The strength, direction, and statistical
significance of each independent variable
(regression coefficients)
• Regression equation as a whole can be used to
predict values of the dependent variable for a
given set of values of the independent variables
MLR: Analysis Example

• We will use data on 489


NYCHANES study
participants to look at a
number of potential
predictors of total
cholesterol (mg/dL)
• The dependent variable is
total cholesterol (mg/dL)
• We can see that total
cholesterol is somewhat
right-skewed
MLR: Analysis Example Cont’d

• To correct for this


departure from
normality, an
adjustment called a
“linear transformation”
of the variable can be
made
• In this case, we take
the natural log of
cholesterol. This makes
the dependent variable
normally distributed
MLR: Analysis Example Cont’d
• We will use multivariate linear regression to look
at a number of independent variables
– Gender (female=1 vs. male=0)
– Age (continuous)
– Frequency of eating in restaurants
(frequent=1 vs. infrequent=0)
– Race/ethnicity (Black, White, Asian, or
Hispanic)
• Note that the race/ethnicity variable has four
categories. In order to look at this variable in a
regression model, we will have to create dummy
variables.
MLR: Analysis Example Cont’d
• We will create 3 (4−1) dummy variables
and use the category “White” as the
reference. The variable coding will be
– Black (1 = person is non-Hispanic Black; 0
= person is any other race/ethnicity)
– Asian (1 = person is non-Hispanic Asian; 0
= person is any other race/ethnicity)
– Hispanic (1 = person is Hispanic, 0 =
person is not Hispanic)
MLR: Analysis Example Cont’d
• We are testing a number of hypotheses, one null
and one alternate hypothesis for each
independent variable in the model. For example,
one hypothesis we are testing is
– H0: There is no association between frequency
of eating out and total cholesterol, adjusting
for gender, age, and race/ethnicity (adjusted
beta=0)
– Ha: There is an association between frequency
of eating out and total cholesterol, adjusting
for gender, age, and race/ethnicity (adjusted
beta≠0)
Analysis Example: Model Summary

• Adjusted R2=0.031
• The four independent variables explain 3.1%
of the variance in the dependent variable.

Model Summary

Model R R2 Adjusted R2 Std. Error of the Estimate

1 0.211a 0.044 0.031 0.19918

a Predictors: (Constant), Hispanic, restaurant_dich, participant gender, age in years, Asian, Black
Analysis Example: ANOVA
• The p-value for the overall model is 0.004.
The amount of variance explained by the
model (independent variables) is statistically
significant
Analysis Example: Coefficients
• Beta for gender (−0.015), beta for age (0.002), beta
for eating in restaurants (0.008), beta for Black (−
0.053), beta for Asian (0.0006), and beta for
Hispanic (− 0.040), the regression constant (5.189)

Coefficientsa
Model Unstandardized Coefficients Standardized t Sig. 95% Confidence Interval for
Coefficients B
B Std. Error Beta Lower Bound Upper Bound
1 (Constant) 5.189 0.040 129.516 0.000 5.110 5.268

participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025


age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004
restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056
Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006
Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068
Hispanic −0.040 0.025 −0.096 − 1.607 0.109 −0.088 0.009
a.dependent variable: LNCholesterol
Example: Estimated Equation
• We can construct the regression equation and use it
to make predictions:
• Predicted ln(cholesterol) = 5.189 − 0.015 (gender) +
0.002 (age) – 0.053 (Black) + 0.006 (Asian) – 0.040
(Hispanic ) + 0.008 (restaurant_dich)

Coefficientsa
Model Unstandardized Coefficients Standardized t Sig. 95% Confidence Interval for
Coefficients B
B Std. Error Beta Lower Bound Upper Bound
1 (Constant) 5.189 0.040 129.516 0.000 5.110 5.268

participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025


age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004
restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056
Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006
Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068
Hispanic −0.040 0.025 −0.096 − 1.607 0.109 −0.088 0.009
a.dependent variable: LNCholesterol
Example: Prediction
• From this model, we can predict that the total
cholesterol of a 25-year-old White woman who
does not eat out often
• Predicted ln(cholesterol) = 5.189 − 0.015(1) +
0.002(25) – 0.053(0) + 0.006(0) – 0.040(0) +0
.008(0) = 5.269
• Cholesterol values were log-transformed, so
need to back-transform (exponentiate in this
case) Don’t forget about it!
• e5.269 = 194.22 mg/dL
Example: Significance
• Can look at significance of individual coefficients if
overall model is significant in ANOVA
• From the p-values, we can see that only age is
significantly associated with total cholesterol
(p=0.001)
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig. 95% Confidence Interval for
Coefficients B
B Std. Error Beta Lower Bound Upper Bound
1 (Constant) 5.189 0.040 129.516 0.000 5.110 5.268

participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025


age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004
restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056
Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006
Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068
Hispanic −0.040 0.025 −0.096 − 1.607 0.109 −0.088 0.009
a.dependent variable: LNCholesterol
Example: Interpretation
• The conclusions can be stated as follows:
– This model explains 3.1% of the variance in total
cholesterol, and this is statistically significant at
α=0.05. Age is positively associated with total
cholesterol such that, adjusting for the other
variables in the model, for each additional year in
age, the natural log (recall that we logged the
outcome to make it more normal) of total
cholesterol is predicted to increase by 0.002
units, and this association is statistically
significant (p=0.001). None of the other variables
in the model were significantly associated with
total cholesterol.
Standardized Coefficients
• Different independent variables often measured in
different units (can standardize them to bring to the
same scale, e.g. z)
• If you were interested in relative importance of
relationship of IVs with DV, you would look at
standardized coefficients (called Beta in SPSS
output), look at size regardless of direction
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig. 95% Confidence Interval for
Coefficients B
B Std. Error Beta Lower Bound Upper Bound
1 (Constant) 5.189 0.040 129.516 0.000 5.110 5.268

participant gender −0.015 0.021 −0.038 −0.750 0.453 −0.056 0.025


age in years 0.002 0.001 0.163 3.252 0.001 0.001 0.004
restaurant_dich 0.008 0.025 0.016 0.316 0.752 −0.041 0.056
Black −0.053 0.030 −0.100 − 1.767 0.078 −0.111 0.006
Asian 0.006 0.031 0.011 0.197 0.844 −0.056 0.068
Hispanic −0.040 0.025 −0.096 − 1.607 0.109 −0.088 0.009
a.dependent variable: LNCholesterol
Multicollinearity in MLR
• Multicollinearity—the independent variables in
the model are strongly associated with each
other that they are essentially measuring the
same thing
• You would like to see no or very small
multicollinearity among the independent
variables
• The tolerance of a variable is used as a measure
of collinearity. To obtain measures of tolerance,
each independent variable is treated as a
dependent variable and is regressed on the
other independent variables
Checking Assumptions using Residuals
• Normal distribution of the
dependent variable and linear
relationship with the
independent variables
• If the relationships are linear
and the dependent variable is
normally distributed for each
value of the independent
variable, then the distribution
of the residuals (the residual or
error is the difference between
the actual and the predicted
values in the model) should be
approximately normal. This can
be assessed by using a
histogram of the standardized
residuals.
Checking Assumptions using Residuals
• Homoscedasticity—for every
value of X, the distribution of Y
scores must have
approximately equal variability.
• To check this assumption, the
residuals can be plotted
against the predicted values
and against the independent
variables. When standardized
predicted values are plotted
against observed values, the
data would form a straight line
from the lower-left corner to
the upper-right corner, if the
model fit the data exactly.
Approaches for Selecting Variables
• First, it is common to look at a correlation matrix or SLR
models
• There are a number of approaches to choosing which
variables to include in a multiple regression model
• Standard approach (simultaneous method)—all the
independent variables are entered at once
• Stepwise (forward, backward, or stepwise solution)—the
software selects the best model based on a series of
steps in which variables are added and removed
depending on their association with the outcome
• Hierarchical—the researcher compares two or more
models before and after the addition of certain variables
of interest and uses preset criteria for selecting the best
model
Summary
• Multivariate linear regression can be used when
the outcome of interest is of interval or ratio
scale and normally distributed
• The independent variables can be on any scale,
but dummy variables need to be created for all
polytomous (categorical or nominal) independent
variables
• Using the regression model, we can estimate the
strength and direction of the association from
the adjusted betas
• We can also determine the statistical significance
of each parameter in the model using the p-value
or the 95% confidence interval around the beta
References/Acknowledgement
Kellar, S. P., & Kelvin, E. A. (2013). Munro’s Statistical
Methods for Health Care Research (6th ed.). Philadelphia,
PA: Wolters Kluwer Health | Lippincott Williams and
Wilkins. Chapter 14
Polit, D. F. (2010). Statistics and data analysis for nursing
research (2nd ed.). Upper Saddle River, NJ: Pearson.
Adapted some slides from the publisher’s course
instructor resources
Adapted some slides from biostatsitics course from
school of public health

Das könnte Ihnen auch gefallen