Sie sind auf Seite 1von 52

REGRESSION

REGRESSION ANALYSIS
In regression analysis we analyze the relationship between two or more variables.
Effect of independent variables on dependent variable.

The relationship between two or more variables could be linear or non linear.

We will only talk about relationship between two variables. We also restrict our topic
to Linear Regression.

Simple Linear Regression : Linear Regression Between Two Variables

How we could use available data to investigate such a relationship?

If there exist a relationship, how could we use this relationship to forecast future.

Simple Linear Relationship
Linear relationship between two variables is stated as

y = |
0
+ |
1
x

This is the general equation for a line


|
0
: Intersection with y axis

|
1
: The slope

x : The independent variable

y : The dependent variable
Simple regression seeks to predict an outcome variable from a single predictor
variable


Outcome = ( Model ) + error

What if there is nothing given to us to predict the Outcome???????

On what data REGRESSION IS DONE????

Interval & Ratio Level Data
THING TO REMEMBER !!!!!!

Regression focuses on





ASSOCIATION CAUSATION





Association is a necessary prerequisite for inferring causation, but also:
1. The independent variable must preceded the dependent variable in
time.
2. The two variables must be plausibly lined by a theory,
3. Competing independent variables must be eliminated.

Simple linear regression used one independent variable to explain the dependent
variable

Some relationships are too complex to be described using a single independent
variable

Multiple regression models use two or more independent variables to describe the
dependent variable

This allows multiple regression models to handle more complex situations

There is no limit to the number of independent variables a model can use

Multiple regression has only one dependent variable

Simple Linear Regression Equation
For the time being forget c . The following equation describes how the mean value of y is
related to x.

E (y ) = |
0
+ |
1
x

|
0
is the intersection with y axis, |
1
is the slope.
|
1
> 0

|
1
< 0

|
1
= 0

The Least Square Method
Our judgmental approach and our eyes tried to minimize the
difference between observed values and values obtained on the
estimated regression line.

We may implement the same approach in algebra by

Minimize Sum of the square of the difference between observed
values and values on the regression line.

The Least Square Method.
credit card usage survey.xlsx

C

A

B

A

y
i




x

y

y
i




C

B

*Least squares estimation
gave us the line () that
minimized C
2


o | + =
i i
x y

y
A
2
B
2
C
2


SS
total

Total squared distance of
observations from nave mean of y
Total variation

SS
reg

Distance from regression line to nave mean of y
Variability due to x (regression)




SS
residual

Variance around the regression line
Additional variability not explained
by xwhat least squares method aims
to minimize


= = =
+ =
n
i
i i
n
i
n
i
i i
y y y y y y
1
2
1 1
2 2
) ( ) ( ) (
The Regression Picture
R
2
=SSreg/SStotal
Results of least squares
Slope (beta coefficient) =
x
xy
SS
SS
= |

=
=
=
=
n
i
i
n
i
i i
x x
y y x x
1
2
x
1
xy
) ( SS and
) )( ( SS where
) , ( y x
x

- y

: Calculate | o =
Intercept=

Regression line always goes through the point:

Relationship with correlation
y
x
SS
SS
r |

=
In correlation, the two variables are treated as equals. In regression, one variable is considered
independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.


= =
= =
= =
= =
n
i
i
n
i
i
n
i
i
n
i
i
x n x x x
y n y y y
1
2 2
1
2
x
1
2 2
1
2
y
) ( SS and
) ( SS where
Expected value of y:
i
y

i i
x y | o


+ =
Expected value of y at level of x: xi=
Residual
)

i i i i i
x y y y e | o = =
We fit the regression coefficients such that sum of the
squared residuals were minimized (least squares
regression).
Residual
Residual = observed value predicted value
At 33.5 weeks gestation,
predicted baby weight is
3350 grams
33.5 weeks
This baby was actually
3380 grams.
His residual is +30
grams:
3350
grams
Y=baby
weights
(g)
X=gestation
times (weeks)
20 30 40
The standard error of Y given X is the average variability around the
regression line at any given value of X. It is assumed to be equal at all
values of X.
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
Sy/x
HOW TO START WITH LINEAR REGRESSION???
To see if your data fits the models of regression, it is wise to conduct a scatter
plot analysis.

The reason?

Regression analysis assumes a linear relationship. If you have a
curvilinear relationship or no relationship, regression analysis is of little
use.

Scatter Plots of Data with Various
Correlation Coefficients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3 r = +1
Y
X
r = 0
Slide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
Linear Correlation
Slide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
Linear Correlation
Slide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Y
X
Y
X
No relationship
Slide from: Statistics for Managers Using Microsoft Excel 4th Edition, 2004 Prentice-Hall
What is Linear?
Remember this:
Y=mX+B?
B
m
Whats Slope?
A slope of 2 means that every 1-unit change in X
yields a 2-unit change in Y.


Simple linear regression
The linear regression model:
Love of Math = 5 + .01*math SAT score
intercept
slope
P=.22; not
significant
Prediction
If you know something about X, this knowledge helps you
predict something about Y. (Sound familiar?sound
like conditional probabilities?)

EXAMPLE
The distribution of baby weights at Stanford
~ N(3400, 360000)
Your Best guess at a random babys weight,
given no information about the baby, is what?
3400 grams
But, what if you have relevant information? Can
you make a better guess?
Predictor variable
X=gestation time

Assume that babies that gestate for longer are born
heavier, all other things being equal.
Pretend (at least for the purposes of this example)
that this relationship is linear.
Example: suppose a one-week increase in gestation,
on average, leads to a 100-gram increase in birth-
weight
Y depends on X
Y=birth-
weight
(g)
X=gestation
time (weeks)
Best fit line is chosen such
that the sum of the squared
(why squared?) distances of
the points (Yis) from the line
is minimized:
Or mathematically
(remember max and mins
from calculus)
Derivative[E(Yi-(mx+b))
2
]=0
Prediction
A new baby is born that had gestated for just
30 weeks. Whats your best guess at the
birth-weight?
Are you still best off guessing 3400? NO!
Y=birth-
weight
(g)
X=gestation
time (weeks)

At 30 weeks
3000
30
Y=birth
weight
(g)
X=gestation
time (weeks)

At 30 weeks
(x,y)=
(30,3000)
3000
30
At 30 weeks
The babies that gestate for 30 weeks appear
to center around a weight of 3000 grams.

In Math-Speak
E(Y/X=30 weeks)=3000 grams

Note the conditional
expectation
E(Y/X=30 weeks)=3000 grams
But
Note that not every Y-value (Y
i
) sits on the line.
Theres variability.
Y
i
=3000 + random error
i


In fact, babies that gestate for 30 weeks have
birth-weights that center at 3000 grams, but vary
around 3000 with some variance o
2
Approximately what distribution do birth-
weights follow? Normal. Y/X=30 weeks ~
N(3000, o
2
)


Y=birth-
weight
(g)
X=gestation
time (weeks)

And, if X=20, 30, or 40
20 30 40
Y=baby
weights
(g)
X=gestation
times (weeks)

If X=20, 30, or 40
20 30 40
Y/X=40 weeks ~ N(4000, o
2
)
Y/X=30 weeks ~ N(3000, o
2
)
Y/X=20 weeks ~ N(2000, o
2
)

Mean values fall on the line
E(Y/X=40 weeks)=4000
E(Y/X=30 weeks)=3000

E(Y/X=20 weeks)=2000

E(Y/X)=
Y/X
= 100 grams/week*X weeks



Linear Regression Model
Ys are modeled
Y
i
= 100*X + random error
i



Follows a
normal
distribution
Fixed
exactly
on the
line
Assumptions (or the fine print)
Linear regression assumes that
1. The relationship between X and Y is linear
2. Y is distributed normally at each value of X
3. The variance of Y at every value of X is the same
(homogeneity of variances)

Why? The math requires itthe mathematical
process is called least squares because it fits the
regression line by minimizing the squared errors
from the line (mathematically easy, but not
generalrelies on above assumptions).
Non-homogenous variance
Y=birth-
weight
(100g)
X=gestation
time (weeks)
Coefficient of Determination
Question : How well does the estimated regression line fits the data.

Coefficient of determination is a measure for Goodness of Fit.
Goodness of Fit of the estimated regression line to the data.

Given an observation with values of y
i
and x
i
.
We put x
i
in the equation and get


.

i
= b
0
+ b
1
x
i


(y
i

i
)

is called residual.

It is the error in using
i
to estimate y
i.

SSE = E (y
i
-


i
)
2


^
y
i

SSE , SST and SSR
SST : A measure of how well the observations cluster around y
SSE : A measure of how well the observations cluster around

If x did not play any role in vale of y then we should
SST = SSE

If x plays the full role in vale of y then
SSE = 0

SST = SSE + SSR

SSR : Sum of the squares due to regression


Coefficient of Determination for Goodness of Fit
SSE = SST - SSR

The largest value for SSE is

SSE = SST

SSE = SST =======> SSR = 0

SSR/SST = 0 =====> the worst fit

SSR/SST = 1 =====> the best fit

R
2
= SSR/SST
R
2
= Coefficient of Determination
SSR = Sum of Squares Regression
SST = Total Sum of Squares
REGRESSION COEFFICIENT
The regression coefficient is the slope of the regression line and tells you
what the nature of the relationship between the variables is.

How much change in the independent variables is associated with how
much change in the dependent variable.

The larger the regression coefficient the more change.
PEARSONS r
To determine strength you look at how closely the dots are clustered
around the line. The more tightly the cases are clustered, the stronger the
relationship, while the more distant, the weaker.
Pearsons r is given a range of -1 to + 1 with 0 being no linear relationship
at all.

READING THE TABLE
Model Summary
.736
a
.542 .532 2760.003
Model
1
R R Square
Adj usted
R Square
Std. Error of
the Esti mate
Predi ctors: (Constant), Percent of Popul ati on 25 years
and Over wi th Bachelor's Degree or More, March 2000
esti mates
a.
When you run regression analysis on SPSS you get a 3 tables. Each tells you something
about the relationship.
The first is the model summary.
The R is the Pearson Product Moment Correlation Coefficient.
I n this case R is .736
R is the square root of R-Squared and is the correlation between the observed and
predicted values of dependent variable.

R-Square
Model Summary
.736
a
.542 .532 2760.003
Model
1
R R Square
Adj usted
R Square
Std. Error of
the Esti mate
Predi ctors: (Constant), Percent of Popul ati on 25 years
and Over wi th Bachelor's Degree or More, March 2000
esti mates
a.
R-Square is the proportion of variance in the dependent variable (income per capita)
which can be predicted from the independent variable (level of education).

This value indicates that 54.2% of the variance in income can be predicted from the
variable education. Note that this is an overall measure of the strength of association,
and does not reflect the extent to which any particular independent variable is
associated with the dependent variable.

R-Square is also called the coefficient of determination.
Adjusted R-square
Model Summary
.736
a
.542 .532 2760.003
Model
1
R R Square
Adj usted
R Square
Std. Error of
the Esti mate
Predi ctors: (Constant), Percent of Popul ati on 25 years
and Over wi th Bachelor's Degree or More, March 2000
esti mates
a.
As predictors are added to the model, each predictor will explain some of the variance in
the dependent variable simply due to chance.

One could continue to add predictors to the model which would continue to improve the
ability of the predictors to explain the dependent variable, although some of this increase in
R-square would be simply due to chance variation in that particular sample.

The adjusted R-square attempts to yield a more honest value to estimate the R-squared for
the population. The value of R-square was .542, while the value of Adjusted R-square was
.532. There isnt much difference because we are dealing with only one variable.

When the number of observations is small and the number of predictors is large, there will
be a much greater difference between R-square and adjusted R-square.

By contrast, when the number of observations is very large compared to the number of
predictors, the value of R-square and adjusted R-square will be much closer.
R
2
tells how much of the variance in Y is accounted for by the regression
model from our sample.


Adjusted R
2
indicates the loss of predictive power or shrinkage. i.e. how
much variance in Y would be accounted for if the model has been derived
from the population from which the sample was taken.

Difference between R
2
and Adjusted R
2

ANOVA
ANOVA
b
4.32E+08 1 432493775.8 56.775 .000
a
3.66E+08 48 7617618.586
7.98E+08 49
Regressi on
Resi dual
Total
Model
1
Sum of
Squares df Mean Square F Si g.
Predi ctors: (Constant), Percent of Popul ati on 25 years and Over wi th Bachelor's
Degree or More, March 2000 esti mates
a.
Dependent Vari abl e: Personal Income Per Capi ta, current dol l ars, 1999
b.
The p-value associated with this F value is very small (0.0000).

These values are used to answer the question "Do the independent variables reliably
predict the dependent variable?".

The p-value is compared to your alpha level (typically 0.05) and, if smaller, you can
conclude "Yes, the independent variables reliably predict the dependent variable".

If the p-value were greater than 0.05, you would say that the group of independent
variables does not show a statistically significant relationship with the dependent
variable, or that the group of independent variables does not reliably predict the
dependent variable.
ASSUMPTIONS OF LINEAR REGRESSION MODEL
1. Linear Functional form
2. Fixed independent variables
3. Independent observations
4. Representative sample and proper specification of the model (no omitted
variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors (homogeneity of residual variance)
7. No multicollinearity
8. No autocorrelation of the errors
9. No outlier distortion
EXPLANATION OF THE ASSUMPTIONS
1. 1. Linear Functional form
1. Does not detect curvilinear relationships
2. Independent observations
1. Representative samples
2. Autocorrelation inflates the t and r and f statistics and warps the
significance tests
3. Normality of the residuals
1. Permits proper significance testing
4. Equality of variance
1. Heteroskedasticity precludes generalization and external validity
2. This also warps the significance tests
5. Multicollinearity prevents proper parameter estimation. It may also preclude
computation of the parameter estimates completely if it is serious enough.
6. Outlier distortion may bias the results: If outliers have high influence and the
sample is not large enough, then they may serious bias the parameter estimates
DIAGNOSTIC TESTS FOR THE REGRESSION ASSUMPTIONS
1. Linearity tests: Regression curve fitting
1. No level shifts: One regime

2. Independence of observations: Runs test

3. Normality of the residuals: Shapiro-Wilks or Kolmogorov-Smirnov Test

4. Homogeneity of variance if the residuals: Whites General Specification test

5. No autocorrelation of residuals: Durbin Watson or ACF or PACF of residuals

6. Multicollinearity: Correlation matrix of independent variables.. Condition index
or condition number

7. No serious outlier influence: tests of additive outliers: Pulse dummies.
1. Plot residuals and look for high leverage of residuals
2. Lists of Standardized residuals
3. Lists of Studentized residuals
4. Cooks distance or leverage statistics

Das könnte Ihnen auch gefallen