Sie sind auf Seite 1von 44

Regression Analysis

Scatter Plots and Correlation

A scatter plot (or scatter diagram) is used to show


the relationship between two variables
Correlation analysis is used to measure strength of
the linear association between two variables
Only concerned with strength of the relationship
No causal effect is implied
Examples

Salary and no of years of experience


Household income and expenditure;
Price and supply of commodities;
Amount of rainfall and yield of crops.
Price and demand of goods.
Weight and blood pressure
Sales and GDP
Scatter Plot Examples
Linear relationships Curvilinear relationships

y y

x x

y y

x x
Strong relationships Weak relationships

y y

x x

y y

x x
No relationship

x
Correlation Coefficient
The population correlation coefficient ()
measures the strength of the linear association
between the variables

The sample correlation coefficient (r) is an


estimate of and is used to measure the strength
of the linear relationship in the sample
observations
Calculating sample Correlation Coefficient

cov( x, y )
rxy
sx s y
1
cov( x, y ) ( xi x )( yi y )
n
1 1
sx
n
( xi x ) 2
s y
n
( y i y ) 2
Features of correlation coefficient
Unit free
Range between -1.00 and 1.00
-1<0 implies that as X (), Y ( )
0< 1 implies that as X (), Y ()
The closer to -1.00, the stronger the negative
linear relationship
The closer to 1.00, the stronger the positive linear
relationship
The closer to 0.00, the weaker the linear
relationship
=0 implies that X and Y are not linearly
associated
Significance Test for Correlation

Hypotheses
H0: = 0 (no linear correlation)
H1: 0 (linear correlation)
Significance test for Correlation

Test statistic:

r n2
tobs ~ t n2 , under H 0
1 r 2

Critical Region:
{tobs t ;n 2 }
{tobs t ;n 2 }
{ tobs t / 2;n 2 }
What is Regression
Regression is a tool for finding existence of an
association relationship between a dependent
variable (Y) and one or more independent
variables (1 , 2 , , ) in a study.
The relationship can be linear or non-linear.
Mathematical vs Statistical
Relationship
Mathematical Relationship is exact

y 0 1 x
Statistical Relationship is not exact

y 0 1x
Nomenclature in Regression
A dependent variable (response variable)
measures an outcome of a study (also called
outcome variable).
An independent variable (explanatory
variable) explains changes in a response
variable.
Regression often set values of explanatory
variable to see how it affects response
variable (predict response variable)
Population Linear Regression

The population regression model:


Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual

y 0 1x
Variable

Linear component Random Error


(systematic) component
Linear Regression Assumptions

Distribution of error: i ~ N (0, e2 )


i.e. E( i ) = 0, var( i ) e2; cov(i , j ) 0
i.e. Error values () are statistically independent
The probability distribution of the errors has constant
variance
Population Linear Regression

y y 0 1x
Observed Value
of y for xi

i Slope = 1
Predicted Value Random Error for
of y for xi
this x value

Intercept = 0

xi x
Estimated Regression Model
The sample regression line provides an estimate of the
population regression line

Estimated (or Estimate of the Estimate of the


predicted) y regression regression slope
value intercept

Independent

y b 0 b1 x variable
Estimation of parameters
Least square method of estimation
Confidence interval
Prediction interval
p-value
Interpretation of the Slope and the Intercept

b0 is the estimated average value of y when


the value of x is zero
b1 is the estimated change in the average value
of y as a result of a one-unit change in x
Advertising data
The Advertising data set consists of the sales (in thousands of
units) of a particular product in 200 different markets, along with
advertising budgets (in thousands of dollars) for the product in
each of those markets.
Let us first check how sales is related with ad expenditure.
Simple Linear Regression

Is there a relationship between advertising


budget and sales?
How strong is the relationship between
advertising budget and sales?
Can we forecast sales on the basis of ad
budget?
Scatter Diagram
Correlation coefficient
Test of correlation coefficient
Interpretation of regression coefficients and
corresponding s.e.
Confidence interval of parameters
p-value of t-tests
Test for Significance

To test for a significant regression relationship, we


test for intercept parameter, b0;
slope parameter b1 and predicted y

test commonly used is:

t Test

t test requires an estimate of e2,


the variance of error in the regression model.
Testing for Significance
An Estimate of e2
The mean square error (MSE) provides the unbiased estimator
of e2, given as

s 2 = MSE = SSE/(n - 2)

where:

( SS ) 2
SSE = (yi - yi )2 SS y xy
SS x
= SS y b SS
1 XY
Testing for slope parameter
Hypotheses
H 0 : 1 10

H1 : 1 10

Test Statistic, under H0

b1 10 s
tobs where sb1
sb1 i
(
i
x x ) 2
Testing for intercept parameter
Hypotheses
H 0 : 0 00

H1 : 0 00

Test Statistic, under H0

b0 00 1 x2
where sb 0 s
tobs
sb 0 n ( xi x ) 2
Testing for Significance: t Test
Critical Region

Reject H0 if p-value <


or tobs < -t /2;n-2 or t > t/2;n-2

where:
t is based on a t distribution
with n - 2 degrees of freedom
Testing for Significance: Example

1. Determine the hypotheses. H0 : 1 0


H a : 1 0
2. Specify the level of significance. a = .05

b1
3. Select the test statistic. t
sb1

4. State the rejection rule. Reject H0 if p-value < .05


Testing for Significance: t Test

5. Compute the value of the test statistic.

0.048687879
= = 24.56369137
0.001982108

6. Determine whether to reject H0.

Check the p-value/compare with the critical value


Confidence Interval for 1
The form of a confidence interval for 1 is: t /2 sb1
is the
b1 t /2 sb1 margin
b1 is the of error
point
estimator where t / 2 is the t value providing an area
of /2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Confidence Interval for 0
The form of a confidence interval for 0 is: t / 2 sb 0
is the
b0 t / 2 sb 0 margin
b0 is the of error
point
estimator where t / 2 is the t value providing an area
of /2 in the upper tail of a t distribution
with n - 2 degrees of freedom
Prediction Interval of E(yp/x)
y p t / 2 s p
where
1 (xp x)
2

s p s 1
n ( xi x ) 2

where:
confidence coefficient is 1 - and
t/2 is based on a t distribution
with n - 2 degrees of freedom
Assessing Model Accuracy

2
Residual Standard Error (interpretation?)
F Statistic
Coefficient of Determination
Relationship Among SST, SSR, SSE
SST = SSR + SSE

i
( y y ) 2
i
(
y y ) 2
i i
( y
y ) 2

where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
Goodness of fit of regression
Coefficient of Determination
It can be noted that a fitted model can be said to be good
when residuals are small. Since SSR is based on residuals, so
a measure of quality of fitted model can be based on SSR.
R2 is a measure of relative fit based on a comparison of SSR
and SST
R2 = r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares
a value of closer to 1 indicates the better
fit and value of closer to zero indicates
the poor fit.
Coefficient of Determination (example)

R2 = SSR/SST = 4078.704774 / 5417.14875


= 0.75292464
The regression relationship is strong; 75.3% of the
variability in the sales of the item can be explained
by the linear model between the sales and the ad
expenditure
or
75.3% of the variability in the sales is explained by
ad exp
Residual Standard Error

()
= =
2

Interpretation: Here RSE is 2.6 In other words, actual sales


in each market deviate from the true regression line by
approximately 2,600 units, on average. Another way to
think about this is that even if the model were correct and
the true values of the unknown coefficients were known
exactly, any prediction of sales on the basis of TV
advertising would still be off by about 2,600 units on
average
Multiple Regression
Example
Suppose that we are statistical consultants hired
by a client to provide advice on how to improve
sales of a particular product.

The Advertising data set consists of the sales of


that product in 200 different markets, along with
advertising budgets for the product in each of
those markets for three different media: TV,
radio, and newspaper

Response or dependent variable?


Predictors or independent variable(s)?
Common questions in regression

Which predictors are associated with the


response?

What is the relationship between the response and


each predictor?

Can the relationship between Y and each


predictor be adequately summarized using a
linear equation, or is the relationship more
complicated?
Advertising data
One may be interested in answering questions
such as:
Which media contribute to sales?
Which media generate the biggest boost in
sales? or
How much increase in sales is associated with
a given increase in TV advertising?
Multiple Regression
The model is

y = 0 + 1x1 + 2x2 + + kxk +


Scatter matrix
Correlation matrix
Test of correlation coefficients
Interpretation of regression coefficients and
corresponding s.e.
Confidence interval of parameters
p-value of t-tests

Das könnte Ihnen auch gefallen