Sie sind auf Seite 1von 33

Introduction to Linear Regression

JS

-S

SN C

Scatter Plots and Correlation


Correlation analysis is used to measure strength of the association (linear relationship) between two variables

No causal effect is implied

JS

Only concerned with strength of the relationship

-S

SN C

A scatter plot (or scatter diagram) is used to show the relationship between two variables

Scatter Plot Examples


Linear relationships y y Curvilinear relationships

-S
x

SN C

x y

JS
x

Scatter Plot Examples


(continued)
Strong relationships y Weak relationships

-S
x

SN C

E
y

x y

JS
x

Scatter Plot Examples


(continued)
No relationship y

-S JS
y

SN C
x x

Correlation Coefficient
(continued)

The sample correlation coefficient r is an estimate of and is used to measure the strength of the linear relationship in the sample observations

JS

-S

SN C

The population correlation coefficient (rho) measures the strength of the association between the variables

Features of and r
Unit free Range between -1 and 1 The closer to -1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker the linear relationship

JS

-S

SN C

Examples of Approximate r Values


y y y

r = -1
y

-S

SN C

r = -.6
y

r=0

JS
x

r = +.3

r = +1

Calculating the Correlation Coefficient


Sample correlation coefficient:

SN C

( x x)( y y ) [ ( x x ) ][ ( y y ) ]
2 2

-S

or the algebraic equivalent:

where: r = Sample correlation coefficient n = Sample size x = Value of the independent variable y = Value of the dependent variable

JS

n xy x y

[n( x 2 ) ( x )2 ][n( y 2 ) ( y )2 ]

Calculation Example

SN C

E
xy y2 1225 2401 729 1089 3600 441 2025 2601 x2 64 81 49 36 169 49 121 144 =713 280 441 189 198 780 147 495 612 =3142 =14111

Tree Height y 35 49 27 33 60 21 45 51 =321

Trunk Diamete r x 8 9 7 6 13 7 11 12 =73

JS

-S

Calculation Example

(continued)

60

SN C

Tree Height, y 70

[n( x 2 ) ( x)2 ][n( y 2 ) ( y)2 ] 8(3142) (73)(321)

50

40

[8(713) (73)2 ][8(14111) (321) 2 ]

30

-S JS
0 2 4 6 8 10 12 14

0.886
r = 0.886 relatively strong positive linear association between x and y

20

10

Trunk Diameter, x

n xy x y

Excel Output
Excel Correlation Output Tools / data analysis / correlation
Tree Height Trunk Diameter 1 0.886231 1

Tree Height and Trunk Diameter

JS
Correlation between

-S

Tree Height Trunk Diameter

SN C

Introduction to Regression Analysis


Regression analysis is used to:
Predict the value of a dependent variable based on the value of at least one independent variable Explain the impact of changes in an independent variable on the dependent variable

Independent variable: the variable used to explain the dependent variable

JS

Dependent variable: the variable we wish to explain

-S

SN C

Simple Linear Regression Model


Relationship between x and y is described by a linear function

JS

Changes in y are assumed to be caused by changes in x

-S

SN C

Only one independent variable, x

Types of Regression Models


Positive Linear Relationship Relationship NOT Linear

JS

Negative Linear Relationship

-S

SN C
No Relationship

Population Linear Regression


The population regression model:

E
Independent Variable

Population y intercept Dependent Variable

JS

y 0 1x
-S
Linear component Random Error component

SN C

Population Slope Coefficient

Random Error term, or residual

Linear Regression Assumptions


Error values () are statistically independent Error values are normally distributed for any given value of x

The underlying relationship between the x variable and the y variable is linear

JS

The probability distribution of the errors has constant variance

-S

The probability distribution of the errors is normal

SN C

Population Linear Regression


(continued)
Observed Value of y for xi

SN C
i

y 0 1x
Slope = 1

-S JS
xi

Predicted Value of y for xi

Random Error for this x value

Intercept = 0

Estimated Regression Model

Estimated (or predicted) y value

Estimate of the regression intercept

-S

SN C

The sample regression line provides an estimate of the population regression line
Estimate of the regression slope Independent variable

i b0 b1x y

The individual random error terms ei have a mean of zero

JS

Least Squares Criterion


b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared residuals
2 e (y y)

JS

-S

SN C

E
0

(y (b

b1x))

The Least Squares Equation


The formulas for b1 and b0 are:
b1 ( x x )( y y ) (x x)
2

JS

algebraic equivalent:

-S
and

b1

x y xy n 2 ( x ) 2 x n

SN C

E
b0 y b1 x

Interpretation of the Slope and the Intercept

b1 is the estimated change in the average value of y as a result of a one-unit change in x

JS

-S

SN C

b0 is the estimated average value of y when the value of x is zero

Finding the Least Squares Equation

Other regression measures will also be computed as part of computerbased regression analysis

JS

-S

SN C

The coefficients b0 and b1 will usually be found using computer software, such as Excel or Minitab

Simple Linear Regression Example


A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet) A random sample of 10 houses is selected Dependent variable (y) = house price in
$1000s

Independent variable (x) = square feet

JS

-S

SN C

Sample Data for House Price Model

312 279 308 199 219 405 324 319 255

SN C -S

245

JS

House Price in $1000s (y)

Square Feet (x) 1400 1600 1700 1875 1100 1550 2350 2450 1425 1700

Graphical Presentation
House price model: scatter plot and regression line
450 400 350 300 250 200 150 100 50 0
House Price ($1000s)

SN C
1000 1500 Square Feet

E
2000

Slope = 0.10977

Intercept = 98.248

JS
0 500

-S
2500

3000

house price 98.24833 0.10977 (square feet)

Interpretation of the Intercept, b0

Here, no houses had 0 square feet, so b0 = 98.24833 just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet

JS

-S

b0 is the estimated average value of Y when the value of X is zero (if x = 0 is in the range of observed x values)

SN C

house price 98.24833 0.10977 (square feet)

Interpretation of the Slope Coefficient, b1

Here, b1 = .10977 tells us that the average value of a house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size

JS

-S

b1 measures the estimated change in the average value of Y as a result of a oneunit change in X

SN C

house price 98.24833 0.10977 (square feet)

Example: House Prices

245 312 279 308 199 219 405 324 319 255

1400 1600 1700 1875 1100 1550 2350 2450 1425 1700

JS

-S

Predict the price for a house with 2000 square feet

SN C

House Price in $1000s (y)

house price 98.25 0.1098 (sq.ft.)

Square Feet (x)

Estimated Regression Equation:

Example: House Prices (continued)

house price 98.25 0.1098 (sq.ft.) 98.25 0.1098(2000)

The predicted price for a house with 2000 square feet is 317.85($1,000s) = $317,850

JS

317.85

-S

SN C

Predict the price for a house with 2000 square feet:

Summary
Introduced correlation analysis Discussed correlation to measure the strength of a linear association Introduced simple linear regression analysis Calculated the coefficients for the simple linear regression equation

JS

-S

SN C

Summary
(continued)

JS

-S

Described inference about the slope Addressed estimation of mean values and prediction of individual values Discussed residual analysis

SN C

R software regression
yx=c(245,1400,312,1600,279,1700,308,1875,19 9,1100,219,1550,405,2350,324,2450,319,1425, 255,1700) mx=matrix(yx,10,2, byrow=T) hprice=mx[,1] sqft=mx[,2] reg1=lm(hprice~sqft) summary(reg1) plot(reg1)

JS

-S

SN C

Das könnte Ihnen auch gefallen