Sie sind auf Seite 1von 24

EDU5950

SEM2 2010-11
CORRELATION &
SIMPLE REGRESSION

Correlation - Test of association

A correlation measures the degree of association between


two variables (interval or ordinal)
Associations can be positive (an increase in one variable is
associated with an increase in the other) or negative (an
increase in one variable is associated with a decrease in the
other)
Correlation is measured in r (parametric, Pearsons) or
(non-parametric, Spearmans)

Test of association - Correlation

Compare two continuous variables in terms of degree of


association
e.g. attitude scale vs behavioural frequency
300

250

250

200
200

150
150

100

100

50

50

0
0

50

100

150

200

250

300

50

Positive

100

150

200

250

Negative

Test of association - Correlation

Test statistic is r (parametric) or (non-parametric)


0 (random distribution, zero correlation)
1 (perfect correlation)

180

160

160

140

140

120

120

100

100

80

80

60

60
40

40

20

20

0
0

50

100

150

High

200

250

50

100

150

200

250

Low

Test of association - Correlation


Test statistic is r (parametric) or (non-parametric)
0 (random distribution, zero correlation)
1 (perfect correlation)

180

200

160

180
160

140

140

120

120
100

100

80

80

60

60

40

40

20

20
0

0
0

50

100

High

150

200

250

50

100

150

200

250

Zero

Regression & Correlation

A correlation measures the degree of


association between two variables (interval
(50,100,150) or ordinal (1,2,3...))
Associations can be positive (an increase in one
variable is associated with an increase in the
other) or negative (an increase in one variable is
associated with a decrease in the other)
6

Example: Height vs. Weight


Graph One: Relationship between Height
and Weight
180

Strong positive correlation


between height and weight

Can see how the relationship


works, but cannot predict one
from the other

If 120cm tall, then how heavy?

140
120
100
80
60
40
20
0
0

50

100

150

200

Height (cms)
7

Example: Symptom Index vs Drug A


Graph Two: Relationship between Symptom
Index and Drug A

160
140

Symptom Index

Weight (kgs)

160

120
100
80

60
40
20

Strong negative correlation


Can see how relationship
works, but cannot make
predictions
What Symptom Index might
we predict for a standard dose
of 150mg?

0
0

50

100

150

200

250

Drug A (dose in mg)

Example: Symptom Index vs Drug A


Graph Three: Relationship between
Symptom Index and Drug A
(with best-fit line)

180

Symptom Index

160
140
120
100

80
60
40
20
0
0

50

100

150

200

250

Best fit line


Allows us to describe
relationship between variables
more accurately.
We can now predict specific
values of one variable from
knowledge of the other
All points are close to the line

Drug A (dose in mg)

Example: Symptom Index vs Drug B

Graph Four: Relationship between Symptom


Index and Drug B
(with best-fit line)

160

Symptom Index

140
120
100

Will predictions be as accurate?

Why not?

Residuals

80
60
40
20

We can still predict specific


values of one variable from
knowledge of the other

0
0

50

100

150

200

250

Drug B (dose in mg)

Correlation examples

11

Regression

Regression analysis procedures have as their


primary purpose the development of an
equation that can be used for predicting values
on some DV for all members of a population.
A secondary purpose is to use regression
analysis as a means of explaining causal
relationships among variables.

The most basic application of regression analysis is the


bivariate situation, to which is referred as simple linear
regression, or just simple regression.
Simple regression involves a single IV and a single DV.
Goal: to obtain a linear equation so that we can predict the
value of the DV if we have the value of the IV.
Simple regression capitalizes on the correlation between the
DV and IV in order to make specific predictions about the
DV.

The correlation tells us how much information about the


DV is contained in the IV.
If the correlation is perfect (i.e r = 1.00), the IV contains
everything we need to know about the DV, and we will
be able to perfectly predict one from the other.
Regression analysis is the means by which we determine
the best-fitting line, called the regression line.
Regression line is the straight line that lies closest to all
points in a given scatterplot
This line sometimes pass through the centroid of the
scatterplot.

3 important facts about the regression line must be


known:

The extent to which points are scattered around the line


The slope of the regression line
The point at which the line crosses the Y-axis

The extent to which the points are scattered around the


line is typically indicated by the degree of relationship
between the IV (X) and DV (Y).
This relationship is measured by a correlation
coefficient the stronger the relationship, the higher the
degree of predictability between X and Y.

The degree of slope is determined by the amount


of change in Y that accompanies a unit change in
X.
It is the slope that largely determines the predicted
values of Y from known values for X.
It is important to determine exactly where the
regression line crosses the Y-axis (this value is
known as the Y-intercept).

The regression line is essentially an equation that


express Y as a function of X.
The basic equation for simple regression is:
Y = a + bX
where Y is the predicted value for the DV,
X is the known raw score value on the IV,
b is the slope of the regression line
a is the Y-intercept

Simple Linear Regression


Purpose
determine relationship between two metric variables
predict value of the dependent variable (Y) based on
value of independent variable (X)
Requirement :
DV Interval / Ratio

IV Internal / Ratio

Requirement :
The independent and dependent variables are normally
distributed in the population
The cases represents a random sample from the population

Simple Regression
How best to summarise the data?
160

180

140

160
140

Symptom Index

Symptom Index

120
100
80
60

120
100
80
60

40

40

20

20

50

100

150

200

250

50

Drug A (dose in mg)

100

150

200

250

Drug A (dose in mg)

Adding a best-fit line allows us to describe data simply

General Linear Model (GLM)


How best to summarise the data?

Establish equation for the best-fit line:


Y = a + bX

200
180
160
140

Where: a = y intercept (constant)


b = slope of best-fit line
Y = dependent variable
X = independent variable

120
100
80
60
40
20
0
0

50

100

150

200

250

10

Simple Regression
R2 - Goodness of fit

For simple regression, R2 is the square of the correlation


coefficient

Reflects variance accounted for in data by the best-fit line

Takes values between 0 (0%) and 1 (100%)

Frequently expressed as percentage, rather than decimal

High values show good fit, low values show poor fit

Simple Regression
Low values of R2
DV

300

250

200
150
100

50
0
0

100

200

300

R2 = 0
(0% - randomly scattered
points, no apparent
relationship between X
and Y)
Implies that a best-fit line
will be a very poor
description of data

IV (regressor, predictor)

11

Simple Regression
High values of R2
300
250

DV

200

R2 = 1

150

100
50
0
0

100

200

300

IV

(100% - points lie directly


on the line - perfect
relationship between X
and Y)

250

DV

200
150
100

Implies that a best-fit line


will be a very good
description of data

50
0
0

50

100

150

200

250

IV

Simple Regression
R2 - Goodness of fit
180

160

160

140
120

120

S ymptom Index

S ymptom Index

140

100
80
60

100
80
60

40

40

20

20

0
0

50

100

150

200

250

Drug A (dose in mg)

50

100

150

200

250

Drug B (dose in mg)

Good fit R2 high

Moderate fit R2 lower

High variance explained

Less variance explained

12

Problem: to draw a straight line through the points


that best explains the variance
9
8
7

Line can then be used to


predict Y from X

6
5
4
3
2
1
0
0

25

Example: Symptom Index vs Drug A

Graph Three: Relationship between


Symptom Index and Drug A
(with best-fit line)

180

Symptom Index

160

140
120
100
80
60

40
20

Best fit line


allows us to describe relationship
between variables more accurately.
We can now predict specific values
of one variable from knowledge of
the other
All points are close to the line

0
0

50

100

150

200

250

Drug A (dose in mg)

26

13

Regression

Establish equation for the best-fit line:


Y = a + bX

Best-fit line same as regression line

b is the regression coefficient for x

x is the predictor or regressor variable for y

27

Regression - Types

14

Linear Regression - Model

Yi = 0 + 1 X i + i
Constant

Population

Regression Coefficients
Sample

= a + bX
Y

Parameters
l

The population parameters 0 and 1


are simple the least squares estimates
computed on all the members of the
population, not just the sample

Population parameters: 0 and 1

Sample statistics: a and b

30

15

Inference About the Population


Slope and Intercept
Y = 0 + 1 X +
l

If

1 > 0

then we have a graph like this:

0 + 1 X

X
31

Inference About the Population


Slope and Intercept
Y = 0 + 1 X +
l

If

1 < 0

then we have a graph like this:


This is the mean
of Y for those
whose
independent
variable is X

0 + 1 X

X
32

16

Inference About the Population


Slope and Intercept
Y = 0 + 1 X +
l

If

1 = 0 then we have a graph like this:

0 + 1 X

Note how the mean


of Y does not depend
on X: Y and X are
independent

X
Copyright (c) Bani K. Mallick

33

Linear Regression and Correlation


Y = 0 + 1 X +
1 = 0 then Y and X are independent

If

So, we can test the null hypothesis

H0 :

that Y and X are independent by testing

H0 : 1 = 0
l

The p-value in regression tables tests this


hypothesis
34

17

Ice Cream Example


X
Temperature
63
70
73
75
80
82
85
88
90
91
92
75
98
100
92
87
84
88
80
82
76

Y
Sales
1.52
1.68
1.8
2.05
2.36
2.25
2.68
2.9
3.14
3.06
3.24
1.92
3.4
3.28
3.17
2.83
2.58
2.86
2.26
2.14
1.98

Ice Cream Sales


4
3.5
3
2.5
2
1.5
1
0.5
0
0

20

40

60

80

100

120

Y = a + bX Simple Regression Line

TWO STEPS TO SIMPLE LINEAR REGRESSION


Regression equation : = a + bX

Correlation coefficient (r)


Coefficient of Determination (r)
Descriptive
Inferential

Hypothesis Test :
1 Regression Model
2 Slope

18

First Step -Descriptive


Derive Regression / Prediction equation
Calculate a and b

a=yb X

= a + bX

Example1 :
Data were collected from a randomly
selected sample to determine relationship
between average assignment scores and test
scores in statistics. Distribution for
the data is presented in the table below.
1. Calculate coefficient of determination
and the correlation coefficient
2. Determine the prediction equation.
3. Test hypothesis for the slope at 0.05 level
of significance

Data set:
ID
1
2
3
4
5
6
7
8
9
10

Scores
Assign
8.5
6
9
10
8
7
5
6
7.5
5

Test
88
66
94
98
87
72
45
63
85
77

19

ID
1
2
3
4
5
6
7
8
9
10

1. Derive Regression / Prediction equation

= 215.5 = 8.257
26.1

a= y b x

X
8.5
6
9
10
8
7
5
6
7.5
5

Y
88
66
94
98
87
72
45
63
85
77

Summary stat:

= 77.5 8.257 (7.2)

= 18.050

Prediction equation:
= 18.05 + 8.257X

10
72
775
544.5
62,441
5,795.5

Interpretation of regression equation


= 18.05 + 8.257x
For every 1 unit change in X,
Y will change by 8.257 units

57

18.05

8.2

20

Example 2:
MARITAL SATISFACTION
Children : Y

Parents : X

1
3
7
9
8
4
5
Mean of X
No of pairs
X
X squared
Standard deviation
XY

3
2
6
7
8
6
3
Mean of Y
Y
X squared
Standard deviation

1. Derive Regression / Prediction equation

a= y b x
= 5.00 +.65 (5.29)
= 8.438

Prediction equation:
= 8.44 + 65x

21

Interpretation of regression equation


= 8.43 + .65x
For every 1 unit change in X,
Y will change by .65 units

0.6

8.43

Descriptive Statistics
Mean
Grade - PMR MATH
TEACHER_FACTOR

Std. Deviation

2.53

1.468

62

3.9643

.91443

62

Correlations

Model Summaryb
Model

Grade - PMR TEACHER_F


MATH
Pearson Correlation Grade - PMR MATH

ACTOR

1.000

.571

R
.571a

R Square

Adjusted R
Square

.326

.315

Std. Error of the


Estimate
1.215

di

TEACHER_FACTO

.571

1.000

.000

.000

62

62

R
Sig. (1-tailed)

Grade - PMR MATH


TEACHER_FACTO

R
N

Grade - PMR MATH


TEACHER_FACTO
R

si

62

62

a. Predictors: (Constant), TEACHER_FACTOR


b. Dependent Variable: Grade - PMR MATH

22

ANOVAb

Model
1

Sum of
Squares

Regression
Residual

df

42.848
88.588

Total

131.435

1
60

61

Mean
Square

42.848
1.476

F
29.021

Sig.
.000a

a. Predictors: (Constant), TEACHER_FACTOR


b. Dependent Variable: Grade - PMR MATH

Model

Coefficientsa

Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
-1.101
.692

(Constant)

TEACHER_FACTOR

a. Dependent Variable: Grade - PMR MATH

.917

.170

.571

t
-1.591

Sig.
.117

5.387

.000

Descriptive Statistics
Mean
Std. Deviation
2.53
1.468

Grade - PMR MATH

TEACHER_FACTOR

3.9643

Race

Grade - PMR
MATH
TEACHER_FA
CTOR
Race

62

.91443

1.90

62

.593

Correlations

Pearson
Correlation

Grade - TEACHER
PMR MATH _FACTOR Race
1.000
.571 -.015
.571

1.000

.019

-.015

.019

1.000

.000

.453

.000

.440

.453

.440

62

62

62

62

62

62

62

62

62

Model

62

Model Summaryb
Adjusted R Std. Error of
R Square Square the Estimate

R
.572a

.327

.304

1.225

Sig. (1-tailed)

Grade - PMR
MATH
TEACHER_FA
CTOR
Race
Grade - PMR
MATH
TEACHER_FA
CTOR
Race

a. Predictors: (Constant), Race, TEACHER_FACTOR


b. Dependent Variable: Grade - PMR MATH

23

Model
1

Regression
Residual
Total

ANOVAb
Sum of
Mean
Squares
Square
df
F
Sig.
42.939
2
21.469 14.313 .000a
88.497
59
1.500
131.435

61

a. Predictors: (Constant), Race, TEACHER_FACTOR


b. Dependent Variable: Grade - PMR MATH

Coefficientsa
Model

(Constant)
TEACHER_FACTOR
Race
a. Dependent Variable: Grade - PMR MATH

Unstandardized Coefficients
B
Std. Error
-.980
.853
.917
.172
-.065
.265

Standardized
Coefficients
Beta
.571
-.026

t
-1.150
5.349
-.246

Sig.
.255
.000
.806

24

Das könnte Ihnen auch gefallen