L10-Edu5950 Simple Regression Analysis

EDU5950
SEM2 2010-11
CORRELATION &
SIMPLE REGRESSION
Correlation - Test of association
A correlation measures the degree of association between

two variables (interval or ordinal)
Associations can be positive (an increase in one variable is
associated with an increase in the other) or negative (an
increase in one variable is associated with a decrease in the
other)
Correlation is measured in r (parametric, Pearsons) or
(non-parametric, Spearmans)
Test of association - Correlation
Compare two continuous variables in terms of degree of

association
e.g. attitude scale vs behavioural frequency
300
250
250
200
200
150
150
100
100
50
50
0
0
50
100
150
200
250
300
50
Positive
100
150
200
250
Negative
Test statistic is r (parametric) or (non-parametric)

0 (random distribution, zero correlation)
1 (perfect correlation)
180
160
160
140
140
120
120
100
100
80
80
60
60
40
40
20
20
0
0
50
100
150
High
200
250
50
100
150
200
250
Low

Test statistic is r (parametric) or (non-parametric)
0 (random distribution, zero correlation)
1 (perfect correlation)
180
200
160
180
160
140
140
120
120
100
100
80
80
60
60
40
40
20
20
0
0
0
50
100
High
150
200
250
50
100
150
200
250
Zero
Regression & Correlation
A correlation measures the degree of

association between two variables (interval
(50,100,150) or ordinal (1,2,3...))
Associations can be positive (an increase in one
variable is associated with an increase in the
other) or negative (an increase in one variable is
associated with a decrease in the other)
6
Example: Height vs. Weight

Graph One: Relationship between Height
and Weight
180
Strong positive correlation

between height and weight
Can see how the relationship

works, but cannot predict one
from the other
If 120cm tall, then how heavy?
140
120
100
80
60
40
20
0
0
50
100
150
200
Height (cms)
7
Example: Symptom Index vs Drug A

Graph Two: Relationship between Symptom
Index and Drug A
160
140
Symptom Index
Weight (kgs)
160
120
100
80
60
40
20
Strong negative correlation

Can see how relationship
works, but cannot make
predictions
What Symptom Index might
we predict for a standard dose
of 150mg?
0
0
50
100
150
200
250
Drug A (dose in mg)

Graph Three: Relationship between
Symptom Index and Drug A
(with best-fit line)
180
Symptom Index
160
140
120
100
80
60
40
20
0
0
50
100
150
200
250
Best fit line

Allows us to describe
relationship between variables
more accurately.
We can now predict specific
values of one variable from
knowledge of the other
All points are close to the line
Drug A (dose in mg)
Example: Symptom Index vs Drug B
Graph Four: Relationship between Symptom

Index and Drug B
160
Symptom Index
140
120
100
Will predictions be as accurate?
Why not?
Residuals
80
60
40
20
We can still predict specific

values of one variable from
knowledge of the other
0
0
50
100
150
200
250
Drug B (dose in mg)
Correlation examples
11
Regression
Regression analysis procedures have as their

primary purpose the development of an
equation that can be used for predicting values
on some DV for all members of a population.
A secondary purpose is to use regression
analysis as a means of explaining causal
relationships among variables.
The most basic application of regression analysis is the

bivariate situation, to which is referred as simple linear
regression, or just simple regression.
Simple regression involves a single IV and a single DV.
Goal: to obtain a linear equation so that we can predict the
value of the DV if we have the value of the IV.
Simple regression capitalizes on the correlation between the
DV and IV in order to make specific predictions about the
DV.
The correlation tells us how much information about the

DV is contained in the IV.
If the correlation is perfect (i.e r = 1.00), the IV contains
everything we need to know about the DV, and we will
be able to perfectly predict one from the other.
Regression analysis is the means by which we determine
the best-fitting line, called the regression line.
Regression line is the straight line that lies closest to all
points in a given scatterplot
This line sometimes pass through the centroid of the
scatterplot.
3 important facts about the regression line must be

known:
The extent to which points are scattered around the line

The slope of the regression line
The point at which the line crosses the Y-axis
The extent to which the points are scattered around the

line is typically indicated by the degree of relationship
between the IV (X) and DV (Y).
This relationship is measured by a correlation
coefficient the stronger the relationship, the higher the
degree of predictability between X and Y.
The degree of slope is determined by the amount

of change in Y that accompanies a unit change in
X.
It is the slope that largely determines the predicted
values of Y from known values for X.
It is important to determine exactly where the
regression line crosses the Y-axis (this value is
known as the Y-intercept).
The regression line is essentially an equation that

express Y as a function of X.
The basic equation for simple regression is:
Y = a + bX
where Y is the predicted value for the DV,
X is the known raw score value on the IV,
b is the slope of the regression line
a is the Y-intercept
Simple Linear Regression

Purpose
determine relationship between two metric variables
predict value of the dependent variable (Y) based on
value of independent variable (X)
Requirement :
DV Interval / Ratio
IV Internal / Ratio
Requirement :
The independent and dependent variables are normally
distributed in the population
The cases represents a random sample from the population
Simple Regression
How best to summarise the data?
160
180
140
160
140
Symptom Index
Symptom Index
120
100
80
60
120
100
80
60
40
40
20
20
50
100
150
200
250
50
Drug A (dose in mg)
100
150
200
250
Drug A (dose in mg)
Adding a best-fit line allows us to describe data simply
General Linear Model (GLM)

How best to summarise the data?
Establish equation for the best-fit line:

Y = a + bX
200
180
160
140
Where: a = y intercept (constant)

b = slope of best-fit line
Y = dependent variable
X = independent variable
120
100
80
60
40
20
0
0
50
100
150
200
250
10
Simple Regression
R2 - Goodness of fit
For simple regression, R2 is the square of the correlation

coefficient
Reflects variance accounted for in data by the best-fit line
Takes values between 0 (0%) and 1 (100%)
Frequently expressed as percentage, rather than decimal
High values show good fit, low values show poor fit
Simple Regression
Low values of R2
DV
300
250
200
150
100
50
0
0
100
200
300
R2 = 0
(0% - randomly scattered
points, no apparent
relationship between X
and Y)
Implies that a best-fit line
will be a very poor
description of data
IV (regressor, predictor)
11
Simple Regression
High values of R2
300
250
DV
200
R2 = 1
150
100
50
0
0
100
200
300
IV
(100% - points lie directly

on the line - perfect
relationship between X
and Y)
250
DV
200
150
100
Implies that a best-fit line

will be a very good
description of data
50
0
0
50
100
150
200
250
IV
Simple Regression
R2 - Goodness of fit
180
160
160
140
120
120
S ymptom Index
S ymptom Index
140
100
80
60
100
80
60
40
40
20
20
0
0
50
100
150
200
250
Drug A (dose in mg)
50
100
150
200
250
Drug B (dose in mg)
Good fit R2 high
Moderate fit R2 lower
High variance explained
Less variance explained
12
Problem: to draw a straight line through the points

that best explains the variance
9
8
7
Line can then be used to

predict Y from X
6
5
4
3
2
1
0
0
25
Graph Three: Relationship between

Symptom Index and Drug A
180
Symptom Index
160
140
120
100
80
60
40
20
Best fit line

allows us to describe relationship
between variables more accurately.
We can now predict specific values
of one variable from knowledge of
the other
All points are close to the line
0
0
50
100
150
200
250
Drug A (dose in mg)
26
13
Regression
Establish equation for the best-fit line:

Y = a + bX
Best-fit line same as regression line
b is the regression coefficient for x
x is the predictor or regressor variable for y
27
Regression - Types
14
Linear Regression - Model
Yi = 0 + 1 X i + i
Constant
Population
Regression Coefficients
Sample
= a + bX
Y
Parameters
l
The population parameters 0 and 1

are simple the least squares estimates
computed on all the members of the
population, not just the sample
Population parameters: 0 and 1
Sample statistics: a and b
30
15
Inference About the Population

Slope and Intercept
Y = 0 + 1 X +
l
If
1 > 0
then we have a graph like this:
0 + 1 X
X
31

Slope and Intercept
Y = 0 + 1 X +
l
If
1 < 0
then we have a graph like this:

This is the mean
of Y for those
whose
independent
variable is X
0 + 1 X
X
32
16

Slope and Intercept
Y = 0 + 1 X +
l
If
1 = 0 then we have a graph like this:
0 + 1 X
Note how the mean

of Y does not depend
on X: Y and X are
independent
X
Copyright (c) Bani K. Mallick
33
Linear Regression and Correlation

Y = 0 + 1 X +
1 = 0 then Y and X are independent
If
So, we can test the null hypothesis
H0 :
that Y and X are independent by testing
H0 : 1 = 0
l
The p-value in regression tables tests this

hypothesis
34
17
Ice Cream Example

X
Temperature
63
70
73
75
80
82
85
88
90
91
92
75
98
100
92
87
84
88
80
82
76
Y
Sales
1.52
1.68
1.8
2.05
2.36
2.25
2.68
2.9
3.14
3.06
3.24
1.92
3.4
3.28
3.17
2.83
2.58
2.86
2.26
2.14
1.98
Ice Cream Sales

4
3.5
3
2.5
2
1.5
1
0.5
0
0
20
40
60
80
100
120
Y = a + bX Simple Regression Line
TWO STEPS TO SIMPLE LINEAR REGRESSION

Regression equation : = a + bX
Correlation coefficient (r)

Coefficient of Determination (r)
Descriptive
Inferential
Hypothesis Test :
1 Regression Model
2 Slope
18
First Step -Descriptive

Derive Regression / Prediction equation
Calculate a and b
a=yb X
= a + bX
Example1 :
Data were collected from a randomly
selected sample to determine relationship
between average assignment scores and test
scores in statistics. Distribution for
the data is presented in the table below.
1. Calculate coefficient of determination
and the correlation coefficient
2. Determine the prediction equation.
3. Test hypothesis for the slope at 0.05 level
of significance
Data set:
ID
1
2
3
4
5
6
7
8
9
10
Scores
Assign
8.5
6
9
10
8
7
5
6
7.5
5
Test
88
66
94
98
87
72
45
63
85
77
19
ID
1
2
3
4
5
6
7
8
9
10
1. Derive Regression / Prediction equation
= 215.5 = 8.257
26.1
a= y b x
X
8.5
6
9
10
8
7
5
6
7.5
5
Y
88
66
94
98
87
72
45
63
85
77
Summary stat:
= 77.5 8.257 (7.2)
= 18.050
Prediction equation:
= 18.05 + 8.257X
10
72
775
544.5
62,441
5,795.5
Interpretation of regression equation

= 18.05 + 8.257x
For every 1 unit change in X,
Y will change by 8.257 units
57
18.05
8.2
20
Example 2:
MARITAL SATISFACTION
Children : Y
Parents : X
1
3
7
9
8
4
5
Mean of X
No of pairs
X
X squared
Standard deviation
XY
3
2
6
7
8
6
3
Mean of Y
Y
X squared
Standard deviation
1. Derive Regression / Prediction equation
a= y b x
= 5.00 +.65 (5.29)
= 8.438
Prediction equation:
= 8.44 + 65x
21
Interpretation of regression equation

= 8.43 + .65x
For every 1 unit change in X,
Y will change by .65 units
0.6
8.43
Descriptive Statistics
Mean
Grade - PMR MATH
TEACHER_FACTOR
Std. Deviation
2.53
1.468
62
3.9643
.91443
62
Correlations
Model Summaryb
Model
Grade - PMR TEACHER_F

MATH
Pearson Correlation Grade - PMR MATH
ACTOR
1.000
.571
R
.571a
R Square
Adjusted R
Square
.326
.315
Std. Error of the

Estimate
1.215
di
TEACHER_FACTO
.571
1.000
.000
.000
62
62
R
Sig. (1-tailed)
Grade - PMR MATH

TEACHER_FACTO
R
N
Grade - PMR MATH

TEACHER_FACTO
R
si
62
62
a. Predictors: (Constant), TEACHER_FACTOR

b. Dependent Variable: Grade - PMR MATH
22
ANOVAb
Model
1
Sum of
Squares
Regression
Residual
df
42.848
88.588
Total
131.435
1
60
61
Mean
Square
42.848
1.476
F
29.021
Sig.
.000a
a. Predictors: (Constant), TEACHER_FACTOR

Model
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
B
Std. Error
Beta
-1.101
.692
(Constant)
TEACHER_FACTOR
a. Dependent Variable: Grade - PMR MATH
.917
.170
.571
t
-1.591
Sig.
.117
5.387
.000
Descriptive Statistics
Mean
Std. Deviation
2.53
1.468
Grade - PMR MATH
TEACHER_FACTOR
3.9643
Race
Grade - PMR
MATH
TEACHER_FA
CTOR
Race
62
.91443
1.90
62
.593
Correlations
Pearson
Correlation
Grade - TEACHER
PMR MATH _FACTOR Race
1.000
.571 -.015
.571
1.000
.019
-.015
.019
1.000
.000
.453
.000
.440
.453
.440
62
62
62
62
62
62
62
62
62
Model
62
Model Summaryb
Adjusted R Std. Error of
R Square Square the Estimate
R
.572a
.327
.304
1.225
Sig. (1-tailed)
Grade - PMR
MATH
TEACHER_FA
CTOR
Race
Grade - PMR
MATH
TEACHER_FA
CTOR
Race
a. Predictors: (Constant), Race, TEACHER_FACTOR

23
Model
1
Regression
Residual
Total
ANOVAb
Sum of
Mean
Squares
Square
df
F
Sig.
42.939
2
21.469 14.313 .000a
88.497
59
1.500
131.435
61
a. Predictors: (Constant), Race, TEACHER_FACTOR

Coefficientsa
Model
(Constant)
TEACHER_FACTOR
Race
a. Dependent Variable: Grade - PMR MATH
Unstandardized Coefficients
B
Std. Error
-.980
.853
.917
.172
-.065
.265
Standardized
Coefficients
Beta
.571
-.026
t
-1.150
5.349
-.246
Sig.
.255
.000
.806
24

L10-Edu5950 Simple Regression Analysis

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

L10-Edu5950 Simple Regression Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

EDU5950

Correlation - Test of association

A correlation measures the degree of association between

Test of association - Correlation

Compare two continuous variables in terms of degree of

Test of association - Correlation

Test statistic is r (parametric) or (non-parametric)

Test of association - Correlation

Regression & Correlation

A correlation measures the degree of

Example: Height vs. Weight

Strong positive correlation

Can see how the relationship

If 120cm tall, then how heavy?

Example: Symptom Index vs Drug A

Strong negative correlation

Drug A (dose in mg)

Example: Symptom Index vs Drug A

Best fit line

Drug A (dose in mg)

Example: Symptom Index vs Drug B

Graph Four: Relationship between Symptom

Will predictions be as accurate?

We can still predict specific

Drug B (dose in mg)

Regression analysis procedures have as their

The most basic application of regression analysis is the

The correlation tells us how much information about the

3 important facts about the regression line must be

The extent to which points are scattered around the line

The extent to which the points are scattered around the

The degree of slope is determined by the amount

The regression line is essentially an equation that

Simple Linear Regression

Drug A (dose in mg)

Drug A (dose in mg)

Adding a best-fit line allows us to describe data simply

General Linear Model (GLM)

Establish equation for the best-fit line:

Where: a = y intercept (constant)

For simple regression, R2 is the square of the correlation

Reflects variance accounted for in data by the best-fit line

Takes values between 0 (0%) and 1 (100%)

Frequently expressed as percentage, rather than decimal

(100% - points lie directly

Implies that a best-fit line

Drug A (dose in mg)

Drug B (dose in mg)

Good fit R2 high

Moderate fit R2 lower

High variance explained

Less variance explained

Problem: to draw a straight line through the points

Line can then be used to

Example: Symptom Index vs Drug A

Graph Three: Relationship between

Best fit line

Drug A (dose in mg)

Establish equation for the best-fit line:

Best-fit line same as regression line

b is the regression coefficient for x

x is the predictor or regressor variable for y

Linear Regression - Model