Beruflich Dokumente
Kultur Dokumente
Regression
Cal State Northridge
427
Ainsworth
The Question
Scatterplots
AKA scatter diagram or
scattergram.
Graphically depicts the
relationship between two
variables in two dimensional
space.
Direct Relationship
Inverse Relationship
An Example
Does smoking cigarettes
increase systolic blood pressure?
Plotting number of cigarettes
smoked per day against systolic
blood pressure
Trend?
170
160
150
140
130
SYSTOLIC
120
110
100
0
SMOKING
10
20
30
Smoking and BP
Note relationship is moderate, but
real.
Why do we care about relationship?
The Data
Surprisingly, the
U.S. is the first
country on the
list--the country
with the highest
consumption and
highest mortality.
Why?
Why?
30
20
10
{X = 6, Y = 11}
0
2
10
12
be discussed later
Correlation
Co-relation
The relationship between two
variables
Measured with a correlation
coefficient
Most popularly seen correlation
coefficient: Pearson ProductMoment Correlation
Types of Correlation
Positive correlation
Negative correlation
No correlation
No consistent tendency for values
on Y to increase or decrease as X
increases
Correlation Coefficient
A measure of degree of relationship.
Between 1 and -1
Sign refers to direction.
Based on covariance
18
Covariance
N 1
N 1
The formula for co-variance is:
Cov XY
( X X )(Y Y )
N 1
Example
Example
21
Covcig .&CHD
( X X )(Y Y ) 222.44
11.12
N 1
21 1
Correlation Coefficient
Pearsons Product Moment
Correlation
Symbolized by r
Covariance (product of the 2 SDs)
Cov XY
r
s X sY
Correlation is a standardized
covariance
cov XY
11.12
11.12
r
.713
s X sY
(2.33)(6.69) 15.59
Example
Correlation = .713
Sign is positive
Why?
Other calculations
25
Z-score method
r
z z
x y
N 1
N Y 2 ( Y ) 2
Spearman Rank-Order
Correlation Coefficient (rsp)
26
27
Phi coefficient ()
28
Factors Affecting r
Range restrictions
Nonlinearity
Factors Affecting r
Heterogeneous subsamples
Outliers
Overestimate Correlation
Underestimate Correlation
16
14
12
10
8
6
4
2
2.5
3.0
3.5
4.0
4.5
5.0
5.5
Truncation
32
Non-linearity
33
Heterogenous samples
34
Outliers
35
Testing Correlations
36
Testing r
Population parameter =
Null hypothesis H : = 0
0
Two-tailed
Tables of Significance
N 2
tr
2
1 r
Where DF = N-2
Tables of Significance
In our example r was .71
N-2 = 21 2 = 19
N 2
19
19
tr
.71*
.71*
6.90
2
2
1 r
1 .71
.4959
T-crit (19) = 2.09
Since 6.90 is larger than 2.09 reject
= 0.
Computer Printout
CHD
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
CIGARET
1
.
21
.713**
.000
21
CHD
.713**
.000
21
1
.
21
Regression
What is regression?
42
Linear Regression
43
Why Do We Care?
45
An Example
46
47
The Data
Based on the data we have
what would we predict the
rate of CHD be in a country
that smoked 10 cigarettes on
average?
First, we need to establish a
prediction of CHD from
smoking
30
We predict a
CHD rate of
about 14
20
Regression
Line
10
10
12
Regression Line
49
Formula
Y bX a
CHD mortality)
X = the predictor variable (e.g.
average cig./adult/country)
Regression Coefficients
50
a = intercept
value ofY
when X = 0
Calculation
51
Slope b cov XY or b r s y
2
sX
sx
N XY X Y
or b
2
2
N X ( X )
Intercept
a Y bX
CovXY = 11.12
s2 = 2.332 = 5.447
X
b = 11.12/5.447 = 2.042
a = 14.524 - 2.042*5.952 =
2.32
See SPSS printout on next slide
Answers are not exact due to rounding error and desire to match
SPSS.
SPSS Printout
53
Note:
54
Making a Prediction
55
Accuracy of Prediction
Finnish smokers smoke 6 C/A/D
We predict:
Y bX a 2.042 X 2.367
Y 2.042*6 2.367 14.619
a large error
56
30
Residual
20
Prediction
10
0
2
10
12
Residuals
58
Minimizing Residuals
59
(
X
X
)
Regression Line:
A Mathematical Definition
(Y Y )
61
Summarizing Errors of
Prediction
Residual variance
(Yi Yi )
SSresidual
N 2
N 2
(Yi Yi )
SS residual
sY Y
N 2
N 2
A common measure of the
accuracy of our predictions
We want it to be as small as
possible.
Example
63
2
Y Y
(Yi Yi ) 2 440.756
23.198
N 2
21 2
sY Y
(Yi Yi ) 2
440.756
N 2
21 2
23.198 4.816
Partitioning Variability
65
SStotal (Y Y )
Total
Regression SS regression (Y Y )
SS residual (Y Y )
Partitioning Variability
66
Degrees of freedom
Total
dftotal
=N-1
Regression
dfregression
Residual
dfresidual
= number of predictors
= dftotal dfregression
Partitioning Variability
67
Total Variance
s2total
= SStotal/ dftotal
Regression Variance
s2regression
= SSregression/ dfregression
Residual Variance
s2residual
= SSresidual/ dfresidual
68
Example
Example
SSTotal (Y Y ) 895.247; dftotal 21 1 20
2
69
SS regression
(Y Y ) 440.757; df residual 20 1 19
SS residual
2
total
s
s
s
(Y Y )
2
regression
2
residual
895.247
44.762
20
N 1
2
(Y Y )
1
(Y Y )
N 2
2
Note : sresidual
sY Y
454.307
454.307
1
440.757
23.198
19
70
Coefficient of
Determination
SS regression
SSY
71
r = .713
r 2 = .7132 =.508
SS regression 454.307
2
r
.507
or
SSY
895.247
Coefficient of Alienation
72
It is defined as 1 - r
SS residual
2
1 r
SSY
Example
or
1 - .508 = .492
SS residual 440.757
1 r
.492
SSY
895.247
2
r * SStotal = SSregression
(1 - r2) * SS
total = SSresidual
We can also use r2 to calculate
the standard error of estimate
as:
2
sY Y s y
N 1
20
(1 r )
6.690* (.492) 4.816
N 2
19
F statistic
Example
2
sregression
2
sresidual
454.307
19.594
23.198
SPSS output
76
Model Summary
Model
1
R
.713a
R Square
.508
Adjusted
R Square
.482
Std. Error of
the Estimate
4.81640
Regression
Residual
Total
Sum of
Squares
454.482
440.757
895.238
df
1
19
20
Mean Square
454.482
23.198
F
19.592
Sig.
.000a
seb
sX N 1
.461
2.334 21 1 10.438
Testing
80
Testing
81