Beruflich Dokumente
Kultur Dokumente
A mathematical equation that allows us to predict the values of one dependent variable from
known values of one or more independent variables is called a regression equation. The term
regression equation is derived from the original heredity studies made by Francis Galton. In his
study, he compared the heights of the sons of tall fathers over successive generations regressed
toward the mean height of the population. In other words, sons of unusually tall fathers tend to be
shorter than their fathers and sons of unusually short fathers tend to be taller than their fathers. Today
the term regression is applied to all types of prediction problems and does not necessarily imply a
regression toward a population mean.
In the study of linear regression, we consider the problem of estimating or predicting the
value of a dependent variable Y on the basis of a known measurement of an independent and
frequently controlled variable X.
Using a scatter diagram, we can determine if the two variables are linearly related to some
extent. Once a reasonable linear relationships has been ascertained, we usually express this
mathematically by a straight-line equation called the linear regression line. The linear regression
line is written using the slope-intercept form
y a bx
where the constants a and b represents the y-intercept and slope, respectively. The symbol y is used
here to distinguish between the value given by the regression line and an actual observed value y for
some value of x.
Once the point estimates a and b are determined from the sample data, the linear regression
line can be used to predict the value y corresponding to any given value x.
n x 1 y1 x1 y 1
b
n x x 1
2
2
1
and a y bx
1
Total 21 24 75 91 106
(a) n 6, x 21
i y 24 i x y 75
i i
x 91
2
i y 106 2
i
y 4 and x 3.5
LINEAR CORRELATION
We shall consider here the problem of measuring the relationship between two variables X
and Y rather than predicting a value of Y from a knowledge of the independent variable X. For
example, if X represents the amount of money spent yearly on advertising by a retail merchandising
firm and Y represents their total yearly sales, we might ask whether a decrease in advertising is likely
to be accompanied by a decrease in the yearly sales.
Correlation analysis attempts to measure the strength of such relationships between two
variables by means of a single number called a correlation coefficient.
2
Y Y
. .
.. . .
... . . . .
.... . . .
... . . .
... . .
X X
(a) (b)
Y
Y
. . ...
. . . .. ...
. . . ... ...
. . . ... ...
. . . ..
. . . .... ...
X X
(c ) (d)
The correlation coefficient between two variables is a measure of their linear relationship and
a value of r 0 implies a lack of linearity and not a lack of association. Hence, if a strong quadratic
relationship exists between X and Y as indicated in (d), we still obtain a zero correlation even though
there is a strong nonlinear relationship.
The most widely used measure of linear correlation between two variables is called
PEARSON PRODUCT-MOMENT CORRELATION COEFFICIENT or simply the SAMPLE
CORRELATION COEFFICIENT and is denoted by r.
The measure of linear relationship between two variables X and Y is estimated by the sample
correlation coefficient r, where
n xi yi xi
y S i
r b
n x x n y y S
x
2 2 2 2
y
1 i 1 i
Since SSE n 1 S y b S x
2 2 2
SSE
r2 1
n 1 S y2
2
Note that SSE and S y are always nonnegative, we can say that r 2 must be between zero and 1.
Consequently r must range from –1 to +1. A value of r = -1 will occur when SSE = 0 and all
points lie exactly on a straight line having a negative slope. If all points lie exactly on a straight line
having a positive slope, once again SSE =0 and we obtain a value r= +1. Hence a perfect linear
relationship exists between the values of X and Y in our sample when r 1. If r is close to +1 or –
1, the linear relationship between the two variables is strong and we say that we have a high
correlation. However, if r is close to zero, the linear relationship between X and Y is weak or perhaps
nonexistent.
A number that expresses the proportion of the total variation in the values of the variable Y
that can be accounted for or explained by the linear relationship with the values of the variable X is
usually referred to as the sample coefficient of variation and is denoted by r 2 . Thus a correlation of
r= 0.6 means that 0.36 or 36% of the total variation of the values of Y in our sample is accounted for
by linear relationship with the values of X.
3
The values of r and its interpretation
r Interpretation
1 Perfect positive correlation
0.91 - 0.99 very highly positively correlated
0.71 – 0.90 highly positively correlated
0.41 – 0.70 Marked or moderately positively correlated
0.21 - 0.40 Low or slightly positively correlated
0.01 - 0 .21 Very Low positive or Negligible
-0.01 – -0.20 Very low negative or Negligible
-0.21 - -0.40 Low or slightly negatively correlated
-0.41 - -0.70 Marked or moderately negatively correlated
-0.71- -0.90 Highly negatively correlated
-0.91- -0.99 Very highly negatively correlated
-1 Perfect negative correlation
Example 1: Compute and interpret the correlation coefficient for the following data:
X 4 5 9 14 18 22 24
Y 16 22 11 16 7 3 17
Solution:
xi yi x2 y2 xi yi
4 16 16 256 64
5 22 25 484 110
9 11 81 121 99
14 16 196 256 224
18 7 324 49 126
22 3 484 9 66
24 17 576 289 408
Total 96 92 1702 1464 1097
n7 x i 96 y i 92 x y 1097i i
x 2
1 1464 and y 1702
2
1
n x i y i x i f y i
r
n x 2
1 x n y y
i
2 2
1 i
2
7 1097 96 92
7 1702 96 7 1464 92
2 2
7679 8832
11914 9216 10248 8464
1153 1153 1153
0.5255462
2698 1784 4813232 2193.9079
r 0.53
Since r= -0.53, the two variables X and Y are moderately negatively correlated.
Example 2. Compute and interpret the correlation coefficient for the aptitude scores and grade point
averages below:
4
Grade-point Average Aptitude Score
Y X
1.93 565
2.55 525
1.72 477
2.48 555
2.87 502
1.87 469
1.34 517
3.03 555
2.54 576
2.34 559
1.40 574
1.45 578
1.72 548
3.80 656
2.13 688
1.81 465
2.33 661
2.53 477
2.04 490
3.20 524
Solution:
GPA AS
YI XI X I YI X I2 YI2
1.93 565 1090.45 319225 3.7249
2.55 525 1338.75 275625 6.50250
1.72 477 820.44 227529 2.95840
2.48 555 1376.40 308025 6.15040
2.87 502 1440.74 252004 8.23690
1.87 469 877.03 219961 3.49690
1.34 517 692.78 267289 1.79560
3.03 555 1681.65 308025 9.18090
2.54 576 1463.04 331776 6.45160
2.34 559 1308.06 312481 4.47560
1.40 574 803.60 329476 1.96000
1.45 578 838.10 334084 2.10250
1.72 548 942.56 300304 2.95840
3.80 656 2492.80 430336 14.44000
2.13 688 1465.44 473344 4.53690
1.81 465 841.65 216225 3.27610
2.33 661 1540.13 436921 5.42890
2.53 477 1206.81 227529 6.40090
2.04 490 999.60 240100 4.16160
3.20 524 1676.80 274576 10.24000
X 6084835
1
2
and Y 109.47900
1
2
5
n xi yi xi y i
r
n x x n y y
2
1 i
2 2
1 i
2
20 24896.83 10961 45.08
20 6084835 10961 20 109.47900 45.08
2
497936.6 494121.88
121696700 120143521 2189.58 2032.2064
3814.72
153179 157.3736
3814.72
24106330 .67
3814.72
4909.81982
0.776957228
r 0.78,
The grade-point averages are highly correlated with the aptitude scores.
The sample correlation coefficient r is a value computed from a random sample of n pairs of
measurements. Different random samples of size n from the same population will generally produce
different values of r.
1. The grades of a class of 9 students on a midterm report (x) and on the final examination
(y) are as follows:
x 77 50 71 72 81 94 96 99 67
y 82 66 78 34 47 85 99 99 67
2. A study was made on the amount of converted sugar in a certain process at various
temperatures. The data were coded and recorded as follows:
6
Placement Test Course Grade Placement Test Course Grade
50 53 90 54
35 41 80 91
35 61 60 48
40 56 60 71
55 68 60 71
65 36 40 47
35 11 55 53
60 70 50 68
90 79 65 57
35 59 50 79
4. Compute and interpret the correlation for the following grades of 6 students selected at
random.
Mathematics Grade 70 92 80 74 65 83
English Grade 74 84 63 87 78 90
5. The following data were obtained in a study of the relationship between the weight and
chest size of infants at birth:
Weight (kg) Chest Size (cm) Weight (kg) Chest Size (cm)
2.75 29.5 4.32 27.7
2.15 26.3 2.31 28.3
4.41 32.2 4.30 30.3
5.52 36.5 3.71 28.7
3.21 27.2
(a) Calculate r.
(b) Graph the line on a scatter diagram.
(c) Find the point estimate of y14 .