Y X y X N B: Linear Regression

LINEAR REGRESSION
A mathematical equation that allows us to predict the values of one dependent variable from
known values of one or more independent variables is called a regression equation. The term
regression equation is derived from the original heredity studies made by Francis Galton. In his
study, he compared the heights of the sons of tall fathers over successive generations regressed
toward the mean height of the population. In other words, sons of unusually tall fathers tend to be
shorter than their fathers and sons of unusually short fathers tend to be taller than their fathers. Today
the term regression is applied to all types of prediction problems and does not necessarily imply a
regression toward a population mean.
In the study of linear regression, we consider the problem of estimating or predicting the
value of a dependent variable Y on the basis of a known measurement of an independent and
frequently controlled variable X.
Using a scatter diagram, we can determine if the two variables are linearly related to some
extent. Once a reasonable linear relationships has been ascertained, we usually express this
mathematically by a straight-line equation called the linear regression line. The linear regression
line is written using the slope-intercept form

y  a  bx

where the constants a and b represents the y-intercept and slope, respectively. The symbol y is used
here to distinguish between the value given by the regression line and an actual observed value y for
some value of x.
Once the point estimates a and b are determined from the sample data, the linear regression

line can be used to predict the value y corresponding to any given value x.
Estimation of Parameters. Given the sample   X i , Yi  ; i  1, 2,  , n , the least-squares estimate

of the parameters in the regression line

y  a  bx
are obtained from the formulas
n x 1 y1   x1   y  1
b
n x    x 1 
2
2
1
and a  y  bx
Example 1. Consider the following data:

x 1 2 3 4 5 6
y 6 4 3 5 4 2
(a) Find the equation of the regression line.
(b) Graph the line on a scatter diagram.
(c) Find the point estimate of  y14. .
Solution:
xi yi xi yi x12 y12
1 6 6 1 36
2 4 8 4 16
3 3 9 9 9
4 5 20 16 25
5 4 20 25 16
6 2 12 36 4
1
Total 21 24 75 91 106
(a) n  6,  x  21
i  y  24 i  x y  75
i i
 x  91
2
i  y  106 2
i
y  4 and x  3.5
Substituting these values in the formula for b, we get
n xi yi    xi    yi  6  75   21  24 450  504  54

b     0.5143
n xi2    xi 
2

6  91   21
2
  546  441 105
b  0.514.
a  y  bx  4    0.514  3.5  4  1.799  5.799
a  5.799
yˆ  a  bx  5.799  0.514 x. This is the regression line

(b) Y
7
6 .
. . .
5
4 . . .
3 . . .
2
1 X
1 2 3 4 5 6 7 8
Since the slope of y is negative, it implies that as x increases y decreases.
( c) yˆ  a  bx  5.799    0.514   4   5.799  2.056  3.743
LINEAR CORRELATION
We shall consider here the problem of measuring the relationship between two variables X
and Y rather than predicting a value of Y from a knowledge of the independent variable X. For
example, if X represents the amount of money spent yearly on advertising by a retail merchandising
firm and Y represents their total yearly sales, we might ask whether a decrease in advertising is likely
to be accompanied by a decrease in the yearly sales.
Correlation analysis attempts to measure the strength of such relationships between two
variables by means of a single number called a correlation coefficient.
A linear correlation coefficient is defined to be a measure of the linear relationship between

the two random variables X and Y. This relationship is denoted by r. r measures the extent to which
the points cluster about a straight line. By constructing a scatter diagram for the n pairs of
measurements   xi , yi  ; i  1, 2, , n in our random sample (as in the graph below), we are able
to draw certain conclusions concerning r. If the points follow closely a straight line of positive slope
as in (a), we have a high positive correlation between the two variables. On the other hand, if the
points follow closely a straight line of negative slope as in (b), we have a high negative correlation
between the two variables. The correlation between the two variables decreases numerically as the
scattering of points from a straight line increases. If the points follow a strictly random pattern as in
(c) below, we have zero correlation and conclude that no linear relationship exists between X and Y.
2
Y Y
. .
.. . .
... . . . .
.... . . .
... . . .
... . .
X X
(a) (b)
Y
Y
. . ...
. . . .. ...
. . . ... ...
. . . ... ...
. . . ..
. . . .... ...
X X
(c ) (d)
The correlation coefficient between two variables is a measure of their linear relationship and
a value of r  0 implies a lack of linearity and not a lack of association. Hence, if a strong quadratic
relationship exists between X and Y as indicated in (d), we still obtain a zero correlation even though
there is a strong nonlinear relationship.
The most widely used measure of linear correlation between two variables is called
PEARSON PRODUCT-MOMENT CORRELATION COEFFICIENT or simply the SAMPLE
CORRELATION COEFFICIENT and is denoted by r.
The measure of linear relationship between two variables X and Y is estimated by the sample
correlation coefficient r, where
n xi yi    xi 
 y  S i
r b
n  x    x   n  y    y   S
x
2 2 2 2
y
1 i 1 i
Since SSE   n  1  S y  b S x 
2 2 2
And by dividing both sides of the equation by  n  1 S y , we obtain the relation

2
SSE
r2  1
 n  1 S y2
2
Note that SSE and S y are always nonnegative, we can say that r 2 must be between zero and 1.
Consequently r must range from –1 to +1. A value of r = -1 will occur when SSE = 0 and all
points lie exactly on a straight line having a negative slope. If all points lie exactly on a straight line
having a positive slope, once again SSE =0 and we obtain a value r= +1. Hence a perfect linear
relationship exists between the values of X and Y in our sample when r  1. If r is close to +1 or –
1, the linear relationship between the two variables is strong and we say that we have a high
correlation. However, if r is close to zero, the linear relationship between X and Y is weak or perhaps
nonexistent.
A number that expresses the proportion of the total variation in the values of the variable Y
that can be accounted for or explained by the linear relationship with the values of the variable X is
usually referred to as the sample coefficient of variation and is denoted by r 2 . Thus a correlation of
r= 0.6 means that 0.36 or 36% of the total variation of the values of Y in our sample is accounted for
by linear relationship with the values of X.
3
The values of r and its interpretation
r Interpretation
1 Perfect positive correlation
0.91 - 0.99 very highly positively correlated
0.71 – 0.90 highly positively correlated
0.41 – 0.70 Marked or moderately positively correlated
0.21 - 0.40 Low or slightly positively correlated
0.01 - 0 .21 Very Low positive or Negligible
-0.01 – -0.20 Very low negative or Negligible
-0.21 - -0.40 Low or slightly negatively correlated
-0.41 - -0.70 Marked or moderately negatively correlated
-0.71- -0.90 Highly negatively correlated
-0.91- -0.99 Very highly negatively correlated
-1 Perfect negative correlation
Example 1: Compute and interpret the correlation coefficient for the following data:
X 4 5 9 14 18 22 24
Y 16 22 11 16 7 3 17
Solution:
xi yi x2 y2 xi yi
4 16 16 256 64
5 22 25 484 110
9 11 81 121 99
14 16 196 256 224
18 7 324 49 126
22 3 484 9 66
24 17 576 289 408
Total 96 92 1702 1464 1097
n7 x i  96 y i  92  x y  1097i i
x 2
1  1464 and  y  1702
2
1
substituting these values in the formula for r, we get
n x i y i    x i f   y  i
r
n  x 2
1    x   n  y    y 
i
2 2
1 i
2

7 1097    96   92 

7 1702   96  7 1464   92 
2 2
7679  8832

11914  9216 10248  8464
 1153  1153  1153
    0.5255462
 2698 1784 4813232 2193.9079
r  0.53
Since r= -0.53, the two variables X and Y are moderately negatively correlated.
Example 2. Compute and interpret the correlation coefficient for the aptitude scores and grade point
averages below:
4
Grade-point Average Aptitude Score
Y X
1.93 565
2.55 525
1.72 477
2.48 555
2.87 502
1.87 469
1.34 517
3.03 555
2.54 576
2.34 559
1.40 574
1.45 578
1.72 548
3.80 656
2.13 688
1.81 465
2.33 661
2.53 477
2.04 490
3.20 524
Solution:
GPA AS
YI XI X I YI X I2 YI2
1.93 565 1090.45 319225 3.7249
2.55 525 1338.75 275625 6.50250
1.72 477 820.44 227529 2.95840
2.48 555 1376.40 308025 6.15040
2.87 502 1440.74 252004 8.23690
1.87 469 877.03 219961 3.49690
1.34 517 692.78 267289 1.79560
3.03 555 1681.65 308025 9.18090
2.54 576 1463.04 331776 6.45160
2.34 559 1308.06 312481 4.47560
1.40 574 803.60 329476 1.96000
1.45 578 838.10 334084 2.10250
1.72 548 942.56 300304 2.95840
3.80 656 2492.80 430336 14.44000
2.13 688 1465.44 473344 4.53690
1.81 465 841.65 216225 3.27610
2.33 661 1540.13 436921 5.42890
2.53 477 1206.81 227529 6.40090
2.04 490 999.60 240100 4.16160
3.20 524 1676.80 274576 10.24000
TOTAL 45.08 10961 24896.83 6084835 109.47900
n  20 X i  10961 Yi  45.08  X Y  24896.83

i i
 X  6084835
1
2
and Y  109.47900
1
2
5
n xi yi    xi  y  i
r
n  x    x    n  y    y 
2
1 i
2 2
1 i
2

20  24896.83  10961  45.08

20  6084835  10961  20 109.47900   45.08 
2
497936.6  494121.88

121696700  120143521  2189.58  2032.2064
3814.72

153179 157.3736
3814.72

24106330 .67
3814.72

4909.81982
 0.776957228
r  0.78,
The grade-point averages are highly correlated with the aptitude scores.
The sample correlation coefficient r is a value computed from a random sample of n pairs of
measurements. Different random samples of size n from the same population will generally produce
different values of r.
EXERCISES: Solve each of the following problems. Show all solutions.
1. The grades of a class of 9 students on a midterm report (x) and on the final examination
(y) are as follows:
x 77 50 71 72 81 94 96 99 67
y 82 66 78 34 47 85 99 99 67
(a) Find the equation of the regression line.

(b) Estimate the final examination grade of a student who receive a grade of 85 on the
midterm report but was ill at the time of the final examination.
(c) Compute r.
2. A study was made on the amount of converted sugar in a certain process at various
temperatures. The data were coded and recorded as follows:
Temperature, x Converted Sugar, Temperature, x Converted Sugar, y

y
1.0 8.1 1.6 8.6
1.1 7.8 1.7 10.2
1.2 8.5 1.8 9.3
1.3 9.8 1.9 9.2
1.4 9.5 2.0 10.5
1.5 8.9
(a) Estimate the linear regression line.

(b) Estimate the amount of converted sugar produced when the coded temperature is
1.75.
3. A mathematics placement test is given to all entering freshmen at a small college. A

student who receives a grade below 35 is denied admission to the regular mathematics
course and placed in a remedial class. The placement test scores and the final grades for
20 students who took the regular course were recorded as follows:
6
Placement Test Course Grade Placement Test Course Grade
50 53 90 54
35 41 80 91
35 61 60 48
40 56 60 71
55 68 60 71
65 36 40 47
35 11 55 53
60 70 50 68
90 79 65 57
35 59 50 79
(a) Plot a scatter diagram.

(b) Find the equation of the regression line to predict course grades from placement
test scores.
(c) Graph the line on the scatter diagram
(d) If 60 is the minimum passing grade, below which placement test score should
students in the future be denied admission to this course?
4. Compute and interpret the correlation for the following grades of 6 students selected at
random.
Mathematics Grade 70 92 80 74 65 83
English Grade 74 84 63 87 78 90
5. The following data were obtained in a study of the relationship between the weight and
chest size of infants at birth:
Weight (kg) Chest Size (cm) Weight (kg) Chest Size (cm)
2.75 29.5 4.32 27.7
2.15 26.3 2.31 28.3
4.41 32.2 4.30 30.3
5.52 36.5 3.71 28.7
3.21 27.2
(a) Calculate r.
(b) Graph the line on a scatter diagram.
(c) Find the point estimate of  y14 .

Y X y X N B: Linear Regression

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Y X y X N B: Linear Regression

Hochgeladen von

Copyright:

Verfügbare Formate

LINEAR REGRESSION

Estimation of Parameters. Given the sample   X i , Yi  ; i  1, 2,  , n , the least-squares estimate

Example 1. Consider the following data:

Substituting these values in the formula for b, we get

n xi yi    xi    yi  6  75   21  24 450  504  54

yˆ  a  bx  5.799  0.514 x. This is the regression line

Since the slope of y is negative, it implies that as x increases y decreases.

( c) yˆ  a  bx  5.799    0.514   4   5.799  2.056  3.743

A linear correlation coefficient is defined to be a measure of the linear relationship between

And by dividing both sides of the equation by  n  1 S y , we obtain the relation

substituting these values in the formula for r, we get

TOTAL 45.08 10961 24896.83 6084835 109.47900

n  20 X i  10961 Yi  45.08  X Y  24896.83

EXERCISES: Solve each of the following problems. Show all solutions.

(a) Find the equation of the regression line.

Temperature, x Converted Sugar, Temperature, x Converted Sugar, y

(a) Estimate the linear regression line.

3. A mathematics placement test is given to all entering freshmen at a small college. A

(a) Plot a scatter diagram.

Das könnte Ihnen auch gefallen