Sie sind auf Seite 1von 16

Unit 3: Simple Correlation And Regression Analysis

Simple Correlation And


Regression Analysis

Simple correlation:

1 Introduction
Correlation is a statistical method used to determine whether a relationship between two or more
variables exists. When it is used to measure the relationship between two variables then it is
called simple correlation. When the value of both variables increases (or decrease) in the same
direction at the same time then it is called positive correlation. But when the value of one
variable is increased while the value of other variable is decreased at the same time then it is
called negative correlation

2 Method of measuring the correlation between two variables say ‘X’ and ‘Y’
i) The Scatter diagram (Scatter Plot)
ii) Karl Pearson’s correlation coefficient (r)

i) Scatter diagram: This is the graphical method of measuring the relationship between
two variables. In scatter plot, Pairs of data are plot in the graph by plotting the values
of one variable along X-axis and values of other variable along Y-axis. A scatter of
point so formed is called the scatter plot. If these plotted points shows some trend
either upward or downward then two variables are said to be correlated if these
plotted points does not shows any trend then two variables are said to be uncorrelated.
Some types of relationship obtained by scatter plot are shown in the following figure

Y-axis Y-axis

X-axis
Fig: Positive relationship X-axis
Fig: Negative relationship

1
Unit 3: Simple Correlation And Regression Analysis

Y-axis

Y-axis

X-axis
Fig: Positive Curve X-axis
Linear Relationship Fig: Negative Curve
Linear Relationship

ii) Karl Pearson’s correlation coefficient (r): The coefficient of correlation between
two variables say ‘X’ and ‘Y’ defined by Karl Pearson’s to measure the strength of
relationship between these two variables is denoted by ‘r’ and it’s value always lies
between -1 and +1 and is calculated by using the following relations.

SSxy
r      (1)
SSx SSy

 ( x  x )( y  y )
r     ( 2)
 2  2
 (x x )  ( y y )

 xy  n.x . y
r    (3)
2 2 2 2
 x  n.x  y  n. y

Where
r = Correlation coefficient and its value always lies between -1 and +1
n = Number of pairs of data.
x
 x    (4)
n
y
 y    (5)
n
SSx   ( x  x )2   x 2  n.x 2    (6)

2
Unit 3: Simple Correlation And Regression Analysis

SSy  ( y  y )   y  n. y    (7)
2 2 2

SSxy   ( x  x )( y  y )   xy  n.x . y    (8)

3 Interpretation of correlation coefficient (r)


If r = 0, this means there is no correlation (relation) between two variables.
If r > 0, this means there is positive correlation between two variables.
If r < 0, this means there is negative correlation between two variables.
If r = -1, this means there is highly (perfect) negative relationship between two variables.
If r = +1, this means there is highly positive relationship between two variables.

4 Test of significance for correlation coefficient:


The test statistics to test the significance of the correlation coefficient is obtained under the
assumption that in null hypothesis, the population correlation coefficient () is set to be zero.
Thus, the null and alternative hypotheses are set as

Null hypothesis (H0):  = 0 (This null hypothesis means that there is no correlation between
the x and y variables in the population.)
Alternative hypothesis (H1):  ≠ 0 (This alternative hypothesis means that there is a significant
correlation between the x and y variables in the population). (Two tailed)

If null hypothesis is accepted then you can conclude that there is no association between two
variables. But if alternative hypothesis is accepted then you can conclude that there is a
significant association (correlation) between two variables.

Test statistics:
r
t cal  n2
1 r 2
It follows student’s t-distribution with (n-2) degree of freedom.

Decision: if the calculated value of the test statistics (tcal) is less than tabulated value (ttab) then
null hypothesis is accepted otherwise null hypothesis is rejected. i.e.
If tcal < t, n-2, then null hypothesis is accepted. Otherwise alternative hypothesis is accepted.

Where
t, n-2 = tabulated value of‘t’ at (n-2) degree of freedom and ‘’ level of significance, obtained
from two tailed t-table.
n = number of pairs of data.
 = level of significance
r = sample correlation coefficient
 = population correlation coefficient.

5 Problems on Correlation analysis:


1. A study of the car running cost and family income of 10 families gave the following data

3
Unit 3: Simple Correlation And Regression Analysis

 X = 3150  Y = 315  X2 =1128750  Y2 =12225


 XY =116375 and n= 10
i) Calculate the correlation coefficient and interpret its value.
2. Calculate the Karl Pearson’s correlation coefficients for the following data of sales and
expensed in thousands of rupees of 5 firms. Also interpret the value of correlation
coefficient.
Sales: 43 41 36 34 50
Expenses: 12 24 15 21 19
Test the significance of the correlation coefficients at 5%level of significance.
3. A computer while calculating correlation coefficient between two variables X and Y from
12 pair of observation obtained the following results:
n = 12, ∑ X = 30, ∑ X 2 =670, ∑y = 5, ∑ Y2= 285, ∑ YX = 334,
It was however, later discovered at the time of checking that he had copied down pairs as
(x = 11, y=4). While the correct values were (x = 10, y=14). Obtain the correct value of
the correlation coefficient.

Simple Regression Analysis:

1 Introduction
Regression analysis is used to measure the strength of relationship between two or more
variables. Regression analysis is used to predict or estimate the value of one dependent or
response or endogenous variables based on the known values of independent or explanatory or
regressor variables. The unknown variables which we have to estimate or predict is called
dependent variable and denoted by ‘y’. The variable whose value is given to estimate the value
of dependent variable (y) is called independent variable and denoted by ‘x’.
When the regression analysis is used to measure the strength of relationship between one
dependent (y) and one independent (x) variables then it is called simple regression analysis.

2 Regression line (Regression Model)


A simple linear regression line between one dependent variable (y) and one independent variable
(x1) is written as
y   0  1 x1  e........(1)

Where
y = dependent variable
x1 = independent variable.
0 = y-intercept for the population.

4
Unit 3: Simple Correlation And Regression Analysis

1 = slope for the population. i.e. Regression coefficients of dependent variably (y) on
independent variable (x1).
e = error term, is the difference between the observed and estimated value of the dependent
variable (y).

To obtain the best fit of the regression model of y on x, we need the value of 0 and 1, which are
unknown. By using the principle of lest square, we can get two normal equation of regression
model (1).
The two normal equation of regression line (1) are
 y  nb 0  b1  x1    ( 2)

x y b x  b1  x1    (3)
2
1 0 1

By solving these two normal equations we get the value of b0 and b1 as


b0  y  b1 x1    ( 4)

b1 
SSxy

 ( x  x )( y  y )    (5)
1 1

SSx  (x  x ) 1 1
2

Or

b1   x y  nx y    (6)
1 1

 x  nx
2 2
1 1

After finding the value of b0 and b1, we get the required fitted regression model of y on x as
yˆ  b0  b1 x1
Where
ŷ = estimated value of dependent variable (y) for some given value of independent variable (x1)
x1 = independent variable.
b0 = estimated value of 0 i.e. y- intercept.
b1 = estimated value of 1 i.e. regression coefficient of y on x1 or slope of the regression line.
n = Number of pairs of data.
x1 = mean of the independent variable
y = mean of the dependent variable.

3 Interpreting the regression coefficients:


Suppose we have the fitted simple regression model
yˆ  15  3 x1
a. The coefficient ‘b0’ (estimated value of 0) represents the average value of the dependent
variable (y) when value of independent variable (x1) is zero.
For example, in the above model, b0 = 15, this means, the average value of the
dependent variable (y) is 15 when x1 = 0.
b. The regression coefficient ‘b1’ (estimated value of 1) measure the average rate of
increase or decrease in the value of dependent variable (y) while increasing the value of
independent variable (x1) by unit.
For example, in the above model, b1 = -3, this means , the value of dependent variable
(y) is decreased by 3 while the value of independent variable (x1) is increase by 1.

5
Unit 3: Simple Correlation And Regression Analysis

Note : If in the above model b 1 = 3, this means, the value of dependent variable (y) is
increased by 3 while the value of independent variable (x1) is increase by 1.

4 Error term (Residual):


The difference between the observed and estimated value of the dependent variable (y) is called
error or residual and it is denoted by ‘e’
e  y  yˆ
Where
e = Error term
y = Observed value of the dependent variable.
ŷ = Estimated value of the dependent variable for a given value of independent variable.

5 Measures of Variation:
To examine the ability of the independent variable to predict the dependent variable (y) in the
regression model, several measures of variation need to be developed. In a regression analysis,
the total variation or total sum of squares (SST) is subdivide into explained variation or
regression sum of squares (SSR) and unexplained variation or error sum of squares (SSE). These
different measures of variation are shown in the following figure.

Y
SSE
Y-axis

SST
yˆ  b0  b1 x1

SSR

X-axis

From the figure, mathematically


Total Sum of Square (SST) = Regression Sum of Square (SSR) + Error Sum of Square
(SSE) i.e.
SST  SSR  SSE    (7)
Where,
SST   ( y  y ) 2   y 2  n. y 2
SSR   ( yˆ  y ) 2  b0  y  b1  x1 y  n. y 2

6
Unit 3: Simple Correlation And Regression Analysis

SSE   ( y  yˆ ) 2   y 2  b0  y  b1  x1 y

6 Standard error of the estimate (Se or Sy.x)


The standard error of the estimate measures the average variation or scatter ness of the observed
data point around the regression line. Standard error of the estimate is used to measure the
reliability of the regression equation and it is denoted by S e or Sy.x and is calculated by using the
following relation.

Se 
SSE
  ( y  yˆ ) 2

   (8)
n2 n2

Se 
y 2
 b0  y  b1  x1 y
   (9)
n2

SSy  b1SSxy
Se     (10)
n2

The notations have their usual meaning.

Interpreting the standard error of the estimate:


The regression line having the lesser value of the standard error of the estimate is more reliable
then the regression line having the higher value of the standard error of the estimate i.e. how
much the value of the standard error of the estimate is less, the fitted regression line is more
reliable.
a. Is Se = 0, this means there is no variation of the observed data around the regression line
i.e. all the observed data lies in the regression line. So we expect that the regression line
is perfect for predicting the dependent variable.
b. If the value of Se is large then fitted regression line is poor for predicting the dependent
variable since there is greater variation of the observed data around the regression line.
c. If the value of Se is small, this means there is less variation of the observed data around
the regression line. So the regression line will be better for predicting the dependent
variable.
If Se = 2.5, this means, the average variation of the observed data around the regression line is
2.5.

7 Confidence Interval Estimate


a. Confidence interval for Y-intercept ( 0)
b0  t n  2, S b   0  b0  t n  2, S b
0 0

b. Confidence interval for the regression coefficient or slope ( 1)


b1  t n  2, S b1  1  b1  t n  2, S b1
c. Confidence interval estimate for the mean of dependent variable (y)
yˆ  t n  2, S e h  Y / X  x  yˆ  t n  2, S e h

d. Prediction interval for an individual response of dependent variable (y)

7
Unit 3: Simple Correlation And Regression Analysis

yˆ  t n  2, S e 1  h  YX  x  yˆ  t n  2 , S e 1  h
e. Approximate prediction interval: This interval gives within which the actual value of
the dependent variable (Y) lies for a given value of the independent variable.
yˆ  t n  2, S e  YX  x  yˆ  t n  2 , S e

Where
ŷ = Estimated value of the dependent variable for a given value of independent variable.

S b0 = Standard error of y-intercept (b0)


S b1 = Standard error of the regression coefficient (b1)
x2
Se n
S b0     (11)
 x 2  nx 2
Se
S b1     (12)
 x 2  nx 2
1 ( x  x )2
h     (13)
n  x 2  n.x 2
Se = Standard error of the estimate
t n  2 , = Tabulated value of the‘t’ obtained from two tailed student’s t-table at (n-2) degree of
freedom and ‘’ level of significance.
n = Number of pairs of observations.
Other notations have their usual meanings.

8 Test of significance for the regression coefficient ( 1):


To determine the existence of a significant linear relationship between the dependent variable (y)
and independent variable (x1), a hypothesis test concerning the population slope (1 i.e.
regression coefficient) is made by setting the null and alternative hypothesis as stated below.

Null hypothesis (H0):  1 = 0 (This means there is no linear relationship between dependent and
independent variables)

Alternative hypothesis (H1):  1  0 (This means there is a significant linear relationship


between dependent and independent variable.) (Two tailed)

If null hypothesis is accepted then you can conclude that there is no linear relationship between
dependent and independent variables. But if alternative hypothesis is accepted then you can
conclude that there is a significant linear relationship between dependent and independent
variables.

Test Statistics:

8
Unit 3: Simple Correlation And Regression Analysis

b1
t cal 
S b1
This test statistics follows t-distribution with (n-2) degree of freedom.

Decision: if the calculated value of the test statistics (tcal) is less than tabulated value (ttab) then
null hypothesis is accepted otherwise null hypothesis is rejected. i.e.
If tcal < t, n-2, then null hypothesis is accepted. Otherwise alternative hypothesis is accepted.

Where
t, n-2 = Tabulated value of ‘t’ at (n-2) degree of freedom and ‘’ level of significance, obtained
from two tailed t-table.
n = number of pairs of data.
 = level of significance
b1 = Regression coefficients of y on x1.
S b1 = Standard error of the regression coefficient (b1)
Se Se
S b1  Or S b1 
SSx x 2
 nx 2
Other notations have their usual meanings.

9 Coefficient of determination (r2):


The coefficient of determination measures the strength or extent of the association that exists
between dependent variable (y) and independent variable (x1). It measures the proportion of
variation in the dependent variable (y) that is explained by the regression line. In other word,
coefficient of variation measures the total variation in the dependent variable due to the variation
in the independent variable and it is denoted by ‘r 2’. The following relations are used to obtain
the value of coefficient of determination.
SSR
r2       (14)
SST
SSE
r 2  1    (15)
SST
b0  y  b1  x1 y  ny 2
r 
2
   (16)
 y 2  ny 2
Note: Since coefficient of determination is the square of the Correlation coefficient. So
correlation coefficient is the square root of the coefficient of determination and can be obtained
from the coefficient of determination by the following relation.
r r 2   r    (17)
If the regression coefficient (b1) is negative then take the negative sign.
If the regression coefficient (b1) is positive then take the positive sign.

Adjusted coefficient of determination (r2adj.): The adjusted coefficient of determination is


calculated by using the following relation.

9
Unit 3: Simple Correlation And Regression Analysis

n 1 2
r 2ad j  1  (1  r )    (18)
n2

Interpretation of the coefficient of determination (r 2): The regression model having the higher
value of coefficient of determination is better, more reliable than the regression model having the
smaller value of coefficient of determination, this means higher value of r 2 is better than lesser
value of r2.
For example if r2 = 0.91, this means 91% of the total variation in the dependent variable
(y) is due to the variation in the independent variable (x1) and remaining 9% variation in the
dependent variable is due to the other factor which are not accounted in the independent variable.

10 Assumption on Regression Analysis:


The following three assumptions are necessary for the regression analysis. Which are described as,
i) Normality of errors
ii) Homoscedasticity
iii) Independence of errors

i) Normality of errors: This assumption requires that, the errors around the regression line
be normally distributed for each value of X (independent variables). As long as the
distribution of the errors around the regression line for each value of independent
variables in not extremely different from a normal distribution, then inference about the
line of regression and regression coefficients will not be seriously affected.
ii) Homoscedasticity: This assumption requires that the variation around the line of
regression be constant for all values of independent variables(X). This means that the
errors vary the same amount when X is a low value as when X is a high value. The
Homoscedasticity assumption is important for using the least square method to fit the
regression line. If there are serious departures from this assumption, either data
transformations or weighted least square method can be applied.
iii) Independence of errors: This assumption requires that the errors around the regression
line be independent for each value of explanatory variables. This is particularly important
when data are collected over a period of time. In such situation, errors for a specific time
period are often correlated with those of the previous time period.

11 Residual analysis:
The residual analysis is a graphical method to evaluate whether the regression model that has been
fitted to the data is an appropriate model. In addition, residual analysis enables potential violations of
the assumption of the regression model.

10
Unit 3: Simple Correlation And Regression Analysis

The aptness of the fitted regression model is evaluated by plotting the residual on the vertical
axis against the corresponding X values of the independence variable along the x- axis. If the fitted
model is appropriate for the data then there will be no apparent pattern in this plot. However, if the
fitted model is not appropriate then there will be a relationship between X values and the residual
(e).
By plotting the histogram, box-and-whisker plot, stem-and-leaf display of the errors term, we
can measure the normality of the errors.

12 Problems on Simple Correlation and Regression Analysis:


1. A study of the car running cost and family income of 10 families gave the following data
 X = 3150  Y = 315  X2 =1128750  Y2 =12225
 XY =116375 and n= 10
a. Calculate the correlation coefficient.
b. Calculate the regression equation relating the running cost of a car and the family income
2. For the following set of data
a. Plot the scatter diagram.
b. Compute the coefficient of correlation and interpret its value.
c. Develop the estimating equation that best describe the data and interpret the
regression coefficients.
d. Predict Y for X = 10 , 15, 20
X: 13 16 14 11 17 9 13 17 18 12
Y: 6.2 8.6 7.2 4.5 9.0 3.5 6.5 9.3 9.5 5.7
Compute the residual when X= 14
3. An instructor is interested in finding out how the number of students absent on a given
day is related to the mean temperature that day. A random sample of 10 days was used for
the study. The following data indicate the number of students absent (ABS) and the mean
temperature (TEMP) for each day.
ABS: 8 7 5 4 2 3 5 6 8 9
TEMP: 10 20 25 30 40 45 50 55 59 60
a. Develop the estimating equation that best describe the data.
b. What is the logical explanation for the observed relationship?
c. Compute the residual when X= 25
d. Compute the standard error of the estimate and interpret the standard error of the
estimate.
4. Cost accounts often estimating overhead based on the level of production. At the standard
knitting co., they have collected information on overhead expenses and units produced at
different plants, and want to estimate a regression equation to predict future overhead.
Overhead; 191 170 272 155 280 173 234 116 153 178
Units: 40 42 53 35 56 39 48 30 37 40

11
Unit 3: Simple Correlation And Regression Analysis

a. Develop the regression equation for the cost accountants.


b. Predict overhead when 50 units are produced.
c. Calculate the standard error of estimate and interpret the value of the standard error of
the estimate.
d. Calculate the correlation coefficient and test it at 95% confidence level.

5. Using the data given below


X 16 6 10 5 12 14
Y -4.4 8 2.1 8.7 0.1 -2.9
a. Develop the estimating equation that best describe the data.
b. Predict Y for X= 5, 6, 7
c. Interpret the meaning of the slope.
d. Compute the coefficient of determination and interpret its meaning.
e. Obtain the estimate of the correlation coefficient and interpret its meaning.
f. Obtain the standard error of the estimate and interpret its meaning.
g. Test for the significance of the slope.
h. Obtain the 95% confidence interval estimate of the slope.
i. Carry out the t-test for the correlation coefficient.
j. Obtain the confidence interval estimate for the mean of Y for x= 7.
k. Obtain the 95% prediction interval for individual Y for the value of x= 7.
l. Obtain the 95% approximate prediction interval of Y for the value of x=7.

6. Sales of major appliances vary with the new housing market: when new home sales are
good, so are the sales of dishwashers, washing machines, driers, and refrigerators. A trade
association compiled the following historical data (in thousands of units) on major
appliance sales and housing starts:
Housing starts (thousands): 2.0 2.5 3.2 3.6 3.3 4.0 4.2 4.6 4.8
Appliance sales (thousands): 5 5.5 6 7 7.2 7.7 8.4 9 9.7
a. Develop an equation for the relationship between appliance sales (in thousands) and
housing starts (in thousands)
b. Interpret the slope of the regression line.
c. Compute and interpret the standard error of estimate.
d. Compute the 90% prediction interval for the appliance sales when housing is 8.0
e. Compute the coefficient of determination and coefficient of correlation and interpret the
value.
7. A study by the department of transportation on the effect of bus ticket price upon the
number of passengers produced the following results
Ticket price (Rs.): 25 30 35 40 45 50 55 60
Passenger per 100 miles: 800 780 780 660 640 600 620 620
a. Develop the estimation equation that best describe these data.
b. Interpret the regression coefficient (slope of the regression line)
c. Predict the number of passengers per 100 miles if the ticket price were Rs. 50. And also
obtain the 95% approximate prediction intervals for ticket price Rs 50.
8. Campus stores have been selling the Believe it or not. Wonders of statistics study Guide
for 10 semesters and would like to estimate the relationship between sales and number of

12
Unit 3: Simple Correlation And Regression Analysis

sections of elementary statistics taught in each semester. The following data have been
collected
Sales (units): 33 38 24 61 52 45 65 82 29 63
Number of sections: 3 7 6 6 10 12 12 13 12 13
a. Develop the estimating equation that best fits the data.
b. Calculate the sample coefficient of determination and the sample coefficient of
correlation. If this linear relationship is significant at 5% level of significance.
9. A statistician for American automobile manufacturer would like to develop a model for
predicting delivery time (the days between the ordering of the car and the actual delivery
of the car) of custom ordered new automobile. A random sample of 15 cars is selected
with the result is summarized in the following table
Car 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
No. of 3 4 4 7 7 8 9 11 12 12 14 16 20 23 25
ordered
(x)
Delivery 25 32 26 38 34 41 39 46 44 51 58 53 64 66 70
(y)
a. Given a correlation coefficient, r = 0.9726 between the number of options ordered and the
delivery time in days, examine if this linear relationship is significant at the 5% level of
significance. 
b. Given the linear regression line Y = 22.2123 + 2.0218X, compute the residual for car 6.
c. Next given that  (Y  Y ) 2 =153.421, compute the standard error of the estimate Se Sy.x)
and interpret its meaning.
d. Given that  ( X  X ) 2 =657.33, test the regression coefficient at the 5% level of
significance.
e. Compute the 95% prediction interval of the delivery time for a car with 14 options
ordered.
f. Compute the 95% confidence interval estimate of the slope (regression coefficient)
10. Fitting a straight line to a set of data yields the regression equation: Ŷ = 2+ 5X
a. Interpret the meaning of the y intercept b and slope of the regression line b
0 1.

b. Predict the average value of Y for X = 3.


11.
a. What does it mean if the coefficient of determination r2 is equal to 0.80?
b. If SSR = 36 and SSE = 4 find SST, and then compute the coefficient of determination
r2 and interpret its meaning.
c. If SSR = 66 and SST = 88, compute the coefficient of determination r 2 and interpret
its meaning.
12. Suppose that you are testing the null hypothesis that there in no relationship between two
variables X and Y. from your sample on n = 18, you determine that b1 = 4.5 and Sb1 =1.5
a. What is the value of the test statistics?
b. At the = 0.05 level of significance, test the regression coefficients.
13. Suppose that you are testing the null hypothesis that there is no relationship between two
variables X and Y. From your sample of n= 20, you determine that SSR= 60 and SSE =
40.

13
Unit 3: Simple Correlation And Regression Analysis

a. Calculate the coefficient of correlation and test the sample correlation coefficient.
14. Based on a sample of 20 observations, the least squares method was used to obtain the
following linear regression equation Ŷ = 5+ 3X. In addition, Syx= 1.0, X = 2 and
 ( X  X ) 2 = 20
a. Set up a 95% confidence interval estimate of the population average response for
X= 2.
b. Set up a 95% prediction interval of the individual response for X= 2.

15. In a regression problem with a sample size of 17, the slope was found to be 3.71 and the
standard error of estimate 28.654. The quantity  X 2 – n X 2 = 871.56, Where X is an
independent variable.
a) Find the standard error of the regression coefficient (slope).
b) Construct a 95% confidence interval for the population slope and
interpret.
16. The city council of Pokhara has gathered data on number of minor traffic accidents and
the number of youth football games that occurred in town over the weekend.
X (football games): 20 30 10 12 15 25 34
Y (minor accident): 6 9 4 5 7 8 9
a. Develop the estimating linear equation to predict minor accident from football games.
b. Predict the number of minor traffic accidents that will occur at weekends during which 30
soccer games will take place in Pokhara.
c. Calculate the value for the coefficients of determination.

17. A consultant is interested in seeing how accurately a new job performance index
measured, what is important for a corporation. One way to cheek is to look at the
relationship between the job evaluation index and an employee’s salary. A sample of eight
employee’s was taken and information about salary (in thousands of Rs.) and job
performance index (1-10; 10 is best) was collected.
Job performance index: 9 7 8 4 7 5 5 6
Salary : 36 25 33 15 28 19 20 22
a) Calculate the standard error of estimate, Se , for these data.
b) Calculate the sample coefficient of determination and sample coefficient of correlation
for these data.
18. The managers of a brokerage firm are interested in finding out if the number of new
clients a broker bring into the firm affects the sales generated by the broker. They
sample 10 brokers and determine the number of new clients they enrolled in the last
year and their sales amounts in thousands of dollars. These data are presented in the
table that follows.
Broker Clients (X) Sales (Y) Calculation shows that:
n = 10
1 27 52  X = 260
2 11 37  Y = 480
3 40 64
4 33 55
 X2 =7594
5 15 29  Y2 =24276

14
Unit 3: Simple Correlation And Regression Analysis

6 15 34  XY =13377
7 25 58 SSX =  ( X  X ) 2 =834
8 36 59
SST=  (Y  Y ) 2 =1236
9 28 44
10 30 48 SSE =  (Y  Yˆ ) 2= 271.241
a) Assuming a linear relationship, what is the least square prediction for the amount of sales
(in $ 1,000) for a person who brings 25 new clients into the firm?
b) Calculate the standard error of estimate and interpret the result.
c) Suppose the managers of the brokerage firm want to obtain a 99% prediction interval for
the sales made by a broker who has brought into the firm 18 new clients. What would be
the prediction interval for this problem?
19. Coca-cola is studying the effect of its latest advertising people chosen at random
were called and asked how many cans of coca cola they had bought in the past week
and how many coca cola advertisements they had either read or seen in the past
week. The data collected from different people are as follows
People 1 2 3 4 5 6 7 8 9 10 11 12
Number 3 7 6 6 10 12 12 13 12 13 14 15
of ads (x)
Cans 33 38 24 61 52 45 65 82 29 63 50 79
purchased
(y)
Calculation shows that
 X  123 ,  X 2  1421 ,  Y  621 ,  Y 2  36059 ,  XY  6833
Find the coefficient of correlation between the number of ads and cans purchased, examine if this
linear relationship is significant at the 5% level of significance.
a. Find the linear regression line. Calculate the standard error of the estimate, S yx and
interpret its meaning.
b. Test the regression coefficient at the 1% level of significance.
c. Compute the 90% prediction interval of the can purchased for people 7.

20. The marketing manger of a large supermarket chain would like to determine the effect of
shelf space on the sales of pet food. A random sample of 12 equal sized stores is selected
with the following results
Store 1 2 3 4 5 6 7 8 9 10 11 12
Weekly sales 1.6 2.2 1.4 1.9 2.4 2.6 2.3 2.7 2.8 2.6 2.9 3.1
,Y(Hundreds of $)
Shelf space, X 5 5 5 10 10 10 15 15 15 20 20 20
(Feet)
Calculation shows that:
 X = 150,  X2 =2250,  Y = 28.5,  Y2 =70.69,  XY = 384
a. Assuming a linear relationship, use the least squares methods to find the best fitting
regression equation and hence compute the residual for store 6.
b. What percentage of the total variation in sales is explained by shelf space?
c. Set up 95% confidence interval estimate of the average weekly sales for all stores that
have 10 feet of shelf space for pet food.

15
Unit 3: Simple Correlation And Regression Analysis

16

Das könnte Ihnen auch gefallen