Sie sind auf Seite 1von 17

Regression Analysis using

Microsoft Excel
1. Before you start

Microsoft Excel has a built-in feature to perform a regression analysis. This


feature is available in the Analysis Toolpack. First, check that the option Data
Analysis is available under the Data menu. If it is not there, select
File/Options/Add-Ins and press the GO button at the bottom of the dialog box.
In the ensuing dialog box, select Analysis Toolpak and Analysis Toolpak-VBA. Now
Data Analysis should appear as an option in the Data menu.

Copy the Excel workbook Heating Cost.xls that can be found on


the course Web site to your personal drive. The spreadsheet contains heating
cost data for 20 small houses in different geographical regions, together with
details on local average minimum external temperature, inches of insulation in
the house, the age of the central heating equipment and the number of windows.
The objective of the regression analysis in this tutorial is to discover if these
variables explain the differences in the heating costs for the 20 houses, and
hence if the approach would be useful for predicting heating costs for other
similar properties.

2. Summarising and describing data sets

You can compute summary statistics for data sets by using appropriate functions,
such as:
Average(B4..B23), which yields the average heating cost for all the houses;
Minimum(B4..B23), which yields the minimum heating cost for all the houses;
Maximum(B4..B23), which yields the maximum heating cost for all the
houses;
Stdev(B4..B23), which yields the sample standard deviation in heating costs;
etc..
A B C D E F
1 HEATING COST DATA
2
3 House Heating Cost Minimum Temperature Insulation (inches) Age Windows
4 1 250 35 3 6 10
5 2 360 29 4 10 1
6 3 165 36 7 3 9
7 4 43 60 6 9 8
8 5 92 65 5 6 8
9 6 200 30 5 5 9
10 7 355 10 6 7 14
11 8 290 7 10 10 9
12 9 230 21 9 11 11
13 10 120 55 2 5 9
14 11 73 54 12 4 11
15 12 205 48 5 1 10
16 13 400 20 5 15 12
17 14 320 39 4 7 10
18 15 72 60 8 6 8
19 16 272 20 5 8 10
20 17 94 58 7 3 10
21 18 190 40 8 11 11
22 19 235 27 9 8 14
23 20 139 30 7 5 9

If you enter for instance Average(B4..B23) in cell B24, the average heating cost
will be displayed in that cell. By selecting the cell and dragging the handle to the
adjacent cells, the formula will automatically be copied, and will display the
average temperature, insulation, age and number of windows.

Alternatively, you can use Data/Data Analysis/Descriptive Statistics to get a


variety of summary statistics automatically. In the Descriptive Statistics dialog
box, specify:
Input Range: B3..F23 (by selecting the region with the mouse);
Select Labels in First Row;
Select New Worksheet Ply with the name Descriptive Statistics;
Select Summary Statistics.

The resulting spreadsheet is shown below (you may need to reformat the cells
and column widths).

2
Heating Cost Minimum Temperature Insulation (inches) Age Windows

Mean 205.25 Mean 37.2 Mean 6.35 Mean 7 Mean 9.65


Standard Error 23.67 Standard Error 3.89 Standard Error 0.55 Standard Error 0.75 Standard Error 0.60
Median 202.5 Median 35.5 Median 6 Median 6.5 Median 10
Mode #N/A Mode 60 Mode 5 Mode 6 Mode 10
Standard Deviation 105.86 Standard Deviation 17.41 Standard Deviation 2.48 Standard Deviation 3.34 Standard Deviation 2.66
Sample Variance 11206.09 Sample Variance 303.12 Sample Variance 6.13 Sample Variance 11.16 Sample Variance 7.08
Kurtosis -0.97 Kurtosis -1.05 Kurtosis 0.05 Kurtosis 0.36 Kurtosis 5.65
Skewness 0.21 Skewness 0.02 Skewness 0.45 Skewness 0.48 Skewness -1.48
Range 357 Range 58 Range 10 Range 14 Range 13
Minimum 43 Minimum 7 Minimum 2 Minimum 1 Minimum 1
Maximum 400 Maximum 65 Maximum 12 Maximum 15 Maximum 14
Sum 4105 Sum 744 Sum 127 Sum 140 Sum 193
Count 20 Count 20 Count 20 Count 20 Count 20

3
3. Correlation Analysis

You can compute correlation statistics for a data set by using the following
function: Correl(B4..B23;C4..C23), which yields the correlation between the
heating cost and the minimum outside temperature. A correlation coefficient
indicates the level of linear association between a pair of variables. In this case,
the correlation between the heating cost and the minimum outside temperature
is 0.81, implying a rather strong negative correlation in the sense that if the
outside temperature is low, the heating cost is high and vice-versa.

Again, you can use an automated tool by selecting Data\Data


Analysis\Correlation. The following dialog box should appear:

In the Correlation dialog box, specify:


Input Range: B3:F23;
Grouped By: Columns, so that Excel knows that each column represents a
variable;
Select Labels in First Row;
Select New Worksheet Ply with the name Correlation Analysis.

A new spreadsheet with the following correlation matrix should appear.

Heating Cost Minimum Temperature Insulation (inches) Age Windows


Heating Cost 1.00
Minimum Temperature -0.81 1.00
Insulation (inches) -0.26 -0.10 1.00
Age 0.54 -0.49 0.06 1.00
Windows 0.10 -0.26 0.31 0.03 1.00

Notice the high correlation between heating cost and the minimum temperature
(negative) and the age of the heating installation (positive). Also notice the
sometimes high correlations between the explanatory variables themselves.

4
3. Scatter Plots

Scatter plots are of great help in identifying the strength, nature and direction of
relationships between variables. In particular, they can highlight non-linear
relationships, which will not necessarily be apparent from the correlation values.
Since the observed correlation, -0.81, between the heating cost and the
minimum outside temperature suggests a strong (linear) relationship, let us
examine their scatter plot:
Select the data range B3:C23 (using the mouse);
Select Insert\Scatter Plot and then the first available type;
Specify the Chart title as Cost & Temperature, Value (X) Axis as
Temperature, and Value (Y) Axis as Cost;

Cost & Temperature

70
60
50
Cost

40
30
20
10
0
0 100 200 300 400 500

Temperature

The scatter plot confirms the rather strong, linear relationship between heating
cost and temperature, with heating cost declining as the temperature increases.
Similar scatter plots can be examine for other pairs of variables.

5
4. Simple Linear Regression Analysis

A regression analysis estimates the linear equation that best fits a set of data,
in the sense that it minimises the residual scatter. Let us perform a regression
analysis of heating cost as a function of the temperature, i.e.

heating cost = a + b(temperature) + e

Select Data\Data Analysis\Regression;


Specify Input Y Range as B3..B23, this is the dependent variable;
Specify Input X Range as C3..C23, this is the explanatory variable;
Select Labels;
Select New Worksheet Ply, with the name Regression 1;
Under the heading Residuals, select all four options (Residuals, Standardized
Residuals, Residual Plots and Line Fit Plots).

The results consist of different sections:


Summary output, containing summary statistics for the regression as a whole,
of which Adjusted R Square (R2) and Standard Error (the standard deviation of
the residuals) are the most important;
ANOVA (Analysis of Variance), can be ignored when performing a regression
analysis;
a table with the actual regression model;
Residual Output, containing the predicted values for each of the observations
in the data set, and the prediction errors (residuals);
a Residual Plot;
a Line Fit Plot.

6
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.81
R Square 0.66
Adjusted R Square 0.64
Standard Error 63.55
Observations 20

ANOVA
df SS MS F Significance F
Regression 1 140215 140215 34.72 0.00
Residual 18 72701 4039
Total 19 212916

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0
Intercept 388.80 34.24 11.35 0.00 316.86 460.74 316.
Minimum Temperature -4.93 0.84 -5.89 0.00 -6.69 -3.17 -6.

RESIDUAL OUTPUT

Observation Predicted Heating Cost Residuals Standard Residuals


1 216.11 33.89 0.55
2 245.71 114.29 1.85
3 211.17 -46.17 -0.75
4 92.75 -49.75 -0.80
5 68.08 23.92 0.39
6 240.78 -40.78 -0.66
7 339.46 15.54 0.25
8 354.26 -64.26 -1.04
9 285.18 -55.18 -0.89
10 117.42 2.58 0.04
11 122.36 -49.36 -0.80
12 151.96 53.04 0.86
13 290.12 109.88 1.78
14 196.37 123.63 2.00
15 92.75 -20.75 -0.34
16 290.12 -18.12 -0.29
17 102.62 -8.62 -0.14
18 191.43 -1.43 -0.02
19 255.58 -20.58 -0.33
20 240.78 -101.78 -1.65

The suggested regression equation is:

heating cost = 388.80 4.93(temperature) + e

The slope, -4.93, has a t-value of -5.89 (in absolute terms bigger than 2) and a
very small p-value (smaller than our confidence level of 5%). The coefficient
related to the temperature variable is therefore significantly different from zero,
which can also be seen from the confidence interval [-6.69; -3.17] which does not
include zero. We may conclude that there is a significant effect of temperature
on heating cost.

The regression model is able to explain 64% of the variability in heating cost in
terms of differences between outside temperature (Adjusted R2). The standard
error of the forecasts is 63.55, implying that if we want to make a prediction with
confidence (95%), we should subtract and add 127.10 (2*63.55) to the prediction
to obtain a confidence interval. For instance, for an outside temperature of 50,
we predict the heating costs to be in the region of [142.30-127.10; 142.30-
127.10] = [15.20; 269.40].

7
The Regression tool also displays several charts (you may have to move them to
make them visible):
The Line Fit Plot (see below) shows actual costs and predicted costs, plotted
for different values of temperature. This plot is identical to the scatter plot of
cost and temperature we constructed earlier, with the predicted points
superimposed. The regression line is shown as points rather than as a line.
This can be changed by double-clicking the estimated points, and selecting
Patterns\Line\Automatic.
The Residual Plot shows the forecast errors versus temperature. If this plot
exhibits an obvious pattern, it would suggest that the model is ill-specified.
Ideally, the residuals should be random. Residual plots are also useful for
spotting outliers data points much further from the regression line than
others.

Minimum Temperature Line Fit Plot

500

400
Heating Cost

300

200

100

0
0 20 40 60 80

Minimum Temperature

Minimum Temperature Residual Plot

150.00

100.00
Residuals

50.00

0.00
0 20 40 60 80
-50.00

-100.00

-150.00

Minimum Temperature

8
5. Multiple Linear Regression Analysis

By adding extra explanatory variables in the regression model, we may be able


to improve our predictions of heating cost. However, including extra explanatory
variables may also cause problems such as multicollinearity. We therefore have
to find the best possible regression model for the purpose of predicting heating
costs using one or more explanatory variables.

Let us perform a regression analysis of heating cost as a function of all the


available explanatory variables, i.e.

heating cost = a + b(temperature) + c(insulation) + d(age) + e(windows) + e

Select Data\Data Analysis\Regression;


Specify Input Y Range as B3..B23, this is the dependent variable;
Specify Input X Range as C3..F23, these are the explanatory variable (they
should be in adjacent columns, you may need to move columns);
Select Labels;
Select New Worksheet Ply, with the name Regression 2;
Under the heading Residuals, select all four options (Residuals, Standardized
Residuals, Residual Plots and Line Fit Plots).

The suggested regression equation is (see regression results on next page):

heating cost = 424.74 4.57(temperature) 14.91(insulation)


+ 6.13(age) + 0.24(windows) + e

The Adjusted R2 has increased from 64% to 75%, indicating that we are now able
to explain more of the variability in heating cost. Also the standard error of the
predictions has drastically decreased from 127.10 to 52.72, enabling much more
accurate predictions. However, although the coefficient related to the
temperature and the insulation are found to be significantly different from zero,
the coefficients related to the age of the installation and the number of windows
have not. Therefore, these variables should be excluded from the model and the
model re-analysed. Also the residual plots and line fit plots need to be examined.

The modified regression equation is (see regression results):

heating cost = 490.29 5.15(temperature) 14.72(insulation) + e

9
SUMMARY OUTPUT
Minim
Regression Statistics
Multiple R 0.90 100.00

Residuals
R Square 0.80
Adjusted R Square 0.75 0.00
Standard Error 52.72
-100.00
Observations 20

ANOVA
df SS MS F Significance F
Regression 4 171227 42807 15.40 0.00
Residual 15 41689 2779
Total 19 212916

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 424.74 79.23 5.36 0.00 255.86 593.61
Minimum Temperature -4.57 0.83 -5.53 0.00 -6.33 -2.81
Insulation (inches) -14.91 5.14 -2.90 0.01 -25.86 -3.95
Age 6.13 4.17 1.47 0.16 -2.77 15.02
Windows 0.24 4.95 0.05 0.96 -10.31 10.80

RESIDUAL OUTPUT

Observation Predicted Heating Cost Residuals Standard Residuals


1 259.20 -9.20 -0.20
2 294.04 65.96 1.41
3 176.38 -11.38 -0.24
4 118.08 -75.08 -1.60
5 91.75 0.25 0.01
6 245.88 -45.88 -0.98
7 335.88 19.12 0.41
8 307.13 -17.13 -0.37
9 264.65 -34.65 -0.74
10 176.30 -56.30 -1.20
11 26.18 46.82 1.00
12 139.32 65.68 1.40
13 353.59 46.41 0.99
14 232.13 87.87 1.88
15 69.89 2.11 0.05
16 310.22 -38.22 -0.82
17 76.05 17.95 0.38
18 192.69 -2.69 -0.06
19 219.57 15.43 0.33
20 216.07 -77.07 -1.65

10
SUMMARY OUTPUT
Mini
Regression Statistics
Multiple R 0.88 200.00

Residuals
R Square 0.78
Adjusted R Square 0.75 0.00
Standard Error 52.98
-200.00
Observations 20

ANOVA
df SS MS F Significance F
Regression 2 165195 82597 29.42 0.00
Residual 17 47721 2807
Total 19 212916

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 490.29 44.41 11.04 0.00 396.59 583.98
Minimum Temperature -5.15 0.70 -7.34 0.00 -6.63 -3.67
Insulation (inches) -14.72 4.93 -2.98 0.01 -25.13 -4.31

RESIDUAL OUTPUT

Observation Predicted Heating Cost Residuals Standard Residuals


1 265.89 -15.89 -0.32
2 282.07 77.93 1.56
3 201.86 -36.86 -0.74
4 92.98 -49.98 -1.00
5 81.95 10.05 0.20
6 262.20 -62.20 -1.24
7 350.48 4.52 0.09
8 307.06 -17.06 -0.34
9 249.68 -19.68 -0.39
10 177.61 -57.61 -1.15
11 35.57 37.43 0.75
12 169.50 35.50 0.71
13 313.70 86.30 1.72
14 230.57 89.43 1.78
15 63.55 8.45 0.17
16 313.70 -41.70 -0.83
17 88.57 5.43 0.11
18 166.55 23.45 0.47
19 218.78 16.22 0.32
20 232.76 -93.76 -1.87

11
6. Non-Linear Regression Analysis

In some case a linear model is not suitable for modelling the relationship
between two variables. Let us have a look at another example: General
Public Electric (GPE). GPE operates 11 thermal power stations of basically
the same design. We will investigate the relationship between the cost
efficiency (pence per Kilowatt-hour) of the electricity generating plants, as a
function of their generating capacity (Megawatts installed). The object of
the exercise is to model the economy of scale effect which allows larger
plants to generate electricity at lower marginal cost per unit. In practice,
this analysis might be part of a larger exercise in which economy of scale
would be one of a variety of factors which would be taken into account in
deciding between alternative development plans. A more accurate
understanding of the relative efficiency of different size plants would make
it easier to balance this factor against capital investment costs,
environmental factors, construction time, etc. The data can be found in the
Excel workbook GPE.xls that can be found on
the course Web site.

The spreadsheet, scatter plot and regression results are shown below.

A B C
1 General Public Electric
2
3 Plant Capacity Cost per Unit
Cost & Temperature

4 1 525 1.2
1.80
5 2 555 1.70
6 3 600 1
1.60
7 4 610 1.58
8 5 700 0.8
1.35
9 6 990 1.20
10 7 1100 1.13
Cost

11 8 1450 0.95
0.6

12 9 1950 0.85
13 10 1950 0.4

0.84
14 11 2400 0.75
0.2

0
0 1000 2000 3000

Temperature

12
Cost per Unit

2.00

1.50

1.00

0.50

0.00
0 500 1000 1500 2000 2500 3000

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.94 0.5

Residuals
R Square 0.88
Adjusted R Square 0.86 0.0
Standard Error 0.14
-0.5
Observations 11

ANOVA
df SS MS F Significance F
Regression 1 1.26 1.26 64.05 0.00
Residual 9 0.18 0.02
Total 10 1.43

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 1.87 0.09 21.26 0.00 1.67 2.06
Capacity -0.00053 0.00007 -8.00 0.00 -0.00068 -0.00038

RESIDUAL OUTPUT

Observation Predicted Cost per Unit Residuals Standard Residuals


1 1.59 0.21 1.59
2 1.57 0.13 0.96
3 1.55 0.05 0.38
4 1.54 0.04 0.27
5 1.50 -0.15 -1.10
6 1.34 -0.14 -1.08
7 1.29 -0.16 -1.17
8 1.10 -0.15 -1.13
9 0.84 0.01 0.10
10 0.84 0.00 0.03
11 0.60 0.15 1.14

13
The model, namely

Cost = 1.87 0.00053(Capacity) + e

seems reasonable because of the high t-statistic related to the Capacity variable
and the high Adjusted R2 (86%), implying that there is a significant
relationship between capacity and cost, and that we are able to explain a
lot of the variability in costs, purely by examining the capacity. Also, the
result is logical in the sense that we indeed observe an economies of scale
effect: cost decreases as capacity increases, for every Megawatt of
generating capacity, the unit cost decreases at a rate of 0.00053.

However, if we examine the line fit plot and the residual plot (see below), we
observe the following:
the line does not perfectly fit the data, it slightly underestimates the cost for
small capacity values, overestimates it for medium capacity values and again
underestimates it for high capacity values;
the residual plot clearly exhibits a pattern, the errors are positive for small
capacity values, negative for medium capacity values and again positive for
high capacity values.

Capacity Line Fit Plot

2.00
Cost per Unit

1.50

1.00

0.50

0.00
0 500 1000 1500 2000 2500 3000

Capacity

14
Capacity Residual Plot

0.30

0.20
Residuals

0.10

0.00
0 500 1000 1500 2000 2500 3000
-0.10

-0.20

Capacity

This indicates that the model is ill-specified, and more specifically that we have
been trying to fit a line to data which exhibits a non-linear relationship. We
therefore should look for a more suitable specification of the model.

Let us try to regress costs to the reciprocal of the plant capacity, i.e.

Cost = a + b(1/Capacity) + e

We therefore have to transform the capacity data. The proposed relationship


(Cost as a function of 1/Capacity) resembles the shape of the curve we observe
in the scatter plot. Sometimes however, different candidate transformations
exist.

In order to transform the capacity data in our model, we add another column. In
cell D3, we enter the title 1/Capacity. In cell D4, we enter the formula =1/B4.
The value 0.0019 should appear (= 1/525). We select cell D4 and drag the
handle down to fill the entire column with the transformed capacity data. Now,
we can run another regression analysis, using the new column as the
explanatory variable.

We obtain the following results (see results on following pages):


R2 increased to 99%, indicting a near-perfect fit;
standard error reduced to 0.04;
residual plot reveals no obvious pattern;
line fit plot indicates near-perfect fit of line and data.

Below, you will also find a plot of the estimated costs as a function of Capacity,
revealing the non-linear nature of the estimated relationship. In order to draw
such a graph, add another column in your spreadsheet with the cost predictions,
computed using the regression coefficients and the explanatory variables data.
Then draw a scatter plot of the predictions as well as the actual heating cost data
versus capacity. Again, the predictions will be displayed as points rather than a
line, but this can be changed by double-clicking the points and selecting a
different format.

15
SUMMARY OUTPUT

Regression Statistics
Multiple R 1.00 0

Residuals
R Square 0.99 0
Adjusted R Square 0.99 -0
Standard Error 0.04
-0
Observations 11

ANOVA
df SS MS F Significance F
Regression 1 1.418077648 1.418077648 957.9913963 1.87988E-10
Residual 9 0.013322352 0.001480261
Total 10 1.4314

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 0.50 0.03 18.37 0.00 0.43 0.56
1/Capacity 664.08 21.46 30.95 0.00 615.55 712.62

RESIDUAL OUTPUT

Observation Predicted Cost per Unit Residuals Standard Residuals


1 1.76 0.04 1.08
2 1.69 0.01 0.21
3 1.60 0.00 -0.07
4 1.58 0.00 -0.12
5 1.44 -0.09 -2.59
6 1.17 0.03 0.91
7 1.10 0.03 0.83
8 0.95 0.00 -0.10
9 0.84 0.01 0.37
10 0.84 0.00 0.10
11 0.77 -0.02 -0.62

1/Capacity Line Fit Plot

2.00

1.50
Cost per Unit

1.00

0.50

0.00
0.0000 0.0005 0.0010 0.0015 0.0020

1/Capacity

16
1/Capacity Residual Plot

0.06
0.03
Residuals

0.00
0.0000
-0.03 0.0005 0.0010 0.0015 0.0020

-0.06
-0.09
-0.12

1/Capacity

AB C D E F GH I J K
G
1
2
3
4
5
6
e
n
P
l
ae
n
tr
a
l
C
a
15
25
36
P
u
p
a
c
i
2
0
b
t
y
5l
i
cE
C
ol
e
c
s
t
pt
e
rr
i
c
U
n
i
1
.
8
0
7
1
.
6
0
t1
/
Ca
p
0
0
a
c
.
0
.
0
i
t
1
1
y
9
8
7
P
r
e
C
o

1d
i
c
t
1
1
i
o
st&
T

.
7
1.2

.
6
n
em
p
eratu

6
9
0
re

7
8
9 46
57
691
0
9 5
8
1
.
3
2
0 0.
016
4
0 15
.
4
0.8

18
7
10 711
0 1
.
1
3 0.
0 9 1.0
o
C
st

0.6

12
3 81
91
1
0 4
5
9
1
50 0
9
5
.
8
0
4 0.
00
07
5 0
09
.
8
0.4
5
4
1
14
5
6 2
4
0 .
7
5 0.
0 4 .
7
0.2

7
00100200300

18
Tem
p
eratu
re

29
0
1
2
23
4
25
6
27
8
9
17

Das könnte Ihnen auch gefallen