Sie sind auf Seite 1von 8

Guide forWindows Excel 2003 RegressionModelling with Analysis Toolpak James W.

Taylor
The purpose of this guide is to explore linear regression using Excel. This note consists of the following sections: Summarising and describing a multi-variable data set Correlation analysis Scatter plots Simple regression Multiple regression

We must attach Excels statistical add-in options: From the Tools menu, select Add-Ins In the Add-Ins dialog box select: Analysis ToolPak - VBA and Analysis ToolPak

1. SUMMARISING & DESCRIBING A MULTI-VARIABLE DATA SET The Excel file ElectricityConsumption.xls contains monthly observations from January 2004 to July 2012 for the following variables:
ELEC C66 C76 H55 DINC AIRC Residential electricity sales (KWh) per customer in a mid-Atlantic U.S. city Cooling degree hours at base temperature 66 degrees (a measure of summer heat)1 Cooling degree hours at base temperature 76 degrees (a measure of summer heat) Heating degree hours at base temperature 55 degrees (a measure of winter cold)2 Disposable income per household ($) Proportion of households with air conditioning

The ultimate aim is to build a forecasting model for residential electricity consumption.
1 2 3 4 5 6 7 8 9 10 11 12 13 A MONTH Jan-04 Feb-04 Mar-04 Apr-04 May-04 Jun-04 Jul-04 Aug-04 Sep-04 Oct-04 Nov-04 Dec-04 B ELEC 681.7 620.3 590.8 538.0 513.4 575.5 1019.3 1203.9 1176.7 723.0 519.0 604.9 C C66 20 0 20 14 559 1601 5348 7416 6887 2975 427 9 D C76 0 0 0 0 3 83 833 1547 1287 398 5 0 E H55 10148 12504 9300 5333 2846 282 1 0 0 155 1812 5779 F DINC 34825 34934 35050 35172 35302 35438 35583 35734 35892 36056 36222 36391 G AIRC 0.698 0.701 0.705 0.708 0.712 0.716 0.72 0.724 0.728 0.731 0.735 0.739

Use the Analysis Toolpak Descriptive Statistics tool to get summary statistics (in one sequence of operations) for all 6 variables, by selecting Tools Data Analysis Descriptive Statistics

In the Descriptive Statistics dialog box, specify: Input Range as the range containing values and variable names: B1:G104 Click the Labels in First Row checkbox Output options as New Worksheet Ply with the name Summary Click the Summary Statistics checkbox.

The cooling degree hours at base temperature T is:

in
i 1 i 1

where ni is the number of hours in the month at temperature T+i.

The heating degree hours at base temperature T is:

in

where ni is the number of hours in the month at temperature T-i.

2. CORRELATION ANALYSIS Return to the Data worksheet.

1.

From the main menu, choose: Tools Data Analysis...

and in the Data Analysis dialog box, specify Correlation and confirm OK. The following dialog box should appear:

2.

In the Correlation dialog box, specify: Input Range: as B5:F25 (dont include the house number column) Grouped By: as Columns, so that Excel knows that each column is a variable. The Labels in First Row checkbox should be crossed Output options: as New Worksheet Ply with the name Correlations Click OK.

The correlation matrix below should result. Correlation coefficients for pairs of variables indicate the levels of linear association between them, e.g. ELEC and C76 have correlation of 0.94, so that as C76 rises, ELEC rises.
ELEC C66 C76 H55 DINC AIRC ELEC 1.00 0.92 0.94 -0.36 0.14 0.14 C66 0.92 1.00 0.95 -0.65 0.02 0.02 C76 0.94 0.95 1.00 -0.52 0.01 0.01 H55 -0.36 -0.65 -0.52 1.00 -0.04 -0.05 DINC 0.14 0.02 0.01 -0.04 1.00 0.94 AIRC 0.14 0.02 0.01 -0.05 0.94 1.00

You should get the same value using the Excel function =CORREL Note any variables strongly correlated with ELEC, and any strong inter-correlations between the potential explanatory variables, C66, C76, H55, DINC and AIRC.

3. SCATTER PLOTS Scatter plots are of great help in identifying the strength, nature and direction of relationships between pairs of variables. In particular, they can highlight non-linear relationships, which will not necessarily be apparent from the correlation values. Since the observed correlation, 0.94, between ELEC and C76 suggests a relationship, lets examine their scatter plot. Return to the Data worksheet. Copy the ELEC column of data to column K. Copy C76 to column J. From the main menu, select: Insert Chart In Step 1 of Chart Wizard, select chart type as: XY (Scatter) and click Next>. In Step 2, specify J1:K104 as the Data range. In Step 3, specify Chart titles as Electricity Consumption, Value (X) Axis as C76, Value (Y) Axis as ELEC, click Next>. In Step 4, specify that the chart should be placed As object in the Data worksheet, then click Finish.

The scatter plot confirms the reasonably strong linear relationship, with ELEC rising as C76 increases.

4. SIMPLE REGRESSION

Regression analysis produces the estimated linear equation that best fits a set of data. By best fitting we mean the line (or linear model) for which there is least residual scatter.

1.

Choose from the main menu: Tools Data Analysis

Regression

2.

Complete Regression dialog box. Specify: Input Y range as B1:B104 Input X range as D1:D104 ELEC as dependent variable C76 as independent variable

Check the Labels box as the first entries in each cell range are labels Specify Output options as New Worksheet Ply, with the name Regression1. Under the heading Residuals, select Residuals, Residual Plots & Line Fit Plots. Then click OK.

4.1 REGRESSION ANALYSIS - INTERPRETING NUMERICAL OUTPUT

SUMMARY OUTPUT Regression Statistics Multiple R 0.936601141 R Square 0.877221698 Adjusted R Squ 0.876006071 Standard Error 84.01563552 Observations 103 ANOVA df Regression Residual Total 1 101 102 SS 5093652.918 712921.3281 5806574.246 MS 5093652.918 7058.627011 F Significance F 721.6209201 8.45675E-48

Intercept C76

Coefficients Standard Error 632.1967321 9.685863338 0.538125757 0.020032227

t Stat 65.2700446 26.86300281

P-value 2.09858E-84 8.45675E-48

Lower 95% 612.9825852 0.498387209

Upper 95% 651.410879 0.577864305

The 1st part of the output contains summary statistics for the regression as a whole, R and residual standard deviation (called standard error). Ignore the 2nd part which displays ANOVA or Analysis of Variance calculations. The 3rd part of the output indicates that the best fitting linear model has equation: ELEC = 632.20 + 0.538*C76 And that the slope, 0.538, has a t-stat of 26.86 and a very small p-value. The variable C76 is therefore significantly explaining some of the variation in ELEC. The 4th part shows predicted values for each of the observations, and the residuals.

4.2 REGRESSION - INTERPRETING EXCELS GRAPHICAL OUTPUT

The Regression tool puts one chart on top of another. Click on the top chart so that it becomes the active chart, then move it below.

1.

The Line Fit Plot shows actual ELEC and predicted ELEC, plotted for different values of C76. This plot is the same as your scatter plot of ELEC & C76 (only with the axes flipped round) with points from the regression line superimposed. The regression line (called Predicted ELEC in the legend) is shown as points rather than as a line. This can be changed by formatting the data series.
C76 Line Fit Plot 2000 ELEC 1500 1000 500 0 0 1000 C76 2000
ELEC Predicted ELEC

2.

Residuals Plot shows residuals plotted versus the value of the C76 variable. Check that the residuals do not display an obvious pattern. Ideally, residuals should be as if random, not showing any systematic pattern, of much the same average size, and not increasing in size as X (C76) increases, etc. Residual plots are also useful for spotting outliers - data points much further from the regression line than others.
C76 Residual Plot
300 200 Residuals 100 0 -100 0 -200 -300 C76 500 1000 1500 2000

5. MULTIPLE REGRESSION Can the ELEC predictions be improved if other possible explanatory variables are brought into the model? This section contains a brief description of the way Excels regression can be extended from simple (ELEC on C76) to multiple regression (ELEC on two or more variables). The purpose is to find the best equation for predicting ELEC from one or more of the independent variables. Lets regress ELEC on the other five variables. Return to the Data worksheet. 1. Starting from a cell on Data sheet, choose from the main menu: Tools Data Analysis Regression

2.

In the Regression dialog box, specify: Input Y range as B1:B104 i.e. ELEC as dependent variable Input X range as C1:G104 i.e. five explanatory variables Check the Labels checkbox. Specify Output options: as New Worksheet Ply, with the name Regression2. Under the heading Residuals, select Residuals, Residual Plots & Line Fit Plots. Then click OK.