Sie sind auf Seite 1von 347

Manual 3 of 4

Practical and Clear Graduate Statistics


in Excel

Correlation – Pearson & Spearman


Confidence Intervals
Simple & Multiple Regression
Logistic Regression
One & Two-Factor ANOVA

The Excel Statistical Master


(that’ll be you!)

By Mark Harmon
Copyright © 2014 Mark Harmon
No part of this publication may be reproduced
or distributed without the express permission
of the author.
mark@ExcelMasterSeries.com
ISBN: 978-1-937159-22-1
Table of Contents

Confidence Intervals in Excel


Confidence Intervals in Excel ................................................................ 13
Overview .................................................................................................................... 13
Margin of Error ........................................................................................................... 13
Factors Affecting Size of the Confidence Interval ....................................................... 13
1) Degree of Confidence ....................................................................................................... 13
2) Sample Size ....................................................................................................................... 13
3) Variability of the Population .......................................................................................... 13
C. l. of a Population Mean vs. C. I. of a Population Proportion ................................... 14
Confidence Intervals for t-Tests ................................................................................. 14
Prediction Interval of a Regression Estimate .............................................................. 14

t-Based Confidence Interval of a Population Mean in Excel ................ 15


Overview .................................................................................................................... 15
Example of a t-Based Confidence Interval of a Population Mean in Excel ................. 15
Summary of Problem Information ..................................................................................... 18
Question 1) Type of Confidence Interval? ......................................................................... 19
a) Confidence Interval of Population Mean or Population Proportion? ................................... 19
b) t-Based or z-Based Confidence Interval? ............................................................................... 19
Question 2) All Required Assumptions Met? ................................................................... 19
a) Normal Distribution of the Sample Mean ................................................................................ 19
Sample Means Are Normally Distributed If Any of the Following Are True: ............... 20
1) Sample Size of Each Sample, n, Is Greater Than 30. ............................................................. 20
2) Population Is Normally Distributed. ......................................................................................... 20
3) Sample Is Normally Distributed. ............................................................................................... 20
Evaluating the Normality of the Sample Data ................................................................... 20
Histogram in Excel ......................................................................................................................... 21
Normal Probability Plot in Excel ................................................................................................... 23
Kolmogorov-Smirnov Test For Normality in Excel ..................................................................... 23
Anderson-Darling Test For Normality in Excel ........................................................................... 25
Shapiro-Wilk Test For Normality in Excel .................................................................................... 27
Correctable Reasons That Normal Data Can Appear Non-Normal ............................... 28
Step 1) Calculate Width-Half of Confidence Interval....................................................... 29
Step 2 Confidence Interval = Sample Mean ± C.I. Half-Width ....................................... 29

Min Sample Size to Limit Width of a Confidence Interval of a Mean ... 32


Example of Calculating Min Sample Size in Excel ..................................................... 33

z-Based Confidence Interval of a Population Mean in Excel ............... 34


Overview .................................................................................................................... 34
Example of a z-Based Confidence Interval in Excel ................................................... 35
Summary of Problem Information ..................................................................................... 35
Question 1) Type of Confidence Interval? ......................................................................... 36
a) Confidence Interval of Population Mean or Population Proportion? ................................... 36
b) t-Based or z-Based Confidence Interval? ............................................................................... 36
Question 2) All Required Assumptions Met? ................................................................... 36
a) Normal Distribution of the Sample Mean ................................................................................ 36
b) Population Standard Deviation Is Known (σ = 30 MPa) ......................................................... 36
Step 1) Calculate Width-Half of Confidence Interval....................................................... 36
Step 2 Confidence Interval = Sample Mean ± C.I. Half-Width ...................................... 37

Min Sample Size to Limit Width of a Confidence Interval of a Mean ... 38


Example of Calculating Min Sample Size in Excel ..................................................... 39

Confidence Interval of a Population Proportion in Excel ..................... 40


Overview .................................................................................................................... 40
Example of a Confidence Interval of a Population Proportion in Excel ....................... 41
Summary of Problem Information ..................................................................................... 41
Question 1) Type of Confidence Interval? ......................................................................... 42
a) Confidence Interval of Population Mean or Population Proportion? ................................... 42
b) t-Based or z-Based Confidence Interval? ............................................................................... 42
Question 2) All Required Assumptions Met? ................................................................... 42
Binomial Distribution Can Be Approximated By Normal Distribution? ................................... 42
Step 1) Calculate Width-Half of Confidence Interval....................................................... 44
Step 2 Confidence Interval = Sample Proportion ± C.I. Half-Width ............................. 44
Min Sample Size to Limit Width of a Confidence Interval of a
Population Proportion ............................................................................ 45
Example 1 of Calculating Min Sample Size in Excel .................................................. 46
Min Number of Voters Surveyed to Limit Poll Error Margin ...................................................... 46
Example 2 of Calculating Min Sample Size in Excel .................................................. 46
Min Number of Production Samples to Limit Defect Rate Estimate Error Margin .................. 46

Prediction Interval of a Regression Estimate in Excel ......................... 47


Example of Prediction Interval in Excel ...................................................................... 50

Correlation in Excel ................................................................................ 52


Overview .................................................................................................................... 52
Quick Indicator of a Correlation ........................................................................................ 52
Correlation Does Not Mean Causation .............................................................................. 54
Types of Data ........................................................................................................................ 54
Nominal data ................................................................................................................................... 54
Ordinal data .................................................................................................................................... 54
Interval data .................................................................................................................................... 54
Ratio data ........................................................................................................................................ 54
Pearson Correlation vs. Spearman Correlation ................................................................ 54
Pearson Correlation’s Six Required Assumptions ..................................................................... 56
Spearman Correlation’s Only Two Required Assumptions ....................................................... 56
Interesting History of Both Correlations...................................................................................... 56

Pearson Correlation Coefficient, r, in Excel.......................................... 57


Overview .................................................................................................................... 57
Pearson Correlation’s Six Required Assumptions ............................................................ 59
Pearson Correlation Formulas ............................................................................................ 59
Example of Pearson Correlation in Excel ................................................................... 61
Step 1 – Create a Scatterplot of the Data ........................................................................... 61
Step 2 – Calculate r in Excel With One of Three Methods .............................................. 62
1) Data Analysis Correlation Tool................................................................................................. 62
2) Correlation Formula ................................................................................................................... 62
3) Covariance Formula .................................................................................................................. 62
Step 3 - Determine Whether r Is Significant...................................................................... 63
Calculate p Value............................................................................................................................ 63
Calculate r Critical .......................................................................................................................... 63
Comparing Chart Values of r Critical and p value in Excel with Calculated Values ............... 64
Calculating r Critical with the Formula .......................................................................................... 64
Calculating p Value With the Formula .......................................................................................... 64
Performing Correlation Analysis On More Than 3 Variables .................................................... 65

Spearman Correlation Coefficient, rs, in Excel ..................................... 66


Overview .................................................................................................................... 66
Spearman Correlation Formula ......................................................................................... 66
Tied Data Values............................................................................................................................. 66
No Ties Among Data Values ......................................................................................................... 66
Spearman Correlation’s Only Two Required Assumptions ............................................ 67
Example of Spearman Correlation in Excel ................................................................ 67
Step 1 – Plot the Data to Check For a Monotonic Relationship ...................................... 67
Step 2 – Check For Tied X or Y Values ............................................................................. 68
Step 4 – Calculate the Sum of the Square of the Rank Differences ................................. 69
Two Different Methods Used to Calculate rs Critical Values ..................................................... 73

Covariance, sxy, in Excel ........................................................................ 75


Overview .................................................................................................................... 75
Using Covariance To Calculate a Line’s Slope and Y-Intercept ................................. 75

Single-Variable Linear Regression in Excel.......................................... 77


Overview .................................................................................................................... 77
The Regression Equation ........................................................................................... 77
Purposes of Linear Regression .................................................................................. 77
The Inputs For Linear Regression .............................................................................. 77
Simple Linear Regression .......................................................................................... 78
Null and Alternative Hypotheses ................................................................................ 78
X and Y Variables Must Have a Linear Relationship .................................................. 79
Do Not Extrapolate Regression Beyond Existing Data ............................................... 79
Example of Why Regression Should Not Be Extrapolated .............................................. 79
Linear Regression Should Not Be Done By Hand ...................................................... 79
Complete Example of Simple Linear Regression in Excel.......................................... 80
Step 1 – Remove Extreme Outliers ..................................................................................... 81
Sorting the Data To Quickly Spot Extreme Outliers ................................................................... 81
Step 2 – Create a Correlation Matrix ................................................................................. 82
Step 3 – Scale Variables If Necessary ................................................................................. 83
Step 4 – Plot the Data ........................................................................................................... 84
Step 5 – Run the Regression Analysis ................................................................................ 85
Step 6 – Evaluate the Residuals .......................................................................................... 88
Linear Regression’s Required Residual Assumptions .............................................................. 88
Locating and Removing Outliers .................................................................................................. 89
Determining Whether Residuals Are Independent ..................................................................... 90
Determining If Autocorrelation Exists.......................................................................................... 90
Determining if Residual Mean Equals Zero ................................................................................. 92
Determining If Residual Variance Is Constant ............................................................................ 92
Determining if Residuals Are Normally-Distributed ................................................................... 93
Histogram of the Residuals in Excel ............................................................................................. 94
Normal Probability Plot of Residuals in Excel .............................................................................. 97
Kolmogorov-Smirnov Test For Normality of Residuals in Excel ................................................... 98
Anderson-Darling Test For Normality of Residuals in Excel ........................................................ 99
Shapiro-Wilk Test For Normality in Excel ................................................................................... 101
Correctable Reasons Why Normal Data Can Appear Non-Normal ........................................... 102
Determining If Any Input Variables Are Too Highly Correlated .............................................. 103
Determining If There Are Enough Data Points .......................................................................... 105
Step 7 – Evaluate the Excel Regression Output .............................................................. 105
Regression Equation ................................................................................................................... 106
R Square –The Equation’s Overall Predictive Power ............................................................... 107
Significance of F - Overall p Value and Validity Measure ........................................................ 107
p Value of Intercept and Coefficients – Measure of Their Validity .......................................... 108
All Calculations That Created Excel’s Regression Output ............................................ 109
Calculation of Coefficient and Intercept in Excel ..................................................................... 110
Calculation of R Square in Excel ................................................................................................ 112
Calculation of Adjusted R Square in Excel ............................................................................... 114
Calculation of the Standard Error of the Regression Equation in Excel ................................ 114
ANOVA Calculations in Excel ..................................................................................................... 114
Analysis of the Independent Variable Coefficient in Excel ...................................................... 116
Standard Error of Coefficient ...................................................................................................... 116
t Stat of Coefficient ..................................................................................................................... 116
p-Value of the Coefficient ........................................................................................................... 116
95% Confidence Interval of Coefficient ...................................................................................... 116
Analysis of Intercept in Excel ..................................................................................................... 117
Standard Error of the Intercept ................................................................................................... 117
t Stat of the Intercept .................................................................................................................. 118
p-Value of the Intercept .............................................................................................................. 118
95% Confidence Interval of Intercept ......................................................................................... 118
Prediction Interval of a Regression Estimate .................................................................. 118
Prediction Interval Estimate Formula ......................................................................................... 119

Multiple-Variable Linear Regression in Excel ..................................... 124


Overview .................................................................................................................. 124
The Regression Equation ......................................................................................... 124
Purposes of Linear Regression ................................................................................ 124
The Inputs For Linear Regression ............................................................................ 125
Null and Alternative Hypotheses .............................................................................. 125
X and Y Variables Must Have a Linear Relationship ................................................ 126
Do Not Extrapolate Regression Beyond Existing Data ............................................. 126
Example of Why Regression Should Not Be Extrapolated ............................................ 126
Linear Regression Should Not Be Done By Hand .................................................... 126
Complete Example of Multiple Linear Regression in Excel ...................................... 127
Step 1 – Remove Extreme Outliers ................................................................................... 128
Sorting the Data To Quickly Spot Extreme Outliers ................................................................. 128
Step 2 – Create a Correlation Matrix ............................................................................... 130
Step 3 – Scale Variables If Necessary ............................................................................... 133
Step 4 – Plot the Data ......................................................................................................... 134
Step 5 – Run the Regression Analysis .............................................................................. 135
Step 6 – Evaluate the Residuals ........................................................................................ 141
Linear Regression’s Required Residual Assumptions ............................................................ 141
Locating and Removing Outliers ................................................................................................ 141
Determining Whether Residuals Are Independent ................................................................... 143
Determining If Autocorrelation Exists........................................................................................ 144
Determining if Residual Mean Equals Zero ............................................................................... 146
Determining If Residual Variance Is Constant .......................................................................... 146
Determining if Residuals Are Normally-Distributed ................................................................. 147
Histogram of the Residuals in Excel ........................................................................................... 148
Normal Probability Plot of Residuals in Excel ............................................................................ 149
Kolmogorov-Smirnov Test For Normality of Residuals in Excel ................................................. 150
Anderson-Darling Test For Normality of Residuals in Excel ...................................................... 152
Shapiro-Wilk Test For Normality in Excel ................................................................................... 154
Correctable Reasons Why Normal Data Can Appear Non-Normal ........................................... 155
Determining If Any Input Variables Are Too Highly Correlated With Residuals ................... 156
Determining If There Are Enough Data Points .......................................................................... 157
Step 7 – Evaluate the Excel Regression Output .............................................................. 158
Regression Equation ................................................................................................................... 158
R Square –The Equation’s Overall Predictive Power ............................................................... 160
Significance of F - Overall p Value and Validity Measure ........................................................ 160
p Value of Intercept and Coefficients – Measure of Their Validity .......................................... 161
Prediction Interval of a Regression Estimate .................................................................. 162
Prediction Interval Formula ......................................................................................................... 162
Prediction Interval Estimate Formula ......................................................................................... 162
Example in Excel ........................................................................................................................ 163

Logistic Regression.............................................................................. 164


Overview .................................................................................................................. 164
The Goal of Binary Logistic Regression ........................................................................... 164
Allowed Variable Types For Binary Logistic Regression............................................... 164
Logistic Regression Calculates the Probability of an Event Occurring ........................ 165
The Difference Between Linear Regression and Logistic Regression ................................... 165
The Relationship Between Probability and Odds ..................................................................... 165
The Logit – The Natural Log of the Odds........................................................................ 165
LE - The Likelihood Estimation ....................................................................................... 167
MLE – The Maximum Likelihood Estimation ................................................................ 167
LL - The Log-Likelihood Function ................................................................................... 167
MLL – Maximum Log Likelihood Function ................................................................... 167
Example of Binary Logistic Regression .................................................................... 169
Step 1 – Sort the Data ........................................................................................................ 170
Step 2 – Calculate a Logit For Each Data Record .......................................................... 171
Step 3 – Calculate eL For Each Data Record ................................................................... 172
Step 4 – Calculate P(X) For Each Data Record............................................................... 173
Step 5 – Calculate LL, the Log-Likelihood Function ...................................................... 174
Step 6 – Use the Excel Solver to Calculate MLL, the Maximum Log-Likelihood
Function............................................................................................................................... 175
Solver Results .............................................................................................................................. 178
Step 7 – Test the Solver Output By Running Scenarios ................................................. 179
Step 8 – Calculate R Square .............................................................................................. 183
Step 1) Calculate the Maximum Log-Likelihood for Full Model .............................................. 183
Step 2) Calculate the Maximum Log-Likelihood for the Model With No Explanatory Variables
........................................................................................................................................................ 183
Step 3) Calculate R Square ......................................................................................................... 186
Step 9 – Determine if the Variable Coefficients Are Significant.................................... 186
The Wald Statistic ........................................................................................................................ 186
The Likelihood Ratio .................................................................................................................... 187
Using the Likelihood Ratio to Determine Whether Coefficient b 1 Is Significant ................... 188
Using the Likelihood Ratio to Determine Whether Coefficient b 2 Is Significant ................... 191
Step 10 – Create a Classification Table ............................................................................ 194
Step 11 – Determine if the Overall Logistic Regression Equation Is Significant ......... 195

Single-Factor ANOVA in Excel ............................................................. 198


Overview .................................................................................................................. 198
ANOVA = Analysis of Variance ................................................................................ 198
Null and Alternative Hypotheses for Single-factor ANOVA....................................... 198
Single-Factor ANOVA vs.Two-Sample, Pooled t-Test.............................................. 199
2-Sample One-Way ANOVA = 2-Sample, Pooled t-Test ................................................ 202
Sample Groups With Small Variances (the first graph) ........................................................... 202
Sample Groups With Large Variances (the second graph) ..................................................... 204
Single-Factor ANOVA Should Not Be Done By Hand ................................................... 206
Single-Factor ANOVA Example in Excel .................................................................. 207
Step 1 – Place Data in Excel Group Columns .................................................................. 208
Step 2 – Remove Extreme Outliers ................................................................................... 212
Find Outliers From the Sorted Data ........................................................................................... 212
Find Outliers By Standardizing Residuals ................................................................................ 213
Step 3 – Verify Required Assumptions ............................................................................ 214
Single-Factor ANOVA Required Assumptions .......................................................................... 214
1) Independence of Sample Group Data .................................................................................... 214
2) Sample Data Are Continuous ................................................................................................. 214
3) Independent Variable is Categorical ...................................................................................... 215
4) Extreme Outliers Removed If Necessary ............................................................................... 215
5) Normally-Distributed Data In All Sample Groups ................................................................... 215
6) Relatively Similar Variances In All Sample Groups ................................................................ 215
Determining If Sample Groups Are Normally-Distributed........................................................ 215
Shapiro-Wilk Test For Normality in Excel ................................................................................... 215
Correctable Reasons Why Normal Data Can Appear Non-Normal ........................................... 218
Nonparametric Alternatives To Single-Factor ANOVA For Non-Normal Data ........................... 219
Determining If Sample Groups Have Similar Variances ........................................................... 219
Levene’s Test in Excel For Sample Variance Comparison ........................................................ 220
Brown-Forsythe Test in Excel For Sample Variance Comparison ............................................. 222
Step 4 – Run the Single-Factor ANOVA Tool in Excel .................................................. 224
Step 5 – Interpret the Excel Output ................................................................................. 226
All Calculations That Created Excel’s One-Way ANOVA Output ............................................ 226
Step 6 – Perform Post-Hoc Testing in Excel .................................................................... 230
The Many Types of Post-Hoc Tests Available ........................................................................... 230
Post-Hoc Tests Used When Group Variances Are Equal .......................................................... 230
Post-Hoc Tests Used When Group Variances Are Not Equal ................................................... 231
Tukey’s HSD (Honestly Significant Difference) Test .................................................................. 231
Tukey-Kramer Test ..................................................................................................................... 232
Games-Howell Test .................................................................................................................... 233
Tukey-Kramer Post-Hoc Test in Excel ....................................................................................... 235
Games-Howell Post-Hoc Test in Excel....................................................................................... 239
Step 7 – Calculate Effect Size ............................................................................................ 242
The Three Most Common Measures of Effect Size ................................................................... 242
2
Eta Square (η ) ........................................................................................................................... 242
Psi (ψ) - RMSSE ......................................................................................................................... 243
2
Omega Squared (ώ ) .................................................................................................................. 244
2
Calculating Eta Squared (η ) in Excel ........................................................................................ 245
Calculating Psi (ψ) – RMSSE – in Excel ..................................................................................... 246
2
Calculating Omega Squared (ώ ) in Excel ................................................................................. 247
Step 8 – Calculate the Power of the Test .......................................................................... 248
Calculating Power With Online Tool G Power ........................................................................... 249
1) A Priori .................................................................................................................................... 249
2) Post hoc .................................................................................................................................. 250
What To Do When Groups Do Not Have Similar Variances ......................................... 252
Welch’s ANOVA in Excel ............................................................................................................. 252
Brown Forsythe F-Test in Excel ................................................................................................. 259
What To Do When Groups Are Not Normally-Distributed ........................................... 262
Kruskal-Wallis Test in Excel ....................................................................................................... 262

Two-Factor ANOVA With Replication in Excel.................................... 282


Overview .................................................................................................................. 282
Independent Variables vs. Dependent Variables ..................................................... 283
Two-Way ANOVA..................................................................................................... 283
Balanced Two-Way ANOVA With Replication .............................................................. 283
ANOVA = Analysis of Variance........................................................................................ 283
The Independent and Dependent Variables of ANOVA ................................................ 284
Two-Way ANOVA With Replication Performs Three F Tests ................................... 284
Factor 1 Main Effects F Test ............................................................................................. 284
Factor 2 Main Effects F Test ............................................................................................. 284
Factor 1 and 2 Interaction Effects F Test ........................................................................ 284
Requirements of Each F Test ............................................................................................ 284
Factor 1 Main Effects F Test ....................................................................................................... 284
Factor 2 Main Effects F Test ....................................................................................................... 284
Factor 1 and 2 Interaction Effects F Test ................................................................................... 285
Alternative Test When Data Are Normally Distributed .............................................. 285
Null and Alt. Hypotheses For 2-Way ANOVA W/Rep. .............................................. 285
Null and Alternative Hypotheses for the Two Main Effects F Tests ............................. 285
Null and Alternative Hypotheses for the Interaction Effect F Tests ............................. 286
Two-Factor ANOVA Should Not Be Done By Hand ...................................................... 286
Two-Factor ANOVA With Replication Example in Excel .......................................... 286
Step 1 – Arrange the Data Properly ................................................................................. 288
Step 2 – Evaluate Extreme Outliers ................................................................................. 292
Step 3 – Verify Required Assumptions ............................................................................ 293
Two-Factor ANOVA With Replication Required Assumption .................................................. 293
1) Independence of Sample Group Data .................................................................................... 293
2) Sample Data Are Continuous ................................................................................................. 293
3) Independent Variables Are Categorical ................................................................................. 293
4) Extreme Outliers Removed If Necessary ............................................................................... 293
5) Normally Distributed Data In All Sample Groups ................................................................... 293
6) Relatively Similar Variances In All Sample Groups In Each F Test ....................................... 293
Determining If Sample Groups Are Normally-Distributed........................................................ 294
Shapiro-Wilk Test For Normality ................................................................................................. 295
Shapiro-Wilk Normality Test in Excel of Factor 1 Level 1 Data .................................................. 296
Shapiro-Wilk Normality Test in Excel of Factor 1 Level 2 Data .................................................. 296
Shapiro-Wilk Normality Test in Excel of Factor 1 Level 3 Data .................................................. 297
Shapiro-Wilk Normality Test in Excel of Factor 2 Level 1 Data .................................................. 297
Shapiro-Wilk Normality Test in Excel of Factor 2 Level 2 Data .................................................. 298
Correctable Reasons Why Normal Data Can Appear Non-Normal ........................................... 298
Determining If Sample Groups Have Similar Variances ........................................................... 299
Levene’s Test in Excel For Sample Variance Comparison ........................................................ 300
Brown-Forsythe Test in Excel For Sample Variance Comparison ............................................. 303
Step 4 – Run the Two-Factor ANOVA With Replication Tool in Excel ....................... 305
Step 5 – Interpret the Excel Output ................................................................................. 308
Main Effects F Test for Factor 1 .................................................................................................. 308
Main Effects F Test for Factor 2 .................................................................................................. 308
Interaction Effects F Test for Factors 1 and 2 ........................................................................... 308
Step 6 – Perform Post-Hoc Testing in Excel .................................................................... 309
Post-Hoc Tests Used When Group Variances Are Equal .......................................................... 309
Tukey’s HSD Test in Excel For Each Main Effects F Test For Factor 1 .................................... 310
Determining Where the Strongest Interactions Between Factor 1 and Factor 2 Occur ............. 320
Step 7 – Calculate Effect Size ............................................................................................ 322
2
Eta Square (η ) .............................................................................................................................. 322
2
Calculating Eta Square (η ) in Excel .......................................................................................... 323
Step 8 – Calculate the Power of the Test .......................................................................... 324
Calculating Power With Online Tool G Power ........................................................................... 325
What To Do When Groups Are Not Normally-Distributed ........................................... 330
Scheirer-Ray-Hare Test in Excel ................................................................................................. 330

Two-Factor ANOVA Without Replication ............................................ 333


Overview .................................................................................................................. 333
Two-Factor ANOVA Without Replication Example in Excel ..................................... 333
Power Analysis of Two-Factor ANOVA Without Replication....................................... 335
Performing a priori Power Analysis for the Main Effect of Factor 1 ....................................... 337

Check Out the Latest Book in the Excel Master Series! .................... 341

Meet the Author .................................................................................... 346


Confidence Intervals in Excel

Overview
A confidence interval is a range of values that is believed to contain a population parameter (usually the
population’s mean) with a specified degree of certainty. For example, when you have calculated a
confidence interval, you can usually make a statement like: “I am 95% certain that the mean of the
population from which I obtained the sample is somewhere between points A and B.” That would be the
equivalent of saying that the range of values between A and B is the 95% confidence interval for the
mean of the population from which the samples were obtained.
The purpose of calculating a confidence interval is to create an estimate of a population’s mean. The
confidence interval is a range of values that is believed to contain the population’s mean with a specified
degree of certainty.

Margin of Error
Half the width of a Confidence Interval. The Margin of Error is always expressed in the same units as the
data sample.
The Margin of Error can be roughly and quickly estimated as follows:
Margin of error at 99 percent ≈ 1.29 / SQRT(n)
Margin of error at 95 percent ≈ 0.98 / SQRT(n)
Margin of error at 90 percent ≈ 0.82 / SQRT(n)

Factors Affecting Size of the Confidence Interval

1) Degree of Confidence
The size of the Confidence Interval increases as the degree of confidence increases. That is intuitive
because the wider the interval, the greater is the certainty that the interval contains the true population
mean.

2) Sample Size
The greater the sample size, the more certain is the position of the population parameter. The Confidence
Interval therefore becomes smaller as sample size increases.

3) Variability of the Population


The greater the variation with the population, the less certain is the position of the population’s mean. The
Confidence Interval therefore increases as population variability increases.

13
C. l. of a Population Mean vs. C. I. of a Population Proportion
Confidence Interval are closely related to hypothesis tests. the difference between a Confidence Interval
of a mean vs. a proportion is the same as the difference between Hypothesis Test of Mean vs.
Proportion.
The sample is the difference.
Samples taken for both a Confidence Interval of a Population mean and a Hypothesis Test of Mean can
take a wide range of values.
Samples taken for both a Confidence Interval of a Population proportion and a Hypothesis Test of
Proportion are binary: they can take only one of two of values.

Confidence Intervals for t-Tests


A t-Test is a method that is used to create an assumption about a population parameter based on a
sample parameter. A t-Test is used to determine whether it is likely that the sample parameter is the
same or different as the population parameter. That is another way of trying to determine whether the
sample came from the same population as the population parameter, or whether the sample was drawn
from a different population.
Another way to evaluate the population from which the sample was drawn is to create a Confidence
Interval about the sample parameter. This Confidence Interval has a certain probability of containing the
matching population parameter.

Prediction Interval of a Regression Estimate


This is called a prediction interval. This is a confidence interval about a Y value that is estimated from a
regression equation. A regression prediction interval is a value range above and below the Y estimate
calculated by the regression equation that would contain the actual value of a sample with, for example,
95 percent certainty.
Calculating prediction intervals of a regression equation is found in the chapters of this book that cover
simple regression and multiple regression.

14
t-Based Confidence Interval of a Population Mean in
Excel

Overview
This confidence interval of a population mean is based upon the sample mean being distributed
according to the t distribution. A 95-percent confidence interval of a population mean is an interval that
has a 95-percent chance of containing the population mean.

The sample mean is distributed according to the t distribution if any of the following sets of conditions is in
effect:

1) The sample size is large and the population standard deviation is not known.

2) The sample size is small (n < 30) and the population is shown to be normally distributed.

3) The sample size is small (n < 30) and the sample is proven to be normally distributed.

x_bar = Observed Sample Mean

Margin of Error = Half Width of C.I. = t Valueα/2 * Standard Error


Margin of Error = Half Width of C.I. = T.INV(1 – α/2,df) * s/SQRT(n)
A confidence interval of a population mean that is based on the normal distribution is z-based. A
confidence interval of a population mean that is based on the t distribution is t-based.
It is much more common to use the t distribution than the normal distribution to create a confidence
interval of the population mean. Requirements for t-based confidence intervals are much less restrictive
than the requirements for a z-based confidence interval.

Example of a t-Based Confidence Interval of a Population


Mean in Excel
In this example a 95 percent Confidence Interval is created around a sample mean using the normal
distribution.
In this example a 95 percent Confidence Interval is created around a sample mean. There is a 95 percent
chance that the population mean is contained within this Confidence Interval.
A company is evaluating whether to purchase a large number of electric-powered machines. An
importance purchase criterion is how long the machine can operate after being fully charged.
To determine how long the machine can be expected to operate on a single charge, the company
purchased 20 machines and fully charged each. Each of these machines was then operated at full speed

15
until the charge ran out and the machine stopped running. The number of hours that each machine was
able to operate after a full charge at full speed was recorded.
Calculate the interval in which contains that average operation length of all of the machines with 95
percent certainty. In other words, calculate the 95 percent Confidence Interval of the mean operating
length for all machines based upon the sample of 20 machines that was tested. The data provided is as
follows:

Running the Excel data analysis tool Descriptive Statistics will provide the Sample Mean, the Sample
Standard Deviation, the Standard Error, and the Sample Size. The output of this tool appears as follows:

16
The above Descriptive Statistics are obtained by running Excel Descriptive Statistics data analysis tool as
shown below. It is important to select the Confidence Level checkbox and specific that confidence level
desired (95 percent in this case). Doing so calculates half of the width of the 95 percent Confidence
Interval using the t distribution as this example will also do. Below is the Descriptive Statistics completed
dialogue box:

17
Summary of Problem Information
x_bar = sample mean = AVERAGE() = 250
µ = (population) mean of all machines = Unknown
s = sample standard deviation =STDEV.S() = 37.170
σ (Greek letter “sigma”) = population standard deviation = Not Known
n = sample size = COUNT() = 20
SE = Standard Error = s / SQRT(n) = 37.170 / SQRT(20)
SE = 8.311
Note that this calculation of the Standard Error using the sample standard deviation, s, is an estimate of
the true Standard Error which would be calculated using the population standard deviation, σ.
df = degrees of freedom = n – 1 = 20 – 1 = 19
Level of Certainty = 0.95
Alpha = 1 - Level of Certainty = 1 – 0.95 = 0.05

18
As when creating all Confidence of Mean, we must satisfactorily answer these two questions and then
proceed to the two-step method of creating the Confidence Interval.

The Initial Two Questions That Must be Answered Satisfactorily


What Type of Confidence Interval Should Be created?
Have All of the Required Assumptions For This Confidence Interval Been Met?

The Two-Step Method For Creating Confidence Intervals of Mean Are the Following:
Step 1 - Calculate the Half-Width of the Confidence Interval (Sometimes Called the Margin of Error)
Step 2 – Create the Confidence Interval By Adding to and Subtracting From the Sample Mean Half
the Confidence Interval’s Width

The Initial Two Questions That Need To Be Answered Before Creating a Confidence Interval of the Mean
or Proportion Are as Follows:

Question 1) Type of Confidence Interval?


a) Confidence Interval of Population Mean or Population Proportion?
This is a Confidence Interval of a population mean because each individual observation (each sampled
machine’s length of operation) within the entire sample can have a wide range of values. The samples
values are spread out between 175 and 300. On the other hand, sampled data points used to create a
Confidence Interval of a population proportion are binary: they can take only one of two possible values.

b) t-Based or z-Based Confidence Interval?


A Confidence Interval created using the t distribution is said to be t-based. A Confidence Interval created
using the normal distribution is said to be z-based. It is much more common to use the t distribution to
create Confidence Intervals of a population mean because the t distribution is much less restrictive. The t
distribution can always be used. The normal distribution can only be used if:
Sample size is large (n > 30)
AND
The population standard deviation, σ, is known.
In this case sample size is small (n < 30) and the population standard deviation is not known. The t
distribution must therefore be used to create this Confidence Interval of a population mean. This
Confidence Interval of a population mean will be t-based.
This Confidence Interval will be a Confidence Interval of a population mean and will be created
using the t distribution.

Question 2) All Required Assumptions Met?


a) Normal Distribution of the Sample Mean
We are attempting to create a confidence interval about the sample mean which contains the population
mean. To create a confidence interval that is based on the normal distribution or t distribution, the sample
mean must be normally distributed. In other words, if we took multiple samples just like the one
mentioned here, the means of those samples would have to be normally distributed in order to be able to
create a confidence interval that is based upon the normal or t distributions.

19
For example, 30 independent, random samples of 20 machines each could be tested for mean length of
operation just like the single sample of 20 machines in this example was tested. If those means of all 30
samples are normally distributed, a confidence interval based on the t distribution can be created around
the mean of the single sample taken.

Sample Means Are Normally Distributed If Any of the Following Are


True:
1) Sample Size of Each Sample, n, Is Greater Than 30.
The Central Limit Theorem states that the means of similar-sized, random, independent samples will be
normally distributed if the sample size is large (n >30) no matter how the underlying population from
which the samples came from is distributed. In reality, the distribution of sample means converges toward
normality when n is as small as 5 as long as the underlying population is not too skewed.

2) Population Is Normally Distributed.


If this is the case, the means of similar sized, random, independent samples will also be normally
distributed. It is quite often the case that the distribution of the underlying population is not known and the
normal distribution of a population should not be assumed until proven.

3) Sample Is Normally Distributed.


If the sample is normally distributed, the means of other similar-sized, independent, random samples will
also be normally distributed. Normality testing must be performed on the sample to determine whether the
sample is normally distributed.
In this case the sample size is small (n = 10) and the population’s distribution is unknown. The only
remaining way to verify normal distribution of the sample mean is to verify normal distribution of the
sample. The sample must be therefore be tested and confirmed for normality before a Confidence Interval
based on t distribution can be created.

Evaluating the Normality of the Sample Data


The following five normality tests will be performed on the sample data here:
1) An Excel histogram of the sample data will be created.
2) A normal probability plot of the sample data will be created in Excel.
3) The Kolmogorov-Smirnov test for normality of the sample data will be performed in Excel.
4) The Anderson-Darling test for normality of the sample data will be performed in Excel.
5) The Shapiro-Wilk test for normality of the sample data will be performed in Excel.

20
Histogram in Excel
The quickest way to check the sample data for normality is to create an Excel histogram of the data as
shown below, or to create a normal probability plot of the data if you have access to an automated
method of generating that kind of a graph.

To create this histogram in Excel, fill in the Excel Histogram dialogue box as follows:

21
The sample group appears to be distributed reasonably closely to the bell-shaped normal distribution. It
should be noted that bin size in an Excel histogram is manually set by the user. This arbitrary setting of
the bin sizes can has a significant influence on the shape of the histogram’s output. Different bin sizes
could result in an output that would not appear bell-shaped at all. What is actually set by the user in an
Excel histogram is the upper boundary of each bin.

22
Normal Probability Plot in Excel
Another way to graphically evaluate normality of each data sample is to create a normal probability plot
for each sample group. This can be implemented in Excel and appears as follows:

The normal probability plots for the sample group show that the data appears to be very close to being
normally distributed. The actual sample data (red) matches very closely the data values of the sample
were perfectly normally distributed (blue) and never goes beyond the 95 percent confidence interval
boundaries (green).

Kolmogorov-Smirnov Test For Normality in Excel


The Kolmogorov-Smirnov Test is a hypothesis test that is widely used to determine whether a data
sample is normally distributed. The Kolmogorov-Smirnov Test calculates the distance between the
Cumulative Distribution Function (CDF) of each data point and what the CDF of that data point would be if
the sample were perfectly normally distributed. The Null Hypothesis of the Kolmogorov-Smirnov Test
states that the distribution of actual data points matches the distribution that is being tested. In this case
the data sample is being compared to the normal distribution.
The largest distance between the CDF of any data point and its expected CDF is compared to
Kolmogorov-Smirnov Critical Value for a specific sample size and Alpha. If this largest distance exceeds
the Critical Value, the Null Hypothesis is rejected and the data sample is determined to have a different
distribution than the tested distribution. If the largest distance does not exceed the Critical Value, we
cannot reject the Null Hypothesis, which states that the sample has the same distribution as the tested
distribution.
F(Xk) = CDF(Xk) for normal distribution
F(Xk) = NORM.DIST(Xk, Sample Mean, Sample Stan. Dev., TRUE)

23
0.1500 = Max Difference Between Actual and Expected CDF
20 = n = Number of Data Points
0.05 = α

24
The Null Hypothesis Stating That the Data Are Normally Distributed Cannot Be Rejected
The Max Difference Between the Actual and Expected CDF (0.1500) is less than the Kolmogorov-
Smirnov Critical Value for n = 20 and α = 0.05 so do not reject the Null Hypothesis.
The Null Hypothesis for the Kolmogorov-Smirnov Test for Normality, which states that the sample data
are normally distributed, is rejected if the maximum difference between the expected and actual CDF of
any of the data points exceed the Critical Value for the given n and α.

Anderson-Darling Test For Normality in Excel


The Anderson-Darling Test is a hypothesis test that is widely used to determine whether a data sample is
normally distributed. The Anderson-Darling Test calculates a test statistic based upon the actual value of
each data point and the Cumulative Distribution Function (CDF) of each data point if the sample were
perfectly normally distributed.
The Anderson-Darling Test is considered to be slightly more powerful than the Kolmogorov-Smirnov test
for the following two reasons:
The Kolmogorov-Smirnov test is distribution-free. i.e., its critical values are the same for all distributions
tested. The Anderson-darling tests requires critical values calculated for each tested distribution and is
therefore more sensitive to the specific distribution.
The Anderson-Darling test gives more weight to values in the outer tails than the Kolmogorov-Smirnov
test. The K-S test is less sensitive to aberration in outer values than the A-D test.
If the test statistic exceeds the Anderson-Darling Critical Value for a given Alpha, the Null Hypothesis is
rejected and the data sample is determined to have a different distribution than the tested distribution. If
the test statistic does not exceed the Critical Value, we cannot reject the Null Hypothesis, which states
that the sample has the same distribution as the tested distribution.
F(Xk) = CDF(Xk) for normal distribution
F(Xk) = NORM.DIST(Xk, Sample Mean, Sample Stan. Dev., TRUE)

25
Adjusted Test Statistic A* = 0.407
Reject the Null Hypothesis of the Anderson-Darling Test which states that the data are normally
distributed if any the following are true:
A* > 0.576 When Level of Significance (α) = 0.15
A* > 0.656 When Level of Significance (α) = 0.10
A* > 0.787 When Level of Significance (α) = 0.05
A* > 1.092 When Level of Significance (α) = 0.01
The Null Hypothesis Stating That the Data Are Normally Distributed Cannot Be Rejected
The Null Hypothesis for the Anderson-Darling Test for Normality, which states that the sample data are
normally distributed, is rejected if the Adjusted Test Statistic (A*) exceeds the Critical Value for the given
n and α.
The Adjusted Test Statistic (A*) for the Difference Sample Group (0.407) is significantly less than the
Anderson-Darling Critical Value for α = 0.05 so the Null Hypotheses of the Anderson-Darling Test for the
sample group is accepted.

26
Shapiro-Wilk Test For Normality in Excel
The Shapiro-Wilk Test is a hypothesis test that is widely used to determine whether a data sample is
normally distributed. A test statistic W is calculated. If this test statistic is less than a critical value of W for
a given level of significance (alpha) and sample size, the Null Hypothesis which states that the sample is
normally distributed is rejected.
The Shapiro-Wilk Test is a robust normality test and is widely-used because of its slightly superior
performance against other normality tests, especially with small sample sizes. Superior performance
means that it correctly rejects the Null Hypothesis that the data are not normally distributed a slightly
higher percentage of times than most other normality tests, particularly at small sample sizes.
The Shapiro-Wilk normality test is generally regarded as being slightly more powerful than the Anderson-
Darling normality test, which in turn is regarded as being slightly more powerful than the Kolmogorov-
Smirnov normality test.
Sample Data

0.967452 = Test Statistic W


0.905 = W Critical for the following n and Alpha
20 = n = Number of Data Points
0.05 = α

27
The Null Hypothesis Stating That the Data Are Normally Distributed Cannot Be Rejected
Test Statistic W (0.967452) is larger than W Critical 0.905. The Null Hypothesis therefore cannot be
rejected. There is not enough evidence to state that the data are not normally distributed with a
confidence level of 95 percent.
******************
One note on the data set used for this example – The data set used for this example is the same data set
used for the one-sample hypothesis test except that each data value has now been divided by 800 for this
example. If you go to that section in this book, you will observe that the Kolmogorov-Smirnov test, the
Anderson-Darling test, and the Shapiro-Wilk test produce exactly the same results for both sets of
numbers.
******************

Correctable Reasons That Normal Data Can Appear Non-Normal


If a normality test indicates that data are not normally distributed, it is a good idea to do a quick evaluation
of whether any of the following factors have caused normally-distributed data to appear to be non-
normally-distributed:
1) Outliers – Too many outliers can easily skew normally-distributed data. An outlier can often be
removed if a specific cause of its extreme value can be identified. Some outliers are expected in normally-
distributed data.
2) Data Has Been Affected by More Than One Process – Variations to a process such as shift changes
or operator changes can change the distribution of data. Multiple modal values in the data are common
indicators that this might be occurring. The effects of different inputs must be identified and eliminated
from the data.
3) Not Enough Data – Normally-distributed data will often not assume the appearance of normality until
at least 25 data points have been sampled.
4) Measuring Devices Have Poor Resolution – Sometimes (but not always) this problem can be solved
by using a larger sample size.
5) Data Approaching Zero or a Natural Limit – If a large number of data values approach a limit such
as zero, calculations using very small values might skew computations of important values such as the
mean. A simple solution might be to raise all the values by a certain amount.
6) Only a Subset of a Process’ Output Is Being Analyzed – If only a subset of data from an entire
process is being used, a representative sample in not being collected. Normally-distributed results would
not appear normally distributed if a representative sample of the entire process is not collected.

28
We now proceed to the two-step method for creating all Confidence intervals of a population mean.
These steps are as follows:
Step 1) Calculate the Width of Half of the Confidence Interval
Step 2 – Create the Confidence Interval By Adding and Subtracting the Width of Half of the
Confidence Interval from the Sample Mean
Proceeding through the four steps is done is follows:

Step 1) Calculate Width-Half of Confidence Interval


Half the Width of the Confidence Interval is sometimes referred to the Margin of Error. The Margin of Error
will always be measured in the same type of units as the sample mean is measured in. Calculating the
Half Width of the Confidence Interval using the t distribution would be done as follows in Excel:
Margin of Error = Half Width of C.I. = t-Valueα/2 * Standard Error
Margin of Error = Half Width of C.I. = T.INV(1-α/2, df) * s/SQRT(n)
Margin of Error = Half Width of C.I. = T.INV(0.975, 19) * 37.17/SQRT(20)
Margin of Error = Half Width of C.I. = 2.093 * 8.311 = 17.396 hours

Step 2 Confidence Interval = Sample Mean ± C.I. Half-Width


Confidence Interval = Sample Mean ± (Half Width of Confidence Interval)
Confidence Interval = x_bar ± 17.396
Confidence Interval = 250 ± 17.396
Confidence Interval = [ 323.604 hours, 267.396 hours ]
A graphical representation of this Confidence Interval is shown as follows:

29
It should be noted that the legacy formula TINV(α,df) can be replaced in Excel 2010 and later by the
following formula: T.INV(1-α/2,df)

A Excel Shortcut For Calculating the t-Based Confidence Interval


The formula for calculating the Confidence Interval is the following:
Confidence Interval = Sample Mean ± (Half Width of Confidence Interval)
Descriptive Statistics in Excel instantly calculates the following:
Sample mean
Half the width of the Confidence Interval at the specified level of confidence using the t distribution
Here, once again, is the Descriptive Statistics for this data set:

This is created with the following information filled in the Excel dialogue box for the Descriptive Statistics
data analysis tool:

30
Note the following from the Descriptive Statistics:
Sample mean = x_bar = 250
Half the Width of the Confidence Interval = 17.396 hours
These numbers can simply be plugged into the Confidence Interval formula below to obtain the t-based
C. I. as long as the sample mean has been proven to be normally distributed.
Confidence Interval = Sample Mean ± (Half Width of Confidence Interval)
The half-width of a t-based confidence interval can also be quickly found by the following Excel formula:
Half-width of a t-based confidence interval = CONFIDENCE.T(α,s,n)
Half-width of a t-based confidence interval = CONFIDENCE.T(0.05,37.17,20)
Half-width of a t-based confidence interval = 17.396

31
Min Sample Size to Limit Width of a Confidence
Interval of a Mean
The same procedure using the normal distribution is always used to calculate the minimum sample size
needed to limit a confidence interval of population mean to a certain size. The procedure shown here is
the same procedure used in the previous section that focused on creating a confidence interval with the t-
distribution. This procedure is as follows once again:
The larger the sample taken, the smaller the Confidence Interval becomes. That makes intuitive sense
because the more sample information that is gathered, the more tightly the position of the population
mean can be defined. The Confidence Interval is an interval believed to contain the population mean with
specific degree of certainty.
As sample size increase, the Confidence Interval shrinks because greater certainty has been attained.
The margin of error, which is equal to half the width of the Confidence Interval, therefore shrinks as well.
During the design phase of a statistical experiment, sample size should be determined. Sampling has a
cost and additional sampling beyond what is necessary to attain a desired level of certainty is often
undesirable. One common objective of the design phase of a statistical test involving sampling is to
determine the minimum sample size required to obtain a specified degree of certainty.
Calculating the minimum sample size necessary to limit the size of a confidence interval of a population
mean can be done using the normal distribution but not the t distribution. A t-based confidence interval
requires specifying the degrees of freedom, which is derived from the sample size that is unknown.
A z-based confidence interval (a confidence interval based upon the normal distribution) requires that the
sample mean be normally distributed and the population standard deviation be known. These
requirements are met if both of the following are true:
1) The minimum sample size is at least 30. This ensures that the sample mean is normally distributed
as per the Central Limit Theorem. If the calculated minimum sample size is less than 30, the sample or
the population must be confirmed to be normally distributed.
and
2) Population standard deviation or a reasonable estimate of the population standard deviation is
known. Sample standard deviation cannot be used because a sample has not been taken.
The minimum sample size, n, to limit the width of a z-based confidence interval of a population proportion
to a specific size can be derived with the following algebra:
Confidence Interval = Sample mean ± z Scoreα,two-tailed * SE
Confidence Interval = x_bar ± NORM.S.INV(1 – α/2) * σ/SQRT(n)
Confidence Interval = x_bar ± NORM.S.INV(1 – α/2) * σ /SQRT(n)

(Half-width of C.I.) = NORM.S.INV(1 – α/2) * σ /SQRT(n)

Squaring both sides gives the following:


2 2 2
(Half-width of C.I.) = NORM.S.INV (1 – α/2) * σ /n

32
Further algebraic manipulation produces the following:
2 2 2
n = [NORM.S.INV (1 – α/2) * σ ] / (Half-width of C.I.)
or, equivalently because Half-width of C.I. = Margin of Error,
2 2 2
n = [NORM.S.INV (1 – α/2) * σ ] / (Margin of Error)

The count of data observations in a sample, n, must be a whole number so n must be rounded up to the
nearest whole number. This is implemented in Excel as follows:
2 2 2
n = Roundup( ( NORM.S.INV (1 – α/2) * σ ) / (Half-width of C.I.) )

Example of Calculating Min Sample Size in Excel


A survey was taken of the monthly salaries full-time employee of the California Department of
Transportation. The standard deviation of monthly salaries throughout the entire California DOT is known
to be $500.
What is the minimum number of employees that would have to be surveyed to be at least 95% certain
that the sample average monthly salary is within $50 of the true average monthly salary of all employees
in the California DOT?
In other words, what is the minimum sample size needed to create a 95-percent confidence interval about
the population mean that has a margin of error no larger than $50?
Another way to state the problem is to ask how large must the sample size be to create a 95-percent
confidence interval about the population mean that has a half-width of no more than $50?
σ = Population standard deviation = $500
Half-width of Confidence Interval = Margin of Error = $50
(The confidence interval must be specified in the same units as the population standard deviation is.)
α = 1 – Level of Certainty = 1 – 0.95 = 0.05

2 2 2
n = Roundup( ( NORM.S.INV (1 – α/2) * σ ) / (Half-width of C.I.) )
2 2 2
n = Roundup( ( NORM.S.INV (1 – 0.05/2) * (500) ) / (50) )
2 2 2
n = Roundup( ( NORM.S.INV (0.975) * (500) ) / (50) )
2 2 2
n = Roundup( ( (1.96) * (500) ) / (50) )
n = Roundup( 384.1459 )
n = 385
A minimum of 385 employees must be surveyed to be 95 percent certain that the average salary of the
sample is no more than $50 from the true average salary within the entire California DOT.

33
z-Based Confidence Interval of a Population Mean in
Excel

Overview
This confidence interval of a population mean is based upon the sample mean being normally distributed.
A 95-percent confidence interval of a population mean is an interval that has a 95-percent chance of
containing the population mean.
The sample mean is normally distributed if the sample size is large (n > 30) as per the Central Limit
Theorem. The CTL states that the means of large samples randomly taken from the same population will
be normally distributed no matter how the population is distributed. Confidence intervals of a population
mean can be based upon the normal distribution only if the sample size is large (n > 30).
In addition to the required large sample size that ensures the normal distribution of the sample mean, the
population standard deviation must be known as well.
x_bar = Observed Sample Mean

Margin of Error = Half Width of C.I. = Z Valueα/2 * Standard Error


Margin of Error = Half Width of C.I. = NORM.S.INV(1 – α/2) * σ/SQRT(n)
A confidence interval of a population mean that is based on the normal distribution is z-based. A
confidence interval of a population mean that is based on the t distribution is t-based.
It is much more common to use the t distribution than the normal distribution to create a confidence
interval of the population mean. Requirements for t-based confidence intervals are much less restrictive
than the requirements for a z-based confidence interval.
A confidence interval of a population mean can be based on t distribution if only the sample standard
deviation is known and any of the following three conditions are met:
1) Sample size is large (n > 30). The Central Limit Theorem states that the means of large, similar-sized,
random samples will be normally distributed no matter how the underlying population is distributed.
2) The population from which the sample was drawn is proven to be normally distributed.
3) The sample is proven to be normally distributed.
A confidence interval of the mean can be created based on the normal distribution only if the sample size
is large (n >30) AND the population standard deviation, σ, is known. For this reason, confidence intervals
are nearly always created using the t distribution in the professional environment.
This example will demonstrate how to create a confidence Interval of the mean using the normal
distribution

34
Example of a z-Based Confidence Interval in Excel
In this example a 95 percent Confidence Interval is created around a sample mean using the normal
distribution.
A company received a shipment of 5,000 steel rods of unknown tensile strength. All rods originated from
the same source. The company randomly selected 100 rods from the shipment and tested each for
tensile strength. The average tensile strength of the 100 rods tested was found to be 250 MPa
(megapascals). The tensile strength of steel rods of this exact type is known to have a standard deviation
of 30 MPa.
Calculate the endpoints of the interval that 95 percent certain to contain the true mean tensile strength of
all 5,000 rods in the shipment. In other words, calculate the 95 percent confidence interval of the
population (entire shipment) mean tensile strength.

Summary of Problem Information


x_bar = sample mean = AVERAGE() = 250 MPa
µ = (population) mean tensile strength of entire shipment = Unknown
σ (Greek letter “sigma”) = population tensile strength standard deviation = 30 MPa
n = sample size = COUNT() = 100
SE = Standard Error = σ / SQRT(n) = 30 / SQRT(100)
SE = 3
Level of Certainty = 0.95
Alpha = 1 - Level of Certainty = 1 – 0.95 = 0.05
As when creating all Confidence of Mean, we must satisfactorily answer these two questions and then
proceed to the two-step method of creating the Confidence Interval.
The Initial Two Questions That Must be Answered Satisfactorily
What Type of Confidence Interval Should Be created?
Have All of the Required Assumptions For This Confidence Interval Been Met?

The Two-Step Method For Creating Confidence Intervals of Mean are the following:
Step 1 - Calculate the Half-Width of the Confidence Interval (Sometimes Called the Margin of Error)
Step 2 – Create the Confidence Interval By Adding to and Subtracting From the Sample Mean Half
the Confidence Interval’s Width

35
The Initial Two Questions That Need To Be Answered Before Creating a Confidence Interval of the Mean
or Proportion Are as Follows:

Question 1) Type of Confidence Interval?


a) Confidence Interval of Population Mean or Population Proportion?
This is a Confidence Interval of a population mean because each individual observation (each sampled
rod’s tensile strength) within the entire sample can have a wide range of values. Most of the sample
values are spread out between 200 MPa and 300 MPa.
Sampled data points used to create a Confidence Interval of a population proportion are binary: they can
take only one of two possible values.

b) t-Based or z-Based Confidence Interval?


A confidence interval can be created that is based on the normal distribution can only if both of the
following conditions are met:
Sample size is large (n > 30)
AND
The population standard deviation, σ, is known.
In this case sample size is large (n = 100) and the population standard deviation is known (σ = 30 MPa).
This Confidence Interval can be created using either the t distribution or the normal distribution. In this
case, the normal distribution will be used to create this Confidence Interval of a population mean. This
Confidence Interval of a population mean will be z-based.
This confidence interval will be a confidence interval of a population mean and will be created
using the normal distribution.

Question 2) All Required Assumptions Met?


a) Normal Distribution of the Sample Mean
As per the Central Limit Theorem, the large sample size (n = 100) guarantees that the sample mean is
normally distributed.

b) Population Standard Deviation Is Known (σ = 30 MPa)


We now proceed to the two-step method for creating all Confidence intervals of a population mean.
These steps are as follows:
Step 1) Calculate the Width of Half of the Confidence Interval
Step 2 – Create the Confidence Interval By Adding and Subtracting the Width of Half of the
Confidence Interval from the Sample Mean

Proceeding through the two-step method of creating a confidence interval is done is follows:

Step 1) Calculate Width-Half of Confidence Interval


Half the Width of the Confidence Interval is sometimes referred to the Margin of Error. The Margin of Error
will always be measured in the same type of units as the sample mean is measured in, which in this case
was MPa (megapascals).

36
Calculating the Half Width of the Confidence Interval using the normal distribution would be done as
follows in Excel:
Margin of Error = Half Width of C.I. = Z Valueα/2 * Standard Error
Margin of Error = Half Width of C.I. = NORM.S.INV(1 – α/2) * σ/SQRT(n)
Margin of Error = Half Width of C.I. = NORM.S.INV(1 – 0.05/2) * 30/SQRT(100)
Margin of Error = Half Width of C.I. = NORM.S.INV(0.975) * 30/10
Margin of Error = Half Width of C.I. = 1.96 * 3
Margin of Error = Half Width of C.I. = 5.88 MPa
The Half Width of z-based Confidence Interval can also be calculated by the following Excel formula:
Margin of Error = Half Width of C.I. = CONFIDENCE.NORM(α, σ, n)
Margin of Error = Half Width of C.I. = CONFIDENCE.NORM(0.05, 30, 100)
Margin of Error = Half Width of C.I. = 5.88 MPa

Step 2 Confidence Interval = Sample Mean ± C.I. Half-Width


Confidence Interval = Sample Mean ± (Half Width of Confidence Interval)
Confidence Interval = x_bar ± 5.88
Confidence Interval = 250 ± 5.88
Confidence Interval = [ 244.12 MPa, 255.88 MPa ]
A graphical representation of this Confidence Interval is shown as follows:

37
Min Sample Size to Limit Width of a Confidence
Interval of a Mean
The larger the sample taken, the smaller the Confidence Interval becomes. That makes intuitive sense
because the more sample information that is gathered, the more tightly the position of the population
mean can be defined. The Confidence Interval is an interval believed to contain the population mean with
specific degree of certainty.
As sample size increase, the Confidence Interval shrinks because greater certainty has been attained.
The margin of error, which is equal to half the width of the Confidence Interval, therefore shrinks as well.
During the design phase of a statistical experiment, sample size should be determined. Sampling has a
cost and additional sampling beyond what is necessary to attain a desired level of certainty is often
undesirable. One common objective of the design phase of a statistical test involving sampling is to
determine the minimum sample size required to obtain a specified degree of certainty.
Calculating the minimum sample size necessary to limit the size of a confidence interval of a population
mean can be done using the normal distribution but not the t distribution. A t-based confidence interval
requires specifying the degrees of freedom, which is derived from the sample size that is unknown.
A z-based confidence interval (a confidence interval based upon the normal distribution) requires that the
sample mean be normally distributed and the population standard deviation be known. These
requirements are met if both of the following are true:
1) The minimum sample size is at least 30. This ensures that the sample mean is normally distributed
as per the Central Limit Theorem. If the calculated minimum sample size is less than 30, the sample or
the population must be confirmed to be normally distributed.
and
2) Population standard deviation or a reasonable estimate of the population standard deviation is
known. Sample standard deviation cannot be used because a sample has not been taken.
The minimum sample size, n, to limit the width of a z-based confidence interval of a population proportion
to a specific size can be derived with the following algebra:
Confidence Interval = Sample mean ± z Scoreα,two-tailed * SE
Confidence Interval = x_bar ± NORM.S.INV(1 – α/2) * σ/SQRT(n)
Confidence Interval = x_bar ± NORM.S.INV(1 – α/2) * σ /SQRT(n)
(Half-width of C.I.) = NORM.S.INV(1 – α/2) * σ /SQRT(n)
Squaring both sides gives the following:
2 2 2
(Half-width of C.I.) = NORM.S.INV (1 – α/2) * σ /n

Further algebraic manipulation produces the following:


2 2 2
n = [NORM.S.INV (1 – α/2) * σ ] / (Half-width of C.I.)
or, equivalently because Half-width of C.I. = Margin of Error,
2 2 2
n = [NORM.S.INV (1 – α/2) * σ ] / (Margin of Error)

The count of data observations in a sample, n, must be a whole number so n must be rounded up to the
nearest whole number. This is implemented in Excel as follows:
2 2 2
n = Roundup( ( NORM.S.INV (1 – α/2) * σ ) / (Half-width of C.I.) )

38
Example of Calculating Min Sample Size in Excel
A survey was taken of the monthly salaries full-time employee of the California Department of
Transportation. The standard deviation of monthly salaries throughout the entire California DOT is known
to be $500.
What is the minimum number of employees that would have to be surveyed to be at least 95% certain
that the sample average monthly salary is within $50 of the true average monthly salary of all employees
in the California DOT?
In other words, what is the minimum sample size needed to create a 95-percent confidence interval about
the population mean that has a margin of error no larger than $50?
Another way to state the problem is to ask how large must the sample size be to create a 95-percent
confidence interval about the population mean that has a half-width of no more than $50?
σ = Population standard deviation = $500
Half-width of Confidence Interval = Margin of Error = $50
(The confidence interval must be specified in the same units as the population standard deviation is.)
α = 1 – Level of Certainty = 1 – 0.95 = 0.05
2 2 2
n = Roundup( ( NORM.S.INV (1 – α/2) * σ ) / (Half-width of C.I.) )
2 2 2
n = Roundup( ( NORM.S.INV (1 – 0.05/2) * (500) ) / (50) )
2 2 2
n = Roundup( ( NORM.S.INV (0.975) * (500) ) / (50) )
2 2 2
n = Roundup( ( (1.96) * (500) ) / (50) )
n = Roundup( 384.1459 )
n = 385
A minimum of 385 employees must be surveyed to be 95 percent certain that the average salary of the
sample is no more than $50 from the true average salary within the entire California DOT.

39
Confidence Interval of a Population Proportion in
Excel

Overview
Confidence intervals covered in this manual will either be Confidence Intervals of a Population Mean or
Confidence Intervals of a Population Proportion. A data point of a sample taken for a confidence
interval of a population mean can have a range of values. A data point of a sample taken for a confidence
interval of a population proportion is binary; it can take only one of two values.
Data observations in the sample taken for a confidence interval of a population proportion are required to
be distributed according to the binomial distribution. Data that are binomially distributed are independent
of each other, binary (can assume only one of two states), and all have the same probability of assuming
the positive state.
A basic example of a confidence interval of a population proportion would be to create a 95-percent
confidence interval of the overall proportion of defective units produced by one production line based
upon a random sample of completed units taken from that production line. A sampled unit is either
defective or it is not. The 95-percent confidence interval is range of values that has a 95-percent certainty
of containing the proportion defective (the defect rate) of all of the production from that production line
based on a random sample taken from the production line.
The data sample used to create a confidence interval of a population proportion must be distributed
according to the binomial distribution. The confidence interval is created by using the normal distribution
to approximate the binomial distribution. The normal approximation of the binomial distribution allows for
the convenient application of the widely-understood z-based confidence interval to be applied to
binomially-distributed data.
The binomial distribution can be approximated by the normal distribution under the following two
conditions:
1) p (the probability of a positive outcome on each trial) and q (q = 1 – p) are not too close to 0 or 1.
2) np > 5 and nq > 5
The Standard Error and half the width of a confidence interval of proportion are calculated as follows:

Margin of Error = Half Width of C.I. = z Valueα, 2-tailed * Standard Error


Margin of Error = Half Width of C.I. = NORM.S.INV(1 – α/2) * SQRT[ (p_bar * q_bar) / n]

40
Example of a Confidence Interval of a Population Proportion
in Excel
In this example a 95 percent confidence interval of a population proportion is created around a sample
proportion using the normal distribution to approximate the binomial distribution.
This example evaluates a group of shoppers who either prefer to pay by credit or by cash. A random
sample of 1,000 shoppers was taken. 70% of the sampled shoppers preferred to pay with a credit card.
The remaining 30% of the sampled shoppers preferred to pay with cash.
Determine the 95% Confidence Interval for the proportion of the general population that prefers to pay
with a credit card. In other words, determine the endpoints of the interval that is 95 percent certain to
contain the true proportion of the total shopping population that prefers to pay by credit card.

Summary of Problem Information


p_bar = sample proportion = 0.70
q_bar = 1 – p_bar = 1 – 0.70 = 0.30
p = population proportion = Unknown (This is what the confidence interval will contain.)
n = sample size = 1,000
α = Alpha = 1 – Level of Certainty = 1 – 0.95 = 0.05

SE = Standard Error = SQRT[ (p_bar * q_bar) / n]


SE = SQRT[ (0.70 * 0.30) / 1000] = 0.014491
As when creating all Confidence of Proportion, we must satisfactorily answer these two questions and
then proceed to the two-step method of creating the Confidence Interval of Proportion.

The Initial Two Questions That Must be Answered Satisfactorily


What Type of Confidence Interval Should Be Created?
Have All of the Required Assumptions For This Confidence Interval Been Met?

The Two-Step Method For Creating Confidence Intervals of Mean are the following:
Step 1 - Calculate the Half-Width of the Confidence Interval (Sometimes Called the Margin of Error)
Step 2 – Create the Confidence Interval By Adding to and Subtracting From the Sample Mean Half
the Confidence Interval’s Width

41
The Initial Two Questions That Need To Be Answered Before Creating a Confidence Interval of the Mean
or Proportion Are as Follows:

Question 1) Type of Confidence Interval?


a) Confidence Interval of Population Mean or Population Proportion?
This is a Confidence Interval of a population proportion because sampled data observations are binary:
they can take only one of two possible values. A shopper sampled either prefers to pay with a credit card
or prefers to pay with cash.
The data sample is distributed according to the binomial distribution because each observation has only
two possible outcomes, the probability of a positive outcome is the same for all sampled data
observations, and each data observation is independent from all others.
Sampled data points used to create a confidence interval of a population mean can take multiple values
or values within a range. This is not the case here because sampled data observations can have only two
possible outcomes: a sampled shopper either prefers to pay with credit card or with cash.

b) t-Based or z-Based Confidence Interval?


A Confidence Interval of proportion is always created using the normal distribution. The binomial
distribution of binary sample data is closely approximated by the normal distribution in certain conditions.
The next step in this example will evaluate whether the correct conditions are in place that permit the
approximation of the binomial distribution by the normal distribution.
It should be noted that the sample size (n) equals 1,000. At that sample size, the t distribution is nearly
identical to the normal distribution. Using the t distribution to create this Confidence Interval would
produce exactly the same result as the normal distribution produces.
This confidence interval will be a confidence cnterval of a population proportion and will be
created using the normal distribution to approximate the binomial distribution of the sample data.

Question 2) All Required Assumptions Met?


Binomial Distribution Can Be Approximated By Normal Distribution?
The most important requirement of a Confidence Interval of a population proportion is the validity of
approximating the binomial distribution (that the sampled objects follow because they are binary) with the
normal distribution.
The binomial distribution can be approximated by the normal distribution sample size, n, is large enough
and p is not too close to 0 or 1. This can be summed up with the following rule:
The binomial distribution can be approximated by the normal distribution if np > 5 and nq >5. In this case,
the following are true:
n = 1,000
p = 0.70 (p is approximated by p_bar)
q = 0.30 (q is approximated by q_bar)
np = 700 and nq = 300
It is therefore valid to approximate the binomial distribution with the normal distribution.
The binomial distribution has the following parameters:
Mean = np
Variance = npq

42
Each unique normal distribution can be completely described by two parameters: its mean and its
standard deviation. As long as np > 5 and nq > 5, the following substitution can be made:
Normal (mean, standard deviation) approximates Binomial (n,p)
If np is substituted for the normal distribution’s mean and npq is substituted for the normal distribution’s
standard deviation as follows:
Normal (mean, standard deviation)
becomes
Normal (np, npq), which approximates Binomial (n,p)
This can be demonstrated with Excel using data from this problem.
n = 1000
n = the number of trials in one sample
p = 0.7 (p is approximated by p_bar)
p = the probability of obtaining a positive result in a single trial
q = 0.7 (q is approximated by q_bar)
q=1-p
np = 700
npq = 210

at arbitrary point X = 700


(X equals the number positive outcomes in n trials)
BINOM.DIST(X, n, p, FALSE) = BINOM.DIST(700, 1000, 0.7, FALSE) = 0.0275
The Excel formula to calculate the PDF (Probability Density Function) of the normal distribution at point X
is the following:
NORM.DIST(X, Mean, Stan. Dev, FALSE)
The binomial distribution can now be approximated by the normal distribution in Excel by the following
substitutions:
BINOM.DIST(X, n, p, FALSE) ≈ NORM.DIST(X, np, npq, FALSE)
NORM.DIST(X, np, npq, FALSE) = NORM.DIST(700,700,210,FALSE) = 0.0019
BINOM.DIST(X, n, p, FALSE) = BINOM.DIST(700, 1000, 0.7, FALSE) = 0.0275
The difference is less than 0.03 and is reasonable close. Note that this approximation only works for the
PDF (Probability Density Function) and not the CDF (Cumulative Distribution Function – Replacing
FALSE with TRUE in the above formulas would calculate the CDF instead of the PDF).
We now proceed to the two-step method for creating all Confidence Intervals of a population proportion.
These steps are as follows:
Step 1) Calculate the Width of Half of the Confidence Interval
Step 2 – Create the Confidence Interval By Adding and Subtracting the Width of Half of the
Confidence Interval from the Sample Mean

43
Proceeding through the four steps is done is follows:

Step 1) Calculate Width-Half of Confidence Interval


Half the Width of the Confidence Interval is sometimes referred to the Margin of Error. The Margin of Error
will always be measured in the same type of units as the sample proportion is measured in, which is
percentage. Calculating the Half Width of the Confidence Interval using the t distribution would be done
as follows in Excel:
Margin of Error = Half Width of C.I. = z Valueα, 2-tailed * Standard Error
Margin of Error = Half Width of C.I. = NORM.S.INV(1 – α/2) * SQRT[ (p_bar * q_bar) / n]
Margin of Error = Half Width of C.I. = NORM.S.INV(0.975) * SQRT[ (0.7 * 0.3) / 1000]
Margin of Error = Half Width of C.I. = 1.95996 * 0.014491
Margin of Error = Half Width of C.I. = 0.0284, which equals 2.84 percent

Step 2 Confidence Interval = Sample Proportion ± C.I. Half-Width


Confidence Interval = Sample Proportion ± (Half Width of Confidence Interval)
Confidence Interval = p_bar ± 0.0284
Confidence Interval = 0.70 ± 0.0284
Confidence Interval = [ 0.6716, 0.7284 ], which equals 67.16 percent to 72.84 percent
We now have 95 percent certainty that the true proportion of all shoppers who prefer to pay with a credit
card is between 67.16 percent and 72.84 percent.
A graphical representation of this confidence interval is shown as follows:

44
Min Sample Size to Limit Width of a Confidence
Interval of a Population Proportion
The larger the sample taken, the smaller the Confidence Interval becomes. That makes intuitive sense
because the more sample information that is gathered, the more tightly the position of the population
mean can be defined. The Confidence Interval is an interval believed to contain the population mean with
specific degree of certainty.
As sample size increase, the Confidence Interval shrinks because greater certainty has been attained.
The margin of error, which is equal to half the width of the Confidence Interval, therefore shrinks as well.
During the design phase of a statistical experiment, sample size should be determined. Sampling has a
cost and additional sampling beyond what is necessary to attain a desired level of certainty is often
undesirable. One common objective of the design phase of a statistical test involving sampling is to
determine the minimum sample size required to obtain a specified degree of certainty.
This minimum sample size, n, can be derived by the following equation:
(Half-width of C.I.) = z Valueα, 2-tailed * SQRT[ (pest * qest) / n]
Estimates of population parameters p and q must be used in this equation because sample statistics
p_bar and q_bar are not available since a sample has not been taken.
(Half-width of C.I.) = z Valueα, 2-tailed * SQRT[ (pest * qest) / n]
(Half-width of C.I.) = NORM.S.INV(1 – α/2) * SQRT[ (pest * qest) / n]
Squaring both sides gives the following:
2 2
(Half-width of C.I.) = NORM.S.INV (1 – α/2) * pest * qest / n
Further algebraic manipulation provides the following:
2 2
n = [NORM.S.INV (1 – α/2) * pest * qest] / (Half-width of C.I.)
or, equivalently
2 2
n = [NORM.S.INV (1 – α/2) * pest * qest] / (Margin of Error)
The count of data observations in a sample, n, must be a whole number so n must be rounded up to the
nearest whole number. This is implemented in Excel as follows:
2 2
n = Roundup( ( NORM.S.INV (1 – α/2) * pest * qest ) / (Half-width of C.I.) )
pest and qest are estimates of the actual population parameters p and q. The most conservative estimate of
the minimum sample size would use pest = 0.50.
If pest = 0.05, then qest = 1 – p = 0.50
The product pest * qest has its maximum value of 0.25 when pest = 0.50. This maximum value of pest * qest
produces the highest and therefore most conservative value of the minimum sample size, n.
If p is fairly close to 0.5, then pest should be set at 0.5. If p is estimated to be significantly different than
0.5, pest should be set to its estimated value.

45
Example 1 of Calculating Min Sample Size in Excel
Min Number of Voters Surveyed to Limit Poll Error Margin
Two candidates are running against each other in a national election. This election is considered fairly
even. What is the minimum number of voters who should be randomly surveyed to obtain a survey result
that has 95 percent certainty of being within 2 percent of the nationwide preference for either one of the
candidates?
pest should be set at 0.5 since the election is considered even.
pest = 0.5
qest = 1 – pest = 0.5
Half-width of the confidence interval = Margin of Error = 2 percent = 0.02
2 2
n = Roundup( ( NORM.S.INV (1 – α/2) * pest * qest ) / (Half-width of C.I.) )
2 2
n = Roundup( ( NORM.S.INV (1 – 0.05/2) * 0.50 * 0.50 ) / (0.02) )
2 2
n = Roundup( ( NORM.S.INV ( 0.975) * 0.250 ) / (0.02) )
n = Roundup(2400.912)
n = 2401
The preferences of at least 2,401 voters would have to be randomly surveyed to obtain a sample
proportion that has 95 percent certainty of being within 2 percent of the national voter preference for one
of the candidates.

Example 2 of Calculating Min Sample Size in Excel


Min Number of Production Samples to Limit Defect Rate Estimate Error Margin
A production line is estimated to have a defect rate of approximately 15 percent of all units produced on
the line. What would be the minimum number of completed production units that should be randomly
sampled for defects to obtain a sample proportion of defective units that has 95 percent certainty to being
within 1 percent of the real defect rate of all unites produced on that production line?
pest should be set more conservatively than its estimate. The more conservative that p is, the higher will
be the minimum sample size required. The most conservative setting for p est would 0.5. pest should be set
between its estimate of 0.15 and 0.5. A reasonable setting for pest would be 0.25.
pest = 0.25
qest = 1 – pest = 0.75
Half-width of the confidence interval = Margin of Error = 1 percent = 0.01
2 2
n = Roundup( ( NORM.S.INV (1 – α/2) * pest * qest ) / (Half-width of C.I.) )
2 2
n = Roundup( ( NORM.S.INV (1 – 0.05/2) * 0.25 * 0.75 ) / (0.01) )
2 2
n = Roundup( ( NORM.S.INV ( 0.975) * 0.1875 ) / (0.01) )
n = Roundup(7202.735)
n = 7203
At least 7,203 completed units should be randomly sampled from the production line to obtain a sample
proportion defective that has 95 percent certainty of being within 1 percent of the actual proportion
defective of all units produced on that production line. If pest were set at 0.15 instead of the more
conservative 0.25, the minimum sample size would have been reduced to 4,898.

46
Prediction Interval of a Regression Estimate in Excel
A prediction interval is a confidence interval about a Y value that is estimated from a regression equation.
A regression prediction interval is a value range above and below the Y estimate calculated by the
regression equation that would contain the actual value of a sample with, for example, 95 percent
certainty.
Calculating an exact prediction interval for any regression with more than one independent variable
(multiple regression) involves some pretty heavy-duty matrix algebra. Fortunately there is an easy short-
cut that can be applied to simple and multiple regression that will give a fairly accurate estimate of the
prediction interval.
The data and the Excel Regression output for that data are shown below:

47
This appears to be a valid linear regression because of the following:
1) Relatively high R Square (0.879)
2) Extremely low overall p value (Significance of F) = 2.465E-10
3) No patterns noticeable in the Residual graph, except maybe a bit of fanning out as X values increase.

48
4) Residuals have a reasonable resemblance to the normal distribution as per the following Excel
histogram:

This Excel histogram was created with the following input in the Excel Histogram dialogue box:

49
The formula for a prediction interval about an estimated Y value (a Y value calculated from the regression
equation) is found by the following formula:
Prediction Interval = Yest ± t-Valueα/2 * Prediction Error
Prediction Error = Standard Error of the Regression * SQRT(1 + distance value)
The Standard Error of Regression is the yellow-highlighted cell in the Excel regression output titled
Standard Error and has the value of 1588.4.
Distance value is the measure of distance of the combinations of values, x 1, x2,…, xk from the center of
the observed data. Distance value in any type of multiple regression requires some heavy-duty matrix
algebra. This is given in Bowerman and O’Connell (1990).
Some software packages such as Minitab perform the internal calculations to produce an exact Prediction
Error for a given Alpha. Excel does not. Fortunately there is an easy substitution that provides a fairly
accurate estimate of Prediction Interval. The following fact enables this:
The Prediction Error for a point estimate of Y is always slightly larger than the Standard Error of the
Regression Equation (the yellow-highlight Standard Error of the above Excel regression output).
The Standard Error (highlighted in yellow in the Excel regression output) is used to calculate a confidence
interval about the mean Y value. The Prediction Error is use to create a confidence interval about a
predicted Y value. There will always be slightly more uncertainty in predicting an individual Y value than in
estimating the mean Y value.
The Prediction Error is always slightly bigger than the Standard Error of a Regression. The Prediction
Error can be estimated reasonably accurately by the following formula:
P.I.est = (Standard Error of the Regression)* 1.1
Prediction Intervalest = Yest ± t-Valueα/2 * P.I.est
Prediction Intervalest = Yest ± t-Valueα/2 * (Standard Error of the Regression)* 1.1
Prediction Intervalest = Yest ± T.INV(1-α/2, dfResidual) * (Standard Error of the Regression)* 1.1
The t-value must be calculated using the degrees of freedom, df, of the Residual (highlighted in Yellow in
the Excel Regression output and equals n – 2).
dfResidual = n – 2 = 20 – 2 = 18

Example of Prediction Interval in Excel


Create a 95 percent prediction interval about the value of Y when X = 1,000
From the Excel Regression output, the regression equation is:
Yest = 1045.70 + 2.05 * X
Plugging in the numbers from the problem…
Yest = 1045.70 + 2.05 * 1,000
Yest = 3099.67 (with a bit of rounding error)
Prediction Intervalest = Yest ± T.INV(1-α/2, dfResidual) * (Standard Error of the Regression)* 1.1
Prediction Intervalest = 3099.67 ± T.INV(0.975, 18) * (1588.4)* 1.1
Prediction Intervalest = [3099.67 ± 3670.8 ]
Prediction Intervalest = [ -571.14, 6770.49 ]
This is a relatively wide Prediction Interval that results from a large Standard Error of the Regression
(1588.4).

50
It should be noted a regression line should never be extrapolated beyond input data values. In this case,
Y values should be estimated from X values only within the range of 103 to 6,592. These are the lowest
and highest input X values.

51
Correlation in Excel

Overview
Correlation analysis describes the strength of relationship between two variables. A positive correlation
means that two variable usually move in the same direction, i.e., when one goes up, the other usually
goes up as well. A negative correlation means that variables usually move in opposite directions, i.e.,
when one goes down, the other usually goes down. If changes in one variable can be closely estimated
by changes in the other variable, the two variables have a high correlation.
If two variables have little or no correlation, there exists very little pattern between the movement of one
variable and the movement of the other variable.

Quick Indicator of a Correlation


The quickest way to see if a correlation exists between two variables is to plot them on a X-Y scatter-plot
graph. The graph needs to indicate a monotonic relationship between the two variables in order to
conclude that there might be a correlation. A monotonic relationship is one in which one variable
generally moves in one direction (either always up or always down) when the other variable moves in a
specific direction. In other words, when one variable goes up, the other variable usually always goes up
as well or usually always goes down.\
Correlations can have values from -1 to +1. The closer the correlation value is to +1, the more positively
correlated the two variables are. An X-Y scatterplot graph of two positively correlated variables looks like
this:

52
The closer the correlation value is to -1, the more negatively correlated the two variables are. An X-Y
scatterplot graph of two negatively correlated variables looks like this:

The closer the value of the correlation is to 0, the less correlated the two variables are. An X-Y scatterplot
graph of two variables with very little correlation looks like this:

53
Correlation Does Not Mean Causation
Using correlation to imply causation is probably the most frequently occurring incorrect use of statistics.
If data pairs X and Y are correlated, the following relationships are possible:
1) X causes Y
2) Y causes X
3) X and Y are consequences of a common cause, but do not cause each other;
4) There is no connection between X and Y; the correlation is coincidental.
Misinterpretation of correlation occurs when the correlation is interpreted to be the result of either point 1
or point 2 when in fact the underlying cause of the correlation was either point 3 or point 4. It is
commonplace to find occurrences of correlation incorrectly being used to imply causation in advertising
and political speeches.
It should be noted that while correlation does not mean causation, a causal relationship between can
often not be ruled out. Correlation often indicates that a relationship between two variables might exist
that warrants further investigation.

Types of Data
Nominal data are categorical data whose order does not matter. Nominal data are merely name labels
that are only used to differentiate but not to indicate any ordering of the data.

Ordinal data are categorical data whose order matter but there is no specific measurement of difference
between values. A customer satisfaction scale or a Likert scale are examples of ordinal data.

Interval data are data whose difference between values is meaningful but the zero point is arbitrary.
Fahrenheit and Celsius temperature scales are interval data.

Ratio data are data whose difference between values is meaningful and the zero point indicates that
there is none of that variable. The absolute temperature scale is ratio data.

Pearson Correlation vs. Spearman Correlation


The two types of correlations mostly commonly used are the Pearson Correlation and the Spearman
Correlation.
The Pearson Correlation is generally used when the relationship between two variables appears to be
linear, there are not too many outliers, and both variables are interval or ratio but not ordinal.
The Spearman Correlation is generally used the relationship between two variables appears to be
nonlinear, there are many outliers, or at least one of the variables is ordinal.
An X-Y scatterplot graph of two variables whose correlation is linear looks like this:

54
An X-Y scatterplot graph of two variables whose correlation is nonlinear looks like this:

55
Pearson Correlation’s Six Required Assumptions
1) The both variables are either interval or ratio data.
2) The Pearson Correlation is most accurate when the variables are approximately normally distribution.
Normality is not an absolute requirement for applying the Pearson Correlation though. The text indicates
that it is, but that is incorrect. I have uploaded an Excel workbook to the Doc Sharing folder that
automatically checks normality by creates a Normal Probability Plot for input data.
3) The relationship is reasonably linear. This can be seen on an X-Y scatterplot.
4) Outliers are removed or kept to a minimum. Outliers can badly skew the Pearson correlation.
5) Each variable has approximately the same variance. In statistical terms, variables with the same
variance are said to be homoscedastistic. Variance in data sets can be compared using the
nonparametric tests Levene’s Test and the Brown-Forsythe Test. The F Test (available in Excel both as a
function and as a Data Analysis tool) can be used to compare variance in data sets but is highly sensitive
to non-normality of data.
6) There is a monotonic relationship between the two variables.

Spearman Correlation’s Only Two Required Assumptions


1) The variables can be ratio, interval, or ordinal, but not nominal. Nominal variables are simply labels
whose order doesn’t mean anything. The Spearman Correlation is nonparametric, i.e., the test’s outcome
is not affected by the distributions of the data being compared.
2) There is a monotonic relationship between the two variables.

Interesting History of Both Correlations


The inventors of the two correlations, Karl Pearson and Charles Spearman, were both professors in
nearby universities in Europe at the beginning of the twentieth century. Each became the other’s arch-
enemy as a result of their feud over the principles of correlation. Karl Pearson went on to become much
more famous and is credited with creating the discipline of mathematical sciences. Further, the Pearson
Correlation is more widely used in statistics than the Spearman Correlation, so it appears that Professor
Pearson won the feud?

56
Pearson Correlation Coefficient, r, in Excel

Overview
Pearson’s Correlation Coefficient, r, is widely used as a measure of linear dependency between two
variables. Pearson’s Correlation Coefficient is also referred to as Pearson’s r or Pearson’s Product
Moment Correlation Coefficient.
2
r is denoted as R Square and tells how well data points fit a line or curve. In simple linear regression, R
Square is simply the square of the correlation coefficient between the dependent variable (the Y values)
and the single independent variable (the X values). R Square represents the proportion of the total
variance of the Y values can be explained by the variance of the X values. R Square takes can assume
values from 0 to +1.
Pearson’s Correlation Coefficient, r, can assume values from -1 to +1.
A value of +1 indicates that two variables have a perfect positive correlation. A perfect positive correlation
means that one of the variables moves exactly the same positive amount for each unit positive change in
the other variable. A scatterplot of linear data having a Pearson Correlation, r, near +1 is as follows:

57
An r value of -1 indicates that two variables have a perfect negative correlation. A perfect negative
correlation means that one of the variables moves exactly the same negative amount for each unit
positive change in the other variable. A scatterplot of linear data having a Pearson Correlation, r, near -1
is as follows:

An r value near 0 indicates very low correlation between two variables. The movements of one variable
have a very low correspondence with the movements of the other variable. A scatterplot of linear data
having a Pearson Correlation, r, near 0 is as follows:

58
Pearson Correlation’s Six Required Assumptions
1) The both variables are either interval or ratio data.
2) The Pearson Correlation is most accurate when the variables are approximately normally distribution.
Normality is not an absolute requirement for applying the Pearson Correlation though. The text indicates
that it is, but that is incorrect. I have uploaded an Excel workbook to the Doc Sharing folder that
automatically checks normality by creates a Normal Probability Plot for input data.
3) The relationship is reasonably linear. This can be seen on an X-Y scatterplot.
4) Outliers are removed or kept to a minimum. Outliers can badly skew the Pearson correlation.
5) Each variable has approximately the same variance. In statistical terms, variables with the same
variance are said to be homoscedastistic. Variance in data sets can be compared using the
nonparametric tests Levene’s Test and the Brown-Forsythe Test. The F Test (available in Excel both as a
function and as a Data Analysis tool) can be used to compare variance in data sets but is highly sensitive
to non-normality of data.
6) There is a monotonic relationship between the two variables.
Pearson’s Correlation can be applied to a population or to a sample.

Pearson Correlation Formulas


Pearson’s Correlation when applied to a population is referred to as the Population Pearson’s Correlation
Coefficient or simply the Population Correlation Coefficient. The Population Pearson Correlation
Coefficient is designated by the symbol ρ (Greek letter “rho”) and is calculated by the following formula:

Pearson’s Correlation when applied to a sample is referred to as the Sample Pearson’s Correlation
Coefficient or simply the Sample Correlation Coefficient. The Population Pearson Correlation Coefficient
is designated by the symbol r or rxy and is equal to the sample covariance between two variables divided
by the product of their sample standard deviations as given by the following formula:

sxy is the Sample Covariance between variables x and y and is calculated by the following formula:

59
sx is the Sample Standard Deviation of variable x and is calculated by the following formula:

sy is the Sample Standard Deviation of variable y and is calculated by the following formula:

60
Example of Pearson Correlation in Excel

Step 1 – Create a Scatterplot of the Data


Before calculating the Pearson Correlation between two variables, it is a good idea to create an X-Y
scatterplot to determine if there appears to be a linear relationship between the two. Following is an
example of creating an Excel scatterplot of a sample of X-Y data. The chart type in Excel is an X-Y
scatterplot with only markers using Chart Layout 3, Style 2. A Least-Squares Line is created using Chart
Layout 3.
The chart appears as follows:

The scatterplot chart shows a strong linear relationship between the two variables X and Y. The Pearson
correlation would be the correct choice to determine the correlation between the two variables.

61
Step 2 – Calculate r in Excel With One of Three Methods
The Pearson Sample Correlation Coefficient, r xy, can be calculated using any of the three following
methods in Excel:

1) Data Analysis Correlation Tool This tool can also be used to create a correlation matrix between
more than two variables. An example of this will be performed later in this section.

2) Correlation Formula The correlation formula which is the following:


CORREL(array1, array2)
3) Covariance Formula The sample covariance between two variables divided by the product of their
sample standard deviations as given by the following formula:
COVARIANCE.S(array1, array2)*STDEV.S(array1)* STDEV.S(array2)
These three methods are implemented in Excel as follows:

62
Step 3 - Determine Whether r Is Significant
After calculating the Pearson Correlation Coefficient, r, between two data sets, the significance of r should
be checked. If r has been calculated based upon just a few pairs of numbers, it is difficult to determine
whether this calculated correlation really exists between the two sets of numbers or if that calculated r is
just a random occurrence because there are so few data pairs.
On the other hand, if the r is calculated from a large number of data pairs, the certainty level is much
higher the calculated correlation r really does exist between the two sets of numbers.
There are two equivalent ways to determine whether or not the calculated r should be considered
significant at a given α. These two methods are the following:
a) Calculate the p value and compare it to the specified α
b) Calculate r Critical and compare it to r

Calculate p Value
To find the p Value for a given r and sample size, use the following formula:

p Value = 1 - F.DIST( ((n-2)*r^2)/(1-r^2), 1, n-2 )


df = n - 2
n = number of X-Y data pairs
The p value can be directly compared to Alpha to determine if the calculated correlation coefficient is
statistically significant.
For example, if Alpha is set to 0.05, the p Value must be less than 0.05 to be considered statistically
significant. If the p Value is less than 0.05, you can be at least 95% certain that the calculated correlation
value was not a random event.
The calculation in Excel for this example is performed as follows:
p Value = 0.0008 =1-F.DIST(((7-2)*0.0.9544^2)/(1-0.9455^2),1,7-2,TRUE)
The p Value of 0.0008 is much less than alpha (0.05). This indicates that r is significant.

Calculate r Critical
r Critical is the minimum value of r that would be considered significant for a given sample size and alpha
level. r Critical is usually looked up on a chart but can be calculated directly with the following Excel
formula:

For a small number a data pairs, the calculated r must be very high to be reasonably certain that this
calculated correlation really does exist between the two variables and is not just a random occurrence.
The calculation in Excel is performed as follows:

63
r Critical = 0.7545 =(T.INV(1-0.05/2,7-2))/SQRT((T.INV(1-0.05/2,7-2))^2+7-2)
r Critical(α= 0.05, df = n-2 =5) = 0.7545
The correlation coefficient r (0.9544) is much greater than r Critical (0.7545). This indicates that r is
significant.

Comparing Chart Values of r Critical and p value in Excel with Calculated Values
Charts containing r Critical values list the following r Critical value for α = 0.05 and sample size n = 10 as
follows:
r Critical(α= 0.05, df = n-2 =8) = 0.632
r Critical and the p value will now be calculated by the formulas to verify that chart values for r Critical
match those calculated with the formulas.

Calculating r Critical with the Formula


Plugging values α = 0.05 and df = 8 into the r Critical formula produces the following result:

The calculation in Excel is performed as follows:


r Critical =(T.INV(1-0.05/2,10-2))/SQRT((T.INV(1-0.05/2,10-2))^2+10-2) = 0.632

Calculating p Value With the Formula


The p Value for the r Critical with df = n – 2 = 8 should be 0.05. Plugging that r Critical and df value into
the p value formula produces the following result:

The calculation in Excel is performed as follows:


p Value =1-F.DIST(((10-2)*0.632^2)/(1-0.632^2),1,10-2,TRUE) = 0.05

The value of r Critical for Alpha = 0.05 equals 0.632. This agrees with the value calculated with the r
Critical formula.

64
Performing Correlation Analysis On More Than 3 Variables
As mentioned, the Data Analysis Correlation tool can be used to create a correlation matrix if there are
more than two variables. An example of creating a correlation matrix between three variables is shown as
follows:

Each r must be evaluated separately to determine if that r is significant. A correlation coefficient r is


significant if its calculated p Value is less than alpha or, equivalently, if the r is greater than r Critical. The
p value and r Critical are calculated in the same way as before with the following formulas:

65
Spearman Correlation Coefficient, rs, in Excel

Overview
The Spearman Correlation Coefficient is designated by either r s or by the Greek letter ρ, “rho.” As
mentioned the Spearman correlation should be used instead of the Pearson correlation in any of the
following circumstances:
1) An X-Y scatterplot of the data indicates that there is a nonlinear monotonic relationship between two
variables. Monotonic simply means that one variable generally goes in one direction (either always up or
always down) when the other variable moves in one direction.
2) There are significant outliers. The Pearson Correlation is very sensitive to outliers. The Spearman
Correlation is not because the Spearman Correlation bases its calculation on the ranks and not the mean
(as the Pearson Correlation does).
3) At least one of the variables is ordinal. An ordinal variable is one in which the order matters but the
difference between values does not have meaning. A customer satisfaction scale or a Likert scale are
examples of ordinal data. Satisfaction scales and Likert scales can be analyzed as interval data if the
distance between values is considered to be the same. A Pearson correlation can be used if the variables
are either interval or ration but cannot be used if any of the variables are ordinal.

Spearman Correlation Formula


The Spearman Correlation Coefficient is defined as the Pearson Correlation Coefficient between ranked
variables. The Spearman Correlation is sometimes called the Spearman Rank-Order Correlation or
simply Spearman’s rho (ρ) and is calculated as follows:

For a sample of n (X-Y) data pairs, each Xi,Yi are converted to ranks xi,yi that appear in the preceding
formula for Spearman’s rho.

Tied Data Values


rd
Tied valued of X or Y are assigned the average rank of the tied values. For example, if the 2rd, 3 , and
th
4 X value were equal to 19, the rank assigned to each would be 3. This is the average rank, which would
be calculated as follows:
Average rank = (Sum of ranks)/(Number of ranks) = (2 + 3 + 4)/3 = 3

No Ties Among Data Values


If there are no tied values of X or Y, the following simpler formula can be used to calculate Spearman’s
rho:

66
Spearman Correlation’s Only Two Required Assumptions
1) The variables can be ratio, interval, or ordinal, but not nominal. Nominal variables are simply labels
whose order doesn’t mean anything. The Spearman Correlation is nonparametric, i.e., the test’s outcome
is not affected by the distributions of the data being compared.
2) There is a monotonic relationship between the two variables.

Example of Spearman Correlation in Excel


The Spearman Correlation Coefficient will be calculated for the following data:

Step 1 – Plot the Data to Check For a Monotonic Relationship

A monotonic relationship exists is one variable generally moves in a single direction (either increasing or
decreasing) as the other variable moves in a single direction. A monotonic relationship does not imply

67
linearity. A monotonic relationship appears to exist between the X and Y variables. X values generally
increase as Y values increase.

Step 2 – Check For Tied X or Y Values


Checking a column of data for tied values can be automated in Excel. The cell U8 has the following
formula:
=IF(SUM(IF(FREQUENCY($S$9:$S$15,$S$9:$S$15)>0,1))=COUNT($S$9:$S$15),
“There Are No Tied Values”,”There Are Tied Values”)
Cell C18 contains a similar formula.

No Tied Values

Step 3 – Calculate the Ranks of the X and Y Values


This can be done in a single step in Excel with the RANK.AVG() formula as follows:

68
Step 4 – Calculate the Sum of the Square of the Rank Differences

69
Step 4 – Calculate rs

Step 4 – Determine If rs Is Significant

Method 1 – Compare rs to r Critical

rs is not significant at α = 0.05 because rs (0.6786) is less than r Critical (0.7545).

70
Method 2 – Compare the p value to Alpha

rs is not significant at α = 0.05 because the p Value (0.0536) is greater than Alpha (0.05). This rs would
be significant at α = 0.10 but not at α = 0.05.

If There Are Any Tied X or Y Values

Step 3 – Calculate rs

71
Step 4 – Determine If rs Is Significant

Method 1 – Compare rs to r Critical

rs is not significant at α = 0.05 because rs (0.7339) is less than r Critical (0.7545). This rs would be
significant at α = 0.10 but not at α = 0.05.

Method 2 – Compare the p value to Alpha

72
rs is not significant at α = 0.05 because the p Value (0.0536) is greater than Alpha (0.05). This rs would
be significant at α = 0.10 but not at α = 0.05.

Two Different Methods Used to Calculate rs Critical Values


There is slight disagreement in the statistical community about how to calculate rs Critical Values.
Some use a table of Critical rs values. This table of values was created in 1938 in the journal The Annals
of Mathematical Statistics.
Others use the formula for Critical rs as was done in this example. This formula is once again shown here
as follows:

The results are quite close. As sample size increase, the results of both methods converge. This is shown
in the following comparison with α set to 0.05:

73
74
Covariance, sxy, in Excel

Overview
Covariance is a measure of how much two random variables change together. s xy is the Sample
Covariance between variables x and y and is calculated by the following formula:

Sample covariance is calculated in Excel with the following formula:


COVARIANCE.S(array 1, array 2)
A Covariance matrix can be created between more than two variables using the Covariance Data
Analysis tool in a similar manner than the Correlation Data Analysis tool would be used to create a
correlation matrix.
A positive covariance between two variables indicates that both variables tend to move in the same
direction. A negative covariance between two variables indicates that both variables tend to move in
opposite directions. A covariance near zero indicates a very low covariance.
Covariance is used much less often than correlation to describe the degree of relationship between two
random variables because the magnitude of the covariance is difficult to interpret. Covariance values of
data sets using different units of measure are not comparable.
The Pearson correlation can be used to compare data sets whose units of measure are different because
the Pearson correlation coefficient is the normalized version of covariance. The values of the Pearson
correlation values are all between +1 and -1 and therefore directly comparable. The dimensions or units
of measure are removed from the problem by using the Pearson correlation instead of the covariance to
express the degree of relationship between two random variables.
The Pearson sample correlation coefficient is calculated be dividing the sample covariance by the product
of the sample standard deviations as follows:

Variance is a special case of covariance when the two variables are equal.

Using Covariance To Calculate a Line’s Slope and Y-Intercept


The Least Square Line which most accurately describes the linear relationship between to random
variables X and Y is given by the following equation:
Yi = b0 + b1*Xi
The slope of this line, b1, can be calculated using the covariance between the two variables as follows:
2
b1 = sxy / (sx)
The Y-intercept of this line, b0, can be calculated as follows:
b0 = y_bar - b1*x_bar
x_bar is the mean x value and y_bar is the mean y value.

75
An example of calculating the slope and Y-intercept of a least-squares line of an X-Y data set using the
covariance is shown as follows:

76
Single-Variable Linear Regression in Excel

Overview
Linear regression is a statistical technique used to model the relationship between one or more
independent, explanatory variables and a single dependent variable. The linear regression type is
classified as Simple Linear Regression if there is only a single explanatory variable. The regression type
is classified as Multiple Linear Regression if there is more than one explanatory variable.

The Regression Equation


The end result of linear regression is a linear equation that models actual data as closely as possible.
This equation is called the Regression Equation. The more linear the relationship is between each of the
explanatory variables and the single dependent variable, the more closely the Regression Equation will
model the actual data.
In the Regression Equation, the variable Y is usually designated as the single dependent variable. The
independent explanatory variables are usually labeled X 1, X2, …, Xk.
The Regression Equation for multiple regression appears as follows:
Y = b0 + b1X1 + b2X2 + … + bkXk
The Regression Equation for simple regression appears as follows:
Y = b0 + b1X
b0 is the Y-intercept of the Regression Equation.
b1, b2, ..,, bk are the coefficients of the independent variables.
The most important part of regression analysis is the calculation of b 0, b1, b2, ..,, bk in order to be able to
construct the Regression Equation
Y = b0 + b1X for simple regression
or
Y = b0 + b1X1 + b2X2 + … + bkXk for multiple regression.

Purposes of Linear Regression


Linear regression, both simple and multiple linear regression, generally have two main uses. They are as
follows:
1) To quantify the linear relationship between the dependent variable and the independent variable(s) by
calculating a regression equation.
2) To quantify how much of the movement or variation of the dependent variable is explained by the
independent variable(s).

The Inputs For Linear Regression


The input data for linear regression analysis consists of a number of data records each having a single Y
(dependent variable) value and one or more X (explanatory independent variable) values. Simple
regression has only a single X value. Multiple regression has more than one X (independent) variable for
each Y (dependent) variable.

77
Each data record occupies its own unique row in the regression input. Each data record contains the
specific values of the input (independent) X variables that are associated with a specific value of the
dependent Y variable shown in that data record.
The input data for multiple regression analysis appear as separate data records on each row as follows:

Y X1 X2 … Xk

4 6 10 …15
5 7 11 …16
6 8 12 …17
7 9 13 …18
8 10 14 …19

Simple Linear Regression


Simple regression has only a single X (independent) variable. Simple linear regression is sometimes
called bivariate linear regression. Simple linear regression uses single independent variable (X) known
as the explanatory, predictor, or regressor variable. The single dependent variable (Y) is the target or
outcome variable.
Simple linear regression requires that both the dependent variable and the independent variable be
continuous. If ordinal data such as a Likert scale is used as a dependent or independent variable, it must
be treated as a continuous variable that has equal distance between values. Ordinal data is normally
defined data whose order matters but not the differences between values.
The input data for simple linear regression analysis appear as separate data records on each row as
follows:

Y X

4 6
5 7
6 8
7 9
8 10

Null and Alternative Hypotheses


The Null Hypothesis of linear regression states that the coefficient(s) of the independent variable(s) in the
regression equation equal(s) zero. The Alternative Hypothesis for linear regression therefore states that
these coefficients do not equal zero.
For multiple linear regression this Null Hypothesis is expressed as follows:
H0: b1 = b2 = … = bk = 0
For simple linear regression this Null Hypothesis is expressed as follows:

78
H0: b1 = 0
b1 is the slope of the regression line for simple regression.
The Alternative Hypothesis, H1, for linear regression states that these coefficients do not equal zero.
The Y Intercept b0 is not included in the Null Hypothesis.

X and Y Variables Must Have a Linear Relationship


Linear regression is a technique that provides accurate information only if a linear relationship exists
between the dependent variable and each of the independent variables. Independent variables that do
not have a linear relationship with the dependent variable should not be included as inputs. An X-Y
scatterplot diagram of between each independent variable and the dependent variable provides a good
indication of whether the relationship is linear.
When data are nonlinear, there often two solutions available to allow regression analysis to be performed.
They are the following:
1) Transform the nonlinear data to linear using a logarithmic transformation. This will not be discussed in
this section.
2) Perform nonlinear regression on the data. One way to do that is to apply curve-fitting software that will
calculate the mathematical equation that most closely models the data. Another section in this book will
focus on using the Excel Solver to fit a curve to nonlinear data. The least-squares method is the simplest
way to do this and will be employed in this section.

Do Not Extrapolate Regression Beyond Existing Data


The major purpose of linear regression is to create a Regression Equation that accurately predicts a Y
value based on a new set of independent, explanatory X values. The new set of X values should not
contain any X values that are outside of the range of the X values used to create the original regression
equation. The following simple example illustrates why a Regression Equation should not be extrapolated
beyond the original X values.

Example of Why Regression Should Not Be Extrapolated


Imagine that the height of a boy was measured every month from when the boy was one year old until the
boy was eighteen years old. The independent, explanatory X variable would the month number (12
months to 216 months) and the dependent y variable would be the height measured in inches. Typically
most boys stopped growing in height when they reach their upper teens.
If the Regression Equation was created from the above data and then extrapolated to predict the boy’s
height when he reached 50 years of age, the Regression Equation might predict that the boy would be
fifteen feet tall.

Linear Regression Should Not Be Done By Hand


Excel provides an excellent data analysis regression tool that can perform simple or multiple regression
with equal ease. Doing the calculations by hand would be very tedious and provide lots of opportunities to
make a mistake. Excel produces a very detailed output and the regression tool is run. I have recreated all
of the simple regression calculations that Excel performs in this chapter. It will probably be clear from
viewing this that it is wise to let Excel do the regression calculations. A number of statistics textbooks

79
probably place too much emphasis on teaching the ability to perform the regression equations by hand. In
the real world regression analysis would never be done manually.
The best way to understand simple linear regression is to perform an example as follows:

Complete Example of Simple Linear Regression in Excel


A company has a large plastic injection molding machine. The company would like to create an equation
that will calculate the number of identical plastic parts that would be produced for a specified quantity of
input plastic pellets.
The company conducted 21 independent production runs on the machine. In each case a different-sized
batch of plastic pellets were input into the machine and the total number of identical plastic parts
produced from each batch was recorded.
If the relationship between quantity of input plastic pellets in each batch and the number of parts
produced from each batch is linear, calculate the equation that describes that relationship.
It is important to note that all trial runs were performed as identically as possible. The same operator ran
the machine on each trial run at approximately the same time during a shift. The machine was calibrated
to the same settings and cleaned prior to each trial run and input plastic pellets from the same batch were
used in all 21 trial runs.
The data from the 21 trial runs are as follows:

80
Step 1 – Remove Extreme Outliers
Calculation of the mean is one of the fundamental computations when performing linear regression
analysis. The mean is unduly affected by outliers. Extremely outliers should be removed before beginning
regression analysis. Not all outliers should be removed. An outlier should be removed if it is obviously
extremely and inconsistent with the remainder of the data.
At this point in the beginning of the analysis, the objective is to remove outliers that are obviously
extreme. After Excel has performed the regression and calculated the residuals, further analysis will be
performed to determine if any of the data points can also be considered outliers based upon any
unusually large residual terms generated. A data point is often considered to be an outlier if its residual
value is more than there standard deviations from the mean of the residuals.

Sorting the Data To Quickly Spot Extreme Outliers


An easy way to quickly spot extreme outliers is to sort the data. Extremely high or low outlier values will
appear at the ends of the sort. A convenient, one-step method to sort a column of data in Excel is shown
here.
The formula is cell D3 is the following:
=IF($A3=””,””,LARGE($A$3:$A$23,ROW()-ROW($C$2)))
Copy this formula down as shown to create a descending sort of the data in cells A3 to A23.
Exchanging the word SMALL for LARGE would create an ascending sort instead of the descending sort
performed here.

81
The lowest Y value, 9, is obviously an extreme outlier and is very different than the rest of the data. The
cause of this extreme outlier value is not known. Perhaps something unexpected happened during this
production run? It is clear that this value should be removed from the analysis because it would severely
skew the final result.
Removing this outlier from the data produces this set of 20 data samples:

Step 2 – Create a Correlation Matrix


This step is only necessary when performing multiple regression, i.e., linear regression that has more
than one independent variable, not single variable regression as we are doing here. The purpose of this
step is to identify independent variables that are highly correlated. Different input variables that are highly
correlated cause an error called multicollinearity. There is no need to check for correlated independent
variables when performing single-variable regression as we are doing here because there is only one
independent variable. This step should always be carried out when performing multiple regression. When
highly correlated pairs of independent variables are found, one of the variables of the pair should be
removed from the regression.

82
Step 3 – Scale Variables If Necessary
All variables should be scaled so that each has a similar number of decimal places beyond zero. This
limits rounding error and also insures that the slope of the fitted line will be a convenient size to work with
and not too large or too small.
The weight of the input pellets is measured in grams. If these weights were specified in kilograms, both
variables would be presented in much closer scales. Changing the scale of the incoming pellet weight
from grams to kilograms provides the following properly-scaled data:

83
Step 4 – Plot the Data
The purpose of this step is to check for linearity. Each independent variable should be plotted against the
dependent variable in a scatterplot graph. Linear regression should only be performed if linear
relationships exist between the dependent variable and each of the input variables. An Excel X-Y
scatterplot of the two X-Y variables is shown as follows. The relationship appears to be a linear one.

84
Step 5 – Run the Regression Analysis
Below is the Regression dialogue box with all necessary information filled in. Many of the required
regression assumptions concerning the Residuals have not yet been validated. At this point the
regression is being run in Excel to calculate the Residuals in order to analyze them. Further analysis of
the Excel regression output should take place only after linear regression’s required assumptions
concerning the Residuals have been evaluated.
Calculating the Residuals as part of the Excel regression output is specified in the Excel regression
dialogue box as follows:

It should be noted that the Residuals are sometimes referred to as the Error terms. The checkbox next to
Residuals should be checked in order to have Excel automatically calculate the residual for each data
point. The residual is the difference between the actual data point and its value as predicted by the
regression equation. Analysis of the residuals is a very important part of linear regression analysis
because a number of required assumptions are based upon the residuals.
The checkbox next to Standardized Residuals should also be checked. If this is checked, Excel will
calculate the number of standard deviations that each residual value is from the mean of the residuals.
Data points are often considered outliers if their residual values are located more than three standard
deviations from the residual mean.
The checkbox next to Residual Plots should also be checked. This will create graphs of the residuals
plotted against each of the input (independent) variables. Visual observation of these graphs is an
important part of evaluating whether the residuals are independent. If the residuals show patterns in any
graph, the residuals are considered to not be independent and the regression should not be considered
valid. Independence of the residuals is one of linear regression’s most important required assumptions.

85
The checkbox next to Line Fit plots should be checked as well. This will produce graphs of the Y Values
plotted against each X value in a separate graph. This provides visual analysis of the spread of each
input (X) variable and any patterns between any X variable the output Y variable.
The checkbox for the Normal Probability Plot was not checked because that produces a normal
probability plot of the Y data (the dependent variable data). A normal probability plot is used to evaluate
whether data is normally-distributed. Linear regression does not require the independent or dependent
variable data be normally-distributed. Many textbooks incorrectly state that the dependent and/or
independent data need to be normally-distributed. This is not the case.
Linear regression does however require that the residuals be normally-distributed. A normal probability
plot of the residuals would be very useful to evaluate the normality of the residuals but is not included as
a part of Excel’s regression output.
A normal probability plot of the Y data does not provide any useful information and the checkbox that
would produce that graph is therefore not checked. It is unclear why Excel includes that functionality with
its regression data analysis tool.
Those settings shown in the preceding Excel regression dialogue box produce the following output:

The Excel regression output includes the calculation of the Residuals as specified. Linear regression’s
required assumptions regarding the Residuals should be evaluated before analyzing any other part of the
Excel regression output. The required Residual assumptions must be verified before the regression
output is considered valid.
The Residual output includes each Dependent variable’s predicted value, its Residual value (the
difference between the predicted value and the actual value), and the Residual standardized value (the
number of standard deviations that the Residual value is from the mean of the Residual values). This
Residual output is shown as follows:

86
The follow graphs were also generated as part of the Excel regression output:

87
Step 6 – Evaluate the Residuals
The purpose of Residual analysis is to confirm the underlying validity of the regression. Linear regression
has a number of required assumptions about the residuals. These assumptions should be evaluated
before continuing the analysis of the Excel regression output. If one or more of the required residual
assumptions are shown to be invalid, the entire regression analysis might be considered, at best,
questionable or, at worst, invalid. The residuals should therefore be analyzed first before analyzing any
other part of the Excel regression output.
The Residuals are sometimes called the Error Terms. The Residual is the difference between an
observed data value and the value predicted by the regression equation. The formula for the Residual is
as follows:
Residual = Yactual – Yestimated

Linear Regression’s Required Residual Assumptions


Linear regression has several required assumptions regarding the residuals. These required residual
assumptions are as follows:
1) Outliers have been removed.
2) The residuals must be independent of each other. They must not be correlated with each other.
3) The residuals should have a mean of approximately 0.
4) The residuals must have similar variances throughout all residual values.
5) The residuals must be normally-distributed.
6) The residuals may not be highly correlated with any of the independent (X) variables.
7) There must be enough data points to conduct normality testing of residuals.
Here is how to evaluate each of these assumptions in Excel.

88
Locating and Removing Outliers
In many cases a data point is considered to be an outlier if its residual value is more than three standard
deviations from the mean of the residuals. Checking the checkbox next to Standardized Residuals in the
regression dialogue box calculates standardized value of each residual, which is the number of standard
deviations that the residual is from the residual mean. Below once again is the Excel regression output
showing the residuals and their distance in standard deviations from the residual mean.
Following are the standardized residuals of the current data set. None are larger in absolute value than
1.69 standard deviations from the residual mean.

A data point is often considered an outlier if its residual value is more than three standard deviations from
the residual mean. The following Excel output shows that none of the residuals are more than 1.69
standard deviations from the residual mean. On that basis, no data points are considered outliers as a
result of having excessively large residuals.
Any outliers that have been removed should be documented and evaluated. Outliers more than 3
standard deviations from the mean are to be expected occasionally for normally-distributed data. If an
outlier appears to have been generated by the normal process and not be an aberration of the process,
then perhaps it should not be removed. One item to check is whether a data entry error occurred when
inputting the data set. Another item that should be checked is whether there was a measurement error
when that data point’s parameters were recorded.
If a data point is removed, the regression analysis has to be performed again on the new data set that
does not include that data point.

89
Determining Whether Residuals Are Independent
This is the most important residual assumption that must be confirmed. If the residuals are not found to
be independent, the regression is not considered to be valid.
If the residuals are independent of each other, a graph of the residuals will show no patterns. The
residuals should be graphed across all values of the dependent variable. The Excel regression output
produced individual graphs of the residuals across all values of each independent variable, but not across
all values of the dependent variable. This graph is not part of Excel’s regression output and needs to be
generated separately.
An Excel X-Y scatterplot graph of the Residuals plotted against all values of the dependent variable is
shown as follows:

Residuals that are not independent of each other will show patterns in a Residual graph. No patterns
among Residuals are evidenced in this Residual graph so the required regression assumption of Residual
independence is validated.
It is important to note an upward or downward linear trend appearing in the Residuals probably indicates
that an independent (X) variable is missing. The first tip that this might be occurring is if the Residual
mean does not equal approximately zero.

Determining If Autocorrelation Exists


An important part of evaluating whether the residuals are independent is to calculate the degree of
autocorrelation that exists within the residuals. If the residuals are shown to have a high degree of
correlation with each other, the residual are not independent and the regression is not considered valid.
Autocorrelation often occurs with time-series or any other type of longitudinal data. Autocorrelation is
evident when data values are influenced by the time interval between them. An example might be a graph
of a person’s income. A person’s level of income in one year is likely influenced by that person’s income
level in the previous year.
The degree of autocorrelation existing within a variable is calculated by the Durbin-Watson statistics, d.
The Durbin-Watson statistic can take values from 0 to 4. A Durbin-Watson statistic near 2 indicates very
little autocorrelation within a variable. Values close to 2 indicate that little to no correlation exists among

90
residuals. This, along with no apparent patterns in the Residuals, would confirm the independence of the
Residuals.
Values near 0 indicate a perfect positive autocorrelation. Subsequent values are similar to each other in
this case. Values will appear to following each other in this case. Values near 4 indicate a perfect
negative correlation. Subsequent values are opposite of each other in an alternating pattern.
The data used in this example is not time series data but the Durbin-Watson statistic of the Residuals will
be calculated in Excel to show how it is done. Before calculating the Durbin-Watson statistic, the data
should be sorted chronologically. The Durbin-Watson for the Residuals would be calculated in Excel as
follows:

SUMXMY2(x_array,y_array) calculates the sum of the square of (X – Y) for the entire array.
SUMSQ(array) squares the values in the array and then sums those squares.
If the Residuals are in cells C1:C50, then the Excel formula to calculate the Durbin-Watson statistic for
those Residuals is the following:
SUMXMY2(C2:C50,C1:C49)/SUMSQ(C1:C50)
The Durbin-Watson statistic of 2.07 calculated here indicates that the Residuals have very little
autocorrelation. The Residuals can be considered independent of each other because of the value of the
Durbin-Watson statistic and the lack of apparent patterns in the scatterplot of the Residuals.

91
Determining if Residual Mean Equals Zero
The mean of the residuals is shown to be zero as follows:

Determining If Residual Variance Is Constant


If the Residuals have similar variances across all residual values, the Residuals are said to be
homoscedastistic. The property of having similar variance across all sample values or across different
sample groups is known as homoscedasticity.
If the Residuals do not have similar variances across all residual values, the Residuals are said to be
heteroscedastistic. The property of having different variance across all sample values or across different
sample groups is known as heteroscedasticity.
Linear regression requires that Residuals be homoscedastistic, i.e., have similar variances across all
residual values.
The variance of the Residuals is the degree of spread among the Residual values. This can be observed
on the Residual scatterplot graph. If the variance of the residuals changes as residual values increase,
the spread between the values will visibly change on the Residual scatterplot graph. If Residual variance
increases, the Residual values will appear to fan out along the graph. If Residual variance decreases, the
Residual values will do the opposite; they will appear to clump together along the graph.

92
Here is the Residual graph again:

The Residuals appear to fan out slightly as Residual values increase. This indicates a slight increase in
Residual variance across the values of the dependent variable. The degree of fanning out is not
significant. Slight unequal variance in Residuals in not usually a reason to discard an otherwise
good model. One way to remove unequal variance in the residuals is to reduce the interval between data
points. Shorter intervals will have closer variances.
If the number of data points is too small, the residual spread will sometimes produce a cigar-shaped
pattern.

Determining if Residuals Are Normally-Distributed


An important assumption of linear regression is that the Residuals be normally-distributed. Normality
testing must be performed on the Residuals. The following five normality tests will be performed here:
1) An Excel histogram of the Residuals will be created.
2) A normal probability plot of the Residuals will be created in Excel.
3) The Kolmogorov-Smirnov test for normality of Residuals will be performed in Excel.
4) The Anderson-Darling test for normality of Residuals will be performed in Excel.
5) The Shapiro-Wilk test for normality of Residuals will be performed in Excel.

93
Histogram of the Residuals in Excel
An Excel histogram of the Residuals is shown as follows:

The Residuals appear to be distributed according to the bell-shaped normal distribution in this Excel
histogram. This histogram was created in Excel by inserting the following information into the Excel
histogram dialogue box:

94
This histogram can also be created with formulas and a chart. The advantage of creating a histogram with
formulas and a chart instead of using the Histogram tool from the Data Analysis ToolPak is that chart and
formulas in Excel update their output automatically when data is changed. All of the tools in the Data
Analysis ToolPak must be rerun to update the output when input data has changed. The histogram can
be created with charts and formulas as follows:

95
Using this data to create an Excel bar chart produces the following histogram:

The advantage of creating the histogram with an Excel chart is that the chart automatically updates itself
when the input data is changed.

96
Normal Probability Plot of Residuals in Excel
A Normal Probability Plot created in Excel of the Residuals is shown as follows:

The Normal Probability Plot of the Residuals provides strong evidence that the Residual are normally-
distributed. The more closely the graph of the Actual Residual values (in red) resembles a straight line (in
blue), the more closely the Residuals are to being normally-distributed. The Actual Residual values are
very close to being a straight line (the red graph deviates only slightly from the blue straight line).

97
Kolmogorov-Smirnov Test For Normality of Residuals in Excel
The Kolmogorov-Smirnov Test is a hypothesis test that is widely used to determine whether a data
sample is normally-distributed. The Kolmogorov-Smirnov Test calculates the distance between the
Cumulative Distribution Function (CDF) of each data point and what the CDF of that data point would be if
the sample were perfectly normally-distributed. The Null Hypothesis of the Kolmogorov-Smirnov Test
states that the distribution of actual data points matches the distribution that is being tested. In this case
the data sample is being compared to the normal distribution.
The largest distance between the CDF of any data point and its expected CDF is compared to
Kolmogorov-Smirnov Critical Value for a specific sample size and Alpha. If this largest distance exceeds
the Critical Value, the Null Hypothesis is rejected and the data sample is determined to have a different
distribution than the tested distribution. If the largest distance does not exceed the Critical Value, we
cannot reject the Null Hypothesis, which states that the sample has the same distribution as the tested
distribution.
F(Xk) = CDF(Xk) for normal distribution
F(Xk) = NORM.DIST(Xk, Sample Mean, Sample Stan. Dev., TRUE)

Residual Data

98
0.1480 = Max Difference Between Actual and Expected CDF
20 = n = Number of Data Points
0.05 = α

The Null Hypothesis Stating That the Residuals Are Normally-Distributed Cannot Be Rejected
The Null Hypothesis for the Kolmogorov-Smirnov Test for Normality, which states that the sample data
are normally-distributed, is rejected only if the maximum difference between the expected and actual CDF
of any of the data points exceed the Critical Value for the given n and α. That is not the case here.
The Max Difference Between the Actual and Expected CDF for Variable 1 (0.1480) is significantly less
than the Kolmogorov-Smirnov Critical Value for n = 20 (0.29) at α = 0.05 so the Null Hypotheses of the
Kolmogorov-Smirnov Test for the Residual data is accepted.

Anderson-Darling Test For Normality of Residuals in Excel


The Anderson-Darling Test is a hypothesis test that is widely used to determine whether a data sample is
normally-distributed. The Anderson-Darling Test calculates a test statistic based upon the actual value of
each data point and the Cumulative Distribution Function (CDF) of each data point if the sample were
perfectly normally-distributed.
The Anderson-Darling Test is considered to be slightly more powerful than the Kolmogorov-Smirnov test
for the following two reasons:
The Kolmogorov-Smirnov test is distribution-free. i.e., its critical values are the same for all distributions
tested. The Anderson-darling tests requires critical values calculated for each tested distribution and is
therefore more sensitive to the specific distribution.
The Anderson-Darling test gives more weight to values in the outer tails than the Kolmogorov-Smirnov
test. The K-S test is less sensitive to aberration in outer values than the A-D test.
If the test statistic exceeds the Anderson-Darling Critical Value for a given Alpha, the Null Hypothesis is
rejected and the data sample is determined to have a different distribution than the tested distribution. If
the test statistic does not exceed the Critical Value, we cannot reject the Null Hypothesis, which states
that the sample has the same distribution as the tested distribution.

99
F(Xk) = CDF(Xk) for normal distribution
F(Xk) = NORM.DIST(Xk, Sample Mean, Sample Stan. Dev., TRUE)

Residual Data

Test Statistic A = 1.333


The above test statistic should be adjusted in the general case that both population mean an population
variance are unknown. This is often the case and is an assumption that can always be applied.
When population mean and population variance are unknown, make the following adjustment:
Adjusted Test Statistic A* = ( 1 + 0.75/n + 2.25/n2 )*A
However, the population mean of the residuals is known to be 0. The population standard deviation of the
residuals is now known.
In this case Test Statistic A should be used and not Adjusted Test Statistic A*.
Reject the Null Hypothesis of the Anderson-Darling Test which states that the data are normally-
distributed when the population mean is known but the population standard deviation is not known if any
the following are true:

100
A > 1.760 When Level of Significance (α) = 0.10
A > 2.323 When Level of Significance (α) = 0.05
A > 3.69 When Level of Significance (α) = 0.01
The Null Hypothesis Stating That the Residuals Are Normally-Distributed Cannot Be Rejected
The Null Hypothesis for the Anderson-Darling Test for Normality, which states that the sample data are
normally-distributed, is rejected if the Test Statistic (A) exceeds the Critical Value for the given n and α.
The Test Statistic (A) for the Residual data is significantly less than the Anderson-Darling Critical Value
for α = 0.05 so the Null Hypotheses of the Anderson-Darling Test for the Residual data is not rejected.
The Null Hypothesis states that the residuals are normally-distributed.

Shapiro-Wilk Test For Normality in Excel


The Shapiro-Wilk Test is a hypothesis test that is widely used to determine whether a data sample is
normally-distributed. A test statistic W is calculated. If this test statistic is less than a critical value of W for
a given level of significance (alpha) and sample size, the Null Hypothesis which states that the sample is
normally-distributed is rejected.
The Shapiro-Wilk Test is a robust normality test and is widely-used because of its slightly superior
performance against other normality tests, especially with small sample sizes. Superior performance
means that it correctly rejects the Null Hypothesis that the data are not normally-distributed a slightly
higher percentage of times than most other normality tests, particularly at small sample sizes.
The Shapiro-Wilk normality test is generally regarded as being slightly more powerful than the Anderson-
Darling normality test, which in turn is regarded as being slightly more powerful than the Kolmogorov-
Smirnov normality test.

Residual Data

101
0.966014 = Test Statistic W
0.905 = W Critical for the following n and Alpha
20 = n = Number of Data Points
0.05 = α
The Null Hypothesis Stating That the Data Are Normally-Distributed Cannot Be Rejected
Test Statistic W (0.966014) is larger than W Critical 0.905. The Null Hypothesis therefore cannot be
rejected. There is not enough evidence to state that the data are not normally-distributed with a
confidence level of 95 percent.

Correctable Reasons Why Normal Data Can Appear Non-Normal


If a normality test indicates that data are not normally-distributed, it is a good idea to do a quick evaluation
of whether any of the following factors have caused normally-distributed data to appear to be non-
normally-distributed:
1) Outliers – Too many outliers can easily skew normally-distributed data. An outlier can often be
removed if a specific cause of its extreme value can be identified. Some outliers are expected in normally-
distributed data.
2) Data Has Been Affected by More Than One Process – Variations to a process such as shift changes
or operator changes can change the distribution of data. Multiple modal values in the data are common

102
indicators that this might be occurring. The effects of different inputs must be identified and eliminated
from the data.
3) Not Enough Data – Normally-distributed data will often not assume the appearance of normality until
at least 25 data points have been sampled.
4) Measuring Devices Have Poor Resolution – Sometimes (but not always) this problem can be solved
by using a larger sample size.
5) Data Approaching Zero or a Natural Limit – If a large number of data values approach a limit such
as zero, calculations using very small values might skew computations of important values such as the
mean. A simple solution might be to raise all the values by a certain amount.
6) Only a Subset of a Process’ Output Is Being Analyzed – If only a subset of data from an entire
process is being used, a representative sample in not being collected. Normally-distributed results would
not appear normally-distributed if a representative sample of the entire process is not collected.

Determining If Any Input Variables Are Too Highly Correlated


To determine whether the Residuals have significant correlation with any other variables, an Excel
correlation matrix can be created. An Excel correlation matrix will simultaneously calculate correlations
between all variables. The Excel correlation matrix for all variables in this regression is shown as follows:

103
The correlation matrix shows the correlations between each of the other two variables to be low.
Correlation values go from (-1) to (+1). Correlation values near zero indicate very low correlation. This
correlation matrix was created by inserting the following information into the Excel correlation data
analysis tool dialogue box:

104
Determining If There Are Enough Data Points
Violations of important assumptions such as normality of Residuals is difficult to detect if too few data
exist. 20 data points is sufficient. 10 data points is probably on the borderline of being too few. All of the
normality tests are significantly more powerful (accurate) as data size goes from 15 to 20 data points.
Normality of data is very difficult to access accurately when only 10 data points are present.

All required regression assumptions concerning the Residuals have been met. The next step is to
evaluate the remainder of the Excel regression output.

Step 7 – Evaluate the Excel Regression Output


The Excel regression output that will now be evaluated is as follows:

105
Explanations of the most important individual parts of the Excel regression output are as follows:

Regression Equation

The regression equation is shown to be the following:


Yi = b0 + b1 * Xi
Number of Parts Produced = 1,345.09 + 1.875 (Weight of Input Pellets in kg.)
On an X-Y graph the Y intercept of the regression line would be 1,345.09 and the slope of the regression
line would be 1.875.
For example, if 5,000 kg. of pellets were input into the molding the machine, then 10,730 parts are
expected to be produced by the machine. This regression equation calculation is as follows:
Number of Parts Produced = 1,345.09 + 1.875 (Weight of Input Pellets in kg.)
Number of Parts Produced = 1,345.09 + 1.875 (5,000)
Number of Parts Produced = 10,730
It is very important to note that a regression equation should never be extrapolated outside the range of
the original data set used to create the regression equation. The inputs for a regression prediction should
not be outside of the following range of the original data set:
Weight of Input Pellets (kg.): 103 to 6,592

A simple example to illustrate why a regression line should never be extrapolated is as follows: Imagine
that the height of a child was recorded every six months from ages one to seventeen. Most people stop
growing in height at approximately age seventeen. If a regression line was created from that data and
then extrapolated to predict that person’s height at age 50, the regression equation might predict that the
person would be fifteen feet tall. Conditions are often very different outside the range of the original data
set.

106
R Square –The Equation’s Overall Predictive Power

R Square tells how closely the Regression Equation approximates the data. R Square tells what
percentage of the variance of the output variables is explained by the input variables. We would like to
see at least 0.6 or 0.7 for R Square. The remainder of the variance of the output is unexplained. R Square
here is a relatively high value of 0.904. This indicates the 90.4 percent of the total variance in the output
variable (number of parts produced) is explained by the variance of the input variable (weight of the input
pellets).
Adjusted R Square is quoted more often than R Square because it is more conservative. Adjusted R
Square only increases when new independent variables are added to the regression analysis if those new
variables increase an equation’s predictive ability. When you are adding independent variables to the
regression equation, add them one at a time and check whether Adjusted R Square has gone up with the
addition of the new variable. The value of Adjusted R Square here is 0.898.

Significance of F - Overall p Value and Validity Measure

The Significance of F is the overall p Value of the regression equation. A very small Significance of F
confirms the validity of the Regression Equation. The Regression handout has more information about the
Significance of F that appears in the Excel Regression output. The significance of F is actually a p Value.
If the significance of F is 0.03, then there is only a 3% that the Regression Equation is random. This is
strong evidence of the validity of the Regression Equation.

107
To be more specific, this p value (Significance of F) indicates whether to reject the overall Null Hypothesis
of this regression analysis. The overall Null Hypothesis for this regression equation states that all
coefficients of the independent variables equal zero. In other words, that for this multiple regression
equation:
Y = b0 + b1X1 + b2X2 + … + bkXk
The Null Hypothesis for multiple regression states that the coefficients b 1, b2, … , bk all equal zero. The Y
intercept, b0, is not included in this Null Hypothesis.
For this simple regression equation:
Y = b0 + b1X
The Null Hypothesis for simple regression states that the Coefficient b 1 equals zero. The Y intercept, b0,
is not included in this Null Hypothesis. Coefficient b 1 is the slope of the regression line in simple
regression.
In this case, the p Value (Significance of F) is extremely low (1.39666E-10) so we have very strong
evidence that this is a valid regression equation. There is almost no probability that the relationship
shown to exist between the dependent and independent variables (the nonzero values of coefficient b 1,
b2, … , bk) was obtained merely by chance.
This low p Value (or corresponding high F Value) indicates that there is enough evidence to reject the
Null Hypothesis of this regression analysis.
The 95 percent Level of Confidence is usually required to reject the Null Hypothesis. This translates to a 5
percent Level of Significance. The Null Hypothesis is rejected is the p Value (Significance of F) is less
than 0.05. If the Null Hypothesis is rejected, the regression output stating that the regression coefficients
b1, b2, … , bk do not equal zero is deemed to be statistically significant.

p Value of Intercept and Coefficients – Measure of Their Validity

The lower the p-Value for each, more likely that Y-Intercept or coefficient is valid. The Intercept’s low p
Value of 0.017 indicates that there is only a 1.7 chance that this calculated value of the Intercept is a
random occurrence.
The coefficient’s extremely low p Value of 1.4E-10 indicates that there is almost no chance that this
calculated value of the coefficient is a random occurrence.

108
All Calculations That Created Excel’s Regression Output
Performing regression analysis manually can be done but is somewhat tedious. Remember also that
single-variable regression performed here is the simplest type of regression. If a few more independent
variables, the calculations become exponentially more complicated. All of the calculations needed to
duplicate Excel’s regression output are as follows:

109
Calculation of Coefficient and Intercept in Excel

110
111
Calculation of R Square in Excel

112
Another Way To Calculate R Square in Excel

113
Calculation of Adjusted R Square in Excel

Calculation of the Standard Error of the Regression Equation in Excel


The Standard Error of the Regression Equation is calculated from the residuals.

ANOVA Calculations in Excel

114
The p Value formula above is the legacy formula for Excel versions prior to 2010. Excel 2010 and later
use the following p Value formula:
Significance of F = p Value = F.DIST.RT(F Stat, 1, n – 2)
The F Statistic is the result of an F Test that calculates the ratio of the Explained variance over the
Unexplained variance. If this ratio is large enough, it is unlikely that this result was obtained by chance.
Significance of F is a p Value that determines the overall validity of the regression equation. If the p Value
is smaller than the designated Alpha (Level of Significance), then it can be said that the regression
equation is significant at the designated Level of Confidence (Level of Confidence = 1 – Level of
Significance).
It is the p Value derived from the F Test that produced the F Statistic. This p Value (the percentage of the
total area under the F distribution curve beyond the F Statistic) provides the probability that the regression
equation was arrived at merely by chance.

115
Analysis of the Independent Variable Coefficient in Excel
The overall test being conducted on the Variable Coefficient is a t-Test that has a Null Hypothesis stating
that this regression variable = 0. This Null Hypothesis will be rejected if the t Statistics of this Regression
Variable is large enough or, equivalently, the p-Value associated with that t Statistic is small enough.

Standard Error of Coefficient


One of the first steps in a hypothesis test is to determine the standard error of the distributed variable.

t Stat of Coefficient
The t Statistic of the coefficient states how many standard errors that the coefficient is from zero.

p-Value of the Coefficient


The p-value of the coefficient is calculated from the t Statistic. The smaller the t-Statistic is, the larger will
be the p Value. The very small p Value in this case indicates that validity of the calculated value of
coefficient. This p Value of approximately zero indicates that there is almost no possibility that this
calculated value of the coefficient occurred merely by chance.

95% Confidence Interval of Coefficient


This interval has a 95% chance of containing the coefficient

116
Analysis of Intercept in Excel

The overall test being conducted on the Y Intercept is a t-Test that has a Null Hypothesis stating that the
intercept = 0. This Null Hypothesis will be rejected if the t Statistics of this Intercept is large enough or,
equivalently, the p-Value associated with that t Statistic is small enough.

Standard Error of the Intercept


One of the first steps in a hypothesis test is to determine the standard error of the intercept.

117
t Stat of the Intercept
The t Statistic of the Intercept states how many standard errors that the Intercept is from zero.

p-Value of the Intercept


The p-value of the Intercept is calculated from the t Statistic. The smaller the t-Statistic is, the larger will
be the p Value. The small p Value in this case indicates that validity of the calculated value of Intercept.
This p Value of 0.017 indicates that there is only a 1.7 percent that this calculated value of the Intercept
occurred merely by chance.

95% Confidence Interval of Intercept


This interval has a 95% chance of containing the Intercept

Prediction Interval of a Regression Estimate


A prediction interval is a confidence interval about a Y value that is estimated from a regression equation.
A regression prediction interval is a value range above and below the Y estimate calculated by the

118
regression equation that would contain the actual value of a sample with, for example, 95 percent
certainty.
The Prediction Error for a point estimate of Y is always slightly larger than the Standard Error of the
Regression Equation shown in the Excel regression output directly under Adjusted R Square.
The Standard Error of the Regression Equation is used to calculate a confidence interval about the mean
Y value. The Prediction Error is use to create a confidence interval about a predicted Y value. There will
always be slightly more uncertainty in predicting an individual Y value than in estimating the mean Y
value.
For that reason, a Prediction Interval will always be larger than a Confidence Interval for any type of
regression analysis.
Calculating an exact prediction interval for any regression with more than one independent variable
(multiple regression) involves some pretty heavy-duty matrix algebra. Fortunately a prediction interval for
simple regression can be calculated by hand as follows:

Prediction Interval Estimate Formula


The formula for a prediction interval about an estimated Y value (a Y value calculated from the regression
equation) is found by the following formula:
Prediction Interval = Yest ± t-Valueα/2,df=n-2 * Prediction Error
Prediction Error = Standard Error of the Regression * SQRT(1 + distance value)
Distance value, sometimes called leverage value, is the measure of distance of the combinations of
values, x1, x2,…, xk from the center of the observed data. Distance value in any type of multiple regression
requires some heavy-duty matrix algebra. This is given in Bowerman and O’Connell (1990).
Distance value can be calculated for single-variable regression in a fairly straightforward manner as
follows:
2
Distance value = 1/n + [(x0 – x_bar) ]/SSxx
If, for example we wanted to calculate the 95 percent Prediction Interval for the estimated Y value when X
= 5000 kg. of input pellets, the following calculations would be performed:
x0 = 5,000
n = 20

Yest = Number of Parts Produced = 1,345.09 + 1.875 (Weight of Input Pellets in kg.)
Yest = 1,345.09 + 1.875 (5,000)
Yest = 10,730

t-Valueα/2,df=n-2 = TINV(0.05/2,20-2)
t-Valueα/2,df=n-2 = TINV(0.975,18) = 2.3987
In Excel 2010 and beyond, TINV(α, n – 2) can also be calculated by the following Excel formula:
TINV(α, n – 2) = T.INV(1-α/2, n -2)

x_bar and SSxx are found as follows:

119
120
121
Now we have the following:
x0 = 5,000
n = 20
Yest = 10,730
t-Valueα/2,df=n-2 = 2.3987
x_bar = 2,837.65
SSxx = 94,090,690.55
2
Distance value = 1/n + [(x0 – x_bar) ]/SSxx
2
Distance value = 1/20 + [(5,000 – 2,837) ]/94,090,690
Distance value = 0.099694
Prediction Error = Standard Error of the Regression * SQRT(1 + distance value)
Standard Error of the Regression = 1,400.463
This is found from the Excel regression output as follows:

122
Prediction Error = 1,400.463 * SQRT(1 + 0.099694)
Prediction Error = 1,400.463 * 1.048663
Prediction Error = 1,468

Prediction Interval = Yest ± t-Valueα/2,df=n-2 * Prediction Error


Prediction Interval = 10,730 ± 2.3987 * 1,468
Prediction Interval = 10,730 ± 3,533
Prediction Interval = [ 7,197, 14,263 ]

123
Multiple-Variable Linear Regression in Excel

Overview
Linear regression is a statistical technique used to model the relationship between one or more
independent, explanatory variables and a single dependent variable. The linear regression type is
classified as Simple Linear Regression if there is only a single explanatory variable. The regression type
is classified as Multiple Linear Regression if there is more than one explanatory variable.

The Regression Equation


The end result of linear regression is a linear equation that models actual data as closely as possible.
This equation is called the Regression Equation. The more linear the relationship is between each of the
explanatory variables and the single dependent variable, the more closely the Regression Equation will
model the actual data.
In the Regression Equation, the variable Y is usually designated as the single dependent variable. The
independent explanatory variables are usually labeled X 1, X2, …, Xk.
The Regression Equation for multiple regression appears as follows:
Y = b0 + b1X1 + b2X2 + … + bkXk
The Regression Equation for simple regression appears as follows:
Y = b0 + b1X
b0 is the Y-intercept of the Regression Equation.
b1, b2, ..,, bk are the coefficients of the independent variables.
The most important part of regression analysis is the calculation of b 0, b1, b2, ..,, bk in order to be able to
construct the Regression Equation
Y = b0 + b1X for simple regression
or
Y = b0 + b1X1 + b2X2 + … + bkXk for multiple regression.

Purposes of Linear Regression


Linear regression, both simple and multiple linear regression, generally have two main uses. They are as
follows:
1) To quantify the linear relationship between the dependent variable and the independent variable(s) by
calculating a regression equation.
2) To quantify how much of the movement or variation of the dependent variable is explained by the
independent variable(s).

124
The Inputs For Linear Regression
The input data for linear regression analysis consists of a number of data records each having a single Y
(dependent variable) value and one or more X (explanatory independent variable) values. Simple
regression has only a single X value. Multiple regression has more than one X (independent) variable for
each Y (dependent) variable.
Each data record occupies its own unique row in the regression input. Each data record contains the
specific values of the input (independent) X variables that are associated with a specific value of the
dependent Y variable shown in that data record.
The input data for multiple regression analysis appear as separate data records on each row as follows:

Y X1 X2 … Xk

4 6 10 …15
5 7 11 …16
6 8 12 …17
7 9 13 …18
8 10 14 …19

Multiple linear regression has more than one X (independent) variable. These independent variables
(X’s) known as the explanatory, predictor, or regressor variables. The single dependent variable (Y) is the
target or outcome variable.
Multiple linear regression requires that both the dependent variable and the independent variables be
continuous. If ordinal data such as a Likert scale is used as a dependent or independent variable, it must
be treated as a continuous variable that has equal distance between values. Ordinal data is normally
defined data whose order matters but not the differences between values.

Null and Alternative Hypotheses


The Null Hypothesis of linear regression states that the coefficient(s) of the independent variable(s) in the
regression equation equal(s) zero. The Alternative Hypothesis for linear regression therefore states that
these coefficients do not equal zero.
For multiple linear regression this Null Hypothesis is expressed as follows:
H0: b1 = b2 = … = bk = 0
For simple linear regression this Null Hypothesis is expressed as follows:
H0: b1 = 0
b1 is the slope of the regression line for simple regression.
The Alternative Hypothesis, H1, for linear regression states that these coefficients do not equal zero.
The Y Intercept b0 is not included in the Null Hypothesis.

125
X and Y Variables Must Have a Linear Relationship
Linear regression is a technique that provides accurate information only if a linear relationship exists
between the dependent variable and each of the independent variables. Independent variables that do
not have a linear relationship with the dependent variable should not be included as inputs. An X-Y
scatterplot diagram of between each independent variable and the dependent variable provides a good
indication of whether the relationship is linear.
When data are nonlinear, there often two solutions available to allow regression analysis to be performed.
They are the following:
1) Transform the nonlinear data to linear using a logarithmic transformation. This will not be discussed in
this section.
2) Perform nonlinear regression on the data. One way to do that is to apply curve-fitting software that will
calculate the mathematical equation that most closely models the data. Another section in this book will
focus on using the Excel Solver to fit a curve to nonlinear data. The least-squares method is the simplest
way to do this and will be employed in this section.

Do Not Extrapolate Regression Beyond Existing Data


The major purpose of linear regression is to create a Regression Equation that accurately predicts a Y
value based on a new set of independent, explanatory X values. The new set of X values should not
contain any X values that are outside of the range of the X values used to create the original regression
equation. The following simple example illustrates why a Regression Equation should not be extrapolated
beyond the original X values.

Example of Why Regression Should Not Be Extrapolated


Imagine that the height of a boy was measured every month from when the boy was one year old until the
boy was eighteen years old. The independent, explanatory X variable would the month number (12
months to 216 months) and the dependent y variable would be the height measured in inches. Typically
most boys stopped growing in height when they reach their upper teens.
If the Regression Equation was created from the above data and then extrapolated to predict the boy’s
height when he reached 50 years of age, the Regression Equation might predict that the boy would be
fifteen feet tall.

Linear Regression Should Not Be Done By Hand


Excel provides an excellent data analysis regression tool that can perform simple or multiple regression
with equal ease. Doing the calculations by hand would be very tedious and provide lots of opportunities to
make a mistake. Excel produces a very detailed output and the regression tool is run. I have recreated all
of the simple regression calculations that Excel performs in this chapter. It will probably be clear from
viewing this that it is wise to let Excel do the regression calculations. A number of statistics textbooks
probably place too much emphasis on teaching the ability to perform the regression equations by hand. In
the real world regression analysis would never be done manually.
The best way to understand multiple-variable linear regression is to perform an example as follows:

126
Complete Example of Multiple Linear Regression in Excel
A researcher is attempting to create a model that accurately predicts the total annual power consumption
of companies within a specific industry. The researcher has collected information from 21 companies that
specialize in a single industry. The four pieces of information collected from each of the 21 companies are
as follows:
1) The company’s total power consumption last year in kilowatts.
2) The company’s total number of production machines.
3) The company’s number of new employees added in the last five years.
4) The company’s total increase in salary paid over the last five years.
The collected data are as follows:

127
Step 1 – Remove Extreme Outliers
Calculation of the mean is one of the fundamental computations when performing linear regression
analysis. The mean is unduly affected by outliers. Extremely outliers should be removed before beginning
regression analysis. Not all outliers should be removed. An outlier should be removed if it is obviously
extremely and inconsistent with the remainder of the data.

Sorting the Data To Quickly Spot Extreme Outliers

An easy way to spot extreme outliers is to sort the data. Extremely high or low outlier values will appear at
the ends of the sort. A convenient, one-step method to sort a column of data in Excel is shown here.
The formula is cell I4 is the following:

=IF($G4=””,””,LARGE($G$4:$G$24,ROW()-ROW($I$3)))
Copy this formula down as shown to create a descending sort of the data in cells I4 to I24.
Exchanging the word SMALL for LARGE would create an ascending sort instead of the descending sort
performed here.

128
Here is the original data with the outlier data record highlighted.

The lowest Y value, 509090, is obviously an extreme outlier and is very different than the rest of the data.
The cause of this extreme outlier value is not known. Perhaps something unusual is happening in the
company from which this data was drawn? It is clear that this value should be removed from the analysis
because it would severely skew the final result.

129
Removing this outlier from the data produces this set of 20 data records:

Step 2 – Create a Correlation Matrix


This step is only necessary when performing multiple regression. The purpose of this step is to identify
independent variables that are highly correlated. Different input variables of multiple regression that are
highly correlated can cause an error called multicollinearity.
Multicollinearity does not reduce the overall predictive power of the model but it can cause the coefficients
of the independent variables in the regression equation to change erratically when small changes are
introduced to the regression inputs. Multicollinearity can drastically reduce the validity of the individual
predictors without affecting the overall reliability of the regression equation.
When highly correlated pairs of independent variables are found, one of the variables of the pair should
be removed from the regression. The variable that should be removed is the one with the lowest
correlation with the dependent variable, Y.

130
An Excel correlation matrix of all independent and dependent variables is shown as follows:

This Excel correlation matrix was created using the following inputs for the Excel correlation dialogue box:

131
We can see from the correlation matrix that there is a very high correlation between two independent
variables. The correlation between Total_Salary_Increases and Number_of_Production_Machines is
0.989.
One of these two independent variables should be removed to prevent multicollinearity. The variable that
should be removed is the one that has the lower correlation with the dependent variable,
Power_Consumption. The independent variable Total_Salary_Increases has a lower correlation with the
dependent variable Power_Consumption (0.967) than Number_of_Production_Machines (0.980) and
should be removed from the regression analysis.
Here is the data after the variable Total_Salary_Increases is removed from the analysis:

132
Step 3 – Scale Variables If Necessary
All variables should be scaled so that each has a similar number of decimal places beyond zero. This
limits rounding error and also insures that the slope of the fitted line will be a convenient size to work with
and not too large or too small. Ideally, the coefficients of the independent variables should be between
one and ten.

The next step following this one (Step 4) is to view individual scatterplots of each independent variables
versus the dependent variable. Rescaling the independent variables is one way to ensure that that data
points do not have too extreme of a slope in the scatterplot graphs.

Performing a regression analysis with the current independent variables would produce coefficients for
each variable that are over 1,000. This can be corrected by multiplying each of the two independent
variables by 1,000. Rescaling the variables in that manner is shown as follows:

133
Step 4 – Plot the Data
The purpose of plotting the data is to be able to visually inspect the data for linearity. Each independent
variable should be plotted against the dependent variable in a scatterplot graph. Linear regression should
only be performed if linear relationships exist between the dependent variable and each of the input
variables. Excel X-Y scatterplots of the two independent variables versus the dependent variable are
shown as follows. The relationships in both cases appear to be linear.

134
Step 5 – Run the Regression Analysis
Below is the Regression dialogue box with all of the necessary information filled in. Many of the required
regression assumptions concerning the Residuals have not yet been validated. Calculating and
evaluating the Residuals will be done before analyzing any other part of the regression output. All four
checkboxes in the Residuals section of the regression dialogue box should be checked. This will be
discussed shortly.

135
Here is a close-up of the completed Excel regression dialogue box;

It should be noted that the Residuals are sometimes referred to as the Error terms. The checkbox next to
Residuals should be checked in order to have Excel automatically calculate the residual for each data
point. The residual is the difference between the actual data point and its value as predicted by the
regression equation. Analysis of the residuals is a very important part of linear regression analysis
because a number of required assumptions are based upon the residuals.
The checkbox next to Standardized Residuals should also be checked. If this is checked, Excel will
calculate the number of standard deviations that each residual value is from the mean of the residuals.
Data points are often considered outliers if their residual values are located more than three standard
deviations from the residual mean.
The checkbox next to Residual Plots should also be checked. This will create graphs of the residuals
plotted against each of the input (independent) variables. Visual observation of these graphs is an
important part of evaluating whether the residuals are independent. If the residuals show patterns in any
graph, the residuals are considered to not be independent and the regression should not be considered
valid. Independence of the residuals is one of linear regression’s most important required assumptions.
The checkbox next to Line Fit plots should be checked as well. This will produce graphs of the Y Values
plotted against each X value in a separate graph. This provides visual analysis of the spread of each
input (X) variable and any patterns between any X variable the output Y variable.
The checkbox for the Normal Probability Plot was not checked because that produces a normal
probability plot of the Y data (the dependent variable data). A normal probability plot is used to evaluate
whether data is normally-distributed. Linear regression does not require the independent or dependent

136
variable data be normally-distributed. Many textbooks incorrectly state that the dependent and/or
independent data need to be normally-distributed. This is not the case.
Linear regression does however require that the residuals be normally-distributed. A normal probability
plot of the residuals would be very useful to evaluate the normality of the residuals but is not included as
a part of Excel’s regression output.
A normal probability plot of the Y data does not provide any useful information and the checkbox that
would produce that graph is therefore not checked. It is unclear why Excel includes that functionality with
its regression data analysis tool.
Those settings shown in the previous Excel regression dialogue box produce the following Excel output:

The Excel regression output includes the calculation of the Residuals as specified. Linear regression’s
required assumptions regarding the Residuals should be evaluated before analyzing any other part of the
Excel regression output. The required Residual assumptions must be verified before the regression
output is considered valid.

137
The Residual output includes each Dependent variable’s predicted value, its Residual value (the
difference between the predicted value and the actual value), and the Residual’s standardized value (the
number of standard deviations that the Residual value is from the mean of the Residual values). This
Residual output is shown as follows:

138
The follow graphs were also generated as part of the Excel regression output:

139
140
Step 6 – Evaluate the Residuals
The purpose of Residual analysis is to confirm the underlying validity of the regression. Linear regression
has a number of required assumptions about the residuals. These assumptions should be confirmed
before evaluating the remainder of the Excel regression output. If one or more of the required residual
assumptions are shown to be invalid, the entire regression analysis might be questionable. The residuals
should therefore be analyzed first before the remainder of the Excel regression output.

The Residual is sometimes called the Error Term. The Residual is the difference between an observed
data value and the value predicted by the regression equation. The formula for the Residual is as follows:

Residual = Yactual – Yestimated

Linear Regression’s Required Residual Assumptions


Linear regression has several required assumptions regarding the residuals. These required residual
assumptions are as follows:
1) Outliers have been removed.
2) The residuals must be independent of each other. They must not be correlated with each other.
3) The residuals should have a mean of approximately 0.
4) The residuals must have similar variances throughout all residual values.
5) The residuals must be normally-distributed.
6) The residuals may not be highly correlated with any of the independent (X) variables.
7) There must be enough data points to conduct normality testing of residuals.
Here is how to evaluate each of these assumptions in Excel.

Locating and Removing Outliers


In many cases a data point is considered to be an outlier if its residual value is more than three standard
deviations from the mean of the residuals. Checking the checkbox next to Standardized Residuals in the
regression dialogue box calculates standardized value of each residual, which is the number of standard
deviations that the residual is from the residual mean. Below once again is the Excel regression output
showing the residuals and their distance in standard deviations from the residual mean.
Following are the standardized residuals of the current data set. None are larger in absolute value than
1.755 standard deviations from the residual mean.

141
142
A data point is often considered an outlier if its residual value is more than three standard deviations from
the residual mean. The following Excel output shows that none of the residuals are more than 1.755
standard deviations from the residual mean. On that basis, no data points are considered outliers as a
result of having excessively large residuals.

Any outliers that have been removed should be documented and evaluated. Outliers more than 3
standard deviations from the mean are to be expected occasionally for normally-distributed data. If an
outlier appears to have been generated by the normal process and not be an aberration of the process,
then perhaps it should not be removed. One item to check is whether a data entry error occurred when
inputting the data set. Another item that should be checked is whether there was a measurement error
when that data point’s parameters were recorded.
If a data point is removed, the regression analysis has to be performed again on the new data set that
does not include that data point.

Determining Whether Residuals Are Independent


This is the most important residual assumption that must be confirmed. If the residuals are not found to
be independent, the regression is not considered to be valid.
If the residuals are independent of each other, a graph of the residuals will show no patterns. The
residuals should be graphed across all values of the dependent variable. The Excel regression output
produced individual graphs of the residuals across all values of each independent variable, but not across
all values of the dependent variable. This graph is not part of Excel’s regression output and needs to be
generated separately.

143
An Excel X-Y scatterplot graph of the Residuals plotted against all values of the dependent variable is
shown as follows:

Residuals that are not independent of each other will show patterns in a Residual graph. No patterns
among Residuals are evidenced in this Residual graph so the required regression assumption of Residual
independence is validated.
It is important to note an upward or downward linear trend appearing in the Residuals probably indicates
that an independent (X) variable is missing. The first tip that this might be occurring is if the Residual
mean does not equal approximately zero.

Determining If Autocorrelation Exists


An important part of evaluating whether the residuals are independent is to calculate the degree of
autocorrelation that exists within the residuals. If the residuals are shown to have a high degree of
correlation with each other, the residual are not independent and the regression is not considered valid.
Autocorrelation often occurs with time-series or any other type of longitudinal data. Autocorrelation is
evident when data values are influenced by the time interval between them. An example might be a graph
of a person’s income. A person’s level of income in one year is likely influenced by that person’s income
level in the previous year.
The degree of autocorrelation existing within a variable is calculated by the Durbin-Watson statistics, d.
The Durbin-Watson statistic can take values from 0 to 4. A Durbin-Watson statistic near 2 indicates very
little autocorrelation within a variable. Values close to 2 indicate that little to no correlation exists among
residuals. This, along with no apparent patterns in the Residuals, would confirm the independence of the
Residuals.
Values near 0 indicate a perfect positive autocorrelation. Subsequent values are similar to each other in
this case. Values will appear to following each other in this case. Values near 4 indicate a perfect
negative correlation. Subsequent values are opposite of each other in an alternating pattern.

144
The data used in this example is not time series data but the Durbin-Watson statistic of the Residuals will
be calculated in Excel to show how it is done. Before calculating the Durbin-Watson statistic, the data
should be sorted chronologically. The Durbin-Watson for the Residuals would be calculated in Excel as
follows:

SUMXMY2(x_array,y_array) calculates the sum of the square of (X – Y) for the entire array.
SUMSQ(array) squares the values in the array and then sums those squares.
If the Residuals are in cells DV11:DV30, then the Excel formula to calculate the Durbin-Watson statistic
for those Residuals is the following:
=SUMXMY2(DV12:DV30,DV11:DV29)/SUMSQ(DV11:DV30)
The Durbin-Watson statistic of 2.07 calculated here indicates that the Residuals have very little
autocorrelation. The Residuals can be considered independent of each other because of the value of the
Durbin-Watson statistic and the lack of apparent patterns in the scatterplot of the Residuals.

145
Determining if Residual Mean Equals Zero
The mean of the residuals is shown to be zero as follows:

Determining If Residual Variance Is Constant


If the Residuals have similar variances across all residual values, the Residuals are said to be
homoscedastistic. The property of having similar variance across all sample values or across different
sample groups is known as homoscedasticity.
If the Residuals do not have similar variances across all residual values, the Residuals are said to be
heteroscedastistic. The property of having different variance across all sample values or across different
sample groups is known as heteroscedasticity.

146
Linear regression requires that Residuals be homoscedastistic, i.e., have similar variances across all
residual values.
The variance of the Residuals is the degree of spread among the Residual values. This can be observed
on the Residual scatterplot graph. If the variance of the residuals changes as residual values increase,
the spread between the values will visibly change on the Residual scatterplot graph. If Residual variance
increases, the Residual values will appear to fan out along the graph. If Residual variance decreases, the
Residual values will do the opposite; they will appear to clump together along the graph.
Here is the Residual graph again:

The Residuals’ spread appears to be fairly consistent across all Residual values. This indicates that the
Residuals are homoscedastistic, i.e., have similar variance across all Residual values. There appears to
be no fanning in or fanning out.
Slight unequal variance in Residuals in not usually a reason to discard an otherwise good model. One
way to remove unequal variance in the residuals is to reduce the interval between data points. Shorter
intervals will have closer variances.
If the number of data points is too small, the residual spread will sometimes produce a cigar-shaped
pattern.

Determining if Residuals Are Normally-Distributed


An important assumption of linear regression is that the Residuals be normally-distributed. Normality
testing must be performed on the Residuals. The following five normality tests will be performed here:
1) An Excel histogram of the Residuals will be created.
2) A normal probability plot of the Residuals will be created in Excel.
3) The Kolmogorov-Smirnov test for normality of Residuals will be performed in Excel.
4) The Anderson-Darling test for normality of Residuals will be performed in Excel.
5) The Shapiro-Wilk test for normality of Residuals will be performed in Excel.

147
Histogram of the Residuals in Excel
An Excel histogram of the Residuals is shown as follows:

The Residuals appear to be distributed in a reasonable resemblance to the bell-shaped normal


distribution in this Excel histogram. This histogram was created in Excel by creating a bar chart in Excel
with the following data and formulas:

148
An Excel histogram created using formulas and a bar chart is automatically updated when the input data
changes. An Excel histogram created with the Histogram tool in the Data Analysis ToolPak is not
automatically updated when input data is changed and must be rerun to update the histogram.

Normal Probability Plot of Residuals in Excel


A Normal Probability Plot created in Excel of the Residuals is shown as follows:

The Normal Probability Plot of the Residuals provides strong evidence that the Residual are normally-
distributed. The more closely the graph of the Actual Residual values (in red) resembles a straight line,
the more closely the Residuals are to being normally-distributed. The Actual Residual values are very
close to being a straight line.

149
Kolmogorov-Smirnov Test For Normality of Residuals in Excel
The Kolmogorov-Smirnov Test is a hypothesis test that is widely used to determine whether a data
sample is normally-distributed. The Kolmogorov-Smirnov Test calculates the distance between the
Cumulative Distribution Function (CDF) of each data point and what the CDF of that data point would be if
the sample were perfectly normally-distributed. The Null Hypothesis of the Kolmogorov-Smirnov Test
states that the distribution of actual data points matches the distribution that is being tested. In this case
the data sample is being compared to the normal distribution.
The largest distance between the CDF of any data point and its expected CDF is compared to
Kolmogorov-Smirnov Critical Value for a specific sample size and Alpha. If this largest distance exceeds
the Critical Value, the Null Hypothesis is rejected and the data sample is determined to have a different
distribution than the tested distribution. If the largest distance does not exceed the Critical Value, we
cannot reject the Null Hypothesis, which states that the sample has the same distribution as the tested
distribution.
F(Xk) = CDF(Xk) for normal distribution
F(Xk) = NORM.DIST(Xk, Sample Mean, Sample Stan. Dev., TRUE)
Residual Data

0.1319 = Max Difference Between Actual and Expected CDF


20 = n = Number of Data Points
0.05 = α

150
The Null Hypothesis Stating That the Residuals Are Normally-Distributed Cannot Be Rejected
The Null Hypothesis for the Kolmogorov-Smirnov Test for Normality, which states that the sample data
are normally-distributed, is rejected only if the maximum difference between the expected and actual CDF
of any of the data points exceed the Critical Value for the given n and α. That is not the case here.
The Max Difference Between the Actual and Expected CDF for Variable 1 (0.1319) is significantly less
than the Kolmogorov-Smirnov Critical Value for n = 20 (0.29) at α = 0.05 so the Null Hypotheses of the
Kolmogorov-Smirnov Test for the Residual data is accepted.

151
Anderson-Darling Test For Normality of Residuals in Excel
The Anderson-Darling Test is a hypothesis test that is widely used to determine whether a data sample is
normally-distributed. The Anderson-Darling Test calculates a test statistic based upon the actual value of
each data point and the Cumulative Distribution Function (CDF) of each data point if the sample were
perfectly normally-distributed.
The Anderson-Darling Test is considered to be slightly more powerful than the Kolmogorov-Smirnov test
for the following two reasons:
The Kolmogorov-Smirnov test is distribution-free. i.e., its critical values are the same for all distributions
tested. The Anderson-darling tests requires critical values calculated for each tested distribution and is
therefore more sensitive to the specific distribution.
The Anderson-Darling test gives more weight to values in the outer tails than the Kolmogorov-Smirnov
test. The K-S test is less sensitive to aberration in outer values than the A-D test.
If the test statistic exceeds the Anderson-Darling Critical Value for a given Alpha, the Null Hypothesis is
rejected and the data sample is determined to have a different distribution than the tested distribution. If
the test statistic does not exceed the Critical Value, we cannot reject the Null Hypothesis, which states
that the sample has the same distribution as the tested distribution.
F(Xk) = CDF(Xk) for normal distribution
F(Xk) = NORM.DIST(Xk, Sample Mean, Sample Stan. Dev., TRUE)

152
Residual Data

Test Statistic A = 1.279


The above test statistic should be adjusted in the general case that both population mean an population
variance are unknown. This is often the case and is an assumption that can always be applied.
When population mean and population variance are unknown, make the following adjustment:
Adjusted Test Statistic A* = ( 1 + 0.75/n + 2.25/n2 )*A
However, the population mean of the residuals is known to be 0. The population standard deviation of the
residuals is now known.
In this case Test Statistic A should be used and not Adjusted Test Statistic A*.
Reject the Null Hypothesis of the Anderson-Darling Test which states that the data are normally-
distributed when the population mean is known but the population standard deviation is not known if any
the following are true:
A > 1.760 When Level of Significance (α) = 0.10
A > 2.323 When Level of Significance (α) = 0.05
A > 3.69 When Level of Significance (α) = 0.01

153
The Null Hypothesis Stating That the Residuals Are Normally-Distributed Cannot Be Rejected
The Null Hypothesis for the Anderson-Darling Test for Normality, which states that the sample data are
normally-distributed, is rejected if the Test Statistic (A) exceeds the Critical Value for the given n and α.
The Test Statistic (A) for the Residual data is significantly less than the Anderson-Darling Critical Value
for α = 0.05 so the Null Hypotheses of the Anderson-Darling Test for the Residual data is not rejected.
The Null Hypothesis states that the residuals are normally-distributed.

Shapiro-Wilk Test For Normality in Excel


The Shapiro-Wilk Test is a hypothesis test that is widely used to determine whether a data sample is
normally-distributed. A test statistic W is calculated. If this test statistic is less than a critical value of W for
a given level of significance (alpha) and sample size, the Null Hypothesis which states that the sample is
normally-distributed is rejected.
The Shapiro-Wilk Test is a robust normality test and is widely-used because of its slightly superior
performance against other normality tests, especially with small sample sizes. Superior performance
means that it correctly rejects the Null Hypothesis that the data are not normally-distributed a slightly
higher percentage of times than most other normality tests, particularly at small sample sizes.
The Shapiro-Wilk normality test is generally regarded as being slightly more powerful than the Anderson-
Darling normality test, which in turn is regarded as being slightly more powerful than the Kolmogorov-
Smirnov normality test.
Residual Data

0.972299 = Test Statistic W

154
0.905 = W Critical for the following n and Alpha
20 = n = Number of Data Points
0.05 = α
The Null Hypothesis Stating That the Data Are Normally-Distributed Cannot Be Rejected
Test Statistic W (0.972299) is larger than W Critical 0.905. The Null Hypothesis therefore cannot be
rejected. There is not enough evidence to state that the data are not normally-distributed with a
confidence level of 95 percent.

Correctable Reasons Why Normal Data Can Appear Non-Normal


If a normality test indicates that data are not normally-distributed, it is a good idea to do a quick evaluation
of whether any of the following factors have caused normally-distributed data to appear to be non-
normally-distributed:
1) Outliers – Too many outliers can easily skew normally-distributed data. An outlier can often be
removed if a specific cause of its extreme value can be identified. Some outliers are expected in normally-
distributed data.
2) Data Has Been Affected by More Than One Process – Variations to a process such as shift changes
or operator changes can change the distribution of data. Multiple modal values in the data are common
indicators that this might be occurring. The effects of different inputs must be identified and eliminated
from the data.
3) Not Enough Data – Normally-distributed data will often not assume the appearance of normality until
at least 25 data points have been sampled.
4) Measuring Devices Have Poor Resolution – Sometimes (but not always) this problem can be solved
by using a larger sample size.
5) Data Approaching Zero or a Natural Limit – If a large number of data values approach a limit such
as zero, calculations using very small values might skew computations of important values such as the
mean. A simple solution might be to raise all the values by a certain amount.
6) Only a Subset of a Process’ Output Is Being Analyzed – If only a subset of data from an entire
process is being used, a representative sample in not being collected. Normally-distributed results would
not appear normally-distributed if a representative sample of the entire process is not collected.

155
Determining If Any Input Variables Are Too Highly Correlated With Residuals
To determine whether the Residuals have significant correlation with any other variables, an Excel
correlation matrix can be created. An Excel correlation matrix will simultaneously calculate correlations
between all variables. The Excel correlation matrix for all variables in this regression is shown as follows:

The correlation matrix shows all of the correlations between each of the variables to be low. Correlation
values go from (-1) to (+1). Correlation values near zero indicate very low correlation. This correlation
matrix was created by inserting the following information into the Excel correlation data analysis tool
dialogue box as follows:

156
Determining If There Are Enough Data Points
Violations of important assumptions such as normality of Residuals is difficult to detect if too few data
exist. 20 data points is sufficient. 10 data points is probably on the borderline of being too few. All of the
normality tests are significantly more powerful (accurate) as data size goes from 15 to 20 data points.
Normality of data is very difficult to access accurately when only 10 data points are present.
All required regression assumptions concerning the Residuals have been met. The next step is to
evaluate the remainder of the Excel regression output.

157
Step 7 – Evaluate the Excel Regression Output
The Excel regression output that will now be evaluated is as follows:

Interpretation of the most important individual parts of the Excel regression output are as follows:

Regression Equation

158
The regression equation is shown to be the following:
Yi = b0 + b1 * X1i + b2 * X2i
Power Consumption (kW) = 37,123,164 + 10.234 (Number of Production Machines X 1,000) + 3.573
(New Employees Added in Last 5 Years X 1,000)
Note that the scaling of the independent variables in Step 2 ensures that the calculated coefficients in the
regression equation were of reasonable size (between 1 and 10)
For example, if a company had 10,000 production machines and added 500 new employees in the last 5
years, the company’s annual power consumption would be predicted as follows:
Annual Power Consumption (kW) = 37,123,164 + 10.234 (Number of Production Machines X 1,000) +
3.573 (New Employees Added in Last 5 Years X 1,000)
Annual Power Consumption (kW) = 37,123,164 + 10.234 (10,000 X 1,000) + 3.573 (500 X 1,000)
Annual Power Consumption = 49,143,690 kW
It is very important to note that a regression equation should never be extrapolated outside the range of
the original data set used to create the regression equation. The inputs for a regression prediction should
not be outside of the following ranges of the original data set:
Number of machine: 442 to 28,345
New employees added in last 5 years: -1,460 to 7,030
A simple example to illustrate why a regression line should never be extrapolated is as follows: Imagine
that the height of a child was recorded every six months from ages one to seventeen. Most people stop
growing in height at approximately age seventeen. If a regression line was created from that data and
then extrapolated to predict that person’s height at age 50, the regression equation might predict that the
person would be fifteen feet tall. Conditions are often very different outside the range of the original data
set.
Extrapolation of a regression equation beyond the range of the original input data is one of the most
common statistical mistakes made.

159
R Square –The Equation’s Overall Predictive Power

R Square tells how closely the Regression Equation approximates the data. R Square tells what
percentage of the variance of the output variables is explained by the input variables. We would like to
see at least .6 or .7 for R Square. The remainder of the variance of the output is unexplained. R Square
here is a relatively high value of 0.963. This indicates the 96.3 percent of the total variance in the output
variable (annual power consumption) is explained by the variance of the input variables (number of
production machines and number of new employees added in the last five years).
Adjusted R Square is quoted more often than R Square because it is more conservative. Adjusted R
Square only increases when new independent variables are added to the regression analysis if those new
variables increase an equation’s predictive ability. When you are adding independent variables to the
regression equation, add them one at a time and check whether Adjusted R Square has gone up with the
addition of the new variable. The value of Adjusted R Square here is 0.959.

Significance of F - Overall p Value and Validity Measure

The Significance of F is a p Value. A very small Significance of F confirms the validity of the Regression
Equation. The Regression handout has more information about the Significance of F that appears in the
Excel Regression output. The significance of F is actually a p Value. If the p value (Significance of F) is
nearly zero, then there is almost no chance that the Regression Equation is random. This is very strong
evidence of the validity of the overall Regression Equation.
To be more specific, this p value (Significance of F) indicates whether to reject the overall Null Hypothesis
of this regression analysis. The overall Null Hypothesis for this regression equation states that all
coefficients of the independent variables equal zero. In other words, that for this multiple regression
equation:
Y = b0 + b1X1 + b2X2 + … + bkXk
The Null Hypothesis for multiple regression states that the coefficients b 1, b2, … , bk all equal zero. The Y
intercept, b0, is not included in this Null Hypothesis.

160
For this simple regression equation:
Y = b0 + b1X
The Null Hypothesis for simple regression states that the Coefficient b 1 equals zero. The Y intercept, b0,
is not included in this Null Hypothesis. Coefficient b1 is the slope of the regression line in simple
regression.
In this case, the p Value (Significance of F) is extremely low (6.726657E-13) so we have very strong
evidence that this is a valid regression equation. There is almost no probability that the relationship
shown to exist between the dependent and independent variables (the nonzero values of coefficient b 1,
b2, … , bk) was obtained merely by chance.
This low p Value (or corresponding high F Value) indicates that there is enough evidence to reject the
Null Hypothesis of this regression analysis.
The 95 percent Level of Confidence is usually required to reject the Null Hypothesis. This translates to a 5
percent Level of Significance. The Null Hypothesis is rejected is the p Value (Significance of F) is less
than 0.05. If the Null Hypothesis is rejected, the regression output stating that the regression coefficients
b1, b2, … , bk do not equal zero is deemed to be statistically significant.

p Value of Intercept and Coefficients – Measure of Their Validity

The lower the p-Value for each, more likely that Y-Intercept or coefficient is valid. The Intercept’s low p
Value of 0.0003 indicates that there is only a 0.03 chance that this calculated value of the Intercept is a
random occurrence.
The extremely low p value for the coefficient for the Number_of_Production_Machines indicates that there
is almost no chance that this calculated value of this coefficient is a random occurrence.
The p Value for the Number_of_New_Employees_Added is relatively large. This coefficient cannot be
considered statistically significant (reliable) at a 95 percent certainty level. A 95 percent certainty level
would be the equivalent of a Level of Significance (Alpha) equal to 0.05. The coefficient for the
Number_of_Employees_Added would be considered statistically significant at a 0.05 Level of
Significance if its p Value were less than 0.05. This is not the case because this p Value is shown to be
0.2432.
The coefficient for the Number_of_Machines can be considered reliable but not the coefficient for
New_Employees_Added.

161
Prediction Interval of a Regression Estimate
A prediction interval is a confidence interval about a Y value that is estimated from a regression equation.
A regression prediction interval is a value range above and below the Y estimate calculated by the
regression equation that would contain the actual value of a sample with, for example, 95 percent
certainty.
The Prediction Error for a point estimate of Y is always slightly larger than the Standard Error of the
Regression Equation shown in the Excel regression output directly under Adjusted R Square.
The Standard Error of the Regression Equation is used to calculate a confidence interval about the mean
Y value. The Prediction Error is use to create a confidence interval about a predicted Y value. There will
always be slightly more uncertainty in predicting an individual Y value than in estimating the mean Y
value.
For that reason, a Prediction Interval will always be larger than a Confidence Interval for any type of
regression analysis.
Calculating an exact prediction interval for any regression with more than one independent variable
(multiple regression) involves some pretty heavy-duty matrix algebra. Fortunately there is an easy short-
cut that can be applied to multiple regression that will give a fairly accurate estimate of the prediction
interval.

Prediction Interval Formula


The formula for a prediction interval about an estimated Y value (a Y value calculated from the regression
equation) is found by the following formula:
Prediction Interval = Yest ± t-Valueα/2 * Prediction Error
Prediction Error = Standard Error of the Regression * SQRT(1 + distance value)
Distance value, sometimes called leverage value, is the measure of distance of the combinations of
values, x1, x2,…, xk from the center of the observed data. Calculation of Distance value for any type of
multiple regression requires some heavy-duty matrix algebra. This is given in Bowerman and O’Connell
(1990).
Some software packages such as Minitab perform the internal calculations to produce an exact Prediction
Error for a given Alpha. Excel does not. Fortunately there is an easy substitution that provides a fairly
accurate estimate of Prediction Interval. The following fact enables this:
The Prediction Error for a point estimate of Y is always slightly larger than the Standard Error of the
Regression Equation shown in the Excel regression output directly under Adjusted R Square.
The Standard Error (highlighted in yellow in the Excel regression output) is used to calculate a confidence
interval about the mean Y value. The Prediction Error is use to create a confidence interval about a
predicted Y value. There will always be slightly more uncertainty in predicting an individual Y value than in
estimating the mean Y value.

Prediction Interval Estimate Formula


The Prediction Error is always slightly bigger than the Standard Error of a Regression. The Prediction
Error can be estimated with reasonable accuracy by the following formula:
Prediction Errorest = P.E.est
P.E.est = (Standard Error of the Regression)* 1.1

Prediction Intervalest = Yest ± t-Valueα/2 * P.E.est


Prediction Intervalest = Yest ± t-Valueα/2 * (Standard Error of the Regression)* 1.1

162
Prediction Intervalest = Yest ± TINV(α, dfResidual) * (Standard Error of the Regression)* 1.1
The t-value must be calculated using the degrees of freedom, df, of the Residual (highlighted in Yellow in
the Excel Regression output and equals n – 2).
dfResidual = n – 2 = 20 – 2 = 18
t-Valueα/2,df=n-2 = TINV(0.05,20-2)
t-Valueα/2,df=n-2 = TINV(0.05,18) = 2.1009
In Excel 2010 and later TINV(α, df) can be replaced be T.INV(1-α/2,df)

Example in Excel
Create a 95 percent prediction interval about the estimated value of Y if a company had 10,000
production machines and added 500 new employees in the last 5 years.
In this case the company’s annual power consumption would be predicted as follows:
Yest = Annual Power Consumption (kW) = 37,123,164 + 10.234 (Number of Production Machines X 1,000)
+ 3.573 (New Employees Added in Last 5 Years X 1,000)
Yest = Annual Power Consumption (kW) = 37,123,164 + 10.234 (10,000 X 1,000) + 3.573 (500 X 1,000)
Yest = Estimated Annual Power Consumption = 49,143,690 kW
Yest = 49,143,690
Prediction Intervalest = Yest ± TINV(α, dfResidual) * (Standard Error of the Regression)* 1.1
In Excel 2010 and later TINV(α, df) can be replaced be T.INV(1-α/2,df)
The Standard Error of the Regression is found to be 21,502,161 in the Excel regression output as follows:

Prediction Intervalest = 49,143,690 ± TINV(0.05, 18) * (21,502,161)* 1.1


Prediction Intervalest = [49,143,690 ± 49,691,800 ]
Prediction Intervalest = [ -549,110, 98,834,490 ]
This is a relatively wide Prediction Interval that results from a large Standard Error of the Regression
(21,502,161).
It is very important to note that a regression equation should never be extrapolated outside the range of
the original data set used to create the regression equation. The inputs for a regression prediction should
not be outside of the following ranges of the original data set:
Number of machine: 442 to 28,345
New employees added in last 5 years: -1,460 to 7,030

163
Logistic Regression

Overview
Binary logistic regression is a predictive technique that is applied when the dependent variable (y) is
dichotomous (binary), i.e., there are only two possible outcomes. Binary logistic regression calculates the
probability of the event designated as the positive event occurring.
Logistic regression is widely used in many fields. Engineers often use logistic regression to predict the
probability of a system or part failing. Marketers use logistic regression to calculate the probability of
prospective customer making a purchase or a subscriber cancelling a subscription. Bankers might use
logistic regression to calculate the probability of a homeowner defaulting on a mortgage. Doctors use
logistic regression to calculate a probability of a patient surviving trauma or serious disease.
Binary logistic regression is sometimes called Dummy Dependent Variable Regression because the
dependent variable is binary and therefore resembles a dummy variable, which is binary. Dummy
variables are binary variables that must be substituted when categorical independent variables are used
as inputs to multiple linear regression. Multiple linear regression requires that independent variables be
continuous or binary. Categorical independent variables must be converted to binary dummy variables
before they can serve as inputs for multiple linear regression. Another chapter in this book covers this
type of dummy variable regression in detail.

The Goal of Binary Logistic Regression


The goal of binary logistic regression analysis is to create an equation, P(X), that most accurately
calculates the probability of the occurrence of binary event X for a given the inputs X 1, X2, …, Xk.
Variable Y describes the observed occurrence of event X. Y takes the value of 1 when event X actually
occurred and the value of 0 when event X did not occur for a given set of inputs X1, X2, …,Xk.
P(X) should calculate a probability close to 1 as often as possible for any given set of inputs for which
event X occurred (Y = 1). P(X) should also calculate a probability close to 0 as often as possible for any
given set of inputs for which event X did not occur (Y = 0).

Allowed Variable Types For Binary Logistic Regression


The dependent variable of binary logistic regression is a categorical variable with two possible outcomes.
The independent variables (the inputs, a.k.a. the predictor variables) can be any of the four variable
types. The four types of numeric variables are nominal, ordinal, interval, and ratio.
Nominal variables are categorical and are simply arbitrary labels whose order doesn’t matter.
Ordinal variables are categorical variables whose order has meaning but the distance between units is
usually not measurable.
Interval variables have measurable distance between units and a zero point that is arbitrarily chosen.
Fahrenheit and Celsius temperatures are interval data.
Ratio variables have measurable distance between units and a zero point that indicates that there is none
of the variable present. Absolute temperature is an example of ratio data.

164
Logistic Regression Calculates the Probability of an Event Occurring
Logistic regression calculates the probability of the positive event (the event whose observed occurrence
is designated by Y = 1) occurring for a given set of inputs X 1, X2, …, Xk.
Binary logistic regression therefore calculates the following conditional probability:
Pr(Y=1 | X1, X2, …, Xk)
This is the probability that the actual observed output, Y, equals 1 given the inputs X 1, X2, …, Xk.

The Difference Between Linear Regression and Logistic Regression


Linear regression requires that the dependent variable (y) be continuous. The dependent variable for
binary logistic regression is binary is therefore not continuous. Logistic regression is a method for
calculating a continuous probability for a discontinuous event. A brief description of how that continuous
probability is created follows.

The Relationship Between Probability and Odds


Event X is the event whose actual occurrence is designated by Y = 1. The probability of event X occurring
is given as P(X). The odds of event X occurring are given as O(X). The “X” is somewhat of a strange
variable name in P(X), O(X), and Event X because it is not related to the logistic regression inputs X 1, X2,
… , Xk.
The relationship between the probability of event X occurring and the odds of event X occurring is given
as follows:
O(X) = P(X) / (1 – P(X))
For example, the probability of event X occurring is 75 percent, the odds of event X occurring are 3-to-1.
The odds, O(X), of discontinuous, binary event X occurring can be expressed as a continuous variable by
taking the natural log of the odds. A complicated derivation proving this will not be shown here.

The Logit – The Natural Log of the Odds


The natural log of the odds is called the Logit, L, (pronounced LOH-jit) and is calculated as follows:
Given the following k inputs, X1, X2, …, Xk, and the following k constants, b0, b1, b2, …bk, the Logit equals
the following:
ln[O(X)] = Logit = L = b0 + b1X1 + b2X2 + …+ bkXk
Since ln[O(X)] = Logit = L
L
O(X) therefore equals e .
L b0+b1X1+b2X2 +..+bkXk
O(X) = e = e
If O(X) = P(X) / (1 – P(X)), simple algebra can be applied to define P(X) as follows:
P(X) = O(X)/(1 + O(X))
or
L L
P(X) = e /(1+ e )

165
With algebraic manipulation, this can also be expressed as the following for occasions when this formula
is simpler to work with:
-L
P(X) = 1 / (1+e )
Keep in mind that P(X) is the conditional probability Pr(Y=1 | X 1, X2, …,Xk)

Showing How Closely The Predicted Value Matches The Actual Value
P(X) is the estimated probability of Event X occurring. Variable Y records the actual occurrence of Event
X. The goal of binary logistic regression analysis is to create an equation P(X) that most closely matches
Y for each set of inputs X1, X2, …, Xk.
P(X) should calculate a probability close to 1 as often as possible for any given set of inputs for which
event X occurred (Y = 1). P(X) should also calculate a probability close to 0 as often as possible for any
given set of inputs for which event X did not occur (Y = 0).
The conditional probability Pr(Yi=yi|X1i,X2i,…Xki) is the probability that predicted dependent variable yi
equals the actual observed value Yi given the values of the independent variables inputs X 1i,X2i,…Xki.
The conditional probability Pr(Yi=yi|X1i,X2i,…Xki) will be abbreviated Pr(Y=y|X) from here forward for
convenience.
The conditional probability Pr(Y=y|X) is calculated by the following formula:
Y (1-Y)
Pr(Y=y|X) = P(X) * [1-P(X)]
Pr(Y=y|X) can take values between 0 and 1 just like any other probability.
Y (1-Y)
Pr(Y=y|X) = P(X) * [1-P(X)] is maximized (approaches 1) when P(X) matches Y:
In other words, Pr(Y=y|X) is maximized (approaches 1) when either of the following occur:
1) Y = 1 and P(X) approaches 1
2) Y = 0 and P(X) approaches 0

To demonstrate this, here are several scenarios. In the first two scenarios Y and P(X) are nearly the same
and Pr(Y=y|X) is maximized (approaches 1):
Y = 1 and P(X) = 0.995,
Y (1-Y)
Pr(Y=y|X) = P(X) * [1-P(X)] =
1 (1-1)
Pr(Y=y|X) = 0.995 * [1-0.995] = 0.995
Y = 0 and P(X) = 0.005,
Y (1-Y)
Pr(Y=y|X) = P(X) * [1-P(X)] =
0 (1-0)
Pr(Y=y|X) = 0.005 * [1-0.005] = 0.995

In the third scenario Y and P(X) are very different and Pr(Y=y|X) is not maximized (does not approach 1):
Y = 0 and P(X) = 0.45
Y (1-Y)
Pr(Y=y|X) = P(X) * [1-P(X)] =
0 (1-0)
Pr(Y=y|X) = 0.45 * [1-0.45] = 0.55

166
LE - The Likelihood Estimation
As explained, the following equation is maximized (approaches 1) when P(X) matches Y:
Y (1-Y)
Pr(Y=y|X) = P(X) * [1-P(X)]
If that conditional probability were calculated for each data record (each set of inputs and the associated
output, Y), the product of all of these conditional probabilities is called the Likelihood Estimation, LE. The
Likelihood Estimation is given by the following formula:
Likelihood Estimation = LE = ∏ Pr(Yi=yi|Xi)
Yi (1-Yi)
LE = ∏ P(Xi) * [1-P(Xi)]
Y (1-Y)
In simple language, The LE is equal to the product of all P(X) * [1-P(X)] terms calculated for each of
the data records.

MLE – The Maximum Likelihood Estimation


The goal of binary logistic regression analysis is to create an equation P(X) that most accurately
calculates the probability of the occurrence of binary event X for a given the inputs X 1, X2, …, Xk.
L L
Equation P(X) = e /(1+ e )
Logit = L = b0 + b1X1 + b2X2 + …+ bkXk
The highest possible value of the Likelihood Estimation, LE, is called the Maximum Likelihood Estimation,
the MLE. The specific P(X) equation that maximizes the Likelihood Estimation, LE, to produce the
Maximum Likelihood Estimation, the MLE, is the most accurate predictive equation.
The goal is therefore to determine the values of the constants b0, b1, b2, …bk that create an equation P(X)
that maximizes the LE to creates the MLE.

LL - The Log-Likelihood Function


The Likelihood Function has been given by the following formula:
Yi (1-Yi)
LK = ∏ P(Xi) * [1-P(Xi)]
Taking the natural logarithm, ln(), of both sides of that equation creates LL, the Log-Likelihood Function.
The formula for the Log-Likelihood Function is as follows:
Yi (1-Yi)
ln [ LK ] = LL = ln [∏ P(Xi) * [1-P(Xi)] ]
LL = ∑ Yi *P(Xi) + (1 – Yi)(1-P(Xi))
This is due to the following property of logarithms:
b d
ln( a * c ) = b*ln(a) + d*ln(c)

MLL – Maximum Log Likelihood Function


It is often more convenient to work with the logarithm of a number than the actual number. That is the
Yi (1-Yi)
case here. Each LE term, P(Xi) * [1-P(Xi)] , is equal to between one and zero. The MLE is equal to

167
Yi (1-Yi)
the maximum possible ∏ P(Xi) * [1-P(Xi)] . The product of a large number of terms, e.g., 1,000 such
terms, between zero and one would produce an unwieldy small number.
A better solution is maximize the natural log of the MLE. Maximizing the log of the MLE would involve
calculating the sum of terms and not the product. Maximizing the sum of small terms is much more
convenient than maximizing the product of small terms.
The Log-Likelihood Function, LL, is given as follows:
LL = ∑ Yi *P(Xi) + (1 – Yi)(1-P(Xi))
The Maximum Log-Likelihood Function, MLL, is the maximum possible value of LL.
The MLE is maximized when its natural log, the MLL, is maximized since the logarithm is a monotonically
increasing function. Two variables are monotonic if they either always move in the same direction or
always move in the opposite direction. Two variables are monotonically increasing if one variable always
increases when the other increases. Variables X and ln(X) are monotonically increasing because the
ln(X) always increases when X increases. The maximum value of X will produce the maximum value of
ln(X) and vice versa.
The parameters that produce the MLE (the Maximum Likelihood Estimation) also produce the MLL (the
Maximum Log-Likelihood Function). In other words, the values of values of the constants b 0, b1, b2, …bk
that create an equation P(X) that maximizes the LE to creates the MLE are the same constant that
maximize the LL to produce the MLL.

Using the Excel Solver To Calculate the MLL and the Optimal P(X)
The coefficients b0, b1, b2,…, bk that produce MLL are the same coefficients b0, b1, b2,…, bk that produce
the most accurate predictive equation P(X). The ultimate goal of binary logistic regression is to produce
the most accurate predictive equation P(X). The Excel Solver is a quick and easy way to calculates the
values of coefficients b0, b1, b2,…, bk that produce MLL, the Maximum Log-Likelihood function.
Working step-by-step through the following example will provide clarity to what has just been covered in
this section.

168
Example of Binary Logistic Regression
The purpose of this example of binary logistic regression is to create an equation that will calculate the
probability that a production machine is currently producing output that conforms to desired specifications
based upon the age of the machine in months and the average number of shifts that the machine has
operated during each week of its lifetime.
Data was collected on 20 similar machines as follows:
1) Whether the machine produces output that meets specifications at least 99 percent of the time.(1
= Machine Meets Spec – It Does Produce Conforming Output at least 99 Percent of the Time, 0 =
Machine Does Not Meets Spec – It Does Not Produce Conforming Output at least 99 Percent of
the Time)
2) The Machine’s Age in Months
3) The Average Number of Shifts That the Machine Has Operated Each Week During Its Lifetime.

169
Step 1 – Sort the Data
The purpose of sorting the data is to make data patterns more evident. Using Excel data sorting tool,
perform the primary sort on the dependent variable. In this case, the dependent variable is the response
variable indicating whether the prospect made a purchase. Perform subordinate sorts (secondary,
tertiary, etc.) on the remaining variables.
The following data was sorted initially according to the response variable (Y). The secondary sort was
done according to Machine Age and the tertiary sort was done according to Average Number of Shifts of
Operation Per Week. The results are as follows:

Patterns are evident from the data sort. Machines that did not produce conforming output tended to the
older machines and/or machines that operate during a higher average number of shifts per week.

170
Step 2 – Calculate a Logit For Each Data Record
Given the following inputs, X1, X2, …, Xk, the Logit equals the following:
Logit = L = b0 + b1X1 + b2X2 + …+ bkXk
If the explanatory variables are Age and Average Number of Shifts, the Logit, L, is as follows:
Logit = L = b0 + b1*Age + b2*(Average Number of Weekly Shifts)
The Excel Solver will ultimately optimize the variables b0, b1, and b2 in order to create an equation that will
accurately predict the probability of a machine producing conforming output given the machines age and
average number of operating shifts per week.
The Decision Variables are the variables that the Solver adjusts during the optimization process. The
Decision Variables b0, b1, and b2 are arbitrarily set to 0.1 before the Solver is run. It is a good idea to
initially set the Solver decision variables so that the resulting Logit is well below 20 for each record. Logits
that exceed 20 cause extreme values to occur in later steps of logistic regression. The Solver decision
variables b0, b1, and b2 have been arbitrarily set to the value of 0.1 to initial produce reasonably small
Logits as shown next.
A unique Logit is created for each of the 20 data records based on the initial settings of the Decision
Variables as follows:

171
Step 3 – Calculate eL For Each Data Record
The number e is the base of the natural logarithm. It is approximately equal to 2.71828163 and is the limit
n L
of (1 + 1/n) as n approaches infinity. e must be calculated for each data record. This step will be shown
in the image in the next step, Step 4.

172
Step 4 – Calculate P(X) For Each Data Record
P(X) is the probability of event X occurring. Event X occurs when a machine produces conforming output.
P(X) is the probability of a machine producing conforming output.
L L
P(X) = e / (1 + e )
L = Logit = b0 + b1*X1 + b2*X2 + …+ bk*Xk
L
Calculating e and P(X) for each of the data records is done as follows:

L
e can also be calculated in Excel as exp(L).

173
Step 5 – Calculate LL, the Log-Likelihood Function
The conditional probability Pr(Yi=yi|X1i,X2i,…Xki) is the probability that predicted dependent variable yi
equals the actual observed value Yi given the values of the independent variables inputs X 1i,X2i,…Xki.
The conditional probability Pr(Yi=yi|X1i,X2i,…Xki) will be abbreviated Pr(Y=y|X) from here forward for
convenience.
The conditional probability Pr(Y=y|X) is calculated by the following formula:
Y (1-Y)
Pr(Y=y|X) = P(X) * [1-P(X)]
Taking the natural log of both sides yields the following:
ln [ Pr(Y=y|X) ] = y*ln [ P(X) ] * (1-y)*ln[ [1-P(X)] ]
The Log-Likelihood Function, LL, is the sum of the ln [ Pr(Y=y|X) ] terms for all data records as per the
following formula:
LL = ∑ Yi *P(Xi) + (1 – Yi)(1-P(Xi))
Calculating LL is done as follows:

174
Step 6 – Use the Excel Solver to Calculate MLL, the Maximum Log-
Likelihood Function
The objective of Logistic Regression is find the coefficients of the Logit (b 0 , b1,, b2 + …+ bk) that
maximize LL, the Log-Likelihood Function in cell H30, to produce MLL, the Maximum Log-Likelihood
Function.
The functionality of the Excel Solver is fairly straightforward: the Excel Solver adjusts the numeric values
in specific cells in order to maximize or minimize the value in a single other cell.
The cell that the Solver is attempting to maximize or minimize is called the Solver Objective. This is LL in
cell H30.
The cells whose values the Solver adjusts are called the Decision Variables. The Solver Decision
Variables are therefore in cells C2, C3, and C4. These contain b 0 , b1,, b2 + …+ bk, the coefficients of the
Logit. These cells will be adjusted to maximize LL, which is in cell H30.
The Excel Solver is an add-in that in included with most Excel packages. The Solver most be manually
activated by the user before it can be utilized for the first time. Different versions of Excel require different
method of activation for the Solver. The best advice is to search Microsoft’s documentation online to
locate instructions for activating the add-ins that are included with your version of Excel. YouTube videos
are often another convenient source for step-by-step instructions for activating Solver in your version of
Excel. Once activated, the Solver is normally found in the Data tab of versions of Excel from 2007 onward
that use the ribbon navigation. Excel 2003 provides a link to the Solver in the drop-down menu under
Tools.

175
These Decision Variables and Objective are entered into the Solver dialogue box as follows:

Make sure not to check the checkbox next to Make Unconstrained Variables Non-Negative.

176
The GRG Nonlinear Solving Method
The GRG Nonlinear solving method should be selected if any of the equations involving Decision
variables or Constraints is nonlinear and smooth (uninterrupted, continuous, i.e., having no breaks). GRG
stands for Generalized Reduced Gradient and is a long-time, proven, reliable method for solving
nonlinear problems.
L
The equations on the path to the calculation of the Objective (maximizing LK) involve the calculation of e ,
P(X), and Pr(Y=y|X). Each of these three equations is nonlinear and smooth. An equation is “smooth” if
that equation and the derivative of that equation have no breaks (are continuous). The GRG Nonlinear
solving method should therefore be selected.
One way to determine whether an equation or function is non-smooth (the graph has a sharp point
indicating that the derivative is discontinuous) or discontinuous (the equation’s graph abruptly changes
values at certain points – the graph is disconnected at these points) is to graph the equation over its
expected range of values.

The Solver Should Be Run Through Several Trials To Ensure an Optimal Solution
When the Solver runs the GRG algorithm, it picks a starting point for its calculations. Each time the Solver
GRG algorithm is run, it picks a slightly different starting point. This is why different answers will often
appear after each run of the GRG Nonlinear solving method. The Solver should be re-run several times
until the Objective (LK) is not maximized further. This should produce the best locally optimal values of
the Decision Variables (b0, b1, b2, …, bk).
The GRG Nonlinear solving method is guaranteed to produce locally optimal solutions but not globally
optimal solutions. The GREG nonlinear solving method will produce a Globally Optimal solution if all
functions in the path to the Objective and all Constraints are convex. If any of the functions or Constraints
is non-convex, the GRG Nonlinear solving method may find only Locally Optimal Solutions.
A function is convex if it has only one peak either up or down. A convex function can always be solved to
a Globally Optimal solution. A function is non-convex if it has more than one peak or is discontinuous.
Non-convex solutions can often be solved only to Locally Optimal solutions.
A Globally Optimal solution is the best possible solution that meets all Constraints. A Globally Optimal
solution might be comparable to Mount Everest since Mount Everest is the highest of all mountains.
A Locally Optimal solution is the best nearby solution that meets all Constraints. It may not be the best
overall solution, but it is the best nearby solution. A Locally Optimal solution might be comparable to
Mount McKinley, which is the highest mountain in North America not the highest of all mountains.
L
The function e with L = b0 + b1*X1 + b2*X2 + …+ bk*Xk can be non-convex because inputs X1 , X2 ,…, Xk
can be nonlinear. The GRG Nonlinear solving method is therefore only guaranteed to find a Locally
Optimal Solution.

How to Increase the Chance That the Solver Will Find a Globally Optimal Solution
There are three ways to increase the chance that the Solver will arrive at a Globally Optimal solution:
The first is to run the Solver multiple times using different sets of values for the Decision Variables. This
option allows you to select initial sets of Decision Variables based on your understanding of the overall
problem and is often the best way to arrive at the most desirable solution.
The second was is to select “Use Multistart.” This runs the GRG Solver for a number of times and
randomly selects a different set of initial values for the Decision Variables during each run. The Solver
then presents the best of all of the Locally Optimal solutions that it has found.
The third way is to set constraints in the Solver dialogue box that will force the Solver to try a new set of
values. Constraints are limitations manually placed on the Decision Variables. Constraints can be useful if
the Decision variables should be limited to a specific range of values. A Globally Optimal solution will not

177
likely be found by applying constraints but a more realistic solution can be obtained by limiting Decision
Variables to likely values.

Solver Results
Running the Solver produces the following results for this problem:

MLL, the Maximum Log-Likelihood was calculated to be -6.654560484 when the constants were adjusted
as Solver Decision Variables to the values of:
b0 = 12.48285608
b1 = -0.117031374
b2 = -1.469140055

178
Step 7 – Test the Solver Output By Running Scenarios
Validate the output by running several scenarios through the Solver results. Each scenario will employ a
different variation of input variables X1, X2, .. , Xk to produce outputs that should be consistent with the
initial data set.
The sort of the initial data showed a pattern that nonconforming product was more likely on older
machines and/or machines that were run more often.
The following three scenarios were run as follows:

Scenario 1

Machine Age = 40 months


Average Number of Weekly Shifts = 7
P(X) = Probability of Conforming Output = 8 percent

179
Scenario 2

Machine Age = 40 months


Average Number of Weekly Shifts = 4
P(X) = Probability of Conforming Output = 87 percent

180
Scenario 3

Machine Age = 12 months


Average Number of Weekly Shifts = 7
P(X) = Probability of Conforming Output = 69 percent

181
The outcomes of these three scenarios are consistent with the patterns apparent in the initial sorted data
set below that nonconforming product was more likely to be produced by older machines and/or
machines that were run more often:

182
Step 8 – Calculate R Square
A reliable goodness-of-fit calculation is essential for any model. The measures of goodness-of-fit for linear
regression are R Square and the related Adjusted R Square. These metrics calculated the percentage of
total variance can be explained by the combined variance of the input variables since variances can
added.
R Square is calculated for binary logistic regression in a different way. R Square in this case is based
upon the difference in predictive ability of the logistic regression equation with and without the
independent variables. This is sometimes referred to as pseudo R Square.
A summarization of this method is as follows:

Step 1) Calculate the Maximum Log-Likelihood for Full Model


The Maximum Log-Likelihood Function, MLL, is calculated for the full model. This has already been done
by the Excel Solver in order to determine the constants b0, b1, b2, …, bk that create the most accurate
P(X) equation. MLL for the full model is designated as MLL m. This has already been calculated to be the
following:
MLLm = Maximum Log-Likelihood for Full Model
MLLm = -6.6545

Step 2) Calculate the Maximum Log-Likelihood for the Model With No Explanatory Variables
Calculating the Maximum Lob-Likelihood Function for the model with no explanatory variables is done by
setting all constants (Solver Decision Variables) except b0 to zero before calculating the MLL.
The Maximum Log-Likelihood for the model with no explanatory variables (b1 = b2 = … = bk = 0)
designated as MLL0.
The constant b0 is the Y Intercept of regression equation. This is the only constant that will be included in
the calculation of MLL0. The other constants, b1, b2, …, bk, are the coefficients of the input variables X1,
X2, … , Xk. Setting the constants b1, b2, …, bk to zero removes all explanatory variables X1, X2, … , Xk.
The terms b1*X1, b2*X2, …, bk*Xk will now all equal to zero in the Logit (and therefore the logistic equation
P(X)) no matter what the values of the input variables X 1, X2, … , Xk are.
Constants b1 and b2 are set to zero as follows before running the Excel Solver to calculate MLL 0:

183
Below is the Solver dialogue box to calculate MLL0. Note that there is only one Solver Decision Variable
(b0 in cell C2) that will be adjusted to find MLL0.

184
Running the Solver produced the following MLL0:

MLL0 = Maximum Log-Likelihood for Model With Only Intercept and No Explanatory Variables (b 1 = b2 =
… = bk = 0)
MLL0 = MLLb1=b2= ... =bk=0 = -13.8629

Calculating MML for the full model produced the following:


MLLm = Maximum Log-Likelihood for Full Model
MLLm = -6.6545

185
Step 3) Calculate R Square
There are three different measures of R Square that are commonly quoted for binary logistic regression.
They are the Log-Linear Ratio R Square, the Cox and Snell R Square, and the Nagelkerke R Square.
The Nagelkerke R Square is generally the larger value of the three and is the preferred metric of the
three. The Nagelkerke R Square is preferred over the Cox and Snell R Square because the Cox and
Snell R Square has the limitation that it cannot achieve the value of 1.0 as R Square in linear regression
can. The Nagelkerke R Square overcomes that limitation.
The calculations of each of the three R Square methods are shown as follows:
MLLm = -6.6545
MLL0 = -13.8629
n = 20

Log-Linear Ratio R Square = R Square L


R Square L = 1 – MLLm/MLL0 = 0.5199

Cox and Snell R Square = R Square CS


R Square CS = 1 – exp[(-2) * (MLLm - MLL0 ) / n ] = 0.5137

Nagelkerke R Square = R Square N


R Square N = [ R Square CS ] / [ 1 – exp( 2 * MLL0 / n ) ] = 0.6849

These R Square calculations, particularly the preferred Nagelkerke R Square of 0.6849, indicate that the
logistic regression equation, P(X) for the full model, has reasonably good predictive power.

Step 9 – Determine if the Variable Coefficients Are Significant


An essential part of linear regression analysis is the determination of whether the calculated coefficients
of input variables are statistically significant. A variable coefficient is considered to be statistically
significant if the probability that it has a nonzero value is less than the specified level of significance. This
probability is shown in the Excel linear regression output as the P-value next to the coefficient. The
normal level of significance is α = 0.05. A P-value of less than 0.05 indicates that the variable coefficient
is statistically significant if α = 0.05.
The significance of the variable coefficients b1, b2 , …, bk in logistic regression is calculated by different
methods.

The Wald Statistic


Until recently the most common metric used to evaluate the significance of a variable coefficient in binary
logistic regression was the Wald Statistic. The Wald Statistic for each coefficient is calculated by the
following formula:
2 2
Wald Statistic = (Coefficient) / (Standard Error of Coefficient)

186
The Standard errors of the coefficients are equal to the square roots of the diagonals of the covariance
matrix of the coefficients. This requires a bit of matrix work to compute. The Wald Statistic is
approximately distributed according to the Chi-Square distribution with one degree of freedom. The p
Value for the Wald Statistic is calculated as follows:
p Value = CHIDIST(Wald Statistic,1)
In Excel 2010 and later, this formula can be replaced by the following:
p Value = CHISQ.DIST.RT(Wald Statistic, 1)
The coefficient is considered statistically significant is less than the specified level of significance, which is
commonly set at 0.05.
The 95 percent confidence interval for the coefficient is calculated in Excel as follows:
95 percent C.I. = Coefficient ± S.E.* NORM.S.INV(1 – α/2)
The Wald Statistic is currently the main logistic regression metric of variable coefficient significance
calculated by well-known statistical packages such as SAS and SPSS. The reliability of the Wald Statistic
is, however, considered questionable. If the case that a large coefficient is produced, the standard error of
the coefficient can be inflated. This will result in an undersized Wald Statistic, which could lead to a
conclusion that a significant coefficient was not significant. This is a false negative, which is a Type 2
Error. A false positive is a Type 1 Error. An additional reliability issue occurs with the Wald Statistic when
sample size is small. The Wald Statistic is often biased for small sample sizes.
Due to the reliability issues associated with the Wald Statistic, the preferred method to evaluate the
significance of logistic regression variable coefficients is the Likelihood Ratio calculated for each
coefficient.

The Likelihood Ratio


The Likelihood Ratio is a statistical test that compares the likelihood of obtaining the data using a full
model with the likelihood of obtaining the same data with a model that is missing the coefficient being
evaluated. The Likelihood Ratio for logistic regression is a Chi-Square test that compares the goodness
of fit of two models when one of the models is a subset of the other.
The general formula for the Likelihood Ratio is as follows:
Likelihood RatioReduced_Model = -2*MLLReduced_Model + 2*MLLFull_Model
MLLFull_Model is equal to the MLLm that was initially calculated to determine the values of all coefficients b 1,
b2, …, bk that create the most accurate P(X). This has already been calculated to equal the following:
MLLm = -6.6545
MLLReduced_Model is simply the calculation of MLL with the coefficient being evaluated set to zero.
For example, MLLb1=0 would be the MLL calculated with coefficient b1 set to zero.
MLLb2=0 would be the MLL calculated with coefficient b \2 set to zero.
The Likelihood Ratio is approximately distributed according to the Chi-Square distribution with the
degrees of freedom equal to number of coefficients that have been set to zero. This will be one.
The p value of the Likelihood Ratio determines whether the removal of the coefficient made a real
difference. If the p value is lower than the specified level of significance (usually 0.05) the coefficient is
considered significant.
The p value of the Likelihood Ratio is calculated with the following Excel formula:
p Value = CHISQ.DIST.RT(MLLReduced_Model,1)

187
Using the Likelihood Ratio to Determine Whether Coefficient b 1 Is Significant
The Solver will be used to calculate MLLb1=0. The p Value of MLLb1=0 (CHISQ.DIST.RT(MLLb1=0,1) will
determine whether coefficient b1 is significant. Setting the value of coefficient b1 to zero before calculating
MLLb1=0 with the Solver is done as follows:

The Solver dialogue is configured as follows to calculate MLLb1=0. Note that b1 (cell C3) is no longer a
Solver Decision Variable.

188
189
Running the Solver produces the following calculation of MLLb1=0.

MLLm = MLLFull_Model = -6.6546


MLLb1=0 = MLLReduced_Model = -10.9104

Likelihood RatioReduced_Model = -2*MLLReduced_Model + 2*MLLFull_Model


Likelihood Ratio b1 = LR b1 = -2*MLLb1=0 + 2*MLLm

LR b1 = 8.5117

This statistic is distributed according to the Chi-Square distribution with its degrees of freedom equal to
the difference between the number of variables in the full model and the reduced model. In this case that
difference is one variable so df = 1.
p Value = CHIDIST(LR b1,1) = CHIDIST(8.5117,1) = 0.0035
The very low p Value indicates that LR b1 is statistically significant. Variable b1 is therefore significant.

190
Using the Likelihood Ratio to Determine Whether Coefficient b 2 Is Significant
The Solver will be used to calculate MLLb2=0. The p Value of MLLb2=0 (CHISQ.DIST.RT(MLLb2=0,1) will
determine whether coefficient b2 is significant. Setting the value of coefficient b\2 to zero before calculating
MLLb2=0 with the Solver is done as follows:

The Solver dialogue is configured as follows to calculate MLLb2=0. Note that b2 cell (C4) is no longer a
Solver Decision Variable.

191
192
Running the Solver produces the following calculation of MLLb2=0.

MLLm = MLLFull_Model = -6.6546


MLLb2=0 = MLLReduced_Model = -9.5059

Likelihood RatioReduced_Model = -2*MLLReduced_Model + 2*MLLFull_Model


Likelihood Ratio b1 = LR b2 = -2*MLLb2=0 + 2*MLLm

LR b2 = 5.7025

This statistic is distributed approximately according to the Chi-Square distribution with its degrees of
freedom equal to the difference between the number of variables in the full model and the reduced model.
In this case that difference is one variable so df = 1.
p Value = CHISQ.DIST.RT(LR b2,1) = CHISQ.DIST.RT(5.7025,1) = 0.0169
The very low p Value indicates that LR b2 is statistically significant. Variable b2 is therefore significant.

193
Step 10 – Create a Classification Table
Perhaps the most easily understood and intuitive way to present the results of binary logistic regression is
to create a classification table of the output as follows:

A probability of greater than 0.500 is recorded as a “1” in the Predicted column. The classification table
can be set up to automatically create the columns of 1’s by using Excel If-Then-Else statements. An
example of an If-Then-Else Excel formula that might be placed in the upper left cell of the table under the
heading “1 Predicted Correctly” might the following:
=IF(AND(A5=1,E5>0.5),1,"")

194
One overall metric that is commonly calculated is the PAC, the Percentage Accuracy in Classification.
PAC = [ (number of correct positives) + (number of correct negatives) ] / [total observations]
PAC = [ 8 + 9 ] / 20 = 85 percent

Step 11 – Determine if the Overall Logistic Regression Equation Is


Significant
Another goodness-of-fit test commonly applied to logistic regression results is the Hosmer-Lemeshow
test. This is a Chi-Square Goodness-Of-Fit test that quantifies how closely the predicted results match the
actual observations. The test can be summarized as follows:
The total number of observations is split up into ten groups, called quintiles. The number of expected
(predicted) positives and negatives in each quintile is compared with the observed number of positives
and negatives in each quintile. The comparison of expected numbers and observed numbers produces a
test statistic called the Chi-Square Statistic. A p Value is then derived which determines whether or not
the model is a good fit.
A large p Value indicates that the difference between the number of observed and expected values is
insignificant and the model is therefore considered valid. If the p Value is smaller than the specified level
of significance (usually set at 0.05), the difference between the number of observed and expected values
is statistically significant and the model is therefore considered not valid.

Details of the Hosmer-Lemeshow test are as follows:


The data should be divided up into 10 equally-sized groups called quintiles or bins. Produce the following
four counts of the data in each bin:
- Positive values observed in that bin
- Positive values expected in the bin
- Negative values observed in that bin
- Negative values expected in that bin

195
Arrange all of that data is done in the following diagram. Place the positive observed and expected values
together on one side. Place the negative observed and expected values together on the other side. This
is shown as follows:

A Chi-Square Goodness-Of-Fit test requires that the average number of values in each “Expected” bin is
at least 5 and that every “Expected” bin has a value of at least 1.
This test suffers when the total number of observations is not large. Test creators David Hosmer and
Stanley Lemeshow recommend that the minimum number of observations be at least 200.
This test is performed almost exactly like any other Chi-Square Goodness-Of-Fit test except the degrees
of freedom equals the number of bins – 2. In this case, that would be as follows:
df = Number of bins – 2 = 10 – 2 = 8
Calculate the following for each positive observed/expected group and for each negative
observed/expected group:
2
(Number observed – number expected) / (Number of expected)
2
Calculate the test statistic called the Chi-Square Statistic, Χ .
2 2
Χ = ∑ (Number observed – number expected) / (Number of expected)
2
Χ = 6.08418
2
This test statistic, Χ , is distributed approximately according to the Chi-Square distribution with (Number
of bins) – 2 degrees of freedom if the average number of values in each “Expected” bin is at least 5 and
that every “Expected” bin has a value of at least 1.
A p Value can be derived from the Chi-Square Statistic as follows:
2
p Value = CHISQ.DIST.RT(Χ ,2) = CHISQ.DIST.RT(6.08418,2) = 0.63780
This p Value states that there is a 63.78 percent chance the difference between the observed and
expected values is merely a random result and is not significant. The model is therefore considered to be
a good model because the predicted values appear to be a good fit to the observed values. The Null

196
Hypothesis stating that there is no difference between the Expected and Observed values cannot be
rejected.
A small p Value would indicate that the model was not that good of a fit.
The p Value indicates the percentage of area under the Chi-Square distribution curve that is to the right of
the Chi-Square Statistics of 6.08418. This is illustrated in the following diagram.

2
In Excel 2010 and later the formula CHIDIST(Χ ,df) can be replaced with the following formula:
2
CHISQ.DIST.RT(Χ ,df)

197
Single-Factor ANOVA in Excel

Overview
Single-factor ANOVA is used to determine if there is a real difference between three or more sample
groups of continuous data. ANOVA answers the following question: Is it likely that all sample groups
came from the same population?
Single-factor ANOVA is useful in the following two circumstances:
Determining if three or more independent samples are different. In this case Single-Factor ANOVA
might be used to determine whether there is a real difference between the test scores of three or more
separate groups of people. Another example would be to use Single-Factor ANOVA to determine whether
there is a real difference between retail sales of groups of stores in different regions.
Determining if three or more different treatments applied to similar groups have produced
different results. A common example for this case is to compare test scores from groups that underwent
different training programs.

ANOVA = Analysis of Variance


ANOVA stands for Analysis of Variance. ANOVA determines whether or not all sample groups are likely
to have come from the same population by performing a comparison of the variance between sample
groups to the variance within the sample groups.
Single-factor ANOVA represents groupings of objects that described by two variables. One of the
variables describing each grouped object is a categorical variable. The value of each object’s categorical
variable determines into which group the object is placed. The other variable describing each object is
continuous and is the object’s displayed value in the data group.
The categorical variable is sometimes referred to as the independent variable while the continuous
variable is sometimes referred to as the dependent variable. In the case of Simple-Factor ANOVA, the
independent variable simply predicts which group each object’s continuous measurement will be placed.
This independent-dependent relationship is different from that in regression because the independent
variable does not predict the value of the dependent variable, only the group into which it will be placed.
ANOVA is a parametric test because one of ANOVA’s requirements is that the data in each sample group
are normally-distributed. ANOVA is relative robust against minor deviations from normality. When
normality of sample group data cannot be confirmed or if the sample data is ordinal instead of continuous,
a nonparametric test called the Kruskal-Wallis test should be substituted for ANOVA.
Ordinal data are data whose order matter but the specific distances between units is not measurable.
Customer-rating survey data and Likert scales data can be examples of ordinal data. These types of data
can, however, be treated as continuous data if distances between successive units are considered equal.

Null and Alternative Hypotheses for Single-factor ANOVA


The Null Hypothesis for Single-Factor ANOVA states that the samples ALL come from the same
population. This would be written as follows:
Null Hypothesis = H0: µ1 = µ2 = … = µk (k equals the number of sample groups)
Note that Null Hypothesis is not referring to the sample means, s 1 , s2 , … , sk, but to the population
means, µ1 , µ2 , … , µk.

198
The Alternative Hypothesis for Single-Factor ANOVA states that at least one sample group is likely to
have come from a different population. Single-Factor ANOVA does not clarify which groups are different
or how large any of the differences between the groups are. This Alternative Hypothesis only states
whether at least one sample group is likely to have come from a different population.
Alternative Hypothesis = H0: µi ≠ µj for some i and j

Single-Factor ANOVA vs.Two-Sample, Pooled t-Test


Single-Factor ANOVA is nearly the same test as the two-independent-sample, pooled t-test. The major
difference is that Single-Factor ANOVA is used to compare more than two samples groups. Performing
Single-Factor ANOVA or a two-independent sample, pooled t-test on the same two sample groups will
produce exactly the same results.
As stated, ANOVA compares the variance between the samples groups to the variance within the sample
groups. If the ratio of the variance between sample groups over variance within sample groups is high
enough, the samples said to be different from each other.
Another way to understand ANOVA (or the two-independent sample, pooled t-test) is to state that the
sample groups become easier to tell apart as the sample groups become more spread out from each
other or as each of the sample groups become smaller and tighter. That might be more intuitive if
presented visually.

199
Below are box plots of three sample groups:

Each of the sample groups are easy to differentiate from the others. The measures of spread - standard
deviation and variance - are shown for each sample group. Remember that variance equals standard
deviation squared. Each sample group is a small, tightly-bunched group as a result of having a small
standard deviation.
If each sample group’s spread is increased (widened), the sample groups become much harder to
differentiate from each other. The graph shown below is of three sample groups having the same means
as above but much wider spread. The between-groups variance has remained the same but the within-
groups variance has increased.

200
It is easy to differentiate the sample groups in the top graph but much less easy to differentiate the
sample groups in the bottom graph simply because the sample groups in the bottom graph have much
wider spread.
In statistical terms, one could say that it is easy to tell that the samples in the top graph were drawn from
different populations. It is much more difficult to say whether the sample groups in the bottom graph were
drawn from different populations.
That is the underlying principle behind both t-tests and ANOVA tests. The main purpose of t-tests and
ANOVA tests is to determine whether samples are from the same populations or from different
populations. The variance (or equivalently, the standard deviation) of the sample groups is what is what
determines how difficult it is to tell the sample groups apart.
The two-independent-sample, pooled t-test is essentially the same test as single-factor ANOVA. The two-
independent-sample, pooled t-test can only be applied to two sample groups at one time. Single-Factor
ANOVA can be applied to three or more groups at one time.

201
2-Sample One-Way ANOVA = 2-Sample, Pooled t-Test
We will apply both the two-independent sample, pooled t-test and single-factor ANOVA to the first two
samples in each of the above graphs to verify that the results are equivalent.

Sample Groups With Small Variances (the first graph)


Applying a two-independent sample t-test to the first two samples with the small variances would produce
the following result:

This result would have been obtained by filling in the Excel dialogue box as follows:

202
Running Single-Factor ANOVA on those same two sample groups would produce this result:

This chapter has not yet covered how to perform ANOVA in Excel but this result would have been
obtained by filling in the Excel dialogue box as follows:

Both the t-test and the ANOVA test produce the same result when applied to these two sample groups.
They both produce the same p Value (1.51E-10) which is extremely small. This indicates that the result is
statistically significant and that the difference in the means of the two groups is real. More correctly put, it
can be stated that there is a very small chance (1.51E-10) that the samples came from the same
population and that the result obtained (that their means are different) was merely a random occurrence.

203
Sample Groups With Large Variances (the second graph)
Applying a two-independent sample t-test to the first two samples with the large variances would produce
the following result:

This result would have been obtained by filling in the Excel dialogue box as follows:

204
Running Single-Factor ANOVA on those same two sample groups would produce this result:

This chapter has not yet covered how to perform ANOVA in Excel but this result would have been
obtained by filling in the Excel dialogue box as follows:

Both the t-test and the ANOVA test produce the same result when applied to these two sample groups.
They both produce the same p Value (0.230876). This is relatively large. 95 percent is the standard level
of confidence usually required in statistical hypothesis tests to conclude that the results are statistically
significant (real). The p value needs to be less than 0.05 to achieve a 95 percent confidence level that a
difference really exists. The sample groups with the large spread produced a p Value greater than 0.05
and we can therefore not reject the Null Hypothesis which states that the sample groups are the same.
The results are not statistically significant and we cannot conclude that the two samples were not drawn
from the same population.

205
Single-Factor ANOVA Should Not Be Done By Hand
Excel provides an excellent ANOVA tool that can perform Single-factor or two-Factor ANOVA with equal
ease. Doing the calculations by hand would be tedious and provide lots of opportunities to make a
mistake. Excel produces a very detailed output when the ANOVA tool is run. The end of this chapter will
shows the example of Single-Factor ANOVA with all calculations performed individually in Excel.
It will probably be clear from viewing this that it is wise to let Excel do the ANOVA calculations. A number
of statistics textbook place probably too much emphasis on teaching the ability to perform the ANOVA
equations by hand. In the real world that would not likely be done for Single-Factor ANOVA because the
Excel tool is so convenient to use.
The best way to understand Single-Factor ANOVA is to perform an example as follows:

206
Single-Factor ANOVA Example in Excel
A company was attempting to determine whether there was a difference in results produced by three
different training programs. The three unique training programs had the same objective and the training
results were by a single, common test taken by participants at the end of the training.
In this test three groups of similar employees underwent the training. Each of the three groups was put
through one of the three training programs so no group was given the same training program. At the end
of the training, all participants in each group were given the same test. The groups all had a different
number of participants. The test results from all three groups were as follows:

Group 1 had 22 participants. Group 2 had 23 participants, Group 3 had 19 participants.

207
Step 1 – Place Data in Excel Group Columns
The Excel Single-Factor ANOVA tool requires that the data be arranged in columns. Each data column
will hold only data whose categorical variable is the same. In this case, all data whose categorical
variable is Group 1 will be in the first column, Group 2 in the second column, and Group 3 data in the third
column.

Quite often the data is not conveniently arranged that way. Very often the data is arranged in one long
column with each row containing each observation’s independent (categorical) variable value and its
dependent (measured) value as follows:

208
The data now has to be separated into columns so that each column contains data from one level of the
independent variable. In other words, each column will contain a unique group of data that will consist of
all data having a single level of the independent variable. This will be done as follows:

209
The blank cells now have to be removed from the columns. This is accomplished as follows:

Cell J3 contains the formula:


=IF(ISNUMBER(LARGE($E$3:$E$22,ROW()-ROW($J$2))),LARGE($E$3:$E$22,ROW()-ROW($J$2)),"")
Cell K3 contains the formula:
=IF(ISNUMBER(LARGE($F$3:$F$22,ROW()-ROW($K$2))),LARGE($F$3:$F$22,ROW()-ROW($K$2)),"")
Cell L3 contains the formula:
=IF(ISNUMBER(LARGE($G$3:$G$22,ROW()-ROW($L$2))),LARGE($G$3:$G$22,ROW()-
ROW($L$2)),"")
These three formulas are copied down to row 22 to produce the result shown here.

210
It is easier to work with sorted data columns when performing Single-Factor ANOVA so the data will be
sorted in the next step. Data can be sorted in Excel by copying a single command down a column as
follows:

211
Step 2 – Remove Extreme Outliers
Calculation of the mean is one of the fundamental computations when performing ANOVA. The mean is
unduly affected by outliers. Extremely outliers should be removed before ANOVA. Not all outliers should
be removed. An outlier should be removed if it is obviously extreme and inconsistent with the remainder
of the data.

Find Outliers From the Sorted Data


An easy way to spot extreme outliers is to look at the sorted data. Extremely high or low outlier values will
appear at the ends of the sort. A convenient, one-step method to sort a column of data in Excel is shown
here.
The formula is cell H2 is the following:
=IF($D2=””,””,LARGE($D$2:$D$19,ROW()-ROW($D$1)))
Copy this formula down as shown to create a descending sort of the data in cells D2 to D19.
Exchanging the word SMALL for LARGE would create an ascending sort instead of the descending sort
performed here.
No extreme outliers are apparent from the sort.

212
Find Outliers By Standardizing Residuals
Another way to evaluate data for outliers is to calculate the standardized residual value for each data
point. In the case of ANOVA, the residual for each data point is the difference between the data point and
its group mean. The standardized residual value is simply this residual length expressed as the number of
standard deviations.
For example, the value in cell G3 is calculated by the following formula:
=ABS((C3-AVERAGE($C$3:$C$20))/STDEV($C$3:$C$20))
Quite often outliers are considered to be those data that are more than three standard deviations from the
group mean. No data points are that far from the column mean. The farthest data point is only 2.15
standard deviations from its column mean. These numbers are shown as follows:

213
After obvious outliers have been removed, it is good idea to visually inspect a box plot of data to get a
better feel for the dispersion between groups (how spread out the group means are) and within the
groups (how dispersed is the data within each group).

All data points that are deemed extreme outliers and removed should be recorded. Before an outlier is
removed, causes of the outlying value should be considered. It is always a good idea to ensure that no
data recording errors or data measurement errors have cause outlying values. Any reports that record
and interpret the results of the ANOVA test should list any outlier values that were removed and the
reason that they were removed.

Step 3 – Verify Required Assumptions


Single-Factor ANOVA Required Assumptions
Single-Factor ANOVA has six required assumptions whose validity should be confirmed before this test is
applied. The six required assumptions are the following:
1) Independence of Sample Group Data Sample groups must be differentiated in such a way that there
can be no cross-over of data between sample groups. No data observation in any sample group could
have been legitimately placed in another sample group. No data observation affects the value of another
data observation in the same group or in a different group. This is verified by an examination of the test
procedure.
2) Sample Data Are Continuous Sample group data (the dependent variable’s measured value) can be
ratio or interval data, which are the two major types of continuous data. Sample group data cannot be
nominal or ordinal data, which are the two major types of categorical data.

214
3) Independent Variable is Categorical The determinant of which group each data observation belongs
to is a categorical, independent variable. Single-factor ANOVA uses a single categorical variable that has
at least two levels. All data observations associated with each variable level represent a unique data
group and will occupy a separate column on the Excel worksheet.
4) Extreme Outliers Removed If Necessary ANOVA is a parametric test that relies upon calculation of
the means of sample groups. Extreme outliers can skew the calculation of the mean. Outliers should be
identified and evaluated for removal in all sample groups. Occasional outliers are to be expected in
normally-distributed data but all outliers should be evaluated to determine whether their inclusion will
produce a less representative result of the overall data than their exclusion.
5) Normally-Distributed Data In All Sample Groups Single-factor ANOVA is a parametric test having
the required assumption the data from each sample group comes from a normally-distributed population.
Each sample group’s data should be tested for normality. Normality testing becomes significantly less
powerful (accurate) when a group’s size fall below 20. An effort should be made to obtain group sizes that
exceed 20 to ensure that normality tests will provide accurate results.
6) Relatively Similar Variances In All Sample Groups Single-Factor ANOVA requires that sample
groups are obtained from populations that have similar variances. This requirement is often worded to
state that the populations must have equal variances. The variances do not have to be exactly equal but
do have to be similar enough so the variance testing of the sample groups will not detect significant
differences. Variance testing becomes significantly less powerful (accurate) when a group’s size fall
below 20. An effort should be made to obtain group sizes that exceed 20 to ensure that variance tests will
provide accurate results.

Determining If Sample Groups Are Normally-Distributed


There are a number of normality test that can be performed on each group’s data. The normality test that
is preferred because it is considered to be more powerful (accurate) than the others, particularly with
smaller sample sizes is the Shapiro-Wilk test.

Shapiro-Wilk Test For Normality in Excel


The Shapiro-Wilk Test is a hypothesis test that is widely used to determine whether a data sample is
normally-distributed. A test statistic W is calculated. If this test statistic is less than a critical value of W for
a given level of significance (alpha) and sample size, the Null Hypothesis which states that the sample is
normally-distributed is rejected.
The Shapiro-Wilk Test is a robust normality test and is widely-used because of its slightly superior
performance against other normality tests, especially with small sample sizes. Superior performance
means that it correctly rejects the Null Hypothesis that the data are not normally-distributed a slightly
higher percentage of times than most other normality tests, particularly at small sample sizes.
The Shapiro-Wilk normality test is generally regarded as being slightly more powerful than the Anderson-
Darling normality test, which in turn is regarded as being slightly more powerful than the Kolmogorov-
Smirnov normality test.

215
Shapiro-Wilk Normality Test of Group 1 Test Scores

0.964927 = Test Statistic W


0.911 = W Critical for the following n and Alpha
22 = n = Number of Data Points
0.05 = α
The Null Hypothesis Stating That the Data Are Normally-Distributed Cannot Be Rejected
Test Statistic W (0.964927) is larger than W Critical 0.911. The Null Hypothesis therefore cannot be
rejected. There is not enough evidence to state that Group 1 data are not normally-distributed with a
confidence level of 95 percent.

216
Shapiro-Wilk Normality Test of Group 2 Test Scores

0.966950 = Test Statistic W


0.914 = W Critical for the following n and Alpha
23 = n = Number of Data Points
0.05 = α
The Null Hypothesis Stating That the Data Are Normally-distributed Cannot Be Rejected
Test Statistic W (0.964927) is larger than W Critical 0.911. The Null Hypothesis therefore cannot be
rejected. There is not enough evidence to state that Group 2 data are not normally-distributed with a
confidence level of 95 percent.

217
Shapiro-Wilk Normality Test of Group 3 Test Scores

0.969332 = Test Statistic W


0.897 = W Critical for the following n and Alpha
18 = n = Number of Data Points
0.05 = α
The Null Hypothesis Stating That the Data Are Normally-distributed Cannot Be Rejected
Test Statistic W (0.969332) is larger than W Critical 0.897. The Null Hypothesis therefore cannot be
rejected. There is not enough evidence to state that Group 3 data are not normally-distributed with a
confidence level of 95 percent.

Correctable Reasons Why Normal Data Can Appear Non-Normal


If a normality test indicates that data are not normally-distributed, it is a good idea to do a quick evaluation
of whether any of the following factors have caused normally-distributed data to appear to be non-
normally-distributed:
1) Outliers – Too many outliers can easily skew normally-distributed data. An outlier can often be
removed if a specific cause of its extreme value can be identified. Some outliers are expected in normally-
distributed data.
2) Data Has Been Affected by More Than One Process – Variations to a process such as shift changes
or operator changes can change the distribution of data. Multiple modal values in the data are common
indicators that this might be occurring. The effects of different inputs must be identified and eliminated
from the data.
3) Not Enough Data – Normally-distributed data will often not assume the appearance of normality until
at least 25 data points have been sampled.
4) Measuring Devices Have Poor Resolution – Sometimes (but not always) this problem can be solved
by using a larger sample size.

218
5) Data Approaching Zero or a Natural Limit – If a large number of data values approach a limit such
as zero, calculations using very small values might skew computations of important values such as the
mean. A simple solution might be to raise all the values by a certain amount.
6) Only a Subset of a Process’ Output Is Being Analyzed – If only a subset of data from an entire
process is being used, a representative sample in not being collected. Normally-distributed results would
not appear normally-distributed if a representative sample of the entire process is not collected.

Nonparametric Alternatives To Single-Factor ANOVA For Non-Normal Data


When groups cannot be shown to all have normally-distributed data, a nonparametric test called the
Kruskal-Wallis test should be performed instead of Single-Factor ANOVA. This test will be performed at
the end of this chapter on the original sample data.

Determining If Sample Groups Have Similar Variances


Single-Factor ANOVA requires that the variances of all sample groups be similar. Sample groups that
have similar variances are said to be homoscedastistic. Sample groups that have significantly different
variances are said to be heteroscedastistic.
A rule-of-thumb is as follows: Variances are considered similar if the standard deviation of any one group
is no more than twice as large as the standard deviation of any other group. That is the case here as the
following are true:
s1 = Group1 standard deviation = 1.495
s2 = Group2 standard deviation = 1.514
s3 = Group3 standard deviation = 1.552
The variances of all three groups are very similar. A quick look at the box plot of the data would have
confirmed that as well.
Two statistical tests are commonly performed when it is necessary to evaluate the equality of variances in
sample groups. These tests are Levene’s Test and the Brown-Forsythe Test. The Brown-Forsythe Test is
more robust against outliers but Levene’s Test is the more popular test.

219
Levene’s Test in Excel For Sample Variance Comparison
Levene’s Test is a hypothesis test commonly used to test for the equality of variances of two or more
sample groups. Levene’s Test is much more robust against non-normality of data than the F Test. That is
why Levene’s Test is nearly always preferred over the F Test as a test for variance equality.
The Null Hypothesis of Levene’s Test is average distance to the sample mean is the same for each
sample group. Acceptance of this Null Hypothesis implies that the variances of the sampled groups are
the same. The distance to the mean for each data point of both samples is shown as follows:

Levene’s Test involves performing Single-Factor ANOVA on the groups of distances to the mean. This
can be easily implemented in Excel by applying the Excel data analysis tool ANOVA: Single Factor.
Here is the completed dialogue box for this test:

220
Applying this tool on the above data produces the following output:

The Null Hypothesis of Levene’s Test states that the average distances to the mean for the two groups
are the same. Acceptance of this Null Hypothesis would imply that the sample groups have the similar
variances. The p Value shown in the Excel ANOVA output equals 0.9566. This is much larger than the
Alpha (0.05) that is typically used for an ANOVA Test so the Null Hypothesis cannot be rejected.
We therefore conclude as a result of Levene’s Test that the variances are the same or, at least, that we
don’t have enough evidence to state that the variances are different. Levene’s Test is sensitive to outliers
because relies on the sample mean, which can be unduly affected by outliers. A very similar
nonparametric test called the Brown-Forsythe Test relies on sample medians and is therefore much less
affected by outliers as Levene’s Test is or by non-normality as the F Test is.

221
Brown-Forsythe Test in Excel For Sample Variance Comparison
The Brown-Forsythe Test is a hypothesis test commonly used to test for the equality of variances of two
or more sample groups. The Null Hypothesis of the Brown-Forsythe Test is average distance to the
sample median is the same for each sample group. Acceptance of this Null Hypothesis implies that the
variances of the sampled groups are similar. The distance to the median for each data point of the three
sample groups is shown as follows:

The Brown-Forsythe Test involves performing Single-Factor ANOVA on the groups of distances to the
median. This can be easily implemented in Excel by applying the Excel data analysis tool ANOVA:
Single Factor. Here is the completed dialogue box for this test:

222
Applying this tool on the above data produces the following output:

The Null Hypothesis of the Brown-Forsythe Test states that the average distances to the median for the
three groups are the same. Acceptance of this Null Hypothesis would imply that the sample groups have
similar variances. The p Value shown in the Excel ANOVA output equals 0.9582. This is much larger than
the Alpha (0.05) that is typically used for an ANOVA Test so the Null Hypothesis cannot be rejected.
We therefore conclude as a result of the Brown-Forsythe Test that the variances are the same or, at
least, that we don’t have enough evidence to state that the variances are different.
Each of these two variance tests can be considered relatively equivalent to the others.

Alternative Tests To Single-Factor ANOVA When Groups Variances Are Not Similarity
When groups cannot be shown to have homogeneous (similar) variances, either Welch’s ANOVA or the
Brown-Forsythe F test should be used in place of Single-Factor ANOVA. Both of these tests will be
performed on the same data set near the end of this chapter.

223
Step 4 – Run the Single-Factor ANOVA Tool in Excel
The Single-Factor ANOVA tool can be found in Excel 2007 and later by clicking the Data Analysis link
located under the Data tab. In Excel 2003, the Data Analysis link is located in the Tool drop-down menu.
Clicking Anova: Single-Factor brings up the Excel dialogue box for this tool.
The data need to be arranged in contiguous (columns touching with the rows correctly lined up) columns.
The completed dialogue box for this data set would appear as follows:

224
Hitting OK runs the tools and produces the following output:

The meaning of this output can be understood by reviewing the Null and Alternative Hypotheses that
Single-Factor ANOVA evaluates.
The Null Hypothesis states that all populations from which all samples were drawn have the same mean.
Null Hypothesis = H0: µ1 = µ2 = … = µk (k equals the number of sample groups)
Note that Null Hypothesis is not referring to the sample means, s 1 , s2 , … , sk, but to the population
means, µ1 , µ2 , … , µk.
The Alternative Hypothesis for Single-Factor ANOVA states that at least one sample group is likely to
have come from a different population. Single-Factor ANOVA does not clarify which groups are different
or how large any of the differences between the groups are. This Alternative Hypothesis only states
whether at least one sample group is likely to have come from a different population.
Alternative Hypothesis = H0: µi ≠ µj for some i and j

225
Step 5 – Interpret the Excel Output
The Null Hypothesis is rejected if ANOVA’s calculated p Value is smaller than the designated Level of
Significance (alpha). Alpha is most commonly set at 0.05. In this case the Null Hypothesis would be
rejected because the p Value (0.0369) is smaller than Alpha (0.05).
The exact interpretation of a p value of 0.0369 is that there is only a 3.96 percent chance that samples
having these values could have been drawn if all of the populations had the same means.
Although a Hypothesis Test can only result in the rejection of the Null Hypothesis, we can conclude with
at least 95 percent certainty that at least one sample has been drawn from a population with a different
mean than the other samples.
ANOVA can only indicate that at least one sample is different but ANOVA does not provide specific
information about where that difference comes from. Further testing called Post-Hoc testing can indicate
from where the specific differences have come from. Post-Hoc testing on this data set will shortly be
performed in this chapter.

All Calculations That Created Excel’s One-Way ANOVA Output


Excel’s Single-Factor ANOVA tool works so well that there is no reason to perform the ANOVA
calculations by hand except to understand how they work. The Excel Single-Factor ANOVA output is
once again as follows:

The steps to the calculation of the p Value are as follows:


Calculate SSBetween_Groups and SSWithin_Groups from the original data set.
Calculate MSBetween_Groups and MSWithin_Groups by MS = SS/df
Calculate the F Statistic by F = MSBetween_Groups / MSWithin_Groups
Calculate the p value by p Value = F.DIST.RT(F, df Between, dfWithin)

226
Original Data Set

227
Calculating SSBetween_Groups from the Original Data Set

Calculating SSWithin_Groups from the Original Data Set

228
Calculating the Remaining Equations to the p Value

229
Step 6 – Perform Post-Hoc Testing in Excel
The F-test in ANOVA is classified as an omnibus test. An omnibus test is one that-Tests the overall
significance of a model to determine whether a difference exists but not exactly where the difference is.
ANOVA test the Null Hypothesis that all of the group means are the same. When this Null Hypothesis is
rejected, further testing must be performed to determine which pairs of means are significantly different.
That type of testing is called Post-Hoc testing.
Post-Hoc testing is a pairwise comparison. Groups means are compared two at a time to determine
whether the difference between the pair of means is significant.

The Many Types of Post-Hoc Tests Available


There are many types of Post-Hoc tests available in most major statistical software packages but two
have become the preferred tests. These two are the Tukey-Kramer test and the Games-Howell test. The
Tukey-Kramer test should be used when group variances are similar. The Games-Howell test should be
used with group variances are dissimilar. Both tests can be used when group sizes are unequal.

The Tukey-Kramer test is a slight variation of the well-known Tukey HSD test. The Tukey-Kramer can be
used when group sizes are unequal, which the Tukey HSD test is not designed for.

Post-Hoc Tests Used When Group Variances Are Equal


SPSS lists the following Post-Hoc tests or corrections available when groups variances are equal:
LSD
Bonferroni
Sidak
Scheffe
REGWF
REGWQ
S-N-K
Tukey (Tukey’s HSD or Tukey-Kramer)
Tukey’s b
Duncan
Hochberg’s GT2
Gabriel
Waller-Duncan
Dunnett
Of all of the Post-Hoc tests available when groups variances are found to be similar, Tukey’s HSD test is
used much more often than the others. A slight variation of Tukey’s HSD called the Tukey-Kramer test is
normally used when group variances are the same but group sample sizes are different. Tukey’s HSD
can only be used when group sizes are exactly the same.
The Tukey test (Tukey’s HSD test or the Tukey-Kramer test) is generally a good choice when group
variances are similar. Hochberg’s GT2 produces the best result when group sizes are very different.
REGWQ is slightly more accurate than the Tukey test but should only be used when group sizes are the
same.

230
Post-Hoc Tests Used When Group Variances Are Not Equal
SPSS lists the following Post-Hoc tests available when groups variances are not equal:
Tamhane’s T2
Dunnett’s T3
Games-Howell
Dunnett’s C
The Games-Howell test is the most widely used of the above and is generally a good choice when group
variances are not similar. The Games-Howell test can be used when group sizes are not the same.

Tukey’s HSD (Honestly Significant Difference) Test


Used When Group Sizes and Group Variances Are Equal
Tukey’s HSD test compares the difference between each pair of group means to determine which
differences are large enough to be considered significant.
Tukey’s HSD test is very similar to a t-test except that it makes a correction for the experiment-wide error
rate that a t-test doesn’t. The experiment-wide error rate is the increased probability of type 1 errors (false
positives) when multiple comparisons are made.
Tukey’s HSD test can be summarized as follows:
The means of all groups are arranged into as many unique pair combinations as possible. The pair
combination with the largest difference between the two means is tested first. A test statistic for this pair
of means is calculated as follows:

where

n = number of samples in any group (all groups must be of equal size for Tukey’s HSD Post-Hoc test)
This test statistic is compared to qCritical . The critical q values are found on the Studentized Range q table.
A unique critical q value is calculated for each unique combination of level of significance (usually set at
0.05), the degrees of freedom, and the total number of groups in the ANOVA analysis.
Tukey’s test calculates degrees of freedom as follows:
df = Degrees of freedom = (total number of samples in all groups combined) – (total number of groups)
The difference between the two means is designated as significant if its test statistic q is larger than the
critical q value from the table.
If the difference between the means with the largest difference is found to be significant, the next inside
pair of means is tested. This step is repeated until an innermost pair is found to have a difference that is
not significant. Once an inner pair of means is found to have a difference that is not large enough to be

231
significant, no further testing needs to be done because all untested pairs will be inside this one and have
even smaller differences between the means.

Tukey-Kramer Test
Used When Group Variances Are Equal But Group Sizes Are Unequal
A slightly variation of Tukey’s HSD test should be used when group sizes are not the same. This variation
of Tukey’s HSD test is called the Tukey-Kramer test.
This Tukey-Kramer test will normally be performed instead Tukey’s HSD test by most statistical
packages. The Tukey-Kramer test produces the same answer as Tukey’s HSD when group sizes are the
same and can be used when group sizes are different (unlike Tukey’s HSD).
Recall that the Tukey’s HSD test statistic for a pair of means is calculated as follows:

where

The Tukey-Kramer test makes the following adjustment to standard error to account for unequal group
sizes. The pooled variance MSWithin_Groups is multiplied by the average of ( 1/ni + 1/nj ) instead of 1/n.

As with Tukey’s HSD test, the Tukey-Kramer test calculates Test Statistic q for each pair of means. This
Test Statistic is compared to qCritical . The critical q values are found on the Studentized Range q table. A
unique critical q value is calculated for each unique combination of level of significance (usually set at
0.05), the degrees of freedom, and the total number of groups in the ANOVA analysis. An Excel lookup
function can be conveniently used to obtain the critical q value. The easiest Excel lookup function in this
case is Index(array, row, column).
The Tukey-Kramer test calculates degrees of freedom in the same way as Tukey’s HSD test as follows:
df = Degrees of freedom = (total number of samples) – (total number of groups)
The difference between the two means is designated as significant if its test statistic q is larger than the
critical q value from the table.
The Tukey-Kramer test will be performed on the sample data shortly.

232
Games-Howell Test
Used When Groups Variances Are Not Equal. Groups Sizes Can Be Unequal.
The Games-Howell test is the Post-Hoc test used when group variances cannot be confirmed to be
homogeneous (similar). The Games-Howell test can be used whether or not sample sizes are the same.
When group variances are shown to be dissimilar, the Single-factor ANOVA test should be replaced by
either Welch’s ANOVA or the Brown-Forsythe F-Test. Both of these tests will be performed on the sample
data at the end of this chapter.
The two main tests used to evaluate samples for homogeneity (sameness) of variance are Levene’s test
and the Brown-Forsythe test.
Levene’s test is an ANOVA test the compares distances between sample values and group means.
The Brown-Forsythe test is an ANOVA test that compares distances between sample values and group
medians. The Brown-Forsythe test is more robust because it is less affected by outliers since it is based
on the median and not the mean as Levene’s test is.
The F-test is not a good test to compare variances because it is extremely sensitive to non-normality of
sample data.
The Games-Howell Post-Hoc test is performed in the same manner as Tukey’ HSD and the Tukey-
Kramer test. The only differences are the formulas used to calculate standard error and the degrees of
freedom.
Recall that the Tukey’s HSD test statistic for a pair of means is calculated as follows:

where

The Tukey-Kramer test makes the following adjustment to standard error to account for unequal group
sizes. The pooled variance MSWithin_Groups is multiplied by the average of ( 1/ni + 1/nj ) instead of 1/n.

233
Notice that both Tukey’s HSD and the Tukey-Kramer test use a pooled variance MSWithin_Groups because
groups variances are similar. The Games-Howell test assumes dissimilar groups variances and calculates
the standard error using individual variances of each group as follows:

The Games-Howells test calculates degrees of freedom in a different way as well. The formula for df is as
follows:

In Excel terms, the formula is expressed as the following:


df = ( ( (Var1/n1) + (Var2/n2) )^2 ) / ( ((Var1/n1)^2 / (n1 - 1) ) + ( (Var2/n2)^2 / (n2-1) } )
This is the same formula used to calculate degrees of freedom for a two-independent sample, unpooled t-
test. This t-test is known as Welch’s t-test. As mentioned, when group variances are unequal, Single-
Factor ANOVA is replaced by Welch’s ANOVA or the Brown-Forsythe F-Test.
As with Tukey’s HSD test and the Tukey-Kramer test, the Games-Howell test calculates Test Statistic q
for each pair of means. This Test Statistic is compared to qCritical . The critical q values are found on the
Studentized Range q table. A unique critical q value is calculated for each unique combination of level of
significance (usually set at 0.05), the degrees of freedom, and the total number of groups in the ANOVA
analysis. An Excel lookup function can be conveniently used to obtain the critical q value. The easiest
Excel lookup function in this case is Index(array, row, column).
The difference between the two means is designated as significant if its test statistic q is larger than the
critical q value from the table.
The Games-Howell test will be performed on the sample data shortly.

234
Tukey-Kramer Post-Hoc Test in Excel
The Tukey-Kramer Post-Hoc test is performed when group variances are equal and group sizes are
unequal. The Tukey-Kramer test is normally performed in place of Tukey’s HSD when group sizes are the
same because both Post-Hoc tests produce the same answer.
The Tukey-Kramer test calculates Test Statistic q for each pair of means. This Test Statistic is compared
to qCritical . The critical q values are found on the Studentized Range q table using the Excel lookup
function, INDEX(array, row number, column number).
The difference between the two means is designated as significant if its test statistic q is larger than the
critical q value from the table.
The Test Statistic q is calculated as follows:
q = (Max Group Mean – Min Group Mean) / SE
where SE (standard error) is calculated as follows:

df = Degrees of freedom = (total number of samples) – (total number of groups)


The first step when performing the Tukey-Kramer test is to list all unique mean pairs and the differences
between the means. All of this information can be found from the Excel ANOVA output as follows:

Group 1 Mean = x_bar1 = 19.045


Group 2 Mean = x_bar2 = 19.261
Group 3 Mean = x_bar3 = 18.056

235
Three unique group pairings exist: (1,2), (1,3), and (2,3)

The differences in means of each pair are as follows:


Pair (1,2) Mean Difference = ABS(19.045-19.261) = 0.216
Pair (1,3) Mean Difference = ABS(19.045-18.056) = 0.989
Pair (2,3) Mean Difference = ABS(19.261-18.056) = 1.205
The pair of groups having the largest difference in means occurs in groups 2 and 3. This group pair will
therefore be the first evaluated to determine if its difference is large enough to be significant.

Test Statistic q for this group pair will be calculated as follows:


q2,3 = ABS(x_bar2 – x_bar3) / SE
where SE = SQRT(1/2 * MSWithin * (1/n2 + 1/n3) )
ABS(x_bar2 – x_bar3) = 1.205
MSWithin = 2.306
n2 = 23
n3 = 18
q2,3 = 1.205 / SQRT((0.5)*2.306*(1/23+1/18)) = 3.566
df = (total number of samples) – (total number of groups)
df = 63 – 3 = 60
From the Studentized Range q table
qCritical = 3.399

According to the Tukey-Kramer test the largest difference (pair 2,3) is significant because q(2,3) = 3.566
and is larger than qCritical (3.399). The differences between the other pairs are not significant because
q(1,2) = 0.6745 and q(1,3) = 2.898. The Games-Howell test will shortly be shown to produce very similar
results.
Looking Up qCritical on the Studentized Range q Table With the Excel INDEX() Function
The Studentized Range q table and the Excel Index() function appear as follows:
=INDEX( array, relative row number, relative column number )
A relative address is the address relative to the cell in upper left corner of the array. In the INDEX()
function is attempting to locate a value in a cell that is in the third column over (to the right of) and third
row down from the cell in the upper left corner of the array, the relative row number equals 3 and the
relative column number equals 3.

236
The array is the absolute address of the array. This is given by:
(upper left corner cell:lower right corner cell). In this case it would be (D5:K103)

237
238
Games-Howell Post-Hoc Test in Excel
The Games-Howell test is performed the same way as the Tukey-Kramer test except that standard error
and degrees of freedom are calculated with different formulas as follows:
The Test Statistic q is calculated as follows:

where SE (standard error) is calculated as follows:

and
df = degrees of freedom =

In Excel terms, the formula is expressed as the following:


df = ( ( (Var1/n1) + (Var2/n2) )^2 ) / ( ((Var1/n1)^2 / (n1 - 1) ) + ( (Var2/n2)^2 / (n2-1) ) )

239
The Excel ANOVA output for the sample data set is given once again as follows:

The two groups having the largest difference in means are groups 2 and 3. This group pair will therefore
be the first evaluated to determine if its difference is large enough to be significant.
Test Statistic q for this group pair will be calculated as follows:

q2,3 = ABS(x_bar2 – x_bar3) / SE2,3


ABS(x_bar2 – x_bar3) = 1.205
Var2 = 2.292
Var3 = 2.408
n2 = 23
n3 = 18

SE2,3 = SQRT ( 1/2 * (Var2 /n2 + Var3 /n3) )


SE2,3 = =SQRT((0.5)*(2.292/23+2.408/18)) = 0.3416
q2,3 = 1.205/0.3416 = 3.527

240
df2,3 = degrees of freedom =
= ( ( (Var2/n2) + (Var3/n3) )^2 ) / ( ((Var2/n2)^2 / (n2 - 1)) + ( (Var3/n3)^2 / (n3-1) ) )
df2,3 = 37
and number of groups equals 3

From the Studentized Range q table


qCritical = 3.453

According to the Games-Howell test, the largest difference (pair 2,3) is significant because q(2,3) = 3.566
and is larger than q(2,3)Critical = 3.399. The differences between the other pairs are not significant because
q(1,2) = 0.6809 is smaller than q(1,2)Critical = 3.433 and q(1,3) = 2.883 is smaller than q(1,3)Critical = 3.457.

The Tukey-Kramer test and the Games-Howell produced very similar results when applied to this data.

241
Step 7 – Calculate Effect Size
Effect size is a way of describing how effectively the method of data grouping allows those groups to be
differentiated. A simple example of a grouping method that would create easily differentiated groups
versus one that does not is the following.
Imagine a large random sample of height measurements of adults of the same age from a single country.
If those heights were grouped according to gender, the groups would be easy to differentiate because the
mean male height would be significantly different than the mean female height. If those heights were
instead grouped according to the region where each person lived, the groups would be much harder to
differentiate because there would not be significant difference between the means and variances of
heights from different regions.
Because the various measures of effect size indicate how effectively the grouping method makes the
groups easy to differentiate from each other, the magnitude of effect size tells how large of a sample must
be taken to achieve statistical significance. A small effect can become significant if a larger enough
sample is taken. A large effect might not achieve statistical significance if the sample size is too small.

The Three Most Common Measures of Effect Size


The three most common measures of effect size of single-factor ANOVA are the following:
2
η – eta squared
(Greek letter “eta” rhymes with “beta”)

ψ – psi or RMSSE
Sometimes denoted as d because it is derived directly from Cohen’s d. This is also referred to as the
RMSSE, the root-mean-square-standard-effect.

2
ώ – omega squared
The first two measures, eta squared and RMSSE, are based upon Cohen’s d. The third measure, omega
2
squared, is based upon r , the coefficient of determination, used in regression analysis.
2
Eta Square (η )
Eta square quantifies the percentage of variance in the dependent variable (the variable that is measured
and placed into groups) that is explained by the independent variable (the method of grouping). If eta
squared = 0.35, then 35 percent of the variance associated with the dependent variable is attributed to
the independent variable (the method of grouping).
Eta square provides an overestimate (a positively-biased estimate) of the explained variance of the
population from which the sample was drawn because eta squared estimates only the effect size on the
sample. The effect size on the sample will be larger than the effect size on the population. This bias
grows smaller is the sample size grows larger.
Eta square is affected by the number and size of the other effects.
2
η = SSBetween_Groups / SSTotal These two terms are part of the ANOVA calculations found in the Single-
factor ANOVA output.
2
Magnitudes of eta-squared are generally classified exactly as magnitudes of r (the coefficient of
determination) are as follows: = 0.01 is considered a small effect. = 0.06 is considered a medium effect.
= 0.14 is considered a large effect. Small, medium, and large are relative terms. A large effect is easily
discernible but a small effect is not.

242
2
Partial eta squared (pη ) is the proportion of the total variance attributed to a given factor when ANOVA is
performed using more than a single factor as is being done in this section.
Eta squared is sometimes called the nonlinear correlation coefficient because it provides a measure of
strength of the curvilinear relationship between the dependent and independent variables. If the
relationship is linear, eta squared will have the same value as r squared.
The recommended measure of effect size for Single-Factor ANOVA is omega squared instead of eta
squared due the tendency of eta squared to overestimate the percent of population variance associated
with the grouping method.
Psi (ψ) - RMSSE
RMSSE = Root-Mean-Square-Standard-Effect. Sometimes RMSSE is denoted as d because it is derived
directly from Cohen’s d as follows:
Cohen’s d is used to measure size effects when comparing two population variables. The formula for
Cohen’s d is as follows:

Cohen’s d is implemented in the form of Hodge’s measure when estimating the population variances
based upon two samples. The formula for Hodge’s measure is the following:

When applied to omnibus Single-Factor ANOVA, this measure becomes the RMSSE. The formula for
RMSSE for Single-Factor (One-way) ANOVA is the following:

The Grand Mean is the mean of the group means.


RMSSE is often denoted as Cohen’s d for Single Factor ANOVA. The Excel formula to calculate RMSSE
is the following:
=SQRT(DEVSQ(array of group means) / ((*k-1)*MSWithin_Groups)
DEVSQ(array) returns the sum of the squares of deviations of sample points in the array from their mean.
In this case DEVSQ(array of group means) would return the sum of the square of the deviations of the
groups means from the grand mean (the mean of the group means).
Magnitudes of RMSSE are generally classified as follows: = 0.10 is considered a small effect. = 0.25 is
considered a medium effect. = 0.40 is considered a large effect. Small, medium, and large are relative
terms. A large effect is easily discernible but a small effect is not.
243
2
Omega Squared (ώ )
Omega squared is an estimate of the population’s variance that is explained by the treatment (the method
of grouping).
Omega squared is less biased (but still slightly biased) than eta square and is always smaller the eta
squared because eta squared overestimates the explained variance of the population from which the
sample was drawn. Eta squared estimates only the effect size on the sample. The effect size on the
sample will be larger than the same effect size on the population.
Magnitudes of omega squared are generally classified as follows: Up to 0.06 is considered a small effect,
from 0.06 to 0.14 is considered a medium effect, and above 0.14 is considered a large effect. Small,
medium, and large are relative terms. A large effect is easily discernible but a small effect is not.
The relationship between omega squared and r squared is shown as follows:

The first equation shown above is applicable to regression. The second equation is application to Single-
Factor ANOVA.
SSBetween is often referred to as SSTreatment or SSEffect.
MSWithin is often referred to as SSError
so that

becomes

244
2
Calculating Eta Squared (η ) in Excel
Eta squared is calculated with the formula
2
η = SSBetween_Groups / SSTotal
and is implemented in Excel on the data set as follows:

An eta-squared value of 0.104 would be classified as a medium-size effect.


2
Magnitudes of eta-squared are generally classified exactly as magnitudes of r (the coefficient of
determination) are as follows: = 0.01 is considered a small effect. = 0.06 is considered a medium effect.
= 0.14 is considered a large effect. Small, medium, and large are relative terms. A large effect is easily
discernible but a small effect is not.

245
Calculating Psi (ψ) – RMSSE – in Excel
RMSSE is calculated with the formula

and is implemented in Excel on the data set as follows:

An RMSSE value of 0.4233 would be classified as a large size effect just above the medium range.
Magnitudes of RMSSE are generally classified as follows: = 0.10 is considered a small effect. = 0.25 is
considered a medium effect. = 0.40 is considered a large effect. Small, medium, and large are relative
terms. A large effect is easily discernible but a small effect is not.

246
2
Calculating Omega Squared (ώ ) in Excel
Omega squared is calculated with the formula

and is implemented in Excel on the data set as follows:

An omega-squared value of 0.0732 would be classified as a medium size effect.


Magnitudes of omega squared are generally classified as follows: Up to 0.06 is considered a small effect,
from 0.06 to 0.14 is considered a medium effect, and above 0.14 is considered a large effect. Small,
medium, and large are relative terms. A large effect is easily discernible but a small effect is not.

247
Step 8 – Calculate the Power of the Test
The accuracy of a statistical test is very dependent upon the sample size. The larger the sample size, the
more reliable will be the test’s results. The accuracy of a statistical test is specified as the Power of the
test. A statistical test’s Power is the probability that the test will detect an effect of a given size at a given
level of significance (alpha). The relationships are as follows:
α (“alpha”) = Level of Significance = 1 – Level of Confidence
α = probability of a type 1 error (a false positive)
α = probability of detecting an effect where there is none
Β (“beta”) = probability of a type 2 error (a false negative)
Β = probability of not detecting a real effect
1 - Β = probability of detecting a real effect
Power = 1 - Β
Power needs to be clarified further. Power is the probability of detecting a real effect of a given size at a
given Level of Significance (alpha) at a given total sample size and number of groups.
The term Power can be described as the accuracy of a statistical test. The Power of a statistical test is
related with alpha, sample size, and effect size in the following ways:
1) The larger the sample size, the larger is a test’s Power because a larger sample size increases a
statistical test’s accuracy.
2) The larger alpha is, the larger is a test’s Power because a larger alpha reduces the amount of
confidence needed to validate a statistical test’s result. Alpha = 1 – Level of Confidence. The lower the
Level of Confidence needed, the more likely a statistical test will detect an effect.
3) The larger the specified effect size, the larger is a test’s Power because a larger effect size is more
likely to be detected by a statistical test.
If any three of the four related factors (Power, alpha, sample size, and effect size) are known, the fourth
factor can be calculated. These calculations can be very tedious. Fortunately there are a number of free
utilities available online that can calculate a test’s Power or the sample size needed to achieve a specified
Power. One very convenient and easy-to-use downloadable Power calculator called G-Power is available
at the following link at the time of this writing:
http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/
Power calculations are generally used in two ways:
1) A priori - Calculation of the minimum sample size needed to achieve a specified Power to detect an
effect of a given size at a given alpha. This is the most common use of Power analysis and is normally
conducted a priori (before the test is conducted) when designing the test. A Power level of 80 percent for
a given alpha and effect size is a common target. Sample size is increased until the desired Power level
can be achieved. Since Power equals 1 – Β, the resulting Β of the targeted Power level represents the
highest acceptable level of a type 2 error (a false negative – failing to detect a real effect). Calculation of
the sample size necessary to achieve a specified Power requires three input variables:
a) Power level – This is often set at .8 meaning that the test has an 80 percent to detect an effect of a
given size.

b) Effect size - Effect sizes are specified by the variable f. Effect size f is calculated from a different
2 2
measure of effect size called η (eta square). η = SSBetween_Groups / SSTotal These two terms are part of
the ANOVA calculations found in the Single-factor ANOVA output.

248
2
The relationship between effect size f and effect size η is as follows:

Jacob Cohen in his landmark 1998 book Statistical Analysis for the Behavior Sciences proposed that
effect sizes could be generalized as follows:
2
η = 0.01 for a small effect. A small effect is one that not easily observable.
2
η = 0.05 for a medium effect. A medium effect is more easily detected than a small effect but less easily
detected than a large effect.
2
η = 0.14 for a small effect. A large effect is one that is readily detected with the current measuring
equipment.
2
The above values of η produce the following values of effect size f:
f = 0.1 for a small effect.
f = 0.25 for a medium effect.
f = 0.4 for a large effect.

c) Alpha – This is commonly set at 0.05.

Calculating Power With Online Tool G Power


1) A Priori - An example of a priori Power calculation would be the following. Power calculations are
normally used a priori to determine the total ANOVA sample size necessary to achieve a specific Power
level for detecting an effect of a specified size at a given alpha.
The single-factor ANOVA example used in this chapter has three groups. The G-Power utility could be
used a priori in this way:
Calculate the total sample needed achieve the following parameters:
Power level = 0.8 (80 percent chance of detecting the effect)
Effect size f = 0.4 (a large effect)
Number of Groups = 3
Alpha = 0.05
The G-Power dialogue box would be filled in as follows and calculates that a total sample size of 66
would be needed have attain a Power of 0.818 (81.8 percent) to detect a large effect of effect size f = 0.4.
The example used in this chapter has a total of 63 data observations. That would be nearly a large
enough total size to have an 80 percent chance of detecting a large effect (f = 0.4) at an alpha = 0.05.

249
2) Post hoc - Calculation of a test’s Power to detect an effect of a given size at a given alpha for a given
sample size. This is usually conducted post hoc (after a test has been performed). If a test’s Power is
deemed unacceptably low, the test’s results are usually considered invalid.
An example of a post hoc Power calculation would be the following. Power calculations are normally used
post hoc to determine the current Power level of an ANOVA test for detecting an effect of a specified size
at a given alpha given the total sample size.
The single-factor ANOVA example used in this chapter has three groups. The G-Power utility could be
used post hoc in this way:
Calculate the total sample needed achieve the following parameters:
Effect size f = 0.25 (a medium effect)
Number of Groups = 3
Total sample size = 63
Alpha = 0.05

250
The G-Power dialogue box would be filled in as follows and calculates that this single-factor ANOVA test
achieves a Power level of 0.391 (39.1 percent chance) to detect a medium effect (effect size f = 0.25) with
three groups of 63 total data observations.

251
What To Do When Groups Do Not Have Similar Variances
Single-Factor ANOVA requires that the variances of all sample groups be similar. Sample groups that
have similar variances are said to be homoscedastistic. Sample groups that have significantly different
variances are said to be heteroscedastistic.
When groups cannot be shown to have homogeneous (similar) variances, either Welch’s ANOVA or the
Brown-Forsythe F test should be used in place of Single-Factor ANOVA. Both of these tests will be
performed on original the data set.

Welch’s ANOVA in Excel

where

The Excel formula for the p value that determines whether or not the Welch ANOVA test shows that at
least one group mean is significantly different than the others is the following:
Welch
p Value = F.DIST.RT(F , dfBetween, dfWithin)
This onerous set of formulas are much manageable if it is broken down into its component parts as
follows:

252
It can now be solved as follows:

Step 1) Calculate w

253
Step 2)

Step 3) Calculate Grand Weighted Mean

254
Step 4)

Step 5) Calculate A

255
Step 6) Calculate B

256
Welch
Step 7) Calculate MSWithin

257
Welch
Step 8) Calculate F and then the p Value

This Welch’s ANOVA calculation shows the differences between group means to be significant at a Level
of Significance (Alpha) of 0.05 since the p Value (0.0463) is less than Alpha (0.05).
The p Value formula shown here is used in Excel versions prior to 2010. The equivalent formula in Excel
2010 and later is the following:
Welch
p Value = F.DIST.RT(F , dfBetween, dfWithin)

258
Brown Forsythe F-Test in Excel

The Excel formula for the p value that determines whether or not the Brown-Forsythe F test shows that at
least one group mean is significantly different than the others is the following:
BF
p Value = F.DIST.RT(F , dfBetween, dfWithin)
SSBetween_Groups is taken from the Single-Factor ANOVA output shown here and equals 16.079.

259
It can then be solved as follows:
Step 1)

Step 2) Calculate mi

260
Step 3)

B-F
Step 4) Calculate F and the p Value

This Brown-Forsythe F- Test shows the differences between group means to be significant at a Level of
Significance (Alpha) of 0.05 since the p Value (0.0378) is less than Alpha (0.05).
The p Value formula shown here is used in Excel versions prior to 2010. The equivalent formula in Excel
2010 and later is the following:
F-B
p Value = F.DIST.RT(F , dfBetween, dfWithin)

261
What To Do When Groups Are Not Normally-Distributed
Single-Factor ANOVA requires that the samples are taken from normally-distributed populations. If the
populations are normally-distributed, the samples will be normally-distributed if the sample size is large
enough, i.e., each sample contains at 15 to 20 data points.

Kruskal-Wallis Test in Excel


If normality tests indicate that the samples are likely not normally-distributed, the nonparametric Kruskal-
Wallis test should be substituted for Single-Factor ANOVA. The Kruskal-Wallis test is based upon the
rankings of all data points and does not require that the data be normally-distributed.
The Kruskal-Wallis test does have a requirement that the data samples have similar distribution shapes.
The Excel histogram is a convenient tool to quickly view the distribution shape of each sample group.
Excel histograms will be created for each sample group of the original data set. The original data set was
already successfully tested for normality using the Shapiro-Wilk normality test. Excel histograms would
therefore be expected to resemble the bell-shaped normal distribution curve. Histograms of each of the
three data groups are shown in the following diagram:

262
This histogram was created in Excel by inputting the following information into the histogram dialogue
box:

263
This histogram was created in Excel by inputting the following information into the histogram dialogue
box:

264
This histogram was created in Excel by inputting the following information into the histogram dialogue
box:

Excel histograms of each of the data groups reveal similar distribution shapes thus validating this required
assumption of the Kruskal-Wallis test.

265
The Kruskall-Wallis test is based upon the overall rankings of each data point. The sum of the rankings
for each sample groups, Ri, is used to calculate the value of test statistic H as follows:

k = the number of sample groups


Test statistic H is very nearly distributed as the Chi-Square distribution with k – 1 degrees of freedom as
long as the number of samples in each group is at least 5.
A p Value can therefore be derived from H as follows:
p value = CHISQ.DIST.RT(H, k-1)
If the p Value is smaller than the designated Level of Significance (Alpha is usually set at 0.05) then at
least one of the groups has a disproportionately large share of higher numbers. A larger-than-expected
share of higher numbers will produce an unexpectedly large rank sum, R i, for the sample group. This will
result in the small p Value which indicates that the difference between the rankings within sample groups
is significant.

266
The Kruskal-Wallis test is performed on the original sample data as follows:

267
Step 1 – Arrange All Data In One Column

268
269
270
Step 2 - Sort and Then Rank the Data Column.
The data sort must keep the group number attached to each data value.

271
272
273
Step 3 – Take Care of Tied Data Values
Data whose values have ties are all assigned the same rank. This rank is the average rank that all of the
same data would. This is calculation is performed as follows:

274
275
Step 4 – Return Data To Original Groups
The data are then resorted back into their original group. The sort must retain the ranking for each data
point.

276
277
Step 5) Calculate Rank Sum For Each Group
Calculate the Rank Sum for each data group by adding the rankings of all data points in the group.

278
279
Step 6 – Calculate Test Statistic
Calculate test statistic based upon the following formula:

Ri = Rank Sum for group i


ni = number of data points in group i
n = the total number of data points in all groups

280
Step 7 – Calculate the p Value
Calculate the p Value based upon h and k, the number of group as follows:

The p Value formula shown here is used in Excel versions prior to 2010. The equivalent formula in Excel
2010 and later is the following:
p Value = CHISQ.DIST.RT(H, df)
This Kruskal-Wallis test does not show (just barely) a significant difference between the rankings of the
sample groups. The Kruskal-Wallis test is less sensitive than Single-Factor ANOVA. This is usually the
case with any nonparametric test that is used to replace a parametric test.
In this case, the Kruskal-Wallis test shows a higher chance of a type 2 error than Single-Factor ANOVA. A
type 2 error is a false negative. In other words, the Kruskal-Wallis test (p value = 0.0542) is less able to
detect a significant difference than Single-Factor ANOVA (p value = 0.0369), Welch’s ANOVA (p Value =
0.0463), or the Brown-Forsythe F-test (p value = 0.0378).
281
Two-Factor ANOVA With Replication in Excel

Overview
Two-factor ANOVA with replication is used to determine if either of two categorical factors and/or the
interaction between these two factors has had a significant effect on a data set of continuous data.
Two-factor ANOVA with replication is useful in the following two circumstances:
1) Determining if either of two categorical factors has independently affected a data set in a
significant way. The data set is divided into horizontal groups that are each affected by a different level
of one categorical factor. The same data set is also simultaneously divided into vertical groups that are
each affected by a different level of another categorical factor. An example of a data set that is arranged
for two-factor ANOVA with replication analysis is as follows:

The test for main effects of each of the two factors is very similar to main effects test of the one factor in
single-factor ANOVA. The main effects test for each of the two factors determines whether there is a
significant difference between the means of the groups (the levels) within that factor. Factor 1’s main
effect test determines if there is a significant difference between the means of Levels 1, 2, and 3 of Factor
1. Factor 2’s main effects test determines if there is a significant difference between the means of Levels
1 and 2 of Factor 2.
2) Determining if the interaction between the two categorical factors has significantly affected a
data set. The interaction test determines whether data values across the levels of one factor vary
significantly at different levels of the other factor. This test determines whether the levels of one factor
have different effects on the data values across the levels of the other factor. It determines whether there
is interaction between Factor 1 and Factor 2, that is, between rows and columns. Ultimately this test
determines whether the differences between data observations in columns vary from row to row and the
differences between data observations vary from column to column.

282
Independent Variables vs. Dependent Variables
The two factors and their levels are categorical. These two factors are sometimes referred to as the
independent variables of Two-Way ANOVA. The dependent variable contains the values of the data
observations in the ANOVA table. The dependent variable is a continuous variable.

Two-Way ANOVA
Two-way ANOVA means that there are two factors that are being evaluated. Each factor has at least two
or more levels. One of the factors has its levels distributed in columns. Each data column contains all of
the data observations of one of that factor’s levels. The other factor has its levels distributed in rows.
Each data row contains all of the data observations of one of that factor’s levels.

Balanced Two-Way ANOVA With Replication


Replication in two-way ANOVA occurs when there are multiple instances of data observations for each
combination of levels between the two factors. Each unique combination of levels of the two factors is
called a treatment cell. It is important to note that only one of the two factors will always be replicated and
the other factor will never be replicated in the treatment cells. In the example provided here, each
treatment cell contains four data observations that are replications of Factor 1.
It is also important to note that the replication occurs the same number of times at all combinations of
levels. In the example shown here, each combination of levels of Factors 1 and 2 contains four data
observations of the same level of Factor 1. This is called “balanced” ANOVA. Balanced ANOVA means
that each treatment cell (each unique combination of levels of Factors 1 and 2) has one of the factors
replicated the same number of times.
ANOVA can be performed on unbalanced data but it is significantly more complicated and will not be
discussed here. It is always a good idea to design two-factor ANOVA with replication testing to have
balanced treatment cells. It should be noted that single-factor ANOVA can be performed without any
additional complication when treatments cells (data groups) have different sizes.

ANOVA = Analysis of Variance


ANOVA stands for Analysis of Variance. ANOVA determines whether or not all of the sample groups
being compared in a single F test are likely to have come from the same population by comparing the
variance between sample groups to the variance within the sample groups.
Two-factor ANOVA represents groupings of data observations that are each described by two categorical
variables and one continuous variable. The value of each object’s categorical variables determines into
which group (treatment cell) the data observation is placed. A treatment cell is a unique combination of
levels of the two factors. Two-Way ANOVA with one factor that has two levels and a second factor that
has three levels would have a total of six unique treatment cells.
The number of data observations in each treatment cells depends on how much replication has occurred
in the ANOVA test. Two-Way ANOVA Without Replication has a single data observation in each
treatment cell. Two-Way ANOVA with one factor replicated twice has two data observations in each
treatment cell. The example shown in this section has one factor replicated four times and therefore has
four data observations in each treatment cell. Note that this ANOVA example is balanced because each
treatment cell contains the same number of data observations (four) that are replications of the same
factor.

283
The Independent and Dependent Variables of ANOVA
The categorical variables are sometimes referred to as the independent variables of the ANOVA while the
continuous variable is sometimes referred to as the dependent variable of the ANOVA. In the case of
Two-Factor ANOVA, the independent variables predict which unique group (treatment cell) that each data
observation’s continuous value or measurement will be placed. This independent-dependent relationship
is different from that in regression because the independent variable does not predict the value of the
dependent variable, only the group (factor level) into which data observation will be placed.

Two-Way ANOVA With Replication Performs Three F Tests


The three separate F Tests performed are the following:

Factor 1 Main Effects F Test


This F Test determining whether at least one level of the Factor 1 groupings of the data set has a
significantly different mean than the other Factor 1 levels. This is a Main Effects test.

Factor 2 Main Effects F Test


This F Test determining whether at least one level of the Factor 2 groupings of the data set has a
significantly different mean than the other Factor 2 levels. This is a Main Effects test.

Factor 1 and 2 Interaction Effects F Test


this F Test to determining whether any level of Factor 1 interacts with Factor 2 to create significantly
different mean values in treatment cells across the Factor 2 levels. This is an Interaction Test.
Each of these three F Tests produces its own p value and a result that is reported separately from the
other two F Tests.

Requirements of Each F Test


All groups that are part of one F Test should be drawn from normally distributed populations that have
similar variances. This means that all data groups in one F Test must have similar variances and be
normally distributed. Only data groups that are being used in the same F Test are required to have similar
variances. All data groups for all F Tests must be normally distributed. The three F Tests of Two-Factor
ANOVA With Replication are valid only if the following conditions are met:

Factor 1 Main Effects F Test


All data groupings for Factor 1 (each Factor 1 level is its own data grouping) must have similar variances
and be normally distributed.

Factor 2 Main Effects F Test


All data groupings for Factor 2 (each Factor 2 level is its own data grouping) must have similar variances
and be normally distributed.

284
Factor 1 and 2 Interaction Effects F Test
If the two points above are true, then all interaction groupings (the unique treatment cells) will have similar
variance and be normally distributed.
Note that the variances of the groups within each F Test need to similar, not the same as is often quoted
in statistics texts. A rule-of-thumb is that the groups of an F Test are considered to have similar variances
if the standard deviation of any group is no more than twice as large as the standard deviation of any
other group in that F Test.
Group variances for each F Test will be compared in this section using both Levene’s test and the Brown-
Forsythe test. These are widely-used hypothesis tests that indirectly determine whether group variances
are different are significantly different.
Normality testing will be conducted on all groups of all F Tests in this section using the well-known
Shapiro-Wilk normality test.

Alternative Test When Data Are Normally Distributed


ANOVA is a parametric test because one of ANOVA’s requirements is that the data in each sample group
are normally distributed. ANOVA is relative robust against minor deviations from normality. When
normality of sample group data cannot be confirmed or if the sample data is ordinal instead of continuous,
a relatively unknown but very useful nonparametric test called the Scheirer-Ray-Hare test should be
substituted for Two-Factor ANOVA With Replication. This test will be performed on the data at the end of
this section.
Ordinal data are data whose order matter but the specific distances between units is not measurable.
Customer-rating survey data and Likert scales data can be examples of ordinal data. These types of data
can, however, be treated as continuous data if distances between successive units are considered equal.
The nonparametric Friedman test is sometimes mentioned as a substitute for Two-Way ANOVA With
Replication but this is incorrect. The Friedman test is a nonparametric substitute for Repeated-Measure
ANOVA but not for Two-Way ANOVA With Replication.

Null and Alt. Hypotheses For 2-Way ANOVA W/Rep.


Each of the three F Tests has its own Null and Alternative Hypotheses.

Null and Alternative Hypotheses for the Two Main Effects F Tests
The Null Hypothesis for the F Test that compares the means of the Factor 1 levels states that all of the
means are the same.
The Null Hypothesis for the F Test that compares the means of the Factor 2 levels states that all of the
means are the same.
This would be written as follows:
Null Hypothesis = H0: µ1 = µ2 = … = µk (k equals the number of sample groups or levels in each
factor)
Note that Null Hypothesis is not referring to the sample means, s1 , s2 , … , sk, but to the population
means, µ1 , µ2 , … , µk. Each of these two F Tests determine whether all of the data groups in a single F
Test could have come from the same population.

285
The Alternative Hypothesis for ANOVA states that at least one sample group in the F Test is likely to have
come from a different population. The F Tests do not clarify which groups are different or how large any of
the differences between the groups are. This Alternative Hypothesis for an F Test only states whether at
least one sample group in that F Test is likely to have come from a different population.

Null and Alternative Hypotheses for the Interaction Effect F Tests


The Null Hypothesis for the F Test that compares the interaction effect states that there is no interaction
between Factor 1 and Factor 2, that is, between rows and columns. This Null Hypothesis states that the
differences between data observations in columns do not vary from row to row and the differences
between data observations do not vary from column to column.
The Alternative Hypothesis for each of the three F Tests states that its Null Hypothesis is not true. Keep
in mind that a hypothesis test never accepts or rejects an alternative hypothesis: a hypothesis test can
only reject or fail to reject its Null Hypothesis. Rejection of the Null Hypothesis is however usually deemed
as being supportive of the Alternative Hypothesis stating that there is a difference in what is being
compared.

Two-Factor ANOVA Should Not Be Done By Hand


Excel provides an excellent ANOVA tool that can perform Single-factor or two-Factor ANOVA with equal
ease. The section in this manual covering single-factor ANOVA has the example recreated with all of the
individual calculations that go into ANOVA. This will not be done for two-factor ANOVA because that
would not, in this author’s view, provide additional insight into two-factor ANOVA because its calculations
are very numerous and tedious.
The best way to understand Two-Factor ANOVA with replication is to perform an example as follows:

Two-Factor ANOVA With Replication Example in Excel


The two factors of this ANOVA test will generically be called Factor 1 and Factor 2. Factor 1 will have
three levels and Factor 2 will have 2 levels. Each level of Factor 1 will be replicated four times.
The three levels of Factor 1 are labeled as follows:
Factor 1 Level 1
Factor 1 Level 2
Factor 1 Level 3

The two levels of Factor 2 are labeled as follows:


Factor 2 Level 1
Factor 2 Level 2

286
The generic labels were retained through the entire example to provide additional clarity and ease of
interpretation of the output. The data will be arranged as follows so they can be processed in Excel:

This example could have been something similar to the following:


Three groups of eight people simultaneously underwent training programs. Each of the three training
programs was different. Each group contains four men and four women. All people in all groups are
judged to have similar abilities. At the end of the training program, all eight people in each group took the
same test to evaluate comprehension of the training topics.
The three levels of Factor 1 would, in this case, would specify which training program each person had
undergone. The two levels of Factor 2 would specify the gender of each person.
Arranging the data in table as shown would allow for Two-Factor ANOVA With Replication to determine
the following:
1) Whether the training programs made a significant difference in the test scores.
2) Whether test scores were significantly different between gender.
3) Whether there was interaction between training program type and gender. In other words, whether
participants of one gender seemed to perform better or worse in at least one training program than
participants of the other gender did.

287
Step 1 – Arrange the Data Properly
Typically the data is provided in the manner shown as follows. Each data observation is listed on a
separate row along with its respective level of the other two factors.

288
To perform Two-Factor ANOVA with replication in Excel, the data needs to be arranged in rows and
columns as follows:

289
The quickest way to arrange the data correctly is to sort the rows of data by the two factors. The factor
that will not be replicated should be sorted as a primary sort. Levels of this data will wind up in separate
columns. The factor that will be replicated should be the secondary sort. Levels of this data will wind up in
blocks of rows as just shown.

290
Second, create the framework into which the sorted data will be placed as follows:

Third, paste the data into the respective columns.

291
Fourth and finally, outline each treatment cell as follows. Each treatment cell is a unique combination of
levels of both factors and contains four data observations. The data should be balanced meaning that
every treatment cell has the same number of data observations.

Step 2 – Evaluate Extreme Outliers


Calculation of the mean is one of the fundamental computations when performing ANOVA. The mean is
unduly affected by outliers. Extremely outliers should be removed before ANOVA. Not all outliers should
be removed. An outlier should be removed if it is obviously extreme and inconsistent with the remainder
of the data.
Outlier evaluation needs to be carefully performed before or during data collection, not after. Two-Way
ANOVA With Replication requires that the data be balanced. Individual data observations cannot simply
be discarded or there will be a hole in the data and the data will no longer be balanced. Note that Single-
Factor ANOVA can easily be performed on unbalanced data, but not Two-Factor ANOVA With
Replication. This type of ANOVA can be done with unbalanced data but it is significantly more
complicated and cannot be performed by the data analysis tool in Excel.

292
Step 3 – Verify Required Assumptions
Two-Factor ANOVA With Replication Required Assumption
Two-Factor ANOVA With Replication has six required assumptions whose validity should be confirmed
before this test is applied. The six required assumptions are the following:

1) Independence of Sample Group Data Sample groups must be differentiated in such a way that there
can be no cross-over of data between sample groups. No data observation in any sample group could
have been legitimately placed in another sample group. No data observation affects the value of another
data observation in the same group or in a different group. This is verified by an examination of the test
procedure.
2) Sample Data Are Continuous Sample group data (the dependent variable’s measured value) can be
ratio or interval data, which are the two major types of continuous data. Data observation values cannot
be nominal or ordinal data, which are the two major types of categorical data.
3) Independent Variables Are Categorical The determinant of which group each data observation
belongs to is a categorical, independent variable. ANOVA uses two categorical variables that each have
at least two levels. All data observations associated with each variable level represent a unique data
group and will occupy a separate column or row on the Excel worksheet.
4) Extreme Outliers Removed If Necessary ANOVA is a parametric test that relies upon calculation of
the means of sample groups. Extreme outliers can skew the calculation of the mean. Outliers should be
identified and evaluated for removal in all sample groups. Occasional outliers are to be expected in
normally distributed data but all outliers should be evaluated.
5) Normally Distributed Data In All Sample Groups Each of the three F Tests of Two-Factor ANOVA
has the required assumption the data from each sample group in that F Test comes from a normally
distributed population. Each of the two F Tests that are main effects tests for the two factors should have
their sample groups evaluated for normality. If all of the sample groups in the two F Tests are normally
distributed, the sample groups for the interaction F Test will also be normally distributed.
Normality testing becomes significantly less powerful (accurate) when a group’s size fall below 20. An
effort should be made to obtain group sizes that exceed 20 to ensure that normality tests will provide
accurate results. The F Tests in ANOVA are somewhat robust to minor deviations from normality.
6) Relatively Similar Variances In All Sample Groups In Each F Test Single-Factor ANOVA requires
that sample groups are obtained from populations that have similar variances.
Each of the three F Tests of Two-Factor ANOVA has the required assumption that all sample groups in
that specific F Test have similar variances. Each of the two F Tests that are main effects tests for the two
factors should have their sample groups evaluated for homoscedasticity (similarity of variances). If all of
the sample groups in the each F Tests have similar variances, the sample groups for the interaction F
Test will also have similar variances. Note that variances only have to be similar in groups of a single F
Test. All data groups that are the levels from one factor must have similar variances. Levels of one factor
do not have to have similar variances to levels of the other factor though. The requirement is that sample
groups for a single F Test have similar variances.
This requirement actually states that the populations from which the sample are drawn must have equal
variances. Normally the population variances so the sample groups themselves must be tested for
variance equality. The variances do not have to be exactly equal but do have to be similar enough so the
variance testing of the sample groups, which are hypothesis tests, will not detect significant differences.
Variance testing becomes significantly less powerful (accurate) when a group’s size fall below 20. An
effort should be made to obtain group sizes that exceed 20 to ensure that variance tests will provide
accurate results.

293
Determining If Sample Groups Are Normally-Distributed
There are a number of normality test that can be performed on each group’s data. The normality test that
is preferred because it is considered to be more powerful (accurate) than the others, particularly with
smaller sample sizes is the Shapiro-Wilk test.

294
Shapiro-Wilk Test For Normality
The Shapiro-Wilk Test is a hypothesis test that is widely used to determine whether a data sample is
normally distributed. A test statistic W is calculated. If this test statistic is less than a critical value of W for
a given level of significance (alpha) and sample size, the Null Hypothesis which states that the sample is
normally distributed is rejected.
The Shapiro-Wilk Test is a robust normality test and is widely-used because of its slightly superior
performance against other normality tests, especially with small sample sizes. Superior performance
means that it correctly rejects the Null Hypothesis that the data are not normally distributed a slightly
higher percentage of times than most other normality tests, particularly at small sample sizes.
The Shapiro-Wilk normality test is generally regarded as being slightly more powerful than the Anderson-
Darling normality test, which in turn is regarded as being slightly more powerful than the Kolmogorov-
Smirnov normality test.
Here is a summary of the results of the Shapiro-Wilk normality test performed on the sample groups that
constitute each of the levels of each of the two factors.
The Shapiro-Wilk test is a hypothesis test that compares sample group test statistic W to a critical value
of W. If test statistic W is higher than the critical value of W, the Null Hypothesis is not rejected. The Null
Hypothesis of the Shapiro-Wilk normally test states that sample group is normally distributed. The
following results indicate that the test statistic W for the data group of each factor level is greater than its
respective critical W value. All factor levels are deemed to have normally distributed data.

The individual Shapiro-Wilk normality tests for the data groups of each level will be shown as follows. The
critical W values are taken from a table based upon n (the number of data observations in the sample
group) and α (the Level of Significance, set to 0.05 here).

295
Shapiro-Wilk Normality Test in Excel of Factor 1 Level 1 Data

Shapiro-Wilk Normality Test in Excel of Factor 1 Level 2 Data

296
Shapiro-Wilk Normality Test in Excel of Factor 1 Level 3 Data

Shapiro-Wilk Normality Test in Excel of Factor 2 Level 1 Data

297
Shapiro-Wilk Normality Test in Excel of Factor 2 Level 2 Data

Test Statistic W is larger than W Critical in all five cases. The Null Hypothesis therefore cannot be
rejected. There is not enough evidence to state that any of the data groups is not normally distributed with
a confidence level of 95 percent.

Correctable Reasons Why Normal Data Can Appear Non-Normal


If a normality test indicates that data are not normally-distributed, it is a good idea to do a quick evaluation
of whether any of the following factors have caused normally-distributed data to appear to be non-
normally-distributed:
1) Outliers – Too many outliers can easily skew normally-distributed data. An outlier can often be
removed if a specific cause of its extreme value can be identified. Some outliers are expected in normally-
distributed data.
2) Data Has Been Affected by More Than One Process – Variations to a process such as shift changes
or operator changes can change the distribution of data. Multiple modal values in the data are common
indicators that this might be occurring. The effects of different inputs must be identified and eliminated
from the data.
3) Not Enough Data – Normally-distributed data will often not assume the appearance of normality until
at least 25 data points have been sampled.
4) Measuring Devices Have Poor Resolution – Sometimes (but not always) this problem can be solved
by using a larger sample size.
5) Data Approaching Zero or a Natural Limit – If a large number of data values approach a limit such
as zero, calculations using very small values might skew computations of important values such as the
mean. A simple solution might be to raise all the values by a certain amount.

298
6) Only a Subset of a Process’ Output Is Being Analyzed – If only a subset of data from an entire
process is being used, a representative sample in not being collected. Normally-distributed results would
not appear normally-distributed if a representative sample of the entire process is not collected.
Nonparametric Alternative For Two-Way ANOVA W/ Replication When Data Are Not Normal
When groups cannot be shown to all have normally-distributed data, a relatively unknown nonparametric
test called the Scheirer-Ray-Hare Test should be performed instead of Two-Factor ANOVA With
Replication. This test will be performed at the end of this chapter on the original sample data. The
Friedman test is occasionally mentioned as an alternative but that is incorrect. The Freidman test is a
nonparametric alternative for Repeated Measure ANOVA but not for Two-Factor ANOVA With
Replication.

Determining If Sample Groups Have Similar Variances


Each of the three F Tests of Two-Factor ANOVA With Replication requires that the variances of all
sample groups in the same F Test be similar. Sample groups that have similar variances are said to be
homoscedastistic. Sample groups that have significantly different variances are said to be
heteroscedastistic.
A rule-of-thumb is as follows: Variances are considered similar if the standard deviation of any one group
is no more than twice as large as the standard deviation of any other group. This is equivalent to stating
that no data group’s variance can be more than four times the variance of another data group in the same
F Test. That is the case here as the following are true for the levels of Factor 1 and Factor 2:
If the sample variance, VAR() in Excel, of the data groups at each factor level are calculated, the results
are as follows:

Variances of Factor 1 Levels


Var (Factor 1 Level 1) = 1,597
Var (Factor 1 Level 2) = 1,064
Var (Factor 1 Level 3) = 532
None of the Factor 1 Level data groups have a sample variance that is more than four times as large as
the sample variance of another Factor 1 level group. The variance rule-of-thumb indicates that the
299
variances of all data groups that are part of the Factor 1 Main Effects F Test should be considered similar.
All of the Factor 1 level data groups are therefore homoscedastistic (have similar variances).

Variances of Factor 2 Levels


Var (Factor 2 Level 1) = 852
Var (Factor 2 Level 2) = 1,368
None of the Factor 2 Level data groups have a sample variance that is more than four times as large as
the sample variance of another Factor 2 level group. The variance rule-of-thumb indicates that the
variances of all data groups that are part of the Factor 2 Main Effects F Test should be considered similar.
All of the Factor 2 level data groups are therefore homoscedastistic (have similar variances).
In addition to the variance comparison rule-of-thumb, two statistical tests are commonly performed when
it is necessary to evaluate the equality of variances in sample groups. These tests are Levene’s Test and
the Brown-Forsythe Test. The Brown-Forsythe Test is more robust against outliers but Levene’s Test is
the more popular test.

Levene’s Test in Excel For Sample Variance Comparison


Levene’s Test is a hypothesis test commonly used to test for the equality of variances of two or more
sample groups. Levene’s Test is much more robust against non-normality of data than the F Test. That is
why Levene’s Test is nearly always preferred over the F Test as a test for variance equality.
The Null Hypothesis of Levene’s Test is average distance to the sample mean is the same for each
sample group. Acceptance of this Null Hypothesis implies that the variances of the sampled groups are
the same.
Separate Levene’s Test will now be performed on the data groups for Factor 1 Main Effects F Test and
for the Factor 2 Main Effects F Test. The absolute value of the distance from each sample point to the
sample mean must be calculated. Single-Factor ANOVA in Excel is then run on these data sets.
Levene’s Test is performed on the Factor 1 level groups as follows:

300
α was set at 0.05 for this ANOVA test. The p Value of 0.2526 is larger than 0.05. This indicates that the
average distances to the sample mean for each Factor 1 level data group are not significantly different.
This result of Levene’s test is interpreted to mean that the Factor 1 level data groups have similar
variances and are therefore homoscedstistic.

Levene’s Test is performed on the Factor 2 level groups as follows:

301
α was set at 0.05 for this ANOVA test. The p Value of 0.2519 is larger than 0.05. This indicates that the
average distances to the sample mean for each Factor 2 level data group are not significantly different.
This result of Levene’s test is interpreted to mean that the Factor 2 level data groups have similar
variances and are therefore homoscedastistic.
We therefore conclude as a result of Levene’s Test that the group variances for each F Test are the same
or, at least, that we don’t have enough evidence to state that the variances of either of the F Tests are
different. Levene’s Test is sensitive to outliers because relies on the sample mean, which can be unduly
affected by outliers. A very similar nonparametric test called the Brown-Forsythe Test relies on sample
medians and is therefore much less affected by outliers as Levene’s Test is or by non-normality as the F
Test is.

302
Brown-Forsythe Test in Excel For Sample Variance Comparison
The Brown-Forsythe Test is a hypothesis test commonly used to test for the equality of variances of two
or more sample groups. The Null Hypothesis of the Brown-Forsythe Test is average distance to the
sample median is the same for each sample group. Acceptance of this Null Hypothesis implies that the
variances of the sampled groups are similar. The distance to the median for each data point of the three
sample groups is shown as follows:
Separate Brown-Forsythe Test will now be performed on the data groups for Factor 1 Main Effects F Test
and for the Factor 2 Main Effects F Test. The absolute value of the distance from each sample point to
the sample median must be calculated. Single-Factor ANOVA in Excel is then run on these data sets.
The Brown-Forsythe Test is performed on the Factor 1 level groups as follows:

303
α was set at 0.05 for this ANOVA test. The p Value of 0.2530 is larger than 0.05. This indicates that the
average distances to the sample mean for each Factor 1 level data group are not significantly different.
This result of this Brown-Forsythe test is interpreted to mean that the Factor 1 level data groups have
similar variances and are therefore homoscedastistic.

The Brown-Forsythe Test is performed on the Factor 2 level groups as follows:

α was set at 0.05 for this ANOVA test. The p Value of 0.3065 is larger than 0.05. This indicates that the
average distances to the sample mean for each Factor 2 level data group are not significantly different.
This result of this Brown-Forsythe test is interpreted to mean that the Factor 2 level data groups have
similar variances and are therefore homoscedastistic.

304
We therefore conclude as a result of this Brown-Forsythe Test that the group variances for each F Test
are the same or, at least, that we don’t have enough evidence to state that the variances of either of the F
Tests are different.
Each of these two variance tests, Levene’s Test and the Brown-Forsythe Test, can be considered
relatively equivalent to the other.

Step 4 – Run the Two-Factor ANOVA With Replication Tool in Excel


ANOVA tools can be found in Excel 2007 and later by clicking the Data Analysis link located under the
Data tab. In Excel 2003, the Data Analysis link is located in the Tool drop-down menu. Clicking Anova:
Two-Factor With Replication brings up the Excel dialogue box for this tool.
The data need to be arranged in contiguous (columns touching with the rows correctly lined up) columns.

305
The completed dialogue box for this ANOVA test and data set would appear as follows:

306
Hitting OK runs the tools and produces the following output:

307
Step 5 – Interpret the Excel Output
Two-Way ANOVA With Replication Involves Three Separate F Tests. Each of these three F Tests
produces its own p value and a result that is reported separately from the other two F Tests. Results of an
F Test are deemed to be significant if the p Value generated by that F test is smaller than the designated
Level of Significance (α is usually set of 0.05). A significant result is one in which observed differences
have only a small chance of being random results. For example, if one of the Main Effects F Tests
produces a significant result (the p Value is smaller than α) then at least one of the means of the level
groups of that factor is different than the means of the other level data groups.
Those three separate F Tests are the following:
Main Effects F Test for Factor 1 - An F Test determining whether at least one level of the Factor 1
groupings of the data set has a significantly different mean than the other Factor 1 levels. This is a Main
Effects test for Factor 1.
Main Effects F Test for Factor 2 - An F Test determining whether at least one level of the Factor 2
groupings of the data set has a significantly different mean than the other Factor 2 levels. This is a Main
Effects test for Factor 2.

Main Effects F Test for Factor 1


This F Test has produced a p Value of 0.0333. At a Level of Significance (alpha) of 0.05, this F Test has
produced a significant result because the generated p Value is smaller than the alpha level of 0.05. This
result indicates at least 95 percent certainty that the mean at least one of the level data groups is different
than means of the other level data groups of Factor 1. That can be stated equivalently by saying that
there is less than a 5 percent chance that the detected difference is merely a random result of the sample
taken and not real. An F Test is an omnibus test meaning that it can detect significant difference(s)
between the means but not the location of the significant difference(s) if there are more than two sample
groups in the F Test. A post hoc test called Tukey’s HSD test will be performed to determine which
differences between the means of Factor 1’s level groups are significant.

Main Effects F Test for Factor 2


This F Test has produced a p Value of 0.0442. At a Level of Significance (alpha) of 0.05, this F Test has
produced a significant result because the generated p Value is smaller than the alpha level of 0.05. This
result indicates at least 95 percent certainty that the mean at least one of the level data groups is different
than means of the other level data groups of Factor 1. That can be stated equivalently by saying that
there is less than a 5 percent chance that the detected difference is merely a random result of the sample
taken and not real. An F Test is an omnibus test meaning that it can detect difference(s) but not the
location of the difference(s) if there are more than two sample groups in the F Test. In this case there are
only two levels in Factor 2. The significant result of this F Test indicates the difference between those two
levels is significant. Post hoc testing is not needed because the location of the significant difference is
already known since there is only one difference.
Interaction Effects F Test for Factors 1 and 2 - An F Test to determining whether any level of Factor 1
interacts with any level of Factor 2 to create significantly different mean values in treatment cells across
the Factor 2 levels. This F Test analyses whether the systematic differences of means of treatment cells
along rows vary at different column levels and vice versa. This is an Interaction Test.
This F Test has produced a p Value of 0.0142. At a Level of Significance (alpha) of 0.05, this F Test has
produced a significant result because the generated p Value is smaller than the alpha level of 0.05. This
result indicates at least 95 percent certainty that there is interaction between Factor 1 and Factor 2. That
can be stated equivalently by saying that there is less than a 5 percent chance that the detected
interaction is merely a random result of the sample taken and not real.
Post hoc testing would not be the most intuitive method to determine where the significant interactions
occur. These differences will be most prominently displayed on a line graph connecting the means of
treatment cells. A line graph of two-factor ANOVA will produce two line graphs that are next to each other.

308
The greater the difference in the slopes of these lines, the more interaction between the Factors has
occurred. The closer the two lines are to being parallel, the less interaction has occurred between the two
factors. This graph will shortly be created and explained.

Step 6 – Perform Post-Hoc Testing in Excel


The F-test in ANOVA is classified as an omnibus test. An omnibus test is one that tests the overall
significance of a model to determine whether a difference exists but not exactly where the difference is.
The F Test of ANOVA tests the Null Hypothesis that states that all of the group means in that F Test are
the same. When a significant result from the F Test (the p value is smaller than alpha) causes the Null
Hypothesis to be rejected, further testing must be performed to determine which pairs of means are
significantly different. That type of testing is called post hoc testing.
Post hoc testing is a pairwise comparison. Groups means are compared two at a time to determine
whether the difference between the pair of means is significant.

Post-Hoc Tests Used When Group Variances Are Equal


SPSS lists the following Post-Hoc tests or corrections available when groups variances are equal:
LSD
Bonferroni
Sidak
Scheffe
REGWF
REGWQ
S-N-K
Tukey (Tukey’s HSD or Tukey-Kramer)
Tukey’s b
Duncan
Hochberg’s GT2
Gabriel
Waller-Duncan
Dunnett
Of all of the post hoc tests available when groups variances are found to be similar, Tukey’s HSD test is
used much more often than the others. Tukey’s HSD can only be used when group sizes are exactly the
same, which is the case for balanced two-factor ANOVA with replication.
Tukey’s HSD (Honestly Significant Difference) Test – Used When Group Sizes and Group Variances Are
Equal
Tukey’s HSD test compares the difference between each pair of group means to determine which
differences are large enough to be considered significant.
Tukey’s HSD test is very similar to a t-test except that it makes a correction for the experiment-wide error
rate that a t-test doesn’t. The experiment-wide error rate is the increased probability of type 1 errors (false
positives – detecting a difference where none exists) when multiple comparisons are made.
Tukey’s HSD test can be summarized as follows:

309
The means of all groups are arranged into as many unique pair combinations as possible. The pair
combination with the largest difference between the two means is tested first. A test statistic for this pair
of means is calculated as follows:

where

n = number of samples in any group (all groups must be of equal size for Tukey’s HSD Post-Hoc test)
This test statistic is compared to qCritical . The critical q values are found on the Studentized Range q table.
A unique critical q value is calculated for each unique combination of level of significance (usually set at
0.05), the degrees of freedom, and the total number of groups in the ANOVA analysis.
Tukey’s test calculates degrees of freedom as follows:
df = Degrees of freedom = (total number of samples in all groups combined) – (total number of groups in
that F test)
The difference between the two means is designated as significant if its test statistic q is larger than the
critical q value from the table.
If the difference between the means with the largest difference is found to be significant, the next inside
pair of means is tested. This step is repeated until an innermost pair is found to have a difference that is
not significant. Once an inner pair of means is found to have a difference that is not large enough to be
significant, no further testing needs to be done because all untested pairs will be inside this one and have
even smaller differences between the means.
The difference between the two means is designated as significant if its test statistic q is larger than the
critical q value from the table.
The Tukey HSD test calculates Test Statistic q for each pair of means. This Test Statistic is compared to
qCritical . The critical q values are found on the Studentized Range q table using the Excel lookup function,
INDEX(array, row number, column number).
The difference between the two means is designated as significant if its test statistic q is larger than the
critical q value from the table.
The Test Statistic q is calculated as follows:
q = (Max Group Mean – Min Group Mean) / SE
df = Degrees of freedom = (total number of samples) – (total number of groups in that F Test)

Tukey’s HSD Test in Excel For Each Main Effects F Test For Factor 1
Tukey’s HSD Test should be performed for the Factor 1 Main Effect F Test but not for the Factor 2 Main
Effects F Test. The purpose of Tukey’s HSD post hoc test is to determine which difference(s) between
mean is significant. Factor 1 has three levels and therefore two differences between the means of the
three level groups. The significant result of this F Test indicates that at least one of level group means is
different than the other two level group means. The F Test is a omnibus test meaning that it does not tell

310
where that difference lies. Tukey’s HSD test will indicate whether each of the differences between any
combination of the three means is different.
Post hoc testing does not need to be performed on the two level groups of Factor 2’s Main Effects test.
Tukey’s HSD test does not need to be performed when an F Test is run on only two groups. There is only
one difference between the two group means. If the F Test indicates that there is a significant difference
between the means of the two groups, there is no need to determine which difference is significant
because there is only one difference.
The first step when performing the Tukey HSD test is to list all unique mean pairs and the differences
between the means. All of this information can be found from the Excel ANOVA output as follows:

311
The total number of combinations of pairs of n objects can be found by the following Excel formula:
=COMBIN(n,2)
If there are three level group means of Factor 1 (n = 3), the total number of combination pairs of these
means is three, as a result of the following Excel formula:
COMBIN(3,2) = 3

From the Excel output, the three level group means of Factor 1 are the following:

Factor 1 Level 1 group mean = 76.125


Factor 1 Level 2 group mean = 49.625
Factor 1 Level 3 group mean = 85.375

Three unique group pairings exist: (1,2), (1,3), and (2,3)

The absolute differences in means of each pair are as follows:


Pair (1,2) Mean Difference = ABS(76.125-49.625) = 26.50
Pair (1,3) Mean Difference = ABS(76.125-85.375) = 9.25
Pair (2,3) Mean Difference = ABS(49.625-85.375) = 35.75

The differences between these means in descending order are as follows:


Largest difference = Pair (2,3) Mean Difference = 35.75
nd
2 largest difference = Pair (1,2) Mean Difference = 26.50
Smallest difference = Pair (1,3) Mean Difference = 9.25

Calculating q and q Critical for each difference requires MSWithin and dfWithin from the following section of
the Excel ANOVA output.

312
Difference between group means are checked for significance starting with the largest difference and
working down to the smallest difference. As soon as one difference is found to be insignificant, no further
differences need to be checked because all smaller differences will also be insignificant.
Calculating q and q Critical for the largest difference between the means of factor 1 level groups is done
as follows:

313
The q Critical value for α = 0.05 can be looked up on the critical value table for the specific k and df as
follows:

314
nd
Calculating q and q Critical for the 2 largest difference between the means of factor 1 level groups is
done as follows:

315
Calculating q and q Critical for the smallest difference between the means of factor 1 level groups is done
as follows:

316
Looking Up qCritical on the Studentized Range q Table With the Excel INDEX() Function
The Studentized Range q table and the Excel Index() function appear as follows:
=INDEX( array, relative row number, relative column number )
A relative address is the address relative to the cell in upper left corner of the array. In the INDEX()
function is attempting to locate a value in a cell that is in the third column over (to the right of) and third
row down from the cell in the upper left corner of the array, the relative row number equals 3 and the
relative column number equals 3.
The array is the absolute address of the array. This is given by:
(upper left corner cell:lower right corner cell). In this case it would be (D5:K103)

317
318
319
Determining Where the Strongest Interactions Between Factor 1 and Factor 2 Occur
The first step is to calculate the means of treatment cell as follows:

The second step is to plot treatment cells means on a scatterplot chart. Separate line graphs for each
level of one of the factors. In this case each level of Factor 2 given its own line graph as follows.

The preceding scatterplot shows separate line graphs for each of the two levels of Factor 2 at successive
levels of Factor 1. Interaction occurs in two-Way ANOVA systematic differences between levels of one
factor vary along different levels of the other factor.
The interaction of the two factors between various levels is indicated by the slopes of adjacent line
segments. Adjacent line segments that are parallel show no interactions between the levels of the factors

320
at the endpoints of the line segments. The more that the slopes differ, the greater is the interaction of the
two factors between the levels at the endpoints of the adjacent line segments.
The relative degree of interaction between the two factors across all combinations of their levels can be
determined by calculating the absolute difference in the slopes of adjacent line segments. The adjacent
line segments that have the greatest absolute difference in slopes display the greatest degree of
interaction between factor levels at the endpoints of the adjacent line segments.
The calculations below indicate that there is significantly greater interaction between Factors 1 and 2 at
higher levels of Factor 1 than at lower levels of Factor 1.

321
Step 7 – Calculate Effect Size
Effect size is a way of describing how effectively the method of data grouping allows those groups to be
differentiated. A simple example of a grouping method that would create easily differentiated groups
versus one that does not is the following.
Imagine a large random sample of height measurements of adults of the same age from a single country.
If those heights were grouped according to gender, the groups would be easy to differentiate because the
mean male height would be significantly different than the mean female height. If those heights were
instead grouped according to the region where each person lived, the groups would be much harder to
differentiate because there would not be significant difference between the means and variances of
heights from different regions.
Because the various measures of effect size indicate how effectively the grouping method makes the
groups easy to differentiate from each other, the magnitude of effect size tells how large of a sample must
be taken to achieve statistical significance. A small effect can become significant if a larger enough
sample is taken. A large effect might not achieve statistical significance if the sample size is too small.
The most common measure of effect size of two-factor ANOVA is the following:
2
Eta Square (η )
(Greek letter “eta” rhymes with “beta”) Eta square quantifies the percentage of variance in the dependent
variable (the variable that is measured and placed into groups) that is explained by the independent
variable (the method of grouping). If eta squared = 0.35, then 35 percent of the variance associated with
the dependent variable is attributed to the independent variable (the method of grouping).
Eta square provides an overestimate (a positively-biased estimate) of the explained variance of the
population from which the sample was drawn because eta squared estimates only the effect size on the
sample. The effect size on the sample will be larger than the effect size on the population. This bias
grows smaller is the sample size grows larger.
Eta square is affected by the number and size of the other effects.
2
η = SSBetween_Groups / SSTotal These two terms are part of the ANOVA calculations found in the Single-
factor ANOVA output.
Jacob Cohen in his landmark 1998 book Statistical Analysis for the Behavior Sciences proposed that
effect sizes could be generalized as follows:
2
η = 0.01 for a small effect. A small effect is one that not easily observable.
2
η = 0.05 for a medium effect. A medium effect is more easily detected than a small effect but less easily
detected than a large effect.
2
η = 0.14 for a small effect. A large effect is one that is readily detected with the current measuring
equipment.
Eta squared is sometimes called the nonlinear correlation coefficient because it provides a measure of
strength of the curvilinear relationship between the dependent and independent variables. If the
relationship is linear, eta squared will have the same value as r squared.

322
2
Calculating Eta Square (η ) in Excel
Eta squared is calculated with the formula
2
η = SSBetween_Groups / SSTotal
and is implemented in Excel on this data set as follows:

2
Magnitudes of eta-squared are generally classified exactly as magnitudes of r (the coefficient of
determination) are as follows: = 0.01 is considered a small effect. = 0.06 is considered a medium effect.
= 0.14 is considered a large effect. Small, medium, and large are relative terms. A large effect is easily
discernible but a small effect is not.
2
η Factor_1 = 0.198 which is considered to be a large effect.
2
η Factor_2 = 0.111 which is considered to be a medium effect.
2
η Interaction = 0.260 which is considered to be a large effect.
2
η Error = 0.431 which is considered to be a very large effect.
Such large eta-square term for the error component of the variation indicates perhaps another
independent variable that has not been included in the test accounts for a substantial part of total
variation of the data.
A large eta-square error term also indicates the possibility of inaccuracy during data collection and
recording.

323
Step 8 – Calculate the Power of the Test
The accuracy of a statistical test is very dependent upon the sample size. The larger the sample size, the
more reliable will be the test’s results. The accuracy of a statistical test is specified as the Power of the
test. A statistical test’s Power is the probability that the test will detect an effect of a given size at a given
level of significance (alpha). The relationships are as follows:
α (“alpha”) = Level of Significance = 1 – Level of Confidence
α = probability of a type 1 error (a false positive)
α = probability of detecting an effect where there is none
Β (“beta”) = probability of a type 2 error (a false negative)
Β = probability of not detecting a real effect
1 - Β = probability of detecting a real effect
Power = 1 - Β
Power needs to be clarified further. Power is the probability of detecting a real effect of a given size at a
given Level of Significance (alpha) at a given total sample size and number of groups.
The term Power can be described as the accuracy of a statistical test. The Power of a statistical test is
related with alpha, sample size, and effect size in the following ways:
The larger the sample size, the larger is a test’s Power because a larger sample size increases a
statistical test’s accuracy.
The larger alpha is, the larger is a test’s Power because a larger alpha reduces the amount of confidence
needed to validate a statistical test’s result. Alpha = 1 – Level of Confidence. The lower the Level of
Confidence needed, the more likely a statistical test will detect an effect.
The larger the specified effect size, the larger is a test’s Power because a larger effect size is more likely
to be detected by a statistical test.
If any three of the four related factors (Power, alpha, sample size, and effect size) are known, the fourth
factor can be calculated. These calculations can be very tedious. Fortunately there are a number of free
utilities available online that can calculate a test’s Power or the sample size needed to achieve a specified
Power. One very convenient and easy-to-use downloadable Power calculator called G-Power is available
at the following link at the time of this writing:
http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/
Power calculations are generally used in two ways:
1) A priori - Calculation of the minimum sample size needed to achieve a specified Power to detect an
effect of a given size at a given alpha. This is the most common use of Power analysis and is normally
conducted a priori (before the test is conducted) when designing the test. A Power level of 80 percent for
a given alpha and effect size is a common target. Sample size is increased until the desired Power level
can be achieved. Since Power equals 1 – Β, the resulting Β of the targeted Power level represents the
highest acceptable level of a type 2 error (a false negative – failing to detect a real effect). Calculation of
the sample size necessary to achieve a specified Power requires three input variables:
a) Power level – This is often set at .8 meaning that the test has an 80 percent to detect an effect of a
given size.

b) Effect size - Effect sizes are specified by the variable f. Effect size f is calculated from a different
2 2
measure of effect size called η (eta square). η = SSBetween_Groups / SSTotal These two terms are part of
the ANOVA calculations found in the Single-factor ANOVA output.

324
2
The relationship between effect size f and effect size η is as follows:

As mentioned, effect sizes are often generalized as follows:

2
η = 0.01 for a small effect. A small effect is one that not easily observable.

2
η = 0.05 for a medium effect. A medium effect is more easily detected than a small effect but less easily
detected than a large effect.

2
η = 0.14 for a small effect. A large effect is one that is readily detected with the current measuring
equipment.

2
The above values of η produce the following values of effect size f:

f = 0.1 for a small effect.

f = 0.25 for a medium effect.

f = 0.4 for a large effect.

c) Alpha – This is commonly set at 0.05.

Calculating Power With Online Tool G Power


An example of a priori Power calculation would be the following. Power calculations are normally used a
priori to determine the total ANOVA sample size necessary to achieve a specific Power level for detecting
an effect of a specified size at a given alpha.
Power will be calculated separately for each of the Main Effects F Tests.
Power Calculation For the Factor 1 Main Effects F Test
The Factor 1 Main Effect F Test has the following parameters:
Number of Groups in the Factor 1 Main Effects F Test = k = 3
Numerator df = k – 1 = 2
Total number of groups = (number of levels of Factor 1) X (number of levels of Factor 2) = 3 X 2 = 6
Total sample size = total number of data observations contains in all level data groups that are part of the
Factor 1 Main Effects F Test = 24
Determining the power of this F Test to detect a large effect (f = 0.4) at an alpha level of 0.05 would be
calculated using the G*Pwer utility as follows:

325
The preceding G*Power dialogue box and output shows the power of this F Test to be 0.345. That means
that this F Test has a 34.5 percent chance of detect a large effect (f = 0.4) at an alpha level of 0.05.
Determining the power of the current test is post hoc analysis. The type of analysis selected in the
dialogue box is the Post Hoc selection.
A priori analysis can also be performed with the G*Power utility. A priori analysis would be used to
determine the sample size necessary to achieve a given power level. When a prior analysis is selected on
G*Power, the following chart can be generated which indicates the total sample size necessary to
generate various power levels for the test using the current parameters.

326
This diagram shows that a total sample size of at least 63 or 64 would be necessary this 3-level F Test
within this two-factor ANOVA test to generate a power level of 0.8 to detect a large effect (f = 0.4) at an
alpha level of 0.05. Four replicates in each of the six unique treatment cells means that the current total
sample size is 24. At least 11 replicates would be needed in each treatment cell for this F Test to achieve
a power level of 0.8. A power level of 0.8 means that a test has an 80 percent chance of detecting an
effect of the specified size at the given alpha level.
Power Calculation For the Factor 2 Main Effects F Test
The Factor 1 Main Effect F Test has the following parameters:
Number of Groups in the Factor 1 Main Effects F Test = k = 2
Numerator df = k – 1 = 1
Total number of groups = (number of levels of Factor 1) X (number of levels of Factor 2) = 3 X 2 = 6
Total sample size = total number of data observations contains in all level data groups that are part of the
Factor 1 Main Effects F Test = 24
Determining the power of this F Test to detect a large effect (f = 0.4) at an alpha level of 0.05 would be
calculated using the G*Power utility as follows:

327
The preceding G*Power dialogue box and output shows the power of this F Test to be 0.458. That means
that this F Test has a 45.8 percent chance of detect a large effect (f = 0.4) at an alpha level of 0.05.
Determining the power of the current test is post hoc analysis. The type of analysis selected in the
dialogue box is the Post Hoc selection.
A priori analysis can also be performed with the G*Power utility. A priori analysis would be used to
determine the sample size necessary to achieve a given power level. When a prior analysis is selected on
G*Power, the following chart can be generated which indicates the total sample size necessary to
generate various power levels for the test using the current parameters.

328
This diagram shows that a total sample size of at least 50 would be necessary this 3-level F Test within
this two-factor ANOVA test to generate a power level of 0.8 to detect a large effect (f = 0.4) at an alpha
level of 0.05. Four replicates in each of the six unique treatment cells means that the current total sample
size is 24. At least 9 replicates would be needed in each treatment cell for this F Test to achieve a power
level of 0.8. A power level of 0.8 means that a test has an 80 percent chance of detecting an effect of the
specified size at the given alpha level.

Power Calculation For the Interaction Effect F Test


All F Tests that are part of the same ANOVA test use nearly all of the same input parameters for the
G*Power utility. The only input parameter that varies for different F Test is the Numerator df. The
Numerator df for the interaction effect equals (number of Factor 1 levels – 1) X (Number of Factor 2 levels
– 1). In this case, the following calculation is performed:
Numerator df = (3 – 1) X (2 – 1) = 2
This is the same Numerator df as used by G*Power for the Factor 1 Main Effects F Test. The G*Power
output will therefore be the same for both F Tests.

329
What To Do When Groups Are Not Normally-Distributed
The Scheirer-Ray-Hare Test in Place of Two-Factor ANOVA With Replication

Scheirer-Ray-Hare Test in Excel


A relatively unknown but very useful nonparametric substitute for two-way ANOVA with replication (must
be balanced ANOVA) is the Scheirer-Ray-Hare test. It is an extension of the Kruskal-Wallis test. It is done
in this way:
1) Replace each data observation with its overall rank (lowest number is ranked 1 and tied observations
are all given the average rank)
2) Run the two-way ANOVA as usual with the ranks instead of the actual data values.
3) Discard the MS, F, and p value terms in the ANOVA output.
4) Sum SS for SS factors, SS interaction, and SS error. Divide this sum by df total. The result is MS total.
5) The test statistic, H, for each factor and interaction equals its SS / MS total
6) The Excel formula for the p value for each is: CHISQ.DIST.RT(H, df). The df is the usual df for each
factor and interaction. The Excel output provides these df figures.
Just like the Kruskal-Wallis test, the Scheirer-Ray-Hare test requires that the data groups be symmetrical
about an axis. The normality of all data groups has already been confirmed with the Shapiro-Wilks test. A
data group that is normally distributed will be symmetrical about its mean. The Scheirer-Ray-Hare test
symmetry requirement is therefore validated.
Just like the Kruskal-Wallis test, the Scheirer-Ray-Hare test statistic H for each F Test is very nearly
distributed as the Chi-Square distribution with k – 1 degrees of freedom as long as the number of
samples in each group is at least 5. The Factor 1 Main Effects F Test has three level groups each with 8
data observations. The Factor 2 Main Effects F Test has two level groups each with 12 data observations.
The test statistics for each of these F Tests will distributed nearly as the Chi-Square distribution with k -1
degrees of freedom.
There are, however, only four replicates in each treatment cell. The requirement of at least five samples
for each group is not met for the Interaction Effects F Test. The Scheirer-Ray-Hare test statistic for this F
Test is not confirmed to be distributed similar to the Chi-Square distribution.

330
331
The p Value formula used here is for Excel versions prior to 2010. Excel 2010 and later would use the
following formula:
p Value = CHISQ.DIST.RT(H,df)
The p Values generated by the Scheirer-Ray-Hare test are compared here with the p Values generated
by the two-factor ANOVA with replication test perform on the data set.
The Interaction p Value from the Scheirer-Ray-Hare test from this data set is not considered valid
because there is less than five samples in each sample group of this F Test.

332
Two-Factor ANOVA Without Replication

Overview
Single-Factor ANOVA tests whether a significant proportion of the variation present in a data set can be
accounted for by a single factor that affects the objects being measured.
Two-Factor ANOVA tests whether a significant proportion of the variation present in a data set can be
accounted for by either or both of two factors that simultaneously affect the objects being measured.
Two-Factor ANOVA can also be used to test whether a significant proportion of the variation present in a
data set can be accounted for by the interaction between two factors that simultaneously affect the
objects being measured.

Two-Factor ANOVA Without Replication Example in Excel


Excel provides two options for Two-Factor ANOVA. This Excel test can be performed with replication or
without replication. The difference is fairly simple. Two-Factor ANOVA without replication contains exactly
one data point for each possible combination of levels between the two factors.
Two-Factor ANOVA without replication should not be considered to be a reliable statistical test
because the data samples on which this test is based are always too small. This will be discussed
shortly.
An example of a data set for two-factor ANOVA without replication is shown as follows:

Factor 1 contains four levels and Factor 2 contains 3 levels. There are 12 possible combinations of levels
between Factors 1 and 2. Each of these 12 combinations is a unique treatment cell and contains a single
data observation. There are 12 data observations total in this data set.
Two-factor ANOVA with replication contains more than one observation for each combination of factor
levels. Two-factor ANOVA will have an equal number of data observations for every combination of factor
levels. This arrangement of data for ANOVA testing is referred to as being “balanced.” Each treatment
cell (unique combination of factor levels) will contain the same number of data observations. It is possible
to conduct unbalanced two-factor ANOVA but that is much more complicated and will not be discussed
here.

333
Performing two-factor ANOVA without replication can be done by selecting the Data Analysis tool entitled
Anova:Two-Factor Without Replication and then completing the tool’s dialogue box as follows:

Hitting the OK button will produce the following output:

334
The output shown here can be interpreted as follows:
The p Value associated with the main effect of Factor 1 (the factor whose levels are arranged in rows) is
0.0734. This is not significant at an alpha of 0.05. By this measure, Factor 1 has not had a significant
affect on the data.
The p Value associated with the main effect of Factor 2 (the factor whose levels are arranged in columns)
is 0.0417. This is not significant at an alpha of 0.05. By this measure, Factor 1 has not had a significant
affect on the data.
There is, however, one major issue that dramatically reduces the validity of this test’s conclusions just
shown. Two-Factor ANOVA without replication nearly always tests too little data to be considered
reliable. Because each combination of levels contains only a single data observation, the number of
observations in each level group is very small and the total number of observations is very small. This
affects the validity of the test results in the follow two important ways:
1) Small Sample Size Makes ANOVA’s Required Assumptions Unverifiable. ANOVA’s required
assumptions that data come from normally-distributed populations having similar variances cannot be
verified. ANOVA’s required assumptions of data normality and homoscedasticity (similarity of variances)
are derived from the requirements of the F-tests that are performed in the ANOVA tests. Two-Factor
ANOVA performs a separate F-test for each factor that is tested. This can be seen in the Excel ANOVA
output shown in this section. Each F-test requires that the data from all data groups used to construct the
Sum of Squares be taken from populations that are normally distributed and have similar variances.
Group sizes for Two-Factor ANOVA without replication are nearly always smaller than ten. This size is too
small to credibly validate ANOVA’s required assumptions of data normality and similar variances within
the groups of each F test.
2) Small Sample Size Reduces the Test’s Power to an Unacceptably Low Level. The small group
sizes reduce the ANOVA test’s power to an unacceptable level. A statistical test’s power is its probability
of detecting an effect of a specified size. Power is defined as 1 - Β. Beta, Β, represents is a test’s
probability of a type 2 error. A type 2 error is a false negative. In other words, Β is a test’s probability of
not detecting an effect that should have been detected. 1 – Β (the power of the test) is a test’s probability
of detecting an effect that should have been detected. Calculating the power of an ANOVA test is tedious
but fortunately there are a number of utilities freely available online that can quickly calculate an ANOVA
test’s power. The power of the Two-Factor ANOVA without replication will be discussed in detail as
follows:

Power Analysis of Two-Factor ANOVA Without Replication


The accuracy of a statistical test is very dependent upon the sample size. The larger the sample size, the
more reliable will be the test’s results. The accuracy of a statistical test is specified as the Power of the
test. A statistical test’s Power is the probability that the test will detect an effect of a given size at a given
level of significance (alpha). The relationships are as follows:
α (“alpha”) = Level of Significance = 1 – Level of Confidence
α = probability of a type 1 error (a false positive)
α = probability of detecting an effect where there is none
Β (“beta”) = probability of a type 2 error (a false negative)
Β = probability of not detecting a real effect
1 - Β = probability of detecting a real effect
Power = 1 - Β
Power needs to be clarified further. Power is the probability of detecting a real effect of a given size at a
given Level of Significance (alpha) at a given total sample size and number of groups.
335
The term Power can be described as the accuracy of a statistical test. The Power of a statistical test is
related with alpha, sample size, and effect size in the following ways:
1) The larger the sample size, the larger is a test’s Power because a larger sample size increases a
statistical test’s accuracy.
2) The larger alpha is, the larger is a test’s Power because a larger alpha reduces the amount of
confidence needed to validate a statistical test’s result. Alpha = 1 – Level of Confidence. The lower the
Level of Confidence needed, the more likely a statistical test will detect an effect.
3) The larger the specified effect size, the larger is a test’s Power because a larger effect size is more
likely to be detected by a statistical test.
If any three of the four related factors (Power, alpha, sample size, and effect size) are known, the fourth
factor can be calculated. These calculations can be very tedious. Fortunately there are a number of free
utilities available online that can calculate a test’s Power or the sample size needed to achieve a specified
Power. One very convenient and easy-to-use downloadable Power calculator called G-Power is available
at the following link at the time of this writing:
http://www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/
Power calculations are generally used in two ways:
1) A priori - Calculation of the minimum sample size needed to achieve a specified Power to detect an
effect of a given size at a given alpha. This is the most common use of Power analysis and is normally
conducted a priori (before the test is conducted) when designing the test. A Power level of 80 percent for
a given alpha and effect size is a common target. Sample size is increased until the desired Power level
can be achieved. Since Power equals 1 – Β, the resulting Β of the targeted Power level represents the
highest acceptable level of a type 2 error (a false negative – failing to detect a real effect). Calculation of
the sample size necessary to achieve a specified Power requires three input variables:
a) Power level – This is often set at .8 meaning that the test has an 80 percent to detect an effect of a
given size.
b) Effect size - Effect sizes are specified by the variable f. Effect size f is calculated from a different
2 2
measure of effect size called η (eta square). η = SSBetween_Groups / SSTotal These two terms are part of
the ANOVA calculations found in the Single-factor ANOVA output.
2
The relationship between effect size f and effect size η is as follows:

Jacob Cohen in his landmark 1998 book Statistical Analysis for the Behavior Sciences proposed that
effect sizes could be generalized as follows:
2
η = 0.01 for a small effect. A small effect is one that not easily observable.
2
η = 0.05 for a medium effect. A medium effect is more easily detected than a small effect but less easily
detected than a large effect.
2
η = 0.14 for a small effect. A large effect is one that is readily detected with the current measuring
equipment.
2
The above values of η produce the following values of effect size f:
f = 0.1 for a small effect
f = 0.25 for a medium effect

336
f = 0.4 for a large effect
c) Alpha – This is commonly set at 0.05.

Performing a priori Power Analysis for the Main Effect of Factor 1


The G*Power utility will be used in an a priori manner to demonstrate how incredibly low the Power of
two-factor ANOVA without replication is. The example used in this chapter will be analyzed. The data set
and the Excel output of this example are shown as follows:

Two-Factor ANOVA without replication has two factors. There is no factor to account for the effect of
interaction between these two factors. Each factor has its own unique Power that must be calculated. The
Power for each factor is the probability that the ANOVA test will detect an effect of a given size caused by
that factor. A separate Power calculation can be calculated for each of the two factors in this example.
Power analysis performed a priori calculates how large the total sample size must be to achieve a specific
Power level to detect an effect of a specified size at a given alpha level. A priori Power analysis of the
main effect of factor 1 of this example is done as follows:
The following parameters must be entered into the G*Power for a priori analysis for the general ANOVA
dialogue box:

337
Power (1 – Β): 0.8 – This is commonly used Power target. A test that achieves a Power level of 0.8 has
an 80 percent chance of detecting the specified effect.
Effect size: 0.4 – This is a large effect. This analysis will calculate the sample size needed to achieve an
80 percent probability of detecting an effect of this size.
α (alpha): 0.05
Numerator df: 3 – The degrees of freedom specified for a test of a main effect of a factor equals the
number of factor levels – 1. Factor 1 has 4 levels. This numerator df therefore equals 4 – 1 = 3. Note that
this is the same df that is specified in the Excel ANOVA output for factor 1.
Number of groups: 12 – The number of groups equals (number of levels in factor 1) x (number f levels in
factor 2). This equals 4 x 3 = 12. The number of groups is equal to the total number of unique treatment
cells. Each unique treatment cell exists for each unique combination of levels between the factors.
Running the G*Power analysis produces the following output:

This indicates that a total sample size of 73 is needed to achieve a Power level of 0.8 to detect the main
effect of factor 1 that is large (f = 0.4). The total sample size for this example is 12 because there are 12
total data observations in this ANOVA test.
338
G*Power also creates an additional plot showing the Power of this test across a range of values for the
total sample size. This plot will confirm how low the Power of two-factor ANOVA without replication really
is:

This plot shows the Power of this particular test using a total sample size of 12 to be slightly less than 0.1.
This means that this two-factor ANOVA test has less than a 10 percent chance of detecting a large main
effect caused by factor 1 if the total sample size is 12.
Two-factor ANOVA without replication is a two-factor ANOVA test performed on a data set having only a
single data observation in each treatment cell. Performing this same test on a data set with two data
observations in each treatment cell (total sample size equals 24) would still attain a Power level of
approximately 0.25.
This plot shows that this two-factor ANOVA test would require at least 6 data observations in each
treatment cell (total sample size equals 72) to achieve a Power level of 0.8 for a large main effect (f = 0.4)
of factor 1 at alpha = 0.05.

339
Conclusion
Two-Factor ANOVA without replication nearly always tests too little data to be considered reliable.
The small group sizes that occur with two-way ANOVA without replication reduce the test’s Power to an
unacceptable level. Small group size also prevents validation of ANOVA’s required assumptions of data
normality within groups and similar variances of all groups within each factor. The Excel output of the two-
way ANOVA without replication test conducted in this section shows Factor 2 to have a significant effect
on the output (p Value = 0.0417) and Factor 1 not having a significant effect (p value = 0.0734) at a
significance level of alpha = 0.05. This would clearly not be a valid conclusion given the small group sizes
and resulting lack of Power of this ANOVA test.

340
Check Out the Latest Book in the Excel Master Series!

Click Here To Download This 200+ Page Excel Solver Optimization Manual Right Now for $19.95

http://37.solvermark.pay.clickbank.net/

For anyone who wants to be performing optimization at a high level with the Excel Solver quickly, Step-
By-Step Optimization With Excel Solver is the e-manual for you. This is a hands-on, step-by-step,
complete guidebook for both beginner and advanced Excel Solver users. This book is perfect for the
many students who are now required to be proficient in optimization in so many majors as well as industry
professionals who have an immediate need to become up-to-speed with advanced optimization in a short
time frame.
Step-By-Step Optimization With Excel Solver is 200+ pages .pdf e-manual of simple yet thorough
explanations on how to use the Excel Solver to solve today’s most widely known optimization problems.
Loaded with screen shots that are coupled with easy-to-follow instructions, this .pdf e-manual will simplify
many difficult optimization problems and make you a master of the Excel Solver almost immediately.
The author of Step-By-Step Optimization With Excel Solver, Mark Harmon, was the Internet marketing
manager for several years for the company that created the Excel Solver and currently develops it for
Microsoft Excel today. He shares his deep knowledge of and experience with optimization using the Excel
Solver in this book.
Here are just some of the Solver optimization problems that are solved completely with simple-to-
understand instructions and screen shots in this book
● The famous “Traveling Salesman” problem using Solver’s Alldifferent constraint and the Solver’s
Evolutionary method to find the shortest path to reach all customers. This also provides an advanced use
of the Excel INDEX function.

341
● The well-known “Knapsack Problem” which shows how optimize the use of limited space while
satisfying numerous other criteria.
● How to perform nonlinear regression and curve-fitting on the Solver using the Solver’s GRG Nonlinear
solving method
● How to solve the “Cutting Stock Problem” faced by many manufacturing companies who are trying to
determine the optimal way to cut sheets of material to minimize waste while satisfying customer orders.
● Portfolio optimization to maximize return or minimize risk.
● Venture capital investment selection using the Solver’s Binary constraint to maximize Net Present Value
of selected cash flows at year 0. Clever use of the If-Then-Else statements makes this a simple problem.
● How use Solver to minimize the total cost of purchasing and shipping goods from multiple suppliers to
multiple locations.
● How to optimize the selection of different production machine to minimize cost while fulfilling an order.
● How to optimally allocate a marketing budget to generate the greatest reach and frequency or number
of inbound leads at the lowest cost.

Step-By-Step Optimization With Excel Solver has complete instructions and numerous tips on every
aspect of operating the Excel Solver. You’ll fully understand the reports and know exactly how to tweek all
of the Solver’s settings for total custom use. The book also provides lots of inside advice and guidance on
setting up the model in Excel so that it will be as simple and intuitive as possible to work with All of the
optimization problems in this book are solved step-by-step using a 6-step process that works every time.
In addition to detailed screen shots and easy-to-follow explanations on how to solve every optimization
problem in this e-manual, a link is provided to download an Excel workbook that has all problems
completed exactly as they are in this e-manual.
Step-By-Step Optimization With Excel Solver is exactly the e-manual you need if you want to be
optimizing at an advanced level with the Excel Solver quickly.

Reader Testimonials

"Step-By-Step Optimization With Excel Solver is the "Missing Manual" for the Excel Solver. It is pretty
difficult to find good documentation anywhere on solving optimization problems with the Excel Solver.
This book came through like a champ!
Optimization with the Solver is definitely not intuitive, but this book is. I found it very easy to work through
every single one of the examples. The screen shots are clear and the steps are presented logically. The
downloadable Excel spreadsheet with all example completed was quite helpful as well.
Once again, it really amazing how little understandable documentation there is on doing real-life
optimization problems with Solver.
For example, just try to find anything anywhere about the well-known Traveling Salesman Problem (a
salesman needs to find the shortest route to visit all customers once).
Step-By-Step Optimization With Excel Solver is the "Missing Manual" for the Excel Solver. It is pretty
difficult to find good documentation anywhere on solving optimization problems with the Excel Solver.
This book came through like a champ!
Optimization with the Solver is definitely not intuitive, but this book is. I found it very easy to work through
every single one of the examples. The screen shots are clear and the steps are presented logically. The
downloadable Excel spreadsheet with all example completed was quite helpful as well.

342
Once again, it really amazing how little understandable documentation there is on doing real-life
optimization problems with Solver.
For example, just try to find anything anywhere about the well-known Traveling Salesman Problem (a
salesman needs to find the shortest route to visit all customers once)
It is a tricky problem for sure, but this book showed a quick and easy way to get it done. I'm not sure I
would have ever figured that problem out, or some the other problems in the book, without this manual.
I can say that this is the book for anyone who wants or needs to get up to speed on an advanced level
quickly with the Excel Solver. It appears that every single aspect of using the Solver seems to be covered
thoroughly and yet simply. The author presents a lot of tricks in how to set the correct Solver settings to
get it to do exactly what you want.
The book flows logically. It's an easy read. Step-By-Step Optimization With Excel Solver got me up to
speed on the Solver quickly and without to much mental strain at all. I can definitely recommend this
book."
Pam Copus
Sonic Media Inc

“As Graduate student of the Graduate Program in International Studies (GPIS) at Old Dominium
University, I'm required to have a thorough knowledge of Excel in order to use it as a tool for interpreting
data, conducting research and analysis.
I've always found the Excel Solver to be one of the more difficult Excel tools to totally master. Not any
more. This book was so clearly written that I was able to do almost every one of advanced optimization
examples in the book as soon as I read through it once.
I can tell that the author really made an effort to make this manual as intuitive as possible. The screen
shots were totally clear and logically presented.
Some of the examples that were very advanced, such as the venture capital investment example, had
screen shot after screen shot to ensure clarity of the difficult Excel spreadsheet and Solver dialogue
boxes.
It definitely was "Step-By-Step" just like the title says. I must say that I did have to cheat a little bit and
look at the Excel spreadsheet with all of the book's example that is downloadable from the book. The
spreadsheet was also a great help.
Step-By-Step Optimization With Excel Solver is not only totally easy to understand and follow, but it is
also very complete. I feel like I'm a master of the Solver. I have purchased a couple of other books in the
Excel MaSter Series (the Excel Statistical Master and the Advanced Regression in Excel book) and they
have all been excellent.
I am lucky to have come across this book because the graduate program that I am in has a number of
optimization assignments using the Solver. Thanks Mark for such an easy-to-follow and complete book
on using the Solver. It really saved me a lot of time in figuring this stuff out."
Federico Catapano
Graduate Student
International Studies Major
Old Dominion University
Norfolk, Virginia

"I'm finished with school (Financial Economics major) and currently work for a fortune 400 company as a
business analyst. I find that the statistics and optimization manuals are indispensable reference tools
throughout the day.

343
I keep both eManuals loaded on my ipad at all times just in case I have to recall a concept I don't use all
the time. Its easier to recall the concepts from the eManuals rather then trying to sift through the
convoluted banter in a text book, and for that I applaud the author!
In a business world where I need on demand answers now this optimization eManual is the perfect tool.
I just recently used the bond investment optimization problem to build a model in excel and help my VP
understand that a certain process we're doing wasn't maximizing our resources.
That's the great thing about this manual, you can use any practice problem (with a little outside thinking)
to mold it into your own real life problem and come up with answers that matter in the work place.!"
Sean Ralston
Sr. Financial Analyst
Enogex LLC
Oklahoma City, Oklahoma

"Excel Solver is a tool that most folks never use. I was one of those people. I was working on a project,
and was told that solver might be helpful. I did some research online, and was more confused than ever. I
started looking for a book that might help me. I got this book, and was not sure what to expect.
It surpassed my expectations! The book explains the concepts behind the solver, the best way to set up
the "problem", and how to use the tool effectively. It also gives many examples including the files. The
files are stored online, and you can download them so you can see everything in excel.
The author does a fantastic job on this book. While I'm not a solver "expert", I am definitely much smarter
about it than I was before. Trust me, if you need to understand the solver tool, this book will get you
there."
Scott Kinsey
Missouri

“The author, Mark, has a writing style that is easy to follow, simple, understandable, with clear examples
that are easy to follow. This book is no exception.
Mark explains how solver works, the different types of solutions that can be obtained and when to use
one or another, explains the content and meaning of the reports available. Then he presents several
examples, goes about defining each problem, setting it up in excel and in solver and interpreting the
solution.
It is a really good book that teaches you how to apply solver (linear programming) to a problem.”
Luis R. Heimpel
El Paso, Texas

344
Click Here To Download This 200+ Page Excel Solver Optimization Manual Right Now for $19.95

http://37.solvermark.pay.clickbank.net/

345
Meet the Author

Mark Harmon is a university statistics instructor and statistical/optimization consultant. He was the
Internet marketing manager for several years for the company that created the Excel Solver and currently
develops that add-in for Excel. He has made contributions to the development of Excel over a long period
of time dating all the way back to 1992 when he was one of the beta users of Excel 4 creating the sales
force deployment plan for the introduction of the anti-depressant drug Paxel into the North American
market.
Mark Harmon is a natural teacher. As an adjunct professor, he spent five years teaching more than thirty
semester-long courses in marketing and finance at the Anglo-American College in Prague, Czech
Republic and the International University in Vienna, Austria. During that five-year time period, he also
worked as an independent marketing consultant in the Czech Republic and performed long-term
assignments for more than one hundred clients. His years of teaching and consulting have honed his
ability to present difficult subject matter in an easy-to-understand way.
This manual got its start when Mark Harmon began conducting statistical analysis to increase the
effectiveness of various types of Internet marketing that he was performing during the first decade of the
2000s. Mark initially formulated the practical, statistical guidelines for his own use but eventually realized
that others would also greatly benefit by this step-by-step collection of statistical instructions that really did
not seem to be available elsewhere. Over the course of a number of years and several editions, this
instruction manual blossomed into the Excel Master Series of graduate-level, step-by-step, complete,
practical, and clear set of guidebooks that it is today.
Mark Harmon received a degree in electrical engineering from Villanova University and MBA in marketing
from the Wharton School.
Mark is an avid fan of the beach life and can nearly always be found by a warm and sunny beach.

346

Das könnte Ihnen auch gefallen