Quantitative Methods Online Course PDF

11/25/2019 Quantitative Methods Online Course
This document is authorized for use only by Sweta Ranjan. Copy or posting is an infringement of copyright.
Quantitative Methods Online Course

Pre-Assessment Test Introduction
Welcome to the pre-assessment test for the HBS Quantitative Methods Tutorial.
All questions must be answered for your exam to be scored.
Navigation:
To advance from one question to the next, select one of the answer choices or, if applicable, complete with your own choice and
click the “Submit” button. After submitting your answer, you will not be able to change it, so make sure you are satisfied with your
selection before you submit each answer. You may also skip a question by pressing the forward advance arrow. Please note that
you can return to “skipped” questions using the “Jump to unanswered question” selection menu or the navigational arrows at any
time. Although you can skip a question, you must navigate back to it and answer it - all questions must be answered for the exam
to be scored.
In the briefcase, links to Excel spreadsheets containing z-value and t-value tables are provided for your convenience. For some
questions, additional links to Excel spreadsheets containing relevant data will appear immediately below the question text.
Your results will be displayed immediately upon completion of the exam.
After completion, you can review your answers at any time by returning to the exam.
Good luck!
Frequently Asked Questions

How difficult are the questions on the exam? The exam questions have a level of difficulty similar to the exercises in the
course.
Can I refer to statistics textbooks and online resources to help me during the test? Yes. This is an open-book
examination.
May I receive assistance on the exam? No. Although we strongly encourage collaborative learning at HBS, work on exams
such as the assessment tests must be entirely your own. Thus you may neither give nor receive help on any exam question.
Is this a timed exam? No. You should take about 60-90 minutes to complete the exam, depending on your familiarity with
the material, but you may take longer if you need to.
What happens if I am (or my internet connection is) interrupted while taking the exam? Your answer choices will
be recorded for the questions you were able to complete and you will be able to pick up where you left off when you return to
the exam site.
How do I see my exam results? Your results will be displayed as soon as you submit your answer to the final question. The
results screen will indicate which questions you answered correctly.
Overview & Introduction
Welcome to QM...
Welcome! You are about to embark on a journey that will introduce you to the basics of quantitative and statistical analysis.
This course will help you develop your skills and instincts in applying quantitative methods to formulate, analyze, and solve
management decision-making problems.
Click on the link labeled "The Tutorial and its Method" in the left menu to get started.
The Tutorial and its Method

QM is designed to help you develop quantitative analysis skills in business contexts. Mastering its content will help you
evaluate management situations you will face not only in your studies but also as a manager. Click on the right arrow icon
below to advance to the next page.
This isn't a formal or comprehensive course in quantitative methods. QM won't make you a statistician, but it will help you
become a more effective manager.
1/123
The tutorial's primary emphasis is on developing good judgment in analyzing management problems. Whether you are
learning the material for the first time or are using QM to refresh your quantitative skills, you can expect the tutorial to
improve your ability to formulate, analyze, and solve managerial problems.
You won't be learning quantitative analysis in the typical textbook fashion. QM's interactive nature provides frequent
opportunities to assess your understanding of the concepts and how to apply them — all in the context of actual management
problems.
You should take 15 to 20 hours to run through the whole tutorial, depending on your familiarity with the material. QM offers
many features we hope you will explore, utilize, and enjoy.
The Story and its Characters

Naturally, the most appropriate setting for a course on statistics is a tropical island...
Somehow, "internship" is not the way you'd describe your summer plans to your friends. You're flying out to Hawaii after all,
staying at a 5-star hotel as a Summer Associate with Avio Consulting.
This is a great learning opportunity, no doubt about it. To think that you had almost skipped over this summer internship, as
you prepared to enroll in a two-year MBA program this fall.
You are also excited that the firm has assigned Alice, one of its rising stars, as your mentor. It seems clear that Avio partners
consider you a high potential intern — they are willing to invest in you with the hope that you will later return after you
complete your MBA program.
Alice recently received the latest in a series of quick promotions at Avio. This is her first assignment as a project lead:
providing consulting assistance to the Kahana, an exclusive resort hotel on the Hawaiian island Kauai.
Needless to say, one of the perks of the job is the lodging. The Kahana's brochure looks inviting — luxury suites, fine cuisine,
a spa, sports activities. And above all, the pristine beach and glorious ocean.
After your successful interview with Avio, Alice had given you a quick briefing on the hotel and its manager, Leo.
Leo inherited the Kahana just three years ago. He has always been in the hospitality industry, but the sheer scope of the
luxury hotel's operations has him slightly overwhelmed. He has asked for Avio's help to bring a more rigorous approach to his
management decision-making process.
Using the Tutorial: A Guide to Tutorial Resources

Before you start packing your beach towel, read this section to learn how to use this tutorial to your greatest advantage.
QM's structure and navigational tools are easy to master. If you're reading this text, you must have clicked on the link labeled
"Using the Tutorial" on the left.
There are three types of interactive clips: Kahana Clips, Explanatory Clips, and Exercise Clips.
Kahana Clips pose problems that arise in the context of your consulting engagement at the Kahana. Typically, one clip will
have Leo assign you and Alice a specific task. In a later Kahana Clip you will analyze the problem, and you and Alice will
present your results to Leo for his consideration. The Kahana Clips will give you exposure to the types of business problems
that benefit from the analytical methods you'll be learning, and a context for practicing the methods and interpreting their
results.
To fully benefit from the tutorial, you should solve all of Leo's problems. At the end of the tutorial, a multiple-choice
assessment exam will evaluate your understanding of the material.
In Explanatory Clips, you will learn everything needed to analyze management problems like Leo's. Complementing the
text are graphs, illustrations, and animations that will help you understand the material. Keep on your toes: you'll be asked
questions even in Explanatory Clips that you should answer to check your understanding of the concepts.
Some Explanatory Clips give you directions or tips on how to use the analytical and computational features of Microsoft
Excel. Facility with the necessary Excel functions will be critical to solving the management decision problems in this course.
QM is supplemented with spreadsheets of data relating to the examples and problems presented. When you see a Briefcase
link in a clip, we strongly encourage you to click on the link to access the data. Then, practice using the Excel functions to
reproduce the graphs and analyses that appear in the clips.
You will also see Data links that you should click to view summary data relating to the problem.
Exercise Clips provide additional opportunities for you to test your understanding of the material. They are a resource that
you can use to make sure that you have mastered the important concepts in each section.
2/123
Work through exercises to solidify your knowledge of the material. Challenge exercises provide opportunities to tackle
somewhat more advanced problems. The challenge exercises are optional - you should not have to complete them to gain the
mastery needed to pass the tutorial assessment test.
The arrow buttons immediately below are used for navigation within clips. If you've made it this far, you've been using the
one on the right to move forward.
Use the one on the left if you want to back up a page or two.
In the upper right of the QM tutorial screen are three buttons. From left to right they are links to the Help, Briefcase, and
Glossary.
To access additional Help features, click on the Help icon.
In your Briefcase you'll find all the data you'll need to complete the course, neatly stored as Excel workbooks. In many of
the clips there will be links to specific documents in the Briefcase, but the entire Briefcase is available at any time.
In the Glossary/Index you'll find a list of helpful definitions of terms used in the course, along with brief descriptions of the
Excel functions used in the course.
We encourage you to use all of QM's features and resources to the fullest. They are designed to help you build an intuition for
quantitative analysis that you will need as an effective and successful manager.
... and Welcome to Hawaii!

The day of departure has come, and you're in flight over the Pacific Ocean. Alice graciously let you take the window seat, and
you watch as the foggy West Coast recedes behind you.
I've been to Hawaii before, so I'll let you have the experience of seeing the islands from the air before you set foot on them.
This Leo sounds like quite a character. He's been in business all his life, involved in many ventures — some more successful
than others. Apparently, he once owned and managed a gourmet spam restaurant!
Spam is really popular among the islanders. Leo tried to open a second location in downtown Honolulu for the tourists, but that
didn't do so well. He had to declare bankruptcy.
Then, just three years ago, his aunt unexpectedly left him the Kahana. Now Leo is back in business, this time with a large
operation on his hands.
It sounds to me like he's the kind of manager who usually relies on gut instincts to make business decisions, and likes to take
risks. I think he's hired Avio to help him make managerial decisions with, well, better judgment. He wants to learn how to
approach management problems in a more sophisticated, analytical fashion.
We'll be using some basic statistical tools and methods. I know you're no expert in statistics, but I'll fill you in along the way.
You'll be surprised at how quickly they'll become second nature to you. I'm confident you'll be able to do quite a bit of the
analytic work soon.
Leo and the Hotel Kahana

Once your plane touches down in Kauai, you quickly pick up your baggage and meet your host, Leo, outside the airport.
Inheriting the Kahana came as a big surprise. My aunt had run the Kahana for a long time, but I never considered that she
would leave it to me.
Anyway, I've been trying my best to run the Kahana the way a hotel of its quality deserves. I've had some ups and downs.
Things have been fairly smooth for the past year now, but I've realized that I have to get more serious about the way I make
decisions. That's where you come into the picture.
I used to be quite a risk-taker. I made a lot of decisions on impulse. Now, when I think of what I have to lose, I just want to
get it right.
After you arrive at the Kahana, Leo personally shows you to your rooms. "I have a table reserved for the three of us at 8 in the
main restaurant," Leo announces. "You just have to try our new chef's mango and brie tart."
Basics: Data Description
Leo's Data Mine
3/123
After your welcome dinner in the Kahana's main restaurant, Leo asks you and Alice to meet him the next morning. You wake up
early enough to take a short walk on the beach before you make your way to Leo's office.
Good morning! I hope you found your rooms comfortable last night and are starting to recover from your trip.
Unfortunately, I don't have much time this morning. As you requested on the phone, I've assembled the most important data on
the Kahana. It wasn't easy — this hasn't been the most organized hotel in the world, especially since I took over. There's just so
much to keep track of.
Thank you, Leo. We'll have a look at your data right away, so we can get a more detailed understanding of the Kahana and the
type of data you have available for us to work with. Anything in particular that you'd like us to focus on as we peruse your files?
Yes. There are two things in particular that have been on my mind recently.
For one, we offer some recreational activities here at the Kahana, including a scuba diving certification course. I contract out
the operations to a local diving school. The contract is up soon, and I need to renew it, hire another school, or discontinue
offering scuba lessons all together.
I'd like you to get me some quotes from other diving schools on the island so I get an idea of the competition's pricing and how
it compares to the school I've been using.
I'm also very concerned about hotel occupancy rates. As you might imagine, the Kahana's occupancy fluctuates during the year,
and I'd like to know how, when, and why. I'd love to have a better feeling for how many guests I can expect in a given month.
These files contain some information about tourism on the island, but I'd really like you to help me make better sense of it.
Somehow I feel that if I could understand the patterns in the data, I could better predict my own occupancy rates.
That's what we're here to do. We'll take a look at your files to get better acquainted with the Kahana, and then focus on diving
school prices and occupancy patterns.
Thanks, or as we say in Hawaiian, Mahalo. By the way, we're not too formal here on Hawaii. As you probably noticed, Alice,
your suite includes a room that has been set up as an office. But feel free to take your work down to the beach or by the pool
whenever you like.
Thanks! We'll certainly take advantage of that.
Later, under a parasol at the beach, you pore over Leo's folders. Feeling a bit overwhelmed, you find yourself staring out to sea.
Alice tells you not to worry: "We have a number of strategies we can use to compile a mountain of data like this into concise and
useful information. But no matter what data you are working with, always make sure you really understand the data before
doing a lot of analysis or making managerial decisions."
What is Alice getting at when she tells you to "understand the data"? And how can you develop such an understanding?
Describing and Summarizing Data

Data can be represented by graphs like histograms. These visual displays allow you to quickly recognize patterns in the
distribution of data.
Working with Data

Information overload. Inventory costs. Payroll. Production volume. Asset utilization. What's a manager to do?
The data we encounter each day have valuable information buried within them. As managers, correctly analyzing financial,
production, or marketing data can greatly improve the quality of the decisions we make.
Analyzing data can be revealing, but challenging. As managers, we want to extract as much of the relevant information and
insight as possible from our data we have available.
When we acquire a set of data, we should begin by asking some important questions: Where do the data come from? How
were they collected? How can we help the data tell their story?
Suppose a friend claims to have measured the heights of everyone in a building. She reports that the average height was three
and a half feet. We might be surprised...
... until we learn that the building is an elementary school.
We'd also want to know if our friend used a proper measuring stick. Finally, we'd want to be sure we knew how she measured
height: with or without shoes.
4/123
Before starting any type of formal data analysis, we should try to get a preliminary sense of the data. For example, we might
first try to detect any patterns, trends, or relationships that exist in the data.
We might start by grouping the data into logical categories. Grouping data can help us identify patterns within a single
category or across different categories. But how do we do this? And is this often time-consuming process worth it?
Accountants think so. Balance Sheets and Profit and Loss Statements arrange information to make it easier to comprehend.
In addition, accountants separate costs into categories such as capital investments, labor costs, and rent. We might ask: Are
operating expenses increasing or decreasing? Do office space costs vary much from year to year?
Comparing data across different years or different categories can give us further insight. Are selling costs growing more
rapidly than sales? Which division has the highest inventory turns?
Histograms
In addition to grouping data, we often graph them to better visualize any patterns in the data. Seeing data displayed
graphically can significantly deepen our understanding of a data set and the situation it describes.
To see the value a graphical approach can add, let's look at worldwide consumption of oil and gas in 2000. What questions
might we want to answer with the energy data? Which country is the largest consumer? How much energy do most
countries use?
Source
In order to create a graph that provides good visual insight into these questions, we might sort the countries by their level
of energy consumption, then group together countries whose consumption falls in the same range — e.g., the countries that
use 100 to 199 million tonnes per year, or 200 to 299 million tonnes.
Source
We can find the number of countries in each range, and then create a bar graph in which the height of each bar represents
the number of countries in each range. This graph is called a histogram.
A histogram shows us where the data tend to cluster. What are the most common values? The least common? For example,
we see that most countries consume less than 100 million tonnes per year, and the vast majority less than 200 million
tonnes. Only three countries, Japan, Russia, and the US, consume more than 300 million tonnes per year.
Why are there so many countries in the first range — the lowest consumption? What factors might influence this?
Population might be our first guess.
Yet despite a large population, India's energy consumption is significantly less than that of Germany, a much smaller
nation. Why might this be? Clearly other factors, like climate and the extent of industrialization, influence a country's
energy usage.
Outliers
In many data sets, there are occasional values that fall far from the rest of the data. For example, if we graph the age
distribution of students in a college course, we might see a data point at 75 years. Data points like this one that fall far from
the rest of the data are known as outliers. How do we interpret them?
First, we must investigate why an outlier exists. Is it just an unusual, but valid value? Could it be a data entry error? Was it
collected in a different way than the rest of the data? At a different time?
We might discover that the data point refers to a 75 year-old retiree, taking the course for fun.
After making an effort to understand where an outlier comes from, we should have a deeper understanding of the situation
the data represent. Then, we can think about how to handle the outlier in our analysis. Typically, we do one of three things:
leave the outlier alone, or — very rarely — remove it or change it to a corrected value.
A senior citizen in a college class may be an outlier, but his age represents a legitimate value in the data set. If we truly want
to understand the age distribution of all students in the class, we would leave the point in.
Or, if we now realize that what we really want is the age distribution of students in the course who are also enrolled in full-
time degree-granting programs, we would exclude the senior citizen and all other non-degree program students enrolled in
the course.
Occasionally, we might change the value of an outlier. This should be done only after examining the underlying situation in
great detail.
5/123
For example, if we look at the inventory graph below, a data point showing 80 pairs of roller-blades in inventory would be
highly unusual.
Notice that the data point "80" was recorded on April 13th, and that the inventory was 10 pairs on April 12th, and 6 on
April 14th.
Based on our management understanding of how inventory levels rise and fall, we realize that the value of 80 is
extraordinarily unlikely. We conclude that the data point was likely a data entry error. Further investigation of sales and
purchasing records reveals that the actual inventory level on that day was 8, not 80. Having found a reliable value, we
correct the data point.
Excluding or changing data is not something we do often. We should never do it to help the data 'fit' a conclusion we want
to draw. Such changes to a data set should be made on a case-by-case basis only after careful investigation of the situation.
Summary
With any data set we encounter, we must find ways to allow the data to tell their story. Ordering and graphing data sets
often expose patterns and trends, thus helping us to learn more about the data and the underlying situation. If data can
provide insight into a situation, they can help us to make the right decisions.
Creating Histograms
Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to create histograms using the
Histogram tool. However, we suggest you read through the instructions to learn how Excel creates histograms so you can
construct them in the future when you do have access to the Data Analysis Toolpak.
To check if the Toolpak is installed on your computer, go to the Data tab in the Toolbar in Excel 2007. If "Data Analysis"
appears in the Ribbon, the Toolpak has already been installed. If not, click the Office Button in the top left and select "Excel
Options." Choose "Add-Ins" and highlight the "Analysis Toolpak" in the list and click "Go." Check the box next to Analysis
Toolpak and click "OK." Excel will then walk you through a setup process to install the toolpak.
Creating a histogram with Excel involves two steps: preparing our data, and processing them with the Data Analysis
Histogram tool.
To prepare the data, we enter or copy the values into a single column in an Excel worksheet.
Often, we have specific ranges in mind for classifying the data. We can enter these ranges, which Excel calls "bins," into a
second column of data.
In the Tool bar, select the Data tab, and then choose Data Analysis.
In the Data Analysis pop-up window, choose Histogram and click OK.
Click on the Input Range field and enter the range of data values by either typing the range or by dragging the cursor over
the range.
Next, to use the bins we specified, click on the Bin Range field and enter the appropriate range. Note: if we don't specify
our own bins, Excel will create its own bins, which are often quite peculiar.
Click the Chart Output checkbox to indicate that we want a histogram chart to be generated in addition to the summary
table, which is created by default.
Click New Worksheet Ply, and enter the name you would like to give the output sheet.
Finally, click OK, and the histogram with the summary table will be created in a new sheet.
Central Values for Data

Graphs are very useful for gaining insight into data. However, sometimes we would like to summarize the data in a concise
way with a single number.
The Mean
Often, we'd like to summarize a set of data with a single number. We'd like that summary value to describe the data as well
as possible. But how do we do this? Which single value best represents an entire set of data? That depends on the data
we're investigating and the type of questions we'd like the data to answer.
What number would best describe employee satisfaction data collected from annual review questionnaires? The numerical
average would probably work quite well as a single value representing employees' experiences.
6/123
To calculate average — or mean — employee satisfaction, we take all the scores, sum them up, and divide the result by 11,
the number of surveys. The Greek letter mu represents the mean of the data set.
The mean is by far the most common measure used to describe the "center" or "central tendency" of a data set. However,
it isn't always the best value to represent data. Outliers can exercise undue influence and pull the mean value towards one
extreme.
In addition, if the distribution has a tail that extends out to one side — a skewed distribution — the values on that side will
pull the mean towards them. Here, the distribution is strongly skewed to the right: the high value of US consumption pulls
the mean to a value higher than the consumption of most other countries. What other numbers can we use to find the
central tendency of the data?
The Median
Let's look at the revenues of the top 100 companies in the US. The mean revenue of these companies is about $42 billion.
How should we interpret this number? How well does this average represent the revenues of these companies?
When we examine the revenue distribution graphically, we see that most companies bring in less than $42 billion of
revenue a year. If this is true, why is the mean so high?
Source
As our intuition might tell us, the top companies have revenues that are much higher than $42 billion. These higher
revenues pull up the average considerably.
Source
In cases like income, where the data are typically very skewed, the mean often isn't the best value to represent the data. In
these cases, we can use another central value called the median.
Source
The median is the middle value of a data set whose values are arranged in numerical order. Half the values are higher than
the median, and half are lower.
Source
For income, the median revenues of the top 100 US companies is $30 billion, significantly less than $42 billion. Half of all
the companies earn less than $30 billion, and half earn more than $30 billion.
Source
Median revenue is a more informative revenue estimate because it is not pulled upwards by a small number of high-
revenue earners. How can we find the median?
Source
With an odd number of data points, listed in order, the median is simply the middle value. For example, consider this set
of 7 data points. The median is the 4th data point, $32.51.
In a data set with an even number of points, we average the two middle values — here, the fourth and fifth values — and
obtain a median of $41.92.
When deciding whether to use a mean or median to represent the central tendency of our data, we should weigh the pros
and cons of each. The mean weighs the value of every data point, but is sometimes biased by outliers or by a highly skewed
distribution.
By contrast, the median is not biased by outliers and is often a better value to represent skewed data.
The Mode
A third statistic to represent the "center" of a data set is its mode: the data set's most frequently occurring value. We might
use the mode to represent data when knowing the average value isn't as important as knowing the most common value.
In some cases, data may cluster around two or more points that occur especially frequently, giving the histogram more
than one peak. A distribution that has two peaks is called a bimodal distribution.
Summary
7/123
To summarize a data set using a single value, we can choose one of three values: the mean, the median, or the mode. They
are often called summary statistics or descriptive statistics. All three give a sense of the "center" or "central
tendency" of the data set, but we need to understand how they differ before using them:
Finding The Mean In Excel

To find the mean of a data set entered in Excel, we use the AVERAGE function.
We can find the mean of numerical values by entering the values in the AVERAGE function, separated by commas.
In most cases, it's easier to calculate a mean for a data set by indicating the range of cell references where the data are
located.
Excel ignores blank values in cells, but not zeros. Therefore, we must be careful not to put a zero in the data set if it does
not represent an actual data point.
Finding The Median In Excel

Excel can find the median, even if a data set is unordered, using the MEDIAN function.
The easiest way to calculate a data set's median is to select a range of cell references.
Finding The Mode In Excel

Excel can also find the most common value of a data set, the mode, using the MODE function.
If more than one mode exists in a data set, Excel will find the one that occurs first in the data.
Mean, median, and mode are fairly intuitive concepts. Already, Leo's mountain of data seems less intimidating.
Variability
The mean, median and mode give you a sense of the center of the data, but none of these indicate how far the data are spread
around the center. "Two sets of data could have the same mean and median, and yet be distributed completely differently
around the center value," Alice tells you. "We need a way to measure variation in the data."
The Standard Deviation

It's often critical to have a sense of how much data vary. Do the data cluster close to the center, or are the values widely
dispersed?
Let's look at an example. To identify good target markets, a car dealership might look at several communities and find the
average income of each. Two communities — Silverhaven and Brighton — have average household incomes of $95,500 and
$97,800. If the dealer wants to target households with incomes above $90,000, he should focus on Brighton, right?
We need to be more careful: the mean income doesn't tell the whole story. Are most of the incomes near the mean, or is there
a wide range around the average income? A market might be less attractive if fewer households have an income above the
dealer's target level. Based on average income alone, Brighton might look more attractive, but let's take a closer look at the
data.
Despite having a lower average income, incomes in Silverhaven have less variability, and more households are in the dealer's
target income range. Without understanding the variability in the data, the dealer might have chosen Brighton, which has
fewer targeted homes.
Clearly it would be helpful to have a simple way to communicate the level of variability in the household incomes in two
communities.
Just as we have summary statistics like the mean, median, and mode to give us a sense of the 'central tendency' of a data set,
we need a summary statistic that captures the level of dispersion in a set of data.
The standard deviation is a common measure for describing how much variability there is in a set of data. We represent the
standard deviation with the Greek letter sigma:
The standard deviation emerges from a formula that looks a bit complicated initially, so let's try to understand it at a
conceptual level first. Then we'll build up step by step to help understand where the formula comes from.
The standard deviation tells us how far the data are spread out. A large standard deviation indicates that the data are widely
dispersed. A smaller standard deviation tells us that the data points are more tightly clustered together.
8/123
Calculating
A hotel manager has to staff the front reception desk in her lobby. She initially focuses on a staffing plan for Saturdays,
typically a heavy traffic day. In the hospitality industry, like many service industries, proper staffing can make the
difference between unhappy guests and satisfied customers who want to return.
On the other hand, overstaffing is a costly mistake. Knowing the average number of customer requests for services during a
shift gives the manager an initial sense of her staffing needs; knowing the standard deviation gives her invaluable
additional information about how those requests might vary across different days.
The average number of customer requests is 172, but this doesn't tell us there are 172 requests every Saturday. To staff
properly, the hotel manager needs a sense of whether the number of requests will typically be between 150 and 195, for
example, or between 120 and 220.
To calculate the standard deviation for data — in this case the hotel traffic — we perform two steps. The first is to calculate
a summary statistic called the variance.
Each Saturday's number of requests lies a certain distance from 172, the mean number of requests. To find the variance, we
first sum the squares of these differences. Why square the differences?
A hotel manager would want information about the magnitude of each difference, which can be positive, negative, or zero.
If we simply summed the differences between each Saturday's requests and the mean, positive and negative differences
would cancel each other out.
But we are interested in the magnitude of the differences, regardless of their sign. By squaring the differences, we get only
positive numbers that do not cancel each other out in a sum.
The formula for variance adds up the squared differences and divides by n-1 to get a type of "average" squared difference as
a measure of variability. (The reason we divide by n-1 to get an average here is a technicality beyond the scope of this
course.) The variance in the hotel's front desk requests is 637.2. Can we use this number to express the variability of the
data?
Sure, but variances don't come out in the most convenient form. Because we square the differences, we end up with a value
in 'squared' requests. What is a request-squared? Or a dollar-squared, if we were solving a problem involving money?
We would like a way to express variability that is in the same units as the original data — front-desk requests, for example.
The standard deviation — the first formula we saw — accomplishes this.
The standard deviation is simply the square root of the variance. It returns our measure to our original units. The standard
deviation for the hotel's Saturday desk traffic is 25.2 requests.
Interpreting
What does a standard deviation of 25.2 requests tell us? Suppose the standard deviation had been 50 requests.
With a larger standard deviation, the data would be spread farther from the mean. A higher standard deviation would
translate into more difficult staffing: when request traffic is unusually high, disgruntled customers wait in long lines; when
traffic is very low, desk staff are idle.
For a data set, a smaller standard deviation indicates that more data points are near the mean, and that the mean is more
representative of the data. The lower the standard deviation, the more stable the traffic, thereby reducing both customer
dissatisfaction and staff idle time.
Fortunately, we almost never have to calculate a standard deviation by hand. Spreadsheet tools like Excel make it easy for
us to calculate variance and standard deviation.
Summary
The standard deviation measures how much data vary about their mean value.
Finding in Excel
Excel's STDEV function calculates the standard deviation.
To find the standard deviation, we can enter data values into the STDEV formula, one by one, separated by commas.
In most cases, however, it's much easier to select a range of cell references to calculate a standard deviation.
To calculate variance, we can use Excel's VAR function in the same way.
9/123
The Coefficient of Variation

The standard deviation measures how much a data set varies from its mean. But the standard deviation only tells you so
much. How can you compare the variability in different data sets?
A standard deviation describes how much the data in a single data set vary. How can we compare the variability of two data
sets? Do we just compare their standard deviations? If one standard deviation is larger, can we say that data set is "more
variable"?
Standard deviations must be considered within the data's context. The standard deviations for two stock indices below —
The Street.Com (TSC) Internet Index and the Pacific Exchange Technology (PET) Index — were roughly equivalent over a
period. But were the two indices equally variable?
Source
If the average price of an index is $200, a $20 standard deviation is relatively high (10% of the average); if the average is
$700, $20 is relatively low (not quite 3% of the average). To gauge volatility, we'd certainly want to know that PET's average
index price was over three and half times higher than TSC's average index price.
Source
To get a sense of the relative magnitude of the variation in a data set, we want to compare the standard deviation of the data
to the data's mean.
Source
We can translate this concept of relative volatility into a standardized measure called the coefficient of variation, which is
simply the ratio of the standard deviation to the mean. It can be interpreted as the standard deviation expressed as a percent
of the mean.
To get a feeling for the coefficient of variation, let's compare a few data sets. Which set has the highest relative variation?
Click the answer you select.
Because the coefficient of variation has no units, we can use it to compare different kinds of data sets and find out which data
set is most variable in this relative sense.
The coefficient of variation describes the standard deviation as a fraction of the mean, giving you a standard measure of
variability.
Summary
The coefficient of variation expresses the standard deviation as a fraction of the mean. We can use it to compare variation
in different data sets of different scales or units.
Applying Data Analysis

After a good night's sleep, you meet Alice for breakfast.
"It's time to get started on Leo's assignments. Could you get those price quotes from diving schools and prepare a presentation
for Leo? We'll want to present our findings as neatly and concisely as possible. Use graphs and summary statistics wherever
appropriate. Meanwhile, I'll start working on Leo's hotel occupancy problem."
Pricing the Scuba Schools

In addition to the school Leo is currently using, you find 20 other scuba services in the phone book. You call those 20 and get
price quotes on how much they would charge the Kahana per guest for a Scuba Certification Course.
Prices
You create a histogram of the prices. Use the bin ranges provided in the data spreadsheet, or experiment with your own bins.
If you do not have the Excel Analysis Toolpak installed, click on the Briefcase link labeled "Histogram" to see the finished
histogram.
Prices
Histogram
This distribution is skewed to the right, since a tail of higher prices extends to the right side of the histogram. The shape of
the distribution suggests that:
10/123
Prices
Histogram
You calculate the key summary statistics. The correct values are (Mean, Median, Standard Deviation):
Prices
Histogram
Your report looks good. This graphic is very helpful. At the moment, I'm paying $330 per guest, which is about average for
the island. Clearly, I could get a cheaper deal — only 6 schools would charge a higher rate. On the other hand, maybe these
more expensive schools offer a better diving experience? I wonder how satisfied my guests have been with the course offered
by my current contractor...
Exercise 1: VA Linux Stock Bonanza

After a company completes its initial public offering, how is the ownership of common stock distributed between
individuals in the firm, often termed "named insiders"?
Let's examine a company, VA Linux, that chose to sell its stock in an Initial Public Offering (IPO) during the IPO craze in
the late 1990s.
According to its prospectus, after the IPO, VA Linux would have the following distribution of outstanding shares of
common stock owned by insiders:
Source
From the VA Linux common stock data, what could we learn by creating a histogram? (Choose the best answer)
Exercise 2: Employee Turnover

Here is a histogram graphing annual turnover rates at a consulting firm.
Which summary statistic better describes these data?
Exercise 3: Honidew Internship

The J. B. Honidew Corporation offers a prestigious summer internship to first-year students at a local business school. The
human resources department of Honidew wants to publish a brochure to advertise the position.
To attract a suitable pool of applicants, the brochure should give an indication of Honidew's high academic expectations.
The human resources manager calculates the mean GPA of the previous 8 interns, to include in the brochure.
What is the mean GPA of the former interns?
Interns' GPA's
In 1997, J. B. Honidew's grandson's girlfriend was awarded the internship, even though her GPA was only 3.35. In the
presence of outliers or a strongly skewed data set, the median is often a better measure of the 'center'. What's the median
GPA in this data set?
Interns' GPA's
Exercise 4: Scuba Regulations

Safety equipment typically needs to fall within very precise specifications. Such specifications apply, for example, to scuba
equipment using a device called a "rebreather" to recycle oxygen from exhaled air.
Recycled air must be enriched with the right amount of oxygen from the tank before delivery to the diver. With too little
oxygen, the diver can become disoriented; too much, and the diver can experience oxygen poisoning. Minimizing the
deviation of oxygen concentration levels from the specified level is clearly a matter of life and death!
A scuba equipment-testing lab compared the oxygen concentrations of two different brands of rebreathers, A and B.
Examine the data. Without doing any calculations, for which of the two rebreathers does the oxygen concentration appear
to have a lower standard deviation?
Notice that data set A's extreme values are closer to the center, with more data points closer to the center of the set. Even
without calculations, we have a good knack for seeing which set is more variable.
11/123
We can back up our observations; by using the standard deviation formula or the STDEV function in Excel, we can
calculate that the standard deviation of A is 0.58%, whereas that of B is 1.05%.
Exercise 5: Fluctuations in Energy Prices

After decades of government control, states across the US are deregulating energy markets. In a deregulated market,
electricity prices tend to spike in times of high demand.
This volatility is a concern. A primary benefit to consumers in a regulated market is that prices are fairly stable. To provide
a baseline measure for the volatility of prices prior to deregulation, we want to compute the standard deviation of prices
during the 1990s, when electricity prices were largely regulated.
From 1990 to 2000, the average national price in July of 500kW of electricity ranged between $45.02 and $50.55. What is
the standard deviation of these eleven prices?
Electricity Prices
Source
Excel makes the job much easier, because all that's required is entering the data into cells and inputting the range of cells
into the =STDEV() function. The result is $2.02.
On the other hand, to calculate the standard deviation by hand, use the formula:
First, calculate the mean, $48.40. Then, find the difference between each data point and the mean. Calculate the sum of
these squared differences, 40.79. Divide by the number of points minus one (11 - 1 =10 in this case) to obtain 4.08. Taking
the square root of 4.08 gives us the standard deviation, $2.02.
Exercise 6: Big Mart Personal Care Products

Suppose you are a purchasing agent for a wholesale retailer, Big-Mart. Big-Mart offers several generic versions of
household items, like deodorant, to consumers at a considerable discount.
Every 18 months, Big-Mart requests bids from personal care companies to produce these generic products.
After simply choosing the lowest individual bidder for years, Big-Mart has decided to introduce a vendor "score card" that
measures multiple aspects of each vendor's performance. One of the criteria on the score card is the level of year-to-year
fluctuation in the vendor's pricing.
Compare the variability of prices from each supplier. Which company's prices vary the least from year to year in relation to
their average price, as measured by the coefficient of variation?
Summary
Pleased with your work, Alice decides to teach you more data description techniques, so you can take over a greater share of
the project.
Relationships Between Variables

So far, you learned how to work with a single variable, but many managerial problems involve several factors that need to be
considered simultaneously.
Two Variables
We use histograms to help us answer questions about one variable. How do we start to investigate patterns and trends with
two variables?
Let's look at two data sets: heights and weights of athletes. What can we say about the two data sets? Is there a relationship
between the two?
Our intuition tells us that height and weight should be related. How can we use the data to inform that intuition? How can we
let the data tell their story about the strength and nature of that relationship?
As always, one of our first steps is to try to visualize the data.
Because we know that each height and weight belong to a specific athlete, we first pair the two variables, with one height-
weight pair for each athlete.
12/123
Plotting these data pairs on axes of height and weight — one data point for each athlete in our data set — we can see a
relationship between height and weight. This type of graph is called a "scatter diagram."
Scatter diagrams provide a visual summary of the relationship between two variables. They are extremely helpful in
recognizing patterns in a relationship. The more data points we have, the more apparent the relationship becomes.
In our scatter diagram, there's a clear general trend: taller athletes tend to be heavier.
We need to be careful not to draw conclusions about causality when we see these types of relationships.
Growing taller might make us a bit heavier, but height certainly doesn't tell the whole story about our weights.
Assuming causality in the other direction would be just plain wrong. Although we may wish otherwise, growing heavier
certainly doesn't make us taller!
The direction and extent of causality might be easy to understand with the height and weight example, but in business
situations, these issues can be quite subtle.
Managers who use data to make decisions without firm understanding of the underlying situation often make blunders that
in hindsight can appear as ludicrous as assuming that gaining weight can make us taller.
Why don't we try graphing another pair of data sets to see if we can identify a relationship? On a scatter diagram, we plot for
each day the number of massages purchased at a spa resort versus the total number of guests visiting the resort.
We can see a relationship between the number of guests and the number of massages. The more guests that stay at the resort,
the more massages purchased — to a point, where massages level off.
Why does the number of massages reach a plateau? We should investigate further. Perhaps there are limited numbers of
massage rooms at the spa. Scatter plots can give us insights that prompt us to ask good questions, those that deepen our
understanding of the underlying context from which the data are drawn.
Variable and Time

Sometimes, we are not as interested in the relationship between two variables as we are in the behavior of a single variable
over time. In such cases, we can consider time as our second variable.
Suppose we are planning the purchase of a large amount of high-speed computer memory from an electronics distributor.
Experience tells us these components have high price volatility. Should we make the purchase now? Or wait?
Assuming we have price data collected over time, we can plot a scatter diagram for memory price, in the same way we
plotted height and weight. Because time is one of the variables, we call this graph a time series.
Time series are extremely useful because they put data points in temporal order and show how data change over time.
Have prices been steadily declining or rising? Or have prices been erratic over time? Are there seasonal patterns, with
prices in some months consistently higher than in others?
Time series will help us recognize seasonal patterns and yearly trends. But we must be careful: we shouldn't rely only on
visual analysis when looking for relationships and patterns.
False Relationships
Our intuition tells us that pairs of variables with a strong relationship on a scatter plot must be related to each other. But
we must be careful: human intuition isn't foolproof and often we infer relationships where there are none. We must be
careful to avoid some of these common pitfalls.
Let's look at an example. For US presidents of the last 150 years, there seems to be a connection between being elected in a
year that is a multiple of 20 (1900, 1920, 1940, etc.) and dying in office. Abraham Lincoln (elected in 1860) was the first
victim of this unfortunate relationship.
Source
James Garfield (elected 1880) survived his presidency (but was assasinated the year after he left office), and William
McKinley (1900), Warren Harding (1920), Franklin Roosevelt (1940), and John F. Kennedy (1960) all died in office.
Source
Ronald Reagan (elected 1980) only narrowly survived an assassination attempt. What do the data suggest about the
president elected in 2020?
Probably nothing. Unless we have a reasonable theory about the connection between the two variables, the relationship is
no more than an interesting coincidence.
13/123
Hidden Variables
Even when two data sets seem to be directly related, we may need to investigate further to understand the reason for the
relationship.
We may find that the reason is not due to any fundamental connection between the two variables themselves, but that they
are instead mutually related to another underlying factor.
Suppose we're examining sales of ice-hockey pucks and baseballs at a sporting goods store.
The sales of the two products form a relationship on a scatter plot: when puck sales slump, baseball sales jump. But are the
two data sets actually related? If so, why?
A third, hidden factor probably drives both data sets: the season. In winter, people play ice hockey. In spring and summer,
people play baseball.
If we had simply plotted puck and baseball sales without thinking further, we might not have considered the time of year at
all. We could have neglected a critical variable driving the sales of both products.
In many business contexts, hidden variables can complicate the investigation of a relationship between almost any two
variables.
A final point: Keep in mind that scatter plots don't prove anything about causality. They never prove that one variable
causes the other, but simply illustrate how the data behave.
Summary
Plotting two variables helps us see relationships between two data sets. But even when relationships exist, we still need to
be skeptical: is the relationship plausible? An apparent relationship between two variables may simply be coincidental, or
may stem from a relationship each variable has with a third, often hidden variable.
Creating Scatter Diagrams

To create a scatter diagram in Excel with two data sets, we need to first prepare the data, and then use Excel's built in chart
tools to plot the data.
To prepare our data, we need to be sure that each data point in the first set is aligned with its corresponding value in the
other set. The sets don't need to be contiguous, but it's easier if the data are aligned side by side in two columns.
If the data sets are next to each other, simply select both sets.
Next, from the Insert tab in the toolbar, select Scatter in the Charts bin from the Ribbon, and choose the first type:
Scatter with Only Markers.
Excel will insert a nonspecific scatter plot into the worksheet, with the first column of data represented on the X-axis and
the second column of data on the Y-axis.
We can include a chart title and label the axes by selecting Quick Layout from the Ribbon and choosing Layout 1.
Then we can add the chart title and label the axes by selecting and editing the text.
Finally, our scatter diagram is complete. You can explore more of Excel's new Chart Tools to edit and design elements of
your chart.
Correlation
By plotting two variables on a scatter plot, we can examine their relationship. But can we measure the strength of that
relationship? Can we describe the relationship in a standardized way?
Humans have an uncanny ability to discern patterns in visual displays of data. We "know" when the relationship between two
variables looks strong ...
... or weak ...
... linear ...
... or nonlinear ...
... positive (when one variable increases, the other tends to increase) ...
14/123
... or negative (when one variable increases, the other tends to decrease).
Suppose we are trying to discern if there is a linear relationship between two variables. Intuitively, we notice when data
points are close to an imaginary line running through a scatter plot.
Logically, the closer the data points are to that line, the more confidently we can say there is a linear relationship between the
two variables.
However, it is useful to have a simple measure to quantify and communicate to others what we so readily perceive visually.
The correlation coefficient is such a measure: it quantifies the extent to which there is a linear relationship between two
variables.
To describe the strength of a linear relationship, the correlation coefficient takes on values between -1 and +1. Here's a strong
positive correlation (about 0.85) ...
... and here's a strong negative correlation (about -0.90).
If every point falls exactly on a line with a negative slope, the correlation coefficient is exactly -1.
At the extremes of the correlation coefficient, we see relationships that are perfectly linear, but what happens in the middle?
Even when the correlation coefficient is 0, a relationship might exist — just not a linear relationship. As we've seen, scatter
plots can reveal patterns and help us better understand the business context the data describe.
To reinforce our understanding of how our intuition about the strength of a linear relationship between variables translates
into a correlation coefficient, let's revisit the examples we analyzed visually earlier.
Influence of Outliers
In some cases, the correlation coefficient may not tell the whole story. Managers want to understand the attendance
patterns of their employees. For example, do workers' absence rates vary by time of year?
Suppose a manager suspects that his employees skip work to enjoy the good life more often as the temperature rises. After
pairing absences with daily temperature data, he finds the correlation coefficient to be 0.466.
While not a strong linear relationship, a coefficient of 0.466 does indicate a positive relationship — suggesting that the
weather might indeed be the culprit.
But look at the data — besides a few outliers, there isn't a clear relationship. Seeing the scatter plot, the manager might
realize that the three outliers correspond to a late-summer, three-day transportation strike that kept some workers
homebound the previous year.
Without looking at the data, the correlation coefficient can lead us down false paths. If we exclude the outliers, the
relationship disappears, and the correlation essentially drops to zero, quieting any suspicion of weather. Why do the
outliers influence our measure of linearity so much?
As a summary statistic for the data, the correlation coefficient is calculated numerically, incorporating the value of every
data point. Just as it does with the mean, this inclusiveness can get us into trouble...
Because measures like correlation give more weight to points distant from the center of the data, outliers can strongly
influence the correlation coefficient of the entire set. In these situations, our intuition and the measure we use to quantify
our intuition can be quite different. We should always attempt to reconcile those differences by returning to the data.
Summary
The correlation coefficient characterizes the strength and direction of a linear relationship between two data sets. The value
of the correlation coefficient ranges between -1 and +1.
Finding in Excel
Excel's CORREL function calculates the correlation coefficient for two variables. Let's return to our data on athletes' height
and weight.
Enter the data set into the spreadsheet as two paired columns. We must make sure that each data point in the first set is
aligned with its corresponding value in the other set.
To compute the correlation, simply enter the two variables' ranges, separated by a comma, into the CORREL function as
shown below.
15/123
The order in which the two data sets are selected does not matter, as long as the data "pairs" are maintained. With height
and weight, both values certainly need to refer to the same person!
Occupancy and Arrivals

Alice is eager to move forward: "With your new understanding of scatter diagrams and correlation, you'll be able to help me
with Leo's hotel occupancy problem."
In the hotel industry, one of the most important management performance measures is room occupancy rate, the percentage
of available rooms occupied by guests.
Alice suggests that the monthly occupancy rate might be related to the number of visitors arriving on the island each month.
On a geographically isolated location like Hawaii, visitors almost all arrive by airplane or cruise ship, so state agencies can
gather very precise data on arrivals.
Alice asks you to investigate the relationship between room occupancy rates and the influx of visitors, as measured by the
average number of visitors arriving to Kauai per day in a given month. She wants a graphical overview of this relationship,
and a measure of its strength.
Leo's folders include data on the number of arrivals on Kauai, and on average hotel occupancy rates in Kauai, as tracked by
the Hawaii Department of Business, Economic Development, and Tourism. Click on the link to access these data.
Kauai Data
Source
The best way to graphically represent the relationship between arrivals and occupancy is a _______.
Kauai Data
Source
You generate the scatter diagram using the data file and Excel's Chart Wizard. The relationship can be characterized as
Kauai Data
Source
You calculate the correlation coefficient. Enter the correlation coefficient in decimal notation with 2 digits to the right of the
decimal (e.g., enter "5" as "5.00"). Round if necessary.
Kauai Data
Source
To find the correlation coefficient, open the Kahana Data file. In any empty cell, type =CORREL(B2:B37,C2:C37). When you
hit enter, the correct answer, 0.71, will appear.
Kauai Data
Together with Alice, you compile your findings and present them to Leo.
Source
I see. The relationship between the number of people arriving on Kauai and the island's hotel occupancy rate follows a
general trend, but not a precise pattern. Look at this: in two months with nearly the same average number of daily arrivals,
the occupancy rates were very different — 68% in one month and 82% in the other.
But why should they be so different? When people arrive on the island, they have to sleep somewhere. Do more campers
come to Kauai in one month, and more hotel patrons in the other?
Well, that might be one explanation. There could be differences in the type of tourists arriving. The vacation preferences of
the arrivals would be what we call a hidden variable.
Another hidden variable might be the average length of stay. If the length of stay varies month to month, then so will hotel
occupancy. When 50 arrivals check into a hotel, the occupancy rate will be higher if they spend 10 days each at the hotel than
if they spend only 3 days.
I'm following you, but I'm beginning to see that the occupancy issue is more complex than I expected. Let's get back to it at a
later time. The scuba school contract is more pressing at the moment.
Exercise 1: The Effectiveness of Search Engines

As online retailing expands, many companies are interested in knowing how effective search engines are in helping
consumers find goods online.
16/123
Computer scientists study the effectiveness of such search engines and compare how many results search engines recall
and the precision with which they recall them. "Precision" is another way of saying that the search found its target, for
example a page containing both the phrases "winter parka" and "Eddie Bauer."
What could you say about the relationship between the precision and the number of results recalled?
Source
Exercise 2: Education and Income

Is an education a good investment in your future? Some very successful business executives are college dropouts, but is
there a relationship in the general population between income and education level?
Consider the following scatter plot, which lists the income and years of formal education for 18 people. What is the
correlation between the two variables?
Source
Though we should always calculate the correlation coefficient if we want to have a precise measure, it's good to have a
rough feel for the correlation between two variables we see plotted on a scatter diagram. For the income-education data,
the coefficient is nearest to _______.
Sampling & Estimation
Introduction: The Scuba Problem

Leo asks you to help him evaluate the Kahana's contract with the scuba school.
Scuba diving lessons are an ideal way for our guests to enjoy their vacation or take a break from their business activities. We
have an excellent coral reef, and scuba diving is becoming very popular among vacationers and business travelers.
We started our year-round diving program last year, contracting a local diving school to do a scuba certification course. The
one-year trial contract is now up for renewal.
Maintaining the scuba offerings on-site isn't cheap. We have to staff the scuba desk seven days a week, and we subsidize the
costs associated with each course. So I want to get a good handle on how satisfied the guests are with the lessons before I decide
whether or not to renew the contract.
The hotel has a database with information about which guests took scuba lessons and when. Feel free to take a look at it, but I
can't spend a fortune figuring this out. And I need to know as soon as possible, since our contract expires at the end of the
month.
Alice convinces you to do some field research and join her for a scuba diving lesson. You return late that afternoon exhausted
but exhilarated. Alice is especially enthusiastic.
"Well, I certainly give the lessons two thumbs up. And we haven't even been out to sea yet!
"But our opinions alone can't decide the matter. We shouldn't infer from our experience that Leo's clientele as a whole enjoyed
the scuba certification course. After all, we may have caught the instructor on his best day this year."
Alice suggests creating a survey to find out how satisfied guests are with the scuba diving school.
Generating Random Samples

Naturally, you can't ask the opinion of every guest who took scuba lessons over the past year. You have to survey a few guests,
and from their opinions draw conclusions about hotel guests in general. The guests you choose to survey must be representative
of all of the guests who have taken the scuba course at the resort. But how can you be sure you get a good sample?
How to Create a Representative and Unbiased Sample

As managers, we often need to know something about a large group of people or products. For example, how many defective
parts does a large plant produce each year? What are the average annual earnings of a Wall Street investment banker? How
many people in our industry plan to attend the annual conference?
When it is too costly to gather the information we want to know about every person or every thing in an entire group, we
often ask the question of a subset, or sample of the group. We then try to use that information to draw conclusions about the
whole group.
17/123
To take a sample, we first select elements from the entire group, or "population," at random. We then analyze that sample
and try to infer something about the total population we're interested in. For example, we could select a sample of people in
our industry, ask them if they plan to attend the annual conference, and then infer from their answers how many people in
the entire industry plan to attend.
For example, if 10% of the people in our sample say they will attend, we might feel quite confident saying that between 7%
and 13% of our entire population will attend.
This is the general structure of all the problems we'll address in this unit — we'll work out the details as we go forward. We
want to know something about a population large enough to make examining every population member impractical.
We first select elements from the population at random...
...then analyze that sample...
...and then draw an inference about the total population we're interested in.
Taking a Random Sample

The first trick to sampling is to make sure we select a sample that broadly represents the entire group we're interested in.
For example, we couldn't just ask the conference organizers if they wanted to attend. They would not be representative of
the whole group — they would be biased in favor of attending the conference!
To get a good sample, we must make sure we select the sample "at random" from the full population. This means that every
person or thing in the population is equally likely to be selected. If there are 15,000 people in the industry, and we are
choosing a sample of 1,000, then every person needs to have the same chance — 1 out of 15 — of being selected.
Selecting a random sample sounds easy, but actually doing it can be quite challenging. In this section, we'll see examples
of some major mistakes people have made while trying to select a random sample, and provide some advice about how to
avoid the most common types of sampling errors.
In some cases, selecting a random sample can be fairly easy. If we have a complete list of each member of the group in a
database, we can just assign a unique number to each member of the group. We then let a computer draw random numbers
from the list. This would ensure that each element of the population has an equal likelihood of being selected.
If the population about which we need to obtain information is not listed in an easy-to-access database, the task of
selecting a sample at random becomes more difficult. In these cases, we have to be extremely careful not to introduce a bias
in the way we select the sample.
For example, if we want to know something about the opinions of an entire company, we cannot just pick employees from
one department. We have to make sure that each employee has an equal chance of being included in the sample. A
department as a whole might be biased in favor of one opinion.
Sample Size
Once we have decided how to select a sample, we have to ask how large our sample needs to be. How many members of
the group do we need to study to get a good estimate about what we want to know about the entire population?
The answer is: It depends on how "accurate" we want our estimate to be. We might expect that the larger the population,
the larger the sample size needed to achieve a given level of accuracy, but this is not true.
A sample size of 1,000 randomly-selected individuals can often give a satisfactory estimation about the underlying
population, as long as the sample is representative of the whole population. This is true regardless of whether the
population consists of thousands of employees or millions of factory parts.
Sometimes, a sample size of 100 or even 50 might be enough when we are not that concerned about the accuracy of our
estimate. Other times, we might need to sample thousands to obtain the accuracy we require.
Later in this unit, we will find out how to calculate a good sample size. For now, it's important to understand that the
sample size depends on the level of accuracy we require, not on the size of the population.
Learning about a Sample

Once we select our sample, we need to make sure we obtain accurate information about each member of the sample. For
example, if we want to learn about the number of defects a plant produces, we must carefully measure each item in the
sample.
When we want to learn something about a group of people and don't have any existing data, we often use a survey to learn
about an issue of interest. Conducting a survey raises problems that can be surprisingly tricky to resolve.
18/123
First, how do we phrase our questions? Is there a bias in any questions that might lead participants to answer them in a
certain way? Are any questions worded ambiguously? If some of the people in the sample interpret a question one way, and
others interpret it differently, our results will be meaningless!
Second, how do we best conduct the survey? Should we send the survey in the mail, or conduct it over the phone? Should
we interview survey participants in person, or distribute handouts at a meeting?
There are advantages and disadvantages to all methods. A survey sent through the mail may be relatively inexpensive, but
might have a very low response rate. This is a major problem if those who respond have a different opinion than those who
don't respond. After all, the sample is meant to learn about the entire population, not just those with strong opinions!
Creating a telephone survey creates other issues: When do we call people? Who is home during regular business hours?
Most likely not working professionals. On the other hand, if we call household numbers in the evening, the "happy hour
crowd" might not be available.
When we decide to conduct a survey in person, we have to consider whether the presence of the person asking the
questions might influence the survey results. Are the survey participants likely to conceal certain information out of
embarrassment? Are they likely to exaggerate?
Clearly, every survey will have different issues that we need to confront before going into the field to collect the data.
Response Rates
With any type of survey, we must pay close attention to the response rate. We have to be sure that those who respond to the
survey answer questions in much the same way as those who don't respond would answer them. Otherwise, we will have a
biased view of what the whole population thinks.
Surveys with low response rates are particularly susceptible to bias. If we get a low response rate, we must try to follow up
with the people who did not respond the first time. We either need to increase the response rate by getting answers from
those who originally did not respond, or we must demonstrate that the non-respondents' opinions do not differ from those
of the respondents on the issue of interest.
Tracking down everyone in a sample and getting their response can be costly and time consuming. When our resources are
limited, it is often better to take a small sample and relentlessly pursue a high response rate than to take a larger sample
and settle for a low response rate.
Summary
Often it makes sense to infer facts about a large population from a smaller sample. To make sound inferences:
Classic Sampling Mistakes

To understand the importance of representative samples, let's go back in history and look at some mistakes made in the
Literary Digest poll of 1936.
The Literary Digest, a popular magazine in the 1930s, had correctly predicted the outcome of U.S, presidential elections
from 1916 to 1932. When the results of the 1936 poll were announced, the public paid attention. Who would become the
next president?
Newscaster: "Once again, the Literary Digest sent out a survey to the American public, asking, "Whom will you vote for in
this year's presidential election?" This may well be the largest poll in American history."
Newscaster: "The Digest sent the survey to over 10 million Americans and over two million responded!"
Newscaster: "And the survey results predict: Alf Landon will beat Franklin D. Roosevelt by a large margin and become
President of the United States."
As it turned out, Alf Landon did not become President of the United States. Instead, Franklin D. Roosevelt was re-elected
to a third term in office in the largest landslide victory recorded to that date. This was a devastating blow to the Digest's
reputation. What went wrong? How could such a large survey be so far off the mark?
The Literary Digest made two mistakes that led it to predict the wrong election outcome. First, it mailed the survey to
people on three different lists: the magazine's subscribers, car owners, and people listed in telephone directories. What was
wrong with choosing a sample from these lists?
The sample was not representative of the American public. Most lower-income people did not subscribe to the Digest and
did not own phones or cars back in 1936. This led the poll to be biased towards higher-income households and greatly
distorted the poll's results. Lower-income households were more likely to vote for the Democrat, Roosevelt, but they were
not included in the poll.
19/123
Second, the magazine relied on people to voluntarily send their responses back to the magazine. Out of the ten million
voters who were sent a poll, over two million responded. Two million is a huge number of people. What was wrong with
this survey?
The mistake was simple: Republicans, who wanted political change, felt more strongly about the election than Democrats.
Democrats, who were generally happy with Roosevelt's policies, were less interested in returning the survey. Among those
who received the survey, a disproportionate number of Republicans responded, and the results became even more biased.
The Digest had put an unprecedented effort into the poll and had staked its reputation on predicting the outcome of the
election. Its reputation wounded, the Digest went out of business soon thereafter.
During the same election year, a little known psychologist named George Gallup correctly predicted what the Digest
missed: Roosevelt's victory. What did Gallup do that the Literary Digest did not? Did he create an even bigger sample?
Surprisingly, George Gallup used a much smaller sample. He knew that large samples were no guarantee of accurate results
if they weren't randomly selected from the population.
Gallup's team interviewed only 3,000 people, but made sure that the people they selected were truly representative of the
US population. He also instructed his team to be persistent in asking the opinion of each person in the sample, which
generated a high response rate.
Gallup's correct prediction of the 1936 election winner boosted his reputation and Gallup's method of polling soon became
a standard for public opinion polls.
Today's polls usually consist of a sample of around a thousand randomly selected people who are truly representative of the
underlying populations. For example, look at poll reported in a leading newspaper: the sample size will likely be around a
thousand.
Another common survey mistake is phrasing the questions in a way that leads to a biased response. Let's take a look at a
recent example of a biased question.
In 1992, Ross Perot, an independent contender for the US Presidential election, conducted a mail-in survey to show that
the public supported his desire to abolish special interest groups. This is the question he asked:
Source
In Perot's mail-in survey, 99 percent of respondents said "yes" to that question. It seemed as if everyone in America agreed
with Perot's stance.
Source
Soon after Perot's survey, Yankelovich Partners, an independent market research firm, conducted two interesting follow-up
surveys. In the first survey, it used the same question that Perot asked and found that 80 percent of the population favored
passing the law. YP attributed the difference to the fact that it was able to create a more representative sample than Perot.
Source
Interestingly, Yankelovich then conducted a similar survey, but rephrased the question in the following way:
Source
The response to this question was strikingly different. Only 40 percent of the sampled population agreed to prohibit
contributions. As it turned out, the results of the survey all came down to the way the question was phrased.
Source
For any survey we conduct, it's critical to phrase the question in the most neutral way possible to avoid bias in the sample
results.
Source
The real lesson of these two examples is this: How data are collected is at least as important as how data are analyzed. A
sample that is unrepresentative, biased, or not drawn at random can give highly misleading results.
How sample data are collected is at least as important as how they are analyzed. Knowing that sample data need to be
representative and unbiased, you conduct a survey of the hotel guests.
Solving the Scuba Problem (Part I)

How can you best determine if hotel guests are enjoying the scuba course? By searching the hotel database, you determine
that 2,804 hotel guests took scuba trips in the past year. The scuba certification course was offered year-round. The database
includes each guest's name, address, phone number, age, date of arrival, length of stay, and room number.
20/123
Your first step is deciding what type of survey to conduct that will be inexpensive, quick, and will provide a good sample of all
the guests who took scuba lessons.
Should you mail a survey to the whole list of guests who took scuba lessons, expecting that a small percentage will respond, or
conduct a telephone survey, which would likely provide a higher response rate, but cost more per guest contacted?
To ensure a good response rate — and because Leo wants an answer quickly — you choose to contact customers by phone.
Alice warns that to keep costs low, you can only contact 50 hotel guests, and reminds you to create a random, representative
sample.
You open up the list of names in the hotel database. The names were entered as guests arrived. To make things simple, you
randomly select a date and then record the first 50 guests arriving after that date who took the course. You ask the hotel
operator to call them for you, and tell him to be persistent. Eventually he is able to contact 45 of the guests on the list. He asks
the guests to rate their scuba experience on a 1 to 6 scale and reports the results back to you. Click the link below to view your
sample.
Enter the average satisfaction level as a decimal number with one digit to the right of the decimal point (e.g., enter "5" as
"5.0"). Round if necessary.
Hotel Database
You compute the average satisfaction level and find that it is 2.5. You give Leo the news. He explodes.
Two point five! That's impossible! I know for sure that it must be higher than that! You'd better go over your data again.
Back in your room, you look over your list of data. What should you tell Leo?
What factor is biasing your results?
When you report this news to Leo, he begins to laugh.
We were hit with a hurricane at the beginning of April. Half the scuba classes were cancelled, and the ones that did meet had
to deal with choppy water and bad visibility. Even the weeks following the hurricane were bad. Usually guests see a manta ray
every week, and the guests in April could barely see the underwater coral. No wonder they weren't happy.
You assure Leo you will conduct the survey again with a more representative sample. This time, you make sure that the guests
are truly randomly selected. Later, you have new data in your hands from 45 randomly chosen guests that show the average
satisfaction rate to be 4.4 on a 1 to 6 scale. The standard deviation of the sample is 1.54.
Exercise 1: The Bell Computer Problem

Mr. Gavin Collins is the Chief Operating Officer of Bell Computers, a market leader in personal computers. This morning,
he opened the latest issue of Business 4.0, a business journal, and noticed an article on Bell Computers.
The article praised the high quality and low cost of the PCs made by Bell. However, it also included some negative
comments about Bell's customer service.
Currently, customer service is only available to customers of Bell Computers over the phone.
Collins wants to understand more fully what customers think of Bell's customer service. His marketing department designs
a survey that asks customers to rate Bell's customer service from 1 to 10.
How should he conduct the survey?
Exercise 2: The Wave Problem

"Wave" is a company that manufactures laundry detergent in several countries around the world. In India, the competition
among laundry detergents is fierce.
The sales per month of Wave have been constant for the past five years. Wave CEO Mr. Sharma instructed his marketing
team to come up with a strong advertising campaign stressing Wave's superiority over other competitors. Wave conducted
a survey in the month of June.
They asked the following questions: "Have you heard of Wave?" "Do you think Wave is a good product?" "Do you notice a
difference in the color of your clothes after using Wave?" Then, citing the results of their survey, Wave aired a major
television campaign claiming that 75% of the population thought that Wave was a good product.
You are a new associate at Madison Consulting. With your partner, Ms. Mehta, you have been asked to conduct a study for
Wave's main competitor, the Coral Reef Detergent Company, about whether Wave's claims hold water. Coral Reef wonders
how the Wave results are possible, considering that Coral Reef holds over 45% of the current market share.
21/123
Ms. Mehta has been going through the survey methodology, and she tells you, "This sample is obviously not representative
and unbiased. Coral Reef can dispute Wave's claim!" What has Ms. Mehta noticed?
Challenge: The Airport

You have been asked to conduct a survey to determine the percentage of flights arriving at a small airport that were filled to
capacity that morning. You decide to stand outside the airport's single exit door and ask a sample of 60 passengers leaving
the airport how full their flight was.
Your first thought is to just ask the first 60 passengers departing the airport how full their flight was, but you quickly
realize that that could be a highly biased sample. Any 60 people leaving at the same time would likely have come from only
a couple of flights, and you want to get a good sense of what percent of all flights arriving that morning were filled to
capacity. Thus, you decide to randomly select 60 people from all the passengers departing the building that morning.
After conducting your survey, you tally the results: 10 people decline to answer, 30 people tell you that their flight was filled
to capacity, and 20 people tell you that their flight was not filled to capacity. What can you conclude from your survey
results so far?
What is the problem with your survey?
To see this, imagine that 10 planes have arrived that morning — five of which were full (having 100 passengers each) and
five of which had only a single passenger on the plane. In this case, half of the planes were full. However, almost all of the
passengers (500 of the total 505) departing from the airport would report (correctly!) that they had been on a full plane.
Since people from a full plane are more likely to be selected, there is a systematic bias in your response.
It is important, in every survey, to try to make your sample as representative as possible. In this case, your sample was not
representative of the planes arriving to the airport.
A better approach might be to ask the people you select what their flight number was, and then ask them how full their
flight was. Make sure you have at least one passenger from every plane. Then count the responses of only one person from
each flight. By including only one person per flight in your sample, you ensure that your sample is an accurate prediction of
how many planes are filled to capacity.
Sampling is complicated, and it is important to think through all the factors that might influence your results. In this case,
the mistake is that you are trying to estimate a population of planes by sampling a population of passengers. This makes
the sample unrepresentative of the underlying population. By randomly sampling the passengers rather than the flights,
each flight is not equally likely to be selected, and the sample is biased.
The Population Mean

You report the results of your survey, the sample mean, and its standard deviation to Leo.
The Scuba Problem II

A sample mean of 4.4 makes more sense to me, but I'm still a bit uneasy about your survey result. After all, you've only
collected 45 responses.
If you'd chosen different people, they likely would have given different responses. What if — just by chance — these 45 people
loved the scuba course, and no one else did?
You have a good point there, Leo. Our intuition is that the average satisfaction rate for all guests isn't too far from 4.4, but at
this point we're not sure exactly how far away it might be. Without more calculations, all we can say is that 4.4 is the best
estimate we have. That is why...
Wait a minute! This is very unsatisfying. Are you telling me that there's no way to gauge the accuracy of this survey result?
If the results are a little off, that's not a problem. But you have to tell me how far off they might be. What if you're off by two
whole points, and the true satisfaction of my hotel guests is 2.4, not 4.4? In that case, my decision would be completely
different.
I need to know how accurately this sample reflects the opinions of all the hotel guests who went scuba diving!
The sample mean is the best point estimate of the population mean, but it cannot tell you how accurately the sample reflects
the population.
Alice suggests giving Leo a range of values that is almost certain to contain the population mean. "We may not be able to pin
down mean satisfaction precisely. But confining it to a range of likely values will provide Leo with enough information to
make a sound business decision."
That sounds like a good idea, but you wonder how to actually do it.
22/123
Using Confidence Intervals

The sample mean is the best estimate of our population mean. However, it is only a point estimate. It does not give us a sense
of how accurately the sample mean estimates the population mean.
Think about it. If we know only the sample mean, what can we really say about the population mean? In the case of our scuba
school, what can we say about the average satisfaction rate of all scuba-diving hotel guests? Could it be 4.3? 4.0? 4.7? 2.0?
To make decisions as a manager, we need to have more than just a good point estimate. We need to have a sense of how
close or far away the true population mean might be from our estimate.
We can indicate the most likely values of the true population mean by creating a range, or interval, around the sample mean.
If we construct it correctly, this range will very likely contain the true population mean.
For example, by constructing a range, we might be able to tell Leo that we are very confident that the true average customer
satisfaction for all scuba guests falls between 4.2 and 4.6.
Knowing that the true average is almost certainly between 4.2 and 4.6, Leo is better equipped to make a decision than if he
simply knew the estimated average of 4.4.
Creating a range around the sample mean is quite easy. First, we need to know three statistics of the sample: the mean x-bar,
the standard deviation s, and the sample size n.
We also need to know how "confident" we'd like to be that the range contains the true mean of the population. For any level
of "confidence", there is a value we'll call z to put into the formula. We'll learn later in this unit exactly what we mean by
"confidence," and how to compute z. For now, just keep in mind that for higher levels of confidence, we'll need to put in a
larger value of z.
Using these numbers, we can create a range around the sample mean according to the following formula:
Before we actually use the formula, let's try to develop our intuition about the range we're creating. Where should the range
be centered? How wide must the range be to make us confident that it contains the true population mean? What factors
would lead us to need a wider or narrower range?
Let's see how the statistics of the sample influence the location and width of the range. Let's start with the sample mean.
The sample mean is our best estimate of the population mean. This suggests that the sample mean should always be the
center of the range. Move the slider bar to see how the sample mean affects the range.
Second, the width of the range depends on the standard deviation of the sample. When the sample standard deviation is
large, we have greater uncertainty about the accuracy of the sample mean as an estimate of the population mean. Thus, we
have to create a wider range to be confident that it includes the true population mean.
On the other hand, if the sample standard deviation is small, we feel more confident that our sample mean is an accurate
predictor of the true population mean. In this case, we can draw a more narrow range.
The larger the standard deviation, the wider the range must be. Move the slider bar to see how the sample standard deviation
affects the range.
Third, the width of the range depends on the sample size. With a very small sample, it's quite possible that one or two atypical
points in the sample could throw the sample mean off considerably from the true population mean. So with a small sample,
we need to create a wide range to feel comfortable that the true mean is likely to be inside it.
The larger the sample, the more certain we can be that the sample mean represents the population mean. With a large
sample, even if our sample includes a few atypical points, there are likely to be many more typical points in the sample to
compensate for the outliers. Thus, with a large sample, we can feel comfortable with a small range.
Move the slider bar to see how the sample size influences the range.
Finally, the width of the range depends on our desired level of confidence. The level of confidence states how certain we want
to be that the range contains the mean of the population. The more confident we want to be that the range contains the true
population mean, the wider we have to make the range.
If our desired level of confidence is fairly low, we can draw a more narrow range.
In the language of statistics, we indicate our level of confidence by saying, for example, that we are "95% confident" that the
range contains the true population mean. This means there is a 95% chance that the range contains the true population
mean.
Move the slider bar to see how the confidence level affects the range.
23/123
These variables determine the size of the range that we want to construct. We will learn exactly how to construct this range in
a later section.
For now, all we have to understand is that the population mean can best be estimated by a range of values and that the range
depends on three sample statistics as well as the level of confidence that we want to assign to the range.
Summary
The sample mean is our best initial estimate of the population mean. To indicate how accurate this estimate is, we construct a
range around the sample mean that likely contains the population mean. The width of the range is determined by the sample
size, sample standard deviation, and the level of confidence. The confidence level measures how certain we are that the range
we construct contains the true population mean.
Alice recommends taking a step back from sampling and learning about the normal distribution.
The Normal Distribution

Alice recommends taking a step back from sampling and learning about the normal distribution.
The normal distribution helps us create a range around a sample mean that is likely to contain the true population mean. You
can use the normal distribution to turn the intuitive notion of "confidence in your estimate" into a precisely defined concept.
Understanding the normal distribution will also give you deeper insight into how sampling works.
The normal distribution is a probability distribution that is centered at the mean. It is shaped like a bell, and is sometimes
called the "bell curve."
Like any probability distribution, the normal distribution is shown on two axes: the x-axis for the variable we're studying —
women's heights, for example — and the y-axis for the likelihood that different values of the variable will occur.
For example, few women are very short and few are very tall. Most are in the middle somewhere, with fairly average heights.
Since women of average height are so much more common, the distribution of women's heights is much higher in the center
near the average, which is about 63.5 inches.
As it turns out, for a probability distribution like the normal distribution, the percent of all values falling into a specific range is
equal to the area under the curve over that range.
For example, the percentage of all women who are between 61 and 66 inches tall is equal to the area under the curve over that
range.
The percentage of all women taller than 66 inches is equal to the area under the curve to the right of 66 inches.
Like any probability distribution, the total area under the curve is equal to 1, or 100%, because the height of every woman is
represented in the curve.
Over the years, statisticians have discovered that many populations have the properties of the normal distribution. For
example, IQ test scores follow a normal distribution. The weights of pennies produced by U.S. mints have been shown to follow
a normal distribution.
But what is so special about this curve?
First, the normal distribution's mean and median are equal. They are located exactly at the center of the distribution. Hence,
the probability that a normal distribution will have a value less than the mean is 50%, and that the probability it will have a
value greater than the mean is 50%.
Second, the normal distribution has a unique symmetrical shape around this mean. How wide or narrow the curve is depends
solely on the distribution's standard deviation.
In fact, the location and width of any normal curve are completely determined by two variables: the mean and the standard
deviation of the distribution.
Large standard deviations make the curve very flat. Small standard deviations produce tight, tall curves with most of the values
very close to the mean.
How is this information useful?
Regardless of how wide or narrow the curve, it always retains its bell-shaped form. Because of this unique shape, we can create
a few useful "rules of thumb" for the normal distribution.
For a normal distribution, about 68% (roughly two-thirds) of the probability is contained in the range reaching one standard
deviation away from the mean on either side.
24/123
It's easiest to see this with a standard normal curve, which has a mean of zero and a standard deviation of one.
If we go two standard deviations away from the mean for a standard normal curve we'll cover about 95% of the probability.
The amazing thing about normal distributions is that these rules of thumb hold for any normal distribution, no matter what its
mean or standard deviation.
For example, about two thirds of all women have heights within one standard deviation, 2.5 inches, of the average height, which
is 63.5 inches.
95% of women have heights within two standard deviations (or 5 inches) of the average height.
To see how these rules of thumb translate into specific women's heights, we can label the x-axis twice to show which values
correspond to being one standard deviation above or below the mean, which values correspond to being two standard
deviations above or below the mean, and so on.
Essentially, by labeling the x-axis twice we are translating the normal curve into a standard normal curve, which is easier to
work with.
For women's height, the mean is 63.5 and the standard deviation is 2.5. So, one standard deviation above the mean is 63.5 +
2.5, and one standard deviation below the mean is 63.5 - 2.5.
Thus, we can see that about 68% of all women have heights between 61 and 66 inches, since we know that about 68% of the
probability is between -1 and +1 on a standard normal curve.
Similarly, we can read the heights corresponding to two standard deviations above and below the mean to see that about 95% of
all women have heights between 58.5 and 68.5 inches.
The z-statistic
The unique shape of the normal curve allows us to translate any normal distribution into a standard normal curve, as we did
with women's heights simply by re-labeling the x-axis. To do this more formally, we use something called the z-statistic.
For a normal distribution, we usually refer to the number of standard deviations we must move away from the mean to cover
a particular probability as "z", or the "z-value." For any value of z, there is a specific probability of being within z standard
deviations of the mean.
For example, for a z-value of 1, the probability of being within z standard deviations of the mean is about 68%, the probability
of being between -1 and +1 on a standard normal curve.
A good way to think about what the z-statistic can do is this analogy: if a giant tells you his house is four steps to the north,
and you want to know how many steps it will take you to get there, what else do you need to know?
You would need to know how much bigger his stride is than yours. Four steps could be a really long way.
The same is true of a standard deviation. To know how far you must go from the mean to cover a certain area under the curve,
you have to know the standard deviation of the distribution.
Using the z-statistic, we can then "standardize" the distribution, making it into a standard normal distribution with a mean of
0 and a standard deviation of 1. We are translating the real value in its original units — inches in our example — into a z-
value.
The z-statistic translates any value into its corresponding z-value simply by subtracting the mean and dividing by the
standard deviation.
Thus, for the women's height of 66 inches, the z-value, z = (66-63.5)/2.5, equals 1. Therefore, 66 is exactly one standard
deviation above the mean.
Essentially, the z-statistic allows us to measure the distance from the mean in terms of standard deviations instead of real
values. It gives everyone the same size feet in statistics.
We can extend the rules of thumb we've developed beyond the two cases we've looked at. For example, we may want to know
the likelihood of being within 1.5 standard deviations from the mean, or within three standard deviations from the mean.
Select different values of z — that is, select different numbers of standard deviations from the mean — and see how the
probability changes. Be sure to try z values of 1 and 2 to verify that our rules of thumb are on target!
Sometimes we may want to go in the other direction, starting with the probability and figuring out how many standard
deviations are necessary on either side of the mean to capture that probability.
For example, suppose we want to know how many standard deviations we need to be from the mean to capture 95% of the
probability.
25/123
Our second rule of thumb tells us that when we move two standard deviations from the mean, we capture about 95% of the
probability. More precisely, to capture exactly 95% of the probability, we must be within 1.96 standard deviations of the
mean.
This means that for a normal distribution, there is a 95% probability of falling between -1.96 and 1.96 standard deviations
from the mean.
Select different probabilities and see how many standard deviations we have to move away from the mean to cover that
probability.
We can create a table that shows which values of z correspond to each probability or we can calculate z using a simple
function in Microsoft Excel. We'll explain how to use both of these approaches in the next few clips.
z-table
Remember, the probabilities and the rules of thumbs we've described apply ONLY to a normal distribution. Don't think you
can use them for any distribution!
Sometimes, probabilities are shown in other forms. If we start at the very left side of the distribution, the area underneath the
curve is called the cumulative probability. For example, the probability of being less than the mean is 0.5, or 50%. This is just
one example of a cumulative probability.
A cumulative probability of 70% corresponds to a point that has 70% of the area under the curve to its left.
There are easy ways to find cumulative probabilities using spreadsheet packages such as Microsoft Excel. You'll have
opportunities to practice solving these types of problems shortly.
Cumulative probabilities can be used to find the probability of any range of values. For example, to find the percentage of all
women who have heights between 63.5 and 68 inches, we would simply subtract the percent whose heights are less than 63.5
inches from the percent whose heights are less than 68 inches.
Summary
The normal distribution has a unique symmetrical shape whose center and width are completely determined by its mean and
its standard deviation. For every normal distribution, the probability of being within a specified number of standard
deviations of the mean is the same. The distance from the mean, as measured in standard deviations, is known as the z-value.
Using the properties of the normal distribution, we can calculate a probability associated with any range of values.
Using Excel's Normal Functions

To find the cumulative probability associated with a given z-value for a standard normal curve, we use the Excel function
NORMSDIST. Note the S between the M and the D. It indicates we are working with a 'standard' normal curve with mean
zero and standard deviation one.
For example, to find the cumulative probability for the z-value 1, we enter the Excel function =NORMSDIST(1).
The value returned, 0.84, is the area under the standard normal curve to the left of 1. This tells us that the probability of
obtaining a value less than 1 for a standard normal curve is about 84%.
We shouldn't be surprised that the probability of being less than 1 is 84%. Why? First, we know that the normal curve is
symmetric, so there is a 50% chance of being below the mean.
Next, we know that about 68% of the probability for a standard normal curve is between -1 and +1.
Since the normal curve is symmetric, half of that 68% — or 34% of the probability — must lie between 0 and 1.
Putting these two facts together confirms that there is an 84% chance of obtaining a value less than 1 for a standard normal
curve.
If we want to find the cumulative probability of a value in a general normal curve — one that does not necessarily have a
mean of zero and a standard deviation of one — we have two options. One option is to first standardize the value in question
to find the equivalent z-value, and then use the NORMSDIST to find the cumulative probability for that z-value.
For example, if we have a normal distribution with mean 26 and standard deviation 8, we may wish to know the probability
of obtaining a value less than 24.
Standardizing can be done easily by hand, but Excel also has a STANDARDIZE function. We enter the function in a cell and
insert three values: the value to be standardized, and the mean and standard deviation of the normal distribution.
We find that the standardized value (or z value) of 24 for a normal curve with mean 26 and standard deviation 8 is -0.25.
26/123
Now, to find the cumulative probability for the z-value -0.25, we enter the Excel function =NORMSDIST(-0.25), which tells
us that the probability of a value less than -0.25 on a standard normal curve is 40%. Thus, the probability of a value less than
24 on a normal curve with mean 26 and standard deviation 8 is 40%.
The second way to find a cumulative probability in a general normal curve is to use the NORMDIST function. Here, we enter
the function in a cell and insert four values: the number whose cumulative probability we want to find, the mean and
standard deviation of the normal distribution, and the word "TRUE."
As with our previous approach, we find that the probability of obtaining a value less than 24 on a normal curve with mean 26
and standard deviation 8 is 40%.
The value "TRUE" tells Excel to return a cumulative probability. If instead of "TRUE" we enter "FALSE," Excel returns the y-
value of the normal curve — something we are usually not interested in.
Quite often, we have a cumulative probability, and want to work backwards, translating it into a value on a normal curve.
Suppose we want to find the z-value associated with the cumulative probability 95%.
To translate a cumulative probability back to a z-value on the standard normal curve, we use the Excel function NORMSINV.
Note once again the S, which tells us we are working with a standard normal curve.
We find that the z-value associated with the cumulative probability 95% is 1.64.
Sometimes we may want to translate a cumulative probability back to a value on a general normal curve. For example, we
may want to find the value associated with the cumulative probability 95% for a normal curve with mean 26 and standard
deviation 8.
If we want to translate a cumulative probability back to a value on a general normal curve, we use the NORMINV function.
NORMINV requires three values: the cumulative probability, and the mean and standard deviation of the normal distribution
in question.
We find that the value associated with the cumulative probability 95% for a normal curve with mean 26 and standard
deviation 8 is 39.2.
Using the z-table

The previous clip shows us how to use software programs like Excel to calculate z-values and cumulative probabilities for the
normal curve. Another way to find z-values and cumulative probabilities is to use a z-table. Using z-tables is a bit more
cumbersome than using Excel, but it helps reinforce the concepts.
Let's use the z-table to find a cumulative probability. Women's heights are distributed normally, with mean around 63.5
inches, and standard deviation 2.5 inches. What percentage of women are shorter than 65.6 inches?
First, we calculate the z-value for 65.6 inches, 0.84. The cumulative probability associated with the z-value is the area under
the standard normal curve to the left of the z-value. This cumulative probability is the percentage of women who are shorter
than 65.6 inches.
We next use the table to find the cumulative probability corresponding to a z-value of 0.84. First, we find the row by locating
the z-value up to the first digit to the right of the decimal point, 0.8. Then we choose the column corresponding to the
remainder of the z-value (0.84 - 0.8 = 0.04). The cumulative probability is 0.7995. About 80% of women are shorter than
65.6 inches.
Finding the cumulative probability for a value less than the mean is a bit trickier. For example, we might want to know what
percentage of women are shorter than 61.6 inches.
We find that the z-value for a height of 61.6 inches is a negative number: -0.76.
When a z-value is negative, we must first use the table to find the cumulative probability corresponding to the positive z-
value, in this case +0.76. Then, since the normal curve is symmetric, we will be able to conclude that the probability of being
less than the z-value -0.76 is the same as the probability of being greater than the z-value +0.76.
We find the cumulative probability for +0.76 by locating the row corresponding to the z-value up to the first digit to the right
of the decimal point, 0.7, and the column corresponding to the remainder of the z-value (0.76 - 0.7 = 0.06). The cumulative
probability is 0.7764.
Since the probability of being less than a z-value of +0.76 is 0.7764, then the probability of being greater than a z-value of
+0.76 is 1 - 0.7764 = 0.2236. Thus, we can conclude that the probability of being less than a z-value of -0.76 is also 0.2236.
Finally, we reach our conclusion. About 22.36% of women are shorter than 61.6 inches.
Practice with Normal Curves

27/123
Find the cumulative probability associated with the z-value 2.
Enter your answer in decimal notation with 2 digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if necessary.
z-table
Excel
Find the cumulative probability associated with the z-value 2.36.
z-table
Excel
Find the cumulative probability associated with the z-value -1.
z-table
Excel
Find the cumulative probability associated with the z-value 1.645.
z-table
Excel
Find the cumulative probability associated with the z-value -1.645.
z-table
Excel
For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the value 115.
z-table
Excel
For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the value 80.
z-table
Excel
For a normal curve with mean 100 and standard deviation 10, find the probability of obtaining a value greater than 80 but
less than 115.
z-table
Excel
For a normal curve with mean 80 and standard deviation 5, find the probability of obtaining a value greater than 85 but less
than 95.
z-table
Excel
For a normal curve with mean 47 and standard deviation 6, find the probability of obtaining a value greater than 45.
z-table
Excel
For a normal curve with mean 47 and standard deviation 6, find the probability of obtaining a value greater than 38 but less
than 45.
28/123
z-table
Excel
Find the z-value associated with the cumulative probability of 60%.
z-table
Excel
Find the z-value associated with the cumulative probability of 40%.
z-table
Excel
Find the z-value associated with the cumulative probability of 2.5%.
z-table
Excel
For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative probability of
88%.
Enter your answer as an integer (e.g., "5"). Round if necessary.
z-table
Excel
For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative probability of
28%.
Enter your answer as an integer (e.g., "5"). Round if necessary.
z-table
Excel
The Central Limit Theorem

How can the normal distribution help you sample Leo's hotel guests?
How do the unique properties of the normal distribution help us when we use a random sample to infer something about the
underlying population?
After all, when we sample a population, we usually have no idea whether or not the population is normally distributed. We're
typically sampling because we don't even know the mean of the population! If the normal distribution is such a great tool,
when can we use it?
It turns out that even if a population is not normally distributed, the properties of the normal distribution are very helpful to
us in sampling. To see why, let's first learn about a well-established statistical fact known as the "Central Limit Theorem".
Definition
Roughly speaking, the Central Limit Theorem says that if we took many random samples from a population and plotted the
means of each sample, then — assuming the samples we take are sufficiently large — the resulting plot of the sample means
would look normally distributed.
Furthermore, if we took enough of these samples, the mean of the resulting distribution of sample means would be equal to
the true mean of the population.
To repeat: no matter what type of distribution the population has — uniform, skewed, bi-modal, or completely bizarre — if
we took enough samples, and the samples were sufficiently large, then the means of those samples would form a normal
distribution centered around the true mean of the population.
The Central Limit Theorem is one of the subtlest aspects of basic statistics. It may seem odd to be drawing a distribution of
the means of many samples, but that is exactly what we are doing. We'll call this distribution the Distribution of Sample
29/123
Means. (Statisticians also often call it the Sampling Distribution of the Mean).
Let's walk through this step-by-step. If we have a population — any population — we can take a random sample. This
sample has a mean.
We can plot that mean on a graph.
Then we take another sample. That sample also has a mean, which we also plot on the graph.
Now, if we plot a lot of sample means in this way, they will start to form a normal distribution around the population's
mean.
The more samples we take, the more the graph of the sample means would look like a normal distribution. Eventually, the
graph of the sample means — the Distribution of the Sample Means — would form a nearly perfect replica of a normal
distribution.
Now, nobody would actually take a lot of samples, calculate all of the sample means, and then construct a normal
distribution with them. We're taking a lot of samples here just to let you see that graphing the means of many samples
would give you a normal curve.
In the real world, we take a single sample and squeeze it for all the information it's worth. But what does the Central Limit
Theorem allow us to say based on that single sample?
The Central Limit Theorem tells us that the mean of that one sample is part of a normal distribution. More specifically, we
know that the sample mean falls somewhere in a normal Distribution of Sample Means that is centered at the true
population mean.
The Central Limit Theorem is so powerful for sampling and estimation because it allows us to ignore the underlying
distribution of the population we want to learn about. Since we know the Distribution of Sample Means is normally
distributed and centered at the true population mean, we can completely disregard the underlying distribution of the
population.
As we'll see shortly, because we know so much about the normal distribution, we can use the information about the
Distribution of Sample Means to draw conclusions about the likelihood of different values of the actual population mean.
Summary
The Central Limit Theorem states that for any population distribution, the means of samples from that population are
distributed approximately normally. The more samples, and the larger the sample size, the closer the Distribution of
Sample Means fits a normal curve. The mean of a single sample lies on this normal curve, so we can use the normal
curve's special properties to extract more information from a single sample mean.
Illustrating
Let's see how the Central Limit Theorem works using a graphical illustration. The three icons are marked "Uniform,"
"Bimodal," and "Skewed." On a later page, clicking on each of the three sections in the navigation will display a different
kind of distribution.
On the next page, clicking on "Uniform" will display a distribution that is uniform in shape, i.e. a distribution for which all
values in a specified range are equally likely to occur. Clicking on "Bimodal" will display a distribution that has two
separate areas where values are more likely to occur than elsewhere. Clicking on "Skewed" will display a distribution that is
not symmetrical — values are more likely to fall above the mean than below.
Uniform
The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean.
Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a
graph. We repeat this process several times to create our distribution.
Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of
the sample means.
As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always
form a normal distribution. This is what the Central Limit Theorem predicts.
Bimodal
30/123
the sample means.
Skewed
the sample means.
The Central Limit Theorem states that the means of sufficiently large samples are always normally distributed, a key
insight that will allow you to estimate the population mean from a sample.
Confidence Intervals
Using the properties of the normal distribution and the Central Limit Theorem, you can construct a range of values that is
almost certain to contain the population mean.
Estimating a Population Mean II

For a normal distribution, we know that if we select a value at random, it will be within two standard deviations of the
distribution's mean 95% of the time.
The Central Limit Theorem offers us two additional insights. First, we know that the means of sufficiently large samples are
normally distributed, regardless of the distribution of the underlying population.
Second, we know that the mean of the Distribution of Sample Means is equal to the true population mean.
Combining these facts can give us a measure of how accurately the mean of a sample estimates the population mean.
Specifically, we can now conclude that if we take a sufficiently large sample — let's say at least 30 points — from a
population, there is a 95% chance that the mean of that sample falls within two standard deviations of the true population
mean.
Let's build this up step by step to make sure we understand the logic.
First, we take a sample from a population and compute its mean. We know that the mean of that sample is a point on a
normal distribution — the Distribution of Sample Means.
Since the mean of our sample is a value randomly obtained from a normal distribution, there is a 95% chance that the sample
mean is within two standard deviations of the mean of the distribution.
The Central Limit Theorem tells us that the mean of that distribution is the same as the true population mean. Thus, we can
conclude that there is a 95% chance that the sample mean is within two standard deviations of the population mean.
We have argued that 95% of our samples will have a mean within the range shown around the true population mean.
Next we'll turn this around and look at intervals around sample means, because that's exactly what a confidence interval is.
Let's look at intervals around the means of two different types of samples: those whose sample means fall within the 2
standard deviation range around the population mean (which should be the case for 95% of all samples) and those whose
sample means fall outside the 2 standard deviation range around the population mean (which should be the case for 5% of all
samples).
First, let's look at a sample whose mean falls outside the 2 standard deviation range shown around the population mean.
Since this sample mean is outside the range, it must be more than 2 standard deviations away from the population mean.
Since the population mean is more than 2 standard deviations away from this sample mean, an interval of width 2 standard
31/123
deviations around this sample mean could not contain the true population mean.
We know that 5% of all samples should have sample means outside the 2 standard deviation range around the population
mean. Therefore 5% of all samples we obtain will have intervals that do not contain the population mean.
Now let's think about the remaining 95% of samples whose means do fall within the 2 standard deviation range around the
population mean.
If we draw an interval of width 2 standard deviations around any one of these sample means, the interval would contain the
true population mean. Thus, 95% of all samples we obtain will have intervals that contain the population mean.
We've just shown how to go from any sample mean — a point estimate — to a range around the sample mean — a 95%
confidence interval. We've also argued that 95% of confidence intervals obtained in this way should contain the true
population mean.
It's important to emphasize: We are not saying that 95% of the time our sample mean is the population mean, but we are
saying that 95% of the time a range that is two standard deviations wide centered around the sample mean contains the
population mean.
To visualize the general concept of a confidence interval, imagine taking 20 different samples from a population and drawing
a confidence interval around each. As the diagram shows, on average 95% of these intervals — or 19 out of 20 — would
actually contain the population mean.
What does this insight mean for us as managers? When we set a confidence level of 95%, we are agreeing to an approach that
1 out of 20 times will give us an interval that does not contain the true population mean. If we aren't comfortable with those
odds, we should raise the confidence level.
If we increase the confidence level to 98%, we have only a 1 out of 50 chance of obtaining an interval that does not contain the
true population mean. However, this higher confidence comes at a cost. If we keep the same sample size, then the confidence
interval will widen, thereby decreasing the accuracy of our estimate. Alternatively, to keep the same interval width, we can
increase our sample size.
How do we know if an interval is too wide? Typically, if we would make a different decision for different values within an
interval, that interval is too wide. Let's look at an example.
To estimate the percent of people in our industry who will attend the annual conference, we might construct a confidence
interval that ranges from 7% to 13%. If we would select a different conference venue if the true percentages is 7% than if it is
13%, we need to tighten our range.
Now, before we are ready to actually create our own confidence intervals, there is a technical point we need to be acquainted
with. We need to know that the standard deviation of the Distribution of Sample Means is σ, the standard deviation of the
underlying population, divided by the square root of n, the sample size.
We won't prove this fact here, but simply note that it is true, and that it should confirm our general intuition about the
Distribution of Sample Means. For example, if we have huge samples, we'd expect the means of those large samples to be
tightly clustered around the true population mean, and thereby form a narrow distribution.
Summary
A confidence interval is an estimate for the mean of a population. It specifies a range that is likely to contain the population
mean. A confidence interval is centered at the mean of a sample randomly drawn from the population under study. When
we have a confidence level of 95% we expect equally wide confidence intervals centered at 95 out of 100 such sample means
to contain the population mean.
Finding a Confidence Interval

You understand the theory behind a confidence interval. But how do you actually construct one?
We can now translate the previous discussion into a simple method for finding a confidence interval for the mean of any
population. First, we randomly select a sample of size 30 from the population.
We then compute the mean and standard deviation of the sample.
Next, we assign the sample mean as the center of the confidence interval.
To find the width of the interval, we must know the level of confidence we want to assign to the interval. If we want a 95%
confidence interval, the interval should be 2 times the standard deviation of the population divided by the square root of n,
the sample size.
Since we typically don't know the standard deviation of the population, we substitute the best estimate that we do have — the
standard deviation of the sample.
32/123
Here's what the equation looks like for our example.
If we want a level of confidence other than 95%, instead of multiplying s/sqrt(n) by 2, we multiply by the z-value
corresponding to the desired level of confidence.
We can use this formula to compute any confidence interval. There is one restriction: in order for it to work, the sample size
has to be at least 30.
Wine Lover's Magazine

Let's walk through an example. Wine Lover's Magazine's managers have asked us to help them estimate the average age of
their subscribers so they can better target potential advertisers.
We tell them we plan to survey a sample of their subscribers. They say they're comfortable with our working with a sample,
but emphasize that they want to be 95% confident that the range we give them contains the true average age of its full set of
subscribers.
We obtain survey results from 60 randomly-chosen subscribers and determine that the sample has a mean of 52 and a
standard deviation of 40.
To find an appropriate confidence interval, we incorporate information about the sample into the formula:
The z-value for a 95% confidence interval is about 2, or more accurately, about 1.96. This tells us that a 95% confidence
interval would begin at 52 minus 10.12, or 41.88, and end at the mean plus 10.12, or 62.12.
We give management the range from 41.88 to 62.12 as an estimate of the average age of its subscribers, telling them they
can be 95% confident that the true population mean falls between these values.
What if we want a confidence level other than 95%? We can use the sample mean, standard deviation, and size from the
sample data, but how do we obtain the right z-value?
Obtaining the z-value

The z-value for 95% confidence is well known to be about 2, but how do we find a z-value for a less common confidence
interval? To be 98% confident that our interval contains the population mean, how do we obtain the appropriate z-value?
To find the z-value for 98% confidence level, we are essentially asking: How far to the left and right of the standard normal
curve's mean do we have to go to capture 98% of the area?
Capturing 98% of the area centered at the mean of the normal curve leaves two areas at the tails, each covering 1% of the
area under the curve. The z-value of the right boundary is the z-value associated with a cumulative probability of 99% —
the sum of the central 98% and the 1% in the left tail.
Converting the desired confidence level into the corresponding cumulative probability on the standard normal curve is
essential because Excel's NORMSINV function and the z-table work with cumulative probabilities.
To find the z-value associated with a cumulative probability of 99%, enter into Excel =NORMSINV(0.99), which returns
the z-value 2.33. Or, look in the z table and find the cell that contains a cumulative probability closest to 0.9900. The z-
value is 2.33, the sum of the row-value 2.3 and the column-value 0.03.
Try finding a z-value yourself. Find the z-value associated with a 99.5% confidence level using the appropriate normal
distribution function in Excel. Alternatively, you can use the Standard Normal Table (z-table) in your briefcase.
The correct z-value for a confidence level of 99.5% is:
z-table
Excel
Our first step is to convert the confidence level of 99.5% into the corresponding cumulative probability on the standard
normal curve. To do this, note that to have 99.5% probability in the middle of the standard normal curve, we must exclude
a total area of 1 - 99.5% = 0.5% from the curve. That area is divided into two equal parts in the distribution's tails: 0.25% in
each tail.
We can now see that the cumulative probability associated with confidence level of 99.5% is 1 − 0.25% = 99.75%. Thus, the
z-value for a confidence level of 99.5% is the same as the z-value of a cumulative probability of 99.75%. We find the z-value
in Excel by entering =NORMSINV(0.9975), which returns the value 2.81. Alternatively, we could find the z-value in the z-
table by looking up the probability 0.9975.
Summary
33/123
To calculate a confidence interval, we take a sample, compute its mean and standard deviation, and then build a range
around the sample mean with a specified level of confidence. The confidence level indicates how confident we are that the
sample mean we collected contains the population mean.
Using Small Samples

We assumed in our confidence limit calculations that the sample size was at least 30. What if it isn't? What if we have only
a small sample? Let's consider a different survey, one that concerns a delicate matter.
The business manager of a large ocean liner, the Demiurgos asks for our help. She wants us to find out the value of her
guests' belongings. She needs this value to determine the correct insurance protection in case guest belongings disappear
from their cabins, are destroyed in a fire, or sink with the ship.
She has no idea how valuable her guests' belongings are, but she feels uneasy asking them for this information.
She is willing to ask only 16 guests to estimate the total value of the belongings in their cabins. From this sample, we need
to prepare an estimate.
With a sample size less than 30, we cannot calculate confidence intervals in the same way as with a large sample size. A
small sample increases our uncertainty about two important aspects of our estimate of the population mean.
First, with a small sample, the consequences of the Central Limit Theorem are not assured, so we cannot be sure that the
sample means follow a normal distribution.
Second, with a small sample, we can't be sure that the sample standard deviation is a good estimate of the population
standard deviation.
Due to these additional uncertainties, we cannot use z-values to construct confidence intervals. Using a z-value would
overstate our confidence in our estimate.
Can we still create a confidence interval? Is there a way to estimate the population mean even if we have only a handful of
data points?
It depends: if we don't know anything about the underlying population, we cannot create a confidence interval with fewer
than 30 data points. However, if the underlying population is normally distributed — or even roughly normally distributed
— we can use a confidence interval to estimate the population mean.
In practice, as long as we are sure the underlying population is not highly skewed or extremely bimodal, we can construct a
confidence interval, even when we have a small sample. However, we do need to modify our approach slightly.
To estimate the population mean with a small sample, we use a t-distribution, which was discovered in the early 20th
century at the Guinness Brewing Company in Ireland.
A t-distribution gives us t-values in much the same way as a normal distribution gives us z-values. What is the difference
between the normal distribution and the t-distribution?
A t-distribution looks similar to a normal distribution, but is not as tall in the center and has thicker tails, because it is
more likely than the normal distribution to have values fall farther away from the mean.
Therefore, the normal distribution's "rules of thumb" for 68% and 95% probabilities no longer hold. For example, we must
go more than 2 standard deviations on either side of the mean to capture 95% of the probability for a t-distribution.
Thus, to achieve the same level of confidence, a confidence interval based on a t-distribution will be wider than one based
on a normal distribution. This reinforces our intuition: we have less certainty about our estimate with a smaller sample, so
we need a wider interval to achieve a given level of confidence.
The t-distribution is also different because it varies with the sample size: For each sample size, there is a different t-value
associated with a given level of confidence. The smaller the sample size n, the shorter the height and the thicker the tails of
the t-distribution curve, and the farther we have to go from the mean to reach a given level of confidence.
On the other hand, as the sample size increases, the shape of the t-distribution becomes more and more like the shape of a
normal distribution. Once we reach a sample size of 30, the t-distribution becomes virtually identical to the z-distribution,
so t-values and z-values can be used interchangeably.
Incidentally, we can use the t-distribution even for sample sizes larger than 30. However, most people use the z-
distribution for larger samples, partially out of habit and partially because it's easier, since the z-value doesn't vary based
on the sample size.
Finding the t-value
34/123
To find the right t-value, we first have to identify the t-distribution that corresponds to our sample size. We do this by
finding the number of "degrees of freedom" of the sample, which for our purposes is simply the sample size minus one. If
our sample size is 16, we have 15 degrees of freedom, and so on.
Excel provides a simple function for finding the appropriate t-value for a confidence interval. If we enter 1 minus the
level of confidence we want and the degrees of freedom into the Excel function TINV, Excel gives us the appropriate t-
value.
For example, for a 95% confidence interval and a sample size of n = 16, the Excel function TINV(0.05,15) would return
the value 2.131.
Excel
Once we find the t-value, we use it just like we used the z-value to find a confidence interval. For example, for t = 2.131,
the appropriate confidence interval is:
Excel
If we don't have Excel handy, we can use a t-distribution table to find the t-value associated with the degrees of freedom
and the confidence level we specify. When using different t-value tables, we need to be careful to note which probability
the table depicts.
Excel
t-table
Some tables report values associated with the confidence level, like 0.95. Others report values based on the area in the
tails, which would be 0.05 for a 95% confidence interval. Our t-table, like many others, reports values associated with a
cumulative probability, so for a 95% level of confidence, we would have to look at a cumulative probability of 97.5%.
Excel
t-table
The Good Ship Demiurgos

Returning to the good ship Demiurgos, let's determine an estimate of the average value of passengers' belongings. The
manager samples 16 guests, and reports that they have an average of $10,200 worth of clothing, jewelry, and personal
effects in their cabins. From her survey numbers, we calculate a standard deviation of $4,800.
We need to double check that the distribution isn't too skewed, which we might expect, since some of the passengers are
quite wealthy. The manager explains that the insurance policy has a limited liability clause that limits a passenger's
maximum claim to $20,000.
Above $20,000, passengers' own homeowners' policies must cover any losses. Thus, in the survey, if a guest reported
values above $20,000, the manager simply reported $20,000 as the value to be covered for our data set.
We sketch a graph of the 16 values that confirms that the distribution is not too asymmetric, so we feel comfortable using
the t-distribution.
Since we have a sample of 16 passengers, there are 15 degrees of freedom. The Excel function =TINV(0.05,15) tells us
that the appropriate t-value is 2.131.
Excel
Using the confidence interval formula, the guests' valuables are worth $10,200 plus or minus 2.131 times $4,800 over
the square root of 16. Thus, the width of the confidence interval is 2.131*4,800/4 = $2,557, and we can report that we are
95% confident that the average value of passengers' belongings is between $7,643 and $12,757.
Excel
t-table
What if the Demiurgos' manager thinks this interval is too large?
Excel
t-table
She will have to survey more guests. Increasing the sample size causes the t-value to decrease, and also increases the size
of the denominator (the square root of n). Both factors narrow the confidence interval.
Excel
t-table
35/123
For example, if she asks 10 more guests, and the standard deviation of the sample does not change, the t-value would
drop to 2.06 and the square root of n in the denominator would increase. The width of the interval would decrease
significantly, from $2,557 to $1,939.
Excel
t-table
Summary
Confidence intervals can be constructed even with a sample size of less than 30, as long as the population is roughly
normally distributed (or, at least not too skewed or bimodal). To find a confidence interval with a small sample, use a t-
distribution. T-distributions are a set of distributions that resemble the normal distribution, but with shorter heights
near the mean and thicker tails. To find a confidence interval for a small sample size, place the appropriate t-value into
the confidence interval formula.
Choosing a Sample Size

When we take a survey, we often want a specific level of accuracy in our estimate of the population mean. For example,
when estimating car owners' average spending on car repairs each year, we might want to be 95% confident that our
estimate is within $50 of the true mean.
We know that the sample size of our survey directly affects the accuracy of our estimate. The larger the sample size, the
tighter the confidence interval and the more accurate our estimate. A sample of size n gives us a confidence interval that
extends a distance of d on either side of the mean:
To find the sample size necessary to give us a specified distance d from the mean, we must have an estimate of sigma, the
standard deviation of spending. If we do not have an estimate based on past data or some other source, we might take a
preliminary survey to obtain a rough estimate of sigma.
In this example, we estimate sigma to be $300 based on past experience. Since we want a 95% level of confidence, we set z
= 1.96. To ensure our desired accuracy — that d is no more than $50 — we must randomly sample at least 139 people.
In general, to ensure a confidence interval extends a distance of at most d on either side of the mean, we choose a sample
size n that satisfies the expression below. We can do this with simple algebra, or by using the attached Excel utility.
Confidence Interval Utility
Summary
When estimating a population mean, we can ensure that our confidence interval extends a distance of at most d on either
side of the mean by choosing an appropriate sample size.
Step-by-Step Guide
Here is a step-by-step process for creating a confidence interval:
First, we choose a level of confidence and a sample size n appropriate to the decision context.
Second, we take a random sample and find the sample mean. This is our best estimate for the population mean.
Third, we find the sample's standard deviation.
Fourth, find the z-value or t-value associated with the proper confidence level. If our sample size is over 30, we find the z-
value for our confidence level. If not, we find the t-value for our confidence level and with degrees of freedom = sample size
- 1.
Fifth, we calculate the end points of the confidence interval using the formulae below.
Summary
Construct confidence intervals using the steps outlined below. With a confidence interval derived from an unbiased
random sample, we can say that the true population mean falls within the interval with the corresponding level of
confidence.
Excel Utility
36/123
Click here to open an Excel utility that allows you to create confidence intervals by providing the sample mean, standard
deviation, size, and desired level of confidence. You should enter data only in the yellow input areas of the utility. To ensure
you are using the utility correctly, try to reproduce the results for the Wine Lover's Magazine and the Demiurgos
examples.
Solving the Scuba Problem II

The sample you collected earlier has all the data you need to create a confidence interval for Leo's problem.
You take another look at the survey you created earlier for Leo: you sampled 45 guests, and calculated that the average
satisfaction rate of the sample was 4.4, with a standard deviation of 1.54. Using this information, you decide to create a 95%
confidence interval for Leo.
Your calculations show the following:

z-table
t-table
To create a 95% confidence interval, you take the mean of the sample and add/subtract the z-value multiplied by the sample
standard deviation divided by the square root of the sample size. Using the numbers given, you obtain a 95% confidence
interval by going 0.45 points above and below the sample mean of 4.4, which translates into a confidence interval from 3.95
to 4.85.
You meet with Leo and tell him that you can be 95% certain that the population mean falls between 3.95 and 4.85. Leo looks
at your numbers and shakes his head.
That's just not accurate enough for me to make a decision. If the mean is close to 4.85, I'd be happy, but if it's closer to 4, I'm
concerned. Can we narrow the range at all?
Looking over your notes, you think you can give Leo some options.
Why don't you create a larger sample and report the results back to me?
You select another 40 guests at random and ask the hotel operator to conduct the survey for you again. He is able to reach 25
guests. You combine the two samples, which gives a new sample size of 70.
For the combined sample, you find that the new sample mean is 4.5 and the new sample standard deviation is 1.2. Armed
with more data, you create another confidence interval.
We can be 95% certain that the average satisfaction of all hotel guests with the scuba school is between:

z-table
t-table
To create this 95% confidence interval, you take the mean of the sample and add/subtract the z-value multiplied by the
sample standard deviation divided by the square root of the sample size. Using the new numbers given for the larger sample,
you obtain a 95% confidence interval by going 0.28 points above and below the sample mean of 4.5, which translates into a
confidence interval from 4.22 to 4.78.
Thank you. I am much happier with this result. I have enough information now to decide whether to keep the current scuba
diving school.
Exercise 1: The Veetek VCR Gambit

Toshi Matsumoto is the Chief Operating Officer of a consumer electronics retailer with over 150 stores spread throughout
Japan. For over a year, the sales of high-end VCRs have lagged, due to a shift towards DVD players.
Just today, Toshi heard that Veetek, a large South Asian electronics retailer, is looking to purchase a bulk shipment of high-
end VCRs.
This would be a perfect opportunity for Toshi to liquidate slow-moving inventory currently languishing on the shelves of
his stores. Before he calls Veetek, he wants to know how many high-end VCRs he can promise. After two days of furious
phone calls, his deputy has gathered data from 36 representative outlets in his retail chain.
The mean high-end VCR inventory in each store polled was 500 units. The standard deviation was 180. Toshi needs you to
find a 95% confidence interval for the average VCR inventory per store. The interval is

z-table
37/123
t-table
Exercise 2: Pulluscular Pig Disorder

Paul Segal manages the pig-farming division of the agricultural company Bowman-Lyons-Centerville. A rumored outbreak
of Pulluscular Pig Disorder (PPD) in one of Paul's herds is on the verge of causing a public relations disaster.
The main symptom of PPD is a shrinking brain, and the only certain way to diagnose PPD is by measuring brain size post-
mortem.
Paul needs to know if his herd is affected by PPD, but he does not want to have to slaughter hundreds of swine to find out.
At the preliminary stage, he can offer no more than 5 prime porkers to be slaughtered and diagnosed.
For the pigs slaughtered, the mean brain weight was 0.18 lbs, with a standard deviation of 0.06 lbs. With 95% confidence,
in what range does the herd's average brain weight lie?

z-table
t-table
Proportions
The next morning, you and Alice are about to head off to the hotel pool when Leo calls you.
The Customer Response Problem

I'm sorry to disturb you, but I have another problem, and I think you might be able to help.
The Kahana is a very popular resort during the summer tourist season. But the number of leisure visitors drops significantly
during the off-season, from September through February and then April through May.
We usually have quite a few room vacancies during that period of time. We expect to have about 200 rooms vacant for
weeklong periods during the slow season this year.
I've developed a new program that rewards our best guests with a special discount if they book a weeklong stay during our
slow period. They won't have complete date flexibility of course, but the steep discount should make the offer attractive for
them.
To see how many of our past guests would accept such an offer, I sent promotional brochures to 100 of them. The deadline by
which they had to respond to the offer has passed. Ten guests responded with the required room deposit prior to the deadline
— that's a solid 10 percent.
I figure if we send out 2,000 promotions, we'll get about 200 responses.
This is a nice idea Leo, but I'm concerned it could backfire. If more than 10% respond to this offer, you might end up
disappointing some of the very guests you're trying to reward. Or, if too many respond and you give them all the discount,
you'll have to turn away customers willing to pay full price.
That is exactly my concern. I wonder how accurate the 10% response rate is. Just because it held for 100 guests, will it hold
for 2,000? What if 11% actually respond to the promotions?
Imagine what would happen if 220 guests responded. I don't want to anger 20 loyal customers by telling them the offer is not
valid, but I also don't want to turn away full paying guests to accommodate the extra 20 guests at a discount.
I'm willing to reserve 200 rooms for these discount weeklong stays during the slow season. How many return guests can I
safely send the discount offer and be confident that no more than 200 will respond?
You can tell that Leo is growing quite comfortable with relying on your statistical methods. He seems almost as interested in
them as he is in your results.
Confidence Intervals And Proportions

Sometimes, the question we pose to members of a sample calls for a yes or no answer.
We might survey people in a target market and ask if they plan to buy a new car this year. Or survey voters and ask if they
plan to vote for the incumbent candidate for office. Or we might take a sample of the products our plant produced yesterday
and count how many are defective.
38/123
Even though our question has only two answers, we still have to address an inherent uncertainty: We know what values our
data can take — yes or no — but we don't know how often each response will be given.
In these cases, we usually convey the survey results by reporting the percentage of yes responses as a proportion, p-bar. This
is our best estimate of p, the true percentage of "yes" responses in the underlying population.
Suppose, for example, that we have posted advertisements in the subway cars on Boston's "Red Line," and want to know what
percentage of all passengers remembers seeing our ad.
We create a proper survey, and ask randomly selected Red Line passengers if they remember seeing our ad. 300 passengers
respond to our survey, of which 100 passengers report remembering the ad.
Then p-bar is simply 33%, which is the number of people that remember the ad, 100, divided by the number of respondents,
300.
The remaining 200 passengers, or 67% of the sample, report not remembering the ad. The two proportions always add up to 1
because survey respondents report either remembering the ad or not.
Once we know the proportion of the sample, we can draw conclusions about all Red Line passengers. Our best estimate, or
point estimate, for p, the percentage of all passengers who remember seeing our ad, is 33%.
As managers, we typically want more than this simple point estimate — we want to know how accurate the estimate is. How
far from 33% might the true percentage be? Can we say confidently that it is between 30% and 36%, for example?
When we work with proportions, how do we find a confidence interval around our point estimate?
The process for creating a confidence interval around a proportion is nearly identical to the process we've used before. The
only difference is that we can approximate the standard deviation of the population with a simple formula rather than
calculating it directly from the raw data.
Based on our sample, our best estimate of the true population proportion is p-bar, the percentage of "yes" responses in our
survey. Statistical theory tells us that our best estimate of the standard deviation of the true population proportion is the
square root of [(p-bar)*(1 - (p-bar))]. We can use this approximate standard deviation to determine a confidence interval for
the proportion.
For our Red Line ad, we approximate the standard deviation with the square root of 0.33 times 0.67, or 0.47. A 95%
confidence interval is 0.33 plus or minus 1.96 times 0.47 divided by the square root of 300. This is equal to 0.33 plus or
minus 0.053, or 27.7% to 38.3%.
Unfortunately, there is one catch when we calculate confidence intervals around proportions...
Sample Size
Sample size matters, particularly when dealing with very small or very large proportions. Suppose we are sampling New
Yorkers for Amyotrophic Lateral Sclerosis, commonly known as Lou Gehrig's Disease. In the U.S., the odds of having the
disease are less than 1 in 10,000. Would our sample be useful if we surveyed 100 people?
No. We probably wouldn't find a single person with the disease in our sample. Since the true proportion is very small, we
need to have a large enough sample to make sure we find at least a few people with the disease. Otherwise, we will not have
enough data to get a good estimate of the true proportion.
There is a guideline we must meet to make sure that our sample is large enough when estimating proportions. Two
conditions must be met: First, the product of the sample size and the proportion must be at least 5. Second, the product of
the sample size and 1 minus the proportion must also be at least 5.
If both these requirements are met, we can use the sample. Essentially, this guideline guarantees that our sample contains
a reasonable number of "yes" and a reasonable number of "no" answers. Our sample will not be useful otherwise.
To avoid an invalid sample, we need to create a large enough sample size to satisfy the requirements. However, since we
don't know the proportion p-bar before sampling, we don't know if the two conditions are met before setting the sample
size. How can we get around this problem?
Finding a Preliminary Estimate of p-bar

We can obtain a preliminary estimate of p-bar using either of two methods: first, we can use past experience. For example,
to estimate the rate of Lou Gehrig's disease, we can research the rate of occurrence in the general population. This is a
reasonable first estimate for p-bar.
In many cases, however, we are sampling for the first time. Without past experience, we don't know what p-bar might be.
In this case, it may well be worth our time to take a small test sample to estimate the proportion, p-bar.
39/123
For example, if the proportion of yes answers in our small test sample is 3%, then we can use 3% as our preliminary
estimate of p-bar.
Substituting 3% for p-bar in our two requirements, n(p-bar) ≥ 5 and n(1 - (p-bar)) ≥ 5, tells us that n must satisfy n*0.03 ≥
5 and n*0.97 ≥ 5. Thus the sample size we need for our real sample must be at least 167.
We would then use a real sample — with at least 167 respondents — to find an actual sample value of p-bar to create a
confidence interval for the population proportion.
Summary
Proportions are often used to indicate the frequency of some characteristic in a population. The sample proportion p-bar is
the number of occurrences of the characteristic in the sample divided by the number of respondents, the sample size. It is
our best estimate of the true proportion in the population. We can construct a confidence interval for the population
proportion. Two guidelines for the sample size must be met for a valid confidence interval: n(p-bar) and n(1 - (p-bar)) must
each be at least five.
Solving the Customer Response Problem

Creating confidence intervals around proportions is not much different from creating them around means. Finding the right
number of Leo's promotional brochures to mail should be easy.
Leo needs to know how accurate the 10 percent response rate of his 100-customer sample is. Will this response rate hold for
2,000 guests? To how many guests can he send the discount offer for his 200 rooms?
First, you calculate a 95% confidence interval for the response rate.
Enter the lower bound as a decimal number with two digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if
necessary.
z-table
The 95% confidence interval for the proportion estimate is 0.0412 to 0.1588, or 4.12% and 15.88%. You obtain that answer by
using the sample data and applying the familiar formula:
Then, after giving Leo's questions some thought, you recommend to him that he send the mailing to a specific number of
guests.
Enter the number of guests as an integer (e.g., "5"). Round if necessary.
z-table
Based on the confidence interval for the proportion, the maximum percentage of people who are likely to respond to the
discount offer (at the 95% confidence level) is 15.88%. So, if 15.88% of people were to respond for 200 rooms, how many
people should Leo send out the survey to? Simply divide 200 by 0.1588 to get to the answer: Leo needs to send out the survey
to at most 1,259 past customers.
Leo is pleased with your work. He tells you to relax and enjoy the resort.
Exercise 1: GMW Automotive

GMW is a German auto manufacturer that has regional sales subsidiaries throughout the world. Arturo Lopez heads the
Mexican sales division of the company's Latin American subsidiary.
GMW earns additional profit when customers choose to finance their car purchase with a GMW financing package. Arturo
has been asked to submit a report to the GMW CEO in Germany about the percentage of GMW customers who opt for
financing.
Arturo has asked you, a new member of the division sales team, to devise a way to estimate this percentage. You take a
random sample of 64 cars sold in the Mexican sales division, and find that 13 of them, or about 20.3%, opted for GMW
financing.
If you want to be 95% confident in your report to Mr. Lopez, you should tell him that the percentage of all Mexican
customers opting for GMW financing falls in the range:
z-table
40/123
Exercise 2: Crown Toothpaste

Kayleigh Marlon is the Chief Buyer at Tar-Mart, a company that operates a chain of superstores selling discount
merchandise. Tar-Mart has a huge national presence, and manufacturers compete fiercely to get their products onto Tar-
Mart's shelves.
Crown Toothpaste, a new entrant in the toothpaste market, is one of them. Kayleigh agreed to stock Crown for 4 weeks and
display it prominently. After that period, she will stop stocking Crown unless 5% of Tar-Mart's customers bought Crown or
were considering buying Crown within the next month.
The trial period is now over. Kayleigh has asked you to take a sample of customers to see if Tar-Mart should continue
stocking Crown. She would like you to be at least 95% confident in your answer.
The first step is to decide how large a sample size to choose. Kayleigh tells you that, in the past, when Tar-Mart introduced
a new product, the percentage of people who expressed interest ranged between 2% and 10%. What sample size should you
use?
You choose a sample size of 250. After conducting the survey, you find that 10 out of 250 people surveyed had bought
Crown or were considering buying Crown within the next month. What is the 95% confidence interval for the population
proportion?
z-table
First, you find the sample proportion: 10 out of 250 is a proportion of 4%. You verify that n(p-bar) = 250*0.04 = 10 ≥ 5 and
n(1 - (p-bar)) = 250*0.96 = 240 ≥ 5. Then, using the formula, you find the confidence interval around the sample
proportion. The endpoints of that interval are 1.6% and 6.4%.
Challenge: OOPS! Package Deliveries

OO-P-S is a small-package delivery service with worldwide operations. Celine Bedex, VP Marketing, has heard increasing
complaints about late deliveries, and wants to know how many of the shipments are late by one day or more.
Celine would like an estimate of the percentage of late deliveries. In a sample of 256 shipments, 2 were delivered late, a
proportion of about 0.008, or 0.8%. If Celine wants to be 99% confident in the result of a confidence interval calculation,
the interval is:
Celine collects a new sample, this time of 729 shipments. Of these, 8 were late. Celine can be 99% confident that the
population proportion of late packages is between:
z-table
First, calculate the sample proportion for the new sample: 8/729 = 0.011. Then, verify that the new sample size satisfies the
rules of thumb. Both n(p-bar) and n(1 - (p-bar)) are greater than 5.
Using the new sample size and sample proportion, calculate the confidence interval: [0.1%, 2.1%].
Hypothesis Testing
Introduction
After finishing the sampling assignments, you and Alice decide to take some time off to enjoy the beach. Just as you are
gathering your beach gear, Leo gives you another call.
Improving the Kahana

Hi there! Don't let me keep you from enjoying the beach. I just wanted to let you know what I'd like you to help me with next.
I've been working on ideas to increase the Kahana's profits.
Is it possible to increase profits by raising the room prices? That would be an easy solution.
I wish it were that easy. Room prices are extremely competitive and are often the first thing potential guests take into
consideration. So if we increase room prices, I'm afraid we'll have fewer guests. That might put us back where we started from
with profits — or even worse.
What other factors influence your profits?
41/123
The two major ones are room occupancy rates and discretionary spending. "Discretionary spending" is the money guests
spend on non-room amenities. You know, food, drinks, spa services, sports activities, and so on.
As a manager I can affect a variety of factors that influence discretionary spending: the quality of the restaurant, for example,
or the types of amenities offered.
And you'd like us to help you understand your guests' discretionary spending patterns better.
Right. Then I can explore new ways to increase profits on non-room amenities. I can also see if some of my recent efforts to
increase guest spending have paid off.
I'm particularly interested in restaurant operations. I've made some changes to the restaurants recently. For example, I hired
a new executive chef last year. I'd like to know if restaurant revenues per person have changed since then.
I'd also like to find out if the renovation of our premier cocktail lounge has resulted in higher spending on beverages.
Finally, I've been wondering if discretionary spending patterns are different for leisure and business guests. If so, I might
change our marketing campaigns to better suit each of those market segments.
What records do you have for us to work with?
We don't have a consolidated report for this year yet, so we'll need to conduct some surveys and analyze the results.
You're really getting into these statistical methods, aren't you, Leo?
Definition
Leo made some important changes to his business and he has some ideas of what the impact of these changes has been. How
do you put his ideas to the test?
As managers, we often need to put our claims, ideas, or theories to the test before we make important decisions. Based on
whether or not our claim is statistically supported, we may wish to take managerial action.
Hypothesis testing is a statistical method for testing such claims. A hypothesis is simply a claim that we want to
substantiate. To begin, we will learn how to test hypotheses about population means.
For instance, suppose we know that the historical average number of defects in a production process is 3 defects per 1,000
units produced. We have a hunch that a certain change to the process — a new machine, say — has changed this number. The
hypothesis we wish to substantiate is that the average defect rate has changed — that it is no longer 3 per 1,000.
How do we conduct a hypothesis test? First, we collect a random sample of units produced by the process. Then, we see
whether or not what we learn about the sample supports our hypothesis that the defect rate has changed.
Suppose our sample has an average defect rate of 2.7 defects per 1,000. Based on this sample, can we confidently say that the
defect rate has changed?
That depends. To find out, we construct a range around the historical defect rate of 3 — the population mean that has been
cast in doubt. We construct the range so that if the mean defect rate in the population is still 3, it is very likely for the
mean of a sample taken from the population to fall within that range.
The outcome of our test will depend on whether 2.7, the mean of the sample we have taken, falls within the range or not.
If the sample mean of 2.7 falls outside of the range, we feel comfortable rejecting the hypothesis that the defect rate is still 3.
However, if the sample mean falls within the range, we don't have enough evidence to support the claim that the defect rate
has changed.
This example captures the essence of hypothesis testing, but we need to formalize our intuition about the example and define
our new statistical technique more precisely.
To conduct a hypothesis test, we formulate two hypotheses: the so-called null hypothesis and the alternative
hypothesis.
Based on experience or conventional wisdom, we have an initial value of the population mean in mind. The null hypothesis
states that the population mean is equal to that initial value: in our example, the null hypothesis states that the current
population mean is 3 defects per 1,000. We use the Greek letter mu to represent the population mean, in this case the current
average defect rate.
The alternative hypothesis is the claim we are trying to substantiate. Here, the alternative hypothesis is that the average
defect rate has changed. Note that the alternative hypothesis states that the null hypothesis does not hold.
42/123
As the example suggests, in a hypothesis test, we test the null hypothesis. Based on evidence we gather from a sample, there
are only two possible conclusions we can draw from a hypothesis test: either we reject the null hypothesis or we do not
reject it.
Since the alternative hypothesis states the opposite of the null hypothesis, by "rejecting" the null hypothesis we necessarily
"accept" the alternative hypothesis.
In our example, the evidence from our sample will help us determine whether or not we should reject the null hypothesis that
the defect rate is still 3 in favor of the alternative hypothesis that the defect rate has changed.
Based on our sample evidence, which conclusion should we draw? We reject the null hypothesis if it is highly unlikely that
our sample mean would come from a population with the mean stated by the null hypothesis.
For example, if the sample we drew had a defect rate of 14 per 1,000, we would reject the null hypothesis. Drawing a sample
with 14 defects from a population with an average defect rate of 3 would be very unlikely.
"We cannot reject the null hypothesis if it is reasonably likely that our sample mean would come from a population with the
mean stated by the null hypothesis. The null hypothesis may or may not be true: we simply don't have enough evidence to
draw a definite conclusion."
For example, if the sample we drew had a defect rate of 3.05 per 1,000, we could not reject the null hypothesis, since it
wouldn't be unusual to randomly draw a sample with 3.05 defects from a population with an average defect rate of 3.
Note that having the sample's average defect rate very close to 3 does not "prove" that the mean is 3. Thus we never say that
we "accept" the null hypothesis — we simply don't reject it.
It is because we can never "accept" the null hypothesis that we do not pose the claim that we actually want to substantiate as
the null hypothesis — such a test would never allow us to "accept" our claim! The only way we can substantiate our claim is to
state it as the opposite of the null hypothesis, and then reject the null hypothesis based on the evidence.
It is important that we understand exactly how to interpret the results of a hypothesis test. Let's illustrate the two types of
conclusions with an analogy: a US jury trial.
In the US judicial system, the accused is considered innocent until proven guilty. So, the null hypothesis is that the accused is
innocent. The alternative hypothesis is that the accused is guilty: this is the claim that the prosecution is trying to prove.
The two possible outcomes of a jury trial are "guilty" or "not guilty." The jury does not convict the accused unless it is certain
beyond reasonable doubt that the accused is guilty. With insufficient evidence, the jury cannot conclude that the accused
truly is innocent. The jury simply declares that the accused is "not guilty."
Similarly, in a hypothesis test, if our evidence is not strong enough to reject the null hypothesis, then that does not prove that
the null hypothesis is true. We simply have failed to show it is false, and thus cannot reject it.
A hypothesis is a claim or assertion that can be tested. On the basis of a hypothesis test we either reject or leave unchallenged
a particular statement: the null hypothesis.
Alice promises Leo that the two of you will drop by his office first thing in the morning to test if Leo's survey results support
his claims that food and beverage spending patterns have changed.
Summary
We use hypothesis tests to substantiate a claim about a population mean. The null hypothesis states that the population
mean is equal to an initial value that is based on our experience or conventional wisdom. We test the null hypothesis to
learn if we should reject it in favor of our claim, the alternative hypothesis, which states that the null hypothesis does not
hold.
Single Population Means

The next morning, Leo explains the measures he has undertaken to increase customer spending on food and beverages. "I'd like
to see if they've had a discernable impact on my guests' restaurant-related spending patterns."
The Restaurant Revenue Problem

Last year, I made two major changes to restaurant operations: I brought in a new executive chef and renovated the main
cocktail lounge.
The chef introduced a new menu: a fusion of traditional Hawaiian and French cuisine. She put some elaborate items on the
menu, like that mango and brie tart I recommended to you. She also has offerings that cater to simpler tastes. But the
question is, have restaurant profits been affected by the new chef?
43/123
Since we set our food margins as a fixed percentage of food revenue, I know that if revenues have increased, profits have
increased too. Based on last year's consolidated reports, the average spending on food per person per day was $55. I'm
curious to see if that has changed.
In addition, I renovated the cocktail lounge. The old bar was designed poorly and used space inefficiently. Now more guests
can be seated in the lounge, and more seats have good views of the ocean.
I also invested in a large machine that makes a wide variety of frozen drinks. Pina coladas are very, very popular.
I hope my investments in the bar are paying off in terms of higher guest spending on drinks. Beverages have high margins,
but I'm not sure if beverage sales have increased enough to cover the investments.
Can we say, for beverages, as for food, that "changes in revenues" are a good proxy for "changes in profits?"
Absolutely. I set my profit margins as a fixed percentage of revenues for beverages as well. Last year, the average spending on
beverages per guest per day was $21.
Isn't that high?
Well, we have some very nice wines in our restaurants.
We don't have the consolidated report yet, but I've already had my staff choose a random sample of guests.
We pulled the restaurant and lounge receipts for the guests in the sample and noted three items: total food revenues, total
beverage revenues, and number of guests at the table. Using this information, we should be able to estimate the daily
spending on food and beverages per guest.
You look at Leo's data and wonder how you can discern whether Leo's changes — the new chef and the bar renovations —
have influenced the resort's profits.
Hypothesis Tests for Single Population Means

Leo has prepared data for you. How are you going to put it to use?
Our first type of hypothesis test is used to study population means. Let's walk through an example of this type of test.
Suppose the manager of a movie theater implemented a new strategy at the beginning of the year: he started showing old
classics instead of recent releases.
He knows that prior to the change in strategy, average customer satisfaction was 6.7 out of a possible 10 points. He would like
to know if average customer satisfaction has changed since he altered his theater's artistic focus.
The manager's null hypothesis states that the current mean satisfaction has not changed; it is still 6.7. We use the Greek letter
mu to represent the current mean satisfaction rating of the theater's entire film-going population.
His alternative hypothesis is the opposite of the null hypothesis: it states that average customer satisfaction is now different.
To substantiate his claim that the mean has changed, the manager takes a random sample of 196 moviegoers. He is careful to
sample across movies, show times, and dates. The mean satisfaction rating for the sample is 7.3, with a standard deviation of
2.8.
Does the fact that the random sample's mean of 7.3 is higher than the historical mean of 6.7 indicate that this year's
moviegoers really are more satisfied?
Or, is the mean still the same, and the manager "just happened" to pick a sample with an unusually high average satisfaction
rating? This is equivalent to asking the question: If the null hypothesis is true — the average satisfaction is still 6.7 — would
we be likely to randomly draw the sample that we did, with average satisfaction 7.3?
To answer this question, we have to first define what we mean by "likely." As in sampling and estimation, we typically use
95% as our threshold level of likelihood.
We then construct a range around the population mean specified by our null hypothesis. The range should be drawn so that if
the null hypothesis is true, 95% of all samples drawn from the population would fall in that range. In other words, we create a
range of likely sample means.
The central limit theorem tells us that the distribution of sample means follows a normal curve, so we can use its familiar
properties to find probabilities. Moreover, the distribution of sample means is centered at our assumed population mean,
mu, and has standard deviation sigma/sqrt(n). We don't know sigma, the underlying population standard deviation, so we
use the sample standard deviation as our best estimate.
As we do when constructing 95% confidence intervals, we create a range with width z*s/sqrt(n) = 1.96*s/sqrt(n) on either
side of the mean. However, when we conduct a hypothesis test, we center the range around the mean specified in the null
44/123
hypothesis because we always start a hypothesis test by assuming the null hypothesis is true.
In our example, the null hypothesis is that the population mean is 6.7, n is 196, and s is 2.8. Our 95% confidence level
translates into a z-value of 1.96. We construct the range of likely sample means:
This tells us that if the population mean is 6.7, there is a 95% chance that the mean of a randomly selected sample will fall
between 6.3 and 7.1.
Now, if we take a sample, and the mean does not fall within the range around 6.7, we can reject the null hypothesis. Why?
Because if the population mean were 6.7, it would be unlikely to collect a sample whose mean falls outside this range.
The region outside the range of likely sample means is called the "rejection region," since we reject the null hypothesis if our
sample mean falls into it. In the movie theater example, the rejection region contains all values less than 6.3 and all values
greater than 7.1.
In this example, the sample mean, 7.3, falls in the rejection region, so we reject the null hypothesis. Whenever we reject the
null hypothesis, we in effect accept the alternative hypothesis.
We conclude that customer satisfaction has indeed changed from the historical mean value of 6.7.
If our sample mean had fallen within the range around 6.7, we could not make a definite statement about moviegoers'
satisfaction. We would not have enough evidence to state that things have changed, but we can never claim that they have
definitely remained the same.
Unless we poll every customer, we'll never know for sure if customer satisfaction has truly changed. Working only with
sample data, there is always a chance that we'll draw the wrong conclusion about the population. We can go wrong in two
ways: rejecting a null hypothesis that is in fact true or failing to reject a null hypothesis that is in fact false. Let's look at the
first of these: the null hypothesis is true, but we reject it.
We choose the confidence level so it is unlikely — but not impossible — for the sample mean to fall in the rejection region
when the null hypothesis is true. In this case, we are using a 95% confidence level, so by unlikely we mean a 5% chance.
However, 5% of all samples from a population with the null hypothesis mean would fall in the rejection region, so when we
reject a null hypothesis, there is a 5% chance we will do so erroneously.
Therefore, when the sample mean falls in the rejection region, we can only be 95% confident that we are justified in rejecting
the null hypothesis. Hence we continue to speak of a confidence level of 95%.
A hypothesis test with a 95% confidence level is said to have a 5% level of significance. A 5% significance level says that there
is a 5% chance of a sample mean falling in the rejection region when the null hypothesis is true. This is what people mean
when they say that something is "statistically significant" at a 5% significance level.
If we increase our confidence level, we widen the range around the null hypothesis mean. At a 99% confidence level, our
range captures 99% of all sample means. This reduces to 1% our chance of rejecting the null hypothesis erroneously. But
doing this has a downside: by decreasing the chance of one type of error, we increase the chance of the other type.
The higher the confidence level the smaller the rejection region, and the less likely it is that we can reject the null hypothesis
when it is in fact false. This decreases our chance of being able to substantiate the alternative hypothesis when it is true. As
managers, we need to choose the confidence level of our test based on the relative costs of making each type of error.
The range of likely sample means should not be confused with a confidence interval. Confidence intervals are always
constructed around sample means, never around population means. When we construct a confidence interval, we don't even
have an initial estimate of the population mean. Constructing a confidence interval is a process for estimating the population
mean, not for testing particular claims about that mean.
Summary
In a hypothesis test for population means, we assume that the null hypothesis is true. Then, we construct a range of likely
sample means around the null hypothesis mean. If the sample mean we collect falls in the rejection region, we reject the
null hypothesis. Otherwise, we cannot reject the null hypothesis. The confidence level measures how confident we are that
we are justified in rejecting the null hypothesis.
One-sided Hypothesis Tests

The movie theater manager did not have a strong conviction about the direction of change for customer satisfaction prior
to performing the hypothesis test.
He wanted to test for change in both directions — up or down — and thus he used a two-sided hypothesis test. The null
hypothesis — that no change has taken place — could have been wrong in either of two ways: Customer satisfaction may
have increased or decreased. The two-tailed nature of the test was reflected in the two-sided range we drew around the
population mean.
45/123
Sometimes, we may want to know if the actual population mean differs from our initial value of the population mean in a
specific direction. For instance, if the theater manager were quite sure that satisfaction had not decreased, he wouldn't
have to test in that direction; rather, he'd only have to test for positive change.
In these cases, our alternative hypothesis should clearly state which direction of change we want to test for. These kinds of
tests are called one-sided hypothesis tests. Here, we substantiate the claim that the mean has increased only if the sample
mean is sufficiently higher than 6.7, so our rejection region extends only to the right.
Let's outline how to formulate one- and two-sided tests. For a two-sided test we have an initial understanding of the
population: the population mean is equal to a specified initial value.
If we want to substantiate the claim that a population mean has changed, the null hypothesis should state that the mean
still equals that initial value. The alternative hypothesis should state that the mean does not equal that initial value.
If we want to know that the actual population mean is greater than the initial value — the null hypothesis mean — then the
null hypothesis should state that the population mean has at most that value. The alternative hypothesis states that the
mean is greater than the null hypothesis mean.
Likewise, if we want to substantiate the claim that a population mean is less than the initial value, the null hypothesis
should state that the mean is at least that initial value. The alternative hypothesis should state that the mean is less than
the null hypothesis mean, and the rejection region extends only to the left.
When we conduct a one-sided hypothesis test, we need to create a one-sided range of likely sample means. Suppose the
theater manager claims that satisfaction improved. As usual, he states the claim he wants to substantiate as his alternative
hypothesis.
The 196-person sample has mean 7.3 and standard deviation 2.8. Does this sample provide sufficient evidence to
substantiate the claim that mean satisfaction increased? To find out, the manager creates a one-sided range: he assumes
the population mean is the null hypothesis mean, 6.7, and finds the range that contains the lower 95% of all sample means.
To find this range, all he needs to do is calculate its upper bound. For what value would 95% of all sample means be less
than that value?
To find out, we use what we know about the cumulative probability under the normal curve: a cumulative probability of
95% corresponds to a z-value of 1.645.
z-table
Why is this different from the z-value for a two-sided test with a 95% confidence level? For a two-sided test, the z-value
corresponds to a 97.5% cumulative probability, since 2.5% of the probability is excluded from each tail. For a one-sided
test, the z-value corresponds to a 95% cumulative probability, since 5% of the probability is excluded from the upper tail.
z-table
We now have all the information we need to find the upper bound on the range of likely sample means.
The rejection region is everything above the value 7.0. The sample mean falls in the rejection region, so the manager rejects
the null hypothesis. He is confident that customer satisfaction is higher.
Summary
When we want to test for change in a specific direction, we use a one-sided test. Instead of finding a range containing
95% of all sample means centered at the null hypothesis mean, we find a one-sided range. We calculate its endpoint
using the cumulative probability under the normal curve.
Excel Utility (Single Populations)

The Excel Utility link below allows you to perform hypothesis tests for single populations.
Make sure you do at least one example by hand to ensure you thoroughly understand the basic concepts before using the
utility. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to
reproduce the results for the theater manager's example.
Excel Utility for Single populations
Solving the Restaurant Revenue Problem

A single-population hypothesis test tests a claim using a sample from a single population. With a plan in mind, you take a
look at Leo's sample data.
46/123
You are ready to analyze the impact of the changes Leo has made to his restaurant operations. You draw a table to organize
the data from your sample on daily guest spending on restaurant food.
One change Leo made to his restaurant operations was to hire a new chef. He wants to know whether average restaurant
spending per guest has changed since she took over the menu and the kitchen. This is a clear case for a hypothesis test.
Last year's average spending on food per person was $55; this gives you an initial value for the mean.
Leo wants to know if mean spending has changed, so you use a two-sided test. You jot down your null hypothesis, which
states that the average revenue per guest is still $55.
If the null hypothesis is true, the difference between the sample mean of $64 and the initial value of $55 can be accounted for
by chance.
You add the alternative hypothesis to your notes.
Next, you assume that the null hypothesis is true: the population mean is $55. Now you need to construct a range of likely
sample means around $55 and ask: does the sample mean of $64 fall within that range? Or does it fall in the rejection region?
Leo didn't specify what level of confidence he wanted for your results. You call him for clarification.
I suppose a 95% confidence level is okay. I'd like to be more confident, of course.
After you point out that higher confidence would reduce his chances of being able to substantiate a change in spending if a
change has taken place, he agrees to 95%. You pull out your trusty calculator and get ready to compute a range around the
null hypothesis mean of $55. Consulting your notes, you find the correct formula:
You find the range containing 95% of all sample means. Its endpoints are:
z-table
Utility for Single Populations
You pause for a moment to reflect on the interpretation of this range. Suppose the null hypothesis is true. Then 19 out of 20
samples of this size from the population of hotel guests would have means that would fall in the calculated range.
The sample mean of $64 falls outside of this range. You and Alice report your results to Leo.
Looks like hiring that chef was a good decision. The evidence suggests that mean spending per person has increased.
I'm glad to hear it. Now what about renovating the bar? Can you run a similar test to see if that has affected average beverage
spending?
Leo emphasizes that he can't imagine that his investments in the bar could have reduced average beverage spending per
guest. He wants to know if spending has gone up. You decide to do a one-sided test.
First, you write down all of Leo's data, along with the hypotheses:
You need to find an upper bound such that 95% of all sample means are smaller than it. To do so, you use a z-value of 1.645.
The upper bound is $24.29.
What is the correct interpretation of this number? Given that the null hypothesis is true,
The range of likely sample means contains the collected sample mean of $24. This tells you that:
Presenting your full report to Leo, he appears confused and disappointed.
How is this possible? Why hasn't renovating the bar increased revenues? Even if the frozen drink machine didn't pay off,
shouldn't the increase in seats have helped?
First of all, we haven't concluded that average revenue has not increased. We just can't be sure that it has. The fact that our
sample mean is $24 vs. $21 last year does not allow us to say anything definitive about the change in average beverage
revenue.
Remember, we set out to substantiate our hypothesis that spending has improved. Based just on this sample, we are unable
to conclude that spending has increased.
You added seats and now more people can be seated in your lounge. But a greater number of guests does not necessarily
translate into more spending per person.
That does make a lot of sense.
Your overall revenues may have actually increased, because more guests can be seated in the lounge.
47/123
Gosh, I'm glad to hear that. For a moment there, I thought I had made a really bad investment. I'm quite optimistic I'll see a
jump in total beverage revenues in the consolidated report at the end of the year.
Why don't we go fill three of those new seats right now?
Exercise 1: Oma's Pretzels

Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes toasted pretzel snacks, and
advertises that these pretzels contain an average of 112 calories per serving.
In a recent test, an independent consumer research organization conducted an experiment to see if this claim was true.
Somewhat to their surprise, the researchers found that the average calorie content in a sample of 32 bags was 102 calories
per serving. The standard deviation of the sample was 19.
Blanche would like to know if the calorie content of Oma's pretzels really has changed, so she can market them
appropriately. With 99% confidence, do these data indicate that the pretzels' calorie content has changed?
z-table
You begin any hypothesis test by formulating a null and an alternative hypothesis. The null hypothesis states that the
population mean is equal to the initial value. In this problem, the null hypothesis is that the caloric content in the actual
population is what Oma's has always advertised.
The alternative hypothesis should contradict the null hypothesis. For a two-sided test, the alternative hypothesis simply
states that the mean does not equal the initial value. A two-sided test is more appropriate in this problem, since Blanche
only wants to know if the mean calorie content has changed.
You assume that the null hypothesis is true and construct a range of likely sample means around the population mean.
Using the data and the appropriate formula, you find the range [103; 121].
The sample mean of 102 falls outside of that range, so you can reject the null hypothesis. Blanche can be 99% confident
that the population mean is not 112.
Why might Blanche have chosen a 99% confidence level rather than the more typical 95% level for her test?
z-table
Exercise 2: The Clearwater Power Company

The Clearwater Power Company produces electrical power from coal. A local environmental group claims that Clearwater's
emissions have raised sulfur dioxide levels above permissible standards in Blue Sky, the town downwind of the plant.
According to Environmental Protection Agency standards, an acceptable average sulfur dioxide level is 30 parts per billion
(ppb). As Clearwater's PR consultant, you want to defend the company, and you try to anticipate the environmentalist's
argument.
The environmental group collects 36 samples on randomly selected days over the course of a year. It finds a mean sulfur
dioxide content of 35 ppb with a standard deviation of 24 ppb.
The environmentalist group will use a hypothesis test to back up its claim that the sulfur dioxide levels are higher than
permitted. Which of the following is an appropriate null hypothesis for this problem?
z-table
The environmentalists' claim is that sulfur dioxide levels are higher, so they will want to run a one-sided test. The
alternative hypothesis states that the sulfur dioxide levels are above the accepted standard. We assume they will choose a
95% confidence level.
What is the range of likely sample means?
z-table
They calculate the one-sided range around the null hypothesis mean that contains 95% of all samples. The z-value for a
one-sided 95% range is 1.645. The upper bound on the range of likely sample means is 36.58 ppb.
Based on your calculations, what should you do?
48/123
z-table
Exercise 3: Neshey's Smooches

You are the plant manager of a Neshey's chocolate factory. The shop was flooded during the recent storms. The machine
that wraps Neshey's popular chocolate confection, Smooches, still works, but you are afraid it may not be working at its
former capacity.
If the machine isn't working at top capacity, you will need to have it replaced.
Which type of hypothesis test is most appropriate for this problem?
The hourly output of the machine is normally distributed. Before the flood, the machine wrapped an average of 340
Smooches per hour. Over the first week after the flood, you counted wrapped Smooches during 32 randomly selected one-
hour periods. The machine averaged 318 Smooches per hour, with a standard deviation of 44.
You conduct a one-sided hypothesis test using a 95% confidence level. According to your calculations, what should you do?
The null hypothesis is that µ ≥ 340. The alternative hypothesis is that µ < 340 since you are using a one-tail test and you
are assuming that the new population mean is lower than the population mean before the flood.
Identify the relevant values. The sample size n=32. The standard deviation s=44. The appropriate z-value is 1.645 if you
want to capture 95% of all sample means in a one-sided range around the null hypothesis mean.
Use the formula and calculate the lower bound, 327. The sample mean of 318 falls well outside of the calculated range of
likely sample means. You accept this as strong evidence against the null hypothesis, substantiating the alternative
hypothesis that the mean output rate has dropped. You should replace the machine.
Single Population Proportions

Happy with your work on restaurant spending, Leo jumps right into the next problem. "It's not just the revenue of the
restaurants that I care about," Leo says, "It's also my guests' satisfaction with their restaurant experience."
The Restaurant Ambiance Problem

When I go out to eat, I expect more than just excellent food. The whole dining experience is essential — everything from the
service, to the décor, to the design and quality of the silverware.
And it's not just that all of these factors must be excellent individually — they have to fit together. The restaurant has to have
ambiance! I'm sure my guests have similar expectations, and I want to be sure my restaurant meets them.
Since my new chef introduced more sophisticated cuisine, I made some changes to the décor that I think have improved the
ambiance.
It took me a long time and a substantial amount of money to get everything right, but I'm pleased with the result: the
restaurants are elegant and distinctly Hawaiian. Just like the new chef's cuisine.
In the past, I've contracted a local market research firm to conduct surveys, asking guests to rate the Kahana's restaurants'
ambiance on a scale of one to five.
Historically, the percentage of people that rated ambiance the top score of 5 gave me a good idea of how well we were doing.
That percentage has been very high: 72%.
I've collected this year's data for you. Can you figure out if my guests are happier with my restaurants' ambiance?
Hypothesis Tests for Single Population Proportions

Alice tells you that testing Leo's claim about a proportion will be very similar to testing a mean.
Often the summary statistic we want to make a claim about is a proportion. How do we test a hypothesis about a population
proportion instead of a population mean?
We know from our work with confidence intervals that the processes for estimating population proportions and population
means are virtually identical. Similarly, hypothesis tests for proportions are much like hypothesis tests for means.
Because we are examining a population proportion instead of a population mean, we use slightly different notation: we use a
lower case p to represent the population proportion in place of µ for a population mean. We construct a hypothesis test to
test a claim about the value of p.
49/123
Again, we formulate null and alternative hypotheses. Based on conventional wisdom or past experience, we have an initial
understanding of the population proportion. The null hypothesis for a proportion test states the initial understanding. For
example, in a two-sided test, the null hypothesis asserts that the population proportion, p, is equal to the initial value we had
in mind.
The alternative hypothesis is the claim we are using the hypothesis test to substantiate. The alternative hypothesis typically
states the opposite of the null hypothesis: it states that our initial understanding is incorrect.
As with population means, we collect a random sample and calculate the sample proportion, "p-bar." However, for a
hypothesis test about a population proportion, we don't need to calculate a standard deviation from the sample.
Statistical theory tells us that σ, the standard deviation of the population proportion, is the square root of [p*(1 - p)]. Since
we always start the test assuming the null hypothesis is true, we will calculate σ using the null hypothesis proportion.
Analogously to population mean tests, we create a range of likely sample proportions around the null hypothesis proportion.
To create the range, we substitute for σ, the standard deviation of the underlying null hypothesis population.
If our sample proportion falls outside the range of likely sample proportions, we reject the null hypothesis. Otherwise, we
cannot reject the null hypothesis.
Summary
In a hypothesis test for population proportions, we assume that the null hypothesis is true. Then, we construct a range of
likely sample proportions around the null hypothesis proportion. If the sample proportion we collect falls in the rejection
region, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis.
Solving the Restaurant Ambiance Problem

Once you understand hypothesis testing for means, using the same techniques on proportions is easy.
By now, you're familiar with the concept of testing a hypothesis. You recognize that Leo's restaurant ambiance problem calls
for a hypothesis test for a population proportion.
Leo wants you to find out if the proportion of his guests that rate restaurant ambiance "excellent" has increased. Historically,
that population proportion has been 0.72. Since Leo wants to see if there has been positive change, you do a one-sided test.
The appropriate pair of hypotheses is:
You are doing a one-sided test to see if the proportion of guests rating the restaurant "excellent" has increased. The
alternative hypothesis states that the proportion has increased, and the null hypothesis states that it has not increased.
You look at Leo's data. The sample proportion is 0.81 and the sample size is 126.
But what about the standard deviation?
z-table
Here's how you find the standard deviation for a proportion problem:
Using the appropriate formula, you calculate the standard deviation to be 0.45.
Leo wanted you to use a 95% confidence level. Now you're ready to construct a range of likely sample means around the null
hypothesis value of the population proportion: 0.72.
Find the range of likely sample proportions around the null hypothesis proportion, and formulate a short answer for Leo.
z-table
A one-sided test calls for a one-sided range of likely sample proportions. You need to find the upper bound for this range such
that the range captures the lower 95% of the sample proportions.
The z-value for a one-sided 95% confidence level is 1.645. Substitute the null hypothesis proportion, 0.72, for p. The upper
bound for the range containing the lower 95% of all sample means is 0.78.
Since the sample proportion 0.81 falls in the rejection region, you reject the null hypothesis. The data provide sufficient
evidence that the population proportion has, in fact, changed.
Alice presents your findings to Leo, telling him that with 95% confidence, the data you collected indicate that the difference
between the historical population proportion and the proportion of the random sample is not due to chance.
50/123
The proportion of your guests that rate the restaurants' ambiance as "excellent" has increased.
Just what I wanted to hear! Thanks, you two.
Exercise 1: The Ventura Insurance Company

Luther Lenya, the new product guru of The Ventura Automotive Insurance Company, is considering marketing a special
insurance package to members of certain professional groups.
In particular, Luther wants to create a special package for health professionals.
To find out what rate to charge for this package, Luther conducts a preliminary study to see if health professionals are less
likely to be involved in car accidents than the rest of his customer base.
If the data indicate that health professionals are less likely to be involved in car accidents, then Ventura can offer health
professionals a lower, more competitive rate.
In the past 5 years, 8.3% of Ventura's customers have been involved in accidents. Which of the following is the correct pair
of hypotheses for solving Luther's problem?
A sample of 240 customers in the health profession reveals that 12 (5.0%) have had accidents.
If he uses a 95% confidence level, which of the following is the best conclusion Luther can come to?
z-table
You need to find a range of likely sample proportions. To find this range, you calculate a standard deviation. The standard
deviation is 0.28.
For a one-sided test, a confidence level of 95% corresponds to a z-value of 1.645. The lower bound of this range is 0.054 =
5.4%.
The range of likely sample proportions does not contain 5.0%, so you should reject the null hypothesis. With 95%
confidence, the proportion of health professionals involved in car accidents is lower than the proportion of the overall
population of drivers.
P-Values
After sleeping over your analysis of restaurant operations, Leo seems unsatisfied.
Leo Demands a Deeper Understanding

Don't get me wrong, I appreciate your hard work. But look here: these hypothesis tests result in a "reject/don't reject"
decision. If I understand you correctly, it doesn't matter how close to the border of the rejection region our sample statistic
falls: "reject" is "reject."
But can't you tell me more? I want to know how strong the evidence against the null hypothesis is, not just if it is strong
enough.
I'm glad you brought that issue up, Leo. We have a second method of doing hypothesis tests, one that provides a measure of
the strength of the evidence.
P-Values
The evening before, Alice had acquainted you with p-values: "We can use the p-value method of hypothesis testing to make
'reject/not reject' decisions in the same way we have been doing all along. But the p-value also measures the strength of
evidence against a null hypothesis."
In hypothesis tests we've done so far, we first chose the confidence level of the test. The confidence level tells us the
significance level of the test, which is simply 1 minus the confidence level.
Typically, we chose a 5% significance level — a 95% confidence level — as our threshold value for rejection. Assuming that the
null hypothesis is true, we reasoned that certain sample mean values are less likely to appear than others. If the mean of the
sample we collected was sufficiently unlikely to appear (that is, less than 5% likely) we considered the null hypothesis
implausible and rejected it.
Now, rather than simply checking whether the likelihood of collecting our sample is above or below our chosen threshold,
we'll ask: if the null hypothesis is true, how likely is it to choose a sample with a mean at least as far from the null
51/123
hypothesis mean as the sample mean we collected?
The "p-value" measures this likelihood: it tells us how likely it is to collect a sample mean that falls at least a certain distance
from the null hypothesis mean.
In the familiar hypothesis testing procedure, if the p-value is less than our threshold of 5%, we reject our null hypothesis.
The p-value does more than simply answer the question of whether or not we can reject the hypothesis. It also indicates the
strength of the evidence for rejecting the null hypothesis. For example, if the p-value is 0.049, we barely have enough
evidence to reject the null hypothesis at the 0.05 level of significance; if it is 0.001, we have strong evidence for rejecting the
null hypothesis.
Let's look at an example. Recall the movie theater manager who wanted to know if the average satisfaction rate for his
clientele had changed from its historical rate of 6.7.
To find out, we constructed the range, 6.3 to 7.1, which would have contained 95% of the sample means if the null hypothesis
mean had still been true. Since the mean of the sample of current moviegoers we collected, 7.3, fell outside of that range, we
rejected the null hypothesis.
Because 7.3 fell in the rejection region, we know that the likelihood of collecting a sample mean as extreme as 7.3 is less than
5% if the null hypothesis is true. Now let's find out exactly how unlikely it is by calculating the p-value.
Calculating the p-value is a little tricky, but we have all the tools we need to do it. Recall that for samples of sufficient size, the
sample means of any population are distributed normally.
To calculate the likelihood of a certain range of sample mean values — in our example, sample mean values greater than 7.3
or less than 6.1 — we just need to find the appropriate area under the distribution curve of the sample means.
To calculate the p-value for this two-sided test, we want to find the area under the normal curve to the right of 7.3 and to the
left of 6.1. The standard deviation in this example is 2.8, and the sample size is 196.
We can calculate this probability by first calculating the z-value associated with the value 7.3. That z-value is 3.
Then, we find the probability of having a z-value less than -3 or greater than 3.
The area to the left of the z-value of -3 is 0.00135. The area to the right of the z-value of +3 is the same size, so the total area is
0.0027. That is our p-value. These areas and the p-value can be found in Excel using the NORMSDIST(-3) function, in the z-
table, or with the Excel utility provided.
Excel
z-table
Our p-value calculation tells us that the probability of collecting a sample mean at least as far from 6.7 as 7.3 is 0.0027. The
p-value is lower than 0.05. Thus, at a significance level of 0.05, we would reject the null hypothesis and conclude that
moviegoers' average satisfaction rating is no longer 6.7.
Excel
z-table
But the p-value 0.0027 is much smaller than 0.05. Thus, we can reject the null hypothesis at 0.0027, a much lower
significance level. In other words, we can reject the null hypothesis with 99.73% confidence. In general, the lower the p-value,
the higher our confidence in rejecting the null hypothesis.
One-sided hypothesis tests are also easily conducted with p-values. For one-sided tests, the p-value is the area under one side
of the curve. In our movie theater example, if the alternative hypothesis states that the population mean is larger than 6.7, the
p-value is the area under the normal curve to the right of the sample mean of 7.3.
Summary
The p-value measures the strength of the evidence against the null hypothesis. It is the likelihood, assuming that the null
hypothesis is true, of collecting a sample mean at least as far from the null hypothesis mean as the sample actually
collected. We compare the p-value to the threshold significance level to make a reject/not reject decision. The p-value also
tells us how comfortable we can be with that decision.
Solving the Restaurant Revenue Problem (Part II)

Now Alice explains the basics of p-values to Leo, so you can present the results of your restaurant revenue hypothesis test
again. This time, you'll be able to give Leo an idea of how strong the statistical evidence is.
52/123
Leo wants you to complete the p-value hypothesis test right there in his office. You're a little nervous — you've never had a
client peering over your shoulder when you work. But you oblige him, because you're growing more confident of your
statistical skills.
Looking back at your notes on the problem, you find the data and the hypotheses. You make a mental note that you are doing
a two-sided test to see whether or not average spending on food has changed from its historical level of $55.
An eager Leo interrupts your thought process:
When you ran the hypothesis test earlier, I had you use a 95% confidence level. That corresponds to a significance level of 0.5,
right?
You politely respond:
To find the significance level corresponding to a confidence level of 95%, simply subtract 95% from 100%, and convert into
decimal notation: 0.05.
After you clarify Leo's mistake, he sits back and lets you finish your analysis without further interruption. First, you find the
appropriate z-value. Enter the z-value as a decimal number with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00").
Round if necessary.
Excel
z-table
The correct z-value is 3.00, corresponding to a right-tail probability of 0.00135. You:
z-table
Your sample has a mean (x-bar = $64) that is $9 higher than the assumed population mean, $55. You want to calculate the
likelihood of getting a sample mean that is at least as far from the population mean as x-bar.
That likelihood is not just the tail probability to the right of the sample mean. Sample means on the other side of the normal
curve are just as far from the population mean as x-bar. They must be included, too, when you calculate the p-value for a two-
sided hypothesis test.
Doubling the right-tail probability gives you the correct p-value: 0.0027.
Alice summarizes your results for Leo.
All we have to do is compare the p-value to the significance level. The p-value 0.0027 is less than the significance level 0.05.
Our data are statistically significant at the 0.05 level.
Just as we calculated earlier by constructing a range around the null hypothesis mean, the p-value method suggests that we
reject the null hypothesis. With 95% confidence, average food spending per guest has changed.
But now we can also see that the evidence is very strong, because the p-value is much lower than the significance level. We
can claim that food spending has changed at the 0.0027 level of significance. Thanks, you two. I feel much more comfortable
concluding that average guest spending in my restaurant has changed.
Exercise 1: Oma's Revisited

In the following exercise you will revisit an earlier problem, this time solving it with the p-value method.
Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes toasted pretzel snacks. Each
bag of pretzels contains one serving, and Oma's advertises that the pretzel snacks contain an average of 112 calories per
serving.
In a recent test, an independent consumer research organization conducted an experiment to see if this claim was true. The
researchers found that the average calorie content in a sample of 32 bags was 102 calories per serving. The standard
deviation of the sample was 19.
Blanche would like to know if the calorie content of Oma's pretzels has really changed, so she can market them
appropriately. At the significance level 0.01, do these data indicate that the pretzels' calorie content has changed?

Excel
z-table
In this problem, the null hypothesis is that the actual population mean is what Oma's has always advertised. A two-sided
test is more appropriate in this problem, since Blanche only wants to know if the mean calorie content has changed.
53/123
Assuming that the null hypothesis is true, you find a z-value for the sample mean of 102 using the appropriate formula. The
z-value is -2.98.
Using the Excel NORMSDIST function or the Standard Normal Table, you can find the corresponding left-tail probability
of 0.0014. For a two-sided test, you double this number to find the p-value, in this case 0.0028.
Since this p-value is less than the significance level, you can reject the null hypothesis. Moreover, you now can say that you
are rejecting the null hypothesis at the 0.0028 level of significance. You can recommend to Blanche that she have the
labeling changed on the pretzel bags, and adjust her marketing accordingly.
Exercise 2: Neshey's Revisited

In the following challenge you will revisit an earlier problem, this time solving it with the p-value method.
You are the plant manager of a Neshey's chocolate factory. The shop was flooded during the recent storms. The machine
that wraps Neshey's popular chocolate confection, Smooches, still works, but you are afraid it may not be working at its
former capacity.
If the machine isn't working at top capacity, you will need to have it replaced.
The hourly output of the machine is normally distributed. Before the flood, the machine wrapped an average of 340
Smooches per hour. Over the first week after the flood, you counted wrapped Smooches during 32 randomly selected one-
hour periods. The machine averaged 318 Smooches per hour, with a standard deviation of 44.
You conduct a one-sided hypothesis test using a 95% confidence level. According to your calculations, what should you do?

Excel
z-table
The null hypothesis is that the population mean is no lower than 340. The alternative hypothesis is that the population
mean is now less than 340. You are using a one-tailed test, and you are assuming that the new population mean is lower
than the population mean before the flood.
Identify the values of the relevant quantities. Use the appropriate formula and calculate the z-value. The z-value is -2.83.
This z-value corresponds to a left-tail probability of 0.0023. This is the tail you are interested in, since you are conducting a
one-sided test to see if the actual population mean is less than it was in the past. This tail probability is the p-value.
Since the p-value is less than the significance level, you reject the null hypothesis that the population mean is unchanged.
Moreover, you now can say that you are rejecting the null hypothesis at the 0.0023 level of significance. You should replace
the machine.
Comparing Two Populations

Now satisfied with your analysis of the restaurant, Leo asks you to compare the discretionary spending habits of two categories
of guests: leisure and business.
Leisure Guests vs. Business Guests: Who spends more?

Every hotel manager wrestles with the problem of stretching limited marketing resources. I want to make sure that I'm wisely
allocating each marketing dollar.
Leisure guests, such as tourists and honeymooners, are especially attracted to Hawaii. Also, many professional associations
like to have their conventions here, so our islands attract business travelers, who mix business and pleasure.
Business travelers pay lower room prices because conferences book rooms in bulk. Bulk reservations are good for me because
they keep my occupancy levels high.
However, I don't have a good sense of whether the discretionary spending of my business guests is different from that of my
leisure guests: they may take fewer scuba lessons but use the spa services more, for example.
Can you help me figure out whether there is any significant difference between leisure and business travelers' discretionary
spending habits? Your conclusions might influence my marketing efforts.
I collected two random samples: one of leisure guests and one of business guests. Not including room, meal, and beverage
charges leisure travelers spent an average of $75 a day, compared to $64 a day for the business travelers.
I knew that the difference between the two averages of the two samples could be due to chance, so I thought I'd have you do a
hypothesis test to find out.
54/123
When I was compiling the data for you, I realized that my samples were of different sizes. I was able to get 85 leisure guests to
respond, but only 76 business guests returned my survey.
Which figure will you use as the sample size? Or will you add them together?
I also realized that with these data, you'd have to calculate two sample standard deviations, one for each sample. How do you
go about solving a problem like this?
Using Hypothesis Tests to Compare Two Population Means

How do you test whether two populations have different means?
So far, we've used hypothesis tests to study the mean or proportion of a single population. Often, managers want to compare
the means or proportions of two different populations: in this case, we use a two-population hypothesis test. Let's clarify
when we use each type of test.
We conduct single-population tests when we have an initial value for a population mean and want to test to see if it is correct.
Single population tests are especially useful when we suspect that the population mean has changed. For example, we use a
single-population test when we know the historical average of a population and want to test whether that historical average
has changed.
We conduct two-population tests to compare a characteristic of two groups for which we have access to sample data for each
group. For example, we'd use a two-population test to study which of two educational software packages better prepares
students for the GMAT. Do the students using package 1 perform better on the GMAT than the students using package 2?
In two-population tests, we take two samples, one from each population. For each sample, we calculate the sample mean,
standard deviation, and sample size.
We can then use the two sets of sample data to test claims about differences between the two populations. For example, when
we want to know whether two populations have different means, we formulate a null hypothesis stating that the means are
not different: the first population mean is equal to the second.
Let's look at the GMAT software package example more closely. The manager of one educational software company might
wonder if the average GMAT score of students using her software is different from the average GMAT score of students using
the competitor's software.
Since the manager only wants to test if the average GMAT scores are different, she conducts a two-sided hypothesis test for
two populations. The null hypothesis states that there is no difference between the average GMAT scores of the students who
use the two companies' software.
The alternative hypothesis states that the average GMAT scores of the students who use the two companies' software are
different.
We denote the average scores of the two populations by the Greek letter mu and distinguish them with subscripts. Our
hypotheses are:
To be 95% confident in the result of the test, we use a significance level of 0.05.
We collect two samples, one from each population. We denote the sample means with the familiar x-bar, which we again
distinguish with subscripts.
We are able to collect the GMAT scores of 45 people who used the company's software, and 36 people who used the
competitor's software. As we will see shortly, the different sample sizes will not pose a problem.
The respective sample means are 650 and 630, and the standard deviations are 60 and 50.
Could the two random samples we picked just happen to have different means by chance but really have come from
populations that have the same population means?
The null hypothesis states that there is no difference in the two population means. As with single-population tests, we test the
null hypothesis by asking how likely it would be to produce the sample results if the null hypothesis is in fact true.
That is, if the average GMAT scores for students using the two different software packages actually are the same, what is the
chance that two samples we collect would have sample means as different as 650 and 630?
Our intuition tells us that the greater the difference between the means of the two samples, the more likely it is that the
samples came from different populations. But how do we know when the numerical difference is large enough to be
statistically significant? When do we have enough evidence to actually conclude that the two populations must be different?
We use p-values to answer this question. First, we calculate a z-value for the difference of the sample means, incorporating
the data from both populations. It looks a bit complicated:
55/123
Let's compute the z-value for our example. Since we assume that the null hypothesis is true, we have:
Using the formula, we find that the z-value is 1.64.
For a two-sided test, a z-value of 1.64 translates into a probability in one tail of 0.05, and thus a p-value of 0.10.
Since this p-value is greater than the significance level of 0.05, we cannot reject the null hypothesis.
In other words, the high p-value tells us that there is insufficient evidence from the two samples to conclude that the average
GMAT score of the students who use the company's software is different from the average GMAT score of students who use
the competitor's software.
Two-population hypothesis tests can be performed using the formula shown on the previous page, or you can click here to
access the Excel utility for hypothesis testing.
Summary
In a hypothesis test for two population means, we assume a null hypothesis: that the two population means are equal. We
collect a sample from each population and calculate its sample statistics. We calculate a p-value for the difference between
the two samples. If the p-value is less than the significance level, we reject the null hypothesis.
Hypothesis Tests for Two Population Proportions

Often, managers want to know if two population proportions are equal. For example, a marketing manager of a packaged
snack foods company might want to compare the snack food habits of different states in the US.
The marketing manager might think that the proportion of consumers who favor potato chips in Texas is different from the
proportion of consumers who favor potato chips in Oklahoma.
Comparing two population proportions is similar to comparing two population means. We have two populations: the null
hypothesis states that their proportions are the same; the alternative hypothesis states that they are different.
We collect a sample from each population and calculate its sample size and sample proportion. As in the single population
proportion test, we don't need to find the sample standard deviation, since we know that the population standard deviation
is the square root of
[p*(1 - p)].
Similarly to the hypothesis tests for comparing two population means, we calculate a z-value for the difference between the
proportions using the formula below:
We translate the z-value into a p-value just as we would for any other type of hypothesis test. If the p-value is less than our
significance level, we reject the null hypothesis and conclude that the proportions are different. If the p-value is greater
than the significance level, we do not reject the null hypothesis.
Optional Example
Let's take a closer look at the study of snacking habits in Texas and Oklahoma.
The manager does not wish to test for a particular direction of difference; he just wants to know if the proportions are
different. Thus, he should use a two-sided test.
The marketing manager wants to be 95% confident in the result of this test, so the significance level is 0.05.
Suppose we collect responses from 400 people in Texas and 225 people in Oklahoma. The sample proportions are 45%
and 35%, respectively.
Could the two random samples we picked just happen to have different sample proportions? That is, if the true
proportions of Texans and Oklahomans favoring potato chips actually are the same, what would be the chance that the
sample proportions are 45% and 35% respectively?
We use p-values to answer this question. First, we calculate a z-value for the difference of the sample proportions that
incorporates the data from both populations. The null hypothesis states that the population proportions are equal, so
their difference is 0.
The z-value is 2.48.
For a two-sided test, a z-value of 2.48 translates into a probability in one tail of 0.0065 and hence a p-value of 0.013.
Since this p-value is less than the significance level of 0.05, we can reject the null hypothesis.
56/123
In other words, the low p-value tells us that there is sufficient evidence from the samples to conclude that there is a
difference between the proportions of Texan and Oklahoman potato chip lovers. We can make this claim at a 0.013 level
of significance.
Two-population hypothesis tests for population proportions can be performed using the formula shown in the previous
slide, or you can click here to access the Excel utility for hypothesis tests.
Summary
In a hypothesis test for two population proportions, we assume a null hypothesis: the two population proportions are
equal. We collect two samples and calculate the sample proportions. We calculate a p-value for the difference between
the sample proportions. If the p-value is less than the significance level, we reject the null hypothesis.
Excel Utility (Two Populations)

Click here to open an Excel Utility that allows you to perform hypothesis tests for two populations.
Make sure you do at least one example by hand to ensure you thoroughly understand the basic concepts before using the
utility. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to
reproduce the results for the GMAT and potato chip examples.
Solving the Leisure vs. Business Guest Spending Problem

Two-population hypothesis tests help you determine whether two populations have different means. You use a two-
population test to solve Leo's problem.
You have to find out if leisure guests' average daily discretionary spending is different from business guests' average daily
discretionary spending.
Leo has provided these data:
Now it's time to state the null hypothesis. The best formulation is:
You want to find out if the means of two populations — average spending by leisure guests vs. average spending by business
guests — are different. The two samples from those populations have different means: $75 and $64, respectively.
The samples may come from populations with the same means, and the numerical difference is due to chance in getting these
particular samples. It could be that the first sample just happened to have a high mean and the second sample just happened
to have a lower mean.
You test the null hypothesis that the population means have the same value.
You make note of your null hypothesis and the corresponding alternative hypothesis. You use a two-sided test because you
don't have any reason to believe that one type of guest spends more than the other.
At Leo's request you do a p-value test using a significance level of 0.05. To calculate the p-value, you first find the z-value.
Enter the z-value as a decimal number with 2 digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if necessary.
z-table
Utility for Two Populations
The z-value is +2.12 or -2.12, depending on how you set up the difference, i.e., in what order you subtract the sample means.
Either way, the final conclusion will be the same.
You use the z-value to calculate the p-value. The p-value is _______.
z-table
A z-value of 2.12 has a cumulative probability of 0.9830. You subtract this probability from the total probability, 1, for a right-
tail probability of 0.017.
Because we are doing a two-side test, we want to measure the probability of extreme values on both sides. Thus, we double
0.017 to get a p-value of 0.034.
Since the p-value 0.034 is less than the significance level 0.05, you recommend to Leo that he reject the null hypothesis. The
average daily discretionary spending per person is different for leisure and business guests.
Leo reads your report:
57/123
I see. We can tell if two population means are different by running a hypothesis test on their difference. We test the null
hypothesis that there is no difference.
As you see, the p-value is less than the significance level. This tells you that...
...the difference in the two sample means is probably not due to chance. The spending habits of the two types of guests are
different. Got it.
Exercise 1: The Burger Baron

Claude Forbes is a regional manager of The Burger Baron restaurant chain. The Baron would like to open a franchise in
Sappington. Two properties are up for sale, and Claude wants to choose the location that has the most traffic.
On 54 randomly selected days over a six-month period, an average of 92 cars passed by location A during the lunch hour,
from noon to 1 p.m. On 62 randomly selected days, 101 cars passed by location B during the lunch hour. The standard
deviations for the two locations were 16 and 23, respectively.
Is the difference between the amount of traffic at locations A and B statistically significant at the 0.01 level?
z-table
First, you set up the hypotheses. The null hypothesis states that there is no difference in mean traffic flow during the lunch
hour at the two locations. Since you are only interested in whether the traffic is different at the two locations, you conduct a
two-sided hypothesis test.
Next, you find the z-value for the difference between the two sample means using the appropriate formula. The z-value is
-2.47.
A z-value of -2.47 corresponds to a left-tail probability of 0.0068. Since you are doing a two-sided test, you must double
this probability to find the correct p-value: 0.0136.
The p-value is higher than the significance level. You cannot reject the null hypothesis. On the basis of these data, Claude
has insufficient evidence to show that one location has more traffic than the other.
Exercise 2: Karnivorous Kong vs. Peter the Pipsqueak

The Magical Toy company manufactures a line of wrestling action figures. Maude Troston, the head of the doll department,
must make the final decision regarding which of two models, "Karnivorous Kong" or "Peter the Pipsqueak," should be
discontinued.
Magical Toy ships crates containing equal numbers of Pipsqueaks and Kongs to retail outlets. 45 randomly selected toy
stores have sold an average of 78% of all Pipsqueaks delivered to them. 48 other stores have sold an average of 84% of their
Kongs.
At a significance level of 0.05, do the data indicate that one action figure sells better than the other?
z-table
Maude wants to know if one figure sells better, but doesn't have a preconception about which figure might be flying off the
shelves. Thus you use a two-sided test, and your alternative hypothesis simply states that there is a difference between the
two population proportions.
Using the appropriate formula and the data, you find the z-value for the difference between the sample proportions. The z-
value is -0.738.
A z-value of -0.738 corresponds to a left-tail probability of 0.2303. You double this to find the p-value for the two-sided
test. The p-value is 0.4606.
The p-value is greater than the significance level. Therefore, you can't reject the hypothesis. The data do not show
conclusively that one action figure sells better than the other.
Grapefruit Bizarre
The Regal Beverage Company makes the soft drink Grapefruit Bizarre. The marketing department wants to refocus its
energies and resources.
You have been asked to determine if there are regional differences in consumers' response to advertisements for Grapefruit
Bizarre. Specifically, you must find out if the Midwest responds to Grapefruit Bizarre advertisements as well as the West
58/123
Coast.
The marketing department is clamoring to start a second campaign. It claims that ads that are effective on the West Coast
do not go over as well in the Midwest. Management demands statistical evidence at a significance level of 0.05.
In the context of a free movie screening, an ad for Grapefruit Bizarre is shown to 173 Midwesterners. The viewers had been
randomly selected, and had not previously tasted the drink.
When asked later, 33% claimed that they were at least mildly interested in trying Grapefruit Bizarre. In a similar survey
conducted on the West Coast, 42% of 152 test subjects claimed at least a mild interest in trying Grapefruit Bizarre.
You calculate the z-value for the difference in sample proportions.
Enter your z-value as a decimal number with 2 digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if
necessary.
z-table
The z-value is + or -1.68, depending on how you set up the difference, i.e., in what order you subtract the sample means.
A z-value of -1.68 corresponds to a left-tail probability of 0.0465. What do you report to the marketing department?
z-table
You are conducting a one-sided hypothesis test. The alternative hypothesis states that the proportion of the Midwest
sample is less than the proportion of the West Coast sample. Therefore, you are interested only in the left-tail probability.
Your p-value is 0.0465.
0.0465 is less than the significance level 0.05. You should reject the null hypothesis. Midwesterners do not respond to the
ads as well as people from the West Coast. Marketing's claims are valid.
Challenge: LeMer Fashion Design

Upscale fashion designer Marjorie LeMer must decide from which supplier she should purchase bolts of cloth. Rumor has
it that BlueTex's product is superior to Southern Halifax's.
Random 10-yard sections from 43 bolts of Halifax's cloth contain a mean of 1.8 flaws per yard. Similar sections from 42
bolts of BlueTex's product contain 1.6 flaws per yard. The standard deviations are 0.3 and 0.6, respectively.
Marjorie wants you to find out if the rumors that BlueTex makes a better product are statistically warranted.
You conduct a one-sided test. Which of the following is the best alternative hypothesis?
z-table
At what level are these data significant?
z-table
You find the z-value for the difference between the two sample means using the appropriate formula. The z-value is -1.94.
The cumulative probability for z=-1.94 is 0.0262. This is the left-tail probability. Since you are running a one-sided test,
0.0262 is your p-value.
0.0262 is greater than 0.01. At this significance level, you would not reject the null hypothesis. 0.0262 is less than 0.05. At
this significance level, you would reject the null hypothesis.
Southern Halifax's product is slightly cheaper than BlueTex's. All other factors being equal, Marjorie would like to buy the
less expensive product. Unless she is 99% confident that there is a difference in quality, she will go with the cheaper cloth.
Based on this information and your calculations data, what should Marjorie do?
z-table
The data are not significant at the 0.01 level, so you can't reject the null hypothesis that they are different.
59/123
A significance level of 0.01 corresponds to a confidence level of 99%. So at a 99% confidence level, you can't reject the null
hypothesis. Marjorie can't be 99% confident that BlueTex's product is better than Halifax's.
Since Halifax's product is cheaper and you can't determine a difference in quality at the level of statistical significant
Marjorie requires, you recommend Halifax to Marjorie.
"Good work!" says Alice. "You're ready for a new challenge: investigating relationships between variables."
Regression Basics
Introduction
As you relax in your room during a brief afternoon downpour, your phone rings.
Leo's Bisque Debacle and the Staffing Problem

Leo just called. He wants us to come to his office immediately. He sounds a little angry. We'd better not keep him waiting.
I'm sorry if I was short on the phone. I'm very upset. We just had a little incident down in the restaurant. A server spilled a
tureen of crab bisque on one of our most "favored" guests, Mr. Pitt.
The Kahana's occupancy this year has been higher than I expected, and I had to hire extra help from a staffing agency. Those
staffing agencies charge a fortune, which is especially irritating considering that the employees they refer to us are often
poorly suited to customer service in an up-scale hotel.
Really, this is my fault for not having a more effective staffing process. I just wish I could predict my needs better. Sometimes,
when demand is lower than I expected, I'm overstaffed. Then I lose money paying idle bellhops. If I had a good sense of my
staffing needs at least a month in advance, I could avoid hiring workers at the last minute and having idle staff.
I had been thinking that the number of advance reservations would give me a good idea of how high my occupancy would be
a month down the road. But clearly advance reservations don't tell me the whole story. I've been making way too many false
predictions.
Is there anything you can do to help me here? What predictions about occupancy can I make based on advance bookings?
And how much can I trust them?
We'll take a look at the data on advance bookings and occupancy and let you know what we find out.
Introducing the Regression Line

Alice seems confident that the two of you can offer useful advice on Leo's staffing problem:
"This will be a great opportunity for you to learn regression. It's a powerful statistical tool used all the time in business: in
finance, demand forecasting, market research to name just a few areas. I'm sure you'll use it in your MBA program. And it's a
great chance to review what you've learned so far: sampling, confidence intervals, and hypothesis testing all play a part in
regression."
As we have seen, it is often useful to examine the relationship between two variables. Using scatter diagrams, we can
visualize such relationships.
We can learn more about the relationship by finding the correlation coefficient, which measures the strength of the linear
relationship on a scale from -1 to 1.
Regression is a statistical tool that goes even further: it can help us understand and characterize the specific structure of the
relationship between two variables.
Let's look at an example. Julius Tabin owns a small food processing company that produces the spreadable lunchmeat
product EasyMeat. Julius is trying to understand the relationship between his firm's advertising and its sales.
Total sales in the spreadable meat industry have been fairly flat over the last decade, and Julius' competitors' actions have
been quite stable. Julius believes that his advertising levels influence his firm's sales positively, but he doesn't have a clear
understanding of what the relationship looks like.
Let's have a look at data on his firm's advertising and sales over the last 10 years. Click on the Excel link to create the scatter
diagram yourself from an Excel spreadsheet.
EasyMeat Data
60/123
Plotting annual sales against annual advertising expenditures gives us a visual sense of the relationship between the two
variables. Looking at the graph, we can see that as advertising has gone up, sales have generally increased. The relationship
looks reasonably linear.
EasyMeat Data
The correlation coefficient for the two variables is 0.93, indicating a strong linear relationship between advertising and sales.
EasyMeat Data
What if we were to draw a line that characterizes this relationship? Which line would best fit the data? Our mind's eye already
sees how the two variables are related, but how can we formalize our visual impression?
EasyMeat Data
Before we start any calculations, let's look at several lines that could describe the relationship.
EasyMeat Data
One of these lines most accurately describes the relationship between the two variables: the "best-fit" or regression line.
EasyMeat Data
In our example, the best-fit line is Sales = -333,831 + 50*Advertising. For this line, the y-intercept is -333,831 and the slope is
50.
EasyMeat Data
In general, a regression line can be described by a simple linear equation: y = a + bx, with y-intercept a and slope b.
EasyMeat Data
In this equation, the y-variable, sales, is called the dependent variable, to suggest that we think Julius' sales depend to
some degree on his advertising. The x-variable, advertising, is called the independent variable, or the explanatory
variable.
EasyMeat Data
When we observe that a change in the independent variable (here advertising) is typically accompanied by a proportional
change in the dependent variable (here sales), regression analysis can identify and formalize that relationship.
EasyMeat Data
Summary
Regression analysis helps us find the mathematical relationship between two variables. We can use regression to describe a
linear relationship: one that can be represented by a straight line and characterized by an equation of the form y = a + bx.
The Uses of Regression

What kinds of questions can regression analysis help answer?
How does regression help us as managers? It can help in two ways: first, it helps us forecast. For example, we can make
predictions about future values of sales based on possible future values of advertising.
Second, it helps us deepen our understanding of the structure of the relationship between two variables by expressing the
relationship mathematically.
EasyMeat Data
Using Regression for Forecasting

Let's talk first about how managers can use regression to forecast. In our example, regression can help Julius predict his
company's sales for a specified level of advertising.
For example, if he plans to spend $65,000 in advertising next year, what might we expect sales to be?
If we didn't know anything about the relationship, but only had the historical data, we might simply note that the last time
Julius spent $65,000 on advertising, his sales were $3,200,200. But is this the best prediction we can make?
EasyMeat Data
61/123
Not at all. Regression analysis brings the entire data set to bear on our prediction. In general, this will allow us to make
more accurate predictions than if we infer the future value of sales from a single observation of advertising and sales.
Having identified the relationship between the two variables from the full data set, we can apply our understanding of
that relationship to our forecast.
EasyMeat Data
Using regression analysis, we found the regression line to be Sales = -333,831 + 50*Advertising. If Julius plans to spend
$65,000 in advertising, what would we predict sales to be?
EasyMeat Data
The point on the line shows us what level of sales to expect. In this case, we would expect sales of $2,916,169.
EasyMeat Data
With regression, we can forecast sales for any advertising level within the range of advertising levels we've seen
historically. For example, even if Julius has never spent exactly $50,000 on advertising, we can still forecast a
corresponding level of sales.
EasyMeat Data
We must be extremely cautious about forecasting sales for values of advertising beyond the range of values we have already
observed. The further we are from the historical values of advertising, the more we should question the reliability of our
forecast.
EasyMeat Data
For example, we might feel comfortable forecasting sales for advertising levels a bit above the observed range- perhaps as
high as $100,000 or $105,000. But we shouldn't infer that if Julius spent $10 million on advertising, he would achieve
$500 million in sales. The total market for spreadable meat is probably much less than $500 million annually!
EasyMeat Data
Likewise, we might feel comfortable forecasting sales for advertising levels just below the observed range. But we certainly
shouldn't report that if Julius spent $0 on advertising, he would have negative sales!
EasyMeat Data
If we try to use our regression equation to forecast sales for advertising levels outside of the historical range, we are
implicitly assuming that the relationship between advertising and sales continues to be linear outside of the historical
range.
EasyMeat Data
In reality, although the relationship may be quite linear for the range of values we've observed, the curve may well level off
for advertising values much lower or much higher than those we've observed. With no observations outside the historical
data range, we simply don't have evidence about what the relationship looks like there.
EasyMeat Data
Another critical caveat to keep in mind is that whenever we use historical data to predict future values, we are assuming
that the past is a reasonable predictor of the future. Thus, we should only use regression to predict the future if the general
circumstances that held in the past, such as competition, industry dynamics, and economic environment, are expected to
hold in the future.
The Structure of a Relationship

Regression can be used to deepen our understanding of the structural relationship between two variables. If we think about
it, many business decisions are about increasing or decreasing one variable — investments or advertising, for example — to
affect some other variable — productivity, brand recognition, or profits, for example. Regression can reveal the structure of
relationships of this type.
Our regression analysis stipulates a linear relationship between sales and advertising. Understanding "the structure" of this
relationship translates into finding and interpreting the coefficients of the regression equation.
As we've noted in the previous slide, the constant term -333,831 may have no real managerial significance; it just "anchors"
the regression line by telling us the y-intercept. We've never seen advertising levels close to $0, so we cannot infer that
spending no money on advertising will lead to sales of -$333,831!
The more important term is the advertising coefficient, 50, which gives us the slope of the line. The advertising coefficient
tells us how sales have changed on average as advertising has increased.
62/123
In the past, when advertising has increased by $10,000, what has been the average corresponding change in sales?
Assuming that the relationship between sales and advertising is linear, each $1 increase in advertising should be
accompanied by the same average increase in sales. In our example, for every incremental $1 in advertising, sales increase
on average by $50. Thus, for every incremental $10,000 in advertising, sales increase on average by $500,000.
The regression line gives us insight into how two variables are related. As one variable increases, by how much does the
other variable typically change? How much growth in sales can we anticipate from an incremental increase in advertising
expenditures? Regression analysis helps managers answer questions like these.
Summary
We use regression analysis for two primary purposes: forecasting and studying the structure of the relationship between
two variables. We can use regression to predict the value of the dependent variable for a specified value of the independent
variable. The regression equation also tells us how the dependent variable has typically changed with changes in the
independent variable.
Exercise 1: Soft Drink Consumption

Per-capita consumption of soft drink beverages is related to per-capita gross domestic product (GDP). Generally, the
higher the GDP of a country, the more soda its citizens consume. Soft drink consumption is measured in number of 8-oz
servings.
Based on data from 12 countries, the relationship can be expressed mathematically as:
Source
Based on this relationship, you can expect that, on average, for each additional $1,000 of per-capita GDP a country's soda
consumption increases by _____ servings.
The regression equation tells us that in our data set, average soda consumption increases by 0.018 servings for every
additional $1 of per-capita GDP. So, for an additional $1,000, average consumption increases by ($1,000)(0.018
servings/$) = 18 servings.
The per-capita GDP in the Netherlands is $25,034. What do you predict is the average number of servings of soda
consumed in the Netherlands per year?
Enter predicted average soda consumption (in servings) as an integer (e.g., "5"). Round if necessary.
The regression equation tells us that average soda consumption = 130 + 0.018*(per-capita GDP). Therefore, we anticipate
the Netherlands' average soda consumption to be 580.6 servings.
Although the regression predicts a soda consumption of around 581 servings per person for the Netherlands, the actual
measured number of servings consumed is much lower: 362. The discrepancy in the actual and predicted consumption
reinforces that per-capita GDP alone is not a perfect predictor of soda consumption.
Calculating the Regression Line

A regression line helps you understand the relationship between two variables and forecast future values of the dependent
variable. Alice points out to you that these two features of regression analysis make it a powerful tool for managers who make
important decisions in the uncertain world of business.
But how do you generate a regression line from observed data? Of all the straight lines that you could draw through a scatter
diagram, which one is the regression line?
The Accuracy of a Line

Let's return to Julius Tabin's sales and advertising data. As we can see from the graph, no straight line could be drawn that
would pass through every point in the data set.
This is not surprising. Typically, advertising is not a perfect predictor of sales, so we don't expect every data point to fall in a
perfect line. The regression line depicts the best linear relationship between the two variables. We attribute the difference
between the actual data points and the line to the influence that other variables have on sales, or to chance alone.
Since the regression line does not pass through every point, the line does not fit the data perfectly. How accurately does the
regression line represent the data?
To measure the accuracy of a line, we'll quantify the dispersion of the data around the line. Let's look at one line we could
draw through our data set.
63/123
Let's consider a second line. Click on the line that more closely fits the ten data points.
Although in this example we can see which of two lines is more accurate, it is useful to have a precise measure of a line's
accuracy.
To quantify how accurately a line fits a data set, we measure the vertical distance between each data point and the line.
Why don't we measure the shortest distance between the point and the line — the distance perpendicular to the line? Why
do we measure vertically?
We measure vertical distance because we are interested in how well the line predicts the value of the dependent variable.
The dependent variable — in our case, sales — is measured on the vertical axis. For each data point, we want to know how
close the value of sales predicted by the line is to the historically observed value of sales.
From now on we will refer to this vertical distance between a data point and the line as the error in prediction or the
residual error, or simply the error. The error is the difference between the observed value and the line's prediction for our
dependent variable. This difference may be due to the influence of other variables or to plain chance.
Going forward, we will refer to the value of the dependent variable predicted by the line as y-hat and to the actual value of the
dependent variable as y. Then the error is y - (y-hat), the difference between the actual and predicted values of the dependent
variable.
The complete mathematical description of the relationship between the dependent and independent variables is y = a + bx +
error. The y-value of any data point is exactly defined by these terms: the value y-hat given by the regression line plus the
error, y - (y-hat).
Collectively, the errors in prediction for all the data points measure how accurately a line fits a set of data.
To quantify the total size of the errors, we cannot just sum each of the vertical distances. If we did, positive and negative
distances would cancel each other out.
Instead, we take the square of each distance and then sum all the squares, similarly to what we do when we calculate
variance.
This measure, called the Sum of Squared Errors (SSE), or the Residual Sum of Squares, gives us a good measure of how
accurately a line describes a set of data.
The less well the line fits the data, the larger the errors, and the higher the Sum of Squared Errors.
Summary
To find the line that best fits a data set, we first need a measure of the accuracy of a line's fit: the Sum of Squared Errors. To
find the Sum of Squared Errors, we calculate the vertical distances from the data points to the line, square the distances,
and sum the squares.
Identifying the Regression Line

Now that you have a way to measure how well a line fits a set of data, you need a way to identify the line that "best fits" the
data: the regression line.
We can calculate the Sum of Squared Errors for any line that passes through the data. Of course, different lines will give us
different Sums of Squared Errors. The line we are looking for — the regression line — is the one with the smallest Sum of
Squared Errors.
Let's look at several lines that could describe the relationship between advertising and sales in our example. Our intuition
tells us that the middle line is a much better fit than line a or line b.
Let's check our intuition. For each line, we can calculate the Sum of Squared Errors to determine its accuracy.
The lower the Sum of Squared Errors, the more precisely the line fits the data, and the higher the line's accuracy.
The line that most accurately describes the relationship between advertising and sales — the regression line — is the line that
minimizes the sum of squares. Finding the regression line for a set of data is a calculation-intensive process best left to
statistical software.
Summary
The line that most accurately fits the data — the regression line — is the line for which the Sum of Squared Errors is
minimized.
64/123
Performing Regression Analysis

Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to do regression analysis using
the regression tool. However, we suggest you read through the following instructions to learn how Excel's regression tool
works, so you can run regressions in the future, when you do have access to the Data Analysis Toolpak.
Performing regression analysis by hand is a time-consuming process. Fortunately, statistical software packages and major
spreadsheet programs — Excel, for example — can do the necessary calculations for you in a matter of seconds. Click on the
Excel link to access the data file so you can practice doing the analysis in Excel as you read through the instructions.
EasyMeat Data
Let's go through the process step by step. We start with data entered into two columns in an Excel spreadsheet. Each
column contains values of a variable. To perform regression analysis, there must be an equal number of entries in each
column.
Under the Data tab in the toolbar we select the Data Analysis option.
A window pops open containing an alphabetical list of statistical tools. We select "Regression" and click "OK".
A new window opens offering several options for regression analysis.
In the regression window, we see a prompt field titled ''Input Y Range.'' In it, we enter C1:C11, the range of cells containing
the column label (C1) and the data (C2:C11) for the dependent variable: Sales ($).
We repeat this for the prompt field titled ''Input X Range,'' entering B1:B11 to include both the column label (B1) and the
data (B2:B11) for the independent variable: Advertising ($).
Since we included the column lables in row 1 in our ranges, we must check the "Labels" box. Including labels is helpful
because Excel uses the labels to identify the variable coefficients in the output sheet. If you do not include the labels in your
ranges, do not check the label box, or Excel will treat the first row of data as labels, excluding those entries from the
regression.
Finally, we select the output option "New Worksheet Ply:", enter the name for the new worksheet, and click "OK."
Excel opens a new worksheet with the name we specified. In it, we see an intimidating array of data.
For the moment, we're mainly interested in the entries in the cells labeled "Coefficients," which specify the intercept and
slope of the regression line.
Note that the label "Advertising ($)" has been carried over from the original data column. The coefficient in the
"Advertising ($)" row is the slope of the regression line.
For the exercises in this unit, we strongly recommend you find the relevant data in an Excel spreadsheet and perform the
regression analyses yourself. If you do not have the Analysis Toolpak, you can open a file containing the relevant regression
output.
EasyMeat Data
EasyMeat Regression
Exercise 1: Soft Drink Consumption Revisited

To practice using Excel's regression tool, run a regression using the world soft drink consumption data from an earlier
exercise. Use soft drink consumption for the dependent variable and per capita GDP for the independent variable.
Soft Drink Consumption Data
What is the slope of the regression line? Enter the slope as a decimal number with 3 digits to the right of the decimal point
(e.g., enter "5" as "5.000"). Round if necessary.

Soft Drink Consumption Regression
Source
We run the regression by selecting range C1:C13 for the Y-range, the dependent variable consumption, and B1:B13 for the
X-range, the independent variable GDP per capita. We check the label box, and see the output below. The slope of the
regression line is the coefficient of the independent variable, GDP per capita.
What is the intercept of the regression line? Enter the intercept as an integer (e.g., "5"). Round if necessary.
65/123
Soft Drink Consumption Regression
Source
The intercept of the line is the coefficient labeled "Intercept."
Deeper into Regression

Equipped with the basic tools needed to find and interpret the regression line, you feel ready to tackle Leo's assignment. But
Alice cautions you not to be hasty and urges you to consider some tricky questions: "How well does the regression line actually
characterize the relationship in the data? Is a straight line even a good descriptor of the relationship?"
Quantifying the Predictive Power of Regression

How much does the relationship between advertising and sales help us understand and predict sales? We'd like to be able to
quantify the predictive power of the relationship in determining sales levels. How much more do we know about sales thanks
to the advertising data?
To answer this question we need a benchmark telling us how much we know about the behavior of sales without the
advertising data. Only then does it make sense to ask how much more information the advertising data give us.
Without the advertising data, we have the sales data alone to work with. Using no information other than the sales data, the
best predictor for future sales is simply the mean of previous sales. Thus, we use mean sales as our benchmark, and draw a
"mean sales line" through the data.
Let's compare the accuracy of the regression line and the mean sales line. We already have a measure of how accurately an
individual line fits a set of data: the Sum of Squared Errors about the line. Now we want a measure of how much more
accurate the regression line is than the mean line.
To obtain such a measure, we'll calculate the Sum of Squared Errors for each of the two lines, and see how much smaller the
error is around the regression line than around the mean line.
The Sum of Squared Errors for the mean sales line measures the total variation in the sales data. In fact, it is the same
measure of variation we use to derive the standard deviation of sales. We call the Sum of Squared Errors for our benchmark
— the mean sales line — the Total Sum of Squares. Here, the Total Sum of Squares is 8.01 trillion.
The Sum of Squared Errors for the regression line is often called the Residual Sum of Squared Errors, or the Residual Sum of
Squares. The Residual Sum of Squares is the variation left "unexplained" by the regression. Here, the Residual Sum of
Squares is 1.13 trillion.
The difference between the Total Sum of Squares and the Residual Sum of Squares, 6.88 trillion in this case, is called the
Regression Sum of Squares. The Regression Sum of Squares measures the variation in sales "explained" by the regression
line.
Excel's regression output reports all three of these terms.
A standardized measure of the regression line's explanatory power is called R-squared. R-squared is the fraction of the total
variation in the dependent variable that is explained by the regression line.
R-squared will always be between 0 and 1 — at worst, the regression line explains none of the variation in sales; at best it
explains all of it.
We find R-squared by dividing the variation explained by the regression line — the Regression Sum of Squares — by the total
variation in the dependent variable — the Total Sum of Squares.
R-squared is presented either as a fraction, a percentage, or a decimal. We find that in the advertising and sales example, the
R-squared value is 6.88 trillion/8.01 trillion = 0.859 = 85.9%.
An equivalent approach to computing R-squared is somewhat less intuitive but more common. In this approach we first find
the fraction of the total variation in the dependent variable that is NOT explained by the regression line: we divide the
Residual Sum of Squares by the Total Sum of Squares.
Then we subtract the fraction of unexplained variation from 1 to obtain R-squared.
Fortunately, we don't need to calculate R-squared ourselves — Excel computes R-squared and includes it in the standard
regression output.
In a regression that has only one independent variable, R-squared is closely related to the correlation coefficient between the
independent and dependent variables: the correlation coefficient is simply the positive or negative square root of R-squared—
positive if the slope of the regression line is positive and negative if the slope of the regression line is negative.
66/123
Excel's regression output always computes the square root of R-squared, which it labels "Multiple R."
Summary
R-squared measures how well the behavior of the independent variable explains the behavior of the dependent variable. R-
squared is the ratio of the Regression Sum of Squares to the Total Sum of Squares. As such, it tells us what proportion of
the total variation in the dependent variable is explained by its linear relationship with the independent variable.
Residual Analysis
Although the regression line is the line that best fits the observed data, the data points typically do not fall precisely on the
line. Collectively, the vertical distances from the data to the line — the errors — measure how well the line fits the data.
These errors are also known as residuals.
A careful study of the residuals can tell us a lot about a regression analysis and the validity of the assumptions we base it
on.
For example, when we run a regression, we assume that a straight line best describes the relationship between our two
variables. In fact, sometimes the relationship may be better described by a curve.
In this graph, we can clearly see a negative trend. If we run a regression on these data, we find a relatively high R-squared.
How do we use the residuals to check our assumption that the relationship is linear?
First, we measure the residuals: the distance from the data points to the regression line.
Then we plot the residuals against the values of the independent variable. This graph — called a residual plot — helps us
identify patterns in the residuals.
We can recognize a pattern in the residual plot: a curve. This pattern strongly indicates that a straight line is not the best
way to express the relationship between the variables: a curve would be a much better fit.
A residual plot often is better than the original scatter plot for recognizing patterns because it isolates the errors from the
general trend in the data. Residual plots are critical for studying error patterns in more advanced regressions with multiple
independent variables.
If the only pattern in the dependent variable is accounted for by a linear relationship with the independent variable, then
we should see no systematic pattern in the residual plot. The residuals should be spread randomly around the horizontal
axis.
In fact, the distribution of the residuals should be a normal distribution, with mean zero, and a fixed variance. Residuals
are called homoskedastic if their distributions have the same variance.
If we see a pattern in the distribution of the residuals, then we can infer that there is more to the behavior of the dependent
variable than what is explained by our linear regression. Other factors may be influencing the dependent variable, or the
assumption that the relationship is linear may be unwarranted.
We've already seen the pattern of a curved relationship. What other patterns might we see?
Let's look at this scatter diagram and its corresponding residual plot. The residuals appear to be getting larger for higher
values of the independent variable. This phenomenon is known as heteroskedasticity.
Residual analysis reveals that the distribution of the residuals changes with the independent variable: the variance
increases as the independent variable increases. Since the variance of the residuals — which contributes to the variation of
the dependent variable — is affected by the behavior of the independent variable, we can conclude that there must be more
to the story than just the linear relationship.
There are a number of other assumptions about regression whose validity can be tested by performing a residual analysis.
Although interesting, these uses of residual analysis are beyond the scope of this course.
Summary
A complete regression analysis should include a careful inspection of the residuals. Plot the residuals against the
independent variable to reveal patterns in the distribution of the residuals.
Graphing Residual Lines and Residuals

Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to perform residual analysis
using the regression tool. However, we suggest you read through the instructions to learn how Excel's regression tool
works, so you can perform residual analysis in the future, when you do have access to the Data Analysis Toolpak.
67/123
To study patterns in residuals and other regression data, visual representations can be very helpful. To plot a regression
line we first generate a scatter diagram of the data we are studying.
Once we have the scatter diagram, we add the regression line by right-clicking on any one of the data points. A menu will
pop up. From that menu, we select "Add Trendline."
In the Trendline Operations Menu, we select "Linear" under Trend/Regression Type.
If we wish to display the regression equation and R-squared value on the same scatter plot, we can select the check boxes
next to those items at the bottom and press "Close."
The scatter diagram will now have been augmented by the regression line, the regression equation, and the R-squared
value. This is a quick way to perform a simple regression analysis from a scatter plot, though it doesn't provide all of the
output we'll want to thoroughly review the results.
Residual analysis is an option in Excel's Regression tool. To calculate the residuals and generate the residual plot, we select
"Residual Plots" in the Residuals section of the Regression Menu and click "Ok."
The residuals and their plot appear in the regression output.
The Significance of Regression Coefficients

R-squared and the residuals are not the only things to keep an eye on when running a regression.
The regression line is the line that best fits our observed data. But are we sure that the regression line depicts the "true" linear
relationship between the variables?
In fact, the regression line is almost never a perfect descriptor of the true linear relationship between the variables. Why?
Because the data we use to find the regression line typically represent only a sample from the entire population of data
pertaining to the relationship.
Returning to our spreadable lunchmeat example, advertising and sales data for a different set of years would likely have given
us a different line. Since each regression line comes from a limited set of data, it gives us only an approximation of the
"true" linear relationship between the variables.
As we do in sampling, we'll use Greek letters to denote parameters describing the population, and Latin letters to denote the
estimates of those parameters based on sample data. We'll use alpha and beta to represent the coefficients of the "true" linear
relationship in the population, and a and b to represent our estimates of alpha and beta.
When we calculate the coefficients of a regression line from a set of observed data, the value a is an estimate of alpha, the
intercept of the "true" line. Similarly, the value b is an estimate of beta, the slope of the true line.
Like all estimates based on sample data, the calculated estimate for a coefficient is probably different from the true
coefficient. Just as in sampling, to find a range of likely values for the true coefficient, we construct confidence intervals
around each estimated coefficient.
Excel's output gives us 95% confidence intervals for each coefficient by default.
The Excel output for our EasyMeat example tells us that the best estimate of the slope is 50, and we are 95% confident that
the slope of the true line is between 33.5 and 66.5.
We can specify any other confidence level when setting up the regression. We might be interested in a 99% confidence
interval for the slope of the EasyMeat regression line.
The Excel output for our EasyMeat example tells us that the best estimate of the slope is 50, and we are 99% confident that
the slope of the true line is between 25.9 and 74.1.
Testing for a Linear Relationship

Since we don't know the exact value of the slope of the true advertising line, we might well question whether there actually
IS a linear relationship between advertising and sales. How can we assure ourselves that the pattern we see in the sample
data is not simply due to chance?
If there truly were no linear relationship, then in the full population of relevant data, changes in advertising would not
correspond systematically to proportional changes in sales. In that case, the slope of the best-fitting line for the true
relationship would be zero.
There are two quick ways to test if the slope of the true line might be zero. One is simply to look at the confidence interval
for the slope coefficient and see if it includes zero.
68/123
The 95% confidence interval for the slope in our case, [33.5, 66.5], does not include zero, so we can reject with 95%
confidence the hypothesis that the slope of the true line is zero. We can say that we are 95% confident that there really is a
linear relationship between advertising and sales.
Alternatively, we can test how likely it is that the slope really is zero using a hypothesis test. We formulate the null
hypothesis that there is no linear relationship: the true slope, beta, of the regression line is zero.
Then we calculate the p-value: the likelihood of having a slope at least as far from zero as the slope calculated, b, assuming
that the slope is in fact zero.
In our example, the p-value tells us the likelihood of having a slope as far from zero as 50 if the true slope were in fact zero.
Fortunately, Excel saves us the trouble of calculating the p-value and reports it along with the regression output. This p-
value tells us exactly what we want to know: if there really is no linear relationship between the variables, how likely is it
that our sample would have given us a slope as large as 50?
In our example, the p-value for the advertising coefficient is 0.0001, indicating that we can be confident at the 99.99% level
that beta, the slope of the true line, is different from zero. Thus, we are confident at the 99.99% level that there really is a
linear relationship between advertising and sales.
In general, if the p-value for a slope coefficient is less than 0.05, then we can reject the null hypothesis that the slope beta is
zero, and conclude with 95% confidence that there is a linear relationship between the two variables. Moreover, the smaller
the p-value, the more confident we are that a linear relationship exists.
If the p-value for a slope coefficient is greater than 0.05, then we do not have enough evidence to conclude with 95%
confidence that there is a significant linear relationship between the variables.
Additionally, p-values can be used to test hypotheses about the intercept of a regression line. In our example, the p-value
for the intercept is 0.498, much larger than 0.05. However, since the intercept simply anchors the regression line, whether
or not it is zero is not particularly important.
We won't use the standard error of the coefficient in this course, but will simply point out that the standard error is similar
to a standard deviation of our distribution for the coefficient. Thus, the 95% confidence interval is approximately 50 +/- 2*
(7.2). It's not exact, because the distribution isn't quite normal, but the concept is very similar.
We won't use the t-stat in this course either, since the p-value gives the equivalent information in a more readily usable
form. The t-stat tells us how many standard errors the coefficient is from the value zero. Thus, if the t-stat is greater than 2,
we are quite sure (approximately 95% confident) that the true coefficient is not zero.
Summary
The slope and intercept of the regression line are estimates based on sample data: how closely they approximate the actual
values is uncertain. Confidence intervals for the regression coefficients specify a range of likely values for the regression
coefficients. Excel reports a p-value for each coefficient. If the p-value for a slope coefficient is less than 0.05 we can be
95% confident that the slope is nonzero, and hence that there is a linear relationship between the independent and
dependent variables.
Revisiting R-squared and p

It is important not to confuse the p-value for a coefficient with the R-squared for the regression. R-squared tells us what
percentage of the variation seen in the dependent variable is explained by its relationship with the independent variable.
The p-value tells us the likelihood that there is no real relationship between the dependent and independent variables - that
the true coefficient of the line in the full population is zero.
Let's look at an example that illustrates the difference in these two measures. This scatter diagram shows data tightly
clustered around the regression line. R-squared is very high — very little variation in Y is left unexplained by the line. The
p-value on the coefficient is very low — there is clearly a significant linear relationship between X and Y.
The second scatter diagram has a much lower R-squared — a lot of variation in Y is left unexplained by the line. But there is
no question that there is a significant relationship between X and Y, so the p-value on the coefficient is again very low.
Here, the linear relationship may not explain as much variation as it does in the upper graph, but the relationship is clearly
significant.
A high p-value indicates a lack of confidence in the underlying linear relationship. We would not expect a questionable
relationship to explain much variation. So if we have only one independent variable and it has a high p-value, we would not
expect to find a very high R-squared. However, as we'll see shortly, the story becomes more complex when we have two or
more independent variables.
So far, we haven't discussed how sample size affects the accuracy of a regression analysis. The larger the sample we use to
conduct the regression analysis, the more precise the information we obtain about the true nature of the relationship under
69/123
investigation. Specifically, the larger the sample, the better our estimates for the slope and the intercept, and the tighter the
confidence intervals around those estimates.
Let's look at how sample size affects p-values in an example. This scatter plot is based on a large sample size — 50
observations. With a p-value of zero we are extremely confident that there is a linear relationship between X and Y, and our
confidence interval for the slope coefficient is quite tight.
The second scatter plot is based on a sample size of only 10. The p-value rises to 0.07, so we cannot be 95% confident that
there is a linear relationship between X and Y. Our lack of confidence in our estimate is also evident in the wide confidence
interval for the slope coefficient.
Summary
The p-value and R2 provide different information. A linear relationship can be significant but not explain a large
percentage of the variation, so having a low p-value does not ensure a high R2. Sample size is an important determinant
of regression accuracy: as with all sampling, larger samples give more accurate estimates.
Solving the Staffing Problem

Aware of many of basic regression's subtleties, you are ready to turn to Leo's staffing problem.
The Kahana has 500 guest rooms. Over the last three years, average yearly occupancy has varied a good deal: my record low
was 250 rooms occupied, and my high was 484. That range of variation makes staff planning difficult.
I'd like to be able to predict the Kahana's occupancy one month in advance. So far, I've been making educated guesses about
the level of occupancy one month in the future based on the number of advance bookings. I take the number of bookings and
add another 50%, to be on the safe side. Obviously, I haven't done very well with that method.
You need to find out more precisely what the relationship is between advance bookings and occupancy. Alice asks you to run
a regression analysis on Leo's occupancy and advance bookings data.
Bookings and Occupancy Data
The results of your analysis tell you that, for every 100 additional advance bookings,

Bookings and Occupancy Regression
When you run the regression analysis, be sure you choose the dependent and independent variables correctly. We are using
advance bookings to predict occupancy, suggesting that we believe that occupancy levels "depend" on advance bookings.
Thus, the occupancy is the dependent variable, and the number of advance bookings is the independent variable.
In the regression output, find the coefficient of the slope of the regression line. This tells you how many additional guests Leo
can expect for each additional advance booking.
Based on these data, can you be 95% confident that the slope of the regression line is not 0?

How much of the variation in occupancy is explained by the variation in the number of advance bookings?

You present your findings to Leo.
So there is a positive relationship between advance bookings and occupancy. But the power of advance bookings to predict
occupancy is pretty small. More than half of the variation in your room occupancy is due to other factors.
I suppose the mathematical relationship you found will help me make slightly more informed staffing choices. But mostly I'll
still be stumbling in the dark.
Not so fast Leo: We may be able to identify other factors linked to occupancy, and use them to come up with an improved
forecasting model. Why don't we meet tomorrow and discuss what other factors might help us predict occupancy?
Alright. In the meantime, I'll be doing my best to appease Mr. Pitt. He's talking about filing suit against me. He says he was
burned. Burned by a bisque!
Exercise 1: Inventories and Capacity Utilization

70/123
You have been asked to examine the relationship between two important macroeconomic quantities: the change in
business inventories and factory capacity utilization levels.
If you wish to learn what percent of the variation in changes in inventories is explained by capacity utilization levels, which
variable should you choose as your independent variable?
Click here to access US economic data from the years 1971-1986. Run the regression with the change in business
inventories as the dependent variable and the capacity utilization as the independent variable.
Source
Using the regression output, find the slope of the regression line. Enter the slope as a decimal number with 2 digits to the
right of the decimal point (e.g., enter "5" as "5.00"). Round if necessary.
Inventories and Capacity Data

Inventories and Capacity Regression
Exercise 2: Clever Solutions

Greta John is the human resources manager at the software consulting firm Clever Solutions. Recently, some of the
programmers have been restless: the senior programmers feel that length of service and loyalty have not been rewarded in
their compensation. The junior programmers think that seniority should not be a major basis for pay.
Greta wants some hard data to inform the debate. As a preliminary step, she plots employees' salaries against their length
of service. Using the data provided, perform a regression analysis with salary as the dependent variable and length of
service as the independent variable.
Clever Solutions Data
What is the average increase in a Clever Solutions programmer's salary per year of service?

Clever Solutions Regression
Based on the regression analysis, Greta can tell that approximately _____ of the variation in compensation can be
explained by length of service.

Clever Solutions Regression
Exercise 3: Productivity and Compensation

Productivity measures a nation's average output per labor-hour. It is one of the most closely watched variables in
economics: as workers produce more per hour, employers can pay them more without increasing the price of the product.
Since wages can rise without provoking a corresponding rise in consumer prices, the growth of productivity is essential to a
real increase in a nation's standard of living.
Peter Agarwal, a student at the Harvard Business School, wants to investigate the relationship between change in
productivity and change in real hourly compensation.
Peter has data on change in productivity and change in compensation for eight industrialized nations. The figures are
annual averages over the period from 1979-1990.
Productivity and Compensation Data

Source
Run a regression with change in compensation as the dependent variable and change in productivity as the independent
variable.

Source
How much of the variation in the change in compensation can be explained by the change in productivity? Enter the
percentage as decimal number with 2 digits to the right of the decimal point (e.g, enter ''50%'' as ''0.50''). Round if
necessary.

Productivity and Compensation Regression
Given these data, Peter finds that the relationship can be mathematically expressed as:
71/123
Can Peter claim (with a 95% level of confidence) that the relationship is statistically significant?

Productivity and Compensation Regression
The coefficient for the slope given by Excel is an estimate based on the data in Peter's sample. The estimate for the slope of
the regression line is about 0.75. If the actual slope of the relationship is 0, there is no significant linear relationship
between the change in productivity and the change in compensation.
On the regression output, there are two ways to tell if the slope coefficient is significant at the 0.05 level. First, we can look
at the 95% confidence interval provided and see that it ranges from -0.05 to +1.56. Since the 95% confidence interval
contains zero, the coefficient is not significant at the 0.05 level.
Alternatively, we can note that the p-value of the slope coefficient, 0.0625, is greater than 0.05. Peter cannot be 95%
confident that the actual slope is 0.
Since Peter cannot be confident that the slope is not zero, he cannot be confident that there is a linear relationship between
the two variables.
Suppose Peter collects data on eight more countries. Run the regression for the entire data set with change in
compensation as the dependent variable and change in productivity as the independent variable.
Expanded Productivity and Compensation Data
The new data set indicates that the variables have a slightly different relationship:
Can Peter claim (with a 95% level of confidence) that the relationship is statistically significant?

Expanded Productivity and Compensation Regression
What might explain why the coefficient is significant in the second (combined) data set?

Expanded Productivity and Compensation Regression
Multiple Regression
Introduction
After brainstorming for several hours, you and Alice devise a plan to improve your regression analysis of the staffing problem,
to help Leo better predict the Kahana's occupancy.
The Staffing Problem (II)

Good morning. I hope you had a good night's sleep and have come up with some ideas on how to better predict my
occupancy.
For my part, I slept horribly last night. This bisque lawsuit is taking years off my life.
I'm sorry to hear that. I'm afraid we don't have any legal advice to give you. We do have some ideas about how to improve
your ability to forecast your hotel's occupancy.
We can do better than explaining 39% of the variation in occupancy as we did with our earlier regression using advance
bookings as the independent variable.
Remember when we first arrived, we analyzed the relationship between Kauai's average hotel occupancy rates and arrivals on
the island? We found a fairly strong correlation between occupancy and arrivals: 71%.
Source
I took a look at the relationship between arrivals on Kauai and your hotel's occupancy numbers. In the regression of
Kahana occupancy versus arrivals, arrivals explain 80% of the variation in occupancy. That's much better than the 39%
explained by advance bookings.
Wait a minute. Those numbers add up to more than 100%. It seems like your regression technique explains more variation
than there is!
72/123
Good observation, Leo. The numbers don't add up. The reason is that there is a statistical relationship between arrivals and
advance bookings that we aren't taking into account when we run the two regressions separately.
We intend to find one equation that incorporates the data on both arrivals and advance bookings and takes into account the
relationship between them. We'll also investigate the impact of other factors, such as the business practices of your
competitors. Who is your main competitor in the area?
That would be the Hotel Excelsior. Its manager, Knut Steinkalt, is a real cut-throat. He's always offering special promotions
that undercut my room prices.
Fortunately, the Excelsior, though very luxurious, is not nearly as inviting as the Kahana. That place feels like an undertaker's
parlor! I've been able to keep ahead of old Knut by offering a better product.
We'll study the Excelsior's promotions, and see if they've had a significant influence on the Kahana's occupancy.
Thanks. Let me know as soon as you have some results. I have to warn you, though, I may be out: I'm going see my lawyers in
Honolulu this week to discuss Mr. Pitt's bisque lawsuit. What a mess!
Introducing Multiple Regression

"Most management problems are too complex to be completely described by the interactions between only two variables,"
Alice tells you. "Incorporating multiple independent variables can give managers a more accurate mathematical
representation of their business."
When buying a new home, we know that the house's size influences its selling price. All other factors being equal, we expect
that the larger the home, the higher its price.
We can gain a better sense of this relationship by graphing data on price and house size in a scatter diagram, and running a
regression with price as the dependent variable and house size as the independent variable. We'll use data on 15 recent home
purchases in the town of Silverhaven.
Silverhaven Real Estate Data
We can tell from the value of R-squared, 26%, that the relationship is fairly weak: house size variation explains only about a
quarter of the variation in price.

Silverhaven Real Estate Regressions
This should come as no surprise: house size is not the only variable affecting the price of a home. There are at least three
other important factors, as any real estate agent will tell you:
One facet of a desirable location is a low commuting time, so we'd expect average commuting time to be related to price. To
study this relationship, we'll use a proxy variable: the house's distance from the business center in downtown Silverhaven. A
proxy variable is a variable that is closely correlated with the variable we want to investigate, but typically has more readily
available data.
The data on distance for the same 15 houses reveals a negative relationship: the farther away from downtown, the less
expensive houses tend to be. Again, the strength of the relationship is relatively weak: R-squared for the regression of price
versus distance is 37%.

We might be tempted to think that, if house size explains 26% of the price of a house and distance explains 37%, then the two
variables together would explain 63% of the price. But what if there is a relationship between the two independent variables,
house size and distance?
In fact, the correlation coefficient of house size and distance is 31%. As we might expect, there is a positive relationship
between the two variables: as we move farther from the city center, houses tend to be larger. How should we factor this
relationship into our analysis of how house size and distance affect the price of a house?
Rather than considering each individual relationship, we need to find a way to express the three-way relationship among all
three variables: price, house size, and location. Instead of two separate equations describing price, each with a different
independent variable, we need one equation that includes both independent variables.
We find this three-way relationship using multiple regression. In multiple regression, we adapt what we know about
regression with one independent variable — often called simple regression — to situations in which we take into account
the influence of several variables.
Thinking about several variables simultaneously can be quite challenging. Given only the price of a home, we cannot make
inferences about its size and location. A $500,000 home might be a mansion in the countryside, a modest house in the
73/123
suburbs, or a cozy cardboard box on the corner of 1st and Main.
Graphing data on more than two variables poses its own set of difficulties. Three variables can still be represented, but
beyond that, visualization and graphical representation become essentially impossible.
Although we can carry over many of the central ideas behind simple regression to the multivariate case, we'll have to consider
several interesting complications.
As managers, almost any quantity we wish to study will be influenced by more than one variable: to construct an accurate
model of a business' dynamics, we'll usually need several variables. Multiple regression is an essential and powerful
management tool for analyzing these situations.
Incorporating more than one independent variable into your analysis of the Kahana's occupancy sounds like a good idea. But
how do you adapt regression analysis to accommodate multiple variables?
Summary
Multiple regression is an extension of simple regression that allows us to analyze the relationships between multiple
independent variables and a dependent variable. Relationships among independent variables complicate multivariate
regression. With more than two independent variables, graphing multivariable relationships is impossible, so we must
proceed with caution and conduct additional analyses to identify patterns.
Incorporating more than one independent variable into your analysis of the Kahana's occupancy sounds like a good idea.
But how do you adapt regression analysis to accomodate multiple variables?
Adapting Basic Concepts

"Multiple regression equations look very similar to regression equations with only one independent variable," Alice explains.
"But be careful - you have to interpret them slightly differently."
Interpreting the Multiple Regression Equation

In our home price example, we found two regression equations: one for the relationship between price and house size, and
one for the relationship between price and distance. What will the equation look like for the three-way relationship between
price, the dependent variable, and the two independent variables, house size and distance?
The regression equation in our housing example will have the form below: house size and distance each have their own
coefficients, and they are summed together along with the constant coefficient a.
In general, the linear equation for a regression model with k different variables has the form below. Since the coefficients we
obtain from the data are just estimates, we must distinguish between the idealized equation that represents the "true"
relationship and the regression line that estimates that relationship. To express that even the "true" equation does not fit
perfectly, we include an error term in the idealized equation.
Running the regression gives us coefficients for house size and distance: 252 and -55,006, respectively. We can use this
multiple regression equation to predict the price of other houses not in our data set. To predict a house's price, we need to
know only its size and its distance to downtown.

Suppose "Windsor" is a modest mansion of 3,500 square feet, located in the outer suburbs of Silverhaven, approximately 11
miles from downtown. Based on our regression equation, how much would we expect Windsor to sell for?

We simply enter Windsor's square footage and distance to downtown into the equation, and calculate an expected selling
price of $699,938.
Let's take a closer look at the coefficients in the housing example, focusing on the distance coefficient: -55,006. This
coefficient is substantially different from the coefficient in the original simple regression: -39,505. Why is it so different?
The coefficient in the simple regression and the coefficient in the multiple regression have very different meanings. In the
simple regression equation of price versus distance, we interpret the coefficient, -39,505, in the following way: for every
additional mile farther from downtown, we expect house price to decrease by an average of $39,505.
We describe this average decrease of $39,505 as a gross effect - it is an average computed over the range of variation of all
other factors that influence price.
74/123
In the multiple regression of price versus size and distance, the value of the distance coefficient, -55,006, is different, because
it has a different meaning. Here, the coefficient tells us that, for every additional mile, we should expect the price to decrease
by $55,006, provided the size of the house stays the same.
In other words, among houses that are similarly sized, we expect prices to decrease by $55,006 per mile of distance to
downtown. We refer to this decrease as the net effect of distance on price. Alternatively, we refer to it as "the effect of
distance on price controlling for house size."
Two houses are similar in size, but located in different neighborhoods: "Shangri La" is five miles farther from downtown than
"Xanadu." If Xanadu's selling price is $450,000, what would we expect Shangri La's selling price to be?

Since the two houses are the same size, we use the net effect of distance on price,
-$55,006/mile, to predict the expected difference in their selling prices. Shangri La is 5 additional miles from downtown, so
its price should be -$55,006/mile * 5 miles = $275,030 less than Xanadu's, or $450,000 - $275,030 = $174,970.
"Valhalla" is another house located 5 miles farther from downtown than Xanadu. We have no information about the relative
sizes of the two homes. If Xanadu's selling price is $450,000, what would we expect Valhalla's selling price to be?

Since we cannot assume that the sizes of the two houses are equal, we should not control for size. Thus we use the gross effect
of distance on price,
-$39,505/mile, to predict the expected difference in the two homes' selling prices. Valhalla is 5 additional miles from
downtown, so its price should be $39,505/mile * 5 miles = $197,525 less than Xanadu's, or $450,000 - $197,525 = $252,475.
Let's try to build our intuition about the difference in the distance coefficients in the simple and multiple regressions. The
coefficients are different because they have different meanings. But what exactly accounts for the drop from -39,505 to
-55,006?
In the multiple regression, by essentially considering only houses that are of equal size, we separate out the effect of house
size on price. We are left with a distance coefficient that is net relative to house size.
In the simple regression, the gross effect of distance, -$39,505/mile, represents an average over the range of house sizes. As
such, it also captures some of the effect that house size has on price. Let's take a closer look at the distance and house size
data.
Calculating the correlation coefficient between sizes of homes and their distances from downtown Silverhaven, we see that
there is a slight positive relationship, with a correlation coefficient of 31%. In other words, as we move farther from
downtown, houses tend to be larger.
We have seen that two things happen as we move farther from downtown — housing prices drop because the commute is
longer, and house size increases. The fact that house size increases with distance complicates the pricing story, because larger
houses tend to be more expensive.
Longer distances from downtown translate into two different effects on price. One effect of distance on price is negative: as
distance increases, commute times increase and prices drop.
A second effect of distance on price is positive: as distance increases, house sizes increase, and larger houses corresponds to
higher prices.
Running a multiple regression with both size and distance as independent variables helps tease out these two separate effects.
When we control for house size, we see the net effect of distance on price: prices drop by $55,006 per additional mile.
When we don't control for house size, the effect of distance alone on price is confounded by the fact that house size tends to
rise as distance increases. The "real" effect of distance on price is diminished by the relationship between price and house
size.
When we look at the net relationship between distance and price, we consider only similarly sized houses. Now we assume
that as distance from downtown grows, house size stays the same. If house size didn't increase as we moved farther out,
prices would drop more sharply: by $55,006 rather than $39,505 per additional mile.
Let's analyze the house size coefficient in a similar fashion. In the multi-variable regression model of home prices, the house
size coefficient is net relative to distance. The coefficient of 252 tells us to expect prices of homes equally distant from
downtown to increase by an average of $252 for each additional square foot of size.
The gross effect of house size on price is $167 per square foot, considerably less than $252, the net effect. When we do not
control for distance, house size "off-sets" some of the negative effect of distance: the fact that larger houses are typically
located farther from the city counteracts some of the effect of increased house size.
75/123
We should always be careful to interpret regression coefficients properly. A coefficient is "net" with respect to all variables
included in the regression, but "gross" with respect to all omitted variables. An included variable may be picking up the
effects on price of a variable that is not included in the model — school district, for example.
Finally, we should note that these coefficients are based on sample data, and as such are only estimates of the coefficients of
the true relationship. For each independent variable, we must inspect its p-value in the regression output to make sure that
its relationship with the dependent variable is significant.
Since the p-value is less than 0.05 for both house size and distance to downtown, we can be 95% confident that the true
coefficients of the two independent variables are not zero. In other words, we are confident that there are linear relationships
between each independent variable and house price.
There are four steps we should always follow when interpreting the coefficients of an independent variable in a multiple
regression:
Summary
We use multiple regression to understand the structure of relationships between multiple variables and a dependent
variable, and to forecast values of the dependent variable. A coefficient for an independent variable in a regression
equation characterizes the net relationship between the independent variable and the dependent variable: the effect of the
independent variable on the dependent variable when we control for the other independent variables included in the
regression.
Performing in Excel
Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to perform multiple regression
analysis using the regression tool. However, we suggest you read through the following instructions to learn how Excel's
regression tool works, so you can perform multiple regression in the future, when you do have access to the Data Analysis
Toolpak.
Performing multiple regression in Excel is nearly identical to performing simple regressions. In the "Input Y range" field,
enter the range that includes the column label and the data for the dependent variable as you would for a simple
regression.
To enter data on the independent variables, the columns of all the independent variables must be contiguous and have the
same number of rows.
In the "Input X Range" field, enter the cell reference of the top cell of the independent variable appearing farthest to the
left. Following a colon, enter the cell reference of the bottom cell of the independent variable appearing farthest to the
right.
Select "Labels" so that Excel can properly label the independent variables in the output. Select "Residual Plots" to include
the residuals and residual plots of the residuals against each independent variable in the output. Enter the desired
"Confidence Level," or accept the default level of 95%.
The regression output prints the values of the coefficients for each variable in separate rows, along with the p-values for the
coefficients and the desired confidence intervals.
Residual Analysis
When running a multiple regression, you must distinguish between net and gross relationships. What about R-squared? How
do you measure the predictive power of a regression model with multiple independent variables?
One of the Silverhaven houses in our data set, "Windsor," sold for $570,000, substantially less than its predicted price,
$699,938. Rumor has it that Windsor is haunted, and it was hard to find a buyer for it. The difference between the actual
price and the predicted price is the residual error, in this case -$129,938.
In simple regression, the residuals — the differences between the actual and predicted values of the dependent variables —
are easy to visualize on a scatterplot of the data. The residuals are the vertical distances from the regression line to the data
points.
Graphically representing relationships among three variables is a bit difficult.
When we have only one independent variable, we create a scatter plot of the data points, then draw a line through the data
representing the relationship defined by the regression equation: y = a + bx.
With two independent variables, an ordinary scatter plot will no longer do. Instead, we keep track of the third variable by
picturing a three-dimensional space.
76/123
Our data set is "scattered" in the space with distance from downtown measured on one axis, house size measured on another
axis, and the dependent variable, price, measured on the vertical axis.
The regression equation with two independent variables defines a plane that passes through the data. The residuals — the
differences between the actual and predicted prices in the data set — are the vertical distances from the regression plane to
the data points.
As in simple regression, these vertical distances are known as residuals or errors. The regression plane is the plane that
"best fits" the data in the sense that it is the plane for which the sum of the squared errors is minimized.
We can plot the residuals against the values of each independent variable to look for patterns that could indicate that our
linear regression model is inadequate in some way. Here is the residual plot analyzing the behavior of the residuals over the
range of the independent variable house size...
...and here is the residual plot analyzing the behavior of the residuals over the range of the independent variable distance.
When linear regression is a good model for the relationships studied, each of the residual plots should reveal a random
distribution of the residuals. The distribution should be normal, with mean zero and fixed variance.
The residual plot against distance in the multiple regression looks different from the residual plot against distance in the
simple regression. They represent different concepts: the first gives insight into the net relationship between distance and
price controlling for house size, and the second gives insight into the gross relationship.
With more than two independent variables, visualizing the residuals in the context of the full regression relationship becomes
essentially impossible. We can only think of residuals in terms of their meaning: the differences between the actual and
predicted values of the dependent variable.
Because residual plots always involve only two variables — the magnitude of the residual and one of the independent
variables — they provide an indispensable visual tool for detecting patterns such as heteroskedasticity and non-linearity in
regressions with multiple independent variables.
Summary
The residuals, or errors, are the differences between the actual values of the dependent variable and the predicted values of
the dependent variable. For a regression with two independent variables, the residuals are the vertical distances from the
regression plane to the data points. We can graph a residual plot for each independent variable to help identify patterns
such as heteroskedasticity or non-linearity.
Quantifying the Predictive Power of Multiple Regression

In simple regressions, R-squared measures the predictive or explanatory power of the independent variable: the percentage
of the variation in the dependent variable explained by the independent variable. When we run the regression of house
price versus house size, we find an R-squared of 26%. The R-squared for price versus distance from downtown is 37%.
The multiple regression of price versus the two independent variables, house size and distance, also returns an R-squared
value: 90%. Here, R-squared is the percentage of price variation explained by the variation in both of the independent
variables.
The multiple regression's R-squared is much higher than the R-squared of either simple regression. But how can we be
sure we really gained predictive power by considering more than one variable?
In fact, R-squared cannot decrease when we add another independent variable to a regression — it can only stay the same
or increase, even if the new independent variable is completely unrelated to the dependent variable.
To understand why R-squared always improves when we add another variable, let's return to the simple regression of
house price versus distance from downtown. When we look at only two observations — two houses — we can find a line
that fits the two points perfectly.
R-squared for these two data points with this line is 100%. But clearly we can't use this line to explain the true
relationship between house price and distance. The high R-squared is a result of the fact that the number of observations,
2, is so small relative to the number of independent variables, 1.
If we have 3 observations — three houses — we can't find a line that fits perfectly. Here, R-squared is 35%.
But when we add another independent variable — no matter how irrelevant — we can find a plane that fits the three data
points perfectly, increasing R-squared to 100%. The "perfect" R-squared is again due to the fact that the number of
observations, 3, is only one more than the number of independent variables, 2.
Improving R-squared by adding irrelevant variables is "cheating." We can always increase R-squared to 100% by adding
independent variables until we have one fewer than the number of observations. We are "over-fitting" the data when we
77/123
obtain a regression equation in this way: the equation fits our particular data set exactly, but almost surely does not explain
the true relationship between the independent and dependent variables.
To balance out the effect of the difference between the number of observations and the number of independent variables,
we modify R-squared by an adjustment factor. This transformation looks quite complicated, but notice that it is largely
determined by n-k, the difference between the number of observations n and the number of independent variables k.
This adjustment reduces R-squared slightly for each variable we add: unless the new variable explains enough additional
variance to increase R-squared by more than the adjustment factor reduces it, we should not add the new variable to the
model.
Excel reports both the "raw" R-squared and the Adjusted R-squared.
It is critical to use adjusted R-squared when comparing the predictive power of regressions with different numbers of
independent variables. For example, since the adjusted R-squared of the multiple regression of house price versus house
size and distance is greater than the adjusted R-squared of either simple regression, we can conclude that we gained real
predictive power by considering both independent variables simultaneously.
A final caveat: we should never compare R-squared or adjusted R-squared values for regressions with different dependent
variables. An R-squared of 50% might be considered low when we are trying to explain product sales, since we expect to be
able to identify and quantify many key drivers of sales. An R-squared of 50% would be considered high if we were trying to
explain human personality traits, since they are influenced by so many factors and random events.
Summary
R-squared measures how well the behavior of the independent variables explains the behavior of the dependent variable.
It is the percentage of variation in the dependent variable explained by its relationship with the independent variables.
Because R-squared never decreases when independent variables are added to a regression, we multiply it by an
adjustment factor. This adjustment balances out the apparent advantage gained just by increasing the number of
Solving the Staffing Problem (II)

Alice urges you to use your newfound knowledge to analyze the relationship between the Kahana's occupancy and advanced
bookings and Kauai's arrivals.
You and Alice want to find the three-way relationship between the dependent variable, the Kahana's occupancy, and the two
independent variables, arrivals on Kauai and the Kahana's advance bookings.
The first thing Alice asks you to do is to measure the strength of the relationship between the two independent variables.
Enter the correlation coefficient as a decimal number with three digits to the right of the decimal point (e.g., enter "5" as
Kahana Occupancy Data
The correlation between arrivals and bookings isn't especially strong, 44%.
You run the multiple regression of occupancy versus Kauai arrivals and advance bookings.
Which of the following indicates that the multiple regression with two independent variables is an improvement over both
simple regressions?

Kahana Occupancy Regressions
Do the data indicate that the regression coefficients of the two independent variables are significant at the 0.05 significance
level?

Suppose Leo has 200 advance bookings for the month of January and 250 advance bookings for February. What is the best
estimate of how many more guests Leo can expect in February compared to January?

78/123
You and Alice try to contact Leo, but he appears to be unavailable.
Leo must be with his lawyers in Honolulu. I'm sure we'll be able to get a hold of him later.
In the meantime, I can complete my research into the Excelsior's promotions. And there are some complexities of multiple
regression that you should become familiar with...
Exercise 1: Empire Learning

Empire Learning is a developer of educational software. CEO Bill Hartborne is making a bid for a contract to create an e-
learning module for a new client.
Preparing the bid requires an estimate of the number of labor-hours it will take to create the new module. Bill believes that
the length of a module and the complexity of its animations directly affect the amount of labor required to complete it.
Bill has data on the labor-hours Empire used to complete previous courses. He also knows the number of pages and the
animation run-time of each previous course — quantities he thinks are reasonable proxies for course length and animation
complexity, respectively.
Perform a simple regression analysis for each of the independent variables: number of pages and run-time of animations.
Empire Learning Data
Which factor explains more variation in labor hours?

Empire Learning Regressions
In the simple regressions, which of the independent variables contributes significantly to the number of labor-hours it
takes Empire to create an e-learning course?

The p-values for the coefficients on animation run-time and number of pages are 0.003 and 0.0002 respectively—well
below 0.05, the most commonly used level of significance. Thus, we conclude that both independent variables contribute
significantly in their respective simple regressions to the number of labor hours Empire takes to create an e-learning
course.
Run the multiple regression of labor-hours versus number of pages and run-time of animations.
According to this multiple regression, which of the independent variables contributes significantly to the number of labor-
hours it takes Empire to create an e-learning course?

The p-values for the coefficients on animation run-time and number of pages are 0.014 and 0.0015 respectively—well
below 0.05, the most commonly used level of significance. Thus, we conclude that both independent variables contribute
significantly in the multiple regression to the number of labor hours Empire takes to create an e-learning course.
Exercise 2: The Empire Strikes Back

For this exercise, refer to the regression analyses performed in Exercise 1 of this section.

Bill Hartborne, CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it will take his
team to create a new e-learning course. He is using data on previous courses Empire created, with the number of pages and
the total run-time of animations as independent variables.

In the multiple regression of labor-hours versus number of pages and run-time of animations, what does the coefficient of
0.84 for the number of pages tell us?
79/123
In the multiple regression equation, the coefficient of the independent variable "number of pages" is gross relative to:

Challenge: Children of the Empire

For this exercise, refer to the regression analyses performed in Exercise 1 of this section.

Bill Hartborne, the CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it will take
his team to create a new e-learning course. He is using data on previous courses Empire created, with the number of pages
and the total run-time of animations as independent variables.

Bill bills out his talent at $70/hour. Based on the multiple regression, how much should he charge for the labor content of a
course with 400 pages and 170 seconds of animations? Enter the estimated cost of the labor (in $) as an integer (e.g., enter
"$5.00" as "5"). Round if necessary.

First use the regression equation to predict the number of labor-hours required to complete the course.

Then multiply that number by Empire Learning's billing rate of $70/hour to find the total amount he should charge for the
labor content of the course, $77,210.

Bill is sure that the client will balk at a labor bill of over $70,000. He knows that animation is important to the client, so he
doesn't want to cut corners there. However, he believes that his lead writer can cover the content in fewer pages without
compromising his renowned clear and engaging prose.

To reduce total labor costs to $70,000, how many pages must Bill cut from the plan to meet his client's cost limits?

To reduce the labor bill from $77,210 to $70,000, Bill must reduce labor costs by $7,210. To achieve this reduction, Bill
must cut the contract's labor hours by 103 hours, since he bills out his talent at $70/hour.

Since the animation run time will not change, we use the net relationship between labor-hours and number of pages, which
tells us that each additional page consumes 0.84 labor hours. Thus, Bill must reduce the number of pages by 123..

New Concepts in Multiple Regression

"I expect Leo will call us this evening after his meeting with the lawyers," Alice predicts. "I hope things are going well. If Mr.
Pitt's lawsuit materializes, Leo might not have much of a business left to help him with."
The Staffing Problem (III)

80/123
I just spent the whole day at my lawyers' offices. Please give me some good news about the occupancy problem.
Well, we've found a regression model that incorporates arrivals on Kauai and advance bookings. We're now able to explain
about 86% of the variation in occupancy.
That's great. That's so much better than the 39% you calculated using only advance bookings as the independent variable. I
should be able to make much more reliable predictions based on your new model!
Unfortunately, no. Although this new model helps us understand why your occupancy varies, we can't exactly use the model
to make predictions.
You can use advance bookings to make a prediction about occupancy in a given month because the bookings are known to
you ahead of time. But you won't get the data on the number of arrivals in a month until it's too late and your guests are
already on your doorstep.
That's terrible! Sure, it's nice to know how today's occupancy is affected by today's arrivals. But I need to make business
decisions! I need to know one month in advance how many staffers to hire! Please, isn't there something you can do?
Don't give up yet, Leo. We still have a number of statistical approaches at our disposal. We'll have something for you when
you get back from Honolulu.
Multicollinearity
"We need some more advanced statistical tools to find a regression suitable to Leo's purposes," Alice tells you. "And there is
still a pitfall you'll need to learn to avoid when using multiple regression."
Another key factor that influences the price of a house is the size of the property it is built on - its "lot size." Naturally, we'd
pay more for a spacious acre of land than for 800 cramped square feet.
Using new data on the lot sizes of the 15 sample houses in Silverhaven, run the simple regression of house price on lot size. If
we don't control for any other factors, how much of the variation in price can be explained by variation in lot size? Enter R-
squared as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50"). Round if
necessary.

Lot size accounts for 30 percent of a home's price. Do the data provide evidence that there is a significant linear relationship
between house price and lot size?

The low p-value of 0.033 tells us that we can be confident that the gross relationship between lot size and home price is
significant. What happens when we add lot size as a third independent variable in our multiple regression of price on three
independent variables: house size, distance from downtown Silverhaven, and lot size?
Run a multiple regression of price on the three independent variables: house size, distance, and lot size. How does the
addition of the new independent variable, "lot size" affect the predictive power of the regression model?

By adding the independent variable "lot size", we improve adjusted R-squared slightly: from 89% to 91%, telling us that the
predictive power of the regression has improved. What about the significance of the independent variables? Has adding the
new variable changed the p-values of the coefficients?
Something odd has happened. In our earlier regression with two independent variables, the p-values for both the house size
and the distance coefficients were less than 0.05. Now, adding lot size into the equation has somehow raised the p-value for
house size.
The new p-value for house size, 0.2179, is so high that there is no longer evidence of a significant linear relationship between
price and house size after taking lot size and distance into account. How do we explain this drop in significance?
When a multiple regression delivers a surprising result such as this, we can usually attribute it to a relationship between two
or more of the independent variables. Let's look at the data on house size and lot size.
81/123
The high correlation coefficient between house size and lot size — 94% — is the culprit in the Case of the Dropping
Significance.
When two of the independent variables are highly correlated, one is essentially a proxy for the other. This phenomenon is
called multicollinearity. In our example, lot size is a good proxy for house size.
Both house size and lot size contribute to the price of a home. But because these two variables are closely correlated in our
data set, there is not enough information in the data to discern how their combined contributions should be attributed to
each of these two variables.
The net effect of house size on price should tell us the effect of house size on price assuming that the lot size is fixed.
However, we can't detect this effect in the data: house size and lot size are so closely related that we've never seen house size
vary much when lot size is fixed.
Would dropping the variable house size improve the predictive power of the regression model? The multiple regressions with
and without house size have different numbers of independent variables, so we use adjusted R-squared to compare their
predictive power.
Without house size, adjusted R-squared is 90.89%, slightly lower than 91.40%, the adjusted R-squared for the regression
including house size. Thus, although the regression model cannot accurately estimate the effect of house size when we control
for lot size and distance, the addition of house size does help explain a bit more of the variance in selling price.
Diagnosing and Treating Multicollinearity

A common indication of lurking multicollinearity in a regression is a high adjusted R-squared value accompanied by low
significance for one or more of the independent variables. One way to diagnose multicollinearity is to check if the p-value
on and independent variable rises when a new independent variable is added, suggesting strong correlation between those
How much of a problem is multicollinearity? That depends on what we are using the regression analysis for. If we're using
it to make predictions, multicollinearity is not a problem, assuming as always that the historically observed relationships
among the variables continue to hold going forward.
In the house price example, we'd keep the house size variable in the model, because its presence improves adjusted R-
squared and because our judgment would suggest that house size should have an impact on price separate from the effect
of lot size.
If we're trying to understand the net relationships of the independent variables, multicollinearity is a serious problem that
must be addressed. One way to reduce multicollinearity is to increase the sample size. The more observations we have, the
easier it will be to discern the net effects of the individual independent variables.
We can also reduce or eliminate multicollinearity by removing one of the collinear independent variables. Identifying
which variable to remove requires a careful analysis of the relationships between the independent variables and the
dependent variable. This is where a manager's deep understanding of the dynamics of the situation becomes invaluable.
In our home price example, we'd expect both house size and lot size to have significant and discernable effects on the price
of a home: a shack on an acre of land should cost less than a mansion on a similar property.
To better understand the net effect of house size we should probably gather a larger sample to reduce multicollinearity. If
we didn't expect lot size and house size to have distinct effects on price, we might remove house size from the equation.
Summary
Multicollinearity occurs when some of the independent variables are strongly interrelated: distinguishing the respective
effects of some of the independent variables on the dependent variable is not possible using the available data.
Multicollinearity is typically not a problem when we use regression for forecasting. When using regression to understand
the net relationships between independent variables and the dependent variable, multicollinearity should be reduced or
eliminated.
Lagged Variables
In the Silverhaven real estate example, we looked at a number of houses and collected data on four characteristics: price,
house size, lot size, and distance from the city center. These are cross-sectional data: we looked at a cross-section of the
Silverhaven real estate market at a specific point in time.
A time series is a set of data collected over a range of time: each data point pertains to a specific time period. Our
EasyMeat data is an example of a time series — we have data on sales and advertising levels for ten consecutive years.
82/123
Sometimes, the value of the dependent variable in a given period is affected by the value of an independent variable in an
earlier period. We incorporate the delayed effect of an independent variable on a dependent variable using a lagged
variable.
The effects of variables such as advertising often carry over into the later time period. For example, last year's EasyMeat
advertising levels may continue to affect this year's sales.
To study this carry-over effect, we add the lagged variable "previous year's advertising." Our regression now has two
independent variables: the current year's advertising level and last year's advertising level.
To run a regression on the lagged EasyMeat advertising variable, we first need to prepare the data for the lagged variable:
we copy the column of advertising data over to a new column, shifted down by one row.
The first and last data points draw attention: the first has all the necessary data except a lagged advertising value, and the
last has a lagged value but no other information. Since we need observations with data on all variables, we are forced to
discard the first observation as well as the extraneous piece of information in the lagged variable column.
In effect, by introducing a lagged variable, we lose a data point, because we have no value for "previous year's advertising"
for the first observation.
We run the regression as we would any ordinary multiple regression. The equation we obtain is:
EasyMeat Sales Data

EasyMeat Sales Regressions
Comparing the output from the regressions with and without the lagged variable, we first notice that the number of
observations has decreased from 10 to 9. Does the addition of the lagged advertising variable improve the regression?
The lagged variable decreases adjusted R-squared from 84.11% to 76.62%. Moreover, the lagged variable's high p-value —
0.3305 — tells us that its effect is too insignificant for us to distinguish from the current year's advertising's effect. We
conclude that the addition of the lagged variable has not improved the regression.
We can introduce variables with even longer lags. We might try to use advertising data from two or more years in the past
to help explain the present sales level. However, for each additional period of lag we lose another data point.
Running the EasyMeat regression with the current year's advertising, last year's advertising, and the advertising level from
two years ago leaves only eight observations. We lose predictive power: adjusted R-squared drops from 76.62% to 62.47%.
Moreover, none of the coefficients is significant at the 0.05 level.
Adding a lagged variable is costly in two ways. The loss of a data point decreases our sample size, which reduces the
precision of our estimate of the regression coefficients. At the same time, because we are adding another variable, we
decrease adjusted R-squared.
Thus, we include a lagged variable only if we believe the benefits of adding it outweigh the loss of an observation and the
"penalty" imposed by the adjustment to R-squared.
Despite these costs, lagged variables can be very useful. Since they pertain to previous time periods, they are usually
available ahead of time. Lagged variables are often good "leading indicators" that help us predict future values of a
dependent variable.
Summary
We can use lagged variables when data consist of a time series, and we believe that the value of the dependent variable at
one point in time is related to the value of an independent variable at a previous point in time. Lagged variables are
especially useful for prediction since they are available ahead of time. However, they come at a cost: we lose one
observation for each time series interval of delayed effect we incorporate into the lagged variable.
Dummy Variables
So far, we've been constructing regression models using only quantitative variables such as advertising expenditures,
house size, or distance. These variables by their nature take on numerical values.
But many variables we study are qualitative or categorical: they do not naturally take on numerical values, but can be
classified into categories.
For example, in our house price example, a categorical variable might be the home's primary construction material: wood,
brick, or straw. Assigning numerical values to such categories doesn't make sense — wood cannot be described as a brick
plus some number. How do we incorporate categorical variables into a regression analysis?
Let's look at an example from Julius Tabin's EasyMeat business. We have found that EasyMeat sales are determined to a
large extent by how much he spends on advertising. Another factor affecting sales is which flavor of EasyMeat Julius
83/123
features in his advertising campaigns.
Julius produces both EasyMeat Classic, made from beef for the most part, and EasyMeat Poulk!, a pork and poultry blend.
Hoping to match the right spreadable meat flavor to the mood of the nation, Julius features Poulk! in his ad campaigns in
some years. In other years, classic takes the lead role.
To study the effects of Julius' flavor-of-the-year choice, we use a type of categorical variable called a dummy variable. A
dummy variable takes on one of two values, 0 or 1, to indicate which of two categories a data point falls into.
This dummy variable — we'll call it "Poulk!" flavor — is set to 1 for years when EasyMeat ads feature Poulk!, and 0 for years
when they feature Classic.
Run the regression on the data, with the sales as the dependent variable, and advertising and the Poulk! flavor dummy
variable as the independent variables. What is the coefficient for the Poulk! flavor variable? Enter the coefficient for the
Poulk! flavor variable as an integer (e.g., "5"). Round if necessary.
EasyMeat Sales Data

EasyMeat Sales Regressions
We find a coefficient of 533,024 for Poulk! flavor. What does this coefficient tell us?
The coefficient 533,024 tells us that after controlling for the level of advertising expenditures, average sales are $533,024
higher when Poulk! is the flavor-of-the-year, rather than EasyMeat Classic.
This regression model can be expressed graphically: essentially we have two parallel regression lines: one for years in
which Classic is promoted, and one for "Poulk! Years." The vertical distance between the lines is the average increase in
sales Julius would expect if he chooses to feature Poulk! in a given year, after controlling for advertising level.
The slope of each line is the same: it tells us the average increase in sales when Julius spends another dollar in advertising
after controlling for which flavor is featured.
In this example, we arbitrarily chose to set the dummy variable equal to zero for the Classic flavor, i.e. to make Classic the
"base case" flavor. If we made Poulk! the base case flavor, we would obtain exactly the same graph. The coefficient on flavor
now would be -533,024, but it would again indicate that Julius should expect average sales to be $533,024 lower in the
years that his ads feature Classic rather than Poulk!.
As for any independent variable in a regression, we can test a dummy variable's significance by calculating a p-value for its
coefficient. When the p-value is less than 0.05, we can be 95% confident that the coefficient is not zero and that the dummy
variable is significantly related to the dependent variable.
If Julius produced and promoted more than two flavors, we'd need another dummy variable for each additional flavor. In
general, we'd need one fewer dummy variable than the number of flavors: one variable for each of the flavors except for
Classic, the base case.
For each year that a given flavor is featured, its dummy variable is set to one, and all other flavor variables are set to zero.
In a "Classic Year," all dummy variables are set to zero.
Why don't we have exactly as many dummy variables as categories? Suppose we have just two categories — Poulk! and
Classic — and we tried to use separate dummy variables for each — P and C. In a Poulk! Year, P = 1 and C = 0. In a Classic
Year, P = 0 and C = 1. The second dummy variable would be completely redundant: since there are only 2 flavors, if P = 0
we know the flavor is not Poulk!, so it must be Classic.
Technically, including both dummy variables would be problematic because they would be perfectly correlated: whenever P
= 1, C = 0, and whenever P = 0, C = 1. Perfectly correlated variables are perfectly collinear: we gain nothing by including
both and must drop one to avoid crippling multicollinearity.
It is useful to note that running a simple regression analysis with only a dummy variable is equivalent to running a
hypothesis test for the difference between the means of two populations. In this case, one population would be sales in
Poulk! years, and the other would be sales in Classic years. From the data we find the mean and standard deviation of sales
in Poulk! and Classic years. The hypothesis test gives us a p-value of 0.43, so we cannot conclude that the means differ
when the feature flavor is different.
A simple regression of sales on the dummy variable flavor gives us the same result. The p-value on the dummy variable is
0.45, telling us that the flavor dummy variable is not statistically significant. The p-values differ slightly for the two
approaches due to differences in the way Excel calculates the terms, but the tests are conceptually equivalent.
Summary
We can use dummy variables to incorporate qualitative variables into a regression analysis. Dummy variables have the
values 0 and 1: the value is 1 when an observation falls into a category of the qualitative variable and 0 when it doesn't.
84/123
For qualitative variables with more than two categories, we need multiple dummy variables: one fewer than the number
of categories.
Solving the Staffing Problem (III)

With one eye on the lookout for signs of lurking multicollinearity and two new types of variables in your toolbox, you set out
to settle the staffing problem for good.
You might have noticed that the number of arrivals follows an annual seasonal pattern. Arrivals tend to drop off during the
late summer, surge again in October, and drop very low for the rest of the year.
Source
They pick up briefly in February, but the tourist business slows through the spring. During the early summer, vacationers
start arriving in droves, with arrivals peaking in June or July.
Source
Perhaps we can use the seasonality of arrivals in some way. We might come up with a lagged variable that functions as a
proxy for the current month's arrivals. Then we can run a regression with the lagged variable. Since the values of a lagged
variable would be based on historical data, Leo would know them ahead of time.
Given that arrivals follow this seasonal pattern, which of the following variables is likely to be a good proxy for this month's
arrivals on Kauai?
Source
You run the simple regression of Kahana occupancy versus 12-month lagged arrivals.

Source
What is the lowest level at which the "lagged arrivals" variable is significant?

Source
The regression output shows a p-value of 0.000 for the lagged arrivals coefficient, far below the significance level of 0.01.
At 55%, the R-squared for the simple regression on lagged arrivals is much lower than for the simple regression on current
arrivals, 80%. The adjusted R-squared is only 53%. Still, lagged arrivals have substantial predictive power. Run a multiple
regression of occupancy versus lagged arrivals and advance bookings.
What is the adjusted R-squared for this multiple regression? Enter the adjusted R-squared as a decimal number with two
digits to the right of the decimal point (e.g., enter "50%" as "0.50").

The adjusted R-squared of 60% for the multiple regression is higher than the 53% for the simple regression. But is that the
best you can do? For the moment, you report your analysis to Alice.
This model with the lagged variable is more useful to Leo, even if its predictive power is still pretty low. But let's look at these
data on Leo's competition that I dug up. Maybe we can use them to give us an even better model.
Leo's competitor, Knut Steinkalt at the Hotel Excelsior, frequently launches promotion campaigns: in some months, Steinkalt
slashes room prices dramatically.
These data show the months in which the Excelsior's promotions took place in the last three years. We can use a dummy
variable to see how the competition's promotions have affected Leo's occupancy.
The Excelsior has to advertise these promotional packages at least a month in advance to attract customers. As long as Leo
keeps an eye on the Excelsior's advertising, he'll be alerted to these promotions in enough time to take them into account
when he makes staffing decisions for the following month.
85/123
Using Alice's research, you run a simple regression of occupancy versus the promotions. Then you run the multiple regression
of occupancy versus advance bookings, lagged arrivals, and Excelsior promotions.
Suppose Leo has used the multiple regression equation to predict July's occupancy using a given level of advance bookings
and last July's arrivals. Just before he makes his staffing decisions, he learns that, unexpectedly, his rival Knut has cut the
Excelsior's room prices for July. Leo should revise his predicted occupancy by

The gross effect of Excelsior promotions on Kahana occupancy, a reduction of 52 guests, is not relevant here since we know
the advanced bookings and lagged arrivals for July are still the same as they were before the Excelsior launched its campaign.
Leo wants to use the net effect of the promotion on his occupancy levels, assuming advanced bookings and lagged arrivals
remain fixed — an average reduction in occupancy of 60 guests.
Adding Excelsior promotions to the regression analysis

Upon Leo's return, you eagerly report your results.
This is good work. Naturally, I'd like to have even greater predictive power than 84%, but I realize you aren't psychics. This
model will really help me when I hire staff. Thanks!
I have some good news of my own. Mr. Pitt agreed not to file suit against me! I did have to promise him an extended stay in
the Kahana's penthouse, free of charge.
He's coming next month — there's one occupant I don't need to use statistics to predict. This time, I'll serve the bisque myself.
Thanks again for all your help!
Exercise 1: The Kiwana Quandary

Linda Szewczyk, marketing director of Amalgamated Fruits Vegetables & Legumes (AFV&L), is researching the nation's
fruit consumption habits. In particular, she would like greater insight into household consumption of the kiwana, a cross-
breed of kiwis and bananas that AFV&L pioneered.
Naturally, one important determinant of household consumption is the size of the household — the number of members.
Since AFV&L has positioned the kiwana as a "high end" fruit, Linda believes that household income may also influence its
consumption.
Run a multiple regression of household kiwana consumption versus household size and income. Make note of important
regression parameters such as R-squared, adjusted R-squared, the coefficients, and the coefficients' significance. The
income variable has a coefficient of 0.0004. Can a variable with such a small coefficient be statistically significant?
Kiwana Consumption Data
The independent variable, income, is statistically significant since its p-value is less than 0.05, the most common level of
significance. The small coefficient tells us that for every additional $10,000 of income, average kiwana consumption
increases by 4 lbs. a year.
To date, AF&L has focused its marketing campaigns on high-income, highly educated consumers. Linda would like to
deepen her understanding of how the educational level of the household members might affect their appetite for kiwanas.
To incorporate education into her kiwana consumption analysis, Linda separated the households in her data set into three
categories based on the highest level of education attained by any member of the household — no college degree, college
degree but no post-graduate degree, and post-graduate degree. She represents these categories using two dummy variables
— "college only" and "post-graduate."
Run a regression on all four independent variables.
Controlling for household size, income, and post-graduate degree, how many more pounds of kiwanas are consumed in a
household in which the highest educational level is a college degree, compared to a household in which no one holds a
college degree?
86/123
Kiwana Consumption Regressions
The coefficient for the dummy variable "college," 51.6, tells us the expected difference in kiwana consumption for "college
degrees only" households compared to the excluded educational category: households in which no one holds a college
degree. The coefficient describes the net relationship between "college degrees only" and household kiwana consumption,
controlling for household size, income, and post-graduate degree.
Controlling for household size and income, how many more pounds of kiwanas are consumed in a household in which the
highest educational level is a post-graduate degree, compared to a household in which the highest educational level is a
college degree? Enter the difference in consumption between the two households as a decimal number with two digits to
the right of the decimal point (e.g., enter "5" as "5.00"). Round if necessary.

When you control for household size and income, college degree households consume 51.63 lbs more than non-college
households. Post-graduate degree households consume 51.95 lbs more than non-college households. In other words, post-
graduate households consume 0.32 lbs more than college households.
What does the analysis of household kiwana consumption indicate?

Exercise 2: The Return of the Empire

For this exercise, refer to the regression analyses you ran in exercise 1 of the previous section.

Bill Hartborne, the CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it will take
his team to create a new e-learning course. He is using data on previous courses Empire created, with the number of pages
and the total animation run-time as independent variables.

Bill believes that the number of illustrations used in the course may also have a significant impact on the number of labor-
hours it takes to complete an e-learning course. He wants to add the number of illustrations to the model as another
independent variable.

Run the simple regression of labor-hours versus number of illustrations.

At which level is the number of illustrations a statistically significant independent variable?

Run the multiple regression of labor-hours versus number of pages, illustrations, and animation run-time.

Is there evidence of multicollinearity in the data?

A common symptom of multicollinearity is a high adjusted R-squared — in this case 94% — accompanied by one or more
independent variables with low significance. In this case, the coefficient for the number of illustrations is not significant at
the 0.05 level, and the p-value for the number of pages has risen to 0.0291, up from 0.0015 in the regression without
illustrations.
Which of the following is the likely culprit of multicollinearity?

87/123
Multicollinearity occurs when the respective effects of two or more independent variables on the dependent variable are
not distinguishable in the data. This can be the result of correlated independent variables. The fact that the p-value for the
number of pages rises when we add the illustrations raises our suspicions that the number of illustrations and the number
of pages might be correlated.
We can compute the correlation between the number of pages and the number of illustrations: the correlation coefficient,
67%, is fairly high.
We could also attempt to diagnose the cause of the multicollinearity by running a regression of labor-hours versus number
of pages and number of illustrations - omitting animations. Here, the significance of illustrations is extremely low, with a
p-value of 0.85.
In the regression of labor-hours versus number of illustrations and run-time of animations - omitting pages - the respective
effects of the independent variables on the dependent variables can be distinguished. Here, the p-values for both variables
are much lower than 0.05. All of the evidence points to a linear relationship between number of pages and number of
illustrations as the culprit for the multicollearity.
Bill wants to use the regression analysis to predict the number of labor-hours it will take to complete a new e-learning
course. Comparing the two regression models - one with all three independent variables, one without illustrations - which
should Bill use?

Decision Analysis
Introduction
Three more days before you leave Hawaii. As excited as you are by your imminent debut at business school, you don't look
forward to your departure. You find yourself weighing the relative merits of business school and just staying on Kauai. Leo has
an opening in the restaurant...
The shrill ring of the telephone interrupts your reverie. Sounding excited even by Leo standards, the hotelier demands your
immediate attention in his office.
Dining at Sea: The Chez Tethys Problem

I've had an exciting business venture on my mind for some time: I want to run a floating restaurant. It's going to be amazing,
offering spectacular views of volcanic silhouettes and the bright lights of resort towns, the best Hawaiian culinary artistry,
tastefully alluring hula dancing and music, ...
Slow down, slow down, Leo. I'm interested to hear your idea, but let's focus on the big picture first. A floating restaurant?
I want to run a world-class restaurant on a yacht. Chez Tethys, I'll call it. I've had the idea for some time. Chez Tethys will be
Hawaii's most luxurious dining experience. The yacht will travel from island to island, docking in a different harbor each
night.
Seating will be limited, and you can board only at selected times. The challenge alone of making a reservation will make it
irresistible. It would become a matter of prestige to include a meal Chez Tethys on any trip to the islands.
The Chez Tethys has been on my mind since before I inherited the Kahana. I'd been looking for investors, but without luck.
Then the Kahana fell into my lap, and the hotel has preoccupied me since then.
I'm beginning to feel confident about running the hotel — thanks to you two, I should add. Now I feel like I'm ready to expand
my operations. In addition to its own profits, just think of the prestige a high-profile floating restaurant would add to the
Kahana!
Listen, Leo, this is an intriguing proposition, but I'm not fully convinced that a floating restaurant would really catch on. Let's
go over your business plan, and then we'll think about how likely it is that the Tethys will be a profitable venture. One quick
question: How do you intend to finance the Tethys?
I'm way ahead of you. I've already been to the bank, and I think I can take out a loan using the Kahana as collateral. I have a
really good feeling about the Tethys. I'm pretty confident that I can make a floating restaurant work.
88/123
"Leo's Chez Tethys idea is out of the ordinary, but very interesting," Alice muses. "But we'll need to guide him through the
decision to expand his operations so extravagantly, especially since he's putting the Kahana at risk. We wouldn't want him to
make a choice he'll regret."
Introducing Decision Analysis

S&C Films, a small film production company, just purchased a script for a new film called Cloven. Cloven's unusual plot has
the makings of a cult classic: a pastoral romance shattered by a livestock revolution and the creation of a poultry-dominated
totalitarian farm state. S&C's owner and producer, Seth Chaplin, enthusiastically pitches the film: "It's Animal Farm meets
Casablanca."
Seth entered into negotiations with two major studios — Pony Pictures and K2 Classics — and hammered out a prospective
deal with each of them.
Pony Pictures liked the script and agreed to pay S&C $10 million to produce Cloven, covering the production costs and a
profit margin for S&C. Seth estimates Cloven would cost S&C $9 million to produce. Under this agreement, Pony would
acquire all rights to the movie.
The second deal, negotiated with K2 Classics, stipulates that K2 would market and distribute the movie, and S&C would cover
the production costs itself. Under this agreement, K2 and S&C would split ownership of the movie. Seth believes that the
movie has huge potential. Even with partial ownership, S&C would reap enormous profits if Cloven becomes a blockbuster.
This deal specifies that K2 would collect all revenues until its marketing and distribution costs are covered, then take 35% of
any further revenues. S&C would collect the remaining 65%. If Cloven is a blockbuster, S&C will make a killing. If it flops,
S&C will lose some or all of its production investment.
Which deal should Seth choose?
Decision-making is an essential management responsibility. Managers are charged with choosing from multiple courses of
action that can each lead to radically different consequences. Seth's decision about how to finance Cloven's production might
lead to anything from blockbuster profits to financial disaster.
When faced with a choice of actions and a range of possible outcomes, decision making can be difficult, in large part due to
uncertainty. Although the course of action we choose influences the outcome, much of what happens is not only beyond our
control, but beyond our powers to predict with certainty.
Nonetheless, decisions like Seth's that involve uncertainty can be analyzed logically and rigorously: decision analysis tools
can help managers weigh alternative options and make informed and rational choices.
In a world of uncertainty, applying decision analysis will not guarantee that each decision you make will lead to the best
result, or even to a good result. But if you apply effective decision analysis consistently over the course of a managerial career,
you are almost certain to gain a reputation for sound judgment.
Summary
Decision analysis is a set of formal tools that can help managers make more informed decisions in the face of uncertainty.
Although applying even the most rigorous decision analysis does not guarantee infallibility, it can help you make sound
judgments over the course of your career.
Decision Trees
"The problem with Leo's floating restaurant is that its success hinges on so many factors," Alice tells you. "Some, Leo can
predict, like the approximate price of a yacht, or the cost of labor. But what about consumer demand? Truth be told, not even
Leo knows whether the Chez Tethys will catch on. Leo has to incorporate this uncertainty into his decision-making process."
Uncertainty and Probability

Producer Seth Chaplin boasts that when Cloven is released in theaters, it will become an instant classic. But even an incurable
enthusiast like Seth would never claim that success is certain, and if asked how confident he is that Cloven will achieve at
least modest success, he might reply with a percentage: "80%."
Uncertainty clearly has a quantitative dimension: we can distinguish between degrees of uncertainty.
But our intuition about uncertainty is often underdeveloped. How do we measure uncertainty consistently and
systematically?
Probability is a measure of uncertainty. A meteorologist reports the probability of rain in her forecast: "We'll see cloudy skies
tomorrow, with a 20% chance of rain."
89/123
Probability is measured on a scale from 0 to 1, or 0% to 100%. Events that are impossible are said to have zero (or zero
percent) probability, and events that are absolutely certain are said to have a probability of one (or 100%). But how do we
interpret probabilities of 20% or 37.8%?
Let's look at a very simple and familiar uncertain event: a spin of a wheel of fortune. Our wheel consists of two equally-sized
areas: an orange half and a green half. Spin the wheel a few times and count the number of green and orange outcomes you
get.
Since the two areas are of equal size, we'd expect "greens" and "oranges" to occur with equal frequency. About half of the
spins will end up "green" and half will end up "orange." Of course, if we only spin a few times, we won't expect exactly half of
the spins to result in "green." But the more times we spin, the closer we expect the percentage of greens will be to 50%. This is
the probability of getting a "green" each time we spin.
This interpretation of probability is a relative frequency interpretation. We have a set of events — in our case, spins — and
a set of outcomes — "green" or "orange." We form the ratio of the number of times a particular color occurs to the total
number of spins. The probability of that particular color is the value that ratio approaches as the total number of spins
approaches infinity.
Let's look at a different wheel. This wheel is also divided into two areas, but now the green area is larger: it covers three
quarters of the wheel.
What is the probability of a spin of this wheel resulting in the outcome "green"?
Since the green area covers three quarters of the whole wheel, we can expect that three out of four spins will result in a green
outcome. This corresponds to a probability of 75%.
For simple uncertain events like a wheel of fortune game, figuring out the probabilities of the outcomes is easy. There are a
limited number of possible outcomes and there are no hidden factors affecting the outcome. Besides, if we had any difficulty
believing that the probability of a "green" is 75%, we could simply spin the wheel over and over again to calculate an
approximation of the probability.
For more complex uncertain events like the weather or the success of a new venture, finding the probabilities of the outcomes
is more difficult. Many factors can affect the outcome; indeed, we may not even know all the possible outcomes. Furthermore,
there may be no simple experiment we can repeatedly run to generate approximations of the relative frequencies. How
should we think about probabilities in these situations?
Let's think about the weather. Anytime you leave your house, you face the decision of whether or not to take an umbrella. If
the sky is perfectly blue, you probably won't take one. Nor will you do so if there are a few white clouds in the sky.
On the other hand, if the sky is completely overcast, you might consider taking some kind of rain protection, and if it's already
raining, you almost certainly would. You'd be able to give a rough estimate of the probability of rain; for instance if it's
overcast, you might estimate a 40% chance of rain.
This estimate may be rough, but it isn't completely arbitrary, and it's clearly a better estimate than either 0% or 100%. Your
subjective estimate may be based on an informal assessment of relative frequency. The estimate of "40%" might be
shorthand for the reflection: "In my past experience, it has rained on a little under half the days when it is has been this
overcast in the morning at this time of the year, in this geographic location."
Although you probably never sat down and tabulated the days according to season, atmospheric conditions, and rainfall, your
informal assessment is powerful enough to make a reasonable choice about taking your umbrella. Though sometimes, you
may make the wrong choice.
Similarly, when Seth Chaplin gives his estimate of an 80% chance of Cloven's success, he is probably articulating the
following: "In my experience, about four out of five movies of a similar genre, with a similar budget, released under similar
conditions were successful."
In the examples we've discussed thus far, our uncertainty about the events we faced has stemmed from the fact that the
events had not yet happened. Sometimes we face uncertainty for a different reason: an event may have already occurred, but
we simply don't know what the outcome was. Let's return to the wheel of fortune to see how we might think about uncertainty
in these cases.
Imagine you and a friend spin the wheel. You close your eyes and keep them shut while your friend observes the wheel. While
the wheel is spinning, both you and your friend are uncertain about the outcome, and you would each assess the probability
of "green" at 75%.
When the wheel stops, the result becomes definite. Your friend knows the outcome: for her, the probability of "green" is now
either 0% or 100%. But as long as your eyes are closed, you remain uncertain about the outcome.
If someone promised to give you $10 if you correctly guessed the outcome of that spin before you opened your eyes, which
color would you choose?
90/123
From your limited point of view, the best choice you can make is "green," even though the outcome may already be a definite
"orange." If you played the same game over and over again and always chose "green," you would win three quarters of the
games. The probability of the outcome turning out to have been "green" when you open your eyes is 75%.
Even though the event has already occurred, the outcome is uncertain from your point of view. It makes sense to measure the
uncertainty due to a lack of information the same way we measure the uncertainty due to our inability to predict the
outcomes of future events.
Summary
Uncertainty makes decision-making challenging. We can be uncertain about outcomes that have or have not occurred. To
make the most informed decisions, we quantify our uncertainty using probability measures. Probability is measured on a
scale from 0% to 100%: events with 0% probability are impossible; events with 100% probability are certain. An event's
probability can often be determined by observing the relative frequency of its occurrence within a set of opportunities for
the event to occur. For events for which relative frequencies are difficult to assess, we often make subjective estimates of
the probabilities. Though not based on "hard" data, these estimates are often a sufficient basis for sound decision making.
Structuring Decision Trees

Alice whips out a notepad and pencil, and begins drawing something that vaguely resembles a saguaro cactus. "Let's look at
how we can analyze and structure decisions like Leo's."
Graphical representations are often a highly efficient way to organize and convey information. Scatter plots and histograms
efficiently communicate distributions of data, an organizational chart can quickly outline the structure of a company; and pie
charts effectively express information about proportions and probabilities.
Is there such a thing as a graphical tool that helps inform and organize the decision-making process?
The answer is yes, and the graphical tool is called a decision tree. Let's look at a simple decision tree.
Seth Chaplin's business decision — whether to produce Cloven for Pony Pictures or in partnership with K2 Classics —
involves two alternatives. At some point, Seth must make a decision. Until he does, there are two paths along which history
can unfold. Cloven will either be owned by Pony or co-owned by S&C Films and K2 Classics. A different branch of the tree
represents each alternative.
Decisions aren't the only points at which alternatives can branch off. Events with uncertain outcomes also lead to branching
alternatives. Once released in theaters, Cloven could be a "Blockbuster" hit, have a fair, but "Lackluster" performance, or
"Flop" completely. Each of these outcomes corresponds to a separate branch on the decision tree.
Decision trees have two types of branching points, or nodes. At decision nodes, the tree branches into the alternatives the
decision-maker can choose. We use square boxes to represent decision nodes.
At chance nodes, the tree branches because the uncertainty of the business world permits multiple possible outcomes.
We represent chance nodes by circles.
We arrange the nodes from left to right in the order in which we will eventually determine their results. In Seth's example, the
first thing to be determined is his own decision about which deal to choose. The next thing to be determined is Cloven's box
office performance.
One of the challenges of creating a decision tree is determining the correct sequence of nodes: in what order will the relevant
outcomes be determined? Which events "depend" on the occurrence of other events? What alternatives are created or
foreclosed by prior decisions or by the outcomes of uncertain events?
Drawing a decision tree forces us to delineate each alternative and clarify our assumptions about those alternatives. Thus,
even before we use a tree to make a decision, simply structuring the tree helps us clarify and organize our thoughts about a
problem. A decision tree is also useful in its own right as a tool for communicating our understanding of a complex situation
to others.
Categorizing branching points as decision or chance nodes makes clear which events and outcomes are under our power and
which are beyond our control.
Writing down alternatives in a systematic fashion often allows us to think of ideas for new alternatives. For instance, after
viewing the decision tree, Seth might realize that he would prefer a deal that pays part of his production costs but still allows
a small ownership stake as opposed to either of the deals he has hammered out with Pony and K2.
Summary
Decision trees are a graphical tool managers use to organize, structure, and inform their decision-making. Decision trees
branch at two types of nodes: at decision nodes, the branches represent different courses of action the decision-maker can
91/123
choose. At chance nodes, the branches represent different possible outcomes of an uncertain event - at the time of the
initial decision, the decision maker does not know which of these outcomes will occur (or has occurred). The nodes are
arranged from left to right, in the order in which the decision-maker will determine which of the possible branches actually
occurs.
Incorporating Data into Decision Trees

Laying out a decision tree — mapping a logical structure and identifying the alternative scenarios — is an essential first step
in the decision-making process. But how do we compare different scenarios? On what criteria might we base our
preference for one scenario over another?
In the business world, success is often measured in profits. Seth Chaplin will consider Cloven a success if it brings in at
least a modest profit. How should Seth incorporate possible profit levels into his decision analysis?
Seth's decision tree contains four possible scenarios: four unique paths from his current decision node on the left to the
ultimate outcomes on the right. He must assign to each scenario a dollar amount — S&C's profits. For example, if Seth
chooses to produce Cloven for Pony Pictures, his expected profits will be $1 million.
What if Seth chooses the K2 deal? If he has a significant ownership stake in the movie, his profits will depend on Cloven's
box office success. With a "Blockbuster" movie come "Blockbuster" profits: Seth estimates around $6 million.
If Cloven performance is "Lackluster," Seth thinks the Cloven project would break even for S&C films: S&C's expected
profits would be roughly $0. If Cloven "Flops," S&C's production costs will exceed its revenues from the movie, leading to
an expected loss of $2 million.
Are these three the only possible outcomes? Clearly, no. S&C's profits on the release of Cloven could be $6.6 million or
$15.03. In fact there is a range of outcomes stretching from profits in the hundreds of millions of dollars — the historical
upper limit of movie profits — to a loss of $9 million — the amount S&C will invest in production.
Although it is mathematically possible to analyze the full range of possible outcomes, the procedure for doing so is usually
more complicated and time consuming than is warranted for many decision problems. In practice, considering only a few
representative scenarios as we have done in the Cloven example will typically lead to good decisions.
When we use a scenario to represent a range of possible outcomes, the outcome figure we assign to that scenario should
represent the weighted average of all outcomes in the range that scenario represents. For example, in the Cloven decision,
$0 million might represent the weighted average of all possible profits from -$3 million to +$3 million.
Decision Trees and Probabilities

How can Seth use his decision tree and estimated profit values to choose the better deal?
If Seth chooses the partnership with K2, S&C could make $6 million. Clearly, that scenario is preferable to the scenario in
which S&C simply sells Cloven to Pony for a profit of $1 million.
If S&C retains part ownership of Cloven and it "Flops" in the theaters, S&C stands to lose $2 million. That scenario is
clearly worse than the scenario in which S&C sells Cloven to Pony.
If Cloven is highly likely to "Flop" in theaters, then Seth should choose the Pony deal; if it's highly likely to be a
"Blockbuster" he should choose the K2 deal. Clearly, the likelihood of the different outcomes in the K2 deal should weigh
heavily in Seth's decision.
To each branch emanating from a chance node we must associate a probability: the probability of that outcome occurring.
These probabilities may be based on historical data or on our best judgment of the likelihood of each outcome.
Based on the quality of the script and his years of experience in the movie industry, Seth estimates the probability of
Cloven becoming a "Blockbuster" at 30%, the probability of a fair, but "Lackluster" performance at 50%, and the
probability of a "Flop" at 20%.
If S&C sells Cloven to Pony, the level of box office success is largely irrelevant to Seth. S&C's profits are certain to be $1
million.
When assigning outcomes and probabilities to chance nodes, we must meet two requirements. First, outcomes that branch
off from the same chance node must be "mutually exclusive." Thus, for example, two outcomes emanating from the same
node cannot both occur at the same time: the occurrence of one outcomes excludes the occurrence of any other.
Second, the set of outcomes that branch off from the same chance node must be "collectively exhaustive": the branches
must represent all possible outcomes.
In practice, we typically don't depict every possible outcome separately. For Cloven, we've reduced the vast range of
possible financial outcomes to a manageable set of three representative outcomes. However, we consider these three
outcomes collectively exhaustive in that they represent three ranges that together exhaust all possibilities.
92/123
Also, we usually consider extremely unlikely but not impossible events to be included in one of the existing branches, or if
sufficiently unlikely, to be irrelevant to decision-making. In the Cloven example, we don't include a branch representing
the simultaneous destruction of all existing copies of the Cloven film.
When we construct a chance node, we must make sure that the outcomes emanating from that node are mutually exclusive
and collectively exhaustive. Since it is certain that one and only one of the possible scenarios will occur, the individual
probabilities must add up to 100%.
Summary
For a decision tree to effectively inform a decision, it must incorporate two types of relevant data: endpoint values
corresponding to each scenario and the probabilities of the possible outcomes of each uncertain event. An outcome value
is associated with a scenario - a unique path from the first node on the left of the tree to an endpoint on the right. We
place the appropriate probability on each branch emanating from a chance node. Outcomes represented by branches
emanating from the same chance node must be mutually exclusive and collectively exhaustive.
Exercise 1: The Tardytech Blame Game

Ted Nesbit is a project manager at Tardytech, a software development company. Ted is planning the project schedule and
budget for the development of a new piece of custom business software for a Fortune 500 client.
The client has imposed a strict deadline for completion of the project. However, based on past experience with similar
clients, Ted knows that there is a significant probability that Tardytech will not be able to meet its deadline due — in most
cases — client-side delays.
Experience indicates that 56% of all Tardytech projects are completed on time, 42% are delayed due to client-side delays,
and 2% are delayed due to errors and process failures at Tardytech.
What is the probability that Tardytech's new project will not be completed on time?
The total probability of a delay is the sum of the probabilities of a client-side delay and a Tardytech-side delay: 44%.
A potential new client has asked Tardytech to report the probability that a project will not be delayed due to Tardytech
errors or process failures. What probability should Ted report?
Only 2% of all projects are delayed due to problems caused by Tardytech. The remaining projects are either completed on
time (56%) or delayed by client-side delays (42%). The probability that a project will not be delayed by Tardytech is the
sum of the probabilities of those outcomes, 98%.
Another potential client has asked Tardytech to report the percentage of all delayed projects whose delays could be
attributed to Tardytech's errors and process failures. What percentage should Ted report?
In the past, 44 out of 100 projects were not completed on time. On average, 2 projects out of 44 delayed projects were
delayed due to Tardytech errors and process failures. Thus, the proportion of Tardytech-caused delays among all delays is
2/44, or 4.5%. We'll discuss and analyze similar questions in greater depth in the next unit.
Exercise 2: The First Bank of Silverhaven

As a loan officer at the First Bank of Silverhaven, Carla Wu determines whether or not to grant loans to small business loan
applicants.
Of 108 recent successful small-business loan applicants with a full year of payment history, 28 were unable to meet at least
one loan payment in the first year they had an outstanding loan.
What is the probability that a randomly selected small business in this pool missed a payment in the first year of the loan?
Enter the probability as a decimal number with two digits to the right of the decial point (e.g., enter "50%" as "0.50").
Round if necessary.
The probability that a randomly selected small business missed a payment in the first year of the loan is simply the ratio of
the number who missed at least one payment to the total number of loans, i.e., 25.9%.
Exercise 3: The Shipping Bea

Robin Bea is the CEO of a small shipping company. She needs to decide whether or not to lease another truck to add to her
current fleet.
If she leases the truck, she may be able to generate new business that would increase her company's profits. However, if
additional business fails to materialize, she may not be able to cover the incremental leasing costs.
93/123
Based on her assessment of future trends in the transportation sector, Robin identifies three scenarios that might
transpire: "Boom," "Moderate Growth," and "Slowdown," depending upon the performance of the economy and its impact
on the transportation sector. She associates with each scenario an estimate of the total profits an expanded fleet will
generate for her firm.
If she doesn't lease the new truck, Robin probably won't generate as much revenue, but her leasing costs will be lower.
Again, she breaks down the possible outcomes if she does not lease the additional truck into three scenarios corresponding
to the performance of the economy. Then she associates an estimate of total firm profits with each scenario.
Which of the following trees best represents Robin's decision?
The first branching of the tree occurs at the point when Robin makes her decision: to lease or not to lease a new truck. A
square decision node should represent this decision.
Each of Robin's options splits into three possible outcomes depending on the performance of the economy. Robin will learn
which economic scenario transpires only after making her decision. The second branching represents this uncertainty. The
nodes of these branches are chance nodes and should be drawn as circles.
Exercise 4: Wheel of Fortune

You and a friend have a wheel of fortune that has a 75% chance of a "green" outcome and a 25% chance of a "red" outcome.
Before you spin, the friend offers you $100 if you correctly predict the outcome. If you choose the wrong color, you receive
nothing. The good news: you don't have to choose until the spin is complete. The bad news: you will have to keep your eyes
closed until the wheel has stopped and you have made your prediction.
You close your eyes and keep them shut while the wheel spins. When the wheel stops, your eyes are still closed. You now
must decide whether to choose "red" or "green."
Which of the following trees best represents your decision?
The nodes of a decision tree are arranged from left to right in the order in which we discover their results, not in the order
in which the events actually occur. Thus, even though the wheel stops and the result is finalized before you make your
decision, because you don't know the result until after your decision is made, the chance node for the spin result should
appear after — to the right of — the decision node.
Comparing the Outcomes

Alice's decision trees are neat devices to help organize and structure a decision problem. But how are they going to help Leo
evaluate his options and choose the best one?
Introducing the Expected Monetary Value

Seth Chaplin must decide how to produce the movie Cloven. He has mapped out the logical structure of his decision in a
decision tree and has incorporated the appropriate data, evaluating each scenario in terms of its expected profits and its
probability of occurring. Now, how should he use the tree to inform his decision?
If S&C Films works in partnership with K2 and retains part ownership of Cloven, S&C could earn a profit of $6 million, a
highly desirable outcome. However, that scenario is relatively unlikely: 30%. How do we balance the high value of that
outcome against its low probability?
The answer is elegant and simple: we multiply the outcome value by its probability: $6 million * 0.3 = $1.8 million.
Essentially, we "credit" the "Blockbuster" outcome with only 30% of its value.
Similarly, we credit a "Lackluster" performance of $0 profits with only half of its value.
The magnitude of the loss incurred if the film "Flops" is mitigated by its low likelihood of 20%.
Then we add these weighted values together, for a total of $1.4 million. This total is called the expected monetary value
(EMV) of the K2 option. The EMV is a weighted average of the expected outcomes of the scenarios in which S&C retains part
ownership of Cloven as stipulated in the K2 deal.
Let's return to our wheel of fortune game to build our intuition for the concept of the EMV. This wheel consists of three areas:
"blue," "green," and "red." The blue area is 30% of the total area. If a spin results in "blue," you win $6.
"Green" covers 50% and "red" covers 20% of the wheel's area. If a spin results in "green," you gain nothing at all, if the result
is "red," you lose $2.
Suppose you play the game 100 times. How much money do you think you will have gained or lost over 100 spins? What do
you estimate will be your average yield per game?
94/123
About 30% of the time, a spin will result in "blue." In other words, you can expect 30 of the 100 spins to yield $6.
About 50% of the time a spin will result in "green." You expect 50 of the 100 spins to yield $0. And about 20% of the time a
spin will result in "red," — that is, you expect 20 of the 100 spins to cost you $2.
The total amount you can expect to win after playing the game 100 times is $140, and the average yield per spin is $1.40.
That average is the expected monetary value (EMV) for a single spin.
The expected monetary value for a single spin is $1.40. If you spin the wheel once, which of the following results is least
likely to occur as the outcome?
The probability of each outcome is shown below. With a probability of 0%, $1.40 is the least likely outcome.
It is important to understand the nature of the expected monetary value. We do not actually "expect" an outcome of $1.40. In
fact, $1.40 is not even a possible outcome. The EMV of $1.40 is the long-term average value of the outcomes of a large
number spins.
In the Cloven case, the EMV of $1.4 million is the average amount of profits Seth Chaplin can expect to make when he
produces similar films in similar circumstances.
We can use the EMV as a measure with which to compare alternative options. First, we calculate the EMV for each chance
node, beginning at the right of the tree. For the chance node associated with the K2 deal, the EMV is $1.4 million.
Now that we have calculated the EMV of the K2 chance node, we can "collapse" the branches emanating from the chance
node to a single point. Going forward, we can treat the EMV of $1.4 million as the endpoint value of the K2 option.
The EMV for the Pony Option is simply $1.0 million: the outcome value multiplied by its probability of 100%.
At a decision node, we choose the best EMV of all the branches emanating from that decision node. In our Cloven example,
the best EMV is the one with the highest expected profits. The EMV of the K2 deal, $1.4 million, exceeds $1.0 million, the
EMV of the Pony deal. Selecting the option with the best EMV and removing all other options from consideration is known as
"pruning" the tree.
Any decision tree — no matter how large or complex — can be analyzed using two simple procedures. At each chance node,
calculate the EMV, collapse the branches to a point, and replace the chance node with its EMV. At each decision node,
compare the EMVs and prune the branches with less favorable EMVs. This entire process is known as folding back the
decision tree.
Summary
We often use expected monetary value (EMV) to quantify the value of uncertain outcomes. The EMV is the sum of the
values of the possible outcomes of an uncertain event after each has been weighted by its probability of occurring. The
EMV can be interpreted as the expected average outcome value of the uncertain event, if that uncertain event were
repeated a large number of times. To analyze a tree, we "fold it back": we move from right (the future) to left, finding the
EMV for each node. For chance nodes we calculate the EMV as described below. For decision nodes we simply choose the
option with the best EMV — lower costs or higher profits — among the choices represented by a decision node's branches
and prune the others.
Relevant Costs
Burning to use decision trees and EMVs, you start to draw up the square and circular nodes that make up Leo's Chez Tethys
decision. Alice, however, urges caution: "Before you get too trigger-happy with those decision trees, you should be aware of a
few common pitfalls."
Jen Amato has been driving "Millie," an old jalopy of a car, for the past five years. Millie — an Oldsmobile Delta-88 — is
twenty years old. A week ago, the air conditioner broke down, and Jen had it replaced at a cost of $500.
Now, the drive train is worn out. Replacing it will cost $1,200. If she doesn't repair it, Jen can sell Millie "as is" for about
$300. Jen must decide: should she sell her car or should she have the drive train replaced?
If Jen sells her car now, she will not even recoup her $500 investment in the air conditioner. If she has the drive train
replaced, her beloved Millie will last a little longer, and Jen will be able to enjoy the cool air she spent so much money on.
How should the fact that Jen hasn't had the opportunity to benefit from her air conditioner investment affect her decision?
With a broken drive train, Millie is worth about $300. According to her mechanic, if Jen pays $1,200 to replace the drive
train, Millie will be worth approximately $1,100 in terms of resale value. In the analysis of Jen's decision, what role should
her $500 investment in the air conditioner play?
In any scenario in which Jen sells her car, the $500 air conditioner cost has already been incurred, so it could be included in
the total cost.
95/123
Likewise, the air conditioner repair costs are part of the prehistory of any scenario in which Jen has Millie's drive train
replaced. Here, too, the $500 investment in the air conditioner could be included as a cost.
However, when we compare the total costs of the two options, we recognize that the $500 does not contribute to a
difference in the outcome values. Because the $500 cost is incurred in both cases, it plays no role in the comparison of
the two options, and thus is irrelevant to Jen's decision — she will have spent the $500 no matter what she decides to do now.
Costs that were incurred or committed to in the past, before a decision is made, contribute to the total costs of all scenarios
that could possibly unfold after the decision is made. As such these costs — called sunk costs — should not have any bearing
on the decision, because we cannot devise a scenario in which they are avoided.
It isn't wrong to include a sunk cost in the analysis as long as it is included in the value of every outcome. However, including
sunk costs distracts from the differences between scenarios — the relevant costs. Imagine the complexity of Jen's tree if she
included every sunk cost she's incurred from owning Millie — from the original purchase price of the car to the cost of all the
gasoline she's pumped into Millie over the years, to her expenditure on a dashboard hula dancer.
Misinterpreting a sunk cost as a cost that weighs on only some of the scenarios is a common error. After sinking $500 of
repairs into her car, selling it for $300 will be quite painful for Jen. Nonetheless, good decisions are made based on possible
future outcomes, not on the desire to correct or justify past decisions or mistakes.
Another common decision-making error is to omit from the analysis relevant costs that should have a bearing on the
decision. If Jen has Millie's drive train replaced, the repair costs are just one of many costs involved with that option. Given
Millie's age, there is a high likelihood of another repair cost soon. Similarly, if Jen sells Millie now, she will have costs
associated with buying a new car or arranging other transportation services.
Opportunity costs are an important cost category that decision makers often neglect to include in their analyses. For
example, selling the car will require Jen to devote 10 hours of her time that she would otherwise devote to her part-time job
paying $12 per hour. Thus, Jen should add $120 in opportunity costs to the outcomes of the "sell Millie" decision.
We should also take non-monetary costs into account. Jen will feel sad at leaving her trusted vehicle and companion on many
a road trip behind. Although such costs can be difficult to quantify, they should not be neglected. We will see shortly how to
use sensitivity analysis to incorporate non-monetary costs into a decision analysis.
Summary
Among the most common errors in decision analysis is the failure to properly account for the costs involved in different
possible scenarios. On the one hand, relevant costs such as opportunity costs or non-monetary consequences are often
omitted. On the other hand, irrelevant costs such as sunk costs are incorrectly included in the analysis. Sunk costs are costs
that were incurred or committed to prior to making the decision and cannot be recovered at the time the decision is being
made. Since these costs factor into any possible future outcome, they can be safely omitted from the analysis; sunk costs
must never be included in only selected branches of a decision tree.
Time Horizons
Jen has decided to replace her old car, Millie the Oldsmobile. Her friend, Sven, is leaving the country for two years and
doesn't want to pay to store his Mazda Miata. He offers Jen two options.
In the first option, Jen leases the car, paying Sven $700 each year. When Sven returns from abroad, he reclaims his car. In
the second option, Jen buys the car outright for $4,000. How should Jen compare these two options?
What is the appropriate cost difference Jen should use to compare the two options Sven has offered?
To begin, note that in the first option, the costs are spread across two years, whereas in the second option, Jen makes a
single payment when she closes the deal with Sven. To compare the options meaningfully, we must define a time horizon
for this problem, that is, we must compare their costs over a common time period. In this case, two years is a convenient
time horizon.
Next, note that we cannot directly compare $1,400, the simple sum of the two lease payments made at two different times,
to the $4,000 one-time cost of the purchase option paid entirely at the beginning of the first year. We can only compare
costs when they are valued at the same point in time.
We'll compare the present value of the costs associated with each option — that is, the value of the costs at the time Jen
makes her decision. In order to compare the costs, we first need to convert the second installment of the leasing payment
into its present value.
Jen currently has sufficient cash for either alternative in an investment account with a 5% rate of return. What is the
present value of the second installment of $700 paid under the leasing option?
The present value of the second installment of $700 is $666.67. At the end of one year, $666.67 residing in her investment
account today will have increased in value to $666.67*(1.05) = $700, the amount she must pay for the second lease
96/123
installment.
The present value of the total cost of the leasing option is $1,366.67. Can we use this number to compare the two options?
Although $1,366.67 and $4,000 are now comparable costs because they are given at their present value, we have not yet
considered the fact that under the purchase option, Jen will own the Miata at the end of the two-year period. In two years,
Jen expects that the Miata's market value will be around $3,000. The value of an asset at the end of the time horizon is
typically called its terminal value.
What's the net present value of the cost of the purchase option - that is, the present value of future cash flows after
subtracting, or "netting out," the initial payment to Sven? Recall that Jen currently has her money invested in accounts that
earn a 5% average annual return.
If Jen resells the car in two years, she would receive $3,000. The present value of $3,000 received at the end of two years is
$2,721.09.
The net present value of the cost of the purchase option is $1,278.91.
If cost is the only deciding factor in Jen's choice, she'd save $87.76 if she chose to purchase the Miata. Before finalizing her
decision, Jen should make sure to weigh other considerations relevant to the decision, such as whether or not she will need
to borrow money to cover other expenses this year if she spends $4,000 to buy the Miata now, what the borrowing rate is,
etc.
Summary
When we make a decision, we need to choose a time horizon over which we quantify the outcomes of our decision. To
compare monetary values that take place at different times, we must account for the time value of money by comparing
cash flows at their values at a common point in time. Generally, we compare the present values — or net present values
— of different outcomes by discounting future cash flows at the appropriate discount rate. Terminal values of assets held
at the end of the time horizon must be determined and discounted to their present value.
Solving the Chez Tethys Problem

"Now let's find Leo, and get to work on his decision problem," Alice says, snapping her laptop shut.
I calculated that if the floating restaurant idea really takes off, I'd make over $2 million in profits over the next five years. And
that's after I subtract my initial investment, including the purchase of the ship.
How certain are you that the Tethys will be the success you envision?
Gosh, I don't know. It's almost like a flip of a coin to me — chances are about fifty/fifty that the Tethys will be a big hit.
That sounds a little too optimistic. Let's take a close look at your business plan.
The three of you spend the rest of the day in energetic discussion and research. Finally, you agree on two representative
scenarios that might occur if Leo launches the Chez Tethys: "Phenomenon," or "Fad."
If the Tethys becomes a cultural "Phenomenon" with staying power, Leo can expect $2 million in profits over five years, in
terms of net present value. Leo grudgingly agrees that the likelihood of this scenario is only 35%.
Alternatively, dining Chez Tethys might become a passing "Fad" for a couple of years, then be replaced by "The Next Big
Thing." In that case, Leo would face substantial losses, estimated to have a present value of about $800,000. Leo grants that
his brainchild has a 65% chance of just being a fad.
What is the EMV of launching the Chez Tethys?
Enter the EMV in $millions as a decimal number with three digits to the right of the decimal point (e.g., enter ''$5,500,000''
as ''5.500''). Round if necessary.
The expected monetary value of launching the Tethys is the outcome value of the "Phenomenon" scenario weighted by its
probability added to the outcome value of the "Fad" scenario weighted by its probability, i.e., $180,000.
Hmm. I can see how this analysis is helpful. I'm glad that my expected profits are positive, but somehow I'm not satisfied. I
don't feel very comfortable with some of our estimates.
Exercise 1: The Shipping Bea Flies Again

current fleet.
97/123
Robin identifies three states of the transportation sector that might occur: "Boom," "Moderate Growth," and "Slowdown."
Her firms profits depend on whether she leases an additional truck and on the state of the sector: She associates an
estimate of total firm profits with each of the six scenarios.
What is the EMV of the decision to lease the truck?
Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500'' as
''5.5''). Round if necessary.
The EMV of the decision to lease the truck is $31,000. Weigh each of the scenario's outcomes by the probabilities that they
will occur, then add the weighted values.
What is the EMV of the decision to not lease the truck?
The EMV of the option to not lease the truck is $18,200. Weigh each of the scenario's outcomes by the probabilities that
they will occur, then add the weighted values.
Exercise 2: The SHAMH of the Century

Jari Lipponen of the Silverhaven Home for Abandoned Miniature Horses (SHAMH) needs funds to maintain operations.
He can either apply for a government grant or run a local fundraiser, but the demands on his time are too high for him to
be able to do both.
Jari believes he has a 90% chance that he will win a grant of $25,000 if he submits the grant application. Grants in this
category are for a fixed amount, so if he loses the grant, he'll have no money to run the SHAMH.
Based on his past fundraising experience, he estimates that if he runs a local fundraiser, he has a 30% chance of raising
$30,000 and 70% chance of raising $20,000.
What is the EMV of launching a fundraiser?
To find the EMV of launching the fundraiser, weigh the $30,000 outcome value by its probability of 30% and add that to
the $20,000 outcome value weighted by its probability of 70%.
What is the EMV of applying for the grant?
To find the EMV of applying for the grant, weigh the grant award amount of $25,000 by the probability of winning it (90%)
and add that to the value of losing the reward ($0) weighted by the probability of losing (10%).
The EMV of the fundraiser option is $23,000, higher than the EMV of applying for the grant. Based on thia analysis, Jari
should organize a fundraiser.
Exercise 3: The Gaiacorps Upgrade

Marsha Ratulangi is the chief operating officer at Gaiacorps, a non-profit organization dedicated to preserving natural
habitats around the world. Gaiacorps' IT hardware is aging, and Marsha must decide whether to extend the current lease
on Gaiacorps' IT desktop computers or purchase new ones.
Which of the following costs is not relevant to Marsha's decision?
The cost of the lease extension and the future maintenance costs of the current hardware are incurred only in the scenario
in which Gaiacorps extends its current lease. The purchase price of the new hardware is incurred only in the scenario in
which Gaiacorps purchases new hardware. Which of the three costs are incurred depends on which option Marsha chooses,
so all three are highly relevant to her decision.
Which of the following costs is relevant to Marsha's decision?
The cost of the memory card upgrade and the maintenance costs previously invested in the current hardware are sunk
costs that have already been incurred.
Marsha's salary is not a sunk cost, but is incurred in all possible scenarios. Neither of the options Marsha is considering
will affect the amount of these three costs, thus all three costs are irrelevant to the decision.
98/123
In contrast, the cost of disposing of the new hardware is relevant, since it is incurred only in the case that Marsha buys new
desktop computers.
Exercise 4: Mopping up the Empire

Under pressure from his company's ad hoc advisory board to lower operating expenses, the CEO of Empire Learning, Bill
Hartborne, is considering canceling the biweekly cleaning service for the company offices. Each cleaning engagement costs
$75.
Instead of hiring a cleaning service, Bill could simply clean the office himself whenever a client visits the office. The
probability of exactly one client visiting the office in a given month is 25%, and the probability that two clients will visit is
10%. The probability that no clients will visit is 65%.
Bill draws up the structure of his decision in the tree depicted below. Given a one-month time horizon, what is your best
estimate of the EMV of the cost of not hiring a cleaning service?
Instead of cleaning the office, Bill could be creating value for his company by making sales, networking, boosting employee
morale, or simply increasing his productivity by napping on the office sofa. We need to calculate the opportunity cost of
the time Bill would spend cleaning the office, so we need to know how long he spends cleaning, and how highly his time is
valued.
Assuming that it always takes Bill 2 full hours to clean Empire Learning's offices, and that Bill's time is valued at
$200/hour, what is the EMV of Bill cleaning the office?
Enter the EMV in dollars as an integer (e.g., enter ''$5.00'' as ''5''). Round if necessary.
The EMV of Bill cleaning the office is $180. Based on this analysis, Bill should continue employing the cleaning service.
Exercise 5: The Eris Shoe Company

Val Purcell, CEO of a supply-chain management consulting firm Purcell & Co., must decide whether or not to put in a bid
for a contract to re-engineer the supply chain of a potential new client: the Eris Shoe Company.
Creating the bid will cost $16,000 in Val's time and legal fees. If his bid beats out the competition, Val expects the contract
to return profits of $100,000 — from which the cost of preparing the bid has not yet been subtracted. Val believes he has a
20% chance of winning the bid.
Eris will pay the consulting fee and Val will accrue his estimated $100,000 profits upon completion of the project, which is
scheduled for one year from now.
Under these terms, what is the expected monetary value of creating and submitting the bid?
The answer cannot be determined without knowing Purcell's discount rate. If Purcell wins the bid, it won't receive its
consulting fee until completion of the project one year from now. Cash flows related to the contract should be discounted at
Purcell's discount rate. For simplicity assume that the cost/profit figures represent cash outflows and inflows, and that
Purcell's discount rate is 15%.
What is the EMV for the option of submitting this bid?
Enter the EMV in $thousands as a decimal number with two digits to the right of the decimal point (e.g., enter ''$5,500'' as
To find the EMV of creating the bid, first calculate the present value of the cash inflow from Purcell's profits on the
contract, to be received a year from now upon completion of the project.
To determine the net present value of winning the bid, we subtract the $16,000 bid preparation costs.
Then, use the probability of winning the bid to weight the EMVs of the "Win" and "Don't win" scenarios to determine the
EMV of placing the bid.
Before Val starts work on the bid, Eris decides to move the project's completion date to two years after the contract is
signed. Again assume that the cost/profit figures represent cash outflows and inflows, and that Purcell's discount rate is
15%. What is the EMV for submitting the bid now?
Enter the EMV in dollars as an integer (e.g., enter ''$5.00'' as ''5''). Round if necessary.
To find the new EMV of putting in the bid, first calculate the present value of the cash inflow from Purcell's profits on the
contract, to be received two years from now upon completion of the project.
To determine the net present value of winning the bid, we subtract the $16,000 bid preparation costs.
99/123
Then, use the probability of winning the bid to weight the EMVs of the "Win" and "Don't win" scenarios and determine the
EMV of placing the bid. The EMV of putting in the bid is -$877. If Val's payments are delayed by another year, he cannot
expect the project to be profitable, so he should not submit a bid. Delaying the profits changes Val's optimal decision!
Exercise 6: The Crumbling Empire

Cap Winestone of Universal Learning is preparing a bid for a new building to accommodate its expanding operations. As
luck has it, the headquarters of former competitor Empire Learning is for sale through a sealed bid auction. The building is
well suited to Universal's needs, containing computing equipment, network infrastructure and other important e-learning
accessories.
Cap estimates the building is worth about $900,000 to him and is trying to decide what bid to place. To simplify his
decision, he narrows down his bid choices to four possible bids. He muses: "With lower bids I gain more value. If I bid
$600,000, I'll get a building worth $900,000 to me, so I gain $300,000 in value. In terms of the value I gain, the lower the
bid, the better."
"On the other hand," Cap continues, "with a low bid I'm not likely to win. From the point of view of winning the bid, the
higher the bid the better. How do I balance these two opposing factors?"
Cap lays out a decision tree with the possible bid amounts, the likelihood of winning for each bid, and the outcome values.
What amount should Cap bid?
Folding back the tree, we find the $800,000 bid gives the highest expected monetary value. The $800,000 bid best
balances the value gained against the probability of winning.
Sensitivity Analysis
Still in Leo's office after you initially calculated the expected monetary value of launching the Chez Tethys, you listen to Leo's
growing concerns about the estimates that inform his decision.
Leo the Skeptic: The Uncertain Estimates Problem

Now that you've mapped out the possible downside and its probability for me, I'm a little discouraged. Sure, the analysis
indicates that ventures like the Tethys will be profitable on average, but the fact that I have almost a two-thirds chance of an
$800,000 loss scares me.
What if the potential loss is even greater? Or, what if it's even more likely that the Chez Tethys will just be a passing "Fad"?
Both points are well taken, Leo. I'll tell you what: let's break for lunch, and then delve a little deeper into our analysis.
A Decision's Sensitivity to Outcome Estimates

Over lunch, Alice comments on Leo's reaction to your analysis. "Leo's questions bring us to a crucial component of decision
analysis: sensitivity analysis."
Seth Chaplin has all but decided how to produce the film Cloven. Based on his initial analysis, he is inclined to produce
Cloven in partnership with K2 Classics, thereby retaining part ownership of the film. The EMV of the K2 option is $1.4
million. The EMV of the alternative option — in which Seth's company produces Cloven for Pony Pictures and relinquishes
ownership in the film — is $1.0 million.
But Seth's calculations were based on estimates. The probabilities and the outcome values he used in his analysis were
educated guesses. No matter how detailed and rigorous the methodology, a decision analysis is only as good as the data on
which it is based. What if Seth isn't completely certain about these data?
Seth is particularly unsure about the $6 million value he used to represent the profits associated with a "Blockbuster" film.
What would the EMV of the K2 option be if $4 million were a more representative value for "Blockbuster" profits?
Enter the EMV in $millions as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500,000'' as
If $4 million is the expected value of S&C's profits for the "Blockbuster" scenario, then the EMV of a "Blockbuster" drops to
$800,000, less than the $1 million EMV of the Pony option. Note that if the "Blockbuster" profit figure drops to $4 million,
Seth's optimal decision changes: he should now choose the Pony option. Seth has learned that his optimal decision is
sensitive to the figure he uses to represent S&C's profits if Cloven attains "Blockbuster" status.
If Seth's optimal decision switches when the "Blockbuster" profit figure drops to $4 million, he might reasonably wonder
what his decision would be for other values. What about $5 million? $4.5 million? How low would the "Blockbuster" profit
figure have to be to make the Pony option preferable to the K2 option in terms of the EMV?
100/123
At what "Blockbuster" profit figure does Seth's optimal decision switch?
Enter the EMV in $millions as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500,000'' as
To answer that question, we first write the EMV of the K2 option with a variable, B, to represent "Blockbuster" profits in
millions of dollars.
Now, we compare the EMV of the K2 option to the EMV of the Pony option. The K2 EMV is greater than the Pony EMV when
the "Blockbuster" profit figure is greater than $4.67 million.
We call $4.67 million the breakeven value for "Blockbuster" profits: above it, the EMV criterion recommends the K2
option; below it, the EMV criterion recommends the Pony option.
Seth may not be sure if the "Blockbuster" profits are best represented by $6 million or $5 million or $4.8 million, but as long
as he is confident that the figure is greater than $4.67 million, he need not lose any sleep over finding a more accurate value.
Knowing that his estimates for the "Blockbuster" profit figure are firmly on one side of the breakeven value allows him to stop
worrying about the precise value of that number in the decision analysis, since knowing that number with greater precision
will not change his decision.
On the other hand, if Seth thinks the "Blockbuster" profit figure could be below $4.67 million, he may wish to invest
additional resources to find a more accurate figure to represent "Blockbuster" profits.
If Seth thinks the true number is around the breakeven value of $4.67 million, but isn't certain if it's slightly above or slightly
below that figure, he can also stop worrying about the precise value of the number. As long as the figure is around $4.67
million, Seth should be indifferent between the K2 and Pony deals, since the EMVs of the two options are about the same.
In fact, a good way to check that we have calculated a breakeven value correctly is to substitute the value into the EMV
calculation: if the options have the same EMV, the breakeven value is correct.
Once we calculate a breakeven value, we know whether or not expending additional time and other resources to find a more
accurate estimate is worthwhile. The breakeven value establishes a comfort zone: as long as we are confident that the value
we are estimating is within the zone, we can feel comfortable choosing the option recommended by our initial analysis, based
on our original estimate.
If we think the value we are estimating could be close to the breakeven value, we need to be more cautious. If the true value
we are trying to estimate could lie outside of the comfort zone, we might want to try to make our estimate of that value more
accurate before we reach a final decision.
How confident we are that the true value we are estimating lies inside the comfort zone given by the breakeven analysis is a
matter of judgment and experience. Sometimes, we might collect sample data to estimate an outcome value. In this case, we
should look closely at the variation in the data to see how widely and in what way the data can vary.
Calculating a breakeven value for data used in a decision analysis is called sensitivity analysis: for each estimated value in
the analysis, we check to see by how much it would have to change to affect our decision, assuming our estimates for all the
other data are correct.
Sensitivity analysis is an important and powerful tool for management decision-making. Managers who base decisions on an
initial analysis without performing sensitivity analysis on critical data risk lulling themselves into a false sense of security in
their decisions.
Summary
After completing an initial decision analysis, always conduct a sensitivity analysis for each outcome value estimate you are
uncomfortable with. First, calculate the outcome value's breakeven value: the value for which the EMV of the option
initially recommended by the decision analysis ceases to be the best EMV. The breakeven value defines a comfort zone: If
we believe that the actual outcome value might be outside that zone — thereby changing the optimal decision — we should
reconsider our analysis and refine our estimate of the outcome value in question.
Evaluating Non-Monetary Consequences

During his negotiations with K2 and Pony, Seth realized that he did not particularly look forward to working with the K2
team. Based on past experience, he knows that interpersonal frictions can be highly frustrating and can make a
collaboration unpleasant. This frustration is clearly a cost — albeit a non-monetary cost — associated with the K2 option.
Sensitivity analysis can give us a "reality check" on how highly we value non-monetary consequences such as frustration,
reputation costs and benefits, and sentimental values. Although it may be difficult to assign a value to such consequences,
we can often answer questions about the most we'd be willing to pay to avoid (or to obtain) them by calculating a threshold
value.
101/123
Clearly, Seth wouldn't want to spoil any magical Hollywood days just to make an additional $5 in profits. But the more the
K2 option pays relative to the alternatives, the more willing Seth might be to suffer working with the K2 team. How can
Seth determine whether he should accept the K2 deal and bear the resulting interpersonal trials and tribulations?
Sensitivity analysis can help us analyze non-monetary consequences such as frustration. Let's use "F" to represent the
frustration cost (in $millions) Seth will incur if he has to work with the K2 team. Since frustration will occur in any
scenario involving the K2 option, $F million must be subtracted from all outcomes associated with the K2 option.
How large must the cost of frustration be for Seth's optimal decision to switch to the Pony option?
Enter F, the cost of frustration, in $millions as decimal number with two digits to the right of the decimal point (e.g., enter
''$5,500,000'' as ''5.50''). Round if necessary.
Adding $F million to the outcomes changes the EMV of the K2 option to $1.4 million - $F million. For Seth's decision to
switch from the K2 deal to the Pony deal, the EMV of the K2 option, $1.4 million - $F million, must drop below $1.0
million. In other words, F must satisfy the inequality below.
In order to lower the EMV of the K2 option below the EMV of the Pony option, the cost of frustration would have to be
valued at least at $400,000. Seth needs to ask himself if he would pay $400,000 to avoid the frustration of working with
K2. Sensitivity analysis provides a clear, monetary upper bound against which he can measure the strength of his feelings.
In the end, Seth decides that he can learn to love the K2 team for $400,000.
Summary
To incorporate non-monetary consequences into a decision, first find the option with the best EMV. Then, add the non-
monetary consequence to the outcome values of all scenarios affected by that non-monetary consequence, and calculate
the breakeven value for which the option recommended by the initial decision analysis ceases to have the best EMV. The
breakeven value defines a comfort zone: If we believe that the actual value of the non-monetary consequences might be
outside that zone — thereby changing the optimal decision — we should try to gain a firm estimate of the non-monetary
consequence's value.
A Decision's Sensitivity to Probability Estimates

Seth is somewhat unsure about his estimates for the probabilities of how successful Cloven will be. He is quite sure that the
probability of the film flopping is around 20%, but he's less sure about the probabilities of "Lackluster" and "Blockbuster"
performance levels. He wants to know how sensitive his decision to choose the K2 deal is to the values of these
probabilities.
Let's call the probability of a "Blockbuster" "p." Seth is confident that the probability of a "Flop" is 20%. What is the
probability of a "Lackluster" outcome?
Since the three outcomes are mutually exclusive and collectively exhaustive, their probabilities must add to 100%. Seth is
confident that the probability of a "Flop" is 20%, so he knows that the probabilities of the remaining two outcomes
("Blockbuster" and "Lackluster") must add to 80%. Thus, the probability of a "Lackluster" performance is 0.8 - p.
Using p to denote the probability of a "Blockbuster" outcome and 0.8 - p to denote the probability of a "Lackluster"
outcome, what is the EMV of the K2 option?
The EMV of the K2 option is p * $6 million - $0.4 million. For what values of p, the probability of Cloven becoming a
"Blockbuster" film, is the Pony option preferred to the K2 option on the basis of EMV?
The breakeven value for the probability of a "Blockbuster" hit is 23.3%: when the probability is above 23.3%, the EMV of
the K2 option is higher; when the probability is below 23.3%, the EMV of the Pony option is higher. If Seth is confident that
the probability of a "Blockbuster" is at least 23.3%, he can feel comfortable choosing the K2 option. He need not expend
additional effort trying to further refine his estimates for the probabilities of different levels of success.
Decision-making is an iterative, multi-step process. When analyzing a decision, we should first construct and analyze a
decision tree based on our best estimates of the outcomes and probabilities involved. After reaching a tentative decision, it
is critical to scrutinize the data used in a decision analysis and conduct sensitivity analyses for each estimate that we feel
unsure about.
As long as we are comfortable that the true value we are estimating is within the range specified by the breakeven
calculation, we can confidently proceed with our decision. If not, we should focus our efforts on refining our estimates for
those values to which our decisions are most sensitive.
Finally, we should note that managers sometimes need to test the sensitivity of their decisions to two or more estimates
simultaneously. Sensitivity analysis techniques can be extended to address these situations; these techniques are beyond
the scope of this course.
102/123
Summary
After completing an initial decision analysis, always conduct a sensitivity analysis for each probability value you are
uncomfortable with. First calculate the probability's breakeven value: the probability for which the EMV of the option
initially recommended by the decision analysis ceases to be the best EMV. The breakeven value defines a comfort zone: If
we believe that the actual probability might be outside that zone — thereby changing the optimal decision — we should
reconsider our analysis and refine our estimate of the probability in question.
Solving the Uncertain Estimates Problem

Sensitivity analysis helps managers cope with the uncertainty that surrounds the estimates they base decisions on. With your
new knowledge, you're ready to turn to Leo's problem.
Your initial analysis recommends that Leo launch his floating restaurant, but Leo has expressed some doubt about the
estimate of the losses he'd incur if the Tethys turns out to be a passing "Fad."
How high do the losses have to be in the "Fad" scenario to make launching the Tethys ill-advised in terms of EMV?
Launching the Tethys would be less attractive than not launching it if the EMV of launching is less than $0, the EMV of not
launching. Solving the inequality below, we find that the losses in the event that the Tethys is just a "Fad" would have to
exceed $1.08 million for Leo to abandon the project based on the EMV criterion.
Assume that, in fact, Leo's estimate — that he will incur an $800,000 loss if Chez Tethys turns out to be a "Fad" — is correct.
For what probability of a "Fad" does launching the Tethys cease to be a worthwhile venture, in terms of the EMV?
Launching the Tethys is less attractive than not launching it if the EMV of launching is less than $0, the EMV of not
launching. Solving the inequality below, we find that the probability of a "Fad" would have to be higher than 71% for Leo to
prefer to abandon the project based on the EMV criterion.
Hmm, I'm fairly confident that our estimate of the possible loss is on target — certainly it won't be higher than a million
dollars! But the more I think about it, the more I wish we had a better estimate for the probability that the Tethys will be the
success I've always dreamed of.
I have some ideas for ways to get a better handle on the likelihood of the Tethys' success. But I need to make some phone calls
to see if I'm on the right track. Let's break for today and meet tomorrow morning.
Exercise 1: Sensitive as a Truck

current fleet.
Based on her assessment of future trends in the transportation sector, Robin identified three scenarios that might occur:
"Boom," "Moderate Growth," and "Slowdown," corresponding to the performance of the economy and its impact on the
transportation sector. She associated estimates of total firm profits with each scenario, and calculated the EMVs of
"leasing" and "not leasing" a new truck: $31,000 and $18,200, respectively.
Robin is afraid that she may have been a little overoptimistic in her estimate of $70,000 of total profits in the scenario in
which she "leases a new truck" and the transportation sector "booms." She wants to know how sensitive her decision is to
the outcome value of this scenario.
What must the value of total profits of the scenario in which Robin "leases the truck" and the transportation sector "booms"
be in order for the "don't lease truck" option to be the more attractive one?
For sufficiently low profits generated by the new truck in the "Boom" scenario, the EMV of leasing the truck is lower than
the EMV of not leasing the truck. To find the breakeven value X, we solve the inequality between the two EMVs shown
below. This inequality is solved when X, the value of total profits in the lease truck and "Boom" scenario, is $27,333.
Exercise 2: Sneakers of Discord

Val Purcell, CEO of the supply-chain management consulting firm Purcell & Co., must decide whether or not to put in a bid
for a contract to re-engineer the procurement process for a potential new client: the Eris Shoe Company.
Creating the bid will cost $16,000 in Val's time and legal fees. If he wins the bid, he will receive payment for the project
upon completion, two years after signing the contract. He has estimated the present value of the profits if he wins the bid at
$75,614. After factoring in the $16,000 cost of submitting the bid, the net present value of the profits if he wins the bid is
$59,614.
Val believes he has a 20% chance of winning the bid, so the EMV for submitting the bid is -$877: under these
circumstances, Val can't expect submitting a bid to be profitable. But Val hasn't been able to figure out how stiff the
103/123
competition for the contract is, and he's fairly uncertain about the probability of winning.
How high would the probability of winning the bid have to be to make Val change his mind and choose to put in a bid?
To change Val's decision, the EMV of bidding will have to be higher than the EMV of not bidding: $0. To find out how high
p, the probability of winning the bid would have to be, we solve the inequality below. Note that the probability of losing the
bid (1 - p) must change as the probability of winning changes. The inequality is solved when the probability of winning the
bid is greater than 21.2%.
Exercise 3: The SHAMH Continues

Jari Lipponen of the Silverhaven Home for Abandoned Miniature Horses (SHAMH) needs funds to maintain operations.
He can either apply for a government grant or run a local fundraiser, but the demands on his time are too high for him to
be able to do both.
Jari believes he has a 90% chance that he will win a grant of $25,000 if he submits the grant application. He estimates that
as a local fundraiser, he has a 30% chance of raising $30,000 and 70% chance of raising $20,000. The EMV of the
fundraiser option is $23,000, higher than $22,500, the EMV of applying for the grant. Based on the initial analysis, Jari
should organize a fundraiser.
Jari has expressed uncertainty about his estimate for the probabilities of the two levels of fundraising success. How low
would the probability of raising $30,000 have to be to change his decision? (Enter your answer as a decimal number with
two digits to the right of the decimal point)
For Jari to change his initial decision, the EMV of the fundraising option would have to be less than $22,500. That would
be the case if the probability of the high fundraising success — valued at $30,000 — were less than 25%.
Jari needs $20,000 to operate the SHAMH through the next year. Raising $25,000 or more would permit Jari to operate
the SHAMH and expand the capacity of the operation, thereby allowing even more abandoned miniature horses to be
saved. This year will be Jari's last running the SHAMH, and he really wants to expand its capacity before he leaves.
If he's unable to raise at least $25,000 to cover the SHAMH expansion, he'll be very disappointed. How highly would Jari
have to value his disappointment in order for him to prefer the government grant option that is more likely to secure funds
for expansion? Assume that his original probability assessments for the success of the fundraiser are correct.
If Jari is only moderately successful in his fundraising activities, he won't be able to expand the SHAMH on his watch. The
disappointment cost D should be subtracted from the outcome value for the scenario in which he takes in only $20,000 in
contributions. After incorporating the disappointment cost, the EMV of the fundraiser option becomes $23,000 - 0.7D.
However, if he applies for the grant and doesn't win it, he won't be able to expand, either. The disappointment cost D must
also be subtracted from the outcome value in the scenario in which he writes the grant but doesn't win it. After
incorporating the disappointment cost, the EMV of the grant-writing option becomes $22,500 - 0.1D.
The breakeven value is the value of disappointment for which the EMVs of the two options are equal. Thus, to find the
breakeven value of D, we must set up an inequality between the two EMVs and solve for D. In this case, the disappointment
cost would have to be greater than $833.33 in order for Jari to prefer the grant-writing option.
Decision Analysis II
Conditional Probabilities
After a Kahana breakfast of crab benedict, you meet once more with Leo.
Dining Chez Tethys: The Market Research Problem

So, I've been thinking: my decision is heavily dependent on the likelihood that the Chez Tethys will be a success. Couldn't we
do some market research to find out how interested our target market would be? Then I could make a better decision — one
that would be less risky!
Sure, Leo. But market research costs money. How much would you be willing to pay for information that would help you
better predict the success of the Chez Tethys?
Hmmm. Good question...frankly, I don't have a clue how to even start thinking about that. Can you two help me?
"Whatever market research Leo has in mind," remarks Alice, "it won't reveal with certainty how consumers will take to the
Tethys. In other words, we're uncertain about the Tethys' chance of success, and we'll be uncertain that the market research
we collect is accurate."
104/123
Joint and Marginal Probabilities

Pondering two layers of uncertainty you begin to feel vertigo...
When analyzing a decision, we typically use estimates for the probabilities and financial implications of various outcomes
that could occur. Often, we can imagine additional information — scientific tests, market research data, or professional
expertise — that would help make our estimates more accurate. How much should we be willing to pay for this type of
information? And how do we incorporate new information into our decision analysis?
Before we answer these questions, we'll need to expand our understanding of probability and introduce the important
concepts of conditional probability and statistical independence. Let's look at an example first.
British automaker Chariot's most sought-after model is the Ben Hur. Consumers love the Bennie, as it's affectionately called.
It was offered as a limited edition: to date, only 1,000 Bennies are on the road in Britain. We'll take a closer look at two
properties of the Bennie: its "Color" and its "Stereo."
The Bennie comes in two color options: Burgundy and Champagne. Also, the Bennie is offered with a high-end car stereo
system by Sweetone. Let's look at a table of the 1,000 Bennies currently on the road and see how the two properties — "Color"
and "Stereo" — are distributed.
Of the 1,000 Bennies, 150 are Burgundy and have a Sweetone stereo. 600 are Burgundy and do not have the Sweetone stereo.
Furthermore, there are 50 Champagne Bennies with the stereo and 200 without.
Let's add another column to our table and fill in the total numbers of Burgundy and Champagne Bennies. To calculate these
numbers, we simply take the sums of the rows. For example, the number of Burgundy Bennies is simply the number of
Burgundy Bennies with the Sweetone stereo added to the number of Burgundy Bennies without.
Next, in the final row, we'll fill in the total numbers of Bennies with and without the Sweetone stereo. Here, we simply take
the sums of the two columns. In the bottom right cell, we enter the total number of cars: 1,000.
This number — 1,000 — should be equal to the sum of the numbers in the final row: the total number of Bennies with the
Sweetone stereo added to the total number without one. Also, it should be equal to the sum of the numbers in the final
column: the number of Burgundy Bennies plus the number of Champagne Bennies. Finally, since 1,000 is the total number of
all Bennies, it should be the sum of all the original numbers we entered into the table.
A Venn diagram is a useful graphical way to represent the contents of the table. The Burgundy rectangle on top represents the
set of all Burgundy Bennies. The Champagne rectangle below represents the set of all Champagne Bennies. The rectangles do
not intersect because a Bennie cannot be both Champagne and Burgundy.
We now add a patterned rectangle on the left to represent the set of Bennies with the Sweetone stereo. The area without the
pattern represents the set of Bennies that are not equipped with the Sweetone. These areas intersect the rectangles that
represent the distribution of "color": some Burgundy Bennies are Sweetone-equipped; some are not. Some Champagne
Bennies are Sweetone-equipped; some are not.
The sizes of the areas in the Venn diagram are directly proportional to the incidence of the different Bennie properties:
Burgundy/Champagne and Sweetone/no Sweetone. We use Venn diagrams to effectively communicate information about
sets of things — for example, Burgundy cars and Sweetone-equipped cars — and their interactions.
The table is a useful tool for calculating proportions of potential interest to managers. For instance, we can find the
proportion of Burgundy Bennies with the Sweetone stereo simply by locating the cell that contains their number and dividing
it by the total number of cars: 15%.
Or, to find the proportion of Burgundy Bennies in general, we find the cell that contains the total number of Burgundy
Bennies and divide it by the total number of cars: 75%.
In this way, we can create an entire table of proportions. These proportions can be interpreted as probabilities. For
example, since 15% of the Bennies on the road are Burgundy-colored and Sweetone-equipped, the probability that a
randomly selected Bennie will be Burgundy-colored and have the Sweetone is 15%.
The probability that a randomly selected Bennie will be Burgundy is 75%. Going forward, when talking about the table, we'll
use the words "proportion" and "probability" interchangeably.
The probabilities on the inside of the table are called joint probabilities: the probabilities of a single car having two
particular Bennie features, for example, Burgundy-colored and Sweetone-equipped. We'll denote joint probabilities in the
following way: P(Burgundy & Sweetone) is the probability that a Bennie is Burgundy-colored and Sweetone-equipped.
The probabilities of each "Color" or "Stereo" option occurring in a given Bennie — Burgundy or Champagne, Sweetone-
equipped or not Sweetone-equipped — are often called marginal probabilities. They are denoted simply as P(Burgundy),
P(Champagne), P(Sweetone), and P(no Sweetone).
Information about the distribution of properties in populations is often available in terms of probabilities, so the table of
probabilities is a very natural way to represent the Bennie data.
105/123
Summary
For two events A and B with outcomes A1, A2, etc. and B1, B2, etc., respectively, the joint probability P(A1 & B1) is the
probability that the uncertain event A has outcome A1 and the uncertain event B has outcome B1. The joint probabilities of
all possible outcomes of two uncertain events can be summarized in a probability table. The marginal probability of
the outcome A1 of the first uncertain event is the sum of the joint probabilities of outcomes A1 and all possible outcomes
B1, B2, etc. of the second uncertain event.
Conditional Probabilities
Automaker Chariot's limited edition Ben Hur model comes in two possible colors — Champagne or Burgundy — and with
or without a high-end Sweetone stereo. The table below shows the distribution of these properties — "Color" and "Stereo"
— in the population of 1,000 Bennies that Chariot has sold to date.
Restricting our focus to the Burgundy Bennies, we ask the following question: among the set of Burgundy Bennies, what is
the proportion of Burgundy Bennies with a Sweetone stereo? Stated differently: what is the probability that a randomly
selected Bennie has a Sweetone given that we know it is Burgundy?
To answer the question, we find the ratio of Sweetone-equipped Burgundy Bennies among the set of all Burgundy
Bennies. That is, we divide the number of Bennies that are both Burgundy and have a Sweetone — 150 — by the total
number of Bennies that are Burgundy — 750. The probability is 20%.
This probability is called a conditional probability: the probability that a Bennie is Sweetone-equipped given that it is
Burgundy. We'll denote this probability P(Sweetone | Burgundy), and read the vertical line as "given."
We can calculate P(Sweetone | Champagne) as:
P(Sweetone | Champagne) is the number of Bennies that are Champagne and Sweetone-equipped divided by the total
number of Champagne Bennies.
What is P(No Sweetone | Burgundy)?
Enter the percentage as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50").
Round if necessary.
Using the table of actual numbers of cars we can calculate P(No Sweetone | Burgundy) and a fourth conditional probability,
P(No Sweetone | Champagne).
Earlier, we used the table of actual numbers of cars to calculate a table of probabilities. When given information as a table
of probabilities, we can use the probabilities to calculate conditional probabilities as well: we simply form the ratios of the
appropriate probabilities.
For example, the probability that a Bennie is Sweetone-equipped given that it is Burgundy is the probability of a Sweetone-
equipped, Burgundy Bennie divided by the probability of a Burgundy Bennie.
In fact, a conditional probability is formally defined in terms of the ratio of a joint probability to a marginal probability.
If P(Sweetone | Burgundy) represents the probability that a Bennie is Sweetone-equipped given that it is Burgundy, what
does P(Burgundy | Sweetone) represent?
P(Burgundy | Sweetone) represents the probability that a Bennie is Burgundy given that it is Sweetone-equipped. Note that
P(Sweetone | Burgundy) and P(Burgundy | Sweetone) are not the same.
We can calculate the other conditional probabilities, this time conditioning the property "Color" on the property "Stereo."
It is useful to write the conditional probailities next to the table of probabilities to facilitate calculation: since P(Burgundy |
Sweetone) and P(Champagne | Sweetone) require only the probabilities in the Sweetone column, we write them directly
below the Sweetone column.
Similarly, since P(Burgundy | No Sweetone) and P(Champagne | No Sweetone) require only the probabilities in the "No
Sweetone" column, we write them directly below the "No Sweetone" column. Note that the new rows mimic the rows in the
original table: Burgundy on top and Champagne below it.
Similarly, we write the conditional probabilities for the property "Stereo" conditioned on the property "Color" to the right
of the table. Again, we mimic the columns in the original table: a column for "Sweetone" and one for "No Sweetone." We
now have a full table of joint, marginal, and conditional probabilities.
It is important to double check which event we are conditioning on, and to make sure our calculations are realistic. For
example, the probability that a randomly chosen Indian citizen is the prime minister is nearly zero, but the probability that
the prime minister of India is an Indian citizen is 100%!
106/123
The table of joint probabilities informs our understanding of the likelihood of different properties or events in a crucial
way: as we've seen in the Bennie example, once we have the joint probabilities, we can compute any conditional probability
and any marginal probability.
Thus, when presented with a decision problem in which the outcomes are influenced by multiple uncertain events,
constructing the joint probabilities of these events is almost always a wise first step.
Summary
The conditional probability P(A | B) is the probability of the outcome A of one uncertain event, given that the outcome B
of a second uncertain event has already occurred. The table of joint probabilities provides all the information needed to
compute all conditional probabilities. First calculate the marginal probabilities for each event, then compute the
conditional probabilities as shown below:
Statistical Independence
Let's return once more to our Chariot Ben Hur example. The Bennie comes in two possible colors — Champagne or
Burgundy — and with or without a Sweetone stereo. The probability table below shows the distribution of these properties -
"Color" and "Stereo" - in the population of Bennies that Chariot has sold to date.
Note that the proportion of Sweetone-equipped Bennies in the Burgundy population is the same as the proportion of
Sweetone-equipped Bennies in the overall population: 20%. In the language of conditional probabilities, this is the same as
saying that P(Sweetone | Burgundy) = P(Sweetone).
In other words, if we randomly select a Bennie, then discovering that it is Burgundy gives us no additional information
about whether or not it is equipped with a Sweetone stereo, beyond what we had before we knew its color: we still think
there is a 20% chance it has a Sweetone.
Similarly, the proportion of Burgundy Bennies in the Sweetone-equipped population is the same as the proportion of
Burgundy Bennies in the overall population: P(Burgundy | Sweetone) = P(Burgundy) = 75%.
A Bennie's color tells us nothing about its stereo system. Its stereo system tells us nothing about its color. When this is true,
we say that that the Bennie properties "Color" and "Stereo" are statistically independent, or, more simply, independent.
In general, we can interpret the fact that two uncertain events are independent in the following way: knowing that one
event has occurred gives us no additional information about whether or not the other event has. For example, the results of
two spins of a wheel of fortune are independent. The first result does not reveal anything about the second.
We can confirm the independence of the Bennie's stereo and color by looking at our Venn diagram and noting that
Sweetone-equipped Bennies occupy the same percentage in the population of burgundy Bennies — 20% — as they do in the
entire Bennie population.
Thus, to find the joint probability that a Bennie has a Sweetone stereo and is burgundy, we take 20% of the 75% of Bennies
that are burgundy, giving us a joint probability of 15%. This property is true for any two statistically independent
properties: the joint probabilities are simply the products of the marginal probabilities.
Although it may seem plausible to assume that certain properties are independent, managers who take statistical
independence for granted do so at their peril. We need to verify the assumption that the properties are independent by
looking at and evaluating data or by proving independence on the theoretical level.
Statistical Dependence
When are two outcomes statistically dependent? Let's look at another optional feature of the Bennie: a unique factory-
installed theft discouragement system (TDS).
Using Chariot's data, we can create the following table of the distribution of the properties "Stereo" and "Protection."
Before going forward, practice your skills by calculating the joint, marginal, and conditional probabilities for these
properties. Place them in the usual format in the complete probability table.
The completed table is shown below. Which of the following is true?
From the table, we can see that P(TDS | Sweetone) is not equal to P(TDS). We can infer that the properties "Stereo" and
"Protection" are not statistically independent — once you know that a randomly selected Bennie has a Sweetone, you
know the chances of it having a TDS system are greater than they would be if you didn't have that information.
Once again our Venn diagram provides visual confirmation, in this case that the properties are not independent. In the
overall population, the proportion of Bennies that are TDS-protected is only 35%. However, in the population of
Sweetone-equipped Bennies, the proportion is significantly higher: 75%. Why might that be?
107/123
Bennie buyers who opt for the Sweetone stereo feel that their cars will be especially targeted for theft or vandalism, and
are more likely to choose the TDS option than are buyers who choose the low-end car stereo.
A savvy car thief knows that the next Bennie he passes has a 35% probability of being protected by a theft
discouragement system.
Once he identifies that a particular Bennie has a Sweetone stereo, he gains more information: then he knows that the car
has a 75% chance of being TDS-protected. This may affect his decision about whether or not to break into that car.
Summary
Two uncertain events A and B are said to be statistically independent if knowing that A has occurred does not tell us
anything about the probability of B occurring, and vice versa. Statistical independence of two events can be
demonstrated based on data or proved from theory; it should never be assumed. Events that are not statistically
independent are said to be statistically dependent.
Conditional Probabilities in Decision Analysis

Probability theory is fascinating, sure, but what do conditional probabilities have to do with decision analysis?
To see how we might apply conditional probabilities in a decision analysis, let's revisit the Cloven film production example.
Seth Chaplin of S&C Films is ecstatic. In a sushi lunch meeting with the agent of superstar actor Shawn Connelly, the agent
agreed to try to convince Connelly to be the voice of the main character in Seth's new film, Cloven.
Under Seth's charming yet irresistible pressure, the agent told Seth that — just between the two of them — he thought he had
a 40% chance of persuading Connelly to take the role. Connelly's fame is such that by lending the prestige of his name to
Cloven, the likelihood of Cloven's success increases dramatically.
Seth has booked the services of a crack animation team, so he'll have to start production soon — probably before Connelly
makes a decision. But that shouldn't hinder production because Connelly can add his voiceovers well after the animation has
been created.
Seth has not yet signed either of the two deals he negotiated, and feels he should reconsider his options in light of Connelly's
possible participation. With a decent shot at star power behind Cloven, Seth feels he can close a better deal with Pony.
Connelly's voice would also substantially increase the likelihood of blockbuster success, making the deal with K2 potentially
more lucrative.
Seth returns to Pony and hammers out a second agreement, contigent on Connelly's participation. Pony will pay an
additional $5 million — $15 million in total — if Seth can retain Connelly's voice acting services. Even after accounting for
Connelly's salary, Seth expects to make a total of $2.2 million in profits from the Pony deal if Connelly agrees to take the role.
How should all of this new information affect Seth's decision? Let's take a look at the new decision tree.
If Seth chooses the Pony deal, there are now two possible scenarios: in the first, Connelly agrees to lend his voice to Cloven, in
the second, Connelly declines. These two scenarios are associated with two different profit outcomes: $2.2 million and $1
million, respectively.
Based on Connelly's agent's assessment, there is a 40% chance that Connelly will take the role, and a 60% chance that he
won't.
What about the K2 option? Which of the following decision trees correctly reflects the newly introduced circumstances?
The nodes of a decision tree are arranged from left to right in the order in which the decision maker will eventually know
their results. In this case, Seth will first discover whether or not Connelly takes the role; only then will he see how Cloven
performs at the box office.
What are the probabilities of the different branches? As in the Pony option, the first branching in the K2 option is a split
between the scenarios in which Connelly signs onto the Cloven project and those in which he doesn't. These branches are
associated with probabilities of 40% and 60%, respectively.
What are the probabilities of the six final branches? Let's look closely at the top three branches. Once we know that Connelly
has signed on, we need to know what the respective probabilities are of a "Blockbuster," a "Lackluster" performance, and a
"Flop." In other words, we need the conditional probabilities of the three outcomes, given that Connelly takes part in Cloven.
In any decision tree, to "be at a node" is to assume that all the events on the path leading to that node from the left have
already taken place. Thus, the data we associate with any decision or chance node depend directly on the sequence of events
on the unique path leading up to that node from the left.
Specifically, the probabilities after a chance node must be conditioned on all events on the preceding path, and the outcome
values must incorporate the effects of all of the preceding events.
108/123
Returning to Seth's decision: if Connelly takes part in Cloven, Seth estimates at 50% each the conditional probability that
Cloven will be a "Blockbuster" and the conditional probability that it will have a "Lackluster" theater run. At this stage in
Connelly's career, a "Flop" is virtually impossible. If Connelly doesn't take part, the conditional probabilities are simply the
original probabilities of 30%, 50%, and 20% for the three possible outcomes.
Using these conditional probabilities, we can determine which of the two options Seth should choose by calculating the EMVs
of all the nodes and folding back the decision tree.
Let's begin with the Pony option. What is its EMV? Enter the EMV in $millions as a decimal number with two digits to right
of the decimal point (e.g., enter "$5,500,000" as "5.50"). Round if necessary.
The EMV of the Pony option is $1.48 million.
Now we find the EMV of the K2 option, constructing it step by step. First, we must find the EMV for the event that Connelly
decides to voice the main character in Cloven. That EMV is $3 million.
Next, we find the EMV for the event in which Connelly refuses the part. That EMV is simply the EMV of the original K2
option: $1.4 million.
The total EMV for the K2 option is $2.04 million: the EMV with Connelly's participation weighted by the probability of his
participation, plus the EMV of Cloven filmed without Connelly, weighted by the probability that he refuses the part.
The possibility that Shawn Connelly might take part greatly enhances the attractiveness of the K2 option.
Summary
The eventual outcomes of many decisions involve sequential uncertain events whose outcomes are determined over time.
To conduct a decision analysis in such cases, we need the probabilities of the first uncertain event's possible outcomes and
conditional probabilities of future uncertain events' possible outcomes, conditioned on the previous events' outcomes.
Exercise 1.: Captain Ahab Fisheries

Captain Ahab Fisheries cans herring and sardines in either spicy tomato sauce or vegetable oil. The table to the right
summarizes the distribution of Ahab's product line. What is the marginal probability that a randomly selected can of fish is
canned in tomato sauce?
Enter the probability as a decimal number with three digits to the right of the decimal point (e.g., enter "50%" as "0.500").
Round if necessary.
The marginal probability of tomato sauce is 23%. Similarly, the marginal probability of vegetable oil is 77%.
What is the marginal probability that a randomly selected can of fish contains sardines?
Round if necessary.
The marginal probability of sardines is 40%. Similarly, the marginal probability of herring is 60%.
What is the conditional probability that a randomly selected can of fish is canned in tomato sauce, given that it is a can of
sardines?
Round if necessary.
The conditional probability of tomato sauce given that a can contains sardines is 35%.
What is the conditional probability that a randomly selected can of fish is canned in tomato sauce, given that it is a can of
herring?
Enter the probability as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50").
Round if necessary.
The conditional probability of tomato sauce given that a can contains herring is 15%.
Which is larger, the conditional probability that a randomly selected can of fish contains sardines, given that it is canned in
vegetable oil, or the conditional probability that a randomly selected can of fish is canned in vegetable oil, given that it
contains sardines?
The conditional probability of a can containing sardines given that it contains vegetable oil is 34%. The conditional
probability of the fish being canned in vegetable oil given that the can contains sardines is 65%. Note that P(Sardines | Oil)
is not the same as P(Oil | Sardines).
109/123
Exercise 2: Mutually Exclusive and Collectively Exhaustive

The table to the right shows three possible outcomes of an uncertain event. The fact that the sum of the three probabilities
is less than 100% tells us
We cannot infer anything about whether or not these events are mutually exclusive from the information provided. The
events A, B, and C on the left are mutually exclusive. The events D, E, and F on the right are not mutually exclusive.
is less than 100% tells us
For a set of events to be collectively exhaustive, the probabilities of the events must sum to at least 100%.
is greater than 100% tells us
The probabilities of the events in a mutually exclusive set cannot add up to more than 100%. If the probabilities of a set of
events add up to more than 100%, then at least two of them are not mutually exclusive, i.e., they can occur simultaneously.
is greater than 100% tells us
We cannot infer anything about whether or not these events are collectively exclusive from the information provided.
Below are examples of events with these probabilities of which one set is collectively exhaustive and one isn't.
The Value of Information

Market research can be expensive. If the cost of the research outweighs its value to Leo, then obviously he shouldn't spend
money on the research. "We'll need to assess the value of Leo's market research," Alice tells you, "to determine whether or not
paying for it is a wise choice."
The Expected Value of Perfect Information

The Eris Shoe Company is considering sourcing some of its production from the developing country Arboria. Arboria was a
land of civil war and strife until a controversial UN intervention two years ago reconciled warring factions and helped install a
national unity government. Today, guerrilla warfare persists in the mountains, but the major coastal cities are relatively
secure.
Eris CEO Emily Ville has identified Arboria as a candidate location due to its low labor costs and is considering opening a
small buying office in the capital city. However, she also recognizes the potential for substantial risk if civil war should break
out, including the loss of Eris' investments in the office and the disruption of its supply chain.
Emily estimates the present value of sourcing from Arboria at $1 million in net savings as long as the Arborian production
facility and infrastructure work at a high level of reliability. In her opinion, the probability of such a "Serene" scenario is 40%.
Emily believes there is a 40% chance of a more "Troubled" environment beset with minor supply chain disruptions, which
would reduce Eris' expected cost savings to $0.5 million. Finally, she estimates the likelihood of major political unrest at
20%, and the net present value of the losses associated with such a "chaotic" scenario at $1 million.
Instead of sourcing from Arboria, Eris could continue its current sourcing agreements, which would not result in any cost
reduction.
The EMV of sourcing from Arboria is $0.4 million. Based on the EMV criterion, Emily should choose to source from Arboria.
However, given the stakes involved, Emily would like to have additional information that would increase her understanding
of Arboria's political situation.
Ideally, she'd like to base her decision on perfect information, i.e., she'd like to know now exactly which one of the three
outcomes will materialize. Such perfect information would be extremely valuable: how much should Emily be willing to pay
for it?
Managers often have the opportunity to gather more data or expertise to inform their business decisions. Depending on the
business context, new information can be obtained from statistical data, expert consultants, or scientific tests and
experiments. In many cases, such information can improve the accuracy of the estimates incorporated into a decision
analysis.
However, the cost of such data can drag down the bottom line. Clearly, the cost of additional information must be weighed
against its value. How can managers assess the value of information?
Suppose — for the sake of argument — that Emily knows a fortune-teller, Frieda Featherlight, who has genuine psychic
powers and can predict the future flawlessly, giving Emily the perfect information she craves. How much should Emily pay
110/123
Frieda for her supernatural services?
To answer this question, we calculate the Expected Value of Perfect Information — the EVPI — provided by Frieda. We
can determine this value by framing a decision: should Emily purchase Frieda's information before she makes her decision to
source from Arboria?
Characterized as a decision problem, we can determine the EVPI using familiar tools: a decision tree and the expected
monetary value. Let's construct a tree for Emily's decision, beginning with a decision node that branches into two options:
"Buy the information" or "Don't buy the information."
The lower branch — "Don't buy the information" — is simply the original tree for the decision about whether or not to source
from Arboria. This tree uses only the information Emily already possesses. The EMV of this option is $0.4 million.
If Emily engages Frieda's services and "buys the information," three possibilities emerge based on which scenario Frieda
predicts — "Serene," "Troubled," or "Chaotic." If the probabilities of these scenarios actually occurring are 40%, 40%, and
20%, respectively, what is Emily's best assessment of the likelihood that Frieda will predict each scenario?
Without further information, Emily has no reason to change her initial assessments. Unless she probes Frieda for
information, Emily's best guess that Frieda will predict each outcome — "Serene," "Troubled," and "Chaotic" — is the same
as her best estimates for the probabilities that those outcomes will actually occur: 40%, 40%, and 20%, respectively.
Let's assume for the moment that Frieda won't charge Emily for her services. The beauty of perfect information is that Emily
can delay her decision until after she has heard Frieda's completely accurate prediction of what the Arborian political climate
will be. If Frieda predicts a "Serene" sourcing experience, then Emily will choose to source from Arboria, and cost savings of
about $1 million are certain.
Likewise, if Frieda predicts a "Troubled" sourcing experience, Emily will source from Arboria, and cost savings of about $0.5
million are certain. If Frieda predicts a "Chaotic" sourcing experience, then Emily will choose not to source from Arboria,
thereby avoiding a certain loss in the $1 million range. In this case, Eris will receive no cost savings.
Still working under the assumption that Frieda won't charge for her services, what is the EMV of "buying the information"?
Enter the EMV in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,500,000" as
The EMV of "buying the information" is $600,000.
At $600,000, the EMV of "buying the information" is $200,000 higher than the EMV of "not buying the information." This
difference between the EMVs of the option with perfect information and without perfect information is the expected value of
perfect information (EVPI). The EVPI of $200,000 is the maximum amount Emily should pay Frieda for a séance.
The EVPI establishes an upper bound on what we should pay for perfect information. In some business cases, perfect — or
near perfect — information may be available without supernatural means.
For instance, suppose that aerospace company Airbus wants to know if it could perfect all of the technologies necessary to
create a safe and reliable SpaceCruiser, a space shuttle for tourists that would have all the comforts of a luxury cruise ship.
Airbus could build a small but completely functional prototype and see if it works. Although Airbus might collect near perfect
information by doing so, the cost of developing the SpaceCruiser prototype would likely be higher than the expected value of
that perfect information.
Instead, Airbus might first use computer simulations and limited functionality prototypes to assess the technical viability of
the SpaceCruiser. The information thus gained wouldn't be perfect, but it would be helpful and it would cost far less than a
fully functional prototype.
The EVPI establishes an upper bound on the value of any information, perfect or flawed. Through sampling, educated
expertise, or imperfect testing, we might be able to gain better — though not perfect — estimates of the probabilities and the
outcome values of our decision's possible scenerios.
Shortly, we'll learn how to value imperfect information — information that reduces, but does not eliminate, our uncertainty
about future events. For the time being, we know that we should never spend more for imperfect information than we are
willing to pay for perfect information.
In her quest for an infallible choice in the Arborian sourcing decision, if Emily won't pay more than $200,000 for perfect
information, she certainly shouldn't pay more than $200,000 for imperfect information. If the price of imperfect information
exceeds the EVPI, we should not expend resources on it.
Summary
As managers, we would like to know exactly which outcomes will occur so we can make the best decisions. We can calculate
the expected value of such perfect information — EVPI — to find an upper limit on the amount we would be willing
111/123
to pay for any additional information. To calculate the EVPI, we first frame the decision problem "to buy or not to buy the
perfect information." We subtract the EMV of buying perfect information — assuming it's free — from the EMV of not
buying perfect information to find the EVPI.
The Expected Value of Sample Information

In almost all circumstances, perfect information is either impossible to assemble or prohibitively expensive. In these cases,
we use imperfect information — often called "sample" information — to inform our decision. As with perfect information,
the cost of sample information must be weighed against its value. As managers, how do we assess the value of sample
information?
Let's return the Eris Shoe Company's decision to source production in Arboria. Emily Ville has distinguished three possible
scenarios for the Arborian political climate: "Serene," "Troubled," and "Chaotic," which she believes have probabilities of
40%, 40%, and 20%, respectively. The outcomes she associates with these scenarios — in cost savings for Eris — are $1
million, $0.5 million, and -$1 million, respectively.
Emily would like to improve her estimates of the probabilities. She contacts PoliFor, a consulting company that specializes
in business outlook intelligence. PoliFor produces an analysis of a country's or region's political climate, and then provides
a risk assessment specific to the needs of its client, distinguishing between "High" and "Low" risk situations.
In a world of uncertainty, PoliFor's assessments are not always accurate. Sometimes, regions with "Low" risk assessments
burst into revolutionary flame. Sometimes, regions with "High" risk assessments turn out to be as stable and placid as the
moon's orbit. How high a price should Emily pay for PoliFor's services given that PoliFor's assessments are not perfectly
reliable?
Based on preliminary conversations with PoliFor's lead consultant, Emily assesses the probabilities that PoliFor will report
"High" or "Low" risk for sourcing in Arboria at 30% and 70%, respectively. She then considers how each of these risk level
assessments would affect her estimates of the relative likelihood of her three representative scenarios: "Serene,"
"Troubled," and "Chaotic."
Emily believes that if PoliFor predicts "Low" risk, then the likelihood of a "Chaotic" political climate would be low: 5%, and
the probabilities of "Troubled" or "Serene" scenarios would be 40% and 55%, respectively. If we use "Low" to represent
PoliFor predicting a low-risk political environment, which of the following best expresses the beliefs Emily has summarized
above?
The information upon which Emily is basing her assessments is the PoliFor prediction of a "Low" risk political climate.
Thus, she is estimating conditional probabilities that Eris' experience sourcing from Arboria will be "Serene," "Troubled,"
or "Chaotic," given that the consultants predict a "Low" level of risk.
Similarly, Emily assesses the conditional probabilities that Eris' experience sourcing from Arboria will be "Serene,"
"Troubled," or "Chaotic" given that the consultants predict a "High" level of risk at 5%, 40%, and 55%, respectively. How do
we calculate the expected value of the imperfect information PoliFor offers?
As we did for perfect information, we frame this question as a decision problem: should Emily purchase the imperfect
information from PoliFor or not? First, we'll construct a decision tree. The tree begins with a decision node for the choice
Emily faces: "Buy the information" or "Don't buy the information".
The lower branch for the option "Don't buy the information" is the original tree that Emily constructed using her initial
estimates for the outcome values and probabilities of the three basic scenarios. What should emanate to the right from the
"Buy Information" branch?
If we are at the end of the "Buy Information" branch, we assume that Emily has purchased the information. In that case,
the first thing that she will do is open the report envelope and examine the information to learn what level risk PoliFor has
predicted. Thus, the upper branch for the option "Buy the information" splits the chance node into two branches, one for
each risk level the consulting company might predict.
What should emanate from each of the Low and High branches?
The information that Emily has purchased is valuable only if she uses it to support her decision-making. The value of the
information resides in Emily's ability to make her sourcing decision after learning the level of risk PoliFor predicts. Thus,
after each prediction, we place a decision node: should she source from Arboria or continue to source from her current
suppliers?
To complete the decision tree, we must incorporate the possible scenarios that can occur if Emily chooses to source from
Arboria: "Serene," Troubled," or "Chaotic." What probabilities should we assign to the three branches emanating from the
chance node highlighted to the right?
At the highlighted node, we assume that every event on the unique path leading from the node back to the beginning of the
tree has transpired: Emily bought the information; PoliFor reported "High" risk; and Emily chose to source from Arboria.
Thus, the probabilities assigned to the branches must be conditioned on those events. For example, we assign P(Serene |
High) to the "Serene" scenario on the "High" branch.
112/123
Now we complete the tree, substituting the values for the conditional probabilities and placing the appropriate monetary
values at the endpoints. As usual, we'll assume for the moment that the information is free, and start to fold back the tree.
What is the EMV of the "High Risk" branch's decision node?
If PoliFor predicts a "High" risk level, Emily should not choose to source from Arboria since the EMV of the "Source" from
Arboria node, -$0.3 million, is less than the EMV of the "Don't source" node, $0 million. Instead, she should continue her
current supply arrangements, in which case she will realize $0 in cost savings.
Folding back the decision tree, we find an EMV of $490,000 for the option "Buy the information."
What is the most Emily should pay for the imperfect information?
The difference between the EMVs of the "Buy" and "Don't buy the information" options is $90,000, the expected value of
imperfect (or sample) information (EVSI) provided by PoliFor. The EVSI is the maximum amount that Emily should be
willing to pay for PoliFor's report.
As we would anticipate, the expected value of this imperfect information is quite a bit less than $200,000, the expected
value of perfect information we computed earlier.
PoliFor's information, although imperfect, has value because it allows Emily to update her original probability estimates
based on the risk level PoliFor predicts. She can make better decisions based on these more accurate updated probability
assessments.
The probability estimates we started with — Emily's estimates for P(Chaotic), etc. — are called prior probabilities. The
updated conditional probabilities — P(Chaotic | High Risk), P(Troubled | High Risk), etc. — are called posterior
probabilities.
Summary
By investing in professional expertise, statistical studies or tests, managers can often improve their understanding of
uncertain events. Such imperfect (or "sample") information is itself a source of uncertainty because it isn't a perfect
predictor of the outcomes of interest. Although imperfect information can improve our predictions, such information
comes at a price. Responsible managers calculate the expected value of sample information (EVSI), and purchase
information only if its expected value exceeds its cost. To calculate the EVSI, we pose the decision problem "to buy or not
to buy the sample information," and subtract the EMV of buying the information — assuming that it's free — from the
EMV of not buying it.
Updating Prior Probabilites

Zeke "Claw" Crankshaw owns and runs Oligarchol, one of the largest independent oil producers in the Western
Hemisphere.
Recently, Zeke acquired the mineral rights to a piece of land near Chanchito Gordo Canyon. Based on his own professional
assessment, Zeke believes that he has a 50% chance of finding an oil field if he drills in Chanchito Gordo, and a 50% chance
of drilling nothing but "Dry" holes.
Like any subjective probabilities, these estimates are the product of Zeke's educated guesswork. Clearly, the oil is either
beneath the surface at Chanchito or it isn't. However, Zeke's experience tells him, for example, that in about half of all
similar situations — similar terrain, similarly exposed rock strata, etc. — oil prospectors have struck oil.
He estimates the value of striking oil — in present value of future profits — to be $8 million after netting out drilling costs.
If all of his boreholes come up "Dry," Zeke will have lost the $2 million cost of drilling. With these assessments in mind,
Zeke must decide if he should drill on the property.
If Zeke decides not to drill, his investment in the mineral rights will not pay off in the way he had hoped. However, since
the purchase price of the mineral rights has already been incurred, it is a sunk cost that shouldn't bear on the decision.
What is the expected monetary value of drilling in Chanchito Gordo?
Enter the EMV in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,500,000"
as "5.50"). Round if necessary.
The EMV of drilling in Chanchito Gordo is $3 million.
Without further information, Zeke should start shipping drill bits and oil rigs over the dusty back-country roads to
Chanchito. But old Zeke is not the impetuous young whippersnapper of the early days of Oligarchol, back when the
company was derisively referred to in industry circles as "Pipe Dreams, Incorporated."
Now, Zeke is happy to use the latest in seismic testing technology to gain better information and make more informed
decisions. Zeke can order a $100,000 seismic test of the Chanchito Gordo property. Although the seismic test will improve
his estimate of the likelihood of finding oil, it won't provide perfect information.
113/123
The seismic test will report one of two possible outcomes: a "Positive" or a "Negative" result, depending on whether or not
the test indicates the presence of oil. Because the seismic test's accuracy is a key determinant of its value, Zeke has gathered
some historical data on the test's performance.
Zeke believes that if there is an oil field on his property, then the probability that the test will return a "Positive" result is
90%. This is the conditional probability P(Positive | Oil). Clearly, the test isn't perfect — even if there is a reserve on his
property, there is a 1 in 10 chance the test will fail to detect it. This is the conditional probability of a "False Negative"
result: P(Negative | Oil) = 10%.
And, unfortunately for Zeke, even when there isn't a drop of oil to be found, the test will still report a positive result — a
false "Positive" — with probability 30%. The probability that the seismic test will indicate the absence of oil when there is,
in fact, no oil — P(Negative|Dry) — is 70%. How can Zeke use these data to calculate the EVSI — the expected value of the
test information?
As usual, we start with a decision node that branches into two options: "Test" or "Don't Test." The lower branch, "Don't
Test," is the same as Zeke's original tree for whether to drill or not. The EMV for the "Don't Test" branch is $3 million; it is
based only on the information Zeke previously had available to him.
If Zeke buys the test, he will learn the test's result — positive or negative — before he has to decide whether to drill or not.
What probability should Zeke assign to the "Test Positive" branch?
The probability that Zeke should assign to the "Test Positive" branch is P(Positive). However, this probability is not in the
information Zeke originally had available, which is summarized in the table to the right.
To find the P(Positive), we must first find the joint probabilities and enter them into the table. What is P(Oil | Positive)?
Round if necessary.
To find P(Oil | Positive), we use the familiar relationship for conditional probabilities. If you wish to practice, calculate the
other joint probabilities and P(Positive) and P(Negative) before advancing to the next screen.
Adding the joint probabilities, we find that P(Positive) = 60% and P(Negative) = 40%.
We are now ready to complete the tree. After Zeke learns the test result, he makes a decision about whether or not to drill.
If he drills, he will either strike oil or not. What probability should we associate with the branch highlighted to the right?
If Zeke is at the highlighted node, then all the events on the branch leading up to the node must have occurred: he decided
to test, the test was positive, and he decided to drill. Thus, Zeke must condition the probability of striking oil on past
events: specifically, on the fact that the test report was "Positive." Thus, he should associate the conditional probability
P(Oil | Positive) with the highlighted "Oil" branch.
We find the conditional probabilities in the usual way. The full table of joint, marginal, and conditional probabilites is
shown to the right.
Our tree now has all the necessary data. We fold it back and find that the EMV for conducting a seismic test — assuminig
the test is free — is $3.3 million.
What is the EVSI of the seismic test?
Enter the EVSI in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,500,000"
as "5.50"). Round if necessary.
The EVSI is $300,000 — the difference between the EMV of conducting the test, $3.3 million, and the EMV of not
conducting the test, $3 million. This is the maximum amount Zeke should be willing to spend on the seismic test. Since the
expected value of the test exceeds its $100,000 cost, Zeke should order the test.
Summary
Often, information is not available in the form we need it in to inform our decision process and the calculation of the
EVSI. Frequently, for example, we have access to data about the reliability of a test. The reliability of a test is given as the
set of conditional probabilities P(TRi | Si), where TRi is a test result and Si is one of the scenarios the test is intended to
help predict. Collectively, these conditional probabilities reveal the test's ability to predict the outcome of the uncertain
event in question. Using this information, we update our initial estimates for the probabilities of each scenario — the
prior probabilities P(Si) — to achieve improved, conditional probabilities of each scenario, given the result of the test —
posterior probabilities P(Si | TRi). We use the posterior probabilities to calculate the EVSI and aid our decision process.
Solving the Market Research Problem

You know how to calculate the expected value of information and set an upper limit on the amount Leo should be willing to
pay for market research. But what kind of market research could Leo have in mind? And would it be worth its cost?
114/123
Leo wants to know how much money he should be willing to spend on additional information about consumers' likely
reaction to a floating restaurant. First, you calculate the expected value of perfect information to give Leo on absolute upper
bound on the amount he should pay for any kind of information.
You begin by setting up a decision tree to help inform the decision of whether or not to buy perfect information. Which of the
following best represents the "Buy Information" branch of the perfect information tree?
Whatever the source of perfect information, it will predict the occurrence of either a "Phenomenon" or a "Fad." Which one of
these scenarios it will predict is uncertain; the respective probabilities are equal to the probabilities of the two scenarios: 35%
and 65%.
Once Leo has perfect information, he can choose to either launch the Tethys venture or not.
What is the EMV of acquiring the perfect information, assuming that it is free?
If Leo is certain that the Chez Tethys will be a "Phenomenon," he will launch her and realize a $2 million profit. If he is
certain that the Tethys would be a mere "Fad," he'll avoid the $800,000 loss by not launching, thereby realizing no profits or
losses in the floating restaurant industry. The EMV of acquiring perfect information is $700,000.
Recall that the EMV of launching the Tethys — based on Leo's original estimates — was $180,000. What is the expected value
of the perfect information?
The expected value of perfect information is the difference between $700,000 — the EMV of acquiring the perfect
information — and the EMV of not acquiring the perfect information. In this case, the EMV of not acquiring the perfect
information is the original EMV of launching the Chez Tethys, $180,000. The difference is $520,000.
That's a lot of money! I have an idea for a market research event that is expensive, but it's much cheaper than that. Let's get to
it!
Not so fast, Leo. Up to $520,000 is what you should be willing to pay for perfect information. But real world information is
usually much less than perfect. What exactly do you have in mind?
You two are going to love this. I'll rent a cruise ship and do a week-long test cruise around the islands, inviting guests at the
finest hotels on board for free. At dinner, I'll run an advertisement for the Tethys, then survey all the guests and ask them if
they'd be interested in dining there!
That's an audacious market research plan for an audacious business venture. I like the idea, though, and I think you'll get
much better information than if you hire a firm to do conventional market research. Let's try to estimate the value of this
information.
The three of you debate the merits of Leo's marketing event. You split the results of Leo's event into two response scenarios
based on how Leo's guests/subjects receive his demonstration: "Enthusiastic" and "Tepid," estimating the probabilities of
these responses at 40% and 60%, respectively.
If the subjects respond to Leo's demonstration "Enthusiastically," you believe that the Tethys becoming a "Phenomenon" is
much more likely than your earlier estimate: 65%. On the other hand, if the reception is "Tepid," you estimate that the
probability of a "Phenomenon" is a mere 15%.
Given these data and the tree to the right, what is the expected value of Leo's sample information?
Enter the EVSI in $millions as a decimal number with four digits to the right of the decimal point (e.g., enter "$5,555,000" as
To calculate the EVSI, calculate the EMV of Leo's running his planned event. First, calculate the EMV of launching the Tethys
given that the event participants' response is "Enthusiastic." The EMV is the sum of the outcomes of the "Phenomenon" and
"Fad" scenarios, after they have been weighted by the conditional probabilities that these scenarios will occur given an
"Enthusiastic" reception: $1.02 million.
Since the EMV for launching the Tethys is higher than the EMV for not launching her, Leo should embark on the Tethys
venture if his guest/subjects respond "Enthusiastically" to his event. Thus, the EMV if guests give an "Enthusiastic" response
to Leo's event is $1.02 million.
Next, calculate the EMV of launching the Tethys given that the event participants' response is "Tepid." The EMV is the sum of
the outcomes of the "Phenomenon" and "Fad" scenarios, after they have been weighted by the conditional probabilities that
these scenarios will occur given a "Tepid" reception: -$380,000.
115/123
Since the EMV for launching the Tethys is lower than the EMV for maintaining the status quo at the Kahana, Leo should not
embark on the Tethys venture if the participants' response is "Tepid". Thus, the EMV of a "Tepid" response to Leo's event is
$0.
The EMV of Leo's event is $408,000: the sum of the EMVs of the "Enthusiastic" and "Tepid" responses, after they have been
weighted by the probabilities that these responses will occur.
The expected value of Leo's sample information is the difference between the EMV of his event and $180,000 — the EMV of
launching the Tethys without further information. The difference is $228,000.
I talked to some friends last night, and the total cost of renting and running a cruise ship for a week, and shooting an
advertising short comes to at least $240,000.
That's higher than the expected value of this information we calculated: $228,000.
Are you telling me that my event would not be worth its cost?
I'm afraid so, Leo. But that doesn't necessarily mean that you shouldn't do it. You might choose to have your event even if its
expected monetary value is low. It's late. Let's break for tonight and meet over breakfast tomorrow.
Exercise 1: The FUD Grocery Chain

The grocery chain FUD offers two lines of private label products: "FUD Brand," and "Pleasant Valley Fare." The two brands
each are used to label a set of grocery items with very little overlap. For example, "FUD Brand" is used for meat and cheese
products; "Pleasant Valley Fare" for canned goods.
Esther Smith, CEO of FUD, anticipates that eliminating the brand with weaker consumer loyalty and repackaging its
product line under the label of the stronger brand would result in an additional $200,000 in profits from increased sales
and from cost savings on packaging and advertising. If Esther eliminates the stronger brand, she anticipates a net bottom
line improvement of $0 — reductions in sales volume or price discounts would roughly offset cost savings.
Esther doesn't know which of the brands enjoys greater customer loyalty — since the product lines don't overlap, she can't
directly compare the two brands' performances. Initially, she believes that there is a 50% chance that customers are more
loyal to FUD Brand, and a 50% chance that they are more loyal to Pleasant Valley Fare (PVF).
Based on this information, which of the two options - eliminating FUD Brand or Pleasant Valley Fare - has the higher
EMV?
Each option has an EMV of $100,000. Based on Esther's current information, she has no basis for preferring one brand
over the other.
Suppose Esther had access to perfect information, i.e., she could discover with certainty which brand her customers prefer,
so she could eliminate the weaker brand and increase FUD's profits by $200,000. How much should she be willing to pay
for such perfect information?
Enter the expected value of perfect information in $millions as a decimal number with two digits to the right of the decimal
point (e.g., enter "$5,500,000" as "5.50"). Round if necessary.
First, construct the tree for the decision to buy or not to buy the information. The EMV for not buying the information is
$100,000 no matter which brand Esther chooses to eliminate.
There is a 50% chance that the perfect information Esther acquires will reveal that customers prefer the FUD Brand and a
50% chance it will reveal that customers prefer Pleasant Valley Fare.
Once Esther has the perfect information, she will choose to eliminate the brand that customers are less loyal to, adding
$200,000 to FUD's bottom line.
The EMV of buying perfect information is $200,000. The expected value of perfect information, EVPI, is the difference
between the EMVs of buying and not buying perfect information: $100,000.
Since Esther cannot get perfect information, she's willing to settle for information that is less than perfect and decides to
hire a market research firm to determine which of FUD's two brands customers prefer.
Esther believes that the probability that the market research firm will find that FUD Brand has higher customer loyalty is
50%. Likewise, the probability that the market research firm will predict that Pleasant Valley Fare has higher customer
loyalty is 50%.
Whichever brand the market research firm indicates is stronger, Esther believes that the finding will be correct with
probability 85%. Thus, there is a 15% chance that the finding the firm provides will be wrong.
What is the expected value of this sample information?
116/123
Enter the expected value of sample information in $millions as a decimal number with three digits to the right of the
decimal point (e.g., enter "$5,500,000" as "5.500"). Round if necessary.
First, construct the tree for the decision to buy or not to buy the sample information using the probabilities and outcomes
cited above. The EMV for not buying the information is $100,000 no matter which brand Esther chooses.
Once Esther has the sample information, she will choose to eliminate the brand that the market research indicates is
weaker. The EMV of eliminating the brand that market research predicts enjoys less customer loyalty is $170,000.
The EMV of buying this imperfect information is $170,000. Thus, the expected value of sample information, EVSI, is the
difference between the EMVs of buying and not buying perfect information: $70,000.
Exercise 2: PPD Immunity

After a public relations debacle over a possible outbreak of pulluscular pig disorder (PPD) at one of Bowman-Lyons-
Centerville's hog farms, Paul Segal, head of the hog farming division, is considering vaccinating another herd at a hog farm
located near the original outbreak.
Immunity to PPD is fairly common among pigs; the probability that the entire herd in question is immune is fairly high:
60%.
However, if any pigs in the herd are not immune, the expected cost to Bowman-Lyons-Centerville is $150,000. This cost
has been calculated based on the probability of a PPD outbreak and the ensuing costs to contain the disease and manage
the expected PR fallout.
The cost of vaccinating the herd is $40,000. What is the EMV of the cost of not vaccinating the herd?
Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter "$5,500" as
The EMV of the cost of not vaccinating the herd is $60,000. Since the cost of vaccinating the herd — $40,000 — is lower
than the expected cost of not vaccinating the herd, Paul should have the herd vaccinated, barring the emergence of any
further information.
There is a test that Paul could perform on the herd to determine if the herd is immune to PPD. This test delivers two
results: "Positive" for immunity, or "Negative" for immunity.
The test is not perfectly accurate: if the entire herd is immune, the test will report "Positive" with probability 85% and
"Negative" with probability 15%. Similarly, if any of the pigs in the herd are not immune, the test will report "Positive" with
probability 30% and "Negative" with probability 70%.
Using Paul's prior probability for the immunity of the herd, calculate P(Positive), the marginal probability that the test will
report a "Positive" result.
Round if necessary.
First, calculate the joint probabilities P(Positive & Immune) and P(Positive and Not Immune), using the marginal
probabilities for immunity and non-immunity — 60% and 40%, respectively — and the conditional probabilities that
quantify the reliability of the test - P(Positive | Immune) and P(Positive | Not Immune).
Then, sum the joint probabilities to find the marginal probability of a "Positive" test result: 63%. Since the test results
"Positive" and "Negative" are mutually exclusive and collectively exhaustive events, the probability of a "Negative" is simply
100% minus the probability of a "Positive," i.e., P(Negative) is 37%.
The tree below will help Paul determine whether or not ordering the immunity test will be worthwhile. The calculated
probabilities of the test results are entered in the appropriate branches. However, to reach a decision, Paul needs the
conditional probabilities of immunity and non-immunity conditioned on a "Positive" test result. What is P(Immune |
Positive)?
Round if necessary.
To calculate P(Immune | Positive), use the definition of conditional probability and divide the joint probability P(Positive &
Immune) by the marginal probability P(Positive). P(Immune | Positive) is 81%. Since immunity and non-immunity are
mutually exclusive and collectively exhaustive events, P(Not Immune | Positive) is simply 100% minus P(Immune |
Positive) i.e., 19%.
Calculate the remaining probabilities on the decision tree. What is the highest amount Paul should be willing to spend on
the immunity test?
117/123
Enter the expected value of sample information in $thousands as a decimal number with one digit to the right of the
decimal point (e.g., enter "$5,500" as "5.5"). Round if necessary.
The expected value of not vaccinating given a "Positive" test result is $28,500. Since this is lower than the cost of
vaccination, Paul should choose not to vaccinate on EMV grounds.
The probability that the herd is immune given that the test reports "Negative" is 24.3%. Since the EMV of the cost of
vaccinating the herd ($40,000) is lower than the EMV of the cost of not vaccinating ($113,550) Paul should choose to
vaccinate if the test result is "Negative".
The EMV of conducting the immunity test is $32,755, i.e., the sum of the EMVs of "Negative" and "Positive" test results,
weighted by their respective probabilities.
The expected value of sample information, EVSI, is the difference between the EMVs of testing and not testing for
immunity, i.e., $7,245. The most Paul should be willing to pay for the immunity test is $7,245.
Risk Analysis
Just as an e-learning course must have a bittersweet last section, so too must your internship on Hawaii have a final day. A
parting Pacific frolic is refreshing preparation for your meeting with Leo.
Introducing Risk
As you dry in the morning sun, Alice invites you to consider Leo's position should he wager the Kahana on the Tethys'
success. "Even if the potential profits are substantial, losing the Kahana if the Tethys tanks would be a severe blow. I wonder
how seriously he's considered the potential downside."
Imagine: upon your arrival at business school you buy yourself a nice new car. It's a sleek, powerful, Burgundy Ben Hur by
Chariot — with both a Sweetone stereo and a Theft Discouragement System — worth over $25,000.
City driving can be rough: the probability that you'll be involved in an accident that "totals" your car — reduces its value from
$25,000 to zero — in your first year is 0.01%. Collision insurance to cover such a loss in your new city is optional: it will cost
you $1,600 per year above and beyond the premium for required liability insurance. Will you buy the collision insurance?
Most people would pay the premium to mitigate the risks associated with a car accident leading to such large losses. But let's
look at the decision tree for buying collision insurance. For simplicity, we'll ignore accidents that do not total your car.
After a year in school, your Bennie's market value will have a present value of $25,000. If you don't buy collision insurance
and the car is totaled, you'll lose the full value of the car — $25,000. If — as we all hope — you and your car survive the city's
roads unscathed, you won't lose anything.
Opting for collision insurance entails a $1,600 insurance premium. If you are spared the calamity of a wreck, you'll have
incurred just the premium cost at the end of the year. If traffic tragedy befalls you, your insurance company will pay out the
value of the car minus a $1,000 deductible: so at the end of the year you will be out $2,600: the premium plus the $1,000
deductible.
Which option is better in terms of the EMV?
At $2.50, the EMV of not buying collision insurance is much better than the EMV of buying the insurance. Essentially, if you
buy the collision insurance, you are paying $1,600 to avoid an expected loss of $2.50! We shouldn't be surprised — insurance
companies are not charities, and from their perspective, selling you insurance has a very favorable EMV.
Still, a majority of drivers choose collision insurance, even though it has a significantly lower EMV. Why might this be the
case?
Let's consider a different situation. Suppose you owned a bright and shiny miniature windup toy Bennie for $25. Would you
take out a $1.60 insurance policy on it, even if the probability were upwards of 50% that it would be stepped on or lost in the
coming year?
Probably not. A $25 loss is not even a minor catastrophe. If you lost or accidentally destroyed your miniature Bennie, you'd
either replace it quickly, or accept its departure without fuss.
If instead of a nice, new, $25,000 Bennie you owned a 20-year old beat-up and rusty Oldsmobile Delta-88, worth about
$2,500, you'd also think twice about paying $160 to insure it against collision. A loss of $2,500 — though stinging — is
unlikely to be a major setback, especially when weighed against a $160 premium.
But, imagine that you — assisted by your MBA degree — amass a multi-million dollar fortune over the next decades. Would
you still insure a $25,000 car against collision? Perhaps not.
Clearly, the value of the potential loss relative to your net worth has an important influence on your willingness to take on the
risk associated with uncertainty. Suppose you are invited to play a gambling game in which you could win $10 million with
118/123
50% probability or lose $2 million with 50% probability. Even though the EMV of playing this game is $4 million, you'd
probably decline participation unless you can afford a $2 million loss.
For most of us, the pain of the $2 million loss would weigh more heavily than the joy of winning $10 million, so we would not
accept this gamble. For someone who can afford a $2 million loss, this game might be very profitable, especially if played
multiple times.
There are formal ways to measure such assessments by assigning a personal utility value to each outcome. Then we can show
that maintaining the status quo — our current assets — provides greater utility than playing the game does. Rigorous
methods of quantifying utility are beyond this course's scope. For now, let's look at how we might gain insight into the
personal utility we associate with different monetary outcomes.
Returning to collision insurance, let's summarize the scenarios and their probabilities and outcome values. You have two
options: "Buy collision insurance" and "Don't buy collision insurance.
If you insure against collision, there are two possible outcomes: either you avoid a wreck for one year and lose the insurance
premium of $1,600, or misfortune strikes and your Bennie is totaled. In the latter case, you lose the premium and the
deductible. The probabilities associated with these two outcomes are 99.99% and 0.01%, respectively. The possible outcomes
and their respective probabilities are listed in the table below.
If you don't insure against collision and park your car safely at year's end, you won't sustain any loss at all. If the unfortunate
alternative occurs and your Bennie is totaled, you'll have lost its value of $25,000. Again, the probabilities of these outcomes
are 99.99% and 0.01%, respectively.
We now have two tables that completely summarize the possible scenarios of our decision. The monetary values of these
scenarios are arranged from top to bottom, from the best outcomes to the worst. These tables — one for each option — are
together referred to as the risk profiles for the decision.
From the risk profiles, it's easy to recognize that the option of not buying insurance will deliver the most preferred outcome if
you don't have an accident but will deliver the least desirable outcome if you do. This insight provides the basis for you to
prefer to buy the collision insurance: it exposes you to lower risk, even though it has a lower EMV.
Summary
The EMV criterion is not the only criterion that informs decisions. Besides wanting to maximize our average outcomes in
the long run, we want to minimize our exposure to risk, especially when potential losses are high relative to our own net
worth.
Risk Attitudes
Recall Seth Chaplin and S&C Films' production of the movie Cloven. Seth has a choice between two options. In a deal with
Pony Pictures, S&C produces the film, and Pony acquires the complete rights to Cloven for a flat production fee. In a
second deal with K2 Classics, S&C retains part ownership, and the financial outcomes for S&C depend on Cloven's
performance at the box office.
Seth knows from previous analysis that the K2 deal has a higher EMV: $2.04 million vs. $1.48 million for the Pony deal.
However, he also knows the K2 deal is riskier. Before making a final decision, Seth wants to understand the full
implications of his two options. Let's build risk profiles for each deal.
There are seven possible scenarios that can occur. The outcome values of these scenarios depend on Seth's choice of a
production partner, on whether or not superstar actor Shawn Connelly lends his gravelly baritone to the film's lead
character, and on the audience's reception of the film.
Let's summarize the possible scenarios. If Seth chooses the Pony option, two scenarios could occur. In one, Connelly
participates in the movie and Seth's profits are $2.2 million. In the other, Connelly does not participate in the movie and
Seth's profits are only $1 million. If Seth opts for the Pony deal, the probabilities of these two scenarios occurring are 40%
and 60%, respectively.
If Seth takes the K2 option, there are five possible scenarios, but only three possible outcomes: "Blockbuster" success, a
"Lackluster" performance, and a "Flop."
A "Blockbuster" success can occur whether or not Connelly participates in the production. The probability of a
"Blockbuster" is 38%: this probability is calculated by adding the joint probability that Cloven stars Connelly and is a
"Blockbuster" to the joint probability that Cloven is a "Blockbuster" without Connelly's participation.
A "Lackluster" performance can also occur whether or not Connelly participates in the production. The probability of a
"Lackluster" performance is 50%: this probability is calculated by adding the joint probability that Cloven stars Connelly
and has a "Lackluster" performance to the joint probability that Cloven has a "Lackluster" performance without Connelly's
participation.
119/123
A "Flop" occurs only when Connelly doesn't take part in the production of Cloven. The total probability of a "Flop" is the
joint probability of Cloven being a "Flop" and Connelly not taking part: 12%.
Seth now has complete risk profiles for the two options. The risk profiles give Seth a quick overview of the possible
outcomes — and their respective probabilities — for each choice. Note that we've combined scenarios so that we now have
probabilities for each distinct outcome value. How does a manager like Seth use risk profiles to inform a decision?
Risk profiles allow Seth to compare options not just in terms of their EMVs but in terms of the risk they expose him to.
From Seth's risk profile, we can see that, although the K2 option has the higher EMV, it is much riskier, a $2 million loss is
possible, whereas the Pony option doesn't include any possible loss.
Which option Seth should choose ultimately depends on how comfortable he is with the risk associated with the K2 option.
The Pony option is certain to deliver a profit. The K2 option might generate a substantial loss.
If a $2 million loss is more than Seth believes his company can — or should — sustain, he may want to choose the Pony
option despite its lower EMV. If Seth chooses the Pony option, he will be demonstrating that he is risk averse: he is
choosing an option with a lower EMV in return for lower exposure to risk.
Most people are at least slightly risk averse, as shown by their inclinations to buy insurance. Some people tend to be risk
seeking, that is, they are willing to forgo options with higher EMVs and accept higher levels of risk in return for the
possibility of extremely high returns.
Although risk-seeking behavior is generally rare in the population when significant down-side risk is involved, many people
tend to be somewhat risk seeking when the value of the loss risked is low. For example, the EMV for playing the lottery is
lower than the EMV for not playing the lottery. However, since the amount risked by playing the lottery — the price of a
lottery ticket — is so low many people choose to play it in return for the unlikely outcome that they will win a multi-million
dollar prize.
When we use the EMV as the basis for decision making, we are acting in a risk neutral manner. Most people are risk
neutral when they make routine "day in, day out" types of decisions — those for which potential losses are not terribly
harmful to the decision-maker or her organization.
People feel comfortable using the EMV for routine decisions because they feel comfortable "playing the averages": some
decisions will result in losses and some in gains, but over the long run, the outcomes will average out. We expect that, over
time, choosing options with the highest EMVs will lead to highest total value.
Most people feel comfortable using EMV as the basis for decisions that are not made repeatedly — provided the decisions
have outcome values similar to those of other, more routine decisions that they make on a regular basis.
Why would Pony Pictures agree to purchase the rights to Cloven and take on all the risk associated with marketing and
distributing the film? For Pony, the deal with Seth is a routine decision. Pony is a large enough company to sustain a
potential multi-million dollar loss if Cloven flops. Pony also distributes 10 to 12 movies a year, so overall, the positive EMV
of a film release promises that in the long run, Pony's operations will be profitable.
Seth knows that the K2 deal is very risky: if he chooses the K2 deal, he will tie his company's financial success to box office
performance and to Shawn Connelly's whims. If Cloven "Flops," S&C Films might well go bankrupt.
But Seth likes a gamble...
Summary
Risk profiles allow us to assess the utility different outcomes bring us, as opposed to their monetary value. The
concise summary risk profiles provided helps us compare and contrast our different decision options, allowing us to
choose the option we prefer based on our attitude to risk: risk averse, risk seeking, or risk neutral.
Solving the Market Research Problem (II)

"Although Leo's market research event doesn't pay off in terms of its expected value," Alice explains, "it could help him
manage his risk."
Let's assemble the risk profile for Leo's decision, including the decision about whether or not to run his market research
event. We have to keep in mind that the cost of running his event is $240,000.
Which of the following tables correctly summarizes the outcomes and probabilities if Leo chooses to run his market research
event?
If Leo chooses to run his market research event, then the participants' response can be either "Enthusiastic" or "Tepid." If it's
"Tepid," Leo will choose to stay out of the floating restaurant business, so there is a 60% chance that Leo will incur only the
cost of the event, $240,000.
120/123
If the response to the event is "Enthusiastic," then there is a 65% chance that Leo will make a profit of $1.76 million — the
original profit estimate of $2 million, minus the event's cost of $240,000. The likelihood of a $1.76 million profit is (40%)
(65%), or 26%.
If the response to the event is "Enthusiastic," then there is an 35% chance that Leo will suffer a loss of $1.04 million — the
original loss estimate of $800,000, minus the event's cost of $240,000. The likelihood of a $1.04 million loss is (40%)(35%),
or 14%.
You next complete the risk profile for Leo's profits if he chooses not to run the event. You present your risk analysis to Leo to
show him why he might want to run his event even if its expected monetary value is not optimal.
I see. So if I run my event, I'm most likely to incur the cost of the event, and nothing else. But, if the response is
"Enthusiastic," I should go ahead with my plan. Then, there is a small chance that I will incur a large loss if the restaurant
becomes a mere "Fad," and a good chance that the Tethys will become a "Phenomenon."
On the other hand, if don't run my event, although my potential profits are slightly higher, the risk is too great. I just have too
much to lose at this point.
At the same time, I think I can afford to pay for a market research event now, without going to the bank. Then if my guests
react favorably, I might consider diving into the Tethys adventure.
I want you two to enjoy your last day here and have dinner with me tonight. Anywhere special you'd like to go?
Hmmm. I know this wonderful place that serves the most delightful mix of French and Hawaiian cuisines...
Aw, shucks.
Exercise 1: The Frivolous Lawsuit of D. Pitt

Danforth Pitt was injured in a bisque-spilling incident in a Hawaiian luxury resort restaurant. Danforth, who insists he
hasn't been able to remove the scent of crab from his knees, is pursuing his legal options.
After initial investigation and consultation, Pitt's lawyers explain to Danforth that he can expect to win $250,000 in
damages with 5% probability or $100,000 with 25% probability, before subtracting the attorney's fees. The fees associated
with pursuing legal action are estimated at $50,000, which Pitt must pay whether he wins or loses the suit.
What is the EMV of pursuing legal action against the resort hotel?
Enter the EMV in $thousands as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,000" as
The EMV of pursuing legal action is -$12,500.
To avoid a legal battle, the hotel's owner agrees to settle out of court, offering Danforth two full weeks in the hotel's
penthouse suite free of charge. Including amenities, the value of this offer is $12,000.
If Danforth chooses to reject the offer to settle out of court, and to sue the hotel instead, he should be characterized as
which of the following?
Since the EMV of pursuing legal action is lower than the EMV of settling out of court, but the possible gains of legal action,
though relatively unlikely, are much higher than the EMV of settling, Danforth would have to be characterized as risk
seeking.
Final Assessment Test I Introduction

Welcome to the post-assessment test for the HBS Quantitative Methods Tutorial.
Navigation:
to be scored.
In the briefcase, links to Excel spreadsheets containing z-value and t-value tables as well as utilities for finding confidence
intervals and conducting hypothesis tests are provided for your convenience. For some questions, additional links to Excel
spreadsheets containing relevant data will appear immediately below the question text.
121/123
Good luck!

course.
examination.
the exam site.
Final Assessment Test II Introduction

Welcome to the second post-assessment test for the HBS Quantitative Methods Tutorial.
Navigation:
to be scored.
In the briefcase, links to Excel spreadsheets containing z-value and t-value tables are provided for your convenience. For some
questions, additional links to Excel spreadsheets containing relevant data will appear immediately below the question text.
IMPORTANT: Take this exam only after taking the previous final assessment exam. There are only two final
assessment exams. Thus, if you did not receive a passing score on the first, it is critical that you review the
course material carefully before taking this exam.
Good luck!

course.
examination.
122/123
the exam site.
Copyright Harvard Business School Publishing. Copying or posting is an infringement of copyright. Permissions@hbsp.harvard.edu or
617-783-7860.
123/123

Quantitative Methods Online Course PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Quantitative Methods Online Course PDF

Hochgeladen von

Copyright:

Verfügbare Formate

11/25/2019 Quantitative Methods Online Course

Quantitative Methods Online Course

All questions must be answered for your exam to be scored.

Your results will be displayed immediately upon completion of the exam.

Frequently Asked Questions

Overview & Introduction

The Tutorial and its Method

The Story and its Characters

Using the Tutorial: A Guide to Tutorial Resources

To access additional Help features, click on the Help icon.

... and Welcome to Hawaii!

Leo and the Hotel Kahana

Basics: Data Description

Leo's Data Mine

Thanks! We'll certainly take advantage of that.

Describing and Summarizing Data

Working with Data

... until we learn that the building is an elementary school.

Central Values for Data

Finding The Mean In Excel

Finding The Median In Excel

Finding The Mode In Excel

The Standard Deviation

The Coefficient of Variation

Applying Data Analysis

Pricing the Scuba Schools

Exercise 1: VA Linux Stock Bonanza

Exercise 2: Employee Turnover

Which summary statistic better describes these data?

Exercise 3: Honidew Internship

What is the mean GPA of the former interns?

Exercise 4: Scuba Regulations

Exercise 5: Fluctuations in Energy Prices

Exercise 6: Big Mart Personal Care Products

Relationships Between Variables

As always, one of our first steps is to try to visualize the data.

Variable and Time

Creating Scatter Diagrams

... or weak ...

... linear ...

... or nonlinear ...

... and here's a strong negative correlation (about -0.90).

Occupancy and Arrivals

Exercise 1: The Effectiveness of Search Engines

Exercise 2: Education and Income

Sampling & Estimation

Introduction: The Scuba Problem

Generating Random Samples

How to Create a Representative and Unbiased Sample

We first select elements from the population at random...

...then analyze that sample...

Taking a Random Sample

Learning about a Sample

Classic Sampling Mistakes

Solving the Scuba Problem (Part I)

What factor is biasing your results?

When you report this news to Leo, he begins to laugh.

Exercise 1: The Bell Computer Problem

How should he conduct the survey?

Exercise 2: The Wave Problem

Challenge: The Airport

What is the problem with your survey?

The Population Mean