00 positive Bewertungen00 negative Bewertungen

19 Ansichten123 SeitenFeb 01, 2020

© © All Rights Reserved

PDF, TXT oder online auf Scribd lesen

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

19 Ansichten

00 positive Bewertungen00 negative Bewertungen

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

Sie sind auf Seite 1von 123

This document is authorized for use only by Sweta Ranjan. Copy or posting is an infringement of copyright.

Pre-Assessment Test Introduction

Welcome to the pre-assessment test for the HBS Quantitative Methods Tutorial.

Navigation:

To advance from one question to the next, select one of the answer choices or, if applicable, complete with your own choice and

click the “Submit” button. After submitting your answer, you will not be able to change it, so make sure you are satisfied with your

selection before you submit each answer. You may also skip a question by pressing the forward advance arrow. Please note that

you can return to “skipped” questions using the “Jump to unanswered question” selection menu or the navigational arrows at any

time. Although you can skip a question, you must navigate back to it and answer it - all questions must be answered for the exam

to be scored.

In the briefcase, links to Excel spreadsheets containing z-value and t-value tables are provided for your convenience. For some

questions, additional links to Excel spreadsheets containing relevant data will appear immediately below the question text.

After completion, you can review your answers at any time by returning to the exam.

Good luck!

How difficult are the questions on the exam? The exam questions have a level of difficulty similar to the exercises in the

course.

Can I refer to statistics textbooks and online resources to help me during the test? Yes. This is an open-book

examination.

May I receive assistance on the exam? No. Although we strongly encourage collaborative learning at HBS, work on exams

such as the assessment tests must be entirely your own. Thus you may neither give nor receive help on any exam question.

Is this a timed exam? No. You should take about 60-90 minutes to complete the exam, depending on your familiarity with

the material, but you may take longer if you need to.

What happens if I am (or my internet connection is) interrupted while taking the exam? Your answer choices will

be recorded for the questions you were able to complete and you will be able to pick up where you left off when you return to

the exam site.

How do I see my exam results? Your results will be displayed as soon as you submit your answer to the final question. The

results screen will indicate which questions you answered correctly.

Welcome to QM...

Welcome! You are about to embark on a journey that will introduce you to the basics of quantitative and statistical analysis.

This course will help you develop your skills and instincts in applying quantitative methods to formulate, analyze, and solve

management decision-making problems.

Click on the link labeled "The Tutorial and its Method" in the left menu to get started.

QM is designed to help you develop quantitative analysis skills in business contexts. Mastering its content will help you

evaluate management situations you will face not only in your studies but also as a manager. Click on the right arrow icon

below to advance to the next page.

This isn't a formal or comprehensive course in quantitative methods. QM won't make you a statistician, but it will help you

become a more effective manager.

1/123

11/25/2019 Quantitative Methods Online Course

The tutorial's primary emphasis is on developing good judgment in analyzing management problems. Whether you are

learning the material for the first time or are using QM to refresh your quantitative skills, you can expect the tutorial to

improve your ability to formulate, analyze, and solve managerial problems.

You won't be learning quantitative analysis in the typical textbook fashion. QM's interactive nature provides frequent

opportunities to assess your understanding of the concepts and how to apply them — all in the context of actual management

problems.

You should take 15 to 20 hours to run through the whole tutorial, depending on your familiarity with the material. QM offers

many features we hope you will explore, utilize, and enjoy.

Naturally, the most appropriate setting for a course on statistics is a tropical island...

Somehow, "internship" is not the way you'd describe your summer plans to your friends. You're flying out to Hawaii after all,

staying at a 5-star hotel as a Summer Associate with Avio Consulting.

This is a great learning opportunity, no doubt about it. To think that you had almost skipped over this summer internship, as

you prepared to enroll in a two-year MBA program this fall.

You are also excited that the firm has assigned Alice, one of its rising stars, as your mentor. It seems clear that Avio partners

consider you a high potential intern — they are willing to invest in you with the hope that you will later return after you

complete your MBA program.

Alice recently received the latest in a series of quick promotions at Avio. This is her first assignment as a project lead:

providing consulting assistance to the Kahana, an exclusive resort hotel on the Hawaiian island Kauai.

Needless to say, one of the perks of the job is the lodging. The Kahana's brochure looks inviting — luxury suites, fine cuisine,

a spa, sports activities. And above all, the pristine beach and glorious ocean.

After your successful interview with Avio, Alice had given you a quick briefing on the hotel and its manager, Leo.

Leo inherited the Kahana just three years ago. He has always been in the hospitality industry, but the sheer scope of the

luxury hotel's operations has him slightly overwhelmed. He has asked for Avio's help to bring a more rigorous approach to his

management decision-making process.

Before you start packing your beach towel, read this section to learn how to use this tutorial to your greatest advantage.

QM's structure and navigational tools are easy to master. If you're reading this text, you must have clicked on the link labeled

"Using the Tutorial" on the left.

There are three types of interactive clips: Kahana Clips, Explanatory Clips, and Exercise Clips.

Kahana Clips pose problems that arise in the context of your consulting engagement at the Kahana. Typically, one clip will

have Leo assign you and Alice a specific task. In a later Kahana Clip you will analyze the problem, and you and Alice will

present your results to Leo for his consideration. The Kahana Clips will give you exposure to the types of business problems

that benefit from the analytical methods you'll be learning, and a context for practicing the methods and interpreting their

results.

To fully benefit from the tutorial, you should solve all of Leo's problems. At the end of the tutorial, a multiple-choice

assessment exam will evaluate your understanding of the material.

In Explanatory Clips, you will learn everything needed to analyze management problems like Leo's. Complementing the

text are graphs, illustrations, and animations that will help you understand the material. Keep on your toes: you'll be asked

questions even in Explanatory Clips that you should answer to check your understanding of the concepts.

Some Explanatory Clips give you directions or tips on how to use the analytical and computational features of Microsoft

Excel. Facility with the necessary Excel functions will be critical to solving the management decision problems in this course.

QM is supplemented with spreadsheets of data relating to the examples and problems presented. When you see a Briefcase

link in a clip, we strongly encourage you to click on the link to access the data. Then, practice using the Excel functions to

reproduce the graphs and analyses that appear in the clips.

You will also see Data links that you should click to view summary data relating to the problem.

Exercise Clips provide additional opportunities for you to test your understanding of the material. They are a resource that

you can use to make sure that you have mastered the important concepts in each section.

2/123

11/25/2019 Quantitative Methods Online Course

Work through exercises to solidify your knowledge of the material. Challenge exercises provide opportunities to tackle

somewhat more advanced problems. The challenge exercises are optional - you should not have to complete them to gain the

mastery needed to pass the tutorial assessment test.

The arrow buttons immediately below are used for navigation within clips. If you've made it this far, you've been using the

one on the right to move forward.

Use the one on the left if you want to back up a page or two.

In the upper right of the QM tutorial screen are three buttons. From left to right they are links to the Help, Briefcase, and

Glossary.

In your Briefcase you'll find all the data you'll need to complete the course, neatly stored as Excel workbooks. In many of

the clips there will be links to specific documents in the Briefcase, but the entire Briefcase is available at any time.

In the Glossary/Index you'll find a list of helpful definitions of terms used in the course, along with brief descriptions of the

Excel functions used in the course.

We encourage you to use all of QM's features and resources to the fullest. They are designed to help you build an intuition for

quantitative analysis that you will need as an effective and successful manager.

The day of departure has come, and you're in flight over the Pacific Ocean. Alice graciously let you take the window seat, and

you watch as the foggy West Coast recedes behind you.

I've been to Hawaii before, so I'll let you have the experience of seeing the islands from the air before you set foot on them.

This Leo sounds like quite a character. He's been in business all his life, involved in many ventures — some more successful

than others. Apparently, he once owned and managed a gourmet spam restaurant!

Spam is really popular among the islanders. Leo tried to open a second location in downtown Honolulu for the tourists, but that

didn't do so well. He had to declare bankruptcy.

Then, just three years ago, his aunt unexpectedly left him the Kahana. Now Leo is back in business, this time with a large

operation on his hands.

It sounds to me like he's the kind of manager who usually relies on gut instincts to make business decisions, and likes to take

risks. I think he's hired Avio to help him make managerial decisions with, well, better judgment. He wants to learn how to

approach management problems in a more sophisticated, analytical fashion.

We'll be using some basic statistical tools and methods. I know you're no expert in statistics, but I'll fill you in along the way.

You'll be surprised at how quickly they'll become second nature to you. I'm confident you'll be able to do quite a bit of the

analytic work soon.

Once your plane touches down in Kauai, you quickly pick up your baggage and meet your host, Leo, outside the airport.

Inheriting the Kahana came as a big surprise. My aunt had run the Kahana for a long time, but I never considered that she

would leave it to me.

Anyway, I've been trying my best to run the Kahana the way a hotel of its quality deserves. I've had some ups and downs.

Things have been fairly smooth for the past year now, but I've realized that I have to get more serious about the way I make

decisions. That's where you come into the picture.

I used to be quite a risk-taker. I made a lot of decisions on impulse. Now, when I think of what I have to lose, I just want to

get it right.

After you arrive at the Kahana, Leo personally shows you to your rooms. "I have a table reserved for the three of us at 8 in the

main restaurant," Leo announces. "You just have to try our new chef's mango and brie tart."

3/123

11/25/2019 Quantitative Methods Online Course

After your welcome dinner in the Kahana's main restaurant, Leo asks you and Alice to meet him the next morning. You wake up

early enough to take a short walk on the beach before you make your way to Leo's office.

Good morning! I hope you found your rooms comfortable last night and are starting to recover from your trip.

Unfortunately, I don't have much time this morning. As you requested on the phone, I've assembled the most important data on

the Kahana. It wasn't easy — this hasn't been the most organized hotel in the world, especially since I took over. There's just so

much to keep track of.

Thank you, Leo. We'll have a look at your data right away, so we can get a more detailed understanding of the Kahana and the

type of data you have available for us to work with. Anything in particular that you'd like us to focus on as we peruse your files?

Yes. There are two things in particular that have been on my mind recently.

For one, we offer some recreational activities here at the Kahana, including a scuba diving certification course. I contract out

the operations to a local diving school. The contract is up soon, and I need to renew it, hire another school, or discontinue

offering scuba lessons all together.

I'd like you to get me some quotes from other diving schools on the island so I get an idea of the competition's pricing and how

it compares to the school I've been using.

I'm also very concerned about hotel occupancy rates. As you might imagine, the Kahana's occupancy fluctuates during the year,

and I'd like to know how, when, and why. I'd love to have a better feeling for how many guests I can expect in a given month.

These files contain some information about tourism on the island, but I'd really like you to help me make better sense of it.

Somehow I feel that if I could understand the patterns in the data, I could better predict my own occupancy rates.

That's what we're here to do. We'll take a look at your files to get better acquainted with the Kahana, and then focus on diving

school prices and occupancy patterns.

Thanks, or as we say in Hawaiian, Mahalo. By the way, we're not too formal here on Hawaii. As you probably noticed, Alice,

your suite includes a room that has been set up as an office. But feel free to take your work down to the beach or by the pool

whenever you like.

Later, under a parasol at the beach, you pore over Leo's folders. Feeling a bit overwhelmed, you find yourself staring out to sea.

Alice tells you not to worry: "We have a number of strategies we can use to compile a mountain of data like this into concise and

useful information. But no matter what data you are working with, always make sure you really understand the data before

doing a lot of analysis or making managerial decisions."

What is Alice getting at when she tells you to "understand the data"? And how can you develop such an understanding?

Data can be represented by graphs like histograms. These visual displays allow you to quickly recognize patterns in the

distribution of data.

Information overload. Inventory costs. Payroll. Production volume. Asset utilization. What's a manager to do?

The data we encounter each day have valuable information buried within them. As managers, correctly analyzing financial,

production, or marketing data can greatly improve the quality of the decisions we make.

Analyzing data can be revealing, but challenging. As managers, we want to extract as much of the relevant information and

insight as possible from our data we have available.

When we acquire a set of data, we should begin by asking some important questions: Where do the data come from? How

were they collected? How can we help the data tell their story?

Suppose a friend claims to have measured the heights of everyone in a building. She reports that the average height was three

and a half feet. We might be surprised...

We'd also want to know if our friend used a proper measuring stick. Finally, we'd want to be sure we knew how she measured

height: with or without shoes.

4/123

11/25/2019 Quantitative Methods Online Course

Before starting any type of formal data analysis, we should try to get a preliminary sense of the data. For example, we might

first try to detect any patterns, trends, or relationships that exist in the data.

We might start by grouping the data into logical categories. Grouping data can help us identify patterns within a single

category or across different categories. But how do we do this? And is this often time-consuming process worth it?

Accountants think so. Balance Sheets and Profit and Loss Statements arrange information to make it easier to comprehend.

In addition, accountants separate costs into categories such as capital investments, labor costs, and rent. We might ask: Are

operating expenses increasing or decreasing? Do office space costs vary much from year to year?

Comparing data across different years or different categories can give us further insight. Are selling costs growing more

rapidly than sales? Which division has the highest inventory turns?

Histograms

In addition to grouping data, we often graph them to better visualize any patterns in the data. Seeing data displayed

graphically can significantly deepen our understanding of a data set and the situation it describes.

To see the value a graphical approach can add, let's look at worldwide consumption of oil and gas in 2000. What questions

might we want to answer with the energy data? Which country is the largest consumer? How much energy do most

countries use?

Source

In order to create a graph that provides good visual insight into these questions, we might sort the countries by their level

of energy consumption, then group together countries whose consumption falls in the same range — e.g., the countries that

use 100 to 199 million tonnes per year, or 200 to 299 million tonnes.

Source

We can find the number of countries in each range, and then create a bar graph in which the height of each bar represents

the number of countries in each range. This graph is called a histogram.

A histogram shows us where the data tend to cluster. What are the most common values? The least common? For example,

we see that most countries consume less than 100 million tonnes per year, and the vast majority less than 200 million

tonnes. Only three countries, Japan, Russia, and the US, consume more than 300 million tonnes per year.

Why are there so many countries in the first range — the lowest consumption? What factors might influence this?

Population might be our first guess.

Yet despite a large population, India's energy consumption is significantly less than that of Germany, a much smaller

nation. Why might this be? Clearly other factors, like climate and the extent of industrialization, influence a country's

energy usage.

Outliers

In many data sets, there are occasional values that fall far from the rest of the data. For example, if we graph the age

distribution of students in a college course, we might see a data point at 75 years. Data points like this one that fall far from

the rest of the data are known as outliers. How do we interpret them?

First, we must investigate why an outlier exists. Is it just an unusual, but valid value? Could it be a data entry error? Was it

collected in a different way than the rest of the data? At a different time?

We might discover that the data point refers to a 75 year-old retiree, taking the course for fun.

After making an effort to understand where an outlier comes from, we should have a deeper understanding of the situation

the data represent. Then, we can think about how to handle the outlier in our analysis. Typically, we do one of three things:

leave the outlier alone, or — very rarely — remove it or change it to a corrected value.

A senior citizen in a college class may be an outlier, but his age represents a legitimate value in the data set. If we truly want

to understand the age distribution of all students in the class, we would leave the point in.

Or, if we now realize that what we really want is the age distribution of students in the course who are also enrolled in full-

time degree-granting programs, we would exclude the senior citizen and all other non-degree program students enrolled in

the course.

Occasionally, we might change the value of an outlier. This should be done only after examining the underlying situation in

great detail.

5/123

11/25/2019 Quantitative Methods Online Course

For example, if we look at the inventory graph below, a data point showing 80 pairs of roller-blades in inventory would be

highly unusual.

Notice that the data point "80" was recorded on April 13th, and that the inventory was 10 pairs on April 12th, and 6 on

April 14th.

Based on our management understanding of how inventory levels rise and fall, we realize that the value of 80 is

extraordinarily unlikely. We conclude that the data point was likely a data entry error. Further investigation of sales and

purchasing records reveals that the actual inventory level on that day was 8, not 80. Having found a reliable value, we

correct the data point.

Excluding or changing data is not something we do often. We should never do it to help the data 'fit' a conclusion we want

to draw. Such changes to a data set should be made on a case-by-case basis only after careful investigation of the situation.

Summary

With any data set we encounter, we must find ways to allow the data to tell their story. Ordering and graphing data sets

often expose patterns and trends, thus helping us to learn more about the data and the underlying situation. If data can

provide insight into a situation, they can help us to make the right decisions.

Creating Histograms

Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to create histograms using the

Histogram tool. However, we suggest you read through the instructions to learn how Excel creates histograms so you can

construct them in the future when you do have access to the Data Analysis Toolpak.

To check if the Toolpak is installed on your computer, go to the Data tab in the Toolbar in Excel 2007. If "Data Analysis"

appears in the Ribbon, the Toolpak has already been installed. If not, click the Office Button in the top left and select "Excel

Options." Choose "Add-Ins" and highlight the "Analysis Toolpak" in the list and click "Go." Check the box next to Analysis

Toolpak and click "OK." Excel will then walk you through a setup process to install the toolpak.

Creating a histogram with Excel involves two steps: preparing our data, and processing them with the Data Analysis

Histogram tool.

To prepare the data, we enter or copy the values into a single column in an Excel worksheet.

Often, we have specific ranges in mind for classifying the data. We can enter these ranges, which Excel calls "bins," into a

second column of data.

In the Tool bar, select the Data tab, and then choose Data Analysis.

In the Data Analysis pop-up window, choose Histogram and click OK.

Click on the Input Range field and enter the range of data values by either typing the range or by dragging the cursor over

the range.

Next, to use the bins we specified, click on the Bin Range field and enter the appropriate range. Note: if we don't specify

our own bins, Excel will create its own bins, which are often quite peculiar.

Click the Chart Output checkbox to indicate that we want a histogram chart to be generated in addition to the summary

table, which is created by default.

Click New Worksheet Ply, and enter the name you would like to give the output sheet.

Finally, click OK, and the histogram with the summary table will be created in a new sheet.

Graphs are very useful for gaining insight into data. However, sometimes we would like to summarize the data in a concise

way with a single number.

The Mean

Often, we'd like to summarize a set of data with a single number. We'd like that summary value to describe the data as well

as possible. But how do we do this? Which single value best represents an entire set of data? That depends on the data

we're investigating and the type of questions we'd like the data to answer.

What number would best describe employee satisfaction data collected from annual review questionnaires? The numerical

average would probably work quite well as a single value representing employees' experiences.

6/123

11/25/2019 Quantitative Methods Online Course

To calculate average — or mean — employee satisfaction, we take all the scores, sum them up, and divide the result by 11,

the number of surveys. The Greek letter mu represents the mean of the data set.

The mean is by far the most common measure used to describe the "center" or "central tendency" of a data set. However,

it isn't always the best value to represent data. Outliers can exercise undue influence and pull the mean value towards one

extreme.

In addition, if the distribution has a tail that extends out to one side — a skewed distribution — the values on that side will

pull the mean towards them. Here, the distribution is strongly skewed to the right: the high value of US consumption pulls

the mean to a value higher than the consumption of most other countries. What other numbers can we use to find the

central tendency of the data?

The Median

Let's look at the revenues of the top 100 companies in the US. The mean revenue of these companies is about $42 billion.

How should we interpret this number? How well does this average represent the revenues of these companies?

When we examine the revenue distribution graphically, we see that most companies bring in less than $42 billion of

revenue a year. If this is true, why is the mean so high?

Source

As our intuition might tell us, the top companies have revenues that are much higher than $42 billion. These higher

revenues pull up the average considerably.

Source

In cases like income, where the data are typically very skewed, the mean often isn't the best value to represent the data. In

these cases, we can use another central value called the median.

Source

The median is the middle value of a data set whose values are arranged in numerical order. Half the values are higher than

the median, and half are lower.

Source

For income, the median revenues of the top 100 US companies is $30 billion, significantly less than $42 billion. Half of all

the companies earn less than $30 billion, and half earn more than $30 billion.

Source

Median revenue is a more informative revenue estimate because it is not pulled upwards by a small number of high-

revenue earners. How can we find the median?

Source

With an odd number of data points, listed in order, the median is simply the middle value. For example, consider this set

of 7 data points. The median is the 4th data point, $32.51.

In a data set with an even number of points, we average the two middle values — here, the fourth and fifth values — and

obtain a median of $41.92.

When deciding whether to use a mean or median to represent the central tendency of our data, we should weigh the pros

and cons of each. The mean weighs the value of every data point, but is sometimes biased by outliers or by a highly skewed

distribution.

By contrast, the median is not biased by outliers and is often a better value to represent skewed data.

The Mode

A third statistic to represent the "center" of a data set is its mode: the data set's most frequently occurring value. We might

use the mode to represent data when knowing the average value isn't as important as knowing the most common value.

In some cases, data may cluster around two or more points that occur especially frequently, giving the histogram more

than one peak. A distribution that has two peaks is called a bimodal distribution.

Summary

7/123

11/25/2019 Quantitative Methods Online Course

To summarize a data set using a single value, we can choose one of three values: the mean, the median, or the mode. They

are often called summary statistics or descriptive statistics. All three give a sense of the "center" or "central

tendency" of the data set, but we need to understand how they differ before using them:

To find the mean of a data set entered in Excel, we use the AVERAGE function.

We can find the mean of numerical values by entering the values in the AVERAGE function, separated by commas.

In most cases, it's easier to calculate a mean for a data set by indicating the range of cell references where the data are

located.

Excel ignores blank values in cells, but not zeros. Therefore, we must be careful not to put a zero in the data set if it does

not represent an actual data point.

Excel can find the median, even if a data set is unordered, using the MEDIAN function.

The easiest way to calculate a data set's median is to select a range of cell references.

Excel can also find the most common value of a data set, the mode, using the MODE function.

If more than one mode exists in a data set, Excel will find the one that occurs first in the data.

Mean, median, and mode are fairly intuitive concepts. Already, Leo's mountain of data seems less intimidating.

Variability

The mean, median and mode give you a sense of the center of the data, but none of these indicate how far the data are spread

around the center. "Two sets of data could have the same mean and median, and yet be distributed completely differently

around the center value," Alice tells you. "We need a way to measure variation in the data."

It's often critical to have a sense of how much data vary. Do the data cluster close to the center, or are the values widely

dispersed?

Let's look at an example. To identify good target markets, a car dealership might look at several communities and find the

average income of each. Two communities — Silverhaven and Brighton — have average household incomes of $95,500 and

$97,800. If the dealer wants to target households with incomes above $90,000, he should focus on Brighton, right?

We need to be more careful: the mean income doesn't tell the whole story. Are most of the incomes near the mean, or is there

a wide range around the average income? A market might be less attractive if fewer households have an income above the

dealer's target level. Based on average income alone, Brighton might look more attractive, but let's take a closer look at the

data.

Despite having a lower average income, incomes in Silverhaven have less variability, and more households are in the dealer's

target income range. Without understanding the variability in the data, the dealer might have chosen Brighton, which has

fewer targeted homes.

Clearly it would be helpful to have a simple way to communicate the level of variability in the household incomes in two

communities.

Just as we have summary statistics like the mean, median, and mode to give us a sense of the 'central tendency' of a data set,

we need a summary statistic that captures the level of dispersion in a set of data.

The standard deviation is a common measure for describing how much variability there is in a set of data. We represent the

standard deviation with the Greek letter sigma:

The standard deviation emerges from a formula that looks a bit complicated initially, so let's try to understand it at a

conceptual level first. Then we'll build up step by step to help understand where the formula comes from.

The standard deviation tells us how far the data are spread out. A large standard deviation indicates that the data are widely

dispersed. A smaller standard deviation tells us that the data points are more tightly clustered together.

8/123

11/25/2019 Quantitative Methods Online Course

Calculating

A hotel manager has to staff the front reception desk in her lobby. She initially focuses on a staffing plan for Saturdays,

typically a heavy traffic day. In the hospitality industry, like many service industries, proper staffing can make the

difference between unhappy guests and satisfied customers who want to return.

On the other hand, overstaffing is a costly mistake. Knowing the average number of customer requests for services during a

shift gives the manager an initial sense of her staffing needs; knowing the standard deviation gives her invaluable

additional information about how those requests might vary across different days.

The average number of customer requests is 172, but this doesn't tell us there are 172 requests every Saturday. To staff

properly, the hotel manager needs a sense of whether the number of requests will typically be between 150 and 195, for

example, or between 120 and 220.

To calculate the standard deviation for data — in this case the hotel traffic — we perform two steps. The first is to calculate

a summary statistic called the variance.

Each Saturday's number of requests lies a certain distance from 172, the mean number of requests. To find the variance, we

first sum the squares of these differences. Why square the differences?

A hotel manager would want information about the magnitude of each difference, which can be positive, negative, or zero.

If we simply summed the differences between each Saturday's requests and the mean, positive and negative differences

would cancel each other out.

But we are interested in the magnitude of the differences, regardless of their sign. By squaring the differences, we get only

positive numbers that do not cancel each other out in a sum.

The formula for variance adds up the squared differences and divides by n-1 to get a type of "average" squared difference as

a measure of variability. (The reason we divide by n-1 to get an average here is a technicality beyond the scope of this

course.) The variance in the hotel's front desk requests is 637.2. Can we use this number to express the variability of the

data?

Sure, but variances don't come out in the most convenient form. Because we square the differences, we end up with a value

in 'squared' requests. What is a request-squared? Or a dollar-squared, if we were solving a problem involving money?

We would like a way to express variability that is in the same units as the original data — front-desk requests, for example.

The standard deviation — the first formula we saw — accomplishes this.

The standard deviation is simply the square root of the variance. It returns our measure to our original units. The standard

deviation for the hotel's Saturday desk traffic is 25.2 requests.

Interpreting

What does a standard deviation of 25.2 requests tell us? Suppose the standard deviation had been 50 requests.

With a larger standard deviation, the data would be spread farther from the mean. A higher standard deviation would

translate into more difficult staffing: when request traffic is unusually high, disgruntled customers wait in long lines; when

traffic is very low, desk staff are idle.

For a data set, a smaller standard deviation indicates that more data points are near the mean, and that the mean is more

representative of the data. The lower the standard deviation, the more stable the traffic, thereby reducing both customer

dissatisfaction and staff idle time.

Fortunately, we almost never have to calculate a standard deviation by hand. Spreadsheet tools like Excel make it easy for

us to calculate variance and standard deviation.

Summary

The standard deviation measures how much data vary about their mean value.

Finding in Excel

Excel's STDEV function calculates the standard deviation.

To find the standard deviation, we can enter data values into the STDEV formula, one by one, separated by commas.

In most cases, however, it's much easier to select a range of cell references to calculate a standard deviation.

To calculate variance, we can use Excel's VAR function in the same way.

9/123

11/25/2019 Quantitative Methods Online Course

The standard deviation measures how much a data set varies from its mean. But the standard deviation only tells you so

much. How can you compare the variability in different data sets?

A standard deviation describes how much the data in a single data set vary. How can we compare the variability of two data

sets? Do we just compare their standard deviations? If one standard deviation is larger, can we say that data set is "more

variable"?

Standard deviations must be considered within the data's context. The standard deviations for two stock indices below —

The Street.Com (TSC) Internet Index and the Pacific Exchange Technology (PET) Index — were roughly equivalent over a

period. But were the two indices equally variable?

Source

If the average price of an index is $200, a $20 standard deviation is relatively high (10% of the average); if the average is

$700, $20 is relatively low (not quite 3% of the average). To gauge volatility, we'd certainly want to know that PET's average

index price was over three and half times higher than TSC's average index price.

Source

To get a sense of the relative magnitude of the variation in a data set, we want to compare the standard deviation of the data

to the data's mean.

Source

We can translate this concept of relative volatility into a standardized measure called the coefficient of variation, which is

simply the ratio of the standard deviation to the mean. It can be interpreted as the standard deviation expressed as a percent

of the mean.

To get a feeling for the coefficient of variation, let's compare a few data sets. Which set has the highest relative variation?

Click the answer you select.

Because the coefficient of variation has no units, we can use it to compare different kinds of data sets and find out which data

set is most variable in this relative sense.

The coefficient of variation describes the standard deviation as a fraction of the mean, giving you a standard measure of

variability.

Summary

The coefficient of variation expresses the standard deviation as a fraction of the mean. We can use it to compare variation

in different data sets of different scales or units.

After a good night's sleep, you meet Alice for breakfast.

"It's time to get started on Leo's assignments. Could you get those price quotes from diving schools and prepare a presentation

for Leo? We'll want to present our findings as neatly and concisely as possible. Use graphs and summary statistics wherever

appropriate. Meanwhile, I'll start working on Leo's hotel occupancy problem."

In addition to the school Leo is currently using, you find 20 other scuba services in the phone book. You call those 20 and get

price quotes on how much they would charge the Kahana per guest for a Scuba Certification Course.

Prices

You create a histogram of the prices. Use the bin ranges provided in the data spreadsheet, or experiment with your own bins.

If you do not have the Excel Analysis Toolpak installed, click on the Briefcase link labeled "Histogram" to see the finished

histogram.

Prices

Histogram

This distribution is skewed to the right, since a tail of higher prices extends to the right side of the histogram. The shape of

the distribution suggests that:

10/123

11/25/2019 Quantitative Methods Online Course

Prices

Histogram

You calculate the key summary statistics. The correct values are (Mean, Median, Standard Deviation):

Prices

Histogram

Your report looks good. This graphic is very helpful. At the moment, I'm paying $330 per guest, which is about average for

the island. Clearly, I could get a cheaper deal — only 6 schools would charge a higher rate. On the other hand, maybe these

more expensive schools offer a better diving experience? I wonder how satisfied my guests have been with the course offered

by my current contractor...

After a company completes its initial public offering, how is the ownership of common stock distributed between

individuals in the firm, often termed "named insiders"?

Let's examine a company, VA Linux, that chose to sell its stock in an Initial Public Offering (IPO) during the IPO craze in

the late 1990s.

According to its prospectus, after the IPO, VA Linux would have the following distribution of outstanding shares of

common stock owned by insiders:

Source

From the VA Linux common stock data, what could we learn by creating a histogram? (Choose the best answer)

Here is a histogram graphing annual turnover rates at a consulting firm.

The J. B. Honidew Corporation offers a prestigious summer internship to first-year students at a local business school. The

human resources department of Honidew wants to publish a brochure to advertise the position.

To attract a suitable pool of applicants, the brochure should give an indication of Honidew's high academic expectations.

The human resources manager calculates the mean GPA of the previous 8 interns, to include in the brochure.

Interns' GPA's

In 1997, J. B. Honidew's grandson's girlfriend was awarded the internship, even though her GPA was only 3.35. In the

presence of outliers or a strongly skewed data set, the median is often a better measure of the 'center'. What's the median

GPA in this data set?

Interns' GPA's

Safety equipment typically needs to fall within very precise specifications. Such specifications apply, for example, to scuba

equipment using a device called a "rebreather" to recycle oxygen from exhaled air.

Recycled air must be enriched with the right amount of oxygen from the tank before delivery to the diver. With too little

oxygen, the diver can become disoriented; too much, and the diver can experience oxygen poisoning. Minimizing the

deviation of oxygen concentration levels from the specified level is clearly a matter of life and death!

A scuba equipment-testing lab compared the oxygen concentrations of two different brands of rebreathers, A and B.

Examine the data. Without doing any calculations, for which of the two rebreathers does the oxygen concentration appear

to have a lower standard deviation?

Notice that data set A's extreme values are closer to the center, with more data points closer to the center of the set. Even

without calculations, we have a good knack for seeing which set is more variable.

11/123

11/25/2019 Quantitative Methods Online Course

We can back up our observations; by using the standard deviation formula or the STDEV function in Excel, we can

calculate that the standard deviation of A is 0.58%, whereas that of B is 1.05%.

After decades of government control, states across the US are deregulating energy markets. In a deregulated market,

electricity prices tend to spike in times of high demand.

This volatility is a concern. A primary benefit to consumers in a regulated market is that prices are fairly stable. To provide

a baseline measure for the volatility of prices prior to deregulation, we want to compute the standard deviation of prices

during the 1990s, when electricity prices were largely regulated.

From 1990 to 2000, the average national price in July of 500kW of electricity ranged between $45.02 and $50.55. What is

the standard deviation of these eleven prices?

Electricity Prices

Source

Excel makes the job much easier, because all that's required is entering the data into cells and inputting the range of cells

into the =STDEV() function. The result is $2.02.

On the other hand, to calculate the standard deviation by hand, use the formula:

First, calculate the mean, $48.40. Then, find the difference between each data point and the mean. Calculate the sum of

these squared differences, 40.79. Divide by the number of points minus one (11 - 1 =10 in this case) to obtain 4.08. Taking

the square root of 4.08 gives us the standard deviation, $2.02.

Suppose you are a purchasing agent for a wholesale retailer, Big-Mart. Big-Mart offers several generic versions of

household items, like deodorant, to consumers at a considerable discount.

Every 18 months, Big-Mart requests bids from personal care companies to produce these generic products.

After simply choosing the lowest individual bidder for years, Big-Mart has decided to introduce a vendor "score card" that

measures multiple aspects of each vendor's performance. One of the criteria on the score card is the level of year-to-year

fluctuation in the vendor's pricing.

Compare the variability of prices from each supplier. Which company's prices vary the least from year to year in relation to

their average price, as measured by the coefficient of variation?

Summary

Pleased with your work, Alice decides to teach you more data description techniques, so you can take over a greater share of

the project.

So far, you learned how to work with a single variable, but many managerial problems involve several factors that need to be

considered simultaneously.

Two Variables

We use histograms to help us answer questions about one variable. How do we start to investigate patterns and trends with

two variables?

Let's look at two data sets: heights and weights of athletes. What can we say about the two data sets? Is there a relationship

between the two?

Our intuition tells us that height and weight should be related. How can we use the data to inform that intuition? How can we

let the data tell their story about the strength and nature of that relationship?

Because we know that each height and weight belong to a specific athlete, we first pair the two variables, with one height-

weight pair for each athlete.

12/123

11/25/2019 Quantitative Methods Online Course

Plotting these data pairs on axes of height and weight — one data point for each athlete in our data set — we can see a

relationship between height and weight. This type of graph is called a "scatter diagram."

Scatter diagrams provide a visual summary of the relationship between two variables. They are extremely helpful in

recognizing patterns in a relationship. The more data points we have, the more apparent the relationship becomes.

In our scatter diagram, there's a clear general trend: taller athletes tend to be heavier.

We need to be careful not to draw conclusions about causality when we see these types of relationships.

Growing taller might make us a bit heavier, but height certainly doesn't tell the whole story about our weights.

Assuming causality in the other direction would be just plain wrong. Although we may wish otherwise, growing heavier

certainly doesn't make us taller!

The direction and extent of causality might be easy to understand with the height and weight example, but in business

situations, these issues can be quite subtle.

Managers who use data to make decisions without firm understanding of the underlying situation often make blunders that

in hindsight can appear as ludicrous as assuming that gaining weight can make us taller.

Why don't we try graphing another pair of data sets to see if we can identify a relationship? On a scatter diagram, we plot for

each day the number of massages purchased at a spa resort versus the total number of guests visiting the resort.

We can see a relationship between the number of guests and the number of massages. The more guests that stay at the resort,

the more massages purchased — to a point, where massages level off.

Why does the number of massages reach a plateau? We should investigate further. Perhaps there are limited numbers of

massage rooms at the spa. Scatter plots can give us insights that prompt us to ask good questions, those that deepen our

understanding of the underlying context from which the data are drawn.

Sometimes, we are not as interested in the relationship between two variables as we are in the behavior of a single variable

over time. In such cases, we can consider time as our second variable.

Suppose we are planning the purchase of a large amount of high-speed computer memory from an electronics distributor.

Experience tells us these components have high price volatility. Should we make the purchase now? Or wait?

Assuming we have price data collected over time, we can plot a scatter diagram for memory price, in the same way we

plotted height and weight. Because time is one of the variables, we call this graph a time series.

Time series are extremely useful because they put data points in temporal order and show how data change over time.

Have prices been steadily declining or rising? Or have prices been erratic over time? Are there seasonal patterns, with

prices in some months consistently higher than in others?

Time series will help us recognize seasonal patterns and yearly trends. But we must be careful: we shouldn't rely only on

visual analysis when looking for relationships and patterns.

False Relationships

Our intuition tells us that pairs of variables with a strong relationship on a scatter plot must be related to each other. But

we must be careful: human intuition isn't foolproof and often we infer relationships where there are none. We must be

careful to avoid some of these common pitfalls.

Let's look at an example. For US presidents of the last 150 years, there seems to be a connection between being elected in a

year that is a multiple of 20 (1900, 1920, 1940, etc.) and dying in office. Abraham Lincoln (elected in 1860) was the first

victim of this unfortunate relationship.

Source

James Garfield (elected 1880) survived his presidency (but was assasinated the year after he left office), and William

McKinley (1900), Warren Harding (1920), Franklin Roosevelt (1940), and John F. Kennedy (1960) all died in office.

Source

Ronald Reagan (elected 1980) only narrowly survived an assassination attempt. What do the data suggest about the

president elected in 2020?

Probably nothing. Unless we have a reasonable theory about the connection between the two variables, the relationship is

no more than an interesting coincidence.

13/123

11/25/2019 Quantitative Methods Online Course

Hidden Variables

Even when two data sets seem to be directly related, we may need to investigate further to understand the reason for the

relationship.

We may find that the reason is not due to any fundamental connection between the two variables themselves, but that they

are instead mutually related to another underlying factor.

Suppose we're examining sales of ice-hockey pucks and baseballs at a sporting goods store.

The sales of the two products form a relationship on a scatter plot: when puck sales slump, baseball sales jump. But are the

two data sets actually related? If so, why?

A third, hidden factor probably drives both data sets: the season. In winter, people play ice hockey. In spring and summer,

people play baseball.

If we had simply plotted puck and baseball sales without thinking further, we might not have considered the time of year at

all. We could have neglected a critical variable driving the sales of both products.

In many business contexts, hidden variables can complicate the investigation of a relationship between almost any two

variables.

A final point: Keep in mind that scatter plots don't prove anything about causality. They never prove that one variable

causes the other, but simply illustrate how the data behave.

Summary

Plotting two variables helps us see relationships between two data sets. But even when relationships exist, we still need to

be skeptical: is the relationship plausible? An apparent relationship between two variables may simply be coincidental, or

may stem from a relationship each variable has with a third, often hidden variable.

To create a scatter diagram in Excel with two data sets, we need to first prepare the data, and then use Excel's built in chart

tools to plot the data.

To prepare our data, we need to be sure that each data point in the first set is aligned with its corresponding value in the

other set. The sets don't need to be contiguous, but it's easier if the data are aligned side by side in two columns.

If the data sets are next to each other, simply select both sets.

Next, from the Insert tab in the toolbar, select Scatter in the Charts bin from the Ribbon, and choose the first type:

Scatter with Only Markers.

Excel will insert a nonspecific scatter plot into the worksheet, with the first column of data represented on the X-axis and

the second column of data on the Y-axis.

We can include a chart title and label the axes by selecting Quick Layout from the Ribbon and choosing Layout 1.

Then we can add the chart title and label the axes by selecting and editing the text.

Finally, our scatter diagram is complete. You can explore more of Excel's new Chart Tools to edit and design elements of

your chart.

Correlation

By plotting two variables on a scatter plot, we can examine their relationship. But can we measure the strength of that

relationship? Can we describe the relationship in a standardized way?

Humans have an uncanny ability to discern patterns in visual displays of data. We "know" when the relationship between two

variables looks strong ...

... positive (when one variable increases, the other tends to increase) ...

14/123

11/25/2019 Quantitative Methods Online Course

... or negative (when one variable increases, the other tends to decrease).

Suppose we are trying to discern if there is a linear relationship between two variables. Intuitively, we notice when data

points are close to an imaginary line running through a scatter plot.

Logically, the closer the data points are to that line, the more confidently we can say there is a linear relationship between the

two variables.

However, it is useful to have a simple measure to quantify and communicate to others what we so readily perceive visually.

The correlation coefficient is such a measure: it quantifies the extent to which there is a linear relationship between two

variables.

To describe the strength of a linear relationship, the correlation coefficient takes on values between -1 and +1. Here's a strong

positive correlation (about 0.85) ...

If every point falls exactly on a line with a negative slope, the correlation coefficient is exactly -1.

At the extremes of the correlation coefficient, we see relationships that are perfectly linear, but what happens in the middle?

Even when the correlation coefficient is 0, a relationship might exist — just not a linear relationship. As we've seen, scatter

plots can reveal patterns and help us better understand the business context the data describe.

To reinforce our understanding of how our intuition about the strength of a linear relationship between variables translates

into a correlation coefficient, let's revisit the examples we analyzed visually earlier.

Influence of Outliers

In some cases, the correlation coefficient may not tell the whole story. Managers want to understand the attendance

patterns of their employees. For example, do workers' absence rates vary by time of year?

Suppose a manager suspects that his employees skip work to enjoy the good life more often as the temperature rises. After

pairing absences with daily temperature data, he finds the correlation coefficient to be 0.466.

While not a strong linear relationship, a coefficient of 0.466 does indicate a positive relationship — suggesting that the

weather might indeed be the culprit.

But look at the data — besides a few outliers, there isn't a clear relationship. Seeing the scatter plot, the manager might

realize that the three outliers correspond to a late-summer, three-day transportation strike that kept some workers

homebound the previous year.

Without looking at the data, the correlation coefficient can lead us down false paths. If we exclude the outliers, the

relationship disappears, and the correlation essentially drops to zero, quieting any suspicion of weather. Why do the

outliers influence our measure of linearity so much?

As a summary statistic for the data, the correlation coefficient is calculated numerically, incorporating the value of every

data point. Just as it does with the mean, this inclusiveness can get us into trouble...

Because measures like correlation give more weight to points distant from the center of the data, outliers can strongly

influence the correlation coefficient of the entire set. In these situations, our intuition and the measure we use to quantify

our intuition can be quite different. We should always attempt to reconcile those differences by returning to the data.

Summary

The correlation coefficient characterizes the strength and direction of a linear relationship between two data sets. The value

of the correlation coefficient ranges between -1 and +1.

Finding in Excel

Excel's CORREL function calculates the correlation coefficient for two variables. Let's return to our data on athletes' height

and weight.

Enter the data set into the spreadsheet as two paired columns. We must make sure that each data point in the first set is

aligned with its corresponding value in the other set.

To compute the correlation, simply enter the two variables' ranges, separated by a comma, into the CORREL function as

shown below.

15/123

11/25/2019 Quantitative Methods Online Course

The order in which the two data sets are selected does not matter, as long as the data "pairs" are maintained. With height

and weight, both values certainly need to refer to the same person!

Alice is eager to move forward: "With your new understanding of scatter diagrams and correlation, you'll be able to help me

with Leo's hotel occupancy problem."

In the hotel industry, one of the most important management performance measures is room occupancy rate, the percentage

of available rooms occupied by guests.

Alice suggests that the monthly occupancy rate might be related to the number of visitors arriving on the island each month.

On a geographically isolated location like Hawaii, visitors almost all arrive by airplane or cruise ship, so state agencies can

gather very precise data on arrivals.

Alice asks you to investigate the relationship between room occupancy rates and the influx of visitors, as measured by the

average number of visitors arriving to Kauai per day in a given month. She wants a graphical overview of this relationship,

and a measure of its strength.

Leo's folders include data on the number of arrivals on Kauai, and on average hotel occupancy rates in Kauai, as tracked by

the Hawaii Department of Business, Economic Development, and Tourism. Click on the link to access these data.

Kauai Data

Source

The best way to graphically represent the relationship between arrivals and occupancy is a _______.

Kauai Data

Source

You generate the scatter diagram using the data file and Excel's Chart Wizard. The relationship can be characterized as

Kauai Data

Source

You calculate the correlation coefficient. Enter the correlation coefficient in decimal notation with 2 digits to the right of the

decimal (e.g., enter "5" as "5.00"). Round if necessary.

Kauai Data

Source

To find the correlation coefficient, open the Kahana Data file. In any empty cell, type =CORREL(B2:B37,C2:C37). When you

hit enter, the correct answer, 0.71, will appear.

Kauai Data

Together with Alice, you compile your findings and present them to Leo.

Source

I see. The relationship between the number of people arriving on Kauai and the island's hotel occupancy rate follows a

general trend, but not a precise pattern. Look at this: in two months with nearly the same average number of daily arrivals,

the occupancy rates were very different — 68% in one month and 82% in the other.

But why should they be so different? When people arrive on the island, they have to sleep somewhere. Do more campers

come to Kauai in one month, and more hotel patrons in the other?

Well, that might be one explanation. There could be differences in the type of tourists arriving. The vacation preferences of

the arrivals would be what we call a hidden variable.

Another hidden variable might be the average length of stay. If the length of stay varies month to month, then so will hotel

occupancy. When 50 arrivals check into a hotel, the occupancy rate will be higher if they spend 10 days each at the hotel than

if they spend only 3 days.

I'm following you, but I'm beginning to see that the occupancy issue is more complex than I expected. Let's get back to it at a

later time. The scuba school contract is more pressing at the moment.

As online retailing expands, many companies are interested in knowing how effective search engines are in helping

consumers find goods online.

16/123

11/25/2019 Quantitative Methods Online Course

Computer scientists study the effectiveness of such search engines and compare how many results search engines recall

and the precision with which they recall them. "Precision" is another way of saying that the search found its target, for

example a page containing both the phrases "winter parka" and "Eddie Bauer."

What could you say about the relationship between the precision and the number of results recalled?

Source

Is an education a good investment in your future? Some very successful business executives are college dropouts, but is

there a relationship in the general population between income and education level?

Consider the following scatter plot, which lists the income and years of formal education for 18 people. What is the

correlation between the two variables?

Source

Though we should always calculate the correlation coefficient if we want to have a precise measure, it's good to have a

rough feel for the correlation between two variables we see plotted on a scatter diagram. For the income-education data,

the coefficient is nearest to _______.

Leo asks you to help him evaluate the Kahana's contract with the scuba school.

Scuba diving lessons are an ideal way for our guests to enjoy their vacation or take a break from their business activities. We

have an excellent coral reef, and scuba diving is becoming very popular among vacationers and business travelers.

We started our year-round diving program last year, contracting a local diving school to do a scuba certification course. The

one-year trial contract is now up for renewal.

Maintaining the scuba offerings on-site isn't cheap. We have to staff the scuba desk seven days a week, and we subsidize the

costs associated with each course. So I want to get a good handle on how satisfied the guests are with the lessons before I decide

whether or not to renew the contract.

The hotel has a database with information about which guests took scuba lessons and when. Feel free to take a look at it, but I

can't spend a fortune figuring this out. And I need to know as soon as possible, since our contract expires at the end of the

month.

Alice convinces you to do some field research and join her for a scuba diving lesson. You return late that afternoon exhausted

but exhilarated. Alice is especially enthusiastic.

"Well, I certainly give the lessons two thumbs up. And we haven't even been out to sea yet!

"But our opinions alone can't decide the matter. We shouldn't infer from our experience that Leo's clientele as a whole enjoyed

the scuba certification course. After all, we may have caught the instructor on his best day this year."

Alice suggests creating a survey to find out how satisfied guests are with the scuba diving school.

Naturally, you can't ask the opinion of every guest who took scuba lessons over the past year. You have to survey a few guests,

and from their opinions draw conclusions about hotel guests in general. The guests you choose to survey must be representative

of all of the guests who have taken the scuba course at the resort. But how can you be sure you get a good sample?

As managers, we often need to know something about a large group of people or products. For example, how many defective

parts does a large plant produce each year? What are the average annual earnings of a Wall Street investment banker? How

many people in our industry plan to attend the annual conference?

When it is too costly to gather the information we want to know about every person or every thing in an entire group, we

often ask the question of a subset, or sample of the group. We then try to use that information to draw conclusions about the

whole group.

17/123

11/25/2019 Quantitative Methods Online Course

To take a sample, we first select elements from the entire group, or "population," at random. We then analyze that sample

and try to infer something about the total population we're interested in. For example, we could select a sample of people in

our industry, ask them if they plan to attend the annual conference, and then infer from their answers how many people in

the entire industry plan to attend.

For example, if 10% of the people in our sample say they will attend, we might feel quite confident saying that between 7%

and 13% of our entire population will attend.

This is the general structure of all the problems we'll address in this unit — we'll work out the details as we go forward. We

want to know something about a population large enough to make examining every population member impractical.

...and then draw an inference about the total population we're interested in.

The first trick to sampling is to make sure we select a sample that broadly represents the entire group we're interested in.

For example, we couldn't just ask the conference organizers if they wanted to attend. They would not be representative of

the whole group — they would be biased in favor of attending the conference!

To get a good sample, we must make sure we select the sample "at random" from the full population. This means that every

person or thing in the population is equally likely to be selected. If there are 15,000 people in the industry, and we are

choosing a sample of 1,000, then every person needs to have the same chance — 1 out of 15 — of being selected.

Selecting a random sample sounds easy, but actually doing it can be quite challenging. In this section, we'll see examples

of some major mistakes people have made while trying to select a random sample, and provide some advice about how to

avoid the most common types of sampling errors.

In some cases, selecting a random sample can be fairly easy. If we have a complete list of each member of the group in a

database, we can just assign a unique number to each member of the group. We then let a computer draw random numbers

from the list. This would ensure that each element of the population has an equal likelihood of being selected.

If the population about which we need to obtain information is not listed in an easy-to-access database, the task of

selecting a sample at random becomes more difficult. In these cases, we have to be extremely careful not to introduce a bias

in the way we select the sample.

For example, if we want to know something about the opinions of an entire company, we cannot just pick employees from

one department. We have to make sure that each employee has an equal chance of being included in the sample. A

department as a whole might be biased in favor of one opinion.

Sample Size

Once we have decided how to select a sample, we have to ask how large our sample needs to be. How many members of

the group do we need to study to get a good estimate about what we want to know about the entire population?

The answer is: It depends on how "accurate" we want our estimate to be. We might expect that the larger the population,

the larger the sample size needed to achieve a given level of accuracy, but this is not true.

A sample size of 1,000 randomly-selected individuals can often give a satisfactory estimation about the underlying

population, as long as the sample is representative of the whole population. This is true regardless of whether the

population consists of thousands of employees or millions of factory parts.

Sometimes, a sample size of 100 or even 50 might be enough when we are not that concerned about the accuracy of our

estimate. Other times, we might need to sample thousands to obtain the accuracy we require.

Later in this unit, we will find out how to calculate a good sample size. For now, it's important to understand that the

sample size depends on the level of accuracy we require, not on the size of the population.

Once we select our sample, we need to make sure we obtain accurate information about each member of the sample. For

example, if we want to learn about the number of defects a plant produces, we must carefully measure each item in the

sample.

When we want to learn something about a group of people and don't have any existing data, we often use a survey to learn

about an issue of interest. Conducting a survey raises problems that can be surprisingly tricky to resolve.

18/123

11/25/2019 Quantitative Methods Online Course

First, how do we phrase our questions? Is there a bias in any questions that might lead participants to answer them in a

certain way? Are any questions worded ambiguously? If some of the people in the sample interpret a question one way, and

others interpret it differently, our results will be meaningless!

Second, how do we best conduct the survey? Should we send the survey in the mail, or conduct it over the phone? Should

we interview survey participants in person, or distribute handouts at a meeting?

There are advantages and disadvantages to all methods. A survey sent through the mail may be relatively inexpensive, but

might have a very low response rate. This is a major problem if those who respond have a different opinion than those who

don't respond. After all, the sample is meant to learn about the entire population, not just those with strong opinions!

Creating a telephone survey creates other issues: When do we call people? Who is home during regular business hours?

Most likely not working professionals. On the other hand, if we call household numbers in the evening, the "happy hour

crowd" might not be available.

When we decide to conduct a survey in person, we have to consider whether the presence of the person asking the

questions might influence the survey results. Are the survey participants likely to conceal certain information out of

embarrassment? Are they likely to exaggerate?

Clearly, every survey will have different issues that we need to confront before going into the field to collect the data.

Response Rates

With any type of survey, we must pay close attention to the response rate. We have to be sure that those who respond to the

survey answer questions in much the same way as those who don't respond would answer them. Otherwise, we will have a

biased view of what the whole population thinks.

Surveys with low response rates are particularly susceptible to bias. If we get a low response rate, we must try to follow up

with the people who did not respond the first time. We either need to increase the response rate by getting answers from

those who originally did not respond, or we must demonstrate that the non-respondents' opinions do not differ from those

of the respondents on the issue of interest.

Tracking down everyone in a sample and getting their response can be costly and time consuming. When our resources are

limited, it is often better to take a small sample and relentlessly pursue a high response rate than to take a larger sample

and settle for a low response rate.

Summary

Often it makes sense to infer facts about a large population from a smaller sample. To make sound inferences:

To understand the importance of representative samples, let's go back in history and look at some mistakes made in the

Literary Digest poll of 1936.

The Literary Digest, a popular magazine in the 1930s, had correctly predicted the outcome of U.S, presidential elections

from 1916 to 1932. When the results of the 1936 poll were announced, the public paid attention. Who would become the

next president?

Newscaster: "Once again, the Literary Digest sent out a survey to the American public, asking, "Whom will you vote for in

this year's presidential election?" This may well be the largest poll in American history."

Newscaster: "The Digest sent the survey to over 10 million Americans and over two million responded!"

Newscaster: "And the survey results predict: Alf Landon will beat Franklin D. Roosevelt by a large margin and become

President of the United States."

As it turned out, Alf Landon did not become President of the United States. Instead, Franklin D. Roosevelt was re-elected

to a third term in office in the largest landslide victory recorded to that date. This was a devastating blow to the Digest's

reputation. What went wrong? How could such a large survey be so far off the mark?

The Literary Digest made two mistakes that led it to predict the wrong election outcome. First, it mailed the survey to

people on three different lists: the magazine's subscribers, car owners, and people listed in telephone directories. What was

wrong with choosing a sample from these lists?

The sample was not representative of the American public. Most lower-income people did not subscribe to the Digest and

did not own phones or cars back in 1936. This led the poll to be biased towards higher-income households and greatly

distorted the poll's results. Lower-income households were more likely to vote for the Democrat, Roosevelt, but they were

not included in the poll.

19/123

11/25/2019 Quantitative Methods Online Course

Second, the magazine relied on people to voluntarily send their responses back to the magazine. Out of the ten million

voters who were sent a poll, over two million responded. Two million is a huge number of people. What was wrong with

this survey?

The mistake was simple: Republicans, who wanted political change, felt more strongly about the election than Democrats.

Democrats, who were generally happy with Roosevelt's policies, were less interested in returning the survey. Among those

who received the survey, a disproportionate number of Republicans responded, and the results became even more biased.

The Digest had put an unprecedented effort into the poll and had staked its reputation on predicting the outcome of the

election. Its reputation wounded, the Digest went out of business soon thereafter.

During the same election year, a little known psychologist named George Gallup correctly predicted what the Digest

missed: Roosevelt's victory. What did Gallup do that the Literary Digest did not? Did he create an even bigger sample?

Surprisingly, George Gallup used a much smaller sample. He knew that large samples were no guarantee of accurate results

if they weren't randomly selected from the population.

Gallup's team interviewed only 3,000 people, but made sure that the people they selected were truly representative of the

US population. He also instructed his team to be persistent in asking the opinion of each person in the sample, which

generated a high response rate.

Gallup's correct prediction of the 1936 election winner boosted his reputation and Gallup's method of polling soon became

a standard for public opinion polls.

Today's polls usually consist of a sample of around a thousand randomly selected people who are truly representative of the

underlying populations. For example, look at poll reported in a leading newspaper: the sample size will likely be around a

thousand.

Another common survey mistake is phrasing the questions in a way that leads to a biased response. Let's take a look at a

recent example of a biased question.

In 1992, Ross Perot, an independent contender for the US Presidential election, conducted a mail-in survey to show that

the public supported his desire to abolish special interest groups. This is the question he asked:

Source

In Perot's mail-in survey, 99 percent of respondents said "yes" to that question. It seemed as if everyone in America agreed

with Perot's stance.

Source

Soon after Perot's survey, Yankelovich Partners, an independent market research firm, conducted two interesting follow-up

surveys. In the first survey, it used the same question that Perot asked and found that 80 percent of the population favored

passing the law. YP attributed the difference to the fact that it was able to create a more representative sample than Perot.

Source

Interestingly, Yankelovich then conducted a similar survey, but rephrased the question in the following way:

Source

The response to this question was strikingly different. Only 40 percent of the sampled population agreed to prohibit

contributions. As it turned out, the results of the survey all came down to the way the question was phrased.

Source

For any survey we conduct, it's critical to phrase the question in the most neutral way possible to avoid bias in the sample

results.

Source

The real lesson of these two examples is this: How data are collected is at least as important as how data are analyzed. A

sample that is unrepresentative, biased, or not drawn at random can give highly misleading results.

How sample data are collected is at least as important as how they are analyzed. Knowing that sample data need to be

representative and unbiased, you conduct a survey of the hotel guests.

How can you best determine if hotel guests are enjoying the scuba course? By searching the hotel database, you determine

that 2,804 hotel guests took scuba trips in the past year. The scuba certification course was offered year-round. The database

includes each guest's name, address, phone number, age, date of arrival, length of stay, and room number.

20/123

11/25/2019 Quantitative Methods Online Course

Your first step is deciding what type of survey to conduct that will be inexpensive, quick, and will provide a good sample of all

the guests who took scuba lessons.

Should you mail a survey to the whole list of guests who took scuba lessons, expecting that a small percentage will respond, or

conduct a telephone survey, which would likely provide a higher response rate, but cost more per guest contacted?

To ensure a good response rate — and because Leo wants an answer quickly — you choose to contact customers by phone.

Alice warns that to keep costs low, you can only contact 50 hotel guests, and reminds you to create a random, representative

sample.

You open up the list of names in the hotel database. The names were entered as guests arrived. To make things simple, you

randomly select a date and then record the first 50 guests arriving after that date who took the course. You ask the hotel

operator to call them for you, and tell him to be persistent. Eventually he is able to contact 45 of the guests on the list. He asks

the guests to rate their scuba experience on a 1 to 6 scale and reports the results back to you. Click the link below to view your

sample.

Enter the average satisfaction level as a decimal number with one digit to the right of the decimal point (e.g., enter "5" as

"5.0"). Round if necessary.

Hotel Database

You compute the average satisfaction level and find that it is 2.5. You give Leo the news. He explodes.

Two point five! That's impossible! I know for sure that it must be higher than that! You'd better go over your data again.

Back in your room, you look over your list of data. What should you tell Leo?

We were hit with a hurricane at the beginning of April. Half the scuba classes were cancelled, and the ones that did meet had

to deal with choppy water and bad visibility. Even the weeks following the hurricane were bad. Usually guests see a manta ray

every week, and the guests in April could barely see the underwater coral. No wonder they weren't happy.

You assure Leo you will conduct the survey again with a more representative sample. This time, you make sure that the guests

are truly randomly selected. Later, you have new data in your hands from 45 randomly chosen guests that show the average

satisfaction rate to be 4.4 on a 1 to 6 scale. The standard deviation of the sample is 1.54.

Mr. Gavin Collins is the Chief Operating Officer of Bell Computers, a market leader in personal computers. This morning,

he opened the latest issue of Business 4.0, a business journal, and noticed an article on Bell Computers.

The article praised the high quality and low cost of the PCs made by Bell. However, it also included some negative

comments about Bell's customer service.

Currently, customer service is only available to customers of Bell Computers over the phone.

Collins wants to understand more fully what customers think of Bell's customer service. His marketing department designs

a survey that asks customers to rate Bell's customer service from 1 to 10.

"Wave" is a company that manufactures laundry detergent in several countries around the world. In India, the competition

among laundry detergents is fierce.

The sales per month of Wave have been constant for the past five years. Wave CEO Mr. Sharma instructed his marketing

team to come up with a strong advertising campaign stressing Wave's superiority over other competitors. Wave conducted

a survey in the month of June.

They asked the following questions: "Have you heard of Wave?" "Do you think Wave is a good product?" "Do you notice a

difference in the color of your clothes after using Wave?" Then, citing the results of their survey, Wave aired a major

television campaign claiming that 75% of the population thought that Wave was a good product.

You are a new associate at Madison Consulting. With your partner, Ms. Mehta, you have been asked to conduct a study for

Wave's main competitor, the Coral Reef Detergent Company, about whether Wave's claims hold water. Coral Reef wonders

how the Wave results are possible, considering that Coral Reef holds over 45% of the current market share.

21/123

11/25/2019 Quantitative Methods Online Course

Ms. Mehta has been going through the survey methodology, and she tells you, "This sample is obviously not representative

and unbiased. Coral Reef can dispute Wave's claim!" What has Ms. Mehta noticed?

You have been asked to conduct a survey to determine the percentage of flights arriving at a small airport that were filled to

capacity that morning. You decide to stand outside the airport's single exit door and ask a sample of 60 passengers leaving

the airport how full their flight was.

Your first thought is to just ask the first 60 passengers departing the airport how full their flight was, but you quickly

realize that that could be a highly biased sample. Any 60 people leaving at the same time would likely have come from only

a couple of flights, and you want to get a good sense of what percent of all flights arriving that morning were filled to

capacity. Thus, you decide to randomly select 60 people from all the passengers departing the building that morning.

After conducting your survey, you tally the results: 10 people decline to answer, 30 people tell you that their flight was filled

to capacity, and 20 people tell you that their flight was not filled to capacity. What can you conclude from your survey

results so far?

To see this, imagine that 10 planes have arrived that morning — five of which were full (having 100 passengers each) and

five of which had only a single passenger on the plane. In this case, half of the planes were full. However, almost all of the

passengers (500 of the total 505) departing from the airport would report (correctly!) that they had been on a full plane.

Since people from a full plane are more likely to be selected, there is a systematic bias in your response.

It is important, in every survey, to try to make your sample as representative as possible. In this case, your sample was not

representative of the planes arriving to the airport.

A better approach might be to ask the people you select what their flight number was, and then ask them how full their

flight was. Make sure you have at least one passenger from every plane. Then count the responses of only one person from

each flight. By including only one person per flight in your sample, you ensure that your sample is an accurate prediction of

how many planes are filled to capacity.

Sampling is complicated, and it is important to think through all the factors that might influence your results. In this case,

the mistake is that you are trying to estimate a population of planes by sampling a population of passengers. This makes

the sample unrepresentative of the underlying population. By randomly sampling the passengers rather than the flights,

each flight is not equally likely to be selected, and the sample is biased.

You report the results of your survey, the sample mean, and its standard deviation to Leo.

A sample mean of 4.4 makes more sense to me, but I'm still a bit uneasy about your survey result. After all, you've only

collected 45 responses.

If you'd chosen different people, they likely would have given different responses. What if — just by chance — these 45 people

loved the scuba course, and no one else did?

You have a good point there, Leo. Our intuition is that the average satisfaction rate for all guests isn't too far from 4.4, but at

this point we're not sure exactly how far away it might be. Without more calculations, all we can say is that 4.4 is the best

estimate we have. That is why...

Wait a minute! This is very unsatisfying. Are you telling me that there's no way to gauge the accuracy of this survey result?

If the results are a little off, that's not a problem. But you have to tell me how far off they might be. What if you're off by two

whole points, and the true satisfaction of my hotel guests is 2.4, not 4.4? In that case, my decision would be completely

different.

I need to know how accurately this sample reflects the opinions of all the hotel guests who went scuba diving!

The sample mean is the best point estimate of the population mean, but it cannot tell you how accurately the sample reflects

the population.

Alice suggests giving Leo a range of values that is almost certain to contain the population mean. "We may not be able to pin

down mean satisfaction precisely. But confining it to a range of likely values will provide Leo with enough information to

make a sound business decision."

That sounds like a good idea, but you wonder how to actually do it.

22/123

11/25/2019 Quantitative Methods Online Course

The sample mean is the best estimate of our population mean. However, it is only a point estimate. It does not give us a sense

of how accurately the sample mean estimates the population mean.

Think about it. If we know only the sample mean, what can we really say about the population mean? In the case of our scuba

school, what can we say about the average satisfaction rate of all scuba-diving hotel guests? Could it be 4.3? 4.0? 4.7? 2.0?

To make decisions as a manager, we need to have more than just a good point estimate. We need to have a sense of how

close or far away the true population mean might be from our estimate.

We can indicate the most likely values of the true population mean by creating a range, or interval, around the sample mean.

If we construct it correctly, this range will very likely contain the true population mean.

For example, by constructing a range, we might be able to tell Leo that we are very confident that the true average customer

satisfaction for all scuba guests falls between 4.2 and 4.6.

Knowing that the true average is almost certainly between 4.2 and 4.6, Leo is better equipped to make a decision than if he

simply knew the estimated average of 4.4.

Creating a range around the sample mean is quite easy. First, we need to know three statistics of the sample: the mean x-bar,

the standard deviation s, and the sample size n.

We also need to know how "confident" we'd like to be that the range contains the true mean of the population. For any level

of "confidence", there is a value we'll call z to put into the formula. We'll learn later in this unit exactly what we mean by

"confidence," and how to compute z. For now, just keep in mind that for higher levels of confidence, we'll need to put in a

larger value of z.

Using these numbers, we can create a range around the sample mean according to the following formula:

Before we actually use the formula, let's try to develop our intuition about the range we're creating. Where should the range

be centered? How wide must the range be to make us confident that it contains the true population mean? What factors

would lead us to need a wider or narrower range?

Let's see how the statistics of the sample influence the location and width of the range. Let's start with the sample mean.

The sample mean is our best estimate of the population mean. This suggests that the sample mean should always be the

center of the range. Move the slider bar to see how the sample mean affects the range.

Second, the width of the range depends on the standard deviation of the sample. When the sample standard deviation is

large, we have greater uncertainty about the accuracy of the sample mean as an estimate of the population mean. Thus, we

have to create a wider range to be confident that it includes the true population mean.

On the other hand, if the sample standard deviation is small, we feel more confident that our sample mean is an accurate

predictor of the true population mean. In this case, we can draw a more narrow range.

The larger the standard deviation, the wider the range must be. Move the slider bar to see how the sample standard deviation

affects the range.

Third, the width of the range depends on the sample size. With a very small sample, it's quite possible that one or two atypical

points in the sample could throw the sample mean off considerably from the true population mean. So with a small sample,

we need to create a wide range to feel comfortable that the true mean is likely to be inside it.

The larger the sample, the more certain we can be that the sample mean represents the population mean. With a large

sample, even if our sample includes a few atypical points, there are likely to be many more typical points in the sample to

compensate for the outliers. Thus, with a large sample, we can feel comfortable with a small range.

Move the slider bar to see how the sample size influences the range.

Finally, the width of the range depends on our desired level of confidence. The level of confidence states how certain we want

to be that the range contains the mean of the population. The more confident we want to be that the range contains the true

population mean, the wider we have to make the range.

If our desired level of confidence is fairly low, we can draw a more narrow range.

In the language of statistics, we indicate our level of confidence by saying, for example, that we are "95% confident" that the

range contains the true population mean. This means there is a 95% chance that the range contains the true population

mean.

Move the slider bar to see how the confidence level affects the range.

23/123

11/25/2019 Quantitative Methods Online Course

These variables determine the size of the range that we want to construct. We will learn exactly how to construct this range in

a later section.

For now, all we have to understand is that the population mean can best be estimated by a range of values and that the range

depends on three sample statistics as well as the level of confidence that we want to assign to the range.

Summary

The sample mean is our best initial estimate of the population mean. To indicate how accurate this estimate is, we construct a

range around the sample mean that likely contains the population mean. The width of the range is determined by the sample

size, sample standard deviation, and the level of confidence. The confidence level measures how certain we are that the range

we construct contains the true population mean.

Alice recommends taking a step back from sampling and learning about the normal distribution.

Alice recommends taking a step back from sampling and learning about the normal distribution.

The normal distribution helps us create a range around a sample mean that is likely to contain the true population mean. You

can use the normal distribution to turn the intuitive notion of "confidence in your estimate" into a precisely defined concept.

Understanding the normal distribution will also give you deeper insight into how sampling works.

The normal distribution is a probability distribution that is centered at the mean. It is shaped like a bell, and is sometimes

called the "bell curve."

Like any probability distribution, the normal distribution is shown on two axes: the x-axis for the variable we're studying —

women's heights, for example — and the y-axis for the likelihood that different values of the variable will occur.

For example, few women are very short and few are very tall. Most are in the middle somewhere, with fairly average heights.

Since women of average height are so much more common, the distribution of women's heights is much higher in the center

near the average, which is about 63.5 inches.

As it turns out, for a probability distribution like the normal distribution, the percent of all values falling into a specific range is

equal to the area under the curve over that range.

For example, the percentage of all women who are between 61 and 66 inches tall is equal to the area under the curve over that

range.

The percentage of all women taller than 66 inches is equal to the area under the curve to the right of 66 inches.

Like any probability distribution, the total area under the curve is equal to 1, or 100%, because the height of every woman is

represented in the curve.

Over the years, statisticians have discovered that many populations have the properties of the normal distribution. For

example, IQ test scores follow a normal distribution. The weights of pennies produced by U.S. mints have been shown to follow

a normal distribution.

First, the normal distribution's mean and median are equal. They are located exactly at the center of the distribution. Hence,

the probability that a normal distribution will have a value less than the mean is 50%, and that the probability it will have a

value greater than the mean is 50%.

Second, the normal distribution has a unique symmetrical shape around this mean. How wide or narrow the curve is depends

solely on the distribution's standard deviation.

In fact, the location and width of any normal curve are completely determined by two variables: the mean and the standard

deviation of the distribution.

Large standard deviations make the curve very flat. Small standard deviations produce tight, tall curves with most of the values

very close to the mean.

Regardless of how wide or narrow the curve, it always retains its bell-shaped form. Because of this unique shape, we can create

a few useful "rules of thumb" for the normal distribution.

For a normal distribution, about 68% (roughly two-thirds) of the probability is contained in the range reaching one standard

deviation away from the mean on either side.

24/123

11/25/2019 Quantitative Methods Online Course

It's easiest to see this with a standard normal curve, which has a mean of zero and a standard deviation of one.

If we go two standard deviations away from the mean for a standard normal curve we'll cover about 95% of the probability.

The amazing thing about normal distributions is that these rules of thumb hold for any normal distribution, no matter what its

mean or standard deviation.

For example, about two thirds of all women have heights within one standard deviation, 2.5 inches, of the average height, which

is 63.5 inches.

95% of women have heights within two standard deviations (or 5 inches) of the average height.

To see how these rules of thumb translate into specific women's heights, we can label the x-axis twice to show which values

correspond to being one standard deviation above or below the mean, which values correspond to being two standard

deviations above or below the mean, and so on.

Essentially, by labeling the x-axis twice we are translating the normal curve into a standard normal curve, which is easier to

work with.

For women's height, the mean is 63.5 and the standard deviation is 2.5. So, one standard deviation above the mean is 63.5 +

2.5, and one standard deviation below the mean is 63.5 - 2.5.

Thus, we can see that about 68% of all women have heights between 61 and 66 inches, since we know that about 68% of the

probability is between -1 and +1 on a standard normal curve.

Similarly, we can read the heights corresponding to two standard deviations above and below the mean to see that about 95% of

all women have heights between 58.5 and 68.5 inches.

The z-statistic

The unique shape of the normal curve allows us to translate any normal distribution into a standard normal curve, as we did

with women's heights simply by re-labeling the x-axis. To do this more formally, we use something called the z-statistic.

For a normal distribution, we usually refer to the number of standard deviations we must move away from the mean to cover

a particular probability as "z", or the "z-value." For any value of z, there is a specific probability of being within z standard

deviations of the mean.

For example, for a z-value of 1, the probability of being within z standard deviations of the mean is about 68%, the probability

of being between -1 and +1 on a standard normal curve.

A good way to think about what the z-statistic can do is this analogy: if a giant tells you his house is four steps to the north,

and you want to know how many steps it will take you to get there, what else do you need to know?

You would need to know how much bigger his stride is than yours. Four steps could be a really long way.

The same is true of a standard deviation. To know how far you must go from the mean to cover a certain area under the curve,

you have to know the standard deviation of the distribution.

Using the z-statistic, we can then "standardize" the distribution, making it into a standard normal distribution with a mean of

0 and a standard deviation of 1. We are translating the real value in its original units — inches in our example — into a z-

value.

The z-statistic translates any value into its corresponding z-value simply by subtracting the mean and dividing by the

standard deviation.

Thus, for the women's height of 66 inches, the z-value, z = (66-63.5)/2.5, equals 1. Therefore, 66 is exactly one standard

deviation above the mean.

Essentially, the z-statistic allows us to measure the distance from the mean in terms of standard deviations instead of real

values. It gives everyone the same size feet in statistics.

We can extend the rules of thumb we've developed beyond the two cases we've looked at. For example, we may want to know

the likelihood of being within 1.5 standard deviations from the mean, or within three standard deviations from the mean.

Select different values of z — that is, select different numbers of standard deviations from the mean — and see how the

probability changes. Be sure to try z values of 1 and 2 to verify that our rules of thumb are on target!

Sometimes we may want to go in the other direction, starting with the probability and figuring out how many standard

deviations are necessary on either side of the mean to capture that probability.

For example, suppose we want to know how many standard deviations we need to be from the mean to capture 95% of the

probability.

25/123

11/25/2019 Quantitative Methods Online Course

Our second rule of thumb tells us that when we move two standard deviations from the mean, we capture about 95% of the

probability. More precisely, to capture exactly 95% of the probability, we must be within 1.96 standard deviations of the

mean.

This means that for a normal distribution, there is a 95% probability of falling between -1.96 and 1.96 standard deviations

from the mean.

Select different probabilities and see how many standard deviations we have to move away from the mean to cover that

probability.

We can create a table that shows which values of z correspond to each probability or we can calculate z using a simple

function in Microsoft Excel. We'll explain how to use both of these approaches in the next few clips.

z-table

Remember, the probabilities and the rules of thumbs we've described apply ONLY to a normal distribution. Don't think you

can use them for any distribution!

Sometimes, probabilities are shown in other forms. If we start at the very left side of the distribution, the area underneath the

curve is called the cumulative probability. For example, the probability of being less than the mean is 0.5, or 50%. This is just

one example of a cumulative probability.

A cumulative probability of 70% corresponds to a point that has 70% of the area under the curve to its left.

There are easy ways to find cumulative probabilities using spreadsheet packages such as Microsoft Excel. You'll have

opportunities to practice solving these types of problems shortly.

Cumulative probabilities can be used to find the probability of any range of values. For example, to find the percentage of all

women who have heights between 63.5 and 68 inches, we would simply subtract the percent whose heights are less than 63.5

inches from the percent whose heights are less than 68 inches.

Summary

The normal distribution has a unique symmetrical shape whose center and width are completely determined by its mean and

its standard deviation. For every normal distribution, the probability of being within a specified number of standard

deviations of the mean is the same. The distance from the mean, as measured in standard deviations, is known as the z-value.

Using the properties of the normal distribution, we can calculate a probability associated with any range of values.

To find the cumulative probability associated with a given z-value for a standard normal curve, we use the Excel function

NORMSDIST. Note the S between the M and the D. It indicates we are working with a 'standard' normal curve with mean

zero and standard deviation one.

For example, to find the cumulative probability for the z-value 1, we enter the Excel function =NORMSDIST(1).

The value returned, 0.84, is the area under the standard normal curve to the left of 1. This tells us that the probability of

obtaining a value less than 1 for a standard normal curve is about 84%.

We shouldn't be surprised that the probability of being less than 1 is 84%. Why? First, we know that the normal curve is

symmetric, so there is a 50% chance of being below the mean.

Next, we know that about 68% of the probability for a standard normal curve is between -1 and +1.

Since the normal curve is symmetric, half of that 68% — or 34% of the probability — must lie between 0 and 1.

Putting these two facts together confirms that there is an 84% chance of obtaining a value less than 1 for a standard normal

curve.

If we want to find the cumulative probability of a value in a general normal curve — one that does not necessarily have a

mean of zero and a standard deviation of one — we have two options. One option is to first standardize the value in question

to find the equivalent z-value, and then use the NORMSDIST to find the cumulative probability for that z-value.

For example, if we have a normal distribution with mean 26 and standard deviation 8, we may wish to know the probability

of obtaining a value less than 24.

Standardizing can be done easily by hand, but Excel also has a STANDARDIZE function. We enter the function in a cell and

insert three values: the value to be standardized, and the mean and standard deviation of the normal distribution.

We find that the standardized value (or z value) of 24 for a normal curve with mean 26 and standard deviation 8 is -0.25.

26/123

11/25/2019 Quantitative Methods Online Course

Now, to find the cumulative probability for the z-value -0.25, we enter the Excel function =NORMSDIST(-0.25), which tells

us that the probability of a value less than -0.25 on a standard normal curve is 40%. Thus, the probability of a value less than

24 on a normal curve with mean 26 and standard deviation 8 is 40%.

The second way to find a cumulative probability in a general normal curve is to use the NORMDIST function. Here, we enter

the function in a cell and insert four values: the number whose cumulative probability we want to find, the mean and

standard deviation of the normal distribution, and the word "TRUE."

As with our previous approach, we find that the probability of obtaining a value less than 24 on a normal curve with mean 26

and standard deviation 8 is 40%.

The value "TRUE" tells Excel to return a cumulative probability. If instead of "TRUE" we enter "FALSE," Excel returns the y-

value of the normal curve — something we are usually not interested in.

Quite often, we have a cumulative probability, and want to work backwards, translating it into a value on a normal curve.

Suppose we want to find the z-value associated with the cumulative probability 95%.

To translate a cumulative probability back to a z-value on the standard normal curve, we use the Excel function NORMSINV.

Note once again the S, which tells us we are working with a standard normal curve.

We find that the z-value associated with the cumulative probability 95% is 1.64.

Sometimes we may want to translate a cumulative probability back to a value on a general normal curve. For example, we

may want to find the value associated with the cumulative probability 95% for a normal curve with mean 26 and standard

deviation 8.

If we want to translate a cumulative probability back to a value on a general normal curve, we use the NORMINV function.

NORMINV requires three values: the cumulative probability, and the mean and standard deviation of the normal distribution

in question.

We find that the value associated with the cumulative probability 95% for a normal curve with mean 26 and standard

deviation 8 is 39.2.

The previous clip shows us how to use software programs like Excel to calculate z-values and cumulative probabilities for the

normal curve. Another way to find z-values and cumulative probabilities is to use a z-table. Using z-tables is a bit more

cumbersome than using Excel, but it helps reinforce the concepts.

Let's use the z-table to find a cumulative probability. Women's heights are distributed normally, with mean around 63.5

inches, and standard deviation 2.5 inches. What percentage of women are shorter than 65.6 inches?

First, we calculate the z-value for 65.6 inches, 0.84. The cumulative probability associated with the z-value is the area under

the standard normal curve to the left of the z-value. This cumulative probability is the percentage of women who are shorter

than 65.6 inches.

We next use the table to find the cumulative probability corresponding to a z-value of 0.84. First, we find the row by locating

the z-value up to the first digit to the right of the decimal point, 0.8. Then we choose the column corresponding to the

remainder of the z-value (0.84 - 0.8 = 0.04). The cumulative probability is 0.7995. About 80% of women are shorter than

65.6 inches.

Finding the cumulative probability for a value less than the mean is a bit trickier. For example, we might want to know what

percentage of women are shorter than 61.6 inches.

We find that the z-value for a height of 61.6 inches is a negative number: -0.76.

When a z-value is negative, we must first use the table to find the cumulative probability corresponding to the positive z-

value, in this case +0.76. Then, since the normal curve is symmetric, we will be able to conclude that the probability of being

less than the z-value -0.76 is the same as the probability of being greater than the z-value +0.76.

We find the cumulative probability for +0.76 by locating the row corresponding to the z-value up to the first digit to the right

of the decimal point, 0.7, and the column corresponding to the remainder of the z-value (0.76 - 0.7 = 0.06). The cumulative

probability is 0.7764.

Since the probability of being less than a z-value of +0.76 is 0.7764, then the probability of being greater than a z-value of

+0.76 is 1 - 0.7764 = 0.2236. Thus, we can conclude that the probability of being less than a z-value of -0.76 is also 0.2236.

Finally, we reach our conclusion. About 22.36% of women are shorter than 61.6 inches.

27/123

11/25/2019 Quantitative Methods Online Course

Find the cumulative probability associated with the z-value 2.

Enter your answer in decimal notation with 2 digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if necessary.

z-table

Excel

Enter your answer in decimal notation with 2 digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if necessary.

z-table

Excel

Enter your answer in decimal notation with 2 digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if necessary.

z-table

Excel

z-table

Excel

z-table

Excel

For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the value 115.

z-table

Excel

For a normal curve with mean 100 and standard deviation 10, find the cumulative probability associated with the value 80.

z-table

Excel

For a normal curve with mean 100 and standard deviation 10, find the probability of obtaining a value greater than 80 but

less than 115.

z-table

Excel

For a normal curve with mean 80 and standard deviation 5, find the probability of obtaining a value greater than 85 but less

than 95.

z-table

Excel

For a normal curve with mean 47 and standard deviation 6, find the probability of obtaining a value greater than 45.

Enter your answer in decimal notation with 3 digits to the right of the decimal (e.g., enter "5" as "5.000"). Round if necessary.

z-table

Excel

For a normal curve with mean 47 and standard deviation 6, find the probability of obtaining a value greater than 38 but less

than 45.

28/123

11/25/2019 Quantitative Methods Online Course

Enter your answer in decimal notation with 2 digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if necessary.

z-table

Excel

z-table

Excel

z-table

Excel

z-table

Excel

For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative probability of

88%.

z-table

Excel

For a normal curve with mean 222 and standard deviation 17, find the value associated with the cumulative probability of

28%.

z-table

Excel

How can the normal distribution help you sample Leo's hotel guests?

How do the unique properties of the normal distribution help us when we use a random sample to infer something about the

underlying population?

After all, when we sample a population, we usually have no idea whether or not the population is normally distributed. We're

typically sampling because we don't even know the mean of the population! If the normal distribution is such a great tool,

when can we use it?

It turns out that even if a population is not normally distributed, the properties of the normal distribution are very helpful to

us in sampling. To see why, let's first learn about a well-established statistical fact known as the "Central Limit Theorem".

Definition

Roughly speaking, the Central Limit Theorem says that if we took many random samples from a population and plotted the

means of each sample, then — assuming the samples we take are sufficiently large — the resulting plot of the sample means

would look normally distributed.

Furthermore, if we took enough of these samples, the mean of the resulting distribution of sample means would be equal to

the true mean of the population.

To repeat: no matter what type of distribution the population has — uniform, skewed, bi-modal, or completely bizarre — if

we took enough samples, and the samples were sufficiently large, then the means of those samples would form a normal

distribution centered around the true mean of the population.

The Central Limit Theorem is one of the subtlest aspects of basic statistics. It may seem odd to be drawing a distribution of

the means of many samples, but that is exactly what we are doing. We'll call this distribution the Distribution of Sample

29/123

11/25/2019 Quantitative Methods Online Course

Means. (Statisticians also often call it the Sampling Distribution of the Mean).

Let's walk through this step-by-step. If we have a population — any population — we can take a random sample. This

sample has a mean.

Then we take another sample. That sample also has a mean, which we also plot on the graph.

Now, if we plot a lot of sample means in this way, they will start to form a normal distribution around the population's

mean.

The more samples we take, the more the graph of the sample means would look like a normal distribution. Eventually, the

graph of the sample means — the Distribution of the Sample Means — would form a nearly perfect replica of a normal

distribution.

Now, nobody would actually take a lot of samples, calculate all of the sample means, and then construct a normal

distribution with them. We're taking a lot of samples here just to let you see that graphing the means of many samples

would give you a normal curve.

In the real world, we take a single sample and squeeze it for all the information it's worth. But what does the Central Limit

Theorem allow us to say based on that single sample?

The Central Limit Theorem tells us that the mean of that one sample is part of a normal distribution. More specifically, we

know that the sample mean falls somewhere in a normal Distribution of Sample Means that is centered at the true

population mean.

The Central Limit Theorem is so powerful for sampling and estimation because it allows us to ignore the underlying

distribution of the population we want to learn about. Since we know the Distribution of Sample Means is normally

distributed and centered at the true population mean, we can completely disregard the underlying distribution of the

population.

As we'll see shortly, because we know so much about the normal distribution, we can use the information about the

Distribution of Sample Means to draw conclusions about the likelihood of different values of the actual population mean.

Summary

The Central Limit Theorem states that for any population distribution, the means of samples from that population are

distributed approximately normally. The more samples, and the larger the sample size, the closer the Distribution of

Sample Means fits a normal curve. The mean of a single sample lies on this normal curve, so we can use the normal

curve's special properties to extract more information from a single sample mean.

Illustrating

Let's see how the Central Limit Theorem works using a graphical illustration. The three icons are marked "Uniform,"

"Bimodal," and "Skewed." On a later page, clicking on each of the three sections in the navigation will display a different

kind of distribution.

On the next page, clicking on "Uniform" will display a distribution that is uniform in shape, i.e. a distribution for which all

values in a specified range are equally likely to occur. Clicking on "Bimodal" will display a distribution that has two

separate areas where values are more likely to occur than elsewhere. Clicking on "Skewed" will display a distribution that is

not symmetrical — values are more likely to fall above the mean than below.

Uniform

The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean.

Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a

graph. We repeat this process several times to create our distribution.

Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of

the sample means.

As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always

form a normal distribution. This is what the Central Limit Theorem predicts.

Bimodal

The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean.

30/123

11/25/2019 Quantitative Methods Online Course

Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a

graph. We repeat this process several times to create our distribution.

Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of

the sample means.

As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always

form a normal distribution. This is what the Central Limit Theorem predicts.

Skewed

The population distribution is on the top half of the page. Let's take a sample of it. This sample has a mean.

Let's start building a distribution of the sample means on the bottom half of the page by placing each sample mean on a

graph. We repeat this process several times to create our distribution.

Take a sample. Find its mean. Record it in the sample mean histogram. This histogram approximates the distribution of

the sample means.

As we can see, the shape of the original distribution doesn't matter. The distribution of the sample means will always

form a normal distribution. This is what the Central Limit Theorem predicts.

The Central Limit Theorem states that the means of sufficiently large samples are always normally distributed, a key

insight that will allow you to estimate the population mean from a sample.

Confidence Intervals

Using the properties of the normal distribution and the Central Limit Theorem, you can construct a range of values that is

almost certain to contain the population mean.

For a normal distribution, we know that if we select a value at random, it will be within two standard deviations of the

distribution's mean 95% of the time.

The Central Limit Theorem offers us two additional insights. First, we know that the means of sufficiently large samples are

normally distributed, regardless of the distribution of the underlying population.

Second, we know that the mean of the Distribution of Sample Means is equal to the true population mean.

Combining these facts can give us a measure of how accurately the mean of a sample estimates the population mean.

Specifically, we can now conclude that if we take a sufficiently large sample — let's say at least 30 points — from a

population, there is a 95% chance that the mean of that sample falls within two standard deviations of the true population

mean.

Let's build this up step by step to make sure we understand the logic.

First, we take a sample from a population and compute its mean. We know that the mean of that sample is a point on a

normal distribution — the Distribution of Sample Means.

Since the mean of our sample is a value randomly obtained from a normal distribution, there is a 95% chance that the sample

mean is within two standard deviations of the mean of the distribution.

The Central Limit Theorem tells us that the mean of that distribution is the same as the true population mean. Thus, we can

conclude that there is a 95% chance that the sample mean is within two standard deviations of the population mean.

We have argued that 95% of our samples will have a mean within the range shown around the true population mean.

Next we'll turn this around and look at intervals around sample means, because that's exactly what a confidence interval is.

Let's look at intervals around the means of two different types of samples: those whose sample means fall within the 2

standard deviation range around the population mean (which should be the case for 95% of all samples) and those whose

sample means fall outside the 2 standard deviation range around the population mean (which should be the case for 5% of all

samples).

First, let's look at a sample whose mean falls outside the 2 standard deviation range shown around the population mean.

Since this sample mean is outside the range, it must be more than 2 standard deviations away from the population mean.

Since the population mean is more than 2 standard deviations away from this sample mean, an interval of width 2 standard

31/123

11/25/2019 Quantitative Methods Online Course

deviations around this sample mean could not contain the true population mean.

We know that 5% of all samples should have sample means outside the 2 standard deviation range around the population

mean. Therefore 5% of all samples we obtain will have intervals that do not contain the population mean.

Now let's think about the remaining 95% of samples whose means do fall within the 2 standard deviation range around the

population mean.

If we draw an interval of width 2 standard deviations around any one of these sample means, the interval would contain the

true population mean. Thus, 95% of all samples we obtain will have intervals that contain the population mean.

We've just shown how to go from any sample mean — a point estimate — to a range around the sample mean — a 95%

confidence interval. We've also argued that 95% of confidence intervals obtained in this way should contain the true

population mean.

It's important to emphasize: We are not saying that 95% of the time our sample mean is the population mean, but we are

saying that 95% of the time a range that is two standard deviations wide centered around the sample mean contains the

population mean.

To visualize the general concept of a confidence interval, imagine taking 20 different samples from a population and drawing

a confidence interval around each. As the diagram shows, on average 95% of these intervals — or 19 out of 20 — would

actually contain the population mean.

What does this insight mean for us as managers? When we set a confidence level of 95%, we are agreeing to an approach that

1 out of 20 times will give us an interval that does not contain the true population mean. If we aren't comfortable with those

odds, we should raise the confidence level.

If we increase the confidence level to 98%, we have only a 1 out of 50 chance of obtaining an interval that does not contain the

true population mean. However, this higher confidence comes at a cost. If we keep the same sample size, then the confidence

interval will widen, thereby decreasing the accuracy of our estimate. Alternatively, to keep the same interval width, we can

increase our sample size.

How do we know if an interval is too wide? Typically, if we would make a different decision for different values within an

interval, that interval is too wide. Let's look at an example.

To estimate the percent of people in our industry who will attend the annual conference, we might construct a confidence

interval that ranges from 7% to 13%. If we would select a different conference venue if the true percentages is 7% than if it is

13%, we need to tighten our range.

Now, before we are ready to actually create our own confidence intervals, there is a technical point we need to be acquainted

with. We need to know that the standard deviation of the Distribution of Sample Means is σ, the standard deviation of the

underlying population, divided by the square root of n, the sample size.

We won't prove this fact here, but simply note that it is true, and that it should confirm our general intuition about the

Distribution of Sample Means. For example, if we have huge samples, we'd expect the means of those large samples to be

tightly clustered around the true population mean, and thereby form a narrow distribution.

Summary

A confidence interval is an estimate for the mean of a population. It specifies a range that is likely to contain the population

mean. A confidence interval is centered at the mean of a sample randomly drawn from the population under study. When

we have a confidence level of 95% we expect equally wide confidence intervals centered at 95 out of 100 such sample means

to contain the population mean.

You understand the theory behind a confidence interval. But how do you actually construct one?

We can now translate the previous discussion into a simple method for finding a confidence interval for the mean of any

population. First, we randomly select a sample of size 30 from the population.

Next, we assign the sample mean as the center of the confidence interval.

To find the width of the interval, we must know the level of confidence we want to assign to the interval. If we want a 95%

confidence interval, the interval should be 2 times the standard deviation of the population divided by the square root of n,

the sample size.

Since we typically don't know the standard deviation of the population, we substitute the best estimate that we do have — the

standard deviation of the sample.

32/123

11/25/2019 Quantitative Methods Online Course

Here's what the equation looks like for our example.

If we want a level of confidence other than 95%, instead of multiplying s/sqrt(n) by 2, we multiply by the z-value

corresponding to the desired level of confidence.

We can use this formula to compute any confidence interval. There is one restriction: in order for it to work, the sample size

has to be at least 30.

Let's walk through an example. Wine Lover's Magazine's managers have asked us to help them estimate the average age of

their subscribers so they can better target potential advertisers.

We tell them we plan to survey a sample of their subscribers. They say they're comfortable with our working with a sample,

but emphasize that they want to be 95% confident that the range we give them contains the true average age of its full set of

subscribers.

We obtain survey results from 60 randomly-chosen subscribers and determine that the sample has a mean of 52 and a

standard deviation of 40.

To find an appropriate confidence interval, we incorporate information about the sample into the formula:

The z-value for a 95% confidence interval is about 2, or more accurately, about 1.96. This tells us that a 95% confidence

interval would begin at 52 minus 10.12, or 41.88, and end at the mean plus 10.12, or 62.12.

We give management the range from 41.88 to 62.12 as an estimate of the average age of its subscribers, telling them they

can be 95% confident that the true population mean falls between these values.

What if we want a confidence level other than 95%? We can use the sample mean, standard deviation, and size from the

sample data, but how do we obtain the right z-value?

The z-value for 95% confidence is well known to be about 2, but how do we find a z-value for a less common confidence

interval? To be 98% confident that our interval contains the population mean, how do we obtain the appropriate z-value?

To find the z-value for 98% confidence level, we are essentially asking: How far to the left and right of the standard normal

curve's mean do we have to go to capture 98% of the area?

Capturing 98% of the area centered at the mean of the normal curve leaves two areas at the tails, each covering 1% of the

area under the curve. The z-value of the right boundary is the z-value associated with a cumulative probability of 99% —

the sum of the central 98% and the 1% in the left tail.

Converting the desired confidence level into the corresponding cumulative probability on the standard normal curve is

essential because Excel's NORMSINV function and the z-table work with cumulative probabilities.

To find the z-value associated with a cumulative probability of 99%, enter into Excel =NORMSINV(0.99), which returns

the z-value 2.33. Or, look in the z table and find the cell that contains a cumulative probability closest to 0.9900. The z-

value is 2.33, the sum of the row-value 2.3 and the column-value 0.03.

Try finding a z-value yourself. Find the z-value associated with a 99.5% confidence level using the appropriate normal

distribution function in Excel. Alternatively, you can use the Standard Normal Table (z-table) in your briefcase.

z-table

Excel

Our first step is to convert the confidence level of 99.5% into the corresponding cumulative probability on the standard

normal curve. To do this, note that to have 99.5% probability in the middle of the standard normal curve, we must exclude

a total area of 1 - 99.5% = 0.5% from the curve. That area is divided into two equal parts in the distribution's tails: 0.25% in

each tail.

We can now see that the cumulative probability associated with confidence level of 99.5% is 1 − 0.25% = 99.75%. Thus, the

z-value for a confidence level of 99.5% is the same as the z-value of a cumulative probability of 99.75%. We find the z-value

in Excel by entering =NORMSINV(0.9975), which returns the value 2.81. Alternatively, we could find the z-value in the z-

table by looking up the probability 0.9975.

Summary

33/123

11/25/2019 Quantitative Methods Online Course

To calculate a confidence interval, we take a sample, compute its mean and standard deviation, and then build a range

around the sample mean with a specified level of confidence. The confidence level indicates how confident we are that the

sample mean we collected contains the population mean.

We assumed in our confidence limit calculations that the sample size was at least 30. What if it isn't? What if we have only

a small sample? Let's consider a different survey, one that concerns a delicate matter.

The business manager of a large ocean liner, the Demiurgos asks for our help. She wants us to find out the value of her

guests' belongings. She needs this value to determine the correct insurance protection in case guest belongings disappear

from their cabins, are destroyed in a fire, or sink with the ship.

She has no idea how valuable her guests' belongings are, but she feels uneasy asking them for this information.

She is willing to ask only 16 guests to estimate the total value of the belongings in their cabins. From this sample, we need

to prepare an estimate.

With a sample size less than 30, we cannot calculate confidence intervals in the same way as with a large sample size. A

small sample increases our uncertainty about two important aspects of our estimate of the population mean.

First, with a small sample, the consequences of the Central Limit Theorem are not assured, so we cannot be sure that the

sample means follow a normal distribution.

Second, with a small sample, we can't be sure that the sample standard deviation is a good estimate of the population

standard deviation.

Due to these additional uncertainties, we cannot use z-values to construct confidence intervals. Using a z-value would

overstate our confidence in our estimate.

Can we still create a confidence interval? Is there a way to estimate the population mean even if we have only a handful of

data points?

It depends: if we don't know anything about the underlying population, we cannot create a confidence interval with fewer

than 30 data points. However, if the underlying population is normally distributed — or even roughly normally distributed

— we can use a confidence interval to estimate the population mean.

In practice, as long as we are sure the underlying population is not highly skewed or extremely bimodal, we can construct a

confidence interval, even when we have a small sample. However, we do need to modify our approach slightly.

To estimate the population mean with a small sample, we use a t-distribution, which was discovered in the early 20th

century at the Guinness Brewing Company in Ireland.

A t-distribution gives us t-values in much the same way as a normal distribution gives us z-values. What is the difference

between the normal distribution and the t-distribution?

A t-distribution looks similar to a normal distribution, but is not as tall in the center and has thicker tails, because it is

more likely than the normal distribution to have values fall farther away from the mean.

Therefore, the normal distribution's "rules of thumb" for 68% and 95% probabilities no longer hold. For example, we must

go more than 2 standard deviations on either side of the mean to capture 95% of the probability for a t-distribution.

Thus, to achieve the same level of confidence, a confidence interval based on a t-distribution will be wider than one based

on a normal distribution. This reinforces our intuition: we have less certainty about our estimate with a smaller sample, so

we need a wider interval to achieve a given level of confidence.

The t-distribution is also different because it varies with the sample size: For each sample size, there is a different t-value

associated with a given level of confidence. The smaller the sample size n, the shorter the height and the thicker the tails of

the t-distribution curve, and the farther we have to go from the mean to reach a given level of confidence.

On the other hand, as the sample size increases, the shape of the t-distribution becomes more and more like the shape of a

normal distribution. Once we reach a sample size of 30, the t-distribution becomes virtually identical to the z-distribution,

so t-values and z-values can be used interchangeably.

Incidentally, we can use the t-distribution even for sample sizes larger than 30. However, most people use the z-

distribution for larger samples, partially out of habit and partially because it's easier, since the z-value doesn't vary based

on the sample size.

34/123

11/25/2019 Quantitative Methods Online Course

To find the right t-value, we first have to identify the t-distribution that corresponds to our sample size. We do this by

finding the number of "degrees of freedom" of the sample, which for our purposes is simply the sample size minus one. If

our sample size is 16, we have 15 degrees of freedom, and so on.

Excel provides a simple function for finding the appropriate t-value for a confidence interval. If we enter 1 minus the

level of confidence we want and the degrees of freedom into the Excel function TINV, Excel gives us the appropriate t-

value.

For example, for a 95% confidence interval and a sample size of n = 16, the Excel function TINV(0.05,15) would return

the value 2.131.

Excel

Once we find the t-value, we use it just like we used the z-value to find a confidence interval. For example, for t = 2.131,

the appropriate confidence interval is:

Excel

If we don't have Excel handy, we can use a t-distribution table to find the t-value associated with the degrees of freedom

and the confidence level we specify. When using different t-value tables, we need to be careful to note which probability

the table depicts.

Excel

t-table

Some tables report values associated with the confidence level, like 0.95. Others report values based on the area in the

tails, which would be 0.05 for a 95% confidence interval. Our t-table, like many others, reports values associated with a

cumulative probability, so for a 95% level of confidence, we would have to look at a cumulative probability of 97.5%.

Excel

t-table

Returning to the good ship Demiurgos, let's determine an estimate of the average value of passengers' belongings. The

manager samples 16 guests, and reports that they have an average of $10,200 worth of clothing, jewelry, and personal

effects in their cabins. From her survey numbers, we calculate a standard deviation of $4,800.

We need to double check that the distribution isn't too skewed, which we might expect, since some of the passengers are

quite wealthy. The manager explains that the insurance policy has a limited liability clause that limits a passenger's

maximum claim to $20,000.

Above $20,000, passengers' own homeowners' policies must cover any losses. Thus, in the survey, if a guest reported

values above $20,000, the manager simply reported $20,000 as the value to be covered for our data set.

We sketch a graph of the 16 values that confirms that the distribution is not too asymmetric, so we feel comfortable using

the t-distribution.

Since we have a sample of 16 passengers, there are 15 degrees of freedom. The Excel function =TINV(0.05,15) tells us

that the appropriate t-value is 2.131.

Excel

Using the confidence interval formula, the guests' valuables are worth $10,200 plus or minus 2.131 times $4,800 over

the square root of 16. Thus, the width of the confidence interval is 2.131*4,800/4 = $2,557, and we can report that we are

95% confident that the average value of passengers' belongings is between $7,643 and $12,757.

Excel

t-table

Excel

t-table

She will have to survey more guests. Increasing the sample size causes the t-value to decrease, and also increases the size

of the denominator (the square root of n). Both factors narrow the confidence interval.

Excel

t-table

35/123

11/25/2019 Quantitative Methods Online Course

For example, if she asks 10 more guests, and the standard deviation of the sample does not change, the t-value would

drop to 2.06 and the square root of n in the denominator would increase. The width of the interval would decrease

significantly, from $2,557 to $1,939.

Excel

t-table

Summary

Confidence intervals can be constructed even with a sample size of less than 30, as long as the population is roughly

normally distributed (or, at least not too skewed or bimodal). To find a confidence interval with a small sample, use a t-

distribution. T-distributions are a set of distributions that resemble the normal distribution, but with shorter heights

near the mean and thicker tails. To find a confidence interval for a small sample size, place the appropriate t-value into

the confidence interval formula.

When we take a survey, we often want a specific level of accuracy in our estimate of the population mean. For example,

when estimating car owners' average spending on car repairs each year, we might want to be 95% confident that our

estimate is within $50 of the true mean.

We know that the sample size of our survey directly affects the accuracy of our estimate. The larger the sample size, the

tighter the confidence interval and the more accurate our estimate. A sample of size n gives us a confidence interval that

extends a distance of d on either side of the mean:

To find the sample size necessary to give us a specified distance d from the mean, we must have an estimate of sigma, the

standard deviation of spending. If we do not have an estimate based on past data or some other source, we might take a

preliminary survey to obtain a rough estimate of sigma.

In this example, we estimate sigma to be $300 based on past experience. Since we want a 95% level of confidence, we set z

= 1.96. To ensure our desired accuracy — that d is no more than $50 — we must randomly sample at least 139 people.

In general, to ensure a confidence interval extends a distance of at most d on either side of the mean, we choose a sample

size n that satisfies the expression below. We can do this with simple algebra, or by using the attached Excel utility.

Summary

When estimating a population mean, we can ensure that our confidence interval extends a distance of at most d on either

side of the mean by choosing an appropriate sample size.

Step-by-Step Guide

Here is a step-by-step process for creating a confidence interval:

First, we choose a level of confidence and a sample size n appropriate to the decision context.

Second, we take a random sample and find the sample mean. This is our best estimate for the population mean.

Fourth, find the z-value or t-value associated with the proper confidence level. If our sample size is over 30, we find the z-

value for our confidence level. If not, we find the t-value for our confidence level and with degrees of freedom = sample size

- 1.

Fifth, we calculate the end points of the confidence interval using the formulae below.

Summary

Construct confidence intervals using the steps outlined below. With a confidence interval derived from an unbiased

random sample, we can say that the true population mean falls within the interval with the corresponding level of

confidence.

Excel Utility

36/123

11/25/2019 Quantitative Methods Online Course

Click here to open an Excel utility that allows you to create confidence intervals by providing the sample mean, standard

deviation, size, and desired level of confidence. You should enter data only in the yellow input areas of the utility. To ensure

you are using the utility correctly, try to reproduce the results for the Wine Lover's Magazine and the Demiurgos

examples.

The sample you collected earlier has all the data you need to create a confidence interval for Leo's problem.

You take another look at the survey you created earlier for Leo: you sampled 45 guests, and calculated that the average

satisfaction rate of the sample was 4.4, with a standard deviation of 1.54. Using this information, you decide to create a 95%

confidence interval for Leo.

z-table

t-table

To create a 95% confidence interval, you take the mean of the sample and add/subtract the z-value multiplied by the sample

standard deviation divided by the square root of the sample size. Using the numbers given, you obtain a 95% confidence

interval by going 0.45 points above and below the sample mean of 4.4, which translates into a confidence interval from 3.95

to 4.85.

You meet with Leo and tell him that you can be 95% certain that the population mean falls between 3.95 and 4.85. Leo looks

at your numbers and shakes his head.

That's just not accurate enough for me to make a decision. If the mean is close to 4.85, I'd be happy, but if it's closer to 4, I'm

concerned. Can we narrow the range at all?

Looking over your notes, you think you can give Leo some options.

Why don't you create a larger sample and report the results back to me?

You select another 40 guests at random and ask the hotel operator to conduct the survey for you again. He is able to reach 25

guests. You combine the two samples, which gives a new sample size of 70.

For the combined sample, you find that the new sample mean is 4.5 and the new sample standard deviation is 1.2. Armed

with more data, you create another confidence interval.

We can be 95% certain that the average satisfaction of all hotel guests with the scuba school is between:

z-table

t-table

To create this 95% confidence interval, you take the mean of the sample and add/subtract the z-value multiplied by the

sample standard deviation divided by the square root of the sample size. Using the new numbers given for the larger sample,

you obtain a 95% confidence interval by going 0.28 points above and below the sample mean of 4.5, which translates into a

confidence interval from 4.22 to 4.78.

Thank you. I am much happier with this result. I have enough information now to decide whether to keep the current scuba

diving school.

Toshi Matsumoto is the Chief Operating Officer of a consumer electronics retailer with over 150 stores spread throughout

Japan. For over a year, the sales of high-end VCRs have lagged, due to a shift towards DVD players.

Just today, Toshi heard that Veetek, a large South Asian electronics retailer, is looking to purchase a bulk shipment of high-

end VCRs.

This would be a perfect opportunity for Toshi to liquidate slow-moving inventory currently languishing on the shelves of

his stores. Before he calls Veetek, he wants to know how many high-end VCRs he can promise. After two days of furious

phone calls, his deputy has gathered data from 36 representative outlets in his retail chain.

The mean high-end VCR inventory in each store polled was 500 units. The standard deviation was 180. Toshi needs you to

find a 95% confidence interval for the average VCR inventory per store. The interval is

z-table

37/123

11/25/2019 Quantitative Methods Online Course

t-table

Paul Segal manages the pig-farming division of the agricultural company Bowman-Lyons-Centerville. A rumored outbreak

of Pulluscular Pig Disorder (PPD) in one of Paul's herds is on the verge of causing a public relations disaster.

The main symptom of PPD is a shrinking brain, and the only certain way to diagnose PPD is by measuring brain size post-

mortem.

Paul needs to know if his herd is affected by PPD, but he does not want to have to slaughter hundreds of swine to find out.

At the preliminary stage, he can offer no more than 5 prime porkers to be slaughtered and diagnosed.

For the pigs slaughtered, the mean brain weight was 0.18 lbs, with a standard deviation of 0.06 lbs. With 95% confidence,

in what range does the herd's average brain weight lie?

z-table

t-table

Proportions

The next morning, you and Alice are about to head off to the hotel pool when Leo calls you.

I'm sorry to disturb you, but I have another problem, and I think you might be able to help.

The Kahana is a very popular resort during the summer tourist season. But the number of leisure visitors drops significantly

during the off-season, from September through February and then April through May.

We usually have quite a few room vacancies during that period of time. We expect to have about 200 rooms vacant for

weeklong periods during the slow season this year.

I've developed a new program that rewards our best guests with a special discount if they book a weeklong stay during our

slow period. They won't have complete date flexibility of course, but the steep discount should make the offer attractive for

them.

To see how many of our past guests would accept such an offer, I sent promotional brochures to 100 of them. The deadline by

which they had to respond to the offer has passed. Ten guests responded with the required room deposit prior to the deadline

— that's a solid 10 percent.

I figure if we send out 2,000 promotions, we'll get about 200 responses.

This is a nice idea Leo, but I'm concerned it could backfire. If more than 10% respond to this offer, you might end up

disappointing some of the very guests you're trying to reward. Or, if too many respond and you give them all the discount,

you'll have to turn away customers willing to pay full price.

That is exactly my concern. I wonder how accurate the 10% response rate is. Just because it held for 100 guests, will it hold

for 2,000? What if 11% actually respond to the promotions?

Imagine what would happen if 220 guests responded. I don't want to anger 20 loyal customers by telling them the offer is not

valid, but I also don't want to turn away full paying guests to accommodate the extra 20 guests at a discount.

I'm willing to reserve 200 rooms for these discount weeklong stays during the slow season. How many return guests can I

safely send the discount offer and be confident that no more than 200 will respond?

You can tell that Leo is growing quite comfortable with relying on your statistical methods. He seems almost as interested in

them as he is in your results.

Sometimes, the question we pose to members of a sample calls for a yes or no answer.

We might survey people in a target market and ask if they plan to buy a new car this year. Or survey voters and ask if they

plan to vote for the incumbent candidate for office. Or we might take a sample of the products our plant produced yesterday

and count how many are defective.

38/123

11/25/2019 Quantitative Methods Online Course

Even though our question has only two answers, we still have to address an inherent uncertainty: We know what values our

data can take — yes or no — but we don't know how often each response will be given.

In these cases, we usually convey the survey results by reporting the percentage of yes responses as a proportion, p-bar. This

is our best estimate of p, the true percentage of "yes" responses in the underlying population.

Suppose, for example, that we have posted advertisements in the subway cars on Boston's "Red Line," and want to know what

percentage of all passengers remembers seeing our ad.

We create a proper survey, and ask randomly selected Red Line passengers if they remember seeing our ad. 300 passengers

respond to our survey, of which 100 passengers report remembering the ad.

Then p-bar is simply 33%, which is the number of people that remember the ad, 100, divided by the number of respondents,

300.

The remaining 200 passengers, or 67% of the sample, report not remembering the ad. The two proportions always add up to 1

because survey respondents report either remembering the ad or not.

Once we know the proportion of the sample, we can draw conclusions about all Red Line passengers. Our best estimate, or

point estimate, for p, the percentage of all passengers who remember seeing our ad, is 33%.

As managers, we typically want more than this simple point estimate — we want to know how accurate the estimate is. How

far from 33% might the true percentage be? Can we say confidently that it is between 30% and 36%, for example?

When we work with proportions, how do we find a confidence interval around our point estimate?

The process for creating a confidence interval around a proportion is nearly identical to the process we've used before. The

only difference is that we can approximate the standard deviation of the population with a simple formula rather than

calculating it directly from the raw data.

Based on our sample, our best estimate of the true population proportion is p-bar, the percentage of "yes" responses in our

survey. Statistical theory tells us that our best estimate of the standard deviation of the true population proportion is the

square root of [(p-bar)*(1 - (p-bar))]. We can use this approximate standard deviation to determine a confidence interval for

the proportion.

For our Red Line ad, we approximate the standard deviation with the square root of 0.33 times 0.67, or 0.47. A 95%

confidence interval is 0.33 plus or minus 1.96 times 0.47 divided by the square root of 300. This is equal to 0.33 plus or

minus 0.053, or 27.7% to 38.3%.

Unfortunately, there is one catch when we calculate confidence intervals around proportions...

Sample Size

Sample size matters, particularly when dealing with very small or very large proportions. Suppose we are sampling New

Yorkers for Amyotrophic Lateral Sclerosis, commonly known as Lou Gehrig's Disease. In the U.S., the odds of having the

disease are less than 1 in 10,000. Would our sample be useful if we surveyed 100 people?

No. We probably wouldn't find a single person with the disease in our sample. Since the true proportion is very small, we

need to have a large enough sample to make sure we find at least a few people with the disease. Otherwise, we will not have

enough data to get a good estimate of the true proportion.

There is a guideline we must meet to make sure that our sample is large enough when estimating proportions. Two

conditions must be met: First, the product of the sample size and the proportion must be at least 5. Second, the product of

the sample size and 1 minus the proportion must also be at least 5.

If both these requirements are met, we can use the sample. Essentially, this guideline guarantees that our sample contains

a reasonable number of "yes" and a reasonable number of "no" answers. Our sample will not be useful otherwise.

To avoid an invalid sample, we need to create a large enough sample size to satisfy the requirements. However, since we

don't know the proportion p-bar before sampling, we don't know if the two conditions are met before setting the sample

size. How can we get around this problem?

We can obtain a preliminary estimate of p-bar using either of two methods: first, we can use past experience. For example,

to estimate the rate of Lou Gehrig's disease, we can research the rate of occurrence in the general population. This is a

reasonable first estimate for p-bar.

In many cases, however, we are sampling for the first time. Without past experience, we don't know what p-bar might be.

In this case, it may well be worth our time to take a small test sample to estimate the proportion, p-bar.

39/123

11/25/2019 Quantitative Methods Online Course

For example, if the proportion of yes answers in our small test sample is 3%, then we can use 3% as our preliminary

estimate of p-bar.

Substituting 3% for p-bar in our two requirements, n(p-bar) ≥ 5 and n(1 - (p-bar)) ≥ 5, tells us that n must satisfy n*0.03 ≥

5 and n*0.97 ≥ 5. Thus the sample size we need for our real sample must be at least 167.

We would then use a real sample — with at least 167 respondents — to find an actual sample value of p-bar to create a

confidence interval for the population proportion.

Summary

Proportions are often used to indicate the frequency of some characteristic in a population. The sample proportion p-bar is

the number of occurrences of the characteristic in the sample divided by the number of respondents, the sample size. It is

our best estimate of the true proportion in the population. We can construct a confidence interval for the population

proportion. Two guidelines for the sample size must be met for a valid confidence interval: n(p-bar) and n(1 - (p-bar)) must

each be at least five.

Creating confidence intervals around proportions is not much different from creating them around means. Finding the right

number of Leo's promotional brochures to mail should be easy.

Leo needs to know how accurate the 10 percent response rate of his 100-customer sample is. Will this response rate hold for

2,000 guests? To how many guests can he send the discount offer for his 200 rooms?

First, you calculate a 95% confidence interval for the response rate.

Enter the lower bound as a decimal number with two digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if

necessary.

z-table

Confidence Interval Utility

The 95% confidence interval for the proportion estimate is 0.0412 to 0.1588, or 4.12% and 15.88%. You obtain that answer by

using the sample data and applying the familiar formula:

Then, after giving Leo's questions some thought, you recommend to him that he send the mailing to a specific number of

guests.

z-table

Based on the confidence interval for the proportion, the maximum percentage of people who are likely to respond to the

discount offer (at the 95% confidence level) is 15.88%. So, if 15.88% of people were to respond for 200 rooms, how many

people should Leo send out the survey to? Simply divide 200 by 0.1588 to get to the answer: Leo needs to send out the survey

to at most 1,259 past customers.

Leo is pleased with your work. He tells you to relax and enjoy the resort.

GMW is a German auto manufacturer that has regional sales subsidiaries throughout the world. Arturo Lopez heads the

Mexican sales division of the company's Latin American subsidiary.

GMW earns additional profit when customers choose to finance their car purchase with a GMW financing package. Arturo

has been asked to submit a report to the GMW CEO in Germany about the percentage of GMW customers who opt for

financing.

Arturo has asked you, a new member of the division sales team, to devise a way to estimate this percentage. You take a

random sample of 64 cars sold in the Mexican sales division, and find that 13 of them, or about 20.3%, opted for GMW

financing.

If you want to be 95% confident in your report to Mr. Lopez, you should tell him that the percentage of all Mexican

customers opting for GMW financing falls in the range:

z-table

Confidence Interval Utility

40/123

11/25/2019 Quantitative Methods Online Course

Kayleigh Marlon is the Chief Buyer at Tar-Mart, a company that operates a chain of superstores selling discount

merchandise. Tar-Mart has a huge national presence, and manufacturers compete fiercely to get their products onto Tar-

Mart's shelves.

Crown Toothpaste, a new entrant in the toothpaste market, is one of them. Kayleigh agreed to stock Crown for 4 weeks and

display it prominently. After that period, she will stop stocking Crown unless 5% of Tar-Mart's customers bought Crown or

were considering buying Crown within the next month.

The trial period is now over. Kayleigh has asked you to take a sample of customers to see if Tar-Mart should continue

stocking Crown. She would like you to be at least 95% confident in your answer.

The first step is to decide how large a sample size to choose. Kayleigh tells you that, in the past, when Tar-Mart introduced

a new product, the percentage of people who expressed interest ranged between 2% and 10%. What sample size should you

use?

You choose a sample size of 250. After conducting the survey, you find that 10 out of 250 people surveyed had bought

Crown or were considering buying Crown within the next month. What is the 95% confidence interval for the population

proportion?

z-table

Confidence Interval Utility

First, you find the sample proportion: 10 out of 250 is a proportion of 4%. You verify that n(p-bar) = 250*0.04 = 10 ≥ 5 and

n(1 - (p-bar)) = 250*0.96 = 240 ≥ 5. Then, using the formula, you find the confidence interval around the sample

proportion. The endpoints of that interval are 1.6% and 6.4%.

OO-P-S is a small-package delivery service with worldwide operations. Celine Bedex, VP Marketing, has heard increasing

complaints about late deliveries, and wants to know how many of the shipments are late by one day or more.

Celine would like an estimate of the percentage of late deliveries. In a sample of 256 shipments, 2 were delivered late, a

proportion of about 0.008, or 0.8%. If Celine wants to be 99% confident in the result of a confidence interval calculation,

the interval is:

Celine collects a new sample, this time of 729 shipments. Of these, 8 were late. Celine can be 99% confident that the

population proportion of late packages is between:

z-table

Confidence Interval Utility

First, calculate the sample proportion for the new sample: 8/729 = 0.011. Then, verify that the new sample size satisfies the

rules of thumb. Both n(p-bar) and n(1 - (p-bar)) are greater than 5.

Using the new sample size and sample proportion, calculate the confidence interval: [0.1%, 2.1%].

Hypothesis Testing

Introduction

After finishing the sampling assignments, you and Alice decide to take some time off to enjoy the beach. Just as you are

gathering your beach gear, Leo gives you another call.

Hi there! Don't let me keep you from enjoying the beach. I just wanted to let you know what I'd like you to help me with next.

I've been working on ideas to increase the Kahana's profits.

Is it possible to increase profits by raising the room prices? That would be an easy solution.

I wish it were that easy. Room prices are extremely competitive and are often the first thing potential guests take into

consideration. So if we increase room prices, I'm afraid we'll have fewer guests. That might put us back where we started from

with profits — or even worse.

41/123

11/25/2019 Quantitative Methods Online Course

The two major ones are room occupancy rates and discretionary spending. "Discretionary spending" is the money guests

spend on non-room amenities. You know, food, drinks, spa services, sports activities, and so on.

As a manager I can affect a variety of factors that influence discretionary spending: the quality of the restaurant, for example,

or the types of amenities offered.

And you'd like us to help you understand your guests' discretionary spending patterns better.

Right. Then I can explore new ways to increase profits on non-room amenities. I can also see if some of my recent efforts to

increase guest spending have paid off.

I'm particularly interested in restaurant operations. I've made some changes to the restaurants recently. For example, I hired

a new executive chef last year. I'd like to know if restaurant revenues per person have changed since then.

I'd also like to find out if the renovation of our premier cocktail lounge has resulted in higher spending on beverages.

Finally, I've been wondering if discretionary spending patterns are different for leisure and business guests. If so, I might

change our marketing campaigns to better suit each of those market segments.

We don't have a consolidated report for this year yet, so we'll need to conduct some surveys and analyze the results.

You're really getting into these statistical methods, aren't you, Leo?

Definition

Leo made some important changes to his business and he has some ideas of what the impact of these changes has been. How

do you put his ideas to the test?

As managers, we often need to put our claims, ideas, or theories to the test before we make important decisions. Based on

whether or not our claim is statistically supported, we may wish to take managerial action.

Hypothesis testing is a statistical method for testing such claims. A hypothesis is simply a claim that we want to

substantiate. To begin, we will learn how to test hypotheses about population means.

For instance, suppose we know that the historical average number of defects in a production process is 3 defects per 1,000

units produced. We have a hunch that a certain change to the process — a new machine, say — has changed this number. The

hypothesis we wish to substantiate is that the average defect rate has changed — that it is no longer 3 per 1,000.

How do we conduct a hypothesis test? First, we collect a random sample of units produced by the process. Then, we see

whether or not what we learn about the sample supports our hypothesis that the defect rate has changed.

Suppose our sample has an average defect rate of 2.7 defects per 1,000. Based on this sample, can we confidently say that the

defect rate has changed?

That depends. To find out, we construct a range around the historical defect rate of 3 — the population mean that has been

cast in doubt. We construct the range so that if the mean defect rate in the population is still 3, it is very likely for the

mean of a sample taken from the population to fall within that range.

The outcome of our test will depend on whether 2.7, the mean of the sample we have taken, falls within the range or not.

If the sample mean of 2.7 falls outside of the range, we feel comfortable rejecting the hypothesis that the defect rate is still 3.

However, if the sample mean falls within the range, we don't have enough evidence to support the claim that the defect rate

has changed.

This example captures the essence of hypothesis testing, but we need to formalize our intuition about the example and define

our new statistical technique more precisely.

To conduct a hypothesis test, we formulate two hypotheses: the so-called null hypothesis and the alternative

hypothesis.

Based on experience or conventional wisdom, we have an initial value of the population mean in mind. The null hypothesis

states that the population mean is equal to that initial value: in our example, the null hypothesis states that the current

population mean is 3 defects per 1,000. We use the Greek letter mu to represent the population mean, in this case the current

average defect rate.

The alternative hypothesis is the claim we are trying to substantiate. Here, the alternative hypothesis is that the average

defect rate has changed. Note that the alternative hypothesis states that the null hypothesis does not hold.

42/123

11/25/2019 Quantitative Methods Online Course

As the example suggests, in a hypothesis test, we test the null hypothesis. Based on evidence we gather from a sample, there

are only two possible conclusions we can draw from a hypothesis test: either we reject the null hypothesis or we do not

reject it.

Since the alternative hypothesis states the opposite of the null hypothesis, by "rejecting" the null hypothesis we necessarily

"accept" the alternative hypothesis.

In our example, the evidence from our sample will help us determine whether or not we should reject the null hypothesis that

the defect rate is still 3 in favor of the alternative hypothesis that the defect rate has changed.

Based on our sample evidence, which conclusion should we draw? We reject the null hypothesis if it is highly unlikely that

our sample mean would come from a population with the mean stated by the null hypothesis.

For example, if the sample we drew had a defect rate of 14 per 1,000, we would reject the null hypothesis. Drawing a sample

with 14 defects from a population with an average defect rate of 3 would be very unlikely.

"We cannot reject the null hypothesis if it is reasonably likely that our sample mean would come from a population with the

mean stated by the null hypothesis. The null hypothesis may or may not be true: we simply don't have enough evidence to

draw a definite conclusion."

For example, if the sample we drew had a defect rate of 3.05 per 1,000, we could not reject the null hypothesis, since it

wouldn't be unusual to randomly draw a sample with 3.05 defects from a population with an average defect rate of 3.

Note that having the sample's average defect rate very close to 3 does not "prove" that the mean is 3. Thus we never say that

we "accept" the null hypothesis — we simply don't reject it.

It is because we can never "accept" the null hypothesis that we do not pose the claim that we actually want to substantiate as

the null hypothesis — such a test would never allow us to "accept" our claim! The only way we can substantiate our claim is to

state it as the opposite of the null hypothesis, and then reject the null hypothesis based on the evidence.

It is important that we understand exactly how to interpret the results of a hypothesis test. Let's illustrate the two types of

conclusions with an analogy: a US jury trial.

In the US judicial system, the accused is considered innocent until proven guilty. So, the null hypothesis is that the accused is

innocent. The alternative hypothesis is that the accused is guilty: this is the claim that the prosecution is trying to prove.

The two possible outcomes of a jury trial are "guilty" or "not guilty." The jury does not convict the accused unless it is certain

beyond reasonable doubt that the accused is guilty. With insufficient evidence, the jury cannot conclude that the accused

truly is innocent. The jury simply declares that the accused is "not guilty."

Similarly, in a hypothesis test, if our evidence is not strong enough to reject the null hypothesis, then that does not prove that

the null hypothesis is true. We simply have failed to show it is false, and thus cannot reject it.

A hypothesis is a claim or assertion that can be tested. On the basis of a hypothesis test we either reject or leave unchallenged

a particular statement: the null hypothesis.

Alice promises Leo that the two of you will drop by his office first thing in the morning to test if Leo's survey results support

his claims that food and beverage spending patterns have changed.

Summary

We use hypothesis tests to substantiate a claim about a population mean. The null hypothesis states that the population

mean is equal to an initial value that is based on our experience or conventional wisdom. We test the null hypothesis to

learn if we should reject it in favor of our claim, the alternative hypothesis, which states that the null hypothesis does not

hold.

The next morning, Leo explains the measures he has undertaken to increase customer spending on food and beverages. "I'd like

to see if they've had a discernable impact on my guests' restaurant-related spending patterns."

Last year, I made two major changes to restaurant operations: I brought in a new executive chef and renovated the main

cocktail lounge.

The chef introduced a new menu: a fusion of traditional Hawaiian and French cuisine. She put some elaborate items on the

menu, like that mango and brie tart I recommended to you. She also has offerings that cater to simpler tastes. But the

question is, have restaurant profits been affected by the new chef?

43/123

11/25/2019 Quantitative Methods Online Course

Since we set our food margins as a fixed percentage of food revenue, I know that if revenues have increased, profits have

increased too. Based on last year's consolidated reports, the average spending on food per person per day was $55. I'm

curious to see if that has changed.

In addition, I renovated the cocktail lounge. The old bar was designed poorly and used space inefficiently. Now more guests

can be seated in the lounge, and more seats have good views of the ocean.

I also invested in a large machine that makes a wide variety of frozen drinks. Pina coladas are very, very popular.

I hope my investments in the bar are paying off in terms of higher guest spending on drinks. Beverages have high margins,

but I'm not sure if beverage sales have increased enough to cover the investments.

Can we say, for beverages, as for food, that "changes in revenues" are a good proxy for "changes in profits?"

Absolutely. I set my profit margins as a fixed percentage of revenues for beverages as well. Last year, the average spending on

beverages per guest per day was $21.

We don't have the consolidated report yet, but I've already had my staff choose a random sample of guests.

We pulled the restaurant and lounge receipts for the guests in the sample and noted three items: total food revenues, total

beverage revenues, and number of guests at the table. Using this information, we should be able to estimate the daily

spending on food and beverages per guest.

You look at Leo's data and wonder how you can discern whether Leo's changes — the new chef and the bar renovations —

have influenced the resort's profits.

Leo has prepared data for you. How are you going to put it to use?

Our first type of hypothesis test is used to study population means. Let's walk through an example of this type of test.

Suppose the manager of a movie theater implemented a new strategy at the beginning of the year: he started showing old

classics instead of recent releases.

He knows that prior to the change in strategy, average customer satisfaction was 6.7 out of a possible 10 points. He would like

to know if average customer satisfaction has changed since he altered his theater's artistic focus.

The manager's null hypothesis states that the current mean satisfaction has not changed; it is still 6.7. We use the Greek letter

mu to represent the current mean satisfaction rating of the theater's entire film-going population.

His alternative hypothesis is the opposite of the null hypothesis: it states that average customer satisfaction is now different.

To substantiate his claim that the mean has changed, the manager takes a random sample of 196 moviegoers. He is careful to

sample across movies, show times, and dates. The mean satisfaction rating for the sample is 7.3, with a standard deviation of

2.8.

Does the fact that the random sample's mean of 7.3 is higher than the historical mean of 6.7 indicate that this year's

moviegoers really are more satisfied?

Or, is the mean still the same, and the manager "just happened" to pick a sample with an unusually high average satisfaction

rating? This is equivalent to asking the question: If the null hypothesis is true — the average satisfaction is still 6.7 — would

we be likely to randomly draw the sample that we did, with average satisfaction 7.3?

To answer this question, we have to first define what we mean by "likely." As in sampling and estimation, we typically use

95% as our threshold level of likelihood.

We then construct a range around the population mean specified by our null hypothesis. The range should be drawn so that if

the null hypothesis is true, 95% of all samples drawn from the population would fall in that range. In other words, we create a

range of likely sample means.

The central limit theorem tells us that the distribution of sample means follows a normal curve, so we can use its familiar

properties to find probabilities. Moreover, the distribution of sample means is centered at our assumed population mean,

mu, and has standard deviation sigma/sqrt(n). We don't know sigma, the underlying population standard deviation, so we

use the sample standard deviation as our best estimate.

As we do when constructing 95% confidence intervals, we create a range with width z*s/sqrt(n) = 1.96*s/sqrt(n) on either

side of the mean. However, when we conduct a hypothesis test, we center the range around the mean specified in the null

44/123

11/25/2019 Quantitative Methods Online Course

hypothesis because we always start a hypothesis test by assuming the null hypothesis is true.

In our example, the null hypothesis is that the population mean is 6.7, n is 196, and s is 2.8. Our 95% confidence level

translates into a z-value of 1.96. We construct the range of likely sample means:

This tells us that if the population mean is 6.7, there is a 95% chance that the mean of a randomly selected sample will fall

between 6.3 and 7.1.

Now, if we take a sample, and the mean does not fall within the range around 6.7, we can reject the null hypothesis. Why?

Because if the population mean were 6.7, it would be unlikely to collect a sample whose mean falls outside this range.

The region outside the range of likely sample means is called the "rejection region," since we reject the null hypothesis if our

sample mean falls into it. In the movie theater example, the rejection region contains all values less than 6.3 and all values

greater than 7.1.

In this example, the sample mean, 7.3, falls in the rejection region, so we reject the null hypothesis. Whenever we reject the

null hypothesis, we in effect accept the alternative hypothesis.

We conclude that customer satisfaction has indeed changed from the historical mean value of 6.7.

If our sample mean had fallen within the range around 6.7, we could not make a definite statement about moviegoers'

satisfaction. We would not have enough evidence to state that things have changed, but we can never claim that they have

definitely remained the same.

Unless we poll every customer, we'll never know for sure if customer satisfaction has truly changed. Working only with

sample data, there is always a chance that we'll draw the wrong conclusion about the population. We can go wrong in two

ways: rejecting a null hypothesis that is in fact true or failing to reject a null hypothesis that is in fact false. Let's look at the

first of these: the null hypothesis is true, but we reject it.

We choose the confidence level so it is unlikely — but not impossible — for the sample mean to fall in the rejection region

when the null hypothesis is true. In this case, we are using a 95% confidence level, so by unlikely we mean a 5% chance.

However, 5% of all samples from a population with the null hypothesis mean would fall in the rejection region, so when we

reject a null hypothesis, there is a 5% chance we will do so erroneously.

Therefore, when the sample mean falls in the rejection region, we can only be 95% confident that we are justified in rejecting

the null hypothesis. Hence we continue to speak of a confidence level of 95%.

A hypothesis test with a 95% confidence level is said to have a 5% level of significance. A 5% significance level says that there

is a 5% chance of a sample mean falling in the rejection region when the null hypothesis is true. This is what people mean

when they say that something is "statistically significant" at a 5% significance level.

If we increase our confidence level, we widen the range around the null hypothesis mean. At a 99% confidence level, our

range captures 99% of all sample means. This reduces to 1% our chance of rejecting the null hypothesis erroneously. But

doing this has a downside: by decreasing the chance of one type of error, we increase the chance of the other type.

The higher the confidence level the smaller the rejection region, and the less likely it is that we can reject the null hypothesis

when it is in fact false. This decreases our chance of being able to substantiate the alternative hypothesis when it is true. As

managers, we need to choose the confidence level of our test based on the relative costs of making each type of error.

The range of likely sample means should not be confused with a confidence interval. Confidence intervals are always

constructed around sample means, never around population means. When we construct a confidence interval, we don't even

have an initial estimate of the population mean. Constructing a confidence interval is a process for estimating the population

mean, not for testing particular claims about that mean.

Summary

In a hypothesis test for population means, we assume that the null hypothesis is true. Then, we construct a range of likely

sample means around the null hypothesis mean. If the sample mean we collect falls in the rejection region, we reject the

null hypothesis. Otherwise, we cannot reject the null hypothesis. The confidence level measures how confident we are that

we are justified in rejecting the null hypothesis.

The movie theater manager did not have a strong conviction about the direction of change for customer satisfaction prior

to performing the hypothesis test.

He wanted to test for change in both directions — up or down — and thus he used a two-sided hypothesis test. The null

hypothesis — that no change has taken place — could have been wrong in either of two ways: Customer satisfaction may

have increased or decreased. The two-tailed nature of the test was reflected in the two-sided range we drew around the

population mean.

45/123

11/25/2019 Quantitative Methods Online Course

Sometimes, we may want to know if the actual population mean differs from our initial value of the population mean in a

specific direction. For instance, if the theater manager were quite sure that satisfaction had not decreased, he wouldn't

have to test in that direction; rather, he'd only have to test for positive change.

In these cases, our alternative hypothesis should clearly state which direction of change we want to test for. These kinds of

tests are called one-sided hypothesis tests. Here, we substantiate the claim that the mean has increased only if the sample

mean is sufficiently higher than 6.7, so our rejection region extends only to the right.

Let's outline how to formulate one- and two-sided tests. For a two-sided test we have an initial understanding of the

population: the population mean is equal to a specified initial value.

If we want to substantiate the claim that a population mean has changed, the null hypothesis should state that the mean

still equals that initial value. The alternative hypothesis should state that the mean does not equal that initial value.

If we want to know that the actual population mean is greater than the initial value — the null hypothesis mean — then the

null hypothesis should state that the population mean has at most that value. The alternative hypothesis states that the

mean is greater than the null hypothesis mean.

Likewise, if we want to substantiate the claim that a population mean is less than the initial value, the null hypothesis

should state that the mean is at least that initial value. The alternative hypothesis should state that the mean is less than

the null hypothesis mean, and the rejection region extends only to the left.

When we conduct a one-sided hypothesis test, we need to create a one-sided range of likely sample means. Suppose the

theater manager claims that satisfaction improved. As usual, he states the claim he wants to substantiate as his alternative

hypothesis.

The 196-person sample has mean 7.3 and standard deviation 2.8. Does this sample provide sufficient evidence to

substantiate the claim that mean satisfaction increased? To find out, the manager creates a one-sided range: he assumes

the population mean is the null hypothesis mean, 6.7, and finds the range that contains the lower 95% of all sample means.

To find this range, all he needs to do is calculate its upper bound. For what value would 95% of all sample means be less

than that value?

To find out, we use what we know about the cumulative probability under the normal curve: a cumulative probability of

95% corresponds to a z-value of 1.645.

z-table

Why is this different from the z-value for a two-sided test with a 95% confidence level? For a two-sided test, the z-value

corresponds to a 97.5% cumulative probability, since 2.5% of the probability is excluded from each tail. For a one-sided

test, the z-value corresponds to a 95% cumulative probability, since 5% of the probability is excluded from the upper tail.

z-table

We now have all the information we need to find the upper bound on the range of likely sample means.

The rejection region is everything above the value 7.0. The sample mean falls in the rejection region, so the manager rejects

the null hypothesis. He is confident that customer satisfaction is higher.

Summary

When we want to test for change in a specific direction, we use a one-sided test. Instead of finding a range containing

95% of all sample means centered at the null hypothesis mean, we find a one-sided range. We calculate its endpoint

using the cumulative probability under the normal curve.

The Excel Utility link below allows you to perform hypothesis tests for single populations.

Make sure you do at least one example by hand to ensure you thoroughly understand the basic concepts before using the

utility. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to

reproduce the results for the theater manager's example.

A single-population hypothesis test tests a claim using a sample from a single population. With a plan in mind, you take a

look at Leo's sample data.

46/123

11/25/2019 Quantitative Methods Online Course

You are ready to analyze the impact of the changes Leo has made to his restaurant operations. You draw a table to organize

the data from your sample on daily guest spending on restaurant food.

One change Leo made to his restaurant operations was to hire a new chef. He wants to know whether average restaurant

spending per guest has changed since she took over the menu and the kitchen. This is a clear case for a hypothesis test.

Last year's average spending on food per person was $55; this gives you an initial value for the mean.

Leo wants to know if mean spending has changed, so you use a two-sided test. You jot down your null hypothesis, which

states that the average revenue per guest is still $55.

If the null hypothesis is true, the difference between the sample mean of $64 and the initial value of $55 can be accounted for

by chance.

Next, you assume that the null hypothesis is true: the population mean is $55. Now you need to construct a range of likely

sample means around $55 and ask: does the sample mean of $64 fall within that range? Or does it fall in the rejection region?

Leo didn't specify what level of confidence he wanted for your results. You call him for clarification.

I suppose a 95% confidence level is okay. I'd like to be more confident, of course.

After you point out that higher confidence would reduce his chances of being able to substantiate a change in spending if a

change has taken place, he agrees to 95%. You pull out your trusty calculator and get ready to compute a range around the

null hypothesis mean of $55. Consulting your notes, you find the correct formula:

You find the range containing 95% of all sample means. Its endpoints are:

z-table

Utility for Single Populations

You pause for a moment to reflect on the interpretation of this range. Suppose the null hypothesis is true. Then 19 out of 20

samples of this size from the population of hotel guests would have means that would fall in the calculated range.

The sample mean of $64 falls outside of this range. You and Alice report your results to Leo.

Looks like hiring that chef was a good decision. The evidence suggests that mean spending per person has increased.

I'm glad to hear it. Now what about renovating the bar? Can you run a similar test to see if that has affected average beverage

spending?

Leo emphasizes that he can't imagine that his investments in the bar could have reduced average beverage spending per

guest. He wants to know if spending has gone up. You decide to do a one-sided test.

First, you write down all of Leo's data, along with the hypotheses:

You need to find an upper bound such that 95% of all sample means are smaller than it. To do so, you use a z-value of 1.645.

The upper bound is $24.29.

What is the correct interpretation of this number? Given that the null hypothesis is true,

The range of likely sample means contains the collected sample mean of $24. This tells you that:

How is this possible? Why hasn't renovating the bar increased revenues? Even if the frozen drink machine didn't pay off,

shouldn't the increase in seats have helped?

First of all, we haven't concluded that average revenue has not increased. We just can't be sure that it has. The fact that our

sample mean is $24 vs. $21 last year does not allow us to say anything definitive about the change in average beverage

revenue.

Remember, we set out to substantiate our hypothesis that spending has improved. Based just on this sample, we are unable

to conclude that spending has increased.

You added seats and now more people can be seated in your lounge. But a greater number of guests does not necessarily

translate into more spending per person.

Your overall revenues may have actually increased, because more guests can be seated in the lounge.

47/123

11/25/2019 Quantitative Methods Online Course

Gosh, I'm glad to hear that. For a moment there, I thought I had made a really bad investment. I'm quite optimistic I'll see a

jump in total beverage revenues in the consolidated report at the end of the year.

Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes toasted pretzel snacks, and

advertises that these pretzels contain an average of 112 calories per serving.

In a recent test, an independent consumer research organization conducted an experiment to see if this claim was true.

Somewhat to their surprise, the researchers found that the average calorie content in a sample of 32 bags was 102 calories

per serving. The standard deviation of the sample was 19.

Blanche would like to know if the calorie content of Oma's pretzels really has changed, so she can market them

appropriately. With 99% confidence, do these data indicate that the pretzels' calorie content has changed?

z-table

Utility for Single Populations

You begin any hypothesis test by formulating a null and an alternative hypothesis. The null hypothesis states that the

population mean is equal to the initial value. In this problem, the null hypothesis is that the caloric content in the actual

population is what Oma's has always advertised.

The alternative hypothesis should contradict the null hypothesis. For a two-sided test, the alternative hypothesis simply

states that the mean does not equal the initial value. A two-sided test is more appropriate in this problem, since Blanche

only wants to know if the mean calorie content has changed.

You assume that the null hypothesis is true and construct a range of likely sample means around the population mean.

Using the data and the appropriate formula, you find the range [103; 121].

The sample mean of 102 falls outside of that range, so you can reject the null hypothesis. Blanche can be 99% confident

that the population mean is not 112.

Why might Blanche have chosen a 99% confidence level rather than the more typical 95% level for her test?

z-table

Utility for Single Populations

The Clearwater Power Company produces electrical power from coal. A local environmental group claims that Clearwater's

emissions have raised sulfur dioxide levels above permissible standards in Blue Sky, the town downwind of the plant.

According to Environmental Protection Agency standards, an acceptable average sulfur dioxide level is 30 parts per billion

(ppb). As Clearwater's PR consultant, you want to defend the company, and you try to anticipate the environmentalist's

argument.

The environmental group collects 36 samples on randomly selected days over the course of a year. It finds a mean sulfur

dioxide content of 35 ppb with a standard deviation of 24 ppb.

The environmentalist group will use a hypothesis test to back up its claim that the sulfur dioxide levels are higher than

permitted. Which of the following is an appropriate null hypothesis for this problem?

z-table

Utility for Single Populations

The environmentalists' claim is that sulfur dioxide levels are higher, so they will want to run a one-sided test. The

alternative hypothesis states that the sulfur dioxide levels are above the accepted standard. We assume they will choose a

95% confidence level.

z-table

Utility for Single Populations

They calculate the one-sided range around the null hypothesis mean that contains 95% of all samples. The z-value for a

one-sided 95% range is 1.645. The upper bound on the range of likely sample means is 36.58 ppb.

48/123

11/25/2019 Quantitative Methods Online Course

z-table

Utility for Single Populations

You are the plant manager of a Neshey's chocolate factory. The shop was flooded during the recent storms. The machine

that wraps Neshey's popular chocolate confection, Smooches, still works, but you are afraid it may not be working at its

former capacity.

If the machine isn't working at top capacity, you will need to have it replaced.

The hourly output of the machine is normally distributed. Before the flood, the machine wrapped an average of 340

Smooches per hour. Over the first week after the flood, you counted wrapped Smooches during 32 randomly selected one-

hour periods. The machine averaged 318 Smooches per hour, with a standard deviation of 44.

You conduct a one-sided hypothesis test using a 95% confidence level. According to your calculations, what should you do?

The null hypothesis is that µ ≥ 340. The alternative hypothesis is that µ < 340 since you are using a one-tail test and you

are assuming that the new population mean is lower than the population mean before the flood.

Identify the relevant values. The sample size n=32. The standard deviation s=44. The appropriate z-value is 1.645 if you

want to capture 95% of all sample means in a one-sided range around the null hypothesis mean.

Use the formula and calculate the lower bound, 327. The sample mean of 318 falls well outside of the calculated range of

likely sample means. You accept this as strong evidence against the null hypothesis, substantiating the alternative

hypothesis that the mean output rate has dropped. You should replace the machine.

Happy with your work on restaurant spending, Leo jumps right into the next problem. "It's not just the revenue of the

restaurants that I care about," Leo says, "It's also my guests' satisfaction with their restaurant experience."

When I go out to eat, I expect more than just excellent food. The whole dining experience is essential — everything from the

service, to the décor, to the design and quality of the silverware.

And it's not just that all of these factors must be excellent individually — they have to fit together. The restaurant has to have

ambiance! I'm sure my guests have similar expectations, and I want to be sure my restaurant meets them.

Since my new chef introduced more sophisticated cuisine, I made some changes to the décor that I think have improved the

ambiance.

It took me a long time and a substantial amount of money to get everything right, but I'm pleased with the result: the

restaurants are elegant and distinctly Hawaiian. Just like the new chef's cuisine.

In the past, I've contracted a local market research firm to conduct surveys, asking guests to rate the Kahana's restaurants'

ambiance on a scale of one to five.

Historically, the percentage of people that rated ambiance the top score of 5 gave me a good idea of how well we were doing.

That percentage has been very high: 72%.

I've collected this year's data for you. Can you figure out if my guests are happier with my restaurants' ambiance?

Alice tells you that testing Leo's claim about a proportion will be very similar to testing a mean.

Often the summary statistic we want to make a claim about is a proportion. How do we test a hypothesis about a population

proportion instead of a population mean?

We know from our work with confidence intervals that the processes for estimating population proportions and population

means are virtually identical. Similarly, hypothesis tests for proportions are much like hypothesis tests for means.

Because we are examining a population proportion instead of a population mean, we use slightly different notation: we use a

lower case p to represent the population proportion in place of µ for a population mean. We construct a hypothesis test to

test a claim about the value of p.

49/123

11/25/2019 Quantitative Methods Online Course

Again, we formulate null and alternative hypotheses. Based on conventional wisdom or past experience, we have an initial

understanding of the population proportion. The null hypothesis for a proportion test states the initial understanding. For

example, in a two-sided test, the null hypothesis asserts that the population proportion, p, is equal to the initial value we had

in mind.

The alternative hypothesis is the claim we are using the hypothesis test to substantiate. The alternative hypothesis typically

states the opposite of the null hypothesis: it states that our initial understanding is incorrect.

As with population means, we collect a random sample and calculate the sample proportion, "p-bar." However, for a

hypothesis test about a population proportion, we don't need to calculate a standard deviation from the sample.

Statistical theory tells us that σ, the standard deviation of the population proportion, is the square root of [p*(1 - p)]. Since

we always start the test assuming the null hypothesis is true, we will calculate σ using the null hypothesis proportion.

Analogously to population mean tests, we create a range of likely sample proportions around the null hypothesis proportion.

To create the range, we substitute for σ, the standard deviation of the underlying null hypothesis population.

If our sample proportion falls outside the range of likely sample proportions, we reject the null hypothesis. Otherwise, we

cannot reject the null hypothesis.

Summary

In a hypothesis test for population proportions, we assume that the null hypothesis is true. Then, we construct a range of

likely sample proportions around the null hypothesis proportion. If the sample proportion we collect falls in the rejection

region, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis.

Once you understand hypothesis testing for means, using the same techniques on proportions is easy.

By now, you're familiar with the concept of testing a hypothesis. You recognize that Leo's restaurant ambiance problem calls

for a hypothesis test for a population proportion.

Leo wants you to find out if the proportion of his guests that rate restaurant ambiance "excellent" has increased. Historically,

that population proportion has been 0.72. Since Leo wants to see if there has been positive change, you do a one-sided test.

You are doing a one-sided test to see if the proportion of guests rating the restaurant "excellent" has increased. The

alternative hypothesis states that the proportion has increased, and the null hypothesis states that it has not increased.

You look at Leo's data. The sample proportion is 0.81 and the sample size is 126.

z-table

Utility for Single Populations

Here's how you find the standard deviation for a proportion problem:

Using the appropriate formula, you calculate the standard deviation to be 0.45.

Leo wanted you to use a 95% confidence level. Now you're ready to construct a range of likely sample means around the null

hypothesis value of the population proportion: 0.72.

Find the range of likely sample proportions around the null hypothesis proportion, and formulate a short answer for Leo.

z-table

Utility for Single Populations

A one-sided test calls for a one-sided range of likely sample proportions. You need to find the upper bound for this range such

that the range captures the lower 95% of the sample proportions.

The z-value for a one-sided 95% confidence level is 1.645. Substitute the null hypothesis proportion, 0.72, for p. The upper

bound for the range containing the lower 95% of all sample means is 0.78.

Since the sample proportion 0.81 falls in the rejection region, you reject the null hypothesis. The data provide sufficient

evidence that the population proportion has, in fact, changed.

Alice presents your findings to Leo, telling him that with 95% confidence, the data you collected indicate that the difference

between the historical population proportion and the proportion of the random sample is not due to chance.

50/123

11/25/2019 Quantitative Methods Online Course

The proportion of your guests that rate the restaurants' ambiance as "excellent" has increased.

Luther Lenya, the new product guru of The Ventura Automotive Insurance Company, is considering marketing a special

insurance package to members of certain professional groups.

To find out what rate to charge for this package, Luther conducts a preliminary study to see if health professionals are less

likely to be involved in car accidents than the rest of his customer base.

If the data indicate that health professionals are less likely to be involved in car accidents, then Ventura can offer health

professionals a lower, more competitive rate.

In the past 5 years, 8.3% of Ventura's customers have been involved in accidents. Which of the following is the correct pair

of hypotheses for solving Luther's problem?

A sample of 240 customers in the health profession reveals that 12 (5.0%) have had accidents.

If he uses a 95% confidence level, which of the following is the best conclusion Luther can come to?

z-table

Utility for Single Populations

You need to find a range of likely sample proportions. To find this range, you calculate a standard deviation. The standard

deviation is 0.28.

For a one-sided test, a confidence level of 95% corresponds to a z-value of 1.645. The lower bound of this range is 0.054 =

5.4%.

The range of likely sample proportions does not contain 5.0%, so you should reject the null hypothesis. With 95%

confidence, the proportion of health professionals involved in car accidents is lower than the proportion of the overall

population of drivers.

P-Values

After sleeping over your analysis of restaurant operations, Leo seems unsatisfied.

Don't get me wrong, I appreciate your hard work. But look here: these hypothesis tests result in a "reject/don't reject"

decision. If I understand you correctly, it doesn't matter how close to the border of the rejection region our sample statistic

falls: "reject" is "reject."

But can't you tell me more? I want to know how strong the evidence against the null hypothesis is, not just if it is strong

enough.

I'm glad you brought that issue up, Leo. We have a second method of doing hypothesis tests, one that provides a measure of

the strength of the evidence.

P-Values

The evening before, Alice had acquainted you with p-values: "We can use the p-value method of hypothesis testing to make

'reject/not reject' decisions in the same way we have been doing all along. But the p-value also measures the strength of

evidence against a null hypothesis."

In hypothesis tests we've done so far, we first chose the confidence level of the test. The confidence level tells us the

significance level of the test, which is simply 1 minus the confidence level.

Typically, we chose a 5% significance level — a 95% confidence level — as our threshold value for rejection. Assuming that the

null hypothesis is true, we reasoned that certain sample mean values are less likely to appear than others. If the mean of the

sample we collected was sufficiently unlikely to appear (that is, less than 5% likely) we considered the null hypothesis

implausible and rejected it.

Now, rather than simply checking whether the likelihood of collecting our sample is above or below our chosen threshold,

we'll ask: if the null hypothesis is true, how likely is it to choose a sample with a mean at least as far from the null

51/123

11/25/2019 Quantitative Methods Online Course

hypothesis mean as the sample mean we collected?

The "p-value" measures this likelihood: it tells us how likely it is to collect a sample mean that falls at least a certain distance

from the null hypothesis mean.

In the familiar hypothesis testing procedure, if the p-value is less than our threshold of 5%, we reject our null hypothesis.

The p-value does more than simply answer the question of whether or not we can reject the hypothesis. It also indicates the

strength of the evidence for rejecting the null hypothesis. For example, if the p-value is 0.049, we barely have enough

evidence to reject the null hypothesis at the 0.05 level of significance; if it is 0.001, we have strong evidence for rejecting the

null hypothesis.

Let's look at an example. Recall the movie theater manager who wanted to know if the average satisfaction rate for his

clientele had changed from its historical rate of 6.7.

To find out, we constructed the range, 6.3 to 7.1, which would have contained 95% of the sample means if the null hypothesis

mean had still been true. Since the mean of the sample of current moviegoers we collected, 7.3, fell outside of that range, we

rejected the null hypothesis.

Because 7.3 fell in the rejection region, we know that the likelihood of collecting a sample mean as extreme as 7.3 is less than

5% if the null hypothesis is true. Now let's find out exactly how unlikely it is by calculating the p-value.

Calculating the p-value is a little tricky, but we have all the tools we need to do it. Recall that for samples of sufficient size, the

sample means of any population are distributed normally.

To calculate the likelihood of a certain range of sample mean values — in our example, sample mean values greater than 7.3

or less than 6.1 — we just need to find the appropriate area under the distribution curve of the sample means.

To calculate the p-value for this two-sided test, we want to find the area under the normal curve to the right of 7.3 and to the

left of 6.1. The standard deviation in this example is 2.8, and the sample size is 196.

We can calculate this probability by first calculating the z-value associated with the value 7.3. That z-value is 3.

Then, we find the probability of having a z-value less than -3 or greater than 3.

The area to the left of the z-value of -3 is 0.00135. The area to the right of the z-value of +3 is the same size, so the total area is

0.0027. That is our p-value. These areas and the p-value can be found in Excel using the NORMSDIST(-3) function, in the z-

table, or with the Excel utility provided.

Excel

z-table

Utility for Single Populations

Our p-value calculation tells us that the probability of collecting a sample mean at least as far from 6.7 as 7.3 is 0.0027. The

p-value is lower than 0.05. Thus, at a significance level of 0.05, we would reject the null hypothesis and conclude that

moviegoers' average satisfaction rating is no longer 6.7.

Excel

z-table

Utility for Single Populations

But the p-value 0.0027 is much smaller than 0.05. Thus, we can reject the null hypothesis at 0.0027, a much lower

significance level. In other words, we can reject the null hypothesis with 99.73% confidence. In general, the lower the p-value,

the higher our confidence in rejecting the null hypothesis.

One-sided hypothesis tests are also easily conducted with p-values. For one-sided tests, the p-value is the area under one side

of the curve. In our movie theater example, if the alternative hypothesis states that the population mean is larger than 6.7, the

p-value is the area under the normal curve to the right of the sample mean of 7.3.

Summary

The p-value measures the strength of the evidence against the null hypothesis. It is the likelihood, assuming that the null

hypothesis is true, of collecting a sample mean at least as far from the null hypothesis mean as the sample actually

collected. We compare the p-value to the threshold significance level to make a reject/not reject decision. The p-value also

tells us how comfortable we can be with that decision.

Now Alice explains the basics of p-values to Leo, so you can present the results of your restaurant revenue hypothesis test

again. This time, you'll be able to give Leo an idea of how strong the statistical evidence is.

52/123

11/25/2019 Quantitative Methods Online Course

Leo wants you to complete the p-value hypothesis test right there in his office. You're a little nervous — you've never had a

client peering over your shoulder when you work. But you oblige him, because you're growing more confident of your

statistical skills.

Looking back at your notes on the problem, you find the data and the hypotheses. You make a mental note that you are doing

a two-sided test to see whether or not average spending on food has changed from its historical level of $55.

When you ran the hypothesis test earlier, I had you use a 95% confidence level. That corresponds to a significance level of 0.5,

right?

To find the significance level corresponding to a confidence level of 95%, simply subtract 95% from 100%, and convert into

decimal notation: 0.05.

After you clarify Leo's mistake, he sits back and lets you finish your analysis without further interruption. First, you find the

appropriate z-value. Enter the z-value as a decimal number with 2 digits to the right of the decimal, (e.g., enter "5" as "5.00").

Round if necessary.

Excel

z-table

z-table

Utility for Single Populations

Your sample has a mean (x-bar = $64) that is $9 higher than the assumed population mean, $55. You want to calculate the

likelihood of getting a sample mean that is at least as far from the population mean as x-bar.

That likelihood is not just the tail probability to the right of the sample mean. Sample means on the other side of the normal

curve are just as far from the population mean as x-bar. They must be included, too, when you calculate the p-value for a two-

sided hypothesis test.

Doubling the right-tail probability gives you the correct p-value: 0.0027.

All we have to do is compare the p-value to the significance level. The p-value 0.0027 is less than the significance level 0.05.

Our data are statistically significant at the 0.05 level.

Just as we calculated earlier by constructing a range around the null hypothesis mean, the p-value method suggests that we

reject the null hypothesis. With 95% confidence, average food spending per guest has changed.

But now we can also see that the evidence is very strong, because the p-value is much lower than the significance level. We

can claim that food spending has changed at the 0.0027 level of significance. Thanks, you two. I feel much more comfortable

concluding that average guest spending in my restaurant has changed.

In the following exercise you will revisit an earlier problem, this time solving it with the p-value method.

Blanche McCarthy is the marketing director of Oma's Own snack food company. Oma's makes toasted pretzel snacks. Each

bag of pretzels contains one serving, and Oma's advertises that the pretzel snacks contain an average of 112 calories per

serving.

In a recent test, an independent consumer research organization conducted an experiment to see if this claim was true. The

researchers found that the average calorie content in a sample of 32 bags was 102 calories per serving. The standard

deviation of the sample was 19.

Blanche would like to know if the calorie content of Oma's pretzels has really changed, so she can market them

appropriately. At the significance level 0.01, do these data indicate that the pretzels' calorie content has changed?

Excel

z-table

In this problem, the null hypothesis is that the actual population mean is what Oma's has always advertised. A two-sided

test is more appropriate in this problem, since Blanche only wants to know if the mean calorie content has changed.

53/123

11/25/2019 Quantitative Methods Online Course

Assuming that the null hypothesis is true, you find a z-value for the sample mean of 102 using the appropriate formula. The

z-value is -2.98.

Using the Excel NORMSDIST function or the Standard Normal Table, you can find the corresponding left-tail probability

of 0.0014. For a two-sided test, you double this number to find the p-value, in this case 0.0028.

Since this p-value is less than the significance level, you can reject the null hypothesis. Moreover, you now can say that you

are rejecting the null hypothesis at the 0.0028 level of significance. You can recommend to Blanche that she have the

labeling changed on the pretzel bags, and adjust her marketing accordingly.

In the following challenge you will revisit an earlier problem, this time solving it with the p-value method.

You are the plant manager of a Neshey's chocolate factory. The shop was flooded during the recent storms. The machine

that wraps Neshey's popular chocolate confection, Smooches, still works, but you are afraid it may not be working at its

former capacity.

If the machine isn't working at top capacity, you will need to have it replaced.

The hourly output of the machine is normally distributed. Before the flood, the machine wrapped an average of 340

Smooches per hour. Over the first week after the flood, you counted wrapped Smooches during 32 randomly selected one-

hour periods. The machine averaged 318 Smooches per hour, with a standard deviation of 44.

You conduct a one-sided hypothesis test using a 95% confidence level. According to your calculations, what should you do?

Excel

z-table

The null hypothesis is that the population mean is no lower than 340. The alternative hypothesis is that the population

mean is now less than 340. You are using a one-tailed test, and you are assuming that the new population mean is lower

than the population mean before the flood.

Identify the values of the relevant quantities. Use the appropriate formula and calculate the z-value. The z-value is -2.83.

This z-value corresponds to a left-tail probability of 0.0023. This is the tail you are interested in, since you are conducting a

one-sided test to see if the actual population mean is less than it was in the past. This tail probability is the p-value.

Since the p-value is less than the significance level, you reject the null hypothesis that the population mean is unchanged.

Moreover, you now can say that you are rejecting the null hypothesis at the 0.0023 level of significance. You should replace

the machine.

Now satisfied with your analysis of the restaurant, Leo asks you to compare the discretionary spending habits of two categories

of guests: leisure and business.

Every hotel manager wrestles with the problem of stretching limited marketing resources. I want to make sure that I'm wisely

allocating each marketing dollar.

Leisure guests, such as tourists and honeymooners, are especially attracted to Hawaii. Also, many professional associations

like to have their conventions here, so our islands attract business travelers, who mix business and pleasure.

Business travelers pay lower room prices because conferences book rooms in bulk. Bulk reservations are good for me because

they keep my occupancy levels high.

However, I don't have a good sense of whether the discretionary spending of my business guests is different from that of my

leisure guests: they may take fewer scuba lessons but use the spa services more, for example.

Can you help me figure out whether there is any significant difference between leisure and business travelers' discretionary

spending habits? Your conclusions might influence my marketing efforts.

I collected two random samples: one of leisure guests and one of business guests. Not including room, meal, and beverage

charges leisure travelers spent an average of $75 a day, compared to $64 a day for the business travelers.

I knew that the difference between the two averages of the two samples could be due to chance, so I thought I'd have you do a

hypothesis test to find out.

54/123

11/25/2019 Quantitative Methods Online Course

When I was compiling the data for you, I realized that my samples were of different sizes. I was able to get 85 leisure guests to

respond, but only 76 business guests returned my survey.

Which figure will you use as the sample size? Or will you add them together?

I also realized that with these data, you'd have to calculate two sample standard deviations, one for each sample. How do you

go about solving a problem like this?

How do you test whether two populations have different means?

So far, we've used hypothesis tests to study the mean or proportion of a single population. Often, managers want to compare

the means or proportions of two different populations: in this case, we use a two-population hypothesis test. Let's clarify

when we use each type of test.

We conduct single-population tests when we have an initial value for a population mean and want to test to see if it is correct.

Single population tests are especially useful when we suspect that the population mean has changed. For example, we use a

single-population test when we know the historical average of a population and want to test whether that historical average

has changed.

We conduct two-population tests to compare a characteristic of two groups for which we have access to sample data for each

group. For example, we'd use a two-population test to study which of two educational software packages better prepares

students for the GMAT. Do the students using package 1 perform better on the GMAT than the students using package 2?

In two-population tests, we take two samples, one from each population. For each sample, we calculate the sample mean,

standard deviation, and sample size.

We can then use the two sets of sample data to test claims about differences between the two populations. For example, when

we want to know whether two populations have different means, we formulate a null hypothesis stating that the means are

not different: the first population mean is equal to the second.

Let's look at the GMAT software package example more closely. The manager of one educational software company might

wonder if the average GMAT score of students using her software is different from the average GMAT score of students using

the competitor's software.

Since the manager only wants to test if the average GMAT scores are different, she conducts a two-sided hypothesis test for

two populations. The null hypothesis states that there is no difference between the average GMAT scores of the students who

use the two companies' software.

The alternative hypothesis states that the average GMAT scores of the students who use the two companies' software are

different.

We denote the average scores of the two populations by the Greek letter mu and distinguish them with subscripts. Our

hypotheses are:

To be 95% confident in the result of the test, we use a significance level of 0.05.

We collect two samples, one from each population. We denote the sample means with the familiar x-bar, which we again

distinguish with subscripts.

We are able to collect the GMAT scores of 45 people who used the company's software, and 36 people who used the

competitor's software. As we will see shortly, the different sample sizes will not pose a problem.

The respective sample means are 650 and 630, and the standard deviations are 60 and 50.

Could the two random samples we picked just happen to have different means by chance but really have come from

populations that have the same population means?

The null hypothesis states that there is no difference in the two population means. As with single-population tests, we test the

null hypothesis by asking how likely it would be to produce the sample results if the null hypothesis is in fact true.

That is, if the average GMAT scores for students using the two different software packages actually are the same, what is the

chance that two samples we collect would have sample means as different as 650 and 630?

Our intuition tells us that the greater the difference between the means of the two samples, the more likely it is that the

samples came from different populations. But how do we know when the numerical difference is large enough to be

statistically significant? When do we have enough evidence to actually conclude that the two populations must be different?

We use p-values to answer this question. First, we calculate a z-value for the difference of the sample means, incorporating

the data from both populations. It looks a bit complicated:

55/123

11/25/2019 Quantitative Methods Online Course

Let's compute the z-value for our example. Since we assume that the null hypothesis is true, we have:

For a two-sided test, a z-value of 1.64 translates into a probability in one tail of 0.05, and thus a p-value of 0.10.

Since this p-value is greater than the significance level of 0.05, we cannot reject the null hypothesis.

In other words, the high p-value tells us that there is insufficient evidence from the two samples to conclude that the average

GMAT score of the students who use the company's software is different from the average GMAT score of students who use

the competitor's software.

Two-population hypothesis tests can be performed using the formula shown on the previous page, or you can click here to

access the Excel utility for hypothesis testing.

Summary

In a hypothesis test for two population means, we assume a null hypothesis: that the two population means are equal. We

collect a sample from each population and calculate its sample statistics. We calculate a p-value for the difference between

the two samples. If the p-value is less than the significance level, we reject the null hypothesis.

Often, managers want to know if two population proportions are equal. For example, a marketing manager of a packaged

snack foods company might want to compare the snack food habits of different states in the US.

The marketing manager might think that the proportion of consumers who favor potato chips in Texas is different from the

proportion of consumers who favor potato chips in Oklahoma.

Comparing two population proportions is similar to comparing two population means. We have two populations: the null

hypothesis states that their proportions are the same; the alternative hypothesis states that they are different.

We collect a sample from each population and calculate its sample size and sample proportion. As in the single population

proportion test, we don't need to find the sample standard deviation, since we know that the population standard deviation

is the square root of

[p*(1 - p)].

Similarly to the hypothesis tests for comparing two population means, we calculate a z-value for the difference between the

proportions using the formula below:

We translate the z-value into a p-value just as we would for any other type of hypothesis test. If the p-value is less than our

significance level, we reject the null hypothesis and conclude that the proportions are different. If the p-value is greater

than the significance level, we do not reject the null hypothesis.

Optional Example

Let's take a closer look at the study of snacking habits in Texas and Oklahoma.

The manager does not wish to test for a particular direction of difference; he just wants to know if the proportions are

different. Thus, he should use a two-sided test.

The marketing manager wants to be 95% confident in the result of this test, so the significance level is 0.05.

Suppose we collect responses from 400 people in Texas and 225 people in Oklahoma. The sample proportions are 45%

and 35%, respectively.

Could the two random samples we picked just happen to have different sample proportions? That is, if the true

proportions of Texans and Oklahomans favoring potato chips actually are the same, what would be the chance that the

sample proportions are 45% and 35% respectively?

We use p-values to answer this question. First, we calculate a z-value for the difference of the sample proportions that

incorporates the data from both populations. The null hypothesis states that the population proportions are equal, so

their difference is 0.

For a two-sided test, a z-value of 2.48 translates into a probability in one tail of 0.0065 and hence a p-value of 0.013.

Since this p-value is less than the significance level of 0.05, we can reject the null hypothesis.

56/123

11/25/2019 Quantitative Methods Online Course

In other words, the low p-value tells us that there is sufficient evidence from the samples to conclude that there is a

difference between the proportions of Texan and Oklahoman potato chip lovers. We can make this claim at a 0.013 level

of significance.

Two-population hypothesis tests for population proportions can be performed using the formula shown in the previous

slide, or you can click here to access the Excel utility for hypothesis tests.

Summary

In a hypothesis test for two population proportions, we assume a null hypothesis: the two population proportions are

equal. We collect two samples and calculate the sample proportions. We calculate a p-value for the difference between

the sample proportions. If the p-value is less than the significance level, we reject the null hypothesis.

Click here to open an Excel Utility that allows you to perform hypothesis tests for two populations.

Make sure you do at least one example by hand to ensure you thoroughly understand the basic concepts before using the

utility. You should enter data only in the yellow input areas of the utility. To ensure you are using the utility correctly, try to

reproduce the results for the GMAT and potato chip examples.

Two-population hypothesis tests help you determine whether two populations have different means. You use a two-

population test to solve Leo's problem.

You have to find out if leisure guests' average daily discretionary spending is different from business guests' average daily

discretionary spending.

Now it's time to state the null hypothesis. The best formulation is:

You want to find out if the means of two populations — average spending by leisure guests vs. average spending by business

guests — are different. The two samples from those populations have different means: $75 and $64, respectively.

The samples may come from populations with the same means, and the numerical difference is due to chance in getting these

particular samples. It could be that the first sample just happened to have a high mean and the second sample just happened

to have a lower mean.

You test the null hypothesis that the population means have the same value.

You make note of your null hypothesis and the corresponding alternative hypothesis. You use a two-sided test because you

don't have any reason to believe that one type of guest spends more than the other.

At Leo's request you do a p-value test using a significance level of 0.05. To calculate the p-value, you first find the z-value.

Enter the z-value as a decimal number with 2 digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if necessary.

z-table

Utility for Two Populations

The z-value is +2.12 or -2.12, depending on how you set up the difference, i.e., in what order you subtract the sample means.

Either way, the final conclusion will be the same.

You use the z-value to calculate the p-value. The p-value is _______.

z-table

Utility for Two Populations

A z-value of 2.12 has a cumulative probability of 0.9830. You subtract this probability from the total probability, 1, for a right-

tail probability of 0.017.

Because we are doing a two-side test, we want to measure the probability of extreme values on both sides. Thus, we double

0.017 to get a p-value of 0.034.

Since the p-value 0.034 is less than the significance level 0.05, you recommend to Leo that he reject the null hypothesis. The

average daily discretionary spending per person is different for leisure and business guests.

57/123

11/25/2019 Quantitative Methods Online Course

I see. We can tell if two population means are different by running a hypothesis test on their difference. We test the null

hypothesis that there is no difference.

As you see, the p-value is less than the significance level. This tells you that...

...the difference in the two sample means is probably not due to chance. The spending habits of the two types of guests are

different. Got it.

Claude Forbes is a regional manager of The Burger Baron restaurant chain. The Baron would like to open a franchise in

Sappington. Two properties are up for sale, and Claude wants to choose the location that has the most traffic.

On 54 randomly selected days over a six-month period, an average of 92 cars passed by location A during the lunch hour,

from noon to 1 p.m. On 62 randomly selected days, 101 cars passed by location B during the lunch hour. The standard

deviations for the two locations were 16 and 23, respectively.

Is the difference between the amount of traffic at locations A and B statistically significant at the 0.01 level?

z-table

Utility for Two Populations

First, you set up the hypotheses. The null hypothesis states that there is no difference in mean traffic flow during the lunch

hour at the two locations. Since you are only interested in whether the traffic is different at the two locations, you conduct a

two-sided hypothesis test.

Next, you find the z-value for the difference between the two sample means using the appropriate formula. The z-value is

-2.47.

A z-value of -2.47 corresponds to a left-tail probability of 0.0068. Since you are doing a two-sided test, you must double

this probability to find the correct p-value: 0.0136.

The p-value is higher than the significance level. You cannot reject the null hypothesis. On the basis of these data, Claude

has insufficient evidence to show that one location has more traffic than the other.

The Magical Toy company manufactures a line of wrestling action figures. Maude Troston, the head of the doll department,

must make the final decision regarding which of two models, "Karnivorous Kong" or "Peter the Pipsqueak," should be

discontinued.

Magical Toy ships crates containing equal numbers of Pipsqueaks and Kongs to retail outlets. 45 randomly selected toy

stores have sold an average of 78% of all Pipsqueaks delivered to them. 48 other stores have sold an average of 84% of their

Kongs.

At a significance level of 0.05, do the data indicate that one action figure sells better than the other?

z-table

Utility for Two Populations

Maude wants to know if one figure sells better, but doesn't have a preconception about which figure might be flying off the

shelves. Thus you use a two-sided test, and your alternative hypothesis simply states that there is a difference between the

two population proportions.

Using the appropriate formula and the data, you find the z-value for the difference between the sample proportions. The z-

value is -0.738.

A z-value of -0.738 corresponds to a left-tail probability of 0.2303. You double this to find the p-value for the two-sided

test. The p-value is 0.4606.

The p-value is greater than the significance level. Therefore, you can't reject the hypothesis. The data do not show

conclusively that one action figure sells better than the other.

Grapefruit Bizarre

The Regal Beverage Company makes the soft drink Grapefruit Bizarre. The marketing department wants to refocus its

energies and resources.

You have been asked to determine if there are regional differences in consumers' response to advertisements for Grapefruit

Bizarre. Specifically, you must find out if the Midwest responds to Grapefruit Bizarre advertisements as well as the West

58/123

11/25/2019 Quantitative Methods Online Course

Coast.

The marketing department is clamoring to start a second campaign. It claims that ads that are effective on the West Coast

do not go over as well in the Midwest. Management demands statistical evidence at a significance level of 0.05.

In the context of a free movie screening, an ad for Grapefruit Bizarre is shown to 173 Midwesterners. The viewers had been

randomly selected, and had not previously tasted the drink.

When asked later, 33% claimed that they were at least mildly interested in trying Grapefruit Bizarre. In a similar survey

conducted on the West Coast, 42% of 152 test subjects claimed at least a mild interest in trying Grapefruit Bizarre.

Enter your z-value as a decimal number with 2 digits to the right of the decimal (e.g., enter "5" as "5.00"). Round if

necessary.

z-table

Utility for Two Populations

The z-value is + or -1.68, depending on how you set up the difference, i.e., in what order you subtract the sample means.

A z-value of -1.68 corresponds to a left-tail probability of 0.0465. What do you report to the marketing department?

z-table

Utility for Two Populations

You are conducting a one-sided hypothesis test. The alternative hypothesis states that the proportion of the Midwest

sample is less than the proportion of the West Coast sample. Therefore, you are interested only in the left-tail probability.

Your p-value is 0.0465.

0.0465 is less than the significance level 0.05. You should reject the null hypothesis. Midwesterners do not respond to the

ads as well as people from the West Coast. Marketing's claims are valid.

Upscale fashion designer Marjorie LeMer must decide from which supplier she should purchase bolts of cloth. Rumor has

it that BlueTex's product is superior to Southern Halifax's.

Random 10-yard sections from 43 bolts of Halifax's cloth contain a mean of 1.8 flaws per yard. Similar sections from 42

bolts of BlueTex's product contain 1.6 flaws per yard. The standard deviations are 0.3 and 0.6, respectively.

Marjorie wants you to find out if the rumors that BlueTex makes a better product are statistically warranted.

You conduct a one-sided test. Which of the following is the best alternative hypothesis?

z-table

Utility for Two Populations

z-table

Utility for Two Populations

You find the z-value for the difference between the two sample means using the appropriate formula. The z-value is -1.94.

The cumulative probability for z=-1.94 is 0.0262. This is the left-tail probability. Since you are running a one-sided test,

0.0262 is your p-value.

0.0262 is greater than 0.01. At this significance level, you would not reject the null hypothesis. 0.0262 is less than 0.05. At

this significance level, you would reject the null hypothesis.

Southern Halifax's product is slightly cheaper than BlueTex's. All other factors being equal, Marjorie would like to buy the

less expensive product. Unless she is 99% confident that there is a difference in quality, she will go with the cheaper cloth.

Based on this information and your calculations data, what should Marjorie do?

z-table

Utility for Two Populations

The data are not significant at the 0.01 level, so you can't reject the null hypothesis that they are different.

59/123

11/25/2019 Quantitative Methods Online Course

A significance level of 0.01 corresponds to a confidence level of 99%. So at a 99% confidence level, you can't reject the null

hypothesis. Marjorie can't be 99% confident that BlueTex's product is better than Halifax's.

Since Halifax's product is cheaper and you can't determine a difference in quality at the level of statistical significant

Marjorie requires, you recommend Halifax to Marjorie.

"Good work!" says Alice. "You're ready for a new challenge: investigating relationships between variables."

Regression Basics

Introduction

As you relax in your room during a brief afternoon downpour, your phone rings.

Leo just called. He wants us to come to his office immediately. He sounds a little angry. We'd better not keep him waiting.

I'm sorry if I was short on the phone. I'm very upset. We just had a little incident down in the restaurant. A server spilled a

tureen of crab bisque on one of our most "favored" guests, Mr. Pitt.

The Kahana's occupancy this year has been higher than I expected, and I had to hire extra help from a staffing agency. Those

staffing agencies charge a fortune, which is especially irritating considering that the employees they refer to us are often

poorly suited to customer service in an up-scale hotel.

Really, this is my fault for not having a more effective staffing process. I just wish I could predict my needs better. Sometimes,

when demand is lower than I expected, I'm overstaffed. Then I lose money paying idle bellhops. If I had a good sense of my

staffing needs at least a month in advance, I could avoid hiring workers at the last minute and having idle staff.

I had been thinking that the number of advance reservations would give me a good idea of how high my occupancy would be

a month down the road. But clearly advance reservations don't tell me the whole story. I've been making way too many false

predictions.

Is there anything you can do to help me here? What predictions about occupancy can I make based on advance bookings?

And how much can I trust them?

We'll take a look at the data on advance bookings and occupancy and let you know what we find out.

Alice seems confident that the two of you can offer useful advice on Leo's staffing problem:

"This will be a great opportunity for you to learn regression. It's a powerful statistical tool used all the time in business: in

finance, demand forecasting, market research to name just a few areas. I'm sure you'll use it in your MBA program. And it's a

great chance to review what you've learned so far: sampling, confidence intervals, and hypothesis testing all play a part in

regression."

As we have seen, it is often useful to examine the relationship between two variables. Using scatter diagrams, we can

visualize such relationships.

We can learn more about the relationship by finding the correlation coefficient, which measures the strength of the linear

relationship on a scale from -1 to 1.

Regression is a statistical tool that goes even further: it can help us understand and characterize the specific structure of the

relationship between two variables.

Let's look at an example. Julius Tabin owns a small food processing company that produces the spreadable lunchmeat

product EasyMeat. Julius is trying to understand the relationship between his firm's advertising and its sales.

Total sales in the spreadable meat industry have been fairly flat over the last decade, and Julius' competitors' actions have

been quite stable. Julius believes that his advertising levels influence his firm's sales positively, but he doesn't have a clear

understanding of what the relationship looks like.

Let's have a look at data on his firm's advertising and sales over the last 10 years. Click on the Excel link to create the scatter

diagram yourself from an Excel spreadsheet.

EasyMeat Data

60/123

11/25/2019 Quantitative Methods Online Course

Plotting annual sales against annual advertising expenditures gives us a visual sense of the relationship between the two

variables. Looking at the graph, we can see that as advertising has gone up, sales have generally increased. The relationship

looks reasonably linear.

EasyMeat Data

The correlation coefficient for the two variables is 0.93, indicating a strong linear relationship between advertising and sales.

EasyMeat Data

What if we were to draw a line that characterizes this relationship? Which line would best fit the data? Our mind's eye already

sees how the two variables are related, but how can we formalize our visual impression?

EasyMeat Data

Before we start any calculations, let's look at several lines that could describe the relationship.

EasyMeat Data

One of these lines most accurately describes the relationship between the two variables: the "best-fit" or regression line.

EasyMeat Data

In our example, the best-fit line is Sales = -333,831 + 50*Advertising. For this line, the y-intercept is -333,831 and the slope is

50.

EasyMeat Data

In general, a regression line can be described by a simple linear equation: y = a + bx, with y-intercept a and slope b.

EasyMeat Data

In this equation, the y-variable, sales, is called the dependent variable, to suggest that we think Julius' sales depend to

some degree on his advertising. The x-variable, advertising, is called the independent variable, or the explanatory

variable.

EasyMeat Data

When we observe that a change in the independent variable (here advertising) is typically accompanied by a proportional

change in the dependent variable (here sales), regression analysis can identify and formalize that relationship.

EasyMeat Data

Summary

Regression analysis helps us find the mathematical relationship between two variables. We can use regression to describe a

linear relationship: one that can be represented by a straight line and characterized by an equation of the form y = a + bx.

What kinds of questions can regression analysis help answer?

How does regression help us as managers? It can help in two ways: first, it helps us forecast. For example, we can make

predictions about future values of sales based on possible future values of advertising.

Second, it helps us deepen our understanding of the structure of the relationship between two variables by expressing the

relationship mathematically.

EasyMeat Data

Let's talk first about how managers can use regression to forecast. In our example, regression can help Julius predict his

company's sales for a specified level of advertising.

For example, if he plans to spend $65,000 in advertising next year, what might we expect sales to be?

If we didn't know anything about the relationship, but only had the historical data, we might simply note that the last time

Julius spent $65,000 on advertising, his sales were $3,200,200. But is this the best prediction we can make?

EasyMeat Data

61/123

11/25/2019 Quantitative Methods Online Course

Not at all. Regression analysis brings the entire data set to bear on our prediction. In general, this will allow us to make

more accurate predictions than if we infer the future value of sales from a single observation of advertising and sales.

Having identified the relationship between the two variables from the full data set, we can apply our understanding of

that relationship to our forecast.

EasyMeat Data

Using regression analysis, we found the regression line to be Sales = -333,831 + 50*Advertising. If Julius plans to spend

$65,000 in advertising, what would we predict sales to be?

EasyMeat Data

The point on the line shows us what level of sales to expect. In this case, we would expect sales of $2,916,169.

EasyMeat Data

With regression, we can forecast sales for any advertising level within the range of advertising levels we've seen

historically. For example, even if Julius has never spent exactly $50,000 on advertising, we can still forecast a

corresponding level of sales.

EasyMeat Data

We must be extremely cautious about forecasting sales for values of advertising beyond the range of values we have already

observed. The further we are from the historical values of advertising, the more we should question the reliability of our

forecast.

EasyMeat Data

For example, we might feel comfortable forecasting sales for advertising levels a bit above the observed range- perhaps as

high as $100,000 or $105,000. But we shouldn't infer that if Julius spent $10 million on advertising, he would achieve

$500 million in sales. The total market for spreadable meat is probably much less than $500 million annually!

EasyMeat Data

Likewise, we might feel comfortable forecasting sales for advertising levels just below the observed range. But we certainly

shouldn't report that if Julius spent $0 on advertising, he would have negative sales!

EasyMeat Data

If we try to use our regression equation to forecast sales for advertising levels outside of the historical range, we are

implicitly assuming that the relationship between advertising and sales continues to be linear outside of the historical

range.

EasyMeat Data

In reality, although the relationship may be quite linear for the range of values we've observed, the curve may well level off

for advertising values much lower or much higher than those we've observed. With no observations outside the historical

data range, we simply don't have evidence about what the relationship looks like there.

EasyMeat Data

Another critical caveat to keep in mind is that whenever we use historical data to predict future values, we are assuming

that the past is a reasonable predictor of the future. Thus, we should only use regression to predict the future if the general

circumstances that held in the past, such as competition, industry dynamics, and economic environment, are expected to

hold in the future.

Regression can be used to deepen our understanding of the structural relationship between two variables. If we think about

it, many business decisions are about increasing or decreasing one variable — investments or advertising, for example — to

affect some other variable — productivity, brand recognition, or profits, for example. Regression can reveal the structure of

relationships of this type.

Our regression analysis stipulates a linear relationship between sales and advertising. Understanding "the structure" of this

relationship translates into finding and interpreting the coefficients of the regression equation.

As we've noted in the previous slide, the constant term -333,831 may have no real managerial significance; it just "anchors"

the regression line by telling us the y-intercept. We've never seen advertising levels close to $0, so we cannot infer that

spending no money on advertising will lead to sales of -$333,831!

The more important term is the advertising coefficient, 50, which gives us the slope of the line. The advertising coefficient

tells us how sales have changed on average as advertising has increased.

62/123

11/25/2019 Quantitative Methods Online Course

In the past, when advertising has increased by $10,000, what has been the average corresponding change in sales?

Assuming that the relationship between sales and advertising is linear, each $1 increase in advertising should be

accompanied by the same average increase in sales. In our example, for every incremental $1 in advertising, sales increase

on average by $50. Thus, for every incremental $10,000 in advertising, sales increase on average by $500,000.

The regression line gives us insight into how two variables are related. As one variable increases, by how much does the

other variable typically change? How much growth in sales can we anticipate from an incremental increase in advertising

expenditures? Regression analysis helps managers answer questions like these.

Summary

We use regression analysis for two primary purposes: forecasting and studying the structure of the relationship between

two variables. We can use regression to predict the value of the dependent variable for a specified value of the independent

variable. The regression equation also tells us how the dependent variable has typically changed with changes in the

independent variable.

Per-capita consumption of soft drink beverages is related to per-capita gross domestic product (GDP). Generally, the

higher the GDP of a country, the more soda its citizens consume. Soft drink consumption is measured in number of 8-oz

servings.

Based on data from 12 countries, the relationship can be expressed mathematically as:

Source

Based on this relationship, you can expect that, on average, for each additional $1,000 of per-capita GDP a country's soda

consumption increases by _____ servings.

The regression equation tells us that in our data set, average soda consumption increases by 0.018 servings for every

additional $1 of per-capita GDP. So, for an additional $1,000, average consumption increases by ($1,000)(0.018

servings/$) = 18 servings.

The per-capita GDP in the Netherlands is $25,034. What do you predict is the average number of servings of soda

consumed in the Netherlands per year?

Enter predicted average soda consumption (in servings) as an integer (e.g., "5"). Round if necessary.

The regression equation tells us that average soda consumption = 130 + 0.018*(per-capita GDP). Therefore, we anticipate

the Netherlands' average soda consumption to be 580.6 servings.

Although the regression predicts a soda consumption of around 581 servings per person for the Netherlands, the actual

measured number of servings consumed is much lower: 362. The discrepancy in the actual and predicted consumption

reinforces that per-capita GDP alone is not a perfect predictor of soda consumption.

A regression line helps you understand the relationship between two variables and forecast future values of the dependent

variable. Alice points out to you that these two features of regression analysis make it a powerful tool for managers who make

important decisions in the uncertain world of business.

But how do you generate a regression line from observed data? Of all the straight lines that you could draw through a scatter

diagram, which one is the regression line?

Let's return to Julius Tabin's sales and advertising data. As we can see from the graph, no straight line could be drawn that

would pass through every point in the data set.

This is not surprising. Typically, advertising is not a perfect predictor of sales, so we don't expect every data point to fall in a

perfect line. The regression line depicts the best linear relationship between the two variables. We attribute the difference

between the actual data points and the line to the influence that other variables have on sales, or to chance alone.

Since the regression line does not pass through every point, the line does not fit the data perfectly. How accurately does the

regression line represent the data?

To measure the accuracy of a line, we'll quantify the dispersion of the data around the line. Let's look at one line we could

draw through our data set.

63/123

11/25/2019 Quantitative Methods Online Course

Let's consider a second line. Click on the line that more closely fits the ten data points.

Although in this example we can see which of two lines is more accurate, it is useful to have a precise measure of a line's

accuracy.

To quantify how accurately a line fits a data set, we measure the vertical distance between each data point and the line.

Why don't we measure the shortest distance between the point and the line — the distance perpendicular to the line? Why

do we measure vertically?

We measure vertical distance because we are interested in how well the line predicts the value of the dependent variable.

The dependent variable — in our case, sales — is measured on the vertical axis. For each data point, we want to know how

close the value of sales predicted by the line is to the historically observed value of sales.

From now on we will refer to this vertical distance between a data point and the line as the error in prediction or the

residual error, or simply the error. The error is the difference between the observed value and the line's prediction for our

dependent variable. This difference may be due to the influence of other variables or to plain chance.

Going forward, we will refer to the value of the dependent variable predicted by the line as y-hat and to the actual value of the

dependent variable as y. Then the error is y - (y-hat), the difference between the actual and predicted values of the dependent

variable.

The complete mathematical description of the relationship between the dependent and independent variables is y = a + bx +

error. The y-value of any data point is exactly defined by these terms: the value y-hat given by the regression line plus the

error, y - (y-hat).

Collectively, the errors in prediction for all the data points measure how accurately a line fits a set of data.

To quantify the total size of the errors, we cannot just sum each of the vertical distances. If we did, positive and negative

distances would cancel each other out.

Instead, we take the square of each distance and then sum all the squares, similarly to what we do when we calculate

variance.

This measure, called the Sum of Squared Errors (SSE), or the Residual Sum of Squares, gives us a good measure of how

accurately a line describes a set of data.

The less well the line fits the data, the larger the errors, and the higher the Sum of Squared Errors.

Summary

To find the line that best fits a data set, we first need a measure of the accuracy of a line's fit: the Sum of Squared Errors. To

find the Sum of Squared Errors, we calculate the vertical distances from the data points to the line, square the distances,

and sum the squares.

Now that you have a way to measure how well a line fits a set of data, you need a way to identify the line that "best fits" the

data: the regression line.

We can calculate the Sum of Squared Errors for any line that passes through the data. Of course, different lines will give us

different Sums of Squared Errors. The line we are looking for — the regression line — is the one with the smallest Sum of

Squared Errors.

Let's look at several lines that could describe the relationship between advertising and sales in our example. Our intuition

tells us that the middle line is a much better fit than line a or line b.

Let's check our intuition. For each line, we can calculate the Sum of Squared Errors to determine its accuracy.

The lower the Sum of Squared Errors, the more precisely the line fits the data, and the higher the line's accuracy.

The line that most accurately describes the relationship between advertising and sales — the regression line — is the line that

minimizes the sum of squares. Finding the regression line for a set of data is a calculation-intensive process best left to

statistical software.

Summary

The line that most accurately fits the data — the regression line — is the line for which the Sum of Squared Errors is

minimized.

64/123

11/25/2019 Quantitative Methods Online Course

Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to do regression analysis using

the regression tool. However, we suggest you read through the following instructions to learn how Excel's regression tool

works, so you can run regressions in the future, when you do have access to the Data Analysis Toolpak.

Performing regression analysis by hand is a time-consuming process. Fortunately, statistical software packages and major

spreadsheet programs — Excel, for example — can do the necessary calculations for you in a matter of seconds. Click on the

Excel link to access the data file so you can practice doing the analysis in Excel as you read through the instructions.

EasyMeat Data

Let's go through the process step by step. We start with data entered into two columns in an Excel spreadsheet. Each

column contains values of a variable. To perform regression analysis, there must be an equal number of entries in each

column.

Under the Data tab in the toolbar we select the Data Analysis option.

A window pops open containing an alphabetical list of statistical tools. We select "Regression" and click "OK".

In the regression window, we see a prompt field titled ''Input Y Range.'' In it, we enter C1:C11, the range of cells containing

the column label (C1) and the data (C2:C11) for the dependent variable: Sales ($).

We repeat this for the prompt field titled ''Input X Range,'' entering B1:B11 to include both the column label (B1) and the

data (B2:B11) for the independent variable: Advertising ($).

Since we included the column lables in row 1 in our ranges, we must check the "Labels" box. Including labels is helpful

because Excel uses the labels to identify the variable coefficients in the output sheet. If you do not include the labels in your

ranges, do not check the label box, or Excel will treat the first row of data as labels, excluding those entries from the

regression.

Finally, we select the output option "New Worksheet Ply:", enter the name for the new worksheet, and click "OK."

Excel opens a new worksheet with the name we specified. In it, we see an intimidating array of data.

For the moment, we're mainly interested in the entries in the cells labeled "Coefficients," which specify the intercept and

slope of the regression line.

Note that the label "Advertising ($)" has been carried over from the original data column. The coefficient in the

"Advertising ($)" row is the slope of the regression line.

For the exercises in this unit, we strongly recommend you find the relevant data in an Excel spreadsheet and perform the

regression analyses yourself. If you do not have the Analysis Toolpak, you can open a file containing the relevant regression

output.

EasyMeat Data

EasyMeat Regression

To practice using Excel's regression tool, run a regression using the world soft drink consumption data from an earlier

exercise. Use soft drink consumption for the dependent variable and per capita GDP for the independent variable.

What is the slope of the regression line? Enter the slope as a decimal number with 3 digits to the right of the decimal point

(e.g., enter "5" as "5.000"). Round if necessary.

Soft Drink Consumption Regression

Source

We run the regression by selecting range C1:C13 for the Y-range, the dependent variable consumption, and B1:B13 for the

X-range, the independent variable GDP per capita. We check the label box, and see the output below. The slope of the

regression line is the coefficient of the independent variable, GDP per capita.

What is the intercept of the regression line? Enter the intercept as an integer (e.g., "5"). Round if necessary.

65/123

11/25/2019 Quantitative Methods Online Course

Soft Drink Consumption Data

Soft Drink Consumption Regression

Source

Equipped with the basic tools needed to find and interpret the regression line, you feel ready to tackle Leo's assignment. But

Alice cautions you not to be hasty and urges you to consider some tricky questions: "How well does the regression line actually

characterize the relationship in the data? Is a straight line even a good descriptor of the relationship?"

How much does the relationship between advertising and sales help us understand and predict sales? We'd like to be able to

quantify the predictive power of the relationship in determining sales levels. How much more do we know about sales thanks

to the advertising data?

To answer this question we need a benchmark telling us how much we know about the behavior of sales without the

advertising data. Only then does it make sense to ask how much more information the advertising data give us.

Without the advertising data, we have the sales data alone to work with. Using no information other than the sales data, the

best predictor for future sales is simply the mean of previous sales. Thus, we use mean sales as our benchmark, and draw a

"mean sales line" through the data.

Let's compare the accuracy of the regression line and the mean sales line. We already have a measure of how accurately an

individual line fits a set of data: the Sum of Squared Errors about the line. Now we want a measure of how much more

accurate the regression line is than the mean line.

To obtain such a measure, we'll calculate the Sum of Squared Errors for each of the two lines, and see how much smaller the

error is around the regression line than around the mean line.

The Sum of Squared Errors for the mean sales line measures the total variation in the sales data. In fact, it is the same

measure of variation we use to derive the standard deviation of sales. We call the Sum of Squared Errors for our benchmark

— the mean sales line — the Total Sum of Squares. Here, the Total Sum of Squares is 8.01 trillion.

The Sum of Squared Errors for the regression line is often called the Residual Sum of Squared Errors, or the Residual Sum of

Squares. The Residual Sum of Squares is the variation left "unexplained" by the regression. Here, the Residual Sum of

Squares is 1.13 trillion.

The difference between the Total Sum of Squares and the Residual Sum of Squares, 6.88 trillion in this case, is called the

Regression Sum of Squares. The Regression Sum of Squares measures the variation in sales "explained" by the regression

line.

A standardized measure of the regression line's explanatory power is called R-squared. R-squared is the fraction of the total

variation in the dependent variable that is explained by the regression line.

R-squared will always be between 0 and 1 — at worst, the regression line explains none of the variation in sales; at best it

explains all of it.

We find R-squared by dividing the variation explained by the regression line — the Regression Sum of Squares — by the total

variation in the dependent variable — the Total Sum of Squares.

R-squared is presented either as a fraction, a percentage, or a decimal. We find that in the advertising and sales example, the

R-squared value is 6.88 trillion/8.01 trillion = 0.859 = 85.9%.

An equivalent approach to computing R-squared is somewhat less intuitive but more common. In this approach we first find

the fraction of the total variation in the dependent variable that is NOT explained by the regression line: we divide the

Residual Sum of Squares by the Total Sum of Squares.

Fortunately, we don't need to calculate R-squared ourselves — Excel computes R-squared and includes it in the standard

regression output.

In a regression that has only one independent variable, R-squared is closely related to the correlation coefficient between the

independent and dependent variables: the correlation coefficient is simply the positive or negative square root of R-squared—

positive if the slope of the regression line is positive and negative if the slope of the regression line is negative.

66/123

11/25/2019 Quantitative Methods Online Course

Excel's regression output always computes the square root of R-squared, which it labels "Multiple R."

Summary

R-squared measures how well the behavior of the independent variable explains the behavior of the dependent variable. R-

squared is the ratio of the Regression Sum of Squares to the Total Sum of Squares. As such, it tells us what proportion of

the total variation in the dependent variable is explained by its linear relationship with the independent variable.

Residual Analysis

Although the regression line is the line that best fits the observed data, the data points typically do not fall precisely on the

line. Collectively, the vertical distances from the data to the line — the errors — measure how well the line fits the data.

These errors are also known as residuals.

A careful study of the residuals can tell us a lot about a regression analysis and the validity of the assumptions we base it

on.

For example, when we run a regression, we assume that a straight line best describes the relationship between our two

variables. In fact, sometimes the relationship may be better described by a curve.

In this graph, we can clearly see a negative trend. If we run a regression on these data, we find a relatively high R-squared.

How do we use the residuals to check our assumption that the relationship is linear?

First, we measure the residuals: the distance from the data points to the regression line.

Then we plot the residuals against the values of the independent variable. This graph — called a residual plot — helps us

identify patterns in the residuals.

We can recognize a pattern in the residual plot: a curve. This pattern strongly indicates that a straight line is not the best

way to express the relationship between the variables: a curve would be a much better fit.

A residual plot often is better than the original scatter plot for recognizing patterns because it isolates the errors from the

general trend in the data. Residual plots are critical for studying error patterns in more advanced regressions with multiple

independent variables.

If the only pattern in the dependent variable is accounted for by a linear relationship with the independent variable, then

we should see no systematic pattern in the residual plot. The residuals should be spread randomly around the horizontal

axis.

In fact, the distribution of the residuals should be a normal distribution, with mean zero, and a fixed variance. Residuals

are called homoskedastic if their distributions have the same variance.

If we see a pattern in the distribution of the residuals, then we can infer that there is more to the behavior of the dependent

variable than what is explained by our linear regression. Other factors may be influencing the dependent variable, or the

assumption that the relationship is linear may be unwarranted.

We've already seen the pattern of a curved relationship. What other patterns might we see?

Let's look at this scatter diagram and its corresponding residual plot. The residuals appear to be getting larger for higher

values of the independent variable. This phenomenon is known as heteroskedasticity.

Residual analysis reveals that the distribution of the residuals changes with the independent variable: the variance

increases as the independent variable increases. Since the variance of the residuals — which contributes to the variation of

the dependent variable — is affected by the behavior of the independent variable, we can conclude that there must be more

to the story than just the linear relationship.

There are a number of other assumptions about regression whose validity can be tested by performing a residual analysis.

Although interesting, these uses of residual analysis are beyond the scope of this course.

Summary

A complete regression analysis should include a careful inspection of the residuals. Plot the residuals against the

independent variable to reveal patterns in the distribution of the residuals.

Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to perform residual analysis

using the regression tool. However, we suggest you read through the instructions to learn how Excel's regression tool

works, so you can perform residual analysis in the future, when you do have access to the Data Analysis Toolpak.

67/123

11/25/2019 Quantitative Methods Online Course

To study patterns in residuals and other regression data, visual representations can be very helpful. To plot a regression

line we first generate a scatter diagram of the data we are studying.

Once we have the scatter diagram, we add the regression line by right-clicking on any one of the data points. A menu will

pop up. From that menu, we select "Add Trendline."

If we wish to display the regression equation and R-squared value on the same scatter plot, we can select the check boxes

next to those items at the bottom and press "Close."

The scatter diagram will now have been augmented by the regression line, the regression equation, and the R-squared

value. This is a quick way to perform a simple regression analysis from a scatter plot, though it doesn't provide all of the

output we'll want to thoroughly review the results.

Residual analysis is an option in Excel's Regression tool. To calculate the residuals and generate the residual plot, we select

"Residual Plots" in the Residuals section of the Regression Menu and click "Ok."

R-squared and the residuals are not the only things to keep an eye on when running a regression.

The regression line is the line that best fits our observed data. But are we sure that the regression line depicts the "true" linear

relationship between the variables?

In fact, the regression line is almost never a perfect descriptor of the true linear relationship between the variables. Why?

Because the data we use to find the regression line typically represent only a sample from the entire population of data

pertaining to the relationship.

Returning to our spreadable lunchmeat example, advertising and sales data for a different set of years would likely have given

us a different line. Since each regression line comes from a limited set of data, it gives us only an approximation of the

"true" linear relationship between the variables.

As we do in sampling, we'll use Greek letters to denote parameters describing the population, and Latin letters to denote the

estimates of those parameters based on sample data. We'll use alpha and beta to represent the coefficients of the "true" linear

relationship in the population, and a and b to represent our estimates of alpha and beta.

When we calculate the coefficients of a regression line from a set of observed data, the value a is an estimate of alpha, the

intercept of the "true" line. Similarly, the value b is an estimate of beta, the slope of the true line.

Like all estimates based on sample data, the calculated estimate for a coefficient is probably different from the true

coefficient. Just as in sampling, to find a range of likely values for the true coefficient, we construct confidence intervals

around each estimated coefficient.

Excel's output gives us 95% confidence intervals for each coefficient by default.

The Excel output for our EasyMeat example tells us that the best estimate of the slope is 50, and we are 95% confident that

the slope of the true line is between 33.5 and 66.5.

We can specify any other confidence level when setting up the regression. We might be interested in a 99% confidence

interval for the slope of the EasyMeat regression line.

The Excel output for our EasyMeat example tells us that the best estimate of the slope is 50, and we are 99% confident that

the slope of the true line is between 25.9 and 74.1.

Since we don't know the exact value of the slope of the true advertising line, we might well question whether there actually

IS a linear relationship between advertising and sales. How can we assure ourselves that the pattern we see in the sample

data is not simply due to chance?

If there truly were no linear relationship, then in the full population of relevant data, changes in advertising would not

correspond systematically to proportional changes in sales. In that case, the slope of the best-fitting line for the true

relationship would be zero.

There are two quick ways to test if the slope of the true line might be zero. One is simply to look at the confidence interval

for the slope coefficient and see if it includes zero.

68/123

11/25/2019 Quantitative Methods Online Course

The 95% confidence interval for the slope in our case, [33.5, 66.5], does not include zero, so we can reject with 95%

confidence the hypothesis that the slope of the true line is zero. We can say that we are 95% confident that there really is a

linear relationship between advertising and sales.

Alternatively, we can test how likely it is that the slope really is zero using a hypothesis test. We formulate the null

hypothesis that there is no linear relationship: the true slope, beta, of the regression line is zero.

Then we calculate the p-value: the likelihood of having a slope at least as far from zero as the slope calculated, b, assuming

that the slope is in fact zero.

In our example, the p-value tells us the likelihood of having a slope as far from zero as 50 if the true slope were in fact zero.

Fortunately, Excel saves us the trouble of calculating the p-value and reports it along with the regression output. This p-

value tells us exactly what we want to know: if there really is no linear relationship between the variables, how likely is it

that our sample would have given us a slope as large as 50?

In our example, the p-value for the advertising coefficient is 0.0001, indicating that we can be confident at the 99.99% level

that beta, the slope of the true line, is different from zero. Thus, we are confident at the 99.99% level that there really is a

linear relationship between advertising and sales.

In general, if the p-value for a slope coefficient is less than 0.05, then we can reject the null hypothesis that the slope beta is

zero, and conclude with 95% confidence that there is a linear relationship between the two variables. Moreover, the smaller

the p-value, the more confident we are that a linear relationship exists.

If the p-value for a slope coefficient is greater than 0.05, then we do not have enough evidence to conclude with 95%

confidence that there is a significant linear relationship between the variables.

Additionally, p-values can be used to test hypotheses about the intercept of a regression line. In our example, the p-value

for the intercept is 0.498, much larger than 0.05. However, since the intercept simply anchors the regression line, whether

or not it is zero is not particularly important.

We won't use the standard error of the coefficient in this course, but will simply point out that the standard error is similar

to a standard deviation of our distribution for the coefficient. Thus, the 95% confidence interval is approximately 50 +/- 2*

(7.2). It's not exact, because the distribution isn't quite normal, but the concept is very similar.

We won't use the t-stat in this course either, since the p-value gives the equivalent information in a more readily usable

form. The t-stat tells us how many standard errors the coefficient is from the value zero. Thus, if the t-stat is greater than 2,

we are quite sure (approximately 95% confident) that the true coefficient is not zero.

Summary

The slope and intercept of the regression line are estimates based on sample data: how closely they approximate the actual

values is uncertain. Confidence intervals for the regression coefficients specify a range of likely values for the regression

coefficients. Excel reports a p-value for each coefficient. If the p-value for a slope coefficient is less than 0.05 we can be

95% confident that the slope is nonzero, and hence that there is a linear relationship between the independent and

dependent variables.

It is important not to confuse the p-value for a coefficient with the R-squared for the regression. R-squared tells us what

percentage of the variation seen in the dependent variable is explained by its relationship with the independent variable.

The p-value tells us the likelihood that there is no real relationship between the dependent and independent variables - that

the true coefficient of the line in the full population is zero.

Let's look at an example that illustrates the difference in these two measures. This scatter diagram shows data tightly

clustered around the regression line. R-squared is very high — very little variation in Y is left unexplained by the line. The

p-value on the coefficient is very low — there is clearly a significant linear relationship between X and Y.

The second scatter diagram has a much lower R-squared — a lot of variation in Y is left unexplained by the line. But there is

no question that there is a significant relationship between X and Y, so the p-value on the coefficient is again very low.

Here, the linear relationship may not explain as much variation as it does in the upper graph, but the relationship is clearly

significant.

A high p-value indicates a lack of confidence in the underlying linear relationship. We would not expect a questionable

relationship to explain much variation. So if we have only one independent variable and it has a high p-value, we would not

expect to find a very high R-squared. However, as we'll see shortly, the story becomes more complex when we have two or

more independent variables.

So far, we haven't discussed how sample size affects the accuracy of a regression analysis. The larger the sample we use to

conduct the regression analysis, the more precise the information we obtain about the true nature of the relationship under

69/123

11/25/2019 Quantitative Methods Online Course

investigation. Specifically, the larger the sample, the better our estimates for the slope and the intercept, and the tighter the

confidence intervals around those estimates.

Let's look at how sample size affects p-values in an example. This scatter plot is based on a large sample size — 50

observations. With a p-value of zero we are extremely confident that there is a linear relationship between X and Y, and our

confidence interval for the slope coefficient is quite tight.

The second scatter plot is based on a sample size of only 10. The p-value rises to 0.07, so we cannot be 95% confident that

there is a linear relationship between X and Y. Our lack of confidence in our estimate is also evident in the wide confidence

interval for the slope coefficient.

Summary

The p-value and R2 provide different information. A linear relationship can be significant but not explain a large

percentage of the variation, so having a low p-value does not ensure a high R2. Sample size is an important determinant

of regression accuracy: as with all sampling, larger samples give more accurate estimates.

Aware of many of basic regression's subtleties, you are ready to turn to Leo's staffing problem.

The Kahana has 500 guest rooms. Over the last three years, average yearly occupancy has varied a good deal: my record low

was 250 rooms occupied, and my high was 484. That range of variation makes staff planning difficult.

I'd like to be able to predict the Kahana's occupancy one month in advance. So far, I've been making educated guesses about

the level of occupancy one month in the future based on the number of advance bookings. I take the number of bookings and

add another 50%, to be on the safe side. Obviously, I haven't done very well with that method.

You need to find out more precisely what the relationship is between advance bookings and occupancy. Alice asks you to run

a regression analysis on Leo's occupancy and advance bookings data.

The results of your analysis tell you that, for every 100 additional advance bookings,

Bookings and Occupancy Regression

When you run the regression analysis, be sure you choose the dependent and independent variables correctly. We are using

advance bookings to predict occupancy, suggesting that we believe that occupancy levels "depend" on advance bookings.

Thus, the occupancy is the dependent variable, and the number of advance bookings is the independent variable.

In the regression output, find the coefficient of the slope of the regression line. This tells you how many additional guests Leo

can expect for each additional advance booking.

Based on these data, can you be 95% confident that the slope of the regression line is not 0?

Bookings and Occupancy Regression

How much of the variation in occupancy is explained by the variation in the number of advance bookings?

Bookings and Occupancy Regression

So there is a positive relationship between advance bookings and occupancy. But the power of advance bookings to predict

occupancy is pretty small. More than half of the variation in your room occupancy is due to other factors.

I suppose the mathematical relationship you found will help me make slightly more informed staffing choices. But mostly I'll

still be stumbling in the dark.

Not so fast Leo: We may be able to identify other factors linked to occupancy, and use them to come up with an improved

forecasting model. Why don't we meet tomorrow and discuss what other factors might help us predict occupancy?

Alright. In the meantime, I'll be doing my best to appease Mr. Pitt. He's talking about filing suit against me. He says he was

burned. Burned by a bisque!

70/123

11/25/2019 Quantitative Methods Online Course

You have been asked to examine the relationship between two important macroeconomic quantities: the change in

business inventories and factory capacity utilization levels.

If you wish to learn what percent of the variation in changes in inventories is explained by capacity utilization levels, which

variable should you choose as your independent variable?

Click here to access US economic data from the years 1971-1986. Run the regression with the change in business

inventories as the dependent variable and the capacity utilization as the independent variable.

Source

Using the regression output, find the slope of the regression line. Enter the slope as a decimal number with 2 digits to the

right of the decimal point (e.g., enter "5" as "5.00"). Round if necessary.

Inventories and Capacity Regression

Greta John is the human resources manager at the software consulting firm Clever Solutions. Recently, some of the

programmers have been restless: the senior programmers feel that length of service and loyalty have not been rewarded in

their compensation. The junior programmers think that seniority should not be a major basis for pay.

Greta wants some hard data to inform the debate. As a preliminary step, she plots employees' salaries against their length

of service. Using the data provided, perform a regression analysis with salary as the dependent variable and length of

service as the independent variable.

What is the average increase in a Clever Solutions programmer's salary per year of service?

Clever Solutions Regression

Based on the regression analysis, Greta can tell that approximately _____ of the variation in compensation can be

explained by length of service.

Clever Solutions Regression

Productivity measures a nation's average output per labor-hour. It is one of the most closely watched variables in

economics: as workers produce more per hour, employers can pay them more without increasing the price of the product.

Since wages can rise without provoking a corresponding rise in consumer prices, the growth of productivity is essential to a

real increase in a nation's standard of living.

Peter Agarwal, a student at the Harvard Business School, wants to investigate the relationship between change in

productivity and change in real hourly compensation.

Peter has data on change in productivity and change in compensation for eight industrialized nations. The figures are

annual averages over the period from 1979-1990.

Source

Run a regression with change in compensation as the dependent variable and change in productivity as the independent

variable.

Source

How much of the variation in the change in compensation can be explained by the change in productivity? Enter the

percentage as decimal number with 2 digits to the right of the decimal point (e.g, enter ''50%'' as ''0.50''). Round if

necessary.

Productivity and Compensation Regression

Given these data, Peter finds that the relationship can be mathematically expressed as:

71/123

11/25/2019 Quantitative Methods Online Course

Can Peter claim (with a 95% level of confidence) that the relationship is statistically significant?

Productivity and Compensation Regression

The coefficient for the slope given by Excel is an estimate based on the data in Peter's sample. The estimate for the slope of

the regression line is about 0.75. If the actual slope of the relationship is 0, there is no significant linear relationship

between the change in productivity and the change in compensation.

On the regression output, there are two ways to tell if the slope coefficient is significant at the 0.05 level. First, we can look

at the 95% confidence interval provided and see that it ranges from -0.05 to +1.56. Since the 95% confidence interval

contains zero, the coefficient is not significant at the 0.05 level.

Alternatively, we can note that the p-value of the slope coefficient, 0.0625, is greater than 0.05. Peter cannot be 95%

confident that the actual slope is 0.

Since Peter cannot be confident that the slope is not zero, he cannot be confident that there is a linear relationship between

the two variables.

Suppose Peter collects data on eight more countries. Run the regression for the entire data set with change in

compensation as the dependent variable and change in productivity as the independent variable.

The new data set indicates that the variables have a slightly different relationship:

Can Peter claim (with a 95% level of confidence) that the relationship is statistically significant?

Expanded Productivity and Compensation Regression

What might explain why the coefficient is significant in the second (combined) data set?

Expanded Productivity and Compensation Regression

Multiple Regression

Introduction

After brainstorming for several hours, you and Alice devise a plan to improve your regression analysis of the staffing problem,

to help Leo better predict the Kahana's occupancy.

Good morning. I hope you had a good night's sleep and have come up with some ideas on how to better predict my

occupancy.

For my part, I slept horribly last night. This bisque lawsuit is taking years off my life.

I'm sorry to hear that. I'm afraid we don't have any legal advice to give you. We do have some ideas about how to improve

your ability to forecast your hotel's occupancy.

We can do better than explaining 39% of the variation in occupancy as we did with our earlier regression using advance

bookings as the independent variable.

Remember when we first arrived, we analyzed the relationship between Kauai's average hotel occupancy rates and arrivals on

the island? We found a fairly strong correlation between occupancy and arrivals: 71%.

Source

I took a look at the relationship between arrivals on Kauai and your hotel's occupancy numbers. In the regression of

Kahana occupancy versus arrivals, arrivals explain 80% of the variation in occupancy. That's much better than the 39%

explained by advance bookings.

Wait a minute. Those numbers add up to more than 100%. It seems like your regression technique explains more variation

than there is!

72/123

11/25/2019 Quantitative Methods Online Course

Good observation, Leo. The numbers don't add up. The reason is that there is a statistical relationship between arrivals and

advance bookings that we aren't taking into account when we run the two regressions separately.

We intend to find one equation that incorporates the data on both arrivals and advance bookings and takes into account the

relationship between them. We'll also investigate the impact of other factors, such as the business practices of your

competitors. Who is your main competitor in the area?

That would be the Hotel Excelsior. Its manager, Knut Steinkalt, is a real cut-throat. He's always offering special promotions

that undercut my room prices.

Fortunately, the Excelsior, though very luxurious, is not nearly as inviting as the Kahana. That place feels like an undertaker's

parlor! I've been able to keep ahead of old Knut by offering a better product.

We'll study the Excelsior's promotions, and see if they've had a significant influence on the Kahana's occupancy.

Thanks. Let me know as soon as you have some results. I have to warn you, though, I may be out: I'm going see my lawyers in

Honolulu this week to discuss Mr. Pitt's bisque lawsuit. What a mess!

"Most management problems are too complex to be completely described by the interactions between only two variables,"

Alice tells you. "Incorporating multiple independent variables can give managers a more accurate mathematical

representation of their business."

When buying a new home, we know that the house's size influences its selling price. All other factors being equal, we expect

that the larger the home, the higher its price.

We can gain a better sense of this relationship by graphing data on price and house size in a scatter diagram, and running a

regression with price as the dependent variable and house size as the independent variable. We'll use data on 15 recent home

purchases in the town of Silverhaven.

We can tell from the value of R-squared, 26%, that the relationship is fairly weak: house size variation explains only about a

quarter of the variation in price.

Silverhaven Real Estate Regressions

This should come as no surprise: house size is not the only variable affecting the price of a home. There are at least three

other important factors, as any real estate agent will tell you:

One facet of a desirable location is a low commuting time, so we'd expect average commuting time to be related to price. To

study this relationship, we'll use a proxy variable: the house's distance from the business center in downtown Silverhaven. A

proxy variable is a variable that is closely correlated with the variable we want to investigate, but typically has more readily

available data.

The data on distance for the same 15 houses reveals a negative relationship: the farther away from downtown, the less

expensive houses tend to be. Again, the strength of the relationship is relatively weak: R-squared for the regression of price

versus distance is 37%.

Silverhaven Real Estate Regressions

We might be tempted to think that, if house size explains 26% of the price of a house and distance explains 37%, then the two

variables together would explain 63% of the price. But what if there is a relationship between the two independent variables,

house size and distance?

In fact, the correlation coefficient of house size and distance is 31%. As we might expect, there is a positive relationship

between the two variables: as we move farther from the city center, houses tend to be larger. How should we factor this

relationship into our analysis of how house size and distance affect the price of a house?

Rather than considering each individual relationship, we need to find a way to express the three-way relationship among all

three variables: price, house size, and location. Instead of two separate equations describing price, each with a different

independent variable, we need one equation that includes both independent variables.

We find this three-way relationship using multiple regression. In multiple regression, we adapt what we know about

regression with one independent variable — often called simple regression — to situations in which we take into account

the influence of several variables.

Thinking about several variables simultaneously can be quite challenging. Given only the price of a home, we cannot make

inferences about its size and location. A $500,000 home might be a mansion in the countryside, a modest house in the

73/123

11/25/2019 Quantitative Methods Online Course

suburbs, or a cozy cardboard box on the corner of 1st and Main.

Graphing data on more than two variables poses its own set of difficulties. Three variables can still be represented, but

beyond that, visualization and graphical representation become essentially impossible.

Although we can carry over many of the central ideas behind simple regression to the multivariate case, we'll have to consider

several interesting complications.

As managers, almost any quantity we wish to study will be influenced by more than one variable: to construct an accurate

model of a business' dynamics, we'll usually need several variables. Multiple regression is an essential and powerful

management tool for analyzing these situations.

Incorporating more than one independent variable into your analysis of the Kahana's occupancy sounds like a good idea. But

how do you adapt regression analysis to accommodate multiple variables?

Summary

Multiple regression is an extension of simple regression that allows us to analyze the relationships between multiple

independent variables and a dependent variable. Relationships among independent variables complicate multivariate

regression. With more than two independent variables, graphing multivariable relationships is impossible, so we must

proceed with caution and conduct additional analyses to identify patterns.

Incorporating more than one independent variable into your analysis of the Kahana's occupancy sounds like a good idea.

But how do you adapt regression analysis to accomodate multiple variables?

"Multiple regression equations look very similar to regression equations with only one independent variable," Alice explains.

"But be careful - you have to interpret them slightly differently."

In our home price example, we found two regression equations: one for the relationship between price and house size, and

one for the relationship between price and distance. What will the equation look like for the three-way relationship between

price, the dependent variable, and the two independent variables, house size and distance?

The regression equation in our housing example will have the form below: house size and distance each have their own

coefficients, and they are summed together along with the constant coefficient a.

In general, the linear equation for a regression model with k different variables has the form below. Since the coefficients we

obtain from the data are just estimates, we must distinguish between the idealized equation that represents the "true"

relationship and the regression line that estimates that relationship. To express that even the "true" equation does not fit

perfectly, we include an error term in the idealized equation.

Running the regression gives us coefficients for house size and distance: 252 and -55,006, respectively. We can use this

multiple regression equation to predict the price of other houses not in our data set. To predict a house's price, we need to

know only its size and its distance to downtown.

Silverhaven Real Estate Regressions

Suppose "Windsor" is a modest mansion of 3,500 square feet, located in the outer suburbs of Silverhaven, approximately 11

miles from downtown. Based on our regression equation, how much would we expect Windsor to sell for?

Silverhaven Real Estate Regressions

We simply enter Windsor's square footage and distance to downtown into the equation, and calculate an expected selling

price of $699,938.

Let's take a closer look at the coefficients in the housing example, focusing on the distance coefficient: -55,006. This

coefficient is substantially different from the coefficient in the original simple regression: -39,505. Why is it so different?

The coefficient in the simple regression and the coefficient in the multiple regression have very different meanings. In the

simple regression equation of price versus distance, we interpret the coefficient, -39,505, in the following way: for every

additional mile farther from downtown, we expect house price to decrease by an average of $39,505.

We describe this average decrease of $39,505 as a gross effect - it is an average computed over the range of variation of all

other factors that influence price.

74/123

11/25/2019 Quantitative Methods Online Course

In the multiple regression of price versus size and distance, the value of the distance coefficient, -55,006, is different, because

it has a different meaning. Here, the coefficient tells us that, for every additional mile, we should expect the price to decrease

by $55,006, provided the size of the house stays the same.

In other words, among houses that are similarly sized, we expect prices to decrease by $55,006 per mile of distance to

downtown. We refer to this decrease as the net effect of distance on price. Alternatively, we refer to it as "the effect of

distance on price controlling for house size."

Two houses are similar in size, but located in different neighborhoods: "Shangri La" is five miles farther from downtown than

"Xanadu." If Xanadu's selling price is $450,000, what would we expect Shangri La's selling price to be?

Silverhaven Real Estate Regressions

Since the two houses are the same size, we use the net effect of distance on price,

-$55,006/mile, to predict the expected difference in their selling prices. Shangri La is 5 additional miles from downtown, so

its price should be -$55,006/mile * 5 miles = $275,030 less than Xanadu's, or $450,000 - $275,030 = $174,970.

"Valhalla" is another house located 5 miles farther from downtown than Xanadu. We have no information about the relative

sizes of the two homes. If Xanadu's selling price is $450,000, what would we expect Valhalla's selling price to be?

Silverhaven Real Estate Regressions

Since we cannot assume that the sizes of the two houses are equal, we should not control for size. Thus we use the gross effect

of distance on price,

-$39,505/mile, to predict the expected difference in the two homes' selling prices. Valhalla is 5 additional miles from

downtown, so its price should be $39,505/mile * 5 miles = $197,525 less than Xanadu's, or $450,000 - $197,525 = $252,475.

Let's try to build our intuition about the difference in the distance coefficients in the simple and multiple regressions. The

coefficients are different because they have different meanings. But what exactly accounts for the drop from -39,505 to

-55,006?

In the multiple regression, by essentially considering only houses that are of equal size, we separate out the effect of house

size on price. We are left with a distance coefficient that is net relative to house size.

In the simple regression, the gross effect of distance, -$39,505/mile, represents an average over the range of house sizes. As

such, it also captures some of the effect that house size has on price. Let's take a closer look at the distance and house size

data.

Calculating the correlation coefficient between sizes of homes and their distances from downtown Silverhaven, we see that

there is a slight positive relationship, with a correlation coefficient of 31%. In other words, as we move farther from

downtown, houses tend to be larger.

We have seen that two things happen as we move farther from downtown — housing prices drop because the commute is

longer, and house size increases. The fact that house size increases with distance complicates the pricing story, because larger

houses tend to be more expensive.

Longer distances from downtown translate into two different effects on price. One effect of distance on price is negative: as

distance increases, commute times increase and prices drop.

A second effect of distance on price is positive: as distance increases, house sizes increase, and larger houses corresponds to

higher prices.

Running a multiple regression with both size and distance as independent variables helps tease out these two separate effects.

When we control for house size, we see the net effect of distance on price: prices drop by $55,006 per additional mile.

When we don't control for house size, the effect of distance alone on price is confounded by the fact that house size tends to

rise as distance increases. The "real" effect of distance on price is diminished by the relationship between price and house

size.

When we look at the net relationship between distance and price, we consider only similarly sized houses. Now we assume

that as distance from downtown grows, house size stays the same. If house size didn't increase as we moved farther out,

prices would drop more sharply: by $55,006 rather than $39,505 per additional mile.

Let's analyze the house size coefficient in a similar fashion. In the multi-variable regression model of home prices, the house

size coefficient is net relative to distance. The coefficient of 252 tells us to expect prices of homes equally distant from

downtown to increase by an average of $252 for each additional square foot of size.

The gross effect of house size on price is $167 per square foot, considerably less than $252, the net effect. When we do not

control for distance, house size "off-sets" some of the negative effect of distance: the fact that larger houses are typically

located farther from the city counteracts some of the effect of increased house size.

75/123

11/25/2019 Quantitative Methods Online Course

We should always be careful to interpret regression coefficients properly. A coefficient is "net" with respect to all variables

included in the regression, but "gross" with respect to all omitted variables. An included variable may be picking up the

effects on price of a variable that is not included in the model — school district, for example.

Finally, we should note that these coefficients are based on sample data, and as such are only estimates of the coefficients of

the true relationship. For each independent variable, we must inspect its p-value in the regression output to make sure that

its relationship with the dependent variable is significant.

Since the p-value is less than 0.05 for both house size and distance to downtown, we can be 95% confident that the true

coefficients of the two independent variables are not zero. In other words, we are confident that there are linear relationships

between each independent variable and house price.

There are four steps we should always follow when interpreting the coefficients of an independent variable in a multiple

regression:

Summary

We use multiple regression to understand the structure of relationships between multiple variables and a dependent

variable, and to forecast values of the dependent variable. A coefficient for an independent variable in a regression

equation characterizes the net relationship between the independent variable and the dependent variable: the effect of the

independent variable on the dependent variable when we control for the other independent variables included in the

regression.

Performing in Excel

Note: Unless you have installed the Excel Data Analysis ToolPak add-in, you will not be able to perform multiple regression

analysis using the regression tool. However, we suggest you read through the following instructions to learn how Excel's

regression tool works, so you can perform multiple regression in the future, when you do have access to the Data Analysis

Toolpak.

Performing multiple regression in Excel is nearly identical to performing simple regressions. In the "Input Y range" field,

enter the range that includes the column label and the data for the dependent variable as you would for a simple

regression.

To enter data on the independent variables, the columns of all the independent variables must be contiguous and have the

same number of rows.

In the "Input X Range" field, enter the cell reference of the top cell of the independent variable appearing farthest to the

left. Following a colon, enter the cell reference of the bottom cell of the independent variable appearing farthest to the

right.

Select "Labels" so that Excel can properly label the independent variables in the output. Select "Residual Plots" to include

the residuals and residual plots of the residuals against each independent variable in the output. Enter the desired

"Confidence Level," or accept the default level of 95%.

The regression output prints the values of the coefficients for each variable in separate rows, along with the p-values for the

coefficients and the desired confidence intervals.

Residual Analysis

When running a multiple regression, you must distinguish between net and gross relationships. What about R-squared? How

do you measure the predictive power of a regression model with multiple independent variables?

One of the Silverhaven houses in our data set, "Windsor," sold for $570,000, substantially less than its predicted price,

$699,938. Rumor has it that Windsor is haunted, and it was hard to find a buyer for it. The difference between the actual

price and the predicted price is the residual error, in this case -$129,938.

In simple regression, the residuals — the differences between the actual and predicted values of the dependent variables —

are easy to visualize on a scatterplot of the data. The residuals are the vertical distances from the regression line to the data

points.

When we have only one independent variable, we create a scatter plot of the data points, then draw a line through the data

representing the relationship defined by the regression equation: y = a + bx.

With two independent variables, an ordinary scatter plot will no longer do. Instead, we keep track of the third variable by

picturing a three-dimensional space.

76/123

11/25/2019 Quantitative Methods Online Course

Our data set is "scattered" in the space with distance from downtown measured on one axis, house size measured on another

axis, and the dependent variable, price, measured on the vertical axis.

The regression equation with two independent variables defines a plane that passes through the data. The residuals — the

differences between the actual and predicted prices in the data set — are the vertical distances from the regression plane to

the data points.

As in simple regression, these vertical distances are known as residuals or errors. The regression plane is the plane that

"best fits" the data in the sense that it is the plane for which the sum of the squared errors is minimized.

We can plot the residuals against the values of each independent variable to look for patterns that could indicate that our

linear regression model is inadequate in some way. Here is the residual plot analyzing the behavior of the residuals over the

range of the independent variable house size...

...and here is the residual plot analyzing the behavior of the residuals over the range of the independent variable distance.

When linear regression is a good model for the relationships studied, each of the residual plots should reveal a random

distribution of the residuals. The distribution should be normal, with mean zero and fixed variance.

The residual plot against distance in the multiple regression looks different from the residual plot against distance in the

simple regression. They represent different concepts: the first gives insight into the net relationship between distance and

price controlling for house size, and the second gives insight into the gross relationship.

With more than two independent variables, visualizing the residuals in the context of the full regression relationship becomes

essentially impossible. We can only think of residuals in terms of their meaning: the differences between the actual and

predicted values of the dependent variable.

Because residual plots always involve only two variables — the magnitude of the residual and one of the independent

variables — they provide an indispensable visual tool for detecting patterns such as heteroskedasticity and non-linearity in

regressions with multiple independent variables.

Summary

The residuals, or errors, are the differences between the actual values of the dependent variable and the predicted values of

the dependent variable. For a regression with two independent variables, the residuals are the vertical distances from the

regression plane to the data points. We can graph a residual plot for each independent variable to help identify patterns

such as heteroskedasticity or non-linearity.

In simple regressions, R-squared measures the predictive or explanatory power of the independent variable: the percentage

of the variation in the dependent variable explained by the independent variable. When we run the regression of house

price versus house size, we find an R-squared of 26%. The R-squared for price versus distance from downtown is 37%.

The multiple regression of price versus the two independent variables, house size and distance, also returns an R-squared

value: 90%. Here, R-squared is the percentage of price variation explained by the variation in both of the independent

variables.

The multiple regression's R-squared is much higher than the R-squared of either simple regression. But how can we be

sure we really gained predictive power by considering more than one variable?

In fact, R-squared cannot decrease when we add another independent variable to a regression — it can only stay the same

or increase, even if the new independent variable is completely unrelated to the dependent variable.

To understand why R-squared always improves when we add another variable, let's return to the simple regression of

house price versus distance from downtown. When we look at only two observations — two houses — we can find a line

that fits the two points perfectly.

R-squared for these two data points with this line is 100%. But clearly we can't use this line to explain the true

relationship between house price and distance. The high R-squared is a result of the fact that the number of observations,

2, is so small relative to the number of independent variables, 1.

If we have 3 observations — three houses — we can't find a line that fits perfectly. Here, R-squared is 35%.

But when we add another independent variable — no matter how irrelevant — we can find a plane that fits the three data

points perfectly, increasing R-squared to 100%. The "perfect" R-squared is again due to the fact that the number of

observations, 3, is only one more than the number of independent variables, 2.

Improving R-squared by adding irrelevant variables is "cheating." We can always increase R-squared to 100% by adding

independent variables until we have one fewer than the number of observations. We are "over-fitting" the data when we

77/123

11/25/2019 Quantitative Methods Online Course

obtain a regression equation in this way: the equation fits our particular data set exactly, but almost surely does not explain

the true relationship between the independent and dependent variables.

To balance out the effect of the difference between the number of observations and the number of independent variables,

we modify R-squared by an adjustment factor. This transformation looks quite complicated, but notice that it is largely

determined by n-k, the difference between the number of observations n and the number of independent variables k.

This adjustment reduces R-squared slightly for each variable we add: unless the new variable explains enough additional

variance to increase R-squared by more than the adjustment factor reduces it, we should not add the new variable to the

model.

Excel reports both the "raw" R-squared and the Adjusted R-squared.

It is critical to use adjusted R-squared when comparing the predictive power of regressions with different numbers of

independent variables. For example, since the adjusted R-squared of the multiple regression of house price versus house

size and distance is greater than the adjusted R-squared of either simple regression, we can conclude that we gained real

predictive power by considering both independent variables simultaneously.

A final caveat: we should never compare R-squared or adjusted R-squared values for regressions with different dependent

variables. An R-squared of 50% might be considered low when we are trying to explain product sales, since we expect to be

able to identify and quantify many key drivers of sales. An R-squared of 50% would be considered high if we were trying to

explain human personality traits, since they are influenced by so many factors and random events.

Summary

R-squared measures how well the behavior of the independent variables explains the behavior of the dependent variable.

It is the percentage of variation in the dependent variable explained by its relationship with the independent variables.

Because R-squared never decreases when independent variables are added to a regression, we multiply it by an

adjustment factor. This adjustment balances out the apparent advantage gained just by increasing the number of

independent variables.

Alice urges you to use your newfound knowledge to analyze the relationship between the Kahana's occupancy and advanced

bookings and Kauai's arrivals.

You and Alice want to find the three-way relationship between the dependent variable, the Kahana's occupancy, and the two

independent variables, arrivals on Kauai and the Kahana's advance bookings.

The first thing Alice asks you to do is to measure the strength of the relationship between the two independent variables.

Enter the correlation coefficient as a decimal number with three digits to the right of the decimal point (e.g., enter "5" as

"5.000"). Round if necessary.

The correlation between arrivals and bookings isn't especially strong, 44%.

You run the multiple regression of occupancy versus Kauai arrivals and advance bookings.

Which of the following indicates that the multiple regression with two independent variables is an improvement over both

simple regressions?

Kahana Occupancy Regressions

Do the data indicate that the regression coefficients of the two independent variables are significant at the 0.05 significance

level?

Kahana Occupancy Regressions

Suppose Leo has 200 advance bookings for the month of January and 250 advance bookings for February. What is the best

estimate of how many more guests Leo can expect in February compared to January?

Kahana Occupancy Regressions

78/123

11/25/2019 Quantitative Methods Online Course

You and Alice try to contact Leo, but he appears to be unavailable.

Leo must be with his lawyers in Honolulu. I'm sure we'll be able to get a hold of him later.

In the meantime, I can complete my research into the Excelsior's promotions. And there are some complexities of multiple

regression that you should become familiar with...

Empire Learning is a developer of educational software. CEO Bill Hartborne is making a bid for a contract to create an e-

learning module for a new client.

Preparing the bid requires an estimate of the number of labor-hours it will take to create the new module. Bill believes that

the length of a module and the complexity of its animations directly affect the amount of labor required to complete it.

Bill has data on the labor-hours Empire used to complete previous courses. He also knows the number of pages and the

animation run-time of each previous course — quantities he thinks are reasonable proxies for course length and animation

complexity, respectively.

Perform a simple regression analysis for each of the independent variables: number of pages and run-time of animations.

Empire Learning Regressions

In the simple regressions, which of the independent variables contributes significantly to the number of labor-hours it

takes Empire to create an e-learning course?

Empire Learning Regressions

The p-values for the coefficients on animation run-time and number of pages are 0.003 and 0.0002 respectively—well

below 0.05, the most commonly used level of significance. Thus, we conclude that both independent variables contribute

significantly in their respective simple regressions to the number of labor hours Empire takes to create an e-learning

course.

Run the multiple regression of labor-hours versus number of pages and run-time of animations.

According to this multiple regression, which of the independent variables contributes significantly to the number of labor-

hours it takes Empire to create an e-learning course?

Empire Learning Regressions

The p-values for the coefficients on animation run-time and number of pages are 0.014 and 0.0015 respectively—well

below 0.05, the most commonly used level of significance. Thus, we conclude that both independent variables contribute

significantly in the multiple regression to the number of labor hours Empire takes to create an e-learning course.

For this exercise, refer to the regression analyses performed in Exercise 1 of this section.

Empire Learning Regressions

Bill Hartborne, CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it will take his

team to create a new e-learning course. He is using data on previous courses Empire created, with the number of pages and

the total run-time of animations as independent variables.

Empire Learning Regressions

In the multiple regression of labor-hours versus number of pages and run-time of animations, what does the coefficient of

0.84 for the number of pages tell us?

79/123

11/25/2019 Quantitative Methods Online Course

Empire Learning Data

Empire Learning Regressions

In the multiple regression equation, the coefficient of the independent variable "number of pages" is gross relative to:

Empire Learning Regressions

For this exercise, refer to the regression analyses performed in Exercise 1 of this section.

Empire Learning Regressions

Bill Hartborne, the CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it will take

his team to create a new e-learning course. He is using data on previous courses Empire created, with the number of pages

and the total run-time of animations as independent variables.

Empire Learning Regressions

Bill bills out his talent at $70/hour. Based on the multiple regression, how much should he charge for the labor content of a

course with 400 pages and 170 seconds of animations? Enter the estimated cost of the labor (in $) as an integer (e.g., enter

"$5.00" as "5"). Round if necessary.

Empire Learning Regressions

First use the regression equation to predict the number of labor-hours required to complete the course.

Empire Learning Regressions

Then multiply that number by Empire Learning's billing rate of $70/hour to find the total amount he should charge for the

labor content of the course, $77,210.

Empire Learning Regressions

Bill is sure that the client will balk at a labor bill of over $70,000. He knows that animation is important to the client, so he

doesn't want to cut corners there. However, he believes that his lead writer can cover the content in fewer pages without

compromising his renowned clear and engaging prose.

Empire Learning Regressions

To reduce total labor costs to $70,000, how many pages must Bill cut from the plan to meet his client's cost limits?

Empire Learning Regressions

To reduce the labor bill from $77,210 to $70,000, Bill must reduce labor costs by $7,210. To achieve this reduction, Bill

must cut the contract's labor hours by 103 hours, since he bills out his talent at $70/hour.

Empire Learning Regressions

Since the animation run time will not change, we use the net relationship between labor-hours and number of pages, which

tells us that each additional page consumes 0.84 labor hours. Thus, Bill must reduce the number of pages by 123..

Empire Learning Regressions

"I expect Leo will call us this evening after his meeting with the lawyers," Alice predicts. "I hope things are going well. If Mr.

Pitt's lawsuit materializes, Leo might not have much of a business left to help him with."

80/123

11/25/2019 Quantitative Methods Online Course

I just spent the whole day at my lawyers' offices. Please give me some good news about the occupancy problem.

Well, we've found a regression model that incorporates arrivals on Kauai and advance bookings. We're now able to explain

about 86% of the variation in occupancy.

That's great. That's so much better than the 39% you calculated using only advance bookings as the independent variable. I

should be able to make much more reliable predictions based on your new model!

Unfortunately, no. Although this new model helps us understand why your occupancy varies, we can't exactly use the model

to make predictions.

You can use advance bookings to make a prediction about occupancy in a given month because the bookings are known to

you ahead of time. But you won't get the data on the number of arrivals in a month until it's too late and your guests are

already on your doorstep.

That's terrible! Sure, it's nice to know how today's occupancy is affected by today's arrivals. But I need to make business

decisions! I need to know one month in advance how many staffers to hire! Please, isn't there something you can do?

Don't give up yet, Leo. We still have a number of statistical approaches at our disposal. We'll have something for you when

you get back from Honolulu.

Multicollinearity

"We need some more advanced statistical tools to find a regression suitable to Leo's purposes," Alice tells you. "And there is

still a pitfall you'll need to learn to avoid when using multiple regression."

Another key factor that influences the price of a house is the size of the property it is built on - its "lot size." Naturally, we'd

pay more for a spacious acre of land than for 800 cramped square feet.

Using new data on the lot sizes of the 15 sample houses in Silverhaven, run the simple regression of house price on lot size. If

we don't control for any other factors, how much of the variation in price can be explained by variation in lot size? Enter R-

squared as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50"). Round if

necessary.

Silverhaven Real Estate Regressions

Lot size accounts for 30 percent of a home's price. Do the data provide evidence that there is a significant linear relationship

between house price and lot size?

Silverhaven Real Estate Regressions

The low p-value of 0.033 tells us that we can be confident that the gross relationship between lot size and home price is

significant. What happens when we add lot size as a third independent variable in our multiple regression of price on three

independent variables: house size, distance from downtown Silverhaven, and lot size?

Run a multiple regression of price on the three independent variables: house size, distance, and lot size. How does the

addition of the new independent variable, "lot size" affect the predictive power of the regression model?

Silverhaven Real Estate Regressions

By adding the independent variable "lot size", we improve adjusted R-squared slightly: from 89% to 91%, telling us that the

predictive power of the regression has improved. What about the significance of the independent variables? Has adding the

new variable changed the p-values of the coefficients?

Something odd has happened. In our earlier regression with two independent variables, the p-values for both the house size

and the distance coefficients were less than 0.05. Now, adding lot size into the equation has somehow raised the p-value for

house size.

The new p-value for house size, 0.2179, is so high that there is no longer evidence of a significant linear relationship between

price and house size after taking lot size and distance into account. How do we explain this drop in significance?

When a multiple regression delivers a surprising result such as this, we can usually attribute it to a relationship between two

or more of the independent variables. Let's look at the data on house size and lot size.

81/123

11/25/2019 Quantitative Methods Online Course

The high correlation coefficient between house size and lot size — 94% — is the culprit in the Case of the Dropping

Significance.

When two of the independent variables are highly correlated, one is essentially a proxy for the other. This phenomenon is

called multicollinearity. In our example, lot size is a good proxy for house size.

Both house size and lot size contribute to the price of a home. But because these two variables are closely correlated in our

data set, there is not enough information in the data to discern how their combined contributions should be attributed to

each of these two variables.

The net effect of house size on price should tell us the effect of house size on price assuming that the lot size is fixed.

However, we can't detect this effect in the data: house size and lot size are so closely related that we've never seen house size

vary much when lot size is fixed.

Would dropping the variable house size improve the predictive power of the regression model? The multiple regressions with

and without house size have different numbers of independent variables, so we use adjusted R-squared to compare their

predictive power.

Without house size, adjusted R-squared is 90.89%, slightly lower than 91.40%, the adjusted R-squared for the regression

including house size. Thus, although the regression model cannot accurately estimate the effect of house size when we control

for lot size and distance, the addition of house size does help explain a bit more of the variance in selling price.

A common indication of lurking multicollinearity in a regression is a high adjusted R-squared value accompanied by low

significance for one or more of the independent variables. One way to diagnose multicollinearity is to check if the p-value

on and independent variable rises when a new independent variable is added, suggesting strong correlation between those

independent variables.

How much of a problem is multicollinearity? That depends on what we are using the regression analysis for. If we're using

it to make predictions, multicollinearity is not a problem, assuming as always that the historically observed relationships

among the variables continue to hold going forward.

In the house price example, we'd keep the house size variable in the model, because its presence improves adjusted R-

squared and because our judgment would suggest that house size should have an impact on price separate from the effect

of lot size.

If we're trying to understand the net relationships of the independent variables, multicollinearity is a serious problem that

must be addressed. One way to reduce multicollinearity is to increase the sample size. The more observations we have, the

easier it will be to discern the net effects of the individual independent variables.

We can also reduce or eliminate multicollinearity by removing one of the collinear independent variables. Identifying

which variable to remove requires a careful analysis of the relationships between the independent variables and the

dependent variable. This is where a manager's deep understanding of the dynamics of the situation becomes invaluable.

In our home price example, we'd expect both house size and lot size to have significant and discernable effects on the price

of a home: a shack on an acre of land should cost less than a mansion on a similar property.

To better understand the net effect of house size we should probably gather a larger sample to reduce multicollinearity. If

we didn't expect lot size and house size to have distinct effects on price, we might remove house size from the equation.

Summary

Multicollinearity occurs when some of the independent variables are strongly interrelated: distinguishing the respective

effects of some of the independent variables on the dependent variable is not possible using the available data.

Multicollinearity is typically not a problem when we use regression for forecasting. When using regression to understand

the net relationships between independent variables and the dependent variable, multicollinearity should be reduced or

eliminated.

Lagged Variables

In the Silverhaven real estate example, we looked at a number of houses and collected data on four characteristics: price,

house size, lot size, and distance from the city center. These are cross-sectional data: we looked at a cross-section of the

Silverhaven real estate market at a specific point in time.

A time series is a set of data collected over a range of time: each data point pertains to a specific time period. Our

EasyMeat data is an example of a time series — we have data on sales and advertising levels for ten consecutive years.

82/123

11/25/2019 Quantitative Methods Online Course

Sometimes, the value of the dependent variable in a given period is affected by the value of an independent variable in an

earlier period. We incorporate the delayed effect of an independent variable on a dependent variable using a lagged

variable.

The effects of variables such as advertising often carry over into the later time period. For example, last year's EasyMeat

advertising levels may continue to affect this year's sales.

To study this carry-over effect, we add the lagged variable "previous year's advertising." Our regression now has two

independent variables: the current year's advertising level and last year's advertising level.

To run a regression on the lagged EasyMeat advertising variable, we first need to prepare the data for the lagged variable:

we copy the column of advertising data over to a new column, shifted down by one row.

The first and last data points draw attention: the first has all the necessary data except a lagged advertising value, and the

last has a lagged value but no other information. Since we need observations with data on all variables, we are forced to

discard the first observation as well as the extraneous piece of information in the lagged variable column.

In effect, by introducing a lagged variable, we lose a data point, because we have no value for "previous year's advertising"

for the first observation.

We run the regression as we would any ordinary multiple regression. The equation we obtain is:

EasyMeat Sales Regressions

Comparing the output from the regressions with and without the lagged variable, we first notice that the number of

observations has decreased from 10 to 9. Does the addition of the lagged advertising variable improve the regression?

The lagged variable decreases adjusted R-squared from 84.11% to 76.62%. Moreover, the lagged variable's high p-value —

0.3305 — tells us that its effect is too insignificant for us to distinguish from the current year's advertising's effect. We

conclude that the addition of the lagged variable has not improved the regression.

We can introduce variables with even longer lags. We might try to use advertising data from two or more years in the past

to help explain the present sales level. However, for each additional period of lag we lose another data point.

Running the EasyMeat regression with the current year's advertising, last year's advertising, and the advertising level from

two years ago leaves only eight observations. We lose predictive power: adjusted R-squared drops from 76.62% to 62.47%.

Moreover, none of the coefficients is significant at the 0.05 level.

Adding a lagged variable is costly in two ways. The loss of a data point decreases our sample size, which reduces the

precision of our estimate of the regression coefficients. At the same time, because we are adding another variable, we

decrease adjusted R-squared.

Thus, we include a lagged variable only if we believe the benefits of adding it outweigh the loss of an observation and the

"penalty" imposed by the adjustment to R-squared.

Despite these costs, lagged variables can be very useful. Since they pertain to previous time periods, they are usually

available ahead of time. Lagged variables are often good "leading indicators" that help us predict future values of a

dependent variable.

Summary

We can use lagged variables when data consist of a time series, and we believe that the value of the dependent variable at

one point in time is related to the value of an independent variable at a previous point in time. Lagged variables are

especially useful for prediction since they are available ahead of time. However, they come at a cost: we lose one

observation for each time series interval of delayed effect we incorporate into the lagged variable.

Dummy Variables

So far, we've been constructing regression models using only quantitative variables such as advertising expenditures,

house size, or distance. These variables by their nature take on numerical values.

But many variables we study are qualitative or categorical: they do not naturally take on numerical values, but can be

classified into categories.

For example, in our house price example, a categorical variable might be the home's primary construction material: wood,

brick, or straw. Assigning numerical values to such categories doesn't make sense — wood cannot be described as a brick

plus some number. How do we incorporate categorical variables into a regression analysis?

Let's look at an example from Julius Tabin's EasyMeat business. We have found that EasyMeat sales are determined to a

large extent by how much he spends on advertising. Another factor affecting sales is which flavor of EasyMeat Julius

83/123

11/25/2019 Quantitative Methods Online Course

features in his advertising campaigns.

Julius produces both EasyMeat Classic, made from beef for the most part, and EasyMeat Poulk!, a pork and poultry blend.

Hoping to match the right spreadable meat flavor to the mood of the nation, Julius features Poulk! in his ad campaigns in

some years. In other years, classic takes the lead role.

To study the effects of Julius' flavor-of-the-year choice, we use a type of categorical variable called a dummy variable. A

dummy variable takes on one of two values, 0 or 1, to indicate which of two categories a data point falls into.

This dummy variable — we'll call it "Poulk!" flavor — is set to 1 for years when EasyMeat ads feature Poulk!, and 0 for years

when they feature Classic.

Run the regression on the data, with the sales as the dependent variable, and advertising and the Poulk! flavor dummy

variable as the independent variables. What is the coefficient for the Poulk! flavor variable? Enter the coefficient for the

Poulk! flavor variable as an integer (e.g., "5"). Round if necessary.

EasyMeat Sales Regressions

We find a coefficient of 533,024 for Poulk! flavor. What does this coefficient tell us?

The coefficient 533,024 tells us that after controlling for the level of advertising expenditures, average sales are $533,024

higher when Poulk! is the flavor-of-the-year, rather than EasyMeat Classic.

This regression model can be expressed graphically: essentially we have two parallel regression lines: one for years in

which Classic is promoted, and one for "Poulk! Years." The vertical distance between the lines is the average increase in

sales Julius would expect if he chooses to feature Poulk! in a given year, after controlling for advertising level.

The slope of each line is the same: it tells us the average increase in sales when Julius spends another dollar in advertising

after controlling for which flavor is featured.

In this example, we arbitrarily chose to set the dummy variable equal to zero for the Classic flavor, i.e. to make Classic the

"base case" flavor. If we made Poulk! the base case flavor, we would obtain exactly the same graph. The coefficient on flavor

now would be -533,024, but it would again indicate that Julius should expect average sales to be $533,024 lower in the

years that his ads feature Classic rather than Poulk!.

As for any independent variable in a regression, we can test a dummy variable's significance by calculating a p-value for its

coefficient. When the p-value is less than 0.05, we can be 95% confident that the coefficient is not zero and that the dummy

variable is significantly related to the dependent variable.

If Julius produced and promoted more than two flavors, we'd need another dummy variable for each additional flavor. In

general, we'd need one fewer dummy variable than the number of flavors: one variable for each of the flavors except for

Classic, the base case.

For each year that a given flavor is featured, its dummy variable is set to one, and all other flavor variables are set to zero.

In a "Classic Year," all dummy variables are set to zero.

Why don't we have exactly as many dummy variables as categories? Suppose we have just two categories — Poulk! and

Classic — and we tried to use separate dummy variables for each — P and C. In a Poulk! Year, P = 1 and C = 0. In a Classic

Year, P = 0 and C = 1. The second dummy variable would be completely redundant: since there are only 2 flavors, if P = 0

we know the flavor is not Poulk!, so it must be Classic.

Technically, including both dummy variables would be problematic because they would be perfectly correlated: whenever P

= 1, C = 0, and whenever P = 0, C = 1. Perfectly correlated variables are perfectly collinear: we gain nothing by including

both and must drop one to avoid crippling multicollinearity.

It is useful to note that running a simple regression analysis with only a dummy variable is equivalent to running a

hypothesis test for the difference between the means of two populations. In this case, one population would be sales in

Poulk! years, and the other would be sales in Classic years. From the data we find the mean and standard deviation of sales

in Poulk! and Classic years. The hypothesis test gives us a p-value of 0.43, so we cannot conclude that the means differ

when the feature flavor is different.

A simple regression of sales on the dummy variable flavor gives us the same result. The p-value on the dummy variable is

0.45, telling us that the flavor dummy variable is not statistically significant. The p-values differ slightly for the two

approaches due to differences in the way Excel calculates the terms, but the tests are conceptually equivalent.

Summary

We can use dummy variables to incorporate qualitative variables into a regression analysis. Dummy variables have the

values 0 and 1: the value is 1 when an observation falls into a category of the qualitative variable and 0 when it doesn't.

84/123

11/25/2019 Quantitative Methods Online Course

For qualitative variables with more than two categories, we need multiple dummy variables: one fewer than the number

of categories.

With one eye on the lookout for signs of lurking multicollinearity and two new types of variables in your toolbox, you set out

to settle the staffing problem for good.

You might have noticed that the number of arrivals follows an annual seasonal pattern. Arrivals tend to drop off during the

late summer, surge again in October, and drop very low for the rest of the year.

Source

They pick up briefly in February, but the tourist business slows through the spring. During the early summer, vacationers

start arriving in droves, with arrivals peaking in June or July.

Source

Perhaps we can use the seasonality of arrivals in some way. We might come up with a lagged variable that functions as a

proxy for the current month's arrivals. Then we can run a regression with the lagged variable. Since the values of a lagged

variable would be based on historical data, Leo would know them ahead of time.

Given that arrivals follow this seasonal pattern, which of the following variables is likely to be a good proxy for this month's

arrivals on Kauai?

Source

You run the simple regression of Kahana occupancy versus 12-month lagged arrivals.

Source

What is the lowest level at which the "lagged arrivals" variable is significant?

Kahana Occupancy Regressions

Source

The regression output shows a p-value of 0.000 for the lagged arrivals coefficient, far below the significance level of 0.01.

At 55%, the R-squared for the simple regression on lagged arrivals is much lower than for the simple regression on current

arrivals, 80%. The adjusted R-squared is only 53%. Still, lagged arrivals have substantial predictive power. Run a multiple

regression of occupancy versus lagged arrivals and advance bookings.

What is the adjusted R-squared for this multiple regression? Enter the adjusted R-squared as a decimal number with two

digits to the right of the decimal point (e.g., enter "50%" as "0.50").

Kahana Occupancy Regressions

The adjusted R-squared of 60% for the multiple regression is higher than the 53% for the simple regression. But is that the

best you can do? For the moment, you report your analysis to Alice.

This model with the lagged variable is more useful to Leo, even if its predictive power is still pretty low. But let's look at these

data on Leo's competition that I dug up. Maybe we can use them to give us an even better model.

Leo's competitor, Knut Steinkalt at the Hotel Excelsior, frequently launches promotion campaigns: in some months, Steinkalt

slashes room prices dramatically.

These data show the months in which the Excelsior's promotions took place in the last three years. We can use a dummy

variable to see how the competition's promotions have affected Leo's occupancy.

The Excelsior has to advertise these promotional packages at least a month in advance to attract customers. As long as Leo

keeps an eye on the Excelsior's advertising, he'll be alerted to these promotions in enough time to take them into account

when he makes staffing decisions for the following month.

85/123

11/25/2019 Quantitative Methods Online Course

Using Alice's research, you run a simple regression of occupancy versus the promotions. Then you run the multiple regression

of occupancy versus advance bookings, lagged arrivals, and Excelsior promotions.

Suppose Leo has used the multiple regression equation to predict July's occupancy using a given level of advance bookings

and last July's arrivals. Just before he makes his staffing decisions, he learns that, unexpectedly, his rival Knut has cut the

Excelsior's room prices for July. Leo should revise his predicted occupancy by

Kahana Occupancy Regressions

The gross effect of Excelsior promotions on Kahana occupancy, a reduction of 52 guests, is not relevant here since we know

the advanced bookings and lagged arrivals for July are still the same as they were before the Excelsior launched its campaign.

Leo wants to use the net effect of the promotion on his occupancy levels, assuming advanced bookings and lagged arrivals

remain fixed — an average reduction in occupancy of 60 guests.

Kahana Occupancy Regressions

This is good work. Naturally, I'd like to have even greater predictive power than 84%, but I realize you aren't psychics. This

model will really help me when I hire staff. Thanks!

I have some good news of my own. Mr. Pitt agreed not to file suit against me! I did have to promise him an extended stay in

the Kahana's penthouse, free of charge.

He's coming next month — there's one occupant I don't need to use statistics to predict. This time, I'll serve the bisque myself.

Linda Szewczyk, marketing director of Amalgamated Fruits Vegetables & Legumes (AFV&L), is researching the nation's

fruit consumption habits. In particular, she would like greater insight into household consumption of the kiwana, a cross-

breed of kiwis and bananas that AFV&L pioneered.

Naturally, one important determinant of household consumption is the size of the household — the number of members.

Since AFV&L has positioned the kiwana as a "high end" fruit, Linda believes that household income may also influence its

consumption.

Run a multiple regression of household kiwana consumption versus household size and income. Make note of important

regression parameters such as R-squared, adjusted R-squared, the coefficients, and the coefficients' significance. The

income variable has a coefficient of 0.0004. Can a variable with such a small coefficient be statistically significant?

The independent variable, income, is statistically significant since its p-value is less than 0.05, the most common level of

significance. The small coefficient tells us that for every additional $10,000 of income, average kiwana consumption

increases by 4 lbs. a year.

To date, AF&L has focused its marketing campaigns on high-income, highly educated consumers. Linda would like to

deepen her understanding of how the educational level of the household members might affect their appetite for kiwanas.

To incorporate education into her kiwana consumption analysis, Linda separated the households in her data set into three

categories based on the highest level of education attained by any member of the household — no college degree, college

degree but no post-graduate degree, and post-graduate degree. She represents these categories using two dummy variables

— "college only" and "post-graduate."

Controlling for household size, income, and post-graduate degree, how many more pounds of kiwanas are consumed in a

household in which the highest educational level is a college degree, compared to a household in which no one holds a

college degree?

86/123

11/25/2019 Quantitative Methods Online Course

Kiwana Consumption Data

Kiwana Consumption Regressions

The coefficient for the dummy variable "college," 51.6, tells us the expected difference in kiwana consumption for "college

degrees only" households compared to the excluded educational category: households in which no one holds a college

degree. The coefficient describes the net relationship between "college degrees only" and household kiwana consumption,

controlling for household size, income, and post-graduate degree.

Controlling for household size and income, how many more pounds of kiwanas are consumed in a household in which the

highest educational level is a post-graduate degree, compared to a household in which the highest educational level is a

college degree? Enter the difference in consumption between the two households as a decimal number with two digits to

the right of the decimal point (e.g., enter "5" as "5.00"). Round if necessary.

Kiwana Consumption Regressions

When you control for household size and income, college degree households consume 51.63 lbs more than non-college

households. Post-graduate degree households consume 51.95 lbs more than non-college households. In other words, post-

graduate households consume 0.32 lbs more than college households.

Kiwana Consumption Regressions

For this exercise, refer to the regression analyses you ran in exercise 1 of the previous section.

Empire Learning Regressions

Bill Hartborne, the CEO of Empire Learning, is using regression analysis to predict the number of labor-hours it will take

his team to create a new e-learning course. He is using data on previous courses Empire created, with the number of pages

and the total animation run-time as independent variables.

Empire Learning Regressions

Bill believes that the number of illustrations used in the course may also have a significant impact on the number of labor-

hours it takes to complete an e-learning course. He wants to add the number of illustrations to the model as another

independent variable.

Empire Learning Regressions

Empire Learning Regressions

Empire Learning Regressions

Run the multiple regression of labor-hours versus number of pages, illustrations, and animation run-time.

Empire Learning Regressions

Empire Learning Regressions

A common symptom of multicollinearity is a high adjusted R-squared — in this case 94% — accompanied by one or more

independent variables with low significance. In this case, the coefficient for the number of illustrations is not significant at

the 0.05 level, and the p-value for the number of pages has risen to 0.0291, up from 0.0015 in the regression without

illustrations.

87/123

11/25/2019 Quantitative Methods Online Course

Empire Learning Data

Empire Learning Regressions

Multicollinearity occurs when the respective effects of two or more independent variables on the dependent variable are

not distinguishable in the data. This can be the result of correlated independent variables. The fact that the p-value for the

number of pages rises when we add the illustrations raises our suspicions that the number of illustrations and the number

of pages might be correlated.

We can compute the correlation between the number of pages and the number of illustrations: the correlation coefficient,

67%, is fairly high.

We could also attempt to diagnose the cause of the multicollinearity by running a regression of labor-hours versus number

of pages and number of illustrations - omitting animations. Here, the significance of illustrations is extremely low, with a

p-value of 0.85.

In the regression of labor-hours versus number of illustrations and run-time of animations - omitting pages - the respective

effects of the independent variables on the dependent variables can be distinguished. Here, the p-values for both variables

are much lower than 0.05. All of the evidence points to a linear relationship between number of pages and number of

illustrations as the culprit for the multicollearity.

Bill wants to use the regression analysis to predict the number of labor-hours it will take to complete a new e-learning

course. Comparing the two regression models - one with all three independent variables, one without illustrations - which

should Bill use?

Empire Learning Regressions

Decision Analysis

Introduction

Three more days before you leave Hawaii. As excited as you are by your imminent debut at business school, you don't look

forward to your departure. You find yourself weighing the relative merits of business school and just staying on Kauai. Leo has

an opening in the restaurant...

The shrill ring of the telephone interrupts your reverie. Sounding excited even by Leo standards, the hotelier demands your

immediate attention in his office.

I've had an exciting business venture on my mind for some time: I want to run a floating restaurant. It's going to be amazing,

offering spectacular views of volcanic silhouettes and the bright lights of resort towns, the best Hawaiian culinary artistry,

tastefully alluring hula dancing and music, ...

Slow down, slow down, Leo. I'm interested to hear your idea, but let's focus on the big picture first. A floating restaurant?

I want to run a world-class restaurant on a yacht. Chez Tethys, I'll call it. I've had the idea for some time. Chez Tethys will be

Hawaii's most luxurious dining experience. The yacht will travel from island to island, docking in a different harbor each

night.

Seating will be limited, and you can board only at selected times. The challenge alone of making a reservation will make it

irresistible. It would become a matter of prestige to include a meal Chez Tethys on any trip to the islands.

The Chez Tethys has been on my mind since before I inherited the Kahana. I'd been looking for investors, but without luck.

Then the Kahana fell into my lap, and the hotel has preoccupied me since then.

I'm beginning to feel confident about running the hotel — thanks to you two, I should add. Now I feel like I'm ready to expand

my operations. In addition to its own profits, just think of the prestige a high-profile floating restaurant would add to the

Kahana!

Listen, Leo, this is an intriguing proposition, but I'm not fully convinced that a floating restaurant would really catch on. Let's

go over your business plan, and then we'll think about how likely it is that the Tethys will be a profitable venture. One quick

question: How do you intend to finance the Tethys?

I'm way ahead of you. I've already been to the bank, and I think I can take out a loan using the Kahana as collateral. I have a

really good feeling about the Tethys. I'm pretty confident that I can make a floating restaurant work.

88/123

11/25/2019 Quantitative Methods Online Course

"Leo's Chez Tethys idea is out of the ordinary, but very interesting," Alice muses. "But we'll need to guide him through the

decision to expand his operations so extravagantly, especially since he's putting the Kahana at risk. We wouldn't want him to

make a choice he'll regret."

S&C Films, a small film production company, just purchased a script for a new film called Cloven. Cloven's unusual plot has

the makings of a cult classic: a pastoral romance shattered by a livestock revolution and the creation of a poultry-dominated

totalitarian farm state. S&C's owner and producer, Seth Chaplin, enthusiastically pitches the film: "It's Animal Farm meets

Casablanca."

Seth entered into negotiations with two major studios — Pony Pictures and K2 Classics — and hammered out a prospective

deal with each of them.

Pony Pictures liked the script and agreed to pay S&C $10 million to produce Cloven, covering the production costs and a

profit margin for S&C. Seth estimates Cloven would cost S&C $9 million to produce. Under this agreement, Pony would

acquire all rights to the movie.

The second deal, negotiated with K2 Classics, stipulates that K2 would market and distribute the movie, and S&C would cover

the production costs itself. Under this agreement, K2 and S&C would split ownership of the movie. Seth believes that the

movie has huge potential. Even with partial ownership, S&C would reap enormous profits if Cloven becomes a blockbuster.

This deal specifies that K2 would collect all revenues until its marketing and distribution costs are covered, then take 35% of

any further revenues. S&C would collect the remaining 65%. If Cloven is a blockbuster, S&C will make a killing. If it flops,

S&C will lose some or all of its production investment.

Decision-making is an essential management responsibility. Managers are charged with choosing from multiple courses of

action that can each lead to radically different consequences. Seth's decision about how to finance Cloven's production might

lead to anything from blockbuster profits to financial disaster.

When faced with a choice of actions and a range of possible outcomes, decision making can be difficult, in large part due to

uncertainty. Although the course of action we choose influences the outcome, much of what happens is not only beyond our

control, but beyond our powers to predict with certainty.

Nonetheless, decisions like Seth's that involve uncertainty can be analyzed logically and rigorously: decision analysis tools

can help managers weigh alternative options and make informed and rational choices.

In a world of uncertainty, applying decision analysis will not guarantee that each decision you make will lead to the best

result, or even to a good result. But if you apply effective decision analysis consistently over the course of a managerial career,

you are almost certain to gain a reputation for sound judgment.

Summary

Decision analysis is a set of formal tools that can help managers make more informed decisions in the face of uncertainty.

Although applying even the most rigorous decision analysis does not guarantee infallibility, it can help you make sound

judgments over the course of your career.

Decision Trees

"The problem with Leo's floating restaurant is that its success hinges on so many factors," Alice tells you. "Some, Leo can

predict, like the approximate price of a yacht, or the cost of labor. But what about consumer demand? Truth be told, not even

Leo knows whether the Chez Tethys will catch on. Leo has to incorporate this uncertainty into his decision-making process."

Producer Seth Chaplin boasts that when Cloven is released in theaters, it will become an instant classic. But even an incurable

enthusiast like Seth would never claim that success is certain, and if asked how confident he is that Cloven will achieve at

least modest success, he might reply with a percentage: "80%."

Uncertainty clearly has a quantitative dimension: we can distinguish between degrees of uncertainty.

But our intuition about uncertainty is often underdeveloped. How do we measure uncertainty consistently and

systematically?

Probability is a measure of uncertainty. A meteorologist reports the probability of rain in her forecast: "We'll see cloudy skies

tomorrow, with a 20% chance of rain."

89/123

11/25/2019 Quantitative Methods Online Course

Probability is measured on a scale from 0 to 1, or 0% to 100%. Events that are impossible are said to have zero (or zero

percent) probability, and events that are absolutely certain are said to have a probability of one (or 100%). But how do we

interpret probabilities of 20% or 37.8%?

Let's look at a very simple and familiar uncertain event: a spin of a wheel of fortune. Our wheel consists of two equally-sized

areas: an orange half and a green half. Spin the wheel a few times and count the number of green and orange outcomes you

get.

Since the two areas are of equal size, we'd expect "greens" and "oranges" to occur with equal frequency. About half of the

spins will end up "green" and half will end up "orange." Of course, if we only spin a few times, we won't expect exactly half of

the spins to result in "green." But the more times we spin, the closer we expect the percentage of greens will be to 50%. This is

the probability of getting a "green" each time we spin.

This interpretation of probability is a relative frequency interpretation. We have a set of events — in our case, spins — and

a set of outcomes — "green" or "orange." We form the ratio of the number of times a particular color occurs to the total

number of spins. The probability of that particular color is the value that ratio approaches as the total number of spins

approaches infinity.

Let's look at a different wheel. This wheel is also divided into two areas, but now the green area is larger: it covers three

quarters of the wheel.

What is the probability of a spin of this wheel resulting in the outcome "green"?

Since the green area covers three quarters of the whole wheel, we can expect that three out of four spins will result in a green

outcome. This corresponds to a probability of 75%.

For simple uncertain events like a wheel of fortune game, figuring out the probabilities of the outcomes is easy. There are a

limited number of possible outcomes and there are no hidden factors affecting the outcome. Besides, if we had any difficulty

believing that the probability of a "green" is 75%, we could simply spin the wheel over and over again to calculate an

approximation of the probability.

For more complex uncertain events like the weather or the success of a new venture, finding the probabilities of the outcomes

is more difficult. Many factors can affect the outcome; indeed, we may not even know all the possible outcomes. Furthermore,

there may be no simple experiment we can repeatedly run to generate approximations of the relative frequencies. How

should we think about probabilities in these situations?

Let's think about the weather. Anytime you leave your house, you face the decision of whether or not to take an umbrella. If

the sky is perfectly blue, you probably won't take one. Nor will you do so if there are a few white clouds in the sky.

On the other hand, if the sky is completely overcast, you might consider taking some kind of rain protection, and if it's already

raining, you almost certainly would. You'd be able to give a rough estimate of the probability of rain; for instance if it's

overcast, you might estimate a 40% chance of rain.

This estimate may be rough, but it isn't completely arbitrary, and it's clearly a better estimate than either 0% or 100%. Your

subjective estimate may be based on an informal assessment of relative frequency. The estimate of "40%" might be

shorthand for the reflection: "In my past experience, it has rained on a little under half the days when it is has been this

overcast in the morning at this time of the year, in this geographic location."

Although you probably never sat down and tabulated the days according to season, atmospheric conditions, and rainfall, your

informal assessment is powerful enough to make a reasonable choice about taking your umbrella. Though sometimes, you

may make the wrong choice.

Similarly, when Seth Chaplin gives his estimate of an 80% chance of Cloven's success, he is probably articulating the

following: "In my experience, about four out of five movies of a similar genre, with a similar budget, released under similar

conditions were successful."

In the examples we've discussed thus far, our uncertainty about the events we faced has stemmed from the fact that the

events had not yet happened. Sometimes we face uncertainty for a different reason: an event may have already occurred, but

we simply don't know what the outcome was. Let's return to the wheel of fortune to see how we might think about uncertainty

in these cases.

Imagine you and a friend spin the wheel. You close your eyes and keep them shut while your friend observes the wheel. While

the wheel is spinning, both you and your friend are uncertain about the outcome, and you would each assess the probability

of "green" at 75%.

When the wheel stops, the result becomes definite. Your friend knows the outcome: for her, the probability of "green" is now

either 0% or 100%. But as long as your eyes are closed, you remain uncertain about the outcome.

If someone promised to give you $10 if you correctly guessed the outcome of that spin before you opened your eyes, which

color would you choose?

90/123

11/25/2019 Quantitative Methods Online Course

From your limited point of view, the best choice you can make is "green," even though the outcome may already be a definite

"orange." If you played the same game over and over again and always chose "green," you would win three quarters of the

games. The probability of the outcome turning out to have been "green" when you open your eyes is 75%.

Even though the event has already occurred, the outcome is uncertain from your point of view. It makes sense to measure the

uncertainty due to a lack of information the same way we measure the uncertainty due to our inability to predict the

outcomes of future events.

Summary

Uncertainty makes decision-making challenging. We can be uncertain about outcomes that have or have not occurred. To

make the most informed decisions, we quantify our uncertainty using probability measures. Probability is measured on a

scale from 0% to 100%: events with 0% probability are impossible; events with 100% probability are certain. An event's

probability can often be determined by observing the relative frequency of its occurrence within a set of opportunities for

the event to occur. For events for which relative frequencies are difficult to assess, we often make subjective estimates of

the probabilities. Though not based on "hard" data, these estimates are often a sufficient basis for sound decision making.

Alice whips out a notepad and pencil, and begins drawing something that vaguely resembles a saguaro cactus. "Let's look at

how we can analyze and structure decisions like Leo's."

Graphical representations are often a highly efficient way to organize and convey information. Scatter plots and histograms

efficiently communicate distributions of data, an organizational chart can quickly outline the structure of a company; and pie

charts effectively express information about proportions and probabilities.

Is there such a thing as a graphical tool that helps inform and organize the decision-making process?

The answer is yes, and the graphical tool is called a decision tree. Let's look at a simple decision tree.

Seth Chaplin's business decision — whether to produce Cloven for Pony Pictures or in partnership with K2 Classics —

involves two alternatives. At some point, Seth must make a decision. Until he does, there are two paths along which history

can unfold. Cloven will either be owned by Pony or co-owned by S&C Films and K2 Classics. A different branch of the tree

represents each alternative.

Decisions aren't the only points at which alternatives can branch off. Events with uncertain outcomes also lead to branching

alternatives. Once released in theaters, Cloven could be a "Blockbuster" hit, have a fair, but "Lackluster" performance, or

"Flop" completely. Each of these outcomes corresponds to a separate branch on the decision tree.

Decision trees have two types of branching points, or nodes. At decision nodes, the tree branches into the alternatives the

decision-maker can choose. We use square boxes to represent decision nodes.

At chance nodes, the tree branches because the uncertainty of the business world permits multiple possible outcomes.

We represent chance nodes by circles.

We arrange the nodes from left to right in the order in which we will eventually determine their results. In Seth's example, the

first thing to be determined is his own decision about which deal to choose. The next thing to be determined is Cloven's box

office performance.

One of the challenges of creating a decision tree is determining the correct sequence of nodes: in what order will the relevant

outcomes be determined? Which events "depend" on the occurrence of other events? What alternatives are created or

foreclosed by prior decisions or by the outcomes of uncertain events?

Drawing a decision tree forces us to delineate each alternative and clarify our assumptions about those alternatives. Thus,

even before we use a tree to make a decision, simply structuring the tree helps us clarify and organize our thoughts about a

problem. A decision tree is also useful in its own right as a tool for communicating our understanding of a complex situation

to others.

Categorizing branching points as decision or chance nodes makes clear which events and outcomes are under our power and

which are beyond our control.

Writing down alternatives in a systematic fashion often allows us to think of ideas for new alternatives. For instance, after

viewing the decision tree, Seth might realize that he would prefer a deal that pays part of his production costs but still allows

a small ownership stake as opposed to either of the deals he has hammered out with Pony and K2.

Summary

Decision trees are a graphical tool managers use to organize, structure, and inform their decision-making. Decision trees

branch at two types of nodes: at decision nodes, the branches represent different courses of action the decision-maker can

91/123

11/25/2019 Quantitative Methods Online Course

choose. At chance nodes, the branches represent different possible outcomes of an uncertain event - at the time of the

initial decision, the decision maker does not know which of these outcomes will occur (or has occurred). The nodes are

arranged from left to right, in the order in which the decision-maker will determine which of the possible branches actually

occurs.

Laying out a decision tree — mapping a logical structure and identifying the alternative scenarios — is an essential first step

in the decision-making process. But how do we compare different scenarios? On what criteria might we base our

preference for one scenario over another?

In the business world, success is often measured in profits. Seth Chaplin will consider Cloven a success if it brings in at

least a modest profit. How should Seth incorporate possible profit levels into his decision analysis?

Seth's decision tree contains four possible scenarios: four unique paths from his current decision node on the left to the

ultimate outcomes on the right. He must assign to each scenario a dollar amount — S&C's profits. For example, if Seth

chooses to produce Cloven for Pony Pictures, his expected profits will be $1 million.

What if Seth chooses the K2 deal? If he has a significant ownership stake in the movie, his profits will depend on Cloven's

box office success. With a "Blockbuster" movie come "Blockbuster" profits: Seth estimates around $6 million.

If Cloven performance is "Lackluster," Seth thinks the Cloven project would break even for S&C films: S&C's expected

profits would be roughly $0. If Cloven "Flops," S&C's production costs will exceed its revenues from the movie, leading to

an expected loss of $2 million.

Are these three the only possible outcomes? Clearly, no. S&C's profits on the release of Cloven could be $6.6 million or

$15.03. In fact there is a range of outcomes stretching from profits in the hundreds of millions of dollars — the historical

upper limit of movie profits — to a loss of $9 million — the amount S&C will invest in production.

Although it is mathematically possible to analyze the full range of possible outcomes, the procedure for doing so is usually

more complicated and time consuming than is warranted for many decision problems. In practice, considering only a few

representative scenarios as we have done in the Cloven example will typically lead to good decisions.

When we use a scenario to represent a range of possible outcomes, the outcome figure we assign to that scenario should

represent the weighted average of all outcomes in the range that scenario represents. For example, in the Cloven decision,

$0 million might represent the weighted average of all possible profits from -$3 million to +$3 million.

How can Seth use his decision tree and estimated profit values to choose the better deal?

If Seth chooses the partnership with K2, S&C could make $6 million. Clearly, that scenario is preferable to the scenario in

which S&C simply sells Cloven to Pony for a profit of $1 million.

If S&C retains part ownership of Cloven and it "Flops" in the theaters, S&C stands to lose $2 million. That scenario is

clearly worse than the scenario in which S&C sells Cloven to Pony.

If Cloven is highly likely to "Flop" in theaters, then Seth should choose the Pony deal; if it's highly likely to be a

"Blockbuster" he should choose the K2 deal. Clearly, the likelihood of the different outcomes in the K2 deal should weigh

heavily in Seth's decision.

To each branch emanating from a chance node we must associate a probability: the probability of that outcome occurring.

These probabilities may be based on historical data or on our best judgment of the likelihood of each outcome.

Based on the quality of the script and his years of experience in the movie industry, Seth estimates the probability of

Cloven becoming a "Blockbuster" at 30%, the probability of a fair, but "Lackluster" performance at 50%, and the

probability of a "Flop" at 20%.

If S&C sells Cloven to Pony, the level of box office success is largely irrelevant to Seth. S&C's profits are certain to be $1

million.

When assigning outcomes and probabilities to chance nodes, we must meet two requirements. First, outcomes that branch

off from the same chance node must be "mutually exclusive." Thus, for example, two outcomes emanating from the same

node cannot both occur at the same time: the occurrence of one outcomes excludes the occurrence of any other.

Second, the set of outcomes that branch off from the same chance node must be "collectively exhaustive": the branches

must represent all possible outcomes.

In practice, we typically don't depict every possible outcome separately. For Cloven, we've reduced the vast range of

possible financial outcomes to a manageable set of three representative outcomes. However, we consider these three

outcomes collectively exhaustive in that they represent three ranges that together exhaust all possibilities.

92/123

11/25/2019 Quantitative Methods Online Course

Also, we usually consider extremely unlikely but not impossible events to be included in one of the existing branches, or if

sufficiently unlikely, to be irrelevant to decision-making. In the Cloven example, we don't include a branch representing

the simultaneous destruction of all existing copies of the Cloven film.

When we construct a chance node, we must make sure that the outcomes emanating from that node are mutually exclusive

and collectively exhaustive. Since it is certain that one and only one of the possible scenarios will occur, the individual

probabilities must add up to 100%.

Summary

For a decision tree to effectively inform a decision, it must incorporate two types of relevant data: endpoint values

corresponding to each scenario and the probabilities of the possible outcomes of each uncertain event. An outcome value

is associated with a scenario - a unique path from the first node on the left of the tree to an endpoint on the right. We

place the appropriate probability on each branch emanating from a chance node. Outcomes represented by branches

emanating from the same chance node must be mutually exclusive and collectively exhaustive.

Ted Nesbit is a project manager at Tardytech, a software development company. Ted is planning the project schedule and

budget for the development of a new piece of custom business software for a Fortune 500 client.

The client has imposed a strict deadline for completion of the project. However, based on past experience with similar

clients, Ted knows that there is a significant probability that Tardytech will not be able to meet its deadline due — in most

cases — client-side delays.

Experience indicates that 56% of all Tardytech projects are completed on time, 42% are delayed due to client-side delays,

and 2% are delayed due to errors and process failures at Tardytech.

What is the probability that Tardytech's new project will not be completed on time?

The total probability of a delay is the sum of the probabilities of a client-side delay and a Tardytech-side delay: 44%.

A potential new client has asked Tardytech to report the probability that a project will not be delayed due to Tardytech

errors or process failures. What probability should Ted report?

Only 2% of all projects are delayed due to problems caused by Tardytech. The remaining projects are either completed on

time (56%) or delayed by client-side delays (42%). The probability that a project will not be delayed by Tardytech is the

sum of the probabilities of those outcomes, 98%.

Another potential client has asked Tardytech to report the percentage of all delayed projects whose delays could be

attributed to Tardytech's errors and process failures. What percentage should Ted report?

In the past, 44 out of 100 projects were not completed on time. On average, 2 projects out of 44 delayed projects were

delayed due to Tardytech errors and process failures. Thus, the proportion of Tardytech-caused delays among all delays is

2/44, or 4.5%. We'll discuss and analyze similar questions in greater depth in the next unit.

As a loan officer at the First Bank of Silverhaven, Carla Wu determines whether or not to grant loans to small business loan

applicants.

Of 108 recent successful small-business loan applicants with a full year of payment history, 28 were unable to meet at least

one loan payment in the first year they had an outstanding loan.

What is the probability that a randomly selected small business in this pool missed a payment in the first year of the loan?

Enter the probability as a decimal number with two digits to the right of the decial point (e.g., enter "50%" as "0.50").

Round if necessary.

The probability that a randomly selected small business missed a payment in the first year of the loan is simply the ratio of

the number who missed at least one payment to the total number of loans, i.e., 25.9%.

Robin Bea is the CEO of a small shipping company. She needs to decide whether or not to lease another truck to add to her

current fleet.

If she leases the truck, she may be able to generate new business that would increase her company's profits. However, if

additional business fails to materialize, she may not be able to cover the incremental leasing costs.

93/123

11/25/2019 Quantitative Methods Online Course

Based on her assessment of future trends in the transportation sector, Robin identifies three scenarios that might

transpire: "Boom," "Moderate Growth," and "Slowdown," depending upon the performance of the economy and its impact

on the transportation sector. She associates with each scenario an estimate of the total profits an expanded fleet will

generate for her firm.

If she doesn't lease the new truck, Robin probably won't generate as much revenue, but her leasing costs will be lower.

Again, she breaks down the possible outcomes if she does not lease the additional truck into three scenarios corresponding

to the performance of the economy. Then she associates an estimate of total firm profits with each scenario.

The first branching of the tree occurs at the point when Robin makes her decision: to lease or not to lease a new truck. A

square decision node should represent this decision.

Each of Robin's options splits into three possible outcomes depending on the performance of the economy. Robin will learn

which economic scenario transpires only after making her decision. The second branching represents this uncertainty. The

nodes of these branches are chance nodes and should be drawn as circles.

You and a friend have a wheel of fortune that has a 75% chance of a "green" outcome and a 25% chance of a "red" outcome.

Before you spin, the friend offers you $100 if you correctly predict the outcome. If you choose the wrong color, you receive

nothing. The good news: you don't have to choose until the spin is complete. The bad news: you will have to keep your eyes

closed until the wheel has stopped and you have made your prediction.

You close your eyes and keep them shut while the wheel spins. When the wheel stops, your eyes are still closed. You now

must decide whether to choose "red" or "green."

The nodes of a decision tree are arranged from left to right in the order in which we discover their results, not in the order

in which the events actually occur. Thus, even though the wheel stops and the result is finalized before you make your

decision, because you don't know the result until after your decision is made, the chance node for the spin result should

appear after — to the right of — the decision node.

Alice's decision trees are neat devices to help organize and structure a decision problem. But how are they going to help Leo

evaluate his options and choose the best one?

Seth Chaplin must decide how to produce the movie Cloven. He has mapped out the logical structure of his decision in a

decision tree and has incorporated the appropriate data, evaluating each scenario in terms of its expected profits and its

probability of occurring. Now, how should he use the tree to inform his decision?

If S&C Films works in partnership with K2 and retains part ownership of Cloven, S&C could earn a profit of $6 million, a

highly desirable outcome. However, that scenario is relatively unlikely: 30%. How do we balance the high value of that

outcome against its low probability?

The answer is elegant and simple: we multiply the outcome value by its probability: $6 million * 0.3 = $1.8 million.

Essentially, we "credit" the "Blockbuster" outcome with only 30% of its value.

Similarly, we credit a "Lackluster" performance of $0 profits with only half of its value.

The magnitude of the loss incurred if the film "Flops" is mitigated by its low likelihood of 20%.

Then we add these weighted values together, for a total of $1.4 million. This total is called the expected monetary value

(EMV) of the K2 option. The EMV is a weighted average of the expected outcomes of the scenarios in which S&C retains part

ownership of Cloven as stipulated in the K2 deal.

Let's return to our wheel of fortune game to build our intuition for the concept of the EMV. This wheel consists of three areas:

"blue," "green," and "red." The blue area is 30% of the total area. If a spin results in "blue," you win $6.

"Green" covers 50% and "red" covers 20% of the wheel's area. If a spin results in "green," you gain nothing at all, if the result

is "red," you lose $2.

Suppose you play the game 100 times. How much money do you think you will have gained or lost over 100 spins? What do

you estimate will be your average yield per game?

94/123

11/25/2019 Quantitative Methods Online Course

About 30% of the time, a spin will result in "blue." In other words, you can expect 30 of the 100 spins to yield $6.

About 50% of the time a spin will result in "green." You expect 50 of the 100 spins to yield $0. And about 20% of the time a

spin will result in "red," — that is, you expect 20 of the 100 spins to cost you $2.

The total amount you can expect to win after playing the game 100 times is $140, and the average yield per spin is $1.40.

That average is the expected monetary value (EMV) for a single spin.

The expected monetary value for a single spin is $1.40. If you spin the wheel once, which of the following results is least

likely to occur as the outcome?

The probability of each outcome is shown below. With a probability of 0%, $1.40 is the least likely outcome.

It is important to understand the nature of the expected monetary value. We do not actually "expect" an outcome of $1.40. In

fact, $1.40 is not even a possible outcome. The EMV of $1.40 is the long-term average value of the outcomes of a large

number spins.

In the Cloven case, the EMV of $1.4 million is the average amount of profits Seth Chaplin can expect to make when he

produces similar films in similar circumstances.

We can use the EMV as a measure with which to compare alternative options. First, we calculate the EMV for each chance

node, beginning at the right of the tree. For the chance node associated with the K2 deal, the EMV is $1.4 million.

Now that we have calculated the EMV of the K2 chance node, we can "collapse" the branches emanating from the chance

node to a single point. Going forward, we can treat the EMV of $1.4 million as the endpoint value of the K2 option.

The EMV for the Pony Option is simply $1.0 million: the outcome value multiplied by its probability of 100%.

At a decision node, we choose the best EMV of all the branches emanating from that decision node. In our Cloven example,

the best EMV is the one with the highest expected profits. The EMV of the K2 deal, $1.4 million, exceeds $1.0 million, the

EMV of the Pony deal. Selecting the option with the best EMV and removing all other options from consideration is known as

"pruning" the tree.

Any decision tree — no matter how large or complex — can be analyzed using two simple procedures. At each chance node,

calculate the EMV, collapse the branches to a point, and replace the chance node with its EMV. At each decision node,

compare the EMVs and prune the branches with less favorable EMVs. This entire process is known as folding back the

decision tree.

Summary

We often use expected monetary value (EMV) to quantify the value of uncertain outcomes. The EMV is the sum of the

values of the possible outcomes of an uncertain event after each has been weighted by its probability of occurring. The

EMV can be interpreted as the expected average outcome value of the uncertain event, if that uncertain event were

repeated a large number of times. To analyze a tree, we "fold it back": we move from right (the future) to left, finding the

EMV for each node. For chance nodes we calculate the EMV as described below. For decision nodes we simply choose the

option with the best EMV — lower costs or higher profits — among the choices represented by a decision node's branches

and prune the others.

Relevant Costs

Burning to use decision trees and EMVs, you start to draw up the square and circular nodes that make up Leo's Chez Tethys

decision. Alice, however, urges caution: "Before you get too trigger-happy with those decision trees, you should be aware of a

few common pitfalls."

Jen Amato has been driving "Millie," an old jalopy of a car, for the past five years. Millie — an Oldsmobile Delta-88 — is

twenty years old. A week ago, the air conditioner broke down, and Jen had it replaced at a cost of $500.

Now, the drive train is worn out. Replacing it will cost $1,200. If she doesn't repair it, Jen can sell Millie "as is" for about

$300. Jen must decide: should she sell her car or should she have the drive train replaced?

If Jen sells her car now, she will not even recoup her $500 investment in the air conditioner. If she has the drive train

replaced, her beloved Millie will last a little longer, and Jen will be able to enjoy the cool air she spent so much money on.

How should the fact that Jen hasn't had the opportunity to benefit from her air conditioner investment affect her decision?

With a broken drive train, Millie is worth about $300. According to her mechanic, if Jen pays $1,200 to replace the drive

train, Millie will be worth approximately $1,100 in terms of resale value. In the analysis of Jen's decision, what role should

her $500 investment in the air conditioner play?

In any scenario in which Jen sells her car, the $500 air conditioner cost has already been incurred, so it could be included in

the total cost.

95/123

11/25/2019 Quantitative Methods Online Course

Likewise, the air conditioner repair costs are part of the prehistory of any scenario in which Jen has Millie's drive train

replaced. Here, too, the $500 investment in the air conditioner could be included as a cost.

However, when we compare the total costs of the two options, we recognize that the $500 does not contribute to a

difference in the outcome values. Because the $500 cost is incurred in both cases, it plays no role in the comparison of

the two options, and thus is irrelevant to Jen's decision — she will have spent the $500 no matter what she decides to do now.

Costs that were incurred or committed to in the past, before a decision is made, contribute to the total costs of all scenarios

that could possibly unfold after the decision is made. As such these costs — called sunk costs — should not have any bearing

on the decision, because we cannot devise a scenario in which they are avoided.

It isn't wrong to include a sunk cost in the analysis as long as it is included in the value of every outcome. However, including

sunk costs distracts from the differences between scenarios — the relevant costs. Imagine the complexity of Jen's tree if she

included every sunk cost she's incurred from owning Millie — from the original purchase price of the car to the cost of all the

gasoline she's pumped into Millie over the years, to her expenditure on a dashboard hula dancer.

Misinterpreting a sunk cost as a cost that weighs on only some of the scenarios is a common error. After sinking $500 of

repairs into her car, selling it for $300 will be quite painful for Jen. Nonetheless, good decisions are made based on possible

future outcomes, not on the desire to correct or justify past decisions or mistakes.

Another common decision-making error is to omit from the analysis relevant costs that should have a bearing on the

decision. If Jen has Millie's drive train replaced, the repair costs are just one of many costs involved with that option. Given

Millie's age, there is a high likelihood of another repair cost soon. Similarly, if Jen sells Millie now, she will have costs

associated with buying a new car or arranging other transportation services.

Opportunity costs are an important cost category that decision makers often neglect to include in their analyses. For

example, selling the car will require Jen to devote 10 hours of her time that she would otherwise devote to her part-time job

paying $12 per hour. Thus, Jen should add $120 in opportunity costs to the outcomes of the "sell Millie" decision.

We should also take non-monetary costs into account. Jen will feel sad at leaving her trusted vehicle and companion on many

a road trip behind. Although such costs can be difficult to quantify, they should not be neglected. We will see shortly how to

use sensitivity analysis to incorporate non-monetary costs into a decision analysis.

Summary

Among the most common errors in decision analysis is the failure to properly account for the costs involved in different

possible scenarios. On the one hand, relevant costs such as opportunity costs or non-monetary consequences are often

omitted. On the other hand, irrelevant costs such as sunk costs are incorrectly included in the analysis. Sunk costs are costs

that were incurred or committed to prior to making the decision and cannot be recovered at the time the decision is being

made. Since these costs factor into any possible future outcome, they can be safely omitted from the analysis; sunk costs

must never be included in only selected branches of a decision tree.

Time Horizons

Jen has decided to replace her old car, Millie the Oldsmobile. Her friend, Sven, is leaving the country for two years and

doesn't want to pay to store his Mazda Miata. He offers Jen two options.

In the first option, Jen leases the car, paying Sven $700 each year. When Sven returns from abroad, he reclaims his car. In

the second option, Jen buys the car outright for $4,000. How should Jen compare these two options?

What is the appropriate cost difference Jen should use to compare the two options Sven has offered?

To begin, note that in the first option, the costs are spread across two years, whereas in the second option, Jen makes a

single payment when she closes the deal with Sven. To compare the options meaningfully, we must define a time horizon

for this problem, that is, we must compare their costs over a common time period. In this case, two years is a convenient

time horizon.

Next, note that we cannot directly compare $1,400, the simple sum of the two lease payments made at two different times,

to the $4,000 one-time cost of the purchase option paid entirely at the beginning of the first year. We can only compare

costs when they are valued at the same point in time.

We'll compare the present value of the costs associated with each option — that is, the value of the costs at the time Jen

makes her decision. In order to compare the costs, we first need to convert the second installment of the leasing payment

into its present value.

Jen currently has sufficient cash for either alternative in an investment account with a 5% rate of return. What is the

present value of the second installment of $700 paid under the leasing option?

The present value of the second installment of $700 is $666.67. At the end of one year, $666.67 residing in her investment

account today will have increased in value to $666.67*(1.05) = $700, the amount she must pay for the second lease

96/123

11/25/2019 Quantitative Methods Online Course

installment.

The present value of the total cost of the leasing option is $1,366.67. Can we use this number to compare the two options?

Although $1,366.67 and $4,000 are now comparable costs because they are given at their present value, we have not yet

considered the fact that under the purchase option, Jen will own the Miata at the end of the two-year period. In two years,

Jen expects that the Miata's market value will be around $3,000. The value of an asset at the end of the time horizon is

typically called its terminal value.

What's the net present value of the cost of the purchase option - that is, the present value of future cash flows after

subtracting, or "netting out," the initial payment to Sven? Recall that Jen currently has her money invested in accounts that

earn a 5% average annual return.

If Jen resells the car in two years, she would receive $3,000. The present value of $3,000 received at the end of two years is

$2,721.09.

The net present value of the cost of the purchase option is $1,278.91.

If cost is the only deciding factor in Jen's choice, she'd save $87.76 if she chose to purchase the Miata. Before finalizing her

decision, Jen should make sure to weigh other considerations relevant to the decision, such as whether or not she will need

to borrow money to cover other expenses this year if she spends $4,000 to buy the Miata now, what the borrowing rate is,

etc.

Summary

When we make a decision, we need to choose a time horizon over which we quantify the outcomes of our decision. To

compare monetary values that take place at different times, we must account for the time value of money by comparing

cash flows at their values at a common point in time. Generally, we compare the present values — or net present values

— of different outcomes by discounting future cash flows at the appropriate discount rate. Terminal values of assets held

at the end of the time horizon must be determined and discounted to their present value.

"Now let's find Leo, and get to work on his decision problem," Alice says, snapping her laptop shut.

I calculated that if the floating restaurant idea really takes off, I'd make over $2 million in profits over the next five years. And

that's after I subtract my initial investment, including the purchase of the ship.

How certain are you that the Tethys will be the success you envision?

Gosh, I don't know. It's almost like a flip of a coin to me — chances are about fifty/fifty that the Tethys will be a big hit.

That sounds a little too optimistic. Let's take a close look at your business plan.

The three of you spend the rest of the day in energetic discussion and research. Finally, you agree on two representative

scenarios that might occur if Leo launches the Chez Tethys: "Phenomenon," or "Fad."

If the Tethys becomes a cultural "Phenomenon" with staying power, Leo can expect $2 million in profits over five years, in

terms of net present value. Leo grudgingly agrees that the likelihood of this scenario is only 35%.

Alternatively, dining Chez Tethys might become a passing "Fad" for a couple of years, then be replaced by "The Next Big

Thing." In that case, Leo would face substantial losses, estimated to have a present value of about $800,000. Leo grants that

his brainchild has a 65% chance of just being a fad.

Enter the EMV in $millions as a decimal number with three digits to the right of the decimal point (e.g., enter ''$5,500,000''

as ''5.500''). Round if necessary.

The expected monetary value of launching the Tethys is the outcome value of the "Phenomenon" scenario weighted by its

probability added to the outcome value of the "Fad" scenario weighted by its probability, i.e., $180,000.

Hmm. I can see how this analysis is helpful. I'm glad that my expected profits are positive, but somehow I'm not satisfied. I

don't feel very comfortable with some of our estimates.

Robin Bea is the CEO of a small shipping company. She needs to decide whether or not to lease another truck to add to her

current fleet.

97/123

11/25/2019 Quantitative Methods Online Course

Robin identifies three states of the transportation sector that might occur: "Boom," "Moderate Growth," and "Slowdown."

Her firms profits depend on whether she leases an additional truck and on the state of the sector: She associates an

estimate of total firm profits with each of the six scenarios.

Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500'' as

''5.5''). Round if necessary.

The EMV of the decision to lease the truck is $31,000. Weigh each of the scenario's outcomes by the probabilities that they

will occur, then add the weighted values.

Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500'' as

''5.5''). Round if necessary.

The EMV of the option to not lease the truck is $18,200. Weigh each of the scenario's outcomes by the probabilities that

they will occur, then add the weighted values.

Jari Lipponen of the Silverhaven Home for Abandoned Miniature Horses (SHAMH) needs funds to maintain operations.

He can either apply for a government grant or run a local fundraiser, but the demands on his time are too high for him to

be able to do both.

Jari believes he has a 90% chance that he will win a grant of $25,000 if he submits the grant application. Grants in this

category are for a fixed amount, so if he loses the grant, he'll have no money to run the SHAMH.

Based on his past fundraising experience, he estimates that if he runs a local fundraiser, he has a 30% chance of raising

$30,000 and 70% chance of raising $20,000.

Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500'' as

''5.5''). Round if necessary.

To find the EMV of launching the fundraiser, weigh the $30,000 outcome value by its probability of 30% and add that to

the $20,000 outcome value weighted by its probability of 70%.

''5.5''). Round if necessary.

To find the EMV of applying for the grant, weigh the grant award amount of $25,000 by the probability of winning it (90%)

and add that to the value of losing the reward ($0) weighted by the probability of losing (10%).

The EMV of the fundraiser option is $23,000, higher than the EMV of applying for the grant. Based on thia analysis, Jari

should organize a fundraiser.

Marsha Ratulangi is the chief operating officer at Gaiacorps, a non-profit organization dedicated to preserving natural

habitats around the world. Gaiacorps' IT hardware is aging, and Marsha must decide whether to extend the current lease

on Gaiacorps' IT desktop computers or purchase new ones.

The cost of the lease extension and the future maintenance costs of the current hardware are incurred only in the scenario

in which Gaiacorps extends its current lease. The purchase price of the new hardware is incurred only in the scenario in

which Gaiacorps purchases new hardware. Which of the three costs are incurred depends on which option Marsha chooses,

so all three are highly relevant to her decision.

The cost of the memory card upgrade and the maintenance costs previously invested in the current hardware are sunk

costs that have already been incurred.

Marsha's salary is not a sunk cost, but is incurred in all possible scenarios. Neither of the options Marsha is considering

will affect the amount of these three costs, thus all three costs are irrelevant to the decision.

98/123

11/25/2019 Quantitative Methods Online Course

In contrast, the cost of disposing of the new hardware is relevant, since it is incurred only in the case that Marsha buys new

desktop computers.

Under pressure from his company's ad hoc advisory board to lower operating expenses, the CEO of Empire Learning, Bill

Hartborne, is considering canceling the biweekly cleaning service for the company offices. Each cleaning engagement costs

$75.

Instead of hiring a cleaning service, Bill could simply clean the office himself whenever a client visits the office. The

probability of exactly one client visiting the office in a given month is 25%, and the probability that two clients will visit is

10%. The probability that no clients will visit is 65%.

Bill draws up the structure of his decision in the tree depicted below. Given a one-month time horizon, what is your best

estimate of the EMV of the cost of not hiring a cleaning service?

Instead of cleaning the office, Bill could be creating value for his company by making sales, networking, boosting employee

morale, or simply increasing his productivity by napping on the office sofa. We need to calculate the opportunity cost of

the time Bill would spend cleaning the office, so we need to know how long he spends cleaning, and how highly his time is

valued.

Assuming that it always takes Bill 2 full hours to clean Empire Learning's offices, and that Bill's time is valued at

$200/hour, what is the EMV of Bill cleaning the office?

Enter the EMV in dollars as an integer (e.g., enter ''$5.00'' as ''5''). Round if necessary.

The EMV of Bill cleaning the office is $180. Based on this analysis, Bill should continue employing the cleaning service.

Val Purcell, CEO of a supply-chain management consulting firm Purcell & Co., must decide whether or not to put in a bid

for a contract to re-engineer the supply chain of a potential new client: the Eris Shoe Company.

Creating the bid will cost $16,000 in Val's time and legal fees. If his bid beats out the competition, Val expects the contract

to return profits of $100,000 — from which the cost of preparing the bid has not yet been subtracted. Val believes he has a

20% chance of winning the bid.

Eris will pay the consulting fee and Val will accrue his estimated $100,000 profits upon completion of the project, which is

scheduled for one year from now.

Under these terms, what is the expected monetary value of creating and submitting the bid?

The answer cannot be determined without knowing Purcell's discount rate. If Purcell wins the bid, it won't receive its

consulting fee until completion of the project one year from now. Cash flows related to the contract should be discounted at

Purcell's discount rate. For simplicity assume that the cost/profit figures represent cash outflows and inflows, and that

Purcell's discount rate is 15%.

Enter the EMV in $thousands as a decimal number with two digits to the right of the decimal point (e.g., enter ''$5,500'' as

''5.50''). Round if necessary.

To find the EMV of creating the bid, first calculate the present value of the cash inflow from Purcell's profits on the

contract, to be received a year from now upon completion of the project.

To determine the net present value of winning the bid, we subtract the $16,000 bid preparation costs.

Then, use the probability of winning the bid to weight the EMVs of the "Win" and "Don't win" scenarios to determine the

EMV of placing the bid.

Before Val starts work on the bid, Eris decides to move the project's completion date to two years after the contract is

signed. Again assume that the cost/profit figures represent cash outflows and inflows, and that Purcell's discount rate is

15%. What is the EMV for submitting the bid now?

Enter the EMV in dollars as an integer (e.g., enter ''$5.00'' as ''5''). Round if necessary.

To find the new EMV of putting in the bid, first calculate the present value of the cash inflow from Purcell's profits on the

contract, to be received two years from now upon completion of the project.

To determine the net present value of winning the bid, we subtract the $16,000 bid preparation costs.

99/123

11/25/2019 Quantitative Methods Online Course

Then, use the probability of winning the bid to weight the EMVs of the "Win" and "Don't win" scenarios and determine the

EMV of placing the bid. The EMV of putting in the bid is -$877. If Val's payments are delayed by another year, he cannot

expect the project to be profitable, so he should not submit a bid. Delaying the profits changes Val's optimal decision!

Cap Winestone of Universal Learning is preparing a bid for a new building to accommodate its expanding operations. As

luck has it, the headquarters of former competitor Empire Learning is for sale through a sealed bid auction. The building is

well suited to Universal's needs, containing computing equipment, network infrastructure and other important e-learning

accessories.

Cap estimates the building is worth about $900,000 to him and is trying to decide what bid to place. To simplify his

decision, he narrows down his bid choices to four possible bids. He muses: "With lower bids I gain more value. If I bid

$600,000, I'll get a building worth $900,000 to me, so I gain $300,000 in value. In terms of the value I gain, the lower the

bid, the better."

"On the other hand," Cap continues, "with a low bid I'm not likely to win. From the point of view of winning the bid, the

higher the bid the better. How do I balance these two opposing factors?"

Cap lays out a decision tree with the possible bid amounts, the likelihood of winning for each bid, and the outcome values.

What amount should Cap bid?

Folding back the tree, we find the $800,000 bid gives the highest expected monetary value. The $800,000 bid best

balances the value gained against the probability of winning.

Sensitivity Analysis

Still in Leo's office after you initially calculated the expected monetary value of launching the Chez Tethys, you listen to Leo's

growing concerns about the estimates that inform his decision.

Now that you've mapped out the possible downside and its probability for me, I'm a little discouraged. Sure, the analysis

indicates that ventures like the Tethys will be profitable on average, but the fact that I have almost a two-thirds chance of an

$800,000 loss scares me.

What if the potential loss is even greater? Or, what if it's even more likely that the Chez Tethys will just be a passing "Fad"?

Both points are well taken, Leo. I'll tell you what: let's break for lunch, and then delve a little deeper into our analysis.

Over lunch, Alice comments on Leo's reaction to your analysis. "Leo's questions bring us to a crucial component of decision

analysis: sensitivity analysis."

Seth Chaplin has all but decided how to produce the film Cloven. Based on his initial analysis, he is inclined to produce

Cloven in partnership with K2 Classics, thereby retaining part ownership of the film. The EMV of the K2 option is $1.4

million. The EMV of the alternative option — in which Seth's company produces Cloven for Pony Pictures and relinquishes

ownership in the film — is $1.0 million.

But Seth's calculations were based on estimates. The probabilities and the outcome values he used in his analysis were

educated guesses. No matter how detailed and rigorous the methodology, a decision analysis is only as good as the data on

which it is based. What if Seth isn't completely certain about these data?

Seth is particularly unsure about the $6 million value he used to represent the profits associated with a "Blockbuster" film.

What would the EMV of the K2 option be if $4 million were a more representative value for "Blockbuster" profits?

Enter the EMV in $millions as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500,000'' as

''5.5''). Round if necessary.

If $4 million is the expected value of S&C's profits for the "Blockbuster" scenario, then the EMV of a "Blockbuster" drops to

$800,000, less than the $1 million EMV of the Pony option. Note that if the "Blockbuster" profit figure drops to $4 million,

Seth's optimal decision changes: he should now choose the Pony option. Seth has learned that his optimal decision is

sensitive to the figure he uses to represent S&C's profits if Cloven attains "Blockbuster" status.

If Seth's optimal decision switches when the "Blockbuster" profit figure drops to $4 million, he might reasonably wonder

what his decision would be for other values. What about $5 million? $4.5 million? How low would the "Blockbuster" profit

figure have to be to make the Pony option preferable to the K2 option in terms of the EMV?

100/123

11/25/2019 Quantitative Methods Online Course

At what "Blockbuster" profit figure does Seth's optimal decision switch?

Enter the EMV in $millions as a decimal number with one digit to the right of the decimal point (e.g., enter ''$5,500,000'' as

''5.5''). Round if necessary.

To answer that question, we first write the EMV of the K2 option with a variable, B, to represent "Blockbuster" profits in

millions of dollars.

Now, we compare the EMV of the K2 option to the EMV of the Pony option. The K2 EMV is greater than the Pony EMV when

the "Blockbuster" profit figure is greater than $4.67 million.

We call $4.67 million the breakeven value for "Blockbuster" profits: above it, the EMV criterion recommends the K2

option; below it, the EMV criterion recommends the Pony option.

Seth may not be sure if the "Blockbuster" profits are best represented by $6 million or $5 million or $4.8 million, but as long

as he is confident that the figure is greater than $4.67 million, he need not lose any sleep over finding a more accurate value.

Knowing that his estimates for the "Blockbuster" profit figure are firmly on one side of the breakeven value allows him to stop

worrying about the precise value of that number in the decision analysis, since knowing that number with greater precision

will not change his decision.

On the other hand, if Seth thinks the "Blockbuster" profit figure could be below $4.67 million, he may wish to invest

additional resources to find a more accurate figure to represent "Blockbuster" profits.

If Seth thinks the true number is around the breakeven value of $4.67 million, but isn't certain if it's slightly above or slightly

below that figure, he can also stop worrying about the precise value of the number. As long as the figure is around $4.67

million, Seth should be indifferent between the K2 and Pony deals, since the EMVs of the two options are about the same.

In fact, a good way to check that we have calculated a breakeven value correctly is to substitute the value into the EMV

calculation: if the options have the same EMV, the breakeven value is correct.

Once we calculate a breakeven value, we know whether or not expending additional time and other resources to find a more

accurate estimate is worthwhile. The breakeven value establishes a comfort zone: as long as we are confident that the value

we are estimating is within the zone, we can feel comfortable choosing the option recommended by our initial analysis, based

on our original estimate.

If we think the value we are estimating could be close to the breakeven value, we need to be more cautious. If the true value

we are trying to estimate could lie outside of the comfort zone, we might want to try to make our estimate of that value more

accurate before we reach a final decision.

How confident we are that the true value we are estimating lies inside the comfort zone given by the breakeven analysis is a

matter of judgment and experience. Sometimes, we might collect sample data to estimate an outcome value. In this case, we

should look closely at the variation in the data to see how widely and in what way the data can vary.

Calculating a breakeven value for data used in a decision analysis is called sensitivity analysis: for each estimated value in

the analysis, we check to see by how much it would have to change to affect our decision, assuming our estimates for all the

other data are correct.

Sensitivity analysis is an important and powerful tool for management decision-making. Managers who base decisions on an

initial analysis without performing sensitivity analysis on critical data risk lulling themselves into a false sense of security in

their decisions.

Summary

After completing an initial decision analysis, always conduct a sensitivity analysis for each outcome value estimate you are

uncomfortable with. First, calculate the outcome value's breakeven value: the value for which the EMV of the option

initially recommended by the decision analysis ceases to be the best EMV. The breakeven value defines a comfort zone: If

we believe that the actual outcome value might be outside that zone — thereby changing the optimal decision — we should

reconsider our analysis and refine our estimate of the outcome value in question.

During his negotiations with K2 and Pony, Seth realized that he did not particularly look forward to working with the K2

team. Based on past experience, he knows that interpersonal frictions can be highly frustrating and can make a

collaboration unpleasant. This frustration is clearly a cost — albeit a non-monetary cost — associated with the K2 option.

Sensitivity analysis can give us a "reality check" on how highly we value non-monetary consequences such as frustration,

reputation costs and benefits, and sentimental values. Although it may be difficult to assign a value to such consequences,

we can often answer questions about the most we'd be willing to pay to avoid (or to obtain) them by calculating a threshold

value.

101/123

11/25/2019 Quantitative Methods Online Course

Clearly, Seth wouldn't want to spoil any magical Hollywood days just to make an additional $5 in profits. But the more the

K2 option pays relative to the alternatives, the more willing Seth might be to suffer working with the K2 team. How can

Seth determine whether he should accept the K2 deal and bear the resulting interpersonal trials and tribulations?

Sensitivity analysis can help us analyze non-monetary consequences such as frustration. Let's use "F" to represent the

frustration cost (in $millions) Seth will incur if he has to work with the K2 team. Since frustration will occur in any

scenario involving the K2 option, $F million must be subtracted from all outcomes associated with the K2 option.

How large must the cost of frustration be for Seth's optimal decision to switch to the Pony option?

Enter F, the cost of frustration, in $millions as decimal number with two digits to the right of the decimal point (e.g., enter

''$5,500,000'' as ''5.50''). Round if necessary.

Adding $F million to the outcomes changes the EMV of the K2 option to $1.4 million - $F million. For Seth's decision to

switch from the K2 deal to the Pony deal, the EMV of the K2 option, $1.4 million - $F million, must drop below $1.0

million. In other words, F must satisfy the inequality below.

In order to lower the EMV of the K2 option below the EMV of the Pony option, the cost of frustration would have to be

valued at least at $400,000. Seth needs to ask himself if he would pay $400,000 to avoid the frustration of working with

K2. Sensitivity analysis provides a clear, monetary upper bound against which he can measure the strength of his feelings.

In the end, Seth decides that he can learn to love the K2 team for $400,000.

Summary

To incorporate non-monetary consequences into a decision, first find the option with the best EMV. Then, add the non-

monetary consequence to the outcome values of all scenarios affected by that non-monetary consequence, and calculate

the breakeven value for which the option recommended by the initial decision analysis ceases to have the best EMV. The

breakeven value defines a comfort zone: If we believe that the actual value of the non-monetary consequences might be

outside that zone — thereby changing the optimal decision — we should try to gain a firm estimate of the non-monetary

consequence's value.

Seth is somewhat unsure about his estimates for the probabilities of how successful Cloven will be. He is quite sure that the

probability of the film flopping is around 20%, but he's less sure about the probabilities of "Lackluster" and "Blockbuster"

performance levels. He wants to know how sensitive his decision to choose the K2 deal is to the values of these

probabilities.

Let's call the probability of a "Blockbuster" "p." Seth is confident that the probability of a "Flop" is 20%. What is the

probability of a "Lackluster" outcome?

Since the three outcomes are mutually exclusive and collectively exhaustive, their probabilities must add to 100%. Seth is

confident that the probability of a "Flop" is 20%, so he knows that the probabilities of the remaining two outcomes

("Blockbuster" and "Lackluster") must add to 80%. Thus, the probability of a "Lackluster" performance is 0.8 - p.

Using p to denote the probability of a "Blockbuster" outcome and 0.8 - p to denote the probability of a "Lackluster"

outcome, what is the EMV of the K2 option?

The EMV of the K2 option is p * $6 million - $0.4 million. For what values of p, the probability of Cloven becoming a

"Blockbuster" film, is the Pony option preferred to the K2 option on the basis of EMV?

The breakeven value for the probability of a "Blockbuster" hit is 23.3%: when the probability is above 23.3%, the EMV of

the K2 option is higher; when the probability is below 23.3%, the EMV of the Pony option is higher. If Seth is confident that

the probability of a "Blockbuster" is at least 23.3%, he can feel comfortable choosing the K2 option. He need not expend

additional effort trying to further refine his estimates for the probabilities of different levels of success.

Decision-making is an iterative, multi-step process. When analyzing a decision, we should first construct and analyze a

decision tree based on our best estimates of the outcomes and probabilities involved. After reaching a tentative decision, it

is critical to scrutinize the data used in a decision analysis and conduct sensitivity analyses for each estimate that we feel

unsure about.

As long as we are comfortable that the true value we are estimating is within the range specified by the breakeven

calculation, we can confidently proceed with our decision. If not, we should focus our efforts on refining our estimates for

those values to which our decisions are most sensitive.

Finally, we should note that managers sometimes need to test the sensitivity of their decisions to two or more estimates

simultaneously. Sensitivity analysis techniques can be extended to address these situations; these techniques are beyond

the scope of this course.

102/123

11/25/2019 Quantitative Methods Online Course

Summary

After completing an initial decision analysis, always conduct a sensitivity analysis for each probability value you are

uncomfortable with. First calculate the probability's breakeven value: the probability for which the EMV of the option

initially recommended by the decision analysis ceases to be the best EMV. The breakeven value defines a comfort zone: If

we believe that the actual probability might be outside that zone — thereby changing the optimal decision — we should

reconsider our analysis and refine our estimate of the probability in question.

Sensitivity analysis helps managers cope with the uncertainty that surrounds the estimates they base decisions on. With your

new knowledge, you're ready to turn to Leo's problem.

Your initial analysis recommends that Leo launch his floating restaurant, but Leo has expressed some doubt about the

estimate of the losses he'd incur if the Tethys turns out to be a passing "Fad."

How high do the losses have to be in the "Fad" scenario to make launching the Tethys ill-advised in terms of EMV?

Launching the Tethys would be less attractive than not launching it if the EMV of launching is less than $0, the EMV of not

launching. Solving the inequality below, we find that the losses in the event that the Tethys is just a "Fad" would have to

exceed $1.08 million for Leo to abandon the project based on the EMV criterion.

Assume that, in fact, Leo's estimate — that he will incur an $800,000 loss if Chez Tethys turns out to be a "Fad" — is correct.

For what probability of a "Fad" does launching the Tethys cease to be a worthwhile venture, in terms of the EMV?

Launching the Tethys is less attractive than not launching it if the EMV of launching is less than $0, the EMV of not

launching. Solving the inequality below, we find that the probability of a "Fad" would have to be higher than 71% for Leo to

prefer to abandon the project based on the EMV criterion.

Hmm, I'm fairly confident that our estimate of the possible loss is on target — certainly it won't be higher than a million

dollars! But the more I think about it, the more I wish we had a better estimate for the probability that the Tethys will be the

success I've always dreamed of.

I have some ideas for ways to get a better handle on the likelihood of the Tethys' success. But I need to make some phone calls

to see if I'm on the right track. Let's break for today and meet tomorrow morning.

Robin Bea is the CEO of a small shipping company. She needs to decide whether or not to lease another truck to add to her

current fleet.

Based on her assessment of future trends in the transportation sector, Robin identified three scenarios that might occur:

"Boom," "Moderate Growth," and "Slowdown," corresponding to the performance of the economy and its impact on the

transportation sector. She associated estimates of total firm profits with each scenario, and calculated the EMVs of

"leasing" and "not leasing" a new truck: $31,000 and $18,200, respectively.

Robin is afraid that she may have been a little overoptimistic in her estimate of $70,000 of total profits in the scenario in

which she "leases a new truck" and the transportation sector "booms." She wants to know how sensitive her decision is to

the outcome value of this scenario.

What must the value of total profits of the scenario in which Robin "leases the truck" and the transportation sector "booms"

be in order for the "don't lease truck" option to be the more attractive one?

For sufficiently low profits generated by the new truck in the "Boom" scenario, the EMV of leasing the truck is lower than

the EMV of not leasing the truck. To find the breakeven value X, we solve the inequality between the two EMVs shown

below. This inequality is solved when X, the value of total profits in the lease truck and "Boom" scenario, is $27,333.

Val Purcell, CEO of the supply-chain management consulting firm Purcell & Co., must decide whether or not to put in a bid

for a contract to re-engineer the procurement process for a potential new client: the Eris Shoe Company.

Creating the bid will cost $16,000 in Val's time and legal fees. If he wins the bid, he will receive payment for the project

upon completion, two years after signing the contract. He has estimated the present value of the profits if he wins the bid at

$75,614. After factoring in the $16,000 cost of submitting the bid, the net present value of the profits if he wins the bid is

$59,614.

Val believes he has a 20% chance of winning the bid, so the EMV for submitting the bid is -$877: under these

circumstances, Val can't expect submitting a bid to be profitable. But Val hasn't been able to figure out how stiff the

103/123

11/25/2019 Quantitative Methods Online Course

competition for the contract is, and he's fairly uncertain about the probability of winning.

How high would the probability of winning the bid have to be to make Val change his mind and choose to put in a bid?

To change Val's decision, the EMV of bidding will have to be higher than the EMV of not bidding: $0. To find out how high

p, the probability of winning the bid would have to be, we solve the inequality below. Note that the probability of losing the

bid (1 - p) must change as the probability of winning changes. The inequality is solved when the probability of winning the

bid is greater than 21.2%.

Jari Lipponen of the Silverhaven Home for Abandoned Miniature Horses (SHAMH) needs funds to maintain operations.

He can either apply for a government grant or run a local fundraiser, but the demands on his time are too high for him to

be able to do both.

Jari believes he has a 90% chance that he will win a grant of $25,000 if he submits the grant application. He estimates that

as a local fundraiser, he has a 30% chance of raising $30,000 and 70% chance of raising $20,000. The EMV of the

fundraiser option is $23,000, higher than $22,500, the EMV of applying for the grant. Based on the initial analysis, Jari

should organize a fundraiser.

Jari has expressed uncertainty about his estimate for the probabilities of the two levels of fundraising success. How low

would the probability of raising $30,000 have to be to change his decision? (Enter your answer as a decimal number with

two digits to the right of the decimal point)

For Jari to change his initial decision, the EMV of the fundraising option would have to be less than $22,500. That would

be the case if the probability of the high fundraising success — valued at $30,000 — were less than 25%.

Jari needs $20,000 to operate the SHAMH through the next year. Raising $25,000 or more would permit Jari to operate

the SHAMH and expand the capacity of the operation, thereby allowing even more abandoned miniature horses to be

saved. This year will be Jari's last running the SHAMH, and he really wants to expand its capacity before he leaves.

If he's unable to raise at least $25,000 to cover the SHAMH expansion, he'll be very disappointed. How highly would Jari

have to value his disappointment in order for him to prefer the government grant option that is more likely to secure funds

for expansion? Assume that his original probability assessments for the success of the fundraiser are correct.

If Jari is only moderately successful in his fundraising activities, he won't be able to expand the SHAMH on his watch. The

disappointment cost D should be subtracted from the outcome value for the scenario in which he takes in only $20,000 in

contributions. After incorporating the disappointment cost, the EMV of the fundraiser option becomes $23,000 - 0.7D.

However, if he applies for the grant and doesn't win it, he won't be able to expand, either. The disappointment cost D must

also be subtracted from the outcome value in the scenario in which he writes the grant but doesn't win it. After

incorporating the disappointment cost, the EMV of the grant-writing option becomes $22,500 - 0.1D.

The breakeven value is the value of disappointment for which the EMVs of the two options are equal. Thus, to find the

breakeven value of D, we must set up an inequality between the two EMVs and solve for D. In this case, the disappointment

cost would have to be greater than $833.33 in order for Jari to prefer the grant-writing option.

Decision Analysis II

Conditional Probabilities

After a Kahana breakfast of crab benedict, you meet once more with Leo.

So, I've been thinking: my decision is heavily dependent on the likelihood that the Chez Tethys will be a success. Couldn't we

do some market research to find out how interested our target market would be? Then I could make a better decision — one

that would be less risky!

Sure, Leo. But market research costs money. How much would you be willing to pay for information that would help you

better predict the success of the Chez Tethys?

Hmmm. Good question...frankly, I don't have a clue how to even start thinking about that. Can you two help me?

"Whatever market research Leo has in mind," remarks Alice, "it won't reveal with certainty how consumers will take to the

Tethys. In other words, we're uncertain about the Tethys' chance of success, and we'll be uncertain that the market research

we collect is accurate."

104/123

11/25/2019 Quantitative Methods Online Course

Pondering two layers of uncertainty you begin to feel vertigo...

When analyzing a decision, we typically use estimates for the probabilities and financial implications of various outcomes

that could occur. Often, we can imagine additional information — scientific tests, market research data, or professional

expertise — that would help make our estimates more accurate. How much should we be willing to pay for this type of

information? And how do we incorporate new information into our decision analysis?

Before we answer these questions, we'll need to expand our understanding of probability and introduce the important

concepts of conditional probability and statistical independence. Let's look at an example first.

British automaker Chariot's most sought-after model is the Ben Hur. Consumers love the Bennie, as it's affectionately called.

It was offered as a limited edition: to date, only 1,000 Bennies are on the road in Britain. We'll take a closer look at two

properties of the Bennie: its "Color" and its "Stereo."

The Bennie comes in two color options: Burgundy and Champagne. Also, the Bennie is offered with a high-end car stereo

system by Sweetone. Let's look at a table of the 1,000 Bennies currently on the road and see how the two properties — "Color"

and "Stereo" — are distributed.

Of the 1,000 Bennies, 150 are Burgundy and have a Sweetone stereo. 600 are Burgundy and do not have the Sweetone stereo.

Furthermore, there are 50 Champagne Bennies with the stereo and 200 without.

Let's add another column to our table and fill in the total numbers of Burgundy and Champagne Bennies. To calculate these

numbers, we simply take the sums of the rows. For example, the number of Burgundy Bennies is simply the number of

Burgundy Bennies with the Sweetone stereo added to the number of Burgundy Bennies without.

Next, in the final row, we'll fill in the total numbers of Bennies with and without the Sweetone stereo. Here, we simply take

the sums of the two columns. In the bottom right cell, we enter the total number of cars: 1,000.

This number — 1,000 — should be equal to the sum of the numbers in the final row: the total number of Bennies with the

Sweetone stereo added to the total number without one. Also, it should be equal to the sum of the numbers in the final

column: the number of Burgundy Bennies plus the number of Champagne Bennies. Finally, since 1,000 is the total number of

all Bennies, it should be the sum of all the original numbers we entered into the table.

A Venn diagram is a useful graphical way to represent the contents of the table. The Burgundy rectangle on top represents the

set of all Burgundy Bennies. The Champagne rectangle below represents the set of all Champagne Bennies. The rectangles do

not intersect because a Bennie cannot be both Champagne and Burgundy.

We now add a patterned rectangle on the left to represent the set of Bennies with the Sweetone stereo. The area without the

pattern represents the set of Bennies that are not equipped with the Sweetone. These areas intersect the rectangles that

represent the distribution of "color": some Burgundy Bennies are Sweetone-equipped; some are not. Some Champagne

Bennies are Sweetone-equipped; some are not.

The sizes of the areas in the Venn diagram are directly proportional to the incidence of the different Bennie properties:

Burgundy/Champagne and Sweetone/no Sweetone. We use Venn diagrams to effectively communicate information about

sets of things — for example, Burgundy cars and Sweetone-equipped cars — and their interactions.

The table is a useful tool for calculating proportions of potential interest to managers. For instance, we can find the

proportion of Burgundy Bennies with the Sweetone stereo simply by locating the cell that contains their number and dividing

it by the total number of cars: 15%.

Or, to find the proportion of Burgundy Bennies in general, we find the cell that contains the total number of Burgundy

Bennies and divide it by the total number of cars: 75%.

In this way, we can create an entire table of proportions. These proportions can be interpreted as probabilities. For

example, since 15% of the Bennies on the road are Burgundy-colored and Sweetone-equipped, the probability that a

randomly selected Bennie will be Burgundy-colored and have the Sweetone is 15%.

The probability that a randomly selected Bennie will be Burgundy is 75%. Going forward, when talking about the table, we'll

use the words "proportion" and "probability" interchangeably.

The probabilities on the inside of the table are called joint probabilities: the probabilities of a single car having two

particular Bennie features, for example, Burgundy-colored and Sweetone-equipped. We'll denote joint probabilities in the

following way: P(Burgundy & Sweetone) is the probability that a Bennie is Burgundy-colored and Sweetone-equipped.

The probabilities of each "Color" or "Stereo" option occurring in a given Bennie — Burgundy or Champagne, Sweetone-

equipped or not Sweetone-equipped — are often called marginal probabilities. They are denoted simply as P(Burgundy),

P(Champagne), P(Sweetone), and P(no Sweetone).

Information about the distribution of properties in populations is often available in terms of probabilities, so the table of

probabilities is a very natural way to represent the Bennie data.

105/123

11/25/2019 Quantitative Methods Online Course

Summary

For two events A and B with outcomes A1, A2, etc. and B1, B2, etc., respectively, the joint probability P(A1 & B1) is the

probability that the uncertain event A has outcome A1 and the uncertain event B has outcome B1. The joint probabilities of

all possible outcomes of two uncertain events can be summarized in a probability table. The marginal probability of

the outcome A1 of the first uncertain event is the sum of the joint probabilities of outcomes A1 and all possible outcomes

B1, B2, etc. of the second uncertain event.

Conditional Probabilities

Automaker Chariot's limited edition Ben Hur model comes in two possible colors — Champagne or Burgundy — and with

or without a high-end Sweetone stereo. The table below shows the distribution of these properties — "Color" and "Stereo"

— in the population of 1,000 Bennies that Chariot has sold to date.

Restricting our focus to the Burgundy Bennies, we ask the following question: among the set of Burgundy Bennies, what is

the proportion of Burgundy Bennies with a Sweetone stereo? Stated differently: what is the probability that a randomly

selected Bennie has a Sweetone given that we know it is Burgundy?

To answer the question, we find the ratio of Sweetone-equipped Burgundy Bennies among the set of all Burgundy

Bennies. That is, we divide the number of Bennies that are both Burgundy and have a Sweetone — 150 — by the total

number of Bennies that are Burgundy — 750. The probability is 20%.

This probability is called a conditional probability: the probability that a Bennie is Sweetone-equipped given that it is

Burgundy. We'll denote this probability P(Sweetone | Burgundy), and read the vertical line as "given."

P(Sweetone | Champagne) is the number of Bennies that are Champagne and Sweetone-equipped divided by the total

number of Champagne Bennies.

Enter the percentage as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50").

Round if necessary.

Using the table of actual numbers of cars we can calculate P(No Sweetone | Burgundy) and a fourth conditional probability,

P(No Sweetone | Champagne).

Earlier, we used the table of actual numbers of cars to calculate a table of probabilities. When given information as a table

of probabilities, we can use the probabilities to calculate conditional probabilities as well: we simply form the ratios of the

appropriate probabilities.

For example, the probability that a Bennie is Sweetone-equipped given that it is Burgundy is the probability of a Sweetone-

equipped, Burgundy Bennie divided by the probability of a Burgundy Bennie.

In fact, a conditional probability is formally defined in terms of the ratio of a joint probability to a marginal probability.

If P(Sweetone | Burgundy) represents the probability that a Bennie is Sweetone-equipped given that it is Burgundy, what

does P(Burgundy | Sweetone) represent?

P(Burgundy | Sweetone) represents the probability that a Bennie is Burgundy given that it is Sweetone-equipped. Note that

P(Sweetone | Burgundy) and P(Burgundy | Sweetone) are not the same.

We can calculate the other conditional probabilities, this time conditioning the property "Color" on the property "Stereo."

It is useful to write the conditional probailities next to the table of probabilities to facilitate calculation: since P(Burgundy |

Sweetone) and P(Champagne | Sweetone) require only the probabilities in the Sweetone column, we write them directly

below the Sweetone column.

Similarly, since P(Burgundy | No Sweetone) and P(Champagne | No Sweetone) require only the probabilities in the "No

Sweetone" column, we write them directly below the "No Sweetone" column. Note that the new rows mimic the rows in the

original table: Burgundy on top and Champagne below it.

Similarly, we write the conditional probabilities for the property "Stereo" conditioned on the property "Color" to the right

of the table. Again, we mimic the columns in the original table: a column for "Sweetone" and one for "No Sweetone." We

now have a full table of joint, marginal, and conditional probabilities.

It is important to double check which event we are conditioning on, and to make sure our calculations are realistic. For

example, the probability that a randomly chosen Indian citizen is the prime minister is nearly zero, but the probability that

the prime minister of India is an Indian citizen is 100%!

106/123

11/25/2019 Quantitative Methods Online Course

The table of joint probabilities informs our understanding of the likelihood of different properties or events in a crucial

way: as we've seen in the Bennie example, once we have the joint probabilities, we can compute any conditional probability

and any marginal probability.

Thus, when presented with a decision problem in which the outcomes are influenced by multiple uncertain events,

constructing the joint probabilities of these events is almost always a wise first step.

Summary

The conditional probability P(A | B) is the probability of the outcome A of one uncertain event, given that the outcome B

of a second uncertain event has already occurred. The table of joint probabilities provides all the information needed to

compute all conditional probabilities. First calculate the marginal probabilities for each event, then compute the

conditional probabilities as shown below:

Statistical Independence

Let's return once more to our Chariot Ben Hur example. The Bennie comes in two possible colors — Champagne or

Burgundy — and with or without a Sweetone stereo. The probability table below shows the distribution of these properties -

"Color" and "Stereo" - in the population of Bennies that Chariot has sold to date.

Note that the proportion of Sweetone-equipped Bennies in the Burgundy population is the same as the proportion of

Sweetone-equipped Bennies in the overall population: 20%. In the language of conditional probabilities, this is the same as

saying that P(Sweetone | Burgundy) = P(Sweetone).

In other words, if we randomly select a Bennie, then discovering that it is Burgundy gives us no additional information

about whether or not it is equipped with a Sweetone stereo, beyond what we had before we knew its color: we still think

there is a 20% chance it has a Sweetone.

Similarly, the proportion of Burgundy Bennies in the Sweetone-equipped population is the same as the proportion of

Burgundy Bennies in the overall population: P(Burgundy | Sweetone) = P(Burgundy) = 75%.

A Bennie's color tells us nothing about its stereo system. Its stereo system tells us nothing about its color. When this is true,

we say that that the Bennie properties "Color" and "Stereo" are statistically independent, or, more simply, independent.

In general, we can interpret the fact that two uncertain events are independent in the following way: knowing that one

event has occurred gives us no additional information about whether or not the other event has. For example, the results of

two spins of a wheel of fortune are independent. The first result does not reveal anything about the second.

We can confirm the independence of the Bennie's stereo and color by looking at our Venn diagram and noting that

Sweetone-equipped Bennies occupy the same percentage in the population of burgundy Bennies — 20% — as they do in the

entire Bennie population.

Thus, to find the joint probability that a Bennie has a Sweetone stereo and is burgundy, we take 20% of the 75% of Bennies

that are burgundy, giving us a joint probability of 15%. This property is true for any two statistically independent

properties: the joint probabilities are simply the products of the marginal probabilities.

Although it may seem plausible to assume that certain properties are independent, managers who take statistical

independence for granted do so at their peril. We need to verify the assumption that the properties are independent by

looking at and evaluating data or by proving independence on the theoretical level.

Statistical Dependence

When are two outcomes statistically dependent? Let's look at another optional feature of the Bennie: a unique factory-

installed theft discouragement system (TDS).

Using Chariot's data, we can create the following table of the distribution of the properties "Stereo" and "Protection."

Before going forward, practice your skills by calculating the joint, marginal, and conditional probabilities for these

properties. Place them in the usual format in the complete probability table.

From the table, we can see that P(TDS | Sweetone) is not equal to P(TDS). We can infer that the properties "Stereo" and

"Protection" are not statistically independent — once you know that a randomly selected Bennie has a Sweetone, you

know the chances of it having a TDS system are greater than they would be if you didn't have that information.

Once again our Venn diagram provides visual confirmation, in this case that the properties are not independent. In the

overall population, the proportion of Bennies that are TDS-protected is only 35%. However, in the population of

Sweetone-equipped Bennies, the proportion is significantly higher: 75%. Why might that be?

107/123

11/25/2019 Quantitative Methods Online Course

Bennie buyers who opt for the Sweetone stereo feel that their cars will be especially targeted for theft or vandalism, and

are more likely to choose the TDS option than are buyers who choose the low-end car stereo.

A savvy car thief knows that the next Bennie he passes has a 35% probability of being protected by a theft

discouragement system.

Once he identifies that a particular Bennie has a Sweetone stereo, he gains more information: then he knows that the car

has a 75% chance of being TDS-protected. This may affect his decision about whether or not to break into that car.

Summary

Two uncertain events A and B are said to be statistically independent if knowing that A has occurred does not tell us

anything about the probability of B occurring, and vice versa. Statistical independence of two events can be

demonstrated based on data or proved from theory; it should never be assumed. Events that are not statistically

independent are said to be statistically dependent.

Probability theory is fascinating, sure, but what do conditional probabilities have to do with decision analysis?

To see how we might apply conditional probabilities in a decision analysis, let's revisit the Cloven film production example.

Seth Chaplin of S&C Films is ecstatic. In a sushi lunch meeting with the agent of superstar actor Shawn Connelly, the agent

agreed to try to convince Connelly to be the voice of the main character in Seth's new film, Cloven.

Under Seth's charming yet irresistible pressure, the agent told Seth that — just between the two of them — he thought he had

a 40% chance of persuading Connelly to take the role. Connelly's fame is such that by lending the prestige of his name to

Cloven, the likelihood of Cloven's success increases dramatically.

Seth has booked the services of a crack animation team, so he'll have to start production soon — probably before Connelly

makes a decision. But that shouldn't hinder production because Connelly can add his voiceovers well after the animation has

been created.

Seth has not yet signed either of the two deals he negotiated, and feels he should reconsider his options in light of Connelly's

possible participation. With a decent shot at star power behind Cloven, Seth feels he can close a better deal with Pony.

Connelly's voice would also substantially increase the likelihood of blockbuster success, making the deal with K2 potentially

more lucrative.

Seth returns to Pony and hammers out a second agreement, contigent on Connelly's participation. Pony will pay an

additional $5 million — $15 million in total — if Seth can retain Connelly's voice acting services. Even after accounting for

Connelly's salary, Seth expects to make a total of $2.2 million in profits from the Pony deal if Connelly agrees to take the role.

How should all of this new information affect Seth's decision? Let's take a look at the new decision tree.

If Seth chooses the Pony deal, there are now two possible scenarios: in the first, Connelly agrees to lend his voice to Cloven, in

the second, Connelly declines. These two scenarios are associated with two different profit outcomes: $2.2 million and $1

million, respectively.

Based on Connelly's agent's assessment, there is a 40% chance that Connelly will take the role, and a 60% chance that he

won't.

What about the K2 option? Which of the following decision trees correctly reflects the newly introduced circumstances?

The nodes of a decision tree are arranged from left to right in the order in which the decision maker will eventually know

their results. In this case, Seth will first discover whether or not Connelly takes the role; only then will he see how Cloven

performs at the box office.

What are the probabilities of the different branches? As in the Pony option, the first branching in the K2 option is a split

between the scenarios in which Connelly signs onto the Cloven project and those in which he doesn't. These branches are

associated with probabilities of 40% and 60%, respectively.

What are the probabilities of the six final branches? Let's look closely at the top three branches. Once we know that Connelly

has signed on, we need to know what the respective probabilities are of a "Blockbuster," a "Lackluster" performance, and a

"Flop." In other words, we need the conditional probabilities of the three outcomes, given that Connelly takes part in Cloven.

In any decision tree, to "be at a node" is to assume that all the events on the path leading to that node from the left have

already taken place. Thus, the data we associate with any decision or chance node depend directly on the sequence of events

on the unique path leading up to that node from the left.

Specifically, the probabilities after a chance node must be conditioned on all events on the preceding path, and the outcome

values must incorporate the effects of all of the preceding events.

108/123

11/25/2019 Quantitative Methods Online Course

Returning to Seth's decision: if Connelly takes part in Cloven, Seth estimates at 50% each the conditional probability that

Cloven will be a "Blockbuster" and the conditional probability that it will have a "Lackluster" theater run. At this stage in

Connelly's career, a "Flop" is virtually impossible. If Connelly doesn't take part, the conditional probabilities are simply the

original probabilities of 30%, 50%, and 20% for the three possible outcomes.

Using these conditional probabilities, we can determine which of the two options Seth should choose by calculating the EMVs

of all the nodes and folding back the decision tree.

Let's begin with the Pony option. What is its EMV? Enter the EMV in $millions as a decimal number with two digits to right

of the decimal point (e.g., enter "$5,500,000" as "5.50"). Round if necessary.

Now we find the EMV of the K2 option, constructing it step by step. First, we must find the EMV for the event that Connelly

decides to voice the main character in Cloven. That EMV is $3 million.

Next, we find the EMV for the event in which Connelly refuses the part. That EMV is simply the EMV of the original K2

option: $1.4 million.

The total EMV for the K2 option is $2.04 million: the EMV with Connelly's participation weighted by the probability of his

participation, plus the EMV of Cloven filmed without Connelly, weighted by the probability that he refuses the part.

The possibility that Shawn Connelly might take part greatly enhances the attractiveness of the K2 option.

Summary

The eventual outcomes of many decisions involve sequential uncertain events whose outcomes are determined over time.

To conduct a decision analysis in such cases, we need the probabilities of the first uncertain event's possible outcomes and

conditional probabilities of future uncertain events' possible outcomes, conditioned on the previous events' outcomes.

Captain Ahab Fisheries cans herring and sardines in either spicy tomato sauce or vegetable oil. The table to the right

summarizes the distribution of Ahab's product line. What is the marginal probability that a randomly selected can of fish is

canned in tomato sauce?

Enter the probability as a decimal number with three digits to the right of the decimal point (e.g., enter "50%" as "0.500").

Round if necessary.

The marginal probability of tomato sauce is 23%. Similarly, the marginal probability of vegetable oil is 77%.

What is the marginal probability that a randomly selected can of fish contains sardines?

Enter the probability as a decimal number with three digits to the right of the decimal point (e.g., enter "50%" as "0.500").

Round if necessary.

The marginal probability of sardines is 40%. Similarly, the marginal probability of herring is 60%.

What is the conditional probability that a randomly selected can of fish is canned in tomato sauce, given that it is a can of

sardines?

Enter the probability as a decimal number with three digits to the right of the decimal point (e.g., enter "50%" as "0.500").

Round if necessary.

The conditional probability of tomato sauce given that a can contains sardines is 35%.

What is the conditional probability that a randomly selected can of fish is canned in tomato sauce, given that it is a can of

herring?

Enter the probability as a decimal number with two digits to the right of the decimal point (e.g., enter "50%" as "0.50").

Round if necessary.

The conditional probability of tomato sauce given that a can contains herring is 15%.

Which is larger, the conditional probability that a randomly selected can of fish contains sardines, given that it is canned in

vegetable oil, or the conditional probability that a randomly selected can of fish is canned in vegetable oil, given that it

contains sardines?

The conditional probability of a can containing sardines given that it contains vegetable oil is 34%. The conditional

probability of the fish being canned in vegetable oil given that the can contains sardines is 65%. Note that P(Sardines | Oil)

is not the same as P(Oil | Sardines).

109/123

11/25/2019 Quantitative Methods Online Course

The table to the right shows three possible outcomes of an uncertain event. The fact that the sum of the three probabilities

is less than 100% tells us

We cannot infer anything about whether or not these events are mutually exclusive from the information provided. The

events A, B, and C on the left are mutually exclusive. The events D, E, and F on the right are not mutually exclusive.

The table to the right shows three possible outcomes of an uncertain event. The fact that the sum of the three probabilities

is less than 100% tells us

For a set of events to be collectively exhaustive, the probabilities of the events must sum to at least 100%.

The table to the right shows three possible outcomes of an uncertain event. The fact that the sum of the three probabilities

is greater than 100% tells us

The probabilities of the events in a mutually exclusive set cannot add up to more than 100%. If the probabilities of a set of

events add up to more than 100%, then at least two of them are not mutually exclusive, i.e., they can occur simultaneously.

The table to the right shows three possible outcomes of an uncertain event. The fact that the sum of the three probabilities

is greater than 100% tells us

We cannot infer anything about whether or not these events are collectively exclusive from the information provided.

Below are examples of events with these probabilities of which one set is collectively exhaustive and one isn't.

Market research can be expensive. If the cost of the research outweighs its value to Leo, then obviously he shouldn't spend

money on the research. "We'll need to assess the value of Leo's market research," Alice tells you, "to determine whether or not

paying for it is a wise choice."

The Eris Shoe Company is considering sourcing some of its production from the developing country Arboria. Arboria was a

land of civil war and strife until a controversial UN intervention two years ago reconciled warring factions and helped install a

national unity government. Today, guerrilla warfare persists in the mountains, but the major coastal cities are relatively

secure.

Eris CEO Emily Ville has identified Arboria as a candidate location due to its low labor costs and is considering opening a

small buying office in the capital city. However, she also recognizes the potential for substantial risk if civil war should break

out, including the loss of Eris' investments in the office and the disruption of its supply chain.

Emily estimates the present value of sourcing from Arboria at $1 million in net savings as long as the Arborian production

facility and infrastructure work at a high level of reliability. In her opinion, the probability of such a "Serene" scenario is 40%.

Emily believes there is a 40% chance of a more "Troubled" environment beset with minor supply chain disruptions, which

would reduce Eris' expected cost savings to $0.5 million. Finally, she estimates the likelihood of major political unrest at

20%, and the net present value of the losses associated with such a "chaotic" scenario at $1 million.

Instead of sourcing from Arboria, Eris could continue its current sourcing agreements, which would not result in any cost

reduction.

The EMV of sourcing from Arboria is $0.4 million. Based on the EMV criterion, Emily should choose to source from Arboria.

However, given the stakes involved, Emily would like to have additional information that would increase her understanding

of Arboria's political situation.

Ideally, she'd like to base her decision on perfect information, i.e., she'd like to know now exactly which one of the three

outcomes will materialize. Such perfect information would be extremely valuable: how much should Emily be willing to pay

for it?

Managers often have the opportunity to gather more data or expertise to inform their business decisions. Depending on the

business context, new information can be obtained from statistical data, expert consultants, or scientific tests and

experiments. In many cases, such information can improve the accuracy of the estimates incorporated into a decision

analysis.

However, the cost of such data can drag down the bottom line. Clearly, the cost of additional information must be weighed

against its value. How can managers assess the value of information?

Suppose — for the sake of argument — that Emily knows a fortune-teller, Frieda Featherlight, who has genuine psychic

powers and can predict the future flawlessly, giving Emily the perfect information she craves. How much should Emily pay

110/123

11/25/2019 Quantitative Methods Online Course

Frieda for her supernatural services?

To answer this question, we calculate the Expected Value of Perfect Information — the EVPI — provided by Frieda. We

can determine this value by framing a decision: should Emily purchase Frieda's information before she makes her decision to

source from Arboria?

Characterized as a decision problem, we can determine the EVPI using familiar tools: a decision tree and the expected

monetary value. Let's construct a tree for Emily's decision, beginning with a decision node that branches into two options:

"Buy the information" or "Don't buy the information."

The lower branch — "Don't buy the information" — is simply the original tree for the decision about whether or not to source

from Arboria. This tree uses only the information Emily already possesses. The EMV of this option is $0.4 million.

If Emily engages Frieda's services and "buys the information," three possibilities emerge based on which scenario Frieda

predicts — "Serene," "Troubled," or "Chaotic." If the probabilities of these scenarios actually occurring are 40%, 40%, and

20%, respectively, what is Emily's best assessment of the likelihood that Frieda will predict each scenario?

Without further information, Emily has no reason to change her initial assessments. Unless she probes Frieda for

information, Emily's best guess that Frieda will predict each outcome — "Serene," "Troubled," and "Chaotic" — is the same

as her best estimates for the probabilities that those outcomes will actually occur: 40%, 40%, and 20%, respectively.

Let's assume for the moment that Frieda won't charge Emily for her services. The beauty of perfect information is that Emily

can delay her decision until after she has heard Frieda's completely accurate prediction of what the Arborian political climate

will be. If Frieda predicts a "Serene" sourcing experience, then Emily will choose to source from Arboria, and cost savings of

about $1 million are certain.

Likewise, if Frieda predicts a "Troubled" sourcing experience, Emily will source from Arboria, and cost savings of about $0.5

million are certain. If Frieda predicts a "Chaotic" sourcing experience, then Emily will choose not to source from Arboria,

thereby avoiding a certain loss in the $1 million range. In this case, Eris will receive no cost savings.

Still working under the assumption that Frieda won't charge for her services, what is the EMV of "buying the information"?

Enter the EMV in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,500,000" as

"5.50"). Round if necessary.

At $600,000, the EMV of "buying the information" is $200,000 higher than the EMV of "not buying the information." This

difference between the EMVs of the option with perfect information and without perfect information is the expected value of

perfect information (EVPI). The EVPI of $200,000 is the maximum amount Emily should pay Frieda for a séance.

The EVPI establishes an upper bound on what we should pay for perfect information. In some business cases, perfect — or

near perfect — information may be available without supernatural means.

For instance, suppose that aerospace company Airbus wants to know if it could perfect all of the technologies necessary to

create a safe and reliable SpaceCruiser, a space shuttle for tourists that would have all the comforts of a luxury cruise ship.

Airbus could build a small but completely functional prototype and see if it works. Although Airbus might collect near perfect

information by doing so, the cost of developing the SpaceCruiser prototype would likely be higher than the expected value of

that perfect information.

Instead, Airbus might first use computer simulations and limited functionality prototypes to assess the technical viability of

the SpaceCruiser. The information thus gained wouldn't be perfect, but it would be helpful and it would cost far less than a

fully functional prototype.

The EVPI establishes an upper bound on the value of any information, perfect or flawed. Through sampling, educated

expertise, or imperfect testing, we might be able to gain better — though not perfect — estimates of the probabilities and the

outcome values of our decision's possible scenerios.

Shortly, we'll learn how to value imperfect information — information that reduces, but does not eliminate, our uncertainty

about future events. For the time being, we know that we should never spend more for imperfect information than we are

willing to pay for perfect information.

In her quest for an infallible choice in the Arborian sourcing decision, if Emily won't pay more than $200,000 for perfect

information, she certainly shouldn't pay more than $200,000 for imperfect information. If the price of imperfect information

exceeds the EVPI, we should not expend resources on it.

Summary

As managers, we would like to know exactly which outcomes will occur so we can make the best decisions. We can calculate

the expected value of such perfect information — EVPI — to find an upper limit on the amount we would be willing

111/123

11/25/2019 Quantitative Methods Online Course

to pay for any additional information. To calculate the EVPI, we first frame the decision problem "to buy or not to buy the

perfect information." We subtract the EMV of buying perfect information — assuming it's free — from the EMV of not

buying perfect information to find the EVPI.

In almost all circumstances, perfect information is either impossible to assemble or prohibitively expensive. In these cases,

we use imperfect information — often called "sample" information — to inform our decision. As with perfect information,

the cost of sample information must be weighed against its value. As managers, how do we assess the value of sample

information?

Let's return the Eris Shoe Company's decision to source production in Arboria. Emily Ville has distinguished three possible

scenarios for the Arborian political climate: "Serene," "Troubled," and "Chaotic," which she believes have probabilities of

40%, 40%, and 20%, respectively. The outcomes she associates with these scenarios — in cost savings for Eris — are $1

million, $0.5 million, and -$1 million, respectively.

Emily would like to improve her estimates of the probabilities. She contacts PoliFor, a consulting company that specializes

in business outlook intelligence. PoliFor produces an analysis of a country's or region's political climate, and then provides

a risk assessment specific to the needs of its client, distinguishing between "High" and "Low" risk situations.

In a world of uncertainty, PoliFor's assessments are not always accurate. Sometimes, regions with "Low" risk assessments

burst into revolutionary flame. Sometimes, regions with "High" risk assessments turn out to be as stable and placid as the

moon's orbit. How high a price should Emily pay for PoliFor's services given that PoliFor's assessments are not perfectly

reliable?

Based on preliminary conversations with PoliFor's lead consultant, Emily assesses the probabilities that PoliFor will report

"High" or "Low" risk for sourcing in Arboria at 30% and 70%, respectively. She then considers how each of these risk level

assessments would affect her estimates of the relative likelihood of her three representative scenarios: "Serene,"

"Troubled," and "Chaotic."

Emily believes that if PoliFor predicts "Low" risk, then the likelihood of a "Chaotic" political climate would be low: 5%, and

the probabilities of "Troubled" or "Serene" scenarios would be 40% and 55%, respectively. If we use "Low" to represent

PoliFor predicting a low-risk political environment, which of the following best expresses the beliefs Emily has summarized

above?

The information upon which Emily is basing her assessments is the PoliFor prediction of a "Low" risk political climate.

Thus, she is estimating conditional probabilities that Eris' experience sourcing from Arboria will be "Serene," "Troubled,"

or "Chaotic," given that the consultants predict a "Low" level of risk.

Similarly, Emily assesses the conditional probabilities that Eris' experience sourcing from Arboria will be "Serene,"

"Troubled," or "Chaotic" given that the consultants predict a "High" level of risk at 5%, 40%, and 55%, respectively. How do

we calculate the expected value of the imperfect information PoliFor offers?

As we did for perfect information, we frame this question as a decision problem: should Emily purchase the imperfect

information from PoliFor or not? First, we'll construct a decision tree. The tree begins with a decision node for the choice

Emily faces: "Buy the information" or "Don't buy the information".

The lower branch for the option "Don't buy the information" is the original tree that Emily constructed using her initial

estimates for the outcome values and probabilities of the three basic scenarios. What should emanate to the right from the

"Buy Information" branch?

If we are at the end of the "Buy Information" branch, we assume that Emily has purchased the information. In that case,

the first thing that she will do is open the report envelope and examine the information to learn what level risk PoliFor has

predicted. Thus, the upper branch for the option "Buy the information" splits the chance node into two branches, one for

each risk level the consulting company might predict.

What should emanate from each of the Low and High branches?

The information that Emily has purchased is valuable only if she uses it to support her decision-making. The value of the

information resides in Emily's ability to make her sourcing decision after learning the level of risk PoliFor predicts. Thus,

after each prediction, we place a decision node: should she source from Arboria or continue to source from her current

suppliers?

To complete the decision tree, we must incorporate the possible scenarios that can occur if Emily chooses to source from

Arboria: "Serene," Troubled," or "Chaotic." What probabilities should we assign to the three branches emanating from the

chance node highlighted to the right?

At the highlighted node, we assume that every event on the unique path leading from the node back to the beginning of the

tree has transpired: Emily bought the information; PoliFor reported "High" risk; and Emily chose to source from Arboria.

Thus, the probabilities assigned to the branches must be conditioned on those events. For example, we assign P(Serene |

High) to the "Serene" scenario on the "High" branch.

112/123

11/25/2019 Quantitative Methods Online Course

Now we complete the tree, substituting the values for the conditional probabilities and placing the appropriate monetary

values at the endpoints. As usual, we'll assume for the moment that the information is free, and start to fold back the tree.

What is the EMV of the "High Risk" branch's decision node?

If PoliFor predicts a "High" risk level, Emily should not choose to source from Arboria since the EMV of the "Source" from

Arboria node, -$0.3 million, is less than the EMV of the "Don't source" node, $0 million. Instead, she should continue her

current supply arrangements, in which case she will realize $0 in cost savings.

Folding back the decision tree, we find an EMV of $490,000 for the option "Buy the information."

What is the most Emily should pay for the imperfect information?

The difference between the EMVs of the "Buy" and "Don't buy the information" options is $90,000, the expected value of

imperfect (or sample) information (EVSI) provided by PoliFor. The EVSI is the maximum amount that Emily should be

willing to pay for PoliFor's report.

As we would anticipate, the expected value of this imperfect information is quite a bit less than $200,000, the expected

value of perfect information we computed earlier.

PoliFor's information, although imperfect, has value because it allows Emily to update her original probability estimates

based on the risk level PoliFor predicts. She can make better decisions based on these more accurate updated probability

assessments.

The probability estimates we started with — Emily's estimates for P(Chaotic), etc. — are called prior probabilities. The

updated conditional probabilities — P(Chaotic | High Risk), P(Troubled | High Risk), etc. — are called posterior

probabilities.

Summary

By investing in professional expertise, statistical studies or tests, managers can often improve their understanding of

uncertain events. Such imperfect (or "sample") information is itself a source of uncertainty because it isn't a perfect

predictor of the outcomes of interest. Although imperfect information can improve our predictions, such information

comes at a price. Responsible managers calculate the expected value of sample information (EVSI), and purchase

information only if its expected value exceeds its cost. To calculate the EVSI, we pose the decision problem "to buy or not

to buy the sample information," and subtract the EMV of buying the information — assuming that it's free — from the

EMV of not buying it.

Zeke "Claw" Crankshaw owns and runs Oligarchol, one of the largest independent oil producers in the Western

Hemisphere.

Recently, Zeke acquired the mineral rights to a piece of land near Chanchito Gordo Canyon. Based on his own professional

assessment, Zeke believes that he has a 50% chance of finding an oil field if he drills in Chanchito Gordo, and a 50% chance

of drilling nothing but "Dry" holes.

Like any subjective probabilities, these estimates are the product of Zeke's educated guesswork. Clearly, the oil is either

beneath the surface at Chanchito or it isn't. However, Zeke's experience tells him, for example, that in about half of all

similar situations — similar terrain, similarly exposed rock strata, etc. — oil prospectors have struck oil.

He estimates the value of striking oil — in present value of future profits — to be $8 million after netting out drilling costs.

If all of his boreholes come up "Dry," Zeke will have lost the $2 million cost of drilling. With these assessments in mind,

Zeke must decide if he should drill on the property.

If Zeke decides not to drill, his investment in the mineral rights will not pay off in the way he had hoped. However, since

the purchase price of the mineral rights has already been incurred, it is a sunk cost that shouldn't bear on the decision.

Enter the EMV in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,500,000"

as "5.50"). Round if necessary.

Without further information, Zeke should start shipping drill bits and oil rigs over the dusty back-country roads to

Chanchito. But old Zeke is not the impetuous young whippersnapper of the early days of Oligarchol, back when the

company was derisively referred to in industry circles as "Pipe Dreams, Incorporated."

Now, Zeke is happy to use the latest in seismic testing technology to gain better information and make more informed

decisions. Zeke can order a $100,000 seismic test of the Chanchito Gordo property. Although the seismic test will improve

his estimate of the likelihood of finding oil, it won't provide perfect information.

113/123

11/25/2019 Quantitative Methods Online Course

The seismic test will report one of two possible outcomes: a "Positive" or a "Negative" result, depending on whether or not

the test indicates the presence of oil. Because the seismic test's accuracy is a key determinant of its value, Zeke has gathered

some historical data on the test's performance.

Zeke believes that if there is an oil field on his property, then the probability that the test will return a "Positive" result is

90%. This is the conditional probability P(Positive | Oil). Clearly, the test isn't perfect — even if there is a reserve on his

property, there is a 1 in 10 chance the test will fail to detect it. This is the conditional probability of a "False Negative"

result: P(Negative | Oil) = 10%.

And, unfortunately for Zeke, even when there isn't a drop of oil to be found, the test will still report a positive result — a

false "Positive" — with probability 30%. The probability that the seismic test will indicate the absence of oil when there is,

in fact, no oil — P(Negative|Dry) — is 70%. How can Zeke use these data to calculate the EVSI — the expected value of the

test information?

As usual, we start with a decision node that branches into two options: "Test" or "Don't Test." The lower branch, "Don't

Test," is the same as Zeke's original tree for whether to drill or not. The EMV for the "Don't Test" branch is $3 million; it is

based only on the information Zeke previously had available to him.

If Zeke buys the test, he will learn the test's result — positive or negative — before he has to decide whether to drill or not.

What probability should Zeke assign to the "Test Positive" branch?

The probability that Zeke should assign to the "Test Positive" branch is P(Positive). However, this probability is not in the

information Zeke originally had available, which is summarized in the table to the right.

To find the P(Positive), we must first find the joint probabilities and enter them into the table. What is P(Oil | Positive)?

Round if necessary.

To find P(Oil | Positive), we use the familiar relationship for conditional probabilities. If you wish to practice, calculate the

other joint probabilities and P(Positive) and P(Negative) before advancing to the next screen.

Adding the joint probabilities, we find that P(Positive) = 60% and P(Negative) = 40%.

We are now ready to complete the tree. After Zeke learns the test result, he makes a decision about whether or not to drill.

If he drills, he will either strike oil or not. What probability should we associate with the branch highlighted to the right?

If Zeke is at the highlighted node, then all the events on the branch leading up to the node must have occurred: he decided

to test, the test was positive, and he decided to drill. Thus, Zeke must condition the probability of striking oil on past

events: specifically, on the fact that the test report was "Positive." Thus, he should associate the conditional probability

P(Oil | Positive) with the highlighted "Oil" branch.

We find the conditional probabilities in the usual way. The full table of joint, marginal, and conditional probabilites is

shown to the right.

Our tree now has all the necessary data. We fold it back and find that the EMV for conducting a seismic test — assuminig

the test is free — is $3.3 million.

Enter the EVSI in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,500,000"

as "5.50"). Round if necessary.

The EVSI is $300,000 — the difference between the EMV of conducting the test, $3.3 million, and the EMV of not

conducting the test, $3 million. This is the maximum amount Zeke should be willing to spend on the seismic test. Since the

expected value of the test exceeds its $100,000 cost, Zeke should order the test.

Summary

Often, information is not available in the form we need it in to inform our decision process and the calculation of the

EVSI. Frequently, for example, we have access to data about the reliability of a test. The reliability of a test is given as the

set of conditional probabilities P(TRi | Si), where TRi is a test result and Si is one of the scenarios the test is intended to

help predict. Collectively, these conditional probabilities reveal the test's ability to predict the outcome of the uncertain

event in question. Using this information, we update our initial estimates for the probabilities of each scenario — the

prior probabilities P(Si) — to achieve improved, conditional probabilities of each scenario, given the result of the test —

posterior probabilities P(Si | TRi). We use the posterior probabilities to calculate the EVSI and aid our decision process.

You know how to calculate the expected value of information and set an upper limit on the amount Leo should be willing to

pay for market research. But what kind of market research could Leo have in mind? And would it be worth its cost?

114/123

11/25/2019 Quantitative Methods Online Course

Leo wants to know how much money he should be willing to spend on additional information about consumers' likely

reaction to a floating restaurant. First, you calculate the expected value of perfect information to give Leo on absolute upper

bound on the amount he should pay for any kind of information.

You begin by setting up a decision tree to help inform the decision of whether or not to buy perfect information. Which of the

following best represents the "Buy Information" branch of the perfect information tree?

Whatever the source of perfect information, it will predict the occurrence of either a "Phenomenon" or a "Fad." Which one of

these scenarios it will predict is uncertain; the respective probabilities are equal to the probabilities of the two scenarios: 35%

and 65%.

Once Leo has perfect information, he can choose to either launch the Tethys venture or not.

What is the EMV of acquiring the perfect information, assuming that it is free?

Enter the EMV in $millions as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,500,000" as

"5.50"). Round if necessary.

If Leo is certain that the Chez Tethys will be a "Phenomenon," he will launch her and realize a $2 million profit. If he is

certain that the Tethys would be a mere "Fad," he'll avoid the $800,000 loss by not launching, thereby realizing no profits or

losses in the floating restaurant industry. The EMV of acquiring perfect information is $700,000.

Recall that the EMV of launching the Tethys — based on Leo's original estimates — was $180,000. What is the expected value

of the perfect information?

"5.50"). Round if necessary.

The expected value of perfect information is the difference between $700,000 — the EMV of acquiring the perfect

information — and the EMV of not acquiring the perfect information. In this case, the EMV of not acquiring the perfect

information is the original EMV of launching the Chez Tethys, $180,000. The difference is $520,000.

That's a lot of money! I have an idea for a market research event that is expensive, but it's much cheaper than that. Let's get to

it!

Not so fast, Leo. Up to $520,000 is what you should be willing to pay for perfect information. But real world information is

usually much less than perfect. What exactly do you have in mind?

You two are going to love this. I'll rent a cruise ship and do a week-long test cruise around the islands, inviting guests at the

finest hotels on board for free. At dinner, I'll run an advertisement for the Tethys, then survey all the guests and ask them if

they'd be interested in dining there!

That's an audacious market research plan for an audacious business venture. I like the idea, though, and I think you'll get

much better information than if you hire a firm to do conventional market research. Let's try to estimate the value of this

information.

The three of you debate the merits of Leo's marketing event. You split the results of Leo's event into two response scenarios

based on how Leo's guests/subjects receive his demonstration: "Enthusiastic" and "Tepid," estimating the probabilities of

these responses at 40% and 60%, respectively.

If the subjects respond to Leo's demonstration "Enthusiastically," you believe that the Tethys becoming a "Phenomenon" is

much more likely than your earlier estimate: 65%. On the other hand, if the reception is "Tepid," you estimate that the

probability of a "Phenomenon" is a mere 15%.

Given these data and the tree to the right, what is the expected value of Leo's sample information?

Enter the EVSI in $millions as a decimal number with four digits to the right of the decimal point (e.g., enter "$5,555,000" as

"5.5550"). Round if necessary.

To calculate the EVSI, calculate the EMV of Leo's running his planned event. First, calculate the EMV of launching the Tethys

given that the event participants' response is "Enthusiastic." The EMV is the sum of the outcomes of the "Phenomenon" and

"Fad" scenarios, after they have been weighted by the conditional probabilities that these scenarios will occur given an

"Enthusiastic" reception: $1.02 million.

Since the EMV for launching the Tethys is higher than the EMV for not launching her, Leo should embark on the Tethys

venture if his guest/subjects respond "Enthusiastically" to his event. Thus, the EMV if guests give an "Enthusiastic" response

to Leo's event is $1.02 million.

Next, calculate the EMV of launching the Tethys given that the event participants' response is "Tepid." The EMV is the sum of

the outcomes of the "Phenomenon" and "Fad" scenarios, after they have been weighted by the conditional probabilities that

these scenarios will occur given a "Tepid" reception: -$380,000.

115/123

11/25/2019 Quantitative Methods Online Course

Since the EMV for launching the Tethys is lower than the EMV for maintaining the status quo at the Kahana, Leo should not

embark on the Tethys venture if the participants' response is "Tepid". Thus, the EMV of a "Tepid" response to Leo's event is

$0.

The EMV of Leo's event is $408,000: the sum of the EMVs of the "Enthusiastic" and "Tepid" responses, after they have been

weighted by the probabilities that these responses will occur.

The expected value of Leo's sample information is the difference between the EMV of his event and $180,000 — the EMV of

launching the Tethys without further information. The difference is $228,000.

I talked to some friends last night, and the total cost of renting and running a cruise ship for a week, and shooting an

advertising short comes to at least $240,000.

That's higher than the expected value of this information we calculated: $228,000.

Are you telling me that my event would not be worth its cost?

I'm afraid so, Leo. But that doesn't necessarily mean that you shouldn't do it. You might choose to have your event even if its

expected monetary value is low. It's late. Let's break for tonight and meet over breakfast tomorrow.

The grocery chain FUD offers two lines of private label products: "FUD Brand," and "Pleasant Valley Fare." The two brands

each are used to label a set of grocery items with very little overlap. For example, "FUD Brand" is used for meat and cheese

products; "Pleasant Valley Fare" for canned goods.

Esther Smith, CEO of FUD, anticipates that eliminating the brand with weaker consumer loyalty and repackaging its

product line under the label of the stronger brand would result in an additional $200,000 in profits from increased sales

and from cost savings on packaging and advertising. If Esther eliminates the stronger brand, she anticipates a net bottom

line improvement of $0 — reductions in sales volume or price discounts would roughly offset cost savings.

Esther doesn't know which of the brands enjoys greater customer loyalty — since the product lines don't overlap, she can't

directly compare the two brands' performances. Initially, she believes that there is a 50% chance that customers are more

loyal to FUD Brand, and a 50% chance that they are more loyal to Pleasant Valley Fare (PVF).

Based on this information, which of the two options - eliminating FUD Brand or Pleasant Valley Fare - has the higher

EMV?

Each option has an EMV of $100,000. Based on Esther's current information, she has no basis for preferring one brand

over the other.

Suppose Esther had access to perfect information, i.e., she could discover with certainty which brand her customers prefer,

so she could eliminate the weaker brand and increase FUD's profits by $200,000. How much should she be willing to pay

for such perfect information?

Enter the expected value of perfect information in $millions as a decimal number with two digits to the right of the decimal

point (e.g., enter "$5,500,000" as "5.50"). Round if necessary.

First, construct the tree for the decision to buy or not to buy the information. The EMV for not buying the information is

$100,000 no matter which brand Esther chooses to eliminate.

There is a 50% chance that the perfect information Esther acquires will reveal that customers prefer the FUD Brand and a

50% chance it will reveal that customers prefer Pleasant Valley Fare.

Once Esther has the perfect information, she will choose to eliminate the brand that customers are less loyal to, adding

$200,000 to FUD's bottom line.

The EMV of buying perfect information is $200,000. The expected value of perfect information, EVPI, is the difference

between the EMVs of buying and not buying perfect information: $100,000.

Since Esther cannot get perfect information, she's willing to settle for information that is less than perfect and decides to

hire a market research firm to determine which of FUD's two brands customers prefer.

Esther believes that the probability that the market research firm will find that FUD Brand has higher customer loyalty is

50%. Likewise, the probability that the market research firm will predict that Pleasant Valley Fare has higher customer

loyalty is 50%.

Whichever brand the market research firm indicates is stronger, Esther believes that the finding will be correct with

probability 85%. Thus, there is a 15% chance that the finding the firm provides will be wrong.

116/123

11/25/2019 Quantitative Methods Online Course

Enter the expected value of sample information in $millions as a decimal number with three digits to the right of the

decimal point (e.g., enter "$5,500,000" as "5.500"). Round if necessary.

First, construct the tree for the decision to buy or not to buy the sample information using the probabilities and outcomes

cited above. The EMV for not buying the information is $100,000 no matter which brand Esther chooses.

Once Esther has the sample information, she will choose to eliminate the brand that the market research indicates is

weaker. The EMV of eliminating the brand that market research predicts enjoys less customer loyalty is $170,000.

The EMV of buying this imperfect information is $170,000. Thus, the expected value of sample information, EVSI, is the

difference between the EMVs of buying and not buying perfect information: $70,000.

After a public relations debacle over a possible outbreak of pulluscular pig disorder (PPD) at one of Bowman-Lyons-

Centerville's hog farms, Paul Segal, head of the hog farming division, is considering vaccinating another herd at a hog farm

located near the original outbreak.

Immunity to PPD is fairly common among pigs; the probability that the entire herd in question is immune is fairly high:

60%.

However, if any pigs in the herd are not immune, the expected cost to Bowman-Lyons-Centerville is $150,000. This cost

has been calculated based on the probability of a PPD outbreak and the ensuing costs to contain the disease and manage

the expected PR fallout.

The cost of vaccinating the herd is $40,000. What is the EMV of the cost of not vaccinating the herd?

Enter the EMV in $thousands as a decimal number with one digit to the right of the decimal point (e.g., enter "$5,500" as

"5.5"). Round if necessary.

The EMV of the cost of not vaccinating the herd is $60,000. Since the cost of vaccinating the herd — $40,000 — is lower

than the expected cost of not vaccinating the herd, Paul should have the herd vaccinated, barring the emergence of any

further information.

There is a test that Paul could perform on the herd to determine if the herd is immune to PPD. This test delivers two

results: "Positive" for immunity, or "Negative" for immunity.

The test is not perfectly accurate: if the entire herd is immune, the test will report "Positive" with probability 85% and

"Negative" with probability 15%. Similarly, if any of the pigs in the herd are not immune, the test will report "Positive" with

probability 30% and "Negative" with probability 70%.

Using Paul's prior probability for the immunity of the herd, calculate P(Positive), the marginal probability that the test will

report a "Positive" result.

Round if necessary.

First, calculate the joint probabilities P(Positive & Immune) and P(Positive and Not Immune), using the marginal

probabilities for immunity and non-immunity — 60% and 40%, respectively — and the conditional probabilities that

quantify the reliability of the test - P(Positive | Immune) and P(Positive | Not Immune).

Then, sum the joint probabilities to find the marginal probability of a "Positive" test result: 63%. Since the test results

"Positive" and "Negative" are mutually exclusive and collectively exhaustive events, the probability of a "Negative" is simply

100% minus the probability of a "Positive," i.e., P(Negative) is 37%.

The tree below will help Paul determine whether or not ordering the immunity test will be worthwhile. The calculated

probabilities of the test results are entered in the appropriate branches. However, to reach a decision, Paul needs the

conditional probabilities of immunity and non-immunity conditioned on a "Positive" test result. What is P(Immune |

Positive)?

Round if necessary.

To calculate P(Immune | Positive), use the definition of conditional probability and divide the joint probability P(Positive &

Immune) by the marginal probability P(Positive). P(Immune | Positive) is 81%. Since immunity and non-immunity are

mutually exclusive and collectively exhaustive events, P(Not Immune | Positive) is simply 100% minus P(Immune |

Positive) i.e., 19%.

Calculate the remaining probabilities on the decision tree. What is the highest amount Paul should be willing to spend on

the immunity test?

117/123

11/25/2019 Quantitative Methods Online Course

Enter the expected value of sample information in $thousands as a decimal number with one digit to the right of the

decimal point (e.g., enter "$5,500" as "5.5"). Round if necessary.

The expected value of not vaccinating given a "Positive" test result is $28,500. Since this is lower than the cost of

vaccination, Paul should choose not to vaccinate on EMV grounds.

The probability that the herd is immune given that the test reports "Negative" is 24.3%. Since the EMV of the cost of

vaccinating the herd ($40,000) is lower than the EMV of the cost of not vaccinating ($113,550) Paul should choose to

vaccinate if the test result is "Negative".

The EMV of conducting the immunity test is $32,755, i.e., the sum of the EMVs of "Negative" and "Positive" test results,

weighted by their respective probabilities.

The expected value of sample information, EVSI, is the difference between the EMVs of testing and not testing for

immunity, i.e., $7,245. The most Paul should be willing to pay for the immunity test is $7,245.

Risk Analysis

Just as an e-learning course must have a bittersweet last section, so too must your internship on Hawaii have a final day. A

parting Pacific frolic is refreshing preparation for your meeting with Leo.

Introducing Risk

As you dry in the morning sun, Alice invites you to consider Leo's position should he wager the Kahana on the Tethys'

success. "Even if the potential profits are substantial, losing the Kahana if the Tethys tanks would be a severe blow. I wonder

how seriously he's considered the potential downside."

Imagine: upon your arrival at business school you buy yourself a nice new car. It's a sleek, powerful, Burgundy Ben Hur by

Chariot — with both a Sweetone stereo and a Theft Discouragement System — worth over $25,000.

City driving can be rough: the probability that you'll be involved in an accident that "totals" your car — reduces its value from

$25,000 to zero — in your first year is 0.01%. Collision insurance to cover such a loss in your new city is optional: it will cost

you $1,600 per year above and beyond the premium for required liability insurance. Will you buy the collision insurance?

Most people would pay the premium to mitigate the risks associated with a car accident leading to such large losses. But let's

look at the decision tree for buying collision insurance. For simplicity, we'll ignore accidents that do not total your car.

After a year in school, your Bennie's market value will have a present value of $25,000. If you don't buy collision insurance

and the car is totaled, you'll lose the full value of the car — $25,000. If — as we all hope — you and your car survive the city's

roads unscathed, you won't lose anything.

Opting for collision insurance entails a $1,600 insurance premium. If you are spared the calamity of a wreck, you'll have

incurred just the premium cost at the end of the year. If traffic tragedy befalls you, your insurance company will pay out the

value of the car minus a $1,000 deductible: so at the end of the year you will be out $2,600: the premium plus the $1,000

deductible.

At $2.50, the EMV of not buying collision insurance is much better than the EMV of buying the insurance. Essentially, if you

buy the collision insurance, you are paying $1,600 to avoid an expected loss of $2.50! We shouldn't be surprised — insurance

companies are not charities, and from their perspective, selling you insurance has a very favorable EMV.

Still, a majority of drivers choose collision insurance, even though it has a significantly lower EMV. Why might this be the

case?

Let's consider a different situation. Suppose you owned a bright and shiny miniature windup toy Bennie for $25. Would you

take out a $1.60 insurance policy on it, even if the probability were upwards of 50% that it would be stepped on or lost in the

coming year?

Probably not. A $25 loss is not even a minor catastrophe. If you lost or accidentally destroyed your miniature Bennie, you'd

either replace it quickly, or accept its departure without fuss.

If instead of a nice, new, $25,000 Bennie you owned a 20-year old beat-up and rusty Oldsmobile Delta-88, worth about

$2,500, you'd also think twice about paying $160 to insure it against collision. A loss of $2,500 — though stinging — is

unlikely to be a major setback, especially when weighed against a $160 premium.

But, imagine that you — assisted by your MBA degree — amass a multi-million dollar fortune over the next decades. Would

you still insure a $25,000 car against collision? Perhaps not.

Clearly, the value of the potential loss relative to your net worth has an important influence on your willingness to take on the

risk associated with uncertainty. Suppose you are invited to play a gambling game in which you could win $10 million with

118/123

11/25/2019 Quantitative Methods Online Course

50% probability or lose $2 million with 50% probability. Even though the EMV of playing this game is $4 million, you'd

probably decline participation unless you can afford a $2 million loss.

For most of us, the pain of the $2 million loss would weigh more heavily than the joy of winning $10 million, so we would not

accept this gamble. For someone who can afford a $2 million loss, this game might be very profitable, especially if played

multiple times.

There are formal ways to measure such assessments by assigning a personal utility value to each outcome. Then we can show

that maintaining the status quo — our current assets — provides greater utility than playing the game does. Rigorous

methods of quantifying utility are beyond this course's scope. For now, let's look at how we might gain insight into the

personal utility we associate with different monetary outcomes.

Returning to collision insurance, let's summarize the scenarios and their probabilities and outcome values. You have two

options: "Buy collision insurance" and "Don't buy collision insurance.

If you insure against collision, there are two possible outcomes: either you avoid a wreck for one year and lose the insurance

premium of $1,600, or misfortune strikes and your Bennie is totaled. In the latter case, you lose the premium and the

deductible. The probabilities associated with these two outcomes are 99.99% and 0.01%, respectively. The possible outcomes

and their respective probabilities are listed in the table below.

If you don't insure against collision and park your car safely at year's end, you won't sustain any loss at all. If the unfortunate

alternative occurs and your Bennie is totaled, you'll have lost its value of $25,000. Again, the probabilities of these outcomes

are 99.99% and 0.01%, respectively.

We now have two tables that completely summarize the possible scenarios of our decision. The monetary values of these

scenarios are arranged from top to bottom, from the best outcomes to the worst. These tables — one for each option — are

together referred to as the risk profiles for the decision.

From the risk profiles, it's easy to recognize that the option of not buying insurance will deliver the most preferred outcome if

you don't have an accident but will deliver the least desirable outcome if you do. This insight provides the basis for you to

prefer to buy the collision insurance: it exposes you to lower risk, even though it has a lower EMV.

Summary

The EMV criterion is not the only criterion that informs decisions. Besides wanting to maximize our average outcomes in

the long run, we want to minimize our exposure to risk, especially when potential losses are high relative to our own net

worth.

Risk Attitudes

Recall Seth Chaplin and S&C Films' production of the movie Cloven. Seth has a choice between two options. In a deal with

Pony Pictures, S&C produces the film, and Pony acquires the complete rights to Cloven for a flat production fee. In a

second deal with K2 Classics, S&C retains part ownership, and the financial outcomes for S&C depend on Cloven's

performance at the box office.

Seth knows from previous analysis that the K2 deal has a higher EMV: $2.04 million vs. $1.48 million for the Pony deal.

However, he also knows the K2 deal is riskier. Before making a final decision, Seth wants to understand the full

implications of his two options. Let's build risk profiles for each deal.

There are seven possible scenarios that can occur. The outcome values of these scenarios depend on Seth's choice of a

production partner, on whether or not superstar actor Shawn Connelly lends his gravelly baritone to the film's lead

character, and on the audience's reception of the film.

Let's summarize the possible scenarios. If Seth chooses the Pony option, two scenarios could occur. In one, Connelly

participates in the movie and Seth's profits are $2.2 million. In the other, Connelly does not participate in the movie and

Seth's profits are only $1 million. If Seth opts for the Pony deal, the probabilities of these two scenarios occurring are 40%

and 60%, respectively.

If Seth takes the K2 option, there are five possible scenarios, but only three possible outcomes: "Blockbuster" success, a

"Lackluster" performance, and a "Flop."

A "Blockbuster" success can occur whether or not Connelly participates in the production. The probability of a

"Blockbuster" is 38%: this probability is calculated by adding the joint probability that Cloven stars Connelly and is a

"Blockbuster" to the joint probability that Cloven is a "Blockbuster" without Connelly's participation.

A "Lackluster" performance can also occur whether or not Connelly participates in the production. The probability of a

"Lackluster" performance is 50%: this probability is calculated by adding the joint probability that Cloven stars Connelly

and has a "Lackluster" performance to the joint probability that Cloven has a "Lackluster" performance without Connelly's

participation.

119/123

11/25/2019 Quantitative Methods Online Course

A "Flop" occurs only when Connelly doesn't take part in the production of Cloven. The total probability of a "Flop" is the

joint probability of Cloven being a "Flop" and Connelly not taking part: 12%.

Seth now has complete risk profiles for the two options. The risk profiles give Seth a quick overview of the possible

outcomes — and their respective probabilities — for each choice. Note that we've combined scenarios so that we now have

probabilities for each distinct outcome value. How does a manager like Seth use risk profiles to inform a decision?

Risk profiles allow Seth to compare options not just in terms of their EMVs but in terms of the risk they expose him to.

From Seth's risk profile, we can see that, although the K2 option has the higher EMV, it is much riskier, a $2 million loss is

possible, whereas the Pony option doesn't include any possible loss.

Which option Seth should choose ultimately depends on how comfortable he is with the risk associated with the K2 option.

The Pony option is certain to deliver a profit. The K2 option might generate a substantial loss.

If a $2 million loss is more than Seth believes his company can — or should — sustain, he may want to choose the Pony

option despite its lower EMV. If Seth chooses the Pony option, he will be demonstrating that he is risk averse: he is

choosing an option with a lower EMV in return for lower exposure to risk.

Most people are at least slightly risk averse, as shown by their inclinations to buy insurance. Some people tend to be risk

seeking, that is, they are willing to forgo options with higher EMVs and accept higher levels of risk in return for the

possibility of extremely high returns.

Although risk-seeking behavior is generally rare in the population when significant down-side risk is involved, many people

tend to be somewhat risk seeking when the value of the loss risked is low. For example, the EMV for playing the lottery is

lower than the EMV for not playing the lottery. However, since the amount risked by playing the lottery — the price of a

lottery ticket — is so low many people choose to play it in return for the unlikely outcome that they will win a multi-million

dollar prize.

When we use the EMV as the basis for decision making, we are acting in a risk neutral manner. Most people are risk

neutral when they make routine "day in, day out" types of decisions — those for which potential losses are not terribly

harmful to the decision-maker or her organization.

People feel comfortable using the EMV for routine decisions because they feel comfortable "playing the averages": some

decisions will result in losses and some in gains, but over the long run, the outcomes will average out. We expect that, over

time, choosing options with the highest EMVs will lead to highest total value.

Most people feel comfortable using EMV as the basis for decisions that are not made repeatedly — provided the decisions

have outcome values similar to those of other, more routine decisions that they make on a regular basis.

Why would Pony Pictures agree to purchase the rights to Cloven and take on all the risk associated with marketing and

distributing the film? For Pony, the deal with Seth is a routine decision. Pony is a large enough company to sustain a

potential multi-million dollar loss if Cloven flops. Pony also distributes 10 to 12 movies a year, so overall, the positive EMV

of a film release promises that in the long run, Pony's operations will be profitable.

Seth knows that the K2 deal is very risky: if he chooses the K2 deal, he will tie his company's financial success to box office

performance and to Shawn Connelly's whims. If Cloven "Flops," S&C Films might well go bankrupt.

Summary

Risk profiles allow us to assess the utility different outcomes bring us, as opposed to their monetary value. The

concise summary risk profiles provided helps us compare and contrast our different decision options, allowing us to

choose the option we prefer based on our attitude to risk: risk averse, risk seeking, or risk neutral.

"Although Leo's market research event doesn't pay off in terms of its expected value," Alice explains, "it could help him

manage his risk."

Let's assemble the risk profile for Leo's decision, including the decision about whether or not to run his market research

event. We have to keep in mind that the cost of running his event is $240,000.

Which of the following tables correctly summarizes the outcomes and probabilities if Leo chooses to run his market research

event?

If Leo chooses to run his market research event, then the participants' response can be either "Enthusiastic" or "Tepid." If it's

"Tepid," Leo will choose to stay out of the floating restaurant business, so there is a 60% chance that Leo will incur only the

cost of the event, $240,000.

120/123

11/25/2019 Quantitative Methods Online Course

If the response to the event is "Enthusiastic," then there is a 65% chance that Leo will make a profit of $1.76 million — the

original profit estimate of $2 million, minus the event's cost of $240,000. The likelihood of a $1.76 million profit is (40%)

(65%), or 26%.

If the response to the event is "Enthusiastic," then there is an 35% chance that Leo will suffer a loss of $1.04 million — the

original loss estimate of $800,000, minus the event's cost of $240,000. The likelihood of a $1.04 million loss is (40%)(35%),

or 14%.

You next complete the risk profile for Leo's profits if he chooses not to run the event. You present your risk analysis to Leo to

show him why he might want to run his event even if its expected monetary value is not optimal.

I see. So if I run my event, I'm most likely to incur the cost of the event, and nothing else. But, if the response is

"Enthusiastic," I should go ahead with my plan. Then, there is a small chance that I will incur a large loss if the restaurant

becomes a mere "Fad," and a good chance that the Tethys will become a "Phenomenon."

On the other hand, if don't run my event, although my potential profits are slightly higher, the risk is too great. I just have too

much to lose at this point.

At the same time, I think I can afford to pay for a market research event now, without going to the bank. Then if my guests

react favorably, I might consider diving into the Tethys adventure.

I want you two to enjoy your last day here and have dinner with me tonight. Anywhere special you'd like to go?

Hmmm. I know this wonderful place that serves the most delightful mix of French and Hawaiian cuisines...

Aw, shucks.

Danforth Pitt was injured in a bisque-spilling incident in a Hawaiian luxury resort restaurant. Danforth, who insists he

hasn't been able to remove the scent of crab from his knees, is pursuing his legal options.

After initial investigation and consultation, Pitt's lawyers explain to Danforth that he can expect to win $250,000 in

damages with 5% probability or $100,000 with 25% probability, before subtracting the attorney's fees. The fees associated

with pursuing legal action are estimated at $50,000, which Pitt must pay whether he wins or loses the suit.

What is the EMV of pursuing legal action against the resort hotel?

Enter the EMV in $thousands as a decimal number with two digits to the right of the decimal point (e.g., enter "$5,000" as

"5.00"). Round if necessary.

To avoid a legal battle, the hotel's owner agrees to settle out of court, offering Danforth two full weeks in the hotel's

penthouse suite free of charge. Including amenities, the value of this offer is $12,000.

If Danforth chooses to reject the offer to settle out of court, and to sue the hotel instead, he should be characterized as

which of the following?

Since the EMV of pursuing legal action is lower than the EMV of settling out of court, but the possible gains of legal action,

though relatively unlikely, are much higher than the EMV of settling, Danforth would have to be characterized as risk

seeking.

Welcome to the post-assessment test for the HBS Quantitative Methods Tutorial.

Navigation:

To advance from one question to the next, select one of the answer choices or, if applicable, complete with your own choice and

click the “Submit” button. After submitting your answer, you will not be able to change it, so make sure you are satisfied with your

selection before you submit each answer. You may also skip a question by pressing the forward advance arrow. Please note that

you can return to “skipped” questions using the “Jump to unanswered question” selection menu or the navigational arrows at any

time. Although you can skip a question, you must navigate back to it and answer it - all questions must be answered for the exam

to be scored.

In the briefcase, links to Excel spreadsheets containing z-value and t-value tables as well as utilities for finding confidence

intervals and conducting hypothesis tests are provided for your convenience. For some questions, additional links to Excel

spreadsheets containing relevant data will appear immediately below the question text.

121/123

11/25/2019 Quantitative Methods Online Course

Your results will be displayed immediately upon completion of the exam.

After completion, you can review your answers at any time by returning to the exam.

Good luck!

How difficult are the questions on the exam? The exam questions have a level of difficulty similar to the exercises in the

course.

Can I refer to statistics textbooks and online resources to help me during the test? Yes. This is an open-book

examination.

May I receive assistance on the exam? No. Although we strongly encourage collaborative learning at HBS, work on exams

such as the assessment tests must be entirely your own. Thus you may neither give nor receive help on any exam question.

Is this a timed exam? No. You should take about 60-90 minutes to complete the exam, depending on your familiarity with

the material, but you may take longer if you need to.

What happens if I am (or my internet connection is) interrupted while taking the exam? Your answer choices will

be recorded for the questions you were able to complete and you will be able to pick up where you left off when you return to

the exam site.

How do I see my exam results? Your results will be displayed as soon as you submit your answer to the final question. The

results screen will indicate which questions you answered correctly.

Welcome to the second post-assessment test for the HBS Quantitative Methods Tutorial.

Navigation:

To advance from one question to the next, select one of the answer choices or, if applicable, complete with your own choice and

click the “Submit” button. After submitting your answer, you will not be able to change it, so make sure you are satisfied with your

selection before you submit each answer. You may also skip a question by pressing the forward advance arrow. Please note that

you can return to “skipped” questions using the “Jump to unanswered question” selection menu or the navigational arrows at any

time. Although you can skip a question, you must navigate back to it and answer it - all questions must be answered for the exam

to be scored.

In the briefcase, links to Excel spreadsheets containing z-value and t-value tables are provided for your convenience. For some

questions, additional links to Excel spreadsheets containing relevant data will appear immediately below the question text.

IMPORTANT: Take this exam only after taking the previous final assessment exam. There are only two final

assessment exams. Thus, if you did not receive a passing score on the first, it is critical that you review the

course material carefully before taking this exam.

After completion, you can review your answers at any time by returning to the exam.

Good luck!

How difficult are the questions on the exam? The exam questions have a level of difficulty similar to the exercises in the

course.

Can I refer to statistics textbooks and online resources to help me during the test? Yes. This is an open-book

examination.

May I receive assistance on the exam? No. Although we strongly encourage collaborative learning at HBS, work on exams

such as the assessment tests must be entirely your own. Thus you may neither give nor receive help on any exam question.

Is this a timed exam? No. You should take about 60-90 minutes to complete the exam, depending on your familiarity with

the material, but you may take longer if you need to.

What happens if I am (or my internet connection is) interrupted while taking the exam? Your answer choices will

be recorded for the questions you were able to complete and you will be able to pick up where you left off when you return to

122/123

11/25/2019 Quantitative Methods Online Course

the exam site.

How do I see my exam results? Your results will be displayed as soon as you submit your answer to the final question. The

results screen will indicate which questions you answered correctly.

Copyright Harvard Business School Publishing. Copying or posting is an infringement of copyright. Permissions@hbsp.harvard.edu or

617-783-7860.

123/123

## Viel mehr als nur Dokumente.

Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.

Jederzeit kündbar.