Albright DADM 5e - PPT - CH 02

© 2015 Cengage Learning. All Rights Reserved.
May not be scanned, copied or duplicated, or posted to a

publicly accessible website, in whole or in part.
Chapter 2
DECISION MAKING
DATA ANALYSIS AND
BUSINESS ANALYTICS:
Describing the Distribution of a Single Variable

Introduction
(slide 1 of 2)
 The goal is to present data in a form that makes

sense to people. Tools that are used to do this
include:
 Graphs: bar charts, pie charts, histograms, scatterplots,
time series graphs
 Numerical summary measures: counts, percentages,
averages, measures of variability
 Tables of summary measures: totals, averages, counts,
grouped by categories
 It is a challenge to summarize data so that the
important information stands out clearly.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction
(slide 2 of 2)
 There are four steps in data analysis:

1. Recognize a problem that needs to be solved.
2. Gather data to help understand and then solve the
problem.
3. Analyze the data.
4. Act on this analysis.
 It is up to you to ask good questions—and then take
advantage of the most appropriate tools to answer
them.
Populations and Samples
 A population includes all of the entities of interest
in a study (people, households, machines, etc.)
 Examples:
 All potential voters in a presidential election
 All subscribers to cable television
 All invoices submitted for Medicare reimbursement by
nursing homes
 A sample is a subset of the population, often
randomly chosen and preferably representative of
the population as a whole.
 Examples: Gallup, Harris, other polls today
Data Sets, Variables, and
Observations
 A data set is usually a rectangular array of data,
with variables in columns and observations in rows.
 A variable (or field or attribute) is a characteristic
of members of a population, such as height, gender,
or salary.
 An observation (or case or record) is a list of all
variable values for a single member of a
population.
Example 2.1:
Questionnaire Data.xlsx
 Objective: To illustrate variables and observations in a typical data
set.
 Solution: Data set includes observations on 30 people who
responded to a questionnaire on the president’s environmental
policies.
 Variables include: age, gender, state, children, salary, opinion.
 Include a row that lists variable names.
 Include a column that shows an index of the observation.
Types of Data
(slide 1 of 5)
 A variable is numerical if meaningful arithmetic can

be performed on it.
 Otherwise, the variable is categorical.
 There is also a third data type, a date variable.
 Excel® stores dates as numbers, but dates are treated
differently from typical numbers.
 A categorical variable is ordinal if there is a
natural ordering of its possible values.
 If there is no natural ordering, it is nominal.
Types of Data
(slide 2 of 5)
 Categorical variables can be coded numerically or

left uncoded.
 A dummy variable is a 0–1 coded variable for a
specific category.
 It is coded as 1 for all observations in that category
and 0 for all observations not in that category.
 Categorizing a numerical variable by putting the
data into discrete categories (called bins) is called
binning or discretizing.
A variable that has been categorized in this way is
called a binned or discretized variable.
Environmental Data
Using a Different Coding (slide 3 of 5)
Types of Data
(slide 4 of 5)
 A numerical variable is discrete if it results from a

count, such as the number of children.
 A continuous variable is the result of an essentially
continuous measurement, such as weight or height.
 Cross-sectional data are data on a cross section of
a population at a distinct point in time.
 Time series data are data collected over time.
Typical Time Series Data Set
(slide 5 of 5)
Descriptive Measures for
Categorical Variables
 There are only a few possibilities for describing a
categorical variable, all based on counting:
 Count the number of categories.
 Give the categories names.
 Count the number of observations in each category

(referred to as the count of categories).
 Once you have the counts, you can display them graphically,
usually in a column chart or a pie chart.
Example 2.2:
Supermarket Transactions.xlsx (slide 1 of 3)
 Objective: To summarize categorical variables in a large

data set.
 Solution: Data set contains transactions made by
supermarket customers over a two-year period.
 Children, Units Sold, and Revenue are numerical.
 Purchase Date is a date variable.
 Transaction and Customer ID are used only to identify.
 All of the other variables are categorical.
Example 2.2:
 To get the counts in column S, use Excel’s COUNTIF function.

 To get the percentages in column T, divide each count by the
total number of observations.
 When creating charts, be careful to use appropriate scales.
Example 2.2:
 Another efficient way to find counts for a categorical variable is to

use dummy (0–1) variables.
 Recode each variable so that one category is replaced by 1 and all
others by 0.
 This can be done using a simple IF formula.
 Find the count of that category by summing the 0s and 1s.
 Find the percentage of that category by averaging the 0s and 1s.
Descriptive Measures for
Numerical Variables
 There are many ways to summarize numerical
variables, both with numerical summary measures
and with charts.
 To learn how the values of a variable are
distributed, ask:
 What are the most “typical” values?
 How spread out are the values?
 What are the “extreme” values on either end?
 Is the chart of the values symmetric about some middle
value, or is it skewed in some direction? Does it have
any other peculiar features besides possible skewness?
Example 2.3:
Baseball Salaries 2011.xlsx (slide 1 of 2)
 Objective: To learn how salaries are distributed across all 2011

MLB players.
 Solution: Data set contains data on 843 Major League Baseball
players in the 2011 season.
 Variables are player’s name, team, position, and salary.
 Create summary measures of baseball salaries using Excel functions.
Example 2.3:
Measures of Central Tendency
(slide 1 of 3)
 The mean is the average of all values.

 If the data set represents a sample from some larger
population, this measure is called the sample mean
and is denoted by X.
 If the data set represents the entire population, it is
called the population mean and is denoted by μ.
 In Excel, the mean can be calculated with the

AVERAGE function.
(slide 2 of 3)
 The median is the middle observation when the

data are sorted from smallest to largest.
 If the number of observations is odd, the median is
literally the middle observation.
 If the number of observations is even, the median is
usually defined as the average of the two middle
observations.
 In Excel, the median can be calculated with the
MEDIAN function.
(slide 3 of 3)
 The mode is the value that appears most often.

 Inmost cases where a variable is essentially continuous,
the mode is not very interesting because it is often the
result of a few lucky ties.
 However, it is not always a result of luck and may
reveal interesting information.
 In Excel, the mode can be calculated with the MODE
function.
Minimum, Maximum,
Percentiles, and Quartiles
 For any percentage p, the pth percentile is the value
such that a percentage p of all values are less than it.
 The quartiles divide the data into four groups, each
with (approximately) a quarter of all observations.
 The first, second and third quartiles are the percentiles
corresponding to p = 25%, p = 50%,
and p = 75%.
 By definition, the second quartile (p = 50%) is equal to the
median.
 The minimum and maximum values can be calculated
with Excel’s MIN and MAX functions, and the percentiles
and quartiles with Excel’s PERCENTILE and QUARTILE
functions.
Measures of Variability
(slide 1 of 3)
 The range is the maximum value minus the minimum

value.
 The interquartile range (IQR) is the third quartile
minus the first quartile.
 Thus, it is the range of the middle 50% of the data.
 It is less sensitive to extreme values than the range.
 The variance is essentially the average of the

squared deviations from the mean.
 If Xi is a typical observation, its squared deviation from
the mean is (Xi – mean)2.
(slide 2 of 3)
 The sample variance is denoted by s2, and the

population variance by σ2.
 If all observations are close to the mean, their squared

deviations from the mean—and the variance—will be
relatively small.
 If at least a few of the observations are far from the mean,
their squared deviations from the mean—and the variance—
will be large.
 In Excel, use the VAR function to obtain the sample variance
and the VARP function to obtain the population variance.
(slide 3 of 3)
 A fundamental problem with variance is that it is in

squared units (e.g., $  $2).
 A more natural measure is the standard deviation,
which is the square root of variance.
 The sample standard deviation, denoted by s, is the
square root of the sample variance.
 The population standard deviation, denoted by σ, is
the square root of the population variance.
 In Excel, use the STDEV function to find the sample
standard deviation or the STDEVP function to find the
population standard deviation.
Calculating Variance and
Standard Deviation
Empirical Rules for Interpreting
Standard Deviation (slide 1 of 3)
 The interpretation of the standard deviation can be
stated as three empirical rules.
 If the values of a variable are approximately normally
distributed (symmetric and bell-shaped), then the
following rules hold:
 Approximately 68% of the observations are within one
standard deviation of the mean.
 Approximately 95% of the observations are within two
standard deviations of the mean.
 Approximately 99.7% of the observations are within three
standard deviations of the mean.
Empirical Rules for Baseball Salaries
(slide 2 of 3)
 The empirical rules should be applied with caution,

especially when the data are clearly skewed, as
illustrated by the calculations for baseball salaries
below.
Empirical Rules for Interpreting
Standard Deviation (slide 3 of 3)
 The mean absolute deviation (MAD) is the
average of the absolute deviations.
 In Excel, use the AVEDEV function to calculate MAD.

 There is another empirical rule for MAD: For many
variables, the standard deviation is approximately
25% larger than MAD.
Measures of Shape
(slide 1 of 2)
 Skewness occurs when there is a lack of symmetry.

 A variable can be skewed to the right (or positively
skewed) because of some really large values (e.g.,
really large baseball salaries).
 Or it can be skewed to the left (or negatively skewed)
because of some really small values (e.g., temperature
lows in Antarctica).
 In Excel, a measure of skewness can be calculated
with the SKEW function.
Measures of Shape
(slide 2 of 2)
 Kurtosis has to do with the “fatness” of the tails of

the distribution relative to the tails of a normal
distribution.
 A distribution with high kurtosis has many more
extreme observations.
 In Excel, kurtosis can be calculated with the KURT
function.
Numerical Summary Measures in the
Status Bar and with StatTools
 If you select multiple cells, summary measures
appear for the selected cells in the status bar at the
bottom of the Excel window.
 You can choose the summary measures that appear by
right-clicking the status bar and selecting your favorites.
 Although Excel’s built-in functions can be used to
calculate a number of summary measures, a much
quicker way is to use the StatTools add-in.
Example 2.3 (Continued):
Baseball Salaries 2011.xlsx
 Objective: To learn the
fundamentals of StatTools and
use it to generate summary
measures of baseball salaries.
 Solution: First, define a
StatTools data set, by selecting
any cell in the data set and
clicking the Data Set Manager
button.
 Then generate summary
measures for the Salary
variable, by selecting One-
Variable Summary from the
Summary Statistics dropdown
list and filling in the dialog box
that appears.
Charts for Numerical Variables
 There are many graphical ways to indicate the
distribution of a numerical variable.
 For cross-sectional variables:
 Histograms
 Box plots
 For time series variables:
 Time series graphs
Histograms
 A histogram is the most common type of chart for

showing the distribution of a numerical variable.
 It is based on binning the variable—that is, dividing it
up into discrete categories.
 It is a column chart of the counts in the various
categories (with no gaps between the vertical bars).
 A histogram is great for showing the shape of a
distribution—whether the distribution is symmetric or
skewed in one direction.
 Objective: To see the shape of the salary distribution

through a histogram.
 Solution: It is possible to create a histogram with Excel
tools only—but it is a tedious process.
 The resulting table of counts is usually called a frequency
table.
 The counts are called frequencies.
 It is much easier to create a histogram with StatTools.

 First, designate a StatTools data set.
 Next, select Histogram from the Summary Graphs dropdown
list.
 In the dialog box, select the Salary variable and click OK.
Example 2.4:
Late or Lost Baggage.xlsx (slide 1 of 2)
 Objective: To fine-tune a
histogram for a variable with
integer counts.
 Solution: Data set lists the number
of bags that were either late or
lost for 456 flights.
 In the Histogram dialog box,
request 9 bins and set the
minimum and maximum to -0.5
and 8.5.
 StatTools divides the range into 9
equal-length bins.
Example 2.4:
Late or Lost Baggage.xlsx (slide 2 of 2)
Box Plots
 A box plot (or box-whisker plot) is an alternative
type of chart for showing the distribution of a
variable.
 The elements of a generic box plot are shown below:
Baseball Salaries 2011.xlsx
 Objective: To illustrate the features of a box plot,
particularly how it indicates skewness.
 Solution: In StatTools, select Box-Whisker Plot from
the Summary Graphs dropdown list and fill in the
dialog box.
Time Series Data
 Our main interest in time series variables is how
they change over time, and this information is lost in
traditional summary measures and in histograms or
box plots.
 For time series data, a time series graph is used.
This is a graph of the values of one or more time
series, using time on the horizontal axis.
 This is always the place to start a time series analysis.
Example 2.5:
Crime in US.xlsx (slide 1 of 3)
 Objective: To see how time series graphs help to detect trends in crime
data.
 Solution: Data set contains annual data on violent and property crimes for
the years 1960 to 2010.
 In StatTools, designate a StatTools data set.
 Then select Times Series Graph from the Time Series and Forecasting
dropdown list and fill in the resulting dialog box.
Example 2.5:
Total Violent and Property Crimes
Population Totals
Example 2.5:
Violent and Property Crime Rates
Example 2.6:
DJIA Monthly Close.xlsx (slide 1 of 2)
 Objective: To find useful ways to summarize the monthly

Dow data.
 Solution: Data set contains monthly values of the Dow
from 1950 through 2011.
 Create summary measures and time series graphs for
monthly values and percentage changes of the Dow.
Example 2.6:
DJIA Monthly Close.xlsx (slide 2 of 2)
Outliers
 An outlier is a value or an entire observation (row)
that lies well outside of the norm.
 Some statisticians define an outlier as any value more
than three standard deviations from the mean, but this
is only a rule of thumb.
 Even if values are not unusual by themselves, there
still might be unusual combinations of values.
 When dealing with outliers, it is best to run the
analyses two ways: with the outliers and without
them.
Missing Values
 Most real data sets have gaps in the data.
 There are two issues: how to detect these missing
values and what to do about them.
 The more important issue is what to do about them:
 One option is to simply ignore them. Then you will have to
be aware of how the software deals with missing values.
 Another option is to fill in missing values with the average of
nonmissing values, but this isn’t usually a very good option.
 A third option is to examine the nonmissing values in the row
of a missing value; these values might provide clues on what
the missing value should be.
Excel Tables for Filtering,
Sorting, and Summarizing
 Tables are a tool introduced in Excel 2007.
 You now have the ability to designate a rectangular
data set as a table and then employ a number of
powerful tools for analyzing tables.
 These tools include:
 Filtering
 Sorting
 Summarizing
Example 2.7:
Catalog Marketing.xlsx (slide 1 of 2)
 Objective: To illustrate Excel tables for analyzing the HyTex data.

 Solution: Data set contains data on 1000 customers of HyTex, a
fictional direct marketing company.
 Designate the data set as a table by selecting any cell in the data
set and clicking the Table button on the Insert ribbon.
 Use the dropdown arrows next to the variable names to filter in
many different ways.
Example 2.7:
Filtering
 Finding records that match particular criteria is called
filtering.
 One way to filter is to create an Excel table, which
automatically provides dropdown arrows next to the
field names that allow you to filter.
 There are also three ways to filter on any rectangular
data set with variable names:
1. Use the Filter button from the Sort & Filter dropdown list
on the Home ribbon.
2. Use the Filter button from the Sort & Filter group on the
Data ribbon.
3. Right-click any cell in the data set and select Filter. You
get several options, the most popular of which is Filter by
Selected Cell’s Value.
 Objective: To investigate the types of filters that can be
applied to the HyTex data.
 Solution: There is almost no limit to the filters you can
apply, but here are a few possibilities:
 Filter on one or more values in a field.
 Filter on more than one field.
 Filter on a continuous numerical field.
 Top 10 and Above/Below Average filters.
 Filter on a text field.
 Filter on a date field.
 Filter on color or icon.
 Use a custom filter.
Results from a Typical Filter

Albright DADM 5e - PPT - CH 02

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Albright DADM 5e - PPT - CH 02

Hochgeladen von

Copyright:

Verfügbare Formate

© 2015 Cengage Learning. All Rights Reserved.

May not be scanned, copied or duplicated, or posted to a

Describing the Distribution of a Single Variable

 The goal is to present data in a form that makes

 There are four steps in data analysis:

 A variable is numerical if meaningful arithmetic can

 Categorical variables can be coded numerically or

 A numerical variable is discrete if it results from a

 Count the number of observations in each category

 Objective: To summarize categorical variables in a large

 To get the counts in column S, use Excel’s COUNTIF function.

 Another efficient way to find counts for a categorical variable is to

 Objective: To learn how salaries are distributed across all 2011

 The mean is the average of all values.

 In Excel, the mean can be calculated with the

 The median is the middle observation when the

 The mode is the value that appears most often.

 The range is the maximum value minus the minimum

 The variance is essentially the average of the

 The sample variance is denoted by s2, and the

 If all observations are close to the mean, their squared

 A fundamental problem with variance is that it is in

 The empirical rules should be applied with caution,

 In Excel, use the AVEDEV function to calculate MAD.

 Skewness occurs when there is a lack of symmetry.

 Kurtosis has to do with the “fatness” of the tails of

 A histogram is the most common type of chart for

 Objective: To see the shape of the salary distribution

 It is much easier to create a histogram with StatTools.

Total Violent and Property Crimes

Violent and Property Crime Rates

 Objective: To find useful ways to summarize the monthly

 Objective: To illustrate Excel tables for analyzing the HyTex data.

Das könnte Ihnen auch gefallen