Sie sind auf Seite 1von 74

1

Business Statistics

Graphs, Charts, and Tables
Describing Your Data
Dr.M.Raghunadh Acharya
4/26/2012
2
Contents
Construct a frequency distribution both
manually and with a computer
Construct and interpret a histogram
Create and interpret bar charts, pie
charts, and stem-and-leaf diagrams
Present and interpret data in line charts
and scatter diagrams
4/26/2012
3
Frequency Distributions
What is a Frequency Distribution?
A frequency distribution is a list or a table
containing the values of a variable (or a set
of ranges within which the data falls) ...
and the corresponding frequencies with
which each value occurs (or frequencies with
which data falls within each range)
4/26/2012
4
Why Use Frequency Distributions?
A frequency distribution is a way
to summarize data
The distribution condenses the
raw data into a more useful
form...
and allows for a quick visual
interpretation of the data
4/26/2012
5
Frequency Distribution:
Discrete Data
Discrete data: possible values are
countable
Example: An
advertiser asks
200 customers
how many days
per week they
read the daily
newspaper.
Number of
days read
Frequency
0 44
1 24
2 18
3 16
4 20
5 22
6 26
7 30
Total 200
4/26/2012
6
Relative Frequency
Relative Frequency: What proportion is in each
category?
Number of
days read
Frequency
Relative
Frequency
0 44 .22
1 24 .12
2 18 .09
3 16 .08
4 20 .10
5 22 .11
6 26 .13
7 30 .15
Total 200 1.00
.22
200
44
=
22% of the
people in the
sample report
that they read
the newspaper
0 days per week
4/26/2012
7
Frequency Distribution: Continuous Data
Continuous Data: may take on any value in some
interval
Example: A manufacturer of insulation randomly selects 20 winter
days and records the daily high temperature

24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
32, 13, 12, 38, 41, 43, 44, 27, 53, 27

(Temperature is a continuous variable because it could
be measured to any degree of precision desired)
4/26/2012
8
Grouping Data by Classes
Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43,
44, 46, 53, 58
Find range: 58 - 12 = 46
Select number of classes: 5 (usually between 5 and 20)
Compute class width: 10 (46/5 then round off)
Determine class boundaries:10, 20, 30, 40, 50
Compute class midpoints: 15, 25, 35, 45, 55
Count observations & assign to classes
4/26/2012
9
Frequency Distribution Example
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class Frequency
10 but under 20 3 .15
20 but under 30 6 .30
30 but under 40 5 .25
40 but under 50 4 .20
50 but under 60 2 .10
Total 20 1.00
Relative
Frequency
Frequency Distribution
4/26/2012
10
Histograms
The classes or intervals are shown on the
horizontal axis
frequency is measured on the vertical axis

Bars of the appropriate heights can be
used to represent the number of
observations within each class

Such a graph is called a histogram
4/26/2012
11
Class Midpoints
Histogram Example
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
No gaps
between
bars, since
continuous
data
4/26/2012
12
Questions for Grouping Data
into Classes
1. How wide should each interval be?
(How many classes should be used?)

2. How should the endpoints of the
intervals be determined?
Often answered by trial and error, subject
to user judgment
The goal is to create a distribution that is
neither too "jagged" nor too "blocky
Goal is to appropriately show the pattern of
variation in the data
4/26/2012
13
How Many Class Intervals?
Many (Narrow class
intervals)
may yield a very jagged
distribution with gaps from
empty classes
Can give a poor indication of how
frequency varies across classes

Few (Wide class intervals)
may compress variation too much
and yield a blocky distribution
can obscure important patterns
of variation.
(X axis labels are upper class endpoints)
4/26/2012
14
General Guidelines
Number of Data Points Number of Classes
under 50 5 - 7
50 100 6 - 10
100 250 7 - 12
over 250 10 - 20

Class widths can typically be reduced as the number of
observations increases
Distributions with numerous observations are more likely
to be smooth and have gaps filled since data are plentiful
4/26/2012
15
Class Width
The class width is the distance
between the lowest possible value
and the highest possible value for a
frequency class
The minimum class
width is
Largest Value Smallest
Value
Number of Classes
W =
4/26/2012
16
Histograms in Excel
Select
Tools/Data
Analysis
1
4/26/2012
17
Choose Histogram
2
3
Input data and bin
ranges

Select Chart Output
Histograms in Excel
(continued)
4/26/2012
18
Stem and Leaf Diagram
A simple way to see distribution
details in a data set
METHOD: Separate the sorted
data series into leading digits
(the stem) and the trailing digits
(the leaves)
4/26/2012
19
Example:
Here, use the 10s digit for the
stem unit:
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
12 is shown as

35 is shown as
Stem Leaf
1 2

3 5
4/26/2012
20
Example:
Completed Stem-and-leaf diagram:
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Stem Leaves
1 2 3 7
2 1 4 4 6 7
8
3 0 2 5 7 8
4 1 3 4 6
5 3 8
4/26/2012
21
Using other stem units
Using the 100s digit as the stem:
Round off the 10s digit to form the
leaves
613 would become 6 1
776 would become 7 8
. . .
1224 becomes 12 2

Stem Leaf
4/26/2012
22
Graphing Categorical Data
Categorical
Data
Pie
Charts
Pareto
Diagram
Bar
Charts
4/26/2012
23
Bar and Pie Charts
Bar charts and Pie charts are
often used for qualitative
(category) data

Height of bar or size of pie slice
shows the frequency or
percentage for each category
4/26/2012
24
Pie Chart Example
Percentages
are rounded to
the nearest
percent
Current Investment Portfolio
Savings
15%
CD
14%
Bonds
29%
Stocks
42%
Investment Amount Percentage
Type (in thousands $)


Stocks 46.5 42.27

Bonds 32.0 29.09

CD 15.5 14.09

Savings 16.0 14.55

Total 110 100
(Variables are Qualitative)
4/26/2012
25
Bar Chart Example
4/26/2012
26
Pareto Diagram Example
c
u
m
u
l
a
t
i
v
e

%

i
n
v
e
s
t
e
d

(
l
i
n
e

g
r
a
p
h
)

%

i
n
v
e
s
t
e
d

i
n

e
a
c
h

c
a
t
e
g
o
r
y

(
b
a
r

g
r
a
p
h
)

4/26/2012
27
Bar Chart Example
Number
of days
read
Frequency
0 44
1 24
2 18
3 16
4 20
5 22
6 26
7 30
Total 200
4/26/2012
28
Tabulating and Graphing
Multivariate Categorical Data
Investment in thousands of dollars
Investment Investor A Investor B Investor C Total
Category

Stocks 46.5 55 27.5 129

Bonds 32.0 44 19.0 95

CD 15.5 20 13.5 49

Savings 16.0 28 7.0 51


Total 110.0 147 67.0 324

4/26/2012
29
Tabulating and Graphing
Multivariate Categorical Data
Side by side charts
(continued)
4/26/2012
30
Side-by-Side Chart Example
Sales by quarter for three sales territories:

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
East 20.4 27.4 59 20.4
West 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9
4/26/2012
31
Line charts show values of one
variable vs. time
Time is traditionally shown on the
horizontal axis
Scatter Diagrams show points for
bivariate data
one variable is measured on the
vertical axis and the other variable is
measured on the horizontal axis
Line Charts and Scatter Diagrams
4/26/2012
32
Line Chart Example
Year
Inflati
on
Rate
1985 3.56
1986 1.86
1987 3.65
1988 4.14
1989 4.82
1990 5.40
1991 4.21
1992 3.01
1993 2.99
1994 2.56
1995 2.83
1996 2.95
1997 2.29
1998 1.56
1999 2.21
2000 3.36
2001 2.85
2002 1.58
4/26/2012
33
Scatter Diagram Example
Volume
per
day
Cost
per
day
23 125
26 140
29 146
33 160
38 167
42 170
50 188
55 195
60 200
4/26/2012
34
Types of Relationships
Linear Relationships
X X
Y Y
4/26/2012
35
Curvilinear Relationships
X X
Y Y
Types of Relationships
(continued)
4/26/2012
36
No Relationship
X X
Y Y
Types of Relationships
(continued)
4/26/2012
37
Chapter Summary
Data in raw form are usually not easy
to use for decision making -- Some
type of organization is needed:
+ Table + Graph
Techniques reviewed in this chapter:
Frequency Distributions and
Histograms
Bar Charts and Pie Charts
Stem and Leaf Diagrams
Line Charts and Scatter Diagrams

4/26/2012
38
Summarization measures are single or few number representations of the
data which are helpful in representing data and also to compare between
data. Based on the summary measures of the sample ,population measures
can be forecasted.

The following will illustrate the above, different measures to represent the
data are as follows :

1. Measures of Center and Location
2. Mean, median, mode, geometric mean, midrange
3. Other measures of Location
4. Weighted mean, percentiles, quartiles
5. Measures of Variation
6. Range, Inter quartile range, variance and standard deviation,
coefficient of variation

Summarization measures ..
4/26/2012
39
Center and Location
Mean
Median
Mode
Other Measures of
Location
Weighted Mean
Describing Data Numerically
Variation
Variance
Standard Deviation
Coefficient of Variation
Range
Percentiles
Inter quartile Range
Quartiles
Summary Measures
4/26/2012
40
Center and Location
Mean Median Mode Weighted Mean
N
x
n
x
x
N
i
i
n
i
i

=
=
=
=
1
1

=
=
i
i i
W
i
i i
W
w
x w
w
x w
X

Overview: Measures of Center and Location


4/26/2012
41
The Mean is the arithmetic average of data values
Sample mean



Population mean
n = Sample Size
N = Population Size
N
x x x
N
x
N
N
i
i
+ + +
= =

=

2 1 1
Mean (Arithmetic Average)
n
x x x
n
x
x
n
n
i
i
+ + +
= =

=

2 1 1
4/26/2012
42
The most common measure of central tendency
Mean = sum of values divided by the number of values
Affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
3
5
15
5
5 4 3 2 1
= =
+ + + +
4
5
20
5
10 4 3 2 1
= =
+ + + +
Mean (Arithmetic Average)
4/26/2012
43

Not affected by extreme values




In an ordered array, the median is the middle number
If n or N is odd, the median is the middle number
If n or N is even, the median is the average of the two middle numbers
0 1 2 3 4 5 6 7 8 9 10
Median = 3
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median
4/26/2012
44
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical data
There may be no mode
There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 5
0 1 2 3 4 5 6
No Mode
Mode
4/26/2012
45
Used when values are grouped by frequency or relative
importance

Days to
Complete
Frequency
5 4
6 12
7 8
8 2
Example: Sample of
26 Repair Projects
Weighted Mean Days to Complete:
days 6.31
26
164

2 8 12 4
8) (2 7) (8 6) (12 5) (4
w
x w
X
i
i i
W
= =
+ + +
+ + +
= =

Weighted Mean
4/26/2012
46
Five houses on a hill by the
beach
$2,000 K
$500 K
$300 K
$100 K
$100 K
House Prices:

$2,000,000
500,000
300,000
100,000
100,000

Review Example
4/26/2012
47
Mean: ($3,000,000/5)
= $600,000

Median: middle value of ranked
data
= $300,000

Mode: most frequent value
= $100,000
House Prices:

$2,000,000
500,000
300,000
100,000
100,000
Sum 3,000,000
Summary Statistics
4/26/2012
48
Mean is generally used, unless extreme values
(outliers) exist
Then median is often used, since the median is
not sensitive to extreme values.
Example: Median home prices may be reported
for a region less sensitive to outliers
Which measure of location is the best?
4/26/2012
49
Describes how data is distributed
Symmetric or skewed
Mean = Median = Mode

Mean < Median < Mode
Mode < Median < Mean
Right-Skewed
Left-Skewed
Symmetric
(Longer tail extends to left)
(Longer tail extends to right)
Shape of a Distribution
4/26/2012
50
Other Measures of
Location
Percentiles Quartiles
1
st
quartile = 25
th
percentile

2
nd
quartile = 50
th
percentile
= median

3
rd
quartile = 75
th
percentile


The p
th
percentile in a data array:
p% are less than or equal to this
value
(100 p)% are greater than or
equal to this value
(where 0 p 100)
Other Location Measures
4/26/2012
51
The p
th
percentile in an ordered array of n values is the value in
i
th
position, where
Example: The 60
th
percentile in an ordered array of 19 values is the
value in 12
th
position:
1) (n
100
p
i + =
12 1) (19
100
60
1) (n
100
p
i = + = + =
Percentiles
4/26/2012
52
Quartiles split the ranked data into 4 equal groups
25% 25% 25%
25%
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
Example: Find the first quartile
Q1
Q2 Q3
Quartiles
(n = 9)
Q1 = 25
th
percentile, so find the

so use the value half way between the 2
nd
and 3
rd
values,
so
25
100
(9+1) = 2.5 position
25
100
Q1=12.5
4/26/2012
53
A Graphical display of data using 5-number summary:

Minimum -- Q1 -- Median -- Q3 -- Maximum
Example:
Minimum 1st Median 3rd Maximum
Quartile Quartile
25% 25% 25% 25%
Box and Whisker Plot
4/26/2012
54
The Box and central line are centered between the
endpoints if data is symmetric around the median




A Box and Whisker plot can be shown in either vertical or
horizontal format
Shape of Box and Whisker Plots
4/26/2012
55
Right-Skewed Left-Skewed Symmetric
Q1 Q2 Q3 Q1 Q2 Q3
Q1 Q2 Q3
Distribution Shape and Box and Whisker Plot
4/26/2012
56
Below is a Box-and-Whisker plot for the following data:

0 2 2 2 3 3 4 5 5 10 27





This data is very right skewed, as the plot depicts
0 2 3 5 27
0 2 3 5 27
Min Q1 Q2 Q3 Max
Box-and-Whisker Plot Example
4/26/2012
57
Variation
Variance Standard Deviation Coefficient of
Variation
Population
Variance
Sample
Variance
Population
Standard
Deviation
Sample
Standard
Deviation
Range
Interquartile
Range
Measures of Variation
4/26/2012
58
Measures of variation give information on the spread or
variability of the data values.

Same center,
different variation
Variation
4/26/2012
59

Difference between the largest and the smallest observations.



Range = x
maximum
x
minimum

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Example:
Range
4/26/2012
60
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119
Disadvantages of the Range
Sensitive to outliers

Ignores the way in which data are distributed




4/26/2012
61
Can eliminate some outlier problems by using the Interquartile
range

Eliminate some high-and low-valued observations and calculate the
range from the remaining values.

Interquartile range = 3
rd
quartile 1
st
quartile

Interquartile Range
4/26/2012
62
Median
(Q2)
X
maximum
X
minimum
Q1 Q3
Example:
25% 25% 25% 25%
12 30 45 57 70
Interquartile range
= 57 30 = 27
Interquartile Range
4/26/2012
63
Average of squared deviations of values from the mean

Sample variance:


Population variance:
N
) (x

N
1 i
2
i
2

=

=
1 - n
) x (x
s
n
1 i
2
i
2

=

=
Variance
4/26/2012
64
Most commonly used measure of variation
Shows variation about the mean
Has the same units as the original data

Sample standard deviation:



Population standard deviation:
N
) (x

N
1 i
2
i
=

=
1 - n
) x (x
s
n
1 i
2
i
=

=
Standard Deviation
4/26/2012
65
Sample
Data (X
i
) : 10 12 14 15 17 18 18 24
n = 8 Mean = x = 16
4.2426
7
126
1 8
16) (24 16) (14 16) (12 16) (10
1 n
) x (24 ) x (14 ) x (12 ) x (10
s
2 2 2 2
2 2 2 2
= =

+ + + +
=

+ + + +
=

Calculation Example: Sample Standard Deviation


4/26/2012
66
Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5
s = .9258
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.57
Data C
Comparing Standard Deviations
4/26/2012
67
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
Is used to compare two or more sets of data measured in different
units
100%
x
s
CV
|
|
.
|

\
|
= 100%

CV
|
|
.
|

\
|
=
Population Sample
Coefficient of Variation
4/26/2012
68
Stock A:
Average price last year = $50
Standard deviation = $5


Both stocks
have the same
standard
deviation, but
stock B is less
variable
relative to its
price
10% 100%
$50
$5
100%
x
s
CV
A
= =
|
|
.
|

\
|
=
5% 100%
$100
$5
100%
x
s
CV
B
= =
|
|
.
|

\
|
=
Comparing Coefficient of Variation
Stock B:
Average price last year = $100
Standard deviation = $5
4/26/2012
69
If the data distribution is bell-shaped, then the interval:
contains about 68% of the values in the population or the sample
The Empirical Rule
1
X

68%
1
4/26/2012
70
contains about 95% of the values in the population or the sample
contains about 99.7% of the values in the population or the sample
The Empirical Rule
2
3
3
99.7%
95%
2
4/26/2012
71
Regardless of how the data are distributed, at least (1 - 1/k
2
) of
the values will fall within k standard deviations of the mean
Examples:
(1 - 1/1
2
) = 0% ..... k=1 ( 1)
(1 - 1/2
2
) = 75% ........ k=2 ( 2)
(1 - 1/3
2
) = 89% ........ k=3 ( 3)
within At least
Tchebysheffs Theorem
4/26/2012
72
A standardized data value refers to the number of standard
deviations a value is from the mean

Standardized data values are sometimes referred to as z-
scores
Standardized Data Values
4/26/2012
73
where:
x = original data value
= population mean
= population standard deviation
z

= standard score
(number of standard deviations x is from )

x
z

=
Standardized Population Values
4/26/2012
74
where:
x = original data value
x = sample mean
s = sample standard deviation
z

= standard score
(number of standard deviations x is from )
Remark: The standardized sample values are used for
constructing the confidence limits for the
population parameters.

s
x x
z

=
Standardized Sample Values
4/26/2012

Das könnte Ihnen auch gefallen