Sie sind auf Seite 1von 29

Displaying and Describing Categorical

Data
Summarizing categorical data
Two-way tables
Relationships between categorical variables
Marginal distributions
Conditional distributions
Simpsons paradox


Summary of Categories
Count
Each category has a number of occurrences (frequency tables)
Percentages are useful (relative frequency tables)

100% 338 Total
53.6% 181 Female
46.4% 157 Male
Percentage Count Sex
Two Categories
What is the relationship between two categories?
Cross-classification table is a good summary
(contingency tables or two-way tables)

338 63 55 129 91
Column
Totals
181
157
Row Totals
46 25 62 48
Female
17 30 67 43
Male
Senior(+) Junior Sophomore Freshman
Column
Percentages
338
63
100%
55
100%
129
100%
91
100%
181
54%
157
46%
46
73%
25
45%
62
48%
48
53%
F
17
27%
30
55%
67
52%
43
47%
M
SR J Sp F
338
63
19%
55
16%
129
38%
91
27%
181
100%
46
25%
25
14%
62
34%
48
27%
F
157
100%
17
11%
30
19%
67
43%
43
27%
M
SR J Sp F
Row
Percentages
Which way is better?
There is a basic asymmetry to many problems.
Explanatory variable
Predictor, cause, available variable
Response variable
Predicted, effect, interesting variable
Visualize Categorical Data
Give a clear picture of what the data contain
Emphasize differences (or similarities)
Bar graphs and pie charts are usually best
Many varieties, actual form of the graph depends
on the use
Height of bar or size of pie slice shows the
frequency or percentage for each category (area
principle)
What is the graph communicating?
Bar Graph
157
181
0
20
40
60
80
100
120
140
160
180
200
Male Female
Year in School
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
Freshman Sophomore Junior Senior
For 2 variables use multiple columns
0
10
20
30
40
50
60
70
80
Male Female
Freshman
Sophomore
Junior
Senior
Or the other way around
0
10
20
30
40
50
60
70
80
Freshman Sophomore Junior Senior(+)
Male
Female
My Pet Peeves
Graphs should
Be clear
Allow comparisons
Tell a story
Graphs should not
Have uninformative aspects
Obscure

Arrrgh!
0
10
20
30
40
50
60
70
Male Female
Arrrgh! Arrrgh!
0
10
20
30
40
50
60
70
Male Female
Wouldnt it just be simpler?
0
10
20
30
40
50
60
70
80
Male Female
Freshman
Sophomore
Junior
Senior
USA Today Snapshots
Pie Charts
Senior
Junior
Freshman
Sophomore
Or
Freshman Sophomore Junior Senior
0% 20% 40% 60% 80% 100%
Allows clear comparisons
Freshman
Freshman
Junior
Junior
Senior
Senior
0% 20% 40% 60% 80% 100%
Male
Female
What is wrong with this picture?
Constructing Bar and Pie Charts
1. Define categories for variables of interest

2. Determine the appropriate measure for
each category
For pie charts, the value assigned is the proportion of
the total for all categories

3. Develop the chart
For pie charts, the size of the slice is proportional to
value and the sum must equal 100%


More examples
Investor's Portfolio
0 10 20 30 40 50
Stocks
CD
Amount in $1000's
Bar charts can also be displayed with vertical bars
Newspaper readership per week
0
10
20
30
40
50
0 1 2 3 4 5 6 7
Number of days newspaper is read per week
F
r
e
u
e
n
c
y
Number of
days read
Frequency
0 44
1 24
2 18
3 16
4 20
5 22
6 26
7 30
Total 200
Pie Chart Example
Percentages are
rounded to the
nearest percent
Current Investment Portfolio
Savings
15%
CD
14%
Bonds
29%
Stocks
42%
Investment Amount Percentage
Type (in thousands $)


Stocks 46.5 42.27

Bonds 32.0 29.09

CD 15.5 14.09

Savings 16.0 14.55

Total 110 100

Qualitative variables
Must equal 100%
Marginal distributions
We can look at each categorical variable separately in a two-
way table by studying the row totals and the column totals.
They represent the marginal distributions, expressed in
counts or percentages (They are written as if in a margin.)
2000 U.S. census
The marginal distributions can then be displayed on separate bar graphs, typically
expressed as percents instead of raw counts. Each graph represents only one of the two
variables, completely ignoring the second one.

Conditional distribution
Music and wine purchase decision

We want to compare the conditional distributions of the response
variable (wine purchased) for each value of the explanatory variable
(music played). Therefore, we calculate column percents.
What is the relationship between type of music played in
supermarkets and type of wine purchased?
We calculate the column conditional
percents similarly for each of the
nine cells in the table:
Calculations: When no music was played, there were 84
bottles of wine sold. Of these, 30 were French wine.
30/84 = 0.357 35.7% of the wine sold was French when
no music was played.
30 = 35.7%
84
=
cell total .
column total
For every two-way table, there are two sets of
possible conditional distributions.
Wine purchased for each kind of
music played (column percents)
Music played for each kind
of wine purchased (row
percents)
Does background music in
supermarkets influence
customer purchasing
decisions?
Simpsons paradox
An association or comparison that holds for all of several groups can
reverse direction when the data are combined (aggregated) to form a
single group. This reversal is called Simpsons paradox.
Hospital A Hospital B
Died 63 16
Survived 2037 784
Total 2100 800
% surv. 97.0% 98.0%
On the surface,
Hospital B would
seem to have a
better record.
Here, patient condition was the lurking variable.
Patients in good condition Patients in poor condition
Hospital A Hospital B Hospital A Hospital B
Died 6 8 Died 57 8
Survived 594 592 Survived 1443 192
Total 600 600 Total 1500 200
% surv. 99.0% 98.7% % surv. 96.2% 96.0%
But once patient
condition is taken
into account, we
see that hospital A
has in fact a better
record for both patient conditions (good and poor).
Example: Hospital death
rates

Das könnte Ihnen auch gefallen