Beruflich Dokumente
Kultur Dokumente
Association between
Categorical Variables
Chapter 5
5.1 Contingency Tables
Which hosts send more buyers to
Amazon.com?
To answer this question we must gather data on
two categorical variables: Host and Purchase
Host identifies the originating site: MSN,
RecipeSource, or Yahoo; Purchase indicates
whether or not the visit results in a sale
Copyright 2011 Pearson Education, Inc.
3 of 39
5.1 Contingency Tables
Consider Two Categorical Variables
Simultaneously
A table that shows counts of cases on one
categorical variable contingent on the value of
another (for every combination of both variables)
Cells in a contingency table are mutually
exclusive
Copyright 2011 Pearson Education, Inc.
4 of 39
5.1 Contingency Tables
Contingency Table for Web Shopping
Copyright 2011 Pearson Education, Inc.
5 of 39
5.1 Contingency Tables
Marginal and Conditional Distributions
Marginal distributions appear in the margins of a
contingency table and represent the totals
(frequencies) for each categorical variable
separately
Conditional distributions refer to counts within a
row or column of a contingency table (restricted
to cases satisfying a condition)
Copyright 2011 Pearson Education, Inc.
6 of 39
5.1 Contingency Tables
Conditional Distribution of Purchase for each
Host (Column Counts and Percentages)
Copyright 2011 Pearson Education, Inc.
7 of 39
5.1 Contingency Tables
Conditional Distribution
Reveals the percentage of purchases
among visitors from RecipeSource to be
much less than for MSN and Yahoo
Host and Purchase are associated
Copyright 2011 Pearson Education, Inc.
8 of 39
5.1 Contingency Tables
Segmented Bar Charts
Used to display conditional distributions
Divides the bars in a bar chart into
segments that are proportional to the
percentage in each category of a second
variable
Copyright 2011 Pearson Education, Inc.
9 of 39
5.1 Contingency Tables
Contingency Table of Purchase by Region
Copyright 2011 Pearson Education, Inc.
10 of 39
5.1 Contingency Tables
Segmented Bar Chart Shows Association
Copyright 2011 Pearson Education, Inc.
11 of 39
5.1 Contingency Tables
Mosaic Plots
Alternative to segmented bar chart
A plot in which the size of each tile is
proportional to the count in a cell of a
contingency table
Copyright 2011 Pearson Education, Inc.
12 of 39
5.1 Contingency Tables
Contingency Table of Shirt Size by Style
Copyright 2011 Pearson Education, Inc.
13 of 39
5.1 Contingency Tables
Mosaic Plot Shows Association
Copyright 2011 Pearson Education, Inc.
14 of 39
4M Example 5.1: CAR THEFT
Motivation
Should insurance companies vary the
premiums for different car models (are
some cars more likely to be stolen than
others)?
Copyright 2011 Pearson Education, Inc.
15 of 39
4M Example 5.1: CAR THEFT
Method
Data obtained from the National Highway Traffic
Safety Administration (NHTSA) on car theft for
seven popular models (two categorical variables:
type of car and whether the car was stolen).
Copyright 2011 Pearson Education, Inc.
16 of 39
4M Example 5.1: CAR THEFT
Mechanics
Copyright 2011 Pearson Education, Inc.
17 of 39
4M Example 5.1: CAR THEFT
Mechanics
Copyright 2011 Pearson Education, Inc.
18 of 39
4M Example 5.1: CAR THEFT
Message
The Dodge Intrepid is more likely to be stolen than
other popular models. The data suggest that
higher premiums for theft insurance should be
charged for models that are more likely to be
stolen.
Copyright 2011 Pearson Education, Inc.
19 of 39
5.2 Lurking Variables
and Simpsons Paradox
Association Not Necessarily Causation
Lurking Variable: a concealed variable that
affects the apparent relationship between two
other variables
Simpsons Paradox: a change in the association
between two variables when data are separated
into groups defined by a third variable
Copyright 2011 Pearson Education, Inc.
20 of 39
4M Example 5.2: AIRLINE ARRIVALS
Motivation
Does it matter which of two airlines a
corporate CEO chooses when flying to
meetings if he wants to avoid delays?
Copyright 2011 Pearson Education, Inc.
21 of 39
4M Example 5.2: AIRLINE ARRIVALS
Method
Data obtained from US Bureau of
Transportation Statistics on flight delays for
two airlines (two categorical variables:
airline and whether the flight arrived on
time).
Copyright 2011 Pearson Education, Inc.
22 of 39
4M Example 5.2: AIRLINE ARRIVALS
Mechanics
Copyright 2011 Pearson Education, Inc.
23 of 39
4M Example 5.2: AIRLINE ARRIVALS
Mechanics
Is destination a lurking variable?
Copyright 2011 Pearson Education, Inc.
24 of 39
4M Example 5.2: AIRLINE ARRIVALS
Mechanics
This is Simpsons Paradox
Copyright 2011 Pearson Education, Inc.
25 of 39
4M Example 5.2: AIRLINE ARRIVALS
Message
The CEO should book on US Airways as it is
more likely to arrive on time regardless of
destination.
Copyright 2011 Pearson Education, Inc.
26 of 39
5.3 Strength of Association
Chi-Squared Statistic
A measure of association in a contingency
table
Calculated based on a comparison of the
observed contingency table to an artificial
table with the same marginal totals but no
association
Copyright 2011 Pearson Education, Inc.
27 of 39
5.3 Strength of Association
Contingency Table
Copyright 2011 Pearson Education, Inc.
28 of 39
5.3 Strength of Association
Calculating the Chi-Squared Statistic
Copyright 2011 Pearson Education, Inc.
29 of 39
5.3 Strength of Association
Calculating the Chi-Squared Statistic
Copyright 2011 Pearson Education, Inc.
30 of 39
2
2 2 2
2
30 40 70 60 50 40
(50 60)
40 60 40 60
x
2 2 2 2
10 10 10 10
40 60 40 60
2.5 1.67 2.5 1.67
8.33
5.3 Strength of Association
Cramers V
Derived from the Chi-Squared Statistic
Ranges in value from 0 (variables are not
associated) to 1(variables are perfectly
associated)
Copyright 2011 Pearson Education, Inc.
31 of 39
5.3 Strength of Association
Calculating Cramers V
V = 0.20 for our example
There is a weak association between group
(students or staff) and attitude toward sharing
copyrighted music
Copyright 2011 Pearson Education, Inc.
32 of 39
2
min 1, 1
x
V
n r c
5.3 Strength of Association
Checklist: Chi-Squared and Cramers V
Verify that variables are categorical
Verify that there are no obvious lurking
variables
Copyright 2011 Pearson Education, Inc.
33 of 39
4M Example 5.3: REAL ESTATE
Motivation
Do people who heat their homes with gas
prefer to cook with gas as well? What
heating systems and appliances should a
developer select for newly built homes?
Copyright 2011 Pearson Education, Inc.
34 of 39
4M Example 5.3: REAL ESTATE
Method
The developer contacts homeowners to
obtain the data. Two categorical variables:
type of fuel used for home heating (gas or
electric) and type of fuel used for cooking
(gas or electric).
Copyright 2011 Pearson Education, Inc.
35 of 39
4M Example 5.3: REAL ESTATE
Mechanics
Chi-Squared = 98.62; Cramers V = 0.47
Copyright 2011 Pearson Education, Inc.
36 of 39
4M Example 5.3: REAL ESTATE
Message
Homeowners prefer gas to electric heat by
about 2 to 1. The developer should build
about two-thirds of new homes with gas
heat. Put electric appliances in all homes
with electric heat and in half of the homes
with gas heat (assuming that buyers for
new homes have the same preferences).
Copyright 2011 Pearson Education, Inc.
37 of 39
Best Practices
Use contingency tables to find and summarize
association between two categorical variables.
Be on the lookout for lurking variables.
Use plots to show association.
Exploit the absence of association.
Copyright 2011 Pearson Education, Inc.
38 of 39
Pitfalls
Dont interpret association as causation.
Dont display too many numbers in a table.
Copyright 2011 Pearson Education, Inc.
39 of 39