Sie sind auf Seite 1von 39

Copyright 2011 Pearson Education, Inc.

Association between
Categorical Variables
Chapter 5
5.1 Contingency Tables
Which hosts send more buyers to
Amazon.com?

To answer this question we must gather data on
two categorical variables: Host and Purchase

Host identifies the originating site: MSN,
RecipeSource, or Yahoo; Purchase indicates
whether or not the visit results in a sale

Copyright 2011 Pearson Education, Inc.
3 of 39
5.1 Contingency Tables
Consider Two Categorical Variables
Simultaneously

A table that shows counts of cases on one
categorical variable contingent on the value of
another (for every combination of both variables)

Cells in a contingency table are mutually
exclusive
Copyright 2011 Pearson Education, Inc.
4 of 39
5.1 Contingency Tables
Contingency Table for Web Shopping

Copyright 2011 Pearson Education, Inc.
5 of 39
5.1 Contingency Tables
Marginal and Conditional Distributions

Marginal distributions appear in the margins of a
contingency table and represent the totals
(frequencies) for each categorical variable
separately

Conditional distributions refer to counts within a
row or column of a contingency table (restricted
to cases satisfying a condition)





Copyright 2011 Pearson Education, Inc.
6 of 39
5.1 Contingency Tables
Conditional Distribution of Purchase for each
Host (Column Counts and Percentages)




Copyright 2011 Pearson Education, Inc.
7 of 39
5.1 Contingency Tables
Conditional Distribution

Reveals the percentage of purchases
among visitors from RecipeSource to be
much less than for MSN and Yahoo

Host and Purchase are associated




Copyright 2011 Pearson Education, Inc.
8 of 39
5.1 Contingency Tables
Segmented Bar Charts

Used to display conditional distributions

Divides the bars in a bar chart into
segments that are proportional to the
percentage in each category of a second
variable




Copyright 2011 Pearson Education, Inc.
9 of 39
5.1 Contingency Tables
Contingency Table of Purchase by Region





Copyright 2011 Pearson Education, Inc.
10 of 39
5.1 Contingency Tables
Segmented Bar Chart Shows Association




Copyright 2011 Pearson Education, Inc.
11 of 39
5.1 Contingency Tables
Mosaic Plots

Alternative to segmented bar chart

A plot in which the size of each tile is
proportional to the count in a cell of a
contingency table





Copyright 2011 Pearson Education, Inc.
12 of 39
5.1 Contingency Tables
Contingency Table of Shirt Size by Style




Copyright 2011 Pearson Education, Inc.
13 of 39
5.1 Contingency Tables
Mosaic Plot Shows Association




Copyright 2011 Pearson Education, Inc.
14 of 39
4M Example 5.1: CAR THEFT
Motivation

Should insurance companies vary the
premiums for different car models (are
some cars more likely to be stolen than
others)?
Copyright 2011 Pearson Education, Inc.
15 of 39
4M Example 5.1: CAR THEFT
Method

Data obtained from the National Highway Traffic
Safety Administration (NHTSA) on car theft for
seven popular models (two categorical variables:
type of car and whether the car was stolen).
Copyright 2011 Pearson Education, Inc.
16 of 39
4M Example 5.1: CAR THEFT
Mechanics





Copyright 2011 Pearson Education, Inc.
17 of 39
4M Example 5.1: CAR THEFT
Mechanics





Copyright 2011 Pearson Education, Inc.
18 of 39
4M Example 5.1: CAR THEFT
Message

The Dodge Intrepid is more likely to be stolen than
other popular models. The data suggest that
higher premiums for theft insurance should be
charged for models that are more likely to be
stolen.
Copyright 2011 Pearson Education, Inc.
19 of 39
5.2 Lurking Variables
and Simpsons Paradox
Association Not Necessarily Causation

Lurking Variable: a concealed variable that
affects the apparent relationship between two
other variables

Simpsons Paradox: a change in the association
between two variables when data are separated
into groups defined by a third variable
Copyright 2011 Pearson Education, Inc.
20 of 39
4M Example 5.2: AIRLINE ARRIVALS
Motivation

Does it matter which of two airlines a
corporate CEO chooses when flying to
meetings if he wants to avoid delays?



Copyright 2011 Pearson Education, Inc.
21 of 39
4M Example 5.2: AIRLINE ARRIVALS
Method

Data obtained from US Bureau of
Transportation Statistics on flight delays for
two airlines (two categorical variables:
airline and whether the flight arrived on
time).



Copyright 2011 Pearson Education, Inc.
22 of 39
4M Example 5.2: AIRLINE ARRIVALS
Mechanics




Copyright 2011 Pearson Education, Inc.
23 of 39
4M Example 5.2: AIRLINE ARRIVALS
Mechanics
Is destination a lurking variable?




Copyright 2011 Pearson Education, Inc.
24 of 39
4M Example 5.2: AIRLINE ARRIVALS
Mechanics
This is Simpsons Paradox





Copyright 2011 Pearson Education, Inc.
25 of 39
4M Example 5.2: AIRLINE ARRIVALS
Message

The CEO should book on US Airways as it is
more likely to arrive on time regardless of
destination.


Copyright 2011 Pearson Education, Inc.
26 of 39
5.3 Strength of Association
Chi-Squared Statistic

A measure of association in a contingency
table

Calculated based on a comparison of the
observed contingency table to an artificial
table with the same marginal totals but no
association
Copyright 2011 Pearson Education, Inc.
27 of 39
5.3 Strength of Association
Contingency Table

Copyright 2011 Pearson Education, Inc.
28 of 39
5.3 Strength of Association
Calculating the Chi-Squared Statistic


Copyright 2011 Pearson Education, Inc.
29 of 39
5.3 Strength of Association
Calculating the Chi-Squared Statistic



Copyright 2011 Pearson Education, Inc.
30 of 39
2
2 2 2
2
30 40 70 60 50 40
(50 60)
40 60 40 60
x




2 2 2 2
10 10 10 10
40 60 40 60


2.5 1.67 2.5 1.67
8.33
5.3 Strength of Association
Cramers V

Derived from the Chi-Squared Statistic

Ranges in value from 0 (variables are not
associated) to 1(variables are perfectly
associated)
Copyright 2011 Pearson Education, Inc.
31 of 39
5.3 Strength of Association
Calculating Cramers V




V = 0.20 for our example
There is a weak association between group
(students or staff) and attitude toward sharing
copyrighted music



Copyright 2011 Pearson Education, Inc.
32 of 39
2
min 1, 1
x
V
n r c


5.3 Strength of Association
Checklist: Chi-Squared and Cramers V

Verify that variables are categorical

Verify that there are no obvious lurking
variables
Copyright 2011 Pearson Education, Inc.
33 of 39
4M Example 5.3: REAL ESTATE
Motivation

Do people who heat their homes with gas
prefer to cook with gas as well? What
heating systems and appliances should a
developer select for newly built homes?



Copyright 2011 Pearson Education, Inc.
34 of 39
4M Example 5.3: REAL ESTATE
Method

The developer contacts homeowners to
obtain the data. Two categorical variables:
type of fuel used for home heating (gas or
electric) and type of fuel used for cooking
(gas or electric).



Copyright 2011 Pearson Education, Inc.
35 of 39
4M Example 5.3: REAL ESTATE
Mechanics






Chi-Squared = 98.62; Cramers V = 0.47






Copyright 2011 Pearson Education, Inc.
36 of 39
4M Example 5.3: REAL ESTATE
Message
Homeowners prefer gas to electric heat by
about 2 to 1. The developer should build
about two-thirds of new homes with gas
heat. Put electric appliances in all homes
with electric heat and in half of the homes
with gas heat (assuming that buyers for
new homes have the same preferences).


Copyright 2011 Pearson Education, Inc.
37 of 39
Best Practices

Use contingency tables to find and summarize
association between two categorical variables.

Be on the lookout for lurking variables.

Use plots to show association.

Exploit the absence of association.


Copyright 2011 Pearson Education, Inc.
38 of 39
Pitfalls

Dont interpret association as causation.

Dont display too many numbers in a table.
Copyright 2011 Pearson Education, Inc.
39 of 39

Das könnte Ihnen auch gefallen