Chapter 8

Associations Between
Categorical Variables
Case where both explanatory (independent)
variable and response (dependent) variable
are qualitative (Chapter 7 includes case
where both are binary (2 levels)
Association: The distributions of responses
differ among the levels of the explanatory
variable (e.g. Party affiliation by gender)
Contingency Tables
Cross-tabulations of frequency counts where the
rows (typically) represent the levels of the
explanatory variable and the columns represent
the levels of the response variable.
Numbers within the table represent the numbers
of individuals falling in the corresponding
combination of levels of the two variables
Row and column totals are called the marginal
distributions for the two variables
Example - Cyclones Near Antarctica

Period of Study: September,1973-May,1975
Explanatory Variable: Region (40-49,50-59,60-79)
(Degrees South Latitude)
Response: Season (Aut(4),Wtr(5),Spr(4),Sum(8))
(Number of months in parentheses)
Units: Cyclones in the study area
Treating the observed cyclones as a random
sample of all cyclones that could have occurred
Source: Howarth(1983), An Analysis of the Variability of Cyclones around Antarctica and Their
Relation to Sea-Ice Extent, Annals of the Association of American Geographers, Vol.73,pp519-537

Region\Season
40-49S
50-59S
60-79S
Total
Autumn
370
526
980
1876
Winter
452
624
1200
2276
Spring
273
513
995
1781
Summer
422
1059
1751
3232
Total
1517
2722
4926
9165
For each region (row) we can compute the percentage of storms

occuring during each season, the conditional distribution. Of the
1517 cyclones in the 40-49 band, 370 occurred in Autumn, a
proportion of 370/1517=.244, or 24.4% as a percentage.
Region\Season
40-49S
50-59S
60-79S
Autumn
24.4
19.3
19.9
Winter
29.8
22.9
24.4
Spring
18.0
18.9
20.2
Summer
27.8
38.9
35.5
Total% (n)
100.0 (1517)
100.0 (2722)
100.0 (4926)

40.00
region
40-49S
50-59S
60-79S
30.00
Bars show Means
20.00
10.00
Autumn
Winter
Spring
Summer
season
Graphical Conditional Distributions for Regions
Guidelines for Contingency Tables

Compute percentages for the response (column)
variable within the categories of the explanatory
(row) variable. Note that in journal articles, rows
and columns may be interchanged.
Divide the cell totals by the row (explanatory
category) total and multiply by 100 to obtain a
percent, the row percents will add to 100
Give title and clearly define variables and
categories.
Include row (explanatory) total sample sizes
Independence & Dependence

Statistically Independent: Population conditional
distributions of one variable are the same across
all levels of the other variable
Statistically Dependent: Conditional Distributions
are not all equal
When testing, researchers typically wish to
demonstrate dependence (alternative hypothesis),
and wish to refute independence (null hypothesis)
Pearsons Chi-Square Test

Can be used for nominal or ordinal explanatory and
response variables
Variables can have any number of distinct levels
Tests whether the distribution of the response
variable is the same for each level of the explanatory
variable (H0: No association between the variables
r = # of levels of explanatory variable
c = # of levels of response variable

Intuition behind test statistic
Obtain marginal distribution of outcomes for the
response variable
Apply this common distribution to all levels of
the explanatory variable, by multiplying each
proportion by the corresponding sample size
Measure the difference between actual cell counts
and the expected cell counts in the previous step

Notation to obtain test statistic
Rows represent explanatory variable (r levels)
Cols represent response variable (c levels)
1
Total
n11
n12
n1c
n1.
n21
n22
n2c
n2.
nr1
nr2
nrc
nr.
Total
n.1
n.2
n.c
n..

Observed frequency (fo): The number of
individuals falling in a particular cell
Expected frequency (fe): The number we would
expect in that cell, given the sample sizes
observed in study and the assumtpion of
independence.
Computed by multiplying the row total and the
column total, and dividing by the overall sample size.
Applies the overall marginal probability of the
response category to the sample size of explanatory
category

Large-sample test (all fe > 5)
H0: Variables are statistically independent
(No association between variables)
Ha: Variables are statistically dependent
(Association exists between variables)
2
(
f
f
)
2
e
Test Statistic: obs
o
fe
P-value: Area above obs in the chi-squared
distribution with (r-1)(c-1) degrees of

freedom. (Critical values in Table 8.5)

Observed Cell Counts (fo):
Region\Season
40-49S
50-59S
60-79S
Total
Autumn
370
526
980
1876
Winter
452
624
1200
2276
Spring
273
513
995
1781
Summer
422
1059
1751
3232
Total
1517
2722
4926
9165
Note that overall: (1876/9165)100%=20.5% of all cyclones

occurred in Autumn. If we apply that percentage to the 1517 that
occurred in the 40-49S band, we would expect (0.205)(1517)=310.5
to have occurred in the first cell of the table. The full table of fe:
Region\Season
40-49S
50-59S
60-79S
Total
Autumn
310.5
557.2
1008.3
1876
Winter
376.7
676.0
1223.3
2276
Spring
294.8
529.0
957.3
1781
Summer
535.0
959.9
1737.1
3232
Total
1517
2722
4926
9165

Computation of
Region
40-49S
40-49S
40-49S
40-49S
50-59S
50-59S
50-59S
50-59S
60-79S
60-79S
60-79S
60-79S
2
obs
Season
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer
fo
fe
370
452
273
422
526
624
513
1059
980
1200
995
1751
310.5
376.7
294.8
535.0
557.2
676.0
529.0
959.9
1008.3
1223.3
957.3
1737.1
(fo-fe)^2
3540.25
5670.09
475.24
12769
973.44
2704
256
9820.81
800.89
542.89
1421.29
193.21
((fo-fe)^2)/fe
11.4017713
15.0520042
1.61207598
23.8672897
1.74702082
4
0.48393195
10.2310762
0.79429733
0.44379138
1.4846861
0.11122561
71.2291706

H0: Seasonal distribution of cyclone occurences
is independent of latitude band
Ha: Seasonal occurences of cyclone occurences
differ among latitude bands
2
Test Statistic:
obs
71.2
P-value: Area in chi-squared distribution with (31)(4-1)=6 degrees of freedom above 71.2
Frrom Table 8.5, P(222.46)=.001 P< .001
SPSS Output - Cyclone Example

REGION * SEASON Crosstabulation
REGION
40-49S
50-59S
60-79S
Total
Count
Expected Count
% within REGION
Count
Expected Count
% within REGION
Count
Expected Count
% within REGION
Count
Expected Count
% within REGION
Autumn
370
310.5
24.4%
526
557.2
19.3%
980
1008.3
19.9%
1876
1876.0
20.5%
SEASON
Winter
Spring
452
273
376.7
294.8
29.8%
18.0%
624
513
676.0
529.0
22.9%
18.8%
1200
995
1223.3
957.3
24.4%
20.2%
2276
1781
2276.0
1781.0
24.8%
19.4%
Summer
422
535.0
27.8%
1059
959.9
38.9%
1751
1737.1
35.5%
3232
3232.0
35.3%
Total
1517
1517.0
100.0%
2722
2722.0
100.0%
4926
4926.0
100.0%
9165
9165.0
100.0%
Chi-Square Tests
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Value
71.189a
71.337
23.418
6
6
Asymp. Sig.
(2-sided)
.000
.000
.000
df
9165
a. 0 cells (.0%) have expected count less than 5. The

minimum expected count is 294.79.
P-value
Misuses of chi-squared Test

Expected frequencies too small (all
expected counts should be above 5, not
necessary for the observed counts)
Dependent samples (the same individuals
are in each row, see McNemars test)
Can be used for nominal or ordinal
variables, but more powerful methods exist
for when both variables are ordinal and a
directional association is hypothesized
Residual Analysis
Once dependence has been determined from a chisquared test, often interested in determining which
cells contributed
Residual: fo-fe measures the difference between the
observed and expected counts
Positive implies observed more than expected
Residuals practical importance depends on level of fe
Adjusted Residual (computed for each cell):

fo fe
f e (1 row proportion)(1 column proportion)
Adjusted residuals above 3 in absolute value give strong evidence against independence in
that cell

Adjusted residuals are computed in the following table.
Row proportion for Region 40-49S: 1517/9165=0.1655
Column Proportion for Season Autumn is: 1876/9165=0.2047
Region
40-49S
40-49S
40-49S
40-49S
50-59S
50-59S
50-59S
50-59S
60-79S
60-79S
60-79S
60-79S
Season
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer
fo
fe
370
452
273
422
526
624
513
1059
980
1200
995
1751
310.5
376.7
294.8
535
557.2
676
529
959.9
1008.3
1223.3
957.3
1737.1
row prop col prop adj res

0.1655
0.2047 4.144837
0.1655
0.2483 4.898484
0.1655
0.1943 -1.54843
0.1655
0.3526 -6.64664
0.297
0.2047 -1.76769
0.297
0.2483 -2.75125
0.297
0.1943 -0.92433
0.297
0.3526 4.741291
0.5375
0.2047
-1.4695
0.5375
0.2483 -1.12983
0.5375
0.1943 1.996065
0.5375
0.3526 0.609481
2x2 Tables
Each variable has 2 levels
Explanatory Variable Groups (Typically based
on demographics, exposure, or Trt)
Response Variable Outcome (Typically
presence or absence of a characteristic)
Measures of association
Relative Risk (Prospective Studies)
Odds Ratio (Prospective or Retrospective)
Absolute Risk (Prospective Studies)
2x2 Tables - Notation
Group 1
Outcome
Present
n11
Outcome
Absent
n12
Group
Total
n1.
Group 2
n21
n22
n2.
Outcome
Total
n.1
n.2
n..
Relative Risk
Ratio of the probability that the outcome
characteristic is present for one group, relative to
the other
Sample proportions with characteristic from
groups 1 and 2:
n11
1
n1.
^
n21
2
n2.
^
Relative Risk
Estimated Relative Risk:
RR 1
95% Confidence Interval for Population

Relative Risk:
( RR (e 1.96
) , RR (e1.96
^
e 2.71828
))
^
(1 1 )
(1
v
n11
n21
Relative Risk
Interpretation
Conclude that the probability that the outcome is
present is higher (in the population) for group 1 if
the entire interval is above 1
present is lower (in the population) for group 1 if
the entire interval is below 1
Do not conclude that the probability of the
outcome differs for the two groups if the interval
contains 1
Example - Coccidioidomycosis and

TNF-antagonists
Research Question: Risk of developing
Coccidioidmycosis associated with arthritis
therapy?
Groups: Patients receiving tumor necrosis
factor (TNF) versus Patients not receiving
TNF (all patients arthritic)
Source: Bergstrom, et al
(2004)
TNF
Other
Total
COC
7
4
11
No COC
240
734
974
Total
247
738
985

TNF-antagonists
Group 1: Patients on TNF
Group 2: Patients not on TNF
^
7
4
1
.0283 2
.0054
247
738
^
.0283
RR ^
5.24
2 .0054
95%CI : (5.24e 1.96
.3874
1 .0283 1 .0054
v
.3874
7
4
, 5.24e1.96
.3874
) (1.55 , 17.76)
Entire CI above 1 Conclude higher risk if on TNF
Odds Ratio
Odds of an event is the probability it occurs
divided by the probability it does not occur
Odds ratio is the odds of the event for group 1
divided by the odds of the event for group 2
Sample odds of the outcome for each group:
n11 / n1.
n11
odds1
n12 / n1.
n12
odds2
n21
n22
Odds Ratio
Estimated Odds Ratio:
odds1 n11 / n12 n11n22
OR
odds2 n21 / n22 n12 n21
95% Confidence Interval for

Population Odds Ratio
( OR (e 1.96
) , OR (e1.96 v ) )
1
1
1
1
e 2.71828
v
n11
n12
n21
n22
Odds Ratio
Interpretation
the entire interval is above 1
the entire interval is below 1
contains 1
Example - NSAIDs and GBM

Case-Control Study (Retrospective)
Cases: 137 Self-Reporting Patients with Glioblastoma
Multiforme (GBM)
Controls: 401 Population-Based Individuals matched to
cases wrt demographic factors
GBM Present GBM Absent

NSAID User
32
138
NSAID Non-User
105
263
Total
137
401
Source: Sivak-Sears, et al
Total
170
368
538
Example - NSAIDs and GBM

32(263)
8416
0.58
138(105) 14490
1
1
1
1
v
0.0518
32 138 105 263
OR
95% CI : ( 0.58e 1.96
0.0518
, 0.58e1.96
0.0518
) (0.37 , 0.91)
Interval is entirely below 1, NSAID

use appears to be lower among
cases than controls
Absolute Risk
Difference Between Proportions of outcomes with
an outcome characteristic for 2 groups
Sample proportions with characteristic from

groups 1 and 2:
n11
1
n1.
^
n21
2
n2.
^
Absolute Risk
Estimated Absolute Risk:
^
AR 1 2
95% Confidence Interval for Population
Absolute Risk ^
^
^
^
1 1 1 2 1 2
AR 1.96
n1.
n2.
Absolute Risk
Interpretation
the entire interval is positive
the entire interval is negative
contains 0

TNF-antagonists
Group 1: Patients on TNF
Group 2: Patients not on TNF
^
7
4
1
.0283 2
.0054
247
738
^
AR 1 2 .0283 .0054 .0229

.0283(.9717) .0054(.9946)
247
738
.0229 .0213 (0.0016 , 0.0242)
95%CI : .0229 1.96
Interval is entirely positive, TNF is associated

with higher risk
Ordinal Explanatory and Response

Variables
Pearsons Chi-square test can be used to test
associations among ordinal variables, but more
powerful methods exist
When theories exist that the association is
directional (positive or negative), measures exist
to describe and test for these specific alternatives
from independence:
Gamma
Kendalls b
Concordant and Discordant Pairs

Concordant Pairs - Pairs of individuals where one
individual scores higher on both ordered variables
than the other individual
Discordant Pairs - Pairs of individuals where one
individual scores higher on one ordered variable
and the other individual scores higher on the other
C = # Concordant Pairs D = # Discordant Pairs
Under Positive association, expect C > D
Under Negative association, expect C < D
Under No association, expect C D
Example - Alcohol Use and Sick Days

Alcohol Risk (Without Risk, Hardly any Risk,
Some to Considerable Risk)
Sick Days (0, 1-6, 7)
Concordant Pairs - Pairs of respondents where one
scores higher on both alcohol risk and sick days
than the other
Discordant Pairs - Pairs of respondents where one
scores higher on alcohol risk and the other scores
higher on sick days
Source: Hermansson, et al
(2003)

ALCOHOL * SICKDAYS Crosstabulation
Count
ALCOHOL
Total
Without Risk
Hardly any Risk
Some-Considerable Risk
0 days
347
154
52
553
SICKDAYS
1-6 days
113
63
25
201
7+ days
145
56
34
235
Total
605
273
111
989
Concordant Pairs: Each individual in a

given cell is concordant with each individual
in cells Southeast of theirs
Discordant Pairs: Each individual in a given
cell is discordant with each individual in
cells Southwest of theirs

ALCOHOL * SICKDAYS Crosstabulation
Count
ALCOHOL
Total
Without Risk
Hardly any Risk
Some-Considerable Risk
0 days
347
154
52
553
SICKDAYS
1-6 days
113
63
25
201
7+ days
145
56
34
235
Total
605
273
111
989
C 347(63 56 25 34) 113(56 34) 154(25 34) 63(34) 83164

D 145(154 63 52 25) 113(154 52) 56(52 25) 63(52) 73496
Measures of Association
Goodman and Kruskals Gamma:
CD

CD
^
1 1
Kendalls b:
CD
0.5 (n ni. )(n n. j )

2
When theres no association between the ordinal variables,

the population based values of these measures are 0.
Statistical software packages provide these tests.

C D 83164 73496

0.0617
C D 83164 73496
^
Symmetric Measures
Ordinal by
Ordinal
Kendall's tau-b
Gamma
N of Valid Cases
Value
.035
.062
989
Asymp.
a
Std. Error
.030
.052
Approx. T
1.187
1.187
a. Not assuming the null hypothesis.

b. Using the asymptotic standard error assuming the null hypothesis.
Approx. Sig.
.235
.235

Chapter 8

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chapter 8

Hochgeladen von

Copyright:

Verfügbare Formate

Associations Between

Example - Cyclones Near Antarctica

Example - Cyclones Near Antarctica

For each region (row) we can compute the percentage of storms

Example - Cyclones Near Antarctica

Bars show Means

Graphical Conditional Distributions for Regions

Guidelines for Contingency Tables

Independence & Dependence

Pearsons Chi-Square Test

Pearsons Chi-Square Test

Pearsons Chi-Square Test

Pearsons Chi-Square Test

Pearsons Chi-Square Test

P-value: Area above obs in the chi-squared

distribution with (r-1)(c-1) degrees of

Example - Cyclones Near Antarctica

Note that overall: (1876/9165)100%=20.5% of all cyclones

Example - Cyclones Near Antarctica

Example - Cyclones Near Antarctica

SPSS Output - Cyclone Example

a. 0 cells (.0%) have expected count less than 5. The

Misuses of chi-squared Test

Adjusted Residual (computed for each cell):

Example - Cyclones Near Antarctica

row prop col prop adj res

2x2 Tables - Notation

95% Confidence Interval for Population

Example - Coccidioidomycosis and

Example - Coccidioidomycosis and

Entire CI above 1 Conclude higher risk if on TNF

odds2 n21 / n22 n12 n21

95% Confidence Interval for

Example - NSAIDs and GBM

GBM Present GBM Absent

Example - NSAIDs and GBM

95% CI : ( 0.58e 1.96

Interval is entirely below 1, NSAID

Sample proportions with characteristic from

Example - Coccidioidomycosis and

AR 1 2 .0283 .0054 .0229

95%CI : .0229 1.96

Interval is entirely positive, TNF is associated

Ordinal Explanatory and Response

Concordant and Discordant Pairs

Example - Alcohol Use and Sick Days

Example - Alcohol Use and Sick Days

Concordant Pairs: Each individual in a

Example - Alcohol Use and Sick Days

C 347(63 56 25 34) 113(56 34) 154(25 34) 63(34) 83164

0.5 (n ni. )(n n. j )

When theres no association between the ordinal variables,

Example - Alcohol Use and Sick Days

a. Not assuming the null hypothesis.

Das könnte Ihnen auch gefallen