Sie sind auf Seite 1von 42

Associations Between

Categorical Variables
Case where both explanatory (independent)
variable and response (dependent) variable
are qualitative (Chapter 7 includes case
where both are binary (2 levels)
Association: The distributions of responses
differ among the levels of the explanatory
variable (e.g. Party affiliation by gender)

Contingency Tables
Cross-tabulations of frequency counts where the
rows (typically) represent the levels of the
explanatory variable and the columns represent
the levels of the response variable.
Numbers within the table represent the numbers
of individuals falling in the corresponding
combination of levels of the two variables
Row and column totals are called the marginal
distributions for the two variables

Example - Cyclones Near Antarctica


Period of Study: September,1973-May,1975
Explanatory Variable: Region (40-49,50-59,60-79)
(Degrees South Latitude)
Response: Season (Aut(4),Wtr(5),Spr(4),Sum(8))
(Number of months in parentheses)
Units: Cyclones in the study area
Treating the observed cyclones as a random
sample of all cyclones that could have occurred
Source: Howarth(1983), An Analysis of the Variability of Cyclones around Antarctica and Their
Relation to Sea-Ice Extent, Annals of the Association of American Geographers, Vol.73,pp519-537

Example - Cyclones Near Antarctica


Region\Season
40-49S
50-59S
60-79S
Total

Autumn
370
526
980
1876

Winter
452
624
1200
2276

Spring
273
513
995
1781

Summer
422
1059
1751
3232

Total
1517
2722
4926
9165

For each region (row) we can compute the percentage of storms


occuring during each season, the conditional distribution. Of the
1517 cyclones in the 40-49 band, 370 occurred in Autumn, a
proportion of 370/1517=.244, or 24.4% as a percentage.
Region\Season
40-49S
50-59S
60-79S

Autumn
24.4
19.3
19.9

Winter
29.8
22.9
24.4

Spring
18.0
18.9
20.2

Summer
27.8
38.9
35.5

Total% (n)
100.0 (1517)
100.0 (2722)
100.0 (4926)

Example - Cyclones Near Antarctica


40.00

region
40-49S
50-59S
60-79S

30.00

Bars show Means

20.00

10.00

Autumn

Winter

Spring

Summer

season

Graphical Conditional Distributions for Regions

Guidelines for Contingency Tables


Compute percentages for the response (column)
variable within the categories of the explanatory
(row) variable. Note that in journal articles, rows
and columns may be interchanged.
Divide the cell totals by the row (explanatory
category) total and multiply by 100 to obtain a
percent, the row percents will add to 100
Give title and clearly define variables and
categories.
Include row (explanatory) total sample sizes

Independence & Dependence


Statistically Independent: Population conditional
distributions of one variable are the same across
all levels of the other variable
Statistically Dependent: Conditional Distributions
are not all equal
When testing, researchers typically wish to
demonstrate dependence (alternative hypothesis),
and wish to refute independence (null hypothesis)

Pearsons Chi-Square Test


Can be used for nominal or ordinal explanatory and
response variables
Variables can have any number of distinct levels
Tests whether the distribution of the response
variable is the same for each level of the explanatory
variable (H0: No association between the variables
r = # of levels of explanatory variable
c = # of levels of response variable

Pearsons Chi-Square Test


Intuition behind test statistic
Obtain marginal distribution of outcomes for the
response variable
Apply this common distribution to all levels of
the explanatory variable, by multiplying each
proportion by the corresponding sample size
Measure the difference between actual cell counts
and the expected cell counts in the previous step

Pearsons Chi-Square Test


Notation to obtain test statistic
Rows represent explanatory variable (r levels)
Cols represent response variable (c levels)
1

Total

n11

n12

n1c

n1.

n21

n22

n2c

n2.

nr1

nr2

nrc

nr.

Total

n.1

n.2

n.c

n..

Pearsons Chi-Square Test


Observed frequency (fo): The number of
individuals falling in a particular cell
Expected frequency (fe): The number we would
expect in that cell, given the sample sizes
observed in study and the assumtpion of
independence.
Computed by multiplying the row total and the
column total, and dividing by the overall sample size.
Applies the overall marginal probability of the
response category to the sample size of explanatory
category

Pearsons Chi-Square Test


Large-sample test (all fe > 5)
H0: Variables are statistically independent
(No association between variables)
Ha: Variables are statistically dependent
(Association exists between variables)
2
(
f

f
)
2
e
Test Statistic: obs
o
fe

P-value: Area above obs in the chi-squared

distribution with (r-1)(c-1) degrees of


freedom. (Critical values in Table 8.5)

Example - Cyclones Near Antarctica


Observed Cell Counts (fo):
Region\Season
40-49S
50-59S
60-79S
Total

Autumn
370
526
980
1876

Winter
452
624
1200
2276

Spring
273
513
995
1781

Summer
422
1059
1751
3232

Total
1517
2722
4926
9165

Note that overall: (1876/9165)100%=20.5% of all cyclones


occurred in Autumn. If we apply that percentage to the 1517 that
occurred in the 40-49S band, we would expect (0.205)(1517)=310.5
to have occurred in the first cell of the table. The full table of fe:
Region\Season
40-49S
50-59S
60-79S
Total

Autumn
310.5
557.2
1008.3
1876

Winter
376.7
676.0
1223.3
2276

Spring
294.8
529.0
957.3
1781

Summer
535.0
959.9
1737.1
3232

Total
1517
2722
4926
9165

Example - Cyclones Near Antarctica


Computation of
Region
40-49S
40-49S
40-49S
40-49S
50-59S
50-59S
50-59S
50-59S
60-79S
60-79S
60-79S
60-79S

2
obs

Season
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer

fo

fe
370
452
273
422
526
624
513
1059
980
1200
995
1751

310.5
376.7
294.8
535.0
557.2
676.0
529.0
959.9
1008.3
1223.3
957.3
1737.1

(fo-fe)^2
3540.25
5670.09
475.24
12769
973.44
2704
256
9820.81
800.89
542.89
1421.29
193.21

((fo-fe)^2)/fe
11.4017713
15.0520042
1.61207598
23.8672897
1.74702082
4
0.48393195
10.2310762
0.79429733
0.44379138
1.4846861
0.11122561
71.2291706

Example - Cyclones Near Antarctica


H0: Seasonal distribution of cyclone occurences
is independent of latitude band
Ha: Seasonal occurences of cyclone occurences
differ among latitude bands
2
Test Statistic:
obs
71.2
P-value: Area in chi-squared distribution with (31)(4-1)=6 degrees of freedom above 71.2
Frrom Table 8.5, P(222.46)=.001 P< .001

SPSS Output - Cyclone Example


REGION * SEASON Crosstabulation

REGION

40-49S

50-59S

60-79S

Total

Count
Expected Count
% within REGION
Count
Expected Count
% within REGION
Count
Expected Count
% within REGION
Count
Expected Count
% within REGION

Autumn
370
310.5
24.4%
526
557.2
19.3%
980
1008.3
19.9%
1876
1876.0
20.5%

SEASON
Winter
Spring
452
273
376.7
294.8
29.8%
18.0%
624
513
676.0
529.0
22.9%
18.8%
1200
995
1223.3
957.3
24.4%
20.2%
2276
1781
2276.0
1781.0
24.8%
19.4%

Summer
422
535.0
27.8%
1059
959.9
38.9%
1751
1737.1
35.5%
3232
3232.0
35.3%

Total
1517
1517.0
100.0%
2722
2722.0
100.0%
4926
4926.0
100.0%
9165
9165.0
100.0%

Chi-Square Tests

Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases

Value
71.189a
71.337
23.418

6
6

Asymp. Sig.
(2-sided)
.000
.000

.000

df

9165

a. 0 cells (.0%) have expected count less than 5. The


minimum expected count is 294.79.

P-value

Misuses of chi-squared Test


Expected frequencies too small (all
expected counts should be above 5, not
necessary for the observed counts)
Dependent samples (the same individuals
are in each row, see McNemars test)
Can be used for nominal or ordinal
variables, but more powerful methods exist
for when both variables are ordinal and a
directional association is hypothesized

Residual Analysis
Once dependence has been determined from a chisquared test, often interested in determining which
cells contributed
Residual: fo-fe measures the difference between the
observed and expected counts
Positive implies observed more than expected
Residuals practical importance depends on level of fe

Adjusted Residual (computed for each cell):


fo fe
f e (1 row proportion)(1 column proportion)
Adjusted residuals above 3 in absolute value give strong evidence against independence in
that cell

Example - Cyclones Near Antarctica


Adjusted residuals are computed in the following table.
Row proportion for Region 40-49S: 1517/9165=0.1655
Column Proportion for Season Autumn is: 1876/9165=0.2047
Region
40-49S
40-49S
40-49S
40-49S
50-59S
50-59S
50-59S
50-59S
60-79S
60-79S
60-79S
60-79S

Season
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer
Autumn
Winter
Spring
Summer

fo

fe
370
452
273
422
526
624
513
1059
980
1200
995
1751

310.5
376.7
294.8
535
557.2
676
529
959.9
1008.3
1223.3
957.3
1737.1

row prop col prop adj res


0.1655
0.2047 4.144837
0.1655
0.2483 4.898484
0.1655
0.1943 -1.54843
0.1655
0.3526 -6.64664
0.297
0.2047 -1.76769
0.297
0.2483 -2.75125
0.297
0.1943 -0.92433
0.297
0.3526 4.741291
0.5375
0.2047
-1.4695
0.5375
0.2483 -1.12983
0.5375
0.1943 1.996065
0.5375
0.3526 0.609481

2x2 Tables
Each variable has 2 levels
Explanatory Variable Groups (Typically based
on demographics, exposure, or Trt)
Response Variable Outcome (Typically
presence or absence of a characteristic)

Measures of association
Relative Risk (Prospective Studies)
Odds Ratio (Prospective or Retrospective)
Absolute Risk (Prospective Studies)

2x2 Tables - Notation

Group 1

Outcome
Present
n11

Outcome
Absent
n12

Group
Total
n1.

Group 2

n21

n22

n2.

Outcome
Total

n.1

n.2

n..

Relative Risk
Ratio of the probability that the outcome
characteristic is present for one group, relative to
the other
Sample proportions with characteristic from
groups 1 and 2:

n11
1
n1.
^

n21
2
n2.
^

Relative Risk
Estimated Relative Risk:

RR 1

95% Confidence Interval for Population


Relative Risk:
( RR (e 1.96

) , RR (e1.96
^

e 2.71828

))
^

(1 1 )
(1
v

n11
n21

Relative Risk
Interpretation
Conclude that the probability that the outcome is
present is higher (in the population) for group 1 if
the entire interval is above 1
Conclude that the probability that the outcome is
present is lower (in the population) for group 1 if
the entire interval is below 1
Do not conclude that the probability of the
outcome differs for the two groups if the interval
contains 1

Example - Coccidioidomycosis and


TNF-antagonists
Research Question: Risk of developing
Coccidioidmycosis associated with arthritis
therapy?
Groups: Patients receiving tumor necrosis
factor (TNF) versus Patients not receiving
TNF (all patients arthritic)
Source: Bergstrom, et al
(2004)

TNF
Other
Total

COC
7
4
11

No COC
240
734
974

Total
247
738
985

Example - Coccidioidomycosis and


TNF-antagonists
Group 1: Patients on TNF
Group 2: Patients not on TNF
^
7
4
1
.0283 2
.0054
247
738
^

.0283
RR ^
5.24
2 .0054
95%CI : (5.24e 1.96

.3874

1 .0283 1 .0054
v

.3874
7
4
, 5.24e1.96

.3874

) (1.55 , 17.76)

Entire CI above 1 Conclude higher risk if on TNF

Odds Ratio
Odds of an event is the probability it occurs
divided by the probability it does not occur
Odds ratio is the odds of the event for group 1
divided by the odds of the event for group 2
Sample odds of the outcome for each group:
n11 / n1.
n11
odds1

n12 / n1.
n12
odds2

n21
n22

Odds Ratio
Estimated Odds Ratio:
odds1 n11 / n12 n11n22
OR

odds2 n21 / n22 n12 n21

95% Confidence Interval for


Population Odds Ratio
( OR (e 1.96

) , OR (e1.96 v ) )
1
1
1
1
e 2.71828
v

n11
n12
n21
n22

Odds Ratio
Interpretation
Conclude that the probability that the outcome is
present is higher (in the population) for group 1 if
the entire interval is above 1
Conclude that the probability that the outcome is
present is lower (in the population) for group 1 if
the entire interval is below 1
Do not conclude that the probability of the
outcome differs for the two groups if the interval
contains 1

Example - NSAIDs and GBM


Case-Control Study (Retrospective)
Cases: 137 Self-Reporting Patients with Glioblastoma
Multiforme (GBM)
Controls: 401 Population-Based Individuals matched to
cases wrt demographic factors

GBM Present GBM Absent


NSAID User
32
138
NSAID Non-User
105
263
Total
137
401
Source: Sivak-Sears, et al

Total
170
368
538

Example - NSAIDs and GBM


32(263)
8416

0.58
138(105) 14490
1
1
1
1
v

0.0518
32 138 105 263

OR

95% CI : ( 0.58e 1.96

0.0518

, 0.58e1.96

0.0518

) (0.37 , 0.91)

Interval is entirely below 1, NSAID


use appears to be lower among
cases than controls

Absolute Risk
Difference Between Proportions of outcomes with
an outcome characteristic for 2 groups

Sample proportions with characteristic from


groups 1 and 2:

n11
1
n1.
^

n21
2
n2.
^

Absolute Risk
Estimated Absolute Risk:
^

AR 1 2
95% Confidence Interval for Population
Absolute Risk ^
^
^
^

1 1 1 2 1 2

AR 1.96
n1.
n2.

Absolute Risk
Interpretation
Conclude that the probability that the outcome is
present is higher (in the population) for group 1 if
the entire interval is positive
Conclude that the probability that the outcome is
present is lower (in the population) for group 1 if
the entire interval is negative
Do not conclude that the probability of the
outcome differs for the two groups if the interval
contains 0

Example - Coccidioidomycosis and


TNF-antagonists
Group 1: Patients on TNF
Group 2: Patients not on TNF
^
7
4
1
.0283 2
.0054
247
738
^

AR 1 2 .0283 .0054 .0229


.0283(.9717) .0054(.9946)

247
738
.0229 .0213 (0.0016 , 0.0242)

95%CI : .0229 1.96

Interval is entirely positive, TNF is associated


with higher risk

Ordinal Explanatory and Response


Variables
Pearsons Chi-square test can be used to test
associations among ordinal variables, but more
powerful methods exist
When theories exist that the association is
directional (positive or negative), measures exist
to describe and test for these specific alternatives
from independence:
Gamma
Kendalls b

Concordant and Discordant Pairs


Concordant Pairs - Pairs of individuals where one
individual scores higher on both ordered variables
than the other individual
Discordant Pairs - Pairs of individuals where one
individual scores higher on one ordered variable
and the other individual scores higher on the other
C = # Concordant Pairs D = # Discordant Pairs
Under Positive association, expect C > D
Under Negative association, expect C < D
Under No association, expect C D

Example - Alcohol Use and Sick Days


Alcohol Risk (Without Risk, Hardly any Risk,
Some to Considerable Risk)
Sick Days (0, 1-6, 7)
Concordant Pairs - Pairs of respondents where one
scores higher on both alcohol risk and sick days
than the other
Discordant Pairs - Pairs of respondents where one
scores higher on alcohol risk and the other scores
higher on sick days
Source: Hermansson, et al
(2003)

Example - Alcohol Use and Sick Days


ALCOHOL * SICKDAYS Crosstabulation
Count

ALCOHOL

Total

Without Risk
Hardly any Risk
Some-Considerable Risk

0 days
347
154
52
553

SICKDAYS
1-6 days
113
63
25
201

7+ days
145
56
34
235

Total
605
273
111
989

Concordant Pairs: Each individual in a


given cell is concordant with each individual
in cells Southeast of theirs
Discordant Pairs: Each individual in a given
cell is discordant with each individual in
cells Southwest of theirs

Example - Alcohol Use and Sick Days


ALCOHOL * SICKDAYS Crosstabulation
Count

ALCOHOL

Total

Without Risk
Hardly any Risk
Some-Considerable Risk

0 days
347
154
52
553

SICKDAYS
1-6 days
113
63
25
201

7+ days
145
56
34
235

Total
605
273
111
989

C 347(63 56 25 34) 113(56 34) 154(25 34) 63(34) 83164


D 145(154 63 52 25) 113(154 52) 56(52 25) 63(52) 73496

Measures of Association
Goodman and Kruskals Gamma:
CD

CD
^

1 1

Kendalls b:
CD

0.5 (n ni. )(n n. j )


2

When theres no association between the ordinal variables,


the population based values of these measures are 0.
Statistical software packages provide these tests.

Example - Alcohol Use and Sick Days


C D 83164 73496

0.0617
C D 83164 73496
^

Symmetric Measures

Ordinal by
Ordinal

Kendall's tau-b
Gamma

N of Valid Cases

Value
.035
.062
989

Asymp.
a
Std. Error
.030
.052

Approx. T
1.187
1.187

a. Not assuming the null hypothesis.


b. Using the asymptotic standard error assuming the null hypothesis.

Approx. Sig.
.235
.235

Das könnte Ihnen auch gefallen