Sie sind auf Seite 1von 14

LOGISTIC REGRESSION (Chapter 20)

Example - High Dieldrin Levels in Western Australian Breast


Feeding Mothers
Data File: Pestmilk.JMP
These data come from a study of breast feeding mothers in Western Australia in 1979-80.
Earlier research discovered surprisingly high levels of pesticide levels in human breast
milk. The research conducted in 1979-80 hoped to show that the levels had decreased as
a result of stricter government regulations on the use of pesticides on food crops. They
did find decreases for several types of pesticides. Levels of the pesticide Dieldrin,
however had substantially increased. These data were collected to hopefully explain
why.
For 45 breast milk donors, we have information on the mother's age in years, whether
they lived in a new suburb (0 = no, 1 = yes), whether their house had been treated for
termites within the past three years (0 = no, 1 = yes), and whether their breast milk
contained above average (> .009 ppm) levels of the pesticide Dieldrin. By law new
houses are treated for termites in Australia.
The variables in the Pestmilk.JMP data file are:
age - age of mother (yrs.)
ns - new suburb indicator (1 = yes, 0 = no)
ht - house treated for termites in the last 3 years (1 = yes, 0 = no)
hd - high Dieldrin level (1 = yes, 0 = no)
New Sub (New or Old)
Treated (HT = house treated or NT= not treated)
High Dieldrin (High or Low)
Important JMP Note: For interpretation purposes it is best to code the
outcome so that the adverse outcome is alphabetically first. The same is true
for risk factors, code them so the level that would be associated with
increased risk is alphabetically first.
One way to examine the relationship between the response (High Dieldrin) and the
predictors (age, New Sub & Treated) we could construct 2 X 2 contingency tables and
compute conditional probabilities, relative risks, and odds ratios. The tables and plots
below were obtained in JMP by using Fit Y by X and placing each of the predictors
(New Sub & Treated) in the X box and the response (High Dieldrin) in the Y box. The
results are shown on the following page.
1
The plots and the contingency tables with the conditional probabilities added suggest that
both living in a new suburb (New Sub) and living in home treated for termites (HT) lead
to increased risk of having high dieldrin levels in breast milk.
Contingency Analysis of High Dieldrin By Treated
H
i
g
h

D
i
e
l
d
r
i
n
0.00
0.25
0.50
0.75
1.00
HT NT
Treated
High
Low
OR = (13*16)/(3*11) = 6.30 Mothers living in a home treated for termites have 6.30
times higher odds for having high dieldrin levels in their breast milk when compared to
mothers living in homes not treated for termites.
Contingency Analysis of High Dieldrin By New Sub
H
ig
h

D
ie
ld
r
in
0.00
0.25
0.50
0.75
1.00
New Old
New Sub
High
Low
OR = (7*22)/(9*5) = 3.42 Mothers living in a new suburb have 3.42 times the odds of
having high dieldrin levels in their breast milk when compared to mothers in living in an
older suburb.
2
Count
Row %
High Low
HT 13
54.17
11
45.83
24
NT 3
15.79
16
84.21
19
16 27 43
Count
Row %
High Low
New 7
58.33
5
41.67
12
Old 9
29.03
22
70.97
31
16 27 43
Logistic Regression Model
In logistic regression we model the log of odds for success as a function of the predictors
using a linear model. For example, consider the logistic regression model for the risk
factor New Suburb.
NewSuburb
p
p
dieldrin high for odds
o 1
1
ln ) ln( +

,
_

where,

'

suburb old in lives mother if


suburb new in lives mother if
NewSuburb
1
1
The log odds a breast feeding mother living in a new suburb is given by
1
1
ln ) ln( +

,
_

o
p
p
suburb new a in living mothers for High for odds
and for a mother living in an old suburb is given by
1
1
ln ) ln(

,
_

o
p
p
suburb old an in living mothers for High for odds
The difference in the log odds is equivalent to the log of the odds ratio (OR) because of
the following property of logarithms.

,
_


y
x
y x ln ) ln( ) ln(
Applying this property here we have
1 1 1
2 ) ( ) (
) ln(
+

,
_

o o
suburb old an in mothers for High for odds
suburb new a in mothers for High for odds
ln
suburb) old an in mothers for High for ln(odds - suburb new a in mothers for High for odds
This says that the OR associated with living in a new suburb is given by
1
2
e OR
3
Fitting the New Suburb Logistic Regression Model in JMP
Select Fit Model and place High Dieldrin in the Y box and New Suburb in the Model
Effects box.
Resulting output
The estimated OR associated with living in a new suburb is then
We can use JMP to compute the ORs by selecting Nominal Logistic > Odds Ratio
4
Similarly for House Treated we have the following logistic regression model.
Finding Predicted Probabilities
The logistic regression model can be used to estimate the probability of success given a
set of predictor values as follows:
X
X
o
o
e
e
X success P p
1
1


1
) | (


+
+
+
for situations where we have a single predictor
and is given by
p p o
p p o
X X
X X
e
e
X success P p




1 1
1 1
1
) | (
+ + +
+ + +
+

for situations where we have p predictors.


For the example above we can estimate the probability of high dieldrin levels for women
living in a home treated for termites as follows:
P(High|House Treated) =
+
+
+
9205 . 7535 .
9205 . 7535 .
1 e
e
.5417
P(High|House Not Treated) =
+


9205 . 7535 .
9205 . 7535 .
1 e
e
.1579
5
How do these estimate probabilities compare to those we
obtain by using a 2 X 2 contingency table?
We now consider the age effect. Again select Fit Y by X from the Analyze menu and
place High Dieldrin in the Y box and age in the X box. The resulting output is given
below.
Logistic Fit of High Dieldrin By age
H
i
g
h

D
i
e
l
d
r
i
n
0.00
0.25
0.50
0.75
1.00
20 25 30 35
age
High
Low
Whole Model Test
Model -LogLikelihood DF ChiSquare Prob>ChiSq
Difference 0.924219 1 1.848438 0.1740
Full 27.458371
Reduced 28.382590
RSquare (U) 0.0326
Observations (or Sum Wgts) 43
Converged by Gradient
Parameter Estimates
Term Estimate Std Error ChiSquare Prob>ChiSq
Intercept -4.0886156 2.7245511 2.25 0.1334
age 0.12223765 0.0922011 1.76 0.1849
For log odds of High/Low
The logistic model using age a predictor is given by

,
_

p
p
1
ln
=

+
1
Age -4.0886156 + .1222*Age
Note: The response in logistic regression is the natural log of the odds for success.
The blue curve added to the plot gives the P(High|Age) = p. For example, for mothers 25
years of age the predicted probability of finding a high dieldrin level in her breast milk
is .25. For mothers 35 years of age this probability increases to around .50. The distance
from the top of the plot to the curve represents the P(Low|Age). To attach an odds ratio
to mothers age we need to pick an incremental increase of interest, e.g. suppose we
wanted to find the odds ratio associated with a 5-year increase in age. The associated
odds ratio is found as follows:
OR for 5-year increase in age = e
5*.122
= 1.84
6
Thus for a 5-year increase in age a mothers odds for having high dieldrin are 1.84 times
higher or alternatively there is an 84% increase in their odds for having high dieldrin
levels in their breast milk.
Predicted Probabilities for Logistic Model Using Age
We can use the logistic regression model to obtain predicted probabilities of high dieldrin
levels as a function of age by using.
P(High|Age) =
Age
Age
e
e
+
+
+
1222 . 089 . 4
1222 . 089 . 4
1
For example,
P(High|Age=25) = 2623 .
1
25 1222 . 089 . 4
25 1222 . 089 . 4

+
+
+
e
e
P(High|Age=35) = 5469 .
1
35 1222 . 089 . 4
35 1222 . 089 . 4

+
+
+
e
e
Multiple Logistic Regression Model
Now we consider a logistic regression model.
Age Treated NewSuburb
p
p
o 3 2 1
1
ln + + +

,
_

where,

'

'

termites for treated not home a in lives mother if


termites for treated home a in lives mother if
Treated
suburb old in lives mother if
suburb new in lives mother if
NewSuburb
1
1
1
1
Age = mothers age in years
Select Fit Model from the Analyze menu and put the high dieldrin indicator in the Y box
and Age, HT, and New Sub in the Effects in Model box as shown at the top of the
following page.
7
The resulting output is shown below.
Finding ORs associated with the predictors
For a dichotomous (two-level) categorical predictor, e.g. new suburb and house treated,
in order to find the associated OR we do the following:
8
The Whole Model Test is testing
useful. is model logistic The H
useful NOT is model logistic The H
a
o
:
:
The p-value = .0013 so here we
evidence to suggest that the model is
useful for explaining presence of high
dieldrin levels in a mothers breast milk.
The Lack of Fit test is testing
fit of lack is there
i.e. , inadequate is model The H
adequate. is model The H
a
o
:
:
The p-value = .2220, so there is no
evidence of lack of fit.
The Parameter Estimates and
Effect Wald Tests both contain the
results of tests that are used to test
the significance of the predictors in
the logistic model. Here we see
that both the new suburb and house
treated indicators are statistically
significant at the .05 level, while
mothers age is significant at the .
10 level.
)

2 exp(
i
i factor risk with associated OR , i.e.
i
e

2
.
Examples:
For New Suburb we have: For House Treated we have:
To find a crude 95% CI associated with the OR associated with risk factor i we compute
))

( * (

( * 2 exp(
i i
SE value) table - t or normal t
which will give an lower and upper confidence limits for the true OR associated with risk
factor.
Examples:
For New Suburb we have: For House Treated we have:
) 22 . 53 , 359 . 1 (
)) 4678 . 96 . 1 0703 . 1 ( * 2 exp(

t
) 65 . 90 986 . 1 (
)) 4873 . 96 . 1 2984 . 1 ( * 2 exp(
,
t
These intervals are very wide because the sample size (n = 45) is not very big. Typically
these types of studies require a larger sample size to get precise CIs for ORs.
We can obtain both the ORs and their confidence intervals using JMP as follows.
Select both the options
The resulting output is shown on the following page.
Multiple Logistic Regression Model
9
Odds Ratios calculates the odds ratios for
all predictors in the model.
Confidence Intervals provides CIs for
the Odds Ratio, calculated using a method
slightly differently than approach above.
ROC Curve draws an ROC curve which
is shown and discussed later in the
handout. (Professional JMP only!)
The ORs associated with living in home treated for termites and living in a new suburb
are considerably larger than those found examining there effect independently. The
differences between those obtained above are due to the fact that the factors themselves
are potentially related and as result their estimated effects when placed in a model jointly
differ.
The odds ratio reported for age is found by using Max(Age) Min(Age) as the
incremental increase. For these data Max(Age) = 37 and Min(Age) = 21, thus a mother
who is 37 has 28.055 times higher odds for having high dieldrin levels in her breast milk
when compared to a mother who is 21 years of age. It is better to use an increment like 5
years instead, i.e. OR associated with a 5 year increase in age is calculated as follows:
833 . 2 ) 042 . 1 exp( ) 5 * 2083 exp(. OR
.
As stated previously, the confidence intervals for all of the ORs are quite broad in this
study because the sample size is small (n = 45).
Predicted Probabilities Using All Available Predictors
The predicted probabilities of high dieldrin can be found as follows.
P(High Dieldrin|House Treated, New Suburb, Age) =

Age ed HouseTreat NewSuburb
Age ed HouseTreat NewSuburb
e
e
2084 . 298 . 1 070 . 1 604 . 6
2084 . 298 . 1 070 . 1 604 . 6
1
+ + +
+ + +
+
For example the probability that a 30 year old mother living in a home treated for
termites in an old suburb is estimated to be:
P(High|Old Suburb, House Treated, Age = 30) =
30 2084 . 298 . 1 070 . 1 604 . 6
30 2084 . 298 . 1 070 . 1 604 . 6
1
+ +
+ +
+ e
e
= .4690
For a 25 year old mother living in a home treated for termites located in a new suburb the
probability of high dieldrin is estimated to be:
P(High|New Suburb, House Treated, Age = 25) =
25 2084 . 298 . 1 070 . 1 604 . 6
25 2084 . 298 . 1 070 . 1 604 . 6
1
+ + +
+ + +
+ e
e
= .7259
Estimates of the P(High Dieldrin|New Suburb, House Treated, Age) Using
Professional Version of JMP (FYI)
10
Selecting Save Probability Formula from the Nominal Logistic Fit pull down menu
places the predicted probabilities of high and low dieldrin levels in the spreadsheet along
with the predicted status. The predicted status is determined by whichever probability is
larger, low dieldrin level or high dieldrin level, given their demographics.
Here is a portion of this output which will appear back in the original data spreadsheet.
P(Low|X) P(High|X)
We can compare the predicted dieldrin status to the actual via a contingency table. Select
Fit Y by X from the Analyze menu a place Most Likely High Dieldrin in the X box and
High Dieldrin in the Y box. The table and mosaic plot are shown below.
Contingency Analysis of High Dieldrin By MostLikely High Dieldrin
H
i
g
h

D
i
e
l
d
r
i
n
0.00
0.25
0.50
0.75
1.00
High Low
MostLikely High Dieldrin
High
Low
From the table we see that 26.7% of mothers classified as having high dieldrin levels
actually had low dieldrin levels, similarly 17.9% of those classified as having low
dieldrin levels actually had high dieldrin levels. In total 9 out of 43 mothers were
misclassified for an estimated overall error rate of 20.9%.
Receiver Operating Characteristic (ROC) Curve
11
Actual Status
Predicted
Status
High Low
High 11
73.33
4
26.67
15
Low 5
17.86
23
82.14
28
16 27 43
The Receiver Operating Characteristic plots the true positive probability vs. the false
positive probability. As the sensitivity increases the false positive rate increases as
expected. A good classification rule based on upon a logistic model should have area
beneath the ROC curve of .90 or higher. Here we do not quite meet that standard.
Receiver Operating Characteristic
T
r
u
e

P
o
s
i
t
i
v
e
S
e
n
s
i
t
i
v
i
t
y
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
.00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1.00
1-Specificity
False Positive
Area Under ROC Curve = 0.83449
Example 2: Risk Factors for Low Birth Weight
12
These data come from a case-control study where risk factors for having a infant with
low birth weight (< 2500g) were studied. The following information was recorded for
each mother in the study: (Data File: LowBirth)
Low Birth Weight indicator of birth weight status (Low or Normal)
Prev? previous history of premature labor (History or None)
Hyper hypertension during pregnancy (HT or Normal)
Smoke mother smoked during pregnancy (Cig or No Cig)
Uterine uterine irritability during pregnancy (Irritation or None)
Minority minority status of mother (Nonwhite or White)
Age age of mother
Lwt mothers weight at last menstrual cycle
Important JMP Note: For interpretation purposes it is best to code the
outcome so that the adverse outcome is alphabetically first. The same is true
for risk factors, code them so the level that would be associated with
increased risk is alphabetically first.
To fit the multiple logistic regression model select Analyze > Fit Model and set up the
dialog box as shown below.
After using backward elimination to remove non-significant predictors, uterine irritability
and mothers age here, we have the following.
13

The only predictor which represents something a mother could control or change is
smoking during pregnancy. This is the primary factor of interest in this study and the
other factors, while interesting, are there for control purposes only. In summarizing the
effect smoking we would see the phrase: adjusting for age, pre-pregnancy weight,
race, hypertension, uterine irritability, and previous history of premature labor we find
the OR associated with smoking is OR = 2.66. This says that, after adjusting for these
factors, the odds for having a low birth weight infant are 2.66 times larger for mothers
who smoked during pregnancy.
14

Das könnte Ihnen auch gefallen