Sie sind auf Seite 1von 87

GENERALIZED LINEAR MODEL

LOGISTIC REGRESSION

Generalized Linear Models


Traditional applications of linear models, such as

SLR and multiple linear regression, assume that


the response variable is
Normally distributed
Constant variance
Independent

There are many situations where these

assumptions are inappropriate

The response is either binary (0,1), or a count


The response is continuous, but nonnormal
2

Some Approaches to These Problems


Data transformation
Induce approximate normality
Stabilize variance
Simplify model form
Weighted least squares
Often used to stabilize variance
Generalized linear models (GLM)
Approach is about 25-30 years old, unifies linear and nonlinear
regression models
Response distribution is a member of the exponential family
(normal, exponential, gamma, binomial, Poisson)

Generalized Linear Models


Original applications were in biopharmaceutical sciences
Lots of recent interest in GLMs in industrial statistics
GLMs are simple models; include linear regression and

OLS as a special case


Parameter estimation is by maximum likelihood
(assume that the response distribution is known)
Inference on parameters is based on large-sample or
asymptotic theory
We will consider logistic regression, Poisson regression,
then the GLM
4

Logistic regression: an overview


1. Models with binary outcomes
2.

Problems with linear models using binary outcomes

3.

What logit models look like

4.

Predicting y and/or odds of y for a given x.

1. Binary outcome..
Binary outcomes are outcomes with two possible values, Success

or failure

The outcome of interest (success) is commonly scored 1 if it

occurs, otherwise 0 (failure). The units of analysis for binary


(0,1) outcomes are individuals.
Occurs often in the biopharmaceutical field; dose-response
studies, bioassays, clinical trials
Industrial applications include failure analysis, fatigue

testing, reliability testing. Example: functional electrical


testing on a semiconductor can yield: success in which case

the device works or failure due to a short, an open, or some other


failure mode
Other examples: college graduation, employment,

improvement under a treatment.

Binary Response Variables


Possible model:

i 1, 2,..., n
yi 0 j xij i xi i
j 1
yi 0 or 1
k

The response yi is a Bernoulli random variable

P ( yi 1) i with 0 i 1
P ( yi 0) 1 i
E ( yi ) i xi i
Var ( yi ) i (1 i )
2
yi

2. Problems With This Model


The error terms take on only two values, so they cant

possibly be normally distributed.


Error distribution is neither identical nor normal.
The variance of the observations is a function of the mean
(see previous slide).
Heteroskedasticity is more of a problem, but still often not
fatal because it acts in a conservative direction

2. Problems With This Model


A linear response function could result in predicted values that

fall outside the 0, 1 range, and this is impossible because

0 E ( yi ) i xi i 1
Nonsensical predictions.
Bad predictions due to nonlinear functional form even within

reasonable values of y.

Binary Response Variables The Challenger


Data

10

At Least One
O-ring Failure

Temperature
at Launch

1.0

At Least One
O-ring Failure

53

70

56

70

57

72

63

73

66

75

67

75

67

76

67

76

68

78

69

79

70

80

70

81

O-Ring Fail

Temperature
at Launch

0.5

0.0
50

60

70

Temperature

Data for space shuttle


launches and static tests
prior to the launch of
Challenger

80

A solution for nonlinear relationships:


Generalized Linear Models
Linear model: yi = i + ixi + i

(identity transform: no change in yi)


Generalized linear model: F(yi) = i + ixi + i

(F is some function such that F(y) is linear with xk)


Logit model: log( pi /(1- pi) )= i + ixi + i

(log( pi /(1- pi) is the log odds or logit of pi)

Binary Response Variables


There is a lot of empirical evidence that the response

function should be nonlinear; an S shape is quite logical


See the scatter plot of the Challenger data
The logistic response function is a common choice

exp(x
1
E ( y)

1 exp(x 1 exp(x

12

13

Logistic Regression Curve


1.0

Probability

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0

9 10 11 12 13 14 15 16 17 18 19 20 21

Assumption

pii
P

(pi )
Logit
Transform

Predictor

Predictor

The Logistic Response Function


The logistic response function can be easily linearized. Let:

x and E ( y )
Define

ln

This is called the logit transformation

16

Logistic Regression Model


Model:

yi E ( yi ) i
where
E ( yi ) i
exp(xi

1 exp( xi

The model parameters are estimated by the method of

maximum likelihood (MLE)


17

Maximum Likelihood Estimation in Logistic


Regression
The distribution of each observation yi is

fi ( yi ) iyi (1 i )1 yi , i 1, 2,..., n
The likelihood function is
n

i 1

L(y , f i ( yi ) iyi (1 i )1 yi
We usually work with the log-likelihood:

i n
ln L(y, ln fi ( yi ) yi ln
ln(1 i )
i 1
i 1
1 i i 1
n

18

Maximum Likelihood Estimation in Logistic


Regression
The maximum likelihood estimators (MLEs) of the model

parameters are those values that maximize the likelihood


(or log-likelihood) function
ML has been around since the first part of the previous
century
Often gives estimators that are intuitively pleasing
MLEs have nice properties; unbiased (for large samples),
minimum variance (or nearly so), and they have an
approximate normal distribution when n is large
19

Maximum Likelihood Estimation in Logistic


Regression
If we have ni trials at each observation, we can write the

log-likelihood as

ln L(y, Xy ni ln[1 exp( xi


i 1

The derivative of the log-likelihood is


n

ni
ln L(y,

X y
exp(xixi

i 1 1 exp( xi
n

Xy ni i xi
i 1

Xy X because i ni i )
20

Maximum Likelihood Estimation in Logistic


Regression
Setting this last result to zero gives the maximum likelihood

score equations

X(y 0

These equations look easy to solveweve actually seen them

before in linear regression:


y X
X y 0 results from OLS or ML with normal errors
Since XX y X y X 0,
XX Xy, and XX)1 Xy (OLS or the normal-theory MLE)
21

Maximum Likelihood Estimation in Logistic


Regression
Solving the ML score equations in logistic regression isnt quite as

easy, because

ni
, i 1, 2,..., n
1 exp(xi

Logistic regression is a nonlinear model


It turns out that the solution is actually fairly easy, and is based on

iteratively reweighted least squares or IRLS


An iterative procedure is necessary because parameter estimates
must be updated from an initial guess through several steps
Weights are necessary because the variance of the observations is
not constant
The weights are functions of the unknown parameters
22

3. What does the logit mean?


The logit model is related to the odds for a binary outcome.
odds = Pr(y=1) / Pr(y=0) = p/(1-p)
log odds = ln (odds) = ln (p/(1-p))
(in statistics, all logs refer to the natural log:
If x = en where e = 2.718, then ln(x) = n
Thus, the logit is the predicted log odds of Y for a given value of x

Interpretation of the Parameters in Logistic


Regression
The log-odds at x is

( x)
( x) ln
0 1 x
1 ( x)

The log-odds at x + 1 is

( x 1)
( x 1) ln
0 1 ( x 1)
1 ( x 1)

The difference in the log-odds is

( x 1) ( x) 1
24

Interpretation of the Parameters in Logistic


Regression
The odds ratio is found by taking antilogs:

Odds x 1
1

OR
e
Odds x
The odds ratio is interpreted as the estimated increase in

the probability of success associated with a one-unit


increase in the value of the predictor variable

25

4. Computing p from a log odds


The formal statement of the logit model is (again)
p
log i
1 pi

Then
Then

i i xi i

Note: ( x) p

p
e x
1 p

e x
p
1 e x

x
e
Or, predicted p, p

1 e x

Thus, getting meaningful predictions out of your model for a reader

to understand, takes a bit of work.

Inference on the Model Parameters


Likelihood ratio tests (LRT)
A LRT can be used to compare a full model with a reduced model
of interest.
Analogous to the extra sum of squares technique to compare full
and reduced models.
The LRT compares twice the logarithm of the value of the likelihood
function for the full model (FM) to twice the logarithm of the value of
the likelihood function of the reduced model (RM) to obtain test
statistic:

L( FM )
LR 2 ln
2[ln L( FM ) ln L( RM )]
L( RM )

27

For large samples, when the reduced model is correct, the test
statistic LR follows a chi-square distribution with df equal to the
difference in the no of parameters between full & reduced models.
2
If LR > ,df , we would reject the claim that the reduced model is
appropriate.

Inference on the Model Parameters


LR approach can be used to provide a test for significance of logistic
regression
uses current model (fit the data) as the full model & compares it to a
reduced model (only has constant prob of success). The constant probability
of success model is
e 0
E ( y) p
1 e 0
logistic regression model with no regressor variables.
The MLE of reduced model is just y/n.
Substituting this into log likelihood fcn gives the max values of the likelihood
fcn for the reduced model as

ln L( RM ) y ln( y ) (n y ) ln(n y ) n ln(n)


Therefore, the LRT for testing significance of regression is:

28

n
n

LR 2 yi ln p i (n y )i ln(1 p i ) [ y ln( y ) (n y ) ln(n y ) n ln(n)]


i 1
i 1

Testing Goodness of Fit (GOF)


The GOF of the logistic regression model can also be assessed using a LRT
procedure.
This test compares the current model to a saturated model, where each obs (or
group obs when n=1) is allowed to have its own parameter (a success probability).
The Deviance is defined as twice the difference in log-likelihoods between this
saturated model and the full model (current model) that has been fit to the data
with estimated
xi'

p i

1 e xi
'

The Deviance is defined as


n

yi
ni yi
L(saturated model)
D 2 ln
2 yi ln(
) (ni yi ) ln(
)

L(FM)
n
p
n
(1

p
)
i 1
i i
i
i

29

Testing Goodness of Fit (GOF)


In calculating the deviance, y ln(
y n we have (n - y ) ln(

n y
) 0.

n(1 p)

y
) 0 if y 0 and if
np

When the logistic regression model is an adequate fit to the data, & the sample
size is large, the deviance has a chi-square distribution with n-p df; p is no of
parameters in the model.
Small values of deviance (or large p value) imply that the model provides a
satisfactory fit to the data, while large values of deviance imply that the current
model is not adequate.

A good rule of thumb is to divide the deviance by its number of degrees of


freedom.
If the ratio D/(n-p) is much greater than unity, the current model is not an
adequate fit to the data.
30

Pearson chi-square goodness-of-fit statistic:

The Pearson chi-square GOF statistic can be compared to a chi-square


distribution with n-p degrees of freedom.
Small values of the statistic (or a large P value) imply that the model
provides a satisfactory fit to the data.
The Pearson chi-square statistic can also be divided by the number of df np and the ratio compared to unity.
If the ratio greatly exceeds unity, the GOF of the model is questionable.

31

The Hosmer-Lemeshow (HL) goodness-of-fit statistic:


When there are no replicates on the regressor variables, the observations can be
grouped to perform a GOF test called the Hosmer-Lemeshow test.
In this procedure, the observations are classified into g groups based on the
estimated probabilities of success.
Generally, about 10 groups are used (when g=10 the groups are called the
deciles of risk) and the observed number of successes Oj and failures Nj-Oj are
compared with the expected frequencies in each group,
and
where Nj is the number of obs in the jth group and the average estimated success
probability in the jth group is

The Hosmer-Lemeshow statistic is really just a Pearson chi-square GOF statistic


comparing observed and expected frequencies:

32

If the fitted logistic regression model is correct, the HL statistic follows a chisquare distribution with g-2 df when the sample size is large.
Large values of the HL imply that the model is not adequate fit to the data.
It is also useful to compute the ratio of the HL to the no of df g-p with values
close to unity implying an adequate fit.
H null : model is not adequate
H 1 : model is adequate

33

Likelihood Inference on the Model Parameters


Deviance can also be used to test hypotheses about subsets of the

model parameters (analogous to the extra SS method)


Procedure:

X1 X 2 2 , with p parameters, 2 has r parameters


This full model has deviance (
H 0 : 2 0
H1 : 2 0
The reduced model is X1 , with deviance (1 )
The difference in deviance between the full and reduced models is

( | 1 ) (1 ) (with r degrees of freedom


( | 1 ) has a chi-square distribution under H 0 : 0
Large values of ( | 1 ) imply that H 0 : 0 should be rejected
34

Inference on the Model Parameters


Tests on individual model coefficients can also be done using Wald

inference
Uses the result that the MLEs have an approximate normal
distribution, so the distribution of

Z0
se( )
is standard normal if the true value of the parameter is zero. Some
computer programs report the square of Z (which is chi-square), and
others calculate the P-value using the t distribution

35

Logistic Regression with 1 Predictor


Response - Presence/Absence of characteristic
Predictor - Numeric variable observed for each case
Model - (x) Probability of presence at predictor level x
0 1 x

e
( x)
0 1 x
1 e
= 0 P(Presence) is the same at each level of x
> 0 P(Presence) increases as x increases

1< 0 P(Presence) decreases as x increases

Logistic Regression with 1 Predictor

0 are unknown parameters and must be estimated using

statistical software such as SPSS, SAS, R or STATA (or in a matrix


language)
Primary interest in estimating and testing hypotheses regarding
Large-Sample test (Wald Test):
H0: = 0
HA: 0

2
T .S . : X obs

2
R.R. : X obs

^

^1

^
1

2 ,1

2
P val : P ( 2 X obs
)

Note: Some software packages


perform this as an equivalent Ztest or t-test

Odds Ratio
Interpretation of Regression Coefficient ():
In linear regression, the slope coefficient is the change in the mean response
as x increases by 1 unit
In logistic regression, we can show that:

odds( x 1)
e
odds( x)

( x)
odds( x)

1 ( x)

Thus e represents the change in the odds of the outcome


(multiplicatively) by increasing x by 1 unit
If = 0, the odds and probability are the same at all x levels (e=1)
If > 0 , the odds and probability increase as x increases (e>1)
If < 0 , the odds and probability decrease as x increases (e<1)

95% Confidence Interval for Odds Ratio


Step 1: Construct a 95% CI for :

^
^
^
^
^
^
1.96 , 1.96

1.96
^

Step 2: Raise e = 2.718 to the lower and upper bounds of the CI:

e 1.96 , e 1.96

^ ^

^ ^

If entire interval is above 1, conclude positive association


If entire interval is below 1, conclude negative association
If interval contains 1, cannot conclude there is an association

EXAMPLE 1: Sex ratios in insects

Ex. Sex ratios in insects (the proportion of all


individuals that are males)
In the species in question, it has been observed that the sex ratio

is highly variable, and an experiment was set up to see whether


population density was involved in determining the fraction of
males.
Density
1
4
10
22
55
121
210
444

females
1
3
7
18
22
41
52
79

male
0
1
3
4
33
80
158
365

Ex. Sex ratios in insects (the proportion of all


individuals
that
It certainly looks
as ifare
theremales)
are proportionally more males at
density,
but
shoulditplot
the data
as proportions
see
high
In the
species
in we
question,
has been
observed
that the sextoratio

more clearly.
is highly variable, andthis
an experiment
was set up to see whether
population density was involved in determining the fraction of
males.
Density
1
4
10
22
55
121
210
444

females
1
3
7
18
22
41
52
79

male
0
1
3
4
33
80
158
365

Enter the data into R


Make it as a data frame:

Density
1
4
10
22
55
121
210
444

females
1
3
7
18
22
41
52
79

male
0
1
3
4
33
80
158
365

Evidently, a logarithmic transformation of the explanatory variable is


likely to improve the model fit (population density involved in
determining the fraction of males)

Question: increasing pop. density leads to a significant increase

in the proportion of males in the population? (whether the sex ratio


is density-dependent?)
The response variable matched pair of counts that we wish to

analyse as proportion data


The explanatory variable population density
First: bind together the vectors of male and female counts into a

single object that will be the response in the analysis.


y <- cbind(males,females)
y will be interpreted in the model as the proportion of all indiv.
that were male.
Then, fit the generalized linear model; link function:binomial

Intercept

slope
If residual deviance > residual d.o.f , we
called overdispersion

The slope is highly significantly steeper than zero (proportionately more


males at higher population density)
See whether if log transformation of the explanatory variable reduces the
residual deviance below 22.091

In GLM, it is assumed that the residual deviance is the same as the

residual degrees of freedom.


If the residual deviance > residual degrees of freedom, we called
OVERDISPERSION
OVERDISPERSION there is extra, unexplained variation, over
and above the binomial variance assumed by the model
specification.
How to overcome?
By transformation
Or use quasi-likelihood (in family argument)
Eg: glm(y~log(x), family=quasibinomial) if binomial dist.

In the model with log(density), there is no evidence of


overdispersion.

Deviance table correspond to ANOVA tables

The analysis of deviance table


Deviance column differences between models as variables

are added to the model in turn.


The deviances are approximately chi-square distributed with
the stated dof.
Necessary to add the test=chisq argument to get the
approx. chisq tests.
If more than one predictor, to test which predictor should be
stay/ remove from the model, we can use function:
> drop1(model,test="Chisq")

Measure of Fit
The deviance shows how well the model fits the data
Comparing two models deviances
Use a likelihood ratio test
Compare using Chi-square distribution

To do the reverse transformation


to get the model coefficients

Model checking..
Plot the model

Residual vs fitted values


2. Normal plot
3. Diagnostic checking ; cooks distance etc
1.

> par(mfrow=c(2,2))
> plot(model)

No pattern in the residuals against fitted values


Normal plot is reasonably linear
Point no. 4 is highly influential ( it has a large Cooks
distance), but the model is still significant with the
point omitted.

Conclusion of the example:


We conclude that the proportion of animals that are males

increases significantly with increasing density


The logistic model is linearized by logarithmic transformation of
the explanatory variable (population density).
Draw the fitted line through the scatter plot:
xv <- seq(0,6,0.1)
plot(log(density),p,ylab="Proportion male")
lines(xv,predict(model,list(density=exp(xv)),type="response"))
The use of type=response to back-transform from logit scale to
the S-shaped proportion scale.

EXAMPLE 2:
Challenger Data

Temperature at At Least One


Launch
O-ring Failure

Temperature at At Least One


Launch
O-ring Failure

53

70

56

70

57

72

63

73

66

75

67

75

67

76

67

76

68

78

69

79

70

80

70

81

A Logistic Regression Model for the


Challenger Data
Test that all slopes are zero: G = 5.944, DF = 1,
P-Value = 0.015
Goodness-of-Fit Tests
Method

Chi-Square

DF

Pearson

14.049

15

0.522

Deviance

15.759

15

0.398

Hosmer-Lemeshow

11.834

0.159

exp(10.875 0.17132 x)
y
1 exp(10.875 0.17132 x)
68

Note that the fitted function has


been extended down to 31 deg F,
the temperature at which
Challenger was launched

69

Odds Ratio for the Challenger Data


O R e 0.17132 0.84
This implies that every decrease of one degree in temperature
increases the odds of O-ring failure by about 1/0.84 = 1.19 or 19
percent
The temperature at Challenger launch was 22 degrees below the
lowest observed launch temperature, so now

O R e22( 0.17132) 0.0231


This results in an increase in the odds of failure of 1/0.0231 =
43.34, or about 4200 percent!!
Theres a big extrapolation here, but if you knew this prior to
launch, what decision would you have made?
70

The Pneumoconiosis Data


Example 3

Another Logistic Regression Example: The


Pneumoconiosis Data
A 1959 article in Biometrics reported the data:

72

73

74

The fitted model:

75

76

Linear Regression Analysis 5E Montgomery, Peck &


Vining

77

78

Linear Regression Analysis 5E Montgomery, Peck &


Vining

Diagnostic Checking

79

Linear Regression Analysis 5E Montgomery, Peck &


Vining

80

Linear Regression Analysis 5E Montgomery, Peck &


Vining

81

Linear Regression Analysis 5E Montgomery, Peck &


Vining

82

Linear Regression Analysis 5E Montgomery, Peck &


Vining

Useful qualities of the logit for social analysis.


The use of odds in the outcome variable makes the model more

sensitive to changes near p=0 or p=1 than to changes near p=.5


This is appropriate in that small absolute changes in proportions near 0 or 1

tend to reflect bigger effects than small absolute changes in proportions


near .5
The use of the log function in the outcome variable makes the

model sensitive to relative changes in proportions rather than


absolute changes.
This is appropriate in that explanatory variables often have multiplicative

rather than additive effects on response variables.

Advantages of the logit model over the linear


regression model for binary outcomes.
1.) The logit of the outcome tends to have a linear relationship with

the explanatory variables.


(This is the most important advantage!)
2.) The logit of the outcome can go to + or -, so it is

impossible to have meaningless predictions for the outcome


variable.
3.) The logit model produces results equivalent to those of a
homoskedastic model.

One important disadvantage of the logit model:


Estimation
A given individual either will or will not have the outcome, so the

observed p = 0 or p = 1 for all cases.


What is the logit when p = 0? When p = 1?

This problem makes it impossible to do least squares estimation of a logit

model.
least squares estimates minimize (observed expected)2, and the logit of the

observed is always undefined!


it is impossible to directly standardize logits, so there is no true r or r2 for a
logit model.

Solving the estimation problem for logit models:


Logit models are not solved by least squares estimation, but by a

completely different procedure called maximum likelihood


estimation.
least squares procedures are based on the notion of a sampling distribution; a

universe of possible samples coming from a single true population


parameter.
maximum likelihood procedures are based on the notion of a universe of
possible population parameters that could produce the one observed sample.
Standard errors are comparable in the two procedures, but

computation for MLE is prohibitively time-consuming for humans.

Summary of this lecture:


You should be able to do the following:
explain the problems (in order of importance) with using a linear regression

model when there is a binary outcome.


define a logit model in equations and in words
explain why a logit model often overcomes the problems of a linear
regression model
look at the output of a logit model and be able to
predict y,

Predict odds of y, and


And predict log odds of y for a given x,
and to express the slope as an odds ratio