0 Stimmen dafür0 Stimmen dagegen

24 Aufrufe87 Seitenmodelling

Oct 08, 2016

© © All Rights Reserved

PDF, TXT oder online auf Scribd lesen

modelling

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

24 Aufrufe

modelling

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

- Neuromancer
- The E-Myth Revisited: Why Most Small Businesses Don't Work and
- How Not to Be Wrong: The Power of Mathematical Thinking
- Drive: The Surprising Truth About What Motivates Us
- Chaos: Making a New Science
- The Joy of x: A Guided Tour of Math, from One to Infinity
- How to Read a Person Like a Book
- Moonwalking with Einstein: The Art and Science of Remembering Everything
- The Wright Brothers
- The Other Einstein: A Novel
- The 6th Extinction
- The Housekeeper and the Professor: A Novel
- The Power of Discipline: 7 Ways it Can Change Your Life
- The 10X Rule: The Only Difference Between Success and Failure
- A Short History of Nearly Everything
- The Kiss Quotient: A Novel
- The End of Average: How We Succeed in a World That Values Sameness
- Made to Stick: Why Some Ideas Survive and Others Die
- Algorithms to Live By: The Computer Science of Human Decisions
- The Universe in a Nutshell

Sie sind auf Seite 1von 87

LOGISTIC REGRESSION

Traditional applications of linear models, such as

the response variable is

Normally distributed

Constant variance

Independent

The response is continuous, but nonnormal

2

Data transformation

Induce approximate normality

Stabilize variance

Simplify model form

Weighted least squares

Often used to stabilize variance

Generalized linear models (GLM)

Approach is about 25-30 years old, unifies linear and nonlinear

regression models

Response distribution is a member of the exponential family

(normal, exponential, gamma, binomial, Poisson)

Original applications were in biopharmaceutical sciences

Lots of recent interest in GLMs in industrial statistics

GLMs are simple models; include linear regression and

Parameter estimation is by maximum likelihood

(assume that the response distribution is known)

Inference on parameters is based on large-sample or

asymptotic theory

We will consider logistic regression, Poisson regression,

then the GLM

4

1. Models with binary outcomes

2.

3.

4.

1. Binary outcome..

Binary outcomes are outcomes with two possible values, Success

or failure

(0,1) outcomes are individuals.

Occurs often in the biopharmaceutical field; dose-response

studies, bioassays, clinical trials

Industrial applications include failure analysis, fatigue

testing on a semiconductor can yield: success in which case

failure mode

Other examples: college graduation, employment,

Possible model:

i 1, 2,..., n

yi 0 j xij i xi i

j 1

yi 0 or 1

k

P ( yi 1) i with 0 i 1

P ( yi 0) 1 i

E ( yi ) i xi i

Var ( yi ) i (1 i )

2

yi

The error terms take on only two values, so they cant

Error distribution is neither identical nor normal.

The variance of the observations is a function of the mean

(see previous slide).

Heteroskedasticity is more of a problem, but still often not

fatal because it acts in a conservative direction

A linear response function could result in predicted values that

0 E ( yi ) i xi i 1

Nonsensical predictions.

Bad predictions due to nonlinear functional form even within

reasonable values of y.

Data

10

At Least One

O-ring Failure

Temperature

at Launch

1.0

At Least One

O-ring Failure

53

70

56

70

57

72

63

73

66

75

67

75

67

76

67

76

68

78

69

79

70

80

70

81

O-Ring Fail

Temperature

at Launch

0.5

0.0

50

60

70

Temperature

launches and static tests

prior to the launch of

Challenger

80

Generalized Linear Models

Linear model: yi = i + ixi + i

Generalized linear model: F(yi) = i + ixi + i

Logit model: log( pi /(1- pi) )= i + ixi + i

There is a lot of empirical evidence that the response

See the scatter plot of the Challenger data

The logistic response function is a common choice

exp(x

1

E ( y)

1 exp(x 1 exp(x

12

13

1.0

Probability

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

9 10 11 12 13 14 15 16 17 18 19 20 21

Assumption

pii

P

(pi )

Logit

Transform

Predictor

Predictor

The logistic response function can be easily linearized. Let:

x and E ( y )

Define

ln

16

Model:

yi E ( yi ) i

where

E ( yi ) i

exp(xi

1 exp( xi

17

Regression

The distribution of each observation yi is

fi ( yi ) iyi (1 i )1 yi , i 1, 2,..., n

The likelihood function is

n

i 1

L(y , f i ( yi ) iyi (1 i )1 yi

We usually work with the log-likelihood:

i n

ln L(y, ln fi ( yi ) yi ln

ln(1 i )

i 1

i 1

1 i i 1

n

18

Regression

The maximum likelihood estimators (MLEs) of the model

(or log-likelihood) function

ML has been around since the first part of the previous

century

Often gives estimators that are intuitively pleasing

MLEs have nice properties; unbiased (for large samples),

minimum variance (or nearly so), and they have an

approximate normal distribution when n is large

19

Regression

If we have ni trials at each observation, we can write the

log-likelihood as

i 1

n

ni

ln L(y,

X y

exp(xixi

i 1 1 exp( xi

n

Xy ni i xi

i 1

Xy X because i ni i )

20

Regression

Setting this last result to zero gives the maximum likelihood

score equations

X(y 0

y X

X y 0 results from OLS or ML with normal errors

Since XX y X y X 0,

XX Xy, and XX)1 Xy (OLS or the normal-theory MLE)

21

Regression

Solving the ML score equations in logistic regression isnt quite as

easy, because

ni

, i 1, 2,..., n

1 exp(xi

It turns out that the solution is actually fairly easy, and is based on

An iterative procedure is necessary because parameter estimates

must be updated from an initial guess through several steps

Weights are necessary because the variance of the observations is

not constant

The weights are functions of the unknown parameters

22

The logit model is related to the odds for a binary outcome.

odds = Pr(y=1) / Pr(y=0) = p/(1-p)

log odds = ln (odds) = ln (p/(1-p))

(in statistics, all logs refer to the natural log:

If x = en where e = 2.718, then ln(x) = n

Thus, the logit is the predicted log odds of Y for a given value of x

Regression

The log-odds at x is

( x)

( x) ln

0 1 x

1 ( x)

The log-odds at x + 1 is

( x 1)

( x 1) ln

0 1 ( x 1)

1 ( x 1)

( x 1) ( x) 1

24

Regression

The odds ratio is found by taking antilogs:

Odds x 1

1

OR

e

Odds x

The odds ratio is interpreted as the estimated increase in

increase in the value of the predictor variable

25

The formal statement of the logit model is (again)

p

log i

1 pi

Then

Then

i i xi i

Note: ( x) p

p

e x

1 p

e x

p

1 e x

x

e

Or, predicted p, p

1 e x

Likelihood ratio tests (LRT)

A LRT can be used to compare a full model with a reduced model

of interest.

Analogous to the extra sum of squares technique to compare full

and reduced models.

The LRT compares twice the logarithm of the value of the likelihood

function for the full model (FM) to twice the logarithm of the value of

the likelihood function of the reduced model (RM) to obtain test

statistic:

L( FM )

LR 2 ln

2[ln L( FM ) ln L( RM )]

L( RM )

27

For large samples, when the reduced model is correct, the test

statistic LR follows a chi-square distribution with df equal to the

difference in the no of parameters between full & reduced models.

2

If LR > ,df , we would reject the claim that the reduced model is

appropriate.

LR approach can be used to provide a test for significance of logistic

regression

uses current model (fit the data) as the full model & compares it to a

reduced model (only has constant prob of success). The constant probability

of success model is

e 0

E ( y) p

1 e 0

logistic regression model with no regressor variables.

The MLE of reduced model is just y/n.

Substituting this into log likelihood fcn gives the max values of the likelihood

fcn for the reduced model as

Therefore, the LRT for testing significance of regression is:

28

n

n

i 1

i 1

The GOF of the logistic regression model can also be assessed using a LRT

procedure.

This test compares the current model to a saturated model, where each obs (or

group obs when n=1) is allowed to have its own parameter (a success probability).

The Deviance is defined as twice the difference in log-likelihoods between this

saturated model and the full model (current model) that has been fit to the data

with estimated

xi'

p i

1 e xi

'

n

yi

ni yi

L(saturated model)

D 2 ln

2 yi ln(

) (ni yi ) ln(

)

L(FM)

n

p

n

(1

p

)

i 1

i i

i

i

29

In calculating the deviance, y ln(

y n we have (n - y ) ln(

n y

) 0.

n(1 p)

y

) 0 if y 0 and if

np

When the logistic regression model is an adequate fit to the data, & the sample

size is large, the deviance has a chi-square distribution with n-p df; p is no of

parameters in the model.

Small values of deviance (or large p value) imply that the model provides a

satisfactory fit to the data, while large values of deviance imply that the current

model is not adequate.

freedom.

If the ratio D/(n-p) is much greater than unity, the current model is not an

adequate fit to the data.

30

distribution with n-p degrees of freedom.

Small values of the statistic (or a large P value) imply that the model

provides a satisfactory fit to the data.

The Pearson chi-square statistic can also be divided by the number of df np and the ratio compared to unity.

If the ratio greatly exceeds unity, the GOF of the model is questionable.

31

When there are no replicates on the regressor variables, the observations can be

grouped to perform a GOF test called the Hosmer-Lemeshow test.

In this procedure, the observations are classified into g groups based on the

estimated probabilities of success.

Generally, about 10 groups are used (when g=10 the groups are called the

deciles of risk) and the observed number of successes Oj and failures Nj-Oj are

compared with the expected frequencies in each group,

and

where Nj is the number of obs in the jth group and the average estimated success

probability in the jth group is

comparing observed and expected frequencies:

32

If the fitted logistic regression model is correct, the HL statistic follows a chisquare distribution with g-2 df when the sample size is large.

Large values of the HL imply that the model is not adequate fit to the data.

It is also useful to compute the ratio of the HL to the no of df g-p with values

close to unity implying an adequate fit.

H null : model is not adequate

H 1 : model is adequate

33

Deviance can also be used to test hypotheses about subsets of the

Procedure:

This full model has deviance (

H 0 : 2 0

H1 : 2 0

The reduced model is X1 , with deviance (1 )

The difference in deviance between the full and reduced models is

( | 1 ) has a chi-square distribution under H 0 : 0

Large values of ( | 1 ) imply that H 0 : 0 should be rejected

34

Tests on individual model coefficients can also be done using Wald

inference

Uses the result that the MLEs have an approximate normal

distribution, so the distribution of

Z0

se( )

is standard normal if the true value of the parameter is zero. Some

computer programs report the square of Z (which is chi-square), and

others calculate the P-value using the t distribution

35

Response - Presence/Absence of characteristic

Predictor - Numeric variable observed for each case

Model - (x) Probability of presence at predictor level x

0 1 x

e

( x)

0 1 x

1 e

= 0 P(Presence) is the same at each level of x

> 0 P(Presence) increases as x increases

language)

Primary interest in estimating and testing hypotheses regarding

Large-Sample test (Wald Test):

H0: = 0

HA: 0

2

T .S . : X obs

2

R.R. : X obs

^

^1

^

1

2 ,1

2

P val : P ( 2 X obs

)

perform this as an equivalent Ztest or t-test

Odds Ratio

Interpretation of Regression Coefficient ():

In linear regression, the slope coefficient is the change in the mean response

as x increases by 1 unit

In logistic regression, we can show that:

odds( x 1)

e

odds( x)

( x)

odds( x)

1 ( x)

(multiplicatively) by increasing x by 1 unit

If = 0, the odds and probability are the same at all x levels (e=1)

If > 0 , the odds and probability increase as x increases (e>1)

If < 0 , the odds and probability decrease as x increases (e<1)

Step 1: Construct a 95% CI for :

^

^

^

^

^

^

1.96 , 1.96

1.96

^

Step 2: Raise e = 2.718 to the lower and upper bounds of the CI:

e 1.96 , e 1.96

^ ^

^ ^

If entire interval is below 1, conclude negative association

If interval contains 1, cannot conclude there is an association

individuals that are males)

In the species in question, it has been observed that the sex ratio

population density was involved in determining the fraction of

males.

Density

1

4

10

22

55

121

210

444

females

1

3

7

18

22

41

52

79

male

0

1

3

4

33

80

158

365

individuals

that

It certainly looks

as ifare

theremales)

are proportionally more males at

density,

but

shoulditplot

the data

as proportions

see

high

In the

species

in we

question,

has been

observed

that the sextoratio

more clearly.

is highly variable, andthis

an experiment

was set up to see whether

population density was involved in determining the fraction of

males.

Density

1

4

10

22

55

121

210

444

females

1

3

7

18

22

41

52

79

male

0

1

3

4

33

80

158

365

Make it as a data frame:

Density

1

4

10

22

55

121

210

444

females

1

3

7

18

22

41

52

79

male

0

1

3

4

33

80

158

365

likely to improve the model fit (population density involved in

determining the fraction of males)

is density-dependent?)

The response variable matched pair of counts that we wish to

The explanatory variable population density

First: bind together the vectors of male and female counts into a

y <- cbind(males,females)

y will be interpreted in the model as the proportion of all indiv.

that were male.

Then, fit the generalized linear model; link function:binomial

Intercept

slope

If residual deviance > residual d.o.f , we

called overdispersion

males at higher population density)

See whether if log transformation of the explanatory variable reduces the

residual deviance below 22.091

If the residual deviance > residual degrees of freedom, we called

OVERDISPERSION

OVERDISPERSION there is extra, unexplained variation, over

and above the binomial variance assumed by the model

specification.

How to overcome?

By transformation

Or use quasi-likelihood (in family argument)

Eg: glm(y~log(x), family=quasibinomial) if binomial dist.

overdispersion.

Deviance column differences between models as variables

The deviances are approximately chi-square distributed with

the stated dof.

Necessary to add the test=chisq argument to get the

approx. chisq tests.

If more than one predictor, to test which predictor should be

stay/ remove from the model, we can use function:

> drop1(model,test="Chisq")

Measure of Fit

The deviance shows how well the model fits the data

Comparing two models deviances

Use a likelihood ratio test

Compare using Chi-square distribution

to get the model coefficients

Model checking..

Plot the model

2. Normal plot

3. Diagnostic checking ; cooks distance etc

1.

> par(mfrow=c(2,2))

> plot(model)

Normal plot is reasonably linear

Point no. 4 is highly influential ( it has a large Cooks

distance), but the model is still significant with the

point omitted.

We conclude that the proportion of animals that are males

The logistic model is linearized by logarithmic transformation of

the explanatory variable (population density).

Draw the fitted line through the scatter plot:

xv <- seq(0,6,0.1)

plot(log(density),p,ylab="Proportion male")

lines(xv,predict(model,list(density=exp(xv)),type="response"))

The use of type=response to back-transform from logit scale to

the S-shaped proportion scale.

EXAMPLE 2:

Challenger Data

Launch

O-ring Failure

Launch

O-ring Failure

53

70

56

70

57

72

63

73

66

75

67

75

67

76

67

76

68

78

69

79

70

80

70

81

Challenger Data

Test that all slopes are zero: G = 5.944, DF = 1,

P-Value = 0.015

Goodness-of-Fit Tests

Method

Chi-Square

DF

Pearson

14.049

15

0.522

Deviance

15.759

15

0.398

Hosmer-Lemeshow

11.834

0.159

exp(10.875 0.17132 x)

y

1 exp(10.875 0.17132 x)

68

been extended down to 31 deg F,

the temperature at which

Challenger was launched

69

O R e 0.17132 0.84

This implies that every decrease of one degree in temperature

increases the odds of O-ring failure by about 1/0.84 = 1.19 or 19

percent

The temperature at Challenger launch was 22 degrees below the

lowest observed launch temperature, so now

This results in an increase in the odds of failure of 1/0.0231 =

43.34, or about 4200 percent!!

Theres a big extrapolation here, but if you knew this prior to

launch, what decision would you have made?

70

Example 3

Pneumoconiosis Data

A 1959 article in Biometrics reported the data:

72

73

74

75

76

Vining

77

78

Vining

Diagnostic Checking

79

Vining

80

Vining

81

Vining

82

Vining

The use of odds in the outcome variable makes the model more

This is appropriate in that small absolute changes in proportions near 0 or 1

near .5

The use of the log function in the outcome variable makes the

absolute changes.

This is appropriate in that explanatory variables often have multiplicative

regression model for binary outcomes.

1.) The logit of the outcome tends to have a linear relationship with

(This is the most important advantage!)

2.) The logit of the outcome can go to + or -, so it is

variable.

3.) The logit model produces results equivalent to those of a

homoskedastic model.

Estimation

A given individual either will or will not have the outcome, so the

What is the logit when p = 0? When p = 1?

model.

least squares estimates minimize (observed expected)2, and the logit of the

it is impossible to directly standardize logits, so there is no true r or r2 for a

logit model.

Logit models are not solved by least squares estimation, but by a

estimation.

least squares procedures are based on the notion of a sampling distribution; a

parameter.

maximum likelihood procedures are based on the notion of a universe of

possible population parameters that could produce the one observed sample.

Standard errors are comparable in the two procedures, but

You should be able to do the following:

explain the problems (in order of importance) with using a linear regression

define a logit model in equations and in words

explain why a logit model often overcomes the problems of a linear

regression model

look at the output of a logit model and be able to

predict y,

And predict log odds of y for a given x,

and to express the slope as an odds ratio

- alscalHochgeladen vonNovia Widya
- r-cheat-sheet.pdfHochgeladen vonluanamdonascimento
- Regresi LogistikHochgeladen vonakun sementara88
- SAG-Trend-Analysis.pdfHochgeladen vonarsana
- Model CheckingHochgeladen vonhubik38
- Cabieses, Tunstall, Pickett y Gideon (2012) Understanding Differences in Access an UseHochgeladen vonCarlos Roberto Zamora Bugueño
- Skiba, R.J. (2014) - Parsing Disciplinary DisproportionalityHochgeladen vonBenny
- nullHochgeladen vonapi-26045136
- Risk Factors of Rheumatic Heart Disease.pdfHochgeladen vonandina
- kapitaHochgeladen vonsukmawati
- 8Hochgeladen vonAnonymous 9zisbZkJ
- Logit to Probit to LPM ExampleHochgeladen vonluke
- dynamHochgeladen vonstefanovicana1
- Logistic RegressionHochgeladen vonArun Balakrishnan
- Binary Logistic RegressionHochgeladen vonMichael Padilla
- Probit Logit IndianaHochgeladen vonmb001h2002
- CHAPTER III AO Dr.azizah1,Jan15,OriginalHochgeladen vonHarry Pendiem
- Common Statistical Methods for Clinical ResearchHochgeladen vonhima
- lecture8.pptHochgeladen vonghufran adnan
- Introducing RHochgeladen vonarclite123
- 20091 Delignette-Muller Pouillot DenisHochgeladen vonBustami Muhammad Sidik
- 642Hochgeladen vonAya Sobhi
- 1-s2.0-S0965206X1100012X-mainHochgeladen vonMissing Man
- Instruction PlansHochgeladen vonYashpal Singh
- Q_CT6Hochgeladen vonpooja sharma
- 1-s2.0-S2214241X15000589-mainHochgeladen vonYudi Lasmana
- 1-s2.0-S0029784497005310-mainHochgeladen vonRiswanMaulidiyandi
- 493-H0015Hochgeladen vonEleventi Oktarina Putri
- Baumgaertner. The Value of the Tip-Apex Distance.pdfHochgeladen vonIcaruscignus
- impactofpricepro00mcalHochgeladen vonAravind Reddy

- Unit 6finalHochgeladen vonjazz440
- PfaffvdvsdHochgeladen vonprabinseth
- mebs6006_0708_exercise03_solutions.pdfHochgeladen vonDGG
- TT Indiana State Fair Commission Investigation ReportHochgeladen vonManotapaBhaumik
- Cooling Towers - Source Prediction, Modelling, Specification and Noise Control (Derrick)Hochgeladen vonbinho58
- Mission Accomplished 3 Writing 1 Unit 3Hochgeladen vonMichelle O'Brien
- Analogy - GRE Graduate Record Examinations InformationHochgeladen vonRaiTejani
- Irregular Verb DictionaryHochgeladen vonElena Casas Cortada
- US METAR AbbreviationsHochgeladen vonalinarocha
- Zahra & Ryan - Complexity in Tourism Structures - the Embedded System of New Zealand's Regional Tourism Organisation.pdfHochgeladen vonGerson Godoy Riquelme
- An Example of Emergency Towing ProceduresHochgeladen vonYildan Emiralioglu
- rsh_qam11_tif_05.docHochgeladen vonJay Brock
- Internet Sites for ForecastingHochgeladen vonanjali9414
- Users Guide to AMICAFHochgeladen vonAnonymous TAMiMm
- asl410-WRFHochgeladen vonLokesh Mishra
- Chapter 7 - Exercise AnswersHochgeladen vonApril Reynolds
- four seasons unit planHochgeladen vonapi-281200210
- Relationship of Sodium Carbonate SRC with Some Physicochemical, Rheological and Gelatinization Properties of Flour and its Impact on End Quality of BiscuitHochgeladen vonIJSRP ORG
- Q'vive PMP Formulas PMBOK5 v2.pdfHochgeladen vonjsrt234956
- 4levels Official Yearly Planning & Omitted Lessons (3)Hochgeladen vonSamir Bounab
- Demand ForecastingHochgeladen vonswarna_surabhi
- sutton-88-with-erratum.pdfHochgeladen vonLe Hoang Van
- Amina AmiratHochgeladen vonGörkem Topuz
- An Explanation and Proposal on Multisensor Data FusionHochgeladen vonInternational Journal of Innovative Science and Research Technology
- Methodology of Estimation of Available Water at a Proposed Dam SiteHochgeladen vonUsman Iftikhar
- SwiftKey Emoji ReportHochgeladen vonSwiftKey
- Barra Predicting Risk at Short HorizonsHochgeladen vonrichard_dalaud24
- science observation 7 januaryHochgeladen vonapi-295547670
- Lecture Note - M2S1 Questions.pdfHochgeladen vonThảo Nguyễn
- hurricane packet answer sheetHochgeladen vonapi-240276125

## Viel mehr als nur Dokumente.

Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.

Jederzeit kündbar.