You are on page 1of 49

Logistika regresija

General linear models


Family of regression models Outcome variable determines choice of model Outcome
Continuous Counts Survival Binomial

Model
Linear regression Poisson regression Cox model Logistic regression

Uses Control of confounding Model building, risk prediction

What is Logistic Regression?


Form of regression that allows the prediction of discrete variables by a mix of continuous and discrete predictors Logistic regression is often used because the relationship between the DV (a discrete variable) and a predictor is non-linear Example: the probability of heart disease changes very little with a ten-point difference among people with low-blood pressure, but a ten point change can mean a drastic change in the probability of heart disease in people with high blood-pressure

Logistic regression
Models relationship between set of variables xi dichotomous (yes/no) categorical (social class, ... ) continuous (age, ...) and dichotomous (binary) variable Y Dichotomous outcome most common situation in biology and epidemiology We code dichotomous varioable: 0 no disease, survive, non-smoker, female ... 1 disease, died , smoker, male ... We code with 1 what we want to predict
4

Linear regression

y = 0 + 1x +
error is normally distributed, with mean=0 and constant variance (i.e., homogeneity of variance) Binary dependent variable: y = 0 or y = 1
1 positive response 0 negative response P Q = (1-P)

= 1 0 1x = 0 0 1x
5

Applications
Compare two (success) probabilities with correction for prognostic factors (clinical trials) Determine which risk factors are important/not important (epidemiology) Determine the dose-response relation (toxicology)

Example 1 Is a smoking predictor for CHD?


Design: 60 Patients Two groups: Smokers and Non-smokers Smokers = 1 Non-smokers = 0 Outcome variable: Coronary heart disease: yes/no Research question: CHD different for smokers as for non-smokers?
7

Example 1
outcome CHD + CHD total smoking + 17 (a) 7 (c) 24 (m) smoking 9 (b) 27 (d) 36 (n) total 26 (r) 34 (s) 60 (N)

Analysis: Student t test for proportion: pCHD+ CHD+ smokers : p CHD+ nonnon-smokers t = 3,53 p < 0,01 or 2 - test

Dichotomous outcome variable Y (0/1):

CHD

0 0 smoking 1

Data transformation is required!


9

Example 1
outcome CHD + CHD total smoking + 17 (a) 7 (c) 24 (m) smoking 9 (b) 27 (d) 36 (n) total 26 (r) 34 (s) 60 (N)

Odds CHD (smo ker s) =

a / m a 17 = = = 2,429 c /m c 7

A smoker is 2.428 times more likely to have CHD than he is likely to have not CHD

Odds CHD (non smo ker s) =

b/n b 9 = = = 0,333 d / n d 27

A non-smoker is 0.333 times more likely to have CHD as he is likely to have not CHD.
10

Odds
Odds for an event

p odds = 1 p
p log (odds ) = log 1 p

p is probability that an event occurs What is greater odds of an event, the greater the probability that the event occurs

11

Logit transformacija
Logit transformacija daje linearnu relaciju izmeu verovatnoe posmatranog dogaaja i vrednosti nezavisne varijable x
p log (odds) = log 1 p = 0 + 1x

Model je slian prostom regresionom modelu, ali: raspodela je binomna, a ne normalna koeficijenti a i b se ne odreuju na isti nain kao u linearnom regresionom modelu

12

Logit transformacija
Logit prirodni logaritam (ln) odds (anse) da se posmatrani dogaaj desi (kodiranog sa 1) obeleava se kao log odds logit skala je kontinuirana i ponaa se na slian nain kao z-score skala p = 0.50, logit = 0 p = 0.70, logit = 0.84 p = 0.30, logit = -0.84

13

Logistic regression model


Equation for P (y=1) for one predictor is:

p log (odds) = log 1 p = 0 + 1x p = e0 +1x 1 p p= p= e0 +1x 1 e 1 e


0 +1x

for population for sample

e b0 + b1x
b 0 + b1x

e 2,718 p = P(y=1) x = predictor


14

Logistic regression model


Equation for P (y=1) for more predictors is :

p=

eb0 +b1x1 +b2 x 2 +...... 1 eb0 +b1x1 +b2 x 2 +......

e 2,718 p = P(y=1)

15

Interpretacija koeficijenata b0 i b1
b0 neophodan za jednainu, nema znaaja za interpretaciju predstavlja vrednost log odds kada je prediktor jednak 0

b1 mera za asocijaciju izmeu prediktora i log odds za pojavu dogaaja koji nas interesuje b1 > 0 pozitivna asocijacija b1 = 0 nema asocijacije b1 < 0 negativna asocijacija

16

Interpretacija koeficijenta b1
b1 je frakcija za koju se promeni rizik za pojavu dogaaja koji nas interesuje kada se prediktor x promeni za jednu jedinicu Primer

osoba 1, prediktor (x) = k osoba 2, prediktor (x) = k + 1


Jednaine za log odds glase

log (odds za dogaaj kod osobe 2) = b0 + b1 (k + 1) log (odds za dogaaj kod osobe 1) = b0 + b1 (k)
Dalje:

log (odds za dogaaj kod osobe 2) = b0 + b1 (k) + b1 log (odds za dogaaj kod osobe 1) = b0 + b1 (k)
17

Interpretacija koeficijenta b1
Razlika izmeu log odds osobe 1 i osobe 2:

log (odds za dogaaj kod osobe 2) = b0 + b1 (k) + b1 log (odds za dogaaj kod osobe 1) = b0 + b1 (k)
log odds za pojavu dogaaja koji nas interesuje kod osobe 2 iji je prediktor x = k + 1, razlikuje se od log odds za pojavu dogaaja koji nas interesuje kod osobe 1 iji je prediktor x = k za vrednost koeficijenta b1 odnosno b1 je frakcija za koju se promeni rizik za pojavu dogaaja koji nas interesuje kada se prediktor x promeni za jednu jedinicu

18

Interpretacija koeficijenta b1
b1 = log (odds za pojavu dogaaja kod osobe 2) - log (odds za pojavu dogaaja kod osobe 1)

odds za pojavu dogaaja kod osobe 2 b1 = log odds za pojavu dogaaja kod osobe 1 b1 = log (odds ratio ) odds ratio (OR ) = eb1

19

Interpretacija koeficijenta b1
b1 = 0 odds i verovatnoa za pojavu eljenog dogaaja su jednaki za sve vrednosti x (eb1 = OR = 1) b1 > 0 odds i verovatnoa za pojavu eljenog dogaaja se poveavaju sa poveanjem vrednosti x (eb1 = OR > 1) b1 < 0 odds i verovatnoa za pojavu eljenog dogaaja se smanjuju sa smanjenjem vrednosti x (eb1 = OR < 1)
20

Example 1 Odds ratio


outcome CHD + CHD total smoking + 17 (a) 7 (c) 24 (m) smoking 9 (b) 27 (d) 36 (n) total 26 (r) 34 (s) 60 (N)

Odds CHD (smo ker s) =

a / m a 17 = = = 2,429 c/m c 7 b/n b 9 = = = 0,333 d / n d 27

Odds CHD (non smo ker s) =

Odds ratio (OR ) =


Interpretation:

2,429 = 7,286 0,333

Smokers are 7,29 times more likely to have CHD than non-smokers
21

Odds ratio (Relativni odds, Ukrteni odnos)


Odds Ratio (OR) je odnos ansi prethodne izloenosti u grupi u kojoj je prisutan dogaaj koji nas interesuje (kodiran sa 1) i u grupi u kojoj je odsutan dogaaj koji nas interesuje (kodiran sa 0):
dogaaj prisutan (+) odsutan (-) da (+) izloenost ne (-) ukupno a c m (a + c) b d n (b + d) ukupno r (a + b) s (c + d) N (a+b+c+d)

Odds za prisutan dogaaj koji nas interesuje: (a/m) / (c/m) = a/c Odds za odsutan dogaaj koji nas interesuje: (b/n) / (d/n) = b/d Odds ratio: (a/c) / (b/d) = ad/bc

22

Interpretation of coefficients
Odds (smokers) = 2.429 ln (odds) = 0.887 Odds (non-smokers) = 0.333 ln (odds) = -1.099 Model for this example is
p ln 1 p = b 0 + b1 x p ln 1 p = b 0 + b1 0 = b 0

For non-smokers (x = 0) we have

The estimate of the intercept is equal to 0 which is the log odds for non-smokers

p ln 1 p

= 0 = 1 . 099

23

Interpretation of coefficients
The estimate of the slope is the difference between the log odds for smokers and the log odds for non-smokers:

p0 p1 b1 = ln ln (1 p ) (1 p ) = 0.887 (1.099) = 1.986 1 0


The fitted model is: log(odds) = -1. 099 + 1.986x The odds ratio is:

Oddssmo ker s e (1.099+1.986 ) 1.986 = = e = 7.286 ( ) 1 . 099 Odds non smo ker s e
24

Logistic regression in SPSS


In the menu, click on Analyze Point to Regression Point to Binary Logistic ... and click Dependent : chd Covariates: smoking Method: Enter Then Continue and OK

25

Example 1 in SPSS
Point to the variable labeled chd Move variable chd, to the box labeled Dependent Variable by clicking the arrow Point to the variable labeled smoking Move variable smoking to the box labeled Covariates by clicking the arrow Method Enter

26

Example 1 in SPSS
In the menu, click on Options Check CI for exp(B) and Continue Then click OK

27

Example 1 in SPSS - Output


Case Processing Summary

We see that there are 60 cases used in the analysis.

Unweighted Cases Selected Cases

N Included in Analysis Missing Cases Total 60 0 60 0 60

Unselected Cases Total

Percent 100,0 ,0 100,0 ,0 100,0

a. If weight is in effect, see classification table for the total number of cases.
a,b Classification Table

Predicted CHD Step 0 Observed CHD Overall Percentage a. Constant is included in the model. b. The cut value is ,500 0 0 1 34 26 1 0 0 Percentage Correct 100,0 ,0 56,7

The Block 0 output is for a model that includes only the intercept (which SPSS calls the constant). Given the base rates of the two CHD options (34/60 = 56.7% no CHD, 43.3% with CHD), and no other information, the best strategy is to predict, for every case, that the subject has CHD. Using that strategy, you would be correct 56.7% of the time.
28

Example 1 in SPSS - Output


Under Variables in the Equation you see that the intercept-only model is ln(odds) = -.268 The predicted odds that nonsmokers have CHD is [Exp(B)] = 0.765
Variables in the Equation

Step 0

Constant

B -,268

S.E. ,261

Wald 1,060

df 1

Sig. ,303

Exp(B) ,765

Omnibus Tests of Model Coefficients gives us a Chi-Square of 12.645 on 1 df, significant beyond 0.001. This is a test of the null hypothesis that adding the smoking variable to the model has not significantly increased our ability to predict the CHD in our subjects.
Omnibus Tests of Model Coefficients Chi-square 12,645 12,645 12,645 df 1 1 1 Sig. ,000 ,000 ,000

Step 1

Step Block Model

29

Example 1 in SPSS - Output


Under Model Summary we see that the -2 Log Likelihood statistic is 69.463. This statistic measures how poorly the model predicts the decisions -the smaller the statistic the better the model. The Cox & Snell R2 can be interpreted like R2 in a multiple regression, but cannot reach a maximum value of 1. The Nagelkerke R2 can reach a maximum of 1.

Model Summary -2 Log likelihood 69,463 Cox & Snell R Square ,190 Nagelkerke R Square ,255

Step 1

30

Example 1 in SPSS - Output


The Variables in the Equation output shows us that the regression equation is

log (odds ) = 1,099 + 1,986 smoking


Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 2,286 23,223

Step a 1

PUSENJE Constant

B 1,986 -1,099

S.E. ,591 ,385

Wald 11,274 8,147

df 1 1

Sig. ,001 ,004

Exp(B) 7,286 ,333

a. Variable(s) entered on step 1: PUSENJE.

Wald 2 - significance of the coefficients in a model

coefficient Wald 2 = SE
df = 1, 20,05; 1 = 3,841

31

Example 1 in SPSS - Output


The Variables in the Equation output also gives us the Exp(B) or the odds ratio predicted by the model.
Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 2,286 23,223

Step a 1

PUSENJE Constant

B 1,986 -1,099

S.E. ,591 ,385

Wald 11,274 8,147

df 1 1

Sig. ,001 ,004

Exp(B) 7,286 ,333

a. Variable(s) entered on step 1: PUSENJE.

OR = e1,986 = 7,286

OR

32

Example 1 in SPSS - Output


We can now use this model to predict the odds that a subject has CHD. The odds prediction equation is odds = ea+bx If our subject is a non-smoker (smoking = 0), then odds = e-1.099+1.986(0) = e-1.099 = 0.333 A non-smoker is only 0.333 times more likely to have CHD as he is likely to have not CHD. If our subject is a smoker (smoking = 1), then odds = e-1.099+1.986(1) = e0.887 = 2.428 A smoker is 2.428 times more likely to have CHD than he is likely to have not CHD

33

Example 1 in SPSS - Output


Convert Odds to probability p = odds / (1+odds)

Non-smokers: p = 0.333 / (1+0.333) = 0.250 = 25% Probability is 25% that non-smoker will have CHD Smokers: p = 2.428 / (1+2.428) = 0.708 = 70.8% probability is 70.8% that smoker will have CHD

34

Primer 2 Faktori rizika za pojavu KSB


Pokazati da li su starost, puenje, gojaznost i holesterol faktori rizika za KSB Ako su faktori rizika kolika je jaina njihovog delovanja Varijable:
KSB: 0 KSB odsutna; 1 KSB prisutna zavisna varijabla, nominalna skala (binarna) Starost: 0 - < 50 g; 1 - > 50 g prediktor, kategorika varijabla, nominalna skala (binarna) Puenje: 0 nepua; 1 pua prediktor, kategorika varijabla, nominalna skala (binarna) Gojaznost: 0 negojazni; 1 gojazni prediktor, kategorika varijabla, nominalna skala (binarna) Holesterol: kontinuirane vrednosti prediktor, skala odnosa
35

Primer 2 - Logistika regresija


Omoguava da se izrauna jednaina koja izraava relaciju izmeu binarnog ishoda i jednog ili vie faktora uticaja (prediktora): verovatnoa za pojavu KSB i starost verovatnoa za pojavu KSB i puenje verovatnoa za pojavu KSB i gojaznost verovatnoa za pojavu KSB i holesterol verovatnoa za pojavu KSB i starost + puenje + gojaznost + holesterol i ako nas interesuje
verovatnoa za pojavu KSB i starost + puenje verovatnoa za pojavu KSB i starost + gojaznost verovatnoa za pojavu KSB i starost + holesterol verovatnoa za pojavu KSB i puenje + gojaznost verovatnoa za pojavu KSB i puenje + holesterol verovatnoa za pojavu KSB i gojaznost + holesterol
36

Primer 2 u SPSS-u
KSB : Faktor rizika Starost
Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 1,931 19,338

Step a 1

AGE Constant

B 1,810 -1,299

S.E. ,588 ,461

Wald 9,485 7,958

df 1 1

Sig. ,002 ,005

Exp(B) 6,111 ,273

a. Variable(s) entered on step 1: AGE.

b0 OR = e-1,299 = 6,111

b1

OR

Osobe starije od 50 g imaju 6,11 puta veu verovatnou da obole od KSB nego osobe mlae od 50 g
Model Summary -2 Log likelihood 71,437 Cox & Snell R Square ,163 Nagelkerke R Square ,219 Step 1

37

Primer 2 u SPSS-u
KSB : Faktor rizika Puenje
Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 2,286 23,223

Step a 1

PUSENJE Constant

B 1,986 -1,099

S.E. ,591 ,385

Wald 11,274 8,147

df 1 1

Sig. ,001 ,004

Exp(B) 7,286 ,333

a. Variable(s) entered on step 1: PUSENJE.

p OR = e1,986 = 7,286

OR

Puai imaju 7,29 puta veu verovatnou da obole od KSB nego nepuai
Model Summary -2 Log likelihood 69,463 Cox & Snell R Square ,190 Nagelkerke R Square ,255 Step 1

38

Primer 2 u SPSS-u
KSB : Faktor rizika Gojaznost
Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 1,096 9,581

Step a 1

OBESITY Constant

B 1,176 -,734

S.E. ,553 ,351

Wald 4,520 4,368

df 1 1

Sig. ,034 ,037

Exp(B) 3,241 ,480

a. Variable(s) entered on step 1: OBESITY.

p OR = e1,176 = 3,241

OR

Gojazne osobe imaju 3,24 puta veu verovatnou da obole od KSB nego negojazne osobe
Model Summary -2 Log likelihood 77,415 Cox & Snell R Square ,075 Nagelkerke R Square ,101

Step 1

39

Primer 2 u SPSS-u
KSB : Faktor rizika Holesterol

p OR = e0,696 = 2,005

OR

Kada se holesterol povea za jednu jedinicu (1 mmol/L), verovatnoa da osoba oboli od KSB poveava se za 2,005 puta
Model Summary -2 Log likelihood 73,490 Cox & Snell R Square ,134 Nagelkerke R Square ,179

Step 1

40

Example 2
In the menu, click on Options Check CI for exp(B) Hosmer-Lemeshow goodnessof-fit and Continue Then click OK

41

Example 2
Point to the variable labeled chd Move variable chd, to the box labeled Dependent Variable by clicking the arrow Point to the variable labeled smoking, then obesity, age and cholestero Move variables to the box labeled Covariates by clicking the arrow Method Enter

42

Example 2 in SPSS - Output


The -2 Log Likelihood statistic has dropped to 55.86, indicating that our expanded model is doing a better job at predicting CHD than was one-predictor model The R2 statistics have also increased
Model Summary -2 Log likelihood 55,860 Cox & Snell R Square ,354 Nagelkerke R Square ,475

Step 1

The Hosmer-Lemeshow tests the null hypothesis that there is a linear relationship between the predictor variables and the log odds of the criterion variable.
Hosmer and Lemeshow Test Step 1 Chi-square 5,583 df 8 Sig. ,694

43

Example 2 in SPSS - Output


Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 2,023 30,840 ,822 11,575 1,089 23,216 ,712 2,633

Step a 1

SMOKING OBESITY AGE CHOLESTE Constant

B 2,067 1,126 1,615 ,314 -4,482

S.E. ,695 ,675 ,781 ,334 2,013

Wald 8,843 2,785 4,280 ,886 4,960

df 1 1 1 1 1

Sig. ,003 ,095 ,039 ,347 ,026

Exp(B) 7,899 3,084 5,027 1,369 ,011

a. Variable(s) entered on step 1: SMOKING, OBESITY, AGE, CHOLESTE.

one-predictor model OR smoking obesity age cholesterol 7.286 3.241 6.111 2.005 p < 0.05 < 0.05 < 0.05 < 0.05

four-predictors model OR 7.899 3.084 5.027 1.369 p < 0.05 > 0.05 <0.05 >0.05

44

Example 2 in SPSS Method Forward:Wald


Point to the variable labeled chd Move variable chd, to the box labeled Dependent Variable by clicking the arrow Point to the variable labeled smoking, then obesity, age and cholestero Move variables to the box labeled Covariates by clicking the arrow Method Forward: Wald

45

Example 2 in SPSS Method Forward:Wald - Output


Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 2,286 23,223 2,183 1,824 30,080 25,688 B 1,986 -1,099 2,092 1,924 -2,239 S.E. ,591 ,385 ,669 ,675 ,636 Wald 11,274 8,147 9,776 8,129 12,376 df 1 1 1 1 1 Sig. ,001 ,004 ,002 ,004 ,000 Exp(B) 7,286 ,333 8,104 6,846 ,107

Step a 1 Step b 2

SMOKING Constant SMOKING AGE Constant

a. Variable(s) entered on step 1: SMOKING. b. Variable(s) entered on step 2: AGE.

Model Summary

Variables not in the Equation Step 1 Variables OBESITY AGE CHOLESTE OBESITY CHOLESTE Score 3,769 9,234 6,060 12,654 3,247 1,262 4,106 df 1 1 1 3 1 1 2 Sig. ,052 ,002 ,014 ,005 ,072 ,261 ,128
Step 1 2

-2 Log likelihood 69,463 60,020

Cox & Snell R Square ,190 ,308

Nagelkerke R Square ,255 ,413

Step 2

Overall Statistics Variables Overall Statistics

Hosmer and Lemeshow Test Step 2 Chi-square ,053 df 2 Sig. ,974

46

Primer 3 Faktori rizika za pojavu KSB


Pokazati da li su starost, puenje, gojaznost i holesterol faktori rizika za KSB Ako su faktori rizika kolika je jaina njihovog delovanja Varijable:
KSB: 0 KSB odsutna; 1 KSB prisutna zavisna varijabla, nominalna skala (binarna) Starost: kontinuirane vrednosti prediktor, skala odnosa Puenje: 0 nepua; 1 pua prediktor, kategorika varijabla, nominalna skala (binarna) Gojaznost (BMI): kontinuirane vrednosti prediktor, skala odnosa Holesterol: kontinuirane vrednosti prediktor, skala odnosa
47

Primer 3 Faktori rizika za pojavu KSB


Model Summary -2 Log likelihood 43,255 Cox & Snell R Square ,477 Nagelkerke R Square ,639

Hosmer and Lemeshow Test Step 1 Chi-square 6,370 df 8 Sig. ,606

Step 1

Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 2,388 71,358 1,054 1,719 1,031 1,198 ,446 2,107

Step a 1

SMOKING BMI YEARS CHOLESTE Constant

B 2,569 ,297 ,106 -,031 -14,624

S.E. ,867 ,125 ,038 ,396 4,594

Wald 8,788 5,680 7,603 ,006 10,134

df 1 1 1 1 1

Sig. ,003 ,017 ,006 ,938 ,001

Exp(B) 13,054 1,346 1,111 ,970 ,000

a. Variable(s) entered on step 1: SMOKING, BMI, YEARS, CHOLESTE.

48

Primer 3 Faktori rizika za pojavu KSB


Model Summary -2 Log likelihood 64,361 50,473 43,261 Cox & Snell R Square ,256 ,410 ,477 Nagelkerke R Square ,343 ,550 ,639

Hosmer and Lemeshow Test Step 1 2 3 Chi-square 2,687 4,078 6,346 df 8 8 8 Sig. ,952 ,850 ,609

Step 1 2 3

Variables in the Equation 95,0% C.I.for EXP(B) Lower Upper 1,038 1,142 2,802 1,046 60,461 1,171

Step a 1 Step b 2

YEARS Constant SMOKING YEARS Constant SMOKING BMI YEARS Constant

B ,085 -4,744 2,566 ,101 -6,703 2,558 ,298 ,104 -14,739

S.E. ,024 1,339 ,784 ,029 1,763 ,854 ,125 ,034 4,365

Wald 12,268 12,558 10,724 12,337 14,451 8,973 5,681 9,515 11,402

df 1 1 1 1 1 1 1 1 1

Sig. ,000 ,000 ,001 ,000 ,000 ,003 ,017 ,002 ,001

Exp(B) 1,089 ,009 13,016 1,106 ,001 12,910 1,347 1,110 ,000

Step c 3

2,421 1,054 1,039

68,831 1,720 1,186

a. Variable(s) entered on step 1: YEARS. b. Variable(s) entered on step 2: SMOKING. c. Variable(s) entered on step 3: BMI.

49