You are on page 1of 9

Answer questions

Logistic Regression (LR)


1. Briefly discuss the concept of LR analysis.
Ans: when we have qualitative/dichotomous dependent variables we cannot run our ordinary
least squares (OLS) models. OLS models require us to have quantitative, continuous,
unbounded dependent variables. We will instead use logistic regression analysis to determine
how independent variables affect these qualitative dependent variables. Logistic regression is
a non-parametric technique for determining the estimates of independent variables on a
dependent variable. Because it is a non-parametric technique, the test is not as powerful as
OLS regression and other parametric statistical tests. This means that logistic regression will
not pick up relationships between variables as well as OLS regression analysis for a given
number of observations. Therefor the basic understanding of logistic regression is that the
type of regression analysis usually employed when the dependent variable is dichotomous.
2. Define LR model
Ans: Logistic regression is the appropriate regression analysis to conduct when the dependent
variable is dichotomous (binary). Like all regression analyses, the logistic regression is a
predictive analysis.
3. What are the assumptions in LR analysis
1. Ans: The model is correctly specified, i.e.,

The true conditional probabilities are a logistic function of the independent


variables;

No important variables are omitted;

No extraneous/unnecessary variables are included; and

The independent variables are measured without error.

2. The cases are independent.


3. The independent variables are not linear combinations of each other.

Perfect multicollinearity makes estimation impossible,

While strong multicollinearity makes estimates imprecise.

4. What are the difference between Linear Regression and Logistic Regression?
Ans: 1. Linear regression requires the dependent variable to be continuous i.e. numeric values
(no categories or groups). While logistic regression requires the dependent variable to be
binary - two categories only (0/1).
2. Linear regression is based on ordinary least square estimation. While logistic regression is
based on Maximum Likelihood Estimation.
3. Linear regression needs a linear relationship between the dependent and independent
variables. While logistic regression does not need a linear relationship between the dependent
and independent variables.
4. In Linear regression analysis, error term should be normally distributed. While logistic
regression

does

not

require

error

term

should

be

normally

distributed.

5. Linear regression assumes that residuals are approximately equal for all predicted
dependent variable values. While logistic regression does not need residuals to be equal for
each level of the predicted dependent variable values.
6. Linear regression requires 5 cases per independent variable in the analysis. While logistic
regression needs at least 10 events per independent variable.

7. Linear regression is very fast as compared to logistic regression as logistic regression is an


iterative process of maximum likelihood.

5. When and why do we use LR?


Ans. To predict outcome variable that is a categorical dichotomy from one or more
categorical or continuous predictor (independent) variables. We use because having a
categorical dichotomy as an outcome variable violates the assumptions of linearity in
normal regression.
6. How estimate unknown parameters for fitting LR model?
Ans: The likelihood function for the pair (Xi, Yi) is written as

Li ( X i ) X i i 1 X i
Y

1 Y i

Since the observations are assumed to be independent, the likelihood function expresses the
values of in terms of known, fixed values for y
n

L( ) Li ( X i )
i 1

The log likelihood function is written as


n

ln L( ) Li ( X i )
i 1

Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like
iteratively reweighted least squares is used to find an estimate of the regression coefficients,
.

7. Discuss Likelihood Ratio Test for test of significance.


Ans: The likelihood ratio test is used to test the null hypothesis that any sub set of the 's is
equal to 0. i.e., we want to test the following hypothesis
H0: 1=2=...=p-1=0 vs Ha: not all of the k in H0 equal zero
Then the LRT statistic is given by:

2 l ( ( 0 ) ) l ( )

( 0)

where l ( ) is the log likelihood of the fitted (full) model (or under H 1) and l ( ) is the

log likelihood of the (reduced) model specified by the null hypothesis (H0) evaluated at
2
the maximum likelihood estimate of that reduced model. This test statistic has a -

distribution with (p-r) degrees of freedom, r is the unknown parameters of the reduced
model.

8. Why linear regression model makes no sense when your dependent Y variable is binary
(only takes the values of 0 or 1)?
Ans: In least-squares regression, we model the dependent variable Y as a linear function of the Xvariables plus a random error that is assumed to have a normal distribution. That is, the ith Yobservation is assumed to have been generated with the following equation:

Yi=0+ 1X1i+ 2Xi2++ pXip+i


The key point is that in regular least-squares regression, the error term i has a normal

distribution with mean 0 and standard deviation . This means the Yi must be a continuous
variable (i.e., one that takes values on an interval), not a binary variable (i.e., a variable taking
only the values 0 and 1).

Thus, if we use regular least-squares regression when the dependent variable is binary or
dichotomous (and should be using logistic regression) some of the assumptions are violating
such as the least-squares requirement that the regression errors have a normal distribution.
When the assumptions that underlie the least-squares regression model are violated, we can
no longer rely on the statistical inference (e.g., which regression coefficients are significant)
or predictions that are made based on the least-squares model.
Figure below shows the kind of data that is appropriate for regular least-squares regression Y
vs X:

Figure below shows the kind of data that is appropriate for logistic regression Y vs X:

9. Discuss odds and odds ratio with example


Ans: Odds: The odds of an event of interest occurring is defined by
odds = p/(1-p)
where p is the probability of the event occurring or success. So if p=0.1, the odds are equal to
0.1/0.9=0.111 (recurring). So here the probability (0.1) and the odds (0.111) are quite similar.
Indeed whenever p is small, the probability and odds will be similar. This is because when p
is small, 1-p is approximately 1, so that p/(1-p) is approximately equal to p.
But when p is large, the probability and odds will generally be quite different. For example if
p=0.5, we have odds=0.5/0.5=1. As p increases, the odds get larger and larger such as p=0.99,
odds=0.99/0.01=99.
Another example, let odds are =0.90/(90-0.10) or 0.90 to 0.10 or 9 to 1 or written as 9/1 or
9:1, means the event of interest will occur once for every 9 times that the event does not
occur. That is in 10 times/replications, we expect the event of interest to happen once and the
event not to happen in the other 9 times.
Odds ratios: In order to understand the output of a logistic regression analysis, you also need
to have some understanding of odds ratios. Odds ratios are frequently used to express the

relative chance of an event happening under two different conditions. So if the odds of thing
A are MA to NA and the odds of thing B are MB to NB then the odds is defined as

MA
OR

MB

NA
NB

We can also write the odds ratio in terms of probabilities. Using the formula given above to
calculate the probabilities gives

PA

MA

PB

MB

MA NA
M B NB

Since, using probabilities, the odds ratio expressed in terms of probabilities is:
OR

PA /(1 PA )
PB /(1 PB )

For Example: Suppose that the probability of a bad outcome is 0.2 if a patient takes the
existing treatment, but that this is reduced to 0.1 if they take the new treatment. The odds of a
bad outcome with the existing treatment is 0.2/0.8=0.25, while the odds on the new treatment
are 0.1/0.9=0.111 (recurring). The odds ratio comparing the new treatment to the old
treatment is then simply the correspond ratio of odds: (0.1/0.9) / (0.2/0.8) = 0.111 / 0.25 =
0.444 (recurring). This means that the odds of a bad outcome if a patient takes the new
treatment are 0.444 that of the odds of a bad outcome if they take the existing treatment. The
odds (and hence probability) of a bad outcome are reduced by taking the new treatment. We
could also express the reduction by saying that the odds are reduced by approximately 56%,
since the odds are reduced by a factor of 0.444.

10. Calculate Odds Ratio from a logistic regression

Ans: A logistic regression is used for predicting the probability of occurrence of an event by
fitting data to a logit function. Let us consider fitted logit function by
log it ( pi ) 0 1 X i

Where p is the probability of occurrence of an event (Y=1) in the population and


log it ( pi ) log

pi
log(odds )
1 pi

Or we can write
odds1

when X=Xj,

pX j
1 pX j

odds2

when X=Xj+1,

p X j 1
1 p X j 1

0 1 X j

0 1 X j 1

Therefore, the odds ratio from a logistic regression is obtained by


odds1

e 1
odds2
log OR

OR

11. What is the logit models for Bernoulli distribution?


Ans: Many categorical response variables have only two categories. The observation for each
subject might be classified as a Success or a Failure. Represent these outcomes by 1 and
0. The Bernoulli distribution for binary random variables specifies probabilities
p Y 1 and p Y 0 1

for the two outcomes, for which E Y . When Yi has

Bernoulli distribution with parameter i , the probability mass function is

f yi ; i iyi 1 i

1 yi
y

i
i
1 i

1 i

i
1 i exp yi log

1 i

for yi 0 and 1

This distribution is in the natural exponential family. The natural parameter

Q log

the log odds of response 1, is called the logit of . GLMs that use the logit link are called
logit models.