Sie sind auf Seite 1von 44

Logistic Regression

Logistic Regression

Jia Li

Department of Statistics
The Pennsylvania State University

Email: jiali@stat.psu.edu
http://www.stat.psu.edu/jiali

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Logistic Regression
Preserve linear classification boundaries.
I By the Bayes rule:

G (x) = arg max Pr (G = k | X = x) .


k

I Decision boundary between class k and l is determined by the


equation:

Pr (G = k | X = x) = Pr (G = l | X = x) .

I Divide both sides by Pr (G = l | X = x) and take log. The


above equation is equivalent to
Pr (G = k | X = x)
log =0.
Pr (G = l | X = x)

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I Since we enforce linear boundary, we can assume


p
Pr (G = k | X = x) (k,l)
X (k,l)
log = a0 + aj xj .
Pr (G = l | X = x)
j=1

I For logistic regression, there are restrictive relations between


a(k,l) for different pairs of (k, l).

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Assumptions

Pr (G = 1 | X = x)
log = 10 + 1T x
Pr (G = K | X = x)
Pr (G = 2 | X = x)
log = 20 + 2T x
Pr (G = K | X = x)
..
.
Pr (G = K 1 | X = x)
log = (K 1)0 + KT 1 x
Pr (G = K | X = x)

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I For any pair (k, l):

Pr (G = k | X = x)
log = k0 l0 + (k l )T x .
Pr (G = l | X = x)
I Number of parameters: (K 1)(p + 1).
I Denote the entire parameter set by

= {10 , 1 , 20 , 2 , ..., (K 1)0 , K 1 } .

I The log ratio of posterior probabilities are called log-odds or


logit transformations.

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I Under the assumptions, the posterior probabilities are given


by:

exp(k0 + kT x)
Pr (G = k | X = x) = PK 1
1 + l=1 exp(l0 + lT x)

for k = 1, ..., K 1

1
Pr (G = K | X = x) = PK 1 .
1+ l=1 exp(l0 + lT x)

I For Pr (G = k | X = x) given above, obviously


PK
I Sum up to 1: k=1 Pr (G = k | X = x) = 1.
I A simple calculation shows that the assumptions are satisfied.

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Comparison with LR on Indicators

I Similarities:
I Both attempt to estimate Pr (G = k | X = x).
I Both have linear classification boundaries.
I Difference:
I Linear regression on indicator matrix: approximate
Pr (G = k | X = x) by a linear function of x.
Pr (G = k | X = x) is not guaranteed to fall between 0 and 1
and to sum up to 1.
I Logistic regression: Pr (G = k | X = x) is a nonlinear function
of x. It is guaranteed to range from 0 to 1 and to sum up to 1.

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Fitting Logistic Regression Models

I Criteria: find parameters that maximize the conditional


likelihood of G given X using the training data.
I Denote pk (xi ; ) = Pr (G = k | X = xi ; ).
I Given the first input x1 , the posterior probability of its class
being g1 is Pr (G = g1 | X = x1 ).
I Since samples in the training data set are independent, the
posterior probability for the N samples each having class gi ,
i = 1, 2, ..., N, given their inputs x1 , x2 , ..., xN is:
N
Y
Pr (G = gi | X = xi ) .
i=1

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I The conditional log-likelihood of the class labels in the


training data set is
N
X
L() = log Pr (G = gi | X = xi )
i=1
XN
= log pgi (xi ; ) .
i=1

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Binary Classification

I For binary classification, if gi = 1, denote yi = 1; if gi = 2,


denote yi = 0.
I Let p1 (x; ) = p(x; ), then

p2 (x; ) = 1 p1 (x; ) = 1 p(x; ) .

I Since K = 2, the parameters = {10 , 1 }.


We denote = (10 , 1 )T .

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I If yi = 1, i.e., gi = 1,

log pgi (x; ) = log p1 (x; )


= 1 log p(x; )
= yi log p(x; ) .

If yi = 0, i.e., gi = 2,

log pgi (x; ) = log p2 (x; )


= 1 log(1 p(x; ))
= (1 yi ) log(1 p(x; )) .

Since either yi = 0 or 1 yi = 0, we have

log pgi (x; ) = yi log p(x; ) + (1 yi ) log(1 p(x; )) .

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I The conditional likelihood


N
X
L() = log pgi (xi ; )
i=1
XN
= [yi log p(xi ; ) + (1 yi ) log(1 p(xi ; ))]
i=1

I There p + 1 parameters in = (10 , 1 )T .


I Assume a column vector form for :

10
11

= 12 .

..
.
1,p

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I Here we add the constant term 1 to x to accommodate the


intercept.
1
x,1

x = x,2 .

..
.
x,p

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I By the assumption of logistic regression model:

exp( T x)
p(x; ) = Pr (G = 1 | X = x) =
1 + exp( T x)
1
1 p(x; ) = Pr (G = 2 | X = x) =
1 + exp( T x)

I Substitute the above in L():


N h i
Tx
X
L() = yi T xi log(1 + e i
) .
i=1

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I To maximize L(), we set the first order partial derivatives of


L() to zero.
N N T
L() X X xij e xi
= yi xij
1j
i=1 i=1
1 + e T xi
XN N
X
= yi xij p(x; )xij
i=1 i=1
XN
= xij (yi p(xi ; ))
i=1

for all j = 0, 1, ..., p.

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I In matrix form, we write


N
L() X
= xi (yi p(xi ; )) .

i=1

I To solve the set of p + 1 nonlinear equations L()


1j = 0,
j = 0, 1, ..., p, use the Newton-Raphson algorithm.
I The Newton-Raphson algorithm requires the
second-derivatives or Hessian matrix:
N
2 L() X
T
= xi xiT p(xi ; )(1 p(xi ; )) .

i=1

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I The element on the jth row and nth column is (counting from
0):

L()
1j 1n
N T T T
X (1 + e xi )e xi xij xin (e xi )2 xij xin
=
i=1
(1 + e T xi )2
XN
= xij xin p(xi ; ) xij xin p(xi ; )2
i=1
XN
= xij xin p(xi ; )(1 p(xi ; )) .
i=1

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I Starting with old , a single Newton-Raphson update is


1
2 L()

new old L()
= ,
T

where the derivatives are evaluated at old .

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I The iteration can be expressed compactly in matrix form.


I Let y be the column vector of yi .
I Let X be the N (p + 1) input matrix.
I Let p be the N-vector of fitted probabilities with ith element
p(xi ; old ).
I Let W be an N N diagonal matrix of weights with ith
element p(xi ; old )(1 p(xi ; old )).
I Then
L()
= XT (y p)

2
L()
= XT WX .
T

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I The Newton-Raphson step is


new = old + (XT WX)1 XT (y p)
= (XT WX)1 XT W(X old + W1 (y p))
= (XT WX)1 XT Wz ,
where z , X old + W1 (y p).
I If z is viewed as a response and X is the input matrix, new is
the solution to a weighted least square problem:
new arg min(z X)T W(z X) .

I Recall that linear regression by least square is to solve
arg min(z X)T (z X) .

I z is referred to as the adjusted response.
I The algorithm is referred to as iteratively reweighted least
squares or IRLS.
Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Pseudo Code
1. 0
2. Compute y by setting its elements to

1 if gi = 1
yi = ,
0 if gi = 2
i = 1, 2, ..., N.
3. Compute p by setting its elements to
T
e xi
p(xi ; ) = i = 1, 2, ..., N.
1 + e T xi
4. Compute the diagonal matrix W. The ith diagonal element is
p(xi ; )(1 p(xi ; )), i = 1, 2, ..., N.
5. z X + W1 (y p).
6. (XT WX)1 XT Wz.
7. If the stopping criteria is met, stop; otherwise go back to step
3.
Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Computational Efficiency

I Since W is an N N diagonal matrix, direct matrix


operations with it may be very inefficient.
I A modified pseudo code is provided next.

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

1. 0
2. Compute y by setting its elements to

1 if gi = 1
yi = , i = 1, 2, ..., N .
0 if gi = 2
3. Compute p by setting its elements to
T
e xi
p(xi ; ) = i = 1, 2, ..., N.
1 + e T xi
4. Compute the N (p + 1) matrix X by multiplying the ith row of
matrix X by p(xi ; )(1 p(xi ; )), i = 1, 2, ..., N:
T
p(x1 ; )(1 p(x1 ; ))x1T

x1
xT p(x2 ; )(1 p(x2 ; ))x T
X= 2 2
X =


xNT p(xN ; )(1 p(xN ; ))xNT
5. + (XT X)1 XT (y p).
6. If the stopping criteria is met, stop; otherwise go back to step 3.
Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Example

Diabetes data set


I Input X is two dimensional. X1 and X2 are the two principal
components of the original 8 variables.
I Class 1: without diabetes; Class 2: with diabetes.
I Applying logistic regression, we obtain

= (0.7679, 0.6816, 0.3664)T .

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I The posterior probabilities are:

e 0.76790.6816X1 0.3664X2
Pr (G = 1 | X = x) =
1 + e 0.76790.6816X1 0.3664X2
1
Pr (G = 2 | X = x) =
1 + e 0.76790.6816X1 0.3664X2
I The classification rule is:

1 0.7679 0.6816X1 0.3664X2 0
G (x) =
2 0.7679 0.6816X1 0.3664X2 < 0

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Solid line: decision boundary obtained by logistic regression. Dash


line: decision boundary obtained by LDA.

I Within training
data set
classification error
rate: 28.12%.
I Sensitivity: 45.9%.
I Specificity: 85.8%.

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Multiclass Case (K 3)
I When K 3, is a (K-1)(p+1)-vector:

10
11

.
..

10


1 1p

20

20

.
2 .
= = .
..
2p

.

.
(K 1)0 ..


K 1

(K 1)0

..
.
(K 1)p

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

 
l0
I Let l = .
l
I The likelihood function becomes
N
X
L() = log pgi (xi ; )
i=1
N T
!
X e gi xi
= log PK 1 Tx
i=1 1+ l=1 e l i

N K 1
" !#
lT xi
X X
= gTi xi log 1 + e
i=1 l=1

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I Note: the indicator function I () equals 1 when the argument


is true and 0 otherwise.
I First order derivatives:
N
" T
#
L() X e k xi xij
= I (gi = k)xij P 1 T x
kj
i=1 1+ K l=1 e
l i

N
X
= xij (I (gi = k) pk (xi ; ))
i=1

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I Second order derivatives:


2 L()
kj mn
N
X 1
= xij PK 1 Tx
i=1 (1 + l=1 e l i )2
K 1
" #
kT xi lT xi kT xi Tx
X
m
e I (k = m)xin (1 + e )+e e i
xin
l=1
N
X
= xij xin (pk (xi ; )I (k = m) + pk (xi ; )pm (xi ; ))
i=1
N
X
= xij xin pk (xi ; )[I (k = m) pm (xi ; )] .
i=1

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I Matrix form.
I y is the concatenated indicator vector of dimension
N (K 1).

y1 I (g1 = k)
y2 I (g2 = k)
y= . yk =

..
..

.
yK 1 I (gN = k)

1k K 1
I p is the concatenated vector of fitted probabilities of dimension
N (K 1).

p1 pk (x1 ; )
p2 pk (x2 ; )
p= . pk =

..
..

.
pK 1 pk (xN ; )

1k K 1
Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I X is an N(K 1) (p + 1)(K 1) matrix:



X 0 0
0 X 0
X =


0 0 X

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I Matrix W is an N(K 1) N(K 1) square matrix:



W11 W12 W1(K 1)
W21 W 22 W2(K 1)
W =


W(K 1),1 W(K 1),2 W(K 1),(K 1)

I Each submatrix Wkm , 1 k, m K 1, is an N N


diagonal matrix.
I When k = m, the ith diagonal element in Wkk is
pk (xi ; old )(1 pk (xi ; old )).
I When k 6= m, the ith diagonal element in Wkm is
pk (xi ; old )pm (xi ; old ).

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I Similarly as with binary classification

L()
= XT (y p)

2 L()
= XT WX .
T
I The formula for updating new in the binary classification case
holds for multiclass.

new = (XT WX)1 XT Wz ,

where z , X old + W1 (y p). Or simply:

new = old + (XT WX)1 XT (y p) .

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Computation Issues

I Initialization: one option is to use = 0.


I Convergence is not guaranteed, but usually is the case.
I Usually, the log-likelihood increases after each iteration, but
overshooting can occur.
I In the rare cases that the log-likelihood decreases, cut step
size by half.

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Connection with LDA


I Under the model of LDA:
Pr (G = k | X = x)
log
Pr (G = K | X = x)
k 1
= log (k + K )T 1 (k K )
K 2
T 1
+x (k K )
= ak0 + akT x .

I The model of LDA satisfies the assumption of the linear


logistic model.
I The linear logistic model only specifies the conditional
distribution Pr (G = k | X = x). No assumption is made
about Pr (X ).
Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I The LDA model specifies the joint distribution of X and G .


Pr (X ) is a mixture of Gaussians:
K
X
Pr (X ) = k (X ; k , ) .
k=1

where is the Gaussian density function.


I Linear logistic regression maximizes the conditional likelihood
of G given X : Pr (G = k | X = x).
I LDA maximizes the joint likelihood of G and X :
Pr (X = x, G = k).

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

I If the additional assumption made by LDA is appropriate,


LDA tends to estimate the parameters more efficiently by
using more information about the data.
I Samples without class labels can be used under the model of
LDA.
I LDA is not robust to gross outliers.
I As logistic regression relies on fewer assumptions, it seems to
be more robust.
I In practice, logistic regression and LDA often give similar
results.

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Simulation

I Assume input X is 1-D.


I Two classes have equal priors and the class-conditional
densities of X are shifted versions of each other.
I Each conditional density is a mixture of two normals:
I Class 1 (red): 0.6N(2, 14 ) + 0.4N(0, 1).
I Class 2 (blue): 0.6N(0, 14 ) + 0.4N(2, 1).
I The class-conditional densities are shown below.

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

LDA Result

I Training data set: 2000 samples for each class.


I Test data set: 1000 samples for each class.
I The estimation by LDA: 1 = 1.1948, 2 = 0.8224,
2 = 1.5268. Boundary value between the two classes is
(1 + 2 )/2 =0.1862.
I The classification error rate on the test data is 0.2315.
I Based on the true distribution, the Bayes (optimal) boundary
value between the two classes is 0.7750 and the error rate is
0.1765.

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

Logistic Regression Result

I Linear logistic regression obtains

= (0.3288, 1.3275)T .

The boundary value satisfies 0.3288 1.3275X = 0, hence


equals 0.2477.
I The error rate on the test data set is 0.2205.
I The estimated posterior probability is:

e 0.32881.3275x
Pr (G = 1 | X = x) = .
1 + e 0.32881.3275x

Jia Li http://www.stat.psu.edu/jiali
Logistic Regression

The estimated posterior probability Pr (G = 1 | X = x) and its true


value based on the true distribution are compared in the graph
below.

Jia Li http://www.stat.psu.edu/jiali

Das könnte Ihnen auch gefallen