Sie sind auf Seite 1von 24

Lecture 5 :: Logistic regression

1 / 24
Supervised learning

Figure 1: Supervised learning


2 / 24
Binary classification

y {0, 1}

Example: bigrams: multi-word expression (yes/no)? (recall the data we


work with during the lab sessions - pdt-dep-full-pos-freq-all.csv).

elementary school - 1 ("positive" instance)


at school - 0 ("negative" instance)

Do classification as linear regression ...

3 / 24
Figure 2: Classification as linear regression (1)

4 / 24
You can classify as follows:

if h 0, 5, predict y = 1
if h < 0, 5, predict y = 0

5 / 24
Figure 3: Classification as linear regression (2)

6 / 24
Classification :: Logistic regression

Representation of h: we want 0 h (x) 1


Linear regression: h (x) = T x
1
Logistic regression: h (x) = g(T x), where g(z) = 1+e z is a
sigmoid (logistic) function. I.e.

1
h (x) = (1)
1 + e T x

7 / 24
1
Figure 4: Sigmoid function g(z) = 1+e z

8 / 24
Interpretation of h (x)

h (x) = T x: h (x) is estimated probability that y = 1 on input x,


i.e. h (x) = p(y = 1/x ; )
Predict y = 1 if h (x) 0.5, i.e. if T x 0.5
Predict y = 0 if h (x) < 0.5, i.e. if T x < 0.5
Example: If x = hx0 , x1 i = h1, joint probability i and h(x) = 0, 8. I.e.
80% chance of bigram being multi-word expression.
h (x) = g(T x): g(z) 0, 5 whenever z 0 and g(z) < 0, 5
whenever z < 0.
Predict y = 1 if h (x) 0.5, i.e. T x 0
Predict y = 0 if h (x) < 0.5, i.e. T x < 0

9 / 24
Linear Decision Boundary

Let h (x) = g(0 + 1 x1 + 2 x2 ) and for example


0 = 2, 1 = 1, 2 = 1.
Predict y = 1 if 2 + x1 + x2 0, i.e. x1 + x2 2

10 / 24
Figure 5: Linear decision boundary

Decision boundary x1 + x2 = 2 is a property of h (x) not of the training


data.
11 / 24
Non-linear Decision Boundary

Let h (x) = g(0 + 1 x1 + 2 x2 + 3 x12 + 4 x22 ) and for example


0 = 1, 1 = 1, 2 = 1, 3 , 4 = 0.
Predict y = 1 if 1 + x12 + x22 0, i.e. x12 + x22 1

12 / 24
Figure 6: Non-linear decision boundary

13 / 24
More complicated decision boundary

Figure 7: Non-linear decision boundary (2)

14 / 24
How to choose parameters ? :: Cost function

1 Pn
Notation: J() = n i=1 Cost(h (xi ), yi )
Linear regression
J() = n1 ni=1 Cost(h (xi ), yi ) = 1 Pn 1
yi )2 .
P
n i=1 2 (h (xi )
Logistic regression Using the squarre error function doesnt guarantee
finding a global minimum because of non-linearity of
1
h (x) = T x
(many local minima)
1+e

15 / 24
How to choose parameters ? :: Cost function(2)

(
log(h (x)) if y = 1
Cost(h (x), y ) = (2)
log(1 h(x)) if y = 0

We can also write

Cost(h (x), y ) = y log(h (x)) (1 y ) log(1 h (x)) (3)

Its a nice convex function - see a concept of Maximum Likelihood


Estimation later on.

16 / 24
How to choose parameters ?

Fit parameters : min J()


1
Make a prediction of new instance x: h (x) =
1+e T x

17 / 24
How to choose parameters ? :: Gradient descent

Pn
J() = n1 [ i=1 yi log(h (xi )) + (1 yi ) log(1 h (xi ))]

min J()

Repeat

{
J()
j := j (4)
j

(simultaneously update j for j = 1, ...m) }

18 / 24
How to choose parameters ? :: Gradient descent
(2)

Repeat

{
n
X
j := j (h (xi ) yi )xij (5)
i=1

(simultaneously update j for j = 1, ...m) }

Have you already meet it? Yes, see linear regression. But

Linear regression: h (x) = T x


1
Logistic regression: h (x) =
1+e T x

19 / 24
Logistic regression :: Multi-class classification

"Our" MWE classification:


0 non-MWE
1 stock phrases, frequent unpredictable usages
2 names of persons, organizations, geographical locations,
and other entities
3 support verb constructions
4 technical terms
5 idiomatic expressions

20 / 24
One-vs-all

Figure 8: One-vs-all

21 / 24
One-vs-all (2)

New instance x:

h (x) = Pr(y = red|x; )


h (x) = Pr(y = blue|x; )
h (x) = Pr(y = green|x; )

Classify x into class i {red, green, blue} that maximizes h (x).

22 / 24
Addressing overfitting :: Regularization :: Cost
function
Pn Pm
J() = n1 [ i=1 yi log(h (xi )) + (1 yi ) log(1 h (xi )) +
2m i=j 2j ]

Repeat

n
X
0 := 0 (h (xi ) yi )xi0 (6)
i=1
n
X
j := j [ (h (xi ) yi )xij + j ] (7)
i=1
m

23 / 24
For more details refer to

Ng Andrew, Machine Learning online at Stanford


(https://class.coursera.org/ml-2012-002/class/index)
Hastie, T. et al. The Elements of Statistical Learning. Springer, 2009,
(http://www-stat.stanford.edu/~tibs/ElemStatLearn/),
Section 4.4

24 / 24

Das könnte Ihnen auch gefallen