PFL054 Logistic Regression

Lecture 5 :: Logistic regression
1 / 24
Supervised learning
Figure 1: Supervised learning

2 / 24
Binary classification
y {0, 1}
Example: bigrams: multi-word expression (yes/no)? (recall the data we

work with during the lab sessions - pdt-dep-full-pos-freq-all.csv).
elementary school - 1 ("positive" instance)

at school - 0 ("negative" instance)
Do classification as linear regression ...
3 / 24
Figure 2: Classification as linear regression (1)
4 / 24
You can classify as follows:
if h 0, 5, predict y = 1
if h < 0, 5, predict y = 0
5 / 24
Figure 3: Classification as linear regression (2)
6 / 24
Classification :: Logistic regression
Representation of h: we want 0 h (x) 1

Linear regression: h (x) = T x
1
Logistic regression: h (x) = g(T x), where g(z) = 1+e z is a
sigmoid (logistic) function. I.e.
1
h (x) = (1)
1 + e T x
7 / 24
1
Figure 4: Sigmoid function g(z) = 1+e z
8 / 24
Interpretation of h (x)
h (x) = T x: h (x) is estimated probability that y = 1 on input x,

i.e. h (x) = p(y = 1/x ; )
Predict y = 1 if h (x) 0.5, i.e. if T x 0.5
Predict y = 0 if h (x) < 0.5, i.e. if T x < 0.5
Example: If x = hx0 , x1 i = h1, joint probability i and h(x) = 0, 8. I.e.
80% chance of bigram being multi-word expression.
h (x) = g(T x): g(z) 0, 5 whenever z 0 and g(z) < 0, 5
whenever z < 0.
Predict y = 1 if h (x) 0.5, i.e. T x 0
Predict y = 0 if h (x) < 0.5, i.e. T x < 0
9 / 24
Linear Decision Boundary
Let h (x) = g(0 + 1 x1 + 2 x2 ) and for example

0 = 2, 1 = 1, 2 = 1.
Predict y = 1 if 2 + x1 + x2 0, i.e. x1 + x2 2
10 / 24
Figure 5: Linear decision boundary
Decision boundary x1 + x2 = 2 is a property of h (x) not of the training

data.
11 / 24
Non-linear Decision Boundary
Let h (x) = g(0 + 1 x1 + 2 x2 + 3 x12 + 4 x22 ) and for example

0 = 1, 1 = 1, 2 = 1, 3 , 4 = 0.
Predict y = 1 if 1 + x12 + x22 0, i.e. x12 + x22 1
12 / 24
Figure 6: Non-linear decision boundary
13 / 24
More complicated decision boundary
Figure 7: Non-linear decision boundary (2)
14 / 24
How to choose parameters ? :: Cost function
1 Pn
Notation: J() = n i=1 Cost(h (xi ), yi )
Linear regression
J() = n1 ni=1 Cost(h (xi ), yi ) = 1 Pn 1
yi )2 .
P
n i=1 2 (h (xi )
Logistic regression Using the squarre error function doesnt guarantee
finding a global minimum because of non-linearity of
1
h (x) = T x
(many local minima)
1+e
15 / 24
How to choose parameters ? :: Cost function(2)
(
log(h (x)) if y = 1
Cost(h (x), y ) = (2)
log(1 h(x)) if y = 0
We can also write
Cost(h (x), y ) = y log(h (x)) (1 y ) log(1 h (x)) (3)
Its a nice convex function - see a concept of Maximum Likelihood

Estimation later on.
16 / 24
How to choose parameters ?
Fit parameters : min J()

1
Make a prediction of new instance x: h (x) =
1+e T x
17 / 24
How to choose parameters ? :: Gradient descent
Pn
J() = n1 [ i=1 yi log(h (xi )) + (1 yi ) log(1 h (xi ))]
min J()
Repeat
{
J()
j := j (4)
j
(simultaneously update j for j = 1, ...m) }
18 / 24
How to choose parameters ? :: Gradient descent
(2)
Repeat
{
n
X
j := j (h (xi ) yi )xij (5)
i=1
(simultaneously update j for j = 1, ...m) }
Have you already meet it? Yes, see linear regression. But
Linear regression: h (x) = T x

1
Logistic regression: h (x) =
1+e T x
19 / 24
Logistic regression :: Multi-class classification
"Our" MWE classification:

0 non-MWE
1 stock phrases, frequent unpredictable usages
2 names of persons, organizations, geographical locations,
and other entities
3 support verb constructions
4 technical terms
5 idiomatic expressions
20 / 24
One-vs-all
Figure 8: One-vs-all
21 / 24
One-vs-all (2)
New instance x:
h (x) = Pr(y = red|x; )

h (x) = Pr(y = blue|x; )
h (x) = Pr(y = green|x; )
Classify x into class i {red, green, blue} that maximizes h (x).
22 / 24
Addressing overfitting :: Regularization :: Cost
function
Pn Pm
J() = n1 [ i=1 yi log(h (xi )) + (1 yi ) log(1 h (xi )) +
2m i=j 2j ]
Repeat
n
X
0 := 0 (h (xi ) yi )xi0 (6)
i=1
n
X
j := j [ (h (xi ) yi )xij + j ] (7)
i=1
m
23 / 24
For more details refer to
Ng Andrew, Machine Learning online at Stanford

(https://class.coursera.org/ml-2012-002/class/index)
Hastie, T. et al. The Elements of Statistical Learning. Springer, 2009,
(http://www-stat.stanford.edu/~tibs/ElemStatLearn/),
Section 4.4
24 / 24

PFL054 Logistic Regression

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

PFL054 Logistic Regression

Hochgeladen von

Copyright:

Verfügbare Formate

Lecture 5 :: Logistic regression

Figure 1: Supervised learning

Example: bigrams: multi-word expression (yes/no)? (recall the data we

elementary school - 1 ("positive" instance)

Do classification as linear regression ...

Representation of h: we want 0 h (x) 1

h (x) = T x: h (x) is estimated probability that y = 1 on input x,

Let h (x) = g(0 + 1 x1 + 2 x2 ) and for example

Decision boundary x1 + x2 = 2 is a property of h (x) not of the training

Let h (x) = g(0 + 1 x1 + 2 x2 + 3 x12 + 4 x22 ) and for example

Figure 7: Non-linear decision boundary (2)

We can also write

Cost(h (x), y ) = y log(h (x)) (1 y ) log(1 h (x)) (3)

Its a nice convex function - see a concept of Maximum Likelihood

Fit parameters : min J()

(simultaneously update j for j = 1, ...m) }

(simultaneously update j for j = 1, ...m) }

Linear regression: h (x) = T x

"Our" MWE classification:

h (x) = Pr(y = red|x; )

Classify x into class i {red, green, blue} that maximizes h (x).

Ng Andrew, Machine Learning online at Stanford

Das könnte Ihnen auch gefallen